Inside MarkLogic Server

Its data model, indexing system, update model, and operational behaviors

April 28, 2013

Jason Hunter, MarkLogic Corporation

Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Some features discussed are covered by MarkLogic patents. Download at http://developer.marklogic.com/inside-marklogic. April 28, 2013

Inside MarkLogic Server

This paper describes the MarkLogic Server internals: its data model, indexing system, update model, and operational behaviors. It's intended for a technical audience — either someone new to MarkLogic wanting to understand its capabilities, or someone already familiar with MarkLogic who wants to understand what's going on under the hood.

This paper is not an introduction to using MarkLogic Server. For that you can read the official product documentation. Instead, this paper explains the principles on which MarkLogic is built. The goal isn't to teach you to write code, but to help you understand what's going on behind your code, and thus help you write better and more robust applications.

The paper is organized into sections. The first section provides a high-level overview of MarkLogic Server. The next few sections explain MarkLogic's core indexes. The sections after that explain the transactional storage system, multi-host clustering, and the various connection options. At this point there's a natural stopping point for the casual reader, but those who read on will find sections covering advanced indexing features as well as topics like replication and failover. The final section discusses the ecosystem built up around MarkLogic.

This major update to the original version adds discussion on features introduced in MarkLogic 5 and MarkLogic 6: Database replication, journal archiving with point-in- time recovery, multi-statement transactions, XA transactions, Hadoop integration, tiered storage, large binary support, compartment security, SQL/ODBC access for BI tool integration, REST API access, user-defined functions, JSON support, document filters, path range indexes, application packaging, and system performance monitoring. It also adds coverage for several older features: text value matching, phrase handling, stop words, lexicons, and document compression.

2 April 28, 2013

Table of Contents

What Is MarkLogic Server? ...... 6! Document-Centric ...... 6! Transactional ...... 7! Search-Centric ...... 7! Structure-Aware ...... 7! Schema-Agnostic ...... 8! Programmatic ...... 9! High Performance ...... 10! Clustered ...... 10! Database Server ...... 11! Core Topics ...... 12! Indexing Text and Structure ...... 12! Indexing Words ...... 12! Indexing Phrases ...... 13! Indexing Longer Phrases ...... 14! Indexing Structure ...... 14! Indexing Values ...... 15! Indexing Text Values ...... 16! Indexing Text with Structure ...... 17! Special Phrase Handling ...... 17! Index Size ...... 18! Reindexing ...... 19! Relevance ...... 19! Lifecycle of a Query ...... 19! Indexing Document Metadata ...... 21! Collection Indexes ...... 21! Directory Indexes ...... 21! Security Indexes ...... 22! Properties Indexes ...... 22! Fragmentation ...... 23! Fragment vs. Document ...... 23! Estimate and Count ...... 24! Unfiltered ...... 25! The Range Index ...... 25! Range Queries ...... 26! Data-Type Aware Equality Queries ...... 27! Extracting Values ...... 27! Optimized "Order By" ...... 28! Using Range Indexes for Joins ...... 29! Using Path Range Indexes for Extra Optimization ...... 30! Lexicons ...... 30! Data Management ...... 31! What's on Disk: Databases, Forests, and Stands ...... 31! Ingesting Data ...... 31!

3 April 28, 2013

Modifying Data ...... 33! Multi-Version Concurrency Control ...... 34! Time Travel ...... 34! Locking ...... 35! Updates ...... 35! Documents are Like Rows ...... 36! Lifecycle of a Document ...... 37! Multi-Statement Transactions ...... 38! XA Transactions ...... 39! Tiered Storage ...... 40! Fast Data Directory on SSDs ...... 40! Large Data Directory for Binaries ...... 40! Clustering and Caching ...... 41! Cluster Management ...... 42! Caching ...... 43! Caching Binary Documents ...... 44! Cache Partitions ...... 44! No Need for Global Cache Invalidation ...... 45! Locks and Timestamps in a Cluster ...... 45! Lifecycle of a Query in a Cluster ...... 46! Lifecycle of an Update in a Cluster ...... 46! Coding and Connecting to MarkLogic ...... 47! XQuery and XSLT ...... 48! XDBC/XCC for Java and .NET Access ...... 50! REST API ...... 51! WebDAV: Remote Filesystem Access ...... 51! SQL/ODBC Access for Business Intelligence ...... 52! QC for Remote Coding ...... 52! Advanced Topics ...... 54! Advanced Text Handling ...... 54! Text Sensitivity Options ...... 54! Stemmed Indexes ...... 55! Relevance Scoring ...... 56! Stop Words ...... 58! Fields ...... 58! More with Fields ...... 59! Registered Queries ...... 59! The Geospatial Index ...... 62! The Reverse Index ...... 63! Reverse Query Use Cases ...... 64! A Reverse Query Carpool Match ...... 65! The Reverse Index ...... 67! Range Queries in Reverse Indexes ...... 68! Managing Backups ...... 69! Typical Backup ...... 69! Flash Backup ...... 70! Journal Archiving and Point-in-Time Recovery ...... 70! Failover and Replication ...... 71!

4 April 28, 2013

Shared-Disk Failover ...... 72! Local-Disk Failover ...... 73! Database Replication ...... 74! Flexible Replication ...... 75! Hadoop ...... 76! Aggregate Functions and UDFs in C++ ...... 78! Low-Level System Control ...... 79! Outside the Core ...... 80! Application Services ...... 80! Content Pump (MLCP) ...... 81! Content Processing Framework ...... 82! Office Toolkits ...... 82! Connector for SharePoint ...... 82! Document Filters ...... 83! Unofficial Tools, Libraries, and Connectors ...... 83! But Wait, There's More ...... 85!

5 April 28, 2013

What Is MarkLogic Server?

MarkLogic Server is an Enterprise NoSQL Database1. It fuses together database internals, search-style indexing, and application server behaviors into a unified system. It uses XML documents as its data model, and stores the documents within a transactional repository. It indexes the words and values from each of the loaded documents, as well as the document structure. And, because of its unique Universal Index, MarkLogic doesn't require advance knowledge of the document structure (its "schema") nor complete adherence to a particular schema. Through its application server capabilities, it's programmable and extensible.

MarkLogic Server (referred to from here on as just "MarkLogic") clusters on commodity hardware using a shared-nothing architecture and differentiates itself in the market by supporting massive scale and fantastic performance — customer deployments have scaled to hundreds of terabytes of source data while maintaining sub-second query response time.

MarkLogic Server is a document-centric, transactional, search-centric, structure-aware, schema-agnostic, programmatic, high performance, clustered, database server.

Let's look at all of this in more detail.

Document-Centric

MarkLogic uses documents, often written in XML, as its core data model. Because it uses a non-relational data model and doesn't rely on SQL as its primary means of connectivity, MarkLogic is considered a "NoSQL database". Financial contracts, medical records, legal filings, presentations, blogs, tweets, press releases, user manuals, books, articles, web pages, metadata, sparse data, message traffic, sensor data, shipping manifests, itineraries, contracts, and emails are all naturally modeled as documents. In some cases the data might start formatted as XML documents (for example, Microsoft Office 2007 documents or financial products written in FpML), but if not, it can be transformed to XML documents during ingestion. Relational databases, in contrast, with their table-centric data models, can't represent data like this as naturally and so either have to spread the data out across many tables (adding complexity and hurting performance) or keep this data as unindexed BLOBs or CLOBs

1 NoSQL originally meant "No SQL" as a descriptor for non-relational databases that didn't rely on SQL but now, because many non-relational systems including MarkLogic provide SQL interfaces for certain purposes, it has transmogrified into "Not Only SQL".

6 April 28, 2013

In addition to XML, MarkLogic can store JSON, text, and binary documents. JSON documents are internally transformed to XML for purposes of indexing. Text documents are indexed as if each was an XML text node without a parent. Binary documents are by default unindexed, with the option to index their metadata and extracted contents.

Transactional

MarkLogic stores documents within its own transactional repository. The repository wasn't built on a relational database or any other third party technology. It was purpose- built with a focus on maximum performance.

Because of the transactional repository, you can insert or update a set of documents as an atomic unit and have the very next query able to see those changes with zero latency. MarkLogic supports the full set of ACID properties: Atomicity (a set of changes either takes place as a whole or doesn't take place at all), Consistency (system rules are enforced, such as that no two documents should have the same identifier), Isolation (uncompleted transactions are not otherwise visible), and Durability (once a commit is made it will not be lost).

ACID transactions are considered commonplace for relational databases but they're a game changer for document-centric databases and search-style queries.

Search-Centric

When people think of MarkLogic they often think of its text search capabilities. The founding team has a deep background in search: Chris Lindblad was the architect of Ultraseek Server, while Paul Pedersen was the VP of Enterprise Search at Google. MarkLogic supports numerous search features including word and phrase search, boolean search, proximity, wildcarding, stemming, tokenization, decompounding, case-sensitivity options, punctuation-sensitivity options, diacritic-sensitivity options, document quality settings, numerous relevance algorithms, individual term weighting, topic clustering, faceted navigation, custom-indexed fields, and more.

Structure-Aware

MarkLogic indexes both text and structure, and the two can be queried together efficiently. For example, consider the challenge of querying and analyzing intercepted message traffic for threat analysis:

Find all messages sent by IP 74.125.19.103 between April 11th and April 13th where the message contains both "wedding cake" and "empire state building" (case and punctuation insensitive) where the phrases have to be within 15 words of each other but the message can't contain another key phrase such as "presents" (stemmed so "present" matches also). Exclude any message that has a subject equal to "Congratulations". Also exclude any message where the matching phrases were found within a quote block in the email. Then, for matching messages, return the most frequent senders and recipients.

7 April 28, 2013

By using XML documents to represent each message and the structure-aware indexing to understand what's an IP, what's a date, what's a subject, and which text is quoted and which isn't, a query like this is actually easy to write and highly performant in MarkLogic. Or consider some other examples.

Find hidden financial exposure:

Extract footnotes from any XBRL financial filing where the footnote contains "threat" and is found within the balance sheet section.

Review images:

Extract all large-format images from the 10 research articles most relevant to the phrase "herniated disc". Relevance should be weighted so that phrase appearance in a title is 5 times more relevant than body text, and appearance in an abstract is 2 times more relevant.

Find a person's phone number from their emails:

From a large corpus of emails find those sent by a particular user, sort them reverse chronological, and locate the last email they sent which had a footer block containing a phone number. Return the phone number.2

Schema-Agnostic

MarkLogic indexes the XML structure it sees during ingestion, whatever that structure might be. It doesn't have to be told what schema to expect, any more than a search engine has to be told what words exist in the dictionary. MarkLogic sees the challenge of querying for structure or for text as fundamentally the same. At an index level, matching the XPath expression /a/b/c can be performed similarly to matching the phrase "a b c". That's the heart of the Universal Index.

Being able to efficiently index and query without prior knowledge of a schema provides real benefits with unstructured or semi-structured data where:

1. A schema exists, but is either poorly defined or defined but not followed. 2. A schema exists and is enforced at a moment in time, but keeps changing over time, and may not always be kept current. 3. A schema may not be fully knowable, such as intelligence information being gathered about people of interest where anything and everything might turn out to be important.

2 How do you identify footers and phone numbers? You can do it via heuristics, with the markup added during ingestion. You can mark footer blocks as a