Inside Marklogic Server
Total Page:16
File Type:pdf, Size:1020Kb
Inside MarkLogic Server Its data model, indexing system, update model, and operational behaviors April 28, 2013 Jason Hunter, MarkLogic Corporation Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Some features discussed are covered by MarkLogic patents. Download at http://developer.marklogic.com/inside-marklogic. April 28, 2013 Inside MarkLogic Server This paper describes the MarkLogic Server internals: its data model, indexing system, update model, and operational behaviors. It's intended for a technical audience — either someone new to MarkLogic wanting to understand its capabilities, or someone already familiar with MarkLogic who wants to understand what's going on under the hood. This paper is not an introduction to using MarkLogic Server. For that you can read the official product documentation. Instead, this paper explains the principles on which MarkLogic is built. The goal isn't to teach you to write code, but to help you understand what's going on behind your code, and thus help you write better and more robust applications. The paper is organized into sections. The first section provides a high-level overview of MarkLogic Server. The next few sections explain MarkLogic's core indexes. The sections after that explain the transactional storage system, multi-host clustering, and the various connection options. At this point there's a natural stopping point for the casual reader, but those who read on will find sections covering advanced indexing features as well as topics like replication and failover. The final section discusses the ecosystem built up around MarkLogic. This major update to the original version adds discussion on features introduced in MarkLogic 5 and MarkLogic 6: Database replication, journal archiving with point-in- time recovery, multi-statement transactions, XA transactions, Hadoop integration, tiered storage, large binary support, compartment security, SQL/ODBC access for BI tool integration, REST API access, user-defined functions, JSON support, document filters, path range indexes, application packaging, and system performance monitoring. It also adds coverage for several older features: text value matching, phrase handling, stop words, lexicons, and document compression. 2 April 28, 2013 Table of Contents What Is MarkLogic Server? ............................................................................................ 6! Document-Centric ....................................................................................................... 6! Transactional ............................................................................................................... 7! Search-Centric ............................................................................................................. 7! Structure-Aware .......................................................................................................... 7! Schema-Agnostic ........................................................................................................ 8! Programmatic .............................................................................................................. 9! High Performance ..................................................................................................... 10! Clustered .................................................................................................................... 10! Database Server ......................................................................................................... 11! Core Topics ...................................................................................................................... 12! Indexing Text and Structure .......................................................................................... 12! Indexing Words ......................................................................................................... 12! Indexing Phrases ....................................................................................................... 13! Indexing Longer Phrases ........................................................................................... 14! Indexing Structure ..................................................................................................... 14! Indexing Values ......................................................................................................... 15! Indexing Text Values ................................................................................................ 16! Indexing Text with Structure ..................................................................................... 17! Special Phrase Handling ............................................................................................ 17! Index Size .................................................................................................................. 18! Reindexing ................................................................................................................ 19! Relevance .................................................................................................................. 19! Lifecycle of a Query .................................................................................................. 19! Indexing Document Metadata ....................................................................................... 21! Collection Indexes ..................................................................................................... 21! Directory Indexes ...................................................................................................... 21! Security Indexes ........................................................................................................ 22! Properties Indexes ..................................................................................................... 22! Fragmentation ................................................................................................................ 23! Fragment vs. Document ............................................................................................ 23! Estimate and Count ................................................................................................... 24! Unfiltered .................................................................................................................. 25! The Range Index ........................................................................................................... 25! Range Queries ........................................................................................................... 26! Data-Type Aware Equality Queries .......................................................................... 27! Extracting Values ...................................................................................................... 27! Optimized "Order By" ............................................................................................... 28! Using Range Indexes for Joins .................................................................................. 29! Using Path Range Indexes for Extra Optimization ................................................... 30! Lexicons .................................................................................................................... 30! Data Management ......................................................................................................... 31! What's on Disk: Databases, Forests, and Stands ....................................................... 31! Ingesting Data ........................................................................................................... 31! 3 April 28, 2013 Modifying Data ......................................................................................................... 33! Multi-Version Concurrency Control ......................................................................... 34! Time Travel ............................................................................................................... 34! Locking ...................................................................................................................... 35! Updates ...................................................................................................................... 35! Documents are Like Rows ........................................................................................ 36! Lifecycle of a Document ........................................................................................... 37! Multi-Statement Transactions ................................................................................... 38! XA Transactions ........................................................................................................ 39! Tiered Storage ............................................................................................................... 40! Fast Data Directory on SSDs .................................................................................... 40! Large Data Directory for Binaries ............................................................................. 40! Clustering and Caching ................................................................................................. 41! Cluster Management ................................................................................................. 42! Caching .....................................................................................................................