Why an Enterprise NoSQL Database for Unstructured Information

Why MarkLogic: Why MarkLogic: Addressing the Challenges of Unstructured Information Addressing the Challenges of Unstructured Information with Purpose-builtNovember Technology 2012 with Purpose-built Technology Table of Contents

1 | Introduction 1 | Characteristics of Unstructured Information Why MarkLogic: 3 | MarkLogic Addresses Unstructured Information Addressing the 5 Challenges| Summary of Unstructured Information with Purpose-built 5 | About Technology MarkLogic

Abstract Rapidly changing conditions are forcing organizations to re-think how they use information to meet their objectives. Whether battling in the market place or on the battlefield, the need for flexibility and agility with information has never been greater. Organizations are looking to integrate and enrich information to create additional value for users. User ex- pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications that provide modern search capabilities, as well as an ability to interact with information through tagging and user generated comments. And various distribution channels present new challenges for information providers in exposing their information through rich user in- terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore and access information in their own context.

Choosing the right technology at the core of their application architecture is critical for any organization to provide them with the agility they need to meet these goals and rapidly respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility by providing a single unified platform for storing, manipulating and delivering XML and building innovative information applications.

This paper provides a technical overview of MarkLogic Server, the industry’s leading XML server, and also discusses some of the challenges facing organizations today for storing, repurposing, and dynamically delivering information. Introduction structured in nature. At the turn of this • MDDL – Market Data Definition century, the equation shifted - 80% of Language Organizations are exploring new the data created is now unstructured technologies that have the flexibility, • DDMS – Department of Defense according to Gartner and is expected to extensibility, and enterprise-readiness Discovery Metadata Specification continue to grow at a faster rate than to handle today’s mission-critical structured data. Also consider the different document environments. They face challenges with formats such as PDF, HTML, Microsoft important information that does not fit Characteristics of Unstructured Office, RTF, etc. These options represent well in the rows and columns of a relational Information the different ways unstructured database management system (RDBMS). information is stored. This type of information is often referred To understand why today’s most common to as “non-relational” or “unstructured tools are insufficient for leveraging Contrast this heterogeneity to the information.” In some cases, unstructured unstructured information, it is useful to homogeneity of structured information, information might be semi- or even review the specific characteristics of which is stored in a consistent, tabular highly structured, but due to specific unstructured information that require it form. The data types in structured characteristics discussed in this paper, to be treated differently than structured information primarily consist of numbers, requires significant efforts to load, store, information. This section discusses these dates, and fixed-length text strings, which and query in an RDBMS. characteristics while the next section will limits its format variation. Database tables describe how MarkLogic addresses them. were invented with this limited variation Most organizations recognize in mind. unstructured information as documents, Heterogeneous such as policies, manuals, contracts, The first important characteristic Since unstructured information varies reports, articles, cables, journals, of unstructured information is that greatly, it is not easily stored in tables. and legal briefs. Media such as user- it is heterogeneous. In other words, The challenge is unstructured information generated content, RSS feeds, emails, not only does it look different from must be mapped into tables and discrete social graphs, metadata, images, videos, structured information, but the many data types, which entails an unnatural and and audio files are other widely used formats of unstructured information time-consuming effort. As an alternative, forms of unstructured information. vary significantly from one another. data types such as character/binary large Unstructured information includes objects (i.e., CLOBs and BLOBs) of an MarkLogic Server is the Enterprise NoSQL non-discrete data types such as words, RDBMS were created to overcome the database that manages all types of data sentences, and concepts, in conjunction limitations of the discrete data types, but in real time. “Enterprise” means it has the with discrete data types such as numbers, they facilitate only storage, not querying. capabilities mission-critical environments dates, and identifiers. Many combinations Therefore, CLOBs/BLOBs are marginally require such as ACID transactions, of these data types are possible, so better than storage on a filesystem. The high availability, disaster recovery, and standards are created to maintain problem remains that RDBMSs treat security. “NoSQL” (“Not Only SQL”) means manageability. However, the gains are unstructured information as second- it is ideal for unstructured information. not always clear, since great variance still class citizens. The monolithic approach Most existing tools such as RDBMSs exists as evidenced by the many domain- of CLOBs/BLOBs ignores the important were not built to handle the challenges specific standards such as: context in unstructured information, and of unstructured information. These tools thus precludes analysis, retrieval, and • FpML – Financial products Markup either require rigid adherence to a specific updates at a granular level. Language structure or ignore any existing structure Complex altogether. In other words, they treat • OOXML – Office Open XML for unstructured information as second class Microsoft Office 2007/2010 In addition to heterogeneity, unstructured citizens. This precludes organizations information is also very complex. There • ISO 20022 – the ISO Standard for are three characteristics that contribute from effectively leveraging this form of Financial Services Messaging information. to its complexity, any combination of which • XBRL – eXtensible Business Reporting is found in unstructured information. You may be asking, “Why weren’t RDBMS Language built to handle unstructured data?” The First, unstructured information is typically • RixML – Research Information Markup answer is that unstructured data is a hierarchical, with nested parent/child Language progression from when RDBMSs first relationships. Often these relationships appeared in the 1980’s when most data • DocBook – a popular markup language are not obvious, but examples include that was captured and analyzed was for documentation subsections in a chapter of a book or

1 | MarkLogic whitepaper sub-clauses in a contract. On the other by the location of the amended text. She hand, structured information typically typically uses a word processing program has flat, tabular relationships that may be like Microsoft Word to make updates, and expressed as one-to-one, one-to-many, or the user interface does not have hard rules many-to-many. Since RDBMSs were not on how the contract should be changed. designed for hierarchies, a query to join There also is no preparation required by rows to recreate the hierarchy is slow and IT staff to plan for the changes, as the inefficient. attorney makes the changes ad hoc.

Second, unstructured information Contrast this to structured information, is irregular, meaning unstructured which changes in well-known ways. information does not fit in neat, For example, each value in an RDBMS predefined data elements. Information changes in an expected way—numbers may vary greatly in length, with no pre- are increased or decreased, dates are definition or bounded data lengths. It modified with other dates, and text might also be sparsely populated, meaning strings are updated within predefined across a collection of information, there lengths. And when the schema changes, the might be thousands of known data system is first updated to accommodate elements, many of which are blank. These that change. Schema changes must be characteristics are inconsistent from what announced before they can be handled by RDBMSs expect, in which most columns the system. The IT staff necessarily knows are expected to be filled with values. what type of changes will be made by users to structured information before the Third, unstructured information may or changes can be made. RDBMSs are good may not conform to a predefined schema. for predictable and announced changes, If it does conform, the schema might be but are not efficient for the changes that poorly defined, not followed strictly, or unstructured information undergoes. not known in advance. Even in the case of predefined schemas, large variances Text-Centric may be allowed, making each item appear Unstructured information is heavily text- very different from the next. RDBMSs centric. It contains language ambiguities expect rigid, predefined schemas with typically not clear for processing by predefined data elements, so unstructured computers. For example, a word such as information is a poor fit. “foot” can have several different meanings While some organizations try to map including a body part, the bottom of unstructured information into rows and something, or 12 inches. The definition is columns, they face huge tradeoffs. Either dependent on the context. Without proper data accessibility is compromised, or the context, users may encounter many false system takes a significant performance hit positives, in which they retrieve irrelevant due to inefficient storage and indexing. information. They may also encounter many false negatives, in which they miss Changing in Unpredictable Ways relevant information because different When unstructured information evolves, it terminology was used in their search. changes in unpredictable and unannounced Also, text within unstructured information ways. New standards, new sources, and lacks specific identifiers to help define new applications are created continually. various data elements. In comparison, And there are generally no restrictions on column names such as “first_name” in how it is updated. Take an example such an RDBMS table leave no ambiguity as a contract. If an attorney amends a about meaning of the data values. While contract to revise terms, she updates it in human readers can easily find names any way she desires without formatting in unstructured information such as in restrictions. She is not limited by the a contract, it is far less obvious when number of words or sentences, or even

2 | MarkLogic whitepaper processed by a computer. Since RDBMSs Universal Index were designed for tabular data, they do not MarkLogic’s Universal Index is a key have the functionality to properly handle feature for addressing the heterogeneity the text-centric nature of unstructured of unstructured information. It captures information. all information users need for precise, high-performance queries. Application Exponentially Growing development teams spend less time Analysts estimate unstructured on data modeling, re-modeling, and information grows 10 to 50 times performance tuning, thus expediting faster than structured information. time-to-market and lowering total cost Information in general continues to grow of ownership. Unstructured information at a tremendous rate with one estimate wants to be unrestricted, and the Universal at 800% over the next five years. This Index allows that. rapid growth of unstructured information requires new approaches and strategies The Universal Index allows users to pertaining to performance and scalability. query all information that the system Though hardware advancements help sees, rather than only the information MarkLogic’s Universal with scaling, those are only part of the the system is told to see. In other words, Index is a key feature for solution. Software must be optimized with the Universal Index enables MarkLogic addressing the heterogeneity modern hardware in mind to maximize to make no presumptions around what of unstructured information. efficiency. Organizations that rely on information should be expected and older technologies must choose between enables the system to store information excessive expenditures or insufficient “as is,” without requiring time-consuming functionality when facing today’s data modeling to standardize disparate unstructured information loads. information formats. This is also referred to as being “schema-agnostic” or “schema- MarkLogic Addresses permissive” in which any schema, or even Unstructured Information non-existent schemas, can be loaded into MarkLogic with no prior planning. It Based on the characteristics of automatically captures all elements in unstructured information in the previous information, including words, structure, section, it is clear today’s most popular dates, and numbers. This means no technologies are not able to fully leverage information is lost, and all elements can be unstructured information. RDBMSs queried and retrieved. lack the flexibility to efficiently handle unstructured information, and search In addition to effectively handling engines lack the management and update heterogeneous information, the Universal capabilities that applications require. Index also addresses the complexity of To properly handle the Content management systems, which are unstructured information due to hierarchy, complexity of unstructured largely workflow-oriented applications irregularity, and poor schema definition. information, MarkLogic built on RDBMSs and search engines, It provides the flexibility to accommodate uses a data model based on suffer the same challenges because of the the wide variety of changes end users XML documents, which is limitations of the underlying platform. make to their information.

more efficient and effective Despite this, many organizations still try to XML Documents as the Data Model for storing unstructured use their current tools, but achieve limited To properly handle the complexity of information than the success. Now organizations no longer unstructured information, MarkLogic uses relational model. have to compromise. Since MarkLogic a data model based on XML documents, was designed for leveraging unstructured which is more efficient and effective information, it has important features that for storing unstructured information lead to significant benefits. Some of those than the relational model. Support for key features are described below. W3C-standard XSLT and XQuery, both designed for XML, enables fast and easy

3 | MarkLogic whitepaper querying and transformation. MarkLogic immediately. Its multi-version concurrency customers have experienced significant control (MVCC) ensures rapid insertion improvements in agility and efficiency by with minimal resource contention. eliminating the resource drain of trying to Indexing can be done simultaneously model and store unstructured information with heavy query loads with no blocking in an RDBMS. so organizations do not have to settle MarkLogic Server was for delayed information access. And for An XML data model gives MarkLogic designed to immediately the most time-sensitive information, several important advantages for accommodate unannounced MarkLogic’s real-time alerting quickly and leveraging unstructured information. First, changes, thus eliminating the efficiently processes millions or billions embedded markup in XML creates context latency found in structured of queries against a fast incoming feed of to enable granularity for access, updates, technologies. new information. reuse, and repurposing. Second, XML is extensible so new data elements can be Search and Analytics Capabilities added ad hoc without having to redesign Resolving language ambiguities is an a schema. Third, XML has the flexibility to important requirement in handling fully capture and model the unpredictable text-centric unstructured information. and irregular aspects of unstructured MarkLogic Server helps in two ways to information, including non-discrete data let end users find and make sense of the elements, hierarchical elements, variable information they have. First, it provides length characters, and sparseness of data. features to make information clearer. Using XML documents as the data model Second, it offers several techniques for was a natural architectural decision for finding evidence as the basis for relevance. MarkLogic Server. XML is ideal for fully To make information more clear, exploiting unstructured information MarkLogic helps with the identification despite its heterogeneity, complexity, and of meaning and context in information. unpredictable change. MarkLogic’s use For example, integration with entity of XML ensures it can handle current and enrichment tools enables identification of future requirements around unstructured entities such as people, places, and things. information. Range indexes provide structure around Transaction Controller specific values to enable precise and fast retrievals, as well as sorting, aggregations, Delays in access to information are often and lookups. Support for extensible MarkLogic Server provides due to limitations in technology. With metadata schemas allows adding any type features to make information unpredictable changes in unstructured of identifying data to existing documents. clearer, and also provides information—including those pertaining several techniques for finding to standards, formats, and content—the To improve relevance in searches, evidence as the basis potential for delay is increased. MarkLogic MarkLogic Server provides capabilities for relevance. Server was designed to immediately found in leading enterprise search engines accommodate those types of changes, such as phrase, proximity, and thesaurus thus eliminating the latency found in searches. In addition, MarkLogic supports structured technologies. As mentioned highly tunable relevance ranking to more earlier, MarkLogic’s Universal Index and precisely match the end user’s needs. The XML data model provide the flexibility Universal Index captures all components to offset the design overhead for new of information to enable a higher level information types. of specificity, granularity, and structure in searches. Range indexes enable Those features represent only part of the classification and faceted navigation, to real-time access capability. MarkLogic’s help organize information in meaningful ACID (atomicity, consistency, isolation, and structured ways for faster discovery durability) transaction controller ensures by end users. Geospatial searching enables newly inserted information is indexed location-based information retrieval. in real time and available to users

4 | MarkLogic whitepaper And finally, built-in co-occurrence gains with MarkLogic Server, including 10 analysis reveals hidden relationships to 100 times performance improvements, between various entities in a collection of time-to-market in weeks instead of years, information. and scaling to hundreds of terabytes today and petabytes tomorrow. Shared Nothing Architecture

MarkLogic’s shared nothing architecture About MarkLogic allows high performance and massive Since MarkLogic was founded in 2001, we scalability to address the unanticipated have focused on building a database and growth of unstructured information. applications that enable our customers to Why MarkLogic: MarkLogic is optimized for commodity capture more data, and do more with it. We hardware, and exhibits linear scaling provide our customers with an unmatched Addressing the Challenges of Unstructured Information to easily and efficiently grow to handle competitive edge through game-changing with Purpose-built Technology future needs. As the number of users or technology. We’ve set new standards in information load increases, performance scalability, enterprise-readiness, time to and response times can be maintained by value, and innovation. adding servers to a cluster. The world is seeing an explosion in MarkLogic has been deployed in clusters unstructured data, including user- of over 100 hardware servers, with generated content, machine-generated expectations of customers moving well content, aggregation of highly structured beyond that in the near future. Not only do MarkLogic is optimized for heterogeneous data sources (patient customers gain cost savings by leveraging commodity hardware, and records, insurance claims, mortgage commodity hardware, and fewer of them, documents, etc.), raw data, social media, exhibits linear scaling to but the lower administrative overhead has log files, and sensor data, among others. easily and efficiently grow to resulted in the ability to reallocate human Relational databases were not designed – handle future needs. resources to higher value activities. At and are simply not able – to deal with this one customer site, only one-half of a full- variety and complexity in real-time. At time equivalent is required to administer MarkLogic we are experts at this. We’ve the 100-server MarkLogic cluster, which been doing it for a decade. pre-MarkLogic needed 20 database administrators. Some businesses, organizations, and agencies are missing out on a huge Summary opportunity. Our customers are not. The Federal Aviation Administration The focus on unstructured information depends on MarkLogic as the backbone has increased over the years, but the of its Emergency Operations Network. ubiquity of RDBMSs has misled many Conde Nast is building new applications organizations to make tradeoffs around in weeks. CQ Roll Call puts up to date functionality, time-to-market, total costs, information about government-in-action and performance that they did not need at the fingertips of its subscribers. These to. Since RDBMSs were designed for are just a few MarkLogic customers who structured information, which is greatly are revolutionizing industry and the public different from unstructured information, sector with Big Data Applications . Become there is a clear mismatch that leads to one of them. Transform your organization costly inefficiencies. and outpace your competition with With its Universal Index, XML data MarkLogic. model, transaction controller, search and MarkLogic is headquartered in Silicon analytics capabilities, and shared nothing Valley with field offices in Washington architecture, MarkLogic is the right choice D.C., New York, Austin, London, Frankfurt, for tackling the challenges of unstructured and Tokyo. information. Customers report significant

5 | MarkLogic whitepaper MarkLogic Corporation www..com

Headquarters 999 Skyway Road, Suite 200 San Carlos, CA 94070 + 1 650 655 2300 Why MarkLogic: Addressing the Challenges of Unstructured Information with Purpose-built Technology

© Copyright 2010 MarkLogic Corporation. MarkLogic is a registered trademark and MarkLogic Server is a trademark of MarkLogic Corporation, all rights reserved. All other product names mentioned herein are the property of their respective owners.