Why an Enterprise Nosql Database for Unstructured Information
Total Page:16
File Type:pdf, Size:1020Kb
Why an Enterprise NoSQL Database for Unstructured Information Why MarkLogic: Why MarkLogic: Addressing the Challenges of Unstructured Information Addressing the Challenges of Unstructured Information with Purpose-builtNovember Technology 2012 with Purpose-built Technology Table of Contents 1 | Introduction 1 | Characteristics of Unstructured Information Why MarkLogic: 3 | MarkLogic Addresses Unstructured Information Addressing the 5 Challenges| Summary of Unstructured Information with Purpose-built 5 | About Technology MarkLogic Abstract Rapidly changing conditions are forcing organizations to re-think how they use information to meet their objectives. Whether battling in the market place or on the battlefield, the need for flexibility and agility with information has never been greater. Organizations are looking to integrate and enrich information to create additional value for users. User ex- pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications that provide modern search capabilities, as well as an ability to interact with information through tagging and user generated comments. And various distribution channels present new challenges for information providers in exposing their information through rich user in- terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore and access information in their own context. Choosing the right technology at the core of their application architecture is critical for any organization to provide them with the agility they need to meet these goals and rapidly respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility by providing a single unified platform for storing, manipulating and delivering XML and building innovative information applications. This paper provides a technical overview of MarkLogic Server, the industry’s leading XML server, and also discusses some of the challenges facing organizations today for storing, repurposing, and dynamically delivering information. Introduction structured in nature. At the turn of this • MDDL – Market Data Definition century, the equation shifted - 80% of Language Organizations are exploring new the data created is now unstructured technologies that have the flexibility, • DDMS – Department of Defense according to Gartner and is expected to extensibility, and enterprise-readiness Discovery Metadata Specification continue to grow at a faster rate than to handle today’s mission-critical structured data. Also consider the different document environments. They face challenges with formats such as PDF, HTML, Microsoft important information that does not fit Characteristics of Unstructured Office, RTF, etc. These options represent well in the rows and columns of a relational Information the different ways unstructured database management system (RDBMS). information is stored. This type of information is often referred To understand why today’s most common to as “non-relational” or “unstructured tools are insufficient for leveraging Contrast this heterogeneity to the information.” In some cases, unstructured unstructured information, it is useful to homogeneity of structured information, information might be semi- or even review the specific characteristics of which is stored in a consistent, tabular highly structured, but due to specific unstructured information that require it form. The data types in structured characteristics discussed in this paper, to be treated differently than structured information primarily consist of numbers, requires significant efforts to load, store, information. This section discusses these dates, and fixed-length text strings, which and query in an RDBMS. characteristics while the next section will limits its format variation. Database tables describe how MarkLogic addresses them. were invented with this limited variation Most organizations recognize in mind. unstructured information as documents, Heterogeneous such as policies, manuals, contracts, The first important characteristic Since unstructured information varies reports, articles, cables, journals, of unstructured information is that greatly, it is not easily stored in tables. and legal briefs. Media such as user- it is heterogeneous. In other words, The challenge is unstructured information generated content, RSS feeds, emails, not only does it look different from must be mapped into tables and discrete social graphs, metadata, images, videos, structured information, but the many data types, which entails an unnatural and and audio files are other widely used formats of unstructured information time-consuming effort. As an alternative, forms of unstructured information. vary significantly from one another. data types such as character/binary large Unstructured information includes objects (i.e., CLOBs and BLOBs) of an MarkLogic Server is the Enterprise NoSQL non-discrete data types such as words, RDBMS were created to overcome the database that manages all types of data sentences, and concepts, in conjunction limitations of the discrete data types, but in real time. “Enterprise” means it has the with discrete data types such as numbers, they facilitate only storage, not querying. capabilities mission-critical environments dates, and identifiers. Many combinations Therefore, CLOBs/BLOBs are marginally require such as ACID transactions, of these data types are possible, so better than storage on a filesystem. The high availability, disaster recovery, and standards are created to maintain problem remains that RDBMSs treat security. “NoSQL” (“Not Only SQL”) means manageability. However, the gains are unstructured information as second- it is ideal for unstructured information. not always clear, since great variance still class citizens. The monolithic approach Most existing tools such as RDBMSs exists as evidenced by the many domain- of CLOBs/BLOBs ignores the important were not built to handle the challenges specific standards such as: context in unstructured information, and of unstructured information. These tools thus precludes analysis, retrieval, and • FpML – Financial products Markup either require rigid adherence to a specific updates at a granular level. Language structure or ignore any existing structure Complex altogether. In other words, they treat • OOXML – Office Open XML for unstructured information as second class Microsoft Office 2007/2010 In addition to heterogeneity, unstructured citizens. This precludes organizations information is also very complex. There • ISO 20022 – the ISO Standard for are three characteristics that contribute from effectively leveraging this form of Financial Services Messaging information. to its complexity, any combination of which • XBRL – eXtensible Business Reporting is found in unstructured information. You may be asking, “Why weren’t RDBMS Language built to handle unstructured data?” The First, unstructured information is typically • RixML – Research Information Markup answer is that unstructured data is a hierarchical, with nested parent/child Language progression from when RDBMSs first relationships. Often these relationships appeared in the 1980’s when most data • DocBook – a popular markup language are not obvious, but examples include that was captured and analyzed was for documentation subsections in a chapter of a book or 1 | MarkLogic whitepaper sub-clauses in a contract. On the other by the location of the amended text. She hand, structured information typically typically uses a word processing program has flat, tabular relationships that may be like Microsoft Word to make updates, and expressed as one-to-one, one-to-many, or the user interface does not have hard rules many-to-many. Since RDBMSs were not on how the contract should be changed. designed for hierarchies, a query to join There also is no preparation required by rows to recreate the hierarchy is slow and IT staff to plan for the changes, as the inefficient. attorney makes the changes ad hoc. Second, unstructured information Contrast this to structured information, is irregular, meaning unstructured which changes in well-known ways. information does not fit in neat, For example, each value in an RDBMS predefined data elements. Information changes in an expected way—numbers may vary greatly in length, with no pre- are increased or decreased, dates are definition or bounded data lengths. It modified with other dates, and text might also be sparsely populated, meaning strings are updated within predefined across a collection of information, there lengths. And when the schema changes, the might be thousands of known data system is first updated to accommodate elements, many of which are blank. These that change. Schema changes must be characteristics are inconsistent from what announced before they can be handled by RDBMSs expect, in which most columns the system. The IT staff necessarily knows are expected to be filled with values. what type of changes will be made by users to structured information before the Third, unstructured information may or changes can be made. RDBMSs are good may not conform to a predefined schema. for predictable and announced changes, If it does conform, the schema might be but are not efficient for the changes that poorly defined, not followed strictly, or unstructured information undergoes. not known in advance. Even in the case of predefined schemas, large variances Text-Centric may be allowed, making each item appear Unstructured information is heavily text- very different from the next. RDBMSs centric. It contains