Evaluation of Native XML Database Chapter 3. Flattening, Storing, and Mapping

<< Chapter_2_XML_Databases Index of Original Docs Chapter_4_Icons >>

Chapter 3. Flattening, Storing, and Mapping

Flattening The Data Dictionary The Tag Dictionary The Map Store A Note on Data Integrity Preliminary Evaluation of the Storage Architecture

Material in this chapter is taken from the following: a. US Patent Application 20020099712, July 25, 2002, “Method of operating an extensible markup language database” b. US Patent Application 20020099745, July 25, 2002, “Method and system for storing a flattened structured data document” c. US Patent Application 20020099736, July 25, 2002, “Method of storing a structured data document” d. US Patent Application 20020052878, May 2, 2002, “Method of storing and flattening a structured data document” e. Direen, Harry, Ph.D., et. al., “NeoCore XML Information Management System (XMS)” (slides to accompany oral presentation), undated, slides 6-8.

Flattening

In the NeoCore XML database, the first step is to “flatten” the XML document before it is stored. Suppose we start with this simplified XML document:

Page 3-1 Evaluation of Native XML Database Chapter 3. Flattening, Storing, and Mapping

Doe John

1234 Main Street Anytown AL 55555

666-555-4444

Buck Jennie

5678 Byway Street Sometown AL 55555

666-555-3333

After flattening, the tags and data would look like this:

Telephone_book>Entry>@access_category>unlisted Telephone_book>Entry>Name>Last>Doe Telephone_book>Entry>Name>First>John Telephone_book>Entry>Address>Street>1234 Main Street Telephone_book>Entry>Address>City>Anytown Telephone_book>Entry>Address>State>AL Telephone_book>Entry>Address>Zip>55555 Telephone_book>Entry>Phone_number>666-555-4444 Telephone_book>Entry>@access_category>unlisted Telephone_book>Entry>Name>Last>Buck Telephone_book>Entry>Name>First>Jennie Telephone_book>Entry>Address>Street>5678 Byway Street Telephone_book>Entry>Address>City>Sometown Telephone_book>Entry>Address>State>AL Telephone_book>Entry>Address>Zip>55556 Telephone_book>Entry>Phone_number>666-555-3333

To each line, some control/format codes are added so that the original XML file can be easily reconstructed from the flattened document. (The control/format codes are also used to make searches efficient; searching is discussed in Chapter 6.) In addition, some initial lines (“metadata”) are added to aid in document management and document-level searching.

The general idea of flattening is that each data item (e.g., “Doe”) is listed with its complete structural context as defined by the tags as they are actually given in the original document. Where the data item is an attribute value the “@” symbol is inserted before the name of the attribute.

Each flattened line contains a group of tags, a data item, and its set of control/format codes. The tagset will be placed into the Tag Dictionary. The data item will be placed into the Data Dictionary. The control/format codes will be placed into the Map Store. The structures of the dictionaries and the Map Store are discussed below.

The exact methods of handling such things as comments and CDATA in the flattened document are not explained in the publicly available NeoCore materials. While our testing will check whether such constructs are handled well by the database, we will simply accept that the details of flattening are proprietary and will examine them no further. Further, the publicly available NeoCore materials do not explain all the details of the control/format codes, and so we will not examine those closely either. Two significant items in the control/format codes are the P-Level Code and the Parent Code; these two are explained in the publicly available documents, and we will examine them in Chapter 6.

Return to Top

The Data Dictionary

The Data Dictionary is a file on a hard disk or an array of hard disks.

The first data item (in our example, “unlisted”) is placed in the first available storage location of the Data Dictionary. Immediately following it we place the second data item, and then the third, and so on. So, using our example, and assuming it is the only document currently stored, the Data Dictionary would look like this:1

unlisted Doe John “1234 Main Street” Anytown AL 55555 666-555-4444 Buck Jennie “5678 Byway Street” Sometown 55556 666-555-3333

Notice that duplicates are not stored in the Data Dictionary.

The address (offset) of each data item in the Data Dictionary is put into the Map Store, so that the data items can be found easily. If a particular data item is contained 43 times in the database, then the data item will occur once in the Data Dictionary and a pointer to that data item will occur 43 times in the Map Store.

1 When you actually examine the Data Dictionary, you notice NeoCore does not use a mere space as the separator between data items; however, the publicly available documents do not explain the several different separators that are actually used.

Notice how space efficient the Data Dictionary is: there are no duplicates and there is no wasted space. Instead of storing duplicate data items, we store relatively short duplicate pointers.

Return to top

The Tag Dictionary

The Tag Dictionary works exactly the same as the Data Dictionary. Instead of containing data items, the Tag Dictionary contains the tag portion of each line. Again using our example above, the Tag Dictionary will contain the following:2

Telephone_book>Entry>@access_category> Telephone_book>Entry>Name>Last> Telephone_book>Entry>Name>First> Telephone_book>Entry>Address>Street> Telephone_book>Entry>Address>City> Telephone_book>Entry>Address>State> Telephone_book>Entry>Address>Zip> Telephone_book>Entry>Phone_number>

Just like with the Data Dictionary, there are no duplicates and no wasted space. The storage locations are entered into the Map Store.

Of particular note is that the storage methodology of the tags is exactly the same as the storage methodology of the data. The database can handle a tagset that is as dynamic as the data set; that is, tags which shift continually from document to document are no more difficult to handle than data items which change from document to document. There is nothing in the NeoCore XML database to restrict the extensibility inherent in XML. Whatever structure the user provides within the XML document is exactly what the database will store. Whether you do or do not provide a DTD or other externally defined schema has nothing to do with the efficiency of storing documents. And if the structure definined within the document—in the structure of the tagsets—changes continually from document to document, the efficiency of document storage is not affected. There are no triggers or stored procedures to make sure you do it “right”; whatever structure you provide is the structure that is stored. There are no tables or columns into which the data is inserted. A fortiori there is no need to modify any table or column structure if an unfamiliar tagset is encountered.

Return to top

The Map Store

The Map Store is a table that contains a complete record of each document. From the Map Store we can completely reconstruct each XML document in the database.

For each line in the flattened document, there is a row in the Map Store. Each row contains a pointer to the tagset in the Tag Dictionary, a pointer to the data item in the 2 Actually, for ease of subsequent searching, the tag dictionary also stores subsets of the complete tagset; we explain this in Chapter 6.

Data Dictionary, and the set of control/format codes that allows the reconstruction of the original XML document. We know where each document in the Map Store starts and stops because one of the formating codes indicates the first and last lines of each document. And we know the order of the lines in the document, because it is the same order as the lines in the Map Store, as modified by insertions and deletes.

Using the same example above, the contents of the Map Store would be similar to Figure 3-1. (“TDL” indicates “Tag Dictionary location”; “DDL” indicates “Data Dictionary location”; “MSL” indicates “Map Store Location”.)

FORMAT CODE SET TDL POINTER DDL POINTER For Line 1 TDL #1 DDL #1 For Line 2 TDL #2 DDL #2 For Line 3 TDL #3 DDL #3 For Line 4 TDL #4 DDL #4 For Line 5 TDL #5 DDL #5 For Line 6 TDL #6 DDL #6 For Line 7 TDL #7 DDL #7 For Line 8 TDL #8 DDL #8 For Line 9 TDL #1 DDL #1 For Line 10 TDL #2 DDL #9 For Line 11 TDL #3 DDL #10 For Line 12 TDL #4 DDL #11 For Line 13 TDL #5 DDL #12 For Line 14 TDL #6 DDL #6 For Line 15 TDL #7 DDL #13 For Line 16 TDL #8 DDL #14

Figure 3-1. Map Store

For a delete, the lines in the Map Store are not deleted; instead, one of its control codes is set to “deleted” status. The corresponding entries in the tag and data dictionaries are not removed either. (NeoCore advises us that in a future version of the XMS, there will be a defragmentation process to remove deleted items from the dictionaries and the Map Store.)

To insert a new line in a document, you put some “jump” commands in the Map Store. For example, suppose you start with the Map Store in Figure 3-2:

FORMAT CODE TDL POINTER DDL POINTER SET For Line 1 TDL #1 DDL #1 For Line 2 TDL #2 DDL #2 For Line 3 TDL #3 DDL #3 Many additional lines Last current line, n

Figure 3-2. Map Store Before Jump

If you want to insert a line 2a, you would end up with the following. (When the defragmentation process is added to the XMS, it will re-order the Map Store, removing all jumps.)

FORMAT CODE TDL POINTER DDL POINTER SET For Line 1 TDL #1 DDL #1 Jump Flag MSL #(n+1) For Line 3 TDL #3 DDL #3 Many additional lines Last current line, n For Line 2 TDL #2 DDL #2 For Line 2a TDL #next DDL #next Jump Back Flag MSL #3

Figure 3-3. Map Store After Jump

For an update, the new data item is added to the Data Dictionary if it was not already there. Then the DDL pointer in the Map Store is changed to point to the correct data item in the Data Dictionary.

A variety of indices provide access to any specific entry in the Map Store. These are discussed in Chapter 6.

Notice how space-efficient the Map Store is. The pointers to the tag and data dictionaries are of fixed width. The set of format/control characters are of fixed width. There is little wasted space in the Map Store.

Return to top

A Note on Data Integrity

The database is totally agnostic about what is being stored. It doesn’t care if the client- side application fails to provide the telephone number in a telephone directory listing. It simply stores what it is given. And it doesn’t care if someone stores his bowling scores in the same database that stores the telephone directory.

A typical RDBMS uses triggers and most of its stored procedures as part of the database itself to insure data integrity. In the NeoCore arrangement, responsibility for data integrity is outside the database proper and either in the broader XMS or on the client side.3

By removing responsibility for data integrity outside the database proper, the database behaves exactly the same whether it is asked to store incomplete information, erroneous information, or new types of information. Inside the database, there simply is no “correct” schema; there is no such thing as incomplete data; there is no extraneous data. There are only the particular schemata that are implicitly contained in the tag structures of the individual documents. The database stores whatever it is given, provided only that the input is syntactically correct so that the database can parse it.

Return to top

Preliminary Evaluation of the Storage Architecture

For a native XML database, the design of the storage architecture should be done with the following goals in mind:

a. Retain the extensibility and heterogeneity of XML b. Recovery from catastrophic events c. Minimum storage space d. Efficient searches e. Quick stores, inserts, updates, and deletes a. We have a separate Tag Dictionary with as much flexibility as the Data Dictionary. There is a sense in which this is not merely a DATAbase, but also a TAGbase. That is, we are storing not just dynamic data, but also dynamic context. The storage architecture allows us to totally sever ourselves from the limitations of static table and column structures. The architecture does far more than merely allow us to modify table and column structures dynamically; it totally liberates us from such RDBMS limitations. b. We haven’t discussed recovery mechanisms. The Tag Dictionary, Data Dictionary, and Map Store are disk-based structures. And, of course, if the data is critical, this means RAID-based structures given contemporary computer architecture. All that

3 Currently, data integrity is a client-side responsibility. NeoCore planners anticipate adding tools to assist with data integrity issues in future versions of the XMS. But such tools will remain outside the database proper and in the broader XMS.

Page 3-7 Evaluation of Native XML Database Chapter 3. Flattening, Storing, and Mapping needs be done is make sure the writes to disk are completed before the using applications are given a positive response. Obviously, the prudent system administrator will also have a regular archiving program; the NeoCore XMS includes archiving tools. For those whose database is on a single hard disk, prudence demands frequent backup of that disk. There is a transaction log which allows reconstructing the state of the database by starting with the last archive and updating it from the log. c. There is some space overhead to manage the two dictionaries and the Map Store. This is negligible—nothing more than what the OS needs to keep track of these ordinary files. The Map Store takes quite a bit of filespace; but it is actually quite efficient: each line is only the width of two addresses and the control/formatting codes, and in return the structure permits a very efficient Data Dictionary with no duplicates. The NeoCore database has a fairly large Tag Dictionary that has no direct equivalent in most databases; however, the other databases replace it with significant overhead to maintain table and column names, their relationships, and (usually) the type definitions associated with each column. The storage of primary and foreign key information is eliminated from the database, but if the user wishes to have analogous tools to guarantee data integrity, that space requirement is distributed to the client. We eliminate stored procedures; again, any needs the user has to retain the capability contained in stored procedures is moved to the client. Duplicate data (e.g., two customers at the same address or a hundred thousand customers with the same city) is reduced to duplicating a (relatively short) address in the Map Store; the data itself is never duplicated. There is no wasted space in the dictionaries due to varying length data in fixed-length data fields. And there is no wasted space in the Map Store. As we will see, the indexing will be quite extensive; we’ll see later how extensive. Later, when we do our testing, we will see how much space is actually taken by the database files. d. We will examine the searching capability later, in Chapter 6. e. As far as the two dictionaries and the Map Store are concerned, the stores, inserts, updates, and deletes can be done quite rapidly. There is also indexing to consider; we will examine that in Chapter 5 and Chapter 6.

Return to top

<< Chapter_2_XML_Databases Index of Original Docs Chapter_4_Icons >>

Page 3-8