<<

Appendix C: Technical Details (with code)

Note: This appendix provides technical details about the eXist system. In some cases the code used to create applications are referenced. As the length of some of the code files is over a few pages, there are two versions of this appendix; one that references the file name and displays the code while the other just lists the file name.

Table of Contents 1. System Information 2. Template Files 3. File Structure 4. Creating New Collections 5. Search Functionality 6. Search 7. Search Performance 8. Modifying Indexing Parameters 9. Uploading Files 10. Moving Files 11. Checksum Application 12. Stress Test Tool 13. Dublin Core 14. Exporting Methods 15. Migration to XML Formats 16. Role-Based Access Controls 17. Scalability

1. System Information - eXist 1 o Free download available on eXist’s home page Instructions for loading software2 o 3 o Play with code in the XQuery Sandbox using preloaded with sample data sets

- oXygen XML Editor Download free trial version or order licensed product4 o 5 o Database perspective communicates with eXist database file structure

- Mozilla Firefox 6 o Download

1 eXist. Home Page. 2010. http://exist.sourceforge.net/ 2 Meier, Wolfgang M. Quick Start Guide. eXist. November 2009. http://exist.sourceforge.net/quickstart.html 3 eXist. XQuery Sandbox. http://demo.exist-db.org/exist/sandbox/sandbox.xql 4oXygen. oXygen: XML Editor Home Page. 2010. http://www.oxygenxml.com/ 5 oXygen. XML Database Perspective. http://www.oxygenxml.com/xml_and_relational_database_perspective. 2. Template Files

The following files were used as templates for creating new applications. For most new applications, only minor edits were needed for each file. Specific functionality of specialized applications required additional modifications or files.

File Description index.xq Main landing page of the application. list-items.xq XQuery for listing items with one item per row. view-item.xq File that transforms XML document into XHTML for viewing. search-form.xq XHTML search form. search.xq Search service. edit.xq XForms application for saving new data and changing data. new-instance. Data for a new item (i.e. includes default values). next-id.xml XML file that contains the next ID number to be assigned to a document. save-new.xq Service that takes a HTTP POST (Save) from the edit form and saves it into the collection. It also assigns an ID to the document and increments the ID for the next save. update.xq Updates an existing document from an edit POST. delete-confirm.xq Confirms with the user that a delete should be performed. delete.xq Deletes a document. metrics.xq Counts the number of items in a collection. views collection Files that provide instructions for additional views of data in HTML format, including various reports. scripts collection XQuery Scripts that could be run to import or cleanup data sets or other functions.

Sample Template Files

3. File Structure The folder/file structure of eXist is very repetitious which makes locating files simpler. The pilot project folder structure is shown below using oXygen’s Database Perspective.

This clip shows the applications (apps) associated with the Minnesota Historical Society (mhs). The folder structure is as follows: - localhost (eXist) o db (eXist database) . cust (customers) • mhs (Minnesota Historical Society) o apps (the applications created for MHS)

6 Mozilla. Firefox 3.6. 2010. http://www.mozilla.com/firefox/ . the list of applications

Each application folder contains the same set of folders used to organize the information. The applications ‘ca-bills’, ‘il-bills’, and ‘mn-bills’ are all expanded in this view and show that they all have the following: - A data folder (data) - An image folder (images) - A search folder (search) - A view folder (views) - A file for application information (app-info.xml) - A file for the home page for the application (index.xq)

This view expands the folders found under each collection. The data folder holds the data (in this case one file); the images folder holds the images; the search folder holds the search forms required for the application; and the views folder holds the files and forms required for various views.

By default, the folder structure also determines the web address in the browser being used to view the database. The following addresses show pathways to selected individual files: - Application Main Page: o http:// localhost:8080/exist/rest/cust/mhs/ap ps/index.xq - Glossary Main Page: o http:// localhost:8080/exist/rest/db/cust/mhs/ apps/glossary/index.xq

- California Bill’s Search Page: o http://localhost:8080/exist/rest/db/cust/mhs/apps/ca-bills/search/search-form- html.xq - Minnesota Bill List View: o http://localhost:8080/exist/rest/db/cust/mhs/apps/mn-bills/views/list-items.xq - Checksum Data File: o http:// localhost:8080/exist/rest/db/cust/mhs/apps/check-sum/data/4.xml

4. Creating New Collections

The following describes the general steps of creating a new collection within the pilot project system. After these steps are taken you should be able to view the new application home page, view a list of items, view individual records, and search the collection.

1. A set of template files was copied, renamed, and placed into the system structure. (localhost/db/cust/mhs/apps/collection name). The template files included the files contained in the data, images, search, and view folders as well as the main application files described below. a. index.xq: ‘home page’ for collection – shows icon, application title, and list of what you can do with the collection – generally list items, search item, and a count of items b. app-info.xml: provides an icon and application title on the application home page as well as give background information on who created it and with what versions

2. Icons and other necessary collection images were uploaded into the ‘images’ folder for the collection.

3. Data was uploaded into the ‘data’ folder for the collection. a. The XML tags found within the data will be used to determine what is displayed while viewing a record and what tags are searched.

4. The following template files were modified to be able to view the items in the collection. a. views/list.items.xq: view list of bills in the collection b. views/view-bill.xq: view individual bill i. Note that the individual bill views include transformation of XML strikeouts to HTML markup elements. This process uses ‘typeswitch’ XQuery transformations that are based on template rules.

5. The following template files were modified to search an individual collection. a. search/search-form-html.xq: provides a simple keyword search box for a collection b. search/search.xq: query instructions for keyword search displaying a list of results

Note: Because of differences with XML tag names between the state collections, it was necessary to write separate transformations for each of the document viewers. For the applications Syntactica developed, each transformation was approximately two pages of XQuery that used the typeswitch function to translate each tag into the appropriate XHTML tag. If standard tags are used across collections, this action would not be necessary.

Note: Some form operations on XML data sets are much easier than when using relational database. For example if a field has a one-to-many relationship it is easy to add this function to any form without making any changes to the database or the index configurations. Only a few lines of code were needed to add multiple categories to one item in selected collections.

5. Search Functionality

To allow users to search, a simple search HTML form is used (file = search-form-html.xq). The search form displays a web page that contains a key word search box and a search button. The entire form for searching Illinois Bills is shown below:

[File: search-form-html.xq for Illinois bills.] version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; declare option exist:serialize "method=xhtml media-type=text/html indent=yes"; let $title := 'Search Illinois Bills' return {$title} {style:import-css()} {style:header()} {style:breadcrumb()}

{$title}

Search:

{style:footer()}

6. Search

The keywords that are entered into the search box are passed to a REST search interface that actually performs the search. Using reverse keyword indexes created by the Lucene full-text indexing library, the search.xq file tells the REST service what documents match and gives instructions on how to display the search results. In general, the code for this service is a basic XQuery FLWOR7 statement in the form:

for $hit in collection($my-collection)[ft:query(., $keywords)] order by ft:score($hit) return local:view-summary($hit) where:

$my-collection was the name of the collection to be search $keywords is the string of keywords the users was searching $view-summary is an XQuery summary to view the document hit summary

This service is invoked by simply adding the query parameter to the URL such as the following:

SERVER/apps/mn-bills/search/search.xq?q=healthcare

In addition, all of the search applications use two simple functions that are part of the Lucene function module: fq:query($context, $keywords): a function that returns a conditional (true, false) if a document contains a keyword ft:score($hit) : a relative numeric score that allows the hit results to be sorted from highest score to lowest score

A small XQuery transformation was used to transform each XML document into a full HTML document for viewing on a web browser. The XQuery “typeswitch” operation was used; which told the system what HTML tags to generate for each XML attribute of element via transform.

7 F=for, L=let, W=where, O=order by, and R=return. Kay, Dr. Michael. Blooming FLWOR – An Introduction to the XQuery FLWOR Expression. Stylus Studio. 2010. http://www.stylusstudio.com/xquery_flwor.html The results that were displayed for this project showed the bill title with a link to the individual file, a section of the bill that contained the key word in context, and a link to the XML bill file.

The following example file provides the system with the instructions for the search and display results. Once the user is familiar with the basic template, only around three lines of code need to be modified to change what items are returned.

[File: search.xq for Illinois Bills.] xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; import module namespace kwic="http://exist-db.org/xquery/kwic"; declare namespace caml="http://lc.ca.gov/legalservices/schemas/caml.1#"; declare option exist:serialize "method=xhtml media-type=text/html"; let $q := xs:string(request:get-parameter("q", "")) let $start-time := util:system-time() return if (string-length($q) < 1) then Error: Query parameter "q" must be supplied. else let $filtered-q := replace($q, "[&"-*;-`~!@#$%^*()_+-=\[\]\{\}\|';:/.,?(:]", "") let $scope := ( (:collection('/db/cust/mhs/apps/mn-bills/data')/*:) collection('/db/cust/mhs/apps/il-bills/data')/* (:collection('/db/cust/mhs/apps/ca-bills/data')/*:) ) let $hits := for $hit in $scope[ft:query(., $q)] order by ft:score($hit) descending return $hit let $perpage := xs:integer(request:get-parameter("perpage", "10")) let $start := xs:integer(request:get-parameter("start", "0")) let $total-result-count := count($hits) let $end := if ($total-result-count lt $perpage) then $total-result-count else $start + $perpage let $results := for $hit in $hits[position() = ($start to $end)] let $collection := util:collection-name($hit) let $document := util:document-name($hit) let $config := let $base-uri := request:get-server-name() let $subpath := substring-after($collection,"data") let $basepath := concat(substring-before($collection,"data"),"data") let $base-uri := replace(concat('http://',$base-uri,"/exist/rest",$basepath),'data','views') let $base-url := style:app-base-url()

return if (contains($collection,'mn-bills')) then let $title := data(doc(concat($collection, '/', $document))/bill/btitle/btitle_summary) let $summary := kwic:summarize($hit, $config) let $url := concat('/views/view-bill.xq?Minnesota=', $subpath, "/" ,$document) return

{$title}
{$summary/*}
{concat($base-url, $url)}

else if ($collection = '/db/cust/mhs/apps/il-bills/data') then let $title := doc(concat($collection, '/', $document))//BillTitle/text() let $summary := kwic:summarize($hit, $config) let $url := concat('/views/view-bill.xq?Illinois=', $document) return else if ($collection = '/db/cust/mhs/apps/ca-bills/data') then let $title := data(doc(concat($collection, '/', $document))//caml:Title) let $summary := kwic:summarize($hit, $config) let $url := concat('/views/view-bill.xq?California=', $document) return

{$title}
{$summary/*}
{concat($base-uri, $url)}

else let $title := concat('Unknown result. Collection: ', $collection, '. Document: ', $document, '.') let $summary := kwic:summarize($hit, $config) let $url := concat($collection, '/', $document) return

{$title}
{$summary/*}
{concat($base-uri, $url)}

let $number-of-pages := xs:integer(ceiling($total-result-count div $perpage)) let $current-page := xs:integer(($start + $perpage) div $perpage) let $url-params-without-start := replace(request:get-query-string(), '&start=\d+', '') let $pagination-links := if ($number-of-pages le 1) then () else
    { (: Show 'Previous' for all but the 1st page of results :) if ($current-page = 1) then () else
  • Previous
  • }

    { (: Show links to each page of results :) let $max-pages-to-show := 20 let $padding := xs:integer(round($max-pages-to-show div 2)) let $start-page := if ($current-page le ($padding + 1)) then 1 else $current-page - $padding let $end-page := if ($number-of-pages le ($current-page + $padding)) then $number-of-pages else $current-page + $padding - 1 for $page in ($start-page to $end-page) let $newstart := $perpage * ($page - 1) return ( if ($newstart eq $start) then (

  • {$page}
  • ) else
  • {$page}
  • ) }

    { (: Shows 'Next' for all but the last page of results :) if ($start + $perpage ge $total-result-count) then () else

  • Next
  • }
let $how-many-on-this-page := (: provides textual explanation about how many results are on this page, : i.e. 'all n results', or '10 of n results' :) if ($total-result-count lt $perpage) then concat('all ', $total-result-count, ' results') else concat($start + 1, '-', $end, ' of ', $total-result-count, ' results') let $end-time := util:system-time() let $runtimems := ($end-time - $start-time) div xs:dayTimeDuration('PT1S') (:* 1000 :) return Search Bills {style:import-css()}
{style:header()} {style:breadcrumb()}

Search Bills

Keyword Search:

{ if (empty($hits)) then () else (

Results for keyword search "{$q}". Displaying {$how-many-on-this-page}. Execution Time: {$runtimems} Seconds.
,
{$results}
,
{$pagination-links}
) }

7. Search Performance

To display the time it took to complete a query the following code was put into each search.xq file. For more information on this function review the eXist function library.8

[File: Portion of search.xq for Illinois bills.]

let $end-time := util:system-time() let $runtimems := ($end-time - $start-time) div xs:dayTimeDuration('PT1S') (:* 1000 :)

return Search Bills {style:import-css()} [code removed]

{ if (empty($hits)) then () else (

Results for keyword search "{$q}". Displaying {$how-many-on-this-page}.

8 eXist. A RESTful browser for eXist Java-Based Function Modules. http://demo.exist- db.org/exist/functions/util/system-time Execution Time: {$runtimems} Seconds.

,
{$results}
,
{$pagination-links}
) }

The stress test application used similar code to determine run times. For more information see the Stress Test section of Appendix C.

8. Modifying Indexing Parameters

The default indexing parameter for eXist 1.4 does not use full-text indexing; to modify the indexing parameters (i.e. add full-text indexing) for a collection an indexing configuration file must be created and modified as necessary. Configuration files are usually created in a manual process by an index designer who is familiar with the structure of the data within the collection. To assist index designers, XML schemas9 are sometimes provided, that if annotated, provide great details about what data each field contains. However, because there is no standard for annotating schemas, index designers are often expected to understand the structure of the data. To create a configuration file for indexing the designer must understand how the system works and programming languages.

In the eXist file structure, the configuration file is located in the system folder. Each collection has its own configuration file that is located in a folder specified to its application. The configuration file for the Minnesota bills application follows:

[File: collection.xconf for Minnesota bills.]

9 “An XML Schema describes the structure of an XML document.” W3C Schools. XML Schema Tutorial. 2010. http://www.w3schools.com/schema/default.asp

The file above tells the system how to treat the various XML tags found in a Minnesota bill data file.

To make things easier for non-programmers who may have to modify the indexing parameters, Syntactica created the Index Advisor. The Index Advisor is a tool that can be used to edit and re- index configuration files.

This is the first screen of the Index Advisor. It is through this screen that users can select a configuration file (Path) and choose to ‘Reindex Now’, ‘Estimate Time to ReIndex’, or ‘Schedule Reindex Time’; or by selecting the ‘Edit Index’ button, use the web interface to edit the indexes for the selected collection. (As this is a proof-of- concept application, currently users cannot schedule a time to reindex the files, use the login, or actually modify the configuration file on the next screen.)

The above image was created with the following file:

[File: ix-advisor.xq] xquery version "1.0"; declare namespace xrx='http://code.google.com/p/xrx'; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; declare option exist:serialize "method=xhtml media-type=text/html indent=yes"; declare function local:sitemap($collection as xs:string) as node()* { if (empty(xmldb:get-child-collections($collection))) then () else for $child in xmldb:get-child-collections($collection) return (local:config(concat($collection, '/', $child)), local:sitemap(concat($collection, '/', $child)))

};

declare function local:config($collection as xs:string) { for $child in xmldb:get-child-resources($collection) order by $child return if (ends-with($child, '.xconf')) then else {$child} };

let $login := xmldb:login('/db',"admin","admin123") let $collectionpath := '/db/system/config' return

Index Advisor {style:import-css()}

{style:header()} {style:breadcrumb()}

{style:app-name()}

Login



Collection



{style:footer()}

After a user selects a configuration file (by using the drop-down menu next to Path) and presses the “Edit Index” button, the Index Advisor brings up a screen that looks like the following:

This screen displays the XML tags that can be used to set indexing parameters on the left side of the screen, with the indexing parameters themselves along the top of the screen in the blue box. Users can edit the indexing parameters by selecting a radio button that corresponds to how the tag should be indexed. Boost values can also be set here. Boosting a value increases its search result ranking; the higher the boost number the more relevant the search result is and the higher up the result will be on the search result page. A ‘Reindex’ button at the bottom of the screen, in a fully developed application, would change the indexing parameters and reindex the collection.

The indexing parameters are as follows:

• Default – in eXist 1.3 the default value was full-text indexing; to reduce the size of the index files, the default value for eXist 1.4 is to use eXist’s structural indexes which builds an index on how things are classified, rather than using full-text. • Text – full-text index using standard word extraction for the selected element. • In-Line – a way to ignore the style markup and read the content as text and index it no matter how it is visual marked up (strikethrough or bold or underlined). • Ignore – the element should be not be indexed (useful for tags that contain the same information in each document) • N-Gram – this type of index should be created that allows exact matches of string that include spaces and punctuation. This is used in indexing software code samples DNA string extraction. • Boost – set a boost score for each tag, a higher number will be at the top of the search results.

The screen above was created with the following file:

[File: ix-advisor-edit.xq]

xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; (:import module namespace functx= "http://www.functx.com";:) declare namespace conf="http://exist-db.org/collection-config/1.0"; declare default element namespace "http://exist-db.org/collection-config/1.0"; declare option exist:serialize "method=xhtml media-type=text/html";

(:~ : All XML elements that don't have any child elements : : @author Priscilla Walmsley, Datypic : @version 1.0 : @see http://www.xqueryfunctions.com/xq/functx_leaf-elements.html : @param $root the root :) declare function local:leaf-elements ( $root as node()? ) as element()* {

$root/descendant-or-self::*[not(*)] } ; (:~ : The distinct names of all elements in an XML fragment : : @author Priscilla Walmsley, Datypic : @version 1.0 : @see http://www.xqueryfunctions.com/xq/functx_distinct-element-names.html : @param $nodes the root(s) to start from :) declare function local:distinct-element-names ( $nodes as node()* ) as xs:string* {

distinct-values($nodes/descendant-or-self::*/name(.)) } ;

(:Get the entry type and boost for current tag:) declare function local:test-tag($configfile as node()*, $tag as xs:string) as xs:string { let $tagvar := if(local:test-text($configfile, $tag)) then (local:test-text($configfile, $tag)) else if(local:test-inline($configfile, $tag)) then (local:test-inline($configfile, $tag)) else if(local:test-ignore($configfile, $tag)) then (local:test-ignore($configfile, $tag)) else if(local:test-ngram($configfile, $tag)) then (local:test-ngram($configfile, $tag)) else "none = " return $tagvar };

(:Check Inline Value if it exists for this tag:) declare function local:test-inline($configfile as node()*, $tag as xs:string) as xs:string* { for $inline in $configfile//inline return if ($inline/@qname = $tag) then concat("inline = ",$inline/boost) else() };

(:Check Text Value and Boost if it exists for this tag:) declare function local:test-text($configfile as node()*, $tag as xs:string) as xs:string* { for $text in $configfile//text return if ($text/@qname = $tag) then concat("text = ",$text/@boost) else() };

(:Check Ignore Value if it exists for this tag:) declare function local:test-ignore($configfile as node()*, $tag as xs:string) as xs:string* { for $ignore in $configfile//ignore return if ($ignore/@qname = $tag) then concat("ignore = ",$ignore/@boost) else() };

(:Check Ngram Value if it exists for this tag:) declare function local:test-ngram($configfile as node()*, $tag as xs:string) as xs:string* { for $ngram in $configfile//ignore return if ($ngram/@qname = $tag) then concat("ngram = ",$ngram/@boost) else() };

(:Check if checked value this tag:) declare function local:test-checked($tagvar as xs:string, $id as xs:string) as xs:string { if (contains($tagvar,$id)) then "checked" else "unchecked" };

(:Check if selected value this tag:) declare function local:test-selected($tagvar as xs:string, $id as xs:string) as xs:string { if (contains($tagvar,$id)) then "selected" else "unselected" };

(:Determine analyzer class:) declare function local:analyzer($configfile as node()*) as xs:string { for $analyzer in $configfile//analyzer return string($analyzer/@class) };

declare function local:words($leaves as xs:string*, $configfile as node()* ) { for $word in $leaves let $checked := "checked" let $tagvar := local:test-tag($configfile, $word) return( {concat("<",$word,">")} {attribute {local:test-checked($tagvar,"none")} {local:test-checked($tagvar,"none")}}

{attribute {local:test-checked($tagvar,"text")} {local:test-checked($tagvar,"text")}}

{attribute {local:test-checked($tagvar,"inline")} {local:test- checked($tagvar,"inline")}}

{attribute {local:test-checked($tagvar,"ignore")} {local:test- checked($tagvar,"ignore")}}

{attribute {local:test-checked($tagvar,"ngram")} {local:test- checked($tagvar,"ngram")}}

) };

(: Get parameters from restful interfase:) let $name := xs:string(request:get-parameter("name", "")) let $password := xs:string(request:get-parameter("password", "")) (:let $reindex := xs:string(request:get-parameter("reindex", "")):) let $indexpath := xs:string(request:get-parameter("indexpath", "")) let $collectionpath := $indexpath let $systempath := doc(concat('/db/system/config',$collectionpath,'/collection.xconf')) let $confpath := concat('/db/system/config',$collectionpath,'/collection.xconf') let $login := xmldb:login('/db',"admin","admin123") (:let $configfile := doc('/db/system/config/db/cust/mhs/apps/mn- bills/data/collection.xconf')/conf:collection:) let $configfile := doc(concat('/db/system/config',$indexpath,'/collection.xconf'))/conf:collection (:let $tag := "btitle_action":)

(:let $tagvar := local:test-tag($configfile, $tag):)

(: Determine unique leaves :) (:let $file := doc('/db/cust/mhs/apps/mn-bills/data/s2861-1.xml'):) let $datafile := xmldb:get-child-resources($indexpath)[1] let $file := doc(concat($indexpath,'/',$datafile)) let $leaves :=(local:distinct-element-names(local:leaf-elements($file))) (:let $analyzer := local:analyzer($configfile):) (:let $leaf := functx:atomic-type($leaves):) return

Index Advisor: Edit {style:import-css()}

{style:header()} {style:breadcrumb()}

{style:app-name()}

Index Advisor Edit


{local:words($leaves,$configfile)}

Data Tag

Default Text In-line Ignore Ngram Boost


{style:footer()}

The file that controls the actual reindexing (which currently only works from the first screen on the Index Advisor) is shown below.

[File: reindex.xq] xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; declare option exist:serialize "method=xhtml media-type=text/html";

(: Index Advisor This application re-indexs a collection and returns the number of files contained in the collection, the total collection size in megabytes, the length of time it took to re-index the collection and the indexing rate expressed in megabytes per minute. The collection may contain any number of subdirectories.:)

declare function local:format-number($n as xs:decimal ,$s as xs:string) as xs:string { string(transform:transform( , , () )) };

declare function local:sitemap($collection as xs:string) as node()* { if (empty(xmldb:get-child-collections($collection))) then () else

for $child in xmldb:get-child-collections($collection) return

{concat($collection, '/', $child)} {local:sitemap(concat($collection, '/', $child))}

};

(:Determine collection size:) declare function local:directorysize($directorypath as xs:string) as xs:integer { sum( for $file in xmldb:get-child-resources($directorypath) return xmldb:size($directorypath, $file) ) }; declare function local:collectionsize($directorypath as xs:string) as xs:integer { sum( for $directorypath in local:sitemap($directorypath) return local:directorysize($directorypath) ) };

(: Time the reindex operation :) declare function local:timedreindex($collectionpath as xs:string) as xs:integer { let $login := xmldb:login('/db',"admin","admin123") let $start := util:system-time() let $temp := xmldb:reindex($collectionpath) let $end := util:system-time() let $runtimems := ($end - $start) div xs:dayTimeDuration('PT1S') (:* 1000 :) return $runtimems }; (: Get parameters from restful interfase:) let $name := xs:string(request:get-parameter("name", "")) let $password := xs:string(request:get-parameter("password", "")) let $reindex := xs:string(request:get-parameter("reindex", "")) let $indexpath := xs:string(request:get-parameter("indexpath", "")) let $collectionpath := $indexpath let $systempath := doc(concat('/db/system/config',$collectionpath,'/collection.xconf')) let $confpath := concat('/db/system/config',$collectionpath,'/collection.xconf') let $login := xmldb:login('/db',"admin","admin123")

(:Determine collection size :) let $subsize :=local:collectionsize($collectionpath) return

(:Determine collection count:) let $count := count(collection($collectionpath)/element()) return

(: Time or estimate the time for the reindex operation :) let $runtimems := if ($reindex = 'reindex') then local:timedreindex($collectionpath) else (($subsize div 371500)+1) return

Index Advisor {style:import-css()}

{style:header()} {style:breadcrumb()}

{style:app-name()}

Statistics for {$indexpath}

Estimated reindexing time = {local:format-number(($subsize div 22290000),'#,###.00')} minutes

Number of files = {local:format-number($count,'#,###')}

Total Collection Size = {local:format-number($subsize div 1000000,'#,###.00')} megabytes

Index time = {local:format-number(($runtimems div 60), '#,###.00')} minutes

Reindexing rate = {local:format-number(($subsize div 1000000) div (($runtimems+1) div 60),'#,###.00')} megabytes per minute










{$confpath}

{data($systempath/collection)}

{style:footer()}

9. Uploading Files

The Uploader Tool allows users to upload files though a web interface as seen in the screen shot below.

The index.xq file displays the screen and along with a widgets.xq and widgets.js file runs the upload process. The files are shown below.

[File: index.xq for Uploader]

xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; import module namespace widgets="http://www.mhs.org/widgets" at "/db/cust/mhs/apps/uploader/widgets.xq"; declare option exist:serialize "method=xhtml media-type=text/html omit-xml-declaration=yes indent=yes";

File Upload Example Module {style:import-css()}

{style:header()} {style:breadcrumb()}

File Upload Example Module

This example illustrates multiple file upload capability.

{ widgets:uploadProcessor( "/db/cust/mhs/apps/uploader/uploads", "xml;xsl;xsd;txt;xq;css;xhtml;html;htm;svg;properties;js", "jpg;png;gif;doc;xls;ppt;pdf;zip", "fileCt", "uploader", "retainStructure") } {style:footer()}

[File: widgets.js for Uploader] function addFileUpload(me){ var pt=me.parentNode; var inputNodes = pt.getElementsByTagName("input"); var fileCt = inputNodes.length - 3; var dupNode = inputNodes[0].cloneNode(true); pt.insertBefore(dupNode,inputNodes[fileCt]); pt.insertBefore(document.createElementNS("http://www.w3.org/1999/xhtml","br"),dupNode); dupNode.setAttribute("name","uploader"+ (fileCt + 1)); dupNode.value=""; var fileCtNode = document.getElementById("fileCt"); fileCtNode.value = 1 + parseInt(fileCtNode.value) var container = document.getElementById("container"); }

[File: widgets.xq for Uploader] module namespace widgets = "http://www.mhs.org/widgets"; declare namespace request = "http://exist-db.org/xquery/request"; declare namespace xmldb = "http://exist-db.org/xquery/xmldb"; declare namespace util = "http://exist-db.org/xquery/util"; declare namespace compression = "http://exist-db.org/xquery/compression"; declare function widgets:uploadProcessor( (: The path to the collection that holds the user collections :) $collection-base-name as xs:string, (: A semi-colon delimited string listing valid text format extensions :) $valid-text-list as xs:string, (: A semi-colon delimited string listing valid binary format extensions :) $valid-binary-list as xs:string, (: The name of the counter variable used in the component :) $file-counter-name as xs:string, (: The base name of the input control used in the component :) $upload-control-base-name as xs:string, (: The name of the control to retain zip structure :) $retain-structure as xs:string ) as node() { (: Retrieve the number of items being uploaded in the variabinput control used in the componentle named by $file-counter-name :) let $fileCt := xs:integer(request:get-parameter($file-counter-name,0))

let $clear-files := request:get-parameter("clearFiles","false")

(: set up a list of valid text or xml formats :) let $valid-text-seq := if ($valid-text-list != "") then tokenize($valid-text-list,";") else ("xml","xsd","xsl","xhtml","html","htm","txt","xq","xp","css")

(: set up a list of valid binary formats :) let $valid-binary-seq := if ($valid-binary-list != "") then tokenize($valid-binary-list,";") else ("jpg","png","gif","doc","ppt","xls")

let $retain-structure-value := request:get-parameter($retain-structure,"false")

(: Get the name of the currently logged in user :) let $user := xmldb:get-current-user()

(: Log in as the administrator - NOTE that this is a KLUDGE until LDAP support :) let $login := xmldb:login($collection-base-name,"admin","admin123")

(: Derive the new collection name from the base name :) let $new-collection-name := concat($collection-base-name,"/",$user) let $clear-collection := if ($clear-files = "true") then xmldb:remove($new-collection-name) else () (: If the collection does not already exist, create it :) let $make-collection := if (not(xmldb:collection-available($new-collection-name)) or ($clear- files="true")) then xmldb:create-collection($collection-base-name,$user) else ()

(: Iterate through uploader1, uploader2, etc., skipping over blank entries :) let $files := for $index in (1 to $fileCt) return let $paramName := concat($upload-control-base-name,$index) (: Get the associated filename :) let $filename :=request:get-uploaded-file-name($paramName) let $filename := translate($filename," ","_") let $extension := lower-case(tokenize($filename,"\.")[last()]) (: Process the files :) let $action := (: If the filename is given :) if ($filename != "") then (: extract the appropriate data block from it :) let $data := request:get-uploaded-file-data($paramName) return (: if the filename is a zip file :) if (($extension="zip")) (: uncompress it into the new collection :) then (: first create new folders :) let $folders := compression:unzip( $data, util:function(xs:QName("local:unzip-folder-filter"),3), ($retain-structure-value,$new-collection-name,string-join($valid-text- seq,','),string-join($valid-binary-seq,',')), util:function(xs:QName("local:unzip-entry-data"),4), ($retain-structure-value,$new-collection-name,string-join($valid-text- seq,','),string-join($valid-binary-seq,','))) (: then create new files :) let $files := compression:unzip( $data, util:function(xs:QName("local:unzip-entry-filter"),3), ($retain-structure-value,$new-collection-name,string-join($valid-text- seq,','),string-join($valid-binary-seq,',')), util:function(xs:QName("local:unzip-entry-data"),4), ($retain-structure-value,$new-collection-name,string-join($valid-text- seq,','),string-join($valid-binary-seq,','))) let $store := xmldb:store($new-collection-name,$filename,$data) (: return the link to the zip file itself :) return ({$filename},$files) (: otherwise if it's a text or xml type :) else if (index-of($valid-text-seq,$extension)>0) (: then convert to text and store it :) then let $proc := xmldb:store($new-collection-name,$filename,util:binary-to- string($data)) return {$filename} (: otherwise, store it as binary :) else (: if it is a valid binary type :) if (index-of($valid-binary-seq,$extension)>0) (: store it as binary :) then let $proc := xmldb:store($new-collection-name,$filename,$data) return {$filename} (: otherwise decline to store it :) else {$filename} (: otherwise assume that it was an inadvertantly created entry with no content :) else () (: create a record indicating the filename as part of the iterated sequence :) return $action

(: Generate a report :) return

{style:header()} {style:breadcrumb()}

{$title}

Currently logged in as: {xmldb:get-current-user()}. Hold down the Control key while dragging to copy the file

You are currently in MOVE mode

Source Tree


{style:footer()}

11. Checksum

The list-items.xq file for the checksum application runs checksums through eXist and displays the hash numbers for each file. If hash numbers were wrapped around a file, eXist verifies that they match and displays a Pass or Fail next to the file. The list-items.xq file is shown below.

[File: list-items.xq for the Checksum Application]

xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; declare option exist:serialize "method=xhtml media-type=text/html indent=yes"; let $app-collection := style:app-base-uri() let $data-collection := concat($app-collection, '/data') return

List Hashs of test Files {style:import-css()}

{style:header()} {style:breadcrumb()}

List Items

{ for $file-name at $count in xmldb:get-child-resources($data-collection) let $full-path := concat($data-collection, '/', $file-name) let $doc := doc($full-path) let $verified := if ($doc/*/md5) then ( if ($doc/*/md5/text() = util:hash($doc/*/root, 'md5')) then PASS else FAIL ) else () order by $file-name return }
File MD5 Verified View XML
{$file-name} {util:hash($doc, 'md5')} {$verified} View XML
{style:footer()}

12. Stress Test Tool

The code for performing the stress test itself is shown below.

[File: execute-test.xq for Stress Test Tool] xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; declare option exist:serialize "method=xhtml media-type=text/html indent=yes"; let $id := request:get-parameter('id', '')

(: like "localhost" :) let $server-name := request:get-server-name()

(: like "8080" :) let $port:= xs:string(request:get-server-port()) let $new-port := if ($port="80") then '' else concat(':', $port)

(: like "/" or "/exist" :) let $web-context := request:get-context-path() let $prefix := concat('http://', $server-name, $new-port, $web-context) let $title := 'Execute Test'

(: check for required parameters :) return if (not($id)) then ( Parameter "id" is missing. This argument is required for this web service. ) else let $app-collection := style:app-base-uri() let $data-collection := concat($app-collection, '/data') let $item := collection($data-collection)/stress-test[id = $id] let $search-url := $item/search-url/text() let $pause-interval-ms := $item/pause-interval-ms/text() let $batch-start-time := util:system-time() return Execute Stress Test {style:import-css()} {style:header()} {style:breadcrumb()}

Stress Test Results

Field Value
Name:{$item/name/text()}
Category:{$item/category/text()}
Search URL:{$search-url}
Status:{$item/status/text()}
Tag:{$item/tag/text()}

    {for $query in $item/search-queries/query let $query-url := xs:anyURI(concat($prefix, $search-url, $query)) return
  1. Test: {$query/text()} {let $start-time := util:system-time() let $search-result := httpclient:get($query-url, false(), ()) let $end-time := util:system-time() let $runtimems := (($end-time - $start-time) div xs:dayTimeDuration('PT1S')) * 1000 (: put in the code to wait/pause/sleep for $pause-interval-ms :) return Result in {$runtimems} milliseconds. }
  2. }
{ let $batch-end-time := util:system-time() let $batch-runtimems := (($batch-end-time - $batch-start-time) div xs:dayTimeDuration('PT1S')) * 1000 return

Total Batch Run Time = {$batch-runtimems}

}

Rerun Tests

Edit Item Delete Item

{style:footer()}

13. Dublin Core

The Dublin Core application shows how you can use the Zotero FireFox plugin to import metadata about a document directly into your bibliographic tools. It was included to show different transforms can put standard metadata into a document. The file below transforms an html view of a record to a view that uses HTML 4 metadata tags, which then allows Zotero to pull bibliographic information from the file using FireFox. (Other examples are available.)

[File: view-html-4.xq for Dublin Core Application]

xquery version "1.0"; import module namespace style='http://www.mnhs.org/style' at '/db/cust/mhs/modules/style.xqm'; declare option exist:serialize "method=xhtml media-type=text/html indent=yes";

let $title := 'Zetero Demo Using HTML4 Metadata Tags'

let $id := request:get-parameter('id', '')

(: check for required parameters :) return if (not($id)) then ( Parameter "id" is missing. This argument is required for this web service. ) else let $app-collection := style:app-base-uri() let $data-collection := concat($app-collection, '/data') let $item := collection($data-collection)/item[id = $id] return {$item/title/text()}

{style:import-css()}

{style:header()}

{style:breadcrumb()}

{$title}

ID:{$item/id/text()}
Title:{$item/title/text()}
Creator:{$item/creator/text()}
Subject:{$item/subject/text()}
Description:{$item/description/text()}
Publisher:{$item/publisher/text()}
Contributor:{$item/contributor/text()}
Date:{$item/date/text()}
Type:{$item/type/text()}
Format:{$item/format/text()}
Identifier:{$item/identifier/text()}
Source:{$item/source/text()}
Language:{$item/language/text()}
Relation:{$item/relation/text()}
Coverage:{$item/coverage/text()}
Rights:{$item/rights/text()}

Edit Item Delete Item {style:footer()}

14. Exporting Methods

The following provides more details on the methods proposed for exporting files from eXist to other systems. 1. View and Save XML File with Browser: viewing an XML file through a browser click File -> Save As on the File or Page menu (locations may vary from browser to browser). File will be saved to the selected location. Use the “View source” for files with namespaces which will also show the XML elements and namespaces in the document. (Perfect for saving one file at a time.)

2. WebDav: Open a collection using a WebDav client. Browse to the application, drag and drop files from one collection to the desktop of other folder location. (For moving multiple files or collections quickly.)

3. XQuery Dump: Create a custom XQuery and link it to a button such as “Dump All” in the application. The XQuery would be similar to the code used in the FAQ collection: { for $item in collection(‘/db/cust/mhs/apps/faq/data’) return $item } This method is ideal for getting all of the data from a single collection to another location. If filters are added to the XQuery, users could select to receive only the most recent items or items that have changed in the last month.

4. XQuery Dump with Compression: The same actions as above but the files would be compressed into a .zip file. The following line would be added to the XQuery code: return compression:gzip($dump). Note that the compression:gzip method returns a binary file that can be directly downloaded into the client. Collection hierarchy can be maintained if the second parameter is set to be true. Functions can be written to allow you to compress an entire collection and its sub-collections as well.

5. Custom Atom Feed Using XQuery: This method requires the user to create and Atom feed using XQuery. To do this the user creates a query that uses and XML tags that conform to the Atom specification. This can be done by using an XQuery template file for Atom feed. These tags include information about a document id, title, last updated datetime, link and summary.

6. Use a built-in Atom Service: This method is a variation of ATOM feed. It is built into eXist but most be configured using XML configuration files. The user just specify what collection to use and the services automatically creates an atom feed for you. The advantage is that you do not have to write any custom XQuery scripts. The disadvantage is that you have no control over the mapping of custom fields to the Atom elements.

7. Use a SOAP service: SOAP is another standard service that has been integrated with eXist in the past. Large corporations with big enterprises frameworks that do not leverage the caching structure of the world-wide-web sometimes use SOAP protocols for serializing XML data. This is used because SOAP has a rich set of header specifications for storing digital signatures and document messaging information.

8. Use XQuery to write an XHTML Report: Users can write a query that will create XHTML output and display it on an HTML page.

9. Use XQuery to Create a Comma Separated Value (CSV) File: perfect for use with Excel.

10. Convery XML into PDG using XML-FO Libraries: Just like HTML files have renderings of XML source documents, you can also create PDF rendering that use an XML markup format called XML-FO for Formatted Objects or Formatted Output. There are several examples of other output formats in the XQuery wikibook10.

10Wikibooks. XQuery. February 26, 2010. http://en.wikibooks.org/wiki/XQuery

15. Migration to XML Formats More and more tools are available for migrating non-XML files into XML. To import such files into a native-XML system filters and tools are needed to manage the document metadata and recreated the original document. This is, in general, easier to do with more recent files that are XML ready.

For the files that are not XML ready tools may be available to assist with transformations. The Apache POI project11 provides a library of tools (filters) to extract text from older Microsoft binary document formats. The Apache POI project provides filters for the following document types:

• Excel (SS=HSSF+XSSF) • Word (HWPF+XWPF) • PowerPoint (HSLF+XSLF) • OpenXML4J (OOXML) • OLE2 Filesystem (POIFS) • OLE2 Document Props (HPSF) • Outlook (HSMF) • Visio (HDGF) • Publisher (HPBF)

PDF files are very common and pose another concern. There are open-source tools that do extract text from PDF documents. For example the PDFBox12 tools attempt to provide high- quality text extraction and PDFBox tools are used by many Lucene systems. With Java programming skills these tools can also be customized using the PDFTextStripper13 Java classes. A simple XHTML format14 can also be created with PDFBox that can be stored in native XML .

16. Role-Based Access Control

eXist, as it stands, provides adequate group-based access control15 but it does not scale very well with a large number of users and a large number of applications. The most scalable authorization module available is the Role-Based Access Control (RBAC) model16.

11 Apache. The Apache POI Project. 2009. http://poi.apache.org/; In addition, the Apache project provides a series of Case Studies that show how these filters are used in real systems: Apache. The Apache POI Project: Case Studies. 2009. http://poi.apache.org/casestudies.html 12 The Apache Software Foundation. PDFBox. 2010. http://www.pdfbox.org 13 PDFBox. Class PDFTextStripper. http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html 14 PDFBox. Class PDFText2HTML. http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFText2HTML.html 15Meier, Wolfgang M. and Loren Cahlander. Resource Permissions. eXist. November 2009. http://exist.sourceforge.net/security.html#permissions 16 Wikipedia. Role-Based Access Control. January 30, 2010. http://en.wikipedia.org/wiki/Role- based_access_control

The current version of eXist does not easily support role-based access control without complex XML file editing of a system known as XACML17. To make the system more user friendly, an application would need to be developed that provides customization of RBACs.

For this to be implemented each application would need to define a series of “actions” such as: • Create new content • Edit content • Approve content • Publish content

Each project would then define a set of “roles” for their site and associate individual users with one or more roles. A Role Manager application would then show a simple checkbox table with the vertical axis being the application and the horizontal axis being the roles for that project. An example report from this data is shown in the figure below.

FAQ Editor Contributor Editor Approver Publisher FAQ Admin YES YES YES YES Create new FAQ YES YES Edit FAQ YES Approve FAQ YES Publish FAQ YES Figure 1: Sample Interface for Setting FAQ Actions for Site Roles

17. Scalability

Legislative data is constantly being produced. In order for eXist to be useful as a tool that provides access to legislative materials the system must be able to scale to multi-terabyte size. Currently there are no strong examples of multi-terabyte users of eXist, but there are examples using MarkLogic (a commercial native-XML database).

Like many relational database systems, eXist stores it low-level data structures in files called B- Plus tree files or B+Trees for short. In order to get eXist to scale multi-terabyte document collections the system architect is now forced to understand the tradeoffs of attempting to scale B+Tree on data stores. The contribution of eXist to the high-performance XML data retrieval system is that it used a node-id system that lends itself to very fast bitmap operations on these B+Tree structures.18 For most systems the dom.dbx file grows the fastest and is the first to span multiple drives.

Most of the other components of eXist can be set up to scale evenly if the B+Tree searches span multi-terabytes. So the real question is how to scale B+Tree backend systems to allow eXist to

17 Harrah, Mark. Access Control in eXist. eXist. September 2009. http://exist.sourceforge.net/xacml.html 18 Meier, Wolfgang M. Index-driven XQuery processing in the eXist XML database. eXist. March 2006. http://www.exist-db.org/xmlprague06.html scale. Luckily other database systems such as RDF-triple stores also use B+Trees and have similar scalability needs.

Some possibilities of how to scale open-source products to the same level at a low cost are described below.

Cluster Configurations: The eXist system was originally designed as a single CPU system accessing a single file system. Organizations have successfully used multi-CPU clustered19 versions of eXist on read-mostly collections of data. Central to the cluster configuration management is the maintenance of cluster configuration files and journal management20.

Extending eXist with Storage Area Network (SAN) Configurations: Many of the size limitations of eXist systems come down to the limitations of storing very large B+Tree files on single hard drives. If storage of a single index goes above a single drive size no further information can be added. The central file to monitor is the dom.dbx file. When this file exceeds the size of a single hard drive then today an alternative storage system must be used. A simple alternative is to use a third party storage area network (SAN) system that allows the system to have the appearance of a single logical drive that spans many physical drives.

Migrating to Distributed B+Tree Backends with BigData: BigData21 is an open-source distributed B+Tree backend that allows for greater scalability. This software is becoming much faster due to the availability of many low-cost 64bit systems22. BigData is compatible with eXist and work is being done to possibly integrate it into the 1.5 eXist release.

19 Piranha Group. Cluster Configuration Environment. eXist. September 2009. http://www.exist- db.org/cluster.html 20 Piranha Group. Cluster Journal description. September 2009. http://www.exist-db.org/journal.html 21 Systap, llc. Bigdata. 2010. http://www.systap.com/bigdata.htm 22 Thomson, Bryan et al. B+Tree Compression and Buffering. BigData Blog. August 19, 2009. http://www.bigdata.com/blog/2009/08/btree-compression-and-buffering.html