Using MarcXML for Archiving, Transforming, and Displaying Complex Bibliographic Citation Metadata -- A Surprisingly Flexible & Robust Option

I. Why bother? a. Data variety 1. Lack of standardization from vendors – vendors include Thompson/ISI, IEEE/IEE, Dept. of Energy, Elsevier, Springer, Wiley, etc. Data differs across year ranges (early 1800’s to the present) within any given vendor. New feeds are starting to be in XML but no standard Schema/DTD.

2. Examples from different vendors i. Figure I-a.2-i - Same article metadata record from 2 different vendors (IEEE and Elsevier)

Figure I-a-2-i INSPEC (from IEEE) xml metadata

7923447 2004 015 2004 IEE A637-A3007-A003 A637-2003-007-A003 C2004-05-1290D-151 02 21 Journal Paper Optimal portfolio choice for unobservable and regime-switching mean returns Honda T. Graduate Sch. of Int. Corp. Strategy, Hitotsubashi Univ., Tokyo Japan A637 Journal of Economic Dynamics and Control J. Econ. Dyn. Control (Netherlands) vol.28,no.1 28 1

MARCXML for Archiving... 1 Blake/Goldsmith Lita Forum 2004 45-78 10 2003 Oct. 2003 Elsevier Netherlands JEDCDH 0165-1889 0165-1889(200310)28:1L.45:OPCU;1-M 0165-1889/03/$30.00 S0165-1889(02)00106-9 10.1016/S0165-1889(02)00106-9 We study dynamic optimal consumption and portfolio choice for a setting in which the mean returns of a risky asset depend on an unobservable regime variable of the economy, which is defined as a continuous-time Markov chain. The investor estimates the current regime by observing past and present asset prices. We compute the optimal consumption and portfolio policies of an investor with power utility. The optimal consumption/portfolio rule of a long-time- horizon investor could be substantially different from that of a short-time-horizon investor. The difference is caused by an investor's hedging demand of assets against fluctuations in the estimated mean returns 46 English economics investment Markov processes partial differential equations optimal portfolio dynamic optimal consumption unobservable regime variable risky asset continuous time Markov chain long time-horizon investor short time-horizon investor partial differential equations economics T C1290D Systems theory applications in economics and business C1140J Markov processes C1120 Mathematical analysis

MARCXML for Archiving... 2 Blake/Goldsmith Lita Forum 2004 Figure I.a.2.i Elsevier metadata - effect format (not xml) Article Identifier (ISSN/voliss/art) _t3 OHM04740 01651889 V0028I01 02001069 Publication state (internal) _ps [S300] Item Identifier _ii S0165-1889(02)00106-9 Item Identifier _ii [DOI] 10.1016/S0165-1889(02)00106-9 Article type _ty FLA Language code _li EN Article title _ti Optimal portfolio choice for unobservable and regime-switching mean returns Author _au Honda, T. Abstract _ab We study dynamic optimal consumption and portfolio choice for a setting in which the mean returns of a risky asset depend on an unobservable regime variable of the economy, which is defined as a continuous-time Markov chain. The investor estimates the current regime by observing past and present asset prices. We compute the optimal consumption and portfolio policies of an investor with power utility. The optimal consumption/portfolio rule of a long-time-horizon investor could be substantially different from that of a short-time-horizon investor. The difference is caused by an investor's hedging demand of assets against fluctuations in the estimated mean returns. Translation language _la EN Keyword _kw [JEL classification codes] G11 Keyword _kw [JEL classification codes] C61 Keyword _kw [JEL classification codes] D90 Keyword _kw Regime switching Keyword _kw Optimal consumption and portfolio Keyword _kw Incomplete information Keyword _kw Degenerate partial differential equation Keyword _kw Stochastic flows Pages _pg 45-78 Fulltext manifest info _mf [SGML ART 4.3.1] main Fulltext manifest info _mf [PDF 1.3 DISTILLED OPTIMIZED BOOKMARKED] main Fulltext manifest info _mf [Raw ASCII] main Fulltext manifest info _mf [STRIPIN 1.0] stripin

3. Data quantity : > 75 Million citation records, and growing.

Numbers – totals for vendors, weekly updates

Abstract & Indexes (A& I) vendors:  SciSearch 1945-present : ~30 M + 4 k weekly  Social SciSearch 1973-present: ~15M + 1k weekly  Arts & Humanities 1975-present: ~5M + .5k weekly  ISI Proceedings 1990-present : ~3M + .5k weekly All ISI dbs have associated citation records  Engineering Index : ~5.5M + .5k weekly  INSPEC : ~8 M + .5k weekly  BIOSIS : ~15M + 3k weekly Electronic journals (metadata + full text) vendors:  Elsevier, IEEE/IEE, Kluwer, Springer, Wiley, ACS : ~ 7M + 10K monthly

MARCXML for Archiving... 3 Blake/Goldsmith Lita Forum 2004 2. ii. Plans for the future of the the LANL repository : (fig I-2-d-ii)

Figure I-2-d-ii - LANL Repository

II. A recognized standard a. MARCXML i. MARC (MAchine Readable Cataloging) - The 30 second overview 1. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form. 2. Formats are defined for five types of data: bibliographic, holdings, authority, classification, and community information. 3. The data in a MARC record is organized into fields, each identified by a three-character tag, with the first character of the tag identifying the function of the data (main entry, subject entry) and the remainder of the tag identifying the type of information in the field (personal name, corporate name). 4. See: http://www.loc.gov/marc ii. MARCXML 1. MARCXML is a simple XML schema which contains MARC data. 2. All control fields, including the leader are treated as a data string. Data fields are treated as elements with the tag as an attribute and indicators treated as attributes. Subfields are treated as subelements with the subfield code as an attribute. 3. See: http://www.loc.gov/standards/marcxml/ 4. Schema (fig. II-c-1)

MARCXML for Archiving... 4 Blake/Goldsmith Lita Forum 2004 a. See: http://www.loc.gov/standards/marcxml/schema/MARC20slim.xsd 5. Sample record (fig. II-c-2) Figure II-c-1

Figure II-c-2

MARCXML for Archiving... 5 Blake/Goldsmith Lita Forum 2004 III. MARCXML @ LANL: a standards-based, uniform container to store disparate vendor metadata for easy transformation and exchange while retaining the granularity of the original data b. the question of granularity/”re-buildability”, esp. with deeply nested complex vendor data i. One subject tree from one vendor (fig. III-b-1)

Figure III-b-1

MARCXML for Archiving... 6 Blake/Goldsmith Lita Forum 2004 ii. Sample vendor data iii. Our solution

Figure III-b-2 Pisces Vertebrata Chordata Animalia Animals, Chordates, Fish, Nonhuman Vertebrates, Vertebrates Osteichthyes [85206] Salmo trutta brown trout commercial species

Figure III-b-3 Pisces BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/SuperTaxa[1] Vertebrata BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/SuperTaxa[2] Chordata BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/SuperTaxa[3] Animalia BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/SuperTaxa[4]< /subfield> Animals, Chordates, Fish, Nonhuman Vertebrates, Vertebrates BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/TaxaNotes[1] Osteichthyes [85206] BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/OrgData[1]/OrgClassifier[1] Salmo trutta BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/OrgData[1]/OrganismList[1]/ OrganismName[1]

MARCXML for Archiving... 7 Blake/Goldsmith Lita Forum 2004 brown trout BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/OrgData[1]/OrganismList[1]/ OrganismSpec[1]/OrgVariantList[1]/OrganismVariant[1] commercial species BIOSIS : Taxonomy : /Index/TaxonomicData[1]/TaxonomicList[1]/OrgData[1]/OrganismList[1]/ OrganismSpec[1]/OrgDetail[1]

c. Standards within MARCXML metadata records i. XML standards 1. XML: http://www.w3.org/XML/ 2. Schemas: http://www.w3.org/TR/NOTE-xml-schema-req 3. XPath: http://www.w3.org/TR/xpath ii. ISO standards 1. ISO-8601 Standard Date Notation: http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail? CSNUMBER=26780&ICS1=1&ICS2=140&ICS3=30 2. ISO-639-2 Language representation: http://www.loc.gov/standards/iso639- 2/ and http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail? CSNUMBER=4767&ICS1=1&ICS2=140&ICS3=20 iii. OpenURL: http://alcme.oclc.org/openurl/servlet/OAIHandler?ver=ListSets and http://library.caltech.edu/openurl/default.htm

III. Application a. Existing tools i. Conversion (Perl utilities) 1. XML::XPath - a set of modules for parsing and evaluating XPath statements 2. XML::Parser - A perl module for parsing XML documents 3. MARC.pm - Perl extension to manipulate MAchine Readable Cataloging records 4. See: http://cpan.org ii. Transformation 1. Xalan - an XSLT processor for transforming XML documents into HTML, text, or other XML document types 2. See: http://xml.apache.org/xalan-c/index.html iii. Validation 1. Xerces SAX.counter (Simple API for XML) 2. See: http://www.saxproject.org/ iv. Desktop 1. Altova xmlspy® - XML development environment for modeling, editing, transforming, and debugging XML technologies 2. See: http://www.xmlspy.com

b. Custom transformation stylesheets i. Using regular xpath expressions to get from “that” to... something else ii. Stylesheet example (Figure IV-b-1)

Figure IV-b-1

MARCXML for Archiving... 8 Blake/Goldsmith Lita Forum 2004 Xpath Processing for transformation to VerityXml ...

iii. Output example (Figure IV-b-2)

Figure IV-b-2 Pisces Vertebrata Chordata Animalia Animals, Chordates, Fish, Nonhuman Vertebrates, Vertebrates

MARCXML for Archiving... 9 Blake/Goldsmith Lita Forum 2004