Taxonomy of XML Schema Languages Using Formal Language

TaxonomyofXMLSchemaLanguagesusing FormalLanguageTheory Onthebasisofregulartreelanguages,wepresentaformalframeworkforXMLschemalanguages.This frameworkhelpstodescribe,compare,andimplementsuchschemalanguages.Ourmainresultsareasfollows:(1) fourclassesoftreelanguages,namely"local","single-type","restrainedcompetition"and"regular";(2)document validationalgorithmsfortheseclasses;and(3)classificationandcomparisonofschemalanguages:DTD,XML- Schema,DSD,XDuce,RELAXCore,andTREX. MakotoMurata,IBMTokyoResearchLab./InternationalUniversityofJapan DongwonLee,UCLA/CSD MuraliMani,UCLA/CSD 1. Introduction XMLisametalanguageforcreatingmarkuplanguages.Torepresentaparticulartypeof information,wecreateanXML-basedlanguagebydesigninganinventoryofnamesforelements andattributes.Thesenamesarethenutilizedbyapplicationprogramsdedicatedtothistypeof information. Aschemaisadescriptionofsuchaninventory:aschemaspecifiespermissiblenamesfor elementsandattributes,andfurtherspecifiespermissiblestructuresandvaluesofelementsand attributes.Advantagesofcreatingaschemaareasfollows:(1)theschemapreciselydescribes permissibleXMLdocuments,(2)computerprogramscandeterminewhetherornotagivenXML documentispermittedbytheschema,and(3)wecanusetheschemaforcreatingapplication programs(bygeneratingcodeskeletons,forexample).Thus,schemasplayaveryimportantrole indevelopmentofXML-basedapplications. Severallanguagesforwritingschemas,whichwecallschemalanguages,havebeenproposedin thepast.Somelanguages(e.g.,DTD)areconcernedwithXMLdocumentsingeneral;thatis, theyhandleelementsandattributes.Otherlanguagesareconcernedwithparticulartypeof informationwhichmayberepresentedbyXML;primaryconstructsforsuchinformationarenot elementsorattributes,butratherconstructsspecifictothattypeofinformation.RDFSchemaof W3Cisanexampleofsuchaschemalanguage.SinceprimaryconstructsforRDFinformation areresources,properties,andstatements,RDFSchemaisconcernedwithresources,properties, andstatementsratherthanelementsandattributes.Inthispaper,welimitourconcerntoschema languagesforXMLdocumentsingeneral(i.e.,elementsandattributes);specifically,weconsider DTD[BPS00],XML-Schema[TBMM00],DSD[KMS00],RELAXCore[Mur00b],andTREX [Cla01].1Althoughschemalanguagesdedicatedtoparticulartypesofinformation(e.g.,RDF, XTM,andER)areusefulforparticularapplications,theyareoutsidethescopeofthispaper. Webelievethatprovidingaformalframeworkiscrucialinunderstandingvariousaspectsof schemalanguagesandfacilitatingefficientimplementationsofschemalanguages.Weuse regulartreegrammartheory([Tak75]and[CDG+97])toformallycaptureschemas,schema languages,anddocumentvalidation.Regulartreegrammarshaverecentlybeenusedbymany ExtremeMarkupLanguages2000 1 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory researchersforrepresentingschemasorqueriesforXMLandhavebecomethemainstreamin thisarea(seeOASIS[OA01]andVianu[VV01]);inparticular,XMLQuery[CCF+01]ofW3C isbasedontreegrammars. Ourcontributionsareasfollows: 1. Wedefinefoursubclassesofregulartreegrammarsandtheircorrespondinglanguagesto describeschemasprecisely; 2. Weshowalgorithmsforvalidationofdocumentsagainstschemasforthesesubclasses andconsiderthecharacteristicsofthesealgorithms(e.g.,thetreemodel.vs.theevent model);and 3. Basedonregulartreegrammarsandthesevalidationalgorithms,wepresentadetailed analysisandcomparisonofafewXMLschemaproposalsandtypesystems;anXML schemaproposalAismoreexpressivethananotherproposalBifthesubclasscapturedby AproperlyincludesthatcapturedbyB. Theremainderofthispaperisorganizedasfollows.InSection2,weconsiderrelatedworks suchasothersurveypapersonXMLschemalanguages.InSection3,wefirstintroduceregular treelanguagesandgrammars,andthenintroducerestrictedclasses.InSection4,weintroduce validationalgorithmsforthefourclasses,andconsidertheircharacteristics.InSection5,onthe basisoftheseobservations,weevaluatedifferentXMLschemalanguageproposals.Finally, concludingremarksandthoughtsonfutureresearchdirectionsarediscussedinSection6. 2. RelatedWork MorethantenschemalanguagesforXMLhaveappearedrecently,and[Jel00]and[LC00] attempttocompareandclassifysuchXMLschemaproposalsfromvariousperspectives. However,theirapproachesarebyandlargenotmathematicalsothattheprecisedescriptionand comparisonamongschemalanguageproposalsarenotstraightforward.Ontheotherhand,this paperfirstestablishesaformalframeworkbasedonregulartreegrammars,andthencompares schemalanguageproposals. SinceKilhoShinadvocateduseoftreeautomataforstructureddocumentsin1992,many researchershaveusedregulartreegrammarsortreeautomataforXML(seeOASIS[OA01]and Vianu[VV01]).However,tothebestofourknowledge,nopapershaveusedregulartree grammarstoclassifyandcompareschemalanguageproposals.Furthermore,weintroduce subclassesofregulartreegrammars,andpresentacollectionofvalidationalgorithmsdedicated tothesesubclasses. XML-SchemaFormalDescription(formerlycalledMSL[BFRW01])isamathematicalmodelof XML-Schema.However,itistailoredforXML-Schemaandisthusunabletocaptureother schemalanguages.Meanwhile,ourframeworkisnottailoredforaparticularschemalanguage. Asaresult,allschemalanguagescanbecaptured,althoughfinedetailsofeachschemalanguage arenot. ExtremeMarkupLanguages2000 2 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory 3. TreeGrammars Inthissection,asamechanismfordescribingpermissibletrees,westudytreegrammars.We beginwithaclassoftreegrammarscalled"regular",andthenintroducethreerestrictedclasses called"local","single-type",and"restrained-competition". Somereadersmightwonderwhywedonotusecontext-free(string)grammars.Context-free (string)grammars[HU79]representsetsofstrings.Successfulparsingofstringsagainstsuch grammarsprovidesderivationtrees.Thisscenarioisappropriateforprogramminglanguages andnaturallanguages,whereprogramsandnaturallanguagetextarestringsratherthantrees. Ontheotherhand,starttagsandendtagsinanXMLdocumentcollectivelyrepresentatree. Sincetraditionalcontext-free(string)grammarsareoriginallydesignedtodescribepermissible strings,theyareinappropriatefordescribingpermissibletrees. 3.1 RegularTreeGrammarsandLanguages Weborrowthedefinitionsofregulartreelanguagesandtreeautomatain[CDG+97],butallow treeswith"infinitearity";thatis,weallowanodetohaveanynumberofsubordinatenodes,and allowtheright-handsideofaproductionruletohavearegularexpressionovernon-terminals. Definition1.(RegularTreeGrammar)Aregulartreegrammar(RTG)isa4-tupleG=(N,T, S,P),where: • Nisafinitesetofnon-terminals, • Tisafinitesetofterminals, • Sisasetofstartsymbols,whereSisasubsetofN. • PisafinitesetofproductionrulesoftheformX→ar,whereX∈N,a∈T,andrisa regularexpressionoverN;Xistheleft-handside,aristheright-handside,andristhe contentmodelofthisproductionrule. Example1.ThefollowinggrammarG1=(N,T,S,P)isaregulartreegrammar.Theleft-hand side,right-handside,andcontentmodelofthefirstproductionruleisDoc,Doc(Para1, Para2*),and(Para1,Para2*),respectively. N={Doc,Para1,Para2,Pcdata} T={doc,para,pcdata} S={Doc} P={Doc→doc(Para1,Para2*),Para1→para(Pcdata), Para2→para(Pcdata),Pcdata→pcdata } Werepresenteverytextvaluebythenodepcdataforconvenience. Withoutlossofgenerality,wecanassumethatnotwoproductionruleshavethesamenon- terminalintheleft-handsideandthesameterminalintheright-handsideatthesametime.Ifa regulartreegrammarcontainssuchproductionrules,weonlyhavetomergethemintoasingle ExtremeMarkupLanguages2000 3 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory productionrule.Wealsoassumethateverynon-terminaliseitherastartsymboloroccursinthe contentmodelofsomeproductionrule(inotherwords,nonon-terminalsareuseless). Wehavetodefinehowaregulartreegrammargeneratesasetoftreesoverterminals.Wefirst defineinterpretations. Definition2.(Interpretation)AninterpretationIofatreetagainstaregulartreegrammarGis amappingfromeachnodeeinttoanon-terminal,denotedI(e),suchthat: • I(eroot)isastartsymbolwhereerootistherootoft,and X→ar • foreachnodeeanditssubordinatese0,e1,...,ei,thereexistsaproductionrule suchthat • I(e)isX, • theterminal(label)ofeisa,and • I(e0)I(e1)...I(ei)matchesr. Now,wearereadytodefinegenerationoftreesfromregulartreegrammars,andregulartree languages. Definition3.(Generation)AtreetisgeneratedbyaregulartreegrammarGifthereisan interpretationoftagainstG. Example2.AninstancetreegeneratedbyG1,anditsinterpretationagainstG1areshownin Figure1. d1 Doc p1 p2 Para1 Para2 pcdata1 pcdata2 Pcdata Pcdata a) Instance tree, t b) Interpretation of t generated by G against G Figure1. AninstancetreegeneratedbyG1,anditsinterpretationagainstG1.Weuseunique labelstorepresentthenodesintheinstancetree. Definition4.(RegularTreeLanguage)Aregulartreelanguageisthesetoftreesgeneratedby aregulartreegrammar. ExtremeMarkupLanguages2000 4 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory 3.2 LocalTreeGrammarsandLanguages Wefirstdefinecompetitionofnon-terminals,whichmakesvalidationdifficult.Then,we introducearestrictedclasscalled``local''byprohibitingcompetitionofnon-terminals[Tak75]. ThisclassroughlycorrespondstoDTD. Definition5.(CompetingNon-Terminals)Twodifferentnon-terminalsAandBaresaid competingwitheachotherif • oneproductionrulehasAintheleft-handside, • anotherproductionrulehasBintheleft-handside,and • thesetwoproductionrulessharethesameterminalintheright-handside. Example3.ConsideraregulartreegrammarG3=(N,T,S,P),where: N={Book,Author1,Son,Article,Author2,Daughter} T={book,author,son,daughter} S={Book,Article} P={Book→book(Author1),Author1→author(Son),Son→son , Article→article(Author2),Author2→author(Daughter), Daughter→daughter } Author1andAuthor2competewitheachother,sincetheproductionruleforAuthor1andthatfor Author2sharetheterminalauthorintheright-handside.Therearenoothercompetingnon-

Taxonomy of XML Schema Languages Using Formal Language

Using Tree Automata and Regular Expressions to Manipulate

Custom XML Attribute

Taxonomy of XML Schema Languages Using Formal Language Theory

XML Access Control Using Static Analysis

Towards Conceptual Modeling for XML

Document Declaration Is Missing Html Attribute

Logics for XML

Parsing and Querying XML Documents in SML

PLAN-X 2006 Informal Proceedings

XML Graphs in Program Analysis

Iso/Iec Jtc 1/Sc 34 N 1707 Date: 2011-10-03

Converting Into Pattern-Based Schemas: a Formal Approach