TaxonomyofXMLSchemaLanguagesusing FormalLanguageTheory

Onthebasisofregulartreelanguages,wepresentaformalframeworkforXMLschemalanguages.This frameworkhelpstodescribe,compare,andimplementsuchschemalanguages.Ourmainresultsareasfollows:(1) fourclassesoftreelanguages,namely"local","single-type","restrainedcompetition"and"regular";(2)document validationalgorithmsfortheseclasses;and(3)classificationandcomparisonofschemalanguages:DTD,XML- Schema,DSD,XDuce,RELAXCore,andTREX.

MakotoMurata,IBMTokyoResearchLab./InternationalUniversityof DongwonLee,UCLA/CSD MuraliMani,UCLA/CSD

1. Introduction XMLisametalanguageforcreatingmarkuplanguages.Torepresentaparticulartypeof information,wecreateanXML-basedlanguagebydesigninganinventoryofnamesforelements andattributes.Thesenamesarethenutilizedbyapplicationprogramsdedicatedtothistypeof information. Aschemaisadescriptionofsuchaninventory:aschemaspecifiespermissiblenamesfor elementsandattributes,andfurtherspecifiespermissiblestructuresandvaluesofelementsand attributes.Advantagesofcreatingaschemaareasfollows:(1)theschemapreciselydescribes permissibleXMLdocuments,(2)computerprogramscandeterminewhetherornotagivenXML documentispermittedbytheschema,and(3)wecanusetheschemaforcreatingapplication programs(bygeneratingcodeskeletons,forexample).Thus,schemasplayaveryimportantrole indevelopmentofXML-basedapplications. Severallanguagesforwritingschemas,whichwecallschemalanguages,havebeenproposedin thepast.Somelanguages(e.g.,DTD)areconcernedwithXMLdocumentsingeneral;thatis, theyhandleelementsandattributes.Otherlanguagesareconcernedwithparticulartypeof informationwhichmayberepresentedbyXML;primaryconstructsforsuchinformationarenot elementsorattributes,butratherconstructsspecifictothattypeofinformation.RDFSchemaof W3Cisanexampleofsuchaschemalanguage.SinceprimaryconstructsforRDFinformation areresources,properties,andstatements,RDFSchemaisconcernedwithresources,properties, andstatementsratherthanelementsandattributes.Inthispaper,welimitourconcerntoschema languagesforXMLdocumentsingeneral(i.e.,elementsandattributes);specifically,weconsider DTD[BPS00],XML-Schema[TBMM00],DSD[KMS00],RELAXCore[Mur00b],andTREX [Cla01].1Althoughschemalanguagesdedicatedtoparticulartypesofinformation(e.g.,RDF, XTM,andER)areusefulforparticularapplications,theyareoutsidethescopeofthispaper. Webelievethatprovidingaformalframeworkiscrucialinunderstandingvariousaspectsof schemalanguagesandfacilitatingefficientimplementationsofschemalanguages.Weuse regulartreegrammartheory([Tak75]and[CDG+97])toformallycaptureschemas,schema languages,anddocumentvalidation.Regulartreegrammarshaverecentlybeenusedbymany

ExtremeMarkupLanguages2000 1 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory researchersforrepresentingschemasorqueriesforXMLandhavebecomethemainstreamin thisarea(seeOASIS[OA01]andVianu[VV01]);inparticular,XMLQuery[CCF+01]ofW3C isbasedontreegrammars. Ourcontributionsareasfollows: 1. Wedefinefoursubclassesofregulartreegrammarsandtheircorrespondinglanguagesto describeschemasprecisely; 2. Weshowalgorithmsforvalidationofdocumentsagainstschemasforthesesubclasses andconsiderthecharacteristicsofthesealgorithms(e.g.,thetreemodel.vs.theevent model);and 3. Basedonregulartreegrammarsandthesevalidationalgorithms,wepresentadetailed analysisandcomparisonofafewXMLschemaproposalsandtypesystems;anXML schemaproposalAismoreexpressivethananotherproposalBifthesubclasscapturedby AproperlyincludesthatcapturedbyB.

Theremainderofthispaperisorganizedasfollows.InSection2,weconsiderrelatedworks suchasothersurveypapersonXMLschemalanguages.InSection3,wefirstintroduceregular treelanguagesandgrammars,andthenintroducerestrictedclasses.InSection4,weintroduce validationalgorithmsforthefourclasses,andconsidertheircharacteristics.InSection5,onthe basisoftheseobservations,weevaluatedifferentXMLschemalanguageproposals.Finally, concludingremarksandthoughtsonfutureresearchdirectionsarediscussedinSection6. 2. RelatedWork MorethantenschemalanguagesforXMLhaveappearedrecently,and[Jel00]and[LC00] attempttocompareandclassifysuchXMLschemaproposalsfromvariousperspectives. However,theirapproachesarebyandlargenotmathematicalsothattheprecisedescriptionand comparisonamongschemalanguageproposalsarenotstraightforward.Ontheotherhand,this paperfirstestablishesaformalframeworkbasedonregulartreegrammars,andthencompares schemalanguageproposals. SinceKilhoShinadvocateduseoftreeautomataforstructureddocumentsin1992,many researchershaveusedregulartreegrammarsortreeautomataforXML(seeOASIS[OA01]and Vianu[VV01]).However,tothebestofourknowledge,nopapershaveusedregulartree grammarstoclassifyandcompareschemalanguageproposals.Furthermore,weintroduce subclassesofregulartreegrammars,andpresentacollectionofvalidationalgorithmsdedicated tothesesubclasses. XML-SchemaFormalDescription(formerlycalledMSL[BFRW01])isamathematicalmodelof XML-Schema.However,itistailoredforXML-Schemaandisthusunabletocaptureother schemalanguages.Meanwhile,ourframeworkisnottailoredforaparticularschemalanguage. Asaresult,allschemalanguagescanbecaptured,althoughfinedetailsofeachschemalanguage arenot.

ExtremeMarkupLanguages2000 2 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory

3. TreeGrammars Inthissection,asamechanismfordescribingpermissibletrees,westudytreegrammars.We beginwithaclassoftreegrammarscalled"regular",andthenintroducethreerestrictedclasses called"local","single-type",and"restrained-competition". Somereadersmightwonderwhywedonotusecontext-free(string)grammars.Context-free (string)grammars[HU79]representsetsofstrings.Successfulparsingofstringsagainstsuch grammarsprovidesderivationtrees.Thisscenarioisappropriateforprogramminglanguages andnaturallanguages,whereprogramsandnaturallanguagetextarestringsratherthantrees. Ontheotherhand,starttagsandendtagsinanXMLdocumentcollectivelyrepresentatree. Sincetraditionalcontext-free(string)grammarsareoriginallydesignedtodescribepermissible strings,theyareinappropriatefordescribingpermissibletrees. 3.1 RegularTreeGrammarsandLanguages Weborrowthedefinitionsofregulartreelanguagesandtreeautomatain[CDG+97],butallow treeswith"infinitearity";thatis,weallowanodetohaveanynumberofsubordinatenodes,and allowtheright-handsideofaproductionruletohavearegularexpressionovernon-terminals. Definition1.(RegularTreeGrammar)Aregulartreegrammar(RTG)isa4-tupleG=(N,T, S,P),where: • Nisafinitesetofnon-terminals, • Tisafinitesetofterminals, • Sisasetofstartsymbols,whereSisasubsetofN. • PisafinitesetofproductionrulesoftheformX→ar,whereX∈N,a∈T,andrisa regularexpressionoverN;Xistheleft-handside,aristheright-handside,andristhe contentmodelofthisproductionrule.

Example1.ThefollowinggrammarG1=(N,T,S,P)isaregulartreegrammar.Theleft-hand side,right-handside,andcontentmodelofthefirstproductionruleisDoc,Doc(Para1, Para2*),and(Para1,Para2*),respectively.

N={Doc,Para1,Para2,Pcdata} T={doc,para,pcdata} S={Doc} P={Doc→doc(Para1,Para2*),Para1→para(Pcdata),

Para2→para(Pcdata),Pcdata→pcdata }

Werepresenteverytextvaluebythenodepcdataforconvenience. Withoutlossofgenerality,wecanassumethatnotwoproductionruleshavethesamenon- terminalintheleft-handsideandthesameterminalintheright-handsideatthesametime.Ifa regulartreegrammarcontainssuchproductionrules,weonlyhavetomergethemintoasingle

ExtremeMarkupLanguages2000 3 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory productionrule.Wealsoassumethateverynon-terminaliseitherastartsymboloroccursinthe contentmodelofsomeproductionrule(inotherwords,nonon-terminalsareuseless). Wehavetodefinehowaregulartreegrammargeneratesasetoftreesoverterminals.Wefirst defineinterpretations. Definition2.(Interpretation)AninterpretationIofatreetagainstaregulartreegrammarGis amappingfromeachnodeeinttoanon-terminal,denotedI(e),suchthat:

• I(eroot)isastartsymbolwhereerootistherootoft,and X→ar • foreachnodeeanditssubordinatese0,e1,...,ei,thereexistsaproductionrule suchthat • I(e)isX, • theterminal(label)ofeisa,and

• I(e0)I(e1)...I(ei)matchesr.

Now,wearereadytodefinegenerationoftreesfromregulartreegrammars,andregulartree languages. Definition3.(Generation)AtreetisgeneratedbyaregulartreegrammarGifthereisan interpretationoftagainstG.

Example2.AninstancetreegeneratedbyG1,anditsinterpretationagainstG1areshownin Figure1. d1 Doc

p1 p2 Para1 Para2 pcdata1 pcdata2 Pcdata Pcdata a) Instance tree, t b) Interpretation of t generated by G against G

Figure1.

AninstancetreegeneratedbyG1,anditsinterpretationagainstG1.Weuseunique labelstorepresentthenodesintheinstancetree. Definition4.(RegularTreeLanguage)Aregulartreelanguageisthesetoftreesgeneratedby aregulartreegrammar.

ExtremeMarkupLanguages2000 4 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory

3.2 LocalTreeGrammarsandLanguages Wefirstdefinecompetitionofnon-terminals,whichmakesvalidationdifficult.Then,we introducearestrictedclasscalled``local''byprohibitingcompetitionofnon-terminals[Tak75]. ThisclassroughlycorrespondstoDTD. Definition5.(CompetingNon-Terminals)Twodifferentnon-terminalsAandBaresaid competingwitheachotherif • oneproductionrulehasAintheleft-handside, • anotherproductionrulehasBintheleft-handside,and • thesetwoproductionrulessharethesameterminalintheright-handside.

Example3.ConsideraregulartreegrammarG3=(N,T,S,P),where:

N={Book,Author1,Son,Article,Author2,Daughter} T={book,author,son,daughter} S={Book,Article}

P={Book→book(Author1),Author1→author(Son),Son→son , Article→article(Author2),Author2→author(Daughter),

Daughter→daughter } Author1andAuthor2competewitheachother,sincetheproductionruleforAuthor1andthatfor Author2sharetheterminalauthorintheright-handside.Therearenoothercompetingnon- terminalpairsinthisgrammar. Definition6.(LocalTreeGrammarandLanguage)Alocaltreegrammar(LTG)isaregular treegrammarwithoutcompetingnon-terminals.Atreelanguageisalocaltreelanguageifitis generatedbyalocaltreegrammar.

Example4.G3inExample3isnotlocal,sinceAuthor1andAuthor2competewitheachother.

Example5.ThefollowinggrammarG5=(N,T,S,P)isalocaltreegrammar:

N={Book,Author1,Son,Pcdata} T={book,author,son,pcdata} S={Book} P={Book→book(Author1),Author1→author(Son),

Son→sonPcdata,Pcdata→pcdata }

Observethatlocaltreegrammarsandextendedcontext-free(string)grammars(ECFG)look similar.However,theformerdescribessetsoftrees,whilethelatterdescribessetsofstrings. TheparsetreesetofanECFGisalocaltreelanguage. 3.3 Single-TypeTreeGrammarsandLanguages Next,weintroducealessrestrictedclasscalled``single-type''byprohibitingcompetitionofnon- terminalswithinasinglecontentmodel.ThisclassroughlycorrespondstoXML-Schema.

ExtremeMarkupLanguages2000 5 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory

Definition7.(Single-TypeTreeGrammarandLanguage)Aasingle-typetreegrammarisa regulartreegrammarsuchthat • foreachproductionrule,non-terminalsinitscontentmodeldonotcompetewitheach other,and • startsymbolsdonotcompetewitheachother. Atreelanguageisasingle-typetreelanguageifitisgeneratedbyasingle-typetreegrammar.

Example6.ThegrammarG1showninExample1isnotsingle-type.Observethatnon- terminalsPara1andPara2compete,andtheyoccurinthecontentmodeloftheproductionrule forDoc.

Example7.ConsiderG3inExample3.Thisgrammarisasingle-typetreegrammarsinceno productionruleshavemorethanonenon-terminalintheircontentmodelsandthustherecannot beanycompetingnon-terminalsinthecontentmodels. 3.4 Restrained-CompetitionTreeGrammarsandLanguages Now,weintroduceanevenlessrestrictedclasscalled``restrained-competition''.Thekeyideais tousecontentmodelsforrestrainingcompetitionofnon-terminals. Definition8.(Restraining)Acontentmodelrrestrainscompetitionoftwocompetingnon- terminalsAandBif,foranysequencesU,V,Wofnon-terminals,rdonotgeneratebothUAV andUBW. Definition9.(Restrained-CompetitionTreeGrammarandLanguage)Arestrained- competitiontreegrammarisaregulartreegrammarsuchthat • foreachproductionrule,itscontentmodelrestrainscompetitionofnon-terminals occurringinthecontentmodel,and • startsymbolsdonotcompetewitheachother. Atreelanguageisarestrained-competitiontreelanguageifitisgeneratedbyarestrained- competitiontreegrammar.

Example8.ThegrammarG1isarestrained-competitiontreegrammar.Non-terminalsPara1 andPara2competewitheachother,andtheybothoccurinthecontentmodeloftheproduction ruleforDoc.However,thiscontentmodel(Para1,Para2*)restrainsthecompetitionbetween Para1andPara2,sincePara1mayoccuronlyasthefirstnon-terminalandPara2mayoccur onlyasthenon-firstnon-terminal.

Example9.ThefollowinggrammarG9isnotarestrained-competitiontreegrammar.Observe thatthecontentmodel(Para1*,Para2*)doesnotrestrainthecompetitionofnon-terminals

Para1andPara2.Forexample,supposethatU=V=W= .Then,bothUPara1VandU Para2Wmatchthecontentmodel.

N={Doc,Para1,Para2,Pcdata} T={doc,para,pcdata} S={Doc} P={Doc→doc(Para1*,Para2*),Para1→para(Pcdata),

ExtremeMarkupLanguages2000 6 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory

Para2→para(Pcdata),Pcdata→pcdata }

3.5 SummaryofExamples Eachoftheexamplegrammarsdemonstratesoneoftheclassesoftreegrammars.Thefollowing tableshowstowhichclasseachexamplebelongs.

Regular Restrained- Single-Type Local Competition

G5 Yes Yes Yes Yes

G3 Yes Yes Yes No

G1 Yes Yes No No

G9 Yes No No No 3.6 ExpressivenessandClosure Havingintroducedfourclassesofgrammarsandlanguages,weconsiderexpressivenessand closure.Theinterestedreaderisreferredto[LMM00]fortheproofs. First,weconsiderexpressivenessusingpracticalexamples.Foreachterminal,alocaltree grammarprovidesasinglecontentmodelforallelementsofthisterminal.Forexample, �elements�may�be�subordinate�to�<section>,�<chapter>,�or�<author>�elements,�but permissible�subordinate�children�are�always�the�same.��A�single-type�tree�grammar�does�not�have this�tight�restriction,�but�cannot�allow�the�first�paragraph�and�the�other�paragraphs�in�a�section�to have�different�content�models.��A�restrained-competition�tree�grammar�lifts�this�restriction. Although�some�regular�tree�languages�cannot�be�captured�by�restrained-competition�tree grammars,�we�are�not�aware�of�any�practical�examples. Next,�we�present�theoretical�observations�about�expressiveness. • Regular�tree�grammars�are�strictly�more�expressive�than�restrained-competition�tree grammars.��That�is,�some�regular�tree�grammars�cannot�be�rewritten�as�restrained- competition�tree�grammars. • Restrained-competition�tree�grammars�are�strictly�more�expressive�than�single-type�tree grammars.�That�is,�some�restrained-competition�tree�grammars�cannot�be�rewritten�as single-type�tree�grammars. • Single-type�tree�grammars�are�strictly�more�expressive�than�local�tree�grammars.�That�is, some�single-type�tree�grammars�cannot�be�rewritten�as�local�tree�grammars. �Next,�we�present�observations�on�closure.��A�class�of�languages�is�said�to�be�closed�under union/intersection/difference�when,�for�any�two�languages�in�that�class,�the union/intersection/difference�of�the�two�languages�also�belongs�to�the�same�class. • The�class�of�regular�tree�languages�is�closed�under�union,�intersection,�and�difference.</p><p>Extreme�Markup�Languages�2000 7 Makoto�Murata,�Dongwon�Lee�and�Murali�Mani�•�Taxonomy�of�XML�Schema�Languages�using�Formal�Language Theory</p><p>• The�class�of�local�tree�languages�is�closed�under�intersection,�but�is�not�closed�under union�or�difference. • The�class�of�single-type�tree�languages�is�closed�under�intersection,�but�is�not�closed under�union�or�difference. • The�class�of�restrained-competition�languages�is�closed�under�intersection,�but�is�not closed�under�union�or�difference.</p><p>4. Document�Validation In�this�section,�we�consider�algorithms�for�document�validation.��Given�a�grammar�and�a�tree, these�algorithms�determine�whether�that�tree�is�generated�by�the�grammar,�and�also�construct interpretations�of�the�tree.��All�algorithms�are�based�on�variations�of�tree�automata.��Some algorithms�require�the�tree�model�while�others�do�not. 4.1 Tree�Model�vs.�Event�Model Programs�for�handling�XML�documents,�including�those�for�validation,�are�typically�based�on either�the�tree�model�or�event�model.��In�the�tree�model,�the�XML�parser�creates�a�tree�in memory.�Application�programs�have�access�to�the�tree�in�memory�via�some�API�such�as�DOM [WHA+00],�and�can�traverse�the�tree�any�number�of�times.��In�the�event�model,�the�XML�parser does�not�create�a�tree�in�memory,�but�merely�raise�events�for�start�tags�or�end�tags.��Application programs�are�notified�of�such�events�via�some�API�such�as�SAX�[Meg00].��In�other�words, application�programs�visit�and�leave�elements�in�the�document�in�the�depth-first�manner.��Once an�application�program�visits�an�element�in�the�document,�it�cannot�visit�the�element�again. While�both�models�work�fine�for�small�documents,�the�tree�model�has�performance�problems�for significantly�large�documents. 4.2 DTD�validation�and�their�variations We�begin�with�validation�for�local�tree�grammars.��Since�DTDs�correspond�to�local�tree grammars,�such�validation�has�been�used�for�years.��Remember�that�a�local�tree�grammar�does not�have�competing�non-terminals�(see�Definition�6).�Thus,�for�each�element,�we�can�uniquely determine�a�non-terminal�from�the�terminal�of�this�element. Validation�for�local�tree�grammars�can�be�easily�built�in�the�event�model.��Suppose�that�the�XML processor�has�encountered�a�start�tag�for�an�element�e�in�a�given�document.�We�can�easily determine�a�non-terminal�for�e�from�this�start�tag.��Next,�suppose�that�the�XML�processors�has encountered�the�end�tag�for�e.��Let�e1,�e2,�...�ei�be�the�child�elements�of�e.��Since�the�start�tags�for these�elements�have�been�already�encountered�by�the�XML�parser,�we </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'a5a3ce390ea2de7bf55930df2a7fa984'; var endPage = 1; var totalPage = 25; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/a5a3ce390ea2de7bf55930df2a7fa984-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>