Taxonomy of XML Schema Languages Using Formal Language
Total Page:16
File Type:pdf, Size:1020Kb
TaxonomyofXMLSchemaLanguagesusing FormalLanguageTheory Onthebasisofregulartreelanguages,wepresentaformalframeworkforXMLschemalanguages.This frameworkhelpstodescribe,compare,andimplementsuchschemalanguages.Ourmainresultsareasfollows:(1) fourclassesoftreelanguages,namely"local","single-type","restrainedcompetition"and"regular";(2)document validationalgorithmsfortheseclasses;and(3)classificationandcomparisonofschemalanguages:DTD,XML- Schema,DSD,XDuce,RELAXCore,andTREX. MakotoMurata,IBMTokyoResearchLab./InternationalUniversityofJapan DongwonLee,UCLA/CSD MuraliMani,UCLA/CSD 1. Introduction XMLisametalanguageforcreatingmarkuplanguages.Torepresentaparticulartypeof information,wecreateanXML-basedlanguagebydesigninganinventoryofnamesforelements andattributes.Thesenamesarethenutilizedbyapplicationprogramsdedicatedtothistypeof information. Aschemaisadescriptionofsuchaninventory:aschemaspecifiespermissiblenamesfor elementsandattributes,andfurtherspecifiespermissiblestructuresandvaluesofelementsand attributes.Advantagesofcreatingaschemaareasfollows:(1)theschemapreciselydescribes permissibleXMLdocuments,(2)computerprogramscandeterminewhetherornotagivenXML documentispermittedbytheschema,and(3)wecanusetheschemaforcreatingapplication programs(bygeneratingcodeskeletons,forexample).Thus,schemasplayaveryimportantrole indevelopmentofXML-basedapplications. Severallanguagesforwritingschemas,whichwecallschemalanguages,havebeenproposedin thepast.Somelanguages(e.g.,DTD)areconcernedwithXMLdocumentsingeneral;thatis, theyhandleelementsandattributes.Otherlanguagesareconcernedwithparticulartypeof informationwhichmayberepresentedbyXML;primaryconstructsforsuchinformationarenot elementsorattributes,butratherconstructsspecifictothattypeofinformation.RDFSchemaof W3Cisanexampleofsuchaschemalanguage.SinceprimaryconstructsforRDFinformation areresources,properties,andstatements,RDFSchemaisconcernedwithresources,properties, andstatementsratherthanelementsandattributes.Inthispaper,welimitourconcerntoschema languagesforXMLdocumentsingeneral(i.e.,elementsandattributes);specifically,weconsider DTD[BPS00],XML-Schema[TBMM00],DSD[KMS00],RELAXCore[Mur00b],andTREX [Cla01].1Althoughschemalanguagesdedicatedtoparticulartypesofinformation(e.g.,RDF, XTM,andER)areusefulforparticularapplications,theyareoutsidethescopeofthispaper. Webelievethatprovidingaformalframeworkiscrucialinunderstandingvariousaspectsof schemalanguagesandfacilitatingefficientimplementationsofschemalanguages.Weuse regulartreegrammartheory([Tak75]and[CDG+97])toformallycaptureschemas,schema languages,anddocumentvalidation.Regulartreegrammarshaverecentlybeenusedbymany ExtremeMarkupLanguages2000 1 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory researchersforrepresentingschemasorqueriesforXMLandhavebecomethemainstreamin thisarea(seeOASIS[OA01]andVianu[VV01]);inparticular,XMLQuery[CCF+01]ofW3C isbasedontreegrammars. Ourcontributionsareasfollows: 1. Wedefinefoursubclassesofregulartreegrammarsandtheircorrespondinglanguagesto describeschemasprecisely; 2. Weshowalgorithmsforvalidationofdocumentsagainstschemasforthesesubclasses andconsiderthecharacteristicsofthesealgorithms(e.g.,thetreemodel.vs.theevent model);and 3. Basedonregulartreegrammarsandthesevalidationalgorithms,wepresentadetailed analysisandcomparisonofafewXMLschemaproposalsandtypesystems;anXML schemaproposalAismoreexpressivethananotherproposalBifthesubclasscapturedby AproperlyincludesthatcapturedbyB. Theremainderofthispaperisorganizedasfollows.InSection2,weconsiderrelatedworks suchasothersurveypapersonXMLschemalanguages.InSection3,wefirstintroduceregular treelanguagesandgrammars,andthenintroducerestrictedclasses.InSection4,weintroduce validationalgorithmsforthefourclasses,andconsidertheircharacteristics.InSection5,onthe basisoftheseobservations,weevaluatedifferentXMLschemalanguageproposals.Finally, concludingremarksandthoughtsonfutureresearchdirectionsarediscussedinSection6. 2. RelatedWork MorethantenschemalanguagesforXMLhaveappearedrecently,and[Jel00]and[LC00] attempttocompareandclassifysuchXMLschemaproposalsfromvariousperspectives. However,theirapproachesarebyandlargenotmathematicalsothattheprecisedescriptionand comparisonamongschemalanguageproposalsarenotstraightforward.Ontheotherhand,this paperfirstestablishesaformalframeworkbasedonregulartreegrammars,andthencompares schemalanguageproposals. SinceKilhoShinadvocateduseoftreeautomataforstructureddocumentsin1992,many researchershaveusedregulartreegrammarsortreeautomataforXML(seeOASIS[OA01]and Vianu[VV01]).However,tothebestofourknowledge,nopapershaveusedregulartree grammarstoclassifyandcompareschemalanguageproposals.Furthermore,weintroduce subclassesofregulartreegrammars,andpresentacollectionofvalidationalgorithmsdedicated tothesesubclasses. XML-SchemaFormalDescription(formerlycalledMSL[BFRW01])isamathematicalmodelof XML-Schema.However,itistailoredforXML-Schemaandisthusunabletocaptureother schemalanguages.Meanwhile,ourframeworkisnottailoredforaparticularschemalanguage. Asaresult,allschemalanguagescanbecaptured,althoughfinedetailsofeachschemalanguage arenot. ExtremeMarkupLanguages2000 2 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory 3. TreeGrammars Inthissection,asamechanismfordescribingpermissibletrees,westudytreegrammars.We beginwithaclassoftreegrammarscalled"regular",andthenintroducethreerestrictedclasses called"local","single-type",and"restrained-competition". Somereadersmightwonderwhywedonotusecontext-free(string)grammars.Context-free (string)grammars[HU79]representsetsofstrings.Successfulparsingofstringsagainstsuch grammarsprovidesderivationtrees.Thisscenarioisappropriateforprogramminglanguages andnaturallanguages,whereprogramsandnaturallanguagetextarestringsratherthantrees. Ontheotherhand,starttagsandendtagsinanXMLdocumentcollectivelyrepresentatree. Sincetraditionalcontext-free(string)grammarsareoriginallydesignedtodescribepermissible strings,theyareinappropriatefordescribingpermissibletrees. 3.1 RegularTreeGrammarsandLanguages Weborrowthedefinitionsofregulartreelanguagesandtreeautomatain[CDG+97],butallow treeswith"infinitearity";thatis,weallowanodetohaveanynumberofsubordinatenodes,and allowtheright-handsideofaproductionruletohavearegularexpressionovernon-terminals. Definition1.(RegularTreeGrammar)Aregulartreegrammar(RTG)isa4-tupleG=(N,T, S,P),where: • Nisafinitesetofnon-terminals, • Tisafinitesetofterminals, • Sisasetofstartsymbols,whereSisasubsetofN. • PisafinitesetofproductionrulesoftheformX→ar,whereX∈N,a∈T,andrisa regularexpressionoverN;Xistheleft-handside,aristheright-handside,andristhe contentmodelofthisproductionrule. Example1.ThefollowinggrammarG1=(N,T,S,P)isaregulartreegrammar.Theleft-hand side,right-handside,andcontentmodelofthefirstproductionruleisDoc,Doc(Para1, Para2*),and(Para1,Para2*),respectively. N={Doc,Para1,Para2,Pcdata} T={doc,para,pcdata} S={Doc} P={Doc→doc(Para1,Para2*),Para1→para(Pcdata), Para2→para(Pcdata),Pcdata→pcdata } Werepresenteverytextvaluebythenodepcdataforconvenience. Withoutlossofgenerality,wecanassumethatnotwoproductionruleshavethesamenon- terminalintheleft-handsideandthesameterminalintheright-handsideatthesametime.Ifa regulartreegrammarcontainssuchproductionrules,weonlyhavetomergethemintoasingle ExtremeMarkupLanguages2000 3 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory productionrule.Wealsoassumethateverynon-terminaliseitherastartsymboloroccursinthe contentmodelofsomeproductionrule(inotherwords,nonon-terminalsareuseless). Wehavetodefinehowaregulartreegrammargeneratesasetoftreesoverterminals.Wefirst defineinterpretations. Definition2.(Interpretation)AninterpretationIofatreetagainstaregulartreegrammarGis amappingfromeachnodeeinttoanon-terminal,denotedI(e),suchthat: • I(eroot)isastartsymbolwhereerootistherootoft,and X→ar • foreachnodeeanditssubordinatese0,e1,...,ei,thereexistsaproductionrule suchthat • I(e)isX, • theterminal(label)ofeisa,and • I(e0)I(e1)...I(ei)matchesr. Now,wearereadytodefinegenerationoftreesfromregulartreegrammars,andregulartree languages. Definition3.(Generation)AtreetisgeneratedbyaregulartreegrammarGifthereisan interpretationoftagainstG. Example2.AninstancetreegeneratedbyG1,anditsinterpretationagainstG1areshownin Figure1. d1 Doc p1 p2 Para1 Para2 pcdata1 pcdata2 Pcdata Pcdata a) Instance tree, t b) Interpretation of t generated by G against G Figure1. AninstancetreegeneratedbyG1,anditsinterpretationagainstG1.Weuseunique labelstorepresentthenodesintheinstancetree. Definition4.(RegularTreeLanguage)Aregulartreelanguageisthesetoftreesgeneratedby aregulartreegrammar. ExtremeMarkupLanguages2000 4 MakotoMurata,DongwonLeeandMuraliMani•TaxonomyofXMLSchemaLanguagesusingFormalLanguage Theory 3.2 LocalTreeGrammarsandLanguages Wefirstdefinecompetitionofnon-terminals,whichmakesvalidationdifficult.Then,we introducearestrictedclasscalled``local''byprohibitingcompetitionofnon-terminals[Tak75]. ThisclassroughlycorrespondstoDTD. Definition5.(CompetingNon-Terminals)Twodifferentnon-terminalsAandBaresaid competingwitheachotherif • oneproductionrulehasAintheleft-handside, • anotherproductionrulehasBintheleft-handside,and • thesetwoproductionrulessharethesameterminalintheright-handside. Example3.ConsideraregulartreegrammarG3=(N,T,S,P),where: N={Book,Author1,Son,Article,Author2,Daughter} T={book,author,son,daughter} S={Book,Article} P={Book→book(Author1),Author1→author(Son),Son→son , Article→article(Author2),Author2→author(Daughter), Daughter→daughter } Author1andAuthor2competewitheachother,sincetheproductionruleforAuthor1andthatfor Author2sharetheterminalauthorintheright-handside.Therearenoothercompetingnon-