Knowledge and Metadata Integration for Warehousing Complex Data
Total Page:16
File Type:pdf, Size:1020Kb
Knowledge and MetadataIntegration forWarehousing ComplexData Jean-ChristianRalaivao 1 , 2 and J e´ r omeˆ Darmont 2 1 Ec´ ole Nationale d’Informatique 2 Universit ed´ eLyon (ERIC Lyon 2) BP 14875avenuePierreMende` s-France Fianarantsoa(301) 69676Bron Cedex MadagascarFrance [email protected]fianar.mg first name.last [email protected] Abstract: Withthe ever-growing availability of so-called complexdata, especially on the Web, decision-support systemssuchasdatawarehousesmust storeand pro- cess datathatarenotonlynumericalorsymbolic.Warehousing and analyzing such datarequiresthe jointexploitation of metadataand domain-related knowledge,which must therebybeintegrated.Inthispaper,wesurveythe typesof knowledge and meta- datathatareneeded formanaging complexdata, discuss the issueofknowledge and metadataintegration,and proposeaCWM-compliantintegrationsolutionthatwein- corporateintoanXML complexdatawarehousingframeworkwepreviouslydesigned. Keywords: XML datawarehousing,complexdata, metadata, knowledge,metadata and knowledge integration. 1Introduction Decision-support technologies,and moreparticularlydatawarehousing[Inm02,KR02, JLVV03], arenowadays technologicallymature. Datawarehousesareaimed atmonitor- ing and analyzingactivitiesthatarematerialized bynumericalmeasures(facts),while symbolicdatadescribethesefacts and constituteanalysisaxes(dimensions). However, in reallife,manydecision-support fields(customerrelationship management,marketing, competitionmonitoring,medicine...) need toexploitdatathatarenotonlynumericalor symbolic.Forexample,computer-aideddiagnosissystemsmightrequirethe analysisof various and heterogeneous data, suchaspatientrecords,medicalimages,biologicalanal- ysisresults,and previous diagnosesstored astexts [Saa04].Wetermsuchdata complex data [DBRA05].Theiravailability isnowvery common,especiallysincethe broaddevel- opmentof the Web, and morerecentlythe Web2.0(blogs,wikis,multimediadatasharing sites...). Complexdatamightbestructured ornot,and areoften located in several,heterogeneous datasources.Specificapproachesareneeded tocollect,integrate,manage and analyze them. Adatawarehousingsolutionisinteresting in thiscontext,though adaptationsare 164 obviouslynecessary totake intoaccountdatacomplexity (measuresmightnotbenumer- ical,forinstance).Datavolumetryand datingarealsootherarguments in favorof the warehousing approach. Inthiscontext,metadataand domain-related knowledge areessentialinthe processing of complexdataand playanimportantrole when integrating,managing,and analyzingthem. Inthispaper,weaddress the issueofjointlymanaging knowledge and metadata, in order towarehousecomplexdataand handle them,atthree differentlevels:atthe supplierlevel (dataproviders),toidentifyall input datasourcesand the roleofsourcetype drivers;at the userlevel (consumers),toidentifyall datasourcesforanalysisand theirsourcetype drivers;atthe managerlevel (administrators),toachievegood performance. Sincedatawarehousestraditionallyhandle knowledge underthe formofmetadata, we discuss the alternativesforintegratingdomain-related knowledge and metadata.Our po- sition isthatknowledge should beintegrated asmetadatain acomplexdatawarehouse. Onthisbasis,wealsopresentanXML-based architectureframeworkforcomplexdata warehousesthatexpandsthe one weproposed in [DBRA05]. The remainderof thispaperisorganized asfollows.InSection 2,wesurveythe various kindsof knowledge and metadatathatarerequired formanaging complexdata.InSec- tion 3,wediscussthe issueofknowledge and metadataintegration,justifyour choice,and presentour revised architectureframeworkforcomplexdatawarehouses.InSection4,we summarizethe stateofthe art regarding knowledge and metadataintegration. Wefinally conclude thispaperand providefutureresearchdirectionsin Section5. 2Knowledgeand MetadataNeeds 2.1 Knowledge Types Twotypesof knowledge mustbetakeninconsideration:tacitand explicitknowledge [NSIH02].Tacitknowledge includesbeliefs,perspectivesand mentalmodels.Explicit knowledge isknowledge thatcanbeexpressed formallyusing alanguage,symbols,rules, objects orequations,and thus canbecommunicated toothers.Indatawarehousingenvi- ronments,weareparticularlyinterested in explicitknowledge. Then,differentkindsof questionsmust beconsidered regarding the typesof knowledge thatareneeded tomanage complexdatawarehouses.Thesequestionsdetermine the de- scription context (what),the organizationalcontext (who,whereand when),the processing context (how)and the motivationand business rules(why). Responsestothe “what”-type questiondescribebusiness concepts.Theseelements guide the link betweenmetadataand knowledge;whileknowledge representationusesmetadata contents and structure. The “how”and “why”questionsrelatetoeachprocess’motivation and the wayitoperates,in comparison toanexistingorganization. Eventually,answering tothe “who”,“where” and “when” questionshelpsin connecting the first twocategories of questionstoaparticularorganization. 165 Furthermore,one type of knowledge thatisoften forgotten isuniversalorbackground knowledge. Forexample,the numberof days in amonth,the workschedulerwithwrought days,publicholidays,constitutesome background knowledge thatisessentialfordecision oranalyticalqueries. Wemust alsoconsiderstatisticalknowledge,whichmayinclude descriptivestatistics about the datawarehousecontents,orhypothesesabout attributes’characteristics,such asprobabilisticlaws orsampling methods.Statisticalknowledge maybeprovided bydata analysisordatamining,and results should bereinjected intothe system. Technicalknowledge isalsovery importantatdifferentphasesof the datawarehouselife- cycle. Atahigh level of abstraction,itiscloselyrelated tometadata.Technicalknowl- edge includesknowledge about datasourcesand targets,standardand specificdatatypes, databasemanagementsystems(DBMSs),softwareand hardwareplatforms,technologies, etc.Indexingtechniquesavailable in eachDBMS belong tothistype of knowledge too. Closelyrelated isknowledge about organizationaland geographicaldeployment,which includesinformation about users,theirneeds,theirattributionsand theirconstraints in regardtotheirneeds(e.g.,in termsof responsetime,volume of processed data, result format,etc.). The last kind of knowledge wemust considerrelatestodatawarehouseadministration. It providesinformation about howthe datawarehouseisused (access statistics)and howthe interfacebetween the datawarehouseand the operationalsystemsarticulates,i.e.,whatthe transactionalapplicationsand theircharacteristics(frequencies,responsetimes,users...) are;and whatthe majorExtracting,Transforming and Loading (ETL)problems(planifi- cationtosatisfyuserrequirements withrespecttoworkschedule,identification of peak periods...) are. The refreshmentpoliciesof the datawarehousecontents arealsoimportant here,sincetheydictatethe rotation period of summary data, the purge period and dormant datadetermination. 2.2MetadataTypes Weidentified fivetransversaland complementary classificationsformetadatain the lit- erature. Inthe first classification[HMT00], metadataareclassified based on the data warehousearchitecturelayers,asfollows: • metadataassociated withdataloading and transformation,whichdescribethe source dataand anychangesoperated on data; • metadataassociated withdatamanagement,whichdefine the datastored in the data warehouse; • metadataused bythe query managertogenerateanappropriatequery. The second classification[HMT00]dividesmetadatainto: 166 • technicalmetadatathatsupport the technicalstaff and contain the termsand defini- tionofmetadataastheyappearin operationaldatabases; • business metadatathatsupport business end-users who do nothaveanytechnical background; • information navigatormetadata, whicharetoolsthathelp users navigatethrough boththe business metadataand the warehoused data. Inthe thirdclassification [HMT00], metadatamaybe: • staticmetadatathatareused todocumentorbrowsethe system; • dynamicmetadatathatcanbegenerated and maintained atruntime.Anewkind of metadataismade of metadatathathandle the mapping between systems. Inthe fourthclassification[Kim05], metadatamaybe: • system catalog metadataordatadescriptors; • relationship metadatathatstoreinformationabout the relationshipsbetween data entities(primary key/foreign keyrelationships,generalization/specializationrela- tionship,aggregationrelationship,inheritancerelationshipsand anyotherspecial semanticrelationship implyingupdateordeletedependency); • contentmetadataformedbydescriptionsof the contents of stored dataatanarbitrary granularity.Contentmetadatamaybeassimple asone keyword,orascomplexasa business rules,formulaeorlinkstowhole documents; • datalineage metadata, whicharelifecycle dataabout stored data(information about the creationofdata, subsequentupdates,transformation,versioning,summarization, migration,and replication,transformationrules,and descriptionsof migrationand replication); • technicalmetadatathatstoretechnicalinformationabout stored data: format,com- pressionorencoding algorithm used,encryptionand decryption algorithms,encryp- tionand decryption keys,softwareused tocreateorupdatethe data, Application Programming Interfaces(APIs)available toaccess the data, etc.; • datausage metadataorbusiness datathataredescriptionsof howand forwhatpur- posesthe dataaretobeused byusers and applications; • system metadatathataredescriptionsabout the overall system environment,includ- ing hardware,operating system and applicationsoftware; • process metadatathatdescribethe processesin whichthe applicationsoperate,and anyrelevantoutput of eachstep of theseprocesses. Eventually,the fifthclassification weidentified [SE06]isbased upon functionality