Querying Semi-Structured Data

Querying Semi-Structured Data ? Serge Abiteb oul INRIA-Ro cquencourt Serge.Abiteb oul@in ria .fr 1 Intro duction The amount of data of all kinds available electronicall y has increased dramat- ically in recentyears. The data resides in di erent forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a varietyofinterfaces including Web browsers, database query languages, application-sp eci c interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to b e extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purp oses suchasbrowsing. We call here semi-structured data this data that is from a particular viewp oint neither raw data nor strictly typ ed, i.e., not table-oriented as in a relational mo del or sorted-graph as in ob ject databases. As will seen later when the notion of semi-structured data is more precisely de ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data- formats e.g., SGML or ASN.1 data and data found on the Web has brought the topic of semi-structured data to the forefront of research. The main purp ose of the pap er is to isolate the essential asp ects of semi- structured data. We also survey some prop osals of mo dels and query languages for semi-structured data. In particular, we consider recentworks at Stanford U. and U. Penn on semi-structured data. In b oth cases, the motivation is found in the integration of heterogeneous data. The \lightweight" data mo dels they use based on lab elled graphs are very similar. As we shall see, the topic of semi-structured data has no precise b oundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some imp ortant issues in this context. The pap er is organized as follows. In Section 2, we discuss the particularitie s of semi-structured data. In Section 3, we consider the issue of the data structure and in Section 4, the issue of the query language. ? Currently visiting the Computer Science Dept., Stanford U. Work supp orted in part by CESDIS, NASA Go ddard Space Flight Center; by the Air Force Wright Lab ora- tory Aeronautical Systems Center under ARPA Contract F33615-93-1-1339, and by the Air Force Rome Lab oratories under ARPA Contract F30602-95-C-0119. 2 Semi-Structured Data In this section, we make more precise what we mean by semi-structured data, how such data arises, and emphasize its main asp ects. Roughly sp eaking, semi-structured data is data that is neither raw data, nor very strictly typ ed as in conventional database systems. Clearly, this de nition is imprecise. For instance, would a BibTex le b e considered structured or semi- structured? Indeed, the same piece of information may b e viewed as unstructured at some early pro cessing stage, but later b ecome very structured after some analysis has b een p erformed. In this section, we give examples of semi-structured data, make more precise this notion and describ e imp ortant issues in this context. 2.1 Examples We will often discuss in this pap er BibTex les [Lam94] that present the ad- vantage of b eing more familiar to researchers than other well-accepted formats such as SGML [ISO86] or ASN.1 [ISO87]. Data in BibTex les closely resembles relational data. Such a le is comp osed of records. But, the structure is not as regular. Some elds may b e missing. Indeed, it is customary to even nd com- pulsory elds missing. Other elds have some meaningful structure, e.g., author. There are complex features such as abbreviations or cross references that are not easy to describ e in some database systems. The Web also provides numerous p opular examples of semi-structured data. In the Web, data consists of les in a particular format, HTML, with some structuring primitives such as tags and anchors. A typical example is a data source ab out restaurants in the Bay Area from the Palo Alto Weekly newspap er, that we will call Guide. It consists of an HTML le with one entry p er restaurant and provides some information on prices, addresses, styles of restaurants and reviews. Data in Guide resides in les of text with some implicit structure. One can write a parser to extract the underlying structure. However, there is a large degree of irregularity in the structure since i restaurants are not all treated in a uniform manner e.g., much less information is given for fast-fo o d joints and ii information is entered as plain text byhuman b eings that do not present the standard rigidityofyour favorite data loader. Therefore, the parser will haveto b e tolerant and accept to fail parsing p ortions of text that will remain as plain text. Also, semi-structured data arises often when integrating several p ossibly structured sources. Data integration of indep endent sources has b een a p opular topic of research since the very early days of databases. Surveys can b e found in [SL90, LMR90, Bre90 ], and more recentwork on the integration of heterogeneous + + sources in e.g., [LRO96, QRS 95,C 95]. It has gained a new vigor with the recent p opularity of the Web. Consider the integration of car retailer databases. Some retailers will represent addresses as strings and others as tuples. Retailers will probably use di erent conventions for representing dates, prices, invoices, etc. We should exp ect some information to b e missing from some sources. E.g., some retailers may not record whether non-automatic transmission is available. More generally, a wide heterogeneity in the organization of data should b e ex- p ected from the car retailer data sources and not all can b e resolved by the integration software. Semi-structured data arises under a variety of forms for a wide range of applications such as genome databases, scienti c databases, libraries of programs and more generally, digital libraries, on-line do cumentations, electronic commerce. It is thus essential to b etter understand the issue of querying semi-structured data. 2.2 Main asp ects The structure is irregular: This must b e clear from the previous discussion. In many of these applications, the large collections that are maintained often consist of heterogeneous elements. Some elements may b e incomplete. On the other hand, other elements may record extra information, e.g., annotations. Di erenttyp es may b e used for the same kind of information, e.g., prices may b e in dollars in p ortions of the database and in francs in others. The same piece of information. e.g., an address, maybe structured in some places as a string and in others as a tuple. Mo delling and querying such irregular structures are essential issues. The structure is implicit: In many applications, although a precise structuring exists, it is given implicitly. For instance, electronic do cuments consist often of a text and a grammar e.g., a DTD in SGML. The parsing of the do cument then allows one to isolate pieces of information and detect relationships b etween them. However, the interpretation of these relationships e.g., SGML exceptions maybebeyond the capabilities of standard database mo dels and are left to the particular applications and sp eci c to ols. We view this structure as implicit although sp eci ed explicitl y by tags since i some computation is required to obtain it e.g., parsing and ii the corresp ondence b etween the parse-tree and the logical representation of the data is not always immediate. It is also sometimes the case, in particular for the Web, that the do cuments come as plain text. Some ad-ho c analysis is then needed to extract the structure. For instance, in the Guide data source, the description of restaurant is in plain text. Now, clearly, it is p ossible to develop some analysis to ols to recognize prices, addresses, etc. and then extract the structure of the le. The issue of extracting the structure of some text e.g., HTML is a challenging issue. The structure is partial: To completely structure the data often remains an elusive goal. Parts of the data may lack structure e.g., bitmaps; other parts may only unveil some very sketchy structure e.g., unstructured text. Information retrieval to ols may provide a limited form of structure, e.g., by computing o ccurrences of particular words or group of words and by classifying do cuments based on their content. An application may also decide to leave large quantities of data outside the database. This data then remains unstructured from a database viewp oint. The loading of this external data, its analysis, and its integration to the database have to b e p erformed eciently.Wemaywant to also use optimization techniques to only load selective p ortions of this data, in the style of [ACM93]. In general, the management and access of this external data and its interop erability with the data from the database is an imp ortant issue.

Load more