Information Extraction from the Web Using a Search Engine
Total Page:16
File Type:pdf, Size:1020Kb
Information extraction from the web using a search engine Citation for published version (APA): Geleijnse, G. (2008). Information extraction from the web using a search engine. Technische Universiteit Eindhoven. https://doi.org/10.6100/IR639768 DOI: 10.6100/IR639768 Document status and date: Published: 01/01/2008 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim. Download date: 26. Sep. 2021 Information Extraction from the Web using a Search Engine ISBN: 978-90-74445-85-6 Cover design by Paul Verspaget Photo by Marianne Achterbergh The work described in this thesis has been carried out at the Philips Research Labo- ratories in Eindhoven, the Netherlands, as part of the Philips Research programme. Philips Electronics N.V. 2008 All rights are reserved. Reproduction in whole or in part is prohibited without the written consent of the copyright owner. Information Extraction from the Web using a Search Engine PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de Rector Magnificus, prof.dr.ir. C.J. van Duijn, voor een commissie aangewezen door het College voor Promoties in het openbaar te verdedigen op maandag 8 december 2008 om 16.00 uur door Gijs Geleijnse geboren te Breda Dit proefschrift is goedgekeurd door de promotor: prof.dr. E.H.L. Aarts Copromotor: dr.ir. J.H.M. Korst Contents 1 Introduction 1 1.1 Information on the Web . 3 1.2 Information Extraction and Web Information Extraction . 5 1.3 Related Work . 10 1.4 Outline . 14 2 A Pattern-Based Approach to Web Information Extraction 15 2.1 Introduction . 15 2.2 Extraction Information from the Web using Patterns . 21 3 Two Subproblems in Extracting Information from the Web using Pat- terns 31 3.1 Identifying Effective Patterns . 31 3.2 Identifying Instances . 38 4 Evaluation: Extracting Factual Information From the Web 51 4.1 Populating a Movie Ontology . 52 4.2 Identifying Burger King and its Empire . 54 4.3 Identifying Countries . 58 4.4 The Presidents of the United States of America . 64 4.5 Extracting Historical Persons from the Web . 68 4.6 Conclusions . 76 5 Application: Extracting Inferable Information From the Web 77 5.1 Improving the Accessibility of a Thesaurus-Based Catalog . 78 5.2 Extracting Lyrics from the Web . 92 6 Discovering Information by Extracting Community Data 109 6.1 Extracting Subjective Information from the Web . 110 6.2 Processing Extracted Subjective Information . 113 6.3 Evaluating Extracted Subjective Information . 118 6.4 Experimental Results . 127 v vi Contents 6.5 Conclusions . 148 7 Conclusions 151 Bibliography 155 Publications 166 Summary 169 Acknowledgements 171 Biography 172 1 Introduction “All science is either physics or stamp collecting.” — Ernest Rutherford. Whether we want to know the name of Canada’s capital, or gather opinions on Philip Roth’s new novel, the World Wide Web currently is the de-facto source to find an arbitrary piece of information. In an era where a community-based source as Wikipedia is found to be as accurate as the Encyclopaedia Britannica [Giles, 2005], the collective knowledge of the internet contributors is an unsurpassed col- lection of facts, analyses and opinions. This knowledge simplifies the process for people to gather knowledge, form an opinion or buy a cheap and reliable product. With its rise in the late nineties, the web was intended as a medium to distribute content among an audience. Alike newspapers and magazines, the communication was merely one way. The content published on the web was presented in an often attractive format and lay-out, using a natural language (e.g. Dutch) we are most acquainted with. Nowadays, only a few years later, the web is a place where people can easily contribute, share and reuse thoughts, stories or other expressions of creativity. The popularity of social web sites enriches the information available on the web. This mechanism turned the web into a place where people can form nuanced opinions about virtually any imaginable subject. 1 2 To enable people to share and reuse content, such as the location of that great Vietnamese restaurant in Avignon on Google Maps, the information on the web is currently not only presented in a human-friendly fashion, but also in formats that allow interpretation of information by machines. The so-called Social Web, or Web2.0, enables people to easily create and publish content. Moreover, content can be easily reused and combined. A movement next to the social web is the semantic web. The semantic web community has created a dedicated formal language to express concepts, predicates and relations between concepts. Using this mathematical language for general in- formation, knowledge can be expressed on every imaginable topic. The semantic web can be seen as a distributed knowledge base. Instead of browsing through web pages, the semantic web enables direct access to information. The more information is already expressed in the semantic web languages, the easier it becomes to represent new information. For example, to model the concept of First Lady of the United States, it may be needed to first model the concepts country, United States, person, president, married, time, period and so on. The use of earlier defined notions makes the content of the semantic web richer, as content created by various parties can be linked and combined. In the late sixties in Eindhoven, N.G. De Bruijn and his group developed the mathematical language for mathematics and system Automath [De Bruijn, 1968; Nederpelt, Geuvers, & De Vrijer, 2004]. Automath is a dedicated formal language to express mathematics. The project can be seen as an attempt to formulate and propagate a universal language for mathematics, that is checked by a system. Such languages serve two goals. On the one hand, it is a means to ensure mathematical correctness. If a theorem is provided with a proof in the mathematical language, and the well-designed system accepts this proof, then the theorem can be consid- ered to be true. On the other hand, the language provides a means of clear and unambiguous communication. Białystok, Poland, the home town of the constructed language Esperanto, is the base of one of the most active projects on formal mathematical languages. The Mizar system builds on a set of axioms. A collection of mathematics is formalized (i.e. derived from the set of axioms) through out the years. Although the Mizar team have succeeded to completely formalize a whole handbook on continuous lattices (by 16 authors in 8 years time), the formalization of an elementary the- ory in another mathematical subject (i.e. group theory) proved to be too ambitious [Geleijnse, 2004]. In spite of the work done by the semantic web and formal mathematics re- searchers, both mathematicians and web publishers prefer natural language over dedicated artificial languages to express their thoughts and findings. In mathe- matics, dedicated researchers are formalizing (or translating) definitions, theorems 1.1 Information on the Web 3 and their proofs into formal languages. The translation of mathematics into formal languages was the topic of my 2004 master’s thesis. In this thesis, I will discuss approaches to catch information on the web into a dedicated formalism. Although both topics may be closer to stamp collecting than to physics, I do hope that you will enjoy this work. 1.1 Information on the Web In this thesis, we focus on information that is represented in natural language texts on the web. We make use of the text itself rather than of the formatting. Hence, we extract information from unstructured texts rather than from formatted tables or XML. Although some web sites may be more authoritative than others, we do not distinct between sources as such. Suppose we are interested in a specific piece of information, for example the capital of Australia. Nowadays, the web is an obvious source to learn this and many other facts. The process of retrieving such information generally starts with the use of a search engine, for example Google or perhaps the search engine in Wikipedia.