Improving Retrieval Accuracy Main Content Extraction HTML Web
Total Page:16
File Type:pdf, Size:1020Kb
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Documents Von der Fakult¨atf¨ur Mathematik und Informatik der Universit¨atLeipzig angenommene DISSERTATION zur Erlangung des akademischen Grades DOCTOR rerum naturalium (Dr. rer. nat.) im Fachgebiet Informatik vorgelegt von Master Applied Math. Hadi Mohammadzadeh geboren am 12.10.1969 in Mashhad, Iran Die Annahme der Dissertation wurde empfohlen von: 1. Professor Dr. Jinan Fiaidhi (Ontario, Kanada) 2. Professor Dr. Gerhard Heyer (Leipzig) Die Verleihung des akademischen Grades erfolgt mit Bestehen der Verteidigung am 27.11.2013 mit dem Gesamtpr¨adikat magna cum laude . Amtierender Dekan: Professor Dr. Hans-Bert Rademacher Gutachter: Professor Dr. Jinan Fiaidhi Professor Dr. Gerhard Heyer Tag der Promotion: 27. November 2013 I would like to dedicate this thesis to my loving parents ii Abstract The rapid growth of text based information on the World Wide Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the \main content" from the additional con- tent items, such as navigation menus, advertisements, design elements or legal disclaimers. Firstly, in this thesis, we study, develop, and evaluate R2L, DANA, DANAg, and AdDANAg, a family of novel algorithms for extracting the main content of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other three algorithms, is to use well particularities of Right-to-Left languages for obtaining the main content of web pages. As the English character set and the Right-to-Left character set are encoded in different intervals of the Unicode character set, we can efficiently distinguish the Right-to- Left characters from the English ones in an HTML file. This enables the R2L approach to recognize areas of the HTML file with a high density of Right-to- Left characters and a low density of characters from the English character set. Having recognized these areas, R2L can successfully separate only the Right-to- Left characters. The first extension of the R2L, DANA, improves effectiveness of the baseline algorithm by employing an HTML parser in a post processing phase of R2L for extracting the main content from areas with a high density of Right-to-Left characters. DANAg is the second extension of the R2L and generalizes the idea of R2L to render it language independent. AdDANAg, the third extension of R2L, integrates a new preprocessing step to normalize the hyperlink tags. The presented approaches are analyzed under the aspects of efficiency and effectiveness. We compare them to several established main content extraction algorithms and show that we extend the state-of-the-art in terms of both, efficiency and effectiveness. Secondly, automatically extracting the headline of web articles has many ap- plications. We develop and evaluate a content-based and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. The proposed method achieves high performance in terms of effectiveness and ef- ficiency and outperforms approaches operating on structural and visual features. iv Zusammenfassung Das rasante Wachstum von textbasierten Informationen im World Wide Web und die Vielfalt der Anwendungen, die diese Daten nutzen, macht es notwendig, effiziente und effektive Methoden zu entwickeln, die den Hauptinhalt identifizieren und von den zus¨atzlichen Inhaltsobjekten wie z.B. Navigations-Men¨us,Anzeigen, Design-Elementen oder Haftungsausschl¨ussentrennen. Zun¨achst untersuchen, entwickeln und evaluieren wir in dieser Arbeit R2L, DANA, DANAg und AdDANAg, eine Familie von neuartigen Algorithmen zum Extrahieren des Inhalts von Web-Dokumenten. Das grundlegende Konzept hin- ter R2L, das auch zur Entwicklung der drei weiteren Algorithmen f¨uhrte,nutzt die Besonderheiten der Rechts-nach-links-Sprachen aus, um den Hauptinhalt von Webseiten zu extrahieren. Da der lateinische Zeichensatz und die Rechts-nach-links-Zeichens¨atzedurch ver- schiedene Abschnitte des Unicode-Zeichensatzes kodiert werden, lassen sich die Rechts-nach-links-Zeichen leicht von den lateinischen Zeichen in einer HTML- Datei unterscheiden. Das erlaubt dem R2L-Ansatz, Bereiche mit einer hohen Dichte von Rechts-nach-links-Zeichen und wenigen lateinischen Zeichen aus einer HTML-Datei zu erkennen. Aus diesen Bereichen kann dann R2L die Rechts-nach- links-Zeichen extrahieren. Die erste Erweiterung, DANA, verbessert die Wirk- samkeit des Baseline-Algorithmus durch die Verwendung eines HTML-Parsers in der Nachbearbeitungsphase des R2L-Algorithmus, um den Inhalt aus Bereichen mit einer hohen Dichte von Rechts-nach-links-Zeichen zu extrahieren. DANAg erweitert den Ansatz des R2L-Algorithmus, so dass eine Sprachunabh¨angigkeit erreicht wird. Die dritte Erweiterung, AdDANAg, integriert eine neue Vorver- arbeitungsschritte, um u.a. die Weblinks zu normalisieren. Die vorgestellten Ans¨atzewerden in Bezug auf Effizienz und Effektivit¨atanalysiert. Im Vergleich mit mehreren etablierten Hauptinhalt-Extraktions-Algorithmen zeigen wir, dass sie in diesen Punkten ¨uberlegen sind. Dar¨uber hinaus findet die Extraktion der Uberschriften¨ aus Web-Artikeln vielf¨altigeAnwendungen. Hierzu entwickeln wir mit TitleFinder einen sich nur auf den Textinhalt beziehenden und sprachabh¨angigenAnsatz. Das vorgestellte Verfahren ist in Bezug auf Effektivit¨atund Effizienz besser als bekannte Ans¨atze, die auf strukturellen und visuellen Eigenschaften der HTML-Datei beruhen. vi Acknowledgements Like all research, the work described in this dissertation was not conducted in isolation. It is impossible to thank everyone who has in some way contributed to it and equally impossible to thank individuals for all of their contributions but I would like to thank those people that gave me the opportunity to work and enjoy during my PhD journey. I am indebted for technical insights, encouragement and friendship to many members of the Institute of Applied Information Processing at the University of Ulm, as well as those I had the privilege of working with in the Department of Natural Language Processing at the University of Leibzig. This thesis would not have been possible without the support and the insights of Prof. Dr. Franz Schweiggert. I'm pleased for the given chance to reach this point of my career. I am most grateful to Prof. Dr. Gholamreza Nakhaeizadeh. It has been a big pleasure working with him. I have to thank him for all the support, the suggestions, the patience, and every kind of help that he gave to me, and for the person I became at this point. And of course, for the time spent out of work and his continues suggestions. Grateful acknowledgment is made to my supervisor Prof. Dr. Gerhard Heyer for all his help, guidance, and support. I sincerely would like to thank Prof. Dr Jinan Fiaidhi for being the reviewer of this thesis. She gave me useful comments. Thanks also go to my friends and colleagues at the University of Ulm, in particular Dr. Andreas Borchert, Armin Rutzen, Wolfgang Kaifler, Rene Just, Steffen Kram and Michaela Weiss. I would like to show my gratitude to Dr. Thomas Gottron for his advice, collaboration, and the opportunity to work and publish some papers together. I want to thank one of my best friends Dr. Norbert Heidenbluth for helping me to prepare figures and diagrams. A special thank is dedicated to Mrs Roswitha Jelinek and Mrs Gisela Richter for them support and efforts. At the end, I appreciate Dr. Ljubow Rutzen- Loesevitz for proof-reading this thesis. viii Publications The algorithms, for main content extraction form HTML web documnets, and experimental results introduced in this thesis have been published in six conference and workshop proceedings in addition to one book chapter, as follows: • Hadi Mohammadzadeh, Franz Schweiggert, and Gholamreza Nakhaeizadeh. Using UTF-8 to Extract Main Content of Right to Left Language Web Pages. In: ICSOFT 2011 - Proceedings of the 6th International Conference on Soft- ware and Data Technologies, Seville, Spain, 2011, volume 1, pp. 243 { 249, SciTePress. • Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, and Gholam- reza Nakhaeizadeh. A Fast and Accurate Approach for Main Content Ex- traction based on Character Encoding. In: TIR'11: Proceedings of the 8th International Workshop on Text-based Information Retrieval(DEXA'11), Toulouse, France, 2011, pp. 167 { 171, IEEE Computer Society. • Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, and Gholam- reza Nakhaeizadeh. Extracting the Main Content of Web Documents Based on a Naive Smoothing Method. In: KDIR'11: International Conference on Knowledge Discovery and Information Retrieval, Paris, France, 2011, pp. 470 { 475, SciTePress. • Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, and Gholam- reza Nakhaeizadeh. Extracting the Main Content of Web Documents based on Character Encoding and a Naive Smoothing Method. In: Software and Data Technologies, CCIS Series, Springer, 2013, pp. 217 { 236, Springer- Verlag Berlin Heidelberg. • Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, and Gholam- reza Nakhaeizadeh. The impact of source code normalization on main con- tent extraction. In: WEBIST'12: 8th International Conference on Web In- formation Systems and Technologies, Porto, Portugal, 2011, pp. 677 { 682, SciTePress. • Hadi Mohammadzadeh, Omid Paknia, Franz Schweiggert, and Thomas Got- tron. Revealing trends based on defined queries in biological publications using cosine similarity. In: BIOKDD'12: 23th International Workshop on