Efficient Construction of Metadata-Enhanced Web Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Efficient construction of metadata-enhanced web corpora Adrien Barbaresi To cite this version: Adrien Barbaresi. Efficient construction of metadata-enhanced web corpora. 10th Web as Corpus Workshop, Association for Computational Linguistics (ACL SIGWAC), Aug 2016, Berlin, Germany. pp.7-16, 10.18653/v1/W16-2602. hal-01371704v2 HAL Id: hal-01371704 https://hal.archives-ouvertes.fr/hal-01371704v2 Submitted on 18 Oct 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Efficient construction of metadata-enhanced web corpora Adrien Barbaresi Austrian Academy of Sciences – Berlin-Brandenburg Academy of Sciences [email protected] Abstract evolutions. However, it is quite rare to find ready- made resources. Specific issues include first the Metadata extraction is known to be a discovery of relevant web documents, and second problem in general-purpose Web corpora, the extraction of text and metadata, e.g. because and so is extensive crawling with little of exotic markup and text genres (Schafer¨ et al., yield. The contributions of this paper are 2013). Nonetheless, proper extraction is neces- threefold: a method to find and download sary for the corpora to be established as scientific large numbers of WordPress pages; a tar- objects, as science needs an agreed scheme for geted extraction of content featuring much identifying and registering research data (Samp- needed metadata; and an analysis of the son, 2000). Web corpus yield is another recurrent documents in the corpus with insights of problem (Suchomel and Pomikalek,´ 2012; Schafer¨ actual blog uses. et al., 2014). The shift from web as corpus to web for corpus – mostly due to an expanding Web The study focuses on a publishing soft- universe and the need for better text quality (Vers- ware (WordPress), which allows for reli- ley and Panchenko, 2012) – as well as the limited able extraction of structural elements such resources of research institutions make extensive as metadata, posts, and comments. The downloads costly and prompt for handy solutions download of about 9 million documents in (Barbaresi, 2015). the course of two experiments leads after 1 processing to 2.7 billion tokens with us- The DWDS lexicography project at the Berlin- able metadata. This comparatively high Brandenburg Academy of Sciences already fea- yield is a step towards more efficiency tures a good coverage of specific written text gen- with respect to machine power and “Hi-Fi” res (Geyken, 2007). Further experiments includ- web corpora. ing internet-based text genres are currently con- ducted in joint work with the Austrian Academy The resulting corpus complies with formal of Sciences (Academy Corpora). The common ab- requirements on metadata-enhanced cor- sence of metadata known to the philological tradi- pora and on weblogs considered as a series tion such as authorship and publication date ac- of dated entries. However, existing typolo- counts for a certain defiance regarding Web re- gies on Web texts have to be revised in the sources, as linguistic evidence cannot be cited or light of this hybrid genre. identified properly in the sense of the tradition. Thus, missing or erroneous metadata in “one size 1 Introduction fits all” web corpora may undermine the relevance of web texts for linguistic purposes and in the hu- 1.1 Context manities in general. Additionally, nearly all ex- This article introduces work on focused web cor- isting text extraction and classification techniques pus construction with linguistic research in mind. have been developed in the field of information The purpose of focused web corpora is to comple- retrieval, that is not with linguistic objectives in ment existing collections, as they allow for better mind. coverage of specific written text types and genres, user-generated content, as well as latest language 1Digital Dictionary of German, http://zwei.dwds.de The contributions of this paper are threefold: This means that a comprehensive crawl could lead (1) a method to find and download large amounts to better yields. of WordPress pages; Regarding the classification of blogs, Blood (2) a targeted extraction of content featuring much (2002) distinguishes three basic types: filters, per- needed metadata; sonal journals, and notebooks, while Krishna- (3) an analysis of the documents in the corpus with murthy (2002) builds a typology based on func- insights of actual uses of the blog genre. tion and intention of the blogs: online diaries, sup- My study focuses on a publishing software with port group, enhanced column, collaborative con- two experiments, first on the official platform tent creation. More comprehensive typologies es- wordpress.com and second on the .at-domain. tablished on one hand several genres: online jour- WordPress is used by about a quarter of the web- nal, self-declared expert, news filter, writer/artist, sites worldwide2, the software has become so spam/advertisement; and on the other hand dis- broadly used that its current deployments can be tinctive “variations”: collaborative writing, com- expected to differ from the original ones. A ments from readers, means of publishing (Glance number of 158,719 blogs in German have pre- et al., 2004). viously been found on wordpress.com (Barbaresi and Wurzner,¨ 2014). The .at-domain (Austria) 2 Related work is in quantitative terms the 32th top-level domain 2.1 (Meta-)Data Extraction with about 3,7 million hosts reported.3 Data extraction has first been based on “wrappers” 1.2 Definitional and typological criteria (nowadays: “scrapers”) which were mostly rely- From the beginning of research on blogs/weblogs, ing on manual design and tended to be brittle and the main definitional criterion has always been hard to maintain (Crescenzi et al., 2001). These their form, a “reverse chronological sequences of extraction procedures have also been used early dated entries” (Kumar et al., 2003). Another for- on by blogs search engines (Glance et al., 2004). mal criterion is the use of dedicated software to ar- Since the genre of “web diaries” was established ticulate and publish the entries, a “weblog publish- before the blogs in Japan, there have been attempts ing software tool” (Glance et al., 2004), “public- to target not only blog software but also regular domain blog software” (Kumar et al., 2003), or pages (Nanno et al., 2004), in which the extraction Content Management System (CMS). These tools of metadata also allows for a distinction based on largely impact the way blogs are created and run. heuristics. 1996 seems to be the acknowledged beginning of Efforts were made to generate wrappers auto- the blog/weblog genre, with an exponential in- matically, with emphasis on three different ap- crease of their use starting in 1999 with the emer- proaches (Guo et al., 2010): wrapper induction gence of several user-friendly publishing tools (e.g. by building a grammar to parse a web page), (Kumar et al., 2003; Herring et al., 2004). sequence labeling (e.g. labeled examples or a Whether a blog is to be considered to be a web schema of data in the page), and statistical anal- page in its whole (Glance et al., 2004) or a website ysis and series of resulting heuristics. This analy- containing a series of dated entries, or posts, (Ke- sis combined to the inspection of DOM tree char- hoe and Gee, 2012) being each a web page, there acteristics (Wang et al., 2009; Guo et al., 2010) are invariant elements, such as “a persistent side- is a common ground to the information retrieval bar containing profile information” as well as links and web corpus linguistics communities, with the to other blogs (Kumar et al., 2003), or blogroll. categorization of HTML elements and linguistic For that matter, blogs are intricately intertwined in features (Ziegler and Skubacz, 2007) for the for- what has been called the blogosphere: “The cross- mer, and markup and boilerplate removal opera- linking that takes place between blogs, through tions known to the latter community (Schafer¨ and blogrolls, explicit linking, trackbacks, and refer- Bildhauer, 2013). rals has helped create a strong sense of community Regarding content-based wrappers for blogs in in the weblogging world.” (Glance et al., 2004). particular, targets include the title of the entry, the date, the author, the content, the number of com- 2http://w3techs.com/technologies/details/cm- wordpress/all/all ments, the archived link, and the trackback link 3http://ftp.isc.org/www/survey/reports/2016/01/bynum.txt (Glance et al., 2004); they can also aim at com- ments specifically (Mishne and Glance, 2006). with the “freshly pressed” page on WordPress as well as a series of trending blogs used as seed 2.2 Blog corpus construction for the crawls, leading to 222,245 blog posts and The first and foremost issue in blog corpus con- 2,253,855 comments from Blogger and WordPress struction still holds true today: “there is no com- combined, totaling about 95 million tokens (for prehensive directory of weblogs, although several the posts) and 86 million tokens (for the com- small directories exist” (Glance et al., 2004). Pre- ments). vious work established several modes of construc- The YACIS Corpus (Ptaszynski et al., 2012) is tion, from broad, opportunistic approaches, to the a Japanese corpus consisting of blogs collected focus on a particular method or platform due to the from a single blog platform, which features mostly convenience of retrieval processes.