Internals of an Aggregated Web News Feed
Total Page:16
File Type:pdf, Size:1020Kb
INTERNALS OF AN AGGREGATED WEB NEWS FEED Mitja Trampuš, Blaž Novak Artificial Intelligence Laboratory Jozef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia e-mail: {mitja.trampus, blaz.novak}@ijs.si ABSTRACT 3) Parses each article to obtain We present IJS NewsFeed, a pipeline for acquiring a a. Potential new RSS sources, to be used in clean, continuous, real-time aggregated stream of step (1) publically available news articles from web sites across b. Cleartext version of the article body the world. 4) Process articles with Enrycher (see Section 3.2) The articles are stripped of the web page chrome and 5) Expose two streams of news articles (cleartext and semantically enriched to include e.g. a list of entities Enrycher-processed) to end users. appearing in each article. The results are cached and distributed in an efficient manner. 2 SYSTEM ARCHITECTURE Figure 1 gives a schematic overview of the architecture. The 1 INTRODUCTION first part of the aggregator is based around a PostgreSQL The news aggregator is a piece of software which provides database running on a Linux server. The database contains a a real-time aggregated stream of textual news items list of RSS feeds which are periodically downloaded by the provided by RSS-enabled news providers across the world. RSS monitoring component. RSS feeds contain a list of The pipeline performs the following main steps: news article URLs and some associated metadata, such as 1) Periodically crawls a list of RSS feeds and a subset tags, publication date, etc. Articles that are not already of Google News and obtains links to news articles present in the database are added to a list of article URLs, 2) Downloads the articles, taking care not to overload and marked for download. Tags and publication date are also any of the hosting servers stored alongside, if found in the RSS. Figure 1: The system architecture of IJS NewsFeed. A separate component periodically retrieves the list of new We initially implemented the algorithm by Pasternack articles and fetches them from the web. The complete because of its simplicity and reported state-of-the-art HTML is stored in the database, and simultaneously sent to performance. The algorithm scores each token (a word or a a set of cleaning processes over a 0mq message queue. tag) in the document based on how probable it is to comprise The cleaning process converts the HTML into UTF-8 the final result (the scores are trained); then it extracts the encoding, determines which part of the HTML contains the maximum token subsequence. useful text, and discards the remainder and all of the tags. Datasets Finally, a classifier is used to determine the primary We tested the initial algorithm on three manually developed language. datasets. Each of the three consists of 50 articles, each from The cleaned version of the text is stored back in the a different web site. database, and sent over a message queue to consumers. english – English articles only. Documents in English language are sent to the Enrycher web alphabet – Non-English articles using an alphabet, i.e. service, where named entities are extracted and resolved, one glyph per sound. This includes e.g. Arabic. and the entire document is categorized into a DMOZ topic hierarchy. syllabary – Non-English articles using a syllabary, i.e. one glyph per syllable. This boils down to Asian Both the cleartext and the enriched versions of documents languages. They lack word boundaries and have are fed to a filesystem cache, which stores a sequence of generally shorter articles in terms of glyphs. Also, the compressed xml files, each containing a series of documents design of Asian pages tends to be slightly different. in the order they have arrived through the processing pipeline. The caching service exposes an HTTP interface to Some of the input pages (about 5%), realistically, also do not the world through an Apache transparent proxy, serving include meaningful content. This is different from other data those compressed xml files on user request. sets but very relevant to our scenario. Examples are paywall pages and pages with a picture bearing a single-sentence The Apache server also hosts a CGI process capable of caption. generating HTML5 server-side events, which contains the article metadata and cleartext as payload. These events can The fact that each of the 150 articles comes from a different be consumed using Javascripts EventSource object in a web site is crucial – most of the papers on this topic evaluate on a browser. dataset from a small number of sites, which leads to overfitting and poor performance in the general case. This was also the case with Pasternack’s algorithm. As the 3 DATA PREPROCESSING performance was unsatisfactory, we developed three new Data preprocessing is an important part of the pipeline, both algorithms. in terms of the added value provides and in terms of Algorithms challenges posed by the data volume. The articles WWW – an improved version of Pasternack (2009), it themselves are certainly useful, but almost any automated extracts two most promising contiguous chunks of text task dealing with them first needs to transform the raw from the article to account for the fact that the first HTML into a form more suitable for further processing. We paragraph is often placed separately from the main therefore perform the preprocessing ourselves; this is much article body. like the practice followed by professional data aggregation services like Spinn3r or Gnip. WWW++ – a combination of WWW and heuristic pre- and post-processing to account for the most obvious In terms of data volume, preprocessing is the most errors of WWW. For instance, preprocessing tries to interesting stage and the one at which the most tradeoff can remove user comments. be made. The present data download rate of about one article per second is nothing extreme, especially if we consider DOM – a completely heuristics-based approach scaling to multiple processing nodes; however, it is proposed here which requires the DOM tree to be nontrivial in that adding complex preprocessing steps (e.g. computed. With the fast libxml package, this is not a full syntactic parsing of text) or drastically increasing data limiting factor. The core of the heuristic is to take the load (e.g. including a 10% sample of the Twitter feed) first large enough DOM element that contains enough would turn preprocessing into a bottleneck and require us to promising <p> elements. Failing that, take the first <td> scale the architecture. or <div> element which contains enough promising text. The heuristics for the definition of “promising” 3.1 Extracting article body from web pages rely on metrics found in other papers as well; most Extracting meaningful content from the HTML is the most importantly, the amount of markup within a node. obviously needed preprocessing step. As this is a pervasive Importantly, none of the heuristics are site-specific. problem, a lot has been published on the topic; see e.g. Pasternack (2009), Arias (2009), and Kohlschütter (2010). Number of cleartext articles 120000 100000 80000 60000 40000 20000 0 2008-05-06 2008-05-19 2008-06-01 2008-06-14 2008-06-27 2008-07-11 2008-07-24 2008-08-06 2008-08-19 2008-09-01 2008-09-14 2008-09-27 2008-10-18 2008-11-15 2008-11-30 2008-12-13 2008-12-31 2009-01-13 2009-01-26 2009-02-08 2009-02-21 2009-03-06 2009-03-19 2009-04-01 2009-04-14 2009-04-27 2009-05-18 2009-06-10 2009-06-23 2009-07-06 2009-07-19 2009-08-01 2009-08-29 2010-11-02 2010-11-15 2010-11-28 2010-12-11 2010-12-24 2011-01-06 2011-01-19 2011-02-01 2011-02-14 2011-02-27 2011-03-12 2011-03-25 2011-04-07 2011-04-20 2011-05-03 2011-05-16 2011-06-25 2011-07-17 2011-08-26 2011-09-08 2011-09-21 2011-10-04 2011-10-17 2011-10-30 2012-01-16 2012-01-29 2012-02-11 2012-02-24 2012-03-08 2012-03-21 2012-04-03 2012-04-16 2012-04-29 2012-05-12 2012-05-25 2012-06-07 2012-06-20 2012-07-03 2012-07-16 2012-07-29 2012-08-11 Figure 2: The daily number of downloaded articles. A weekly pattern is nicely observable. Through most of 2011, only Google News was used as an article source, hence the significantly lower volume in that period. In all three algorithms, all pages are first normalized to the Garbage [5.8%] – The output contains mostly or UTF-8 character set using the BeautifulSoup package exclusively text that is not in the golden standard. These (which in turn uses a combination of http headers, meta tags are almost always articles with a very short body and a and the chardet tool). long copyright disclaimer that gets picked up instead. Evaluation Missed [5.8%] – Although the article contains We evaluated two of the three pairs of algorithms by meaningful content, the output is an empty string, i.e. comparing per-article performance. We did compare WWW the algorithm fails to find any content. and DOM; based on informal inspection of outputs, DOM If we combine “Perfect” and “Good” (where the outcome is would be certain to perform better. most often only a sentence away from the perfect match) Algo WWW vs WWW++ WWW++ vs DOM into a “Positive” score, both precision and recall for DOM number of articles where one number of articles where one are 94%.