Download the Ace Report
Total Page:16
File Type:pdf, Size:1020Kb
ABC Research & Development Technical Report ABC-TRP-2015-T1 ACE: Automated Content Enhancement An evaluation of natural language processing approaches to generating metadata from the full text of ABC stories Page 1 of 9 © 2015 Australian Broadcasting Corporation ACE: Automated Content Enhancement An evaluation of natural language processing approaches to generating metadata from the full-text of ABC stories. Principal researcher Charlie Szasz Report prepared by Viveka Weiley & Charlie Szasz It seems obvious, but we have to put the audience at the centre of what we do. If we are not delivering distinctive and quality content that finds its way to the people who pay for us, then we are not fulfilling our basic function. ABC Strategy 2015 – Te 2nd pillar – "audience at the centre" Introduction: The Challenge of Discoverability Audience at the centre is more than a goal – it’s a realit. Audiences already expect to find the stories they are most interested in delivered to them at the time, place, device, platform and format of their choice. Te ABC produces a huge volume of high qualit stories on a broad range of topics. However with the proliferation of alternative media outlets, new platforms for consuming and producing media and increasingly diverse audience expectations, traditional means of bringing those stories to audiences are no longer sufcient. Broadcast at the centre Home page at the centre Page 2 of 9 Audience at the centre Solving for relevance Te task then is to help people find the most relevant stories, however they want them. Tis is more than just a technical challenge. To respond we must look into what people want, and into our own methods for organising and presenting stories to them. ACE is focused on the latter question. In this project we are prototping and demonstrating systems to radically improve discoverabilit of relevant ABC stories. Our first prototpe is available now and has been demonstrated to stakeholders across the ABC. We looked into the former question with the Spoke (Location, News and Relevance) pilots1. Te Spoke engine aggregated stories from across the ABC as well as from third parties, tagged according to location and topic. We then built a mobile app to present those stories to pilot users according to their preferences. Within Spoke we implemented a machine learning system so that it could improve its responses over time, and embedded comprehensive metrics followed up by audience interviews to tell us how well the Spoke content reflected their preferences. Tis analysis taught us a lot about what people care about; in particular their strong preference for highly relevant stories based on their locations and topics of interest, and the granularity of that expectation. For example, we discovered that few people are interested in sport as a category; instead they tend to be interested in certain sports and averse to others; they may want to read every article available on a particular team, but find any information on a sport that they don’t like to be intensely irritating. Similarly, people may be interested in science but not technology or vice versa, and very interested in local business but not at all in national or global business. Tis also gave us insight into how well the ABC’s current and future metadata practices can help meet these emerging audience expectations. Current metadata practices Te curated home page is no longer the sole entry point; as users turn to search, aggregators and emerging platforms, new discovery methods must be accounted for. Tis situation complicates the task of delivering the greatest value for audiences. Every ABC story is tagged with metadata, primarily used by authors to control how content appears in current websites. To the extent that it accurately describes stories it can also be used as discovery metadata, to help search engines and aggregators find and present those stories. Tere are two main gaps in our metadata. First, the existing fields and ontologies do not aford sufciently detailed description for discovery and recommendation engines to deliver personalised results. Second, those fields that are available are ofen lef empt by authors and editors. For personalised content delivery and location and topic-based recommendations to deliver audience value, much more granular and comprehensive metadata will be required. 1 See ABC-WHP-2015-A Page 3 of 9 Solving the metadata problem Sufciently granular and comprehensive metadata could be delivered entirely manually: by training authors and editors in more comprehensive metadata entry, and through policy direction. Tis would however be a labour-intensive solution, in a context where those people have many competing demands on their time and attention. Another manual option would be the employment of metadata subeditors. Tis sufced for the Spoke pilot, where a single part-time editor was able to add sufcient location metadata to create localised feeds for two regional centres. To scale up to the whole country and to add the task of increasing metadata granularit however would not be sustainable. At the other end of the scale is a fully automated system, using artificially intelligent expert systems to extract meaning from the full text of the articles and apply metadata tags. As the success of algorithm-driven solutions such as Google’s PageRank demonstrate, machine learning systems and automated content analysis can be a powerful and scalable means of improving discoverabilit. As computation becomes exponentially cheaper and more powerful, a wide variet of useful machine learning and expert systems are emerging. Tese systems first appear in the world as research projects; and then as each approach matures its products coalesce into proprietary solutions and finally into open source and commodit platforms. Better metadata through Natural Language Processing Natural Language Processing (NLP) techniques in particular have the potential to become an important and useful tool for sorting through large volumes of stories, increasing discoverabilit. NLP is a technology based on artificial intelligence. It can take large volumes of text, and summarise and codif it using a deep understanding of the structure of language as well as databases that link words and phrases to their meanings. It allows metadata to be created automatically. In the past this has been too computationally expensive for our uses, but in recent years advances in NLP techniques and the raw speed of computing have changed that realit. Tis presents an opportunit, as NLP is now at the point where open source and commodit platforms are becoming available. In recent years a number of NLP engines have been released as SaaS (Sofware As A Service) oferings, presenting APIs (Application Programming Interfaces) which can be used to analyse large datasets and provide sample results. Tis makes it possible to use third-part NLP engines as a core component of a technology stack to provide customisable content analysis services for an organisation. Applications include increasing discoverabilit for textual content like news stories, but also anything for which a transcript is available, such as radio interviews and iView content. Page 4 of 9 Evaluation of automatically extracted metadata In order to explore the options for NLP engines, ABC R&D first conducted an overview of all oferings available in the open market. We then selected the top three candidates for testing by building a prototpe to facilitate analysis. The ABC ACE prototype Te result is the ABC ACE prototpe, a custom-built research apparatus which connects the three NLP systems to a corpus of ABC content aggregated from across the organisation and provides a user interface for exploring the results generated by each NLP engine. Te prototpe presents a query interface to aford exploration of the dataset as augmented by the NLP systems. For example, a user can find stories that are in the category of Business_Finance, which must contain the entit BHP Billiton with relevance at least 0.7 (where 1 is highest) and should contain the concepts Iron Ore and Mining with relevance at least 0.5. Tis system is designed to expose as much data and functionalit as possible in order to support comparative evaluation of the three systems, and also to open up the possibilit space to reveal the breadth of possibilities for NLP. Accordingly it presents a comprehensive range of controls which may be daunting for the non-expert user. Future prototpes may explore specific use cases for NLP, intended for specific user populations and with custom-built user interfaces. In the meantime the ACE prototpe system is live and available to use on request for further exploration. To gain access or schedule a demo please contact us. Charlie at the ACE console. Lef: code. Centre: ACE prototype. Right: DBPedia entity detail page. Page 5 of 9 Research Procedure Using the ACE prototpe over 2600 ABC stories were analysed using the three chosen NLP services, AlchemyAPI, OpenCalais and TextRazor. All stories were analysed for: Named entities, fuzzy and disambiguated e.g. John Howard - Person (fuzzy) John Howard - Australian politician, http://en.wikipedia.org/wiki/John_Howard (disambiguated) Concepts e.g. Prime Minister - http://dbpedia.org/resource/Prime_minister Concepts don’t necessarily match exact words or phrases in the text, they derived from meaning and linked to entries in various knowledge bases ( wikipedia, dbpedia ). Categories or Topics (Taxonomy) e.g. law, govt and politics / government Only AlchemyAPI provides hierarchical categories, TextRazor and OpenCalais only derive top level. Te taxonomy extracted by all three services loosely match the ABC’s own. Sentiment Only AlchemyAPI provides sentiment analysis. Due to limited API calls only named entities were analysed for sentiment. Overall sentiment of stories were not established. Engine Comparison – Initial Conclusion As a result of this analysis we determined that all engines detected and identified a similar number of named entities, concepts and topics (within an order of magnitude).