ABC Research & Development Technical Report ABC-TRP-2015-T1

ACE: Automated Content Enhancement

An evaluation of natural language processing approaches to generating metadata from the full text of ABC stories

Page 1 of 9 © 2015 Australian Broadcasting Corporation ACE: Automated Content Enhancement

An evaluation of natural language processing approaches to generating metadata from the full-text of ABC stories.

Principal researcher Charlie Szasz Report prepared by Viveka Weiley & Charlie Szasz

It seems obvious, but we have to put the audience at the centre of what we do. If we are not delivering distinctive and quality content that finds its way to the people who pay for us, then we are not fulfilling our basic function.

ABC Strategy 2015 – Te 2nd pillar – "audience at the centre"

Introduction: The Challenge of Discoverability

Audience at the centre is more than a goal – it’s a realit. Audiences already expect to find the stories they are most interested in delivered to them at the time, place, device, platform and format of their choice.

Te ABC produces a huge volume of high qualit stories on a broad range of topics. However with the proliferation of alternative media outlets, new platforms for consuming and producing media and increasingly diverse audience expectations, traditional means of bringing those stories to audiences are no longer sufcient.

Broadcast at the centre Home page at the centre

Page 2 of 9 Audience at the centre Solving for relevance

Te task then is to help people find the most relevant stories, however they want them. Tis is more than just a technical challenge. To respond we must look into what people want, and into our own methods for organising and presenting stories to them.

ACE is focused on the latter question. In this project we are prototping and demonstrating systems to radically improve discoverabilit of relevant ABC stories. Our first prototpe is available now and has been demonstrated to stakeholders across the ABC.

We looked into the former question with the Spoke (Location, News and Relevance) pilots1. Te Spoke engine aggregated stories from across the ABC as well as from third parties, tagged according to location and topic. We then built a mobile app to present those stories to pilot users according to their preferences. Within Spoke we implemented a machine learning system so that it could improve its responses over time, and embedded comprehensive metrics followed up by audience interviews to tell us how well the Spoke content reflected their preferences.

Tis analysis taught us a lot about what people care about; in particular their strong preference for highly relevant stories based on their locations and topics of interest, and the granularity of that expectation. For example, we discovered that few people are interested in sport as a category; instead they tend to be interested in certain sports and averse to others; they may want to read every article available on a particular team, but find any information on a sport that they don’t like to be intensely irritating. Similarly, people may be interested in science but not technology or vice versa, and very interested in local business but not at all in national or global business.

Tis also gave us insight into how well the ABC’s current and future metadata practices can help meet these emerging audience expectations.

Current metadata practices

Te curated home page is no longer the sole entry point; as users turn to search, aggregators and emerging platforms, new discovery methods must be accounted for. Tis situation complicates the task of delivering the greatest value for audiences.

Every ABC story is tagged with metadata, primarily used by authors to control how content appears in current websites. To the extent that it accurately describes stories it can also be used as discovery metadata, to help search engines and aggregators find and present those stories.

Tere are two main gaps in our metadata. First, the existing fields and ontologies do not aford sufciently detailed description for discovery and recommendation engines to deliver personalised results. Second, those fields that are available are ofen lef empt by authors and editors.

For personalised content delivery and location and topic-based recommendations to deliver audience value, much more granular and comprehensive metadata will be required.

1 See ABC-WHP-2015-A Page 3 of 9 Solving the metadata problem

Sufciently granular and comprehensive metadata could be delivered entirely manually: by training authors and editors in more comprehensive metadata entry, and through policy direction. Tis would however be a labour-intensive solution, in a context where those people have many competing demands on their time and attention.

Another manual option would be the employment of metadata subeditors. Tis sufced for the Spoke pilot, where a single part-time editor was able to add sufcient location metadata to create localised feeds for two regional centres. To scale up to the whole country and to add the task of increasing metadata granularit however would not be sustainable.

At the other end of the scale is a fully automated system, using artificially intelligent expert systems to extract meaning from the full text of the articles and apply metadata tags. As the success of algorithm-driven solutions such as Google’s PageRank demonstrate, machine learning systems and automated content analysis can be a powerful and scalable means of improving discoverabilit.

As computation becomes exponentially cheaper and more powerful, a wide variet of useful machine learning and expert systems are emerging. Tese systems first appear in the world as research projects; and then as each approach matures its products coalesce into proprietary solutions and finally into open source and commodit platforms.

Better metadata through Natural Language Processing

Natural Language Processing (NLP) techniques in particular have the potential to become an important and useful tool for sorting through large volumes of stories, increasing discoverabilit.

NLP is a technology based on artificial intelligence. It can take large volumes of text, and summarise and codif it using a deep understanding of the structure of language as well as databases that link words and phrases to their meanings. It allows metadata to be created automatically. In the past this has been too computationally expensive for our uses, but in recent years advances in NLP techniques and the raw speed of computing have changed that realit.

Tis presents an opportunit, as NLP is now at the point where open source and commodit platforms are becoming available.

In recent years a number of NLP engines have been released as SaaS (Sofware As A Service) oferings, presenting APIs (Application Programming Interfaces) which can be used to analyse large datasets and provide sample results. Tis makes it possible to use third-part NLP engines as a core component of a technology stack to provide customisable content analysis services for an organisation.

Applications include increasing discoverabilit for textual content like news stories, but also anything for which a transcript is available, such as radio interviews and iView content. Page 4 of 9 Evaluation of automatically extracted metadata

In order to explore the options for NLP engines, ABC R&D first conducted an overview of all oferings available in the open market. We then selected the top three candidates for testing by building a prototpe to facilitate analysis.

The ABC ACE prototype

Te result is the ABC ACE prototpe, a custom-built research apparatus which connects the three NLP systems to a corpus of ABC content aggregated from across the organisation and provides a user interface for exploring the results generated by each NLP engine.

Te prototpe presents a query interface to aford exploration of the dataset as augmented by the NLP systems. For example, a user can find stories that are in the category of Business_Finance, which must contain the entit BHP Billiton with relevance at least 0.7 (where 1 is highest) and should contain the concepts Iron Ore and Mining with relevance at least 0.5.

Tis system is designed to expose as much data and functionalit as possible in order to support comparative evaluation of the three systems, and also to open up the possibilit space to reveal the breadth of possibilities for NLP. Accordingly it presents a comprehensive range of controls which may be daunting for the non-expert user.

Future prototpes may explore specific use cases for NLP, intended for specific user populations and with custom-built user interfaces. In the meantime the ACE prototpe system is live and available to use on request for further exploration. To gain access or schedule a demo please contact us.

Charlie at the ACE console. Lef: code. Centre: ACE prototype. Right: DBPedia entity detail page.

Page 5 of 9 Research Procedure

Using the ACE prototpe over 2600 ABC stories were analysed using the three chosen NLP services, AlchemyAPI, OpenCalais and TextRazor.

All stories were analysed for:

Named entities, fuzzy and disambiguated e.g. - Person (fuzzy) John Howard - Australian politician, http://en.wikipedia.org/wiki/John_Howard (disambiguated)

Concepts e.g. Prime Minister - http://dbpedia.org/resource/Prime_minister Concepts don’t necessarily match exact words or phrases in the text, they derived from meaning and linked to entries in various knowledge bases ( wikipedia, dbpedia ).

Categories or Topics (Taxonomy) e.g. law, govt and politics / government Only AlchemyAPI provides hierarchical categories, TextRazor and OpenCalais only derive top level. Te taxonomy extracted by all three services loosely match the ABC’s own.

Sentiment Only AlchemyAPI provides sentiment analysis. Due to limited API calls only named entities were analysed for sentiment. Overall sentiment of stories were not established.

Engine Comparison – Initial Conclusion

As a result of this analysis we determined that all engines detected and identified a similar number of named entities, concepts and topics (within an order of magnitude). Only Alchemy API could provide sentiment.

Out of all available NLP APIs the Alchemy API stands out as the most robust and promising choice at this point, largely due to its capabilit to connect with their News API, connecting to a corpus collected from over 75,000 news organisations. Te capacit for sentiment analysis could also be a useful feature, particularly with regard to providing metadata for a recommendations system.

Further: as all NLP engines are driven by the same underlying concepts, it is possible to build an application architecture which is independent of which NLP engine provides the underlying results. Te ACE prototpe demonstrates this principle in practice, allowing users to switch between any of the three engines under review.

Page 6 of 9 Confirmation study with prototype system users

We identified one potential confounding factor: false positives. Any data analysis procedure, including the natural language processing systems under review, can be considered on a range from more specific to more sensitive. A highly specific test will produce fewer results more accurately, whereas a highly sensitive test will produce more results, but with a higher chance of detecting false positives.

False positives will be missed by entirely automated systems and can pollute results. For example: a story may mention the ABC, and the engine may come up with a disambiguated link for it which is wrong – for example ABC Learning Centres when the article means Broadcasting Corporation, or vice versa. Another example: a story about a police sting operation may be misclassified as Arts and Entertainment, because the word “sting” has been misidentified as the musician. Sometime false positives are obvious – “Council of the European Union” for a story about a local council. Others are not: for example “A Private Function” will appear as a topic match, when the story mentions a private function at a club. Only on further investigation will you see that this match links to a DBPedia entry about the 1984 British comedy film starring Michael Palin and Maggie Smith.

To account for this factor we chose a subset of analysed stories (150) and recruited test users from the R&D team to check them for false positives. Tis was a time consuming process – appropriate for our experimental testing but not for real-world use in editorial workflows. We also determined that while the number of true false positives is low, they can have damaging editorial efects: for example flagging a story on a death as entertainment.

Quality Issues in Detected Metadata

Trough these automated and user driven analyses we identified the following concerns:

Inconsistent named entit disambiguation e.g. AlchemyAPI disambiguated “ABC” over 20 diferent ways in the stories analysed while over 90% of the instances found are referring to the Australian Broadcasting Corporation.

Too many fuzzy named entities, not enough disambiguated ones. Creates noise. Ontologies for entit tpes are ambiguous and vague. e.g. “driver” was detected as - Position (OpenCalais tpe), meaning a person’s occupation. In the context of a trafc accident story this is misleading.

Errors with potential for editorial harm Te machine learning systems can produce errors that a human editor would find embarrassing: for example AlchemyAPI categorising a murder story as “Arts and Entertainment”.

Page 7 of 9 Metadata Quality Findings

Even accounting for confounding factors, all the engines in this study are within the same order of magnitude in results and accuracy. None of them can be reliably used without editorial oversight, especially for Australian content. Even highly customised, proprietary solutions trained on Australian media (for example Fairfax’s Fizzing Panda) could only achieve 84% accuracy when it comes to disambiguating entities.

Notwithstanding this issue, the right workflow could enable content creators and editors to select qualit, NLP generated metadata with minimal efort, enhancing the power of authors and editors to make content richer and more discoverable.

Conclusion

Our initial aim was to explore the usefulness of NLP systems for augmenting metadata at the point of publication. We are now satisfied that those systems are now efective at the point of publication; and that their value can go beyond this point. NLP systems can provide tools useful at each stage of the story production process: research, writing and publishing.

We have also demonstrated that by including a human editor in the loop, results can be obtained that are more useful than either a purely manual or entirely automated approach would deliver. Our recommendation therefore is to ensure that any system using NLP results to augment news stories include a human in the loop. Future research can therefore most fruitfully focus on the human experience of using the engines to enrich content.

Our early user tests indicate that this person should select the strongest matches for retention rather than excluding mismatches specifically. We plan to go on to produce a proof of concept prototpe to show how such a system could be designed.

Te experimental process has already uncovered some qualitative results that could provide insight into the opportunities and challenges of deploying NLP engines to augment editorial workflows. For example, we discovered that the engines could be useful in the research phase, by revealing related stories when an author is still drafing, as well as in writing and publication; but that some terminology used in the field is obscure and if shown should be translated into more recognisable terms for non-expert users.

Accordingly, future phases will explore practical scenarios for integration of NLP techniques in the news gathering and editorial workflow through interactive prototpes focused on specific use cases.

Demos of the ACE prototpe have garnered strong interest from Digital Networks Technology, the LRS project, ABC News, Splash, iView and the WCMS project. We anticipate that any division with a significant content corpus in need of better discovery and analysis tools could benefit from the considered application of NLP techniques.

Page 8 of 9 Opportunities for future development

If we can design an efective a system that allows users to select the most useful metadata from automatically extracted list, then we can substantially improve the qualit of metadata in the ABC’s systems while simultaneously providing a simpler and more efective editorial workflow.

Tis would demonstrate the usefulness of the system, but some opportunities for greater value would be missed. Tese improvements go beyond the scope of an initial proof of concept, but would be fruitful avenues for further research.

1. Identifing consistent errors

By only approving NLP metadata that is correct we would miss out on metadata that is consistently identified erroneously by the NLP service. If we had the abilit to teach the system the correct response, then that match would become useful.

For example, the “ABC” is consistently misidentified as something other than the Australian Broadcasting Corporation. In our proposed proof of concept those mismatches would be ignored by the editor, and discarded by the system.

Further research could incorporate a “correction” process into the workflow which may include presenting multiple choices of disambiguation and/or creating our own definition.

2. Identifing missing entities

Some entities are missed by the NLP engines, or identified but not disambiguated as there is no matching database entry. Te abilit for ABC users to create new database entries for those missing entities would lead to richer results in future.

3. Building a learning system

By teaching the system to correct consistent errors and enriching its concept and topic database, we would over time be building a system that learned from the collective knowledge of ABC contributors, content authors and editors. Tis would result in a continual opening up of our content to easier discovery and distribution.

Page 9 of 9 Appendix 1: Natural Language Processing Case Study

Introduction

As part of our Automatic Content Classification Engine (ACE) project we investigated three commercially available NLP services - AlchemyAPI, OpenCalais, TextRazor – to analyse a subset of ABC content.

Tese services were used to extract named entities, categorise content into topics, generate concepts/tags and in the case of AlchemyAPI, identif the associated sentiment of the named entities extracted.

In this document we explore in detail the results generated by AlchemyAPI for a single selected story published by the ABC. We chose this example as it is illustrative of the issues, matches and mismatches common to NLP analysis of Australian news stories.

Terminology

1. Named Entities Named-entit recognition (NER) (also known as entit identification, entit chunking and entit extraction) is a subtask of information extraction that seeks to locate and classif elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. - Wikipedia

2. Named Entit Disambiguation In natural language processing, entit linking, named entit disambiguation (NED), named entit recognition and disambiguation (NERD) or named entit normalization (NEN)[1] is the task of determining the identit of entities mentioned in text. It is distinct from named entit recognition (NER) in that it identifies not the occurrence of names (and a limited classification of those), but their reference. - Wikipedia

3. Sentiment Analysis Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarit of a document. Te attitude may be his or her judgment or evaluation (see appraisal theory), afective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional efect the author wishes to have on the reader). - Wikipedia!

Page 1 of 7 Case study example: Nine.!

Fuzzy Named Entities

Named entities correctly disambiguated

Named entities incorrectly disambiguated

Named entities missed or ignored

Bali Nine1 families and diplomats en route to Cilacap2 amid emotional pleas

Myuran Sukumaran's3 sister has issued an emotional plea for his life to be spared, appearing in a YouTube4 video clutching a photograph of her brother as a young boy wearing a school uniform.

"My brother made a mistake 10 years ago and he's paid for this mistake every single day since then," Brintha Sukumaran5 said.

"My brother is now a good man and afer 10 years in prison, he has taught so many Indonesian prisoners about art and about how to live outside in the world and have a good and productive life

"From the bottom of my heart, please President Widodo6 have mercy on my brother ... change punishment for humanit."

Sukumaran7 and his co-charged Andrew Chan8 were sentenced to death in Indonesia9 in 2006, as ringleaders of the Bali Nine1 drug smuggling gang.

Some of their family members are on the way to Cilacap2.

Consular ofcials from the countries whose citizens face execution have also started arriving in Cilacap2, which is close to the high-securit prison island of Nusakambangan10 where all of the death row convicts are now housed.

Australian and Indonesian ofcials have met and it is understood they discussed final requests from the condemned men and their funeral arrangements.

Foreign Minister11 Julie Bishop12 said Australian ofcials had been told the execution of the Bali Nine1pair was imminent.

Page 2 of 7 "Indonesian authorities today advised Australian consular ofcials that the executions of Andrew Chan8 and Myuran Sukumaran3 will be scheduled imminently at Nusakambangan10 prison in central Java13"¨," she said in a statement on Saturday.

However, she said the Australian Government14 would still seek clemency from Indonesian president Joko Widodo15.

Jakarta16 has said an exact date for the executions could not be decided yet, as a judicial review was still pending for the sole Indonesian in the group of 10 people who face death by firing squad.

Indonesia9's Supreme Court17 said the ruling on that case could be made as early as Monday, paving the way for the executions to proceed.

Filipina on death row given execution notice: lawyer

A Filipina on death row in Indonesia9 has been informed that she will be executed on Tuesday, her lawyer said.

"We were informed by Mary Jane18 herself that she received the notice that the sentence will be implemented on April 28," Veloso's19 lawyer Minnie Lopez20 told news agency AFP21.

Veloso's19 father and mother, her two sons aged six and 12, and sister pushed through a scrum of waiting journalists.

"If anything bad happens to my daughter, I will hold many people accountable. Tey owe us my daughter's life," Veloso's19 55-year-old mother, Celia22, told a Philippine radio station.

"I hope my appeal reaches President Widodo6."

Lawyers for Veloso19 have also filed another court bid to halt her execution.

Authorities said on Tursday they had ordered prosecutors to start making preparations for the executions.

However convicts must be given 72 hours' notice before executions are carried out, and this notice is yet to be given.

Lawyers for the say the legal process is not complete, with both a constitutional court challenge and judicial commission still in progress, however Indonesia9 says all judicial reviews and appeals for clemency have been exhausted, and that the legal manoeuvres amount to delaying tactics.

Page 3 of 7 Te 10 inmates facing execution, including Chan23, Sukumaran7, Veloso19, one each from Brazil24 and France25 and four from Africa26, have all lost appeals for clemency from Mr Widodo27, who has argued that Indonesia9 is fighting a drugs emergency.

Mr Widodo27 has turned a deaf ear to increasingly desperate appeals on the convicts' behalf from their governments, from social media and from others such as band Napalm Death28 " the president is a huge heavy metal fan.

Julian McMahon29 (centre), the lawyer for the Bali Nine1 pair on death row, leaves the Cilacap2 district prosecutor's30 ofce. (AAP31: Darma Semito32)

Highlights

• Mary-Jane Veloso, one of the subjects of this story, is never identified by her full name, leading to a number of misidentifications. • Te entit database is light on entries regarding Indonesian political figures, sometimes misidentifing them as entertainers with similar names. • Te entit database is better on Australian political figures, but still incomplete. • Te entit database (while very knowledgeable about Australian landmarks) does not recognise many significant Indonesian landmarks. • Te entit database is unaware of topical phrases such as “”, which it detected and through text analysis defined as an unknown “organisation”.

Page 4 of 7 Bali!Nine! Iden%fied'as'“Organiza%on”,'somewhat'incorrect.' 1 0.55$ Did'not'disambiguate'as:'h=p://dbpedia.org/page/Bali_Nine nega)ve

Cilacap! Iden%fied'as'“City”.' 2 0.55$ Did'not'disambiguate'as:'h=p://dbpedia.org/page/Cilacap_Regency nega)ve

Myuran! Did'not'recognise'as'en%ty.' 3 Sukumaran h=p://dbpedia.org/page/Myuran_Sukumaran

YouTube! 4 0.37$ Correctly'disambiguated'as:'h=p://dbpedia.org/resource/YouTube neutral

Brintha! Sukumaran! Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Sukumaran' 5 0.75$ No'entry'exists'in'dbpedia. nega)ve

President!Widodo! Iden%fied'as'“Person”,'somewhat'incorrect'(should'be'just'Widodo)' 6 0.81$ Did'not'disambiguate'as:'h=p://dbpedia.org/resource/Joko_Widodo' nega)ve See:'15

7 Sukumaran See:'3

Andrew!Chan! 8 0.54$ Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Andrew_Chan nega)ve

Indonesia! 9 0.72$ Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Indonesia nega)ve

Nusakambangan! Iden%fied'as'“City”.' 10 0.31$ Did'not'disambiguate'as:'h=p://dbpedia.org/resource/Nusa_Kambangan nega)ve

Foreign!Minister! 11 0.30$ Iden%fied'as'“FieldTerminology”,'quite'ambigous neutral

Julie!Bishop! Iden%fied'as'“Person”.' 12 0.29$ Did'not'disambiguate'as:'h=p://dbpedia.org/page/Julie_Bishop nega)ve

Did'not'recognise'en%ty.' 13 Java h=p://dbpedia.org/resource/Java

Australian! Government! 14 Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Government_of_Australia 0.34$ neutral

Joko!Widodo! Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Joko_Widodo' 15 0.49$ See:'3 nega)ve

Page 5 of 7 Jakarta! 16 0.34$ Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Jakarta nega)ve

Supreme!Court! 17 0.32$ Iden%fied'as'“Organiza%on” neutral

Mary!Jane! 18 0.33$ Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Mary_Jane_CroZ neutral

Veloso! Iden%fied'as'“Person”' 19 0.83$ No'entry'exists'in'dbpedia. nega)ve

Minnie!Lopez! Iden%fied'as'“Person”' 20 0.26$ No'entry'exists'in'dbpedia. neutral

AFP! Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Philippines' 21 0.30$ The'correct'disambigua%on'is:'h=p://dbpedia.org/page/Agence_France[Presse neutral

Celia! 22 0.28$ Iden%fied'as'“Person” posi)ve

23 Chan Did'not'recognise'as'en%ty.'Did'not'iden%fy'it'to'be'the'same'as'8

Brazil! Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/ 24 0.26$ Brazilian_military_government' neutral Correct'disambigua%on:'h=p://dbpedia.org/page/Brazil

France! Iden%fied'as'“County”.' 25 0.22$ Did'not'disambiguate'as:'h=p://dbpedia.org/page/France neutral

Africa! 26 0.28$ Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Africa nega)ve

27 Mr!Widodo Did'not'recognise'as'en%ty.'Did'not'iden%fy'it'to'be'the'same'as'15

Did'not'recognise'as'en%ty.'No'entry'exists'in'dbpedia'for'heavy'metal'band'Napalm' 28 Napalm!Death Death

Julian!McMahon! 29 0.33$ Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Julian_McMahon nega)ve

Prosecutor! 30 0.28$ Iden%fed'as'“JobTitle” nega)ve

Did'not'recognise'en%ty.'Did'not'disambiguate'as:'h=p://dbpedia.org/page/ 31 AAP Australian_Associated_Press

32 Darma!Semito Did'not'recognise'en%ty.

Page 6 of 7 Named entities in order of detected relevance:

1. Veloso -0.83

2. President Widodo – 0.81

3. Brintha Sukumaran – 0.75

4. Indonesia – 0.71

5. Bali Nine – 0.55

6. Cilacap – 0.55

7. – 0.54

8. – 0.49

9. Youtube – 0.37

10. Australian Government – 0.34

11. Jakarta – 0.34

12. Mary Jane – 0.33

13. Julian McMahon – 0.33

14. Supreme Court - 0.32

15. Nusakambangan – 0.31

16. Foreign Minister – 0.30

17. AFP – 0.30

18. – 0.29

19. Celia – 0.28

20. Africa – 0.28

21. Prosecutor – 0.28

22. Brazil – 0.26

23. Minnie Lopez – 0.26

24. France – 0.22

Page 7 of 7 Appendix 2: System documentation

ACE ABC Corpus Explorer

Through the ACE prototype interface you can filter stories by the ABC’s explicit metadata tags at left, which returns a list of stories aggregated from across the ABC using a parsing system originally developed for the Spoke project. You can then go on to investigate automatically generated output from our selected

ACE report 2 Appendix 2 Page 1 of 8

ACE NLP Result and query UI

NLP query form Story marked up with links to detected metadata

Generated topics Identified entity (not disambiguated) Generated concepts

NLP engine chooser

Disambiguated entity

Entity detail

ACE detail showing detected entity

Entity detail on rollover

ACE report 2 Appendix 2 Page 2 of 8

ACE Entity: further detail on click

Links to knowledge base entries on this entity

Human check for false positives

Display documents containing exact entity

Use this entity in a more detailed query

ACE Concept: further detail on click

Links to knowledge base entries on this concept

Display documents containing exact concept

Use this concept in a more detailed query

ACE report 2 Appendix 2 Page 3 of 8

ACE detailed query form

Auto-detected topics

Checkbox = MUST CONTAIN Minimum relevance threshold Unchecked = SHOULD CONTAIN

delete from query parameters

ACE report 2 Appendix 2 Page 4 of 8 ACE Topic detection: ABC editorially chosen categories (headings) showing Alchemy API auto-detected topics for those stories below, in order of frequency.

Note that the engine will choose multiple topics and weight them according to confidence. The chart below shows only the single most relevant topic, according to the engine.

Using the ACE prototype you can follow each of these topic links to display a list of the connected articles.

Continued on next page

ACE report 2 Appendix 2 Page 5 of 8 ACE report 2 Appendix 2 Page 6 of 8

ACE Concept detection

From the query UI the ACE prototype can display a complete list of detected concepts, and allow you to see which stories they are linked from.

ACE report 2 Appendix 2 Page 7 of 8 ACE NLP Engine Comparison of results from a corpus of 2278 documents

ACE report 2 Appendix 2 Page 8 of 8 Appendix 3: Cross-divisional feedback

All R&D projects, proposals and demonstrations are designed to align with ABC strategic priorities. As these priorities evolve we check back in with stakeholders regularly to ensure that our projects are correctly pitched. Tey need to be forward-thinking enough not to duplicate the work of other groups, while also being tactically driven to facilitate implementation by those groups.

R&D projects span a time horizon ranging from 3 to 7 years. ACE is on the near-future end of that range. It investigates the use of an emerging technology which has lef the lab and is becoming widely available, but which is not yet on existing product roadmaps.

Feedback to date indicates that the problem that ACE seeks to solve – metadata qualit – is increasingly important, and is not otherwise being addressed. Some technical groups and product teams can see near immediate application of the results generated by the current ACE prototpe, while others would like to see further work on data qualit, tuning and localisation. In response, we have made those issues the focus of the forthcoming ACE Reporter prototpe.

Over the coming weeks we will reach out beyond the technical implementation and product teams that have been shown the initial ACE demo. Tat more wide-ranging cross-divisional feedback will be included in future reports.

Digital Network / WCMS

Very impressed with the work and it is definitely usable in the short term, especially as we migrate content form legacy CMSs to WCMS, great work,

Ciaran Forde, Head of Digital Architecture and Development, ABC Digital Network

It is exactly Charlie’s focus on business outcomes that makes working with him on his projects so appealing. We will create tools that have real business outcomes for the NLP work he has been doing.

Neil Wilkinson, Manger, Content Services, ABC Digital Network iView:

Really interesting work and could be useful in iview for examining program metadata including:

* series Title * episode title * cast and director list (if we had these) * description * closed captions - this last one might be the most valuable

Page 1 of 2 iView (continued):

A natural language parser could be used not just for adding metadata for searching but also for finding related shows.

One concern is how badly each of the three systems behaved in some circumstances. Tis suggests significant tuning and localisation needs to be done.

I wonder if a much simpler approach might produce results that perhaps not as good as the best of what you’ve got but also not as bad as the weird cases we saw.

I think we should focus on a particular requirement, say personalised recommendations, and mock up several solutions to compare.

Great work.

Peter Marks, iView mobile development lead

Localisation and Recommendations System (LRS)

Te Recommendations Engine uses a number of diferent techniques to deliver refined recommendations based on data sources including content metadata, audience behaviour and content analysis.

What ACE could provide to Recommendations is a valuable data source, by either directly attributing topics, keywords and sentiment to content or by improving content metadata during the publish process.

Te content metadata then forms an integral data source for Recommendations both for generic recommendations as well as personalised recommendations based on user behaviour when combined with content metadata.

From a timing point of view, Recommendations is currently undergoing a series of prototypes to further elaborate business and architectural requirements.

Although the Recommendations roadmap has not been devised as yet, ACE could be immediately useful in driving discussions with Product Managers about what kind of recommendations are possible, particularly for text-based content such as News.

I’ve added ACE/NLP into the following table, which summarises the smorgasbord of techniques that the Recommendations Engine could potentially provide, so that when the initial Recommendations Engine roadmap is defined, NLP can be considered in setting priorities.

Page 2 of 2