Automatic News Recommendations Via Profiling
Total Page:16
File Type:pdf, Size:1020Kb
Automatic News Recommendations via Profiling Erik Mannens Sam Coppens Toon De Pessemier Ghent University - IBBT Ghent University - IBBT Ghent University - IBBT ELIS - Multimedia Lab ELIS - Multimedia Lab INTEC - WiCa 9050 Ghent, Belgium 9050 Ghent, Belgium 9050 Ghent, Belgium [email protected] [email protected] [email protected] Hendrik Dacquin Davy Van Deursen Rik Van de Walle VRT Ghent University - IBBT Ghent University - IBBT VRT-medialab ELIS - Multimedia Lab ELIS - Multimedia Lab 1043 Brussels, Belgium 9050 Ghent, Belgium 9050 Ghent, Belgium [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION Today, people have only limited, valuable leisure time at In the PISA project1 we have investigated how, given a their hands which they want to fill in as good as possible file-based media production and broadcasting system with a according to their own interests, whereas broadcasters want centralized repository of metadata, many common produc- to produce and distribute news items as fast and targeted tion, indexing and searching tasks could be improved and as possible. These (developing) news stories can be charac- automated. In particular, we automated the production of terised as dynamic, chained, and distributed events in ad- multiple versions of news bulletins for different consumption dition to which it is important to aggregate, link, enrich, platforms since news broadcasters generally aggregate and recommend, and distribute these news event items as tar- produce more material than is required for broadcast or on- geted as possible to the individual, interested user. In this line distribution. We have shown how such an up-to-date paper, we show how personalised recommendation and dis- news bulletin can be dynamically created and personalized tribution of news events, described using an RDF/OWL rep- to match the consumer's static categories preferences [10] resentation of the NewsML-G2 standard, can be enabled by by merging different news sources and using the NewsML- automatically categorising and enriching news events meta- G2 specification2. While the impact of file-based production data via smart indexing and linked open datasets available indeed mainly affected the work methods of the news pro- on the web of data. The recommendations { based on a duction staff - journalists, anchors, editorial staff, etc - [9], global, aggregated profile, which also takes into account the the added-value for the end-user was still marginal, id est, (dis)likings of peer friends { are finally fed to the user via a he might notice that news content is made available faster. personalised RSS feed. As such, the ultimate goal is to pro- Practically, our aim is to demonstrate the possibility of dy- vide an open, user-friendly recommendation platform that namically digesting an up-to-date news bulletin by merging harnesses the end-user with a tool to access useful news event different news sources, assembled to match the real indi- information that goes beyond basic information retrieval. At vidual consumer's likings, by recommending his favourite the same time, we provide the (inter)national community topics. In this paper, we build on our prior work and ex- with standardised mechanisms to describe/distribute news ploit further the semantic capabilities of NewsML-G2 and event and profile information. enhance it via an automatic recommendation system using Categories and Subject Descriptors the end-user's dynamic profile. H.4 [Information Systems Applications]: :Miscellaneous; This paper is organised as follows. In Section 2, we briefly D.2.11 [Software Engineering]: :Software Architectures present the NewsML-G2 standard and its conceptualisation [Domain specific architectures] in an OWL3 ontology being used in Section 3 as a unifying General Terms (meta)datamodel for highlighting the backend of the end-to- end news distribution architecture. Section 4 further elab- Design, Management, Standardization orates on how the flow of news events can be categorised Keywords and automatically enriched with knowledge available in large linked datasets. Afterwards Section 5 and Section 6 unleash News Modelling, Profiling, Recommendation the dynamically harvested user profiles to the recommenda- tion engine to harness the best-fit news items to individual user likings. Section 7 then distributes these recommended and enriched news events to the individual users. Finally, Permission to make digital or hard copies of all or part of this work for conclusions are drawn in Section 8. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 1 http://www.ibbt.be/en/projects/overview-projects/p/detail/pisa/ republish, to post on servers or to redistribute to lists, requires prior specific 2 http://www.iptc.com/std/NewsML-G2/NewsML-G2_2.2.zip permission and/or a fee. 3 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. http://www.w3.org/TR/owl-features/ 2. NEWS MODELLING ganize the rundown of a classical television news broadcast. The IPCT News Architecture framework (NAR4) is a Within our presented architecture, the essence is retrieved generic model that defines four main objects (newsItem, from the VRT's MAM and copied into a separate MAM packageItem, conceptItem and knowledgeItem) and the pro- for demo purposes. The rundown information is extracted cessing model associated with these structures. Specific lan- from iNews in the Standard Generalized Markup Language guages such as NewsML-G2 or EventsML-G25 are built on (SGML) format, upconverted into a NewsML-G2 instance, top of this architecture. For example, the generic newsItem, and pushed to the NewsML-G2 Parser. The news items a container for one particular news story or dope sheet, is (essence and metadata) from international news agencies specialized into media objects (textual stories, images or au- are received via satellite communication. More and more dio clips) in NewsML-G2. providers also structure their metadata in this NewsML-G2 standard which can directly be pushed to the NewsML-G2 Within a newsItem, the elements catalog and catalogRef parser. The essence just needs to be packed into an Ma- 7 embed the references to appropriate taxonomies; rightsInfo terial Exchange Format (MXF) instance before the MAM holds rights information such as who is accountable, who is can process it. Afterwards, the essence is transcoded into a 8 the copyright holder and what are the usage terms; item- consumer format, such as H.264/ AVC , to be seen as the Meta is a container for specifying the management of the Automated Production component in Figure 1. item (e.g. title, role in the workflow, provider). The core description of a news item is composed of administrative 4. NEWS ENRICHMENT metadata (e.g. creation date, creator, contributor, intended The NewsML-G2 Parser then takes as input a NewsML- audience) and descriptive metadata (e.g. language, genre, G2 instance (XML format) and produces an enriched NewsML- subject, slugline, headline, dateline, description) grouped in G2 instance (RDF triples) compliant to the NewsML on- the contentMeta container. A news item can be decomposed tology. First, the incoming XML elements are parsed and into parts (e.g. shots, scenes, image regions and their respec- converted to instances of their corresponding OWL classes tive descriptive data and time boundaries) within partMeta and properties within the NewsML ontology. Second, plain while contentSet wraps renditions of the asset. Finally, se- text contained in XML elements such as title and descrip- mantic inline markup is provided by the inlineRef container tion is sent to the metadata enrichment service. The lat- for referring to the definition of particular concepts (e.g. per- ter extracts named entities from the plain text and tries to son, organization, company, geopolitical area, POI, etc). find formal descriptions of these found entities on the Web. Hence, the metadata enrichment service returns a number of NAR is a generic model for describing news items as well additional RDF triples containing more information about as their management, packaging, and the way they are ex- concepts occurring in the plain text sections of the incoming changed. Interestingly, this model shares the principles un- NewsML-G2 instance. Finally, the resulting RDF triples are derlying the Semantic Web: i) news items are distributed stored in the AllegroGraph RDF store (see Figure 1). resources that need to be uniquely identified like the Se- mantic Web resources; ii) news items are described with The linguistic processing consists in extracting named en- shared and controlled vocabularies. NAR is however de- tities such as persons, organisations, companies, brands, lo- fined in XML Schema and has thus no formal representa- cations and other events. We use both the i.Know's Informa- tion of its intended semantics (e.g. a NewsItem can be tion Forensics service9 and the OpenCalais infrastructure10 a TextNewsItem, a PhotoNewsItem or a VideoNewsItem). for extracting these named entities. For example, the pro- Extension to other standards is cumbersome since it is hard cessing of the headline \Tom Barman and his band dEUS to state the equivalence between two XML elements. EBU opening their latest album Vantage Point in Rock Werchter" (amongst others) have proposed to model an OWL ontology 6 will result in five named entities: `Tom Barman',`dEUS', of NewsML-G2 to address these shortcomings and we have `Vantage Point', `Rock Werchter', and 'Werchter' together discussed the design decisions regarding its modeling from with their type (i.e. Person, Music Group, Music Album, existing XML Schemas [10]. Event, Location, etc). Once the named entities have been extracted, we map them to formalised knowledge on the 3. NEWS GATHERING web available in GeoNames11 for the locations, or in DBPe- 12 13 News broadcasters receive news information from different dia /FreeBase for the persons, organisations and events.