Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference

Advanced Features of Collaborative Semantic Annotators — The 4A System

Pavel Smrz and Jaroslav Dytrych Brno University of Technology, Faculty of Information Technology, IT4Innovations Centre of Excellence Bozetechova 2, 612 66 Brno, Czech Republic Email: {smrz,idytrych}@fit.vutbr.cz

Abstract museum professionals involved in the project. Yet, the 4A framework is generally applicable in other knowledge engi- This paper deals with collaborative knowledge engineering, particularly focusing on collective editing and semantic an- neering contexts, e.g., for biomedical text annotation. notation of hypertext. It discusses state-of-the-art functions of There are three particular cases in which the preferable the 4A (Annotations Anywhere, Annotations Anytime) sys- text mining scenario cannot be (fully) applied. First, a vari- tem that has been recently extended to be applicable in a ability of natural language constructs to express a seman- broad range of annotation contexts. We introduce advanced tic relation can be high and there can be insufficient data to features and recent improvements that make the tool unique train a machine learning model. For example, relations of in many aspects. A special attention is paid to the social way artistic influences (among artists, artworks, themes, styles, of semantic tagging – complex annotations can be created by techniques, places, etc.) have been studied within the DECI- a single click and immediately shared with other interested PHER project and it showed up that despite the effort, ex- users or reused by external systems. We also compare the 4A system to similar software solutions and show their similari- pressions such as pays tribute/homage to are not well cov- ties and differences. ered by the resulting system. Although various bootstrap- ping approaches on web-scale data provide a help (Zhu et al. 2009), at least an initial seed of examples needs to be Introduction provided by users. Manual annotation serves then clearly as Despite many efforts in formalising knowledge, a vast ma- a source of more training data and generally improves per- jority of textual content on the current Web takes form of formance of automatic annotation procedures. natural language sentences with no explicit semantics. It is Second, the structure of knowledge (a template to be filled not easy to make this content understandable by machines in by an automatic method) can be complex and natural lan- and thus to realize the original vision of the Semantic Web guage processing and machine learning techniques can be to its full extent. Moreover, unless there is a very simple, unable to deal with it. In the cultural heritage domain, a typ- generally accepted mechanism that would allow laymen to ical example is a knowledge scheme analysing different at- better express meaning and that would bring benefits imme- titudes to an artist and his or her work. Many books can be diately, the situation will not change in near future. written about the topic, people can have opposite meanings Information extraction from text offers a solution for this and it is very difficult for automatic methods to generalize problem. However, cutting-edge text mining systems are in such situations. As complex structures often consist of limited to a narrow set of cases they were trained for. In other sub-components that can be recognized automatically, an- cases, there is still a need for manual semantic metadata cre- notation suggestions can significantly speed up the process ation. This paper discusses how the process of manual text of the semantic enrichment of text in this case. annotation can be assisted by an intuitive user interface that Finally, the knowledge structure itself can be unclear, presents annotation suggestions generated by an external se- not well understood, or fuzzy. When annotating particular mantic enrichment system. pieces of relevant texts, users often become aware of a gen- To motivate particular functions, we use examples from eral semantic pattern of knowledge represented by the texts, the cultural heritage domain. It corresponds to the appli- 1 they better realize what attributes are crucial for a task in cation area of the DECIPHER project in which the tool hand and can easily draft a knowledge scheme that reflects was employed. DECIPHER was an EC FP7 project aim- their specific needs. This aspect showed many times in DE- ing to support discovery and exploration of cultural heritage CIPHER. Although we maximally re-used existing knowl- through story and narrative. Semantic enrichment of under- edge resources such as well-established ontologies and con- lying texts brought new quality to the whole range of nar- ceptual hierarchies (CIDOC CRM, Getty Thesaurus, etc.), rative construction, knowledge visualisation and display for many tasks required specific knowledge structures that had Copyright c 2015, Association for the Advancement of Artificial to be created “on the fly”. One cannot expect that ordinary Intelligence (www.aaai.org). All rights reserved. end-users will adopt sophisticated ontology tools such as 1http://www.decipher-research.eu/ Proteg´ e´ to suggest specific additions to knowledge specifi-

233 Figure 1: Architecture of the 4A system cation schemata. It is thus beneficial if an annotation tool is developed. To edit and annotate a new text at the same time, able to play this role too. the Annotation Editor for JavaScript WYSIWYG editors can The 4A tool reflects the needs discussed above. It was in- be employed. An extension for TinyMCE2 is currently avail- troduced in a workshop paper in 2011 (Smrz and Dytrych able. Other editors can be easily supported through a defined 2011). Since then, the system has been significantly im- abstraction level. This form of the client provides also a way proved and extended. Its new modular design can cope with to integrate the tool into popular content management sys- diverse requirements defined by various user groups and par- tems (such as in the case of DECIPHER). ticular annotating contexts. The paper presents the current An essential feature of the 4A system lies in anchoring version 2.0 of the system. annotations in text. Although resulting knowledge structures (e.g., in RDF) can be attached to the whole document, it is Advanced Features of the 4A System useful to associate semantic interpretation directly with par- The 4A system consists of a back-end server generating an- ticular words, phrases, sentences, paragraphs or any other notation suggestions, an annotation server managing seman- piece of text (referred to as textual fragments in this paper). tic annotations and clients presenting the suggestions and The fine-grained annotation is critical for further process- annotations to users, transferring their feedbacks and vi- ing of semantically enriched data and accountability of re- sualising results of the annotation. A general architectural sults. Moreover, pinpointing a source of information helps schema is presented in Figure 1. machine learning methods to infer better models from anno- The stateless SEC API Server invokes semantic enrich- tated examples. ment components – named entity recognizers, co-reference A text can come into existence and be immediately an- resolution packages, and specific relation extractors – that notated. Texts and annotations are processed separately to pre-annotate an input text and mark potential occurrences support interleaved editing and annotating. A sophisticated of relevant entities and relations in it. The SEC API Server annotation management guarantees that most of annotations can run as a remote service or it can be started and managed remain valid after each text editing step and only disqualified by the 4A Annotation server. Variable deployment models annotations are thrown away. The 4A server takes care of an- are supported. For example, the system can be configured notation updates. To find the best match between a stored so that a local SEC API Server is accessed by many remote and an edited version of an annotated text, a cascade of annotation servers. methods with variable sensitivity is applied. A current node Two kinds of clients access the system. The first group of in a hierarchical representation of the text is searched for- clients is intended for annotating any web page or an exist- ward and backward first. If no match is found, the content is ing document viewed in a browser. A client realized as an searched from the current node to other nodes. A fragment in add-on for Mozilla is currently available, MS Inter- net Explorer, and versions are being 2http://www.tinymce.com/

234 an edited text is taken as matching if specific criteria are met (a threshold on the Levenshtein distance, correspondence of first and last letters, etc.) Although annotation structures can be complex, accepting or rejecting suggestions need to be very easy – it should cor- respond to a single click in most cases. As automatic meth- ods never recognize all potential entities, there also needs to be a simple mechanism to add new entries into an underly- ing knowledge base and to maximally reuse relevant exist- ing content. To realize this, the 4A system employs semantic templates that lead the user through the annotation process. For example, when annotating a text corresponding to creat- ing an artwork, the system displays common attributes from the domain ontology model CIDOC CRM. When clicking on attribute Author which is known to be typically filled by a URI corresponding to a person, the system suggests primar- ily fragments that fit this semantic preference. Semantic tem- plates, derived from ontologies in the initialization phase, ef- fectively get users over complexities of existing knowledge structures – concepts are suggested as semantic types of at- tributes, constraints are transformed into template structures and existing annotations are used to improve the results. Any use of an attribute is also linked back to the original ontol- Figure 2: An annotation suggestion ogy. Thus, suggested changes in knowledge structures can be immediately supported by real world examples. Annotations can link to other annotations that exist either independently or as a part of a superordinate annotation. The on the same document, 4A Annotation editors and the server 4A system supports unlimited nesting of annotations. For employ a real-time protocol and send changes immediately example, an annotation referring to an event can include an- to all involved parties. If a new user opens a document being notations of complex attributes and, at the same time, form edited and/or annotated by other users, the system offers ap- a part of another annotation expressing a cause relation be- plying changes of others or start with a separate version of tween two events, which is further attributed to a belief state the text. of a person. To cope with new requirements that were not envisaged There is a non-trivial support for overlapping and inter- in the beginning of the tool development, various enhance- leaving annotations in the 4A system. To preserve well- ments of the 4A system have been recently realized. The key formedness rules of XML documents and other hierarchi- change is the move from a proprietary annotation format to cal formats, the 4A system automatically splits fragments the W3C Open annotation format3. This guarantees inter- on “seams” and joins the parts again when representing an- operability with other annotation systems and enables using notation results. For example, the advanced mechanism en- external RDF reasoners to gain additional information. ables annotating a sentence Renoir and Manet made copies Accuracy of generated annotation suggestions varies, it of Delacroix’ paintings by two separate events with differ- can be very high for some semantic types and very low for ent actors (Renoir and Manet) and shared textual fragments others. If the suggested annotation is not correct, the 4A representing the verb phrase and the object of the relations. framework enables creating a correct annotation manually. Annotations are displayed in popups on moving the However, this presents a tedious task involving the search mouse over a relevant fragment. Links to other annotations for a correct entity link and editing of additional attributes. can be clicked on and any content of linked and nested an- To simplify this process, we developed a new module of the notations can be displayed directly in a particular annotation annotation server which manages alternative annotations for popup too. Document-level annotations are displayed in a a given textual fragment. If the user rejects a current (best) separate window. suggestion, the system shows all other known alternatives Suggested annotations can be accepted one by one (ei- for a particular textual fragment and the user can choose one ther directly in the text – see Figure 2 – or in a separate by a single click again. window with a list of all suggestions for a quick review). The annotation subscription mechanism has been also ex- The system can be also set to confirm all suggestions with tended to allow fine grained setting of filters defining which a confidence value higher than a specified threshold. Users annotations shared by other users should be shown. It is pos- can further manually reject annotations confirmed in a pre- sible to create sets of filters with annotation sources and se- vious step. Low confidence suggestions can be filtered out mantic types and then apply a given set in a particular in- according to user’s preference. The user feedback is stored stance of the editor. This mechanism enables opening a doc- and used for improving automatic suggestions. To prevent problems related to concurrent work of users 3http://www.openannotation.org/spec/core/

235 ument in more tabs and showing different kinds of annota- Conclusions and Future Directions tions (different perspectives) in individual tabs. The 4A system was successfully deployed in the DECI- An advanced scheme generating modification sequence PHER project. It is currently used for various annotation numbers and server side conflict checking and approving tasks in the cultural heritage domain. Museum professionals of modifications have been also added. These mechanisms appreciate advanced functionality of the tool and introduce minimize delays between modifications in multi-user set- it to new environments of their interest. tings. It is not necessary to apply a change on all clients be- Although the 4A systems overcomes other annotation fore a next modification request is processed (as, for exam- tools in various aspects, there are still many enhancements ple, Google Documents need to do). The system guarantees that wait for integration to next versions. We are currently that conflicting changes will be performed in a right order working on advanced export functions for knowledge struc- but non-interfering changes can proceed immediately. More ture extensions proposed by users and on improving inter- instances of the 4A editor can also newly appear at the same operability with other tools. Annotation of images will be web page. also introduced. Finally, the 4A system will newly support a broad range of WYSIWYG editors including Aloha7 and Related Work CKEditor8 which will help to apply the tool in new settings. The RDFaCE content editor4 (Khalili, Auer, and Hladky 2012) is the most similar tool from the user interface per- Acknowledgments spective. It is implemented as a TinyMCE plugin and en- The research leading to these results has received funding ables annotating textual fragments and linking them to on- from the European Community’s Seventh Framework Pro- tology concepts. Both the systems allow users to create gramme (FP7/2007-2013) under the IT4Innovations simple tags as well as complex annotations with attributes. Centre of Excellence project, Registration number However, RDFaCE does not support real-time collaboration CZ.1.05/1.1.00/02. 0070, supported by Operational among users. While the 4A system synchronizes both the Programme – Research and Development for Innovations – textual content of a document and its annotations, RDFaCE funded by Structural Funds of the European Union and the needs to store the annotated text before others can annotate state budget of the Czech Republic. it. Pundit5 (Grassi et al. 2013) is an annotation tool that References stores annotations on the server side in a similar way as the 4A framework. The system also works with textual frag- Ciccarese, P.; Ocana, M.; and Clark, T. 2012. Open se- ments. It employs ontologies and controlled vocabularies. mantic annotation of scientific publications using DOMEO. Journal of Biomedical Semantics On the other hand, the frontend tool is implemented as a 3(Suppl 1). http://www. bookmarklet so that the text being annotated cannot be mod- jbiomedsem.com/content/3/S1/S1. ified during the annotation process. Simple attributes can be Grassi, M.; Morbidoni, C.; Nucci, M.; Fonda, S.; and Do- of a plain text type only and links to ontology concepts need nato, F. D. 2013. Pundit: Creating, exploring and consum- to be entered directly as RDF triples. This slows down the ing semantic annotations. In Proceedings of the 3nd Inter- annotation process and makes it less comfortable than in the national Workshop on Semantic Digital Archives, Valletta, case of 4A or RDFaCE. Pundit does not allow adding new Malta, September 26, 2013. attributes and nesting annotations. As opposed to RDFaCE, Khalili, A.; Auer, S.; and Hladky, D. 2012. The RDFa Con- however, it is possible to create links between annotations in tent Editor – From WYSIWYG to WYSIWYM. In Pro- Pundit. ceedings of COMPSAC 2012 – Trustworthy Software Sys- Storing annotation in the Open Annotation format on tems for the Digital Society. http://svn.aksw.org/papers/ the server is a feature that the 4A system shares with 2012/COMPSAC RDFaCE/public.pdf. 6 Domeo (Ciccarese, Ocana, and Clark 2012). The frontend Smrz, P., and Dytrych, J. 2011. Towards new scholarly com- takes form of a browser plugin which is also one of the 4A munication: A case study of the 4a framework. In SePublica, client types. Domeo excels in the support for work with im- volume 721 of CEUR Workshop Proceedings. ages – a quality no other semantic annotation tool currently Zhu, J.; Nie, Z.; Liu, X.; Zhang, B.; and Wen, J.-R. 2009. matches. It enables creating simple textual annotations as StatSnowball: A statistical approach to extracting entity re- well as linking fragments to ontology concepts. Adding at- lationships. In Proceedings of the 18th International Confer- tributes is complicated as the tool dedicates attribute manip- ence on World Wide Web, WWW ’09, 101–110. New York, ulation to external plugins. Compared to the 4A direct search NY, USA: ACM. for a term in a controlled vocabulary by autocomplete func- tions with previews, Domeo supports the functionality by a simple search field in a separate tab. It also displays anno- tations differently – in a sidebar rather than together with annotated fragments as other tools do.

4http://rdface.aksw.org/ 5http://www.thepund.it/ 7http://aloha-editor.org 6http://swan.mindinformatics.org/ 8http://ckeditor.com/

236