Semantic Enrichment of Scientific Documents with Semantic Lenses – Developing Methodologies, Tools and Prototypes for Their Concrete Use

ALMA MATER STUDIORUM – UNIVERSITÀ DEGLI STUDI DI BOLOGNA FACOLTÀ DI SCIENZE MATEMATICHE FISICHE E NATURALI Dipartimento di Informatica: Scienza e Ingegneria Corso di Laurea Specialistica in Informatica Semantic Enrichment of Scientific Documents with Semantic Lenses – Developing methodologies, tools and prototypes for their concrete use Relatore: Candidato: Chiar.mo Prof. Fabio Vitali Jacopo Zingoni Correlatore: Dott. Silvio Peroni ANNO ACCADEMICO 2011-2012 Sessione II 2 A Marta e Stefano, i miei amatissimi ed esemplari genitori, straordinari, dolcissimi e sempre pronti a tutto. A Giuliana ed a Tosca, indimenticabili nonne, sempre vicine al mio cuore, delle cui lezioni spero di poter fare sempre tesoro. Grazie per tutto l’appoggio, per ogni parola di incoraggiamento. E soprattutto grazie per tutto l’amore che mi avete sempre donato. 3 4 Indice Introduzione_______________________________________________ 9 1 Introduction and Aims of this Work – Applications for Semantic Lenses ________________________ 13 2 Scientific Context and Related Works ____________________ 16 2.1 General introduction to Semantic Web and Semantic Publishing __ 16 2.2 Semantic web, Semantic Publishing and the Enrichment of knowledge _____________________________ 19 2.3 Other Related Works _________________________________________ 25 3 Technologies and onthologies ____________________________ 29 3.1 RDF and Ontologies in General _______________________________ 29 3.1.1 RDF/XML Syntax Linearization __________________________________ 32 3.1.2 Terse RDF Triple Language – TURTLE ____________________________ 33 3.2 EARMARK – Extremely Annotational RDF Markup _____________ 34 3.2.1 Ghost Classes _________________________________________________ 37 3.2.2 Shell Classes __________________________________________________ 39 3.2.3 Handling Overlapping Markup with EARMARK ___________________ 41 3.3 Linguistic Acts Ontology______________________________________ 44 3.4 Spar Area Ontologies _________________________________________ 47 3.5 Toulmin Argument Model ____________________________________ 53 4 The Semantic Lenses Model _____________________________ 57 4.1 Introduction to Semantic Lenses _______________________________ 57 4.2 Facets of a Document – Different outlooks, aspects and layers. ____ 59 4.3 Semantic Lenses in Detail _____________________________________ 61 4.3.1 Research Context Lens __________________________________________ 64 4.3.2 Contributions and Roles Lens ____________________________________ 64 5 4.3.3 The Publication Context Lens ____________________________________ 65 4.3.4 The Document Structure Lens ___________________________________ 66 4.3.5 The Rhetoric Organization Lens __________________________________ 67 4.3.6 The Citation Network Lens ______________________________________ 68 4.3.7 The Argumentation Lens ________________________________________ 70 4.3.8 The Textual Semantics Lens _____________________________________ 72 4.5 Structural Patterns and the Pattern Ontology ____________________ 73 4.6 DOCO – The Document Components Ontology _________________ 77 4.7 DEO – The Discourse Elements Ontology _______________________ 79 4.8 CiTO – The Citation Typing Ontology _________________________ 81 4.9 AMO – The Argument Model Ontology ________________________ 83 5 SLM – Semantic Lenses Methodology _____________________ 85 5.1 Applying Semantic Lenses – Authorial or Editorial Activity? _____ 85 5.2 Proposing a general methodology ______________________________ 89 5.3 The target paper – “Ontologies are Us” _________________________ 93 5.4 The target lenses _____________________________________________ 93 5.5 Porting the target document to EARMARK _____________________ 94 5.6 The Application of the Document Structure Lens ________________ 96 5.7 The Application of the Rhetoric Organization Lens ______________ 99 5.8 The Application of the Citation Network Lens _________________ 101 5.9 The Application of the Argumentation Lens ___________________ 103 6 SLAM – Semantic Lenses Application & Manipulation ____ 104 6.1 Introduction – Tasks, Aims, Necessities and Priorities __________ 104 6.1.1 Jena API _____________________________________________________ 106 6.1.2 EARMARK API _______________________________________________ 109 6.2 SLAM features and architecture ______________________________ 110 6.3 Details on the code and on SLAM Classes _____________________ 116 6.3.1 Core Package _________________________________________________ 116 6.3.2 Utilities Package ______________________________________________ 119 6.4 From Theory to Practice – Mikalens sub-package and its structure 120 6 7 TAL – Through a Lens __________________________________ 122 7.1 User Interaction with Lenses – The Focusing Activity ___________ 122 7.2 Through a Lens - TAL _______________________________________ 125 7.3 TAL Features _______________________________________________ 127 7.3.1 Explorable Argumentation Index ________________________________ 127 7.3.2 Explorable Citation Index _______________________________________ 130 7.3.3 Rhetoric and Citation Tooltips ___________________________________ 132 7.3.4 Contextual Rhetoric Denotations_________________________________ 133 7.4 Prototype generation from the test-bed through Java ____________ 134 7.5 Other Implementation details ________________________________ 140 8 Evaluation of SLAM and TAL __________________________ 141 8.1 Enacting the methodology and using the framework to field test lenses ___________________________________________ 141 8.2 Applying the annotations, lens by lens ________________________ 143 8.2.1 The Application of the Document Structure Lens ___________________ 143 8.2.2 The Application of the Rhetoric Organization Lens _________________ 150 8.2.3 The Application of the Citation Network Lens _____________________ 155 8.2.4 The Application of the Argumentation Lens _______________________ 158 8.3 Overall statistics ____________________________________________ 164 8.4 Lessons Learned and possible, recommended improvements ____ 175 8.4.1 Suggested changes in the SPAR Ontologies ________________________ 179 8.5 User Testing the TAL prototype ______________________________ 181 9 Conclusions __________________________________________ 184 Conclusioni _____________________________________________ 193 Bibliography – Ontologies References _______________________ 199 Bibliography ____________________________________________ 200 7 8 Introduzione Con questa dissertazione di tesi miro ad illustrare i risultati della mia ricerca nel campo del Semantic Publishing, consistenti nello sviluppo di un insieme di metodologie, strumenti e prototipi, uniti allo studio di un caso d‟uso concreto, finalizzati all‟applicazione ed alla focalizzazione di Lenti Semantiche (Semantic Lenses): un efficace modello per l‟arricchimento semantico di documenti scientifici [PSV12a]. Il Semantic Publishing è un approccio rivoluzionario nel modo in cui i documenti scientifici, come ad esempio degli articoli di ricerca, possono essere letti, usati, indirizzati e diventare oggetto di nuovi modi di interazione. Ma in cosa consiste di preciso il Semantic Publishing? Per definirlo con le stesse parole del suo principale proponente: “Definisco come Semantic Publishing tutto ciò che aumenta la resa del significato di un articolo scientifico pubblicato, che ne facilita la sua scoperta in modo automatizzato, che consente di collegarlo ad articoli semanticamente correlati, che fornisce accesso a dati presenti nell’articolo, in modo azionabile, oppure che facilita l’integrazione di dati fra diversi documenti. Fra le altre cose, richiede l’arricchimento dell’articolo con metadati appropriati, comprensibili, analizzabili e processabili automaticamente, in modo da consentire un miglioramento della verificabilità delle informazioni presenti nella pubblicazione, ed al fine di provvedere ad un loro riassunto automatico, o la loro scoperta automatica da parte di altri agenti.” [SKM09] Il lavoro che ho svolto e che tratterò nelle pagine seguenti è quindi un contributo completo al campo del Semantic Publishing. Innanzitutto è un modo di mostrare la fattibilità ed i vantaggi del modello delle Lenti Semantiche ai fini di un appropriato arricchimento con metadati, tramite la proposta di una metodologia dettagliata per il raggiungimento di questo obiettivo. È una indicazione di una possibile via per risolvere le sfide che si potrebbero incontrare lungo questo percorso, sviluppando gli appropriati strumenti e le soluzioni praticabili. Ed è una dimostrazione pratica di alcune delle nuove interazioni ed opportunità rese possibili da un prototipo di interfaccia generato a partire da un documento scientifico arricchito tramite l‟appropriata annotazione delle lenti su di esso. La mia dimostrazione si basa appunto sulle Lenti Semantiche, che sono un modello per l‟arricchimento semantico di documenti scientifici, accompagnato da un insieme di tecnologie raccomandate per la sua implementazione. È importante osservare come l‟arricchimento di un tradizionale articolo scientifico non sia una operazione monodimensionale, in quanto, al di là del 9 mero atto di aggiungere asserzioni semanticamente precise riguardo il contenuto testuale, sono coinvolte in essa molte altre sfaccettature. Tutti questi aspetti che coesistono simultaneamente in un articolo possono essere definiti nella concreta manifestazione della semantica di un documento tramite l‟applicazione di specifici filtri, in grado di enfatizzare un preciso insieme di informazioni significative su un dominio piuttosto

Semantic Enrichment of Scientific Documents with Semantic Lenses – Developing Methodologies, Tools and Prototypes for Their Concrete Use

A Personal Research Agent for Semantic Knowledge Management of Scientiﬁc Literature

Semantic Web Technologies and Legal Scholarly Publishing Law, Governance and Technology Series

Proceedings of Balisage: the Markup Conference 2012

Towards the Unification of Formats for Overlapping Markup

The MLCD Overlap Corpus (MOC)

XML Prague 2019

Journal of the Text Encoding Initiative, Issue 8

Standoff Properties As an Alternative to XML for Digital Historical Editions

Risoluzione E Linearizzazione Di Overlap in Linguaggi Di Markup

Towards a Formal Ontology for the Text Encoding Initiative DOI

No. 3 on the Use of TEX As an Authoring Language for HTML5 SK

Proceedings of Extreme Markup Languages®