Digital Content Annotation and Transcoding for a Listing of Recent Titles in the Artech House Digital Audio and Video Library, Turn to the Back of This Book

Total Page:16

File Type:pdf, Size:1020Kb

Digital Content Annotation and Transcoding for a Listing of Recent Titles in the Artech House Digital Audio and Video Library, Turn to the Back of This Book Digital Content Annotation and Transcoding For a listing of recent titles in the Artech House Digital Audio and Video Library, turn to the back of this book. Digital Content Annotation and Transcoding Katashi Nagao Artech House Boston * London www.artechhouse.com Library of Congress Cataloging-in-Publication Data Nagao, Katashi. Digital content annotation and transcoding / Katashi Nagao. p. cm. — (Artech House digital audio and video library) Includes bibliographical references and index. ISBN 1-58053-337-X (alk. paper) 1. File processing (Computer Science). 2. Data structures (Computer Science). 3. Information storage and retrieval systems. 4. Digital media. 5. File conversion (Computer Science). I. Title. II. Series. QA76.9.F53N34 2003 005.74’1—dc21 2002043691 British Library Cataloguing in Publication Data Nagao, Katashi Digital content annotation and transcoding. — (Artech House digital audio and video library) 1. Digital media 2. Multimedia systems I. Title 006.7 ISBN 1-58053-337-X Cover design by Christina Stone q 2003 ARTECH HOUSE, INC. 685 Canton Street Norwood, MA 02062 All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. International Standard Book Number: 1-58053-337-X Library of Congress Catalog Card Number: 2002043691 10987654321 To Kazue Contents Preface xiii Acknowledgments xv 1 Introduction: Survival in Information Deluge 1 1.1 Digital Content Technology 2 1.2 Problems of On-Line Content 3 1.3 Extension of Digital Content 4 1.4 Organization of This Book 9 1.4.1 Chapter 2—Transcoding: A Technique to Transform Digital Content 9 1.4.2 Chapter 3—Annotation: A Technique to Extend Digital Content 9 1.4.3 Chapter 4—Semantic Annotation and Transcoding: Towards Semantically Sharable Digital Content 9 1.4.4 Chapter 5—Future Directions of Digital Content Technology 10 References 10 2 Transcoding: A Technique to Transform Digital Content 11 2.1 Representation of Digital Content 13 2.1.1 HTML 13 2.1.2 XHTML 15 2.1.3 WML 19 2.1.4 VoiceXML 21 2.1.5 SMIL 23 vii viii Digital Content Annotation and Transcoding 2.2 Content Adaptation 24 2.2.1 XSLTs 25 2.3 Transcoding by Proxy Servers 28 2.3.1 HTML Simplification 28 2.3.2 XSLT Style Sheet Selection and Application 31 2.3.3 Transformation of HTML into WML 32 2.3.4 Image Transcoding 34 2.4 Content Personalization 35 2.4.1 Transcoding for Accessibility 35 2.5 Framework for Transcoding 43 2.5.1 Transcoding Framework in Action 46 2.5.2 Limitations of the Transcoding Framework 49 2.5.3 An Implementation of Transcoding Proxies 50 2.5.4 Technical Issues of Transcoding 54 2.5.5 Deployment Models of Transcoding 55 2.6 More Advanced Transcoding 58 References 59 3 Annotation: A Technique to Extend Digital Content 61 3.1 Frameworks of Metadata 62 3.1.1 Dublin Core 62 3.1.2 Warwick Framework 65 3.1.3 Resource Description Framework 67 3.1.4 MPEG-7 75 3.2 Creation of Annotations 81 3.2.1 Annotea 82 3.2.2 Web Site–Wide Annotation for Accessibility 92 Contents ix 3.3 Applications of Annotation 101 3.3.1 Multimedia Content Transcoding and Distribution 101 3.3.2 Content Access Control 105 3.4 More on Annotation 116 3.4.1 Annotation by Digital Watermarking 116 3.4.2 Semantic Interoperability by Ontology Annotation 117 3.4.3 Resource Linking and Meta-Annotation 122 References 128 4 Semantic Annotation and Transcoding: Towards Semantically Sharable Digital Content 131 4.1 Semantics and Grounding 133 4.1.1 Ontology 133 4.1.2 Information Grounding 135 4.2 Content Analysis Techniques 138 4.2.1 Natural Language Analysis 138 4.2.2 Speech Analysis 149 4.2.3 Video Analysis 152 4.3 Semantic Annotation 155 4.3.1 Annotation Environment 157 4.3.2 Annotation Editor 158 4.3.3 Annotation Server 158 4.3.4 Linguistic Annotation 160 4.3.5 Commentary Annotation 164 4.3.6 Multimedia Annotation 167 4.3.7 Multimedia Annotation Editor 168 4.4 Semantic Transcoding 175 4.4.1 Transcoding Proxy 176 4.5 Text Transcoding 178 4.5.1 Text Summarization 178 x Digital Content Annotation and Transcoding 4.5.2 Language Translation 181 4.5.3 Dictionary-Based Text Paraphrasing 181 4.6 Image Transcoding 188 4.7 Voice Transcoding 189 4.8 Multimedia Transcoding 189 4.8.1 Multimodal Document 191 4.8.2 Video Summarization 191 4.8.3 Video Translation 193 4.8.4 Multimedia Content Adaptation for Mobile Devices 193 4.9 More Advanced Applications 194 4.9.1 Scalable Platform of Semantic Annotations 194 4.9.2 Knowledge Discovery 196 4.9.3 Multimedia Restoration 198 4.10 Concluding Remarks 203 References 204 5 Future Directions of Digital Content Technology 209 5.1 Content Management and Distribution for Media Businesses 210 5.1.1 Rights Management 211 5.1.2 Portal Services on Transcoding Proxies 213 5.1.3 Secure Distribution Infrastructure 214 5.2 Intelligent Content with Agent Technology 214 5.2.1 Agent Augmented Reality 215 5.2.2 Community Support System 217 5.3 Situated Content with Ubiquitous Computing 218 5.3.1 Situation Awareness 219 5.3.2 Augmented Memory 220 5.3.3 Situated Content in Action 220 Contents xi 5.4 Towards Social Information Infrastructures 222 5.4.1 Content Management and Preservation 225 5.4.2 Security Mechanisms 225 5.4.3 Human Interfaces 226 5.4.4 Scalable Systems 227 5.5 Final Remarks 228 References 228 About the Author 231 Index 233 Preface Today digital content is no longer limited to exclusive people. In addition to World Wide Web pages, which are the epitome of digital content, there are many more familiar examples such as i-mode for cell phones, photographs taken by digital cameras, movies captured by digital camcoders, compact discs, music clips downloaded from the Internet, and digital broadcasting. Audiences do not have to care if content is digital or not; however, whether they are produced digitally or not means a great deal. Digitization has made it easier for us to duplicate these products, and as a result, information remains stable. In short, digitization has facilitated the storage of information. Information storage, however, is not the only advantage of digitization. It is more important to recognize that manipulation (like editing) has become automatic to some degree. This is not to say that information was unable to be stored and edited at all before. Long ago, for instance, some could change the information on a paper, such as a Buddhist scripture, into another form, such as inscription on a stone. In modern times, some painters imitate paintings and add their own originality, and even some ordinary people record TV programs to edit to extract their favorite scenes and reproduce their own programs. Digital content enables more because it is programmable to manipulate information. Suppose a person consistently records a favorite program. Its first 15 minutes is boring, but the person wants to see the opening. So the person has to record the whole program, from the beginning to the end. If the structure of the TV program is the same, it should be possible to automatically extract the unwanted first 15 minutes by programming. In another example, it would be difficult to extract a specific part from printed material that contains more than one million pages; it would be very easy, however, to extract the necessary part with a program for extracting topics whose content is digitized. Thus, digital content has largely changed the way of dealing with information. This book describes some experiments of further developing xiii xiv Digital Content Annotation and Transcoding digital content to make it possible for people to customize and enjoy it more interestingly. This book focuses on two technologies: transcoding and annotation. Transcoding is a technology to program the transformation of digital content. This enables content to be changed to the most preferable form for audiences. Annotation is a technology that provides additional information to extend content so as to facilitate the programming. Annotation can also play the role of hyperlink, which connects several on-line resources. The purpose of this book is not only to introduce these two technologies precisely but to propose the system that integrates these two and to prospect the future directions. I have great expectations for the future technology of digital content for I believe that the digital content produced by a number of folks can motivate people to better understand both themselves and world events so that society at large will change for the better. I also believe that it is the principal responsibility for researchers to produce a technology that maximizes the utilization of the digital content. It might be outrageous to intend to change social systems; however, on the condition that these technologies can facilitate the voluntary actions of many people and can turn them to the right direction, people will come to notice the nature of the issue. As a result, these technologies will put the things that may at first seem impossible into practice.
Recommended publications
  • 9.1.1 Lesson 5
    NYS Common Core ELA & Literacy Curriculum Grade 9 • Module 1 • Unit 1 • Lesson 5 D R A F T 9.1.1 Lesson 5 Introduction In this lesson, students continue to practice reading closely through annotating text. Students will begin learning how to annotate text using four annotation codes and marking their thinking on the text. Students will reread the Stage 1 epigraph and narrative of "St. Lucy’s Home for Girls Raised by Wolves," to “Neither did they” (pp. 225–227). The text annotation will serve as a scaffold for teacher- led text-dependent questions. Students will participate in a discussion about the reasons for text annotation and its connection to close reading. After an introduction and modeling of annotation, students will practice annotating the text with a partner. As student partners practice annotation, they will discuss the codes and notes they write on the text. The annotation codes introduced in this lesson will be used throughout Unit 1. For the lesson assessment, students will use their annotation as a tool to find evidence to independently answer, in writing, one text-dependent question. For homework, students will continue to read their Accountable Independent Reading (AIR) texts, this time using a focus standard to guide their reading. Standards Assessed Standard(s) RL.9-10.1 Cite strong and thorough textual evidence to support analysis of what the text says explicitly as well as inferences drawn from the text. Addressed Standard(s) RL.9-10.3 Analyze how complex characters (e.g., those with multiple or conflicting motivations) develop over the course of a text, interact with other characters, advance the plot or develop the theme.
    [Show full text]
  • A Super-Lightweight Annotation Tool for Experts
    SLATE: A Super-Lightweight Annotation Tool for Experts Jonathan K. Kummerfeld Computer Science & Engineering University of Michigan Ann Arbor, MI, USA [email protected] Abstract we minimise the time cost of installation by imple- menting the entire system in Python using built-in Many annotation tools have been developed, covering a wide variety of tasks and pro- libraries. Third, we follow the Unix Tools Philos- viding features like user management, pre- ophy (Raymond, 2003) to write programs that do processing, and automatic labeling. However, one thing well, with flat text formats. In our case, all of these tools use Graphical User Inter- (1) the tool only does annotation, not tokenisation, faces, and often require substantial effort to automatic labeling, file management, etc, which install and configure. This paper presents a are covered by other tools, and (2) data is stored in new annotation tool that is designed to fill the a format that both people and command line tools niche of a lightweight interface for users with like grep can easily read. a terminal-based workflow. SLATE supports annotation at different scales (spans of char- SLATE supports annotation of items that are acters, tokens, and lines, or a document) and continuous spans of either characters, tokens, of different types (free text, labels, and links), lines, or documents. For all item types there are with easily customisable keybindings, and uni- three forms of annotation: labeling items with cat- code support. In a user study comparing with egories, writing free-text labels, or linking pairs other tools it was consistently the easiest to in- of items.
    [Show full text]
  • Bibliography of Erik Wilde
    dretbiblio dretbiblio Erik Wilde's Bibliography References [1] AFIPS Fall Joint Computer Conference, San Francisco, California, December 1968. [2] Seventeenth IEEE Conference on Computer Communication Networks, Washington, D.C., 1978. [3] ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, Los Angeles, Cal- ifornia, March 1982. ACM Press. [4] First Conference on Computer-Supported Cooperative Work, 1986. [5] 1987 ACM Conference on Hypertext, Chapel Hill, North Carolina, November 1987. ACM Press. [6] 18th IEEE International Symposium on Fault-Tolerant Computing, Tokyo, Japan, 1988. IEEE Computer Society Press. [7] Conference on Computer-Supported Cooperative Work, Portland, Oregon, 1988. ACM Press. [8] Conference on Office Information Systems, Palo Alto, California, March 1988. [9] 1989 ACM Conference on Hypertext, Pittsburgh, Pennsylvania, November 1989. ACM Press. [10] UNIX | The Legend Evolves. Summer 1990 UKUUG Conference, Buntingford, UK, 1990. UKUUG. [11] Fourth ACM Symposium on User Interface Software and Technology, Hilton Head, South Carolina, November 1991. [12] GLOBECOM'91 Conference, Phoenix, Arizona, 1991. IEEE Computer Society Press. [13] IEEE INFOCOM '91 Conference on Computer Communications, Bal Harbour, Florida, 1991. IEEE Computer Society Press. [14] IEEE International Conference on Communications, Denver, Colorado, June 1991. [15] International Workshop on CSCW, Berlin, Germany, April 1991. [16] Third ACM Conference on Hypertext, San Antonio, Texas, December 1991. ACM Press. [17] 11th Symposium on Reliable Distributed Systems, Houston, Texas, 1992. IEEE Computer Society Press. [18] 3rd Joint European Networking Conference, Innsbruck, Austria, May 1992. [19] Fourth ACM Conference on Hypertext, Milano, Italy, November 1992. ACM Press. [20] GLOBECOM'92 Conference, Orlando, Florida, December 1992. IEEE Computer Society Press. http://github.com/dret/biblio (August 29, 2018) 1 dretbiblio [21] IEEE INFOCOM '92 Conference on Computer Communications, Florence, Italy, 1992.
    [Show full text]
  • Browser Security Features
    Browser Security features Michael Sonntag Institute of Networks and Security Cookies: Securing them Attention: These are “requests” by the server setting the cookie Browsers will follow them, but applications not necessarily Secure/HTTPS-only: Do not send unencrypted This is an element of the Cookie header itself “Set-Cookie: “ …content, domain, expiration… “;Secure” Often the application contains an option to set this automatically HTTP-only: No access by JavaScript This is an element of the Cookie header itself “Set-Cookie: “ …content, domain, expiration… “;HttpOnly” Host-only: Do not set the “Domain” attribute Not set: Send to exactly this host only Domain set: Send to every host at or under this domain Priority: When too many cookies from a single domain, delete those of low priority first Not really a security feature! 2 Cookies: Securing them SameSite: Cookie should not be sent with cross-site requests (some CSRF-prevention; prevent cross-origin information leakage) “Strict” : Never cross origin; not even when clicking on a link on site A leading to B the Cookie set from B is actually sent to B “Lax” (default): Sent when clicking on most links, but not with POST requests: “Same-Site” and “cross-site top-level navigation” Not as good as strict: E.g. “<link rel='prerender’>” is a same-site request fetched automatically (and kept in the background!) Sent: GET requests leading to a top-level target (=URL in address bar changes; but may contain e.g. path) I.e.: Will not be sent for iframes, images, XMLHttpRequests
    [Show full text]
  • Web Application Security Standards
    WEB APPLICATION SECURITY STANDARDS It is much more secure to be feared than to be loved. www.ideas2it.com Security Headers ● Content-Security-Policy ○ Reduce XSS risks on modern browsers ● X-Content-Type-Options ○ Prevent browsers from trying to guess (“sniff”) the MIME type ● X-XSS-Protection ○ Stops pages from loading when they detect reflected cross-site scripting (XSS) ● X-Frame-Options ○ Prevent the site from clickjacking attacks www.ideas2it.com Security Headers ● Strict-Transport-Security ○ Lets a website tell browsers that it should only be accessed using HTTPS ● Referrer-Policy ○ Controls how much referrer information (sent via the Referer header) should be included with requests. ● Feature-policy ○ Provides a mechanism to allow and deny the use of browser features. www.ideas2it.com Request : digicontent.com/assets/css/styles.css Request : digicontent.com/assets/js/filter.js Request : malicious.com/assets/js/xss.js Content-Security-Policy: default-src digicontent.com Content Security Policy Browser Asset Sniff asset to determine MIME type Based on content MIME type = HTML MIME Sniffing HSTS enabled Client origin server Request : http://digicontent.com Response : https://digicontent.com HTTP Strict Transport Security Cross Site Scripting - XSS ● Stealing other user’s cookies ● Stealing their private information ● Performing actions on behalf of other users ● Redirecting to other websites ● Showing ads in hidden IFRAMES and pop-ups www.ideas2it.com Cross Site Scripting (XSS) Secure cache control settings ● Max-age ● No-cache ● No-store ● Public ● Private ● Must-Revalidate ● Proxy-Revalidate www.ideas2it.com Request : digicontent.com/styles.css digicontent.com/styles.css Cache-Control : max-age = 3600 3600s Receive styles.css Store styles.css Browser Cache Cache-Control www.ideas2it.com Cookie attributes ● HTTP-ONLY ● SECURE ● Domain ● SameSite (Strict/Lax/None) ● Path www.ideas2it.com Cookie : Same Site Vulnerable TLS SSL vs.
    [Show full text]
  • Web Application Report
    Web Application Report This report includes important security information about your web application. Security Report This report was created by IBM Security AppScan Standard 9.0.3.13, Rules: 18533 Scan started: 6/2/2020 10:39:20 AM Table of Contents Introduction General Information Login Settings Summary Issue Types Vulnerable URLs Fix Recommendations Security Risks Causes WASC Threat Classification Issues Sorted by Issue Type Cross-Site Scripting 28 Link to Non-Existing Domain Found 2 SQL Injection File Write (requires user verification) 1 Check for SRI (Subresource Integrity) support 1 Credit Card Number Pattern Found (Visa) Over Unencrypted Connection 1 Google Sitemap File Detected 1 Hidden Directory Detected 6 Missing or insecure "Content-Security-Policy" header 5 Unsafe third-party link (target="_blank") 22 Fix Recommendations Remove the non-existing domain from the web site Review possible solutions for hazardous character injection Add the attribute rel = "noopener noreferrer" to each link element with target="_blank" 6/3/2020 1 Add to each third-party script/link element support to SRI(Subresource Integrity). Config your server to use the "Content-Security-Policy" header with secure policies Issue a "404 - Not Found" response status code for a forbidden resource, or remove it completely Remove credit card numbers from your website Validate that your Google Sitemap file only contains URLs that should be publicly available and indexed Advisories Cross-Site Scripting Link to Non-Existing Domain Found SQL Injection File Write (requires user verification) Check for SRI (Subresource Integrity) support Credit Card Number Pattern Found (Visa) Over Unencrypted Connection Google Sitemap File Detected Hidden Directory Detected Missing or insecure "Content-Security-Policy" header TargetBlankLink Application Data Cookies JavaScripts Parameters Comments Visited URLs Failed Requests Filtered URLs 6/3/2020 2 Introduction This report contains the results of a web application security scan performed by IBM Security AppScan Standard.
    [Show full text]
  • Active Learning with a Human in the Loop
    M T R 1 2 0 6 0 3 MITRETECHNICALREPORT Active Learning with a Human In The Loop Seamus Clancy Sam Bayer Robyn Kozierok November 2012 The views, opinions, and/or findings contained in this report are those of The MITRE Corpora- tion and should not be construed as an official Government position, policy, or decision, unless designated by other documentation. This technical data was produced for the U.S. Government under Contract No. W15P7T-12-C- F600, and is subject to the Rights in Technical Data—Noncommercial Items clause (DFARS) 252.227-7013 (NOV 1995) c 2013 The MITRE Corporation. All Rights Reserved. M T R 1 2 0 6 0 3 MITRETECHNICALREPORT Active Learning with a Human In The Loop Sponsor: MITRE Innovation Program Dept. No.: G063 Contract No.: W15P7T-12-C-F600 Project No.: 0712M750-AA Seamus Clancy The views, opinions, and/or findings con- Sam Bayer tained in this report are those of The MITRE Corporation and should not be construed as an Robyn Kozierok official Government position, policy, or deci- sion, unless designated by other documenta- tion. Approved for Public Release. November 2012 c 2013 The MITRE Corporation. All Rights Reserved. Abstract Text annotation is an expensive pre-requisite for applying data-driven natural language processing techniques to new datasets. Tools that can reliably reduce the time and money required to construct an annotated corpus would be of immediate value to MITRE’s sponsors. To this end, we have explored the possibility of using active learning strategies to aid human annotators in performing a basic named entity annotation task.
    [Show full text]
  • The Influence of Text Annotation Tools on Print and Digital Reading Comprehension
    28 The influence of text annotation tools on print and digital reading comprehension The Influence of Text Annotation Tools on Print and Digital Reading Comprehension Gal Ben-Yehudah Yoram Eshet-Alkalai The Open University of Israel The Open University of Israel [email protected] [email protected] Abstract Recent studies that compared reading comprehension between print and digital displays found a disadvantage for digital reading. A separate line of research has shown that effective use of learning strategies and active learning tools (e.g. annotation tools: highlighting, underlining and note- taking) can improve reading comprehension. Here, the influence of annotation tools on comprehension and their utility in diminishing the gap between print and digital reading was investigated. In a 2 X 2 experimental design, ninety three undergraduates (mean age 29.6 years, 72 women) read an academic essay in print or in digital format, with or without the use of annotation tools. In general, the results reinforce previous research reports on the inferiority of digital compared to print reading comprehension. Our findings indicate that for factual-level questions using annotation tools had no effect on comprehension in both formats. For the inference-level questions, however, use of annotation tools improved comprehension of printed text but not of digital text. In other words, text-annotation in the digital format had no effect on reading comprehension. In the digital era, digital reading is the norm rather than the exception; thus, it is essential to improve our understanding of the factors that influence the comprehension of digital text. Keywords: digital reading, print reading, annotation tools, learning strategies, self regulated learning, monitoring.
    [Show full text]
  • SECURE PROGRAMMING TECHNIQUES Web Application Security
    SECURE PROGRAMMING TECHNIQUES Web application security • SQL injection • Parameterized statements • Other injections • Cross-site scripting • Persistent and reflected XSS • AJAX and cross-domains access • Javascript • Cross-Site Request Forgery • Clickjacking • OAuth 2 and OpenID Connect • PHP security MEELIS ROOS 1 SECURE PROGRAMMING TECHNIQUES SQL injection • A SQL injection attack consists of insertion or "injection" of a SQL query via the input data from the client to the application • A successful SQL injection exploit can read sensitive data from the database, modify database data (Insert/Update/Delete), execute administration operations on the database (such as shut down the DBMS), recover the content of a given file present on the DBMS file system and in some cases issue commands to the operating system • Blind SQL injection — when you don’t get to see the query output – But you can see whether error messages appear or not, or how long the query takes MEELIS ROOS 2 SECURE PROGRAMMING TECHNIQUES Fixing SQL injection • Input filtering — only harmless input parameters allowed • Input escaping — dangerous characters are allowed but escaped • Explicit type conversions in SQL — int() etc • Parameterized statements with type-aware parameter substitution – Parameters are sent as metadata (in a structured way, not inline) – Also allows for performance gains with most databases • Stored procedures — fixes SQL query structure, parameters still might need validation MEELIS ROOS 3 SECURE PROGRAMMING TECHNIQUES Parameterized statements • Java with
    [Show full text]
  • A Multi-Layer Markup Language for Geospatial Semantic Annotations Ludovic Moncla, Mauro Gaio
    A multi-layer markup language for geospatial semantic annotations Ludovic Moncla, Mauro Gaio To cite this version: Ludovic Moncla, Mauro Gaio. A multi-layer markup language for geospatial semantic annotations. 9th Workshop on Geographic Information Retrieval, Nov 2015, Paris, France. 10.1145/2837689.2837700. hal-01256370 HAL Id: hal-01256370 https://hal.archives-ouvertes.fr/hal-01256370 Submitted on 1 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Multi-Layer Markup Language for Geospatial Semantic Annotations ∗ † Ludovic Moncla Mauro Gaio Universidad de Zaragoza LIUPPA - EA 3000 C/ María de Luna, 1 - Zaragoza, Spain Av. de l’Université - Pau, France [email protected] [email protected] ABSTRACT 1 Motivation and Background In this paper we describe a markup language for semantically Two categories of markup languages can be considered in annotating raw texts. We define a formal representation of geospatial tasks, those dedicated to the encoding of spatial text documents written in natural language that can be ap- data (e.g., gml, kml) and those dedicated to the annota- plied for the task of Named Entities Recognition and Spatial tion of spatial or spatio-temporal information in texts, such Role Labeling.
    [Show full text]
  • BOEMIE Ontology-Based Text Annotation Tool
    BOEMIE ontology-based text annotation tool Pavlina Fragkou, Georgios Petasis, Aris Theodorakos, Vangelis Karkaletsis, Constantine D. Spyropoulos Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications National Center for Scientific Research (N.C.S.R.) “Demokritos” P. Grigoriou and Neapoleos Str., 15310 Aghia Paraskevi Attikis, Greece E-mail: {fragou, petasis, artheo, vangelis, costass}@ iit.demokritos.gr Abstract The huge amount of the available information in the Web creates the need of effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately annotated corpus in order to learn extraction models. The production of such corpora can be significantly facilitated by annotation tools that are able to annotate, according to a defined ontology, not only named entities but most importantly relations between them. This paper describes the BOEMIE ontology-based annotation tool which is able to locate blocks of text that correspond to specific types of named entities, fill tables corresponding to ontology concepts with those named entities and link the filled tables based on relations defined in the domain ontology. Additionally, it can perform annotation of blocks of text that refer to the same topic. The tool has a user-friendly interface, supports automatic pre-annotation, annotation comparison as well as customization to other annotation schemata. The annotation tool has been used in a large scale annotation task involving 3000 web pages regarding athletics. It has also been used in another annotation task involving 503 web pages with medical information, in different languages.
    [Show full text]
  • PDF Annotation with Labels and Structure
    PAWLS: PDF Annotation With Labels and Structure Mark Neumann Zejiang Shen Sam Skjonsberg Allen Institute for Artificial Intelligence {markn,shannons,sams}@allenai.org Abstract the extracted text stream lacks any organization or markup, such as paragraph boundaries, figure Adobe’s Portable Document Format (PDF) is placement and page headers/footers. a popular way of distributing view-only docu- Existing popular annotation tools such as BRAT ments with a rich visual markup. This presents a challenge to NLP practitioners who wish to (Stenetorp et al., 2012) focus on annotation of user use the information contained within PDF doc- provided plain text in a web browser specifically uments for training models or data analysis, designed for annotation only. For many labeling because annotating these documents is diffi- tasks, this format is exactly what is required. How- cult. In this paper, we present PDF Annota- ever, as the scope and ability of natural language tion with Labels and Structure (PAWLS), a new processing technology goes beyond purely textual annotation tool designed specifically for the processing due in part to recent advances in large PDF document format. PAWLS is particularly suited for mixed-mode annotation and scenar- language models (Peters et al., 2018; Devlin et al., ios in which annotators require extended con- 2019, inter alia), the context and media in which text to annotate accurately. PAWLS supports datasets are created must evolve as well. span-based textual annotation, N-ary relations In addition, the quality of both data collection and freeform, non-textual bounding boxes, all and evaluation methodology is highly dependent of which can be exported in convenient for- on the particular annotation/evaluation context in mats for training multi-modal machine learn- which the data being annotated is viewed (Joseph ing models.
    [Show full text]