The WANDAML Markup Language for Digital Document Annotation
Total Page:16
File Type:pdf, Size:1020Kb
The WANDAML markup language for digital document annotation Katrin Franke Isabelle Guyon¤ Fraunhofer Institute, Berlin, Germany ClopiNet, 955 Creston Rd, Berkeley, USA. [email protected] [email protected] Lambert Schomaker Louis Vuurpijl Rijksuniversiteit Groningen, The Netherlands. University of Nijmegen, The Netherlands. [email protected] [email protected] Abstract database technology can then be used to apply logical constraints to the search process. WANDAML is an XML-based markup language for Although our work is very application oriented, it is the annotation and filter journaling of digital docu- not particularly application specific. Our format lends it- ments. It addresses in particular the needs of forensic self to promoting research and development of handwrit- handwriting data examination, by allowing experts to ing analysis methods, and establishing common grounds enter information about writer, material (pen, paper), for international exchange of handwritten data samples script and content, and to record chains of image fil- and annotations. It supports on-line, off-line handwrit- tering and feature extraction operations applied to the ing and overlays of on-line over off-line data. Its sim- data. Annotations may be organized in a structure that ple and general structure makes it fit for annotating data reflects the document layout via a hierarchy of document with other modalities, including printed text, pictures, regions. WANDAML lends itself to a variety of applica- and sound. tions, including the annotation all kinds of handwriting To ensure long-term operation we built upon the documents (on-line or off-line), images of printed text, eXtensible Markup Language (XML) [4] that follows medical images, and satellite images. world-wide standardized syntax rules, recommended by Keywords: Handwriting, forensic data, XML, annota- the World Wide Web Consortium (W3C). XML allows tions, data format, document analysis. users to define their own markup language respecting the basic XML syntax. As an ASCII format, XML is human readable and can be edited with regular text pro- 1 Introduction cessors, although there is a large body of existing soft- ware, which manipulates XML. In particular, XML and We designed an XML-based markup language to Java are very complementary: XML is a widely ac- annotate digital documents, called WANDAML . This cepted means of creating portable data, and the Java pro- markup language is designed for processing, analyzing gramming language provides portable code abilities. We and storing handwriting samples, in application to foren- wrote Java applications for WANDAML as part of the sic handwriting examination and writer identification. WANDA project [6, 5]. For this application, particular specifications are met to Before undertaking the task of defining a new XML ensure objectivity and reproducibility of the processing language for forensic applications, we reviewed exist- steps and examination results [6, 5]. Writer identifica- ing standards that are in use in the handwriting recogni- tion may never be as accurate as iris or DNA-based iden- tion and document analysis community. We distinguish tification. However, other information including age, between formats to encode data and formats to anno- handedness, and script style may reduce the size of a tate data. WANDAML is essentially a data annotation reference set to the extent that automatic writer identifi- format. It is compatible with a variety of data encod- cation becomes viable. To that end, a portable and ex- ing formats. For image data, typical examples of data tensible data format is needed for modelling the knowl- encoding formats would be JPEG, GIF, and TIFF. For edge from the forensic application domain. A standard vectorial data (graphics represented by their coordinate ¤Corresponding author. points, not by pixel values) we can cite the Scalable- 1 IML UP SVG IAM XM TV WANDA p p p p p p <my_tag my_attribute="my_value"> XML-based p p p p p <my_element/> raster images p p p p </my_tag> vector images p p p 1 virtual ink p p where, <my tag/> and <my element/> are tags v-ink segments p p p p or “elements”, and my attribute is an attribute of image regions p p my tag, having value my value. In DTDs, “entities” ink/img overlay p p p are defined, which may be used as lists of admissible device annot. p p p attribute values. writer annot. p p script annot. We adopt a minimum set of conventions carried p throughout WANDAML : material annot. p p p content annot. p p p – Entities, attributes and elements contain only low- filters/plugins p p ercase and underscore characters. interactivity p p layers p p – Certain attributes have special meanings: id a styles p p p type external link unique identifier, a type from a pre-defined list, label an optional user-defined string that can Table 1. Existing format comparison. be used for search purpose. IML=InkML, UP=Unipen, XM=Xmillum, – To simplify parsing, we have defined a number TV=Trueviz. of “containers” (<pages/>, <filters/>, <annotations/>, <inputs/>, and Vector Graphics (SVG) format. <outputs/>), which contain elements of Several handwritten document annotation formats the same name (<page/>, <filter/>, etc.) predate WANDAML . Image annotation formats include and have the optional attribute number of. IAM [9], Xmillum [11] and Trueviz [8]. Formats for virtual ink or “electronic ink” combining both data en- Markup languages based on XML are extensible via coding and data annotation, include Unipen [7] and the use of name spaces. A body of XML text may be InkML [3]. We summarize the features of the existing enclosed between tags with an opening tag containing formats in Table 1. We outline in bold the features that the attribute xmlns (XML name space.) The value of are required for forensic applications. No standard pre- the attribute xmlns is a unique name reserved to iden- dating WANDAML meets all of our requirements. Re- tify the definition of a particular XML subset. The URI lated work is discussed in more details in our technical (Uniform Resource Identifier) of a DTD file or a URL report [1]. (Uniform Resource Locators) are frequently used for WANDAML is the result of joint efforts of forensic that purpose. The use of name spaces in WANDAML handwriting experts and researchers in computer-based allows us to separate the definition of the basic language handwriting analysis. Well-established termini and pro- skeleton from XML subsets that are application specific. cedures of forensic science inspired many aspects of the design [2, 10, 12]. WANDAML also benefitted from the 3 A simple scenario experience of some of the team members who invented Unipen, a data format for on-line handwriting [7]. Two of our team members are also involved in the InkML To understand the WANDAML annotation mecha- consortium [3] and WANDAML is currently under eval- nisms and concepts, it is useful to first understand what uation by an InkML subgroup. it was originally intended for by analyzing an example. The concepts introduced in this Section will be later de- fined formally in Section 4. In Figure 1, we show an 2 Notations and conventions example of forensic evidence. The tasks of a computer- assisted human expert analyzing this document may in- A particular XML markup language such as WAN- clude: DAML can be defined in standard ways using the document type definition language (DTD) or XML 1. outlining regions of interest, schemas [4]. We provide DTDs to specify WANDAML [1]. 2. enhancing image quality, Following the general XML nomenclature, in XML 1We often use the shorthand <my tag/> for empty tags annotation we have: <my tag></my tag> or tags whose content is not expanded. 2 3. measuring handwriting characteristics (size of characters, slant, etc.), 4. translating handwriting into typed text, 5. retracing characters with electronic ink, 6. annotating defined regions with information rela- tive to content (threat, envelope, form, etc.), writer (age, gender, etc.), script (handprinted, cursive, etc.), and material (paper, pen, etc.) A document annotation file always start with a root element <wandoc/>: <wandoc id="20032004" label="Anthrax example case" Figure 1. Wanda document annotation us- xmlns="DTD-HOME.html" ing regions. This picture of two envelopes /> from suspects in the recent anthrax crim- inal case provide us with an example of The id is a machine generated unique document an- possible use of our WANDA regions: A re- notation identifier, while the label is provided by the gion of interest is delineated, isolating one user. The name space xmlns points to the wandoc lan- envelope from the page. A filter is applied guage definition. to the region to clean the data. A hierarchy In our example (Figure 1), the <wandoc/> element of regions (e.g. in address block, lines, contains a single page: words, and characters) is defined and an- notated. Source: US Federal Bureau of <wandoc ...> Investigations. <pages numberOf="1"> <page id="2003_5" label="frontpage" next=""/> </pages> <filters number_of="1"> </wandoc> <filter type="import" label="ibisScan"> By convention, we replace tag attributes by “...” to <inputs> shorten the description. The page contains one call to a <input type="stream" filter, two annotations, and one region: number="1" xmlns="scan.dtd"> <page ...> <scan/> <filters number_of="1"/> </input> <annotations number_of="2"/> </inputs> <regions number_of="1"/> <module type="extern" </page> exec="ibis.exe"> <meta version="3.51" /> In the next paragraph, we expand the <filters </module> ...> tag. In this example, a filter is importing an <outputs> image from a scanner using the a scan software called <output type="file"> ”IBIS” and returning a link to the resulting image. In <wanda\_link the wandoc framework, a document consists of one or href="copy5.tif" /> several pages, each of which may be represented by an </output> image. Notice that there is no need for a special tag </outputs> <image/>.