COLING •ACL 2006
Frontiers in Linguistically Annotated Corpora 2006
A Merged Workshop with 7th International Workshop on Linguistically Interpreted Corpora (LINC-2006) and Frontiers in Corpus Annotation III
Proceedings of the Workshop
Chairs: Adam Meyers, Shigeko Nariyama, Timothy Baldwin and Francis Bond
22 July 2006 Sydney, Australia Production and Manufacturing by BPA Digital 11 Evans St Burwood VIC 3125 AUSTRALIA
c 2006 The Association for Computational Linguistics
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]
ISBN 1-932432-78-7
ii Table of Contents
Preface ...... v
Organizers ...... vii
Workshop Program ...... ix
Challenges for Annotating Images for Sense Disambiguation Cecilia Ovesdotter Alm, Nicolas Loeff and David A. Forsyth ...... 1
A Semi-Automatic Method for Annotating a Biomedical Proposition Bank Wen-Chi Chou, Richard Tzong-Han Tsai, Ying-Shan Su, Wei Ku, Ting-Yi Sung and Wen-Lian Hsu ...... 5
How and Where do People Fail with Time: Temporal Reference Mapping Annotation by Chinese and English Bilinguals Yang Ye and Steven Abney ...... 13
Probing the Space of Grammatical Variation: Induction of Cross-Lingual Grammatical Constraints from Treebanks Felice Dell’Orletta, Alessandro Lenci, Simonetta Montemagni and Vito Pirrelli ...... 21
Frontiers in Linguistic Annotation for Lower-Density Languages Mike Maxwell and Baden Hughes ...... 29
Annotation Compatibility Working Group Report AdamMeyers...... 38
Manual Annotation of Opinion Categories in Meetings Swapna Somasundaran, Janyce Wiebe, Paul Hoffmann and Diane Litman ...... 54
The Hinoki Sensebank — A Large-Scale Word Sense Tagged Corpus of Japanese — Takaaki Tanaka, Francis Bond and Sanae Fujita ...... 62
Issues in Synchronizing the English Treebank and PropBank Olga Babko-Malaya, Ann Bies, Ann Taylor, Szuting Yi, Martha Palmer, Mitch Marcus, Seth Kulick and Libin Shen ...... 70
On Distance between Deep Syntax and Semantic Representation VáclavNovák...... 78
Corpus Annotation by Generation Elke Teich, John A. Bateman and Richard Eckart ...... 86
Constructing an English Valency Lexicon Jiríˇ Semecký and Silvie Cinková ...... 94
Author Index ...... 99
iii
Preface
Large linguistically interpreted corpora play an increasingly important role for machine learning, evaluation, psycholinguistics as well as theoretical linguistics. Many research groups are engaged in the creation of corpus resources annotated with morphological, syntactic, semantic, discourse and other linguistic information for a variety of languages. This workshop brings together researchers interested in the annotation of linguistically interpreted corpora by combining two workshops: Frontiers in Linguistic Annotation III and the 7th International Workshop on Linguistically Interpreted Corpora (LINC-2006). The goal of the workshop is to identify and disseminate best practice in the development and utilization of linguistically interpreted corpora.
We would like to thank all the authors who submitted papers. There were 19 submissions, of which 10 appear in the proceedings. We would like to particularly thank all the members of the program committee for their time and effort in ensuring that the papers were fairly assessed, and for the useful comments they provided.
In addition to regular papers, the workshop features presentations by working groups on two topics:
Annotation Compatibility: A roadmap of the compatibility of current annotation schemes with each other: Annotation Compatibility Working Group Report.
Low-density Languages: A discussion of low density languages and the problems associated with them: Frontiers in Linguistic Annotation for Lower-Density Languages.
The Innovative Student Annotation Award, for best paper by a student, went to Vaclav´ Novak,´ for his paper On Distance between Deep Syntax and Semantic Representation.
v
Organizers
Chairs: Timothy Baldwin, University of Melbourne Francis Bond, NTT Communication Science Laboratories Adam Meyers, New York University Shigeko Nariyama, University of Melbourne
Program Committee:
Lars Ahrenberg, Linkopings¨ Universitet Kathy Baker, U.S. Dept. of Defense Steven Bird, University of Melbourne Alex Chengyu Fang, City University Hong Kong David Farwell, Computing Research Laboratory, New Mexico State University Chuck Fillmore, International Computer Science Institute, Berkeley Anette Frank, DFKI John Fry, SRI International Eva Hajicova, Center for Computational Linguistics, Charles University, Prague Erhard W. Hinrichs, University of Tubingen¨ Ed Hovy, University of Southern California Baden Hughes, University of Melbourne Emi Izumi, NICT Aravind Joshi, University of Pennsylvania, Philadelphia Sergei Nirenburg, University of Maryland, Baltimore County Stephan Oepen, University of Oslo Boyan A. Onyshkevych, U.S. Dept. of Defense Kyonghee Paik, KLI Martha Palmer, University of Colorado Manfred Pinkal, DFKI Massimo Poesio, University of Essex Owen Rambow, Columbia University Peter Rossen Skadhauge, Copenhagen Business School Beth Sundheim, SPAWAR Systems Center Jia-Lin Tsai, Tung Nan Institute of Technology Janyce Wiebe, University of Pittsburgh Nianwen Xue, University of Pennsylvania
Working Group Organizers: Adam Meyers, New York University Mike Maxwell, University of Maryland
vii
Workshop Program
Saturday, 22 July 2006 09:00–09:10 Opening Remarks 09:10–09:30 Challenges for Annotating Images for Sense Disambiguation Cecilia Ovesdotter Alm, Nicolas Loeff and David A. Forsyth 09:30–10:00 A Semi-Automatic Method for Annotating a Biomedical Proposition Bank Wen-Chi Chou, Richard Tzong-Han Tsai, Ying-Shan Su, Wei Ku, Ting-Yi Sung and Wen-Lian Hsu 10:00–10:30 How and Where do People Fail with Time: Temporal Reference Mapping Annotation by Chinese and English Bilinguals Yang Ye and Steven Abney 10.30–11.00 Coffee Break 11:00–11:30 Probing the Space of Grammatical Variation: Induction of Cross-Lingual Gram- matical Constraints from Treebanks Felice Dell’Orletta, Alessandro Lenci, Simonetta Montemagni and Vito Pirrelli 11:30–12:00 Frontiers in Linguistic Annotation for Lower-Density Languages Mike Maxwell and Baden Hughes 12:00–12:30 Annotation Compatibility Working Group Report Adam Meyers 12:30–14.00 Lunch 14:00–14:30 Manual Annotation of Opinion Categories in Meetings Swapna Somasundaran, Janyce Wiebe, Paul Hoffmann and Diane Litman 14:30–15:00 The Hinoki Sensebank — A Large-Scale Word Sense Tagged Corpus of Japanese — Takaaki Tanaka, Francis Bond and Sanae Fujita 15:00–15:30 Issues in Synchronizing the English Treebank and PropBank Olga Babko-Malaya, Ann Bies, Ann Taylor, Szuting Yi, Martha Palmer, Mitch Mar- cus, Seth Kulick and Libin Shen 15.30–16.00 Coffee Break 16:00–16:30 On Distance between Deep Syntax and Semantic Representation Václav Novák 16.30–17.30 Discussion 17.30–17.40 Closing Remarks Alternate papers Corpus Annotation by Generation Elke Teich, John A. Bateman and Richard Eckart Constructing an English Valency Lexicon Jiríˇ Semecký and Silvie Cinková
ix
Challenges for annotating images for sense disambiguation
Cecilia Ovesdotter Alm Nicolas Loeff David A. Forsyth Dept. of Linguistics Dept. of Computer Science Dept. of Computer Science University of Illinois, UC University of Illinois, UC University of Illinois, UC [email protected] [email protected] [email protected]
Abstract munity, who study management practices for im- age collections, and by the computer vision com- We describe an unusual data set of thou- munity, who would like to provide automated im- sands of annotated images with interest- age retrieval tools and possibly learn object recog- ing sense phenomena. Natural language nition methods. image sense annotation involves increased semantic complexities compared to dis- Commercial picture collections are typically an- ambiguating word senses when annotating notated by hand, e.g. (Enser, 1993; Armitage and text. These issues are discussed and illus- Enser, 1997; Enser, 2000). Subtle phenomena can trated, including the distinction between make this very difficult, and content vs. interpreta- word senses and iconographic senses. tion may differ; an image of the Eiffel tower could be annotated with Paris or even love, e.g. (Ar- 1 Introduction mitage and Enser, 1997), and the resulting annota- We describe a set of annotated images, each asso- tions are hard to use, cf. (Markkula and Sormunen, ciated with a sense of a small set of words. Build- 2000), or Enser’s result that a specialized indexing ing this data set exposes important sense phenom- language gives only a “blunt pointer to regions of ena which not only involve natural language but the Hulton collections”, (Enser, 1993), p. 35. also vision. The context of our work is Image Users of image collections have been well stud- Sense Discrimination (ISD), where the task is to ied. Important points for our purposes are: Users assign one of several senses to a web image re- request images both by object kinds, and individ- trieved by an ambiguous keyword. A compan- ual identities; users request images both by what ion paper introduces the task, presents an unsuper- they depict and by what they are about; and that vised ISD model, drawing on web page text and text associated with images is extremely useful in image features, and shows experimental results practice, newspaper archivists indexing largely on (Loeff et al., 2006). The data was subject to single- captions (Markkula and Sormunen, 2000). annotator labeling, with verification judgements on a part of the data set as a step toward study- The computer vision community has stud- ing agreement. Besides a test bed for ISD, the ied methods to predict annotations from images, data set may be applicable to e.g. multimodal word e.g. (Barnard et al., 2003; Jeon et al., 2003; Blei sense disambiguation and cross-language image and Jordan, 2002). The annotations that are pre- retrieval. The issues discussed concern concepts, dicted most successfully tend to deal with ma- and involve insights into semantics, perception, terials whose identity can be determined without and knowledge representation, while opening up a shape analysis, like sky, sea and the like. More bridge for interdisciplinary work involving vision complex annotations remain difficult. There is no and NLP. current theory of word sense in this context, be- cause in most current collections, words appear in 2 Related work the most common sense only. Sense is known to The complex relationship between annotations be important, and image information can disam- and images has been explored by the library com- biguate word senses (Barnard and Johnson, 2005).
1 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 1–4, Sydney, July 2006. c 2006 Association for Computational Linguistics Word (#Annot. images) QueryTerms Senses Coverage Examples of visual annotation cues 1. fish 35% any fish, people holding catch 5: bass, bass guitar, 2. musical instrument 28% any bass-looking instrument, playing BASS bass instrument, 3. related: fish 10% fishing (gear, boats, farms), rel. food, rel. charts/maps (2881) bass fishing, sea 4. related: musical instrument 8% speakers, accessories, works, chords, rel. music bass 5. unrelated 12% miscellaneous (above senses not applicable) 6. people 7% faces, crowds (above senses not applicable) 1. machine 21% machine crane, incl. panoramas 2. bird 26% crane bird or chick 5: crane, 3. origami 4% origami bird construction cranes, 4. related: machine 11% other machinery, construction, motor, steering, seat CRANE whooping crane, 5. related: bird 11% egg, other birds, wildlife, insects, hunting, rel. maps/charts (2650) sandhill crane, 6. related: origami 1% origami shapes (stars, pigs), paper folding origami cranes 7. people 7% faces, crowds (above senses not applicable) 8. unrelated 18% miscellaneous (above senses not applicable) 9. karate 1% martial arts 1. vegetable 24% squash vegetable 10: squash+: rules, 2. sport 13% people playing, court, equipment butternut, vegetable, SQUASH 3. related:vegetable 31% agriculture, food, plant, flower, insect, vegetables grow, game of, (1948) 4. related:sport 6% other sports, sports complex spaghetti, winter, 5. people 10% faces, crowds (above senses not applicable) types of, summer 6. unrelated 16% miscellaneous (above senses not applicable) Table 1: Overview of annotated images for three ambiguous query terms, inspired by the WSD literature. For each term, the number of annotated images, the expanded query retrieval terms (taken terms from askjeeves.com), the senses, their distribution coverage, and rough sample annotation guidelines are provided, with core senses marked in bold.
(a) machine (b) (c) origami (d) (e) rel. to a (f) rel. to b (g) (h) (i) unrel. bird karate rel. to c people Figure 1: CRANE images with clear senses: (a-d) core senses, (e-g) related senses, (h) people and (i) unrelated. Related senses are associated with the semantic field of a core sense, but the core sense is visually absent or undeterminable.
3 Data set core senses, a RELATED label was included for The data set has images retrieved from a web meanings related to the semantic field of a core search engine. We deliberately focused on three sense. Also, a PEOPLE label was included since keywords, which cover a range of phenomena in such images may occur due to how people take semantic ambiguity: BASS, CRANE, and SQUASH. pictures (e.g. portraits of persons, group pictures, Table 1 gives an overview of the data set, anno- or other representations of people outside core and tated by one author (CA).1 The webpage was not related senses). An UNRELATED label accounted considered to avoid bias, given the ISD task. for images that did not fit other labels, or were ir- For each query, 2 to 4 core word senses were relevant or undeterminable. In fact, distinguish- distinguished from inspecting the data using com- ing between PEOPLE and UNRELATED was not al- mon sense. We chose this approach rather than ways straightforward. Fig. 1 shows examples of ontology senses which tend to be incomplete or CRANE when sense assignment was quite straight- too specific for our purposes. For example, the forward. However, distinguishing image senses was often not this clear. In fact, many border-line origami sense of CRANE is not included in Word- cases occurred when one could argue for different Net under CRANE, but for BASS three different senses appear with fish. WordNet contains bird label assignments. Also, annotation cues are sub- as part of the description for the separate entry ject to interpretation, and disagreements between origami, and some query expansion terms are hy- judges are expected. They simply reflect that im- ponyms which occur as separate WordNet entries age senses are located on a semantic continuum. (e.g. bass guitar, sea bass, summer squash). Im- 4 Why annotating image senses is hard ages may show multiple objects; a general strategy In general, annotating images involves special preferred a core sense if it was included. challenges, such as what to annotate and how ex- An additional complication is that given that the tensively. We assign an image one sense. Never- images are retrieved by a search engine there is no theless, compared to disambiguating a word, sev- guarantee that they depict the query term, so ad- eral issues are added for annotation. As noted ditional senses were introduced. Thus, for most above, a core sense may not occur, and judge- 1We call the data set the UIUC-ISD data set. It is currently ments are characterized by increased subjectivity, at http://www.visionpc.cs.uiuc.edu/isd/. with semantics beyond prototypical and peripheral
2 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
(k) (l) (m) (n) (o) (p) (q) Figure 2: Annotating images is often challenging for different reasons. Are these images of CRANE birds? (a-c) depiction (d-f) gradient change (g-h) partial display (i-j) domain knowledge (k) unusual appearance (l-n) distance (o-q) not animate. exemplars. Also, the disambiguating context is may also be etymologically related or blend occa- limited to image contents, rather than collocations sionally, or be guided by cultural interpretations, of an ambiguous token. Fig. 2 illustrates selected and so on. challenging judgement calls for assigning or not Moreover, related senses are meant to capture the bird sense of CRANE, as discussed below. images associated with the semantic field of a core Depiction: Images may include man-made de- sense. However, because the notion and borders of pictions of an object in artistic depictions, and the a semantic field are non-specific, related senses question is whether this counts as the object or are tricky. Annotators may build associations not, e.g. Fig. 2(a-c). Gradient changes: Recog- quite wildly, based on personal experience and nition is complicated by objects taking different opinion, thus what is or is not a related sense may forms and shapes, cf. the insight by (Labov, 1973) very quickly get out of hand. For instance, a per- on gradual categories.2 For example, as seen in son may by association reason that if bird cranes Fig. 2(d-f), birds change with age; an egg may be occur frequently in fields, then an image of a field a bird, but a chick is, as is a fledgeling. Partial alone should be marked as related. To avoid this, display: Objects may be rendered in incomplete guidelines attempted to restrict related senses, as condition. For example, Fig. 2(g-h) show merely exemplified in Table 1, with some data-driven re- feathers or a bird neck. Domain knowledge: Peo- visions during the annotation process. However, ple may disagree due to differences in domain guidelines are also based on judgement calls. Be- knowledge, e.g. some non-experts may have a dif- sides, for abstract concepts like LOVE, differenti- ficult time determining whether or not other sim- ating core versus related sense is not really valid. ilar bird species can be distinguished from a bird Lastly, an additional complexity of image crane, cf. Fig. 2(i-j). This also affected annota- senses is that in addition to traditional word tions’ granularity depending on keyword, see Ta- senses, images may also capture repeatedly oc- ble 1’s example cues. Unusual appearance: Ob- curring iconographic patterns or senses. As illus- jects may occur in less frequent visual appear- trated in Fig. 3, the iconography of flying cranes ance, or lack distinguishing properties. For in- is quite different from that of standing cranes, as stance, Fig. 2(k) illustrates how sunset background regards motion, shape, identity, and color of figure masks birds’ color information. Scale: The dis- and ground, respectively. Mixed cases also occur, tance to objects may render them unclear and in- e.g. when bird cranes are taking off or are about fluence judgement accuracy, and people may dif- to land in relation to flight. Iconographic senses fer in the degree of certainty required for assign- may compare to more complex linguistic struc- ing a sense. For example, Fig. 2(l-n) show flying tures than nominal categories, e.g. a modified NP or standing potential cranes at distance. Animate: or clause, but are represented by image properties. Fig. 2(o-q) raise the question whether dead, skele- tal, or artificial objects are instantiations or not. A policy for annotating iconographic senses is Other factors complicating the annotation task in- still lacking. Image groups based on iconographic clude image crowdedness disguising objects, cer- senses seem to provide increased visual and se- tain entities having less salience, and lacking or mantic harmony for the eye, but experiments are unclear reference to object proportions. Senses needed to confirm how iconographic senses cor- respond to humans’ perception of semantic image 2Function or properties may also influence (Labov, 1973). similarity, and at what level of semantic differen-
3 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3: Iconographic bird CRANE senses: (a-c) flying cranes, (d-f) standing cranes, and (g-h) mixed cases in-between.
(a) 5/2 (b) 1/4 (c) 4/1 (d) 4/1 (e) 4/8 (f) 8/2 (g) 8/1 (h) 6/8,5 (i) 4/1 Figure 4: Disagreement examples (sense numbers in Table 1): (a) crane or other bird? (b) toy crane or scales? (c) crane or other steel structure/elevator? (d) crane or other machine? (e) company is related or not? (f) bird or abstract art? (g) crane in background or not? (h) origami-related paper? (i) inside of crane? (and is inside sufficient to denote image as machine crane?) tiation they become relevant for sense assessment. judgements for image pairs or groups, as well as Lastly, considering the challenges of image an- issues in interannotator agreement for image dis- notation, it is interesting to look at annotation dis- ambiguation, and, finally, to better understand the agreements. Thus, another author (NL) inspected role of iconography for semantic interpretation. CRANE annotations, and recorded disagreement candidates, which amounted to 5%. Rejecting or 6 Acknowledgements accepting a category label seems less hard than Thanks to anonymous reviewers, R. Girju and R. independent annotation but still can give insights Sproat for feedback. Any fallacies are our own. into disagreement tendencies. Several disagree- References ments involved a core category vs. its related label vs. unrelated, rather than two core senses. Also, L. H. Armitage and P. G. B. Enser. 1997. Analysis of user need in image archives. J. of Inform. Sci., some disagreement candidates had tiny, fuzzy, 23(4):287–299. partial or peripheral potential sense objects, or lacked distinguishing object features, so interpre- K. Barnard and M. Johnson. 2005. Word sense disam- biguation with pictures. Artif. Intel., 167:13–30. tation became quite idiosyncratic. The disagree- ment candidates were discussed together, result- K. Barnard, P. Duygulu, N. Freitas, D. Forsyth, D. Blei, ing in 2% being true disagreements, 2% false dis- and M. I. Jordan. 2003. Matching words and pic- tures. J. of Mach. Learn. Research, 3:1107–1135. agreements (resolved by consensus on CA’s la- bels), and 1% annotation mistakes. Examples of D. M. Blei and M. I. Jordan. 2002. Modeling anno- true disagreements are in Fig. 4. Often, both par- tated data. Technical Report CSD-02-1202, Div. of Computer Science, Univ. of California, Berkeley. ties could see each others’ points, but opted for an- other interpretation; this confirms that border lines P. G. B. Enser. 1993. Query analysis in a visual infor- tend to merge, indicating that consistency is chal- mation retrieval context. J. of Doc. and Text Man- agement, 1(1):25–52. lenging and not always guaranteed. As the annota- tion procedure advances, criteria may evolve and P. G. B. Enser. 2000. Visual image retrieval: seek- modify the fuzzy sense boundaries. ing the alliance of concept based and content based paradigms. J. of Inform. Sci., 26(4):199–210. 5 Conclusion J. Jeon, V. Lavrenko, and R. Manmatha. 2003. Auto- matic image annotation and retrieval using crossme- This work draws attention to the need for consid- dia relevance models. In SIGIR, pages 119–126. ering natural language semantics in multi-modal W. Labov. 1973. The boundaries of words and their settings. Annotating image senses adds increased meanings. In C. J. Baily and R. Shuy, editors, New complexity compared to word-sense annotation ways of analyzing variation in English, pages 340– 373. Washington D.C: Georgetown Univ. Press. in text due to factors such as image proper- ties, subjective perception, and annotator domain- N. Loeff, C. O. Alm, and D. A. Forsyth. 2006. Dis- knowledge. Moreover, the concept of related criminating image senses by clustering with multi- modal features. In ACL (forthcoming). senses as well as iconographic senses go beyond and diversify the notion of word sense. In the fu- M. Markkula and E. Sormunen. 2000. End-user ture, we would like to perform experimentation searching challenges indexing practices in the digital newspaper photo archive. Inform. Retr., 1:259–285. with human subjects to explore both similarity
4 A Semi-Automatic Method for Annotating a Biomedical Proposition Bank
Wen-Chi Chou1, Richard Tzong-Han Tsai1,2, Ying-Shan Su1, Wei Ku1,3, Ting-Yi Sung1 and Wen-Lian Hsu1 1Institute of Information Science, Academia Sinica, Taiwan, ROC. 2Dept. of Computer Science and Information Engineering, National Taiwan University, Taiwan, ROC. 3Institute of Molecular Medicine, National Taiwan University, Taiwan, ROC. {jacky957,thtsai,qnn,wilmaku,tsung,hsu}@iis.sinica.edu.tw
1 Introduction Abstract The volume of biomedical literature available on In this paper, we present a semi- the Web has grown enormously in recent years, a automatic approach for annotating se- trend that will probably continue indefinitely. mantic information in biomedical texts. Thus, the ability to process literature automati- The information is used to construct cally would be invaluable for both the design and a biomedical proposition bank called interpretation of large-scale experiments. To this BioProp. Like PropBank in the newswire end, several information extraction (IE) systems domain, BioProp contains annotations of using natural language processing techniques predicate argument structures and seman- have been developed for use in the biomedical tic roles in a treebank schema. To con- field. Currently, the focus of IE is shifting from struct BioProp, a semantic role labeling the extraction of nominal information, such as (SRL) system trained on PropBank is named entities (NEs) to verbal information that used to annotate BioProp. Incorrect tag- represents the relations between NEs, e.g., events ging results are then corrected by human and function (Tateisi et al., 2004; Wattarujeekrit annotators. To suit the needs in the bio- et al., 2004). In the IE of relations, the roles of medical domain, we modify the Prop- NEs participating in a relation must be identified Bank annotation guidelines and charac- along with a verb of interest. This task involves terize semantic roles as components of identifying main roles, such as agents and objects, biological events. The method can sub- and adjunct roles (ArgM), such as location, man- stantially reduce annotation efforts, and ner, timing, condition, and extent. This identifi- we introduce a measure of an upper cation task is called semantic role labeling (SRL). bound for the saving of annotation efforts. The corresponding roles of the verb (predicate) Thus far, the method has been applied are called predicate arguments, and the whole experimentally to a 4,389-sentence tree- proposition is known as a predicate argument bank corpus for the construction of Bio- structure (PAS). Prop. Inter-annotator agreement meas- To develop an automatic SRL system for the ured by kappa statistic reaches .95 for biomedical domain, it is necessary to train the combined decision of role identification system with an annotated corpus, called proposi- and classification when all argument la- tion bank (Palmer et al., 2005). This corpus con- bels are considered. In addition, we show tains annotations of semantic PAS’s superim- that, when trained on BioProp, our bio- posed on the Penn Treebank (PTB) (Marcus et medical SRL system called BIOSMILE al., 1993; Marcus et al., 1994). However, the achieves an F-score of 87%. process of manually annotating the PAS’s to construct a proposition bank is quite time- consuming. In addition, due to the complexity of proposition bank annotation, inconsistent annota- tion may occur frequently and further complicate
5 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 5–12, Sydney, July 2006. c 2006 Association for Computational Linguistics the annotation task. In spite of the above difficul- Since proposition banks are annotated on top ties, there are proposition banks in the newswire of a Penn-style treebank, we selected a biomedi- domain that are adequate for training SRL sys- cal corpus that has a Penn-style treebank as our tems (Xue and Palmer, 2004; Palmer et al., 2005). corpus. We chose the GENIA corpus (Kim et al., In addition, according to the CoNLL-2005 2003), a collection of MEDLINE abstracts se- shared task (Carreras and Màrquez, 2005), the lected from the search results with the following performance of SRL systems in general does not keywords: human, blood cells, and transcription decline significantly when tagging out-of-domain factors. In the GENIA corpus, the abstracts are corpora. For example, when SRL systems trained encoded in XML format, where each abstract on the Wall Street Journal (WSJ) corpus were also contains a MEDLINE UID, and the title and used to tag the Brown corpus, the performance content of the abstract. The text of the title and only dropped by 15%, on average. In comparison content is segmented into sentences, in which to annotating from scratch, annotation efforts biological terms are annotated with their seman- based on the results of an available SRL system tic classes. The GENIA corpus is also annotated are much reduced. Thus, we plan to use a news- with part-of-speech (POS) tags (Tateisi and Tsu- wire SRL system to tag a biomedical corpus and jii, 2004), and co-references are added to part of then manually revise the tagging results. This the GENIA corpus by the MedCo project at the semi-automatic procedure could expedite the Institute for Infocomm Research, Singapore construction of a biomedical proposition bank for (Yang et al., 2004). use in training a biomedical SRL system in the The Penn-style treebank for GENIA, created future. by Tateisi et al. (2005), currently contains 500 abstracts. The annotation scheme of the GENIA 2 The Biomedical Proposition Bank - Treebank (GTB), which basically follows the BioProp Penn Treebank II (PTB) scheme (Bies et al., 1995), is encoded in XML. However, in contrast As proposition banks are semantically annotated to the WSJ corpus, GENIA lacks a proposition versions of a Penn-style treebank, they provide bank. We therefore use its 500 abstracts with consistent semantic role labels across different GTB as our corpus. To develop our biomedical syntactic realizations of the same verb. The an- proposition bank, BioProp, we add the proposi- notation captures predicate-argument structures tion bank annotation on top of the GTB annota- based on the sense tags of polysemous verbs tion. (called framesets) and semantic role labels for In the following, we report on the selection of each argument of the verb. Figure 1 shows the biomedical verbs, and explain the difference be- annotation of semantic roles, exemplified by the tween their meaning in PropBank (Palmer et al., following sentence: “IL4 and IL13 receptors ac- 2005), developed by the University of Pennsyl- tivate STAT6, STAT3 and STAT5 proteins in vania, and their meaning in BioProp (a biomedi- normal human B cells.” The chosen predicate is cal proposition bank). We then introduce Bio- the word “activate”; its arguments and their as- Prop’s annotation scheme, including how we sociated word groups are illustrated in the figure. modify a verb’s framesets and how we define framesets for biomedical verbs not defined in NP-SBJ VP VerbNet (Kipper et al., 2000; Kipper et al., 2002).
VP PP 2.1 Selection of Biomedical Verbs We selected 30 verbs according to their fre- NP quency of use or importance in biomedical texts. Arg0 predicate Arg1 AM-LOC Since our targets in IE are the relations of NEs, only sentences containing protein or gene names NP are used to count each verb’s frequency. Verbs IL4 and IL 13 activate STAT6, STAT3 in the human that have general usage are filtered out in order receptors and B cells to ensure the focus is on biomedical verbs. Some STAT5 proteins verbs that do not have a high frequency, but play important roles in describing biomedical rela- Figure 1. A treebank annotated with semantic tions, such as “phosphorylate” and “transacti- role labels vate”, are also selected. The selected verbs are listed in Table 1.
6
Predicate Frameset Example
express Arg0: agent [Some legislatorsArg0][expressedpredicate] [concern that a gas-tax (VerbNet) Arg1: theme increase would take too long and possibly damage chances of a Arg2: recipient or destina- major gas-tax-increasing ballot initiative that voters will consider tion next JuneArg1 ]. translate Arg0: causer of transfor- (VerbNet) mation But some cosmetics-industry executives wonder whether [tech- Arg1: thing changing niques honed in packaged goodsArg1] [willAM-MOD] [translatepredicate] Arg2: end state [to the cosmetics businessArg2]. Arg3: start state express Arg0: causer of expression [B lymphocytes and macrophagesArg0] [expresspredicate] [closely (BioProp) Arg1: thing expressing related immunoglobulin G ( IgG ) Fc receptors ( Fc gamma RII ) that differ only in the structures of their cytoplasmic domainsArg1]. Table 2. Framesets and examples of “express” and “translate”
experimental results on phosphorylated events; Type Verb list therefore, it is included in our verb list. However, encode, interact, phosphorylate, since VerbNet does not define the frameset for 1 transactivate “phosphorylate”, we must define it after analyz- 2 express, modulate ing all the sentences in our corpus that contain 3 bind the verb. Other type 1 verbs may correspond to verbs in VerbNet; in such cases, we can borrow activate, affect, alter, associate, block, the VerbNet definitions and framesets. For ex- decrease differentiate, encode, enhance, ample, “transactivate” is not found in VerbNet, increase, induce, inhibit, mediate, mu- 4 but we can adopt the frameset of “activate” for tate, prevent, promote, reduce, regulate, this verb. repress, signal, stimulate, suppress, Verbs of the second type appear in VerbNet, transform, trigger but have unique biomedical meanings that are Table 1. Selected biomedical verbs and their undefined. Therefore, the framesets correspond- types ing to their biomedical meanings must be added. In most cases, we can adopt framesets from 2.2 Framesets of Biomedical Verbs VerbNet synonyms. For example, “express” is Annotation of BioProp is mainly based on defined as “say” and “send very quickly” in Levin’s verb classes, as defined in the VerbNet VerbNet. However, in the biomedical domain, its lexicon (Kipper et al., 2000). In VerbNet, the usage is very similar to “translate”. Thus, we can arguments of each verb are represented at the use the frameset of “translate” for “express”. Ta- semantic level, and thus have associated seman- ble 2 shows the framesets and corresponding ex- tic roles. However, since some verbs may have amples of “express” in the newswire domain and different usages in biomedical and newswire biomedical domain, as well as that of “translate” texts, it is necessary to customize the framesets in VerbNet. of biomedical verbs. The 30 verbs in Table 1 are Verbs of the third type also appear in VerbNet. categorized into four types according to the de- Although the newswire and biological senses are gree of difference in usage: (1) verbs that do not defined therein, their primary newswire sense is appear in VerbNet due to their low frequency in not the same as their primary biomedical sense. the newswire domain; (2) verbs that do appear in “Bind,” for example, is common in the newswire VerbNet, but whose biomedical meanings and domain, and it usually means “to tie” or “restrain framesets are undefined; (3) verbs that do appear with bonds.” However, in the biomedical domain, in VerbNet, but whose primary newswire and its intransitive use- “attach or stick to”- is far biomedical usage differ; (4) verbs that have the more common. For example, a Google search for same usage in both domains. the phrase “glue binds to” only returned 21 re- Verbs of the first type play important roles in sults, while the same search replacing “glue” biomedical texts, but rarely appear in newswire with “protein” yields 197,000 hits. For such texts and thus are not defined in VerbNet. For verbs, we only need select the appropriate alter- example, “phosphorylate” increasingly appears native meanings and corresponding framesets. in the fast-growing PubMed abstracts that report Lastly, for verbs of the fourth type, we can di-
7 rectly adopt the newswire definitions and frame- sets, since they are identical. # in # in Verbs Ratio(%) Ratio(%) BioProp PropBank 2.3 Distribution of Selected Verbs induce 290 1.89 16 0.01 There is a significant difference between the oc- bind 252 1.64 0 0 currence of the 30 selected verbs in biomedical activate 235 1.53 2 0 texts and their occurrence in newswire texts. The express 194 1.26 53 0.03 verbs appearing in verb phrases constitute only inhibit 184 1.20 6 0 1,297 PAS’s, i.e., 1% of all PAS’s, in PropBank increase 166 1.08 396 0.24 (shown in Figure 2), compared to 2,382 PAS’s, regulate 122 0.79 23 0.01 i.e., 16% of all PAS’s, in BioProp (shown in mediate 104 0.68 1 0 Figure 3). Furthermore, some biomedical verbs stimulate 93 0.61 11 0.01 associate 82 0.53 51 0.03 have very few PAS’s in PropBank, as shown in encode 79 0.51 0 0 Table 3. The above observations indicate that it affect 60 0.39 119 0.07 is necessary to annotate a biomedical proposition enhance 60 0.39 28 0.02 bank for training a biomedical SRL system. block 58 0.38 71 0.04 reduce 55 0.36 241 0.14 decrease 54 0.35 16 0.01 suppress 38 0.25 4 0 interact 36 0.23 0 0 alter 27 0.18 17 0.01 transactivate 24 0.16 0 0 modulate 22 0.14 1 0 phosphorylate 21 0.14 0 0 transform 21 0.14 22 0.01 differentiate 21 0.14 2 0 Figure 2. The percentage of the 30 biomedical repress 17 0.11 1 0 verbs and other verbs in PropBank prevent 15 0.10 92 0.05 promote 14 0.09 52 0.03 trigger 14 0.09 40 0.02 mutate 14 0.09 1 0 signal 10 0.07 31 0.02 Table 3. The number and percentage of PAS’s for each verb in BioProp and PropBank 1. Each word with a VB POS tag in a verb phrase that matches any lexical variant of the 30 verbs is treated as a predicate candi- Figure 3. The percentage of the 30 biomedical date. The automatically selected targets are verbs and other verbs in BioProp then double-checked by human annotators. As a result, 2,382 predicates were identified 3 Annotation of BioProp in BioProp. 3.1 Annotation Process 2. Sentences containing the above 2,382 After choosing 30 verbs as predicates, we predicates were extracted and labeled adopted a semi-automatic method to annotate automatically by our WSJ SRL system. In BioProp. The annotation process consists of the total, 7,764 arguments were identified. following steps: (1) identifying predicate candi- 3. In this step, sentences with PAS annota- dates; (2) automatically annotating the biomedi- tions are transformed into WordFreak for- cal semantic roles with our WSJ SRL system; (3) mat (an XML format), which allows anno- transforming the automatic tagging results into tators to view a sentence in a tree-like fash- WordFreak (Morton and LaCivita, 2003) format; ion. In addition, users can customize the tag and (4) manually correcting the annotation re- set of arguments. Other linguistic informa- sults with the WordFreak annotation tool. We tion can also be integrated and displayed in now describe these steps in detail:
8 WordFreak, which is a convenient annota- tion tool. 4. In the last step, annotators check the pre- dicted semantic roles using WordFreak and then correct or add semantic roles if the predicted arguments are incorrect or miss- ing, respectively. Three biologists with suf- ficient biological knowledge in our labora- tory performed the annotation task after re- ceiving computational linguistic training for approximately three months. Figure 4 illustrates an example of BioProp an- notation displayed in WordFreak format, using the frameset of “phophorylate” listed in Table 4. This annotation process can be used to con- struct a domain-specific corpus when a general- purpose tagging system is available. In our ex- perience, this semi-automatic annotation scheme saves annotation efforts and improves the anno- tation consistency.
Predicate Frameset Arg0: causer of phosphorylation phosphorylate Arg1: thing being phosphorylated Arg2: end state Arg3: start state Table 4. The frameset of “phosphorylate” Figure 4. An example of BioProp displayed with WordFreak 3.2 Inter-annotation Agreement
We conducted preliminary consistency tests on Kappa P(A) P(E) 2,382 instances of biomedical propositions. The score inter-annotation agreement was measured by the including role identification .97 .52 .94 kappa statistic (Siegel and Castellan, 1988), the ArgM role classification .96 .18 .95 definition of which is based on the probability of combined decision .96 .18 .95 role identification .97 .26 .94 inter-annotation agreement, denoted by P(A), and excluding role classification .99 .28 .98 the agreement expected by chance, denoted by ArgM P(E). The kappa statistics for inter-annotation combined decision .99 .28 .98 agreement were .94 for semantic role identifica- Table 5. Inter-annotator agreement tion and .95 for semantic role classification when ArgM labels were included for evaluation. When all tags during the annotation process, as in the ArgM labels were omitted, kappa statistics full manual annotation approach. Only incor- were .94 and .98 for identification and classifica- rectly predicted tags need to be modified, and tion, respectively. We also calculated the results missed tags need to be added. Therefore, annota- of combined decisions, i.e., identification and tion efforts can be substantially reduced. To classification. (See Table 5.) quantify the reduction in annotation efforts, we define the saving of annotation effort, ρ, as: 3.3 Annotation Efforts ρ = # of correctly labeled nodes Since we employ a WSJ SRL system that labels #of all nodes # of correctly labeled nodes semantic roles automatically, human annotators < (1) can quickly browse and determine correct tag- # of correct + # of incorrect + #of missed nodes ging results; thus, they do not have to examine In Equation (1), since the number of nodes that need to be examined is usually unknown, we
9 use an easy approximation to obtain an upper nervous system, lung, kidney, liver, and spleen bound for ρ. This is based on the extremely op- but only very low levels occur in thymusAM-LOC]. timistic assumption that annotators should be able to recover a missed or incorrect label by 4.2 Additional Argument Types only checking one node. However, in reality, this In PropBank, the negative argument (AM-NEG) would be impossible. In our annotation process, usually contains explicit negative words such as the upper bound of ρ for BioProp is given by: “not”. However, in the biomedical domain, re- 18932 18932 searchers usually express negative meaning im- ρ < = = 46% , 18932 + 6682 +15316 40975 plicitly by using “fail”, “unable”, “inability”, which means that, at most, the annotation effort “neither”, “nor”, “failure”, etc. Take “fail” as an could be reduced by 46%. example. It is tagged as a verb in general English, A more accurate tagging system is preferred as shown in the following sentence: because the more accurate the tagging system, the higher the upper bound ρ will be. But [the new pactArg1] will force huge debt on the new firm and [couldAM-MOD] [stillAM-TMP] [failpredi- 4 Disambiguation of Argument Annota- cate] [to thwart rival suitor McCaw CellularArg2]. tion Negative results are important in the biomedi- During the annotation process, we encountered a cal domain. Thus, for annotation purposes, we number of problems resulting from different us- create additional negation tag (AM-NEG1) that age of vocabulary and writing styles in general does not exist in PropBank. The following sen- English and the biomedical domain. In this sec- tence is an example showing the use of AM- tion, we describe three major problems and pro- NEG1: pose our solutions. [They ] [fail ] to [induce ] [mRNA 4.1 Cue Words for Role Classification Arg0 AM-NEG1 predicate of TNF-alphaArg1] [after 3 h of culture AM-TMP]. PropBank annotation guidelines provide a list of words that can help annotators decide an argu- In this example, if we do not introduce the ment’s type. Similarly, we add some rules to our AM-NEG1, “fail” is considered as a verb like in BioProp annotation guideline. For example, “in PropBank, not as a negative argument, and it will vivo” and “in vitro” are used frequently in bio- not be included in the proposition for the predi- medical literature; however, they seldom appear cate “induce”. Thus, BioProp requires the “AM- in general English articles. According to their NEG1” tag to precisely express the correspond- meanings, we classify them as location argument ing proposition. (AM-LOC). In addition, some words occur frequently in 4.3 Essentiality of Biomedical Knowledge both general English and in biomedical domains Since PAS’s contain more semantic information, but have different meanings/usages. For instance, proposition bank annotators require more domain “development” is often tagged as Arg0 or Arg1 knowledge than annotators of other corpora. In in general English, as shown by the following BioProp, many ambiguous expressions require sentence: biomedical knowledge to correctly annotate them, as exemplified by the following sentence in Bio- Despite the strong case for stocks, however, most Prop: pros warn that [individualsArg0] shouldn't try to [profitpredicate] [from short-term developmentsArg1]. In the cell types tested, the LS mutations indi- cated an apparent requirement not only for the However, in the biomedical domain, “devel- intact NF-kappa B and SP1-binding sites but also opment” always means the stage of a disease, for [several regions between -201 and -130Arg1] cell, etc. Therefore, we tag it as temporal argu- [notAM-NEG] [previouslyAM-MNR] [associatedpredi- ment (AM-TMP), as shown in the following sen- cate][with viral infectivityArg2]. tence: Annotators without biomedical knowledge [Rhom-2 mRNAArg1] is [expressedpredicate] [in may consider [between -201 and -130] as extent early mouse developmentAM-TMP] [in central argument (AM-EXT), because the PropBank guidelines define numerical adjuncts as AM-
10 EXT. However, it means a segment of DNA. It is BASIC FEATURES an appositive of [several regions]; therefore, it z Predicate – The predicate lemma z Path – The syntactic path through the parsing tree should be annotated as part of Arg1 in this case. from the parse constituent being classified to the predicate 5 Effect of Training Corpora on SRL z Constituent type Systems z Position – Whether the phrase is located before or af- ter the predicate To examine the possibility that BioProp can im- z Voice – passive: If the predicate has a POS tag VBN, prove the training of SRL systems used for and its chunk is not a VP, or it is preceded by a form of “to be” or “to get” within its chunk; otherwise, it is automatic tagging of biomedical texts, we com- active pare the performance of systems trained on Bio- z Head word – Calculated using the head word table Prop and PropBank in different domains. We described by Collins (1999) construct a new SRL system (called a BIOmedi- z Head POS – The POS of the Head Word z Sub-categorization – The phrase structure rule that cal SeMantIc roLe labEler, BIOSMILE) that is expands the predicate’s parent node in the parsing trained on BioProp and employs all the features tree used in our WSJ SRL system (Tsai et al., 2006). z First and last Word and their POS tags As with POS tagging, chunking, and named z Level – The level in the parsing tree entity recognition, SRL can also be formulated as PREDICATE FEATURES z Predicate’s verb class a sentence tagging problem. A sentence can be z Predicate POS tag represented by a sequence of words, a sequence z Predicate frequency of phrases, or a parsing tree; the basic units of a z Predicate’s context POS sentence in these representations are words, z Number of predicates phrases, and constituents, respectively. Hacioglu FULL PARSING FEATURES z Parent’s, left sibling’s, and right sibling’s paths, et al. (2004) showed that tagging phrase-by- constituent types, positions, head words and head phrase (P-by-P) is better than word-by-word (W- POS tags by-W). However, Punyakanok et al. (2004) z Head of PP parent – If the parent is a PP, then the showed that constituent-by-constituent (C-by-C) head of this PP is also used as a feature tagging is better than P-by-P. Therefore, we use COMBINATION FEATURES z Predicate distance combination C-by-C tagging for SRL in our BIOSMILE. z Predicate phrase type combination SRL can be divided into two steps. First, we z Head word and predicate combination identify all the predicates. This can be easily ac- z Voice position combination complished by finding all instances of verbs of OTHERS interest and checking their part-of-speech (POS) z Syntactic frame of predicate/NP tags. Second, we label all arguments correspond- z Headword suffixes of lengths 2, 3, and 4 z Number of words in the phrase ing to each predicate. This is a difficult problem, z Context words & POS tags since the number of arguments and their posi- tions vary according to a verb’s voice (ac- Table 6. The features used in our argument clas- tive/passive) and sense, along with many other sification model factors. We then test both systems on 30 400-PAS test In BIOSMILE, we employ the maximum en- sets from BioProp, with g1 and w1 being tested on tropy (ME) model for argument classification. test set 1, g2 and w2 on set 2, and so on. Then we We use Zhang’s MaxEnt toolkit generate the scores for g1-g30 and w1-w30, and (http://www.nlplab.cn/zhangle/maxent_toolkit.ht compare their averages. ml) and the L-BFGS (Nocedal and Wright, 1999) Table 7 shows the experimental results. When method of parameter estimation for our ME tested on the biomedical corpus, BIOSMILE out- model. Table 6 shows the features we employ in performs the WSJ SRL system by 22.9%. This BIOSMILE and our WSJ SRL system. result is statistically significant as expected. To compare the effects of using biomedical training data versus using general English data, Training Test Precision Recall F-score we train BIOSMILE on 30 randomly selected PropBank BioProp 74.78 56.25 64.20 training sets from BioProp (g1,.., g30), and WSJ BioProp BioProp 88.65 85.61 87.10 SRL system on 30 from PropBank (w1,.., w30), each of which has 1,200 training PAS’s. Table 7. Performance comparison of SRL sys- tems trained on BioProp and PropBank
11 6 Conclusion & Future Work Karin Kipper, Martha Palmer, and Owen Rambow. 2002. Extending PropBank with VerbNet semantic The primary contribution of this study is the an- predicates. In Proceedings of AMTA-2002. notation of a biomedical proposition bank that Mitchell Marcus, Grace Kim, Mary Ann incorporates the following features. First, the Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark choice of 30 representative biomedical verbs is Ferguson, Karen Katz, and Britta Schasberger. made according to their frequency and impor- 1994. The Penn Treebank: Annotating predicate tance in the biomedical domain. Second, since argument structure. In Proceedings of ARPA Hu- man Language Technology Workshop. some of the verbs have different usages and oth- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann ers do not appear in the WSJ proposition bank, Marcinkiewicz. 1993. Building a large annotated we redefine their framesets and add some new corpus of English: the Penn Treebank. Computa- argument types. Third, the annotation guidelines tional Linguistics, 19(2): 313-330. in PropBank are slightly modified to suit the Thomas Morton and Jeremy LaCivita. 2003. Word- needs of the biomedical domain. Fourth, using Freak: an open tool for linguistic annotation. In appropriate argument types, framesets and anno- Proceedings of HLT/NAACL-2003. tation guidelines, we construct a biomedical Jorge Nocedal and Stephen J Wright. 1999. Numeri- proposition bank, BioProp, on top of the popular cal Optimization, Springer. biomedical GENIA Treebank. Finally, we em- Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus ploy a semi-automatic annotation approach that of Semantic Roles. Computational Linguistics, uses an SRL system trained on the WSJ Prop- 31(1). Bank. Incorrect tagging results are then corrected Vasin Punyakanok, Dan Roth, Wen-tau Yih, and Dav by human annotators. This approach reduces an- Zimak. 2004. Semantic Role Labeling via Integer notation efforts significantly. For example, in Linear Programming Inference. In Proceedings of BioProp, the annotation efforts can be reduced COLING-2004. by, at most, 46%. In addition, trained on BioProp, Sidney Siegel and N. John Castellan. 1988. Non- BIOSMILE’s F-score increases by 22.9% com- parametric Statistics for the Behavioral Sciences. pared to the SRL system trained on the PropBank. New York, McGraw-Hill. In our future work, we will investigate more Richard Tzong-Han Tsai, Wen-Chi Chou, Yu-Chun Lin, Cheng-Lung Sung, Wei Ku, Ying-Shan Su, biomedical verbs. Besides, since there are few Ting-Yi Sung, and Wen-Lian Hsu. 2006. BIOS- biomedical treebanks, we plan to integrate full MILE: Adapting Semantic Role Labeling for Bio- parsers in order to annotate syntactic and seman- medical Verbs: An Exponential Model Coupled tic information simultaneously. It will then be with Automatically Generated Template Features. possible to apply the SRL techniques more ex- In Proceedings of BioNLP'06. tensively to biomedical relation extraction. Yuka Tateisi, Tomoko Ohta, and Jun-ichi Tsujii. 2004. Annotation of Predicate-argument Structure of Mo- References lecular Biology Text. In Proceedings of the IJCNLP-04 workshop on Beyond Shallow Analyses. Ann Bies, Mark Ferguson, Karen Katz, and Robert Yuka Tateisi and Jun-ichi Tsujii. 2004. Part-of- MacIntyre. 1995. Bracketing Guidelines for Tree- Speech Annotation of Biology Research Abstracts. bank II Style Penn Treebank Project. Technical re- In Proceedings of the 4th International Conference port, University of Pennsylvania. on Language Resource and Evaluation. Xavier Carreras and Lluís Màrquez. 2005. Introduc- Yuka Tateisi, Akane Yakushiji, Tomoko Ohta, and tion to the CoNLL-2005 Shared Task: Semantic Jun-ichi Tsujii. 2005. Syntax Annotation for the Role Labeling. In Proceedings of CoNLL-2005. GENIA corpus. In Proceedings of IJCNLP-2005. Michael Collins. 1999. Head-driven Statistical Mod- Tuangthong Wattarujeekrit, Parantu K Shah, and els for Natural Language Parsing. Ph.D. thesis, Nigel Collier1. 2004. PASBio: predicate-argument University of Pennsylvania. structures for event extraction in molecular biology. Kadri Hacioglu, Sameer Pradhan, Wayne Ward, BMC Bioinformatics, 5(155). James H. Martin, and Daniel Jurafsky. 2004. Se- Nianwen Xue and Martha Palmer. 2004. Calibrating mantic Role Labeling by Tagging Syntactic Features for Semantic Role Labeling. In Proceed- Chunks. In Proceedings of CoNLL-2004. ings of the EMNLP-2004. Jin-Dong Kim, Tomoko Ohta, Yuka Teteisi, and Jun'- Xiaofeng Yang, Guodong Zhou, Jian Su, and Chew ichi Tsujii. 2003. GENIA corpus—a semantically Lim Tan. 2004. Improving Noun Phrase Corefer- annotated corpus for bio-textmining. Bioinformat- ence Resolution by Matching Strings. In Proceed- ics, 19(Suppl. 1): i180-i182. ings of 1st International Joint Conference on Natu- Karin Kipper, Hoa Trang Dang, and Martha Palmer. ral Language Processing: 226-233. 2000. Class-based construction of a verb lexicon. In Proceedings of AAAI-2000.
12 How and Where do People Fail with Time: Temporal Reference Mapping Annotation by Chinese and English Bilinguals
Yang Ye§, Steven Abney†§ †Department of Linguistics §Department of Electrical Engineering and Computer Science University of Michigan
Abstract tivates experiments and research in temporal infor- mation annotation. This work reports on three human tense One important temporal relation distinction that annotation experiments for Chinese verbs human beings make is the temporal reference dis- in Chinese-to-English translation scenar- tinction based on relative positioning between the ios. The results show that inter-annotator following three time parameters, as proposed by agreement increases as the context of the (Reichenbach, 1947): speech time (S), event time verb under the annotation becomes in- (E) and reference time (R). Temporal reference creasingly specified, i.e. as the context distinction is linguistically realized as tenses. Lan- moves from the situation in which the tar- guages have various granularities of tense repre- get English sentence is unknown to the sentations; some have finer-grained tenses or as- situation in which the target lexicon and pects than others. This poses a great challenge to target syntactic structure are fully speci- automatic cross-lingual tense mapping. The same fied. The annotation scheme with a fully challenge holds for cross-lingual tense annotation, specified syntax and lexicon in the tar- especially for language pairs that have dramati- get English sentence yields a satisfactorily cally different tense strategies. A decent solution high agreement rate. The annotation re- for cross-lingual tense mapping will benefit a va- sults were then analyzed via an ANOVA riety of NLP tasks such as Machine Translation, analysis, a logistic regression model and a Cross-lingual Question Answering (CLQA), and log-linear model. The analyses reveal that Multi-lingual Information Summarization. While while both the overt and the latent linguis- automatic cross-lingual tense mapping has re- tic factors seem to significantly affect an- cently started to receive research attention, such notation agreement under different scenar- as in (Olsen,et al., 2001) and (Ye, et al., 2005), ios, the latent features are the real driving to the best of our knowledge, human performance factors of tense annotation disagreement on tense and aspect annotation for machine trans- among multiple annotators. The analy- lation between English and Chinese has not re- ses also find the verb telicity feature, as- ceived any systematic investigation to date. Cross- pect marker presence and syntactic em- linguistic NLP tasks, especially those requiring a bedding structure to be strongly associated more accurate tense and aspect resolution, await with tense, suggesting their utility in the a more focused study of human tense and aspect automatic tense classification task. annotation performance. Chinese and English are a language pair in 1 Introduction which tense and aspect are represented at differ- In recent years, the research community has seen ent levels of units: one being realized at the word a fast-growing volume of work in temporal infor- level and the other at the morpheme level. mation processing. Consequently, the investiga- This paper reports on a series of cross-linguistic tion and practice of temporal information anno- tense annotation experiments between Chinese tation by human experts have emerged from the and English, and provides statistical inference for corpus annotation research. To evaluate automatic different linguistic factors via a series of statisti- temporal relation classification systems, annotated cal modeling. Since tense and aspect are mor- corpora must be created and validated, which mo- phologically merged in English, tense annotation
13 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 13–20, Sydney, July 2006. c 2006 Association for Computational Linguistics discussed in this paper also includes elements of “LINK” tags which encode the various relations aspect. We only deal with tense annotation in existing between the temporal elements of a doc- Chinese-to-English scenario in the scope of this ument. The challenge of human labeling of links paper. among eventualities was discussed at great length The remaining part of the paper is organized in their paper. Automatic “time-stamping” was as follows: Section 2 summarizes the significant attempted on a small sample of text in an earlier related works in temporal information annotation work of (Mani, 2003). The result was not partic- and points out how this study relates to yet differs ularly promising. It showed the need for a larger from them. Section 3 reports the details of three quantity of training data as well as more predictive tense annotation experiments under three scenar- features, especially on the discourse level. At the ios. Section 4 discusses the inter-judge agree- word level, the semantic representation of tenses ment by presenting two measures of agreement: could be approached in various ways depending the Kappa Statistic and accuracy-based measure- on different applications. So far, their work has ment. Section 5 investigates and reports on the gone the furthest towards establishing a broad and significance of different linguistic factors in tense open standard metadata mark-up language for nat- annotation via an ANOVA analysis, a logistic re- ural language texts. gression analysis and a log-linear model analysis. (Setzer, et al., 2004) presents a method of eval- Finally, section 6 concludes the paper and points uating temporal order relation annotations and an out directions for future research. approach to facilitate the creation of a gold stan- dard by introducing the notion of temporal clo- 2 Related Work sure, which can be deduced from any annotations through using a set of inference rules. There are two basic types of temporal location re- From the above works, it can be seen that the lationships. The first one is the ternary classifica- effort in temporal information annotation has thus tion of past, present and future. The second one far been dominated by annotating temporal rela- is the binary classification of “BEFORE” versus tions that hold entities such as events or times “AFTER”. These two types of temporal relation- explicitly mentioned in the text. Cross-linguistic ships are intrinsically related but each stands as a tense and aspect annotation has so far gone un- separate issue and is dealt with in different works. studied. While the “BEFORE” versus “AFTER” relation- ship can easily be transferred across a language 3 Chinese Tense Annotation pair, the ternary tense taxonomy is often very hard Experiments1 to transfer from one language to another. In current section, we present three tense annota- (Wilson, et al., 1997) describes a multilin- tion experiments with the following scenarios: gual approach to annotating temporal information, which involves flagging a temporal expression in 1. Null-control situation by native Chinese the document and identifying the time value that speakers where the annotators were provided the expression designates. Their work reports an with the source Chinese sentences but not the inter-annotator reliability F-measure of 0.79 and English translations; 0.86 respectively for English corpora. (Katz, et al., 2001) describes a simple and gen- 2. High-control situation by native English eral technique for the annotation of temporal rela- speakers where the annotators were provided tion information based on binary interval relation with the Chinese sentences as well as English types: precedence and inclusion. Their annotation translations with specified syntax and lexi- scheme could benefit a range of NLP applications cons; and is easy to carry out. 3. Semi-control situation by native English (Pustejovsky et al., 2004) reports an annotation speakers where the annotators were allowed scheme, the TimeML metadata, for the markup of to choose the syntax and lexicons for the En- events and their anchoring in documents. The an- glish sentence with appropriate tenses; notation schema of TimeML is very fine-grained 1All experiments in the paper are approved by Behav- with a wide coverage of different event types, de- ioral Sciences Institutional Review Board at the University pendencies between events and times, as well as of Michigan, the IRB file number is B04-00007481-I.
14 3.1 Experiment One LDC2002T01). In the previous experiment, the Experiment One presents the first scenario of annotators tagged all verbs. In the current experi- tense annotation for Chinese verbs in Chinese-to- mental set-up, we preprocessed the materials and English cross-lingual situation. In the first sce- removed those verbs that lose their verbal status in nario, the annotation experiment was carried out translation from Chinese to English due to nom- on 25 news articles from LDC Xinhua News re- inalization. After this preprocessing, there was lease with category number LDC2001T11. The ar- a total of 288 verbs annotated by the annotators. ticles were divided into 5 groups with 5 articles in Three native speakers, who were bilingually fluent each group. There are a total number of 985 verbs. in English and Chinese, were recruited to annotate For each group, three native Chinese speakers who the tense for the English verbs that were translated were bilingual in Chinese and English annotated from Chinese. As in the previous scenario, the an- the tense of the verbs in the articles independently. notators were encouraged to pay attention to the Prior to annotating the data, the annotators under- context of the target verb when tagging its tense. went brief training during which they were asked The annotators were provided with the full taxon- to read an example of a Chinese sentence for each omy illustrated by examples of English verbs and tense and make sure they understand the exam- they worked independently. The following is an ples. During the annotation, the annotators were example of the annotation where the annotators asked to read the whole articles first and then se- were to choose an appropriate tense tag from the lect a tense tag based on the context of each verb. provided tense tags:
ᓔᬒࠡⱘϔббϔᑈ↨ˈܗThe tense taxonomy provided to the annotators in- 㒳䅵ˈ䖭ѯජᏖএᑈᅠ៤ݙ⫳ѻᘏؐϔⱒбकғ 䭓б៤DŽ clude the twelve tenses that are different combi- According to statistics, the cities (achieve) a combined gross domestic product of RMB19 billion last year, an increase of more than 90% over 1991 before their opening. nations of the simple tenses (present, past and fu- A. achieves B. achieved ture), the prograssive aspect and the perfect aspect. C. will achieve D. are achieving In cases where the judges were unable to decide E. were achieving the tense of a verb, they were instructed to tag it F. will be achieving G. have achieved as “unknown”. In this experiment, the annotators H. had achieved I. will have achieved were asked to tag the tense for all Chinese words J. have been achieving K. had been achieving that were tagged as verbs in the Penn Treebank L. will have been achieving M. would achieve corpora. Conceivably, the task under the current scenario is meta-linguistic in nature for the reason that tense is an elusive notion for Chinese speak- 3.3 Experiment Three ers. Nevertheless, the experiment provides a base- Experiment Three was an experiment simulated line situation for human tense annotation agree- on 52 Xinhua news articles from the Multiple ment. The following is an example of the anno- Translation Corpus (MTC) mentioned in the pre- tation where the annotators were to choose an ap- vious section. Since in the MTC corpora, each propriate tense tag from the provided tense tags: Chinese article is translated into English by ten ((IP (NP-TPC (NP-PN (NR Ё))(NP (NN ᓎㄥ)(NN Ꮦഎ)))(LCP-TMP (NP (NT 䖥 human translation teams, conceptually, we could ᑈ))(LC ᴹ)) (NP-SBJ (NP (PP (P ᇍ)(NP (NN )))(NP (NN ᓔᬒ)))(NP (NN ℹ Ӥ)))(VP (ADVP (AD 䖯ϔℹ)) (VP (VV ࡴᖿ)))(PU DŽ)) ) view these ten translation teams as different an- 1. simple present tense 2. simple past tense notators. They were making decisions about ap- 3. simple future tense 4. present perfect tense propriate tense for the English verbs. These an- 5. past perfect tense 6. future perfect tense notators differ from those in Experiment Two de- 7. present progressive tense 8. past progressive tense scribed above in that they were allowed to choose 9. future progressive any syntactic structure and verb lexicon. This is 10. present perfect progressive 11. past perfect progressive because they were performing tense annotation in a bigger task of sentence translation. Therefore, their tense annotations were performed with much 3.2 Experiment Two less specification of the annotation context. We Experiment Two was carried out using 25 news manually aligned the Chinese verbs with the En- articles from the parallel Chinese and English glish verbs for the 10 translation teams from the news articles available from LDC Multiple Trans- MTC corpora and thus obtained our third source lation Chinese corpora (MTC catalog number of tense annotation results. For the Chinese verbs
15 that were not translated as verbs into English, we the verbs that are nominalized. Interestingly, the assigned a “Not Available” tag. There are 1505 Kappa score calculated by collapsing the 13 tenses verbs in total including the ones that lost their ver- into 3 tenses (present, past and future) is only bal status across the language. slightly higher: 0.595. The observed agreement rate is 0.72. 4 Inter-Judge Agreement Human tense annotation in the Chinese-to- Researchers use consistency checking to validate English restricted translation scenario achieved a human annotation experiments. There are vari- Kappa score of 0.723 on the full taxonomy with an ous ways of performing consistency checking de- observed agreement of 0.798. If we collapse sim- scribed in the literature, depending on the scale of ple past and present perfect, the Kappa score goes the measurements. Each has its advantages and up to 0.792 with an observed agreement of 0.893. disadvantages. Since our tense taxonomy is nomi- The Kappa score is 0.81 on the reduced taxonomy. nal without any ordinal information, Kappa statis- 4.2 Accuracy tics measurement is the most appropriate choice to measure inter-judge agreement. The Kappa score is a relatively conservative mea- surement of the inter-judge agreement rate. Con- 4.1 Kappa Statistic ceptually, we could also obtain an alternative mea- Kappa scores were calculated for the three human surement of reliability by taking one annotator as judges’ annotation results. The Kappa score is the the gold standard at one time and averaging over de facto standard for evaluating inter-judge agree- the accuracies of the different annotators across ment on tagging tasks. It reports the agreement different gold standards. While it is true that nu- rate among multiple annotators while correcting merically, this would yield a higher score than the for the agreement brought about by pure chance. Kappa score and seems to be inflating the agree- It is defined by the following formula, where P(A) ment rate, we argue that the difference between is the observed agreement among the judges and the Kappa score and the accuracy-based measure- P(E) is the expected agreement: ment is not limited to one being more aggressive than the other. The policies of these two mea- P (A) P (E) surements are different. The Kappa score is con- k = − (1) 1 P (E) cerned purely with agreement without any consid- − Depending on how one identifies the expected eration of truthfulness or falsehood, while the pro- agreement brought about by pure chance, there are cedure we described above gives equal weights to two ways to calculate the Kappa score. One is the each annotator being the gold standard. Therefore, “Seigel-Castellian” Kappa discussed in (Eugenio, it considers both the agreement and the truthful- 2004), which assumes that there is one hypotheti- ness of the annotation. Additionally, the accuracy- cal distribution of labels for all judges. In contrast, based measurement is the same measurement that the “Cohen” Kappa discussed in (Cohen, 1960), is typically used to evaluate machine performance; assumes that each annotator has an individual dis- therefore it gives a genuine ceiling for machine tribution of labels. This discrepancy slightly af- performance. fects the calculation of P(E). There is no consen- The accuracy under such a scheme for the three sus regarding which Kappa is the ”right” one and annotators in Experiment One is 43% on the full researchers use both. In our experiments, we use tense taxonomy. the “Seigel-Castellian” Kappa. The accuracy under such a scheme for tense The Kappa statistic for the annotation results of generation agreement from three annotators in Ex- Experiment One are 0.277 on the full taxonomy periment Two is 80% on the full tense taxonomy. and 0.37 if we collapse the tenses into three big The accuracy under such a scheme for the ten classes: present, past and future. The observed translation teams in Experiment Three is 70.8% on agreement rate,that is, P(A), is 0.42. the full tense taxonomy. The Kappa score for tense resolution from the Table 1 summarizes the inter-judge agreement ten human translation teams for the 52 Xinhua for the three experiments. news articles is 0.585 on the full taxonomy; we Examining the annotation results, we identified expect the Kappa score to be higher if we exclude the following sources of disagreement. While the
16 Agreement Exp 1 Exp 2 Exp 3 Kappa Statistic 0.277 0.723 0.585 Kappa Statistic 0.37 0.81 0.595 (Reduced Taxonomy) Accuracy 43% 80% 70.8%
Table 1: Inter-Annotator Agreement for the Three Tense Annotation Experiments
first two factors can be controlled for by a clearly pre-defined annotation guideline, the last two fac- tors are intrinsically rooted in natural languages and therefore hard to deal with: Figure 1: Interaction between Aspect Marker and 1. Different compliance with Sequence of Tense Temporal Modifier (SOT) principle among annotators;
2. “Headline Effect”; question of where the challenge for human classi- fication comes from, and thereby provides an ex- 3. Ambiguous POS of the “verb”: sometimes it ternal reference for an automatic NLP system, al- is not clear whether a verb is adjective or past though not necessarily in a direct way. The latter participle. e.g. The Fenglingdu Economic sheds light on the structures hidden among groups Development Zone is the only one in China of features, the identification of which could pro- that is/was built on the basis of a small town. vide insights for feature selection as well as of- 4. Ambiguous aspectual property of the verb: fer convergent evidence for the significance of cer- the annotator’s view with respect to whether tain features confirmed from classification practice or not the verb is an atelic verb or a telic verb. based on machine learning. e.g. “statistics showed/show...... ” In this section, we discuss at some length a fea- ture analysis for the results of each of the anno- Put abstractly, ambiguity is an intrinsic property tation experiments discussed in the previous sec- of natural languages. A taxonomy allows us to tions and summarize the findings. investigate the research problem, yet any clearly defined discrete taxonomy will inevitably fail on 5.1 ANOVA analysis of Agreement and boundary cases between different classes. Linguistic Factors in Free Translation Tense Annotation 5 Significance of Linguistic Factors in Annotation This analysis tries to find the relationship be- tween the linguistic properties of the verb and the In the NLP community, researchers carry out an- tense annotation agreement across the ten different notation experiments mainly to acquire a gold translation teams in Experiment Three. Specifi- standard data set for evaluation. Little effort has cally, we use an ANOVA analysis to explore how been made beyond the scope of agreement rate the overall variance in the inconsistency of the calculations. We propose that not only does fea- tenses of a particular verb with respect to differ- ture analysis for annotation experiments fall un- ent translation teams can be attributed to different der the concern of psycholinguists, it also merits linguistic properties associated with the Chinese investigation within the enterprise of natural lan- verb. It is a three-way ANOVA with three linguis- guage processing. There are at least two ways tic factors under investigation: whether the sen- that the analysis of annotation results can help tence contains a temporal modifier or not; whether the NLP task besides just providing a gold stan- the verb is embedded in a relative clause, a senten- dard: identifying certain features that are respon- tial complement, an appositive clause or none of sible for the inter-judge disagreement and model- the above; and whether the verb is followed by as- ing the situation of associations among the differ- pect markers or not. The dependent variable is the ent features. The former attempts to answer the inconsistency of the tenses from the teams. The
17 inconsistency rate is measured by the ratio of the number of distinct tenses over the number of tense tokens from the ten translation teams. Our ANOVA analysis shows that all of the three main effects, i.e. the embedding structures of the verb (p 0.001), the presence of aspect markers (p 0.01), and the presence of temporal mod- ifiers (p<0.05) significantly affect the rate of disagreement in tense generation among the dif- ferent translation teams. The following graphs show the trend: tense generation disagreement rates are consistently lower when the Chinese as- pect marker is present, whether there is a temporal modifier present or not (Figure 1). The model also Figure 2: Interaction between the Temporal Mod- suggested that the presence of temporal modifiers ifier and the Syntactic Embedding Structure is associated with a lower rate of disagreement for three embedding structures except for verbs in were obtained through manual annotation based sentential complements (Figure 2, 0: the verb is on the situation in the context. The data are from not in any embedding structures; 1: the verb is Experiment Two. Since there are only three an- embedded in a relative clause; 2: the verb is em- notators, the inconsistency rate we discussed in bedded in an appositive clause; 3: the verb is em- 5.1 would have insufficient variance in the current bedded in sentential complement). Our explana- scenario, making logistic regression a more appro- tion for this is that the annotators receive varying priate analysis. The response is now binary being degrees of prescriptive writing training, so when either agreement or disagreement (including par- there is a temporal modifier in the sentence as a confounder, there will be a larger number, a higher tial agreement and pure disagreement). To avoid a incidence of SOT violations than when there is multi-colinearity problem, we model Chinese fea- tures and English features separately. In order no temporal modifier present in the sentence. On to truly investigate the effects of the latent fea- top of this, the rate of disagreement in tense tag- tures, we keep the overt linguistic features in the ging between the case where a temporal modifier model as well. The overt features include: type of is present in the sentence and the case where it is syntactic embedding, presence of aspect marker, not depends on different types of embedding struc- presence of temporal expression in the sentence, tures (Figure 2, p value < 0.05). whether the verb is in a headline or not, and the We also note that the relative clause embed- presence of certain signal adverbs including “yi- ding structure is associated with a much higher jing”(already), “zhengzai” (Chinese pre-verb pro- disagreement rate than any other embedding struc- gressive marker), “jiang”(Chinese pre-verbal ad- tures (Figure 3). verb indicating future tense). We used backward elimination to obtain the final model. 5.2 Logistic Regression Analysis of The result showed that punctuality is the only Agreement and Linguistic Factors in factor that significantly affects the agreement rate Restricted Tense Annotation among multiple judges in both the model of En- The ANOVA analysis in the previous section is glish features and the model of Chinese features. concerned with the confounding power of the The significance level is higher for the punctuality overt linguistic features. The current section ex- of English verbs, suggesting that the source lan- amines the significance of the more latent fea- guage environment is more relevant in tense gener- tures on tense annotation agreement when the SOT ation. The annotators are roughly four times more effect is removed by providing the annotators a likely to fail to agree on the tense for verbs as- clear guideline about the SOT principle. Specif- sociated with an interval event. This supports the ically, we are interested in the effect of verb telic- hypothesis that human beings use the latent fea- ity and punctuality features on tense annotation tures for tense classification tasks. Surprisingly, agreement. The telicity and punctuality features the telicity feature is not significant at all. We sus-
18 tense, independent of punctuality, telicity feature and embedding structure. Second, there is a strong association between telicity and tense, indepen- dent of punctuality, aspect marker presence and punctuality feature. Thirdly, there is a strong as- sociation between embedding structure and tense, independent of telicity, punctuality feature and as- pect marker presence. This result is consistent with (Olsen, 2001), in that the lexical telicity fea- ture, when used heuristically as the single knowl- edge source, can achieve a good prediction of verb tense in Chinese to English Machine Translation. For example, the odds of the verb being atelic in Figure 3: Effect of Syntactic Embedding Structure the past tense is 2.5 times the odds of the verb on Tense Annotation Disagreement being atelic in the future tense, with a 95% con- fidence interval of (0.9, 7.2). And the odds of a verb in the future tense having an aspect marker pect this is partly due to the correlation between approaches zero when compared to the odds of a the punctuality feature and the telicity feature. Ad- verb in the past tense having an aspect marker. ditionally, none of the overt linguistic features is Putting together the pieces from the logistic significant in the presence of the latent features, analysis and the current analysis, we see that an- which implies that the latent features drive dis- notators fail to agree on tense selection mostly agreement among multiple annotators. with apunctual verbs, while the agreed-upon tense is jointly decided by the telicity feature, aspect 5.3 Log-linear Model Analysis of Associations between Linguistic Factors marker feature and the syntactic embedding struc- ture that are associated with the verb. in Free Translation Tense Annotation This section discusses the association patterns be- 6 Conclusions and Future Work tween tense and the relevant linguistic factors via a log-linear model. A log-linear model is a special As the initial attempt to assess human beings’ case of generalized linear models (GLMs) and has cross-lingual tense annotation, the current paper been widely applied in many fields of social sci- carries out a series of tense annotation experi- ence research for multivariate analysis of categor- ments between Chinese and English under differ- ical data. The model reveals the interaction be- ent scenarios. We show that even if tense is an tween categorical variables. The log-linear model abstract grammatical category, multiple annotators is different from other GLMs in that it does not are still able to achieve a good agreement rate distinguish between “response” and “explanatory when the target English context is fully specified. variables”. All variables are treated alike as “re- We also show that in a non-restricted scenario, sponse variables”, whose mutual associations are the overt linguistic features (aspect markers, em- explored. Under the log-linear model, the ex- bedding structures and temporal modifiers), can pected cell frequencies are functions of all vari- cause people to fail to agree with each other signif- ables in the model. The most parsimonious model icantly in tense annotation. These factors exhibit that produces the smallest discrepancy between certain interaction patterns in the decision mak- the expected cell and the observed cell frequen- ing of the annotators. Our analysis of the anno- cies is chosen as the final model. This provides tation results from the scenario with a fully speci- the best explanation of the observed relationships fied context show that people tend to fail to agree among variables. with each other on tense for verbs associated with We use the data from Experiment Two for the interval events. The disagreement seems not to current analysis. The results show that three lin- be driven by the overt linguistic features such as guistic features under investigation are signifi- embedding structure and aspect markers. Lastly, cantly associated with tense. First, there is a strong among a set of overt and latent linguistic features, association between aspect marker presence and aspect marker presence, embedding structure and
19 the telicity feature exhibit the strongest association Andrea Setzer, Robert Gaizauskas, and Mark Hep- with tense, potentially indicating their high utility ple, 2003. Using Semantic Inferences for Tem- in tense classification task. poral Annotation Comparison, Proceedings of the Fourth International Workshop on Inference in The current analysis, while suggesting certain Computational Semantics (ICOS-4), INRIA, Lor- interesting patterns in tense annotation, could be raine, Nancy, France, September 25-26, 185-96. more significant if the findings could be replicated Jacob Cohen, 1960. A Coefficient of Agreement for by experiments of different scales on different data Nominal Scales, Educational and Psychological sets. Furthermore, the statistical analysis could be Measurement, 20, 37-46. more finely geared to capture the more subtle dis- tinctions encoded in the features. Acknowledgement All of the annotation exper- iments in this paper are funded by Rackham Grad- uate School’s Discretionary Funds at the Univer- sity of Michigan.
References Hans Reichenbach,1947. Elements of Symbolic Logic, Macmillan, New York, N.Y.
Mari Olson, David Traum, Carol Van Ess-Dykema, and Amy Weinberg, 2001. Implicit Cues for Explicit Generation: Using Telicity as a Cue for Tense Struc- ture in a Chinese to English MT System, Proceed- ings Machine Translation Summit VIII, Santiago de Compostela, Spain.
Yang Ye, Zhu Zhang, 2005. Tense Tagging for Verbs in Cross-Lingual Context: A Case Study. Proceed- ings of 2nd International Joint Conference in Natural Language Processing (IJCNLP), 885-895.
George Wilson, Inderjeet Mani, Beth Sundheim, and Lisa Ferro, 2001. A Multilingual Approach to An- notating and Extracting Temporal Information, Pro- ceedings of the ACL 2001 Workshop on Temporal And Spatial Information Processing, 39th Annual Meeting of ACL, Toulouse, 81-87.
Graham Katz and Fabrizio Arosio, 2001. The Annota- tion of Temporal Information in Natural Language Sentences, Proceedings of the ACL 2001 Workshop on Temporal And Spatial Information Processing, 39th Annual Meeting of ACL, Toulouse, 104-111.
James Pustejovsky, Robert Ingria, Roser Sauri, Jose Castano, Jessica Littman, Rob Gaizauskas, Andrea Setzer, Graham Katz, and Inderjeet Mani. 2004. The Specification Language TimeML. The Language of Time: A Reader. Oxford, 185-96.
Barbara Di Eugenio and Michael Glass. 2004. The kappa statistic: A second look. Computational Lin- guistics, 30(1): 95-101.
Inderjeet Mani, 2003. Recent Developments in Tem- poral Information Extraction. In Nicolov, N. and Mitkov, R., editors, Proceedings of RANLP’03. John Benjamins.
20 Probing the space of grammatical variation: induction of cross-lingual grammatical constraints from treebanks
Felice Dell’Orletta Alessandro Lenci Università di Pisa, Dipartimento di Università di Pisa, Dipartimento di Informatica - Largo B. Pontecorvo 3 Linguistica - Via Santa Maria 36 ILC-CNR - via G. Moruzzi 1 56100 Pisa, Italy 56100 Pisa, Italy [email protected] [email protected]
Simonetta Montemagni Vito Pirrelli ILC-CNR - via G. Moruzzi 1 ILC-CNR - via G. Moruzzi 1 56100 Pisa, Italy 56100 Pisa, Italy [email protected] [email protected]
Abstract Current research in natural language learning and processing supports the view that The paper reports on a detailed grammatical competence consists in mastering quantitative analysis of distributional and integrating multiple, parallel constraints language data of both Italian and Czech, (Seidenberg and MacDonald 1999, MacWhinney highlighting the relative contribution of a 2004). Moreover, there is growing consensus on number of distributed grammatical two major properties of grammatical constraints: factors to sentence-based identification of i.) they are probabilistic “soft constraints” subjects and direct objects. The work (Bresnan et al. 2001), and ii.) they have an uses a Maximum Entropy model of inherently functional nature, involving different stochastic resolution of conflicting types of linguistic (and non linguistic) grammatical constraints and is information (syntactic, semantic, etc.). These demonstrably capable of putting features emerge particularly clearly in dealing explanatory theoretical accounts to the with one of the core aspects of grammar test of usage-based empirical verification. learning: the ability to identify syntactic relations in text. Psycholinguistic evidence shows that 1 Introduction speakers learn to identify sentence subjects and direct objects by combining various types of The paper illustrates the application of a probabilistic, functional cues, such as word Maximum Entropy (henceforth MaxEnt) model order, noun animacy, definiteness, agreement, (Ratnaparkhi 1998) to the processing of subjects etc. An important observation is that the relative and direct objects in Italian and Czech. The prominence of each such cue can considerably model makes use of richly annotated Treebanks vary cross-linguistically. Bates et al. (1984), for to determine the types of linguistic factors example, argue that while, in English, word order involved in the task and weigh up their relative is the most effective cue for Subject-Object salience. In doing so, we set ourselves a two- Identification (henceforth SOI) both in syntactic fold goal. On the one hand, we intend to discuss processing and during the child’s syntactic the use of Treebanks to discover typologically development, the same cue plays second fiddle in relevant and linguistically motivated factors and relatively free phrase-order languages such as assess the relative contribution of the latter to Italian or German. cross-linguistic parsing issues. On the other If grammatical constraints are inherently hand, we are interested in testing the empirical probabilistic (Manning 2003), the path through plausibility of constraint-resolution models of which adult grammar competence is acquired can language processing (see infra) when confronted be viewed as the process of building a stochastic with real language data. model out of the linguistic input. In computational linguistics, MaxEnt models have
21 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 21–28, Sydney, July 2006. c 2006 Association for Computational Linguistics proven to be robust statistical learning algorithms Dependency Treebank (PDT, Bohmova et al. that perform well in a number of processing 2003) for Czech, and the Italian Syntactic tasks. Being supervised learning models, they Semantic Treebank (ISST, Montemagni et al. require richly annotated data as training input. 2003) for Italian. These corpora have been Before we turn to the use of Treebanks for chosen not only because they are the largest training a MaxEnt model for SOI, we first syntactically annotated resources for the two analyse the range of linguistic factors that are languages, but also because of their high degree taken to play a significant role in the task. of comparability, since they both adopt a dependency-based annotation scheme. 2 Subjects and objects in Czech and Czech and Italian provide an interesting Italian vantage point for the cross-lingual analysis of grammatical variation. They are both Indo- Grammatical relations - such as subject (S) and European languages, but they do not belong to direct object (O) - are variously encoded in the same family: Czech is a West Slavonic languages, the two most widespread strategies language, while Italian is a Romance language. being: i) structural encoding through word order, For our present concerns, they appear to share and ii) morpho-syntactic marking. In turn, two crucial features: i) the free order of morpho-syntactic marking can apply either on grammatical relations with respect to the verb; ii) the noun head only, in the form of case the possible absence of an overt subject. inflections, or on both the noun and the verb, in Nevertheless, they also greatly differ due to: the the form of agreement marking (Croft 2003). virtual non-existence of case marking in Italian Besides formal coding, the distribution of (with the only marginal exception of personal subjects and object is also governed by semantic pronouns), and the degree of phrase-order and pragmatic factors, such as noun animacy, freedom in the two languages. Empirical definiteness, topicality, etc. As a result, there evidence supporting the latter claim is provided exists a variety of linguistic clues jointly co- in Table 1, which reports data extracted from operating in making a particular noun phrase the PDT and ISST. Notice that although in both subject or direct object of a sentence. Crucially languages S and O can occur either pre-verbally for our present purposes, cross-linguistic or post-verbally, Czech and Italian greatly differ variation does not only concern the particular in their propensity to depart from the (unmarked) strategy used to encode S and O, but also the SVO order. While in Italian preverbal O is relative strength that each factor plays in a given highly infrequent (1.90%), in Czech more than language. For instance, while English word order 30% of O tokens occur before the verb. The is by and large the dominant clue to identify S situation is similar but somewhat more balanced and O, in other languages the presence of a rich in the case of S, which occurs post-verbally in morphological system allows word order to have 22.21% of the Italian cases, and in 40% of a much looser connection with the coding of Czech ones. For sure, one can argue that, in grammatical relations, thus playing a secondary spoken Italian, the number of pre-verbal objects role in their identification. Moreover, there are is actually higher, because of the greater number languages where semantic and pragmatic of left dislocations and topicalizations occurring constraints such as animacy and/or definiteness in informal speech. However reasonable, the play a predominant role in the processing of observation does not explain away the grammatical relations. A large spectrum of distributional differences in the two corpora, variations exists, ranging from languages where since both PDT and ISST contain written S must have a higher degree of animacy and/or language only. We thus suggest that there is clear definiteness relative to O, to languages where empirical evidence in favour of a systematic, this constraint only takes the form of a softer higher phrase-order freedom in Czech, arguably statistical preference (cf. Bresnan et al. 2001). related to the well-known correlation of Czech The goal of this paper is to explore the area of constituent placement with sentence information this complex space of grammar variation through structure, with the element carrying new careful assessment of the distribution of S and O information showing a tendency to occur tokens in Italian and Czech. For our present sentence-finally (Stone 1990). For our present analysis, we have used a MaxEnt statistical concerns, however, aspects of information model trained on data extracted from two structure, albeit central in Czech grammar, were syntactically annotated corpora: the Prague not taken into account, as they happen not to be
22 Czech Czech Italian Subj Obj Subj Obj Subj Obj Nominative 53.83% 0.65% Pre 59.82% 30.27% 77.79% 1.90% Accusative 0.15% 28.30% Pos Post 40.18% 69.73% 22.21% 98.10% Dative 0.16% 9.54% All 100.00% 100.00% 100.00% 100.00% Genitive 0.22% 2.03% Agr 98.50% 56.54% 97.73% 58.33% Instrumental 0.01% 3.40% Agr NoAgr 1.50% 43.46% 2.27% 41.67% Ambiguous 45.63% 56.08% All 100.00% 100.00% 100.00% 100.00% All 100.00% 100.00% Anim 34.10% 15.42% 50.18% 10.67% Table 2 - Distribution of Czech S and O Anim NoAnim 65.90% 84.58% 49.82% 89.33% wrt case All 100.00% 100.00% 100.00% 100.00% Table 1 –Distribution of Czech and Italian S and O wrt word order, agreement and noun animacy marked-up in the Italian corpus. a verb do not agree in number and/or person (as According to the data reported in Table 1, in leggono il libro ‘(they) read the book’). Czech and Italian show similar correlation Conversely, when N and V share the same patterns between animacy and grammatical person and number, no conclusion can be drawn, relations. S and O in ISST were automatically as trivially shown by a sentence like il bambino annotated for animacy using the SIMPLE Italian legge il libro ‘the child reads the book’. In ISST, computational lexicon (Lenci et al. 2000) as a more than 58% of O tokens agree with their background semantic resource. The annotation governing V, thus being formally was then checked manually. Czech S and O were indistinguishable from S on the basis of annotated for animacy using Czech WordNet agreement features. PDT also exhibits a similar (Pala and Smrz 2004); it is worth remarking that ratio, with 56% of O tokens agreeing with their in Czech animacy annotation was done only verb head. Analogous considerations apply to automatically, without any manual revision. case marking, whose perceptual reliability is Italian shows a prominent asymmetry in the undermined by morphological syncretism, distribution of animate nouns in subject and whereby different cases are realized through the object roles: over 50% of ISST subjects are same marker. Czech data reveal the massive animate, while only 10% of the objects are extent of this phenomenon and its impact on SOI. animate. Such a trend is also confirmed in Czech As reported in Table 2, more than 56% of O – although to a lesser extent - with 34.10% of tokens extracted from PDT are formally animate subjects vs. 15.42% of objects.1 Such an indistinguishable from S in case ending. overwhelming preference for animate subjects in Similarly, 45% of S tokens are formally corpus data suggests that animacy may play a indistinguishable from O uses on the same very important role for S and O identification in ground. All in all, this means that in 50% of the both languages. cases a Czech noun can not be understood as the Corpus data also provide interesting evidence S/O of a sentence by relying on overt case concerning the actual role of morpho-syntactic marking only. constraints in the distribution of grammatical To sum up, corpus data lend support to the relations. Prima facie, agreement and case are idea that in both Italian and in Czech SOI is the strongest and most directly accessible clues governed by a complex interplay of probabilistic for S/O processing, as they are marked both constraints of a different nature (morpho- overtly and locally. This is also confirmed by syntactic, semantic, word order, etc.) as the latter psycholinguistic evidence, showing that subjects are neither singly necessary nor jointly sufficient tend to rely on these clues to identify S/O. to attack the processing task at hand. It is However, it should be observed that agreement tempting to hypothesize that the joint distribution can be relied upon conclusively in S/O of these data can provide a statistically reliable processing only when a nominal constituent and basis upon which relevant probabilistic constraints are bootstrapped and combined consistently. This should be possible due to i) the 1 In fact, the considerable difference in animacy distribution between the two languages might only be an artefact of the different degrees of clue salience in the two way we annotated Czech nouns semantically, on the basis of languages and ii) the functional need to minimize their context-free classification in the Czech WordNet.
23 processing ambiguity in ordinary communicative its nominal dependent in σ. This notion of σ exchanges. With reference to the latter point, for departs from more traditional ways of describing example, we may surmise that a speaker will be an SOI context as a triple of one verb and two more inclined to violate one constraint on S/O nouns in a certain syntactic configuration (e.g, distribution (e.g. word order) when another clue SOV or VOS, etc.). In fact, we assume that SOI is available (e.g. animacy) that strongly supports can be stated in terms of the more local task of the intended interpretation only. The following establishing the grammatical function of a noun section illustrates how a MaxEnt model can be n observed in a verb-noun pair. This simplifying used to model these intuitions by bootstrapping assumption is consistent with the claim in constraints and their interaction from language MacWhinney et al. (1984) that SVO word order data. is actually derivative from SV and VO local patterns and downplays the role of the transitive 3 Maximum Entropy modelling complex construction in sentence processing. Evidence in favour of this hypothesis also comes The MaxEnt framework offers a mathematically from corpus data: for instance, in ISST complete sound way to build a probabilistic model for SOI, subject-verb-object configurations represent only which combines different linguistic cues. Given 26% of the cases, a small percentage if compared a linguistic context c and an outcome aÎA that to the 74% of verb tokens appearing with either a depends on c, in the MaxEnt framework the subject or an object only; a similar situation can conditional probability distribution p(a|c) is be observed in PDT where complete subject- estimated on the basis of the assumption that no verb-object configurations occur in only 20% of a priori constraints must be met other than those the cases. Due to the comparative sparseness of related to a set of features fj(a,c) of c, whose canonical SVO constructions in Czech and distribution is derived from the training data. It Italian, it seems more reasonable to assume that can be proven that the probability distribution p children should pay a great deal of attention to satisfying the above assumption is the one with both SV and VO units as cues in sentence the highest entropy, is unique and has the perception (Matthews et al. in press). following exponential form (Berger et al. 1996): Reconstruction of the whole lexical SVO pattern k 1 f j (a,c) can accordingly be seen as the end point of an (1) p(a | c) = Õa j Z(c) j=1 acquisition process whereby smaller units are re- analyzed as being part of more comprehensive where Z(c) is a normalization factor, fj(a,c) are the values of k features of the pair (a,c) and constructions. This hypothesis is more in line correspond to the linguistic cues of c that are with a distributed view of canonical relevant to predict the outcome a. Features are constructions as derivative of more basic local extracted from the training data and define the positional patterns, working together to yield constraints that the probabilistic model p must more complex and abstract constructions. Last but not least, assuming verb-noun pairs as the satisfy. The parameters of the distribution α1, …, relevant context for SOI allows us to αk correspond to weights associated with the features, and determine the relevance of each simultaneously model the interaction of word feature in the overall model. In the experiments order variation with pro-drop. reported below feature weights have been estimated with the Generative Iterative Scaling 4 Feature selection (GIS) algorithm implemented in the AMIS The most important part of any MaxEnt model is software (Miyao and Tsujii 2002). the selection of the context features whose We model SOI as the task of predicting the weights are to be estimated from data correct syntactic function φ Î {subject, object} distributions. Our feature selection strategy is of a noun occurring in a given syntactic context grounded on the main assumption that features σ. This is equivalent to building the conditional should correspond to theoretically and probability distribution p(φ|σ) of having a typologically well-motivated contextual cues. syntactic function φ in a syntactic context σ. This allows us to evaluate the probabilistic Adopting the MaxEnt approach, the distribution model also with respect to its consistency with p can be rewritten in the parametric form of (1), current linguistic generalizations. In turn, the with features corresponding to the linguistic model can be used as a probe into the contextual cues relevant to SOI. The context σ is correspondence between theoretically motivated a pair
24 generalizations and usage-based empirical following universal markedness hierarchy Aissen evidence. (2003):
Features are binary functions fki,φ (φ,σ), which test whether a certain cue ki for the feature φ Definiteness Markedness Hierarchy occurs in the context σ. For our MaxEnt model, Subj/Pro > Subj/Name > Subj/Def > Subj/Indef we have selected different features types that test Obj/Indef > Obj/Def > Obj/Name > Obj/Pro morpho-syntactic, syntactic, and semantic key dimensions in determining the distribution of S According to this hierarchy, subjects with a low and O. degree of definiteness are more marked than subjects with a high degree of definiteness (for Morpho-syntactic features. These include N-V objects the reverse pattern holds). Given the agreement, for Italian and Czech, and case, only importance assigned to the definiteness for Czech. The combined use of such features markedness hierarchy in current linguistic allow us not only to test the impact of morpho- research, we have included the definiteness cue syntactic information on SOI, but also to analyze in the MaxEnt model. In our experiments, for patterns of cross-lingual variation stemming Italian we have used a compact version of the from language specific morphological definiteness scale: the definiteness cue tests differences, e.g. lack of case marking in Italian. whether the noun in the context pair i) is a name or a pronoun ii) has a definite article iii), has an Word order. This feature essentially test the indefinite article or iv) is a bare noun (i.e. with position of the noun wrt the verb, for instance: no article). It is worth saying that bare nouns are usually placed at the bottom end of the 1 if noun .pos post ì s = definiteness scale. Since in Czech there is no (2) f post,subj (subj,s ) = í î0 otherwise article, we only make a distinction between proper names and common nouns. Animacy. This is the main semantic feature, which tests whether the noun in σ is animate or 5 Testing the model inanimate (cf. section 2). The centrality of this The Italian MaxEnt model was trained on 14,643 cue for grammatical relation assignment is verb-subject/object pairs extracted from ISST. widely supported by typological evidence (cf. For Czech, we used a training corpus of 37,947 Aissen 2003, Croft 2003). The Animacy verb-subject/object pairs extracted from PDT. In Markedness Hierarchy - representing the relative both cases, the training set was obtained by markedness of the associations between extracting all verb-subject and verb-object grammatical functions and animacy degrees – is dependencies headed by an active verb, with the actually assigned the role of a functional exclusion of all cases where the position of the universal principle in grammar. The hierarchy is nominal constituent was grammatically reported below, with each item in these scales determined (e.g. clitic objects, relative clauses). been less marked than the elements to its right: It is interesting to note that in both training sets
the proportion of subjects and objects relations is Animacy Markedness Hierarchy nearly the same: 63.06%-65.93% verb-subject Subj/Human > Subj/Animate > Subj/Inanimate pairs and 36.94%-34.07% verb-object pairs for Obj/Inanimate > Obj/Animate > Obj/Human Italian and Czech respectively. Markedness hierarchies have also been The test corpus consists of a set of verb-noun interpreted as probabilistic constraints estimated pairs randomly extracted from the reference from corpus data (Bresnan et al. 2001). In our Treebanks: 1,000 pairs for Italian and 1,373 for MaxEnt model we have used a reduced version Czech. For Italian, 559 pairs contained a subject of the animacy markedness hierarchy in which and 441 contained an object; for Czech, 905 human and animate nouns have been both pairs contained a subject and 468 an object. subsumed under the general class animate. Evaluation was carried out by calculating the percentage of correctly assigned relations over Definiteness tests the degree of “referentiality” of the total number of test pairs (accuracy). As our the noun in a context pair σ. Like for animacy, model always assigns one syntactic relation to definiteness has been claimed to be associated each test pair, accuracy equals both standard with grammatical functions, giving rise to the precision and recall.
25 Czech Italian Czech Italian Subj Obj Subj Obj Subj Obj Subj Obj Preverb 1.24E+00 5.40E-01 1.31E+00 2.11E-02 Preverb 1.99% 19.40% 0.00% 6.90% Postverb 8.77E-01 1.17E+00 5.39E-01 1.38E+00 Postverb 71.14% 7.46% 71.55% 21.55% Anim 1.16E+00 6.63E-01 1.28E+00 3.17E-01 Anim 0.50% 3.98% 6.90% 21.55% Inanim 1.03E+00 9.63E-01 8.16E-01 1.23E+00 Inanim 72.64% 22.89% 64.66% 6.90% PronName 1.13E+00 7.72E-01 1.13E+00 8.05E-01 Nomin 0.00% 1.00% DefArt 1.01E+00 1.02E+00 IndefArt 1.05E+00 9.31E-01 6.82E-01 1.26E+00 Genitive 0.50% 0.00% NoArticle 9.91E-01 1.02E+00 Dative 1.99% 0.00% Na Nomin 1.23E+00 2.22E-02 Accus 0.00% 0.00% Genitive 2.94E-01 1.51E+00 Instrum 0.00% 0.00% Dative 2.85E-02 1.49E+00 Na Ambig 70.65% 25.87% Accus 8.06E-03 1.39E+00 Agr 70.15% 25.87% 61.21% 12.07% Instrum 3.80E-03 1.39E+00 Agr 1.18E+00 6.67E-01 1.28E+00 4.67E-01 NoAgr 2.99% 0.50% 7.76% 1.72% NoAgr 7.71E-02 1.50E+00 1.52E-01 1.58E+00 NAAgr 0.00% 0.50% 2.59% 14.66% NAAgr 3.75E-01 1.53E+00 2.61E-01 1.84E+00 Table 3 – Types of errors for Czech and Italian Table 4 - Feature value weights in NLC for Czech and Italian We have assumed a baseline score of 56% for associated with feature values for the two Italian and of 66% for Czech, corresponding to languages. They are reported in Table 4, where the result yielded by a naive model assigning grey cells highlight the preference of each to each test pair the most frequent relation feature value for either subject or object in the training corpus, i.e. subject. Experiments identification. In both languages agreement with were carried out with the general features the verb strongly relates to the subject relation. illustrated in section 4: verb agreement, case (for For Czech, nominative case is strongly Czech only), word order, noun animacy and associated with subjects while the other cases noun definiteness. with objects. Moreover, in both languages Accuracy on the test corpus is 88.4% for preverbal subjects are strongly preferred over Italian and 85.4% for Czech. A detailed error preverbal objects; animate subjects are preferred analysis for the two languages is reported in over animate objects; pronouns and proper Table 3, showing that in both languages subject names are typically subjects. identification appears to be particularly Let us now try to relate these feature values to problematic. In Czech, it appears that the the Markedness Hierarchies reported in section prototypically mistaken subjects are post-verbal 4. Interestingly enough, if we rank the Italian (71.14%), inanimate (72.64%), ambiguously Anim and Inanim values for subjects and objects, case-marked (70.65%) and agreeing with the we observe that they distribute consistently with verb (70.15%), where reported percentages refer the Animacy Markedness Hierarchy: Subj/Anim to the whole error set. Likewise, Italian mistaken > Subj/Inanim and Obj/Inanim > Obj/Anim. This subjects can be described thus: they typically is confirmed by the Czech results. Similarly, by occur in post-verbal position (71.55%), are ranking the Italian values for the definiteness mostly inanimate (64.66%) and agree with the features in the Subj column by decreasing weight verb (61.21%). Interestingly, in both languages, values we obtain the following ordering: the highest number of errors occurs when a) N PronName > DefArt > IndefArt > NoArt, which has the least prototypical syntactic and semantic nicely fits in with the Definiteness Markedness properties for O or S (relative to word order and Hierarchy in section 4. The so-called noun animacy) and b) morpho-syntactic features “markedness reversal” is replicated with a good such as agreement and case are neutralised. This degree of approximation, if we focus on the shows that MaxEnt is able to home in on the core values for the same features in the Obj column: linguistic properties that govern the distribution the PronName feature represents the most of S and O in Italian and Czech, while remaining marked option, followed by IndefArt, DefArt and uncertain in the face of somewhat peripheral and NoArt (the latter two showing the same feature occasional cases. value). The exception here is represented by the A further way to evaluate the goodness of fit relative ordering of IndefArt and DefArt which of our model is by inspecting the weights however show very close values. The same
26 seems to hold for Czech, where the feature gain considerable insights into the complex ordering for Subj is PronName > process of bootstrapping nature and behaviour of DefArt/IndefArt/NoArt and the reverse is these constraints upon observing their actual observed for Obj. distribution in perceptually salient contexts. In our view of things, this trend outlines a 5.1 Evaluating comparative feature salience promising framework providing fresh support to The relative salience of the different constraints usage-based models of language acquisition acting on SOI can be inferred by comparing the through mathematical and computational weights associated with individual feature simulations. Moreover, it allows scholars to values. For instance, Goldwater and Johnson investigate patterns of cross-linguistic (2003) show that MaxEnt can successfully be typological variation that crucially depend on the applied to learn constraint rankings in Optimality appropriate setting of model parameters. Finally, Theory, by assuming the parameter weights <α1, it promises to solve, on a principled basis, …, αk> as the ranking values of the constraints. traditional performance-oriented cruces of Table 5 illustrates the constraint ranking for grammar theorizing such as degrees of human the two languages, ordered by decreasing weight acceptability of ill-formed grammatical values for both S and O. Note that, although not constructions (Hayes 2000) and the inherently all constraints are applicable in both languages, graded compositionality of linguistic the weights associated with applicable constructions such as morpheme-based words constraints exhibit the same relative salience in and word-based phrases (Bybee 2002, Hay and Czech and Italian. This seems to suggest the Baayen 2005). existence of a rather dominant (if not universal) We argue that the current availability of salience scale of S and O processing constraints, comparable, richly annotated corpora and of in spite of the considerable difference in the mathematical tools and models for corpus marking strategies adopted by the two languages. exploration make time ripe for probing the space As the relative weight of each constraint of grammatical variation, both intra- and inter- crucially depends on its overall interaction with linguistically, on unprecedented levels of other constraints on a given processing task, sophistication and granularity. All in all, we absolute weight values can considerably vary anticipate that such a convergence is likely to from language to language, with a resulting have a twofold impact: it is bound to shed light impact on the distribution of S and O on the integration of performance and constructions. For example, the possibility of competence factors in language study; it will overtly and unambiguously marking a direct make mathematical models of language object with case inflection makes wider room for increasingly able to accommodate richer and preverbal use of objects in Czech. Conversely, richer language evidence, thus putting lack of case marking in Italian considerably explanatory theoretical accounts to the test of a limits the preverbal distribution of direct objects. usage-based empirical verification. This evidence, however, appears to be an In the near future, we intend to pursue two epiphenomenon of the interaction of fairly stable parallel lines of development. First we would and invariant preferences, reflecting common like to increase the context-sensitiveness of our functional tendencies in language processing. As processing task by integrating binary shown in Table 5, if constraint ranking largely grammatical constraints into the broader context confirms the interplay between animacy and of multiply conflicting grammar relations. This word order in Italian, Czech does not contradict way, we will be in a position to capture the it but rather re-modulate it somewhat, due to the constraint that a (transitive) verb has at most one “perturbation” factors introduced by its richer subject and one object, thus avoiding multiple battery of case markers. assignment of subject (object) relations in the same context. Suppose, for example, that both 6 Conclusions nouns in a noun-noun-verb triple are amenable to a subject interpretation, but that one of them is a Probabilistic language models, machine language more likely subject than the other. Then, it is learning algorithms and linguistic theorizing all reasonable to expect the model to process the appear to support a view of language processing less likely subject candidate as the object of the as a process of dynamic, on-line resolution of verb in the triple. Another promising line of conflicting grammatical constraints. We begin to development is based on the observation that the
27 order in which verb arguments appear in context Dell’Orletta F., Lenci A., Montemagni S., Pirrelli V. is also lexically governed: in Italian, for 2005. Climbing the path to grammar: a maximum example, report verbs show a strong tendency to entropy model of subject/object learning. select subjects post-verbally. Dell’Orletta et al. Proceedings of the ACL-2005 Workshop (2005) report a substantial improvement on the “Psychocomputational Models of Human Language Acquisition”, University of Michigan, model performance on Italian SOI when lexical Ann Arbour (USA), 29-30 June 2005. information is taken into account, as a lexicalized MaxEnt model appears to integrate general Hay J., Baayen R.H. 2005. Shifting paradigms: constructional and semantic biases with gradient structure in morphology, Trends in lexically-specific preferences. In a cross-lingual Cognitive Sciences, 9(7): 342-348. perspective, comparable evidence of lexical Hayes B. 2000. Gradient Well-Formedness in constraints on word order would allow us to Optimality Theory, in Joost Dekkers, Frank van discover language-wide invariants in the lexicon- der Leeuw and Jeroen van de Weijer (eds.) grammar interplay. Optimality Theory: Phonology, Syntax, and Acquisition, Oxford University Press, pp. 88-120. References Lenci A. et al. 2000. SIMPLE: A General Framework for the Development of Multilingual Lexicons. Bates E., MacWhinney B., Caselli C., Devescovi A., International Journal of Lexicography, 13 (4): Natale F., Venza V. 1984. A crosslinguistic study 249-263. of the development of sentence interpretation strategies. Child Development, 55: 341-354. MacWhinney B. 2004. A unified model of language acquisition. In J. Kroll & A. De Groot (eds.), Bohmova A., Hajic J., Hajicova E., Hladka B. 2003. Handbook of bilingualism: Psycholinguistic The Prague Dependency Treebank: Three-Level approaches, Oxford University Press, Oxford. Annotation Scenario, in A. Abeille (ed.) Treebanks: Building and Using Syntactically Manning C. D. 2003. Probabilistic syntax. In R. Bod, Annotated Corpora, Kluwer Academic Publishers, J. Hay, S. Jannedy (eds), Probabilistic Linguistics, pp. 103-128. MIT Press, Cambridge MA: 289-341. Bybee J. 2002. Sequentiality as the basis of Miyao Y., Tsujii J. 2002. Maximum entropy constituent structure. in T. Givón and B. Malle estimation for feature forests. Proc. HLT2002. (eds.) The Evolution of Language out of Pre- Montemagni S. et al. 2003. Building the Italian Language, Amsterdam: John Benjamins. 107-132. syntactic-semantic treebank. In Abeillé A. (ed.) Croft W. 2003. Typology and Universals. Second Treebanks. Building and Using Parsed Corpora, Edition, Cambridge University Press, Cambridge. Kluwer, Dordrecht: 189-210. Bresnan J., Dingare D., Manning C. D. 2001. Soft Ratnaparkhi A. 1998. Maximum Entropy Models for constraints mirror hard constraints: voice and Natural Language Ambiguity Resolution. Ph.D. person in English and Lummi. Proceedings of the Dissertation, University of Pennsylvania. LFG01 Conference, Hong Kong: 13-32.
Constraints for S Constraints for O Feature Italian Czech Feature Italian Czech Preverbal 1.31E+00 1.24E+00 Genitive na 1.51E+00 Nomin na 1.23E+00 NoAgr 1.58E+00 1.50E+00 Agr 1.28E+00 1.18E+00 Dative na 1.49E+00 Anim 1.28E+00 1.16E+00 Accus na 1.39E+00 Inanim 8.16E-01 1.03E+00 Instrum na 1.39E+00 Postverbal 5.39E-01 8.77E-01 Postverbal 1.38E+00 1.17E+00 Genitive na 2.94E-01 Inanim 1.23E+00 9.63E-01 NoAgr 1.52E-01 7.71E-02 Agr 4.67E-01 6.67E-01 Dative na 2.85E-02 Anim 3.17E-01 6.63E-01 Accus na 8.06E-03 Preverbal 2.11E-02 5.40E-01 Instrum na 3.80E-03 Nomin na 2.22E-02 Table 5 – Ranked constraints for S and O in Czech and Italian
28 Frontiers in Linguistic Annotation for Lower-Density Languages
Mike Maxwell Baden Hughes Center for Advanced Study of Language Department of Computer Science University of Maryland The University of Melbourne [email protected] [email protected]
Abstract esting or unique phenomena). Notably, these fac- tors are also significant in determining whether a The languages that are most commonly language is worked on for documentary and de- subject to linguistic annotation on a large scriptive purposes, although an additional factor scale tend to be those with the largest pop- in this particular area is also the degree of endan- ulations or with recent histories of lin- germent (which can perhaps be contrasted with the guistic scholarship. In this paper we dis- likelihood of economic returns for computational cuss the problems associated with lower- endeavour). density languages in the context of the de- As a result of these influencing factors, it is velopment of linguistically annotated re- clear that languages which exhibit positive effects sources. We frame our work with three in one or more of these areas are likely to be the key questions regarding the definition of target of computational research. If we consider lower-density languages; increasing avail- the availability of computationally tractable lan- able resources and reducing data require- guage resources, we find, unsuprisingly that major ments. A number of steps forward are languages such as English, German, French and identified for increasing the number lower- Japanese are dominant; and research on computa- density language corpora with linguistic tional approaches to linguistic analysis tends to be annotations. farthest advanced in these languages. However, renewed interest in the annotation of 1 Introduction lower-density languages has arisen for a number The process for selecting a target language for re- of reasons, both theoretical and practical. In this search activity in corpus linguistics, natural lan- paper we discuss the problems associated with guage processing or computational linguistics is lower-density languages in the context of the de- largely arbitrary. To some extent, the motivation velopment of linguistically annotated resources. for a specific choice is based on one or more of a The structure of this paper is as follows. First range of factors: the number of speakers of a given we define the lower-density languages and lin- language; the economic and social dominance of guistically annotated resources, thus defining the the speakers; the extent to which computational scope of our interest. We review some related and/or lexical resources already exist; the avail- work in the area of linguistically annotated cor- ability of these resources in a manner conducive to pora for lower-density languages. Next we pose research activity; the level of geopolitical support three questions which frame the body of this pa- for language-specific activity, or the sensitivity of per: What is the current status of in terms of lower- the language in the political arena; the degree to density languages which have linguistically anno- which the researchers are likely to be appreciated tated corpora? How can we more efficiently create by the speakers of the language simply because this particular type of data for lower-density lan- of engagement; and the potential scientific returns guages? Can existing analytical methods methods from working on the language in question (includ- perform reliably with less data? A number of steps ing the likelihood that the language exhibits inter- are identified for advancing the agenda of linguis-
29 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 29–37, Sydney, July 2006. c 2006 Association for Computational Linguistics tically annotated resources for lower-density lan- • Text marked for word boundaries (for those guages, and finally we draw conclusions. scripts which, like Thai, do not mark most word boundaries) 2 Lower-Density Languages • POS tagged text, and a POS tag schema ap- It should be noted from the outset that in this pa- propriate to the particular language per we interpret ‘density’ to refer to the amount of computational resources available, rather than • Treebanked (syntactically annotated and the number of speakers any given language might parsed) text have. The fundamental problem for annotation of • Semantically tagged text (semantic roles) cf. lower-density languages is that they are lower- Propbank (Palmer et al., 2005), or frames cf. density. While on the surface, this is a tautol- Framenet1 ogy, it in fact is the problem. For a few lan- guages of the world (such as English, Chinese • Electronic dictionaries and other lexical re- 2 and Modern Standard Arabic, and a few West- sources, such as Wordnet ern European languages), resources are abundant; There are numerous dimensions for linguisti- these are the high-density Languages. For a few cally annotated resources, and a range of research more languages (other European languages, for projects have attempted to identify the core prop- the most part), resources are, if not exactly abun- erties of interest. While concepts such as the Ba- dant, at least existent, and growing; these may be sic Language Resource Kit (BLARK; (Krauwer, considered medium-density languages. Together, 2003; Mapelli and Choukri, 2003)) have gained high-density and medium-density languages ac- considerable currency in higher-density language count for perhaps 20 or 30 languages, although of resource creation projects, it is clear that the base- course the boundaries are arbitrary. For all other line requirements of such schemes are signifi- languages, resources are scarce and hence they fall cantly more advanced than we can hope for for into our specific area of interest. lower-density languages in the short to medium 3 Linguistically Annotated Resources term. Notably, the concept of a reduced BLARK (‘BLARKette’) has recently gained some currency While the scarcity of language resources for in various forums. lower-density languages is apparent for all re- source types (with the possible exception of mono- 4 Key Questions lingual text ), it is particularly true of linguistically Given that the vast majority of the more than seven annotated texts. By annotated texts, we include thousand languages documented in the Ethno- the following sorts of computational linguistic re- logue (Gordon, 2005) fall into the class of lower- sources: density languages, what should we do? Equally important, what can we realistically do? We pose • Parallel text aligned with another language at three questions by which to frame the remainder the sentence level (and/or at finer levels of of this paper. parallelism, including morpheme-level gloss- ing) 1. Status Indicators: How do we know where • Text annotated for named entities at various we are? How do we keep track of what lan- levels of granularity guages are high-density or medium-density, and which are lower-density? • Morphologically analyzed text (for non- isolating languages; at issue here is particu- 2. Increasing Available Resources: How (or larly inflectional morphology, and to a lesser can) we encourage the movement of lan- degree of importance for most computational guages up the scale from lower-density to purposes, derivational morphology); also a medium-density or high-density? morphological tag schema appropriate to the 1http://framenet.icsi.berkeley.edu/ particular language 2http://wordnet.princeton.edu
30 3. Reducing Data Requirements: Given that distributed by each organization, and these some languages will always be relatively include only a small number of languages. lower-density, can language processing ap- Naturally, the economically important lan- plications be made smarter, so that they don’t guages constitute the majority of the holdings require largely unattainable resources in or- of the LDC and ELDA. der to perform adequately? • AILLA (Archive of the Indigenous Lan- 5 Status Indicators guages of Latin America6), and numerous other language archiving sites: Such sites We have been deliberately vague up to this point maintain archives of linguistic data for lan- about how many lower-density languages there guages, often with a specialization, such as are, or the simpler question, how my high and indigenous languages of a country or region. medium density languages there are. Of course The linguistic data ranges from unannotated one reason for this is that the boundary between speech recordings to morphologically ana- low density and medium or high density is inher- lyzed texts glossed at the morpheme level. ently vague. Another reason is that the situation is constantly changing; many Central and East- • OLAC (Open Archives Language Commu- ern European languages which were lower-density nity7): Given that many of the above re- languages a decade or so ago are now arguably sources (particularly those of the many lan- medium density, if not high density. (The stan- guage archives) are hard to find, OLAC is dard for high vs. low density changes, too; the bar an attempt to be a meta-catalog (or aggre- is considerably higher now than it was ten years gator)of such resources. It allows lookup of ago.) data by type, language etc. for all data repos- But the primary reason for being vague about itories that ‘belong to’ OLAC. In fact, all the how many – and which – languages are low den- above resources are listed in the OLAC union sity today is that no is keeping track of what re- catalogue. sources are available for most languages. So we simply have no idea which languages are low den- • Web-based catalogs of additional resources: sity, and more importantly (since we can guess that There is a huge number of additional web- in the absence of evidence to the contrary, a lan- sites which catalog information about lan- guage is likely to be low density), we don’t know guages, ranging from electronic and print 8 which resource types most languages do or do not dictionaries (e.g. yourDictionary ), to dis- 9 have. cussion groups about particular languages . This lack of knowledge is not for lack of trying, Most such sites do little vetting of the re- although perhaps we have not been trying hard sources, and dead links abound. Neverthe- enough. The following are a few of the catalogs less, such sites (or a simple search with an of information about languages and their resources Internet search engine) can often turn up use- that are available: ful information (such as grammatical descrip- tions of minority languages). Very few of • The Ethnologue3: This is the standard list- these web sites are cataloged in OLAC, al- ing of the living languages of the world, but though recent efforts (Hughes et al., 2006a) contains little or no information about what are slowly addressing the inclusion of web- resources exist for each language. based low density language resources in such indexes. • LDC catalog4 and ELDA catalog5: The Linguistic Data Consortium (LDC) and the None of the above catalogs is in any sense com- European Language Resources Distribution plete, and indeed the very notion of completeness Agency (ELDA) have been among the largest is moot when it comes to cataloging Internet re- distributors of annotated language data. Their sources. But more to the point of this paper, it catalogs, naturally, cover only those corpora 6http://www.ailla.utexas.org 3http://www.ethnologue.org 7http://www.language-archives.org 4http://www.ldc.upenn.edu/Catalog/ 8http://www.yourdictionary.com 5http://www.elda.org/rubrique6.html 9http://dir.groups.yahoo.com/dir/Cultures Community/By Language
31 is difficult, if not impossible, to get a picture of home countries? (Expatriate communities the state of language resources in general. How may also be motivated by a desire to main- many languages have sufficient bitext (and in what tain their language among younger speakers, genre), for example, that one could put together a born abroad.) statistical machine translation system? What lan- guages have morphological parsers (and for what 5. Which languages would require more work languages is such a parser more or less irrele- (and funding) to build resources, but are still vant, because the language is relatively isolating)? plausible candidates for short term efforts? Where can one find character encoding converters To our knowledge, there is no general, on-going for the Ge’ez family of fonts for languages written effort to collect the sort of data that would make in Ethiopic script? answers to these questions possible. A survey was The answer to such questions is important for done at the Linguistic Data Consortium several several reasons: years ago (Strassel et al., 2003) , for text-based re- 1. If there were a crisis that involved an arbitrary sources for the three hundred or so languages hav- language of the world, what resources could ing at least a million speakers (an arbitrary cutoff, be deployed? An example of such a situa- to be sure, but necessary for the survey to have had tion might be another tsunami near Indone- at least some chance of success). It was remark- sia, which could affect dozens, if not hun- ably successful, considering that it was done by dreds of minority languages. (The Decem- two linguists who did not know the vast majority ber 26, 2004 tsunami was particularly felt in of the languages surveyed. The survey was funded the Aceh province of Indonesia, where one of long enough to ‘finish’ about 150 languages, but the main languages is Aceh, spoken by three no subsequent update was ever done. million people. Aceh is a lower-density lan- A better model for such a survey might be an guage.) edited book: one or more computational linguists would serve as ‘editors’, responsible for the over- 2. Which languages could, with a relatively all framework, and training of other participants. small amount of effort, move from lower- Section ‘editors’ would be responsible for a lan- density status to medium-density or high- guage family, or for the languages of a geographic density status? For example, where parallel region or country. Individual language experts text is harvestable, a relatively small amount would receive a small amount of training to enable of work might suffice to produce many appli- them to answer the survey questions for their lan- cations, or other resources (e.g. by projecting guage, and then paid to do the initial survey, plus syntactic annotation across languages). On periodic updates. The model provided by the Eth- the other hand, where the writing system of nologue (Gordon, 2005) may serve as a starting a language is in flux, or the language is po- point, although for the level of detail that would litically oppressed, a great deal more effort be useful in assessing language resource availabil- might be necessary. ity will make wholesale adoption unsuitable.
3. For which low density languages might re- 6 Increasing Available Resources lated languages provide the leverage needed to build at least first draft resources? For ex- Given that a language significantly lacks compu- ample, one might think of using Turkish (ar- tational linguistic resources (and in the context of guably at least a medium-density language) this paper and the associated workshop, annotated as a sort of pivot language to build lexicons text resources), so that it falls into the class of and morphological parsers for such low den- lower-density languages (however that might be sity Turkic languages as Uzbek or Uyghur. defined), what then? Most large-scale collections of computational 4. For which low density languages are there linguistics resources have been funded by govern- extensive communities of speakers living in ment agencies, either the US government (typi- other countries, who might be better able to cally the Department of Defense) or by govern- build language resources than speakers living ments of countries where the languages in ques- in the perhaps less economically developed tion are spoken (primarily European, but also a
32 few other financially well-off countries). In some translation, in order to gain acceptable results. cases, governments have sponsored collections for Because bitext was so difficult to find for lower- languages which are not indigenous to the coun- density languages, corpus creation efforts rely try in question (e.g. the EMILLE project10, see largely, if not exclusively, on contracting out text (McEnery et al., 2000)). for translation. In most cases, source text is har- In most such projects, production of resources vested from news sites in the target language, and for lower-density languages have been the work of then translated into English by commercial trans- a very small team which oversees the effort, to- lation agencies, at a rate usually in the neighbor- gether with paid annotators and translators. More hood of US$0.25 per word. In theory, one could specifically, collection and processing of monolin- reduce this cost by dealing directly with trans- gual text can be done by a linguist who need not lators, avoiding the middleman agencies. Since know the language (although it helps to have a many translators are in the Third World, this might speaker of the language who can be called on to result in considerable cost savings. Nevertheless, do language identification, etc.). Dictionary col- quality control issues loom large. The more pro- lection from on-line dictionaries can also be done fessional agencies do quality control of their trans- by a linguist; but if it takes much more effort than lations; even so, one may need to reject transla- that – for example, if the dictionary needs to be tions in some cases (and the agencies themselves converted from print format to electronic format – may have difficulty in dealing with translators for it is again preferable to have a language speaker languages for which there is comparatively little available. demand). Obviously this overall cost is high; it Annotating text (e.g. for named entities) is dif- means that a 100k word quantity of parallel text ferent: it can only be done by a speaker of the lan- will cost in the neighborhood of US$25K. guage (more accurately, a reader: for Punjabi, for Other sources of parallel text might include instance, it can be difficult to find fluent readers of government archives (but apart from parliamen- the Gurmukhi script). Preferably the annotator is tary proceedings where these are published bilin- familiar enough with current events in the country gually, such as the Hansards, these are usually not where the language is spoken that they can inter- open), and the archives of translation companies pret cross-references in the text. If two or more an- (but again, these are seldom if ever open, because notators are available, the work can be done some- the agencies must guard the privacy of those who what more quickly. More importantly, there can be contracted the translations). some checking for inter-annotator agreement (and Finally, there is the possibility that parallel text revision taking into account such differences as are – and indeed, other forms of annotation – could be found). produced in an open source fashion. Wikipedia11 Earlier work on corpus collection from the web is perhaps the most obvious instance of this, as (e.g. (Resnik and Smith, 2003)) gave some hope there are parallel articles in English and other lan- that reasonably large quantities of parallel text guages. Unfortunately, the quantity of such par- could be found on the web, so that a bitext collec- allel text at the Wikipedia is very small for all tion could be built for interesting language pairs but a few languages. At present (May 2006), (with one member of the pair usually being En- there are over 100,000 articles in German, Span- glish) relatively cheaply. Subsequent experience ish, French, Italian, Japanese, Dutch, Polish, Por- with lower-density languages has not born that tuguese and Swedish.12 Languages with over hope out; parallel text on the web seems rela- 10,000 articles include Arabic, Bulgarian, Cata- tively rare for most languages. It is unclear why lan, Czech, Danish, Estonian, Esperanto and Ido this should be. Certainly in countries like India, (both constructed languages), Persian, Galician, there are large amounts of news text in English and Hebrew, Croatian), Bahasa Indonesian, Korean, many of the target languages (such as Hindi). Nev- Lithuanian, Hungarian, Bahasa Malay, Norwegian ertheless, very little of that text seems to be gen- uinely parallel, although recent work (Munteanu 11http://en.wikipedia.org 12 and Marcu, 2005) indicates that true parallelism Probably some of these articles are non-parallel. Indeed, a random check of Cebuano articles in Wikipedia revealed may not be required for some tasks, eg machine that many were stubs (a term used in the Wikipedia to refer to “a short article in need of expansion”), or were simply links to 10http://bowland-files.lancs.ac.uk/corplang/emille/ Internet blogs, many of which were monolingual in English.
33 (Bokmal´ and Nynorsk), Romanian, Russian, Slo- Other researchers have experimented with the vak, Slovenian, Serbian, Finnish, Thai, Turkish, automatic creation of corpora using web data Ukrainian, and Chinese. The dominance of Euro- (Ghani et al., 2001) Some of these corpora have pean languages in these lists is obvious. grown to reasonable sizes; (Scannell, 2003; Scan- During a TIDES exercise in 2003, researchers nell, 2006) has corpora derived from web crawling at Johns Hopkins University explored an innova- which are measured in tens of millions of words tive approach to the creation of bitext (parallel En- for a variety of lower-density languages. However glish and Hindi text, aligned at the sentence level): it should be noted that in these cases, the type of they elicited translations into English of Hindi sen- linguistic resource created is often not linguisti- tences they posted on an Internet web page (Oard, cally annotated, but rather a lexicon or collection 2003; Yarowsky, 2003). Participants were paid for of primary texts in a given language. the best translations in Amazon.com gift certifi- Finally, we may mention efforts to create cer- cates, with the quality of a twenty percent subset tain kinds of resources by computer-directed elic- of the translations automatically evaluated using itation. Examples of projects sharing this focus BLEU scores against highly scored translations of include BOAS (Nirenburg and Raskin, 1998), and the same sentences from previous rounds. This the AVENUE project (Probst et al., 2002), (Lavie pool of high-quality translations was initialized to et al., 2003). a set of known quality translations. A valuable side effect of the use of previously translated texts 7 Reducing Data Requirements for evaluation is that this created a pool of multiply translated texts. Creating more annotated resources is the obvious The TIDES translation exercise quickly pro- way to approach the problem of the lack of re- duced a large body of translated text: 300K words, sources for lower-density languages. A comple- in five days, at a cost of about two cents per word. mentary approach is to improve the way the infor- This approach to resource creation is similar to mation in smaller resources is used, for example numerous open source projects, in the sense that by developing machine translation systems that re- the work is being done by the public. It differed quire less parallel text. in that the results of this work were not made How much reduction in the required amount of publicly available; the use of an explicit qual- resources might be enough? An interesting ex- ity control method; and of course the payments periment, which to our knowledge has never been to (some) participants. While the quality control tried, would be for a linguist to attempt as a test aspect may be essential to producing useful lan- case what we hope that computers can do. That guage resources, hiding those resources not cur- is, a linguist could take a ‘small’ quantity of paral- rently being used for evaluation is not essential to lel text, and extract as much lexical and grammat- the methodology. ical information from that as possible. The lin- Open source resource creation efforts are of guist might then take a previously unseen text in course common, with the Wikipedia13 being the the target language and translate it into English, best known. Other such projects include Ama- or perform some other useful task on target lan- zon.com’s Mechanical Turk14, LiTgloss15, The guage texts. One might argue over whether this ESP Game16, and the Wiktionary17. Clearly some experiment would constitute an upper bound on forms of annotation will be easier to do using how much information could be extracted, but it an open source methodology than others will. would probably be more information than current For example, translation and possibly named en- computational approaches extract. tity annotation might be fairly straightforward, Naturally, this approach partially shifts the while morphological analysis is probably more problem from the research community interested difficult, particularly for morphologically complex in linguistically annotated corpora to the research languages. community interested in algorithms. Much ef- fort has been invested in scaling algorithmic ap- 13http://www.wikipedia.org proaches upwards, that is, leveraging every last 14 http://www.mturk.com/mturk/ available data point in pursuit of small perfor- 15http://litgloss.buffalo.edu/ 16http://www.espgame.org/ mance improvements. We argue that scaling down 17http://wiktionary.org/ (ie using less training data) poses an equally sig-
34 nificant challenge. The basic question of whether diversity of standards in lower-density languages. methods which are data-rich can scale down to im- Solutions may include ontologies for linguistic poverished data has been the focus of a number of concepts e.g. General Ontology for Linguistic recent papers in areas such as machine translation Description18 and the ISO Data Category Reg- (Somers, 1997; Somers, 1998), language identifi- istry (Ide and Romary, 2004), which allow cross- cation (Hughes et al., 2006b) etc. However, tasks resource navigation based on common semantics. which have lower-density language at their core Of course, cross-language and cross-cultural se- have yet to become mainstream in shared evalua- mantics is a notoriously difficult subject. tion tasks which drive much of the algorithmic im- Finally, it may be that development of web provements in computational linguistics and natu- based corpora can act as the middle ground: there ral language processing. are plenty of documents on the web in lower- Another approach to data reduction is to change density languages, and efforts such as projects by 19 20 the type of data required for a given task. For Scannell and Lewis indicate these can be cu- many lower-density languages a significant vol- rated reasonably efficiently, even though the out- ume of linguistically annotated data exists, but not comes may be slightly different to that which we in the form of the curated, standardised corpora are accustomed. Is it possible to make use of XML to which language technologists are accustomed. or HTML markup directly in these cases? Some- Neverthless for extremely low density languages, day, the semantic web may help us with this type a degree of standardisation is apparent by virtue of of approach. documentary linguistic practice. Consider for ex- 8 Moving Forward ample, the number of Shoebox lexicons and cor- responding interlinear texts which are potentially Having considered the status of linguistically- available from documentary sources: while not annotated resources for lower-density languages, being the traditional resource types on which sys- and two broad strategies for improving this situ- tems are trained, they are reasonably accessible, ation (innovative approaches to data creation, and and cover a larger number of languages. Bible scaling down of resource requirements for existing translations are another form of parallel text avail- techniques), we now turn to the question of where able in nearly every written language (see (Resnik to go from here. We believe that there are a num- et al., 1999)). There are of course issues of quality, ber of practical steps which can be taken in order not to mention vocabulary, that arise from using to increase the number of linguistically-annotated the Bible as a source of parallel text, but for some lower-density language resources available to the purposes – such as morphology learning – Bible research community: translations might be a very good source of data. • Encouraging the publication of electronic Similarly, a different compromise may be found corpora of lower-density languages: most in the ratio of the number of words in a corpus economic incentives for corpus creation only to the richness of linguistic annotation. In many exhibit return on investment because of the high-density corpora development projects, an ar- focus on higher-density languages; new mod- bitrary (and high) target for the number of words is els of funding and commercializing corpora often set in advance, and subsequent linguistic an- for lower-density languages are required. notation is layered over this base corpus in a pro- gressively more granular fashion. It may be that • Engaging in research on bootstrapping from this corpus development model could be modified higher density language resources to lower- for lower-density language resource development: density surrogates: it seems obvious that we argue that in many cases, the richness of lin- at least for related languages adopting a guistic annotation over a given set of data is more derivational approach to the generation of important than the raw quantity of the data set. linguistically annotated corpora for lower- A related issue is different standards for an- density languages by using automated an- notating linguistic concepts We already see this notation tools trained on higher-density lan- in larger languages (consider the difference in 18http://www.linguistics-ontology.org morpho-syntactic tagging between the Penn Tree- 19http://borel.slu.edu/crubadan/stadas.html bank and other corpora), but has there is a higher 20http://www.csufresno.edu/odin
35 guages may at least reduce the human effort density languages (eg Penn Treebank tag- required. ging conventions) and the relatively ad-hoc schemes often exhibited by lower-density • Scaling down (through data requirement re- languages, through means such as markup duction) of state of the art algorithms: there ontologies like the General Ontology for Lin- has been little work in downscaling state of guistic Description or the ISO Data Category the art algorithms for tasks such as named Registry. entity recognition, POS tagging and syntac- tic parsing, yet (considerably) reducing the Many of these steps are not about to be realised training data requirement seems like one of in the short term. However, developing a cohesive the few ways that existing analysis technolo- strategy for addressing the need for linguistically gies can be applied to lower-density lan- annotated corpora is a first step in ensuring com- guages. mittment from interested researchers to a common • Shared evaluation tasks which include lower- roadmap. density languages or smaller amounts of data: most shared evaluation tasks are construed 9 Conclusion as exercises in cross-linguistic scalability (eg It is clear that the number of linguistically- CLEF) or data intensivity (eg TREC) or both annotated resources for any language will in- (eg NTCIR). Within these constructs there evitably be less than optimal. Regardless of the is certainly room for the inclusion of lower- density of the language under consideration, the density languages as targets, although no- cost of producing linguistically annotated corpora tably the overhead here is not in the provi- of a substantial size is significant, Inevitably, lan- sion of the language data, but the derivatives guages which do not have a strong political, eco- (eg query topics) on which these exercises are nomic or social status will be less well resourced. based. Certain avenues of investigation e.g. collect- • Promotion of multilingual corpora which in- ing language specific web content, or building ap- clude lower-density languages: as multi- proximate bitexts web data are being explored, but lingual corpora emerge, there is opportu- other areas (such as rich morphosyntactic annota- nity to include lower-density languages at tion) are not particularly evidenced. minimal opportunity cost e.g. EuroGOV However, there is considerable research inter- (Sigurbjonsson¨ et al., 2005) or JRC-Acquis est in the development of linguistically annotated (Steinberger et al., 2006), which are based on resources for languages of lower density. We are web data from the EU, includes a number of encouraged by the steady rate at which academic lower-density languages by virtue of the cor- papers emerge reporting the development of re- pus creation mechanism not being language- sources for lower-density language targets. We specific. have proposed a number of steps by which the is- sue of language resources for lower-density lan- • Language specific strategies: collectively we guages may be more efficiently created and look have done well at developing formal strate- forward with anticipation as to how these ideas gies for high density languages e.g. in EU motivate future work. roadmaps, but not so well at strategies for medium-density or lower-density languages. The models for medium to long term strate- References gies of language resource development may Rayid Ghani, Rosie Jones, and Dunja Mladenic. 2001. be adopted for lower density languages. Re- Mining the web to create minority language corpora. cently this has been evidenced through events In Proceedings of 2001 ACM International Con- such as the LREC 2006 workshop on African ference on Knowledge Management (CIKM2001), pages 279–286. Association for Computing Machin- language resources and the development of a ery. corresponding roadmap. Raymond G. Gordon. 2005. Ethnologue: Languages • Moving towards interoperability between an- of the World (15th Edition). SIL International: Dal- notation schemes which dominate the higher- las.
36 Baden Hughes, Timothy Baldwin, and Steven Bird. K. Probst, L. Levin, E. Peterson, A. Lavie, and J. Car- 2006a. Collecting low-density language data on the bonell. 2002. Mt for minority languages using web. In Proceedings of the 12th Australasian Web elicitation-based learning of syntactic transfer rules. Conference (AusWeb06). Southern Cross University. Machine Translation, 17(4):245–270. Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Philip Resnik and Noah A. Smith. 2003. The web Nicholson, and Andrew MacKinlay. 2006b. Recon- as a parallel corpus. Computational Linguistics, sidering language identification for written language 29(3):349–380. resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation Philip Resnik, Mari Broman Olsen, and Mona Diab. (LREC2006). European Language Resources Asso- 1999. The Bible as a parallel corpus: Annotating ciation: Paris. the ‘Book of 2000 Tongues’. Computers and the Humanities, 33(1-2):129–153. Nancy Ide and Laurent Romary. 2004. A registry of standard data categories for linguistic annotation. In Kevin Scannell. 2003. Automatic thesaurus genera- Proceedings of the 4th International Conference on tion for minority languages: an irish example. In Language Resources and Evaluation (LREC2004, Actes des Traitement Automatique des Langues Mi- pages 135–139. European Language Resources As- noritaires et des Petites Langues, volume 2, pages sociation: Paris. 203–212. Steven Krauwer. 2003. The basic language resource Kevin Scannell. 2006. Machine translation for kit (BLARK) as the first milestone for the lan- closely related language pairs. In Proceedings of the guage resources roadmap. In Proceedings of 2nd LREC2006 Workshop on Strategies for developing International Conference on Speech and Computer machine translation for minority languages. Euro- (SPECOM2003). pean Language Resources Association: Paris. A. Lavie, S. Vogel, L. Levin, E. Peterson, K. Probst, B. Sigurbjonsson,¨ J. Kamps, and M. de Rijke. 2005. A. Font Llitjos, R. Reynolds, J. Carbonell, and Blueprint of a cross-lingual web collection. Journal R. Cohen. 2003. Experiments with a hindi-to- of Digital Information Management, 3(1):9–13. english transfer-based mt system under a miserly data scenario. ACM Transactions on Asian Lan- Harold Somers. 1997. Machine translation and minor- guage Information Processing (TALIP), 2(2). ity languages. Translating and the Computer, 19:1– 13. Valerie Mapelli and Khalid Choukri. 2003. Report on a monimal set of language resources to be made Harold Somers. 1998. Language resources and minor- available for as many languages as possible, and a ity languages. Language Today, 5:20–24. map of the actual gaps. ENABLER internal project report (Deliverable 5.1). R. Steinberger, B. Pouliquen, A. Widger, C. Ignat, T. Erjavec, D. Tufis, and D. Varga. 2006. The jrc- Tony McEnery, Paul Baker, and Lou Burnard. 2000. acquis: A multilingual aligned parallel corpus with Corpus resources and minority language engineer- 20+ languages. In Proceedings of the 5th Interna- ing. In Proceedings of the 2nd International Con- tional Conference on Language Resources and Eval- ference on Language Resources and Evaluation uation (LREC2006). European Language Resources (LREC2002). European Language Resources Asso- Association: Paris. ciation: Paris. Stephanie Strassel, Mike Maxwell, and Christopher Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- Cieri. 2003. Linguistic resource creation for re- proving machine translation performance by exploit- search and technology development: A recent exper- ing non-parallel corpora. Computational Linguis- iment. ACM Transactions on Asian Language Infor- tics, 31(4):477–504. mation Processing (TALIP), 2(2):101–117. Sergei Nirenburg and Victor Raskin. 1998. Univer- David Yarowsky. 2003. Scalable elicitation of training sal grammar and lexis for quick ramp-up of mt. In data for machine translation. Team Tides, 4. Proceedings of 36th Annual Meeting of the Associ- ation for Computational Linguistics and 17th Inter- 10 Acknowledgements national Conference on Computational Linguistics, pages 975–979. Association for Computational Lin- The authors are grateful to Kathryn L. Baker for guistics. her comments on earlier drafts of this paper. Douglas W. Oard. 2003. The surprise language exer- Portions of the research in this paper were sup- cises. ACM Transactions on Asian Language Infor- ported by the Australian Research Council Spe- mation Processing (TALIP), 2(2):79–84. cial Research Initiative (E-Research) grant num- Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. ber SR0567353 “An Intelligent Search Infrastruc- The proposition bank: A corpus annotated with se- ture for Language Resources on the Web.” mantic roles. Computational Linguistics, 31(1).
37 Annotation Compatibility Working Group Report*
Contributors: A. Meyers, A. C. Fang, L. Ferro, S. Kübler, T. Jia-Lin, M. Palmer, M. Poesio, A. Dolbey, K. K. Schuler, E. Loper, H. Zinsmeister, G. Penn, N. Xue, E. Hinrichs, J. Wiebe, J. Pustejovsky, D. Farwell, E. Hajicova, B. Dorr, E. Hovy, B. A. Onyshkevych, L. Levin Editor: A Meyers [email protected]
*As this report is a compilation, some sections may not reflect the views of individual contributors. place a low emphasis on NLP-friendly properties Abstract (Minimalism. Optimality Theory, etc.).
This report explores the question of compatibility As annotation projects are usually research between annotation projects including translating efforts, the inherent theoretical differences may annotation formalisms to each other or to be viewed as part of a search for the truth and the common forms. Compatibility issues are crucial enforcement of adherence to a given (potentially for systems that use the results of multiple wrong) theory could hamper this search. In annotation projects. We hope that this report will addition, annotation of particular phenomena begin a concerted effort in the field to track the may be simplified by making theoretical compatibility of annotation schemes for part of assumptions conducive to describing those speech tagging, time annotation, treebanking, phenomena. For example, relative pronouns role labeling and other phenomena. (e.g., that in the NP the book that she read) may be viewed as pronouns in an anaphora annotation 1. Introduction project, but as intermediate links to arguments for a study of predicate argument structure. Different corpus annotation projects are driven by different goals, are applied to different types On the other hand, many applications would of data (different genres, different languages, benefit by merging the results of different etc.) and are created by people with different annotation projects. Thus, differences between intellectual backgrounds. As a result of these and annotation projects may be viewed as obstacles. other factors, different annotation efforts make For example, combining two or more corpora different underlying theoretical assumptions. annotated with the same information may Thus, no annotation project is really theory- improve a system (i.e., "there's no data like more neutral, and in fact, none should be. It is the data.") To accomplish this, it may be necessary theoretical concerns which make it possible to to convert corpora annotated according to one set write the specifications for an annotation project of specifications into a different system or to and which cause the resulting annotation to be convert two annotation systems into a third consistent and thus usable for various natural system. For example, to obtain lots of part of language processing (NLP) applications. Of speech data for English, it is advantageous to course the theories chosen for annotation projects convert POS tags from several tagsets (see tend to be theories that are useful for NLP. They Section 2) into a common form. For more place a high value on descriptive adequacy (they temporal data than is available in Timex3 format, cover the data), they are formalized sufficiently one might have to convert Timex2 and Timex3 for consistent annotation to be possible, and they tags into a common form (See Section 5). tend to share major theoretical assumptions with Compromises that do not involve conversion can other annotation efforts, e.g., the noun is the head be flawed. For example, a machine learner may of the noun phrase, the verb is the head of the determine that feature A in framework 1 predicts sentence, etc. Thus the term theory-neutral is feature A' in framework 2. However, the system often used to mean something like NLP-friendly. may miss that features A and B in framework 1 Obviously, the annotation compatibility problem actually both correspond to feature A', i.e., they that we address here is much simpler than it are subtypes. In our view, directly modeling the would be if we had to consider theories which parameters of compatibility would be preferable.
38 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 38–53, Sydney, July 2006. c 2006 Association for Computational Linguistics Some researchers have attempted to combine a part of speech represents a fairly coarse-grained number of different resource annotations into a division among types of words, usually single merged form. One motivation is that the distinguishing among: nouns, verbs, adjectives, merged representation may be more than the sum adverbs, determiners and possibly a few other of its parts. It is likely that inconsistencies and classes. While part of speech classifications may errors (often induced by task-specific biases) can vary for particular words, especially closed class be identified and adjusted in the merging items, we have observed a larger problem. Most process; inferences may be drawn from how the part of speech annotation projects incorporate component annotation systems interact; a other distinctions into part of speech complex annotation in a single framework may classification. Furthermore, they incorporate be easier for a system to process than several different types of distinctions. As a result, annotations in different frameworks; and a conversion between one tagset and another is merged framework will help guide further rarely one to one. It can, in fact, be many to annotation research (Pustojevsky, et. al. 2005). many, e.g., BROWN does not distinguish the Another reason to merge is that a merged resource in language A may be similar to an Table 1: Part of Speech Compatibility existing resource in language B. Thus merging resources may present opportunities for Extending Manning and Schütze 1999, pp. 141-142, constructing nearly parallel resources, which in to cover Claws1 and ICE -- Longer Version Online turn could prove useful for a multilingual Claws Brow application. Merging PropBank (Kingsbury, and Class Wrds c5, PTB ICE n Palmer 2002) and NomBank (Meyers, et. al. Claws1 2004) would yield a predicate argument structure Hap- for nouns and verbs, carrying more similar ADJ. Adj py, AJ0 JJ JJ ge information to the Praque Dependency bad TreeBank's TectoGrammatical structure (Hajicova and Ceplova, 2000) than either hap- Adj, pier, ADJ. component. AJC JJR JJR comp wors comp e This report and an expanded online version nic- http://nlp.cs.nyu.edu/wiki/corpuswg/Annotation Adj, ADJ. est- AJS JJT JJS Compatibility both describe how to find super sup correspondences between annotation worst frameworks. This information can be used to Adj, VBN ADJ. combine various annotation resources in past eaten JJ ?? , JJ edp different ways, according to one’s research goals, part and, perhaps, could lead to some standards for Adj, calm- VBG ADJ. combining annotation. This report will outline pres JJ ?? ing , JJ ingp some of our initial findings in this effort with an part eye towards maintaining and updating the online version in the future. We hope this is a step slow- ly, ADV. towards making it easier for systems to use Adv AV0 RB RB multiple annotation resources. sweet ge -ly Adv 2. Part of Speech and Phrasal Categories ADV. comp faster AV0 RBR RBR comp
On our website, we provide correspondences Adv fast- ADV. among a number of different part of speech AV0 RBT RBS tagsets in a version of the table from pp. 141-- super est sup 142 of Manning and Schütze (1999), modified Adv up, ADV. AVP, to include the POS classes from CLAWS1 and Part off, RP RP {phras, RP, RI ICE. Table 1 is a sample taken from this table out ge} for expository purposes (the full table is not Conj and, CJC, CON- provided due to space limitations). Traditionally, CC CC coord or CC JUNC.
39 coord PRON. this, dem.si each, DT0, infinitive form of a verb (VB in the Penn Det DT DT ng, ano- DT Treebank, V.X.infin in ICE) from the present- PRON ther tense form (VBP in the Penn Treebank, V.X.pres (recip) in ICE) that has the same spelling (e.g., see in PRON. They see no reason to leave). In contrast, ICE Det. any, DT0, nonass, distinguishes among several different DT1 DT pron some DTI PRON. subcategories of verb (cop, intr, cxtr, dimontr, ass ditr, montr and TRANS) and the Penn Treebank 1 Det does not. In a hypothetical system which merges PRON. pron these DT0, all the different POS tagsets, it would be DTS DT dem. Plur those DTS advantageous to factor out different types of plu features (similar to ICE), but include all the distinctions made by all the tag sets. For Det DT0, ADV quite ABL PDT example, if a token give is tagged as VBP in the preq aBL .intens Penn Treebank, VBP would be converted into PRON. Det VERB.anysubc.pres. If another token give was all, DT0, univ, preq ABN PDT tagged VB in Brown, VB would be converted to half ABN PRON. VERB.anysubc{infin,n3pres} (n3pres = not-3rd- quant person and present tense). This allows systems to air- acquire the maximum information from corpora, N.com. Noun craft, NN0 NN NN tagged by different research groups. sing data Noun CKIP Chinese-Treebank (CCTB) and Penn cat, N.com. sing NN1 NN NN Chinese Treebank (PCTB) are two important pen sing resources for Treebank-derived Chinese NLP tasks (CKIP, 1995; Xia et al., 2000; Xu et al., Noun cats, N.com. 2002; Li et al., 2004). CCTB is developed in plur NN2 NNS NNS pens plu traditional Chinese (BIG5-encoded) at the Academia Sinica, Taiwan (Chen et al., 1999; Noun Paris, Chen et al., 2003). CCTB uses the Information- prop N.prop Mike NP0 NP NNP based Case Grammar (ICG) framework to sing .sing express both syntactic and semantic descriptions. The present version CCTB3 (Version 3) provides Verb. 61,087 Chinese sentences, 361,834 words and 6 V.X. base take, VVB VB VBP {pres, files that are bracketed and post-edited by pres live imp} humans based on a 5-million-word tagged Sinica Corpus (CKIP, 1995). CKIP POS tagging is a Verb, take, V.X. hierarchical system. The first POS layers include VVI VB VB infin live infin eight main syntactic categories, i.e. N (noun), V (verb), D (adverb), A (adjective), C Verb, took, V.X. VVD VBD VBD (conjunction), I (interjection), T (particles) and P past lived past (preposition). In CCTB, there are 6 non-terminal Verb, tak- phrasal categories: S (a complete tree headed by pres ing, V.X. VVG VBG VBG a predicate), VP (a phrase headed by a part liv- ingp predicate), NP (a phrase beaded by an N), GP (a ing phrase headed by locational noun or adjunct), PP Verb, taken past- V.X. , VVN VBN VBN 1 part edp In the ICE column of Table 1 X represents a the lived disjunction of verb subcategorization types {cop, intr, cxtr, dimontr, ditr, montr, trans}. Verb, takes V.X. VVZ VBZ VBZ pres , pres
40 (a phrase headed by a preposition) and XP (a Version 5.1), contains 18,782 sentences, 507,222 conjunctive phrase that is headed by a words, 824,983 Hanzi and 890 data files. conjunction). PCTB’s bracketing annotation is in the same framework as other Penn Treebanks, bearing a loose connection to the Government and Binding Exam- Top Layer (TL) Bottom Layer (BL) Theory paradigm. The PCTB annotation scheme ples PCTB CCTB PCTB CCTB involves 33 POS-tags, 17 phrasal tags, 6 verb
¢¡ compound tags, 7 empty category tags and 26
functional tags. £¥¤ ADVP Head AD Dk in Table 2 includes Top-Layer/Bottom-Layer POS other and phrasal categories correspondences between words PCTB4 and CCTB3 for words/phrases expressed
¦¨§ as the same Chinese characters in the same order. there- ADVP result AD Cbca
fore 3. Differences Between Frameworks © We assume that certain high level differences be- P reason P Cbaa between annotation schemata should be ignored cause if at all possible, namely those that represent
differences of analyses that are notationally NP- time:NP NT Ndda equivalent. In this section, we will discuss those TMP past sorts of differences with an eye towards
evaluating whether real differences do in fact NP- exist, so that way users of annotation can be last NP NT Ndaba TMP careful should these differences be of year
significance to their particular application. NP- NP NN Nep amon ADV To clarify, we are talking about the sort of high g level differences which reflect differences in the
linguistic framework used for representing annotation, e.g., many frameworks represent DVP ADV AD:DEV Dk long distance dependencies in equivalent, but also different ways. In this sense, the linguistic
framework of the Penn Treebank is a phrase
structure based framework that includes a particular set of node labels (POS tags, phrasal LCP- NT:LCG Nddc:N 2 in the GP categories, etc.), function tags, indices, etc. . TMP P g last few 3.1 Dependency vs. Constituency years Figure 1 is a candidate rule for converting a PCTB annotates simplified Chinese texts (GB- phrase structure tree to a dependency tree or vice encoded) from newswire sources (Xinhua versa. Given a phrase consisting of constituents newswire, Hong Kong news and Sinorama news C(n-i) to C(n+j), the rule assumes that: there is magazine, Taiwan). It is developed at the one unique constituent C(n) that is the head of University of Pennsylvania (UPenn). The PCTB the phrase; and it is possible to identify this head annotates Chinese texts with syntactic in the phrase structure grammar, either using a bracketing, part of speech information, empty reliable heuristic or due to annotation that marks categories and function tags (Xia et al, 2000, the head of the phrase. When converting the 2002, 2005). The predicate-argument structure of Chinese verbs for the PCTB is encoded in the 2 Linguistic frameworks are independent of encoding Penn Chinese Proposition Bank (Xue, et. Al. systems, e.g., Penn Treebank’s inline LISP-ish notation, can be converted to inline XML, offset annotation, etc., Such 2005). The present version PCTB5.1 (PCTB encoding differences are outside the scope of this report
41 among theories whether the verb or the pre-S element (that, subordinate conjunction, etc.) is assigned the head label. Names, Dates, and other "patterned" phrases don't seem to have a unique head. Rather there are sets of constituents which together act like the head. For example, in Dr. Mary Smith, the string Mary Smith acts like the head. Idioms are big can of worms. Their headedness properties vary quite a bit. Sometimes they act like normal headed phrases and sometimes they don't. For example, the phrase pull strings for John obeys all the rules of English that would be expected of a verb phrase that consists of a verb, an NP and a complement PP. In contrast, the phrase let alone (Fillmore, et. al. 1988) has a syntax unique to that phrase. Semi-idiomatic constructions (including phrasal verbs, complex prepositions, etc.) raise some of the same questions as idioms. While making headedness assumptions similar to other phrases is relatively harmless, there is some variation. For example, in the phrase Mary called Fred up on the phone, there are two common views: (a) Fig. 1: Candidate Consituency/Dependency Mapping called is the head of the VP (or S) and up is a particle that depends on called; and (b) the VP phrase structure tree to the dependency tree, the has a complex head called up. For most rule promotes the head to the root of the tree. purposes, the choice between these two analyses When converting a dependency tree to a phrase is arbitrary. Coordinate structures also require structure tree, the rule demotes the root to a different treatment from head + modifier phrases constituent, possibly marking it as the head, and -- there are multiple head-like constituents. names the phrase based on the head’s part of speech, e.g., nouns are heads of NPs. This rule is insufficient because: (1) Identifying the head of a A crucial factor is that the notion head is used to phrase by heuristics is not 100% reliable and represent different things. (cf. Corbett, et. al. most phrase structure annotation does not include 1993, Meyers 1995). However, there are two a marking for the head; (2) Some phrase dominant notions. The first we will call the functor structure distinctions do not translate well to (following Categorial Grammar). The some Dependency Grammars, e.g., the VP functor is the glue that holds the phrase together analysis and nestings of prenominal modifiers3; -- the word that selects for the other words, and (3) The rule only works for phrases that fit determines word order, licenses the construction, the head plus modifiers pattern and many phrases etc. For example, coordinate conjunctions are functors because they link the constituents in do not fit this pattern (uncontroversially). their phrase together. The second head like notion we will call the thematic head, the word While most assume that verbs act like the head or words that determine the external selectional of the sentence, a Subject + VP analysis of a properties of the phrase and usually the phrasal sentence complicates this slightly. Regarding S- category as well. For example, in the noun bars (relative clauses, that-S, subordinate- phrase the book and the rock, the conjunction conjunction + S, etc.), there is some variation and is the functor, but the nouns and book and rock are thematic heads. The phrase is a concrete noun phrase due to book and rock. Thus the 3 The Prague Depedency Treebank orders dependency following sentence is well-formed: I held the branches from the same head to represent the scope of the book and the rock, but the following sentence is dependencies. Applicative Universal Grammar (Shauyman 1977) incorporates phrases into dependency structure. ill-formed *I held the noise and the redness. Furthermore, the phrase the book and the rock is a noun phrase, not a conjunction phrase.
42 In summary, there are some differences between hold. In some examples, the filler is hypothetical phrase structure and dependency analyses which and should be interpreted something like the may be lost in translation, e.g., dependency pronoun anyone (4 below) or the person being analyses include head-marking by default and addressed (5). In other examples, the identity phrase structure analyses do not. On the other between filler and gap is not so straight-forward. hand, phrase structure analyses include relations In examples like (6), filler and gap are type between groupings of words which may not identical, not token identical (they represent always be preserved when translating to different reading events). In examples like (7), a dependencies. Moreover, both identifying heads gap can take split antecedents. Conventional and combining words into phrases have their filler/gap mechanims of all types have to be own sets of problems which can come to the modified to handle these types of examples. forefront when translation between the two modalities is attempted. To be descriptively 4. They explained how e to drive the car adequate, frameworks that mark heads do deal 5. e don't talk to me! with these issues. The problem is that they are 6. Sally [read a linguistics article]i, but dealt with in different ways across dependency John didn't ei. frameworks and across those phrase structure 7. Sallyi spoke with Johnj about e,,i,j,, frameworks where heads are marked. For leaving together. example, conjunction may be handled as being a distinct phenomenon (another dimension) that 3.3 Coreference and Anaphora can be filtered through to the real heads. Alternatively, a head is selected on theoretical or There is little agreement concerning coreference heuristic grounds (the head of the first the annotation in the research community. Funding conjunct, the conjunction, etc.) When working for the creation of the existing anaphorically with multiple frameworks, a user must adjust for annotated corpora (MUC6/7, ACE) has come the assumptions of each framework. primarily from initiatives focused on specific application tasks, resulting in task-oriented 3.2 Gap Filling Mechanisms annotation schemes. On the other hand, a few (typically smaller) corpora have also been It is well-known that there are several equivalent created to be consistent with existing, highly ways to represent long distance and lexically developed theoretical accounts of anaphora from based dependencies, e.g., (Sag and Fodor, 1994). a linguistic perspective. Accordingly, many Re-entrant graphs (graphs with shared structure), schemes for annotating coreference or anaphora empty category/antecedent pairs, representations have been proposed, differing significantly with of discontinuous constituents, among other respect to: (1) the task definition, i.e., what type mechanisms can all be used to represent that of semantic relations are annotated; (2) the there is some relation R between two (or more) flexibility that annotators have. elements in a linguistic structure that is, in some sense, noncanonical. The link by any of these By far the best known and most used scheme is mechanisms can be used to show that the relation that originally proposed for MUC 6 and later R holds in spite of violations of a proximity adapted for ACE. This scheme was developed to constraint (long distance dependencies), a special support information extraction and its primary construction such as control, or many other aim is to identify all mentions of the same conditions. Some examples follow: objects in the text (‘coreference’) so as to collect all predications about them. A
43 1.
44 but by specifying once and for all which files for identifying these problematic links, although contain the base level and which files contain syntactic information can sometimes help. The each level of representation; each level points to opposite of course is not true; there is no direct the same base level. Secondly, markables and way of representing the information in (4) in the coref links are contained in the same file. MUC format. Conversion between the MAS- XML and the MMAX format is also possible, 5. [ markable file] provided that pointers are used to represent both
45 argument relations can provide a way to a thin line between a recipient and a goal, e.g., generalize over data and, perhaps, allow systems the prepositional object of John sent a letter to to mitigate against the sparse data problem. the Hospital could take one role or the other depending on a fairly subtle ambiguity. Systems for representing predicate argument relations vary drastically in granularity, In To avoid these problems, some annotation particular, there is a long history of disagreement research (and some linguistic theories) has about the appropriate level of granularity of role abandoned predicate-neutral approaches, in favor labeling, the tags used to distinguish between of the approaches that define predicate relations predicate argument relations. At one extreme, no on a predicate by predicate basis. Furthermore, distinction is made between predicate relations, various balances have been attempted to solve one simply marks that the functor and argument some of the problems of the predicate-neutral are in a predicate-argument relation (e.g., relations. FrameNet defines roles on a scenario unlabeled dependency trees). In another by scenario basis, which limits the growth of the approach, one might distinguish among the inventory of relation labels and insures arguments of each predicate with a small set of consistency within semantic domains or frames. labels, sometimes numbered -- examples of this On the other hand, the predicate-by-predicate approach include Relational Grammar approach is arguably less informative then the (Perlmutter 1984), PropBank and NomBank. predicate-neutral approach, allowing for no These labels have different meanings for each generalization of roles across predicates. Thus functor, e.g., the subject of eat, write and devour although PropBank/NomBank use a strictly are distinct. This assumes a very high level of predicate by predicate approach, there have been granularity, i.e., there are several times the some attempts to regularize the numbering for number of possible relations as there are distinct semantically related predicates. Furthermore, the functors. So 1000 verbs may license as many as descriptors used by the annotators to define roles 5000 distinct relations. Under other approaches, can sometimes be used to help make finer a small set of relation types are generalized distinctions (descriptors often include familiar across functors. For example, under Relational role labels like agent, patient, etc.) Grammar's Universal Alignment Hypothesis (Perlmutter and Postal 1984, Rosen 1984), The diversity of predicate argument labeling subject, object and indirect object relations are systems and the large inventory of possible role assumed to be of the same types regardless of labels make it difficult to provide a simple verb. These terms thus are fairly coarse-grained mapping (like Table 1 for part of speech distinctions between types of predicate/argument conversion) between these types of systems. The relations between verbs and their arguments. SemLink project provides some insight into how this mapping problem can be solved. Some predicate-neutral relations are more fine grained, including Panini's Karaka of 2000 years 4.2 SemLink ago, and many of the more recent systems which make distinctions such as agent, patient, theme, SemLink is a project to link the lexical resources recipient, etc. (Gruber 1965, Fillmore 1968, of FrameNet, PropBank, and VerbNet. The goal Jackendoff 1972). The (current) International is to develop computationally explicit Annotation of Multilingual Text Corpora project connections between these resources combining (http://aitc.aitc.net.org/nsf/iamtc/) takes this individual advantages and overcoming their approach. Critics claim that it can be difficult to limitations. maintain consistency across predicates with these systems without constantly increasing the inventory of role labels to describe idiosyncratic 4.2.1 Background relations, e.g., the relation between the verbs multiply, conjugate, and their objects. For VerbNet consists of hierarchies of verb classes, example, only a very idiosyncratic classification extended from those of Levin 1993. Each class could capture the fact that only a large round and subclass is characterized extensionally by its object (like the Earth) can be the object of set of verbs, and intensionally by argument lists circumnavigate. It can also be unclear which of and syntactic/semantic features of verbs. The full two role labels apply. For example, there can be argument list consists of 23 thematic roles, and
46 possible selectional restrictions on the arguments which lists, for each broad meaning of each are expressed using binary predicates. VerbNet annotated verb, its "frameset", the possible has been extended from the Levin classes, and arguments, their labels and all possible syntactic now covers 4526 senses for 3175 lexemes. A realizations. This lexical resource is used as a set primary emphasis for VerbNet is grouping verbs of verb-specific guidelines by the annotators, and into classes with coherent syntactic and semantic can be seen as quite similar in nature to characterizations in order to facilitate acquisition FrameNet, although much more coarse-grained of new class members based on observable and general purpose in the specifics. syntactic/semantic behavior. The hierarchical structure and small number of thematic roles is To summarize, PropBank and FrameNet both intended to support generalizations. annotate the same verb arguments, but assign different labels. PropBank has a small number of FrameNet consists of collections of semantic vague, general purpose labels with sufficient frames, lexical units that evoke these frames, and amounts of training data geared specifically to annotation reports that demonstrate uses of support successful machine learning. FrameNet lexical units. Each semantic frame specifies a set provides a much richer and more explicit of frame elements. These are elements that semantics, but without sufficient amounts of describe the situational props, participants and training data for the hundreds of individual frame components that conceptually make up part of elements. An ideal environment would allow us the frame. Lexical units appear in a variety of to train generic semantic role labelers on parts of speech, though we focus on verbs here. PropBank, run them on new data, and then be A lexical unit is a lexeme in a particular sense able to map the resulting PropBank argument defined in its containing semantic frame. They labels on rich FrameNet frame elements. are described in reports that list the syntactic realizations of the frame elements, and valence The goal of SemLink is to create just such an patterns that describe possible syntactic linking environment. VerbNet provides a level of patterns. 3486 verb lexical units have been representation that is still tied to syntax, in the described in FrameNet which places a primary way that PropBank is, but provides a somewhat emphasis on providing rich, idiosyncratic more fine-grained set of role labels and a set of descriptions of semantic properties of lexical fairly high level, general purpose semantic units in context, and making explicit subtle predicates, such as contact(x,y), change-of- differences in meaning. As such it could provide location(x, path), cause(A, X), etc. As such it can an important foundation for reasoning about be seen as a mediator between PropBank and context dependent semantic representations. FrameNet. In fact, our approach has been to use However, the large number of frame elements the explicit syntactic frames of VerbNet to semi- and the current sparseness of annotations for automatically map the PropBank instances onto each one has hindered machine learning. specific VerbNet classes and role labels. The mapping can then be hand-corrected. In parallel, PropBank is an annotation of 1M words of the SemLink has been creating a mapping table from Wall Street Journal portion of the Penn Treebank VerbNet class(es) to FrameNet frame(s), and II with semantic role labels for each verb from role label to frame element. This will allow argument. Although the semantic roles labels are the SemLink project to automatically generate purposely chosen to be quite generic, i.e., ArgO, FrameNet representations for every VerbNet Arg1, etc., they are still intended to consistently version of a PropBank instance with an entry in annotate the same semantic role across syntactic the VerbNet-FrameNet mapping table. variations, e.g., Arg1 in "John broke the window" is the same window (syntactic object) 4.2.2 VerbNet <==> FrameNet linking that is annotated as the Arg1 in "The window broke" (syntactic subject). The primary goal of One of the tasks for the SemLink project is to PropBank is to provide consistent general provide explicit mappings between VerbNet and labeling of semantic roles for a large quantity of FrameNet. The mappings between these two text that can provide training data for supervised resources which have complementary machine learning algorithms. PropBank can also information about verbs and disjoint coverage provide frequency counts for (statistical) analysis open several possibilities to increase their or generation. PropBank includes a lexicon
47 robustness. The fact that these two resources are The lexical mapping defines the set of possible now mapped gives researchers different levels of mappings between the two lexicons, independent representation for events these verbs represent to of context. In particular, for each item in the be used in natural language applications. The source lexicon, it specifies the possible mapping between VerbNet and FrameNet was corresponding items in the target lexicon; and for done in two steps: (1) mapping VerbNet verb each of these mappings, specifies how the senses to FrameNet lexical units; (2) mapping detailed fields of the source lexicon item (such as VerbNet thematic roles to the equivalent (if pre- verb arguments) map to the detailed fields of the sent) FrameNet frame elements for the corre- target lexicon item. The lexical mapping sponding class/frame mappings uncovered dur- provides a set of possible mappings, but does not ing step 1. specify which of those mappings should be used for each instance; that is the job of the instance In the first task, VerbNet verb senses were classifier, which looks at a source lexicon item in mapped to corresponding FrameNet senses, if context, and chooses the most appropriate target available. Each verb member of a VerbNet class lexicon items allowed by the lexical mapping. was assigned to a (set of) lexical units of Frame- Net frames according to semantic meaning and The lexical mapping was created semi- to the roles this verb instance takes. These automatically, based on an initial mapping which mappings are not one-to-one since VerbNet and put VerbNet thematic roles in correspondence FrameNet were built with distinctly different with individual PropBank framesets. This lexical design philosophies. VerbNet verb classes are mapping consists of a mapping between the constructed by grouping verbs based mostly on PropBank framesets and VerbNet's verb classes; their participation in diathesis alternations. In and a mapping between the roleset argument contrast, FrameNet is designed to group lexical labels and the VerbNet thematic roles. During items based on frame semantics, and a single this initial mapping, the process of assigning a FrameNet frame may contain sets of verbs with verb class to a frameset was performed manually related senses but different subcategorization while creating new PropBank frames. The properties and sets of verbs with similar syntactic thematic role assignment, on the other hand, was behavior may appear in multiple frames. a semi-automatic process which finds the best match for the argument labels, based on their The second task consisted of mapping VerbNet descriptors, to the set of thematic role labels of thematic roles to FrameNet frame elements for VerbNet. This process required human the pairs of classes/frames found in the first task. intervention due to the variety of descriptors for As in the first task, the mapping is not always PropBank labels, the fact that the argument label one-to-one as FrameNet tends to record much numbers are not consistent across verbs, and more fine-grained distinctions than VerbNet. gaps in frameset to verb class mappings.
So far, 1892 VerbNet senses representing 209 To build the instance classifier, SemLink started classes were successfully mapped to FrameNet with two heuristic classifiers. The first classifier frames. This resulted in 582 VerbNet class – works by running the SenseLearner WSD engine FrameNet frame mappings, across 263 unique to find the WordNet class of each verb; and then FrameNet frames, for a total of 2170 mappings using the existing WordNet/VerbNet mapping to of VerbNet verbs to FrameNet lexical units. choose the corresponding VerbNet class. This heuristic is limited by the performance of the 4.2.3 PropBank <==> VerbNet linking WSD engine, and by the fact that the WordNet/VerbNet mapping is not available for SemLink is also creating a mapping between all VerbNet verbs. The second heuristic classifier VerbNet and PropBank, which will allow the use examines the syntactic context for each verb of the machine learning techniques that have instance, and compares it to the syntactic frames been developed for PropBank annotations to of each VerbNet class. The VerbNet class with a generate more semantically abstract VerbNet syntactic frame that most closely matches the representations. The mapping between VerbNet instance's context is assigned to the instance. and PropBank can be divided into two parts: a "lexical mapping" and an "instance classifier." The SemLink group ran these two heuristic methods on the Treebank corpus and are hand-
48 correcting the results in order to obtain a are several more CLAWS tagsets VerbNet-annotated version of the Treebank (www.comp.lancs.ac.uk/ucrel/annotation.html), corpus. Since the Treebank corpus is also differing both in degree of detail and choice of annotated with PropBank information, this will distinctions made. Thus a detailed conversion provide a parallel VerbNet/PropBank corpus, table among even just the CLAWS tagsets may which can be used to train a supervised classifier prove handy. Similar issues arise with the year to to map from PropBank frames to VerbNet year changes of the ACE annotation guidelines classes (and vice versa). The feature space for (projects.ldc.upenn.edu/ace/ ) which include this machine learning classifier includes named entity, semantic classes for nouns, information about the lexical and syntactic anaphora, relation and event annotation. As context of the verb and its arguments, as well as annotation formalisms mature, specifications the output of the two heuristic methods. can change to improve annotation consistency, speed or the usefulness for some specific task. In 5. Version Control the interest of using old and new annotation together (more training data), it is helpful to have Annotation compatibility is also an issue for explicit mappings for related formalisms. Table 2 related formalisms. Two columns in Table 1 are is a (preliminary) conversion table for Timex2 devoted to different CLAWS POS tagsets, but there and Timex3, the latter of which can be viewed essentially as an elaboration of the former. Table 3: Temporal Markup Translation Table4 Description TIMEX2 TIMEX3 Comment Contains a normal- Some TIMEX2 points are ized form of the VAL="1964-10-16" val="1964-10-16" TIMEX3 durations date/time Captures temporal MOD="APPROX" mod="approx" --- modifiers Contains a normal- ized form of an ANCHOR_VAL See TIMEX3 beginPoint and --- anchoring ="1964-W22" endPoint data/time Captures relative direction between ANCHOR_DIR= See TIMEX3 beginPoint and --- VAL and AN- "BEFORE" endPoint CHOR_VAL Identifies set ex- SET="YES" type="SET" --- pressions Provides unique ID Used to relate time expres- ID="12" tid="12" number sions to other objects Hold over from TIMEX. De- Identifies type of --- type="DATE" rivable from format of expression VAL/val In TIMEX3, indexical expres- Identifies indexical sions are normalized via a --- temporalFunction="true" expressions temporal function, applied as post-process Identifies reference time used to com- --- anchorTimeID="t12" Desired in TIMEX2 pute val Identifies dis- --- functionInDocu- Used for date stamps on
4 This preliminary table shows the attributes side by side with only one sample value, although other values are possible
49 course function ment="CREATION_TIME" documents Captures anchors beginPoint="t11", end- Captured by TIMEX2 AN- --- for durations Point="t12" CHOR attributes Captures quantifi- cation of a set ex- --- quant="EVERY" Desired in TIMEX2 pression Captures number of reoccurences in --- freq="2X" Desired in TIMEX2 set expressions 6. The Effect of Language Differences is not necessarily the subject. In embedded clauses the finite verb normally occurs in a verb- Most researchers involved in linguistic phrase-final position following its arguments and annotation (particularly for NLP) take it for adjuncts, and other non-finite verbal elements. granted that coverage of a particular grammar for German is traditionally assumed to have a head- a particular language is of the utmost important. final verb phrase. The ordering of arguments and The (explanatory) adequacy of the particular adjuncts is relatively free. Firstly almost any linguistic theory assumed for multiple languages constituent can be topicalised preceding the finite is considered a much less important. Given the verb in Verb-Second position. Secondly the diversity of annotation paradigms, we may go a order of the remaining arguments and adjuncts is step further and claim that it may be necessary to still relatively free. Ross (1967) coined the term change theories when going from one language Scrambling to describe the variety of linear to another. In particular, language-specific orderings. Various factors are discussed to play a phenomena can complicate theories in ways that role here such as pronominal vs. phrasal prove unnecessary for languages lacking these constituency, information structure, definiteness phenomena. For example, English requires a and animacy (e.g. Uszkoreit 1986). much simpler morphological framework then languages like German, Russian, Turkish or The annotation scheme of the German TüBa-D/Z Pashto. It has also been claimed on several treebank was developed with special regard to occasions that a VP analysis is needed in some these properties of German clause structure. The languages (English), but not others (Japanese). main ordering principle is adopted from For the purposes of annotation, it would seem traditional descriptive analysis of German (e.g. simplest to choose the simplest language- Herling 1821, Höhle 1986). It partitions the specific framework that is capable of capturing clause into 'topological fields' which are defined the distinctions that one is attempting to by the distribution of the verbal elements. The annotate. If the annotation is robust, it should be top level of the syntactic tree is a flat structure of possible to convert it automatically into some field categories including: Linke Klammer - left language-neutral formalism should one arise that bracket (LK) and Rechte Klammer - verbal maximizes descriptive and explanatory complex (VC) for verbal elements and Vorfeld - adequacy. In the meanwhile, it would seem initial field (VF), C-Feld - complementiser field unnecessary to complicate grammars of specific (C), Mittelfeld - middle field (MF), Nachfeld - languages to account for phenomena which do final field (NF) for other elements. not occur in those languages. Below the level of field nodes the annotation 6.1 The German TüBa-D/Z Treebank scheme provides hierarchical phrase structures except for verb phrases. There are no verb German has a freer word order than English. phrases annotated in TüBa-D/Z. It was one of the This concerns the distribution of the finite verb major design decisions to capture the distribution and the distribution of arguments and adjuncts. of verbal elements and their arguments and German is a general Verb-Second language adjuncts in terms of topological fields instead of which means that in the default structure in hierarchical verb phrase structures. The free declarative main clauses as well as in wh- word order would have required to make questions the finite verb surfaces in second extensive use of traces or other mechanisms to position preceded by only one constituent which relate dislocated constituents to their base
50 positions, which in itself was problematic since there is no consensus among German linguists on what the base ordering is. An alternative which avoids commitment to specific base positions is to use crossing branches to deal with discontinuous constituents. This approach is adopted for example by the German TIGER treebank (Brants et al. 2004). A drawback of crossing branches is that the treebank cannot be modeled by a context free grammar. Since TüBa- D/Z was intended to be used for parser training, it was not a desirable option. Arguments and adjuncts are thus related to their predicates by Fig. 2: verb-second means of functional labels. In contrast to the Dort würde er sicher angenommen werden. Penn Treebank, TüBa-D/Z assigns grammatical there would he surely accepted be functions to all arguments and adjuncts. Due to 'He would be surely accepted there.' the freer word order functions cannot be derived from relative positions only.
The choice of labels of grammatical functions is largely based on the insight that grammatical functions in German are directly related to the case assignment (Reis 1982). The labels therefore do not refer to grammatical functions such as subject, direct object or indirect object but make a distinction between complement and adjunct functions and classify the nominal complements according to their case marking: accusative object (OA), dative object (OD), Fig. 3: verb-final genitive object (OG), and also nominative Zu hoffen ist, daß der Rückzug vollständig sein 'object' (ON) versus verbal modifier (V-MOD) or wird. to hope is that the fallback complete be will underspecified modifier (MOD). 'We hope that they will retreat completely.'
Within phrases a head daughter is marked at each projection level. Exceptions are elliptical phrases, coordinate structures, strings of foreign language, proper names and appositions within noun phrases. Modifiers of arguments and adjuncts are assigned a default non-head function. In case of discontinuous constituents the function of the modifier is either explicitly marked by means of a complex label such as OA-MOD (the modifier of an accusative object) or by means of a secondary edge REFINT in Fig. 4: discont. constituent marked OA-MOD case the modified phrase has a default head or Wie würdet ihr das Land nennen, in dem ihr non-head function itself (which holds in the case geboren wurdet? of e.g. NP complements of prepositions). how would you the country call in which you born were Figures 2 to 4 illustrate the German TüBa-D/Z 'How would you call the country in which you treebank annotation scheme (Telljohann et al. were born?' (2005). – it combines a flat topological analysis with structural and functional information.
51 7. Concluding Remarks G. Corbett, N. M. Fraser, and S. McGlashan, 1993. Heads in Grammatical Theory. Cambridge This report has laid out several major annotation University Press, Cambridge. compatibility issues, focusing primarily on conversion among different annotation K. Van Deemter and R. Kibble, 2001. On frameworks that represent the same type of Coreferring: Coreference in MUC and related information. We have provided procedures for Annotation schemes. Journal of Computational conversion, along with their limitations. As more Linguistics 26, 4, S. 629-637 work needs to be done in this area, we intend to keep the online version available for cooperative C. Fillmore, 1968. The Case for Case. In E. Bach elaboration and extension. Our hope is that the and R. T. Harms, eds, Universals in Linguistic conversion tables will be extended and more Theory. Holt, Rinehart and Winston, NY annotation projects will incorporate details of their projects in order to facilitate compatibility. C. Fillmore, P. Kay & M. O’Connor. 1988. Regularity and Idiomaticity in Grammatical The compatibility between annotation Constructions: The Case of Let Alone., frameworks becomes a concern when (for Language, 64: 501-538. example) a user attempts to use annotation created under two or more distinct frameworks J. S. Gruber, 1965. Studies in Lexical Relations. for a single application. This is true regardless of Ph.D. thesis, MIT whether the annotation is of the same type (the user wants more data for a particular E. Hajicov and M. Ceplov, 2000. Deletions and phenomenon); or of different types (the user Their Reconstruction in Tectogrammatical wants to combine different types of information). Syntactic Tagging of Very Large Corpora. In Proceedings of Coling 2000: pp. 278-284. Acknowledgement S. H. A. Herling, 1821. Über die Topik der This research was supported, in part, by the Na- deutschen Sprache. In Abhandlungen des tional Science Foundation under Grant CNI- frankfurterischen Gelehrtenvereins für deutsche 0551615. Sprache. Frankfurt/M. Drittes Stück.
References T. N. Höhle, 1986. Der Begriff `Mittelfeld'. Anmerkungen über die Theorie der Brants, S., S. Dipper, P. Eisenberg, S. Hansen, E. topologischen Felder. In A. Schöne (Ed.), Knig, W. Lezius, C. Rohrer, G. Smith & H. Kontroversen alte und neue. Akten des 7. Uszkoreit, 2004. TIGER: Linguistic Internationalen Germanistenkongresses Interpretation of a German Corpus. In E. Göttingen. 329-340. Hinrichs and K. Simov, eds, Research on Language and Computation, Special Issue. R. Jackendoff, 1972. Semantic Interpretation in Volume 2: 597-620. Generative Grammar. MIT Press, Cambridge.
Chen, K.-J., Luo, C.-C., Gao, Z.-M., Chang, M.- P. Kingsbury and M. Palmer 2002. From C., Chen, F.-Y., and Chen, C.-J., 1999. The treebank to propbank. In Proc. LREC-2002 CKIP Chinese Treebank. In Journ ees ATALA sur les Corpus annot es pour la syntaxe, Talana, H. Lee, C.-N. Huang, J. Gao and X. Fan, 2004. Paris VII: pp.85-96. Chinese chunking with another type of spec. In SIGHAN-2004. Barcelona: pp. 41-48. Chen, K.-J. et al. Building and Using Parsed Corpora, 2003. (A. Abeillé eds) KLUWER, B. Levin 1993. English Verb Classes and Dordrecht. . Alternations: A Preliminary Investigation. Univ. of Chicago Press. CKIP, 1995. Technical Report no. 95-02, the content and illustration of Sinica corpus of C. Manning and H. Schütze. 1999. Foundations Academia Sinica. Inst. of Information Science. of Statistical Natural Language Processing, MIT.
52 A. Meyers, R. Reeves, C. Macleod, R. Szekely, Preuss, and M. Senturia, eds, Proc. of the V. Zielinska, B. Young, and R. Grishman, 2004. Thirteenth West Coast Conference on Formal The NomBank Project: An Interim Report. In Linguistics, volume 13, CSLI Publications/SLA. NAACL/HLT 2004 Workshop Frontiers in Corpus Annotation. S. Salmon-Alt and L. Romary, RAF: towards a Reference Annotation Framework, LREC 2004 A. Meyers, 1995. The NP Analysis of NP. In Papers from the 31st Regional Meeting of the S. Shaumyan, 1977. Applicative Grammar as a Chicago Linguistic Society, pp. 329-342. Semantic Theory of Natural Language. Chicago Univ. Press. D. M. Perlmutter and P. M. Postal, 1984. The 1- Advancement Exclusiveness Law. In D. M. H. Telljohann, E. Hinrichs, S. Kübler and H. Perlmutter & C. G. Rosen, eds 1984. Studies in Zinsmeister. 2005. Stylebook of the Tübinger Relational Grammar 2. Univ. of Chicago Press. Treebank of Written German (TüBa-D/Z). Technical report. University of Tübingen. D.. M. Perlmutter, 1984. Studies in Relational Grammar 1. Univ. of Chicago Press. C. Thielen and A. Schiller, 1996. Ein kleines und erweitertes Tagset fürs Deutsche. In: Feldweg, M. Poesio, 1999. Coreference, in MATE H.; Hinrichs, E.W. (eds.): Wiederverwendbare Deliverable 2.1, http://www.ims.uni- Methoden und Ressourcen zur linguistischen stuttgart.de/projekte/mate/mdag/cr/cr_1.html Erschliessung des Deutschen. Vol. 73 of Lexicographica. Tübingen: Niemeyer. 193-203. M. Poesio, 2004. "The MATE/GNOME Scheme for Anaphoric Annotation, Revisited", Proc. of J.-L.Tsai, 2005. A Study of Applying BTM SIGDIAL. Model on the Chinese Chunk Bracketing. In LINC-2005, IJCNLP-2005, pp.21-30. M. Poesio and R. Artstein, 2005. The Reliability of Anaphoric Annotation, Reconsidered: Taking H. Uszkoreit, 1986. "Constraints on Order" in Ambiguity into Account. Proc. of ACL Linguistics 24. Workshop on Frontiers in Corpus Annotation. F. Xia, M. Palmer, N. Xue, N., M. E. Okurowski, J. Pustejovsky, A. Meyers, M. Palmer, and M. J. Kovarik, F.-D. Chiou, S. Huang, T. Kroch, and Poesio, 2005. Merging PropBank, NomBank, Marcus, M., 2000. Developing Guidelines and TimeBank, Penn Discourse Treebank and Ensuring Consistency for Chinese Text Coreference. In ACL 2005 Workshop: Frontiers Annotation. In: Proc. of LREC-2000. Greece. in Corpus Annotation II: Pie in the Sky. N. Xue, F. Chiou and M. Palmer. Building a M. Reis, 1982. "Zum Subjektbegriff im Large-Scale Annotated Chinese Corpus, 2002. Deutschen". In: Abraham, W. (Hrsg.): In: Proc. of COLING-2002. Taipei, Taiwan. Satzglieder im Deutschen. Vorschläge zur syntaktischen, semantischen und pragmatischen N. Xue, F. Xia, F.-D. Chiou and M. Palmer, Fundierung. Tübingen: Narr. 171-212. 2005. The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. C. G. Rosen, 1984. The Interface between Natural Language Engineering, 11(2)-207. Semantic Roles and Initial Grammatical Relations. In D.. M. Perlmutter and C. G. Rosen, eds, Studies in Relational Grammar 2. Univ. of Chicago Press.
J. R. Ross, 1967. Constraints on Variables in Syntax. Doctoral dissertation, MIT.
I. A. Sag and J. D. Fodor, 1994. Extraction without traces. In R. Aranovich, W. Byrne, S.
53 Manual Annotation of Opinion Categories in Meetings
Swapna Somasundaran1, Janyce Wiebe1, Paul Hoffmann2, Diane Litman1 1Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260 2Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260 {swapna,wiebe,hoffmanp,litman}@cs.pitt.edu
questions like “Show me the conflicts of opin- Abstract ions between X and Y” , “Who made the highest number of positive/negative comments” and This paper applies the categories from an “Give me all the contributions of participant X in opinion annotation scheme developed for favor of alternative A regarding the issue I.” A monologue text to the genre of multiparty QA system with a component to recognize opin- meetings. We describe modifications to ions would be able to help find answers to such the coding guidelines that were required questions. to extend the categories to the new type Consider the following example from a meet- of data, and present the results of an in- ing about an investment firm choosing which car ter-annotator agreement study. As re- to buy1. (In the examples, the words and phrases searchers have found with other types of describing or expressing the opinion are under- annotations in speech data, inter- lined): annotator agreement is higher when the (1)2 OCK: Revenues of less annotators both read and listen to the data than a million and losses of than when they only read the transcripts. like five million you know Previous work exploited prosodic clues that's pathetic to perform automatic detection of speaker Here, the speaker, OCK, shows his strong nega- emotion (Liscombe et al. 2003). Our tive evaluation by using the expression “That’s findings suggest that doing so to recog- pathetic.” nize opinion categories would be a prom- (2) OCK: No it might just be ising line of work. a piece of junk cheap piece of junk that's not a good 1 Introduction investment In (2), the speaker uses the term “just a piece of Subjectivity refers to aspects of language that junk” to express his negative evaluation and uses express opinions, beliefs, evaluations and specu- this to argue for his belief that it is “not a good lations (Wiebe et al. 2005). Many natural lan- investment.” guage processing applications could benefit from (3) OCK: Yeah I think that's being able to distinguish between facts and opin- the wrong image for an in- ions of various types, including speech-oriented vestment bank he wants sta- applications such as meeting browsers, meeting bility and s safety and you summarizers, and speech-oriented question an- don't want flashy like zip- swering (QA) systems. Meeting browsers could find instances in meetings where opinions about key topics are expressed. Summarizers could in- 1 Throughout this paper we take examples from a meeting clude strong arguments for and against issues, to where a group of people are deciding on a new car for an make the final outcome of the meeting more un- investment bank. The management wants to attract younger derstandable. A preliminary user survey investors with a sporty car. 2 We have presented the examples the way they were ut- (Lisowska 2003) showed that users would like to tered by the speaker. Hence they may show many false be able to query meeting records with subjective starts and repetitions. Capitalization was added to improve readability.
54 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 54–61, Sydney, July 2006. c 2006 Association for Computational Linguistics ping around the corner kind values range from 0.32 to 0.69. As has been of thing you know found for other types of annotations in speech, The example above shows that the speaker has a agreement is higher when the annotators both negative judgment towards the suggestion of a read and listen to the data than when they only sports car (that was made in the previous turn) read the transcripts. Interestingly, the advantages which is indicated by the words “wrong image.” are more dramatic for some categories than oth- The speaker then goes on to positively argue for ers. And, in both conditions, agreement is higher what he wants. He further argues against the cur- for the positive than for the negative categories. rent suggestion by using more negative terms We discuss possible reasons for these disparities. like “flashy” and “zipping around the corner.” Prosodic clues have been exploited to perform The speaker believes that “zipping around the automatic detection of speaker emotion (Lis- corner” is bad as it would give a wrong impres- combe et al. 2003). Our findings suggest that sion of the bank to the customers. In the absence doing so to recognize opinion categories is a of such analyses, the decision making process promising line of work. and rationale behind the outcomes of meetings, The rest of the paper is organized as follows: which form an important part of the organiza- In Section 2 we discuss the data and the annota- tion’s memory, might remain unavailable. tion scheme and present examples. We then pre- In this paper, we perform annotation of a sent our inter-annotator agreement results in Sec- meeting corpus to lay the foundation for research tion 3, and in Section 4 we discuss issues and on opinion detection in speech. We show how observations. Related work is described in Sec- categories from an opinion (subjectivity) annota- tion 5. Conclusions and Future Work are pre- tion scheme, which was developed for news arti- sented in Section 6. cles, can be applied to the genre of multi-party meetings. The new genre poses challenges as it is 2 Annotation significantly different from the text domain, 2.1 Data where opinion analysis has traditionally been applied. Specifically, differences arise because: The data is from the ISL meeting corpus (Bur- 1) There are many participants interacting with ger et al. 2002). We chose task oriented meet- one another, each expressing his or her own ings from the games/scenario and discussion opinion, and eliciting reactions in the process. genres, as we felt they would be closest to the 2) Social interactions may constrain how openly applications for which the opinion analysis will people express their opinions; i.e., they are often be useful. The ISL speech is accompanied by indirect in their negative evaluations. rich transcriptions, which are tagged according to We also explore the influence of speech on hu- VERBMOBIL conventions. However, since real- man perception of opinions. time applications only have access to ASR out- Specifically, we annotated some meeting data put, we gave the annotators raw text, from which with the opinion categories Sentiment and Argu- all VERBMOBIL tags, punctuation, and capitali- ing as defined in Wilson and Wiebe (2005). In zations were removed. our annotation we first distinguish whether a In order to see how annotations would be af- Sentiment or Arguing is being expressed. If one fected by the presence or absence of speech, we is, we then mark the polarity (i.e., positive or divided each raw text document into 2 segments. negative) and the intensity (i.e., how strong the One part was annotated while reading the raw opinion is). Annotating the individual opinion text only. For the annotation of the other part, expressions is useful in this genre, because we speech as well as the raw text was provided. see many utterances that have more than one type of opinion (e.g. (3) above). To investigate 2.2 Opinion Category Definitions how opinions are expressed in speech, we divide We base our annotation definitions on the our annotation into two tasks, one in which the scheme developed by Wiebe et al. (2005) for annotator only reads the raw text, and the other news articles. That scheme centers on the notion in which the annotator reads the raw text and also of subjectivity, the linguistic expression of pri- listens to the speech. We measure inter-annotator vate states. Private states are internal mental agreement for both tasks. states that cannot be objectively observed or veri- We found that the opinion categories apply fied (Quirk et al. 1985) and include opinions, well to the multi-party meeting data, although beliefs, judgments, evaluations, thoughts, and there is some room for improvement: the Kappa feelings. Amongst these many forms of subjec-
55 tivity, we focus on the Sentiment and Arguing child safety locks I think I categories proposed by Wilson and Wiebe think that would be a good (2005). The categories are broken down by po- thing because if our custom- larity and defined as follows: ers happen to have children Positive Sentiments: positive emotions, Example (8) is marked as both Positive Arguing evaluations, judgments and stances. and Positive Sentiment. The more explicit one is (4) TBC: Well ca How about the Positive Sentiment that the locks are good. one of the the newer Cadil- The underlying Argument is that the company lac the Lexus is good car they choose should have child safety locks. In (4), taken from the discussion of which car to Negative Arguing: expressing lack of support buy, the speaker uses the term “good” to express for or attacking the acceptance of an object, his positive evaluation of the Lexus . viewpoint, idea or stance by providing reasoning, Negative Sentiments: negative emotions, justifications, judgment, evaluations or beliefs. evaluations, judgments and stances. This may be explicit or implicit. (5) OCK: I think these are (9) OCK: Town Car But it's a all really bad choices little a It's a little like In (5), the speaker expresses his negative evalua- your grandf Yeah your grand- tion of the choices for the company car. Note that father would drive that “really” makes the evaluation more intense. Example (9) is explicitly stating who would drive Positive Arguing: arguing for something, ar- a Town Car, while implicitly arguing against guing that something is true or is so, arguing that choosing the Town Car (as they want younger something did happen or will happen, etc. investors). (6) ZDN: Yeah definitely moon roof 2.3 Annotation Guidelines In (6), the speaker is arguing that whatever car Due to genre differences, we also needed to they get should have a moon roof. modify the annotation guidelines. For each Argu- Negative Arguing: arguing against some- ing or Sentiment the annotator perceives, he or thing, arguing that something is not true or is not she identifies the words or phrases used to ex- so, arguing that something did not happen or will press it (the text span), and then creates an anno- not happen, etc. tation consisting of the following. (7) OCK: Like a Lexus or • Opinion Category and Polarity perhaps a Stretch Lexus something like that but that • Opinion Intensity might be too a little too • Annotator Certainty luxurious In the above example, the speaker is using the Opinion Category and Polarity: These are term “a little too luxurious” to argue against a defined in the previous sub-section. Note that the Lexus for the car choice. target of an opinion is what the opinion is about. In an initial tagging experiment, we applied For example, the target of “John loves baseball” the above definitions, without modification, to is baseball. An opinion may or may not have a some sample meeting data. The definitions cov- separate target. For example, “want stability” in ered much of the arguing and sentiment we ob- “We want stability” denotes a Positive Senti- served. However, we felt that some cases of Ar- ment, and there is no separate target. In contrast, guing that are more prevalent in meeting than in “good” in “The Lexus is good” expresses a Posi- news data needed to be highlighted more, namely tive Sentiment and there is a separate target, Arguing opinions that are implicit or that under- namely the Lexus. lie what is explicitly said. Thus we add the fol- In addition to Sentiments toward a topic of lowing to the arguing definitions. discussion, we also mark Sentiments toward Positive Arguing: expressing support for or other team members (e.g. “Man you guys backing the acceptance of an object, viewpoint, are so limited”). We do not mark idea or stance by providing reasoning, justifica- agreements or disagreements as Sentiments, as tions, judgment, evaluations or beliefs. This sup- these are different dialog acts (though they some- port or backing may be explicit or implicit. times co-occur with Sentiments and Arguing). (8) MHJ: That's That's why I Intensity: We use a slightly modified version wanna What about the the of Craggs and Wood's (2004) emotion intensity
56 annotation scheme. According to that scheme, Annotation1: Text span=it there are 5 levels of intensity. Level “0” denotes might just be a piece of a lack of the emotion (Sentiment or Arguing in junk Cheap piece of junk our case), “1” denotes traces of emotion, “2” de- that’s not a good investment notes a low level of emotion, “3” denotes a clear Category=Negative Sentiment expression while “4” denotes a strong expres- Intensity=4 Annotator Cer- sion. Our intensity levels mean the same, but we tainty=Certain do not mark intensity level 0 as this level implies Annotation2: Text span=Cheap the absence of opinion. piece of junk that’s not a If a turn has multiple, separate expressions good investment Category marked with the same opinion tag (category and =Negative Arguing Inten- polarity), and all expressions refer to the same sity=3 Annotator Certainty target, then the annotators merge all the expres- =Certain sions into a larger text span, including the sepa- In the above example, there are multiple expres- rating text in between the expressions. This re- sions of opinions. In Annotation1, the expres- sulting text span has the same opinion tag as its sions “it might just be a piece of junk”, “cheap constituents, and it has an intensity that is greater piece of junk” and “not a good investment” ex- than or equal to the highest intensity of the con- press negative evaluations towards the car choice stituent expressions that were merged. (suggested by another participant in a previous Annotator Certainty: The annotators use this turn). Each of these expressions is a clear case of tag if they are not sure that a given opinion is Negative Sentiment (Intensity=3). As they are all present, or if, given the context, there are multi- of the same category and polarity and towards ple possible interpretations of the utterance and the same target, they have been merged by the the annotator is not sure which interpretation is annotator into one long expression of Inten- correct. This attribute is distinct from the Inten- sity=4. In Annotation2, the sub-expression sity attribute, because the Intensity attribute indi- “cheap piece of junk that is not a good invest- cates the strength of the opinion, while the Anno- ment” is also used by the speaker OCK to argue tator Certainty attribute indicates whether the against the car choice. Hence the annotator has annotator is sure about a given tag (whatever the marked this as Negative Arguing. intensity is). 3 Guideline Development and Inter- 2.4 Examples Annotator Agreement We conclude this section with some examples 3.1 Annotator Training of annotations from our corpus. (10) OCK: So Lexun had reve- Two annotators (both co-authors) underwent nues of a hundred and fifty three rounds of tagging. After each round, dis- million last year and prof- crepancies were discussed, and the guidelines its of like six million. were modified to reflect the resolved ambiguities. That's pretty good A total of 1266 utterances belonging to sections Annotation: Text span=That's of four meetings (two of the discussion genre and pretty good Cate- two of the game genre) were used in this phase. gory=Positive Sentiment In- 3.2 Agreement tensity=3 Annotator Cer- tainty=Certain The unit for which agreement was calculated The annotator marked the text span “That’s was the turn. The ISL transcript provides demar- pretty good” as Positive Sentiment because this cation of speaker turns along with the speaker ID. this expression is used by OCK to show his fa- If an expression is marked in a turn, the turn is vorable judgment towards the company reve- assigned the label of that expression. If there are nues. The intensity is 3, as it is a clear expression multiple expressions marked within a turn with of Sentiment. different category tags, the turn is assigned all (11) OCK: No it might just those categories. This does not pose a problem be a piece of junk Cheap for our evaluation, as we evaluate each category piece of junk that’s not a separately. good investment A previously unseen section of a meeting con- taining 639 utterances was selected and divided
57 into 2 segments. One part of 319 utterances was The above example was marked as Negative Ar- annotated using raw text as the only signal, and guing by one annotator (i.e., they should not get the remaining 320 utterances were annotated us- aluminum wheels) while the other annotator did ing text and speech. Cohen’s Kappa (1960) was not mark it at all. The implied Negative Arguing used to calculate inter-annotator agreement. We toward getting aluminum wheels can be inferred calculated inter-annotator agreement for both from the statement that the wheels should look conditions: raw-text-only and raw-text+speech. dignified. However the annotators were not sure, This was done for each of the categories: Posi- as the participant chose to focus on what is desir- tive Sentiment, Positive Arguing, Negative Sen- able (i.e., dignified wheels). This utterance is timent, and Negative Arguing. To evaluate a actually both a general statement of what is de- category, we did the following: sirable, and an implication that aluminum wheels • For each turn, if both annotators tagged are not dignified. But this may be difficult to as- the turn with the given category, or both certain with the raw text signal only. did not tag the turn with the category, then When the annotators had speech to guide their it is a match. judgments, the Kappa values go up significantly for each category. All the agreement numbers for • Otherwise it is a mismatch raw text+speech are in the substantial range ac- Table 1 shows the inter-annotator Kappa val- cording to Landis and Koch (1977). We observe ues on the test set. that with speech, Kappa for Negative Arguing has doubled over the Kappa obtained without Raw Text Raw Text speech. The Kappa for Negative Sentiment Agreement (Kappa) only + Speech (text+speech) shows a 1.5 times improvement Positive Arguing 0.54 0.60 over the one with only raw text. Both these ob- Negative Arguing 0.32 0.65 servations indicate that speech is able to help the Positive Sentiment 0.57 0.69 annotators tag negativity more reliably. It is quite Negative Sentiment 0.41 0.61 likely that a seemingly neutral sentence could Table 1 Inter-annotator agreement on different sound negative, depending on the way words are categories. stressed or pauses are inserted. Comparing the agreement on Positive Sentiment, we get a 1.2 With raw-text-only annotation, the Kappa times improvement by using speech. Similarly, value is in the moderate range according to agreement improves by 1.1 times for Positive Landis and Koch (1977), except for Negative Arguing when speech is used. The improvement Arguing for which it is 0.32. Positive Arguing with speech for the Positive categories is not as and Positive Sentiment were more reliably de- high as compared to negative categories, which tected than Negative Arguing and Negative Sen- conforms to our belief that people are more timent. We believe this is because participants forthcoming about their positive judgments, were more comfortable with directly expressing evaluations, and beliefs. their positive sentiments in front of other partici- In order to test if the turns where annotators pants. Given only the raw text data, inter- were uncertain were the places that caused mis- annotator reliability measures for Negative Argu- match, we calculated the Kappa with the annota- ing and Negative Sentiment are the lowest. We tor-uncertain cases removed. The corresponding believe this might be due to the fact that partici- Kappa values are shown in Table 2 pants in social interactions are not very forthright Raw Text Raw Text with their Negative Sentiments and Arguing. Agreement ( Kappa) Negative Sentiments and Arguing towards some- only + Speech thing may be expressed by saying that something Positive Arguing 0.52 0.63 else is better. For example, consider the follow- Negative Arguing 0.36 0.63 ing response of one participant to another par- Positive Sentiment 0.60 0.73 ticipant’s suggestion of aluminum wheels for the Negative Sentiment 0.50 0.61 company car Table-2 Inter-annotator agreement on different (12) ZDN: Yeah see what kind categories, Annotator Uncertain cases removed. of wheels you know they have to look dignified to go with The trends observed in Table 1 are seen in Ta- the car ble 2 as well, namely annotation reliability im- proving with speech. Comparing Tables 1 and 2,
58 we see that for the raw text, the inter-annotator that will need to be addressed for improving in- agreement goes up by 0.04 points for Negative ter-annotator reliability. Arguing and goes up by 0.09 points for Negative Global and local context for arguing: In the Sentiment. However, the agreement for Negative context of a meeting, participants argue for (posi- Arguing and Negative Sentiment on raw-text+ tively) or against (negatively) a topic. This may speech between Tables 1 and 2 remains almost become ambiguous when the participant uses an the same. We believe this is because we had explicit local Positive Arguing and an implicit 20% fewer Annotator Uncertainty tags in the global Negative Arguing. Consider the following raw-text+speech annotation as compared to raw- speaker turn, at a point in the meeting when one text-only, thus indicating that some types of un- participant has suggested that the company car certainties seen in raw-text-only were resolved in should have a moon roof and another participant the raw-text+speech due to the speech input. The has opposed it, by saying that a moon roof would remaining cases of Annotator Uncertainty could compromise the headroom. have been due to other factors, as discussed in (13) OCK: We wanna make sure the next section there's adequate headroom Table 3 shows Kappa with the low intensity for all those six foot six tags removed. The hypothesis was that low in- investors tensity might be borderline cases, and that re- In the above example, the speaker OCK, in the moving these might increase inter-annotator reli- local context of the turn, is arguing positively ability. that headroom is important. However, in the global context of the meeting, he is arguing Raw Text Raw Text against the idea of a moon roof that was sug- Agreement ( Kappa) only + Speech gested by a participant. Such cases occur when Positive Arguing 0.53 0.66 one object (or opinion) is endorsed which auto- Negative Arguing 0.26 0.65 matically precludes another, mutually exclusive Positive Sentiment 0.65 0.74 object (or opinion). Negative Sentiment 0.45 0.59 Sarcasm/Humor: The meetings we analyzed Table-3 Inter-annotator agreement on different had a large amount of sarcasm and humor. Issues categories, Intensity 1, 2 removed. arose with sarcasm due to our approach of mark- ing opinions towards the content of the meeting Comparing Tables 1 and 3 (the raw-text col- (which forms the target of the opinion). Sarcasm umns), we see that there is an improvement in is difficult to annotate because sarcasm can be the agreement on sentiment (both positive and 1) On topic: Here the target is the topic of dis- negative) if the low intensity cases are removed. cussion and hence sarcasm is used as a Negative The agreement for Negative Sentiment (raw-text) Sentiment. goes up marginally by 0.04 points. Surprisingly, 2) Off topic: Here the target is not a topic un- the agreement for Negative Arguing (raw-text) der discussion, and the aim is to purely elicit goes down by 0.06 points. Similarly in raw- laughter. text+speech results, removal of low intensity 3) Allied topic: In this case, the target is re- cases does not improve the agreement for Nega- lated to the topic in some way, and it’s difficult tive Arguing while hurting Negative Sentiment to determine if the aim of the sarcasm/humor was category (by 0.02 points). One possible explana- to elicit laughter or to imply something negative tion is that it may be equally difficult to detect towards the topic. Negative categories at both low and high intensi- Multiple modalities: In addition to text and ties. Recall that in (12) it was difficult to detect if speech, gestures and visual diagrams play an im- there is Negative Arguing at all. If the annotator portant role in some types of meetings. In one decided that it is indeed a Negative Arguing, it is meeting that we analyzed, participants were put at intensity level=3 (i.e., a clear case). working together to figure out how to protect an egg when it is dropped from a long distance, 4 Discussion given the materials they have. It was evident they were using some gestures to describe their ideas There were a number of interesting subjectiv- (“we can put tape like this”) and that they drew ity related phenomena in meetings that we ob- diagrams to get points across. In the absence of served during our annotation. These are issues visual input, annotators would need to guess
59 what was happening. This might further hurt the initiation and response, which in turn consist of inter-annotator reliability. dialog tags such as query, align, and statement. Their statement dialog category would not only 5 Related Work include Sentiment and Arguing tags discussed in this paper, but it would also include objective Our opinion categories are from the subjectiv- statements and other types of subjectivity. ity schemes described in Wiebe et al. (2005) and “Hot spots” in meetings closely relate to our Wilson and Wiebe (2005). Wiebe et al. (2005) work because they find sections in the meeting perform expression level annotation of opinions where participants are involved in debates or and subjectivity in text. They define their annota- high arousal activity (Wrede and Shriberg 2003). tions as an experiencer having some type of atti- While that work distinguishes between high tude (such as Sentiment or Arguing), of a certain arousal and low arousal, it does not distinguish intensity, towards a target. Wilson and Wiebe between opinion or non-opinion or the different (2005) extend this basic annotation scheme to types of opinion. However, Janin et al. (2004) include different types of subjectivity, including suggest that there is a relationship between dia- Positive Sentiment, Negative Sentiment, Positive log acts and involvement, and that involved ut- Arguing, and Negative Arguing. terances contain significantly more evaluative Speech was found to improve inter-annotator and subjective statements as well as extremely agreement in discourse segmentation of mono- positive or negative answers. Thus we believe it logs (Hirschberg and Nakatani 1996). Acoustic may be beneficial for such works to make these clues have been successfully employed for the distinctions. reliable detection of the speaker’s emotions, in- Another closely related work that finds par- cluding frustration, annoyance, anger, happiness, ticipants’ positions regarding issues is argument sadness, and boredom (Liscombe et al. 2003). diagramming (Rienks et al. 2005). This ap- Devillers et al. (2003) performed perceptual tests proach, based on the IBIS system (Kunz and Rit- with and without speech in detecting the tel 1970), divides a discourse into issues, and speaker’s fear, anger, satisfaction and embar- finds lines of deliberated arguments. However rassment. Though related, our work is not con- they do not distinguish between subjective and cerned with the speaker’s emotions, but rather objective contributions towards the meeting. opinions toward the issues and topics addressed in the meeting. 6 Conclusions and Future Work Most annotation work in multiparty conversa- tion has focused on exchange structures and dis- In this paper we performed an annotation course functional units like common grounding study of opinions in meetings, and investigated (Nakatani and Traum, 1998). In common ground- the effects of speech. We have shown that it is ing research, the focus is on whether the partici- possible to reliably detect opinions within multi- pants of the discourse are able to understand each party conversations. Our consistently better other, and not their opinions towards the content agreement results with text+speech input over of the discourse. Other tagging schemes like the text-only input suggest that speech is a reliable one proposed by Flammia and Zue (1997) focus indicator of opinions. We have also found that on information seeking and question answering Annotator Uncertainty decreased with speech exchanges where one participant is purely seek- input. Our results also show that speech is a more ing information, while the other is providing it. informative indicator for negative versus positive The SWBD DAMSL (Jurafsky et al., 1997) an- categories. We hypothesize that this is due to the notation scheme over the Switchboard telephonic fact the people express their positive attitudes conversation corpus labels shallow discourse more explicitly. The speech signal is thus even structures. The SWBD-DAMSL had a label “sv” more important for discerning negative opinions. for opinions. However, due to poor inter- This experience has also helped us gain insights annotator agreement, the authors discarded these to the ambiguities that arise due to sarcasm and annotations. The ICSI MRDA annotation scheme humor. (Rajdip et al., 2003) adopts the SWBD DAMSL Our promising results open many new avenues scheme, but does not distinguish between the for research. It will be interesting to see how our opinionated and objective statements. The ISL categories relate to other discourse structures, meeting corpus (Burger and Sloane, 2004) is an- both at the shallow level (agree- notated with dialog acts and discourse moves like ment/disagreement) as well as at the deeper level
60 (intentions/goals). It will also be interesting to Werner Kunz and Horst W. J. Rittel. 1970. Issues as investigate how other forms of subjectivity like elements of information systems. Working Paper speculation and intention are expressed in multi- WP-131, Univ. Stuttgart, Inst. Fuer Grundlagen der party discourse. Finding prosodic correlates of Planung, 1970 speech as well as lexical clues that help in opin- Richard Landis and Gary Koch. 1977. The Measure- ion detection would be useful in building subjec- ment of Observer Agreement for Categorical Data tivity detection applications for multiparty meet- Biometrics, Vol. 33, No. 1 (Mar., 1977) , pp. 159- ings. 174 Agnes Lisowska. 2003. Multimodal interface design References for the multimodal meeting domain: Preliminary indications from a query analysis study. Technical Susanne Burger and Zachary A Sloane. 2004. The ISL Report IM2. Technical report, ISSCO/TIM/ETI. Meeting Corpus: Categorical Features of Commu- Universit de Genve, Switserland, November 2003. nicative Group Interactions. NIST Meeting Recog- nition Workshop 2004, NIST 2004, Montreal, Can- Jackson Liscombe, Jennifer Venditti and Julia ada, 2004-05-17 Hirschberg. 2003. Classifying Subject Ratings of Emotional Speech Using Acoustic Features. Eu- Susanne Burger, Victoria MacLaren and Hua Yu. rospeech 2003. 2002. The ISL Meeting Corpus: The Impact of Meeting Type on Speech Style. ICSLP-2002. Den- Christine Nakatani and David Traum. 1998. Draft: ver, CO: ISCA, 9 2002. Discourse Structure Coding Manual version 2/27/98 Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Randolph Quirk, Sidney Greenbaum, Geoffry Leech Meas., 20:37–46. and Jan Svartvik. 1985. A Comprehensive Gram- mar of the English Language. Longman, New Richard Craggs and Mary McGee Wood. 2004. A York.s categorical annotation scheme for emotion in the linguistic content of dialogue. Affective Dialogue Rutger Rienks, Dirk Heylen and Erik van der Wei- Systems. 2004. jden. 2005. Argument diagramming of meeting conversations. In Vinciarelli, A. and Odobez, J., Laurence Devillers, Lori Lamel and Ioana Vasilescu. editors, Multimodal Multiparty Meeting Process- 2003. Emotion detection in task-oriented spoken ing, Workshop at the 7th International Conference dialogs. IEEE International Conference on Multi- on Multimodal Interfaces, pages 85–92, Trento, It- media and Expo (ICME). aly Rajdip Dhillon, Sonali Bhagat, Hannah Carvey and Janyce Wiebe, Theresa Wilson and Claire Cardie. Elizabeth Shriberg. 2003. “Meeting Recorder Pro- 2005. Annotating expressions of opinions and emo- ject: Dialog Act Labeling Guide,” ICSI Technical tions in language. Language Resources and Report TR-04-002, Version 3, October 2003 Evaluation (formerly Computers and the Humani- Giovanni Flammia and Victor Zue. 1997. Learning ties), volume 39, issue 2-3, pp. 165-210. The Structure of Mixed Initiative Dialogues Using Theresa Wilson and Janyce Wiebe. 2005. Annotating A Corpus of Annotated Conversations. Eurospeech attributions and private states. ACL Workshop on 1997, Rhodes, Greece 1997, p1871—1874 Frontiers in Corpus Annotation II: Pie in the Sky. Julia Hirschberg and Christine Nakatani. 1996. A Pro- Britta Wrede and Elizabeth Shriberg. 2003. Spotting sodic Analysis of Discourse Segments in Direction- "Hotspots" in Meetings: Human Judgments and Giving Monologues Annual Meeting- Association Prosodic Cues. Eurospeech 2003, Geneva For Computational Linguistics 1996, VOL 34, pages 286-293 Adam Janin, Jeremy Ang, Sonali Bhagat, Rajdip Dhil- lon, Jane Edwards, Javier Mac´ıas-Guarasa, Nelson Morgan, Barbara Peskin, Elizabeth Shriberg, An- dreas Stolcke, Chuck Wooters and Britta Wrede. 2004. “The ICSI Meeting Project: Resources and Research,” ICASSP-2004 Meeting Recognition Workshop. Montreal; Canada: NIST, 5 2004 Daniel Jurafsky, Elizabeth Shriberg and Debra Biasca, 1997. Switchboard-DAMSL Labeling Project Coder’s Manual. http://stripe.colorado.edu/˜jurafsky/manual.august1
61 The Hinoki Sensebank — A Large-Scale Word Sense Tagged Corpus of Japanese —
Takaaki Tanaka, Francis Bond and Sanae Fujita {takaaki,bond,fujita}@cslab.kecl.ntt.co.jp NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation
Abstract part of a larger project in psycho-linguistic and computational linguistics ultimately aimed at Semantic information is important for language understanding (Bond et al., 2004). precise word sense disambiguation system and the kind of semantic analysis used in 2 Corpus Design sophisticated natural language processing such as machine translation, question In this section we describe the overall design of answering, etc. There are at least two the corpus, and is constituent corpora. The basic kinds of semantic information: lexical aim is to combine structural semantic and lexical semantics for words and phrases and semantic markup in a single corpus. In order to structural semantics for phrases and make the first phase self contained, we started sentences. with dictionary definition and example sentences. We are currently adding other genre, to make the We have built a Japanese corpus of over langauge description more general, starting with three million words with both lexical newspaper text. and structural semantic information. In this paper, we focus on our method of 2.1 Lexeed: A Japanese Basic Lexicon annotating the lexical semantics, that is We use word sense definitions from Lexeed: building a word sense tagged corpus and A Japanese Semantic Lexicon (Kasahara et al., its properties. 2004). It was built in a series of psycholinguistic experiments where words from two existing 1 Introduction machine-readable dictionaries were presented to While there has been considerable research on subjects and they were asked to rank them on a both structural annotation (such as the Penn familiarity scale from one to seven, with seven Treebank (Taylor et al., 2003) or the Kyoto Corpus being the most familiar (Amano and Kondo, (Kurohashi and Nagao, 2003)) and semantic 1999). Lexeed consists of all words with a annotation (e.g. Senseval: Kilgariff and familiarity greater than or equal to five. There Rosenzweig, 2000; Shirai, 2002), there are almost are 28,000 words in all. Many words have no corpora that combine both. This makes it multiple senses, there were 46,347 different difficult to carry out research on the interaction senses. Definition sentences for these sentences between syntax and semantics. were rewritten to use only the 28,000 familiar Projects such as the Penn Propbank are adding words. In the final configuration, 16,900 different structural semantics (i.e. predicate argument words (60% of all possible words) were actually
structure) to syntactically annotated corpora, used in the definition sentences. An example but not lexical semantic information (i.e. word entry for the word ü ß § doraiba¯ “driver” senses). Other corpora, such as the English is given in Figure 1, with English glosses added. Redwoods Corpus (Oepen et al., 2002), combine This figure includes the sense annotation and both syntactic and structural semantics in a information derived from it that is described in this monostratal representation, but still have no paper. lexical semantics. Table 1 shows the relation between polysemy In this paper we discuss the (lexical) semantic and familiarity. The #WS column indicates the annotation for the Hinoki Corpus, which is average number of word senses that polysemous
62 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 62–69, Sydney, July 2006. c 2006 Association for Computational Linguistics INDEX doraiba- POS noun Lexical-Type noun-lex FAMILIARITY 6.5 [1–7] (≥ 5) Frequency 37 Entropy 0.79