<<

Partial as a Method to Expedite Dependency Annotation of a

Mridul Gupta, Vineet Yadav, Samar Husain and Dipti Misra Sharma Language Technologies Research Center, International Institute of Information Technology, Hyderabad INDIA – 500 032 E-mail: [email protected], [email protected], [email protected], [email protected]

Abstract

The paper describes an approach to expedite the process of manual annotation of a Hindi dependency treebank which is currently under development. We propose a way by which consistency among a set of manual annotators could be improved. Furthermore, we show that our setup can also prove useful for evaluating when an inexperienced annotator is ready to start participating in the production of the treebank. We test our approach on sample sets of data obtained from an ongoing work on creation of this treebank. The results asserting our proposal are reported in this paper. We report results from a semi-automated approach of dependency annotation experiment. We find out the rate of agreement between annotators using Cohen’s Kappa. We also compare results with respect to the total time taken to annotate sample data-sets using a completely manual approach as opposed to a semi-automated approach. It is observed from the results that this semi-automated approach when carried out with experienced and trained human annotators improves the overall quality of treebank annotation and also speeds up the process.

Indian languages, like Hindi, are relatively rich in 1. Introduction morphology and have relative free -order. are One of the most important linguistic resources is a grouped into chunks (Bharati et al., 2007). These words treebank. Resources such as these are of immense utility within the chunks are fixed and cannot move. These to various NLP tasks such as syntactic parsing, natural chunks can move around within the which language understanding, MT etc. and have been endorsed accounts for its free-word order. Wherein, the scope of universally. Also, are important for machine partial parsing is to capture dependency relations between learning and for extracting out various kinds of linguistic the chunks only. Hence, it is important that its information. Treebank annotation is carried out to encode methodology (data-driven or knowledge-based) is linguistic information at different levels such as independent of the word-order and derives cues from the morphological, syntactic, syntactico-semantic, and morphological features of words. By partial parse, we discourse. Consistency and quality are important aspects mean a parsed output which has some important during treebank annotation. Much focus has been given to dependency labels annotated between chunks using a set annotating syntactic structures using various linguistic of simple yet effective rules. paradigms during the last decade. This is understandable This approach, as proposed in (Bharati et al., 2009a), as building efficient syntactic tools, such as parsers, is with the use of an automated tool (a) can help in the very crucial for various NLP applications. In this paper process of faster and equally (if not more) accurate our focus would be syntactico-semantic dependency annotation as opposed to all the annotation done annotation. The dependency annotation scheme followed manually, and (b) could be useful for inexperienced is based on computational Paninian grammar (CPG) annotators to learn the process of dependency annotation (Bharati et al., 1995). over a certain number of iterations. The paper describes In this paper, we further elaborate on a method an approach to expedite the process of manual annotation proposed by Bharati et al., (2009a), which can lead to of a Hindi dependency treebank. We propose a way by faster annotation of sentences in Hindi, without which consistency among a set of manual annotators compromising accuracy and consistency. Bharati et al. could be improved. Furthermore, we show that our setup propose an automatic annotator tool, ‘simple parser for can also prove useful for evaluating when an Hindi’. The tool uses a knowledge-based methodology inexperienced annotator is ready to start participating in which relies on exploiting certain linguistic and syntactic the production of the treebank. We test our approach on cues deduced from the manually annotated training data sample sets of data obtained from an ongoing work on (development data) containing about 1800 sentences in creation of this treebank. The results asserting our Hindi. In this paper we use their tool’s output as the data proposal are reported in this paper. over which manual annotation is done. Use of such a The paper is divided as follows. Section 1 talked about the introduction. Section 2 is an overview of some of the partially parsed output is justified as it involves less works related to this paper. In section 3, we provide a analysis as compared to that for a fully generated parse brief description of the Hindi treebank. Section 4 (Bharati et al., 2009a). describes the experiments conducted. Results and them. Annotators in group A were experienced and had observations from those experiments follow in section 5. been doing dependency annotation for at least 8 months. Section 6, finally concludes the paper. They were well-trained annotators. They worked only on sample set S-1. Group B annotators were less experienced 2. Related Work than group A and had been doing annotation for the past two months at the time of conducting our experiments. This semi-automated approach for annotation holds merit Although, B1 was slightly more experienced than B2. and has previously been used for annotation of treebanks. Group B annotated all the three sets. Such a methodology has been followed for annotation of We also tested whether annotation quality of group B treebanks in languages such as German (Brants and Skut, over the three sets improved or not. The annotators 1998), English on the PARC 700 Dependency Treebank performed annotation in two different modes: (King et al., 2003) etc. (a). For the first mode, the annotators simply annotated Inter-annotator agreement is extensively used for the sample sets of data all by themselves without using evaluation of consistency of annotation of corpora. The any automated tool. Time taken to annotate was also inter-annotator agreement to evaluate quality has noted down. previously been used extensively on resources like Penn (b). For the second mode of annotation, they annotated Treebank (Marcus et al., 1993), NEGRA corpus (Brants, and corrected the output of the automatic annotator for the 2000), Brown corpus (Francis and Kucera 1982), in Penn same sample dataset. Care was taken to ensure that there Discourse Treebank Annotation (Miltsakaki et al., 2004a; was enough gap in terms of time between the two modes Miltsakaki et al., 2004b), treebank (Habash and of annotation. There was a time lapse of at least more than Roth, 2009) and others. Apart from inter-annotator a week between the two modes of annotation. This agreement, Miltsakaki et. al (2004a) also used rate of ensured that the annotators did not remember their disagreement for analysis of the Penn discourse Treebank. previous annotation values. In (Marsi and Krahmer, 2005) agreement rate was used to For each mode, inter annotator agreement was evaluate the system on an aligned parallel Dutch corpus. calculated to evaluate and compare the levels of consistency. Also, accuracy (precision) of annotation was 3. The Hindi Dependency Treebank (HDT) evaluated by comparing it against the corresponding A multi-layered and multi-representational treebank for reference gold standard data prepared by the guideline Hindi (Bhatt et al., 2009; Xia et al., 2009) is being setters. developed. The treebank will have dependency, - argument (PropBank, Palmer et al., 2005) and phrase 4.1 Inter Annotator Agreement structure (PS) representation. Automatic conversion from Inter-annotator agreement among annotators is a good dependency structure (DS) to phrase structure, (PS) is way to determine consistency in the process of annotation. planned. However, the focus of the current paper is to It also helps to do error analysis in a more focused ascertain the effectiveness of partial dependency parsing manner. Thus, we chose Cohen's measure (Kappa, κ) in helping the manual annotators to expedite the process (Cohen, 1960) of finding out agreement between of dependency annotation of the treebank. The annotators, as it is a widely used measure, which also dependency treebank contains information encoded at the accounts for agreement by chance rather than simply morpho-syntactic (morphological, part-of-speech and calculating the percentage of agreement. Although, there chunk information) and syntactico-semantic (dependency) has been criticism of the robustness of this method (Di levels. Each sentence is represented in SSF format Eugenio, 2000), it still provides a standardized way of (Bharati et al., 2007). POS and chunk information is estimating the agreement rate across annotators. encoded following a set of guidelines (Bharati et al., Cohen’s kappa agreement is given by the following 2006). The guidelines for the dependency framework equation: (Bharati et al., 2009b) have been adapted from the p(a) − p(e) Paninian grammar (Bharati et al., 1995). For Indian κ = languages, like Hindi, Paninian dependency scheme has 1− p(e) been shown to be effective in (Begum et al., 2008). where, p(a) is the observed probability of agreement 4. Procedure and Experimental Setup between two annotators, and p(e) is the theoretical To evaluate the effectiveness of our approach we used probability of chance agreement, using the annotated some sample sets of data for testing. The sample sets were sample of data to calculate the probabilities of each obtained from a previously validated gold standard Hindi annotator. data of about 50,000 tokens taken from HDT (see Section The co-efficient of Cohen's measurement has been 3). Three sample sets were used for performing the previously applied for evaluating consistency in works on experiment, labeled as S-1, S-2 and S-3. Each set treebanks and other corpora such as the Hinoki treebank consisted of 25 sentences. Five annotators were for Japanese, (Bond et al., 2006), EPEC treebank for employed to carry out the experiment. These annotators Basque (Uria et al., 2009), Wordnet Semcor and DSO were divided into two groups, namely, A (3 annotators, corpora (Ng et al., 1999), spoken Danish corpus (Paggio, A1, A2 and A3) and B (2 annotators, B1 and B2). The 2006). We use the standard metric for the interpretation of data given to these annotators was previously unseen to kappa values (Landis and Koch, 1977). Table 1 as shown gives a measure to interpret kappa values devised by Annotator s Precision (LA) Precision (LAA) Landis and Koch. A1 85.8% 81.5% Degree of Kappa value Agreement A2 83.9% 79.4% <0 None A3 83.4% 79.4% 0 – 0.20 Slight 0.21 – 0.40 Fair Table 2(b): Accuracy of A1, A2 and A3 on the first 0.41 – 0.60 Moderate sample data set (Set-1) for the second mode of annotation. 0.61 – 0.80 Substantial 0.81 – 1.00 Perfect Sample Sets Precision (LA) Precision (LAA) S-1 73.4% 65.1% Table 1: Landis and Koch interpretation of Cohen’s S-2 72.2% 64.8% kappa. S-3 74.2% 65.7% As noted from the table above, values greater than 0.61 Table 3: Accuracy of automatic annotator for the three are considered to be very good agreement rate between sample sets of data. any one pair of annotators. From tables 2(a) and (b), it is clear that the annotation 4.2 Accuracy Measurement quality of group A remained, more or less, the same for For evaluating accuracy of the annotators, we used two both modes of annotation. The time gap between the two metrics, LAA1 and LA2. LAA, is the precision obtained modes was about a month, during which they had been after looking at the correctness of the attachment and the performing annotation on other data sets. Therefore, one label between a child-parent pair. On the other hand, LA, can safely assume that the annotation in the two modes is the precision score obtained looking at only the label was completely independent. between a child and its parent.

5. Results and Observations We present the results in this section for the experiments conducted on the sample data sets of Hindi. We show results (L(κ) 3 and LA(κ) 4) at annotation done in each iteration by the annotators. We also report the total time taken in annotation for each of the two modes in the subsequent figures. In Tables 3(a), (b), we show the respective accuracies (precision) recorded by the experienced annotators of group A for the two modes of annotation.

Annotator s Precision (LA) Precision (LAA) A1 86.4% 81.9%

A2 82.9% 78.9% Figure 1(a): Total agreement rate at first and second A3 82.4% 78.4% modes for A1, A2 and A3.

Table 2(a): Accuracy of A1, A2 and A3 on the first Figure 1(a) depicts agreement rate for annotators, A1, A2, sample data set (Set-1) for the first mode of annotation. A3. It is noted that the agreement rate among them went up when the automatic annotator was used as a Table 2(b) shows the precision values for the annotation preprocessing tool for dependency annotation (second done by each annotator in group A when they were mode). Moreover, the time taken by them also reduced by provided with the same dataset, now partially annotated a considerable amount (cf. figure 1(b)) in that mode. Also, generated by the automatic annotator. tables 2(a), (b) reveal that the overall accuracy of the Table 3 shows accuracy figures in terms of precision annotation in the second mode is at least as good as that using the automatic annotator only, on the three sample for the first mode. But, group A annotators employed sets of data. were unavailable to carry out more iterations. Thus, only one complete iteration of our experiment could be

1 performed by them. Nonetheless, they recorded high LAA: Labeled Attachment Accuracy agreement rate among themselves in the first iteration 2LA: Labeled Accuracy itself, which re-affirms our hypothesis of semi-automated 3 L(κ): Labeled Kappa 4 LA(κ): Labeled Attachment Kappa process being useful in the process of dependency Figures 2(a) and (b) reveal the figures for annotation annotation with experienced annotators. consistency between the two annotators of group B, as To ascertain the validity of our claim when the same well as the total time taken by them to annotate each procedure is applied for relatively inexperienced sample set. In tables 4(a) and (b), accuracy numbers for annotators, we performed this experiment with another set B1 and B2 for the two modes at all the three iterations of annotators (group B). have been shown.

Annotator s at Precision (LA) Precision (LAA) each iteration B1 (1st) 78.9% 73.4%

B1 (2nd) 77.1% 71.9%

B1 (3rd) 75.3% 74.1%

B2 (1st) 81.2% 77.7%

B2 (2nd) 78.4% 74.8%

B2 (3rd) 78.3% 75.5%

Table 4(a): Accuracy of B1 and B2 for the first mode of Figure 1(b): Total time taken (in mins) at each mode of annotation at each iteration. annotation by A1, A2 and A3. Tables 4(a), (b) and figures 2(a), (b) depict numbers for annotation quality of B1 and B2.

Annotator s at Precision Precision (LAA) each iteration (LA) B1 (1st) 82.4% 77.4%

B1 (2nd) 78.2% 73.1%

B1 (3rd) 80.8% 78.8%

B2 (1st) 76.8% 71.7%

B2 (2nd) 76.5% 69.7%

B2 (3rd) 82.5% 76.4% Figure 2(a): Agreement rate for B1-B2 at different iterations. Table 4(b): Accuracy of B1 and B2 for the second mode of annotation at each iteration.

Note that B1 records greater accuracy for mode 2 at the first iteration while taking less time. On the other hand, B2, who was an even less experienced annotator, recorded lesser accuracy in the second mode as compared to the first mode for the first iteration. This may be attributed to the fact that quite a few of his/her judgments get influenced by the output of the automatic annotator. This causes confusion for the annotator. Thus, accuracy is somewhat lowered for B2. As a result of this erroneous annotation by B2, agreement rate between the two drops. To ascertain whether the inexperienced group B annotator understood the guidelines, we used two more Hindi sample sets for annotation in two different iterations. Annotators of group A already recorded higher Figure 2(b): Total time taken (in mins) for each mode at agreement rate and accuracy in the second mode for the each iteration. first iteration. As noted from figures 2(a), (b) and tables 4(a), (b), B1 TreeBank. records considerably less time for annotation and higher http://ltrc.iiit.ac.in/MachineTrans/research/tb/DS- accuracy in second mode for all three sets. But, the same guidelines/DS-guidelines-ver2-28-05-09.pdf cannot be said of B2 for the first two iterations. However, Bhatt, R., Narasimhan, B., Palmer, M., Rambow O., as B2 got more and more familiar with the pattern of Sharma, D.M., Xia. F. (2009). Multi-Representational annotation, in the third and final iteration, his/her and Multi-Layered Treebank for Hindi/. In Proc. annotation quality improved. As a result, agreement rate of the Third Linguistic Annotation Workshop at 47th between B1 and B2 improved in the second mode for this ACL and 4th IJCNLP. particular iteration. Bond, F., Fujita, S., Tanaka, T. (2006). The Hinoki Syntactic and Semantic Treebank of Japanese. 6. Conclusion Language Resources and Evaluation, Vol. 40, No. 3/4, We reported some key observations from a Hindi Asian Language Processing: State-of-the-Art Resources dependency annotation experiment conducted in two and Processing, pp. 253-261. different modes. From the observations, it seems that a Brants, T. (2000). Inter-Annotator Agreement for a semi-automated process is an effective way of doing German Newspaper Corpus. In Proc. of the Second dependency annotation of treebanks when the human International Conference on Language Resources and annotators are trained and experienced. We noted in the Evaluation LREC-2000, Athens, Greece. experiment that the time and effort of human annotators Brants. T., Skut, W. (1998). Automation of Treebank reduced. Also, an improvement was observed in accuracy Annotation. In Proc. of New Methods in Language and consistency among the annotators. A greater degree Processing (NeMLaP-98). of consistency leads to quality assurance. Cohen, J. (1960). A coefficient of agreement for nominal On the other hand, an inexperienced annotator can find scales. Educational and Psychological Measurement, the process pretty helpful in determining when is he/she pp. 37–46. ready for participation in the dependency annotation Di Eugenio., B. (2000). On the usage of Kappa to evaluate process. This also ascertains the fact that treebank nd annotation is not a trivial task and supervision is required agreement on coding tasks. In Proc. of the 2 for carrying out the task in an efficient manner. International Conference on Language Resources and Evaluation (LREC-00), Athens, Greece. 7. Acknowledgements Francis, W.N., Kucera, H. (1982). Frequency Analysis of English Usage: and grammar. Journal of We thank the reviewers for their insightful comments. We English , Vol. 18, No. 1, pp. 64-70, also thank the annotators for annotating the data sets used Houghton Mifflin, Boston, MA. for our experiment. The work reported in this paper was Habash, N., Roth, R. (2009). CATiB: The Columbia supported by the NSF grant (Award Number: CNS th th Arabic Treebank. In Proc. of 47 ACL- 4 IJCNLP. 0751202; CFDA Number: 47.070). King, T.H., Crouch, R., Riezler, S., Dalrymple, M., Kaplan, R.M. (2003). The PARC 700 Dependency 8. References Bank. In Proc. of Workshop on Linguistically Begum, R., Husain, S., Dhwaj, A., Sharma, D.M., Bai, L., Interpreted Corpora at the European Association for Sangal, R. (2008). Dependency annotation scheme for Computational Linguistics. Indian languages. In Proceedings of IJCNLP-2008. Landis, J.R., Koch, G.G. (1977). The measurement of Bharati, A., Chaitanya, V., Sangal, R. (1995). Natural observer agreement for categorical data. Biometrics. Language Processing: A Paninian Perspective, Vol. 33, pp. 159—174. Prentice-Hall of India, New Delhi, pp. 65-106. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B. (1993). Bharati, A., Gupta, M., Yadav, V., Gali, K., Sharma, D.M. Building a large annotated corpus of English: the Penn (2009a). Simple Parser for Indian Languages in a treebank. Computational Linguistics, Volume 19, Issue Dependency Framework. In Proc. of the Third 2, pp. 313 – 330. Linguistic Annotation Workshop at 47th ACL and 4th Marsi, E., Krahmer, E. (2005). Classification of semantic IJCNLP. relations by humans and machines. In Proc. of the ACL Bharati, A., Sangal, R., Sharma, D.M. (2007). SSF: Shakti 2005 Workshop on Empirical Modeling of Semantic Standard Format Guide. Technical Report, TR-LTRC- Equivalence and Entailment. 33, Language Technologies Research Centre, IIIT- Miltsakaki, E., Prasad, R., Joshi, A., Webber, B. (2004a). Hyderabad, India. Annotating Discourse Connectives and Their Bharati, A., Sangal, R., Sharma, D.M., Bai, L. (2006). Arguments. In Proc. of the HLT/NAACL Workshop on AnnCorra: Annotating Corpora Guidelines for POS and Frontiers in Corpus Annotation. Chunk Annotation for Indian Languages. Technical Miltsakaki, E., Prasad, R., Joshi, A., Webber, B. (2004b). Report (TR-LTRC-31), Language Technologies The Penn Discourse TreeBank. In Proc. of the Research Centre, IIIT-Hyderabad, India. Language Resources and Evaluation Conference, Bharati, A., Sharma, D.M., Husain, S., Bai L., Begum, Portugal. R., Sangal, R. (2009b). AnnCorra: TreeBanks for Ng, H. T., Lim, D.C.Y., Foo, S.K. (1999). A Case Study Indian Languages, Guidelines for Annotating Hindi On Inter-Annotator Agreement For Word Sense Disambiguation. In Proc. of SIGLEX Workshop On Standardizing Lexical Resources. Palmer, M., Gildea, D., Kingsbury, P. (2005). The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31(1):71-106. Paggio, P. (2006). Information structure and pauses in a corpus of spoken Danish. In Proc. of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstration, Trento, Italy. Uria, L., Estarrona, A., Aldezabal, I., Aranzabe, M.J., de Ilarraza, A.D, Iruskieta, M. (2009). Evaluation of the Syntactic Annotation in EPEC, the Reference Corpus for the Processing of Basque. In Proc. of 10th CICLing. Xia, F., Rambow, O., Bhatt R., Palmer, M., Sharma, D.M. (2009). Towards a Multi-Representational Treebank. In Proc. of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 2009), Groningen, Netherlands.