Automatic Corpus-Based Extraction of Chinese Legal Terms Oi Yee Kwong and Benjamin K. Tsou Language Information Sciences Research Centre City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong [email protected] [email protected]
is a bilingual glossary of mostly legal terms Abstract derived from the bilingual judgments. There are This paper reports on a study existing bilingual legal dictionaries (e.g. involving the automatic extraction of Department of Justice, 1998; 1999) widely used Chinese legal terms. We used a word and referenced by law students and court segmented corpus of Chinese court interpreters. Nevertheless, according to many judgments to extract salient legal legal professionals, different terminologies are expressions with standard collocation in fact used for different genres of legal learning techniques. Our method documents such as statutes, judgments, and takes the characteristics of Chinese contracts. Therefore, robust and authentic legal terms into account. The glossaries are needed for different uses. extracted terms were evaluated by The compilation of a glossary from human markers and compared against judgments is hence one of the main tasks in the a legal term glossary manually project. However, identification of legal terms compiled from the same set of data. and relevant concepts by humans depends to a Results show that at least 50% of the large extent on their sensitivity which is, in turn, extracted terms are legally salient. based on personal experience and legal Hence they may supplement the knowledge. So not only is the process labour outcome and lighten the inconsistency intensive, the results are also seriously prone to of human efforts. Moreover, various inconsistency. More importantly, inconsistency types of significant knowledge in the is to be avoided in the legal domain where legal context are mined from the data language use should be precise and absolute. as a by-product. Naturally, one way to facilitate the process is to seek automatic means to extract the relevant 1 Introduction terms from texts. Past studies on term extraction have centred on two main ideas, namely In this paper, we discuss our study on automatic collocation and statistics. Most of them dealt term extraction for Chinese, not only as a natural with English and only very few with Chinese. language processing topic, but also with broader In this study, we use a simple but effective motivation from society at large. method to automatically extract Chinese legal Following the implementation of legal terms, with the method tailored to the bilingualism in the 90’s, Hong Kong has characteristics of Chinese legal texts. Although experienced an increasing demand for authentic human judges evaluated our results differently, and high quality legal texts in both Chinese and the method is shown to be able to extract many English. In view of this, the City University of salient legal expressions and supplement a Hong Kong has undertaken, in collaboration manually constructed glossary. with the Judiciary of the Hong Kong Special The rest of this paper is organised as follows. Administrative Region (HKSAR), to build a In Section 2, we review past studies on term bilingual legal document retrieval system called extraction. In Section 3, we discuss the the Electronic Legal Documentation/Corpus characteristics observed for Chinese legal terms System (ELDoS). The system provides a handy and introduce our algorithm for their extraction. reference for the preparation of legal texts, In Section 4, we describe the materials and Chinese judgments in particular (Kwong and explain how evaluation was done. Results are Luk, 2001; LISRC, 2001). discussed in Section 5 and future directions Apart from original court judgment texts (in suggested in Section 6, followed by a conclusion bilingual versions), an important part of the data in Section 7. 2 Related Work further measure of association is required so that significant but infrequent terms will not be On the broader area of lexical acquisition, many missed. The question is whether similar studies (e.g. Riloff and Shepherd, 1997; Roark techniques can be applied to legal terms, and Charniak, 1998) have demonstrated the especially for Chinese. Below we will compare usefulness of statistical, corpus-based and contrast some of the characteristics of approaches for the task. They started with noun English and Chinese legal terms. Note that we co-occurrences within specific syntactic patterns sometimes use the word “term” in a loose way, (e.g. conjunctions, appositions, etc.) to learn referring to expressions of various lengths structured, domain-specific semantic lexicons instead of just single and compound words in from large corpora. the normal sense. With a narrower focus, Smadja (1993) developed Xtract to learn collocation patterns 3.1.1 Source and Translation within small windows, taking the relative For about 150 years, the legal system in Hong positions of the co-occurring words into account. Kong operated through English only. It is not For a similar purpose, Lin (1998) extracted until these few years that parallel Chinese English collocations from dependency triples versions of legal documents are produced. obtained from a parser. He used mutual Hence there is little established standards on information, with the independence assumption how some legal concepts in the Common Law slightly modified, to filter triples which were tradition should be expressed in Chinese. likely to have co-occurred by chance. Translations of English legal texts inevitably Amongst the few relevant work on Chinese, lead to innovative use of Chinese expressions, Fung and Wu (1994) attempted to augment a sometimes criticised as unnatural or even Chinese machine-readable dictionary by clumsy. Nevertheless, constructing legal term collecting Chinese character groups from an glossaries is a step towards setting up a untokenised corpus statistically. They modified reference standard. Smadja’s Xtract to CXtract for Chinese, starting with significant 2-character bigrams within 3.1.2 Lexicalisation of Legal Concepts windows of ±5 characters and seeding with these bigrams to match for longer n-grams. The A legal term glossary does not only contain corpus is made up of transcriptions of the single-word terms, but also longer expressions parliamentary proceedings of the Legislative for relevant (legal) concepts. Legal concepts are Council (LegCo) of Hong Kong. On average not always lexicalised. For instance, the action over 70% of the bigrams were found to be of filing a lawsuit against someone is lexicalised legitimate words and so for about 30-50% of as “sue” in English or “s+” in Chinese. But other n-grams. With the extracted terms, they apparently there is no simple term for the action were able to obtain a 5% augmentation for a resulting in the status of “assault occasioning given Chinese dictionary. actual bodily harm” or “_§ÀÔ3êï9 Starting from monolingual collocations, ” except to use the whole expression as it is. bilingual translation lexicons can be acquired (e.g. Wu and Xia, 1995; Smadja et al., 1996). 3.1.3 Complexity vs Conciseness This is particularly useful in machine translation, and is also pertinent to our setting. Wu and Xia Partly as a lack of cross-lingual parallel (1995), for instance, made use of terms extracted lexicalisation and partly to do with a translator’s by CXtract to learn collocation translations for style, a concise English term can correspond to a English words from the bilingual LegCo long and complex paraphrase in Chinese, and proceedings, reporting a precision of about 90%. the reverse can also be true. For example, an English term can be of a simple modifier-head 3 Extracting Chinese Legal Terms structure such as “procedural irregularity”, but the Chinese translation is more complex: Óg/ B/l/$}/us1. 3.1 Characteristics of Legal Terms
Studies on term extraction generally agree that frequency itself is not sufficient, and some 1 The slashes mark word boundaries in Chinese. 3.2 Customised Approach maximise useful output than pattern matching and frequency counting as in Smadja (1993) and Although Chinese legal terms vary in length and Fung and Wu (1994), especially in view of the structure, consistency is the prime requirement relatively small corpus in our study. for legal texts, as was noted in Section 1. Hence, legal terms and concepts are expected to be 3.3 The Algorithm consistently expressed or translated. As a result, standard term extraction techniques should be In this section, we outline the algorithm for applicable to Chinese legal terms. extracting Chinese legal terms with illustrations. In this study, we employ the techniques for collocation learning to extract consistent and Step 1: Data Collection significant legal expressions from Chinese court judgments. For a pilot study, we use the t- Scan the corpus and collect all words and statistic for significance testing, as described in bigrams, delimited by punctuation marks. Manning and Schütze (1999), despite its For example, from the corpus segment imperfection in modelling the actual probability distribution. Moreover, we tailor the method to <> <Ê> <0>
3.2.3 Recursive significance test to reduce Compute the t-value for each bigram by noise x − µ We start with word bigrams and reiterate the t = process with previous significant bigrams s 2 N combined. In this way, longer expressions are 2 extracted only if they are unlikely to be found by where x is the sample mean, s is the sample chance. It may also possibly reduce noise but variance approximated as in Manning and Schütze (1999), N is the corpus size, and is 4 Materials and Method the mean of the distribution.
In this study, the significance level is set at 4.1 Corpus of Court Judgments p>0.005. Hence, bigrams with t-value>2.576 are selected. Therefore, The present study is based on a corpus of court judgments written in Chinese2, provided by the Judiciary of the HKSAR. A subset of this w1w2 t-value Ê 2.418 corpus consisting of 23 judgments was used, as a part of the ELDoS system mentioned in Ê /´ 0.996 /´ 0 Æ Section 1. These judgments include both 4.114 significant, selected criminal and civil cases tried at various courts 0 m 1.685 such as the High Court and the Final Court of Appeal. Step 4: Corpus Revision Corpus texts first undergo automatic word segmentation, followed by human verification. Scan the corpus again, combining significant While there are different standards for Chinese bigrams into “new words”, i.e., word segmentation, here we followed the protocols adopted for the LIVAC corpus … <> <Ê>
With the materials used in this study, the If we group the results according to the extraction process was iterated for four rounds length of the extracted terms, Table 3 shows that and stopped at the fifth when no more extracted terms with four or more characters are statistically significant bigrams were found. The rather useful. lexical distribution after each round, including the number of word tokens (Token), word types Char N Marker 1 Marker 2 (Type), bigram types (Bigram), and significant LT+UP Acc LT+UP Acc bigram types (Sig Bi), is shown in Table 1. In 2 124 0 0.00% 5 4.03% the last column, we also report the results after 3 313 6 1.92% 68 21.73% taking away terms with numerals (-#), as they 4 260 100 38.46% 187 71.92% are mostly not legal terms. We will base on 5 65 17 26.15% 53 81.54% these numbers in the subsequent discussion. 6 35 12 34.29% 32 91.43% 7 25 6 24.00% 19 76.00% Round Token Type Bigram Sig Bi (-#) 8 12 6 50.00% 11 91.67% 1 73286 5773 31898 862 (703) 9 31 8 25.81% 29 93.55% 2 61573 6519 35551 180 (136) 3 59662 6642 35811 25 (25) Total 865 155 17.92% 404 46.71% 4 59401 6648 35796 1 (1) Table 3 Accuracy and Term Length Table 1 Lexical Distribution after each Round As seen from Table 3, most of the 124 two- Table 1 shows that there are fewer tokens but character strings and 313 three-character strings more types (for both word and bigram) after are not legal terms. This is in contrast to Fung each round. This is natural because words and Wu (1994) who found over 70% of the combine to form “new words” in subsequent two-character bigrams legitimate words. The rounds, so the total number of word decreases reason for the difference is that we used a but the combination gives rise to more word segmented corpus while they used an types and bigram types. untokenised one. During segmentation, most Naturally the length of the extracted strings two-character words and many three-character grows after each round. Nevertheless, unlike ones are already identified. The unigrams left Fung and Wu (1994), we started with word are mostly function words and almost certainly bigrams instead of character bigrams, hence the do not immediately form compound terms with final lengths of the significant patterns would neighbouring words, although they might be part vary. The top ten terms extracted in the first of a longer compound term. So in the following three rounds with their corresponding t-values discussion we concentrate only on the extracted are shown in Table 5. terms with four or more characters. The results from Marker 1 and Marker 2 Thus, for terms with four or more characters, differ considerably. This difference is thus hard overall about 50% were found legally relevant evidence for inconsistency in human markup of by both markers. Also, there is about a 15% legal expressions. Of all the extracted terms, overlap with E-Gloss. Since our corpus is small, there were 108 and 13 which both markers many useful terms actually only appear once or considered legal terms and useful phrases twice and can therefore only be identified respectively. That is to say, most terms manually but not automatically. It is interesting qualified by Marker 1 were also qualified by that the extracted terms can at least supplement Marker 2, but not vice versa. Marker 2, having E-Gloss with 15% of the terms.5 Furthermore, studied law for a longer time than Marker 1, is Table 4 shows that the figures in the last three relatively more sensitive to various legal columns are not always compatible. For concepts. Besides, we observed that Marker 1 instance, 77 of the 4-character words were found has left most of the names and titles of judges to be good by human markers, and the overlap (see Section 6.2) unmarked, showing that it is with E-Gloss is 50, but there are still 40 for really a subjective measure as to what the supplementation. This means that although human markers consider useful or significant some terms manually identified for E-Gloss are information. Despite the variation between the also automatically extracted, the human markers two markers, we nevertheless see a similar trend do not agree on their salience any more. for their evaluation. 6 Future Directions 5.3 Comparison with E-Gloss E-Gloss consists of 1003 Chinese headwords, and many of them have multiple English 6.1 Further Filtering renditions. The average length of the Chinese The present study thus demonstrates the terms is 3.65 characters. Of all the headwords, feasibility of using standard term extraction 41.1% are two-character terms and 8.5% are techniques on Chinese legal terms. Our three-character terms4. The glossary, however, immediate next step would therefore be to should not be thought of as an exhaustive expand the scale and use a much larger corpus to capture of useful terms from the judgments, as extract even more salient legal expressions. there were many practical considerations during Moreover, the total accuracy obtained here may its construction. For instance, a Chinese legal underestimate the “term extraction” ability of term would be collected only if a neat English the algorithm, because we only asked markers to equivalent could be found in the corresponding identify legal terms and legally related English data, to form an entry in the bilingual expressions. Hence some legitimate but glossary. Also, legal terms which are too common words such as “í3I” (police common were left out to keep the database size officer) and “à>;” (apprentice jockey) and thus search speed down. might be left out in the counting. It would be The comparison of the automatically useful to compare the performance of the extracted terms and the entries in E-Gloss is algorithm on the domain-specific corpus and shown in Table 4. In the table, E-Gloss and that on a more general corpus to see how it fits Auto Extract correspond to the number of n- the legal domain. Also more filtering of gram entries found in the glossary and extracted irrelevant terms extracted by the algorithm by our method respectively. Human Check is would be desirable. the amount of automatically extracted terms Meanwhile, we shall experiment with the which both human markers found legitimate pruning step (i.e. Step 6 of the algorithm). In (legal term or useful phrase), whereas that case, we will need to make use of part-of- Glossary Check is the amount of overlapping speech and syntactic information. Also, as the between the automatically extracted terms and task in ELDoS is to achieve a bilingual glossary, glossary entries. Finally, Supplementation a follow-up of this study would be the automatic refers to the amount of extracted terms found legitimate by humans but not found in E-Gloss.
5 A proper recall measure is not possible since we did not have an exhaustive list of all terms which 4 There are only two 1-character terms. should be included. acquisition of the bilingual translation lexicon 4. Legal jargons: from the parallel court judgments. !uà (I agree), zÛ (before us), (±]3 (both parties), ²¶éB 6.2 Other Types of Knowledge (there is no evidence that) Apart from legal concepts, long expressions 5. Who’s who with positions and titles: extracted by the method may have special 6 Ãth (Andrew Li, meanings or carry significant information in the Chief Justice, Court of Final Appeal), 6 legal context. Here are some examples for the ÃzDZt¸äS (Sir Anthony different types of knowledge that could be Mason, Non-Permanent Judge, Court of mined as a by-product, which may also be useful Final Appeal), ÏÛÃB9 ë où for information extraction: è (Michael Stuart-Moore, VP, Court of 1. Organisation names: Appeal of the High Court) Ù (Independent Commission Against 7 Corruption), }ýéô (Hong Kong Conclusion m¾ Special Administrative Region), In this study we have used a simple but effective Î I Ð X * a I (China International method to automatically extract Chinese legal Economic and Trade Arbitration terms. Results show that at least 50% of the Commission) extracted terms are legitimate legal terms, and 2. Concepts for reasoning: they can supplement the outcome of human ^9 %äù1m (cause really serious efforts. More importantly, automatic means can injury), Ü < g § ó ¶ þ 1 é a lighten the problem caused by inconsistency of (conduct of prejudice to good order and human judgement. We have thus shown that discipline) with slight modifications, standard collocation 3. Posts and titles: learning and term extraction techniques can be ìðín (Building Authority), Ù applied to more domain-specific extraction of Chinese legal terms with encouraging outcomes. 3I (ICAC Officer), '%ûo (Official Receiver)
Char E-Gloss Auto Extract Human Check Glossary Check Supplementation 4 288 260 77 (29.62%) 52 (18.06%) 40 (13.89%) 5 80 65 12 (18.46%) 4 (5.00%) 9 (11.25%) 6 57 35 12 (34.29%) 4 (7.02%) 8 (14.04%) 7 31 25 3 (12.00%) 2 (6.45%) 2 (6.45%) 8 17 12 6 (50.00%) 1 (5.88%) 5 (29.41%) 9 31 31 6 (19.35%) 1 (3.23%) 5 (16.13%) Total 504 428 211 (49.30%) 72 (14.29%) 72 (14.29%) Table 4 Comparison of Automatic Extraction with E-Gloss Department of Justice. 1999. The Chinese- Acknowledgements English Glossary of Legal Terms. Hong Kong We thank the Judiciary of the HKSAR for Special Administrative Region. providing the judgment data, colleagues on the P. Fung and D. Wu. 1994. Statistical Augmentation of a Chinese Machine-Readable ELDoS project for discussion, and the two law Dictionary. In Proceedings of the Second Annual students for helping with the evaluation. The Workshop on Very Large Corpora (WVLC-2), Kyoto. authors take sole responsibilities for the findings O.Y. Kwong and R. Luk. 2001. Retrieval and and views expressed in this paper. Recycling of Salient Linguistic Information in the Legal Domain: Project ELDoS. Presented in the References Annual Conference and Joint Meetings of the Pacific Department of Justice. 1998. The English- Neighborhood Consortium (PNC 2001), Hong Kong. Chinese Glossary of Legal Terms. Hong Kong Language Information Sciences Research Centre Special Administrative Region. (LISRC). 2001. ELDoS Version 1.0: Installation and Operation Manual. City University of Hong Computational Linguistics (COLING-ACL ’98), Kong. pages 1110-1116, Montréal, Canada. D. Lin. 1998. Extracting Collocations from Text F.Z. Smadja. 1993. Retrieving Collocations Corpora. In Proceedings of the First Workshop on from Text: Xtract. Computational Linguistics, 19(1): Computational Terminology, Montréal, Canada. 143-177. C.D. Manning and H. Schütze. 1999. F.Z. Smadja, K. McKeown, and V. Foundations of Statistical Natural Language Hatzivassiloglou. 1996. Translating Collocations for Processing. The MIT Press. Bilingual Lexicons: A Statistical Approach. E. Riloff and J. Shepherd. 1997. A corpus-based Computational Linguistics, 22(1): 1-38. approach for building semantic lexicons. In B.K. Tsou, W.F. Tsoi, T.B.Y. Lai, J. Hu, and Proceedings of the Second Conference on Empirical S.W.K. Chan. 2000. LIVAC, A Chinese Methods in Natural Language Processing, pages Synchronous Corpus, and Some Applications. In 117-124, Providence, RI. Proceedings of the ICCLC International Conference B. Roark and E. Charniak. 1998. Noun-phrase on Chinese Language Computing, pages 233-238, co-occurrence statistics for semi-automatic semantic Chicago. lexicon construction. In Proceedings of the 36th D. Wu and X. Xia. 1995. Large-Scale Automatic Annual Meeting of the Association for Computational Extraction of an English-Chinese Translation Linguistics and 17th International Conference on Lexicon. Machine Translation, 9(3-4): 285-313.
Extracted Strings English Equivalents t-value B9 Court of Appeal 15.513 6 à Court of Final Appeal 14.330 R DZt Permanent Judge 13.198 O ÏÛà High Court 10.327 U : t Trial Judge 10.306 N lh not capable of 10.016 D t Judge [of a particular court] 9.885 1 Âm in the case 9.811 ! the present case 9.765 N+é Enforce 9.142 6 ÃDZt Permanent Judge, Court of Final Appeal 8.733 ÏÛÃB9 Court of Appeal of the High Court 6.795 R }ýéô Hong Kong Special Administrative Region 6.476 O DZtúé) Henry Litton, Permanent Judge 5.984 U B9 t Judge of the Court of Appeal 5.895 N zDZt Non-Permanent Judge 5.891 D 6 ÃzDZ Non-Permanent [Judge], Court of Final Appeal 5.801 2 DZtåÀn Kemal Bokhary, Permanent Judge 5.730 B9 ë o Vice-President, Court of Appeal 5.721 n»0¼ Deed of Family Arrangement 5.382 6 ÃDZtúé) Henry Litton, Permanent Judge, Court of Final Appeal 4.574 6 ÃDZtåÀn Kemal Bokhary, Permanent Judge, Court of Final Appeal 4.464 R lÀÞa51ý Privilege against Self-incrimination 4.122 O 6 Ãth Andrew Li, Chief Justice, Court of Final Appeal 4.121 U 6 ÃDZt§b Charles Ching, Permanent Judge, Court of Final Appeal 4.116 N 6 ÃzDZt Non-Permanent Judge, Court of Final Appeal 4.108 D ÏÛÃ:| tZ) David Yam, Judge of the Court of First Instance, High Court 3.604 uà@à1N× 3 Consent Order / An Order by Consent 3.462 ºYbç$}1Ô¼ a certificate that those conditions have been complied with 3.161 ÏÛÃB9 t©/m Arthur Leong, Justice of Appeal 3.160
Table 5 Top Ten Significant Strings in Round 1 to 3