Submission Format for NLPRS-01
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Corpus-Based Extraction of Chinese Legal Terms Oi Yee Kwong and Benjamin K. Tsou Language Information Sciences Research Centre City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong [email protected] [email protected] is a bilingual glossary of mostly legal terms Abstract derived from the bilingual judgments. There are This paper reports on a study existing bilingual legal dictionaries (e.g. involving the automatic extraction of Department of Justice, 1998; 1999) widely used Chinese legal terms. We used a word and referenced by law students and court segmented corpus of Chinese court interpreters. Nevertheless, according to many judgments to extract salient legal legal professionals, different terminologies are expressions with standard collocation in fact used for different genres of legal learning techniques. Our method documents such as statutes, judgments, and takes the characteristics of Chinese contracts. Therefore, robust and authentic legal terms into account. The glossaries are needed for different uses. extracted terms were evaluated by The compilation of a glossary from human markers and compared against judgments is hence one of the main tasks in the a legal term glossary manually project. However, identification of legal terms compiled from the same set of data. and relevant concepts by humans depends to a Results show that at least 50% of the large extent on their sensitivity which is, in turn, extracted terms are legally salient. based on personal experience and legal Hence they may supplement the knowledge. So not only is the process labour outcome and lighten the inconsistency intensive, the results are also seriously prone to of human efforts. Moreover, various inconsistency. More importantly, inconsistency types of significant knowledge in the is to be avoided in the legal domain where legal context are mined from the data language use should be precise and absolute. as a by-product. Naturally, one way to facilitate the process is to seek automatic means to extract the relevant 1 Introduction terms from texts. Past studies on term extraction have centred on two main ideas, namely In this paper, we discuss our study on automatic collocation and statistics. Most of them dealt term extraction for Chinese, not only as a natural with English and only very few with Chinese. language processing topic, but also with broader In this study, we use a simple but effective motivation from society at large. method to automatically extract Chinese legal Following the implementation of legal terms, with the method tailored to the bilingualism in the 90’s, Hong Kong has characteristics of Chinese legal texts. Although experienced an increasing demand for authentic human judges evaluated our results differently, and high quality legal texts in both Chinese and the method is shown to be able to extract many English. In view of this, the City University of salient legal expressions and supplement a Hong Kong has undertaken, in collaboration manually constructed glossary. with the Judiciary of the Hong Kong Special The rest of this paper is organised as follows. Administrative Region (HKSAR), to build a In Section 2, we review past studies on term bilingual legal document retrieval system called extraction. In Section 3, we discuss the the Electronic Legal Documentation/Corpus characteristics observed for Chinese legal terms System (ELDoS). The system provides a handy and introduce our algorithm for their extraction. reference for the preparation of legal texts, In Section 4, we describe the materials and Chinese judgments in particular (Kwong and explain how evaluation was done. Results are Luk, 2001; LISRC, 2001). discussed in Section 5 and future directions Apart from original court judgment texts (in suggested in Section 6, followed by a conclusion bilingual versions), an important part of the data in Section 7. 2 Related Work further measure of association is required so that significant but infrequent terms will not be On the broader area of lexical acquisition, many missed. The question is whether similar studies (e.g. Riloff and Shepherd, 1997; Roark techniques can be applied to legal terms, and Charniak, 1998) have demonstrated the especially for Chinese. Below we will compare usefulness of statistical, corpus-based and contrast some of the characteristics of approaches for the task. They started with noun English and Chinese legal terms. Note that we co-occurrences within specific syntactic patterns sometimes use the word “term” in a loose way, (e.g. conjunctions, appositions, etc.) to learn referring to expressions of various lengths structured, domain-specific semantic lexicons instead of just single and compound words in from large corpora. the normal sense. With a narrower focus, Smadja (1993) developed Xtract to learn collocation patterns 3.1.1 Source and Translation within small windows, taking the relative For about 150 years, the legal system in Hong positions of the co-occurring words into account. Kong operated through English only. It is not For a similar purpose, Lin (1998) extracted until these few years that parallel Chinese English collocations from dependency triples versions of legal documents are produced. obtained from a parser. He used mutual Hence there is little established standards on information, with the independence assumption how some legal concepts in the Common Law slightly modified, to filter triples which were tradition should be expressed in Chinese. likely to have co-occurred by chance. Translations of English legal texts inevitably Amongst the few relevant work on Chinese, lead to innovative use of Chinese expressions, Fung and Wu (1994) attempted to augment a sometimes criticised as unnatural or even Chinese machine-readable dictionary by clumsy. Nevertheless, constructing legal term collecting Chinese character groups from an glossaries is a step towards setting up a untokenised corpus statistically. They modified reference standard. Smadja’s Xtract to CXtract for Chinese, starting with significant 2-character bigrams within 3.1.2 Lexicalisation of Legal Concepts windows of ±5 characters and seeding with these bigrams to match for longer n-grams. The A legal term glossary does not only contain corpus is made up of transcriptions of the single-word terms, but also longer expressions parliamentary proceedings of the Legislative for relevant (legal) concepts. Legal concepts are Council (LegCo) of Hong Kong. On average not always lexicalised. For instance, the action over 70% of the bigrams were found to be of filing a lawsuit against someone is lexicalised legitimate words and so for about 30-50% of as “sue” in English or “s+” in Chinese. But other n-grams. With the extracted terms, they apparently there is no simple term for the action were able to obtain a 5% augmentation for a resulting in the status of “assault occasioning given Chinese dictionary. actual bodily harm” or “_§ÀÔ3êï9 Starting from monolingual collocations, ” except to use the whole expression as it is. bilingual translation lexicons can be acquired (e.g. Wu and Xia, 1995; Smadja et al., 1996). 3.1.3 Complexity vs Conciseness This is particularly useful in machine translation, and is also pertinent to our setting. Wu and Xia Partly as a lack of cross-lingual parallel (1995), for instance, made use of terms extracted lexicalisation and partly to do with a translator’s by CXtract to learn collocation translations for style, a concise English term can correspond to a English words from the bilingual LegCo long and complex paraphrase in Chinese, and proceedings, reporting a precision of about 90%. the reverse can also be true. For example, an English term can be of a simple modifier-head 3 Extracting Chinese Legal Terms structure such as “procedural irregularity”, but the Chinese translation is more complex: Óg/ B/l/$}/us1. 3.1 Characteristics of Legal Terms Studies on term extraction generally agree that frequency itself is not sufficient, and some 1 The slashes mark word boundaries in Chinese. 3.2 Customised Approach maximise useful output than pattern matching and frequency counting as in Smadja (1993) and Although Chinese legal terms vary in length and Fung and Wu (1994), especially in view of the structure, consistency is the prime requirement relatively small corpus in our study. for legal texts, as was noted in Section 1. Hence, legal terms and concepts are expected to be 3.3 The Algorithm consistently expressed or translated. As a result, standard term extraction techniques should be In this section, we outline the algorithm for applicable to Chinese legal terms. extracting Chinese legal terms with illustrations. In this study, we employ the techniques for collocation learning to extract consistent and Step 1: Data Collection significant legal expressions from Chinese court judgments. For a pilot study, we use the t- Scan the corpus and collect all words and statistic for significance testing, as described in bigrams, delimited by punctuation marks. Manning and Schütze (1999), despite its For example, from the corpus segment imperfection in modelling the actual probability distribution. Moreover, we tailor the method to <> <Ê> </´> <0> <m> fit the properties of Chinese legal terms better. (… in these provisional agreements …) Our approach is hence different from those in past studies in the following respects. four bigrams are collected: <><Ê><Ê></´> 3.2.1 Start with minimal word information </´><0><0><m> Unlike Lin (1998) who used syntactic dependency information obtained from a parser, Step 2: Parameter Estimation or Fung and Wu (1994) who simply used a corpus of untokenised Chinese characters, we Estimate the following parameters from the use a Chinese corpus with minimal word corpus data: information. Since compound terms and longer expressions are all formed from words, the word f (w) p(w) = (word probability) would be a better unit than character to start N with. On the other hand, salient legal f (w1w2) expressions can have different linguistic p(w1w2) = (bigram probability) structures, so syntactic information might not be N useful until a later stage, for cleaning up the results.