<<

TANGO: Bilingual Collocational Concordancer

Jia-Yan Jian Yu-Chia Chang Jason S. Chang Department of Computer Inst. of Information Department of Computer Science System and Applictaion Science National Tsing Hua National Tsing Hua National Tsing Hua University University University 101, Kuangfu Road, 101, Kuangfu Road, 101, Kuangfu Road, Hsinchu, Taiwan Hsinchu, Taiwan Hsinchu, Taiwan [email protected] [email protected] [email protected] du.tw

on elaborated statistical calculation. Moreover, log Abstract likelihood ratios are regarded as a more effective In this paper, we describe TANGO as a method to identify collocations especially when the collocational concordancer for looking up occurrence count is very low (Dunning, 1993). collocations. The system was designed to Smadja’s XTRACT is the pioneering work on answer user’s query of bilingual collocational extracting collocation types. XTRACT employed usage for nouns, verbs and adjectives. We first three different statistical measures related to how obtained collocations from the large associated a pair to be collocation type. It is monolingual British National Corpus (BNC). complicated to set different thresholds for each Subsequently, we identified collocation statistical measure. We decided to research and instances and counterparts in the develop a new and simple method to extract bilingual corpus such as Sinorama Parallel monolingual collocations. Corpus (SPC) by exploiting the - We also provide a web-based user interface alignment technique. The main goal of the capable of searching those collocations and its concordancer is to provide the user with a usage. The concordancer supports language reference tools for correct collocation use so learners to acquire the usage of collocation. In the as to assist second language learners to acquire following section, we give a brief overview of the the most eminent characteristic of native-like TANGO concordancer. writing. 2 TANGO 1 Introduction TANGO is a concordancer capable of answering Collocations are a phenomenon of word users’ queries on collocation use. Currently, combination occurring together relatively often. TANGO supports two text collections: a Collocations also reflect the speaker’s fluency of a monolingual corpus (BNC) and a bilingual corpus language, and serve as a hallmark of near native- (SPC). The system consists of four main parts: like language capability. 2.1 Chunk and Clause Information is critical to a range of Integrated studies and applications, including natural For CoNLL-2000 shared task, chunking is language generation, computer assisted language considered as a process that divides a sentence into learning, , lexicography, word syntactically correlated parts of . With the sense disambiguation, cross language information benefits of CoNLL training data, we built a retrieval, and so on. chunker that turn sentences into smaller syntactic Hanks and Church (1990) proposed using point- structure of non-recursive basic phrases to wise mutual information to identify collocations in facilitate precise collocation extraction. It becomes lexicography; however, the method may result in easier to identify the argument-predicate unacceptable collocations for low-count pairs. The relationship by looking at adjacent chunks. By best methods for extracting collocations usually doing so, we save time as opposed to n-gram take into consideration both linguistic and statistics or full . Take a text in CoNLL- statistical constraints. Smadja (1993) also detailed 2000 for example: techniques for collocation extraction and The words correlated with the same chunk tag developed a program called XTRACT, which is can be further grouped together (see Table 1). For capable of computing flexible collocations based instance, with chunk information, we can extract As a result, we can avoid combining a verb with Confidence/B-NP in/B-PP the/B-NP an irrelevant noun as its collocate as “have toward pound/I-NP is/B-VP widely/I-VP ex- pected/I-VP to/I-VP take/I-VP an- country” in (1) or “think … people” in (2). When other/B-NP sharp/I-NP dive/I-NP if/B- the sentences in the corpus are annotated with the SBAR trade/B-NP figures/I-NP for/B-PP chunk and clause information, we can September/B-NP consequently extract collocations more precisely.

2.2 Collocation Type Extraction (Note: Every chunk type is associated with two A large set of collocation candidates can be different chunk tags: B-CHUNK for the first word obtained from BNC, via the process of integrating of the chunk and I-CHUNK for the other words in chunk and clause information. We here consider the same chunk) three prevalent Verb-Noun collocation structures in corpus: VP+NP, VP+PP+NP, and VP+NP+PP. Exploiting Logarithmic Likelihood Ratio (LLR) the target VN collocation “take dive” from the statistics, we can calculate the strength of example by considering the last word of two association between two collocates. The adjacent VP and NP chunks. We build a robust and collocational type with threshold higher than 7.88 efficient chunking model from training data of the (confidence level 99.5%) will be kept as one entry CoNLL shared task, with up to 93.7% precision in our collocation type list. and recall. 2.3 Collocation Instance Identification Sentence chunking Features We subsequently identify collocation instances Confidence NP in the bilingual corpus (SPC) with the collocation in PP types extracted from BNC in the previous step. Making use of the sequence of chunk types, we the pound NP again single out the adjacent structures of VN, is expected to take VP VPN, and VNP. With the help of chunk and clause another sharp dive NP information, we thus find the valid instances where if SBAR the expected collocation types are located, so as to build a collocational . Moreover, the trade figures NP quantity and quality of BNC also facilitate the for PP collocation identification in another smaller September NP bilingual corpus with better statistic measure.

Table 1: Chunked Sentence English sentence Chinese sentence If in this time no one 如果這時沒有人 In some cases, only considering the chunk shows concern for them, information is not enough. For example, the 關心他們,引導 and directs them to sentence “…the attitude he had towards the 他們正確思考, correct thinking, and country is positive…” may cause problem. With 教他們表達、宣 teaches them how to the chunk information, the system extracts out the 洩情緒,極易在 express and release type “have towards the country” as a VPN emotions, this could very 人格成長上留下 collocation, yet that obviously cuts across two easily leave them with a 一個打不開的死 clauses and is not a valid collocation. To avoid that terrible personality 結。 kind of errors, we further take the clause complex they can never information into account. resolve. With the training and test data from CoNLL- 2001, we built an efficient HMM model to identify Occasionally some 偶爾有一些武 clause relation between words. The language kungfu movies may 打片對某些外國 model provides sufficient information to avoid appeal to foreign 觀眾有吸引力 , extracting wrong collocations. Examples show as audiences, but these too 但也是個案。 follows (additional clause tags will be attached): are exceptions to the rule. (1) ….the attitude (S* he has *S) toward the country Table 2: Examples of collocational translation (2) (S* I think (S* that the people are most memory concerned with the question of (S* when conditions may become ripe. *S)S)S) Type Collocation types in BNC VN type Example VN 631,638 Exert That means they would VPN 15,394 influence already be exerting their VNP 14,008 influence by the time the microwave background was Table 3: The result of collocation types extracted born. from BNC and collocation instances identified in Exercise The Davies brothers, Adrian SPC influence (who scored 14 points) and Graham (four), exercised an 2.4 Extracting Collocational Translation important creative influence Equivalents in Bilingual Corpus on Cambridge fortunes while When accurate instances are obtained from their flankers Holmes and bilingual corpus, we continue to integrate the Pool-Jones were full of fire statistical word-alignment techniques (Melamed, and tenacity in the loose. 1997) and dictionaries to find the translation Wield Fortunately, George V had candidates for each of the two collocates. We first influence worked well with his father locate the translation of the noun. Subsequently, and knew the nature of the we locate the verb nearest to the noun translation current political trends, but he to find the translation for the verb. We can think of did not wield the same collocation with corresponding as a influence internationally as his kind of (shows in Table 2).The esteemed father. implementation result of BNC and SPC shows in the Table 3, 4, and 5. Table 5: Examples of collocation instances extracted from SPC 3 Collocation Concordance Moreover, using the technique of bilingual With the collocation types and instances collocation alignment and sentence alignment, the extracted from the corpus, we built an online system will display the target collocation with collocational concordancer called TANGO for highlight to show translation equivalents in con- looking up translation memory. A user can type in text. Translators or learners, through this web- any English query and select the intended part of based interface, can easily acquire the usage of speech of query and collocate. For example in each collocation with relevant instances. This Figure 1, after query for the verb collocates of the collocational concordancer is a very useful tool for noun “influence” is submitted, the results are self-inductive learning tailored to intermedi-ate or displayed on the return page. The user can then advanced English learners. browse through different collocates types and also Users can obtain the result of the VN or AN click to get to see all the instances of a certain collocations related to their query. TANGO shows collocation type. the collocation types and instances with collocations and translation counterparts high- Noun VN types lighted. Language 320 The evaluation (shows in Table 6) indicates an Influence 319 average precision of 89.3 % with regard to Threat 222 satisfactory. Doubt 199 Crime 183 4 Conclusion and Future Work Phone 137 In this paper, we describe an algorithm that Cigarette 121 employs linguistic and statistical analyses to Throat 86 extract instance of VN collocations from a very Living 79 large corpus; we also identify the corresponding Suicide 47 translations in a parallel corpus. The algorithm is applicable to other types of collocations without Table 4: Examples of collocation types including being limited by collocation’s span. The main a given noun in BNC difference between our algorithm and previous work lies in that we extract valid instances instead of types, based on linguistic information of chunks and clauses. Moreover, in our research we observe Type The number of Translation Translation Precision of Precision of selected Memory Memory (*) Translation Translation sentences Memory Memory (*) VN 100 73 90 73 90 VPN 100 66 89 66 89 VNP 100 78 89 78 89 Table 6: Experiment result of collocational translation memory from Sinorama parallel Corpus

Figure 1: The caption of the table other types related to VN such as VPN (ie. verb + preposition + noun) and VNP (ie. verb + noun + preposition), which will also be crucial for machine translation and computer assisted language learning. In the future, we will apply our method to more types of collocations, to pave the way for more comprehensive applications.

Acknowledgements This work is carried out under the project “CANDLE” funded by National Science Council in Taiwan (NSC92-2524-S007-002). Further information about CANDLE is available at http://candle.cs.nthu.edu.tw/.

References Dunning, T (1993) Accurate methods for the statistics of surprise and coincidence, Computational 19:1, 61-75. Hanks, P. and Church, K. W. Word association norms, mutual information, and lexicography. Computational Linguistics, 1990, 16(1), pp. 22-29. Melamed, I. Dan. "A Word-to-Word Model of Translational Equivalence". In Procs. of the ACL97. pp 490-497. Madrid Spain, 1997. Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143-177.