<<

Building the Wordnet

Joanna Ut-Seong Sio Luis Morgado da Costa Palacky University Technological University Olomouc, the Czech Republic [email protected] [email protected]

Abstract Chinese languages (Handel, 2015). The term ‘topolects’ is coined by Mair (1991) to refer to Chinese or, more generally, to speech This reports on the development varieties where the label of either ‘language’ or of the Cantonese Wordnet, a new wordnet ‘’ would be controversial. Nevertheless, project based on Kong Cantonese. It for the purpose of this paper, we will continue is built using the expansion approach, lever- to use the term ‘Chinese’ to refer to this fam- aging on the existing Chinese Open Wordnet, ily of languages and the term ‘dialects’ to refer and the Princeton Wordnet’s semantic hierar- to its variants while being fully aware of the chy. The main goal of our project was to pro- complexity involved. duce a high quality, human-curated resource – There are seven most recognised dialectal and this paper reports on the initial efforts and groups: Mandarin (or Northern Chinese), Xi- steady progress of our building method. It is ang, , , , Hakka and (- our belief that the lexical data made available del, 2015). Norman (1988) classifies the tradi- by this wordnet, including romaniza- tional seven dialectal groups into three larger tion, will be useful for a variety of future uses, groups: Northern (Mandarin), Central (Wu, including many language processing tasks and Gan, and Xiang) and Southern (Hakka, Yue, linguistic research on Cantonese and its inter- and Min). Cantonese belongs to Yue, the actions with other Chinese dialects. Southern group, and it is often used as an 1 Introduction alternative name for this whole group. The variety this Cantonese Wordnet is based on is 1.1 Chinese and its Dialects Cantonese. Chinese is generally treated as one language is often considered a prestige variety due to with many dialects for both cultural and po- its association with the prosperous southern litical reasons. The dialects are spoken by peo- provinces as well as with the Cantonese cul- ple who mostly identify as a single nationality ture of films and popular music. with a shared cultural history. Linguistically speaking, this unifying view is problematic as 1.2 Project Motivation the dialects are not always mutually intelli- A few wordnets exist for Chinese languages. gible. Chinese is, more accurately, a family These efforts include some work on Pre- of genetically-related languages most proba- Ancient Chinese ( et al., 2017), Mid- bly descended from a form of late dle Ancient Chinese (Zhang et al., 2014), as dating from the or slightly ear- well as multiple wordnets for Mandarin - lier (with the possible exception of Min (Han- nese, namely: the Sinica Bilingual Ontolog- del, 2015)). Various dialects, including Can- ical Wordnet (, 2003; Huang et al., tonese, had also been importing grammati- 2004, BOW), the Southeast University Chi- cal elements from neighboring languages (Yue- nese WordNet ( et al., 2008, SEW), the Chi- Hashimoto, 1991), creating dialectal varia- nese WordNet (Huang et al., 2010, CWN) and tions that are more than the sum of language- the Chinese Open Wordnet ( and Bond, internal changes. An arguably less confus- 2013, COW). The Chinese Open Wordnet is ing term ‘Sinitic’ is often used to refer to the the best and most recent effort to produce a high quality wordnet for , tonese and Mandarin vocabulary ranges from learning from previous analogous experiences 30-50%. Ouyang (1993, 23) estimates that and developed alongside a sense-tagged corpus about 1/3 of the lexical items used in reg- ( and Bond, 2014; Wang and Bond, 2014; ular Cantonese speech is not found in Man- Seah and Bond, 2014). darin. To give an example for a very common Unfortunately, scholarly efforts often seem item,‘umbrella’ is yǔsǎn ⾬傘 in Mandarin but to forgo Cantonese, as there is a chronic ab- ze1 遮 in Cantonese. In cases where they share sence of digital resources to study and pro- the same lexical item, the item is always pro- cess this important Chinese dialect. To further nounced differently in the two dialects. For ex- this problem, it is important to note that ample, ‘teacher’ ⽼師 is pronounced as lǎoshī the significant differences between Mandarin, in Mandarin and lou5si1 in Cantonese. The for which there are plenty of resources, and of the first in each case are dif- Cantonese make the idea of using Mandarin ferent, and the onsets of the second syllables resources to process Cantonese fairly useless. are also different, not to mention tonal differ- It was this chronic absence of Cantonese dig- ences. Note that the system is ital resources that ultimately fed our motiva- different here as well. Mandarin uses pīnyīn, tions to build the Cantonese Wordnet. Our which is based exclusively upon the pronun- motivation spawns from our belief that Can- ciation of the dialect. In Cantonese, tonese should have plenty of open, compu- Jyutping is used, a point we will come back to tational tractable and linguistically rich re- later. sources, such as wordnets and corpora, that In the existing Mandarin Chinese Wordnet, support scholarly work, as well as this - simplified characters are used. The simplified guage’s maintenance and preservation – simi- script, adopted in 1949, aims to alleviate some lar to what happens with Mandarin Chinese. of the difficulty associated with use of the tra- We would like our Cantonese Wordnet to ditional script, as a measure to eradicate illit- support many Natural Language Processing eracy. In Hong Kong, and , the tasks, such as speech recognition, word sense traditional script is used, though in the former disambiguation, machine translation or infor- two, changes are happening rapidly since their mation retrieval. And, at the same time, to return to in 1997 and 1999, respectively. also support the study of purely linguistic re- Cantonese is primarily a spoken variant. A search topics, such as lexical semantics, tonal lot of lexical items, excluding those shared patterns, subcategorization, etc. with Mandarin, do not have fixed agreed upon characters, these are often called ‘character- 2 Cantonese: an Overview less’ words. It is not always easy to determine Cantonese is the second most widely known which character to use as there is no standard- Chinese dialect after Mandarin (Matthews ization. In some cases, multiple options are and Yip, 1994). It is spoken in - available, while in some other cases, no op- dong Province, Province, the Spe- tions are available. We will pick up on this cial Administrative of Hong Kong and issue in Section 4.3. Macau, as well as diaspora communities in 2.1 Jyutping Romanization System , , , Singa- Pīnyīn is the official romanization system of pore, etc. According to Ethnologue,1 there are Mandarin Chinese or Pǔtōnghuà (lit. ‘com- 73 million Cantonese speakers worldwide. But mon speech’). And since Mandarin Chi- despite the large number of speakers, credi- nese/Pǔtōnghuà and Cantonese have different ble online resources on Cantonese, free or oth- phonological systems, a different romanization erwise, are limited, especially in comparison system is needed for Cantonese. Many ro- with Mandarin. manization systems exists for Cantonese (.g., There is a considerable lexical overlap be- Jyutping, S.L. , Sidney Lau, Yale, the tween Cantonese and Mandarin. Snow (2004, Government System, etc.) ( and Tang, 49) mentions that the difference between Can- 2016). We adopt the Jyutping system, 粵 1https://www.ethnologue.com/language/yue 拼, for the Cantonese Wordnet. Jyutping was developed by the Linguistic Society of Hong a lemma, both its Jyutping representation and Kong (LSHK) in 1993. Its formal name is The its graph (character) are needed. For example, Linguistic Society of Hong Kong Cantonese sing1 can mean ‘to rise’ 升 or ‘star’ 星. With- Romanization Scheme.2 Since its inception, out the character, it is ambiguous. it is used widely in academic as well as social media. 3 Methodology Cantonese syllables contain onset and rime. The rime can be further divided into the nu- There are two main methods to build wordnets cleus and coda. The lists of possible onset, (Vossen, 1998). The first method is known as nucleus and coda in Jyutping are shown in the ‘expansion’ approach, where the structure Table 1. /m/ and /ng/ are syllabic nasals, of another wordnet is used as ‘pivot’, and the meaning they can appear on their own to form main work is essentially a translation effort – a . Kataoka and Lee (2008) provide conserving the structure of the pivot wordnet the correspondence between Jyutping, the In- and translating nodes of the hierarchy. The ternational Phonetic symbol (IPA) and other Princeton Wordnet (Fellbaum, 1998, PWN) Cantonese romanization systems. is, by far, the most frequently used ‘pivot’ for projects that employ this approach. The Jyutping phonemes second method is known as the ‘merge’ ap- Onset b, p, m, f, d, t, n, l, g, k, ng, proach. This is usually a slower method, since h, gw, kw, w, z, c, s, no pivot structure is assumed, but it ensures Nucleus aa, i, u, e, o, ,oe, a, eo a higher degree of freedom to more carefully Coda p, t, k, m, n, ng, i, u model the structure of the wordnet based on the language in question, without depending Table 1: Jyutping Syllable Struture on pre-assumed semantic relations. One of the immediate benefits of this approach is the abil- In Jyutping, tones are expressed numeri- ity to add new concepts that are not part of cally, using numbers 1 to 6. Table 2 shows the ‘pivot’ language, a problem many word- how these numbers relate to their respective net projects that followed the ‘expansion’ ap- tonal contour using Chao’s number (1 is the proach have struggled with. lowest and 5 is the highest) together with their And while the ‘merge’ approach is perhaps description. more principled in theory, the major drawback Jyutping Chao’s description from this approach is that it does not benefit 1 53/55 high falling/high level from the parallel translations available from 2 35 mid rising all other projects that used the same pivot. 3 33 mid level The best example of this benefit is the Open 4 21 low falling Multilingual Wordnet (Bond and Foster, 2013, 5 13 low rising OMW), a project that links dozens of open 6 22 low level wordnets using PWN as the common struc- ture. This language alignment is very useful Table 2: Cantonese Tones for many NLP tasks, such as Machine Trans- lation and Word Sense Disambiguation. Traditional Chinese philology treats sylla- A recent addition to this discussion is the bles with final stops (p, t, k) as distinct conception of the Collaborative Interlingual classes (checked tones), yielding a nine-tone Index (Bond et al., 2016, CILI) – an open, system. Until recently, there was also a con- language agnostic, flat-structured index that trast between high level (55) and high falling links wordnets across languages without im- (53). However, this distinction has collapsed posing the hierarchy of any single wordnet. for most speakers today. And even though PWN was the main contrib- Cantonese has a lot of homophones, char- utor to the initial set of concepts present in acters that have the same pronunciation but CILI, this set is no longer constrained by it – have different meanings. To uniquely identify multiple projects are now able to contribute 2https://www.lshk.org/jyutping to CILI’s set of concepts, and gain the ben- efits of multilingual alignments without the mas that were not present in COW, we started penalty of being frozen within some imposed by building a small ‘Chinese Wordnet’ from structure. To the best of our knowledge, the the union of all four Chinese wordnets: COW, quickest and easiest way to link to CILI and BOW, SEW and CWN. This was fairly easy to access these language alignments without since all these wordnets were linked to PWN’s an imposed structure is, interestingly enough, hierarchy. Next, we used Hanziconv3 to con- to use the expansion approach with PWN hi- vert all lemmas from simplified Mandarin Chi- erarchy has pivot because all PWN concepts nese to traditional characters. And finally, we have direct links to CILI. And this was what generated a list of all candidate senses (with we decided to do. lemmas converted into traditional characters) As we had no urgent need for a high cover- that satisfied any of these three criteria: age wordnet, for which multiple bootstrapping • Senses that belong to the 4,960 ‘core’ techniques are available to quickly create high concepts in Princeton WordNet (Boyd- coverage lower quality resources, we decided to Graber et al., 2006) – a usual measure for build a high quality resource fully checked by coverage of wordnet resources; native speakers. And although we knew from • Senses from all concepts in two sense the start that building a wordnet from scratch tagged Sherlock Holmes stories, as re- would be very time-consuming, without going ported by NTUMC (Tan and Bond, against our commitment to high quality, we 2014); and decided to ease our task by leveraging on ex- • Senses from any concept with sense sum- isting resources as much has possible. frequency score of one or higher, as re- We used the Chinese Open Wordnet (Wang ported by the PWN (i.e. most concepts and Bond, 2013, COW) as pivot. COW is a yield sum-score of 0); high quality hand-checked resource for Man- 3.1 Human Validation and Jyutping darin Chinese that was also created through the expansion approach (using PWN as pivot). The data generated by the process explained This means that by linking our wordnet to above generated a list of 47,499 candidate COW, we would have easy access to PWN’s senses, spanning over 9,340 synsets. Based concept IDs and, as a result, also to CILI. on this information, we created a spreadsheet for our human validation task. As of this - The basic assumption of our method was ment, a single Cantonese native speaker, who that while Mandarin Chinese and Cantonese is also a trained linguist with extensive work are fairly different languages, and it was clear on Cantonese language is manually checking, from the start that resources from one lan- correcting and adding to this data. guage would not perform well in tasks for the An example of this spreadsheet is shown in other language, there is still a fair amount of Table 3. This spreadsheet contains the can- overlap in the lexical usage. This is not with- didate Cantonese lemmas (converted to tra- out caveats, since Cantonese uses traditional ditional characters from one of the existing characters and Mandarin Chinese uses sim- Mandarin lemmas), English lemmas (provided plified characters – and conversion from sim- by the PWN), Mandarin lemmas (provided plified to traditional characters is inherently by the collection of Chinese wordnets), En- lossy. That being said, we decided to automat- glish definitions and examples (provided by ically convert the lemmas in simplified Man- the PWN), and the synset ID of the PWN3.0. darin Chinese to traditional Chinese, and use this to jumpstart the manual construction of The Jyutping romanization is not produced our Cantonese Wordnet. automatically. It is, in fact, being added by hand by the lexicographer. To our knowledge, Since the other Mandarin Chinese wordnets there are no open Jyutping dictionaries avail- such as the Sinica Bilingual Ontological Word- able under an open license. For this reason, net (Huang, 2003; Huang et al., 2004, BOW), we decided to include this valuable resource in the Southeast University Chinese Wordnet our wordnet. Having Jyutping romanization (Xu et al., 2008, SEW) and the Chinese Word- net (Huang et al., 2010, CWN) included lem- 3https://pypi.org/project/hanziconv/ Cantonese English Mandarin English English Synset Lemma Lemmas Lemmas Definitions Examples 今夜 [deleted] [deleted] [deleted] [deleted] [deleted] [deleted] 今 晚, gam1 this night; tonight; 今夜; 今晚 during the night drop by tonight 00079499-r maan5 this evening of the present day 今 晚, gam1 this night; tonight; 今夜; 今晚 during the night drop by tonight 00079499-r maan1 this evening of the present day 今 晚 ⿊, this night; tonight; 今夜; 今晚 during the night drop by tonight 00079499-r gam1 maan5 this evening of the present day hak1 今 晚 ⿊, this night; tonight; 今夜; 今晚 during the night drop by tonight 00079499-r gam1 maan1 this evening of the present day hak1 Table 3: Human Validation and Jyutping (example)

for each sense will not only facilitate search- hand-checked 18,168 (38.25%) of the total set ing, but can also be very useful for a variety of of candidate senses (i.e. 47,499 senses). Out of other tasks, such as speech recognition or even the total number of candidate senses checked, for educational purposes. We will make use of 8,295 (45.7%) were kept (i.e. the conversion of the new structure provided by the WN-LMF Mandarin lemmas was correct), which is in line format to cluster Jyutping as with Snow’s (2004, 49) predictions. In addi- variants inside the canonical lemma (i.e. the tion to these converted senses, a total of 3,797 traditional ). new senses were added by the lexicographer The process explained in the section above (i.e. that were not suggested by the conver- generated one candidate Cantonese sense for sion from simplified Mandarin Chinese) – this each available Mandarin sense inside each con- comprises about 31.4% of the total number of cept. In the example shown in Table 3, the senses we currently have in our wordnet, and concept 00079499-r contained two Mandarin which is in line with Ouyang’s (1993, 23) pre- senses: jīn yè 今夜, and jīn wǎn 今晚. Both dictions concerning the ratio of exclusive Can- these lemmas have the same form in simplified tonese senses. In total, our wordnet currently and traditional Chinese. This resulted in two has 12,092 senses (a summary of this release’s lines produced (the top two lines in Table 3). statistics is provided in Section 5). The human validation task comprised: 1. asserting if the candidate sense in each 4 Issues line provided was a correct Cantonese 4.1 Separated and Intervening sense – incorrect senses would be deleted Lexemes (see Table 3, ); 2. adding Jyutping romanization for each What is represented by one lemma in En- correct sense – senses with more than glish sometimes requires two lexemes sepa- one pronunciation required the line to be rated from each other with an intervening lex- copied and the corresponding romaniza- eme in Cantonese. For example, ‘to punch’ tion added to the new line (see Table 3, in the sense of ‘to deliver a quick blow’ lines 2-3); is expressed as [daai2...jat1kyun4], literally 3. adding any missing senses that were not ‘hit...one punch’ (打... ⼀拳), where ... is the suggested by the conversion of the Man- slot for the recipient of the punch, the object darin lemmas. This was a non-exhaustive of the verb. Another example is ‘to fire’ in the search, and it depended on the lexicogra- sense of ‘terminate the employment of’, which pher’s ability to recall missing senses (see can be expressed as [gaak3...zik1] (one of the Table 3, lines 4-5); many options in Cantonese), ‘remove...duty’ (⾰... 職), where ... is the slot of the per- At this moment, our lexicographer has son being fired, the object. This is essen- tially different from the English ‘pick up’ and posters, novels) as well as newspapers (e.g. ‘pick...up’ cases (where ... is the object) as , a popular newspaper in Hong [daai2jat1kyun4 + ‘object’] and [gaak3zik1 + Kong). conveys a greater ‘object’] are both ungrammatical – the sepa- degree (compare with ) of ration is obligatory. In view of this, we have ‘informality, directness, intimacy, friendliness, used separated lexemes (with ...) whenever it casualness, freedom, modernity and authen- is necessary to be faithful to the English con- ticity’ (Bauer, 2018, 4). At least partly due cept, a practice also adopted by COW. to the special situation in Hong Kong for a long time, where children speak Cantonese but 4.2 Compositionality of Telic write in standard Chinese (the situation has In many cases, the translated term in Can- changed since the handover in 1997), written tonese is compositional. For example ‘to re- Cantonese ranges over a continuum. On the member’ in the sense of ‘recall knowledge one end, there are texts that are essentially from memory’, is nam2hei2 諗起 in Cantonese, standard Chinese but with a few Cantonese where a post-verbal particle hei2 meaning items, on the other end are texts that are writ- ‘up’ is needed. Sybesma (1997) points out ten entirely in Cantonese (Snow, 2004, 60-61). that Mandarin does not have monomorphemic There is substantial overlap between Man- counterparts for English verbs like ‘see’, ‘hear’, darin and Cantonese vocabulary. For shared and ‘find’, which qualify as achievements (in- vocabulary items, e.g., 飯 ‘rice’, fàn in Man- deed claims that Chinese has no inherently darin and faan6 in Cantonese, the traditional telic verbs at all). The Mandarin counterparts version of the same character is used, and with of these verbs are verbs, where the a different pronunciation. second constituent expresses the attainment of the result (‘phase complement’ in Chao It is estimated that about one-third of the (1968); see also and Thompson (1981)), e.g., lexical items in Cantonese are not shared with kàndào 看到 ‘look-arrive > see’; kànjiàn 看 Mandarin (Ouyang, 1993, 23). This also in- ⾒ ‘look-see > see’; tīngjiàn 聽⾒ ‘listen-see > cludes some very basic vocabulary, such as the 不 唔 hear’; zhǎodào 找到 ‘look for-arrive > find’. negator, which is bù in Mandarin and The situation is the same in Cantonese. To m4 in Cantonese, or very basic content words 看 睇 ensure we have high quality translation equiv- like ‘see’, which is kàn in Mandarin, but alents, these particles are included in the lem- tai2 in Cantonese. For Cantonese-specific lex- mas (the same procedure is adopted by COW). ical items, the choice of the characters is not The consequence is that such entries can be always obvious due to the lack of standardiza- analyzed as compositional. tion. The standardization of written Cantonese 4.3 The Lack of Standardization in lexical items exhibits a gradience, ranging Written Cantonese from items like the negator 唔 m4 and ‘see’ Cantonese is primarily a spoken dialect. Can- 睇 tai2, which are not controversial, to items tonese has never been subjected to rigor- which are regularly represented phonetically ous and formal standardization, despite ef- with English letters in its written forms in forts of lexicographers which resulted in a few online forums, e.g. hea he3 ‘to laze around’. Cantonese-standard Chinese dictionaries and In-between the two extremes, there are many Cantonese word lists (Li, 2000). Cantonese cases where two or more characters are used school children are not taught how to read to represent the same lexical item. For ex- or write Cantonese. The knowledge of writ- ample the word bei2 ‘to give’ can be writ- ten Cantonese among its speakers arises in- ten with 4 different characters, ⽐, 俾, 畀, 被 formally through exposure to its pervasive use (Bauer, 2018, 135). For this first version of (Bauer, 2018). the Cantonese Wordnet, items which are only Written Cantonese is mainly used for in- represented by English letters are not listed. formal or less serious kind of communication For cases where multiple characters are used, (Snow, 2004, 18), but is not uncommon. It all options will be given whenever possible. is used regularly in advertising (e.g. signs, For discussion on strategies on how Cantonese characters are formed, see Li (2000) and Bauer 4.5 The Continuum between Spoken (2018). and Written Cantonese Cantonese has different registers (e.g., every- 4.4 Alternation in Pronunciation day conversation vs. news report). A lot of words which are too formal to use in regular In the Cantonese Wordnet, there are many conversation might appear in TV broadcast, cases where a particular character is given or formal speeches and thus some more for- multiple pronunciations. The two common mal versions of such terms (as long as they are causes for alternation is pinjam 變⾳ ‘changed deem possible in Cantonese) are also included tone’ and laan5jam1 懶⾳ ‘lazy pronunciation’. in our wordnet with the aim of covering the Many morphological constructions in Can- range of registers. The consequence is that the tonese are expressed solely or partly by tone boundary is not always clear. When in doubt, change (Yu, 2009). Traditional descriptive lin- the decision was always to include such items. guistic literature of Cantonese refers to this The question as to what to include can be 變⾳ pinjam process. Table 4 shows some ex- determined in a more objective way in the - amples of tone change cases in deverbal nom- ture. We would like to experiment with Can- inalization (Yu, 2009). tonese texts of various registers, using both the Cantonese and Mandarin wordnets in par- character verb allel to help better understand and identify 掃 ‘to sweep’ sou3 ‘broom’ sou2 words that were not included as part of the 磅 ‘to weight’ bong6 ‘scale’ bong2 Cantonese Wordnet. In time, we hope to es- 油 ‘to grease’ jau4 ‘oil’ jau4 tablish the extent of shared vocabulary items Table 4: Cantonese Tone Change (I) between Mandarin and Cantonese, as well as to identify uniquely Cantonese items.

The term laan5jam1 (‘lazy pronunciation’, 5 Statistics lǎnyīn in Mandarin, literally meaning ‘lazy pronunciation’) has been used in recent years Table 6 provides a summary of the current to refer to ongoing sound changes in Hong state of the Cantonese wordnet. Kong Cantonese. This term designates the use No. No. of a variety of variants in the speech POS % % synsets senses of younger native speakers of Hong Kong Can- tonese (, 2010). One example is syllable- 1,830 (0.52) 5,114 (0.42) initial /n/ and /l/ merger (/n/ > /l/), a phe- verbs 975 (0.28) 3,227 (0.27) nomenon that started around the 70s. This is 565 (0.16) 3,044 (0.25) shown in the Table 5. There are many other 163 (0.05) 707 (0.06) examples of ‘lazy pronunciation’ (e.g., /ng/ > Total 3,533 - 12,092 - /m/) in Cantonese. Table 6: WN Statistics character meaning jyutping In total, the first version of our wordnet cov- 男 ‘male’ naam4 or laam4 ers a bit over 3,500 concepts using over 12,000 ⼥ ‘female’ neoi5 or leoi5 senses. The part-of-speech distribution is gen- 呢度 ‘here’ nei1 dou6 or lei1 dou6 erally in sync with other projects, such as the PWN – with perhaps a weaker dominance of Table 5: Cantonese Tone Change (II) nominal senses and concepts to a slight heav- ier presence of their verbal counterparts. Our In addition to pinjam 變⾳ ‘’ current version covers 35.81% (n = 1,776) of and laan5jam1 懶 ⾳ ‘lazy pronunciation’, the ‘core’ PWN concepts. there are also cases of tone change, which are Since our wordnet is currently pivoting on not clear what the motivation is. Nevertheless, the hierarchy provided by PWN, through whenever possible, all options were captured COW, we have no information about seman- by our wordnet. tic relations to report. In further stages of our project, however, we might revise this po- The data for this wordnet is available on sition and consider taking advantage of CILI Github6. to adapt our wordnet’s semantic hierarchy to better fit the assumptions of Cantonese native 7 Conclusions and Future Work speakers. As mentioned above, in Section 3.1, the pro- This paper presented the ongoing efforts to cess of human validation is still ongoing, and build a Cantonese Wordnet. We have moti- we expect to provide an update to these statis- vated this project with the lack of digital re- tics in the camera-ready version of this paper. sources available for Cantonese – a major Chi- nese dialect. We have introduced our method- 6 Release ology, which is to use existing Mandarin word- nets to project Cantonese candidate senses. This Cantonese Wordnet will be released un- So far our wordnet includes over 3,500 con- der a Creative Commons Attribution 4.0 In- cepts and over 12,000 senses. We have dis- 4 ternational License (CC BY 4.0) . cussed some specific challenges encountered Keeping up with the recent changes and re- while building our wordnet and how we ad- quirements of the OMW, the Cantonese Word- dressed them. We hope that this new open net will be primary released and supported for resource will promote a variety of future uses, 5 the recent WN-LMF format, developed and including language processing tasks and lin- maintained by the Global WordNet Associa- guistic research. tion. The use of WN-LMF is not only required We would like to continue our efforts to im- by the most recent version of the OMW, but prove the coverage and quality of our Can- is also an essential vehicle to access the new tonese Wordnet. This would include: Collaborative Interlingual Index (Bond et al., • finish validating and revising the list of 2016, CILI). Once linked to CILI, our wordnet candidate senses generated through the will be able to contribute with new concepts, methods explained in Section 3 (so far present only in Cantonese such as sap1jit6 we have completed 38.25% of this valida- 濕熱, an adjective with the literal meaning tion); of ‘hot wet’ (it describes a general negative • add example sentences for each sense, health condition resulting from an unhealthy which would be the start of an open, lifestyle, e.g. smoking, sleep deprivation, etc.), sense-tagged Cantonese corpus; or gung1zyu2beng6 公主病, a noun that lit- • given that Cantonese is predominantly erally means ‘princess disease’. It describes used in speech, we would also like to add girls who are over-confident, over-reliant and audio recording for each pronunciation of demand princess-like treatment. each lemma; In addition, this release will also include Once the Cantonese wordnet reaches a suf- the tab-separated-value format used by the ficient coverage, we would like to use it tore- original OMW specifications. These files are search a variety of topics, including: still very useful for their size, simplicity, and legacy compatibilities with existing systems. • study the amount of Mandarin words that One such example is the use of this data have entered common Cantonese speech through NLTK: Python Natural Language and writing and, conversely, when and Toolkit (Bird et al., 2009) – which currently why some Mandarin words are never used still uses this legacy format. However, the Cantonese; simplicity of this format doesn’t come without • study the morphologically conditioned a cost. Due to the flatter nature of this for- tone changes in Cantonese such as pin- mat, the Jyutping romanization of Cantonese jam and other less understood phenom- lemmas will be added as separate lemmas (i.e. ena; and effectively doubling the number of words and • shed some light on the potential relation senses within this format). between register (formal register is often tied to , which is based 4https://creativecommons.org/licenses/by/4.0/ 5https://github.com/globalwordnet/schemas 6https://github.com/lmorgadodacosta/CantoneseWN on Mandarin) and tone change (a speech - Huang, -Kai Hsieh, -Fei Hong, phenomenon); Yun-Zhu , I-Li , Yong-Xiang Chen, and - Huang. 2010. Chinese wordnet: - Acknowledgments sign, implementation, and application of an in- frastructure for cross-lingual knowledge process- We thank the reviewers for their comments. ing. Journal of Chinese Information Processing, Joanna Ut-Seong Sio gratefully acknowledges 24(2):14–23. the research funding received from the Fac- Chu-Ren Huang. 2003. Sinica bow: integrating ulty of Arts, Palacky University in Olomouc bilingual wordnet and sumo ontology. In Inter- through its Research Grant Scheme (funding national Conference on Natural Language Pro- cycle 2019-2021). cessing and Knowledge Engineering, 2003. Pro- ceedings. 2003, pages 825–826. IEEE.

Shin Kataoka and Cream Yin- Lee. 2008. A References system without a system: Cantonese roman- . 2018. Cantonese as written lan- ization used in hong kong place and personal guage in hong kong. Global Chinese, 4(1):103– names. Hong Kong Journal of Applied Linguis- 142. tics, 11(1):79–98. Steven Bird, Ewan Klein, and Edward Loper. Charles Li and Sandra Thompson. 1981. A func- 2009. Natural language processing with Python: tional reference of mandarin chinese. analyzing text with the natural language toolkit. Berkeley, CA: University of California Press. O’Reilly Media, Inc. Find this author on.

Francis Bond and Ryan Foster. 2013. Linking and David CS Li. 2000. Phonetic borrowing: Key to extending an open multilingual wordnet. In Pro- the vitality of written cantonese in hong kong. ceedings of the 51st Annual Meeting of the As- Written Language & Literacy, 3(2):199–233. sociation for Computational Linguistics, pages 1352–1362. Victor H Mair. 1991. What is a Chinese” - alect/topolect”?: Reflections on some key Sino- Francis Bond, Piek Vossen, John P McCrae, and English Linguistic Terms. Christiane Fellbaum. 2016. CILI: the Collabo- rative Interlingual Index. In Proceedings of the Stephen Matthews and Virginia Yip. 1994. Can- Global WordNet Conference. tonese. Routledge. Jordan Boyd-Graber, Christiane Fellbaum, Daniel Osherson, and Robert Schapire. 2006. Adding Jerry Norman. 1988. Chinese. Cambridge Univer- dense, weighted connections to wordnet. In Pro- sity Press. ceedings of the third international WordNet con- ference, pages 29–36. Jueya Ouyang. 1993. Putonghua guangzhouhua de bijiao yu xuexi [comparison and study of - . 1968. Language and symbolic tonghua and cantonese]. systems, volume 457. CUP Archive. Yu Jie Seah and Francis Bond. 2014. Annotation Siu-Pong Cheng and Sze-Wing Tang. 2016. Can- of in a multilingual corpus of Man- tonese Romanizaton. Routledge. darin Chinese, English and Japanese. In Pro- ceedings 10th Joint ISO-ACL SIGSEM Work- Picus Sizhi Ding. 2010. in shop on Interoperable Semantic Annotation, hong kong cantonese through language contact page 82. with chinese topolects and english over the past century. Marginal dialects: Scotland, Ireland Don Snow. 2004. Cantonese as written language: and beyond, pages 198–218. The growth of a written Chinese , vol- Christiane Fellbaum, editor. 1998. WordNet: An ume 1. Hong Kong University Press. Electronic Lexical Database. MIT Press, Cam- Rint Sybesma. 1997. Why chinese verb-le is a re- bridge, . sultative . Journal of East Asian Lin- Zev Handel. 2015. The classification of chi- guistics, 6(3):215–261. nese. The Oxford handbook of Chinese linguis- tics, page 34. Liling Tan and Francis Bond. 2014. NTU- MC toolkit: Annotating a linguistically di- Chu-Ren Huang, -Yng , and Hshiang-Pin verse corpus. In Proceedings of COLING 2014, Lee. 2004. Sinica bow (bilingual ontological the 25th International Conference on Compu- wordnet): Integration of bilingual wordnet and tational Linguistics: System Demonstrations, sumo. In LREC. pages 86–89. Piek Vossen. 1998. A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers. doi, 10:978–94. Shan Wang and Francis Bond. 2013. Building the chinese open wordnet (cow): Starting from core synsets. In Proceedings of the 11th Workshop on Asian Language Resources, pages 10–18. Shan Wang and Francis Bond. 2014. Building the sense-tagged multilingual parallel corpus. In LREC, pages 2403–2409. Renjie Xu, Zhiqiang , Yingji , Yuzhong , and Zhisheng Huang. 2008. An integrated approach for automatic construction of bilingual chinese-english wordnet. In John Domingue and Chutiporn Anutariya, editors, The Seman- tic Web, pages 302–314, Berlin, Heidelberg. Springer Berlin Heidelberg. Alan Yu. 2009. Tonal mapping in cantonese voca- tive . In Annual Meeting of the Berkeley Linguistics Society, volume 35, pages 341–352. Anne Yue-Hashimoto. 1991. The yue dialect. Journal of Chinese Linguistics Monograph - ries, (3):292–322. Yingjie Zhang, Bin Li, Xiaoyu Wang, Xueyang , and Jiajun Chen. 2014. Mapping word senses of middle ancient Chinese to WordNet. In 2014 IEEE/WIC/ACM International Joint Confer- ences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 1, pages 446– 450. IEEE.

Yingjie Zhang, Bin Li, Xinyu Dai, Shujian Huang, and Jiajun Chen. 2017. Pqac-wn: constructing a wordnet for pre-qin ancient chinese. Language Resources and Evaluation, 51(2):525–545.