A Rule System for Chinese Time Entity Recognition by Comprehensive Linguistic Study
Total Page:16
File Type:pdf, Size:1020Kb
A Rule System for Chinese Time Entity Recognition by Comprehensive Linguistic Study Hongzhi Xu Chu-Ren Huang The Department of CBS Faculty of Humanities The Hong Kong Polytechnic University The Hong Kong Polytechnic University [email protected] [email protected] Abstract valid temporal expression is a path from one cer- tain node to another node. The longer the path is, Chinese time entity is quite complex. In the more confident the recognition will be. this paper, we give a comprehensive lin- CTEMP (Wu et al., 2005) also used linguis- guistic study on it. Based on the analy- tic rules for Chinese temporal entity recognition. sis, we present a rule system which on- However, the focus of this work differs from them ly considers the inner structure of Chinese in that we aims to identify Chinese time entities time entities for the recognition. Exper- which could be described with a limited set of iments on Sinica and TempEval-2 corpus rules and can be easily translated into a structured show that the rule system performs much format, such as TIMEX3(Pustejovsky et al., 2010) better than the CRFs model. When using standard. For this part, the set of rules in this work the rules as features within a CRFs mod- are more comprehensive than (Wu et al., 2005). el, the performance could be further im- However, we don’t include events that are used proved. as time entities, since events intrinsically are not time entities. According to the Generative Lexi- 1 Introduction con Theory (Pustejovsky, 1995), this is a case of In SemEval-2010 competition, there is a sub task type coercion. for temporal entity identification, which includes In Section 2, we will give a linguistic study a Chinese corpus. The final goal of the task is to on Chinese time entity expressions. In Section 3, associate a temporal expression to a certain even- we will construct a rule system which is mainly t. It is very important to extract all the elements based on our linguistic study. In Section 4, we for events in that it will be useful for event track- test rule system on Sinica and TempEval-2 corpo- ing. By identifying the time information of events ra and give a discussion on the experimental result. will enable us to make inference on the temporal Section 5 is the conclusion. relation of different events. 2 Chinese time entity: A linguistic study In this paper, we will make a comprehensive s- tudy on Chinese time entities from a linguistic per- We refer to Y.R. Chao’s book (Chao, 1968) as a spective and then present a rule system for recog- starting point of our study. In China, there are nizing them. Chinese temporal entity is very com- different time systems, including the lunar sys- plex due to the flexible grammar of Chinese and tem, TianGan-DiZhi (GZ) system, etc. In ancient the existence of many different time systems, such China, people used the emperor’s reign to count as Gregorian system, the Chinese lunar system, the time. When a new emperor appeared, a new peri- Chinese tian-gan & di-zhi (GZ) time system. od would then started. Based on our linguistic analysis, we formal- In another perspective, people try to divide the ize a set of temporal elements that are the time axis by different levels of granularity. Rough- blocks used to construct time entities, such as ly, the whole axis can be divided into three peri- century, year, month, day, hour etc. We then ods: guo-qu (past), xian-zai (present) and jiang- build a rule system that actually describe the topol- lai (future). Smaller granularity includes century ogy of the temporal elements. For example, year (shi-ji), year (nian), season (ji-jie), month (yue), follows century; month follows year. So, the day (ri), hour (shi), minute (fen), second (miao). model of our system is a directed graph, while a Week (zhou) is a granularity that is independen- 795 International Joint Conference on Natural Language Processing, pages 795–801, Nagoya, Japan, 14-18 October 2013. t to year, season and month. In China, there are (the 90s of 20th century). The first decade is usu- also jie-qi (JQ) that divides one year into 24 d- ally called ling-ling-nian-dai (00s) or tou-shi-nian ifferent periods. One month can also be divided (first ten years). into 3 periods (XUN): the first ten days (shang- If gong-yuan (A.D.) or gong-yuan-qian (B.C) xun), the second ten days (zhong-xun) and the is used before century or year, then the numbers left days (xia-xun). One day can also be divid- will be written as the pronunciation of the num- ed into different vague phases (DP), e.g. be- ber rather than a sequence of digits. For example, fore dawn (ling-chen), early morning (zao-shang), gong-yuan liang-qian-ling-yi-shi-san nian is simi- morning (shang-wu), noon (zhong-wu), afternoon lar to be said as two thousand and thirteenth years (xia-wu), evening and night (wan-shang), mid- A.D. in English. Otherwise, year 2013 will be night (wu-ye). written as er-ling-yi-san-nian (two-zero-one-three To compile rules for the automatic recognition year). of Chinese time entities, one important issue is to Chinese lunar time system uses a similar way find out the construction regularity for each tem- to denote time as the Gregorian system. Howev- poral element and the relations among the ele- er, it refers to the movement of the moon to count ments, which is also the inner structure of Chinese months. So the start of one year in lunar system time entities. is different from the Gregorian system. We can use a flag ‘&’ (nong-li) to denote the lunar system, 2.1 Gregorian system and Chinese lunar such as & 2013-08-08. In addition, the lunar sys- system tem uses chu before the day number for the first Gregorian system starts from the year of Christ’s ten days of a month in order to make up of two birth. Before this year, B.C. (gong-yuan-qian) is syllables, while the day marker ri is usually omit- used with a number to denote time on the time ted. For example, Aug. 8th is said ba-yue chu-ba, axis. After this year, A.D. (gong-yuan) is used, Aug. 11th is said ba-yue shi-yi. The lunar label which is also the default value. Chinese support- nong-li can also be placed before the subsequence s this sytem. For example, 2013-08-08 09:01:01 of year-month-day, such as nong-li wu-yue chu- is said in Chinese (gong yuan) er-ling-yi-san-nian wu (& 05-05), nong-li chu-wu (& 05)’ etc. ba-yue ba-ri jiu-dian ling-yi-fen ling-yi-miao. One hour can also be divided into four quarter- 2.2 TianGan-DiZhi system s (ke). However, only yi-ke (fifteen) and san-ke This system was invented in Ancient China based (forty five) are valid expressions. For the half of on a the Chinese traditional philosophical theory. an hour, ban (half) is used. zheng (right) will be There are ten heavenly stems (tian gan: TG): jia, used as the right start of an hour. So, zheng, yi- yi, bing, ding, wu, ji, geng, xin, ren, gui and twelve ke, ban, san-ke are the four possible values for the mundane branches (di zhi: DZ): zi, chou, yin, mao, KE element. chen, si, wu, wei, shen, you, xu, hai. Then, one One year can be divided into four quarters (ji- year is denoted by a combination of two differ- du:JD) or (ji-jie:season). An ordinal number will ent elements circularly, which generates sixty d- be used to refer to a certain JD, such as di-yi ji- ifferent denotations. If we use a sequence num- du (the first quarter). The ordinal marker di could ber to denote the two elements, i.e. TG0 9 and − be omitted. So, yi ji-du is also a valid expression. DZ0 11, then the ith year of a circulation is de- − Each season has its own name: spring (chun-ji), fined as yi = TGi%10DZi%12, where 0 i < 60 summer (xia-ji), autumn (qiu-ji) and winter (dong- and % is the mod operation. For example,≤ gui- ji). si-nian (2013) can be formally denoted as year For hours, day phases (DP) could be added be- TG9DZ5, or simply GZ9,5. Similarly, month, day fore them. The DP is usually placed before hour, and the Chinese hour can also be denoted like this. such as ling-chen san-dian (3:00am), wu-ye shi- The twelve DZ items are also associated with er-dian (0:00). However, the boundaries of differ- twelve animals (sheng xiao: SX): shu (mouse), ni- ent phases are not clear, such as xia-wu/wan-shang u (cattle), hu (tiger), tu (rabbit), long (dragon), she liu-dian (6:00 in the afternoon/evening) . (snake), ma (house), yang (sheep), hou (monkey), Century (shi-ji) can be followed by decade ji (chick), gou (dog), zhu (pig). So, one year can (nian-dai), such as er-shi-shi-ji jiu-shi-nian-dai also be simplified as [animal] nian. For example, 796 year 2013 can be also called as she-nian (year of s- as qing-ming festival. Festivals are usually used nake), or formally denoted as SX5. However, this independently to other temporal elements. Mean- kind of expression can only be said alone.