Kelsea Hosoda
Total Page:16
File Type:pdf, Size:1020Kb
Kelsea Kanohokuahiwi Hosoda ICS 661 Spring 2017 Professor David Chin Identifying and Categorizing Word Ambiguities due to Diacritical Markers within a single Hawaiian Language Corpus Introduction Background The Hawaiian language has a rich history that includes a thriving language boasting the most literate nation in the late 1800s to less than one thousand Native speakers in the 1950s and is now a leading language in revitalization efforts (Warchauer, 1997). The Hawaiian language has an advantage in its revitalization because of the documentation, preservation, and digitization efforts starting from the 1840s including Hawaiian language newspapers, oral histories, and video recordings of elders (Berez, 2013; Nogelmeir, 2003; Papa Kilo, 2017). These documents are crucial to the revitalization efforts of researchers and students alike to developing the next generation of Hawaiian language speakers and understanding the Hawaiian culture (Grenoble, 2006). The Hawaiian people recognized the power of documenting the Hawaiian language in a printed form. For example, the very first Hawaiian newspaper article published in 1834 states the importance of documenting the language for future generations (Solomona, 1834). From the first Hawaiian newspaper, there were over 40 different Hawaiian language newspapers printed and dispersed between 1834 and the 1980s (Nogelmeir, 2003). The Hawaiian language newspapers have been archived and recently a large subset has been digitized via the ‘Ike Kūʻokoʻa project (ʻIke Kūʻokoʻa 2017). The content of the newspapers included Hawaiian stories as well as mainstream factual information. There is a robust amount of information and cultural knowledge within the printed Hawaiian language documents (Nogelmeir 2003). Hawaiian Orthography Hawaiian orthography was first developed by American protestant missionaries in the 1820s (Schultz, 1994). Common Hawaiian orthography taught in modern schools includes 13 letters, 12 of which are roman characters A, E, I, O, U, H, K, L, M, N, P, W, and the ʻokina (ʻ) or glottal stop diacritical marker. The alphabet system also includes the macron over vowels (eg. Ā, Ē, Ī, Ō, Ū) which lengthens and emphasizes the letter. The Hawaiian language newspaper corpora used the original Hawaiian alphabet system, however the diacritical markers—the macron and the glottal stop—were omitted due to the inability of the block printing press at the time to distinctively print the markers. The use of diacritical markers is an issue in the Hawaiian language community because of multiple word meanings associated with the ambiguity. For example, the word mānoa means vast, deep, or thick, whereas the word manoa means numerous; in the Hawaiian newspapers both mānoa and manoa would appear as manoa. The lack of diacritical markers increases the ambiguity of words. Current Study Word sense disambiguation specific to diacritical markers is the focus of the study. The corpus chosen for the study is a machine-readable Hawaiian book that is a compilation of an ancient story that was run in the Hawaiian newspapers from 1905-1906, Hiʻiakaikapoliopele or Hiʻiaka for short (Hoʻoulumahiehie, 2008). The diacritical markers for the Hiʻiaka corpus were manually inserted by experts in the field. Based on the Hiʻiaka corpus, the goals of this study were to: 1) Statistically define word ambiguity due to diacritical marker omission 2) Characterize features associated with disambiguating word senses Methods The corpus used for analysis was in an electronic, machine readable format. The corpus was tokenized by removing punctuations, making the text case-insensitive, and then splitting by space. The word tokens were counted to obtain the frequency of each word type, the unique occurrence of a word within a corpus. Python was used for tokenization and occurrence counts then exported to Microsoft excel. In Microsoft excel, an ambiguous word type list was developed by creating a second list of the word types in which all diacritical markers were removed. The pivot table function in Microsoft excel was used to examine the ambiguous word types derived from the lack of diacritical markers. Qualitative analysis and categorization of word ambiguities were done manually by cross referencing in-corpus examples and Hawaiian dictionaries, including those found at wehewehe.org and www.trussel2.com/haw/ Descriptive Information # of Words Word Tokens 292898 Word Types 5476 Ambiguous words due to diacritical markers 371 5 word forms 5 4 word forms 9 3 word forms 50 2 word forms 307 Table 1. Descriptive Information of Hiʻiakaikapoliopele corpus. Word tokens is the total number of words found in the corpus. Word types is the total unique words in the corpus. Ambiguous words due to diacritical markers are the number of words types that have more than form with diacritical markers. The number of different forms are further broken down. Obstacles The most difficult part of the project was identifying generalized patterns for word ambiguity. The method that I used was looking at a concordance of the ambiguous words within a sentence that were manually cut from the original text and examined within excel to identify patterns, see figure 1. With this process, I was able to identify a pattern based on the word immediately preceding the ambiguous word. The example in figure 1, shows the difference between the forms aha and ʻaha. In the Hiʻiaka corpus, the aha form is always proceeded by the word he, whereas ʻaha is proceeded by ka. This finding was not surprising because in older Hawaiian language documents he and aha are concatenated to form one word, heaha. This does bring in to question the significance of spaces to define the boundaries of words, seeing as the Hawaiian language was first oratory then later documented in a written form. Based on the concordances of sample sentences two generalized patterns were define: 1) the ke and ka pattern, and 2) the n-gram part-of-speech pattern. Figure 1. Example of pattern identification in excel using concordances and qualitative content analysis of words and phrases surrounding ambiguous words, aha and ʻaha. Ke and Ka Rule There is a syntactical rule taught in Hawaiian language courses that there are two different forms of the – ke and ka. The rule states that words beginning with the letters K, E, A, or O are preceded by the ke form of the. Whereas, words that start with the other characters found in the Hawaiian alphabet (I, U, H, L, M, N, P, W, ‘) are preceded by the ka form. This rule was tested as generalized pattern however this specific pattern does not really provide substantial statistical power because the lack of examples. In figure 2, the words ahi and ʻahi are compared. Following the rule, ahi should follow ke and ʻahi would follow ka. The ka ʻahi example holds true for all occurrences in the corpus, however within the entire corpus there is only one instance of ʻahi. Ahi does follow ke in two of the five examples found. The other three sample ahi sentences are proceeded by noun phrases where the word ahi is used describe the noun phrase. Figure 2. Example comparison of two ambiguous words that should follow the ke/ka rule. N-grams and Parts-of-Speech Parts-of-speech (POS) can potentially be used to aid in disambiguating the ambiguous words. The generalized pattern found suggests that n-grams could be used to define the distance to the ambiguous word and the POS’s co-occurrence with the ambiguous word to discriminate between word forms. I recognize this finding is not unique to the Hawaiian language because POS tags are commonly used for word disambiguation, as noted in Chapter 20.2 of Martin (2000). An obstacle for using POS tags to disambiguate Hawaiian words is the lack of a POS tagged data set. All referenced examples of POS tags as a method to disambiguate Hawaiian words in this study are based on content analysis of the sample sentences from the Hiʻiaka corpus. Ambiguous Word Categorization Fifty of the 307 ambiguous word pairs were examined and manually categorized as either following the ke/ka pattern, the n-gram POS pattern, or not in either category. The findings shown in table 1 suggest that the n-gram POS pattern should be focused on in future studies. Four of the 15 words categorized in the “other” category had more than 3 semantic meanings, suggesting that ambiguity of those words could be due to polysemy, the coexistence of many possible meanings for a word. Pattern Occurrence ke/ka 12 n-gram POS 23 other 15 Table 2. Categorization of 50 of the 307 words with strictly two ambiguous forms found in the corpus. Analysis & Conclusion To my knowledge, this is the first study that attempts to identify and categorize features of the Hawaiian language to develop algorithms for automatic disambiguation of words based on lack of diacritical markers. The project has identified potential patterns that can aid in categorizing Hawaiian word ambiguity based on omission of diacritical markers. The study lacks statistical power needed to develop NLP and AI tools to disambiguate words. I have learned that n-grams and POS tags have a large potential within the Hawaiian language to disambiguate words. There is a lot of work that could be done in this field. A POS tagged data set would drastically advance the work in this project. However, I recognize the Hawaiian language community has reservations pertaining to POS tags because the POS tags could inadvertently push the language to evolve away from the original perspective and intent of the Hawaiian language. The patterns identified in this study could be further defined and used to develop a hand-labeled word sense data set that could be used for supervised learning. Bibliography Berez, Andrea. “Kaipuleohone: The University of Hawai‘i Digital Ethnographic Archive,” March 2, 2013. http://scholarspace.manoa.hawaii.edu/handle/10125/26188.