
Classifying Amharic News Text Using Self-Organizing Maps Samuel Eyassu Bjorn¨ Gamback¨ ∗ Department of Information Science Swedish Institute of Computer Science Addis Ababa University, Ethiopia Box 1263, SE–164 29 Kista, Sweden [email protected] [email protected] Abstract Many of the languages of Africa have few speak- ers, and some lack a standardised written form, both The paper addresses using artificial neu- creating problems for building language process- ral networks for classification of Amharic ing systems and reducing the need for such sys- news items. Amharic is the language for tems. However, this is not true for the major African countrywide communication in Ethiopia languages and as example of one of those this pa- and has its own writing system contain- per takes Amharic, the Semitic language used for ing extensive systematic redundancy. It is countrywide communication in Ethiopia. With more quite dialectally diversified and probably than 20 million speakers, Amharic is today probably representative of the languages of a conti- one of the five largest on the continent (albeit diffi- nent that so far has received little attention cult to determine, given the dramatic population size within the language processing field. changes in many African countries in recent years). The experiments investigated document The Ethiopian culture is ancient, and so are the clustering around user queries using Self- written languages of the area, with Amharic using Organizing Maps, an unsupervised learn- its own script. Several computer fonts for the script ing neural network strategy. The best have been developed, but for many years it had no ANN model showed a precision of 60.0% standardised computer representation1 which was a when trying to cluster unseen data, and a deterrent to electronic publication. An exponentially 69.5% precision when trying to classify it. increasing amount of digital information is now be- ing produced in Ethiopia, but no deep-rooted cul- 1 Introduction ture of information exchange and dissemination has been established. Different factors are attributed to Even though the last years have seen an increasing this, including lack of digital library facilities and trend in investigating applying language processing central resource sites, inadequate resources for elec- methods to other languages than English, most of tronic publication of journals and books, and poor the work is still done on very few and mainly Euro- documentation and archive collections. The diffi- pean and East-Asian languages; for the vast number culties to access information have led to low expec- of languages of the African continent there still re- tations and under-utilization of existing information mains plenty of work to be done. The main obsta- resources, even though the need for accurate and fast cles to progress in language processing for these are information access is acknowledged as a major fac- two-fold. Firstly, the peculiarities of the languages tor affecting the success and quality of research and themselves might force new strategies to be devel- development, trade and industry (Furzey, 1996). oped. Secondly, the lack of already available re- sources and tools makes the creation and testing of 1An international standard for Amharic was agreed on only new ones more difficult and time-consuming. in year 1998, following Amendment 10 to ISO–10646–1. The standard was finally incorporated into Unicode in year 2000: ∗Author for correspondence. www.unicode.org/charts/PDF/U1200.pdf 71 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 71–78, Ann Arbor, June 2005. c 2005 Association for Computational Linguistics In recent years this has lead to an increasing aware- 2 Artificial Neural Networks ness that Amharic language processing resources and digital information access and storage facili- Artificial Neural Networks (ANN) is a computa- ties must be created. To this end, some work has tional paradigm inspired by the neurological struc- now been carried out, mainly by Ethiopian Telecom, ture of the human brain, and ANN terminology bor- the Ethiopian Science and Technology Commission, rows from neurology: the brain consists of millions Addis Ababa University, the Ge’ez Frontier Foun- of neurons connected to each other through long and dation, and Ethiopian students abroad. So have, for thin strands called axons; the connecting points be- example, Sisay and Haller (2003) looked at Amharic tween neurons are called synapses. word formation and lexicon building; Nega and Wil- ANNs have proved themselves useful in deriving lett (2002) at stemming; Atelach et al. (2003a) at meaning from complicated or imprecise data; they treebank building; Daniel (Yacob, 2005) at the col- can be used to extract patterns and detect trends that lection of an (untagged) corpus, tentatively to be are too complex to be noticed by either humans or hosted by Oxford University’s Open Archives Ini- other computational and statistical techniques. Tra- tiative; and Cowell and Hussain (2003) at charac- ditionally, the most common ANN setup has been ter recognition.2 See Atelach et al. (2003b) for an the backpropagation architecture (Rumelhart et al., overview of the efforts that have been made so far to 1986), a supervised learning strategy where input develop language processing tools for Amharic. data is fed forward in the network to the output The need for investigating Amharic information nodes (normally with an intermediate hidden layer access has been acknowledged by the European of nodes) while errors in matches are propagated Cross-Language Evaluation Forum, which added an backwards in the net during training. Amharic–English track in 2004. However, the task 2.1 Self-Organizing Maps addressed was for accessing an English database in English, with only the original questions being Self-Organizing Maps (SOM) is an unsupervised posed in Amharic (and then translated into English). learning scheme neural network, which was in- Three groups participated in this track, with Atelach vented by Kohonen (1999). It was originally devel- et al. (2004) reporting the best results. oped to project multi-dimensional vectors on a re- In the present paper we look at the problem of duced dimensional space. Self-organizing systems mapping questions posed in Amharic onto a col- can have many kinds of structures, a common one lection of Amharic news items. We use the Self- consists of an input layer and an output layer, with Organizing Map (SOM) model of artificial neural feed-forward connections from input to output lay- networks for the task of retrieving the documents ers and full connectivity (connections between all matching a specific query. The SOMs were imple- neurons) in the output layer. mented using the Matlab Neural Network Toolbox. A SOM is provided with a set of rules of a lo- The rest of the paper is laid out as follows. Sec- cal nature (a signal affects neurons in the immedi- tion 2 discusses artificial neural networks and in par- ate vicinity of the current neuron), enabling it to ticular the SOM model and its application to infor- learn to compute an input-output pairing with spe- mation access. In Section 3 we describe the Amharic cific desirable properties. The learning process con- language and its writing system in more detail to- sists of repeatedly modifying the synaptic weights gether with the news items corpora used for training of the connections in the system in response to input and testing of the networks, while Sections 4 and 5 (activation) patterns and in accordance to prescribed detail the actual experiments, on text retrieval and rules, until a final configuration develops. Com- text classification, respectively. Finally, Section 6 monly both the weights of the neuron closest match- sums up the main contents of the paper. ing the inputs and the weights of its neighbourhood nodes are increased. At the beginning of the training 2In the text we follow the Ethiopian practice of referring to the neighbourhood (where input patterns cluster de- Ethiopians by their given names. However, the reference list follows Western standard and is ordered according to surnames pending on their similarity) can be fairly large and (i.e., the father’s name for an Ethiopian). then be allowed to decrease over time. 72 2.2 Neural network-based text classification eter, if the training set is a singular value decom- position reduced vector space. Tambouratzis et al. Neural networks have been widely used in text clas- (2003) use SOMs for categorizing texts according to sification, where they can be given terms and hav- register and author style and show that the results are ing the output nodes represent categories. Ruiz equivalent to those generated by statistical methods. and Srinivasan (1999) utilize an hierarchical array of backpropagation neural networks for (nonlinear) 3 Processing Amharic classification of MEDLINE records, while Ng et al. (1997) use the simplest (and linear) type of ANN Ethiopia with some 70 million inhabitants is the classifier, the perceptron. Nonlinear methods have third most populous African country and harbours not been shown to add any performance to linear more than 80 different languages.3 Three of these ones for text categorization (Sebastiani, 2002). are dominant: Oromo, a Cushitic language spoken SOMs have been used for information access in the South and Central parts of the country and since the beginning of the 90s (Lin et al., 1991). A written using the Latin alphabet; Tigrinya, spoken in SOM may show how documents with similar fea- the North and in neighbouring Eritrea; and Amharic, tures cluster together by projecting the N-dimen- spoken in most parts of the country, but predomi- sional vector space onto a two-dimensional grid. nantly in the Eastern, Western, and Central regions. The radius of neighbouring nodes may be varied to Both Amharic and Tigrinya are Semitic and about as include documents that are weaker related.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-