Towards Amharic Semantic Search Engine

Towards Amharic Semantic Search Engine Fekade Getahun Genet Asefa Department of Computer Science Department of Computer Science and IT Addis Ababa University, Ethiopia Addis Ababa Science and Technology University, Ethiopia Email: [email protected] Email: [email protected]

ABSTRACT their approach do not consider the basic characteristics of the Recently the amount of documents written in Amharic language Amharic language. has been dramatically increasing. Searching such content using In Amharic lexical variations are very common [4], a word may localized and regional version of general search engine such as have more than one meaning also called Polysemy. For example, google.com.et returns documents containing search key terms WKH ZRUG ³ ÷´/lega has two different meanings; i.e. ³NLFNLQJ D while excluding specific characteristics of Amharic Language. EDOO´RU³YHU\\RXQJ´; different words may have similar meaning. In this paper, we present the design and implementation of For instance, ³ np´/memtat and ³ ÷p´/melegat have Semantic Search Engine for Amharic documents. The search similar meaning ³NLFNLQJ´.Asearchengine should have the engine has Crawler, Ontology/Knowledge base, Indexer and capability of dealing with these characteristics of the language. Query Processor that consider characteristics of Amharic language. The ontology provides shared concepts Sport. This The number of Information retrieval systems/ researches ontology is built manually by language and sport domain experts associated with Amharic language is very limited [5,6,7]. In [5,6], and it is used in building semantic indexer, ranker and query text based searching approach is adopted whereas in [7] latent engine. In addition, we show how the system facilitate meaning semantic indexing is used to identify concepts. However, both based searching, document relevant and popularity based approaches fail to consider the semantic relationship between documents ranking. concepts and hence the results are restricted to mere occurrence of terms/ concept. Thus both are incapable in responding queries Categories and Subject Descriptors represented indirectly or with paraphrasing. Information Systems: Information retrieval- Document 2. RELATED WORK representation - Ontologies The size of news documents written in Amharic language has been Information Systems: World Wide Web - Web searching and increasing dramatically and yet there is a need to share them to the information discovery - Web search engines ± Web indexing public. To make this happen, we need to have a search engine which is responsive to Amharic language. General Terms The prominent researches conducted in the area of Amharic Algorithms, Performance document retrieval search engines [5,6] utilize a keyword based indexing method which is incapable of representing the semantics Keywords of a document. In particular, the works are incapable in addressing Search engine, Semantic information retrieval, Semantic indexing, two vital properties of Amharic language (Synonymy and Football domain ontology, Query processor, Document Polysemy) as discussed in Section 1. annotation. In [7], the researcher applied Latent Semantic Indexing (LSI) with Singular Value Decomposition (SVD) method to construct a 1. INTRODUCTION semantic indexer. The LSI extracts concepts from a given corpus by looking for words that co-occur frequently without giving Noticing the growing number of non-English language web emphasis to the relationship between concepts. Therefore, this documents, google has provided a localized and region based approach is incapable in responding to queries writing indirectly. search engine. In addition, there are a number of research works In addition, as the size of documents increases, the performance of dedicated to searching non-Amharic documents [1,2,3]. However the indexer degrades. In order to show the realm of this research let us consider a user query to get all sport documents about Lucy in 2013 African Permission to make digital or hard copies of all or part of this work National Cup using the query i.e. ³[2013 7qx Õĳ-¢ ºĒ 5 for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage Õk¢Į x^s¸ ùČØ½x´. In query the WHUP ³ 5´Lucy has two and that copies bear this notice and the full citation on the first page. GLIIHUHQW PHDQLQJV WKH ³(WKLRSLDQ :RPHQ¶V National Football To copy otherwise, or republish, to post on servers or to redistribute 7HDP´DQGthe name of a female player. However, in this particular to lists, requires prior specific permission and/or a fee. TXHU\ WKH WHUP ³ 5´ ³/XF\´ LV LQWHQGHG WR PHDQ (WKLRSLDQ MEDES '15, October 25-29, 2015, Caraguatatuba, Brazil :RPHQ¶V1DWLRQDO)RRWEDOO7HDPUDWKHUWKDQWKHIHPDOHSOD\HU © 2015 ACM. ISBN 978-1-4503-3480-8/15/10…$15.00 http://dx.doi.org/10.1145/2857218.2857235 Responding such query using existing approaches i.e. keyword based or LSI has difficulties. The keyword based approach in [5,6]

- 103 - uses the mere existence of words in the query in the document and [3] An Ontology- Knowledge -The does not address the semantic information embedded in the Based Retrieval based similarity document. This may lead to irrelevant results. System Using approach between Semantic Indexing concepts is In [7] , LSI that takes a query as a pseudo document and looks for not documents very similar to it. However, this DSSURDFKFDQ¶Wretrieve considered documents about ³pÛĜØ 7p `.¹ \á´Ethiopian :RPHQ¶V -Language national team which is synonym to ³/XF\´ or ³Õ7qx Õĳ-¢ specific ºĒ´African Women's Championship which is related to Lucy 3. Architecture of Amharic Semantic Search as LSI based concept indexer ignore the relationship between concepts. Engine Researchers have been working on semantic search engine that a Figure 1 shows the architecture of the proposed semantic search dedicated knowledge bases organized systematically enters of engine that takes into account specific features of the Amharic concepts, and associated relationships and documents. The language. It is composed of four major components: Crawler, semantic search engine can be domain specific or generic Query Interface, Ontology, Concept Indexer, and Query Processor. depending on the referenced knowledge base- VikiTron ± uses mathematics, chemistry and geography knowledge base, Firmily ± The Ontology is built manually with the help of domain and business search engine, Symbolab ± scientific search engine, Evi language experts. It contains concepts, relationship between ± answer engine, Swoogle ± Ontology search engine, Falcons ± concepts and instances of each concept and similarity between Full semantic search engine). However, these engines are knowledge entries. restricted to foreign language and hence do not address basic The crawler extracts URLs and content of Amharic document and characteristics of Amharic language. store them in a document repository. In this paper, we provide a semantic search engine, AmhS2Eng, The indexing component is responsible to annotate the crawled which returns list of Amharic documents very similar to the user document with semantic information (Concept, Instance, Words) query either directly or indirectly. In our approach, documents and associate semantic information with weight. This component extracted from the web and user queries are annotated with is responsible to identify specific characteristic of the Amharic semantic concepts extracted from football domain. language ± such as stemming. Table 2-1 shows summary of works related to the realm of this The query processor responsible to annotate the original user query research being categorized based on language. with semantic information, conduct document retrieval, rank the Table 2-1: Summary of related work resulting documents based on relevance as explained in the next sub-sections. Language Research title Approach Drawbacks [5]Amharic Search Keyword -Not concept Engine based or meantime [6]Enhanced based 3.1 Ontology management Amharic Search The ontology development process is composed of three main Engine Amharic activities: [7]Amharic Text LSI - Cannot infer retrieval new result 1. Building the high-level ontological schema -is time and 2. Populating the ontology with instance value memory intensive 3. Computing similarity Localized and Keyword - Does take region based engine based into account [Google] characteristic a) Building the ontology schema of the Amharic In this work, ontology O is defined as collection of related language concepts and it is represented as follows: [8] Context based Users must -Incapability Indexing in Search explicitly of handling Engines using add context indirect Where Ontology queries ܱൌሼܥǡܧǡܴሽ [2] Ontology based Ontology -Incapability English C: Collection of concepts each represent words having text indexing and having of handling similar words similar to SynSet in WordNet [9] querying for the uniform indirect semantic web relationship queries E: Collection of edges that connect related concepts and is used --Each instances to concepts relationship concept in the ontology R: Collection of semantic relations such as IsK, partOf, has equal weight The ontology is built following 8VFKROGDQG.LQJ¶V stated in [10]. -Dedicated to English The ontology provides a conceptual knowledge model of football language domain which is used later in document and query annotation, and document ranking. Glossaries of terms are identified and further validated by domain experts to provide the list of domain terms

- 104 - with their full descriptions and synonyms. Figure 2 shows extract In this work, the ontology is represented in triplet data format of the domain ontology depicting concepts and relationship having a subject, predicate and an object (i.e. value) and stored in between concepts. Microsoft SQL server database.

we Query Interface

Concept Indexer Crawler Query Processor Parser Language Document Identifier Stemmer Retriever

Document Query Parser Annotator Annotator

Weight Calculator Ranker

Document Repository Index Ontology

Figure 1: General Framework of AmhS2Eng b) Populating the ontology returns the distance between o1 and C,

The ontology is populated with instances or terms of football ݈݁݊ሺ݋ଵǡܥሻ domain. Sport domain experts identify and approve terms to be 3.2 Crawler populated into the defined ontology. The ontology is populated The crawler component is multithreaded and is responsible to and accessed using jenatoolkit1 and SPARQL2 queries. identify Amharic web pages and associated content. Generally it c) Calculating semantic similarity performs: Semantic query processing demands the use of concepts or - Extracts URL from the given page, similar concepts that encompass each query term. One of the - Fetches a web page pointed by URL, known approaches in measuring the semantic similarity between - Extracts set of links from the fetched web page, ontological terms or instances in the ontology depends on the - Ignores the extracted link that has been already fetched amount of information common to concepts. In this work, the Generally, the crawler uses http protocol and multiple threads that similarity between concepts is measured using the popular process concurrently to fetch pages. The fetched pages are then distance based approach proposed by Wu and palmer [11] and passed to Amharic language identification and stores in denoted as: Document repository. 3.3 Concept Indexer ʹܦ Concept indexer is responsible to represent concepts extracted ܵܫܯሺ݋ଵǡ݋ଶሻ ൌ Where ʹܦ ൅ ݈݁݊ሺ݋ଵǡܥሻ ൅ ݈݁݊ሺ݋ଶǡܥሻ from the unstructured Amharic documents and associates each concept with weight that shows its importance. It has the C: the least common ancestor of o1 and o2 Document Annotator and Indexer as main components as D: the distance between the root element and C presented in the next sub-sections.

1 https://jena.apache.org/ - A Java system for RDF manipulation, parsing 2 www.w3.org/TR/rdf-sparql-query/ - SPARQL used to express queries RDF/XML and N3, persistent storage with SQL and data stored natively as RDF.

- 105 - Figure 2: The concept taxonomy

3.3.1 Document Annotator (DA) ³Ü+ï´ // rank [0-9]+( )*Č`//[0-9]+( )*point DA accepts a document as input, segment it into words, and map each word to semantic concept stored in the ontology. b) knowledge based concept identification In this work, word to concept mapping is done using the Even if the rule based approach is capable in identifying concepts, combination of predefine rules and looking for the best concept it is difficult to formulate patterns that detect all concepts of a in the ontology approaches. given domain. For instance, considering the concept ³kĒºx´/players, it is not easy to have a pattern that represents a) Rule based concept identification its instances which are person name LH ³ß ù0´ ³6Þ Rules are regular expressions/ patterns, having numerals and 3Úá´³øn [Ü´DQGetc). Thus, consulting a knowledge base literal values, defined by knowledge experts to represent that contains such information is crucial. concepts. Table 3-1 shows sample rules that extract the notion of Let us consider the following example to demonstrate the two ³Č`´SRLQW³¸ċp´UHVXOW³Æ0´ URXQG³Ü+ï´UDQNIURP approaches. textual corpus. Example1. Concept identification Table 3-1: Sample regular expressions for concept extraction Consider the title of a sport news article Concepts Pattern ³Õ16¼ 6 p ÕpÛĜØ Ļ-Õ0 ù ÕÜÜ]p Õ 5 ČE Õ ¸ Č` ú ³ ´ // point ³>-9]+( )* ´// ´>-@ *RDO´ øn [Ü [13 ú Õ. ¸ł´ ³¸ċp´ // result ³[0-9]+( )* ( )*[0-9]+´ 7KHQHZVLVDERXW³'HGHELW¶VDQG/XF\¶VVWULNHU*HWDQHK.HEGH th //³[0-9]+( )*to( )*[0-9]+´ who is leading the 16 week Ethiopian primer league with 13 points´ ³Æ0´//round [0-9]+(6 p|Æ0) The content is preprocessed, stemmed and stored in the Document //[0-9]th week* repository. The indexer uses the stemmed version of the article:

- 106 - ³ 6 p pÛĜÚ Ļ-Õ0 ù ÜÜ]p 8 ČH øn [á 13 where: ú 0 ¼´ - CFt,d is the number of times a term t occurs in the Applying the two concept identification approaches on the above document D as is or with its instances i in t and it is stemmed text, we get the following results: computed as follows: x Using rules: the following concepts are identified: [166 p]<Æ0>//[16th Week] ࡯ࡲ࢚ǡࢊ ൌ࡯࢕࢛࢔࢚ሺ࢚ሻ ൅ ෍ ࡯࢕࢛࢔࢚ሺ࢏ሻ [13 ú]<Č`>//[13 Goal] - IDFt is the inverse document࢏࢚ࣕ frequency ± the number of documents divided by the number of documents in x Using knowledge base the following concepts are which the term t occurs. extracted: 3.3.3 Ranking [ 8]<3Ċ>//[Lucy] Ranking is the process of determining which annotated text has [ 8=pÛĜÚ 7p `.¼ \á]<`.¹ \á> more sense than the other. Each annotated text is ranked //[lucy=Ethiopian Women National Team] The correlation among instances, concepts, and instances and concepts are defined taking into account the level of similarity [ 8=`0l¤ [C]//[Lucy=Birtuan Bekele]< denoted as SimInst, SimConc and SimInstCon respectively and Player> formalized as follows: [ÜÜ]p]<\á>//[Dedebit] [pÛĜÚ]<ô0>//[Ethiopia] ࢔ ࢔ [Ļ-Õ0 ù]< ù>//[Premiere League] σσ࢏ୀ૚ ࢑ୀ૚ ࢙࢏࢓ሺࡵ࢏ǡࡵ࢑ሻ ࢙࢏࢓ࡵ࢔࢙࢚ሺࢀሻ ൌ ૛ [øn [á]//[Getaneh Kebede] Where: ࢔ Notice that all terms have exactly one meaning except for - I is the set of instances identified from the text T ³ 5´³/XF\´ ZKLFK has 3 different interpretations (i.e. An - n is the number of instances in I. instance of the concept 3Ċ/ Coach, a nick name (synonym) - Sim(Ii, Ik) is the similarity between Instances Ii and Ik in for the instance `0l¤ [C/Birtukan Bekele, and a synonym for I the instance pÛĜÚ 7p `.¼ \á/ Ethiopian Women National

Team). So, the sentence will have 3 different forms of annotations ࢔ ࢔ as; σσ࢏ୀ૚ ࢑ୀ૚ ࢙࢏࢓ሺ࡯࢏ǡ࡯࢑ሻ ࢙࢏࢓࡯࢕࢔ࢉሺࢀሻ ൌ ૛ 1. [166 p/16th Week]<Æ0/Round> [pÛĜÚ/ Ethiopia] Where: ࢔ <ô0/ Country > [Ļ-Õ0 ù/ Premiere League]< ù/ - C is the set of concepts identified from the text T League> [ÜÜ]p/ Dedebit]<\á> [ 8/ Lucy]<3Ċ/ - nis the number of conceptsin C. Coach > ČH / striker [øn [á/ Getaneh Kebede] - Sim(Ci, Ck) is the similarity between concepts Ci and Ck [13 ú/Goal] <Č`/Point>0 ¼/ in C. leads

th 2. [166 p/ 16 Week]<Æ0/ Round> [pÛĜÚ/ Ethiopia] ࢔ ࢓ <ô0/ Country > [Ļ-Õ0 ù/ Premiere League]< ù/ σσ࢏ୀ૚ ࢑ୀ૚ ࢙࢏࢓ሺ࡯࢏ǡࡵ࢑ሻ ࢙࢏࢓ࡵ࢔࢙࢚࡯࢕࢔ࢉሺࢀሻ ൌ ࢓כLeague> [ÜÜ]p/ Dedebit]<\á> [ 8 / Lucy= pÛĜÚ Where: ࢔ 7p `.¼\á / Ethiopian Women National Team] - C is the set of concepts identified from the <`.¹ \á/ National Team> ČH / striker [øn text T [á/ Getaneh Kebede] [13 ú/Goal] - I is the set of instances identified from the <Č`/ Point> 0 ¼/ leads text T 3. [166 p/ 16th Week]<Æ0/ Round > [pÛĜÚ/ - M and N are the number of instances in I Ethiopia]<ô0/Country > [Ļ-Õ0 ù/Premiere and the number of concepts in C League]< ù/ League > [ÜÜ]p/ Dedebit]<\á> [ 8 / respectively. Lucy= `0l¤ [C/ Birtuan Bekele] - Sim(Ci,Ik) is the similarity between concept ČH / striker [øn [á/ Getaneh Kebede] [13 ú/Goal]<Č`/Point>0 ¼/ leads - 3.3.2 Weighting For each term, instance and concept extracted from the document weight determining its importance is computed. In this work, ܀܉ܖܓሺࢀሻ modified version of TF-IDF is used to compute weight and ࡿ࢏࢓࢏࢔࢙࢚ሺࢀሻ ൅ࡿ࢏࢓࡯࢕࢔࢙ሺࢀሻ ൅ ࡿ࢏࢓ࡵ࢔࢙࢚࡯࢕࢔ࢉሺࢀሻ ൌ formalized as follows. ૜ Total similarity is computed for each annotated text and the texts will be ranked according to their similarity values. The one

ࢀࡲࡵࡰࡲ࢚ǡࢊ ൌ࡯ࡲ࢚ǡࢊ ൈ࢒࢕ࢍሺࡵࡰࡲ࢚ሻ

- 107 - ranked at the top will be returned as a result of the document 12. W:= Simwu&Palmer(Wd, Ci); annotation process. 13. TFIDF := TFIDF(Ci); 3.4 Query Processor 14. DocSet:=Documents (Ci); The Query Processor accepts user query, annotate it using 15. END IF semantic knowledge, retrieve documents and rank them. 16. Rank := Rank + W*TFIDF a) Query annotator (QA) 17. DocSetR := DocSetR.Add(DocSet, Rank) QA is used to annotate a user query using predefined rules and 18. Next concepts in the ontology. Give a specific user query Q, QA 19. Next annotates it with several semantically complete and meaningful queries. QA component works using the same principles as DA 20. Return DocSetR as the user query Q is Query document. The annotated queries are 21. End ranked and the top most important query is evaluated automatically by the Query processor component; and the For each instance or concept,its weight that depends on the remaining queries will be presented to the user. semantic similarity it has with related concept and TFIDF of the related concept as shown in line 6-16 is used. b) Document Retrieval (DR) The role of the DR is to find list relevant documents to the 4. EXPERIMENT AND EVALUATION semantically annotated user query. Algorithm 1 demonstrates the In order to validate the performance of our approach we have detail of this component. developed a semantic search engine, AmhS2Eng; and we compared its result against the result of classical keyword-based The annotated query, returned by QA, has three category of Amharic search engine [5]. The evaluation is using the popular terms: instances, concepts and words which are parsed in Line 2, information retrieval base relevance measures Recall, Precision, 3 and 4 respectively. Instances and concepts have references in and F-value [12] are used. Both AmhS2Eng and the classical IR the knowledge base, KB sport ontology. results are compared against the list of documents returned by The DR looks for documents containing instances, concepts and experts. In most researches, expert judgments are considered to words from the index along with weight as shown in Lines 5-19. be correct and absolute. However, in this research, we argue that the initial expert judgments are very limited and we advised Algorithm 1: Document Retrieval experts to refine their initial query result judgment seeing the Input: result of the two search engine (following the relevance feedback approach in IR). oQ: Query // Original user query KB: Ontology In the refinement process experts go through documents retrieved by both systems to determine whether the documents are relevant Intermediate: to the posed queries or not. Due to the refinement process the D: Integer // Knowledge depth proposed search engine captured 8 relevant documents for 6 queries and the classical search engine returned 5 documents for Q : query // Annotated Query 3 queries. This indicates that the relevance judgment made by InSet: Set // set of instances experts is limited and also the proposed system has returned more relevant documents than the classical IR. ConSet: Set // Concept Set The Precision, Recall and F-values computed for 25 different Words: Set // Set of Words queries before and after the refinement and are presented in Output: Figure 3, Figure 4 and Figure 5 respectively. DocSet: Set // Set of documents 1 Begin PR 0.9 AmhS2Eng 1. Q:= QA (oQ); // Semanticallyannotate Query oQ 0.8 Original 0.7 2. InSet:= ParseInstances(Q); PR Classical 0.6 IR Original 3. ConSet:= ParseConcepts(Q); 0.5

4. Words:= ParseUnAnnotatedWords(Q); Precision 0.4 0.3 PR 5. Foreach wd in (InSet U ConSet U Words) AmhS2Eng 0.2 Refined 6. For i=0 to D 0.1 0 PR Classical 7. If (i==0) IR Refined 8. W:=1; Query 1 Query 3 Query 5 Query 7 Query 9 Query

9. TFIDF:=TFIDF(Wd); 11Query 13Query 15Query 17Query 19Query 21Query 23Query 25Query 10. DocSet:=Documents (Wd); Figure 3: Precision graph 11. Else

- 108 - Considering Precision and Recall graphs shown in Figure 3 and LHWKH³)-YDOXH$PK6(QJ5HILQHG´OLQHLVPXFKFORVHUWRWKH Figure 4 respectively the precision has LQFUHDVHGIRU³4XHU\´ ³5HILQHG H[SHUW MXGJPHQW´ ZKHQ FRPSDUHG WR WKH ³)-value ³4XHU\´³4XHU\´³4XHU\´³4XHU\´DQG³4XHU\´ $PK6(QJ2ULJLQDO´ and the RHFDOOLQFUHDVHGIRU³4XHU\´DQG³4XHU\´for the case of refined expert judgment. For example, when we look at 1 ³4XHU\ ´ initially LW GRHVQ¶W KDYH DQ\ UHOHYDQW GRFXPHQW. 0.9 F-value However, the AmhS2Eng returns one relevant document and the 0.8 AmhS2Eng precision for this particular query increased from 0 to 0.2 and Original recall from 0 to 1. Like AmhS2Eng, the classical IR system 0.7 FDSWXUHG PLVVHG GRFXPHQWV IRU WKH GLUHFW TXHULHV ³4XHU\ ´ 0.6 F-value ³4XHU\ ´ DQG ³4XHU\ ´ DV VKRZQ LQ Error! Reference 0.5 Classical IR Original source not found.. Thus, the precision for these 3 queries F-value 0.4 increased for the classical IR as well. However, no change has 0.3 F-value been observed on recall for any of the queries even though expert 0.2 AmhS2Eng judgment refinement has been done. Besides, the classical IR did Refined not capture any of the missed documents for the indirect queries. 0.1 This shows that, unlike the classical IR, the proposed search 0 F-value engine method is based on semantics rather than simple key Classical IR words. Refined Query 1 Query 4 Query 7 Query 1 10 Query 13 Query 16 Query 19 Query 22 Query 25 Query 0.9 Figure 5. F-measure graph 0.8 R AmhS2Eng 0.7 For the 25 user queries and set of documents retrieved using each Original 0.6 system, average recall, precision, and F-measure values are computed and presented graphically in Figure 6. The average 0.5 R Classical IR values are calculated over the number of queries. Recall 0.4 Original 0.3 R AmhS2Eng As it is illustrated in Figure 6, the average values for all the three 0.2 Refined evaluation techniques AmhS2Eng has a better f-value than the respective values of the classical IR system. When we put the 0.1 R Classical IR average values in a percentage, the proposed system has 100% 0 Refined recall, 43.2% precision, and 53.68% F-measure whereas classical IR has 64.56% recall, 29.6% precision, and 36.04% F-measure. Query 1 Query 4 Query 7 Query Query 22 Query 25 Query Query 10 Query 13 Query 16 Query 19 Query

Figure 4. Recall value graph 1 Figure 4 shows that AmhS2Eng has a maximum recall value, 1, 0.9 for all the queries even though it has returned some irrelevant 0.8 documents. On the contrary, the classical IR system has missed 0.7 Average F- some of the relevant documents of some queries and missed all measure IRU³4XHU\´XSWR³4XHU\´)RUH[DPSOH³4XHU\´± 0.6 ³Õ^ı^ı3Ċ´the coach of bafana bafana has 2 relevant 0.5 Recall documents according to the refined relevance information but the 0.4 classical IR returned none of them as WKHWHUP³^ı^ı´bafana bafana does not exist in the documents. In contrast, AmhS2Eng 0.3 Precision captured tKHUHOHYDQWGRFXPHQWDV³^ı^ı´bafana bafana has 0.2 the same meaning as WKH WHUP ³ÕÜ\`ĳ-¢`.¹\á´South 0.1 African national team. 0 In order to show a better view of the capability of both systems AmhS2Eng Classical IR F-value is computed and graph is shown in Figure 5. The F- values are computed with the intention of evaluating the result of Figure 6. A comparison of the two systems the systems independent of recall and precision values. The F- values are computed for both systems and the refined expert 5. Discussion judgment as shown in Figure 5. In addition, Figure 5 shows that F- values of AmhS2Eng for 14 queries are greater than that of the Even though the recall of AmhS2Eng is 100%, the precision is classical IR and the same for the remaining queries. This indicates 43.2% which is less than 50%. This happened due to the nature that for more than half of the queries, the result of AmhS2Eng is of the documents i.e. the content of almost all news documents much closer to expert judgment compared to the classical IR. are different and yet published by the same publisher, ERTA Besides, the f-values computed for AmhS2Eng using the refined (Ethiopian Radio and Television Agency). Thus, the extent to relevance information are greater than the values computed using which these news items will be similar is very rare. Therefore, for the original/initial relevance information for some of the queries most of the queries the number of relevant documents in the

- 109 - refined relevance information set is either 1 or 2. However, for REFERENCES TXHU\³4XHU\´UHOHYDQWGRFXPHQWVDUHUHWXUQHGDVWKHWRS [1] Parul Gupta and A.Sharma, "Context based Indexing in documents returned by AmhS2Eng and the classical IR systems Search Engines using Ontology," International Journal of are taken into consideration. As a result, for queries that have Computer Applications, vol. 1, no. 14, pp. 53-56, 2010. limited number of relevant documents, 1 or 2, the precision goes down because of the false negative 4 or 3 irrelevant documents. [2] Jacob Kohler, Stephan Philippi, Michael Specht, and This indicates that if the document collection was somehow Alexander Ruegg, "Ontology based text indexing and different, the precision would be much better. querying for the semantic web," Knowledge Based Systems - KBS, vol. 19, no. 8, pp. 744-754, 2006. In addition, considering Figure 4, the recall values for both [3] Soner Kara, "An ontology-based retrieval system using AmhS2Eng and the classical IR goes up and down when we go semantic indexing," 2010. from the first query to the second query and up to the last. This happened because of the nature of the queries i.e. all the queries [4] Wolf Leslau, Amharic Reference Grammar.: ERIC are completely different from one another. The recall would have Clearinghouse for Linguistics, Center for Applied increased linearly if the queries are related to one another i.e. if Linguistics,1717 Massachusetts Ave. N.W., Washington, ³4XHU\´FRQWDLQV³4XHU\´DQG³4XHU\´FRQWDLQV³4XHU\´ D.C. 20036, 1969. and the same with the rest of the queries. [5] Tessema Mindaye Mengistu and Solomon Atnafu, "Design and Implementation of Amharic Search Engine," As it is mentioned in Section 4, the harmonic F-measure in International IEEE Conference on Signal-Image technique which has equal weight for recall and precision was Technologies and Internet-Based System - SITIS, 2009, used to evaluate both AmhS2Engand the classical IR system. The pp. 318 - 325. harmonic F-measure was chosen over the balanced F-measure because we cannot tell which one recall or precision is very [6] Hassen Redwan, Tessema Mindaye, and Solomon Atnafu, important to users. In fact, it is likely that many people may prefer "Enhanced Design of Amharic Search Engine (An recall to precision. In such case, the balanced F-measure which Amharic Search Engine with Alias and Multi-character takes recall as twice as precision is used for evaluation. If this Set Support)," in AFRICON '09, Addis Ababa, 2009, pp. particular balanced F-measure had been used for this study 1-6. instead of the harmonic F-measure technique, the averaged F- [7] Tewodros Hailemeskel Gebermariam, "Amharic Text measure would have been much higher than the one we have in Retrieval: An Experiment Using Latent Semantic this work. Indexing (LSI) with Singular Value Decomposition (SVD)," Masters Thesis, School of Information Studies 6. CONCLUSION for Africa, Addis Ababa University, Addis Ababa, In this paper, a semantic search engine based on domain ontology Ethiopia, Unpublished 2003. for Amharic text documents is presented. The goal of this study [8] Parul Gupta and A.Sharma, "Context based Indexing in is to explore the advantages of ontology to build a semantic search Search Engines using Ontology," International Journal of engine. The concepts and individuals in the document collections Computer Applications , vol. 1, no. 14, pp. 53-56, 2010. are identified using sport knowledge based constructed manually [9] George A. Miller, "WordNet: a lexical database for by domain experts. All these concepts and individuals are used as English.," Commun. ACM , vol. 38, no. 11, pp. 39-41, index terms to represent documents. November 1995. The proposed search engine is tested based on the relevance [10] Oscar Corcho, Mariano Fernandez-Lopez, and Asuncion information provided by domain experts. Besides, the proposed Gomez-Perez, "Methodologies, tools and languages for system is compared with the classical IR system developed by building ontologies. Where is their meeting point?," Data Tessema [5]. In order to test these two systems, 138 football news & Knowledge Engineering, vol. 46, no. 1, pp. 41-64, articles and 25 queries are used. The precision, recall, and F- 2003. measure techniques are used to evaluate the performance of the [11] Palmer and Wu., "Verbs Semantics and Lexical systems. AmhS2Eng has better average recall, precision, and F- Selection," in ACL., New Mexico, 1994., p. p. 133. measure values compared to the classical IR system. [12] Christopher D. Manning, Prabhakar Raghavan, and 7. ACKNOWLEDGMENTS Hinrich Schütze, Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008. This research is sponsored by the Ministry of Information . Communication and Technology, MCIT, Ethiopia. Our deepest gratitude goes to Mr. Tesfaye, Health and Physical Education expert in Ministry of Education, for facilitating access to resources from Ethiopian Football Federation and for his very helpful expert advises on the main parts of the work. Our appreciation also goes to Mr. Getachew Abebe and Zewudinesh Yirdaw, staff members of Ethiopian Football Federation, for spending their valuable time to respond to our questions and for providing the necessary data.

- 110 -