<<

Using and Wiktionary in Domain-Specific Information Retrieval

Christof Müller and Iryna Gurevych

Kang Ji

Wikipedia Mining conducted by Günter Neumann ws 2009/2010 Saarland University Statistical Information Retrieval

• Vector Space Model • Term Distribution Models: e.g. Poission Distribution • Latent Semantic Analysis: usage defined by term and document cooccurrence Vector Space Model

• Considering every document as vector

• The vector contains the weights of the index terms as components

• In case of t index terms the dimension of the vector- space is also t

• Similarity of query to a document is the correlation between the their vectors

• Correlation quantified by cosine of the angle between the vectors Vector Space Model cont.

• Distance Matrix

n

"qidi ! ! cos(q, d ) = i=1 n n 2 2 "qi "di i=1 i=1

! TF and IDF

• term frequency

with the frequency freqi,j that term i occurs in document j

freqi, j tfi, j = maxl freql, j • inverse document frequency

with number of documents ni that contain term i and total ! number of documents N N idfi = log ni • tf-idf weight

! Motivation

• Consider the situation of synonymy, i.e. expressing a concept with different terms. • Then do we have to raise multiple query for one single concept? Approaches

• Local methods: relevance and pseudo-relevance feedback • Global methods: additional linguistic knowledge base, e.g. thesauri on document collection or WordNet • SR models: use retrieval models which are based on semantic relatedness (SR) between query and document terms computed by using linguistic knowledge bases. Expand the documents(Local)

relevant relevant already returned

Irrelevant Irrelevant

refine the representation of the user’s information need by using either manual or automatic feedback about already returned documents. Expand the Query(Global)

• Recall is improved • Precision is degraded Combine the query and document(SR)

• promising results • low coverage, especially of domain-specific vocabulary by using linguistic knowledge bases for semantically enhanced IR

• So we introduce “collaborative knowledge bases” collaborative knowledge bases

• enabled by Web 2.0 technologies • constructed by volunteers on the web • have reached a size which makes them promising for improving IR performance Wiktionary

• Wiktionary (a portmanteau of the and ) is a multilingual, web-based project to create a dictionary, available in over 151 languages. Unlike standard , it is written collaboratively by volunteers, dubbed "Wiktionarians", using , allowing articles to be changed by almost anyone with access to the website. • Wiktionary Homepage Our SR Models on CLIR

• SR-Text • SR-Word • article titles in Wikipedia are referred to as concepts and the article texts as textual representation of these concepts.

• each word entry in Wiktionary as a distinct concept, and use the entry’s information as the textual representation of a concept analogous to the text of Wikipedia articles SR-Text

n n

""tf (td,i,d)# idf (td ,i )# tf (tq, j ,q)# idf (tq, j ) r (q,d ) = i=1 j =1 SR i n n n n 2 2 ""(tf (td ,i,d)# idf (td,i )) ""(tf (tq, j ,q)# idf (tq, j )) i=1 j =1 i=1 j =1

• the concept vector of a term consists of its tf value in the ! respective articles. • build the concept vectors for all terms of a doc or query. • sum up the concept vectors after normalizing and scaling using its tf-idf weight

• use the cosine of the angle between the two vectors as relevance score. SR-Word

n n

""tf (td,i,d)# idf (td ,i )# tf (tq, j ,q)# idf (tq, j )# s(td ,i,tq, j ) i=1 j =1 rSR (q,di ) = (1+ nnsm )# (1+ nnr )

• nnsm the number of unique query terms not ! literally found in the document

• nnr the number of unique query terms which do not contribute a SR score above a predefined threshold Combine the models

• CombSUM = SUM(Individual Similarities)

rorig " rmin rnorm = rmax " rmin

! Evaluation

• we run experiment with several query types by using different combinations of the topic fields. • we found that the retrieval effectiveness improved when query terms are weighted • depending on the field in which they occur. We therefore use the following weights for query terms in all experiments: 1 for title (T), 0.8 for description (D), and 0.6 for narrative (N). Monolingual Retrieval

• For English and German, we also performed experiments using either Wikipedia or Wiktionary separately as knowledge base.

• For English using onlyWikipedia often performs better than using the combination of both knowledge bases. Using Wiktionary separately always performed worse than using Wikipedia or the combination of both.

• For German, the combination of Wikipedia and Wiktionary slightly improves the performance in most cases. Bilingual Retrieval

• For the SR-Text model, using the cross-language links (CLL) between language specific editions of Wikipedia, mapping concept vector represented by articles in the English Wikipedia to the concept vector represented by articles in the German version without translate the query. Conclusion

• Integration of semantic knowledge from collaborative knowledge bases(Combination of Wikipedia and Wiktionary) into IR. • Evaluate two IR models (SR-Text and SR-Word) • Explored a different method on CLIR using the cross-language links between different language editions of Wikipedia for the SR- Word model Reference

• Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval, Christof Müller and Iryna Gurevych,(2009)

• Zesch, T., Müller, C., Gurevych, I.: Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In: Proceedings of the Conference on Language Resources and Evaluation, LREC (2008)

• A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness, George Tsatsaronis and Vicky Panagiotopoulou Thank you for attendance!