<<

Simulating IBM Watson in the Classroom

Wlodek Zadrozny, Sean Gallagher, Walid Shalaby and Adarsh Avadhani The University of North Carolina at Charlotte {wzadrozn,sgalla19,wshalaby,amanjuna}@uncc.edu

ABSTRACT The 2011 match demonstrated the feasibility of real time IBM Watson exemplifies multiple innovations in natural lan- question answering in open domains, and was hailed by some guage processing and question answering. In addition, Wat- as an arrival of a new era of “cognitive computing” [22]. son uses most of the known techniques in these two domains This new era starts with Watson as a machine that can as well as many methods from related domains. Hence, there “reprogram” itself using machine learning and knowledge it is pedagogical value in a rigorous understanding of its func- acquires during a competition. tion. The paper provides the description of a text analyt- Most of the Jeopardy! questions are “factoid” questions, ics course focused on building a simulator of IBM Watson, but there are also others asking for quotations and various conducted in Spring 2014 at UNC Charlotte. We believe kinds of puzzles. Here are some examples of the questions, this is the first time a simulation containing all the major or clues (with their answers): Watson components was created in a university classroom. • Factoid: The Pekingese, Chihuahua, & ’s The system achieved a respectable (close to) 20% accuracy “” fall under this classification of dogs: on Jeopardy! questions, and there remain many known and toys. new avenues of improving performance that can be explored in the future. The code and documentation are available • Factoid: Don’t be intimidated by the skewers; I’ll use on GitHub. The paper is a joint effort of the teacher and them on the marinated lamb to make this: shish kabob. some of the students who were leading teams implement- ing component technologies, and therefore deeply involved • Quote: When Ryan O’Neal apologizes to Barbra in making the class successful. Streisand in “What’s Up, Doc?” she quotes this line from “Love Story”: “Love means never having to say Categories and Subject Descriptors you’re sorry” K.3.2 [Computers and Education]: Computer and Infor- • Fill in the Blank: Lemon Rain Cough : mation Science Education,Curriculum drop. General Terms From the educational perspective IBM Watson exempli- fies multiple innovations in natural language processing and Experimentation question answering. These include title-oriented organiza- tion of background knowledge and various ways of corpus Keywords processing and expansion [9] [28], massive parallelism of se- mantic evaluation [10], lack of explicit translation from nat- IBM Watson; Student Research; Question Answering; QA; ural language into a “language of meaning” for the purpose of finding an answer to a clue [14], the use of search for 1. INTRODUCTION: WATSON CLASS OB- candidate answer generation [12], and others [14]. JECTIVES AND RESULTS In addition, Watson uses most of the known techniques IBM Watson was a question answering machine which in in these two and in related domains [14][23][26][30]. In par- 2011 won a Jeopardy! match against former winners Brad ticular Watson uses two search engines, and various query Rutter and Ken Jennings. Jeopardy! is a television transformation techniques, a parser (which also produces a show in which participants are tested on general knowledge. logical form), machine learning, semi-structured databases (such as DBpedia), multiple matching techniques such as “skip bi-grams”, and many others. Permission to make digital or hard copies of part or all of this work for personal or Therefore there is a pedagogical value in in-depth under- classroom use is granted without fee provided that copies are not made or distributed standing of how IBM Watson works. That is, a Watson class for profit or commercial advantage, and that copies bear this notice and the full ci- can be used to teach both about advances in text analytics, tation on the first page. Copyrights for third-party components of this work must be as well as about standard methods of text understanding. honored. For all other uses, contact the owner/author(s). Copyright is held by the For example, the use of both Indri [15] and Lucene [25] author/owner(s). search engines, raises a natural question about their rela- SIGCSE’15, March 4–7, 2015, Kansas City, MO, USA. ACM 978-1-4503-2966-8/15/03. tive strengths and weaknesses. This question was answered http://dx.doi.org/10.1145/2676723.2677287. by discussing in class the paper comparing them [29]. Figure 1: Our Watson simulator reproduced most of the original IBM Watson architecture. Light blue: elements with very similar functionality as IBM Watson, for example primary search. Partly blue are the elements only partly reconstructed, for example we had only 20 evidence scorers compared to over 400 in the IBM system. Original image courtesy of Wikipedia: http://commons.wikimedia.org/wiki/File:DeepQA.svg

Hands-on experience is perhaps the best way for students Challenges of building a classroom Watson simulator to gain real understanding of how such a complex system Obviously, the bar for question answering performance set works, and what it takes to make an accurate question an- by IBM Watson is difficult to meet; all time Jeopardy! cham- swering system. Furthermore, building a Watson simulator pions are excellently accurate. requires learning the component technologies including data However, the level of required expertise was perhaps our preparation, search, information extraction, part of speech biggest challenge. Watson was built by about 20 PhDs tagging, parsing, machine learning, among others. This pa- and a few experienced software engineers in 5 years, and per provides the description of a text analytics class focused started with an existing competitive question answering sys- on building a simulator of IBM Watson. tem. Members of the Watson team had at least several years Discussing computer education, Patterson [27] remarked: of experience building NLP applications. In contrast, our “The challenge is preventing the large code base from over- class had one PhD student with some experience in machine whelming the students, interfering with learning the concep- learning, 8 Master’s students, and 12 undergraduates with tual material of the course” . We believe our class met this some exposure to artificial intelligence or symbolic program- challenge. ming. For almost all of them this was the first encounter with text analytics. Moreover, Watson is a very large software system. IBM researcher Gary Booch reports Watson has over 1 million Brief description of the class lines of code (mostly in Java) [5]. It operates as an ensem- IBM Watson remains proprietary, and therefore cannot be ble of integrated subsystems and pipelines rather than any downloaded or used in the class directly. However, it has singular, central component. Implementing and testing so been described by IBM researchers in collections of papers many components takes time and labor. [14][23][26][30] and patents [7][11]. In addition, Jeopardy! What made the class possible questions are freely available from j-archive.com. Therefore, it is possible in principle for anyone to build Three major factors made this class possible. As mentioned a Watson simulator, or any question answering system con- above, IBM researchers have cataloged detailed descriptions taining components of Watson architecture. (Although com- of how Watson works. IBM also permitted us to use their mercial uses might be protected by IBM patents). presentations from a class taught at Columbia University a Yet it is not the case that many systems are replicating year earlier. Last but not least, the professor had a history the results of the Watson project. This comes as a surprise, with the Watson team. given that around the time of Deep Blue many computer chess programs were challenging human chess champions Accomplishments of the class [1][8]. Four years after the Watson – Jeopardy! game, no We created a simulation that contains all of the major el- question answering programs seems to approach Watson’s ements of Watson (shown in Fig.1). By the end of the performance. This comes in spite of all the technical and semester, the performance of our simulator on Jeopardy! scientific innovation embodied in Watson. questions was approximately 20% accuracy, compared with While we do not boast equivalent performance, we simu- Watson’s 70% accuracy [14]. Obviously, the objective of the lated all the major components and experimented with some class wasn’t to achieve Watson’s performance, but rather to additions in the course. We also added the feature of inter- see what can be accomplished in one semester. We hope net search. that our experience, and resources we are releasing will in- commit count

Name Question class. Query translatiion ... Machine learning git Integration Passage retrieval Data Preparation Indri Lucene Bing student1   #    student2  student3       #   student4         student5 student6        student7        student8     #   student9         student10   ... studentx #     50%    studentx+1      studentx+2     ?  

Table 1: The gradesheet at the end of the semester (# indicates a team lead) spire others to do better. We note though that the accuracy trieval and included comparisons between query languages, is higher than the baseline QA system the Watson team corpus content and models for each of the engines. Secondly, started with. [14] adding web search to the system was an easy way of intro- In the months since the class completed, several changes ducing additional scorers and experimenting with machine have been made, including several new scorers and other learning. The use of Internet search as primary search was features not available in the simulator at the end of the reported by Hendler in 2013 [20], and again in 2014 [21]. semester. In contrast with Gliozzo et al. [18], we introduced the Our code and documentation are available on GitHub, and Apache Unstructured Information Management Architec- is freely available to anyone wanting to experiment with it. ture (UIMA) [13] late, and the system we built is not UIMA- In the course of creating the software, the students learned based. By skipping UIMA we gained 2-3 weeks of develop- about engineering a relatively complex system. They all ac- ments time. However, we did use one of the student teams quired semantic knowledge in combination with hands-on to create a UIMA version of our system, and report on it experience in local and internet search, NLP, and various in class. This provided a minimal exposure to UIMA for all aspects of text analytics. For many of them, it was a first en- students. counter with version control (specifically git and GitHub). Recently, Wollowski reported using in class Watson train- Some were also afforded career opportunities: one MS stu- ing data (for machine learning models) [31]. However this dent was hired by the IBM Watson team, and another has effort was an exercise in data mining and did not attempt an offer pending. to recreate any elements of the Watson architecture. Based on this discussion, we believe we have created a Comparison with prior work unique methodology for teaching a Watson-focused text pro- cessing class. Artefacts are also available for others to use The authors are aware of only two other examples of a and improve. Watson-focused class: one at Columbia University, as re- ported by Gliozzo in 2013 [18], and another at Rensselear Polytechnic (RPI) as reported by Hendler in 2013 [20] and in 2. ORGANIZATION OF THE CLASS 2014 [21]. The Columbia class focused on building Watson- The class time was approximately evenly divided between inspired tools that could be used for other tasks. The two lectures and system building. The lectures used a collec- sets of Hendler’s slides discuss the RPI system which recre- tion of slides provided by IBM, which were previously used ated portions of the Watson data pipeline. However, no at Columbia University. The system building was done in accuracy results are reported. teams and sub-teams: data preparation, search, machine Our work differs from these two by attempting to com- learning, scoring, and the integration of the components. pletely recreate the original Watson pipeline, and measure Team membership changed throughout the class, although the correctness of the resulting system. In addition, we see two teams (integration and machine learning) had more sta- the opportunity to add additional components, and further bility than others (such as data preparation). Almost all improve the question answering accuracy. homework was about creating software or data components. We experimented with two local search engines (Lucene IBM slides were supplemented by additional material from and Indri), and two Internet search interfaces (Google and the IBM Journal of Research and Development [3] necessary Bing). There were two objectives behind these experiments: for understanding of question classification or question pre- Firstly, a large part of the class focused on information re- processing. The students had access to the whole collection, and some used it to gain deeper understanding of the mate- Data sources rial. Data collection is a major part of the question answering Throughout the semester the students gave short presen- process, and occupied a large fraction of effort expended tations intended to (a) show the progress of their teams and by the team. The greatest focus was on retrieving, clean- (b) teach others how to use the technologies. For exam- ing, processing and indexing Wikipedia materials. These ple, a student could present the basics of git version con- included the full texts of all Wikipedia articles, page view trol and the use of GitHub. Afterwords other students were statistics for 100 days of Wikipedia traffic, the full text of able to contribute to the project directly through GitHub. Wikiquotes, and the works of Shakespeare. Teaching material was also presented for maximal impact Unlike IBM Watson, the course project was not limited to on the practicum. We started with an introduction to Wat- offline data sources. As a result, we also used web search en- son, and proceeded lecturing on information retrieval. When gines such as Google and Bing. We encountered some prob- system building was progressing more slowly, short lectures lems, most notably query quotas, which made mass searches addressed other aspects of Watson, such as its existing and of the magnitude necessary for passage retrieval impractical. potential applications. Coordinating dozens of students to get a sample of queries Using Weka [19] allowed a relatively simple way to replace on a regular basis is mostly unproductive. The situation one machine learning algorithm with another. In these ex- with Bing was less restrictive than Google, and therefore periments logistic regression was indicated to be the best Bing was used in measuring performance. choice. Naive Bayes classifiers performed very poorly. Our later results showed that Weka’s default implementation of Clue Analysis and Search Query Generation the multilayer perceptron was marginally more accurate. In our system there are three major question analysis pipe- The students were graded according to their mastery of lines covering the factoid questions, fill in the blank ques- the majority of component technologies. We tabulated this tions, and quotations. The selection of which pipeline a according to a matrix of checkmarks relating students to question would follow was made according to its category their mastered technologies, as shown in Figure 2. For ex- (factoid, quotations, common bonds, ...), of which there were ample, mastery of 80% of the subject technologies translated a small predefined set; Lally et al. [23] list twelve of them, into a “B” grade. An “A” required substantial contribution and then add a few more in a discussion. The majority of throughout the course. A monitoring process was estab- the questions were factoids, so we focused on them. Note lished early in the class. The project leads were monitoring that question categories or classes as discussed here and in the progress of students in their teams, and discussing it with [23] are not the same as Jeopardy! categories or topics, which the professor during office hours. Extensive tutoring, mostly are unlimited. by other students, was available and took place throughout the course. Even the exam time was used for additional Evidence Scoring testing and integration. Overall, the class discussed in detail most of the major Supporting passages receive a number of scores, which are elements of the Watson architecture. It was unfeasible to intended to be indications of how much evidence the pas- cover all of Watson in details. In particular, deeper natural sage gives toward the correctness of the answer for which language processing using frameworks similar to OpenNLP it was retrieved. Some scores are taken from the original wasn’t given enough attention. In a future class, we plan to search engines; this is the case for both offline search en- give more time to NLP, at the expense of one of the search gines, Lucene and Indri. Lucene bases its score on Term engines. We will consider UIMA or other text processing Frequency-Inverse Document Frequency (TF-IDF) and the framework (e.g. GATE [16]) in the future. Student opinions Vector Space Model, whereas Indri scores are based on were divided on whether we should have based the develop- Bayesian inference networks [15][25]. ment on UIMA. Passages are also subjected to a number of n-gram com- parisons. This suite of related scorers provided the strongest boost to ranking found in the project. One additional scorer takes an approach similar to the n- gram models, but in place of n-grams, it searches for the 3. TECHNICAL APPROACH number of common phrase subtrees taken from the Open- The team clearly did not have the time or expertise to NLP dependency parser. Notably, this scorer applies to any replicate the work of the original Watson team in one semes- length phrase, and common phrases have weight approxi- ter. The general model of development was to begin with a mately relative to the square of their length on the account minimal working subset, only the functionality necessary to that each common smaller phrase will also receive another answer a question. To complete that, a data source, an index count. and a search engine were necessary. The structure of the framework was basically to query the engine. The goal was 4. PERFORMANCE EVALUATION AND to complete as much of the known architecture of the Watson project as quickly as possible and have a measurable setup STUDENT RESEARCH at the end. This included passage retrieval, scoring based on We viewed frequent evaluation of system performance as evidence, and specialized pipelines based on question types. a necessary step to see what works and what doesn’t. As Below we describe selected, perhaps most interesting, el- shown in Fig 2. our progress wasn’t smooth. In many cases, ements of the technical approach. The details can be found usually because of integration errors, the system did not in [17]. We discuss data resources, and how they were used; work at all. As mentioned early, the final version achieved query generation; supporting evidence scoring. In the next the accuracy of slightly less than 20% (which is higher than section we briefly discuss performance evaluation. the Watson’s baseline system in 2007). Our MRR (explained Figure 2: During the class we ran multiple experiments and measured the results. This figure shows the changes in the Mean Reciprocal Rank for different configurations, collections of data sources, and sets of questions. The final MRR from May 2014 was measured on a representative sample of 1000 Jeopardy! questions. As expected, most of them were factoid questions.

third candidate answers were correct for an additional 2% Table 2: Results using all sources except Google questions. When the same system was retrained and used Recall of Rank 1 187 (18.70%) with the Bing Internet search engine, we obtained 18.7% ac- Recall in Top 3 278 (27.80%) curacy for the first candidate answer, with correct answers Recall in Full Candidate Answer Set 497 (49.70%) in the second or third ranks in an additional 10% of ques- MRR 0.5072 tions. Thus Internet search improved both scores of correct Total Questions 1000 answers, and contributed to better recall. Total Candidate Answers 46881 Total Runtime (seconds) 23562.1203 5. CONCLUSION AND NEXT STEPS We demonstrated the feasibility of creating a class focused on a very advanced project, and meeting the Patterson’s [27] below) was 0.5, which means that on average correct answer challenge of avoiding “the large code base (...) interfering was in the second position among the final list of scored with learning the conceptual material of the course”. The candidate answers (if a correct answer existed in a list of 70 results of the class – in terms e.g. of produced code, the candidates). final accuracy, and job prospects of the students – seem to We did not have access to the comprehensive suite of test- prove this point. ing tools employed by the Watson team [6]. And given the We believe there is value in students attempting large and amount of effort that would be required we made no attempt ambitious projects. It removes a lot of mystery from big ac- to recreate it. Instead, the performance tests for the Wat- complishments of computer science, and it is a great learning son simulator were created using a JUnit test that takes a experience that also allows the students to gain confidence set of Jeopardy! questions with known correct answers, and and ambition. Obviously its too early to tell whether this collects statistics on the resulting ranked candidate answer change of attitudes will persist throughout their careers. lists. These data were tabulated and uploaded online. The As in every project, a course focused on a single project results of our final runs are in Table 2. The final tests used has positive and negative aspects. Combining the output of Bing search. Since our candidate answers normalization and groups of students makes it possible to achieve a much larger merging capabilities were limited (see Fig.1), we did not re- goal; no group of two or three students could have made this quire a perfect match with the answers from J-archive.com. project in this timeline. However, when teams can depend Candidate answers were marked as correct or incorrect based on each other’s work, it is possible for some students to stall. on whether the candidate and known correct answers both Elastic group membership was conducive to mitigating this contained the same unordered set of words after filtering us- problem. In our case, membership changes were enforced by ing the Lucene EnglishAnalyzer. Obviously, this mostly the grading system represented in Table 1. did not matter when candidate answers were taken from We believe, based on students opinions, that to maximize Wikipedia titles (as in IBM Watson), but possibly made a the quality of the resulting project, development should start difference in some other situations. as early as possible. This is one reason why it is not clear In addition to assessing performance of the whole sys- that UIMA is a significant advantage. A steeper learning tem, we evaluated performance of various sets of scoring curve may cost more than a lack of scalability. On the other programs. The data in Fig.2 cover the Spring 2014 semester hand, should someone try to build a more accurate system and one additional experiment at the end of May. In these based on our software, making UIMA part of the class should the Mean Reciprocal Rank was used to indicate scorer qual- be considered. ity. For every question for which there was at least one We also see an opportunity to continue this project for ex- correct answer, the reciprocal rank of the first correct an- ample as part of an intelligent system class where other text swer was recorded. Then the mean was taken of this list of analytic methods could be introduced (as scoring modules), reciprocals. and various knowledge representation and logical methods Student research: The class provided opportunities for could be added as well. This would also produce an oppor- student research, including finding answers to interesting tunity to create research projects for graduate students, and open questions. Clearly, building a Watson simulator was explore some of the avenues that the IBM Watson project novel. But there were other specific questions answered in did not. To give some examples, available projects which experiments. For example, we investigated the value of the would complement text analytics projects such as ours could Internet search relative to the local search: When trained include JoBimText since it has a unique distributional se- and used on only the local dataset, the top candidate an- mantic approach to knowledge management [4] or Word2Vec swer was correct for 12% of questions, and the second or (to assist for measuring relationships between phrases) [24], or textual entailment modules such as Excitement Open http://sourceforge.net/p/lemur/wiki/Indri Retrieval Platform [2]. These only scratch the surface, and many Model/, 2012. would not be very difficult to integrate. [16] R. Gaizauskas, H. Cunningham, Y. Wilks, P. Rodgers, and K. Humphreys. GATE: an environment to support research and development in natural language 6. ACKNOWLEDGMENTS engineering. In Tools with Artificial Intelligence, We would acknowledge IBM for their contribution of lec- 1996., Proceedings Eighth IEEE International ture material and slides, as well was the substantial contri- Conference on, pages 58–66. IEEE, 1996. bution to our project by the following students: Chris Gib- [17] S. Gallagher, W. Zadrozny, W. Shalaby, and son, Dhaval Patel, Elliot Mersch, Jagan Vujjini, Jonathan A. Avadhani. Watsonsim: Overview of a question Shuman, Ken Overholt, Phani Rahul,and Varsha Devadas. answering engine. arxiv.org, 2014. We also thank the SIGCSE referees for their comments and [18] A. Gliozzo, O. Biran, S. Patwardhan, and suggestions. Several of the suggested changes will be incor- K. McKeown. Semantic technologies in IBM Watson. porated in an expanded version of this paper. ACL 2013, http://www.aclweb.org/anthology/W13-3413, page 85, 7. REFERENCES 2013. [19] M. Hall, E. Frank, G. Holmes, B. Pfahringer, [1] Deep blue (chess computer). P. Reutemann, and I. H. Witten. The WEKA data http://en.wikipedia.org/wiki/Deep Blue mining software: an update. ACM SIGKDD (chess computer). explorations newsletter, 11(1):10–18, 2009. [2] Excitement Open Platform. [20] J. Hendler. Watson at RPI. Retrieved September 03 http://hltfbk.github.io/Excitement-Open-Platform/. 2014 from [3] ”IBM J. RES. and DEV” 56 (3.4). 2012. http://www.slideshare.net/jahendler/watson-summer- [4] C. Biemann et al. Jobimtext project. Retrieved review82013final. September 03 2014 from [21] J. Hendler and S. Ells. Retrieved September 03 2014 http://sourceforge.net/projects/jobimtext/, 2014. from http://www.slideshare.net/jahendler/why- [5] G. Booch. Watson under the hood: Booch Describes watson-won-a-cognitive-perspective?next slideshow=1. Jeopardy Champ’s System’s Architecture. Retrieved [22] IBM. IBM Research: Cognitive Computing. Retrieved September 03 2014 from September 03 2014 from http://searchsoa.techtarget.com/news/2240038723/ http://www.research.ibm.com/cognitive-computing/. Watson-under-the-hood-Booch-describes-Jeopardy- [23] A. Lally et al. Question analysis: How Watson reads a champs-systems-architecture. clue. IBM J. RES. and DEV, 56(3/4), 2012. [6] E. Brown, E. Epstein, J. W. Murdock, and T.-H. Fin. [24] T. Mikolav et al. Word2Vec project. Retrieved Tools and methods for building Watson. IBM September 03 2014 from Research Report RC 25356. http://code.google.com/p/word2vec/, 2014. [7] E. Brown, D. Ferrucci, A. Lally, and W. W. Zadrozny. [25] R. Muir et al. Lucene Similarity class reference. System and method for providing answers to Retrieved September 03 2014 from questions, September 2012. US Patent 8,275,803. http://lucene.apache.org/core/4 9 0/core/org/ [8] M. Campbell, A. J. Hoane Jr, and F. Hsu. Deep Blue. apache/lucene/search/similarities/ Artificial intelligence, 134(1):57–83, 2002. TFIDFSimilarity.html, 2014. [9] J. Chu-Carroll, J. Fan, N. Schlaefer, and W. Zadrozny. [26] J. W. Murdock et al. Textual evidence gathering and Textual resource acquisition and engineering. IBM analysis. IBM J. RES. and DEV, 56(3.4), 2012. Journal of Research and Development, [27] D. A. Patterson. Computer science education in the 56(3.4):4:1–4:11, May 2012. 21st century. Commun. ACM, 49(3):27–30, Mar. 2006. [10] E. A. Epstein, M. I. Schor, B. S. Iyer, A. Lally, [28] N. Schlaefer, J. Chu-Carroll, E. Nyberg, J. Fan, E. Brown, and J. Cwiklik. Making Watson fast. IBM W. Zadrozny, and D. Ferrucci. Statistical source Journal of Research and Development, expansion for question answering. In Proceedings of 56(3.4):15:1–15:12, May 2012. the 20th ACM International Conference on [11] J. Fan, D. Ferrucci, D. C. Gondek, and W. W. Information and Knowledge Management, CIKM ’11, Zadrozny. System and method for providing questions pages 345–354, New York, NY, USA, 2011. ACM. and answers with deferred type evaluation, December [29] A. Trotman, C. L. Clarke, I. Ounis, S. Culpepper, 2012. US Patent 8,332,394. M.-A. Cartright, and S. Geva. Open Source [12] D. Ferrucci et al. Building Watson: An overview of Information Retrieval: A Report on the SIGIR 2012 the DeepQA project. AI Magazine, 31(3), 2010. Workshop. SIGIR Forum, 46(2):95–101, 2012. [13] D. Ferrucci and A. Lally. UIMA: an architectural [30] C. Wang et al. Relation extraction and scoring in approach to unstructured information processing in DeepQA. IBM J. RES. and DEV, 56(3/4), 2012. the corporate research environment. Natural Language [31] M. Wollowski. Teaching with Watson. Fifth AAAI Engineering, 10(3-4):327–348, 2004. Symposium on Educational Advances in Artificial [14] D. A. Ferrucci et al. Introduction to ”This is Watson”. Intelligence, IBM J. RES. and DEV, 56(3), 2012. http://www.aaai.org/ocs/index.php/EAAI/ [15] D. Fisher. Indri retrieval model. Retrieved September EAAI14/paper/view/8584/8677, 2014. 03 2014 from