Which is the Effective Way for : Information Retrieval or Neural Networks? 1 1,2 1 Shangmin Guo† , Xiangrong Zeng† , Shizhu He , Kang Liu1 and Jun Zhao1

1National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, , 100190, 2University of Chinese Academy of Sciences, Beijing, 10049, China shangmin.guo, xiangrong.zeng @nlpr.ia.ac.cn { shizhu.he, kliu, jzhao @nlpr.ia.ac.cn} { }

After the World War II, U.S. and Soviet Union are fighting against each other in politics, economics and military. To promote the development of economics in Socialist Countries, Abstract Soviet Union establish The Council for Mutual Economic Assistance. This is against A. Truman Doctrine B. Marshall Plan C. NATO D. Federal Republic of Germany As one of the most important test of China, Entity Question Gaokao is designed to be difficult enough From Qin and Han Dynasties to Ming Dynasty, businessmen are always at the bottom of to distinguish the excellent high school hierarchy. One reason for this is that the ruling class thought the businessmen A. are not engaged in production B. do not respect Confucianism students. In this work, we detailed the C. do not respect the clan D. do not pay tax Gaokao Multiple Choice Ques- Sentence Question

tions(GKHMC) and proposed two differ- ent approaches to address them using var- ious resources. One approach is based on Figure 1: Examples of questions and their types. entity search technique (IR approach), the The upper one is an entity question. The lower other is based on text entailment approach one is a sentence question. where we specifically employ deep neu- ral networks(NN approach). The result of questions and essays and it covers several dif- experiment on our collected real Gaokao ferent subjects, like Chinese, Math, History and questions showed that they are good at etc. In this work, we focus on Gaokao History different categories of questions, i.e. IR Multiple Choice questions which is denoted as approach performs much better at entity GKHMC. Both of the factoid question answering questions(EQs) while NN approach shows task and reading comprehension task are similar to its advantage on sentence questions(SQs). GKHMC. But, the GKHMC questions have their Our new method achieves state-of-the-art own characteristics. performance and show that it’s indispens- A multiple-choice question in GKHMC such able to apply hybrid method when partici- as the examples shown in Figure 1 is composed pating in the real-world tests. of a question stem and four candidates. Our goal is to figure out the only one correct candi- 1 Introduction date. But, there are certain obstacles to achieve it. First, several background sentencess and a lead-in Gaokao, namely the National College Entrance sentence conjointly constitutes the question stem, Examination, is the most important examination which makes these questions more complicated for Chinese senior high school students. Ev- than former one-sentence-long factoid questions ery college in China, no matter it is Top10 or that can be handled by the existing approaches, Top100, would only accept the exam-takers whose like (Kolomiyet and Moens, 2011; Kwiatkowski Gaokao score is than its threshold score. et al., 2013; Berant and Liang, 2014; Yih et al., As there are almost 10 million students take the 2015). Secondly, the background sentences gener- examination every year, Gaokao needs to be dif- ally contain various clues to figure out the histori- ficult enough to distinguish the excellent students. cal events or personages which may be the perdue Therefore, it includes various types of questions key to answer the question. These clues may in- such as multiple-choice questions, short-answer clude Tang poem and Song iambic verse, domain- † Both of the two authors contributed equally to this paper. specific expressions, even some mixture of mod-

111 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 111–120, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics Question Type Candidate Type Count from questions to form a query, then inquire this EQ Entities 160 SQ Sentence 584 query in all the text resources to get the most rele- ALL Whatever 744 vant candidate. In NN approach, we take the ques- tion text and every candidate to form four state- Table 1: The GKHMC dataset. ments respectively, then judge how possible every statement is right so that we can figure out which ern Chinese and excerpt from ancient books and is most likely to be the correct answer. etc. The dependence of background knowledge To test the two approaches’ performance, we makes the models that are designed for read- collected and classified the multiple-choice ques- ing comprehension such as (Penas˜ et al., 2013; tions in Gaokao test papers from 2011 to 2015 all Richardson et al., 2013) fail. Thirdly, the diver- over the country, and they are released. From the sity of candidates’ granularity, i.e. candidates can result, we find that the performance of two ap- either be entities or sentences, makes it harder proaches are significantly discrepant at each kind to match the candidate and stem. So, the an- of questions. That is, IR approach shows notice- swer selection is disparate from the former ap- able advantages on EQs, while NN approach per- proaches whose candidates are usually just enti- forms much better on SQs. This will be further ties. Lastly, as the candidates are already given, discussed in Section 4.4. the answer generation step in former neural net- In this paper, our contributions are as follows: work approaches based question answering sys- We gave a detailed description of the Gaokao • tem is no longer necessary. History Multiple Choice Questions task and As mentioned above and shown in Figure showed its importance and difficulty. 1, in accordance with candidates’ granularity, the GKHMC questions can be divided into two We released a dataset1 for this task. The • types: entity questions(EQs) and sentence ques- dataset is manually collected and classified. tions(SQs). Entity questions are those whose can- All questions in the dataset are real Gaokao didates are all entities, no matter they are peo- quesitons from 2011 to 2015. ple, dynasties, warfares or something else. And, We introduced two different approaches for sentence questions are those whose candidates are • all sentences. We observe that such two types of this task. Each approach achieved a promis- questions have their own specific characteristics. ing results. We also compared this two ap- Most of background sentences in EQs are descrip- proaches and found that they are complemen- tion of the right candidate, so it may be partic- tary, i.e. they are good at different types of ularly suitable to apply information retrieval like questions. approach to handle them. Meanwhile, as the back- We introduced permanent provisional mem- • ground sentences and lead-in sentences in SQs are ory network(PPMN) to model the joint back- more like the entailing text, these questions aren’t ground knowledge and sentences in question appropriate to be addressed by lexically searching stem, and it beats existing memory networks and matching. Therefore, it seems that it’s more on SQs. resonable to resolve SQs by using textual reason- ing techniques. 2 Dataset In this paper, we wonder about which kind of As described in the Introduction, we collected the approach is more effective for GKHMC. Further- historical multiple-choice questions from Gaokao more, whether we should select specific method to all over the country in rencent five years. However, work out different types of questions. In terms of quite a lot contain graphs or tables which require various characteristics of GKHMC questions, we the techniques beyond natural language process- introduce two independent approaches to address ing(NLP). So, we filter out this part of questions them. One is based on entity search technique (IR and manually classified the left into two parts: approach) and the other is based on a text entail- EQs and SQs. The number of different kinds of ment approach where we specifically employ deep questions are listed in Table 1. The examples of neural networks (NN approach). In IR approach, we use the key entities and relationships extracted 1 https://github.com/IACASNLPIR/GKHMC/tree/master/data

112 different types of questions translated into English are shown in Figure 1. It is worth mentioning that there is a special type of questions on test papers named sequential ques- tions. The candidates of this kind of questions are just some ordered numbers. Every number stands for a certain content which is given in question stem. We simply replace every sequential number in candidates with their corresponding contents. Then, we can classify these questions as EQs or SQs according to the type of contents. We also collected a wide diversity of re- sources including Baidu Encyclopedia, textbooks Figure 2: Pipeline of IR approach. and practice questions as our external knowledge when inquiring the generated query. Baidu En- cyclopedia which is also known as Baidu Baike, validation on the GKHMC dataset and the results is something like Wikipedia, but the content of are 90.00% precision and 84.38% recall in EQs it is written in Chinese. We denote this resource and 95.79% precision and 97.43% recalls in SQs. as BAIKE. The textbooks resource contains three 3.1.2 Score Functions compulsory history textbooks published by Peo- To calculate the relevance between question stem ple’s Education Press. We denote them as BOOK. and candidates, we introduce three different meth- And we gathered about 50,000 practice questions ods with seven score functions, which are summa- and their answers, and this is denoted as TIKU. rized in Table 2. 3 Approaches2 Lexical Matching Score: Since the correct can- didate usually directly related to question stem, 3.1 IR Approach it’s reasonable to assume that the facts in question The GKHMC questions require figuring out the stem may appear in documents related to them, most relevant candidate to the question stem from together with the correct candidate. Here we in- the four given candidates. Our IR approach is in- troduce our lexical matching score functions, tak- spired by this observation. The diagram of IR ap- ing BAIKE as our external resource. The four proach is illustrated in Figure 2. queries are formed by each candidate and ques- The pipeline of IR approach is: (1) use the clas- tion stem separately. Then we retrieval every sifier to automatically classify the question and query and sum up the scores of the top three re- select the weights according to the classification turned documents as the lexical matching score. result; (2) calculate the relevance scores for ev- We use scoretopi to denote the score of the top i- ery candidate(we introduce three different meth- th returned documents. scoretopi is calculated by 3 ods with seven score functions to calculate the rel- Lucene’s TFIDFSimilarity function . The lexical evance scores) and combine them together with matching score Scorelexical(candidatek) is cal- specific weights; (3) choose the candidate with culated as highest score as right answer. Despite the simplic- 3

ity of it, IR approach achieves a promising result Scorelexical(candidatek) = (scoretopi ). in experiment. i=1 X (1) 3.1.1 Naive Bayes Classifier We build indices for BAIKE with different We build a naive bayes classifier to classify ques- grains. The index built for every BAIKE docu- tions. Using length of candidates, entity number ment is denoted as BAIKE Document Index(BDI). of candidates and verb number of candidates as The index built for every paragraph in BAIKE is features, every question is classified as EQ or SQ. denoted as BAIKE Paragraph Index(BPI). And, When building the classifier, we do 10-folder cross the index built for every sentence in BAIKE is called BAIKE Sentence Index(BSI). 2 The codes of this project can be obtained at https://github.com/IACASNLPIR/GKHMC 3 https://lucene.apache.org/core/

113 We denote the lexical matcing score function Function Description using BDI, BPI and BSI as ScoreBDI , ScoreBPI ScoreBDI Scorelexical using BDI and ScoreBSI respectively. ScoreBPI Scorelexical using BPI Entity Co-Occurrence Score: We also con- ScoreBSI Scorelexical using BSI sider the relevance of entities in co-occurrence ScoreBDC document level Scoreco aspect. If two entities often appearing together, ScoreBPC paragraph level Scoreco we assume that they are revelent. We use nor- ScoreBSC sentence level Scoreco malized google distance(Cilibrasi and Vitanyi, ScoreBDL document link score function 2007) to calculate the entity co-occurrence score Table 2: Summarization of score functions. Scoreco(candidatek).

3.1.3 Training Weights Max(e , e ) log f(e , e ) NGD(e , e ) = i j − i j Since we have seven score functions, we need i j log N Min(e , e ) − i j combine them together with different weights. (2) For a given question, we calculate the score of Max(e , e ) = max log f(e ), log f(e ) every candidate as follows: i j { i j } (3) 7 Min(ei, ej) = min log f(ei), log f(ej) score = (w f (candidate )) (7) { } candidatek i ∗ i k (4) i=i X Scoreco(candidatek) = log(NGD(ei, ej)) where k 1, 2, 3, 4 , f is one of the seven score − ∈ { } i (5) functions and wi is the corresponding weight. Then we normalize the scores of all candidates: where scorecandidatek scorek = 4 (8) e E , e E . i=1(scorecandidatei ) i ∈ stem j ∈ candidatek We suppose that theP true answer of a question is In which, e is entity; f(e ) is the number of i i the n-th candidate, where n 1, 2, 3, 4 . The parts which contain entity e ; f(e , e ) is the num- ∈ { } i i j loss of it is ber of parts which contain both entity ei and ej;

Estem and Ecandidatek denotes the entities in ques- lossquesiton = log(1 scoren) (9) tion stem and candidate. − − The entity co-occurrence could be in document, Now we can calculate the total loss of the dataset paragraph or sentence, and they are donated as with M questions:

ScoreBDC , ScoreBPC and ScoreBSC respec- M tively. loss = (lossquestioni ) (10) Page Link Score: Inspired from PageRank al- i gorithm(Page et al., 1999), we assume that enti- X ties have links to each other are relevant. Here All operations are derivable so that we can use we introduce the page link score function. We gradient descent algorithm to train the weights. use Link(ei, ej) to denote the number of links 3.2 NN Approach between entities e and e . The link score i j As deep neural networks are widely used in natu- Score (candidate ) could be calculated as: link k ral language processing tasks and has gained great success, it’s naturally to come up with building Scorelink(candidatek) = max(Link(ei, ej)) (6) deep neural networks to handle GKHMC task. So, we built several deep neural networks in different where structures. And, we used both TIKU and BOOK to train these models, in order to teach models not

ei Estem, ej Ecandidatek . only how to answer the questions but also the his- ∈ ∈ torical knowledge. We only count the number of links between To handle the joint inference between back- BAIKE documents, and it is denoted as ScoreBDL ground knowledge and question stems in GKHMC

114 Provisional Memory Module After WWⅡ, the United States and the AfterUnion WWⅡ, of Soviet the United Socialist States Republics and the score , score , score , score After WWⅡ, the United States and the a b c d Uniongradually of Soviet began Socialist competing Republics in all fields AfterUnion graduallyWWⅡ, of Soviet the began United Socialist competing States Republics and in theall fields 1 2 3 4 5 6 t After WWⅡ,including the United politics, States economy and the and m m m m m m …… m Similarity Judger Uniongraduallyincluding of Soviet began politics, Socialist competing economy Republics in andall fields Union of Sovietmilitary. Socialist In order Republics to coordinate and Encoder graduallyincludingmilitary. began politics, In order competing economy to coordinate in andall fields and gradually beganpromote competing economical in all development fields of includingmilitary.promote politics, In order economical economy to coordinate development and and of including politics,member economy countries and in socialist party, A. 活字印刷术 、万有引⼒定律 military.promotemember In order economical countries to coordinate indevelopment socialist and party, of military. InUSSR order established to coordinate Council and for Mutual typography, promotememberUSSR economical establishedcountries indevelopment Council socialist for party, Mutual of promote economicalEconomic Assistance development in 1949. of This the law of gravity memberUSSREconomic established countries Assistance in Council socialist in for1949. party, Mutual This Encoder member countriesmove is mainlyin socialist for confrontingparty, USSREconomicmove established is Assistancemainly Council for inconfronting for1949. Mutual This B.《 九章算术 》、 罗⻢法 USSR establishedMarshall Council Plan. for Mutual EconomicmoveMarshall is Assistancemainly Plan. for confrontingin 1949. This Nine Chapters on the Mathematical Art, Economicmove is Assistancemainly for inconfronting 1949. This Marshall Plan. Input Module lead-in representation Roman law moveMarshall is mainly Plan. for confronting Marshall Plan. C. 蔡伦改进的造纸术 、⽇⼼说 Encoder papermaking technology, 元 年 汉 超 ⼈ 使 洲 国 秦 汉 秦 造 煌 化 Permanent Memory Module 7 曾 都 heliocentric theory 公 9 东 班 派 出 欧 强 ⼤ 东 ⼤ 创 辉 文 属于它们的文化成就分别是 D.《 春秋繁露 》、《 理想国 》 Chun Qiu Fan Lu, The cultural achivements of Utopia 公元 97 年,东汉的班超曾派⼈出使欧洲强国 “⼤秦 ”。 东汉和 “⼤秦 ”都创造了辉煌的文化 。 them are In 97 A.D., Ban Chao of the Eastern Han Dynasty had sent an envoy to European powerful country "DaQin". Both of the Eastern Han and "DaQin" created splendid culture. material sentences lead-in sentence answer candidates

Figure 3: Diagram of PPMN. The questions hasn’t been translated into English. questions, we introduce permanent-provisional Provisional Memory Module: It first inquires memory network(PPMN). As illuminated in Fig- the current word of background sentences in the ure 3, our PPMN is composed by the following permanent memory, then use an attention vector components: generated by current word and lead-in sentence as well as the following words to decide how to ad- 1. Permanent Memory Module that plays the just itself. The update equations are as follow: same role as a knowledge base and stores the original text from history textbooks or other

relevant resource. ht =GRU(wt, ht 1) (11) − p 2. Provisional Memory Module that generates p =softmax(pW ht) (12) some contents based on the current word in K background sentences, permanent knowledge Mt = piki (13) and the lead-in sentence. Xi=1 x =[ht,Mt, l, ht l, Mt l, ht Mt, 3. Input Module that reads the words sequen- ◦ ◦ ◦ (14) h l , M l , h M ] tially in background sentences and maps | t − | | t − | | t − t| g t t g them into high-dimensional vector space. g =σ(W tanh(W x + b ) + b ) (15) mt =g Mt + (1 g) mt 1 (16) 4. Similarity Judger that scores the similarity ◦ − ◦ − between the output of provisional memory In the above equations, wt denotes the t-th word and the vector representations of answer can- in the background sentences, GRU is defined in didates. equation (19-22), ht 1 and ht are the hidden rep- − resentation of wt 1 and wt respectively, l stands 5. Sentence Encoder that encodes lead-in sen- − for the lead-in sentence encoded by the sentence tence, sentences in permanent memory and encoder, is element-wise multiplication and m answer candidates. ◦ t is the computational result of current step. The Permanent Memory Module: We denote the final output of this module is the last provisional sentences encoded by sentence encoder in this memory vector mn where n is the length of module as k , k , ..., k , where K is the scale background sentences. { 1 2 K } of permanent memory. The permanent memory is a constant matrix composed by the concatenation Input Module: This module takes the same of representation vectors of these sentences, weight matrices in sentence encoder and calcu- namely [k1; k2; k3; ...; kK ]. Considering the time lates the hidden states of every word sequentially. complexity of training PPMN, we only take the All the words in background sentences are first syllabus of all history courses including 198 mapped into the hidden states in this module and sentences, i.e. K = 198, as the permanent then can be taken as input by other modules. memory. If necessary, all of the history text books The calculation of hidden states are the same as can be taken into the permanent memory. equation(19-22).

115 Similarity Judger: This module takes the con- Accuracy EQ-WEQ 49.38% catenation of the output from provisional memory SQ-WSQ 28.60% and representation of answer candidate as input and use a classifier based on logistic regression to Table 3: Accuracy of SQs and EQs with their cor- score it. The judging procedure is defined as fol- responding best weights. low:

l l 4 Experiment pˆ = σ(W [mK ; a] + b ) (17) 0 score = softmax(ˆp) (18) 4.1 Experiments of IR Approach 1   To find the best weights for EQ and SQ, We use TIKU as the training dataset. Using gradi- where W l is a matrix that can map the concate- ent descent to optimize parameters, we get the best nation vector [mK ; a] into a vector pˆ of length 2 weights for EQs and SQs separately, that is, W and a stands for the answer candidate encoded by EQ is the weight best for EQs and W is the weight sentence encoder. SQ best for SQs. We test the weights on EQs and SQs of GKHMC with their corresponding weights, and Sentence Encoder: We experimented several re- result is shown in Table 3. As we can see, with current neural networks with different structures these weights, we achieve promising result. as the sentence encoder. Both of Long-Short Term Momery (LSTM)(Hochreiter and Schmidhuber, We use GKHMC as the dataset to test the per- 1997) and Gated Recurrent Unit (GRU)(Cho et formance of IR approach with naive bayes classi- al., 2014) perform much better than the standard fier. The precision of EQs and SQs are 48.75%, tanh RNN. However, considering that the com- 28.42% respectively. It’s clear that the accuracy of putation of LSTM is more complicated and time- both EQs and SQs decreased with automatic clas- consuming, we choose GRU as the sentence en- sification. But still, IR approach achieves much better results on EQs than SQs. coder. The calculation of GRU denoted as ht = GRU(wt, ht 1) is as follow: − 4.2 Results of NN Approach z z z z = σ(W wt + U wt + b ) (19) We take some other neural network models r z r r = σ(W wt + U wt + b ) (20) with memory capability as our baseline models s s s including the standard tanh recurrent neu- s = tanh(W wt + U (r ht 1) + b )(21) ◦ − ral network(RNN), long-short term memory ht = (1 z) s + z ht 1 (22) − ◦ ◦ − network(LSTM)(Hochreiter and Schmidhu- ber, 1997), gated recurrent unite(GRU)(Cho In the above equations, wt is extracted from et al., 2014), end-to-end memory net- a word embedding matrix We initialized by work(MemNN)(Sukhbaatar et al., 2015) and word2vec(Mikolov et al., 2013) through an id dynamic memory network(DMN)(Kumar et al., number that indicates which word it is. 2016). As for our PPMN, we summarize the syllabus of all history textbooks for senior school Loss Function: Intuitively, as we want to encour- students to cover as much knowledge points as age the score as same to the true score (0 or 1) as possible and we get 198 sentences which are possible, a negative log-likelihood loss function is taken into the permanent memory module. For introduced: all the above models, we used rmsprop(Hinton L = log(ˆpy) (23) et al., 2012) with 0.001 as the learning rate to − train them, the size of hidden units as well as where y would be [0 1]> if a is the right answer the size of memory were both set to 400 and the or [1 0]> otherwise. size of batches were set to 1000. Also, we used dropout(Srivastava et al., 2014) to prevent the Optimization Algorithm: We use the AdaDelta models from overfitting and the probability of it introduced by (Zeiler, 2012) to minimize the loss was set to 0.5. We test all these models and the L, and use back propagation through time to opti- results are shown in Table 4. mize the calculation results of intermediate results. From the result, we observe that our PPMN

116 Model EQs SQs All disappeared in the right candidate. So it’s easy for RNN 36.25% 29.74% 31.18% the correct candidate to achieve a higher relevance LSTM 40.63% 40.41% 40.46% score than others. And, that’s why IR approach GRU 40.63% 40.24% 40.32% achieves promising result on EQs. Whereas, in MemNN 43.75% 36.13% 37.77% SQs, the key entity doesn’t appear in any candi- DMN 44.38% 45.38% 45.16% date. And, it needs to be inferred out from ques- PPMN 45.63% 45.72% 45.70% tion stem. No matter in aspect of lexical match- Random 25.00% 25.00% 25.00% ing, entity co-occurrence or page link, the rele- vance between question stem and correct candi- Table 4: Results of all neural network models. date may be as low as other candidates. There- for, it’s not surprised that IR approach is not suffi- gains best performance on all kinds of GKHMC cient to figure out the right choice on SQs. After questions and all memory-capable neural network adding the classifier in IR approach, we notice the models beat RNN. It’s interesting that MemNN decrease of accuracy on both EQs and SQs. This is performs much worse than other memory-capable because of the misclassification on the questions, models on SQs whereas it shows promising capa- which demonstrates that the weights WEQ,WSQ bility on EQs. are particularly efficient on EQs, SQs. The experiment of NN approach declared that 4.3 Combine IR Approach and NN Approach our PPMN does show its advantages on GKHMC It can be easily observed from the above experi- questions. During the training, the performance ments that IR approach and NN approach are some of RNN model is labile, i.e. the precision are kind of complementary, namely they performs bet- still variational when loss is convergent. In con- ter to each other on different categories of ques- trast, other model’s performance is more stable. tions. So, we combine the two approaches to- Hence, we consider that the memory mechanism c 2 2 gether via a weights matrix W R × as fol- ∈ helps model to ”remember” the knowledge that lows: appeared in the training data. Compared with the 4 c scoreIR “inside” memory of LSTM and GRU, the spe- scoreEQ = W1 (24) · scoreNN cially designed memory component in MemNN,   DMN and PPMN are more powerful to find out c scoreIR scoreSQ = W2 (25) the relationships between the question stem and · scoreNN   answer candidates in GKHMC questions. How- c c where the Wi means the i-th row of W and ever, the limited performance of MemNN on SQs · scoreIR, scoreNN are the scores calculated by indicates that the sequences of words in GKHMC IR and NN approaches respectively. Here, the questions are especially important for questions categories of questions are given by the naive containing no distinct entities. Last but not least, bayes classifier. The performance of combined the best performance of PPMN may due highly on model and its comparison to the two individual ap- the novel permanent memory module which can proaches are illustrated in Figure 4. helps finding the implicit relationships with the stored background knowledge. 4.4 Discussion The state-of-the-art performance of hybrid From the global aspect, it can be easily ob- method indicates that combination of IR approach served that IR approach are more proficient on and NN approach is the best strategy to address EQs(49.38% vs 40.63%), whereas NN approach the GKHMC questions. As illustrated in Figure expand superior to it on SQs(28.60% vs 40.24%). 4, the combined method shows its enormous ad- And the hybrid method composed by two ap- vantage on EQs. This may because both character proaches get the best performance(42.60%). and word embedding are more sufficient to cover As for the IR approach itself, the performance the lexical meaning. And, some of EQs may be on EQs is much better than on SQs. This may be- more suitable to be handled as SQs. Compared to cause that IR approach is based on the relevance the NN approach separately, the hybrid way does between candidates and question stem. In EQs, 4 We consider that the memory of LSTM and the information given by the question stem is usu- GRU are kind of stored inside the weight ally the description of the key entity which only matrices.

117 60.00% the question as a kind of semantic parsing which 52.50% 49.38% 50.00% 45.72% 45.37% 45.63% 45.70% 46.90% can not handle the specific expressions with lots

40.00% of background knowledge. Although (Yih et al., 32.80% 28.60% 30.00% 2015) employed knowledge base, but still failed

ACCURACY 20.00% on multiple sentences questions which is beyond

10.00% the scope of semantic parsing. However, the diver-

0.00% sity of candidates in GKHMC makes these mod- SQ EQ ALL els fail to match the question with the right candi- IR Approach NN Approach Combination date. Another nonnegligible task is machine com- prehension, also called reading comprehension. Figure 4: Result of different methods. Although in several different datasets introduced by (Smith et al., 2008; Richardson et al., 2013; Weston et al., 2015), questions are open-domain a little poorly on SQs, which may caused by the and candidates may be entities or sentences, un- loss of classification. derstanding these questions don’t require as much 5 Related Work background knowledge as in GKHMC and these models cannot handle the joint inference between Answering real world questions in various sub- the background knowledge and words in ques- jects already gained attention from the beginning tions. of this century. The ambitious Project Halo (Fried- We are not the first to take up the Gaokao chal- land et al., 2004) was proposed to create a ”digital” lenge, but former information retrieval approach Aristotle that can encompass most of the worlds’s doesn’t fit to part of the questions in GKHMC scientific knowledge and be capable of address- and resources in their system are limited. In con- ing complex problems with novel answers. In trast, we introduced two different approaches to this project, (Angele et al., 2003) employed hand- this task, compared their performance on different crafted rule to answer chemistry questions, (Gun- types of questions, combined them and gained a ning et al., 2010) took the physics and biology into state-of-the-art result. account. Another important trial is solving the mathematical questions. (Mukherjee and Garain, 6 Conclusion and Future Work 2008) attempted to answer them via transform- ing the natural language description into formal In this work, we detailed the multiple choice ques- queries with hand-crafted rules, whereas recent tions in subject History of Gaokao, present two works (Hosseini et al., 2014) started to employ- different approaches to address them and com- ing learning techniques. However, none of these pared these approaches’ performance on all cat- methods are suitable for history questions which egories of questions. We find that the IR approach requires large background knowledge, the same to are more sufficient on EQs cause the words in the Aristo Challenge(Clark, 2015) focused on El- these questions are usually the description of right ementary Grade Tests which is for 6-11 year olds. answer, whereas the NN approach performs much The Todai Robot Project(Fujita et al., better on SQs, and this may because neural net- 2014)aims to build a system that can pass work models can find out the semantic relationship the University of Tokyo’s entrance examination. between questions and candidates. When combin- As parts of this project, (Kanayama et al., 2012) ing them together, we get the state-of-the-art per- mainly focus on addressing the yes-no questions formance on GKHMC, better than any individual via determining the correctness of the original approach. This points out that combining different proposition, and (Miyao et al., 2012) mainly approaches may be a better method to deal with focus on recognizing textual entailment between the real-world questions. a description in Wikipedia and each option of In future work, we will explore whether key- question. But, these two methods are separated value memory network proposed by (Miller et for different kinds of questions and none of them al., 2016) can help improve the performance of introduced neural network approach. PPMN, what content in textbook or encyclopedia It’s inevitable to compare the GKHMC with the should be taken into the permanent memory, how factoid questions. (Berant and Liang, 2014) takes to mathematically organize the permanent mem-

118 ory to make it can be reasoned on as well as problem solving. In Proceedings of the Ninth In- whether transforming the knowledge described in ternational Conference on Language Resources and natural language into formal representation is ben- Evaluation (LREC-2014), Reykjavik, Iceland. Euro- pean Language Resources Association (ELRA). eficial. As a long-term goal, it’s necessary to intro- duce discourse analysis, semantic parsing to help David Gunning, Vinay Chaudhri, Peter Clark, Ken the model truly understand the material sentences, Barker, Shaw-Yi Chaw, Mark Greaves, Benjamin Grosof, Alice Leung, David McDonald, Sunil questions and candidates. Mishra, et al. 2010. Project Halo update-progress toward digital Aristotle. AI Magazine, 31(3):33–58. Acknowledgments Geoffrey Hinton, Nirsh Srivastava, and Kevin We thank for the anonymous reviewers for help- Swersky. 2012. Lecture 6a overview of mini– ful comments. This work was supported by the batch gradient descent. Coursera Lecture National High Technology Development 863 Pro- slides https://class.coursera.org/neuralnets-2012- 001/lecture. gram of China (No.2015AA015405) and the Nat- ural Science Foundation of China (No.61533018). Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. And this research work was also supported by Long short-term memory. Neural computation, 9(8):1735–1780. Google through focused research awards program. Javad Mohammad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning References to solve arithmetic word problems with verb catego- rization. In Proceedings of the 2014 Conference on Jurgen¨ Angele, Eddie Monch,¨ Henrik Oppermann, Empirical Methods in Natural Language Process- Steffen Staab, and Dirk Wenke. 2003. Ontology- ing(EMNLP), pages 523–533, Doha, Qatar. Associ- based query and answering in chemistry: Ontonova ation for Computational Linguistics. project Halo. In The Semantic Web-ISWC 2003, pages 913–928. Springer, Sanibel Island, Florida, Hiroshi Kanayama, Yusuke Miyao, and John Prager. USA. 2012. Answering yes/no questions via question in- version. In Proceedings of COLING 2012, pages Jonathan Berant and Percy Liang. 2014. Seman- 1377–1392, Mumbai, India. The COLING 2012 Or- tic parsing via paraphrasing. In Proceedings of the ganizing Committee. 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages Oleksandrs Kolomiyet and Marie-Francine Moens. 1415–1425, Baltimore, Maryland, USA. Associa- 2011. A survey on question answering technology tion for Computational Linguistics. from an information retrieval perspective. Informa- tion Science, 181(24):5412–5434. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties Ankit Kumar, Ozan Irsoy, Jonathan Su, James Brad- of neural machine translation: Encoder–decoder ap- bury, Robert English, Brian Pierce, Peter Ondruska, proaches. In Proceedings of SSST-8, Eighth Work- Ishaan Gulrajani, and Richard Socher. 2016. Ask shop on Syntax, Semantics and Structure in Statisti- me anything: Dynamic memory networks for nat- cal Translation, pages 103–111, Doha, Qatar. Asso- ural language processing. pages 1378–1387, New ciation for Computational Linguistics. York City, New York, USA. ACM.

Rudi Cilibrasi and Paul Vitanyi. 2007. The google Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke similarity distance. IEEE Transactions on Knowl- Zettlemoyer. 2013. Scaling semantic parsers with edge and Data Engineering, 19(3):370–383. on-the-fly ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in Natu- Peter Clark. 2015. Elementary school science and ral Language Processing, pages 1545–1556, Seattle, math tests as a driver for AI: Take the Aristo chal- Washington, USA. Association for Computational lenge! In Proceedings of the Twenty-Seventh Linguistics. AAAI Conference on Artificial Intelligence (AAAI- 15), pages 4019–4021, Austin, Texas, USA. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- Noah Friedland, Paul Allen, Gavin Matthews, Michael tations in vector space. Computing Research Repos- Witbrock, David Baxter, Jon Curtis, Blake Shepard, itory, abs/1301.3781. Pierluigi Miraglia, Jurgen Angele, Steffen Staab, et al. 2004. Project Halo: Towards a digital Aris- Alexander Miller, Adam Fisch, Jesse Dodge, Amir- totle. AI magazine, 25(4):29. Hossein Karimi, Antoine Bordes, and Jason We- ston. 2016. Key-value memory networks for di- Akira Fujita, Akihiro Kameda, Ai Kawazoe, and rectly reading documents. In Proceedings of the Yusuke Miyao. 2014. Overview of Todai robot 2016 Conference on Empirical Methods in Natu- project and evaluation framework of its nlp-based ral Language Processing, pages 1400–1409, Austin,

119 Texas, USA. Association for Computational Lin- Matthew D. Zeiler. 2012. Adadelta: An adaptive learn- guistics. ing rate method. Computing Research Repository, abs/1212.5701. Yusuke Miyao, Hideki Shima, Hiroshi Kanayama, and Teruko Mitamura. 2012. Evaluating textual entail- ment recognition for university entrance examina- tions. ACM Transactions on Asian Language Infor- mation Processing (TALIP), 11(4):13. Anirban Mukherjee and Utpal Garain. 2008. A review of methods for automatic understanding of natural language mathematical problems. Artificial Intelli- gence Review, 29(2):93–122. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation rank- ing: bringing order to the web. Technical Report 1, Stanford InfoLab. Anselmo Penas,˜ Eduard Hovy, Pamela Forner, Alvaro´ Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. Qa4mre 2011-2013: Overview of question answering for machine reading evaluation. In Inter- national Conference of the Cross-Language Evalu- ation Forum for European Languages, pages 303– 320. Springer. Matthew Richardson, J.C. Christopher Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Em- pirical Methods in Natural Language Processing, pages 193–203, Seattle, Washington, USA. Associ- ation for Computational Linguistics. Noah A. Smith, Michael Heilman, and Rebecca Hwa. 2008. Question generation as a competitive un- dergraduate course project. In Proceedings of the NSF Workshop on the Question Generation Shared Task and Evaluation Challenge, Arlington, Mas- sachusetts, USA. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958. Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, Montreal,´ Canada. Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2015. Towards ai-complete ques- tion answering: A set of prerequisite toy tasks. Computing Research Repository, abs/1502.05698. Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331, Beijing, China. Associ- ation for Computational Linguistics.

120