M U F I

Automatic question answering for flective languages

P.D. T P

Marek Medve

Brno, Fall 2017

M U F I

Automatic question answering for flective languages

P.D. T P

Marek Medve

Advisor: doc. RNDr. AleöHorák Ph.D.

Brno, Fall 2017 Signature of Thesis Advisor

Contents

1 Introduction 1 1.1 Current QA challenges ...... 1 1.1.1 Building knowledge base ...... 1 1.1.2 Question processing ...... 2 1.1.3 Document selection (Knowledge base search) . . 3 1.1.4 Sentence selection ...... 3 1.1.5 Answer extraction ...... 4 1.2 What has to be improved ...... 4 1.2.1 Question processing ...... 4 1.2.2 Document selection ...... 5 1.2.3 Sentence selection ...... 5 1.2.4 Answer extraction ...... 5 1.3 Goal of the postgraduate study ...... 6 1.4 Thesis proposal structure ...... 6

2 State of the art 7 2.1 QA system structure ...... 7 2.1.1 Knowledge base ...... 7 2.1.2 Question processing module ...... 9 2.1.3 Passage retrieval module ...... 12 2.1.4 Answer selection module ...... 13

3 Aims of the Thesis 21 3.1 Czech-English syntax dierences ...... 21 3.1.1 Declarative sentence (statements) ...... 21 3.1.2 Interrogative sentence (Questions) ...... 22 3.2 Proposed system prototype ...... 23 3.3 Study plan ...... 24

4 Achieved Results 25 4.1 SQAD database ...... 25 4.2 Automatic Question Answering system (AQA) ...... 25

5 Author’s publications 29

Bibliography 33

i A Research activity 39

B Teaching activities 41

C Opponent review 43

D Selected papers 45

ii 1 Introduction

Question answering (QA) is a computer science discipline that grabs a lot of interest in Natural Language Processing field. The information extraction and natural language processing fields aim at building sys- tems that can provide accurate answers to input questions. The main dierence between search engine systems and QA systems can be seen in results they provide. A search engine system usually provides a list of eligible candidates that satisfy the input query. In contrast QA systems are more complex and go further. By using multiple nat- ural language processing (NLP) techniques, a QA system chooses the best article, extracts suitable passages and picks up the shortest part of a paragraph or sentence that will satisfy the input question and provide answer with sucient information to the user. There are two main types of QA systems: open domain systems [1, 2, 3, 4, 5, 6] and closed domain systems [7, 8, 9, 10]. Open-domain systems are based on sources without any restrictions, whereas closed- domain systems are limited to specific domains such as medicine, weather forecasting, sports results etc.

1.1 Current QA challenges

Current challenges that are studied by the QA community all over the world arise from the complexity of the QA task, which has to go through multiple layers of processing (question processing, document processing, answer selection, answer extraction) to get from a question to an answer. These challenges are presented in the following text based on the techniques in [11, 12, 13, 3, 10, 14, 5, 6, 15, 16].

1.1.1 Building knowledge base

The first challenge of a QA system is to have a large knowledge data source that will represent the knowledge base (KB) of the system. This is the part of the system that provides all the data and is queried for candidate answers.

1 .I

Usually there is a separate module inside the QA system that processes the data source and stores information inside the QA’s KB database. The main purpose of this module is to extract information from the input texts using NLP techniques such as lexical analysis1, mor- phological analysis2, part of speech tagging3 and syntactic analysis4. There are also advanced NLP techniques available which can pro- vide complex information about a text like proper nouns (Informa- tion Extraction techniques), reference extraction between parts of text (Anaphora resolution) or semantic similarity recognition between expressions (Thesaurus, Word embeddings). A satisfactory answer to the input question requires proper repre- sentation of the QA system’s KB. The KB must contain all available information extracted from input data in a compact form and the QA system must be able to access this information quickly. Detailed de- scription of current KB techniques are presented in section 2.1.1.

1.1.2 Question processing Apart from a KB, the only input the QA system usually receives is an input question. Question processing is very important to the QA system itself. If the QA system is not able to extract all available in- formation from the question, the next processing layers could lead to incorrect answers. Besides , morphological analysis, part of speech tagging and syntactic analysis, the system usually performs question classification (for question types see Section 2.1.2). This additional information helps the system to focus on certain classes of entities inside candidate answers. According to this information, the final score, which represents the confidence of the answer, is assigned to each candidate answer. There are many ways how to extract information from an input question. A question processing module can be based on three main

1. Token recognition inside a sentence. 2. Assigning a base form of token. 3. Part of speech assignment to word in a text based on both its definition and its context 4. Building syntactic tree according to language grammar rules

2 .I

approaches. The first approach is to match a question to automati- cally learned or manually created patterns (e.g. [17]). Second, more linguistically oriented (e.g. [8]) is based on building KB query from information extracted by NLP tools. The last one is application of statistical or machine learning approaches, which use statistical tech- niques such as Support Vector Machines, Bayesian classifiers, neural networks etc., to extract features of a question to create a KB query.

1.1.3 Document selection (Knowledge base search)

After getting all possible information from a question, the system starts to query the KB. The result of this process is usually in the form of a document or a document passage. A document selection module is used by the QA system to select the set of all documents from KB which contain a suitable answer. This process can be based on several strategies. One of the widely used techniques is Information Retrieval (IR) which extracts keywords from a document that are compared to question keywords. A technique used by many IR systems is Boolean IR that uses boolean logic to create a formula from a document and compare it’s intersections with question keywords [18]. Important aspects of a document selection module are not only to develop an IR tool that can select a list of documents based on a given question but also to select a good sorting technique that will provide fi- nal document ranking. The number of documents to be sent to the next processing stage is also very important. For example it is shown in [19] that even when considering the top 1000 text segments, no relevant documents are found by the module for 8 % of the questions.

1.1.4 Sentence selection

A sentence selection area is the most frequently studied component of a QA system. The most important action of a sentence selection module is to pick up the correct passage from a document, usually in the form of a sentence or paragraph. Several approaches have been developed to solve this task. Some of them are very advanced and usually use some kind of neural network model [16], others use lexical

3 .I information that is extracted from questions and sentences or through IR technique [20]. The main challenge of this area is to find a technique with a very high selection confidence. Current research aims not only to find new techniques to accomplish this task but also to explore possible combinations between existing tools and to find the correct weighting of each feature to get best results.

1.1.5 Answer extraction The final stage of a QA system is answer extraction. It uses a combi- nation of Information Extraction (IE) techniques, information about question type extracted from the input question, NLP tools that pin- point important parts of sentence (main verb, subject, object) and advanced tools such as anaphora resolution [21]. According to the information obtained in previous steps the system has to decide which part of a sentence (passage) must be extracted and shown to the user as the final answer. The main goal is to provide a correct answer that will include enough information to satisfy user’s question.

1.2 What has to be improved

The QA field has become very attractive in recent years to people and companies around the world, because of the its usage potential. NLP techniques applied in the QA field such as lexical analysis, morphological analysis, syntactic analysis, document selection and knowledge base building still do not reach 100 % confidence and all of them are still under the development. The main focus of research is to improve question processing, sentence (passage) selection and answer extraction area.

1.2.1 Question processing Question type: there is a problematic balance between the small • and large number of question types that the system can recog- nize. If the system implements too many question types, the com- plexity of question processing can become very time consuming

4 .I

and a very small change in feature weighting may change the re- sulting question type. On the other hand, very few question type classes can cause picking up wrong answer. Question focus: finding the main focus (example in Figure 1.1) • of the question itself is also a challenging task. The confidence of focus recognition greatly influences the resulting answer [22, 23].

Question: Who is the founder of Facebook? Focus: [person, company] Figure 1.1: Example of question focus

1.2.2 Document selection The challenge of finding a list of candidate documents that can answer the input question has been more or less solved. The problem lies in a final document ranking that could select the wrong document over a suitable one by using an incorrect ranking procedure.

1.2.3 Sentence selection This is the topic that has been presented at recent conferences. Sev- eral techniques that attempt to solve this task have been developed, still, the mean reciprocal rank (MRR) is about 0.8 (e.g. [5]) when mea- sured on large English datasets such as the TRAC-QA, the WikiQA or the SQuAD. On the other hand the flective languages such as Polish reach 0.5 MRR [24] evaluated on 598 question-answer pairs created from Polish web.

1.2.4 Answer extraction Even though the answer extraction depends on previous levels of processing it is a very challenging task. According to the question

5 .I focus, this part of the system has to select the correct sentence part, one which contains the required answer.

1.3 Goal of the postgraduate study

The goal of this Ph.D. study is to build a QA system prototype that can operate on flective languages (Slavonic family). The system will implement state-of-the-art techniques that will be supplemented by syntactic information which should improve the system performance on this language family. We will provide an evidence of syntactic information significance inside the system prototype. The main reason to study the influence of syntactic information on the system performance is because the major- ity of state-of-the-art QA systems is built on English language, that is very dierent from flective languages. The word order inside flective sentences is more free and does not have strict structure, the syntax information is necessary to identify not only a subject, a verb and an object but also all syntactic dependences that help in answer extraction procedure. Therefore we assume that this extracted syntactic infor- mation should improve performance of a QA system that works on flective languages.

1.4 Thesis proposal structure

The structure of this work is following. In the first chapter we introduce the main structure of a QA system and describe the latest state-of-the- art techniques that are used in the QA field. The second chapter defines the proposed solution of the Ph.D. study in a detailed study plan. Already achieved results are present in the third chapter where we introduce first prototype of a QA system called AQA. The brief system description is also presented in this chapter.

6 2 State of the art

In this section we will introduce a basic QA system structure and state-of-the-art techniques that are used in QA systems. At the end of this section we present the result overview of the best QA systems.

2.1 QA system structure

All recent QA systems usually have a very similar structure to that presented in Figure 2.1.

QA system question question processing

knowledge base passage retrieval

answer selection

answer answer extraction

Figure 2.1: General QA structure

The five core modules of most QA systems are a knowledge base, a question processor, a passage retrieval (document retrieval), an an- swer selection and an answer extraction.

2.1.1 Knowledge base To be able to find required information, a QA system has to have some kind of knowledge base. There are two main source types for knowledge base building.

7 . S

Structured data Structured data for knowledge base building contains some type of additional information which is added by either manual or automatic annotation. The FreeBase [25] and the DBpedia [26] are two main representatives of this data class. The Freebase is a large collaborative knowledge database within a structure in the form of links between items inside the database. Every text of some topic includes a variety of types that are related to the topic itself. For example a text with the topic “Arnold Schwarzeneg- ger” should assign types such as actor, bodybuilder and politician to the text. After the Freebase company was bought by Google company in 2010, the data moved to the Wikidata [27] database. There have been several QA systems based on FreeBase such as [28, 29, 30]. A similar data source is the DBpedia [26] where each entry contains a RDF triple1 (subject/predicate/object). The DBpedia covers domains such as geography, companies, online communities, film, music, books and scientific publications. Descriptions of the QA systems based on DBpedia can be found in [31, 32].

Plain text data The second main source for building a KB is unstructured data har- vested from web sites or books. The main disadvantage comparing to the previous data source is a lack of relation annotation between entries. On the other hand this data can be enlarged and updated in a straightforward way. The TREC-QA [33] is built form web sites and books. It was created specifically for the answer selection task and consists of 12,887 train- ing questions where each question has several positive and negative candidate answers. The WikiQA [34] consists of 3,047 questions (originally sampled from Bing query logs) and 29,258 sentences, where 1,473 sentences were labeled as answer sentences to their corresponding questions.

1. RDF (Resource Description Framework) triple is a metadata data model for modeling of information that is implemented in web resources, using a variety of syntax notations and data serialization formats

8 . S

Like the TRAC-QA dataset this dataset aims at improving the answer selection task. Both Stanford Question Answering Dataset (SQuAD) [35] and the DeepMind Q&A Dataset [36] are reading comprehension datasets, consisting of questions and answers where the answer to every ques- tion is a segment of text, or span, from the corresponding reading passage. The SQuAD database consists of more than 500 Wikipedia articles and more than 100,000 question-answer pairs where the ques- tions were posed by crowdworkers. The DeepMind Q&A Dataset includes CNN news articles and Daily Mail news articles. Together it has almost 300,000 documents and almost 1,300,000 questions. Questions in the DeepMind Q&A Dataset are not in usual form like "QWord + focus?" but in form of sentence where the answer word is replaced by a special placeholder to indicate the missing word. The reason why it has so many data compared to other sources is that they can be automatically generated.

2.1.2 Question processing module Providing a good result requires proper processing of input question be able to extract all important information Introduced QA systems [23, 37] are using NLP tools including tok- enizer, morphological analyzer, and part-of-speech tagger. Apart form these NLP tools, a lot of QA systems provide question classification according to the nature of the question.

Functional word questions are all non Wh* questions that usu- • ally start with verb.

When questions focus on a exact time or a time span. • Where questions focus on a place. • Which questions where the focus lies on the noun phrase fol- • lowing the "Which" word.

Who questions ask about some entity. • Why questions want to find out the reason or explanation. • 9 . S

How questions ask for explanation or number. • What questions are general questions. • According to these question classes a QA system can determine what kind of entity the question is looking for. Recent state-of-the-art systems transform an input question into its embedded form [14, 16, 15] to determine its semantic meaning (usually in form of word/sentence vectors) which is then used in an answer selection module to match a question to an answer according to similar semantic meaning.

Word embedding A word embedding can be seen as the mapping of words to a more computer-like form. The simplest approach is to map a word from a sentence to an index of the same word inside a dictionary. In addition, this technique can determine the semantics of a word by using neural networks. These networks are trained on millions of sentences which enable the neural network to catch relations of words from their context. The most popular implementation of a word em- bedding was developed by Mikolov’s team [38] and is called word2vec. It has been implemented into many neural network libraries including TensorFlow2, Dynet3, Theano4, Keras5, Gensim6. Word2vec utilizes bag-of-words and skip-gram models to compute vector representations of words. Word vectors typically have several hundred dimensions and each word from a corpus is assigned a cor- responding vector inside this space. The vector represents features that have been learned by neural network which groups semantically similar words. The example vector embedding trained on an English corpus is shown in Figure 2.2 and its visualization in Figure 2.3. For training, the word2vec algorithm obtains a large tokenized corpus7 and creates model that stores all features extracted form cor-

2. https://www.tensorflow.org/ 3. dynet.readthedocs.io 4. http://deeplearning.net/software/theano/ 5. https://keras.io/ 6. https://radimrehurek.com/gensim 7. A large collection of texts

10 . S pus. Than if the QA system want to work with a embedded form of a sentence the model is queried for word vector representation of the sentence. #wordvectors,trainedoncontextwindowsizeof3 must: [-0.04942611 -0.02303079 0.0573978 ..., 0.07257255 0.01359628 -0.06095064] according: [-0.06460572 0.09217705 -0.16815324 ..., 0.0705426 0.07179514 0.03950354] school: [-0.11782318 0.00927046 -0.0631732 ..., 0.06049724 0.04135431 -0.04148336] ...

Figure 2.2: Word vectors trained on English corpora (using TensorFlow library)

Figure 2.3: Word embedding visualization8

8. source: suriyadeepan.github.io/2016-06-28-easy-seq2seq/

11 . S

2.1.3 Passage retrieval module Most question answering systems feature a text retrieval module that searches through a knowledge base and creates a list of passages relevant to the question according to informations extracted from the question. This module usually consists of two parts. The first one provides list of candidate answers and the second one uses scoring algorithms to rank the most relevant answers to the top of the list.

Search engine The first step is to find documents relevant to a given question. The state- of-the-art systems usually perform some kind of search techniques that are able to find all possible candidate answers. Examples of search engines are Apache Lucene [39] and Indri [40]. The Apache Lucene includes multiple ranking models, including the Vector Space Model (based on term-frequency inverse-document- frequency (TF-IDF) weighting algorithm method) and Okapi BM25 algorithm (also represent TF-IDF-like retrieval functions used in doc- ument retrieval). The Indri system’s IR model is based on an inference network approach, which combines multiple features to prove document rele- vance to query (for more detailed description see [40]).

Filtering To prune down the amount of candidate answers, the system combines multiple analysis scores into a filtering score. There are many analysis tools that can contribute to filtering. We introduce those which are most important:

the longest similar subsequences between a question and a pas- • sage

Lexical Answer Type (LAT): if the expected lexical type (ex- • tracted from question) is present in a candidate document.

tree distance score on noun phrases • 12 . S

TF-IDF question-answer score • proper noun presence (based on IR module results) • Ranking At the end of the passage retrieval phase, the system combines the vec- tor of multiple features into a final score for each passage. The higher the final score, the more identical question-answer features system found which contribute to the confidence of that answer.

2.1.4 Answer selection module Before we explain state-of-the art answer selection methods, let us introduce the neural network models that are currently used for this task.

Neural networks There are two mainstream NN models: Convolutional Neural Net- works (CNN) [41] and Recurrent Neural Networks (RNN) [42]. The CNN networks proved to be accurate for image recognition tasks [43] such as identifying objects in pictures. On the other hand, the RNN proved to be accurate in sequence to sequence [38] task like language and speech to text transcription. In the QA field, the RNN networks have become very ecient in a question to candidate sentence matching (the graphic representation of RNN network cell is on Figure 2.4). The advantage of RNN over other NN models is in information persistence, which allows RNN to make decisions according to previous actions. When we unroll the green loop from Figure 2.4.a, the chain that arises in Figure 2.4.b can be directly mapped to list of words from text. The specialty of RNN cells is that they are connected and can pass trough information from previous cell to the next cell. This information corresponds to all previous states of RNN model, which in some cases can be undesirable. Consider a sentence where a word in bold should be predicted by NN: "The clouds are in the sky.". For this prediction we do not need to see all previous context, it is pretty obvious that the only possible word is "sky".

13 . S

o t w1..t - word vector o1..t - output vector tanh - hyperbolic tangent

tanh a)

w1..t

b) o1 o2 ot

tanh tanh tanh

w w1 w2 t

Figure 2.4: Recurrent network unit

On the other hand, the RNN are not good in remembering infor- mation from the previous text that was further away. Take for example a text where there is a large gap between sentences "Mark grew up in England." and "His native language is English.". Long Short Term Memory (LSTM) [44] networks were developed to improve the learning process of RNN networks. LSTM network is still an RNN network but the advantage of LSTM over RNN network is in gates that are present within an LSTM network cell. These gates are used to determine whether both the information from previous steps as well as the current information will be passed to the next step or not (see illustration of LSTM cell on Figure 2.5). The cells are formed of sigmoid neural net layer and element-by-element multi- plication operation. Each LSTM cell contains an input gate,aforget

14 . S gate and an output gate. The input gate ("ignoring" part in Figure 2.5) controls actual prediction. Forget gate ("memory" part in Figure 2.5) controls whether the past knowledge will be added to the present result. The output gate ("selection" part in Figure 2.5) controls what can be released as a prediction. prediction

selection collected possibilities forgetting

memory

ignoring filtered possibilities

possibilities

input

hyperbolic sigmoid multiplication addition neural network tangent

Figure 2.5: LSTM network unit

Answer selection module In the previous text we introduced the state-of-the-art neural network models that are used in recent QA systems. These NN models are used in answer selection module architecture to create a complex neural network that can learn how to find crucial information between

15 . S a question and candidate answers to provide a final scoring of all candidate answers. There are multiple approaches available how to put together lay- ers of NN. In the following text we will introduce the structure of answer sentence selection that represents multiple state-of-the-art QA systems [5, 14, 15]. The latest research in the answer selection field has introduced new types of LSTM networks, new formulas for loss function and new NN models. In Figure 2.6 we present a NN model that is based on Bi-LSTM layer, max pooling and loss function (for specific description see [5, 14, 15]).

probability(sentence|question)

loss function

question vector answer vector

max pool max pool max pool max pool

w w 1 w1 1 w1

w w 2 w2 2 w2

w w 3 w3 3 w3

w w n wn n wn

w addition i word vector

Figure 2.6: bi-directional LSTM network for answer selection

16 . S

The Bi-LSTM layer is a special kind of LSTM where the input is read from both sides (from the first to the last word and from the last to the first word). This Bi-LSTM network utilizes both the previous and future context (which can not be captured by single direction LSTM) by processing the input word sequence in forward and back- ward direction. In our example the Bi-LSTM layer takes the input question and the answer. The output of this layer is then calculated as an element-by-element addition (can be substituted by concatenation or other function). In the [5] the IBM Watson team introduces a novel attentive pool- ing method for Bi-LSTM and CNN networks, that is very similar to standard Bi-LSTM layer from the previous text. It diers in the pooling strategy. The non attentive pooling goes over a matrix with NxN sized window and pick the greatest number inside the filter to the final result (see illustration of max pooling with filter 2x2 is on Figure 2.7).

filter 2 ✕ 2

7 0

1 6

1 5 7 0

3 4 1 6

max 2 9 3 9 pooling 5 7

5 1 7 2 9 9

Figure 2.7: max pooling with filter 2x2

In attentive pooling the strategy is the same, but the input matrix contains an input pair (both question and candidate answer vectors) which allows the information from the question to influence the an-

17 . S swer and vice versa. This approach has proved to be ecient and expands to the majority of new QA systems. Some recently proposed methods consider not only the candi- date answer in the training process but also incorrect answers (see Figure 2.8). Where alongside the question answer pair, the negative example also aects the computation of the result.

probability(sentence|question)

loss function

question vector positive example vector negative example vector

max pool max pool max pool max pool max pool max pool

w w w 1 w1 1 w1 1 w1

w w w 2 w2 2 w2 2 w2

w w w 3 w3 3 w3 3 w3

w w w n wn n wn n wn

w addition i word vector

Figure 2.8: bi-directional LSTM network for answer selection with positive and negative examples

There are several methods to build and set up the NN for the an- swer selection task and a lot of work can be done in this area of QA research.

State-of-the-art evaluation overview In this section we introduce best state-of-the-art QA systems according the SQuAD leader board9 from September 2017. The best system according to leader board is called DCN+ [16] (Dynamic Coattention Network). The system introduces coattention

9. https://rajpurkar.github.io/SQuAD-explorer/

18 . S

encoder module where a attention network is used for creating a co- dependent representation of the question and answer. A novelty ap- proach is also present inside the final dynamic pointer decoder mod- ule which estimates start and end position of the answer span accord- ing to finite state automaton that is maintained by LSTM model. AoA Reader [14] introduces novelty attention over attention net- work module that places attention mechanism over existing document level attention. This mechanism allows the system to exploit mutual in- formation between the document and query and provide importance score of each query. A model of R-net system [15] propose a new gated attention-based recurrent network and self matching mechanism. The gated attention- based recurrent network adds an additional gate to the attention-based recurrent network. This new model assigns an importance score to passage parts depending on their relevance to the question, which highlights important parts of the passage. The self matching mecha- nism refines passage representation with information from the whole passage. Reg-RaSoR [45] system introduces an eective mechanism for com- puting embedded representation of candidate answer spans. They build neural network with shared substructures of passage spans to avoid the cubic sized network input. The proposed system also augment passage word embeddings with an additional embedding representation of question to match answer spans with the input ques- tion. The overview of ExactMatch (EM) and F1 scores of these systems is present in Table 2.1.

Table 2.1: Overview of state-of-the-art QA system evaluation on SQuAD database

System EM/F1 DCN+ 78.706/85.619 AoA Reader 77.845/85.297 R-NET 77.688/84.666 Reg-RaSoR 75.789/83.261

19

3 Aims of the Thesis

The aim of the thesis is to build a prototype of a question answering system that will be able to work on flective languages from the Slavic language family. The most important part of the prototype will be the passage/sen- tence selection module and the answer extraction module. The rest of the system will use the state-of-the-art approaches and the work will not focus on improvement of this parts. The proposed passage/sentence selection module of the QA sys- tem will be built upon neural network technology. The work will study the influence of syntactic information on the resulting ranking of candidate answers. Morphologically rich languages are more complex than less grained, which means that a lot of information is stored inside the syntactic de- pendencies. The majority of recent QA systems are operating on some kind of English knowledge base. Our system will focus on Slavonic flective languages that are more fine grained. We pick up the Czech language from the Slavonic family for testing purposes. Following Czech-English syntax dierences describe the importance of syntactic information presence inside our system.

3.1 Czech-English syntax dierences

3.1.1 Declarative sentence (statements) In an English text the declarative sentences have fixed position of a subject and a verb. The subject usually precedes the verb in an English sentence, While in the Czech sentence, they can be moved freely, even after the verb. The object in English sentences has also fixed position. It is usually placed after the verb. On the other hand the Czech sentence structure is more free. It follows certain rules (some word classes do have their fixed positions) but still is not that strict. In a Czech declarative sentence, unlike in English, it is possible for the subject and the object to switch positions without altering the meaning of the sentence.

21 .A T

Irrelevance of word ordering within a Czech sentence is supported by inflectional suxes that determine the grammatical function of sentence elements. For example the dierence between Figure 3.1 right and left part is only in inflection of name “Mark” and meaning of the sentences is completely dierent. Therefore subject and object determination inside a Czech sentence have to be done more carefully according the syntax structure.

Mark miluje Susanne. Marka miluje Susanne. [Mark loves Susanne.] [Susanne loves Mark.]

Figure 3.1: Inflectional suxes

3.1.2 Interrogative sentence (Questions) In English, interrogative sentences are unique in word ordering. They require usage of an auxiliary verb (like be, do, have), and the subject- auxiliary inversion rule applies in most cases. Placing the verb before the subject is typical for interrogative sentence, except the wh* ques- tions where wh* word has function of the subject. In Czech interrogative questions, the word order remains free (mostly same as in declarative sentences, see Figure 3.2). The subject is usually not expressed and in case of its presence the inversion is optional.

Zaínáöb˝t unaven˝? Zaínáöb˝t unaven˝. [Are you getting tired?] [You are getting tired.]

Figure 3.2: Order in Czech interrogative sentences

The current state-of-the-art systems for English should be able to process flective languages. However these system can learn incorrect patters form the flective text because of the free word ordering. If we take closer look to the declarative sentence example (Fig- ure 3.1) the state-of-the-art system has no syntax information about these sentences so it can learn that the semantics of these sentences is the same. But syntactic trees of these sentences are dierent, we can

22 .A T

see the important dierence between Figure 3.3 and Figure 3.4, where in the first tree the “Mark” is the subject while in the second it is an object.

Figure 3.3: Syntactic tree for Figure 3.4: Syntactic tree for “Mark miluje Susanne.” [Mark “Marka miluje Susanne.” [Su- loves Susanne] sanne loves Mark]

3.2 Proposed system prototype

The aim of our research is to contribute to question answering pas- sage selection area and answer extraction area. In our firs QA system prototype (description in following chapter) we noticed that the syn- tax based methods prove to be eective in these two areas and yield better results [46]. We will observe on the testing data whether certain syntactic information has a positive or negative influence. We will design a heuristic of syntax based methods combination and establish a weighting balance between those that contribute to better results. Also, this part of the work will study if these syntactic information will be handled separately or can be put together into a more complex model like attention LSTM, presented previously in the text, that introduce the idea of direct influence of question to sentence (and vice versa) in NN training. All designed methods and models will be tested on Czech SQAD database and compared to similar state-of-the art QA systems. An evaluated prototype of the QA system will contribute to the QA field that aims to process flective languages.

23 .A T 3.3 Study plan

Spring 2018:

implementation of neural network into passage/sentence selec- • tion module

enlarge SQAD database • publish new results • Autumn 2018:

testing new AQA prototype • – NN network with syntactic information – balancing of syntactic information weights

Spring 2019:

identify the most beneficial syntactic methods for QA system • publish new results • Autumn 2019:

experimenting with Logic sentence representation • publish new results • Spring 2019:

Ph.D. thesis writing • All results will be published in leading Natural Language Process- ing (NLP) conferences and NLP journals.

24 4 Achieved Results

In this section we introduce question answering system called AQA and knowledge base called SQAD. Both of them have been created during the Ph.D. study and have been published in [47, 46].

4.1 SQAD database

In [47] we introduced a knowledge base called Simple Question An- swering Database (SQAD). The SQAD database uses Czech Wikipedia articles as a source of dierent questions and their respective answers. Current SQAD database consists of 3301 records with the following data fields (see Figure 4.1 for an example):

the original sentence(s) from Wikipedia • the question that is directly answered in the text • the expected answer to the question, as it appears in the original • text

the URL of the Wikipedia web page from which the original text • was extracted

the name of the author of this SQAD record • The SQAD database is still under development. In a future work we plan to extend the current 3301 question-answer pairs up to 9000 pairs.

4.2 Automatic Question Answering system (AQA)

We already introduced the first prototype of a QA system called AQA. The system incorporates NLP tools that have been developed at the Masaryk University as well as state-of-the-art techniques from the QA field. These tools are mainly used for the information extraction in a ques- tion processing phase and in the document retrieval module.

25 . A R

Original text: Létající jaguár je novela spisovatele Josefa Formánka z roku 2004. ——- [Létající jaguár is a novel of writer Josef Formánek form the 2004.] Question: Kdo je autorem novely Létající jaguár? —————————————- [Who is the author of the novel of Flying jaguar?] Answer: Josef Formánek URL: http://cs.wikipedia.org/wiki/L%C3%A9taj%C3%ADc%C3%AD_jagu%C3%A1r Author: chalupnikova

Figure 4.1: Example of SQAD record

Table 4.1: Evaluation of the AQA system on the expanded SQAD v1.1 database (syntax oriented Word2Vec sentence representation)

Answer extraction %

Match 1 257 38.08 % Partial match 270 8.18 % Mismatch 1774 53.74 %

In the passage/sentence retrieval module we used state-of-the-art tools such as the word2vec [38] and neural networks that are enhanced by syntactic information. In [46] we identified the tree distance score metrics calculated from syntactic tree to be ecient and to yield better results than the baseline prototype on the SQAD test set. Also the sentence similarity built upon syntactic trees and trans- formed into the sentence embedded form via word2vec method proved

26 . A R to be ecient in our second system prototype that dos not been pub- lished yet. The actual evaluation of AQA system on SQAD v1.1 is on Table 4.1

27

5 Author’s publications

MEDVE, Marek and AleöHORÁK. AQA: Automatic Question • Answering System for Czech. In Sojka Petr, Horák Aleö, Kopeek Ivan, Pala Karel. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publish- ing, 2016. s. 270-278, 9 s. ISBN 978-3-319-45510-5. doi:10.1007/978- 3-319-45510-5_31.

– Contribution 80 %. Introduction of new syntax based method for Question Answering sentence selection module.

MEDVE, Marek, AleöHORÁK and Vojtch KOVÁ. Bilingual • Logical Analysis of Natural Language Sentences. In AleöHorák, Pavel Rychl˝, Adam Rambousek. Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. s. 69-78, 10 s. ISBN 978-80-263-1095- 2.

– Contribution 30 %. Introduction of multilingual sentence semantic analysis.

MEDVE, Marek, Vojtch KOVÁand MiloöJAKUBÍEK. English- • French Document Alignment Based on Keywords and Statistical Translation. In Proceedings of the First Conference on , Volume 2: Shared Task Papers. Berlin: Association for Computational Linguistics, 2016. s. 728-732, 5 s. ISBN 978-1- 945626-10-4.

– Contribution 80 %. New document alignment approach that is based on key-word search by using TF-IDF score and bilingual statistical dictionary.

BAISA, Vít, Jan MICHELFEIT, Marek MEDVEand MiloöJAKUBÍEK. • European Union Language Resources in . In Nico- letta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis.

29 . A’

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portoroû, Slovenia: Eu- ropean Language Resources Association (ELRA), 2016. s. 2799- 2803, 5 s. ISBN 978-2-9517408-9-1.

– Contribution 10 %. Parallel corpus development.

MEDVE, Marek and AleöHORÁK. AST: New Tool for Logical • Analysis of Sentences based on Transparent Intensional Logic. In AleöHorák, Pavel Rychl˝, Adam Rambousek. Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2015. s. 95-102, 8 s. ISBN 978-80-263-0974-1.

– Contribution 30 %. Development of new standalone system for sentence logical analysis based on Transparent Inten- sional Logic.

MEDVE, Marek, Vít BAISA and AleöHORÁK. Increasing Cov- • erage of Translation Memories with Linguistically Motivated Segment Combination Methods. In Constantin Orasan and Ro- hit Gupta. Proceedings of The Workshop on Natural Language Processing for Translation Memories (NLP4TM). Bulgaria: IN- COMA Ltd. Shoumen, 2015. s. 31-35, 5 s. ISBN 978-954-452-032-8.

HORÁK, Aleöand Marek MEDVE. SQAD: Simple Question • Answering Database. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014. s. 121-128, 8 s. ISSN 2336-4289.

– Contribution 80 %. Development of new Czech database for a Question Answering system evaluation.

RYGL, Jan and Marek MEDVE. Style Markers Based on Stop- • word List. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014. s. 85-89, 5 s. ISSN 2336-4289.

– Contribution 10 %. Development of information retrieval module for text processing.

30 . A’

JAKUBÍEK, Miloöand Marek MEDVE. Portable Lexical Anal- • ysis for Parsing of Morphologically-Rich Languages. In A. Horák, P. Rychl˝.. RASLAN 2013 Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2013. s. 21-26, 6 s. ISBN 978-80-263-0520-0.

– Contribution 80 %. New lexical analysis work-flow with easy maintenance and new language portability.

MEDVE, Marek, MiloöJAKUBÍEK and Vojtch KOVÁ. To- • wards taggers and parsers for Slovak. In Zygmunt Vetulani & Hans Uszkoreit. Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. PoznaÒ, Poland: Fun- dacja Uniwersytetu im. A. Mickiewicza, 2013. s. 527-530, 4 s. ISBN 978-83-932640-3-2.

– Contribution 80 %. Adaptation of RFTagger and two Czech parsers (Synt and SET) for Slovak language.

MEDVE, Marek, MiloöJAKUBÍEK, Vojtch KOVÁand Vá- • clav NMÍK. Adaptation of Czech Parsers for Slovak. In Aleö Horák, Pavel Rychl˝. RASLAN 2012 Recent Advances in Slavonic Natural Language Processing. Brno, Czech Republic: Tribun EU, 2012. s. 23-30, 8 s. ISBN 978-80-263-0313-8.

– Contribution 80 %. Adaptation Czech part-of-speech tagset for Slovak language and modification of formal grammar inside Czech parsers for Slovak.

31

Bibliography

1. ITTYCHERIAH, Abraham; FRANZ, Martin; ZHU, Wei-Jing; RATNA- PARKHI, Adwait; MAMMONE, Richard J. IBM’s Statistical Ques- tion Answering System. In: TREC. 2000. 2. SUZUKI, Jun; SASAKI, Yutaka; MAEDA, Eisaku. SVM answer selec- tion for open-domain question answering. In: Proceedings of the 19th international conference on Computational linguistics-Volume 1. 2002, pp. 1–7. 3. FERRUCCI, David et al. Building Watson: An overview of the DeepQA project. In: 2010, vol. 31, pp. 59–79. No. 3. 4. TAN, Ming; XIANG, Bing; ZHOU, Bowen. LSTM-based Deep Learn- ing Models for non-factoid answer selection. CoRR. 2015, vol. abs/1511.04108. Available also from: http://arxiv.org/abs/1511.04108. 5. SANTOS, Cicero Nogueira dos; TAN, Ming; XIANG, Bing; ZHOU, Bowen. Attentive Pooling Networks. CoRR. 2016, vol. abs/1602.03609. Available also from: http://arxiv.org/abs/1602.03609. 6. BAUDIä, Petr. YodaQA: a modular question answering system pipeline. In: POSTER 2015-19th International Student Conference on Electrical Engineering. 2015, pp. 1156–1165. 7. GREEN Jr., Bert F.; WOLF, Alice K.; CHOMSKY, Carol; LAUGHERY, Kenneth. Baseball: An Automatic Question-answerer. In: Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference. Los Angeles, California: ACM, 1961, pp. 219–224. IRE- AIEE-ACM ’61 (Western). Available from DOI: 10.1145/1460690. 1460714. 8. WEIZENBAUM, Joseph. ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM. 1966, vol. 9, no. 1, pp. 36–45. 9. KATZ, Boris. Using English for indexing and retrieving. 1988.

33 BIBLIOGRAPHY

10. DERICI, Caner; ÇELIK, Kerem; KUTBAY,Ekrem; AYDIN, Yiit; GÜNGÖR, Tunga; ÖZGÜR, Arzucan; KARTAL, Günizi. Computational Linguis- tics and Intelligent Text Processing: 16th International Conference, CI- CLing 2015, Cairo, Egypt, April 14-20, 2015, Proceedings, Part II. Ques- tion Analysis for a Closed Domain Question Answering System. Ed. by GELBUKH, Alexander. Cham: Springer International Publishing, 2015. ISBN 978-3-319-18117-2. Available from DOI: 10.1007/978-3- 319-18117-2_35. 11. GUPTA, Poonam; GUPTA, Vishal. A survey of text question answer- ing techniques. International Journal of Computer Applications. 2012, vol. 53, no. 4. 12. FOUCAULT, Nicolas; ADDA, Gilles; ROSSET, Sophie. Language Mod- eling for Document Selection in Question Answering. In: RANLP. 2011, pp. 716–720. 13. AJITKUMAR M. PUNDGE, Khillare S.A.; MAHENDER, C. Namrata. Question Answering System, Approaches and Techniques: A Re- view. International Journal of Computer Applications. 2016, vol. 141, no. 3, pp. 34–39. ISSN 0975-8887. Available from DOI: 10.5120/ ijca2016909587. 14. CUI, Yiming; CHEN, Zhipeng; WEI, Si; WANG, Shijin; LIU, Ting; HU, Guoping. Attention-over-Attention Neural Networks for Reading Comprehension. CoRR. 2016, vol. abs/1607.04423. Available also from: http://arxiv.org/abs/1607.04423. 15. NATURAL LANGUAGE COMPUTING GROUP, Microsoft Research Asia. R-NET: Machine Reading Comprehension with Self-Matching Networks. In: ACL. 2017. 16. XIONG, Caiming; ZHONG, Victor; SOCHER, Richard. Dynamic Coat- tention Networks For Question Answering. CoRR. 2016, vol. abs/1611.01604. Available also from: http://arxiv.org/abs/1611.01604. 17. UNGER, Christina; BÜHMANN, Lorenz; LEHMANN, Jens; NGONGA NGOMO, Axel-Cyrille; GERBER, Daniel; CIMIANO, Philipp. Template- based Question Answering over RDF Data. In: Proceedings of the 21st International Conference on World Wide Web. Lyon, France: ACM, 2012, pp. 639–648. WWW ’12. ISBN 978-1-4503-1229-5. Available from DOI: 10.1145/2187836.2187923.

34 BIBLIOGRAPHY

18. SAGGION, Horacio; GAIZAUSKAS, Robert; HEPPLE, Mark; ROBERTS, Ian; GREENWOOD, Mark A. Exploring the performance of boolean retrieval strategies for open domain question answering. In: Proc. of the IR4QA Workshop at SIGIR. 2004. 19. EDUARD, Hovy; LAURIE, Gerber; ULF, Hermjakob; MICHAEL, Junk; CHIN-YEW, Lin. Question Answering in Webclopedia. In: In Pro- ceedings of The Ninth Text REtrieval Conference (TREC 2000). 2000. 20. FERRUCCI, D. A. Introduction to "This is Watson". IBM Journal of Research and Development. 2012, vol. 56, no. 3.4, pp. 1:1–1:15. ISSN 0018-8646. Available from DOI: 10.1147/JRD.2012.2184356. 21. QUARTERONI, S.; MANANDHAR, S. Designing an interactive open- domain question answering system. Natural Language Engineering. 2009, vol. 15, no. 1, pp. 73–95. Available from DOI: 10.1017/S1351324908004919. 22. DUAN, Huizhong; CAO, Yunbo; LIN, Chin-Yew; YU, Yong. Searching Questions by Identifying Question Topic and Question Focus. In: ACL. 2008, vol. 8, pp. 156–164. 23. LALLY, Adam; PRAGER, John M; MCCORD, Michael C; BOGURAEV, Branimir K; PATWARDHAN, Siddharth; FAN, James; FODOR, Paul; CHU-CARROLL, Jennifer. Question analysis: How Watson reads a clue. IBM Journal of Research and Development. 2012, vol. 56, no. 3.4, pp. 2–1. 24. MARCI—CZUK, Micha≥; RADZISZEWSKI, Adam; PIASECKI, Ma- ciej; PIASECKI, Dominik; PTAK, Marcin. Evaluation of baseline information retrieval for Polish open-domain Question Answering system. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. 2013, pp. 428–435. 25. BOLLACKER, Kurt; EVANS, Colin; PARITOSH, Praveen; STURGE, Tim; TAYLOR, Jamie. Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, pp. 1247–1250. 26. AUER, Sören; BIZER, Christian; KOBILAROV, Georgi; LEHMANN, Jens; CYGANIAK, Richard; IVES, Zachary. DBpedia: A Nucleus for a Web of Open Data. In: The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 +

35 BIBLIOGRAPHY

ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings. Ed. by ABERER, Karl et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 722–735. ISBN 978-3-540-76298-0. Available from DOI: 10.1007/978-3-540-76298-0_52. 27. VRANDEI∆, Denny; KRÖTZSCH, Markus. Wikidata: A Free Col- laborative Knowledgebase. Commun. ACM. 2014, vol. 57, no. 10, pp. 78–85. ISSN 0001-0782. Available from DOI: 10.1145/2629489. 28. YAO, Xuchen; VAN DURME, Benjamin. Information Extraction over Structured Data: Question Answering with Freebase. In: ACL (1). 2014, pp. 956–966. 29. DONG, Li; WEI, Furu; ZHOU, Ming; XU, Ke. Question Answering over Freebase with Multi-Column Convolutional Neural Networks. In: ACL (1). 2015, pp. 260–269. 30. BERANT, Jonathan; CHOU, Andrew; FROSTIG, Roy; LIANG, Percy. Semantic Parsing on Freebase from Question-Answer Pairs. In: EMNLP. 2013, vol. 2, p. 6. No. 5. 31. UNGER, Christina; BÜHMANN, Lorenz; LEHMANN, Jens; NGONGA NGOMO, Axel-Cyrille; GERBER, Daniel; CIMIANO, Philipp. Template- based Question Answering over RDF Data. In: Proceedings of the 21st International Conference on World Wide Web. Lyon, France: ACM, 2012, pp. 639–648. WWW ’12. ISBN 978-1-4503-1229-5. Available from DOI: 10.1145/2187836.2187923. 32. UNGER, Christina; FORASCU, Corina; LOPEZ, Vanessa; NGONGA NGOMO, Axel-Cyrille; CABRIO, Elena; CIMIANO, Philipp; WAL- TER, Sebastian. Question Answering over Linked Data (QALD-4). In: CAPPELLATO, Linda; FERRO, Nicola; HALVEY, Martin; KRAAIJ, Wessel (eds.). Working Notes for CLEF 2014 Conference. Sheeld, United Kingdom, 2014. Available also from: https://hal.inria. fr/hal-01086472. 33. VOORHEES, Ellen M; TICE, Dawn M. Building a question answering test collection. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. 2000, pp. 200–207.

36 BIBLIOGRAPHY

34. YANG, Yi; YIH, Wen-tau; MEEK, Christopher. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In: EMNLP. 2015, pp. 2013–2018. 35. RAJPURKAR, Pranav; ZHANG, Jian; LOPYREV, Konstantin; LIANG, Percy. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR. 2016, vol. abs/1606.05250. Available also from: http: //arxiv.org/abs/1606.05250. 36. HERMANN, Karl Moritz; KOCISK›, Tomás; GREFENSTETTE, Ed- ward; ESPEHOLT, Lasse; KAY, Will; SULEYMAN, Mustafa; BLUN- SOM, Phil. Teaching Machines to Read and Comprehend. CoRR. 2015, vol. abs/1506.03340. Available also from: http://arxiv.org/ abs/1506.03340. 37. SACHAN, Mrinmaya; DUBEY, Kumar Avinava; XING, Eric P; RICHARD- SON, Matthew. Learning Answer-Entailing Structures for Machine Comprehension. In: ACL (1). 2015, pp. 239–249. 38. MIKOLOV, Tomas; SUTSKEVER, Ilya; CHEN, Kai; CORRADO, Greg S; DEAN, Je. Distributed Representations of Words and Phrases and their Compositionality. In: BURGES, C. J. C.; BOTTOU, L.; WELLING, M.; GHAHRAMANI, Z.; WEINBERGER, K. Q. (eds.). Advances in Neural Information Processing Systems 26. Curran As- sociates, Inc., 2013, pp. 3111–3119. Available also from: http:// papers.nips.cc/paper/5021- distributed- representations- of-words-and-phrases-and-their-compositionality.pdf. 39. MCCANDLESS, Michael; HATCHER, Erik; GOSPODNETIC, Otis. Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Green- wich, CT, USA: Manning Publications Co., 2010. ISBN 1933988177, 9781933988177. 40. STROHMAN, Trevor; METZLER, Donald; TURTLE, Howard; CROFT, W Bruce. Indri: A language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis. 2005, vol. 2, pp. 2–6. No. 6. 41. RUMELHART, David E; HINTON, Georey E; WILLIAMS, Ronald J. Learning internal representations by error propagation. 1985. Technical report. California Univ San Diego La Jolla Inst for Cognitive Science.

37 BIBLIOGRAPHY

42. HOPFIELD, J J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences. 1982, vol. 79, no. 8, pp. 2554–2558. Available from eprint: http://www.pnas.org/content/79/8/2554.full.pdf. 43. KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON, Georey E. Ima- geNet Classification with Deep Convolutional Neural Networks. In: PEREIRA, F.; BURGES, C. J. C.; BOTTOU, L.; WEINBERGER, K. Q. (eds.). Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 1097–1105. Available also from: http: //papers.nips.cc/paper/4824-imagenet-classification- with-deep-convolutional-neural-networks.pdf. 44. HOCHREITER, Sepp; SCHMIDHUBER, Jürgen. Long Short-Term Memory. Neural Comput. 1997, vol. 9, no. 8, pp. 1735–1780. ISSN 0899-7667. Available from DOI: 10.1162/neco.1997.9.8.1735. 45. LEE, Kenton; KWIATKOWSKI, Tom; PARIKH, Ankur P.; DAS, Di- panjan. Learning Recurrent Span Representations for Extractive Question Answering. CoRR. 2016, vol. abs/1611.01436. Available also from: http://arxiv.org/abs/1611.01436. 46. MEDVE, Marek; HORÁK, Aleö. AQA: Automatic Question Answer- ing System for Czech. In: Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publishing, 2016, pp. 270–278. ISBN 978-3-319-45510-5. Available from DOI: http: //dx.doi.org/10.1007/978-3-319-45510-5_31. 47. HORÁK, Aleö; MEDVE, Marek. SQAD: Simple Question Answering Database. In: Eighth Workshop on Recent Advances in Slavonic Natural Language Processing [tiötná verze "print"]. Brno: Tribun EU, 2014, pp. 121–128.

38 A Research activity

2017: Project TA R, MU Brno, Tool development for confirma- • tion of proposed techniques

2017: Project MUNI/33/55939/2017, MU Brno, Tool develop- • ment for evaluation of information extraction accuracy on scanned documents

2017: Projekt LINDAT-Clarin, MU Brno, Data and tool develop- • ment for annotation

2016-2017: Projekt HaBit, MU Brno, Automatic annotation tech- • niques implementation

2015-2016: Projekt GA R, MU Brno, Development of tool for • logical analysis.

2014-: Lexical Computing s.r.o, Brno, Corpus development • 2014: Slovak science academy – udovít ätúr institute, Bratislava, • Corpus development

2014: Project Authorship, MU Brno, System development • 2012-2013: Student science project, MU Brno, Czech syntactic • parser adaptation for Slovak

39

B Teaching activities

2015-2016: IA161 Advanced Techniques of Natural Language • Processing

2015-2016: IB111 Foundations of Programming, practical part •

41

C Opponent review

Dialogov˝systém s uením znalostí, bachelor thesis, Bc. Kristína • Mikláöová, 2017

V˝vojové prostedí pro digitalizaci deskov˝ch her, bachelor the- • sis, Bc. Martin Golomb, 2016

Automatické zodpovídání dotaznad filmovou databází, bach- • elor thesis, Bc. Veronika Aksamítová, 2016

Automatic question generation and adaptive practice, bachelor • thesis, Bc. TomáöEenberger, 2015

43

Selected papers

45 English-French Document Alignment Based on Keywords and Statistical Translation Marek Medved’ MilošJakubícekˇ Vojtech Kovárˇ Lexical Computing CZ s.r.o. & Centre of Natural Language Processing, Faculty of Informatics, Masaryk University, Botanická 68a 602 00 Brno [email protected]

Abstract word extraction process, and Section 4 describes scoring of comparable English-French pairs. In this paper we present our approach to The final results on the training data are sum- the Bilingual Document Alignment Task marized in Section 5 where we also discuss errors (WMT16), where the main goal was to of our system and problematic features of the pro- reach the best recall on extracting aligned vided data. pages within the provided data. Our approach consists of tree main parts: 2 Preprocessing data preprocessing, keyword extraction The training and testing data were provided in the and text pairs scoring based on keyword .lett format. Each .lett file consists of lines matching. where each line contains these six parts: For text preprocessing we use the Tree- Language ID (e.g. “en”) Tagger pipeline that contains the Unitok • tool (Michelfeit et al., 2014) for tokeniza- Mime type (always “text/html”) tion and the TreeTagger morphological an- • alyzer (Schmid, 1994). Encoding (always “charset=utf-8”) • After keywords extraction from the texts URL according TF-IDF scoring our system • searches for comparable English-French HTML in Base64 encoding pairs. Using a statistical dictionary created • from a large English-French parallel cor- Text in Base64 encoding pus, the system is able to find comaparable • documents. We pick up language id, URL and text as a in- put for our system. To obtain keywords for each At the end this procedure is combined with text, our system converts plain text into a so-called the baseline algorithm and best one-to-one vertical text, or word-per-line format. This format pairing is selected. The result reaches contains each word on a separate line together with 91.6% recall on provided training data. morphological information, namely lemma (base After a deep error analysis (see section 5) form of the word) and morphological tag. For text the recall reached 97.4%. tokenization we use the Unitok tool (Michelfeit et al., 2014) that splits sentences into tokens accord- 1 Introduction ing to a predefined grammar. Unitok has a special In this paper we describe our approach to solve the grammar model for each language that was cre- Bilingual Document Alignment Task (WMT16). ated using information extracted from large cor- It consists of tree main parts: data preprocessing, pora. An example of Unitok output is the first col- keyword extraction and text pairs scoring based on umn of Figure 1. The Unitok output is enhanced keyword matching. by a sentence boundaries recognizer (we use According to these steps, the text is divided into and for marking sentence boundaries). three main sections. Section 2 describes the data After tokenization and sentence boundary de- preprocessing that was crucial for key-word ex- tection, lemmatization and morphological anal- traction. In the next section we describe the key- ysis follows. For both we use TreeTagger

728 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 728–732, Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics (Schmid, 1994) with language dependent models (i.e. French model for French texts, English for fpmt,d +1 keyt = (4) English texts). Figure 1 contains an example of a fpmt,ref +1! morphologically analyzed sentence in the vertical Legend: format. Unitok and TreeTagger, together with sentence N: number of documents in corpus • boundary detection and few other small pre- n : number of documents containing a par- and post-processing scripts, form the TreeTagger • t ticular word (token) t pipeline that is used in the Sketch Engine (Kilgar- riff, 2014) corpus query and management system. f : frequency of token t in document d • t,d word tag lemma fd: size (length) of document d • fpm : frequency per million of token t in A DT a • t,d web NN web document d page NN page fpm : frequency per million of token t in • t,ref is VBZ be a reference corpus (large, representative sam- a DT a ple of general language) web NN web document NN document As reference corpora, the TenTen web corpora in Sketch Engine for English and French were . SENT . used (Jakubícekˇ et al., 2013), in particular enTen- Ten 2013 and frTenTen 2012. Sometimes the TF-IDF scoring can score some Figure 1: TreeTagger morphological analysis of the most common words (like "the", "a", ...) very high. These so-called stop-words do not have any value when finding match between two texts, 3 Keyword Extraction as practically all of the texts will contain them. In the previous section, we described the text pre- Therefore, we created stop-word lists for English processing needed for the next part of our system, and French (from enTenTen and frTenTen corpus) the keyword extraction. that filter out these most frequent words so they The lemma (base form) information from the are never considered keywords. morphological analysis was used for computing As we will see, the Equation 3 gives the best “keyness”, or specificity scores for each word in results on the training data, therefore we chose it the text. For this, we used three different variants for the final evaluation. of the standard TF-IDF score (Equation 1, 2, 3)1 and a Simple math score2 (Kilgarriff, 2009) used 4 Scoring in keywords extraction in Sketch Engine (Equa- After obtaining the keyword list from each text, tion 4): the final step was to find matches between English and French texts. N key =1 log (1) We used top 100 keywords from each text (this t n ⇤ ✓ t ◆ number was estimated during the experiments). Then we consulted a statistical dictionary which N key =(1+log(f )) log (2) contains 10 most probable French for t t,d n ⇤ ✓ t ◆ each English lemma (see below for more informa- tion about this dictionary). f N key = t,d log (3) We translated the English keywords into all of t f n ✓ d ◆ ⇤ ✓ t ◆ their French variants, and intersected this list of 1The difference between Equations 1,2 and 3 is in TF translations with the keyword lists etracted from weight score. all of the French documents. The French docu- 2Variant of statistic that choose keywords according rule: ‘word W is N times as frequent in document/corpus X vs doc- ment with the biggest intersection was selected as ument/corpus Y’. the best candidate.

729 This procedure was combined with the baseline tics over the number of aligned pairs, and to quan- algorithm3 based on finding language identifica- tify the probability (or other metric) that word X tion in the URLs of the documents – firstly, the translates to word Y, for each pair of words in baseline was applied, then (if no matching docu- the corpus. The procedure is similar to training ment was found) the matching by keywords was a translation model in statistical machine transla- performed. tion (Och and Ney, 2003). Our implementation The data processing flow is on Figure 2. uses the logDice association score (Rychlý, 2008) which is the same measure that is used in scoring collocational strength in word sketches, the key feature of the Sketch Engine system. It depends on

frequency of co-occurrence of the two words • (e.g. “chat” and “cat”) – the higher this fre- quency, the higher the resulting score; co- occurrence here means that the words oc- cured in a pair of aligned sentences

standalone frequencies of the two words – the • higher these frequencies, the lower the result- ing score

By computing these scores for all word pairs across the corpus, we are able to list the strongest “translation candidates” for each word, according to the score; for our purposes, we store 10 best candidates. The procedure is computationally demanding – quadratic to the number of types (different words) in the corpus – and we exploit an algorithm for computing bi-grams to make it feasible even for very large corpora. The statistical dictionary for this task was ex- tracted from the English-French Europarl 7 corpus (Koehn, 2005).

5 Evaluation The goal of this task was to find English-French URL pairs. Some training pairs were provided by authors of this task. Our procedure does not in- clude any learning from the training data, there- fore we can use them for quite a reliable evalua- tion. With regard to that data, our solution reached Figure 2: System data flow 91.6% recall, using the most successful TF-IDF equation 3; the results for the other equations are comparable and are summarized in Table 1. 4.1 Statistical translation dictionary If we did not include the baseline algorithm into Sentence alignment in some of the available par- the procedure, the recall was 82%. allel corpora enables us to compute various statis- After a detailed error analysis we found out that the provided data contain duplicate web pages 3The baseline algorithm iterates through all URLs and search for language identifiers inside URLs and then pro- with different URLs. This is an important prob- duces pairs of URLs that have the same language identifiers. lem – our error analysis shows that we have found

730 Expected http://cineuropa.mobi/interview.aspx?lang=en&documentID=65143 http://cineuropa.mobi/interview.aspx?lang=fr&documentID=65143 Found http://cineuropa.mobi/interview.aspx?documentID=65143 http://cineuropa.mobi/interview.aspx?lang=fr&documentID=65143 Expected http://creationwiki.org/Noah%27s_ark http://creationwiki.org/fr/Arche_de_No%C3%A9 Found http://creationwiki.org/Noah%27s_Ark http://creationwiki.org/fr/Arche_de_No%C3y%A9 Expected http://pawpeds.com/pawacademy/health/pkd/ http://pawpeds.com/pawacademy/health/pkd/index_fr.html Found http://pawpeds.com/pawacademy/health/pkd/index.html http://pawpeds.com/pawacademy/health/pkd/index_fr.html

Figure 3: Examples of false errors

Equation Recall in % plicate texts with different URLs. 1 89.2 2 89.5 Acknowledgments 3 91.6 4 88.7 This work has been partly supported by the Min- Baseline 67.92 istry of Education of CR within the LINDAT- Clarin project LM2015071 and by the Grant Table 1: Overall results according to “keyness” Agency of CR within the project 15-13277S. The Equations research leading to these results has received fund- ing from the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education, Youth a correct document pair in many cases, but a docu- and Sports under Project Contract no. MSMT- ment with a different URL (and identical text) was 28477/2014 within the HaBiT Project 7F14047. marked as correct in the data. We went through the document pairs marked as errors of our algorithm and manually evaluated References them for correctness. If we exclude the false er- Jan Michelfeit, Jan Pomikálek, Vít Suchomel. Text to- rors (correct document pairs evaluated as incor- kenisation using unitok. In: 8th Workshop on Recent rect), the recall is 97.4%. Some examples of these Advances in Slavonic Natural Language Processing, URL pairs are given in Figure 3 – as we can see, Brno, Tribun EU, pp. 71-75, 2014 in many cases the duplicity is clear directly from the URL. Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the interna- Unfortunately, we were unable to assess the tional conference on new methods in language pro- number of duplicates in the data by the submission cessing, pp. 44-49, 1994. deadline. However, we believe it will be done, as the mentioned duplicates significantly reduce the Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubícek,ˇ Vojtechˇ Kovár,ˇ Jan Michelfeit, Pavel soundness of such evaluation. Rychlý, Vít Suchomel. The Sketch Engine: ten years on. Lexicography, pp. 7-36, 2014 6 Conclusion Adam Kilgarriff. Simple maths for keywords. In Pro- We have described a method for finding English- ceedings of Conference CL2009, French web pages that are translations of each Mahlberg, M., González-Díaz, V.& Smith, C. (eds.), other. The method is based on statistical extraction University of Liverpool, UK, 2009. of keywords and comparing them, using a trans- MilošJakubícek,ˇ Adam Kilgarriff, Vojtechˇ Kovár,ˇ lation dictionary. The results are promising, but Pavel Rychlý, Vít Suchomel. The TenTen corpus detailed error analysis shows there are significant family. The 7th International Corpus Linguistics problems in the testing data, namely unmarked du- Conference, Lancaster, 2013.

731 Franz Josef Och, Hermann Ney. A systematic compari- son of various statistical alignment models, Compu- tational Linguistics, volume 29, number 1, pp. 19- 51, 2003. Pavel Rychlý. A lexicographer-friendly association score. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 6-9, 2008. Philipp Koehn. Europarl: A parallel corpus for statisti- cal machine translation, MT Summit, 2005.

732 European Union Language Resources in Sketch Engine

Vít Baisa, Jan Michelfeit, Marek Medved’, MilošJakubícekˇ Masaryk University, Czech Republic, Brno Lexical Computing Ltd, United Kingdom, Brighton {vit.baisa,jan.michelfeit,marek.medved,milos.jakubicek}@sketchengine.co.uk Abstract Several parallel corpora built from European Union language resources are presented here. They were processed by state-of-the-art tools and made available for researchers in the Sketch Engine corpus management system. A completely new resource is introduced: EUR-Lex corpus, being one of the largest parallel corpus available at the moment, containing 840 million tokens of English and having the largest language pair (English-French) with more than 25 million aligned segments (paragraphs).

Keywords: JRC-Acquis, DCEP, DGT-TM, Europarl, EUR-Lex, Sketch Engine, parallel corpus, word sketch, parallel concordance

1. Introduction Table 1 compares the mentioned language resources. JRC- 3 The European Union is producing a large amount of valu- Acquis 3.0 figures are there for comparison. “Tokens” is able multilingual textual data every day. To be able to the number of tokens (words, numbers and punctuation) use it in applications, for text analysis, terminology extrac- in the English parts of the corpora. “Types” is the num- tion, full text search etc., it must be downloaded, converted ber of unique English word forms, i.e. the size of English into plain text, processed with suitable tools, aligned on lexicons, “L” column contains the number of languages in- sentence level and finally made available to researchers in cluded and “Format” states in which form the source data some standard format. In this paper we describe our ex- is available. perience with using several resources built from European Language Since ACQ CEP DGT EUR LEX Union’s (EU) multilingual resources, namely DCEP (Ha- Dutch 1958 35 96 63 60 777 jlaoui et al., 2014), DGT-TM (Steinberger et al., 2013) and French 1958 39 116 47 67 878 Europarl (Koehn, 2005). German 1958 32 98 58 55 732 We also describe a new multilingual “EUR-Lex corpus” Italian 1958 36 103 66 59 807 containing more than 840 million tokens of English. To Danish 1973 31 88 59 56 731 our knowledge, it is currently the largest parallel corpus English 1973 35 118 74 61 840 built from European language resources. The corpus was Greek 1981 36 100 64 44 775 1 downloaded from the official website of EUR-Lex which Portuguese 1986 37 99 66 61 793 provides an access to up-to-date legal documents published Spanish 1986 39 106 69 61 831 by European Commission, European Parliament, national Finnish 1995 25 72 47 41 558 courts, Council of the European Union and other European Swedish 1995 29 86 55 52 640 institutions. The majority of recently added documents is Czech 2004 23 51 57 15 500 translated into all official languages of EU making it a huge Estonian 2004 25 43 46 13 437 multilingual language resource. Hungarian 2004 29 50 55 15 500 Latvian 2004 28 48 54 14 491 Corpus Tokens Types L Format Lithuanian 2004 27 47 52 14 476 JRC-Acquis 55,537,910 N/A 22 XML Maltese 2004 21 46 30 — 466 DCEP 118,046,857 513,000 23 TXT Polish 2004 30 51 58 15 511 DGT-TM 74,365,007 342,340 24 TMX Slovak 2004 27 50 56 15 495 Europarl 60,741,877 139,217 21 XML Slovenian 2004 28 50 57 15 509 EUR-Lex 839,745,466 2,416,841 24 various Bulgarian 2007 16 41 33 11 457 Irish 2007 — 2 1 — 37 Table 1: Comparison of various EU corpora. Romanian 2007 9 42 33 11 462 Croatian 2013 — — 5 — 156 All mentioned corpora are available for language re- searchers through the Sketch Engine corpus management Table 2: Representation of languages (millions of tokens). system (Kilgarriff et al., 2014). EUR-Lex corpus is released in the form of gzipped archives containing a) documents Table 2 contains an overview of language representation in with meta information in a flat XML format and b) align- the corpora in millions of tokens per language. The second ment files for all language pairs. The whole gzipped dataset column states a year when a particular language became is over 40 GB.2 an official language of European Union—it usually corre- sponds to the amount of documents in the particular lan- 1http://eur-lex.europa.eu 2To obtain the data, contact us or follow the instructions at 3https://ec.europa.eu/jrc/en/language- https://www.sketchengine.co.uk/eur-lex technologies/jrc-acquis

2799 guage and the table is sorted by this column. ACQ stands 4. Europarl for JRC-Acquis 3.0, CEP for The Digital Corpus of the Eu- The Europarl parallel corpus is a well-known re- ropean Parliament, EUR for Europarl and LEX for EUR- source (Koehn, 2005). It is a collection of sentence-aligned Lex corpus. texts in 21 languages extracted from the proceedings of the European Parliament. It stands out among the other corpora provided by the EU, which contain mostly legal documents. Its primary goal is to aid statistical machine translation sys- tems. The authors of the corpus have detected sentence boundaries in the raw transcripts and aligned the sentences using a tool based on the Church and Gale algorithm. (Gale and Church, 1993). The Europarl corpus has been also incorporated into the OPUS project, a collection of publicly available paral- lel corpora (Tiedemann, 2009). Thanks to this, the sen- tence alignment data is available from the OPUS website in XCES format, which can be easily translated into the format used internally by Sketch Engine (pairs of structure IDs, here sentence IDs). See Figure 3 for an example of full Figure 1: Bilingual terminology candidates extracted from text parallel search in Sketch Engine using Europarl corpus. DGT-TM English-Spanish. All the text for each of the 21 languages was processed by the most up-to-date (at the time of compilation) pro- cessing chain for each respective language—including tok- 2. DCEP enization (Michelfeit et al., 2014), PoS tagging where avail- The Digital Corpus of the European Parliament able, but excluding sentence boundary detection, which (DCEP) (Hajlaoui et al., 2014) is a collection of doc- was taken directly from Europarl data. Each of the result- uments published on the European Parliament’s official ing 21 corpora is therefore compatible for use as a reference website4. This corpus includes a variety of document types, corpus for other corpora in Sketch Engine (including user- from press release to session and legislative documents created corpora) of the same language. The same holds for related to European Parliament’s activities and bodies. The DCEP and DGT corpora. A reference corpus is used for latest version contains documents produced in 2001–2012. comparison with a focus corpus for extraction of keywords Since the original alignments contained a lot of errors and and terminology. Bilingual terminology (Baisa et al., 2015) the sentences were wrongly segmented, we created a new can be also extracted, see Figure 1. alignment. Instead of HunAlign (Varga et al., 2007) aligner All of the Europarl corpora are aligned to each other, giving we used GaChalign5 algorithm (implementation of Gale- us a total of 210 language pairs. Each pair of corpora can Church sentence aligner (Gale and Church, 1993)). be exploited to extract a statistical dictionary of words and The data has been processed automatically by Sketch lemmas (where available), or even term candidates. Due Engine: plain text data has been tokenized with uni- to the nature of the texts, the vocabulary used is relatively tok (Michelfeit et al., 2014) and tagged with various tools: broad, while the quality of the data is far better than other TreeTagger (Schmid, 1995), Hunpos (Halácsy et al., 2007), bigger, web-based corpora. This makes Europarl an invalu- Freeling (Carreras et al., 2004). Further processing in- able resource for the creation of statistical dictionaries and volved collocation pattern extraction, terminology extrac- building translation models for statistical machine transla- tion, distributional thesaurus computation and other spe- tion systems. cific processing which is available in Sketch Engine for many languages (Kilgarriff et al., 2014). 5. EUR-Lex corpus EUR-Lex is an official on-line resource providing access 3. DGT-TM to 1) the Official Journal of the European Union, 2) EU The European Commission’s Directorate-General for law (EU treaties, directives, regulations, decisions, consol- Translation, in cooperation with the European Commis- idated legislation, etc.), 3) preparatory acts (legislative pro- sion’s Joint Research Centre, have created a freely available posals, reports, green and white papers, etc.), 4) EU case- translation memory DGT-TM (Steinberger et al., 2013). law (judgements, orders, etc.), 5) international agreements, The DGT-TM is stored in TMX files with segments aligned 6) EFTA documents and 7) other public documents dating in 231 language pairs. back to 1950s in 24 official EU languages. The EUR-Lex We have processed DGT-TM with Sketch Engine: it sup- website allows querying its database in which each docu- ports TMX import, we just merged all the original TMX ment has meta data ranging from unique IDs (cellar and CELEX6 numbering), dates of documents, official publica- files and let Sketch Engine extract the aligned segments, 7 tokenize and PoS tag the texts. See Figure 2 for an example tion and revision dates, Eurovoc terms, authors (an agent, of parallel collocation functionality in Sketch Engine. a state) of a document, type of a document etc. 6http://eur-lex.europa.eu/content/help/ 4http://www.europarl.europa.eu/ faq/intro.html#help10 5https://github.com/alvations/gachalign 7http://eurovoc.europa.eu/ 2800 Figure 2: Parallel collocation candidates for English “Commission” and Czech equivalent “komise” derived from DGT- English and DGT-Czech corpora in Sketch Engine. The joint grey and green columns correspond to a grammar relation (object_of, modifier and coordination) in which the collocation candidates occur in data. The collocates in green columns are usually translation equivalents of the collocates in joint grey columns. E.g. inform—informovat, Electoral—volební, Presidency—predsednictví,ˇ etc.

Figure 3: Parallel search in Sketch Engine for English Commission, French Commission and German Kommission, DGT.

To get all documents we first had to query EUR-Lex for EUR-Lex website.10 Sometimes, the count of paragraphs is meta data year by year as the list of all documents in EUR- inconsistent in some language mutations, so we have cor- Lex is not available. From the meta data, a list of all rected these using a modified Gale-Church algorithm.5 available documents with CELEX numbers was retrieved The resulting corpus has 3.9 million documents. Figure 4 (with all its language variants) and then all the documents shows size of aligned documents. The largest language pair were downloaded: only documents in HTML format have English-French has 25,211,093 aligned paragraphs. All been downloaded, yielding almost 7 million documents in data from JRC Acquis corpus (Steinberger et al., 2006) 26 languages.8 According to the statistics9 there are more should be included in EUR-Lex corpus. PDF documents than HTML documents but we decided to According to the copyright notice11 on EUR-Lex website: download only HTML in the first phase as HTML files are “Except where otherwise stated, reuse of the EUR-Lex data easier for further processing. for commercial or non-commercial purposes is authorised We have exploited the fact that EUR-Lex database contains provided the source is acknowledged c European Union, HTML documents split into fine-grained paragraphs and http://eur-lex.europa.eu/, 1998–2015”. This allows us to these paragraphs mostly correspond to each other in differ- provide the downloaded data to researchers.2 Fully pro- ent languages. This can be seen in the parallel view on the cessed data (tokenized, PoS-tagged) is not available due to taggers’ copyright reasons but available in Sketch Engine.

8Norwegian and Icelandic languages are represented in EUR- 10http://eur-lex.europa.eu/legal-content/ Lex, but we have omitted them from the final data set due to the EN-ES-FR/TXT/?qid=1445777763012&uri=CELEX: negligible number of documents. 32013R1303&from=EN 9http://eur-lex.europa.eu/statistics/eu- 11http://eur-lex.europa.eu/content/legal- law-statistics.html notice/legal-notice.html

2801 BUL CES DAN DEU ELL ENG EST FIN FRA GLE HRV HUN ITA LAV LIT MLT NLD POL POR RON SLK SLV SPA SWE Bulgarian BUL 16M 14M 14M 12M 17M 15M 15M 15M .4M 4M 14M 15M 16M 15M 16M 14M 17M 15M 20M 16M 16M 15M 14M Czech CES 16M 17M 17M 13M 20M 19M 19M 19M .4M 3M 19M 19M 20M 19M 18M 18M 21M 19M 16M 22M 22M 19M 17M Dannish DAN 14M 17M 21M 15M 22M 17M 21M 21M .3M 3M 17M 21M 17M 17M 15M 22M 18M 22M 15M 17M 18M 21M 22M German DEU 14M 17M 21M 15M 22M 16M 21M 21M .4M 2M 16M 21M 17M 16M 15M 21M 18M 20M 14M 17M 17M 20M 20M Greek ELL 12M 13M 15M 15M 17M 13M 15M 15M .3M 3M 13M 16M 14M 13M 12M 15M 14M 15M 12M 13M 13M 15M 14M English ENG 17M 20M 22M 22M 17M 18M 24M 25M .5M 3M 18M 24M 19M 18M 18M 23M 21M 25M 18M 20M 21M 23M 20M Estonian EST 15M 19M 17M 16M 13M 18M 18M 17M .3M 2M 19M 18M 19M 19M 16M 18M 20M 18M 15M 19M 19M 18M 17M Finnish FIN 15M 19M 21M 21M 15M 24M 18M 22M .4M 3M 18M 22M 19M 18M 16M 20M 19M 22M 16M 19M 19M 21M 21M French FRA 15M 19M 21M 21M 15M 25M 17M 22M .4M 3M 17M 24M 18M 18M 16M 23M 19M 23M 16M 19M 19M 23M 20M Irish GLE .4M .4M .3M .4M .3M .5M .3M .4M .4M 95k .3M .4M .4M .4M .4M .3M .4M .4M .4M .4M .4M .4M .3M Croatian HRV 4M 3M 3M 2M 3M 3M 2M 3M 3M 95k 3M 3M 3M 3M 3M 3M 3M 3M 3M 3M 3M 3M 2M Hungarian HUN 14M 19M 17M 16M 13M 18M 19M 18M 17M .3M 3M 18M 19M 19M 16M 18M 18M 18M 14M 19M 19M 17M 17M Italian ITA 15M 19M 21M 21M 16M 24M 18M 22M 24M .4M 3M 18M 18M 18M 16M 23M 19M 24M 15M 19M 19M 24M 20M Latvian LAV 16M 20M 17M 17M 14M 19M 19M 19M 18M .4M 3M 19M 18M 20M 17M 18M 21M 18M 16M 21M 21M 18M 17M Lithuanian LIT 15M 19M 17M 16M 13M 18M 19M 18M 18M .4M 3M 19M 18M 20M 16M 18M 19M 18M 15M 20M 20M 18M 17M Maltese MLT 16M 18M 15M 15M 12M 18M 16M 16M 16M .4M 3M 16M 16M 17M 16M 15M 18M 16M 17M 18M 18M 16M 14M Dutch NLD 14M 18M 22M 21M 15M 23M 18M 20M 23M .3M 3M 18M 23M 18M 18M 15M 18M 23M 15M 18M 18M 23M 21M Polish POL 17M 21M 18M 18M 14M 21M 20M 19M 19M .4M 3M 18M 19M 21M 19M 18M 18M 19M 17M 22M 21M 19M 17M Portuguese POR 15M 19M 22M 20M 15M 25M 18M 22M 23M .4M 3M 18M 24M 18M 18M 16M 23M 19M 16M 19M 19M 24M 20M Romanian RON 20M 16M 15M 14M 12M 18M 15M 16M 16M .4M 3M 14M 15M 16M 15M 17M 15M 17M 16M 17M 17M 15M 14M Slovak SLK 16M 22M 17M 17M 13M 20M 19M 19M 19M .4M 3M 19M 19M 21M 20M 18M 18M 22M 19M 17M 22M 18M 17M Slovene SLV 16M 22M 18M 17M 13M 21M 19M 19M 19M .4M 3M 19M 19M 21M 20M 18M 18M 21M 19M 17M 22M 19M 17M Spanish SPA 15M 19M 21M 20M 15M 23M 18M 21M 23M .4M 3M 17M 24M 18M 18M 16M 23M 19M 24M 15M 18M 19M 21M Swedish SWE 14M 17M 22M 20M 14M 20M 17M 21M 20M .3M 2M 17M 20M 17M 17M 14M 21M 17M 20M 14M 17M 17M 21M

Figure 4: Aligned paragraph counts in EUR-Lex corpus. Millions (M) and thousands (k), darker means larger alignment.

Since EUR-Lex documents contain rich meta data, various parallel data is just to repeat the whole processing once ev- aspect can be studied in Sketch Engine. E.g. one can study ery few months since the EU Publication Office adds new the trends in keywords and translations in last 60 years, dis- documents to EUR-Lex every day. cover language characteristics per EU body, extract domain terminologies using EuroVoc thesaurus etc. We will leave 7. Acknowledgements the enumerating of all the possibilities for the reader. This work has been partly supported by the Grant Agency 6. Conclusion of CR within the project 15-13277S. The research lead- ing to these results has received funding from the Nor- We have described a few European multilingual resources wegian Financial Mechanism 2009–2014 and the Ministry and how we made them available in the corpus manager of Education, Youth and Sports under Project Contract Sketch Engine for lexicographers, linguists and language no. MSMT-28477/2014 within the HaBiT Project 7F14047, researchers in general. This allows them to search the full LINDAT/CLARIN project LM2010013 and within the Spe- text data using a rich query language which is more suit- cific University Research. able for linguistically motivated searches than the full text search engine used on EUR-Lex official web page. Users can also use various statistics derived from the data, e.g. 8. References distributional thesaurus, automatic collocations, keyword Baisa, V., Ulipová, B., and Cukr, M. (2015). Bilingual ter- and terminology candidates, bilingual terminology candi- minology extraction in sketch engine. In AlešHorák, dates, parallel collocates and much more. Pavel Rychlý, A. R., editor, Ninth Workshop on Re- We have also described a new resource—EUR-Lex cent Advances in Slavonic Natural Language Process- corpus—which is to our knowledge the largest resource ing, pages 61–67, Brno. Tribun EU. built from EU data at the moment. Thanks to the per- Carreras, X., Chao, I., Padró, L., and Padró, M. (2004). missive data policy of EU we can provide the full data to Freeling: An open-source suite of language analyzers. researchers.2 In LREC. In the future, we plan to download and process EUR-Lex Gale, W. A. and Church, K. W. (1993). A program for documents also in other formats (PDF, DOCX). This should aligning sentences in bilingual corpora. Computational yield even more parallel data. Another way of getting more linguistics, 19(1):75–102.

2802 Document type Docs Author Docs EuroVoc Docs Year Docs Written question 156,744 European Commission 150,545 State aid 18,239 2013 24,978 Regulation 59,758 European Parliament 104,323 European Commission 18,057 2011 24,852 judicial information 36,964 Provisional data 53,230 information transfer 15,778 2012 22,879 Decision 20,400 Council of the EU 31,453 control of State aid 14,096 2010 22,266 Question at Question Time 19,027 Court of Justice 22,397 import 14,074 2007 20,216 Communication 16,384 Court of Justice of the EU 14,637 econ. concentration 12,620 2008 19,238 Consolidated text 16,060 Court of First Instance 12,201 merger control 12,558 2009 18,088 decision w/out addressee 13,718 General Court 9,056 originating product 11,896 2006 17,822 Judgment 13,709 EES Committee 4,524 Italy 11,831 2003 16,587 Proposal for a regulation 8,608 United Kingdom 3,995 Spain 10,882 2005 16,407 Opinion 7,774 EEA Joint Committee 2,880 annul. of EC decis. 10,698 2000 16,248 National exec. measures 7,745 Civil Service Tribunal 2,830 EU Member State 10,562 2001 16,044 Information 7,314 Malta 2,184 Germany 10,274 1996 15,293 Notice 7,306 The Member States 1,978 interpr. of the law 10,030 2004 14,974 Adv. General’s Opinion 7,155 Ireland 1,729 EU programme 9,760 1998 14,946 Treaty 5,808 National Courts 1,674 export refund 9,337 1997 14,929 Own-initiative resolution 5,460 Committee of the Regions 1,364 award of contract 9,258 2014 14,868 Report 4,454 European Court of Auditors 1,248 third country 9,210 2002 14,868 Implementing regulation 4,205 The 12 Member States 1,182 trademark law 9,110 1995 14,319 proposal for a decision 4,066 EFTA Surveillance Authority 985 European trademark 8,912 1999 12,667 Info 4,066 European Central Bank 847 environ. protection 8,693 1992 10,768 Directive 3,795 KOSTOPOULOS 807 EU financing 8,212 1993 9,693 Order 3,407 Others 686 import (EU) 8,060 1986 9,265 Own-initiative report 3,054 Gov. representatives 639 EU aid 8,015 1990 9,259 Opinion proposing amend. 3,039 The 6 Member States 622 France 7,980 1985 9,224

Table 3: Example of meta data in English part of EUR-Lex corpus, sorted by document frequency.

Hajlaoui, N., Kolovratnik, D., Väyrynen, J., Steinberger, multilingual parallel corpora with tools and interfaces. R., and Varga, D. (2014). Dcep-digital corpus of the eu- In Recent advances in natural language processing, vol- ropean parliament. In LREC, pages 3164–3171. ume 5, pages 237–248. Halácsy, P., Kornai, A., and Oravecz, C. (2007). Hun- Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., pos: an open source trigram tagger. In Proceedings of and Trón, V. (2007). Parallel corpora for medium den- the 45th annual meeting of the ACL on interactive poster sity languages. AMSTERDAM STUDIES IN THE THE- and demonstration sessions, pages 209–212. Association ORY AND HISTORY OF LINGUISTIC SCIENCE SE- for Computational Linguistics. RIES 4, 292:247. Kilgarriff, A., Baisa, V., Bušta, J., Jakubícek,ˇ M., Kovár,ˇ V., Michelfeit, J., Rychlý, P., and Suchomel, V. (2014). The sketch engine: ten years on. Lexicography, 1(1):7–36. Koehn, P. (2005). Europarl: A parallel corpus for statisti- cal machine translation. In MT summit, volume 5, pages 79–86. Michelfeit, J., Pomikálek, J., and Suchomel, V. (2014). Text tokenisation using unitok. In 8th Workshop on Re- cent Advances in Slavonic Natural Language Processing, Brno, Tribun EU, pages 71–75. Schmid, H. (1995). Treetagger| a language indepen- dent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43:28. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Er- javec, T., Tufis, D., and Varga, D. (2006). The jrc- acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., and Schlüter, P. (2013). Dgt-tm: A freely available translation memory in 22 languages. arXiv preprint arXiv:1309.5226. Tiedemann, J. (2009). News from opus-a collection of

2803 AQA: Automatic Question Answering System for Czech

Marek Medved’(B) and AleˇsHor´ak

Faculty of Informatics, Natural Language Processing Centre, Masaryk University, Botanick´a68a,60200Brno,CzechRepublic xmedved1,hales @fi.muni.cz { } Abstract. Question answering (QA) systems have become popular nowadays, however, a majority of them concentrates on the English language and most of them are oriented to a specific limited problem domain. In this paper, we present a new question answering system called AQA (Automatic Question Answering). AQA is an open-domain QA system which allows users to ask all common questions related to a selected text collection. The first version of the AQA system is developed and tested for the Czech language, but we also plan to include more languages in future versions. The AQA strategy consists of three main parts: question processing, answer selection and answer extraction. All modules are syntax-based with advanced scoring obtained by a combination of TF-IDF, tree dis- tance between the question and candidate answers and other selected criteria. The answer extraction module utilizes named entity recognizer which allows the system to catch entities that are most likely to answer the question. Evaluation of the AQA system is performed on a previously published Simple Question-Answering Database, or SQAD, with more than 3,000 question-answer pairs.

Keywords: Question Answering AQA Simple Question Answering Database SQAD Named entity· recognition· · ·

1Introduction

The number of searchable pieces of new information increases every day. Look- ing up a concrete answer to an asked question simply by information retrieval techniques thus becomes difficult and extremely time consuming. That is why new systems devoted to the Question Answering (QA) task are being developed nowadays [2,9,10]. The majority of them concentrates on the English language and/or relay on a specific limited problem domain or knowledge base. This leads us to two main drawbacks of such QA systems: firstly, the specific limited problem domain and knowledge base are always pre-limiting the answers to what knowledge is stored in the system being thus possibly useful only for people working in this specific domain but useless for common people in daily c Springer International Publishing Switzerland 2016 ⃝ P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 270–278, 2016. DOI: 10.1007/978-3-319-45510-5 31 AQA: Automatic Question Answering System for Czech 271 usage. The second problem arises when going multilingual, since in most cases the system functionality is not directly transferable to other languages without a decrease in accuracy. In this paper, we present a new open domain syntax based QA system, named AQA. The first version of the system is developed and tested on the Czech language and uses the latest approaches from QA field. The evaluation was performed on Simple Question Answering Database (SQAD) [4].

2TheAQASystem

In the following text we briefly describe the input format, the database format and the main AQA modules that are necessary to extract a concrete answer to agivenquestion.

2.1 Input

Since the AQA system is aimed at processing texts in morphologically rich lan- guages, it supports two input formats of the user question. The first is in a form of a plain text question without any structure or additional information; and the second one is in a vertical format1 with words enriched by lexical and morpho- logical information, see Fig. 1.

Fig. 1. Question “Kdo je autorem novely L´etaj´ıc´ıjagu´ar?” (Who wrote the story called Flying jaguar?) in the vertical format.

Internally the syntax-based AQA system works with syntactic trees, which means that the plain text input form is automatically processed by a morpho- logical tagger2 and transformed into the (internal representation of) the vertical format.

1 One text token per line with multiple attributes separated by tabs. 2 In the current version, we use the morphological analyser Majka [5,11]disambiguated by the DESAMB [7]tagger. 272 M. Medved’ and A. Hor´ak

The morphologically annotated question input (obtained from the user or created by the tagger) is sent to the Question processor module that extracts all important information about the question. The detailed description of Question processor module is described in Sect. 2.3.

2.2 Knowledge Base To find an appropriate answer to the given question the AQA system uses a knowledge database that can be created form any natural language text (in Czech, for now). The text can be again in a form of plain text or vertical text, the same as the input question. To create a new knowledge database, the user enters just the source of the input texts (file, directory, ...) and the AQA system automatically processes the texts and gathers information to be stored in the knowledge base. The texts are processed in several steps and the following features are extracted:

– Syntactic tree: each sentence in the input text is parsed by a syntactic analyser and the resulting syntactic tree is stored in the database. We use the SET (Syntactic Engineering Tool) parser [6] in the current version. The SET parsing process is based on pattern matching rules for link extraction with probabilistic tree construction. For the purposes of AQA, the SET grammar was enriched by answer types that serve as hints to the Answer selection module (see Sect. 2.3) for matching among the candidate answers. – Tree distance: for each syntactic tree, the system computes tree distance map- pings between each words pair in a noun phrase. This information is used in the Answer selection module (see Sect. 2.3) to pick up the correct answer from multiple candidate answers. – Birth/death date, birth/death place:thisfeatureisspecificforsentenceswhere the birth/death date and birth/death place are present in the text just after a personal name, usually in parentheses (see example in Fig. 2). – Phrase extraction: within the parsing process, the SET parser can provide alistof(noun,prepositional,...)phrasespresentinthesentence.Thisinfor- mation is used in in Answer extraction module (see Sect. 2.3)whenthesystem searches for a match between the question and the (parts of the) knowledge base sentences. – Named entity extraction:asmentionedabove,thegeneralSETgrammarwas modified to include, within the resulting syntactic tree, also the information about what kind of questions can be answered by each particular sentence. This information is also supplemented with a list of named entities found in the sentence. The AQA system recognizes three named entity types: a place,an agent and an art work. In the current version, AQA uses the Stanford Named Entity Recognizer [3]withamodeltrainedontheCzechNamedEntityCorpus 1.1 [8]andtheCzechDBpedia[1] data. – Inverted index: an inverted word index is created for fast search purposes, this serves mainly as a technical solution. AQA: Automatic Question Answering System for Czech 273

Sir Isaac Newton (January 4 1643 in Woolsthorpe – March 31 1727 in London) was a English physicist, mathematician …

Fig. 2. Example sentence for birth/death date and place

For development and evaluation purposes, the AQA knowledge base was trained on the SQAD 1.0 database [4]thatwasmanuallycreatedfromCzechWikipedia web pages and contains 3301 question-answer pairs. The answer is stated both in a form of the direct answer (noun, number, noun phrase, ...) and in the form of the answer context (several sentences) containing the direct answer. The results of an evaluation of AQA with the SQAD database are explicated in Sect. 3.

2.3 The AQA Modules The AQA systems consists of tree main modules: the Question processor, the Answer selection module and the Answer extraction module. Figure 3 presents aschematicgraphicalrepresentationoftheAQAsystemparts.

Fig. 3. The AQA system modules.

The Question Processor. The first step after the system receives the input question in the form of morphologically annotated tokens (either created by the tagger or obtained in the vertical format from the user) consists in extracting the following features: 274 M. Medved’ and A. Hor´ak

– Question reformulation:theoriginalquestionisreformulated(ifpossible)into a“normalized”questionform,whichallowstosimplifytheanswermatching process later. For example, if the user asks a question starting with “Jak se jmenuje ... osoba ...” (What is the name of ... a person ...), the system reformulates the question to “Kdo je ...”(Whois...).Themeaningofthe sentence remains the same but for further processing the reformulation directly corresponds to searching for a personal name or denotation among possible answers. – Question syntactic tree:thesameastheAQAsystemusessyntactictreesin the knowledge base creation process (Answer selection and extraction mod- ules), the system automatically creates a syntactic tree for each question. The question syntactic tree is provided by the same parser as used for the knowl- edge base processing. – Question type extraction:duetotheenrichedparsinggrammar,thequestion type can be extracted during the syntactic tree creation. The question type itself is determined by the sentence structure and the specific pronouns present in the question. For example, whe user asks a question such as “Kdo byl ...”(Whowas...),thesystemassignsthisquestiontheWHO question type. The answer selection process then uses this information to filter the matching question and answer types. – Question main subject and main verb extraction: the AQA system tries to find in the syntactic tree the main subject and the main verb of the question. This information is important for the answer extraction process. According to this information, the system knows which subject and verb should be present in the candidate answer (if the system picks up more than one).

Answer Selection Module. After extracting all the necessary information from the question, the Answer selection module is responsible for recognizing all possible candidate answers. This module communicates with the knowledge database and collects all pieces of text (sentence or paragraph) that contain a possible answer. The selection decision rules are based on the question type, the question main subject, the question main verb and specific words present in the question syntactic tree. Each of the candidate answer texts are then assigned a score and the Answer selection module picks up first five top rated answers. The ranking score for each candidate answer is a combination of TF-IDF3 score and the tree distance between the words in noun phrases in the answer and the question. The TF-IDF score consists of two parts: – TF-IDF match: for each word (nouns, adjectives, numerals, verbs, adverbs) that matches in the question-answer pair, the TF-IDF score is computed

tf idf =(1+log(tf )) log(idf ) match ∗ – TF-IDF mismatch: TF-IDF for the rest of words that did not match in the question-answer pair. 3 TF-IDF stands for Term Frequency-Inverse Document Frequency. AQA: Automatic Question Answering System for Czech 275

Then the resulting TF-IDF score is determined as:

tf idf = tf idf tf idf res match − mismatch The final score of a candidate answer according to the question is finally sup- plemented by the tree distance calculated between the words in the question and answer noun phrases. If the tree distance in the question-answer pair noun phrase is equal it does not influence the final score. But when the tree distance is not equal the final score is modified as follows:

final score = tf idf TreeDistance TreeDistance res −| q − a| The five best scored candidate answers are then send to the Answer extraction module that extracts particular parts of each sentence that will be considered as the required answer.

The Answer Extraction Module. The final step of the AQA processing is accomplished by the Answer extraction module where a particular part of each sentence is extracted and declared for the final answer. This part of the system works only with the five best scored candidate answers that were picked up by the Answer selector module. The final answer for the asked question is extracted according to the following three factors: – Question focus:asmentionedinSect.2.3, each question receives the ques- tion type according to the sentence structure and the words present in the question. The question type is then mapped to one or more named entity types. – Answer named entities: within the knowledge base creation process, AQA extracts the supported named entities of three types, i.e. Place, Agent and ArtWork. The Answer extraction module then maps the question focus to the extracted answer named entities. In this process, AQA also excludes named entities that are present in the question to avoid an incorrect answer. This is the first attempt to get a concrete answer. – Answer noun phrases: in case the previous step fails to find an answer, the system selects one phrase from the phrase list as the answer. The failure can be caused because of two reasons. First, the question focus can be a date or a number. In this case the Answer extraction module returns a noun phrase where the question main subject related to a number is present or it returns a birth/death date if the question asks about birth or death of some person (birth/death date is stored in database alongside the sentence tree, named entities and phrases), see Sect. 2.2. The second reason why the question focus mapping to answer named entities can fail is because of missing named entity in the candidate answer. In this case, the system checks the remaining candidate answers and if it does not succeed the Answer extraction module returns the noun phrase that contains the main question subject. 276 M. Medved’ and A. Hor´ak

3Evaluation

Within the evaluation process, we have used the SQAD v1.0 database containing 3,301 entries of a question, an answer and an answer text. The AQA knowledge base was build from all the answer texts from SQAD, and all 3,301 questions were answered by the AQA system. The results were evaluated in two levels (see Table 1). The first level evaluates the answer selection module and the second level evaluates the Answer extraction module. There are three possible types of amatchbetweentheAQAanswerandtheexpectedanswer: –a(full)Match: for the Answer selection module: the module picked up the correct sen- • tence/paragraph, for the Answer extraction module: the first provided answer is equal to • the expected answer; –aPartial match: the Answer selection module: a correct sentence/paragraph appears in • the Top 5 best scored answers that were selected, the Answer extraction module: the provided answer is not at the first • position or it is not an exact phrase match; –aMismatch: in case of the Answer selection module: the correct sentence/paragraph • is not present in the Top 5 best scored answers, for the Answer extraction module: incorrect or no answer produced. •

Table 1. Evaluation of the AQA system on the SQAD v1.0 database

Answer selection Answer extraction % % Match 2, 645 80.1 % 1, 326 40.2 % Partial match 66 1.9 % 443 13.4 % Mismatch 590 18 % 1, 532 46.4 %

4ErrorAnalysis

Within a detailed analysis of the errors (mismatches and partial matches) in the evaluation, we have identified the causes and outlined the directions and tasks of next development steps of the AQA modules. The Question processor module needs to use a fine grained grammar for the question type assignment task. In the evaluation process, 18 % of errors have been assigned an incorrect question type. The following modules then could not find the required answer. AQA: Automatic Question Answering System for Czech 277

The Answer selection module is in some cases too biased to preferring answers that contain a question word multiple times. This was a cause of 20 % of erro- neous answers in the analysis. The Answer extraction module uses a question subject word, named enti- ties and phrase extraction in the extraction process. The evaluation showed that the module needs to include other phrase and tree matching techniques – 21 % of errors were caused by extracting a different phrase from a correctly selected answer sentence. The current AQA version also suffers from not apply- ing anaphora resolution techniques, which are necessary when the correct answer needs to be extracted by a reference between two sentences.

5Conclusions

The paper presented details about the architecture of a new open question- answering system named AQA. The system is aimed at morphologically rich languages (the first version is developed and tested with the Czech language). We have described the AQA knowledge base creation from free natural lan- guage texts and the step-by-step syntax-based processing of the input question and processing and scoring the candidate answers to obtain the best specific answer. The AQA system has been evaluated with the Simple Question Answer- ing Database (SQAD), where it has achieved an accuracy of 40 % of correct answers and 53 % of partially correct answers. We have also identified prevailing causes of errors in the answer selection and answer extraction phases of the system and we are heading to amend them in the next version of the system.

Acknowledgments. This work has been partly supported by the Grant Agency of CR within the project 15-13277S. The research leading to these results has received funding from the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education, Youth and Sports under Project Contract no. MSMT-28477/2014 within the HaBiT Project 7F14047.

References

1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007) 2. Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: Proceedings of the 20th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, pp. 1156–1165. ACM (2014) 3. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, pp. 363–370. Association for Computational Linguistics, Stroudsburg (2005). http:// dx.doi.org/10.3115/1219840.1219885 278 M. Medved’ and A. Hor´ak

4. Hor´ak, A., Medved’, M.: SQAD: simple question answering database. In: Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 121– 128. Tribun EU, Brno (2014) 5. Jakub´ıˇcek, M., Kov´aˇr, V., Smerk,ˇ P.: Czech morphological tagset revisited. In: Proceedings of Recent Advances in Slavonic Natural Language Processing 2011, pp. 29–42 (2011) 6. Kov´aˇr, V., Hor´ak, A., Jakub´ıˇcek, M.: Syntactic analysis using finite patterns: a new parsing system for Czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562, pp. 161–171. Springer, Heidelberg (2011) 7. Smerk,ˇ P.: Towards morphological disambiguation of Czech (2007) 8. Sevˇˇ c´ıkov´a, M., Zabokrtsk´ˇ y, Z., Strakov´a, J., Straka, M.: Czech named entity corpus 1.1 (2014). http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C,LIN- DAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague 9. Shtok, A., Dror, G., Maarek, Y., Szpektor, I.: Learning from the past: answering new questions with past answers. In: Proceedings of the 21st International Con- ference on World Wide Web, pp. 759–768. ACM (2012) 10. Yih, W.T., He, X., Meek, C.: Semantic parsing for single-relation question answer- ing. In: Proceedings of ACL 2014, vol. 2, pp. 643–648. Citeseer (2014) 11. Smerk,ˇ P.: Fast morphological analysis of Czech. In: Proceedings of the RASLAN Workshop 2009, Brno (2009)