Automatic Question Answering for Flective Languages

M U F I Automatic question answering for flective languages P.D. T P Marek Medveď Brno, Fall 2017 M U F I Automatic question answering for flective languages P.D. T P Marek Medveď Advisor: doc. RNDr. AleöHorák Ph.D. Brno, Fall 2017 Signature of Thesis Advisor Contents 1 Introduction 1 1.1 Current QA challenges .................... 1 1.1.1 Building knowledge base . 1 1.1.2 Question processing . 2 1.1.3 Document selection (Knowledge base search) . 3 1.1.4 Sentence selection . 3 1.1.5 Answer extraction . 4 1.2 What has to be improved ................... 4 1.2.1 Question processing . 4 1.2.2 Document selection . 5 1.2.3 Sentence selection . 5 1.2.4 Answer extraction . 5 1.3 Goal of the postgraduate study ................ 6 1.4 Thesis proposal structure ................... 6 2 State of the art 7 2.1 QA system structure ..................... 7 2.1.1 Knowledge base . 7 2.1.2 Question processing module . 9 2.1.3 Passage retrieval module . 12 2.1.4 Answer selection module . 13 3 Aims of the Thesis 21 3.1 Czech-English syntax differences ............... 21 3.1.1 Declarative sentence (statements) . 21 3.1.2 Interrogative sentence (Questions) . 22 3.2 Proposed system prototype .................. 23 3.3 Study plan ........................... 24 4 Achieved Results 25 4.1 SQAD database ........................ 25 4.2 Automatic Question Answering system (AQA) ....... 25 5 Author’s publications 29 Bibliography 33 i A Research activity 39 B Teaching activities 41 C Opponent review 43 D Selected papers 45 ii 1 Introduction Question answering (QA) is a computer science discipline that grabs a lot of interest in Natural Language Processing field. The information extraction and natural language processing fields aim at building systems that can provide accurate answers to input questions. The main difference between search engine systems and QA systems can be seen in results they provide. A search engine system usually provides a list of eligible candidates that satisfy the input query. In contrast QA systems are more complex and go further. By using multiple natural language processing (NLP) techniques, a QA system chooses the best article, extracts suitable passages and picks up the shortest part of a paragraph or sentence that will satisfy the input question and provide answer with sufficient information to the user. There are two main types of QA systems: open domain systems [1, 2, 3, 4, 5, 6] and closed domain systems [7, 8, 9, 10]. Open-domain systems are based on sources without any restrictions, whereas closed- domain systems are limited to specific domains such as medicine, weather forecasting, sports results etc. 1.1 Current QA challenges Current challenges that are studied by the QA community all over the world arise from the complexity of the QA task, which has to go through multiple layers of processing (question processing, document processing, answer selection, answer extraction) to get from a question to an answer. These challenges are presented in the following text based on the techniques in [11, 12, 13, 3, 10, 14, 5, 6, 15, 16]. 1.1.1 Building knowledge base The first challenge of a QA system is to have a large knowledge data source that will represent the knowledge base (KB) of the system. This is the part of the system that provides all the data and is queried for candidate answers. 1 .I Usually there is a separate module inside the QA system that processes the data source and stores information inside the QA’s KB database. The main purpose of this module is to extract information from the input texts using NLP techniques such as lexical analysis1, morphological analysis2, part of speech tagging3 and syntactic analysis4. There are also advanced NLP techniques available which can provide complex information about a text like proper nouns (Informa- tion Extraction techniques), reference extraction between parts of text (Anaphora resolution) or semantic similarity recognition between expressions (Thesaurus, Word embeddings). A satisfactory answer to the input question requires proper repre- sentation of the QA system’s KB. The KB must contain all available information extracted from input data in a compact form and the QA system must be able to access this information quickly. Detailed de- scription of current KB techniques are presented in section 2.1.1. 1.1.2 Question processing Apart from a KB, the only input the QA system usually receives is an input question. Question processing is very important to the QA system itself. If the QA system is not able to extract all available information from the question, the next processing layers could lead to incorrect answers. Besides lexical analysis, morphological analysis, part of speech tagging and syntactic analysis, the system usually performs question classification (for question types see Section 2.1.2). This additional information helps the system to focus on certain classes of entities inside candidate answers. According to this information, the final score, which represents the confidence of the answer, is assigned to each candidate answer. There are many ways how to extract information from an input question. A question processing module can be based on three main 1. Token recognition inside a sentence. 2. Assigning a base form of token. 3. Part of speech assignment to word in a text based on both its definition and its context 4. Building syntactic tree according to language grammar rules 2 .I approaches. The first approach is to match a question to automati- cally learned or manually created patterns (e.g. [17]). Second, more linguistically oriented (e.g. [8]) is based on building KB query from information extracted by NLP tools. The last one is application of statistical or machine learning approaches, which use statistical techniques such as Support Vector Machines, Bayesian classifiers, neural networks etc., to extract features of a question to create a KB query. 1.1.3 Document selection (Knowledge base search) After getting all possible information from a question, the system starts to query the KB. The result of this process is usually in the form of a document or a document passage. A document selection module is used by the QA system to select the set of all documents from KB which contain a suitable answer. This process can be based on several strategies. One of the widely used techniques is Information Retrieval (IR) which extracts keywords from a document that are compared to question keywords. A technique used by many IR systems is Boolean IR that uses boolean logic to create a formula from a document and compare it’s intersections with question keywords [18]. Important aspects of a document selection module are not only to develop an IR tool that can select a list of documents based on a given question but also to select a good sorting technique that will provide final document ranking. The number of documents to be sent to the next processing stage is also very important. For example it is shown in [19] that even when considering the top 1000 text segments, no relevant documents are found by the module for 8 % of the questions. 1.1.4 Sentence selection A sentence selection area is the most frequently studied component of a QA system. The most important action of a sentence selection module is to pick up the correct passage from a document, usually in the form of a sentence or paragraph. Several approaches have been developed to solve this task. Some of them are very advanced and usually use some kind of neural network model [16], others use lexical 3 .I information that is extracted from questions and sentences or through IR technique [20]. The main challenge of this area is to find a technique with a very high selection confidence. Current research aims not only to find new techniques to accomplish this task but also to explore possible combinations between existing tools and to find the correct weighting of each feature to get best results. 1.1.5 Answer extraction The final stage of a QA system is answer extraction. It uses a combi- nation of Information Extraction (IE) techniques, information about question type extracted from the input question, NLP tools that pin- point important parts of sentence (main verb, subject, object) and advanced tools such as anaphora resolution [21]. According to the information obtained in previous steps the system has to decide which part of a sentence (passage) must be extracted and shown to the user as the final answer. The main goal is to provide a correct answer that will include enough information to satisfy user’s question. 1.2 What has to be improved The QA field has become very attractive in recent years to people and companies around the world, because of the its usage potential. NLP techniques applied in the QA field such as lexical analysis, morphological analysis, syntactic analysis, document selection and knowledge base building still do not reach 100 % confidence and all of them are still under the development. The main focus of research is to improve question processing, sentence (passage) selection and answer extraction area. 1.2.1 Question processing Question type: there is a problematic balance between the small • and large number of question types that the system can recog- nize. If the system implements too many question types, the complexity of question processing can become very time consuming 4 .I and a very small change in feature weighting may change the re- sulting question type. On the other hand, very few question type classes can cause picking up wrong answer. Question focus: finding the main focus (example in Figure 1.1) • of the question itself is also a challenging task.

Automatic Question Answering for Flective Languages

Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis

Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011

ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018

Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation

A Massively Parallel Corpus: the Bible in 100 Languages

A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding

Semantic Role Annotation of a French-English Corpus

Standard Test Collection for English-Persian Cross-Lingual Word Sense Disambiguation

The Syntax and Semantics of Prepositions in the Task of Automatic Interpretation of Nominal Phrases and Compounds: a Cross-Linguistic Study

Language Teaching Raising Teachers' Awareness of Corpora