Question Answering for List Questions with Support for Temporal Restrictors

UNIVERSIDADE DE LISBOA FACULDADE DE CIÊNCIAS DEPARTAMENTO DE INFORMÁTICA Open-Domain Web-based Multiple Document Question Answering for List Questions with Support for Temporal Restrictors Patricia Nunes Gonçalves DOUTORAMENTO EM INFORMÁTICA ESPECIALIDADE CIÊNCIAS DA COMPUTAÇÃO 2015 UNIVERSIDADE DE LISBOA FACULDADE DE CIÊNCIAS DEPARTAMENTO DE INFORMÁTICA Open-Domain Web-based Multiple Document Question Answering for List Questions with Support for Temporal Restrictors Patricia Nunes Gonçalves Tese orientada pelo Prof. Doutor António Horta Branco Especialmente elaborada para obtenção do grau de DOUTOR EM INFORMÁTICA ESPECIALIDADE CIÊNCIAS DA COMPUTAÇÃO 2015 Dedico esta tese em memória do meu pai, meu grande ídolo. Acknowledgements Obrigada a todos que contribuíram diretamente e indiretamente na realização desse trabalho. Obrigada e meus colegas do NLX, especialmente para meu in- separável amigo João Silva por todo apoio e amizade nestes últimos anos. Quero também agradecer a minha família pelo apoio incondicional, especialmente para minha mãe que foi uma rocha nesses últimos anos, meu irmão Guti, minha cunhada Grasi e minha sobrinha Duda pelo carinho e a todos meus tios, primos e ao meu padrinho Tio Pedro (in memoriam) pelo apoio e carinho. Obrigada a todos amigos que fiz ao longo desses 5 anos, Marina, Branquinho, Renata, Marcia, Leandro, Maria Clara, Laly, Cecilia. Monica, Giu e Jojo, obrigada por me acolherem logo na minha chegada e por vários momentos felizes. Rosa e Augusto obrigada por me fazer apaixonar pelo tango e pelos jantares maravilhosos. Obrigada Lê e Vini que me “salvaram” no momento mais crítico da minha vida que foi no nascimento do meu filho. Obrigada a Luciana pelos nossos cafés inesquecíveis e a toda família “The Christians”, Chris, Zazá e Tico pelo carinho e amizade. A Dalva, Tácito e Samsam pelos maravilhosos passeios aos fins-de-semana. Ao Zeca e sua família, Juliana e Antonio por todo apoio e incentivo. A Cristiane, Marcirio e nosso príncipe Vicente meu agradecimento especial por me acompanharem nessa jornada com um apoio fundamental. Quero agradecer a Deus por me trazer meu maior presente, meu filho Andrew and to his Daddy, Colin to make my dreams come true when I become a mom. Obrigada a meu orientador António Branco por acreditar neste trabalho e um agradecimento especial aos professores Renata Vieira e Thiago Pardo que mesmo de longe me incentivaram a desenvolver esta tese. Finalmente, agradeço à Fundação para Ciência e Tecnologia pelo financiamento desta tese de doutoramento através da bolsa SFRH/BD/65647/2009 e ao projeto QTLeap concedido pela European Commission sob o contrato #610516. Abstract With the growth of the Internet, more people are searching for information on the Web. The combination of web growth and improvements in Information Technology has reignited the interest in Question Answering (QA) systems. QA is a type of information retrieval combined with natural language processing techniques that aims at finding answers to natural language questions. List questions have been widely studied in the QA field. These are questions that require a list of correct answers, making the task of correctly answering them more complex. In List questions, the answers may lie in the same document or spread over multiple documents. In the latter case, a QA system able to answer List questions has to deal with the fusion of partial answers. The current Question Answering state-of-the-art does not provide yet a good way to tackle this complex problem of collecting the exact answers from multiple documents. Our goal is to provide better QA solutions to users, who desire direct answers, using approaches that deal with the complex problem of extracting answers found spread over several documents. The present dissertation address the problem of answering Open-domain List questions by exploring redundancy and combining it with heuristics to improve QA accuracy. Our approach uses the Web as information source, since it is several orders of magnitude larger than other document collections. Besides handling List questions, we develop an approach with special focus on questions that include temporal information. In this regard, the current work addresses a topic that was lacking specific research. A additional purpose of this dissertation is to report on important results of the research combining Web-based QA, List QA and Temporal QA. Besides the evaluation of our approach itself we compare our system with other QA systems in order to assess its performance relative to the state-of-the-art. Finally, our approaches to answer List questions and List questions with temporal information are implemented into a fully-fledged Open-domain Web-based Question Answering System that provides answers retrieved from multiple documents. Keywords: Question Answering, Web-based, List Question, Temporal List Ques- tion Resumo - Portuguese Abstract Com o crescimento da Internet cada vez mais pessoas buscam informações usando a Web. A combinação do crescimento da Internet com melhoramentos na Tecnologia da Informação traz como consequência o renovado interesse em Sis- temas de Respostas a Perguntas (SRP). SRP combina técnicas de recuperação de informação com ferramentas de apoio à linguagem natural com o objetivo de encontrar respostas para perguntas em linguagem natural. Perguntas do tipo lista têm sido largamente estudadas nesta área. Neste tipo de perguntas é esperada uma lista de respostas corretas, o que torna a tarefa de responder a perguntas do tipo lista ainda mais complexa. As respostas para este tipo de pergunta podem ser encontradas num único documento ou espalhados em múltiplos documentos. No último caso, um SRP deve estar preparado para lidar com a fusão de respostas parciais. Os SRP atuais ainda não providenciam uma boa forma de lidar com este complexo problema de coletar respostas de múltiplos documentos. Nosso objetivo é prover melhores soluções para utilizadores que desejam buscar respostas diretas usando abordagens para extrair respostas de múltiplos documentos. Esta dissertação aborda o problema de responder a perguntas de domínio aberto explorando redundância combinada com heurísticas. Nossa abordagem usa a Internet como fonte de informação uma vez que a Web é a maior coleção de documentos da atualidade. Para além de responder a perguntas do tipo lista, nós desenvolvemos uma abordagem para responder a perguntas com restrição temporal. Neste sentido, o presente trabalho aborda este tema onde há pouca investigação específica. Adicionalmente, esta dissertação tem o propósito de informar sobre resultados importantes desta pesquisa que combina várias áreas: SRP com base na Web, SRP especialmente desenvolvidos para responder perguntas do tipo lista e tam- bém com restrição temporal. Além da avaliação da nossa própria abordagem, comparamos o nosso sistema com outros SRP, a fim de avaliar o seu desem- penho em relação ao estado da arte. Por fim, as nossas abordagens para responder a perguntas do tipo lista e perguntas do tipo lista com informações temporais são implementadas em um Sistema online de Respostas a Perguntas de domínio aberto que funciona diretamente sob a Web e que fornece respostas extraídas de múltiplos documentos. Palavras Chave: Sistema de Resposta à Perguntas, Domínio Aberto, Perguntas do Tipo Lista, Perguntas do Tipo Temporal Contents 1 Introduction1 1.1 Question Answering . .3 1.2 Types of Questions . .5 1.3 Major Topics . .8 1.3.1 Open-Domain QA . .8 1.3.2 Web-based QA . .9 1.3.3 List QA . 10 1.3.4 Temporal QA . 11 1.4 Motivation and Goals . 12 1.5 Challenges . 14 1.6 Outline of this Dissertation . 16 2 Related Work 19 2.1 Outline . 19 2.2 Question Answering Evaluation Competitions . 20 2.3 Evaluation on Question Answering . 21 2.4 QA Systems for the Portuguese Language . 24 2.5 Approaches for List Questions . 27 2.6 Approaches for Web-based QA Systems . 33 2.7 Approaches for Temporal QA Systems . 38 2.8 Summary . 43 3 Answering List Questions 45 3.1 Outline . 46 3.2 List Questions . 46 3.3 Challenges . 49 3.4 Approach . 50 3.4.1 Expected Input . 51 ix CONTENTS 3.4.2 External Resources and Support Tools . 52 3.5 Architecture . 54 3.5.1 Question Processing . 54 3.5.1.1 Question Analysis . 55 3.5.1.2 Transformation of the Question into a Query . 56 3.5.1.3 Keywords Extraction . 56 3.5.1.4 Semantic Category of the Expected Answer . 57 3.5.1.5 Counting Keywords . 58 3.5.1.6 Identifying the Question-Focus . 58 3.5.2 Passage Retrieval . 59 3.5.2.1 Document Retrieval . 59 3.5.2.2 Document Analysis . 60 3.5.3 Answer Extraction . 64 3.5.3.1 Extracting Candidate Answers . 65 3.5.3.2 Building the Answer List . 66 3.6 LX-ListQuestion: an Open-Domain Web-based QA system for List Questions 69 3.7 Summary . 71 4 Answering Temporal List Questions 73 4.1 Outline . 74 4.2 Background . 74 4.2.1 Classification of Temporal Questions . 76 4.2.2 Challenges in Temporal Question Answering . 79 4.2.3 Related Work . 79 4.3 Approach . 80 4.3.1 Design Features . 81 4.3.2 Expected Input . 81 4.3.3 Classification of Temporal List Questions . 82 4.3.4 Solving Time-Range . 83 4.3.4.1 Temporal Restriction in an Absolute Time . 83 4.3.4.2 Temporal Restriction with a Relative Reference . 83 4.3.4.3 Temporal Restriction with a Vague Reference . 84 4.4 Architecture . 84 x CONTENTS 4.4.1 Question Processing Module . 85 4.4.1.1 Identifying the Temporal Expression . 86 4.4.1.2 Defining the Temporal Boundaries . 88 4.4.1.3 Generating the Query . 89 4.4.2 Document Processing Module . 89 4.4.3 Answer Processing Module . 90 4.5 LX-ListQuestion: QA system to List Question with Temporal Restrictors .

Load more