Natural Language Processing: My ”Grandchild-Bot”
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Processing: My ”grandchild-Bot” Carlos Henrique Olim Silva Thesis to obtain the Master of Science Degree in Electrotechnical and Computer Engineering Supervisors: Prof. Pl´ınio Moreno Lopez´ Prof. Jose´ Alberto Rosado dos Santos Vitor Examination Committee Chairperson: Prof. Joao˜ Fernando Cardoso Silva Sequeira Supervisor: Prof. Pl´ınio Moreno Lopez´ Member of the Committee: Prof. Maria Lu´ısa Torres Ribeiro Marques da Silva Coheur June 2020 Declarac¸ao/Declaration˜ Declaro que o presente documento e´ um trabalho original da minha autoria e que cumpre todos os requisitos do Codigo´ de Conduta e Boas Praticas´ da Universidade de Lisboa. I declare that this document is an original work of my own authorship and that it fulfills all the require- ments of the Code of Conduct and Good Practices of the Universidade de Lisboa. Acknowledgments First of all, I would like to acknowledge my dissertation supervisors Prof. Pl´ınio Moreno and Prof. Jose´ Santos-Vitor for their guidance during the execution of my thesis. I would particularly like to express my gratitude to Pl´ınio Moreno not only for his suggestions and expertise, but also for his aid and his concern for my work. Additionally, I would like to thank my family, namely my mother Rita, my father Carlos, my brother Tomas,´ my cousins, my uncles and my aunts for their reassurance, for always caring and for giving me precious advice to thrive in my everyday life. I would also like to thank my parents for everything they gave me throughout my upbringing years, particularly during my university years, that made it possible for me to endorse and to successfully complete this course. Also, to my mother for her understanding and emotional support throughout these years that gave me the strength to always carry on, even in the darkest days. I am certain that I couldn’t have made it this far without her. Furthermore, I would be pleased to pay a special tribute to my grandparents - this work was made in their honor. Ultimately, I would also like to say thanks to all the friends I made during the years I studied in Instituto Superior Tecnico´ (in the university and in the student residence that I lived in) and during my school years. Some of them are Cat, Ines,ˆ Rocha, Jinho, Nen´ e,´ Sofia Lima, Saraiva, Tavares, Menezes, Ema, Rita, Moita, Sergio,´ Sofia Jesus, Jacob, Ana and Gi, without whom I would not have been able to overcome my biggest challenges. Thank you for always making my everyday life more interesting and fun and, since I couldn’t go to my home as often as I would like, thank you for being the family that I chose in this side of the ocean. ii Abstract For many years, the communication competences of a robot have been oversimplified due to its complex- ity. Over the last few years, there were a couple of breakthroughs in the Machine Learning and Natural Language Processing areas that brought a lot of new possibilities to human-robot interactions. However, much of the research and data made in these fields are in English, with other languages being mostly overlooked. In order to apply these techniques to the Portuguese language, a solution suitable to the available databases is investigated and the differences between both languages, such as the plural and gender endings and the word stemming, are taken into consideration. Here, the robot communicates with humans in Portuguese and the discourse should work in a reactive and rapid manner, considering and inspecting previous dialogues to build a conversational model as natural as possible. Bearing this in mind, the Latent Semantic Analysis, a Natural Language Processing technique, is used intertwined with a Na¨ıve Bayes classifier to predict what the robot should respond based on the human utterance. The usage of a stop words list and a keyword extractor like in the English research are carefully inspected along with the system parameters to understand their influence on the final performance. A new ap- proach to gather new Portuguese dialogues based on a form is also proposed since the quantity of data available is almost non-existent. Keywords Natural Language Processing, Latent Semantic Analysis, Machine Learning, Multinomial Na¨ıve Bayes Classifier, Social Robotics, Human-Robot Interaction, Minimum Document Frequency, N-Gram, Stop Words, Stemming, Term Frequency–Inverse Document Frequency, Singular Value Decomposition, Di- mensionality Reduction, Keyword iii Resumo Durante muitos anos, as competenciasˆ de comunicac¸ao˜ de um robot foram simplificadas devido a` sua complexidade. Nos ultimos´ anos, um conjunto de avanc¸os nas areas´ de Machine Learning e Processa- mento Natural de Linguagem trouxeram muitas novas possibilidades para as interac¸oes˜ entre humanos e robots. Contudo, a maioria da investigac¸ao˜ e dos dados obtidos nestas areas´ sao˜ em ingles,ˆ com- pletamente negligenciando outras l´ınguas. De modo a aplicar estas tecnicas´ a` l´ıngua portuguesa, e´ investigada uma soluc¸ao˜ adequada as` bases de dados existentes e que considere as diferenc¸as entre as duas linguagens, como a flexao˜ em genero´ e em numero´ e a ra´ız das palavras. Neste trabalho o robot interage com os humanos num dialogo´ em portuguesˆ que deve funcionar de um modo rapido´ e reativo, considerando e inspecionando dialogos´ anteriores de modo a construir um modelo de conversa o mais natural poss´ıvel. Com isto em mente, o Latent Semantic Analysis, uma tecnica´ da area´ de Pro- cessamento Natural de Linguagem, e´ utilizado juntamente com um classificador Na¨ıve Bayes de forma a prever o que o robot deve responder com base numa expressao˜ humana. A utilizac¸ao˜ de uma lista de palavras vazias e de um extractor de palavras-chaves, como em varias´ investigac¸oes˜ na l´ıngua inglesa, sao˜ cuidadosamente inspecionadas, juntamente com os parametrosˆ do sistema de forma a compreen- der a sua influenciaˆ no desempenho final. Tambem´ uma nova abordagem para adquirir novos dados em portugues,ˆ atraves´ de um formulario,´ e´ apresentada dado que a quantidade de dados dispon´ıvel e´ quase nao-existente.˜ Palavras Chave Processamento Natural de Linguagem, Latent Semantic Analysis, Machine Learning, Multinomial Na¨ıve Bayes Classifier, Robotica´ Social, Interac¸oes˜ Humano-Robot, M´ınima Frequenciaˆ do Documento, N- Gram, Palavras Vazias, Stemizac¸ao,˜ Frequenciaˆ do Termo–Inverso da Frequenciaˆ nos Documentos, iv Decomposic¸ao˜ em Valores Singulares, Reduc¸ao˜ da dimensionalidade, Palavra-Chave v Contents 1 Introduction 1 1.1 Contextualization and Motivation................................3 1.2 Objective.............................................4 1.3 Organization of the Document..................................4 2 State of the Art 6 3 Feature extraction and selection through Latent Semantic Analysis (LSA) 15 3.1 Tokenization, Stemming and Removal of Stop Words..................... 17 3.2 Building of TF-IDF matrix.................................... 18 3.3 SVD Matrix Truncation and Dimensionality Reduction.................... 19 4 Utterance generation through Na¨ıveBayes Classifier (NBC) 25 4.1 Mathematical Introduction.................................... 27 4.2 Multinomial Na¨ıve Bayes Classifier............................... 28 5 System Evaluation 33 5.1 Development and Employment of the Software........................ 35 5.1.1 Speech Recognition Software.............................. 35 5.1.2 Keyword Extraction Software.............................. 36 5.1.2.A 1st Evaluation Metric............................. 36 5.1.2.B 2nd Evaluation Metric............................. 37 5.1.2.C Results..................................... 37 5.1.3 Latent Semantic Analysis................................ 38 5.1.4 Multinomial Naive Bayes Classifier........................... 45 5.2 Discourse Build-up........................................ 46 5.3 Training and Testing Set..................................... 50 5.4 System Adjustment........................................ 51 5.4.1 Minimum Document Frequency and Maximum N-Gram Value............ 51 5.4.2 Percentage of Cumulative Eigenvalues......................... 53 5.4.3 Laplace Smoothing Parameter............................. 54 vi 5.4.4 Stop Words........................................ 55 5.4.5 Keywords......................................... 56 5.5 Results and Output Analysis.................................. 57 5.5.1 Class Labels Robustness................................ 57 5.5.2 Performance on number of phrases per class..................... 58 5.5.3 Correct dialogue provided by the system........................ 60 5.5.4 Wrongful predicted phrases............................... 63 5.6 Highlights............................................. 67 6 Conclusion 69 A Appendix 77 B Large Tables 79 vii viii List of Figures 2.1 Overview of the whole system.................................. 13 5.1 Errors related to the 1st Metric. Blue, red and green represent the software Azure, Yake and LinguaKit respectively. The dashed line represents the mean error........... 37 5.2 Errors related to the 2nd Metric. Blue, red and green represent the programs Azure, Yake and LinguaKit, respectively. The dashed line represents the mean error........... 38 5.3 Every step of the RSLP Stemmer ............................... 40 5.4 Minimum Frequency Appraisal.................................. 42 5.5 Maximum N-Gram Appraisal................................... 43 5.6 Percentage of Cumulative Singular Values Appraisal...................... 44 5.7 Full set split into training and testing subsets [1]........................ 51 5.8 Influence