Natural Language Processing: My ”grandchild-Bot”

Carlos Henrique Olim Silva

Thesis to obtain the Master of Science Degree in Electrotechnical and Computer Engineering

Supervisors: Prof. Pl´ınio Moreno Lopez´ Prof. Jose´ Alberto Rosado dos Santos Vitor

Examination Committee

Chairperson: Prof. Joao˜ Fernando Cardoso Silva Sequeira Supervisor: Prof. Pl´ınio Moreno Lopez´ Member of the Committee: Prof. Maria Lu´ısa Torres Ribeiro Marques da Silva Coheur

June 2020

Declarac¸ao/Declaration˜ Declaro que o presente documento e´ um trabalho original da minha autoria e que cumpre todos os requisitos do Codigo´ de Conduta e Boas Praticas´ da Universidade de Lisboa. I declare that this document is an original work of my own authorship and that it fulfills all the require- ments of the Code of Conduct and Good Practices of the Universidade de Lisboa. Acknowledgments

First of all, I would like to acknowledge my dissertation supervisors Prof. Pl´ınio Moreno and Prof. Jose´ Santos-Vitor for their guidance during the execution of my thesis. I would particularly like to express my gratitude to Pl´ınio Moreno not only for his suggestions and expertise, but also for his aid and his concern for my work. Additionally, I would like to thank my family, namely my mother Rita, my father Carlos, my brother Tomas,´ my cousins, my uncles and my aunts for their reassurance, for always caring and for giving me precious advice to thrive in my everyday life. I would also like to thank my parents for everything they gave me throughout my upbringing years, particularly during my university years, that made it possible for me to endorse and to successfully complete this course. Also, to my mother for her understanding and emotional support throughout these years that gave me the strength to always carry on, even in the darkest days. I am certain that I couldn’t have made it this far without her. Furthermore, I would be pleased to pay a special tribute to my grandparents - this work was made in their honor. Ultimately, I would also like to say thanks to all the friends I made during the years I studied in Instituto Superior Tecnico´ (in the university and in the student residence that I lived in) and during my school years. Some of them are Cat, Ines,ˆ Rocha, Jinho, Nen´ e,´ Sofia Lima, Saraiva, Tavares, Menezes, Ema, Rita, Moita, Sergio,´ Sofia Jesus, Jacob, Ana and Gi, without whom I would not have been able to overcome my biggest challenges. Thank you for always making my everyday life more interesting and fun and, since I couldn’t go to my home as often as I would like, thank you for being the family that I chose in this side of the ocean.

ii Abstract

For many years, the communication competences of a robot have been oversimplified due to its complex- ity. Over the last few years, there were a couple of breakthroughs in the Machine Learning and Natural Language Processing areas that brought a lot of new possibilities to human-robot interactions. However, much of the research and data made in these fields are in English, with other languages being mostly overlooked. In order to apply these techniques to the Portuguese language, a solution suitable to the available databases is investigated and the differences between both languages, such as the plural and gender endings and the word stemming, are taken into consideration. Here, the robot communicates with humans in Portuguese and the discourse should work in a reactive and rapid manner, considering and inspecting previous dialogues to build a conversational model as natural as possible. Bearing this in mind, the Latent Semantic Analysis, a Natural Language Processing technique, is used intertwined with a Na¨ıve Bayes classifier to predict what the robot should respond based on the human utterance. The usage of a stop words list and a keyword extractor like in the English research are carefully inspected along with the system parameters to understand their influence on the final performance. A new ap- proach to gather new Portuguese dialogues based on a form is also proposed since the quantity of data available is almost non-existent.

Keywords

Natural Language Processing, Latent Semantic Analysis, Machine Learning, Multinomial Na¨ıve Bayes Classifier, Social Robotics, Human-Robot Interaction, Minimum Document Frequency, N-Gram, Stop Words, Stemming, Term Frequency–Inverse Document Frequency, Singular Value Decomposition, Di- mensionality Reduction, Keyword

iii Resumo

Durante muitos anos, as competenciasˆ de comunicac¸ao˜ de um robot foram simplificadas devido a` sua complexidade. Nos ultimos´ anos, um conjunto de avanc¸os nas areas´ de Machine Learning e Processa- mento Natural de Linguagem trouxeram muitas novas possibilidades para as interac¸oes˜ entre humanos e robots. Contudo, a maioria da investigac¸ao˜ e dos dados obtidos nestas areas´ sao˜ em ingles,ˆ com- pletamente negligenciando outras l´ınguas. De modo a aplicar estas tecnicas´ a` l´ıngua portuguesa, e´ investigada uma soluc¸ao˜ adequada as` bases de dados existentes e que considere as diferenc¸as entre as duas linguagens, como a flexao˜ em genero´ e em numero´ e a ra´ız das palavras. Neste trabalho o robot interage com os humanos num dialogo´ em portuguesˆ que deve funcionar de um modo rapido´ e reativo, considerando e inspecionando dialogos´ anteriores de modo a construir um modelo de conversa o mais natural poss´ıvel. Com isto em mente, o Latent Semantic Analysis, uma tecnica´ da area´ de Pro- cessamento Natural de Linguagem, e´ utilizado juntamente com um classificador Na¨ıve Bayes de forma a prever o que o robot deve responder com base numa expressao˜ humana. A utilizac¸ao˜ de uma lista de palavras vazias e de um extractor de palavras-chaves, como em varias´ investigac¸oes˜ na l´ıngua inglesa, sao˜ cuidadosamente inspecionadas, juntamente com os parametrosˆ do sistema de forma a compreen- der a sua influenciaˆ no desempenho final. Tambem´ uma nova abordagem para adquirir novos dados em portugues,ˆ atraves´ de um formulario,´ e´ apresentada dado que a quantidade de dados dispon´ıvel e´ quase nao-existente.˜

Palavras Chave

Processamento Natural de Linguagem, Latent Semantic Analysis, Machine Learning, Multinomial Na¨ıve Bayes Classifier, Robotica´ Social, Interac¸oes˜ Humano-Robot, M´ınima Frequenciaˆ do Documento, N- Gram, Palavras Vazias, Stemizac¸ao,˜ Frequenciaˆ do Termo–Inverso da Frequenciaˆ nos Documentos,

iv Decomposic¸ao˜ em Valores Singulares, Reduc¸ao˜ da dimensionalidade, Palavra-Chave

v Contents

1 Introduction 1 1.1 Contextualization and Motivation...... 3 1.2 Objective...... 4 1.3 Organization of the Document...... 4

2 State of the Art 6

3 Feature extraction and selection through Latent Semantic Analysis (LSA) 15 3.1 Tokenization, Stemming and Removal of Stop Words...... 17 3.2 Building of TF-IDF matrix...... 18 3.3 SVD Matrix Truncation and Dimensionality Reduction...... 19

4 Utterance generation through Na¨ıveBayes Classifier (NBC) 25 4.1 Mathematical Introduction...... 27 4.2 Multinomial Na¨ıve Bayes Classifier...... 28

5 System Evaluation 33 5.1 Development and Employment of the Software...... 35 5.1.1 Speech Recognition Software...... 35 5.1.2 Keyword Extraction Software...... 36 5.1.2.A 1st Evaluation Metric...... 36 5.1.2.B 2nd Evaluation Metric...... 37 5.1.2.C Results...... 37 5.1.3 Latent Semantic Analysis...... 38 5.1.4 Multinomial Naive Bayes Classifier...... 45 5.2 Discourse Build-up...... 46 5.3 Training and Testing Set...... 50 5.4 System Adjustment...... 51 5.4.1 Minimum Document Frequency and Maximum N-Gram Value...... 51 5.4.2 Percentage of Cumulative Eigenvalues...... 53 5.4.3 Laplace Smoothing Parameter...... 54

vi 5.4.4 Stop Words...... 55 5.4.5 Keywords...... 56 5.5 Results and Output Analysis...... 57 5.5.1 Class Labels Robustness...... 57 5.5.2 Performance on number of phrases per class...... 58 5.5.3 Correct dialogue provided by the system...... 60 5.5.4 Wrongful predicted phrases...... 63 5.6 Highlights...... 67

6 Conclusion 69

A Appendix 77

B Large Tables 79

vii viii List of Figures

2.1 Overview of the whole system...... 13

5.1 Errors related to the 1st Metric. Blue, red and green represent the software Azure, Yake and LinguaKit respectively. The dashed line represents the mean error...... 37

5.2 Errors related to the 2nd Metric. Blue, red and green represent the programs Azure, Yake and LinguaKit, respectively. The dashed line represents the mean error...... 38

5.3 Every step of the RSLP Stemmer ...... 40

5.4 Minimum Frequency Appraisal...... 42

5.5 Maximum N-Gram Appraisal...... 43

5.6 Percentage of Cumulative Singular Values Appraisal...... 44

5.7 Full set split into training and testing subsets [1]...... 51

5.8 Influence of the minimum document frequency and maximum NGram value combined on the final dataset. Each sub-figure represents a different NGram number while each point of the graph corresponds to a different frequency and the bar represents the deviation towards the average performance...... 52

5.9 Influence of the percentage of cumulative eigenvalues on the final data set. Each point of the graph corresponds to a different percentage and the bar represents the deviation towards the average performance...... 54

5.10 Influence of the Laplace Smoothing on the final data set. Each point of the graph cor- responds to a different alpha and the bar represents the deviation towards the average performance...... 55

5.11 The impact of the class labels, n, on the performance of the final dataset...... 58

5.12 The percentage error for each different set division associated to each class after execut- ing the system with the whole dataset...... 59

5.13 The impact of the number of phrases for each class labels on the final result...... 60

ix List of Tables

3.1 Comparison of a pair of words without or with stemming...... 18 3.2 Tokenization, stop words removal and stemming of a set of phrases as an example.... 19 3.3 Term Frequency - Inverse Document Frequency (TF-IDF) Matrix...... 19 3.4 Reconstructed TF-IDF Matrix...... 21 3.5 Correlation Coefficient of the terms before and after the dimensionality reduction..... 21 3.6 Correlation Coefficient of the terms before the dimensionality reduction...... 22 3.7 Correlation Coefficient of the terms after the dimensionality reduction...... 22

4.1 Input given to the Multinomial Na¨ıves Bayes Classifier (MNBC) and its respective label.. 29 4.2 Conditional Probabilities...... 29 4.3 LSA vector of the testing set...... 30 4.4 Probability obtained for each label and chosen class...... 30

5.1 Mean error for each software for both methods...... 37 5.2 Values used in this thesis’ system for the following parameters...... 45 5.3 Number of class labels per k...... 47 5.4 Best average value for each N followed by its upper and lower bounds along with the corresponding minimum document frequency (min df)...... 53 5.5 Best average value for each best percentage followed by its upper and lower bounds... 54 5.6 Best values chosen for the parameters for the final dataset...... 55 5.7 Best average value for each system configuration followed by its upper and lower bounds 56 5.8 Best average value for each system configuration followed by its upper and lower bounds 56 5.9 Number of class labels per k...... 58 5.10 Number of class labels per k...... 60 5.11 Values of the conditional, class and total probabilities of the phrase of class 5...... 64 5.12 Values of the conditional, class and total probabilities of the phrase of class 6...... 64 5.13 Values of the conditional, class and total probabilities of the phrase of class 8...... 65

x 5.14 Values of the conditional, class and total probabilities of the phrase of class 12...... 66 5.15 Values of the conditional and class probabilities of a wrong phrase of class 8 ...... 66

B.1 Utterances used in the training phase of the parameters’ analysis - First Part...... 80 B.2 Utterances used in the training phase of the parameters’ analysis - Second Part..... 81 B.3 Utterances used in the testing phase of the parameters’ analysis...... 82

xi xii Acronyms

AI Artificial Intelligence

API Application Programming Interface

CV Cross Validation

DB Database

GA General Algorithm

HCI Human-Computer Interaction

HHI Human-Human Interaction

HRI Human-Robot Interaction

IG Information Gain

KCV K-Fold Cross Validation

kNN k-Nearest Neighbour

LSA Latent Semantic Analysis

ML Machine Learning

MNBC Multinomial Na¨ıves Bayes Classifier

NBC Na¨ıve Bayes Classifier

NLP Natural Language Processing

PCA Principal Componenst Analysis

SKCV Stratified K-Fold Cross Validation

SRob Social Robotics

xiii SR Speech Recognition

STT Speech-to-Text

SVD Singular Value Decomposition

SVM Support Vector Machine

TF-IDF Term Frequency - Inverse Document Frequency

TTS Text-to-Speech

xiv 1 Introduction

Contents

1.1 Contextualization and Motivation...... 3

1.2 Objective...... 4

1.3 Organization of the Document...... 4

1 2 First of all, the thesis theme is to be introduced, along with the background and the reasons that made its conception viable. Afterwards, the main goals are briefly exposed and, finally, the structure is presented, highlighting the order of work.

1.1 Contextualization and Motivation

Humans are, at its core, social beings who mingle in the community as the basis of the functionality of the society. It is known that the human sociality was and, it still is, the evolutionary restraint that manages to increase the species’ brain size and also made possible for the hominid to develop more sophisticated societal relations. To this extent, it is safe to claim that the social intelligence was one of the main factors for the advancement of the anthropoid’s intelligence. Note that, although this effect is bigger for humans, this is also true for every primate [2]. This deduces that social relationships and conversations are encouraged and, more than that, needed for maintaining a healthy mind, making loneliness an emotion to be avoided at all costs. However, there are some groups in the society that cannot escape this unfortunate situation, espe- cially the elderly. Due to the aging problem that is present in almost all countries, especially the most developed ones, this group is growing in the society and the lack of friends or family, skilled personnel in nursing homes or hospitals, family’s availability or their mobility deepens the danger of their isolation and urge them to seek emotional assistance [3][4]. It is clear that socialization is imperative and, to compensate the scarcity of human communication, interactions with artificial machines can improve this insufficiency. Robotics enters along these lines. The development in robotics along the last decades made it possible to look at conversations be- tween humans and robots in a new light, with more complex and sophisticated procedures to precisely mimic Human-Human Interaction (HHI). Thus, many different areas needed to work together in order to produce more refined communications. Therefrom a new field of study was created: Human-Robot Interaction (HRI), which is deeply intertwined with Social Robotics (SRob)[5]. HRI is the application of SRob techniques in order to possibilitate the interaction between autonomous robots and humans, comprehending all the operations that capacitates the agent to correctly and naturally chat with a person. SRob, therefore, comprises itself of many different disciplines, such as, Natural Language Processing (NLP), Human-Computer Interaction (HCI), Speech Recognition (SR) and Machine Learning (ML). The first area is characterized by the process of cleaning and transformation of raw text input into meaningful information that can correctly represent human utterances and it is usually applied to textual problems like text categorization classification or spell checking. HCI is the interaction between a human and an artificial machine, which, is very important in building the bridge to HRI whereasSR helps the robot in better apprehending the human discourse and expressing itself. FinallyML is essential in providing

3 multiple algorithms that, through receiving proper representation of human expressions, can generate the right robot response. In more recent decades, the steep growth of computational power, the enhancement of linguistic theories that leads to a better understanding of the language functionality, the burst in available data, the huge expansion of data storage, the development of new and more efficient algorithms have greatly improved NLP andML techniques. Although SRob and HRI still have a long way to go, these break- throughs lead to an advance in the area and promotes the creation of more enhanced systems that are applied to more and more difficult social problems, even if the creation of a complete functional conversational agent is a problem still too complex to fulfill.

1.2 Objective

The main goal of this thesis is to develop a software for a robot that interacts with the elderly in a Portuguese speech-based dialogue. The dialogue should work in a reactive and rapid manner, consider- ing and inspecting previous dialogues between elders and other people to build a conversational model as natural as possible. In order to achieve the primary goal, different techniques are to be researched and analysed so that the best one is chosen and, due to scarcity of Portuguese dialogues, it is also necessary to investigate a feasible approach to obtain new data so that the Software developed can be properly tested and used.

1.3 Organization of the Document

This thesis is organized as follows: First chapter: the thesis theme is introduced along with its context and motives. The main goal is declared and the architecture of how the document is organized is presented. Second chapter: describes the existent research in the fields of study and provides a couple of possible solutions to the main problem, explaining why the one chosen makes more sense in the thesis context. Third chapter: the NLP technique (Latent Semantic Analysis (LSA)) is described. This process includes operations of word cleansing and stemming, computation of the term-frequency matrix and reduction of data dimensionality. In the end of the procedure a sentences by terms matrix, according to the most important dimensions, is obtained, highlighting the relations between the terms of each utterance. Fourth chapter: the model of the classifier is exposed and its learning and prediction phrases are presented. The algorithm used is Na¨ıve Bayes Classifier (NBC).

4 Fifth chapter: the proposal for the overall system, joining all the previous parts, from the extraction of audio and text processing to the training of the model and prediction of new phrases. Explains the best software used for for keywords extraction, after analyzing a couple of them through specific evaluation metrics. The method to obtain new Portuguese data is also presented and clarified. Afterwards, the parameters for the best system performance are computed and, then, the LSA components are analyzed to check if they improve the accuracy. Also, experiences regarding the number of classes and the number of phrases per class are carried out to understand the influence on the performance. In the end, some examples of correct and incorrect phrases predicted by the system are presented and a justification is given in the latter case. Sixth chapter: Brief conclusions on the research made and the possible solutions to the problem, as well as the recommendations for future iterations and new information retrieved from the experiences executed.

5 2 State of the Art

6 Despite robotics being a recent area, dating only back to the 20th century, the concept of robot-like objects and the desire to build a completely autonomous machine already exist for centuries. However it was only in the 1980’s that sufficient progress was made that built the foundation for the present day robotics, not only evolving from a centralized system to a more distributed sense-response, but also the appearance of hybrid architectures. These changes induce robots to be more robust to the external vari- ations and allow them to participate in more lasting and more sophisticated human interactions. From mid 1990’s and early 2000’s, with the greater development of the area, a more multi-disciplinary ap- proach was needed, which in turn made multiple experts from different areas, such as natural language, psychology and HCI, coming and working together to achieve the same final goal. This was the birth of HRI as a new field of study [5]. These type of interactions, along with HHI, are important pieces for the understanding and development of social robotics.

A social robot is defined by Brian R. Duffy as a “A physical entity embodied in a complex, dynamic, and social environment sufficiently empowered to behave in a manner conducive to its own goals and those of its community” [6]. The previous denotation can be seen as a bit ambiguous since it does not clearly outlines which are the robot functions and characteristics, nevertheless, it is accurate in the sense that the social level of the robot has numerous degrees and depends on the final objective [2]. Therefore, depending on the application area or the role of the robot in the relationship, each robot has different designs, that varies from being a tool or a mentor to a peer. However, most of social robots can be divided into two major groups according to their main functionality, from a service type robot, e.g. giving basic or more complex assistance, to companionship providers, related to the socialization towards certain sectors of the society [4][5]. It becomes clear that, in this scope, the second group is the one that deserves more focus since it is important to create solutions which can maximize socializing among segments of the society with higher risk of solitude. Through tracking of multiple HHI among elderly-people, it is possible to understand which is the type of relationship needed and their preferred and disliked themes of conversation that clearly helps in turning the interactions between robots and the aged population as natural as possible.

It is important to remark that the development of important areas within the field of SRob and HRI, such as, and more important, NLP andML, made a huge impact in evolving them. NLP characterizes itself of being a set of procedures capable of cleansing the text and transforming the textual information into meaningful numerical data that can be used in e.g., classification tasks. Although natural language is very important for a social robot, especially when it functions as a companionship provider, programmers often overlook the robotic communication due to the complexity of NLP tasks. The approach taken is usually simplistic, basing how the robot selects what it should say through template-based, rule-based or even hard-coded systems. Nevertheless, recent progress in artificial intelligence steered away from these types of systems to increasingly use data-driven techniques, taking advantage ofML algorithms.

7 The main obstacle in using these newer methods is acquiring the suitable quantity of textual information to create decent classifiers. In order to build a system that can work in every context and interaction, there is a necessity to set up a Database (DB) with millions of entries, so that the robot is sufficiently trained for every situation. However, it is quite impossible since there are no NLP techniques or enough data to perfectly accomplish this system, yet. Anyhow, it is already possible to obtain a good performance if the right language processing and utterance generation methods are adequate to the problem [7].

In regard toML, there are a couple of improvements in this area that make the adoption of certain techniques desirable and some of them will be mentioned. The increase in computational power, the development of new algorithms, the improvement of classifiers that already exist and breakthroughs in techniques used for selecting and extracting data features are the most important ones and, lead to an evolution in Artificial Intelligence (AI) and a steep growth in the employment ofML algorithms in new problems and in problems where better procedures are needed in order to boost the performance.

All of the previous advances give more impetus for SRob and HRI to solve more complex problems and improve solutions with average, or poor performances. As previously explained, there is a need of socialization of the elderly that can be fulfilled by a social robot. For this purpose, it is compelling to understand the HHI between the elderly and humans in general, so it is possible for the dialogue between elders and robots to simulate it. Hence, the main objectives are acquire a dataset that properly represents interactions with the elderly and create a software for a robot that given a human utterance, can decently respond to the elder. For this purpose, Speech-to-Text (STT) and Text-to-Speech (TTS) procedures are needed so the robot can ”hear” and ”speak”, respectively, while NLP andML techniques are to be investigated so that the algorithm can withdraw fundamental information from the given text in order to correctly link the human utterance to what the robot is supposed to express.

After understanding the problem at hands, solutions appropriated to the goal at sight are researched. Each one of these solutions is explained and then, the solution more appropriated to the main goals is selected and its justification is presented. It is important to note that the solution chosen depends on the research done, as there are infinite possible solutions. Although some solutions constitute a single sys- tem, the majority is divided in smaller parts that are connected together and each part can be replaced by another similar. Most of the systems employ a NLP technique and aML algorithm. In these systems, from a raw text file, a set of sentences is extracted, and then it is passed to a NLP method so that the textual information can be transformed to a numerical vector that correctly quantifies the sentences. Afterwards, theML algorithm receives as input these sentences’ vectors and the corresponding robot sentences and trains the classifier. In the end, a classifier ready to predict new phrases, through their corresponding numerical vector, is obtained.

The first nominee is character-level convolutional networks, or ConvNets, that are useful in obtaining information from raw signals. In this case a text is treated as a raw signal that uses characters (comprised

8 of letters, digits and punctuation signs) as the base unit. The whole text is converted into a sequence of characters that, afterwards, is divided into multiple vectors with a fixed size, l0. It is a deep learning method and thus, a neural network is used for correctly classifying the document. The network has 70 features, one for each character, and their set represents the input space. Each feature vector has l0 size and after being processed by specific convolutional networks, it associates the vector to a class, i.e. to a context. Without any knowledge about semantic or syntactic structure of the language or the words presented in the texts, this method is able to perform better on larger datasets(several millions). Nonetheless, because of the problem novelty in Portugal, there are still no reasonable datasets for this solution in Portuguese, and so, at least in the first iteration, the dataset size fluctuates between the hundreds and thousands and, hence, there are methods to be employed that result in a better accuracy [8].

The following contenders use feature selection to select features that serve as the input ofML al- gorithm: in this case is either Support Vector Machine (SVM) or NBC. The first part is equal for both solutions and it goes as follows: build a table, sentences by terms, based on the frequency of a term in a phrase and, subsequently, choosing the best features from the original set through the Information Gain (IG) metrics. The second piece changes in theML method: SVM makes use of a hyperplane to split the data into two classes, minimizing the margin between them [9] while NBC uses the feature data as a support to train the classifier and to calculate the conditional probabilities of having a specific phrase, knowing the class and the a priori probabilities in order to predict new texts [10].

However both solutions lack a more effective way to quantify the features, such as Term Frequency - Inverse Document Frequency (TF-IDF), that not only performs better than TF tables but also works well with the size of the dataset used [8][11][12]. It not only takes term-frequency into account but also punishes more a term the more regular it is in the whole set of phrases, giving more data related infor- mation. Regarding SVM, as it does not directly apply to multiclasses, it creates extra issues that have to be solved by implementing one class vs one class or one class vs all voting schemes. Furthermore, as stated in [9] and [13], text categorization vectors are usually characterized by having a great number of features and, although having some irrelevant features, the majority is relevant. The vectors also have a high dimensional feature space that, if not correctly reduced, hides relations between terms in the same context, altering the results. This means that, not only there is a need for feature selection for removing these irrelevant features but also, there is a need for feature extraction, i.e. transformation of the feature space into a new one in order to be more suitable for modeling, to highly reduce the data dimensionality and obtain more data variation. In this light, this solution demands a more concrete algorithm to diminish the dimensionality.

Another contender is feature selection and extraction and, as classifiers, k-Nearest Neighbour (kNN) or decision trees [13]. Firstly, a set of preprocessing tasks (i.e. elimination of punctuation signs and

9 common words of a language and stem withdrawal), that polishes the text, is applied to each sentence and then, the algorithm employs Document Frequency Threshold, a feature selection method that elim- inates terms that have a lower frequency than a specific threshold. Afterwards, a TF-IDF table is built, through the frequency of each feature (words belonging to the polished text) in the phrase and in the set. Feature selection is performed throughIG metrics and, through extraction methods like Principal Componenst Analysis (PCA) or General Algorithm (GA), the dimensionality is reduced and more data variation is delivered. Finally, it uses one of the previously mentioned classifiers. Briefly, kNN associates a new unknown sentence with the cluster that is nearer based on its neighbours, while decision trees are a representation of an algorithm that for each internal node, analyzes a feature of the set of sentences and it has as many branches as the possible outcomes. This process repeats itself until it reaches the leaf nodes, where each one corresponds to a class. In this case, although this process is better than the contender described in the previous paragraph, there is a feature selection and extraction algorithm that is more used and more efficient, directly applied to text categorization, Latent Semantic Analysis.

Along these lines, the next solution is, therefore, NBC[14][15] with feature selection and extraction, through LSA. The later algorithm is a NLP technique that comprises itself of preprocessing tasks, equal as the previous candidate and then, applies Singular Value Decomposition (SVD) to the sentences by terms matrix. Afterwards, it reduces the matrix dimensionality in order to select the most important dimensions and achieve a better data insight [16]. In this case, since LSA already performs feature selection and since theML algorithm used is NBC, which is very sensitive to feature selection and performs better when using most of the features [9], theIG metrics are skipped. The NBC is used as a support for predicting which utterance the robot should respond, when triggered by human discourse [17]. These last two references ( [9] and [17]) show how much the performance improves by eliminating the lowest dimensions without further erasing more features, indicating the preference over IG metrics.

There are other reasonableML algorithms that could be used in replacement of the NBC like kNN, decision trees or SVM[9]. Nevertheless, for this solution they are not quite suitable. The first one is extremely sensitive to noisy data (even though the training data has no noise, the new data can have due to the speech processing) while decisions trees are very prone to over-fitting, leaving the classifier a little blind to new abnormal predictions, which can easily happen in text categorization [18]. The latter algorithm could be used since it deals rather well with sparse feature vectors and has an excellent performance [9][19]. However, the dimensionality reduction applied by LSA transforms the otherwise sparse vector into a vector full of non-null values, excessively increasing the computational complexity and the unwillingness to use this method. The preference of NBC over otherML methods is mainly due to its rapidity and implementation facility. However, with the emergence of new procedures throughout the years, together with the advancement of the already existing methods, NBC has been outshone by many [9]. This arises the question of why would anyone ever use thisML method when there are

10 better ones in the market. However, the LSA data preprocessing transformed the classifier into a more viable method, maintaining the NBC’s easy employment while increasing its accuracy [14]. Also, NBC is a technique that thrives under small datasets when compared to more complex solutions, like neural networks or SVM, that need a lot of data to correctly compute the parameters needed [18]. Therefore, the solution previously described is concluded to be the most interesting to obtain the desired goals and it will be subsequently explained.

First of all, the Latent Semantic Analysis (LSA) is a non traditional NLP technique developed in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter [16]. This method is based on the SVD, a mathematical decomposition of a matrix into the product of other three matrices. The words and phrases processed by LSA have resulted in outputs that replicate a lot of the human cognitive phenomena, such as dialogue understanding, capture the semantic similarity between terms and the manner that humans acquire knowledge from texts. It has been shown that the outcome of LSA is very similar to how people symbolize the essence of what they hear or read and, consequently, predicts the effect of word similarity between sentences.

It is important to note that this technique is able to induce more profound relations from just the analysis of the text, without any type of a priori knowledge, heavily relying on mathematical theories and the information given by the provided sentences. Along these lines, it is admitted that the knowledge withdrawn by LSA is limited to the dataset used and the sentences provided, basing the analysis purely on the insight given by the phrases, which in turn alters the final product. This means that LSA misses words’ relations or contexts that are not present in the set of sentences and, even if they exist in the new phrases to be predicted, the system is unsuccessful in forecasting these relationships. It also fails to take word order, logic or syntactic relations into account. Nonetheless, it still manages to almost correctly represent people’s understanding of judgement [16].

Before LSA executes the SVD, the sentences suffer a pre-processing phase to divide every phrase into tokens (a sequence of characters delimited by blank spaces), elimination of punctuation signs and stop words (words common in a language or that do not give any distinctive meaning to any phrase), stemming of the remaining words so that the root of each term is obtained. Finally, the sentences by terms matrix is computed from specific information, such as the frequency of each word in each sentence and the number of times it appears in all of the phrases of the training set. Then, the decomposition is executed, obtaining two matrices that characterize the original row and column entities, as vectors of derived orthogonal factor values, and another one which is a diagonal matrix containing the singular values or scaling values. These values are also called scaling because each represents the magnitude of a dimension, highlighting its relevance to the data output. Choosing the highest scaling values possi- bilitates the representation of the data through the most important dimensions, obtaining an output with higher variation than before.

11 LSA’s capacity of deriving both the words and sentences in a intertwined manner is very important and is only made possible by the right decision on picking the correct dimension on which the sentences- terms distribution will be represented. The reduction of dimensionality is a crucial step and its main goal is to create better approximations of the data so that new word-word, word-sentence and sentence- sentence relations emerge, more similar to the human cognition ones. However, finding the right number of dimensionality is still a problem with already some possible, not perfect though, solutions in practice that will be further explored in chapter3. After choosing the right number, the reconstruction of the approximated matrix is done, making the reconstruct matrix a least-squares best fit, and each cell’s value suffers a minor or major change, due to the inference made by LSA and, that variation will depend on the relations existent on the most important dimensions [16]. These new vectors, representing the new sentence-word distribution, are to be used as input for the training of the NBC classifier.

Naive Bayes Classifier (NBC) is aML method deployed in the 1960s and with continual development until the present day that is particularly notorious in the text categorization area, i.e. classifying docu- ments or phrases according to their semantic similarities and disparities. The classifier is called Naive because it assumes that all of the attributes of the evidence are always independent from each other, which is usually a false assumption, as the independence of data attributes is of rare occurrence. Nev- ertheless, the results obtained are accurate [15]. Its mathematical foundations are based in the Bayes theorem that computes the a posteriori probability, i.e. the conditional probability of a certain hypoth- esis A happening knowing the evidence B, through the insight of the given data that culminates in the a priori probability, i.e. the probability of A happening, the probability of evidence B and the conditional probability of evidence B happening given that A is true. Hence the name Bayes [20]. Succinctly, it results in the following equation:

P (B|A) · P (A) P (A|B) = , (2.1) P (B)

In image 2.1, it is possible to visualize how the entire system is going to work. Briefly explained, in the training phase of our system a text file constituted of numerous human sentences associated to a specific robot utterance is given as input. An excerpt of an example of this file is displayed in appendix A, where the number represents the class label ID and the letters H and R are associated with human and robot utterances, respectively. In order to obtain two different sets composed purely by robot or human utterances the full set of sentences is to be divided, maintaining the association between which robot sentence follows a certain human utterance. Then, the keywords of the human utterances are withdrawn by a Keyword Extractor API, and together with the vocabulary extracted by LSA, they constitute the terms to be used in the construction of the TF-IDF table. After processing it through the LSA, the numerical vectors are obtained and serve as the input for the training of the NBC. The NBC’s labels are the robot utterances after being altered to their ID numbers, that are different for every different phrase. After

12 training the classifier accordingly, the system is ready to predict new human utterances, transforming the new human audio into a sentence, converting it to numerical information and passing onto the NBC. Finally, theML algorithm predicts the robot response to the human utterance and transforms it to audio.

Figure 2.1: Overview of the whole system

13 14 3 Feature extraction and selection through LSA

Contents

3.1 Tokenization, Stemming and Removal of Stop Words...... 17

3.2 Building of TF-IDF matrix...... 18

3.3 SVD Matrix Truncation and Dimensionality Reduction...... 19

15 16 As related in chapter2, the LSA process comprises itself of numerous steps that are explained in this section. It is of important relevance to point out that all of these phases constitute a single process that receives, as input, a set of phrases and its respective keywords. For every step an example is given for better visualization. Despite the fact that the system deals with Portuguese sentences and keywords, the examples are given in English.

3.1 Tokenization, Stemming and Removal of Stop Words

1. Tokenization: Given a sequence of characters (i.e. a string), it splits them into tokens through a process of delimitation based on a specific criteria, such as a unique character (usually the blank space like in this work). Tokens represent, therefore, words with a particular meaning or a punctuation sign. In the latest case, these signs are eliminated from the set of tokens.

e.g. The girl likes to fish. → {The, girl, likes, to, fish}

2. Stop Words: In every single language there is a group of terms that are extremely common in oral and written forms and, consequently, neither have a distinctive meaning nor give important content to the sentence. Hence, all of these words are eliminated from the set of tokens. It is crucial to remark that, there is not an objective list of stop words and, as such, there is a reasonable amount of different stop words sets. However, it should include the terms that, according to one individual, fail to provide extra value to the sentence. Although it is quite ambiguous, this list can deeply affect the LSA output, as it will be visualized in chapter5.

e.g. {The, girl, likes, to, fish} → {girl, likes, fish}

It is important to notice that the word to marked the difference between fish being a noun or a verb and its elimination contributes to increase ambiguity. Nonetheless, as explained previously in chapter2, this technique completely disregards syntactic relations in a phrase, purely depending on the context of the set of tokens. Regardless of representing an activity (verb) or an animal (noun), the context of both variants is equal and knowing its syntactic relation is dispensable to the final result.

3. Porter Stemming: It is the action of replacing each flexed or derived word in the set with its root, i.e. delete the verb conjugations, the singular and plural aspect and also the gender factor if the language verifies gender-specific words (which is very important in the Portuguese language, in contrast with English which is quite insignificant). In order to correctly apply the right stemming, there are many lists of rules that are exclusive to each language. In this case the Porter Stemmer rules are to be used as the rules are effective in obtaining the word root [17][21].

e.g. {girl, likes, fish} → {girl, like, fish}

17 This process is quite significant, since LSA highly depends on the frequency of each term and without stemming, many words would be considered different despite their equal context leading to an incorrect induction. In the table 3.1 are presented a couple of word pairs that would belong to a different context if this action would not be executed.

Without Stemming With Stemming fisher / fishing fish / fish plays / played play / play amused / amusements amus / amus car’s / cars car / car

Table 3.1: Comparison of a pair of words without or with stemming.

3.2 Building of TF-IDF matrix

A TF-IDF matrix, composed of sentences by terms, consists of two major parts [22]: the first one is the absolute frequency of each term t within the sentence while the second one is the logarithm of the size of the set of sentences divided by the subset that contains the term t. So, given a set of sentences

S, a set of terms T and assuming that s and t are elements of each set, respectively, and ft,s is the frequency of a term t in a sentence s, the following equation is obtained for every t  T and s  S:

h  #S + 1  i TF − IDF (t, s) = TF (t, s) · IDF (t) = f · ln + 1 (3.1) t,s #{s0  S : t  s0} + 1

The ones added in the IDF part of the equation are used to smooth the results so that inside the log a division by zero does not occur and, outside the log in order to avoid the complete suppression of terms that appear in all of the sentences as the multiplication would be zero. In terms of importance, according to the TF function, the more a term appears in a sentence, the more it matters and the more its value increases. On the other hand, the IDF function attributes higher weight to a term if it appears in less phrases, reaching its maximum value when it only exists in a sentence, where it solely depends on the total number of sentences. On the other side, it will be minimum, i.e. zero, when a sentence s is deprived of the term t. In the end, it is obtained a matrix sentences×terms and the higher the score a term has in the a sentence when compared to the others, the greater is its relevance. In the table 3.2, a set of phrases and the terms obtained after the sentences passed through the process described in section 3.1 are revealed. And now the TF-IDF buildup. Each row corresponds to a sentence, identified in the table 3.2, and the columns represents the remaining tokens of all the phrases after the tokenization, stop words removal

18 Utterance After 3.1 s1 : The boy owns a dog. boy, own, dog s2 : The man takes the cat to the beach. man, take, cat, beach s3 : The boy goes to the beach with his cat. boy, go, beach, cat s4 : The man and the boy take the cat to the park. man, boy, take, cat, park s5 : The men go to the park and they find dogs. man, go, park, find, dog

Table 3.2: Tokenization, stop words removal and stemming of a set of phrases as an example.

and stemming. The values are rounded to the third decimal.

boy cat man take beach go park find dog own s1 1.405 0 0 0 0 0 0 0 1.693 2.099 s2 0 1.405 1.405 1.693 1.693 0 0 0 0 0 s3 1.405 1.405 0 0 1.693 1.693 0 0 0 0 s4 1.405 1.405 1.405 1.693 0 0 1.693 0 0 0 s5 0 0 1.405 0 0 1.693 1.693 2.099 1.693 0

Table 3.3: TF-IDF Matrix

As expected from theory, it is quickly visualized that, within a sentence, a term’s value increase when it appears less times in the whole set and, for each term the value is either zero or other constant value. This happens because the absolute frequency is either zero or one, depending only on the number of sentences the term appears. The terms find and own are the ones that have higher relevance since they appear only once in the corresponding phrases.

3.3 SVD Matrix Truncation and Dimensionality Reduction

The next phase is to decompound the previously retrieved matrix, called A for easier visualization, into a product of three matrices according to SVD which is defined as follows:

A = U · Σ · V T , (3.2)

where A is a m × n matrix, U is m × m, Σ is m × n and V is n × n. Σ is a diagonal matrix and it is often called singular value matrix as its non-zero cells represent the eigenvalues, each one linked to one particular dimension. The need to reduce the dimensions of the feature space arises from the fact that if the irrelevant dimensions are withdrawn and the matrix is rebuilt into the initial dimensions, the obtained matrix is a least-square best fit. The resulted output has each cell changed and the new values represent greater or smaller frequency values when compared to the original matrix. This is due to indirect relations between all of the phrases, i.e. if two terms do not appear in the same sentences but the context in which they appear are similar then with the reduced dimensions a higher correlation will emerge. The opposite

19 happens when, e.g. even though two terms occur in the same sentence but in the rest of the sentences they always appear in different background then their correlation will lower. It is through this approach that LSA performs induction [16].

Now a criteria to evaluate which dimensions are and aren’t significant is crucial. The magnitude of each eigenvalue determines the importance of its related dimension so the higher values should be chosen. Criteria such as the number of dimensions or a threshold eigenvalue value could be chosen. However, the first one can incorporate dimensions with low magnitude (if the number chosen is too big) or exclude relevant dimensions (if the number is small). In the second criteria, a given threshold can leave important dimensions out if the magnitudes are not really large enough or, if many eigenvalues have a high value, too many dimensions are picked. Hence, these methods are too static and one more dynamic must be chosen.

In this way, the percentage of cumulative singular values is a method worth using [23]. After nor- malizing all of the eigenvalues according to total sum of all the matrix values, it starts from the most important and picks all of the following eigenvalues until their total sum reaches a specific percentage. This leads to the selection of the most important dimensions, independently of its number or the eigen- values absolute value, leaving out the insignificant ones. The various values of the percentage will be analyzed and the best one will be chosen in chapter5. Continuing with the matrix from 3.2, the SVD and dimensionality reduction are done. For the percentage of cumulative singular values, the value 50% is assumed:

5.018 0 0 0 0 0 0 0  0 3.524 0 0 0 0 0 0   Σ =  0 0 2.901 0 0 0 0 0 (3.3)    0 0 0 2.523 0 0 0 0 0 0 0 0 1.696 0 0 0

After normalization, the matrix 3.4 is obtained. Note that the sum of the first two diagonal cells is over 50% (0.320 + 0.225 = 0.545 > 0.5) and, therefore, the two initial dimensions are chosen and the highlighted lines are used to reconstruct the truncated matrix.

 0.320 0 0 0 0 0 0 0   0 0.225 0 0 0 0 0 0    Σ =  0 0 0.185 0 0 0 0 0  (3.4)    0 0 0 0.161 0 0 0 0  0 0 0 0 0.108 0 0 0

 −0.215 0.350 0.834 −0.300 0.213   −0.442 −0.501 −0.103 −0.032 0.736    U =  −0.406 −0.234 0.344 0.747 −0.323  (3.5)    −0.562 −0.242 −0.108 −0.566 −0.542  −0.527 −0.716 −0.404 0.177 0.123

20  −0.331 −0.395 −0.429 −0.339 −0.286 −0.315 −0.368 −0.221 −0.250 −0.090   −0.050 −0.390 −0.011 −0.357 −0.353 0.231 0.228 0.426 0.512 0.208     0.518 0.064 −0.298 −0.123 0.141 −0.035 −0.299 −0.292 0.251 0.603     −0.067 0.083 −0.235 −0.401 0.479 0.620 −0.261 0.147 −0.083 0.250    T  −0.540 −0.106 0.263 0.194 0.413 −0.200 −0.419 0.152 0.335 0.263  V =    −0.156 0.158 0.470 −0.045 −0.476 0.474 −0.347 −0.329 −0.110 0.193     −0.293 0.417 0.159 −0.626 0.147 −0.251 0.391 −0.213 −0.008 0.203     0.073 0.517 −0.088 −0.083 −0.273 −0.217 −0.333 0.657 −0.191 0.106     −0.249 0.449 −0.413 0.195 −0.225 0.059 −0.018 −0.223 0.578 −0.300  −0.381 0.039 −0.423 0.325 −0.006 0.290 0.311 0.066 −0.332 0.523 (3.6) To reconstruct the matrix, simply apply the equation 3.2 using the truncated matrices.

boy cat man take beach go park find dog own

s1 0.295 -0.055 0.449 -0.075 -0.127 0.625 0.677 0.764 0.902 0.354

s2 0.824 1.564 0.969 1.381 1.258 0.289 0.412 -0.265 -0.350 -0.169

s3 0.717 1.127 0.883 0.985 0.875 0.451 0.562 0.097 0.088 0.011

s4 0.978 1.446 1.218 1.260 1.109 0.69 0.842 0.258 0.269 0.075

s5 0.749 0.061 1.106 -0.006 -0.135 1.417 1.546 1.659 1.955 0.764

Table 3.4: Reconstructed TF-IDF Matrix

The table 3.4 represents the utterance LSA vector for each sentence based on the vocabulary, i.e. the set of terms. In order to understand what the change of values represent, the correlation coefficient is calculated between each pair of sentences, before and after the dimensionality reduction. In the table 3.5, the correlation values are depicted and the ones in the left represent the results before the dimensionality reduction while the others in the right are after the matrix reconstruction.

s2 s3 s4 s5 s1 -0.521 / -0.943 -0.202 / -0.889 -0.319 / -0.854 -0.226 / 0.999 s2 - 0.171 / 0.991 0.359 / 0.979 -0.501 / -0.931 s3 - -0.130 / 0.997 -0.368 / -0.873 s4 - -0.252 / -0.835

Table 3.5: Correlation Coefficient of the terms before and after the dimensionality reduction

After a quick observation of table 3.5, there are two sets of sentences that are more correlated after the truncation and looking at the vocabulary extracted from the phrases, there is a cluster of words for each one that is exclusive of that set.

S1 = {s2, s3, s4} =⇒ {cat, take, beach} (3.7)

21 S2 = {s1, s5} =⇒ {dog, own, find} (3.8)

To understand the impact of each word’s subset, the correlation coefficient between terms is pre- sented. In this procedure, each term is compared to the rest of the terms in the set of tokens in order to understand if they are intertwined or not. In the table 3.6, the values introduced are before the re- construction of the approximated matrix while in 3.7 the results presented are after the dimensionality reduction. Comparing the two tables it is possible to understand how the new approximated matrix is capable of creating new relations between the same words just from reducing the dimensionality.

cat man take beach go park find dog own boy 0.167 -0.667 -0.167 -0.167 -0.167 -0.167 -0.612 -0.167 0.408 cat - 0.167 0.667 0.667 -0.167 -0.167 -0.612 -1.000 -0.612 man - 0.667 -0.167 -0.167 0.667 0.408 -0.167 -0.612 take - 0.167 -0.667 0.167 -0.408 -0.667 -0.167 beach - 0.167 -0.667 -0.408 -0.667 -0.408 go - 0.167 0.612 0.167 -0.408 park - 0.612 0.167 -0.408 find - 0.612 -0.250 dog - 0.612

Table 3.6: Correlation Coefficient of the terms before the dimensionality reduction

cat man take beach go park find dog own boy 0.750 0.958 0.736 0.706 0.018 0.101 -0.290 -0.300 -0.321 cat - 0.529 1.000 0.998 -0.648 -0.583 -0.851 -0.856 -0.868 man - 0.512 0.474 0.303 0.381 -0.004 -0.015 -0.038 take - 0.999 -0.664 -0.599 -0.861 -0.866 -0.878 beach - -0.696 -0.633 -0.882 -0.887 -0.898 go - 0.997 0.952 0.948 0.941 park - 0.923 0.919 0.910 find - 1.000 0.999 dog - 1.000

Table 3.7: Correlation Coefficient of the terms after the dimensionality reduction

After an easy inspection of both tables, the words belonging to each subset increase their correlation substantially. For example, in the first table the pair take-beach is poorly correlated (0.167) as they only appear together in a sentence but, because the subset of sentences which they belong have a high association, its value increments to almost 1. In the third phrase, for example, the LSA values of this pair changes from (0, 1.693) to (0.985, 0.875) which is a lot more similar and shows the inference made by this technique. The pair own-find has a similar behaviour, although having a negative correlation (−0.250) before the reduction of dimensionality and not appearing together in a sentence, this value changes to roughly 1. Looking to the LSA values, it is observed that in all of the phrases their value changes to be

22 in more accordance with each other. All of this happens because the words cat and dog are present in all of the phrases that correspond to subsets one and two, respectively. As a consequence, when one of the previous terms appears, the rest of the words belonging to that subset are expected to appear in the same sentence. LSA also deduces that when one of the other terms of the subsets, apart from cat or dog, are present in a phrase, that the others belonging to the same subset are prone to appear even if they didn’t come along in the given utterances. On the other hand, the opposite also happens. Even though the pair of terms boy-own appeared together in an utterance, its correlation lowers from 0.408 to −0.321 due to the fact that it is a rare combination and with the truncation, the term boy becomes an expected term in almost all of the sentences as seen by its LSA score. The detection of these two terms occurring together is, therefore, discarded by the system. It is crucial to point out that the given examples are just to facilitate the understanding of how LSA works and the set of utterances used for the final system is more extensive and insightful, leading to a more real inference. Despite the fact that the combination of some words seems nonsensical, the induc- tion presented by this technique is well-built. After the human phrases are processed by this technique, a collection of LSA vectors that represents the textual information through numbers is obtained. The vectors retrieved will be used in the learning phase of the NBC classifier, as it is explained in chapter4.

23 24 4 Utterance generation through NBC

Contents

4.1 Mathematical Introduction...... 27

4.2 Multinomial Na¨ıveBayes Classifier...... 28

25 26 4.1 Mathematical Introduction

The NBC is aML algorithm that is based on the Bayes’ Theorem. Before covering the equation there are some basic concepts that must be introduced:

• Posterior probability: P (y|X)

• Prior probability: P (y)

• Likelihood probability: P (X|y)

• Evidence probability: P (X)

After understanding these notions, it is presented the mathematical theory that NBC follows:

P (X|y) · P (y) P (y|X) = , (4.1) P (X)

where y is defined as the class label, having m possible outcomes, and X is a vector, composed of n features:

X = {x1, x2, ..., xn−1, xn} (4.2)

The equation 4.1, therefore, desires to achieve the probability of obtaining the class y, when the array X contains a specific combination of values. This is only achieved after making one important assumption: the features of the evidence X are independent from each other:

P (A, B) = P (A) · P (B), (4.3)

where A and B are features and P(A, B) is the joint probability of the two happening together. Taking the equations 4.1, 4.2 and 4.3 into account the final equation is obtained:

Qn P (x1, ..., xn|y) · P (y) P (x1|y)...P (xn|y) · P (y) P (y) · i=1 P (xi|y) P (y|X) = = = Qn (4.4) P (x1, ..., xn) P (x1, ..., xn) i=1 P (xi)

The above equation is used in the prediction phase of the NBC classifier. After estimating the es- sential parameters for the classification model in the learning phase, i.e. the a priori and likelihood probabilities, the system is ready to predict new elements. For this purpose, the equation 4.4 is com- puted for all m class labels and, in the end, the one with higher probability is the class selected by the system. The denominator in 4.4 can be suppressed since it solely depends on the given data and it is equal for every different class, resulting in the consecutive equation 4.5:

27 n Y yˆ = argmaxy P (y|X) = argmaxy P (y) · P (xi|y) (4.5) i=1

4.2 Multinomial Na¨ıveBayes Classifier

The LSA vectors obtained in the chapter3 are used as the input of the classifier. Each LSA vector corresponds to a specific human utterance and, as stated in chapter2, each human utterance is attached to the following robot utterance. Each distinct robot expression is transformed into a numerical ID that will be used as the corresponding class label. Therefore, each LSA vector will have, as the class label in the NBC learning phase, a specific robot ID associated. Looking at equation 4.2, it is determined that each of the features of X is related to a term of the vocabulary extracted by LSA. Because each feature does not have a normal distribution or only has two possible values, neither the Gaussian Na¨ıve Bayes Classifier nor Bernoulli Na¨ıve Bayes Classifier are feasible. Instead, each feature has a multinomial distribution and, therefore, Multinomial Na¨ıves Bayes Classifier (MNBC) is the most desirable. This type of NBC performs well in text classification problems and the likelihood of observing a data X is given by the following equation:

n Y ˆ xi P (X|y) ∝ θyi , (4.6) i ˆ where θyi represents the smoothed version of maximum likelihood (i.e. relative frequency counting) and xi is the frequency of feature i of X and it is computed through:

ˆ Nyi + α θyi = , (4.7) Ny + αn P where Nyi = x∈S∗ xi is the sum of the frequencies of feature i in all of the sentences in set Pn S* (human utterances belonging to class y), and Ny = i=1 Nyi (sum of the frequencies of all the vocabulary terms of all sentences in S*). The α represents the Smoothing parameter and it is used in order to avoid the final value of P (y|X) to become zero when one of the conditional probabilities is zero. In order to transform the MNBC classifier into a linear classifier and to avoid underflow, the equations 4.5, 4.6 and 4.7 are expressed together in the log-space [14]. This results in the final equation:

" n # " n # Y ˆ xi X ˆ log(ˆy) ∝ argmaxy log P (y) · θyi = argmaxy logP (y) + xilogθyi (4.8) i i This model is supposed to work with integers but fractional values also function adequately as the Inverse Document part of TF-IDF can be seen as a weight to the word count. However, because the possible LSA values for each feature can be negative and frequencies cannot, it is mandatory to

28 manipulate the data in order to obtain non-negative values. Hence, each array is normalized to values between 0 and 1. In order to compute the value of the MNBC parameters, the classifier simply attributes each LSA vector to the corresponding class label and computes the a priori and likelihood probabilities, i.e. P (y) and θyi respectively. Afterwards, the system is ready to predict new phrases uttered by a human. In the prediction phase, the LSA vector is obtained and equation 4.4 is computed for all the classes and the one with higher probability is chosen. In the interest of illustrating the theory, the following example is presented. The LSA vectors calcu- lated in subsection 3.3 are used in the training phase and the corresponding classes, i.e. the utterances the robot should answer, are: ”I love cats” and ”I like dogs”. The sentences from table 3.2 are divided with the following distribution:

”I love cats” =⇒ set #1 : s2, s3, s4 ”I like dogs” =⇒ set #2 : s1, s5 (4.9)

Each previous sentence is transformed in its respective set number and, for each phrase of table 3.2, the correct class label is attributed according to equation 4.9. After normalizing the LSA vectors previously computed between 0 and 1, the subsequent results are obtained and presented in table 4.1:

Input Class Labels boy cat man take beach go park find dog own 2 0.410 0.070 0.560 0.051 0 0.731 0.781 0.866 1 0.467 1 0.613 1 0.689 0.904 0.840 0.334 0.398 0.044 0. 0.095 1 0.633 1 0.781 0.873 0.774 0.394 0.494 0.078 0.069 0 1 0.659 1 0.834 0.864 0.754 0.449 0.559 0.133 0.142 0 2 0.423 0.094 0.594 0.062 0 0.743 0.804 0.858 1 0.430

Table 4.1: Input given to the MNBC and its respective label

Now the computation of the a priori and likelihood probabilities is possible and the results are suc- cessively presented:

3 2 P (y = 1) = P (y = 2) = (4.10) 5 5

P(xi|y) X y boy cat man take beach go park find dog own 1 0.124 0.195 0.150 0.171 0.154 0.076 0.094 0.017 0.014 0.006 2 0.084 0.016 0.116 0.011 0 0.148 0.159 0.173 0.201 0.090

Table 4.2: Conditional Probabilities

29 Visualizing table 4.2, the probabilities of almost all of the terms differ an order of magnitude be- tween classes, demonstrating the different distribution of the LSA vectors previously explained. With all the probabilities calculated, the system is ready to predict new phrases. A testing set is created to

demonstrate the prediction part of MNBC and it is represented by sentences s5, s6, s7 and s8.

• s5 : The boy goes with a cat to the park.

• s6 : A man and a boy find dogs.

• s7 : The man goes to a park.

• s8 : The dogs and cats go to the beach.

LSA vector Utterance boy cat man take beach go park find dog own

s5 0.845 0.928 0.949 0.586 0.439 0.876 1 0.409 0.539 0

s6 0.598 0.081 0.681 0.069 0 0.558 0.627 0.922 1 0.382

s7 0.424 0.193 0.858 0.121 0 0.897 1 0.747 0.893 0.203

s8 0.654 0.850 0.985 0.453 0.621 1 0.902 0.554 0.941 0

Table 4.3: LSA vector of the testing set

For the purpose of guessing to which class each phrase of the testing set belongs to, the equation 4.8 is computed for both class labels and the a posteriori probability is presented in table 4.4. The class with the higher probability is chosen and is also exhibited in the table.

a posteriori probability Utterance ”I love cats.” ”I like dogs.” Class chosen −7 −13 s5 1.060 ×10 9.379 ×10 1 −8 −5 s6 8.839 ×10 1.917 ×10 2 −8 −6 s7 9.643 ×10 6.604 ×10 2 −8 −14 s8 1.358 ×10 1.334 ×10 1

Table 4.4: Probability obtained for each label and chosen class

As expected from section3, when the terms cat and dog appear individually, the phrase to be predicted will belong to the first and second class, respectively. Also, when they both appear in a phrase or none of them, the system checks the other terms to verify the information they infer about each class. Observing the testing set, in the sentence 7 the word park is much more related to dogs while the term

30 beach in phrase 8 is more associated to cats and, as such, the last two phrases correspond to classes two and one, respectively. This is corroborated by the correlations computed in the table 3.7.

31 32 5 System Evaluation

Contents

5.1 Development and Employment of the Software...... 35

5.2 Discourse Build-up...... 46

5.3 Training and Testing Set...... 50

5.4 System Adjustment...... 51

5.5 Results and Output Analysis...... 57

5.6 Highlights...... 67

33 34 The following tasks are to be carried out throughout this chapter:

• Presentation of the packages employed in the system, from the speech recognition and the - word extractor programs to the packages used for the LSA and MNBC. The computational lan- guage adopted for the system is Python since there are already plenty of suitable packages that can easily perform the desirable tasks. The software used for LSA and MNBC belongs to the scikit-learn package [24][25]. It is crucial to note that the software adopted must support the Portuguese language. The full developed software code is described completely (including all classes, methods and variables and the scripts executed) and can be visualized in the following GitHub repository under the final directory [26].

• Inspection of the LSA and MNBC variables in order to choose the values that deliver the best performance possible. The parameters are the minimum document frequency, the maximum and minimum values of N-GRAM, the percentage of cumulative eigenvalues and the Laplace smooth- ing parameter.

• Enunciation of the technique endorsed in this work to gather new data (form technique). A new procedure is created as there are essentially no records of Portuguese speeches that can be used to build new dialogues based on the elderly normal flux of conversation.

• Description of the procedure adopted to divide the data into training and testing set.

• Adjustment of the system to the new dataset, validating the values of the parameters previously selected. The variables’ values along with the employment of a stop words list and the keywords from the keyword extractor are inspected to understand their influence on the final output.

• Analysis of the final results and output, investigating the performance variation according to differ- ent quantities of class labels and of phrases per class and presenting a correct dialogue produced by the system and a set of phrases wrongly predicted by the system. The set is thoroughly exam- ined to comprehend the system’s weaknesses.

5.1 Development and Employment of the Software

5.1.1 Speech Recognition Software

The Speech Recognition Application Programming Interface (API), as described in figure 2.1 and used for new human utterances, is a software capable of properly recognizing the human discourse and transforming it into textual information. It is necessary to mention that, in current days theSR software is still far from perfect and there is always some error associated to the discourse it perceives from a human compared to what he actually said, sometimes confusing a whole idea or simply a letter [17][27].

35 An investigation for a program is done and Google Speech API [28] is a reasonable approach for the desired output. It not only supports the Portuguese language but also is powered byML, implementing deep neural networks to the audio captured, which mitigates the imperfection ofSR. Besides, it improves overtime due to continuous integration of new user data into training the classifier. The Python interface for the Google Speech API, speech recognition, is imported and used in the system as mentioned in the Python API documentation [29]. For adequate operations, the API needs a couple of parameters whose values, apart from two, correspond to the default numbers related in the documentation. The exceptions to the rule are the language (Portuguese) and the microphone utilized, which depends on the machine running the program.

5.1.2 Keyword Extraction Software

A Keyword Extraction software is a program that processes a text, withdrawing the words more relevant according to the context of the text and to algorithm of the extractor. After a brief research of possible software solutions, the more reliable and capable of supporting Portuguese sentences are selected and consecutively presented to be completely analyzed. They are Azure [30][31], LinguaKit [32][33] and Yake [34][35]. In order to find out the best extractor it is necessary to create metrics to correctly evaluate all of them. First of all, a PortugueseDB with multiple texts and the appropriate keywords is mandatory so that it can be used as a sample for the comparison between it and each software previously enumerated. For an independent program evaluation, the keywords need to be withdrawn from a text by one or more people which contributes to a certain ambiguity and results in a misleading output. However, so that the execution of these trials is feasible, these keywords are regarded as being the correct keywords of the texts and, thus, are called the gold set. In furtherance of diminishing the uncertainty of the keywords extracted by people, two metrics are created and used to evaluate each extractor.

5.1.2.A 1st Evaluation Metric

In the first method, for a set of texts of theDB, the output of each keyword extractor and the keywords of the gold set are compared and the percentage of incorrect keywords is computed. The less the percentage associated to a software, the more precise it is going to be. However, this percentage is also influenced by the number of keywords extracted by a software compared to the total of the gold set. As such, if one program extracts a huge amount of words, it is going to be penalized. For example, an output can have all of the right keywords but half of the outcome represents incorrect words. Notwithstanding, as a keyword is an important piece of information about a text or a sentence, the number extracted should be as minimal as possible and a penalization should be applied when this does not occur.

36 5.1.2.B 2nd Evaluation Metric

The following metric is characterized by comparing each set of keywords extracted by each one of the three programs to the other two and calculating the percentage of nonequivalent keywords. A nonequivalent keyword corresponds to a word that only appears in one software, being absent in the remaining. Furthermore, not only this penalizes extractors that have too many keywords but also it can inspect how many of the extracted keywords have equivalents in other software.

5.1.2.C Results

The dataset used contains countless informative texts from the broadcast news with very different lengths and rather distinct contexts [36]. Since the context in this thesis is on simple dialogues where most of the utterances are short and simple, smaller texts from the database are preferred over longer ones. Consequently, twenty texts are picked from the dataset, where the first ten are considered to be small and the closest to a sentence whilst the rest range from medium to large size. A text is considered small if the number of lines composed is less or equal to 4, which in terms of words range from about 10 to 60. The interface used to build the metrics and compute the percentages is MATLAB and the outcome is presented in the images 5.1 and 5.2 and table 5.1.

Figure 5.1: Errors related to the 1st Metric. Blue, red and green represent the software Azure, Yake and LinguaKit respectively. The dashed line represents the mean error.

Azure LinguaKit Yake 1st Metric 34.50% 57.89% 44.65% 2nd Metric 5.93% 28.78% 17.81%

Table 5.1: Mean error for each software for both methods.

After a brief analysis, it is easily visualized that Azure is the best one. Although the percentage

37 Figure 5.2: Errors related to the 2nd Metric. Blue, red and green represent the programs Azure, Yake and LinguaKit, respectively. The dashed line represents the mean error. of mistakes in the first method is quite high, it is significantly lower than the other two. In the second method, it is observed a minor error in Azure, three times and almost five times smaller than Yake and LinguaKit, respectively. For the first ten texts, it is recognized that in the second method, there is no error associated to Azure while in the first one all of them are under or around the mean error. For the rest of the documents the error tends to increase, being the last values the highest. The other programs have an erratic behaviour and, ultimately work better with larger texts, as they have a similar, or even better, score that Azure in the last iterations. For texts more similar to the size of a sentence, Yake and LinguaKit usually do not have an adequate performance. In conclusion, Azure can provide reasonable results for what is intended.

5.1.3 Latent Semantic Analysis

In pursuance of correctly explaining the deployment of LSA, the packages used in each step of chapter3 are detailed.

1. Stop Words:

A list of Portuguese stop words is obtained from nltk.corpus.stopwords package [37]. The obtained list contains all of the most important words that should be eliminated from a phrase due to not bringing any content to the sentence. However, after some trials, it was noticed that the word e´ was missing from the list and that was causing abnormal results since it was creating non-existing relationships between sentences and terms. Therefore, that term is added to the list of stop words. The corresponding list contains, therefore, the following terms:

• {de, a, o, que, e, do, da, em, um, para, com, nao,˜ uma, os, no, se, na, por, mais, as,

38 dos, como, mas, ao, ele, das, a,` seu, sua, ou, quando, muito, nos, ja,´ eu, tambem,´ so,´ pelo, pela, ate,´ isso, ela, entre, depois, sem, mesmo, aos, seus, quem, nas, me, esse, eles, voce,ˆ essa, num, nem, suas, meu, as,` minha, numa, pelos, elas, qual, nos,´ lhe, de- les, essas, esses, pelas, este, dele, tu, te, voces,ˆ vos, lhes, meus, minhas, teu, tua, teus, tuas, nosso, nossa, nossos, nossas, dela, delas, esta, estes, estas, aquele, aquela, aqueles, aquelas, isto, aquilo, estou, esta,´ estamos, estao,˜ estive, esteve, estivemos, estiveram, es- tava, estavamos,´ estavam, estivera, estiveramos,´ esteja, estejamos, estejam, estivesse, es- tivessemos,´ estivessem, estiver, estivermos, estiverem, hei, ha,´ havemos, hao,˜ houve, hou- vemos, houveram, houvera, houveramos,´ haja, hajamos, hajam, houvesse, houvessemos,´ houvessem, houver, houvermos, houverem, houverei, houvera,´ houveremos, houverao,˜ hou- veria, houver´ıamos, houveriam, sou, somos, sao,˜ era, eramos,´ eram, fui, foi, fomos, foram, fora, foramos,ˆ seja, sejamos, sejam, fosse, fossemos,ˆ fosse, for, formos, forem, serei, sera,´ seremos, serao,˜ seria, ser´ıamos, seriam, tenho, tem, temos, tem,´ tinha, t´ınhamos, tinham, tive, teve, tivemos, tiveram, tivera, tiveramos,´ tenha, tenhamos, tenham, tivesse, tivessemos,´ tivessem, tiver, tivermos, tiverem, terei, tera,´ teremos, terao,˜ teria, ter´ıamos, teriam, e´}

2. Porter Stemming:

A Portuguese Porter Stemmer is provided by the package nltk.stem.RSLPStemmer. RSLP stands for Removedor de Sufixos da L´ıngua Portuguesa (Portuguese Language Suffix Removal in En- glish) [38]. Each word processed by this algorithm passes through a sequence of 8 steps to execute or not different reductions, depending on a specific combination of letters of the word end- ing. In figure 5.3, it is possible to observe the order the algorithm must follow for the correct stem withdrawal given a precise word ending [39].

3. Tokenization:

The tokenization is executed by the method word tokenize from the nltk.tokenize package [40]. The delimitation of each token is the blank space so that each word is separated from the others and from punctuation signs.

In order to group the three last packages, it is created a method that receives as input a sentence and returns a set of terms as output. This function lowercases all of its letters and after tokenizing the sentence, the tokens which belong to the list of stop words or the set of punctuation symbols (granted by a punctuation list from the package string) are eliminated. The words which are not eliminated are processed by the Portuguese Porter Stemmer, transforming tokens into word stems and returning them as a list of terms (the LSA’s vocabulary) to be used in the next step. Along with the keywords, these words compose the the system’s vocabulary.

4. Building of TF-IDF matrix:

39 Figure 5.3: Every step of the RSLP Stemmer

Constructing the desired TF-IDF matrix is easily executed by the class TfidfVectorizer from the package sklearn.feature extraction.text [41]. For this purpose, the class requires a couple of pa- rameters that are subsequently enumerated. It is imperative to remark that the first two argu- ments are interchangeable.

• The minimum document frequency, min df, serves as a minimum threshold for the fre- quency of each term in a document or sentence. If the frequency of a term is below the established value, then it is eliminated from the set of terms.

OR

•A list of vocabulary, vocabulary. Instead of determining it from the input documents, the algorithm searches for the terms in this list and only computes the TF-IDF for these words.

•A list of stop words, stop words, as previously described in the first part of subsection 5.1.3.

•A tokenizing method, tokenizer, already mentioned in the preceding step.

• N-Gram’s minimum and maximum values, ngram range.A N-Gram represents a contiguous set of N words of a sentence. In the following example, it is possible to visualize the different sets for the sentence ”I love yellow cars and boats”, depending on the N:

(a) N1 (N = 1) : {”I”, ”love”, ”yellow”, ”cars”, ”and”, ”boats”}

(b) N2 (N = 2) : {”I love”, ”love yellow”, ”yellow cars”, ”cars and”, ”and boats”}

40 (c) N3 (N = 3) : {”I love yellow”, ”love yellow cars”, ”yellow cars and”, ”cars and boats”}

(d) N4 (N = 4): {”I love yellow cars”, ”love yellow cars and”, ”yellow cars and boats” } (e) ...

Since it is important to consider each sequence constituted by one word only, the minimum value of N-Gram selected is one. However, distinct maximum values of N-Gram lead to different performances. As such, a set of values must be analysed prior to choosing the one that best fits the procedure. It is essential to remark that since the argument is the N-Gram range and the minimum value is 1, the total set of vocabulary includes all of the N-sets from 1 to the maximum value. It means that, for example, if the maximum N-Gram was 3 then, the

complete word set would include N1, N2 and N3.

5. Singular Value Decomposition and Dimensionality Reduction

The decomposition of the sentences by terms matrix and reduction of its dimensionality are com- pleted simultaneously with the help of function linalg.svd from package numpy. It requires one argument, beyond the TF-IDF matrix, for behaving appropriately:

• A Boolean flag, compute uv, that computes(if True) all of the three matrices of the SVD. If not, it only computes the matrix responsible for the eigenvalues. As the three matrices are needed, compute uv = True.

The output of this method is the three matrices obtained after a SVD. After normalizing the eigen- values, the most important components are picked until the percentage of cumulative eigenvalues (p eig) is attained. This value is, however, unknown and should be computed in order to obtain the best performance for the system. Finally, the necessary components from the three matrices are picked and the matrix is reconstructed.

6. Parameters’ values computation

The LSA vector is finally obtained as the output. However, for the algorithm to function according to the main objective, there are a couple of parameters that must be defined so that the system is able to execute. Briefly, the arguments to take into consideration are: ngram range, min df, p eigen and α (Laplace Smoothing parameter). As stated in the fourth part of this subsection, the minimum value of N-Gram selected is 1. Regarding the values of the minimum document frequency, the maximum value of N-Gram and the percentage of cumulative singular values, a collection of experiments is prepared and conducted in order to decide the ones which deliver a better performance to the algorithm.

For this reason, a dataset is needed to train the LSA subsystem and to test if the algorithm’s efficiency is enhanced and maximized with a specific set of values of parameters. A small dataset

41 is created with a couple of distinct topics to test the impact of different values on the system and to check if it indeed functions properly. The dataset adopted for this analysis can be visualized in the appendix B and is divided into training (tables B.1 and B.2) and testing sets (table B.3). The experiments are divided into three major sets of tests and in each group, only one of the three parameters are modified while the remaining are kept constant in order to control each parameters’ variance. For every experiment the number of correct predictions and its percentage are computed for every different set of parameters’ values. After all trials are completed, the results within each major set of tests are compared and the chosen value corresponds to the one which delivers a better output. Each parameter takes the following default value when it is maintained constant.

(min df, ngrammax, p eigen) = (2, 2, 50%) (5.1)

• Minimum document frequency, min df The possible values for the minimum document frequency include all natural numbers from 1 to ∞ however, as can be easily visualized in the figure 5.4, from a certain frequency onwards the system’s performance starts to deteriorate. It is crucial to note that, as the minimum frequency increases, the extent of the set of terms decreases. Hence, it is better to have a smaller set of terms if two different frequencies deliver the same performance.

Figure 5.4: Minimum Frequency Appraisal.

Observing figure 5.4, it is clear that the best values for the min df is either 1 or 2, as the performance produced is equal to 90.5%. Therefore, it is concluded that the best value for the minimum document frequency is two, since a smaller set is preferred. After these values the output begins to decrease its accuracy, being the minimum obtained for 5 and 6 (less

42 than 65%). Due to the reduced size of the dataset, the majority of the terms of the phrases have a low frequency, leading the LSA vocabulary to be greatly reduced when the minimum frequency is bigger than 2. The vocabulary continues to diminish until the minimum value becomes high enough that the set of terms of LSA becomes empty, as the frequency of the terms is lower than the threshold given. At this point, the system solely relies on the keywords given by the keyword extractor.

• Maximum value of N-Gram, ngrammax

The values of the maximum value of N-Gram to be analysed range from the minimum value of N-Gram, i.e. 1, to ∞, including only natural numbers. For a value N of the maximum N-Gram, the complete set includes all of the subsets composed purely by 1 word up until the one containing sequences of N words. Thus, for higher N not only the number of N-Gram subsets increase but also, the subsets with bigger N contain more words. Therefore, if two N values deliver the same performance, the lowest N should be chosen.

Figure 5.5: Maximum N-Gram Appraisal.

After a quick observation of the figure 5.5, all the possible N-Gram values have a good perfor- mance with the worst one being 85.7%. However, the best outcome belongs to the maximum N-Gram greater or equal to 3 (95.2%). Hence, the best value of the maximum value of N- Gram is three as the quantity of terms is lower. So many different values have the same performance because of the reduced number of training phrases that, intertwined with the minimum document frequency, eliminate many N-Gram elements. This effect is particularly harsher for elements composed by more than two words for the reason that the more words the contiguous sequences have, the rarer they become, making it more difficult to reach the

43 frequency threshold. Hence, for higher N’s their vocabularies are very similar or equal, caus- ing the performance to be the same.

• Percentage of cumulative singular values, p eigen

The percentage of cumulative singular values varies between 0% and 100% (or 0 and 1) but given that there are too many values in the possible interval, firstly, these only include every number that ends in zero, for a broader outlook of the influence of this parameter. Both 0% and 100% are excluded from the possible set of values since the first leads to no eigenvalues being picked at all while the latter causes the Dimensionality Reduction Step of LSA to be pointless as the matrix obtained is essentially the same as before the decomposition. After running the first set of tests, more values are added to better understand the influence of the parameter and these include every number that ends in 5 from 50 to 70. For lower percentages, less columns and rows from the SVD matrices are chosen, prompting an easier reconstruction of the new reduced matrix. Thus, the lowest value should be selected when two or more percentages have correctly predicted an equal number of phrases.

Figure 5.6: Percentage of Cumulative Singular Values Appraisal.

After analysing figure 5.6, the system manages to function perfectly, correctly predicting 100% of the phrases when the percentages of cumulative eigenvalues are either 50% or 60%. Thus, the percentage of cumulative eigenvalues chosen is 50% (0.5). The system obtains a performance below 90% when dealing with values under 50% and higher than 80%. In the first case, the system cannot work properly since the number of dimensions picked are not enough and overlook some important information. In the later situation it picks too many dimensions and the additional information given by the insignificant dimensions damages the induction

44 made by LSA. Also, it is confirmed that is better to execute the Dimensionality Reduction and the reduction brings more insight to the data than before the SVD.

5.1.4 Multinomial Naive Bayes Classifier

The class MultinomialNB of package sklearn.naive bayes [42] is picked since it contains the perfect NBC class that corresponds to the type of MNBC desired and that computes the probability exactly as equation 4.8. The class has one important variable to be defined, the Laplace Smoothing parameter (α) in equation 4.7 which ranges from 0 (no smoothing at all) to 1. In pursuance of understanding the influence of the value of α and which one delivers a better performance to the system, a set of tests similar to the ones executed in the previous section are carried out to choose the result for the constant. For this purpose a couple of values are selected starting from 0 and ending in 1. In between them, 2 values from each order of magnitude are picked, beginning with 3 decimal digits in order to have a different influence when compared to having 0 as the parameter value. Therefore, the Laplace parameters to be analysed are the following:

α = {0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} (5.2)

The performance obtained for each value of α is the same as the best one attained in the former section. Therefore, it can be said that, at least for a dataset of this size, there is no influence of the Laplace Smoothing parameter. However, it is known inML that the use of this constant usually leads to a better output and, thus, a number bigger than zero and not excessively high is picked and, after obtaining more real data the trials are to be rerun to confirm that the value is the best one and the assumption is true. Summarizing, the best values for the parameters and used in this system are:

Parameters Value Minimum document frequency 2 Maximum N-Gram 3 Percentage of cumulative singular values 50% Laplace Smoothing 0.01

Table 5.2: Values used in this thesis’ system for the following parameters

After computing the MNBC parameters, the classifier is ready to predict new phrases. A new phrase is initially processed by LSA and, subsequently, the classifier predicts the utterance the robot should express based on the LSA vector. When the algorithm obtains the robot phrase, it must be transformed to speech so the robot can vocalize the predicted sentence. In pursuance of finding the proper TTS software and as Google Speech Recognition API was already selected in 5.1.1, one of the Google APIs

45 is adopted. It culminated with the class gTTS from package gtts (Google Text-to-Speech)[43], which requires two parameters: the phrase to be uttered and the language of the discourse.

5.2 Discourse Build-up

The datasets in section 6 of 5.1.3 are built by a Portuguese person that already understands how the system works which drives the system to misleading performances, i.e. to better achievements than it would otherwise obtain with unknown real data. For all intents and purposes, the human utterances used to train the system must be done by people unaware of the advantages or disadvantages of the system functionality. Before introducing the form technique, it is important to remark that each human utterance is followed by and is associated to a certain robot phrase. Also, each different utterance verbalized by the robot represents an unique class label and, as such, the set of phrases expressed by the robot is always equal. On the other hand, every person has distinct manners of articulating the same idea and, therefore, the human part of the discourse is mutable and is the one needed to be collect. For this purpose, it is created a form that follows a base dialogue, depicted in this section, that keeps the robot speech and conceals all the human utterances of the discourse. This is done so that the person that fills the form can rewrite them and create human phrases similar to the ones written in the base dialogue, but that make a coherent discourse when aligned with the robot speech. The main goal of the form is to capture the variance of how people express the same concepts in distinct words. In order to build the base dialogue, the favorite and undesirable themes of conversation between the elderly were inspected. Themes such as religion or politics are to be avoided. In pursuance of obtaining a general discourse, the main theme chosen is preferences and the conversation is split into 4 different main parts: the greeting, favorite activities, food and goodbye. The first and the last are added to the dialogue as it is the common convention between human beings when starting and finishing a conversation, respectively. Inside favorite activities 2 options are made available: being with the family or go for a walk, as they are topics that an elder would talk about. In the theme family, the conversation revolves around where the relatives live while the thematic go for a walk discusses two different types of localization: inside and outside. Inside introduces the mall and activities that involve it, e.g. buying clothes and going to the cinema and outside describes an outdoor localization and what a person does when the weather is too bad to go outside. After describing the favorite activity, the conversation evolves to the topic of food, especially the most preferred meat, fish, Italian dish and desert. Every discourse design decision is explained throughout the unfolding of the dialogue and the letter H represents the human utterances (more specifically the elder) while R corresponds to the robot dis- course. The base dialogue has a couple of different conversation alternatives and, as such, each one is

46 presented in table 5.3, where each number represents an excerpt of the total conversation:

Alternatives Parts of the speech A 1, 2 B 1, 3, 4, 5, 11 C 1, 3, 4, 6, 7, 11 D 1, 3, 4, 6, 8, 9, 11 E 1, 3, 4, 6, 8, 10, 11 F 1, 3, 8, 9, 11 G 1, 3, 8, 10, 11

Table 5.3: Number of class labels per k.

1.• Human: Bom dia. Robot: Bom dia. Esta´ tudo bem consigo?

• H: Comigo esta´ tudo otimo´ e consigo? R: Eu estou bem-disposto. Queria saber se tem vontade de conversar.

After the first part of dialogue the elder can respond negatively (part2) or positively (3) to the suggestion.

2.• H: Pec¸o desculpa mas nao˜ posso conversar. Ja´ tenho planos para hoje. R: Nao˜ faz mal. Entao˜ tenha um bom dia.

• H: Muito obrigado, bom dia para si tambem.´ Adeus. R: Adeus.

The robot-elder interaction ceases and this conversation alternative ends after this part. The next segment represents the other path:

3.• H: Sim, seria agradavel´ um pouco de companhia. R: Muito bem, falaremos dos nossos gostos. O que e´ que mais gosta de fazer?

Although this question has n possible answers, in the first form iteration two more general options are given just to prove that the system can understand distinct alternatives: enjoying being with your family (part4) or take a walk (part8).

4.• H: Gosto imenso de estar com a minha fam´ılia. R: Tem muitos familiares que vivam perto?

Now, the interlocutor can either tell if his/her family lives nearby (part5) or far away (part6).

5.• H: Felizmente tenho possibilidade de poder estar frequentemente com a maioria deles. R: Ainda bem, fico contente. Agora gostaria de saber a sua comida preferida.

This conversation path leads to a new conversation topic: food, introduced in part 11.

47 6.• H: Infelizmente so´ alguns familiares moram aqui perto. R: E´ uma pena, as` vezes a vida podia ser mais como nos´ queremos. Mas para alem´ disso, nao˜ ha´ nada que aprecie?

The latter utterance contains an open question that can comprehend n possible responses. In this situation, the activity take a walk appears again as a choice, along with nothing else. Both possibilities are shown in dialogues8 and7, respectively.

7.• H: Nao˜ ha´ nada assim que me deˆ tanto gosto. R: Muito bem, falaremos de outras coisas entao.˜ Qual e´ a tua comida favorita?

The previous alternative leads to the same theme as the fifth, food, presented in part 11.

8.• H: Gosto de sair de casa. Ainda tenho sorte de ter pernas para andar. R: Sim, e´ uma sorte. Onde e´ que gosta mais de ir?

Once more, 2 different choices are presented to this open question with infinite available answers: an indoor location and an outside place that are represented by9 and 10, respectively. The indoor dialogue revolves around the mall and activities associated with it (like buying new clothes and going to the cinema):

9.• H: Um dos meus s´ıtios prediletos e´ o centro comercial. R: Eu nao˜ gosto de centro comerciais. No geral costumam ser aborrecidos. O que e´ que la´ costuma fazer? H: Normalmente vou as` lojas de roupa. R: Tem por habito´ comprar roupas? H: Normalmente nao,˜ mas de vez em quando gosto de comprar uma pec¸a para ou mim ou alguem´ que goste. R: Pois as` vezes sabe bem oferecer uma pec¸a bonita, mesmo que seja a nos´ proprios.´ H: Exato. Tambem´ se tiver vontade gosto de ir ver um filme. R: Muito bem. Costuma comprar pipocas quando vai ao cinema? H: Por acaso nao˜ porque detesto comer pipocas. R: E´ a primeira pessoa que eu conhec¸o que nao˜ gosta. O que e´ que lhe agrada comer?

The previous section ends with the beginning of a new topic, food, that starts in the chat number 11. It is important to notice that there are a couple of questions in the ninth part that only have a finite answer since there are already enough possible options to show that the algorithm can understand when to pick one over the other. Therefore, the main focus is robustness and showing that the algorithm is able to respond correctly when hearing a new human utterance. This situation repeats itself in the subsequent interactions and all of them ignores the plurality of choices.

48 The next discourse talks about a park, an outdoor environment, and what the elder would prefer to go if the weather wasn’t good enough to be outside.

10.• H: Gosto imenso de ir ao parque Eduardo VII, e´ bom estar ao ar livre. R: Tambem´ prefiro estar no exterior com bom tempo. O que faz quando esta´ chuva ou frio? H: Nesse caso prefiro ficar por casa, ou se puder, ir para casa de alguem,´ um familiar ou amigo. R: Portanto estou a ver que dispensa o frio. H: Dispenso? Eu sou uma pessoa friorenta, por isso odeio o frio. R: Ja´ somos dois que nao˜ gostam do frio. Falando agora noutras preferencias,ˆ qual e´ a sua comida de eleic¸ao?˜

This part also finishes with the initialization of a new topic, food. Finally, the subsequent dialogue marks the introduction of this topic.

11.• H: Isso e´ uma pergunta dif´ıcil. Ha´ imensos pratos que me dao˜ alegria comer. R: Certo, entao˜ vamos por partes. Qual e´ a sua carne favorita? H: Essa e´ facil:´ carne de vaca. R: E em relac¸ao˜ ao peixe? H: Como um bom portugues,ˆ adoro bacalhau. R: Realmente e´ dos melhores peixes. E qual e´ a melhor sobremesa para si? H: Essa seria um tiramisu.´ Mas acho que e´ porque tenho uma fraqueza por comida italiana. R: Italia´ e´ um dos pa´ıses com melhor comida. Especialmente, com a melhor pizza. H: Concordo plenamente! A pizza, a lasanha e os diferentes tipos de massa tornam a culinaria´ italiana irresist´ıvel. R: Nem mais, e´ mesmo imposs´ıvel resistir a esses pratos italianos. H: Enfim, a conversa esta´ mas eu preciso de ir agora. R: Percebo completamente. Tambem´ foi um prazer falar consigo. H: Falamos mais tarde. Ate´ logo. R: Ate´ logo.

After discussing some culinary features the human has to leave due to external affairs and the base discourse is completed.

The form is composed with the help of Google Forms and it is shared and filled in by a small sample of people (35 people), just to check if it is a valid approach to create a dataset representative of an usual conversation between an elder and a robot. The form has a couple of important rules such as avoiding only yes or no answers and also giving a response as complete as possible. The form can be accessed through the following link. After obtaining the data, all the phrases are analyzed to correct the

49 existing spelling errors and if the answers given by the interviewed are according to the rules. When they are not, they are eliminated from the final dataset which resulted in the withdrawal of about 15 phrases. However, some human phrases composed by one word are maintained as they correspond to a greeting or because they can be used in the final part of this section in the incorrect phrases presentation to explain why they should be avoided. In conclusion, the final set contains 455 human utterances and 22 robot utterances, resulting in 22 class labels used in the MNBC. Each class has a different number of human utterances associated to, with the smallest classes including only 5 human phrases while the biggest class consists of 34 human sentences. The average number of human utterances per class corresponds to 22. Ultimately, each human utterance’s length ranges between 1 word and 27 words. The dataset is available in the file files/database.txt in [26].

5.3 Training and Testing Set

It is very important to divide the acquired data (the LSA vectors obtained after processing the phrases) into two major sets: the training and testing set. The first set is used for calculating the MNBC probabilities necessary to predict new data while the second corresponds to the new vectors to be predicted. The main objective of the testing set is to analyse the system’s performance with regard to unfamiliar data. Since every different division of data will produce distinct training/testing sets, it directly affects the learning and prediction phases which leads to the system attaining particular performances that are better than others. As such, in order to dismiss the influence of the data, the correctly computation of the most probable performance is using Cross Validation (CV). CV is a model validation technique based on re-sampling the data into different training and testing samples, demonstrating how the system handles an independent set. This procedure generally keeps a portion or a sample of the data aside to be used for validating or testing while the remainder of the data is applied to train the system. In addition,CV is great technique when the quantity of data available is limited, which is the case of this thesis. There are many different types ofCV but, for this purpose, the K-Fold Cross Validation (KCV) is chosen since it divides the data into K equal-sized portions with one of them being used for testing and the rest for training. Each portion of the set is used once as the testing set so that all of the phrases get the chance to participate in both subsets, creating K different set divisions. In pursuance of obtaining an 80%/20% division of the full set in training and testing subsets, respectively, the data needs to be divided in subsets of 20% and, thus, the value chosen for K is 5, as observed in image 5.7. However, the number of elements that each class label includes is not always constant and so, in order to avoid achieving an unbalanced division, especially regarding the utterances that belong to the

50 Figure 5.7: Full set split into training and testing subsets [1].

smallest classes, a stratified split is needed. Otherwise, if all the utterances of one class were in the testing set, then there would be no representation of them in the training phase and the system would fail to correctly predict them. If the inverse happened and all of them were in the training set, there wouldn’t be any means to evaluate if the system functions properly. The Stratified K-Fold Cross Validation (SKCV) fits perfectly to the objective and, therefore, is selected [44] . For it to operate accordingly, the minimum number of elements must be equal to K, i.e. 5 and, as such, any class whose size is below the defined threshold is eliminated from the complete set along with its corresponding human utterances. The split is always random, thus most of the executions of the system obtain different outputs but, as it is executed 5 times and all of the data is both present in the training and testing sets throughout the iterations, the average performance value is representative of the overall system attainment.

5.4 System Adjustment

On the lookout for validation of the parameters chosen in sections 5.1.3 and 5.1.4( i.e. min df,

ngrammax, p eigen and α), all of the previously executed tests are repeated. The values previously selected are depicted in table 5.2.

5.4.1 Minimum Document Frequency and Maximum N-Gram Value

The parameters to be tested are the maximum value of the N-Gram range intertwined with the minimum document frequency as it became obvious after the first trials that they influence each other. So, for each maximum N, all of the possible frequencies are tested. The potential values for each variable are the same as before in subsection 5.1.3 and the results are presented in image 5.8. Analysing figure 5.8 along with its sub-figures, it is observed that, comparing each N-Gram, the worst performance is for the smallest value (1) as all of the test set performances are below 75% and the aver- age performance for all frequencies corresponds to only 70.5%, while the other Ns range between 73% and 74%, obtaining its peak for N = 2 (74.0%). This means that introducing the N-Gram improves the performance of the system.

51 (a) ngrammax = 1. (b) ngrammax = 2

(c) ngrammax = 3 (d) ngrammax = 4

(e) ngrammax = 5 (f) ngrammax = 6

Figure 5.8: Influence of the minimum document frequency and maximum NGram value combined on the final dataset. Each sub-figure represents a different NGram number while each point of the graph corre- sponds to a different frequency and the bar represents the deviation towards the average performance.

52 Regarding the minimum document frequency, it is easily noted that the best performance is always for either 2 (sub-figures 5.8a, 5.8d and 5.8f) or 3 (sub-figures 5.8b, 5.8c and 5.8e), confirming that the employment of the minimum frequency threshold leads to a better LSA performance. Eliminating

5.8a since it was concluded that Nmax´ 6= 1 and 5.8f as the best value is less than 76%, there are four 0 possible values for Nmax´ . Each maximum average performance obtained for each of the four N s is observed on table 5.4. N/min df Average Upper Bound Lower Bound 2/ 3 77.84% +9.92% -9.43% 3/ 3 77.60% +7.11% -6.04% 4/ 2 76.71% +6.22% -5.11% 5/ 3 76.71% +4.58% -4.32%

Table 5.4: Best average value for each N followed by its upper and lower bounds along with the corresponding minimum document frequency (min df).

As explained before in 5.1.3, a bigger minimum frequency for a similar output is preferable since it means that with a smaller vocabulary the performance is maintained and useless terms (towards the system) are eliminated. Thus, the number 3 is the finest choice for the parameter min df. It is also quickly apprehended that the best output is the one for N = 2 and the worst for N = {4, 5}. However, the smallest classification standard deviation is completely inverted, scoring the best value for N = 5 and the worst for N = 2. The difference between both average values is very small (approximately 1%) when compared to their upper and lower bounds. Nonetheless, it should be kept in mind that for N = 2 a score as great as 88% but as poor as 68% can be obtained depending on the set splitting, which is an unreasonable deviation for a classification system. On the other hand, for N = 5 the results range from 72% and 82% and, even if the system performance cannot attain such high results and needs a greater vocabulary set, at least the output is more reliable. Hence, the optimum value for the maximum value of N-Gram is 5.

5.4.2 Percentage of Cumulative Eigenvalues

The possible percentages not only are the ones chosen in the first experiments in image 5.6 but also the percentages 45%, 75%, 85% and 95%. The value 100% is represented just to prove that the use

of LSA is, indeed, better for the system. The new values of min df and ngrammax are adopted in the investigation of the influence of this parameter on the classification system and the results of these trials can be visualized in image 5.9. Quickly analysing figure 5.9 the best performances are those that correspond to the values 75% and 95%. The performance for 100% is also not very different from the previous values, however, it does not correspond to the best score. Even if the favorite numbers are way higher than the initial 50% selected in 5.1.3, it is seen that from 55% to 95% the output obtained is quite similar, just diverging

53 Figure 5.9: Influence of the percentage of cumulative eigenvalues on the final data set. Each point of the graph corresponds to a different percentage and the bar represents the deviation towards the average perfor- mance. on the standard deviation. Therefore, it is concluded that LSA improves the outcome, especially the dimensionality reduction. Regarding the selection of either one of the best two values, a deeper outlook is carried out in table 5.5.

p eigen Average Upper Bound Lower Bound 75% 81.98% +5.29% -8.21% 95% 82.27% +3.94% -2.27%

Table 5.5: Best average value for each best percentage followed by its upper and lower bounds

Observing table 5.5, the best average performance and the one who has less standard deviation corresponds to the same percentage, 95%. Note that for 75% is possible to obtain the higher classifica- tion score (87%) but also the lowest (74%) while percentage 95% always maintains its result above the accuracy of 80%. Consequently the number selected is 95%.

5.4.3 Laplace Smoothing Parameter

Finally, the last parameter to be evaluated is Laplace Smoothing (α). Again, the experiment is executed for the same numbers as before in 5.1.4 and it already uses the best new parameters values. The results are presented in image 5.10. As seen before in the first experiments of all parameters, α is the one that less influences the data, merely fluctuating a meager 2.5% between the best and the worst result. It was presumed in subsection 5.1.4 that introducing a little smoothing would improve the final results, nonetheless, it is clear that the

54 Figure 5.10: Influence of the Laplace Smoothing on the final data set. Each point of the graph corresponds to a different alpha and the bar represents the deviation towards the average performance.

assumption was not valid since the best value corresponds to no smoothing at all, α = 0 (that has a performance of 81.13% and also a narrower standard deviation). The justification is grounded on the value of the word frequencies obtained after normalization as they are relatively small and the little smoothing inserted by an α bigger than zero disturb these frequencies, damaging the induction made by the LSA. For these reasons the α picked is 0. In conclusion, the final constants selected for the parameters are represented in table 5.6.

Parameters Value Maximum N-Gram 5 Minimum document frequency 3 Percentage of cumulative singular values 95% Laplace smoothing parameter 0

Table 5.6: Best values chosen for the parameters for the final dataset.

These values are quite different from table 5.2 and the reason is justified with the new dataset. The new dataset used is much bigger than the one employed in the first trials and the quantity of class labels and terms is considerable greater. Hence, for substantial dataset changes, it is necessary to tune the system’s parameters in order to maintain the best outcome possible. After adjusting the parameters, a deeper look is taken into the other technicalities of the system to validate that they improve the result.

5.4.4 Stop Words

The list of stop words is a set of words that are very common in a language and are used in this system to eliminate the terms that do not bring any context to the phrase, or otherwise the induction made by LSA would be incorrectly influenced. However, there is not a stop words set that is guaranteed

55 to perfectly describe the most common words of a language. Therefore, it is needed to verify if the set chosen in 5.1.3 is the one that delivers the best performance. Four different trials are executed, each one with a different LSA’s stop words set: the first one, A, is the one chosen in 5.1.3 and the second one, B, contains all the words in the original set of stop words, i.e. all of the words except the Portuguese word ”e”´ . The third set, the stop word set C, includes all of the stop words of set B but excludes the verbs and personal pronouns while the last set completely bypasses the utilization of a list. Each trial is executed 50 different times in order to obtain an overall result of the database used. For this reason, it is expected that the standard classification deviation is much higher than the variances previously attained. The results are illustrated in table 5.7.

LSA type Average Upper Bound Lower Bound Stop Words set A 82.88% +10.53% -9.25% Stop Words set B 82.76% +9.55% -9.13% Stop Words set C 84.02% +8.29% -9.29% No Stop Words 81.83% +10.48% -11.50%

Table 5.7: Best average value for each system configuration followed by its upper and lower bounds

It is evident that the set C is the best of the 4 while no set is the worst. The set previously chosen only scored 2nd, but it is noticed that the introduction of the word ”e”´ in the stop words set enhances the performance of the whole system when compared to using the original stop words set. However, even with the term ”e”´ the stop word set C is still better in this case. It is also crucial to acknowledge that since the Inverse Document Frequency in the TF-IDF is adopted in the computation of the word frequency, the weight of the more common words in the phrases is reduced quite heavily, securing the integrity of the whole system. In conclusion, the final system will use stop words set C, instead of A.

5.4.5 Keywords

The keywords extracted by the Azure software along with the terms extracted by TfidfVectorizer constitute the vocabulary (the terms used to characterize the human phrases). In this subsection, the introduction of keywords in the system’s vocabulary is to be verified to know if the final performance improves. For this purpose, 2 trials are executed: in the first one the vocabulary purely consists of the terms withdrawn by the TfidfVectorizer while the second also includes the words gathered by the software Azure. Each attempt is completed 50 times to achieve the most overall output of the database possible, leading to a larger standard classification deviation. The results can be observed in table 5.8.

LSA type Average Upper Bound Lower Bound Total words in the vocabulary No Keywords 84.15% +9.26% -8.33% 576 Keywords 81.99% +9.22% -10.56% 792 (= 576 + 216)

Table 5.8: Best average value for each system configuration followed by its upper and lower bounds

56 Analysing table 5.8, it is clear that the LSA vocabulary delivers a better average performance and a smaller standard classification deviation without the keywords of Azure. It is also of important remark that the first system configuration has solely 576 terms, while the second one has 792. It means that, through the TfidfVectorizer the system already extracts more words than the software Azure, which extorts almost 300 words but, only 216 aren’t already represented in the TfidfVectorizer vocabulary. In conclusion, the introduction of the keywords extracted are rather useless since it disturbs the system enough to slightly lower its accuracy and precision. It also means that the words carefully selected by the extractor software are not as meaningful as it was expected and/or that the majority of the terms is already represented in the rest of the vocabulary list. Therefore, the final system will not use the keywords extracted by Azure.

5.5 Results and Output Analysis

5.5.1 Class Labels Robustness

In the first place, a closer look is taken into the behaviour of the system when including more classes, analysing the loss of the performance. For this purpose, it is created an experiment that, essentially picks n random numbers, where the number represents a class label and must belong to a specific interval, between the first label no. 1 and the last one, no. 22. The n is every natural number between 1, the minimum number of classes possible, and 22 which is the total number of classes existent on the dataset. Afterwards, the classes that correspond to the number picked and the associated phrases are selected. Each n is executed 10 times, obtaining 10 different sets of classes in order to obtain the general influence of numerous distinct classes on the performance. This is particularly important since the classes have different numbers of total phrases and these component affects the final output. The results of the trial are represented in image 5.11.

As expected, it is visualized in figure 5.11 that the highest value (100%) corresponds to the trial with purely one class while the lowest value (82.2%) corresponds to the one with the highest number of class labels. In general, the performance of the system declines throughout the addition of more classes. Analysing it with more detail, the decrease is higher for n smaller than 7, while for the rest of the possible values the accuracy loss is rather low. Another important aspect to point out is the massive increase of the classification standard deviation from 1 to 7 that, afterwards, diminishes for the rest of the dominion. This demonstrates that introducing more classes in the set, raises the precision of the system, even if a little of accuracy is lost. Finally, it is important to mention that the system is capable of functioning properly and maintaining an adequate result with at least 22 classes.

57 Figure 5.11: The impact of the class labels, n, on the performance of the final dataset.

5.5.2 Performance on number of phrases per class

Moving onto the number of phrases per class, the following experiment selects, for each trial, k phrases associated to each one of the 22 class labels. The value of k ranges from 5, which is the minimum number of phrases for KCV to properly work, and 34, the total number of sentences of the biggest class. Because the classes have a different total number of phrases associated to, when the total number of sentences of a class is below the threshold k, that class is eliminated from the set used in that trial. Each trial is also run 10 different times, so that, for lower k, different phrases of a bigger class are used in the same trial. In table 5.9 it is depicted the number of classes for each k and the output of the experiment is displayed in figure 5.13.

k 5,6 7 8-11 12-21 22 23 24-27 28 29,30 31,32 33,34 No. of Class Labels 22 21 15 14 13 12 11 9 5 4 1

Table 5.9: Number of class labels per k.

Before analysing the performance outcome, a few remarks must be mentioned. There are 7 classes that have less than 10 total phrases associated to each one of them and only 5 classes reach the number k = 30. Also, the size of the majority of the classes lies around 20 total phrases. The last two k are ignored since they only have 1 class label in the set and, therefore have a 100% average performance. By following the rule of thumb of amount of data samples, the 7 classes with fewer phrases will have a worse performance than the rest of them since they don’t have as many training utterances. In the other side of the spectrum the biggest 5 classes would enhance the system’s performance. In order to check the veracity of the previous sentences, a trial with the whole dataset is done to see the errors associated to each class and the results of each different set division are portrayed in figure 5.12. Carefully inspecting it and taking into account the classes that, at least once, have a test set that wrongly

58 predicts 10% or more of the phrases, a similar conclusion is reached. From the previous description, classes 5, 6, 7, 8, 10, 11, 12, 13 and 15 are selected. Only the class labels 7, 8 and 10 correspond to bigger classes while 5, 6, 11, 12, 13 and 15 to smaller ones. Although the majority of the absolute errors belong to the biggest classes, their greater size makes the percentage of errors per class remarkably low, concluding that the weakest performances are associated with a smaller subset of training/testing phrases.

Figure 5.12: The percentage error for each different set division associated to each class after executing the system with the whole dataset.

The results of the experience are depicted in image 5.13. The first clear impressions after observing it are that the worst results are for the lower k, i.e. when only a handful of class phrases are used, and the performance slowly increases as the number of phrases rises. This means that as more utterances are available for LSA and more training results are accessible for MNBC, the better is the system’s accuracy. As forecast before and confirmed now with these results, a smaller set of phrases always leads to a weaker performance. Even though all of the k do not have the same amount of class labels, the ones that do have, like 5 − 6, 8 − 11 or 12 − 21 for example, clearly show that the performance enhances when introducing more phrases into the system. It is also evident that the performance improvement is much higher when k is lower, obtaining around 90% of performance when k is bigger than 23 phrases. Although the last values are higher than 90% (91 − 93%), they also have a lower total number of class labels so it can be said that the system slowly converges its performance to 90%, not only when introducing more phrases but also, when including more classes. Additionally, the highest k with the majority of the class labels is 23 and it corresponds to almost 87% of classification accuracy which is a very favorable result. The main conclusion reached is that a bigger dataset is needed in further iterations of this subject to improve the performance.

59 Figure 5.13: The impact of the number of phrases for each class labels on the final result.

5.5.3 Correct dialogue provided by the system

Here, a demonstration of the numerous possible alternatives of a correct conversation, similar to the base dialogue produced in 5.2, are presented to prove that the system works when applied to the problem described in the first chapter. The several alternatives can be visualized in the table 5.10, where each letter represents an alternative and each number an excerpt of the total conversation:

Alternatives Parts of the speech A 1, 2, 4, 8 B 1, 2, 5, 6, 8 C 1, 2, 5, 7, 8 D 1, 3, 6, 8 E 1, 3, 7, 8

Table 5.10: Number of class labels per k.

All the possible alternatives start with excerpt 1 and finish with number 8, since they correspond to, respectively, the greeting and goodbye (and also food) themes of the speech. This dialogue is quite similar to the one built in 5.2, maintaining the robot utterances but changing the human sentences to the ones that were written by the interviewed. Notwithstanding, some classes are not represented in this dialogue because the number of phrases associated to them were very reduced and, as previously explained, they were eliminated from the final set.

1.• Human: Bom dia.

• Robot: Bom dia. Esta´ tudo bem?

60 • Human: Sim e contigo?

• Robot: Eu estou bem-disposto. Tem vontade de conversar agora?

• Human: Sim, tenho!

• Robot: Muito bem, falaremos dos nossos gostos. O que e´ que mais gosta de fazer?

2.• Human: O que mais gosto e´ passar tempo com a minha fam´ılia.

• Robot: Tem muitos familiares que vivam perto?

OR

3.• Human: Entao,˜ gosto muito de sair de casa e dar longos passeios.

• Robot: Onde e´ que gosta mais de ir?

The previous two extracts represent two favorite activities: being with one’s family or going for a walk. Analysing each human utterance, there are some words written in bold that stand out:

- O que mais gosto e´ passar tempo com a minha fam´ılia.

- Entao,˜ gosto muito de sair de casa e dar longos passeios.

The words passar (spend), tempo (time) and fam´ılia (family) are terms that belong of the first phrase while sair de casa (go out) and passeios (walks) are to the second. It is clear that the first phrase belongs to be with the family thematic while the second one to going out even though both of them include the word gosto (like). For the reason that the whole discourse speaks about preferences, the word gosto is extremely frequent and common in many phrases, thus, its weight is reduced when compared to the other relevant words due to the TF-IDF.

4.• Human: A minha fam´ıliavive toda aqui perto, somos vizinhos.

• Robot: Ainda bem, fico contente. Agora gostaria de saber a sua comida preferida.

OR

5.• Human: Infelizmente nao,˜ vivem todos no norte do pa´ısou no estrangeiro.

• Robot: E´ uma pena, as` vezes a vida podia ser mais como nos´ queremos. Mas para alem´ disso, nao˜ ha´ nada que aprecie fazer?

The last two fragments describe the positive and negative answer, respectively, to the question: ’Tem muitos familiares que vivam perto?’. Examining each one of the human utterances:

- A minha fam´ıliavive toda aqui perto, somos vizinhos.

- Infelizmente nao˜ , vivem todos no norte do pa´ıs ou no estrangeiro.

On one hand, there are some words that, even if they don’t appear in both phrases, only determine the subtopic of the conversation (where is your family): fam´ılia (family) and vive(m) (live). On the

61 other side, terms like aqui (here), perto (near) and vizinhos (neighbours) clearly belong to an answer of someone whose family lives nearby while infelizmente (unfortunately), nao˜ (no), norte do pa´ıs (north of the country) or estrangeiro (abroad) to a negative response.

6.• Human: Gosto de ir ate´ ao largo do Rato.

• Robot: Tambem´ prefiro estar no exterior com bom tempo. O que faz quando esta´ chuva ou frio?

• Human: Fico em casa com amigos a ver um filme.

• Robot: Portanto estou a ver que dispensa o frio.

• Human: Sim, sou uma pessoa friorenta.

• Robot: Ja´ somos dois que nao˜ gostam do frio. Falando agora noutras preferencias,ˆ qual e´ a sua comida de eleic¸ao?˜

OR

7.• Human: Eu gosto muito de ir ao centro comercial, principalmente, e tu?

• Robot: Eu nao˜ gosto de centro comerciais. No geral costumam ser aborrecidos. O que e´ que la´ costuma fazer?

• Human: Eu gosto de ir comprar novas roupas para estar fashion todos os dias.

• Robot: Tem por habito´ comprar roupas?

• Human: Costumo comprar roupas para oferecer a` minha mae!˜

• Robot: Pois as` vezes sabe bem oferecer uma pec¸a bonita, mesmo que seja a nos´ proprios.´ O que mais faz?

• Human: Tambem´ costumo ir ao cinema.

• Robot: Muito bem. Costuma comprar pipocas quando vai ao cinema?

• Human: Nao,˜ nao˜ gosto de comer pipocas no cinema!

• Robot: E´ a primeira pessoa que eu conhec¸o que nao˜ gosta. O que e´ que lhe agrada comer?

Finally, the preceding two topics represent two different locations to where someone would go out. The first describes an exterior place whist the second an indoors location. Looking to the first human utterance of each excerpt and carefully evaluating them, some words stick out:

- Gosto de ir ate´ ao largo do Rato.

- Eu gosto muito de ir ao centro comercial, principalmente, e tu?

Thoroughly observing the words in bold, there are a couple of words that are equal in both dis- courses and they represent the topic of conversation (where do you like to go): gosto (like) and ir

62 (go). The other terms largo (square) or largo do Rato (a Portuguese square) and centro comercial (mall) mark the divergence between the two dialogues, since one represents an outdoors place while the second is indoors.

8.• Human: Nao˜ sei bem, gosto de muitas coisas.

• Robot: Certo, entao˜ vamos por partes. Qual e´ a sua carne favorita?

• Human: Gosto de carne de vaca, tem mais sabor.

• Robot: E em relac¸ao˜ ao peixe?

• Human: Como bom portugues,ˆ bacalhau sem d ´uvida!

• Robot: Realmente e´ dos melhores peixes. E qual e´ a melhor sobremesa para si?

• Human: Olhe, ja´ comia um bom tiramissu. E´ uma boa sobremesa italiana.

• Robot: Italia´ e´ um dos pa´ıses com melhor comida. Especialmente, com a melhor pizza.

• Human: Tambem´ gosto muito de massas.

• Robot: Nem mais, e´ mesmo imposs´ıvel resistir a esses pratos italianos.

• Human: Estou a gostar da conversa mas preciso de ir trabalhar. Foi um prazer.

• Robot: Percebo completamente. Tambem´ foi um prazer falar consigo.

• Human: Obrigada, igualmente.

• Robot: Ate´ logo.

Following the conversation alternatives, stressed in table 5.3, it is verified that all of them produce a coherent dialogue between the human and the robot. Therefore, it is confirmed that the program built functions accordingly, managing to answer correctly to a human utterance and, particularly, to choose the right option when a topic has two different possible paths. It is also proved that the dataset approach is quite favorable, however, much more data is required. It can be obtained through many different forms, adopting the technique previously described or recording and transcribing dialogues between humans [17], especially, between elders.

5.5.4 Wrongful predicted phrases

It is time to take a look into a couple of wrong results made by the system. 6 incorrect prediction of the classes with at least 10% of relative error, as can be observed in image 5.12, are indicated and thoroughly analyzed. The main objective is to demonstrate some of the system’s faults, so that, improvements in further iterations have a guideline on the system’s weaknesses. Note that, not all of the classes have representation in this subsection since many have identical types of errors. For each human phrase that

63 resulted in a wrong outcome, the class, the analogous correct and predicted phrases are presented. In the human phrase, the keywords (i.e. the words in the vocabulary selected by the system) are in bold followed by their weight, computed through the TF-IDF function. After their introduction, a table is displayed with the likelihood probabilities of the phrases’ keywords, followed by the a priori and total probabilities.

• Class 5: Human phrase: Bem (0.53) perto (0.84). Correct phrase: Ainda bem, fico contente. Agora gostaria de saber a sua comida preferida. Predicted phrase: Certo, entao˜ vamos por partes. Qual e´ a sua carne favorita?

Label ”bem” ”perto” Class Total Correct 0.002 0.011 0.014 0.037 Predicted 0.007 0.002 0.066 0.099

Table 5.11: Values of the conditional, class and total probabilities of the phrase of class 5.

The word perto (near) is more important in this phrase and is more connected to the correct phrase while bem (well), even if its weight is more reduced, is more related to the predicted one. Therefore, according to the likelihood probabilities represented in the table, the human phrase should be associated to the real class label. The problem here is the class probability, which for the predicted class is much higher than the correct one, as the quantity of phrases associated to the predicted label is much greater. Thus, even if the conditional probability favoured the right phrase, the class probability does the opposite, resulting in a higher total for the predicted phrase. It must be noted that, the correct label comes 14th out of 22 classes when comparing the final probabilities.

• Class 6: Human phrase: Nem por isso, os mais perto (0.61) vivem (0.54) a uma hora (0.57). Correct phrase: E´ uma pena, as` vezes a vida podia ser mais como nos´ queremos. Mas para alem´ disso, nao˜ ha´ nada que aprecie fazer? Predicted phrase: Ainda bem, fico contente. Agora gostaria de saber a sua comida preferida.

Label ”perto” ”vivem” ”hora” Class Total Correct 0.002 0.021 0.002 0.016 0.016 Predicted 0.011 0.021 0.002 0.014 0.074

Table 5.12: Values of the conditional, class and total probabilities of the phrase of class 6.

The system fails between deciding if the human’s relatives do or do not live nearby. The main

64 reason is because the term perto (near) is responsible for having the biggest weight in the phrase but also it is 5 times more probable to be associated with the predicted label than the correct one. Furthermore, the words vivem (live), hora (hour) and the a priori probability do not contribute much to the prediction, as both probabilities of belonging to each class are equal or very similar. Not having any negative keywords to counterbalance the term perto, the system selects the wrong output. Due to perto, the correct class ends up 19th out of 22 classes, being quite far from being selected.

• Class 8: Human phrase: Gosto (0.22) de ir (0.28) ao cafe´ (0.50). Sair (0.39) faz (0.40) bem (0.33). Correct phrase: Tambem´ prefiro estar no exterior com bom tempo. O que faz quando esta´ chuva ou frio? Predicted phrase: Onde e´ que gosta mais de ir?

Label ”Gosto” ”ir” ”cafe”´ ”sair” ”faz” ”bem” ”gosto ir” Class Total Corr. 0.017 0.013 0.002 0.002 0.002 0.002 0.014 0.047 0.104 Pred. 0.016 0.009 0.005 0.011 0.004 0.002 0.004 0.063 0.287

Table 5.13: Values of the conditional, class and total probabilities of the phrase of class 8.

The most important word in this phrase is cafe´ (coffee or cafe), followed by the words faz (do) and sair (go out). Inspecting table 5.13, it is observed that all of them are more related to the incorrect class label. The rest of the terms (except bem), along with gostar ir (with weight of 0.41), are associated to the correct robot phrase, however their weight on the phrase is low compared to the most important terms. The term bem (well), however, is equally probable for both classes and doesn´t contribute to the classification. In conjunction with the greater a priori probability of the predicted phrase, it is easily visualized the reason for the malfunction of the system. Nonetheless, the correct label comes right after the predicted one in terms of total probability. It is noted that the weight of gosto (like) has the lowest score and the probability of belonging to either class is very similar which confirms that the system doesn’t give much weight to a very frequent term.

• Class 12: Human phrase: Comprar (0.54) roupas (0.55) e bijuteria. Correct phrase: Tem por habito´ comprar roupas? Predicted phrase: Pois as` vezes sabe bem oferecer uma pec¸a bonita, mesmo que seja a nos´ proprios.´ O que mais faz?

Both terms have a similar weight, however, the word comprar (buy) is more associated to the

65 Label ”comprar” ”roupas” ”comprar roupas” Class Total Correct 0.017 0.026 0.008 0.014 0.173 Predicted 0.028 0.022 0.017 0.016 0.369

Table 5.14: Values of the conditional, class and total probabilities of the phrase of class 12.

second label while roupas (clothes) to the correct class. Additionally, the set of words comprar roupas, that has the biggest weight (0.62), is way more related to the wrong label, obtaining greater conditional probabilities for the predicted class. All of the previous reasons adding to the a priori probability, higher for the predicted phrase, drives the system to predict incorrectly. Note that, even if the system didn’t select the right phrase, the correct label comes after the predicted class in terms of total probability.

• Class 13: Human phrase: Ocasionalmente. Correct phrase: Pois as` vezes sabe bem oferecer uma pec¸a bonita, mesmo que seja a nos´ proprios.´ O que mais faz? Predicted phrase: Percebo completamente. Tambem´ foi um prazer falar consigo.

In 5.2 it was explained the reason why the majority of the phrases constituted by a single word was eliminated from the final database. However, a couple of them were kept in order to check the system’s robustness and performance. Because the only word that composes the human phrase is not considered a keyword by the system, the LSA vector is filled purely with zeros. Therefore, the system bases its prediction purely in the a priori probability and, therefore, the predicted class is chosen as it has the highest a priori probability (0.074). On the other hand, the correct label only comes 17th as its class probability is only 0.016. It is verified that single word phrases should be avoided at all cost (apart from phrases like greetings where Ola´ or Adeus does not represent a liability to the system as they are so predictable).

• Class 15: Human phrase: So´ se tiver (0.39) fome (0.45), nao˜ (0.24) sou (0.40) grande (0.46) fa.˜ Correct phrase: E´ a primeira pessoa que eu conhec¸o que nao˜ gosta. O que e´ que lhe agrada comer? Predicted phrase: Certo, entao˜ vamos por partes. Qual e´ a sua carne favorita?

Label ”tiver” ”fome” ”nao”˜ ”sou” ”nao˜ sou” ”grande” Class Total Corr. 0.002 0.002 0.037 0.002 0.002 0.002 0.016 0.019 Pred. 0.002 0.002 0.015 0.005 0.004 0.002 0.063 0.136

Table 5.15: Values of the conditional and class probabilities of a wrong phrase of class 8

66 First of all, the words tiver (be or have), fome (hungry) and grande (big) are the most important terms of the phrase, nonetheless, their probability of belonging to either label is equal, making them useless in the prediction. Also, the word sou (am) is important in terms of weight and it is twice more probable of belonging to the predicted phrase than the corrected one. On the other hand, the probability of the word nao˜ belonging to the correct class is twice as great as the predicted one and both are bigger than the probabilities of the term sou, however, the weight of nao˜ is much lower. When this term joins the word sou, their weight increases to 0.45 and the probability tends more to the predicted phrase. The much higher class probability of the predicted label leads the system, along with the likelihood probabilities to wrongly forecast and pushes the right class to score only 15th.

5.6 Highlights

1. Azure is the best keyword extractor program from the ones analysed. However, not including the keywords in the final vocabulary of the system actually improves the results.

2. The elimination of the stop words from the vocabulary is beneficial, nonetheless it is con- cluded that a smaller list with no verbs or personal pronouns is preferred and delivers a better result to the final system.

3. The Minimum Document Frequency is a favorable feature selection method, slightly improv- ing the performance.

4. The introduction of the N-Gram range possibilitates the introduction of specific terms combina- tion on the system’s vocabulary, increasing the system accuracy.

5. The dimensionality reduction through the percentage of cumulative eigenvalues, a feature extraction method, allows the LSA to rightfully induce and enhance the performance of the system, already delivering a reasonable performance with a percentage of 50%.

6. The Laplace Smoothing parameter, which usually improves the Naive Bayes classifiers, in reality slightly deteriorates the accuracy and precision for this type of categorization.

7. The 4 previous parameters should be tuned when a change in the dataset occurs, so that the final system always delivers the best performance possible. The features keywords and stop words are immutable and the description in the first 2 points describes how they are employed.

8. The dataset built is representative of an elderly conversation, however, in future iterations more data is needed. It can be enlarged through more forms with previously built conversations or recording and transcribing dialogues between humans.

67 9. The system improves its performance with the introduction of more phrases for each class since they increase the training set size.

10. The addition of more classes worsens the final output. However it slowly converges to an accuracy of 90% when all the classes have a considerable training size (at least 20 phrases).

11. The software developed can reproduce a full conversation based on unknown phrases but it fails to perceive the temporal aspect of the dialogue and has difficulty dealing with one- worded phrases.

12. The system build is able to obtain an average performance of 84%, having a classification deviation of 3% to 4%.

68 6 Conclusion

69 70 The thesis created a solution to the problem of producing a system that generates a robot utterance for every human discourse uttered, so that a coherent conversation between a robot and an elder is ob- tained. After a meticulous research, a feasible solution is obtained. Nonetheless, before concluding that it is, indeed, the best one, the investigation retrieves a couple of different systems that could be adopted in its place and that can be divided in 2 groups. The first one needs huge amount of data to prop- erly work as it uses neural networks and deep learning. Although the approach delivers a satisfactory performance, the lack of data in Portuguese makes it undesirable [8].

On the other hand, the second group thrives in smaller databases and is based on a NLP technique with support of new collected data and intertwined with aML classifier. After a deep investigation, it is found that LSA is the most promising technique for performing feature selection and extraction. It not only incorporates many techniques in a single algorithm but also, when it comes to text categorization, it is quite efficient in comparison to other methods. Regarding theML algorithm, one of the best is SVM, but the multi-class problem, along with LSA transforming the textual information into a vector too complicated, makes it rather inconvenient and unwelcome for the classification task at hand, since SVM works better with sparse vectors [19]. Also, this technique, as the neural networks method, needs a bigger dataset to function properly [18]. On the other side, NBC maintains an easy implementation with the NLP technique used and the dataset built and, even though its performance solely as a classifier is not that great, the LSA methods for feature selection and extraction generate a much better accuracy, worthy of adopting in the final solution.

Regardless of the results obtained with the developed system, it is not faultless. Inspecting some of the phrases wrongly predicted by the system, it is observed that some times the correct and predicted phrases belonged to the same topic but, the predicted phrase should either come before or after the correct phrase. Because the robot is reactionary, it only responds to the last human phrase uttered and, thus, completely misses the course of the conversation. In future work, this could be solved with the introduction of the last human and robot utterances, along with the new human phrase to be predicted in the algorithm and predicting the robot utterance through neural networks [45]. Additionally, single-word phrases where the word doesn’t belong to the set of terms, create ambiguity in the system. On the next iterations a solution for this issue could either be totally eliminating that phrase from the dataset or including it directly in the vocabulary, avoiding the possibility of scoring null LSA vectors that mislead the MNBC into picking the class based purely on the a priori probabilities. It must also be noted that the approach of building up a conversation and using a form to obtain variations of the human utterances is an acceptable technique to obtain diverse data. However, it is not efficient enough since more data is required. Therefore, further work on the theme should focus on the creation of a much more complete dataset that can capture the essence of a dialogue between humans. For this purpose, more data can be acquired through the form approach proposed in this thesis or, through recording and transcribing

71 conversations between humans [17]. After the research made, many parts of the whole system were tested in order to see if they, indeed, improved the overall performance. When the keywords extracted by Azure were employed in the system the performance decreased an overall of 3%, which demonstrated that through the most prevailing words, the system is already capable of rightly defining the textual information. Another piece tested was the list of stop words, i.e. the most common words in a language. Although the lack of this list damages the accuracy, the withdrawal of some words from this list, such as, verbs and personal pronouns actually brings a better performance to the system (around 2%). Furthermore, it was verified that the NLP technique adopted, LSA, could be used with the Portuguese language and obtain a favorable result. It was also confirmed that components such as N-Gram and Minimum document frequency indeed improve the final result and, even though the best percentage of cumulative eigenvalues was quite high it is visualized that reducing the dimensionality improves the system, already attaining a very satisfactory output from a percentage of 50% onwards. In conclusion, the LSA can still provide a positive result in current days. Together with the NBC, the system is able to obtain an average performance of 84%, which is a very favorable result.

72 Bibliography

[1] “Importance of k-fold cross validation in machine learning,” https://medium.com/ towards-artificial-intelligence/importance-of-k-fold-cross-validation-in-machine-learning-a0d76f49493e, accessed: 2019-12-14.

[2] D. Kerstin, “Socially intelligent robots: dimensions of human–robot interaction,” Philosophical Trans- action of The Royal Society B, vol. 362, 2007.

[3] A. M. Sabelli, T. Kanda, and N. Hagita, “A conversational robot in an elderly care center: An ethnographic study,” in Proceedings of the 6th International Conference on Human-robot Interaction, ser. HRI ’11. New York, NY, USA: ACM, 2011, pp. 37–44. [Online]. Available: http://doi.acm.org/10.1145/1957656.1957669

[4] H. M. . R. H. Broekens, J., “Assistive social robots in elderly care: A review.” Gerontechnology, vol. 8(2), pp. 94–103, 2009.

[5] M. A. Goodrich and A. C. Schultz, “Human–robot interaction: A survey,” Foundations and Trends® in Human–Computer Interaction, vol. 1, no. 3, pp. 203–275, 2008. [Online]. Available: http://dx.doi.org/10.1561/1100000005

[6] B. R. Duffy, “Anthropomorphism and the social robot,” Robotics and Autonomous Systems, vol. 42, no. 3, pp. 177 – 190, 2003, socially Interactive Robots. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0921889002003743

[7] M. E. Foster, “Natural language generation for social robotics: opportunities and challenges,” Philosophical Transaction of The Royal Society B, vol. 374, 2019. [Online]. Available: http://dx.doi.org/10.1098/rstb.2018.002

[8] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 649–657. [Online]. Available: http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf

73 [9] T. Joachims, “Text categorization with support vector machines: Learning with many relevant fea- tures,” in Machine Learning: ECML-98, C. Nedellec´ and C. Rouveirol, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 137–142.

[10] J. Chen, H. Huang, S. Tian, and Y. Qu, “Feature selection for text classification with na¨ıve bayes,” Expert Systems with Applications, vol. 36, no. 3, Part 1, pp. 5432 – 5435, 2009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417408003564

[11] T. Xia and Y. Chai, “An improvement to tf-idf: Term distribution based term weight algorithm.” JSW, vol. 6, no. 3, pp. 413–420, 2011.

[12] P.Nakov, A. Popova, and P.Mateev, “Weight functions impact on lsa performance,” EuroConference RANLP, pp. 187–193, 2001.

[13]H.U guz,˘ “A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm,” Knowledge-Based Systems, vol. 24, no. 7, pp. 1024 – 1032, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0950705111000803

[14] J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger, “Tackling the poor assumptions of naive bayes text classifiers,” in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 616–623.

[15] I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, no. 22, 2001, pp. 41–46.

[16] F. P. W. . L. D. Landauer, T. K., “Introduction to latent semantic analysis.” Discourse Processes, vol. 25, pp. 259–284, 1998.

[17] T. K. H. I. Phoebe Liu, Dylan F. Glas, “Data-driven hri: Learning social behaviors by example from human–human interaction,” IEEE TRANSACTIONS ON ROBOTICS,, vol. 32, 2016.

[18] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: a review of classification and combining techniques,” Artificial Intelligence Review, vol. 26, no. 3, pp. 159–190, Nov 2006. [Online]. Available: https://doi.org/10.1007/s10462-007-9052-3

[19] E. Leopold and J. Kindermann, “Text categorization with support vector machines. how to represent texts in input space?” Machine Learning, vol. 46, no. 1, pp. 423–444, Jan 2002. [Online]. Available: https://doi.org/10.1023/A:1012491419635

[20] T. Bayes and n. Price, “Lii. an essay towards solving a problem in the doctrine of chances. by the late rev. mr. bayes, f. r. s. communicated by mr. price, in a letter to john canton, a. m. f. r. s,”

74 Philosophical Transactions of the Royal Society of London, vol. 53, pp. 370–418, 1763. [Online]. Available: https://royalsocietypublishing.org/doi/abs/10.1098/rstl.1763.0053

[21] M. F. Porter et al., “An algorithm for suffix stripping.” Program, vol. 14, no. 3, pp. 130–137, 1980.

[22] “6.2. feature extraction - scikit-learn 0.21.3 documentation,” https://scikit-learn.org/stable/modules/ feature extraction.html#text-feature-extraction, accessed: 2019-07-31.

[23] W. et al, “Parameters driving effectiveness of automated essay scoring with lsa.” Proceedings of the 9th CAA Conference, Loughborough: Loughborough University, 2005.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[25] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine learning software: experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pp. 108–122.

[26] “Thesis/final,” https://github.com/h-s1996/Thesis/tree/master/final, accessed: 2019-07-01.

[27] M. Forsberg, “Why is speech recognition difficult,” Chalmers University of Technology, 2003.

[28] “Cloud speech-to-text documentation,” https://cloud.google.com/speech-to-text/docs, accessed: 2019-09-03.

[29] “Speech Recognition in Python using Google Speech API - Geeksforgeeks,” https://www. geeksforgeeks.org/speech-recognition-in-python-using-google-speech-api/, accessed: 2019-09- 03.

[30] “Text Analytics API — Microsoft Azure,” https://azure.microsoft.com/en-us/services/ cognitive-services/text-analytics/, accessed: 2019-07-25.

[31] “What is the text analytics api?” https://docs.microsoft.com/en-us/azure/cognitive-services/ text-analytics/overview, accessed: 2019-07-25.

[32] “Full analysis,” https://linguakit.com/en/full-analysis, accessed: 2019-07-25.

[33] “About linguakit,” https://linguakit.com/en/about, accessed: 2019-07-25.

[34] “Yake! demo for keyword extraction,” http://yake.inesctec.pt/demo/user, accessed: 2019-07-22.

75 [35] “About yake!” https://github.com/LIAAD/yake, accessed: 2019-07-22.

[36] “Keywordextractor-datasets/110-pt-bn-zp.zip,” https://github.com/LIAAD/ KeywordExtractor-Datasets/blob/master/datasets/110-PT-BN-KP.zip, accessed: 2019-04-20.

[37] “Examples for portuguese processing,” http://www.nltk.org/howto/portuguese en.html, accessed: 2019-08-01.

[38] “nltk.stem package - NLTK 3.4.5 documentation,” https://www.nltk.org/api/nltk.stem.html# module-nltk.stem.rslp, accessed: 2019-08-01.

[39] “Rslp stemmer,” http://www.inf.ufrgs.br/∼viviane/rslp/index.htm, accessed: 2019-10-01.

[40] “nltk.tokenize package - NLTK 3.4.5 documentation,” https://www.nltk.org/api/nltk.tokenize.html# module-nltk.tokenize, accessed: 2019-07-28.

[41] “sklearn.feature extraction.text.tfidfvectorizer - scikit-learn 0.21.3 documentation,” https: //scikit-learn.org/stable/modules/generated/sklearn.feature extraction.text.TfidfVectorizer.html, accessed: 2019-07-31.

[42] “sklearn.naive bayes.multinomialnb - scikit-learn 0.21.3 documentation,” https://scikit-learn.org/ stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition. TruncatedSVD, accessed: 2019-08-17.

[43] “Module(gtts) - gTTS documentation,” https://gtts.readthedocs.io/en/latest/module.html, accessed: 2019-09-03.

[44] “sklearn.model selection.stratifiedkfold,” https://scikit-learn.org/stable/modules/generated/sklearn. model selection.StratifiedKFold.html, accessed: 2019-12-18.

[45] P. Liu, D. F. Glas, T. Kanda, and H. Ishiguro, “Learning proactive behavior for interactive social robots,” Autonomous Robots, vol. 42, no. 5, pp. 1067–1085, 2018.

76 A Appendix

77 (...) 5 H Felizmente tenho possibilidade de poder estar frequentemente com a maioria deles. H Sim tenho a minha filha que mora a uns 20 minutos daqui. H A minha fam´ılia vive toda aqui perto, somos vizinhos. H Bem perto. H Sim, o meu irmao˜ vive na mesma cidade que eu. H Sim, quase toda a minha fam´ılia vive na mesma cidade que eu. R Ainda bem, fico contente. Agora gostaria de saber a sua comida preferida. (...)

78 B

79 Large Tables

Train Utterances Corresponding Robot Utterance Bom dia. Bom dia. Ola.´ Bom dia. Tudo bem? Esta´ tudo bem e consigo? Tudo bem consigo? Esta´ tudo bem e consigo? Tudo bem contigo? Esta´ tudo bem e consigo? Esta´ tudo a andar? Esta´ tudo bem e consigo? Comigo esta´ tudo otimo.´ Ainda bem. Tambem´ estou bem. Ainda bem. Eu estou otimo.´ Ainda bem. Eu estou excelente. Ainda bem. Comigo esta´ tudo magn´ıfico. Ainda bem. Eu estou excelente. Ainda bem. Como esta´ o tempo hoje? Hoje esta´ um tempo fantastico.´ Como esta´ o tempo a` tarde? Hoje esta´ um tempo fantastico.´ Como esta´ a meteorologia? Hoje esta´ um tempo fantastico.´ Qual e´ a meteorologia para esta manha?˜ Hoje esta´ um tempo fantastico.´ Qual e´ a meteorologia para hoje? Hoje esta´ um tempo fantastico.´ Devo ir ate´ a casa do meu filho. Aproveite o tempo com a sua fam´ılia. Hoje vou estar com o meu filho. Aproveite o tempo com a sua fam´ılia. Vou jantar hoje com a minha filha. Aproveite o tempo com a sua fam´ılia. Hoje vou almoc¸ar com o meu irmao.˜ Aproveite o tempo com a sua fam´ılia. Vou a` casa da minha filha. Aproveite o tempo com a sua fam´ılia. Hoje vou passar o dia com os meus netos. Aproveite o tempo com a sua fam´ılia. Vou hoje ao supermercado. O que e´ que vai comprar? Hoje vou as` compras, preciso de comida. O que e´ que vai comprar? Vou as` lojas de roupa hoje. O que e´ que vai comprar? Vou as` lojas a` tarde. O que e´ que vai comprar? Hoje vou ao centro comercial. O que e´ que vai comprar? Vou ao centro comercial de manha.˜ O que e´ que vai comprar?

Table B.1: Utterances used in the training phase of the parameters’ analysis - First Part

80 Train Utterances Corresponding Robot Utterance Hoje quero passear. Entao˜ divirta-se. Vou dar um passeio. Entao˜ divirta-se. Quero dar uma volta esta tarde. Entao˜ divirta-se. Vou sair de casa hoje. Entao˜ divirta-se. Hoje vou dar uma volta. Entao˜ divirta-se. Hoje vou fazer uma excursao.˜ Entao˜ divirta-se. Gosto de tomar um cafe.´ Eu tambem,´ o sabor e´ otimo.´ Adoro um cafe´ todas as manhas.˜ Eu tambem,´ o sabor e´ otimo.´ Amo cafe.´ Eu tambem,´ o sabor e´ otimo.´ Nao˜ consigo passar sem um cafe´ por dia. Eu tambem,´ o sabor e´ otimo.´ A minha bebida favorita e´ cafe.´ Eu tambem,´ o sabor e´ otimo.´ Gosto de tomar um cha.´ Eu prefiro cafe´ em relac¸ao˜ ao cha.´ Adoro um cha´ todas as manhas.˜ Eu prefiro cafe´ em relac¸ao˜ ao cha.´ Amo cha.´ Eu prefiro cafe´ em relac¸ao˜ ao cha.´ Nao˜ consigo passar sem um cha´ por dia. Eu prefiro cafe´ em relac¸ao˜ ao cha.´ A minha bebida favorita e´ cha.´ Eu prefiro cafe´ em relac¸ao˜ ao cha.´

Table B.2: Utterances used in the training phase of the parameters’ analysis - Second Part

81 Test Utterances Corresponding Robot Utterance Bom dia. Bom dia. Como e´ que anda? Esta´ tudo e contigo? Esta´ tudo bem? Esta´ tudo e contigo? Comigo esta´ tudo magn´ıfico. Ainda bem. Eu estou excelente. Ainda bem. Quero ir a` casa da minha filha Aproveite o tempo com a sua fam´ılia. Vou almoc¸ar amanha˜ com o meu filho. Aproveite o tempo com a sua fam´ılia. Vou a` casa do meu irmao.˜ Aproveite o tempo com a sua fam´ılia. Devo ir ao supermercado. O que e´ que vai comprar? De manha˜ vou as` lojas do centro comercial. O que e´ que vai comprar? Necessito de ir as` compras a` tarde. O que e´ que vai comprar? Esta´ um bom tempo hoje? Hoje esta´ um tempo fantastico.´ Como esta´ o tempo? Hoje esta´ um tempo fantastico.´ Como e´ a meteorologia para a` tarde? Hoje esta´ um tempo fantastico.´ Vou dar uma volta de manha.˜ Entao˜ divirta-se. Quero sair hoje ao cafe.´ Entao˜ divirta-se. Amanha˜ irei numa excursao.˜ Entao˜ divirta-se. Adoro uma chavena´ de cafe.´ Eu tambem,´ o sabor e´ otimo.´ Todos os dias amo beber um cafe.´ Eu tambem,´ o sabor e´ otimo.´ Amo tomar cha.´ Eu prefiro cafe´ em relac¸ao˜ ao cha.´ Nao˜ ha´ um dia em que nao˜ beba cha.´ Eu prefiro cafe´ em relac¸ao˜ ao cha.´

Table B.3: Utterances used in the testing phase of the parameters’ analysis

82 83