<<

Ivo Filipe Pinho dos Anjos

Master of Science

Serious mobile game with sibilant exercises for speech therapy

Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering

Adviser: Prof. Dr. Sofia Cavaco, Assistant Professor, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa Co-adviser: Prof. Dr. João Magalhães, Assistant Professor, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa

Examination Committee Chairperson: Prof. Dr. Fernando Birra Raporteur: Prof. Dr. Sérgio Paulo Member: Prof. Dr. Sofia Cavaco

November, 2017

Serious mobile game with sibilant consonant exercises for speech therapy

Copyright © Ivo Filipe Pinho dos Anjos, Faculty of Sciences and Technology, NOVA Uni- versity of Lisbon. The Faculty of Sciences and Technology and the NOVA University of Lisbon have the right, perpetual and without geographical boundaries, to file and publish this disserta- tion through printed copies reproduced on paper or on digital form, or by any other means known or that may be invented, and to disseminate through scientific reposito- ries and admit its copying and distribution for non-commercial, educational or research purposes, as long as credit is given to the author and editor.

This document was created using the (pdf)LATEX processor, based in the “unlthesis” template[1], developed at the Dep. Informática of FCT-NOVA [2]. [1] https://github.com/joaomlourenco/unlthesis [2] http://www.di.fct.unl.pt

Acknowledgements

This work was supported by the Portuguese Foundation for Science and Technology under projects BioVisualSpeech (CMUP-ERI/TIC/0033/2014) and also NOVA-LINCS (PEest/UID/CEC/04516/2013). I would like to thank Prof. Dr. Sofia Cavaco for all the counselling and guidance during this last year, which allowed me to produce the work presented in this dissertation, and also in the form of an accepted paper, which I am very grateful for. I would also like to thank the SLPs Diana Lança and Catarina Duarte for their availabil- ity and feedback, which helped me find the main problem to adress in this dissertation. Would also like to thank all the 3rd and 4th year SLP students from Escola Superior de Saúde do Alcoitão who collaborated in the data collection task. Many thanks also to Inês Jorge for the graphic design of the game scenarios. Finally, would like to thank the schools from Agrupamento de Escolas de Almeida Garrett, and all the children who participated in the recordings. Lastly, I would like to thank my family and friends who have helped me a lot during my academic years.

v

Abstract

The distortion of sibilant sounds is a common type of speech sound disorder (SSD) in European Portuguese (EP) speaking children. Speech and pathologists (SLP) frequently use the isolated sibilants exercise to assess and treat this type of speech errors. While technological solutions like serious games can help SLPs to motivate the chil- dren on doing the exercises repeatedly, there is a lack of such games for this specific exercise. Another important aspect is that given the usual small number of therapy ses- sions per week, children are not improving at their maximum rate, which is only achieved by more intensive therapy. We propose a serious game for mobile platforms that allows children to practice their isolated sibilants exercises at home to correct sibilant distortions. This will allow children to practice their exercises more frequently, which can lead to faster improvements. We have designed four different scenarios, one for each EP sibilant consonant. The game, which uses an automatic speech recognition (ASR) system to classify the child sibilant productions, is controlled by the child’s in real time and gives immediate visual feedback to the child about her sibilant productions. We also used some relevant cues, in order to help the child remember which sound to produce. In order to keep the computation on the mobile platform as simple as possible, the game has a client-server architecture, in which the external server runs the ASR system. We used the raw Mel frequency cepstral coefficients as features, and tested different classifiers, like linear and quadratic discriminant analysis and support vector machines, and compared multiple options for selecting the training and test sets. We were able to achieved very good results with accuracy test scores of above 91% using support vector machines.

Keywords: Sound Analysis, Machine Learning, Supervised Learning, Interactive Environment, Parameterization, Speech Therapy, Articulation Disorders

vii

Resumo

A distorção das sibilantes é um tipo de distúrbio de fala comum em crianças cuja lín- gua materna é o Português Europeu. Os terapeutas da fala e da linguagem frequentemente utilizam o exercício das sibilantes isoladas para avaliar e tratar este tipo de problemas. Apesar de existirem soluções tecnológicas como jogos sérios para ajudar os terapeutas a motivar as crianças para praticarem os exercícios repetidamente, não existem jogos para este exercício específico. Outro aspeto importante é que dado o reduzido número de sessões de terapia por semana, as crianças não estão a melhorar ao seu ritmo máximo, pois isso só é alcançável com recurso a uma terapia mais intensiva. Propomos um jogo sério para plataformas móveis que permite às crianças praticar os seus exercícios das sibilantes isoladas em casa para corrigir as distorções. Isto vai per- mitir que as crianças pratiquem os exercícios mais frequentemente, o que pode levar a melhorias mais rápidas. Desenhámos quatro cenários, um para cada sibilante do Portu- guês Europeu. O jogo, que utiliza um sistema automático de reconhecimento de fala para classificar as produções da criança, é controlado pela voz da criança em tempo real e dá feedback visual imediato à criança sobre a sua produção de som. Também utilizámos algumas pistas, de modo a ajudar a criança a lembrar-se de qual o som a produzir. Para manter a complexidade na plataforma móvel o mais simples possível, o jogo uti- liza uma arquitetura de cliente-servidor, sendo que o servidor corre o sistema automático de reconhecimento de fala. Utilizámos Mel frequency cepstral coefficients como features, e testámos vários classificadores, como linear e quadratic discriminant analysis e support vector machines, e comparámos várias opções para selecionar os conjuntos de treino e de teste. Conseguimos atingir resultados muito bons com resultados de precisão em teste acima de 91% utilizando support vector machines.

Palavras-chave: Análise de Som, Aprendizagem Automática, Aprendizagem Supervisionada, Ambi- ente Interativo, Parametrização, Terapia da Fala, Distúrbios de Articulação

ix

Contents

List of Figures xiii

List of Tables xv

Listings xvii

1 Introduction1 1.1 Introduction...... 1 1.2 Objectives ...... 2 1.3 Proposed Solution...... 4

2 Background and Related Work7 2.1 Speech Therapy ...... 7 2.1.1 The Process of voice...... 8 2.1.2 Language and Speech...... 8 2.1.3 Speech Disorders...... 9 2.1.4 Speech Therapy Areas - Sibilant Sounds and Minimal Pairs.... 9 2.1.5 Interviews...... 12 2.2 Sound Features Extraction & Machine Learning...... 13 2.2.1 Mel-Frequency Cepstral Coefficients (MFCC)...... 14 2.2.2 Linear Discriminant Analysis...... 14 2.2.3 Linear and Quadratic Discriminant Classifier...... 14 2.2.4 Cross Validation...... 15 2.2.5 Support Vector Machines (SVM)...... 15 2.3 State-of-the-art Tools ...... 16 2.3.1 Isolated Games ...... 16 2.3.2 Complex Systems...... 20 2.3.3 Tools Comparison...... 21

3 Game and Architecture 23 3.1 System platform and game engine...... 23 3.2 Mobile game...... 24 3.2.1 Game goal and scenarios...... 25

xi CONTENTS

3.2.2 Visual cues...... 26 3.2.3 Visual Feedback...... 27 3.2.4 Parameterization of the game ...... 28 3.3 System architecture...... 29 3.4 Implementation details - modularity and extensibility...... 30 3.4.1 Mobile Game ...... 31 3.4.2 Server...... 33 3.4.3 The need for an ASR system...... 34

4 Automatic Speech Recognition System 35 4.1 Sound data...... 35 4.2 Automatic recognition of isolated sibilant ...... 37 4.2.1 The classification algorithm...... 37 4.2.2 Feature vectors ...... 38 4.2.3 Different options for the training and test sets...... 39 4.2.4 Model training methodology...... 41 4.3 Classification Results...... 41 4.3.1 Comparing the classifiers using the Naive split...... 42 4.3.2 Comparing the different training and test sets...... 43 4.3.3 Naive split results...... 44 4.3.4 K% children test set results ...... 44 4.3.5 One Child Out experiment results ...... 47 4.4 Discussion ...... 50

5 Feedback from the SLPs and children 53 5.1 Feedback from children...... 53 5.2 Feedback from SLPs...... 54

6 Conclusion and future work 57

Bibliography 61

A Accepted scientific paper 65

xii List of Figures

1.1 Platform of BioVisualSpeech project...... 3 1.2 Child playing the proposed game comfortably at home...... 4

2.1 Diagram with the names of the main places of articulation.[42] ...... 11 2.2 Interface of Articulation Station...... 16 2.3 Interface of Articulation Test Center ...... 17 2.4 Interface of Falar a Brincar...... 18 2.5 Interface of sPeAK-MAN...... 18 2.6 Interface of Flappy Voice...... 20 2.7 Interface of Vithea...... 21 2.8 European systems comparison...... 22

3.1 Forecast of tablet user numbers in Portugal from 2014 to 2021 ...... 24 3.2 The four game scenarios. (a) The scenario for the [S] consonant. (b) The scenario for the [] consonant. (c) The scenario for the [s] consonant. (d) The scenario for the [z] consonant...... 25 3.3 Example of the sinusoidal movement of the bumblebee character...... 27 3.4 The two developed main cycles. (a) The game cycle ends when the child produces an incorrect sound production. (b) The game cycle waits for another correct production in order to give positive feedback to the child...... 28 3.5 The end game messages. (a) The message the child receives when she reaches the goal of the game. (b) The message the child receives when she stops pro- ducing the correct sibilant sound...... 28 3.6 Client-server architecture...... 30

4.1 The setup used for the recordings...... 36 4.2 The MFCCs matrix with 13 coefficients, obtained for a [z] sound...... 38 4.3 Accuracy test scores of the multiclass SVM classifier with RBF kernel, using different numbers of MFCCs...... 39 4.4 Accuracy test scores of the three multiclass classifiers...... 42 4.5 Accuracy test scores of the multiclass SVM, and four single-phoneme SVMs (both with the RBF kernel) for all four classes...... 43

xiii List of Figures

4.6 Accuracy test scores of the four single-phoneme SVMs, while using different training and test sets...... 44 4.7 The progression of the MFCC 4 for sound [S] over time, for two different chil- dren...... 45 4.8 The box plot of the MFCC values for each coefficient, for sound [S], for two different children...... 46 4.9 Comparison between the accuracy scores of the validation and test sets, while using different test set sizes...... 47

5.1 Child playing the proposed game during the European Researchers Night 2017. 54

xiv List of Tables

4.1 Number of children who performed the recordings...... 36 4.2 Number of samples for each sound...... 37 4.3 Validation and test scores of the multiple runs of the One Child Out experi- ment, removing the child that was in the test set at the end of each run. . . . 48 4.4 Number of false negatives and false positives of the multiple runs of the One Child Out experiment, removing the child that was in the test set at the end of each run...... 49

xv

Listings

3.1 Pseudo code of the main game cycle...... 31 3.2 Pseudo code of the main method in the microphone class ...... 32 3.3 Pseudo code of the server...... 33

xvii

hc h itkscniu vnwe h hl rw le.I hs ae h child the cases these In older. grows child the when even continue mistakes the which . Introduction 1.1 SP,wocnacs h yeadsvrt ftedsre.A ato h ramn to treatment the of part As disorder. the of severity and type the access can who (SLP), h eso.Ms ie hsls aeicue losm oto hrp exercise. therapy of sort during some well also performs includes he game if be last session may this the child times of the Most minutes instance, session. five the last For the system. for reward game a of with kind the rewarded motivate some further use To also speech task. SLPs achieve hard many a to child, be required may few motivated a repetitions child with of the simple number keeping relatively improvements, be extensive would the This play. given can but child exercises the repetitions, the that transforming game by of accomplished sort is some this in times Most motivated. child the keep session. therapy speech must each that in exercises times production multiple speech performed of types be specific use SLPs errors, speech the correct not were mistakes these 36 ]. which [3, at happen age when to an occurs supposed after SSD incorrectly, A sounds (SSD). speech produces disorders and child sound speech a speech of on types focus many we are here there disorders, While language disorder. language and/or in speech cases a are have there may However these up. language, grow the children learning as and gradually speak disappear made to are to how mistakes tend learning regular mistakes still While are life. they our when of children aspects important by most the of one is Speech .I atclr hnteitnietann sdone is training intensive the when particular, In to 23]. proven 9, been [8, has improvements sessions, faster weekly to more lead having of consist can which therapy, intensive h L a ofidwy omk h eeiino h xrie noal,i re to order in enjoyable, exercises the of repetition the make to ways find to has SLP The pathologist language speech a by observed be should she SSD, a has child a When hra rdtoal hlrnatn pehteayssin neawe,more week, a once sessions therapy speech attend children traditionally Whereas 1 Introduction

C h a p t e r 1 CHAPTER 1. INTRODUCTION at home (home training) it can increase the overall exercise practicing time substantially. This is particularly useful when children attend speech therapy sessions only once a week, and as a consequence they do not repeat the speech exercises as often as desirable. With home training children can practice the exercises recommended by their SLP whenever they have free time. These two concepts of intensive training (with more frequent sessions per week) and home training are becoming more popular in recent years, and some studies have already proven that this type of training is very beneficial [9, 12, 20]. These studies show con- siderable improvements when children have more than the regular weekly session. In the cases where children cannot attend more than one session per week, home training has been proved as a good alternative for those extra sessions. The major benefit from intensive training and home training is a faster improvement rate, which can also lead to more motivation. When children are motivated they tend to overcome their problems more easily, so this kind of cycle is beneficial. In spite of all the benefits of home training, this may not be straightforward to imple- ment when the right tools are not available. The problem is how can children practice the speech exercises at home? Who is going to verify if they are doing the exercises correctly? Well, the first obvious choice would be the parents, but, for multiple reasons including lack of time, this is not always possible. This leaves room for a combined clinical and technological solution to emerge.

1.2 Objectives

This dissertation is part of the BioVisualSpeech (BVS) project, that is a partnership be- tween Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, Carnegie Mellon University, Escola Superior de Saúde de Alcoitão, INESC-ID, Centro Hospitalar de Lisboa and Voice Interaction and is funded by Fundação para a Ciência e Tecnologia. This project aims to research mechanisms that provide bio-feedback in speech therapy through the use of serious games. This project proposes a platform (figure 1.1) that has two main goals, the first is to keep the children more motivated on doing the speech ther- apy exercises, and for this the platform includes a gamified interactive environment that has several serious games, a reward system that motivates children to keep exercising, and also some tools to help the speech therapists such as a mirror, the ability to perform audio-visual recordings and annotations of the session. The second goal is to provide the speech therapist with a tool set to plan the course of the ongoing therapy session. For this the platform includes a system for post-session analysis, where the audio-visual record- ings can be examined and future sessions can be planned. More recently this project added another area of interest, that is the home practice. This is something that is being very used by the speech therapists but there are not many tools that provide the necessary feedback on the exercises.

2 1.2. OBJECTIVES

Figure 1.1: Platform of BioVisualSpeech project.

Given the importance of the area of Home Practice and its recent addition to the project, we decided to focus on this particular area, combined with the gamification part to make our solution more attractive to children. The BioVisualSpeech project is based on the European Portuguese (EP) language, and this is one language in particular that is not addressed by most of the systems already available. It was also proposed to us that we worked with the sibilant sounds, since there is not any system available that allows children to practice these sounds for EP. Sibilant sounds are a specific group of consonants, and happen when the air flows through a very narrow channel in the direction of the teeth[19]. This channel can be created in different places and this creates different types of sibilant sounds. There are two types of sibilant sounds in EP: alveolar, and palato-alveolar, in each of these sibilant types the vocal folds can be used or not, resulting in a voiced or a voiceless sibilant, respectively. In EP there are four sibilant sounds: [s], [z], [S], and [Z]. Both [s] and [z] are alveolar sibilants, [s] being unvoiced and [z] voiced. In the case of [S] and [Z], they are palato-alveolar sibilants, and are respectively unvoiced and voiced. These sibilant sounds also appear in minimal pairs words. Minimal pairs, are a pair of words where only a single sound varies. An example of a minimal pair are the words "zip" and "sip", where only the first sound varies, [z] and [s]. Given their similarities, the words in a minimal pair are very easy to confuse and it is common to use one instead of the other. This is a very common SSD, especially in children. In our example this is even more problematic because the sound that varies in between the two words is an alveolar sibilant, meaning that both sounds are produced in the same location, so the only difference in the production of these two words is the use of the vocal folds when producing the first sound. What makes these sounds very interesting for a dissertation is that there is not much research regarding the differences between the different sibilant sounds, and even less when comparing the voiced and unvoiced sounds. Regarding the Home Practice area, the main goal is to allow the children to perform the exercises more times, which may lead to a faster improvement. The platform we

3 CHAPTER 1. INTRODUCTION choose also has an impact on this, so the best option is to use a mobile platform since this does not create any restriction to use the system other than an internet connection, which allows for the children to perform the exercises nearly anywhere.

1.3 Proposed Solution

While many native Portuguese children need to attend speech therapy to correct SSDs related to the production of distorted sibilant sounds, there is a lack of software systems to assist with the training of these sounds. As a contribution to fill this gap, we propose a serious game for mobile platforms for intensive training of the EP sibilant consonants for correction of distortion errors. The aimed age group of this game are children from five to nine years old, because usually at these ages the regular phonological development exchange errors have already disappeared. In figure 1.2 we present our goal. Here we have a child playing our proposed game comfortably at home, without the parents or SLP supervision. Our solution allows the child to practice the speech therapy exercises without leaving the comfort of his home, while still providing all the benefits of intensive training, such as a faster improvement. We believe that with the extra motivation from playing a mobile game, instead of prac- ticing the regular speech therapy exercises, combined with the faster improvements, this will greatly help the child to stay more motivated throughout all the process of speech therapy. We also tried to make our game more innovative by allowing the child to control the characters only with their voice. This type of novelty will also help in keeping the child more engaged with the game.

Figure 1.2: Child playing the proposed game comfortably at home.

4 1.3. PROPOSED SOLUTION

The game incorporates the isolated sibilants exercise, which is an exercise frequently used by SLPs during the therapy sessions. The isolated sibilants exercise consists of producing a specific sibilant sound for some duration of time. The main goal of this exercise is to teach the child how to distinguish between the different sibilant sounds, and also how to produce each one of them. We are going to focus on the EP sibilants sounds, and these are the sounds [z] as in zebra, [s] as in snake, [S] as the sh sound in sheep, and [Z] as the s sound in Asia. The game goal is to move a character from a starting position, to a specific target. In order to control the character, children do not use a keyboard or any other regular input method, but instead they use only their voice. Each character responds to a different sibilant sound, and to make the character move, the child has to say that specific sibilant correctly. The game includes four scenarios, one for each of the addressed sibilant sounds. To create the scenarios we took into account the aimed age group of five to nine years old children. Each of the scenarios has a different character, that is related to the sibilant sound the child must produce. This gives the child a visual cue to help her remember the sound she must produce for each scenario. We also choose targets that would relate to the each game character, to be more appealing to the child. For instance for the [s] sound, we have a snake (e.g. serpente in EP) that is moving towards a log. It runs on mobile platforms like tablets, iPads and smartphones, which children usu- ally are keen on using. As thus, we are not only providing means for home training the sibilants, but also to motivate children on doing the exercises often. The game’s main character is controlled in real time by the child’s voice and its behaviour gives visual feedback to the child on his speech productions. While our game can also be used during the speech therapy sessions, our main goal is to provide a game that can be used for home training and does not require the super- vision of parents. The main difficulty of home training is how to verify the child sound productions. Many system do not have any type of automatic verification, which means that the child needs the parent’s or SLPs supervision to validate the exercises. Our pro- posal includes an automatic speech recognition (ASR) system to identify if the child is performing the exercise correctly, and therefore decrease the need of an adult to monitor the child. This ASR system was trained with the speech reproductions of 90 children. In order to keep the complex computation of the automatic speech recognition system out of the mobile platform, the game is implemented with a client-server architecture. The client is the mobile game application and the server has our ASR system that classifies the child speech productions. The client records the sound that the child produces in segments and sends them to the server, to check if the sound is being correctly reproduced. When the server receives a new sound , it has to extract its audio features and then uses our ASR system to check whether the segment corresponds to a correct reproduction of the respective sibilant sound. The server then sends the answer back to the client so that this can provide visual feedback to the user.

5 CHAPTER 1. INTRODUCTION

Summary of contributions:

• Research on the consonant sibilant sounds, particularly in EP language.

• Base Client-Server architecture and game development with three simple main classes, that can be used in further iterations of the project, since they are completely modular and independent from our problem

• Testing of different machine learning algorithms and also the use of some more sophisticated training sets, like for instance the One Child Out approach.

Part of this work was presented in an accepted scientific paper, with its content and further information found in AppendixA.

The organization of this dissertation is as follows:

Chapter 2 (Background and Related Work) - in the beginning of this chapter some concepts of speech therapy are discussed, like voice, language, and speech. Next is pre- sented a brief description of the multiple speech disorders and a more detailed expla- nation on sibilant consonant sounds and minimal pairs. This helps to understand the foundations of speech therapy. Some interviews are also described, and they were very useful to help finding the most interesting problems and also provided some validation for this solution. Afterwards we discuss a bit about some techniques on how to perform sound feature extraction, and the most used algorithms in machine learning for the pur- pose of speech recognition. To end this chapter there is a dedicated section to show the State of the Art Tools that helps to get a better idea of the already proposed systems and their solutions. Chapter 3 (Game and Architecture) - in this chapter we have a complete description of all our decisions regarding the choice of system platform and game engine, and system architecture. We also have a very detailed explanation of the mobile game itself, and all its characteristics. Chapter 4 (Automatic Speech Recognition System) - here we present how we gath- ered our sound data, and what was needed to make it usable by our system. We also describe how we translated the sound files to our feature vectors, and then present the re- sults of the multiple algorithms tested, and our solutions to try improve our classification scores. Chapter 5 (Feedback from the SLPs) - in this chapter we show the feedback of some SLPs that were interviewed, some of them before the game was fully developed, and one other after the game was developed. Chapter 6 (Conclusion and future work) - here we present some final conclusions, and some ideas for future work.

6 eepanms ftecocso hsdsetto,rgrigtesec hrp ra To area. therapy speech the regarding dissertation, this of choices the of most explain we is section each of description detailed more a Below therapy. speech of scope the within hscatrpeet nitouto osm ocpso pehteay on pro- sound therapy, speech of concepts some to introduction an presents chapter This h olo hsscini oepansm ocpso pehteayta r eyimpor- very are that therapy speech of concepts some explain to is section this of goal The ealddsrpino u w ancnet:sbln onsadmnmlpis here pairs, minimal and sounds have sibilant we Next concepts: dissertation. main two this our to important of more description are detailed that a ones the di on the focusing describe start then and can we acquired, basic di knowledge some the explaining this and by produced, start is we voice So how dissertation. like this ideas of understanding better a to tant Therapy Speech 2.1 systems. language Portuguese European two between comparison a also used systems. algorithms recognition learning speech machine automatic important in very some and extraction, feature also and the with interviews the of topics main the present we SLPs. end To pairs. minimal and sounds di the of developedpresented: been have that systems the describes also and learning, machine and cessing, neveshle st etrraieteSP rbesadcm pwt hssolution. this These with up presented. come are and SLPs problems SLPs the the with realize interviews better to the us of helped topics interviews main the section this end 2.3 Section 2.2 Section 2.1 Section ff rn ye fsec iodr sas rsn,admr eal nsibilant on details more and present, also is disorders speech of types erent ecie oetcnqe fsudfaue xrcin rs validation, cross extraction, features sound of techniques some describes - rsnstemn ytm htwr led rpsdi hsae,and area, this in proposed already were that systems many the presents - xlistebsccnet fvie agae n pa.Adescription A speak. and language, voice, of concepts basic the explains - akrudadRltdWork Related and Background ff 7 rne ewe pehadlnug.With language. and speech between erences ff rn ye fsec disorders, speech of types erent

C h a p t e r 2 CHAPTER 2. BACKGROUND AND RELATED WORK

2.1.1 The Process of voice

The process of voice production can be described in three main steps [41]:

1. With a coordinated action of the diaphragm, abdominal, chest muscles, and rib cage, the air is moved from the lungs to the vocals folds.

2. The vibration of the vocal folds is a sequence of vibratory cycles:

a) A column of air pressure is moving from the bottom to the top of the vocal folds. b) First this column starts by opening the bottom of the vocals folds. c) As the column continues to move upwards, it opens the top of the vocal folds. d) This fast-moving air column creates a low pressure behind it, which causes the bottom of the vocals folds to close followed by the closure of the top of the vocal folds. e) The closure of the vocals folds cuts off the air column and releases a pulse of air. f) Then a new cycle repeats the same process.

This sequence of vibrations produces rapid pulses of air that create a "voiced sound".

3. The vocal tract, that is composed from both resonators and articulators, like the nose, pharynx and mouth, is what allows the "voiced sound" to be modified and amplified to produce voice as we know it.

2.1.2 Language and Speech

Before talking about language or speech disorders, and the multiple speech therapy areas, it is necessary to understand the differences between language and speech. First off all, these two concepts are two very distinct things [2]. Language can be defined as a set of socially shared rules, like what the words mean, how to make new words by adding prefixes or suffixes, how to combine words to create sentences and what word combinations are the best in specific situations.

Speech can be defined as the verbal means of communicating, and can be divided in three main groups: articulation, voice, and fluency. Articulation refers to how speech sounds are made, a simple example is a child must learn how to produce certain sounds to produce words that contain that same sound (e.g. the sound "S" to produce the word "snake"). So basically words are a combination of separate sounds and each of those must be learned in order to produce those words. Voice is the use of vocal folds and breathing to produce sound, voice can be abused either from overuse or misuse and this can lead to hoarseness or loss of voice. A common example of this can be seen in singers after a long

8 2.1. SPEECH THERAPY tour and teachers in the end of a school year. Fluency is the rhythm of the speech, is the ability of a person to speak without hesitations or stuttering.

2.1.3 Speech Disorders

There are multiple types of speech disorders [1]:

• Childhood Apraxia of Speech - is a motor speech disorder, where children have prob- lems saying sounds, , and words, because their brain can not completely coordinate the body parts, such as lips, jaw, and .

• Dysarthria - is also a motor speech disorder, in which children can not correctly move their muscles, but in this case this is due to lesions on the nervous system. Although similar to Apraxia, Dysarthria is movement problem, not a planning problem like Apraxia.

• Orofacial Myofunctional Disorders - the tongue moves forward in an exaggerated way during speech and/or swallowing.

• Speech Sound Disorders - a speech sound disorder occurs when children make mistakes when saying a word, after a certain age. This can be of two kinds, articula- tory or phonological. In articulatory disorders sounds can be removed, substituted, added or changed, this normally happens when a child can not reproduce a sound that composes a word, and normally it substitutes that sound with one that is easier to produce. Phonological disorders can be characterized by patterns of sound errors. For instance, when sounds produced in the back of the mouth are substituted by those made in the front of the mouth.

• Stuttering - is a disorder that is characterized by disruptions in the production of speech sounds. This can include repetitions of words or parts of words and also prolongations of speech sounds.

• Voice - voice disorders are problems with voice that can go from hoarse voice to inability to produce any kind of sound. This happen because of voice abuse or colds, allergies, bronchitis or anything that irritates the vocal folds.

The focus of this dissertation is on speech sound disorders and also in voice disorders. These correspond respectively to the areas of articulation and voice.

2.1.4 Speech Therapy Areas - Sibilant Sounds and Minimal Pairs

The speech therapy areas, and the corresponding disorders, can be separated in four distinct categories: articulation, voice, fluency, and language, that were explained above. The main focus of this dissertation is in articulation and also voice, since the sounds used are going to be sibilant consonants and those are very useful in both areas.

9 CHAPTER 2. BACKGROUND AND RELATED WORK

Sibilant consonants are a special group inside consonants. A consonant is a speech sound that occurs when the air flow is limited by complete or partial closure of the vocal tract. Sibilant sounds are a specific group of consonants, and happen when the air flows through a very narrow channel in the direction of the teeth [19]. This channel can be created in different places and this creates different types of sibilant sounds.

These kind of sounds are very useful in the area of voice, for instance to help reduce the Hard Glottal Attack [44], which is the aggressive closure of the glottis when produc- ing any sound. This can lead to many changes in the sound characteristics of the voice and even to the development of nodules in the vocal folds. The most common exercises involve reproducing a single sibilant consonant without that initial hard attack. After the patient can correctly perform this exercise, the next step is to add after the sibilant (e.g. "za" or "so") or changing from one sibilant sound to another. Many other exercises exist and must be used in each particular patient case.

These sibilant sounds also have a very interesting feature that comes from how they are produced. The narrow channel that must be created in order to vocalize a sibilant sound can be positioned in different places and this creates different types of sibilant sounds. There are two types of sibilant sounds in EP: alveolar, and palato-alveolar, these regions can be seen in figure 2.1. In each of these sibilant types the vocal folds can be used or not, resulting in a voiced or a voiceless sibilant, respectively. An example of this is the sound [z] (e.g. z in “zip"), which is voiced because it uses the vocal folds, and the sound [s] (e.g. s in “sip"), that does not use the vocal folds and as a result it is a voiceless sibilant. These two sounds are alveolar sibilants, so they are produced by creating a narrow channel in the same location (in this case between the tongue tip against teeth ridge), and the only difference between them is the use of the vocal folds. Palato-alveolar sibilants are produced when the tongue tip is slightly retracted from teeth ridge, and produce the sound [S] when not using the vocal folds, and the sound [Z] when these are used. There are four different sibilant consonant sounds in EP: [z] as in zebra, [s] as in snake, [S] as the sh sound in sheep, and [Z] as the s sound in Asia. [z], and [s] are both alveolar sibilants, while [S], and [Z] are palato-alveolar sibilants. Both [z] and [Z] are voiced sibilants, and [s] and [S] are voiceless sibilants.

If we look closely to the two words given above as an example, "zip" and "sip", we realize that between these two words only a single sound varies: [z] and [s]. This is the definition of minimal pairs, which is a pair of words where only a single sound varies. Given their similarities, minimal pairs are very easy to confuse and it is common to use one sound instead of another, which is a very common SSD, especially in children. These words are even more difficult to distinguish when they are not only minimal pairs, but the sound that is different between the two words is a sibilant sound, like in our example "zip" and "sip". In this case the only difference when producing the two words, is the use

10 2.1. SPEECH THERAPY

Figure 2.1: Diagram with the names of the main places of articulation.[42] of the vocal folds when producing the initial sound. When compared to a regular minimal pair (which is already difficult to explain to children), this particular case is considerably harder to explain, because most children do not even know what is a vocal fold, so it is not simply a case of asking the children to use or not the vocal folds. Another thing that makes these sibilant sounds very important is that they appear in many minimal pair words (e.g. "sink" and "zinc"), even thought the sounds [Z] and [S] do not appear in any usual minimal pair words in English, they are common in EP. The most usual sibilant mistakes committed by children are distortion errors. These can consist of (1) exchanging the voiced and voiceless sounds, for example exchanging a [s] that is voiceless, for a [z] that is produced in the same location of the vocal tract but is voiced, or (2) exchanging the vocal tract local of production which results in producing another sibilant [33]. These types of mistakes are a problem for the area of articulation, and may lead to many words being replaced by others that are very similar except for one single sound, which can lead to many misinterpretations and even sentences that do not make any sense.

As explained by the SLPs we contacted, when a child has a SSD that influences his production of the sibilant consonants, the SLP usually starts by observing how the child reacts to the hearing and production of the isolated sibilants. The process starts by trying to understand if the child can distinguish the different sibilants when hearing them as isolated sounds. If he can, the next step is to practice the isolated sibilants exercise, which consists of producing a specific sibilant sound for some duration of time. The main goal of this exercise is to teach the child how to distinguish between the different sibilant sounds, and also how to produce each one of them. Once the child is able to say the sibilant consonants correctly, the SLP then starts asking for multiple productions of the different sibilants always alternating both the

11 CHAPTER 2. BACKGROUND AND RELATED WORK vocal tract local of production and the use of the vocal folds, in order to try to understand if the child can always produce the correct sibilant or if sometimes the child exchanges from one sibilant to another. This process is done with the isolated sibilants exercise and is the base to detect what the child problem is. Only after the problem is correctly identified and the child can say the isolated sibilants correctly, they can move on to more complex exercises like using the sibilant sounds inside words and in multiple positions within a word.

Given the importance to both the areas of voice and articulation, and the many prob- lems that can arise either from the abuse of the voice and the distortion of the sibilant sounds, we decided to focus on these sounds and their minimal pairs. According to the SLPs that we talked with, there is not any software that allows children to practice these sounds, and given their importance as explained above, this means that a solution for this problem is even more necessary. Our solution is to incorporate the isolated sibilants exercise in a mobile game that motivates children to exercise more often, which could lead to a faster improvement and more motivation. Given the focus on sounds, the game characters are controlled by using only the child’s voice and not any of the usual input methods like keyboard, mouse or touch, when the child performs the exercise correctly the character will move towards its goal. All these game concepts and methodology will be discussed in detail in chapter3.

2.1.5 Interviews

In the beginning of this dissertation we had to talk with different speech and language therapists to understand what are their needs and difficulties, and to learn a bit about the process of speech therapy and their sessions. This began with meetings with the project partners from Escola Superior de Saúde do Alcoitão (ESSA) and also some meetings with other SLPs from private clinics. In this meetings, games for the sibilant sounds were always mentioned. The main idea of these games is the child has to reproduce a sibilant sound, and at the same time the SLP uses something to provide visual feedback to show if the exercise is being done correctly or not. An example was with the sound [s] and the SLP has to imitate a snake moving, or with the sound [z] and that was a bee flying towards a flower. We were told that sibilants sounds are very useful for both voice and articulation disorders. For voice disorders these are very used since they help to "break" nodules in the vocal folds, and also, given how they are produced, they do not strain the vocals folds and the muscles surrounding it. This is a very important characteristic to the area of voice since most of their patients have a hoarse voice and sometimes can not even produce any sound. For the area of articulation the sibilants are also interesting because many children tend to confuse the minimal pairs of sibilants. In order to help these children, SLPs usually start by trying to understand if the children can distinguish when hearing the

12 2.2. SOUND FEATURES EXTRACTION & MACHINE LEARNING minimal pair sounds. If they can, the next step is to make them reproduce the sounds isolated, one after the other, until they can understand the differences in production of both sounds. Afterwards the same principle is applied but now with this sounds in different positions inside of words, in the initial, medial, and final position. This has to be done always interchanging between the minimal pair sound, to make sure that the children is actually understanding the differences and not only repeating the same sound over and over. Another very good idea to help children differentiate the sounds of minimal pairs is to use sounds of nature or animals that are very common to them and assigned them to a specific sibilant, this helps them to remember which sound they must reproduce. Also, to help with this association, some SLPs use specific animals or nature sounds. For instance, when asking for the sibilant sound [z], some SLPs use the word "Zangão", since it starts with the sound of the sibilant that they asked. This is very useful particularly when children still have problems distinguishing between the minimal pairs sounds. This idea can be applied to all sibilant sounds. All the SLPs that were interview used some sort of homework since this provides a faster improvement. The kind of homework that they use always has to have parent supervision, which is sometimes a problem since not all parents have that much time available. Some SLPs also use some sort of self evaluation method to try to solve this problem, but this does not work with all children. When they were asked about mobile applications, they all agreed that there is a big lack of applications for European Portuguese, particularly for these types of exercises with the sibilant sounds. Even thought there are some applications developed for European Portuguese most of them only have resources, like images and sentences and do not have any type of automatic speech recognition. These applications are not very useful to SLPs, because in this case they are always dependant on the parents availability. Also, all the SLPs already have lots of these resources in non digital format.

2.2 Sound Features Extraction & Machine Learning

In this section some important concepts about the extraction of sound features and ma- chine learning that are relevant to our work are discussed. This discussion is based mainly on three automatic speech recognition systems for European Portuguese (EP) that were proposed in the context of computational systems for speech therapy. The first is a robust phoneme recognizer for EP vowels and the proposed for articulation disorders in the context of the VisualSpeech project [18]. The second is a robust scoring model for serious games with voice exercises with EP vowels, namely for the sustained and pitch variation exercises [14]. Finally, the Interactive Game for the Training of Portuguese Vowels was also proposed in the context of a serious game and developed a speech recognition system for the five Portuguese vowels [10]. All these systems use sound features extraction techniques and different machine learning algorithms. The full

13 CHAPTER 2. BACKGROUND AND RELATED WORK understanding of these concepts can be very beneficial when having to choose the best way to develop the machine learning part of our system.

2.2.1 Mel-Frequency Cepstral Coefficients (MFCC)

MFCCs is probably the most used technique in speech recognition for the extraction of sound features. The goal of this extraction is to remove everything that is not important to identify the linguistic content, like for example background noise. When people produce sound, they are modifying the shape of their vocal tract, and this determines the sound that is produced. This sound can be represented by a short time power spectrum, that is essentially a distribution of the energy as a function of frequency. The main objective of MFCCs is to represent this short time power spectrum. This was used in all systems mentioned above to extract the sound features, in VisualSpeech the number of coeffi- cients that got the best scores was 10 and 12, in Interactive Game for the Training of Portuguese Vowels were used 16 coefficients, and in serious games with voice exercises with EP vowels, 13 coefficients were used.

2.2.2 Linear Discriminant Analysis

While Carvalho used 16 MFCCs in her Interactive Game for the Training of Portuguese Vowels [10], she noted that, depending on the sounds that they were trying to separate, some of the coefficients are redundant and not useful to produce an efficient machine learning algorithm. So they used Linear Discriminant Analysis (LDA) to select only the four features that give more relevant information. To select these features multiple algorithms can be used, but the one that provided better results in their system was LDA. LDA is a supervised method that tries to separate different classes by identifying the attributes that account for the most variance between classes.

2.2.3 Linear and Quadratic Discriminant Classifier

Carvalho used both a linear and a quadratic discriminant classifier in Interactive Game for the Training of Portuguese Vowels. The two classifiers produced very similar results. Both of these algorithms are based on Bayes rule, that gives the probability of an event based on some information that may be related to that event. Using this rule these classifiers choose the most likely class given a set of attributes. The conditional probability of both of these classifiers is modelled using a multivariate Gaussian distribution. The main difference between the Linear and Quadratic classifiers is that Linear assumes that the Gaussian for each class share the same covariance matrix, and in the Quadratic there are no assumptions on the covariance matrices. This means that the decision space for Linear and Quadratic classifiers are respectively linear and quadratic decision surfaces.

14 2.2. SOUND FEATURES EXTRACTION & MACHINE LEARNING

2.2.4 Cross Validation

The basic goal of cross validation is to give a more accurate error estimate for a given prediction model. This is achieved by partitioning the regular training set in K folds, and then using all except one of them to train the model, and then validate the model against the fold that was not used. This process is repeated for all K folds, and in the end it averages the error estimates. This is the basic and yet one of the most used processes and is called K-fold cross-validation. This can be very useful especially for small data sets because this way more data is used to perform the training of the model, if compared to a regular split in training, validation, and test set. Another benefit is that this kind of algorithm uses all of the data to perform the learning of the model, thus unlike the regular split, the results are more independent from the random choice of the train and validation sets.

Diogo et al. proposed a double cross-validation experiment. They start with a one child out experiment that includes a 5-fold cross validation in each test. The one child out experiment consists of n tests, where n is the number of children that participated in the recordings. In each test, the recordings from child i are used as the test set, and all the others are considered the learning set. To this learning set is then applied a regular 5-fold cross validation. This means that the learning set was divided into five folds, and these were used to perform the cross validation. So in the end there are 5 classification models for each n test. The final step for each test is to select the classification model with the highest accuracy in the validation fold. This is also performed one last time over all the n classification models, in order to select the best model. This last selection is a sort of cross validation, hence that is the reason why this is called a double cross validation.

2.2.5 Support Vector Machines (SVM)

In serious games with voice exercises with EP vowels the classifier used was SVM, using a Gaussian radial-basis function kernel. The original SVM algorithm is used to perform a linear classification between two classes. It does this by constructing an hyperplane that maximizes the margin between the two classes. But this leaves us with the problem that not all classes are linearly separable. It was later introduced a kernel trick that addresses this problem, the main difference is that instead of calculating every dot product like in the linear SVM, this part is replaced by a non linear kernel function. This allows for the fitting of the maximum margin hyperplanes in higher dimensions, which will lead to a non linear separation in the original feature space. The Gaussian radial-basis function is one of the multiple kernels that can be used for this kernel trick.

15 CHAPTER 2. BACKGROUND AND RELATED WORK

2.3 State-of-the-art Tools

2.3.1 Isolated Games

In the latest years multiple systems have been developed to help keep children motivated while doing their exercises during and after the therapy sessions. These systems can be divided in two main classes, systems with and without sound analysis. In the following sections we discuss examples of systems of both types.

2.3.1.1 No Sound Analysis

There are many applications for speech therapy without sound analysis. While it may be useful to have a great variety of applications to help SLPs diversify the exercises that they present to the children and to keep them motivated, the lack of sound analysis limits these systems to the context of the session or to practicing with the parents supervision at home.

Articulation One of the big names in this kind of application development is LittleBeeSpeech [25], the company behind the group of apps denominated Articulation. They have multiple apps in English and also in Spanish, with many kinds of exercises and games.

One of the apps is Articulation Station [5] where its main goal is to practice the multiple sounds of the English or , with the help of different games and exercises (figure 2.2). They have games like match making, cards with images, sentences with specific words with phonemes in initial, medial or final position, stories and also questions.

Figure 2.2: Interface of Articulation Station

They also have Articulation Test Center [6], this is specifically designed to help speech therapists or parents to assess the articulation and speech production of children. This app has two tests that can be performed (figure 2.3), one is the Screener where all the test

16 2.3. STATE-OF-THE-ART TOOLS is based on the age of the children, this means that both the sounds and images are age appropriated, and the other one is the Full Test that allows the full customisation of the test, this means that everything can be tested, from specific phonemes to vowels.

Figure 2.3: Interface of Articulation Test Center

Articula Articula [4] is know as the first European Portuguese (EP) application in the market, but although it was supposedly launched sometime around 2014, there is not much information about it, and it is not available in the market. But from what was gathered the exercises in this application were basically the same that all speech therapists have in non-digital format, and maybe this is why this application did not catch much traction.

Falar a Brincar Falar a Brincar [15] is another EP application (figure 2.4), but this one launched more recently in 2015. This app is a compilation of different speech therapy exercises like counting the number of syllables, identification, and so on. This application is designed to be used with supervision since in most exercises the person using it has to define if the exercise was correctly executed or not.

2.3.1.2 With Sound Analysis

All the above systems lack the ability to inform the user about his/her performance, and this means that the children can only use the application when they are being supervised either by their parents or by the speech therapists. This minimizes the overall time that the children could use to practise their exercises. In this section, we will see some system that use sound analysis to circumvent this problem. sPeAK-MAN sPeAK-MAN is a game based on the original Pac-Man (figure 2.5) where the core mechanics involve the vocalisation of certain words generated from a pool commonly used in clinical speech therapy sessions [38]. The game has three difficulty levels, and

17 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.4: Interface of Falar a Brincar the main objective is to say the name of the ghosts in order to scare them. To process the words pronounced they used the Microsoft Kinect Sensor, and to recognize the speech the Microsoft Speech SDK was used. Some problems were detected such as a slight delay in the speech recognition system, and some troubles detecting words with different accents.

Figure 2.5: Interface of sPeAK-MAN

Star The Star system was created to help treating children with articulation problems [17]. The game is set in a spaceship and the goal is to teach aliens selected words and sentences by spoken example. This game has an increasing difficulty, so in the beginning the children must reproduce simple consonant-vowel syllables and then progresses to words and phrases. The words are chosen in a way that the children can better understand the small differences between minimal pairs. The speech recognition system used was a discrete Hidden Markov Model.

Interactive Game for the Training of Portuguese Vowels

18 2.3. STATE-OF-THE-ART TOOLS

As discussed above in section 2.2, the Interactive Game for the Training of Portuguese Vowels was developed around the five Portuguese vowels [10]. In this game the children must control the car using only the five vowels that correspond to a specific action, turn right or left, accelerate or decelerate, and stop the car. To develop the speech recogni- tion system, they tested multiple machine learning algorithms and different features. The algorithms that had better results where Linear Discriminant Classifier, Quadratic Discriminant Classifier, and the Nearest-Neighbor Classifier, but Nearest-Neighbor had to be dropped for the final system because they needed an algorithm that could be im- plemented in a real time system, and this one is not the most appropriate. In terms of features their choice was using 16 MFCCs, mapped to a 4-dimensional subspace using Linear Discriminant Analysis. They also tested adding the pitch to this four features, and had good results in simulation but not so good on the final real time system.

Talker Talker is a web platform that was created to help children with speech disorders and also to help people who suffered a brain stroke [37]. The part of the system more oriented to the children can be separated in three parts, the games and activities where children can practice their exercises, the online browser that allows to record and compare the recording to a correct reproduction of the word/sentence, and the virtual mirror that allows the children to compare their performance in real time with the correct recording.

Speech Adventure This game was developed to help children with cleft palate or lip, as even after surgery it is hard to understand what they are saying [34]. Since this is a serious problem and it takes a lot of time to surpass it, the regular therapy alone is not enough as children became very easily tired of doing always the same exercises. So Rubin and Kurniawan developed a game for iPhone and iPad, where the children must pronounce the sentence written in the screen three times to pop the three balloons that are preventing the main character from crossing a bridge. OpenEars, a system based on CMU Sphinx project, was used as the speech recognition engine.

Flappy Voice Flappy Voice [24] was created to help children with apraxia. It is a game based on Zombie Bird which is a Flappy Bird clone. The objective is to keep the bird flying as long as possible, in between obstacles without hitting them (figure 2.6). The main difference to Zombie Bird is that in this game children do not control the bird with a touch on the screen but rather with the duration and amplitude of their voice. The game was developed in English, but it actually responds to any sound regardless of the language used, so it does not perform any specific sound analysis method, it just reacts to sound. The game has two difficulty levels: free and assisted. In free mode if the player hits an object the game ends, and in assisted mode a soft barrier is used in both sides of the obstacles giving the bird more space to fly in between them. The game can also be parameterized to

19 CHAPTER 2. BACKGROUND AND RELATED WORK specific quantities and height of the obstacles, and also the vertical space in between them.

Figure 2.6: Interface of Flappy Voice

2.3.2 Complex Systems

Some systems were designed specifically to help patients to train at home and to help the therapist choose the more appropriate exercises to the patient according to their current performance. So this means that these systems have a way to automatically evaluate the patient performance.

The core of this kind of systems is always the same, and can be divided in three main components:

• The platform, where patients can do their exercises, either during a therapy session or at home, since these systems do not require any kind of supervision.

• The server that is running the speech-analysis engine and doing all the complex work, from storing every piece of data in the database to processing sound signals.

• Most of these systems also offer another platform, this one for therapists to monitor their patients progress and assign them new exercises.

The need for such tools is not recent, for instance the IBM Speech Viewer, which is one of the first systems of this kind to appear, had its second version released in 1992 [22][21], and the third in 1997 [13]. In this third version, there were most of the common features that can now be seen in this kind of systems, like visual feedback of speech attributes, a full range of speech exercises, game-like exercises, and clinical management functions.

20 2.3. STATE-OF-THE-ART TOOLS

In the case of newer tools, some original features are starting to appear, like for instance in OLP [31] and TERAPERS [11], where the user can visualize the correct artic- ulation of the sounds and words. Other systems like ARTUR - the ARticulation TUtoR [7], even offer a virtual speech therapist to guide the patient during the exercises.

Another systems that uses the system layout described above and with good exper- imental results is Remote Therapy Tool for Childhood Apraxia of Speech [32]. This system uses mobile applications in the form of games in the patient side, to record the sound samples and then sends the recordings to a server to be analysed and scored. On the other side the therapist can review each individual recording and the automated scores, and adapt the training program if needed, all this through a web interface.

Vithea [43] is also another example of one of these systems, but this one was devel- oped to help people with aphasia and is for European Portuguese language. It has a virtual therapist that guides the patient through the different exercises (figure 2.7) and validates them using automatic speech recognition. This platform also allows for the speech therapists to add more exercises and even new images with sounds samples. For the speech recognition engine they integrated AUDIMUS [27], a system that uses Hidden Markov Model combined with Multilayer Perceptron.

Figure 2.7: Interface of Vithea

2.3.3 Tools Comparison

Figure 2.8 shows a comparison between the two systems of EP language. By analysing this figure we can easily realize that both of this systems lack somethings that are important for the SLPs.

For instance VITHEA, given its focus on the aphasia disorder which normally does not occur in children, does not have any kind of game designed for children, and also it does not work on mobile platforms. Given the focus on aphasia there are not any specific

21 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.8: European Portuguese language systems comparison. games for the sibilant sounds either. So although this is a very good platform, it does not focus on the same problems that this dissertation is trying to solve.

Interactive Game for the Training of Portuguese Vowels does not give much rele- vance to the speech therapy subject, even thought it is a game that could be used for that purpose. The game can also be played by children, since it is relatively simple, but since it is a car racing game probably is not equally attractive for both genders. Its focus is on the five Portuguese vowels and not on sibilant sounds, and also like VITHEA it does not work on a mobile platform.

These are the two systems, from all of the analysed above, that were developed for EP language, and have some sort of speech recognition system. This by itself already shows that there is a lack of systems for EP language. Also, considering all the different types of exercises that the SLPs use, when we analyse these two systems it is simple to realize that these only cover a small percentage of those exercises. Another thing to notice is that neither of these systems offer any type of exercise for the sibilant sounds, so in the case of sibilant sounds, SLPs do not have any alternative to their regular exercises. We can also observe that there has not been much focus on mobile platforms for this type of systems. Nowadays this is becoming more important for the SLPs, since most children can only have one therapy session per week, and that does not give them enough time to prat ice their exercises. The only viable alternative is to use some sort of homework, but this has to be done with supervision from their parents, that sometimes do not have that time available. So, to sum up, what is needed is: a system that works on mobile platforms; has some kind of speech recognition system, so it does not require any type of supervision; and exercises for the sibilant sounds since there are not any alternatives to train these sounds.

22 n ftegaso u ouini ogv hlrntepsiiiyt rcieterspeech their practice to possibility the children give to is solution our of goals the of One engine game and platform System 3.1 training intensive more having di of be opportunity may the o it children to when give is even to purpose While home main at our attractive. used session, more therapy be the can find at would used the children that be that exercise can game game one not mobile this adapt was and a solution try into proposed to our use, rather of already but goal SLPs exercise, therapy The speech them. new this of di a create of one the to each goal between produce distinguish main to to The how how also of child time. and consists the of teach exercise duration to sibilants some is isolated for exercise sound The sibilant specific sessions. a therapy producing the during SLPs by used development phonological disappeared. regular from already the children have ages are these errors game at exchange this usually because of old, group years are age nine they aimed to if The five feedback not. necessary or the sound them correct give the still producing that and provided connection, anywhere, internet nearly an Our exercises have their they errors. practice distortion to the of platforms children correction regarding mobile allow for for must choices consonants solution game our sibilant serious the and a of game training is the intensive proposal for of Our details architecture. all and discuss platform game will we chapter this In r sn atp n eko optr,o sn oembl omtlk tablets like format mobile more a using solutions or possible the The computers, So, use. desktop to and platform home. of laptops at kind using or what are session is consider the must in we that either thing consonants first sibilant the for exercises therapy h aeicroae h sltdsblnseecs,wihi neecs frequently exercise an is which exercise, sibilants isolated the incorporates game The ffi utfrte oatn oeta n hrp eso e week. per session therapy one than more attend to them for cult aeadArchitecture and Game 23 ff rn iiatsounds, sibilant erent ff

raslto that solution a er C h a p t e r 3 CHAPTER 3. GAME AND ARCHITECTURE or smartphones. After some research we concluded that the tablet market in Portugal is still expanding (figure 3.1), yet slowly, but this is mainly because almost half of the population already has a tablet and does not need a new one1. This means that if the game is developed for tablets, it has the possibility of reaching a very large user base. Also, some of the SLPs that we contacted already use some sort of tablet applications, some of them even as homework, which means that some of the children are already used to this mobile format. Being developed for a mobile platform also gives other benefits, such as allowing the children to play comfortably anywhere at their house, and even outside provided that they have an internet connection (as it will be explained in section 3.3 the game will need access to the internet).

Figure 3.1: Forecast of number of tablet users in Portugal from 2014 to 2021 (in million users) [39].

The next challenge was to choose the platform, that is, to decide if the game was going to be developed for Android, iOS or both. Even though Android has a larger market share [28], iOS still has a significant amount of users. Thus our decision was to develop the game for both platforms. Instead of developing the game twice, one for each operating system, we decided to use Unity because this allows for it to be developed only once, while creating executable files for each platform.

3.2 Mobile game

As briefly explained above, the proposed game uses the isolated sibilants exercise. In order to play the game the child has to say one of the four sibilant consonants addressed by the game: [z], [s], [S], or [Z], for some duration of time.

1In 2014 the share of monthly active tablet users is 42.38% and by 2021 this value is project to reach 52.07% of the total population [40].

24 3.2. MOBILE GAME

3.2.1 Game goal and scenarios

The game goal is to move a character from a starting position, to a specific target. In order to control the character, children do not use a keyboard or any other regular input method, but instead they use only their voice. Each character responds to a different sibilant sound, and to make the character move, the child has to say that specific sibilant correctly. The game includes four scenarios, one for each of the addressed sibilant sounds (fig- ure 3.2). These scenarios were created with the help of a visual artist and with images from Freepik [16]. To create the scenarios we took into account the aimed age group of five to nine years old children. Each of the scenarios has a different character, that is related to the sibilant sound the child must produce. We also choose targets that would relate to the each game character, to be more appealing to the child.

(a) (b)

(c) (d)

Figure 3.2: The four game scenarios. (a) The scenario for the [S] consonant. (b) The scenario for the [Z] consonant. (c) The scenario for the [s] consonant. (d) The scenario for the [z] consonant.

The scenario in figure 3.2.d is used to train the [z] sibilant. The main character, a bumblebee, moves towards the beehive, which is the game target, while the child is saying the [z] sibilant correctly. The characters or goal of each scenario are always related

25 CHAPTER 3. GAME AND ARCHITECTURE to a word in EP that starts with the specific sibilant used in the scenario. For instance, a bumblebee was chosen for the [z] sibilant scenario because the EP word for bumblebee starts with a [z] sibilant (zangão). The scenario used to train the [s] sibilant has a snake that moves towards the log while the child is saying the [s] sibilant correctly (figure 3.2.c), and the scenario used for the [Z] sound has a ladybug that flies towards a flower while the child is saying the [Z] sibilant correctly (figure 3.2.b). The EP word for snake starts with the [s] sibilant (serpente) and the EP word for ladybug starts with the [Z] sibilant (joaninha). Finally, there is a scenario for the [S] sibilant (figure 3.2.a). In this case the main character, a boy, has to run away from the rain until the end of the street, which is the game target. The rain comes from a gray rainy cloud that follows the boy while he moves. The boy is able to run away from the rain while the child says the [S] sound correctly. The EP word for rain starts with the [S] sibilant (chuva).

3.2.2 Visual cues

The idea of using a character which the name reassembles the EP sound that the child must produce, was given to us by one of the SLPs that we talked to. She uses a lot of different exercises and activities, and she always tries to find ways for the children to remember the exercise and the corresponding sounds. This is very useful, not only because she does not need to explain the exercise every time, but it also helps the children to remember the sounds from one session to another. This is particularly useful when she is trying to see if children already know how to produce the sound, without having heard a previous example. We decided to use the same characters that she uses in this activity (e.g. zangão for the [z] sound), since this is a simple and effective way to remind the children of the sound they must produce. Considering that our game is designed to be used at home, and not only during the therapy sessions, is important that we do not rely on the SLP explanations to help children remember the exercise and sound that they should produce. So, this type of visual clues is particularly useful in our use case since this way children do not need the help of their SLP or an adult to remember the sound. Apart from using characters related to the addressed phoneme, an interesting idea was to relate the movement of the character with the use of the vocal folds. This idea came from the flying movement of the bumblebee, and that the [z] sibilant is a voiced one. We did not want a very complex motion in order not to distract the child, so the movement was simplified to a sinusoidal wave (as seen in figure 3.3), that can be modified by changing its amplitude and frequency. We applied the same concept to the movement of the ladybug, since sound [Z] is also voiced. We decided to use a simple straight line movement for the remaining sounds, [s] and [S], for the snake and the boy running away from the rain respectively, because these are voiceless sibilants. This is achieved by using the same movement code, but with a zero amplitude parameter. This idea was shown to an SLP that considered this cue very helpful, particularly because explaining to a child

26 3.2. MOBILE GAME that she must use the vocal folds its not easy, because most children in our age group do not even know what are the vocal folds.

Figure 3.3: Example of the sinusoidal movement of the bumblebee character.

3.2.3 Visual Feedback

Visual feedback is very important to let the child know if she is producing the sound correctly or not. We decided to use the character’s movements to give visual feedback to the child about his sound productions. The feedback is positive when the production is correct, in which case the character moves towards the goal. The character keeps moving while the child is producing the sibilant consonant correctly. If the production is not correct, the character stops moving to give negative feedback to the child. In the case of negative feedback, there are two available modes: one in which the exercise has to be repeated from the beginning and another in which the character waits for a correct production to continue moving towards the target. This type of visual feedback is very intuitive for children and also a good way to motivate them on trying to say the sibilant consonants correctly. The two different game modes can be seen in figure 3.4. In the game mode in fig- ure 3.4.a, when the ASR system sends the game some incorrectly classified productions, the mobile game has to give negative feedback to the child, which ends the game and the child then has the possibility to try again. In the game in figure 3.4.b, when the game gives negative feedback to the child, the character just stops moving, and the game keeps waiting on another correct production to give positive feedback to the child. When the child produces the correct sibilant sound in order to make the character reach its goal, we show a congratulations message, and give the child the opportunity to keep playing, as can be seen in figure 3.5.a. When we are using the game mode shown in figure 3.4.a, when the child is in the middle of a correct production and then it starts producing an incorrect sound, or simply stops producing any sound, we show a message trying to motivate them to try again, and give them that possibility, as can be seen in

27 CHAPTER 3. GAME AND ARCHITECTURE

(b)

(a)

Figure 3.4: The two developed main cycles. (a) The game cycle ends when the child produces an incorrect sound production. (b) The game cycle waits for another correct production in order to give positive feedback to the child.

figure 3.5.b. When the game mode in figure 3.4.b is used, we never show the try again message, and always wait for children to complete the exercise. At every time, we allow children the possibility to exit the exercise by pressing the home button, in the top right corner.

(a) (b)

Figure 3.5: The end game messages. (a) The message the child receives when she reaches the goal of the game. (b) The message the child receives when she stops producing the correct sibilant sound.

3.2.4 Parameterization of the game

The distance separating the initial position of the main character (starting point) from the target (end point) is related to the speech duration needed to make the character move

28 3.3. SYSTEM ARCHITECTURE until it reaches the target. The isolated sibilants exercise can be done with shorter or longer reproduction of the sibilants by varying the distance between the starting point and end point. This possibility is very useful for SLPs because every child is different and the exercise needs to be adapted to his current state or problem. While developing games for speech therapy, we must be very careful to not restrict the options of the SLPs. For instance in this case, if the time that a child has to reproduce the sibilant sound was always the same, this would not suit every child in the same way. So, the best alternative is to give SLPs the most options and adjustable parameters, so that they can have all the needed flexibility to better adjust the exercise to the problems of each child. As a response to these observations, there are five main parameters that can be adjusted in all four scenarios:

• the starting and end points,

• the time the character takes to move from the starting to the end point,

• the amplitude and frequency of the movement for the voiced consonants.

3.3 System architecture

This game was designed to be used at home, so it can not depend on the SLP to provide the necessary feedback to the child on whether the exercise is being correctly executed. This task could be attributed to the child parents, but many times they are not available, or do not have the knowledge to assess if the exercise is correct or not. So our solution was to use an automatic speech recognition system to provide feedback to the child on whether the exercise is executed correctly. As it was already mentioned, the game will run on a mobile platform, however, an ASR system is very demanding, particularly for a mobile platform, and it could cause delays in the regular execution of the game. This type of delays are even more difficult to predict given that we could be dealing with multiple devices, with a wide variety of computational power. The solution is to use a client-server architecture (as seen in figure 3.6), where the game only has to record the sound samples and send them to the server, and the server classifies the samples and responds if these samples were correct sound productions. This type of architecture gives us a great flexibility when choosing what to use for the ASR engine, and also allows changing the algorithm at anytime without having to change the mobile application.

Client The main job of the client application is to send the sound samples the child is pro- ducing to the server. This is done by recording segments of sound until these reach a determined length. When a segment reaches its desired length the client sends it to the server. The client then has to wait for the server to classify these samples as correct or

29 CHAPTER 3. GAME AND ARCHITECTURE incorrect reproductions. When the client receives the answer from the server, it can then provide visual feedback to the child. This feedback can be either positive or negative. In the case of positive feedback, the character starts or continues moving towards the goal, if the feedback continues to be positive, this cycle continues until the character reaches its goal. If the feedback is negative, the character either stops moving and the child has to repeat the exercise, or in the other game mode, the character simply waits for another correct production to start moving again. Given this type of connectivity with the server, the client always needs to have a stable internet connection.

Server The server is responsible for classifying each sound segment that is sent by the client. To do this, it first extracts the features from the sound samples, and then uses the previ- ously trained ASR system to classify them as correct or incorrect (section 4.2 discusses the features and algorithms used to classify the child productions). Then the server sends this answer back to the client and waits for the next speech segment. In order to train the ASR engine, child speech productions of the sibilant sounds were used (more details in section 4.1).

Figure 3.6: Client-server architecture. The client is the mobile application, that records segments of sound, and then sends them to the server. The server has to extract the sound features from each segment and uses an ASR system to classify them. Afterwards the server sends the response back to the client, which can now provide the feedback to the child.

3.4 Implementation details - modularity and extensibility

In order to develop our proposed solution, we have three main components, the mobile game, the server, and the ASR system. We decided that the best way to achieve our goal, was to develop each one of them in way that it could be completely independent from the others, as this would allow us to make changes in one component without it affecting the remaining. In this section we are going to present some of the development choices that allowed this to happen.

30 3.4. IMPLEMENTATION DETAILS - MODULARITY AND EXTENSIBILITY

3.4.1 Mobile Game

One of the most important things in our mobile game, was the need to easily add more scenarios, either to the same sounds or new ones, and it also should allow to build more complex scenarios. Our client-server architecture helps to split the mobile game from our ASR system, and this allowed us to build a simple main game cycle that is completely independent from both the scenario and the sound. To be able to get this type of ex- tensibility, all scenarios share three main classes, that are responsible for recording the microphone input, the connection to the server, and create the movement of the character. The main game cycle of each of the scenarios, is processed as follow, the microphone class is always recording the input sound, when this recording reaches a certain length it is sent to the server by using the helper class, that manages the connections to the server, and then the response from the server is used to compute the current state of the game: proceed (if the child is producing the correct sound) or stop (if the child is not producing the sibilant sound correctly). If the child is producing the correct sound, the movement class is used to produce the animation that moves the character towards the target, and the main cycle continues until the character reaches the target. If the child does not produce the correct sound, the game finishes and the main cycle stops its execution. In listing 3.1 we have some pseudo code explaining this game cycle. We have our three main classes, mic for the microphone class, http for the classes that connects to the server, and mov for the class responsible for the movement of the character. We start by checking if our audio segment already has the predetermined size in order to be sent to the server, in line 3. If it has, we fetch it from the mic class, in line 4, and then use the http class to sent it to the server in line 5. Line 8, represents the beginning of the game, were we are still waiting for a correct production of the sound to occur in order to start the movement of the character. In line 11, we check if we got a correct production, and if we have we update the position of the character using the movement class, line 12, and always check if we are already at the target location in line 14. If we are in the target location we then display a congratulations message to the child, and end the game. When we do not get a correct production we enter the last else, which displays a try again message, and ends the game. This code represents the cycle present in figure 3.4.a. The only change needed to this code in order to produce the game cycle in figure 3.4.b, is the removal of the last else, since in this way we never displays the try again message, and simply continue waiting for a correct production to move the character again.

Listing 3.1: Pseudo code of the main game cycle.

1 baseGameCycle() { 2 3 if (segmentHasEnoughLenght) { 4 audio = mic.getAudioByteArray(); 5 http.sendAudioToServer(audio); 6 } 7

31 CHAPTER 3. GAME AND ARCHITECTURE

8 if (charStopped && !correctProduction) { 9 // waiting for correct production 10 } 11 else if (correctProduction) { 12 charPos = mov.getCharNewPosition(); 13 14 if (charPos == target) 15 displayCongratsMessage(); 16 endGame(); 17 } 18 else { 19 displayTryAgainMessage(); 20 endGame(); 21 } 22 }

With this base main cycle and with the help of the three main classes, it is very simple to add other scenarios for these sounds, and even to new sounds. More complex scenarios can also be built using this base, as proven by our scenario for the [S] sound, that has two moving characters (the boy and the rain) and uses the same logic for the base main game cycle and also relies on our three main helper classes, with some additional code to control the movement of the rain.

3.4.1.1 Real time sound capture

The microphone class is responsible for the real time sound capture, which combined with the remaining classes allows us to provide visual feedback to child. Without this class, we would not be able to get the audio data necessary to check the sound productions of the children, which would make this game completely dependent on the SLP decisions. We first need to add an audio source object to the main character and then we can use that object to get the sound that is being captured by the microphone. In listing 3.2, we have the main method of the microphone class that is used to get a byte array of audio samples. We first need to calculate the size of the array that we will need to allocate. For that we get the position where the current recording is, line 2, and subtract the last position. We can then initialize a float array that is going to be loaded with the samples in line 6. In the end we just need to convert our float array to a byte array in order to sent it with our http class.

Listing 3.2: Pseudo code of the main method in the microphone class .

1 getAudioByteArray() { 2 int pos = Microphone.GetPosition(micName); 3 int diff = pos - lastPos; 4 5 float[] samples = new float[diff]; 6 audioSource.clip.GetData(samples, lastPos); 7 lastPos = pos;

32 3.4. IMPLEMENTATION DETAILS - MODULARITY AND EXTENSIBILITY

8 9 return floatToByteArray(samples); 10 }

3.4.2 Server

As it was said above, our mobile game can be adapted to having multiple scenarios for our sounds, and even for new sounds. What allows us this type of flexibility is the way our server was designed. We do not train our ASR system in the server. Instead we test and train our ASR system independently from our server, and when we get a good model for a specific sound, we then serialize that classifier object into a file. We then must place that file in a folder in our server. When our server initializes, it first loads every classifier from a file to an object, and then we simply have one method for each of the classifiers. So when a new game scenario is added, a new serialized classifier must be added, and also a new method that uses that classifier to predict the sound samples. In listing 3.3, we present some pseudo code of the server. In line 2 we can see a classifier being loaded into an object, and when that classifier is loaded, we start the server in line 5. We also have a generic method for each classifier, presented in line 8. From the form that is sent to the server, we extract the audio rate, and also the audio block, that comes in the form of a byte array. We then have to convert that byte array in a float array, in line 12. We then have to extract the MFCC matrix from those audio samples, and use our previous loaded classifier to predict the samples. Our result is simple array with ones and zeros, depending if each sample was classified correctly or not. We then convert that array into three integer values, the total of samples, and the number of correctly and incorrectly classified samples. We then send those three values as a json to the mobile game.

Listing 3.3: Pseudo code of the server.

1 if __name__ == ’__main__’: 2 classifier_XX = load_classifier_XX() 3 # load the remaining classifiers in the same way 4 5 app.run() # start server 6 7 @app.route(’/classify_sound_XX’, methods=[’POST’]) 8 def classifier_sound_XX(): 9 rate = getRateFromForm() 10 audio = getAudioBlockFromForm() 11 12 data = convertFromByteToFloatArray(audio) 13 mfcc_feat = getMfccMatrix(data, rate, coefs=13) 14 result = classifier_XX.predict(mfcc_feat) 15 16 total_samples = len(result) 17 wrong_samples = len(result == 0)

33 CHAPTER 3. GAME AND ARCHITECTURE

18 correct_samples = len(result == 1) 19 20 return json(total=total_samples, C=correct, W=wrong)

3.4.3 The need for an ASR system

As it was explained earlier, our mobile game is designed to be used at home and without the need for the supervision of an adult. So in order to be able to provide visual feedback to the child, we had to come up with a solution to know if the child is producing the sibilant sound correctly. We can not simply use the sound spectrogram or other simple characteristics of the sound to determine if the sound production is correct. So our decision was to use an ASR system which will be explained in detail in chapter4. This ASR system, is responsible for training all the classifiers that are then used in our server to predict the sound samples.

34 sdsusdaoe nodrt rvd edakt h hl,w s nARsse to system ASR an use we child, the to feedback provide to order in above, discussed As . on data Sound 4.1 otanteARsse,w eddsudsmlso hlrnpromn h isolated the performing children of samples sound needed we system, ASR the train To cutcfa rudtemcohn,wihalwdu oioaeabtmr h pro- the more bit a an isolate used to We us allowed quiet recess. a which during in microphone, especially recordings the outside, the around make the foam to from acoustic tried coming we noise While was 4.1). there figure room, in seen be can used setup our increase to opportunity the took others we the instead as and room them, set. class record data same to the refuse in not were did group, children we range These so age old. our years of eleven well outside and is fall four recordings that the of children number two the had We old years genders. eleven between and balanced eight Between di old. this years recordings more seven and bit and boys, a of were was There there ). 4.1 participating than (table children girls 11 90 of and were 4 There between ages area. with Lisbon recordings the the in in schools three in made were ings di the with exercise sibilant to used scores. were test that accuracy solutions classification our alternative our improve some data, of and sound aspect the scores, every of corresponding discuss treatment and and will algorithms gathering we the chapter from this system, In ASR our productions. sound child’s the classify oigadsm eesnie hc sgo nodrt rdc nagrtmta is that algorithm an produce to order in chairs good closing, is doors which hear does still noise, foam can recess this we However, some recordings and the quality. in moving sound instance better for a noise, have all to remove not order in child the of ductions h eodnswr aeuigaddctdmcohn n A qimn (the equipment DAT a and microphone dedicated a using made were recordings The uoai pehRcgiinSystem Recognition Speech Automatic ff rn iiatcnoatsud.T copihti,record- this, accomplish To sounds. consonant sibilant erent ff 35 rnei anyi h hlrnbtenfive between children the in mainly is erence

C h a p t e r 4 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

Table 4.1: Number of children who performed the recordings.

Figure 4.1: The setup used for the recordings. On the left there is one of the DAT equip- ment’s used, and on the right the microphone, with an acoustic foam around it. more robust to outside noises. During the recordings there was only one child in the room, and the SLP that gave him the instructions. Most of times there was also one person from the BVS project present. Most of the recordings were made by an SLP (holding a master in speech and language therapy), but others were also made by graduate students in speech and language therapy. The SLP (or a graduate student) would first explain the child what was the exercise, and the sounds that she would have to produce. Then it would start the recording, and it would ask the child to produce each sound by giving an example. First the SLP would ask for the four sibilants in the short version of the sound, and then it would ask for the long version, all in the same recording. So we collected eight different samples from each child which correspond to short and long repetitions of the sibilant sounds [S], [Z], [s], and [z]. Table 4.2 shows the number of samples from each sound.

36 4.2. AUTOMATIC RECOGNITION OF ISOLATED SIBILANT CONSONANTS

Table 4.2: Table with the number of samples for each sound.

For each child, a single file was recorded with all his sound productions. This means that each file includes both the sibilant sounds produced by the child, and also every indication from the SLP. In order to be able to use this data, every file had to be split in separated samples, one for each of the eight child productions that it contains. This task could be done by hand, by using software like Audacity. We would have to listen to each file, and search for every child speech production, and then splitting it in the multiple productions. This would take a considerable amount of time, and it is very prone to human errors. Our solution to automatize this process, was to create an algorithm that measures the energy in the beginning of the file (to measure the sound level of the no speech signal) and then identifies the peaks of energy, which have high probability of being regions containing speech. For every peak of energy that we find we create a new file, and the file ends when we reach again the energy of the no speech signal. In this way the files are automatically split into smaller files that contain single productions of sound. These single productions of sound can be either from the SLP or the child, and sometimes also noises. So we still need to listen to each of these smaller files to identify which ones are from the child, and also to label them, and delete the remaining files (SLP speech, or noises). When all the files were split we obtained a total of around 2600 smaller files, from which 842 were the child sibilant productions we are interested in.

4.2 Automatic recognition of isolated sibilant consonants

Our ASR system can be divided in three main components: the classification algorithm, the feature extraction, and the selection of the training and test sets. All of these topics will be explained in greater detail bellow. In the end, we will also explain our training and testing methodology.

4.2.1 The classification algorithm

One of the goals of our game is to provide visual feedback to the child through the movement of the main character, this is only achievable with the use of our ASR system. In order for this feedback to be useful to the child, the game must react almost instantly to the child’s voice. Given the inherent communication delay from the client-server

37 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM architecture, we have to consider carefully the classification algorithm in order to keep the time it takes to classify a group of samples as low as possible. This leads to the exclusion of some sort of algorithms, like for instance lazy ones (e.g. KNN), as those usually take more time to classify samples since they do not have a previous training phase to learn the model, but instead they delay the learning until any request is made to the system, and since these results are always based on a local feature space, this process happens at every request. The game’s time restrictions also require that some algorithms that are more complex and take more time to compute an answer must be avoided. Given the game’s time restriction, we selected the following classifiers for this study: multiclass support vector machines (SVM) with radial basis function (RBF) kernel, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). An important characteristic of these classifiers is that all three of them support multiclass classification. Since we are trying to distinguish between the four different sibilant sounds, this a good place to start, because it is a simple and effective way to do it. Another advantage of these classifiers, is that since they all share this characteristic (multiclass classification), it is simple to compare the three algorithms and the tested features. After our first ex- periments with these classifiers, we decided to try another approach, which was using a single-phoneme SVM. In this case we have one SVM trained for each of the four sibi- lant sounds, which means that each SVM is trained to recognize only one phoneme. In section 4.3, we will compare these four classifiers in greater detail.

4.2.2 Feature vectors

In order to train our ASR system, we need to extract features from our sound samples. We decided to use the raw Mel frequency cepstral coefficients (MFCC) as features, since these are commonly used in ASR systems. As an illustration, figure 4.2 shows the MFCCs (matrix) obtained for a [z] sound. Our feature vectors consist of columns of the MFCCs matrix. We used MFCCs as features to train all three classifiers, and obtained very good results with the combination of these features and classifiers.

Figure 4.2: The MFCCs matrix with 13 coefficients, obtained for a [z] sound.

38 4.2. AUTOMATIC RECOGNITION OF ISOLATED SIBILANT CONSONANTS

While historically the number of MFCCs used is thirteen, there are also studies that use different numbers of coefficients. In particular Carvalho et al. had good results using different numbers of coefficients to classify isolated phonemes (the EP vowels) [10]. To start testing our different algorithms and solutions, the first thing that needed to be decided was the number of MFCCs to use, as this would allow us to get comparable results between algorithms. We started by training the SVM classifier with different numbers of MFCCs, ranging from 5 to 25 coefficients, as seen in figure 4.3. When using less than 9 MFCCs, the accuracy of the test score is really low, but it rapidly increases with the use of more coefficients. This means that when we use only a small number of coefficients, our classifier does not have enough distinct data to correctly separate the four different sibilant sounds, but when using more data (more coefficients), the four sounds can be more easily separated. The increase in score starts to slow down at around 13 MFCCs, and above that the differences between scores can be easily attributed to the margin of error of the training of each classifier. So we decided to use 13 coefficients since our results did not show any particular improvements when using more than 13 coefficients. This result is also confirmed by most literature, that usually uses around 12 to 13 MFCCs [26, 29, 30, 35].

Figure 4.3: Accuracy test scores of the multiclass SVM classifier with RBF kernel, using different numbers of MFCCs.

4.2.3 Different options for the training and test sets

One thing that heavily influences both the validation and test scores is the way the training and test data are chosen. To try to correct this problem, we present multiple options for the training and test sets. These options range from a simple naive split to create both subsets of data, to more sophisticated algorithms to try and reduce the possible bias in our results created by our data sets.

39 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

Considering the limitations of the SVM when dealing with large data sets (the time the SVM takes to train the model), we never use the complete MFCCs matrices from the sound samples of every child. Instead we only use a subset of that data set, and then we further divided it in the training and test set. This first data set is created in the following way: let us consider a number of samples that we would like to have, for instance 20000. We know that we have the recordings from 90 children, and each of them produced recordings for the four different sounds. But we only want the correct recordings to train our algorithm, so lets assume that of those 90 children, only 70 produced all sounds correctly. Now we just need to calculate how many feature vectors, we must take of each sound for each children in order to have approximately the total number of samples that we want. Which in this case is, 72 feature vectors for each sound of each child, which would leave us with a total of 20160 total samples. We decided to consider only the children that had produced all the sounds correctly in order to have a similar number of feature vectors for each sound. Bellow we present our three options to split the data in the training and test set:

• The naive split In the naive split, we use the technique explained above to get the first subset of data. Then we use a random split to get both the training and test sets, using a 30% test size. This slip is stratified, meaning it maintains the same ratio between classes in the two data sets. Using this naive split, it is very likely that samples from the same child and even from the same sound file are present in both the training and test sets.

• K% children test set The goal of this experiment is to simply divide the children in two groups, the training and test group. We basically use the technique explained above to get the first subset of data, and then we consider the number of children that are in this subset, and select a given percentage of them as the test set, and the remaining as the training set. For instance, using the example above, where the subset contained 70 children, we would take randomly 21 of them (in this case 30%) as the test set, and the remaining 49 children would form the training set. Using this split there are not going to be any child productions in both groups, as children are either in one group or the other. The children are selected randomly for each group, so we may have considerable fluctuations in the scores in different runs of this split.

• One Child Out experiment Our last option is a double cross-validation, which we called our One Child Out experiment, that was already described in section 2.2.4. The goal of this experiment is to try to see how well will our model behave with completely new data. This happens because the training set consists of n 1 children, being n the total number − 40 4.3. CLASSIFICATION RESULTS

of children, and the remaining child is used as the test set. This process continues until all children have been used as the test set, and the corresponding models for all the training sets have been created. In the end we will have n models, which all have been trained using a slightly different training set, and the test set of all of them always contained unseen sound samples from a different child.

4.2.4 Model training methodology

In order to train our models and achieve comparable results, we had to develop a training methodology. The base idea is to develop a group of rules that we must follow for every model that we trained. These rules can be split in some main concepts that are further explained bellow: select a subset of all the features, split the data in the training and test sets, and then training the model and tuning the parameters (in the case of the SVM) using the validation set, and finally test the model against the test set. Each classifier was trained in the following way:

• Once we computed the MFCCs matrices from all sounds, we built the feature vectors by selecting columns from these matrices. These vectors were labeled with the corresponding sibilant. We used one class for each sibilant, so we have 4 different classes. Given the limitations of the SVM training complexity, we did not use the whole matrices. Instead we used the different options explained in section 4.2.3 to select both the training and test sets.

• We used a stratified 5 fold cross validation within our training set.

• To choose the best parameters (C and gamma) of the RBF kernel of the SVM classifier, we used a grid search to test different ranges of values for both parameters. LDA and QDA do not need to fine tune any parameter.

• In the end we test our best model against the test set, to get a better estimate of how our model will perform with new data.

4.3 Classification Results

In this section we are going to present all the classification results that were produced using the multiple ideas explained in the previous sections. We are going to start by comparing our three multiclass classifiers, and then comparing the best one with our single-phoneme SVM, using our Naive split to produce both the training and test sets. Afterwards we are going to pick our best classifier, and test it using our different options to produce the training and test sets. In the end of this chapter we present some final conclusions of all our experiments.

41 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

4.3.1 Comparing the classifiers using the Naive split

As explained above we are going to start by comparing the accuracy test scores of our three multiclass classifiers (figure 4.4). As can be observed in the figure 4.4, LDA was the classifier with the lowest accuracy test score, a result that can be attributed to one big limitation of this classifier, it can only learn linear boundaries. QDA is around 3% more accurate than LDA, which can be easily explained by QDA not being restricted to a linear space to separate all four classes. Given this difference in scores between LDA and QDA, we can assume that our data set is not linearly separable. Our multiclass SVM was the classifier with the highest accuracy test score, giving us a test score almost 8% higher than QDA, which means that it will give less errors in the classification of the four different sounds.

Figure 4.4: Accuracy test scores of the three multiclass classifiers.

The classifier that gave us the best results was the multiclass SVM with the RBF kernel, with a considerable margin to the results of the other two classifiers, so we decided to focus on this classifier to try to achieve even better results. One problem of multiclass classifiers, is that they have to adjust in order to give the best average score of all classes that they are trying to separate. Sometimes this results in not providing the best fitment to one particular class, because the overall average score would be lower. So, in order to fix this problem, we replaced the multiclass SVM with four SVMs, one for each class. The idea was to create one SVM classifier, also with the RBF kernel, to classify each of the four different sibilant classes. This solution allows us to fine tune each one of the four classifiers, which here we call single-phoneme SVM, in order to improve our classification scores. To accomplish this single-phoneme SVM solution, for each classifier, the labels of the data were changed to true or false depending on the sound that was being trained. For instance, when we trained the classifier for class [s], all samples from sibilant [s]

42 4.3. CLASSIFICATION RESULTS were labeled true, and the remaining samples were labeled false. This makes our single- phoneme SVM a binary classifier, a new sound sample for one specific sibilant either is or is not, a correct production of that sibilant sound. Doing this for each sound, let us achieve our goal of one classifier for each sound. Each single-phoneme SVM was trained with the same method that was described in section 4.2.4, the only thing that changed was the different labeling of the data to accommodate our new solution. Our four single- phoneme SVMs, were then fine tuned to try to achieve the best accuracy scores for each of the four classes. Figure 4.5 shows the results obtained with this approach. Our first solution of using a multiclass SVM has the same score for all four sounds, since the classifier is always the same, and the score is an average of all the correct labels for all classes. Our solution of using a single-phoneme SVM, basically four different SVMs one for each sound, brought us better results in the classification of all four sounds, by a margin of around 8%. In addition, it can be observed that these are quite high accuracy test scores (above 91%).

Figure 4.5: Accuracy test scores of the multiclass SVM, and four single-phoneme SVMs (both with the RBF kernel) for all four classes.

4.3.2 Comparing the different training and test sets

In the previous section we reached the conclusion that our single-phoneme SVM was the classifier that produced the highest accuracy test scores. So, in this section we are going to use it to compare our different options for the training and tests sets (these options are explained in detail in section 4.2.3). Given the time it takes to train the SVM, we had to reduce the number of samples used to 2500 instead of the previously used 20000, especially because of the One Child Out experiment, since it has to train a new SVM model for each iteration of the algorithm. Figure 4.6 shows us the results of these experiments using 2500 samples.

43 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

Figure 4.6: Accuracy test scores of the four single-phoneme SVMs, while using different training and test sets.

4.3.3 Naive split results

The results obtained here are a bit lower than some of the results presented previously, this is due to the change in the number of samples used to train the algorithm. We could not use the scores obtained before, since it would not be a fair comparison between all options, since our first results were obtained with more data samples. But even using considerably less samples than before this option still gave us very good scores, always above 91%.

4.3.4 K% children test set results

This solution showed us that our scores from the Naive split may be a bit influenced, yet it still provided us with very good results, the lowest being 87% and the highest 91%. This 3/4% gap between the two options can be attributed to two things. The first is that since the Naive split sometimes has samples from the same child in both the training and test sets, and considering that there is not much variation during the production of these sounds for each child, as it will be further explained bellow, this could be leading to a bit of a biased scenario since we mostly have the "same" data in both data sets, which will obviously lead to higher classification scores. In figure 4.7, we present the evolution of the fourth MFCC over time, in an segment of the sound clips for a reproduction of the [S] sound for two different children. With these two segments we can see two interesting characteristics of these sounds. The first is that even thought there is some variation in the values of this particular MFCC, the values always fall into a certain range. For instance in the child 001, the values vary between around 35 and 55, regardless of the time period. The second characteristic, is that if we compare both figures, even though they both share this initial characteristic, the range

44 4.3. CLASSIFICATION RESULTS in the child 344, is a bit different, in this case the values hover around 20 and 40. This is observed in different MFCCs, as seen in figure 4.8, and the ranges almost always varies between children.

Figure 4.7: The progression of the MFCC 4 for sound [S] over time, for two different children.

This leads us to conclude that there is not much variation during the sound production of these sounds by a particular child, but the same can not be said if we compare the productions of different children, as in this case they usually have different value ranges. The first characteristic helped us to conclude that we had a bit of a biased scenario with our first option, since different samples from the same sound, from the same child, are quite similar, and we are using them in both the training and test sets. The second one helps to explain why we had a lower score in the K% children test set split, since the second characteristic of these sounds, allows for the outliers to be sufficiently different from the remaining samples to not be correctly classified. The second is that since we are using a 30% test size, it is possible that we are not getting enough distinct data to correctly identify all sound samples, and also very likely

45 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

Figure 4.8: The box plot of the MFCC values for each coefficient, for sound [S], for two different children.

that we have most of the outlier children in the test set. Figure 4.9 shows a scenario were this is very likely to have occurred. We can see that the validation score decreases a bit when we increase the test size, this is due to having more children in the training set, meaning more probability of having outlier situations in this data set, which will lead to a decrease in score. When this occurs, usually the test score also increases, in this case we see an increase of over 4%, when comparing the 30% and 10% test size. The reason for this considerable rising of the test score is the opposite of the validation score. When we have a lower test size, we have less probability of having outliers samples in our test set, which leads to a higher test score. Another thing to notice is that if we compare the

46 4.3. CLASSIFICATION RESULTS

30% test size model, against the 10%, the latest has most likely a more robust model to not so perfect sound productions. We can derive this conclusion by comparison with the other two models validations scores, which are both higher, but have a considerable gap between the validation and test scores, which is an indicative that the model may be a bit over fitted. Also the model with a test size of only 10%, will have more children in the training set, so it should be more prepared to deal with completely new sound samples.

Figure 4.9: Comparison between the accuracy scores of the validation and test sets, while using different test set sizes.

4.3.5 One Child Out experiment results

The results of this experiment were not as good as expected. The validation score is more or less similar to the results that we previously had, but the test score is always consider- ably lower than what we had with the other methods. After some testing, we found out some possible causes for these lower results. We started by listening to each of the child files that produced the lower test scores. What we find is that these particular recordings, are lower in volume, there is more background noise than in other recordings, the sounds produced by the child are more strident than what is normal, or other variations that fall outside of the norm for most of our sound productions. This happens because the one child out experiment is trying to maximize the validation score. The case when the validation score is higher is always going to be when one of this child outlier recordings are not in the training set, which as a consequence, means that the samples from this child will be used as the test set. Since the samples on our test set are just a group of outliers, this will lead to a considerably lower test score. This happens because even though we have a lot of samples, we only have around 90 children, and out of those only a few are these outlier scenarios. So if these few outliers are not present in our training set, our algorithm has no clue on how to deal with these cases, which means that when these cases are in our test set they will lead to these lower scores.

47 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

In order to try to validate this theory, we decided to run the One Child Out experiment a few times, and at each run, if the test score was still a lot lower than our validation score, we would remove the child that was in the test set, from our data set. In table 4.3, and table 4.4 we can see the results that were obtained in the multiple runs of the experiment that was performed for the training of the sound [S].

Table 4.3: Validation and test scores of the multiple runs of the One Child Out experiment, removing the child that was in the test set at the end of each run.

In table 4.3 we can see the results of the best model found and also the average of all models, remember that we are producing one model for each child. This difference in scores, mainly between the test score of the best model and the average test score, was what gave us a clue that something else was happening. When we compare this results with all the others that we had obtained before, this is the only one that has this discrepancy between scores. One thing that we can observe in this table is that all scores slowly rise with the multiple iterations of our experiment. This is an expected result, since the child that we are removing at each iteration, had sound samples that had more background noise, or were not the most perfect sound productions, which will either harm the test score, or the validation score depending on the data set they are included. After running the algorithm 5 times, and deleting the recordings of 5 children, at the sixth run we got our expected results. We had a validation score of around 90% and our test score was 95%. What happened was that since we removed all of the outlier situations of our data set, our validation score improved, and when testing this model against the test child, the samples that were now present in our test set are a perfect example of the sound production, so our model got very good results, even better than what we had with our other solutions, and only using 2500 samples as opposed to the usual 20000. In table 4.4 we can see the results of the confusion matrix that our best model has produced. This allows for an easy way to understand what was the main problem with the classification of the test child, which combined with the listening of the recordings of each test child, allowed us to reach some good conclusions. When we are testing our model against the test child, we are testing it against samples from all the four sounds. So this means that we are expecting a different number of true positives and true negatives.

48 4.3. CLASSIFICATION RESULTS

Table 4.4: Number of false negatives and false positives of the multiple runs of the One Child Out experiment, removing the child that was in the test set at the end of each run.

For instance when the total samples are 36, this means that we have 9 sound samples from each sound, so we are expecting 9 true positives and the remaining 27 samples as true negatives, since they are the other 3 sounds that our model was not trained to classify. The same happens when we have 40 total sound samples, we are expecting 10 true positives, and 30 true negatives. When we have a false negative, this means that a sample that was supposed to be positive was classified as negative, which means that either our model is a bit over fitted, or the child production of the sound CH was not a good representation of the sound, at least by the standard of the samples used in the training set. When listening to cases when this happened what we found is that either the recordings are too low, have more background noise than what is desired, or simply the child produces the sound a bit differently. The opposite occurs with the false positives, which were supposed to be classified as a negative sample. In this case what happens is that the child produces one of the other sounds in a way that is too similar to the [S] sound, which makes our model to consider it a correct production of the sound.

It is possible that some of these sounds productions should not have been classified as correct in the first place, and in this case these children would not have been present in the data set. This is something that we must try to validate with some SLPs, and if they agree that these child productions are not correct, then this method would produce very good results without having to remove any children. Another thing to consider is that these scores were obtained using only 2500 sound samples, and the scores are already higher than the ones obtained with our first option which was using 20000 samples.

For now, these results are not a good representation of our model because we removed all the outliers, which means that we are in a perfect scenario for both the training and test set, so it is easily expected that we would get good results. The other problem is that these particular results, are also not a good representation of how our model will perform with completely new data, which is not going to be always "perfect".

49 CHAPTER 4. AUTOMATIC SPEECH RECOGNITION SYSTEM

4.4 Discussion

In this chapter we presented the results of our four different classifiers and tested multiple options to create the training and test sets. Regarding the initial classifiers, our multiclass SVM with the RBF kernel always provided the best accuracy scores, followed by QDA and then LDA. In the next iteration we decided to use what we called a single-phoneme SVM, which is essentially one SVM trained to classify only one sound, meaning that we must have four different SVMs trained in order to recognize our four sibilant sounds. This single-phoneme SVM approach, allowed us to fine tune each one of the four SVMs to better classify each sound. This brought considerable improvements over our multiclass SVM, and provided us with accuracy test scores of above 91% for every sound. This results were obtained using our Naive split, which has the problem of possibly containing samples from the same child in both the training and test sets, which could be influencing our results. In order to test this, we presented two other options for selecting the training and test sets. The K% children test set, were the children are divided in two initial groups, one group is used as the training set, and the other as the test set. Using this method the samples from one child will only be part of one of the two groups, they will never appear in both. This method gave us results a bit worse then the Naive split, but only of around 3/4%, when using 30% of the children in the test set, but these results are not biased since we do not have samples from the same child in the training and test set. We also tried to reduce the test set size to around 10%, in order to be able to train our model with more distinct data, and we were able to almost achieve the same results as in the Naive experiment. Our final option was the One Child Out experiment, which brought us some results that were not what we expected. The validation score coincided with what was seen in our previous experiments, but the test scores were considerably lower, most of times over 15% lower than the validation scores. When trying to understand what was causing these results, we decided to check the recordings of the child that was being used as the test set. When listening to these recordings, we realized that they either were lower in volume, had more background noise than the rest of the recordings, and the sound productions were more strident than what is normal, or other variations that fall outside of the norm for most of our sound productions. This was causing them to the incorrectly classified, since these samples were initially labeled as correct. We then decided to remove this child from the data set, and run our algorithm again. We got similar scores, and when listening to the recordings of this new child, we found out that the same thing had occur, these recordings also were outliers when compared to the remaining data. When all the outliers were removed from the data set, this algorithm brought us our best results so far, with an accuracy test score of 95% for the [S] sound. These results may be a bit influenced, since we removed all the outliers that we had in our data set, but if these samples are indeed incorrect, then these scores are not biased and present the kind of accuracy that we could

50 4.4. DISCUSSION expected when using a larger data set. This is a good estimation of the kind of scores that we would get with a larger data set while using a more regular split (like for example our K% children test set), because with this option we are using n 1 children as the training − set, n being the number of children in the initial data set. Which means that we are using the most amount of distinct data possible, and then testing our model against the test set. If we had a larger data set, when using our K% children test set option, we would see similar results to those that we got here, since it would have enough distinct data in the training set to capture all the little variations in the productions of these sounds, which would then lead to higher accuracy test scores when classifying the corresponding test set. In the end we decided to create two different models, each one with four classifiers, one for each sibilant sound. We used the parameters that got us the best results when using the Naive split, and also when using K% children test set. We trained both models with these parameters, using the complete data set, and then serialized both models to files, in order to place them in our server (as explained in section 3.4.2, the models are then desiralized and loaded into the server when this is initialized). We did some testing with both models and did not find any major differences in classification.

51

ihtervoice. their with activities. through public the to work hswr a rsne tteErpa eerhr ih EN 07 tteNational the at 2017, (ERN) Night Researchers European the at presented was work This eywl otetp fitrcinwt h ae hti,cnrligtecaatr only characters the controlling is, that game, the with interaction of type the to well very h aewsdsge o hlrnfo v onn er l,alcide rigthe trying children all old, years nine to five While from children. old children year for twelve designed some was to ages game children all young the from very children had some We from children. game, and the the to characters trying appealed graphics, game the the if with access interaction to of aimed type contributes only We it if improvements. to or speech correctly not faster production was speech to here all the goal classifying Not our was game but game. therapy, the speech the if attending understand controlling were of game way the the tried and who children characters the to reactions children’s the microphone. external an and speech screen attending a are explain laptop, that time a children same used We for the games at ). 5.1 of and (figure types therapy it, these try of to importance children the allow parents to to up set game the had We parents. their explain and demonstrate researchers in event and someone the as life During researchers everyday of their inaccessible. image the and in to demystify distant science also is and of ERN society, of impact the of goal the development The the and Europe. importance general in the the cities to citizens several open show in event simultaneously an is runs This that Lisbon. and in public Science and History Natural of Museum children from Feedback 5.1 aea R ie h cnro n hrcesrgrls fterae hyas reacted also They age. their of regardless characters and scenarios the liked ERN at game rudfit hlrntidtegm.W eeal olc oeoiin n osee to and opinions some collect able were We game. the tried children fifty Around and children both to work our show to opportunity the had we 2017 ERN the At edakfo h Lsadchildren and SLPs the from Feedback 53

C h a p t e r 5 CHAPTER 5. FEEDBACK FROM THE SLPS AND CHILDREN

Figure 5.1: Child playing the proposed game during the European Researchers Night 2017.

5.2 Feedback from SLPs

During the development of our solution we were always in contact with multiple SLPs. This allowed us to have impartial feedback about our work, and also get some confirma- tions if our solution was adequate to what they needed. This feedback was very important in order to create a solution that would be useful for both the SLPs and children. For the SLPs, it provides them with a simple way of giving homework to children to practice the sibilant sounds. Since we did not create a new exercise, but instead adapted one exercise that the SLPs already use, this is a very simple transition for both the SLPs and children. The SLP does not need to explain a new game to each child, and the children are already quite familiar with this game, they just need to adapt to the new platform. For children, our solution gives them the opportunity of experiencing the intensive training benefits, even when they can only attend one speech therapy session per week. This occurs, be- cause with our solution, they can practice their exercises at home in order to increase their weekly training time, which would lead to faster improvements, and also more mo- tivation, which in return helps children improve even further. This faster improvements and extra motivation, will have a positive impact in the children’s life. In an initial phase, we created a simple prototype of the game that was used to demon- strate our proposed scenarios to some SLPs and explain our concept and the main mech- anism of the game. We collected feedback from four SLPs that work with children, one with four years of experience, two with eight, and one with thirteen years of experience. All four SLPs agreed that the scenarios were adequate for the aimed age group, and that the game would provide a good way of training the sibilant sounds. Another interest- ing aspect, is that all of them already use some sort of homework, but not in a mobile platform given the lack of this type of systems for EP. Also, all of them noticed a faster improvement when children used homework, this can probably be attributed to the extra weekly training time. When asked if they would rather use our system in their sessions or at home, the

54 5.2. FEEDBACK FROM SLPS results were a bit mixed. There was one SLP who only wanted to use the system during the sessions. The remaining were interested in trying it a few times during their sessions first, in order to understand how children react to the system, and also how the system performs. But they all agreed that if the system is robust enough, and if children react well to it, they would use it as homework. In our opinion, this type of feedback is expected, given the lack of this type of systems for EP, SLPs are not completely confident on them. They first need to experiment it to understand that using an ASR system combined with the game visual feedback is a good alternative that allows decreasing the need of adult assistance to give feedback to the child. After our game was completely developed, we got feedback from one SLP with four years of experience with children. We explained her the game concept, and also per- formed some demonstrations. The SLP considered that our system, is a good way for children to practice the sibilant sounds, and that the scenarios are very interesting for children within the aimed age group. Like some of the SLPs we talked to previously, she would first use our system during the sessions to make sure that everything is working as expected, in order not to mislead the children. But she added that once she verified that the game works correctly, she would use it as home training. She already uses other types of homework with her patients, with very good results. Home training with the proposed game would have the advantage of increasing the motivation of the child and reducing the need of adult supervision. The idea of using the movement of the character as a visual cue for the use of the vocal folds was something that she considered very important, since she already tries to incorporate those types of cues when performing the exercises with children. As an additional visual cue, she gave the advice of trying to incorporate the same idea to the background of the scenarios. Basically using more straight lines in the scenarios where the vocal folds are not used, and the opposite when the vocal folds are used.

55

nte eyipratapc sta hlrnsol o ersrce ool practice only to restricted be not should children that is aspect important very Another iulfebc otecidaothssec,ti svr motn eas hswythe way this because important very is this speech, his about child the to feedback visual ytmgtvr odrsls ihacrc etsoe fabove of scores ASR test Our accuracy solution. with our even played results, from be good benefit to very to game got children the system more for allowing allows thus which devices, server, end the low on on done is system ASR the of tion allows which not, or needed. correctly if exercise production the sound performing the adjust is to he him if time real in knows child immediate to gives has game child the the addition, goal, In game correctly. the achieve exercise to sibilants isolated order In the perform voice. child’s the by only controlled internet is an having anywhere. of nearly restriction exercises their only perform the to them with allows sounds, This connection. sibilant EP their practice o more further. can by exercises even problems the them these both motivating perform solves thus to solution their improvements, Our them faster when allow to even would lead home, this can at which since exercises frequently, time, these have do not to do able supervise parents be to should time have children parents Instead their when them. or sessions exercise. therapy sibilants their EP, during isolated on exercises the their focus practice that to systems exercises of have lack not a do is do, that there those sibilants team, even isolated project the and the uses by and identified training As home for sounds exercise. designed sibilant of is correction game for The EP. platforms in mobile distortion for game serious a proposed we Here rni h riigadts es(h %cide etst,gv ssm le htwe that clues some us gave set), test children K% (the sets test and training the in dren split. naive our with MFCCs, and h aei mlmne ihacin-evracietr.Altecmlxcomputa- complex the All architecture. client-server a with implemented is game The game the productions, speech child the classifies that system ASR an of help the With eas rsn te riigapoce,orscn pino eaaigtechil- the separating of option second our approaches, training other present also We ocuinadftr work future and Conclusion 57 ff rn oiegm ihwihchildren which with game mobile a ering 91%

hnuigSVMs using when C h a p t e r 6 CHAPTER 6. CONCLUSION AND FUTURE WORK probably have some outlier children recordings. This is indicated by the considerable variance in the training and test scores, especially when varying the test size. This pos- sibility was further confirmed by our One Child Out experiment, that gave really bad results when we first tested it. Further analysis of the results lead us to believe that these low classification test scores could only be explained by some outliers in our recordings. We then decided to find the children in the test set that gave these low test scores, and removed them from the data set, and perform this until there was a significant change in the scores. After just a couple of iterations the results were even higher than what we had achieve with our first solution. After listening to child productions of the children that were removed from the data set, we confirmed our initial idea, these children had indeed some sound productions that are a bit different from the remaining children. This particular characteristic of the algorithm, made us realize that it can also be used to detect outliers in new data. For instance, if we get new children sound recordings that are already labelled, we can run our algorithm and see if this difference in scores occurs. If this happens, it means that we probably have some outlier recordings, that were either incorrectly labelled, or are reproductions of the sounds that are significantly different from the ones that we have. Looking at the number of false negatives and false positives, we know exactly the recordings that gave us this scores, so it is easy to identify them, and listen to them one by one. This opens some different possibilities, like for instance use some of our previous algorithms to classify and label new sound samples, instead of listening to all of them and labelling them manually. We can then use our One Child Out technique to check if we have any outliers in our new data set, which means, check if any of the recordings were incorrectly labelled by our algorithm. Using this type of approach would provide us with a fast way to classify new sound recordings, and at the same time, give us enough confidence in our labelling. Given that we have one classifier for each of the four sounds, we can also use our system to perform some simple assessment and diagnosis of the sounds the child has more difficulty producing, or if the problem is between the voiced and unvoiced sounds, or some sort of articulation problem. For instance, lets assume the child is producing the sound [S], we take that recording, and classify it using our [S] classifier. The classifier considers that production incorrect. We then take the same recording and classify it using our [Z] classifier, the expected outcome is an incorrect production of the sound. But, if our classifier considers this sample correct, it means that the child probably confuses the voiced and unvoiced sound. In a similar manner, if instead of using our [Z] classifier, we used the [s] classifier, and it considered the sound production correct, it would mean that the child was some difficulty with the articulation of the sound, since instead of producing a [S] sound, it is actually producing a [s], that is also unvoiced, but produced in a different location. This concept could be applied to all the four different sounds to try to give the SLP a quick idea of main difficulties of the child. Even if the child produces all four sounds in a correct way, we have other metrics, like for instance the average of correctly classified samples, which could be used to see which sounds are harder for the

58 child to produce. In the future we need to work on a way to validate our game solution and our ASR system, with real users. The best way would be to test our game during some therapy sessions, since this would allows us to use the game with children that really need it, and see if they understand the base game, like the scenarios and find it motivating. This way of testing would also allow us to have feedback from the SLPs to confirm if our ASR system is performing as expected. Since we are testing it with children that are in speech therapy, this would be good way to understand if our model works correctly for different children, with different problems, and in different progress phases. After this test is complete, another interesting way to validate our solution is to give our solution has homework to some children, selected by the SLPs according to their needs, and see if after some weeks they show any improvements. As future work we will explore more machine learning algorithms to improve our scores, such as artificial neural networks and multiple techniques of deep learning. How- ever, we must keep in mind that we can not develop a computationally expensive system because of our response time restrictions. Thus the next goal is to balance more complex algorithms in order to try improve our ASR system performance, but at the same time maintaining the very low response time that we currently have.

59

Bibliography

[1] American Speech - Language-Hearing Association (ASHA) - Child Speech and Language. url: http : / / www . asha . org / public / speech / disorders / ChildSandL . htm (visited on 01/19/2016). [2] American Speech - Language-Hearing Association (ASHA) - What is Language? What is Speech? url: http://www.asha.org/public/speech/development/language_ speech/ (visited on 01/19/2016). [3] American Speech-Language-Hearing Association (ASHA) - Speech Sound Disorders: Articulation and Phonological Processes. url: http://www.asha.org/public/ speech/disorders/SpeechSoundDisorders/ (visited on 07/28/2017). [4] Articula by Dwitmee. url: http://www.dwitmee.com/ (visited on 01/19/2016). [5] Articulation Station. url: http://littlebeespeech.com/articulation_station. php (visited on 01/15/2016). [6] Articulation Test Center. url: http :/ / littlebeespeech. com /articulation _ test_center.php (visited on 01/15/2016). [7] ARTUR - the ARticulation TUtoR. url: http://www.speech.kth.se/multimodal/ ARTUR/ (visited on 01/16/2016). [8] J. Barratt, P.Littlejohns, and J. Thompson. “Trial of intensive compared with weekly speech therapy in preschool children.” In: Archives of Disease in Childhood 67.1 (1992), pp. 106–108. [9] S. K. Bhogal, R. Teasell, and M. Speechley. “Intensity of aphasia therapy, impact on recovery”. In: Stroke 34.4 (2003), pp. 987–993. [10] M Carvalho. “Interactive game for the training of portuguese vowels”. MA thesis. Master Thesis, Faculdade de Engenharia da Universidade do Porto, 2008. [11] M. Danubianu, S.-G. Pentiuc, O. Andrei, S. Marian, N. Ioan, U. Doina, and M. Schipor. “TERAPERS-intelligent solution for personalized therapy of speech disor- ders”. In: (2009). [12] G. Denes, C. Perazzolo, A Piani, and F. Piccione. “Intensive versus regular speech therapy in global aphasia: a controlled study”. In: Aphasiology 10.4 (1996), pp. 385– 394.

61 BIBLIOGRAPHY

[13] F. Destombes. “The development and application of the IBM speech viewer”. In: Interactive Learning Technology for the Deaf (1993), pp. 187–196. [14] M. Diogo, M. Eskenazi, J. Magalhães, and S. Cavaco. “Robust scoring of voice exer- cises in computer-based speech therapy systems”. In: Signal Processing Conference (EUSIPCO), 2016 24th European. IEEE. 2016, pp. 393–397. [15] Falar a Brincar. url: https : / / falarabrincar . wordpress . com/ (visited on 01/16/2016). [16] Freepik. url: http://www.freepik.com/ (visited on 07/28/2017). [17] S. Ganapathy, S. Thomas, and H. Hermansky. “Comparison of modulation features for phoneme recognition”. In: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE. 2010, pp. 5038–5041. [18] A. Grossinho, I. Guimaraes, J. Magalhaes, and S. Cavaco. “Robust phoneme recog- nition for a speech therapy environment”. In: Serious Games and Applications for Health (SeGAH), 2016 IEEE International Conference on. IEEE. 2016, pp. 1–7. [19] I. Guimarães. A Ciência e a Arte da Voz Humana. ESSA - Escola Superior de Saúde do Alcoitão, 2007. [20] P. K. Hall, L. S. Jordan, and D. A. Robin. Developmental apraxia of speech: Theory and clinical practice. Pro Ed, 1993, p. 200. [21] IBM Speech Viewer 2. url: http://www.speech.cs.cmu.edu/comp.speech/ Section1/Aids/speechviewer.html (visited on 01/18/2016). [22] Info World Magazine. Page: 28. url: https://books.google.pt/books?id= KT0EAAAAMBAJ & lpg = PA28 & ots = _zmjCW8rBT & dq = ibm \ %20speech \ %20viewer \ %20release\%20date&hl=pt- PT&pg=PA28#v=onepage&q&f=false (visited on 01/16/2016). [23] S. Kreimer. “Intensive Speech and Language Therapy Found to Benefit Patients with Chronic Aphasia After Stroke”. In: Neurology Today 17.12 (2017), pp. 12–13. [24] T. Lan, S. Aryal, B. Ahmed, K. Ballard, and R. Gutierrez-Osuna. “Flappy voice: an interactive game for childhood apraxia of speech therapy”. In: Proceedings of the first ACM SIGCHI annual symposium on Computer-human interaction in play. ACM. 2014, pp. 429–430. [25] Little Bee Speech. url: http://littlebeespeech.com/ (visited on 01/15/2016). [26] P. Matejka, P. Schwarz, et al. “Analysis of feature extraction and channel compensa- tion in a GMM speaker recognition system”. In: IEEE Transactions on Audio, Speech, and Language Processing 15.7 (2007), pp. 1979–1986. [27] H. Meinedo, D. Caseiro, J. Neto, and I. Trancoso. “AUDIMUS. media: a Broadcast News speech recognition system for the European Portuguese language”. In: Inter- national Workshop on Computational Processing of the Portuguese Language. Springer. 2003, pp. 9–17.

62 BIBLIOGRAPHY

[28] Mobile Operating System Market Share in Portugal, 2016 to 2017. url: http://gs. statcounter.com/os-market-share/mobile/portugal/#yearly-2016-2017- bar (visited on 07/25/2017). [29] A. V. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy. “A coupled HMM for audio-visual speech recognition”. In: Acoustics, Speech, and Signal Pro- cessing (ICASSP), 2002 IEEE International Conference on. Vol. 2. IEEE. 2002, pp. II– 2013. [30] T. L. Nwe, S. W. Foo, and L. C. De Silva. “Speech emotion recognition using hidden Markov models”. In: Speech communication 41.4 (2003), pp. 603–623. [31] Ortho-Logo-Paedia - OLP. url: http://www.xanthi.ilsp.gr/olp/default.htm (visited on 01/16/2016). [32] A. Parnandi, V. Karappa, T. Lan, M. Shahin, J. McKechnie, K. Ballard, B. Ahmed, and R. Gutierrez-Osuna. “Development of a Remote Therapy Tool for Childhood Apraxia of Speech”. In: ACM Transactions on Accessible Computing (TACCESS) 7.3 (2015), p. 10. [33] J. Preston and M. L. Edwards. “Phonological awareness and types of sound errors in preschoolers with speech sound disorders”. In: Journal of Speech, Language, and Hearing Research 53.1 (2010), pp. 44–60. [34] Z. Rubin and S. Kurniawan. “Speech adventure: using speech recognition for cleft speech therapy”. In: Proceedings of the 6th International Conference on PErvasive Technologies Related to Assistive Environments. ACM. 2013, p. 35. [35] S. Sharma, D. Ellis, S. Kajarekar, P. Jain, and H. Hermansky. “Feature extrac- tion using non-linear transformation for robust speech recognition on the Aurora database”. In: Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceed- ings. 2000 IEEE International Conference on. Vol. 2. IEEE. 2000, pp. II1117–II1120. [36] L. D. Shriberg, R Paul, and P Flipsen. “Childhood speech sound disorders: From postbehaviorism to the postgenomic era”. In: Speech sound disorders in children (2009), pp. 1–33. [37] Talker. url: http://speech-trainer.com/children-speech-therapy (visited on 01/15/2016). [38] C. T. Tan, A. Johnston, K. Ballard, S. Ferguson, and D. Perera-Schulz. “sPeAK- MAN: towards popular gameplay for speech therapy”. In: Proceedings of The 9th Australasian Conference on Interactive Entertainment: Matters of Life and Death. ACM. 2013, p. 28. [39] The Statistics Portal - Forecast of tablet user numbers in Portugal from 2014 to 2021 (in million users). url: https : / / www . statista . com / statistics / 566416 / predicted-number-of-tablet-users-portugal/ (visited on 02/02/2016).

63 BIBLIOGRAPHY

[40] The Statistics Portal - Forecast of the tablet user penetration rate in Portugal from 2014 to 2021. url: https://www.statista.com/statistics/568594/predicted- tablet-user-penetration-rate-in-portugal/ (visited on 02/02/2016). [41] The Voice Foundation - The Process of Voice. url: http : / / voicefoundation . org / health - science / voice - disorders / anatomy - physiology - of - voice - production/understanding-voice-production/ (visited on 01/19/2016). [42] UCL Psychology and Language Sciences - Consonants. url: http://www.phon.ucl. ac.uk/courses/spsci/iss/week6.php (visited on 02/02/2016). [43] VITHEA - Virtual Therapist for Aphasia treatment. url: https : / / vithea . l2f . inesc-id.pt/wiki/index.php/Main_Page (visited on 01/16/2016). [44] VOX CURA - Voice Care Specialists - 10 Common Problems of Singers. url: http: //www.voxcura.com/more-at-vox-cura/10-common-problems-of-singers/ (visited on 01/20/2016).

64 inlCneec nAvne nCmue netimn ehooy edi London, in Interna- held Technology) (14th UK. Entertainment 2017 Computer ACE in to Advances submission on paper Conference accepted tional the present we appendix this In cetdsinii paper scientific Accepted 65

A p p e n d i x A A serious mobile game with visual feedback for training sibilant consonants

Ivo Anjos1, Margarida Grilo2, Mariana Ascens˜ao2, Isabel Guimar˜aes2, Jo˜aoMagalh˜aes1, and Sofia Cavaco1

1 NOVA LINCS, Department of Computer Science Faculdade de Ciˆenciase Tecnologia, Universidade NOVA de Lisboa 2829-516 Caparica, Portugal 2 Escola Superior de Sa´udedo Alcoit˜ao Rua Conde Bar˜ao,Alcoit˜ao,2649-506 Alcabideche, Portugal [email protected], margarida.grilo, mascensao, iguimaraes @essa.pt, jm.magalhaes,{ scavaco @fct.unl.pt } { }

Abstract. The distortion of sibilant sounds is a common type of speech sound disorder (SSD) in Portuguese speaking children. Speech and lan- guage pathologists (SLP) frequently use the isolated sibilants exercise to assess and treat this type of speech errors. While technological solutions like serious games can help SLPs to mo- tivate the children on doing the exercises repeatedly, there is a lack of such games for this specific exercise. Another important aspect is that given the usual small number of therapy sessions per week, children are not improving at their maximum rate, which is only achieved by more intensive therapy. We propose a serious game for mobile platforms that allows children to practice their isolated sibilants exercises at home to correct sibilant distortions. This will allow children to practice their exercises more fre- quently, which can lead to faster improvements. The game, which uses an automatic speech recognition (ASR) system to classify the child sibi- lant productions, is controlled by the child’s voice in real time and gives immediate visual feedback to the child about her sibilant productions. In order to keep the computation on the mobile platform as simple as possible, the game has a client-server architecture, in which the external server runs the ASR system. We trained it using raw Mel frequency cepstral coefficients, and we achieved very good results with an accuracy test score of above 91% using support vector machines.

1 Introduction

Speech is one of the most important aspects of our life. While regular mistakes are made by children when they are still learning how to speak and learning the language, these mistakes tend to disappear gradually as children grow up. However there are cases in which the mistakes continue even when the child grows older. In these cases the child may have a speech and/or language disorder. While there are many types of speech and language disorders, here we focus on speech sound disorders (SSD). A SSD occurs when a child produces speech sounds incorrectly, after an age at which these mistakes were not supposed to happen [1, 25]. When a child has a SSD, she should be observed by an SLP, who can ac- cess the type and severity of the disorder, and then she should be clinically followed by the SLP to treat the disorder. As part of the treatment to correct the speech errors, SLPs use specific types of speech production exercises that must be performed multiple times in each speech therapy session. The SLP has to find ways to make the repetition of the exercises enjoyable, in order to keep the child motivated. Most times this is accomplished by trans- forming the exercises in some sort of game that the child can play. This would be relatively simple with a few repetitions, but given the extensive number of rep- etitions required to achieve speech improvements, keeping the child motivated may be a hard task. To further motivate the child, many SLPs also use some kind of reward system. For instance, the child may be rewarded with a game for the last five minutes of the session if he performs well during the session. Most times this last game includes also some sort of therapy exercise. Whereas traditionally children attend speech therapy sessions once a week, more intensive therapy has been proven to lead to faster improvements [4, 5, 14]. In particular, when the intensive training is done at home (home training) it can increase the overall exercise practicing time substantially. This is particularly useful when children attend speech therapy sessions only once a week, and as a consequence they do not repeat the speech exercises as often as desirable. With home training children can practice the exercises recommended by their SLP whenever they have free time. These two concepts of intensive training (with more frequent sessions per week) and home training are becoming more popular in recent years, and some studies have already proven that this type of training is very beneficial [7, 12, 5]. These studies show considerable improvements when children have more than the regular weekly session. In the cases where children cannot attend more than one session per week, home training has been proved as a good alternative for those extra sessions. The major benefit from intensive training and home training is a faster improvement rate, which can also lead to more motivation. When children are motivated they tend to overcome their problems more easily, so this kind of cycle is beneficial. In spite of all the benefits of home training, this may not be straightforward to implement when the right tools are not available. The problem is how can children practice the speech exercises at home? Who is going to verify if they are doing the exercises correctly? Well, the first obvious choice would be the parents, but, for multiple reasons including lack of time, this is not always possible. This leaves room for a combined clinical and technological solution to emerge. Some software systems to assist children to train their exercises have already been proposed for several disorders. Nonetheless, some systems like Articulation Station, and Falar a Brincar do not have any type of ASR system [2, 8], which leads to the problems that we described above: children are always depending Fig. 1. Child playing the proposed game comfortably at home. on another person other than themselves to train their exercises. This can lead to lower weekly training times, which will lead to slower improvements. There are other systems that have an ASR system or use some sort of sound analysis, like sPeAK-MAN, Star, and ARTUR - the ARticulation TU- toR, that have the goal of helping children with articulation problems [27, 10, 3], Talker [26], Speech Adventure, that was developed for children with cleft palate or lip [23], and Flappy Voice and the Remote Therapy Tool for Child- hood Apraxia, which focus on apraxia [15, 21]. All of these systems rely on game like exercises to keep children motivated and engaged while training, and are developed for the . While its main focus is not on speech therapy training, the Interactive Game for the Training of Portuguese Vowels uses an ASR system to distinguish the EP vowels [6]. Vithea is another system developed for EP [30]. It aims to help people with aphasia. This system has a virtual therapist that guides the patient through their exercises and uses an ASR system to validate the exercises. While many native Portuguese children need to attend speech therapy to correct SSDs related to the production of distorted sibilant sounds, there is a lack of software systems to assist with the training of these sounds. As a contribution to fill this gap, we propose a serious mobile game that focus on the EP sibilant consonants, the serious mobile game for the sibilant consonants. While our game can also be used during the speech therapy sessions, our main goal is to provide a game that can be used for home training and does not require the supervision of parents (figure 1). Thus, our game uses an ASR system to identify if the child is performing the exercise correctly and therefore decrease the need of an adult to monitor the child. The game incorporates the isolated sibilants exercise, which is an exercise used by SLPs to teach children to correctly produce the sibilant consonants. It runs on mobile platforms like tablets, iPads and smartphones, which children usually are keen on using. As thus, we are not only providing means for home training the sibilants, but also to motivate children on doing the exercises often. The game’s main character is controlled in real time by the child’s voice and its behaviour gives visual feedback to the child on his speech productions. The game is implemented with a client-server architecture, where the client is the mobile game application and the server has an ASR system that classifies the child speech productions. The Serious Game for the Sustained Vowel Exercise (SVG) [16], previously proposed by some of our team members, and which, like the present work is also part of the BioVisualSpeech project, uses sound analysis to control the movement of the main character. While at first glance, the SVG may look similar to the present work, there are some fundamental differences. The SVG is developed for the sustained vowel exercise, which consists of saying a vowel for a few seconds with approximately constant intensity. In its current version, that game does not use an ASR system and it does not distinguish the vowel that is being produced. It uses sound analysis to determine the intensity of the speech production and the time, independently of the produced phoneme. The proposed game, on the other hand is developed for the isolated sibilants exercise. It distinguishes the produced sibilant with the help of an ASR system. On the other hand, the intensity variation of the production is not important for the proposed game. Finally, the SVG is a game to be used during the speech therapy sessions, whereas our game can be used both during the speech therapy sessions but also at home. In the remaining sections we explain the sibilants exercise (section 2), we discuss the main characteristics of the proposed game (section 3), and its ASR system (section 5), and the feedback we received from both SLPs and children (section 6).

2 Sibilant consonants and the isolated sibilants exercise

Sibilant sounds are a specific group of consonants, and happen when the air flows through a very narrow channel in the direction of the teeth [11]. This channel can be created in different places and this creates different types of sibilant sounds. There are two types of sibilant sounds in EP: alveolar, and palato-alveolar, these regions can be seen in figure 2. In each of these sibilant types, the vocal folds can be used or not, resulting in a voiced or a voiceless sibilant, respectively. An example of this is the sound [z] (e.g. z in “zip”), which is voiced because it uses the vocal folds, and the sound [s] (e.g. s in “sip”), that does not use the vocal folds and as a result it is a voiceless sibilant. These two sounds are alveolar sibilants, so they are produced by creating a narrow channel in the same location, and the only difference between them is the use of the vocal folds. There are four different sibilant consonant sounds in EP: [z] as in zebra, [s] as in snake, [S] as the sh sound in sheep, and [Z] as the s sound in Asia. [z], Fig. 2. Diagram with the names of the main places of articulation [13]. and [s] are both alveolar sibilants, while [S], and [Z] are palato-alveolar sibilants. Both [z] and [Z] are voiced sibilants, and [s] and [S] are voiceless sibilants. In the words given above as an example, “zip” and “sip”, there is also one interesting characteristic, that is between these two words only a single sound varies: [z] and [s]. This creates what is called a minimal pair, which is a pair of words where only a single sound varies. Given their similarities, the words in a minimal pair are very easy to confuse and it is common to use one instead of the other. This is a very common SSD, especially in children. It is even more problematic when the minimal pair consists of two sibilants like in our example (the sounds [z] and [s] from “zip” and “sip”), because here the only difference between the two different sounds is the use of the vocal folds. Another reason for the choice of the four sibilant sounds named above is that they also appear in many minimal pair words (e.g. “sink” and “zinc”). Even thought the sounds [Z] and [S] do not appear in any usual minimal pair words in English, there are some in EP (e.g. “jacto”, “chato”). The most usual sibilant mistakes committed by children are distortion errors. These can consist of (1) exchanging the voiced and voiceless sounds, for example exchanging a [s] that is voiceless, for a [z] that is produced in the same location of the vocal tract but is voiced, or (2) exchanging the vocal tract local of production which results in producing another sibilant [22]. When a child has a SSD that influences his production of the sibilant conso- nants, the SLP usually starts by observing how the child reacts to the hearing and production of the isolated sibilants. The process starts by trying to under- stand if the child can distinguish the different sibilants when hearing them as isolated sounds. If he can, the next step is to practice the isolated sibilants ex- ercise, which consists of producing a specific sibilant sound for some duration of time. The main goal of this exercise is to teach the child how to distinguish between the different sibilant sounds, and also how to produce each one of them. Once the child is able to say the sibilant consonants correctly, the SLP then starts asking for multiple productions of the different sibilants always alternating both the vocal tract local of production and the use of the vocal folds, in order to try to understand if the child can always produce the correct sibilant or if sometimes the child exchanges from one sibilant to another. This process is done with the isolated sibilants exercise and is the base to detect what the child problem is. Only after the problem is correctly identified and the child can say the isolated sibilants correctly, they can move on to more complex exercises like using the sibilant sounds inside words and in multiple positions within a word. As it will be discussed in detail in subsequent sections, our solution is to incorporate the isolated sibilants exercise in a computer game that motivates the child on exercising often. An important characteristic of the game is that children do not control the main character with a keyboard or other usual input method, but instead the character is controlled only by the child’s voice; the child has to do the isolated sibilants exercise correctly in order to make the character move towards a target.

3 Game and architecture

We propose a serious game for mobile platforms for intensive training of the sibilant consonants for correction of distortion errors. The aimed age group of this game are children from five to nine years old, because usually at these ages the regular phonological development exchange errors have already disappeared. The game incorporates the isolated sibilants exercise, which is an exercise frequently used by SLPs during the therapy sessions. While the game can be used at the therapy session, our main purpose is to offer a solution that can be used at home to give children the opportunity of having more intensive training even when it may be difficult for them to attend more than one therapy session per week.

3.1 System platform and game engine

Our proposal, as already mentioned, is to create a game that children can use to practice the therapy exercises for the sibilant consonants either in the session or at home. Thus, the first decision that had to be made was if the game was going to be developed for a mobile format, like a tablet or a smartphone, or for laptops and desktop computers. After some research we concluded that the tablet market in Portugal is still expanding (figure 3). In 2014 the share of monthly active tablet users was 42.38% and by 2021 this value is projected to reach 52.07% of the total population [29]. The slower predicted expansion for future years is mainly because almost half of the population already has a tablet and does not need a new one. So this is a good indicative that if the game Fig. 3. Forecast of number of tablet users in Portugal from 2014 to 2021 (in million users) [28]. is developed for tablets, it can reach a very large audience. Moreover, being developed for a mobile platform, the game allows the child to play comfortably anywhere in the house, and even outside provided there is an internet connection (as it will be explained in section 3.2, the game will need access to the internet). The next challenge was to choose the platform, that is, to decide if the game was going to be developed for Android, iOS or both. Even though Android has a larger market share [18], iOS still has a significant amount of users. Thus our decision was to develop the game for both platforms. Instead of developing the game twice, one for each operating system, we decided to use Unity because this allows for it to be developed only once, while creating executable files for each platform.

3.2 System architecture

Since this game is designed to be used at home, it can not depend on the SLP to decide if the exercise is being correctly executed, nor the parents since many times they are not available. So the best option is to use automatic speech recognition to provide feedback on whether the exercise is executed correctly. The game will run in mobile platforms, but using an ASR system that ex- ecutes the speech recognition in the mobile platform can be very demanding which could cause delays in the regular execution of the game. So the best al- ternative is to use a client-server architecture, where the game sends the sound samples to the server, and the server sends back information on whether the productions were correctly produced (figure 4). This kind of architecture also gives us more flexibility when choosing what to use for the ASR engine, and also allows changing the algorithm at anytime without having to change the mobile application. Fig. 4. Client-server architecture. The client is the mobile application, that captures segments of sound, and then sends them to the server. The server has to extract the sound features from each segment and uses an ASR system to classify them. Afterwards the server sends the response back to the client, which can now provide the feedback to the child.

Client The client application has to send the sound that the child is producing to the server. In order to do this, the application captures segments of sound of a determined length. When a segment reaches its desired length the client sends it to the server. The client then awaits for the server to respond whether the sound production in the segment is correct or not. When the client receives the answer it provides visual feedback to the child through the character’s movement. In case of positive feedback this cycle continues until the character reaches its goal. Given this type of connectivity with the server, the client always needs to have a stable internet connection.

Server When the server receives a new segment of sound it first has to extract its features, and then uses them in the previously trained ASR system in order to get a response on whether the speech segment is correct or not (section 5 discusses the features and algorithms used to classify the child productions). The server then sends this answer back to the client and waits for the next speech segment. To train this ASR system, we used child speech productions of the sibilant sounds (more details in section 4).

3.3 Mobile game

As briefly explained above, the proposed game uses the isolated sibilants exercise. In order to play the game the child has to say one of the four sibilant consonants addressed by the game: [z], [s], [S], or [Z]. The game includes four scenarios, one for each of the addressed sibilant sounds (figure 5). These scenarios were created with the help of a visual artist and with images from Freepik [9]. To create the scenarios we took into account the aimed age group of five to nine years old children. Each of the scenarios has (a) (b)

(c) (d)

Fig. 5. The four game scenarios. (a) The scenario for the [z] consonant. (b) The scenario for the [s] consonant. (c) The scenario for the [S] consonant. (d) The scenario for the [Z] consonant. a different character, and the game goal is to move it to a specific location. The characters or game goal are related to the addressed sibilant sound, which is very helpful to provide visual cues to the child.

Visual feedback and visual cues In order to control the character, children do not use a keyboard or any other regular input method, but instead they use only their voice. Each character responds to a different sibilant sound. To make the character move, the child has to say that specific sibilant correctly. The character’s movements give visual feedback to the child about his sound productions. The feedback is positive when the production is correct, in which case the character moves towards the goal. The character keeps moving while the child is producing the sibilant consonant correctly. If the production is not correct, the character stops moving to give negative feedback to the child. In the case of negative feedback, there are two available modes: one in which the exercise has to be repeated from the beginning and another in which the character waits for a correct production to continue moving towards the target. This type of visual feedback is very intuitive for children and also a good way to motivate them on trying to say the sibilant consonants correctly. The scenario in figure 5.a is used to train the [z] sibilant. The main character, a bumblebee, moves towards the beehive, which is the game target, while the child is saying the [z] sibilant correctly. The characters or goal of each scenario are always related to a word in EP that starts with the specific sibilant used in the scenario. For instance, a bumblebee was chosen for the [z] sibilant scenario because the EP word for bumblebee starts with a [z] sibilant (zang˜ao). The scenario used to train the [s] sibilant has a snake that moves towards the log while the child is saying the [s] sibilant correctly (figure 5.b), and the scenario used for the [Z] sound has a ladybug that flies towards a flower while the child is saying the [Z] sibilant correctly (figure 5.d). The EP word for snake starts with the [s] sibilant (serpente) and the EP word for ladybug starts with the [Z] sibilant (joaninha). Finally, there is a scenario for the [S] sibilant (figure 5c). In this case the main character, a boy, has to run away from the rain until the end of the street, which is the game target. The rain comes from a gray rainy cloud that follows the boy while he moves. The boy is able to run away from the rain while the child says the [S] sound correctly. The EP word for rain starts with the [S] sibilant (chuva). Instead of relying only on the SLP explanations, it is important that the game provides visual cues to help the children remember the exercise and sound that they should produce. An interesting idea was to relate the type of movement of the character with the use of the vocal folds. This idea came from the flying movement of the bumblebee, and that the [z] sibilant is a voiced one. We did not want a very complex motion in order not to distract the child, so the move- ment was simplified to a sinusoidal wave, that can be modified by changing its amplitude and frequency. We applied the same concept to the movement of the ladybug, since the [Z] sound is also voiced. We decided to use a simple straight line movement for the remaining sounds, [s] and [S], for the snake and the boy running away from the rain respectively, because these are voiceless sibilants. This is achieved by using the same code, but with a zero amplitude parameter.

Parameterization of the game The distance separating the initial position of the main character (starting point) from the target (end point) is related to the speech duration needed to make the character move until it reaches the target. The isolated sibilants exercise can be done with shorter or longer reproduction of the sibilants by varying the distance between the starting point and end point. This possibility is very useful for SLPs because every child is different and the exercise needs to be adapted to his current state or problem. While developing games for speech therapy, we must be very careful to not restrict the options of the SLPs. For instance in this case, if the time that a child has to reproduce the sibilant sound was always the same, this would not suit every child in the same way. So, the best alternative is to give SLPs the most options and adjustable parameters, so that they can have all the needed flexibility to better adjust the exercise to the problems of each child. As a response to these observations, there are many parameters that can be adjusted in all four scenarios: – the starting and end points, – the time the character takes to move from the starting to the end point, – the amplitude and frequency of the movement for the voiced consonants.

Implementation details - modularity and extensibility Another very important characteristic of these scenarios is that given the client- server system architecture, each one can be individually built and animated, since they are all completely independent from each other. To accomplish this, they all share three main classes, that are responsible for recording the microphone input, the connection to the server, and create the movement of the character. The main game cycle of each of the scenarios, is processed as follow, the microphone class is always recording the input sound, when this recording reaches a certain length it is sent to the server by using the helper class, that manages the connections to the server, and then the response from the server is used to compute the current state of the game: proceed (if the child is producing the correct sound) or stop (if the child is not producing the sibilant sound correctly). If the child is producing the correct sound, the movement class is used to produce the animation that moves the character towards the target, and the main cycle continues until the character reaches the target. If the child does not produce the correct sound, the game finishes and the main cycle stops its execution. With this base main cycle and with the help of the three main classes, it is very simple to add other scenarios for these sounds, and even to new sounds. In the latter case, in order to have the game characters moving as a response to the added type of sounds, the game’s ASR system would need to be trained, in order to recognize the new class of sounds. The re-trained ASR system would be updated in the server, but the base code would not need any changes. More complex scenarios can also be built using this base, as proven by our scenario for the [S] sound, that has two moving characters (the boy and the rain) and uses the same logic for the base main game cycle and also relies on our three main helper classes, with some additional code to control the movement of the rain.

4 Sound data

In order to train the ASR system, we collected data in three schools in the Lisbon area. There were 90 children participating in the recordings with ages between 4 and 11 (table 1.a). The recordings were made using a dedicated microphone and a DAT equipment (the setup used can be seen in figure 6). While we tried to make the recordings in a quiet room, there was noise coming from the outside, especially during recess. We collected eight different samples from each child which correspond to short and long repetitions of the sibilant sounds [S], [Z], [s], and [z]. Table 6.b shows the number of samples from each sound. There was always an SLP participating in the recordings, who gave all the necessary indications to the children. Fig. 6. The setup used for the recordings. On the left there is one of the DAT equip- ment’s used, and on the right the microphone, with an acoustic foam around it.

(a) (b)

Table 1. Recorded data. (a) Number of children who performed the recordings. (b) Number of samples from each sound.

The sound productions from each child were recorded in a single file. This means that every file includes the sibilant sounds and also every indication from the SLP. In order to be able to use this data, every file had to be split in separated samples, one for each of the eight child productions that it contains. To automatize this process, we created an algorithm that measures the energy in the beginning of the file (to measure the sound level of the no speech signal) and then identifies the peaks of energy, which have high probability of being regions containing speech. In this way the files are automatically split into smaller files that contain single productions of sound. Afterwards we still needed to listen to each of these samples, to identify which files contain the child sibilant productions, and delete the remaining files (SLP speech, or noises). When all the files were split we obtained a total of around 2600 smaller files, from which 842 were the child sibilant productions we are interested in. 5 Automatic recognition of isolated sibilant consonants

An important characteristic of our game is the visual feedback given to the child through the movement of the main character. In order to provide useful visual feedback to the child, the game must react almost instantly to the child’s voice. Given the inherent communication delay from the client-server architecture, we have to consider carefully the classification algorithm in order to reduce the time that it takes to classify a group of samples as much as possible. This leads to the exclusion of some sort of algorithms, like for instance lazy ones, as those usually take more time to classify samples since they do not have a previous training phase to learn the model, but instead they delay the learning until any request is made to the system, and since these results are always based on a local feature space, this process happens at every request. The game’s time restrictions also require that some algorithms that are more complex and take more time to compute an answer must be avoided. Given the game’s time restriction, we selected the following classifiers for this study: support vector machines (SVM) with radial basis function (RBF) kernel, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). An important characteristic of these classifiers is that all three of them support multiclass classification. Given that we are trying to distinguish between the four different sibilant sounds, this is a simple and effective way to do it. Since all three classifiers share this characteristic, this was also useful because it provided us with an easy way of comparing the algorithms and features that were being tested.

5.1 Feature vectors We use the raw Mel frequency cepstral coefficients (MFCC) as features for all three classifiers. As an illustration, figure 7 shows the MFCCs (matrix) obtained for a [z] sound. Our feature vectors consist of columns of the MFCCs matrix. MFCCs are commonly used in ASR systems and, as it will be seen below, we obtained very good results with the combination of these features and classifiers. While historically the number of MFCCs used is thirteen, there are also studies that use different numbers of coefficients. In particular Carvalho et al. had good results using different numbers of coefficients to classify isolated phonemes (the EP vowels) [6]. We trained the multiclass SVM classifier with different numbers of MFCCs, ranging from 5 to 25 coefficients, as seen in figure 8. The accuracy of the test score is really low when using only 5 MFCCs, but it rapidly increases with the use of more coefficients. This basically means that by using a small number of coefficients, our machine learning algorithm does not have enough distinct data to correctly separate the four different sibilant sounds, but when using more data (more coefficients), the four sounds can be more easily separated. The increase in score starts to slow down at around 13 MFCCs, and above that the differences between scores can be easily attributed to the margin of error of the training of each classifier. So we decided to use 13 coefficients since our results did not Fig. 7. The MFCCs matrix with 13 coefficients, obtained for a [z] sound.

Fig. 8. Accuracy test scores of the multiclass SVM classifier with RBF kernel, using different numbers of MFCCs. show any particular improvements when using more than 13 coefficients. This result is also confirmed by most literature, that usually uses around 12 to 13 MFCCs [19, 24, 17, 20].

5.2 Classification results

We trained the LDA, QDA and multiclass SVM with the MFCCs feature vectors and the data described in section 4. As stated previously, these are all multiclass classifiers which means that we can easily train them with the same data set, and get comparable results. The three classifiers were trained in the following way:

– Once we computed the MFCCs matrices from all sounds, we built the feature vectors by selecting columns from these matrices. Given the limitations of the SVM training complexity, we did not use the whole matrices. Instead, we selected around 20000 columns by choosing 66 random samples of each sound Fig. 9. Accuracy test scores of the three multiclass classifiers.

for each child. These vectors were labeled with the corresponding sibilant. We use one class for each sibilant, so we have 4 different classes. – We then used stratified sampling on this set of data to get the training and test set, using a test set size of 30%. – We also used a stratified 5 fold cross validation within our training set. Figure 9 shows the accuracy of the test scores for these three multiclass classifiers. As can be observed in the figure, QDA performs better than LDA, which is expected since QDA is not limited to a linear space to try separate all four classes, in contrast with LDA that can only learn linear boundaries. Even thought QDA give us considerably better scores than LDA, our multiclass SVM classifier has a test score almost 8% higher than QDA, meaning it is around 8% more accurate, giving us less errors in the classification of the four different sounds. Since the multiclass SVM with the RBF kernel gave the best results with a considerable margin, we focused on this classifier to try to achieve even better results. The multiclass SVM can loose accuracy in some sounds, due to having to adjust to all four classes at the same time. So we replaced the multiclass SVM with four SVMs, one for each class. The idea was to create one SVM classifier, also with the RBF kernel, to classify each of the four different sibilant classes. This way we could fine tune each one of the four classifiers, which here we call single-sibilant SVM, in order to improve our classification scores. Each single-sibilant SVM was trained with the same method that was de- scribed above, the only change was that for each classifier, the labels of the data were changed to true or false depending on the sound that was being trained. For instance, when we trained the classifier for class [s], all samples from sibi- lant [s] were labeled true, and the remaining samples were labeled false. Doing this for each sound, let us achieve our goal of one classifier for each sound. The parameters for each classifier were then fine tuned to try to achieve the best accuracy scores for each of the four classes. Figure 10 shows the results obtained with this approach. The multiclass SVM has the same score for all sounds, since it is always the same classifier, and the Fig. 10. Accuracy test scores of the multiclass SVM, and four single-sibilant SVMs (both with the RBF kernel) for all four classes. score is an average of all the correct labels for all classes. The four single-sibilant SVMs approach brought us better results in the classification of all four sounds, by a margin of around 8%. In addition, it can be observed that these are quite high accuracy test scores (above 91%).

6 Feedback from potential users

As future work we will validate our game solution and our ASR system with real users. In a first phase we are planning on validating it during speech therapy sessions, since this will allow having SLPs confirming that our ASR system is performing as expected. After this test is complete, another interesting way to validate our solution is to give our solution as homework to some children, selected by the SLPs according to their needs, and see if using the game at home contributes towards faster improvements. In the meantime we were able to collect feedback from both SLPs and chil- dren. While this was not a formal validation, their feedback, which is discussed below, was important to access the potential of the game.

6.1 Feedback from SLPs We were always in contact with multiple SLPs to have impartial feedback about our work. In an initial phase, we created a simple prototype of the game that was used to demonstrate our proposed scenarios to some SLPs and explain the mechanism of the game. We collected feedback from four SLPs that work with children, one with four years of experience, two with eight, and one with thirteen years of experience. All four SLPs agreed that the scenarios were adequate for the aimed age group, and that the game would provide a good way of training the sibilant sounds. Another interesting aspect, is that all of them already use some sort of homework, but not in a mobile platform given the lack of this type of systems for EP. Also, all of them noticed a faster improvement when children used homework, this can probably be attributed to the extra weekly training time. When asked if they would rather use our system in their sessions or at home, the results were a bit mixed. There was one SLP who only wanted to use the system during the sessions. The remaining were interested in trying it a few times during their sessions first, in order to understand how children react to the system, and also how the system performs. But they all agreed that if the system is robust enough, and if children react well to it, they would use it as homework. In our opinion, this type of feedback is expected, given the lack of this type of systems for EP, SLPs are not completely confident on them. They first need to experiment it to understand that using an ASR system combined with the game visual feedback is a good alternative that allows decreasing the need of adult assistance to give feedback to the child. After our game was completely developed, we got feedback from one SLP with four years of experience with children. We explained her the game concept, and also performed some demonstrations. The SLP considered that our system, is a good way for children to practice the sibilant sounds, and that the scenarios are very interesting for children within the aimed age group. Like some of the SLPs we talked to previously, she would first use our system during the sessions to make sure that everything is working as expected, in order not to mislead the children. But she added that once she verified that the game works correctly, she would use it as home training. She already uses other types of homework with her patients, with very good results. Home training with the proposed game would have the advantage of increasing the motivation of the child and reducing the need of adult supervision. The idea of using the movement of the character as a visual cue for the use of the vocal folds was something that she considered very important, since she already tries to incorporate those types of cues when performing the exercises with children. As an additional visual cue, she gave the advice of trying to incorporate the same idea to the background of the scenarios. Basically using more straight lines in the scenarios where the vocal folds are not used, and the opposite when the vocal folds are used.

6.2 Feedback from children This work was presented at the European Researchers Night (ERN) 2017, at the National Museum of Natural History and Science in Lisbon. This is an event open to the general public and that runs simultaneously in several cities in Europe. The goal of ERN is to show citizens the importance and the impact of science in their everyday life and in the development of the society, and also demystify the image of researchers as someone distant and inaccessible. During the event researchers demonstrate and explain their work to the public through activities. At the ERN 2017 we had the opportunity to show our work to both children and parents. We had the game set up to allow children to try it, and at the same time explain to parents the importance of these types of games for children that are attending speech therapy (figure 11). We used a laptop, a screen and an external microphone. Fig. 11. Child playing the proposed game during the European Researchers Night 2017.

Around fifty children tried the game. We were able collect some opinions and to see the children’s reactions to the characters and the way of controlling the game. Not all children who tried the game were attending speech therapy, but our goal here was not to understand if the game was classifying the speech production correctly or if it contributes to faster speech improvements. We only aimed to access if the graphics, characters and type of interaction with the game appealed to the children. We had children from all ages trying the game, from some very young children to some twelve year old children. While the game was designed for children from five to nine years old, all children trying the game at ERN liked the scenarios and characters regardless of their age. They also reacted very well to the type of interaction with the game, that is, controlling the characters only with their voice.

7 Conclusion and future work

Here we proposed the serious mobile game for the sibilant consonants, a serious game for mobile platforms for correction of sibilant sounds distortion in EP. The game is designed for home training and uses the isolated sibilants exercise. As identified by our team, there is a lack of systems that focus on EP, and even those that do, do not have exercises to practice the isolated sibilants exercise. Another very important aspect is that children should not be restricted to only practice their exercises during their therapy sessions or when their parents have time to supervise them. Instead children should be able to do these exercises at home, even when their parents do not have time, since this would allow them to perform the exercises more frequently, which can lead to faster improvements, thus motivating them even further. Our solution solves both these problems by offering a mobile game with which children can practice their EP sibilant sounds. The game is implemented with a client-server architecture in which all the complex computation is done in the server. In this way the game can be played even on low end devices, with the only restriction of having an internet connection. This allows more children to benefit from our solution and perform their exercises nearly anywhere. With the help of an ASR system that classifies the child speech productions, the game is controlled only by the child’s voice. In order to achieve the game goal, the child has to perform the isolated sibilants exercise correctly. In addition, the game gives immediate visual feedback to the child about his speech. Our ASR system got very good results, with accuracy test scores of above 91% when us- ing SVMs and MFCCs. As future work we will explore more machine learning algorithms to improve these scores, such as hidden Markov models and artificial neural networks. However, we must keep in mind that we can not develop a com- putationally expensive system because of our response time restrictions. Thus the next goal is to balance more complex algorithms in order to try improve our ASR system performance, but at the same time maintaining the very low response time that we currently have. While a formal validation has been planned as future work, we were able to gather feedback from both SLPs and children. The feedback from children confirmed that children of the aimed age group liked the characters, scenarios and that they also reacted well to the way of controlling the character with their voices. The SLPs that we contacted showed interest in our solution, and consider it a good way for children to train the sibilant sounds at home. They all consider that this type of visual feedback is a good way to let children know if they are producing the sounds correctly or not, and that it will probably be more motivating than the regular exercises they do in their speech therapy sessions. Nonetheless, almost all of them would first like to try the game in their sessions, to check if our ASR system is accurate and if children react well to the game.

Acknowledgments

This work was supported by the Portuguese Foundation for Science and Technol- ogy under projects BioVisualSpeech (CMUP-ERI/TIC/0033/2014) and NOVA- LINCS (PEest/UID/CEC/04516/2013). We thank the SLPs Diana Lan¸caand Catarina Duarte for their availability and feedback. We also thank all the 3rd and 4th year SLP students from Escola Superior de Sa´udedo Alcoit˜aowho collaborated in the data collection task. Many thanks also to InˆesJorge for the graphic design of the game scenarios. Finnally, we would like to thank the schools from Agrupamento de Escolas de Almeida Garrett, and all the children who participated in the recordings. References

[1] American Speech-Language-Hearing Association (ASHA) - Speech Sound Disorders: Articulation and Phonological Processes. url: http://www. asha.org/public/speech/disorders/SpeechSoundDisorders/ (visited on 07/28/2017). [2] Articulation Station. url: http://littlebeespeech.com/articulation_ station.php (visited on 01/15/2016). [3] ARTUR - the ARticulation TUtoR. url: http://www.speech.kth.se/ multimodal/ARTUR/ (visited on 01/16/2016). [4] Jean Barratt, Peter Littlejohns, and Julie Thompson. “Trial of intensive compared with weekly speech therapy in preschool children.” In: Archives of Disease in Childhood 67.1 (1992), pp. 106–108. [5] Sanjit K Bhogal, Robert Teasell, and Mark Speechley. “Intensity of aphasia therapy, impact on recovery”. In: Stroke 34.4 (2003), pp. 987–993. [6] Mara InˆesPires Carvalho et al. “Interactive game for the training of por- tuguese vowels”. In: (2012). [7] Gianfranco Denes et al. “Intensive versus regular speech therapy in global aphasia: a controlled study”. In: Aphasiology 10.4 (1996), pp. 385–394. [8] Falar a Brincar. url: https://falarabrincar.wordpress.com/ (visited on 01/16/2016). [9] Freepik. url: http://www.freepik.com/ (visited on 07/28/2017). [10] Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky. “Compari- son of modulation features for phoneme recognition”. In: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE. 2010, pp. 5038–5041. [11] Isabel Guimar˜aes. Ciˆenciae Arte da Voz Humana. Escola Superior de Sa´udede Alcoit˜ao,2007. [12] Penelope K Hall, Linda S Jordan, and Donald A Robin. Developmental apraxia of speech: Theory and clinical practice. Pro Ed, 1993, p. 200. [13] Image Quiz: Vocal Tract. url: http://www.imagequiz.co.uk/img? img _ id = ag5zfmltYWdlbGVhcm5lcnIQCxIHUXVpenplcxifotErDA (visited on 07/30/2017). [14] Susan Kreimer. “Intensive Speech and Language Therapy Found to Benefit Patients with Chronic Aphasia After Stroke”. In: Neurology Today 17.12 (2017), pp. 12–13. [15] Tian Lan et al. “Flappy voice: an interactive game for childhood apraxia of speech therapy”. In: Proceedings of the first ACM SIGCHI annual sym- posium on Computer-human interaction in play. ACM. 2014, pp. 429–430. [16] Marta Lopes, Jo˜aoMagalh˜aes,and Sofia Cavaco. “A voice-controlled se- rious game for the sustained vowel exercise”. In: Proceedings of the 13th International Conference on Advances in Computer Entertainment Tech- nology. ACM. 2016, p. 32. [17] Pavel Matejka, Petr Schwarz, et al. “Analysis of feature extraction and channel compensation in a GMM speaker recognition system”. In: IEEE Transactions on Audio, Speech, and Language Processing 15.7 (2007), pp. 1979–1986. [18] Mobile Operating System Market Share in Portugal, 2016 to 2017. url: http://gs.statcounter.com/os- market- share/mobile/portugal/ #yearly-2016-2017-bar (visited on 07/25/2017). [19] Ara V Nefian et al. “A coupled HMM for audio-visual speech recogni- tion”. In: Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on. Vol. 2. IEEE. 2002, pp. II–2013. [20] Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. “Speech emotion recognition using hidden Markov models”. In: Speech communication 41.4 (2003), pp. 603–623. [21] Avinash Parnandi et al. “Development of a Remote Therapy Tool for Childhood Apraxia of Speech”. In: ACM Transactions on Accessible Com- puting (TACCESS) 7.3 (2015), p. 10. [22] Jonathan Preston and Mary Louise Edwards. “Phonological awareness and types of sound errors in preschoolers with speech sound disorders”. In: Journal of Speech, Language, and Hearing Research 53.1 (2010), pp. 44– 60. [23] Zak Rubin and Sri Kurniawan. “Speech adventure: using speech recog- nition for cleft speech therapy”. In: Proceedings of the 6th International Conference on PErvasive Technologies Related to Assistive Environments. ACM. 2013, p. 35. [24] Sangita Sharma et al. “Feature extraction using non-linear transforma- tion for robust speech recognition on the Aurora database”. In: Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on. Vol. 2. IEEE. 2000, pp. II1117–II1120. [25] LAWRENCE D Shriberg, R Paul, and P Flipsen. “Childhood speech sound disorders: From postbehaviorism to the postgenomic era”. In: Speech sound disorders in children (2009), pp. 1–33. [26] Talker. url: http://speech-trainer.com/children-speech-therapy (visited on 01/15/2016). [27] Chek Tien Tan et al. “sPeAK-MAN: towards popular gameplay for speech therapy”. In: Proceedings of The 9th Australasian Conference on Interac- tive Entertainment: Matters of Life and Death. ACM. 2013, p. 28. [28] The Statistics Portal - Forecast of tablet user numbers in Portugal from 2014 to 2021 (in million users). url: https : / / www . statista . com / statistics/566416/predicted-number-of-tablet-users-portugal/ (visited on 02/02/2016). [29] The Statistics Portal - Forecast of the tablet user penetration rate in Por- tugal from 2014 to 2021. url: https://www.statista.com/statistics/ 568594/predicted- tablet- user- penetration- rate- in- portugal/ (visited on 02/02/2016). [30] VITHEA - Virtual Therapist for Aphasia treatment. url: https://vithea. l2f.inesc-id.pt/wiki/index.php/Main_Page (visited on 01/16/2016).