Mobile Alert: Combining Human Motion Detection and Voice Analysis

Telma Cristóvão de Oliveira

Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering

Supervisor: Prof. Paulo Lobato Correia Co-Supervisor: Prof. Isabel Maria Martins Trancoso

Examination Committee

Chairperson: Prof. José Eduardo Sanguino

Supervisor: Prof. Paulo Lobato Correia

Co-Supervisor: Prof. Isabel Maria Martins Trancoso

Member of Committee: Prof. José Miguel de Oliveira Monteiro Sales Dias

Member of Committee: Prof. Daniel Jorge Viegas Gonçalves

April 2015

ii

Aos meus pais.

iii

iv Acknowledgements

Acknowledgements

Gostaria de agradecer o apoio e amor incondicional dos meus pais que sempre acreditaram em mim e sempre me ajudaram em tudo o que estava ao seu alcance. Ao meu irmão pela amizade.

Ao Tiago Leite, que esteve sempre presente e sempre me ajudou a superar as diversas dificuldades ao longo da minha vida académica e não só. Obrigada por tudo.

Ao professor Rui Crespo, que mais que um tutor foi um grande amigo que me ajudou a ser mais confiante e a superar muitas barreiras psicológicas, que durante muito tempo impediram a minha progressão no curso.

Aos meus amigos Ana Rita Coelho, João Gonçalves Pereira, Marta Quintão, Miguel Almeida, Teresa Lopes e Ivo Anjo, que em momentos distintos deste longo percurso estiveram a meu lado e me deram força para continuar.

Ao Filipe Tente, Cláudia Mendes, Filipe Teixeira e Gonçalo Silva obrigada pelas inúmeras gargalhadas e boa disposição que tornou tudo mais fácil!

Aos meus amigos da Confraria de Amigos do NEEC (Núcleo de Estudantes de Electrotecnia e Computadores do IST) que me receberam de braços abertos em 2010 e que me continuam a apoiar 5 anos depois!

Aos colegas CSS MACH FY15 da Microsoft Portugal que me deram uma enorme ajuda na colheita de samples de audio para que pudesse trabalhar.

Aos professores Paulo Lobato Correia e Isabel Trancoso pela paciência e pelo apoio ao longo deste trabalho.

Obrigada a todos.

v

vi Abstract

Abstract

In the developed countries societies are getting older. Due to better health care and better overall conditions the average life time of an individual tends to increase. The fraction of elderly population also tends to increase. These societies need to adapt to this reality by providing elder citizens ways to assure a certain quality of life and that can extend their autonomy. Taking into account the huge growth of the smartphones’ market, chances are that most of the people who already have a mobile phone will have a smartphone in the future if they don’t own one already. This project aims to take advantage of what is already present in everyday life like a smartphone, and exploit the multiple built- in sensors to try to create a mobile monitoring system. This study aims to assess the possibility of having a monitoring system that detects both fall and screams in a distress situation for an elderly user. This study also aims to put into perspective how would such system work.

Keywords

Monitoring system, elderly, scream detection, fall detection

vii Resumo

Resumo

As sociedades dos países desenvolvidos estão a envelhecer. Devido a melhores cuidados de saúde e melhores condições em geral, o tempo médio de vida dos indivíduos está a aumentar. A fracção idosa da população está também a aumentar. Estas sociedades têm que se adaptar a esta realidade e criar meio de garantir aos seus idosos uma certa qualidade de vida e um prolongamento da sua autonomia. Levando em consideração o enorme crescimento do mercado dos smartphones, quem tem um telemóvel hoje em dia possivelmente é um smartphone. Quem ainda não tem um smartphone é possível que no futuro venha a ter. Este projecto pretende utilizar o que já está presente no dia-a-dia das sociedades, o smartphone, e aproveitar os múltiplos sensores presentes nestes aparelhos para criar um sistema de monitorização. Este estudo tem por objectivo avaliar a possibilidade de se ter um sistema de monitorização que detecte quedas e gritos, procurando facilitar a assitência aos cidadãos idosos em situações de risco. Este estudo pretende exemplificar como poderá funcionar este tipo de sistema.

Palavras-chave

Sistema de monitorização, idosos, detecção de gritos, detecção de quedas

viii Table of Contents

Table of Contents

Acknowledgements ...... v

Abstract ...... vii

Resumo ...... viii

Table of Contents ...... ix

List of Figures ...... xi

List of Tables ...... xii

1 Introduction ...... 1 1.1 Overview...... 2 1.2 Motivation ...... 3 1.3 Goals ...... 3 1.4 Contributions ...... 4 1.5 Structure ...... 4

2 State of the Art ...... 6 2.1 Current Commercial Solutions ...... 7 2.1.1 Emergency Monitoring as a Service ...... 7 2.1.2 Video Monitoring ...... 8 2.1.3 GPS Location Monitoring ...... 9 2.1.4 Smartphone Applications as Monitoring Systems ...... 10 2.2 Fall Detection Techniques ...... 12 2.2.1 Low-Complexity Fall Detection Algorithm ...... 13 2.3 Voice Detection Techniques ...... 15 2.3.1 Production and Perception ...... 15 2.3.2 Voice Activity Detection – Short-Time Energy-Based ...... 16 2.3.3 Feature Extraction and Classification ...... 17 2.3.4 Pitch Detection...... 22

3 Proposed System ...... 24 3.1 Scope ...... 25

ix

3.2 Scenarios and Target Users ...... 25 3.3 Fall Detection ...... 27 3.4 Scream Detection Relying on Pitch Analysis ...... 28

4. Implementation and Discussion ...... 31 4.1. Implementation and Testing ...... 32 4.2. Scream Detection Approaches ...... 32 4.2.1 Voice Activity Detection ...... 33 4.2.2 MFCC and GMM Approach ...... 34 4.2.3 Pitch Analysis Approach ...... 34 4.3. Fall Detection Implementation ...... 35 4.3.1 Testing ...... 36 4.4. Conclusions ...... 38

5. Future Work ...... 39 5.1. Where to Improve ...... 40 5.2. Future Applications ...... 40

References...... 42

x List of Figures

List of Figures

Figure 1 - Projection for the growth of the population over 65 years-old in USA (adapted from [1]) ...... 2 Figure 2 - Projection for the growth of the population over 65 years-old in Europe (adapted from [2]) ...... 2 Figure 3 - Projection for the growth of the population over 65 years-old in Japan (adapted from [3]) ...... 3 Figure 4 - ADT Medical devices: wristband, pendant and two-way voice intercommunication device. [17] ...... 8 Figure 5 - ADT system overview [17]...... 8 Figure 6 - FamilyLink network model. [18] ...... 9 Figure 7 - How GPS tracking works. [19] ...... 10 Figure 8 - Real-world example that illustrates common components between falls. [14] ...... 14 Figure 9 - Vocal tract diagram. [27] ...... 15 Figure 10 - Block diagram of a VAD implementation [29]...... 16 Figure 11 - Cepstrum analysis block diagram ...... 18 Figure 12 - a) Input signal; b) Cepstral output; c) MFCC output...... 19 Figure 13 - MFCC calculation block diagram...... 19 Figure 14 - Mel-frequency filter bank example [28]...... 20 Figure 15 - Proposed algorithm for fall detection in the context of the application...... 28 Figure 16 - Regular speech followed by a small scream while using a regular microphone in ideal conditions, i.e., no obstruction to sound capture...... 29 Figure 17 - Pitch detection on a sentence...... 29 Figure 18 - Pitch detection on a scream...... 29 Figure 19 - Diagram of the proposed scream detector based on pitch analysis...... 30 Figure 20 - System integration overview...... 32 Figure 21 - Speech captured from a smartphone inside a pocket while walking. The noise identified at the beginning happened while inserting the device into the pocket...... 33 Figure 22 - Scream captured from a smartphone inside a pocket while walking with little noise...... 33 Figure 23 - Scream captured from a smartphone inside a pocket while walking with a lot of noise...... 34 Figure 23 - Diagram of the proposed scream detector based on pitch analysis integrated with fall detection...... 35 Figure 25 - Code block for the adaptive behaviour...... 37 Figure 26 - SenseAck application main screen...... 38

xi List of Tables

List of Tables

Table 1 - Worldwide smartphone sales by operating system in the 3rd Quarter of 2014...... 10 Table 2 - Feature comparison between the studied apps...... 11 Table 3 - Accuracy test of the studied applications...... 12 Table 4 - Scenarios definition ...... 26

xii Chapter 1

Introduction

1

In this chapter, a brief overview of the work is presented. The context where this thesis work is framed is introduced. The scope and the work structure of this thesis are also provided at the end of the chapter.

1

1 1.1 Overview

In developed countries’ societies, due to several factors such as better healthcare services or better general life conditions, the average life expectancy of individuals tends to increase. This means that there is a global tendency for this kind of societies to become aged. Therefore, the fraction of the total population under these conditions will grow in the future.

According to Eurostat in 2011, the average number of individuals living in the European Union with ages over 65 years old represented 17.5% of the total population, and individuals over 50 years old represented roughly 50% of the total population – see Figure 1. Also, according to Eurostat projections, these numbers tend to slightly increase in the next 5 decades.

When comparing European statistics to similar numbers compiled by the USA government in 2010, one can see that only 32% of the USA's total population is above 50 years old and 13% is over 65 – see Figure 2. USA and European numbers on the 65+ population are similar, but when comparing to Japan we can see that the Japanese society is aging even faster. In Japan, in 2010, the group of people over 65 years old already represented 23% of the total population, with an enormous tendency to grow, according to the Japanese government projections – see Figure 3. This shows that aging is a very concerning subject, and it has deep economic and social repercussions [1] [2] [3].

19,30 20,00 20,20 16,10

13,00

Population (%) Population Fraction of Total Total of Fraction

2010 2020 2030 2040 2050 Year

Figure 1 - Projection for the growth of the population over 65 years-old in USA (adapted from [1])

28,60 29,50 23,60 20,20

17,50

Population (%) Population Fraction of Total Total of Fraction

2010 2020 2030 2040 2050 Year

Figure 2 - Projection for the growth of the population over 65 years-old in Europe (adapted from [2])

2

41,00 37,30 29,40 32,30

23,00

Population (%) Population Fraction of Total Total of Fraction

2010 2020 2030 2040 2050 Year

Figure 3 - Projection for the growth of the population over 65 years-old in Japan (adapted from [3])

More than health care support to endure longevity, developed countries’ societies need now to adapt to this reality by providing elder citizens ways to assure their quality of life. Methods and mechanisms that extend their autonomy can be of extreme importance.

1.2 Motivation

Due to the increasing degradation of human physical characteristics with ageing, such as balance or motor skills, which affect the daily tasks, multiple cases of falls occur among older people. When a person lives alone, some of these falls, depending on the victim’s overall condition and physical fragilities, may lead to episodes where the person is injured and cannot get up to ask for help, or may lose consciousness without anyone knowing what is going on.

Let us assume that despite living alone these individuals have family, friends or neighbours who worry about their wellbeing and can be trusted to ask for help when needed. We’ll call them caretakers. In such scenarios as those described above, the caretaker might never be warned in time to assist the person in need and in more tragic situations help won’t be there within a reasonable period of time.

Of course there may be inconsequent falls where nothing too serious happens to the victim. Still, falls are a recurring event and monitoring them is a way to keep caretakers aware of what’s going on with their elders and in case of emergency, a more prompt assistance action would be taken.

1.3 Goals

This project aims to study the possibility of a monitoring system solution that can help caretakers understand if their elders are in a risky situation due to a fall event. The proposed system relies on a smartphone device, since these gadgets are not too expensive and are expanding their market. Through fall detection, complemented with a scream detection method, the proposed solution’s main goal is to understand if it is possible to correctly identify a fall event, supporting that assessment with a voice analysis process applied on an audio sample collected through the microphone of the mobile device. If a scream is detected then the caretaker should be warned, i.e., a distress call should be performed.

There are two major goals that are proposed by this work. One of them is to study a suitable method

3

that can provide a correct classification of recorded sound as by the means of a scream. That sound is recorded from a mobile device inside a pocket. The second one is to implement an adaptive fall algorithm resilient to changes of movement.

1.4 Contributions

Voice analysis has been subject of deep study for a long time under multiple circumstances. As the technology evolves and innovation takes place, voice recognition has become a common feature to a set of different devices and services: video games [4] [5], mobile phones that respond to voice commands [6] [7], browsing the internet using natural speech [8], etc… Even though there are multiple developed tools and solutions to identify what people say, most make use of reasonable environment conditions that are usually fairly optimal for that solution design. Despite this is also evolving, now the challenge lays in performing the same kind of voice analysis under extreme conditions like environments with high levels of background noise, not making use of controlled noise added to the system. During the research period for this work studies were found that aimed to develop systems that could be performed under such conditions. For example, to identify audio events in public transportations [9] or scream detection for audio-surveillance systems [10]. It is under this scope that this work will contribute to the present topic. Classification of sound captured from a mobile device inside a pocket as human voice has no known available studies. In this work it is found a study on how to recognize pitch to identify human voice in a very noisy and variable environment without any control of the noise sources.

From another perspective, this work also contributes with a proposed fall detection algorithm, based on a two threshold algorithm, to increase the system resilience to false positives while changing the type of locomotion. The algorithm adjusts its thresholds to more suitable values for walking and running. The adjustments are made in an adaptive way without user interference for pre-defined profiles. The consulted literature – see [11], [12], [13], [14], [15] – studied fall detection algorithms minding solely one kind of motion at a time, i.e. check if an individual falls while walking or while running, for example. What about an individual that is walking, afterwards decides to run and then falls? Literature algorithms studied weren’t tested for motion shift resilience. The adaptive changes in thresholds minding natural variations in motion is the novelty this work brings to present literature.

1.5 Structure

This dissertation is organised as follows:

Chapter 2 provides a broad study of state of the art from a commercial and an academic perspective. Fall detectors and health check monitoring systems are available for many years and stablished in certain markets. It is interesting to have an idea of the offered products and their scope of action in order to understand where would a product based on this work stand. From an academic point of view it is relevant to understand what kind of methodologies were already approached regarding the topics of fall detection and also voice detection.

4

Chapter 3 aims to explain the parts of the proposed system for fall detection and scream detection under the conditions and scenario also stated in this chapter.

Chapter 4 provides the details on the chosen implementation regarding how the systems work, in which conditions were they tested and which was the development environment. Also the results discussion will figure under this chapter.

Chapter 5 should provide information about the future work that can derive from this work along with the identification of items to be improved.

5 Chapter 2

State of the Art

2 State of the Art

This chapter aims to present some of the most common commercial solutions available to mitigate the problem stated in the previous chapter. It will also provide an overview of the state-of-the-art regarding the most relevant fall detection algorithms, as well as voice detection techniques studied under the scope of this work.

6 2.1 Current Commercial Solutions

As stated before, nowadays there are multiple services and solutions to monitor the elderly population, assessing their wellbeing and sending alerts in emergency situations to their caretakers. These solutions serve multiple purposes and monitor different aspects, all related with health.

Along with commercial solutions it’s important to review the relevant related work and research in this area.

2.1.1 Emergency Monitoring as a Service

Even though Europe is getting older faster than USA, nowadays USA provides more solutions, particularly as services, offered by multiple companies providing monitoring systems. Several services were studied, like Life Station, Bay Alarm Medical, ADT Medical and Phillips Lifeline [16]. All of them offer a device connected to the phone line and an electronic wristband that communicates with that device wirelessly. These gadgets typically have an emergency button, a microphone and a loud speaker embedded.

In dangerous situations the user can push the emergency button existing in either one of the devices and make a phone call to the call centre associated with the service provider. Once the emergency case is stated, the call centre operator evaluates the event, and calls either the national emergency service or a family member or caretaker.

These systems should present a low complexity user interface, making it easy to use and, in some cases, the monitoring range goes up to a 300 meters radius. This kind of service is paid having annual fees and extra expenses like device activation fees, which may vary from service to service.

ADT Medical and Bay Alarm Medical differentiate from the rest because they provide passive monitoring also. The user may require that carbon monoxide monitoring and fire/smoke monitoring to be performed and connected to the service provider monitoring centre. ADT Medical also provides the installation of burglar alarms to be connected to their monitoring centre.

ADT Medical service, for example, offers three kind of devices that work as an emergency trigger and can be used by the elder in distress: a wristband, a pendant and a two-way voice intercommunication device - see Figure 4. As mentioned before, to trigger an emergency event, the user only needs to push the button on any device. A phone call is performed to ADT call centre. Only the two-way voice intercommunication allows the communication to be made between the person in distress and the call centre assistant through loud speaker. This means that if the person in distress is far away from the intercommunication device he/she may not be able to state his/her problem. The call centre assistant evaluates the situation activating the most suitable entity to help the person in distress.

ADT also offers an integrated service that contains burglar alarm, fire alarm, carbon monoxide, temperature and flood monitoring – see Figure 5. In this case there is a sensor network that requires a more complex set up. There is a main device similar to a regular burglar alarm that is capable of

7

correctly identify the problem. If a sensor triggers, that information will be sent by Wi-Fi to ADT monitoring centres where the information is processed and an assistant will get the information. This assistant will contact the person whose house is being monitored. The assistant will contact the police, fire department or other emergency personnel they will be dispatched to the person’s home.

Figure 4 – ADT Medical devices: wristband, pendant and two-way voice intercommunication device. [17]

Figure 5 – ADT system overview [17].

In ADT‘s case the safety problem is addressed to a considerable extent. Still, this kind of systems fail to monitor situations of conscience loss or inactivity.

2.1.2 Video Monitoring

Video monitoring is another common solution when to maintain the safety of an elder person. This kind of monitoring helps to evaluate the amount of activity produced by the person being monitored. A very good example of this type of monitoring system is FamilyLink [18]. This product, for example, consists of a laptop that through a built-in web camera captures video images and has software that automatically understands if there was a change in scene, identifying movement if a change is detected.

If no activity is monitored an alert text message or email will be sent to the caretaker. The alerts may be configured according to the elder’s routine. If there is no movement in the kitchen at 4 a.m. no alert will be sent, for example.

This product also offers an easy interface to keep the elder included in social networks and online communication with relatives and friends, which helps checking if the elder is well. It also offers an online dashboard for elder’s caretakers to check the monitored data.

The FamilyLink laptop needs to be connected to the internet. When no activity is measured, FamilyLink

8

computer’s software issues an alert message to the FamilyLink Cloud Servers. These servers are responsible for keeping the online dashboard up to date, for sending an e-mail and a text message to the caretakers alerting them for the event – see Figure 6.

Figure 6 – FamilyLink network model. [18]

This solution won’t be able to identify situations of conscience loss, since the device needs to be moved from room to room and, for instance, in the event of a person fainting within a configured period where inactivity is normal or happens in a division different from the one where the device stands, no alarm will be sent and the elder person’s life might be in danger Besides, if the elder person has a pet and the elder loses conscience for some reason no alert will be sent because the pet’s movement might be captured, analysed and be accounted as valid activity.

Another concern this solution might raise is related to privacy issues, since it records and analyses video footage from the elder person’s environment, independently of who or what might appear in that footage.

2.1.3 GPS Location Monitoring

If an elder person suffers from disorientation or memory loss, and the probability of getting lost is high, there are GPS tracking devices that can help monitoring the elder person’s position allowing the caretaker to be informed of the elder person’s whereabouts at all instants.

These devices usually have a simple circuit and small dimensions so it can fit in a pocket or in a purse and can be easily carried around without being noticed. Some of these devices can be programmed with a delimited zone, called safe zone, and when the person being monitored gets out of this area the caretaker receives an alarm to inform about the event. These alarms can be received by text message or e-mail.

The device sometimes has a panic button that the elder can press when he needs help. This button also sends an alarm so the caretaker knows he’s in need.

The GPS tracking devices are localized through GPS signalling. The GPS devices share their location

9

data through GRPS or GSM, which allows the information to be deployed through the internet network to monitoring dashboards or to send automatic alerts via e-mail.

Figure 7 - How GPS tracking works. [19]

These kind of devices are easy to find in the market and are not related to a specific service provider.

Despite the usefulness of these gadgets the state of unconsciousness is difficult to monitor outdoors. Inactivity may be caused for different reasons, not necessarily dangerous, like sitting in a park or waiting in a queue. This solution cannot monitor a person’s activity inside a building due to technology limitation.

2.1.4 Smartphone Applications as Monitoring Systems

The Smartphones' market is expanding [20]. As time passes these gadgets tend to become cheaper and more accessible to everyone. Mobile phone devices are already present as a necessity. Few people go around without one. Chances are that most of the people who already have a mobile phone will have a smartphone in the future if they don’t own one now.

2.1.4.1 Markets and Operating Systems

According to Gartner’s Press Release from December 2014 [21], worldwide mobile phone sales to end users totalled 455.8 million units in the third quarter of 2014 and were flat compared to the same period in 2013 – see Table 1. Despite some variety regarding existing operating systems, the market shares are greatly dissonant from one to another.

Table 1 - Worldwide smartphone sales by operating system in the 3rd Quarter of 2014.

In Table 1, we can see the market leadership of Android systems as a growing event. Comparing to

10

the same quarter of the previous year, Android’s market 250 060 units were sold. When comparing Android evolution to its direct competitor, iOS, we can see a total of 38 186 units sold. Their huge difference is translated in the number of new users and on the maintenance of older users, meaning Android devices are present in a larger number of homes than any other of its competitors.

2.1.4.2 Applications Overview

Smartphones, even low cost solutions, have several built-in sensors such as gyroscope, compass, accelerometer, proximity sensor, light sensor and barometer in some cases. It makes sense that one can make use of this hardware that is already present in that kind of devices to develop a monitoring system.

The development of a fall detection system should be a feasible goal and in fact there are already some solutions available for free at Google Play, the Android app market. Solutions for other operating systems such as Windows Phone or iOS were considered in the performed survey. No application regarding fall detection was found for Windows Phone and only one was found for iOS.

Ten Android applications were collected and researched upon. Most of them used the same techniques and had similar functionalities. The table below shows a comparison of the functionalities displayed by the studied applications. The table below shows a comparison of the functionalities displayed by the studied applications.

Emergen Fade: Fall Smart T3LAB El Fall Fall Settings CRADAR cy Fall Fall Detector iFall Fall Fall Abuelo Detector Monitor Detector Detector Gmbh Detector Detector

Manual Threshold

Tuning

Location ID

Voice

Analysis

SMS sent to

Caretaker

GPS coordinates

Internet

Connection

Free

Table 2 - Feature comparison between the studied apps.

11

In Table 2 the featured applications were tested in order to determine what kind of features were present in each. Manual Thresholds Tuning is a setting that allows the user to edit the default values of the thresholds required to detect a fall. This is a very relevant feature to adjust the thresholds to the specific movement that the user wishes to monitor. Location ID allows the application to access device’s location. Voice Analysis is a feature that enables the capture of audio clips through the device’s microphone. SMS Sent to Caretaker allows the application so send an SMS to predefined contacts set as emergency numbers when a fall occurs. Along with the SMS sent to the caretaker, some applications also gather the GPS coordinates of the place where the fall took place. One can see that none of the studied applications requires Internet Connection to function properly and all of them are available free of charge.

Besides the study of the features available in the applications, all of them were tested regarding the accuracy of decision. They were tested in three different situations: while jumping, running and falling. In Table 3 it’s possible to see that none of the application is able to distinguish between any of the three types of movements. Most of them do not accurately detect a fall. Clearly there is an issue with the monitoring process. Under this scope this work aims to provide an improvement on the monitoring process by having an adaptive method resilient to movement changes.

Detects Emergen Fade: Fall Smart T3LAB El Fall Fall Fall CRADAR cy Fall Fall Detector iFall Fall Fall Abuelo Detector Monitor While Detector Detector Gmbh Detector Detector

Jumping

Running

Falling

Table 3 - Accuracy test of the studied applications.

The tests were performed with one single test subject. The subject performed each action five times. It is relevant to mention the subject didn’t fell to the ground directly, instead fell into a mattress. Also relevant to mention that intentional falls might be not be perceived by the systems the same way non intentional falls. In our testing scenario that is not possible to determine.

2.2 Fall Detection Techniques

Regarding fall detection it can be achieved taking in consideration multiple variables. This sub-section aims to explain what kind of parameters can be taken into account when detecting a fall.

To identify a fall, a set of sensors needs to be attached to the subject’s body in order to measure different kinds of physical quantities that exist during the movement. The studied literature showed

12

that the most experimented sensors are gyroscopes and accelerometers. Due to the fall characteristics explained with detail in section 2.2.1, a fall can be detected relying solely on the acceleration value over a body so an accelerometer - see [13] [22] [23] [24] - is enough to access from this point of view. Also, besides identifying a fall, some studies aim to understand if the subject fell forward, backward or sideways. This analysis will rely on the angle between the subject and the floor, before and after the fall. To perform this a gyroscope is needed [12].

Where to attach the sensors in order to acquire best results has been the motto for the study performed in [24]. The experiments that took place had the main purpose of understanding which was the best place to attach the sensors between wrist, waist and head.

2.2.1 Low-Complexity Fall Detection Algorithm

Most of the algorithms studied regarding fall detection using body attached accelerometer sensors are based on the analysis of the acceleration vector components. The acceleration is described as:

퐴 = 퐴푥푒푥 + 퐴푦푒푦 + 퐴푧푒푧 2.1

where Ax, Ay and Az are the tri-axial components of the acceleration vector.

All of the algorithms described in the literature [11], [13] and [23] use the resultant acceleration, Ares, to measure the type of movement performed by the test subject – see equation 2.2Error! Reference source not found..

2 2 2 퐴푟푒푠 = √퐴푥 + 퐴푦 + 퐴푧 2.2

Figure 8 shows a behaviour that is common to many types of falls and consists, basically, in four phases: the pre-fall phase, free fall phase, the impact and the post-fall phase. In Figure 8 the Y-axis refers to the resultant acceleration normalized, meaning the scale presented comes from the following expression:

퐴푐푐푒푙푒푟푎푡𝑖표푛 [𝑖푛 푔] = 퐴푟푒푠⁄퐴𝑔푟푎푣𝑖푡푦 2.3

where, 퐴𝑔푟푎푣𝑖푡푦 is the gravitational acceleration, and the X-axis refers to time, in seconds. The pre- fall phase is the phase that relates to the movement before the fall happens, in this particular case it represents the walking movement. One can see that the absolute variation of the acceleration is really small and it has values around 1, meaning that while walking the resultant body’s acceleration is roughly the same as the gravitational acceleration.

13

Figure 8 - Real-world example that illustrates common components between falls. [14]

The free fall phase occurs when the body starts falling. Right before the impact, the resultant acceleration value diminishes because during free fall no other force is exerted over the body except the gravitational one. Therefore the resultant acceleration starts decreasing abruptly and will tend to zero, which implies the normalized acceleration will also decrease as one can see in Figure 8. When free fall phase ends there will be an impact. This occurs in the minimum magnitude value for acceleration. The impact will apply a new force to the body, so the magnitude of the resultant acceleration will increase abruptly and its values will be a lot greater than the acceleration of gravity. This occurs at the maximum magnitude value for acceleration. Because the body doesn’t stop right away once it falls, there will be a bounce period in the post-fall phase that will converge again to normalized acceleration value of 1.

Taking this behaviour into account, a fall will be detected when certain thresholds of normalized acceleration values are crossed. Each type of movement has different values associated with it and, as seen in Figure 8, during the pre-fall phase, those values have certain variations. Small variations usually are a normal part of the movement but when drastic variation happen, this may imply a fall occurred. It’s then necessary to establish from which values we accurately identify a fall. To be able to make a decision like that there is the necessity to establish those limits, two threshold values, an upper one and a lower one.

In [25], data from daily activities, like walking (without falling), was gathered. To determine the threshold values, it is evaluated the maximum and minimum peaks of the variation of the acceleration. The peak identified with the maximum magnitude will be set as the upper threshold values. Likewise, the minimum identified peak will be set as the lower threshold value.

Knowing that when a fall occurs the acceleration behaviour is similar to the one on Figure 8, the

14

thresholds set previously have for sure values that do not exceed the lower peak at impact and following higher peak right after it. So, to detect a fall, the system must keep constant monitoring to evaluate if the threshold values are being exceeded. If both are, then a fall has taken place.

2.3 Voice Detection Techniques

There are multiple ways one can use to identify if human voice is present within an audio signal. Depending on the conditions of the signal we can rely on the study of the energy of the signal, on feature extraction or in pitch detection. These were the three studied techniques in the scope of this work. The following subsections summarize how the studied techniques work along with a very brief explanation of how the human speech is produced and how perception works.

2.3.1 Speech Production and Perception

Speech production is a process that takes place in the vocal tract and respiratory passages, powered by the exhaled air that flows from the lungs [26]. The human speech production apparatus is composed by the lungs and, as shown in Figure 9, by the vocal chords, velum (soft palate), hard palate, tongue, teeth and lips.

Figure 9 - Vocal tract diagram. [27]

The air from the lungs flows through the glottis. When we speak the glottis shuts making the vocal folds come together. The air makes the vocal folds vibrate. This vibration produces a sound that is enhanced when it reaches the oral cavity or the nasal cavity on its way out and can be perceived by the listener.

15

The perception of the speech and sounds happens in the cochlea.

Depending on the frequency of the sound inputted, different parts of the cochlea are stimulated and electric impulses are produced only on the active parts. These impulses are transmitted to the brain where it will later be interpreted. The cochlea acts like a set of band pass filters of increasing bandwidth as their center frequency increases.

2.3.2 Voice Activity Detection – Short-Time Energy-Based

The Voice Activity Detection (VAD) techniques assume that the speech signal and the noise signal are additive. An algorithm of this kind needs to be able to decide between two options: is this input signal only noise or is it noise + speech? To answer this question this kind of algorithms usually require two main steps. The first one is to get a chunk of the acoustic signal and extract some features from it. On the second step a classifier is applied to the features calculated in the first step in order to classify that chunk as speech or non-speech. Multiple different features have been used to perform the decision step [28]. This work focused the short-time energy-based approach. The Figure 10 depicts the basic principle of a VAD algorithm.

Figure 10 - Block diagram of a VAD implementation [29].

As stated in [30], speech is a time-varying and non-stationary signal but when is divided in small time frames the signal may be considered quasi-stationary. Each frame is processed by a rectangular window function 푤푅(푛). Let us say that 푥(푛) is the time series of the input speech and after being split into several frames of length L it can be addressed as 푓𝑖(𝑖퐿 + 푛), where i is the serial number of the frame. The windowed frame will look like the expression 2.4 where the window function may be the rectangular one portrayed in the expression 2.5.

푓푤𝑖(푛) = 푓𝑖(𝑖퐿 + 푛) 푤푅(푛), 0 ≤ 푛 ≤ 퐿 − 1 2.4

1, 0 ≤ 푛 ≤ 퐿 − 1 푤 (푛) = 푓(푥) = { 푅 0, 표푡ℎ푒푟 2.5

So, from this the short-time energy for the i-th frame can be defined as stated in the following expression:

16

퐿−1 2 퐸𝑖 = ∑ 푓푤𝑖 (푛) 2.6 푛=0

All this process is what it takes to perform the feature extraction step on Figure 10. After collecting the features the next step is to make a rule so that the system has a decision model from where the frames can be classified as speech or non-speech. Multiple types of rules can be set. Under the scope of this work the hypothesis was set for the rule to be based on the energy corresponding to a noise frame set as a threshold. So after collecting the energy in a frame, that value is compared to the previously set threshold that was taken from noise solely. If the energy value is bigger than the threshold, the frame is a speech frame. If the value is equal or inferior to the threshold set than that frame is a non-speech frame.

The decision smoothing step is a step where the robustness of the decision previously taken. This is the step of the refinement of the decision rule. Since the rule is applied on a frame by frame basis there is a need for an extra algorithm to help increase the robustness of the decision rule in order to recover speech segments that are masked by noise. The beginning or the end of a word are examples of situations that might generate frames masked by noise.

2.3.3 Feature Extraction and Classification

More than to determine if human speech is present on an audio sample, it is also interesting to determine what that speech represents in an intelligible way. This means that one could develop a system capable of classifying speech segments like humans do. The challenge to mimic human perception of what a certain sound means can be approached by using methods such as feature extraction and the development of classifiers depending on what is the goal for that system, i.e., to classify if that audio signal represents a vowel, a scream, a consonant, etc.

2.3.3.1 Mel Frequency Cepstral Coefficients

The voice signal is composed by two main components: the excitation and the modulation applied, which is responsible for slower variations in the spectrum. The power cepstrum is a mathematical transformation meant to separate these two main components. The power cepstrum can then be used as a feature vector for representing the human voice, since its characteristics become easier to analyse (feature extraction).

The power cepstrum of a signal is defined as the squared magnitude of the inverse Fourier transform of the logarithm of the squared magnitude of the Fourier transform of a signal [26].

|퐹−1{log(|퐹 {푓(푡)}|2)}|2 2.7

In Figure 11 one can see how the analysis flows. At the input we have a sound wave. From that sound we need to retrieve a small segment based on a time window previously determined. After that the

17

process goes according to equation 2.7.

Figure 11 - Cepstrum analysis block diagram

As shown, after the inverse discrete Fourier transform (IDFT), the signal goes through a process called lifetering. This process is a filtering process where a low-pass filter is applied to a cepstrum. From this process we take the relevant frequencies correspondent to the vocal sound and discard the higher frequencies corresponding to the excitation. This method seems useful to discard the information we do not want to analyse bur it also discards some information that might be part of the vocal sound that has high frequencies.

Kalamani et al. [29], studied the best way to extract features from signals containing speech. In this work the author compares the cepstral analysis, as stated in the previous paragraphs, and the mel- frequency cepstral coefficients method. This model reproduces the cochlea behaviour in the human auditory system. In his results we can see the MFCCs method achieves a more descripted waveform then the cepstral method. In Figure 12 the sample input signal was a phone conversation. The signal was not noisy and the conversation was clear. One can see that the cepstral output lost some of the original signal characteristics. On the other hand the MFCC output has its characteristics similar to those of the original signal.

18

Figure 12 - a) Input signal: voice conversation over telephone; b) Cepstral output of the same conversation; c) MFCC output of the same conversation.

To obtain mel-frequency cepstrum coefficients we need to filter the power signal before applying the logarithm function. Because the Mel cepstrum coefficients are real numbers, we can convert them to the time domain using a discrete cosine transform (DCT) to retrieve the MFCC.

Figure 13 - MFCC calculation block diagram.

To do so we must use a mel-frequency filter bank. The Mel-frequency filter bank is based on a non- linear frequency scale called the Mel scale. The filters are overlapped in such a way that the lower boundary of one filter is situated at the centre frequency of the previous filter and the upper boundary is situated at the centre frequency of the next filter. The maximum response of a filter is located at the filter’s centre frequency and is normalized to unity. Figure 14 illustrates the general form of a filter bank. The filters used are triangular and they are equally spaced along the mel-scale, which is defined by equation 2.8, where 푓 is the frequency in Hertz.

푓 푚 = 2595 log (1 + ) 2.8 10 700

19

Figure 14 - Mel-frequency filter bank example [28].

MFCC can now be used as a feature vector.

2.3.3.2 Gaussian Mixed Models

As stated in [32], a Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs assume that the observed data is made up of a mixture of several Gaussian distributions. These distributions have associated means and variances. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system.

푝(풙|휆) = ∑ 풘풊 푔(풙|흁풊, 휮풊) 2.9 𝑖=1

A GMM is given by the expression 2.9, where 풙 is the contiguous data vector with dimension D, 풘풊 is 푡ℎ the weight (prior probability of the 𝑖 Gaussian) vector and 푔(푥|휇𝑖, 훴𝑖) are the components of the 푀 Gaussian densities. The mixture weights vector satisfy the constraint ∑𝑖=1 풘풊 = 1. The components of the Gaussian densities are given by a Gaussian function as shown in the expression 2.10, where

흁풊 is the mean vector and 휮풊 is the covariance matrix.

1 1 ′ −1 푔(풙|흁풊, 휮풊) = 퐷/2 1/2 exp {− (풙 − 흁풊) 휮풊 (풙 − 흁풊)} 2.10 (2휋) |휮풊| 2

The complete mixture model is parameterized by mean vectors, weight vector and covariance matrices. Then 휆 = {풘풊, 흁풊, 휮풊}, 𝑖 = 1 … 푀.

When we need to analyse a data set drawn from an unknown distribution, we have no information regarding the means and covariance of the samples to be analysed. In this scenario the main goal is to find a GMM that fits the data we are studying. To do so we need to find the suited parameter, 휆, that is, for now, unknown.

One way of finding the unknown 휆 values is the maximum likelihood (ML) estimation method. The aim of ML estimation is to find the model parameters that maximize the likelihood of the GMM given the

20

training data set. Since the direct use of the ML method to find 휆 is a nonlinear function, the estimates can be obtained iteratively using the Expectation-Maximization (EM) method.

With the EM algorithm we start with an initial 휆 model and use it to estimate a new model, 휆̅, such that it meets the condition in 2.11 .

푝(푋|휆̅) ≥ 푝(푋|휆) 2.11

The new model found is now the model used for the next iteration and the process continues until the method converges. On each iteration of the EM algorithm the re-estimation formulas 2.12, 2.13 and

2.14, defined for a sequence of T training data vectors, 푋 = {푥1, … , 푥푇}, are used to guarantee a monotonic increase in the model’s likelihood value.

푇 1 푤̅ = ∑ 푃푟(𝑖|푥 , 휆) 2.12 𝑖 푇 푡 푡=1

푇 ∑ 푃푟(𝑖|푥푡, 휆) 푥푡 푡=1 2.13 흁̅풊 = 푇 ∑푡=1 푃푟(𝑖|푥푡, 휆)

푇 2 ∑푡=1 푃푟(𝑖|푥푡, 휆) 푥푡 2 ퟐ 2.14 휎̅𝑖 = 푇 − 흁̅풊 ∑푡=1 푃푟(𝑖|푥푡, 휆)

2 Where 푤̅𝑖 represents the mixture estimated weights, 흁̅풊 represents the estimated mean values, 휎̅𝑖 represents the estimated covariance, all referring to a certain estimated model 휆̅.

푃푟(𝑖|푥푡, 휆) is the a posteriori probability for the component 𝑖 – see expression Error! Reference source not found..

풘풊 푔(풙풕|흁풊, 휮풊) 푃푟(𝑖|푥푡, 휆) = 푀 2.15 ∑푘=1 풘풌 푔(풙풕|흁풌, 휮풌)

2.3.3.3 Combining MFCC and GMM

When one wishes to build a system capable of identifying human speech in an automatic way, it is possible to develop machine learning algorithms to that purpose. If, for example, one wishes to set a rule to identify a scream in a noisy environment, as studied by Rouas et al. [9], the first approach is to analyse a signal already classified as being a scream and extract an MFCC feature vector from it. Repeating this process with multiple audio samples classified as screams will allow us to obtain multiple different feature vectors. These feature vectors are likely to have some similarities, since all represent screams. When a probabilistic model is applied to all these MFCC vectors, a probability

21

value is attributed to each value present in the set, based on the model described in 2.3.3.2. After applying the GMM on the feature vectors, when an unclassified audio sample is presented to the system the process of retrieving the MFCC vector will remain the same as previous but now the system is ready to apply a decision rule to classify it this unknown sample is a scream or not. The classification occurs when comparing the values obtained in the feature vector with the weighted values previously learned by training. If certain values on this new feature vector have correspondent high probabilities, then by likelihood the unclassified sample can be classified as a scream.

2.3.4 Pitch Detection

One of the most successful approaches to perform pitch detection is to make use of autocorrelation analysis on a speech signal. As stated in [33], the use of autocorrelation in this context has several benefits, the main one being low computational demands. Usually it only requires a single multiplier and an accumulator. Using autocorrelation on pitch detection requires the usage of a window to compute the short-time autocorrelation function. This windows should contain 2 to 3 complete pitch periods. This means that for high pitch speakers the window size should be short (5ms to 20ms) whereas for low pitch speakers the windows should be a bit longer (20 – 50ms). One of the difficulties associated with this method is that autocorrelation peaks occur due to present formants. A formant is, in speech, the resonance of the vocal cavities when speech is being produced – see [34]. In order to eliminate the effects of the higher formants on the autocorrelation function it is typical to apply a low- pass filter with a limit around 900 Hz.

2.3.4.1 Short-Time Autocorrelation Function

The autocorrelation function of a signal is usually defined as in expression 2.16.

푁 1 ∅푥(푚) = lim ∑ 푥(푛)푥(푛 + 푚) 2.16 푁→∞ 2푁 + 1 푛=−푁

For pitch detection it is assumed that 푥(푛) is periodic with period P, meaning 푥(푛) = 푥(푛 + 푃) for all n and as a consequence the autocorrelation will be periodic with the same period P – see 2.17.

∅푥(푚) = ∅푥(푚 + 푃) 2.17

The audio signal in its original form is a non-stationary signal, which motivates splitting the signal into small windows as described before, so that the chunk of signal captured in the window may be considered nearly stationary. This allows the assumption of the periodicity and the use of autocorrelation to perform pitch detection. In this the expression 2.16 is not suitable, since it does not take into consideration the chunks of signal, only the signal as a whole. It is necessary to define a new expression to operate on the short segments – see expression 2.18.

22

푁′−1 1 ∅ (푚) = ∑ [푥(푛 + 푙)푤(푛)][푥(푛 + 푙 + 푚)푤(푛 + 푚)], 0 ≤ 푚 ≤ 푀 − 1 2.18 푙 푁 0 푛=0

Where 푤(푛) is the window function, the same as in 2.5. N is the size of the window, N’ is the number of signal samples used to calculate ∅푙(푚), 푀0 is the number of autocorrelation points to be calculated, and 푙 is the index of the starting sample of the frame.

23 Chapter 3

Proposed System

3 Proposed System

This chapter presents the definition of the scope and usage scenarios for the proposed system. The methodology used to achieve the implementation of an adaptive fall detection algorithm is also presented along with a scream detection algorithm proposal under the particular conditions set for this work.

24

3.1 Scope

As stated before, this project’s goal, is to implement an Android application that can correctly detect a fall event and uses the audio input from the microphone to help validate the detected fall. The application will take advantage of some of the sensors available on the device and through the analysis of the input values provided by those sensors, a fall will be detected if it occurs. The application will use mainly the accelerometer readings, for the algorithm that will be used for fall detection only needs this kind of input. Regarding audio input, the device’s microphone will be used. The device is placed within a pocket waist/thigh high.

The desired behaviour is that when a fall is detected an alarm message will be sent to a designated phone number. Also a prompt window will appear after the fall is detected so that in case of a false positive or a not so serious fall the user is able to cancel the alarm message before sending it.

After a fall, if a scream is detected, it is expected the fall validation to be faster. The timer will be ignored and the message will be sent right away.

This application is meant to be used by the elderly population.

To implement this application it is necessary to define three essential steps: how to detect a fall, how to assess if an audio input is a scream and how to decrease the number of false positives.

The following subsections will start by explaining what will be the case study scenario desired for a real test implementation, what are the algorithms proposed and how they work.

3.2 Scenarios and Target Users

To proceed with a study it is relevant to establish a suited scenario and a user profile. To do so, several scenarios were taken into account and they are described in Table 4. These scenarios were considered based on daily life activities that are likely to happen. Two major groups were set: indoor and outdoor activities. Regarding indoor activities it is assumed the subject is alone. Usually an indoor environment is not too noisy, having background noises from certain devices like a television, a washing machine working etc. Still in indoor scenarios is likely that the subject keeps the mobile phone on top of a table or on a stand, moving freely inside the house without having any devices on him/her. Elders can fall inside a house while performing daily life activities such as sitting on a chair, going up and down the stairs or climbing on top of a chair to reach some place. These are some examples of possible situations that may end up with a fall.

Regarding outdoor scenarios, people usually take the mobile phone with them either in a purse or in a pocket. Going out for a stroll may also represent a risky situation for an elder. Due to multiple factors such as balance loss or strength loss the elders may fall if the ground is too irregular, for instance. While walking on the street or sitting in a garden, for example, the subject may or may not be alone. This is relevant from the speech analysis point of view since multiple speech sources might be present if an audio sample is taken. Also the outdoor environment in a city

25

scenario is noisier than the indoor scenario.

Conditions Scenarios

Scenario 1  Sitting on a chair and standing still.  Subject is alone.  Environment not too noisy (television is on  Climbing to a chair to reach Indoor adding some noise to the background, e.g.). Scenario 2 somewhere.  Mobile phone rested upon a near surface.  Low obstruction to sound capture. Scenario 3  Climbing up or down the stairs.

 Calm environment, not too noisy. Scenario 4  Subject is alone.

 Mobile phone held waist high.  Crowded place, a lot of people  Some obstruction to sound capture. Scenario 5 nearby. Outdoor  Slow pace.  Noisy environment (people chatter).  Irregular pavement, holes in the ground, steps, stranded objects.  Subject is alone walking down the Scenario 6 street.  Regular traffic noise.

Table 4 - Scenarios definition

According to [21] there are multiple factors that lead to falls among elderly. Experimental results show that 31% of the falls happen due to accidental and environment related issues. This kind of falls happen while tripping, reaching or bending. These falls are related to changes in muscle strength, posture control or step weakness. 17% of the falls happen due to gait, balance or weakness disorders and 13% of the falls happen due to vertigo or dizziness.

A subject could fall indoors due to tripping, reaching or bending in all the scenarios described. It is unlikely to be able to monitor this kind of falls from accelerometer analysis because this would require the user to have the mobile phone attached to her/his body at all times. It is usual to put the mobile phone down somewhere inside the house. On the other hand, carrying the mobile phone attached to the body at all times when leaving home is a more natural thing to do, rather like carrying the keys or a wallet. We shall consider the mobile phone to be held waist high in outdoor scenarios, not only because to carry the device in a pocket is a natural behaviour but also because, as stated in [21], the accuracy of the fall detecting algorithm (accelerometer based) is higher.

All the outdoor scenarios could result in a fall. The irregularities of the floor will make it easier to trip. Regarding voice validation when a fall occurs, scenarios 5 and 6 are not a favourable option since there might be multiple voices or too much noise on the input. It will make it difficult to accurate distinguish the relevant information.

Scenario 4 is the most favourable for a first approach and will be the one considered for the current study. In this scenario there will be two relevant personas: the elder and the caretaker.

Within the context of this study, an elder is an aged person with 65 years old or more that is still

26

independent. The elder is capable of walking by himself/herself even though he/she might have some movement constraints.

The caretaker is the person responsible for the wellbeing of the elder. The caretaker might be a family member or a friend. In this study, the caretaker will be warned every time the elder falls.

3.3 Fall Detection

From the assessment on subsection 2.1.4.2 one can see that the algorithms present in the available solutions are either too stiff or too sensitive regarding fall detection. This means that most likely their thresholds are hardcoded and are not flexible. Three of the applications studied allowed the manual setting of thresholds, therefore a test subject can freely edit these thresholds in order to find the most suitable ones. Picturing the scenario where the user is an elder, interacting with the device at the point of setting thresholds is not the most intuitive way of using such application. Also hardcoded thresholds might or might not be suitable for that particular user. Taking this into account this section proposes an adaptive algorithm meant for fall detection that requires less interaction with the user and self-adjusts to the type of movement that is happening at the moment.

The proposed algorithm has in its basis the methodology referenced in the subsection 2.2.1. Nevertheless none of the consulted literature mentioned exactly how the thresholds should be set or updated or if a single analysis was enough to generalize those values for similar movements. The truth is people don’t walk the same way so those values might vary from person to person. The elderly population is not likely to jump or to run but still the algorithm suggested in this section tries to deal with the threshold issue so it can be used in different applications.

The resultant acceleration,퐴푟푒푠, will be simply addressed as acceleration.

When the application starts it requests for the accelerometer readings. This is an ongoing action from the beginning. The frequency of the solicitations should be the same as the device sampling rate.

After obtaining a certain number of samples the threshold calculation should be performed. As stated before, the threshold setting methodology is not very clear in the consulted literature, therefore the rule to determine threshold values will come from observation. From the sampled vector created in order to store the readings, the higher value will be set as the maximum threshold and the lower value of the sampled vector will be set as the minimum threshold value.

If the system detects a reading from where both those thresholds are surpassed one of two things might be happening: either the subject changed the way he/she was moving or a fall actually occurred.

If the subject simply changed the type of movement the system should perceive that change in order to lessen false positives. The change in the type of movement, as for example from walking to running, can be identified through the periodicity of the movement. For instance, running won’t generate single thresholds exceeding peaks. The peaks would exceed once again the thresholds

27

within a certain period of time. If an analysis is performed regarding this point of view it is expected that the error resilience of the system would increase. Below you can find a diagram on how this system should work – see Figure 15.

Figure 15 – Proposed algorithm for fall detection in the context of the application.

In the context of this work these variations are also relevant due to the different type of activity an elder performs on a daily basis.

3.4 Scream Detection Relying on Pitch Analysis

In the context of this work audio samples were taken from mobile phones inside a pocket waist high. This represents a huge difficulty when aiming to perform scream detection. It is important to understand what a scream is in this context. As stated before, this scream might occur while the fall occurs. This scream is not a stressful scream like it could be under a more dangerous situation. It is most likely a sound slightly louder and higher pitched that the average speaking – see Figure 16.

28

Figure 16 - Regular speech followed by a small scream while using a regular microphone in ideal conditions, i.e., no obstruction to sound capture.

Since neither the VAD nor the feature extraction studied in subsection 2.3 proved to be suitable approaches for this scenario to analyse the pitch was another option. Despite the extremely noisy samples one can see that for both the sentence used in Figure 17 and the scream captured in Figure 18 it is possible to extract the pitch information.

Figure 17 – Pitch detection on a sentence.

Figure 18 – Pitch detection on a scream.

The android application in its final form was envisioned to identify if a scream occurred after a fall

29

had happened. Since the fall and the scream usually happen simultaneously, a valid fall detection couldn’t be used as a trigger for the voice analysis to start. To solve this issue periodic audio recordings should be performed. The latest periodic recording should be kept in storage.

Since no information is provided at the start of the application regarding the user’s pitch or the conditions of the surrounding environment, it is necessary to proceed with periodic calibrations. The calibrations should access what is the pitch information from the saved recording. Like the fall detection proposed algorithm, from the pitch analysis on the saved recording the higher pitch value from the sample should be set as a threshold. When a scream occurs the pitch captured during the recording will be compared with the pitch value set as threshold. If it surpasses this threshold then the scream would be detected.

After the detection occurs,

Figure 19 - - Diagram of the proposed scream detector based on pitch analysis.

30 Chapter 4

Implementation and Discussion

4. Implementation and Discussion

This chapter aims to present what solutions were implemented and the experimental results taken from them.

31 4.1. Implementation and Testing

Figure 20 - System integration overview.

Figure 20 represents the proposed system integration overview. The application reads the data from the accelerometer. This data is used by the fall detection module to extract the information on the acceleration over the user’s body. This acceleration readings are tri-axial like stated in 2.2.1. The acceleration information will be analysed in order to set thresholds and later to test the actual acceleration values against these thresholds.

Simultaneously the application will also read the data stream captured by the device’s microphone, will encode it and save it to the local storage.

When a fall is detected by the fall detector, the scream detector will access the local storage, load the input data and check if a scream was performed. If so, then a possible risky situation is in order and an emergency contact should be warned.

There will be no scream detection actions performed if the fall detector doesn’t identify a fall.

4.2. Scream Detection Approaches

The task of identifying a scream having the sound capture taken inside a pocket adds some extra complexity. In Figure 16 one can manipulate the power signal and extrapolate which segments correspond to speech and which don’t. It’s simple to identify the power correspondent to the silent segments has its value around 30dB, in this case. In Figure 21, the same sentence used in Figure 16 was captured from a smartphone inside a pocket while the subject was walking. The noise overlapping the speech segments is too much and no useful information from power can be retrieved.

32

Figure 21 - Speech captured from a smartphone inside a pocket while walking. The noise identified at the beginning happened while inserting the device into the pocket.

Since the algorithm was not successfully implemented to perform the scream detection no testing was possible. Still a brief explanation on why the VAD and Feature Extraction approaches weren’t the most suitable options for this purpose.

4.2.1 Voice Activity Detection

From the previous evidence it was concluded that to recognise speech under these conditions was not possible through the power information. Another test was made using solely a small scream captured under the same conditions described in Figure 21 . This sample – see Figure 22– raised the possibility to develop a voice activity detection algorithm to identify only a scream. To achieve this, the use of a short-time energy based system was considered to analyse if human speech was present or not in the audio signal. The implementation was supposed to be as the system described in the subsection 2.3.2. Despite regular speech could not be identified through this method it was expected that a scream would, due to the fact that in most of the samples collected different subjects tended to use vowel like sounds to express a scream. These sounds are, for example, “ah!”. This approach did not work out well due to the fact that most of the samples collected had also noise overlapping the scream making the distinction very difficult to perform – see Figure 23.

Figure 22 - Scream captured from a smartphone inside a pocket while walking with little noise.

33

Figure 23 - Scream captured from a smartphone inside a pocket while walking with a lot of noise.

4.2.2 MFCC and GMM Approach

Another approach that was taken into account was to implement a classifier and build a machine learning algorithm as stated in subsection 2.3.3.3. This algorithm should be trained from a data base of screams and afterwards it should be able to perform autonomous decisions. This algorithm would make use of MFCC for feature extraction and then an Expectation Maximization algorithm for likelihood weights. This wasn’t adequate as the captured segments were too noisy. Note that the noise in the samples is not added noise. It is real noise and therefore hardly treatable under the conditions stated since MFCC is little robust in these conditions. In addition to the noise, one must consider that the distance from the sound source (the user vocal system) to the microphone on the mobile device inside the pocket adds an extra attenuation to the sound captured.

4.2.3 Pitch Analysis Approach

Taking into account what was previously explained in subsection 3.4, periodic recordings should be performed every 5 seconds. The chuck of audio captured would be saved to the device’s storage. This file should be overwritten for storage management purposes at each capture. The first recording when the application starts should be analysed to find the maximum pitch value of that audio file to be set as threshold. If no fall occurs meanwhile this calibration process should happen every 30 minutes. If a fall is detected the pitch analysis of the audio file that holds the information of the 5 seconds prior to the fall detection should be triggered. In this case, a new audio file would be created to carry on with the periodic recordings while the older one would be kept until the analysis of the signal was finished. This constant recording is necessary since the scream usually occurs while falling, not after. It is also necessary to keep these small samples of prior timespans since no reasonable trigger exists to identify the exact moment where it is more likely for a scream to occur. To analyse the audio signal the techniques described in the subsection 2.3.4 should be applied. Since the pitch is something that will

34

vary from person to person based on age and gender, the choice of the window to apply the short- time autocorrelation is a sensitive parameter since, as stated before, for high pitch speakers the window should be 5ms to 20ms and for low pitch speakers the windows should be about 20 – 50ms. In this case the window chosen would be of 25ms in order to provide a reasonably ranged solution. Having a 25ms window implies that from a 5 second audio file is possible to retrieve 200 samples, for each sample the autocorrelation would be applied and 200 pitch values would be extracted. From the 200 values, the maximum pitch value should be retrieved and compared to the previously set threshold. If it exceeds the threshold value then it is possible a scream happened. After the fall occurred and the audio signal’s been analyse, a resting period of 10 minutes should take place and after that a new calibration period should start.

Figure 24 - Diagram of the proposed scream detector based on pitch analysis integrated with fall detection.

4.3. Fall Detection Implementation

The fall detector reads data from the accelerometer and in the performed implementation obtains 20 samples that are kept in a vector. The lower peak and higher peak of the acceleration analysis are retrieved from that vector. Also the mean value of this vector is obtained. If the thresholds are not set yet these values will respectively be the lower threshold value and the upper threshold value and we will go back to the first task. If the thresholds were already set, the amplitude of the acceleration should be analysed. If both thresholds are surpassed we need to check if it is a periodic event. If so, thresholds

35

should be updated. If not, it is likely that a fall had happened.

To determine the periodicity of the movement captured in order to assess if the thresholds should be updated or not, a vector with 20 samples is collected and the mean value of that vector is obtained. If the newly calculated mean value is twice as big as the previously calculated value, related to the first sample capture, then new thresholds are set. This will enable a change of movement from walking to running without a false positive for a fall.

On the other hand, the thresholds will also be updated if the newly calculated mean is half of the previously calculated mean. This will allow the user to change from running to walking adjusting the thresholds to be ready to detect a fall in a body with less acceleration at the present time.

While running the maximum and minimum values of the thresholds are usually higher when comparing to those obtained while walking. If the user changes the pace from running to walking and these thresholds are not reconfigured, then to detect a fall will be very difficult since the acceleration of the user has lowered significantly and possibly one of the thresholds will not be surpassed when a fall occurs.

The threshold update is very relevant to improve the accuracy of the decision when a fall happens.

4.3.1 Testing

After adding the fall detection implementation for an adaptive algorithm – see Figure 25 for the correspondent code – fall detection tests were performed.

Testing procedure was meant to test the following aspects: behaviour from walking to running, from running to walking, walking and then falling, running and then falling. Each behaviour had 10 recurrences.

In the scenario where the subject is running and then falls, the application never detected the fall occurred. This was due to threshold values. The maximum threshold value is usually quite high and the lower threshold value is quite low while running. This means the impacted caused by the fall wasn’t able to cause the thresholds to be surpassed.

In the scenario where the subject is walking and then falls, the application was able to correctly detect 8 out of 10 falls. This translates in a success rate of 80% versus a fail rate of 20%.

In the scenario where the subject is walking and the starts to run, the application was able to correctly assume that wasn’t a fall 6 out of 10 trials. This means for the performed test the success rate was 60% and the fail rate was 40%. Also the thresholds weren’t always updated correctly and that was one of the main factors that caused the application to misbehave.

At last, the scenario where the subject was running and then started to walk no fall was detected because no thresholds were supposed to be surpassed. The thresholds did not update correctly as they should have lowered to more suitable values regarding the new type of movement.

36

if (type == Sensor.TYPE_ACCELEROMETER) { int res = 0; int i = 0; double magnitude = (Mathematica.euclideanMagnitude(acceleration[0], acceleration[1], acceleration[2])) / 9.8; if (calib_cnt < 20) { calib_vector[calib_cnt] = magnitude; calib_cnt++; firstCountFlag++; } else { fallDetector.setThresholds(calib_vector); newMeanValue = Mathematica.magnitudeSampleMean(calib_vector, 20); max.setText("Res " + res + "Max = " + fallDetector.getThresholdMax()); min.setText("Min = " + fallDetector.getThresholdMin());

fallDetector.readMagnitude(magnitude); if (fallDetector.FALL_DETECTED) { Toast.makeText(this, "Fall Detected!!", Toast.LENGTH_SHORT).show(); confirmIfOk(); calib_cnt = 0; } if (firstCountFlag == 1) { oldMeanValue = newMeanValue; } else if (newMeanValue >= 2 * oldMeanValue) { oldMeanValue = newMeanValue; if (i < 20) { fallDetector.setThresholds(calib_vector); i++; } else i = 0; } else if (newMeanValue <= 1 / 2 * oldMeanValue) { oldMeanValue = newMeanValue; if (i < 20) { fallDetector.setThresholds(calib_vector); i++; } else i = 0; } Figure 25 - Code block for the adaptive behaviour. }

}

37

4.4. Conclusions

One of the biggest difficulties faced during this project was the amount of variables present under its scope. Regarding the fall detection, the non-adaptive algorithm used is quite robust for applications of this kind. When adding the adaptive behaviour the robustness should increase. It was observed was observed that most of the times the application can correctly distinguish a movement change. Nonetheless the error rate is still quite high. Also it was observed that the thresholds didn’t update themselves when necessary and most classification errors derived from there.

On the other hand the scream detection was far from simple in its nature due to the condition of the audio samples taken. The scenarios where the monitoring system makes more sense to act are the outdoor ones. In this kind of scenarios the noise environment is quite unpredictable. Despite some calm environments are likely to occur, environments with multiple speech sources or extremely noisy are also probable, like riding the bus or going to a market fair. These factors adding extra noise to the audio signal to be captured by a device inside a pocket makes the analysis quite difficult to perform.

Other factors like the kind of clothes the person is wearing are also relevant to consider when thinking about how those affect the obstruction for the sound capture. The conditions set in this work are not favourable to this kind of audio processing.

This work culminated with the adaptation of the application named SenseAck which is now ready for adaptive recognition of a fall – see Figure 26.

Figure 26 - SenseAck application main screen.

38 Chapter 5

Future Work

5. Future Work

In this chapter some consideration on future work to be performed under this scope. Below there are some suggestions on what to do to implement a more interesting solution, on how to improve the one presented in this work and what to take into account as well as some testing parameters that are relevant for the study of similar systems.

39 5.1. Where to Improve

Despite the implementation of a full functional application was not successfully performed, it is possible to do it.

Regarding the scream detector it is not that trivial to manipulate an audio stream using Android OS. If you manipulate an audio stream in real time, i.e., recorded and analysed at the same time is quite simple but to read that stream from a device’s storage unit presents more difficulties. It is necessary to specify a fixed sampling rate and format for the recording part in order to generate uniform audio samples. To be able to read those audio samples it is needed a decoder in the application. This decoder will be responsible for decoding those samples into raw data. After this raw data is available the manipulation should be easier. In order to implement the proposed algorithm this is the first step.

After the implementation it is required testing, which presents another issue. To test the scream detector it is required a data base containing suitable audio samples, recorded by a subject having the mobile device inside a pocket waist high while falling. This is also not easily obtained because if the subject falls on purpose then the scream is less genuine. During this study some audio samples were collected and a small data base was created. They are usable for a first approach on testing the proposed scream detector but not that useful when it comes to improve the robustness.

Another aspect that requires improvement is the resilience to false positives on the fall detection algorithm. Having an adaptive algorithm is something useful in far more areas than just this scenario. Some other features can be considered to improve the presented fall detection algorithm is the assessment of the angle between the body of the subject and the vertical in order to understand to which direction the subject fell: sideways, backward, forward.

From a technical point of view no studies were performed during this work in order to understand the impact of this system in the smartphone power consumption. Since the accelerometer and the mic are constantly on demand when thinking about a final product is relevant to perform an analysis from this perspective.

5.2. Future Applications

To successful implement the scream algorithm and correctly integrate it with the fall detection algorithm would be the major next step. From this point, having the system correctly working, it would be interesting to create an adequate interface and study what would be the acceptance of this system among a target group of users. The system developed under the scope of this work aimed to lessen the active interaction between the user and the technology. It was envisioned to be an almost autonomous system that wouldn’t demand the user to be highly technically educated. One next step in the continuation of this work would pass through this point.

Ultimately, under different usage conditions, such as using the smartphone on a small bag around the waist guaranteeing to provide less obstruction and noise addition to the sound capture it would

40

probably be possible to understand if a distress situation was occurring by recognising key words like “help”. It would be interesting to provide an interface that required no other interaction besides speech to perform actions like calling someone through voice commands. This would be particularly useful in situations where the user could not move but didn’t lose conscience. The application scenarios of such system would be vast.

41 References References

[1] USA Govenment - Administration on Aging, “Population for States by Age Group Percentages,” Department of Health & Human Services, 2010.

[2] Directorate General for Economic and Financial Affairs (ECFIN), European Commission, “The 2009 Agein report. Economic and budgetary projections for the EU-27 Member States (2008-2060). European Economy 2/2009.,” Eurostat, 2009.

[3] Japanese Ministry of Health, Labour and Welfare, “Household Projections for Japan 2010-2035,” National Institute of Population and Social Security Research, 2010.

[4] Ace Attorney Wikia, “Ace Attorney Wikia,” [Online]. Available: http://aceattorney.wikia.com/wiki/Phoenix_Wright:_Ace_Attorney. [Acedido em 19 February 2015].

[5] IGN, “IGN,” IGN, [Online]. Available: http://www.ign.com/wikis/mass-effect-3/Kinect_Controls. [Acedido em 19 February 2015].

[6] Apple Inc., “Apple,” Apple, [Online]. Available: https://www.apple.com/uk/ios/siri/. [Acedido em 19 February 2015].

[7] Microsoft, “WindowsPhone,” Microsoft, [Online]. Available: http://www.windowsphone.com/en-us/how- to/wp8/cortana/meet-cortana. [Acedido em 19 February 2015].

[8] M. Brian, “Engadget,” 26 June 2014. [Online]. Available: http://www.engadget.com/2014/06/26/ok-google-voice- commands-to-your-android-lockscreen/. [Acedido em 19 February 2015].

[9] J.-L. Rouas, “Audio Events Detection in Public Transport Vehicle,” em 2006 IEEE Intelligent Transportation Systems Conference, Toronto, 2006.

[10] G. Valenzise, “Scream and Gunshot Detection and Localization for Audio-Surveillance Systems,” Dipartimento di Elettronica e Informazione – Politecnico di Milano, Milan, 2007.

[11] A. Bourke, J. O'Brian e G. Lyons, “Evaluation of a threshold-based tri-axial accelerometer fall detection algorithm,” Gait & Posture, Limerick, Ireland, 2006.

[12] J. A. S. M. H. A. B. J. L. Qiang Li, “Accurate, Fast Fall Detection Using Gyroscopes and Accelerometer-Derived Posture Information,” Sixth International Workshop on Wearable & Implantable Body Sensor Networks Conference, Berkeley, 2009.

[13] A. Sorvala, E. Alasaarela, H. Sorvoja e R. Myllylä, “A Two-Threshold Fall Detection Algorithm for Reducing False Alarms,” University of Oulu, Oulu, Finland, 2012.

[14] F. Sposaro e G. T. yson, “iFall: An Android Application for Fall Monitoring and Response,” 31st Annual International Conference of the IEEE EMBS, Minneapolis, Minnesota, USA, 2009.

[15] P. P. C. L. ,. W. P. Jantaraprim, “Improving the Accuracy of a Fall Detection Algorithm Using Free Fall Characteristics,” 2010 7th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON 2010), Chiang Mai, 2010.

[16] Medical Alert Systems HQ, “Medical Alert Systems HQ,” [Online]. Available: http://medicalalertsystemshq.com/compare- medical-alert-systems. [Acedido em November 2013].

[17] ADT, “ADT Home and Health,” [Online]. Available: http://www.adt.com/home-health/. [Acedido em January 2014].

[18] FamilyLink, “FamilyLink,” [Online]. Available: http://www.familylink.net/. [Acedido em November 2013].

42

[19] Corvus, “Corvus,” [Online]. Available: http://corvusgps.com/. [Acedido em January 2014].

[20] K. Garcha, A. Sultania, P. Chen e et al., “Handset Industry 2013 Outlook,” Credit Suisse, 2013.

[21] Gartner, “Gartner Says Sales of Smartphones Grew 20 Percent in Third Quarter of 2014,” Gartner, 2014. [Online]. Available: http://www.gartner.com/newsroom/id/2944819.

[22] F. Bagala, C. Becker, A. Capello e e. al., “Evaluation of Accelerometer-Based Fall Detection Algorithms on Real Life Falls,” PLoS ONE 7(5): e37062. doi:10.1371/journal.pone.0037062, 16 May 2012.

[23] M. Kangas, A. Konttila, P. Lindgren, T. Jämsä e I. Winblad, “Comparison of low-complexity fall detection algorithms for body attached accelerometers,” Gait & Posture, 2007.

[24] I. W. T. J. M. K. Antti Konttila, “Determination of simple thresholds for accelerometry-based parameters for fall detection,” Proceedings of the 29th Annual International Conference of the IEEE EMBS, Lyon, 2007.

[25] K. R. J. Laurence Z. Rubenstein, “Falls and Their Prevention in Elderly People: What Does the Evidence Show?,” Elsevier Saunders.

[26] M. H. Farouk, Application of Wavelets in Speech Processing, Springer, 2014.

[27] Macquire University, “Department of Linguistics,” [Online]. Available: http://clas.mq.edu.au/speech/units/ling110_phonetics/articulation/tongue_palate.html. [Acedido em December 2014].

[28] B. L. Zheng-Hua Tan, “High-Accuracy, Low-Complexity Voice Activity Detection Based on A Posteriori SNR Weighted Energy,” em Interspeech, Brighton, 2009.

[29] K. K. Michael Grimm, Robust Speech Recognition and Understanding, Vienna: I-Tech Education and Publishing, 2007.

[30] D. Enqing e et al, “Voice Activity Detection based on Short-Time Energy and Noise Spectrum Adaptation,” Signal Processing, 2002 6th International Conference, vol. I, pp. 464 - 467, 2002.

[31] M.Kalamani, Dr.S.Valarmathy e e. al., “Comparison Of Cepstral And Mel Frequency Cepstral Coefficients For Various Clean And Noisy Speech Signals,” International Journal of Innovative Research in Computer and Communitcation Engineering, Tamilnadu, 2014.

[32] S. Z. Li e A. K. Jain, Encyclopedia of Biometrics, Springer, 2009.

[33] L. Rabiner, “On the Use of Autocorrelation Analysis for Pitch Detection,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vols. %1 de %2ASSP-25, nº 1, pp. 24-33, 1977.

[34] X. Huang, A. Acero e H. Hon, Spoken Language Processing, Prentice Hall, 2001.

[35] R. v. d. Mulen e J. Rivera, “Gartner Says Smartphone Sales Grew 46.5 Per Cent in Second Quarter of 2013 and Exceeded Feature Phone Sales for First Time,” Gartner, Egham, UK, 2013.

[36] Coder Software, “Coder Software,” [Online]. Available: http://www.coder.es/index.php/elabuelo.html. [Acedido em September 2013].

[37] Google Inc., “Google Play,” [Online]. Available: https://play.google.com/store/apps/details?id=com.ericcurtin.fallMonitor. [Acedido em 2013 August].

[38] Google Inc., “Android Developers,” [Online]. Available: https://developer.android.com/reference/android/speech/package- summary.html. [Acedido em December 2013].

[39] Nuance, “NDEV Nuance Mobile,” Nuance, [Online]. Available: http://dragonmobile.nuancemobiledeveloper.com/public/index.php?task=faq. [Acedido em December 2013].

43

[40] iSpeech, “iSpeech,” [Online]. Available: https://www.ispeech.org/developers. [Acedido em December 2013].

[41] G. Maas, “Nuance recognizer: Bringing New Levels of Accuracy, Reliability, and Ease of Use to Speech-Based Self-Service Applications,” Nuance, Burlington, Massachusetts, 2007.

[42] Nuance Communications, “Nuance Care Solutions : Nuance Vocalizer Studio,” Nuance Communications, 2013.

[43] Nuance, “Nuance Mobile SDK Reference Speech Kit Guide,” [Online]. Available: http://nuancemobiledeveloper.com/public/Help/DragonMobileSDKReference_Android/SpeechKit_Guide/Introduction.html. [Acedido em December 2013].

[44] Speech Enhancement for Android, “Speech Enhancement for Android,” [Online]. Available: http://enhancementapp.com/. [Acedido em October 2013].

[45] J. Wong, “musicg,” [Online]. Available: https://code.google.com/p/musicg/. [Acedido em November 2013].

[46] J. Pohjalainen, P. Alku e T. Kinnunen, “Shout Detection In Noise,” 2011.

[47] B. P. Bogert, M. J. R. Healy e J. W. Tukey, “The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-Cepstrum and Saphe Cracking,” Proceedings of the Symposium on Time Series Analysis, New York, 1963.

[48] R. Togneri, “The HTK book by Steve Young,” [Online]. Available: http://www.ee.uwa.edu.au/~roberto/research/speech/local/entropic/HTKBook/. [Acedido em 2014].

[49] H. J. Fell e J. MacAuslan, “Automatic Detection of in Speech,” Firenze University Press, Firenze, Italy, 2003.

[50] R. S. H. S. J. Abhijeet Sangwan, VAD Techniques for Real-Time Speech Transmission on the Internet, Bangalore, 2002.

[51] M. H. Moattar e M. M. Homayounpour, A simple but efficient real-time voice activity detection, Tehran: EUSIPCO 2009, 2009.

44