REAL TIME IMPLEMENTATION OF DIRECTION OF ARRIVAL ESTIMATION ON

ANDROID PLATFORMS FOR HEARING AID APPLICATIONS

by

Abdullah Küçük

APPROVED BY SUPERVISORY COMMITTEE:

______Dr. Issa M. S. Panahi, Chair

______Dr. Carlos Busso

______Dr. Mehrdad Nourani

Copyright 2018

Abdullah Küçük

All Rights Reserved

To my wife, family and every person who had faith in me.

REAL TIME IMPLEMENTATION OF DIRECTION OF ARRIVAL ESTIMATION ON

ANDROID PLATFORMS FOR HEARING AID APPLICATIONS

by

Abdullah Küçük, BE

THESIS

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE IN

ELECTRICAL ENGINEERING

THE UNIVERSITY OF TEXAS AT DALLAS

AUGUST 2018

ACKNOWLEDGMENTS

This thesis is the result of two years of research regarding speech source localization. I am grateful to finish my thesis thanks to valuable feedback from many people. I am thankful to all who encouraged and trusted my aspirations and endeavors.

I would like to express my gratitude to my advisor, Dr. Issa Panahi for giving me the opportunity to work on this project at the Statistical Signal Processing Laboratory (SSPRL) / UT Acoustic

Laboratory (UTAL) and for his valuable mentorship and guidance throughout my research. I also am thankful to Dr. Carlos Busso and Dr. Mehrdad Nourani for serving as committee members for my master thesis defense.

I would like to thank my lab members for helping my research: Mr. Anshuman (for supervising me on my research topic), Mr. Yiya Hao (for helping me with real-time implementation of the algorithm), Mr. Chandan Reddy (for his help on DSP related doubts), Mr. Gautam Bhat, Mr. Nikhil

Shankar, Mr. Parth Mishra, Mr. Serkan Tokgoz, Ms. Ziyan Zou (for useful discussions about our research and empathizing with difficulties of research).

I am thankful to my wife, my parents, my parents-in-law, my siblings, my siblings-in-law, and my friends for being understanding and their moral support. Thank you for trusting and encouraging me.

This thesis was supported by the National Institute of the Deafness and Other Communication

Disorders (NIDCD) of the National Institutes of Health (NIH) under Award 1R01DC015430-01.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

April 2018

v

REAL TIME IMPLEMENTATION OF DIRECTION OF ARRIVAL ESTIMATION ON

ANDROID PLATFORMS FOR HEARING AID APPLICATIONS

Abdullah Küçük, MSEE The University of Texas at Dallas, 2018

ABSTRACT

Supervising Professor: Dr. Issa M.S. Panahi

Sound Source Localization (SSL) is one of the vital areas in signal processing, especially in hearing aid applications. SSL (or Direction of Arrival) helps to determine the location of the speaker via multiple fixed microphones (also known as microphone array). Knowing of speaker Direction of

Arrival (DOA) helps to improve the performance of the system. Another advantage of DOA is that it helps hearing-impaired people to locate talker because hearing aid users, especially those who are older than 60 years, have difficulties in determining the place of the speaker. Having requisite processing capabilities and at least two microphones makes smartphones a cost-effective solution for multi-channel audio signal processing. In this thesis, we propose a new stereo input/output framework for Android platforms for audio signal processing. This frame enables us to perform multi or single channel audio signal processing for real-time operations. We also propose a method for two microphones-based Direction of Arrival (DOA) estimation and real-time implementation of this method on the latest Android smartphones.

vi

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ...... v

ABSTRACT ...... vi

LIST OF FIGURES ...... ix

LIST OF TABLES ...... xi

CHAPTER 1 INTRODUCTION ...... 1 1.1. Motivation ...... 1 1.2. Contribution of the thesis ...... 2 1.2.1 Stereo Channel Input and Output Framework for Android Platforms ...... 2 1.2.2 Speech Source Localization and Tracking ...... 3 1.2.3 Real Time Implementation of Proposed DOA Estimation Method on Android Smartphone ...... 4 1.3. Outline of the Thesis ...... 5

CHAPTER 2 STEREO CHANNEL INPUT/OUTPUT FRAMEWORK FOR ANDROID PLATFORMS ...... 6 2.1 Motivation ...... 6 2.2 The Proposed Framework ...... 7 2.2.1 Java Module of the Framework ...... 8 2.2.1.1 Graphical User Interface (GUI) ...... 9 2.2.1.2 Wave Recorder Class ...... 9 2.2.1.3 Processing Class...... 11 2.2.1.4 Java Native Interface (JNI) ...... 11 2.2.2 C/C++ Module of the Framework ...... 12 2.2.2.1 Audio Main ...... 13 2.2.2.2 Signal Processing C/C++ Class ...... 13 2.3 Experiment and Results ...... 14 2.4 Summary ...... 17

CHAPTER 3 TWO MICROPHONE BASED SPEECH SOURCE LOCALIZATION ...... 18 3.1 Motivation ...... 18

vii

3.2 Literature Review...... 19 3.2.1 Direction of Arrival Estimation using GCC ...... 19 3.2.2 Direction of Arrival Estimation using Frequency Weighted GCC ...... 21 3.2.3 Direction of Arrival (DOA) Estimation on Smartphone/Tablet ...... 23 3.3 Proposed DOA Estimation Method ...... 25 3.3.1 Pre-Processing Module ...... 25 3.3.2 DOA Estimation Module ...... 28 3.3.3 Post-Processing Module ...... 29 3.3.4 Proposed DOA Estimation Method and Sound Source-Tracking ...... 30 3.4 Experimental Evaluations ...... 30

3.4.1 Noisy Dataset and Experimental Setup ...... 30 3.4.2 Evaluation Metrics ...... 33 3.4.3 Results and Discussions ...... 34 3.5 Summary ...... 45

CHAPTER 4 REAL TIME IMPLEMENTATION OF DOA ESTIMATION METHOD ON ANDROID PLATFORMS ...... 47 4.1 Motivation ...... 47 4.2 Shortcomings of Real Time Implementation of SSL on Mobile Platforms...... 47 4.3 Real-Time Implementation of Proposed SSL Algorithm ...... 48 4.3.1 DOA Estimation Application Version 1 ...... 48 4.3.2 DOA Estimation Application Version 2 ...... 50 4.3.3 DOA Estimation Application Version 3 ...... 52 4.3.4 DOA Estimation Application Version 4 ...... 54 4.4 Summary ...... 59

CHAPTER 5 CONCLUSION ...... 60

REFERENCES ...... 62

BIOGRAPHICAL SKETCH ...... 67

CURRICULUM VITAE

viii

LIST OF FIGURES

Figure 2.1. The pipeline of Proposed Method ...... 8

Figure 2.2. Shows architecture of Android application ...... 12

Figure 2.3. Block Diagram of Proposed SE method ...... 15

Figure 2.4. Screenshot of developed SE method ...... 16

Figure 3.1. Traditional GCC based DOA estimation with the Pre-filtering method ...... 21

Figure 3.2. Two Microphone DOA estimation Setup ...... 24

Figure 3.3. Block Diagram of the Proposed method...... 25

Figure 3.4. Image Source model (ISM) setup for Simulation Experiments. Black markers represent microphones. Hollow markers represent speaker locations ...... 31

Figure 3.5. Experimental setup for Real data experiments...... 32

Figure 3.6. (Top to Bottom) Average Accuracy (ACC) for White noise, Machinery noise, Traffic noise and Babble noise. Higher ACC is preferred...... 35

Figure 3.7. (Top to Bottom) Average Miss Rate (MR) for White noise, Machinery noise, Traffic noise and Babble noise. Lower MR is preferred...... 36

Figure 3.8. (From Top to Bottom) Noisy input signals (5dB SNR, Babble noise, Real recorded data), Enhanced signals after MMSE-LSA, Spectral Flux(SF) feature for enhanced signals, Inter-microphone SF difference, Ground Truth and VAD decisions ...... 37

Figure 3.9. (From Top to Bottom) DOA estimation and Source Tracking for above noisy data from Fig.3.8 (top) for Traditional GCC(middle) and the Proposed method (bottom). Solid blue and dashed red lines represent the estimated DOA and DOA based on Ground Truth, respectively...... 38

Figure 3.10. Average RMSE(°) for Proposed Method 1, Proposed Method 2, and Proposed Method 3 for different noisy conditions...... 40

Figure 3.11. Average RMSE(°) for GCC, GCC+Post-filter (VAD), GCC+Pre-filter (MMSE-LSA) and Proposed method for different noisy conditions...... 42

ix

Figure 3.12. (Top to Bottom) Average RMSE(°) for DOA estimation using different pre-filtering techniques for simulated noisy data at different SNR. Proposed Post-filter is included for all, for fairness...... 43

Figure 3.13. (Top to Bottom) Average RMSE(°) for DOA estimation using different pre-filtering techniques for Real data for different values of D and T. Proposed VAD is included for all, for fairness...... 44

Figure 4.1. Screenshot of Smartphone Application on Google Pixel 2 ...... 49

Figure 4.2. Screenshots of Smartphone Application on Google Pixel 2 ...... 51

Figure 4.3. Block diagram of Modified Proposed Method. H(푘) is the pre-filter. 휃 is the estimated DOA...... 52

Figure 4.4. Average RMSE (°) for Proposed Method with Modified MMSE LSA and with MMSE LSA for different noisy conditions...... 54

Figure 4.5. Block Diagram of Adaptive Thresholding based VAD ...... 55

Figure 4.6. Average RMSE (°) for Proposed Method with SF VAD and with AT VAD for different noisy conditions...... 57

Figure 4.7. Screenshots of Smartphone Application on Google Pixel 2 Smartphone (Android 8.1). Blue marker shows the estimated DOA...... 58

x

LIST OF TABLES

Table 2.1. The timing for some common methods* ...... 15

Table 3.1. Summary of Weighting functions ...... 23

Table 3.2. Decision matrix for the binary classifier (VAD) ...... 34

Table 4.1. The timing for DOA application Version 1 ...... 50

Table 4.2. The timing for DOA application Version 2 ...... 52

Table 4.3. The timing for DOA application Version 3 ...... 53

Table 4.4. The timing for DOA application Version 4 ...... 59

xi

CHAPTER 1

INTRODUCTION

1.1. Motivation

Spatial loss is commonly seen in people over 60 years of age, and sometimes it doesn't relate to age [1]. People with spatial hearing loss have difficulty perceiving and processing different sounds from different directions. It is also known as lack of spatial awareness. Spatial hearing loss cause the people not to locate speaker accurately. This situation leads to problem in group talks or group activities for people who lack spatial awareness. Hearing aid devices (HAD) use several spectral gains, such as speech enhancement [2,3] for compensating the hearing loss in the noisy environment. However, these spectral gains mostly cannot preserve location cues [4]. Hence, another technique is needed.

Speech source localization (SSL) is a proper technique which helps hearing-impaired people to estimate the speaker location. SSL also is powerful preprocessing technique improving several signal processing algorithms such as microphone array beamforming [5-8], noise reduction [8, 9], and automatic speech/speaker recognition (ASR) [10, 11].

SSL system should be robust to background noise, reverberation. It should be able to show speaker’s direction for speech, not for background noise. Last but not least, the system should be applicable for frame base operations for real-time systems.

Widely using of smartphones offers hearing impaired people cost-effective solution for increasing effectiveness of HAD. Smartphones are proposed in [12-14] in order to implement speech processing algorithms such as Direction of Arrival (DOA) estimation [15], speech enhancement

[16, 17], dynamic range compression [18] among others with minimum latency. Since, modern

1

smartphones possess necessary hardware such as microphones, powerful processors, and enough battery power for signal processing algorithms.

1.2. Contribution of the thesis

HADs have some constraints such as device size, computational capability of processors, memory size, and a number of microphones. To overcome these constraints, smartphones are essential candidate as smartphones have requisite hardware such as at least two mics, processor, and battery.

For more information can be found on [19].

SSL method helps to increase the signal to noise ratio (SNR), de-reverberation, suppression of background noise, and enhancement of speech with high perceptual quality [6, 8, 20, 21]. In this thesis, real-time SSL method is proposed, and its real-time Android application is developed for hearing aid applications. The main contributions of the thesis are as follows:

1.2.1 Stereo Channel Input and Output Framework for Android Platforms

As it is mentioned before, smartphones and tablets are attractive, cost-effective solutions for implementing multi-channel audio signal processing algorithms. Modern smartphones and tablets have at least two microphones and requisite processing capabilities to perform real-time audio signal processing. However, it is still challenging to integrate the dual-microphone audio capture and the audio processing pipeline in real-time for efficient operation. In this thesis, we propose a practical stereo input/output framework for real-time audio signal processing on Android smartphones/tablets. This framework enables us to perform a large variety of real-time signal- processing applications on Android platform such as Direction of Arrival (DOA) estimation,

2

dynamic range Audio Compression, and Multi-Channel Speech Enhancement. Thanks to the framework, smartphones are able to be used as an assistive device for hearing aid devices (HAD)

[22].

Before the proposed framework, the Android smartphones/tablets had to be rooted in order to use multiple microphones of the devices. Rooting is not allowed by vendors because rooting change the system files of the device. Thanks to proposed framework, rooting is not needed anymore, as the proposed framework uses Android system libraries. Another main advantage of the proposed method has low latency. For audio signal processing, the signal should be processed in 15 milliseconds (ms) in order to avoid perceptional issues [23]. The proposed framework integrates

Java and C/C++ programming languages to decrease the processing time. The framework is fast enough thanks to this integration [22].

Although the proposed framework can be used for only Android platforms, it is quite comprehensive in the smartphone market. The usage of Android phones for the year of 2017 is reported as 87 percent [24]. The importance of the proposed framework is that it is the base framework for our project which is supported by NIH. The details will be covered in Chapter 2.

1.2.2 Speech Source Localization and Tracking

The using of microphone array to estimate the direction of arrival (DOA) of the source signal is typical for speech source localization (SSL). The DOA performance depends on SNR, noise types, reverberation, type and geometry of microphone array, and number of the microphones. In this work, we focus on two microphones and each microphone is apart from 13-16 cm like modern

3

smartphones. Another consideration is only finding DOA single source (dominant source) using short frames of incoming data (20-100 milliseconds (ms)).

The focus is the estimation of SSL or DOA using Android mobile devices with two microphones for this work. Hence, DOA method to be used should has high performance and less computational complexity. One of the best candidates for these constraints is time delay arrival estimation

(TDOA). TDOA exploits of time difference of incoming signal to microphones. Using simple statistics of data makes TDOA suitable for real-time implementations.

We are able to use maximum two microphones of smartphones, and this limits the performance of

TDOA algorithm. Therefore, we propose a novel approach for DOA estimation which uses three steps. The first step is called pre-processing stage which is used for increasing SNR of the speech signal. The second step refers to one of the TDOA method, Generalized Cross Correlation (GCC) method. The last step is post-processing steps which provide us consistent DOA estimations. The details of this approach are given in Chapter 3.

1.2.3 Real Time Implementation of Proposed DOA Estimation Method on Android

Smartphone

The goal of this thesis is developing real-time SSL estimation application on smartphone and tablet. We have some constraints for smartphone or tablet applications such as frame length, sampling frequency, processor power, battery life. After objective evaluation of DOA estimation methods, we have released four versions of DOA application with consideration of the real-time implementation constraints. Each version release is done based on two critical reasons which are

4

increasing the performance of DOA estimation and decreasing computational complexity of the application. The details of the improvements are given in Chapter 4.

1.3. Outline of the Thesis

The organization of the thesis is as follows: Chapter 2 provides details of Stereo channel input and output framework for Android platforms. The literature background of TDOA for two microphone and novel approach of two microphone DOA estimation is presented in Chapter 3. Simulation results of the proposed approach for DOA estimation can be found in Chapter 3. Chapter 4 offers improvements of the real-time implementation of DOA estimation. Chapter 5 concludes the thesis.

5

CHAPTER 2

STEREO CHANNEL INPUT/OUTPUT FRAMEWORK FOR ANDROID PLATFORMS

2.1 Motivation

Smartphones are encountered in every area of life. This popularity makes people use smartphone new application areas. One of the attractive areas is using smartphones as an assistive device for hearing aid devices (HAD). Since HADs have some limitation such as device size, computational capability of processors, memory size, and number of microphones, using smartphones as an assistive device for HAD is needed [19].

To be smartphones as an assistive device for hearing aid applications, smartphones should have the capability to process audio signal processing. As smartphones have powerful processors, at least two microphones, and battery, they are capable of audio signal processing in the sense of hardware. Therefore, software is needed for audio signal processing. The proposed framework responds the need for audio signal processing software.

Two main challenges emerge for audio signal processing on the smartphone which are latency and multichannel signal processing. Firstly, the proposed will be used for real-time communication or dialogue so signal processing should be done in 15 milliseconds (ms) in order to provide healthy discussion. According to the study [23], audio latency should be less than 15 ms to avoid perception issues. Another point is smartphones have their own operating systems. In other words, smartphones manage their task schedule and developers cannot manipulate or know the scheduling priority. Hence, audio signal processing on smartphones should less than 15 ms. Secondly, having multichannel processing ability make the proposed framework stronger software for audio signal processing on smartphones. There are several multichannel signal processing algorithms such as

6

Direction of Arrival (DOA), Multichannel Signal Processing, which are not used in HADs, can be helpful for hearing impaired people. The proposed framework achieves these two challenges [22].

Although the proposed is for Android platforms, it is quite comprehensive for smartphone and tablet market. As the report shows that the usage of Android devices is 87 percent for 2017 [24].

Java programming language is being used for developing Android applications. However, it is not fast enough to audio signal processing. Therefore, the proposed framework integrates Java and

C/C++ programming languages for faster signal processing. Next, Android smartphones/tablet had to be rooted for using two microphones of the device. The smartphone/tablet vendors don't allow to root, so it is illegal. Fortunately, the proposed framework removes rooting obstacle. Another main advantage of the proposed framework is that it will be the base framework for the project which is supported by National Institute of Health hearing aid project. The details will be given under following titles.

2.2 The Proposed Framework

This section covers details of the proposed framework whose pipeline is shown in Fig. 2.1. The proposed software shell will be explained according to the pipeline. To do that, this section divided two parts as Java and C/C++ module of the framework. Graphical User Interface (GUI), Wave

Recorder java class, Processing class, and Java Native Interface is explained under Java module of the framework. C/C++ module of the framework is divided two sub-titles as Audio Main and

Signal Processing.

7

Graphical User Interface (GUI)

Wave Processing Recorder

Signal Audio Main Processing

Figure 2.1. The pipeline of Proposed Method

2.2.1 Java Module of the Framework

Java programming language is needed for application development on Android platforms.

Although Java is a must for Android app development, it is not fast enough for audio signal processing. To overcome the latency problem, Java is integrated with C/C++ programming language. All parts of framework except signal processing part is designed in Java. Therefore,

Graphical User Interface (GUI), sound capturing, and communication with Java and C/C++ is done in Java module. The details are given as follows.

8

2.2.1.1 Graphical User Interface (GUI) Graphical User Interface (GUI) is the first screen of an app, and the user can interact with it. For our framework, it is used for starting/stopping the app, changing some parameters about app/algorithm, and displaying. Therefore, GUI includes text-view, button, image-view, edit text, layout, list view, so on.

Once a user starts the app, Wave Recorder java class is called. Wave Recorder begins to capture signal according to user preferences and system setups.

2.2.1.2 Wave Recorder Class Wave Recorder is a Java class which is being used for capturing audio signal and playing processed output. For sound capturing, Android class and libraries are used. One of these libraries and classes is AudioRecord class which administer the audio sources for recording audio from built-in microphones of the platform [25]. A few parameters have critical importance for signal capture.

These are explained as follows.

 Audio Source = AudioRecord class allows us to select an audio source of recording.

Although there are many options to choose a source, it is recommended that MIC and

CAMCORDER. MIC denotes for the bottom microphone of the smartphone and

CAMCORDER is used for microphone audio source which is near to the camera.

 Sampling Frequency = It is used for determining the sampling rate of recording. According

to the developer of Android operating system, 44.1 kHz or 48 kHz is suggested as a

sampling frequency for modern Android smartphones. These sampling rates provide high

quality and minimum audio latency [9].

9

 Channel Configuration = Number of audio channel is defined via channel configuration

parameter. Stereo channel can be selected at max. However, Android doesn't guarantee

two-channel recording for all Android phones. Google Pixel 2 XL, Google Pixel, and

Samsung Galaxy S7 are tested, and it has seen that two-channel recording is possible for

these phones.

 Audio Format = The audio data is represented by ENCODING_PCM_16BIT. Each sample

is denoted by 16 bit, and Pulse Code Modulation is used for the data.

 Buffer Size = The size of recorder buffer is crucial. Since we don`t desire to lose any data

when capturing signal. Hence, getminbufsize() function is used for determining buffer size.

In order to avoid any data losing while signal capturing, function getminbufsize() has been

used for determining the minimal buffer size. The return value from the function above is

assigned to Buffer Size.

An audio signal is captured in WaveRecorder class and stored in the input buffer. Stored data is sent through Processing class according to First Input First Output(FIFO) criteria. After signal processing is completed, the processed signal is sent to the speaker. This operation is managed by

WaveRecorder class. We exploit AudioTrack class which is provided by Android and for playing audio resource. The parameters, which are defined above, are needed for AudioTrack as well.

10

2.2.1.3 Processing Class Real-time audio signal processing on the smartphones is being done frame by frame Processing class is responsible for frame-based operations and communication between Java and C/C++.

Thread, which updates GUI, can be managed in Processing java class.

The captured audio signal by Wave Recorder class is processed frame by frame. The frame sizes are usually from 5 milliseconds to 20 milliseconds. Processing java class sends the frame to Audio

Main C/C++ function using Java Native Interface (JNI). After signal processing is done in C/C++ module, the processed frame comes back to Processing class. Next, Processing class calls a thread in order to either send the frame to speaker or updates GUI.

2.2.1.4 Java Native Interface (JNI) Android offers Java Native Interface (JNI) framework in order to manage native code (written in

C/C++) in Android platform. In other words, we can integrate C/C++ and Java codes via JNI as a bridge. This integration helps performance boost to Java. The performance increase is needed for healthy dialogue. A study shows that greater than 15 ms latency is recognizable so real-time audio signal processing should be done in 15 ms [23]. JNI is needed to break through 15 ms latency problem. Another advantage of JNI is that it makes some hardware features using other languages like C/C++ accessible for Java. Native Development Kit (NDK) is needed for using

JNI framework. NDK enables native libraries available using JNI framework. For more information can be found on [26].

11

2.2.2 C/C++ Module of the Framework

Audio signal processing should be completed less than 15 ms in order to avoid perception issues.

C/C++ programming language is used for signal processing algorithm to overcome perception issue, as C/C++ runs faster than Java. Fig. 2.2 demonstrates architecture of Android operating system [27]. The architecture is divided five layers. First two layers are Applications, and

Application Framework and Java programming language is used for these layers. Native Libraries and Hardware Abstraction Layer are third, and fourth layer and C/C++ programming language is needed for these layers. The system drivers are in the fifth layer, also known as Linux Kernel. As it is seen from Fig. 2.2, C/C++ is closer to the low layer of the system than Java, which makes calculations and processing become faster.

Stock Apps Vendor Specific Custom Apps Market Apps Java ...

Activity Manager, Content Manager, ...

Media Framework

Surface Manager Core Libraries C/C++ OpenGL|ES Dalvik Virtual Machines ...

Graphics, Audio, Camera,

Kernel Display Driver Bluetooth Driver USB Driver Wifi Driver Power Management Audio Driver

Figure 2.2. Shows architecture of Android application

12

2.2.2.1 Audio Main It is the primary C/C++ code. Processing.java class sends an audio frame to the Audio main class via JNI. The first operation is converting Little Endian (LE) numbers to double or float number.

Since Java record audio signal in LE format which is also known as wav format and maximum magnitude range between -32768 to +32768 for signal processing algorithm.

Audio-Main class has three critical functions that are Parameter Initialization, Real-time

Processing, and Sound Output.

 Parameter Initialization = This function responsible for initializing parameters which are

defined in Java part such as sampling rate and frame length.

 Real-time Processing = Number conversion of frame samples, input-output buffer filling,

calling C/C++ code of signal processing method are managed by real-time processing

class.

 Sound Output = Output buffer is filled with the processed frame. Before filling operation,

the number format should be converted from double/float to LE which is conducted by this

function. After buffer filling operation, the output buffer is sent to Processing.java class

using JNI.

2.2.2.2 Signal Processing C/C++ Class Signal processing method should be developed in C/C++ programming language. Signal processing class should be called in real-time processing function, so that captured each audio signal frame is processed by this class. There are two ways of developing C/C++ code of signal processing algorithm. First, C/C++ code of the algorithm can be developed by individually. The second way is using Matlab coder which provides C/C++ codes for Matlab built-in function.

13

2.3 Experiment and Results

Two experiments and results are presented in this section in order to show effectiveness and functionality of the proposed framework. Other than these two experiments, there is one more experiment which is the main focus of the thesis. This experiment is DOA estimation on Android platforms which is explained in detail in Chapter 4.

The first experiment is for measuring the timing of the framework which addresses the latency issue for audio signal processing. The typical audio signal processing methods such as Fast Fourier

Transform (FFT), Inverse Fast Fourier Transform (IFFT), and Convolution are implemented on

Google Pixel 2 XL, Google Pixel, and Samsung Galaxy S7. Table 2.1 shows timing in milliseconds

(ms) for these three methods. Sampling frequency and frame length are considered as 48 kHz and

10 ms, respectively for Experiment 1. Incoming both microphone frames convolve with a saved low-pass filter whose order is 256. The same experiment is repeated for FFT and IFFT test.

Number of FFT points is 512. Three phones performances are different because each of them has different hardware and operating system. The reason for using a variety of phones with a different operating system is displaying performance of the framework with various devices. As it is seen from Table 2.1, common audio signal processing methods take less time than frame length, 10 ms, thanks to the integration of Java and C/C++ via JNI. Taking less time than frame length is significant because it means that framework won't cause skipping frame. Table 2.1 also proves that the proposed framework won't cause perception issues.

14

Table 2.1. The timing for some common methods* Methods Phones Convolution FFT IFFT 1.883 0.992 Pixel 2 XL 4.395

1.614 1.154 Pixel 4.019

1.845 1.301 Galaxy S7 4.899

* Timing in milliseconds

The second experiment is designed for showing the functionality of the framework. The second experiment is multichannel Speech Enhancement (SE) on Google Pixel that works as an assistive device for HADs. For this work, the minimum variance distortionless response (MVDR) beamformer is combined with Log spectral amplitude estimator (Log-MMSE) SE gain function for increasing quality and intelligibility. The proposed method pipeline is shown in Figure 2.3.

Figure 2.3. Block Diagram of Proposed SE method

As it seen from Figure 2.3, the signal is captured by Mic 1 and 2, then windowing and Short Time

Fourier Transform (STFT) is done. MVDR beamformer extracts desired signal by applying a linear

15

filter to the output of the STFT. The log-MMSE gain function is calculated according to the output of MVDR beamformer. Gain function is multiplied by MVDR beamformer output. Finally, Inverse

Fast Fourier Transform and speech synthesis are applied. Enhanced speech is sent to HAD. The proposed SE method is evaluated by objective and subjective tests, and these results are compared with traditional methods [28].

Real-time implementation of the proposed SE is developed using Google Pixel whose operating system is Android 7.1 Nougat. The sampling frequency is 48 kHz, and the frame length is 10 ms for this application. Figure 2.4 shows the screenshot of the proposed SE method on the Pixel.

Figure 2.4. Screenshot of developed SE method

When the app is started, the microphones will capture audio signal and playback to the HADs without any processing. If the bottom button is in "ON" mode, the captured signal is processed and

16

then send an enhanced signal through to HAD. The user-friendly SE application is evaluated by people as well, and we got good feedbacks from them.

There is one more experiment to show the functionality of the framework. It is not presented in this chapter because , real-time DOA estimation, is the main focus of the thesis.

The details of real-time DOA estimation are in Chapter 4.

2.4 Summary

Dual channel input/output framework for real-time audio signal processing on Android platforms is proposed in this chapter. A large variety of audio signal processing algorithms such as Direction of Arrival (DOA), Audio Compression, Multi-Channel Signal Enhancement can be performed by

Android platforms thanks to proposed framework. Another significant advantage of the framework is that smartphones or tablets are able to be used as an assistive device for hearing aid devices

(HAD). Two experiments are given to show that the effectiveness and functionality of the proposed framework. Experiment 1 provides us with an idea timing of proposed method and experiment 2 shows the importance of two mics usage on the smartphones. Another example of the framework, real-time DOA estimation, will be explained in Chapter 4 in detail.

17

CHAPTER 3

TWO MICROPHONE BASED SPEECH SOURCE LOCALIZATION

3.1 Motivation

Speech Source Localization (SSL) using direction of arrival (DOA) estimation allow us to exploit speech-processing techniques in order to increase quality and intelligibility of speech. Most popular DOA estimation algorithms are divided into time difference of arrival (TDOA) based methods [15], [29], [30], steered beam formers [31, 32], high-resolution DOA estimation [33, 34,

35] and sparse representation-based DOA estimation [36]. TDOA based methods estimate the inter-microphone time-delay(ITD) information using a pair of microphones and convert them into direction information using information about the microphone arrangement and other physical parameters, such as speed of sound(c) in medium, the distance between the microphones. For this thesis, TDOA is used for estimating DOA since they are simple and computationally efficient and have reasonable performance under high signal to noise ratio (SNR). Two microphones based

DOA estimation for single or dominant speech source is explained in this chapter. One of the most popular TDOA methods is generalized cross-correlation(GCC)-based TDOA estimation which is used for the thesis because of the reasons explained above. Firstly, a literature review of GCC, existing methods are given then our method proposed in this chapter. The proposed methods compared with existing methods with simulation results. According to the simulation results, the best method is implemented for Android smartphones.

18

3.2 Literature Review

Generalized Cross Correlation is considered as sufficient method for speech source localization on mobile platforms. Background of traditional GCC and frequency weighted GCC is given under following topics.

3.2.1 Direction of Arrival Estimation using GCC

The traditional GCC framework for two microphones is explained under this topic. The mathematical notations are followed similar as [30] for better comparison. Single path plane wave propagation of sound waves from single sound source 푠(푛) and receiving two signals 푥1(푛) and

푥2(푛), which are delayed and attenuated version of 푠(푛), via sufficiently separated two microphones. 푛1(푛) and 푛2(푛) are noise at microphone 1 and microphone 2, respectively and they are additive and uncorrelated with 푠(푛). Then,

푥1(푛) = 푠(푛 + 퐷) + 푛1(푛) (3.1 )

푥2(푛) = 훼푠(푛 + 퐷 + 휂) + 푛2(푛) (3.2 )

Where 훼 denotes attenuation factor and D represents delay between signal source to the first microphone. 휂 is TDOA which is desired to estimate. The finite duration discrete generalized cross-correlation (GCC) function between 푥1(푛) and 푥2(푛) is estimated due to

̂ determine 휂 and the DOA 휃 relative to the broad-side direction (0°). 훾푥1푥2(푚) denotes GCC function for 푥1(푛) and 푥2(푛) and given by:

19

( ) [ ( ) ( )] 훾̂푥1푥2 푚 = 퐸 푥1 푛 푥2 푛 − 푚 (3.3)

Where 퐸 represents expectation operation, (̂) denotes the estimated value, and m is dummy

variable. The Cross power spectral density function Γ푥1푥2(푓) in the frequency domain is used for estimating 휂 in order to avoid computational complexity.

∞ ( ) ̂ 푗2휋푓푚 훾̂푥1푥2 푚 = ∫ 훤푥1푥2(푓) 푒 푑푓 (3.4) −∞

Where

̂ ( ) ( ) ∗( ) 훤푥1푥2 푓 = 푋1 푓 푋2 푓 (3.5 )

푋1(푓) and 푋2(푓) denote Fourier transform of input data which are got via microphone 1 and 2, respectively and ∗ represents complex conjugate.

The estimated TDOA is equal to 휂 which maximizes (4) and is given by:

휂̂ = 푎푟푔 푚푎푥 훾̂푥 푥 (휂) (3.6) 휂 1 2

Assuming 푛1(푛) and 푛2(푛) are uncorrelated (i.e. 훾푛1푛2(푚) = 0), 훾푥1푥2(푚) is:

( ) ( ) ( ) 훾푥1푥2 푚 = 훼훾푠푠 푚 − 휂 ⊛ 훿 푚 − 휂 (3.7 )

훿(푚 − 휂) is a sharp peak at the correct TDOA estimate, i.e., at 푚 = 휂, under sufficient conditions.

20

3.2.2 Direction of Arrival Estimation using Frequency Weighted GCC

Most speech content is found at lower frequencies (up to 2-3 kHz). When microphones are too close more than enough, GCC function peak width depends on the frequency components and is wider for low-frequency components. This cause to lower TDOA estimation because of flatter and smeared peaks.

푀𝑖푐1 푦1(푛) 퐻1(푓) 푥1(푛) 휃̂ GCC Mic2 푦2(푛) 퐻2(푓) 푥2(푛)

Pre-Filtering

Figure 3.1. Traditional GCC based DOA estimation with the Pre-filtering method

As it is seen from Figure 3.1, [30] proposes to pre-filter the signals 푋1(푓) and 푋2(푓) prior to

̂ calculation of 퐺푥1푥2(푓) in (3.5). The pre-filters emphasize the signal frequencies where the SNR is the highest. The pre-filters should be the function of signal and noise spectra (which can be either a-priori or estimated).

휓푔(푓)is the weighting function, then

∗ 휓푔(푓) = 퐻1(푓)퐻2(푓) (3.8)

21

where 퐻1(푓) and 퐻2(푓) are the pre-filters for each microphone like in Figure 3.1. When pre-

filtering output signals are denoted by 푦1(푛) and 푦2(푛), then output 훾̂푦1푦2(푚) is computed from

output Cross power spectral density function Γ푦1푦2(푓) using;

∞ ̂ 푗2휋푓푚 훾̂푦1푦2(푚) = ∫ 휓푔(푓)Γ푥1푥2(푓) 푒 푑푓 (3.9) −∞

In the discrete time domain, we can re-write (3.9) as:

푁−1 ̂ 푗2휋푘푚/푁 훾̂푦1푦2(푚) = ∑ 휓푔(푘)Γ푥1푥2(푘) 푒 (3.10) 푘= 0

Where 푘 = discrete frequency bin and 푁 is the FFT length.

The TDOA is estimated from 훾̂푦1푦2(푚) similar to (3.5). For traditional GCC [30], weighting function is equal to one, so we can say 퐻1(푓) = 퐻2(푓) = 1 ∀푓. For reliable TDOA resolution,

휓푔(푓) should not manipulate sharp peak in 훾̂푦1푦2(푚). Table 3.1. shows some common weighting function and their advantage and disadvantages. The details can be found at [30].

22

Table 3.1. Summary of Weighting functions Approach[Reference] Weighting function Advantage Disadvantage ∗ 흍품(풇) = 푯ퟏ(풇)푯ퟐ(풇) GCC [15] 1 Simplicity No Frequency weighting

Phase Transform 1 Robust to Diverges when Γ̂ (푓) ≈ 0 푥1푥2 (PHAT) [15] ̂ ( ) reverberation |Γ푥1푥2 푓 | Smoothed Coherence 1 Pre-whitening Sub-optimal under non-ideal conditions

Transform(SCOT) √Γ̂ (푓)Γ̂ (푓) [28] 푥1푥1 푥2푥2

Eckart [29] Γ̂푠 푠 Suppresses Requires estimates of noise and speech spectra 1 2 ̂ ̂ frequency Γ푛1푛1Γ푛2푛2 bands of high ≈ |Γ̂ (푓)| ⋅ 푥1푥2 noise ̂ ( ) ̂ ( ) { [Γ푥1푥1 푓 − |Γ푥1푥2 푓 |] ̂ ( ) ̂ ( ) ⋅ [Γ푥2푥2 푓 − |Γ푥1푥2 푓 |] }

|훽 |2 Maximum 푓 ; Robust to Sub-optimal under non-ideal conditions |Γ̂ (푓)|[1−|훽 |2] Likelihood(ML) [30] 푥1푥2 푓 ambient noise ̂ ( ) |Γ푥1푥2 푓 | 훽푓 = ̂ ( )̂ ( ) √Γ푥1푥1 푓 Γ푥2푥2 푓

1 GCC-PHAT-휷 [31] ; 훽 = 0.5 Adaptive to No consistent relationship between β and input |Γ̂푊 (푓)|훽 푥1푥2 background SNRs Wiener weighting (푊) environments

3.2.3 Direction of Arrival (DOA) Estimation on Smartphone/Tablet

Two microphone setup for DOA estimation on smartphones is shown in Figure 3.2. 휃 is the angle

DOA to be estimated. Distance is denoted by 푑 is in meters and 휂 is time delay in seconds. Speech is assumed point source under the far-field assumption.

23

Speech

Source

푑 Mic 2 Mic 1

Figure 3.2. Two Microphone DOA estimation Setup

To be able to estimate, the distance between microphones, 푑 and the speed of sound in the medium,

푐 should be known. Since our application area is smartphones and tablets, 푑 = 0.13 푚 표푟 0.16푚

(5.1-6.3-inch and most common distances for smartphones/tablets) is assumed. The speed of sound is considered as 푐 = 343 푚/푠푛. After these assumptions 휃̂ (in radians) can be derived as under far-field scenario:

휂̂ 푐 휃̂ = cos−1 (3.11) 푑

As it is seen from (3.11), DOA estimation highly depends on 휂̂ as 푑 and 푐 are fixed. We cannot control 푐 as long as DOA estimation is done the same environment. Increasing 푑 boosts the DOA estimation unless it would not have had a side effect which is spatial aliasing. Spatial aliasing causes unwanted peaks in the spectrum and degrades the estimation performance. On the other hand, increasing sampling frequency increases the performance of DOA estimation as we have more data (information) with high sampling rate compared to the low sampling rate.

24

3.3 Proposed DOA Estimation Method

The pipeline of the proposed method for DOA estimation is shown in Figure 3.3. According to the pipeline, the proposed method has three main steps which are Pre-processing, GCC, and Post- processing. Each step will be explained under subtitles in detail.

푦1(푛) 푥1(푛) H(f) 휃̂ If Yes GCC 휃̂ Speech? 푦2(푛) 푥2(푛)

Pre-Processing VAD

Post-Processing

Figure 3.3. Block Diagram of the Proposed method.

3.3.1 Pre-Processing Module

Instead of using frequency weighting function, we offer to use different speech enhancement methods as a pre-filtering. As signal model which is used for most frequency weighting functions derivations is not appropriate for real-time scenario. Therefore, we suggest using methods which can suppress noise more rather than frequency weighting functions. Three different pre-filtering methods are used for the thesis. There is only one criterion when choosing the pre-processing method. The criterion is selected method should not modify the phase spectrum of input signal since DOA information is inside the input signal phase spectrum.

25

The idea behind using SE methods for pre-processing is increasing noise suppression and signal to noise ratio (SNR). The only shortcoming of using SE method is that computational complexity of system will increase. Since computational complexity of GCC is very less, a combination of pre-processing and GCC is possible for smartphones. All methods we have used are for a single channel. According to Figure 3.3, gain function is calculated based on each channel. Hence, following methods derivation is done for a single channel. Following methods are used as a pre- processing method.

Spectral Subtraction

One of the oldest SE methods is spectral subtraction, proposed by Boll [37]. For this method, noise spectral amplitude is estimated then it is subtracted from the observed signal. Thanks to spectral subtraction method, speech spectral amplitude is estimated without phase modification which is crucial for DOA estimation. The spectral subtraction gain is given by,

|푉̂푘| 퐻푘,푆푆 = 1 − (3.12) |푌푘| where |푉̂푘| is an estimation of noise spectral amplitude. Generally, |푉̂푘| is selected to be 퐸[|푉푘|].

The disadvantage of spectral subtraction is that it does not consider any feature of the speech spectrum which leads to an erroneous estimation of |푉̂푘|.

Wiener Filter

Both the speech and noise spectral probabilistic features are utilized by Wiener filter. The Wiener filter is obtained by minimizing mean square error between obtained signal and estimated signal;

2 퐽 = 퐸 [|푆푘 − 푆̂푘| ] (3.13)

26

The above cost function can be driven as,

2 2 2 ∗ ∗ ∗ 퐽 = 퐸[|푆푘| ] + |퐺푘| 퐸[|푌푘| ] − 퐺푘 퐸[푆푘푌푘 ] − 퐺푘 퐸[푋푘푌푘]

2 | |2 [| |2] [ ∗] ∗ [ ∗ ] = 휎푆푘 + 퐺푘 퐸 푌푘 − 퐺푘 퐸 푆푘푌푘 − 퐺 퐸 푆푘푌푘 (3.14)

For finding minimum point of the cost function, we differentiate 퐽 with respect to 퐺∗ and make it equal to zero which gives the Wiener filter gain at frequency bin 푘 as,

2 휎푆 휉 퐻푘 = 푘 = 푘 (3.15) 푊푖푒푛푒푟 휎2 + 휎2 휉 + 1 푆푘 푉푘 푘

휎2 where 휉 = 푆푘 is the a priori SNR. 푘 휎2 푉푘

Minimum Mean Square Error Log Spectral Estimator

(MMSE-LSA)

Minimum Mean Square Error Log Spectral Estimator (MMSE-LSA) is proposed by Ephraim-

Malah [38]. It is one of the methods that doesn’t modify the phase spectrum of signal as expected for DOA estimation. The MMSE is obtained by minimizing following cost function:

2 퐽푀푀푆퐸 = 퐸 [|푆푘 − 푆̂푘| | 푌푘] (3.16)

Since the thesis focus is not derivation of MMSE-LSA, optimization steps are skipped. After some derivation,

푎푆̂2(𝑖 − 1) ̂ 푘 휉푘(𝑖) = + (1 − 푎) max(훾푘(𝑖) − 1,0) (3.17) 휆푑(𝑖 − 1)

27

̂2 Where 0 < 푎 < 1. i shows frame number. 푆푘 (𝑖 − 1) and 휆푑(𝑖 − 1) denote estimated speech and noise spectra for previous frame, respectively. 훾푘 is a-posteriori SNR and defined as;

2 푥푘(𝑖) 훾푘(𝑖) = (3.18) 휆푑(𝑖 − 1)

2 Where 푥푘(𝑖) refers to obtained frame noisy signal spectrum. Then MMSE-LSA filter is equal to;

휉̂ (𝑖) 퐻(푘, 𝑖) = 훾 (𝑖) 푘 − log (1 + 휉̂ (𝑖)) (3.19) 푘 ̂ 푘 휉푘(𝑖) + 1 k denotes the frequency bin. Even though we can use more sophisticated method to suppress more noise, computational complexity limits us. Since the proposed method will be used on smartphone, the proposed method’s complexity should be as less as possible.

3.3.2 DOA Estimation Module

Generalized Cross-Correlation method is used for estimation direction of arrival (DOA). This module also is called that GCC module. Noisy speech firstly goes through the pre-processing module. The output of the pre-processing step will be enhanced signal. In other words, SNR will be higher than before pre-processing step. Next, DOA is estimated for the enhanced signal. GCC module role is finding DOA angle for the enhanced signal. GCC method details are covered in section 3.2.1.

28

3.3.3 Post-Processing Module

Voice Activity Detector (VAD) is used for post-processing module. VAD classifies signal frames as noise or speech frames. The existence of background noise leads to incorrect DOA estimation.

Therefore, using VAD has the critical role in DOA estimation. When VAD estimates current frame as speech, DOA estimation is updated. However, if the estimation of VAD is noise, DOA estimation is not modified; it remains as previous frame`s DOA estimation. It can be formulated as;

̂ ̂ 휃푖−1 , 𝑖푓 푉퐴퐷 = 0 (푁표𝑖푠푒) 휃̂푖 = { (3.20) 휃̂푖 , 𝑖푓 푉퐴퐷 = 1 (푆푝푒푒푐ℎ)

̂ 푡ℎ Where 휃̂푖 shows modified DOA estimation after VAD for 푡ℎ푒 𝑖 frame. Hence, VAD provides a smoothened DOA estimation under noisy condition. In this thesis, simple single feature based

VAD is used. This feature is ‘Spectral Flux (SF)' whose robustness is satisfied [39, 40] under stationary noise condition. In this paper, the SF is:

1 2 푆퐹(푘, 𝑖) = ∑ (|푋푖(푘)| − |푋푖−1(푘)|) (3.21 ) 푁 푘 for 푘푡ℎ bin frequency bin and 𝑖푡ℎ frame. 푘 = 1,2, … , 푁. Hence, the VAD decides according to:

0 (푁표𝑖푠푒), 𝑖푓 푆퐹(푘, 𝑖) < 흃 푉퐴퐷 = { (3.22) 1 (푆푝푒푒푐ℎ), 𝑖푓 푆퐹(푘, 𝑖) ≥ 흃

Where 흃 denotes the threshold.

For SF based VAD, we have two parameters namely ‘D’ and ‘T’. ‘D’ frames decision buffer is used for non-stationary noise conditions. In another word, SF based VAD waits ‘D’ successive

29

frames to determine speech frames. ‘T’ is utilized for determining the threshold, 흃. In , first ‘T’ frames are considered as noise/silence and calculation of 흃 is done under this assumption. Since SF featured VAD has low complexity and is satisfactorily robust, implementation of proposed DOA estimation method on smartphone works well.

3.3.4 Proposed DOA Estimation Method and Sound Source-Tracking

Four steps are followed for proposed DOA estimation algorithm as Figure 3.3:

o Pre-filtering using SE methods in order to enhance the signal o TDOA (휂̂) estimation using GCC o Converting DOA (휃̂) from TDOA for a two-microphone array. o Post-filtering via SF based VAD to smoothen DOA estimation under noisy condition. Instantaneous DOA estimation is obtained as the entire process is done frame by frame. Sound tracking can be done under noisy conditions under each frame includes only one dominant speech source interest assumption [41].

3.4 Experimental Evaluations

In this section, experiment results are given for DOA estimation. Before showing the performance of the proposed method, the dataset for experiments, experimental setup, and evaluation metrics will be explained.

3.4.1 Noisy Dataset and Experimental Setup

The clean speech files from HINT database [42] are a stereo channel (dual-microphone), and the

Image-Source model [43] is used for getting each DOA angle 휃. Figure 3.4 shows the setup of simulation experiments. The room size is assumed (5m x 5m x 5m), and the microphone array is

30

positioned at the center (2.5m x 2.5m x 2.5m) of the room as it shown in Figure 3.4. Active Sound

Level (ASL) is used for SNR calculation, i.e., when there is no speech, SNR → −∞ since SNR is calculated only over speech frames. More realistic SNR conditions are provided by ASL approach.

This simulated dataset does not have any reverberations. We consider five different DOA angles

{0°, 45°, 90°, 135°, 180°} w.r.t end fire for each of the ten clean speech files. Four different noise types {White(W), Machinery(M), Traffic(T), Babble(B)} at four different SNRs viz. {-5, 0, 5, 10 dB} are used for simulations. Test results are averaged over almost 10000 different noisy test cases which makes around ten hours noisy data files.

Speech sources

Microphone array

Figure 3.4. Image Source model (ISM) setup for Simulation Experiments. Black markers represent microphones. Hollow markers represent speaker locations.

31

We have also recorded data using Google Pixel. Figure 3.5 shows the setup of recording data using

Google Pixel smartphone. For this setup, five people sat around table 45° separation. The smartphone was placed in the center of the table, and noise generator was placed approximately

30° (between speaker 1 and 2). Four different types of noise {White(W), Machinery(M),

Traffic(T), Babble(B)} were played and approximate SNR value was 5 dB SPL-A. Experiment room size was (4m x 6m x 8m) with reverberation (푅푇60 ≈ 0.2푠 ). The noisy files are available on

[19] as well.

90°

Noise 45° 135° 0.5m 30°

Mic2 Mic1 Smartphone 1.5m 180° 0°

Figure 3.5. Experimental setup for Real data experiments.

32

3.4.2 Evaluation Metrics

We have used one metric for quantifying DOA estimation performance. The metric is Root

Mean Square Error (RMSE) and it is formulated as following:

1 푁퐹 2 푅푀푆퐸 = √ ∑ (휃푖 − 휃̂푖) (3.23) 푁퐹 푖=1

Where 푁퐹 shows total number of frame. 휃푖 and 휃̂푖 represent correct and estimated DOA angle for ith frame, respectively. RMSE is calculated over only speech portion. Lower RMSE means better method.

We have also evaluated VAD performance. Two different performance metrics are utilized for quantifying performance. These metrics are Accuracy and Miss Rate which are formulated as following:

1. Accuracy (%):

∑(푇푃 + 푇푁) 퐴퐶퐶(%) = × 100 (3.24) ∑(푇푃 + 퐹푃 + 푇푁 + 퐹푁)

2. Miss Rate (%): It is defined as ∑ 퐹푁 푀푅(%) = × 100 (3.25) ∑(푇푁 + 퐹푁) where 푇푃, 퐹푃, 푇푁, and 퐹푁 denote for number of True Positive, False Positive, True Negative and

False Negative, respectively. The definition can be found following table. ‘0’ and ‘1’ represents

‘absence’ and ‘presence’ of speech, respectively for VAD decision. High ACC and low MR are expected for better method.

33

Table 3.2. Decision matrix for the binary classifier (VAD) Ground Truth VAD Decision Outcome 0 (Noise) 0 (Noise) True Negative (TN) 0 (Noise) 1 (Speech) False Positive (FP) 1 (Speech) 0 (Noise) False Negative (FN) 1 (Speech) 1 (Speech) True Positive (TP)

Ground truth (GT) is a baseline for determining speech or noise frames. GT is calculated based on

-42dB root mean square (RMS) threshold and manually determining speech portion. If the RMS value of clean speech frame is higher than -40dB and the frame is in speech portion, GT is set to

1(Speech); else GT = 0.

3.4.3 Results and Discussions

In this section, the performance of DOA estimation will be given. Before doing that, we would like show proposed single feature based VAD performance. Proposed VAD is based on Spectral

Flux (SF) feature and compared with three different famous VAD, ITU-T standard used in G729

[44], VQ VAD [45], and Statistical VAD proposed by Sohn [46], over a variety of noise types and

SNRs.

Firstly, ITU G729[44] uses four features which are spectral distortion, energy difference, low-band energy difference and zero-crossing difference. The second method is VQ VAD [45] which firstly uses pre-processing based on spectral subtraction (SS) then an energy based VAD. The third one, known as Sohn VAD [46], is a statistical decision directed estimation via likelihood ratio.

34

1 ACC White Noise 0.8 0.6 0.4 -5 dB 0 dB 5 dB 10 dB

1 ACC Machinery Noise 0.8 0.6 0.4 -5 dB 0 dB 5 dB 10 dB

1 ACC Traffic Noise 0.8 0.6 0.4 -5 dB 0 dB 5 dB 10 dB

1 ACC Babble Noise

0.8

0.6

0.4 -5 dB 0 dB 5 dB 10 dB

G729B Sohn VQVAD Proposed VAD Figure 3.6. (Top to Bottom) Average Accuracy (ACC) for White noise, Machinery noise, Traffic noise and Babble noise. Higher ACC is preferred.

35

MR White Noise 0.6 0.5 0.4 0.3 0.2 0.1 0 -5 dB 0 dB 5 dB 10 dB

MR 0.6 Machinery Noise 0.5 0.4 0.3 0.2 0.1 0 -5 dB 0 dB 5 dB 10 dB

MR Traffic Noise 0.6 0.5 0.4 0.3 0.2 0.1 0 -5 dB 0 dB 5 dB 10 dB

0.6 MR Babble Noise 0.5 0.4 0.3 0.2 0.1 0 -5 dB 0 dB 5 dB 10 dB

G729B Sohn VQVAD Proposed VAD Figure 3.7. (Top to Bottom) Average Miss Rate (MR) for White noise, Machinery noise, Traffic noise and Babble noise. Lower MR is preferred.

Figure 3.6 and 3.7 shows Accuracy (ACC) and Miss Rate (MR) of proposed VAD and other three

VADs over different noise types and SNRs, respectively. For this experiment, the sampling rate is

48 kHz, and the frame length is 100 milliseconds (ms). ‘D’ and ‘T’ are selected as 1 and 5, respectively for proposed VAD. All four VADs performance is increasing with increase in SNR, irrespective of noise type. The proposed VAD performance is the best one in these four VADs;

36

best performance refers to highest ACC and lowest MR. As it is seen from the figures, the toughest noise type is Babble, and easiest one is White noise as expected.

Figure 3.8. (From Top to Bottom) Noisy input signals (5dB SNR, Babble noise, Real recorded data), Enhanced signals after MMSE-LSA, Spectral Flux(SF) feature for enhanced signals, Inter- microphone SF difference, Ground Truth and VAD decisions

37

Figure 3.8 displays proposed VAD performance on recorded data whose setup is shown in Figure

3.5. For Figure 3.8, data is recorded under Babble noise, SNR is around 5 dB, and reverberation time is about 200 ms. The SF feature for Mic 1 and 2 can be seen in this figure. The proposed

VAD estimation is satisfactorily well by comparing with ground truth for recorded data.

Talker 5 Talker 1 Talker 2 Talker 3 Talker 4

0° 45° 90° 135° 180°

Ground Truth

(dashed)

GCC

Error Ground Truth (dashed) due to GCC Proposed Method

Figure 3.9. (From Top to Bottom) DOA estimation and Source Tracking for above noisy data from Fig.3.8 (top) for Traditional GCC(middle) and the Proposed method (bottom). Solid blue and dashed red lines represent the estimated DOA and DOA based on Ground Truth, respectively.

Figure 3.9 is obtained by using which is same data used for Figure 3.8. The comparison of sound tracking capability of the proposed DOA (MMSE LSA as pre-filtering, GCC, SF based VAD as

38

post-pre-filtering) estimation against the traditional GCC method. The DOA angle of the recorded data alters each 2-3 second by 45o from end-fire direction. Thanks to SF based VAD, the proposed

DOA method is able to track DOA of speech source. Tracking ability of proposed DOA method is better than traditional GCC’s ability.

Following figures, we analyze DOA estimation of the proposed method. We have three proposed methods which are denoted by Proposed Method 1, Proposed Method 2, Proposed Method 3.

 Proposed Method 1 refers to Pre-Filtering (MMSE-LSA) + GCC + Post-Filtering

(Proposed VAD)

 Proposed Method 2 refers to Pre-Filtering (Spectral Subtraction) + GCC + Post-Filtering

(Proposed VAD)

 Proposed Method 3 refers to Pre-Filtering (Wiener Filtering) + GCC + Post-Filtering

(Proposed VAD)

Figure 3.10 compares proposed methods. ‘D’, ‘L’ (Frame length), ‘T’ are selected as 2, 100 ms,

5, respectively and simulated data is used for this experiment. The average MAE decreases with increase in SNR for all methods. According to Figure 3.10, Proposed Methods have similar performance for White, Machinery, and Babble noise but Proposed Method 1 has lowest RMSE value and computational complexity of Proposed Method 1, and Proposed Method 3 is similar and lowest, so we can say that Proposed Method 1 is the best method for DOA estimation in Proposed methods.

39

RMSE White Noise 60 50 40 30 20 -5dB 0dB 5dB 10dB Machinery Noise 60 RMSE 50 40 30 20 -5dB 0dB 5dB 10dB

60 RMSE Traffic Noise 50 40 30 20 -5dB 0dB 5dB 10dB RMSE Babble Noise 60 50 40 30 20 -5dB 0dB 5dB 10dB

Proposed method 1 Proposed method 2 Proposed method 3

Figure 3.10. Average RMSE(°) for Proposed Method 1, Proposed Method 2, and Proposed Method 3 for different noisy conditions.

40

The same experiment whose results are shown in Figure 3.10, is repeated in order to compare the effect of pre-filtering and post-filtering. The performance of the only GCC, GCC + Post-filtering

(SF based VAD), Pre-filtering (MMSE LSA) + GCC, and Proposed Method 1 can be seen in Figure

3.11. The average RMSE decreases with increase in SNR for all methods. The Proposed Method

1 has least RMSE average means that it has the best performance. The RMSE results have same trend over all noise types. The average RMSE of Proposed Method 1 for Babble is highest compared to other noise types since Babble is the toughest noise type to handle. When SNR is higher than) dB, the performance of Proposed Method 1 has striking increase. However, there is a subtle performance increase for Proposed Method 1 when SNR is -5 dB. The last observation from

Figure 3.11 is that pre-filtering affects DOA estimation more than post-filtering.

We have compared Proposed Method 1 against frequency weighted GCC methods. Two of frequency weighted GCC methods are used for Figure 3.12 as we selected which methods have the highest performance. The methods which are used in this experiment are GCC, GCC_PHAT_,

GCC_ML, and Proposed Method 1. The Proposed Method 1 has lowest average RMSE in frequency weighted GCC and traditional GCC. The GCC_PHAT_, GCC_ML performance is lesser than expected. There might be two reasons for that;

i. These methods seldom tested for frame-based operations and changing DOA.

ii. The dataset (SNR conditions) might be tough for these pre-filters.

Same observations for frequency weighted GCC method can be found at [47].

41

RMSE White Noise 60

50

40

30

20 -5dB 0dB 5dB 10dB Machinery Noise 60 RMSE

50

40

30

20 -5dB 0dB 5dB 10dB

60 RMSE Traffic Noise

50

40

30

20 -5dB 0dB 5dB 10dB RMSE Babble Noise 60 50 40 30 20 -5dB 0dB 5dB 10dB

GCC GCC+Post-filter GCC+Pre-filter Proposed method 1

Figure 3.11. Average RMSE(°) for GCC, GCC+Post-filter (VAD), GCC+Pre-filter (MMSE- LSA) and Proposed method for different noisy conditions.

42

RMSE for White 60 50 40 30 20 10 -5dB 0dB 5dB 10dB 15dB

60 RMSE for Machinery 50 40 30 20 10 -5dB 0dB 5dB 10dB 15dB

60 RMSE for Traffic 50 40 30 20 10 -5dB 0dB 5dB 10dB 15dB RMSE for Babble 60

50

40

30

20

10 -5dB 0dB 5dB 10dB 15dB

GCC GCC_PHAT_β GCC_ML Proposed DOA method

Figure 3.12. (Top to Bottom) Average RMSE(°) for DOA estimation using different pre-filtering techniques for simulated noisy data at different SNR. Proposed Post-filter is included for all, for fairness.

43

80 RMSE 70 60 50 40 30 20 10 D=1,T=5, L=100 D=2,T=5, L=100 D=1,T=5, L=200 BEST D=2,T=5, L=200

70 RMSE 60 50 40 30 20 10 D=1,T=5, L=100 D=2,T=5, L=100 D=1,T=5, L=200 BEST D=2,T=5, L=200

80 70 RMSE 60 50 40 30 20 10 D=1,T=5, L=100 D=2,T=5, L=100 D=1,T=5, L=200 BEST D=2,T=5, L=200

80 RMSE 70 60 50 40 30 20 10 D=1,T=5, L=100 D=2,T=5, L=100 D=1,T=5, L=200 BEST D=2,T=5, L=200

GCC GCC-ML GCC-PHAT-β Proposed Method 1 Figure 3.13. (Top to Bottom) Average RMSE(°) for DOA estimation using different pre-filtering techniques for Real data for different values of D and T. Proposed VAD is included for all, for fairness.

44

Figure 3.13 compares the performance of Proposed Method 1 against the performance of GCC,

GCC_PHAT_, and GCC_ML on recorded data by Google Pixel. Using recorded data is important because our final target is implementing proposed DOA method to smartphone. Hence, performance comparison with recorded data gives us an idea for realistic scenarios. We have used same post filtering for GCC, GCC_PHAT_, and GCC_ML for fair comparison. We have also tried different ‘D’ and ‘L’ values in order to select best parameters for smartphone. The lowest average RMSE is achieved when ‘D’ is 1, ‘T’ is 5 and ‘L’ is 200 ms. If the magnitude of error is less than 20o, it is acceptable for 1-meter separation because it is good enough for differentiating speakers for the far-field scenario.

3.5 Summary

In this chapter, instantaneous DOA estimation method for speech localization is proposed.

Proposed DOA methods include Pre-processing step, Generalized Cross Correlation (GCC) method, and Post-processing step. Pre-filtering step suppresses noise and increases signal to noise ratio (SNR). Three methods are used for Pre-filtering step namely MMSE LSA, Spectral

Subtraction, and Wiener Filtering. We have obtained that MMSE LSA is the best Pre-filtering method in terms of high performance and low computational complexity. GCC is used for finding

DOA angle. Post-filtering smoothes out DOA angle in the presence of high noise. Spectral Flux

(SF) based VAD for post-filtering proposed as well. Proposed SF based VAD compared against

ITU G729, VQ VAD, and Sohn VAD. Proposed VAD has the best performance in these VADs.

45

The proposed DOA method compared with traditional GCC, frequency weighted GCC methods with using simulated data and recorded data. Simulation results show that proposed method is proper for real-time implementation on smartphone [48].

46

CHAPTER 4

REAL TIME IMPLEMENTATION OF DOA ESTIMATION METHOD ON ANDROID

PLATFORMS

4.1 Motivation

Real-time implementation of speech localization on a smartphone is the primary goal of the thesis for hearing impaired people. Spatial hearing loss leads people to difficulty in speaker localization.

Speech source localization (SSL) or direction of arrival (DOA) estimation application on smartphone/tablet will be a cost-effective solution.

The DOA estimation method has been proposed in Chapter 3. Primary focus is to implement the proposed method on the Android platforms. However, the proposed method has some constraint itself for real-time operations such as computational complexity or adapting the algorithm for a new environment. Because of these limitations, we have released four different versions of DOA estimation algorithms.

In this chapter, shortcomings of real-time implementation of DOA estimation application on mobile devices and the details of four version of DOA estimation application will be discussed.

4.2 Shortcomings of Real Time Implementation of SSL on Mobile Platforms

There are some constraints to implement SSL algorithm on mobile devices. Firstly, frame by frame operations are needed for real-time operations. The frame of a captured signal has to be processed less than the frame length. The obtained data stored in buffers and these buffers have limited storage area. If processing time is higher than frame length, data in the buffer is not picked out from the buffer on time, and the system begins to skip a frame. Therefore, the computational

47

complexity of SSL method should be as less as possible. Secondly, the assumption for SSL methods even the proposed DOA method is both microphones have same properties and qualities.

However, this is not valid always. Smartphone vendors design their product to make a profit.

Hence, they put mics with different quality for speaking and camera. This situation may lead to decrease performance of SSL algorithm. The last constraint is because of two mics DOA estimation. There is right and left ambiguity problem for two microphones. Time difference of arrival estimation of right-hand source and the left-hand source is same; there is no way to figure out whether it is right or left. Resolution range for two mics is from 0o to 1800.

4.3 Real-Time Implementation of Proposed SSL Algorithm

In this section, the details of real-time implementation of SSL estimation method is given. There are four different versions of DOA estimation application. These versions are released for either increasing performance or decreasing computational complexity. We have used three different

Android smartphones for real-time implementations. These are Google Pixel 2 (one of the latest

Android smartphone release), Google Pixel, and Samsung S7. The aim of using different devices is to prove proposed method applicable for most of Android smartphones. The framework, mentioned in Chapter 2, is used for implementing SSL method on the smartphones. The details are given as following subtitles.

4.3.1 DOA Estimation Application Version 1

The combination of Generalized Cross Correlation (GCC) and Spectral Flux (SF) based VAD is utilized for developing the first version of the app. When sampling rate (Fs) is 48 kHz, and frame length (L) is 40 ms, it performs best.

48

Tuning Parameters: Start 1. Threshold Time (푻)

2. Duration (푫)

Stop

Estimated Angle (Numeric

value)

Auto-Calibrated Threshold

Estimated Angle (Graphical display)

Blue marker

Figure 4.1. Screenshot of Smartphone Application on Google Pixel 2

Graphical User Interface (GUI) of the application can be seen in Figure 4.1. The DOA application has two features that can be updated by users. The first feature is ‘Threshold Time’, mention before as ‘T’. This parameter is used for determining a threshold for VAD. The application assumes first a few frames as noise/silence. The number of a decision frames is controlled by ‘Threshold Time’.

The second feature is ‘Duration’ which refers to parameter ‘D’ in SF VAD. This feature is needed for in presence of non-stationary noises. Users can update both ‘Threshold Time’ and ‘Duration’ according to noise characteristics. The location of the speaker is shown in two ways. First one is

49

graphical. Blue marker in the app shows the direction of the speech source. The second way for displaying speaker location is text view. The DOA app has text view named ‘Angle’ which shows the angle of speech source by number. There is another informative text view named ‘Threshold’ in the app. This text view displays threshold after ‘Threshold Time’. When ‘Start’ button is touched, blue markers start to show speaker under noisy condition. If there is only noise, the markers show last estimated location [41]. A video demonstration can be found on [19].

This application performs well with high SNR since there is no pre-filtering step for suppressing noise. The best thing with version 1 is computational complexity is very less. Table 4.1 shows the timing of version 1 for Google Pixel 2, Google Pixel, and Samsung S7.

Table 4.1. The timing for DOA application Version 1 Version 1 Google Pixel 2 XL Google Pixel Samsung Galaxy S7

16 kHz, 100 ms 16.39 ms 31. 19 ms 32.1 ms

48 kHz, 40ms 10.36 ms 23.1 ms 26.1 ms

4.3.2 DOA Estimation Application Version 2

Version 2 is an implementation of Proposed Method 1, which is Pre-processing (MMSE LSA) +

GCC + Post Processing (SF based VAD). The pipeline of this method is given Figure 3.3. The app performs best when Fs = 48 kHz and L = 40 ms,

50

GUI of the app is given Figure 4.2. The DOA application has three features that can be updated by users. The features, ‘Threshold Time’ and ‘Duration’ are explained above, see title 4.3.1. The third feature is Enhance button. When this button is red, the app performs as version 1. In other words, there is not pre-filtering if button is red. To pre-filter the signal, ‘Enhanced’ button should be green. The procedure of application is given under DOA Estimation Version 1.

Start Stop

DOA (휃̂)

Pre-Filtering is in Pre-Filtering is not in the the process process

Blue Marker

Figure 4.2. Screenshots of Smartphone Application on Google Pixel 2

The timing of applications is given Table 4.2. Timings are not good because processing time is higher than the frame length. Since gain of pre-processing is calculated both two channels, it takes too much time especially for Google Pixel and Samsung Galaxy S7 for 16 kHz and 100 ms. These

51

timings cause to skip data, and the app does not perform good enough. This situation compelled us to develop new version of the app.

Table 4.2. The timing for DOA application Version 2 Version 2 Google Pixel 2 XL Google Pixel Samsung Galaxy S7

16 kHz, 100 ms 31.06 ms 103.27 ms 101.3 ms

48 kHz, 40ms 31.89 ms 61.47 ms 59.13 ms

4.3.3 DOA Estimation Application Version 3

For this version, we have updated pipeline of Proposed Method which is shown in Figure 4.3. The filter gain of pre-filtering for each channel was calculated based on that channel. Now, the block diagram is modified and displayed in Figure 4.3. The gain calculation is done for the first channel, and same gain is used for the second channel. Therefore, we calculate the gain only one time instead of two times and the computational complexity decreases.

MMSE-LSA

퐻(푘) Mic1 No 푦1(푛) 휃̂ 푥1(푛) 휃̂ If Yes GCC 휃̂ 푦2(푛) Speech? 푥2(푛)

Mic2 DOA Estimation VAD

Pre-Filtering Post-Filtering

Figure 4.3. Block diagram of Modified Proposed Method. H(푘) is the pre-filter. 휃̂ is the estimated DOA.

52

Before implementing this modification, we have analyzed the effect of the modification on performance. Figure 4.4 compares the RMSE performance of two channel-based pre-filter gain calculation (shown as MMSE LSA) and one channel-based pre-filter gain calculation (shown as

Modified MMSE LSA). For this figure, simulated data, described under title 3.4.1, is used. As it is seen from Figure 4.4, the modification affects subtle performance degradation. Therefore, a modified Proposed method is suitable to implement.

As it is seen from Table 4.3, modified Proposed Method solves computational complexity problem, what we have in version 2. Now, application processing time is less than the frame length.

Hence, the app does not skip frames and performs well.

The GUI and working procedure of version 2 and 3 are same. Hence, the screenshot is not put here.

Table 4.3. The timing for DOA application Version 3 Version 3 Google Pixel 2 XL Google Pixel Samsung Galaxy S7

16 kHz, 100 ms 28.06 ms 87.27 ms 66.87 ms

48 kHz, 40ms 23.51 ms 58.67 ms 49.53 ms

53

RMSE White Noise 60

40

20 -5dB 0dB 5dB 10dB Machinery Noise 60 RMSE

40

20 -5dB 0dB 5dB 10dB

60 RMSE Traffic Noise 50 40 30 20 -5dB 0dB 5dB 10dB

60 RMSE Babble Noise 50 40 30 20 -5dB 0dB 5dB 10dB Modified MMSE LSA logMMSE Figure 4.4. Average RMSE (°) for Proposed Method with Modified MMSE LSA and with MMSE LSA for different noisy conditions.

4.3.4 DOA Estimation Application Version 4

The first three chapter is commonly utilized SF based VAD as post-filtering step. SF based VAD considers just first ‘Threshold time’ frames as silence/noise and noise mean is calculated based on this idea. Since noise mean is not calibrated anymore, SF VAD is not adaptive to changing

54

background noise or environment. Because of this limitation of SF VAD, we proposed new VAD which is called noise Adaptive Thresholding (AT) VAD [1]. The proposed method, whose pipeline is shown in Figure 4.5, includes combination of feature smoothening and adaptive thresholding for handling changing in background noise. Feature smoothing is applied, and SF feature (defined in equation 3.21) is decomposed as Speech-SF feature SFs (k, i) and Noise-SF frame feature SFn

(k, i) per frame based on recursive averaging technique. k and i denote bin number and frame number, respectively.

y(n)

Feature Extraction

푆퐹(𝑖)

Speech Noise Feature Feature Post Filtering 푆퐹푠(𝑖) 푆퐹푛(𝑖)

Adaptive Thresholding & VAD Decision

Is Speech?

Figure 4.5. Block Diagram of Adaptive Thresholding based VAD

55

푆퐹푠(𝑖) = 훼푠푆퐹푠(𝑖 − 1) + (1 − 훼푠)푆퐹(𝑖) (4.1)

푆퐹푛(𝑖) = 훼푛푆퐹푛(𝑖 − 1) + (1 − 훼푛)푆퐹(𝑖) (4.2)

where

훼퐴푠 , if 푆퐹(𝑖) ≥ 푆퐹푠(𝑖 − 1) 훼푠 = { 훼퐷푠 , if 푆퐹(𝑖) < 푆퐹푠(𝑖 − 1)

훼퐴푛 , if 푆퐹(𝑖) ≥ 푆퐹푛(𝑖 − 1) 훼푛 = { (4.3) 훼퐷푛 , if 푆퐹(𝑖) < 푆퐹푛(𝑖 − 1)

As it is seen from equation 4.3, the smoothing rate, which is controlled by 훼푠 and 훼푛 , is based on two parameters for each. Attack and decay constant for speech SF feature is denoted by 훼퐴푠 and

훼퐷푠, respectively and attack and decay constant for noise SF feature is denoted by 훼퐴푛 and 훼퐷푛, respectively. According to our experiments, following values are suggested;

−6  훼퐴푠 = 10 (Fast attack), 훼퐷푠= 0.5 (Slow decay) for 푆퐹푠(𝑖)

−6  훼퐴푛= 0.99995 (Slow attack), 훼퐷푛 = 10 (Fast decay) for 푆퐹푛(𝑖).

VAD decision is made by comparing 푆퐹푠 and 푆퐹푛 per frame:

1 (Speech), if 푆퐹 (𝑖) ≥ 휅 푆퐹 (𝑖) 푉퐴퐷(𝑖) = { 푠 푛 (4.4) 0 (Non − Speech), if 푆퐹푠(𝑖) < 휅 푆퐹푛(𝑖) where 휅 is a tunable scaling factor depending on the background noise conditions.

Figure 4.6 shows effects of AT VAD on DOA estimation performance. To do this, we compare

RMSE values of two systems, Modified MMSE LSA + GCC + SF VAD and Modified MMSE LSA

+ GCC + AT VAD. For this figure, simulated data, described under title 3.4.1, is used.

56

60 RMSE White Noise 50 40 30 20 10 -5dB 0dB 5dB 10dB Machinery Noise 60 RMSE 50 40 30 20 10 -5dB 0dB 5dB 10dB

60 RMSE Traffic Noise 50 40 30 20 10 -5dB 0dB 5dB 10dB

60 RMSE Babble Noise 50 40 30 20 10 -5dB 0dB 5dB 10dB Proposed Method with SF VAD Proposed Method with AT VAD Figure 4.6. Average RMSE (°) for Proposed Method with SF VAD and with AT VAD for different noisy conditions.

The general trend of the graph is SF VAD performs well for low SNRs (-5, 0 dB) and AT VAD performs well for high SNRs (5, 10 dB). There is the only exception for Babble noise because both methods perform similar for 5 dB Babble noise. Figure 4.6 shows us the AT VAD is suitable for implementing real-time systems.

57

Start Stop

DOA (휃̂)

Tuning Parameters

Blue Markers

Figure 4.7. Screenshots of Smartphone Application on Google Pixel 2 Smartphone (Android 8.1). Blue marker shows the estimated DOA.

Figure 4.7 shows GUI of SSL application version 4. We have designed a new GUI for Preferences with new release. Five parameters can be updated by the user. These parameters are ‘Use

Prefilter?’, ’Sampling Frequency’, ’Frame Length’, ’FFT Size’, ’Thresholding Level’, which refers to 휅 in equation 4.4. After the app begins, it is not needed to wait for a few frames to calculate noise mean. Thanks to AT VAD, noise mean will calculate background. The angle speech source will be pointed with blue markers as well as text field. Timing information is given Table 4.4. The computational complexity of version 4 is good for real time operations, according to Table 4.4.

58

Table 4.4. The timing for DOA application Version 4 Version 4 Google Pixel 2 XL Google Pixel Samsung Galaxy S7

16 kHz, 100 ms 23.23 ms 89.32 ms 62.1 ms

48 kHz, 40ms 23.71 ms 50.9 ms 50.38 ms

4.4 Summary

In this chapter, real-time implementation of speech source localization (SSL) on smartphone is explained. The shortcomings of real-time implementations of SSL method on smartphone is discussed. The proposed direction of arrival (DOA) estimation method in Chapter 3 is implemented on three Android smartphones. The Proposed Method in Chapter 3 has some limitation for realistic scenarios. The first constraint is the computational complexity of Proposed

Method in Chapter 3 is high for smartphone implementation. The second one is Proposed Voice

Activity Detector (VAD) in Chapter 3 is not for real life because it assumes background noise it fixed. Decreasing computational complexity and recalibrating noise mean issues are solved new versions. The SSL application version 4 is proper for realistic scenarios.

59

CHAPTER 5

CONCLUSION

In this thesis, real-time implementation of speech source localization (SSL) on Android platforms for hearing aid applications is studied. People, who have a spatial hearing loss, are having difficulty to localizing speaker especially people older than 60. Smartphones are the cost-effective solution for hearing impaired people because of smartphone's popularity. Hence, this thesis has critical importance for hearing aid applications.

The thesis has three significant contributions. The first contribution is stereo input/output framework for audio signal processing on Android platform. Thanks to the framework, implementing of single or dual channel audio signal processing algorithms such as Direction of

Arrival (DOA), Dynamic Range Audio Compression, Multi/Single-Channel Speech Enhancement

(SE) on Android devices is possible. The Android applications for NIH Hearing Aid Project [19] will be developed using proposed framework. Three stage method is proposed for speech source localization using DOA as the second contribution. The first module of the method is called pre- filtering, which suppress noise and increase signal to noise ratio (SNR). Three different pre- filtering techniques are compared, and Minimum Mean Square Error Log Spectral Estimator is selected for real-time implementation. The second module of the proposed DOA method is

Generalized Cross Correlation (GCC) which is used for estimating DOA angle. Since time difference of arrival (TDOA) methods, GCC is one of them, has low computational complexity and performs well for two microphone applications. The third stage of the proposed DOA estimation method is called post-processing, which smoothens the DOA estimation under noisy conditions. Spectral Flux (SF) based Voice Activity Detector (VAD) as post-filtering is proposed.

60

The last contribution is the real-time implementation of proposed DOA estimation method. The proposed method had some constraints to implement it on smartphone/tablet such as computational complexity, being non-adaptive for background noise changes. Hence, the pre-filtering stage is modified and noise adaptive thresholding VAD is proposed for post-filtering. Finally, the fourth version of the DOA estimation application for hearing aid applications is good to use for real life.

61

REFERENCES

[1] A. Ganguly, A. Kucuk, I. Panahi, “Real-time Smartphone application for improving spatial awareness of Hearing Assistive Devices”, Engineering in Medicine and Biology Society (EMBC), 40th Annual International Conference of the IEEE, Aug. 2018.

[2] M.W. Hoffman, et al. "Robust adaptive microphone array processing for hearing aids: Realistic speech enhancement." The Journal of the Acoustical Society of America 96.2 (1994): 759-770.

[3] V. Montazeri, S. A. Khoubrouy, and I. M. S. Panahi. "Evaluation of a new approach for speech enhancement algorithms in hearing aids." Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. IEEE, 2012.

[4] T. Van den Bogaert, et al. "Horizontal localization with bilateral hearing aids: without is better than with." The Journal of the Acoustical Society of America 119.1 (2006): 515- 526.

[5] E. Lindemann, "Two microphone nonlinear frequency domain beamformer for hearing aid noise reduction," Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, 1995, pp. 24-27.

[6] M. Brandstein, D. Wards (eds.), Microphone Arrays—Signal Processing Techniques and Applications (Springer, Berlin, 2001)

[7] J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, vol. 1 (Springer Science & Business Media, 2008)

[8] I. J. Tashev, Sound Capture and Processing: Practical Approaches Wiley 2009.

[9] M.S. Brandstein, and S. M. Griebel. "Nonlinear, model-based microphone array speech enhancement." Acoustic signal processing for telecommunication. Springer US, 2000. 261-279.

[10] I. A. McCowan, J. Pelecanos, and S. Sridharan. "Robust speaker recognition using microphone arrays." 2001: A Speaker Odyssey-The Speaker Recognition Workshop. 2001.

[11] M. L. Seltzer, Microphone array processing for robust speech recognition. Diss. Carnegie Mellon University Pittsburgh, PA, 2003.

62

[12] I. Panahi, N. Kehtarnavaz and L. Thibodeau, "Smartphone-based noise adaptive speech enhancement for hearing aid applications," 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, 2016, pp. 85-88.

[13] A. Chern, Y. H. Lai, Y. P. Chang, Y. Tsao, R. Y. Chang and H. W. Chang, "A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom," in IEEE Access, vol. 5, pp. 10339-10351, 2017.

[14] M. Mielke and R. Brueck, "Design and evaluation of a smartphone application for non- speech sound awareness for people with hearing loss," 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, 2015, pp. 5008-5011.

[15] A. Ganguly, A. Küçük and I. Panahi, " Real-time Smartphone implementation of noise- robust Speech source localization algorithm for hearing aid users," Proc. Mtgs. Acoust. 30, 055001 (2017).

[16] C. Karadagur Ananda Reddy, N. Shankar, G. Shreedhar Bhat, R. Charan and I. Panahi, "An Individualized Super-Gaussian Single Microphone Speech Enhancement for Hearing Aid Users With Smartphone as an Assistive Device," in IEEE Signal Processing Letters, vol. 24, no. 11, pp. 1601-1605, Nov. 2017.

[17] Y. Rao, Y. Hao, I. M. S. Panahi and N. Kehtarnavaz, "Smartphone-based real-time speech enhancement for improving hearing aids speech perception," 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, 2016, pp. 5885-5888.

[18] Y. Hao, M. C. R. Charan, G. S. Bhat and I. M. S. Panahi, "Robust real-time sound pressure level stabilizer for multi-channel hearing aids compression for dynamically changing acoustic environment," 2017 51st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 2017, pp. 1952-1955.

[19] Smartphone-Based Open Research Platform for Hearing Improvement Studies; http://www.utdallas.edu/ssprl/hearing-aid-project/, May 2017,

[20] K. Kumatani, J. McDonough and B. Raj, "Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors," in IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 127-140, Nov. 2012.

[21] M. S. Brandstein, and H. F. Silverman. "A practical methodology for speech source localization with microphone arrays." Computer Speech & Language 11.2 (1997): 91- 126.

63

[22] A. Kucuk, Y. Hao, A. Ganguly, I. M. S. Panahi, “Stereo I/O Framework for Audio Signal Processing on Android Platforms”, Proceedings of Meetings on Acoustics, Acoustic Society of America (ASA 2018), Minneapolis, MN, May 2018.

[23] https://en.wikipedia.org/wiki/Latency_(audio)

[24] https://www.statista.com/statistics/271774/share-of-android-platforms-on-mobile- devices-with-android-os/

[25] https://developer.android.com/reference/android/media/AudioRecord.html

[26] https://developer.android.com/ndk/guides/index.html

[27] http://blog.techveda.org/android-architecture-essentials/

[28] N. Shankar, A. Küçük, C. Reddy, G. Bhat, I. M. S. Panahi, “Influence of MVDR beamformer on a Speech Enhancement based Smartphone application for Hearing Aids”, Engineering in Medicine and Biology Society (EMBC), 40th Annual International Conference of the IEEE, Aug. 2018

[29] M. S. Brandstein, J. E. Adcock, and H. F. Silverman. "A practical time-delay estimator for localizing speech sources with a microphone array." Computer Speech & Language 9.2 (1995): 153-169

[30] C. Knapp and G. Carter, "The generalized correlation method for estimation of time delay," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320-327, Aug 1976.

[31] H. Do, H. F. Silverman and Y. Yu, "A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array," 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Honolulu, HI, 2007, pp. I-121-I-124.

[32] M. Cobos, A. Marti and J. J. Lopez, "A Modified SRP-PHAT Functional for Robust Real-Time Sound Source Localization With Scalable Spatial Sampling," in IEEE Signal Processing Letters, vol. 18, no. 1, pp. 71-74, Jan. 2011.

[33] R. Schmidt, "Multiple emitter location and signal parameter estimation," in IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, Mar 1986..

[34] H. Tang, “DOA estimation based on MUSIC algorithm”. [Thesis]. Linnæus University; 2014.

[35] R. Roy, A. Paulraj and T. Kailath, "Direction-of-arrival estimation by subspace rotation methods - ESPRIT," ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986, pp. 2495-2498.

64

[36] D. Malioutov, M. Cetin and A. S. Willsky, "A sparse signal reconstruction perspective for source localization with sensor arrays," in IEEE Transactions on Signal Processing, vol. 53, no. 8, pp. 3010-3022, Aug. 2005.

[37] S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120, Apr 1979.

[38] Y. Ephraim and D. Malah, "Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109-1121, Dec 1984.

[39] S. O. Sadjadi and J. H. L. Hansen, "Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux," in IEEE Signal Processing Letters, vol. 20, no. 3, pp. 197-200, March 2013.

[40] E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator” in Proc. ICASSP ‟97, Munich, Germany, 1997.

[41] A. Ganguly, A. Küçük, I. M. S. Panahi, "Real-time Smartphone implementation of noise- robust Speech source localization algorithm for hearing aid users", Proceedings of Meetings on Acoustics, 2017 30:1, Boston, MA, May 2017.

[42] M. Nilsson, S. D. Soli, and J. A. Sullivan. "Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise." The Journal of the Acoustical Society of America 95.2 (1994): 1085-1099.

[43] E.Lehmann, A. Johansson,and S.Nordholm,“Reverberation-time prediction method for room impulse responses simulated with the image-source model”, Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA'07), pp. 159-162, New Paltz, NY, USA, October 2007.

[44] A. Benyassine, E. Shlomot, H.-Y. Su, D. Massaloux, C. Lamblin, and J.-P. Petit, “ITU- T recommendation G.729 Annex B: “A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Commun. Mag., vol. 35, pp. 64–73, Sep. 1997.

[45] T. Kinnunen and P. Rajan, "A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 2013, pp. 7229-7233.

[46] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Process. Lett., vol. 6, pp. 1–3, Jan.1999.

65

[47] J.M. Perez-Lorenzo, et al. "Evaluation of generalized cross-correlation methods for direction of arrival estimation using two microphones in real environments." Applied Acoustics 73.8 (2012): 698-712

[48] A. Ganguly, A. Küçük, I. M. S. Panahi, "Real-time noise robust dual-microphone Speech Source Localization and Tracking on Smartphone", Audio, Speech and Language Processing, IEEE Transactions on, September 2017.

66

BIOGRAPHICAL SKETCH

Abdullah Küçük is currently pursuing his Master of Science degree (with Thesis) at the Statistical

Signal Processing Research Laboratory at Erik Jonsson School of Engineering and Computer

Science at The University of Texas at Dallas, Richardson, TX. He expects to graduate in August

2018 with CGPA 3.612. He received the Bachelor of Engineering degree in Electronics

Engineering from Kadir Has University, Istanbul, Turkey, in 2011 with CGPA 3.46/4. His current research interests include Real-Time Speech Source Localization, Microphone Array Processing,

Voice Activity Detector, Machine Learning, Speech Enhancement, Adaptive Filtering. He is co- author of one journal and five conference paper and has two abstracts. He is a Student member of

IEEE Signal Processing Society (SPS) and IEEE Communications Society (ComSoc).

CURRICULUM VITAE

ABDULLAH KUCUK

Dallas, TX, 75252 [email protected] (762) 728 1063 www.linkedin.com/in/abdullah-kucuk

Objective Actively looking for internship for Summer 2019 in Digital Signal Processing (DSP), Direction of Arrival (DOA) Estimation, Speech Enhancement (SE), Digital Audio Processing.

Education PhD (Thesis), Electrical Engineering Expected Graduation: June 2020 University of Texas at Dallas, TX CGPA-3.61

Master of Science (Thesis), Electrical Engineering August 2016 –August 2018 University of Texas at Dallas, TX CGPA- 3.6

Bachelor of Engineering, Electronics Engineering (Honors) August 2006 –June 2011 Kadir Has University, Turkey CGPA- 3.46

Skills Programming Languages: MATLAB, C/C++, Android. Tools: Android Studio, Visual Studio Courses: Digital Signal Processing-1, Digital Signal Processing-2, Introduction to Speech Processing, Adaptive Signal Processing, Detection & Estimation Theory, Speech & Speaker Recognition, Random Processes, Signals & Systems, Algorithms and Data Structure. Industrial Trainings Software Test Engineer, Huawei Technologies, Istanbul, Turkey November 2013-March 2015 Attending developing of CRM software. Testing, Load Testing, and Automation testing. Managing test environment and database of the environments.

Research Experience Statistical Signal Processing Research Lab (SSPRL) at UTD October 2016 - Present Worked on implementation of Direction of Arrival (DOA) for two microphones. Developed and improved Android application which estimates DOA using two mics of the smartphone. Worked on Audio Compression algorithm to improve quality and intelligibility of speech for Hearing Aids. Coded in MATLAB and C. Worked on implementation Speech Enhancement algorithms to the Android phones.

Projects 2-Channel DOA Application on Smartphone  This project is funded by NIH  Developed and simulated a 2-Channel speech DOA (under noisy environment) algorithm using optimized VAD and logMMSE  Implemented this algorithm on android smartphones (Google Pixel and Samsung S7) in real-time. 8-Channel DOA and Wireless Speech Enhancement System [Fall 2016]  This Project is funded by NIH  Developed an 8-Channel speech DOA algorithm (under noisy environment), implemented it on hardware in real-time.  Developed a 2-channel speech enhancement app using 8-Channel audio data and DOA result from hardware via low-latency Wifi(UDP). User Controlled Speech Enhancement for Optimal Quality and Intelligibility  This Project is funded by NIH  Developed a real time Speech Enhancement on smart phone based on JMAP.  It gives user to control Quality and Intelligibility in the form of knob on the GUI of our app

Publications  A. Ganguly, Abdullah Küçük, I. Panahi, "Real-time noise robust dual-microphone Speech Source Localization and Tracking on Smartphone", Audio, Speech and Language Processing, IEEE Transactions on, September 2017.  A. Ganguly, Abdullah Küçük, I. Panahi, “Real-time Smartphone application for improving spatial awareness of Hearing Assistive Devices”, Engineering in Medicine and Biology Society (EMBC), 40th Annual International Conference of the IEEE, Aug. 2018.  N. Shankar, Abdullah Küçük, C. Reddy, G. Bhat, I. Panahi, “Influence of MVDR beamformer on a Speech Enhancement based Smartphone application for Hearing Aids”, Engineering in Medicine and Biology Society (EMBC), 40th Annual International Conference of the IEEE, Aug. 2018.  Abdullah Küçük, Anshuman Ganguly, Issa Panahi, “Improved pre-filtering stages for GCC-based direction of arrival estimation using smartphone”, Proceedings of Meetings on Acoustics, Acoustic Society of America (ASA 2018), Minneapolis, MN, May 2018.  Abdullah Küçük, Yiya Hao, Anshuman Ganguly, Issa Panahi, “Stereo I/O Framework for Audio Signal Processing on Android Platforms”, Proceedings of Meetings on Acoustics, Acoustic Society of America (ASA 2018), Minneapolis, MN, May 2018.  P. Mishra, A. Ganguly, Abdullah Küçük, I. Panahi, ‘Unsupervised Noise-aware Adaptive Feedback Cancellation for Hearing Aid Devices under Noisy speech framework’ - IEEE Signal Processing in Medicine and Biology Symposium, August 2017  A. Ganguly, Abdullah Küçük, I. Panahi, "Real-time Smartphone implementation of noise- robust Speech source localization algorithm for hearing aid users",Proceedings of Meetings on Acoustics, 2017 30:1, Boston, MA, 2017.

 Y. Sabucu, T. Cogalan, Abdullah Küçük, T. Gucluoglu, “Performance Comparison of Coded OFDM Systems with Antenna Selection Techniques” (MIC-CCA2011 Conference ISBN: 978-0-9836521-1-3):

Visa Status: J1/Eligible to work in USA