LITHUANIAN UNIVERSITY OF HEALTH SCIENCES MEDICAL ACADEMY

Evaldas Padervinskis

THE VALUE OF AUTOMATIC VOICE CATEGORIZATION SYSTEMS BASED ON ACOUSTIC VOICE PARAMETERS AND QUESTIONNAIRE DATA IN THE SCREENING OF VOICE DISORDERS

Doctoral Dissertation Biomedical Sciences, Medicine (06B)

Kaunas, 2016

Dissertation has been prepared at the Lithuanian University of Health Sciences, Medical Academy, Department of Otorinolaryngology during the period of 2011–2015.

Scientific Supervisor Prof. Dr. Habil. Virgilijus Ulozas (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B).

Dissertation is defended at the Medical Research Council of the Lithuanian University of Health Sciences, Medical Academy:

Chairman Prof. Dr. Habil. Limas Kupcinskas (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B).

Members: Prof. Dr. Habil. Daiva Rastenyte (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B);

Prof. Dr. Dalia Zaliuniene (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B);

Prof. Dr. Vaidotas Marozas (Kaunas University of Technology, Technological Sciences, Electrical and Electronics Engineering – 01T);

Prof. Dr. Habil. Kazimierz Niemczyk (Medical University of Warsaw, Biomedical Sciences, Medicine – 06B).

Dissertation will be defended at the open session of the Medical Research Council of Lithuanian University of Health Sciences on June 16th, 2016 at 2 p.m. in 204 auditorium at Faculty of Pharmacy of Lithuanian University of Health Sciences. Address: Sukileliu 13, LT-50009 Kaunas, Lithuania.

2

LIETUVOS SVEIKATOS MOKSLŲ UNIVERSITETAS MEDICINOS AKADEMIJA

Evaldas Padervinskis

AUTOMATINĖS BALSO KATEGORIZAVIMO SISTEMOS, PAREMTOS AKUSTINIŲ BALSO PARAMETRŲ BEI PACIENTŲ KLAUSIMYNŲ DUOMENŲ ANALIZE, VERTĖ PIRMINEI BALSO SUTRIKIMŲ ATRANKAI

Daktaro disertacija Biomedicinos mokslai, medicina (06B)

Kaunas, 2016

3

Disertacija rengta 2011–2015 metais Lietuvos sveikatos mokslų universiteto Medicinos akademijos Ausų, nosies ir gerklės ligų klinikoje.

Mokslinis vadovas prof. habil. dr Virgilijus Ulozas (Lietuvos sveikatos mokslų universitetas, Medicinos akademija, biomedicinos mokslai, medicina – 06B).

Disertacija ginama Lietuvos sveikatos mokslų universiteto Medicinos akademijos medicinos mokslo krypties taryboje:

Pirmininkas prof. habil. dr. Limas Kupčinskas (Lietuvos sveikatos mokslų uni- versitetas, biomedicinos mokslai, medicina – 06B).

Nariai: prof. habil. dr. Daiva Rastenytė (Lietuvos sveikatos mokslų universitetas, biomedicinos mokslai, medicina – 06B);

prof. dr. Dalia Žaliūnienė (Lietuvos sveikatos mokslų universitetas, biomedicinos mokslai, medicina – 06B);

prof. dr. Vaidotas Marozas (Kauno technologijos universitetas, tech- nologijos mokslai, elektros ir elektronikos inžinerija – 01T);

prof. habil. dr. Kazimierz Niemczyk (Varšuvos medicinos universitetas, biomedicinos mokslai, medicina – 06B).

Disertacija ginama viešame Medicinos mokslo krypties tarybos posėdyje 2016 m. birželio 16 d. 14 val. Lietuvos sveikatos mokslų universiteto Farmacijos fakulteto 204 auditorijoje. Adresas : Sukilėlių pr. 13, LT-50009 Kaunas.

4

Šią knygą skiriu savo Tėvams Ilonai ir Edmundui Padervinskiams

Ačiū

„Gyvenimas skirstomas į tris laiko vienetus: kas buvo, kas yra ir kas bus. Tai, ką dabar veikiame, yra trumpa, ką veiksime – netikra, ką nuveikėme – užtikrinta.“

Seneka

5

CONTENTS

ABREVIATIONS ...... 7 INTRODUCTION ...... 8 1. THE AIM AND OBJECTIVES OF THE STUDY ...... 10 2. ORGINALITY OF THE STUDY ...... 11 3. SCIENTIFIC LITERATURE REVIEW ...... 13 3.1. Voice and computer ...... 14 3.2. Acoustic voice analysis ...... 18 3.3. Microphone and acoustic voice analysis ...... 21 3.4. Questionnaires and voice analysis ...... 22 3.5. Classifiers and voice analysis ...... 25 3.6. Acoustic voice analysis and smartphones ...... 27 4. METHODS ...... 27 4.1. Ethics ...... 27 4.2. Study design ...... 27 4.3. Voice recordings ...... 30 4.4. Acoustical analysis ...... 32 4.5. Questionnaire data ...... 33 4.6. Statistical evaluation, classifiers ...... 34 5. RESULTS ...... 37 5.1. Study No I (Analysis of oral and throat microphones using discriminant analysis) data ...... 37 5.2. Study No II (Analysis of oral and throat microphones using Random Forest classifier) data ...... 40 5.3. Study No III (Analysis of oral and smart phone microphones data using Random Forest classifier) data ...... 42 5.4 Study No IV (Testing of VoiceTest software) data ...... 45 6. DISCUSSION ...... 47 7. CONCLUSION ...... 58 REFERENCES ...... 59 LIST OF PUBLICATIONS ...... 74 PUBLICATIONS ...... 76 SANTRAUKA ...... 102 CURRICULUM VITAE ...... 109 PADĖKA ...... 110

6

ABREVIATIONS

RF – Random forest SVM – Support vector machine GERD – Gastro oesophageal disease F0 – Fundamental frequency SNR – Signal to noise ratio HNR – Harmonic to noise ratio NNE – Normalized noise energy CCR – Correct classification rate EER – Equal error rate GFI – Glottal functional index VQOL – Voice-disordered quality of life VHI – Voice handicap index GFI-LT – Glottal functional index Lithuanian version LVQ – Learning vector quantization GMM – Gaussian mixture model HMM – Hidden Markov model LDA – Linear discriminant analysis k-NN – k-nearest neighbours MLP – Multi-layer perceptron SVM – Support vector machine CC – Cepstral coefficient BEC – Band energy coefficient RASTA-PLP – Relative spectral transform perceptual linear prediction LPC – Linear predictive coding GNE – Glottal-to-noise excitation VLS – Video laryngostroboscopy SD – Standard deviation CART – Classification and regression tree t-SNE – t-distributed stochastic neighbour embedding algorithm DET – Detection error trade-off curve ROC – Receiver operating characteristic curve AUC – Area under the curve OOB – Out-of-bag data classification accuracy (%) VHI – Voice handicap index

7

INTRODUCTION

Over the past 200 000 years humans have used lungs, larynx, tongue, and lips in order to produce and modify the highly intricate arrays of voice for realizing verbal communication and emotional expression [1]. Vocal folds have evolved to be a key organ in the creation of human voice. The vibrations of the vocal folds serve as an origin of the primary voice signal. The process of voice production is called phonation, and it is the preli- minary stage for speech production [2]. So what is a normal healthy voice? In 1956, Jahnson et al. suggested a description that a healthy voice is a voice of nice quality and colour, and that it shows the speakers age and sex; as well, it is a voice that has normal loudness and adequate possibilities to change voice loudness and tone [3]. Voice is the main and the easiest instrument for communication between people, and it is one of the most challenging means of information transmission from a person to a computer. When we speak, we give some certain straight and verbal information to other people; however, we also supply some certain indirect and bodily information about ourselves. Such information as the psychological and emotional status, personality, identity and aesthetic orientation is also conveyed [4]. Every day is influenced by internal and external factors. External factors, such as dust, dry or very humid air, temperature, noisy background, incorrect bodily position when speaking, etc. affect our voice [5]. There are internal factors, as well, such as viral infections, GERD, larynx pathologies, neurological diseases, and hormone fluctuations [6–8]. If we made a search throughout the scientific literature according to the Pub Med database in 1960, there would only be 60 articles published about the subject of voice; however, if we start a search nowadays, then, starting with 2007, we would find more than 1000 articles published about the subject of voice-processing. In 2015 we would have found more than 1700 articles; consequently, a conclusion could be made that every year there is a growing number of experts, doctors, and engineers, who investigate the subject of voice-processing. Nowadays modern medicine is focused on the screening programs. Screening programs are chosen due to their cost- effectiveness, and screening has been proved to reduce incidence rates, and both disease-specific and overall mortality rates. It is also recommended by all relevant major organizational guidelines [9]. As a consequence, for all types of screening programs there is a needed for a simple, reliable, user- friendly diagnostic tool, that could be used by every family doctor or even a person himself or herself for voice disease diagnosis or even for a follow-up

8 after the treatment. Thus, present day medical engineers and doctors are targeting for the above-mentioned simple, reliable, and user-friendly diagnostic tool. As technological possibilities develop forward, it could also be mentioned that what seemed impossible 10 years ago, is already possible nowadays. What is more, when talking to another person, a speaker not only con- veys the verbal information, but a lot of non-verbal information, as well. This is the main reason, why it is so complicated to input the information from a human to a machine and to process it.

9

1. THE AIM AND OBJECTIVES OF THE STUDY

The aim of the study To develop an automatic voice categorization system based on the analysis of the acoustic voice parameters and the data obtained from the patient questionnaires and evaluate the system’s effectiveness for the voice screening purposes.

The objectives of the study 1. To evaluate the reliability of the measurements of acoustic voice parameters obtained using the oral and contact (throat) microphones simultaneously, and to investigate the utility of the combined use of these microphones for voice categorization. 2. To evaluate the possibilities of Random Forest classifier for the cate- gorization of voice into normal and pathological classes using a different number of acoustic voice parameters sets. 3. To evaluate the reliability of the measurements of acoustic voice para- meters obtained simultaneously using oral and smart phone micro- phones, and to investigate the utility of the application of smart phone microphone signal data for voice categorization for the voice screening purposes. 4. To evaluate the value of fusion of information obtained from special voice questionnaires and analysis of acoustic voice parameters for automated voice categorization into normal and pathological voice classes. 5. To develop and to test the software intended for voice categorization into normal and pathological voice classes based on automatic analysis of acoustic voice parameters and voice related query data.

10

2. ORGINALITY OF THE STUDY

Reliable, automatic and objective detector of pathological voice dis- orders from speech signals is a tool, sought by many voice clinicians, engineers, as well as general practitioners. Nowadays, multi-parameter monitoring and diagnosis, viewed as a way to improve healthcare quality, to reduce the feedback time, and to enable home-care at one’s own respon- sibility, becomes possible consequently to the progress of the aspects of the diverse computer application. A detection instrument that could be used for low-cost and no-invasive mass screening, diagnosis, and early detection of voice pathology, would contribute to the survival and reduce mortality from laryngeal cancer of professionals using voice as a main occupational tool, of all individuals working in risky environments, such as chemical factories, and of everyone, who smokes, as well as of the general population [10]. If the diseases are detected earlier, the treatment could be started earlier, too; consequently, the price of treatment would decrease, if compared to the treatment of the highly-affected stages of the disease. Voice contains a lot of data, the extraction of which is still too difficult a task to perform. However, the data, accumulated during the multi-parameter monitoring, and supported by the historical records, as well as the specific context the patient is acting in may be exploited for the early detection of possible diseases [11]. Following the years of research and significant progress in voice patho- logy detection and classification, correct detection and classification rates of various pathology stages are still insufficient for a reliable and trusted large- scale screening. The research performed in this field generally splits into two stages: first, extraction of meaningful feature sets, and second, using these features for the classification of speech recordings into cases of healthy conditions and different pathologies [11]. Currently, there is an increasing demand for robust voice quality measures. However, a compre- hensive systemic and routine measurement of acoustic voice parameters for diagnostic and/or voice screening purposes, or the later-stage treatment is only possible in hospitals with voice laboratory facilities. One of the possible solutions providing for the automated acoustic analysis-based voice screening could be the use of telemedicine and/or a telephony-based voice pathology assessment with various automated ana- lysis algorithms. Several earlier studies showed that the performance level of speech and speaker recognition systems declined when processing tele- phone-quality signals, if compared to the systems that utilized high-quality recordings.

11

In scientific literature only sporadic articles could be noticed to analyse the possibilities to apply smartphones as a voice analysis instrument. However, currently a smartphone is one the most rapidly developing technologies. The number of people using a smartphone is rapidly increase- ing; consequently, it would be of much benefit to use this device in order to analyse human voice and to refer the patient to the specialist, when the diseases are in early stages. Our current experimental study was targeting at a more substantial investigation in the subject of the different types of microphones, such as contact-throat, oral, and smartphone microphones. As well, it was intended to test oral and contact-throat microphones in hospitals with voice laboratory facilities with the simultaneously recorded voices, and to establish the differences. We compared acoustic voice data from oral and smartphone micro- phones, as recorded, as well as simultaneously, for the first time. Moreover, in the current study we aimed at fusing the data gathered from the questionnaires and from the acoustic analysis in order to highlight how this improved classification rate works with usual statistical methods for two classes of healthy and pathological cases. To the best of our knowledge, we were the first to dwell upon this subject. In addition, a Random Forest classifier was tried out for the data gathered from the questionnaires and acoustic voice analysis in order to see how this improve classification rate, aimed at two classes (healthy and pathological), compared with the usual statistical methods. Besides, an objective was raised to generate software targeted at inte- grating the data gathered from the questionnaire and the acoustic voice analysis, and to create a user-friendly tool for laryngeal pathology detection with high accuracy using the non-invasive data.

12

3. SCIENTIFIC LITERATURE REVIEW

3.1. Voice and computer

In our age of technology a question is being addressed from time to time to a computer scientist, as well as to a speech specialist, about the possibilities to operate a computer only by voice command. Currently there are a lot of specialists, even not in the field of medicine, that make use of the voice recognition programs; while in medicine there would be specialists from the fields of radiology, pathology, as well as general practitioners, eager to apply the programs of data input with the help of microphone, in order to save the precious office time. Voice recognition programs that are currently available on the market are created to input all the information that is being transmitted through the microphone. Possibilities to install such programs would lead to the economy of person’s or company’s time or money [12, 13]; these being the factors, related to the communication process between the human and the interactive system, with the objective of developing or improving safety, utility, effectiveness, and usability of the interactive computer-based products. By consequence, not only the progress of the science of computers (as the subject of software engineers), but of the human physical and mental characteristics(also known as human factors) is of great importance, as well, not to mention the context, where the interaction is carried out [14, 15]. Nevertheless, the early steps of voice recognition programs were confronted with a lot of scepticism. In 1969, the influential John Pierce wrote an article, questioning the prospects of the technological possibilities in the field of speech recognition, and criticised “the mad inventors and unreliable engineers” that were working in the field. In his article entitled “Whither speech recognition”, Pierce argued that speech recognition was futile, because the task of speech understanding is too difficult for any machine. It must be noted that Pierce wrote at the time when such tech- nological possibilities did not exist. What Pierce’s article failed to foretell was that even a limited success in the field of speech recognition – simple small-vocabulary speech recognizers – would have suggested important and gratifying applications, especially within the telecommunications industry [16]. The voice-processing market was projected to be over $1.5 billion by 1994 and was growing at about 30 % a year [17]. All speech recognition programs work pretty well, if the recorded voice is healthy, if it sounds customarily, if the background is not noisy, and if the pronunciation in a given language is faultless. Otherwise, the input of the

13 data might be burdened with certain problems. According to Alcantud et al [18], the problems that are not related to the larynx, and impede to acquire a satisfactory interaction of the human and the computer, are as follows: personal phonation differences, acoustic ambiguity, variable utterance of speech sounds, phonetic variation, coarticulation, time variation, and background noise.

3.2. Acoustic voice analysis

In our knowledge-based societies communication skills have become more and more important in everyday duties. Voice disorders became a socio-economic factor: in 2000 one study estimated losses within the Gross National Product of the USA being up to $186 billion annually, on the basis that approximately 6–9 % of the entire population suffer from communi- cation or voice disorders [19–21]. In 2015 Mehta et al published a study, where they showed that voice disorders have been estimated to affect approximately 30 % of the adult population in the United States at some point in their lives, with 6.6–7.6 % of individuals affected at any given point in time [22]. In 1960 acoustic voice analysis was identified to be essential and increasingly used both for research and objective assessment of voice disorders in clinical settings. In 2001 Dejonckere et al provided a protocol of the European Laryngological Society (i. e. ELS); the protocol contained five multidimensional aspects: visual analysis, perceptual evaluation, acous- tic analysis, aerodynamic measures and self-evaluation by the patient [23, 24]. Consequently, acoustic measures of the severity of the dysphonia have already been commonly used in various voice clinics, due to its beneficial aspect for the algorithms of automated voice analysis and screening, collection of objective non-invasive voice data, and feasibility to document and quantify dysphonia changes and outcomes of the therapeutic and surgical treatment of voice problems [25]. Voice signals traditionally have been analysed by time, amplitude, frequency and quefrency domain [26]. According to Titze, acoustic voice signals could be classified into three types: Type 1 signals are nearly periodic, Type 2 signals contain intermittency, strong subharmonics, or modu- lations, Type 3 signals are chaotic or random.

Therefore, different methods of voice analysis should be applied depending on the voice signal type. For type 1 signals, perturbation analysis

14 has considerable utility and reliability. For type 2 signals, visual displays (e.g., spectrograms, phase portraits, or next-cycle parameter contours) are most useful for understanding the physical characteristics of the oscillating system. For type 3 signals, perceptual ratings of roughness (and any other auditory manifestation of aperiodicity) are likely to be the best measures for clinical assessment [27]. The most frequently used acoustic measurement parameters for acoustic voice analysis in scientific literature is jitter, shimmer, fundamental fre- quency (F0), harmonic-to-noise-ratio (HNR), signal-to-noise-ratio (SNR), and normalized noise energy (NNE). Perturbation, the cycle-to-cycle varia- tion present in a waveform, is commonly analysed for an acoustic signal, using the parameters of jitter and shimmer. Jitter measures the cycle-to-cycle frequency variation of a signal. Shimmer measures the cycle-to-cycle amplitude variation. Perturbation parameters of percent jitter and percent shimmer were calculated for the voice sample segments [28]. F0 quantifies vocal fold vibratory frequency. SNR reflects the dominance of the harmonic signal over noise (measured in dB) [29]. HNR is a measurement of voice pureness; it is based on the ratio calculation of the energy of the harmonics, related to the noise energy, present in the voice (measured in dB) [30]. NNE is automatically computed from the voice signals using an adaptive comb filtering method performed in the frequency domain[31]. The sustained vowel (/a/, /e/, or /i/) is a classical and widely used ma- terial for acoustic analysis. Currently, acoustic analysis is performed by selecting a particular segment from each voice signal and analysing the selected segment using defined acoustic algorithms. Titze suggested that only periodic or nearly periodic voice signals should be analysed using acoustic measures [27]. There are a lot of studies that are published in the English scientific literature about using two or three acoustic voice para- meters that have already been mentioned in this thesis [32–36]. Some of the authors try to use continuous speech analysis in clinical practice because some voice disorders, such as adductor spasmodic dysphonia, can be cha- racterized by relatively normal voice during sustained vowel productions, whereas voice produced in connected speech is often more severely com- promised [35]. Such authors like Krom [37] and Revis et al [38] reported that there are no significant differences between the ratings of a sustained vowel and a running speech. In another study Wolfe et al [39] found a significant difference between the ratings of both sample types and latter finding was supported in part by Zraick et al [40], who reported a statis- tically significant difference between the judgments of sustained vowels and recordings of a picture description. In our current study it was decided to

15 analyse the sustained vowel samples with 6 acoustic voice parameters that are mostly described in scientific literature. For this preference, factors, as follows, have been contributed by Maryn et al: First, a sustained vowel represents relatively time-invariant phonation, whereas continuous speech involves rapid and frequent changes caused by glottal and supra glottal mechanisms. Second, in contrast to continuous speech, sustained mid-vowel seg- ments do not contain non voiced phonemes, fast voice onsets and ter- minations, and prosodic fundamental frequency and amplitude fluctuations. Third, sustained vowels are not affected by speech rate, vocal pauses, phonetic context, and stress. Fourth, classic fundamental frequency or period perturbation and amplitude perturbation measures strongly rely on pitch detection and extraction algorithms. As a consequence, they lose precision in continuous speech analyses, in which perturbation is significantly affected by intona- tional patterns, voice onsets and offsets, and unvoiced fragments. Fifth, sustained vowels can be elicited and produced with less effort and in a more standardized manner than that of continuous speech. Sixth, there is no linguistic loading in a sustained vowel, resulting in relative immunity from influences related to dialect and region, language, cognition, and so on [41]. The replacement of analogue recording systems with digital recording systems, the availability of automated analysis algorithms, and the non- invasiveness of acoustic measures, combined with the fact that acoustic parameters provide easy quantification of dysphonia improvement during the treatment process, have led to considerable interest in clinical voice quality measurement using acoustic analysis techniques [42]. Automatic systems for the detection of illness related to abnormalities of the vocal signal have been developed and are mainly based on signal processing or on machine learning and data mining techniques. Several experiences of using algorithmic approaches for the automatic analysis of signals exist. Software tools (commercial and freely available) allow manipulating voice compo- nents in an efficient way (e.g. Multi-Dimensional Voice Program (MDVP), WinPitch, Praat, VOICEBOX,), and permit specialists to manipulate and analyse voice signals [43]. A study by Mendes et al [44] described the automatic voice analysis programs in the Table 3.2.1 that are currently on the market.

16

Table 3.2.1. Voice analysis software Freely available software Commercial software • Audacity 2.0.0 Dr. Speech, version 4 • EMU Speech Database • Vocal Assessment System 2.3.0 • Real Analysis • WaveSurfer 1.8.5 • Speech Trainig • Praat 5.3.04 • ScopeView • Speech Filing System • Phonetogram (SFS) 4.8 • Speech Therapy 4 • SFS|WASP 1.51 FonoTools SIL International KayPENTAX • Speech Analyser 3.0.1 • Multi-Speech, Model 3700 • Voice Range Profi le (VPR), Model 4326 • Multi-Dimentional Voice Program (MDVP), Model 5105 • Motor Speech Profile (MSP), Model 5141 LingWAVES Voice Clinic Suite Pro Seegnal • MasterPitch Pro • VoiceStudio • SingingStudio Estill Voice International • VoicePrint • Estill Voiceprint Plus Time Frequency Analysis Software - TF32 Video Voice Speech Training System 3.0 VoxMetria • Vocalgrama

Since 1998, the Department of Otorhinolaryngology of the Hospital of Lithuanian University of Health Sciences Kauno Klinikos has used Tiger Electronics Dr. Speech (Voice assessment 3.0) software. Tiger DRS soft- ware is one of the most frequently used acoustic voice analysis programs that is comparable to Multidimensional Voice Program (MDVP, Kay Pentax, NJ, USA) that is considered to be the golden standard [45, 46]. This software is as well comparable to the free open source voice analysis program Praat. Contrary to the programs mentioned before, this program can be used with Windows and Macintosh, the free Linux operating system and with other systems, such as FreeBSD, SGI, Solaris and HPUX. This makes it easy to be installed in any equipment without the need to have a specific operating system available. However, if compared with the programs of voice analysis, it would reveal weak or moderate correlations in frequency perturbation, and moderate or strong correlations in amplitude

17 perturbation [47]. The reason for the above-mentioned difference involves the use of distinct algorithms in order to extract voice data from voice samples. Thus, first of all, anyone, wishing to choose a program, must be aware of the inter-program reliability from analysis algorithms and methods of the same parameters to be different in every software package; consequently, an impediment arises to establish a common threshold for the acoustic voice parameters. Secondly, one sampling rate should be 44.1 kHz, and the format should be left uncompressed, and typically in wav-file format [24]. Another requirement would constitute the use of the objective-acoustic analysis in research or clinical practice, and the need to attain a high level of accuracy, as well as the reliability of hardware, which was very well analysed by Svec [48].

3.3. Microphone and acoustic voice analysis

There is a need in this part of the thesis, concerned with the scientific literature review, to write about the factors that influence the accuracy and comparability of the measurements of acoustic voice parameters, which may arise from variations in data acquisition environment [49], microphone types or placements [48–50], recording systems, and methods of voice signal analysis [46, 51–54]. During voice and speech production, vibrations from the vocal folds are transmitted through the vocal tract and through the body tissue to the skin surface. These skin surface vibrations can be sensed by contact microphones and/or accelerometers (i.e., vibration sensors that convert mechanical energy into electrical energy in response to the stress applied to it and using piezoelectric effect), as opposed to the microphones, recording in the air, and the output signal, mirroring the sound signal, generated by the vocal fold vibrations, can be used to transmit voice signals into analysis systems [55, 56], even revealing representation of the rapid subglottal pressure vibrations [57]. As opposed to conventional acoustic microphones routinely used for voice recordings, contact microphones are less sensitive to back- ground noise from the surrounding environment. Moreover, contact micro- phones and/or accelerometers have the potential to eliminate acoustic effects of the vocal tract, thus providing enhanced voice signal clarity in elevated ambient noise environments [56, 58]. Microphones are the basic tools for registration of voice signals aiming to convert the sound pressure signal to an electric signal with the same characteristics. Consequently, the type and technical characteristics of the microphone may determine the final results of acoustic voice analysis.

18

Despite the fact that voice and speech recordings and measurements are carried out routinely for clinical and research purposes, the subject of microphone selection reflects some controversies[48–50, 59] . Microphones according to Dejonckere et al [23] have to comply with different conditions to enable acceptable voice recordings: 1. Condenser type. Cardioid characteristics were recommended, because these features allow focusing more directly to the voice signal [48,50] 2. Frequency range from 20–20000 Hz to cover all the spectrum of human voice [48]. 3. Frequency response curve of intensity should be flat with a maximum variations of 2 dB by 20–8000 Hz, preferably 20000 Hz [48]. 4. Voice signal should be protected as much as possible from equivalent noise level that is generated of every component of the microphone. Voice signal must be loud enough to cover the intrinsic noise with a minimum difference of 15 dB[48]. 5. Maximum speech pressure level for 3 % Total harmonic distortion of 126 dB [27]. 6. High sensitivity. Used in order to obviate higher gain level to avoid higher noise level. Condenser type microphones with lower than 60 dB sensitivity level are not recommended to be used in clinical voice inves- tigations [24, 27].

Although voice recordings have been carried out for many years in clinical practice, the debate on microphone selection is still going on. Validity and reliability of acoustic measurements are highly affected by a background noise [60]. Due to its vicinity to the voice source, a contact microphone is less sensitive to background noises and provides enhanced voice signal clarity in noisy environments [56, 61–63]. It is suggested that an acoustic environment should have a signal-to-noise ratio of at least 30 dB to produce valid results in audio analysis [60]. This recommendation can be fulfilled easily when voice recordings are performed in a special sound- proof booth. However, this requirement can become not feasible when voice recordings are obtained in an ordinary environment for voice disorders screening task. Nevertheless, several studies with contact microphones revealed de- creased speech signal intelligibility compared to headset microphones [56, 63, 64]. Moreover, contact microphones are not very effective in transmiting consonant sounds and high frequencies [65]. The elasticity properties of underlying human body tissues acting as a low-pass filter with a 3 kHz cut- off frequency [66], limit the frequency range of the resulting signal.

19

It was demonstrated that in case of non-stationary background noise, use of contact microphones can significantly improve accuracy of separation between voice recordings obtained from healthy subjects and subjects experiencing voice-related problems [67–69]. By using recordings from both types of microphones, Dupont et al [66] achieved 80 % recognition accuracy when discriminating between pathological and normal cases. Mubeen et al [70] achieved some increase in performance when combining features of one type (weighted linear predictive cepstral coefficients) ex- tracted from both types of recordings. Erzin [71] proposed a new frame- work, which learns joint sub-phone patterns of contact and acoustic micro- phone recordings using a parallel branch HMM structure. Application of this technique resulted in significant improvement of throat-only speech recognition. In other studies, accelerometers have been used and found to be useful for voice and speech measurements, that is, for detecting glottal vibrations, extraction of voice fundamental frequency (F0) and frequency perturbation measurements [58], evaluation of acoustic voice characteristics before and after intubation [72], voice accumulation/ dosimetry [61,73], estimation of sound pressure levels of voiced speech [61], mapping of neck surface vibrations during vocalized speech [74], and measurement of facial bone vibration in resonant voice production [75, 76]. There is a lack of data in scientific literature concerning the comparative studies on applicability of contact microphones for acoustic voice measu- rements for voice screening purposes and/or for using combined use of standard and contact (throat) microphones. Therefore, one of the objectives of this research was to validate the suitability of the throat microphone signal for the task of voice screening purposes, to evaluate reliability of acoustic voice parameters obtained simultaneously using oral and contact (throat) microphones, and to invest- tigate the utility of combined use of these microphones for voice cate- gorization.

20

3.4. Questionnaires and voice analysis

Questionnaire data, providing essential statements related to various aspects of subject’s health, are easily obtained and also constitute an important, however, under-exploited source of information obtained non- invasively. In 1997, Jacobson et al [77] for the first time used a question- naire composed of 30 questions i.e. Voice handicap index (VHI). It was the first questionnaire that was created to investigate how voice diseases affects in different aspects such as physical, emotional, functional disability. In 2005 Franic et al [78] published a study, where he compared the psycho- metric properties of voice disordered quality of life (VQOL) instruments. Nine VQOL instruments were identified through a comprehensive literature search.These authors selected the instruments that were evaluated basing on 11 measurement standards related to the item information, versatility, practicality, breadth and depth of health measure, reliability, validity, and responsiveness. In comparison with the other 8 questionnaires, VHI ques- tionnaire showed much better results. VHI questionnaire was validated in 12 different languages [79]. One problem which may arise with the use of the VHI is due to its length. In routine diagnostics, voice patients may need to undergo several further measurements. Therefore, the 30 items of the VHI might require too much time (about 10–15 min) [80]. After VHI become widely used in clinical practice, there was as short version of the ques- tionnaire created, titled VHI-12 and VHI-10 [81] and VHI-9 [80]. The shor- ter versions of VHI have been adopted in German and French languages [80]. In 2005 Bach et al [82] created a simple short self-administered symp- tom index of 4 items with an excellent criterion-based and construct validity i.e. glottal function index (GFI) questionnaire. The GFI questionnaire has been used in the Center for Voice Disorders of Wake Forest University (Winston-Salem, NC) and was initially conceived as an instrument for evaluating glottal insufficiency and its response to therapy. The correlation coefficient between total GFI and total VHI scores was 0.61 (P<0.001), and a strong correlation was identified between those questionnaires. GFI questionnaire was translated and validated in the Lithuanian language in 2011 by Pribuišienė et al [83]. Based on the normative data, Bach et al [82] considered the GFI score higher than 4.0 (mean + 2 SDs) to be abnormal. In the GFI-LT study, a score higher than 3.0 was found to be a limiting value, distinguishing patients and healthy controls with the sensitivity of 88 % and specificity of 84 %, respectively. The same score was found when using ROC curves, and was revealed by Cohen et al [84] during the validation of the GFI for children. The GFI questionnaire was used successfully for the

21 monitoring of the results of surgery, ant it was found to present statistically significant differences in pre- and postoperative groups [85]. Responses to specific questions may contain information, which is not present in the acoustic or visual modalities. Analysis of query data can be used for preventive healthcare of larynx, yet very few attempts have been made to use it in screening [86]. To obtain the most important statements in the questionnaires, certain authors used a genetic search of different classi- fiers, and used them in a SVM in order to categorize the questionnaire data into the healthy class and two classes of pathologies: nodular and diffuse [87].

3.5. Classifiers and voice analysis

Usually, classifiers are used for acoustic voice analysis to get the best symptom combination that is needed to achieve the best classification rate. In scientific literature the following classifiers are used: learning vector quantization (LVQ) [88], Gaussian mixture model (GMM) [89], hidden Markov model (HMM) [90], linear discriminant analysis (LDA) [91]. Also, the following discriminative methods are being used: decision tree, Random forest (RF) [92], k-nearest neighbours (k-NN) [93], multi-layer perceptron (MLP) [94], and support vector machine (SVM) [95]. Ensemble methods, which combine separate classifiers into a multiple classifier system, are also sometimes used [86]. In 2012, a study by Arjmandi was published, where these authors compared the classifiers; it was determined that the SVM is the strongest classifier among the different classifiers that are investigated for voice quality assessment [96].Nevertheless, various authors agree that SVM classifier has some disadvantages, like cases when it is impossible to teach SVM, just to mention a few; as well, there is a certain lack of clarity, when SVM gives the inexplicable values. RF is a popular and efficient algorithm for the classification and regression, based on the ensemble methods. RF advantages were validated and consolidated by the inventors [92, 97]: it is applicable, when there are more predictors than observations, it performs embedded data selection, and is relatively insensitive to a large number of irrelevant data, it incorporates interactions between predictors, it is based on the theory of ensemble learning that allows the algorithm to learn accurately both simple and complex classification functions, it is applicable for both binary and multi-category classification tasks, and, according to its inventors, does not require much fine-tuning of parameters; default parameterization often leads to excellent performance.

22

There is a lack of literature, concerning the comparison of those two classifiers. Statnikov et al. identified that Random forests are outperformed by support vector machines both in the settings when no gene selection is performed, and when several popular gene selection methods are used; however, in 2012 Englund et al [98] determined that for some specific tasks RF has the advantages in comparison with the SVM. At the moment those two classifiers are very similar and both have shown high classification results. Some previous attempts to recognize the pathology in the larynx, using voice signal features, are summarized in Table 3.5.1. Non-invasive measurement-based tools, enabling preventive screening for laryngeal disorders, is the combined use (fusion) of different information sources, such as voice analysis and questionnaire data, which is one of the main objectives of this thesis. One of the aims of this thesis was voice pathology detection from the combined use of non-invasive laryngeal data, specifically, voice recordings and responses to questionnaire, and, by com- bining those results, to create a user-friendly tool of high accuracy for the laryngeal pathology detection, using the non-invasive data. Currently, the tool is oriented to experts, working at the departments of otolaryngology, but in the nearest future the tool should run on a smart phone, including voice recording, and become much more versatile. Both modalities can be easily collected using off-the-shelf solutions. Due to the missing data in query modalities, imputation before decision-level fusion is compared to the complete-case analysis: this part investigates if any gain can be achieved by imputing RF decisions instead of discarding instances with missing moda- lities in fusion. Query modality is additionally explored by extracting rules using affinity analysis [86].

23

Table 3.5.1. History of non-invasive screening for voice pathology No. Database (recordings) Features used Classifier (accuracy, %) Reference 1. KobriElkobba (15 norm, 20 path) RASTA-PLPCCs + ∆ HMM (sens. 87.5, spec. 100) Saudi et al. (2012) [99] 2. UCLA-RABTA (50 norm, 50 path) Wavelet transform MLP (sens. 90, spec. 100) Salhi et al. (2008) [100] 3. MEEI (53 norm, 44 Edem, MFCCs MLP (norm 99, Edem 96, other 93) Marinus et al. (2009) [101] 4. MEEI (53 norm, 82 path) MFCCs + ∆ + ∆∆ GMM (94) Godino-Llorente et al. (2001) [102] 5. MEEI (53 norm, 82 path) MFCCs + ∆ MLP (94), LVQ (96) Godino-Llorente and Vilda (2004) [103] 6. MEEI (53 norm, 163 path) HNRs at 4 freq. bands k-NN (94.28) Shama et al. (2007) [104] 7. MEEI (53 norm, 173 path) MFCCs, SVM (95) Godino-Llorente et al. (2005) [95] 8. MEEI (53 norm, 173 path) noise MFCCs + ∆ GMM (94) Godino-Llorente et al. (2006) [105] 9. MEEI (53 norm, 173 path) MFCCs MLP (89.6) Sáenz-Lechón et al. (2006) [106] 10. MEEI (53 norm, 173 path) (MFCCs, HNR, NNE, GNE) + ∆ SVM (93.01), GMM (94.35) Sáenz-Lechón et al. (2008) [107] 11. MEEI (53 norm, 173 path) MFCCs MLP (88.3) Fraile et al. (2009) [108] 12. MEEI (53 norm, 173 path) Chaos (TISEAN) MLP (99.69) Henriquez et al. (2009)[109] 13. MEEI (53 norm, 173 path) Modulation spectrum SVM (94.1) Markaki and Stylianou (2009, 2011) [110, 111] 14. MEEI (53 norm, 173 path) MFCCs GMM-SVM (96.1) Wang et al. (2010) [112] 15. MEEI (53 norm, 173 path) MFCCs, HNR, NNE, GNE GMM (94.8) Martínez et al. (2012) [113] 16. MEEI (53 norm, 175 path) Noise (LPC-derived) LDA (96.5) Parsa and Jamieson (2000) [114] 17. MEEI (53 norm, 175 path) Perturbation, spectral, noise LDA (98.7) Parsa and Jamieson (2001) [115] 18. MEEI (53 norm, 224 path) Cochlear filter-bank k-NN (89.19) Shama et al. (2004) [116] 19. MEEI (53 norm, 638 path) Perturbation, noise k-NN (96.1) Hadjitodorov and Mitev (2002) [45] 20. MEEI (53 norm, 657 path) MFCCs, MDVP HMM (98.3) Dibazar et al. (2002)[117] 21. MEEI (53 norm, 657 path) MFBECs k-NN (99.59%), LDA (98.48) Hariharan et al. (2009) [118] 22. Doctor Negrin (85 norm, 57 path: 3 GRBAS Chaos (TISEAN) MLP (82.47) Henriquez et al. (2009) [109] levels) Jitter, shimmer, spectral, noise, chaos MLP (92.76) Alonso et al. (2005) [119] 23. Doctor Negrin (100 norm, 68 path) Modulation spectrum SVM (81.2) Markaki and Stylianou (2009) [110] 24. PdA (100 norm, 100 path) MFCCs, HNR, NNE, GNE GMM (79.4) Martínez et al. (2012) [113] 25. SVD (650 norm, 1320 path) Various audio features 5 SVMs (95.13) Gelzinis et al. (2008) [120] 26. LUHS (75 norm, 237 path) Various audio features 4 SVMs (84.65) Gelzinis et al. (2008) [120] 27. LUHS (75 norm, 75 diffuse, 162 nodular) MFCCs GMM-SVM (89) Vaiciukynas et al. (2012) [121] 28. LUHS (103 norm, 671 path) MFCCs GMM-SVM (70) Vaiciukynas et al. (2012) [121] 29. LUHS (103 norm, 212 diffuse, 459 nodular) Various audio features 50 RFs (86.86) Vaiciukynas et al. (2014a, 2014b) [86, 122] 30. LUHS (139 norm, 112 path) 1st deriv., or velocity (∆); 2nd deriv., or acceleration (∆∆); Mel freq. (MF); cepstral coefficient (CC); band energy coefficient (BEC); relative spectral transform perceptual linear prediction (RASTA-PLP); linear predictive coding (LPC;, harmonic-to-noise ratio (HNR); normalized noise energy (NNE); glottal-to-noise excitation (GNE); TISEAN (Hegger et al., 1999)[123]; MDVP (Hema et al., 2009) [124].

24

3.6. Acoustic voice analysis and smartphones

Automated acoustic analysis-based voice screening could be one of the potential approaches, helping primary care physicians and other public health care services to identify the patients, who require early otolaryngo- logical referral, thereby improving the diagnostics and management of the laryngeal patients /patients suffering from voice disorders. The main goal of the automated pathological voice/speech detection systems is to categorize any input voice as either normal or pathological [125]. Currently, there is an increasing demand for the robust measures of voice quality. However, a comprehensive, systematic and routine measurement of acoustic voice para- meters for diagnostic and/or voice screening purposes, or for following the treatment, is only possible in hospitals with voice laboratory facilities [126]. One of the possible solutions, providing automated acoustic analysis-based voice screening, could be the use of tele-medicine and/or telephony-based voice pathology assessment, using various automated analysis algorithms. Several earlier studies showed that at the time of processing telephone- quality signals the performance of speech and speaker recognition systems aggravated, if compared to the systems, utilizing high-quality recordings [127]. In 2008 and 2014, Vogel et al [128, 129] published the findings, where these authors reciprocally compared modern recording devices, intended for speech analysis: smart phones, landline telephones, laptops, and hard disc recorders. Speech samples were acquired simultaneously from 15 healthy adults, using four devices; as well, these samples were analysed acoustically for measures of timing and voice quality. As the results of the voice analysis allowed, the above-mentioned four devices were compared with the benchmark devise – the high-quality recorder, coupled with a condenser microphone. The conclusion presented by these authors was that the acoustic analysis cannot be assumed to be comparable, if different recording methods are applied to record the speech. However, more recent studies highlighted the real possibility for cost- effective remote detection and assessment of voice pathology over the telephone channels, reaching normal/pathological voice classification accu- racy close to 90 % [125, 130–132]. Current progress in digital technologies has enhanced the access to por- table devices, capable of recording acoustic signals in high-quality audio formats, as well as transmitting the digitized audio files via the computer network. The high sampling rate (48.0–90.0 kHz), afforded by the contem- porary models of smart phones may prove to be an important aspect, enabl- ing easily-accessible audio recording tool to collect voice recordings, and preserving sufficient acoustic details for voice analysis and monitoring [133].

25

As a result, some sporadic reports in scientific literature, regarding the applicability and effectiveness of iPhone-based voice recordings for acous- tic voice assessment, have already been introduced[133]. More recent study by Mat Baki et al has demonstrated that voice recordings, performed with iPod’s internal microphone and analysed with OperaVoxTM software application, installed on an iPod touch (4th generation), were statistically comparable to the golden standard, i.e., the Multidimensional Voice Program (MDVP, KayPentax, NJ, USA) [126]. In 2013 and 2015 Mehta et al published a study about a smart phone-based ambulatory voice health monitor that was connected to the accelerometer on the neck surface below the larynx in order to acquire and analyse a large set of ambulatory data from patients with hyperfunctional voice disorders (before and after the treatment stages), and compared the findings with the matched-control subjects. These authors determined that wearable voice monitoring systems have the potential to provide more reliable and objective measures during everyday activities of voice use that can enhance the diagnostic and treatment strategies for voice disorders [134, 135]. Therefore, one of the aims of the present study was to evaluate the reliability of acoustic voice parameters obtained simultaneously using oral and smart phone microphones, and to investigate the benefit of combined use of SP microphone signal, together with the GFI questionnaire data for voice categorization with regard to voice screening purposes, as well as for the development of software, targeted at otolaryngologists for the laryngeal disorder screening purposes.

26

4. METHODS

4.1. Ethics

The current study was approved by Kaunas Regional Biomedical Research Ethics Committee (No. P2-24/2013). All patients and healthy volunteers were provided with the written informed consent. This clinical study was approved by the State Data Protection Inspectorate for dealing with personal patient’s data (No. 2R-648 (2.6-1).

4.2. Study design

During the period of 2011–2015, 656 participants were recruited for our study. 9 patients were not included for this study, as their data appeared to be lost. There were 337 healthy volunteers and 319 patients that addressed Department of Otorhinolaryngology of the LUHS. The present study com- prised 4 parts. The normal voice subgroup was composed of 336 selected healthy volunteer individuals who considered their voice as normal. They had no complaints concerning their voice and no history of chronic laryngeal disea- ses or other long-lasting voice disorders. All of them were free from any known hearing problems and free from common cold or upper respiratory infections at the time of voice recording. The voices of this group of individuals were also evaluated as healthy voices by clinical voice specia- lists. Furthermore, no pathological alterations in the larynx of the subjects of the normal voice subgroup were found during video laryngostroboscopy (VLS). Digital high-quality VLS recordings were performed with an XION Endo- STROB DX device (XION GmbH, Berlin, Germany) using a 70 rigid endoscope. Acoustic voice signal parameters of these normal voice sub- group subjects that were obtained using Dr. Speech software (Tiger Elec- tronics, Seattle, WA; subprogram: voice assessment, version 3.0) were within the normal range. The pathological voice subgroup consisted of 319 patients who repre- sented a rather common and clinically discriminative group of laryngeal diseases, that is, mass lesions of the vocal folds and paralysis. Mass lesions of vocal folds included in the study consisted of nodules, polyps, cysts, papillomata, keratosis, and carcinoma. As well, there were patients with neurological diseases (Parkinson’s disease, Huntington’s chorea). Patholo- gical voice group patients were recruited from consecutive patients who were diagnosed with the laryngeal diseases mentioned previously. The

27 clinical diagnosis was based on typical clinical signs revealed during VLS and direct microlaryngoscopy. Patients with neurological diseases were referred to us from the Neurological department of our clinic; the patients’ diagnosis was based on the typical clinical signs. In all cases of mass lesions of the vocal folds, the final diagnosis was proved by the results of the histological examination of the removed tissue. Demographic data of the total study group and diagnoses of the pathological voice subgroup are presented in Table 4.2.1 and Table 4.2.2. These patients were serially enrolled and, therefore, likely represent the real incidence of pathologies in our series and can be considered to be clinically representative of the population of voice-disordered patients.

Table 4.2.1. The demographic data of total study group Total Gender Age (in years) Diagnosis number Male Female Mean SD (n=656) (n=253) (n=403) Healthy volunteers 337 110 227 37.7 12.86 Patients 319 143 176 46,7 14,90 SD – standard deviation.

In study No I (analysis of oral and throat microphones using dis- criminant analysis) we admitted 157 individuals, the normal voice subgroup was composed of 105 selected healthy volunteer individuals, The patho- logical voice subgroup consisted of 52 patients who represented a rather common and clinically discriminative group of laryngeal diseases, that is, mass lesions of the vocal folds and paralysis. In study No II (analysis of oral and throat microphones using Random Forest classifier) we admitted 273 subjects (163 normal voices and 110 pathological voices) of both genders, ranging from 19 to 85 years of age. In study No III (Analysis of oral and smart phone microphones data using Random Forest classifier) we admitted 118 individuals examined at our Department of Otorhinolaryngology. The normal voice subgroup was composed of 34 selected healthy volunteers. The pathological voice sub- group consisted of 84 patients who represented a rather common, clinically discriminative group of laryngeal diseases including mass lesions of the vocal folds (nodules, polyps, cysts, papillomata, keratosis, and carcinoma), paralysis and reflux laryngitis.

28

Table 4.2.2. The demographic data of pathological voice subgroup Total Gender Age (in years) Diagnosis number Male Female Mean SD (n=319) (n=143) (n=176) Vocal folds nodules 45 3 42 35.6 11.84 Vocal fold polyp 81 32 49 44.8 11.63 Vocal fold cyst 15 1 14 40.6 12.62 Vocal fold cancer 31 30 1 61.8 8.47 Vocal fold polypoid hyperplasia 40 11 29 52.4 9.44 (Mb. Reinke-Hajek) Vocal fold keratosis 8 7 1 50.8 15.35 Vocal fold papilloma 21 12 9 37.2 13.43 Unilateral vocal fold paralysis 20 10 10 54.7 13.54 Bilateral vocal fold paralysis 2 1 1 61.5 6.36 Chronic hyperplastic laryngitis 24 21 3 53.4 14.21 Cystisvestibulumlaryngis 2 0 2 63 4.23 Dysphonia 6 1 5 36.8 13.81 Sulcus glottidis 4 1 3 38.3 22.2 GERD (gastroesophageal 11 8 3 46.3 15.63 disease) Granuloma 2 0 2 26.5 6.36 Acute laryngitis 5 4 1 51 13.06 Presbylaringis 2 0 2 75.5 4.95 Monochorditis 1 1 0 48 – SD – standard deviation.

In study No IV (Testing of Voice Test software) adatabase of 273 sub- jects of both genders (163 normal voices and 110 pathological voices), ranging from 19 to 85 years of age, was used to train the RF classifier. A mixed gender database, containing 596 subjects (106 healthy men, 221 healthy women, 118 pathological men, 151 pathological women) was used for query data to train the RF classifier. 45 unseen subjects (9 healthy and 36 pathological) were admitted for the testing program.

29

4.3. Voice recordings

In studies No I and No II voice recordings of a sustained phonation of the vowel sound /a/ (as in an English word “large”) were used. The subjects were asked to utter a speech sound: a sustained vowel /a/ at a comfortable pitch and loudness level for at least 5 seconds. Voice samples, obtained from each subject, were recorded in a sound-proof booth simultaneously as it shown in a Fig. 4.3.1 , with the help of two microphones: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone placed at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90º microphone-to-mouth angle, and low-cost small contact (throat) microphone Stryker/Triumph PC (Clearer Communications, Inc, Burnaby, Canada) placed on the projection of lamina of thyroid cartilage and fixed with elastic bail. Localization of the throat microphone on thyroid lamina was chosen to acquire the strongest signal because the average magnitude of the acceleration tends to be greatest on and in the immediate vicinity of the larynx [74]. The voice recordings were made in the wav file format on separate tracks using Audacity software (http://audacity. sourceforge.net/) at the rate of 44.100 samples per second as it shown in Fig. 4.3.2. Sixteen bits were allocated for one sample. The external sound card M-Audio (Cumberland, RI) was used for digitization of the voice recordings.

Fig. 4.3.1. Voice recording in a soundproof booth simultaneously with two microphones: oral cardioid AKG Perception 220 and contact (throat) microphone Stryker/Triumph PC.

30

Fig. 4.3.2. The voice recordings were made in the wav file format on separate tracks using Audacity software

In study No III voice recordings of a sustained phonation of the vowel sound /a/ were used. Voice samples, obtained from each subject, were recorded in a sound-proof booth simultaneously with the help of two microphones: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone, and internal smart phone Samsung Galaxy Note 3 microphone as it shown in Fig. 4.3.3. Both microphones were placed alongside at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90° microphone-to-mouth angle. The subjects were asked to utter a speech sound: a sustained vowel /a/ at a comfortable pitch and loudness level for at least 5 seconds. In study No IV voice recordings of a sustained phonation of the vowel sound /a/ (as in an English word “large”) were used. The subjects were asked to utter a speech sound: a sustained vowel /a/ at a comfortable pitch and loudness level for at least 5 seconds. Voice samples obtained from each subject were recorded in a soundproof booth with the help of a microphone: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone placed at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90° microphone-to-mouth angle.

31

Fig. 4.3.3. Voice samples were recorded in a sound-proof booth simultaneously with two microphones: oral cardioid AKG Perception 220and internal smart phone Samsung Galaxy Note 3 microphone

4.4. Acoustic analysis

In studies No I and No III the study segments of at least 5 seconds of the sustained vowel /a:/ of separate voice samples from each recording session were analysed using Dr. Speech software (subprogram: voice assessment, version 3.0). Acoustic voice signal data were measured for F0, percent of jitter and shimmer, normalized noise energy (NNE), signal-to- noise ratio (SNR), and harmonic-to-noise ratio (HNR). According to the results of our previous study, no statistically significant differences between means of male and female acoustic voice parameters (except the mean F0) were revealed [136]. Therefore, in this study, we did not separate parame- ters of acoustic voice analysis between males and females. However, the F0 parameter was analysed separately considering the gender of the subjects. In study No II multiple feature sets, containing 1051 features in total, were used. Feature extraction process was done using the Matlab. The features were thoroughly discussed [11, 120, 137]. In study No IV, aiming to obtain a comprehensive description, each audio recording was represented by 14 feature subsets, resulting in a feature vector of 927 elements. Technical details of feature subsets 1–14 can be found in an article by Gelzinis et al[120].

32

4.5. Questionnaire data

There are several voice symptoms related of questionnaires in scientific literature; in this study, in order to collect the demographic data, the symptoms, the complains, etc. of our patients, the best-performed questions, selected by our team some time earlier, were used, as it was published in the article by Bacauskiene et al in 2012 [138]. Additionally each participant of the study (normal and pathological voice subgroups) filled in the GFI-LT questionnaire at the baseline along with the voice recordings, at least 1 week before the treatment. The query data were collected from subject responses to the set of questions, summarized in Table 4.5.1.

Table 4.5.1. Questionnaire questions used to collect patient data Question content Units (or scale) of measurement 1. Subject’s gender {Man, women} 2. Subject’s age Discrete number 3. Average duration of intensive speech use Hours / day 4. Average duration of intensive speech use Days / week 5. Smoking {Yes, no} 6. Smoking intensity Cigarettes / day 7. Smoking history Years 8. Maximum phonation time Seconds 9. SSA of voice function quality Visual analogue scale from 0 to 100 10. SSA of voice hoarsenes From 0 (no hoarseness) to 100 (severe hoarseness) 11. Voice handicap progressing Grade from 1 to 4 12. SSA of daily experienced stress level From 0 (no stress) to 100 (very much stress) 13. Frequency of singing Grade from 1 to 5 14. Frequency of talking / singing in a smoke-filled Grade from 1 to 5 room 15. SSA of experienced discomfort due to voice From 0 (no discomfort) to 100 (huge disorder discomfort) 16. SSA of “too week voice” From 0 (no) to 100 (very clear) 17. SSA of repetitive “loss of voice” From 0 (no) to 100 (very clear) 18. SSA of reduced voice From 0 (no) to 100 (very distinctly) 19. SSA of reduced ability to sing From 0 (no) to 100 (very distinctly) 20. Frequency of voice cracks or aberrant voice From 0 (no) to 100 (very often) 21. Level of vocal usage Level from 1 to 4

22. Speaking took extra effort (G1) From 0 (no problem) to 5 (severe problem) 23. Throat discomfort or pain after voice usage (G2) From 0 (no problem) to 5 (severe problem) 24. Voice weakens while talking, vocal fatique (G3) From 0 (no problem) to 5 (severe problem) 25. Voice cracks or sound different (G4) From 0 (no problem) to 5 (severe problem) 26. Glottal function index [23, 24] Grade from 0 to 20 (GFI= G1+ G2+ G3+ G4)

33

4.6. Statistical evaluation, classifiers

In the studies No I and No III the statistical analysis was performed with the help of IBM SPSS Statistics software for Windows (version 20.0, IBM Corporation, Armonk, NY). The data were presented as mean ± standard deviation (SD). The Student t test was used for the testing of hypotheses about the equality of the mean. The size of the differences among the mean values of the groups was evaluated with the calculation of type II error, β. The size of the difference was considered to be significant if β 2 (i. e. the power of statistical test !0.8) as type I error α 1⁄4 .05.Fisher discriminant analysis was performed in order to determine the limiting values of the acoustic voice parameters, discriminating normal and pathological voice groups, and selecting an optimum set of parameters for the classification task. Correct classification rate (CCR) was used to evaluate the feasibility of the acoustic voice parameters, classifying normal and pathological voice classes. The correlations among the acoustic voice parameters were evaluated using Pearson correlation coefficient (r). The level of the statistical significance by testing the statistical hypothesis was 0.05. In the studies No II, No III, and No IV the Random forest classifier was used. Random forest (RF) is a popular and efficient algorithm for the classification and regression, based on the ensemble methods. The core idea of RF is to combine many binary decision trees, built using different boots- trap samples of the original data set, to obtain an accurate predictor. Such tree-based ensemble is known to be robust against the over-fitting, and, as the number of trees increases, the generalization error is observed to converge to a limit. Decision tree within RF is the classification and regression tree (CART). Given a training set Z, consisting of n observations and p features, RF is constructed in the following steps: 1. Choosing the forest size t as a number of trees to grow and the subspace size q ≤ p as a number of features to be provided for each node of a tree. 2. Taking a bootstrap sample of Z and randomly selecting q features. 3. Growing an un pruned CART using the bootstrap sample.

Repeating of steps 2–3 is necessary, until the size of the forest reaches t. To classify an unseen observation x, each tree of the forest is provided with x, and outputs a decision. Resulting votes for each class are returned, and the class that collects most votes is considered to be the winner; the decision is based on the majority voting scheme, as illustrated in Fig. 4.6.1.

34

Fig 4.6.1. A general Random forest architecture, where k stands for class label

A data proximity matrix, derived from RF, was used in this study for data exploration and visualization of data and decisions. To map data and decisions onto the 2D space, the t-distributed stochastic neighbour embed- ing (t-SNE) algorithm was used. The t-SNE algorithm often outperforms other state-of-the-art techniques for dimensionality reduction and data visualization [139]. We evaluated the performance of a classifier using the following mea- sures: a) detection error trade-off (DET) curve and equal error rate (EER); b) receiver operating characteristic (ROC) curve and area under the curve (AUC). The DET, EER, ROC, AUC measures were estimated using an interpolated version of the ROC through pool adjacent violators’ algorithm, namely ROC convex hull method, available in the BOSARIS toolkit [140]. The ease of use was evaluated taking the ISO-9241 standard into account, because it is impossible to evaluate the ease of use without taking into account the users’ understanding [141]. As mentioned before, users’ satisfaction is another important detail greatly influencing the success of software implementation. Evaluation of the developed software was done by seven principles of the standard 9241: 1. Suitability for the task. Software is suitable for the task if the user can easily understand what tasks it can do. 2. Self-descriptiveness. This principle is evaluated by checking if software can be understood in intuitive way and no or very little additional

35

information is needed. It also requires that any possible usage mistake would be followed by relevant information. 3. Controllability. Software controllability is achieved by creating user interface, which allows completing the task in one sequence of steps. 4. Conformity with user expectations. Software conforms to the users expectations if it is consistent and complying with characteristics of the user. 5. Error tolerance. Computer program is admitted to be error tolerant if its usage requires no additional effort except in the events of obviously faulty usage. 6. Suitability for individualization. Software is suitable for individualli- zation if it allows personal configuration for each user. 7. Suitability for learning. Software is suitable for learning if minimum effort for usage is required and help information is provided [138, 141].

36

5. RESULTS

5.1. Results study No I (Analysis of oral and throat microphones using discriminant analysis) data

The mixed gender database of voice recordings used in this study contained 157 digital voice recordings of sustained phonation of the vowel sound /a/. Demographic study data is presented in Table 5.1.1.

Table 5.1.1. Demographic data of study I Total Gender Age (in years) Diagnosis number Female Male Mean SD (n= 157) (n=102) (n=55) Normal voice 105 71 34 46.2 6.70 Nodules 7 7 0 25.4 6.00 Polyps 14 8 6 41.1 11.70 Carcinoma 6 0 6 62 7.00 Vocal fold polypoid hyperplasia 9 7 2 50 7.10 (Mb. Reinke-Hajek) Vocal fold papilloma 7 3 4 40 13.50 Other (cyst, granuloma, 9 6 3 45.7 8.10 monochorditis) SD – standard deviation.

The mean values and SD of the acoustic voice parameters obtained both with oral and throat microphones in the total study group are presented in Table 5.1.2. Generally, no statistically significant differences (P > 0.05) between acoustic voice parameters obtained with oral and throat micro- phones were found for all parameters reflecting frequency and amplitude perturbations of voice signal. Some exception was revealed only for SNR and HNR parameters demonstrating slight, however, statistically significant differences between the microphone measurements. However, these differences were only within the range of 5.64–5.78 %. The observed statis- tically significant difference between the HNR and SNR parameters of the two microphones could be because of the rather different frequency res- ponse curves of the microphones.

37

Table 5.1.2. Comparison of the means of acoustic voice parameters obtained from the oral and throat microphones Acoustic voice Paired Difference Mean N SD P* β** parameters Absolute % O-Jitter 0.40 157 0.30 0.004 1 0.801 Pair 1 – T-Jitter 0.40 157 0.33 0.004 1 0.801 O-Shimmer 2.92 157 1.49 0.017 0.59 0.853 Pair 2 – T-Shimmer 2.90 157 1.91 0.017 0.59 0.853 O-NNE –8.64 157 4.82 0.432 –4.76 0.172 Pair 3 – T-NNE –9.08 157 5.48 0.432 –4.76 0.172 O-HNR 23.00 157 5.06 1.380 5.64 Pair 4 0.000* 0.345** T-HNR 24.38 157 5.26 1.380 5.64 O-SNR 21.35 157 4.93 1.311 5.78 Pair 5 0.000* 0.396** T-SNR 22.67 157 5.31 1.311 5.78

Pair 6 O-F0 139.50 55 83.32 0.455 0.33 0.079 – Male T-F0 139.95 55 83.19 0.455 0.33 0.079

Pair 7 O-F0 208.19 102 35.73 0.167 0.08 0.366 – Female T-F0 208.36 102 35.86 0.167 0.08 0.366 SD – standard deviation, O- Oral microphone data, T – Throat microphone data, Fo – fundamental frequency, SNR – Signal to noise ratio, HNR - harmonic to noise ratio, NNE – Normalized noise energy, p<0.05.

In Fig. 5.1.1, paired correlations between acoustic voice parameters obtained with the oral and throat microphones are disclosed. As follows from Figure 4, statistically significant strong correlations (r 1⁄4 0.71–0.86 and F0 1⁄4 1.0) between identical voice measurements registered with dif- ferent microphones were revealed.

38

Fig. 5.1.1. Paired correlations between acoustic voice parameters obtained with the oral and throat microphones, p<0.05 O – Oral microphone data, T – Throat microphone data, Fo – fundamental frequency, SNR – Signal to noise ratio, HNR – harmonic to noise ratio, NNE – Normalized noise energy, r – Pearson’s correlation coefficient.

Table 5.1.3 presents the results of voice signal classification into two classes of normal and pathological voice. As the outcome of the Fisher discriminant analysis of the separate acoustic voice parameters, the opti- mum limiting values of the parameters discriminating the normal and pathological voice subgroups were determined, and consequent CCRs were calculated.

Table 5.1.3. Correct classification rate achieved when classifying into normal and pathological voice classes using acoustic voice parameters obtained from the oral and throat microphones Microphones Acoustic voice parameters CCR Limiting value Oral O-Shimmer 75.2 % 3.20 Throat T-Jitter 70.7% 0.45 Oral & Throat T-SNR 80.3% 22.03 O-Shimmer 3.20 O-NNE –7.98 CCR – Correct classification rate. O – Oral microphone data, T – Throat microphone data, SNR – Signal to noise ratio, NNE – Normalized noise energy.

39

Detailed results of study No I are presented in the article Uloza V, Padervinskis E, Uloziene I, Saferis V, and Verikas A. Combined use of standard and throat microphones for measurement of acoustic voice para- meters and voice categorization. Journal of Voice. 2015. 5, p. 552–559.

5.2. Results study No II (Analysis of oral and throat microphones using Random Forest classifier) data

In this study we tried to classification the results that were obtained with the help of 14 different feature sets, when a single forest is designed for each feature set. The number of initial variables and the number of variables providing the highest classification accuracy on OOB data are also shown in Table 5.2.1.

Table 5.2.1. The number of initial/selected features in different feature sets and the out-of-bag (OOB) data classification accuracy (%) Selected Selected No. Features All Acoustic (%) Contact (%) 1. Perturbation 24 6 77.26 14 76.10 2. Frequency 100 13 71.30 12 70.07 3. Mel-frequency 35 7 69.16 8 70.03 4. Cepstral energy 100 27 72.51 20 69.95 5. Mel-coefficients 35 10 70.14 13 67.87 6. Autocorrelation 80 13 65.09 10 64.08 7. HNR-spectral 11 8 62.17 10 59.70 8. HNR-cepstral 11 6 64.44 3 60.70 9. LP-coefficients 77 12 76.66 25 64.78 10. LPCT-coefficients 77 13 78.70 7 64.13 11. Signal shape 128 50 70.62 10 68.70 12. Reflection-coefficients 24 9 76.60 10 69.56 13. Tract irregularity 71 11 80.36 21 69.44 14 PLPC-coefficients 154 11 81.20 29 76.35 Average 14.0 72.59 13.7 67.96

The superiority of the acoustic microphone was observed for 13 feature sets and the difference was statistically significant on the 95 % confidence level for most of the feature sets. With those 14 feature sets RF were further used in various fusion schemes. To study confidence in RF decisions, we

40 created two dimensional map of the generalized proximity matrix and labelled data with predicted class labels. The results are shown in Fig. 5.2.1, where the size of markers is used to reflect confidence in decisions, when the marker is larger the more confident we are.

Fig. 5.2.1. The proximity matrix, created by the meta RF, mapped on to the 2D space, where ‘□’ denotes a pathological observation and ‘△’ stands for normal.

We used the detection error trade-off (DET) curves, to obtain a compre- hensive comparison of classification accuracy achieved using data from the two microphones (Fig. 5.2.2). The curve was generated using results from the meta RF, which was the most accurate fusion scheme for the contact microphone data. Fusion of information obtained from the two microphones was not effective for the studied data sets. Even if for some fusion schemes small improvements in classification accuracy can be observed. Detailed results of the study No II are presented in a publication, Verikas A, Gelžinis A, Vaičiukynas E, Bacauskienė M, Minelga J, Hållan- der M, Uloza V, and Padervinskis E. Data dependent random forest applied to screening for laryngeal disorders through analysis of sustained phonation: Acoustic versus contact microphone. Medical engineering & physics. 2015;37(2):210-218.

41

Fig. 5.2.2. Detection error trade-off for the data from two microphones; EER means equal error rate

5.3. Resultsof study No III (Analysis of oral and smart phone microphones using Random Forest classifier) data

The mixed gender database of voice recordings used in this study contained 118 digital voice recordings. Demographic study data is presented in Table 5.3.1. For the classification of the voice signal into two classes of normal and pathological voice, the LDA classifier was used in order to determine the suitability of acoustic voice parameters for the discrimination of the normal and pathological voice groups, and for the selection of an optimum set of parameters, meant for the classification. The performance of the LDA was summarized by the correct classification rate (CCR). The LDA data is presented in Table 5.3.2.

42

Table 5.3.1. Demographic data of the study No III Total Gender Age (in years) Diagnosis number Female Male Mean SD (n= 118) (n=73) (n=45) Normal voice 34 23 11 41.8 16.96 Nodules, cysts 16 14 2 34.6 14.48 Polyps 26 15 11 45.9 10.76 Carcinoma, keratosis, 21 11 10 50.5 13.25 papillomatosis Vocal fold paralysis 9 6 3 54.7 11.93 Reflux-laryngitis 10 3 7 52 14.02 Dysphonia, presbylaryngis 2 1 1 61.5 24.75 SD – standard deviation.

Table 5.3.2. CCR achieved by the LDA when classifying into normal and pathological voice classes using acoustic voice parameters obtained from the oral and SP microphones and GFI data Microphones Parameters CCR Oral O-NNE 73.7% SP SP-Shimmer 79.5% SP-NNE Oral & GFI O-NNE 85.1% GFI SP & GFI SP-NNE 83.8% GFI CCR – correct classification rate, SP – smartphone microphone data, O – oral microphone data GFI – glottal functional index. LDA- Linear discriminant analysis, NNE – Normalized noise energy.

As follows from Table 11, for the oral microphone, O-NNE was the most discriminative parameter and provided CCR of 73.7 %. For the SP microphone, a pair of acoustic voice parameters, i.e., SP-shimmer and SP- NNE provided CCR of 79.5 %. LDA fusing entire acoustic voice parameters and GFI data selected an optimum pair of parameters discriminating normal and pathological voice subgroups. For oral micro- phone this pair included O-NNE and GFI, achieving CCR of 85.1 %, and for SP microphone the pair included SP- NNE and GFI, achieving CCR of 83.8 %. Consequently, com- bination of acoustic voice parameters and GFI data increased the CCR when discriminating normal and pathological voice classes both for oral and SP microphones voice recordings.

43

Results of the RFC performance when classifying data into normal and pathological voice classes using acoustic voice parameters obtained from the oral and SP microphones and GFI data are summarized in Fig. 5.3.1.

Fig. 5.3.1. Detection performance of Random forest Classifier DET curves (left) and ROC curves (right) O – oral microphone, SP – smartphone, GFI – glottal functional index, EER – equal error rate, AUC – Area under curve.

44

As shown in Fig. 7, the oral microphone (EER = 29.78 %) was outper- formed by SP microphone (EER = 21.32 %); however, GFI items (EER = 10.15 %) proved to be even better single non-invasive modality for voice pathology detection. Fusing audio data with responses to GFI items improv- ed detection further, where SP microphone fusion with GFI was the most successful achieving the best overall EER of 7.94 %. Further combination of both microphones and the GFI data could not outperform this result. Detailed results of the study No III are presented in a publication Uloza V, Padervinskis E, Vegiene A, Pribuisiene R, Saferis V, Vaiciukynas E, Gelzinis A, and Verikas A. Exploring the feasibility of smart phone micro- phone for measurement of acoustic voice parameters and voice pathology screening. European Archives of Oto-Rhino-Laryngology. 2015;272(11): 3391-3399

5.4. Results of study No IV (Testing of VoiceTest software) data

A mixed gender database of 273 subjects (163 normal voices and 110 pathological voices), ranging from 19 to 85 years of age, was used to train the RF classifier. A mixed gender database, containing 596 subjects (106 healthy men, 221 healthy women, 118 pathological men, 151 patho- logical women), was used for query data to train the RF classifier. To map data and decisions onto the 2D space, the t-distributed stochastic neighbour embedding (t-SNE) algorithm was used. The developed software provided a user with predicted class, classification certainty and a 2D map. A screen- shot of the main software window is shown in Fig. 5.4.1, where a user interface consists of 3 parts: audio file selection, graphical visualization of data (a 2D map), and textual view of analysis results. The suitability of detection was assessed using voice and query data from 45 unseen subjects (9 healthy and 36 pathological). Detection was performed separately for voice and query modalities, certainties saved as scores, and results evaluated using performance measures discussed in the previous section. As it can be seen in Fig. 9, models built on the query data outperform the ones created using the voice data: EER is 11.1 % for the query modality and 14.8 % for the voice modality. Judging from the DET (or ROC) curves one can make a conclusion that the query modality is efficient not only around the EER operating point, but also has the lowest false alarm probability (or highest specificity) near the low miss probability (or high sensitivity) mode of operation, which can be considered as an appealing property for initial screening in preventive health-care.

45

Fig. 5.4.1. Screenshot of the main software window containing three UI parts Audio recording selector, 2D map and textual results viewer, where ‘□’ denotes a pathological observation and ‘△’ stands for normal.

Fig. 5.4.2. DET curve and EER – equal error rate for unseen voice and query data

46

6. DISCUSION

Automated acoustic analysis of voice is increasingly being used in various voice clinics for collection of objective non-invasive voice data, for documenting and quantifying dysphonia changes and the outcomes of therapeutic and/or phonosurgical treatment of voice problems [46,136,142- 145], as well as for screening laryngeal disorders[52,130,146,147]. One of the most important factors determining reliability and practical utility of screening and categorization of voice disorders is voice recordings of acceptable quality. Therefore, choice of the appropriate microphone plays an important role in this matter. A study performed by Titze and Winholtz [50] demonstrated that the type of microphone used in acoustic voice analysis has significant impact on quality of measurement outcome. The results showed that condenser microphones give better results than dynamic microphones, microphones with a balanced output perform better than those with unbalanced outputs, and microphone sensitivity and distance have the largest effect on perturbation measures [60]. An acoustic cardioid micro- phone has been considered by some investigators to be the best choice when voice is measured in clinical settings, especially if perturbation measure- ments are the main interest[148]. However, because of the proximity effect of the microphone, even with these microphones, spectral measurements may be distorted, and inaccuracies of voice measurements may occur [48]. Basically, measurements of acoustic signal perturbations represent measurements of noise and assess the nonstationary characteristics of the acoustic voice signals. Of note, deviations from voice signal stationary cyclic behaviour can result either from the larynx or from the noise, either in the acoustic environment or in the data acquisition hardware [60]. There- fore, it is of great importance to control noise level in the environment and to select appropriate recording systems combined with microphones that provide high SNR [49]. Consequently, all noise-contributing factors should result in an acoustic environment that has the SNR of at least 30 dB to produce valid results [60]. These requirements could be fulfilled rather easily, if voice recordings are performed in a special soundproof booth. However, this could not be feasible for voice recordings occurring in an ordinary environment when voice recordings are carried out for a voice disorder screening task. On the other hand, contact microphones providing reduced sensitivity to environ- mental noise could be one of the solutions to preclude influence of back- ground noise. Moreover, the waveform of a contact microphone is reason- nably independent of the articulation because of high glottal impedance

47

[55]. Consequently, the waveform of a contact microphone is suitable for F0 measurements and frequently has been found useful for F0 detection and perturbation measurements [58]. These circumstances were considered in this study investigating the suitability of use of throat microphone signal for voice categorization and screening purposes. Moreover, it was presumed that combination of oral and throat microphones would increase CCR discriminating normal and pathological voice groups. Horii [58] used a contact microphone (accelerometer) to eliminate acoustic effects of the vocal tract and found no significant differences in the jitter as well as shimmer measurements among eight vowels. However, the airborne voice signals had approximately twice as much shimmer as those from the accelerometer signals. Jitter values; on the other hand, showed only a slight tendency toward increase in the airborne signals. Results of the present study are in some discrepancy with the data of Horii because we identified a strong correlation between jitter and shimmer values measured with acoustic and throat microphones. Moreover, there were no statistically significant differences between the mean values of these voice perturbation parameters obtained using both acoustic and throat microphones. Generally, in the F0 data, there was a perfect agreement between the two microphones both in the male and female groups in our series. These findings are in some controversy with the results of the study by Askenfelt et al [55], who found that the mean F0 tended to be slightly higher in the contact microphone measurements comparing with electroglottogram. However, running speech, F0 extraction algorithm, and several other factors could contribute to this. Therefore, the study by Askenfelt et al should be redone. To the best of our knowledge, the present study measurements of voice signal turbulences (SNR, HNR, and NNE) obtained from the throat micro- phone have been presented for the first time. Strong correlations among these acoustic voice parameters registered with different microphones (oral vs throat) were revealed (r 1⁄4 0.71–0.78) confirming acceptability of throat microphones for measurements in clinical settings and/or for screening purposes. Despite the statistically significant differences among the mean values of HNR and SNR that were found in this study with oral micro- phones showing a slight tendency to higher HNR and SNR values, these differences in the total study group were in the range only of 5.6–5.8 %. The observed statistically significant difference between the HNR and SNR parameters of the two microphones could be because of the rather different frequency response curves of oral and throat microphones. Further studies are required to assess possible differences in HNR and SNR measurements using two types of microphones in a more clinically realistic environment.

48

In the first study, a combined use of both oral and throat microphones revealed some benefits discriminating normal and pathological voice sub- groups. The discriminant analysis determined an optimum set of acoustic voice parameters, including T-SNR, O-shimmer, and O-NNE, which pro- vided CCR of 80.3% when categorizing normal and pathological voice sam- ples. In comparison, from separate oral microphone recordings, O-shimmer provided CCR of 75.2 % and from separate throat microphone recordings T- jitter provided CCR of 70.7 %, respectively. Thus, the most discriminative parameter of the throat microphone, T-jitter, was not included in the combined set of parameters. Such behaviour is often observed in variable selection because two individually best variables do not necessarily comp- rise the best subset of two variables. The combined set of variables, contain- ing parameters related to both throat and oral microphones, indicates that the throat microphone may bring additional information useful for the task. In the second study we used a larger patient and healthy voice database. The study was concerned with the screening of laryngeal disorders based on the classification of voice samples, recorded by acoustic and contact micro- phones, into normal and pathological classes. To obtain a comprehensive characterization of voice samples, 14 different sets of features are extracted and used in RF. Novel ways to build an adaptive data dependent RF and to explore data and automated decisions were suggested. The acoustic micro- phone was superior to the contact one for all feature sets, except the “Mel- frequency bands” set, and the observed difference was statistically signi- ficant at the 95 % confidence level for most of the sets. The perceptual linear predictive cepstral coefficients were the best feature set for both microphones. The tract irregularity features built upon the reflection coeffi- cients were able to significantly improve the classification accuracy when using the acoustic microphone, contrary to the contact microphone case. While the linear predictive coefficients and linear predictive cosine trans- form coefficients have shown good performance when using an acoustic microphone, the feature sets were less suitable for the contact microphone. The perturbation measures performed approximately equally well for both microphones and the set was the next best in the contact microphone case. In the first study, where we used a smaller voice database, we reached 5.1 % better classification rate with a discriminant analysis, using both microphones’ data. In our second study we did not provide any evidence that voice signals recorded by the contact microphone could bring additional information that would be useful for the classification, if compared to the information available from the acoustic microphone. The revealed result seems to contradict the finding of Mubeen et al [70]. The performance increase observed by Mubeen et al [70], when using information from

49 microphones of both types, is probably due to the fact that features of only one type were used in that study, in contrast to a very broad set of features used in our study. We think that the reasons why we have attained different results in both studies was that the database was much larger in the second study, and we also used different features extracted from voice samples. In the first study we used only six features, and in the second study 14 different sets of features were extracted and used. Another reason of the difference is that in the first study we used a discriminant analysis-based classification, in order to compare our data, and in the other study the comparison was based on the Random forest classification. This choice was motivated by the fact that Random forests are among the most accurate general-purpose classifiers [92], and have shown excellent performance on many practical problems [149]. The superiority of an acoustic microphone over the contact one was observed for both, feature- and decision-level, and approaches to fusion. We believe that the conclusion concerning superiority of an acoustic micro- phone over the contact one would also hold for other advanced classifiers, for example, support vector machines. We also believe that other feature sets, capable of providing a compre- hensive representation of voice signal, would lead to the similar conclusion. Superiority of an acoustic microphone over the contact one is expected and one may wonder why this comparison is necessary. The comparison results show that a reasonable performance can be achieved using voice re- cordings made by a contact microphone. Thus, as being much less sensitive to ambient noise and still providing acceptable performance, a contact microphone could be very useful when voice recordings are to be made in noisy environments. Some limitations originating both from the design of the present study and usage of throat microphones with inherent restrictions of such type of devices for acoustic voice analysis must be considered. Throat microphones are not considered to be very effective in transmitting consonant sounds and high frequencies [150]. The intrinsic elasticity properties of underlying human body tissues acting as a low-pass filter with a 3-KHz cut-off frequency limit the frequency representation of the signal [66]. This feature of the throat microphone may influence the accuracy of voice signal turbulence noise measurements. Presumably, these features of the throat microphone determined some differences among the mean values of the voice signal turbulence measurements (SNR and HNR) carried out with microphones of the two types in the present study. Most of the voice signals recorded in our study can be attributed to the type 1 group, according to Titze [27]. However, signals of type2 and type 3 are also present. As it is well known, per-cycle measurement of F0 cannot be done reliably for type 2 and type 3 signals.

50

An acoustic microphone exhibiting a cardioid polar pattern, as the AKG Perception 220 does, is not an ideal microphone for performing spectral analysis because the frequency response reduces about 3 dB at 50 Hz. However, because the frequency range of the contact Stryker/Triumph PC microphone is limited to 100 Hz at the lower end, this deficiency of the acoustic microphone is not crucial for the pair wise comparison of voice parameters obtained using these two microphones. A relative discomfort of wearing a throat microphone during voice recording, as well as some difficulty of properly positioning the device and quantification of the effects of contact pressure on the skin frequency response should be considered in future studies [66,151]. Also, it will be of great importance to analyse how well the throat microphone performs in an ordinary environment and in the presence of background noise. The proposed way of building data dependent Random forests proved to be the best approach to fusing information available from 14 different feature sets and allowed achieving statistically significant improvements in classification accuracy compared to the accuracy achieved by Random forests built on the single best feature set. Elaborate selection of trees to be included into a Random forest and using weighted voting instead of voting are two issues to consider aiming to obtain accurate Random forests. The generalized proximity matrix is able to gather information available in multiple feature subsets in such a way that similar voice samples are mapped close in the 2D space. A map exhibiting such ordering property is very helpful for data exploration, especially when bearing in mind the fact that sometimes even erroneous decisions are made with high confidence. It is worth noting that only features used by trees of the forests contribute to the proximity values. How classification results of this study are comparable to results obtained by other researchers in similar tasks? Classification accuracy obtained in several recent studies, solving a pathological voice related two- class classification problem, varies in a broad range: 80.0 % [146], 89.1 % [130], 91.0 % [152], 93.4 % [153], 93.8 % using sustained phonation and 96.3 % using running speech[36], 95.5 % [120], 97.5 % and 99.0 % using text independent and text dependent spoken digits, respectively [147]. Even if there is a clear difference in accuracy achieved by the different techniques, it is a rather complicated matter to establish superiority of one or another technique. Different data sets, different procedures used to assess classification accuracy, different aims pursued in the studies are the main issues making the comparison difficult. Data sets usually differ in several important aspects: size, composition, quality of records, variation in patho- logy, severity of illness and subject age. Most of data sets are rather small

51 and unbalanced, e.g. 120 subjects (8 with normal voice and 112 patho- logical) [2], 120 subjects (50 normal and 70 pathological) [6], 140 subjects (23 normal and 117 pathological) [5]. A balanced data set collected from 200 subjects (100 normal and 100 pathological), aged 33 ± 12, was used [38]. A big public database covering a broad range of voice pathologies and subject ages would greatly facilitate comparison of different algorithms. Different procedures used to assess classification accuracy can provide rather different results. Often several voice recordings of the same subject are used in the studies. In these kinds of studies, if adequate care is taken not to use the same voice records (not subjects) in training and test sets – which is frequently the case – the voice of the same subject can be used for both training and testing models. This leads to overoptimistic classification accuracy. Some studies have a specific aim, such as voice analysis-based detection of Parkinson’s disease [154]. In such cases, voice pathology is of a specific character, mainly due to Parkinson’s disease. Different procedures used to estimate classification accuracy often lead to different variance of the estimate. For example, high variance is characteristic to the Leave-one- out estimate. This also brings some uncertainty when comparing classify- cation accuracies obtained in different studies. Tsanas and Gómez-Vilda in a recent work based on standard Random forests obtained the 91 % classification accuracy in a two-class classify- cation problem concerning discrimination between normal and pathological voices [152]. Though the classification accuracy obtained in our work is lower, the proposed data dependent Random forest outperformed the stan- dard one used in [152]. The lower accuracy can be explained by the differ- rent, bigger data set used in this study, which covers a larger range of pathologies and subject ages. In all our current studies, sustained phonation of vowel /a/ was chosen for analysis because the steady-state phonation (i. e., time and frequency invariances) are simple, time effective, allow the reduction of the variances in sustained vowels, and provide reliable detection and computation of acoustic features [33, 52, 144], Moreover, sustained vowels are not influ- enced by speech rate and stress; they typically do not contain voiceless phonemes, fast voice onsets and terminations, and prosodic fluctuations in F0 and amplitude[54]. Despite that sustained vowel phonation could not be a complete substitute for real-life phonation in acoustic analysis [33]; they are relatively insulated from influences related to different languages and therefore could be considered as universal and suitable for voice screening purposes. Nevertheless, analysis of connected speech samples would be of interest in future research because symptoms of disordered voice quality are more typically revealed in continuous speech [58]. The evaluation of voice

52 disorders using sustained vowels do not always correspond with running speech and might be a limitation of our studies. However, objective-acoustic parameters can be scarcely used in direct analysis of running speech, i. e., spectral, and cepstral methods [155, 156]. The commonly used application of ‘classically’ objective-acoustic parameters (e.g. Jitter, Shimmer, Harmo- nics-to-noise analysis, etc.) has first to be filtered in voice segments and non-voice segments before they are eligible for analysis in running speech. At the moment there are only a few products on market that are applying two multi-parametric models to evaluate voice quality successfully on sustained vowels and on running speech. Acoustic Voice Quality Index (AVQI) and Cepstral Spectral Index of Dysphonia (i.e., CSID) were proposed by Maryn et al [35] and Awan et al [156]. There is a need for further investigations, and both models could confirm accuracy as well as reliability in detecting voice abnormality: AVQI was tested in 4 studies [157–161]; and CSID was tested in 5 studies [162–167]. All those studies have focused on patients with functional dysphonia in contrast our main objective, where we studied the patients with organic laryngeal diseases, and this is one of the reasons, why we chose the sustained vowel phonation. In the third study we compared the acoustic microphone and the smart- phone microphone options in voice recordings. The study was concerned with the screening of laryngeal disorders, based on the classification of voice samples in a pathological and healthy class. For better classification results we use and query data from the questionnaire GFI-LT. In scientific literature we found only a few studies about smartphone microphone possi- bilities to record patient voices. In 2014, Vogel et al published a paper about the possibilities of smartphones in voice analysis, but these authors sug- gested that smartphones were incomparable with the benchmark devises – high-quality recorders coupled with a condenser microphone. The authors’ conclusion was that acoustic analysis could not be assumed to be compa- rable, if different recording methods were used to record the speech. Still, the authors admitted only 15 healthy volunteers in their study; this was the weak point of that study. In other studies the authors identified that remote detection is cost-effective and the assessment of voice pathology over the telephone channels reached normal/pathological voice classification accuracy close to 90 % [125, 130–132]. Kaleem et al, Moran et al[130] used a linear classifier, processing measurements of pitch perturbation; amplitude perturbation and harmonic-to-noise ratio derived from digitized speech recordings. Results showed that while a sustained phonation, recorded in a controlled environment, could be classified in two classes with the accuracy of 89.1 %, telephone-quality speech could be classified with the accuracy of 74.2 %, using the same scheme. In 2008, Wormald et al[131] used a voice

53 database of 78 patients with vocal paralysis; these authors analysed the sustained phonation that was recorded over a standard telephone network. Automated speech analysis system demonstrated 92 % sensitivity and 75 % specificity for detecting vocal fold paralysis. Kaleem et al [125] used conti- nuous speech samples from 212 voice record database of 51 normal and 161 pathological speakers, which had been modified to simulate telephone quality speech under different levels of noise, and a linear classifier was used with the feature vector. Thus, as these authors have indicated in their study, high classification accuracy was obtained (89.7 % for signal-to-noise ratio, 30 dB). In 2012, Lin et al [133] recorded 11 healthy and 10 patho- logical voice samples, and compared iPhone microphone with the headset condenser microphone. Consequently, high inter-recorder reliabilities for the acoustic measures evaluated were found. In particular, F0, Jitter, were found to be less susceptible to inter-recorder difference. However, the finding of a significant recorder effect on Shimmer, SNR, counter-indicated a direct comparison between voice measures obtained from different digital recording systems. A small amount of participants was the drawback of the study. In 2013 and 2015 Mehta et al published a study about a smartphone- based ambulatory voice health monitoring. These authors used miniature accelerometer senses acceleration in one dimension, and also vibration sensitivity suitable for obtaining meaningful information about the voice. These authors analysed 51 pathological and 20 healthy volunteers’ voices. They used Machine learning classification system for the classification. They identified that wearable voice monitoring systems have the potential to provide more reliable and objective measures during the day activities of voice use that can enhance the diagnostic and treatment strategies for voice disorders [134, 135]. Mat Baki et al. determined that recordings, performed with iPod’s internal microphone and analysed with OperaVoxTM software application, installed on an iPod touch (4-th generation), were statistically comparable to the golden standard, i. e., Multidimensional Voice Program (MDVP, KayPentax, NJ, USA)[126]. In our third study we determined strong statistically significant inter- correlations (r = 0.78–0.91) with small exception for jitter (r = 0.68),as it was as well identified by Lin et al [133], between acoustic voice parameters obtained using standard oral cardioid and SP microphones, thus confirming acceptability of SP microphones for acoustic voice measurements in clinical settings and/or for screening purposes. Moreover, for the F0 data there was a perfect agreement between the two microphones recordings in our series. Our results was similar to the one presented by Lin et al [133].

54

Despite the statistically significant differences among the mean values of some acoustic voice parameters that were found in this study with oral microphone showing a slight tendency to higher mean values, these differ- rences were in the range only of 3.4–9.5 %. Some exception was shown for jitter in a normal voice group (difference 19.9 %) and for NNE in patho- logical voice group (difference 19.0 %). However, these differences be- tween acoustic voice parameters, obtained with different microphones, finally had no significant impact on the classification accuracy into normal/ pathological voice classes. Acoustic voice parameters were more useful for voice pathology detection with RF when estimated from recordings, done with SP microphone, if compared to standard microphone voice recordings. For example, SP-based jitter was found as the most important variable for RF after the GFI items. It was a presumption and a planned design of the present study that a combination of both automated acoustic analyses of the sustained vowel/a/ results, as well as voice-related questionnaire data would increase the discrimination of normal and pathological voice classes. In this study, combined use of both acoustic voice analyses results and GFI-LT questionnaire data revealed evident benefits, discriminating normal and pathological voice groups. To the best of our knowledge, this has been pre- sented for the first time. The discriminant analysis determined the O-NNE and SP-NNE parameters as optimal providing CCR of 73.7 % and 79.5 % respectively, when classifying normal and pathological voice samples. However, fusion of the results obtained from the voice recordings and GFI- LT data increased the CCR to 84.2 % for oral microphone voice recordings, as well as to 84.6 % for the SP microphone recordings. Furthermore, fusing audio data with responses to GFI-LT items improved detection further, where SP microphone fusion with GFI was the most successful achieving the best overall EER of 7.94 %, and including both microphones besides the GFI-LT could not outperform this result. Noteworthy is the fact that in the task of distinguishing between the normal and pathological voice classes, the GFI-LT data outperformed the acoustic data when using RFC. This is not surprising, since our previous investigations have shown that question- naire data may carry more information relevant for the classification task than acoustic data [11]. On the other hand, one can expect obtaining higher classification accuracy from the RF using more parameters to represent the acoustic data than those few, computed by the Dr. Speech software and used in the present study. Moreover, the relatively high discrimination power of the GFI-LT data is an encouraging result for developers of future web-based voice screening systems, because such sensor independent data source of high discrimination power may lessen possible acoustic parameters depen- dent differences in sensitivity of combined classifier built using data of both

55 types (acoustic parameters and questionnaire data). This would be of great importance, if different voice recording devices, i.e., different smart phones, different microphones will be in use. In the present study, only the Dr. Speech system registering a rather limited number of acoustic voice parameters reflecting perturbation and turbulent noise variables in voice signal was used [46]. Some limitations of the present study must be considered, because only the Dr. Speech system registering comprehensible acoustic voice parameters reflecting perturbation (jitter and shimmer) and turbulent noise variables (NNE, HNR, SNR) in voice signal was used. This limitation of the analysis system presumptively reduces the accuracy of classification into normal and pathological voice classes. Therefore, future investigation should be concentrated on the utility of a large variety of voice signal feature types in classifying the voice into healthy and different pathological voice classes, using sophisticated contem- porary methods of automated voice analysis. Also, it will be of great impor- tance to analyse how well the SP microphone performs in an ordinary environment and in the presence of background noise. Results of the present study confirmed that SP-based voice recordings provide suitable quality for automated acoustic voice analysis. Moreover, portability, patient/user- friendliness, low cost and applicability of SP-based devices not only in clinical settings have greater utility and therefore may be preferred by patients and clinicians for voice data collection in both home and clinical settings[126]. It is important to point out, that the SP-based voice recordings and automatic voice analysis system are not considered as a substitute for clinical examination; however, it is seen to have a potential role in screening for laryngeal diseases and for subsequent referral selected individuals for earlier otolaryngological examination and visualization of the larynx (video laryngostroboscopy, indirect/direct microlaryngoscopy) thus improving diagnostics of laryngeal diseases. On the other hand, acoustic voice analysis may be an important part for follow up and monitoring of voice treatment results. In the fourth study we have tested the automated software for the screening of laryngeal disorders. Data classification algorithms, presented in our previous articles, were successfully used to detect laryngeal pathology, together with the t-distributed stochastic neighbour embedding (t-SNE) algorithm for data visualization [168]. Software usability evaluation was also done, as a final software acceptance step. Accurate detection was observed on unseen subjects, where EER of 14.8 % for voice and 11.1 % for query data was achieved. Results of expe- rimental studies have shown that detection using association rules generated

56 from the query data outperformed detection based on audio features extracted from the voice data by almost 4%. However, a small amount of the testing group was the limitation of this part of study; therefore, future investigation should be concentrated on the utility of a larger voice records database. As indicated by Mendes et al [44], there are a lot of programs for acous- tic voice analysis; however, currently all programs provide the analysis from voice samples. This constitutes the main difference of our program: the user can have both sets of information – acoustic voice analysis and query data analysis – in the same window. The developed software was assessed by the users according to the ISO-9241 standard as suitable for the task [141]. As well, software rules that were published by Barsties et al [24] were kept in mind when creating the program. Data classification based on association rules is a transparent (white- box) approach, relating specific questionnaire responses to diagnosis with estimated confidence and providing an expert system solution for medical decision support and preventive healthcare. Low EER of the association rule-based classifier indicates a relatively high preventive healthcare po- tential. The t-SNE algorithm proved to be a useful tool for visualizing multi- dimensional data represented by pair-wise similarities in a proximity matrix generated in an RF designing process. As it was noted by the otolaryn- gology specialists, information, provided by the developed software is very useful in an education process for comparative studies and providing deeper insights into data representing various groups of subjects. Ability to related basic voice parameters to diagnosis and parameters of other similar (in terms of parameters) subjects helps laryngologists to make associations and generalize over different cases. A 2D map is also of great help in identifying erroneously labelled data and other unexpected deviations in both voice and query data. Probability density functions, created using Epanechnikov kernel smo- othing method, provide additional information, while evaluating patients with respect to audio parameters.

57

7. CONCLUSIONS

1. The measurements of acoustic voice parameters using a combination of oral and throat microphones showed to be reliable in clinical settings and demonstrated high CCR (80.3%) of distinguishing healthy and pathological voice patient subgroups. In situations, when the use of conventional microphones due to the background noise could be com- plicated or restricted, a contact (throat) microphone can be considered as a valuable and beneficial alternative and/or supplement for voice recordings and analysis. 2. The proposed way of building a data-dependent Random Forest classifier proved to be the best approach to fusing the information available from 14 different acoustic voice features sets based on data obtained from acoustic and throat microphones and allowed achieving high (CCR 86.62 %) and statistically significant improvement in classi- fication to normal and pathological voice classes accuracy. 3. The measurements of acoustic voice parameters using smart phone microphone were shown to be reliable in clinical settings demonstrating high CCR and low EER when distinguishing between the healthy and pathological voice classes, and validated the suitability of smart phone microphone signal for the task of automatic voice analysis and screen- ing. 4. The fusion of audio data with the responses to GFI-LT questionnaire items improved the detection rate and showed to be reliable in clinical settings when distinguishing between the normal and pathological voice classes. Fusion of acoustic voice parameters registered using smart phone microphone with GFI-LT results was the most successful in achieving the best overall EER of 7.94 %. 5. The developed VoiceTest software of automatic voice and query data analysis corresponds to the ISO-9241 standard and was assessed as suitable for voice classification into normal and pathological voice classes. The observed EER was 14.8 % for acoustic voice data and 11.1 % for the query data. The software shows the potential to be used in clinical practice.

58

REFERENCES

1. Döllinger M, McWhorter A, Svec J, Lohscheller J, and Kunduk M. Support vector machine classification of vocal fold vibrations based on phonovibrogram features. Support vector machine classification of vocal fold vibrations based on phonovibrogram features. INTECH Open Access Publisher; 2011. 2. Titze IR. The Myoelastic Aerodynamic Theory of Phonation, National Centre for Voice and Speech, Iowa City, 2006. ISBN: 0-87414-122-2. 3. Aronson AE, and Bless D. Clinical voice disorders. Clinical voice disorders. Thieme; 2011. 4. Ryan EB, and Bulik CM. Evaluations of middle class and lower class speakers of standard American and German-accented English. Journal of Language and Social Psychology. Sage Publications; 1982;1(1):51- 61. 5. Verdolini K, and Ramig LO. Review: occupational risks for voice problems. Logopedics Phonatrics Vocology. Informa UK Ltd UK; 2001;26(1):37-46. 6. Pribuisiene R, Uloza V, Kupcinskas L, and Jonaitis L. Perceptual and acoustic characteristics of voice changes in reflux laryngitis patients. Journal of voice. Elsevier; 2006;20(1):128-136. 7. Carding PN, Roulstone S, Northstone K, and Team AS. The pre- valence of childhood dysphonia: a cross-sectional study. Journal of Voice. Elsevier; 2006;20(4):623-630. 8. D'haeseleer E, Depypere H, Claeys S, Wuyts FL, De Ley S, and Van Lierde KM. The impact of menopause on vocal quality. Menopause. LWW; 2011;18(3):267-272. 9. Verma M, Sarfaty M, Brooks D, and Wender RC. Population-based programs for increasing colorectal cancer screening in the United States. CA Cancer J Clin. United States; 2015;65(6):496-510. 10. Kons Z, Satt A, Hoory R, Uloza V, Vaiciukynas E, Gelzinis A, and Bacauskiene M. On feature extraction for voice pathology detection from speech signals. In Proceedings of the 1st Annual Afeka-AVIOS Speech Processing Conference, Tel Aviv Academic College of Engineering, Tel Aviv, Israel. 2011. 11. Verikas A, Bacauskiene M, Gelzinis A, Vaiciukynas E, and Uloza V. Questionnaire-versus voice-based screening for laryngeal disorders. Expert Systems with Applications. Elsevier; 2012;39(6):6254-6262.

59

12. Bhan SN, Coblentz CL, and Ali SH. Effect of voice recognition on radiologist reporting time. Canadian Association of Radiologists Journal. 2008;59(4):203. 13. Henricks WH, Roumina K, Skilton BE, Ozan DJ, and Goss GR. The utility and cost effectiveness of voice recognition technology in surgi- cal pathology. Modern Pathology. Nature Publishing Group; 2002; 15(5):565-571. 14. Hewett B, Card C, Gasen M, and Perlman S. Verplank (1996) ACM SIGCHI Curricula for Human-Computer Interaction. Verplank (1996) ACM SIGCHI Curricula for Human-Computer Interaction. Last up- dated; 2004. 15. Aurelija A, Ulozas V, Kupčinskas L, Ja\vsinskas V, Dalia D, Marozas V, Simutis R, Rasa R, Jegelevičius D, and Verikas A. Balso daugia- parametrio tyrimo sistemin\.es analiz\.es reik\vsm\.e pirminei gerkl\ku lig\ku atrankai. Lietuvos sveikatos moksl\ku universitetas; 2014. 16. Wilpon JG. Voice-processing technologies--their application in tele- communications. Proc Natl Acad Sci U S A. UNITED STATES; 1995;92(22):9991-8. 17. Meisel W. Speech Recognition UPDATE. TMA Associates, Encino, Calif. 1993. 18. Alcantud F, Dolz I, Gaya C, and Martin M. The voice recognition sys- tem as a way of accessing the computer for people with physical standards as usual. Technology and Disability. 2006;18(3):89-97. 19. Ruben RJ. Redefining the survival of the fittest: communication disorders in the 21st century. The Laryngoscope. Wiley Online Lib- rary; 2000;110(2):241-241. 20. Roy N, Merrill RM, Thibeault S, Parsa RA, Gray SD, and Smith EM. Prevalence of voice disorders in teachers and the general population. J Speech Lang Hear Res. United States; 2004;47(2):281-93. 21. Branski RC, Cukier-Blaj S, Pusic A, Cano SJ, Klassen A, Mener D, Patel S, and Kraus DH. Measuring quality of life in dysphonic pa- tients: a systematic review of content development in patient-reported outcomes measures. J Voice. United States; 2010;24(2):193-8. 22. Bhattacharyya N. The prevalence of voice problems among adults in the United States. The Laryngoscope. Wiley Online Library; 2014; 124(10):2359-2362. 23. Dejonckere PH, Bradley P, Clemente P, Cornut G, Crevier-Buchman L, Friedrich G, Van De Heyning P, Remacle M, and Woisard V. Com- mittee on Phoniatrics of the European Laryngological S. A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating

60

new assessment techniques. Guideline elaborated by the Committee on Phoniatrics of the European Laryngological Society (ELS). Eur Arch Otorhinolaryngol. 2001;258:77-82. 24. Barsties B, and De Bodt M. Assessment of voice quality: Current state-of-the-art. Auris Nasus Larynx. Elsevier; 2015;42(3):183-188. 25. Uloza V, Padervinskis E, Uloziene I, Saferis V, and Verikas A. Com- bined use of standard and throat microphones for measurement of acoustic voice parameters and voice categorization. Journal of Voice. Elsevier; 2015. 26. Buder EH. Acoustic analysis of voice quality: A tabulation of algo- rithms 1902-1990. Voice quality measurement. Singular Publishing San Diego, CA; 2000:119-244. 27. Titze IR. Workshop on acoustic voice analysis: Summary statement. Workshop on acoustic voice analysis: Summary statement. National Center for Voice and Speech; 1995. 28. MacCallum JK, Zhang Y, and Jiang JJ. Vowel selection and its effects on perturbation and nonlinear dynamic measures. Folia phoniatrica et logopaedica. Karger Publishers; 2011;63(2):88-97. 29. Gelfer MP, and Fendel DM. Comparisons of jitter, shimmer, and signal-to-noise ratio from directly digitized versus taped voice samp- les. Journal of Voice. Elsevier; 1995;9(4):378-382. 30. Ferrand CT. Harmonics-to-noise ratio: an index of vocal aging. Jour- nal of Voice. Elsevier; 2002;16(4):480-487. 31. Kasuya H, Ogawa S, Mashima K, and Ebihara S. Normalized noise energy as an acoustic measure to evaluate pathologic voice. The Jour- nal of the Acoustical Society of America. Acoustical Society of Ame- rica; 1986;80(5):1329-1334. 32. Choi SH, Lee J, Sprecher AJ, and Jiang JJ. The effect of segment selection on acoustic analysis. Journal of Voice. Elsevier; 2012;26(1): 1-7. 33. Moon KR, Chung SM, Park HS, and Kim HS. Materials of acoustic analysis: sustained vowel versus sentence. Journal of Voice. Elsevier; 2012;26(5):563-565. 34. Awan SN, Giovinco A, and Owens J. Effects of vocal intensity and vowel type on cepstral analysis of voice. Journal of Voice. Elsevier; 2012;26(5):670-e15. 35. Maryn Y, Corthals P, Van Cauwenberge P, Roy N, and De Bodt M. Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels. Journal of voice. Elsevier; 2010;24(5):540-555.

61

36. Godino-Llorente JI, Fraile R, Saenz-Lechon N, Osma-Ruiz V, and Gomez-Vilda P. Automatic detection of voice impairments from text- dependent running speech. Biomedical Signal Processing and Control. Elsevier; 2009;4(3):176-182. 37. de Krom G. Consistency and reliability of voice quality ratings for different types of speech fragments. Journal of Speech, Language, and Hearing Research. ASHA; 1994;37(5):985-1000. 38. Revis J, Giovanni A, Wuyts F, and Triglia J-M. Comparison of differ- rent voice samples for perceptual analysis. Folia phoniatrica et logo- paedica. Karger Publishers; 1999;51(3):108-116. 39. Wolfe V, Cornell R, and Fitch J. Sentence/vowel correlation in the evaluation of dysphonia. Journal of voice. Elsevier; 1995;9(3):297- 303. 40. Zraick RI, Wendel K, and Smith-Olinde L. The effect of speaking task on perceptual judgment of the severity of dysphonic voice. Journal of Voice. Elsevier; 2005;19(4):574-581. 41. Maryn Y, Corthals P, Van Cauwenberge P, Roy N, and De Bodt M. Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels. Journal of voice. Elsevier; 2010;24(5):540-555. 42. Maryn Y, Roy N, De Bodt M, Van Cauwenberge P, and Corthals P. Acoustic measurement of overall voice quality: A meta-analysisa). The Journal of the Acoustical Society of America. Acoustical Society of America; 2009;126(5):2619-2634. 43. Amato F, Cannataro M, Cosentino C, Garozzo A, Lombardo N, Manfredi C, Montefusco F, Tradigo G, and Veltri P. Early detection of voice diseases via a web-based system. Biomedical Signal Processing and Control. Elsevier; 2009;4(3):206-211. 44. Mendes AP, Ferreira LJ, and Castro E. Softwares e hardwares de análise acústica da voz e da fala. Distúrbios da Comunicação. ISSN 2176-2724. 2012;24(3). 45. Hadjitodorov S, and Mitev P. A computer system for acoustic analysis of pathological voices and laryngeal diseases screening. Med Eng Phys. England; 2002;24(6):419-29. 46. Smits I, Ceuppens P, and De Bodt MS. A comparative study of acous- tic voice measurements by means of Dr. Speech and Computerized Speech Lab. Journal of Voice. Elsevier; 2005;19(2):187-196. 47. Batalla FN, Márquez RG, González MBP, Laborda IG, Fernández MF, and Galán MM. Acoustic Voice Analysis Using the Praat prog- ramme: Comparative Study With the Dr. Speech Programme. Acta

62

Otorrinolaringologica (English Edition). Elsevier; 2014;65(3):170- 176. 48. Svec JG, and Granqvist S. Guidelines for selecting microphones for human voice production research. American Journal of Speech- Language Pathology. ASHA; 2010;19(4):356-368. 49. Deliyski DD, Evans MK, and Shaw HS. Influence of data acquisition environment on accuracy of acoustic voice quality measurements. Journal of Voice. Elsevier; 2005;19(2):176-186. 50. Titze IR, and Winholtz WS. Effect of microphone type and placement on voice perturbation measurements. Journal of Speech, Language, and Hearing Research. ASHA; 1993;36(6):1177-1190. 51. Maryn Y, Corthals P, De Bodt M, Van Cauwenberge P, and Deliyski D. Perturbation measures of voice: a comparative study between Multi-Dimensional Voice Program and Praat. Folia Phoniatrica et Logopaedica. Karger Publishers; 2009;61(4):217-226. 52. Wormald RN, Moran RJ, Reilly RB, and Lacy PD. Performance of an automated, remote system to detect vocal fold paralysis. Annals of Otology, Rhinology & Laryngology. SAGE Publications; 2008; 117(11):834-838. 53. Lin E, Hornibrook J, and Ormond T. Evaluating iPhone recordings for acoustic voice assessment. Folia Phoniatrica et Logopaedica. Karger Publishers; 2012;64(3):122-130. 54. Maryn Y, De Bodt M, Barsties B, and Roy N. The value of the Acoustic Voice Quality Index as a measure of dysphonia severity in subjects speaking different languages. European Archives of Oto- Rhino-Laryngology. Springer; 2014;271(6):1609-1619. 55. Askenfelt A, Gauffin J, and Sundberg J. A comparison of contact microphone and electroglottograph for the measurement of vocal fundamental frequency. Journal of Speech, Language, and Hearing Research. ASHA; 1980;23(2):258-273. 56. Munger JB, and Thomson SL. Frequency response of the skin on the head and neck during production of selected speech sounds. The Journal of the Acoustical Society of America. Acoustical Society of America; 2008;124(6):4001-4012. 57. Neumann K, Gall V, Schutte HK, and Miller DG. A new method to record subglottal pressure waves: potential applications. Journal of Voice. Elsevier; 2003;17(2):140-159. 58. Horii Y. Jitter and shimmer differences among sustained vowel pho- nations. Journal of Speech, Language, and Hearing Research. ASHA; 1982;25(1):12-14.

63

59. Hancock A, and Helenius L. Adolescent male-to-female transgender voice and communication therapy. Journal of communication disorders. Elsevier; 2012;45(5):313-324. 60. Deliyski DD, Shaw HS, and Evans MK. Adverse effects of environ- mental noise on acoustic voice quality measurements. Journal of Voice. Elsevier; 2005;19(1):15-28. 61. Svec JG, Titze IR, and Popolo PS. Estimation of sound pressure levels of voiced speech from skin vibration of the neck. The Journal of the Acoustical Society of America. Acoustical Society of America; 2005;117(3):1386-1394. 62. Shahina A, and Yegnanarayana B. Mapping speech spectra from throat microphone to close-speaking microphone: A neural network approach. EURASIP Journal on Advances in Signal Processing. Hindawi Publishing Corp.; 2007;2007(2):10-10. 63. Graciarena M, Franco H, Sonmez K, and Bratt H. Combining standard and throat microphones for robust speech recognition. Signal Pro- cessing Letters, IEEE. IEEE; 2003;10(3):72-74. 64. Acker-Mills BE, Houtsma AJ, and Ahroon WA. Speech intelligibility in noise using throat and acoustic microphones. Aviat Space Environ Med. United States; 2006;77(1):26-31. 65. Herzog M, Kühnel T, Bremert T, Herzog B, Hosemann W, and Kaftan H. The impact of the microphone position on the frequency analysis of snoring sounds. European Archives of Oto-Rhino-Laryngology. Springer; 2009;266(8):1315-1322. 66. Dupont S, Ris C, and Bachelart D. Combined use of close-talk and throat microphones for improved speech recognition under non-statio- nary background noise. In COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Inter- action. 2004. 67. Nigade AS, and Chitode JS. Throat microphone signals for isolated word recognition using LPC. International Journal of Advanced Research in Computer Science and Software Engineering. 2012;:401- 407. 68. Dekens T, Verhelst W, Capman F, and Beaugendre F. Improved speech recognition in noisy environments by using a throat micro- phone for accurate voicing detection. In 18th European Signal Pro- cessing Conf.(EUSIPCO). 2010. p. 23-27. 69. Dekens T, Patsis Y, Verhelst W, Beaugendre F, and Capman F. A Multi-sensor Speech Database with Applications towards Robust Speech Processing in hostile Environments. In LREC. 2008.

64

70. Mubeen N, Shahina A, and Vinoth G. Combining spectral features of standard and Throat Microphones for speaker identification. In Recent Trends In Information Technology (ICRTIT), 2012 International Con- ference on. 2012. p. 119-122. 71. Erzin E. Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings. Audio, Speech, and Language Processing, IEEE Transactions on. IEEE; 2009;17(7): 1316-1324. 72. Horii Y, and Fuller BF. Selected acoustic characteristics of voices before intubation and after extubation. Journal of Speech, Language, and Hearing Research. ASHA; 1990;33(3):505-510. 73. Cheyne HA, Hanson HM, Genereux RP, Stevens KN, and Hillman RE. Development and testing of a portable vocal accumulator. Journal of Speech, Language, and Hearing Research. ASHA; 2003;46(6): 1457-1467. 74. Nolan M, Madden B, and Burke E. Accelerometer based measurement for the mapping of neck surface vibrations during vocalized speech. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE. 2009. p. 4453-4456. 75. Yiu EM-L, Chen FC, Lo G, and Pang G. Vibratory and perceptual measurement of resonant voice. Journal of Voice. Elsevier; 2012; 26(5):675-e13. 76. Chen FC, Ma EP-M, and Yiu EM-L. Facial bone vibration in resonant voice production. Journal of Voice. Elsevier; 2014;28(5):596-602. 77. Jacobson BH, Johnson A, Grywalski C, Silbergleit A, Jacobson G, Benninger MS, and Newman CW. The voice handicap index (VHI) development and validation. American Journal of Speech-Language Pathology. ASHA; 1997;6(3):66-70. 78. Franic DM, Bramlett RE, and Bothe AC. Psychometric evaluation of disease specific quality of life instruments in voice disorders. Journal of Voice. Elsevier; 2005;19(2):300-315. 79. Aurelija A, Ulozas V, Kupčinskas L, Ja\vsinskas V, Dalia D, Marozas V, Simutis R, Rasa R, Jegelevičius D, and Verikas A. Balso daugia- parametrio tyrimo sistemin\.es analiz\.es reik\vsm\.e pirminei gerkl\ku lig\ku atrankai. Lietuvos sveikatos moksl\ku universitetas; 2014. 80. Nawka T, Wiesmann U, and Gonnermann U. [Validation of the German version of the Voice Handicap Index]. Hno. 2003;51(11):921- 930. 81. Rosen CA, Lee AS, Osborne J, Zullo T, and Murry T. Development and validation of the Voice Handicap Index-10. The Laryngoscope. Wiley Online Library; 2004;114(9):1549-1556.

65

82. Bach KK, Belafsky PC, Wasylik K, Postma GN, and Koufman JA. Validity and reliability of the glottal function index. Archives of Otolaryngology--Head & Neck Surgery. American Medical Associa- tion; 2005;131(11):961-964. 83. Ruta R, Baceviciene M, Uloza V, Vegiene A, and Antuseva J. Va- lidation of the Lithuanian version of the Glottal Function Index. Journal of Voice. Elsevier; 2012;26(2):e73-e78. 84. Cohen JT, Oestreicher-Kedem Y, Fliss DM, and DeRowe A. Glottal function index: a predictor of glottal disorders in children. Ann Otol Rhinol Laryngol. United States; 2007;116(2):81-4. 85. Buckmire RA, Bryson PC, and Patel MR. Type I gore-tex larynx- goplasty for glottic incompetence in mobile vocal folds. J Voice. United States; 2011;25(3):288-92. 86. Vaiciukynas E, Verikas A, Gelzinis A, Bacauskiene M, Minelga J, Hållander M, Padervinskis E, and Uloza V. Fusing voice and query data for non-invasive detection of laryngeal disorders. Expert Systems with Applications. Elsevier; 2015;42(22):8445-8453. 87. Verikas A, Gelzinis A, Bacauskiene M, Uloza V, and Kaseta M. Using the patient's questionnaire data to screen laryngeal disorders. Computers in Biology and Medicine. Elsevier; 2009;39(2):148-155. 88. Godino-Llorente JI, and Gomez-Vilda P. Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors. Biomedical Engineering, IEEE Transactions on. IEEE; 2004;51(2):380-384. 89. Mashao DJ, and Skosan M. Combining classifier decisions for robust speaker identification. Pattern Recognition. Elsevier; 2006;39(1):147- 155. 90. Dibazar A, Narayanan S, and Berger TW. Feature analysis for auto- matic detection of pathological speech. In Engineering in Medicine and Biology, 2002. 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society EMBS/BMES Conference, 2002. Proceedings of the Second Joint. 2002. p. 182-183. 91. Moran RJ, Reilly RB, de Chazal P, and Lacy PD. Telephony-based voice pathology assessment using automated speech analysis. IEEE Trans Biomed Eng. United States; 2006;53(3):468-77. 92. Breiman L. Random forests. Machine learning. Springer; 2001;45(1): 5-32. 93. Hadjitodorov S, and Mitev P. A computer system for acoustic analysis of pathological voices and laryngeal diseases screening. Medical engineering & physics. Elsevier; 2002;24(6):419-429.

66

94. Martinez CE, and Rufiner HL. Acoustic analysis of speech for detection of laryngeal pathologies. In Engineering in Medicine and Biology Society, 2000. Proceedings of the 22nd Annual International Conference of the IEEE. 2000. p. 2369-2372. 95. Godino-Llorente JI, Gómez-Vilda P, Sáenz-Lechón N, Blanco-Velas- co M, Cruz-Roldán F, and Ferrer-Ballester MA. Support vector machines applied to the detection of voice disorders. Nonlinear Analy- ses and Algorithms for Speech Processing. Springer; 2005. p. 219- 230. 96. Arjmandi MK, and Pooyan M. An optimum algorithm in pathological voice quality assessment using wavelet-packet-based features, linear discriminant analysis and support vector machine. Biomedical Signal Processing and Control. Elsevier; 2012;7(1):3-19. 97. Statnikov A, Wang L, and Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. England; 2008;9:319. 98. Englund C, and Verikas A. A novel approach to estimate proximity in a random forest: An exploratory study. Expert Systems with Appli- cations. Elsevier; 2012;39(17):13046-13050. 99. Saudi ASM, Youssif AA, and Ghalwash AZ. Computer aided recog- nition of vocal folds disorders by means of RASTA-PLP. Computer and information Science. 2012;5(2):p39. 100. Salhi L, Talbi M, and Cherif A. Voice disorders identification using hybrid approach: Wavelet analysis and multilayer neural networks. World Academy of Science, Engineering and Technology. 2008;45: 330-339. 101. Marinus JV, Fechine JM, Gomes HM, and Costa SC. On the use of cepstral coefficients and Multilayer Perceptron Networks for vocal fold edema diagnosis. In Information Technology and Applications in Biomedicine, 2009. ITAB 2009. 9th International Conference on. 2009. p. 1-4. 102. Godino-Llorente J, Aguilera-Navarro S, and Gomez-Vilda P. Auto- matic detection of voice impairments due to vocal misuse by means of gaussian mixture models. In Engineering in Medicine and Biology Society, 2001. Proceedings of the 23rd Annual International Confe- rence of the IEEE. 2001. p. 1723-1726. 103. Godino-Llorente JI, and Gomez-Vilda P. Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors. Biomedical Engineering, IEEE Transactions on. IEEE; 2004;51(2):380-384.

67

104. Shama K, and Cholayya NU. Study of harmonics-to-noise ratio and critical-band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology. EURASIP Journal on Applied Signal Processing. Hindawi Publishing Corp.; 2007;2007(1):50-50. 105. Godino-Llorente JI, Gomez-Vilda P, and Blanco-Velasco M. Dimen- sionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters. Biomedical Engineering, IEEE Transactions on. IEEE; 2006;53(10): 1943-1953. 106. Saenz-Lechon N, Godino-Llorente JI, Osma-Ruiz V, and Gomez- Vilda P. Methodological issues in the development of automatic systems for voice pathology detection. Biomedical Signal Processing and Control. Elsevier; 2006;1(2):120-128. 107. Sáenz-Lechón N, Osma-Ruiz V, Godino-Llorente J, Blanco-Velasco M, Cruz-Roldán F, and Arias-Londono JD. Effects of audio compres- sion in automatic detection of voice pathologies. Biomedical Engi- neering, IEEE Transactions on. IEEE; 2008;55(12):2831-2835. 108. Fraile R, Saenz-Lechon N, Godino-Llorente JI, Osma-Ruiz V, and Fredouille C. Automatic detection of laryngeal pathologies in records of sustained vowels by means of mel-frequency cepstral coefficient parameters and differentiation of patients by sex. Folia phoniatrica et logopaedica. Karger Publishers; 2009;61(3):146-152. 109. Henríquez P, Alonso JB, Ferrer MA, Travieso CM, Godino-Llorente JI, and Díaz-de-María F. Characterization of healthy and pathological voice through measures based on nonlinear dynamics. Audio, Speech, and Language Processing, IEEE Transactions on. IEEE; 2009;17(6): 1186-1195. 110. Markaki ME, and Stylianou Y. Normalized modulation spectral features for cross-database voice pathology detection. In INTERSPEECH. 2009. p. 935-938. 111. Markaki M, and Stylianou Y. Voice pathology detection and discri- mination based on modulation spectral features. Audio, Speech, and Language Processing, IEEE Transactions on. IEEE; 2011;19(7):1938- 1948. 112. Wang X, Zhang J, and Yan Y. Discrimination between pathological and normal voices using GMM-SVM approach. Journal of Voice. Elsevier; 2011;25(1):38-43. 113. Martínez D, Lleida E, Ortega A, Miguel A, and Villalba J. Voice Pathology Detection on the Saarbruecken Voice Database with Cali- bration and Fusion of Scores Using MultiFocal Toolkit. Advances in

68

Speech and Language Technologies for Iberian Languages. Springer; 2012. p. 99-109. 114. Parsa V, and Jamieson DG. Identification of pathological voices using glottal noise measures. Journal of speech, language, and hearing research. ASHA; 2000;43(2):469-485. 115. Parsa V, and Jamieson DG. Acoustic Discrimination of Pathological VoiceSustained Vowels Versus Continuous Speech. Journal of Spe- ech, Language, and Hearing Research. ASHA; 2001;44(2):327-339. 116. Ananthakrishna T, Shama K, and Niranjan UC. k-Means nearest neighbor classifier for voice pathology. In India Annual Conference, 2004. Proceedings of the IEEE INDICON 2004. First. 2004. p. 352- 354. 117. Dibazar AA, Narayanan S, and Berger TW. Feature analysis for automatic detection of pathological speech. In Engineering in Medi- cine and Biology, 2002. 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society EMBS/BMES Conference, 2002. Proceedings of the Second Joint. 2002. p. 182-183. 118. Hariharan M, Paulraj MP, and Yaacob S. Identification of vocal fold pathology based on mel frequency band energy coefficients and sin- gular value decomposition. In Signal and Image Processing Applica- tions (ICSIPA), 2009 IEEE International Conference on. 2009. p. 514- 517. 119. Alonso JB, Díaz-de-María F, Travieso CM, and Ferrer MA. Using nonlinear features for voice disorder detection. In ISCA Tutorial and Research Workshop (ITRW) on Non-Linear Speech Processing. 2005. 120. Gelzinis A, Verikas A, and Bacauskiene M. Automated speech analysis applied to laryngeal disease categorization. Comput Methods Pro- grams Biomed. Ireland; 2008;91(1):36-47. 121. Vaiciukynas E, Verikas A, Gelzinis A, Bacauskiene M, and Uloza V. Exploring similarity-based classification of larynx disorders from human voice. Speech Communication. Elsevier; 2012;54(5):601-610. 122. Kons Z. Enhancing decision-level fusion throughcluster-based parti- tioning of feature set. In The MENDEL Soft Computing journal: International Conference on Soft Computing MENDEL. 2014. p. 259- 264. 123. Hegger R, Kantz H, and Schreiber T. Practical implementation of nonlinear time series methods: The TISEAN package. Chaos: An Interdisciplinary Journal of Nonlinear Science. AIP Publishing; 1999; 9(2):413-435.

69

124. Hema N, Mahesh S, and Pushpavathi M. Normative data for Multi- Dimensional Voice Program (MDVP) for adults--a computerized voice analysis system. J All India Inst Speech Hear. 2009;28:1-7. 125. Kaleem MF, Ghoraani B, Guergachi A, and Krishnan S. Telephone- quality pathological speech classification using empirical mode decomposition. In Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. 2011. p. 7095-7098. 126. Mat Baki M, Wood G, Alston M, Ratcliffe P, Sandhu G, Rubin JS, and Birchall MA. Reliability of OperaVOX against Multidimensional Voice Program (MDVP). Clinical Otolaryngology. Wiley Online Library; 2015;40(1):22-28. 127. Reynolds D. Large population speaker identification using clean and telephone speech. Signal Processing Letters, IEEE. IEEE; 1995;2(3): 46-48. 128. Vogel AP, and Maruff P. Comparison of voice acquisition metho- dologies in speech research. Behavior research methods. Springer; 2008;40(4):982-987. 129. Vogel AP, Rosen KM, Morgan AT, and Reilly S. Comparability of Modern Recording Devices for Speech Analysis: Smartphone, Land- line, Laptop, and Hard Disc Recorder. Folia Phoniatrica et Logo- paedica. Karger Publishers; 2014;66(6):244-250. 130. Moran RJ, Reilly RB, De Chazal P, and Lacy PD. Telephony-based voice pathology assessment using automated speech analysis. Biome- dical Engineering, IEEE Transactions on. IEEE; 2006;53(3):468-477. 131. Wormald RN, Moran RJ, Reilly RB, and Lacy PD. Performance of an automated, remote system to detect vocal fold paralysis. Annals of Otology, Rhinology & Laryngology. SAGE Publications; 2008; 117(11):834-838. 132. Jokinen E, Yrttiaho S, Pulakka H, Vainio M, and Alku P. Signal-to- noise ratio adaptive post-filtering method for intelligibility enhance- ment of telephone speech. The Journal of the Acoustical Society of America. Acoustical Society of America; 2012;132(6):3990-4001. 133. Lin E, Hornibrook J, and Ormond T. Evaluating iPhone recordings for acoustic voice assessment. Folia Phoniatrica et Logopaedica. Karger Publishers; 2012;64(3):122-130. 134. Mehta DD, Zanartu M, Van Stan JH, Feng SW, Cheyne HA, and Hillman RE. Smartphone-based detection of voice disorders by long- term monitoring of neck acceleration features. In Body Sensor Networks (BSN), 2013 IEEE International Conference on. 2013. p. 1- 6.

70

135. Mehta DD, Van Stan JH, Zañartu M, Ghassemi M, Guttag JV, Espinoza VM, Cortés JP, and Cheyne HA. Using ambulatory voice monitoring to investigate common voice disorders: research update. Frontiers in bioengineering and biotechnology. Frontiers Media SA; 2015;3. 136. Uloza V, Saferis V, and Uloziene I. Perceptual and acoustic assess- ment of voice pathology and the efficacy of endolaryngeal phonemic- rosurgery. Journal of Voice. Elsevier; 2005;19(1):138-145. 137. Uloza V, Verikas A, Bacauskiene M, Gelzinis A, Pribuisiene R, Kaseta M, and Saferis V. Categorizing normal and pathological voices: automated and perceptual categorization. J Voice. United States; 2011;25(6):700-8. 138. Bacauskiene M, Verikas A, Gelzinis A, and Vegiene A. Random forests based monitoring of human larynx using questionnaire data. Expert Systems with Applications. Elsevier; 2012;39(5):5506-5512. 139. Van der Maaten L, and Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(2579-2605):85. 140. Brümmer N, and de Villiers E. The bosaris toolkit: Theory, algorithms and code for surviving the new dcf. arXiv preprint arXiv:1304.2865. 2013. 141. Safdari R, Dargahi H, Shahmoradi L, and Nejad AF. Comparing four softwares based on ISO 9241 part 10. Journal of medical systems. Springer; 2012;36(5):2787-2793. 142. Eadie TL, and Doyle PC. Classification of dysphonic voice: acoustic and auditory-perceptual measures. Journal of Voice. Elsevier; 2005; 19(1):1-14. 143. Oguz H, Demirci M, Safak MA, Arslan N, Islam A, and Kargin S. Effects of unilateral vocal cord paralysis on objective voice measures obtained by Praat. Eur Arch Otorhinolaryngol. Germany; 2007; 264(3):257-61. 144. Zhang Y, and Jiang JJ. Acoustic analyses of sustained and running voices from patients with laryngeal pathologies. J Voice. United States; 2008;22(1):1-9. 145. Maryn Y, Corthals P, De Bodt M, Van Cauwenberge P, and Deliyski D. Perturbation measures of voice: a comparative study between Multi-Dimensional Voice Program and Praat. Folia Phoniatr Logop. Switzerland; 2009;61(4):217-26. 146. Linder R, Albers AE, Hess M, Pöppl SJ, and Schönweiler R. Artificial neural network-based classification to screen for dysphonia using psychoacoustic scaling of acoustic voice features. Journal of voice. Elsevier; 2008;22(2):155-163.

71

147. Muhammad G, Mesallam TA, Malki KH, Farahat M, Mahmood A, and Alsulaiman M. Multidirectional regression (MDR)-based features for automatic voice disorder detection. J Voice. United States; 2012; 26(6):817.e19-27. 148. Baken RJ, and Orlikoff RF. Clinical measurement of speech and voice. Clinical measurement of speech and voice. Cengage Learning; 2000. 149. Verikas A, Gelzinis A, and Bacauskiene M. Mining data with random forests: A survey and results of new tests. Pattern Recognition. Else- vier; 2011;44(2):330-349. 150. Acker-Mills BE, Houtsma AJ, and Ahroon WA. Speech intelligibility in noise using throat and acoustic microphones. Aviation, space, and environmental medicine. Aerospace Medical Association; 2006;77(1): 26-31. 151. Munger JB, and Thomson SL. Frequency response of the skin on the head and neck during production of selected speech sounds. J Acoust Soc Am. United States; 2008;124(6):4001-12. 152. Tsanas A, and Gómez-Vilda P. Novel robust decision support tool assisting early diagnosis of pathological voices using acoustic analysis of sustained vowels. In Multidisciplinary Conference of Users of Voice, Speech and Singing. 2013. p. 3-12. 153. Umapathy K, Krishnan S, Parsa V, and Jamieson DG. Discrimination of pathological voices using a time-frequency approach. Biomedical Engineering, IEEE Transactions on. IEEE; 2005;52(3):421-430. 154. Tsanas A, Little M, McSharry PE, Spielman J, and Ramig LO. Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease. Biomedical Engineering, IEEE Transactions on. IEEE; 2012;59(5):1264-1271. 155. Awan SN, Roy N, and Dromey C. Estimating dysphonia severity in continuous speech: application of a multi-parameter spectral/cepstral model. Clin Linguist Phon. England; 2009;23(11):825-41. 156. Awan SN, Roy N, Jetté ME, Meltzner GS, and Hillman RE. Quan- tifying dysphonia severity using a spectral/cepstral-based acoustic index: Comparisons with auditory-perceptual judgements from the CAPE-V. Clin Linguist Phon. England; 2010;24(9):742-58. 157. Maryn Y, De Bodt M, and Roy N. The Acoustic Voice Quality Index: toward improved treatment outcomes assessment in voice disorders. J Commun Disord. United States; 2010;43(3):161-74. 158. Barsties B, and Maryn Y. Der Acoustic Voice Quality Index in Deutsch. HNo. Springer; 2012;60(8):715-720.

72

159. Reynolds V, Buckland A, Bailey J, Lipscombe J, Nathan E, Vijay- asekaran S, Kelly R, Maryn Y, and French N. Objective assessment of pediatric voice disorders with the acoustic voice quality index. J Voice. United States; 2012;26(5):672.e1-7. 160. Barsties B, and Maryn Y. Test-Retest-Variabilität und interne Kon- sistenz des Acoustic Voice Quality Index. HNO. Springer; 2013; 61(5):399-403. 161. Maryn Y, De Bodt M, Barsties B, and Roy N. The value of the acoustic voice quality index as a measure of dysphonia severity in subjects speaking different languages. Eur Arch Otorhinolaryngol. Germany; 2014;271(6):1609-19. 162. Awan SN, and Roy N. Outcomes measurement in voice disorders: application of an acoustic index of dysphonia severity. J Speech Lang Hear Res. United States; 2009;52(2):482-99. 163. Lowell SY, Kelley RT, Awan SN, Colton RH, and Chan NH. Spectral- and cepstral-based acoustic features of dysphonic, strained voice quality. Ann Otol Rhinol Laryngol. United States; 2012;121(8): 539-48. 164. Awan SN, Solomon NP, Helou LB, and Stojadinovic A. Spectral- cepstral estimation of dysphonia severity: external validation. Ann Otol Rhinol Laryngol. United States; 2013;122(1):40-8. 165. Peterson EA, Roy N, Awan SN, Merrill RM, Banks R, and Tanner K. Toward validation of the cepstral spectral index of dysphonia (CSID) as an objective treatment outcomes measure. J Voice. United States; 2013;27(4):401-10. 166. Roy N, Mazin A, and Awan SN. Automated acoustic analysis of task dependency in adductor spasmodic dysphonia versus muscle tension dysphonia. The Laryngoscope. Wiley Online Library; 2014;124(3): 718-724. 167. Awan SN, Roy N, and Cohen SM. Exploring the relationship between spectral and cepstral measures of voice and the Voice Handicap Index (VHI). J Voice. United States; 2014;28(4):430-9. 168. Verikas A, Gelzinis A, Vaiciukynas E, Bacauskiene M, Minelga J, Hållander M, Uloza V, and Padervinskis E. Data dependent random forest applied to screening for laryngeal disorders through analysis of sustained phonation: Acoustic versus contact microphone. Medical engineering & physics. Elsevier; 2015;37(2):210-218.

73

LIST OF PUBLICATIONS

Publications based on the results of dissertation 1. Uloza V, Padervinskis E, Uloziene I, Saferis V, and Verikas A. Combined use of standard and throat microphones for measurement of acoustic voice parameters and voice categorization. Journal of Voice. 2015. 5, p. 552-559. 2. Verikas A, Gelzinis A, Vaiciukynas E, Bacauskiene M, Minelga J, Hållander M, Uloza V, and Padervinskis E. Data dependent random forest applied to screening for laryngeal disorders through analysis of sustained phonation: Acoustic versus contact microphone. Medical engineering & physics. 2015;37(2):210-218. 3. Uloza V, Padervinskis E, Vegiene A, Pribuisiene R, Saferis V, Vaiciukynas E, Gelzinis A, and Verikas A. Exploring the feasibility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening. European Archives of Oto-Rhino- Laryngology. 2015;272(11):3391-3399.

Publications related to the topic of dissertation 1. Vaiciukynas E, Verikas A, Gelzinis A, Bacauskiene M, Minelga J, Hållander M, Padervinskis E, Uloza V. Fusing voice and query data for non-invasive detection of laryngeal disorders. Expert Systems with Applications. 2015;42(22):8445-8453. 2. Minelga J, Gelžinis A, Vaičiukynas E, Verikas A, Bačauskienė M, Padervinskis E, Uloza V. Comparing throat and acoustic microphones for laryngeal pathology detection from human voice Electrical and control technologies : proceedings of the 9th international conference on electrical and control technologies ECT 2014 : [May 8–9, 2014, Kaunas, Lithuania] / Kaunas University of Technology, IFAC Com- mittee of National Lithuanian Organisation, Lithuanian Electricity Association. 3. Gelžinis A, Verikas A, Bačauskienė M, Minelga J, Hållanderb M, Ulozas V, Padervinskis E. Exploring sustained phonation recorded with acoustic and contact microphones to screen for laryngeal disorders Computational Intelligence in Healthcare and e-health (CICARE) : 2014 IEEE Symposium : 9-12 December. 2014, Orlando. IEEE, 2014. p.125-132.

74

Abstracts of conferences 1. Padervinskis E, Uloza V. Išmaniųjų telefonų mikrofonų panaudojimas balso akustinių parametrų registravimui ir balso patologijos pirminei patikrai. VIII nacionalinė doktorantų mokslinė konferencija „Mokslas – sveikatai“ : konferencijos pranešimų tezės, balandžio 7 d., 2015 p. 61-62. 2. Uloza V, Padervinskis E, Pribuišienė R, Vegienė A. Reliability of mobile phone microphones for measurement of acoustic voice parameters and voice categorization. 3rd Congress of European ORL- HNS: 6th Czech-Slovak Congress of Otorhinolaryngology and Head and Neck Surgery, 77th Congress of the Czech Society of Otorhi- nolaryngology and Head and Neck Surgery, 62nd Congress of the Slovak Society of Otorhinolaryngology and Head and Neck Surgery. Abstract book: Prague, Czech Republic. June 7-11, 2015. p. 97-97. 3. Uloza V, Padervinskis E, Vaičiukynas E, Gelžinis A, Verikas A. Utility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening Pan European Voice Conference – PEVOC 11 : Abstract book : August 31 - September 2, 2015, Firenze, Italy p. 139-139. 4. Uloza V, Padervinskis E, Šaferis V. A Comparison of throat (contact) and oral microphones for the measurement of acoustic voice para- meters. 6th Baltic ENT : 22nd- 24th May, 2014, Kaunas, Lithuania : Programme and Abstract Book p. 29. 5. Uloza V, Padervinskis E, Ulozienė I. Measurement of acoustic voice parameters using standard and contact microphones 10th Congress of the European laryngological society 2nd Joint Meeting of ABEA : 9-12 April, 2014, Antalya, Turkey. Antalya: ELS, 2014. p. 22-22. 6. Padervinskis E, Uloza V. Orinio ir kontaktinio mikrofonų panaudojimas objektyviai balso analizei. VII nacionalinė doktorantų mokslinė konferencija „Mokslas – sveikatai“ : konferencijos pranešimų tezės, 2014 m. balandžio 9 d, p. 69-70, Nr. 3.

75

PUBLICATIONS

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

SANTRAUKA

Įvadas Balsas yra pagrindinis žmonių bendravimo instrumentas. Kalbėdami mes ne tik perduodame, bet ir suteikiame informaciją apie save. Pastaruo- sius 200 000 metų žmogaus balsas tobulėjo kaip įrankis, kuriuo perduodama ne tik verbalinė informacija. Kalbant svarbu ne tik tai, kas norima pasakyti, bet ir tai, kaip pasakoma (neverbalinė informacija): balso tembras, into- nacija, garsumas ir greitis. Balso daugialypiškumas kelia daugiausia prob- lemų perduodant verbalinę informaciją kompiuterinėms sistemoms, kadangi per dieną, per parą balsas kinta, dingsta dalis neverbalinės informacijos, kurią klausytojas suvokia pasąmoningai bendraudamas su kalbančiuoju. Balso analizės programų viena iš svarbių sudedamųjų dalių yra mikro- fonas, kuriuo registruojamas balso signalas. Nuo naudojamo mikrofono tipo ženklia dalimi priklauso balso įrašų kokybe, tačiau mokslinėje literatūroje nėra bendro sutarimo, kokie mikrofonai turėtų būti naudojami balso analizei klinikiniame ir moksliniame darbe. Modernioji medicina orientuota į profilaktines programas, kadangi jos mažina medicinos paslaugų kaštus bei pailgina pacientų gyvenimo trukmę. Balsą analizuojantis mokslas padeda ankščiau diagnozuoti gerklų ir kurti galimas profilaktines programas. Profilaktinės programos grindžiamos pa- prastais tyrimų algoritmais, kuriuos galėtų nesudėtingai pritaikyti šeimos gydytojai ar gydytojai specialistai. Šiuolaikiniame technologijų pasaulyje norima sukurti kompiuterines balso analizės programas, prieinamas ne tik gydytojams, bet ir patiems pacientams, todėl yra tikslinga įvertinti galimybę atlikti balso įrašymą su išmaniaisiais telefonais, kurie yra lengvai prieinami visuomenei. Tokiu atveju, pacientai patys galėtų atlikti nesudėtingą savo balso tyrimą, o sistemai nustačius įtartinus pokyčius, būtų rekomenduojama kreiptis į gydytoją specialistą dėl tikslinės konsultacijos.

Tikslas Sukurti automatinę balso kategorizavimo sistemą, paremtą akustinių balso parametrų bei pacientų klausimynų duomenų analize, ir įvertinti jos efektyvumą pirminei balso sutrikimų atrankai

Uždaviniai 1. Įvertinti orinio ir kontaktinio mikrofonų derinio galimybes akustiniams balso parametrams registruoti ir matuoti bei jų tinkamumą balsų klasi- fikacijai balsus į sveikų ir patologinių grupes.

102

2. Įvertinti „Atsitiktinių miškų“ klasifikatoriaus galimybes siekiant atskirti sveikų ir patologinių balsų grupes, kai klasifikavimui naudojamas skir- tingas akustinių balso parametrų rinkinių skaičius. 3. Įvertinti orinio ir išmaniojo telefono mikrofonų galimybes akustiniams balso parametrams registruoti ir matuoti, tai pat įvertinti išmaniojo telefono mikrofono panaudojimo tinkamumą pirminei balso sutrikimų atrankai. 4. Įvertinti informacijos, gaunamos iš specialių klausimynų duomenų ir akustinių balso parametrų, sujungimo reikšmę automatinio balso klasi- fikavimo į sveikų ir patologinių balsų grupes tikslumui. 5. Sukurti balso automatinio kategorizavimo į sveikų ir patologinių balsų grupes sistemą, paremtą specialių balso klausimynų duomenų ir akus- tinių balso parametrų analize.

Darbo naujumas Gerklų ligų diagnostika grindžiama invazyviais gerklų apžiūros meto- dais − netiesiogine ir tiesioginelaringoskopija, vaizdo laringostroboskopija. Šiuos tyrimus atlieka gydytojai otorinolaringologai, naudodamiesi specialia diagnostine įranga. Neretai pacientai dėl įvairių priežasčių į gydytojus specialistus kreipiasi pavėluotai, jau įsisenėjus ar išplitus gerklų ligai. Tai yra viena iš vėlyvos gerklų vėžio diagnostikos priežasčių. Kita vertus, dauguma gerklų ligų jau ankstyvose stadijose pasireiškia įvairiais balso sutrikimais, kuriuos nustačius automatine balso analize ga- lima būtų efektyviai atlikti pirminę balso sutrikimų atranką ir pagerinti ankstyvąją gerklų ligų, tarp jų ir gerklų vėžio, diagnostiką, ir taip padidinti gydymo efektyvumą. Mokslininkai, dirbantys balso analizės srityje, daugelį metų siekia su- kurti automatinę balso analizės programą, tačiau patikima, pigi ir paprastai naudojama programa, kurią būtų galima naudoti pirminei balso sutrikimų atrankai, iki šiol nesukurta. Tiriamasis darbas skirstomas į du etapus. Pirmas etapas – balso įrašuose atpažinti svarbius balso akustinius parametrus ar požymius ir juos perkoduoti į skaitmeninius duomenis (ištraukimas). Ant- ras etapas – balsus suklasifikuoti į dvi grupes, sveikų ir patologinių, naudo- jant naujus akustinius parametrus, pažangius matematinius klasifikatorius ir klausimyno duomenis. Tačiau šiuo metu abu etapus įmanoma įgyvendinti tik laboratorijoje, o tai reikalauja didelių išlaikymo kaštų, specialiai ap- mokyto personalo ir neatitinka pigios ir paprastos programos sampratos. Kadangi šiuo metu balso analizės testai atliekami tik laboratorijos sąlygo- mis, tad dėl riboto galimo ištirti žmonių srauto nėra galimybės atlikti popu- liacijos tyrimų.

103

Disertaciniame darbe išaiškintos skirtingų mikrofonų tipų naudojimo galimybės akustinei balso analizei ir palyginti gautų duomenų skirtumai, taip pat įvertintos orinio ir kontaktinio gerklinio mikrofono galimybės atlie- kant vienmomentinį balso įrašymą; šių duomenų skirtumai irgi palyginti. Šiame tyrime pirmą kartą užfiksuoti vienmomentiniai balso įrašai, panaudojant orinį ir išmaniojo telefono mikrofonus. Įprastiniais statistiniais metodais palyginti orinio ir išmaniojo telefono mikrofonų akustinių balso parametrų rezultatai. Bendradarbiaujant su KTU mokslininkais, pirmą kartą susieti Balso klausimyno duomenys su balso akustiniais parametrais, pa- naudoti tiek įprastiniai statistiniai metodai tiek ir pažangūs „Atsitiktinio miško“ klasifikatoriaus algoritmai, kai balsai buvo klasifikuojami į dvi grupes: sveikų ir patologinių. Sukurta ir klinikinėje praktikoje išbandyta programa VoiceTest, kuri leido susieti klausimynų ir akustinių balso para- metrų duomenis, pasitelkus „Atsitiktinio miško“ klasifikatorių. Įvertintos VoiceTest programos klasifikavimo galimybės, padedančios suskirstyti skirstant tiriamuosius balsus į dvi grupes: sveikų ir patologinių.

Metodai Mokslinis tyrimas atliktas gavus Kauno regioninio biomedicininių tyri- mų etikos komiteto leidimą (Nr. P2-24/2013). Taip pat gautas Valstybinės duomenų apsaugos inspekcijos leidimas dirbti su asmens duomenimis (Nr. 2R-648 (2.6-1) Į mokslinį tyrimą, kuris buvo vykdytas LSMU Ausų, nosies ir gerklės ligų klinikoje nuo 2011 m. rugpjūčio iki 2015 m. rugsėjo, buvo įtraukti 656 dalyviai, 9 pašalinti iš tyrimo dėl duomenų stygiaus. Iš jų 337 sveiki sa- vanoriai ir 319 pacientai, kuriems buvo diagnozuoti tūriniai procesai gerk- lose ir balso klosčių paralyžius Mokslinis tyrimas atliktas keturiais etapais – 4 studijos. Gauti rezul- tatai publikuoti moksliniuose straipsniuose, kurie yra pateikti disertacijoje. Pirmojoje studijoje „Orinio ir kontaktinio mikrofono duomenų analizė naudojant diskriminantinę analizę“ įtraukti 157 asmenys, iš jų 105 sveiki ir 52 pacientai (tūriniai procesai gerklose ir balso klosčių paralyžius). Balsas buvo įrašomas tylos kameroje, fonuojant balsę /a/ bent 5 sekundes, vienu metu dviem mikrofonais: oriniu kardioidiniu (AKG Perception 220) ir kon- taktiniu/gerkliniu (omni-direkciniuTriuph PC). Audiobyla įrašyta Audacity programos wavformatu. Akustinė balso analizė atlikta Tiger Electronics Dr. Speech programa, vertinant šešis akustinius balso parametrus. Statistinė analizė atlikta SPSS 20 paketu: vertinti akustinių balso parametrų vidurkiai; Stjudentot testas naudotas vertinant hipotezę apie vidurkių lygybes. Pirsono koreliacijos koeficientas (r) naudotas vertinant koreliaciją tarp akustinių bal- so parametrų, registruotų skirtingais mikrofonais. Balsų teisingas klasifika-

104 vimo dažnis (TKD), klasifikuojant balsus į sveikų ir patologinių grupes, įvertintas panaudojant šešių akustinių balso parametrų rezultatus. Taikyta diskriminantinė analizė siekiant nustatyti akustinių parametrų ribines vertes, skiriant sveiką ir patologinį balsą. Antrojoje studijoje „Orinio ir kontaktinio mikrofono duomenų analizė naudojant „Atsitiktinių miškų“ klasifikatorių“ įtraukti 273 asmenys; iš jų 163 sveiki ir 110 pacientų (tūriniai procesai gerklose ir balso klosčių para- lyžius). Balso buvo įrašomas analogiškai kaip ir pirmojoje studijoje. Akusti- nei balso analizei panaudota 14 balsinių parametrų rinkinių, sudarytų iš 1051 balsinio parametro. Panaudojant balsinių parametrų rinkinius, balsai klasifikuoti į dvi grupes: sveikus ir patologinius, taikant „Atsitiktinio miš- ko“ klasifikatorių. Klasifikatoriaus naudingumas vertintas naudojant šias priemones: klaidų aptikimo kompromiso kreivę, vienodų klaidų tikimybę (VKT), ROC kreivę bei plotą po ROC kreive. Trečiojoje studijoje „Orinio ir išmaniojo telefono mikrofono duomenų analizė naudojant „Atsitiktinių miškų“ klasifikatorių“ įtraukti 152 asmenys; iš jų 34 sveiki ir 118 pacientų (tūriniai procesai gerklose ir balso klosčių pa- ralyžius). Balsas įrašytas tylos kameroje, fonuojant balsę /a/ bent 5 sekun- des, vienu metu dviem mikrofonais: oriniu kardioidiniu (AKG Perception 220) ir išmaniojo telefono – Samsung GalaxyNote 3. Audiobyla įrašyta į kompiuterį Audacityprogramos wav formatu. Audiobyla į išmanųjį telefoną įrašyta SmartVoiceRecorder programos wav formatu. Akustinė balso analizė atlikta Tiger Electronics Dr. Speech programa. Vertinti šeši akustiniai balso parametrai. Pacientai apklausti naudojant GFI-LT ir Balso klausimynus; statistiniais metodais gauti klausimynų duomenys susieti su akustiniais balso parametrais. Statistinė analizė atlikta SPSS 20.0 paketu: vertinti akus- tinių balso parametrų vidurkiai, Stjudento t testas naudotas vertinant vidurkių lygybės hipotezę. Pirsono koreliacijos koeficientas (r) taikytas vertinant koreliaciją tarp skirtingų mikrofonų balsinių parametrų. TKD naudotas akustinėms balso parametrų galimybėms vertinti klasifikuojant balsus į dvi grupes − sveikus ir patologinius. „Atsitiktinių miškų“ klasifika- torius naudotas klasifikuojant balsus į dvi grupes − sveikų ir patologinių. Ketvirtojoje studijoje „VoiceTest programos testavimas“ įtraukti 273 asmenys, iš jų 163 sveiki ir 110 pacientų (diagnozuoti tūriniai procesai gerklose ir balso klosčių paralyžius). Šie balsai panaudoti apmokant „At- sitiktinio miško“ klasifikatorių skirstyti grupėmis pagal akustinius balso parametrus. Anketiniai 596 asmenų duomenys, iš jų 327 sveikų ir 269 pa- cientų, buvo panaudoti apmokant „Atsitiktinio miško“ klasifikatorių skirs- tyti grupėmis asmenis pagal anketinius duomenis. VoiceTest programos tes- tavimo grupė sudaryta iš 45 asmenų − 9 sveikų ir 36 pacientų. Balsas

105

įrašytas tylos kameroje, fonuojant balsę /a/ bent 5 sekundes, vienu mikro- fonu − oriniu kardioidiniu (AKG Perception 220). Audiobyla įrašyta Auda- city programa wav formatu. Akustinei balso analizei panaudota 14 balsinių parametrų rinkinių „Atsitiktinio miško“ klasifikatoriumi, panaudojant bal- sinių parametrų rinkinius ir klausimyno duomenis, balsai suskirstyti į dvi grupes: sveikų ir patologinių. Visose atliktose studijose statistiškai reikšmingu skirtumas laikytas kai p<0,05.

Rezultatai Pirmosios studijos rezultatai – pacientų grupėje tarp keturių akustinių parametrų (F0, Jitter, Shimmer, NNE) vidurkių nebuvo stebėta statistiškai reikšmingo skirtumo, tarp SNR ir HNR parametrų buvo stebėtas statistiškai reikšmingas skirtumas, nors parametrų skirtumas buvo tik 5.64–5.78 proc. Gautas rezultatas sietinas su aplinkybe, kad abu mikrofonai turi skirtingas dažnių atsako kreives. Analogiški pakitimai buvo stebėti ir sveikų balsų gru- pėje: čia skirtumas tarp SNR ir HNR buvo dar mažesnis: 3,13−3,25 proc., tačiau taip pat statistiškai reikšmingas. Lyginant orinio ir kontaktinio mik- rofono atitinkamus balsinius parametrus tarpusavyje, tarp visų parametrų stebėta statistiškai reikšminga stipri koreliacija: r = 0,71−0,86, o F0 r = 1,0. Diskriminantinei analizei taikytas TKD − atskiriant sveikų ir patologinių balsų grupes. Naudojant orinio mikrofono duomenis, išskirtas svarbiausias diskriminantinis parametras − O-Shimmer, kurio TKD buvo 75,2 proc. Nau- dojant kontaktinio mikrofono duomenis, pagrindinis diskriminantinis parametras buvo T-Jitter, jo TKD pasiekė 70,7 proc. Sujungus abiejų mikro- fonų duomenis, buvo identifikuotas optimalus akustinių balso parametrų rinkinys, leidžiantis pasiekti geriausią skirstymo į dvi klases rezultatą. Šie parametrai buvo O-Shimmer, O-NNE, T-SNR. TKD pasiekė 80,3 proc.. Antros studijos rezultatai – pasitelkus „Atsitiktinių miškų“ klasifika- torių, stebėti geresni skirstymo į sveikų ir patologinių balsų grupes duo- menys su oriniu mikrofonu; iš 14 parametrų grupių tik vienoje grupėje kontaktinis mikrofonas parodė geresnį klasifikavimo rezultatą. Klasifi- kuojant balsus į dvi grupes, kai buvo naudojama 14 balsinių parametrų rinkinių, geriausi rezultatai pasiekti sujungus orinio ir kontaktinio mikro- fono duomenis− TKD buvo 86,82 proc. Skaičiuojant VKT, geresnį rezultatą davė orinio mikrofono duomenys − 19,37 proc., o su kontaktinio mikrofono duomenimis gautas rezultatas siekė 21,64 proc. Sujungus orinio ir kontak- tinio mikrofono duomenis, stebėtas klasifikavimo rezultato į sveikų/pato- loginių balso grupes pagerėjimas – VKT 18,94 proc.

106

Trečiosios studijos rezultatai – nustatyta stipri koreliacija (r= 0,78–0,91) tarp pagrindinių akustinių balso parametrų, užregistruotų oriniu ir išmaniojo telefono mikrofonais (Jitter koreliacija siekė r = 0,67; F0 r= 1,0). Balsams klasifikuoti į sveikų ir patologinių grupes naudojant akustinius balso para- metrus ir diskriminantinę analizę, su išmaniojo telefono duomenimis nusta- tytas TKD =79,5 proc.; su orinio mikrofono duomenimis TDK = 73,7 proc. Geriausias klasifikavimo rezultatas pasiektas sujungus akustinius balso parametrus, registruotus išmaniojo telefono mikrofonu, bei GFI-LT klausi- myno rezultatus, o duomenims apdoroti panaudojus „Atsitiktinio miško“ klasifikatorių: VKT = 7,94 proc. Ketvirtos studijos rezultatai – sukurta VoiceTest programa, pagal kurią atliktas eksperimentas − pateikti 45 nežinomi sistemai balsai ir anketiniai duomenys, naudotas „Atsitiktinio miško“ klasifikatorius balsams suskirstyti dvi grupes: sveikų ir patologinių. Klasifikatoriaus rezultatai vertinti skai- čiuojant VKT. Balsams suskirstyti į sveikų ir patologinių grupes naudojant Balso klausimyno duomenis, pasiekta 11,11 proc. VKT skirstymui pasitel- kus akustinius balso parametrus, VKT siekė 14,81 proc.

Išvados 1. Akustinių balso parametrų registravimas ir matavimas, balso įrašams pasitelkus orinio ir kontaktinio mikrofonų derinį ir naudojant diskri- minantinę duomenų analizę, yra tinkamas klinikiniams poreikiams ir leidžia pasiekti aukštą TKD (80,3 proc.), kai sveikų ir patologinių balsų grupės atskiriamos, naudojant akustinius balso parametrus. Tais atvejais, kai dėl aplinkos triukšmo įprastinių mikrofonų balso įrašymui negalima ar sunku panaudoti, kontaktinis mikrofonas gali būti vertinga ir patikima alternatyva. 2. Pasiūlytas − atsižvelgiant į duomenis − „Atsitiktinių miškų“ klasifi- katoriaus kūrimo būdas patvirtintas kaip geriausiai tinkamas susieti daugialypei informacijai, gautai iš 14 skirtingų akustinių balso para- metrų rinkinių, ir pasiekti statistiškai reikšmingai tikslesnį klasifika- vimą į sveiko ir patologinio balso grupes pagal akustinio ir kontaktinio mikrofono duomenis. Pasitelkus „Atsitiktinių miškų“ klasifikatorių pasiekiamas aukštas TDK (86,62 proc.) atskiriant sveikų ir patologinių balsų grupes. 3. Akustinių balso parametrų registravimas ir matavimas, balso įrašams naudojant išmaniojo telefono mikrofoną, yra patikimas ir tinkamas klinikiniams poreikiams. Tokiu mikrofonu registruotus balso įrašus naudojant sveikų ir patologinių balsų grupių klasifikacijai, pasiekiamas pakankamai didelis TKD (79.5 proc). Tyrimo duomenys patvirtina

107

išmaniojo telefono mikrofono signalo tinkamumą automatinei balso analizei ir pirminei balso sutrikimų atrankai (screening). 4. Akustinės balso analizės ir GFI-LT klausimyno duomenų jungimas pa- gerina klasifikavimo į sveikų ir patologinių balsų grupes tikslumą. Mažiausia VKT (7,94 proc.) gaunama susiejus GFI-LT klausimyno ir išmaniojo telefono mikrofonu registruotų akustinių balso parametrų duomenis. 5. Sukurta akustinės balso analizės ir klausimyno duomenų automatinio vertinimo programa VoiceTest atitinka ISO-9241 standartą. Klasifi- kavimo į sveikų ir patologinių balso grupes pagal balso parametrus VKT siekia 14,8 proc. o pagal klausimyno duomenis − 11,1 proc. Ši programayra potencialiai naudotina klinikinėje praktikoje.

108

CURRICULUM VITAE

EVALDAS PADERVINSKIS M.D.

Work address: Hospital of Lithuanian University of Health Sciences Kauno Klinikos, Department of Otorhinolaryngology. Eivenių 2, LT-50009, Kaunas, Lithuania

Email: [email protected]

Work place: Hospital of Lithuanian University of Health Sciences Kauno Klinikos, Department of Otorhinolaryngology; otorhinolaryngologist, from 2007 till present. Lithuanian University of Health Sciences, Medical Academy; assistant, from 2011 till present.

Education: Kaunas University of Medicine, Lithuania Master Degree in Medicine September 2000 – June 2006

University Hospital of Klaipėda, Lithuania Internship; September 2006 – June 2007

Kaunas University of Medicine, Lithuania Residency in Otorhinolaryngology; August 2007 – June 2010

Lithuanian University of Health Sciences, Lithuania PhD Student; September 2011 – August 2015

109

PADĖKA

Nuoširdžiai dėkoju savo tėvams, už suteiktą galimybę mokytis univer- sitete, supratimą, palaikymą, ir kartais griežtą bet teisingą žodį. Dėkoju prof. Sauliui Vaitkui – už jo pamokas chirurgijoje ir nuolatinį palaikymą, jeigu nebūčiau sutikęs šio žmogaus savo gyvenimo kelyje, tik- riausiai nebūčiau pasirinkęs ANG gydytojo specialybės ir nebūčiau tapęs tokiu gydytoju kokiu dabar esu. Dėkoju prof. Virgilijui Ulozai – už jo pamokymus ir patarimus, bei be galo rūpestingą vadovavimą visam tiriamajam moksliniam darbui, be jo ši disertacija nebūtų išvydusi dienos šviesos. Negalėčiau įsivaizduoti geresnio mokslinio vadovo negu prof. Virgilijus Ulozas. Dėkoju KTU Elektros energetikos sistemų katedros moksliniams bendradarbiams prof. Antanui Verikui ir Jonui Minelgai už gerą, vaisingą ir konstruktyvų darbą kartu. Dėkoju visiems ANG klinikos ir poliklinikos gydytojams ir slaugyto- joms už jų pamokymus, patarimus bei supratingumą nelengvame kasdieni- niame gydytojo kelyje. Esu be galo dėkingas savo žmonai Linai už palaikymą, supratimą, už jos geras lietuvių kalbos žinias ir nuolatinę pagalbą visame kame. Nuoširdus ačiū uošviams, be kurių pagalbos būtų sunku išsiversti kadieniniame gy- venime. Dėkoju visiems draugams ir artimiesiems už supratimą ir palaikymą.

110