Secure, Efficient Automatic Speaker Verification for Embedded Applications
Total Page:16
File Type:pdf, Size:1020Kb
Secure, efficient automatic speaker verification for embedded applications Giacomo Valenti To cite this version: Giacomo Valenti. Secure, efficient automatic speaker verification for embedded applications. Artificial Intelligence [cs.AI]. Sorbonne Université, 2019. English. NNT : 2019SORUS471. tel-03001286 HAL Id: tel-03001286 https://tel.archives-ouvertes.fr/tel-03001286 Submitted on 12 Nov 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. PHD THESIS SORBONNE UNIVERSITÉ Thèse de doctorat (CIFRE) École doctorale ED 130 Informatique, Telecommunications et Electronique de Paris EURECOM - Sécurité numérique Thesis director : prof. Nicholas Evans presented by Giacomo VALENTI Thesis title : Secure, efficient automatic speaker verification for embedded applications defense scheduled on 4th March 2019 before a committee composed of : Prof. Magne JOHNSEN Rapporteur Prof. Tomi KINNUNEN Rapporteur Prof. Marc DACIER Examiner M. Nicolas OBIN Examiner Mme Florence TRESSOLS Examiner M. Adrien DANIEL Examiner M. Laurent PILATI Examiner, industrial supervisor Prof. Nicholas EVANS Thesis director ii Abstract This industrial CIFRE PhD thesis addresses automatic speaker verification (ASV) issues in the context of embedded applications. The first part of this thesis focuses on more traditional problems and topics, which are introduced with a specific literature review. The first work investigates the minimum enrolment data requirements for a practical, text- dependent short-utterance ASV system. Results on the RSR2015 database and protocols indicate that the need for up to 97% enrolment data can be eliminated with only a negligible impact on performance. Contributions in the first part of the thesis consist in a statistical analysis onto the aforementioned RSR corpus. The objective is to isolate text-dependent factors and prove they are consistent across different sets of speakers. Experiments suggest that, for very short utterances, the influence of a specific text content on the system performance can be considered a speaker-independent factor and is named spoken password strength. If the user could be made aware of the strength of the chosen password, ASV reliability could be improved with the judicious choice of a more secure password over a less secure one. The second part of the thesis focuses on neural network-based approaches and is accompanied by a second, specific literature review. While it was clear from the beginning of the thesis that neural networks and deep learning were becoming state-of-the-art in several machine learning domains, their use for embedded solutions was hindered by their complexity. Contributions described in the second part of the thesis comprise blue-sky, experimen- tal research which tackles the substitution of hand-crafted, traditional speaker features in favour of operating directly upon the raw audio waveform and the search for optimal network architectures and weights by means of genetic algorithms. This work is the most fundamental contribution of this thesis: neuro-evolved network structures consisting of only a few hundred connections which are able to learn from the raw audio input. While the approach is undoubtedly still in its infancy, it shows promising results for experiments carried out for text-independent speaker verification (including a subset of the NIST 2016 SRE data) and anti-spoofing (on the official ASVspoof 2017 protocols). iii iv Abstract Acknowledgements This thesis was carried out at EURECOM and NXP Semiconductors and was entirely funded by NXP. During this 3-year long journey, lasting from July 2015 to August 2018, I was not alone. I would like to thank my academic advisor, Prof. Nicholas Evans for his relentless support not only during the time of this thesis, but also during the preceding 5 months, which he allowed me to spend at EURECOM, in order to familiarise with speaker verifica- tion. My gratitude also goes to Dr. Adrien Daniel, who has been my industrial supervisor for the most part of my thesis. He is responsible for pushing me to explore new horizons of research; our active collaboration led to the most fundamental contributions of this thesis. Since I had a few jobs before working at NXP Semiconductors, I can sincerely say that the NXP working environment, especially at the human level, is the best I ever experienced. I am very much grateful to my boss and current industrial supervisor, Laurent Pilati, for letting me do research with a great degree of freedom, something which I am told is not to be taken for granted in industrial PhDs. A special thank goes to my former classmate, former colleague and current friend, Daniele Battaglino, who introduced me to my academic supervisor and to NXP at a time when I was really dissatisfied with my professional life. Very warm thanks go to my soulmate Francesca, her "never-settle" attitude is what pushed me to leave a stable (albeit boring) job for a definitely more adventurous experi- ence. Among the people who encouraged me and always supported me I cannot avoid to mention my parents, whom I deeply thank for teaching me the importance of education since my first day of school. I would also like to express my gratitude to my fellow PhD students and post-docs in EURECOM for the nice atmosphere every time I "dropped by", with special thanks to Héctor for his patience and kindness. Last, at the risk of repeating myself with regard to my master thesis’ acknowledgements, I want to thank friends in my home town. They have exactly nothing to do with audio-related research or data science, but they know the true meaning of "decompress". Giacomo Valenti v vi Acknowledgements Contents Abstract iii Acknowledgementsv I Introduction1 I.1 Speaker recognition terminology.......................2 I.2 Issues with current ASV status........................3 I.3 Contributions and publications........................4 I.4 Thesis structure.................................7 I.4.1 Part A..................................7 I.4.2 Part B..................................8 II A review of traditional speaker verification approaches 11 II.1 Speech as a biometric............................. 11 II.2 Front-end: speaker features.......................... 12 II.2.1 Short-term features........................... 13 II.2.2 Longer-term features.......................... 14 II.3 Back end: models and classifiers....................... 15 II.3.1 Gaussian Mixture Models....................... 15 II.3.2 GMM-UBM............................... 17 II.3.3 Hidden Markov Models........................ 17 II.3.4 Towards i-vectors............................ 19 II.3.5 The HiLAM system.......................... 20 II.4 Performance Metrics.............................. 21 II.4.1 Receiver Operating Characteristic (ROC).............. 21 II.4.2 Equal Error Rate (EER)........................ 23 II.4.3 Score normalisation.......................... 23 II.5 Challenges and Databases........................... 23 II.5.1 NIST Speaker Recognition Evaluations................ 23 II.5.1.a The early years........................ 24 II.5.1.b Broader scope and higher dimensionality......... 24 II.5.1.c Bi-annual big data challenges................ 24 II.5.1.d SRE16............................ 25 II.5.2 RSR2015 corpus............................ 26 vii viii CONTENTS II.5.2.a Training........................... 26 II.5.2.b Testing............................ 27 II.6 Summary.................................... 27 IIISimplified HiLAM 29 III.1 HiLAM baseline implementation....................... 30 III.1.1 Preprocessing and feature extraction................. 30 III.1.2 GMM optimisation........................... 31 III.1.3 Relevance factor optimisation..................... 31 III.1.4 Baseline performance.......................... 31 III.2 Protocols.................................... 32 III.3 Simplified HiLAM............................... 33 III.3.1 Middle-layer training reduction.................... 33 III.3.2 Middle layer removal.......................... 34 III.4 Evaluation Results............................... 35 III.5 The Matlab demo................................ 37 III.6 Conclusions................................... 38 IV Spoken password strength 39 IV.1 The concept of spoken password strength.................. 39 IV.2 Preliminary observations............................ 40 IV.2.1 The text-dependent shift....................... 40 IV.2.2 The text-dependent overlap...................... 40 IV.3 Database and protocols............................ 42 IV.4 Statistical analysis............................... 42 IV.4.1 Variable strength command groups.................. 42 IV.4.2 Sampling distribution of the EER.................. 43 IV.4.3 Isolating the influence of overlap................... 43 IV.5 Results interpretation............................. 45 IV.6 Conclusions................................... 46 V A review of deep learning speaker verification approaches 47 V.1 Neural networks and deep learning...................... 47 V.1.1 Deep Belief Networks......................... 48 V.1.2 Deep Auto-encoders.........................