Vysoké Učení Technické V Brně Brno University of Technology

VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ BRNO UNIVERSITY OF TECHNOLOGY FAKULTA INFORMAČNÍCH TECHNOLOGIÍ ÚSTAV POČÍTAČOVÉ GRAFIKY A MULTIMÉDIÍ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER GRAPHICS AND MULTIMEDIA FINITE-STATE BASED RECOGNITION NETWORKS FOR FORWARD-BACKWARD SPEECH DECODING DISERTAČNÍ PRÁCE PHD THESIS AUTOR PRÁCE Dipl.-Ing. MIRKO HANNEMANN AUTHOR BRNO 2016 VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ BRNO UNIVERSITY OF TECHNOLOGY FAKULTA INFORMAČNÍCH TECHNOLOGIÍ ÚSTAV POČÍTAČOVÉ GRAFIKY A MULTIMÉDIÍ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER GRAPHICS AND MULTIMEDIA FINITE-STATE BASED RECOGNITION NETWORKS FOR FORWARD-BACKWARD SPEECH DECODING ROZPOZNÁVACÍ SÍTĚ ZALOŽENÉ NA KONEČNÝCH STAVOVÝCH PŘEVODNÍCÍCH PRO DOPŘEDNÉ A ZPĚTNÉ DEKÓDOVÁNÍ V ROZPOZNÁVÁNÍ ŘEČI DISERTAČNÍ PRÁCE PHD THESIS AUTOR PRÁCE Dipl.-Ing. MIRKO HANNEMANN AUTHOR VEDOUCÍ PRÁCE Doc. Ing. LUKÁŠ BURGET, Ph.D. SUPERVISOR BRNO 2016 Abstract Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models { called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that con- strains the search for the best path through the composed graph { called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alterna- tive weight pushing algorithm, that is suitable for an important class of models { language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significant speed-ups, since it allows to decode with a variable beam width, which is most of the time much smaller than the baseline beam. We also show that it is possible to apply the algorithms in a dynamic network decoder by using the incrementally refining recognition setup. This additionally leads to a partial parallelization of the decoding. Keywords Automatic speech recognition, LVCSR decoding, recognition networks, weighted finite state transducers, N-gram language models, weight pushing Bibliographic citation Mirko Hannemann: Finite-state based recognition networks for forward-backward speech decoding, PhD thesis, Brno, Brno University of Technology, Faculty of Information Tech- nology, 2016 Abstrakt Pomocí matematického formalismu váhovaných koneèných stavových pøevodníků (weighted finite state transducers WFST) může být formulována øada úloh včetně automatického rozpoznávání øeèi (automatic speech recognition ASR). Dnešní ASR systémy ¹iroce využí- vají složených pravděpodobnostních modelù nazývaných dekódovací grafy nebo rozpozná- vací sítě. Ty jsou z jednotlivých komponent konstruovány pomocí WFST operací, napø. kom- pozice. Každá komponenta je zde zdrojem znalostí a omezuje vyhledávání nejlepší cesty ve složeném grafu v operaci zvané dekódování. Využití koherentního teoretického rámce garantuje, že výsledná struktura bude optimální podle definovaného kritéria. WFST mo- hou být v rámci daného polookruhu (semi-ring) optimalizovány pomocí determinizace a minimalizace. Aplikací těchto algoritmù získáme optimální strukturu pro prohledávání, optimální distribuce vah je pak získána aplikací \weight pushing" algoritmu. Cílem této práce je zdokonalit postupy a algoritmy pro konstrukci optimálních rozpoznávacích sítí. Zavádíme alternativní weight pushing algoritmus, který je vhodný pro důležitou třídu mod- elù - pøevodníky jazykového modelu (language model transducers) a obecně pro v¹echny cyklické WFST a WFST se záložními (back-off) pøechody. Pøedstavujeme také zpùsob konstrukce rozpoznávací sítě vhodné pro dekódování zpětně v èase, které prokazatelně pro- dukuje ty samé pravděpodobnosti jako dopøedná síť. K tomuto úèelu jsme vyvinuli algoritmus pro exaktní reverzi back-off jazykových modelù a pøevodníků, které je reprezentují. Pomocí zpětných rozpoznávacích sítí optimalizujeme dekódování: ve statickém dekodéru je využíváme pro dvoustupňové dekódování (dopøedné a zpětné vyhledávání). Tento přístup | \sledovací” dekódování (tracked decoding) | umožnuje zahrnout výsledky vyhledávání z prvního stupně do druhého stupně tak, že se sledují hypotézy obsažené v rozpoznávacím grafu (lattice) prvního stupně. Výsledkem je podstatné zrychlení dekódování, protože tato technika umožnuje prohledávat s variabilním prohledávacím paprskem (search beam) { ten je povětšinou mnohem užší než u základního přístupu. Ukazujeme rovněž, že uvedenou tech- niku je možné využít v dynamickém dekodéru tím, že postupně zjemňujeme rozpoznávání. To navíc vede i k èásteèné paralelizaci dekódování. Klíčová slova Automatické rozpoznávání øeèi, dekodování øeèi, rozpoznávací sítě, váhované koneèné stavové automaty, jazykové modely Bibliografická citace Mirko Hannemann: Rozpoznávací sítě založené na koneèných stavových pøevodnících pro dopøedné a zpětné dekódování v rozpoznávání øeèi, disertační práce, Brno, FIT VUT v Brně, 2016 4 Finite-state based recognition networks for forward- backward speech decoding Declaration of Originality I hereby declare that this thesis and the work reported herein was composed by and originated entirely from me. The work has been supervised by Doc. Ing. Luká¹ Burget, Ph.D. and Doc. Dr. Ing. Jan Èernocký. Information derived from the published and un- published work of others has been acknowledged in the text and references are given in the list of sources. Some of the used recognition systems were set-up by the members of the BUT Speech@FIT research group or in cooperation with third parties (Microsoft Research, Kaldi team, Johns Hopkins University). ....................... Mirko Hannemann 17.07. c Mirko Hannemann, 2016. This work was created at the Brno University of Technology at the Faculty of Information Technology. It is protected by the Czech copyright law and any use without the permission of the author is illegal, except for the exceptions specified in the law. Rozpoznávací sítě založené na koneèných stavových pøevodnících pro dopøedné a zpětné dekódování v rozpoznávání øeèi Prohlášení Prohla¹uji, že jsem tuto disertační práci vypracoval samostatně pod vedením Doc. Ing. Luká¹e Burgeta, Ph.D. a Doc. Dr. Ing. Jana Èernockého. Uvedl jsem v¹echny literární prameny a publikace, ze kterých jsem èerpal. Některé systémy použité v práci byly vytvoøeny èleny výzkumné skupiny BUT Speech@FIT samostatně nebo ve spolupráci s třetími stranami (Microsoft Research, Kaldi team, Johns Hopkins University). ....................... Mirko Hannemann 17.07. c Mirko Hannemann, 2016. Tato práce vznikla jako ¹kolní dílo na Vysokém učení technickém v Brně, Fakultě in- formačních technologií. Práce je chráněna autorským zákonem a její užití bez udělení oprávnění autorem je nezákonné, s výjimkou zákonem definovaných případů. 6 Acknowledgements I want to thank Luká¹ Burget, who was my thesis supervisor, for his feedback and for the discussions, especially for his efforts to advance our general machine learning knowledge. I especially want to thank Daniel Povey, who was my first supervisor at Microsoft Research and who had the original idea of decoding forwards and backwards in time and to perform an iterative weight pushing. I also want to thank him and his family for their hospitality. Thanks go to Honza Èernocký for reviewing this thesis and for never being tired in coming up with new projects and funding and to all colleagues from Speech@FIT for their help and good atmosphere at work. I also want to thank Jasha Droppo, who was my second supervisor at MSR, for the discussions on parallelizing the decoding algorithm and revers- ing the language model. I thank Geoffrey Zweig for hosting me at MSR and for useful discussions on decoding in general. Thanks go to Sanjeev Khudanpur for his input on the weight-pushing algorithm and on language modeling. I want to thank my wife Tereza for the patience she had with the inconveniences caused by the process of writing this thesis, and I want to thank God for providing me with the time and health necessary to achieve it. Some of the work described here was done when the author was summer intern at Microsoft Research, Redmond, WA. This work was partly supported by the Intelligence Advanced Research Projects Activity (IARPA) BABEL program, the European Union's

Load more