<<

Research Project Report: Automatic and Natural Language Processing

Christoph Kuhr (11046220)

18.03.2016 Abstract

This report describes the theoretical background, work and the results of a research project at the Cologne University of Applied Sciences coducted from September 2015 to March 2016. The project Automatic Speech Recognition and Natural Language Processing shall line out state-of-the-art research in multiple fields of computer lingustics. To demonstrate a use case, an automated speech recognition (ASR) system with natural language processing (NLP) capabilities will be implemented. The aspects under investigation are both deep learning algorithms used in ASR systems and their acceleration. Current computer linguistic concepts and the customization of the corpora of both, the speech recognition and the language processing, shall be investegated as well. Automatic Speech Recognition and Natural Language Processing

Contents

1 Introduction 1

2 Automatic Speech Recognition 4 2.1 Brief History ...... 4 2.2 Theory ...... 4 2.2.1 Speech and its Acoustic Features ...... 5 2.2.2 Hidden-Markov-Model ...... 6 2.2.3 N-grams and Statistical Language Modeling ...... 7 2.2.3.1 Decoding ...... 9 2.2.3.2 Smoothing ...... 9 2.2.3.3 Quality Measurements ...... 9 2.2.4 Knowledge Base ...... 10 2.2.4.1 Phonetic Dictionary ...... 10 2.2.4.2 Acoustic Model ...... 10 2.2.4.3 Language Model ...... 10 2.3 Recent Research ...... 11 2.3.1 Recurrent Neural Network Language Model ...... 11 2.3.2 Continuous Space Language Model ...... 12 2.3.3 Case Studies ...... 13 2.3.3.1 Google Voice Search in Mobile Applications ...... 13 2.3.3.2 Emotion-Detection Voice Applications in Automated Call Cen- ters ...... 13 2.3.3.3 You’re as Sick as you Sound ...... 15

3 Natural Language Processing 16 3.1 Brief History ...... 16 3.2 Theory ...... 16 3.2.1 ...... 17 3.2.2 Chunking ...... 17 3.2.3 Part-of-Speech (POS) Tagging ...... 17 3.2.4 ...... 18 3.2.4.1 Suffix-stripping ...... 18 3.2.4.2 ...... 18 3.2.4.3 Stochastic Algorithms ...... 19 3.2.4.4 Matching Algorithms ...... 19 3.3 Recent Research ...... 19 3.3.1 POS Tagging and Chunking ...... 19 3.3.2 Native Language Acquisition ...... 20 3.3.3 Simple Synchrony Networks (SSN) ...... 20

4 Test System, Training and Experiments 22 4.1 Heterogeniuous System Architecture (HSA) ...... 22 4.2 CMU-Sphinx ...... 23 4.2.1 Sphinx-4 ...... 23 4.2.2 Sphinxbase ...... 25

i Automatic Speech Recognition and Natural Language Processing

4.2.3 Pocketsphinx ...... 25 4.2.4 Sphinxtrain ...... 26 4.3 Knowledge Base ...... 26 4.3.1 Voxforge Dictionary, Acoustic and Language Models ...... 26 4.3.2 German Corpus for Building Language Models ...... 26 4.3.2.1 Wikipedia Dump ...... 26 4.3.2.2 Gutenberg Online Archive ...... 27 4.3.2.3 Zeit Online Archive ...... 27 4.3.3 Language Model Training and Toolkits ...... 27 4.3.3.1 SRILM Toolkit ...... 27 4.3.3.2 RNNLM Toolkit ...... 28 4.3.3.3 CSLM Toolkit ...... 29 4.3.3.4 CMUCLM Toolkit ...... 29 4.4 NLP with Parsers, and Taggers ...... 30 4.4.1 Pattern ...... 30 4.4.2 Stanford Parser ...... 30 4.4.3 NLTK Unigram POS Tagger ...... 30 4.4.4 GermaNet ...... 30 4.5 psApp.py - A Python ASR and NLP Tool ...... 31 4.5.1 User Interface ...... 31 4.5.2 Recognition Process ...... 31 4.5.2.1 Activation by Keyword Search ...... 31 4.5.2.2 Continuous Speech Recognition ...... 32 4.5.2.3 Processing of Recognized Sentences ...... 35 4.6 Experiments ...... 35 4.6.1 ASR ...... 35 4.6.1.1 Test Environments and Configurations ...... 36 4.6.1.2 Test Sets ...... 37 4.6.1.3 Test Results ...... 39 4.6.2 NLP ...... 40 4.6.2.1 Imperative ...... 40 4.6.2.2 Question - Yes-No ...... 41 4.6.2.3 Question - Probe ...... 41 4.6.2.4 Identify Numerics ...... 42

5 Conclusions 43 5.1 Automatic Speech Recognition ...... 43 5.2 Natural Language Processing ...... 44 5.3 Future Work ...... 44

Appendices 46

A Wikipedia Dump Cleanup Python Script 46

B Gutenberg Online Cleanup Python Script 48

C Building RNNLM Toolkit from Source 49

ii Automatic Speech Recognition and Natural Language Processing

D RNNLM Toolkit Training Output 50

E librnnlm.cpp with clBLAS SGEMV Implementation 50

F Building CMULM Toolkit from Source 52

G CMULM Toolkit Training Output 52

H Building SRILM Toolkit from Source 53

I SRILM Toolkit Training Output 54

J Building Pocketsphinx from Source 55 J.1 Dependencies ...... 55 J.2 Sphinxbase ...... 55 J.3 Pocketsphinx ...... 55

K Recognition Process C++ Implementation 55

L Numeric Reduction Python Parser 57

M CKY Recognition Algorithm 58

N Probabilistic CKY Recognition Algorithm 59

O Earley Recognition Algorithm 59

iii Automatic Speech Recognition and Natural Language Processing

List of Figures

2.1 Spectrogram of the Word Sequence “Hello World” ...... 5 2.2 Hidden Markov Model Chain of an Utterance ...... 7 2.3 N-gram of the Sentence Ist die Sonne gr¨un? ...... 8 2.4 Architecture of a Recurrent Neural Network ...... 11 2.5 Architecture of a CSLM Neural Network ...... 12 2.6 Block Diagramm of Google Search by Voice ...... 13 2.7 Anger Detection System ...... 14 2.8 HMM for Anger Turn Prediction ...... 15 3.9 POS Tagging Decision Tree ...... 18 3.10 Decision Tree with Classification Error Count ...... 18 3.11 Simple Synchrony Networks unfolded over a derivation sequence ...... 21 4.12 Non-Uniform Memory Access ...... 22 4.13 Heterogenous Unified Memory Access ...... 23 4.14 Sphinx-4 Decoder Architechture ...... 24 4.15 Python Application psApp.py UI ...... 33

List of Tables

2.1 HF Optimizer Algorithm ...... 12 4.2 “Heimdall” IPA and Arpabet Notation ...... 32 4.3 Pocketsphinx Keyword Search Parameters ...... 32 4.4 Pocketsphinx Voxforge Parameters ...... 33 4.5 Test Run Configuration for the Voxforge Language Model ...... 36 4.6 Test Run Configuration for the merged Gutenberg and Voxforge Language Model 36 4.7 Recognition Test Set Voxforge ...... 37 4.8 Recognition Test Set Voxforge and Gutenberg-Online ...... 38 4.9 Recognition Test Results showing the Number of Mis Recognized Words of Voxforge ...... 39 4.10 Recognition Test Results showing the Number of Mis Recognized Words of Voxforge and Gutenberg-Online ...... 39 4.11 NLP Imperative: Chunks, POS Tags, Relations ...... 40 4.12 NLP Question - Yes-No: Chunks, POS Tags, Relations ...... 41 4.13 NLP Question - Probe: Chunks, POS Tags, Relations ...... 41 4.14 NLP Numerical Expression Reduction ...... 42 M.15CKY Recognition Algorithm ...... 58 N.16 Probabilistic CKY Recognition Algorithm ...... 59 O.17 Earley Recognition Algorithm ...... 59

iv Automatic Speech Recognition and Natural Language Processing

List of Equations

1 PDF Properties ...... 6 2 PDF Properties Short ...... 6 3 Prior Probability for N-grams ...... 7 5 Knesser-Ney Smoothing ...... 9 6 Word Error Rate (WER) ...... 9 7 Perplexity (PPL) ...... 10 8 HF-Optimization ...... 12 9 F-Score / F-Measure ...... 19 10 Logarithmic Probability of Hypothesis ...... 32

v Automatic Speech Recognition and Natural Language Processing

Nomenclature

API Application Programming Interface APU Accelerated Processing Unit ASD Autism Spectrum Disorder ASR Automatic Speech Recognition BLAS Basic Linear Algebra Subprograms CFG Context Free Grammar CG Linear Conjugate Gradient Algorithm CKY Cocke-Kasami-Younger algorithm clBLAS OpenCL Basic Linear Algebra Subprograms CMU Carnegie Melon Universty CMUCLMTK Carnegie Melon Universty C? Language Model Tool Kit CNF Chomsky Normal Form CPU Central Processing Unit CSLM Continuous Space Language Model CUDA Compute Unified Device Architecture DTMF Dual Tone Multiple Frequency EM Expectation Maximization FSG Finite State Grammar FST Finite State Transducer GMA Google Mobile App GMM Gaussian Mixture Model GMM-HMM Gaussian Mixture Model - Hidden-Markov-Model GPGPU Generale Purpose Graphics Processing Unit GPU Graphics Processing Unit HF Hessian Free HMM Hidden Markov Models HSA Heterogenous System Architecture HSAIL HSA Intermediate Language HUMA Heterogenous Unified Memory Access IPA International Phonetic Alphapet IPA International Phonetic Association IVR Interactive Voice Response System JVM Java Virtual Machine KenLM Kenneth Heafield Language Model L2 Layer Two LPC Linear Predictive Coding

vi Automatic Speech Recognition and Natural Language Processing

MFCC Mel-Frequency Cepstra Coefficients MISD Multiple Instruction Single Data MIT Massachusetts Institute of Technology MITLM Massachusetts Institute of Technology Language Model MLP Multilayer Perceptron NLP Natural Language Processing NLTK Natural Language Tool Kit NUMA Non-Unified Memory Access OOV Out-of-Vocabulary OSC Open Sound Control PDF Probability Density Function PLP Parameters or Perceptual Linear Prediction POS Part-of-Speech PPL Perplexity RAM Random Access Memory RNN Recurrent Neural Networks RNNLM Recurrent Neural Network Language Model SGEMV Sparse General Matrix Vector Multiplication SLM Statistical Language Modeling SRI Stanford Research Institute SRILM Stanford Research Institute Language Model SSN Simple Synchrony Networks STTS Stuttgart-T¨ubingen-TagSet SVM Support Vector Machines UI User Interface WER Word Error Rate WSJ Wall Street Journal

vii ASR and NLP 1 Introduction

1 Introduction

Automatic speech recognition (ASR) systems are coming more into focus of research, due to the increased use of smart phones. With the statistical data aqcuired by large internet companies, such as Google Inc., via such devices, speech input can be processed in server farms and is already producing accurate results. There are different approaches for small ASR systems, like desktop computers and even Rasp- berryPi single board computers. One of those approaches is CMU Sphinx [1], an open source speech recognition framework developed by the Carnegie Melon Universty (CMU) [2]. The investigation of an implemented ASR system using Pocketsphinx, a light weight imple- mentation of the CMU Sphinx framework, is one of the major aims of this research project and shall provide further insight into the applied concepts of state-of-the-art computer lin- guistics. Pocketsphinx is using an American English corpus by default. The VoxForge Project [3] pro- vides transcribed speech corpora in German. The ASR test system shall be configured to use both.

Several toolkits can be used to create custom grammars, dictionaries, language and acous- tic models. For example CMUCLMTK[4], SRILM[5], KenLM[6] , MITLM[7], are the most common. These are known to be compatible with the Sphinx framework, mainly because they output ARPA files. ARPA, also called Arpabet [8], is a file format containing text in a phonetic transcription code and is named after the Advanced Research Projects Agency (ARPA). CMUCLMTK is the reference implementation of tools for estimation of statistical N-gram language models (which are described in section 2.2.3), for perplexity calculations, for language model pruning and other modeling tasks. The SRILM and KenLM toolkits are based on the same concepts, however KenLM has better performance due to the usage of multithreading implementation.

In the last years the computational power of single computer systems has reached a point at which algorithms that do not necessarily have to be supervised during training (deep learning algorithms) can be implemented and return results in feasable time. This lead to approaches of accelerated learning algorithms with the usage of Generale Purpose Graphics Processing Unit (GPGPU) programming, like CUDA or OpenCL[9].

1 ASR and NLP 1 Introduction

Automatic speech recognition uses two different models from different fields of research, acous- tic models and language models. Both types of models have the need for and thus deep learning can be applied. The topic of deep learning for acoustic models will not be covered in this project, for detailed work on this topic see [10]. For language models especially RNN came into focus, due to recent developments. With a new method, called Hessian-Free Optimizer, the “vanishing/ exploding gradient” phenomenon can be avoided. This enables RNNs to make suggestions for text auto completion, which can be used to accelerate speech recognition based on statistical language modeling. Recent studies [15] are showing first results of RNN language models outperforming backoff N-gram language models. Which are, as of now, the most common choice for domain specific language mod- elling.

The current research in the field of natural language processing (NLP) is mostly concerned with unsupervised learning, where algorithms can learn from unannotated input data. NLP is already used in everyday work, in fields such as financial analysis, social network and general web content analysis. An existing and wide-spread tool for this purpose is the Python NLTK (Natural Language Tool Kit) [11]. The NLTK will be used to analyse the text that is recognized from speech by Pocketsphinx and to further investigate the possibilities of unsupervised learning. By default the NLTK is using an American English corpus and grammar. The NLTK shall be configured to use the German corpus and grammar of GermaNLTK. GermaNLTK is a project at Hochschule der Medien Stuttgart [12], which integrates GermaNet [13] into NLTK.

The test system under investigation will be an industrial computer with an AMD APU (Accelerated Processing Unit), running a Linux operating system. An APU is a processor, which shares the L2 cache data directly between CPU and GPU. Thus the bottleneck of copying data into and reading it from RAM is reduced. The audio system for the speech input will be realized with GStreamer-1.0 [14]. GStreamer is a multimedia framework for Linux, that allows the handling of all kinds of data in pipelines. Such a pipeline will be implemented to provide an audio stream to Pocketsphinx. A plugin for GStreamer is natively integrated into Pocketsphinx. In order to process the recognized speech with NLTK, as well as to control the audio stream and the speech recognition itself, a Python application will be developed. The following aspects will be investigated, once the basic system is operating properly:

• Investigate and describe the concepts of computer lingustics in use

• Configuration of German corpora for Pocketsphinx and NLTK

• Comparison of Deep Learning Algormithms, Recurrent Neural Network Language Model (RNNLM) Toolkit [15] and Continuous Space Language Model (CSLM) Toolkit [16]

• Parallelization of deep learning and search algorithms with GPGPU (OpenCL)

• Evaluate the possibilities of unsupervised learning with NLTK. In section 2, a brief history of ASR is given and the theoretical concepts are discussed. Further it discusses research that has been done in language modeling for the past decade. Section 3 gives a brief history of NLP, along with the discussion of the theoretical concepts

2 ASR and NLP 1 Introduction and recent research. The test system for the demonstration of ASR and NLP is described and discussed in section 4. In section 5, the efforts of setting up the custom test system with the application of most recent research results is described. The experiments conducted for this project are discussed in detail with an outlook for future work.

3 ASR and NLP 2 Automatic Speech Recognition

2 Automatic Speech Recognition

2.1 Brief History The root of speech recognition dates back to 1952, when the Bell Laboratories build a system, that could detect formants in the power spectrum of utterances. It was limited to a single speaker with a vocabulary of approximately ten words. In 1960 the first continuous speech recognition system was developed at Stanford University. It was able to recognize chess commands. At about the same time the Dynamic Time Warping algorithm was developed and allowed larger vocabularies. The major obstacle in this decade was the recognition of different speakers. In the late 1960s the mathematics of the Hidden Markov Models (HMM) were developed and found their way into the CMU speech recognition system. 1971 the DARPA founded a research program for the recognition of a vocabulary size of up to 1000 words. BBN. IBM., Carnegie Mellon and Stanford Research Institute all participated in the program. Then in the mid 1980s IBM developed the typewriter Tangora, which could recognize 20,000 words and was based on HMMs. In the 1990s the commercial use of a variety of speech recognition systems. Since Apple acquired Nuance in 2005, it’s system SIRI have been available on Apple’s iPhones. Two years later, in 2007, Google started to develop on speech recognition systems, too. When the first deep learning algorithms for acoustic models where proposed in 2009, the word error rates of the past decades could be optimized by 30%. Artificial Neural Network approaches have been studied in this context throughtout the 1980s, 90s and early 2000s, with little success. For further reading see [17].

2.2 Theory The following sub sections explain the theoretical concepts of speech recognition. Since a huge number of different approaches exist, only the concepts relevant for the implementation under investigation are elaborated on.

4 ASR and NLP 2.2 Theory

2.2.1 Speech and its Acoustic Features Speech is a sequence of the acoustical representations of words. The same word can be pro- nounced in different ways, which is subject to several aspects, such as the emotional context, the speaker him/herself or his/her accent. A word, which succeeds another, also influences the predecessor’s pronounciation. This leads to the following model regarding the composition of words, utterances and sentences. A word can be sub-devided into syllables. Each syllable can be further devided into tri- phones. There also exist even more precise divisions, like quinphones, but for the purpose of speech recognition, triphones are sufficient. A triphone describes three segments of a sylla- ble, three phones. The first phone of a syllable is depending on the predecessing word, the middle phone is constant, and the third phone depends on the succeeding word or syllable, respectivly. Practical work determined 48 different, context-independent phones [18]. These phones have been defined by the International Phonetic Association (IPA)[19] and are called International Phonetic Alphabet (IPA)[20]. One has to be aware, that the abbreviations are ambiguous. Words, that are spoken in a sequence, are called an utterance and are terminated by a pause. In most cases utterances form sentences, but this is not necessarily true. They can also form part of sentences or even a simple exclamation. The human voice ranges up to a freuquency of 8kHz with relevant information. Thus, ac- cording to Shannons’ theorem [21], 16kHz sampling rate is sufficient to record the audio information. The recorded audio signal is divided into frames, each frame being several samples long. Usu- ally a feature frame is about 20ms long and two frames overlap for 10ms, or 320 and 160 speech samples each[30].

Figure 2.1: Spectrogram of the Word Sequence “Hello World”

5 ASR and NLP 2.2 Theory

However the exact duration depends on the implementation. A frame is represented by features, describing the envelope of the power spectrum. The set of features per frame is called a feature vector. Fig. 2.1 shows the power spectrum of a speech signal, each column representing one feature vector. The computation of features depends on the implementa- tion. The most important methods to compute features are Mel-Frequency Cepstra Coeffi- cients (MFCC), Linear Predictive Coding (LPC) Parameters or Perceptual Linear Prediction (PLP)[31]. MFCC is based on Fast Fourier Transformation and relates the analysed frequency scale to the Mel-scale, which is a non-linear scale, based on the human hearing perception. Thus it is a good approach to the human hearing process. LPC coefficients are calculated by the power spectrum of the signal and are used for formant (vocal tract) analysis. This technique is based on the removal of redundant information and residual error calculation. LPC uses 12 coefficients to describe a feature vector. Coefficients that are quantized into a codebook of 256 prototype vectors [18]. The PLP cepstra computation is based on psychoacoustic concepts, but is otherwise identi- cal to LPC. It only takes acoustical information into account, that is relevant to the human hearing. A scalar continuous random variable can be considered Gaussian distributed if its probability density function (PDF) fulfills the property shown in eq. (1). A Gaussian Mixture distribution has shown to perform very well for speech recognition, since speech can be represented on a frame basis, rather than on a temporal basis [10]. Thus, a Gaussian Mixture Model (GMM) is only able to represent momentary information, like statistical classification, but fails to repre- sent sequential data. For the representation of sequential data HMMs are used. The concept of HMMs is discussed in section 2.2.2. The combination of both models, the combination of an observable and a hidden process, leads to a doubly stochastic model, GMM-HMMs . With a GMM-HMM, segments are computed from the acoustic features and the summation of several segments results in a phone. " # 1 1 x − µ2 . p(x) = exp − = N (x : µ, σ2), (−∞ < x < ∞; σ > 0) (1) (2π)1/2σ 2 σ

Or in short: x ∼ N (µ, σ2). (2)

2.2.2 Hidden-Markov-Model A Markov model is a graph, with vertices describing the transitional probabilities from one state or node of the model to another. The HMM is an extension to the Markov model, in which the stochastical process, the transition, is not observable. For the application of speech recognition, the particular type of left-right model or Bakis model fits best. This model has the property, that the state increases, when time increases. In this case the underlying sequence, speech, proceeds with time [34]. Used with an acoustic model, the observation vectors for the HMMs, that are built for each of the 48 phones, are obtained by matching the LPC encoded feature vectors, with the likelihoods of the acoustic model. In fig. 2.2 an example utterance with the correspondig HMM is provided [35].

6 ASR and NLP 2.2 Theory

Figure 2.2: Hidden Markov Model Chain of an Utterance

2.2.3 N-grams and Statistical Language Modeling N-grams are a subset of Statistical Language Modeling (SLM) and represent a Bayes’ clas- sification problem. It describes the probability of one word or character succeeding another. This results in a path for every possible word sequence. In a SLM task, the implementation of complex grammar rules is not necessary. It depends solely on the probabilities assigned to sentences from training data. The likelihood of a specific word succeeding another, is calculated during the training phase of the modeling process. The prior probability values P(W) for strings of words W = w1, . . . wN , is given in eq. (3). Usually the string consists of sentences or utterances [23].

n Y P (W ) = P (wi|w1, w2, . . . , wi−1) (3) i=1

An example for the N-gram analysis of a sentence is shown in fig. 2.3 for the sentence already used in the previous sub section: Ist die Sonne gr¨un? Any N-gram starts with a begin sentence marker and ends with the end sentence marker . Each node in between, is connected to both of these markers and to its preceeding and succeeding node, which are omitted in fig. 2.3 for clarity. Paths which have not been seen in the training data, but are still possible, have a low transitional probability and might never occur. A statistical language model can combine words from its vocabulary into new sentences (or phrases), which have not been seen in the training data. The probability associated with this new sequence might be quite low, compared to those actually found in the training data, but they are not discarded. In an N-gram model this is called backoff N-gram and is achieved by calculating the reduced order, an (N-1)-gram. The backoff wheight is computed as follows: P 1 − P (w|WN−1) backoff(WN−1) = P (4) P (w|WN−2)

7 ASR and NLP 2.2 Theory

Figure 2.3: N-gram of the Sentence Ist die Sonne gr¨un? 8 ASR and NLP 2.2 Theory

2.2.3.1 Decoding

Decoding is done in two or more steps. In the first pass decoding language models of lower complexity, like , are used. Second pass decoding, which is also called rescoring, is more complex and is done by one of the following methods:

• N-best Rescoring A list of hypotheses is retained from the first pass, along with their scores of the com- plete path in the decision tree. The hypotheses are then rescored.

• Lattice Rescoring A hypothesis is stored as a directed acyclic graph, where the nodes are language model states and the arcs are labeled with the suggested words. Compared to N-best, this method’s advantage is, to preserve intermediate scores and not only the absolut score of a path. This can be usefull for subsequent processing tasks [38].

2.2.3.2 Smoothing

Without smoothing, words that have not been in the training data would be assigned a probability of 0.0. This leads to the smoothing requirement given in eq. (5). Over the years several techniques to avoid this have been developed, such as Katz back-off or Knesser-Ney smoothing. P (wk|θ(Wk−1)) >  > 0, ∀wk,Wk−1 (5)

2.2.3.3 Quality Measurements

• Word Error Rate (WER) The WER is a metric for the performance of a speech recognition task. Its calculation is shown in eq. (6). Where S denotes the number of substitutions, D is the number of deletions, I is the number of insertions and C is the number of corrects. The total number of words in the reference sentence is N, with N = S + D + C. S + D + I WER = (6) S + D + C

• Out-of-Vocabulary (OOV) Rate Any word not included in the training set and hence not in the vocabulary, is represented by an unknown word symbol. The rate of unknown words in a test set is called Out-of- Vocabulary (OOV) rate.

• Perplexity (PPL) Perplexity, given in eq. (7), compares two models M. One from the training data and

9 ASR and NLP 2.2 Theory

one from the test data. It represents the number of guesses the model needs to find the next word W . The upper limit is given by the vocabulary size of the model. The lower the perplexity, the better the language model. ! −1 N PPL(M) = exp · X ln(P (w|W )) (7) N M k−1 k=1

2.2.4 Knowledge Base The knowledge base contains information on which the decoding process of the speech recog- nition depends. There are three different types of databases: the phonetic dictionary, the acoustic model and the language model. Each holds information for the different stages of the decoding process.

2.2.4.1 Phonetic Dictionary The dictionary contains a mapping of the vocabulary used, which will occur in the language model, to phones. The phones are stored as a phonetic spelling dictionary in a modified Arpabet [49] form.

2.2.4.2 Acoustic Model Acoustic models are data collections that provide estimates of the likelihood of a certain feature to occur in a given word context [33]. Acoustic models are trained by matching transcribed speech to acoustic features.

2.2.4.3 Language Model The language model represents a vast set of sentences, providing the source for constructing HMM graphs, on which the decoder will later perform its searches. Probabilities for certain words succeeding other words are assigned. A language model is constructed of a , specific to the domain, in which the ASR system is used. The construction of the language model is the process of training. In this process the probabilities for the word sequences are estimated.

Different types of language models are listed below:

• Keyword Lists Keyword lists, that contain several words combined with a threshold value, can be spec- ified. Individual keywords from those lists can then be spotted in a continuous speech recognition process. To adjust the detection accuracy between missed detections and false alarms, a threshold value needs to be adjusted on each keyword.

• Grammar Grammars represent a very simple type of language, that are best suited for command and control applications. A grammar expects certain types of words in a pre-defined order. If the recognition of an intermediate word fails, the whole grammar detection fails. To make the recognition more robust, grammars have to be simple and short.

10 ASR and NLP 2.3 Recent Research

• Statistical Language Model and N-Grams Statistical language models describe methods to recognize natural language. In contrast to grammars, the design of a statistical language model takes much less engineering effort, because not every possible case of word combinations has to be forecasted. It is discussed in section 2.2.3.

2.3 Recent Research This section gives a short overview of the recent research in the field of the language modeling for speech recognition tasks. Furthermore it presents three case studies in different scientific domains, utilizing use speech recognition. Other popular fields of research, in the context of speech recognition, are studies about deep learning algorithms, learning acoustic features for acoustic modeling. There are, for example, studies in using RNNs to train acoustic models and accelerate them with the use of parallelization on GPUs. The aspects of acoustic modeling are not further investigated in this project. For further reading see [10].

2.3.1 Recurrent Neural Network Language Model A RNN is a feed-forward neural network, that has also feed back connections through time. In the context of speech recognition, this means input to the RNN is a word of the vocabulary, thus the input layer has the size of the vocabulary. The first dimension of the hidden layer has the same size as the input layer and the second dimension is the length of a sequence that can be learned. The output layer has also the size of the vocabulary and represents the probabilities that a certain word from the vocabulary has appeared. The hidden layer wheights are then fed into the hidden layer of the next timestep, the next word. The architecture is shown in fig. 2.4 [15].

Figure 2.4: Architecture of a Recurrent Neural Network

The neural network predicts which word might appear next in the sequence, based on context information from the last word. This approach of a RNN was first proposed in 1994 [36], but faced some major difficulties called the “vanishing/exploding gradient” phenomenon. This phenomenon describes the observation, that with a gradient descent search algorithm, context information either increases exponentially or approaches zero over time. In 2010

11 ASR and NLP 2.3 Recent Research

an alternative search algorithm for optimizing 5f(θn) (the gradient of f) was proposed the Hessian Free-Optimization algorithm shown in table 2.1 [37].

1: for n = 1, 2, . . . do 2: gn ← 5f(θn) 3: compute/adjust λ by some method1 4: define the function Bn(d) = H(θn)d + λd 5: pn ← CG-Minimize(Bn, −gn) 6: θn+1 ← θn + pn 7: end for

Table 2.1: HF Optimizer Algorithm

It is a second order optimization algorithm, also called truncated-Newton algorithm. The standard Newton method computes a N ×N matrix B and multiplies it with the system. The HF-Optimization provides two different approaches. Firstly it exploits the faster computation of finite differences in eq. (8). Secondly it uses the linear conjugate gradient algorithm (CG) for the search [37]. 5f(θ + d) − 5f(θ) Hd = lim (8) →0 

2.3.2 Continuous Space Language Model The continuous space language model is a neural network language model. It addresses the data sparseness problem, the impossibility to consider the full word history, is eliminated by probability estimation in a continuous space [38]. Backoff N-gram models represent words in a vocabulary, which is a discrete space [39]. The improvement is based on the usage of fast algorithms for training and recognition. One solution is to train several neural networks in parallel, another is to use large projection (output) layers, because they have little influence on the complexity and only require more computational power. The architecture of a CSLM is shown in fig. 2.5 [39].

Figure 2.5: Architecture of a CSLM Neural Network

1e.g. the Newton-Lanczos methods, where λ corresponds to a given trust-region radius τ[37]

12 ASR and NLP 2.3 Recent Research

2.3.3 Case Studies 2.3.3.1 Google Voice Search in Mobile Applications Google basicly provides a search engine. Over the years the interfaces of this search engine have been extended. One example of such an extension is the voice search function. Prior to 2008 Google used a telephone-based interactive voice response system (IVR), which was a dialogue system to pick an entry from a list of buisnesses, whithout knowing the number. First it prompted for the city and state, followed by a prompt to pick an entry of the buisness list, that was provided. In 2008 Google deployed a system for continuous recognition of speech. Google Maps for Mobile [33] was the first Google service to utilize speech recognition. It was able to search for buisness information on the web and include the information within the Google Maps application on the smartphone. Later in 2008 Google extended the speech recognition service to web searches. Since that time, Google Mobile App (GMA) [33] have been using Google Search by Voice. Fig. 2.6 shows the principle architecture of Google Search by Voice [33].

Figure 2.6: Block Diagramm of Google Search by Voice

Google uses the spoken user data any voice search provides, to generate and train acousic models. Data for the language model is collected from all web searches performed by Google search engine. With this vast amount of data, very complex acoustic and language models can be built. These models take advantage of the huge computational power, provided by the company’s infrastructure, to deliver fast and accurate results.

2.3.3.2 Emotion-Detection Voice Applications in Automated Call Centers The Institue for Information Technology at Ulm University conducted a case study regarding emotion-detection [40], particularly the detecion of anger and annoyance from users of a custom care voice application. At the time of the study, IVR systems were already able to work with natural language instead of using an option from a list of possible commands. With earlier dual tone multiple frequency (DTMF) systems, requiring the user to select

13 ASR and NLP 2.3 Recent Research between options by pressing the telephone’s keys, only few dialogue scenarios needed to be implemented. With natural language as input, the demand for dialogues has increased substantially. Also the number of steps within a dialogue has increased from a few steps, up to 50-100 steps. During a dialogue, a user can become annoyed or even angry. There are several reasons for this, for example the user’s reply is misrecognized due to loud background noise, or the user tries to use commands, that are not implemented or he speaks with inappropriate grammar. There are many more reasons for the user to become frustrated. For the implementation of an emotion-detection system, four basic steps have to be taken. Firstly user utterances have to be collected for training purposes. Either hypothetic data, spoken by actors or real user data can be used. The latter gives better results. Secondly, this data has to be labeled with the possible emotion expressed. A best practice is to let several people rate the utterances and take the majority vote. Thirdly Acoustic features are extracted holding indications for the labeled utterances. In the fourth and final step, a machine learning algorithm is trained on the data. In this case support vector machines (SVM) are used. Fig. 2.7 visualizes the implemented emotion-detection system [40].

Figure 2.7: Anger Detection System

The acoustic subsystem estimates, if acoustic features indicating anger are beeing recog- nized. Features like longer pauses in utterances or increased energy througout the whole voice spectrum are examined, they play the most important role in the 50 most relevant features indicating anger. Other features like pitch (vocal excitation), loudness, intensity MFCC, for- mants (vocal tract sound) or harmonics-to-noise-ratio are examined as well. One might assume that the estimation of the linguistic subsystem should be quite important, but the opposite is true. Only a few words are directly related to anger, and those words, e.g. “Damn”, are seldomly integrated into language models. There for linguistic subsystem is not of great importance. However it can be taken into account to increase the detection accuracy, because words like “operator”, “person” or “representative” are a good indication, that the user is annoyed by misrecognition and wants to speak to a real person. The reliability of the ASR system can also be taken into account, since misrecognition plays an important role for anger detection. The system can make an estimation based on the assumption that no input has been given or the user input could not be matched. This so- called sliding window feature and takes several dialogue terms into account. The cumulative feature counts the results of the sliding window features, to give an overall estimate of the call quality.

14 ASR and NLP 2.3 Recent Research

The frustration history subsystem evaluates the estimation of preceeding dialogue turns. For this purpose dialogue turns are labled as not-angry (N) or angry (A). The user’s anger most likely has a source. This means it is unlikley, that after a sequence of N’s an A will occour. In contrast, if at some point in the dialogue A’s already occurred, it is more likley for more A’s to occour. This likelihood can be expressed using HMMs as depicted in fig. 2.8 [40]. All the estimations of the subsystem are an input to the meta-classifier, which then gives a prediction of the user’s emotional state in terms of not angry, annoyed or angry. The com- bination of all the subsystems in this study reach an accuracy of 78.1%, while the acoustic model alone has an accuracy of of 77.3%.

Figure 2.8: HMM for Anger Turn Prediction

2.3.3.3 You’re as Sick as you Sound Automatic speech recognition systems have also found their way into the medical domain, although for most applications they remain subject to research and are not yet practically used [41]. In a hospital, ASR systems could be used to assist in the screening process of a patient or make assessments about ongiong treatments. Particularly in the screening process of mental disorders, e.g. autism spectrum disorder (ASD) , they promise to lower the rate of undiag- nosed cases. Beyond, they might assist in supervised learning task for patients. Research in this domain covers a broad spectrum. Three main paradigms have been estab- lished: Assessment, treatment and assistive technologies. Prosodic cues in general describe properties of speech, that are not covered by phonetic seg- ments, such as intonation and tone, rythem and stress. Amogst other things, they provide information about the emotional state of the speaker, whether a statement is ironic or sar- castic. Prosodic cues are used for the assessment of coping strategies for different types of cancer, for diagonsis of depression and schizophrenia, and for the classification of ASD. Additionally, textual analysis methods, a field of applied natural language processing, seem to be useful in development of cancer coping mechanisms, in psychiatric diagnosis and analysis of suicide notes. Assistive technologies are explored in treating aphasia or ASD patients, as well as evaluation of cochlear implants. Since the analysis of speech signal and speech context have been investigated seperatly, future research aims to reconcile both approaches. This might lead to more accurate speaker state assessments and the development of more applications in a larger field.

15 ASR and NLP 3 Natural Language Processing

3 Natural Language Processing

3.1 Brief History Alan Turing laid the foundation of NLP in 1950, when he published his article Computing Ma- chinery and Intelligence[24]. Troughout the 50s and 60s progress was slow, only a few working systems were built, SHRDLU 68-70 and ELIZA 64-66, both at MIT. During the 70s several ontologies were developed, to preserve real-world information in computer-understandable structures. Before the 1980s, NLP had to be done by implementing complex rules, based on the linguistic theories of Noam Chomsky. This changed with the introduction of machine learning algorithms and the acceptance of corpus . In the 1990s, the Penn [25], a training data set, which contains reliable statistics for dealing with grammatical and lexical phenomena was introduced. Together with the Word- Net it represents the most common treebank. For further reading see [26]. Today’s research is mostly concerned with unsupervised and semisupervised learning algo- rithms. It is hard work to annotate a text corpus, unannotated corpora however can be generated with the huge amounts of text from the world wide web. In the last decade pro- cessing power of computer systems and the content of the world wide web have increased drastically [23].

3.2 Theory The following sub sections present the theoretical concepts of NLP. One of the most challeng- ing tasks of NLP is the automatic translation between different natural languages. NLP can be specified as a classification task, which allows to apply supervised machine learn- ing. An annotated text corpus is a requirement for classification-based learning. The quality of the annotations has great influcene on the accuracy of the parsing. In combination with an inference system, NLP can extract meaning of sentences and convert it to a logical form [23].

16 ASR and NLP 3.2 Theory

3.2.1 Parsing In NLP, parsing means to analyze a sentence, a sequence of words, in order to find its syn- tactical structure, it’s grammar. The parsing task requires a formal grammar, which defines the rules, how words and sentences of the language using this grammar, are formed. Two algorithms exist to parse a sentence with context-free grammar, the Cocke-Kasami-Younger algorithm (CKY) and Earley’s algorithm. The CKY algorithm, shown in appendix M [23], follows a bottom-up strategy. It starts at the leaves and ends at the root of the parsing tree. It requires the grammar in Chomsky Normal Form (CNF). Earley’s algorithm, shown in appendix O [23], can be used with context-free grammars in arbitrary form. However, those algorithms fail to recognize subtle differences in the meaning of words. This can be overcome with an extended CKY algorithm. The probabilistic CKY algorithm in appendix N assigns a weight to each rule [23]. The highest weight is assigned to the most probable meaning.

3.2.2 Chunking The determination of the relations between words is called chunking. Chunking only analyzes the references between the pre- or succeeding word. Chunking is also called anaphora resolu- tion and is used to extract context information of a sentence, which is also based on decision trees [23]. Anaphora is information, that is depending on words, that appeared in either a preceeding sentence or clause. The term is therefore called anaphor, with the reference to the antecedent. The other case, where a word depends on its succeeding sentence or clause, is called cataphor, with the reference to the postcedent.

3.2.3 Part-of-Speech (POS) Tagging Assigning a part of speech to a given word is called Part-of-Speech Tagging. A major difficulty for POS tagging is the ambiguity of words. A single word can belong to more then one part of speech. For example the word bug can be a noun in the sentence “I have swallowed a bug!”, or it can be a verb in the sentence “He bugged me on the telephone”. Most commonly in English, nine parts of speech are distinguished: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. Those parts of speech are further analyzed for their grammatical structure to provide a more precise categorization. In fig. 3.9 an example for a parsed sentence is given [23]. The root of the decision tree is a noun phrase (NP) chunk. With arcs to the subsequent chunks, again a noun phrase and prepositional phrase (PP). The first chunk in the second layer, NP, contains two words with the POS tags D and N. D identifies a determiner an N a noun. The second chunk contains a preposition P and another chunk NP. The NP chunk of the second path conatins again the POS tags D and N. The classification of the chunks and POS tags are defined in the Penn Treebank [22]. The grammatical structure is analyzed by a set of rules or, which is the focus of modern research in this field, probabilities assigned by semi- or unsupervised learning. The latter is discussed in section 3.3. The rules for the grammatical structure have to be provided for the tagging process, by the underlying text corpora. Hereby 50 to 150 parts of speech can be distinguished by modern NLP systems. To overcome the ambiguation problem of the part of speech of a word, HMMs are used. The

17 ASR and NLP 3.2 Theory

Figure 3.9: POS Tagging Decision Tree

HMMs assign probabilities to parts of speech depending on the preceeding word and its part of speech. For example the article the could preceed a noun in 40%, an adjective in 40% and a number in 20% of the cases. For each ambiguation a seperate decision tree, also called parse tree, can be constructed and an accumulated probability is assigned to the tree, like shown in fig. 3.10 [23].

Figure 3.10: Decision Tree with Classification Error Count

3.2.4 Stemming The stemming task reduces inflicted words to their root, the word stem. There are several different approaches.

3.2.4.1 Suffix-stripping Suffix-stripping finds the path to the root, by stripping the word of it’s ending, e.g. the end syllable “ing”. The algorithm identifies the suffix “-ing” and removes it from the word in order to get to the root. This approach has weaknesses in dealing with irregularities, e.g. the root of the word “met” is “meet”.

3.2.4.2 Lemmatisation A lemma is the base form of a given word. For noun phrases in the German language, it is given by the nominative singular form, while the lemma of a verb phrase is given by its active infinitive present form. Lemmatisation is based on the information collected with POS tagging. A different set of normalization rules may be applied to find the root of a word,

18 ASR and NLP 3.3 Recent Research depending on the POS tagging. These normalization rules are able to modify the word stem, which makes it more powerful then Suffix-stripping, to find irregular word endings.

3.2.4.3 Stochastic Algorithms Stochastic algorithms are very similar to Suffix-stripping and Lemmatisation. They are based on sets of rules that have to be engineered. However, stochastic algorithms need training. A table, mapping the word stem to it’s inflicted word, has to be provided for this purpose. With the trained data set as common ground, the algorithm can decide which rules are applied to find the root, according to the estimated probability.

3.2.4.4 Matching Algorithms Matching algorithms try to lookup the word stem, for a given inflicted word, in a provided table.

3.3 Recent Research This subsection is merely a summary of the chapters eight and nine, pages 197 - 236, from [23]. For supervised learning, the training data needs linguistic annotations, classifications and structures for the intended test data. Such classifiers are for example POS tagging or anaphora resolutions. Supervised learning algorithms generally achieve higher recognition accuracy with less data than unsupervised learning. Unsupervised learning, which is also called deep learning, does not provide such levels of accuracy. The major advantage unsupervised learning has over supervised learning is it’s higher capacity of handling errorneous input data or data that is not errorneous, but has never occured in the training data.

3.3.1 POS Tagging and Chunking In the case of POS tagging, the entire procedure can be regarded as an unsupervised gram- mar induction. With the application of an expectation maximization (EM) algorithm, the most likely parse for a sentence can be determined. This method identifies constituents in unannotated text, which depend on POS relations. The POS relations of the set of sentences are compared and the intersection of the sentences represent the learned probability. A measure for the accuracy of the test restults is the F-score or F-measure. It represents the weighted average of the precision and recall, where the precision is the number of correct positive results devided by the number of all poitive results and the recall is the number of correct positive results devided by the number of results that should have been positive. It is given in eq. (9).

precision · recall F = 2 · (9) Score precision + recall

A model using an EM achieves an F-score of 71%, when compared with the Penn Tree- bank on the Wall Street Journal (WSJ) text corpus. This could be improved to 87% accuracy by using a binary branching, a feature which the Penn Treebank does not provide. With the combination of a model for dependency and for constituent structure grammar induction, parsers can achieve an F-score of 77.6% with Penn Treebank POS tagging, and an F-score of

19 ASR and NLP 3.3 Recent Research

72.9% with an unsupervised tagger. In the field of NLP unsupervised learning has successfully been applied to POS tagging. An F-score of 88% to 91% can be achieved with a WSJ corpus and structural information from the Penn Treebank. 75% to 79% accuracy can be achieved with unsupervised methods. A method for semi-supervised learning that combines aspects of both systems can be trained with disctinct classifiers for a word disambiguation task. The training data has to be anno- tated in this case. After training the classifiers, the parsing process is applied to an unanno- tated corpus. The instance of classifiers, for which all results are the same, is selected. The annotation is then added to the original training data and a new training epoche is started. This procedure is repeated until no more unannotated data is available and the task is done. The task of linguistic annotation of the text copora will most likely not become obsolete. However, the engineering effort for supervised learning algorithms is very expensive. This is in contrast to the raw material for unsupervised learning, for which the world wide web is a source of nearly unlimited and constantly growing data.

3.3.2 Native Language Acquisition In 1996 it was shown that machine learning methods play an important role in human lan- guage acquisition. An experiment showed that eight-months-old infants are able to learn word boundaries in continuous syllable sequences, learned from two minutes of training ma- terial. The results suggest that transitional probabilities can be used to identify constituent structures of word sequences. With each training epoche, the learned constrains influence the following epoche and results. In order to use such model with the required confidence, much research needs to be done, to use such model with the required confidence. It provides a serious alternative perspective to the approaches of the passed five decades, how humans acquire natural language.

3.3.3 Simple Synchrony Networks (SSN) A Simple Synchrony Network, introduced by Lane and Henderson [27] and [28], is another type of an ANN. It is a MLP and has interconnections between hidden layers as defined in RNNs, with the difference that the input to the previous hidden layer is a derivation sequence and not just a sequence. This is visualized in fig. 3.11 [23]. The correlation between different nodes of a partial structure of the hidden layers, is defined by the engineer. The current hidden layers are linked to previous decision with a weight. SNNs applied to parsing tasks, has shown to be among the best parsing approaches for NLP. With this appproach constituency parsing, dependency parsing and semantic role parsing can be done.

The benchmarking of the parser is done with the labled Penn Treebank data of the WSJ, which contains constituency trees. Its performance is also measured by the F-measure and shows a F-measure of 89.5 percent on the WSJ test set. SSNs are generative, which means that it calculates the constituency tree on the output and tries to predict the next word on the input. The prediction of the input word represents the correlation of previous words and the decisions made. This influences the probability of the word that is predicted in the next step. With a discriminative, rather than a generative approach, the requirement to predict words has been removed, while maintaining the ability to differentiate between compatible (high

20 ASR and NLP 3.3 Recent Research

Figure 3.11: Simple Synchrony Networks unfolded over a derivation se- quence probability) and incompatible (low probability) next words. This improves the F-measure of the SSN to 90.1 percent. SSNs are used in the same way for dependency parsing.

Henderson et al. [29] suggest a constituency parser combined with a semantic role parser to calculate the syntactical and semantical probabilities. The hidden neurons are devided into two parts, a syntactical and a semantical weight. The derivation steps are then able to make conditionable dicisions. These decisions represent the close relation between the two parsers and improves the accuracy over a seperate observation, significantly.

21 ASR and NLP 4 Test System, Training and Experiments

4 Test System, Training and Experiments

4.1 Heterogeniuous System Architecture (HSA) To exploit the special features of the AMD APU architecture, a driver stack for HSA is re- quired. The idea behind the HSA[42] is to coherently use the system memory by the CPU and GPU of a computer system. The result is an APU, where the GPU can be utilized as a coprocessesor. In fig. 4.12 the system architecture of typical modern computer systems is depicted. It is called Non-Uniform Memory Access (NUMA). If an application is designed to do parallel computing tasks on the GPU, it has to copy the data for processing purposes from the virtual memory of the CPU to the physical memory of the GPU and vice versa. This is not efficient and represents a bottleneck for parallel processing. This is further discussed in paragraph 4.3.3.2. Consequently, GPGPU programming is mostly suitable for multiple instruction single data (MISD) computations.

Figure 4.12: Non-Uniform Memory Access

With the system architecture shown in fig. 4.13, GPU and CPU share the same virtual memory. This has the consequence, that no data has to be copied. Instead the CPU only passes a pointer to the GPU and directly reads the result of the parallel processing task from the virtual memory. Such memory architecture is called heterogenous uniform memory access (HUMA) and is a key feature in HSA. Further features are the avoidance of kernel space calls and the queueing of tasks between CPU and GPU. To access these features, different libraries are provided, as well as the new

22 ASR and NLP 4.2 CMU-Sphinx

Figure 4.13: Heterogenous Unified Memory Access HSA Intermediate Language (HSAIL).

Since the APU architecture of the test system is of the generation Piledriver with the codename Trinity and not at least of generation Steamroller with the codename Kaveri, the HSA Driver stack is not supported. Which can be tested with the included script “kfd check installation.sh”. As the follwing standard output shows, compatible APUs could not be found. Thus it is impossible to further investigate the benefits of HSA in this project.

./kfd_check_installation.sh

Carrizo detected:...... NO Kaveri detected:...... NO amdkfd module is loaded:...... NO AMD IOMMU V2 module is loaded:...... Yes KFD device exists:...... NO KFD device has correct permissions:...... NO Valid GPU ID is detected:...... NO

Can run HSA...... NO

4.2 CMU-Sphinx The CMU Sphinx toolkit [1] is a set of tools to build various types of speech applications. Its four major tools [43] are explained in the following sections. While Sphinx-4 targets server and cloud based services, the Pocketsphinx implementation targets real time applications [44]. They share the same concepts, but are completly different implementations.

4.2.1 Sphinx-4 Sphinx-4 is a Java library and provides an easy API to build complex speech recognition applications with [45]. As this API is of no relevance to this project, further details are omitted. However, Pocket- sphinx/Sphinxbase share the overall architecture.

In fig. 4.14 the architechture of the Sphinx decoder is shown [46]. The architechture con- sists of four components. The application, which provides the knowledge base, and feeds speech and control input to the frontend. The frontend computes the features of the speech input. These two components and the knowledge base component are connected to the decoder com- ponent.

23 ASR and NLP 4.2 CMU-Sphinx

The decoder itself has three components. The first component, graph construction is also called liniguist and takes information from the knowledge base to construct language HMM graphs. It translates the language model to an internal grammar. This offers the advantage to provide different types of grammar for the application, e.g. statistical N-gram in this case, context free grammar (CFG), finite state grammar (FSG) , or finite state transducer (FST). FSTs produce a specific output to a given input, which means, that FSTs can map one formal language to another. In the case of N-gram language models, each word represents a node in the grammar and is connected to every other node. The grammar is then converted to a HMM and the nodes are expanded with the information from the dictionary and the acoustic models. The construction of the language HMMs can either be done statically or dynamically. If they are constructed statically, the language HMM represents the whole language model, and therefore takes a lot of memory. In case of dynamic construction, every time the search reaches the end of a word, it builds a language HMM for the possible following words.

Figure 4.14: Sphinx-4 Decoder Architechture

The second component is the acoustic scorer, which is also called state probability compu- tation as depicted in fig. 4.14. On demand of the search module, the acoustic scorer requests a feature vector from the feature computation of the frontend. It matches the feature vector to those of the acoustic model and computes the scores of the vector. The scores are then returned to the search module. The third component is the search module. It takes the language HMM provided by the graph construction module and generates a trellis, that will be searched. For searching the trellis, two algorithms can be used, either Viterbi or Bushderby. While the form performs classifications on likelihood, the latter relies on free energy. The search module takes the state probabilities provided by acoustic scorer component and searches for matches in the trellis. From each matching node in the trellis, a token is passed

24 ASR and NLP 4.2 CMU-Sphinx to its succeeding nodes and the node is added to the list of active tokens. Amongst other information, a token contains the overall score of a path through the trellis. The termination of a path in a token is considered an hypothesis of the word sequence. The complete token tree of hypothesis is then sent to the application. It is mostly identical to the architecture of Google Search by Voice, which is shown in fig. 2.6. The knowledge base and graph construction module are identical in both archi- tectures. Sphinx uses Viterbi decoding in its search module, which is also the case in Google Search by Voice. The state probability computation, from fig. 4.14, corresponds with the Acoustic Model Evaluation Module in fig. 2.6.

4.2.2 Sphinxbase Sphinxbase provides low level functions and algorithms that are used in different functions of Pocketsphinx. It provides platform independent access to the installed audio devices (a parser for JSGF [47] grammars into Sphinx finite-state grammars), and tools for language model formatting.

4.2.3 Pocketsphinx Pocketsphinx is a C library, that requires the same version of the Sphinxbase library to be in- stalled. The API of Pocketsphinx provides various possibilities to develop customized speech recognition applications. Words of utterances can be found in an iterable list, together with information about timepoints, scores, and posterior probabilities. The posterior probability of the whole utterance can be accessed as well. If needed, single lattices can be examined. Pocketsphinx allows multiple searches that can be switched in realtime. For example an activation utterance detection based on simple keyword search, detects a keyword and Pock- etsphinx switches to a continous N-gram search based on a complex language model. The following search modes are available:

• keyword

• grammar

• N-gram / Language Model

• allphone If needed, Pocketsphinx can be configured to use custom acoustic or language models, grammars, dictionaries or keyword lists. For devoloping a Python application based on Pocketsphinx, a wrapper API has to be used [48]. The native C API provides full access to decoders, their properties, and search modes. Some of the provided functionality is listed below, a complete list can be found in the library include file pocketsphinx.h [49].

• The recognition processing can be controlled

• Hypotheses of an actual recognition can be analyzed, depending on the rescoring mode

• N-best score or the complete lattice with full contextual information of a hypothesis can be accessed

25 ASR and NLP 4.3 Knowledge Base

• Unknown words can be added to the vocabulary and the language model during runtime

4.2.4 Sphinxtrain Sphinxtrain is a tool to train customized acoustic models based on GMMs, for brevity further details shall be omitted as it has no influence on the matters discussed.

4.3 Knowledge Base In order to recognize spoken German language, Pocketsphinx requires a German dictionary, a German acoustic and language model. These three components of the knowledge base will be investigated in more detail throughout this section.

4.3.1 Voxforge Dictionary, Acoustic and Language Models VoxForge is a multi-lingual community, which provides acoustic models together with different feature extraction formats of speech, e.g. 8kHz and 16kHz sampling rate, spoken by different speakers in different languages. The Open Speech Data Corpus for German is created by the LT and the Teleccorperation Group[54] and mirrored by Voxforge. It is published under the Creative Commons license CC-BY and contains about 36 hours of speech, spoken by about 180 speakers, 130 male and 50 female. A 3-gram based language model for any supported language is avalable, as well as the pho- netic dictionary, which contains the vocabulary for the acoustic and language model. Voxforge also provides scripts and tools e.g. to extract pronounciations from the Wiktionary webpage.

4.3.2 German Corpus for Building Language Models In the domain of automatic speech recognition, mostly three sources have been used for training text purposes.

• Wikipedia Dump

• Gutenberg Online

• Zeit Online Archive To be able to serve as raw text data for a language model toolkit, punctuation has to be removed, digits and special characters have to be expanded to their textual representation or removed if there is non. For this purpose a set of Python scripts is used to get the raw text data: one sentence per line, all in capital letters.

4.3.2.1 Wikipedia Dump

Every two to four weeks Wikipedia stores [50] a snapshot of the whole Wikipedia database for any given language and provides it as a compressed XML-file. The German Wikipedia Dumpfile of from 6 August 2015 was used to build the text corpus. It is accessible at [51].

26 ASR and NLP 4.3 Knowledge Base

The data extracted from the dumpfile is in XML format. Any section regarding the for- matting of the article has to be removed and the relevent sections have to be cleaned from XML-annotations and tags with the Python script in appendix A. The relevant text content has a size of approximatly 3.5 GB. After cleaning, the corpus has a size of 1.7 GB.

4.3.2.2 Gutenberg Online Archive

The Gutenberg Online Archive contains the textual representation of works of classical Ger- man literature. The complete collection of multiple German writers is available at [52]. In contrast to the Wikipedia dump, there is no single dump file. Instead the text is accessible through a web interface. For this purpose another Python script, see appendix B, is used to access the web site and extract the text content for all works of all writers. Harvesting this text results in a corpus, which has to be cleaned as mentioned under ?? 4.3.2.1, resulting in a file approximatley 2.3 GB in size. After cleaning the corpus has a size of 514,8 MB.

4.3.2.3 Zeit Online Archive

Downloading text from the Zeit online archive [53] was difficult, because the server can- celed the session after extracting one year of articles. For this reason the Zeit online archive is not used any further in this project.

4.3.3 Language Model Training and Toolkits Different toolkits are available to train and build customized language models. SRILM and CMUCLMTK provide N-gram based language models. CSLM and RNNLM provide neural network based language models. While RNNLM trains a recurrent neural network, CSLM trains a MLP neural network. The toolkits under investigation are:

4.3.3.1 SRILM Toolkit

The SRILM toolkit is provided by the Stanford Research Institue (SRI) and contains several command line tools to create backoff N-gram language models from cleared text files. At first, the language model, a 7-gram model with Knesser-Ney smoothing, can be built with the following command:

ngram\_count -debug 2 -text TRAINFILE -order 7 -lm MODELNAME \ -kndiscount -interpolate -gt3min 1 -gt4min 1

With the following command, the PPL of the language model can be determined:

ngram -debug 2 -lm MODELNAME -order 7 -ppl TESTFILE -debug 2 > PPLFILE

The second step is to sort the model with the command:

27 ASR and NLP 4.3 Knowledge Base

sphinx\_lm\_sort MODELNAME_SORTED

The third and last step is to convert the sorted model to a format, that Pocketsphinx under- stands, using the following command: sphinx\_lm\_convert MODELNAME_SORTED_CONVERTED

4.3.3.2 RNNLM Toolkit

The following commands are used to train and test a language model, based on the probabil- ities of a recurrent neural network, with the RNNLM Toolkit. The test and valid file contain a tenth of sentences of the train file size, each. The RNNLM is trained with the following command: rnnlm -train lmData/rnnlmtk/train__GutenbergVoxforge.merge.rnnlm \ -valid lmData/rnnlmtk/valid__GutenbergContentRaw \ -rnnlm lmData/rnnlmtk/GutenbergVoxforge.merge.rnn.lm \ -hidden 15 -rand-seed 1 -debug 3 -class 100 -bptt 4 \ -bptt-block 10 -direct-order 2 -alpha 0.1 -direct 5

The parameters -hidden 15, -bptt-block 10 and -alpha 0.1 are the most important parameters. With -hidden 15 the size of the hidden layers is defined. The size of the training epoches for the Backpropagation-Through-Time algorithm is set with -bptt-block 10 to 10 iterations. With -alpha 0.1, the threshold is set, at which the RNN training is finished. The RNNLM is tested and the PPL is determined with the following command:

./rnnlm -rnnlm lmData/rnnlmtk/GutenbergVoxforge.merge.rnn.lm \ -test lmData/rnnlmtk/test__GutenbergContentRaw

The output of this command sequence is shown in appendix D.

The most simple way to accelerate the training of RNNLM with GPGPU, is to perform the vector matrix multiplication on the GPU. For this purpose, the C++-code of the function vectorXMatrix(), out of the file rnnlmlib.cpp, has to be translated to use the clBLAS SGEMV GPGPU implemenation. A first attempt, that requires further modification, is shown in the listing in appendix E. The numerical calculations during the training phase are not correct and require further debugging. There will be no further investigation in this research, due to the problems mentioned in section 4.1. However, first time measurements of data handling were done. Data handling, for matrices with a size of 11966x5127198 ((Vocabulary) x (Words in train file)), with OpenCL on a conventional CPU/GPU System, took the following time in microseconds:

Create Buffers in us: A 3 X 1 Y 1 Create Enqueue Buffers in us: A 65 X 53 Y 47 Copy Src Vector in us: X 0 Copy Src Matrix in us: A 0 Vector x Matrix in us: Y 2 Copy Result in us: A 1

Where A is the source vector, X is the source matrix and Y is the destination matrix of the operation.

28 ASR and NLP 4.3 Knowledge Base

4.3.3.3 CSLM Toolkit

The CSLM toolkit requires BLAS libraries that could not be provided. There are three different BLAS libraries that could be used. The first is the MKL library by Intel, which could not be usd on an AMD Architecture. The same is true for the CUDA libraries. CUDA only works on Nvidia GPUs and not on AMD GPUs. The standard and architecture inde- pendent library libatlas, could not be used due to version mismatches. At this point clBLAS, as mentioned in sub section 4.3.3.2, could be implemented instead.

4.3.3.4 CMUCLM Toolkit

The CMUCLMTK is the reference toolkit for Pocketsphinx and provides the tools that are required to convert and compile language models for the usage with Pocketsphinx. Firstly, a temporary vocabulary file has to be generated. This is a list of all the words in the training text file: text2wfreq < lmData/train__GutenbergVoxforge.merge | \ wfreq2vocab > lmData/cmulmtk/train__GutenbergVoxforge.merge.tmp.vocab

In the second step the ARPA format language model is generated, using the commands: text2idngram -vocab lmData/cmulmtk/train__GutenbergVoxforge.merge.tmp.vocab \ -idngram lmData/cmulmtk/train__GutenbergVoxforge.merge.idngram \ < lmData/train__GutenbergVoxforge.merge idngram2lm -vocab_type 0 \ -idngram lmData/cmulmtk/train__GutenbergVoxforge.merge.idngram \ -vocab lmData/cmulmtk/train__GutenbergVoxforge.merge.tmp.vocab \ -arpa lmData/cmulmtk/GutenbergVoxforge.merge.lm

The last step is to compile the language model to CMU binary form (BIN) with: sphinx_lm_convert -i lmData/cmulmtk/GutenbergVoxforge.merge.lm \ -o lmData/cmulmtk/GutenbergVoxforge.merge.lm.bin

The training output of the merged Voxforge and Gutenberg corpora is shown in appendix G. With the tool ngram from the SRILM toolkit, OOV and PPL of the language model can be estimated as follows:

ngram -lm lmData/cmulmtk/GutenbergVoxforge.merge.lm \ -ppl lmData/cmulmtk/voxforge_test.transcription.cleared -debug 2 \ > lmData/cmulmtk/GutenbergVoxforge.merge.lm.ppl

The following line of output shows averages of the language model:

file lmData/cmulmtk/voxforge_test.transcription.cleared: 2888 sentences, 17891 words, 157 OOVs 0 zeroprobs, logprob= -36025.7 ppl= 55.8412 ppl1= 107.51

An example test, as performed by ngram is shown below:

29 ASR and NLP 4.4 NLP with Parsers, Treebanks and Taggers

PARAGRAPHSYMBOL ZWEIHUNDERTDREIUNDFUNFZIG¨ STGB SCHUTZT¨ AUCH DIE FREIE WILLENSBILDUNG p( PARAGRAPHSYMBOL | ) = [1gram] 2.80414e-09 [ -8.5522 ] p( ZWEIHUNDERTDREIUNDFUNFZIG¨ | PARAGRAPHSYMBOL ...) = [1gram] 5.84925e-07 [ -6.2329 ] p( STGB | ZWEIHUNDERTDREIUNDFUNFZIG¨ ...) = [2gram] 0.886543 [ -0.0523 ] p( SCHUTZT¨ | STGB ...) = [3gram] 0.831764 [ -0.08 ] p( AUCH | SCHUTZT¨ ...) = [3gram] 0.851138 [ -0.07 ] p( DIE | AUCH ...) = [3gram] 0.851138 [ -0.07 ] p( FREIE | DIE ...) = [3gram] 0.00708435 [ -2.1497 ] p( WILLENSBILDUNG | FREIE ...) = [3gram] 0.212765 [ -0.6721 ] p( | WILLENSBILDUNG ...) = [3gram] 0.851138 [ -0.07 ] 1 sentences, 8 words, 0 OOVs 0 zeroprobs, logprob= -17.9492 ppl= 98.7087 ppl1= 175.247

4.4 NLP with Parsers, Treebanks and Taggers 4.4.1 Pattern For the first approach, the pattern.de Python package is used to analyze the spoken language. Pattern is a natively and freely available Python NLP tool, provided by the Computational Linguistics Psycholinguistics Research Center[55] in Antwerpen, Belgium. It has to be exe- cuted in a different process, because Python3 cannot execute Python2.7 code and the module pattern.de is only available for Python 2.7. The recognized utterance is written to a file and the PatternWrapper process parses the sentence. Pattern performs POS tagging, chunking and anaphora resolution, according to the Penn Treebank Tags[56].

4.4.2 Stanford Parser The Standford Parser is written in Java and provides an API for NLTK and also a Python2.7 API through JPype, a wrapper, which provides bidirectional exchange between the Python interpreter and the Java Virtual Machine (JVM). The configuration of both APIs has not been successful on the test system and could therefor not be evaluated.

4.4.3 NLTK Unigram POS Tagger NLTK is available for both, Python2.7 and Python3, thus it integrates into psApp.py without problems. It provides the possibility to use custom annotated corpora for POS tagging based on unigrams. As a requirement, the text has to be annotated with POS tags. For the German language a POS tagged text is available at [57]. There are two corpora. The first corpus contains speeches from the German presidency covering the years 1984 to 2012 and the second corpus contains German chancellory speeches from 1998 to 2011. The corpora are available in an XML format and have to be parsed into an NLTK understandable format: word/tag. The tags follow the Stuttgart-T¨ubingen-TagSet (STTS) notation [58] instead of the Penn Treebank notation. After the conversion, the tagged corpus can be loaded by the NLTK TaggedCorpusReader. The TaggedCorpusReader generates tagged sentences, with which the NLTK UnigramTagger is trained [59]. After the training phase, tokenized sentences can be POS tagged.

4.4.4 GermaNet The GermaNet corpus (Version 10.0), provided by the University of T¨ubingen [13], is a lexical-semantic database. It contains synonyms, hypernyms and hyponyms, meronymes and

30 ASR and NLP 4.5 psApp.py - A Python ASR and NLP Tool holonyms, and lemmas of German words, structured in the same way as the WordNet[60] provided by the Princeton University. For research purposes an academic license must be obtained, which is free of charge in this case. In general GermaNet is a resource for word sense disambiguation of nouns, verbs and adjec- tives, to be used in NLP applications. GermaNET requires a running database like MongoDB and a frontend for the database queries, in this case pygermanet[61]. The integration with NLTK should also be possible, but requires modifications to the NLTK source code. The integration with NLTK was not successful, although a tutorial is available, but it referes to different versions of NLTK and GermaNet.

4.5 psApp.py - A Python ASR and NLP Tool For a frontend application, two different approaches were investigated. The first approach was to use the GStreamer [14] multimedia framework to record audio data, process the speech signal with a Pocketsphinx plugin and forward the result to the Python PyQt UI. This at- tempt was not successfull, because the Pocketsphinx GStreamer plugin could not keep up with the incoming samples. GStreamer reported several thousand samples to be dropped. This approach was then discarded in favour of a multiprocessing solution, based on a recog- nition process running C-code. Fig. 4.15 shows the UI component and listing 1 shows the recognition process. They use the Open Sound Control protocol (OSC) [62] for inter process communication via localhost. Despite a simple messaging scheme, it has the advantage, that this application could also be run as a distributed application.

4.5.1 User Interface The user interface was designed with the Qt Designer [63], converted to Python code and inte- grated with PyQt [64], a Python wrapper for the popular Qt framework. It provides switches for enabling/disabling the audio recording and switch the between English and German lan- guage models. It further offers the possibility to read text files sentence by sentence, as long as the sentences are seperated by periods. The results of the speech recognition process, or the sentence being read from the input file, are shown in the first text field. The second text field shows, how a sentence was seperated into tokens by NLTK and the third text field shows the corresponding tags to the tokens. In the bottom text field the received OSC messages from the recognition process can be monitored.

4.5.2 Recognition Process 4.5.2.1 Activation by Keyword Search

The activation of the continuous recognition with the language model, is performed by a keyword search. To allow Pocketsphinx to recognize the word Heimdall, the phones for the word have to be provided in a dictionary file. The notation of Heimdall in ARPA format has been derived from the IPA notation of the german words Heim and Hall, provided by the german Wiktionary Database [65]. The IPA notiation for “Heim” is [haim] and for “Hall” is [hal]. “ “H” of “Hall” is then substituted by “D”, although they do not share the same category of

31 ASR and NLP 4.5 psApp.py - A Python ASR and NLP Tool consonants, “D” is a stop consonant and “H” is a fricative consonant. The resulting IPA and Arpabet notations are shown in table 4.2.

IPA [haimdal] “ Arpabet HH AY M D AA L

Table 4.2: “Heimdall” IPA and Arpabet Notation

With the following parameters, pocketsphinx searches for the keyword Heimdall:

-inmic yes -lw 10 -feat 1s c d dd -beam 1e-80 -wbeam 1e-40 -wip 0.2 -agc none -varnorm no -cmn current -dict keyword/key.dic -keyphrase ”HEIMDALL” -kws threshold 1e-20

Table 4.3: Pocketsphinx Keyword Search Parameters

4.5.2.2 Continuous Speech Recognition

The recognition process uses audio data from the underlying Sphinxbase library. It seperates uttertances by one second of silence. This is the frequence in which sentences or utterances are recognized. A lattice search is performed successfully, while an N-best search had some yet untraceable error. A segmenation fault occurs when the hypothesis string is accessed. An- other yet unknown error is that the switching of the language model causes the recognition process to crash, when the process is launched from inside PyQt as a QProcess. To overcome this problem, the recognition process is launched from a terminal. Together with all subsequent hypotheses, the recognition process reports the log probability of the hypotheses, which is calculated like shown in eq. (10), to the UI.

  bestScore N P 10scorei logP rob = 10 i=0 , N ∈ score of any recognized word(10)

32 ASR and NLP 4.5 psApp.py - A Python ASR and NLP Tool

The following parameters are set for the language model recognition, based on recommenda- tions from the VoxForge Community:

-inmic yes -lw 10 -feat 1s c d dd -beam 1e-80 -wbeam 1e-40 -wip 0.2 -agc none -varnorm no -cmn current -dict voxforge-de-r20140216/etc/voxforge.dic -lm voxforge-de-r20140216/etc/voxforge.lm.DMP -hmm voxforge-de-r20140216/model parameters/voxforge.cd cont 4000

Table 4.4: Pocketsphinx Voxforge Parameters

The listing in appendix K shows the source code of the recognition process.

Figure 4.15: Python Application psApp.py UI

33 ASR and NLP 4.5 psApp.py - A Python ASR and NLP Tool

~/Images/ResearchProject/testData/psApp.git $ ./startASRNLP.sh

Training Unigram POS Tagger Got bus address: "unix:abstract=/tmp/dbus-QtKkYBLacx,guid=893e7d59d5bbaae18a43059d56ea92d0" Connected to accessibility bus at: "unix:abstract=/tmp/dbus-QtKkYBLacx,guid=893e7d59d5bbaae18a43059d56ea92d0" Registered DEC: true rm: cannot remove tmp /response : No such file or directory rm: cannot remove tmp /sentence : No such file or directory Registered event listener change listener: true waiting for GUI... starting PS Decoder...

0 -@- pocketsphinx_continuous 1 -@- -argfile 2 -@- ResearchProject/testData/psApp.git/config/de INFO: continuousOSC.c(337): pocketsphinx_continuous COMPILED ON: Dec 15 2015, AT: 13:21:17 INFO: pocketsphinx.c(145): Parsed model-specific feature parameters from ResearchProject/testData/psApp.git/ voxforge-de-r20140216/model_parameters/voxforge.cd_cont_4000/feat.params

Current configuration: ------8<------Keyword search configuration ... Keyword search info ... ------8<------Current configuration: ------8<------Language Model search configuration ... Language Model info ... ------8<------

Keyword Listening... INFO: cmn_prior.c(131): cmn_prior_update: from < 39.95 -4.23 -0.34 8.20 0.66 4.18 -2.45 -1.60 -0.15 1.68 0.00 0.99 -0.74 > INFO: cmn_prior.c(149): cmn_prior_update: to < 39.60 -3.45 -1.74 7.24 -2.19 3.14 -2.01 -1.44 0.57 2.03 0.25 1.64 -1.13 > Keyword spotted: HEIMDALL LM Listening... LM IN -1566 LM IST -2121 LM IST -2742 LM IST DIE -3596 LM IST DIESE -4238 LM IST DIESER -5377 LM IST DIE SONNE -5659 LM IST DIE SONNE -7019 LMISTDIESONNEGRUN¨ -8333 LMISTDIESONNEGRU¨ NEN -8103 LMISTDIESONNEGRU¨ NEN -8861 LMISTDIESONNEGRU¨ NEN -8864 LMISTDIESONNEGRU¨ NEN -8947 INFO: cmn_prior.c(131): cmn_prior_update: from < 32.68 -14.38 -2.68 -1.07 -2.64 -0.51 -2.72 1.71 -1.13 0.96 -1.57 0.73 -1.12 > INFO: cmn_prior.c(149): cmn_prior_update: to < 38.95 -9.52 1.52 9.59 2.57 4.67 -5.26 -1.34 1.40 0.07 -2.68 -0.15 -1.41 > INFO: ngram_search_fwdtree.c(1553): 1057 words recognized (6/fr) INFO: ngram_search_fwdtree.c(1555): 219104 senones evaluated (1224/fr) INFO: ngram_search_fwdtree.c(1559): 296110 channels searched (1654/fr), 83555 1st, 17566 last INFO: ngram_search_fwdtree.c(1562): 1447 words for which last channels evaluated (8/fr) INFO: ngram_search_fwdtree.c(1564): 3951 candidate words for entering last phone (22/fr) INFO: ngram_search_fwdtree.c(1567): fwdtree 0.36 CPU 0.200 xRT INFO: ngram_search_fwdtree.c(1570): fwdtree 3.90 wall 2.181 xRT INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 49 words INFO: ngram_search_fwdflat.c(948): 965 words recognized (5/fr) INFO: ngram_search_fwdflat.c(950): 46523 senones evaluated (260/fr) INFO: ngram_search_fwdflat.c(952): 46902 channels searched (262/fr) INFO: ngram_search_fwdflat.c(954): 3670 words searched (20/fr) INFO: ngram_search_fwdflat.c(957): 2100 word transitions (11/fr) INFO: ngram_search_fwdflat.c(960): fwdflat 0.05 CPU 0.031 xRT INFO: ngram_search_fwdflat.c(963): fwdflat 0.06 wall 0.031 xRT INFO: ngram_search.c(1253): lattice start node .0 end node .168 INFO: ngram_search.c(1279): Eliminated 0 nodes before end node INFO: ngram_search.c(1384): Lattice has 100 nodes, 151 links INFO: ps_lattice.c(1380): Bestpath score: -7280 INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:168:177) = -513835 INFO: ps_lattice.c(1441): Joint P(O,S) = -523603 P(S|O) = -9768 INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT LMREADY.... IST/VAFINDIE/ARTSONNE/NNGRUNEN/NN¨ INFO: cmn_prior.c(131): cmn_prior_update: from < 39.60 -3.45 -1.74 7.24 -2.19 3.14 -2.01 -1.44 0.57 2.03 0.25 1.64 -1.13 > INFO: cmn_prior.c(149): cmn_prior_update: to < 39.60 -3.45 -1.74 7.24 -2.19 3.14 -2.01 -1.44 0.57 2.03 0.25 1.64 -1.13 > Keyword ready.... Listing 1: pocketsphinx continuousOSC Console Output

34 ASR and NLP 4.6 Experiments

4.5.2.3 Processing of Recognized Sentences

In order to enable the test system to respond to a command or request of the user, NLP is used. Commands usually contain a predicate in an imperative form. Requests occur in two forms, as a yes-no or a probe question. A probe question contains a question word like “who” or “where”, while a yes-no question begins with a predicate at the first position of the sentence, having the general form predicate-subject-object. Numeric words shall also be convertet into digits. The POS for cardinality determines, whether a numeric word is present. If this is true, the cardnality word is converted to its integer form. A parser to do this for the German language, which works for words less than eine Milliarde, is shown in appendix L.

4.6 Experiments 4.6.1 ASR To test the recognition accuracy, a set of sentences had been chosen from the language model provided by Voxforge. From those sentences, variations were derived, one variation per word in the sentence. The first variation contained one changed word position, the second variation contained two changed word positions and so on, until any position has been changed. In table 4.7 and table 4.8, Words written in bold are not contained in the dictionary. At first the original sentence were read, followed by the variations. For any recognition the number of mis recognitions was counted. To test the merged corpus of the Voxforge and the Gutenberg-Online corpora, the same pro- cedure was used. Only the last two test cases have been modified, to apply to the Gutenboerg- Online corpus as well. The entire test sets are shown in table 4.7 and table 4.8. The test results are shown in table 4.9 and table 4.10. The test runs represent different audio hardware configurations and environments, shown in table 4.5 for the Voxforge language model and table 4.6 for the merged Gutenberg and Voxforge language model.

35 ASR and NLP 4.6 Experiments

4.6.1.1 Test Environments and Configurations

Number Configuration 1 AMD 8-Core Workstation with Focusrite Scarlett 2i2 USB audio interface and Shure Beta 91 Condenser Microphone, quite environment 2 Intel i3 2-Core Laptop with integrated audio interface and microphone, quite environment 3 Intel i3 2-Core Laptop with integrated audio interface and microphone, noisy environment 4 AMD 8-Core Workstation with Focusrite Scarlett 2i2 USB audio interface and Shure Beta 91 Condenser Microphone, quite environment

Table 4.5: Test Run Configuration for the Voxforge Language Model

Number Configuration 1 AMD 8-Core Workstation with Focusrite Scarlett 2i2 USB audio interface and Shure Beta 91 Condenser Microphone, quite environment 2 Intel i3 2-Core Laptop with integrated audio interface and microphone, quite environment 3 Intel i3 2-Core Laptop with integrated audio interface and microphone, noisy environment 4 AMD 8-Core Workstation with Focusrite Scarlett 2i2 USB audio interface and Shure Beta 91 Condenser Microphone, quite environment 5 AMD 8-Core Workstation with Focusrite Scarlett 2i2 USB audio interface and Shure Beta 91 Condenser Microphone, quite environment, Gutenberg sentences with Voxforge language model 6 Intel i3 2-Core Laptop with integrated audio interface and microphone, quite environment, Gutenberg sentences with Voxforge language model 7 AMD 4-Core APU with integrated audio interface and Low Budget ELV condenser micorphone, Gutenberg sentences with Voxforge language model

Table 4.6: Test Run Configuration for the merged Gutenberg and Voxforge Language Model

36 ASR and NLP 4.6 Experiments

4.6.1.2 Test Sets

1 2 3 4 5 6 7 8 9 LM 1: Norwegen grenzt an Schweden Finnland und Russland V1: Schweden grenzt an Norwegen und Finnland V2: Spanien grenzt an Portugal Frankreich und Andorra V3: Deutschland grenzt an die Schweiz und Ostereich¨ V4: Die USA grenzen an Mexico LM 2: Die Antenne hat eine unterschiedliche Gr¨oße V1: Die Festplatte hat eine unterschiedliche Gr¨oße V2: Der Rahmen hat eine unterschiedliche Gr¨oße V3: Das Auto hat eine normale Gr¨oße V4: Die B¨aume haben eine kleine Gr¨oße V5: Die L¨aufer haben eine durchschnittliche Geschwindigkeit V6: Die B¨ucher sind heute elektronischen Formats LM 3: Man kann davon sehr viel Brot essen V1: Man kann davon sehr viel K¨ase essen V2: Man kann davon sehr viel Wasser trinken V3: Man kann davon nicht viel Wein keltern V4: Ich kann davon auch viel Bier tragen V5: Wir k¨onnen davon einmal viel Gem¨use anbauen V6: Sie werden damit noch viel Erfolg haben V7: Er wird daraus wenig neue Hoffnung sch¨opfen LM 4: Diese Nummer wird nur ein einziges Mal vergeben V1: Diese Medallie wird nur ein einziges Mal vergeben V2: Diese Variante wird nur ein einziges Mal ausgef¨uhrt V3: Dieser Fall wird nur ein einziges Mal auftreten V4: Dieser Tag wird nur kein einziges Mal gez¨ahlt V5: Dieser Weg wird nur beim zweiten Mal gewertet V6: Dieses Haus wurde nur durch letztes Mal einbezogen V7: Dieses Bild sollte nur sein heutiges Ansehen beschmutzen V8: Dieses Blatt k¨onnte nicht unter leichte Dinge fallen V9: Das Geld wird nur ein einziges Mal ausgegeben V10: Das Bild wird nur ein einziges Mal gemalt LM 5: Nur wenige Minuten Zeit werden ben¨otigt V1: Nicht wenige Minuten Zeit werden ben¨otigt V2: Sehr viele Minuten Zeit werden ben¨otigt V3: Ein paar Minuten Zeit w¨urden ben¨otigt V4: Viele wertvolle Minuten Zeit k¨onnten verstreichen V5: Eher stressige Momente Zeit m¨ussten aufkommen V6: Bald keine Zeilen Text sollten wegfallen LM 6: Das Datenpaket wurde an den Computer im Netzwerk weitergeleitet V1: Das Ergebnis wurde an den Computer im Netzwerk weitergeleitet V2: Die Information wurde an den Computer im Netzwerk weitergeleitet V3: Die Einstellung wurde an den Laptop im Netzwerk weitergeleitet V4: Diese Nachrichten wird an den Server im Netzwerk weitergeleitet V5: Manche Befehle w¨urden an die Workstation im Netzwerk weitergeleitet V6: Andere Einstellungen wurden an dem PC im Netzwerk get¨atigt V7: Solche Eingaben w¨urden an dem Terminal im Rechenzentrum stattfinden V8: Kurze Kabel sind an der Schnittstelle am Server benutzt V9: Die Katze hat schnell die Tasten am PC gedr¨uckt

Table 4.7: Recognition Test Set Voxforge

37 ASR and NLP 4.6 Experiments

1 2 3 4 5 6 7 8 9 LM 1: Norwegen grenzt an Schweden Finnland und Russland V1: Schweden grenzt an Norwegen und Finnland V2: Spanien grenzt an Portugal Frankreich und Andorra V3: Deutschland grenzt an die Schweiz und Ostereich¨ V4: Die USA grenzen an Mexico LM 2: Die Antenne hat eine unterschiedliche Gr¨oße V1: Die Festplatte hat eine unterschiedliche Gr¨oße V2: Der Rahmen hat eine unterschiedliche Gr¨oße V3: Das Auto hat eine normale Gr¨oße V4: Die B¨aume haben eine kleine Gr¨oße V5: Die L¨aufer haben eine durchschnittliche Geschwindigkeit V6: Die B¨ucher sind heute elektronischen Formats LM 3: Man kann davon sehr viel Brot essen V1: Man kann davon sehr viel K¨ase essen V2: Man kann davon sehr viel Wasser trinken V3: Man kann davon nicht viel Wein keltern V4: Ich kann davon auch viel Bier tragen V5: Wir k¨onnen davon einmal viel Gem¨use anbauen V6: Sie werden damit noch viel Erfolg haben V7: Er wird daraus wenig neue Hoffnung sch¨opfen LM 4: Diese Nummer wird nur ein einziges Mal vergeben V1: Diese Medallie wird nur ein einziges Mal vergeben V2: Diese Variante wird nur ein einziges Mal ausgef¨uhrt V3: Dieser Fall wird nur ein einziges Mal auftreten V4: Dieser Tag wird nur kein einziges Mal gez¨ahlt V5: Dieser Weg wird nur beim zweiten Mal gewertet V6: Dieses Haus wurde nur durch letztes Mal einbezogen V7: Dieses Bild sollte nur sein heutiges Ansehen beschmutzen V8: Dieses Blatt k¨onnte nicht unter leichte Dinge fallen V9: Das Geld wird nur ein einziges Mal ausgegeben V10: Das Bild wird nur ein einziges Mal gemalt LM 5: So wenig wird hier auf meine Bemerkungen gegeben V1: So wenig wird hier auf meine Bemerkungen eingegangen V2: So wenig wird hier auf meine Erfahrungen verwiesen V3: So wenig wird hier auf eure Kreise geachtet V4: So wenig wird hier an meinem Fall gearbeitet V5: So wenig wird dort f¨ur mein Wohlergehen gesorgt V6: So wenig k¨onnen da auch solche Ansichten z¨ahlen V7: So einige werden mancherorts um meinen Kommentar flehen V8: Sehr viele haben anderswo an ihr Aussehen gedacht LM 6: Den Eltern darf man das nicht sagen V1: Den Eltern darf er das nicht sagen V2: Den Eltern darf sie das nicht zeigen V3: Den Eltern darfst du das nicht fl¨ustern V4: Den Eltern d¨urfen wir das nicht erz¨ahlen V5: Dem Gremium muss ich das nicht berichten V6: Der Menge wird er es doch beichten V7: Die Leute m¨ussen ihm das auch zumuten

Table 4.8: Recognition Test Set Voxforge and Gutenberg-Online

38 ASR and NLP 4.6 Experiments

4.6.1.3 Test Results

Table 4.9: Recognition Test Results show- Table 4.10: Recognition Test Results ing the Number of Mis Recognized Words showing the Number of Mis Recognized of Voxforge Words of Voxforge and Gutenberg-Online

Test Run: 1 2 3 4 Test Run: 1 2 3 4 5 6 7 LM 1: 0 0 0 LM 1: 0 0 0 V1: 1 0 0 V1: 1 1 0 V2: 3 5 3 V2: 2 2 3 V3: 2 2 2 V3: 0 0 0 V4: 1 1 2 V4: 0 1 1 LM 2: 0 0 0 LM 2: 0 0 0 V1: 0 3 0 V1: 0 0 0 V2: 0 2 1 V2: 1 1 1 V3: 1 0 1 2 V3: 2 1 1 2 V4: 1 3 1 V4: 0 0 2 V5: 1 1 1 V5: 1 1 2 V6: 0 2 0 V6: 0 0 1 LM 3: 0 0 0 LM 3: 0 0 0 V1: 2 0 2 V1: 1 2 0 V2: 0 0 0 V2: 0 0 0 V3: 3 2 2 V3: 1 1 1 V4: 0 0 3 V4: 1 1 1 V5: 1 1 0 V5: 0 1 2 V6: 0 0 0 V6: 0 1 0 V7: 1 3 4 V7: 2 2 3 LM 4: 0 0 2 0 LM 4: 0 0 2 0 V1: 1 1 0 1 V1: 1 1 0 1 V2: 0 0 0 V2: 0 0 1 V3: 0 2 0 V3: 0 0 1 V4: 1 0 0 V4: 0 1 2 V5: 0 1 2 V5: 0 1 2 V6: 1 0 1 V6: 0 1 0 V7: 2 3 4 V7: 1 2 3 V8: 0 1 5 V8: 0 3 3 V9: 0 2 V9: 0 0 0 1 V10: 1 1 V10: 0 1 1 1 LM 5: 0 0 0 LM 5: 0 0 0 1 1 8 V1: 1 0 1 V1: 0 0 4 0 1 7 V2: 0 0 0 V2: 1 0 1 1 3 4 V3: 0 0 2 V3: 0 1 3 2 3 3 V4: 1 1 1 V4: 0 1 1 0 2 3 V5: 4 3 2 V5: 0 2 3 4 1 4 V6: 2 1 4 V6: 1 1 3 2 2 2 LM 6: 0 0 0 V7: 1 2 1 2 3 6 V1: 0 0 1 V8: 0 1 1 1 2 3 V2: 0 0 0 LM 6: 0 0 0 0 0 2 V3: 2 2 3 V1: 0 0 1 0 2 2 V4: 0 1 1 V2: 0 0 0 1 0 1 V5: 5 4 3 V3: 1 1 2 1 3 3 V6: 2 1 2 V4: 0 0 0 0 0 1 V7: 3 2 3 V5: 0 0 1 1 0 1 V8: 0 0 3 V6: 1 1 3 2 1 4 V9: 4 3 3 V7: 1 1 2 1 1 4

39 ASR and NLP 4.6 Experiments

4.6.2 NLP In the following subsection the experiments carried out with Pattern.de and NLTK are elab- orated on. Their aim is to determine whether a sentence is a question or an imperative. Questions are further distiguished in probing questions and Yes-No questions. The tables 4.11 to 4.14 show the test results of the Pattern.de Parser, the test results of NLTK are shown in the succeding lines. The annotations of Pattern.de are based on the Penn Treebank POS tags[22], while the NLTK annotation follow the STTS. There is no punctuation or capitalization, because those aspects are not recognized by the speech recognition.

4.6.2.1 Imperative

Table 4.11 shows the POS tags for the Pattern.de parser. In order to identify an imper- ative, the words have to be lemmatised. The imperative of the verb for the word root is determined. The result is compared with the word from the input sentence. If there is a match, the verb is an imperative.

Annotations 1 2 3 4 Sentence mache das licht an POS Tags NN DT JJ O Chunks B-NP O B-ADJP B-PP Relation O O O O

Table 4.11: NLP Imperative: Chunks, POS Tags, Relations

In contrast to Pattern.de, the custom German POS corpora for NLTK is tagged with the STTS, instead of the Penn Treebank tags. This has the advantage, that verbs in imperative form are instantly identified. The customized corpus for the POS tagging procedure contains about 250 imperatives. The phonetic dictionary does not contain a single imperative, that is also contained in the POS tagging corpus. This leads to the problem in recognizing imperatives. There cannot be an input to the NLP, containing an imperative, because the speech recognition does not know any imperatives. The output from a corresponding string input to the unigram tagger shows the verb in im- perative from:

MACHE/VVIMP DAS/ART LICHT/NN AN/PREP

40 ASR and NLP 4.6 Experiments

4.6.2.2 Question - Yes-No

Table 4.12 shows the POS tags for the Pattern.de parser.

Annotations 1 2 3 4 Sentence ist die sonne gr¨un POS Tags VB DT NN JJ Chunks B-VP-1 B-NP I-NP B-ADJP Relation VP-1 NP-SBJ-1 NP-SBJ-1 O

Table 4.12: NLP Question - Yes-No: Chunks, POS Tags, Relations

A Yes-No question is determined in the same way for both approaches. If the first word is a predicate of the sentence, followed by subject and object, the sentence is a Yes-No question. The unigram tagger identifies the following tags:

IST/VAFIN DIE/ART SONNE/NN GRUNEN/NN¨

4.6.2.3 Question - Probe

Table 4.13 shows the POS tags for the Pattern.de parser. The pronoun “welche” is tagged correctly, which makes it easy to determine, that it is a question.

Annotations 1 2 3 4 5 Sentence welche farbe hat die sonne POS Tags WP NN VB DT NN Chunks O B-NP B-VP B-NP I-NP Relation O NP-SBJ-1 VP-1 NP-OBJ-1 NP-OBJ-1

Table 4.13: NLP Question - Probe: Chunks, POS Tags, Relations

With the unigram tagger, the result is neither correct nor reliable. The pronoun “WELCHE” is tagged as PRELS, a relative pronoun appearing in relative sentences. The customized POS corpus only contains sentences, where the pronoun is used this way. An attempt to correct the tag in the processing by adding the sentence shown below to the training material, was not successful. Adding more sentences to the corpus, containing such tags, is expected to make a difference, by increasing the probability of its occurence.

WELCHE/PWAT FARBE/NN HAT/VAFIN DIE SONNE/NN

41 ASR and NLP 4.6 Experiments

4.6.2.4 Identify Numerics

To enable the system to process numerics, being extracted from the speech input, the words have to be identified. Pattern.de tags numerical expression as “CD”, cardinal number, as shown in table 4.14.

Annotations 1 2 3 4 5 6 Sentence heute gibt es zwei neue m¨oglichkeiten POS Tags VB VB PRP CD JJ NNS Chunks B-VP I-VP B-NP I-NP I-NP I-NP Relation O O O O O O

Table 4.14: NLP Numerical Expression Reduction

While the unigram tagger tags them as “CARD”:

HEUTE/ADV GIBT/VVFIN ES/PPER ZWEI/CARD NEUE/ADJA MOGLICHKEITEN/NN¨

42 ASR and NLP 5 Conclusions

5 Conclusions

5.1 Automatic Speech Recognition The processing of speech with a machine, has been continuously researched during the past six decades, although with varying levels of intensity. Recent improvements in computing power and resources have fostered a lot of research in this field. The aspects and approaches that have been developed in this time, make working solutions for automatic speech recognition feasable. The accuracy of such systems can be tuned, although with a lot of effort, so that real working applications are possible. There are example configurations and knowledge bases for the German language. But for a real application the phonetic dictionary and the language model have to be fine tuned. Without adjustments, taking into account the domain for which it is used, an ASR system is of litte use. The accuracy would be too poor in such cases, as indicated by the experiments shown in section 4.6.1. This problem can be overcome with enough computing power and resources, a lot of recog- nition processes running in parallel on huge server farms, with access to nearly unlimited training data available from the world wide web.

The training process is very delicate and time consuming. Text data that is not properly cleared of any non-alphabetical symbols, can lead to errors in a later task. The Wikipedia corpus contained to many misspelled words and names to serve as a text corpus. In contrast to the Gutenberg Online corpus, which is completly reviewed and free of misspellings. None the less, names and noun phrases had to be matched with the Voxforge vocabulary and truncated, in order to build a language model. Words that are missing in the phonetic dictionary, could be added manually, in order to train the algorithm on them and enable it to recognize them. Other approaches would be automatic grapheme-to-phoneme conversion or accessing online resources, such as Wiktionary, to lookup the phoentic spelling and convert it from the IPA to the ARPAbet. In theory all the toolkits mentioned in section 4.3.3 should be compatible with Pocketsphinx. Except for CMUCLMTK, all attempts to build a correct and readable language model for Pocketsphinx, failed. Subtle differences in the text input files for the training tools, let the

43 ASR and NLP 5.2 Natural Language Processing tools fail at some point in the process. The training with RNNLM was succesful, but the format of the language model is not readable by the Sphinxbase converter and thus not readable by Pocketsphinx. CSLM could not be evaluated, because it required BLAS libraries that could not be provided. Converting the language model, trained with SRILM, with Sphinxbase to a format that is understandable by Pocketsphinx, always resulted in segmentation faults. CMUCLMTK is the reference toolkit to be used with Pocketsphinx. After having the raw input text cleared properly, the language model created with this toolkit worked well with Pocketsphinx. In uses the most common N-gram language model, in this case a .

5.2 Natural Language Processing There are some natural language processing tools for the German language, like the Java based Parser, that is provided by the Stanford University, along with a lot of other tools for NLP. For Python, a generic solution exists, utilizing the Pattern.de parser. However it only works under Python 2.7. An alternative, running in both, Python 2.7 and Python >3.0, is available, the NLTK. NTLK can be trained with custom corpora, such that later parsing can be adjusted to the requirements of the application, for which it is used. With the Pattern.de parser, some verbs have not been tagged correctly in the processing, which leads to wrong conjugations for the imperative form and hence the processing of a command fails. Through the comparison with the input data, this approach is more error prone, than with NLTK. The unigram tagger of NLTK can be trained with the more precise STTS. As a result the tags are better matched to the German language. The NLTK toolkit also provides the possibility to use HMM-based unsupervised POS taggers. But due to time limitation of this project, further evaluation of this type of tagger has not been done.

5.3 Future Work For parallelization of language model training and decoding, different approaches could be used. A simpler approach is the computation of mathmatical functions with the GPU. BLAS libraries, executed on the GPU, can accelerate the computation in comparison to a CPU, but only if the size of the data of the computation is big enough. [68] shows such results. [69] and [70] have already presented working solutions with BLAS libraries. The approach in this project was to extend the RNNLM toolkit, to use the OpenCl clBLAS libraries. It could be shown, that the copying of the data and execution of mathmatical func- tions on the GPU took a considerable amount of time. This bottleneck is eliminated with the Heterogenous Systen Architecture. Further investigations should compare the traditional CPU/GPU with the new HSA, but have not been possible, due to mismatching between driver and hardware versions. Data handling with OpenCL on an APU system with HSA, should improve significantly over the standard GPU data handling. This could be especially useful to accelerate the Viterbi decoding of HMM and language models. A more complex approach to GPU accelaration, mainly aiming at the training phase, is the implementation of activation functions of the neural networks, as well. In [70] a solution with this approach is presented.

44 ASR and NLP 5.3 Future Work

NLP becomes more useful to applications, if they have access to some sort of inference system. In this way the processed language can be assigned a meaning. This would lead to the possibility, not only to recognize if and what kind of question is posed, like shown in the experiments in paragraphs 4.6.2.2 and 4.6.2.3, but also to answer said questions. Lexical databases, like GermaNet, could be used to increase the accuracy of an inference system with preprocessing of synonyms. The synonyms could also be matched with rhymes. Rhymes would take slightly wrong recognized phones into account, thus giving the system some sort of error correction capabilities.

45 ASR and NLP

Appendices

A Wikipedia Dump Cleanup Python Script

def processContentWikipedia(dstdir, subdir, files, q): checkString = "" for file in files: voxForgeDict = [] lineStr0 = "" with open(rootdir+"voxforge.dic","r") as dictSet: completeDict = dictSet.read() for char0 in completeDict: if "\n" == char0: lineStr0 += char0 voxForgeDict.append(lineStr0.split(" ")[0]) lineStr0 = "" else: lineStr0 += char0

with open(subdir+"/"+file,"r") as fd: content = fd.read()

’’’ Substitute <...> with "" ’’’ content = re.sub(r"<[^<>]+>","",content) content = re.sub(r"]+ref>","",content) ’’’ Substitute -...- with "" ’’’ content = re.sub(r" [^ ]+ ","",content) ’’’ Substitute (...) with "" ’’’ content = re.sub(r"\([^\(\)]+\)","",content) content = re.sub(r"\{[^\(\)]+\}","",content) ’’’ Substitute [...] with "" ’’’ content = re.sub(r"\[[^\[\]]+\]","",content)

content = content.replace("ü", " ") content = content.replace("Ü", "U")¨ content = content.replace("ä", " ") content = content.replace("Ä", " ") content = content.replace("ö", " ") content = content.replace("Ö", " ") content = content.replace("ß", " ")

content = content.replace(" ", "AE") content = content.replace(" ", "AAE") content = content.replace(" ", "OE") content = content.replace(" ", "OOE") content = content.replace(" ", "UE") content = content.replace("U","UUE")¨ content = content.replace(" ", "SSS") content = re.sub(r’[^\x00-\x7F]+’,’ ’, content) content = content.replace(" ",",5")

content = content.replace("!", " Fakultaet ") content = content.replace("\\", " ") content = content.replace("~", " ") content = content.replace("?", ".") content = content.replace(" ", "*") content = content.replace(’ ’,’*’) content = content.replace(" ", " Grad ") content = content.replace("@", " bei ") content = content.replace(" ", " Zoll ") content = content.replace("%"," Prozent ") content = content.replace("#"," ") content = content.replace("*"," mal ") content = content.replace("/","") content = content.replace(" "," Euro ") content = content.replace("$"," Dollar ") content = content.replace("+"," plus ") content = content.replace("^",’ hoch ’) content = content.replace("\\u","") content = content.replace(" ","-") content = content.replace(" ","-") content = content.replace("_","") content = content.replace("-"," - ") content = content.replace("&"," und ") content = content.replace("<"," kleine als ") content = content.replace(">"," gr er als ") content = content.replace("<="," kleine gleich ") content = content.replace(">="," gr er gleich ") content = content.replace("="," gleich ") content = content.replace("Mrd."," Milliarden ") content = content.replace("ggf."," gegebenenfalls ")

46 ASR and NLP A Wikipedia Dump Cleanup Python Script

content = content.replace("bzw."," bezehungsweise ") content = content.replace("bspw."," beispielssweise ") content = content.replace("etc."," et cetera ") content = content.replace("int."," intern ") content = content.replace("ext."," extern ") content = content.replace("bzgl."," bez glich ") content = content.replace("zb."," zum Beispiel ") content = content.replace("evtl."," eventuell ") content = content.replace("Evtl."," eventuell ") content = content.replace("Mio."," Million ") content = content.replace("Jhdt."," Jahrhundert ") content = content.replace("d.h."," das hei t ") content = content.replace("D.h."," Das hei t ") content = content.replace("z.B."," zum Beispiel ") content = content.replace("Z.B."," Zum Beispiel ") content = content.replace("u.a."," unter anderem ") content = content.replace("U.a."," unter anderem ") content = content.replace("v.Chr."," vor Christus ") content = content.replace("n.Chr."," nach Christus ") content = content.replace("U.S."," Vereinigte Staaten ") content = content.replace("BRD"," Bundesrepublik Deutschland ") content = content.replace("DDR"," Deutsche Demokratische Republik Deutschland ") content = content.replace("gr."," gro ") content = content.replace("kl."," klein ") content = content.replace("‘",’’) content = content.replace("’",’’) content = content.replace(’"’,’’) content = content.replace(’ ’,’’) content = content.replace(’ ’,’’) content = content.replace(’ ’,’’) content = content.replace(’ ’,’’) content = content.replace(’’,’’) content = content.replace(’ ’,’’) content = content.replace(’;’,’’) content = content.replace(’:’,’’) content = content.replace(’ ’,’’)

’’’ remove chemical names’’’ ’’’ Substitute 287Ac with "" ’’’ content = re.sub(r"\d+[a-zA-Z]+","", content) ’’’ Substitute Ac287 with "" ’’’ content = re.sub(r"[a-zA-Z]+\d+","", content) ’’’ Substitute Ac287 with "" ’’’ content = re.sub(r"\d+[a-zA-Z]+\d+","", content)

content = content.replace("["," ") content = content.replace("]"," ") content = content.replace("|"," ") content = content.replace("{"," ") content = content.replace("}"," ")

content = content.replace("AAE", " ") content = content.replace("AE", " ") content = content.replace("OOE"," ") content = content.replace("OE"," ") content = content.replace("UUE", "U")¨ content = content.replace("UE"," ") content = content.replace("SSS"," ")

lineList = [] lineStr = "" for char in content: if "\n" == char: lineStr += char if ".\n" in lineStr: if not"(" in lineStr and not")" in lineStr and not "formula_" in lineStr and not"/" in lineStr and not"*" in lineStr and not "https" in lineStr and not "Bibel" in lineStr:

regexp0 = re.compile("\d+") idx = 0 for cnt in range(0,lineStr.count(".")): if cnt < lineStr.count(".")-2: idx = lineStr.index(".", idx) match100 = regexp0.match(lineStr[idx-1]) match001 = regexp0.match(lineStr[idx+1]) if match100: if match001: lineStr[idx].replace(".", "") else: lineStr[idx].replace(".", ".\n") numberClearedText = ParseNumberNames(lineStr)

lineContent = numberClearedText.get_CheckedString().replace(" "," ") lineContent = lineContent.replace(",","") lineContent = lineContent.replace(".","") lineContent = lineContent.replace(" - ","") lineContent = lineContent.upper() lineContent2 = lineContent.split()

if len( lineContent2 ) > 3:

47 ASR and NLP B Gutenberg Online Cleanup Python Script

lineContent = os.linesep.join([s fors in lineContent.splitlines() if s]) lineList.append( lineContent.lstrip()+"\n" ) lineStr = "" else: lineStr += char

srcdir = rawData

with open(dstdir+"train_"+subdir.replace(srcdir, "")+ "_"+file,"w") as trainingSet: with open(dstdir+"valid_"+subdir.replace(srcdir, "")+ "_"+file,"w") as validationSet: with open(dstdir+"test_"+subdir.replace(srcdir, "")+ "_"+file,"w") as testSet: for idx, line in enumerate( lineList ): if idx % 9 == 0: validationSet.write(line.lstrip()) elif idx % 8 == 0: testSet.write(line.lstrip()) else: trainingSet.write(line.lstrip())

B Gutenberg Online Cleanup Python Script

def processContentGutenberg(dstdir, subdir, files): for file in files: with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Loading File: %s/%s\n" % (datetime.datetime.now(), subdir, file)) print(datetime.datetime.now()," Loading File: "+ subdir+"/"+file)

with open(subdir+file,"r") as fd: content = fd.read()

with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s File read complete... Opening Output files\n" % (datetime.datetime.now())) print(datetime.datetime.now()," File read complete... Opening Output files")

’’’ Klammerung entfernen ’’’ with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Klammerung entfernen\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Klammerung entfernen")

’’’ Substitute <...> with "" ’’’ content = re.sub(r"<[^<>]+>","",content) content = re.sub(r"]+ref>","",content) ’’’ Substitute -...- with "" ’’’ content = re.sub(r" [^ ]+ ","",content) ’’’ Substitute (...) with "" ’’’ content = re.sub(r"\([^\(\)]+\)","",content) content = re.sub(r"\{[^\(\)]+\}","",content) ’’’ Substitute [...] with "" ’’’ content = re.sub(r"\[[^\[\]]+\]","",content) ’’’ remove chemical names’’’ ’’’ Substitute 287Ac with "" ’’’ content = re.sub(r"\d+[a-zA-Z]+","", content) ’’’ Substitute Ac287 with "" ’’’ content = re.sub(r"[a-zA-Z]+\d+","", content) ’’’ Substitute Ac287 with "" ’’’ content = re.sub(r"\d+[a-zA-Z]+\d+","", content)

’’’ Gedankenstriche entfernen ’’’ with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Gedankenstriche entfernen\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Gedankenstriche entfernen")

content = content.replace(" ","\n") content = content.replace(" ","\n") content = content.replace("_"," ") content = content.replace("-","\n")

’’’ Anf hrungszeichen entfernen ’’’ with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Anf hrungszeichen entfernen\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Anf hrungszeichen entfernen")

content = content.replace("‘",’ ’) content = content.replace("’",’ ’) content = content.replace(’"’,’ ’) content = content.replace(’ ’,’ ’) content = content.replace(’ ’,’ ’) content = content.replace(’ ’,’ ’) content = content.replace(’ ’,’ ’) content = content.replace(’ ’,’ ’)

48 ASR and NLP C Building RNNLM Toolkit from Source

content = content.replace(’ ’,’’)

’’’ Sonstige Sonderzeichen ersetzen ’’’ with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Sonstige Sonderzeichen ersetzen\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Sonstige Sonderzeichen ersetzen")

content = content.replace(" ","") content = content.replace(" ","") content = content.replace(" ",",5") content = content.replace(" "," zum Quadrat") content = content.replace(" "," hoch drei") content = content.replace(" ",",75") content = content.replace(" ",",25") content = content.replace(" ",",33")

content = content.replace("1/2","0,5") content = content.replace("%"," Prozent") content = content.replace("3 ann "," ") content = content.replace("["," ") content = content.replace("]"," ") content = content.replace("|"," ") content = content.replace("{"," ") content = content.replace("}"," ") content = content.replace("("," ") content = content.replace(")"," ")

with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Zeilenumbr che einf gen\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Zeilenumbr che einf gen") ’’’ Zeilenumbr che einf gen ’’’ content = content.replace(".", "\n") content = content.replace("!", "\n") content = content.replace("?", "\n") content = content.replace(",", "") content = content.replace(";", "\n") content = content.replace(":", "\n")

with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Inhalt in Zeile aufteilen\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Inhalt in Zeile aufteilen") ’’’ Inhalt in Zeile aufteilen ’’’ lineStr = "" lineCnt = 0 for idx, char in enumerate( content ): if "\n" == char: ’’’ NICHT AlphaNumerische Zeichen entfernen, aber Leerzeichen erhalten. ’’’ lineStr = re.sub(r"([^\s\w]|_)+","",lineStr)#\W? <- min1 auftreten

’’’ Numerische Ausdr cke ausschreiben ’’’ numberClearedText = ParseNumberNames(lineStr) lineStr = numberClearedText.get_CheckedString() lineSplit = lineStr.split() if len( lineSplit ) > 3: lineCnt += 1 if lineCnt % 9 == 0: with open(dstdir+"valid_"+subdir.replace(rawData, "")+ "_"+file,"a") as validationSet: validationSet.write(lineStr.lstrip().upper()+"\n") elif lineCnt % 8 == 0: with open(dstdir+"test_"+subdir.replace(rawData, "")+ "_"+file,"a") as testSet: testSet.write(lineStr.lstrip().upper()+"\n") else: with open(dstdir+"train_"+subdir.replace(rawData, "")+ "_"+file,"a") as trainingSet: trainingSet.write(lineStr.lstrip().upper()+"\n") lineStr = "" else: lineStr += char

with open(rootdir+"command.log","a") as commandLog: commandLog.write("%s Ready\n" % (datetime.datetime.now())) print(datetime.datetime.now()," Ready")

C Building RNNLM Toolkit from Source g++ -D WEIGHTTYPE=float -D STRIDE=8 -lm -Ofast -march=native -Wall -funroll-loops -ffast-math \ -Wno-unused-result -Wno-narrowing -c rnnlmlib.cpp g++ -D WEIGHTTYPE=float -D STRIDE=8 -lm -Ofast -march=native -Wall -funroll-loops -ffast-math \ -Wno-unused-result -Wno-narrowing rnnlm.cpp rnnlmlib.o -o rnnlm

49 ASR and NLP D RNNLM Toolkit Training Output

D RNNLM Toolkit Training Output debug mode: 3 train file: lmData/rnnlmtk/train__GutenbergVoxforge.merge.rnnlm valid file: lmData/rnnlmtk/valid__GutenbergContentRaw class size: 100 Starting learning rate: 0.100000 Hidden layer size: 15 Direct connections: 5M Order of direct connections: 2 BPTT: 4 BPTT block: 10 Rand seed: 1 rnnlm file: lmData/rnnlmtk/GutenbergVoxforge.merge.rnn.lm Starting training using file lmData/rnnlmtk/train__GutenbergVoxforge.merge.rnnlm Vocab size: 11966 Words in train file: 5127198 Iter: 0 Alpha: 0.100000 TRAIN entropy: 7.0752 Words/sec: 117451.2 VALID entropy: 10.9204 Iter: 1 Alpha: 0.100000 TRAIN entropy: 6.8070 Words/sec: 118677.6 VALID entropy: 10.8602 Iter: 2 Alpha: 0.100000 TRAIN entropy: 6.7260 Words/sec: 118644.1 VALID entropy: 11.0641 Iter: 3 Alpha: 0.050000 TRAIN entropy: 6.6325 Words/sec: 118593.8 VALID entropy: 10.5144 Iter: 4 Alpha: 0.025000 TRAIN entropy: 6.5805 Words/sec: 118428.9 VALID entropy: 9.8425 Iter: 5 Alpha: 0.012500 TRAIN entropy: 6.5613 Words/sec: 118735.5 VALID entropy: 9.3684 Iter: 6 Alpha: 0.006250 TRAIN entropy: 6.5572 Words/sec: 116890.7 VALID entropy: 9.0048 Iter: 7 Alpha: 0.003125 TRAIN entropy: 6.5600 Words/sec: 114780.4 VALID entropy: 8.6788 Iter: 8 Alpha: 0.001563 TRAIN entropy: 6.5659 Words/sec: 114145.8 VALID entropy: 8.3703 Iter: 9 Alpha: 0.000781 TRAIN entropy: 6.5724 Words/sec: 114911.7 VALID entropy: 8.1590 Iter: 10 Alpha: 0.000391 TRAIN entropy: 6.5776 Words/sec: 118779.4 VALID entropy: 8.0218 Iter: 11 Alpha: 0.000195 TRAIN entropy: 6.5814 Words/sec: 118786.3 VALID entropy: 7.9124 Iter: 12 Alpha: 0.000098 TRAIN entropy: 6.5845 Words/sec: 118867.2 VALID entropy: 7.8039 Iter: 13 Alpha: 0.000049 TRAIN entropy: 6.5867 Words/sec: 118782.3 VALID entropy: 7.7165 Iter: 14 Alpha: 0.000024 TRAIN entropy: 6.5877 Words/sec: 118770.2 VALID entropy: 7.6703 Iter: 15 Alpha: 0.000012 TRAIN entropy: 6.5880 Words/sec: 114331.0 VALID entropy: 7.6519 real 20m37.691s user 20m26.864s sys 0m3.111s test file: lmData/rnnlmtk/test__GutenbergContentRaw rnnlm file: lmData/rnnlmtk/GutenbergVoxforge.merge.rnn.lm test log probability: -22763332.468480

PPL net: 201.176043

E librnnlm.cpp with clBLAS SGEMV Implementation

void CRnnLM::matrixXvector(struct neuron *dest, struct neuron *srcvec, struct synapse *srcmatrix, int matrix_width, int from, int to, int from2, int to2, int type, cl_context ctx, cl_command_queue queue, int opencl ) { int a, b, c; real val[ STRIDE ];

if ( opencl == 1 ) { // OpenCL cl_int err ; cl_mem bufA, bufX, bufY; cl_event event = NULL; time_t rawtime; struct timeval timeinfoS, timeinfoE;

int m = to - from; int n = to2 - from2; int mn = m*n;

cl_float A[mn]; cl_float X[n]; cl_float Y[m]; cl_float result[m];

bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, mn * sizeof(*A), NULL, &err); bufX = clCreateBuffer(ctx, CL_MEM_READ_ONLY, n * sizeof(*X), NULL, &err); bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, m * sizeof(*Y), NULL, &err);

err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, mn * sizeof(*A), A, 0, NULL, NULL); err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, n * sizeof(*X), X, 0, NULL, NULL); err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, m * sizeof(*Y), Y, 0, NULL, NULL);

if (type==0) {//ac mod memcpy(X, &srcvec[from2], n * sizeof(*srcvec)); intk=0; for(int i = from; i < m; i++){ memcpy(&A[k], &srcmatrix[ matrix_width * i + from ], n * sizeof(*srcmatrix) ); k += n;

50 ASR and NLP E librnnlm.cpp with clBLAS SGEMV Implementation

} err = clblasSgemv(clblasRowMajor, clblasTrans, m, n, 1, bufA, 0, n, bufX, 0, 1, 1, bufY, 0, 1, 1, &queue, 0, NULL, &event); if(result != CL_SUCCESS);

err = clWaitForEvents( 1, &event ); err = clEnqueueReadBuffer( queue, bufY, CL_TRUE, 0, m * sizeof(*result), result, 0, NULL, NULL );

int j=0; for(int i = from ; i

intk=0; for(int i = from2; i < n; i++){ memcpy(&A[k], &srcmatrix[ matrix_width * i + from2 ], m * sizeof(*srcmatrix) ); k += m; }

err = clblasSgemv(clblasColumnMajor, clblasTrans, n, m, 1, bufA, 0, m, bufX, 0, 1, 1, bufY, 0, 1, 1, &queue, 0, NULL, &event); if(result != CL_SUCCESS); err = clWaitForEvents( 1, &event ); err = clEnqueueReadBuffer( queue, bufY, CL_TRUE, 0, m * sizeof(*result), result, 0, NULL, NULL ); int j=0; for(int i = from2; i < n ; i++ ) dest[i].er += result[j++];

if (gradient_cutoff > 0) for (a = from2; a < to2; a++) { if (dest[a].er > gradient_cutoff) dest[a].er = gradient_cutoff; if (dest[a].er < -gradient_cutoff) dest[a].er = -gradient_cutoff; } }

clReleaseMemObject( bufY ); clReleaseMemObject( bufX ); clReleaseMemObject( bufA );

} else{ // Without OpenCL if (type==0) { for (b = 0; b < to - from - STRIDE; b += STRIDE) {// ac mod [0:m-STRIDE-1] for (c = 0; c < STRIDE; c++) val [ c ]=0;// Init weights with0 for (a = from2; a < to2; a++) for (c = 0; c < STRIDE; c++) // compute weights[from2: to2-1] val[ c ] += srcvec[ a ].ac * srcmatrix[ a + ( b + from + c ) * matrix_width ].weight; for (c = 0; c < STRIDE; c++) // copy computed weights to dest vector dest[ b + from + c ].ac += val[ c ]; } for (; b < to - from; b++) for (a = from2; a < to2; a++) // compute remaining weights and copy to dest vector dest[ b + from ].ac += srcvec[ a ].ac * srcmatrix[ a + ( b + from ) * matrix_width ].weight; } else{ for (a = 0; a < to2 - from2 - STRIDE; a += STRIDE) {// er mod [0:n-STRIDE-1] for (c = 0; c < STRIDE; c++) val [c ]=0; // Init weights with0 for (b = from; b < to; b++) for (c = 0; c < STRIDE; c++) // learn weights[from: to-1] val[c] += srcvec[ b ].er * srcmatrix[ a + from2 + c + b * matrix_width ].weight; for (c = 0; c < STRIDE; c++) dest[ a + from2 + c ].er += val[ c ];// copy learned weights to dest vector } for (; a < to2 - from2; a++) for (b = from; b < to; b++) // learned remaining weights and copy to dest vector dest[ a + from2 ].er += srcvec[ b ].er * srcmatrix[ a + from2 + b * matrix_width ].weight;

if (gradient_cutoff > 0) for (a = from2; a < to2; a++) { if (dest[a].er > gradient_cutoff) dest[a].er = gradient_cutoff; if (dest[a].er < -gradient_cutoff) dest[a].er = -gradient_cutoff; } } } }

51 ASR and NLP F Building CMULM Toolkit from Source

F Building CMULM Toolkit from Source cd POCKETSPHINXDIR/cmulmtk/ ./autogen.sh make sudo make install

G CMULM Toolkit Training Output cmuclmtk/src/programs $ ./text2idngram -vocab lmData/train__GutenbergVoxforge.merge.tmp.vocab -idngram lmData/train__GutenbergVoxforge.merge.idngram < lmData/train__GutenbergVoxforge.merge text2idngram Vocab : lmData/train__GutenbergVoxforge.merge.tmp.vocab Output idngram : lmData/train__GutenbergVoxforge.merge.idngram N-gram buffer size : 100 Hash table size : 2000000 Temp directory : cmuclmtk-5lIXrw Max open files : 20 FOF size : 10 n : 3 Initialising hash table... Reading vocabulary... Allocating memory for the n-gram buffer... Reading text into the n-gram buffer... 20,000 n-grams processed for each ".", 1,000,000 for each line...... Sorting n-grams... Writing sorted n-grams to temporary file cmuclmtk-5lIXrw/1 Merging 1 temporary files...

2-grams occurring: N times > N times Sug. -spec_num value 0 600335 606348 1 343724 256611 259187 2 90423 166188 167859 3 39750 126438 127712 4 23835 102603 103639 5 15951 86652 87528 6 11149 75503 76268 7 8640 66863 67541 8 6675 60188 60799 9 5409 54779 55336 10 4508 50271 50783

3-grams occurring: N times > N times Sug. -spec_num value 0 2085436 2106300 1 1589563 495873 500841 2 241305 254568 257123 3 82973 171595 173320 4 43869 127726 129013 5 26751 100975 101994 6 17571 83404 84248 7 12649 70755 71472 8 9499 61256 61878 9 7316 53940 54489 10 5996 47944 48433 text2idngram : Done. cmuclmtk/src/programs $ ./idngram2lm -vocab_type 0 -idngram lmData/train__GutenbergVoxforge.merge.idngram -vocab lmData/train__GutenbergVoxforge.merge.tmp.vocab -arpa lmData/GutenbergVoxforge.merge.lm

n : 3 Input file : lmData/train__GutenbergVoxforge.merge.idngram (binary format) Output files : ARPA format : lmData/GutenbergVoxforge.merge.lm Vocabulary file : lmData/train__GutenbergVoxforge.merge.tmp.vocab Cutoffs : 2-gram : 0 3-gram : 0 Vocabulary type : Closed Minimum unigram count : 0 Zeroton fraction : 1 Counts will be stored in two bytes. Count table size : 65535 Discounting method : Good-Turing

52 ASR and NLP H Building SRILM Toolkit from Source

Discounting ranges : 1-gram : 1 2-gram : 7 3-gram : 7 Memory allocation for tree structure : Allocate 100 MB of memory, shared equally between all n-gram tables. Back-off weight storage : Back-off weights will be stored in four bytes. Reading vocabulary...... read_wlist_into_siht: a list of 11967 words was read from "lmData/train__GutenbergVoxforge.merge.tmp.vocab". read_wlist_into_array: a list of 11967 words was read from "lmData/train__GutenbergVoxforge.merge.tmp.vocab". WARNING: appears as a vocabulary item, but is not labelled as a context cue. Allocated space for 3571428 2-grams. Allocated space for 8333333 3-grams. table_size 11968 Allocated 57142848 bytes to table for 2-grams. Allocated (2+33333332) bytes to table for 3-grams. Processing id n-gram file. 20,000 n-grams processed for each ".", 1,000,000 for each line...... Calculating discounted counts. Warning : 1-gram : Discounting range of 1 is equivalent to excluding singletons. Warning : 1-gram : GT statistics are out of range; lowering cutoff to 0. Warning : 1-gram : Discounting is disabled. Unigrams’s discount mass is 2.69784e-14 (n1/N = 0.000272119) Discount mass was rounded to zero. prob[UNK] = 1e-99 Incrementing contexts... Calculating back-off weights... Writing out language model... ARPA-style 3-gram will be written to lmData/GutenbergVoxforge.merge.lm idngram2lm : Done. cmuclmtk/src/programs $ sphinx_lm_convert -i lmData/GutenbergVoxforge.merge.lm -o lmData/GutenbergVoxforge.merge.lm.bin

Current configuration: [NAME] [DEFLT] [VALUE] -case -debug 0 -help no no -i lmData/GutenbergVoxforge.merge.lm -ifmt -lm_trie no no -logbase 1.0001 1.000100e+00 -mmap no no -o lmData/GutenbergVoxforge.merge.lm.bin -ofmt

INFO: ngram_model_trie.c(398): Trying to read LM in trie binary format INFO: ngram_model_trie.c(409): Header doesn’t match INFO: ngram_model_trie.c(177): Trying to read LM in arpa format INFO: ngram_model_trie.c(193): LM of order 3 INFO: ngram_model_trie.c(195): #1-grams: 11967 INFO: ngram_model_trie.c(195): #2-grams: 600336 INFO: ngram_model_trie.c(195): #3-grams: 2085436 INFO: lm_trie.c(317): Training quantizer INFO: lm_trie.c(323): Building LM trie

H Building SRILM Toolkit from Source

Set SRILM Variable in top-level Makefile to absolute path to POCKETSPHINXDIR/SRILM/srilm-1.7.1 ! Then: cd POCKETSPHINXDIR/SRILM/srilm-1.7.1 make World

The result of the above should be a fair number of .h and .cc files in include/, libraries in lib/$MACHINETYPE, and programs in bin/$MACHINETYPE. In your shell, set the following environment variables, add the following lines to the bottom of nano $homefolder/.profile: export PATH=$PATH:/POCKETSPHINXDIR/SRILM/srilm-1.7.1/bin/i686-m64/ export MANPATH=$MANPATH:/POCKETSPHINXDIR/SRILM/srilm-1.7.1/man

To test the compiled tools, run

53 ASR and NLP I SRILM Toolkit Training Output

cd POCKETSPHINXDIR/SRILM/srilm-1.7.1 make test from the top-level directory. After a successful build, clean up the source directories of object and binary files that are no longer needed: make cleanest

I SRILM Toolkit Training Output train__GutenbergContentRaw: line 5660441: 5.66044e+06 sentences, 8.19793e+07 words, 0 OOVs 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 using ModKneserNey for 1-grams modifying 1-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 1-grams n1 = 603148 n2 = 141079 n3 = 67962 n4 = 41282 D1 = 0.681288 D2 = 1.01541 D3+ = 1.34467 using ModKneserNey for 2-grams modifying 2-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 2-grams n1 = 12061559 n2 = 1632059 n3 = 641875 n4 = 343429 D1 = 0.787016 D2 = 1.07142 D3+ = 1.31566 using ModKneserNey for 3-grams modifying 3-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 3-grams n1 = 39291751 n2 = 2944302 n3 = 977658 n4 = 478160 D1 = 0.869665 D2 = 1.13368 D3+ = 1.29863 using ModKneserNey for 4-grams modifying 4-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 4-grams n1 = 60631243 n2 = 2028831 n3 = 540254 n4 = 233675 D1 = 0.937274 D2 = 1.25124 D3+ = 1.37841 using ModKneserNey for 5-grams modifying 5-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 5-grams n1 = 65731861 n2 = 805637 n3 = 158927 n4 = 61074 D1 = 0.976074 D2 = 1.42235 D3+ = 1.49962 using ModKneserNey for 6-grams modifying 6-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 6-grams n1 = 62903932 n2 = 307934 n3 = 40016 n4 = 13496 D1 = 0.990304 D2 = 1.61393 D3+ = 1.66402 using ModKneserNey for 7-grams Kneser-Ney smoothing 7-grams n1 = 57117765 n2 = 1027630 n3 = 27188 n4 = 6499 D1 = 0.965267 D2 = 1.92339 D3+ = 2.07705 discarded 1 2-gram contexts containing pseudo-events discarded 307855 3-gram contexts containing pseudo-events discarded 2503568 4-gram contexts containing pseudo-events discarded 4360170 5-gram contexts containing pseudo-events

54 ASR and NLP J Building Pocketsphinx from Source

discarded 65731861 5-gram probs discounted to zero discarded 5176317 6-gram contexts containing pseudo-events discarded 62903932 6-gram probs discounted to zero discarded 5419443 7-gram contexts containing pseudo-events discarded 57117765 7-gram probs discounted to zero inserted 998810 redundant 5-gram probs inserted 893000 redundant 6-gram probs writing 1062356 1-grams writing 15795561 2-grams writing 44901948 3-grams writing 63883751 4-grams writing 2119408 5-grams writing 1272608 6-grams writing 1068831 7-grams

J Building Pocketsphinx from Source J.1 Dependencies GStreamer has to be build prior to Pocketsphinx to provide the correct development headers for Pocketsphinx GStreamer plugin. sudo apt-get install subversion autoconf libtool automake gfortran g++ python-dev \ python-pip pocketsphinx-hmm-en-hub4wsj swig bison flex --yes

J.2 Sphinxbase cd POCKETSPHINXDIR git clone https://github.com/cmusphinx/sphinxbase.git cd sphinxbase sudo ./autogen.sh && make && sudo make install

J.3 Pocketsphinx cd POCKETSPHINXDIR git clone https://github.com/cmusphinx/pocketsphinx.git cd pocketsphinx sudo ./autogen.sh && make && sudo make install cd src/src/gst-plugin/ make && sudo make install

K Recognition Process C++ Implementation static void recognize(int argc, char *argv[]) { int32 k; char const *hyp; int probability; int finalFlag; double out_nspeech, out_ncpu, out_nwall; int i, latticeScore;

initDecoderContinuous(argc, argv); memcpy ( (char*) argv[2], keyPath, strlen(keyPath) +1); initDecoderKeyword(argc, argv); if ( ( ad = ad_open_dev(cmd_ln_str_r( config, "-adcdev" ) , (int) cmd_ln_float32_r( config , "-samprate" ) ) ) == NULL) E_FATAL("Failed to open audio device\n"); if (ad_start_rec(ad) < 0) E_FATAL("Failed to start recording\n"); if (ps_start_utt(psKey) < 0) E_FATAL("Failed to start Keyword utterance\n");

oscState = PAUSED;

while(oscState) { switch(oscState){ caseSWITCHLM:; ......

55 ASR and NLP K Recognition Process C++ Implementation

casePAUSED: break;

caseMUTE:; caseUNMUTE:; if(oscState == MUTE) mute =TRUE; else if(oscState == UNMUTE) mute =FALSE; oscState = RECORD;

caseRECORD: if(keywordSpotted){ if ((k = ad_read(ad, adbuf, 2048)) < 0) E_FATAL("Failed to read LM audio\n");

ps_process_raw(ps, adbuf, k,FALSE,FALSE); in_speech = ps_get_in_speech(ps);

if( mute ){ in_speech =FALSE; if( utt_started ) ps_end_utt(ps); }

if (in_speech && !utt_started) { utt_started =TRUE; printf("LM Listening...\n"); } if (utt_started) { hyp = ps_get_hyp(ps, &latticeScore); if(hyp !=NULL){ printf("LM %s %d\n", hyp, latticeScore); } } if (!in_speech && utt_started) { ps_end_utt(ps); hyp = ps_get_hyp_final(ps, &finalFlag );

if (hyp !=NULL && strcmp("",hyp)!=0){ probability = ps_get_prob(ps); }

keywordSpotted =FALSE; if (ps_start_utt(psKey) < 0) E_FATAL("Failed to start Keyword utterance\n"); utt_started =FALSE; printf("LM READY....\n"); } } else{ if ((k = ad_read(ad, adbufKey, 2048)) < 0) E_FATAL("Failed to read audio\n");

ps_process_raw(psKey, adbufKey, k,FALSE,FALSE); in_speechKey = ps_get_in_speech(psKey);

if( mute ){ in_speechKey =FALSE; if( utt_startedKey ) ps_end_utt(psKey); }

if (in_speechKey && !utt_startedKey) { utt_startedKey =TRUE; printf("Keyword Listening...\n"); }

if (!in_speechKey && utt_startedKey) { ps_end_utt(psKey); hyp = ps_get_hyp_final(psKey, &finalFlag );

if (hyp !=NULL && strcmp("",hyp)!=0){ printf("Keyword spotted: %s\n",hyp); if( strcmp(hyp,"HEIMDALL") == 0 ) keywordSpotted =TRUE; }

if(keywordSpotted){ if (ps_start_utt(ps) < 0) E_FATAL("Failed to start LM utterance\n"); } else{ if (ps_start_utt(psKey) < 0) E_FATAL("Failed to start Keyword utterance\n"); utt_startedKey =FALSE; printf("Keyword ready....\n"); } } } break;// caseRECORD }// Switch lo_server_recv_noblock(st, 100);

56 ASR and NLP L Numeric Reduction Python Parser

}// while ad_close(ad); lo_server_free(st); } L Numeric Reduction Python Parser class NumberWordReducer(): def __init__(self, numberWord): self.numberWord = numberWord.lower() #______

def einer(self, digitWord): zehn = 0 if "zehn" in digitWord: zehn = 10 elif "zig" in digitWord or " ig " in digitWord: zehn = self.zig(digitWord)

if "ein" in digitWord: return 1 + zehn elif "zwei" in digitWord: return 2 + zehn elif "drei" in digitWord: return 3 + zehn elif "vier" in digitWord: return 4 + zehn elif "f nf" in digitWord: return 5 + zehn elif "sech" in digitWord: return 6 + zehn elif "sieb" in digitWord: return 7 + zehn elif "acht" in digitWord: return 8 + zehn elif "neun" in digitWord: return 9 + zehn elif "elf" in digitWord: return 11 elif "zw lf" in digitWord: return 12 else: return zehn #______

def zig(self, digitWord): if "zwanzig" in digitWord: return 20 elif "drei ig" in digitWord: return 30 elif "vierzig" in digitWord: return 40 elif "f nfzig" in digitWord: return 50 elif "sechzig" in digitWord: return 60 elif "siebzig" in digitWord: return 70 elif "achtzig" in digitWord: return 80 elif "neunzig" in digitWord: return 90 #______

def hundert(self, digitWord): integerNumber = 0 if "hundert" in digitWord: hundertList = digitWord.split("hundert") integerNumber = self.einer(hundertList[-1]) integerNumber += self.einer(hundertList[-2]) * 100 else: integerNumber = self.einer(digitWord)

return integerNumber #______

def tausend(self, digitWord): integerNumber = 0 if "tausend" in digitWord: tausendList = digitWord.split("tausend") integerNumber = self.hundert(tausendList[-1]) integerNumber += self.hundert(tausendList[-2]) * 1000 else: integerNumber = self.hundert(digitWord)

57 ASR and NLP M CKY Recognition Algorithm

return integerNumber #______

def million(self, digitWord): integerNumber = 0 if "million" in digitWord: millionList = digitWord.split("million") integerNumber = self.tausend(millionList[-1]) integerNumber += self.tausend(millionList[-2]) * 1000000 else: integerNumber = self.tausend(digitWord)

return integerNumber #______

def getWordRepresentation(self): return self.million(self.numberWord) #______#======M CKY Recognition Algorithm

1: Function CKY(G, w) { w = a1 ... an ; R the rules of G} 2: T ← ∅; 3: for all j from 1 up to n do 4: for all rules A → aj in R do 5: add [j − 1, A, j] to T ; 6: for all i from j - 2 down to 0 do 7: for all rules A → BC in R do 8: for all i from j - 2 down to 0 do 9: if [i, B, k] and [k, C, j] are both in T then 10: add [i, A, j] to T ; 11: if [0, S, n] is in T then 12: return true ; 13: else 14: return false ;

Table M.15: CKY Recognition Algorithm

58 ASR and NLP N Probabilistic CKY Recognition Algorithm

N Probabilistic CKY Recognition Algorithm

1: Function CKY(G, w) { w = a1 ... an ; R the rules of G} 2: T ← ∅; 3: for all j from 1 up to n do 4: for all non-terminals A do 5: if there is a rule A → aj then 6: pmax([j − 1, A, j]) ← p(A → aj ); 7: else 8: pmax([j − 1, A, j]) ← 0; 8: for all i from j - 1 down to 0 do 9: for all non-terminals A do 10: pmax([i, A, j]) ← 0; 11: for all k from i + 1 up to j -1 do 12: for all rules A → {BC} in R do 13: pmax([i, A, j]) ← max(pmax([i, A, j]) , p([A → BC]) · pmax([i, B, k]) ·pmax([k, C, j]) ) ; 14: return pmax([0, S, n]) ;

Table N.16: Probabilistic CKY Recognition Algorithm

O Earley Recognition Algorithm

1: Function EARLEY(G, w) { w = a1 ... an ; R the rules of G} 2: T ← A ← {[0, S → σ, 0]} ; 3: for all j from 0 to n do 4: for all items [i, A → aα0, j − 1] in T do 5: if a = aj then 6: add [i, A → a α0, j] to T and to A ; 7: while A= 6 ∅ do 8: remove some [k, A → α0, j] from A 9: if α0 = B β then 10: for all rules B → γ in R do 11: if item [j, B → γ, j] is not in T then 12: add [j, B → γ, j] to T and to A ; 13: for all items [j, B → γ , j] in T do 14: if item [k, A → αB β, j] is not in T then 15: add [k, A → αB β, j] to T and to A ; 16: if α0 =  then 17: for all items [i, B → β Aγ , k] in T do 18: if item [i, B → βA , j] is not in T then 19: add [i, B → βA , j] to T and to A ; 20: if [0, S → σ , n] is in T then 21: return true ; 22: else 23: return false ;

Table O.17: Earley Recognition Algorithm

59 ASR and NLP REFERENCES

References

[1] CMU Sphinx Framework. http://cmusphinx.sourceforge.net/, 17.08.2015. [2] Carnegie Mellon University. http://www.speech.cs.cmu.edu/, 17.08.2015. [3] VoxForge. http://www.voxforge.org/de, 17.08.2015. [4] CMU-Cambridge Language Modeling Toolkit (CMUCLTK). http://www.speech.cs.cmu.edu/SLM/toolkit documentation.html, 30.09.2015. [5] SRILM - The SRI Language Modeling Toolkit. http://www.speech.sri.com/projects/srilm/, 30.09.2015. [6] KenLM - KenLM Language Modeling Toolkit. https://kheafield.com/code/kenlm/, 30.09.2015. [7] MIT Language Modeling (MITLM) Toolkit. https://code.google.com/p/mitlm/, 30.09.2015. [8] Arpabet. Wikipedia https://en.wikipedia.org/wiki/Arpabet, 30.09.2015. [9] The open standard for parallel programming of heterogeneous systems. Khronos Group https://www.khronos.org/opencl/, 12.01.2016. [10] Automatic Speech Recognition - A Deep Learning Approach. Dong Yu, Li Deng. Springer, London, 2015. [11] Natural Language Toolkit. http://www.nltk.org/, 17.08.2015. [12] GermaNLTK. https://www.hdm-stuttgart.de/in aktion/medianight/medianight projekt detail?projekt ID=74, 30.09.2015. [13] GermaNet. http://www.sfs.uni-tuebingen.de/GermaNet/index.shtml, 30.09.2015. [14] GStreamer. http://gstreamer.freedesktop.org/, 17.08.2015. [15] Recurrent Neural Network Language Model (RNNLM) Toolkit. http://rnnlm.org/, 17.08.2015. [16] CSLM: Continuous Space Language Model Toolkit. http://www-lium.univ-lemans.fr/cslm/, 17.08.2015. [17] Speech Recognition, Wikipedia, https://en.wikipedia.org/wiki/Speech recognition#History. 09.11.2015 [18] The Sphinx Speech Recognition System. Lee Kai-Fu et al., IEEE, 1989. [19] International Phonetic Association, https://www.internationalphoneticassociation.org/, 12.01.2016. [20] International Phonetic Alphabet, https://www.internationalphoneticassociation.org/content/ipa-chart, 12.01.2016. [21] Digital Communications: A Discrete-Time Approach. Michael Rice. Pearson Prentice Hall, 2009. [22] Part-of-Speech Tagging Guidelines for the Penn Treebank Project. Beatrice Santorini. http://www.clips.ua.ac.be/pages/mbsp-tags, June 1990. [23] The handbook of computational linguistics and natural language processing, Alexander Clark, http://www.digibib.net/permalink/832/FHBK-x/HBZ:HT016743461, Wiley-BlackwellWiley-Blackwell - MyiLibrary, 2010 .

60 ASR and NLP REFERENCES

[24] Computing Machinery and Intelligence, Alan Turing, http://mind.oxfordjournals.org/content/LIX/236/433, Mind, 1950. [25] The Penn Treebank, http://www.cis.upenn.edu/˜treebank/, 28.12.15. [26] Natural Language Processing, Wikipedia, https://en.wikipedia.org/wiki/Natural language processing#History. 09.11.2015 [27] Segmenting state into entities and its implication for learning, Henderson James. 227–36, Emergent Neural Computational Architectures based on Neuroscience. Heidelberg: Springer- Verlag, 2001. [28] Inducing history representations for broad coverage statistical parsing. Henderson James, 103–10, Proceedings of the Joint Meeting of North American Chapter of the Association for Computational Linguistics and the Human-Language Technology Conference, 2003. [29] A latent variable model of synchronous parsing for syntactic and semantic dependencies. Henderson, James, Paola Merlo, Gabriele Musillo, Ivan Titov, 178–82, Proceedings of the CoNLL-2008 Shared Task, 2008. [30] An Overview of the Sphinx Speech Recognition System. Lee Kai-Fu et al., IEEE Transactions on Acoustics Speech, and Signal Processing, Vol. 38. No. I. January 1990. [31] Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition. Namrata Dave, International Journal for Advanced Research in Engineering and Technology Issue VI, July 2013. [32] CMU Pronouncing Dictionary. CMU Sphinx Framework, http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 21.10.2015. [33] “Your Word is my Command”: Google Search by Voice: A Case Study. Johan Schalkwyk et al., Advances in Speec Recognition, Amy Neustein, Springer 2010. [34] A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Lawrence R. Rabiner, Proceedings of the IEEE, Vol. 77, No. 2, February 1989. [35] The Application of Hidden Markov Models in Speech Recognition. Mark Gales and Steve Young, Foundations and Trends in Signal Processing, Vol. 1, No. 3 (2007). [36] Learning long-term dependencies with gradient descent is difficult. Bengio Y., Simard P., and Frasconi P., IEEE Transactions on Neural Networks, 5(2):157–166, 1994. [37] Deep learning via Hessian-free optimization. Martens, J. In Proceedings of the 27th International Conference on Machine Learning (ICML). ICML 2010, 2010. [38] Continuous Space Language Models. Holger Schwenk. Computer Speech and Language 21, Science Direct, 2006. [39] Continuous Space Language Models for Statistical . Holger Schwenk. The Prague Bulletin of Mathmatical Linguistics Number 93, 2010. [40] “You’re as Sick as you Sound”: Using Computational Approaches for Modeling Speaker State to Gauge Illness and Recovery. Julia Hirschberg, Anna Hjalmarsson, Ne´omieElhadad, Advances in Speec Recognition, Amy Neustein, Springer 2010. [41] “For Heaven’s Sake, Gimme a Live Person!” Designing Emotion-Detection Customer Care Voice Appli- cations in Automated Call Centers. Alexander Schmitt et al., Advances in Speec Recognition, Amy Neustein, Springer 2010. [42] Heterogenous System Architecture Overview, Phil Rogers http://www.hsafoundation.com/pubs-presos/, 08.2013 [43] Overview of CMUSphinx Toolkit. CMU Sphinx Framework, http://cmusphinx.sourceforge.net/wiki/tutorialoverview, 17.08.2015. [44] Versions Of Decoders. CMU Sphinx Framework, http://cmusphinx.sourceforge.net/wiki/versions, 17.08.2015. [45] Sphinx-4 Application Programmer’s Guide. CMU Sphinx Framework, http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4, 17.08.2015.

61 ASR and NLP REFERENCES

[46] Design of the CMU Sphinx-4 Decoder. Paul Lamere, et al. Sun Microsystems Inc., Carnegie Mellon University, and Mitsubishi Electric Research Laboratories, 2003. [47] JSGF Format. W3C. http://www.w3.org/TR/jsgf/, 08.11.2015. [48] Pocketsphinx Python API. https://github.com/cmusphinx/pocketsphinx-python, 17.08.2015. [49] Pocketsphinx C API. https://github.com/cmusphinx/pocketsphinx/blob/master/include/pocketsphinx.h, 17.08.2015. [50] German Wikipedia Dump. https://de.wikipedia.org/wiki/Wikipedia:Technik/Datenbank/Download, 21.10.2015 [51] German Wikipedia XML Dump. http://dumps.wikimedia.org/dewiki/20150806/, 17.08.2015. [52] Gutenberg Online Archive. http://gutenberg.spiegel.de/, 13.10.2015. [53] Zeit Online Archive. http://www.zeit.de/2015/index, 17.08.2015. [54] Open acoustic models and speech data for German speech recognition. TU Darmstadt https://www.lt.informatik.tu-darmstadt.de/de/data/open-acoustic-models/, 13.01.2016. [55] Pattern, Computational Linguistics Psycholinguistics Research Center https://github.com/clips/pattern, 14.01.2016 [56] Penn Treebank Tags. http://web.mit.edu/6.863/www/PennTreebankTags.html, 06.10.2016 [57] German Political Speeches Corpus and Visualization, Adrien Barbaresi http://adrien.barbaresi.eu/corpora/speeches/, 25.01.2016. [58] STTS Tag Table, Institut f¨urMaschinelle Sprachverarbeitung http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html, 25.01.2016. [59] Python with NLTK 2.0 Cookbook, Jacob Perkins Packt Publishing, Birmingham, Nov. 2010. [60] WordNet - A lexical database for English. Princeton University https://wordnet.princeton.edu/, 27.01.2016. [61] Python GermaNet Frontend https://github.com/wroberts/pygermanet, 25.01.2016. [62] Open Sound Control (OSC). http://opensoundcontrol.org/spec-1 0, 11.11.2015. [63] Qt Designer. http://doc.qt.io/qt-4.8/designer-manual.html, 11.11.2015 [64] PyQt. http://pyqt.sourceforge.net/Docs/PyQt4/index.html, 11.11.2015 [65] German Wiktionary https://de.wiktionary.org/wiki/Wiktionary:Hauptseite, 11.11.2015 [66] The Stanford Parser: A statistical parser. http://nlp.stanford.edu/software/lex-parser.shtml, 11.11.2015 [67] Pattern, CLiPS (Computational Linguistics Psycholinguistics). http://www.clips.ua.ac.be/pattern, 11.11.2015 [68] Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation, Holger Schwenk, Anthony Rousseau and Mohammed Attik. Univerity of Le Mans, 2012. [69] Use of CUDA For the Contiuous Space Language Model, Elizabeth Thompson and Timothy Anderson Purdue University, 2011? [70] Efficient GPU-based Training of Recurrent Neural Network Language Models Using Spliced Sentence Bunch, X. Chen et al. University of Cambridge, 2014?

62