Lecture 5: N-Gram Language Models

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 5: N-Gram Language Models Speech Recognition Lecture 5: N-gram Language Models Mehryar Mohri Courant Institute and Google Research [email protected] Language Models Definition: probability distribution Pr[ w ] over sequences of words w = w1 ...wk. • Critical component of a speech recognition system. Problems: • Learning: use large text corpus (e.g., several million words) to estimate Pr[ w ] . Models in this course: n-gram models, maximum entropy models. • Efficiency: computational representation and use. Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU This Lecture n-gram models definition and problems Good-Turing estimate Smoothing techniques Evaluation Representation of n-gram models Shrinking LMs based on probabilistic automata Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU N-Gram Models Definition: an n-gram model is a probability distribution based on the nth order Markov assumption ∀i, Pr[wi | w1 ...wi−1]=Pr[wi | hi], |hi|≤n − 1. • Most widely used language models. Consequence: by the chain rule, k k Pr[w]=! Pr[wi | w1 ...wi−1]=! Pr[wi | hi]. i=1 i=1 Mehryar Mohri - Speech Recognition page 4 Courant Institute, NYU Maximum Likelihood Likelihood: probability of observing sample under distribution p ∈ P , which, given the independence assumption is m Pr[x1,...,xm]=! p(xi). i=1 Principle: select distribution maximizing sample probability m p! = argmax ! p(xi), ∈P p i=1 m or p! = argmax ! log p(xi). ∈P p i=1 Mehryar Mohri - Speech Recognition page 5 Courant Institute, NYU Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T, . , H. Bernoulli distribution: p(H)=θ, p(T )=1− θ. Likelihood: l(p) = log θN(H)(1 − θ)N(T ) = N(H) log θ + N(T ) log(1 − θ). Solution: l is differentiable and concave; dl(p) N(H) N(T ) N(H) = =0 θ = . dθ θ − 1 θ ⇔ N(H)+N(T ) Mehryar Mohri - Speech Recognition − page 6 Courant Institute, NYU Maximum Likelihood Estimation Definitions: • n-gram: sequence of n consecutive words. • S : sample or corpus of size m . c w ...w • ( 1 k ) : count of sequence w 1 ...w k . ML estimates: for c ( w 1 ...w n − 1 ) =0! , c(w1 ...wn) Pr[wn|w1 ...wn−1]= . c(w1 ...wn−1) • But, c(w1 ...wn)=0 =⇒ Pr[wn|w1 ...wn−1]=0! Mehryar Mohri - Speech Recognition page 7 Courant Institute, NYU N-Gram Model Problems Sparsity: assigning probability zero to sequences not found in the sample = ⇒ speech recognition errors. • Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Central techniques in language modeling. • Class-based models: create models based on classes (e.g., DAY) or phrases. Representation: for | Σ | =100 , 000 , the number of bigrams is 10 10 , the number of trigrams 10 15 ! • Weighted automata: exploiting sparsity. Mehryar Mohri - Speech Recognition page 8 Courant Institute, NYU Smoothing Techniques Typical form: interpolation of n-gram models, e.g., trigram, bigram, unigram frequencies. Pr[w3|w1w2]=α1c(w3|w1w2)+α2c(w3|w2)+α3c(w3). Some widely used techniques: • Katz Back-off models (Katz, 1987). • Interpolated models (Jelinek and Mercer, 1980). • Kneser-Ney models (Kneser and Ney, 1995). Mehryar Mohri - Speech Recognition page 9 Courant Institute, NYU Good-Turing Estimate (Good, 1953) Definitions: • Sample S of m words drawn from vocabulary Σ . • c ( x ) : count of word x in S . • S k : set of words appearing k times. • M k : probability of drawing a point in S k . Good-Turing estimate of M k : k k S G +1 S ( +1)| k+1|. k = m | k+1| = m Mehryar Mohri - Speech Recognition page 10 Courant Institute, NYU Properties Theorem: the Good-Turing estimate is an estimate of M k with small bias, for small values of k/m : k +1 E[Gk]=E[Mk]+O( ). S S m Proof: E[Mk]= Pr[x] Pr[x ∈ Sk] S x!∈Σ m = Pr[x] Pr[x]k(1 − Pr[x])m−k " k # x!∈Σ m m = Pr[x]k+1(1 − Pr[x])m−(k+1) k (1 − Pr[x]) "k +1# $m% x!∈Σ k+1 k +1 m $ % k +1 = E |Sk | − E Mk = E[Gk] − E[Mk ]. m − k & +1 +1 ) m − k m − k +1 ' ( ' ( Mehryar Mohri - Speech Recognition page 11 Courant Institute, NYU Properties Proof (cont.): thus, ! ! ! k k +1 ! !E[Mk] − E[Gk]! = ! E[Gk] − E[Mk+1]! !S S ! !m − k S m − k ! k +1 ≤ . (0 ≤ E[Gk], E[Mk+1] ≤ 1) m − k S In particular, for k =0 , ! ! 1 !E[Mk] − E[Gk]! ≤ . !S S ! m It can be proved using McDiarmid’s inequality that with probability at least 1 − δ , (McAllester and Schapire, 2000), 1 log( δ ) M0 G0 + O . ≤ m Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU Good-Turing Count Estimate Definition: let r = c ( w 1 ...w k ) and n r = | S r |, then Gr m nr+1 c∗(w ...w )= × =(r +1) . 1 k S n | r| r Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU Simple Method (Jeffreys, 1948) Additive smoothing: add δ ∈ [0 , 1] to the count of each n-gram. c(w1 ...wn)+δ Pr[wn|w1 ...wn−1]= . c(w1 ...wn−1)+δ|Σ| • Poor performance (Gale and Church, 1994). • Not a principled attempt to estimate or make use of an estimate of the missing mass. Mehryar Mohri - Speech Recognition page 14 Courant Institute, NYU Katz Back-off Model (Katz, 1987) Idea: back-off to lower order model for zero counts. c wn−1 w wn−1 w wn−1 • if ( 1 )=0 , then Pr[ n | 1 ]=Pr[ n | 2 ] . otherwise, • n c(w1 ) n dc wn c w > n−1 ( 1 ) n−1 if ( 1 ) 0 wn w c w Pr[ | 1 ]= ( 1 ) n−1 β Pr[wn|w2 ] otherwise. where d k is a discount coefficient such that 1 if k>K; dk = (k +1)nk+1 ≈ otherwise. knk Katz suggests K =5 . Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU Discount Coefficient With the Good-Turing estimate, the total probability mass for unseen n-grams is G 0 = n 1 /m . n dc(wn)c(w )/m =1 n1/m 1 1 − c(wn)>0 1 d kn = m n ⇔ k k − 1 k>0 d kn = kn n ⇔ k k k − 1 k>0 k>0 (1 d ) kn = n . ⇔ − k k 1 k>0 K (1 d ) kn = n . ⇔ − k k 1 k=1 Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU Discount Coefficient k∗ Solution: a search with 1 − d k = µ (1 − ) leads to k K − ∗ µ !(1 k /k) knk = n1. k=1 K ⇔ − (k +1)nk+1 µ !(1 ) knk = n1. knk k=1 K ⇔ − µ ![knk (k +1)nk+1]=n1. k=1 ⇔ µ[n1 − (K +1)nK+1]=n1. ⇔ 1 µ = K n 1 − ( +1) K+1 n1 ∗ k (K+1)nK+1 k − n ⇔ 1 dk = K n . 1 − ( +1) K+1 n1 Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU Interpolated Models (Jelinek and Mercer, 1980) Idea: interpolation of different order models. Pr[w3|w1w2]=αc(w3|w1w2)+βc(w3|w2)+(1− α − β)c(w3), with 0 ≤ α,β ≤ 1. • α and β are estimated by using held-out samples. • sample split into two parts for training higher- order and lower order models. • optimization using expectation-maximization (EM) algorithm. • deleted interpolation: k-fold cross-validation. Mehryar Mohri - Speech Recognition page 18 Courant Institute, NYU Kneser-Ney Model Idea: combination of back-off and interpolation, but backing-off to lower order model based on counts of contexts. Extension of absolute discounting. max {0,c(w1w2w3) − D} c(·w3) Pr[w3|w1w2]= + α , c(w1w2) ! c(·w3) where D is a constant. Modified version (Chen and Goodman, 1998): D function of c ( w 1 w 2 w 3 ) . Mehryar Mohri - Speech Recognition page 19 Courant Institute, NYU Evaluation Average log-likelihood of test sample of size N: N 1 L(p)= log p[w h ], h n 1. N 2 k | k | k| ≤ − k=1 Perplexity : the perplexity PP ( q ) of the model is defined as (‘average branching factor’) Lˆ(q) PP(q)=2− . • For English texts, typically PP ( q ) ∈ [50 , 1000] and 6 L(q) 10 bits. ≤ ≤ Mehryar Mohri - Speech Recognition page 20 Courant Institute, NYU In Practice Evaluation: an empirical observation based on the word error rate of a speech recognizer is often a better evaluation than perplexity. n-gram order: typically n = 3, 4, or 5. Higher order n-grams typically do not yield any benefit. Smoothing: small differences between Back-off, interpolated, and other models (Chen and Goodman, 1998). Special symbols: beginning and end of sentences, start and stop symbols. Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU The set of states of the WFA representing a backoff or </s>/1.101 interpolated model is defined by associating a state qh to a/0.405 a/0.287 </s> each sequence of length less than n found in the corpus: a !/4.856 </s>/1.540 a/1.108 Q = {qh : |h| < n and c(h) > 0} a/0.441 /0.356 !/0.231 ! b Its transition set E is defined as the union of the following <s> b/0.693 b/1.945 set of failure transitions: Figure 3: Example of representation of a bigram model {(qwh! , φ, − log(αh), qh! ) : qwh! ∈ Q} with failure transitions. and the following set of regular transitions: cost of the invalid path to the state – the sum of the two transition costs (0.672) – is lower than the cost of the true {(qh, w, − log(Pr(w|h)), nhw) : qwh ∈ Q} path.
Recommended publications
  • Aviation Speech Recognition System Using Artificial Intelligence
    SPEECH RECOGNITION SYSTEM OVERVIEW AVIATION SPEECH RECOGNITION SYSTEM USING ARTIFICIAL INTELLIGENCE Appareo’s embeddable AI model for air traffic control (ATC) transcription makes the company one of the world leaders in aviation artificial intelligence. TECHNOLOGY APPLICATION As featured in Appareo’s two flight apps for pilots, Stratus InsightTM and Stratus Horizon Pro, the custom speech recognition system ATC Transcription makes the company one of the world leaders in aviation artificial intelligence. ATC Transcription is a recurrent neural network that transcribes analog or digital aviation audio into text in near-real time. This in-house artificial intelligence, trained with Appareo’s proprietary dataset of flight-deck audio, takes the terabytes of training data and its accompanying transcriptions and reduces that content to a powerful 160 MB model that can be run inside the aircraft. BENEFITS OF ATC TRANSCRIPTION: • Provides improved situational awareness by identifying, capturing, and presenting ATC communications relevant to your aircraft’s operation for review or replay. • Provides the ability to stream transcribed speech to on-board (avionic) or off-board (iPad/ iPhone) display sources. • Provides a continuous representation of secondary audio channels (ATIS, AWOS, etc.) for review in textual form, allowing your attention to focus on a single channel. • ...and more. What makes ATC Transcription special is the context-specific training work that Appareo has done, using state-of-the-art artificial intelligence development techniques, to create an AI that understands the aviation industry vocabulary. Other natural language processing work is easily confused by the cadence, noise, and vocabulary of the aviation industry, yet Appareo’s groundbreaking work has overcome these barriers and brought functional artificial intelligence to the cockpit.
    [Show full text]
  • Construction of English-Bodo Parallel Text
    International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018 CONSTRUCTION OF ENGLISH -BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRANSLATION Saiful Islam 1, Abhijit Paul 2, Bipul Syam Purkayastha 3 and Ismail Hussain 4 1,2,3 Department of Computer Science, Assam University, Silchar, India 4Department of Bodo, Gauhati University, Guwahati, India ABSTRACT Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
    [Show full text]
  • A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech
    A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech Joshua Y. Kim1, Chunfeng Liu1, Rafael A. Calvo1*, Kathryn McCabe2, Silas C. R. Taylor3, Björn W. Schuller4, Kaihang Wu1 1 University of Sydney, Faculty of Engineering and Information Technologies 2 University of California, Davis, Psychiatry and Behavioral Sciences 3 University of New South Wales, Faculty of Medicine 4 Imperial College London, Department of Computing Abstract large-scale commercial products such as Google Home and Amazon Alexa. Mainstream ASR Automatic Speech Recognition (ASR) systems use only voice as inputs, but there is systems have proliferated over the recent potential benefit in using multi-modal data in order years to the point that free platforms such to improve accuracy [1]. Compared to machines, as YouTube now provide speech recognition services. Given the wide humans are highly skilled in utilizing such selection of ASR systems, we contribute to unstructured multi-modal information. For the field of automatic speech recognition example, a human speaker is attuned to nonverbal by comparing the relative performance of behavior signals and actively looks for these non- two sets of manual transcriptions and five verbal ‘hints’ that a listener understands the speech sets of automatic transcriptions (Google content, and if not, they adjust their speech Cloud, IBM Watson, Microsoft Azure, accordingly. Therefore, understanding nonverbal Trint, and YouTube) to help researchers to responses to unintelligible speech can both select accurate transcription services. In improve future ASR systems to mark uncertain addition, we identify nonverbal behaviors transcriptions, and also help to provide feedback so that are associated with unintelligible speech, as indicated by high word error that the speaker can improve his or her verbal rates.
    [Show full text]
  • Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources.
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples
    Proceedings of eLex 2019 Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples Pēteris Paikens 1, Normunds Grūzītis 2, Laura Rituma 2, Gunta Nešpore 2, Viktors Lipskis 2, Lauma Pretkalniņa2, Andrejs Spektors 2 1 Latvian Information Agency LETA, Marijas street 2, Riga, Latvia 2 Institute of Mathematics and Computer Science, University of Latvia, Raina blvd. 29, Riga, Latvia E-mail: [email protected], [email protected] Abstract This paper describes ongoing work to extend an online dictionary of Latvian – Tezaurs.lv – with representative semantically annotated corpus examples according to the FrameNet and PropBank methodologies and word sense inventories. Tezaurs.lv is one of the largest open lexical resources for Latvian, combining information from more than 300 legacy dictionaries and other sources. The corpus examples are extracted from Latvian FrameNet and PropBank corpora, which are manually annotated parallel subsets of a balanced text corpus of contemporary Latvian. The proposed approach augments traditional lexicographic information with modern cross-lingually interpretable information and enables analysis of word senses from the perspective of frame semantics, which is substantially different from (complementary to) the traditional approach applied in Latvian lexicography. In cases where FrameNet and PropBank corpus evidence aligns well with the word sense split in legacy dictionaries, the frame-semantically annotated corpus examples supplement the word sense information with clarifying usage examples and commonly used semantic valence patterns. However, the annotated corpus examples often provide evidence of a different sense split, which is often more coarse-grained and, thus, may help dictionary users to cluster and comprehend a fine-grained sense split suggested by the legacy sources.
    [Show full text]
  • 20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation
    Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation JOHANN-MATTIS LIST, Max Planck Institute for the Science of Human History, Germany 20 NATHANIEL A. SIMS, The University of California, Santa Barbara, America ROBERT FORKEL, Max Planck Institute for the Science of Human History, Germany While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustain- able. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that couldhelp to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documen- tation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework forthe computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardiza- tion proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research. CCS Concepts: • Applied computing → Language translation; Additional Key Words and Phrases: Sino-tibetan, interlinear-glossed text, computer-assisted language com- parison, standardization, qiang ACM Reference format: Johann-Mattis List, Nathaniel A.
    [Show full text]
  • Linguistic Annotation of the Digital Papyrological Corpus: Sematia
    Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances.
    [Show full text]
  • Synthesis and Recognition of Speech Creating and Listening to Speech
    ISSN 1883-1974 (Print) ISSN 1884-0787 (Online) National Institute of Informatics News NII Interview 51 A Combination of Speech Synthesis and Speech Oct. 2014 Recognition Creates an Affluent Society NII Special 1 “Statistical Speech Synthesis” Technology with a Rapidly Growing Application Area NII Special 2 Finding Practical Application for Speech Recognition Feature Synthesis and Recognition of Speech Creating and Listening to Speech A digital book version of “NII Today” is now available. http://www.nii.ac.jp/about/publication/today/ This English language edition NII Today corresponds to No. 65 of the Japanese edition [Advance Notice] Great news! NII Interview Yamagishi-sensei will create my voice! Yamagishi One result is a speech translation sys- sound. Bit (NII Character) A Combination of Speech Synthesis tem. This system recognizes speech and translates it Ohkawara I would like your comments on the fu- using machine translation to synthesize speech, also ture challenges. and Speech Recognition automatically translating it into every language to Ono The challenge for speech recognition is how speak. Moreover, the speech is created with a human close it will come to humans in the distant speech A Word from the Interviewer voice. In second language learning, you can under- case. If this study is advanced, it will be possible to Creates an Affluent Society stand how you should pronounce it with your own summarize the contents of a meeting and to automati- voice. If this is further advanced, the system could cally take the minutes. If a robot understands the con- have an actor in a movie speak in a different language tents of conversations by multiple people in a natural More and more people have begun to use smart- a smartphone, I use speech input more often.
    [Show full text]
  • Natural Language Processing in Speech Understanding Systems
    Working Paper Series ISSN 11 70-487X Natural language processing in speech understanding systems by Dr Geoffrey Holmes Working Paper 92/6 September, 1992 © 1992 by Dr Geoffrey Holmes Department of Computer Science The University of Waikato Private Bag 3105 Hamilton, New Zealand Natural Language P rocessing In Speech Understanding Systems G. Holmes Department of Computer Science, University of Waikato, New Zealand Overview Speech understanding systems (SUS's) came of age in late 1071 as a result of a five year devel­ opment. programme instigated by the Information Processing Technology Office of the Advanced Research Projects Agency (ARPA) of the Department of Defense in the United States. The aim of the progranune was to research and tlevelop practical man-machine conuuunication system s. It has been argued since, t hat. t he main contribution of this project was not in the development of speech science, but in the development of artificial intelligence. That debate is beyond the scope of th.is paper, though no one would question the fact. that the field to benefit most within artificial intelligence as a result of this progranune is natural language understan ding. More recent projects of a similar nature, such as projects in the Unite<l Kiug<lom's ALVEY programme and Ew·ope's ESPRIT programme have added further developments to this important field. Th.is paper presents a review of some of the natural language processing techniques used within speech understanding syst:ems. In particular. t.ecl.miq11es for handling syntactic, semantic and pragmatic informat.ion are ,Uscussed. They are integrated into SUS's as knowledge sources.
    [Show full text]
  • Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020
    WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length seg- ments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi de- coding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embed- dings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recog- nition or as phone-level acoustic models [11–18].
    [Show full text]
  • Improving Named Entity Recognition Using Deep Learning with Human in the Loop
    Demonstration Improving Named Entity Recognition using Deep Learning with Human in the Loop Ticiana L. Coelho da Silva Regis Pires Magalhães José Antônio F. de Macêdo Insight Data Science Lab Insight Data Science Lab Insight Data Science Lab Fortaleza, Ceara, Brazil Fortaleza, Ceara, Brazil Fortaleza, Ceara, Brazil [email protected] [email protected] [email protected] David Araújo Abreu Natanael da Silva Araújo Vinicius Teixeira de Melo Insight Data Science Lab Insight Data Science Lab Insight Data Science Lab Fortaleza, Ceara, Brazil Fortaleza, Ceara, Brazil Fortaleza, Ceara, Brazil [email protected] [email protected] [email protected] Pedro Olímpio Pinheiro Paulo A. L. Rego Aloisio Vieira Lira Neto Insight Data Science Lab Federal University of Ceara Brazilian Federal Highway Police Fortaleza, Ceara, Brazil Fortaleza, Ceara, Brazil Fortaleza, Ceara, Brazil [email protected] [email protected] [email protected] ABSTRACT Even, orthographic features and language-specific knowledge Named Entity Recognition (NER) is a challenging problem in Nat- resources as gazetteers are widely used in NER, such approaches ural Language Processing (NLP). Deep Learning techniques have are costly to develop, especially for new languages and new do- been extensively applied in NER tasks because they require little mains, making NER a challenge to adapt to these scenarios[10]. feature engineering and are free from language-specific resources, Deep Learning models have been extensively used in NER learning important features from word or character embeddings tasks [3, 5, 9] because they require little feature engineering and trained on large amounts of data. However, these techniques they may learn important features from word or character embed- are data-hungry and require a massive amount of training data.
    [Show full text]