(Speech to Text) Software

How It Works Explained in Nearly Plain English

When speech experts talk about a "rules-based" approach, they are talking about a set of rules derived from the knowledge of experts. In the context of speech processing, these rules can be analogized to a set of English grammar rules. The most straightforward approach to creating speech recognition software would involve using a rules-based approach to compare speaker utterances to a set of English grammar rules and then searching for the closest match in a vocabulary list. In practice, no one has been able to do this with acceptable accuracy while maintaining an acceptable processing speed even though rules based approaches have produced acceptable accuracy for (text to speech) software.

Speaking is quite different from written text for a number of reasons. First, even individuals with very polished diction still insert the occasional "um" or awkward silence into their conversations. This is much rarer in written text unless one is dealing with a verbatim transcript. Second, regional diction is much less important in written text then it is in spoken conversation. Sports columnists from Birmingham having conversations with sports columnists from Boston may have difficulty understanding each other but their written product is likely to be very similar. Third, noise interferes with verbal speech input. All of these things make spoken speech more difficult to model and process.

When speech experts talk about a "corpus-based" approach, they are talking about combining statistical analysis with speech data. In speech processing, a corpus is a collection of audio recordings of speech and speech transcripts. Current commercially available speech recognition software programs use a corpus-based approach. They perform acoustical and statistical analysis of voiceprints to determine what a speaker has said. The typical program uses a mathematical model called a Hidden Markov Model to assign probabilities to words in its vocabulary list based on its analysis of the audio segments and chooses the word most likely to be next in the sequence.

To use a classroom analogy, the speaker/teacher isn't assigning English homework to the program; he or she is assigning math homework instead. This is why every such program will occasionally produce very strange results. The program has no knowledge of English grammar and only indirectly filters out words and phrases that make no sense in context.

1

Inherent Limitations

There are many people who have predicted the imminent disappearance of keyboards and mice as input devices. They may have overlooked certain issues that would impact speech recognition software even if it was much better than it actually is.

While it is possible to train speech recognition software to a very high degree of accuracy in diction and to drastically decrease the need for a keyboard and mouse, there are several significant drawbacks (at least in a work environment). First, speech input is inherently noisy. While this is considered acceptable in some work environments (e.g., telemarketing), there are many other environments where this would not be considered acceptable. Second, if one is using one's voice to dictate, one cannot be using it for other things. In many work environments, one might be on the phone at the same time that one is typing. Third, some tasks are inherently faster for able-bodied users when using a mouse or keyboard (e.g., opening Microsoft Word, changing fonts, changing margins etc.). Fourth, while it is reasonable to assume that the need for training will decrease as the software gets better, it is unlikely that one will be able to get the maximum benefit out of the software without training anytime in the foreseeable future. For many users, the time spent training will not be worth it because using a keyboard and mouse will be easier if not faster.

Improving It for Current Users

Those who have mobility issues in the upper body but retain their speaking skills could still benefit from the software. Despite the inherent limitations, certain classes of able- bodied users (e.g., radiologists) could make use of a sufficiently robust software program. Radiologists dictate more reports and make more use of transcription than perhaps any other medical specialists. Despite the fact that they would be the most obvious able-bodied customers, has failed to make much headway in selling speech recognition software to them.

Radiology journals maintain an ongoing debate between those who use speech recognition software and those who do not. The speech recognition software users are a minority of radiologists. Those who do not use speech recognition software have the usual complaints regarding accuracy. It is not clear from the various articles which version of Dragon NaturallySpeaking proponents and critics are using. While Nuance markets products specifically for radiologists, it is unclear whether any radiologists have purchased them.

The fundamental barrier to improvement is processing power. The Hidden Markov Model (HMM) and other mathematical models used in speech recognition can be understood as a series of mathematical assumptions that are often but not always accurate. These assumptions are made to reduce the amount of data necessary to run

2 the program (thereby reducing the processing power necessary to run the program). Eliminating assumptions and incorporating data in place of those assumptions requires additional processing power.

It took nearly a quarter-century from the first application of a Hidden Markov Model to the speech recognition problem for the first commercially viable continuous speech recognition program to appear because earlier computers lacked the processing power to run such a program. There have been only incremental improvements in mathematical modeling since the mid-1970s largely because even early 21st century personal computers lack the power to run more sophisticated mathematical models while processing speech in real-time.

Research grants for the study of speech recognition and speech synthesis have waxed and waned but never entirely dried up. Unfortunately for the current users of speech recognition software, most current research is focused on the long-term rather than improving the products that are on the market today.

Though IBM funds some speech research, IBM cared so little about its ViaVoice software that it licensed it to a competitor who bought the rights to Dragon NaturallySpeaking in a bankruptcy proceeding. The two products were subsequently merged. Microsoft bundles its speech recognition software with the newer Windows operating systems. It only does that for programs that it does not value highly. Microsoft's high-value programs (e.g., Microsoft Office) are never bundled with the OS. Even the creator of Dragon NaturallySpeaking (who lost his intellectual property rights to Dragon NaturallySpeaking in bankruptcy, see below for details) is focused on long- term projects unlikely to bear fruit in his lifetime.

Nevertheless, speech recognition software may continue to see at least incremental improvements. It may benefit indirectly from ongoing development of speech recognition in other contexts. Many companies have implemented speech recognition in their voicemail menus. This task is much simpler because the vocabulary is limited. The system responds to any unknown input by asking the customer to repeat the input or offering to transfers the customer to a human customer service representative.

Historical Dates Relevant to Current Commercial Products

1947 – Researchers publish a book detailing their work with spectrographs that produce spectrograms. Spectrograms render speech in visual form, creating "visible speech."1

1 Ralph K. Potter, George A. Kopp, Harriet C. Green, Visible Speech (1947).

3

1952 -- Researchers at Bell Labs build a machine capable of recognizing spoken digits. They claim a 97% accuracy rate.2

1956 -- Researchers at RCA Labs build a "phonetic typewriter." They claim a 98% accuracy rate.3

1960 -- An MIT researcher proposes a model for speech recognition and speech synthesis while acknowledging that building such devices is beyond 1960 capabilities. The discussion includes the recognition of problems (e.g., multiple speakers) that have not yet been solved.4

1966 -- Leonard E. Baum publishes the first mathematical proof for a Hidden Markov Model.5

1969 -- Bell Labs engineer John R. Pierce writes a letter to the editor of The Journal of the Acoustical Society of America highly critical of speech recognition research.6

1971 -- DARPA begins funding the Speech Understanding Research project at selected universities.7

1975 -- IBM computer scientist James K. Baker completes his PhD thesis requirements at Carnegie Mellon8 while working on the SUR project.9 He becomes the first person to use a Hidden Markov Model10 to address the problem of speech recognition, applying the HMM to a program known as Dragon (named after the pattern on his wedding

2 K.H. Davis, R. Biddulph, and S. Balashek, Automatic Recognition of Spoken Digits, 24 J. Acous. Soc’y Am. 637 (1952).

3 Harry F. Olson & Herbert Belar, Phonetic Typewriter, 28 J. Acous. Soc’y Am. 1072 (1950).

4 Kenneth N. Stevens, Toward a Model for Speech Recognition, 32 J. Acous. Soc’y Am. 47 (1960).

5 Leonard E. Baum & Ted Petrie, Statistical Inference for Probabilistic Functions of Finite State Markov Chains, 37 Annals Math. Stats. 1554 (1966).

6 John R. Pierce, Whither Speech Recognition?, 46 J. Acous. Soc’y Am. 1049 (1969).

7 Dennis H. Klatt, Review of the ARPA Speech Understanding Project, 62 J. Acous. Soc’y Am. 1345 (1987).

8 James K. Baker, Stochastic Modeling as a Means of Automatic Speech Recognition, (Apr. 1, 1975) (unpublished PhD thesis, Carnegie Mellon University).

9 Simson L. Garfinkel, Enter the Dragon, Technology Review, Sept./Oct. 1998 at 58, 60-61.

4 china)11. It is the only meaningful result from $15 million of DARPA funding for the SUR project.12

1982 -- James and Janet Baker found Dragon Systems. Dragon Systems generates income through federal contracts and business-to-business projects.13

1990 -- Dragon Systems releases DragonDictate, a “discrete” speech recognition program (i.e., it requires pausing between words) with a 30,000 word vocabulary.14

1997

April -- Dragon Systems announces Dragon NaturallySpeaking, a continuous speech recognition program, offering a Personal Edition for $695 and a Professional Edition for $995. It claims a 95% accuracy rate with a 30,000 word "active" vocabulary and a 230,000 word "backup" dictionary from which words can be added to the active vocabulary during correction.15

June -- IBM announces ViaVoice, a continuous speech recognition program for only $99.16 Dragon Systems’ immediate response is to extend a $299 special offer.17

1998 -- Dragon Systems announces Dragon NaturallySpeaking Standard and Dragon NaturallySpeaking Preferred18 (products that would remain until 2010). Standard is

10 Baker, supra, at 5. 11 Garfinkel, supra, at 61. 12 Klatt, supra, at 1354. 13 Garfinkel, supra, at 62. 14 Id. 15 Dragon Systems, Dragon Systems Unveils Revolutionary Breakthrough with Continuous Speech Recognition, DragonSys.com via Internet Archive WayBack Machine, http://replay.waybackmachine.org/19970613205637/http://www.dragonsys.com/marketing /kodiak.html (Apr 2, 1997 press release).

16 IBM, IBM VoiceType Simply Speaking Gold and IBM ViaVoice Expand IBM's Speech Offerings, IBM.com, http://www.www-304.ibm.com/jct01003c/cgi- bin/common/ssi/ssialias?infotype=an&subtype=ca&htmlfid=897/ENUS297- 201&appname=xldata&language=enus (June 10, 1997 press release). 17 Dragon Systems, Special Introductory Offer!, DragonSys.com via Internet Archive WayBack Machine, http://replay.waybackmachine.org/19970613204719/http://www.dragonsys.com/ (crawled June 13, 1997). 18 Dragon Systems, June 16, 1998 press release, DragonSys.com via Internet Archive WayBack Machine, http://replay.waybackmachine.org/19980625171244/http://www.dragonsys.com/frameset/ product-frame.html. 5 priced at $10919 and Preferred at $19920 (adjusted for inflation, their 2010 prices were about one third less).

2000

March -- Lernout & Hauspie announces the acquisition of Dragon Systems.21

August -- The Wall Street Journal reports that Lernout & Hauspie misstated the extent of its Korean business.22

November -- Lernout & Hauspie files for bankruptcy.23

2009 -- James and Janet Baker's lawsuit against Goldman Sachs (for recommending the merger with Lernout & Hauspie) survives a motion to dismiss.24

2010 -- Jo Lernout and Pol Hauspie are sentenced to five years in prison.25

2011 – Nuance releases Dragon NaturallySpeaking 11. While it is significantly improved over the earliest versions, improvements from version 3 to version 11 are similar to the incremental improvements between Word 97 and Word 2007 rather than any fundamental change.

Conclusion

The inherent limitations of speech recognition software combined with fundamental barriers to its improvement limit its possibilities as a general use product. Because it is not a mass-market product, research and development will be more limited than it otherwise would be. Even so, speech recognition software is sufficiently robust to be highly useful to users whose motor skills allow voice input but discourage keyboard input. Despite its flaws, it can be useful in bridging the digital divide between able- bodied and disabled users.

19 Dragon Systems, Dragon NaturallySpeaking® Standard 4.0, DragonSys.com via Internet Archive WayBack Machine, (crawled December 12, 1998) http://replay.waybackmachine.org/19981212024456/http://www.dragonsys.com/ 20 Dragon Systems, Dragon NaturallySpeaking® Preferred 4.0, DragonSys.com via Internet Archive WayBack Machine, (crawled December 12, 1998) http://replay.waybackmachine.org/19981212024456/http://www.dragonsys.com/ 21 Lernout & Hauspie, LERNOUT & HAUSPIE SIGNS DEFINITIVE AGREEMENT TO ACQUIRE DRAGON SYSTEMS, lshl.com via Internet Archive WayBack Machine, http://replay.waybackmachine.org/20001019010743/http://www.lhsl.com/news/releases/2 0000328_dragon.asp (Mar. 28, 2000 press release). 22 Charles Forelle & Mark Maremont, Lernout & Hauspie Founders Guilty in Fraud, Wall St. J. Sept. 20, 2010, available at http://online.wsj.com/article/SB10001424052748703989304575503500899087566.html. 23 Baker v. Goldman Sachs & Co., 656 F.Supp.2d 226, 229 (D. Mass. 2009). 24 Id. 25 Forelle & Maremont, supra.

6

Selected Bibliography

Wikipedia Articles (in suggested reading order)

Stochastic process, Wikipedia.org, http://en.wikipedia.org/wiki/Stochastic_process (last modified on 5 March 2011)

Markov property, Wikipedia.org, http://en.wikipedia.org/wiki/Markov_property (last modified on 24 February 2011)

Markov process, Wikipedia.org, http://en.wikipedia.org/wiki/Markov_process (last modified on 7 April 2011)

Hidden Markov model, Wikipedia.org, http://en.wikipedia.org/wiki/Hidden_Markov_model (last modified on 15 April 2011)

Viterbi algorithm, Wikipedia.org, http://en.wikipedia.org/wiki/Viterbi_algorithm (last modified on 14 April 2011)

Academic Articles and Monographs (in suggested reading order)

Lawrence W. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 77 Proc. IEEE 257 (1989) available at http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/tutorial%20on%20hmm%20and %20applications.pdf (last visited April 18, 2011).

Various authors, Part E Speech Recognition, in Springer Handbook of Speech Processing, 521-724 (Jacob Benesty, W. Mohan Sondi, Yiteng Huang eds., 2008) [10 articles on various aspects of speech recognition].

Radiology Articles (in chronological order)

Douglas J. Quint, Voice Recognition: Ready for Prime Time?, 4 J. Am. C. Radiol. 667 (2007).

Daniel I. Rosenthal, Re:Voice Recognition: Ready for Prime Time?, 4 J. Am. C. Radiol. 670 (2007).

James Y. Chen, et al, Letter to the Editor Re: Voice Recognition Dictation: Radiologists as Transcriptionist and Improvement of report workflow and Productivity Using Speech Recognition-A Follow-Up Study, 22 J. Dig. Img. 560 (2009).

Patent Applications

Speech Recognition System and Method, U.S. Patent App. Pub. No. 20110029316, (filed October 18, 2010) [assigned to Nuance Communications]

Robust Pattern Recognition System and Method Using Socratic Agents, U.S. Patent App. Pub. No. 20080069437 (filed September 13, 2007) [inventor James K. Baker]

7