Recent Development of Open-Source Speech Recognition Engine Julius

Recent Development of Open-Source Speech Recognition Engine Julius Akinobu Lee¤ and Tatsuya Kawaharay ¤ Nagoya Institute of Technology, Nagoya, Aichi 466-8555, Japan E-mail: [email protected] y Kyoto University, Kyoto 606-8501, Japan E-mail: [email protected] ! ! 6 $ $ + 1)7 17" " % # % " Abstract—Julius is an open-source large-vocabulary speech recognition software used for both academic research and in- ¥ ¨© ©¨ §¨© ¦ ¨ dustrial applications. It executes real-time speech recognition of ¦ a 60k-word dictation task on low-spec PCs with small footprint, ¨ and even on embedded devices. Julius supports standard language models such as statistical N-gram model and rule-based ! "# grammars, as well as Hidden Markov Model (HMM) as an ! . + "# " acoustic model. One can build a speech recognition system of his / ¢ £ ( $ ¤ ¡¡ &' )" own purpose, or can integrate the speech recognition capability % 0 + + + ' 1) 1& - - * + to a variety of applications using Julius. This article describes + " & #% " , an overview of Julius, major features and specifications, and 5 4 2 2¨ 2¨ 2 3 summarizes the developments conducted in the recent years. ¦ ! ! + + ''%1 ) '%1 ) I. INTRODUCTION ' ”Julius”1 is an open-source, high-performance speech recognition decoder used for both academic research and indus- Fig. 1. Overview of Julius. trial applications. It incorporates major state-of-the-art speech recognition techniques, and can perform a large vocabulary continuous speech recognition (LVCSR) task effectively in site. The current development snapshot is also available via real-time processing with a relatively small footprint. CVS. There is also a web forum for developers and users. It also has much versatility and scalability. One can easily This article first introduces general information and the build a speech recognition system by combining a language history of Julius, followed by its internal system architecture model (LM) and an acoustic model (AM) for the task, from and decoding algorithm. Then, the model specification is fully a simple word recognition to a LVCSR task with tens of described to show what type of the speech recognition task it thousands of words. Standard file formats are adopted to cope can execute. The way of integrating Julius with other appli- with other standard toolkits such as HTK (HMM Toolkit)[1], cations is also briefly explained. Finally, recent developments CMU-Cam SLM toolkit[2] and SRILM[3]. are described as a list of updates. Julius is written in pure C language. It runs on Linux, Win- II. OVERVIEW dows, Mac OS X, Solaris and other unix variants. It has been ported to SH-4A microprocessor[4], and also runs on Apple's An overview of Julius system is illustrated in Fig. 1. Given iPhone[5]. Most of the research institutes in Japan uses Julius a language model and an acoustic model, Julius functions as for their research[6][7], and it has been applied for various a speech recognition system of the given task. languages such as English, French[8], Mandarin Chinese[9], Julius supports processing of both audio files and a live Thai[10], Estonian[11], Slovenian[12], and Korean[13]. audio stream. For the file input, Julius assumes one sentence Julius is available as open-source software. The license term utterance per input file. It also supports auto-splitting of the is similar to the BSD license, no restriction is imposed for input by long pauses, where pause detection will be performed research, development or even commercial purposes2. based on level and zero cross thresholds. Audio input via a The web page3 contains the latest source codes and pre- network stream is also supported. compiled binaries for Windows and Linux. Several acoustic A language model and an acoustic model is needed to run and language models for Japanese can be obtained from the Julius. The language model consists of a word pronunciation dictionary and a syntactic constraint. Various types of language model are supported: word N-gram model, rule-based gram- 1Julius was named after “Gaius Julius Caesar”, who was a ”dictator” of the Roman Republic in 100 B.C. mars and a simple word list for isolated word recognition. 2See the license term included in the distribution package for details. Acoustic models should be HMM defined for sub-word units. 3http://julius.sourceforge.jp It fully supports HTK HMM definition file: any number of states, any state transition and any parameter tying scheme speech applications. Now, the software is used as a reference can be treated as the same as HTK. of the speech technologies. Applications can interact with Julius in two ways, socket- It had been developed as a part of the free software platform based server-client messaging and function-based library em- for Japanese LVCSR funded by IPA, Japan[17] from 1997 bedding. In either case, the recognition result will be fed to 2000. The decoding algorithm was refined to improve into the application as soon as the recognition process ends the recognition performance in this period (ver. 3.1p2)[18]. for an input. The application can get the live status and After that, the Continuous Speech Recognition Consortium statistics of the Julius engine, and control it. The latest version (CSRC)[19] was founded to maintain the software repository also supports a plug-in facility so that users can extend the for Japanese LVCSR. A grammar-based version of Julius capability of Julius easily. named “Julian” was developed in the project, and the algorithms were further refined, and several new features for A. Summary of Features a spontaneous speech recognition were implemented (ver. Here is a list of major features based on the current version. 3.4.2)[20]. Performance: In 2003, the effort was continued with the Interactive Speech ² Real time recognition of 60k-word dictation on PCs, Technology Consortium (ISTC)[21]. A number of features PDAs and handheld devices were added for real-world speech recognition: robust voice ² Small footprint (about 60MB for 20k-word Japanese tri- activity detection (VAD) based on GMM, lattice output, and phone dictation, including N-gram of 38MB on memory) confidence scoring. ² No machine-specific or hard-coded optimization The latest major revision 4 was released in 2007. The entire Functions: source code was re-organized from a stand-alone application to a set of libraries, and modularity was significantly improved. ² Live audio input recognition via microphone / socket The details are fully described in section VI. ² Multi-level voice activity detection based on power / The current version is 4.1.2, released in February 2009. Gaussian mixture model (GMM) / decoder statistics ² Multi-model parallel recognition within single thread III. INSIDE JULIUS ² Output N-best list / word graph / confusion network ² Forced alignment in word, phone or HMM-state level A. System Architecture ² Confidence scoring The internal module structure of Julius is illustrated in Fig. ² Successive decoding for long input by segmenting with 2. The top-level structure is “engine instance”, which contains short pauses all the modules required for a recognition system: audio input, Supported Models and Features: voice detection, feature extraction, language model, acoustic ² N-gram language model with arbitrary N model and search process. ² Rule-based grammar An “AM process instance” holds an acoustic HMM and ² Isolated word recognition work area for acoustic likelihood computation. The “MFCC ² Triphone HMM / tied-mixture HMM / phonetic tied- instance” is generated from the AM process instance to extract mixture HMM with any number of states, mixtures and a feature vector sequence from speech waveform input. The models supported in HTK. “LM process instance” holds a language model and work area ² Most mel-frequency cepstral coefficients (MFCC) and its for the computation of the linguistic likelihoods. The “Recog- variants supported in HTK. nition process instance” is the main recognition process, using ² Multi-stream HMM and MSD-HMM[14] trained by the AM process instance and the LM process instance. These HTS[15] modules will be created in the engine instance according to Integration / API: the given configuration parameters. When doing multi-model recognition, a module can be shared among several upper ² Embeddable into other applications as C library instances for efficiency. ² Socket-based server-client interaction ² Recognition process control by clients / applications B. Decoding Algorithm ² Plug-in extension Julius performs a two-pass forward-backward search[16]. B. History The overview of the decoding algorithm is illustrated in Fig. Julius was first released in 1998, as a result of a study 3. on efficient algorithms for LVCSR[16]. Our motivation to On the first pass, a tree-structured lexicon assigned with develop and maintain such an open-source speech recognition language model constraint is applied with a standard frame- engine comes from the public requirement of sharing a base- synchronous beam search algorithm. For efficient decoding, line platform for speech research. With a common platform, the reduced LM constraint that concerns only the word-to- researchers on acoustic models or language models can easily word connection, ignoring further context, is used on this demonstrate and compare their works by speech recognition pass. The actual constraint depends on the LM type: when performances. Julius is also intended for easy development of using an N-gram model, 2-gram probabilities will be applied 2 ' ' $ ./.011/3 * F D55 E C - %&%(%)+%,( ?G 8 limit of the hypotheses generation count at every sentence = @ 45 65 7 C <A 89:;< 8? B > length, to avoid search failures for long inputs. T L U K J Julius basically assumes one sentence utterance per input. H I I J However, in natural spontaneous speech such as lectures K ¡ ¢£¤¥¦§§ ¨ ¨ ¥¤ © ¤© ¦ and meetings, the input segment is sometimes uncertain and ¨ !! ©§ ©¥¦ £¤¥¦§§ ¢ often gets long. To handle them, Julius has a function to ¨ § ©¥¦ © perform successive decoding with short-pause segmentation, ## ¡" M ¡ N ¢£¤¥¦§§ ¨ ¨ automatically splitting the long input on the way of recognition ©§ ©¥¦ ©§ ©¥¦ R S Q L by short pauses.

Recent Development of Open-Source Speech Recognition Engine Julius

Speech Recognition Systems: a Comparative Review

Speech-To-Text System for Phonebook Automation

Multipurpose Large Vocabulary Continuous Speech Recognition Engine Julius

The Wavesurfer Automatic Speech Recognition Plugin

Speech Phonetization Alignment and Syllabification (SPPAS): a Tool for the Automatic Analysis of Speech Prosody Brigitte Bigi, Daniel Hirst

A Fully Decentralized, Peer-To-Peer Based Version Control System As a Solution to Overcome Issues of Currently Available Systems

Recent Development of Open-Source Speech Recognition Engine Julius

Open Source Notices This Product Contains Computer Programs (Also Referred to As “Software”) Owned by Or Licensed to Panasonic

Understanding and Auditing the Licensing of Open Source Software Distributions

Recent Development of Open-Source Speech Recognition Engine Julius

Cryptography in C and C++

An Open Source Real-Time Large Vocabulary Recognition Engine