PROJECT REPORT 2015-2016

NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

(An Autonomous Institute, Affiliated to Visvesvaraya Technological University, Belgaum, Approved by AICTE & Govt. of Karnataka)

ACADEMIC YEAR 2015-16

Final Year Project Report on

“Providing voice enabled gadget assistance to inmates of old age home (vridddhashrama) including physically disabled people.”

Submitted in partial fulfillment of the requirement for the award of the degree of

BACHELOR OF ENGINEERING

Submitted by

ABHIRAMI BALARAMAN (1NT12EC004) AKSHATHA P(1NT12EC013)

INTERNAL GUIDE : "MS. KUSHALATHA M R" (Assistant Professor)

Department of Electronics and Communication Engineering NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

Yelahanka, Bangalore-560064

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

(An Autonomous Institute, Affiliated to VTU, Belgaum, Approved by AICTE & State Govt. of Karnataka), Yelahanka, Bangalore-560064

Department Of Electronics And Communication Engineering

CERTIFICATE

Certified that the Project Work entitled “Providing voice enabled gadget assistance to inmates of old age home(vridddhashrama) including physically disabled people.” guided by IISc was carried out by Abhirami Balaraman (1NT12EC004), Akshatha.P(1NT12EC013) bonafide students of Nitte Meenakshi Institute of Technology in partial fulfillment for the award of Bachelor of Engineering in Electronics and Communication of the Visvesvaraya Technological University, Belgaum during the academic year 2015-2016.The project report has been approved as it satisfies the academic requirement in respect of project work for completion of autonomous scheme of Nitte Meenakshi Institute of Technology for the above said degree.

Signature of the Guide Signature of the HOD (Ms.Kushalatha M R) (Dr. S. Sandya)

External Viva Name of the Examiners Signature with Date ……………………………… …......

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

ACKNOWLEDGEMENT

We express our deepest thanks to our principal Dr. H. C. Nagaraj and Dr N. R. Shetty, Director Nitte Meenakshi Institute of Technology, Bangalore for allowing us to carry out the industrial training and supporting us throughout.

We also thank Indian Institute of Science for giving us the opportunity to carry out our internship project in their esteemed instituition & giving us all the support we need to carry on the idea as our final year project.

We express our deepest thanks to Dr.Rathna G N for taking part in useful decision , guidance and necessary equipment for the project and progresssing it to our final year project. We choose this moment to acknowledge her contribution gratefully.

We also express our deepest thanks to our HOD Dr.S.Sandya for allowing us to carry our industrial training and helped us in all the way so that we could gain a practical

experience of the industry. We also take this opportunity to thank Ms. Kushalatha M R [Asst. Prof, ECE Dept.] for guiding us in the right path and being of immense help to us.

Finally we thank all other unnamed who helped us in various ways to gain knowledge and have a good training.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

ABSTRACT

Speech recognition is one of the most recently developing field of research at both industrial and scientific levels. Until recently, the idea of holding a conversation with a computer seemed pure science fiction. If you asked a computer to “open the pod bay doors”—well, that was only in movies. But things are changing, and quickly. A growing number of people now talk to their mobile smart phones, asking them to send e-mail and text messages, search for directions, or find information on the Web. Our Project aims at one such application. Project was designed keeping in mind ,the various categories of people who suffer from loneliness due to absence of others to care for them,especially the ones who are under cancer treatment and old aged people.The system will provide interaction and entertainment and control appliances such as television on voice commands.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

LITERATURE SURVEY

Books are available to read and learn about speech recognition ,these enabled us to see what happens beyond the code.

Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++ Implementation “ 2008 edition, In this book we learned how to write and implement the c++ code.

A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" by Homayoon Beigi, is an in depth source for up to date details on the theory and practice. A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).

"Automatic Speech Recognition: A Deep Learning Approach" (Publisher: Springer) written by D. Yu and L. Deng published near the end of 2014, with highly mathematically-oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods.This gave us an insight into the conversion algorithm used by Google.

Here are some IEEE and other articles we referred :

Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."

Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker identification using Gaussian mixture speaker models" (PDF). IEEE Transactions on Speech and Audio Processing (IEEE) 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063- 6676. OCLC 26108901. Retrieved 21 February 2014.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

SURVEY QUESTIONNNAIRE CONDUCTED IN OLD AGE HOME:

1.What is the total number of people in this old age home? Ans. There are 22 people of age above 70.

2.What are the facilities available to you?

Ans. All basic needs are provided.

3. Is 24/7 medical assistance for someone who's bed- ridden? Ans. No 24/7 nursing.

4. What are the technological facilities provided at your organization for entertainment purpose? Ans. There was only television in each room.

5. Do you have access to computers,internet and mobile phones at your organization?

Ans. No, we are not aware of using all of it. Also it is expensive.

6. What are the changes you would like to have in your daily rountine?

Ans. The rountine is monotonous, so we like means to pass time, like playing games, learning anything based on our intrest.

7. Do you think our project is helpful to you?

Ans Yes. It provides us entertainment and keeps us engaged and not feel bored. Speech activation is very helpful for us as it easy for us to use,especially tv.

8. Any suggestions?

Ans. Add books,scriptures since our eyes gets weak with age. Add quiz games so that we can improve our knowledge. We need something that can train us to learning new languages or anything based on our intrest, without using internet.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CONTENTS

1.INTRODUCTION ...... 8

2. OUR OBJECTIVE ...... 11

3.SYSTEM REQUIREMENTS ...... 13

3.1 HARDWARE COMPONENTS ...... 13

3.2 SOFTWARE REQUIRED ...... 16

4.IMPLEMENTATION ...... 19

4.1 ALGORITHMS ...... 19

4.2 SETTING UP RASPBERRY PI ...... 22

4.3 DOWNLOADING OTHER SOFTWARE ...... 43

4.4 SETTING UP LIRC ...... 46

4.5 WORKING OF IR LED ...... 50

4.6 FLOWCHART ...... 51

4.7 BLOCK DIAGRAM ...... 52

5.FURTHER ENHANCEMENTS ...... 53

6.APPLICATIONS ...... ,55

7.REFERENCES …...... 59

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

LIST OF FIGURES

1.Block diagram of WATSON recognition system ...... 2

2.Raspberry Pi model B ...... 13

3.Sound Card (Qauntum) ...... 14

4.Collar Mic ...... 15

5.IR LED ...... 18

6.IR Receiver ...... 18

7.PN2222 ...... 19

8.The Raspbian Desktop ...... 20

9.Jasper client ...... 21

10.Schematic...... 48

11.Flowchart ...... 51

12.Block Diagram of System ...... 52

13.GSM Quadband 800A ...... 53

14.Home automation possibilities ...... 54

15.Car automation ...... 56

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CHAPTER 1

INTRODUCTION TO SPEECH RECOGNITION

In computer science and electrical engineering, speech recognition (SR) is the translation of spoken words into text. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).

Some SR systems use "training" (also called "enrolment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent"[1] systems. Systems that use training are called "speaker dependent".

Speech recognition applications include voice user interfaces such as voice dialling (e.g. "Call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input).

The term voice recognition[2][3][4] or speaker identification[5][6] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the world-wide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. These speech industry players include , Google, IBM, Baidu (China), Apple, Amazon, Nuance, IflyTek (China), many of which have publicized the core technology in their speech recognition systems being based on deep learning.

Fig 1.WATSON block diagram

Now the rapid rise of powerful mobile devices is making voice interfaces even more useful and pervasive.

Jim Glass, a senior research scientist at MIT who has been working on speech interfaces since the 1980s, says today’s smart phones pack as much processing power as the laboratory machines he worked with in the ’90s. Smart phones also have high-bandwidth data connections to the cloud, where servers can do the heavy lifting involved with both voice recognition and understanding spoken queries. “The combination of more data and more computing power means you can do things today that you just couldn’t do before,” says Glass. “You can use more sophisticated statistical models.”

The most prominent example of a mobile voice interface is, of course, Siri, the voice- activated personal assistant that comes built into the latest iPhone. But voice functionality is built into Android, the platform, and most other mobile systems, as well as many apps. While these interfaces still have considerable limitations (see Social Intelligence), we are inching closer to machine interfaces we can actually talk to.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

In 1971, DARPA funded five years of speech recognition research through its Speech Understanding Research program with ambitious end goals including a minimum vocabulary size of 1,000 words. BBN. IBM., Carnegie Mellon and Stanford Research Institute all participated in the program.[11] The government funding revived speech recognition research that had been largely abandoned in the United States after John Pierce's letter. Despite the fact that CMU's Harpy system met the goals established at the outset of the program, many of the predictions turned out to be nothing more than hype disappointing DARPA administrators. This disappointment led to DARPA not continuing the funding.[12] Several innovations happened during this time, such as the invention of beam search for use in CMU's Harpy system.[13] The field also benefited from the discovery of several algorithms in other fields such as linear predictive coding and cepstral analysis.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CHAPTER 2

OUR OBJECTIVE

Providing information and entertainment,to otherwise solitary people,hence acts as a personal assistant.

People with disabilities can benefit from speech recognition programs. For individuals that are Deaf or Hard of Hearing, speech recognition software is used to automatically generate a closed-captioning of conversations such as discussions in conference rooms, classroom lectures, and/or religious services.[4]

Speech recognition is also very useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involved disabilities that preclude using conventional computer input devices. In fact, people who used the keyboard a lot and developed RSI became an urgent early market for speech recognition.[6] Speech recognition is used in deaf telephony, such as voicemail to text, relay services, and captioned telephone. Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper) can possibly benefit from the software but the technology is not bug proof.[7] Also the whole idea of speak to text can be hard for intellectually disabled person's due to the fact that it is rare that anyone tries to learn the technology to teach the person with the disability.[8]

Being bedridden can be very difficult for many patients to adjust to and it can also cause other health problems as well. It is important for family caregivers to know what to expect so that they can manage or avoid the health risks that bedridden patients are prone to. In this article we would like to offer some information about common health risks of the bedridden patient and some tips for family caregivers to follow in order to try and prevent those health risks.

Depression is also a very common health risk for those that are bedridden because they are unable to care for themselves and maintain the social life that they used to have. Many seniors begin to feel hopeless when they become bedridden but this can be prevented with proper care. Family caregivers should make sure that they are caring for their loved one’s social and emotional needs as well as their physical needs. Many family caregivers focus only on the physical needs of their loved ones and forget that they have emotional and social needs as well. Family caregivers can help their loved ones by providing them with regular social activities and arranging times for friends and other family members to come over so that they will not feel

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 lonely and forgotten. Family caregivers can also remind their loved ones that being bedridden does not necessarily mean that they have to give up everything they used to enjoy.[10]

But since family members wont always be available at home,the above mentioned problems are still prevalent in these patients,hence our interactive system will provide them with entertainment (music,movies),and voice responses to general questions. Therefore it behaves as an electronic companion.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CHAPTER 3

SYSTEM REQUIREMENTS

The project needs both hardware and software components.The hardware components includes,the Raspberry Pi model B ,keyboard,mouse,earphones,microphone with sound card,ethernet cable, HDMI screen and HDMI cable.Software components are Rasbian OS on SD card,C++ compiler and the online resourses google speech API and Wolfram alpha .They are described in detail below.

3.1 HARDWARE COMPONENTS

1. RASPBERRY PI MODEL B

The Raspberry Pi is a series of credit card–sized single-board computers developed in the UK by the Raspberry Pi Foundation with the intention of promoting the teaching of basic computer science in schools.[5][6][7]

The system is developed through ARM microprocessor ARM is a registered trademark of ARM Limited. Linux now provides support for the ARM-11 family processors; it gives consumer device manufacturers, commercial-quality Linux implementation along with tools to reduce time-to-market and development costs. Raspberry Pi is a credit card sized computer development platform based on a BCM2835 system on chip, sporting an ARM11 processor, developed in the UK by Raspberry Pi Foundation. Raspberry Pi model functions as a regular desktop computer when it is connected to the keyboard or monitor. Raspberry Pi is very cheap and most reliable to make a Raspberry Pi supercomputer. The Raspberry Pi uses Linux kernel-based.

The Foundation provides Debian and Arch Linux ARM distributions for download.Tools are available for Python as the main programming language, with support for BBC BASIC(via the RISC OS image or the Brandy Basic clone for Linux), C, C++, Java,Perl and Ruby.

Fig 2.Raspberry Pi Model B

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Specifications include

SoC:Broadcom BCM2835 (CPU, GPU, DSP, SDRAM, one USB port)

CPU:700 MHz single-core ARM1176JZF-S

GPU:Broadcom VideoCore IV @ 250 MHz

OpenGL ES 2.0 (24 GFLOPS)

MPEG-2 and VC-1 (with license),1080p30 H.264/MPEG-4 AVC high-profile decoder and encoder

Memory (SDRAM):512 MB (shared with GPU) as of 15 October 2012

USB 2.0 ports :2 (via the on-board 3-port USB hub)

Video outputs:HDMI (rev 1.3 & 1.4), 14 HDMI resolutions from 640×350 to 1920×1200 plus various PAL and NTSC standards, composite video (PAL and NTSC) via RCA jack

Audio outputs:Analog via 3.5 mm phone jack; digital via HDMI and, as of revision 2 boards, I²S

On-board storage:[SD / MMC / SDIO card slot

On-board network:[11]10/100 Mbit/s Ethernet (8P8C) USB adapter on the third/fifth port of the USB hub (SMSC lan9514-jzx)[42]

Low -level peripherals:8× GPIO plus the following, which can also be used as GPIO: UART, I²C bus, SPI bus with two chip selects, I²S audio +3.3 V, +5 V, ground.

The Power ratings:700 mA (3.5 W)

Power source:5 V via MicroUSB or GPIO header

Size:85.60 mm × 56.5 mm (3.370 in × 2.224 in) – not including protruding connectors

Weight:45 g (1.6 oz)

The main differences between the two flavours of Pi are the RAM, the number of USB 2.0 ports and the fact that the Model A doesn’t have an Ethernet port (meaning a USB Wi-Fi is required to access the internet. While that results in a lower price for the Model A, it means that a user will have to buy a powered USB hub in order to get it to work for many projects. The Model A is aimed more at those creating electronics projects that require programming and control directly from the command line interface. Both Pi models use the Broadcom BCM2835 CPU, which is an ARM11-based processor running at 700MHz. There are overclocking modes built in for users to

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

increase the speed as long as the core doesn’t get too hot, at which point it is throttled back. Also included is the Broadcom VideoCore IV GPU with support for OpenGL ES 2.0, which SD can perform 24 GFlops and decode and play H.264 video at 1080p resolution. Originally the Model A was due to use 128MB RAM, but this was upgraded to 256MB RAM with the Model B going from 256MB to 512MB. The power supply to the Pi is via the 5V microUSB socket. As the Model A has fewer powered interfaces it only requires 300mA, compared to the 700mA that the Model B needs. The standard system of connecting the Pi models is to use the HDMI port to connect to an HDMI socket on a TV or a DVI port on a monitor. Both HDMI-HDMI and HDMI- DVI cables work well, delivering 1080p video, or 1920x1080. Sound is also sent through the HDMI connection, but if using a monitor without speakers then there’s the standard 3.5mm jack socket for audio. The RCA composite video connection was designed for use in countries where the level of technology is lower and more basic displays such as older TVs are used.

2. SOUND CARD WITH MICROPHONE

Sound card is used since raspberry pi has no on board ADC ,

A sound card (also known as an audio card) is an internal computer expansion card that facilitates economical input and output of audio signals to and from a computer under control of computer programs. The term sound card is also applied to external audio interfaces that use software to generate sound, as opposed to using hardware inside the PC. Typical uses of sound cards include providing the audio component for multimedia applications such as music composition, editing video or audio, presentation, education and entertainment (games) and video projection.

Sound functionality can also be integrated onto the motherboard, using components similar to plug-in cards. The best plug-in cards, which use better and more expensive components, can achieve higher quality than integrated sound. The integrated sound system is often still referred to as a "sound card". Sound processing hardware is also present on modern video cards with HDMI to output sound along with the video using that connector; previously they used a SPDIF connection to the motherboard or sound card.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

We are using Quantum sound card and hauwei collar mic.

A microphone, colloquially nicknamed mic or mike (/ˈmaɪk/),[1] is an acoustic-to-electric transducer or sensor that converts sound into an electrical signal. Electromagnetic transducers facilitate the conversion of acoustic signals into electrical signals.[2] Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and public events, motion picture production, live and recorded audio engineering, two-way radios, megaphones, radio and television broadcasting, and in computers for recording voice, speech recognition, VoIP, and for non- acoustic purposes such as ultrasonic checking or knock sensors.

Most microphones today use electromagnetic induction (dynamic microphones), capacitance change (condenser microphones) or piezoelectricity (piezoelectric microphones) to produce an electrical signal from air pressure variations. Microphones typically need to be connected to a preamplifier before the signal can be amplified with an audio power amplifier and a speaker or recorded.

Fig 3.Sound Card Fig 4. Collar Mic

3.KEYBOARD ,MOUSE AND HDMI SCREEN Are the other peripherals.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4. 940nm IR LED 20deg - 20 degree viewing angle. Bright and tuned to 940nm wavelength and 940nm IR LED 40deg - 40 degree viewing angle. Bright and tuned to 940nm wavelength.

Fig 5. IR Transmitter LED

5. 38 khz IR RECEIVER - Receives IR signals at remote control frequencies.

It is a photo detector and preamplifier in one package, high photo sensitivity, improved inner shielding against electrical field disturbance, low power consumption, Suitable burst length≧10 cycles/burst, TTL and CMOS compatibility, improved immunity against ambient light, Internal filter for PCM frequency. Bi-CMOS manufacture IC; ESD HBM>4000V; MM>250V

It is miniaturized receivers for infrared remote control systems with the high speed PIN phototransistor and the full wave band preamplifier. Some of its applications are: Infrared applied system, Light detecting portion of remote control, AV instruments such as Audio, TV, VCR, CD, MD, etc. ,CATV set top boxes, other equipments with wireless remote control, Home appliances such as Air-conditioner, Fan, etc. Multi-media Equipment.

Fig 6. IR Receiver

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

6. PN2222 TRANSISTOR - Transistor here is used to help drive the IR LED. Each transistor is a general purpose amplifier, model PN2222 and has a standard EBC pin out. They can switch up to 40V at peak currents of 1A, with a DC gain of about 100.

A similar transistor is used with same current rating.KSP2222.

Fig 8.PN2222 Pinout

7.10k Ohm RESISTOR- Resistor that goes between rPi GPIO and the PN2222 transistor

Breadboard.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

3.2 SOFTWARE REQUIRED

1. RASPBIAN OS

Although the Raspberry Pi’s operating system is closer to the Mac than Windows, it’s the latter that the desktop most closely resembles

It might seem a little alien at first glance, but using Raspbian is hardly any different to using Windows (barring Windows 8 of course). There’s a menu bar, a web browser, a file manager and no shortage of desktop shortcuts of pre-installed applications.

Raspbian is an unofficial port of Debian Wheezy armhf with compilation settings adjusted to produce optimized "hard float" code that will run on the Raspberry Pi. This provides significantly faster performance for applications that make heavy use of floating point arithmetic operations. All other applications will also gain some performance through the use of advanced instructions of the ARMv6 CPU in Raspberry Pi.

Although Raspbian is primarily the efforts of Mike Thompson (mpthompson) and Peter Green (plugwash), it has also benefited greatly from the enthusiastic support of Raspberry Pi community members who wish to get the maximum performance from their device.

Fig 9.The Raspbian Desktop

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

2.JASPER CLIENT

Jasper is an open source platform for developing always -on, voice-controlled applications Use your voice to ask for information, update social networks, control your home, and more. Jasper is always on, always listening for commands, and you can speak from meters away. Build it yourself with off-the-shelf hardware, and use our documentation to write your own modules.

Fig 10.Jasper client

3. CMU Sphinx

CMUSphi nx http:/ /c mu sph i nx. sou rc ef o r ge .net collects over 20 years of the CMU

research. All advantages are hard to list, but just to name a few:

State of art speech recognition algorithms for eficient speech recognition. CMUSphinx tools are designed specifically for low-resource platforms

Flexible design

Focus on practical application development and not on research

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 Support for several languages like US English, UK English, French, Mandarin, German, Dutch, Russian and ability to build a models for others

BSD-like license which allows commercial distribution

Commercial support

Active development and release schedule

Active community (more than 400 users on Linkedin CMUSphinx group) Wide

range of tools for many speech-recognition related purposes (keyword spotting,

alignment, pronuncation evaluation)

CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).

In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech decoders come with acoustic models and sample applications. The available resources include in addition software for acoustic model training, Language model compilation and a public-domain pronunciation dictionary, cmudict.

Here , we use the pocketsphinx tool.

A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). PocketSphinx is under active development and incorporates features such as fixed-point arithmetic and eficient algorithms for GMM computation.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4. WinSCP

WinSCP (Windows Secure Copy) is a free and open-source SFTP, FTP, WebDAV and SCP client for Microsoft Windows. Its main function is secure file transfer between a local and a remote computer. Beyond this, WinSCP ofers basic file manager and file synchronization functionality. For secure transfers, it uses Secure Shell (SSH) and supports the SCP protocol in addition to SFTP.[3]

Development of WinSCP started around March 2000 and continues. Originally it was hosted by the University of Economics in Prague, where its author worked at the time. Since July 16, 2003, it is licensed under the GNU GPL and hosted on SourceForge.net.[4]

WinSCP is based on the implementation of the SSH protocol from PuTTY and FTP protocol from FileZilla.[5] It is also available as a plugin for Altap Salamander file manager,[6] and there exists a third-party plugin for the FAR file manager.[7]

5.PUTTY

PuTTY is a free and open-source terminal emulator, serial console and network file transfer application. It supports several network protocols, including SCP, SSH, Telnet, rlogin, and raw socket connection. It can also connect to a serial port (since version 0.59). The name "PuTTY" has no definitive meaning.[3]

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

PuTTY was originally written for Microsoft Windows, but it has been ported to various other operating systems. Oficial ports are available for some Unix-like platforms, with work-in-progress ports to Classic Mac OS and Mac OS X, and unoficial ports have been contributed to platforms such as Symbian, [4][5] and Windows Phone.

PuTTY was written and is maintained primarily by Simon Tatham and is currently beta software.

6. LIRC:

LIRC (Linux Infrared remote control) is an open source package that allows users to receive and send infrared signals with a Linux-based computer system. There is a Microsoft Windows equivalent of LIRC called WinLIRC. With LIRC and an IR receiver the user can control their computer with almost any infrared remote control (e.g. a TV remote control). The user may for instance control DVD or music playback with their remote control. One GUI frontend is KDELirc, built on the KDE libraries.

7.Python 2.7

Python is a widely used high-level, general-purpose, interpreted, dynamic programming language.[3][4] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.[5][6] The language provides constructs intended to enable clear programs on both a small and large scale.[7]

Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive standard library.[8]

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Python interpreters are available for installation on many operating systems, allowing Python code execution on a wide variety of systems. Using third-party tools, such as Py2exe or Pyinstaller,[29] Python code can be packaged into stand-alone executable programs for some of the most popular operating systems, allowing the distribution of Python-based software for use on those environments without requiring the installation of a Python interpreter.

CPython, the reference implementation of Python, is free and open-source software and has a community-based development model, as do nearly all of its alternative implementations. CPython is managed by the non-profit Python Software Foundation.

Why python 2.7?

If you can do exactly what you want with Python 3.x, great! There are a few minor downsides, such as slightly worse library support1 and the fact that most current Linux distributions and Macs are still using 2.x as default, but as a language Python 3.x is definitely ready. As long as Python 3.x is installed on your user's computers (which ought to be easy, since many people reading this may only be developing something for themselves or an environment they control) and you're writing things where you know none of the Python 2.x modules are needed, it is an excellent choice. Also, most linux distributions have Python 3.x already installed, and all have it available for end-users. Some are phasing out Python 2 as preinstalled default.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CHAPTER 4

IMPLEMENTATION

Both acoustic modeling and language modeling are important parts of modern statistically- based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling is also used in many other natural language processing applications such as document classification or statistical machine translation.

4.1 ALGORITHMS

HMM

Modern general-purpose speech recognition systems are based on Hidden Markov Models. These are statistical models that output a sequence of symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. In a short time-scale (e.g., 10 milliseconds), speech can be approximated as a stationary process. Speech can be thought of as a Markov model for many stochastic purposes.

Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male- female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semi-tied co variance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information, and combining it statically beforehand (the finite state transducer, or FST, approach).

A possible improvement to decoding is to keep a set of good candidates instead of just keeping the best candidate, and to use a better scoring function (re scoring) to rate these good candidates so that we may pick the best one according to this refined score. The set of candidates can be kept either as a list (the N-best list approach) or as a subset of the models (a lattice). Re scoring is usually done by trying to minimize the Bayes risk[7] (or an approximation thereof): Instead of taking the source sentence with maximal probability, we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions (i.e., we take the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability).

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

The loss function is usually the Levenshtein distance, though it can be different distances for specific tasks; the set of possible transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been devised to re score lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions.[8]

DEEP NEURAL NETWORK

A deep neural network (DNN) is an artificial neural network with multiple hidden layers of units between the input and output layers.[6] Similar to shallow neural networks, DNNs can model complex non-linear relationships. DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning capacity and thus the potential of modeling complex patterns of speech data.[6] The DNN is the most popular type of deep learning architectures successfully used as an acoustic model for speech recognition since 2010.

The success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial researchers, in collaboration with academic researchers, where large output layers of the DNN based on context dependent HMM states constructed by decision trees were adopted.[7][8] [9]

One fundamental principle of deep learning is to do away with hand-crafted feature engineering and to use raw features. This principle was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features,[2] showing its superiority over the Mel-Cepstral features which contain a few stages of fixed transformation from spectrograms. The true "raw" features of speech, waveforms, have more recently been shown to produce excellent larger-scale speech recognition results.[3]

Since the initial successful debut of DNNs for speech recognition around 2009-2011, there have been huge new progresses made. This progress (as well as future directions) has been summarized into the following eight major areas:[8]

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Scaling up/out and speedup DNN training and decoding;

Sequence discriminative training of DNNs;

Feature processing by deep models with solid understanding of the underlying mechanisms;

Adaptation of DNNs and of related deep models;

Multi-task and transfer learning by DNNs and related deep models;

Convolution neural networks and how to design them to best exploit domain knowledge of speech;

Recurrent neural network and its rich LSTM variants;

Other types of deep models including tensor-based models and integrated deep generative/discriminative models.

Large-scale automatic speech recognition is the first and the most convincing successful case of deep learning in the recent history, embraced by both industry and academic across the board. Between 2010 and 2014, the two major conferences on signal processing and speech recognition, IEEE-ICASSP and Interspeech, have seen near exponential growth in the numbers of accepted papers in their respective annual conference papers on the topic of deep learning for speech recognition. More importantly, all major commercial speech recognition systems (e.g., Microsoft , Xbox, Skype Translator, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) nowadays are based on deep learning methods.[5]

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4.2 STEPS TO SETUP RASPBERRY PI

1.1. Connecting Everything Together

1. Plug the preloaded SD Card into the RPi.

2. Plug the USB keyboard and mouse into the RPi, perhaps via a USB hub. Connect the Hub to power, if necessary.

3. Plug a video cable into the screen (TV or monitor) and into the RPi.

4. Plug your extras into the RPi (USB WiFi, Ethernet cable, external hard drive etc.). This is where you may really need a USB hub.

5. Ensure that your USB hub (if any) and screen are working.

6. Plug the power supply into the mains socket.

7. With your screen on, plug the power supply into the RPi microUSB socket. 8. The RPi should boot up and display messages on the screen.

It is always recommended to connect the MicroUSB power to the unit last (while most connections can be made live, it is best practice to connect items such as displays with the power turned off).

1.2. Operating System SD Card

As the RPi has no internal mass storage or built-in operating system it requires an SD card preloaded with a version of the Linux Operating System.

• You can create your own preloaded card using any suitable SD card (4GBytes or above) you have to hand. We suggest you use a new blank card to avoid arguments over lost pictures. • Preloaded SD cards will be available from the RPi Shop.

1.3. Keyboard & Mouse

Most standard USB keyboards and mice will work with the RPi. Wireless keyboard/mice should also function, and only require a single USB port for an RF dongle. In order to use a Bluetooth keyboard or mouse you will need a Bluetooth USB dongle, which again uses a single port.

Remember that the Model A has a single USB port and the Model B has two (typically a keyboard and mouse will use a USB port each).

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

1.4. Display

There are two main connection options for the RPi display, HDMI (High Definition) and Composite (Standard Definition).

• HD TVs and many LCD monitors can be connected using a full-size 'male' HDMI cable, and with an inexpensive adaptor if DVI is used. HDMI versions 1.3 and 1.4 are supported and a version 1.4 cable is recommended. The RPi outputs audio and video via HMDI, but does not support HDMI input.

• Older TVs can be connected using Composite video (a yellow-to-yellow RCA cable) or via SCART (using a Composite video to SCART adaptor). Both PAL and NTSC format TVs are supported.

When using a composite video connection, audio is available from the 3.5mm jack socket, and can be sent to your TV, headphones or an amplifier. To send audio to your TV, you will need a cable which adapts from 3.5mm to double (red and white) RCA connectors.

Note: There is no analogue VGA output available. This is the connection required by many computer monitors, apart from the latest ones. If you have a monitor with only a D-shaped plug containing 15 pins, then it is unsuitable.

1.5. Power Supply

The unit is powered via the microUSB connector (only the power pins are connected, so it will not transfer data over this connection). A standard modern phone charger with a microUSB connector will do, providing it can supply at least 700mA at +5Vdc. Check your power supply's ratings carefully. Suitable mains adaptors will be available from the RPi Shop and are recommended if you are unsure what to use.

Note: The individual USB ports on a powered hub or a PC are usually rated to provide 500mA maximum. If you wish to use either of these as a power source then you will need a special cable which plugs into two ports providing a combined current capability of 1000mA.

1.6. Cables

You will need one or more cables to connect up your RPi system.

• Video cable alternatives: o HDMI-A cable o HDMI-A cable + DVI adapter o Composite video cable o Composite video cable + SCART adaptor • Audio cable (not needed if you use the HDMI video connection to a TV) • Ethernet/LAN cable (Model B only)

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

1.7.Preparing your SD card for the Raspberry Pi In order to use your Raspberry Pi, you will need to install an Operating System (OS) onto an SD card. An Operating System is the set of basic programs and utilities that allow your computer to run; examples include Windows on a PC or OSX on a Mac.

These instructions will guide you through installing a recovery program on your SD card that will allow you to easily install different OS’s and to recover your card if you break it. 1.Insert an SD card that is 4GB or greater in size into your computer

2. Format the SD card so that the Pi can read it. a.Windows i.Download the SD Association's Formatting Tool1 from https://www.sdcard.org/downloads/formatter_4/eula_windows/ ii.Install and run the Formatting Tool on your machine iii.Set "FORMAT SIZE ADJUSTMENT" option to "ON" in the "Options" menu iv.Check that the SD card you inserted matches the one selected by the Tool v. Click the “Format” button b.Mac i.Download the SD Association's Formatting Tool from https://www.sdcard.org/downloads/formatter_4/eula_mac/ ii. Install and run the Formatting Tool on your machine iii. Select “Overwrite Format” iv. Check that the SD card you inserted matches the one selected by the Tool v. Click the “Format” button c.Linux i. We recommend using gparted (or the command line version parted ) ii. Format the entire disk as FAT

3. Download the New Out Of Box Software (NOOBS) from: downloads.raspberrypi.org/noobs 4. Unzip the downloaded file a.Windows Right click on the file and choose “Extract all” b. Mac Double tap on the file

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 c. Linux Run unzip [downloaded filename]

5. Copy the extracted files onto the SD card that you just formatted

6. Insert the SD card into your Pi and connect the power supply

7.You can also alternatively download the raspbian image from https://raspberrypi.org

Your Pi will now boot into NOOBS and should display a list of operating systems that you can choose to install. If your display remains blank, you should select the correct output mode for your display by pressing one of the following number keys on your keyboard; 1. HDMI mode this is the default display mode.

2.HDMI safe mode select this mode if you are using the HDMI connector and cannot see anything on screen when the Pi has booted.

3.Composite PAL mode select either this mode or composite NTSC mode if you are using the composite RCA video connector

4. Composite NTSC mode

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4.3.DOWNLOADING OTHER SOFTWARE

1.CMU SPHINX

CMU Sphinx toolkit has a number of packages for different tasks and applications. It's sometimes confusing what to choose. To cleanup, here is the list

Pocketsphinx — recognizer library written in C.

Sphinxtrain — acoustic model training tools

Sphinxbase — support library required by Pocketsphinx and Sphinxtrain

Sphinx4 — adjustable, modifiable recognizer written in Java

We have chosen pocketsphinx.

To build pocketsphinx in a unix-like environment (such as Linux, Solaris, FreeBSD etc) you need to make sure you have the following dependencies installed: gcc, automake, autoconf, libtool, bison, swig at least version 2.0, python development package, pulseaudio development package. If you want to build without dependencies you can use proper configure options like – without-swig-python but for beginner it is recommended to install all dependencies.

You need to download both sphinxbase and pocketsphinx packages and unpack them. Please note that you can not use sphinxbase and pocketsphinx of different version, please make sure that versions are in sync. After unpack you should see the following two main folders:

sphinxbase-X.X pocketsphinx-X.x

On step one, build and install SphinxBase. Change current directory to sphinxbase folder. If you downloaded directly from the repository, you need to do this at least once to generate the configure file:

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

% ./autogen.sh if you downloaded the release version, or ran autogen.sh at least once, then compile and install:

% ./configure

% make

% make install

The last step might require root permissions so it might be sudo make install. If you want to use fixed-point arithmetic, you must configure SphinxBase with the –enable- fixed option. You can also set installation prefix with –prefix. You can also configure with or without SWIG python support.

The sphinxbase will be installed in /usr/local/ folder by default. Not every system loads libraries from this folder automatically. To load them you need to configure the path to look for shared libaries. It can be done either in the file /etc/ld.so.conf or with exporting environment variables:

export LD_LIBRARY_PATH=/usr/local/lib

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

BUILDING LANGUAGE MODEL

There are several types of models that describe language to recognize - keyword lists, grammars and statistical language models, phonetic statistical language models. You can chose any decoding mode according to your needs and you can even switch between modes in runtime.

Keyword lists

Pocketsphinx supports keyword spotting mode where you can specify the keyword list to look for. The advantage of this mode is that you can specify a threshold for each keyword so that keyword can be detected in continuous speech. All other modes will try to detect the words from grammar even if you used words which are not in grammar. The keyword list looks like this:

oh mighty computer /1e- 40/ hello world /1e-30/

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 other phrase /1e-20/

Threshold must be specified for every keyphrase. For shorter keyphrase you can use smaller thresholds like 1e-1, for longer threshold must be bigger. Threshold must be tuned to balance between false alarms and missed detections, the best way to tune threshold is to use a prerecorded audio file. Tuning process is the following:

Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour

Run keyword spotting on that file with different thresholds for every keyword, use the following command:

pocketsphinx_continuous -infile -keyphrase <"your keyphrase"> -kws_threshold \

-time yes

From keyword spotting results count how many false alarms and missed detections you've encountered

Select the threshold with smallest amount of false alarms and missed detections

For the best accuracy it is better to have keyphrase with 3-4 syllables. Too short phrases are easily confused.

Keyword lists are supported by pocketsphinx only, not by sphinx4.

Grammars

Grammars describe very simple type of the language for command and control, and they are usually written by hand or generated automatically within the code. Grammars usually do not have probabilities for word sequences, but some elements might be weighed. Grammars could be created with JSGF format and usually have extension like .gram or .jsgf.

Grammars allow to specify possible inputs very precisely, for example, that certain word might be repeated only two or three times. However, this strictness might be harmful if your user accidentally skips the words which grammar requires. In that case whole recognition will fail. For that reason it is better to make grammars more relaxed, instead of phrases list just the bag of words allowing arbitrary order. Avoid very complex grammars with many rules and cases, it just slows the recognizer, you can use simple rules instead. In the past grammars required a lot of effort to tune them, to assign variants properly and so on. The big VXML consulting industry was about that.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Language models

Statistical language models describe more complex language. They contain probabilities of the words and word combinations. Those probabilities are estimated from a sample data and automatically have some flexibility. For example, every combination from the vocabulary is possible, though probability of such combination might vary. For example if you create statistical language model from a list of words it will still allow to decode word combinations though it might not be your intent. Overall, statistical language models are recommended for free-form input where user could say anything in a natural language and they require way less engineering effort than grammars, you just list the possible sentences. For example, you might list numbers like “twenty one” and “thirty three” and statistical language model will allow “thirty one” with certain probability as well.

Overall, modern speech recognition interfaces tend to be more natural and avoid command- and-control style of previous generation. For that reason most interface designers prefer natural language recognition with statistical language model than old-fashioned VXML grammars.

On the topic of design of the VUI interfaces you might be interested in the following book: It's Better to Be a Good Machine Than a Bad Person: Speech Recognition and Other Exotic User Interfaces at the Twilight of the Jetsonian Age by Bruce Balentine

There are many ways to build the statistical language models. When your data set is large, there is sense to use CMU language modeling toolkit. When a model is small, you can use an online quick web service. When you need specific options or you just want to use your favorite toolkit which builds ARPA models, you can use it.

Language model can be stored and loaded in three different format - text ARPA format, binary format BIN and binary DMP format. ARPA format takes more space but it is possible to edit it. ARPA files have .lm extension. Binary format takes significantly less space and faster to load. Binary files have .lm.bin extension. It is also possible to convert between formats. DMP format is obsolete and not recommended.

Building a grammar

Grammars are usually written manually in JSGF format:

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

#JSGF V1.0; /**

* JSGF Grammar for Hello World example */ grammar hello; public = (good morning | hello) ( bhiksha | evandro | paul | philip | rita | will );

Building a Statistical Language Model

Text preparation

First of all you need to prepare a large collection of clean texts. Expand abbreviations, convert numbers to words, clean non-word items. For example to clean Wikipedia XML dump you can use special python scripts like https://github.com/attardi/wikiextractor. To clean HTML pages you can try http://code.google.com/p/boilerpipe a nice package specifically created to extract text from HTML

For example on how to create language model from Wikipedia texts please see

http://trulymadlywordly.blogspot.ru/2011/03/creating-text-corpus-from-wikipedia.html

Once you went through the language model process, please submit your langauge model to CMUSphinx project, we'd be glad to share it!

Movie subtitles are good source for spoken language.

Language modeling for many languages like Mandarin is largely the same as in English, with one addditional consideration, which is that the input text must be word segmented. A segmentation tool and associated word list is provided to accomplish this.

Using other Language Model Toolkits

There are many toolkits that create ARPA n-gram language model from text files.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Some toolkits you can try:

IRSLM

MITLM

SRILM

If you are training large vocabulary speech recognition system, the language model training is outlined in a separate page Building a large scale language model for domain-specific transcription.

Once you created ARPA file you can convert the model to binary format if needed.

ARPA model training with SRILM

Training with SRILM is easy, that's why we recommend it. Morever, SRILM is the most advanced toolkit up to date. To train the model you can use the following command:

ngram-count -kndiscount -interpolate -text train-text.txt -lm your.lm

You can prune the model afterwards to reduce the size of the model

ngram -lm your.lm -prune 1e-8 -write-lm your-pruned.lm

After training it is worth to test the perplexity of the model on the test data

ngram -lm your.lm -ppl test-text.txt ARPA model training with CMUCLMTK

You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.

The process for creating a language model is as follows:

1) Prepare a reference text that will be used to generate the language model. The language model toolkit expects its input to be in the form of normalized text files, with utterances delimited by

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

and tags. A number of input filters are available for specific corpora such as Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the set of sentences that are bounded by the start and end sentence markers: and . Here's an example:

generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times

some dry intervals also with hazy sunshine especially in eastern parts in the morning

highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze

cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be light and patchy but heavier rain may develop in the west later

More data will generate better language models. The weather.txt file from sphinx4 (used to generate the weather language model) contains nearly 100,000 sentences.

2) Generate the vocabulary file. This is a list of all the words in the

file: text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab

3) You may want to edit the vocabulary file to remove words (numbers, misspellings, names). If you find misspellings, it is a good idea to fix them in the input transcript.

4) If you want a closed vocabulary language model (a language model that has no provisions for unknown words), then you should remove sentences from your input transcript that contain words that are not in your vocabulary file.

5) Generate the arpa format language model with the commands:

% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt

% idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \

weather.vocab -arpa weather.lm

6) Generate the CMU binary form (BIN)

sphinx_lm_convert -i weather.lm -o weather.lm.bin

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

The CMUCLTK tools and commands are documented at The CMU-Cambridge Language Modeling Toolkit page.

Building a simple language model using web service

If your language is English and text is small it's sometimes more convenient to use web service to build it. Language models built in this way are quite functional for simple command and control tasks. First of all you need to create a corpus.

The “corpus” is just a list of sentences that you will use to train the language model. As an example, we will use a hypothetical voice control task for a mobile Internet device. We'd like to tell it things like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last window”, “open music player”, and so forth. So, we'll start by creating a file called corpus.txt:

open browser new e-mail forward backward next window last window open music player

Then go to the page http://www.speech.cs.cmu.edu/tools/lmtool-new.html. Simply click on the “Browse…” button, select the corpus.txt file you created, then click “COMPILE KNOWLEDGE BASE”.

The legacy version is still available online also here: http://www.speech.cs.cmu.edu/tools/lmtool.html

You should see a page with some status messages, followed by a page entitled “Sphinx knowledge base”. This page will contain links entitled “Dictionary” and “Language Model”. Download these files and make a note of their names (they should consist of a 4-digit number followed by the extensions .dic and .lm). You can now test your newly created language model with PocketSphinx.

Converting model into binary format

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

To quickly load large models you probably would like to convert them to binary format that will save your decoder initialization time. That's not necessary with small models. Pocketsphinx and sphinx3 can handle both of them with -lm option. Sphinx4 automatically detects format by extension of the lm file.

ARPA format and BINARY format are mutually convertable. You can produce other file with sphinx_lm_convert command from sphinxbase:

sphinx_lm_convert -i model.lm -o model.lm.bin sphinx_lm_convert - i model.lm.bin -ifmt bin -o model.lm -ofmt arpa You can also convert old DMP models to bin format this way.

Using your language model

This section will show you how to use, test, and improve the language model you created.

Using your language model with PocketSphinx

If you have installed PocketSphinx, you will have a program called pocketsphinx_continuous which can be run from the command-line to recognize speech. Assuming it is installed under /usr/local, and your language model and dictionary are called 8521.dic and 8521.lm and placed in the current folder, try running the following command:

pocketsphinx_continuous -inmic yes -lm 8521.lm -dict 8521.dic

This will use your new language model and the dictionary and default acoustic model. On Windows you also have to specify the acoustic model folder with -hmm option

bin/Release/pocketsphinx_continuous.exe -inmic yes -lm 8521.lm -dict 8521.dic -hmm model/en-us/en-us

You will see a lot of diagnostic messages, followed by a pause, then “READY…”. Now you can try speaking some of the commands. It should be able to recognize them with complete accuracy. If not, you may have problems with your microphone or sound card.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Using your language model with Sphinx4

In Sphinx4 high-level API you need to specify the location of the language model in Configuration:

configuration.setLanguageModelPath("file:8754.lm");

If the model is in resources you can reference it with resource: URL

configuration.setLanguageModelPath("resource:/com/example/8754.lm");

GENERATING THE DICTIONARY

There are various tools to help you to extend an existing dictionary for new words or to build a new dictionary from scratch. If your language already has a dictionary it's recommended to use since it's carefully tuned for best performance. If you starting a new language you need to account for various reductions and coarticulations effects. They make it very hard to create accurate rules to convert text to sounds. However, the practice shows that even naive conversion could produce a good results for speech recognition. For example, many developers were successful to create ASR with simple grapheme-based synthesis where each letter is just mapped to itself not to the corresponding phone.

For most of the languages you need to use specialized grapheme to phoneme (g2p) code to do the conversion using machine learning methods and existing small database. Nowdays most accurate g2p tools are Phonetisaurus:

http://code.google.com/p/phonetisaurus

And sequitur-g2p:

http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Also note that almost each TTS package has G2P code included. For example you can use g2p code from FreeTTS written in Java:

http://cmusphinx.sourceforge.net/projects/freetts

See FreeTTS example in Sphinx4 here

OpenMary Java TTS:

http://mary.dfki.de/

or espeak for C:

http://espeak.sourceforge.net

Please note that if you use TTS you often need to do phoneset conversion. TTS phonesets are usually more extensive than required for ASR. However, there is a great adavantage in TTS tools because they usually contain more required functionality than simple G2P. For example, they are doing tokenization by converting numbers and abbreviations to spoken format.

For English you can use simplier capabilities by using on-line webservice:

http://www.speech.cs.cmu.edu/tools/lmtool.html

Online LM Tool, produces a dictionary which matches its language model. It uses the latest CMU dictionary as a base, and is programmed to guess at pronunciations of words not in the existing dictionary. You can look at the log file to find which words were guesses, and make your own corrections, if necessary. With the advanced option, LM Tool can use a hand-made dictionary that you specify for your specialized vocabulary, or for your own pronunciations as corrections. The hand dictionary must be in the same format as the main dictionary

If you want to run lmtool offline you can checkout it from subversion:

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios

2.TEXT TO SPEECH eSpeak is a compact open-source speech synthesizer for many platforms. Speech synthesis is done offline, but most voices can sound very “robotic”.

Festival uses the Festival Speech Synthesis System, an open source speech synthesizer developed by the Centre for Speech Technology Research at the University of Edinburgh. Like eSpeak, also synthesizes speech offline.

Initial voice was espeak later changed to festival. sudo apt-get update sudo apt-get install festival festvox-don

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4.4.SETTING UP LIRC

First, we’ll need to install and configure LIRC to run on the RaspberryPi: sudo apt-get install lirc

Second,You have to modify two files before you can start testing the receiver and IR LED. Add this to your /etc/modules file:

lirc_dev

lirc_rpi gpio_in_pin=23 gpio_out_pin=22

Change your /etc/lirc/hardware.conf file to:

########################################################

# /etc/lirc/hardware.conf

#

# Arguments which will be used when launching lircd LIRCD_ARGS="--uinput"

# Don't start lircmd even if there seems to be a good config file

# START_LIRCMD=false

# Don't start irexec, even if a good config file seems to exist.

# START_IREXEC=false

# Try to load appropriate kernel modules

LOAD_MODULES=true

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

# Run "lircd --driver=help" for a list of supported drivers. DRIVER="default" # usually /dev/lirc0 is the correct setting for systems using udev DEVICE="/dev/lirc0" MODULES="lirc_rpi"

# Default configuration files for your hardware if any

LIRCD_CONF=""

LIRCMD_CONF=""

########################################################

Now restart lircd so it picks up these changes:

sudo /etc/init.d/lirc stop sudo /etc/init.d/lirc start

Edit your /boot/config.txt file and add:

dtoverlay=lirc-rpi,gpio_in_pin=23,gpio_out_pin=22

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Now ,connect the circuit .

Fig 11.Schematic

Testing the ir receiver

Testing the IR receiver is relatively straightforward.

Run these two commands to stop lircd and start outputting raw data from the IR receiver:

sudo /etc/init.d/lirc stop

mode2 -d /dev/lirc0

Point a remote control at your IR receiver and press some buttons. You should see something like this:

space 16300 pulse 95 space 28794

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 pulse 80 space 19395

When using irrecord it will ask you to name the buttons you’re programming as you program them. Be sure to run irrecord --list-namespace to see the valid names before you begin.

Here were the commands that we ran to generate a remote configuration file:

# Stop lirc to free up /dev/lirc0 sudo /etc/init.d/lirc stop

# Create a new remote control configuration file (using /dev/lirc0) and save the output to ~/lircd.conf irrecord -d /dev/lirc0 ~/lircd.conf

# Make a backup of the original lircd.conf file sudo mv /etc/lirc/lircd.conf /etc/lirc/lircd_original.conf

# Copy over your new configuration file sudo cp ~/lircd.conf /etc/lirc/lircd.conf

# Start up lirc again sudo /etc/init.d/lirc start

Once you’ve completed a remote configuration file and saved/added it to /etc/lirc/lircd.conf you can try testing the IR LED. We’ll be using the irsend application that comes with LIRC to facilitate sending commands. You’ll definitely want to check out the documentation to learn more about the options irsend has.

Here are the commands I ran to test my IR LED (using the “tatasky” remote configuration file I created):

# List all of the commands that LIRC knows for 'tatasky' irsend LIST tatasky ""

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

# Send the KEY_POWER command once irsend SEND_ONCE tatasky KEY_POWER

# Send the KEY_VOLUMEDOWN command once irsend SEND_ONCE tatasky KEY_VOLUMEDOWN

Last step, is to connect the module to python program.

4.5 WORKING OF IR RECEIVER AND TRANSMITTER

An IR LED, also known as IR transmitter, is a special purpose LED that transmits infrared rays in the range of 760 nm wavelength. Such LEDs are usually made of gallium arsenide or aluminium gallium arsenide. They, along with IR receivers, are commonly used as sensors.

The emitter is simply an IR LED (Light Emitting Diode) and the detector is simply an IR photodiode which is sensitive to IR light of the same wavelength as that emitted by the IR LED. When IR light falls on the photodiode, its resistance and correspondingly, its output voltage, change in proportion to the magnitude of the IR light received. This is the underlying principle of working of the IR sensor.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4.6 FLOW CHART OF PROGRAM

Fig 12 Flowchart

The flowchart of the python script is shown below. Where the voice input is first verified if it is the keyword. Then the system sends a high beep through the audio out, to indicate microphone is actively listening. The voice input now given is compared with the configured commands and the corresponding function is called.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

4.7 BLOCK DIAGRAM

Here we are using CMU Sphinx with jasper-client brain which implements deep learning algorithm.

Python modules are written for various functions.

First the keyword which is configured is said, we will hear a high beep, which means jasper is listening.

Now the command is given ,which is decoded and searched by the pocketshinx dictionary by HMM computation.

Match is found to mentioned words in modules and the appropriate function is executed. Which can be playing a song or video or reading a book or changing TV channel or playing a quiz game.

The song and video database can have any regional language songs as well. The output of the system is then heard through the speakers or earphones.

Fig 13.Block Diagram of System

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CHAPTER 5

FURTHER ENHANCEMENTS

1.RECOGNITION WITHOUT INTERNET ACCESS

We are well aware that there is no availability of internet access throughout our country. Currently, India is nowhere near meeting the target for a service which is considered almost a basic necessity in many developed countries.

In such cases this project may not function, therefore we have enhancing this project to work even without internet using recognition toolkits such as CMU Sphinx.

2. GSM Module for voice activated calling

Raspberry PI SIM800 GSM/GPRS Add-on V2.0 is customized for Raspberry Pi interface based on SIM800 quad-band GSM/GPRS/BT module. AT commands can be sent via the serial port on Raspberry Pi, thus functions such as dialing and answering calls, sending and receiving messages and surfing on line can be realized. Moreover, the module supports powering-on and resetting via software.

Fig.14 GSM Quadband 800A

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

3. HOME AUTOMATION

With the right level of ingenuity, the sky's the limit on things you can automate in your home, but here are a few basic categories of tasks that you can pursue:

Automate your lights to turn on and of on a schedule, remotely, or when certain conditions are triggered.

Set your air conditioner to keep the house temperate when you're home and save energy while you're away.

Fig.15 Home automation possibilities

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

CHAPTER 6

APPLICATIONS

Usage in education and daily life

For language learning, speech recognition can be useful for learning a second language. It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills.[6]

Students who are blind (see Blindness and education) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard. [6]

Aerospace (e.g. space exploration, spacecraft, etc.) NASA’s Mars Polar Lander used speech recognition from technology Sensory, Inc. in the Mars Microphone on the Lander[7]

Automatic subtitling with speech recognition[7]

Automatic translation

Court reporting (Realtime Speech Writing)

Telephony and other domains

ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing. However, ASR in the field of document production has not seen the expected[by whom?] increases in use.

The improvement of mobile processor speeds made feasible the speech-enabled Symbian and Windows Mobile smartphones. Speech is used mostly as a part of a user interface, for creating

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 predefined or custom speech commands. Leading software vendors in this field are: Google, Microsoft Corporation (Microsoft Voice Command), Digital Syphon (Sonic Extractor), LumenVox, Nuance Communications (Nuance Voice Control), VoiceBox Technology, Speech Technology Center, Vito Technologies (VITO Voice2Go), Speereo Software (Speereo Voice Translator), Verbyx VRX and SVOX.

In Car systems

Typically a manual control input, for example by means of a finger control on the steering-wheel, enables the speech recognition system and this is signalled to the driver by an audio prompt. Following the audio prompt, the system has a "listening window" during which it may accept a speech input for recognition.

Simple voice commands may be used to initiate phone calls, select radio stations or play music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice recognition capabilities vary between car make and model. Some of the most recent car models offer natural-language speech recognition in place of a fixed set of commands. allowing the driver to use full sentences and common phrases. With such systems there is, therefore, no need for the user to memorize a set of fixed command words.

Fig 16.Car automation

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

Helicopters

The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the jet fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot, in general, does not wear a facemask, which would reduce acoustic noise in the microphone. Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma helicopter. There has also been much useful work in Canada. Results have been encouraging, and voice applications have included: control of communication radios, setting of navigation systems, and control of an automated target handover system.

As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment. Much remains to be done both in speech recognition and in overall speech technology in order to consistently achieve performance improvements in operational settings.

High-performance fighter aircraft

Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note is the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), and a program in France installing speech recognition systems on Mirage aircraft, and also programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft, with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight display.

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016 REFERENCES

[1] D. Yu and L. Deng"Automatic Speech Recognition: A Deep Learning Approach" (Publisher: Springer) published near the end of 2014,

[2]Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++ Implementation “ 2008 edition

[3]Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker identification using Gaussian mixture speaker models" (PDF). IEEE Transactions on Speech and Audio Processing (IEEE) 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063- 6676. OCLC 26108901. Retrieved 21 February 2014.

[4]Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time- delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."

[5] Microsoft Research. "Speaker Identification (WhisperID)". Microsoft. Retrieved 21 February 2014.

[6]]'Low Cost Home Automation Using Offline Speech Recognition', International Journal of Signal Processing Systems, vol. 2, no. 2, pp. 96-101, 2014.

[7]Juang, B. H.; Rabiner, Lawrence R. "Automatic speech recognition–a brief history of the technology development" (PDF). p. 6. Retrieved 17 January 2015.

[8] Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An Overview". IEEE Transactions on Audio, Speech, and Language Processing.

[9]P. V. Hajar and A. Andurkar, “Facial Recognition and Speech Recognition using Raspberry Pi', International Journal of Advanced Research in Computer and CommunicationReview Paper on System for Voice and F Engineering, vol. 4, no. 4, pp. 232-234, 2015.

[10]Common Health Risks of the Bedridden Patient Posted on October 24, 2013 by Carefect Blog Team

DEPARTMENT OF ECE, NMIT PROJECT REPORT 2015-2016

DEPARTMENT OF ECE, NMIT