High Technology Letters ISSN NO : 1006-6748

Voice Control Desktop Assistant

Jessica Sarah 1, Amisha Michelle Danny 2, Juan Mark Deen 3 Ankit Ahirwar 4, Abhishek Solanki 5, Anurag Shrivastava 6, Divyansh Gajbe 7

1Department of computer science and Engineering Vellore Institute of Technology, Kotri Kalan, Ashta, Near, Indore Road, Bhopal, Madhya Pradesh 466114, India 2Department of Computer Science and Engineering Kalinga Institute of Industrial Technology, KIIT Road, Patia, Bhubaneswar, Odisha 751024,India 3Department of computer science and Engineering and Bioinformatics Vellore Institute of Technology, Vellore Campus, Tiruvalam Rd, Katpadi, Vellore, Tamil Nadu 632014, India, 4,5,6,7 Department of Computer Science and Engineering University Institute of Technology, RGPV ,Bhopal 462033,India

Abstract

In recent times, the demand for controlling electronic devices through voice speech is increasing day by day. Now days working on computers during this time of pandemic has increased drastically. With more and more work on a desktop. It will make a little annoying to do general and regular tasks with extra effort while doing your academic or routine jobs. The desktop assistant needs to be built to work like a better companion. In order to make some of these tasks automated base of this paper shows automate the day-to-day tasks of users with voice control commands and make working a little comfortable.

Keywords :- , Desktop Assistant, Human voice interaction , Python, SqL. 1.Introduction At present, human-machine interaction is an exciting topic, towards which lots of work is already done and continuously develops. Interaction with the machine in the form of voice and gestures gives the user a comfortable working experience. Machine Learning algorithms and Natural language processing development give more flexibility and convenience to human-machine interactions. AT&T researchers in Bell Lab collected information on vowel formant frequency shifts. They produced the world's first test system for 10 English digital pronunciations in the 1950s, providing information about the change in vowel formant frequencies. Dynamic programming was proposed in the 1960s by Soviet universities. In the 1970s, pitch and cepstrum technologies were increasingly applied to language recognition with the advent of the LPC voice feature parameters. Speech recognition technology reached the climax along with HMM put forward in the nineteen-eighties. Some speech recognition systems are introduced today, such as the Whisper system and IBM Via Voice system of . The related technology was pushed ahead simultaneously; discriminative training based on the criterion of maximum likelihood estimation appeared [1]. Speech Recognition is a practical application that establishes a solid foundation for automatic speech recognition

Volume 27, Issue 7, 2021 754 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

robust against acoustic environmental distortion. It provides a thorough overview of classical and modern noise-and reverberation robust techniques developed over the past thirty years, emphasizing practical methods proven to be successful and likely to be further developed for future applications. The strengths and weaknesses of robustness-enhancing speech recognition techniques are carefully analyzed. The author[2] discussed acoustic models based on Gaussian mixing models and profound neural networks for rubber-robust technology. In addition, guidance is offered for the selection of best practices. Speech recognition systems are projected to be employed as the primary man-machine interface for robots used in rehabilitation, entertainment, and other applications in the future. In study[3] outlines the creation of a hidden Markov model-based voice recognition system for biped robot control. According to the researchers [4], the unimodal repair is less accurate than multimodal mistake correction. Multimodal correction is faster than unimodal correction by respeaking on a dictation job. Their research also shows that system-initiated error repair (based on confidence metrics) may not speed up error correction. In the 1990s, the key technologies developed during this period were stochastic language understanding, statistical learning of acoustic and language models, and the methods for implementing large vocabulary speech understanding systems. After five decades of research, speech recognition technology has finally entered the marketplace, benefiting the users in various ways. The challenge of designing a machine that genuinely functions like an intelligent human is still a major one going forward[5]. In this study is a work towards making machine operation automated with the least manual effort of the user. The proposed assistant will be able to be used to perform the following tasks with the convenience of voice commands: • The basic functionality of this assistant is to take voice command from the user and execute the general tasks. • It can search on the Web, Wikipedia, YouTube, and a browser with user voice instruction. • It will send emails to their contacts with document attachments while users themselves are busy with their tasks. • It will play user's favorite videos on YouTube and also download them to their computer. • Play Music from the user's computer. • Maintain user's queries and their quick notes. • Taking a screenshot and Setting up reminders. • All these tasks are done with voice commands and are almost automated .

1.1 Objective Of the Proposed Method The Objective of this study is to develop the application for performing the general tasks with the help of voice commands. Expected achievements in order to fulfill the objective are: • Interact user with the voice and visuals. • Accepting the voice command of user. • Base on the command given, extracting the task and data from command to perform the operation.

Volume 27, Issue 7, 2021 755 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

• Presenting the results in convenient manner to the user.

2. Proposed Methodology This section deals primarily with proposed techniques, methodologies, and concepts relevant to voice control desktop assistants, which is more specific and niche to a single process that uses speech detection, processing, data extraction, and operations.

2.1 Proposed work The proposed model can be divided into three main modules. The modules and their function are defined as follows.

2.1.1: Speech Recognition Speech recognition is the process of converting spoken words to text. Speech recognition is the leading block for the interaction with a machine in this study. Python supports many speech recognition engines and , including Google Speech Engine, Google Cloud Speech API, Microsoft Bing Voice Recognition, and IBM Speech to Text.

2.1.2: Processing of query(command) The processing of the user-given command is done to extract the type of tasks to be performed and the required data that will be used for operating it. The processing of user queries is a complex task, and it will be done using the procedural method of matching the query's keywords with the commands defined in the dataset. After the detection of the command, the required data is obtained by filtering the keywords with relative positions. Query: send email to juanmark0521

2.1.3: Performing Operations: After the command and the data are extracted, the main task remains to execute the given command. In the program, all task that the proposed model can execute is defined functionally. The query recognizer part of a program that runs in a continuous loop will call the required functions along with the data obtained in the query.

2.2: Use case Diagram Below is the use case diagram proposed in fig.2.2, which shows the assistant's tasks

Volume 27, Issue 7, 2021 756 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Fig.2.2 Use case Diagram

2.3: Proposed Logic and Algorithm The working of any program mainly depends on the algorithm used in it. The logic used in this study is procedural based, which were described below: 1.Importing all the required modules and libraries. 2.Creating database if not exists ##When running the program for the first time 3.Defining the speech recognition method and speak method. 4.Defining the setup to store the user credentials and contact data. 5.Defining the methods for executing the operations performed by the assistant. 6. Defining the query identifies logic inside an infinite loop to continuously identify command and execute it with required data. 7.End of program.

2.3.1: Algorithm for speech recognition Algorithm for speech recognition to take the user command as voice input: 1.creating instance of{ speech_recognition} speech identification.

2.Using microphone as operating system service and instance of speech_recognition capturing the audio of the user.

3.Recorded audio is processed with the speech recognition engine API to convert the audio to the text form.

4.Returning the text query to the calling function.

Volume 27, Issue 7, 2021 757 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

2.3.2: Algorithm for query identification and execution: The following algorithm logic is implemented to identify user command and its execution 1.while(true) ## infinite loop for taking continuous commands from user 2.getting the command as a text query from the above algorithm ##using speech ## Recognition algorithm 3.Identifying the command with help of conditional logic and pre-defined dataset of commands. 4.Removing the command keywords from the query and filtering the data for execution if required. 5.Executing the command by calling the respective function with data 6.if (end command): Break the while loop and exit the program Else: Continue the command identifying logic ## goto step 2

2.4: Control Flow Diagram Following diagram give the description about the control flow of the proposed model as shown in fig 2.4.

Fig.2.4 Control Flow Diagram

2.5: Dataflow Diagram Following fig 2.5 diagram shows the flow of data(speech) in the program for processing were query flow depends based on speech command.

Volume 27, Issue 7, 2021 758 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Fig. 2.5 Dataflow Diagram

2.6: Speech Recognition Synthesis

2.6.1 Speech Recognition

Desktop vice assistant based on human speech, in which speech recognition is a technology that enables a computer to capture the words spoken by a human with the help of a microphone. The speech recognizer later recognizes these words, and in the end, the system outputs the recognized words. The process of speech recognition consists of different steps discussed in the following sections, one by one. An ideal situation in speech recognition is that a speech recognition engine recognizes all words uttered by a human. However, practically the performance of a speech recognition engine depends on several factors. Multiple users and noisy environments are the significant factors counted in as the depending factors for a speech recognition engine as shown in fig 2.6.

Speech Identification

Types of speech Speech Recognition Speech Recognition recognition Process Model

Components of Speech Speech Recognition Recognition System weakness and flaws

Fig 2.6 Methods of Speech Recognition Synthesis

Volume 27, Issue 7, 2021 759 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

2.6.1.1 Types of speech recognition

Speech recognition systems can be divided into the number of classes based on their ability to recognize that words and list of words they have. A f ew classes of speech recognition are classified as under:

• Isolated Speech : Isolated words usually involve a pause between two utterances; it doesn’t mean that it only accepts a single word but instead it requires one utterance at a time [4]. • Connected Speech: Connected words or connected speech is similar to isolated speech but allow separate utterances with minimal pause between them. • Continuous speech: Continuous speech allow the user to speak almost naturally, it is also called the computer dictation. • Spontaneous Speech: At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters 2.6.1.2 Speech Recognition Process Model

Fig.2.6.1.2 Model of Speech Recognition Process

2.6.1.3 Componen ts of Speech Recognition System

Voice Input With the help of a microphone, audio is input to the system; the pc sound card produces the equivalent digital representation of received audio. Digitization The process of converting the analog signal into a digital form is known as digitization, and it involves both sampling and quantization processes. S ampling is converting a continuous signal into a discrete signal while approximating a continuous range of values is known as quantization.

Volume 27, Issue 7, 2021 760 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Acoustic Model An acoustic model is created by taking audio recordings of speech and their text transcriptions and using software to create statistical representations of the sounds that make up each word. A speech recognition engine uses it to recognize speech. The software acoustic model breaks the words into phonemes. Language Model Language modeling is used in many natural language processing applications, such as speech recognition, to capture the properties of a language and predict the next word in the speech sequence. The software language model compares the phonemes to words in its built-in dictionary. Speech engine The job of the speech recognition engine is to convert the input audio into text; to accomplish this, it uses all sorts of data, software algorithms, and statistics. Its first operation is digitization, as discussed earlier, to convert it into a suitable format for further processing. Once the audio signal is in a proper format, it searches for the best match. It does this by considering the words it knows. Once the signal is recognized, it returns its corresponding text string.

2.6.1.4 Speech Recognition weakness and flaws

Besides all these advantages and benefits, a hundred percent perfect speech recognition system cannot be developed. Many factors can reduce the accuracy and performance of a speech recognition program. The speech recognition process is easy for a human, but it is difficult for a machine; compared with a human mind, speech recognition programs seem less intelligent. This is because the human mind is God-gifted, and the capability of thinking, understanding and reacting is natural, while it is a complicated task for a computer program. First, it needs to understand the spoken words concerning their meanings, and it has to create a sufficient balance between the words, noise, and spaces. A human has a built-in capability of filtering the noise from a speech while a machine requires training; a computer requires help for separating the speech sound from the other sounds.

2.6.2

A speech synthesizer converts written text into spoken language. Speech synthesis is also referred to as text-to-speech (TTS) conversion as shown in fig.2.6.2.

Volume 27, Issue 7, 2021 761 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Speech Synthesis

Structure analysis

Text pre-processing

Text-to-phoneme conversion Prosody analysis

Waveform production

Fig.2.6.2 Speech Synthesis Model

The major steps in producing speech from text are as follows:

2.6.2.1 Structure analysis

Process the input text to determine where paragraphs, sentences, and other structures start and end. For most languages, punctuation and formatting data are used in this stage.

2.6.2.2 Text pre-processing

Analyze the input text for particular constructs of the language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, email addresses, and many other forms. Other languages need special processing for these forms, and most languages have other specialized requirements.

2.6.2.3 Text-to-phoneme conversion

Convert each word to phonemes. A phoneme is a basic unit of sound in a language. US English has around 45 phonemes, including consonant and vowel sounds. For example, "times" is spoken as four phonemes "t ay m s." Different languages have different sets of sounds (different phonemes). For example, Japanese has fewer phonemes, including sounds not found in English, such as "ts" in "tsunami."

2.6.2.4 Prosody analysis

The sentence structure, words, and phonemes were determined to be the appropriate prosody for the sentence. Prosody includes many of the features of speech other than the sounds of the words being spoken. This includes the pitch (or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words, and many other features. Correct prosody is

Volume 27, Issue 7, 2021 762 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

essential for making speech sound right and for correctly conveying the meaning of a sentence.

2.6.2.5 Waveform production

Finally, the phonemes and prosody information are used to produce the audio waveform for each sentence. There are many ways in which the speech can be produced from the phoneme and prosody information. Most current systems do it in two ways: concatenation of chunks of recorded human speech or formant synthesis using signal processing techniques based on knowledge of how phonemes sound and how prosody affects those phonemes. The details of waveform generation are not typically necessary to application developers.

2.7 Mechanism used in Proposed Model

The proposed model as shown in fig 2.7 is used the default microphone of the desktop system for performing speech recognition. The process of converting the speech to the relevant text is done through the API used by the speech recognition engine of module ‘speech_recognition’ of the python programming language. The instance of the speech_recognition module created in the program first uses the microphone to capture the user’s voice. After getting the user's voice, the recorded audio is passed to the speech_recognition engine. The recognition engine sent the audio file to the speech-to-text converting API on a server that returns the relevant text of the given speech. The engine has multiple API services for speech-to-text conversion. In this model, we use API ‘recognize_google()’ as google has a large dataset to provide the most relevant results.The obtained text of captured audio, then used for further processing i.e. identifying and executing the user command.

Fig. 2.7 Proposed mechanism for speech to text conversion

2.8 Capturing and Identifying Voice The proposed model regularly captures the user's voice, scans the user's voice, and identifies whether the command is given to the assistant or not by checking the triggering word ‘jarvis’.

Volume 27, Issue 7, 2021 763 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

The command in which the starting word is identified as ‘jarvis’ , is forwarded further for exe cuting the task specified in the command. If the speech captured does not have the starting word as ‘jarvis’ , then the command is rejected, and the assistant again starts capturing audio for the command as shown in fig 2.8. The voices in which the triggeri ng keyword is not mentioned are ignored.

Fig.2 .8 Accepting the Command given to assistant and rejecting the rest voices

2.9 User command processing

The queries of user received are converted to the text statement. The general format of command given by user are of following steps. 1. Example: send email to anurag 2. Example: Akshay Kumar search on Wikipedia 3. Example: Tell me news headlines Out of these formats, the first step for processing the command is to extract Command keywords. The predefined command models in the dataset are matched one by one with the keywords present in the query. When command keywords get matched with the query, then according to the comman d, the task is performed. If some data is required for the execution, it will be extracted from the remaining part of the query by removing the searching keywords and identifiers.

2.9.1 Com mands with data and Identifiers

The Processing of this type of command is based on three steps: 1.The main keyword for the command is matched with the command dataset to identify the query. 2.After the command is matched with suitable comman d of datasets, and further action is taken to identify the current working dir ectory for performing query action.

Volume 27, Issue 7, 2021 764 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

3. To execute the command identified in the query, the data required for the execution is extracted by removing identified part from the text query. This matched query data has to be forwarded for the executing program as shown in fig 2.9.1. Command for youtube and email operations belongs to this category.

Fig.2.9.1 Structure Representation of user query

2.9.2 Commands with data only The Processing of these commands are pre-processed in the same manner described above; it includes: 1.As there are no identifiers present in the query, the command keyword is directly matched with datasets. 2.After identifying the command, the query is begin executed with the data available in the query. The commands for web searching and Wikipedia searching have belonged to this category.

2.9.3 Direct commands These commands are not required to process as much as a command with data. The query is matched directly with the dataset with main keywords. These commands are using for routine tasks. The commands for getting news headlines, opening a browser, and playing offline music belong to this category.

2.10 Limitations of speech recognition model Used

Following are some limitations of the model that was used in the proposed model. • Noises in the surrounding make it challenging to capture and process the user's voice.

Volume 27, Issue 7, 2021 765 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

• The desktop system must have the required hardware, i.e., an inbuilt microphone and speaker. • The commands that the voice assistant given by users do not correctly observe, the desktop voice assistant system cannot be evaluated to that text; however, they will be discarded. • Every command that will be given should start with Jarvis.

3. Experiment al Setups and Result outcomes

The windows prompt i nterface is used for the working of assistant. The control inputs are taken through voice and the results is given in the audio as well as with text on the window.

3.1 System Requirements 1.Intel core i3 processor with minimum 4GB RAM 2.Active Internet connection 3.Microsoft Operating System 4.System microphone and speaker

3.2 Starting Interface and S etup The assistant start with wishing greeting to the user as shown in fig 3.2(a),(b) , and displaying its brief introduction through window. Aft er that user can make changes is setup details.

Fig.3.2(a) Starting Interface

Volume 27, Issue 7, 2021 766 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Fig.3.2(b) Setup interface

3.3 Getting news update The assistant can present youtube with the latest news headline across the world. It will present headlines as audio and also log them onto the screen window for reading as shown in fig 3.3. User Voice Input : Jarvis tell me the latest news headlines Output: Speak news headlines

Fig.3.3 Showing result for news headlines

Volume 27, Issue 7, 2021 767 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

3.4 Setting Reminder When the command is given an assistant to set the reminder, it will ask what I will remind you; the user describes the reminder, and later, GU I appears to set the time as shown in fig.3.4. User voice command: Jarvis set reminder for me Output: preparing reminder -> take reminder description as voice input ->take day and time - >schedule the Reminder

Fig.3.4 Setting the reminder

3.5 Making a text note

For making the text note, the given by the user and assistant after identifying the query starts taking the voice input continuously and store as a buffer, when the user completes its note, after that it writes the given note into a tex t file with date time stamp as shown in fig.3.5(a),and (b).

User voice input: Jarvis make note for me Output: Taking Note->saving to the text file in document

Volume 27, Issue 7, 2021 768 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Fig. 3.5(a) Taking Note from the user

Fig 3.5(b) Text file of saved note

3.6 Sending Email

For sending the email the email id of receiver is obtained from the database contacts and the content that to be sent is taken from the user voice. After capturing continuous voice as buffer, it will send it as email as shown in fig.3.6 .

User Voic e Input: Jarvis send email to juanmark0521 Output: Preparing email->getting email address from contacts -> Taking content for email through user voice -> sending the email

Volume 27, Issue 7, 2021 769 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

juanmark0521

Fig.3.6 Preparing and sending email

3.7 Sending Email with attachment For sending the emails with attachment the command given to the assistant, based on it, it provides a to format the email and add attachment. As fig.3.7 shows in after getting the required input it sends the email. User Voice Input: Jarvis make email with attachments Output: Preparing the email -> GUI for filling email content -> user browse the file -> Sending email

Fig.3.7 Preparing and sending email with attachments

3.8 Wikipedia Search For Searching on the Wikipedia, the searching data is extracted from the query. Later result is fetched from the Wikipedia and presented to user as text and voice as shown in fig.3.8.

Volume 27, Issue 7, 2021 770 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

User Voice Input: Jarvis search Narendra Modi Wikipedia Output: Search result fetched from the Wikipedia ->Presented to user text and audio

Fig.3.8 Presenting Wikipedia result

3.9 Playing Video on YouTube To play the video on youtube the data regarding searching video is fetched from the query, after filtering the results obtained by searching data the best suited video according to given data is played as shown in fig.3.9 . User Voice Input: Jarvis play Hanuman Chalisa on Youtube Output: Searching the appropriate result on Youtube -> Play the searched video in default web browser

Fig.3.9 Playing Youtube Video

Volume 27, Issue 7, 2021 771 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

3.10 Downloading the video from the Youtube For downloading the youtube video, there are two options; first, provide the video link explicitly. Second, if a video is already playing on youtube through an assistant, it will be directly downloaded to the desktop storage at the location provided in the setup as shown in fig 3.10(a),and(b).

User voice input: Jarvis dowload this youtube video Output: Preparing Download resources -> Download the video on the Download directory Note:User can also provide video link explicitly .

Fig 3.10(a) Downloading youtube video

Fig.3.10(b) Downloaded Video saved to Download Directory

Volume 27, Issue 7, 2021 772 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

3.11 Web Searching For searching on the web, the searching data is embedded with the URL link, and the result is as shown in fig 3.11(a), and (b) using the default web browser of a desktop. User Voice Input: Jarvis search new cabinet minister on google Output: Preparing url for given search data ->Showing result on default web browser

Fig.3.11(a) Getting query for web searching

Fig.3.11(b) Showing result in default Web Browser

3.12 Taking Screenshot After getting the command of taking screenshot, the captured image is saved to the Images directories as shown in fig. 3.12(a) ,and(b) . User Voice Input: Jarvis capture screen

Volume 27, Issue 7, 2021 773 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Output: Screenshot taken -> saved to Images Directory

Fig.3.12(a) Taking Screenshot on user command

Fig.3.12(b) Saving Screenshot in Image Directory

3.13 Battery Status Assistant gives the percentage of battery, plugged in status and estimated Battery time remaining in both visual and voice form as shown in fig 3.13. User Voice Input: Jarvis show current battery status Output: Battery Status is shown

Volume 27, Issue 7, 2021 774 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Fig. 3.13 Showing Battery Status to User

3.14 Programming Language and T ools

3.14.1.Python Programming Language

Python is an interpreted high-level, general -purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs and object -oriented approach aim to help programmers write clear, logical code for small and large -scale projects. Python is dynamically typed and garbage - collected. It supports multiple programming paradigms, including structured (mainly procedural), object-oriented and functional programming. Python is of ten described as a "batteries included" language due to its comprehensive standard . Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features, such as list comprehensions and a garbage collection system using reference counting. Python 3.0 was released in 2008 and was a significant revision of the language that is not entirely backward-compatible , and much Python 2 code does not run unmodified on Python 3. Python 2 was discontinued version 2.7.18 in 2020.is In this study used Python 3.7 as the programming language. The Python programming language has been widely using in data science, machine lea rning, software development, and web development. The main reason for its wide use is its readability and flexibility for different domains. Its has libraries, module, and framework for different domains [6],[7],[8]and[9].

3.14.2.SQL

SQL (Structured Query Language) is a domain -specific language used in programming and designed for managing data held in a relational database management system (RDBMS) or for stream processing in a relational data stream management system (RDSMS). It is advantageous in handlin g structured data, i.e., data incorporating relations among entities and variables. SQL is used to manage relational databases. In this study, the SQLITE3, python’s internal database, is used to store user data, and SQL is used to manipulate and update the SQLITE3 database. It is a high-level language, and its syntax is almost similar to the general English sentences, which provides it easy readability [10],[11].

Volume 27, Issue 7, 2021 775 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

3.14.3. Programming tools

1.pyttsx3 Engine instance is a very easy-to-use tool that converts the entered text into speech. pyttsx3 is a text-to-speech conversion library in Python. Unlike alternative libraries, it works offline and is compatible with both Python 2 and 3. An application invokes the pyttsx3. init() factory function to get a reference to a pyttsx3. o The pyttsx3 module supports two voices first is female, and the second is male, provided by “sapi5” for windows. o Pyttsx3 is used for the speaking of the assistant, and this will work offline mode.

2.sapi5 driver o Microsoft Speech API (SAPI5) is the technology for voice recognition and synthesis provided by Microsoft. Starting with Windows XP, it ships as part of the Windows OS. o The python pyttsx3 uses the Sapi5 driver to manage the speaking of the program. o It supports two voices, a male voice and a female voice.

3.speech_recognition o speech_recognition is a python tool used to convert the given audio data to the text. o It uses the system microphone as a resource to get the user's voice, and with the help of its google API, it converts the audio into the text data used for functioning. o It also has some other API for converting the speech to the text.

4.PyAudio o It is python binding which is begin using to working with the audio file. o PyAudio provides Python bindings for PortAudio, the cross-platform audio I/O library. With PyAudio, users can easily use Python to play and record audio on a variety of platforms.

5.wikipedia o Wikipedia is a Python library that makes it easy to access and parse data from Wikipedia. o It is used to search Wikipedia, get article summaries, get data like links and images from a page, and more. o Wikipedia wraps the MediaWiki API so users can focus on using Wikipedia data. 6.Pytube o pytube is a lightweight, dependency-free Python library that is used for downloading videos from the web. o Pytube is used to sort the different file formats and links from the youtube media to get the specific URL to download the video in our system.

7.Tkinter Tkinter is the standard GUI library for Python. Python, when combined with Tkinter, provides a quick and easy way to create GUI applications. Tkinter provides a powerful

Volume 27, Issue 7, 2021 776 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

object-oriented interface to the Tk GUI toolkit. Creating a GUI application using Tkinter is an easy task. All users need to do is perform the following steps. o Import the Tkinter module. o Create the GUI application main window. o Add one or more of the widgets as mentioned above to the GUI application. o Enter the main event loop to take action against each event triggered by the user. Tkinter is used to make the graphical user interface for interacting with the user. It is lightweight and one of the default modules of python 3. In this study, GUI is used to set the timer and sending an email.

8.smtplib Simple Mail Transfer Protocol (SMTP) is a protocol, which handles sending Emails and routing the Email address between mail servers. Python provides a smtplib module, which defines an SMTP client session object used to send mail to any Internet machine with an SMTP or ESMTP listener daemon. However, it is used to connect sender and desktop assistant to send via Emails from python scripts.

9.Sqlite3 Sqlite3 is an internal offline database used in python to store the data. In this study, SQLite3 is used to store the user data such as email login credentials, default directories paths, email contacts of the user, and data used to set up the reminders.

10.Pyinstaller

This is a python tool to create the executable file from the Python script. PyInstaller bundles a Python application and all its dependencies into a single package. The user used to run the packaged app without installing a Python interpreter or any modules. PyInstaller supports Python 3.6 or newer and correctly bundles the major Python packages such as NumPy, PyQt, Django, wxPython, etc. PyInstaller is tested against Windows, Mac OS X, and GNU/Linux. However, it is not a cross-compiler: to make a Windows app, users can run PyInstaller in Windows; to make a GNU/Linux app use in GNU/Linux, etc. PyInstaller has been used successfully with AIX, Solaris, FreeBSD, and OpenBSD, but testing against them is not part of our continuous integration tests. Using the PyInstaller, we can convert our python application to the windows executable file, which can run freely on the windows operating system. Also, it makes applications to be functional on a system that has not installed python and its modules.

3.15 Working of the Assistant

The assistant will work in the following way: • The assistant work started with a greeting and wait continuously for the input of a user. • When any user opens it the first time, it asks users to make its setup in which the user has to provide some necessary details for the functioning.

Volume 27, Issue 7, 2021 777 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

• It will take the user's email account credentials for sending an email. • Paths for the defaults directories in which it will manage the document and download files. • It also takes some of user's contact emails for communications. • Users can also add some other directory paths for file management. • All these settings can be changed afterward also. • All these data are stored using the sqlite3 which is the internal offline database for python. • After the setup, the assistant is ready to use for work. • Give voice commands to it for automating user's tasks. • The downloading tasks run parallel on the other thread in the background. • When a user wants to watch videos on youtube, give it like ’Jarvis play the on youtube’. • When users want to send an email to user's contact, just give it a command like ‘jarvis send email to ’, it will ask to user for what to send and give it the content to send through voice. It will send an email to user's contact. • If a user wants to download any youtube video to their device, give it a command like ‘jarvis download youtube video for me’, It will download the videos if user are playing it, or it will ask user to provide the link externally. • If the user wants to search about anything or any person, give it a command like ‘jarvis search for ’ it will provide the search results and speak the response from Wikipedia. • Users can also do a direct search on Wikipedia. It can take them any quick note by taking commands like ‘make a note.’ • Users can set a reminder by giving a command like ‘set reminder for me.’ • Users can play music from their music directory by giving a command like ‘play some music.’

4. Programming Codes for the functioning of the Assistant

Following are the main functions of assistant with programming code

4.1.Taking command from the User

The command is taken from the user with help of the speech_recognizer and microphone service of the operating system. The command then conerted into the text using the google API and this query is returned to the calling program for further functioning. Following is the python program for taking user command: Code for use query command def takeCommand(): #It takes microphone input from the user and returns string output r = sr.Recognizer() with sr.Microphone() as source:

Volume 27, Issue 7, 2021 778 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

print("Listening...") r.pause_threshold = 1 audio = r.listen(source) try: print("Recognizing...") query = r.recognize_google(audio, language='en-in') print(f"User said: {query}\n") except Exception as e: # print(e) print("Say that again please...") return "None" return query

4.2.For sending Email

For sending the Email the smtplib and email Message module is used, the sender's email id is taken from the database according to user command. If the user wants to send email to unknown contact, the email id is taken from the user. The email content is taken as voice input from the user. Before sending the email, it asks the user whether they want to add any document as an attachment. According to user input, the document is grabbed from the directory or asks to attach a file by giving its path. The email then sends using the SMTP SSL server. For sending the email users, Gmail account credentials are required to be set before the setup. Following is the programming code used for sending Email.

Code for sending Email def sendEmail(to,content,file_path=None): msg = EmailMessage() msg['From'] = str(getEmail_id()) msg['To'] = str(to) msg.set_content(str(content)) if file_path != None: with open(file_path,'rb') as f: file_data=f.read() file_name=f.name() msg.add_attachment(file_data,maintype='application',subtype='octet- stream',filename=file_name) server = smtplib.SMTP_SSL("smtp.gmail.com",465) server.ehlo() # server.starttls() server.login(str(getEmail_id()),str(getEmail_pass())) server.send_message(msg) server.quit()

Volume 27, Issue 7, 2021 779 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

4.3. Playing and downloading the Videos from Youtube

For Playing the videos on youtube, the “pytube” module is used. When the user asks the assistant to play a YouTube video, it gets the query and searches the query string from the command. However, take the html format of search results, filter the video link from that html and then play the video on web browser through this link. The same link is used for downloading the video from youtube by using the pytube module. The Playing and downloading functioning of youtube videos is done using the following program code. Codes for Playing and downloading def getYTSearchString(s): return(s.replace(' ','+')) # Play the videos on youtube of given title def playYoutube(videoTitle): search_keyword=getYTSearchString(videoTitle) html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_ keyword) video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode()) # url=print("https://www.youtube.com/watch?v=" + video_ids[0]) url="https://www.youtube.com/watch?v=" + video_ids[0] ytvt=YouTube(url) print(ytvt.title) global TEMPRORY_VARIABLE_youtubePlayingVideoLink TEMPRORY_VARIABLE_youtubePlayingVideoLink = url time.sleep(1) webbrowser.open(url) # Download the Youtube video of given link in another thread def downloadYoutubeVideo(link): yt = YouTube(link) #Showing details print("Title: ",yt.title) print("Number of views: ",yt.views) print("Length of video: ",yt.length) print("Rating of video: ",yt.rating) #Getting the highest resolution possible ys = yt.streams.get_highest_resolution() #Starting download loc=getDefaultDirectoryPath(defaultDirectories[1]) try: if loc == None: print('Directory not set') speak('direcory not set for downloads') else: print("Downloading...") ys.download(str(loc))

Volume 27, Issue 7, 2021 780 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

print("Download completed!!") except Exception as e: print(e)

4.4. Searching on Wikipedia

The Wikipedia module is used to get the search result from Wikipedia, and that content is provided to a user. It speaks the search results. The searching keyword is extracted from the query made by the user, and that keyword is searched through the Wikipedia module; it gives the search result in the text form. This text is spoken by speak function defined in a program.

4.5.Taking Screenshot

Screenshot is taken using the Pyautugui, then converted to numpy array and BGR so that it can saved in storage. Following is the Python Code for taking Screen shot: Code for screen shot image = pyautogui.screenshot() # convert it to numpy array and BGR # so we can write it to the disk image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR) # writing it to the disk using opencv cv2.imwrite("image1.png", image)

4.6.Searching on Web Browser Searching on the web browser is done using the urllib and web browser module. Based on the user input search Query is generated and it is opened with the web browser.

5.Conclusion The Voice Assistant is a small work done toward automating user tasks with some human- machine interaction. It is designed to minimize the human efforts to interact with many other subsystems, which would otherwise have to be performed manually. With more integration of technologies like machine learning and Natural Language Processing, the system will make human life comfortable and perform more complex tasks with automation. The Advantage of a voice desktop assistant is that it was performing tasks with less human effort. Less physical interaction with computers and tasks can be done in parallel with the user's other tasks. The limitation of this study is that while running parallel with other applications, it could affect the performance of other applications. Furthermore, while using an assistant, it will require the system microphone. A noisy environment can cause a disturbance in its working. In order to do this, it provides many applications like: -It will be used for performing the user's daily tasks and helpful for people with some disability of vision. In the future, it can be integrated with the method of deep learning and Natural

Volume 27, Issue 7, 2021 781 http://www.gjstx-e.cn/ High Technology Letters ISSN NO : 1006-6748

Language Processing to perform tasks with more flexibility of language and query gave to it.

References

[1] Hector Perez-Meana, "Advances in audio and speech signal processing: technologies and applications", US, Idea Group Publishing, 2007. [2] Jinyu Li, Li Deng, Reinhold Haeb-Umbach, Yifan Gong, "Robust automatic speech recognition : a bridge to practical applications" Published in 2016 [3] Dwivedi S, Dutta A, Mukerjee A and Kulkarni P,“Development of a speech interface for control of a biped robot, ” Proc. 2004 IEEE International Workshop on Robot and Human Interactive Communication, IEEE Press, Sept. 2004, pp. 601-605. [4] B Suhm, B. Myers and A. Weibel, "Multimodal Error Correction for Speech user Interfaces", ACM Transactions on Computer-Human Interaction, 8(1), pp: 60–98. doi:10.1145/371127.371166. [5] Van Rossum, Guido , "An Introduction to Python for UNIX/CProgrammers". Proceedings of the NLUUG Najaarsconferentie (Dutch UNIX Users Group) . pp 1-8 , 1993 [6] Kuhlman, Dave. "A Python Book: Beginning Python, Advanced Python, and Python Exercises", 23 June 2012. [7] Guttag, John V., " Introduction to Computation and Programming Using Python: With Application to Understanding Data", MIT Press. ISBN 978-0-262-52962-4, 12 August 2016 [8] Allen, Grant; Owens, Mike, "The Definitive Guide to SQLite" (2nd ed.). Apress. p. 368. ISBN 978-1-4302-3225-4,November 5, 2010. [9] Kreibich, Jay A.," Using SQLite" (1st ed.). O'Reilly Media. p. 528. ISBN 978-0-596- 52118-9, August 17, 2010. [10] Newman, Chris, (SQLite (Developer's Library) (1st ed.). Sams. p. 336. ISBN 0-672- 32685-X, November 9, 2004. [11] Hinegardner, Jeremy, "Skype client using SQLite?" sqlite-users(Mailing list), pp: August 28,2007, 2010.

Volume 27, Issue 7, 2021 782 http://www.gjstx-e.cn/