Improving Automatic Speech Recognition for Mobile Learning of Mathematics Through Incremental Parsing
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Kingston University Research Repository Improving Automatic Speech Recognition for Mobile Learning of Mathematics Through Incremental Parsing Marina ISAACa, Eckhard PFLUEGELa, Gordon HUNTERa, James DENHOLM-PRICEa, Dilaksha ATTANAYAKEa and Guillaume COTERb a Kingston University, London, U.K. b Ecole des Mines de Douai, Douai, France Abstract. Knowledge of at least elementary mathematics is essential for success in many areas of study and a large number of careers. However, many people find this subject difficult, partly due to specialised language and notation. These problems tend to be even more marked for people with disabilities, including blindness or impaired vision, and people with limited motor control or use of their hands. Learning activities are increasingly carried out in mobile computing environments, replacing the roles of textbooks or handwritten lecture notes. In the near future we expect the modality of use of mobile devices to shift from predominantly output to both input and output. The generally poor performance and usability of text input on these devices motivates the quest for alternative input technologies. One viable alternative input modality is automatic speech recognition, which has matured significantly over the last decade. We discuss tools that allow the creation and modification of mathematical content using speech, which could be of particular benefit to the groups of disabled people mentioned above. We review previously proposed systems, including our own software for use on desktop PCs, and conclude that there is currently a lack of tools implementing these requirements to a satisfactory level. We address some technical challenges that need to be solved, including the problem of defining and parsing of suitable languages for spoken mathematics, and focus on the parsing of structured mathematical input, highlighting the importance of incrementality in the process. We believe our incremental parsing algorithm for expressions in operator precedence grammar languages will improve upon previous algorithms for such tasks in terms of performance. Keywords. Learning mathematics, mobile learning, disabled students, Automatic Speech Recognition, parsing Introduction Learning is increasingly enhanced by technology, and the surge in use of mobile devices such as Tablet PCs and smart phones offers new learning opportunities [1]. Students benefiting from mobile learning technology include, for example, online (distance) learners, people working or studying “on the move”, and sometimes disadvantaged users such as people with physical disabilities or other special needs [2]. Until relatively recently, automatic speech recognition was only available to a small group of highly specialised scientists. Even around twenty five years ago, in order to use what was then considered state-of-the-art speech recognition software, highly expensive equipment was required. Today, even hand-held devices such as iOS or Android smartphones, tablet PCs and game consoles such as the Xbox and Nintendo Wii are often equipped with some form of speech recognition software. The reasons behind this rapid evolution are the tremendous progress that has been made in speech recognition technology, the advances in available memory and processing power of personal computers and the fact that complex and computationally intensive tasks can now be carried out with the help of powerful cloud-based applications. At present, speech recognition is used in a large number of application domains. As a consequence, speech input plays an increasingly important role in making computer tasks accessible to users who wish or need to rely on input modalities other than conventional keyboard and mouse. However, despite its great importance within the curriculum, both at school level and in higher education, for specialists and non-specialists (including students of Science, Engineering, Economics and Business subjects), the development of such technology for making Mathematics more accessible both to disabled and other users lags a long way behind its use in more conventionally alphabetic text-based subject disciplines. One of the main motivations behind our work is to make speech-driven systems as suitable as possible for users who typically might wish to engage in hands-free computing, and in particular to facilitate mobile learning of Mathematics. This is realised through our TalkMaths project [3] which has delivered several prototype systems and interfaces, as described in Section 2, the latest under development being a mobile app. The main contribution of this paper is, as part of a progress report of recent activities within the TalkMaths project, a high-level description of an efficient implementation of incremental processing of speech-input, commonly referred to as incremental parsing. Our algorithm, an improvement of an existing algorithm by Heeman, consumes less battery power and provides a more responsive interface, and is hence suitable for a mobile version of the existing TalkMaths system. This paper is organised as follows: in Section 1, after briefly reviewing existing commercial and freely available automatic speech recognition tools, we discuss past work on speech-driven editors for mathematics and the challenge of defining spoken mathematics. In Section 2, we present the current architecture of the TalkMaths system and the used parsing technique. Section 3 describes our main contribution of this paper. We conclude our findings and propose future developments in Section 4. 1. Interfacing Mathematics using Speech Technology In this section, we discuss issues that arise when creating mathematical content using speech as input. We review the state of the art of automatic speech recognition technology and discuss approaches to formalised spoken mathematics, including more in detail the language used in the TalkMaths system. 1.1. Review of Speech Recognition Technologies Automatic speech recognition (ASR) is the process of converting human spoken words to human- or computer-readable text [4]. Most ASR systems are speaker dependent – i.e. they require to be trained by a particular user in order to provide the best and most accurate recognition of that user’s speech. There are several commercial ASR packages available in the current market, with the most widely available ones being sold by Nuance and Microsoft. With claims of 99% word accuracy under good conditions, Nuance Dragon NaturallySpeaking (DNS) [5] is the market leader. Originally developed for Windows, recent versions of DNS also run on Mac OS X. Microsoft has included Windows Speech in Windows 7 onwards, and their speech recognition solution is likely to be further improved in future versions of their operating systems. At present, it remains unclear which of these two main players will eventually dominate the market. A free alternative to the commercial products is Sphinx, an Open Source toolkit for speech recognition [6] developed at Carnegie Mellon University. With Sphinx, additional data such as prosodic and intonation information can be captured. Currently, the recognition accuracy is inferior to that of the commercial ASRs. Increasingly, vendors are adding speech support to their software. For example, Google has enabled speech recognition in its Chrome web browser, specifically designed for speech-enabled applications, such as “Google Voice Search” [7]. In particular, mobile devices and gaming consoles are widely speech-enabled, with speech controlled assistants being offered by Google and Apple. Complementary to ASR technology is speech synthesis (or Text To Speech, TTS) technology, whereby a computer system tries to create a comprehensible set of sounds approximating the way a person would read a particular word, or sequence or words, aloud. However, detailed discussion of TTS technology is beyond the scope of this paper, and we refer the reader to [8] for further details. 1.2. Previous Work and Overview of Existing Systems We are aware of a number of systems translating spoken mathematical input to different output formats and displaying the structure of the mathematical expressions. These work together with ASR systems installed on the user's machine. MathTalk [9] is a commercially available system that implements speech macros for use within the Scientific Notebook environment. Its functionality, even when compared with the other academic prototype systems mentioned below, is quite limited. Fateman has undertaken work leading to Math Speak & Write [20], a multimodal approach combining spoken input with handwriting recognition. Unfortunately, the ambitious aim of simultaneous multimodal input was not achieved. Another system is CamMath, described in [10], which needs the same support environment as MathTalk but seems to offer a better developed command language. Previous approaches to allowing spoken input of mathematics include the research prototype systems of Bernareggi & Brigatti [11] (which only works in Italian) and Hanakovič and Nagy [12] (restricted to use with the Opera web browser due to it using X+V (XML + voice) technology). While various mobile apps exist that parse mathematical input, entered either via an on-screen keyboard or by drawing symbols on the screen, none offer dictation as an input modality. Although speech to text input is widely available, its functionality conforms to standard ASR offerings, and does not handle specialised languages such as those needed