Edinburgh Research Explorer

Edinburgh Research Explorer Studio report: Linux audio for multi-speaker natural speech technology. Citation for published version: Fox, C, Christensen, H & Hain, T 2012, Studio report: Linux audio for multi-speaker natural speech technology. in Linux Audio Conference 2012 Proceedings. CCRMA, Stanford University. <https://ccrma.stanford.edu/papers/linux-audio-conference-2012-proceedings> Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Linux Audio Conference 2012 Proceedings General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 23. Sep. 2021 Studio report: Linux audio for multi-speaker natural speech technology Charles FOX, Heidi CHRISTENSEN and Thomas HAIN Speech and Hearing Department of Computer Science University of Sheffield , UK charles.fox@sheffield.ac.uk Abstract our experiences in setting up a new Linux-based The Natural Speech Technology (NST) project is the studio for dedicated natural speech research. UK's flagship research programme for speech recogni- In contrast to assumptions made by current tion research in natural environments. NST is a collab- commercial speech recognisers such as Dragon oration between Edinburgh, Cambridge and Sheffield Dictate, natural environments include situations Universities; public sector institutions the BBC, NHS such as multi-participant meetings [Hain et al., and GCHQ; and companies including Nuance, EADS, Cisco and Toshiba. In contrast to assumptions made 2009], where participants may talk over one an- by most current commercial speech recognisers, nat- other, move around the meeting room, make non- ural environments include situations such as multi- sentence utterances, and all in the presence of participant meetings, where participants may talk over noises from office equipment and external sources one another, move around the meeting room, make such as traffic and people outside the the room. non-speech vocalisations, and all in the presence of The UK Natural Speech Technology project aims noises from office equipment and external sources such to explore these issues, and their applications to as traffic and people outside the room. To generate scenarios as diverse as automated TV programme data for such cases, we have set up a meeting room / recording studio equipped to record 16 channels of au- subtitling; assistive technology for disabled and dio from real-life meetings, as well as a large computing elderly health service users; automated business cluster for audio analysis. These systems run on free, meeting transcription and retrieval, and home- Linux-based software and this paper gives details of land security. their implementation as a case study for other users The use of open source software is practically a considering Linux audio for similar large projects. prerequisite for exploratory research of this kind, Keywords as it is never known in advance which parts of existing systems will need to be opened up and Studio report, case study, speech recognition, diarisa- edited in the course of research. The speech tion, multichannel community generally works on offline statistical, 1 Introduction large data-set based research. For example cor- pora of 1000 hours of audio are not uncommon The speech recognition community has evolved and require the use of large compute clusters to into a niche distinct from general computer au- process them. These clusters already run Linux dio and Linux audio in particular. It has its own and HTK, so it is natural to extend the use of large collection of tools, some of which have been Linux into the audio capture phase of research. developed continually for over 20 years such as As speech research progresses from clean to nat- the HTK Hidden Markov Model ToolKit [Young ural speech, and from offline to real-time process- 1 et al., 2006] . We believe there could be more ing, it is becoming more integrated with general crosstalk between the speech and Linux audio sound processing [Wolfel and McDonough, 2009], worlds, and to this end we present a report of for example developing tools to detect and clas- 1Currently owned by Microsoft, source available gratis sify sounds as precursors to recognition. The use but not libre. Kaldi is a libre alternative currently under of Bayesian techniques in particular emphasises development, (kaldi.sourceforge.net). the advantages of considering the sound processing and recognition as tightly coupled problems, synchronous accuracy between recorded channels, and using tightly integrated computer systems. and often using up to 64 channels of simultaneous For example, it may be useful for Linux cluster audio in microphone arrays (see for example the machines running HTK in real-time to use high NIST Mark-III arrays [Brayda et al., 2005]). level language models to generate Bayesian prior Reverberation removal has been performed in beliefs for low-level sound processing occurring in various ways, using single and multi-channel data. Linux audio. In multi-channel settings, sample-synchronous This paper provides a studio report of our ini- audio is again used to find temporal correlations tial experiences setting up a Linux based studio which can be used to separate the original sound for NST research. Our studio is based on a typical from the echos. In the iPhone4 this is performed meeting room, where participants give presenta- with two microphones but performance may in- tions and hold discussions. We hope that it will crease with larger arrays [Watts, 2009]. serve as a self-contained tutorial recipe for other Speaker tracking may use SLAM techniques speech researchers who are new to the Linux au- from robotics, coupled with acoustic observation dio community (and have thus included detailed models, to infer positions of moving speakers in explanations of relatively simple Linux audio con- a room (eg. [Fox et al., 2012], [Christensen and cepts). It also serves as an example of the audio Barker, 2010]). This can be used in conjunction requirements of the natural speech research com- with beamforming to attempt retrieval of individ- munity; and as a case study of a successful Linux ual speaker channels from natural meeting envi- audio deployment. ronments, and again relies on large microphone arrays and sample-accurate recording. 2 Research applications Part of the NST project called `homeService' The NST project aims to use a meeting room stu- aims to provide a natural language interface to dio, networked home installations, and our anal- electrical and electronic devices, and digital ser- ysis cluster to improve recognition rates in natu- vices in people's homes. Users will mainly be dis- ral environments, with multiple, mobile speakers abled people with conditions affecting their ability and noise sources. We give here some examples to use more conventional means of access such as of algorithms relevant to natural speech, and their keyboard, computer mouse, remote control and requirements for Linux audio. power switches. The assistive technology (AT) Beamforming and ICA are microphone-array domain presents many challenges; of particular based techniques for separating sources of au- consequence for NST research is the fact that dio signals, such as extracting individual speak- users in need of AT typically have physical dis- ers from mixtures of multiple speakers and noise abilities associated with motor control and such sources. ICA [Roberts and Everson, 2001] typ- conditions (e.g. cerebral palsy) will also often af- ically makes weak assumptions about the data, fect the musculature surrounding the articulatory such as assuming that the sources are non Gaus- system resulting in slurred and less clear speech; sian in order to find a mixing matrix M which known as dysarthric speech. minimises the Gaussian-ness over time t of the la- 3 Meeting room studio setup tent sources vector xt, from the microphone array time series vectors yt, in yt = Mxt. Our meeting room studio, shown in fig. 1, is ICA can be performed with as few microphones used to collect natural speech and training data as there are sound sources, but gives improved from real meetings. Is centred on a six-person results as the number of microphones increases. table, with additional chairs around the walls for Beamforming [Trees, 2002] seeks a similar out- around a further 10 people. It has a whiteboard at put, but can include stronger physical assump- the head of the table, and a presentation projec- tions - for example known microphone and source tor. Typical meetings involve participants speak- locations. It then uses expected sound wave ing from their chairs but also getting up and walk- propagation and interference patterns to infer the ing around to present or to use the whiteboard. A source waves from the array data. Beamform- 2×2.5m aluminium frame is suspended from the ing is a high-precision activity, requiring sample- ceiling above the table and used for mounting au- two 1.5m HD presentation screens. In the four upper corners of the meeting room are mounted Ubisense infrared active badge receivers, which may be used to track the 3D locations of 15 mobile badges worn by meeting participants. (The university also has a 24-channel surround sound diffusion system used in an MA Electroacoustic music course [Mooney, 2005], which may be useful for generating spatial audio test sets.) Sixteen of the mics are currently routed through two MOTU 8Pre interfaces, which take eight XLR or line inputs each.

Edinburgh Research Explorer

Release Notes for Debian GNU/Linux 5.0 (Lenny), Alpha

Version 7.8-Systemd

Embedded Systems 2/7

Pipewire: a Low-Level Multimedia Subsystem

Oracle® Secure Global Desktop Platform Support and Release Notes for Release 5.2

Audio on Linux: End of a Golden Age?

Sound-HOWTO.Pdf

IT Acronyms.Docx

Building Embedded Linux Systems ,Roadmap.18084 Page Ii Wednesday, August 6, 2008 9:05 AM

Vmware Workstation Pro 16.0 Using Vmware Workstation Pro

Installing a Real-Time Linux Kernel for Dummies

Cómo Compilar El Kernel Linux