Proceedings of the Audio Conference 2012

April 12th - 15th, 2012 Center for Computer Research in Music and Acoustics (CCRMA), Stanford University, California

10111 11010 00110 101 111 11 0010111001001 01 01 100 100 00 1 01 01100101011 01 0 01 1 010 100 10 11 1 10 10 0111010 11 1 0 1 0 1 1110 001 01 10 1 0 10 01 10 10 1 1 0 0 1 0 1 00011001010 01 00 0 0 0 0 1 10 01 11 01 1 0 1 1 1 0 0 11 udio 10 0 0 0 0 1 1 0 1 01 ux a onf 0 1 1 0 1 0 0 1 0 0 lin er 10 1 0 1 1 1 0 1 0 1 1000110 e 0 1 1 1 0 1 1 1 0 01 11 n 1 0 1 0 0 1 0 1 1 1 10 00 ce 1 0 0 0 1 0 1 1 10 c 1 1 0 1 0 1 0 0 1 01 en 0 2 0 0 1 0 1 0 1 1 0 11 te 1 0 0 0 1 0 1 r 0 0 0 0 1 0 0 1 1 1 1 1 010101 f 0 1 1 0 0 0 0 1 0 11 10 o 1 1 1 0 1 1 0 1 0 1 1 0 r 1 2 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 01011 0 c 0 0 0 0 0 0 0 01 0 o 0 0 0 0 1 1 0 1 1 0 1 01 1 0 1 1 1 1 1 0 1 1 0 m 0 1 0 1 0 0 1 1 0 10001 0 1 1 1 1 0 1 1 0 1 0 1 1 1 0 0 0 0 0 1 0 1 p 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 0 u 0 0 1 10010 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 1 0 e 1 1 0 1 1 0 0 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 r 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 1 1 01 0 0 1 0 1 1 0 r 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 e 0 0 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 s 0 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 e 0

1 1 0 1 1 1 0 0 0 0 1 1 1 0 1

0

1 1 0 1 1 1 1

0 1

1 0

0 0 0

a

1

0

1 0

1

0

0 0

1

0 0

0

1 1

1 1

1 1

0

0

1

1

0

1

0 1 1 1

0

r

1 1

1 1

1

0

1

1 1 0

0 1

0

1

1 1

1 1

0 0

0 0 1

0 1 0 0 c

0 1

0 1 1

0 0

1 1

1 a

0

0

1

0

0

0

1

1

0

0 1

1 0

h 1 1 1 1

1 1

0

0

0

0

1 1

s

1 1

0 1

0

0 1 0 1

0

0

1 1

1 1 0 1

0 1

1

0

0

1

1

1

0

1

1

0

1 0

1 i

1

u

0

0

1

0 0 0

1 0

0 1 0

1 1 1

0

n

1

1

0

0 1

1

1

0

1 1

0 0

1 0

0

, 0

1 1

0

1 1 1

1

0

1

0 0

1

0

1

1

0

1

1 1 0

1

1

1

1 1

0 0

m 0

0

a

1 0 1

1

0 0

1 1

1 1

0

0 1

0

0 0 1 0

1 1

1 0 0

0

1 0 0

c

0

0

0

1

0 0

1

0

0

1 0 u

0

1

0 1

1 0

1

1

0

1

1 0

0

,

0

0

0

0

0

1

0 0 0

1

0

s

1 1

1

0

1

0

1 1

0

1 0 0 0 0 y

0

0

0 1

1 1

1 0

0

1 i 0

1

0

t

1

0

0

0 1 0 c 0 0

1

1

1

1

1

0 i

1

1

0

1 0

1

0

0

1

0

1

1

0 0 1

0 1 0 1 1 0

1

0

0 s

0 a

0

1

0

0 1

0

1 1 1

1

0

1

r

0

0

0 1 n

1 0

1 0

0

0

1 0

1

1

1

0 1 1 1 0 1

0

0 d

e 0

1

1 1

0 0

1

1

0

1 1

0

1

0 0

1

1 0

1

v

1 1

1

0

a

0

0

0 i

1

0

0 1 0

0

0

1 0

c

1

1 1

1

0 0

1 1

1

1

1

1 o 0

1 n

0

0

1 1

0 1

0

u 0

1

1

1

0

0

0

0

u

s

1

1

0

0

t 1

0

0

i 0 1

1

1 0

c

0

0

s 1

0 0

0

1 1

1 1 0 0

d 0

1

1 0 1

1

0 0

r

1 0

0

1 1

1

1

1 1

0

1

0

0

1 0

o

1

1

1

1 1 0

0 f

0 1 0

1 0 0

0

0

1

1

0

1

1

n 0

1 1

1

0 1

0

0

0 0 a

1 1

0

1

1

1 t 1

1

0 1

0 0 1

s

0

1 0

1 1

1

1

0 0 0 0

1 0

0

1

0

1 1

1

0

1

0 0

1

0

1 1

1

1 0

1

1

0 1

1

1 1

0

1

0 0 0 0 0

0

0 1

1

0

1

0 1

1

1

0

1

0 0

1

0

1 1

1

0

1 0

1 0

1

1

1 0 1 0 0 1

0

0

1

1

0

1

0 1

0

1

0 1 1 0 0 1 1 0 1 0 1 1 0 Published by CCRMA, Stanford University, California, US April 2012 All copyrights remain with the authors http://lac.linuxaudio.org/2012 ISBN 978-1-105-62546-6

Credits Cover design: Fernando Lopez-Lezcano Layout: Frank Neumann Typesetting: LATEX and pdfLaTeX Logo Design: The Linuxaudio.org logo and its variations copyright Thorsten Wilms c 2006, im- ported into the "LAC 2012" logo by Robin Gareus

Thanks to: Martin Monperrus for his webpage "Creating proceedings from PDF files"

Printed in the US

ii Partners and Sponsors

CCRMA Linux Weekly News

The Fedora Project SICa

Stanford University LinuxAudio.org

CiTU

iii iv Foreword

Stanford, California, March 2012

A long time ago in a country far far away, a group of Linux Audio enthusiasts got together at LinuxTag 2002 in Karlsruhe/Germany, organized a Linux Audio Developers Booth there, and plotted future plans. Little they knew (or perhaps they did?) that their initiative would have staying power and rally new people from all over the world every year. Today, the Linux Audio Conference celebrates its 10th anniversary being hosted in the United States for the first time. Read the whole story and look at the pictures at http://lac.linuxaudio.org, and you will see that most of those early enthusiasts are still around, as well as new ones that have been joining us every year.

We at Stanford University’s Center for Computer Research in Music and Acoustics (CCRMA, pronounced “karma”) are very happy to be the hosts of LAC 2012. Bear with me as I tell you a bit of our history. CCRMA has been using and sharing Free as part of a collaborative en- vironment for research and the arts since its beginning in the early 1970’s. Mainframes were the only game in town and computer time for making music and creating new sound algorithms was so valuable that, at the beginning, it could only be borrowed at night in the Artificial Intelligence Lab at Stanford. Eventually CCRMA had its own shared computing monster and similarly-sized digital synthesizer (we called it the Samsom Box), and much music and research came out of them. But in time, mainframes, minicomputers, and mammoth synths withered away and gave way around 1990 to a network of nimble NeXT computers and pure software sound making (with Unix under the hood!, alas, not a free and open hood...). The demise of that world around 1997 was followed by a gradual and determined transition to a fully free and open GNU/Linux comput- ing environment. All thanks to a lot of hard work by hundreds of dedicated programmers, hackers, enthusiasts, and visionaries from all over the world that pieced together a complete, working and reliable computing environment seemingly out of nothing.

By now made up of millions and millions of lines of code, all completely free and open for the world to see, use, share and improve. Amazing, if you “think abyte it!”

By 2001, the packages I had created and maintained to populate our GNU/Linux workstations with free sound and music software became available through the Internet for anyone to down- load. You could work at home in an environment similar to the one CCRMA users took for granted every day. The collection of packages was named Planet CCRMA and was offered to the world in the spirit of sharing. Planet CCRMA took me to the LAD conference for the first time in 2003 (LAC was LAD at the time). I found it simply wonderful that I could match faces to email addresses, and get to know a little about many of the people that had made possible the freedom we were enjoying at CCRMA. After that first time, I’ve had the pleasure of attending LAC pretty much every year, save for once, when the party for the inauguration of the renovated CCRMA building landed on the same weekend as LAC 2006.

For some years many LAC’ers had asked the same question: when is CCRMA going to host LAC? It was the enthusiasm of Bruno Ruviaro and his promise to help that convinced me that it was possible. The timing was good, a fantastically energetic friend committed to

v was offering to help, and we had encouragement from Chris Chafe (CCRMA’s Director, power Linux user) and the community. This LAC is going to be, for the first time, far away from its usual geographic “center of gravity,” so we will miss some friends for whom the trip is too long and expensive. But we are sure we will meet new faces and match them with email addresses, ex- changing new ideas with lots of people that could not go to LAC before because it was too far away.

The geekiest amongst us may see no reason to give special attention to the number 10 (after all, it is not a power of two); still, let us celebrate the 10th anniversary of LAC with four days of talks and music, science and art, free software and free and open hardware, new reflections and, of course, a lot of food, beer and coffee. 25 technical and musical talks, 3 full concerts plus the Linux Sound Night (20+ pieces total) and a continuous 3D listening session for 17 additional pieces and sound installations, 5 workshops, etc, etc. Quite a program!

I’m glad to have an old friend and long-time Linux musician, Dave Phillips, to be the keynote speaker this year. I’m sure he will captivate us with memories and prognostications, anecdotes and numbers. His book and his patient encouragement helped raise awareness of the possibili- ties of this evolving platform and provides a lot of the glue that holds the community together.

Many thanks to all those who made this conference a reality. Bruno Ruviaro specially, without him this would not have happened, and of course Chris Chafe and Jonathan Berger for their sup- port and help in so many ways. Robin Gareus, amazing, fast, impossible to stop, for his help with the web site and organization. Frank Neuman for his (again!) amazing work in massaging tex, and figures, and creating a great book of proceedings. Julius Smith, Carr Wilkerson, Sasha Leitman, Jaroslaw Kapuscinski and many others at CCRMA who volunteered their efforts to make this a memorable event. Support for making this conference possible came from CCRMA itself, the Music Department at Stanford, SiCa (the Stanford Institute for Creativity and the Arts), the Fedora Project and other institutions and friends. Needless to say, thanks to Jörn Nettingsmeier, Robin, and the streaming team helpers, if you are halfway around the world, you will be able to tune in and interact with most events remotely.

Here’s my toast for many many more years of better and better software. Hey developers!, surprise us yet again with cool unpredictable programs! Change the world for the better! Hope- fully by next year’s LAC we will all be wearing cheap, fast, 16-core powered, GNU/Linux free and completely open “communicator” devices on our wrists (Dick Tracy style, that was 1946!!), of course with full bandwidth sound and running a super low latency Mini-Jack sound server that will connect us to the world and to each other.

Fernando Lopez-Lezcano LAC 2012 Conference Chair http://lac.linuxaudio.org/2012/ https://ccrma.stanford.edu/

PS: and last but not least come and meet Ping in California: he will welcome you to his native habitat. He’s been traveling with me to many events and conferences all over the world since 2004, and he loves GNU/Linux! See what I mean here: http://ccrma.stanford.edu/~nando/ping/

vi Conference Organization

Fernando Lopez-Lezcano Bruno Ruviaro Robin Gareus

Live Streaming

Jörn Nettingsmeier Robin Gareus

Conference Design and Website

Robin Gareus

Concert Sound

Carr Wilkerson Sasha Leitman Bruno Ruviaro Fernando Lopez-Lezcano

Paper Administration and Proceedings

Frank Neumann

Special Thanks to

Chris Chafe Jonathan Berger Julius Smith Carr Wilkerson Sasha Leitman Jaroslaw Kapuscinski Tricia Schroeter Jay Kadis Scott Kepley Laura Dahl

...and to everyone else who helped in many ways after the editorial deadline of this publication!

vii Review Committee

Fons Adriaensen Casa della Musica, Parma, Italy Tim Blechmann Vienna, Austria Ico Bukvic Virginia Tech, VA, United States Götz Dipper ZKM|Institute for Music and Acoustics, Karlsruhe, Germany John ffitch retired/U of Bath, United Kingdom Robin Gareus Linuxaudio.org, France Dr. Albert Gräf Johannes Gutenberg University Mainz, Germany Fernando Lopez-Lezcano CCRMA, Stanford, United States Frank Neumann PenguinNoises, Germany Yann Orlarey Grame, France Jussi Pekonen Finland Dave Phillips linux-sound.org, United States Martin Rumori Institute of Electronic Music and Acoustics, Germany Bruno Ruviaro CCRMA, Stanford, United States Julius Smith CCRMA, Stanford, United States Michael Wilson CCRMA, Stanford, United States IOhannes Zmölnig Institute of Electronic Music and Acoustics (IEM@KUG), Austria

Music Jury

Bruno Ruviaro Fernando Lopez-Lezcano

viii Table of Contents

The IEM Demosuite, a large-scale jukebox for the MUMUTH concert venue 1 • Peter Plessas, IOhannes Zmölnig

Network distribution in music applications with Medusa 9 • Flávio Luiz Schiavoni, Marcelo Queiroz

Luppp - A real-time audio looping program 15 • Harry van Haaren

Csound as a Real-time Application 21 • Joachim Heintz

Csound for Android 29 • Steven Yi, Victor Lazzarini

Ardour3 - Video Integration 35 • Robin Gareus

INScore - An Environment for the Design of Live Music Scores 47 • Dominique Fober, Yann Orlarey, Stéphane Letz

A Behind-the-Scenes Peek at World’s First Linux-Based Laptop Orchestra - • The Design of L2Ork Infrastructure and Lessons Learned 55 Ivica Bukvic

Studio report: Linux audio for multi-speaker natural speech technology 61 • Charles Fox, Heidi Christensen, Thomas Hain

A framework for dynamic spatial acoustic scene generation with Ambisonics • in low delay realtime 69 Giso Grimm, Tobias Herzke

A Toolkit for the Design of Ambisonic Decoders 77 • Aaron Heller, Eric Benjamin, Richard Lee

JunctionBox for Android: An Interaction Toolkit for Android-based • Mobile Devices 89 Lawrence Fyfe, Adam Tindale, Sheelagh Carpendale

An Introduction to the Synth-A-Modeler : For Modular and • Open-Source Sound Synthesis using Physical Models 93 Edgar Berdahl, Julius O. Smith III

pd-faust: An integrated environment for running Faust objects in Pd 101 • Albert Gräf

The Faust Online Compiler: a Web-Based IDE for the Faust • 111 Romain Michon, Yann Orlarey

ix FaustPad : A free open-source mobile app for multi-touch • interaction with Faust generated modules 117 Hyung-Suk Kim, Julius O. Smith

The why and how of with-height surround production • in Ambisonics 121 Jörn Nettingsmeier

Field Report II - A contemporary music recording in Higher-order Ambisonics 127 • Jörn Nettingsmeier

From Jack to UDP packets to sound, and back 137 • Fernando Lopez-Lezcano

Controlling adaptive resampling 145 • Fons Adriaensen

Signal Processing Libraries for FAUST 153 • Julius Smith

The Integration of the PCSlib PD library in a Touch-Sensitive Interface • with Musical Application 163 José Rafael Subía Valdez

Rite of the Earth - composition with frequency-based harmony and • ambisonic space projection 171 Krzysztof Gawlas

Minivosc - a minimal virtual oscillator driver for ALSA (Advanced Linux • Sound Architecture) 175 Smilen Dimitrov, Stefania Serafin

An Open Source C++ Framework for Multithreaded Realtime • Multichannel Audio Applications 183 Matthias Geier, Torben Hohn, Sascha Spors

Using the beagleboard as hardware to process sound 189 • Rafael Vega González, Daniel Gómez

x The IEM Demosuite, a large-scale jukebox for the MUMUTH concert venue

Peter PLESSAS and IOhannes m ZMOLNIG¨ Institute of Electronic Music and Acoustics (IEM), University of Music and Performing Arts (KUG) Inffeldgasse 10/III 8010 Graz, Austria plessas, zmoelnig @iem.at { }

Abstract poses a serious challenge for the installation of a In order to present the manifold possibilities of sur- sound reinforcement equipment, since any part of round sound design in the new MUMUTH concert hall the room can now become the stage or audience to a broad audience we developed an easy to use ap- area and as such be of varying size. plication running and interacting with audio demon- stration scenes from a mobile computing device. An extensible number of such demonstrations using the large-scale and variable loudspeaker hemisphere in the hall are controlled with a user interface optimized for touch-sensitive input. Strategies for improving operat- ing safety, as well as the introduction of a client/server model using the programming language Pure Data, are discussed in connection with a 3D Ambisonics render- ing server, both implemented using the Linux operat- ing system.

Keywords Figure 1: MUMUTH top view: Floor pedestals demonstration, interaction, client-server systems, Am- and loudspeaker hang points bisonics, Pure Data 1 The Mumuth 1.1 Variable position loudspeaker The University of Music and Performing Arts hemisphere Graz (KUG) opened a new building, the Music The university’s Institute of Electronic Music and and Music Theatre House (MUMUTH), in 2009. Acoustics (IEM) designed a unique sound rein- It holds a 600 sqm. concert hall, named in honor forcement system in order to cope with the many of the Austrian composer Gy¨orgyLigeti. This different demands. It consists of 33 loudspeakers, new venue is used for productions by the uni- which can be lowered from the ceiling on tele- versity’s different departments. Therefore, the scopic lifts and be positioned at arbitrary heights hall has to suit music performances of different as well as rotation and elevation angles. Figure 2 styles as well as theatre and opera works. This gives an impression of the room. The speaker multi-faceted usage pattern demands a flexible layouts, which can be easily refined within sec- performance space in terms of stage machinery, onds, are stored and recalled, if desired also dy- room acoustics, sound reinforcement, lighting and namically, during a show. Electroacoustic mea- seating. As consequence MUMUTH has no fixed surements and listening tests with different loud- stage and audience position but offers an array speaker positions permit instantaneously perceiv- of pedestals making up the hall’s floor and which able comparisons. It is this very setup that is can be individually raised to provide an elevated used in 3D sound field rendering using higher or- audience platform or stages of variable size and der ambisonics from IEM’s own software. It is height. While this flexible layout allows for cre- also employed in the reproduction of the different ative staging of different performances it also im- audio scenes in the IEM Demosuite.

1 1.3 Enhanced room acoustics In addition to the sound reinforcement sys- tem described above, MUMUTH’s concert hall is equipped with a variable room-acoustics sys- tem2, allowing to switch between different rever- beration scenarios enhancing reverb envelopment and speech intelligibility. With this system the venue’s acoustics can be tailored to different per- formance styles such as theatre, classical music, jazz and to IEM’s concerts of electronic music. As the variable room-acoustics system is optimized for sound sources located on the floor level, it is in general not used together with the loudspeaker hemisphere.

Figure 2: Loudspeaker hemisphere 2 Presenting MUMUTH’s sonic abilities to a broad audience 1.2 The integration of higher-order Having build a venue the size of MUMUTH, it is Ambisonics important to convey the idea of a concert hall be- ing ”a musical instrument on its own” to the gen- With many years of experience in the design eral public. In order to demonstrate how sound and development of Ambisonics capture and play- in space can be creatively composed as well as in- back systems, IEM developed the scalable CUBE- teractively explored, we decided to create a suite mixer ambisonics authoring and rendering system of music demos that would immediately catch the [Musil et al., 2008], which is distributed under a listener’s attention and present the abilities of this GPL license1. This system consists of a mixing new location in an entertaining and convincing application that can run on general purpose hard- way. An easy-to-use and interactive demo appli- ware and operating systems, and which is open to cation for visitors, playable even while the room user extention and external control being imple- may be set up for a different production, had to mented in the Pure Data (Pd) programming lan- be developed. While this suite of demos would guage. CUBEmixer is employed in the discrete or provide ready-to-run sound scenes, it should at Ambisonic spatialization of an arbitrary number the same time invite pre-rendered or interactive of sources on the loudspeaker hemisphere. The extensions to its repertoire playlist from other helical positions of the mount points on the ceiling artists and researchers alike. as shown in Figure 1 allow to arrange the loud- speakers in a hemisphere encircling the venue’s 3 Software and Hardware available space with a minimum angular differ- ence between the speaker positions. CUBEmixer While looking for a suitable software platform in integrates very well with the MUMUTH audio in- which to implement the Demosuite, we decided it frastructure, consisting of a Lawo HD Core mix- would be preferable to use an environment that ing engine and signal matrix, controlled through future contributors would most likely be familiar the Lawo mc266 fanless mixing console. Inputs with. In our research we and our colleagues of- and outputs are via dedicated interface racks in- ten use Pure Data, so taking that environment terconnected via MADI optical cables, providing as the starting point for implementing the De- a total of 4096 physical input and output chan- mosuite seemed a good choice: in theory, soft- nels. Interfacing to the main rendering computer ware taken out of our development department is done via two RME HDSP MADI PCIe cards us- could be transferred directly to the suite. An ad- ing Alsa drivers with 64 bit Debian/GNU Linux. ditional bonus was the graphical nature of Pd, as there was no need to switch between environments

1http://ambisonics.iem.at/xchange/products/cubemixer 2Meyersound Constellation

2 when creating the visual user-interface and the DSP side of a demo. Finally, the cross-platform nature of Pd would make it possible to develop demos on one’s own preferred platform, before de- ploying it on a stable and low-latency in the venue. The MUMUTH had already been equipped with a PC for realtime signal processing running Debian GNU/Linux, so we decided to use that machine as a stable and well-tested base platform that was already well integrated into the existing audio in- frastructure. This computer is usually controlled from the MU- MUTH’s separate sound recording studio, though Figure 3: Example demo GUI with global volume it is possible to extend keyboard/monitor/mouse fader right and demo selector strip on top into the concert hall. However, for the needs of the Demosuite, this ap- some global parameters such as master vol- proach was discarded as it would involve the need ume, all optimized for touch sensitive user to set up a workstation in the concert hall ev- interfaces. ery time the demo should be presented, which a growing number of demos, which are inter- might in fact happen spontaneously between two • active audio applications implemented as Pd rehearsals of an opera without advance warning. patches. Instead we decided to control this computer with a convenient and portable remote control the pre- Demos are mutually exclusive, meaning that if a senter could carry around while running the de- given demo is currently loaded/running, no other mos. While the environment of our choice (Pd) demo can be loaded/running at the same time. is known to run well on both iOS and Android An example of a user interface for a selected demo, based smartphone and tablet devices [Brinkmann controlling the playback of pre-rendered sound et al., 2011], this is unfortunately only true for files allowing directly switchable different Am- the DSP-part of Pd and not for the user inter- bisonics orders is shown in Figure 3. Here, play- face requiring Tcl/Tk. The only available touch back of pre-rendered sound files is controlled along device we found that could run the Pd-GUI was with switchable Ambisonics orders for direct com- the Neophonie WePad. This tablet device runs a parison, with additional written comments about stock i386 based Linux system3, making it very the demo’s contents. easy to develop new and deploy already existing To complicate things a bit, the Demosuite soft- software on it. ware is to run on two hosts in parallel: In order to make Pd suitable for tablet use, its a powerful DSP server, capable of processing visual appearance was slightly adapted with a • and interfacing multichannel sound fields in ”kiosk mode” gui plug-in [zmoelnig, 2011] dis- realtime playing a single patch window in full-screen mode without any menus. a lightweight GUI client that controls this • server over a network connection 4 Demosuite Software Design The Demosuite software has to provide the com- The Demosuite software consists of two semanti- munication between the DSP and the GUI parts cally different parts: and ensure that the internal states of the two parts stay synchronized even during network dis- a static framework that takes care of selecting ruptions. In order to add a new demo to the • and activating a demo and that is controlling playlist, corresponding GUI and DSP patches have to be copied to the two machines respec- 3Linux Foundation’s MeeGo tively.

3 4.1 Slot Management – requires dynamic patching which makes We call the container mechanism, that ensures code less readable (bad for long term that the matching parts (GUI and DSP patches) maintenance) of a single demo are active and synchronized across the two computers, the ”slot”. Since the demos are to be implemented by mul- Basically two different approaches to managing a tiple collaborators and will presumably be of di- variable number of demos via this slot mechanism verse code quality, the perfect isolation between exist: demos in the ”dynamic loading” approach was the winning argument. Assuming that a single demo either all demos are preloaded into a match- will consume moderate resources in return, the • ing number of slots and, upon selection of a elongated load time due to dynamic resource al- given demo its containing slot gets activated location was considered acceptable. Furthermore, while all other slots are deactivated this approach scales well even for a very large number of demos. or there is only a single slot that manages • the synchronous loading and activation of an 4.2 Slot Interface arbitrary demo across the two hosts. A slot consists of two separate programs (the DSP Both methods have their advantages and draw- and the GUI part) that communicate with each backs. In the case of the MUMUTH Demosuite other as well as produce perceptible output (the we identified the following relevant characteristics DSP will usually render sound, whereas the GUI of the ”pre-load” technique: will visualize an interface to control the former). The communication channels for the two parts are Pros • standardized, in order to be able to globally re- – fast (since initialization can be done at place them with different networking techniques. startup time) 4.2.1 control communication Cons Usually GUI and DSP will run on different com- • puters that are linked via a network connection. – more resources needed (scales badly) To keep the implementation simple, this commu- – ”deactivated” demos can remain half- nication is abstracted as unidirectional and asyn- active (e.g. by continuously generating chronous busses. The network message interface messages) inside PD is implemented as single [inlet] resp. [outlet] objects in the GUI and DSP patch of – co-existing demos can influence each each demo. Each Pd message that is to be sent other (if not carefully implemented) from one peer to the other, has to go through that Whereas with the technique of ”dynamic loading single [outlet] and will then appear on the other into a single slot” can be characterized as follows: side’s sole [inlet] object. No assumption must be made on the speed of the transmission. Pros • 4.2.2 audible and visible output – resources allocated on demand (scales In addition to choosing and interacting with a well) demo, the user should be able to control the – ”deactivated” demos don’t exist and can global volume with a single fader, in a unified therefore not take any resources way for each and every demo. It is therefore – demos never co-exist important that the DSP patch does not access the soundcard’s DACs (or rather Pd’s representa- Cons tion thereof) directly, but through a global au- • dio bus. The Demosuite applies a few post- – resources allocated on the fly (slow, since processing steps on this multi channel audio bus, no pre-loading is possible) namely master volume, multichannels limiting – instantiation locks the Pd process and a global mute. This is achieved by using

4 a special [throw chn-] object rather than 5.1 TCP/IP vs. UDP Pd’s own [dac ]. Traditionally, network connections involving re- The GUI part is trying to make best use of the e altime audio environments use the simple UDP allocated screen estate by being implemented as e rather than the more reliable TCP/IP protocol, a Pd ”graph on parent” interface of a fixed size, mainly because of the following two reasons: defining the available area for the UI part of a demo.4 overhead: due to it’s simpler nature UDP has • less traffic overhead than TCP/IP, and will 4.3 Loading a Demo therefore consume less resources Since the GUI and DSP parts are asynchronous robustness: since UDP does not track ac- processes, it is necessary to pay special attention • tive connections, it handles loss of connec- to the sequence of actions in which to load a demo tivity more gracefully than TCP/IP; esp. it in order to guarantee consistent behavior and de- does not cause the remote point to hang if fined states of both systems at each instant in the local point experiences a network out- time. age. This is especially important considering Loading is initiated by the user selecting a scene the volatile connection with a wireless mobile (e.g. ”foobar”) on the interface. The GUI will device and the non-threaded network imple- then delete the previously loaded demo, so it can mentation in Pd (that would cause the main no longer send control messages to the DSP part audio thread to hang in case the TCP/IP re- of the demo, and enter a transitional state, (show- mote control left the coverage of the wireless ing a ”loading...” splash screen). Once this is ac- LAN access point!) complished, a request ”/playlist entry foobar” is sent to the DSP part. 5.2 to OSC or not to OSC... Once DSP receives such a request, it will first State of the art communication between net- delete its part of the the previous scene and sub- worked audio workstations is usually implemented sequently load the ”foobar/dsp.pd” patch, which using (OSC). While Pd still receives an initialization trigger. Once the DSP has no built-in support for OSC and instead pro- part of the demo is initialized, it must then send vides objects that speak PD’s own FUDI protocol, a ”/dsp init done” message, which will cause the it can be extended using Martin Peach’s ”osc” li- DSP to send a ”/create gui slot foobar” request brary. Unfortunately we found the various avail- back to the GUI. Now that the GUI knows that able network transport implementations needed the DSP is properly loaded, it can load and dis- for this ”osc” library to be less stable than ex- play the ”foobar/gui.pd” patch. The GUI part of pected, especially in cases where the connection demos having non-trivial internal states can now can easily drop, which is expected to happen once query the DSP to transmit the current state of the tablet leaves the W-LAN coverage. Rather internal parameters. than fixing these implementations, we decided to write a set of abstractions that hide the actual 5 Communication between GUI and implementation of the OSC transport. DSP On the application layer directly accessible within Since the DSP and the GUI are running on two a demo (and within most parts of the framework), different machines, their communication is done ”OSC-style” messages are used. On the layer via a network connection. In MUMUTH, the below, these messages are currently transmitted tablet is connected via a wireless LAN to en- within FUDI containers. Once the issues of the sure maximum mobility of the presenter. This network transport libraries are addressed with the required the following considerations. ongoing development of Pd’s externals, the wrap-

4 per objects can easily be switched back to use This is probably the most tedious part when writ- OSC containers. ing new demos. Unfortunately, at the time of this writ- ing, Pd does not provide a usable auto-scaling function of 5.3 Network Security ”graph on parent” interfaces, which means that whenever the screen resolution of the remote control changes, the In the current implementation, absolutely no net- GUI patches have to be adapted... work security is built into the system. An intruder

5 who has knowledge of the protocol, the network sources on different spatialisation paths on address and port of the DSP machine can send the hemisphere with dynamically changing arbitrary commands. Since the wireless network early reflections and late reverberation, with is only available within the concert hall and is Ambisonics-encoded reverb return signals. encrypted, we have not experienced any security Ambisonic sources in comparison to discrete related problems so far. In any case, it should be • loudspeaker playback with additional depth easy enough to protect the two hosts via a clever perception cues. set of firewall rules and network tunnelling tech- nology such as ssh, SSL or VPN. Extracts from IEM concert productions. • 5.4 Operating Safety Dedicated pieces from IEM staff and featured • Whenever the connection between the main DSP guest composers. computer machine and the GUI remote control A 3D panning demo using allowing interac- is interrupted, it is necessary to fallback into a • tive user control. safe state. In order to monitor the status of the Ambisonics field recording examples. UDP connection, a special ”heartbeat” message • is constantly exchanged between the two comput- 7 Summary ers. Whenever the user interface assumes to be connected to the DSP server, it sends out a mes- In this contribution we described how an easy- sage ”/client alive 1” in regular intervals (cur- to-use system of audio demo scenes for the in- rently about 3 times per second), which is an- teractive exploration of the new MUMUTH con- swered by a ”/server alive 1” message in return. cert space has been realised. We presented an ap- If either side does not receive its peer’s ”alive” proach towards networked Pd patches using con- message for a certain amount of time, it considers nection state monitoring together with dynamic the remote side to have disconnected. Once the patch creation on different host computers. A so- DSP server detects a disconnected GUI, it will lution allowing for easy extensibility of the demo immediately mute its audio outputs in order to repertoire while preserving system stability was prevent damage to the listener’s ears as well as discussed along with an alternative network trans- to the equipment by uncontrollable audio signals. port layer for Pd messages. The adaption of Pd to Whenever the GUI detects a disconnected DSP touch-based tablet interfaces and considerations server, it displays a warning message of the lost regarding operational reliability with regard to connection and locks the interface that otherwise high-volume PA systems were shown. would control the DSP part of a demo. This is last 8 Acknowledgements measure is important in order to ensure that the state on both DSP and GUI sides divert as little While the design and realisation of the audio in- as possible, making the reconnection and resum- frastructure in MUMUTH has been the result ing of a demo an easy task. The GUI will try of a huge team of collaborators, the authors to reestablish the connection by periodically at- would especially like to express their gratitude to tempting to reconnect, send a ”/conn request 1” IEM/KUG members Gerhard Eckel for initiating message. This message will eventually trigger an the Demosuite project, and Thomas Musil, Stefan reconnection attempt of the DSP server. Warum and Ulrich Gladisch for their exceptional As soon as the connection is re-established, the support. GUI queries the DSP server about it’s current References state and adjusts its internal state and hence the state of the user controls accordingly. Peter Brinkmann, Peter Kirn, Richard Lawler, Chris McCormick, Martin Roth, and Hans- 6 Repertoire Christoph Steiner. 2011. Embedding pure data Current examples of demos available in the suite with libpd. In Proceedings of the Pure Data are: Convention 2011, Weimar, Germany. http:// www.uni-weimar.de/medien/wiki/images/ Audio scenes of pre-rendered Ambisonic Embedding_Pure_Data_with_libpd.pdf. •

6 Thomas Musil, Winfried Ritsch, and Johannes Zm¨olnig.2008. The cubemixer a performance-, mixing- and masteringtool. In Proceedings of the Linux Audio Conference 2008, Cologne, Germany. http://old.iem.at/projekte/ publications/paper/cm/cm.pdf. IOhannes zmoelnig. 2011. New clothes for pure data. In Proceedings of the Linux Audio Conference 2011, Maynooth, Ire- land. http://lac.linuxaudio.org/2011/ download/lac2011_proceedings.pdf.

7 8 Network distribution in music applications with Medusa

Fl´avioLuiz SCHIAVONI and Marcelo QUEIROZ Computer Science Department University of S˜aoPaulo S˜aoPaulo Brazil, fls, mqz @ime.usp.br { }

Abstract performance, but several other problems in com- This paper introduces an extension of Medusa, a dis- puter music can take advantage of or benefit from tributed music environment, that allows an easy use of network music distribution. For instance, - network music communication in common music ap- ings can be made using a pool of networked com- plications. Medusa was firstly developed as a Jack puters, a resource which might be seen as a scal- application and now it is being ported to other au- able distributed sound card; music spatialization dio as an attempt to make network music ex- could be done using a mesh network topology; perimentation more widely accessible. The APIs cho- digital signal processing power can be vastly im- sen were LADSPA, LV2 and Pure Data (external). proved using clusters of computers; and musi- This new project approach required a complete review of the original Medusa architecture, which consisted cal composition may explore fresh new grounds of a monolithic implementation. As a result of the by viewing computer networks as nonstandard new modular development, some possibilities of us- acoustical environments for performance and lis- ing networked audio streaming via Medusa plugins in tening. well-known audio processing environments will be pre- There are at least two requirements for all these sented and commented on. scenarios to be fully explorable by potentially in- terested users, which are often musicians or afi- Keywords cionados and usually nonprogrammers; first, flex- Network Music, Pure Data, LADSPA, LV2, Medusa. ible audio network tools must be made available, and second, easy integration with popular mu- 1 Introduction sic/sound processing applications must be guar- Ever since the availability of network and In- anteed. Despite the fact that most high-level ternet connections became an undisputed fact, linux music applications nowadays run over Jack collaborative and cooperative music tools have and ALSA, we see that time and again end users been developed. The desire of using realtime au- can’t deal with this audio infrastructure lying dio streaming in live musical performances in- below the application they are running. Also, spired the creation of various tools focused on the lack of automated setup and graphical user distributed performance, where synchronous mu- interfaces shun common users from many exist- sic communication is a priority. Some related ing tools, like the network music tools mentioned network music tools which address the prob- above. lem of synchronous music communication be- The work presented in this paper is part tween networked computers are NetJack [Carˆot of the investigation behind the development of et al., 2009], SoundJack [Carˆot et al., 2006], Medusa [Schiavoni et al., 2011], a distributed au- JackTrip [C´aceres and Chafe, 2009b; C´aceres dio environment. Medusa is a FLOSS project and Chafe, 2009a], llcon [Fischer, 2006] and which is primarily focused on usability and trans- LDAS [Sæbø and Svensson, 2006]. parency, as means to making network music con- The success cases of network music perfor- nections easier for end users. The first implemen- mance concerts could give the wrong impression tation of Medusa, presented in LAC 2011, was de- that the only use for network music is distributed veloped in C++ with GUI and Jack as sound

9 API. The Jack API was extended with a series of with the chosen sound APIs. Second, the mono- network functionalities, such as add/remove re- lithic structure has been changed to the devel- mote ports and remote control of the Jack Trans- opment of a core Medusa library (libmedusa.so), port functionality. comprising control and network functions, which Recently, the development of Medusa has been could be re-used by each new plugin implemen- strongly guided by the attempt to deal with the tation. Third, graphical user interfaces, text- two aforementioned requirements, namely flexibil- based interfaces and plugins now occupy a sep- ity and integrability. These goals may be reached arate layer, that uses the core Medusa library as by extending regular sound processing applica- an API on its own. tions, such as Pure Data, or , allowing them to function as network music tools. The very basic idea is trying to reach end users wherever they already are. Most music applications can be extended by plugins: Pure Data can be extended by the cre- ation of C externals (which might be viewed as a plugin), whereas digital audio workspaces such as Ardour, Rosegarden, and Traverso can be extended by LADSPA, VST and LV2 plugins. In this paper we will explore the possibility of developing Medusa network music plugins using three popular audio APIs: LADSPA, LV2 and Pure Data API (via C externals). It is worth mentioning that these APIs are all open-source, Figure 1: Medusa Jack implementation with Qt developed in C and widely used in Linux music GUI environments. A reimplementation of Medusa has been re- quired in order to grant code reuse in the im- The new Medusa architecture is divided in plementation of these plugins, and also for easy three layers: Sound, Control and Network. The maintenance of the source code. All source code Control and Network layers are implemented as a of the Medusa project, including Medusa plug- unique library and are used by all Medusa imple- ins, are freely available in the project site1. This mentations. The Sound layer corresponds to spe- reimplementation of Medusa and the proposed ar- cific applications, like the proposed plugins, that chitecture are presented in section 2; section 3 are built using each plugin API and the Medusa discusses the chosen audio APIs and presents the library. This architecture facilitates code mainte- developed plugins, and section 4 brings some con- nance and bug correction. clusions and a discussion of future works.

2 Medusa Although some promising preliminary results had been achieved with the first version of Medusa [Schiavoni et al., 2011], the original monolithic implementation raised several diffi- culties in the implementation of the proposed Medusa plugins, which eventually triggered a fundamental architectural change in the project. First, the implementation language was changed from C++ to ANSI C, which is more compatible Figure 2: Architectural view of the implementa- tion 1http://sourceforge.net/projects/medusa-audionet/

10 2.1 Network layer senders and receivers and provides them to the The network layer is responsible for managing Sound layer. Instead of requiring that all pro- connections between network clients and servers. tocols in the Network layer to have full-duplex This layer’s implementation was made based on communication, the Control layer always sepa- a fixed set of network transport protocols, which rates the roles of sending and receiving network are normally provided as part of the operating data. system kernel, since its development and deploy- ment requires superuser privileges. Medusa cur- rently allows the user to choose among 4 transport protocols: UDP, TCP, SCTP e DCCP: Figure 3: Sender / Receiver communication UDP [Postel, 1980]: User Datagram Protocol is In order to create a sender it is necessary to in- the classical unreliable (but faster) transport form the network protocol, the network port and protocol. the number of channels that will be sent. The TCP [Padlipsky, 1982]: Transmission Control sender creates the network server and one ring Protocol is a reliable transport protocol, buffer for each channel, and provides functions which ensures absence of packet losses. for the sound API to have easy access to these buffers. Each ring buffer receives sound content SCTP [Ong and Yoakum, 2002]: Stream Con- from applications (usually in DSP chunks), out of trol Transmission Protocol is a connection- which the sender prepares network chunks (usu- oriented transport protocol that provides a ally of a different size) for the Network Layer. reliable full-duplex association. This proto- The receiver is created in a way similar to the col was not originally meant as a replacement sender, but besides the basic parameters the cor- for TCP, but was developed for carrying voice responding server IP address is also required. The over IP (VoIP). receiver creates a network client and the required DCCP [Kohler et al., 2006; Floyd et al., number of ring buffers (one per channel). Un- 2006]: Datagram Congestion Control Pro- like the sender, the ring buffer of the receiver will tocol is a transport protocol that combines be fed by a network client and consumed by the TCP-friendly congestion control with unre- sound API. liable datagram semantics for applications 2.3 Sound layer that transfer fairly large amounts of data [Lai and Kohler, 2005]. The outermost Medusa layer is the Sound layer, which deals not only with audio streams but also This multi-protocol network communication with MIDI streams. The Sound layer lies be- layer is intended to offer alternatives for users that tween the Medusa API and several sound applica- may need different specific features in data trans- tion APIs, and represents a collection of Medusa fer according to application context. For instance, front-ends or interfaces, since each integration of an interactive musical performance with strong Medusa with a particular sound application may rhythmic interactions may require the smallest have its own user interface, defined by each sound possible latency while tolerating audio glitches application API. due to packet losses, and a remote recording ses- The integration of Sound and Network layers sion may tolerate high latency and jitter, but still uses the Control layer through its sender / re- require that every packet be delivered. With a few ceiver plugins. Sender and receiver roles defined alternatives available, the user may choose which by the Control layer are converted into Sound protocol is more appropriate to its own musical layer plugins that simply exchange audio streams use. between machines. Each plugin has a host application and a partic- 2.2 Control layer ular way to communicate with it. The API defines The next layer in the Medusa API is the Con- how a plugin is initialized, how it processes data trol layer. The Control layer essentially creates and how it is presented. Each plugin implemen-

11 tation generates an independent software pack- documentation suggests the use of Rosegarden age that can be individually used and distributed. DSSI to deal with it. The separation of these implementations avoids the mixing of different plugin libraries, which is 3.2 LV2 better for code maintenance, chasing bugs and so LV2 (LADSPA version 2) [Steve Harris, 2008] is on. acknowledged as the official LADSPA successor. An LV2 plugin is a bundle that includes the plugin 3 Implementations itself, an RDF descriptor in Turtle and any other 3.1 LADSPA resource needed. The LV2 bundle may include a LADSPA [Furse, 2000] stands for “Linux Audio GTK GUI, thus offering developers a way to cre- Developer Simple Plugin API”, and it is the most ate better user interfaces. As LADSPA, LV2 is an common audio plugin API for Linux applications. easy-to-use library, and plenty of documentation Many Linux programs, such as , Ardour, and examples are provided by the developers. Rosegarden, QTractor and Jack Rack, support LADSPA plugins. The LADSPA API is captured within a header file and is very easy to use. Some examples are provided with an SDK and there is a lot of open source code with great documentation available.

Figure 5: Medusa LV2 implementation

The possibility of creating GTK interfaces al- lowed a more natural integration of Medusa as an LV2 plugin. The IP mask entry allows easy input for the user, and on-the-fly changes to the inter- face may be used by Medusa in the future to allow for finding users and sound resources and keeping this information up-to-date on the interface. 3.3 Pure Data external Pure Data [Puckette, 1996] (aka Pd) is a largely used realtime programming environment for au- Figure 4: Medusa LADSPA implementation dio, video, and graphical processing, which can be extended by the use of externals written in Figure 4 presents Medusa LADSPA sender and C/C++ language [Zm¨olnig, 2001]. Pure Data receiver interfaces. The controls of a LADSPA does have some network externals available, but plugin are represented exclusively by 32 bit float- none offers all transport protocols provided by ing point numbers, and it was awkward to Medusa. A Pd external may also be implemented use these to represent IP addresses. LADSPA with a tcl/tk GUI, which can be useful to improve GUIs are defined in RDF files that cannot be usability and also to implement Medusa service changed on-the-fly, which makes them very cum- control in the future. In addition, Pd offers a good bersome as interfaces for network audio connec- environment to measure latency, packet loss and tions. LADSPA doesn’t support MIDI and the jitter in Medusa Network layer implementation.

12 With that in mind, the level of usability orig- inally pretended by this project will only be reached when Medusa’s control service is imple- mented in the presented Medusa plugins, which is the very next step in our work. This control ser- vice is going to be responsible for improving trans- parency and usability, by including a discovery service, and a name server to publish networked music resources, thus allowing users to refer to Figure 6: Medusa Pure Data external implemen- other users and audio streams, instead of IP ad- tation dresses and network ports. Probably the Medusa LADSPA plugin will have to be abandoned at this point because the control service is impossible to The Medusa Pd external implementation is ac- implement without on-the-fly GUI modification tually a collection of externals (a library) that (although the current version of the plugin, with- includes the sender and receiver objects (see fig- out control service, should be kept). ure 6) and a network meter to measure latency and packet loss. Some examples of use were also Up to this point the plugin development has developed and are available on the project web- focused on audio streaming, but some other fea- site. tures must soon be addressed, such as treating MIDI streams and designing better GUIs. There 4 Conclusions and Future Works are also other sound APIs that are been inves- tigated and maybe soon will be incorporated in The possibility of exchanging sound data between Medusa Sound layer. applications using the Jack sound server, for in- stance, has brought new perspectives in audio Acknowledgements software development and usage. Extending this possibility to allow for network data exchange Thanks to uncountable developers of LADSPA, from within popular user applications may fa- LV2 and Pure Data externals and their amazing cilitate the emergence of new ways to compose, open source code. Without their anonymous help record or play music. It may be early to con- this project would not have been possible. clude that network plugins for audio software will Thanks also go to Andr´e Jucovsky Bianchi, really expand the musical use of computer net- Beraldo Leal, Santiago Davila, AntˆonioGoulart, works, but it is not early to observe in users and Giuliano Obici, Danilo Leite and the Computer musicians new expectations about what might be Music Group of IME/USP for their interest, feed- done in network music, and also the desire to try back and support. out new ways to do old things, maybe more easily This work has been supported by the fund- and more transparently than before. ing agencies CNPq (grant 141730/2010-2) and The effort that went into changing Medusa ar- FAPESP - S˜aoPaulo Research Foundation (grant chitecture and reimplementeing it is totally justi- 2008/08623-8). fied, in the sense that now the effort of implement- ing other network music plugins (i.e. Medusa plu- References gins for other sound applications) involve a lot of Juan-Pablo C´aceresand Chris Chafe. 2009a. code reuse and thus have been made much lighter. Jacktrip: Under the hood of an engine for net- On the other hand, the first version of Medusa work audio. In Proceedings of International had an additional structure called Control Ser- Computer Music Conference, page 509–512, vice, which was responsible for helping users to San Francisco, California: International Com- connect to network music resources in a transpar- puter Music Association. ent way, in other words, without having to specify IP, network ports or audio configuration, using Juan-Pablo C´aceresand Chris Chafe. 2009b. broadcast messages to publish networked audio Jacktrip/Soundwire meets server farm. In In resources. Proceedings of the SMC 2009 - 6th Sound

13 and Music Computing Conference, pages 95– Fl´avioLuiz Schiavoni, Marcelo Queiroz, and 98, Porto, Portugal. Fernando Iazzetta. 2011. Medusa - a dis- tributed sound environment. In Proceedings of A. Carˆot,U. Kramer, and G. Schuller. 2006. the Linux Audio Conference, pages 149–156, Network music performance (NMP) in narrow Maynooth, Ireland. band networks. In Proceedings of the 120th AES Convention, Paris, France. David Robillard Steve Harris. 2008. Lv2 track. http://lv2plug.in/trac/. A. Carˆot, T. Hohn, and C. Werner. 2009. Netjack–remote music collaboration with elec- Johannes M Zm¨olnig. 2001. Howto write an tronic sequencers on the internet. In In Pro- external for puredata. http://pdstatic.iem. ceedings of the Linux Audio Conference, page at/externals-HOWTO/. 118, Parma, Italy. Volker Fischer. 2006. Internet jam session soft- ware. http://llcon.sourceforge.net/. S. Floyd, E. Kohler, and J. Padhye. 2006. Pro- file for Datagram Congestion Control Proto- col (DCCP) Congestion Control ID 3: TCP- Friendly Rate Control (TFRC). RFC 4342 (Proposed Standard), March. Updated by RFC 5348. Richard Furse. 2000. Linux audio devel- oper’s simple plugin (). http://www. ladspa.org/. E. Kohler, M. Handley, and S. Floyd. 2006. Datagram Congestion Control Proto- col (DCCP). RFC 4340 (Proposed Standard), March. Updated by RFCs 5595, 5596. Junwen Lai and Eddie Kohler. 2005. A congestion-controlled unreliable datagram api. http://www.icir.org/kohler/dccp/ nsdiab- stract.pdf. L. Ong and J. Yoakum. 2002. An Introduction to the Stream Control Transmission Protocol (SCTP). RFC 3286 (Informational), May. M.A. Padlipsky. 1982. TCP-on-a-LAN. RFC 872, September. J. Postel. 1980. User Datagram Protocol. RFC 768 (Standard), August. Miller Puckette. 1996. Pure data: another inte- grated computer music environment. In in Pro- ceedings, International Computer Music Con- ference, pages 37–41. Asbjørn Sæbø and U. Peter Svensson. 2006. A low-latency full-duplex audio over IP streamer. In Proceedings of the Linux Audio Conference, pages 25–31, Karlsruhe, Germany.

14 Luppp - A real-time audio looping program

Harry VAN HAAREN, University of Limerick, Ireland

[email protected] http://harryhaaren.blogspot.com https://github.com/harryhaaren/Luppp

Abstract libconfig [5] • Luppp is a live performance tool that allows the play- SUIL [6] back of audio while applying effects to it in real time. • It aims to make creating and modifying audio as easy LILV [7] as possible for an artist in a live situation. • At its core lies the JACK Audio Connection Kit 2.2 Overview of components for access to audio and MIDI, usable effects include LADSPA, LV2 and custom Luppp effects. There is a The core of the program consists of a process- GUI built using , however it is encouraged to ing engine. Its function is to process all input use a hardware controller for faster and more intuitive events and produce its output within a certain interaction. time frame. Its primary inputs are MIDI mes- sages and events from the GUI, while its primary Keywords output is audio data that is written to the JACK Live performance, real-time looping, audio production ports. It also facilitates the display of data on screen in the user interface. This involves trans- 1 Introduction porting data from the real-time engine thread to Luppp is the ”Luppp Untuitive Personal Perform- the user interface thread to be displayed. ing Program”, it’s a tool to be used in a live sce- nario to creatively produce music. The fact that 2.3 Engine Events it is to be used live imposes certain requirements The EngineEvent class is a class that represents like low latency input-output and real-time effect any action that the engine can perform. For each rendering. action there is a function to set the right parame- This paper gives an overview of the structure of ters in the class, while reading data from the class the audio processing engine in Luppp and how it is done through the use of public data members. delegates tasks to different threads, as well as how communication between the threads is achieved. 2.4 Events and ring-buffers The loading of effects is discussed, as is the user To communicate between the various threads in interface and its design. the engine in a lock-free manner ring-buffers are EngineEvent 2 Engine used. Pointers to instances of the class are passed through the ring-buffers, which 2.1 Dependencies allows the use of these events from a real-time The core engine depends on a number of different thread. The GUI thread creates a pool of libraries: EngineEvent instances while the real-time JACK thread can pull event pointers from this queue, JACK [1] • sending the messages without having to allocate Gtkmm [2] them. This causes minimal overhead in the real- • time thread while a background thread occasion- libsndfile [3] • ally polls the ring-buffer for write space, and fills Fluidsynth [4] it with new instances. •

15 2.5 State store The StateStore class holds the state of all other components of the engine. It is a centralized loca- tion for all data like AudioBuffers, AudioSinks, BufferAudioSourceStates etc. The real-time thread takes EngineEvents, and writes the data contained within it to the StateStore. This state data is later requested by each engine component when it is required to do its processing. The requesting of state data is done using unique identification numbers for each state in- stance. When an instance of a class that needs a state is created, it automatically increments its Figure 1: AudioTrack block diagram own ID number, while also adding a new state with its own ID number to the StateStore. It the effects, however the effects that are available can now request the state with its own ID num- have been extensively tested to comply both with ber from StateStore and it will receive its own the real time requirements of the program, as well state data. as not inducing unacceptable latency to the track. 2.6 Audio processing This base class can be subclassed to host any The AudioTrack is the main structure in au- kind of audio effect, the only constraint is that the dio processing, with the track’s AudioSource number of input samples must match the number providing a channel of mono audio, a list of of output samples. LADSPA[8] and LV2[9] ef- Effects being applied to this signal, and finally fects are implemented, however theoretically any the AudioSink passes the produced data to the type of plugin format could be supported. Mixer class. The Mixer then takes each track’s 2.6.3 Audio sinks output, applies master effects and writes the data The AudioSink class is the base class that rep- to the master output ports. resents the end of the processing chain. Every Meta data is available to each element in- track has an OutputAudioSink, whose purpose it side an AudioTrack so AudioSource, Effect and is to mix the mono input samples into ambisonic AudioSink class instances can make informed de- B-format, while also copying the samples to the cisions how to process the next block of audio post-fade send buffer and headphones pre-fade lis- based on the engines current state. ten buffers. 2.6.1 Audio sources 2.6.4 Headphones PFL The AudioSource base class is defined as the start Each track has headphones pre-fade listen func- of the processing chain. It is subclassed to provide tionality, which allows the user to preview tracks varying functionality: The BufferAudioSource before actually mixing them into the master out- reads samples from an AudioBuffer, and then put. This functionality is achieved by writing the writes them to the AudioTrack’s buffer. In the mono signal that is passed to the AudioSink to case of instrument sources like Fluidsynth or a the headphones buffer in Mixer. After each track LV2 synth, it gathers incoming MIDI data from has been processed, Mixer will write the head- JACK or a playing MIDI clip and generates sam- phones buffer to the JACK port. AudioTrack ples and writes them to the ’s buffer. 2.6.5 Post-fade Sends 2.6.2 Effects and plugins Each track has a post-fade send. All tracks are The Effect base class is defined as a class that re- summed together into a single buffer, which is quests its state from the StateStore using its own then written to a JACK audio port. This JACK ID, sets its new parameters, and performs some port can be connected to arbitrary JACK clients processing on an AudioTrack’s buffer. Currently to add effects to the signal. Although a mono sig- there is no latency compensation done for any of nal is sent, the return ports are B-format to sup-

16 port scenarios where an ambisonic effect is used. 3.4 Side-pane browser The returned signal is then mixed into the master On the left of the UI there is a browser that al- bus. lows the previewing of samples, instruments and 2.7 Real-time recording effects. Drag-and-drop operations from the sam- ples to the is supported, as is dropping effects In order to record an arbitrary length of audio an and instruments onto the track view. The browser arbitrary length array is needed to store it, and filters its contents based on file extension, samples due to real-time constraints we cannot create one must have a .wav extension. The Instrument and such array in the real-time thread. This problem Effect lists are hard coded. was solved by writing all incoming audio to a ring- buffer, and allowing the GUI thread to create the 4 Meta data files actual buffers that are later used in the engine. To inform the engine how long an audio loop is The buffer is passed to the real-time thread us- in musical beats, meta data files are used. The ing an event and ring-buffer as described earlier. files are read by libconfig, and can provide infor- This allows the user to record any length of audio mation like the length of the loop in beats, or the in real-time without any non-real-time operations key of a loop. This gives the engine a chance to occurring. process the audio in a musically meaningful way, 3 User Interface like stretching drum loops to match the tempo of the currently playing material. The user interface for Luppp is geared towards live use, and hence it makes as little use of popup 4.1 Sample meta data dialogs and extra windows as possible. The main When loading a sample, Luppp attempts to view is laid out with tracks oriented vertically, open a file called lupppSamplePack.cfg from and song-parts horizontally. the same directory as the sample is located. 3.1 Clip view If this file exists, we iterate trough all infor- mation in the file, and check if any of the The main part of the GUI is dedicated to the metadata is relevant to the sample that we clip view, which selects the currently playing want to load. If meta data exists about the AudioBuffer. To load a buffer into a clip we sample, we set properties on the AudioBuffer right click on the desired clip, and we will be pre- instance that the sample will get loaded into. sented with a file-chooser dialog. It is also possi- These properties can be its number of beats (in ble to drag-and-drop samples onto the clip using the musical sense), or what musical key its in, etc. the side-pane file browser. Recording audio from a JACK audio input port is also possible, just Example lupppSamplePack.cfg file: record enable the track and click on the clip. luppp : 3.2 Track view { The lower part of the window shows the state of samplePack: the currently selected track. Its AudioSource is { on the left while any effects are listed to the right. name = "SamplePackName"; The main parameters of effect can be modified numSamples = 1; using the mouse, by click-and-drag in the graph area. The X and Y axis of the graph are mapped s0: to the two most used parameters of that effect { type. name = "sampleName.wav" numBeats = 16; 3.3 Master track } The master track provides feedback as to what } scene is currently playing, the rotation and eleva- } tion values of the B-format ambisonic output and headphones volume levels.

17 Figure 2: Luppp user interface showing the clip selector, track effects and file browser.

5 Future work 5.2 Immutable Engine In future I would like to implement more features For performance reasons it is planned to make for Luppp, particularly around easier access to AudioBuffer instances immutable, so that they core Luppp functionality from controllers. Also can be used in the real-time thread as well as on the todo list is MIDI sequencing functionality shown in the GUI without expensive copying. which will allow a faster workflow while creating Plotting the audio clips in the GUI will be- new musical ideas. come an easier task, as we can safely access the Some improvements can be made in perfor- AudioBuffers from multiple threads due to its mance, and multi-threading the actual audio pro- immutability. cessing path in the engine is a possibility. To modify an AudioBuffer a new one will be created in the GUI thread, and that new instance 5.1 MIDI sequencing and editing will be swapped into engine. This modified buffer It is intended to support the looping of MIDI clips will keep the same unique ID number, so that all in a similar fashion to Seq24. MIDI clips will be other parts of the engine will automatically use to shown in the user interface just like audio clips, the updated buffer. and clicking on one will select it to be shown in the MIDI editor. 5.3 Sidechain compression The MIDI editor will be as streamlined as possi- In order to fully support sidechain compression ble, with minimal features and as fast a workflow between tracks, it will be necessary to change the as is possible. Some initial sketches of possible direction of sample requesting in an AudioTrack. workflows have been made, however this function- This is due to the fact that for sidechain compres- ality is still in its planning stages. sion audio data from another track is needed, and

18 hence a method to request the processing of an- 7 Acknowledgments other AudioTrack than the current one is needed. I would like to thank the entire LAD community If each SidechainCompressor instance has a for all that I have learnt through reading their pointer to a SidechainSource, it can request the code, asking questions on the LAD mailing list as source to produce the needed samples, and then well as on IRC. continue to process its own AudioTrack’s audio Thanks also to all those who have helped me based on the sidechain buffer of samples. This in any way while I have been working on Luppp also involves implementing a system whereby each for providing suggestions, testing buggy code and Effect or AudioSource can be marked as finished just chatting about the project in general. processing, so that we avoid re-calculating parts of a track that has already been processed because 8 References of the call to the SidechainSource. [1] JACK Audio Connection Kit Essentially this makes the SidechainSource a http://www.jackaudio.org cache for audio samples, which can be requested to process the next block of audio either from [2] Gtkmm, official C++ interface for the the AudioTrack that is it part of, or another popular GUI library GTK+ AudioTrack which needs this SidechainSource’s http://www.gtkmm.org output to complete its own processing. 5.4 Additional effect classes [3] libsndfile, C library for reading and Initial work to host Pure Data patches or CSound writing files containing sampled sound instruments as AudioSources or Effects has http://www.mega-nerd.com/libsndfile went well, although some design is needed with regards to real-time constraints of the effects and [4] Fluidsynth, a real-time loading files. Another issue that needs to be based on the SoundFont 2 specifications addressed is that non real-time safe operations www..org/ might be carried out by user defined patches. VAMP plugins[10] will hopefully be supported [5] libconfig, a simple library for processing in the future, to allow the user analyze existing structured configuration files material, and work with a higher level of con- http://www.hyperrealm.com/libconfig trol over the music. If more meta data can be provided to the Luppp engine, more high level [6] SUIL, a lightweight C library for loading features can be implemented as more data exists and wrapping LV2 plugin UIs for use during the processing cycle. I feel this is http://drobilla.net/software/suil quite an important aspect of Luppp to develop, as it gives the user more scope for creativity and [7] LILV, a library to make the use of LV2 inspiration instead of hindering the workflow with plugins as simple as possible mathematical details of the processing that is go- http://drobilla.net/software/lilv ing on behind the GUI. [8] LADSPA, Linux Audio Developer’s Simple 6 Conclusions Plugin API While Luppp is currently in a usable state, there http://www.ladspa.org are some known issues that must be dealt with. I would like to rethink the method of communica- [9] LV2, a plugin standard for audio systems tion between the Engine and the StateStore for http://lv2plug.in/trac/ the user interface. The way that the user inter- face widgets are stored in the gui class can also [10] VAMP, an audio processing plugin system be much improved. Some design issues like the for plugins that extract descriptive concept of tracks and how ID number related to information from audio data tracks could also be upgraded, as currently some http://vamp-plugins.org/ functionality is hindered by how the ID’s are set.

19 20 Csound as a Real-time Application Typical Problems and Solutions for the Use of Csound in a Live Context

Joachim HEINTZ Incontri - Institute for new music HMTM Hannover Emmichplatz 1, 30175 Hannover, Germany [email protected]

Abstract interesting choices.1 The examples given here utilize these two hosts. They will describe the real-time This article discusses the usage of Csound for live transposition of a live input, which would represent electronics. Embedding Csound as an audio engine in a typical real-time transformation. Pd is shown as well as working in CsoundQt for live performance. Typical problems and possible solutions 2 Csound in Pd: Building a polyphonic real- are demonstrated as examples. The pros and contras of both environments are discussed in comparison. time transposer Combining Csound's DSP power with Pd's Keywords flexibility and interfacing should be a really nice Csound, Pd, Live Electronics. idea, both for Csound and Pd users. Victor Lazzarini's csoundapi~ external for Pd2 offers this 1 Introduction connection. Strange enough, it is much fewer used than it would be expected to – perhaps due to a Csound [1] is well known as one of the most certain lack of documentation. The first example powerful and approved audio libraries, but its main uses Csound as a polyphonic transposition machine programming model stems from a non-real-time inside Pd and discusses some typical issues. approach: 'instruments' are called for specified durations by a ''. This is perfectly suited for 2.1 Running the csoundapi~ object tape compositions, and there is still a predominant It is beyond the scope of this article to describe the opinion that Csound is good for fixed media installation of the csoundapi~ external in Linux, MS compositions but cannot be used in live situations. Windows and Apple OSX. Descriptions can be Or at least, that it is a pain to do so. found, for instance, in the Csound Floss Manual.3 Indeed, if the user does not want to hit a dead end Once the csoundapi~ object has been installed, it when using Csound in a real-time situation, the is available in Pd and refers to a Csound file (*.csd). architecture and event structure of Csound needs to be carefully considered. It is the aim of this contribution to exemplify how Csound can be used 1 For , the use of Csound in successfully for live performance. MaxMsp is well known and perfectly documentated by As Csound has no native user interface, it can be Davis Pyon [4]. As any object itself can be embedded embedded in different hosts. Each host application in via "Max4Live", it is possible to call offers different advantages and its own manner of Csound in Ableton Live, too. This has been popularized interfacing with Csound. In the field of free by Richard Boulanger and partners as "Csound4Live" [5]. software, the use of Csound in Pd [2], and the use of 2 The sources can be found in the "frontends" directory Csound in CsoundQt [3] are probably the most of the csound code [1] or in the repository at http://csound.git.sourceforge.net/git/gitweb-index.cgi 3 www.flossmanuals.net/csound/index

21 Now audio and control data can be sent from Pd to aIn inch 1 ;transform to frequency domain Csound and back. fIn pvsanal aIn, 1024, 256, 1024, 1 ;transposition 2.2 Building a four voice transposition fTransp1 pvscale fIn, cent(-600) fTransp2 pvscale fIn, cent(-100) instrument fTransp3 pvscale fIn, cent(200) fTransp4 pvscale fIn, cent(500) In this example, audio from a microphone is ;back to time domain received by Pd and passed to Csound which then aTransp1 pvsynth fTransp1 aTransp2 pvsynth fTransp2 creates the four voice transposition which is then aTransp3 pvsynth fTransp3 sent back to Pd as stereo mix. Csound implements aTransp4 pvsynth fTransp4 these transpositions by first converting the received ;panning aL1, aR1 pan2 aTransp1, .1 audio into a frequencydomain signal and then aL2, aR2 pan2 aTransp2, .33 creating the four pitch shifted signals. These four aL3, aR3 pan2 aTransp3, .67 aL4, aR4 pan2 aTransp4, .9 signals are converted back into the timedomain and aL = (aL1+aL2+aL3+aL4)/3 panned and mixed to create a stereo output.4 This aR = (aR1+aR2+aR3+aR4)/3 ;output to pd stereo signal is finally passed back into Pd which in outs aL, aR turn sends it to the speakers. endin

i "transp" 0 99999

The whole Pd patch looks like this:

Figure 2: Pd patch for four-voice live transposition embedding Csound via csoundapi~

2.3 Sending and receiving control data For a fuller interaction between Pd and the Figure 1: Scheme of embedding Csound in Pd for a four-voice live transposition embedded Csound it is normally desirable to also pass control signals between Pd and Csound. The The related Csound file is very simple: transposition values in the given example might have to be controlled from Pd instead of being fixed: sr = 44100 ksmps = 32 nchnls = 2 0dbfs = 1

instr transp ;input from pd

4 A quadrophonic output redirection into Pd would also be possible.

22 fTransp2 pvscale fIn, cent(kCent2) fTransp3 pvscale fIn, cent(kCent3) fTransp4 pvscale fIn, cent(kCent4) ...

And this is the Pd patch, with some choices to send values:

Figure 4: Pd sending cent values to Csound

2.4 The "just once" problem It is worth to have a closer look at what is actually happening in the example above. It looks easy, and it works, but actually two languages are Figure 3: Sending cent values from Pd to Csound communicating with each other in a strange way. In this case, they understand each other, but this is pure From Pd's point of view, the cent values are a chance if the differences have not been understood "message" to be sent to Csound. From Csound's by the user. point of view, the cent values are "control signals" Pd differentiates between messages and audio which are received on certain "control channels". streams.6 Messages are sent just once, while audio The csoundapi~ object understands the message streams are sending data continuously, in the sample "chnset ...". A Pd message box containing "chnset rate. Csound has no type for sending something in cent1 -600" means: send the value -600 on the realtime "just once". If something is sent as a control software channel named "cent1". value to Csound, like the cent values in the example In Csound, the related message needs to be above, Csound considers this as a control signal, i.e. received by a chnget opcode.5 The instrument code a continuous stream of control data.7 has to be changed slightly now, as follows: In this case, it works. Pd sends a new cent message just once, and Csound interpretes this as a

... repeated control message. It works, because the instr transp pvscale opcode needs a control-rate input. But if the ;audio input from pd user wants to trigger something "just once", they aIn inch 1 ;control input from pd kCent1 chnget "cent1" kCent2 chnget "cent2" 6 This is the same in MaxMsp, and is the well known kCent3 chnget "cent3" kCent4 chnget "cent4" difference between an object with a tilde and an object ;transform to frequency domain without a tilde. For instance, the object '*' multiplies two fIn pvsanal aIn, 1024, 256, 1024, 1 numbers: just once, when the left inlet has received a ;transposition fTransp1 pvscale fIn, cent(kCent1) number as message. The object '*~' multiplies two audio streams (or an audio stream and a number) continuously. 5 The alternative choice is to send messages as 7 This stream of control data has a lower rate than the "control ..." from Pd to Csound, and receive them with the audio rate, depending on the ksmps value. If ksmps=32, invalue opcode in Csound. one control sample is used for 32 audio samples.

23 have to ensure that Csound does not repeat the following patch sends, when hitting the 'bang' first a message from Pd all the time. '1' and then after 10 milliseconds a '0':8 As a simple example, the user may want to trigger a short beep each time a message box with a '1' has been pressed:

Figure 6: Pd sending '1' followed by '0' to Csound Figure 5: Sending a '1' message from Pd to Csound

On the Csound site, just the first '1' received after It looks straightforward to code the beep.csd in any zeros should trigger a beep. So Csound has to this way: ignore the repetitions. This is done by the following code in the master instrument: instr master ;now correct sr = 44100 kTrigger chnget "trigger" ksmps = 32 kNewVal changed kTrigger nchnls = 1 if kTrigger == 1 && kNewVal == 1 then 0dbfs = 1 event "i", "beep", 0, 1 endif instr master ;wrong! endin kTrigger chnget "trigger" if kTrigger == 1 then This opcode returns '1' exactly in the control cycle event "i", "beep", 0, 1 endif when a new value of kTrigger has been received. So endin the if-clause has to ask whether both, kNewVal and instr beep kTrigger equal 1. Only in this case the subinstrument aBeep oscils .2, 400, 0 is to be triggered. aEnv transeg 1, p3, -6, 0 out aBeep * aEnv endin 2.5 Results The use of Csound for realtime applications in Pd i "master" 0 99999 can be considered as easy and straightforward in general. The major difficulties come from the the different signal paradigms in both languages. Special A master instrument would receive the data on the care has to be taken by the user for "just once" cases, channel called 'trigger' from Pd, and call a which are common in Pd but need cautious coding in subinstrument "beep" each time a '1' has been Csound. received. That's the wish of the user. But what really happens: the kTrigger variable will always be set to '1', because there is no other message coming from Pd, and so Csound will trigger a beep not "just once", but once at each control cycle, in this case 44100/32 times per second ... 8 Using a trigger in Pd which sends a '1' from the right Two things have to be done here: Pd must send a output, followed by a '0' from the left output will not work, '0', if the message box has not been pressed. The because Csound will just receive the '0'.

24 3 Presets, software channels and more: An extended live instrument in CsoundQt CsoundQt has been written and developed by Andrés Cabrera since 2008.9 It is now the most widely used frontend for Csound. It uses the Qt toolkit10 for building a . CsoundQt offers all necessary tools to work with live electronics in Csound, without using another host application. But how can Csound be trimmed to support presets which are necessary for each extended live electronic context? An example is given which uses a keyboard as instrument for controlling both some real-time processings of a microphone input, and electronic sounds generated in real time. 3.1 Presets A preset is a general configuration in a live electronic application which defines the meaning of certain input signals in the context of this preset environment. For instance, if the midi key number 60 has been pressed in the context of preset 1, it may mean "open the live input with a fade in". If the same key has been pressed in preset 2, it may mean something totally different, for instance "play a Figure 7: General preset scheme percussive sound". iPreset = i(gkPreset) ;bindings for preset 1 As CsoundQt has native widgets, these widgets if iPreset == 1 then can be used to represent a preset.11 Csound will then if iNotNum == 60 then event_i "i", "live_fade_in", 0, 1 look at the value of this widget, and will decide what elseif iNotNum == ... then to do at a certain midi input. This is the general event_i ... endif scheme: ;bindings for preset 2 As "triggering an event" in Csound usually means elseif iPreset == 2 then if iNotNum == 60 then "calling an instrument", the Csound code looks like event_i "i", "trigger_sound", 0, 1 this: elseif iNotNum == ... then event_i ... instr midi_receive endif ;getting the midi note number ;bindings for preset ... iNotNum notnum elseif ... ;getting the actual preset number ... endif endin 9 The name has been changed from QuteCsound to The gkPreset variable which holds the status of CsoundQt after discussions at the Csound Conference in the preset given by the preset widget, is created in an Hannover in october 2011. "always on" instrument. This instrument should be 10 http://qt.nokia.com the first of all the instruments,12 and its definition 11 CsoundQt also offers a possibility to store widget should contain a line like this ("preset" is a string states as presets. This is very useful in many cases, but not which defines the name of the widget's channel): sufficient here, because here the presets do not affect just widgets, but also determine the effect of a Midi key 12 Technically speaking: the instrument with the pressed, a parameter being set, and more. smallest number.

25 changing the values – are to be triggered by two gkPreset invalue "preset" different midi keys, for instance 60 and 62. By this, the current preset state is received from This is a typical case for the use of internal the spin box widget, and sent as the global variable software busses. Four audio signals are set. For each gkPreset to each instrument. The widget itself can be of them a software control bus is created, holding the changed by the user either via the GUI or via any transposition value. The first instrument sets the input event, for example a reserved midi note for initial values. When the second instrument is increasing, and another note number for decreasing triggered, the control busses are set to the new the preset number. Assuming the reserved midi note values (figure 8). for increasing is 72, and for decreasing 48, the code looks like this:13

if iNotNum == 72 then outvalue "preset", iPreset + 1 elseif iNotNum == 48 then outvalue "preset", iPreset - 1 endif

The method proposed here for handling presets offers the user all flexibility. All events are embedded in instruments which are triggered if a midi key is received in the context of a particular preset. General conditions for a new preset can be set in a similar way as shown in chapter 2.4 with the changed method. For instance, if the live microphone is to be open at the beginning of preset 1, but closed at the beginning of preset 2, the code in Figure 8: Working with software channels to change the master instrument could be: transposition values

gkPreset invalue "preset" kNewPrest changed gKPreset This is the related Csound code: if kNewPrest == 1 if gkPreset == 1 then instr set_transp kLiveVol = 1 ;mic open ;create software busses elseif gkPreset == 2 then chn_k "transp1", 3 kLiveVol = 0 ;mic closed chn_k "transp2", 3 endif chn_k "transp3", 3 endif chn_k "transp4", 3 ;set initial values chnset -449, "transp1" chnset -315 "transp2" 3.2 Software busses for control and audio data chnset 71, "transp3" chnset 688, "transp4" Many situations need a flexible interchange of ;receive the values from the software busses control data. Suppose the user wants to transpose the kCent1 chnget "transp1" kCent2 chnget "transp2" live input first in four voices with the cent values kCent3 chnget "transp3" -449, -315, 71 and 688, and then in the next bar with kCent4 chnget "transp4" ;receive the live audio input the cent values -568, -386, 428 and 498. Both events aLvIn chnget "live_in" – activating the four voice transposition and ;perform fourier transform fLvIn pvsanal aLvIn,1024,256,1024,1 ;perform four voice transposition 13 The instrument is the same as the one above called fTp1 pvscale fLvIn, cent(kCent1) fTp2 pvscale fLvIn, cent(kCent2) "midi_receive". This instrument is triggered directly by a fTp3 pvscale fLvIn, cent(kCent3) midi note on message. It works during Csound's fTp4 pvscale fLvIn, cent(kCent4) initialization pass, which is in some cases another option ;resynthesize for the "just once" problem. aTp1 pvsynth fTp1

26 aTp2 pvsynth fTp2 an audible latency.15 CsoundQt offers an easy way to aTp3 pvsynth fTp3 aTp4 pvsynth fTp4 adjust them in the configure panel. These are fair ;add and apply envelope values: aTp = aTp1+aTp2+aTp3+aTp4 kHul linsegr 0,.3,1,p3-0.3,1,.5,0 aOut = aTp * kHul ;mix to global audio bus for live out chnmix aOut, "live_out" endin

instr change_transp chnset -568, "transp1" chnset -386, "transp2" Figure 9: Adjusting the chnset 428, "transp3" buffer sizes in CsoundQt's chnset 498, "transp4" configuration panel endin

As can be seen from this example, the software In future versions of Csound, the use of multiple busses are not just used for control but also for audio cores and threads, will further improve the signals. The live audio input is sent to a channel performance.16 This would also be adjusted in called "live_in". The instrument set_transp gets the CsoundQt's configure panel. live input via the line In addition to these general Csound adjustments, there are some performance tweaks particularly aLvIn chnget "live_in" related to CsoundQt. The python callback, After processing, the live transposed signals are especially, must be disabled for the best live sent to an audio bus called "live_out". As other performance: instruments may also send audio to this bus, chnmix is used instead of chnset:

chnmix aOut, "live_out"

Software channels are the solution to many different situations in programming live electronics: mixing, routing, interchanging of values. The chn opcodes offer a flexible and reliable system to work with them in Csound.

Figure 10: Performance 3.3 Performance tweakings tweaks in CsoundQt Csound's performance for live applications depends mainly on its vector and buffer sizes.14 As mentioned above, the ksmps constant defines the 3.4 Results internal vector size. A value of ksmps=32 should be CsoundQt offers a nice graphical user interface for adequate for most live situations. The software and the use of "pure" Csound with standard widgets for hardware buffer sizes must not be kept at Csound's live electronics. Instead of changing between defaults, but should be set to lower values to avoid different platforms, the users have an easier

15 14 There are some opcodes which are not suited for live More information can be found at in Csound. Usually they have a fast and modern www.csounds.com/manual/html/UsingOptimizing.html alternative. Especially the old pv (phase vocoder) 16 Thanks to the work of John ffitch and others, the new opcodes should in pratical use be substituted by the parser for Csound is at the time of writing this article (end excellent pvs (phase vocoder streaming) opcodes: of the year 2011) in the process of becoming the default. www.csounds.com/manual/html/SpectralRealTime.html Amongst other improvements, it will offer multithreading.

27 workflow than using Csound in Pd. (When making References changes to a Csound file running within Pd it is [1] Csound: http://csound.sourceforge.net (with necessary to save the Csound file and to reset the many links to related sites) csoundapi~ object that uses it for those changes to take effect.) Users can add their own functions [2] Pd: http://puredata.info ("user defined opcodes") or integrate any from the [3] CsoundQt: http://qutecsound.sourceforge.net 17 repository, for instance for simplifying the use of [4] csound~ (a Max external to run Csound): the ascii keyboard or for easily recording and www.davixology.com/csound~.html playing buffers. [5] Csound4Live (Csound for use in Ableton Live On the user's wish list for using CsoundQt live is via csound~ [4] and the Max4Live bridge): certainly an increased array of widgets, for example www.csoundforlive.com an audio meter, a midi keyboard widget and an interactive table editor. The possibility of creating more than one widget panel would certainly make CsoundQt much more flexible for use live.18

4 Conclusion Csound can be used perfectly for any live application. The usage of Csound in Pd is very easy. It offers many new possibilities for Pd users, and an easy connection with any external devices (game pads, arduino boards and more) for Csound users. The usage of CsoundQt for live applications offers a clean and pleasant internal graphical interface and a comfortable workflow to the user. Complex features like presets and software busses can be programmed easily. However, some additional widgets and GUI options might be desireable. Ultimately, whether a user ought to use Csound live within Pd or purely with CsoundQt as the front- end, will depend on their personal preferences and the situation. So it depends on the users, their preferences and situations, whether they prefer working with Csound for live in Pd or in CsoundQt.

5 Acknowledgements Thanks to Anna, Alex, Andrés and Iain for reading the manuscript.

17 www.csounds.com/udo 18 Although not covered in this article, it should be http://www.youtube.com/watch?v=O9WU7DzdUmE http://www.youtube.com/watch?v=Hs3eO7o349k http://www.youtube.com/watch?v=yUMzp6556Kw

28 Csound for Android

Steven YI and Victor LAZZARINI National University of Ireland, Maynooth steven.yi.2012, victor.lazzarini @nuim.ie { }

Abstract and netbooks), has brought to the fore new plat- The Csound computer music synthesis system has forms for Computer Music. Csound has already grown from its roots in 1986 on desktop Unix sys- been featured as the sound engine for one of the tems to today’s many different desktop and embed- pioneer systems, the XO-based computer used in ded operating systems. With the growing popularity the One Laptop per Child (OLPC) project [Laz- of the Linux-based Android operating system, Csound zarini, 2008]. This system, based on a Linux ker- has been ported to this vibrant mobile platform. This nel with the Sugar user interface, was an excel- paper will discuss using the Csound for Android plat- lent example of the possibilities allowed by the form, use cases, and possible future explorations. re-engineered Csound. It sparked the ideas for a Ubiquitous Csound, which is steadily coming to Keywords fruition with a number of parallel projects, collec- Csound, Android, Cross-Platform, Linux tively called the Mobile Csound Platform (MCP). One such project is the development of a soft- 1 Introduction ware development kit (SDK) for Android plat- forms, which is embodied by the CsoundObj API, Csound is a computer music language of the MU- an extension to the underlying Csound 5 API. SIC N type, developed originally at MIT for UNIX-like operating systems[Boulanger, 2000]. It Android 1 is a Linux-kernel-based, open-source is Free Software, released under the LGPL. In operating system, which has been deployed on a 2006, a major new version, Csound 5, was re- number of mobile devices (phones and tablets). leased, offering a completely re-engineered soft- Although not providing a full GNU/Linux envi- ware, which is now used as a programming library ronment, Android nevertheless allows the devel- with its own application programming interface opment of Free software for various uses, one of (API). It can now be embedded and integrated which is audio and music. It is a platform with into several systems, and it can be used from a some good potential for musical applications, al- variety of programming languages and environ- though at the moment, it has a severe problem for ments (C/C++, Objective-C, Python, Java, Lua, realtime use that is brought by a lack of support Pure Data, Lisp, etc.). The API provides full con- for low-latency audio. trol of Csound compilation and performance, soft- In this article we will discuss Csound usage on ware bus access to its control and audio signals, Android. We will explore the CsoundObj API as well as hooks into various aspects of its inter- that has been created to ease developing Android nal data representation. Several frontends and applications with Csound, as well as demonstrate composition systems have been developed to take some use cases. Finally, we will look at what advantage of these features. The Csound API has Csound uniquely brings to Android, with a look been described in a number of articles [Lazzarini, at the global Csound ecosystem and how mobile 2006], [Lazzarini and Piche, 2006] [Lazzarini and apps can be integrated into it. Walsh, 2007]. The increasing popularity of mobile devices for computing (in the form of mobile phones, tablets 1http://www.android.com

29 2 Csound for Android also in other mobile platforms), it was decided The Csound for Android platform is made up that all modules without any dependencies or li- of a native shared library (libCsoundandroid.so) censing issues could be moved to the main Csound built using the Android Native Development Kit library code. This was a major change (in Csound (NDK)2, as well as Java classes that are com- 5.15), which made the majority of opcodes part pilable with the more commonly used Android of the base system, about 1,500 of them, with Dalvik compiler. The native library is linked us- the remaining 400 or so being left in plugin mod- ing the the object files that are normally used to ules. The present release of Csound for Android make up the libcsound, libcsnd, and libsndfile3 includes only the internal unit generators. An- libraries that are found part of the desktop ver- other major internal change to Csound, which was sion of Csound. The Java classes include those needed to facilitate development for Android, was commonly found in the csnd.jar library used for the move to use core (memory) files instead of desktop Java-based Csound development, as well temporary disk files in orchestra and score pars- as unique classes created for easing Csound devel- ing. opment on Android. Audio IO has been developed in two fronts: us- The SWIG4 wrapping used for Android con- ing pure Java code through the AudioTrack API tains all of the same classes as those used in the provided by the Android SDK and, using C code, Java wrapping that is used for desktop Java devel- as a Csound IO module that uses the OpenSL API opment with Csound. Consequently, those users that is offered by the Android NDK. The latter who are familiar with Csound and Java can trans- was developed as a possible window into a future fer their knowledge when working on Android, lower-latency mode, which is not available at the and users who learn Csound development on An- moment. It is built as a replacement for the usual droid can take their experience and work on desk- Csound IO modules (PortAudio, ALSA, JACK, top Java applications. However, the two plat- etc.), using the provided API hooks. The Csound forms do differ in some areas such as classes for input and output functions, called synchronously accessing hardware and different user interface li- in its performance loop, pass a buffer of audio braries. To help ease development, a CsoundObj samples to the DAC/ADC using the OpenSL en- class was developed to provide out-of-the-box so- queue mechanism. This includes a callback that lutions for common tasks (such as routing audio is used to notify when a new buffer needs to be from Csound to hardware output). Also, applica- enqueued. A double buffer is used, so that while tions using CsoundObj can be more easily ported one half is being written or read by Csound, the to other platforms where CsoundObj is imple- other is enqueued to be consumed or filled by mented (i.e. iOS).5 the device. The code fragment below in listing 1 One of the first issues arising in the develop- shows the output function and its associated call- ment of Csound for Android was the question of back. The OpenSL module is the default mode plugin modules. Since the first release of Csound of IO in Csound for Android. Although it does 5, the bulk of its unit generators (opcodes) were not currently offer low-latency, it is a more ef- provided as dynamically-loaded libraries, which ficient means of passing data to the audio device resided in a special location (the OPCODEDIR and it operates outside the influence of the Dalvik or OPCODEDIR64 directories) and were loaded virtual machine garbage collector (which executes by Csound at the orchestra compilation stage. the Java application code). However, due to the uncertain situation regard- ing dynamic libraries (not only in Android but The AudioTrack code offers an alternative 2http://developer.android.com/sdk/ndk/index.html means accessing the device. It pushes/retrieves 3http://www.mega-nerd.com/libsndfile/ input/output frames into/from the main process- 4http://www.swig.org ing buffers (spin/spout) of Csound synchronously 5 There are plans to create CsoundObj implementations at control cycle intervals. It is offered as an option for other object-oriented desktop development languages/- platforms such as C++, Objective-C, Java, and Python, to developers, which can be used for instance, in but at the time of this writing, CsoundObj is only avail- older versions of Android without OpenSL sup- able in Objective-C for iOS. port.

30 3 Application Development using CsoundObj Developers using the CsoundObj API will essen- tially partition their codebase into three parts: application code, audio code, and glue code. The application code contains the standard Android code for creating applications, including such things as view controllers, views, database han- dling, and application logic. The audio code is a standard Csound CSD project that contains code written in Csound and will be run using a CsoundObj object. Finally, the glue code is what will bridge the user interface with Csound.

31 /* this callback handler is called every time a buffer finishes playing */ void bqPlayerCallback(SLAndroidSimpleBufferQueueItf bq, void *context) { open_sl_params *params = (open_sl_params *) context; params->csound->NotifyThreadLock(params->clientLockOut); }

/* put samples to DAC */ void androidrtplay_(CSOUND *csound, const MYFLT *buffer, int nbytes) { open_sl_params *params; int i = 0, samples = nbytes / (int) sizeof(MYFLT); short* openslBuffer;

params = (open_sl_params *) *(csound->GetRtPlayUserData(csound)); openslBuffer = params->outputBuffer[params->currentOutputBuffer]; if (params == NULL) return ; do { /* fill one of the double buffer halves */ openslBuffer[params->currentOutputIndex++] = (short) (buffer[i]*CONV16BIT); if (params->currentOutputIndex >= params->outBufSamples) { /* wait for notification */ csound->WaitThreadLock(params->clientLockOut, (size_t) 1000); /* enqueue audio data */ (*params->bqPlayerBufferQueue)->Enqueue(params->bqPlayerBufferQueue, openslBuffer,params->outBufSamples*sizeof(short)); /* switch double buffer half */ params->currentOutputBuffer = (params->currentOutputBuffer ? 0 : 1); params->currentOutputIndex = 0; openslBuffer = params->outputBuffer[params->currentOutputBuffer]; } } while (++i < samples); } Listing 1: OpenSL module output C function and associated callback

public interface CsoundValueCacheable { public void setup(CsoundObj csoundObj); public void updateValuesToCsound(); public void updateValuesFromCsound(); public void cleanup(); } Listing 2: CsoundValueCacheable Interface

String csd = getResourceFileAsString(R.raw.test); File f = createTempFile(csd); csoundObj.addSlider(fSlider, "slider", 0.0, 1.0); csoundObj.startCsound(f); Listing 3: Example CsoundObj usage

32 CsoundObj uses objects that implement the shows how a CSD file is read from the projects re- CsoundValueCacheable interface for reading value sources using the getResourceFileAsString utility from and writing values to Csound (listing 2). method, saved as a temporary file, then used as an Any number of cacheables can be used with argument to CsoundObj’s startCsound method. CsoundObj. The design is flexible enough such The 2nd to last line shows the addSlider method that you can design your application to use one being used to bind fslider, an instance of a Seek- cacheable per user interface or hardware sensor Bar, to Csound with a channel name of ”slider” element, or one can make a cacheable that reads and a range from 0.0 to 1.0. When Csound is and writes along many channels. started, the values from that SeekBar will be read CsoundObj contains utility methods for bind- by the Csound project using the chnget opcode, ing Android Buttons and SeekBars to a Csound which will be reading from the ”slider” channel. channel, as well as for a method for binding the Figure 2 shows the relationships between dif- hardware Accelerometer to preset Csound chan- ferent parts of the platform and different us- nels. These methods wrap the View or sensor ob- age scenarios. An application may work with jects with pre-made CsoundValueCacheables that CsoundObj alone if they are only going to be come with the CsoundObj API. Since these are starting and stopping a CSD. The application commonly used items that would be bound, the may also use CsoundValueCacheables for read- utility methods were added to CsoundObj as a ing and writing values from either CsoundObj or built-in convenience to those using the API. Note the CsoundObject. Finally, an application may that CsoundValueCacheables are run within the do additional interaction with the Csound object context of the audio processing thread; this was that the CsoundObj has as its member, taking done intentionally so that the cacheable could advantage of the standard Csound API. copy any values it needed to from Csound, then continue to do processing in another thread and eventually post back to the main UI thread via a Handler.

Figure 2: CsoundObj Usage Diagram

A Csound for Android examples project has been created the contains a number of differ- ent Csound example applications. These ex- amples demonstrate different ways of using the CsoundObj API as well as different approaches Figure 1: Android Emulator showing Simple Test to applications, such as realtime synthesis instru- 1 Activity ments and generative music. The examples were ported over from the Csound for iOS examples Listing 3 shows example code of using project and users can study the code to better CsoundObj with a single slider, from the Simple understand both the CsoundObj API on Android Test 1 Activity, shown in Figure 1.The code above as well as what is required to do cross-platform

33 development with Csound as an audio platform. cal applications. Current Csound 6 developments to enable realtime modification of the process- 4 Benefits of using Csound on ing graph as well as other features will expand Android the types of applications that can be built with Using Csound on Android provides many bene- Csound. As Android is now supported within the fits. First, Csound contains one of the largest core Csound repository, it will continue to be de- libraries of synthesis and signal processing rou- veloped as a primary platform for deployment as tines. By leveraging what is available in Csound, part of the MCP distribution. the developer can spend more time working on 6 Availability the user interface and application code and rely on the Csound library for audio-related program- The Csound for Android platform and exam- ming. The Csound code library is also tested and ples project are included in the main Csound supported by a open-source community, meaning GIT repository. Build files are included for less testing work required for your project. those interested in building Csound with the In addition to the productivity gain of using a Android Native Development Kit. Archives in- library for audio, Csound projects–developed in cluding a pre-compiled Csound as well as exam- text files with .csd extensions–can be developed ples are available at http://sourceforge.net/ on the desktop, and later moved to the Android projects/csound/files/csound5/Android/. application. Developing and testing on the desk- top allows for a faster development process than 7 Acknowledgements testing in the Android emulator or on a device, This research was partly funded by the Program as it removes the application compilation and de- of Research in Third Level Institutions (PRTLI ployment stage, which can be slow at times. 5) of the Higher Education Authority (HEA) of Having the audio-related code in a CSD file Ireland, through the Digital Arts and Humanities for a project also brings with it two benefits. programme. First, development of an application can be split amongst multiple people; one can work on the References audio code while the other focuses on developing R. Boulanger, editor. 2000. The Csound Book. other areas of the application. Second, develop- MIT Press, Cambridge, Mass. ing an application based around Csound allows V Lazzarini and J. Piche. 2006. Cecilia and tclc- for moving that CSD to other platforms, such as sound. In Proc. of the 9th Int. Conf. on Digital iOS or desktop operating systems. The developer Audio Effects (DAFX), pages 315–318, Mon- would then only have to develop the user-interface treal, Canada. and glue code to work with that CSD on each platform. V Lazzarini and R. Walsh. 2007. Developing Additionally, cleanly separating out the audio ladspa plugins with csound. In Proceedings of system of an application and enforcing a strict 5th Linux Audio Developers Conference, pages API (Application Programmer Interface) to that 30–36, Berlin, Germany. system is a good practice for application devel- V Lazzarini. 2006. Scripting csound 5. In Pro- opment. This helps to prevent tangled, hard to ceedings of 4th Linux Audio Developers Confer- maintain code. This is of benefit to the beginning ence, pages 73–78, Karlsruhe, Germany. and advanced programmer alike. V Lazzarini. 2008. A toolkit for audio and mu- 5 Conclusions sic applications in the xo computer. In Proc. of the International Computer Music Conference From its roots in the Music N family of programs, 2008, pages 62–65, Belfast, Northern Ireland. Csound has grown over the years, continually ex- panding it features as a synthesis library as well as its usefulness as a music platform. With its avail- ability on multiple operating systems, Csound of- fers a multi-platform option for developing musi-

34 Ardour3 - Video Integration

Robin Gareus gareus.org, linuxaudio.org, CiTu.fr Paris, France [email protected]

Abstract Apart from the film-score (music), a film’s soundtrack usually includes dialogue and sound This article describes video-integration in the Ardour3 Digital Audio Workstation to facilitate sound-track effects. creation and film post-production. One goal is to synchronize dramatic events hap- It aims to lay a foundation for users and developers, pening on screen with musical events in the score towards establishing a maintainable tool-set for using - but there are many different (artistic) methods free-software in A/V soundtrack production. for syncing music to picture. With only a few ex- To that end a client-server interface for communi- ceptions - namely song or dance scenes - music cation between non-linear editing systems is specified. composition and sound-design usually takes place The paper describes the reference implementation and after recording and editing the video [Wikipedia, documents the current state of video integration into 2011]. Ardour3 and future planned development. In the final sections a user-manual and setup/install information Major problems persisted, leading to is presented. pictures and sound recording largely taking sep- arate paths for a generation. The primary is- Keywords sue was - and is - synchronization: pictures and sound are recorded and played back by separate Ardour, Video, A/V, post-production, sound-track, devices, which were difficult to start and main- film tain in tandem. This stigma still prevails for most professional audio/video recordings for both cre- 1 Introduction ative and technical: ie. there is too much gear The idea of combining motion pictures with sound on camera already, more (budgetary) choices are is nearly as old as the concept of cinema itself. available, unions define different job functions. . . . Ever since the invention of the moving pic- Post-production requires to synchronize the ture, films have entailed sound and music. Be original audio recorded on the set (on-camera it mechanical music-boxes accompanying candle- voice) with the edited video. In the analog days light projections, the Kinetoscope in the late 19th visual cues (slate) have been used towards that century; live-music supporting silent-films in the end. With the advent of digital technology time- early 20th century to completely digital special code (most commonly SMPTE produced by the effects in the 2000s. camera) is commonly recorded along with the Charlie Chaplin composed his own music for sound to allow reconstructing the audio-tracks af- City Lights (1931) and many of his later movies. ter the video has been cut. He was not clear whose job was to score the sound- As you can imagine, the technical skills and de- tracks [Scaruffi, 2007]. That was the exception, tails involved can become quite complex and nei- and few film-makers would imitate him. In the ther composers nor sound-designers do want to 1930s, after a few years of experimentation, scor- concern themselves with that task. Creative tech- ing film soundtracks became an art in earnest. By nicians have come up with various tools to aid the the mid-1940s, cinema’s composers had become a process. well-established category. These days many Digital Audio Workstations

35 provide features to import EDL (edit decision 2.1 Design-goals and use-cases lists) or variants thereof (AAF: Advanced Author- The aim is to provide an easy-to-use, professional ing Format, MXF: Material eXchange Format, workflow for film-sound production using free- BWF: Broadcast Wave Format, OMF+OMFI: software. More specifically: Integration of video- Open Media Framework Interchange,. . . ) and elements into the Ardour Digital Audio Worksta- display a film-clip in sync with the audio in or- tion. der to facilitate soundtrack creation. The resulting interface must not be limited to There are very few to none free-software solu- the software at hand (Ardour, Xjadeo, icsd) but tions for this process. However, Ardour [Davis allow for further adaption or interoperability. and others, 1999 2012] provides nearly all rele- 2.1.1 General vant features for composition, recording as well An important aspect of the envisaged project is as mastering. The underlying JACK audio con- modular design. This is a prerequisite to allow the nection kit allows for inter-application synchro- project to be developed gradually and maintained nization [Davis and others, 2001 2012] and sug- long-term. With Interface definitions in place, the gests itself to be used in the context of film post- building blocks can be implemented individually. production. It is also crucial to cope with the current situation of video-codec licensing: Depending on the avail- ability of licenses and user preferences, it should be possible to en/disable certain parts of the sys- tem without breaking overall functionality. The utter minimum for both composing as well as producing sound-tracks is a video-player that synchronizes to an external time-source provided by the DAW. To navigate around in larger projects a time- line becomes very valuable. Session-management as well as archiving is im- portant as soon as one works on more than one project. Project-revisions or snapshots come in Figure 1: Ardour 3.0-beta1 with video-timeline handy. patch and video-monitor Import, export and inter-operability with other software: Those are likely going to be the hardest part, yet also the most important. 2 Design Transcoding the video on import is neces- sary to provide smooth response times for for- Other FLOSS projects are tackling the process of ward/backward seeking. It may also be re- video production (NLE: non-linear editing), most quired for inter-operability (codecs). Demuxing is notably [Foundation, 1999 2012], needed to import edited video-tracks along with [Finch, 2004 2012] and [Ltd, 2002 2011]; the original set of audio tracks, aligned to time- however these emphasize on the visuals and often code. significantly lack on the audio side. Exporting the soundtrack means aligning the Other free-software digital audio workstations materials and multiplexing the original video with (DAWs) capable of the job include the audio or to provide exact information on how [Schweer and others, 1999 2012] and qtractor to do that. Various formats - both proprietary [Capela, 2005 2011] amongst others, however and free - are in use which complicates the pro- with MIDI support arriving in Ardour3, contin- cess. ued cross-platform support and available high-end Sometimes you (or the director) notices dur- professional derivatives (Mixbus, Harrison) it is ing sound-design, that some video edit decisions currently the most suitable and promising DAW were not optimal; or that there is just a video from a developer’s perspective. frame missing for proper alignment. With digital

36 technology incremental updates of the video are The video-decoding and video-session manage- not only feasible but relatively simple. Empower- ment is done in a separate application. A client- ing the audio-engineer to quickly do minor video server model is used to partition tasks. Ardour as editing is a very useful but also dangerous feature. client does not share any of its resources, but re- quests video-content from the server. The overall 2.1.2 Short-Term outline is depicted in figure 2. Increase the usability of existing tools with fo- cus on soundtrack creation (video-timeline, video- monitor). In particular the learnability and efficiency for the workflow should be streamlined: video- transcoding, video-monitor and A/V-session management functionality should be accessible from a single application. The complexity of the whole process should be abstracted and focus on the use-case at hand while not limiting the actual system. Practically this is mapped to adding support Figure 2: Overview of the client-server architec- into Ardour3 to communicate with and control ture the Xjadeo [Gareus and Garrido, 2006 2012b] video-monitor. Furthermore a video-timeline axis The video-server itself is a modular system. is aligned to the Ardour-canvas. Lastly, the pos- The essential and minimum system comprises sibility to invoke ffmpeg from within Ardour’s a video-decoder and video-frame cache. Video- GUI to import and export audio/video is im- frames are referenced by frame-number. Time- plemented. The user-interaction is kept to a code mapping is optional and can be done on minimum: Import Video, Show/Hide Timeline, the client and server-side (using server-side ses- Show/Hide video-monitor. sions). However, the server must provide infor- 2.1.3 Mid-Term mation about the file’s (or session’s) frame-rate, Improve system integration. Configuration pre- aspect-ratio and start-time-offset. Furthermore, sets for multi-host setups. Streamline video- the server must be able to present server-side ses- export, MUX/codec/format presets. sions as single file to the client. Stabilize video-session and edit API. Add in- Minimal configuration that needs to be shared teroperability with external video-editors. Import by the server and client is the server’s access URL various EDL dialects. and document-root. Even though tight integration is planned, 2.1.4 Long-Term the prototype architecture leaves room for side- Top Notch client-server distributed A/V non- chains. In particular this concerns extraction of linear editor, using Ardour3 for the sound, dedi- audio-tracks from video files as well as final mul- cated GUI for the video. Seamless integration. tiplexing and mastering the final video. While interfaces are specified to handle these on the 2.2 Architecture server-side, tight user-interface integration and Ardour itself is separated in two main parts: the rapid prototyping motivates a client-side imple- back-end (audio-engine, libardour) which man- mentation. This is accomplished by sharing the ages audio-tracks, audio-routes and session-state underlying file-system storage. and the front-end (gtk2-ardour) which provides a stateless graphical user interface to the back-end. 3 Interface The idea is to construct the system such than For read-only (no video-editing) access to video intrusion in Ardour itself is minimal: Only Ar- information, all communication between the dour’s front-end GUI should be aware of video- client (here: Ardour) and the video-server is via elements. HTTP.

37 This is motivated by: /index/* (optional) file and directory index HTTP is a well supported protocol with ex- /session/* (optional) server-side project and • isting infrastructure. session management with client-side caching: throughput is more The file extension defines the format of the re- • important than low latency. HTTP overhead turned data. For textual (non-binary) data, valid is reasonably small1 requests include .html, .htm , .json, .xml, .raw, web-interface possibility .edl, .null. Valid image file extensions are .jpg, • .jpeg, .png, .ppm, .yuv, .rgb, .rgba. For video persistent HTTP connections are possible streaming, the extension defines the container for- • .avi dv .rm . .ogv .mpg .mpeg possibility to make use of established proxy mat e.g. , , , , , , , .flv .mov .mp4 .webm .mkv .vob • and load-balancing systems , , , , , , . . . . The extension is case-insensitive. 3.1 Protocol: Request Parameters Alternatively the format can be specified using HTTP verbs can be taken into account for spe- the format=FMT query parameter which overrides cific requests e.g. HEAD: query file-information, the extension, if any. DELETE: remove assets, regions, files. The vast 3.1.1 File and Session information majority will bet GET and POST requests. The /info handler returns basic information The server URL must conform to RFC3986 about the file or session. specifications. It must be possible to have Request parameters: the server in a sub-path on the server’s root (e.g. http://example.org/my/server/). file file-name relative to the server’s document The URI must allow path and file- root. names to be added below that path (e.g. Reply: http://example.org/my/server/status.html). A default endpoint handler needs to catch access to version reply-format version. Currently 1, may the doc-root and provide service for the following change in the future requests URLs, which are described in further framerate a double floating point value of video detail below: frames-per-second /status print server-status (user-readable, ver- duration long integer count of video-frames sion, etc) must return 200 OK if the server is start-offset double precision - time of first video- running. frame; specified in (fractional) seconds /info return file or session info aspect-ratio floating point value of the /frame query image-frame data width/height geometry ratio including pixel and display-aspect ratio multipliers / generic handler to above according to request- parameters width (optional) integer video width in pixels /admin/flush cache (optional, POST) reset height (optional) integer video height in pixels cache length (optional) video-duration in seconds /admin/shutdown (optional, POST) request size (optional) video-size in bytes clean shutdown of server The raw (un-formatted) reply concatenates the /stream (optional) server-side rendering of values above in order, separated by unix-newlines video chunks - export and preview (’ n’, ASCII 0x0a). Optional values require all previous\ values to be given: i.e. length requires 1RGB24 frame sizes, thumbnail 160x90: 337kiB, 768x576/PAL:10.1MiB, 1920x1080/full-HD: 47.5MiB. width and height. HTTP request/response headers are typically between 200 json or xml formatted replies must return an and 800 bytes. associative array with the key-name as specified

38 #curl -d file=tmp/test.avi \ one video frame. the default value is -1: until http://localhost:1554/info the end. 1 25.000 w (optional) width in pixels; default -1: auto 15262 h (optional) height in pixels; default -1: auto 0.0 1.833333 container (optional) video format; default : avi nosound (optional) if set (1) only video is en- coded - default: unset, 0 Figure 3: Example request of file information 3.1.4 Server administration in the reply format list. html or txt replies are Since the only way to communicate with the intended for user readability and may differ in server is HTTP, specific interfaces are needed for formatting, but should include the required in- administrative requests. The current standard de- formation. fines only two; namely re-start (or cache-flush) 3.1.2 Image (preview, thumbnails) and shutdown. Request parameters: These commands are mostly for debugging and development purposes: The clean shutdown was file file-name relative to the server’s document motivated to check for memory-leaks; but comes root. in handy to launch temporary servers with each Ardour-session. frame the frame-number (starting at zero for the Cache-flush is useful if the video-file changes. first frame). Cached images must be cleaned and the de- w (optional) width in pixels - default -1: auto coders must release open file-descriptor and re- read the updated video-file. Future implementa- h (optional) height in pixels - default -1: auto tions should actually monitor the file for modifi- format (optional) image format and/or encoding cation and do this automatically. If neither width or height are specified, the 3.1.5 The Session API original size (in pixels, not scaled with pixel or Session API replicates /info, /frame and display-aspect ratio) must be used. If both width /stream request-URLs below the /session/ and height are given, the image must be scaled namespace. Each request to those handlers must disregarding the aspect-ratio. If either width or specify a session query parameter. The file pa- height are specified, the returned image must be rameter - if given - can be used to identify chunks scaled according to the movie’s real aspect ratio. inside a session. The default format is raw RGB, 24 bits per pix- Additional request handlers are needed to pro- els. vide access to editing functionality, in particular asset-management and clip arrangement. 3.1.3 Request video export, rendering The detailed description of the session API is The /stream handler is used to encode a video on beyond the scope of this paper and will be pub- the server. lished separately 2. Request parameters: 3.2 Extensions file file-name relative to the server’s document 3.2.1 Web GUI root or session-id. The /gui/ namespace is reserved for an end-user frame (optional) the frame-number (starting at web-interface which is out of the scope of this zero for the first frame) where to start encod- paper. Currently a prototype using XSLT and ing. the default is 0, start at the beginning. icsd2’s XML function is prototyped in XHTML duration (optional) number of frames to en- 2A prototype of the session API for research code (minus one), negative values indicate no is implemented in libs/libsodan/jvsession.c and limit. A value of 0 (zero) must encode exactly src/ics sessionhandler.c

39 and JavaScript and available with the source-code 3.3.1 Database and persistent session (see figures 4, 7). information storage The server keeps a database for each session. A session consists of one or more chunks (audio or video sources) and a map how to combine them. Furthermore the session-database includes config- uration and session-properties in a generic key- value store. An extended EDL table is used to for map- ping the chunks. It provides for (track) layer- ing of chunk-excerpts (in-point, out-point). The data-model does not yet provide for per edit or per chunk attributes (zoom, effect automation) but transitions between chunks can be specified in EDL style. The current database schema can be found in source-code. Figure 4: icsd2: JavaScript+XHTML EDL editor 3.3.2 Server-side video monitoring An important feature is to be able to to physi- cally separate audio (Ardour) and video (server 3.2.2 Out-of-band notifications and monitor). This is motivated by various fac- Since it is not possible to use HTTP to send out- tors: All video-outputs of the client computer of-band notifications, it is not a suitable protocol may be needed for audio, leaving none available for concurrent editing. for a full-screen video projector. It also preempts Both XMPP (Extensible Messaging and Pres- resource conflicts between audio processing and ence Protocol) and OSC (OpenSoundControl) are video-decoding. options. To provide server-side video monitoring (video- OSC has less overhead, yet only OSC over display), two approaches are currently proto- TCP provides reliable transport. XMPP has typed: the advantage that by spec processing between any two entities at the server must be in order. monolithic approach: share the decoder and Furthermore XMPP is a protocol intended for • frame-cache back-end between the HTTP- publish/subscribe, and capable to notify multiple server and the video-monitor using shared- clients. Authentication methods have also been memory. defined for XMPP, whereas both authentication modular approach: share the source- and Pub/Sub would require custom development • code/libraries: HTTP-server and video- on top of OSC. monitor are independent applications. 3.2.3 Authentication and Access Control Both HTTP as well as XML offer means of au- The former obviously requires less resource and thentication below the application layer. puts less strain on the server (memory and CPU While some requests motivate access-control as usage) at the cost of potential instability (resource part of the protocol - e.g. user-specific read-only conflicts) since maintaining cache-coherence com- access to certain frame-ranges or chunks - these plicates the code. can in general be handled by URI filtering, much A middle-ground - building a video-monitor on as a .htaccess files does in apache. top of the existing HTTP frame/ interface - is unsuitable for low-latency seeks. Even though it 3.3 Server Internal API performs nicely in linear playback it effectively This section covers the planned interface for purges the cache and introduces a penalty for server-side sessions and non-linear video-editing. time-line use. The back-end is partially implemented in icsd2 With computing resources being easily avail- and there is a basic web-interface to it. able (and cheap compared to video-production in

40 general), a modular approach - using independent The video-timeline is a unique “ruler” video-monitor software - is the way to follow. in Ardour (similar to bar/beat or 3.3.3 Server to Server communication markers) and globally accessible via ARDOUR UI::instance()->video timeline. A cache-coherence protocol MOESI (Modified, The video-timeline contains video-frame images: Owned, Exclusive, Shared, Invalid) is envisaged gtk2 ardour/video image frame.cc. to distribute load among servers. An algorithm based on file-id and frame-number can be used to The timeline is populated by aligning the first dispatch requests to servers in a pool. This can video-frame. The zoom-level (audio-frames per be done directly in the client implementation or pixel) and timeline-height * aspect-ratio (video- by a proxy-server (http load balancer). frame width in pixels) define the spacing of video- frames after the first frame. Images on the current Server-to-server communication will use pub- page of the canvas as well as the next and previous lish/subscribe to synchronize and distribute up- page are requested and cached locally to allow dates among servers in the pool. smooth scrolling. 4 Implementation VideoTimeLine and VideoImageFrame are the only two classes that communicate with the video- Historically things went a bit backwards. Af- server. The former is used to request video-session ter the groundwork (Xjadeo, Ardour1) had been information (frame-rate, duration, aspect-ratio), laid, different endeavours (gjvidtimline, xj-five) the latter handles (raw RGB) image-data. The culminated in the video-editing-server software actual curl http get() function is defined in the (vcsd) for Ardour3. The recurring need to fo- file video image frame.cc which also includes a cus on movie-soundtracks motivated the short- threaded handler for async replies of image-data. term goals of the current implementation: Ar- The GET request-parameters to query infor- dour3 [Davis and others, 1999 2012] provides the mation about the video-file are: DAW as well as the GUI. The video-server (icsd) SERVER-URL/info?file=fileURI&format=plain project is named Sodankyl¨a[Gareus, 2008 2012] The GET request-parameters for image data are after its place of birth. as follows: 4.1 client – Ardour3 SERVER-URL/?frame=frame-number The patch3 can be broken out into small parts. &w=width&h=height &file=fileURI&format=rgb Video-timeline display The width and height are calculated by us- • HTTP interaction with the video-server ing the video’s aspect-ratio and the height of the • timeline-ruler. Current options are defined in Xjadeo remote control editor rulers.cc and include: • ffmpeg interaction “Small”: 3H • • t Dialogs (the largest part) • “Normal”: 4Ht Support and helper functions • • “Large”: 6Ht The implementation is completely additive4, and • all video related code is separated by #ifdefs. with Ht = Editor::timebar height = 15.0; pixels defined in editor.cc. 4.1.1 Video timeline gtk2 ardour/video timeline.cc defines the 4.1.2 system-exec timing and alignment of the video-frames depend- Starting an external application with bi- ing on the current scroll-position and zoom-level directional communication over standard-IO, of the Ardour canvas. requires dedicated code. gtk2 ardour/system exec.cc implements a 3http://gareus.org/gitweb/?p=ardour3.git;a= commitdiff_plain;hp=master;h=videotl cross-platform compatible (POSIX, windows) 4No existing code is removed or changed, however the API to launch, terminate and pipe data to/from build-script and some of the documentation is modified. child processes.

41 It is used to communicate with Xjadeo 5 as well as to launch ffmpeg and the video-server (on lo- calhost). 4.1.3 transcoding gtk2 ardour/transcode .cc includes low-level interaction with ffmpeg. It is concerned with locating and starting ffmpeg and ffprobe executables as well as parsing output and progress information. The command-parameters (presets, options) as well as progress-bar display Figure 5: A3: the video-open dialog is part of the transcode video dialog.cc. 4.1.4 Dialogs The vast majority of the code contributed to Ar- extracted from the video-file and imported dour consists of dialogs to interact with the user. as audio-track into Ardour. gtk2 ardour/video server dialog.cc asks to start the video-server on localhost. It is shown on video-import if the configured video-server can not be reached. gtk2 ardour/add video dialog.cc Called from Session-menu Import Video, asks which file to import.→ The file-selection→ dialog as- sumes that the video-server runs on localhost or the server’s document-root is shared via NFS6. The dialog offers options to transcode the file on import, or copy/hardlink it to the session folder. Furthermore there are options to launch the video-monitor Figure 6: A3: the import-video, transcode dialog and set the Ardour session’s timecode-rate to match the video-file’s FPS (see figure 5). gtk2 ardour/video copy dialog.cc is gtk2 ardour/open video monitor dialog.cc a potential follow-up dialog with progress-bar optional settings for Xjadeo. This dialog and overwrite confirmation. can be bypassed (configuration option) If the video-server is running on localhost, and allows to customize settings of Xjadeo the dialog also checks if the (imported) file that should be retained between sessions: is under the configured document-root of the e.g. window-size, position and state, video-server. On-Screen-Display settings, etc. gtk2 ardour/transcode video dialog.cc gtk2 ardour/export video dialog.cc is the allows to transcode video-files on import. It most complex dialog. It includes various is recommended to do so in order to decrease (hardcoded) presets for video-encoding and the CPU load and optimize I/O resource ffmpeg options. It also includes a small usage of the video-decoder. The dialog wrapper around Ardour’s audio-export also allows to select an audio-track to be function. 5actually xj-remote which in turn communicates with 4.1.5 video-monitor / Xjadeo Xjadeo, platform dependent either via message-queues or RPC calls gtk2 ardour/video monitor.cc implements in- 6A prefix can be configured to be removed from the teraction with Xjadeo. In particular sending selected path for making requests to the video-server. video-seek messages if Ardour is using internal

42 transport and time-offset parameters if the video By default icsd listens on all network inter- is re-aligned. faces’ TCP port 1554 and runs in foreground, Switching Ardour’s sync mechanism between serving files below a given document-root. JACK and internal-transport automatically tog- gles the way Ardour controls Xjadeo. Also usage: icsd [OPTION] Xjadeo’s window state (on-top), on-screen-display mode (SMPTE, frame-number display) as well The number of cached video-frames can be as the window position and size are remembered specified with the -C option. It defaults to across sessions. Ardour sets override-flags in 128. The IP address to listen on can be set with Xjadeo’s GUI that disable some of the direct user- -P (default: 0.0.0.0). There are interaction with Xjadeo; in particular: you can various options regarding daemonizing, chroot, not close the monitor-window nor load different set-uid, set-gid and logging that allow icsd be files by drag-drop on the Xjadeo window or nei- started as system-service. They are documented ther modify the time-offset using keyboard short- in the manual-page. cuts. These interactions need to be done through Ardour. Note that, icsd does not perform any access control. As with a web-server, all documents be- 4.1.6 misc low the document-root are publicly readable. There are various small patches - mostly glue The internal video frame-cache uses LRU - to Ardour’s editor *.cc, ardour ui*.cc (least-recently requested video-frame) cache-line as well as preference (video-server URL) and expiry strategy. icsd is multi-threaded, it keeps session-option (sync, pull up/down) dialogs. a pool of decoders and tries to map consecutive Dedicated video functions have been collected frames from one keyframe to the same-decoder; in gtk2 ardour/editor videotimeline.cc and however a new decoder is only spawned if the cur- gtk2 ardour/utils videotl.cc rent one is busy in another thread. 4.1.7 libardour While the video-timeline patch itself only con- cerns the Ardour GUI, a few helper functions se- mantically belong in libardour. They are almost exclusively related to saving preferences or resolv- ing directories. 4.2 prototype server The source-code includes a simple PHP script tools/videotimeline/vseq. that emulates the behaviour of the video-server (no frame- caching) and implements the minimal interface protocol specifications. It is built on top of the command-line applications: ffmpeg, ffprobe and ImageMagick’s convert. Figure 7: icsd2: JavaScript+XHTML video- monitor 4.3 video-server – Sodankyl¨a– icsd The video-server was born in Sodankyl¨a June 2008. It’s a motley collection of tools for handling icsd2 extends the simple API towards video- video files. The ‘image compositor socket dae- session and non-linear-editing. mon’ (icsd - the names goes back to Ardour-1) To allow interoperability with various client- implements a video-frame-caching HTTP server side tools, a REST-API is being defined. XLST according to specs defined in section 3.1. icsd2 in is used to export information in XML, as well as the same source-tree is a (work-in-progress) ver- generate XHTML pages. For prototyping and de- sion adding the video-session and NLE capabili- bugging a simple web-player and JavaScript EDL ties. editor came to be (screenshot in figure 7).

43 5 QA and Performance Tests rates and can be neglected. Systematic latency- The most important part is to assure fitness for compensation tests have successfully been per- the purpose. Usability and integration is useless formed at 25fps file of alternating black and white if fundamental quality is not sufficient. Criteria frames on a 75Hz display frequency. Other rates that require verification include accuracy, time- have been verified but not analyzed in depth. code mapping, as well as latency and system-load. 5.2 Performance tests Depending on the zoom-level video-frames in Performance is evaluated by benchmarking unit- the audio-timeline must be properly aligned. Dur- tests of the individual components as well as mon- ing playback the video-monitor must not lag be- itoring overall system performance. hind. Audio latency must be compensated so The former includes measuring request latency that the video-frame on screen corresponds to to query video-frames which is further broken the audio-signal currently reaching the speakers. up into distributing workload on multiple video- The system load must remain within reasonable decoders. An average session will spawn 16-20 bounds and the applications should run smoothly video-decoders on a dual-core. The memory foot- without spiking. print of a decoder is negligible compared to the 5.1 Quality Assurance and Testing cached image data, yet it greatly improves re- sponse time when seeking in files to use a decoder Audio/Video frame alignment has been tested us- positioned close to the key-frame. ing the “time-stamp-movie-maker” [Gareus and Request and decode latencies have been mea- Garrido, 2006 2012a]. It was used to create sured with ab - The Apache HTTP server bench- movies with time-code and frame-number ren- marking tool as well as by timing the requests dered on the image at most common frame-rates with Ardour. Cache hits are dominated by trans- (23.976, 24, 24.976, 25, 29.97df, 30, 59.94 and 60 fer time which is on the order of a few (1-3) mil- fps) using common codecs + format combinations liseconds depending on image-size and network. (mjpeg/avi, h264/avi, h264/mov, mpeg4/mov, Decoder latency is highly dependent on the geom- /ogg). All 40 videos have been loaded into etry of the movie-file, codec, scaling and CPU or Ardour and tested to align correctly. I/O. On slower CPUs (< 1.6 GHz Intel) a full HD video can be decoded and scaled at 25 fps using mjpeg codec width only intra-frames at the cost of high I/O. Faster CPUs can shift the load towards the CPU. Parallelizing requests increases latency to up to a few hundred milliseconds. This is in- tended behavior: image for a whole view-point page will arrive simultaneously.

Figure 8: testing time-code mapping using a time- 6 User Manual stamped movie 6.1 Setup get Ardour3 with video-timeline patch. • Latency compensation is verified by playing a install icsd ( alpha-11) and optionally ffm- • peg, ffprobe≥ and Xjadeo ( 0.4.12) to movie of a unique series of black and white frames ≥ full-screen synchronously to an audio-track that $PATH. sends an impulse on every white frame. The make sure that there is no firewall on local- audio-impulse signal must occur aligned to the • host TCP-port 1554 (required for communi- (luminance) signal of the video within the bounds cation between Ardour and the video-server). of display frequency. Since there are 3 differ- ent clocks (display-refresh frequency, video frame- If you run the video-server on the same host as rate, audio sample-rate) involved the problem is Ardour, no further configuration is required. not trivial. However audio sample-rate is an order When opening the first video, Ardour3 will ask of magnitude smaller than the usual video frame- for the path of the video-server binary (unless

44 6.2 Getting Started It is pretty much self-explanatory and intended to be intuitive. All functions are available from the Ardour Menu. Select actions are accessible from the context-menu on the video-timeline.

Menu Session Open Video • → → Menu Session Export Video • → → → Menu View Rulers Video (or right- • click the→ ruler/marker→ bar)→ Menu View Video Monitor (Xjadeo) • → → Menu Edit Preferences Video Figure 9: latency for decoding thumbnail frames • → → → for 1,2 and 8 parallel requests Menu Session Video maintenance • . . . (manual→ video server→ interaction) →

A quick walk-through goes something like this: Launch Ardour3, Menu Open Video. If the video-server is not running,→ Ardour asks to start it. You choose a video-file and optionally extract the original sound of the video to an Ardour track. By default Ardour also sets the session FPS and opens a video-monitor window. From then on, everything else is plain Ardour. Add audio-tracks, import samples, mix and mas- ter,. . . Eventually you’ll want to export your work. Session Export Video covers the common cases from→ file-creation→ for online-services over producing a snapshot for the client at the end of the day to optimized high-quality two-pass Figure 10: decoder and request latency for PAL MPEG encoding. However it is no substitute video. for dedicated software (such as dvd-creator) or a video engineer with a black-belt in ffmpeg or ex- pertise in similar transcoding software, especially when it comes to interlacing and scan-ordering.. it is found in $PATH) and also allows you to The export dialog defaults to use the imported change the TCP port number as well as the set video as source, note however that using the orig- cache size. Transcoding is only available if ffm- inal file usually produces better results; every peg and ffprobe are found in $PATH (fallback on transcoding step potentially decreases quality. OSX: /usr/local/bin/ffmpeg sodankyla - pro- 6.3 Advanced Setups vided with icsd.pkg). The video-server can be run on a remote-machine The video-monitor likewise requires xjremote with Ardour connecting to it. In this case you to be present in $PATH or on OSX under need to launch icsd on the server-machine and /Applications/Jadeo.app/Contents/MacOS/ configure Ardour to access it. The server’s URL and windows $PROGRAMFILES xjadeo . Xjre- and docroot configuration can be accessed via Ar- mote comes with Xjadeo and Jadeo\ respectively.\ dour’s menu Edit Preferences. → →

45 You should have a fast network connection video-frames are missing (black-cross images) in ( 100Mbit/s, low latency (a switch performs bet- the timeline. The Ardour internal frame-cache ter≥ than router) and preferably a dedicated net- should require a minimum of frames regardless of work) between the machine running Ardour and canvas pages to compensate for the request and the video-server. redraw latency. While not required for operation, it is handy to share the video-server’s document-root file- 7 Acknowledgements system with the machine running Ardour. It is, Thanks go to Paul Davis, Luis Garrido, Rui Nuno however, required to Capela, Dave Phillips and Natanael Olaiz. This project would not have been possible with- browse to a video-file to open7 • out various free-software projects, most notably display local video-monitor window ffmpeg/libav. Kudos go to the Ardour and JACK • development-teams and to linux-audio-users for import/transcode a video-file to the Ardour • feedback. session folder Last but not least to Carolina Feix for motivat- extract audio from a video-file 7 ing this project in the first place. • export to video-file7 References • File-system sharing can be done using any Rui Nuno Capela. 2005–2011. Qtractor. http: network-file-system (video-files reside on the //qtractor.sourceforge.net/. server, not the Ardour workstation) or using NAS. Paul Davis et al. 1999–2012. Ardour. http: Alternatively a remote replica of the Ardour- //ardour.org. project-tree that only contains the video-files is an Paul Davis et al. 2001–2012. Jack trans- option. The local project folder only needs a copy port. http://jackaudio.org/files/docs/ of the video-file for displaying a video-monitor. html/transport-design.html. The document-root configured in Ardour is re- moved from the local absolute-path to the selected Gabriel Finch. 2004–2012. Lives. http:// file when making a request to the video-server. lives.sourceforge.net/. Xjadeo itself includes a remote-control API Blender Foundation. 1999–2012. Blender. that allows to control the video-monitor over a http://blender.org. network connection. Please refer to the Xjadeo manual for details. Robin Gareus and Luis Garrido. 2006– 2012a. Time-stamp-movie-maker. http:// 6.4 Known Issues xjadeo.sf.net. Video-monitor settings (window state, position, Robin Gareus and Luis Garrido. 2006–2012b. size) are sent to Ardour when the video-monitor Xjadeo - the x-jack-video-monitor. http:// is terminated. Ardour saves and restores these xjadeo.sf.net. settings. The video-monitor is closed when a ses- sion is closed. If any of the video-settings have Robin Gareus. 2008–2012. Sodankyl¨a, icsd. changed since the last save, session-close will ask http://gareus.org/wiki/a3vtl. again to save the session, even if it was saved just Heroine Virtual Ltd. 2002–2011. Cinelerra. before closing. Every session-save should query http://cinelerra.org/. and save the current state of the video-monitor to prevent this. Piero Scaruffi. 2007. A brief history of Popular Operating Ardour with very close-up zoom Music before Rock Music. Omniware. (only one or two video-frames visible in the time- Werner Schweer et al. 1999–2012. Musescore: line), caching images only for the previous and Music score editor. . next viewpoint is not sufficient. On playback Wikipedia. 2011. Film-score. http://en. 7later versions of the video-server and Ardour may not wikipedia.org/wiki/Film_score. require this anymore.

46 INScore An Environment for the Design of Live Music Scores

D. Fober and Y. Orlarey and S. Letz Grame - Centre national de cr´eation musicale 9, rue du Garet BP 1185 69202 Lyon Cedex 01, France, fober, orlarey, letz @grame.fr { }

Abstract of the interaction system. New needs in terms of INScore is an open source framework for the design of music representation emerge of this context. interactive, augmented, live music scores. Augmented In the domain of electro-acoustic music, ana- music scores are graphic spaces providing representa- lytic scores - music scores made a post´eriori, be- tion, composition and manipulation of heterogeneous come common tools for the musicologists but have and arbitrary music objects (music scores but also im- little support from the existing computer music ages, text, signals...), both in the graphic and time do- software, apart the approach proposed for years mains. INScore includes also a dynamic system for by the Acousmograph [Geslin and Lefevre, 2004]. the representation of the music performance, consid- ered as a specific sound or gesture instance of the score, In the music pedagogy domain and based on a and viewed as signals. It integrates an event based in- mirror metaphor, experiments have been made to teraction mechanism that opens the door to original extend the music score in order to provide feed- uses and designs, transforming a score as a user in- back to students learning and practicing a tradi- terface or allowing a score self-modification based on tional music instruments [Fober et al., 2007]. This temporal events. This paper presents the system fea- approach was based on an extended music score, tures, the underlying formalisms, and introduces the supporting various annotations, including perfor- OSC based scripting language. mance representations based on the audio signal, Keywords but the system was limited by a monophonic score centered approach and a static design of the per- score, interaction, signal, synchronization formance representation. 1 Introduction Today, new technologies allow for real-time in- teraction and processing of musical, sound and Music notation has a long history and evolved gestural information. But the symbolic dimen- through ages. From the ancient neumes to the sion of the music is generally excluded from the contemporary music notation, the western culture interaction scheme. is rich of the many ways explored to represent the INScore has been designed in answer to these music. From symbolic or prescriptive notations observations. It is an open source framework1 for to pure graphic representation, the music score the design of interactive, augmented, live music has always been in constant interaction with the scores. It extends the traditional music score to creative and artistic process. arbitrary heterogeneous graphic objects: it sup- However, although the music representations ports symbolic music notation (Guido [Hoos et al., have exploded with the advent of computer music 1998] or MusicXML [Good, 2001] format), text [Dannenberg, 1993; Hewlett and Selfridge-Field, (utf8 encoded or html format), images (jpeg, tiff, 2001], the music score, intended to the performer, gif, png, bmp), vectorial graphics (custom basic didn’t evolved in proportion to the new music shapes or SVG), video files, as well as an original forms. In particular, there is a significant gap performance representation system [Fober et al., between interactive music and the static way it is 2010b]. usually notated: a performer has generally a tra- Each component of an augmented score has a ditional paper score, plus a computer screen dis- playing a number or a letter to indicate the state 1http://inscore.sf.net

47 graphic and temporal dimension and can be ad- may be viewed as a portion of an element. It dressed in both the graphic and temporal space. could be a 2D graphic segment (a rectangle), a A simple formalism, is used to draw relations be- time interval, a text section, or any segment de- tween the graphic and time space and to repre- scribing a part of the considered element in its sent the time relations of any score components local coordinates space. in the graphic space. Based on the graphic and Mappings are mathematical relations between time space segmentation, the formalism relies on segmentations. A composition operation is used simple mathematical relations between segments to relate the graphic space of two arbitrary score and on relations compositions. We talk of time elements, using their relation to the time space. synchronization in the graphic domain[ Fober et Table 2 lists the segmentations and mappings al., 2010a] to refer to this specific feature. used by the different component types. Mappings INScore is a message driven system that makes 2 are indicated by arrows ( ). Note that the ar- use of the Open Sound Control [OSC] format. rows link segments of different↔ types. Segmen- This design opens the door to remote control and tations and mappings in italic are automatically to interaction using any OSC capable application computed by the system, those in bold are user or device. In addition, it includes interaction fea- defined. tures provided at score component level by the way of watchable events. Note that for music scores, an intermediate All these characteristics make INScore a time segmentation, the wrapped time, is necessary unique system in the landscape of music notation to catch repeated sections and jumps (to sign, to or of multi-media presentation. The next section coda, etc.). gives details about the system features, including the underlying formalisms. Next the OSC API is introduced with a special emphasis on how it is turned into a scripting language. The last section gives an overview of the system architecture and dependencies before some directions are presented for future work. 2 Features Table 1 gives the typology of the graphic resources supported by the system. All the score elements have graphic properties (position, scale, rotation, color, etc.) and time properties as well (date and Figure 1: Graphic segments of a score and its duration). performance displayed using colors and annotated INScore provides a message based API to cre- with the corresponding time segments. ate and control the score elements, both in the graphic and time spaces. The Open Sound Con- trol [OSC] protocol is used as basis format for Figure 1 shows segments defined on a score and these messages, described in section 3. an image, annotated with the corresponding time segments. Synchronizing the image to the score 2.1 Time to graphic relations will stretch and align the graphic segments corre- All the elements of a score have a time dimension sponding to similar time segments as illustrated and the system is able to graphically represent the by figure 2. time relationships of these elements. INScore is using segmentations and mappings to achieve this Description of the image graphic to time rela- time synchronization in the graphic domain. tion is given in table 3 as a set of pairs including The segmentation of a score element is a set a graphic segment defined as 2 intervals on the x of segments included in this element. A segment and y axis, and a time segment defined as 2 ratio- nals expressing musical time (where 1 represents 2http://opensoundcontrol.org/ a whole note).

48 Symbolic music description Guido Music Notation and MusicXML formats Text plain text or html (utf8) Images jpg, gif, tiff, png, bmp Vectorial graphics line, rectangle, ellipse, polygon, curve, SVG code Video files using the phonon plugin Performance representations see section 2.2

Table 1: Graphic resources supported by the system. type segmentations and mappings required Symbolic music description graphic wrapped time time Text graphic↔ text time↔ Images graphic↔ pixel↔ time Vectorial graphics graphic↔ vectorial↔ time Performance representations graphic↔ frame time↔ ↔ ↔

Table 2: Segmentations and mappings for each component type

Such a composite signal (see figure 3) includes all the information required to be drawn without additional computation.

� Figure 2: Time synchronization of the segments � displayed in figure 1.

( [0, 67[ [0, 86[ ) ( [0/2, 1/2[ ) � ( [67, 113[ [0, 86[ ) ( [1/2, 1/1[ ) ( [113, 153[ [0, 86[ ) ( [1/1, 5/4[ ) Figure 3: A composite graphic signal at timet. ( [153, 190[ [0, 86[ ) ( [5/4, 3/2[ ) ( [190, 235[ [0, 86[ ) ( [3/2, 4/2[ ) The following examples show some simple rep- Table 3: A mapping described as a set of relations resentations defined using this model. between graphic and time segments. 2.2.1 Pitch representation Represents notes pitches on the y-axis using the 2.2 Performance representation fundamental frequency (figure 4). Music performance representation is based on sig- nals, whether audio or gestural signals. To pro- vide a flexible and extensible system, the graphic representation of a signal is viewed as a graphic signal, i.e. as a composite signal made of: Figure 4: Pitch representation. ay coordinate signal The corresponding graphic signal is expressed • as: a thickness signalh g=S / k / k • f0 t c a color signalc whereS : fundamental frequency • f0

49 kt : a constant thickness signal whereS f0 : fundamental frequency kc : a constant color signal Srms0 : f0 RMS values 2.2.2 Articulations kc0 : a constant color signal Next we build the graphic for the harmonic 1: Makes use of the signal RMS values to control the graphic thickness (figure 5). The corresponding g1=S f0 /Srms1 +S rms0 / kc1

Srms1 : harmonic 1 RMS values kc1 : a constant color signal Next, the graphic for the harmonic 2: Figure 5: Articulations. graphic signal is expressed as: g2=S f0 /Srms2 +S rms1 +S rms0 / kc2

g=k y /Srms / kc Srms2 : harmonic 2 RMS values kc2 : a constant color signal wherek y : signaly constant etc. Srms : RMS signal Finally, the graphic signals are combined into a kc : a constant color signal parallel graphic signal: 2.2.3 Pitch and articulation combined g=g2/g1/g0 Makes use of the fundamental frequency and RMS values to draw articulations shifted by the pitches 2.3 Interaction (figure 6). INScore is a message driven system that makes use of Open Sound Control [OSC] format. It in- cludes interaction features provided at individ- ual score component level by the way of watch- able events, which are typical events available in Figure 6: Pitch and articulation combined. a graphic environment, extended in the temporal The corresponding graphic signal is expressed domain. The list of supported events is given in as: table 4. g=S f0 /Srms / kc mouse events time events misc. whereS f0 : fundamental frequency mouse enter time enter new element Srms : RMS signal mouse leave time leave (at scene level) kc : a constant color signal mouse down mouse up 2.2.4 Pitch and harmonics combined mouse move Combines the fundamental frequency to the first double click harmonics RMS values (figure 7). Each harmonic has a different color. Table 4: Typology of watchable events.

Events are associated to user defined messages that are triggered by the event occurrence. The Figure 7: Pitch and harmonics combined. message includes a destination address (INScore by default) that supports a url-like specification, The graphic signal is described in several steps. allowing to target any IP host on any UDP port. First, we build the fundamental frequency graphic The message associated to mouse events may use as above (see section 2.2.3): predefined variables, instantiated at event time with the current mouse position or the current g0=S f0 /Srms0 / kc0 position time.

50 This simple event based mechanism makes easy ���� ����������� to describe for example an intelligent cursor i.e. an arbitrary object that is synchronized to the ������ ������ ������ ����� score and that turns the page to the next or pre- vious one, depending on the time zone it enters. ����� ������� �������� �������� �������� ���������� 3 INScore messages ���� ���� ������� The INScore API is an OSC messages API. The general format is illustrated in figure 8: it con- sists in an address followed by a message string, Figure 9: INScore address space. followed by parameters. setMsg g set type data ✎ ☞ OSCAddress message parameters ✍ ✌ ☞ ✎ Figure 10: The set message. ✍ ✌ Figure 8: General format of a message. object type as argument, followed by type specific parameters. The address may be viewed as an object Messages given in figure 11 are used to set the pointer, the message string as a method name objects graphic properties. and the parameters as the method parameters. For example, the message: x float32 /ITL/scene/score color 255 128 40 150 ✎☞✎ ☞ ☞ ✎ may be viewed as the following method call: ✍y ✌✍float32 ✌ score->color(255 128 40 150) ✎☞✎ ☞ ✍ ✌ ✍z ✌✍float32 ✌ 3.1 Address space ✎☞✎ ☞ ✍ ✌ ✍angle✌✍ float32✌ The OSC address space is strictly organized like ✎ ☞✎ ☞ the internal score representation, which is a tree ✍ ✌ ✍scale ✌✍float32 ✌ with 4 depths levels (figure 9). Messages can be ✎ ☞✎ ☞ ✍ ✌ sent to any level of the hierarchy. ✍ ✌✍ ✌ The first level is the application level: a static Figure 11: Graphic space management. node with a fixed address that is also used to discriminate incoming messages. An application Messages given in figure 12 set the time proper- contains several scenes, which corresponds to dif- ties. Time is encoded as rational values represent- ferent windows and different scores. Each scene ing music time (where 1 is a whole note). Graphic contains components and two static nodes: a sync space and time management messages have rela- node for the components synchronization and a tive positioning forms: the message string pre- signal node, that may be viewed as a special fixed with ’d’. folder containing signals. The name in blue are All the messages that modify the system state user defined, those in black are static reserved have a counterpart get form illustrated by figure names. 13. Synchronization between components is based 3.2 Message strings on a master / slave scheme. It is described by This section gives some of the main messages sup- a message addressed to the static sync node as ported by all the score components. The list is far illustrated by figure 14. syncmode indicates how from being exhaustive, it is intended to show ex- to align the slave component to its master: hori- amples of the system API. zontal and/or vertical stretch, etc. Score components are created and/or modified Interactive features are available by request- using a set message (figure 10) that takes the ing to a component to watch an event using

51 1 date time slave master ✎ ☞ ☞ ✎ ☞ ☞ ✎✎ ✍duration✌ time syncmode ✎ ☞ ✍ ✌ 2 ✍ ✌ ✍ ✌ Figure 12: Time management. ✍ ✌ Figure 14: Synchronization message: the second get form (2) removes the synchronization from the ✎ ☞ ☞ ✎ slave object. ✍ ✌ getParam ✍✎ ☞✌ watch what oscMsg ✎ ☞ ✍ ✌ ✍ ✌ Figure 13: Querying the system state. Figure 15: Watching events. a message like figure 15, where what indicates Similarly, INScore may support the lua lan- the event (mouseUp, mouseDown, mouseEnter, guage in sections delimited by . mouseLeave, timeEnter, timeLeave) and OSCMsg By default, lua support is not embedded in the represents any valid OSC message including the INScore binary distribution. The engine needs address part. to be compiled with the appropriate options to The example in figure 16 creates a simple score, support lua. do some scaling, creates a red cursor and synchro- nizes it to the score with a vertical stretch to the 4 Architecture score height. Output is presented by figure 17. INScore is both a shared library and a standalone score viewer application without user interface. 3.3 INScore scripts Some insight of the system architecture is given Actually, the example given in figure 16) is mak- in this section for a better understanding and an ing use of the file format of a score, which consists optimal use of the system. The general architec- in a list of textual OSC messages, separated by a ture is a Model View Controller [MVC] designed semi-colon. This textual form is particularly suit- to handle OSC message streams. able to be used as a scripting language and addi- 4.1 The MVC Model tional support is provided in the form of variables and javascript and/or lua support. The MVC architecture is illustrated in figure 19. Modification of the model state is achieved by in- 3.3.1 Variables coming OSC messages or by the library C/C++ INScore scripts supports variables declarations API, that is actually also message based. An in the form illustrated by figure 18. Variables may OSC message is packaged into an internal mes- be used in place of any message parameter pre- sages representation and stacked on a lock-free fixed using a $ sign. A variable must be declared fifo stack. These operations are synchronous to before being used. the incoming OSC stream and to the library API call. 3.3.2 Scripting support On a second step, messages are popped from INScore scripts may include javascript sections, the stack by a Controler that takes in charge delimited by The javascript code is evaluated at parse time. It is expected /ITL/scene/score set ’gmn’ ’[g e f d]’; to produce valid INScore messages as output. /ITL/scene/score scale 5.0; These messages are then expanded in place of the /ITL/scene/cursor set ’rect’ 0.01 0.1; javascript code. /ITL/scene/cursor color 255 0 0 150; Variables declared before the javascript section /ITL/scene/sync ’cursor’ ’score’ ’v’; are exported to the javascript environment, which make these variables usable in both contexts. Figure 16: INScore sample script.

52 ���������� ������ ��������������

��� �������� ��� ������� ����������� ��������� Figure 17: Sample script output.

������� ident = int32 ; �������� ����� ������� ����� ✎ ☞✎☞ ✎ ☞ ✎☞ ��� ☞ ✎ ✍ ✌✍✌ ✍float32✌ ✍✌ ✎ ☞ ���� ✍ ✌ ✍string ✌ ✎ ☞ ��������� ������������ ✍ ✌ ✍ ✌ Figure 18: Variable declaration. ����� ����� ����� ���������� ���������

address decoding and message passing to the cor- ���� responding object of the model. The final opera- tion concerns the view update. An Updater is in ������� charge of producing a view of the model, which is currently based on the Qt Framework. This second step of the model modification ���� �� ���� scheme is asynchronous: it is processed on a reg- ular time base. Figure 19: An overview of the MVC architecture 4.2 Dependencies The INScore project depends on external open source libraries: The GuidoEngine, the GuidoQt and oscpack li- braries are required to compile INScore. the Qt framework3 providing the graphic Binary packages are distributed for Ubuntu • layer and platform independence. Linux, Mac OS and Windows. Detailed instruc- the GuidoEngine4 for music score layout tions are included in the project repository for • compiling. the GuidoQt static library, actually part of • the Guido library project 5 Future work the oscpack library5 for OSC support • and optionally: Interactive features described in section 2.3 result • from an experimental approach. The formaliza- – the MusicXML library6 to support the tion and extension of these features are planned. MusicXML format. Time synchronization features are based on a – the v8 javascript engine7 to support continuous music time. Supporting other time javascript in INScore script files. representations like the Allen relations [Allen, – the lua engine8 to support lua in 1983] is part of the future work. INScore script files. Due to its dynamic nature, INScore is particu- larly suitable to interactive music, i.e. where parts 3http://qt.nokia.com/ 4 of the music score could be computed in real-time http://guidolib.sourceforge.net by an interaction system. It could be particularly 5http://www.rossbencina.com/code/oscpack 6http://libmusicxml.sourceforge.net interesting to provide a musically meaningful vi- 7http://code.google.com/p/v8/ sualization of the interaction system state; exten- 8http://www.lua.org/ sions in this direction are also planned.

53 6 Acknowledgements INScore was initiated in the Interlude project funded by the French National Research Agency [ANR- 08-CORD-010]. References James F. Allen. 1983. Maintaining knowl- edge about temporal intervals. Commun. ACM, 26:832–843, November. R. B. Dannenberg. 1993. Music representation issues, techniques and systems. Computer Mu- sic Journal, 17(3):20–30. D. Fober, S. Letz, and Y. Orlarey. 2007. Ve- mus - feedback and groupware technologies for music instrument learning. In Proceedings of the 4th Sound and Music Computing Confer- ence SMC’07 - Lefkada, Greece, pages 117–123. D. Fober, C. Daudin, S. Letz, and Y. Orlarey. 2010a. Time synchronization in graphic domain - a new paradigm for augmented music scores. In ICMA, editor, Proceedings of the Interna- tional Computer Music Conference, pages 458– 461. D. Fober, C. Daudin, Y. Orlarey, and S. Letz. 2010b. Interlude - a framework for augmented music scores. In Proceedings of the Sound and Music Computing conference - SMC’10, pages 233–240. Yann Geslin and Adrien Lefevre. 2004. Sound and musical representation: the acousmo- graphe software. In ICMC’04: Proceedings of the International Computer Music Conference, pages 285–289. ICMA. M. Good. 2001. MusicXML for Notation and Analysis. In W. B. Hewlett and E. Selfridge- Field, editors, The Virtual Score, pages 113– 124. MIT Press. Walter B. Hewlett and Eleanor Selfridge-Field, editors. 2001. The Virtual Score; representa- tion, retrieval and restoration. Computing in Musicology. MIT Press. H. Hoos, K. Hamel, K. Renz, and J. Kilian. 1998. The GUIDO Music Notation Format - a Novel Approach for Adequately Represent- ing Score-level Music. In Proceedings of the In- ternational Computer Music Conference, pages 451–454. ICMA.

54 A Behind-the-Scenes Peek at World's First Linux-Based Laptop Orchestra – The Design of L2Ork Infrastructure and Lessons Learned

Ivica Ico Bukvic Virginia Tech, DISIS, L2Ork, ICAT Department of Music - 0240 Blacksburg, VA, 24061 [email protected]

Abstract computer music genre has primarily relied on either acousmatic spatialization or a more traditional In recent years we've experienced a proliferation of stereoscopic means of sound reproduction. Both of laptop orchestras and ensembles. Linux Laptop the said approaches pose a challenge in that they can Orchestra or L2Ork founded in the spring 2009 introduces exclusive reliance on open-source disassociate the music-making process with its software to this novel genre. Its hardware design also source. In such a setting typically one or more provides minimal cost overhead. Most notably, computer musicians located on stage making (or L2Ork provides a homogenous software and mixing) sound are being heard across a vast array of hardware environment with focus on usability and speakers, none of which correspond with their transparency. In the following paper we present an physical location. This is where hemispherical overview of L2Ork's infrastructure and lessons speakers make a considerable difference in bridging learned through its design and implementation. the said cognitive gap (Smallwood, Cook, & Trueman, 2009) and reintroducing the critical Keywords of co-located sound inherent to acoustic music instruments. Linux Laptop Orchestra, L2Ork, Infrastructure, Lessons Learned 2 Introducing L2Ork 1 Introduction Linux Laptop Orchestra or L2Ork (I. Bukvic et al., 2010) was founded in Spring 2009 as part of a Laptop orchestras arguably need no introduction. research initiative with focus on providing a Their number is now growing at an exponential rate discipline-agnostic point of convergence for arts, with new ensembles being introduced at Universities technology, and engineering. The initiative has and other independent institutions across the world gained considerable traction, attracting twenty on- (“International Association of Laptop Orchestras,” campus stakeholders and four corporate sponsors. In n.d.). With the ensuing growth, the laptop ensemble part due to its open design, and in part due to its repertoire is increasingly hampered by the lack of ambitious goal of complementing K-12 (elementary software and hardware standardization. The school to high school) education by introducing this mounting cost of obtaining and/or building unusual integration of arts and sciences, L2Ork has supporting infrastructure has resulted in also received an unprecedented amount of media compromises, further exacerbating the said attention, including the front cover feature in the fragmentation (I. Bukvic, Martin, Standley, & Linux Journal (Phillips, 2010). By partnering with Matthews, 2010; Kapur et al., 2010; Trueman, Cook, the regional Boys & Girls Club (an after-school Smallwood, & Wang, 2006; Wang, Bryan, Oh, & program for inner city children) and through the help Hamilton, 2009). One notable aspect of ensembles of a series of grants the project has produced a based on the PLOrk model (Trueman et al., 2006), is smaller complementing satellite laptop orchestra for the use of hemispherical speakers whose purpose is the purpose of training 4th and 5th graders in the to provide co-located sound. Prior to the newfound art of making laptop ensemble music introduction of the hemispherical speakers, the

55 (Ivica Bukvic, Martin, & Matthews, 2011). Since its frustrating experience. There is clearly a need for a inception, the ensemble has had four major tours, turnkey solution, a system that simply does what it is including a twenty-one-day tour of Europe with meant to do. The author argues that chances of stops at STEIM and IRCAM. Last February, a work achieving such an environment, particularly within titled Half-Life won the first place on the first the Linux ecosystem, is to ensure that the OS and international laptop orchestra composition supporting hardware are truly homogenous. L2Ork competition (Elliott, 2011). Apart from the fact that achieves this by essentially loaning out fully the ensemble is a result of academic research and a preinstalled and preconfigured systems together with part of academic curriculum, one of its core goals is supporting hardware to students/users who are also to establish L2Ork as a professional and active expected to treat the ensuing amalgamation of ensemble outside the academic circles. hardware and software as an integrated instrument. By its very design, the ensemble attracts students This is where the affordable design plays a critical from various backgrounds, amounts of musical role. With each seat relying on an MSI Wind Intel training, and/or computer literacy. For this reason, a Atom-based notebook, the total cost per station at part of the project's focus is on the development of $750 is typically less than most mainstream mobile optimal interfaces for the delivery of critical musical devices and their requisite audio hardware. Given cues with the assumption that users have no prior the convenient name, each station was assigned a musical experience. This challenge is further letter of the alphabet, a corresponding wind name amplified by ensemble's focus on physical presence (e.g. Austru, Briza, Cyclone, etc.) and a static IP and choreography, most recently with the inclusion address. IP addresses are distributed evenly starting of Taiji (Tai Chi) mind-body practice and system's with 192.168.2.10 for Austru, *.12 for Briza, *.14 ability to “harvest” ensuing motion data and for Cyclone, etc. with the remaining odd IP translate it into meaningful musical cues audience addresses reserved for projected future growth. can clearly associate with. Another advantage of such low-power setup is in Four unique, yet mutually dependent components the way it encourages creation of new content for the that in many ways define the ensemble's approach to ensemble. The severe computational limits of the laptop orchestra genre are: individual computers in the ensemble literally force artists to approach new compositions by thinking 1. interest in developing physical presence and about each station as a small piece of a much larger choreography; ecosystem, rather than composing essentially entire 2. exclusive reliance on open-source; work on a single laptop and then “exploding” parts 3. affordable design, and to individual machines. Based on the experience 4. focus on homogenous environment and optimal accrued through L2Ork, it is author's belief that the usability. two approaches yield distinctly different results for performers and audience alike. Other advantages In the following section we will visit some of the related to the choice of the aforesaid netbooks is most notable challenges and solutions to problems reliable open-source driver support. As a result, related to both the genre and the Linux platform. L2Ork's setup supports all functions provided by the hardware, including standby, hibernate, and other 3 Homogenous Environment advanced functionality that may be of use during One of the challenges in maintaining a laptop pre-concert tech setup. For a more detailed orchestra is the uncertainty caused by the use of breakdown of L2Ork's hardware infrastructure, individualized software platforms. Even for please consult the NIME 2011 publication (I. Bukvic ensembles relying on fairly homogenous et al., 2010). environments, such as Windows and OSX, hurdles While initially based on modified Ubuntu 9.04 remain in terms of installed software, some of which and 9.10 releases, L2Ork currently relies on the can at times collide with each other and cause mainstream Ubuntu 10.04 LTS distribution. In the compatibility problems. Such challenges can easily near term the ensemble is looking to migrate to a waste a majority of time allocated for a rehearsal on newer hardware with the next target OS choice seemingly trivial things, often resulting in a likely being Ubuntu 12.04. With every Ubuntu

56 upgrade, the author has identified fewer system example in composition Rain one performer's action modifications that were necessary to produce audibly percolates through other systems. optimal configuration. The most notable alteration to Consequently, increased performer activity can the current iteration is the real-time kernel that quickly result in a relatively high amount of traffic enables even low-power Atom-based notebooks to to the point where the ensuing aural soundscape is deliver reliable low-latency performance. To further built from stochastic percolations of short attack- streamline system use in both rehearsal and based sounds propagated through the network. To performance scenarios the desktop has been achieve the said goal in this and other similar enhanced with a set of operational and usability scenarios we rely primarily upon control data. Given tweaks and customizations. Operational the ensemble relies on wired, rather than wireless enhancements deal with use-specific needs that network, it is also possible to exchange high- minimize potentially undesirable degradation in bandwidth audio data among different members of system's performance. These include real-time the ensemble. In such, potentially high-network- priorities for audio-related tasks, appropriate HD traffic environment, wireless communication has sleep and swappiness settings that abate potential proved surprisingly unreliable. Even when xruns caused by HD usage, CPU throttling policy experimenting with industrial grade Cisco wireless that automatically detects JACK server running and router in conjunction with TCP packets over an transitions in the “performance” mode, disabling isolated local network, we still encountered one or screen savers and screen sleep timers that may more machines receiving time-critical cues up to a interfere with the display during rehearsals and second later. Therefore wireless communication has performances, associating static local IP address proven inadequate for any kind of synchronized through the wired ethernet interface, and automatic music production. reloading of bluetooth driver upon resuming the The ensuing local area network is shielded from notebook to minimize potential problems with the external network traffic and thus limits potential Wiimote pairing. In addition, usability network packet collisions. For this reason, all enhancements include a series of shell script notebooks broadcast control data using UDP wrappers that provide turnkey initialization of packets, enabling the individual stations to simply various pieces stored in a form of application filter streams they wish to monitor and/or interact shortcuts (we will discuss these further below), as with. This has allowed the system to have a high well as use of compiz (“Compiz Home,” n.d.) for the flexibility requiring minimal configuration (Fig.1). purpose of providing a more responsive and user- friendly desktop environment. A limited number of desktop shortcuts provides access to the most common functionality, like Web browsing, as well as a shortcut for force-quitting potential runaway processes. For the purpose of minimizing initial setup overhead, every system by default has auto- login enabled and the desktop's appearance is streamlined to promote seamless performer migration among different stations, should such a need arise. 3.1 Synchronization L2Ork's infrastructure supports up to sixteen performers and despite its homogeneous design, several challenges arose from the ongoing attempts to synchronize stations and provide a new way to affect performance structure by cross-pollinating Figure 1. System synchronization. control data. A number of works written specifically Another challenge associated with for the ensemble rely on this component—for synchronization is related to a seemingly trivial

57 matter of connecting concurrently up to 16 co- As a result, at the beginning of every rehearsal, located Nintendo Wiimotes. Discovering devices in systems are synced from the instructor/developer's such an environment simply isn't an option as there master machine in a matter of seconds. Additional is no guarantee that the Wiimote will pair with the supporting tools were provided using ssh and sftp right station. For this reason, the system uses protocols where instructor is capable of uploading ethernet connectivity with predetermined local IP new versions of system software that needs to be address as the foundation for all synchronization applied locally using administrator permissions. with external devices through the use of a series of Such files are accompanied by an installer that shell scripts. All shell scripts are run as part of administers all changes with minimal interaction initialization sequence for each of the pieces. from users. All these changes have vastly minimized whatismyip script is in charge of detecting local the amount of time required to administer the machine's IP address that may be used for matching ensemble, as well as provided for a more enjoyable score part with the station/performer. experience among its users, many of whom have whatismywiimote uses IP address to pair it with an limited computer literacy and/or knowledge of appropriate Wiimote MAC address. The reason we music. did not rely on computer's MAC address as a means of identifying individual stations is because, unlike the hard-wired MAC address, the IP configuration can be easily altered. Therefore, through a simple alteration of the IP address, any station can be re- purposed to take over a role of a different computer in the ensemble. This has proved critical in situations where a potential hardware failure has rendered a particular station inoperable and yet where a work required first n contiguous machines, as is the case with the aforesaid Rain composition that echoes individual aural events through the ensemble. Finally, as the ensemble grew, there was a growing overhead associated with patching and synchronization of each system, including latest L2Ork-related software updates, as well as revisions to compositions and the supporting software. This has proven a problem of exponential nature often Figure 2. System synchronization. requiring hours to administer on all 16 machines. It has also proven prone to human error due to its 3.2 Pd-L2Ork repetitive nature. Therefore, the ensemble developed a series of shell scripts (l2ork-send and l2ork-do) For its audio-related digital signal processing and that akin to GRENDL technology (Beck, Jha, interfacing with external devices, L2Ork relies Ullmer, Branton, & Maddineni, 2010) provides near exclusively on the combination of JACK server real-time synchronization with two notable (Davis & others, n.d.) and in-house version of Pure- differences: Data (PD) (Puckette, 1996) titled Pd-L2Ork (Ivica Bukvic, n.d.). Former is integrated through QjackCtl 1. The L2Ork synchronization system relies on the JACK front-end that is also responsible for running git framework, and a series of start-up scripts to ensure that the OS is 2. L2Ork is a homogeneous environment, which running optimally (e.g. disabling wireless network to has made synchronization considerably easier minimize potential confusion from having two active (Fig.2). network connections, and setting CPU into performance mode). Pd-L2Ork is an ambitious project in and of itself. Initially, the project's output was limited to upstream

58 patches. In the fall of 2010, however, mainly due to Pd-L2Ork project is certainly interested in seeing its increasingly divergent interests and needs between contributions adopted upstream. However, given the L2Ork and the upstream PD distribution, the author lack of control over this process or its pace, it is decided to create a fork. Since, the project has author's belief that maintaining a separate version significantly altered the PD's core functionality with has allowed the ensemble to dramatically increase more than two hundred bug-fixes and new features. the pace of software's development and as such Based on the 0.42.6 branch of Pd-Extended (“Pd- remains the preferred choice. extended — PD Community Site,” n.d.), Pd-L2Ork focuses on two key areas: robust and stable 4 Lessons Learned operation of both the audio engine and GUI, and In the world of computing there is a saying that usability improvements to the runtime environment the last 20% towards fine-tuning a turnkey system and editor. Unlike early iterations, more recent take 80% of the overall project time. Undoubtedly, versions of Pd-L2Ork have had a crash-free streak the same holds true for L2Ork as well. Many of the for over a year, including over twenty evening-long seemingly trivial bug-fixes to the core Linux system shows and performances. The system's GUI is more and more notably pd-l2ork code base have taken resilient in high-bandwidth scenarios, while a series weeks if not months to identify. Now, we are finally of new objects, including a threaded implementation in a position where our integrated system is of Linux-specific Nintendo Wiimote external that observed by a newcomer as something that just allows bi-directional communication without xruns works. L2Ork infrastructure is therefore regarded as have further enhanced its usability. The one fully integrated whole and the fact that it relies disis_wiimote external is of particular importance as on Linux, beyond the obvious potential benefits of L2Ork makes extensive use of haptic feedback to such an approach, are becoming entirely transparent allow performers to monitor critical operation, even to the user. It is author's conviction that this is where sense tempo and beat, without having to look at the Linux's future lies. It is not the advocacy, or that screen, something that has proven particularly useful curious yet often incomplete feature. It is the in practicing Taiji choreography. transparency coupled with unprecedented flexibility Another notable area of Pd-L2Ork are extensive and power of such a system that make it a truly improvements to the editor. Some of them include powerful platform. It is truly those last 20% that improved scrolling algorithm, fixing all known GOP have consumed countless hours and yet rewarded in redrawing bugs and instabilities, addressing all unforeseen ways. known major bugs, adding infinite undo, ability for Ironically, as we look forward to L2Ork's next iemgui objects' and canvas' properties to be altered rehearsal and a tour, the excitement behind the fact via GUI handles rather than through a separate that the infrastructure is robust and stable has all but properties window, and the re-skinning the GUI to worn off. And even though some of us still give the application a more contemporary feel. Most remember the countless hours spent on many of pd- of the core editing actions now rely on Tcl toolkit's l2ork's novel features, we've finally arrived at the “tags,” e.g. displacing a large number of objects. point where one can simply expect nothing less than This change has resulted in a significantly lower stability, thinking this is simply the way it's CPU footprint. Based on preliminary tests, moving a supposed to be. modest graph-on-parent-enabled abstraction now For additional information on the project, pd-l2ork yields no noticeable CPU overhead on a low-power and other supporting software, and video Intel i3 processor, while the same action using instructables, visit L2Ork at continuous redraws (legacy approach) encumbers up http://l2ork.music.vt.edu. to 15% of the CPU on the same hardware. Objects can be also easily layered on top of each other and 5 Acknowledgments the magicglass signal monitoring tool has offered a more efficient approach to debugging. All of these The author would like to hereby thank all L2Ork's features have helped streamline the production of Stakeholders and Sponsors without whose vision performance interfaces (e.g. input/output monitoring and support this project would've never seen the and visual score engine) within PD's environment. light of the day. Stakeholders are (listed in alphabetic order): Virginia Tech Alumni Relations,

59 the Arts Initiative, College of Architecture and Virginia Tech. Retrieved February 28, 2012, Urban Studies, Collaborative for Creative from Technologies in the Arts and Design, Center for http://www.vtnews.vt.edu/articles/2011/09/0 Human-Computer Interaction, Center for 92911-clahs-l2orkstatefair.html Instructional Design and Educational Research, College of Liberal Arts & Human Sciences, Creative International Association of Laptop Orchestras. Technologies Experiential Galleria, Digital (n.d.). Retrieved February 28, 2012, from Interactive Sound & Intermedia Studio, Institute for http://ialo.org/doku.php/laptop_orchestras/or Creativity, Arts & Technology, Institute for Critical chestras Technology and Applied Science, Institute for Distance and Distributed Learning, Institute for Kapur, A., Darling, M., Wiley, M., Vallis, O., Society, Culture & Environment, Information Hochenbaum, J., Murphy, J., Diakopolus, Technology, Learning Technologies, Department of D., et al. (2010). The Machine Orchestra. Music, Office of Academic Assessment, Outreach & Proceedings of the International Computer International Affairs, School of Education, and the Music Conference (pp. 554–558). School of Performing Arts and Cinema. Sponsors are MSI Computer Inc., Inc., Roland Pd-extended — PD Community Site. (n.d.). Corporation, and Sweetwater Inc. Retrieved February 8, 2012, from http://puredata.info/community/projects/soft References ware/pd-extended/?searchterm=pd-extended Beck, S. D., Jha, S., Ullmer, B., Branton, C., & Maddineni, S. (2010). GRENDL: grid Phillips, D. (2010, May). Introducing L20rk: the enabled distribution and control for Laptop Linux Laptop Orchestra | Linux Journal. Orchestras. ACM SIGGRAPH 2010 Posters Retrieved February 28, 2012, from (p. 84). http://www.linuxjournal.com/article/10694

Bukvic, I., Martin, T., Standley, E., & Matthews, M. Puckette, M. (1996). Pure Data: another integrated (2010). Introducing L2Ork: Linux Laptop computer music environment. IN Orchestra. Interfaces, (Nime), 170–173. PROCEEDINGS, INTERNATIONAL COMPUTER MUSIC CONFERENCE, 37— Bukvic, Ivica. (n.d.). Virginia Tech Department of 41. Music L2Ork - Software - Linux Laptop Orchestra. Retrieved February 8, 2012, from Smallwood, S., Cook, P., & Trueman, D. (2009). http://l2ork.music.vt.edu/main/?page_id=56 Don’t Forget the Laptop: A History of Hemispherical Speakers at 10. M. Kuo, D. Bukvic, Ivica, Martin, T., & Matthews, M. (2011). Poxson, Y. Kim, F. Mont, J. Kim, E. Schubert, and S. Lin. Proceedings of the Moving Beyond Academia Through Open Source. Solutions–Introducing L2Ork, New Instruments for Musical Expression, Virginia Tech’s Linux Laptop Orchestra. Pittsburgh, PA, USA. Journal SEAMUS. Trueman, D., Cook, P., Smallwood, S., & Wang, G. Compiz Home. (n.d.). Retrieved February 28, 2012, (2006). PLOrk: Princeton laptop orchestra, from http://www.compiz.org/ year 1. Proceedings of the International Computer Music Conference (pp. 443–450). Davis, P., & others. (n.d.). JACK Audio Connection Kit (software). Wang, G., Bryan, N., Oh, J., & Hamilton, R. (2009). Stanford laptop orchestra (slork). Elliott, J. (2011, September 29). Virginia Tech’s Proceedings of the International Computer Linux Laptop Orchestra to perform at Music Conference (Vol. 505, p. 508). Virginia State Fair | Virginia Tech News |

60 Studio report: Linux audio for multi-speaker natural speech technology

Charles FOX, Heidi CHRISTENSEN and Thomas HAIN Speech and Hearing Department of Computer Science University of Sheffield , UK

charles.fox@sheffield.ac.uk

Abstract our experiences in setting up a new Linux-based The Natural Speech Technology (NST) project is the studio for dedicated natural speech research. UK’s flagship research programme for speech recogni- In contrast to assumptions made by current tion research in natural environments. NST is a collab- commercial speech recognisers such as Dragon oration between Edinburgh, Cambridge and Sheffield Dictate, natural environments include situations Universities; public sector institutions the BBC, NHS such as multi-participant meetings [Hain et al., and GCHQ; and companies including Nuance, EADS, Cisco and Toshiba. In contrast to assumptions made 2009], where participants may talk over one an- by most current commercial speech recognisers, nat- other, move around the meeting room, make non- ural environments include situations such as multi- sentence utterances, and all in the presence of participant meetings, where participants may talk over noises from office equipment and external sources one another, move around the meeting room, make such as traffic and people outside the the room. non-speech vocalisations, and all in the presence of The UK Natural Speech Technology project aims noises from office equipment and external sources such to explore these issues, and their applications to as traffic and people outside the room. To generate scenarios as diverse as automated TV programme data for such cases, we have set up a meeting room / recording studio equipped to record 16 channels of au- subtitling; assistive technology for disabled and dio from real-life meetings, as well as a large computing elderly health service users; automated business cluster for audio analysis. These systems run on free, meeting transcription and retrieval, and home- Linux-based software and this paper gives details of land security. their implementation as a case study for other users The use of open source software is practically a considering Linux audio for similar large projects. prerequisite for exploratory research of this kind, Keywords as it is never known in advance which parts of existing systems will need to be opened up and Studio report, case study, speech recognition, diarisa- edited in the course of research. The speech tion, multichannel community generally works on offline statistical, 1 Introduction large data-set based research. For example cor- pora of 1000 hours of audio are not uncommon The speech recognition community has evolved and require the use of large compute clusters to into a niche distinct from general computer au- process them. These clusters already run Linux dio and Linux audio in particular. It has its own and HTK, so it is natural to extend the use of large collection of tools, some of which have been Linux into the audio capture phase of research. developed continually for over 20 years such as As speech research progresses from clean to nat- the HTK Hidden Markov Model ToolKit [Young ural speech, and from offline to real-time process- 1 et al., 2006] . We believe there could be more ing, it is becoming more integrated with general crosstalk between the speech and Linux audio sound processing [Wolfel and McDonough, 2009], worlds, and to this end we present a report of for example developing tools to detect and clas- 1Currently owned by Microsoft, source available gratis sify sounds as precursors to recognition. The use but not libre. Kaldi is a libre alternative currently under of Bayesian techniques in particular emphasises development, (kaldi.sourceforge.net). the advantages of considering the sound process-

61 ing and recognition as tightly coupled problems, synchronous accuracy between recorded channels, and using tightly integrated computer systems. and often using up to 64 channels of simultaneous For example, it may be useful for Linux cluster audio in microphone arrays (see for example the machines running HTK in real-time to use high NIST Mark-III arrays [Brayda et al., 2005]). level language models to generate Bayesian prior Reverberation removal has been performed in beliefs for low-level sound processing occurring in various ways, using single and multi-channel data. Linux audio. In multi-channel settings, sample-synchronous This paper provides a studio report of our ini- audio is again used to find temporal correlations tial experiences setting up a Linux based studio which can be used to separate the original sound for NST research. Our studio is based on a typical from the echos. In the iPhone4 this is performed meeting room, where participants give presenta- with two microphones but performance may in- tions and hold discussions. We hope that it will crease with larger arrays [Watts, 2009]. serve as a self-contained tutorial recipe for other Speaker tracking may use SLAM techniques speech researchers who are new to the Linux au- from robotics, coupled with acoustic observation dio community (and have thus included detailed models, to infer positions of moving speakers in explanations of relatively simple Linux audio con- a room (eg. [Fox et al., 2012], [Christensen and cepts). It also serves as an example of the audio Barker, 2010]). This can be used in conjunction requirements of the natural speech research com- with beamforming to attempt retrieval of individ- munity; and as a case study of a successful Linux ual speaker channels from natural meeting envi- audio deployment. ronments, and again relies on large microphone arrays and sample-accurate recording. 2 Research applications Part of the NST project called ‘homeService’ The NST project aims to use a meeting room stu- aims to provide a natural language interface to dio, networked home installations, and our anal- electrical and electronic devices, and digital ser- ysis cluster to improve recognition rates in natu- vices in people’s homes. Users will mainly be dis- ral environments, with multiple, mobile speakers abled people with conditions affecting their ability and noise sources. We give here some examples to use more conventional means of access such as of algorithms relevant to natural speech, and their keyboard, computer mouse, remote control and requirements for Linux audio. power switches. The assistive technology (AT) Beamforming and ICA are microphone-array domain presents many challenges; of particular based techniques for separating sources of au- consequence for NST research is the fact that dio signals, such as extracting individual speak- users in need of AT typically have physical dis- ers from mixtures of multiple speakers and noise abilities associated with motor control and such sources. ICA [Roberts and Everson, 2001] typ- conditions (e.g. cerebral palsy) will also often af- ically makes weak assumptions about the data, fect the musculature surrounding the articulatory such as assuming that the sources are non Gaus- system resulting in slurred and less clear speech; sian in order to find a mixing matrix M which known as dysarthric speech. minimises the Gaussian-ness over time t of the la- 3 Meeting room studio setup tent sources vector xt, from the microphone array time series vectors yt, in yt = Mxt. Our meeting room studio, shown in fig. 1, is ICA can be performed with as few microphones used to collect natural speech and training data as there are sound sources, but gives improved from real meetings. Is centred on a six-person results as the number of microphones increases. table, with additional chairs around the walls for Beamforming [Trees, 2002] seeks a similar out- around a further 10 people. It has a whiteboard at put, but can include stronger physical assump- the head of the table, and a presentation projec- tions - for example known microphone and source tor. Typical meetings involve participants speak- locations. It then uses expected sound wave ing from their chairs but also getting up and walk- propagation and interference patterns to infer the ing around to present or to use the whiteboard. A source waves from the array data. Beamform- 2 2.5m aluminium frame is suspended from the ing is a high-precision activity, requiring sample- ceiling× above the table and used for mounting au-

62 two 1.5m HD presentation screens. In the four upper corners of the meeting room are mounted Ubisense infrared active badge receivers, which may be used to track the 3D locations of 15 mo- bile badges worn by meeting participants. (The university also has a 24-channel surround sound diffusion system used in an MA Electroacoustic music course [Mooney, 2005], which may be use- ful for generating spatial audio test sets.) Sixteen of the mics are currently routed through two MOTU 8Pre interfaces, which take eight XLR or line inputs each. Both currently run at 48kHz but can operate up to 96kHz. The first of these runs in A/D conversion mode and sends all 8 digitised channels via a single ADAT Lightpipe fibre optic cable to the second 8Pre. The second 8Pre receives this input, along with its own eight audio channels and outputs all 16 chan- Figure 1: Meeting room recording setup. The nels to the DAW by Firewire 400 (IEEE 1394, 400 boxes on the far wall are active-badge trackers. bits/sec). (The two boxes must be configured to The frame on the ceiling and the black cylinder have (a) the same speed, (b) 1x ADAT protocol on the table each contain eight condenser micro- and (c) be in be in converter/interface mode re- phones. Participants wear headsets and active spectively.) Further firewire devices will be added badges. to the bus later to accommodate the rest of the microphones in the room. dio equipment. Currently this consists of eight 4 Linux audio system review AKG C417/III vocal condenser microphones, ar- Fig. 2 outlines the Linux audio system and high- ranged in an ellipse around the table perime- lights the parts of the many possible Linux audio ter. A further eight AKG C417/IIIs are embed- stacks that are used for recording in the meet- ded in a 100mm radius cylinder placed in the ta- ing room studio. The Linux audio architecture ble centre to act similarly to eight-channel multi- has grown quite complex in recent years, so is re- directional tele-conferencing recorder. The table viewed here in detail. can also include three 7-channel DevAudio Micro- OSS (Open Sound System) was developed in cones (www.dev-audio.com), which are commer- the early 1990s, focused initially on Creative cial products performing a similar function. The SoundBlaster cards then extending to others. It Microcone is a 6-channel microphone array which was a locking system allowing only one program comes with propriety drivers and an API. A fur- at a time to access the sound card, and lacked ther 7th audio channel contains a mix of the other support for features such as surround sound. It 6 channels as well as voice activity detection and allowed low level access to the card, for exam- sound source localisation information annotation. ple by cataudio.wav > /dev/dsp0. ALSA (Ad- Some noise reduction and speech enhancement ca- vanced Linux Sound Architecture) was designed pabilities are provided, although details of the ex- to replace OSS, and is used on most current distri- act processing are not made public. butions including our 11.10. Por- There are four Sennheiser ew100 wireless head- tAudio is an API with backends that abstract sets which may be mounted on selected partici- both OSS and ALSA, as well as sound systems of pants to record their speech directly. The stu- non-free platforms such as Win32 sound and Mac dio will soon also include a Radvision Scopia CoreAudio, created to allow portable audio pro- XT1000 videoconferencing system, comprised of grams to be written. Several software mixer sys- a further source-tracking steered microphone and tems were built to resolve the locking problem for

63 forms – usually ALSA on modern Linux machines. Pro apps Consumer apps The bulk of pro-audio applications such as Ar- eg. HTK tools eg. Ardour VLC, , Firefox dour, zynAddSubFx and qSynth run on JACK. JACK also provides network streaming, and em- ulations/interfaces for other audio APIs including

OSS(padsp) ALSA canberra GStreamer ALSA, OSS and PulseAudio. (Pulse-on-JACK is useful when using pro and consumer applica- tions at the same time, such as when watching a

streaming Pulse ESD aRts YouTube tutorial about how to use a pro appli- cation. This re-configuration happens automati- cally when JACK is launched on a modern Pulse JACK machine such as Ubuntu 11.10.) PortAudio 5 Software setup

ffado ALSA OSS Win32 CoreAudio Our DAW is a relatively low-power Intel E8400 (Wolfdale) duo-core, 3GHz, 4Gb Ubuntu Stu- dio 11.10-64-bit machine. Ubuntu studio was in- stalled directly from CD – not added as pack- firewire USB PCI ages to an existing Ubuntu installation – this gives a more minimalist installation than the lat- ter approach. In particular the window manager Figure 2: Audio system for recording. defaults to the low-power , and resource- intensive programs such as Gnome-Network- Monitor (which periodically searches for new wifi consumer-audio applications, including PulseAu- networks in the background) are not installed. dio, ESD and aRts. Some of these mixers grew to Ubuntu was chosen for compatibility and famil- take advantage of and to control hardware mix- iarity with our other desktop machines. (Sev- ing provided by sounds cards, and provided addi- eral other audio distributions are available includ- tional features such as network streaming. They ing PlanetCCRMA(Fedora), 64Studio(Debian), provided their own APIs as well as emulation lay- ArchProAudio(Arch)). ers for older (or mixer-agnostic) OSS and ALSA The standard ALSA and OSS provide inter- applications. (To complicate matters further, re- faces to USB and PCI devices below, and to cent versions of OSS4 and ALSA have now be- JACK above. However for firewire devices such gun to provide their own software mixers, as well as our Pre8, the ffado driver provides a direct as emulation layers for each other.) Many cur- interface to JACK from the hardware, bypassing rent Linux distributions including Ubuntu 11.10 ALSA or OSS. (Though the latest/development deploy PulseAudio running on ALSA, and also in- version provides an ALSA output layer as well.) clude an ALSA emulation layer on Pulse to allow Our DAW uses ffado with JACK2 (Ubuntu pack- multiple ALSA and Pulse applications to run to- ages: jack2d, jack2d-firewire, libffado, gether through the mixer. Media libraries such jackd, laditools. JACK1 is the older but per- as GStreamer (which powers consumer-audio ap- haps more stable single-processor implementation plications such as VLC, Skype and Flash) and of the JACK API) and fig. 3 shows our JACK set- libcanberra (the GNOME desktop sound system) tings, in the qjackctl tool. The firewire backend have been developed closely with PulseAudio, in- driver (ffado) is selected rather than ALSA. creasing its popularity. However, Pulse is not de- We found it useful to unlock memory for good signed for pro-audio work as such work requires JACK performance.2 As well as ticking the un- very low latencies and minimal drop-outs. 2By default, JACK locks all memory of its clients into The JACK system is an alternative software RAM, ie. tells the kernel not to swap their pages to virtual mixer for pro-audio work. Like the other soft memory on disc, see mlock(2). Unlock memory relaxes this mixers, JACK runs on many lower level plat- slightly, allowing just the large GUI components of clients

64 lock memory option, the user must also be allowed to use it, eg. adduser charles audio. Also the file /etc/security/limits.d/audio.conf was edited (followed by a reboot) to include @audio - rtprio 95 @audio - memlock unlimited These settings can be checked by ulimit -r -l. The JACK sample rate was set to 48kHz, matching the Pre8s. (This is a good sample rate for speech research work as it is similar to CD quality but allows simple sub-sampling to power- of-two frequencies used in analysis.) Fig. 4 shows the JACK connections (again in qjackctl) for our meeting room studio setup. The eight channels from the converter-mode Pre8 appear as ADAT optical inputs, and the eight Figure 3: 16 channel recording JACK settings. channels from the interface-mode Pre8 appear as ‘Analog’ inputs, all within the firewire device. Ar- dour was used with two tracks of eight channel audio to record as shown in fig. 5. 5.1 Results Using this setup we were able to record simul- taneously from six overhead microphones, eight table-centre microphones, and two wireless head- sets, as illustrated in fig. 5. We experienced no JACK xruns in a ten minute, 48kHz, 32-bit, 16-channel recording, and the reported JACK la- tency was 8ms. A one hour meeting recording with the same settings experienced only 11 xruns. Total CPU usage was below 25% at all times, with top listing the following typical total process CPU usages: jack 11%, ardour 8%, jack.real 3%, 3%. However, we were unable to play audio back through the Pre8s, hearing distorted versions of the recording. For our speech recognition this is relatively unimportant, and can be worked around by streaming the output over the network Figure 4: 16 channel recording JACK connec- with JACK and playing back on a second ma- tions. chine. The ffado driver’s support for the Pre8 hardware is currently listed as ‘experimental’ so conferencing channels in future. We have not work is needed here to fix this problem. yet needed to make further speed optimisations, The present two Pre8 system is limited to 16 but we note that for future, more-channel sys- audio channels, we plan to extend it with fur- tems, two speedups include disabling PulseAu- ther firewire devices to record from more audio dio (adding pulseaudio -kill and pulseaudio sources around the meeting room and from tele- -start to qjackctl’s startup and shutdown option to perform swapping, leaving more RAM free for the audio is a simple way to do this); and installing the real- parts of clients. time rt-linux kernel.

65 fig. 6. For direct audio connections the HTK tools make use the OSS sound system, which may be emulated on PulseAudio on a modern ma- chine, by installing the padsp tool (Ubuntu pack- age pulseaudio-utils 0.9.10-1ubuntu1 i386) then prefixing all audio HTK commands with padsp. Similarly, Juicer allows the use of an OOS front, the implementation of a JACK plugin is in progress. The cluster may be used both for online and offline processing. An example of online process- ing can be found in the description of an online transcription system for meetings [Garner et al., 2009]. Such systems distinguish between far and near-field audio sources and employ beamform- Figure 5: 16 channel meeting room recording in ing and echo cancellation procedures [Hain et al., Ardour, using two 8-channel tracks. 2009] for which either sample synchronicity or at least minimal latency between channels is of ut- most importance. For both offline and online pro- Pro apps Consumer apps eg. HTK tools eg. cessing typically audio at 16kHz/16bit is required, Ardour VLC, Banshee, Firefox to be converted into so-called feature streams (for example Perceptual Linear Predictive (PLP) fea- tures [Hermansky, 1990]) of much lower bit rate. OSS(padsp) ALSA canberra GStreamer Even higher sampling frequencies (up to 96kHz are commonplace) are often required for cross- correlation based sound source localisation algo- streaming Pulse ESD aRts rithms to provide sufficient time resolution in or- der to detect changes in angles down to one de- gree. In offline mode the recognition of audio typ- JACK PortAudio ically operates at 10 times real-time (i.e. 1 hour of audio in many channels takes 10 hours to pro- cess). However, the grid framework allows the ffado ALSA OSS Win32 CoreAudio latency of processing to drop to close to 1.5 times real-time using massive parallelisation.

firewire USB PCI 6 Future directions The ultimate goal of NST is to obtain transcrip- tions of what was said in natural environments. Figure 6: Audio system for data analysis. Traditionally, the source separation and denoising techniques of sec. 2 have been treated as a sepa- rate preprocessing step, before the cleaned audio 5.1.1 Audio analysis is passed to a separate transcriber such as HTK. Analysis of our audio data is performed on a com- However for challenging environments this is sub- pute cluster of 20 Linux nodes with 96 proces- optimal, as it reports only a single estimate of the sor cores in total, running the Oracle (formerly denoised signal rather than Bayesian information Sun) Grid Engine, the HTK Hidden Markov about its uncertainty. Future integrated systems Model Tool Kit [Young et al., 2006] and the could fuse the predictions from transcription lan- Juicer recognition engine [Moore et al., 2006]. guage models with inference and reporting of low- During analysis, audio playback on desktop ma- level audio data, for example by passing real-time chines is useful and is done with the setup of probabilistic messages between HTK’s transcrip-

66 tion inference (on a Linux computing cluster) and References low-level audio processing (on desktop or embed- L. Brayda, C. Bertotti, L Cristoforetti, M. , ded Linux close to the recording hardware.) Omologo, and P. Svaizer. 2005. Modifications For training and testing speaker location track- on NIST MarkIII array to improve coherence ing systems it is useful to build a database of properties among input signals. In Proc. of known speaker position sequences, which need to 118th Audio Engineering Society Conv. be synchronised to the audio. Positions change at H. Christensen and J. Barker. 2010. Speaker the order of seconds so it is wasteful to use audio turn tracking with mobile microphones: com- channels to record them – however we note that bining location and pitch information. In Proc. JACK is able to record MIDI information along- of EUSIPCO. side audio, and one possibility would be to encode position from our Ubisense active badges as MIDI C. Fox, M. Evans, M. Pearson, and T. Prescott. messages, synchronous with the audio, and record 2012. Tactile SLAM with a biomimetic them together for example in Ardour3. It could whiskered robot. In Proc. ICRA. also be useful to – somehow – synchronise video P. N. Garner, J. Dines, T. Hain, A. el Hannani, of meetings to JACK audio. M. Karafiat, D. Korchagin, Mike L., V. Wan, The homeService system has two major parts, and L. Zhang. 2009. Real-Time ASR from namely hardware components that are deployed Meetings. In Interspeech’09, pages 2119–2122. in people’s homes during trial (Android tablet, T. Hain, L. Burget, J. Dines, P. N. Garner, Microcone, and Linux box responsible for audio A. el Hannani, M. Huijbregts, M. Karafiat, capture, network access, infrared and blue tooth M. Lincoln, and V. Wan. 2009. The AMIDA communication), as well as our Linux computing 2009 meeting transcription system. In Proc. In- cluster back at the university running the pro- terspeech. cesses with particularly high demands in terms of H. Hermansky. 1990. Perceptual linear predic- processing power and memory usage. The main tive analysis for speech. J. Acoustic Society of audio capturing will take place on a Linux box America, pages 1738–1752. in the users’ home, and we plan to develop a setup which will enable the Microcone and JACK J. Mooney. 2005. Sound Diffusion Systems for to work together and provide – if needed – live the Live Performance of Electroacoustic Music. streaming of audio over the network. Ph.D. thesis, University of Sheffield. The main requirements of NST research for D. Moore, J. Dines, M. Magimai Doss, J. Vepa, Linux audio are support for sample-synchronous, O. Cheng, and T. Hain. 2006. Juicer: A many (e.g. 128 or more) channel recording, weighted finite state transducer speech decoder. and communication with cluster computing and In Machine Learning for Multimodal Interac- speech tools such as HTK. As natural speech tech- tion, pages 285–296. Springer-Verlag. nology makes closer contact with signal process- S. Roberts and R. Everson, editors. 2001. Inde- ing research, we expect to see more speech re- pendent Components Analysis: Principles and searchers moving to Linux audio in the near fu- Practice. Cambridge. ture, and we hope that this paper has provided some guidance for those who wish to make this H. L. Van Trees. 2002. Optimum array process- move, as well as a guide for the Linux audio com- ing. Wiley. munity about what technologies are important to L. Watts. 2009. Reverberation removal. In this field. United States Patent Number 7,508,948. M. Wolfel and J. McDonough. 2009. Distant Acknowledgements Speech Recognition. Wiley. S. Young, G. Evermann, M. Gales, T. Hain, The research leading to these results was D. Kershaw, XA Liu, G. Moore, J. Odell, D. Ol- supported by EPSRC Programme Grant lason, D. Povey, et al. 2006. The HTK book. EP/I031022/1 (Natural Speech Technology).

67 68 A framework for dynamic spatial acoustic scene generation with Ambisonics in low delay realtime

Giso GRIMM Tobias HERZKE Medical Physics Group H¨orTech gGmbH Universit¨atOldenburg Marie-Curie-Str. 2 26111 Oldenburg, 26129 Oldenburg, Germany Germany, [email protected] [email protected]

Abstract 2012]. For example, during this piece the musi- A software toolbox developed for a concert in which cians virtually follow the planet’s trajectories. acoustic instruments are amplified and spatially pro- Such a concert with simultaneous presentation cessed in low-latency is described in this article. The of acoustic sources and their spatially processed spatial image is created in Ambisonics format by a set images requires careful considerations of psycho- of dynamic acoustic scene generators which can cre- acoustic effects, specifically the precedence effect. ate periodic spatial trajectories. Parameterization of Appropriate amplification and placement of loud- the trajectories can be selected by a body tracking in- speakers and musicians are a prerequisite for a terface optimized for seated musicians. Application of this toolbox in the field of hearing research is discussed. spatial image which is dominated by the processed sound and not by the direct sound of the musi- Keywords cians [Grimm et al., 2011]. A set of software com- ponents is required to make such a concert pos- Ambisonics, digital processing, concert performance sible. The Open Sound Control (OSC) protocol 1 Introduction [Wright, 2002] facilitates communication of these components across the borders of processes and Spatially distributed presentation of music can hardware machines. The several software compo- contribute to an improved perception. In the nents and their applications are described in the 16th century, many composers used several choirs following sections. and groups of musicians at distributed places in church music. However, most conventional con- 2 Application in a concert setup cert locations do not provide the capabilities to have the musicians placed around the audience. The tools described in this paper have been de- Spatial presentation with loudspeakers offers new veloped for a concert performance entitled “Har- possibilities. mony of the Spheres”. The ensemble consists of The ensemble ORLANDOviols1 developed and five musicians playing on viola da gamba (or vi- performed a concert program, in which the audi- ols), a historic bowed string instrument. Viola ence is surrounded by the five musicians. Addi- da gambas come as a family - the smallest has tionally, the sound is processed by a low-latency the size of a violin, the largest that of the dou- Ambisonics system and presented in dynamic spa- ble bass. The instrument – except for the largest tial configurations. The instruments virtually one – is held with the legs, which led to its name move around the audience. Trajectories matched (’gamba’ means leg in Italian). to the music potentially provide easier access to In the concert, five musicians (including one of the concepts in music and may add further levels the authors) sit on the corners of a large pen- of interpretation. A central piece in the program tagon. Within the pentagon a 10 channel 3rd or- is an “In Nomine” fantasy from the 16th century der horizontal Ambisonics system is placed on a by Mr. Picforth, in which each voice is related to regular decagon with a radius of approximately one of the planets known in that time [Grimm, 6 meters. Within this decagon sits the audience, with space for roughly 100 listeners. The instru- 1http://www.orlandoviols.de/ ments are picked up with cardioid microphones

69 at a distance of approximately 40 cm, which is and Davis, 2011]. Jack2 is used to allow for paral- a compromise between close distance for minimal lel processing of independent signal graphs. The feedback problems and a large distance for a bal- digital audio workstation ’ardour’ [Davis, 2011] anced sound of the historic acoustic instruments. is used for signal routing, mixing, recording and The microphone signals are converted to the dig- playback. Convolution reverb is used for distance ital domain (Behringer ADA8000) and processed coding; the convolution is done by ’jconvolver’ on a Linux PC (Athlon X2 250). The processed [Adriaensen, 2011]. The Ambisonics signals are signals are then played back by the powered near decoded into the speaker layout with ’ambdec’ field monitors (KRK RP5). Due to the large [Adriaensen, 2011]. The tool ’mididings’ [Sacr´e, distance between the musicians – up to 15 me- 2010] triggers the execution of shell scripts upon ters – monitoring is required. The monitor sig- incoming MIDI events. nals are created on the hardware mixer of the 3.1 Spatial processing: ’sphere’ panning sound card (RME HDSP9652) and played back to application the musicians through in-ear monitor headphones. Two hardware controllers (Behringer BCF2000) For the spatial processing a stand-alone jack pan- are used to control the loudspeaker mix and the ning application with an integrated scene gener- monitor. A separate PC is used for visualization ator is used. The scene generator is inspired by of levels, monitor mixer settings and the spatial the planet movements. It can create Kepler el- parameters. A foot switch (Behringer FBC1010) lipses with an additional epicyclic component and and a virtual foot switch (Microsoft Kinect con- a random phase increment. The parameters are nected to a netbook PC) are used for parameter radius ρ, rotation frequency f, eccentricity ε, ro- control by one of the musicians. An additional tation of ellipse main axis θ, starting angle ϕ0, and netbook PC is used for projection of the trajec- the epicycle parameters radius ρepi, rotation fre- tories of the virtual sources at the ceiling of the quency fepi and starting angle ϕ0,epi. The source room. position on a complex plane is ρ√1 ε2 z = − eiϕ + ρ eiϕepi . (1) 1 ε cos(ϕ θ) epi − − The azimuth 6 z is the input argument of a 3rd or- der horizontal ambisonics panner. Distance per- ception in reverberant environments is dominated by the amount of reverberation [Sheeline, 1982; Bosi, 1990]. The distance z is here coded by adding a reverberant copy| of| the signal. The reverberant signal is created by a convolution stereo-reverb. A virtual stereo loudspeaker pair centered around the target direction given by the virtual source azimuth is used for playback. The width of the virtual stereo pair is controlled by the Figure 1: Concert setup in a church, during re- distance. It is chosen to be maximal at the criti- hearsal. The diameter of the loudspeaker ring was cal distance ρcrit, and converging to zero for close 13 m, fitting about 100 chairs. and distant sources. Also the ratio between the dry signal and the reverberant stereo signal is con- trolled by the distance. The distance to the origin 3 Software components normalized by the critical distance isρ ˆ = ρ/ρcrit. Then the parameters of the distance coding are: The software components used in the toolbox for dynamic scene generation can be divided into a ρˆ w(ρ) = w (2) set of third party software, and specifically devel- max ρˆ2 + 1 oped applications. The third party software in- 1 G = (3) cludes the jack audio connection kit (jack) [Letz dry 1 +ρ ˆ

70 G = 1 G (4) wet − dry

The maximal width wmax and the critical distance ρcrit can be controlled externally. Examples of trajectories used in the concert are shown in Fig- ure 2. The scene generator and panning application, named ’sphere’, is a jack application with an in- put port for the dry signal and a stereo input port pair for the reverberated signal. The parameters can be controlled via OSC messages. The OSC receiver can be configured to listen at a multi- cast address, which allows to control multiple in- stances in a single OSC message. To achieve a dynamic behavior independent from the actual processing block size, the update rate of panning parameters is independent from the block rate. Changes of different parameters can be accumu- lated and applied simultaneously, either immedi- ately or slowly within a given time. Other features of the scene generator are send- ing of the current azimuth to other scene gener- ator instances via OSC, application of parameter changes at predefined azimuths, randomization of the azimuth, and level-dependent azimuth. The ambisonics panner uses horizontal panning only. For artistic reasons, a simulation of eleva- tion is reached by gradually mixing the signal to the zeroth order component of the ambisonics sig- nal. By doing so the resulting ambisonics sig- nal does not correspond to any physical acoustic sound field anymore, however, the intention is to Figure 2: Example trajectories in two of the have the possibility of creating a virtual sound pieces: Kepler ellipses in the “In Nomine” by source without any distinct direction. Mr. Picforth (top), and chaotic movements in the piece “Five” by John Cage (bottom). 3.2 Matrix mixer Creating an individual headphone monitor mix for several musicians and many sources is a de- mixer settings from and to XML files. Mixer set- manding task, and usually needed in situations tings include panning – sin-cos-panning for two- when a quick and efficient solution is required. In channel output, vector-based amplitude panning this setup, the RME HDSP9652 sound card was (VBAP) for multi-channel output – and gains for used, which offers a hardware based matrix mixer. a pre-defined set of inputs and outputs. Chan- The existing interfaces on Linux - amixer (con- nels which are not used in a configuration will sole) and hdspmixer (graphical) - provide full ac- not be accessed. No logical difference is made be- cess to that mixer. However, the lack of hardware tween software and hardware inputs. A third ap- based remote control option makes these tools in- plication allows control via MIDI CC events (here efficient in time-critical situations. Therefore, a coming from a Behringer BCF2000). Feedback is set of applications has been developed which at- sent to the controller device. A fourth applica- tempts to provide a flexible solution: One appli- tion is visualizing the gains. Editing of the gains cation provides an interface to the audio hard- is planned, but not implemented yet at time of ware. A second application reads and stores the writing. A screen-shot of the visualization tool is

71 shown in Fig. 3. The four applications share their get level. This level can be changed by the presets settings via OSC messages. A feedback filter pre- for each piece and also within pieces. Long term vents from OSC message feedback. level deviations can then be corrected by the op- erator. Automatic adaption to the desired levels bares the risk of causing feedback problems.

3.4 Parameter exchange and preset selection All applications exchange parameters and control data via OSC messages. To allow a flexible inter process communication across hardware borders, most applications subscribe themselves to UDP multicast addresses. Settings of the scene generators ’sphere’ can be loaded from setting files. Reading of preset files and sending them as OSC messages can be trig- gered either from the console, from the virtual footswitch controller, from the physical footswitch or by ardour markers. For the last option, an ap- Figure 3: Screen shot of a matrix mixer example plication has been developed which converts ar- session. dour markers into OSC messages: The applica- tion reads an ardour session file, subscribes to 3.3 Level metering jack, and emits OSC messages whenever an ar- For the polyphonic character of many of the pieces dour marker position falls into the time interval played in the above mentioned concert, well bal- of the current jack audio block. anced levels of the five instruments are of major The several preset files can also control the importance. Level differences between the instru- transport of ardour, to control recording of inputs ments of only a few dB can easily destroy the and trigger playback of additional sounds. musical structure in some pieces. Space limita- 3.5 Visualization tions make it impossible to place the mixing desk and its operator within the Ambisonics system. Simple visualization classes have been imple- Therefore, a level metering application was de- mented for display of the trajectories of the veloped. This level meter can measure the root- ’sphere’ scene generators, for level monitoring and mean-square (RMS) level and short time peak val- for visual control of the monitor mixer. With ues of multiple inputs. To account for the loud- these classes applications can be built to contain ness difference between low instruments (double any combination of visualization tools. No graph- bass size) and medium and high instruments, a ical user interaction is possible. band pass filter with a pass band from 300 Hz to 3 kHz is applied before metering. The RMS level 3.6 Implementation of virtual foot switch meter operate on 10 s and 125 ms rectangular win- In the concert application, one of the musicians dows. The 125 ms window is overlapping by 50%. is responsible for switching scene generation pre- The peak levels are measured in the same win- sets at musically appropriate times while playing. dows as the short time levels. The levels are sent Since viola da gambas are played while sitting on via OSC messages to a visualization application. a chair, and the instruments are held between The level meter application was implemented in the legs, it is sometimes difficult to use a real the H¨orTech Master Hearing Aid (MHA) [Grimm footswitch, especially elevating the heel, without et al., 2006]. interrupting the music. To circumvent these prob- To allow for optimal mixing even without hear- lems, a virtual footswitch is used which can be ing the balanced final mix at the sweet spot, an controlled by a minor movement of the tip of the indicator is used which shows the long term tar- foot, without the need of elevating the heel. Care

72 was taken that this footswitch is sufficiently ro- bust to avoid false alarms and missed movements. The virtual foot switch is implemented using the Prime Sense depth camera of the Microsoft Kinect game accessory [Kin, 2010]. The depth camera captures the area of the right foot of the musician in charge of the switching. The switch differentiates between 4 states: No foot present (“free”), and 3 different directions of the foot Figure 4: Kinect depth camera detects depth dif- (“straight”, “left” and “right”). The switch is ference caused by foot inactive in the states “free” and “straight”, and triggers actions in the states “left” and “right”. Our software for the virtual foot switch starts The depth camera uses an infrared laser diode with a training phase before entering the detec- to illuminate the room with an infrared dot pat- tion phase. It first needs to capture the scene tern, which is then picked up by the camera without the foot, then directs the musician to sensor. From the distortion of the dot pattern, place the foot pointing naturally straight, or ro- the camera is able to associate image pixels with tated, around the heel, to the left and right, re- depth information. The computation of depth in- spectively. 10 images are captured for each of formation is performed inside the camera device. the 4 situations. Mean and standard deviation Because the infrared laser diode and the camera for each pixel’s depth is computed. Three differ- sensor are laterally separated, not the complete ential depth images are computed by subtracting scene visible to the camera sensor is illuminated the mean depth information of the images con- with the infrared dot pattern. For these shadow taining a foot from the empty scene. Differential areas, the camera does not provide depth informa- depth for pixels with invalid depth information or tion. Also, for small surfaces like fingertips and too high standard deviation in either image is set reflecting surfaces like glass, the camera will often to 0, effectively excluding these pixels from fur- not return a valid depth information. ther consideration. The presence of a foot where The camera produces depth images of 640x480 2 previously was only floor in the image will reduce pixels at a frame rate of 30 Hz . The depth in- the depth information captured by the camera as formation in each captured depth image consists shown in Figure 4. As we are mainly interested of a matrix of 11 bit integer depth values draw. in the position of the front part of the foot, we The highest bit indicates an invalid depth value, restrict the differential depth image to depth dif- e.g. a shadow. The range of detectable depth is ferences in the range of 3cm to 9cm. All depth approximately 0.5m to 5m, with better resolution differences outside this range are also set to zero (< 1cm) at the start of this range. The depth in and will not be considered for locating the foot. meters is Figure 5 shows an example of such a differential depth image for a straight pointing foot. To iden- draw d = 0.1236m tan + 1.1863 (5) tify the foot, the software searches for a cluster of 2842.5   depth difference pixels from bottom to top. The [Magnenat, 2011]. For depth pixel values where first such cluster of at least 100 pixels that is found this formula yields a depth d < 0m or d > 5m, we is considered to be caused by the foot. The min- conclude an invalid depth value. imum rectangle that contains all three (straight, To capture the foot of the musician, the depth left, right) foot clusters defines the region of inter- camera is mounted on a small pedestal next to est (ROI) to be used during the detection phase. the music stand. The camera is tilted downwards Figure 6 shows the ROI from an example session, by 16 degrees to capture the floor in the area of containing three foot clusters. the foot as shown in Figure 4. Also from depth images captured during de- tection, differential images are computed by sub- 2We use the libfreenect library from the OpenKinect tracting their depth information from the empty [Blake, 2011] project to receive a stream of depth images. scene training image, and considering only depth

73 Figure 7: For each foot cluster from Figure 6 and for each scan line, the horizontal coordinate with the maximum depth difference between the train- ing image with the foot and the empty scene is shown Figure 5: Differential depth image for the straight foot state from training, identified foot cluster fined. These combinations do not contribute to painted black the error summation. Figure 7 shows these horizontal coordinates for the example foot clusters of Figure 6. The training state with the lowest squared error sum ei is then considered recognized. If, however, the lowest error is zero, then the empty scene is recognized. If the left or right foot state is recognized for seven consecutive frames (i.e. for a quarter of a second), then the switch performs the configured action. The straight foot state and the empty scene do not trigger actions. Figure 6: The region of interest (ROI) for the vir- The virtual foot switch is initialized with a list tual foot switch containing the three foot clusters of actions to perform. When the foot is recog- (displayed using additive color) nized in the right or left position reliably, then the software triggers the next or previous action from differences in the range of 3cm to 9cm. During this list, respectively. Actions are system com- detection, this processing is restricted to the ROI mands and executed in a shell. For the concert, determined from the training data. the actions influence the trajectory generator by To be able to classify depth images captured sending appropriate OSC messages. during detection, an error function to compute the distance of the current state to each of the 4 Application in hearing research three training states containing a foot is used. The scene generation toolbox described in the pre- Within the ROI, for each scan line, the horizontal vious sections is planned to be used also in hearing coordinate xy,i with the maximum depth differ- research and hearing aid development, after some ence (within the range of 3cm to 9cm) is located, modifications and extensions. Here, the task is to where y is one of the scan lines in the ROI, and generate dynamic acoustic scenes which are fully i current, straight, left, right . The error func- ∈ { } reproducible and controllable. Dynamic acoustic tion that we use is scenes can be presented to real listeners, who - within a certain area - can move freely. Hearing 2 ei = (xy,current xy,i) (6) aid algorithms can directly be evaluated3. y − X 3For the evaluation of hearing aid algorithms in Am- For some combinations of y and i, xy,i is not de- bisonics systems, the effect of the limitations of Ambisonics

74 In hearing aid research, discrete loudspeaker panning strategies. setups are commonly used for spatial evaluation of algorithms. However, discrete setups are in- 6 Conclusions able to acoustic scenes with moving sources. Vec- A set of software tools for application in a concert tor based amplitude panning (VBAP) as a sim- performance with spatial processing of acoustic ple extension of discrete speaker setups have the instruments has been described in this paper. The disadvantage of creating phantom sources when technical components of the performance were the virtual source is between two speakers and completely build on top of Linux Audio. All ma- real sources when the virtual source direction is jor parts are implemented as open source appli- matched by a speaker. Phantom sources are a cations. general problem for directional filter algorithms. The authors believe that spatial processing can Higher order Ambisonics offers the advantage that substantially contribute to a concert experience moving sources can be created with a steady im- with Early and contemporary music. This be- age. A detailed evaluation of applicability for lieve was supported by statements of many con- hearing aid algorithm testing is content of an on- cert listeners – “the experience of musicians mov- going study. ing behind your back sends you a shiver down your spine”, “I felt like being in a huge bell, right 5 Discussion in the heart of the music”. However, this software Low-latency real-time spatial processing of acous- toolbox leaves still space for further development, tic instruments is a field with many gaps. Al- for improvement of the concert experience and for though much effort has been done to find solu- application in hearing research. tions to most problems, many open questions still remain. One major flaw is the coding of distances: 7 Acknowledgements While distance coding by changing the amount of Our thanks go to the members of the ensemble reverberation may work in low-reverberant rooms ORLANDOviols, who all shared the enthusiasm like the monitor room used during development, in creating this concert project. Special thanks the effect will break down in more reverberant also go to Fons Adriaensen for his great linux au- rooms, like most performance spaces. A potential dio tools and for helpful advice on Ambisonics. solution might be to simulate the Doppler effect This study was partly funded by the German re- in order to at least create an image of distance search funding organization Deutsche Forschungs- changes, at the cost of additional delay. Also spec- gemeinsschaft (DFG) in the Research Unit 1732 tral changes which are observed in large distance “Individualisierte H¨orakustik”and by “klangpol – differences may help in the distance perception. Neue Musik im Nordwesten” (Contemporary Mu- Pseudo-elevation was used in the concert to cre- sic in north-west Germany). ate the image of sources without any distinct di- rection. Mixing a source to the zeroth order Am- References bisonics channel results in an identical signal from Fons Adriaensen. 2011. Linux au- all loudspeakers. The effect of no distinct direc- dio projects at kokkini zita. http: tion, however, can only be perceived right in the //kokkinizita.linuxaudio.org/. middle of the ring. At any other listening pos- tion, the precedence effect will cause the nearest Joshua Blake. 2011. Openkinect. http:// loudspeaker to be perceived as the dominant di- openkinect.org. rection. The main difference to correct panning Marina Bosi. 1990. An interactive real-time sys- is that the perceived direction will differ for all tem for the control of sound localization. Com- listeners. A solution might be to use a full 3- puter Music Journal, 14(4):59–64. dimensional Ambisonics system with at least first order in vertical direction, or to find alternative Paul Davis. 2011. Ardour. http://ardour. org/. systems on the respective algorithms (e.g., spatial aliasing, unmatched wave fronts, high frequency localization) has to Giso Grimm, Tobias Herzke, Daniel Berg, and be carefully considered. Volker Hohmann. 2006. The Master Hearing

75 Aid – a PC-based platform for algorithm de- velopment and evaluation. Acustica acta acus- tica, 92:618–628. · Giso Grimm, Volker Hohmann, and Stephan Ewert. 2011. Object fusion and localization dominance in real-time spatial processing of acoustic sources using higher order ambisonics. In Gy¨orgyWers´eny and David Worrall, editors, Proceedings of the 17th Annual Conference on Auditory Display (ICAD), Budapest, Hungary. OPAKFI Egyes¨ulet.ISBN 978-963-8241-72-6. Giso Grimm. 2012. Harmony of the spheres - cosmology and number aesthetics in 16th and 20th century music. Musica Antiqua, 1(1):18– 22, Jan-Mar. 2010. Kinect. http://microsoft. com/Presspass/press/2010/mar10/ 03-31PrimeSensePR.mspx. Stephane Letz and Paul Davis. 2011. Jack au- dio connection kit. http://jackaudio.org/. St´ephane Magnenat. 2011. http: //openkinect.org/wiki/Imaging_ Information, version from June 7 2011. the wiki page attributes this formula to St´ephaneMagnenat. Dominic Sacr´e.2010. Mididings. http://das. nasophon.de/mididings/. Christopher W. Sheeline. 1982. An investiga- tion of the effects of direct and reverberant sig- nal interactions on auditory distance percep- tion. Ph.D. thesis, CCRMA Department of Mu- sic, Stanford University, Stanford, California. Matthew Wright. 2002. Open sound control 1.0 specification. http://opensoundcontrol. org/spec-1_0.

76 A Toolkit for the Design of Ambisonic Decoders

Aaron J. HELLER Eric M. BENJAMIN Richard LEE AI Center, SRI International Surround Research Pandit Litoral Menlo Park, CA, US Pacifica, CA, US Cooktown, QLD, AU [email protected] [email protected] [email protected]

Abstract arrays [Heller et al., 2010]. In this paper, we ex- Implementation of Ambisonic reproduction systems is tend that work to full periphonic (3-D) arrays and limited by the number and placement of the loudspeak- higher-order Ambisonics (HOA). The techniques ers. In practice, real-world systems tend to have in- are implemented as a MATLAB [2011] and GNU sufficient loudspeaker coverage above and below the Octave [Gnu, 2011] toolkit that makes use of the listening position. Because the localization experi- NLOpt library [Johnson, 2011] to perform the op- enced by the listener is a nonlinear function of the timization. loudspeaker signals it is difficult to derive suitable de- We use the term decoder to mean the configu- coders analytically. As an alternative, it is possible to ration for a decoding engine that does the actual derive decoders via a search process in which analytic estimators of the localization quality are evaluated at signal processing. Examples are Ambdec [Adri- each search position. We discuss the issues involved aensen, 2011] that operates in real time, as well and describe a set of tools for generating optimized as an offline decoder we have implemented as part decoder solutions for irregular loudspeaker arrays and of this toolkit.1 demonstrate those tools with practical examples. In this paper, we use bold roman type to denote vectors, italic type to denote scalars, and sans serif Keywords type to denote signals. A scalar with the same Ambisonic decoder, HOA, periphonic, nonlinear opti- name as a vector denotes the magnitude of the mization vector. A vector with a circumflex (“hat”) is a unit vector, so, for example, ˆrE = rE/rE. 1 Introduction We start with a discussion of the design process and the tradeoffs involved, then the specifics of the Ambisonics is a versatile surround sound record- optimization process, and finally results of two ar- ing and reproduction system. One of the attrac- rays, a third-order decoder for the 22-loudspeaker tions is that the transmission format is indepen- CCRMA array, and a second-order decoder for the dent of the loudspeaker layout. However, this 12-loudspeaker 30 tri-rectangle array. means that each playback system needs a custom ◦ decoder that is matched to the loudspeaker ar- 2 Designing Ambisonic Decoders ray. The decoder creates the loudspeaker signals from the transmission signals. Ambisonics theory Ambisonics represents a sound field with a group provides simple encapsulations of high- and low- of signals that are proportional to spherical har- frequency auditory localization that can be used monics. The original Ambisonic systems were first to design decoders, as well as theorems that ease order only, but more recently, higher-order sys- the design of decoders for regular polygonal and 1Another important function of an Ambisonic decoder is polyhedral loudspeaker arrays. to provide near-field compensation. This compensates for In earlier papers, the present authors have dis- the curvature of the wavefronts due to the finite distance cussed the design and testing of first-order de- to the loudspeaker and is strictly a function of distance of the speaker from the center of the array and the order coders for regular horizontal loudspeaker layouts of reproduction. Ambdec and the offline decoder in this [Heller et al., 2008] as well as the use of nonlin- toolkit provide such filters and they will not be discussed ear optimization to design decoders for ITU 5.1 further in this paper.

77 tems have been implemented. In first-order Am- to create realistic atmosphere even if more precise bisonics the zeroth-order component represents methods like HOA are used for special sounds. the sound pressure, and the three first-order com- To preserve this advantage requires the preser- ponents represent the acoustic particle velocity. vation of a good facsimile of the diffuse field. En- If these components are reproduced exactly, then ergy gain that varies with direction and “bunch- the sound will be correct at the center. How- ing” of directions, particularly in the horizontal ever, it is not possible to get the first-order com- plane, are all detrimental, as is “speaker detent” ponents correct except at a single point and not where individual loudspeakers draw attention to practical to get them correct at higher frequen- themselves. cies, where the wavelengths become smaller than 2.1 Auditory Localization the size of the human head. The task of the de- coder is to create the best perceptual impression Due to the range of wavelengths involved, the that the soundfield is being reproduced accurately human auditory localization mechanism utilizes given the loudspeaker array being used. different directional cues over different frequency In practical terms, the following are necessary: regimes. At low frequencies, localization depends on the detection of Interaural Time Differences Constant amplitude gain for all source direc- (ITDs), but at high frequencies there is an am- • tions biguity because a human head is multiple wave- lengths across above about 1 kHz. For this rea- Constant energy gain for all source directions • son, localization switches abruptly, depending on At low frequencies, correct reproduced wave- Interaural Level Differences (ILDs) above that fre- • front direction and velocity quency. One way to predict localization would be to use Head Related Transfer Functions (HRTFs) At high frequencies, maximum concentration to calculate the actual ear signals of a listener, but • of energy in the source direction this turns out to be computationally difficult and Matching high- and low-frequency perceived would vary from listener to listener. • directions Gerzon developed a series of metrics for pre- dicting localization that are simpler than using These criteria may, themselves, have different the HRTFs [Gerzon, 1992]. The simplest of these interpretation or importance depending on the metrics are the velocity localization vector, r , source material and the intended use. We can V and the energy localization vector, rE. The direc- identify three distinct types of program: tion of each indicates the direction of the expected Natural recordings made with a first-order localization perception, while the magnitude indi- • soundfield microphone. cates the quality of the localization. In natural hearing from a single source, the magnitude of Natural recordings made with higher order each vector should be exactly 1, and the direc- • microphones. As of this writing, such mi- tion of the vectors is the direction to the source. crophones are just becoming available com- It should be noted that, while rV is proportional mercially, but practical constraints will mean to the physical quantity of the acoustic particle that these are still first order at lower fre- 2 velocity, rE is an abstract construct. quencies. Following Gerzon [1992], the pressure (ampli- Artificial recordings. First order as well as tude gain), P , and total energy gain, E, are • Higher Order Ambisonic (HOA) program ma- n terial. P = Gi (1) The first case is Ambisonic’s greatest strength. ∑i=1 Good first-order Ambisonic reproduction is prob- 2Note that these metrics are not specific to Ambisonics; ably the closest to recreating a virtual sound envi- they can be used to predict the quality of the phantom images produced by any multispeaker reproduction system, ronment, whether the buzz of a busy Asian mar- regardless of the panning laws used, including plain old ketplace or the sound of a concert in a good hall two-channel stereo. Gerzon shows this for several well- in your living room. It will most likely be used known stereo phenomena [Gerzon, 1992].

78 n E = (GiGi∗) (2) ∑i=1 The magnitude and direction of the velocity vec- tor, rV and ˆrV, at the center of an array with n loudspeakers is n 1 r ˆr = Re G ˆu (3) V V P i i ∑i=1 whereas the magnitude and direction of the energy vector, rE and ˆrE, are computed by n 1 r ˆr = (G G ∗) ˆu (4) E E E i i i ∑i=1 Figure 1: Maximum average r depending on or- where the Gi are the (possibly complex) gains E der and type. “matching” and “max r ” refer to from the source to the i-th loudspeaker, ˆui is a E unit vector in the direction of the loudspeaker, the decoder matrices described in Sections 2.2 and 2.3, respectively. and Gi∗ is the complex conjugate of Gi. The velocity vector points in the same direction and is proportional to the acoustic particle veloc- to each loudspeaker that are needed to optimize ity. It has been shown that the velocity vector pre- localization as predicted by the velocity localiza- dicts the ITDs very accurately [Benjamin et al., tion vector, rV. Numerous authors have provided 2010]. The energy vector predicts the ILDs, but derivations of the low-frequency solution for a in practice it is not possible to get rE = 1 unless given loudspeaker array, and thus a number of dif- the sound is coming from just one loudspeaker. ferent terms are used to refer to it, including “ve- This is representative of a pervasive problem in locity”, “matching”, “basic”, “exact”, “mode match- multichannel sound reproduction. The maximum ing”, “re-encoding” and so forth. average value of rE that can be obtained for a In practice, these reduce to projecting (or en- given Ambisonic order is shown in Figure 1. The coding) the loudspeaker directions onto the se- formulas to compute these are given in Appendix lected spherical harmonic basis set,4 assembling A. these vectors into an array, and computing the Because different sets of gains are needed to Moore-Penrose pseudoinverse of the array [Weis- satisfy the low- and high-frequency models, many stein, 2008]. Examples of this can be found in ambisonic decoders split the audio into two bands, Appendix A of [Heller et al., 2008]. In general, apply different decoder matrices, and then recom- 3 there are an infinite number of solutions and this bine to produce the loudspeaker signals. Daniel procedure provides the solution with the minimum has suggested that a three-band decoder may pro- L2-norm (i.e., the least-squares fit), which has the vide better reproduction under some listening con- desirable property of requiring the minimum total ditions [Daniel, 2001]. This remains an open ques- radiated power.5 tion at this time. 4 2.2 Computing the Low-Frequency The toolkit is neutral as to the conventions for com- ponent ordering and normalization. These conventions are Matrix encapsulated in a single function. The current implemen- The low-frequency matrix provides gains from tation supports the Furse-Malham set [Malham, 2003], but each channel of the ambisonic program material others can be added easily. 5Recently, some authors, drawing upon compressive 3This places certain constraints on the phase response sending theory, have suggested that the L1-norm may be of the band splitting filters. We discuss the design and more suitable [Wabnitz et al., 2011; Zotter et al., 2012]. L1- implementation of suitable filters in Appendix B of [Heller norm minimizes the sum of absolute errors. Compared to et al., 2008] and note that the filters in Ambdec meet these least-squares, it allows larger maximum errors in exchange requirements. for more zero errors.

79 Except in the case of degenerate configurations, teria and due to the nonlinear nature of the crite- where all the loudspeakers lie in the null of one or ria, numerical optimization is needed to compute more of the spherical harmonics, this procedure the solutions, which will be discussed in Section will result in a decoder matrix that satisfies the 3. low-frequency localization criteria exactly; how- 2.4 Merging the LF and HF Matrix ever, it may utilize a great deal of power to get them correct in directions where there is a large The existence of different optimum decoder coef- angular separation between the loudspeakers in ficients for optimum rV and rE would typically mean having to make a choice or compromise be- the array. This will result in low rE values in those directions. As we shall see, except in the tween the two. In this case, however, both can be case of regular polyhedra and polygons, it is im- had. The decoder that optimizes rV can be used possible to fully satisfy all the ambisonic criteria at low frequencies and the decoder that maximizes simultaneously. This implies that while the am- rE can be used at high frequencies, by the simple bisonic transmission format is independent of the expedient of using filters to cross over between the loudspeaker array, not all loudspeaker array ge- two. This is typically done at around 400 Hz. ometries perform equally well. Because the higher-order components are re- duced in order to maximize rE, this causes a 2.3 Computing the High-Frequency reduction in the total signal level of the high- Matrix frequency decoder outputs, and thus a reduction in the high frequencies heard by the listener. The The high-frequency matrix provides gains from gains that maximize rE specify the relationship each channel of the ambisonic program material among the signals of different order, but not how to each loudspeakers that are needed to optimize that gain should be apportioned between high- localization as predicted by the energy localization frequency cuts and low-frequency boosts. There vector, rE. Gerzon proved two theorems for first- are three possibilities: order reproduction, the polygonal decoder theo- rem and the diametric decoder theorem. They 1) Preservation of the amplitude. That is, sim- state that in an array with a minimum of four ply use the gains produced by the optimizer or loudspeakers for 2-D and six speakers for 3-D, those given in Tables 1 and 2. where the loudspeakers are spaced in equal angles 2) Preservation of the root-mean-square (RMS) or in diametrically opposed pairs, rE is guaran- level. This is what Gerzon [1980] suggests teed to point in the same direction as rV. The and is what is implemented in older analog de- polygonal decoder theorem also holds for higher- coders. order Ambisonic reproduction, provided there is an adequate number of loudspeakers in the ar- 3) Conservation of the total energy. Daniel [2001] ray to support the desired order. This simplifies suggests this, and the configuration files in- the task of designing the high-frequency matrix to cluded with Ambdec follow this recommenda- that of selecting the gain for each order such that tion. This method results in more high fre- quencies with more speakers. the overall magnitude of rE is maximized. For first-order decoders, Gerzon provided the values The calculations involved are given in Appendix √2 √3 of 2 for horizontal arrays and 3 for periphonic B. In listening tests, we have found that preserva- arrays. Daniel derived general formulas for these tion of the RMS level works well for small arrays. gains [Daniel, 2001], which are given in Tables 1 We have also found that using the conservation and 2. (See Appendix A for programs that com- of energy approach on large 3-D arrays results in pute the values in these tables.) overemphasizing high frequencies and near-head As we will see in the example in Section 2.5, imaging artifacts and nulls. In practice, we the set once the array deviates from having equal angles, the LF/HF balance by ear, comparing the balance there is no longer a guarantee that rE and rV of the two-band decoder to that of a single-band point in the same direction or that there is a single rE-max decoder. More work is needed to find a set of gains that maximize rE in every direction. procedure for this that does not involve tuning by Because of this, we must trade off the various cri- ear.

80 Order Max rE Gains 1 0.707107 1, 0.707107 2 0.866025 1, 0.866025, 0.5 3 0.92388 1, 0.92388, 0.707107, 0.382683 4 0.951057 1, 0.951057, 0.809017, 0.587785, 0.309017 5 0.965926 1, 0.965926, 0.866025, 0.707107, 0.5, 0.258819

Table 1: Per-order gains for max-rE decoding with 2-D regular polygonal arrays.

Order Max rE Gains 1 0.57735 1, 0.57735 2 0.774597 1, 0.774597, 0.4 3 0.861136 1, 0.861136, 0.612334, 0.304747 4 0.90618 1, 0.90618, 0.731743, 0.501031, 0.245735 5 0.93247 1, 0.93247, 0.804249, 0.62825, 0.422005, 0.205712

Table 2: Per-order gains for max-rE decoding with periphonic regular polyhedral arrays.

2.5 Selection of a speaker array

LF 1 1 1 0 W Due to symmetry, regular loudspeaker arrays have RF √2 1 1 1 0 X the advantage of uniformity in the localization = − (5) RR 4 1 1 1 0  Y  predictors rV and rE. As noted above, practi- − − LR 1 1 1 0  Z  cal difficulties usually prevent the attainment of a    −    completely regular array. There will be a tendency This gives exact recovery of the pressure and for rE to be greater in the directions where the an- velocity at the center of the array: rV = 1 and gular density of loudspeakers is greater, and less points in the intended direction.| But| because in the directions where there are few loudspeak- the angular separation of the loudspeakers is 90◦, ers and rE will tend to point in the directions of 2 rE = 3 . However, if we investigate what happens concentrations of loudspeakers. as the ratio of pressure (W) and velocity (X, Y It should be noted at this point that it is im- and Z) is varied, it develops that rE is maximum for the case where the first-order components are possible to get rE to be larger in the direction be- √2 tween loudspeakers than the value achieved sim- reduced to 2 of their original value. This gives ply by driving the loudspeakers nearest to the gap √2 a magnitude of rE of 2 . equally. This means that the best that can be If the square is replaced with a rectangle with achieved by an Ambisonic decoder is to have a an aspect ratio of √3 : 1, the front and rear loud- smooth transition between areas where the perfor- speakers now subtend an angle of 60◦ and the side mance is good (large number of loudspeakers, high loudspeakers subtend an angle of 120◦. This re- magnitude of r and r points in the intended di- E E duces rE at the sides but increases it in the front, rection) and areas where the performance is less relative to a square. If the same gain as derived for good (fewer loudspeakers, r has small magnitude E the square ( √2 for the first-order components) is and points in an incorrect direction). As such, we 2 applied, then there is a substantial improvement must be careful in choosing the decoder parame- in r to the sides, and a very tiny decrease in r ters so that the performance in the good directions E E in the front. This is shown in Figure 2. is good enough, and the performance in the poor But is this the “optimum”? Figure 3 shows that directions is not too bad. if we further vary the ratio of the zero- and first- A simple example of a four-speaker array will order components it develops that rE, evaluated illustrate these difficulties. A square horizontal at the sides, is a maximum at a different ratio. array has a basic decoder solution of It is thus possible to maximize rE in front or at

81 For the previous example of a rectangular array with a √3 : 1 aspect ratio, the ratio of the energy in the forward direction to the energy at the sides was calculated and is also plotted in Figure 3. rV obtains its correct value of 1 in all directions and the pressure response is also omnidirectional. At higher frequencies, where the “max-rE” decoder is in effect, rE is maximized for front and back direc- tions (where the speakers are closer together). At these frequencies, sounds from the sides (perceived as “energy”) are 2.4 dB louder. This is a pervasive problem for irregular arrays and will be addressed in greater detail below. The simple √3 : 1 rectan- gle as above is used widely and is known to give good results. On the other hand, listening tests in- Figure 2: Locus of rE for a rectangular array, dicate that the 5.5 dB energy imbalance exhibited matching and “max rE” decoders. by some first-order decoders for ITU 5.1 arrays is too large. From this, we propose 3 dB varia- tion in “energy” with horizontal direction as the maximum imbalance acceptable. 2.5.1 Discussion of compromises of speaker arrays The selection of a loudspeaker array for Ambisonic reproduction is subject to a number of compro- mises, notably the space available to house the array and the budget for purchasing loudspeakers. It may be that the array already exists, in which case the decoder design task is one of selecting a decoder design that provides the best audible performance. In other situations, however, the design of the array has not been fixed although the number of loudspeakers may have been. In that case, there is substantial latitude to trade off Figure 3: r and the energy, E, as a function of E between high-order performance horizontally and the ratio of the first-order to zero-order scaling. periphonic performance. the sides, but not both at once. One might wish 3 Optimizer to use a different decoder depending on whether As noted in Section 2.5, in an irregular array, sim- sound images are expected at the front, or on the ply scaling the LF and HF matrices does not re- sides, or a decoder that gives a compromise be- sult in rV and rE pointing in the same direction; tween the two. hence, the design procedure becomes somewhat Thus far, only the quality and direction of the more complex. localization have been discussed. There is also an Because the key psychoacoustic criteria for good effect on the loudness of the sound, depending on decoder performance are nonlinear functions of direction. If the loudspeaker array is irregular, the speaker signals, we utilize numerical optimiza- then the solution to recover pressure and velocity tion techniques. To do this, a single objective results in an increase in energy in the directions function is formulated that takes as input the where the angular spacing is greatest. This results decoder matrix and produces a single figure of in an increase in reproduced loudness in those di- merit that decreases as the decoder performance rections. improves. The nonlinear optimization algorithm

82 will then try different sets of matrix elements, at- 3.1 Optimization Criteria tempting to arrive at the lowest value possible. For each test direction, the following are com- Because there are a number of criteria, we use puted: amplitude gain, P , energy gain, E, the the weighted sum to provide an overall figure of velocity localization vector, rV, and the energy merit. A user can adjust the weights to set the localization vector, rE. From these, we compute relative importance of the different criteria, say the pairwise angles between the test direction, ˆrV, uniform energy gain (loudness) versus angular ac- and ˆrE. These are summarized with the following curacy. In addition, each test direction can have figures of merit: deviation of amplitude gain from its own set of weights, so that, for example, an- 1 along the x-axis, minimum, maximum, and RMS gular accuracy can be emphasized for the front, values of amplitude gain, energy gain, magnitude while uniform energy gain is emphasized in other of rV, magnitude of rE, and the pairwise angular directions. This might be the preferred configura- deviations. tion for classical music recorded in a reverberant It is important that the criteria are “well be- performance hall. On the other hand, environ- haved” near zero, so as not to trigger oscillating mental recordings made outdoors have very lit- behavior in the optimizer. They should be contin- tle diffuse content, so overall angular accuracy is uous and have first derivatives. In practical terms, more important. Another application of direction absolute value and thresholds should not be used; weightings is in highly asymmetrical arrays, such squaring can be used for the former and the expo- as a dome, where few speakers are below the lis- nential function for the latter cases. tener. In this case, we expect poor performance in Finally, directional weightings are applied to those directions, so they are deemphasized when each criteria and then an overall weighted sum computing the objective function. produces the single figure of merit for that partic- ular configuration. We have employed the NLOpt library for non- 3.2 Test Directions linear optimization [Johnson, 2011]. NLOpt pro- vides a common application programming inter- As mentioned in the previous section, each candi- face (API) for a collection of nonlinear optimiza- date set of parameters is evaluated from a number tion techniques. In particular, it supports a num- of directions. For 2-D speaker arrays, 180 or 360 ber of “derivative free” optimization algorithms, evenly spaced directions are often used [Wiggins which are well suited to the current application et al., 2003; Moore and Wakefield, 2008]. For 3-D where the objective function is the result of a com- arrays, the situation is more complex because no putation, rather than an analytic function. more than 20 points can be distributed uniformly on a sphere (a dodecahedron). Lebedev-Laikov quadrature defines sets of An earlier version of the optimizer that was lim- points on the unit sphere and weights with the ited to first-order horizontal arrays was written property that they provide exact results for inte- in C++ [Heller et al., 2010]. To extend that to gration of the spherical harmonics [Lebedev and higher-order and periphonic arrays required a sig- Laikov, 1999]. The current implementation pro- nificant rewrite, so an initial prototype was writ- vides a function that returns Lebedev grids of ten in MATLAB, with plans to recode in C++. points and corresponding weights for as many as Because the bulk of the computation is matrix 5810 directions. Our current experiments have multiplication, which is handled by highly opti- used a grid with 2702 points, which corresponds mized code in MATLAB, it turned out that the roughly to 3◦. The toolkit also has functions pro- execution speed was almost as fast as the orig- viding 2-D and 3-D grids that are sampled in uni- inal C++ version, so we abandoned plans for form azimuth and elevation increments, which are the rewrite. To make the code widely usable, useful for visualization of the results. it was kept compatible with GNU Octave. The key change is that GNU Octave does not support 3.3 Optimization Behavior nested functions, so a number of variables need to As part of the optimization setup, the user sup- be declared global to make them accessible to the plies a set of stopping criteria. This can be spec- objective function. ified as a threshold on the absolute and relative

83 changes in the parameters and/or the objective the upper and lower loudspeakers at 30◦ with function, as well as a maximum running time and respect to horizontal. ± maximum number of iterations. The default val- 7 4.1 CCRMA Listening Room ues in the current implementation are 1 10− for the parameters and objective function. × The described software was applied to deriving de- For small 2-D arrays (say, 12 to 24 parameters), coders for the Listening Room at CCRMA (Center the optimizer typically converges in less than 1 for Computer Research in Music and Acoustics) minute, examining 40,000 to 1,500,000 configura- at Stanford. This facility consists of 22 identical tions. For large high-order arrays (say, 200 to loudspeakers arranged in five rings. There is a 400+ parameters), it typically converges in less horizontal ring of eight loudspeakers, two rings of than an hour. These timings were done with Oc- six loudspeakers, one 50◦ below and one 40◦ above tave version 3.2.4-atlas on a 2.66 GHz Intel Core i7 horizontal, and one loudspeaker directly above with 8 GB of memory. The bulk of the computa- and one directly below the listening position. The tion comprises matrix multiplications and is there- two hexagonal rings are thus not exactly horizon- fore suitable for parallel implementations. The tally opposed. A schematic of the array is shown timings in MATLAB were approximately 2x faster in Figure 4a. than Octave since it can make use of the multiple An initial solution was derived by calculating cores in the i7 processor. the pseudoinverse of the loudspeaker projection With large optimization problems, using a lo- matrix as described above. The decoder was mod- cal optimization algorithm and providing an ini- ified to optimize the magnitude of rE at high fre- tial solution that is near to the optimum is im- quencies by applying the weighting factors given portant for reliable convergence. The toolkit cur- in Table 2 to the gains of the signal components of rently supports three strategies: each order. Given that the theoretical maximum average value for rE can be no greater than 0.866 Using the low-frequency solution modified at third order, the average value of 0.850 for the • with the per-order gains that would provide third order decoder given here does not leave a the max-rE solution for a uniform array. great deal of margin for improvement. Nonethe- less, the optimization software was applied to the “Musil Design” where additional “virtual” • problem. Figure 5 shows the performance of the loudspeakers are inserted into the array to initial solution and the optimized result. Aver- make the spacing more uniform, and hence age rE was increased slightly, and maximum di- more suited to a pseudo-inverse solution. Af- rectional error reduced by a factor of 5. ter the optimization is complete, the signals An informal listening test comparing this de- for the virtual speakers are either ignored or coder to the existing one was conducted using distributed to the adjacent speakers [Zotter third-order test signals and studio recordings, as et al., 2010]. well as first-order acoustic recordings. The general A hierarchical approach, decomposing the op- impression was that the new decoder did a better • timization by establishing a solution for each job of keeping horizontal sources in the horizontal order consecutively, freezing the individual plane, whereas the existing decoder rendered such coefficients for orders below the current one, sources above the horizontal plane. but allowing an overall adjustment on the 4.2 The 30 Tri-Rectangle gain of the lower orders. ◦ As discussed elsewhere, a dodecahedron or other 4 Examples large regular array is difficult to fit into nor- mally dimensioned spaces. One large array that The software tools described above were applied does fit into normal spaces is the so-called tri- to the derivation of decoders for several real- rectangle, patterned after a suggestion by Gerzon. world systems. The examples given here are the 6 A schematic is shown in Figure 4b. It consists of CCRMA listening room and a tri-rectangle with three interlocking rectangles of loudspeakers, one 6See https://ccrma.stanford.edu/room-guides/ in the horizontal plane, one in the XZ plane, and listening-room/ one in the YZ plane. The projection of the loud-

84 (a) The 22-loudspeaker array at CCRMA (b) The 12-loudspeaker tri-rectangle.

Figure 4: Schematics of the loudspeaker arrays used in the examples.

(a) The initial solution calculated by pseudoinverse and (b) The optimized solution. max-rE gains.

Figure 5: rE in the vertical plane for the CCRMA array before and after optimization. The arrows show the directional error between the low- and high-frequency matrices. In this case, average rE was increased slightly, from 0.85 to 0.86 and the maximum directional error reduced by a factor of 5. speakers into any plane is an octagon, which hints Performance of the initial solution by inversion at its utility for reproducing second-order program is shown in Figure 6. The magnitude of rE is in material. However, to enable it to fit into typical red, with both the horizontal and vertical (in the spaces the vertical rectangles must be squashed to XZ plane) shown. The horizontal shape is essen- an approximate 30 vertical angle. This gives a tially circular, with perfect direction (not shown in ± ◦ solid angle of 120◦ above and below the listening the figure), but the magnitude of rE decreases dra- position with no loudspeakers. Naturally, this has matically for sources above or below about 30◦ a profound effect on the localization for sources of elevation. Furthermore, there is an increasing± above and below the listening position. error in the direction of rE indicating that high-

85 Figure 6: rE in the horizontal and vertical planes for the initial second-order decoder. Figure 7: 3-D plot of rE from an unconstrained optimization of the second-order decoder for the 30 tri-rectangle. frequency sounds will be perceived as coming from ◦ near the poles. The extreme errors in the direction of rE are compounded by the low values, making rately. This resulted in a slightly lower rE, but localization in the up and down directions vague significantly reduced angular error in the vertical in any case. plane. Figure 8b shows the optimized solution. The large angle subtended by the loudspeaker placement with respect to the vertical axis makes 5 Conclusions it impossible to get precise localization for sources An open source package for the design of am- directly above or below the listening position. It bisonic decoders has been presented. The soft- may, however, be possible to improve the localiza- ware allows the derivation of decoders for arbi- tion for sources near horizontal by correcting the trary loudspeaker arrays, 2-D or 3-D. The soft- direction of rE. ware operates under Octave or MATLAB, with Running the optimizer with this configuration the nonlinear optimization performed by the open as an initial solution resulted in a highly distorted source package NLOPT. Auditory localization at solution where the sounds are drawn strongly to middle and high frequencies is a nonlinear func- the loudspeakers. A 3-D plot of rE for this so- tion of the loudspeaker signals, which necessitates lution is shown in Figure 7. As can be seen, the the finding of solutions that work well for those performance is very non-uniform (a sphere would frequencies via an optimization process. be ideal) and the maximum angular error is over Two example systems were solved. The first 30◦. was a third-order decoder for the 22-loudspeaker Next, a Musil design was attempted. Virtual CCRMA listening room. That system is nearly loudspeakers were inserted into the array directly regular, and it was found that a solution obtained above and below the center. This was optimized by inversion of the loudspeaker matrix, with per- and then the signals for the virtual loudspeakers order gains, was nearly as good as one obtained by reassigned to the nearest real loudspeakers. This the nonlinear optimization process. Nonetheless, resulted in improved rE in the horizontal plane, the magnitude of rE was improved and the angle as well as elevations as high as 30 ; however, error was reduced. ± ◦ it suffers from directional errors as large as 31◦. The second system was a 12-loudspeaker tri- Figure 8a shows the optimized solution. rectangle, with the upper and lower loudspeakers Finally, a hierarchical design was attempted, at 30◦ above and below the horizontal plane. A where each subsequent order is optimized sepa- decoder derived for that system via the technique

86 (a) Musil design. (b) Hierarchical design.

Figure 8: rE in the vertical plane for second-order decoders for the 30◦ tri-rectangle. of inversion followed by per-order gains shows high Eric Benjamin, Richard Lee, and Aaron Heller. magnitudes of rE in the horizontal plane but low 2010. Why Ambisonics Does Work. In AES magnitudes in the polar regions and large errors 129th Convention, San Francisco, November. 3 in the direction of . rE Jerome Daniel. 2001. Représentation de champs Two additional methods were tried in a search acoustiques, application à la transmission et à la for a superior solution. The first was the Musil de- reproduction de scènes sonores complexes dans coder in which the array was filled out with virtual un contexte multimédia. Ph.D. thesis, Univer- loudspeakers at the poles and the signals for those sity of Paris. 3, 4 speakers are routed to the nearest real speakers. The second method was a hierarchical one in Michael A. Gerzon. 1980. Practical Periphony: which a solution for each order was established The Reproduction of Full-Sphere Sound. In consecutively, such that a higher-order decoder 65th Audio Engineering Society Convention is also optimum for lower-order program sources. Preprints, number 1571, London, February. This results in a very well behaved decoder, but AES E-lib http://www.aes.org/e-lib/browse. cfm?elib=3794. 4 with slightly lower values of rE. Michael A. Gerzon. 1992. General Metatheory 6 Acknowledgements of Auditory Localisation. In 92nd Audio Engi- The authors thank Fernando Lopez-Lezcano for neering Society Convention Preprints, number encouraging us to write this paper and for pos- 3306, Vienna, March. AES E-lib http://www. ing the problem of deriving a third-order decoder aes.org/e-lib/browse.cfm?elib=6827. 2 for the CCRMA listening room; Andrew Kim- 2011. Octave. http://www.octave.org/ (ac- pel and Elan Rosenman for stimulating discus- cessed July 1, 2011). 1 sion and access to their second-order periphonic and fifth-order horizontal loudspeaker arrays; and Aaron J. Heller, Richard Lee, and Eric M. Ben- Isaac Heller for the idea of using the exponential jamin. 2008. Is My Decoder Ambisonic? In function to implement soft thresholds. AES 125th Convention, number 7553. 1, 3 Aaron Heller, Eric Benjamin, and Richard Lee. References 2010. Design of Ambisonic Decoders for Irregu- Fons Adriaensen, 2011. AmbDec User Manual, lar Arrays of Loudspeakers by Non-Linear Opti- 0.4.3 edition. http://kokkinizita.linuxaudio. mization. In AES 129th Convention, San Fran- org/ambisonics/. 1 cisco. 1, 7

87 Steven G. Johnson. 2011. The nlopt A Formulas for maximum average rE nonlinear- optimization package. http://ab- and per-order gains initio.mit.edu/nlopt. 1, 7 A.1 Horizontal Arrays V.I. Lebedev and D.N. Laikov. 1999. A Quadra- For regular horizontal arrays (Table 1), the maxi- ture Formula for the Sphere of the 131st Alge- mum value of rE and the gains for each Ambisonic braic Order of Accuracy. Doklady Mathematics, order, M, are given by 59(3):477–481. 7 rE = largest root of TM+1(x) (6) David G Malham. 2003. Space in music - mu- (7) sic in space: Higher order ambisonic systems. gm = Tm(rE), m = 0 ...M Master’s thesis, University of York. 3 th where Tm is the m Chebyshev polynomial of the MATLAB. 2011. version 7.13.0 (R2011b). The first kind. In Mathematica, this can be written 7 MathWorks Inc., Natick, Massachusetts. 1 as David Moore and Jonathan Wakefield. 2008. Table[ChebyshevT[Range[0, M], The Design of Ambisonic Decoders for the ITU x /. FindRoot[ChebyshevT[M+1,x], {x,1}]], {M, 1, 5}] 5.1 Layout with Even Performance Character- istics. In 124th Audio Engineering Society Con- A.2 Periphonic Arrays vention Preprints, number 7473, Amsterdam, For regular periphonic arrays (Table 2), the maxi- May. 7 mum value of rE and the gains for each Ambisonic Andrew Wabnitz, Nicolas Epain, A van Schaik, order, M, are given by and C Jin. 2011. Time Domain Reconstruction r = largest root of P (x) (8) of Spatial Sound Fields using Compressed Sens- E M+1 ing. In Acoustics, Speech and Signal Process- gm = Pm(rE), m = 0 ...M (9) ing (ICASSP), 2011 IEEE International Con- th where Pm is the m Legendre polynomial. In ference on, pages 465–468. IEEE. 3 Mathematica, this can be written as Eric W. Weisstein. 2008. Moore-Penrose Table[LegendreP[Range[0, M], Matrix Inverse. From MathWorld–A Wolfram x /. FindRoot[LegendreP[M+1,x], {x,1}]], Web Resource. http://mathworld.wolfram. {M, 1, 5}] com/Moore-PenroseMatrixInverse.html. 3 B LF/HF Matching Bruce Wiggins, Iain Paterson-Stephens, Val As mentioned in Section 2.4, there are three ap- Lowndes, and Stuart Berry. 2003. The Design proaches to adjusting the to match LF/HF and Optimisation of Surround Sound Decoders gm loudness, . For approach 1, . Using Heuristic Methods. In Proceedings of gm′ = g0′ gm g0′ = 1 For approaches 2 and 3, is calculated as UKSim 2003, Conference of the UK Simulation g0′ Society, pages 106–114. UK Simulation Soci- M ety. Availible from http://sparg.derby.ac.uk/ 2 E gm = Cm gm (10) { } SPARG//SPARG_UKSIM_Paper.pdf m=0 (accessed May 15, 2006). 7 ∑ g0′ = N/E gm (11) Franz Zotter, H. Pomberger, and M. Nois- { } ternig. 2010. Ambisonic Decoding with and √ th where Cm is the number of signals in the m order without Mode-Matching: A Case Study Using 2 component. In 3-D, Cm = m + 1; in 2-D, C1 = 1 the Hemisphere. 2nd Ambisonics Symposium, and Cm>1 = 2. For approach 2, N is the total Paris. 8 number of components: in 3-D, (M + 1)2; in 2- Franz Zotter, H. Pomberger, and M. Noisternig. D, 2M + 1. For approach 3, N is the number of 2012. Energy-Preserving Ambisonic Decoding. loudspeakers in the array. Acta Acustica united with Acustica, 98(1):37– 7For those without access to Mathematica, these can 47, January. 3 also be computed interactively using the Wolfram|Alpha online service, http://alpha.wolfram.com.

88 JunctionBox for Android: An Interaction Toolkit for Android-based Mobile Devices Lawrence FYFE Adam TINDALE Sheelagh CARPENDALE InnoVis Group Alberta College of InnoVis Group University of Calgary Art + Design University of Calgary 2500 University Drive NW 1407 14 Avenue NW 2500 University Drive NW Calgary, AB T2N 1N4 Calgary, AB T2N 4R3 Calgary, AB T2N 1N4 Canada Canada Canada [email protected] [email protected] [email protected]

Abstract computers, can be made easier by a toolkit that JunctionBox is an interaction toolkit specifically de- puts commonly used or repetitively coded features signed for building multi-touch sound control inter- into a single place for reuse. faces. The toolkit allows developers to build inter- JunctionBox is an interaction toolkit for build- faces for Android mobile devices, including phones and ing sound control interfaces that addresses this tablets. Those devices can then be used to remotely control any sound engine via OSC messaging. While scenario. The toolkit is written in Java and is de- the toolkit makes many aspects of interface develop- signed as a library for the Processing development ment easy, the toolkit is designed to offer considerable environment [Reas and Fry, 2006]. JunctionBox power to developers looking to build novel interfaces. bridges the interaction gap between the visual de- sign possibilities offered by Processing and the Keywords sound control possibilities afforded by the Open Android, OSC, mobile, interaction, toolkit. Sound Control (OSC) protocol [Wright, 2005]. Figure 1 shows how JunctionBox relates these two 1 Introduction functions. It does this by handling touch inter- The Android operating system [Google, 2012a], actions and by making it easy for developers to developed by Google Inc., runs on a wide variety map those interactions to OSC messages (via the of mobile devices, including phones and tablets. JavaOSC library [Ramakrishnan, 2003]). Junc- With the easy availability of these devices, it be- tionBox has mapping features that are described comes desirable to develop sound and music ap- in detail in an earlier paper by the current authors plications for them. ([Fyfe et al., 2011]). One application scenario involves using inter- faces to control sound in which the interface is one computer and the sound generator is another. The advantage of this scenario is that the quality of the audio on mobile devices will likely not be as good as with combinations of audio interface and speakers in which both are designed to provide the highest quality audio experience. Another related advantage is the possibility of multi-channel sce- narios that go beyond the two channels available on portable devices. For example, an Android device would handle input by a performer, with the option of project- ing the interface, while another computer handles all sound processing. This is not a hypothetical scenario but is the current musical practice of the Figure 1: How JunctionBox serves as a hub for first author. With this kind of scenario, certain both sound and visuals. aspects of development, like messaging between

89 2 Background Other toolkits exist that have features similar to that of JunctionBox. The MT4J toolkit [Laufs et al., 2010] has many features for multi-touch in- teraction building and is available for use with Android. However, MT4J, lacks the OSC map- ping capabilities that make JunctionBox a toolkit for building sound and music control applications. TouchOSC [hexler.net, 2012] offers functionality Figure 2: JunctionBox/Android internals with that is similar to what is offered by JunctionBox classes shown as boxes with solid lines. in which widgets can be mapped to custom OSC messages. Widgets include faders, push buttons Differences in using the Dispatcher in code have and rotary controls and others that are are skeuo- been minimized with both versions using the same morphs of controls for physical hardware devices code for construction. A new Dispatcher will like mixers. take four arguments: the width and height of the Instead of simply offering widgets, JunctionBox screen/touch interface and a socket destination is a full-fledged development toolkit that is meant for OSC messages. The following code constructs to be used by developers who are not necessar- a new Dispatcher object: ily interested in skeuomorphs. The JunctionBox Dispatcher dispatcher = new Dispatcher( approach to interaction design can be summed screenWidth, up as BYOW, for build your own widgets. The screenHeight, "192.168.1.1", focus of JunctionBox is to provide functionality 7000); like OSC message mapping while offering the full range of visual design options available via Pro- All code written for the standard version of cessing. This approach to toolkit design is meant JunctionBox will work with Android with the to encourage greater creativity in the design of addition of a single method made available in both interactions and interfaces. the Android Dispatcher class. The additional method, handleMotionEvent(), gets touch infor- 3 JunctionBox for Android mation from the MotionEvent class. The Dis- The original version of the JunctionBox toolkit patcher’s handleMotionEvent() method is utilized was not developed for Android. However, since it by creating a dispatchTouchEvent() method in was written in Java, porting to Android did not the code for the Processing sketch: require a complete rewrite of the existing code. boolean dispatchTouchEvent(MotionEvent ev) { The main difference between the standard Java dispatcher.handleMotionEvent(ev); return true; version and the Android version of JunctionBox } is that TUIO [Kaltenbrunner et al., 2005] is not needed on the Android version since Android de- While only a single additional method is re- vices have a separate system for handling touch quired for the Android version of JunctionBox, in- interactions. Android-based systems get touch in- ternally an important difference in touch location formation via handling the data contained in the handling is that TUIO touch tracking data is nor- MotionEvent class [Google, 2012b]. malized relative to the size of the interface while Both versions of JunctionBox contain a Dis- for Android devices, touch position is absolute. patcher class that is responsible for handling This difference is made transparent to the devel- touch interactions that ultimately get mapped to oper and the same code will work in both systems OSC messages for controlling audio. Figure 2 with only one minor change, the addition of the shows the relationship of the MotionEvent class new handleMotionEvent() method. Making dif- to JunctionBox classes including the Dispatcher. ferences like this transparent is part of the Junc- The mapping process is exactly the same for both tionBox approach of making development easier the standard version of JunctionBox and the An- without taking away the ability to build highly droid version. customized interfaces.

90 4 Interfaces The first author of this paper wrote the code for JunctionBox not only to provide the community with a toolkit for creating networked instruments but for his own performance practice. That prac- tice currently consists of the building of touch in- terfaces for Android devices that run the interface to handle touch input. That interface then com- municates with another computer running audio via OSC. The audio computer used to develop the following interfaces runs Ubuntu Linux, JACK and Pd. Figure 3: The Orrerator interface with the first 4.1 Orrerator and third planets active and playing. The Orrerator interface has widget-like controls but with a much more stylized appearance. It is designed both as a general instrument that could white and half black. The black tiles feature re- be used for a variety of pieces but also specifically versed numbers 1-5. Each of the tiles can be for the composition by the first author entitled moved anywhere in the interface with a single Sol Aur. touch. When the tiles are moved into the tar- The metaphor for the Orrerator is the Orrery get in the center of the interface and the touch or a model of the solar system, as shown in Figure is released, a sound file corresponding to that tile 3. The sound engine controlled by the Orrerator number is played. Reversed tiles play the same is written in Pd and features four FM oscillators. sound reversed. Figure 4 shows the tiles and the The interface has four planet buttons that light up target. Sound files play until either they have when toggled by simply touching the particular reached the end of the file or if a touch is ap- planet. Each planet button turns a single FM plied while playing. This allows the sounds to be oscillator on or off. paused and moved across the target. In addition to on or off controls, each planet button can be rotated around its orbit by touch- ing the orbit area and rotating it around the cen- ter. Orbit areas look increasingly lighter towards the inner planet. This change of rotation will de- tune that particular FM oscillator from its base frequency by a factor of between one and two with the starting vertical position being one and a full rotation back to that same position being two. Orbital rotations are limited to 360 degrees from the starting position. On the left and right edges of the interface are two sliders: the left slider changes the index of modulation and the modulation frequency and Figure 4: The Distance 2 interface with tile 2 in the right slider changes the gain of all four os- the center set to play at normal rate. cillators. Tiles placed in the very center of the target are 4.2 Distance2 played back at normal playback rate. Tiles in the The Distance2 interface is designed specifically next circle out play at half that rate. The next for a composition entitled Distance 2 (Toshi circle plays at one-third rate and the outermost Ichiyanagi) by the first author. It features ten circle plays at one-quarter rate. If a tile is paused tiles numbered 1-5 with half of the tiles being by touch and moved from one target circle to an-

91 other, the sound will pause and resume at the new References playback rate. Lawrence Fyfe, Adam Tindale, and Sheelagh Tiles placed in the circle but on the left of the Carpendale. 2011. Junctionbox: A toolkit for vertical dividing line play in the left channel while creating multi-touch sound control interfaces. tiles on the right side play in the right channel. In Proceedings of the Conference on New Inter- Tiles placed in the center (normal rate) play in faces for Musical Expression, pages 276–279. the center of the stereo field. Google. 2012a. Android. http://developer. The sound engine for this piece is written in Pd android.com/index.html. with a custom file playback control mechanism de- signed by the first author. The 5 sound files are all Google. 2012b. Motionevent. http: various recorded clips from an interview with ex- //developer.android.com/reference/ perimental composer Toshi Ichiyanagi. Distance android/view/MotionEvent.html. 2, the piece, is based on a piece by Ichiyanagi hexler.net. 2012. Touchosc. http://hexler. called Distance [Ichiyanagi, 1961] in which the net/software/touchosc. performer must be at least three meters away from his or her instrument while playing. In a Toshi Ichiyanagi. 1961. Distance. time of computer music in which wireless net- Martin Kaltenbrunner, Till Bovermann, Ross working makes this kind of setup trivial, the no- Bencina, and Enrico Costanza. 2005. Tuio - a tion of distance can be explored in other ways. In protocol for table-top tangible user interfaces. the context of the piece, distance from the center In Proceedings of the 6th International Work- of the target represents the recognizbility of the shop on Gesture in Human-Computer Interac- recorded clips with the clips getting further away tion and Simulation. from the original as the tiles get further away from the center. Uwe Laufs, Christopher Ruff, and Jan Zibuschka. 2010. Mt4j - a cross-platform multi- 5 Summary touch development. In Proceedings of the 2nd ACM SIGCHI symposium on Engineering in- Bringing the JunctionBox toolkit to the Android teractive computing systems, EICS ’10, New operating system allows developers to build sound York, NY, USA. ACM. and music control interfaces on an increasing ar- ray of touch devices. With a focus on inter- Chandrasekhar Ramakrishnan. 2003. Javaosc. face coding rather than providing pre-built wid- http://www.illposed.com/software/ gets, JunctionBox provides developers with an op- javaoscdoc/. portunity to create new touch-based interactions Casey Reas and Ben Fry. 2006. Processing: pro- with highly customized visuals. Two example in- gramming for the media arts. AI & Society, terfaces, the Orrerator and Distance2, show some 20(4):526–538. of the creative interfaces that can be built using the toolkit. Matthew Wright. 2005. Open sound control: an enabling technology for musical networking. 6 Acknowledgements Organised Sound, 10(3):193–200. We would like to thank the Alberta College of Art + Design, the Canada Council for the Arts, the Natural Science and Engineering Research Coun- cil of Canada, SMART Technologies, the Cana- dian Foundation for Innovation, and the Alberta Association of Colleges and Technical Institutes for research support. We would also like to thank the members of the Interactions Lab at the Uni- versity of Calgary for feedback and support during the development of this project.

92 An Introduction to the Synth-A-Modeler Compiler: Modular and Open-Source Sound Synthesis using Physical Models

Edgar BERDAHL Julius O. SMITH III Audio Communication Group, TU Berlin CCRMA, Stanford University Einsteinufer 17c Stanford, CA 10587 Berlin 94305 Germany USA [email protected] [email protected]

Abstract nection of passive mechanical, acoustical, and/or The tool is not a synthesizer—it is a Synth-A-Modeler! electrical elements, rather than programming any This paper introduces the Synth-A-Modeler compiler, equations. This makes it much easier for artists which enables artists to synthesize binary DSP mod- to employ physical modeling to make new sounds ules according to mechanical analog model specifica- without needing to understand all of the details tions. This open-source tool promotes modular design under the hood. and ease of use. By leveraging the Faust DSP pro- However, most of these tools have not been im- gramming environment, an output Pd, Max/MSP, Su- mediately available to artists, partly due to the perCollider, VST, LADSPA, or other external module cost of purchasing the tools. For example, the is created, allowing the artist to hear the sound of the physical model in real time using an audio host appli- Modalys modal synthesis environment requires cation. To show how the compiler works, the example both the purchase of Max/MSP and the IRCAM model “touch a resonator” is presented. Forum Recherche [Ellis et al., 2005]. GENESIS is a complete modeling package and can be pur- Keywords chased from the Association pour la Cr´eationet la physical modeling, virtual acoustics, Faust DSP, hap- Recherche sur les Outils d’Expression [Castagne tic force-feedback, open source and Cadoz, 2002]. GENESIS includes a graphi- cal user interface, which makes it easier for artists 1 Introduction to specify the interconnection of the mechanical elements. Simulating the equations of motion of acoustic In contrast, the BlockCompiler by Matti Kar- musical instruments, also known as physical mod- jalainen is open-source; however, it is a complex eling, has been employed for decades to synthe- package that has been rewritten at least three size sound digitally [Smith, 2010; Smith III, 1982; times and requires the artist to write Lisp code Cadoz et al., 1981]. It is an intriguing paradigm to create models [Karjalainen, 2003]. Stefan Bil- for synthesizing sound in new media applications bao has written a modular environment for syn- because it enables sound synthesis for fictional thesizing percussion sounds, but it runs in MAT- acoustic-like instruments that would be imprac- LAB and so is not immediately applicable to real- tical or very labor-intensive to construct in real time synthesis [Bilbao, 2009]. Finally, the Syn- life. thesis ToolKit (STK) has become popular due Physical modeling DSP algorithms incorporate to its MIDI capability and ability to run within internal feedback loops, which means that an er- Max/MSP, Pd, and other sound synthesis envi- ror or inaccuracy in the modeling algorithm can ronments [Cook and Scavone, 1999].1 However, cause the algorithm to become unstable. This re- new models can only be created in the STK by sults in a loud sound, which can be very unpleas- artists who can manually write stable difference ant for an artist working with a physical model. equations in C++ for physical modeling. Several tools have been packaged for imple- For the above reasons, we decided to create a menting modular physical modeling with algo- new, free and open-source tool for modular sound rithms that are designed to be stable. The basic concept is that the artist specifies an intercon- 1See http://ccrma.stanford.edu/software/stk

93 synthesis using physical models. Pd external

Model Max/MSP external Faust 2 Synth-A-Modeler specification Synth−A−Model .DSP SuperCollider external file compiler .MDL file 2.1 Requirements VST plug−in Faust The following requirements of the Synth-A- compiler Custom host Modeler project have guided the design. The Synth-A-Modeler compiler should Figure 1: Dataflow for synthesizing a model with enable efficient real-time physical modeling Synth-A-Modeler • sound synthesis for new media applications, target binary format as suitable for Pd, Super- be free and open-source, • Collider, Max/MSP, CSound, LADSPA plug-in, be as modular as possible, VST plug-in, a generic audio application, generic • C++ code, or other host target as desired by the be easy to extend and modify, • artist. serve as a platform for pedagogical explo- • ration of the physics of mechanically vibrat- 2.4 Specifying MDL Files ing systems, For synthesizing sound with traditional signal flow-based approaches such as Pure Data (pd), be accessible to artists who may have little or Max/MSP, Simulink, or others, a user specifies • no experience in programming, digital signal a directed graph of sound sources, processing el- processing (DSP), or physics, ements, and sound outputs. The signal flow is be accessible from as many sound synthesis considered to be unidirectional: from the sources • host environments as possible, to the sinks. However, in the case of physical modeling, enable the development of MIDI-based syn- the signal flow is bidirectional among the ele- • thesizers, and ments. One reason for this is Newton’s third be compatible with programming haptic law: “For every action, there is an equal and • force-feedback systems. opposite reaction.” Hence, in physical model- ing, a graph with bidirectional edges describes 2.2 Faust the signal flow. An engineer must be keenly To enable efficient real-time synthesis while tar- aware of the bidirectional signal flow between each geting as many host environments as possible, pair of elements; however, an artist designing a we decided to use the Functional AUdio STream- physical model using a modular approach needs ing (Faust) programming language [Orlarey et al., only to understand which elements are connected 2009; Barkati et al., 2011; Orlarey et al., 2002]. In to which. For this reason, the artist can sim- fact, some physical models had already been writ- ply specify an undirected graph of virtual phys- ten directly in Faust code [Michon and Smith III, ical elements [Castagne and Cadoz, 2002]. For 2011]; however, most artists would not be ready the Synth-A-Modeler compiler, the artist speci- to put in the detailed effort required to program fies the model in an MDL file by specifying con- physical models in Faust. Hence, we planned to nections between elements, such as digital wave- extend Faust with the development of Synth-A- guides, masses, springs, dampers, ports, etc., us- Modeler. ing a netlist-like format with some extensions. Of 2.3 Dataflow course the artist may also specify the physical pa- rameters for each element. Consequently, we adopted the dataflow shown in Figure 1. The Synth-A-Modeler compiler 3 Modeling Paradigm receives a netlist-like model specification in an MDL file and compiles it into Faust DSP code 3.1 Example Model [Vladimirescu, 1994]. Then the Faust compiler Consider the model shown in Figure 2, which im- together with g++ can transform the code into a plements a very simple synthesizer with only a

94 single resonance frequency. The model describes m1 a series of mechanical elements that connect to a tt user’s finger, allowing the user to “touch” a vir- tual mechanical resonator. This model features dev1 audioout objects as in GENESIS and CORDIS-ANIMA, ll except that the units are SI units and En- glish names are employed [Castagne and Cadoz, 2002][Cadoz et al., 1993][Kontogeorgakopoulos g and Cadoz, 2007]. The following text specifies the same model using an MDL model specifica- tion file: Figure 2: Model for “touch a resonator” link(4200.0,0.001),ll,m1,g,(); the content within the dash-dotted box in Fig- touch(1000.0,0.03,0.0),tt,m1,dev1,(); ure 2. Our resonator implementation in Synth- A-Modeler allows for the frequency and damping mass(0.001),m1,(); time of the resonator to be adjusted in real-time ground(0.0),g,(); while minimizing transients. Modalys might op- port( ),dev1,(); erate in a similar fashion as it also allows inter- polation of resonance frequencies in real time. In audioout,a1,m1,1000.0; Synth-A-Modeler, we use a state-space implemen- The external port named dev1 is connected by tation employing a well-conditioned rotation ma- a touch link tt to a mass m1 of 0.001 kg. The trix [Mathews and Smith III, 2003]. Max Math- mass resonates because the linear link ll con- ews employed banks of this style of resonator in nects it to mechanical ground g, which always his piece Angel Phasered, which was performed remains at the position 0 m. The linear link at the CCRMA Transitions Concert on Sept. 16, ll consists of the parallel combination of a spring 2010. with stiffness 4200 N/m and damper with param- eter 0.001 N/(m/s). The touch link tt (analogous 4 The Synth-A-Modeler Compiler to the BUT element in GENESIS and CORDIS- 4.1 Strategy ANIMA) is similar to a linear link, except it only Although it is possible to represent any explic- results in a force when one of the objects is push- itly computable linear block diagram in Faust [Or- ing “inside” the other one [Kontogeorgakopoulos larey et al., 2002], Faust’s block diagram algebra and Cadoz, 2007]. is oriented toward signals that flow from the left to 3.2 Inputs And Outputs the right. To obtain a signal flowing back from the A port models a mechanical connection from right to the left, it is necessary to employ Faust’s within a model to the outside. By default each recursive composition operator, which automat- port brings an external position signal into the ically incorporates a single sample of delay. In model and sends a collocated force signal outside physical modeling, signals are bidirectional, which the model. This is why a port can easily be used means that roughly half of the signals must flow to control a haptic force-feedback device [Berdahl, from the right to the left. However, it is chal- 2009]. lenging to organize Faust networks in this fashion In addition, the audioout object in Synth-A- without inserting a plethora of single-sample de- Modeler provides a simple audio output. When lays, which can detract from the stability of the applied to a mass-like object, it outputs position models. (see Figure 2, right), and when applied to a link- For example, consider the Faust code for a mass like object, it outputs force. of m [kg] with zero initial conditions and a linear link but with stiffness k [N/m], damping factor 3.3 Resonator Abstraction R [N/(m/s)], and spring centering position offset The generic resonator object (related to CEL in o [m], as given in Figure 3. These equations are GENESIS) could have been employed instead of similar to those in CORDIS-ANIMA and GENE-

95 mass(m) = (/(m*fs*fs) : ((_,_ : +) ~ _) : ((_,_ : +) ~ _)); link(k,R,o) = _ : (_-o) <: (_,_) : (*(k), (_<:(_,_):(_,mem):-:*(R*fs))) : (_,_):+:_; Figure 3: Faust code for a mass and a linear link from physicalmodeling.lib

link−like objects mass−like objects −1 mass(0.001) link(4200,0.001,0)

−1 ! −1 ground(0) touch(500,0.003,0) dev1 force dev1 position −1 bigBlock audioout a1 Figure 4: Representation of Faust DSP code for “touch a resonator”

SIS, except that the units are SI units. put (see Figure 4, bottom left dash-dot-dotted in Neither the mass nor the link incorporates a magenta) and the force signal output (see Fig- single sample of delay; however, in order to wrap ure 4, bottom dashed in cyan) could be connected these specific elements in feedback about one an- to an impedance controlled haptic force-feedback other, it is necessary to insert samples of delay to device in order to control the sound synthesis. make the network explicitly computable. In the Alternatively, using a more conventional kind of CORDIS-ANIMA equations, the delay is included new media controller, the position input could be in the masses [Kontogeorgakopoulos and Cadoz, used independently of any force output to control 2007], so we follow suit by having Faust’s right to the sound synthesis. Finally, the additional au- left signals emanate from the masses. dio output a1, which corresponds to the position For example, Figure 4 shows the representation of the virtual mass, is also provided (see Figure of the Faust DSP code for the model depicted in 4, right dash-dot-dotted in magenta) as a more Figure 2. The mass-like objects are placed on sensible audio output. the right, and the link-like objects are placed fur- ther to the left. The signals representing displace- 4.2 Compiled Code ments are shown in magenta, and the signals rep- Applying the Synth-A-Modeler compiler to the resenting net forces are shown dashed in cyan (see model specification MDL file given in Section 3.1 Figure 4—for color see the online version of this results in the Faust code presented in Figure 5. paper). Linear combinations are formed in or- The code begins by first importing the physical der to properly interconnect the mass-like objects modeling library, which contains Faust code de- and link-like objects. Finally, the single-sample of scribing how each of the elements works. Then delay per feedback loop is represented on the far bigBlock is defined about which the feedback right using the small red boxes as in the Faust - paths will be wrapped. In this case, m1 is fed gram notation. According to this interconnection back, which represents the position of mass m1, strategy, there is no sample of delay associated and the ground position g is also fed back. The with the link-like objects. letter “p” is appended to each variable fed back, The single port in the model (ref. “dev1” in denoting previous, since the variable is delayed by Figure 2) is responsible for the position signal in- a single sample (see the small red squares on the

96 import("physicalmodeling.lib"); bigBlock(m1p,gp,dev1p) = (m1,g,dev1,a1) with { // Link-like objects: ll = (m1p - gp) : link(4200.0,0.001,0.0); tt = (m1p - dev1p) : touch(1000.0,0.03,0.0);

// Mass-like objects: m1 = (0.0-ll-tt) : mass(0.001); g = (0.0+ll) : ground(0.0); dev1 = (0.0+tt);

// Additional audio output a1 = 0.0+m1*(1000.0); }; process = (bigBlock)~(_,_):(!,!,_,_); Figure 5: Compiled Faust DSP code for implementing “touch a resonator” right-hand side of Figure 4). smoothed, and fed into the dev1 position input. The identifier for each element (e.g. ll, tt, m1, The force outlet from the dev1 output (see Fig- g, dev1, and a1) is then employed as an output ure 6) is not needed in this case since there is no variable from the element, which can only be ac- mechanism to provide force feedback. The au- cessed from within bigBlock. The inputs to the dio output comes from the additional a1 audio elements are formed as linear combinations of the output that was specified in the model. Despite other variables (see Figure 5). having only a single resonance frequency, we find that the quality of the user interaction with the 4.3 Example Pure Data Patch model is intriguing, due to our process of phys- Next the Faust compiler together with g++ can ically modeling as many elements as possible in compile the Faust code into a target format as the interaction loop being simulated. suitable for an audio host application. We briefly present an example of compiling our example 5 Conclusion into a Pure Data (pd) external object called Due to its modular character, the Synth-A- touch a resonator~. The pd patch in Figure 6 Modeler compiler enables the creation of com- shows how the external object can be employed models for artistic applications using simple for real-time sound synthesis without requiring a building blocks. In some sense, it is an extension haptic force-feedback user input device. The left- most inlet and outlet are automatically generated by the faust2pd script,2 and we do not use them in this example. The remaining audio inlets and outlets (see Figure 6) are ordered in correspon- dence with the input and outputs in Figure 4 and in the third line of Figure 5. The rightmost inlet corresponds to the input position dev1. A horizontal slider GUI object from pd is employed as a user interface. The slider’s output is converted into an “audio” signal,

2 More information is available on this inlet and outlet Figure 6: Example pd patch for “touch a res- [Graef, 2007]. onator”

97 of our own prior work, rewritten to employ Faust Claude Cadoz, Annie Luciani, and Jean-Loup for efficient DSP and to target multiple host ap- Florens. 1981. Synth`ese musicale par simu- plications [Berdahl et al., 2010]. lation des m´ecanismesinstrumentaux. Revue We are currently working to add support for d’acouqistique, 59:279–292. digital-waveguide objects to the Synth-A-Modeler compiler. We look forward to an open-source Claude Cadoz, Annie Luciani, and Jean-Loup platform for prototyping, developing, and releas- Florens. 1993. CORDIS-ANIMA: A modeling ing physical modeling binaries that incorporate and simulation system for sound and image the popular modeling techniques of mass and link synthesis—The general formalism. Computer (i.e. mass-interaction) modeling, modal synthesis, Music Journal, 17(1):19–29. and digital waveguide synthesis. The implemen- Nicolas Castagne and Claude Cadoz. 2002. Cre- tation is simple enough that we hope other devel- ating music by means of ‘physical thinking’: opers can easily add support for further modeling The musician oriented Genesis environment. In formalisms. Proc. 5th Internat’l Conference on Digital Au- Due to the open-format licensing of Synth-A- dio Effects, pages 169–174, Hamburg, Germany. Modeler via the GPL version 2, we believe that Synth-A-Modeler could be especially attractive Perry Cook and Gary Scavone. 1999. The syn- for commercial applications for compiling opti- thesis toolkit (STK), version 2.1. In Proc. In- mized and fine-tuned models into portable binary ternat’l Computer Music Conf., Beijing, China. modules. We hope that this way we can create a Nicholas Ellis, Jo¨elBensoam, and Ren´eCauss´e. large user base including industrial, artistic, and 2005. Modalys demonstration. In Proc. Int. scientific users. Comp. Music Conf. (ICMC’05), pages 101–102, Barcelona, Spain. 6 Acknowledgments We graciously thank Alexandros Kontogeor- Albert Graef. 2007. Interfacing Pure Data with gakopoulos, Claude Cadoz, Stefan Weinzierl, An- Faust. In Proc. of the Linux Audio Confer- nie Luciani, Andre Bartetzki, Chris Chafe, and ence, Technical University of Berlin, Berlin, the Alexander von Humboldt foundation. Germany. Matti Karjalainen. 2003. BlockCompiler: Effi- References cient simulation of acoustic and audio systems. K. Barkati, D. Fober, S. Letz, and Y. Orlarey. In Proc. 114th Convention of the Audio Engi- 2011. Two recent extensions to the FAUST neering Society, Preprint #5756, Amsterdam, compiler. In Proc. of the Linux Audio Confer- The Netherlands. ence, Maynooth, Ireland. Alexandros Kontogeorgakopoulos and Claude Edgar Berdahl, Alexandros Kontogeorgakopou- Cadoz. 2007. Cordis Anima physical model- los, and Dan Overholt. 2010. HSP v2: Hap- ing and simulation system analysis. In Proc. tic signal processing with extensions for physi- 4th Sound and Music Computing Conference, cal modeling. In Proceedings of the Haptic Au- pages 275–282, Lefkada, Greece. dio Interaction Design Conference, pages 61– 62, Copenhagen, Denmark. Max Mathews and Julius O. Smith III. 2003. Methods for synthesizing very high Q para- Edgar Berdahl. 2009. Applications of Feed- metrically well behaved two pole filters. In back Control to Musical Instrument Design. Proc. Stockholm Musical Acoustic Conference Ph.D. thesis, Stanford University, Stanford, (SMAC), Stockholm, Sweden. CA, USA, December. Romain Michon and Julius O. Smith III. 2011. Stefan Bilbao. 2009. A modular percussion syn- Faust-STK: A set of linear and nonlinear phys- thesis environment. In Proc. 12th Int. Con- ical models for the Faust programming lan- ference on Digital Audio Effects (DAFx-09), guage. In Proc. 14th Int. Conference on Digital Como, Italy. Audio Effects (DAFx-11), Paris, France.

98 Yann Orlarey, Dominique Fober, and St´ephane Letz. 2002. An algebra for block diagram lan- guages. In Proceedings of International Com- puter Music Conference, G¨oteborg, Sweden. Yann Orlarey, Dominique Fober, and St´ephane Letz. 2009. Faust: an Efficient Functional Ap- proach to DSP Programming. Edition Delatour, France. J.O. Smith III. 1982. Synthesis of bowed strings. In Proceedings of the International Computer Music Conference, Venice, Italy. Julius O. Smith. 2010. Physical Audio Signal Processing: For Virtual Musical Instruments and Audio Effects. W3K Publishing, http: //ccrma.stanford.edu/~jos/pasp/, Decem- ber, ISBN 978-0-9745607-2-4. Andrei Vladimirescu. 1994. The SPICE Book. Wiley, Hoboken, NJ.

99 100 pd-faust: An integrated environment for running Faust objects in Pd

Albert Gräf Dept. of Computer Music, Institute of Musicology Johannes Gutenberg University 55099 Mainz, Germany [email protected]

Abstract make the changes take effect. Another obstacle is This paper introduces pd-faust, a library for running that the necessary infrastructure for processing signal processing modules written in Grame’s func- MIDI note input and control changes is imple- tional dsp programming language Faust in Miller mented entirely in Pd, which may involve a lot Puckette’s graphical computer music environment of Pd objects and become a cpu hog for complex Pure Data a.k.a. Pd. pd-faust is based on the author’s Faust programs. faust2pd script which generates Pd GUIs from Faust pd-faust was designed to overcome these lim- programs and also provides the necessary infrastruc- ture for running Faust dsps in Pd. pd-faust combines itations. Most notably, it loads Faust dsps dy- this functionality with its own Faust plugin loader namically and allows them to be reloaded at any which makes it possible to reload Faust dsps while time, in which case the Pd GUI is regenerated au- a patch is running. It also adds automatic configura- tomatically and instantly. Thus you can now just tion of MIDI and OSC controller assignments, as well edit and recompile the Faust source and have pd- as OSC-based automation features. faust pick up the changes on the fly, while the Pd patch keeps running. Keywords pd-faust is implemented as a library of Pd ob- Faust, Pd, Pure, functional signal processing. jects written in the author’s Pure programming language [2], which is compiled to native code, 1 Introduction so that Faust dsps involving a lot of different con- We assume that the reader is familiar (or has at trols are handled in an efficient manner. Another least heard of) Grame’s popular Faust program- advantage of using Pure is that at present it is the ming language [5], which greatly facilitates the only programming language with a built-in Faust programming of custom audio processing plu- interface based on the LLVM toolkit (which Pure gins. Faust programs can be compiled to na- itself uses as its code generation backend). This tive code for an abundance of different signal enables pd-faust to directly load compiled Faust processing environments and plugin standards. programs in LLVM bitcode format [3] if you’re In particular, Faust has had support for Miller running a suitable version of the Faust compiler Puckette’s Pd [6] for some time now through [4]. its puredata.cpp architecture. Pd users in the pd-faust offers a number of other enhance- Faust community have also been using the au- ments facilitating the use of Faust dsps in Pd. thor’s faust2pd script to create Pd GUIs (graph- While it is based on the GUI generation code ical user interfaces) for Faust externals which of the faust2pd script and thus supports all of seem to work quite well for operating Faust dsps faust2pd’s global GUI layout options, it also pro- inside Pd [1]. vides various options to adjust the layout of in- One of faust2pd’s shortcomings is that it gen- dividual control items. In addition, pd-faust rec- erates the Pd GUIs outside of Pd. Thus the gener- ognizes the midi and osc controller attributes in ated GUI is static, and changing the Faust source the Faust source and automatically provides cor- of a dsp generally requires regeneration of the responding MIDI and OSC controller mappings. GUI and reloading of the hosting Pd patch to OSC-based controller automation is also avail-

101 the following section). The basic functionality to do this is actually provided by another Pure module, pure-faust, which can load Faust mod- ules in one of two formats:

Native modules are shared modules in the • format supported by the host operating sys- tem (.so on ELF systems, .dll on Win- dows, etc.). These are created from Faust source programs by invoking the Faust com- piler with the pure.cpp architecture file Figure 1: An overview of the pd-faust system. (faust -a pure.cpp) and compiling the re- sulting C++ source to a shared module. Bitcode modules are modules in LLVM bit- able. These additional features are all described • in some detail in this paper. code format which can be created directly Section 2 starts out with a brief overview of pd- by Faust (faust -lang llvm). You’ll need faust. In Sections 3 and 4 we then take a more in- Faust2, the development version of Faust, depth look at the pd-faust objects and describe to make this work [4]. The Pure interpreter how pd-faust constructs the Pd GUI at runtime. has a built-in LLVM bitcode linker which en- Section 5 briefly discusses how to operate the re- ables it to load these modules [3]; the exe- sulting Pd patches. Sections 6 and 7 show how cutable code of the module is then generated pd-faust uses metadata in Faust programs to de- on the fly when the module is loaded. fine controller mappings and adjust the GUI lay- out. Section 8 briefly touches on the auxiliary The internal workings of pd-faust are illus- facilities provided to ease livecoding with Faust trated in Fig. 2. The fdsp~ and fsynth~ objects dsps. Section 9 concludes with pointers to addi- use the operations provided by the Faust module tional information and examples, and discusses to instantiate a Faust dsp and extract the needed possible directions for further work. information from the dsp in order to construct its Pd GUI. The GUI is then inserted into the host- The paper assumes some working knowledge 1 of Faust and Pd; please consult [5] and [6] if nec- ing Pd patch by means of Pd’s “FUDI” protocol. essary. All this happens automatically whenever the Pd patch is loaded or a Faust module gets reloaded 2 Overview by sending it a corresponding control message. While the patch is running, an fdsp~ or Before we go into the technical details, let us first fsynth~ object receives incoming control mes- give a brief overview of pd-faust and the various sages and audio data through its inlets and in- software components which are involved in the vokes the operations of the dsp to change the system (cf. Fig. 1). control values and compute blocks of audio data pd-faust is implemented as an object library as they are requested by Pd’s audio loop. The for Pd. Since the pd-faust objects are written in generated data is then output through the ob- Pure, you’ll also need the pd-pure plugin loader ject’s audio and control outlets. [2] which provides the necessary infrastructure to run Pure objects in Pd. Both libraries are avail- 3 The fdsp and fsynth objects able in the form of shared modules which are loaded by Pd on startup, using either Pd’s -lib Working with pd-faust basically involves adding option or a corresponding entry in Pd’s startup a bunch of fsynth~ and fdsp~ objects to a Pd preferences. patch along with the corresponding GUI sub- The main ingredients of pd-faust are the fdsp~ patches, and wiring up the Faust units in some fsynth~ and objects which are used to load and 1See http://wiki.puredata.info/en/FUDI. This proto- run different kinds of Faust dsps inside Pd (the col is also used internally by Pd to represent the contents of differences between these will be explained in patches and communicate with its GUI process.

102 The fdsp~ object requires a Faust dsp which can work as an effect unit, processing audio in- put and producing audio output. The fsynth~ object works in a similar fashion, but has an additional creation argument specify- ing the desired number of voices:

Figure 2: Internals of the pd-faust system. fsynth~ dspname instname channel nvoices

The fsynth~ object requires a Faust dsp which variation of a synth-effects chain which typically can work as a monophonic synthesizer (hav- takes input from Pd’s MIDI interface (notein, ing zero audio inputs and a nonzero number ctlin, etc.) and outputs the signals produced by of audio outputs). To these ends, pd-faust as- the Faust units to Pd’s audio interface (dac~). sumes that the Faust unit provides three so- For convenience, pd-faust also includes the called “voice controls” which indicate which midiseq and oscseq objects as well as a corre- note to play: sponding midiosc abstraction which can be used to handle MIDI input and playback as well as freq is the fundamental frequency of the OSC controller automation. These objects are de- • note in Hz. scribed in more detail in Section 5. gain The fdsp~ object is invoked as follows: is the velocity of the note, as a normal- • ized value between 0 and 1. This usually fdsp~ dspname instname channel controls the volume of the output signal. gate indicates whether a note is currently dspname denotes the name of the Faust dsp • playing. This value is either 0 (no note to • (usually this is just the name of the .dsp file play) or 1 (play a note), and usually triggers with the extension stripped off). Please note the envelop function (ADSR or similar). that, as already mentioned, the Faust dsp must be provided in a form which can be pd-faust doesn’t care at which path inside the loaded in Pure (not Pd!), so the pure.cpp ar- Faust dsp these controls are located, but for the chitecture included in recent Faust versions synthesizer to function properly they must all be must be used to compile the dsp to a shared there, and the basenames of the controls must be library. (If you’re already running Faust2, unique throughout the entire dsp. you can also compile to an LLVM bitcode file Like faust2pd, pd-faust implements the neces- instead; Pure has built-in support for load- sary logic to drive the given number of voices of ing these.) The Makefiles included in the pd- an fsynth~ object. That is, it will actually cre- faust distribution show how to do this. ate a separate instance of the Faust dsp for each voice and handle polyphony by allocating voices instname denotes the name of the instance • from this pool in a round-robin fashion, perform- of the Faust unit. Multiple instances of ing the usual voice stealing if the number of si- the same Faust dsp can be used in a Pd multaneous notes to play exceeds the number of patch, which must all have different in- voices. stance names. In addition, the instance The fdsp~ and fsynth~ objects respond to the name is also used to identify the GUI sub- following messages: patch of the unit (see below) and to generate unique OSC addresses for the unit’s control bang outputs the current control settings on elements. • the control outlet in (symbolic) OSC format.2 channel is the number of the MIDI channel • 2pd-faust represents OSC messages as ordinary Pd mes- the unit responds to. This can be 1..16, or 0 sages with the OSC address in the selector symbol of the to specify “omni” operation (listen to MIDI message. Input and output of binary OSC messages is as- messages on all channels). sumed to be handled by a separate OSC library which is not

103 write outputs the current control settings to • external MIDI and/or OSC devices. This message can also be invoked with a numeric argument to toggle the “write mode” of the unit; please see Section 6 for details. reload reloads the Faust unit. This also • reloads the shared library or bitcode file if the unit was recompiled since the object was last loaded. addr value changes the control indicated by • the OSC address addr. This is also used in- ternally for communication with the Pd GUI and for controller automation.

In addition, the fdsp~ and fsynth~ objects respond to MIDI controller messages of the form ctl val num chan, and the fsynth~ ob- ject also understands note-related messages of the form note num vel chan (note on/off) and bend val chan (pitch bend). In either case, pd- faust provides the necessary logic to map con- troller and note-related messages to the corre- Figure 3: Sample pd-faust patch. sponding control changes in the Faust unit.

4 GUI subpatches The relative order in which you insert an fdsp~ For each fdsp~ and fsynth~ object, the Pd patch or fsynth~ object and its GUI subpatch into the should also contain an (initially empty) “one-off” main patch matters. Normally, the GUI subpatch graph-on-parent subpatch with the same name should be inserted first, so that it will be updated as the instance name of the Faust unit: automatically when its associated Faust unit is first created, and also when the main patch is pd instname saved and then reloaded later. You shouldn’t insert anything into this sub- However, in some situations it may be prefer- patch, its contents (a bunch of Pd GUI ele- able to insert the GUI subpatch after its associ- ments corresponding to the control elements of ated Faust unit. If you do this, the GUI will not the Faust unit) will be generated automatically be updated automatically when the main patch is by pd-faust when the corresponding fdsp~ or loaded, so you’ll have to reload the dsp manually fsynth~ object is created, and whenever the unit (sending it a reload message) to force an update gets reloaded at runtime. See Fig. 3 for an exam- of the GUI subpatch. This is useful, in particular, ple. if you’d like to edit the GUI patch manually after As with faust2pd, the GUI layout follows the it has been generated. hierarchical structure of the controls in the Faust In some cases it may even be desirable to com- program which places controls in different con- pletely “lock down” the GUI subpatch. This can trol groups, please check the Faust documentation be done by simply renaming the GUI subpatch for details. The default appearance of the GUI after it has been generated. When Pd saves the can also be adjusted in various ways; see Section main patch, it saves the current status of the 7 for details. GUI subpatches along with it, so that the re- part of pd-faust. E.g., one might use Martin Peach’s collec- named subpatch will remain static and will never tion of OSC objects for that purpose, see http://puredata. be updated, even if its associated Faust unit gets info/Members/martinrp/OSCobjects. reloaded. This generally makes sense only if the

104 is stopped, you can also use these to change the start position for playback and recording. Here is a brief rundown of the available con- trols:

The gray “record” toggle in the upper right • corner of the abstraction enables record- ing of OSC controller automation data. Figure 4: Close-up view of the midiosc abstrac- Note that this toggle merely arms the OSC tion. recorder; you still have to actually start the recording with the start button. However, control interface of the Faust unit isn’t changed you can also first start playback with start after locking down its GUI patch. To “unlock” a and then switch recording on and off as GUI subpatch, you just rename it back to its orig- needed at any point in the sequence. Push- inal name. ing the stop button then stores the recorded sequence for later playback. Also note that 5 Operating the patches before you can start recording any OSC data, you first have to arm the Faust units that The generated Pd GUI elements for the Faust you want to record. This is done with the dsps are pretty much the same as with faust2pd. “record” toggle in the Pd GUI of each unit. See Fig. 3, which shows the midiosc abstraction and the actual Faust objects on the left, and the The “bang” control next to the “record” tog- • corresponding GUI subpatches on the right. The gle lets you record a snapshot of the current only obvious change is the addition of a “record” controller settings. This is also done auto- button (the gray toggle on the left of the but- matically when you start recording a new ton row located in the upper right corner of each sequence. GUI) which enables recording of OSC automa- The start, stop and cont controls in the first tion data (described below). • row of control elements start, stop and con- The midiosc abstraction (cf. Fig. 4) shown in tinue MIDI and OSC playback, respectively. the examples serves as a little sequencer ap- The echo toggle in this row causes played plet that enables you to control MIDI playback MIDI events to be printed in the Pd main and OSC recording. The creation arguments of window. midiosc are the names of the MIDI and OSC files. There are some additional controls related If the second argument is omitted, it defaults to • the name of the MIDI file with new extension to OSC recording in the second row: save .osc. You can also omit both arguments if neither saves the currently recorded data in an OSC MIDI file playback nor saving recorded OSC data file for later use; abort is like stop in that is required. it stops recording and playback, but also The abstraction has a single control outlet throws away the data recorded in this take through which it feeds the generated MIDI data (rather than keeping it for later playback); clear and other messages to the connected fsynth~ purges the entire recorded OSC se- and fdsp~ objects. Live MIDI input is also ac- quence so that you can start a new one; and cepted and forwarded to the control outlet, af- the echo toggle, if enabled, prints the OSC ter being translated to the format understood messages as they are played back. by fsynth~ and fdsp~ objects. Moreover, you The controls in the third row provide some can also control midiosc with an external se- • additional ways to configure the playback quencer program, employing MIDI Machine process. The loop button can be used to en- Control (MMC) for synchronization. able looping, which repeats the playback of At the bottom of the abstraction there is a lit- the MIDI (and OSC) sequence ad infinitum. tle progress bar along with a time display which The thru button, when switched on, routes indicates the current song position. If playback the MIDI data during playback through Pd’s

105 MIDI output so that it can be used to drive forego midiosc, or you may replace them alto- an external MIDI device in addition to the gether with your own sequencer objects. Faust instruments. The write button does the same with MIDI and OSC controller data 6 External MIDI and OSC controllers generated either through automation data The fsynth~ object has built-in (and hard-wired) or by manually operating the control ele- support for MIDI notes, pitch bend and MIDI ments in the Pd GUI, see Section 6 for de- controller 123 (all notes off). tails. The send button recalls the recorded Other controller data received from external OSC parameter settings at a given point in MIDI and OSC devices is interpreted according the sequence, and updates the GUI controls to the controller mappings defined in the Faust accordingly. source (this is explained below), by updating the corresponding GUI elements and the control Once some automation data has been variables of the Faust dsp. For obvious reasons, recorded, it will be played back along with this only works with active Faust controls.3 the MIDI file. You can then just listen to it, or go An fdsp~ or fsynth~ object can also be put on to record more automation data as needed. in write mode by feeding a message of the form If you save the automation data with the save write 1 into its control inlet (the write 0 mes- button, it will be reloaded from its OSC file next sage disables write mode again). For conve- time the patch is opened. nience, the write toggle in the midiosc abstrac- OSC sequences are saved in a simple ASCII tion allows you to do this simultaneously for all format which should be self-explanatory and can Faust units connected to midiosc’s control outlet. be edited with any text editor. For instance: When an object is in write mode, it also out- puts MIDI and OSC controller data in response # written by oscseq Sun Dec 11 19:39:45 2011 to both automation data and the manual oper- # delta /oscaddr value ation of the Pd GUI elements according to the 0 /synth:Nonlinear-Filter/typeMod 0 controller mappings defined in the Faust source, 0 /synth:Nonlinearity 0 so that it can drive an external device such as a 0 /synth:Reverb/reverbGain 0.137 MIDI fader box or a multitouch OSC controller. 0 /synth:Reverb/roomSize 0.72 Note that this works with both active and passive Faust controls. The OSC addresses are generated automati- To configure MIDI controller assignments, the cally from the hierarchical structure of the con- labels of the Faust control elements have to be trols in the Faust program, using the instance marked up with the special midi attribute in the name as well as the labels of control groups and Faust source. For instance, a pan control (MIDI elements to assign a unique pathname to each controller 10) may be implemented in the Faust control of each Faust unit in the patch. source as follows: Note that midiosc is merely an example which pan = hslider( should cover most common uses and can be cus- "pan[midi:ctrl 10]",0,-1,1,0.01); tomized for the target application as needed. You may even do without it and have your patches pd-faust will then provide the necessary logic feed control messages directly into fdsp~ and to handle MIDI input from controller 10 by fsynth~ objects instead. Internally, midiosc uses changing the pan control in the Faust unit ac- two utility objects midiseq and oscseq written cordingly, mapping the controller values 0..127 in Pure and contained in the pd-faust object li- to the range and step size given in the Faust brary. Together these implement most of the source. Moreover, in write mode corresponding MIDI playback and OSC recording functional- MIDI controller data will be generated and sent ity; midiosc basically just adds the necessary 3Faust distinguishes between active controls which take wiring with Pd’s MIDI I/O and a few control el- input from the user and passive controls displaying con- ements. The midiseq and oscseq objects can also trol data computed in the Faust program, such as RMS en- be used directly in your patch if you prefer to velopes.

106 to Pd’s MIDI output, on the MIDI channel spec- radio-sliders option, the option is only ap- ified in the creation arguments of the Faust unit plicable if the control is zero-based and has (0 meaning “omni”, i.e., output on all MIDI chan- a stepsize of 1. nels). slider-nums: Add a number box to each The same functionality is also available for ex- • slider control. Note that in pd-faust this is ternal OSC devices, employing explicit OSC con- actually the default, which can be disabled troller assignments in the Faust source by means with the no-slider-nums option. of the osc attribute. E.g., the following enables exclude=pat,...: Exclude the controls input and output of OSC messages for the OSC • /pan address: whose labels match the given glob patterns from the Pd GUI. pan = hslider("pan[osc:/pan]",0,-1,1,0.01); In pd-faust there is no way to specify the above Note that in contrast to some architectures in- options on the command line, so you’ll have to cluded in the Faust distribution, pd-faust only al- put them as pd attributes on the main group of lows literal OSC addresses here. That is, glob- your Faust program, as described in the faust2pd style OSC patterns are not supported as values documentation. For instance: for the osc attribute. Also note that pd-faust cur- rently does not include any facilities for actual process = vgroup( "[pd:no-slider-nums][pd:font-size=12]", OSC input/output, but it’s easy to add this to ...); midiosc if needed.4 In addition, the following options can be used 7 Tweaking the GUI layout to change the appearance of individual control items. If present, these options override the cor- As already mentioned, pd-faust provides the responding defaults. (Each option can also be same global GUI layout options as faust2pd. prefixed with “no-” to negate the option value. There are a few minor changes in the meaning Thus, e.g., no-hidden makes items visible which of some of the options, though. Here is a brief would otherwise, by means of the global exclude rundown of the available options, as they are im- option, be removed from the GUI.) plemented in pd-faust: hidden: Hides the corresponding control in width=wd, height=ht: Specify the maximum • the Pd GUI. This is the only option which • horizontal and/or vertical dimensions of the can also be used for control groups, in which layout area. If one or both of these values are case all controls in the group will become in- nonzero, pd-faust will try to make the GUI visible in the Pd GUI. fit within this area. fake-button, radio-slider, slider-num: • font-size=sz: Specify the font size (default These have the same meaning as the corre- • is 10). sponding global options, but apply to indi- vidual control items. fake-buttons: Render button controls as • Pd toggles rather than bangs. These options are specified with the pd at- radio-sliders=max: Render sliders with up tribute in the label of the corresponding Faust • to max different values as Pd radio controls control or control group. For instance, the fol- aux rather than Pd sliders. Note that in pd-faust lowing Faust code hides the controls in the pan this option not only applies to sliders, but group, removes the number entry from the preset also to numeric entries, i.e., nentry in the control, and renders the item as a Pd ra- Faust source. However, as with faust2pd’s dio control: aux = vgroup("aux[pd:hidden]", aux_part); 4“Vanilla” Pd doesn’t provide any objects for OSC I/O, pan = hslider( but various suitable externals are readily available, such as "pan[pd:no-slider-num]",0,-1,1,0.01); Martin Peach’s OSC objects already mentioned above. The preset = nentry( pd-faust distribution includes a variation of the midiosc ab- "preset[pd:radio-slider]",0,0,7,1); straction which implements this.

107 8 Remote control in the Pure source code repository. There are Also included in the sources is a helper ab- some technical issues and limitations in the cur- straction faust-remote.pd and an accompanying rent pd-faust version which will hopefully be elisp program faust-remote.el. These work pretty ironed out in the future. Most notably: much like pure-remote.pd and pure-remote.el in In the present implementation, pure-faust the pd-pure distribution, but are tailored for the • remote control of Faust dsps in a Pd patch. In supports Faust modules either in native or particular, they enable you to quickly reload the in LLVM bitcode form, but not both at the Faust dsps in Pd using a simple keyboard com- same time. Therefore there are two corre- mand (C-C C-X by default) from Emacs. The sponding versions of the pd-faust object li- faust-remote.el program was designed to be used brary, and you’ll have to decide in advance whether you’d like to work with native or with Juan Romero’s Emacs Faust mode.5 bitcode modules. 9 Conclusion The names of the fsynth~ voice controls • We hopefully convinced the reader that pd-faust (freq, gain, gate) are currently hardcoded, is an interesting solution for developing, testing, and there must be exactly one instance of deploying and running Faust dsps in the graph- each of these controls in the dsp, otherwise ical Pd environment. It provides Pd and Faust fsynth~ may not function as advertized. users with a fairly complete integrated develop- Passive Faust controls are only supported in ment environment for Faust, and supports a free- • fdsp~ objects right now. wheeling interactive and experimental develop- ment style which should appeal to both develop- The OSC recording capabilities are some- • ers and artists. what limited in the current version. In par- The software described in this paper is avail- ticular, it should be possible to erase and able freely under the LGPL6 from the Pure web- edit already recorded controller data while site.7 As already mentioned, pd-faust is written recording. At present you can only change in Pure; thus, to build and run the software you’ll existing automation data by purging the en- also need an installation of the Pure interpreter tire sequence and starting over, or by edit- and a couple of Pure addon packages including ing the OSC file. This is sufficient for testing the latest release of pd-pure [2] which is needed purposes but not adequate for real musical to run Pure externals in Pd; please check the pd- work. faust documentation for details. The package also contains a few example Further research patches and accompanying files which illustrate Faust’s abstract GUI model seems perfectly ap- how this all works. Here are some of the exam- propriate for the type of control processing appli- ples that you might want to take a look at: cations that are typically written using pd-pure. pd-faust is just one of the many possible pd-pure test.pd: very basic example running a single applications which might benefit from this ap- • Faust instrument proach. Therefore an interesting question for fur- synth.pd: slightly more elaborate patch fea- ther research is how to integrate the correspond- • turing a synth-effects chain ing pd-faust functionality into pd-pure and make it available to all pd-pure applications. bouree.pd: full-featured example running • Also, it might be useful to port pd-faust to various instruments other realtime environments, such as SuperCol- lider, and suitable plugin environments, such as pd-faust development continues, so you might LV2 and VST. In principle it’s also conceivable to want to check out the latest development sources run a suitably modified version of pd-faust di- 5https://github.com/rukano/emacs-faust-mode rectly as a Jack client, without the extra Pd layer. 6http://www.gnu.org/copyleft/lgpl.html Of course, the details of integrating Faust dsps 7http://pure-lang.googlecode.com with these host environments differ considerably

108 from the current implementation running inside Pd, so porting the interface will be a substantial amount of work. Acknowledgements Special thanks are due to Julius O. Smith from CCRMA for following this project from its in- ception, taking the time to test early alpha ver- sions and suggesting improvements. I’d also like to thank Johannes Zmölnig from the IEM Graz for pointing me to Pd’s internal FUDI protocol which made this project feasible in the first place. References [1] A. Gräf. Interfacing Pure Data with Faust. In Proceedings of the 5th International Linux Au- dio Conference, pages 24–32, Berlin, 2007. TU Berlin. [2] A. Gräf. Signal processing in the Pure pro- gramming language. In Proceedings of the 7th International Linux Audio Conference, Parma, 2009. Casa della Musica. [3] A. Gräf. An LLVM bitcode interface between Pure and Faust. In Proceedings of the 9th In- ternational Linux Audio Conference, Maynooth, 2011. NUI Maynooth, Dept. of Music. [4] S. Letz. LLVM backend for Faust. http: //www.grame.fr/~letz/faust_llvm.html, 2012. [5] Y. Orlarey, D. Fober, and S. Letz. FAUST : an efficient functional approach to DSP pro- gramming. In G. Assayag and A. Gerzso, ed- itors, New Computational Paradigms for Com- puter Music. Editions Delatour France, 2009. [6] M. Puckette. The Theory and Techniques of Elec- tronic Music. World Scientific Publishing Co., 2007.

109 110 The Faust Online Compiler: a Web-Based IDE for the Faust Programming Language

Romain MICHON and Yann ORLAREY GRAME, Centre national de cr´eationmusicale 9 rue du Garet 69202 Lyon, France, [email protected] and [email protected]

Abstract think it is a pedagogical tool where it is possible The Faust Online Compiler is a PHP/JavaScript to see in the most interactive way a digital sig- based web application that provides a cross-platform nal processing statement, its C++ equivalent, its and cross-processor programming environment for the diagram representation, its documentation and fi- Faust language. It allows to use most of Faust fea- nally its results. tures directly in a web browser and it integrates an It has been implemented in the frame of the editable catalog of examples making it a platform to development of the new Faust website1 and it easily share and use Faust objects. can be easily embedded in any HTML web page Keywords or installed on an Apache server. Faust, Web Application, Digital Signal Processing. 2 Interface 1 Introduction The Faust Online Compiler is a web applica- Faust [Orlarey et al., 2002] is a functional pro- tion made of one fixed size block. Its size has been gramming language specifically designed for real- defined to fit in the new Faust website but it can time signal processing and synthesis. Thanks to be easily embedded and scaled to be adapted to specific architecture files, a single FAUST pro- any div of an HTML page using the iframe mech- gram can be used to produce code for a variety anism (see 4). A full page mode is also available of platforms and plug-in formats. These architec- to provide a more comfortable workspace. ture files act as wrappers and describe the inter- The navigation between the states of the Faust actions with the host audio and GUI system. object is made through different tabs that can be The Faust Online Compiler is a PHP/- selected in a navigation bar (see frame 1 in figure JavaScript based web application that provides a 1). cross-platform and cross-processor programming 2.1 Editing the code environment for the Faust language. It makes it The Faust code in the Faust Online Compiler possible to use most of Faust features directly in can be edited using two different ways. The most a web browser. In many ways, we think it is solv- obvious one is by using the code editor that can ing one of Faust’s weakness: the great number be found in the Faust Code tab. The code editor of dependencies that using its architecture files in- CodeMirror2 is used to carry out this task. It volved. Indeed, even though Faust doesn’t need has been chosen because its compatibility with any specific library to be installed on a system to a great number of web browsers (Firefox, Safari, use it, some architecture files prove to be quite Opera, IE, Google Chrome, etc.). It uses a syntax complicated to use because of their dependencies. highlighting specific to the Faust programming It is also solving the system architecture issues language. that some Windows users, as an example, might Thanks to an AJAX script the Faust Online have encountered in the past. Compiler also allows to drop any .dsp file con- The Faust Online Compiler integrates an editable catalog of examples making it a platform 1http://faust.grame.fr/ to easily share and use Faust objects. We also 2http://codemirror.net/

111 1 2

3

Figure 1: Overview of the compiler zone of the Faust Online Compiler. Frame 1 contains the navigation bar, frame 2 the compilation options and frame 3 the Faust code editor or the different kind of results of the compilation. taining a Faust program in the file drop area that chitecture): is located just above the code editor (see figure Linux 1). This drop area is present in every tabs. This • allows to compile a Faust code into many differ- Windows using the MinGW3 cross compiler ent elements in a very interactive way and very • quickly. OSX 10.6 using the distant command execu- • tion system of SSH4 2.2 Compilation 2.2.1 Compilation Options If any of these options is changed in the C++ The compilation options menu can be found just code or the Exec File tabs, the result will be under the navigation bar (see frame 2 in figure 1). automatically updated. It allows to set the kind of executable file that will 2.2.2 Compilation be generated during the C++ compilation. Most of In order to carry out the compilation, a tempo- the Faust architectures can be used and the OSC rary folder containing the Faust file and a Make- support option [Fober et al., 2011] can be selected file generated in function of the options given by if it is allowed by the architecture. Further op- the user and the architecture he selected is cre- tions can be given to the Faust compiler by filling ated on the server. It will be used by the Faust the Opts text area. Online Compiler to generate the C++ code, the The 64 bits Linux server (Ubuntu) that hosts SVG diagram, the documentation, the executable the Faust Online Compiler at GRAME makes file and the different downloadable packages. it possible to compile the C++ code generated by the Faust compiler in 32 or 64 bits for the fol- 3http://www.mingw.org/ lowing platforms (in function of the selected ar- 4http://www.openssh.com/

112 Ubuntu 11.04 Server MacPro Server

Web Interface

Joomla!

CodeMirror

Javascript

LAMP

Linux & Windows Targets

Make OSX Targets

SSH

Faust2PD Faust Windows Pure QT cross-compiler QT

LLVM GCC MinGW XCode

Figure 2: Diagram of the server components.

2.3 Getting the C++ Code If the C++ tab is selected, the Faust compiler is Figure 3: Screenshot of the SVG Diagram tab. called on the server by the Makefile that will also use Highlight5 to color the generated C++ code. If the Faust code given by the user contains las and diagrams [Barkati et al., 2011]. The goals some errors, the message returned by the Faust of such a self mathematical documentation are: compiler will be displayed instead of the C++ code. The Faust Online Compiler will react the 1. Preservation, i.e. to preserve signal proces- same way in the case of the tabs SVG diagram sors, independently from any computer lan- and Automatic Doc. guage but only under a mathematical form; 2.4 Displaying the SVG diagram of the 2. Validation, i.e. to bring some help for de- Faust Code bugging tasks, by showing the formulas as When the SVG Diagram button is clicked, in the they are really computed after the compila- same way than for the C++ code, the Makefile cre- tion stage; ated on the server call the Faust compiler to gen- 3. Teaching, i.e. to give a new teaching support, erate a SVG block diagram of the Faust code. The as a bridge between code and formulas for result is then displayed just under the file drop- signal processing; ping area (see figure 3) and can be browsed by clicking on the different boxes. 4. Publishing, i.e. to output publishing mate- rial, by preparing LATEX formulas and SVG 2.5 Displaying the Mathematical block diagrams easy to include in a paper. Documentation of the Faust Object Some recent developments made at GRAME in The Faust Online Compiler is compatible the frame of the ASTREE project (see Acknowl- with this function and can display the resulting 6 edgments) on the Faust compiler make it possible pdf file using Google Document Viewer . The to generate automatically an all-comprehensive Makefile described in 2.3 is also used here to carry mathematical documentation of a Faust program out the compilation process. under the form of a complete set of LATEX formu- A screen shot of the Automatic Doc tab can be seen in figure 4. 5http://www.andre-simon.de/doku/highlight/en/ highlight.html 6https://docs.google.com/viewer/

113 Figure 6: The catalog of examples of the Faust Online Compiler.

When a Faust object is selected, a short de- scription and a screenshot of it are displayed in the Description column. The object can then be used in the online compiler either by double- clicking on it either by pushing the Choose it! Figure 4: Screenshot of the Automatic Doc tab. button. The catalog of examples can be displayed in every tabs of the Faust Online Compiler. So, 2.6 Executable File as an example, someone can choose just to see the When the Exec File button is clicked in the nav- block diagram of an object or its automatically igation bar, the C++ compilation is carried out in generated documentation. function of the platform and the processor archi- tecture selected. If this task is successful, a down- load button appear. Otherwise the error message 3.2 The Example Saver returned by the C++ compiler is displayed. The downloadable file will be either an executable file As said before, the catalog of examples of the either a package containing several files, in func- Faust Online Compiler is intended to be a tion of the Faust architecture. platform to share Faust objects between users. This is made possible by the example saver (see 3 Catalog of Examples figure 7) that allows someone to save the Faust The Faust Online Compiler has been up- code he edited or he dropped within the catalog. graded with an interactive platform to easily share No identification is required to carry out this task: and use pieces of Faust code: the catalog of ex- anyone can freely upload his work in the catalog. amples. It is made of two elements: the catalog When creating a new example, users can also add itself and the example saver. a quick description to their object as well as a screenshot. 3.1 The Catalog If an existing example is modified by a user dif- The catalog has the form of a file browser where ferent than the one who created it, every users each Faust code is sorted into different categories that contributed to this example will be notified (see figure 6): by e-mail of the changes made in their Faust Effects code. Finally, any user can freely delete exam- • ples from the catalog. As in the case mentioned Faust-STK [Michon and Smith, 2011] • before, editors will be notified of this modification Synthesizers with an e-mail. • Tools In order to keep a good order in the catalog and • to avoid duplicates or SPAMs, a voting system has A set of “user” categories editable by the users been implemented. This can also help to promote were also created. the best user examples from the catalog.

114 Linux Server MakeFile Faust Temporary folder

Processor architecture GCC Linux Executable Architecture Windows Executable Diagrams Documentation C++Code MinGW Temporary folder OSX Executable GoogleDoc HighLight SSH

GCC

Result is displayed OSX system

Download

Client

Figure 5: Faust Online Compiler overview.

generate JAVA applets directly usable from a web browser. Thus, the Faust Online Compiler should soon be upgraded with a new tab running a web app generated in function of a Faust code edited in the Faust Code section. Our hope is that the JAVA applet will be updated almost in Figure 7: The example saver of the Faust On- real time without having to restart it every time line Compiler. a change would be made. Therefore, this feature could offer some sort of live codding possibilities. As the Faust Online Compiler prove to be 4 Exporting and embedding the Faust a good solution to the platform and dependen- Online Compiler cies issues related to the use of architecture files in Faust, it can be interesting to establish some The Faust Online Compiler can be very easily links between it and the FaustWorks9 program. embedded in any website using an iframe: In that special case, FaustWorks would be used