Digital Music Input Rendering for Graphical Presentations in SoundStroll(MaxMSP)

Justin Kerobo School of Computer Science and Music Earlham College Richmond, Indiana 47374 Email: [email protected]

Abstract—A graphical presentation is produced at a display and manipulation of virtual worlds comprised of three dimen- of a host computer such that a scene description is rendered and sional objects in a 3D space with applications in many fields, updated by a received digital music input. The digital music input from microscopic imaging to galactic modeling and notably is matched to trigger events of the scene description and actions computer graphics for films and gaming environments. There of each matched trigger event are executed in accordance with have been a few attempts to associate the direct performance action processes of the scene description, thereby updating the of music with computer video and graphics to create new art scene description with respect to objects depicted in the scene on Which the actions are executed. The updated scene description forms. One program, Bliss Paint for the Macintosh, used MIDI is then rendered. The system provides a means for connecting a input to change colors on an evolving rendering of a fractal graphics API to a musical instrument digital interface (e.g., MIDI) image. Another program, ArKaos, uses MIDI commands to data stream, possibly from Ableton, Reason, to MaxMSP and play video clips in a DJ-like process other program, MaxMSP, producing a graphical presentation in SoundStroll in MaxMSP. uses MIDI commands in a flexible environment to drive video clips, audio clips, and drive external events. Keywords—MIDI, LATEX, OpenGL, Graphics, MaxMSP, Music Technology, Ableton, FFT, IFFT, Computer Music. There are many computer programs that control sound in various ways in response to a MIDI command stream. The “3DMIDI” program appears to be un-supported and it is not I.INTRODUCTION clear if the software works or ever worked. The available documentation describes set of separate programs that each A variety of computer software programs are available performs a prescribed set of transformations to an embedded for defining and manipulating objects in a virtual three- set of objects in response to MIDI. dimensional (3D) world. For example, “3DSMax” from Au- todesk, Inc. and SolidWorks are available. They provide an as- Each different performance is loaded and executed sep- sortment of tools in a convenient graphical user interface (GUI) arately, and has its own unique tool set to make specific for manipulation and editing of 3D virtual objects. Programs adjustments to the objects in that scene. There is an API for computer display screensavers also permit manipulation shown that invites others to develop their own performances, of moving images. Also having great popularity are computer each with their own unique sets of objects and tools which software programs for manipulation of video clips, multimedia cannot be edited at that point. Unfortunately, there is no clips, and the like. Such programs include , Aperture, convenient user interface available for interacting computer and ArKaos. Another popular medium that supports creativity graphics with musical instrument digital data. Conventional with computers are the various computer software applications methods generally require cumbersome specification of input that involve the musical instrument digital interface (MIDI) sources, scene description parameters and data objects, and standard. linking of input sources and scene description objects. As a result, a relatively high level of computer skills is necessary The MIDI standard permits connection of musical instru- for creating graphical presentations in conjunction with music ments with digital output to related digital sound processing input. It would be improving creative output if users could devices, including computers with sound cards and sound create scenes with objects and change both the objects and the editing applications, soundboards, broadcast equipment, and nature of the interaction between the video graphics and MIDI the like. Music has become commonly performed with instru- music data. ments that send digital MIDI data since the introduction of MIDI in approximately 1985. MIDI provides a flexible set of Because of these difficulties and increased complexity, instructions that are sent via a serial data link from a controller there is need for a graphical user interface that supports inte- to a receiver that processes those commands in a variety of gration with digital musical instruments, vocal recognition, and ways that pertain to the output functions of the receiving it is possible to do this through and with a Fourier Transform device. The data and instructions involve most commonly that (FT). It is possible and likely to be able to create this interface of sounds and music, but can also involve instructions for as a three-dimensional audio sequencer and spatializer in machine control and lighting control devices. MaxMSP, also using a speech processing application and using the Fast Fourier Transform (FFT), the Inverse Fast Fourier A separate branch of technology is the development of Transform (IFFT), and the Discrete Fourier Transform (DFT) computer video graphics, the digital electronic representation analyses to filter in keywords to find objects and create a scene that you can traverse. In order to control a SysEx function, a manufacturer-specific ID code is sent. Equipment which isn’t set up to recognize II.HISTORY OF MIDI that particular code will ignore the rest of the message, while devices that do recognize it will continue to listen. MIDI (Musical Instrument Digital Interface) protocol has become the dominant method of connecting pieces of elec- ”The MIDI protocol allows for control over more than just tronic musical equipment. And when you consider the previous when a note should be played.” standard you have to say that MIDI arrived at just the right SysEx messages are usually used for tasks such as loading time. custom patches and are typically recorded into a sequencer The control voltage (CV) and gate trigger system used using a ’SysEx Dump’ feature on the equipment. on early analogue synths was severely limited in its scope MIDI information was originally sent over a screened and flexibility. Analogue synths tended to have very few twisted pair cable (two signal wires plus an earthed shield features that could be controlled remotely, relying as they to protect them from interference) terminated with 5-pin DIN did on physical knobs and sliders, patch cables and manual plugs. However, this format has been superseded to some programming. extent by USB connections, as we’ll discuss later. No waves or varying voltages are transmitted since MIDI data is sent Furthermore, there was no universal standard for the way digitally, meaning that the signal pins either carry a voltage or CV control should work, complicating the process when in- none at all, corresponding to the binary logical values 1 and terfacing between products from different manufacturers. The 0. majority of vintage CV-controlled synths can now be adapted with a CV-to-MIDI converter, so you can use MIDI to control These binary digits (bits) are combined into 8-bit messages. them. The protocol supports data rates of up to 31,250 bits per second. Each MIDI connection sends information in one Dave Smith, founder of Californian synth legend Sequen- direction only, meaning two cables are needed if a device is tial Circuits and now head of Dave Smith Instruments, antici- used both to send and receive data (unless you’re working over pated the demand for a more powerful universal protocol and USB that is). developed the first version of the MIDI standard, which was released in 1983. With the increasing complexity of synths, and In addition to the expected IN and OUT connections, most as the music industry shifted towards digital technology and MIDI devices also have a THRU port. This simply repeats the computer-based studios, the MIDI setup took off and became signal received at the IN port so it can be sent on to other the standard for connecting equipment. devices further down the chain. Devices may be connected in series and, for the largest MIDI setups, an interface with A. How It Works multiple output ports may be used to control more than 16 separate chained devices. Absolutely no sound is sent via MIDI, just digital signals known as event messages, which instruct pieces of equipment. B. Becoming a Standard The most basic example of this can be illustrated by consider- The key feature of MIDI when it was launched was its ing a controller keyboard and a sound module. When you push efficiency: it allowed a relatively significant amount of infor- a key on the keyboard, the controller sends an event message mation to be transmitted using only a small amount of data. which corresponds to that pitch and tells the sound module Given the limitations of early ’80s digital data transmission to start playing the note. When you let go of the key, the methods, this was essential to ensure that the reproduction of controller sends a message to stop playing the note. musical timing was sufficiently accurate. Of course, the MIDI protocol allows for control over more Manufacturers quickly adopted MIDI and its popularity than just when a note should be played. Essentially, a message was cemented by the arrival of MIDI-compatible computer is sent each time some variable changes, whether it be note- hardware (most notably the built-in MIDI ports of the Atari ST, on/off (including, of course, exactly which note it is), velocity which was released in 1985). As weaknesses or potential extra (determined by how hard you hit the key), after-touch (how features were identified, the MIDI Manufacturers Association hard the key is held down), pitch-bend, pan, modulation, updated the standard regularly following its first publication. volume or any other MIDI-controllable function. The most notable updates - Roland MT-32 (1987), General The protocol supports a total of 128 notes (from C five MIDI (1991) and GM2 (1999), Roland GS (1991) and Yamaha octaves below middle C through to G ten octaves higher), 16 XG (1997-99) - added further features or standards, generally channels (so that 16 separate devices can be controlled per without making previous ones obsolete. It’s questionable just signal chain, or multiple devices assigned the same channel how relevant the majority of these standards are to digital so they respond to the same input) and 128 programs (corre- musicians and producers, since most of them relate in large sponding to patches or voice/ effect setting changes). MIDI part to standardizing the playback of music distributed in MIDI signals also include built-in clock pulses, which define the format. Unless you intend to distribute your music as MIDI tempo of the track and allow basic timing synchronization files, most of them probably won’t affect you. between equipment. C. Right On Time The other major piece of the jigsaw is the SysEx (System Exclusive) message, designed so that manufacturers could uti- The most common criticisms of the MIDI protocol relate lize MIDI to control features specific to their own equipment. to timing issues. Although MIDI was efficient by the standards of the early ’80s, it is still undeniably flawed to some extent. with placing greater onus on manufacturers and software de- There is some degree of jitter (variation in timing) present velopers to come up with their own powerful proprietary DAW- in MIDI, resulting in discernible sloppiness in recording and based control systems operating via existing USB, FireWire or playback. even over Ethernet connections or wirelessly. Perhaps even more obvious to most of us is latency, the III.HISTORY OF CAD delay between triggering a function (such as a sound) via MIDI and the function being carried out (in this case the sound being Modern engineering design and drafting can be traced back reproduced). The more information sent via MIDI, the more to the development of descriptive geometry in the 16th and latency is created. It may only be in the order of milliseconds, 17th centuries. Drafting methods improved with the intro- but it’s enough to become noticeable to the listener. duction of drafting machines, but the creation of engineering drawings changed very little until after World War II. Even more problematic is the fact that most of us use MIDI in a computer-based studio and each link in the MIDI and During the war, considerable work was done in the devel- audio chain could potentially add to the latency. This could opment of real-time computing, particularly at MIT, and by either be due to software (drivers, DAWs, soft synths) or the 1950s there were dozens of people working on numerical hardware (RAM, hard drives, processors) but the end result control of machine tools and automating engineering design. is sloppy timing. The blame cannot be laid entirely at the But it’s the work of two people in particular—Patrick Hanratty door of MIDI, but the weaknesses of multiple pieces of MIDI and Ivan Sutherland—who are largely credited with setting the equipment combined with all the other sources of timing error stage for what we know today as CAD or Computer Aided can have a significant detrimental effect on the end result. Design.

Most new MIDI equipment is supplied not only with A. The Fathers of CAD traditional 5-pin DIN connections but with standard Type A or B USB ports that allow direct connection to your computer. Hanratty is widely credited as “the Father of CADD/CAM.” However, USB is not the solution to all your MIDI timing In 1957, while working at GE, he developed PRONTO (Pro- problems. Despite the higher data transfer rates possible over gram for Numerical Tooling Operations), the first commer- USB, latency is actually higher than over a standard DIN- cial CNC programming system. Five years later, Sutherland based MIDI connection. Furthermore, jitter is significantly presented his Ph.D. thesis at MIT titled “Sketchpad, A Man- higher when using MIDI over USB, leading to unpredictable Machine Graphical Communication System.” Among its fea- inaccuracies in timing. tures, the first graphical user interface, using a light pen to manipulate objects displayed on a CRT. D. Beyond MIDI The 1960s brought other developments, including the first digitizer (from Auto-trol) and DAC-1, the first production It’s clear that while MIDI has been massively important interactive graphics manufacturing system. By the end of to the development of music technology over the last 25 the decade, a number of companies were founded to com- years, it does come with a few major weaknesses. One mercialize their fledgling CAD programs, including SDRC, heavily researched alternative, the Zeta Instrument Processor Evans and Sutherland, Applicon, Computervision, and M and Interface protocol proposed in the mid-’90s, failed to gain S Computing. support from manufacturers and never saw commercial release. However, the same development team helped to develop the By the 1970s, research had moved from 2D to 3D. Major OpenSound Control (OSC) protocol used by the likes of milestones included the work of Ken Versprille, whose inven- Native Instruments’ Reaktor and Traktor and the Max/MSP tion of NURBS for his Ph.D. thesis formed the basis of modern and SuperCollider development environments. 3D curve and surface modeling, and the development by Alan Grayer, Charles Lang, and Ian Braid of the PADL (Part and This is a much higher bandwidth system which overcomes Assembly Description Language) solid modeler. many of the timing issues of MIDI, most notably by trans- mitting information with built-in timing messages as quickly With the emergence of UNIX workstations in the early as possible through high-bandwidth connections rather than ’80s, commercial CAD systems like CATIA and others began relying on the real-time, event messages used by MIDI devices, showing up in aerospace, automotive, and other industries. But which just assume that timing is correct and respond to each it was the introduction of the first IBM PC in 1981 that set message as soon as it’s received. the stage for the large-scale adoption of CAD. The following year, a group of programmers formed Autodesk, and in 1983 One significant barrier to the development of a universal released AutoCAD, the first significant CAD program for the protocol for contemporary music equipment is that there is so IBM PC. much variation between equipment. With so many different synthesis methods, programming systems, levels of user con- B. The CAD Revolution trol and forms of sound manipulation available on different pieces of gear, it’s unlikely that any universal system for their AutoCAD marked a huge milestone in the evolution of control is possible. CAD. Its developers set out to deliver 80 percent of the functionality of the other CAD programs of the day, for 20 However, as computer processing and interfacing technolo- percent of their cost. From then on, increasingly advanced gies have developed so rapidly since the early ’80s, perhaps drafting and engineering functionality became more affordable. the solution lies not with updating or replacing MIDI but rather But it was still largely 2D. That changed in 1987 with the release of Pro/ENGINEER, extensions, adding the ability to do real-time synthesis using a CAD program based on solid geometry and feature-based an internal hardware digital signal processor (DSP) board. The parametric techniques for defining parts and assemblies. It same year, IRCAM licensed the software to Opcode Systems. ran on UNIX workstations—PCs of the time were simply not powerful enough—but it was a game changer. The later years 2) 1990s:: Opcode launched a commercial version named of the decade saw the release of several 3D modeling kernels, Max in 1990, developed and extended by David Zicarelli. How- most notably ACIS and Parasolids, which would form the basis ever, by 1997, Opcode was considering cancelling it. Instead, for other history-based parametric CAD programs. Zicarelli acquired the publishing rights and founded a new company, Cycling ’74, to continue commercial development. The timing was fortunate, as Opcode was acquired by Gibson C. CAD Today, CAD Tomorrow Guitar in 1998 and ended operations in 1999. The modern CAD era has been marked by improvements IRCAM’s in-house Max development was also winding in modeling, incorporation of analysis, and management of down; the last version produced there was jMax, a direct de- the products we create, from conception and engineering scendant of Max/FTS developed in 1998 for Silicon Graphics to manufacturing, sales, and maintenance (what has become (SGI) and later for Linux systems. It used Java for its graphical known as PLM, product lifecycle management). interface and C for its real-time backend and was eventually “Engineers and designers are being asked to create more, released as open-source software. Meanwhile, Puckette had faster, and with higher quality,” says Bill McClure, vice independently released a fully redesigned open-source com- president of product development at Siemens PLM. With all position tool named Pure Data (Pd) in 1996, which, despite of this pressure on engineers and designers, what do you see some underlying engineering differences from the IRCAM as the next big evolution in CAD? versions, continued in the same tradition. Cycling ’74’s first Max release, in 1997, was derived partly from Puckette’s work IV. MAXMSP AND ITS USES on Pure Data. Called Max/MSP (”Max Signal Processing”, or the initials Miller Smith Puckette), it remains the most A. Introduction notable of Max’s many extensions and incarnations: it made Max capable of manipulating real-time digital audio signals Max, also known as Max/MSP/Jitter, is a visual pro- without dedicated DSP hardware. This meant that composers gramming language for music and multimedia developed and could now create their own complex synthesizers and effects maintained by San Francisco-based software company Cycling processors using only a general-purpose computer like the ’74. Over its more than thirty-year history, it has been used by Macintosh PowerBook G3. composers, performers, software designers, researchers, and artists to create recordings, performances, and installations. In 1999, the Netochka Nezvanova collective released The Max program is modular, with most routines existing as nato.0+55, a suite of externals that added extensive real-time shared libraries. An application programming interface (API) video control to Max. allows third-party development of new routines (named exter- nal objects). Thus, Max has a large user base of programmers 3) 2000s:: Though nato became increasingly popular unaffiliated with Cycling ’74 who enhance the software with among multimedia artists, its development stopped in 2001. commercial and non-commercial extensions to the program. SoftVNS, another set of extensions for visual processing in Because of this ingenious extensible design, which simultane- Max, was released in 2002 by Canadian media artist David ously represents both the program’s structure and its graphical Rokeby. Cycling ’74 released their own set of video extensions, user interface (GUI), Max has been described as the lingua Jitter, alongside Max 4 in 2003, adding real-time video, franca for developing interactive music performance software. OpenGL graphics, and matrix processing capabilities. Max 4 was also the first version to run on Windows. Max 5, released B. History in 2008, redesigned the patching GUI for the first time in Max’s commercial history. 1) 1980s:: began work on Max in 1985, at the Institut de Recherche et Coordination Acoustique/Musique 4) 2010s:: In 2011, Max 6 added a new audio engine (IRCAM) in Paris. Originally called The Patcher, this first compatible with 64-bit operating systems, integration with version provided composers with a graphical interface for Ableton Live sequencer software, and an extension called creating interactive computer music scores on the Macintosh. Gen, which can compile optimized Max patches for higher At this point in its development Max couldn’t perform its own performance. Max 7, the most recent major version, was real-time sound synthesis in software, but instead sent control released in 2014 and focused on 3D rendering improvements. messages to external hardware synthesizers and samplers using On June 6, 2017, Ableton announced its purchase of Cycling MIDI or a similar protocol. Its earliest widely recognized use ’74, with Max continuing to be published by Cycling ’74 and in composition was for Pluton, a 1988 piano and computer David Zicarelli remaining with the company Programs sharing piece by Philippe Manoury; the software synchronized a Max’s visual programming concepts are now commonly used computer to a piano and controlled a Sogitec 4X for audio for real-time audio and video synthesis and processing. processing. C. Language In 1989, IRCAM developed Max/FTS (”Faster Than Sound”), a version of Max ported to the IRCAM Signal Max is named after composer Max Mathews, and can Processing Workstation (ISPW) for the NeXT. Also known as be considered a descendant of his MUSIC language, though ”Audio Max”, it would prove a forerunner to Max’s MSP audio its graphical nature disguises that fact. Like most MUSIC-N languages, Max distinguishes between two levels of time: that V. SOUNDSTROLL of an event scheduler, and that of the DSP (this corresponds to the distinction between k-rate and a-rate processes in Csound, SoundStroll is a kind of 3D audio sequencer. It is a and control rate vs. audio rate in SuperCollider). tool for placing sounds in a virtual landscape (a very bare landscape) constructed with jitter OpenGL, and triggering them The basic language of Max and its sibling programs is and spatializing them around you as you take a stroll in this that of a data-flow system: Max programs (named patches) are virtual sound world. SoundStroll is a tool for placing sounds made by arranging and connecting building-blocks of objects in an open 3D virtual landscape and triggering them as you within a patcher, or visual canvas. These objects act as self- take a stroll through your soundscape. SoundStroll exists as a contained programs (in reality, they are dynamically linked MaxMSP project, which is to say a set of MaxMSP patches; libraries), each of which may receive input (through one or as such it should be compatible with Windows and Mac OSX more visual inlets), generate output (through visual outlets), (made with Max 6.1.9 on OSX 10.6.8). It is intended to be or both. Objects pass messages from their outlets to the inlets free and open source (though one still needs to possess a of connected objects. MaxMSP 6 licence to modify the sources) under the terms of CreativeCommons Attribution-Non Commercial licence (cc Max supports six basic atomic data types that can be by nc). transmitted as messages from object to object: int, float, list, symbol, bang, and signal (for MSP audio connections). Several VI.HISTORY OF SPEECH RECOGNITION SOFTWARE more complex data structures exist within the program for handling numeric arrays (table data), hash tables (coll data), Speech recognition software (or speech recognition tech- XML information (pattr data), and JSON-based dictionaries nology) enables phones, computers, tablets, and other ma- (dict data). An MSP data structure (buffer ) can hold digital chines to receive, recognize and understand human utterances. audio information within program memory. In addition, the It uses natural language as input to trigger an action; enabling Jitter package adds a scalable, multi-dimensional data structure our devices to also respond to our spoken commands. The for handling large sets of numbers for storing video and other technology is being used to replace other, more ‘overused’ datasets (matrix data). methods of input like typing, texting, and clicking. This turns out to be slightly ironic because of the fact that texting has Max is typically learned through acquiring a vocabulary of become the norm, over voice. objects and how they function within a patcher; for example, the metro object functions as a simple metronome, and the A. 1950s and 60s random object generates random integers. Most objects are non-graphical, consisting only of an object’s name and several In this day and age, speech recognition can be found in any- arguments-attributes (in essence class properties) typed into thing and everything, from cars with Bluetooth connections, an object box. Other objects are graphical, including sliders, to asking Google to search up ’spaghetti’, to just processing number boxes, dials, table editors, pull-down menus, but- speech over connections with Microsoft Skype, and many more tons, and other objects for running the program interactively. things. The ability to talk to your devices have expanded to Max/MSP/Jitter comes with about 600 of these objects as the encompass the vast majority of technology that we use in our standard package; extensions to the program can be written by daily lives. third-party developers as Max patchers (e.g. by encapsulating some of the functionality of a patcher into a sub-program that The first speech recognition systems were focused on is itself a Max patch), or as objects written in C, C++, Java, numbers, not words. In 1952, Bell Laboratories designed the or JavaScript. “Audrey” system which could recognize a single voice speak- ing digits aloud. Ten years later, IBM introduced “Shoebox” The order of execution for messages traversing through which understood and responded to 16 words in English. the graph of objects is defined by the visual organization of Across the globe other nations developed hardware that the objects in the patcher itself. As a result of this organizing could recognize sound and speech. And by the end of the ‘60s, principle, Max is unusual in that the program logic and the the technology could support words with four vowels and nine interface as presented to the user are typically related, though consonants. newer versions of Max provide several technologies for more standard GUI design. B. 1970s Max documents (named patchers) can be bundled into Speech recognition made several meaningful advancements stand-alone applications and distributed free or sold commer- in this decade. This was mostly due to the US Department cially. In addition, Max can be used to author audio and MIDI of Defense and DARPA. The Speech Understanding Research plugin software for Ableton Live through the Max for Live (SUR) program they ran was one of the largest of its kind in extension. the history of speech recognition. Carnegie Mellon’s “Harpy’ speech system came from this program and was capable of With the increased integration of laptop computers into understanding over 1,000 words which is about the same as a live music performance (in and elsewhere), three-year-old’s vocabulary. Max/MSP and Max/Jitter have received attention as a devel- opment environment available to those serious about laptop Also significant in the ‘70s was Bell Laboratories’ intro- music/video performance. duction of a system that could interpret multiple voices. C. 1980s Fig. 1. Overall Description The ‘80s saw speech recognition vocabulary go from a few hundred words to several thousand words. One of the breakthroughs came from a statistical method known as the “Hidden Markov Model (HMM)”. Instead of just using words and looking for sound patterns, the HMM estimated the probability of the unknown sounds actually being words.

D. 1990s Speech recognition was propelled forward in the 90s in large part because of the personal computer. Faster processors made it possible for software like Dragon Dictate to become more widely used. Bell South introduced the voice portal (VAL) which was a dial-in interactive voice recognition system. This system gave birth to the myriad of phone tree systems that are still in existence today.

E. 2000s By the year 2001, speech recognition technology had achieved close to 80 percent accuracy. For most of the decade a port of the host computer (box 106). The UI events may there was not a lot of advancements until Google arrived with comprise events such as display mouse or computer keyboard the launch of Google Voice Search. Because it was an app, this activation by the user, or graphics table input, and the like. put speech recognition into the hands of millions of people. The musical instrument digital interface input can comprise It was also significant because the processing power could signals received from a digital musical instrument connected be offloaded to its data centers. Not only that, Google was to a suitable port of the host computer or from a stored digital collecting data from billions of searches which could help it music file. predict what a person is actually saying. At the time Google’s English Voice Search System included 230 billion words from The processed UI events from the user (at box 104 of FIG. user searches. 1) can comprise a variety of user input at the host computer that relate to the scene description and modify the rendered F. 2010s scene in accordance with the user input. Examples of UI events from the user include playback controls, by which the In 2011 Apple launched Siri which was similar to Google’s user can halt operation of the rendering and close the scene Voice Search. The early part of this decade saw an explosion description. The user also can launch a scene description editor of other voice recognition apps. And with Amazon’s Alexa, application, which provides a graphical user interface (GUI) Google Home we have seen consumers becoming more and through which the user can manipulate and change values in more comfortable talking to machines. the scene description to be rendered, thereby effecting the scene that will rendered. The user-editable scene description Today, some of the largest tech companies are competing parameters are described in more detail below. to herald the speech accuracy title. In 2016, IBM achieved a word error rate of 6.9 percent. In 2017 Microsoft beat out IBM The digital music input (at box 106 of FIG. 1) may with a 5.9 percent claim. Shortly after that IBM improved their comprise, for example, input received over an interface that rate to 5.5 percent. However, it is Google that is claiming the is compatible with the MMA interface, wherein MMA is the lowest rate at 4.9 percent. MIDI (Musical Instrument Digital Interface) Manufacturer’s Association protocol specification. Those skilled in the art will VII.DIAGRAMSAND EXPLANATIONS appreciate that a variety of musical instrument digital interfaces may be used, although the MIDI standard of the MMA is The processing is illustrated by the flow diagram of FIG. 1, the most well-known and widely used for digital music rep- which shows that a host computer system processes a scene as resentation. A wide variety of electronic musical instruments specified by a scene description, thereby beginning execution can be supported, including synthesizers that produce MIDI of the scene description. The scene processing comprises command streams for electronic piano, drum, guitar, and the loading a scene description into working memory of the like. Thus, the digital music input at box 106 can comprise computer. This processing is represented by the flow diagram a MIDI command stream that is delivered live (that is, in box numbered 102. The scene description defines one or more response to activation in real time) or delivered serially from objects located in three-dimensional space of the rendered a conventional MIDI file. scene, as specified by the scene description. The system then monitors, or listens, for input from two sources at the host Those skilled in the art will appreciate that a MIDI com- computer: user interface (UI) events from the user at the host mand stream can produce sounds that are triggered from a computer (box 104) and digital music input received from MIDI-enabled sound engine that receives MIDI commands as control inputs and that can produce corresponding sounds Fig. 3. Digital Music Input and musical notes. Such sounds and musical notes can be stored as *.wav, *.aiff, *.mp3 files, and the like. Other digitally encoded audio files can be used for input, as well. Such audio files can be easily played through digital media players. Moreover, musical interfaces such as the MMA MIDI interface can interact with graphical interfaces in real time as a digital instrument is played. For example, the illustrated embodiment utilizes graphics control through the DirectX interface, but OpenGL or any other graphics API could also be supported. Those skilled in the art will understand the integration details for such interaction, in view of the description herein. In the description herein, a MIDI input stream will be assumed for the digital music input, unless otherwise indicated. That is, references to “MIDI” input will be understood to include all varieties of digital music input described herein, unless otherwise indicated. After the user UI events and MIDI port events are pro- cessed, the system updates the scene (box 108). Next, at box 109, the scene is rendered, meaning that applicable video and audio output is generated. Lastly, if no halt instruction or the host computer receives the MIDI input at a sound card through like is received at box 110 execution is continued by returning a MIDI port. The system then matches the digital music input to listening for, and processing, input from the user (box 104) to trigger events of the scene description, as indicated by and the musical instrument (box 106). box 304. The trigger events correspond to musical instrument actions that generate music note events, such as piano keys Fig. 2. User Input that are struck, guitar strings that are plucked, drum surfaces that are hit, and the like. The trigger events comprise MIDI commands such as the output from a synthesizer or other electronic music instrument. As noted above, a MIDI command stream can be played through a MIDI-enabled sound engine and can be stored as audio data in such common formats as Windows Media Player or Real Player or the like, including music files such as a *.WAV file or *.AIFF file or the like. Each trigger event is associated with process functions that are specified in the scene description. At box 306, the process functions are executed, thereby producing changes to the defined objects in the rendered scene. As noted previously, the scene description is updated per the digital music events and process functions, and the updated scene is rendered, while digital music input listening continues. This processing is indicated by box 310. A variety of actions associated with the process functions may be carried out. For example, the actions may specify col- FIG. 2 shows additional details of processing the user lisions between two or more objects of the scene description, input. First, at box 202, conventional computer operating and can include explosion of one or more objects of the scene system listening is performed, to await received input from description, or other movements of the objects in the scene the user interface. When user input is received that changes the description. The actions can be specified by user input so as to scene description, such as object manipulation commands or permit changes in speed, size, movement, color, and behavior changes in MIDI processing parameters, the scene description of the scene objects. in memory that is being executed (rendered) is changed in accordance with that input. FIG. 4 is a block diagram of an exemplary host com- puter 1200 that performs the processing described herein. This is represented in FIG. 2 by the box 204. The system The computer includes a processor 1202, such as a general continues to listen for additional user events, as indicated at purpose computer chip and ancillary components, as provided box 206. As noted above, execution of the scene description in conventional personal computers, workstations, and the and listening for further user input continues, as indicated by like that are generally available. Through the processor 1202, the return from box 210 to box 202, until execution is halted the computer executes program instructions to carry out the by a user input. operations described herein. The processor communicates with FIG. 3 shows additional details of processing the musical other components of the computer over a system bus 1203 for instrument digital interface (MIDI) input. First, at box 302, the data exchange and operations. The processor can operate with Fig. 4. Exemplary Host Computer Fig. 5. First 16 Bins of an FFT Frame(Amplitudes)

and the Inverse Fast Fourier Transform (IFFT). The FFT and IFFT are optimized (very fast) computer-based algorithms that perform a generalized mathematical process called the Discrete Fourier Transform (DFT). The DFT is the actual mathematical transformation that the data go through when converted from one domain to another (time to frequency). In a more basic a sound card 1204 that processes digital music data, such as a explanation, the DFT is just a slow/slower version of the FFT. digital music input data stream received from a digital music input device including a music synthesizer and the like, and FFTs, IFFTs, and DFTs became really important to a lot can produce audio (sound) output 1205. of disciplines when engineers figured out how to take samples quickly enough to generate enough data to re-create sound and The processor 1202 also responds to input devices 1206 other analog phenomena digitally. But, they do not just work that receive user input, including such input devices as a on sounds; they work on any continuous signal (images, radio computer keyboard, mouse, and other similar devices. The waves, seismographic data, etc.). computer includes memory 1208, typically provided as volatile (dynamic) memory for storing program instructions, operating An FFT of a time domain signal takes the samples and data, and so forth. The datastore 1210 is typically non-volatile gives us a new set of numbers representing the frequencies, memory, such as data disks or disk arrays. amplitudes, and phases of the sine waves that make up the sound we’ve analyzed. It is these data that are displayed in The computer can also include a program product reader the sonograms. 1212 that receives externally accessible media 1214 such as flash drives and optical media discs, and the like. Such media Chart 5 shows the first 16 bins of a typical FFT analysis 1214 can include program instructions, comprising program after the conversion is made from real and imaginary numbers products, that can be read by the reader 1212 and executed by to amplitude/phase pairs. The phases are left out because it is the processor 1202 to provide the operation described herein. hard to make up a bunch of arbitrary phases between 0 and The processor uses a graphics or video card 1216 to visually 2. In a lot of cases, you might not need them (and in a lot of render the objects in a scene description according to the cases, you would!). In this case, the sample rate is 44.1 kHz digital music input received through the sound card 1204. The and the FFT size is 1,024, so the bin width (in frequency) is visually rendered graphics output can be viewed at a display the Nyquist frequency (44,100/2 = 22,050) divided by the FFT device 1218, such as visual display devices and the like. The size, or about 22 Hz. sound output 1205 and rendered graphics output 1218 together Amplitude values are assumed to be between 0 and 1; comprise the rendered scene output, providing a multimedia notice that they are quite small because they all must sum presentation. to 1, and there are a lot of bins!. The numbers are not real, but notice that they are made VIII.THE FREQUENCY DOMAIN them up to represent a sound that has a simple, more or less A. The DFT, FFT, and IFFT harmonic structure with a fundamental somewhere in the 66 Hz to 88 Hz range (you can see its harmonics at around 2, 3, 4, 5, The most common tools used to perform Fourier analysis and 6 times its frequency, and note that the harmonics decrease and synthesis are called the Fast Fourier Transform (FFT) in amplitude more or less like they would in a sawtooth wave). 1) How the FFT Works: The Fast Fourier Transform in a Fig. 9. f(t)*w(t) Nutshell: Computing Fourier Coefficients Here’s a little three- step procedure for digital sound processing. 1) Window 2) Periodicize 3) Fourier transform (this also requires sampling, at a rate equal to 2 times the highest frequency required). you do this with FFT. Following is an illustration of steps 1 and 2. Here’s the graph of a (periodic) function, f(t). (Note that f(t) need not be a periodic function.) Fig. 10. f(t)*w(t) Fig. 6. Graph of a (periodic) function, f(t)

You have a periodic function, and the Fourier theorem says Look at the portion of the graph between 0 ≤ t ≤ 1. it can represent this function as a sum of sines and cosines. Following is a graph of the window function we need to use. This is step 3. You can also use other, non-square windows. The function is called w(t). Note that w(t) equals 1 only in the This is done to ameliorate the effect of the square windows interval 0 ≤ t ≤ 1 and it’s 0 everywhere else. on the frequency content of the original signal.

Fig. 7. w(t) B. The DFT, FFT, and IFFT Now, once you have a periodic function, all you need to do is figure out, using the FFT, what the component sine waves of that waveform are. It is possible to represent any periodic waveform as a sum of phase-shifted sine waves. In theory, the number of component sine waves is infinite—there is no limit to how many frequency components a sound might have. In practice, you need to limit it to some predetermined number. This limit In step 1, you window the function. In Figure 7 you plot has a serious effect on the accuracy of our analysis. both the window function, w(t) (which is nonzero in the region Here’s how that works: rather than looking for the fre- of interest) and function f(t) in the same picture. quency content of the sound at all possible frequencies (an infinitely large number - 100.000000001 Hz, 100.000000002 Fig. 8. f(t)*w(t) Hz, 100.000000003 Hz, etc.), next, divide up the frequency spectrum into a number of frequency bands and call them bins. The size of these bins is determined by the number of samples in our analysis frame (the chunk of time mentioned above). The number of bins is given by the formula: number of bins = frame size/2 1) Frame Size: Per example, decide on a frame size of 1,024 samples. This is a common choice because most FFT algorithms in use for sound processing require a number of In Figure 8 you plot f(t)*w(t), which is the periodic samples that is a power of two, and it’s important not to get function multiplied by the windowing function. From this too much or too little of the sound. figure, it’s obvious what part of f(t) is the area of interest. A frame size of 1,024 samples gives us 512 frequency In step 2, you need to periodically extend the windowed bands. Assume that we’re using a sample rate of 44.1 kHz, we function, f(t)*w(t), all along the t-axis. know that we have a frequency range (remember the Nyquist theorem) of 0 kHz to 22.05 kHz. To find out how wide each IX.TIMELINE of the frequency bins is, use the following formula: 1) Sept 10 2018 - Oct 01 2018 : Research and Editing bin width = frequency/number of bins 2) Oct 01 2018 - Nov 08 2018 : SDK/Implementation This formula gives us a bin width of about 43 Hz. Remem- 3) Nov 08 2018 - Jan 15 2019 : Data Collection ber that frequency perception is logarithmic, so 43 Hz gives us 4) Jan 15 2019 - Feb 12 2019 : Implementation/Testing worse resolution at the low frequencies and better resolution 5) Feb 12 2019 - Mar 12 2019 : Debugging at higher frequencies. 6) Mar 12 2019 - Apr 02 2019 : Review/Final Debug- ging By selecting a certain frame size and its corresponding bandwidth, you avoid the problem of having to compute an X.CONCLUSION infinite number of frequency components in a sound. Instead, you just compute one component for each frequency band. In accordance with embodiments of the software, a graph- ical scene representation is produced at a display of a host Fig. 11. Example of a commonly used FFT-based program: the phase vocoder computer such that a scene description is rendered and updated menu from Tom Erbe’s SoundHack. Note that the user is allowed to select by a received digital music input, which can be keywords (among several other parameters) the number of bands in the analysis. This that are gathered from a specific song or vocal recognition means that the user can customize what is called the time/frequency resolution wherein the digital music input is matched to trigger events trade-off of the FFT. of the scene description and action of each matched trigger event are executed in accordance with action processes of the scene description, thereby updating the scene description with respect to objects depicted in the scene on which the actions are executed. The updated scene description is then rendered. Thus, the software provides a patcher in MaxMSP that can link to a musical instrument digital interface (e.g., MIDI) data stream and through a sound and word processing application and producing a scene based on that, that interacts as you traverse it. In this way, this software can also take keywords from vocal recognition and use FFT to be able to put it all through a spectral filter, which you can edit to change the sound of what you listen to as you traverse the world, which will also place sounds in an open three-dimensional virtual landscape and trigger them as you take a stroll through your soundscape, and finally, it will allow you to edit the objects in the soundscape based on the spatialisation tools, which can be anything, including CAD. 2) Software That Uses the FFT: There are many software packages available that will do FFTs and IFFTs of your data ACKNOWLEDGMENT for you and then let you mess around with the frequency The author would like to thank Charlie Peck, Marc Ben- content of a sound. The y-axis tells us the amplitude of amou, Forrest Tobey, Xunfei Jiang, and David Barbella.

Fig. 12. Another way to look at the frequency spectrum is to remove time as REFERENCES an axis and just consider a sound as a histogram of frequencies. Think of this as averaging the frequencies over a long time interval. This kind of picture [1] Cycling ’74. Max/MSP History - Where did Max/MSP (where there’s no time axis) is useful for looking at a short-term snapshot come from? June 2009. URL: https://web.archive.org/ of a sound (often just one frame), or perhaps even for trying to examine the web/20090609205550/http://www.cycling74.com/twiki/ spectral features of a sound that doesn’t change much over time (because all we see are the ”averages”). bin/view/FAQs/MaxMSPHistory. [2] Phil Burk. The Frequency Domain. May 2011. URL: http : / / sites . music . columbia . edu / cmc / MusicAndComputers/chapter3/03 04.php. [3] David Cohn. Evolution of Computer-Aided Design. May 2014. URL: http://www.digitaleng.news/de/evolution- of-computer-aided-design/. [4] 3D Innovations. The History of Computer-Aided Design (CAD). Nov. 2014. URL: https://3d- innovations.com/ blog/the-history-of-computer-aided-design-cad/. [5] IRCAM. A brief history of MAX. June 2009. URL: each component frequency. Looking at just one frame of an https : / / web. archive . org / web / 20090603230029 / http : FFT, you usually assume a periodic, unchanging signal. A //freesoftware..fr/article.php3?id article=5. histogram is generally most useful for investigating the steady- [6] Peter Kirn. A conversation with David Zicarelli and state portion of a sound. (Figure 12 is a screen shot from Gerhard Behles. June 2017. URL: http://cdm.link/2017/ SoundHack.) 06/conversation-david-zicarelli-gerhard-behles/. [7] Future Music. 30 years of MIDI: a brief history. Dec. 2012. URL: http://www.musicradar.com/news/tech/30- years-of-midi-a-brief-history-568009. [8] Tim Place. A modular standard for structuring patches in Max. URL: http : / / jamoma . org / publications / attachments/jamoma-icmc2006.pdf. [9] Miller Puckette. Synthetic Rehersal: Training the Syn- thetic Performer. URL: https://quod.lib.umich.edu/cgi/p/ pod/dod-idx/synthetic-rehearsal-training-the-synthetic- performer.pdf?c=icmc;idno=bbp2372.1985.043;format= pdf. [10] Miller Puckette. The Patcher. URL: http://msp.ucsd.edu/ Publications/icmc88.pdf. [11] Mike Sheffield. Max/MSP for average music junkies. Jan. 2018. URL: http://www.hopesandfears.com/hopes/ culture/music/168579-max-msp-primer. [12] Harvey W Starr and Timonthy M Doyle. Patent US20090015583 - Digital music input rendering for graphical presentations. Jan. 2009. URL: https://www. google.com/patents/US20090015583. [13] Naomi van der Velde. Speech Recognition Software: Past, Present & Future. Sept. 2017. URL: https://www. globalme.net/blog/speech-recognition-software-history- future. [2] [3] [1] [5] [4] [6] [7] [8] [10] [9] [11] [12] [13]