Current Research in Concatenative Sound Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the International Computer Music Conference (ICMC),Barcelona, Spain, September 5-9, 2005 CURRENT RESEARCH IN CONCATENATIVE SOUND SYNTHESIS Diemo Schwarz Ircam – Centre Pompidou 1, place Igor-Stravinsky, 75004 Paris, France ABSTRACT is a complex method that needs many different concepts Concatenative synthesis is a promising method of musical working together, thus much work on only one single as- sound synthesis with a steady stream of work and publi- pect fails to relate to the whole. It has seen accelerating cations in recent years. It uses a large database of sound development over the past few years as can be seen in the snippets to assemble a given target phrase. We explain its chronology in section 3. There is now the first commer- principle and components and its main applications, and cial product available (3.13), and, last but not least, ICMC compare existing concatenative synthesis approaches. We 2004 saw the first musical pieces using concatenative syn- then list the most urgent problems for further work on con- thesis (3.12). We try in this article to acknowledge this catenative synthesis. young field of musical sound synthesis that has been iden- tified as such only five years ago. 1. INTRODUCTION Any CSS system must perform the following tasks, Concatenative synthesis methods use a large database of sometimes implicitly. This list of tasks will serve later source sounds, segmented into units, and a unit selec- for a taxonomy of existing systems. tion algorithm that finds the sequence of units that match Analysis The source sound files are segmented into best the sound or phrase to be synthesised, called the tar- units and analysed to express their characteristics with get. The selection is performed according to the descrip- sound descriptors. Segmentation can be by automatic tors of the units, which are characteristics extracted from alignment of music with its score for instrument corpora, the source sounds, or higher level descriptors attributed by blind or arbitrary grain segmentation for free and re- to them. The selected units can then be transformed to synthesis, or can happen on-the-fly. The descriptors can fully match the target specification, and are concatenated. be categorical (class membership), static (constant over a However, if the database is sufficiently large, the proba- unit), or dynamic (varying over the duration of a unit). bility is high that a matching unit will be found, so the need to apply transformations is reduced. The units can Database Source file references, units and unit descrip- be non-uniform, i.e. they can comprise a sound snippet, tors are stored in a database. The subset of the database an instrument note, up to a whole phrase. that is preselected for one particular synthesis is called the Concatenative synthesis can be more or less data- corpus. driven, where, instead of supplying rules constructed by Target The target specification is generated from a sym- careful thinking as in a rule based approach, the rules are bolic score (expressed in notes or descriptors), or anal- induced from the data itself. The advantage of this ap- ysed from an audio score (using the same segmentation proach is that the information contained in the many sound and analysis methods as for the source sounds). examples in the database can be exploited. The current work on concatenative sound synthesis Selection Units are selected from the database that (CSS) focuses on three main applications: match best the given target descriptors according to a dis- High Level Instrument Synthesis Because concaten- tance function and a concatenation quality function. The ative synthesis is aware of the context of the database as selection can be local (the best match for each target unit well as the target units, it can synthesise natural sounding is found individually), or global (the sequence with the transitions by selecting units from matching contexts. In- least total distance if found). formation attributed to the source sounds can be exploited Synthesis is done by concatenation of selected units, pos- for unit selection, which allows high-level control of syn- sibly applying transformations. Depending on the appli- thesis, where the fine details lacking in the target specifi- cation, the selected units are placed at the times given by cation are filled in by the units in the database. the target (musical or rhythmic synthesis), or are concate- Resynthesis of audio A sound or phrase is taken as the nated with their natural duration (free synthesis or speech audio score, which is resynthesized with the same pitch, synthesis). amplitude, and timbre characteristics using units from the database. 2. RELATED WORK Free synthesis from heterogeneous sound databases of- Concatenative synthesis is at the intersection of many fers a sound composer efficient control of the result by fields of research, such as music information retrieval, using perceptually meaningful descriptors, and allows to database technology, real-time and interactive methods, browse and explore a corpus of sounds. sound synthesis models, musical modeling, classification, Concatenative synthesis sprung up independently in mul- perception. Concatenative text-to-speech synthesis shares tiple places and is sometimes referred to as mosaicing. It many concepts and methods with concatenative sound 1 Proceedings of the International Computer Music Conference (ICMC),Barcelona, Spain, September 5-9, 2005 synthesis, but has different goals. Singing voice synthe- by hand. The sound base was manually labeled with mu- sis occupies an intermediate position between speech and sical genre and tempo as descriptors. sound synthesis and often uses concatenative methods. For instance, Meron [13] uses an automatically consti- 3.2. Caterpillar (2000) tuted large unit database of one singer. Caterpillar [19, 20, 21, 22], performs data-driven concat- Content based processing is a new paradigm in digi- enative musical sound synthesis from large heterogeneous tal audio processing that shares the analysis with CSS. It sound databases. Units are segmented by automatic align- performs symbolic manipulations of elements of a sound, ment of music with its score [14], or by blind segmen- rather than using signal processing alone. Lindsay [12] tation. The descriptors are based on the MPEG-7 low- proposes context-sensitive effects by utilising MPEG-7 de- level descriptor set [17], plus descriptors derived from the scriptors. Jehan [7] works on the objet segmentation and score and the sound class. The low-level descriptors are perception-based description of audio material and then condensed to unit descriptors by modeling of their tempo- manipulates the audio in terms of its musical structure. ral evolution over the unit (mean value, slope, range, etc.) The Song Sampler [1] is a system which automatically The database is implemented using a relational SQL data- samples meaningful units of a song, assigns them to the base management system for reliability and flexibility. keys of a Midi-keyboard to be played with by a user. The unit selection algorithm inspired from speech syn- Related to selection based sound synthesis is music se- thesis finds the sequence of database units that best match lection where a sequence of songs is generated accord- the given synthesis target units using two cost functions: ing to their characteristics and a desired evolution over the The target cost expresses the similarity of a target unit to playlist. An innovative solution based on constraint satis- the database units, including a context around the target, faction is proposed in [16], which ultimately inspired the and the concatenation cost predicts the quality of the join use of constraints for CSS in [27] (section 3.3). of two database units. The optimal sequence of units is The Musescape music browser [25] works by speci- found by a Viterbi algorithm as the best path through the fying high-level musical features (tempo, genre, year) on network of database units. sliders. The system then selects in real time musical ex- The Caterpillar framework is also used for expressive cerpts that match the desired features. speech synthesis [2], and first attempts for hybrid synthe- sis combining music and speech are described in [3]. 3. CHRONOLOGY 3.3. Musical Mosaicing (2001) Approaches to musical sound synthesis that are somehow Musical Mosaicing, or Musaicing [27], performs a kind of data-driven and concatenative can be found throughout automated remix of songs. It is aimed at a sound database history. They are usually not identified as such but can of pop music, selecting pre-analysed homogeneous snip- be arguably seen as instances of fixed inventory or man- pets of songs and reassembling them. Its great innovation ual concatenative synthesis. was to formulate the unit selection as a constraint solv- The groundwork for concatenative synthesis was laid ing problem. The set of descriptors used for the selection in 1948 by the Groupe de Recherche Musicale (GRM) of is: mean pitch, loudness, percussivity, timbre. Work on Pierre Schaeffer, using for the first time recorded segments adding more descriptors has picked up again with [28]. of sound to create their pieces of Musique Concrete` . Scha- effer defines the notion of sound object, which is a clearly 3.4. Soundmosaic (2001) delimited segment in a source recording. This is not so far Soundmosaic [5] constructs an approximation of one from what is here called unit. sound out of small units of varying size from other sounds. Concatenative aspects can also be found in sampling. For version 1.0 of Soundmosaic, the selection of the best The sound database consists of a fixed unit inventory anal- source unit uses a direct match of the normalised wave- ysed by instrument, playing style, pitch, and dynamics, form (Manhatten distance). Version 1.1 introduced as and the selection is reduced to a fixed mapping of Midi- distance metric the correlation between normalized units. note and velocity to a sample.