ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015) 1039 –1051 DOI 10.3813/AAA.918898

The Role of Spectral-Envelope Characteristics in Perceptual Blending of Wind-Instrument Sounds

Sven-Amin Lembke, Stephen McAdams Centre for Interdisciplinary Research in Music Media and Technology (CIRMMT), Schulich School of Music, McGill University,555 SherbrookeStreet West, Montréal, Québec, Canada H3A 1E3. [email protected]

Summary Certain combinations of musical instruments lead to perceptually more blended than others. Orchestra- tion commonly seeks these combinations and can benefitfrom generalized acoustical descriptions of perceptu- ally relevant features that allowthe prediction of blend. Previous research on correlating such instrument-specific features with the perception of blend shows an important role of spectral-envelope characteristics, leaving unan- swered, however, whether global or local characteristics are more important (e.g., spectral centroid or formant structure). This paper reports howwind instruments can be characterized through pitch-generalized spectral- envelope descriptions that exhibit their formant structure and howthis is represented in an auditory model. Two experiments employing blend-production and blend-rating tasks study the perceptual relevance of formants to blend, involving dyads of arecorded instrument sound and aparametrically varied synthesized sound. Frequency relationships between formants influence blend critically,asdoes the degree of formant prominence. In addition, multiple linear regression relying primarily on local spectral-envelope characteristics explains 87% of the vari- ance in blend ratings. Aperceptual model for the contribution of spectral characteristics to perceivedblend is proposed.

PACS no. 43.66.Ba, 43.66.Jh, 43.75.Cd

1. Introduction Along aperceptual continuum, maximum blend is most likely only achievedfor concurrent sounds in pitch unison Knowledge of instrument leads composers to se- or octaves. Even though other intervals may be rightly as- lect certain instruments overothers to fulfill adesired sumed to maketwo instruments more distinct, certain in- purpose in orchestrating amusical work. One such pur- strument combinations would still exhibit higher degrees pose is achieving a blended combination of instruments. of blend than others. On the opposite extreme of this The blending of instrumental timbres is thought to depend continuum, astrong distinctness of individual instruments mainly on factors such as the synchronybetween note leads to the perception of aheterogeneous, non-blended onsets, partial tones aligned along the harmonic series, sound. Assuming auditory fusion to rely on low-levelpro- and specificcombinations of instruments [1]. Whereas the cesses (related to auditory scene analysis, see [5]), in- first twofactors depend on compositional decisions and creasingly strong and congruent perceptual cues for blend their precise execution during musical performance, the should counteract even deliberate attempts to identify in- third factor strongly relies on instrument-specificacous- dividual sounds. tical characteristics. Arepresentative characterization of Previous research on timbre perception has shown a these features would thus facilitate explaining and theoriz- dominant importance of spectral properties. Timbre sim- ing perceptual effects related to blend. In agreement with ilarity has been linked to spectral-envelope characteristics past research [1, 2, 3], blend is defined as the perceptual fu- [6]. Similarity-based behavioral groupings of stimuli re- sion of concurrent sounds, with acorresponding decrease flect acategorization into distinct spectral-envelope types in the distinctness of individual sounds. It can involvedif- [7] or the exchange of spectral envelopes between synthe- ferent practical applications, such as augmenting adomi- sized instruments results in an analogous inversion of po- nant timbre by adding other subordinate timbres or creat- sitions in multidimensional timbre space [8]. Furthermore, ing an entirely novel, emergent timbre [4]. This paper ad- Strong &Clark [9] reported increasing confusion in in- dresses only the first scenario, as the latter likely involves strument identification (e.g., oboe with trumpet)whenever more than twoinstruments. prominent spectral-envelope traits are disfigured, mak- ing instruments resemble each other more. With regard Received15June 2014, to blending, Kendall &Carterette [2] established alink accepted 11 May 2015. between timbre similarity and blend, by relating closer

©S.Hirzel Verlag · EAA 1039 ACTA ACUSTICA UNITED WITH ACUSTICA Lembke,McAdams: Spectral-envelope characteristics Vol. 101 (2015) timbre-space proximity between pairs of single-instrument introduces the chosen approach to spectral-envelope de- sounds to higher blend ratings for the same sounds form- scription, its corresponding representation through audi- ing dyads. ‘Darker’ timbres have been hypothesized to be tory models, and howinthe perceptual investigation the favorable to blend [4, 10], quantified through the global spectral description is operationalized in terms of paramet- spectral-envelope descriptor spectral centroid,with ‘dark’ ric variations of formant frequencylocation. Section 3out- referring to lower centroids. Strong blend wasfound to be lines the design of twobehavioral experiments that inves- best explained by alow centroid composite,i.e., the cen- tigate the relevance of local variations of formant structure troid sum of the sounds forming adyad. to blend perception, with their specificmethods and find- By contrast with global descriptors, attempts to ex- ings presented in Sections 4and 5, respectively.Finally, plain blending through local spectral-envelope character- the combined results from acoustical and perceptual in- istics focus on prominent spectral maxima, also termed vestigations are discussed in Section 6, leading to the es- formants [11, 12] in this context. Reuter [3] reported be- tablishment of aspectral model for blend in Section 7. havioral findings in favoroftimbre blend occurring when- ever formant regions between twoinstruments coincide. 2. Spectral-envelope characteristics His explanation argues that this coincidence avoids in- complete masking [13], which inversely hypothesizes that Acorpus of wind instrument recordings wasused to es- the non-coincidence of formant locations prevents audi- tablish ageneralized acoustical description for each in- tory fusion due to incomplete mutual masking of the pre- strument. The orchestral instrument samples were drawn sumedly salient formants between instruments, facilitating from the Vienna Symphonic Library1 (VSL), supplied as the detection of their distinct identities. stereo WAVfiles (44.1 kHz sampling rate, 16-bit dynamic As prominent signifiers of spectral envelopes, formants resolution), with only left-channel data considered. The have been employed widely to describe wind instruments investigated instruments comprised (French)horn, bas- [3, 7, 8, 12, 14, 15, 16, 17, 18, 19]. Likethe formant soon, Ctrumpet, B clarinet, oboe, and flute, with the structure found in the human voice [11, 12, 20], formants available audio samples spanning their respective pitch in wind instruments are located at absolute frequencyre- ranges in semitone increments. Because the primary fo- gions, which remain largely unaffected by pitch change cus concerned spectral aspects, all selected samples con- [14, 17, 18]. This invariance may in fact allowfor the gen- sisted of long, sustained notes without vibrato. As spectral eralized acoustical description for these instruments and envelopes commonly exhibit significant variation across together with assessing its potential constraints (e.g., in- dynamic markings, all samples included only mezzoforte strument register,dynamic marking), it will be of value to markings, representing an intermediate levelofinstrument musical applications (e.g., [21]). Furthermore, it is mean- dynamics. ingful to assess howsuch prominent spectral features are represented at an intermediary stage between acoustics 2.1. Spectral-envelope description and perception, i.e., at asensorineural level, simulated by Past investigations of pitch-invariant spectral-envelope computational models of the human auditory system. Au- characteristics pursued comprehensive assessments of ditory models can account for effects related to spectral spectral analyses encompassing extended pitch ranges of masking, i.e., to what neural excitation pattern aspectrum instruments [14, 17, 18]. The spectral-envelope descrip- of asingle or compound sound leads. Forinstance, ex- tion employed in this paper wasbased on an empirical es- citation patterns typically involveanasymmetric upward timation technique relying on the initial computation of spread in frequency, butthe shape of excitation still varies power-density spectra for the sustained portions of sounds both as afunction of frequencyand excitation level. The (excluding onset and offset), followed by the detection Auditory Image Model (AIM)simulates different stages of of partial tones, i.e., their frequencies and power levels. the peripheral auditory system, covering the transduction Acurve-fitting procedure employing a cubic smoothing of acoustical signals into neural responses and the sub- spline (piecewise polynomial of order 3) applied to the sequent temporal integration across auditory filters yield- composite distribution of partial tones overall pitches ing the stabilized auditory image (SAI), which provides yielded the spectral-envelope estimates. The procedure the closest representation relating to acoustical spectral- balanced the contrary aims of achieving adetailed spline envelope traits for human-voice and musical-instrument fit and alinear regression, involving iterative minimization sounds [22]. AIM’smost recent development employs of deviations between the estimate and the composite dis- dynamic, compressive gammachirp (DCGC)filterbanks, tribution until an optimal criterion wasmet. These pitch- which account for both frequencyand leveldependency generalized spectral-envelope estimates then served as the of basilar excitation by adapting filter shape accordingly basis for the identification and categorization of formants. [23]. AIM may therefore aid in assessing the relevance The main formant represented the most prominent spectral of hypotheses concerning blend, as previous theories had maximum with decreasing magnitude towards both lower also employed representations or explanations which took and higher frequencies or if not available, the most promi- spectral-masking effects into account [4, 3]. This paper addresses whether pitch-invariant spectral- envelope characterization is relevant to blending. Section 2 1 URL: http://vsl.co.at/. Last accessed: April 12, 2014.

1040 Lembke,McAdams: Spectral-envelope characteristics ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015)

10 10 10 spectral-env.estimate oboe clarinet composite distribution 34 pitches 42 pitches 0 formant maximum 0 0 formant bounds, 3dB (dB) -10 formant bounds, 6dB -10 -10 level -20 -20 -20

Power horn -30 -30 -30 38 pitches

-40 -40 -40 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000

10 10 10 bassoon trumpet flute 41 pitches 31 pitches 39 pitches 0 0 0

(dB) -10 -10 -10 level -20 -20 -20

Power -30 -30 -30

-40 -40 -40 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 Frequency (Hz) Frequency (Hz) Frequency (Hz)

Figure 1. Estimated spectral-envelope descriptions for all six instruments (labelled in individual panels). Estimates are based on the composite distribution of partial tones compiled from the specified number of pitches across the range of each instrument. nent spectral plateau, i.e., the point exhibiting the flat- 2.2. Auditory-model representation test slope along aregion of decreasing magnitude towards If pitch-invariant spectral-envelope characteristics are per- higher frequencies. Furthermore, descriptors for the main ceptually relevant, theyshould also become apparent in formant F were derivedfrom the estimated spectral enve- arepresentation closer to perception, likethe output of lope. Theycomprised the frequencies of the formant max- acomputational auditory model. Using the AIM, SAIs imum F as well as upper and lower bounds (e.g., F → max 3dB were derivedfrom the DCGC basilar-membrane model, and F ← )atwhich the power magnitude had decreased by 3dB comprising 50 filter channels, equidistantly spaced along either 3dBor6dBrelative to F . max equivalent-rectangular-bandwidth (ERB)rate [24] and The spectral-envelope estimates for all investigated in- covering the audible range up to 5kHz2.Atime-averaged struments generally suggested pitch-invariant trends, as SAI magnitude profile wasobtained by computing the me- shown in Figure 1. Anarrower spread of the partial tones dians across time frames per filter channel, which resem- (circles)around the estimate (curve) argues for astronger bled the auditory excitation pattern [22]. pitch-invariant trend. The lower-pitched instruments, horn Astrong similarity among SAIs across an extended and bassoon (left panels), exhibited strong tendencies for range of pitches wastaken as an indicator for pitch- prominent spectral-envelope traits, i.e., formants. Higher- invariant tendencies. Pearson correlation matrices for all pitched instruments yielded twodifferent kinds of descrip- possible pitch combinations were computed, comparing tion. Oboe and trumpet (middle panels)displayed mod- the profiles of SAI magnitudes overfilter channels. In ad- erately weaker pitch-invariant trends, nonetheless exhibit- dition, this approach also aided in identifying the limits ing main formants, with that of the trumpet being of con- of pitch invariance, as adjacent regions exhibiting weaker siderable frequencyextent compared to more locally con- correlations delimited instrument registers where SAIs strained ones reported for the other instruments. Although varied as afunction of pitch. Three representative cases still following an apparent pitch-invariant trend, the re- are illustrated in Figure 2. Forhorn (left panel)and bas- maining instruments, clarinet and flute (right panels), dis- soon (not shown), broad regions of pitch-invariant SAI played only weakly pronounced formant structure, with profiles became apparent (dark square), spanning large the identified formants more resembling local spectral parts of their ranges up to pitches of about D4. Oboe plateaus. Furthermore, the unique acoustical trait of the (middle panel)and trumpet (not shown)exhibited more clarinet concerning its low, chalumeau register prevented constrained and fragmented regions of high SAI similar- anyvalid assumption of pitch invariance to be made for ity,contrasted by increasingly pitch-variant SAI profiles the lower frequencyrange. This register is characterized above A4. Forthese four instruments, pitch-invariant char- by amarked attenuation of the lower even-order partials acterization appeared to be more prevalent and stable in whose locations accordingly varied as afunction of pitch. Figure 1also displays the associated formant descriptors 2 (vertical lines), from which it can be shown that the iden- As band-limited analysis economized computational cost and no promi- nent formants above 5kHz were found, the audio samples were sub- tified main formant for the clarinet (top-right panel)was sampled by afactor of 4toasampling rate of 11025 Hz only for the located above the pitch-variant lowfrequencies. purposes of analysis with AIM.

1041 ACTA ACUSTICA UNITED WITH ACUSTICA Lembke,McAdams: Spectral-envelope characteristics Vol. 101 (2015)

Figure 2. (Color online)SAI correlation matrices for horn, oboe, and clarinet (left-to-right). Correlations (Pearson r)between SAI magnitudes across 50 filter channels consider all possible pitch combinations, with obtained r falling within [0, 1] (see legend, far- right). lower pitch regions, from which low-pitched instruments in particular would benefit. All of these instruments lost  -II -I 0+I+II pitch-invariant tendencies in their high registers. The re- 0 spectral-envelopeestimate maining instruments, clarinet (right panel)and flute (not formant bounds, 3dB synthesis filter for ΔF= 0 shown), lacked widespread pitch-invariant SAI charac- -10 synthesis filters for ΔF≠0 teristics, as strong patterns of correlation were only ob- tained between directly adjacent pitches (diagonal)and (dB) -20 not across wider pitch regions. level -30

2.3. Parametric variation of main-formant fre- Power quency -40 In order to study the contribution of local variations of -50 spectral characteristics, asynthesis model wasemployed 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Frequency (Hz) that provided parametric control overseparate spectral- envelope components. The synthesis infrastructure relied Figure 3. Spectral-envelope estimate of horn and filter magni- on asource-filter model and wasrealized for real-time tude responses of its synthesis analogue. The original analogue modification of the control parameters [25], based on is modeled for ΔF = 0; the other responses are variations of ΔF . which the spectral envelope remained invariant to pitch The top axis displays the equivalent scale for the five ΔF levels changes. During synthesis, the filter structure wasfed a investigated in Experiment B. harmonic source signal of variable , containing harmonics up to 5kHz. The filter structure con- frequencyregion (see Section 2.1). It only modeled the sisted of twoindependent filters, modeling the main for- formant above that region as well as the remaining spec- mant on the one hand and the remaining spectral-envelope tral envelope towards higher frequencies in order to orient regions on the other.Aparameter allowing the main for- the investigation toward specifically testing the relevance mant to be shifted in frequencyrelative to the remain- of the identified, albeit less pronounced, formant. ing regions wasimplemented as an absolute deviation ΔF in Hz from apredefined origin, i.e., ΔF = 0. Analogue 3. General methods models for each instrument were designed for ΔF = 0by matching the frequencyresponse of the composite filter The perceptual relevance of main-formant frequencyto structure to the spectral-envelope estimates, as illustrated blending wastested for sound dyads. All dyads comprised in Figure 3for the horn (dashed black line), superimposed a sampled instrument and its synthesized analogue model. overits corresponding estimate (solid greyline). The ana- In agiven dyad, the instrument sample wasconstant, and logues were not meant to deliverrealistic emulations of its synthesized analogue wasvariable with respect to the the instruments per se, butrather to achieve agood fit parameter ΔF .Variations with ΔF>0shifted the main between the main formants of the analogue and spectral- formant of the synthesized sound higher in frequencyrel- envelope estimate. Limiting differences in shape between ative to the sampled instrument’smain formant and, ac- main formants helped to deduce the measured perceptual cordingly, ΔF<0corresponded to shifts toward lower differences that resulted from frequencyrelationships be- frequencies. Twoperceptual experiments were conducted tween them. It should be noted that the synthesis filter to investigate how ΔF variations relate to blend. In Ex- structure for the clarinet excluded its pitch-variant lower periment A, participants controlled ΔF directly and were

1042 Lembke,McAdams: Spectral-envelope characteristics ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015) asked to find the ΔF that gave optimal blend, whereas not occur twice in succession between blocks. The main in Experiment B, listeners provided direct blend ratings experiments were in each case preceded by 10 practice tri- for predefined ΔF variations. Using the instruments pre- als under the guidance of the experimenter,tofamiliar- sented in Section 2, the robustness of perceptual effects ize participants with the task and with representative ex- wasassessed overthe twoexperimental tasks for differ- amples of stimulus variations. Dyads were played repeat- ent pitches, unison and non-unison intervals, and stimulus edly throughout experimental trials, allowing participants contexts. Givensix instruments, and various pitches and to pause playback at anytime. intervals, it wasimpractical to test all possible combina- tions. An exploratory approach waschosen instead, with 3.4. Data analysis not all instruments being tested across all factors. Whereas this could reveal blend-related dependencies concerning With respect to investigated factors, Experiment Aevalu- ΔF across the factors of interest, it did not allowgener- ated the influence of the factors instrument register and in- alizing across all instruments to the same degree as well as terval type. Experiment Bassessed pitch-invariant percep- determining perceptual thresholds. Still, each factor was tual performance across anumber of factors and further- studied with at least three instruments for greater gener- more correlated the perceptual data with spectral-envelope alizability.Pitches were chosen to represent common reg- traits. Separate analyses of variance (ANOVA s) were con- isters of the individual instruments. Non-unison intervals ducted for each instrument, testing for statistically signifi- included both smaller and larger intervals. cant main effects within factors and interaction effects be- The methods both experiments share in common are tween them. Acriterion significance level α = .05 was presented in this section, before addressing their specifics chosen and, if multiple analyses on split factor levels or and results in the following sections. individual post-hoc analyses were conducted, Bonferroni corrections were applied. In repeated-measures ANOVA s, 3.1. Participants the Greenhouse-Geisser correction (ε)was applied when- Due to the demanding experimental tasks, participants ever the assumption of sphericity wasviolated. In addition, of both experiments were musically experienced listen- Experiment Aalso considered one-sample t-tests against ers. Theywere recruited primarily from the Schulich amean of zero for testing differences to ΔF = 0. Statis- 2 School of Music, McGill University.Their backgrounds tical effect sizes ηp and r are reported for ANOVA sand were assessed through self-reported degree of formal t-tests, respectively.The analyses considered participant- musical training, accumulated across several disciplines, based averages for trial repetitions of identical conditions. e.g., instrumental performance, composition, music the- ory,and/or sound recording. All participants passed astan- dardized hearing test [26, 27]. No participant took part in 4. Experiment A both experiments. 4.1. Method 3.2. Stimuli 4.1.1. Participants All stimuli involved dyads, comprising one sampled (drawn from VSL)and one synthesized sound. Forasam- The experiment wasconducted with 17 participants, 6fe- ple at anygiven pitch, the spectral envelope wasap- male and 11 male, with amedian age of 27 years (range proximated by the pitch-generalized description from Sec- 20-57). Fifteen participants reported more than 10 years tion 2.1, which resulted in the main formants of sampled of formal musical training, with 10 indicating experi- and synthesized sounds resembling each other for ΔF = 0. ence with wind instruments. Participants were remuner- With regard to the temporal envelope, both instruments ated with 15 Canadian dollars. were synchronized in their note onsets, followed by the 4.1.2. Stimuli sustain portion and ending with an artificial 100-ms linear amplitude decay ramp applied to both instrument sounds. Table Ilists the 17 investigated dyad conditions (column The sampled sound retained its original onset character- entries of bottom row).All instruments included unison istics, whereas across all modeled analogues, the synthe- intervals (0 semitones, ST). With regard to additional fac- sized onsets were characterized by aconstant 100-ms lin- tors, three levels of the Interval factor compared unison ear amplitude ramp. Stimuli were presented overastan- intervals to consonant (7 or −3ST) and dissonant (6 or dard two-channel stereophonic loudspeaker setup inside −2ST),non-unison intervals. Twolevels of the Register an Industrial Acoustics Companydouble-walled sound factor contrasted low(A2, C4 or E3)tohigh (D5orB5) in- booth, with the instruments simulated as being captured strument registers for unison dyads, with the high-register by astereo main microphone at spatially distinct locations pitches being derivedfrom the pitch-variant regions iden- inside amid-sized, moderately reverberant room (see [25] tified in Section 2.2. The sampled sound remained at the for details). indicated reference pitch, whereas the synthesized sound varied relative to it to form the non-unison intervals. All 3.3. Procedure dyads had constant durations of 4900 ms. The levelbal- Experimental conditions were presented in randomized or- ance between instruments wasvariable and determined by der within blocks of repetitions. Aspecificcondition could the participant.

1043 ACTA ACUSTICA UNITED WITH ACUSTICA Lembke,McAdams: Spectral-envelope characteristics Vol. 101 (2015)

Table I. Seventeen dyad conditions from Experiment Aacross instruments, pitches, and intervals (top-to-bottom). Intervals in semitones relative to the specified reference pitch.

horn bassoon oboe trumpet clarinet flute C3 A2 D5 C4 C4 B5 E3 D5 C4

0670-2-3 0 0 067 0 0-2-3 0 0

each instrument’srespective slider range. Forthe instru- ments horn, bassoon, oboe, and trumpet (left panel), opti- Γ=-100 Hz fslider A mal blend wasfound in direct proximity to ΔF = 0. For ΔF B high the unison intervals of horn and bassoon, ΔF did not differ significantly from zero; ΔF for the other twoinstruments low -II -I 0 +I +II were located slightly lower [t(16) ≤−5.6, p<.0001, Figure 4. ΔF variations investigated in Experiments Aand B. A: r ≥.82]. By contrast, optimal ΔF for the clarinet and flute (right panel)were relatively distant from ΔF = 0, in line Participants controlled fslider,which provided aconstant range of 700 Hz (white arrows). Γ (e.g., −100 Hz)represented arandom- with significant underestimations [t(16)≤−3.8, p ≤.0015, ized roving parameter,preventing the range from always being r ≥ .69]. In summary, ΔF values leading to optimal blend centered on ΔF = 0. B: Participants rated four dyads varying were limited to cases in which the formant of the synthe- in ΔF ,drawn from the low or high context. The contexts repre- sized instrument wasatorbelowthat of the sampled in- sented subsets of four of the total of fivepredefined ΔF levels. strument.

4.2.2. Instrument register and interval type 4.1.3. Procedure The influence of instrument register on the optimal ΔF Aproduction task required participants to adjust ΔF di- wasinvestigated for trumpet, bassoon, and clarinet at rectly,inorder to achieve the maximum attainable blend, pitches corresponding to instrument-specificlow and high with the produced value serving as the dependent vari- registers. One-way repeated-measures ANOVA sfor Reg- able. User control wasprovided via atwo-dimensional ister yielded moderately strong main effects for trumpet graphical interface, including controls for ΔF and the 2 and bassoon [F (1, 16) ≥ 19.2, p ≤ .0005, ηp ≥ .55], due levelbalance between instruments. The slider controls for to an increase of the optimal ΔF in the high register. ΔF = fslider +Γprovided aconstant range of 700 Hz, Aless pronounced effect wasobtained for the clarinet with fslider ∈ [−350, +350], and including arandomized 2 [F (1, 16) =5.3,p=.0358,ηp =.25]. roving offset Γ ∈ [−100, +100] between trials. As visu- The investigation of possible differences in optimal ΔF alized in Figure 4(top), minimal or maximal Γ limited between interval types involved comparisons between uni- the range covered by all trials to 500 Hz (solid thick grey son and non-unison intervals as well as adistinction be- line), with all possible ΔF deviations spanning arange of tween consonant and dissonant for the latter.One-way 900 Hz (dashed thick line). Participants completed atotal repeated-measures ANOVA sonInterval conducted for of 88 experimental trials (22conditions3 × 4repetitions) bassoon, clarinet, and trumpet only led to aweak main taking about 50 minutes and including a5-minute break 2 effect for the trumpet [F (2, 32) = 3.7,p = .0347,ηp = .19]. after about 44 trials. Post-hoc tests for the three possible comparisons yielded asingle significant difference between the interval sizes 0 4.2. Results and 6ST[t(16) = −3.5, p = .0033, r = .65]. 4.2.1. General trends Forall six instruments, participants associated optimal 5. Experiment B blend with deviations ΔF ≤ 0. Figure 6(diamonds in lower part)illustrates the means for optimal ΔF ,from 5.1. Method which twodifferent patterns become apparent among in- 5.1.1. Participants struments. ΔF are displayed relative to ascale of equiv- alent variations tested in Experiment B, with the scale The experiment wascompleted by 20 participants, 9fe- value 0 corresponding to ΔF = 0.4 The greylines indicate male and 11 male, with amedian age of 22 years (range 18–35). Fifteen participants reported more than 10 years of formal musical training, with 11 indicating experi- 3 Only 17 conditions investigated ΔF ;the remainder studied other for- ence with wind instruments. Participants were remuner- mant properties that lie outside the focus of this article. Forinstance, ated with 20 Canadian dollars. relative main-formant magnitude variations ΔL were also investigated, which will be addressed in afuture publication. 5.1.2. Stimuli 4 ΔF were linearly interpolated to ascale of equi-distant levels, e.g., −I, 0,and +I corresponding to the numerical scale values -1, 0, and 1, re- Table II lists the 22 investigated dyad conditions. The In- spectively. terval factor investigated twolevels, comparing unison to

1044 Lembke,McAdams: Spectral-envelope characteristics ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015)

Table II. Twenty-twodyad conditions from Experiment Bacross instruments, pitches, and intervals (top-to-bottom). Intervals in semi- tones relative to the specified reference pitch.

horn bassoon oboe trumpet clarinet flute C3 B3 A2 D4 C4 G4E5 C4 B4 E3 A4 C4 G4E5

06060-20-2 0 0 0 06060-20-2 0 0 0 non-unison (6 or -2 ST,dissonant)intervals. Depending on the instrument, the Pitch factor involved two(horn, bas-  -II -I 0+I+II soon, trumpet, clarinet)orthree (oboe, flute)levels. In the 0 case of horn, bassoon, trumpet, and clarinet, there were -2 ΔF twolevels of Interval for each levelofPitch. In addition, 3dB this experiment included twofactors that were related to -4 ΔF variations alone, which applied to all conditions listed (dB) in Table II. The first wassynonymous with ΔF ,asitex- level -6 plored atotal of five ΔF levels, including ΔF = 0and two -8 sets of predefined moderate and extreme deviations above Power and belowit, i.e., the ΔF levels hereafter labeled 0, ±I, -10 F← F← F F→ F→ and ±II.The second factor grouped the fivelevels contex- 6dB 3dB max 3dB 6dB -12 tually into twosubsets of four,which are denoted as low 200 400 600 800 1000 1200 and high contexts and defined in Figure 4(bottom). Frequency (Hz) Employing the formant descriptors from Section 2.1, Figure 5. ΔF levels from Experiment B, defined relative to the investigated ΔF levels were expressed on acommon aspectral-envelope estimate’sformant maximum and bounds. scale of spectral-envelope description, which provided a ΔF (±I)fall 10% inside of ΔF3dB’s extent. ΔF (+II)aligns with better basis of comparison than taking equal frequency F → ,whereas ΔF (−II)aligns with either 80% · F ← or 150 Hz, differences in Hz, as the frequencyextent of formants 6dB 6dB whicheveriscloser to Fmax. across instruments varied considerably.Figure 5provides examples for all resulting ΔF levels of the horn. The four levels ΔF = 0were defined as frequencydistances be- either after the corresponding interquartile ranges first fell below4dBorafter running amaximum of 10 participants. tween the formant maximum Fmax and measures related → to the location and width of its bounds (e.g., F6dB and 5.1.3. Procedure ΔF ,respectively). Forexample, the positive deviation 3dB Arelative-rating task required participants to compare ΔF ΔF (+I)was the distance between the formant maximum levels for agiven condition from Table II. In each ex- and its upper bound minus 10% of the width between perimental trial, participants were presented four dyads the 3dBbounds. If spectral-envelope descriptions lacked and asked to provide four corresponding ratings. The four lower bounds (e.g., trumpet, clarinet, flute), the frequency dyads represented one of the two ΔF contexts labeled high located below F that exhibited the lowest magnitude max and low in Figure 4. Acontinuous rating scale wasem- wastaken as asubstitute value. ployed, which spanned from most blended to least blended Unlikethe dyads in Experiment A, the synthesized (values 1to0,respectively)and served as the dependent sound always remained at the reference pitch, whereas the variable. Participants needed to assign twodyads to the sampled sound varied its pitch for non-unison intervals, scale extremes (e.g., most and least); the remaining two because this tested the assumption of pitch-invariant de- dyads were positioned along the scale continuum relative scription for the recorded sounds more thoroughly.The to the chosen extremes. Playback could be switched freely dyads had aconstant duration of 4700 ms. In addition, between the four dyads, with the visual order of the se- the conditions listed in Table II, including the associated lection buttons and rating scales for individual dyads ran- five ΔF levels per condition, had predetermined values for domized between trials. Participants completed 120 trials the levelbalance between sounds and had also been equal- (30conditions5 × 2contexts × 2repetitions)taking about ized for . The first author determined the levelbal- 75 minutes, including two5-minute breaks after about 40 ance, aiming for good balance between both sounds while and 80 trials. maintaining discriminability between ΔF levels, which wassubsequently verified by the second author.Loudness 5.2. Results equalization wasconducted subjectively in aseparate pi- lot experiment, anchored to aglobal reference dyad for all 5.2.1. General trends conditions and ΔF levels. Forall stimuli, gain levels were Group medians of ratings aggregated across the factors determined that equalized stimulus loudness to the global Pitch, Interval, and Context illustrate howblend varies as a reference. These gain levels were based on median val- ues from at least fiveparticipants, which were determined 5 Only the 22 conditions investigating ΔF are reported here.

1045 ACTA ACUSTICA UNITED WITH ACUSTICA Lembke,McAdams: Spectral-envelope characteristics Vol. 101 (2015) function of ΔF .Figure 6suggests that participants mainly associated higher degrees of blend with the levels ΔF ≤ 0, 1 horn clarinet bassoon flute whereas much lower ratings were obtained for ΔF>0. In oboe profile terms of higher degrees of blend, twotypical rating pro- trumpet profile files as afunction of ΔF emerged (shown as the ideal- ized dashed-and-dotted curves in Figure 6):1)For the in- struments horn, bassoon, oboe, and trumpet (left panel), rating medium to high blend ratings were obtained at and be- Blend low ΔF = 0, above which ratings decreased markedly, resembling the profile of a plateau.2)The instruments Produced optimal blend Produced optimal blend clarinet and flute (right panel)exhibited amonotonically 0 decreasing and approximately linear rating profile as ΔF -II -I 0 +I +II -II -I 0 +I +II increases, in which ΔF = 0did not appear to assume ano- ΔF ΔF table role. These differences in plateau vs. linear profiles for the twoinstrument subsets are analyzed more closely Figure 6. Perceptual results for the different instruments, grouped in the following sections, also taking into account potential according to twotypical response patterns (left and right panels). effects due to the other factors. Experiment A(diamonds, bottom): mean ΔF for produced opti- mal blend, transformed to acontinuous scale of ΔF levels. The 5.2.2. Blend and pitch invariance greylines indicate slider ranges (compare to Figure 4, top). Ex- periment B(curves): median blend ratings across ΔF levels and Spectral characteristics that remain stable with pitch varia- idealized profile. tion, such as formants, may have apitch-invariant percep- tual relevance. To test this, wheneverthe profiles of blend ratings over ΔF remained largely unaffected by different 1 1 pitch 1 pitches, intervals, and ΔF contexts, the perceptual results pitch 2 were assumed to be pitch-invariant. Forinstance, Figure 7 suggests this tendencyfor the horn, in which the plateau rating

profile wasmaintained overall factorial manipulations. As Blend context: low context: high afirst step, the main effects across ΔF were tested to con- interval: unison interval: unison 0 0 firmthat ratings served as reliable indicators of perceptual -II -I 0 +I -I 0 +I +II differences. Giventhese main effects, perceptual robust- 1 1 ness to pitch variation wasfulfilled if no ΔF × Pitch or ΔF × Interval interaction effects were found across both ΔF contexts. An absence of main effects between ΔF con- rating texts would indicate further perceptual robustness. Blend context: low context: high As the Context factor only involved ΔF levels com- interval: non-unison interval: non-unison 0 0 mon to both the high and low contexts, namely 0 and -II -I 0 +I -I 0 +I +II ΔF ΔF ±I (see Figure 4, bottom), the ratings for these levels re- quired range normalization and separate analyses from the Figure 7. Medians and interquartile ranges of blend ratings remaining factors. Forthe instruments involving the In- for horn across ΔF levels and the factorial manipulations terval factor,these were conducted on split levels of that Pitch×Context×Interval. factor.Furthermore, the experimental task imposed the us- age of the rating-scale extremes, which resulted in several for the most liberal and conservative p-values are reported violations of normality due to skewed distributions for the (e.g., conserv.|liberal), with the conservative finding being dyads selected as extremes. As aresult, all main and in- assumed valid if statistical significance is in doubt. teraction effects were tested with abattery of fiveindepen- Strong main effects were found for all instruments, dent repeated-measures ANOVA sonthe rawand trans- which indicated clear differences in perceivedblend formed ratings. The data transformations included non- among the investigated ΔF levels. Table III lists ANOVA parametric approaches of rank transformation [28] and statistics for the range between strongest (clarinet)and prior alignment of ‘nuisance’ factors [29].6 The statistics weakest (bassoon)main effects among the instruments, which reflects analogous differences in the utilized rating- scale ranges in Figure 6. Furthermore, the rating profiles of 6 Giventhe unavailability of non-parametric alternativesfor repeated- measures, three-way ANOVA sthat include tests for interaction effects, the instruments horn, bassoon, oboe, and trumpet fulfilled an approach waschosen that assesses tests overmultiple variants of the criteria for pitch-invariant robustness, as theyappeared dependent-variable transformations, presuming that the most conserva- tive test in the ANOVA battery minimizes accepting false positives. Rank transformation is acommon approach in non-parametric tests, such as the by removing the main effects for Aand B. The four data transformations one-way Friedman test [28]. Issues with tests for interaction effects los- processed the rawdata with or without alignment and for tworanking ing power in the presence of strong main effects were addressed through methods. The first method computed global ranks across the entire data ‘alignment’ of the rawdata prior to rank transformation [29]. Forin- set, i.e., across participants and conditions, whereas the second method stance, atest for the interaction A×Bwould align its ‘nuisance’ factors evaluated within-participant ranks across conditions per participant.

1046 Lembke,McAdams: Spectral-envelope characteristics ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015)

Table III. Range of strongest to weakest ANOVA main effects along ΔF across all six instruments.

lowcontext high context Effect Stat. conserv.liberal conserv.liberal

F86.0 82.6 165.1 165.1 Clarinet df 1.6,30.7 2.1,39.1 3,57 3,57 (strongest) ε .54 .69 -- p<.0001 <.0001 <.0001 <.0001 2 ηp .82 .81 .90 .90

F16.4 16.8 12.6 15.2 Bassoon df 3,57 3,57 1.4,27.1 3,57 (weakest) ε --.48 - p <.0001 <.0001 .0005 .0001 2 ηp .46 .47 .40 .44 unaffected by pitch-related variation. There wasonly one Table IV.ANOVA e ff ects for clarinet and flute suggesting an ab- exception from acomplete absence of effects interacting sence of pitch invariance. ∗:The column header low context does with ΔF :amoderate main effect for Context for trum- not apply in this case. pet wasfound only at non-unison intervals [F (1, 19) = 10.48|25.04, p = .0043|.0001, η2 = .355|.569]. By con- Clarinet lowcontext high context p Effect Stat. conserv.liberal conserv.liberal trast, the rating profiles for clarinet and flute did not exhibit pitch invariance, as theyclearly violated the criteria across F2.9 3.4 3.8 4.4 both ΔF contexts. The interaction effects with ΔF and a ΔF × df 3,57 3,57 2.0,38.7 3,57 main effect for Context leading to their disqualification are Interval ε --.68 - described in Table IV. p.044 .024 .030 .008 2 The instruments displaying pitch invariance were the ηp .13 .15 .17 .19 same ones with plateau rating profiles, possibly attributing aspecial role to ΔF = 0asdefining aboundary governing Flute lowcontext high context Effect Stat. conserv.liberal conserv.liberal blend. To further support this assumption by joint analysis of the four instruments, twohierarchical cluster analyses F2.8 4.3 4.4 7.2 were employed that grouped ΔF levels based on their sim- ΔF × df 3.9,75.0 6,114 6,114 6,114 ilarity in perceptual ratings or auditory-model representa- Pitch ε .66 -- - tions. The first cluster analysis reinterpreted rating differ- p.031 .0006 .0005 <.0001 2 ences between ΔF as adissimilarity measure. This mea- ηp .13 .18 .19 .28 sure considered effect sizes of statistically significant non- parametric post-hoc analyses (Wilcoxon signed-rank test) F4.9 15.7 -- df 1,19 1,19 -- for pairwise comparisons between ΔF levels, i.e., greater Context∗ ε --- - statistical effects between two ΔF levels were expressed p.039 .0008 -- as being more dissimilar in the perceiveddegree of blend. 2 ηp.21 .45 -- Fornon-significant differences, dissimilarity wasassumed to be zero. The second analysis relied on correlation coef- ficients (Pearson r)between dyad SAI profiles across ΔF levels (see Figure 9for examples). The dissimilarity mea- 0.5 0.2 sure considered the complement value 1− r,and as all cor- 0.4 0.3 relations fall within the range [0, 1], no special treatment 0.1 for negative correlations wasrequired. Both cluster anal- 0.2 Dissimilarity yses employed complete-linkage algorithms. The dissimi- 0.1 larity input matrices were obtained by averaging 30 inde- -II -I 0 +I +II -II -I 0 +I +II ΔF ΔF pendent data sets, aggregated across the four instruments, and the factors Context, Pitch, and Interval. As shown in Figure 8. Dendrograms of ΔF -levelgroupings for the pitch- Figure 8, both analyses led to analogous solutions in which invariant instruments. Dissimilarity measures are derivedfrom the twolevels ΔF>0are maximally dissimilar to a perceptual ratings (left)and auditory-modelled dyad SAI profiles compact cluster associating the three levels ΔF ≤ 0. In (right). other words, ΔF associated with lowand high degrees of blend group into twodistinct clusters, clearly relating to 5.2.3. Blend and its spectral correlates the plateau profile, where ΔF = 0defines the boundary to Explaining blending between instruments with the help higher degrees of blend. of spectral-envelope characteristics could eventually al-

1047 ACTA ACUSTICA UNITED WITH ACUSTICA Lembke,McAdams: Spectral-envelope characteristics Vol. 101 (2015)

Table V. Variables entering stepwise-regression algorithm to ob- invariance. The initial pool of regressors comprised 32 tain models reported in Table VI. a:Not for pitch-variant subset, variables, subsequently reduced to apre-selected set that inter-variable correlation |r| >.7. b:Computed as described in exhibited inter-variable correlations |r| <.7, in order to [30]. c:Accounting for pitch. d:Accounting for perceivable beat- avoid pronounced collinearity among regressors. The pre- ing between partial tones. selection wasdetermined by first identifying the vari- No. Variable Description able that in simple linear regression exhibited the high- est R2 and subsequently adding all remaining variables → → 1 ΔL3dB derivate of F3dB,Equation (1) that yielded permissible inter-variable correlations. Ta- 2 ΔL ΔL betw.formants F & F F1vsF2 1 2 ble Vlists the pre-selected variables entered into the re- 3 ΔS a,b spectral slope above F → slopeF1 3dB gression, which comprised spectral-envelope descriptors b 4 ΔSslope global spectral slope b (nos. 1-6)and variables representing other potential fac- 5 |ΔScentroid absolute centroid difference b tors of influence (nos. 7-12). Stepwise multiple-regression 6 Scentroid centroid composite 7 ERB c reference pitch in Table II algorithms with both forward-selection and backward- rate elimination schemes were considered, which converged on 8 I(non)unison interval category (binary) a optimum models by iteratively adding or eliminating re- 9 IST interval size in semitones 10 Clo/hi ΔF context (binary) gressors, respectively.Models with similar combinations 11 mixratio balance betw.instruments of regressors to the optimum models were explored as b,d 12 AMdepth amplitude modulation depth well. In anticipation of reporting the results, the inclusion of abinary regressor for ΔF context Clo/hi benefited all in- vestigated regression models, as it corrected the systematic lowthe prediction of blend through these instrument- offset of scaled ratings between the lowand high contexts specifictraits. In addition, it could help understand the way (see Figure 7). in which spectral characteristics contribute perceptually. In simple linear regression, the strongest spectral-enve- Giventhis aim, multiple linear regression wasemployed to lope descriptors all concerned local formant characteriza- model the median blend ratings through anumber of vari- tion and did not involvethe global descriptors. Among the ables or regressors.The regression models assessed the formant descriptors, the highest correlations were obtained for the main-formant upper bound F → ,applied to both relative contributions of regressors describing both global 3dB 2 and local spectral-envelope traits. Global descriptors in- pitch-invariant [R (118) = .656, p<.0001] and pitch- 2 volved the commonly reported spectral centroid and spec- variant subsets [R (54) = .713, p<.0001]. Notably,the formant maximum F did not perform better than F → , tral slope [30], whereas local descriptors concerned the max 3dB formant characterization discussed in Section 2.1. Because likely due to differing skewness properties between instru- adyad yielded twodescriptor values across its constituent ment formants (see Figure 1).Atthe same time, the utility of F → implies that it could assume an important role in sounds, the regressor measure had to associate the twoin 3dB some way. Forthe spectral centroid, twomeasures were explaining blend, as perhaps the perceptually most salient considered, namely, composite (sum)and absolute differ- feature of formants. It performed slightly better for the ence [4]. Although the centroid composite relates to the pitch-variant than for the pitch-invariant subset, because ΔF → essentially follows astrictly monotonic function ‘darker’-timbre hypothesis mentioned in the Introduction, 3dB the centroid difference had still been found to best explain across the investigated ΔF levels, which apparently mod- blend in non-unison intervals [4], which left some uncer- els the linear blend profile better.Toimprove the model- tainty as to which of these twomeasures wasmore appro- ing for the plateau blend profile, derivate descriptors of F → were explored. The most effective derivate ΔL → re- priate in explaining blend in general. The remaining spec- 3dB 3dB lated the upper-bound frequencies F → of the twoinstru- tral regressors were implemented as polarity-preserving 3dB differences between descriptors, with the sampled instru- ments to acorresponding change in levelacross the ref- ment serving as the reference (ref)and the synthesized in- erence instrument’sspectral-envelope magnitude function strument being variable (var)across ΔF .For example, the Lref (f)indB, as formalized in Equation 1. In other words, this measure evaluated magnitude differences relative to difference of descriptor values dx for instrument x would correspond to Δd = d − d . the reference instrument, at frequencies appearing to be of ref var particular perceptual relevance to both instruments, which Regression models for twoseparate subsets of the per- therefore may relate to spectral-masking effects (e.g., in- ceptual data were explored: pitch-invariant (horn, bassoon, complete masking [13]). oboe, trumpet)and pitch-variant instruments (clarinet,     flute). The datasets comprised all conditions tested across → → → ΔL = Lref F − Lref F . (1) factors and instruments, with N = 118 and N = 54 for 3dB 3dB|ref 3dB|var the pitch-invariant and -variant subsets, respectively.Re- The obtained solutions from stepwise multiple regression gressor variables were pooled from the spectral-envelope yielded identical models for both instrument subsets, in- → descriptors and additional variables that were included to volving the regressors ΔL3dB,absolute spectral-centroid account for potential confounding factors, e.g., pitch, in- difference |ΔScentroid|,and context Clo/hi.Aslight gain in terval. If these factors did not contribute as regressors, performance wasachievedbysubstituting the |ΔScentroid| this would further support aperceptual relevance of pitch based on audio signals from individual sounds with avari-

1048 Lembke,McAdams: Spectral-envelope characteristics ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015)

Table VI. Multiple-regression models best predicting timbre- profiles clearly varying as afunction of pitch, implying blend ratings for twoinstrument subsets. that this pitch dependencymay also extend to perception. The perceptual investigation in Sections 3to5con- Pitch-invariant subset R2 =.87 adj firms the acoustical implications, showing that strong for- F (3, 116)=272.4, p<.0001 mant characterization results in main formants becoming

Regressors βstd tp perceptually relevant to blending. Givenadyad in which aputative main formant is variable in frequencyrelative ΔL→ 1.00 28.6 <.0001 3dB to afixedreference formant, the investigated instruments |ΔScentroid| .26 7.7 <.0001 display twoarchetypical profiles based on their formant Clo/hi .27 7.9 <.0001 prominence. Forthe pitch-variant clarinet and flute, blend 2 Pitch-variant subset Radj=.88 increases as amonotonic, quasi-linear function if the vari- F (3, 52)=134.2, p<.0001 able formant movesfrom above to belowthe reference. Forpitch-invariant instruments, the frequencyalignment Regressors β tp std between the variable formant and the reference (ΔF= 0) → functions as aboundary,delimiting aregion of higher de- ΔL3dB 1.03 20.0 <.0001 |ΔScentroid| .16 3.3 .0018 grees of blend at and belowthe reference and contrasted Clo/hi .34 6.8 <.0001 by amarked decrease in blend when the variable formant exceeds it, which overall resembles a plateau profile. The pitch invariance even extends to different interval types, ant computed on the pitch-generalized spectral-envelope as the plateau profile remains unaffected in non-unison estimates. Table VI displays these optimized regression intervals, regardless of their degree of consonance. How- models for pitch-invariant and pitch-variant subsets, both ever,the findings suggest that the perceptual relevance of leading to about 87% explained variance (adjusted R2). spectral-envelope estimates diminishes in higher instru- The standardized regression-slope coefficients βstd indi- ment registers. The limited sampling of conditions across cate the relative contribution of regressors, with the rel- the investigated factors prevented amore comprehensive ative weights being very similar for both subsets. In these evaluation of all instruments. Whereas the findings allow → models, ΔL3dB acted as the strongest predictor for the general relationships for formant frequencytobededuced, blend ratings, contributing about fivetimes more than more comprehensive investigations are needed to attain |ΔScentroid|,which furthermore did not perform better than more generalizable quantification of absolute frequency Clo/hi.These findings clearly argue for local spectral- ranges and thresholds. envelope descriptors to be more meaningful than global In correlating acoustical and perceptual factors, spec- ones in explaining blending in the investigated dyads. tral-envelope characteristics alone explain up to 87% of Moreover, the remaining global descriptor spectral slope the variance in blend ratings. In addition, local spectral appeared to play no role. Furthermore, finding both in- traits seem to be more powerful acoustical predictors of strument subsets to be modeled equally well through the blend than global traits. The formant descriptor for the same spectral-envelope descriptors points to ageneral util- → upper formant bound F3dB,when expressed as aderivate ity of pitch-generalized descriptions for all instruments. → descriptor ΔL3dB,acts as the strongest predictor for the Despite the findings in Section 5.2.2 arguing against pitch- blend ratings, regardless of whether instruments belong invariant perceptual robustness for clarinet and flute, the to the pitch-invariant group or not. With regard to clar- obtained regression models excluded the Pitch and Interval inet and flute, the departures from pitch invariance found variables, thus appearing to be less relevant to explaining in Section 5.2.2 contradict the utility of pitch-generalized the blend ratings. spectral-envelope description in predicting blend ratings, as reported in Section 5.2.3. Taking both findings into ac- 6. General discussion → count, this for one argues that the descriptor F3dB still Orchestrators would benefitfrom acoustical descriptions succeeds in explaining blend well even for clarinet and of instruments that correspond to the perceptual processes flute. On the other hand, the same instruments display involved in achieving blended timbres. Section 2sug- agreater perceptual sensitivity to the Pitch and Interval gests that common orchestral wind instruments are rea- factors, likely associated with their less pronounced for- sonably well described through pitch-generalized spectral- mant structure. Overall, strong formant prominence leads envelope estimates, which furthermore showthe instru- to more drastic changes in blend. → ments horn, bassoon, oboe, and trumpet to be charac- The prediction of blend using ΔL3dB still presumes terized by prominent formant structure. Auditory mod- that one of the instruments serves as areference formant, els employing stabilized auditory images (SAI)confirm as the employed difference descriptors are anchored to that for strong formant characterization and for lower to the sampled instrument. The dependence on areference middle pitch ranges, the pitch-invariant characterization is leavessome ambiguity,because an arbitrary combination stable. In higher instrument registers, however, SAI pro- of twoinstruments would lead to contradictory predictions files indicate limitations to pitch-invariant characteriza- of blend if both instruments were givenequal importance tion. Other instruments, likeclarinet and flute, yield SAI in serving as the reference. Giventhe context of both ex-

1049 ACTA ACUSTICA UNITED WITH ACUSTICA Lembke,McAdams: Spectral-envelope characteristics Vol. 101 (2015) periments, it can be assumed that the sampled instrument, acting as aconstant anchor,had been biased into serving 0.02 0 as the reference by combining it with avariable synthe- -II -I sized instrument. In addition, apossible perceptual expla- 0.01 +I nation could concern audio samples of instruments playing +II 0 without vibrato generally still exhibiting coherent micro- 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 modulations of partial tones. These modulations have been

magnitude 0.08 shown to contribute to astronger and more stable audi-

SAI 0.06 tory image [31] and may thus bias the more stable image 0.04 toward acting as the reference, especially as the synthe- 0.02 0 sized partials remain static overtime. Even in the context 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 of blending in musical performance, one instrument as- Frequency(Hz) sumes the role of the leading voice, in which it possibly serves as the reference, whereas an accompanying instru- Figure 9. SAI profiles of dyads for all ΔF levels (Experiment B), ment avoids exceeding the lead instrument’smain-formant depicting twoexperimental conditions for horn. Top: pitch 1, uni- son; bottom: pitch 2, non-unison; the grid lines correspond to frequency. Likewise, returning to the notion of blend lead- partial-tone locations. ing to augmented timbres [4], the dominant timbre to be embellished by another would seem predestined to func- tion as the reference, based on which the less dominant pass characteristic that attenuates spectral-envelope re- timbre should not exhibit formant frequencies exceeding gions below500 Hz, affecting the perceivedmagnitudes those of the reference. of the respective partial tones (grid lines). This implies Finally,the results allowareassessment of previous that in the region below500 Hz, frequencydeviations be- explanations for blend. The ‘darker’-timbre hypothesis tween main formants no longer affect the achieveddegree [4] is directly reflected in the obtained linear blend pro- of blend, as reflected both in Figure 9and the perceptual file, in which lower ΔF increases blend and at the same findings. Horn and bassoon would especially benefitfrom time causes adecrease in the spectral-centroid compos- this, as their main formants are centered around 500 Hz. ite. By contrast, this hypothesis is not well explained by → Oboe and trumpet, both exhibiting higher F3dB,can be as- the plateau profile, as blend ratings remain similarly high sumed to benefittoalesser degree. In summary,main for- for ΔF ≤ 0. The alternative hypothesis of coincidence of mants located in proximity to 500 Hz will benefitmore, on formant regions [3] would have predicted stronger blend top of which lower pitches would also increase the num- ratings for ΔF = 0than for all other levels, which in the ber of partial tones falling into this favorable frequency perceptual results is only achievedfor the levels ΔF>0. region. This reflects tendencies for pitch-invariant traits in While the hypothesis only achievespartial fulfillment with SAI correlations (see Section 2.2)tobemore pronounced respect to spectral variations ΔF ,itfinds more agreement at lower pitch ranges and for instruments of lower register, in the corresponding SAI representations. As shown in two which would lend support to the ‘darker-timbre’ hypothe- example cases for horn in Figure 9, the dyad SAI pro- sis in terms of limiting the spectral centroid in frequency. files for the levels ΔF>0are distinguishable from the remaining levels through clear deviations between 1and 7 → 7. Conclusion 2kHz and located just above the horn’sestimated F3dB. Remarkably,the formant shifts related to ΔF< 0(Fig- Evidence from acoustical and psychoacoustical descrip- ure 3) are not reflected in the corresponding dyad SAI tions of wind instruments and from perceptual validation profiles (Figure 9),which instead exhibit direct align- shows that relative location and prominence of main for- ment below500 Hz for all three levels ΔF ≤ 0. There- mants affect the perception of timbre blend critically.Fur- fore, only ΔF>0seems to lead to incomplete masking thermore, these pitch-invariant spectral characteristics ex- [13], revealing the presence of the synthesized instrument, plain and predict the perception of blend to apromis- whereas the spectral-envelope variations ΔF<0evoke ingly high degree. Remaining discrepancies between the little change compared to the dyad excitation pattern for acoustic and perceptual domains can be explained through ΔF = 0. apparent constraints of the simulated auditory system. Of still greater importance, the auditory system, as mod- In conclusion, aperceptual model for the contribution eled by AIM using the DCGC, seemingly involves ahigh- of local spectral-envelope characteristics to blending is proposed, keeping in mind that it would serveasan 7 Concerning the output from the AIM, amisalignment between actual instrument-specificcomponent in amore complex, gen- sinusoidal frequencies and the corresponding SAI peaks wasobserved. Through personal communication with the developer of the utilized AIM eral perceptual model involving compositional and per- implementation, this wasexplained as being an inherent property of the formance factors, as initially discussed in Section 1. The dynamic-compression filters. Acorrection function wasderivedbycom- main factors influencing the perception of spectral blend puting SAIs for various sinusoidal frequencies and fitting the twofre- are summarized: quencyscales, yielding the linear function fSAI = 1.17 · f + 28.2Hz [r2(50) =.999,p <.0001]. In Figure 9, the correction manifests itself in 1. Frequencyrelationships between upper bounds of main the compressed frequencyextent for the SAIs. formants are critical to blend. Among several instru-

1050 Lembke,McAdams: Spectral-envelope characteristics ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015)

ments, one is expected to serveasthe reference (e.g., [11] L. Hermann: Phonophotographische Untersuchungen lead instrument, dominant timbre), above which the (Phonophotographic investigations). Pflüger,Archivfür presence of other instruments’ formants would strongly die Gesamte Physiologie des Menschen und der Thiere 58 (1894)264–279. result in decreased blend. 2. Prominence of the main formants governs whether [12] C. Stumpf: Die Sprachlaute (The speech sounds). Springer, Berlin, 1926. these relationships lead to plateau or linear blend pro- [13] J. P. Fricke: Zur Anwendung digitaler Klangfarbenfilter bei files, and in the first case pitch-invariant perceptual ro- Aufnahme und Wiedergabe (Onthe application of digital bustness extends to non-unison intervals. timbre filters during recording and playback). Proc. 14th 3. Spectral-envelope relationships below500 Hz may be Tonmeistertagung, Munich, Germany, 1986, 135–148. negligible, due to constraints of the auditory system. At [14] K. E. Schumann: Physik der Klangfarben (Physics of tim- the same time, blend decreases at higher pitches due to bres). Professorial dissertation. Universität Berlin, Berlin, adegraded perceptual robustness of formants. 1929. This hypothetical model still requires further investigation [15] E. L. Saldanha, J. F. Corso: Timbre cues and the identifica- concerning amore systematic study on 1) the apparent tion of musical instruments. J. Acoust. Soc. Am. 36 (1964) constraints of the auditory system as modeled by AIM, 2021–2026. 2) howinmusical practice one instrument may function [16] W. Strong, M. Clark: Perturbations of synthetic orchestral wind-instrument tones. J. Acoust. Soc. Am. 41 (1967)277– as the reference, 3) establishing amore specificdescrip- 285. tion of formant prominence, and 4) addressing the contri- [17] D. Luce, J. Clark: Physical correlates of brass-instrument bution of loudness balance between instruments to blend. tones. J. Acoust. Soc. Am. 42 (1967)1232–1243. These future investigations will further validate and refine [18] D. A. Luce: Dynamic spectrum changes of orchestral in- the proposed perceptual model and will improve compu- struments. J. Audio Eng. Soc. 23 (1975)565–568. tational prediction tools for the instrument-specific, spec- [19] J. Meyer: Acoustics and the performance of music. 5th ed. tral component of blend. Orchestration practice will ben- Springer,New York, 2009. efitfrom these research efforts even beyond aiming for [20] G. Fant: Acoustic theory of speech production. Mouton, blend, as knowledge of favorable instrument relationships The Hague, 1960. also informs orchestrators as to howtoavoid it. [21] C. Reuter: Klangfarbe und Instrumentation: Geschichte – Ursachen –Wirkung (Timbre and instrumentation: history References –causes –effect). P. Lang, Frankfurt am Main, 2002. [22] R. vanDinther,R.D.Patterson: Perception of acoustic [1] G. J. Sandell: Concurrent timbres in orchestration: aper- scale and size in musical instrument sounds. J. Acoust. ceptual study of factors determining blend. PhDdisserta- Soc. Am. 120 (2006)2158–2176. tion. Northwestern University,1991. [23] T. Irino, R. D. Patterson: Adynamic compressive gam- [2] R. A. Kendall, E. C. Carterette: Identification and blend of machirp auditory filterbank. IEEE Trans. Audio Speech timbres as abasis for orchestration. Contemp. Music Rev. Lang. Process. 14 (2006)2222–2232. 9 (1993)51–67. [24] B. C. Moore, B. R. Glasberg: Suggested formulae for cal- [3] C. Reuter: Die auditive Diskrimination vonOrchesterin- culating auditory-filter bandwidths and excitation patterns. strumenten -Verschmelzung und Heraushörbarkeit von J. Acoust. Soc. Am. 74 (1983)750–753. Instrumentalklangfarben im Ensemblespiel (The auditory [25] S.-A. Lembke, S. McAdams: Aspectral-envelope synthe- discrimination of orchestral instruments: fusion and distin- sis model to study perceptual blend between wind instru- guishability of instrumental timbres in ensemble playing). ments. Proc. 11th Congrès Français d’Acoustique /IOA P. Lang, Frankfurt am Main, 1996. annual meeting /Acoustics 2012, Nantes, France, 2012, [4] G. J. Sandell: Roles for spectral centroid and other factors 1025–1030. in determining “blended" instrument pairings in orchestra- [26] F. Martin, C. Champlin: Reconsidering the limits of normal tion. Music Percept. 13 (1995)209–246. hearing. J. Am. Acad. Audiol. 11 (2000)64–66. [5] A. S. Bregman: Auditory scene analysis: the perceptual or- [27] ISO 389–8: Acoustics: Reference zero for the calibration ganization of sound. MIT Press, Cambridge, MA, 1990. of audiometric equipment—Part 8: Reference equivalent [6] S. McAdams, S. Winsberg, S. Donnadieu, G. De Soete, J. threshold sound pressure levels for pure tones and circum- Krimphoff:Perceptual scaling of synthesized musical tim- aural earphones. Tech. Rept. International Organization for bres: common dimensions, specificities, and latent subject Standardization, Geneva,Switzerland, 2004. classes. Psychol. Res. 58 (1995)177–192. [28] W. J. Conover, R. L. Iman: Rank transformations as a [7] L. Wedin, G. Goude: Dimension analysis of the perception bridge between parametric and nonparametric statistics. of instrumental timbre. Scand. J. Psychol. 13 (1972)228– Am. Stat. 35 (1981)124–129. 240. [29] J. J. Higgins, S. Tashtoush: An aligned rank transform test [8] J. M. Grey, J. W. Gordon: Perceptual effects of spectral for interaction. Nonlinear World 1 (1994)201–221. modifications on musical timbres. J. Acoust. Soc. Am. 63 [30] G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, S. (1978)1493–1500. McAdams: The Timbre Toolbox: extracting audio descrip- tors from musical signals. J. Acoust. Soc. Am. 130 (2011) [9] W. Strong, M. Clark: Synthesis of wind-instrument tones. 2902–2916. J. Acoust. Soc. Am. 41 (1967)39–52. [31] S. McAdams: Spectral fusion, spectral parsing and the for- [10] D. Tardieu, S. McAdams: Perception of dyads of impulsive mation of auditory images. PhD dissertation. Stanford Uni- and sustained instrument sounds. Music Percept. 30 (2012) versity,1984. 117–128.

1051