<<

Philosophy of Mind and Theory: Describe how evolution of human about body and mind problem was related to our of music. Discuss how psychological theories and music theory are related.

Answer: Ancient Greek philosophers described differente relationship between and sense. Pythagoras mainly addressed the physical aspects by considering a mathematical order underlying pitch relationships. Aristoxenos addressed and musical . Plato and Aristotle understood the relationship between sound and sense in terms of a mimesis theory that stated that rhythms and melodies contain similarities with the true nature of qualities in human character, such as anger, gentleness, courage, temperance and the contrary qualities.

Descartesʼ approach distinguished sound and sense. In Descartesʼ opinion, the soul and the are two entirely different things. The soul can think and does not need extension, whereas the body cannot think and needs extension. Knowledge of the soul requires introspection, whereas knowledge of the body requires scientific methods and descriptions that focus on moving objects. According to Descartes, the link between “I” and the world is due to an organ in the brain that connects the parallel worlds of the subjective mind and the objective body.His focus on moving objects opened the way for scientific investigations in and psychoacoustics, and it pushed matters related to sense and meaning a bit further away towards a disembodied mental phenomenon.

In music this dualism resulted in development of rule-based accounts of musical practices such as Zarlinoʼs, and later Rameauʼs and Matthesonʼs. Sound became the subject of a scientific theory, while sense was still considered to be the by- product of something subjective that is done with sound.

Cognitive approaches: Through the disciplines of and , the idea was launched that between sound and sense there is the human brain, whose principles could also be understood in terms of psychic principles and later on, as principles of information processing. In 1863 of psychoacoustics was established by von Helmholtz. This led to experimental psychology and later for Gestalt psychology in the first half of the 20th century, and the cognitive sciences approach of the second half of the 20th century. Embodied claims that the link between sound and sense is based on the role of action as mediator between physical energy and meaning. In the cognitive approach the sound/sense relationship was mainly conceived from the point of view of mental processing. The embodied approach diers from the Gestalt approach in that it puts more emphasis on action and is based on action-relevant perception as reflected in the ecological psychology of Gibson The embodiment hypothesis entails that meaningful activities of humans proceed in terms of goals, values, and interpretations, while the physical world in which these activities are embedded can be described from the point of view of physical energy, , features and descriptors. Mediation concerns the intermediary processes that bridge the semantic gap between the human approach (subject-centered) and the physical approach (object or sound- centered).

Tonal Acculturation and Implicit . Explain the levels of tonal Acculturation based on implicit learning of Western pitch regularities - tones, chords and keys. How does co-occurance of tones and chords relates to within-key hierarchies? Discuss the difference between rule-based and implicit learning.

Answer: Implicit learning of language grammar: Reber (1967), Implicit learning is the process through which we become sensitive to certain regularities in the environment (1) in the absence of to learn about those regularities (2) in the absence of that one is learning, and (3) in such a way that the resulting knowledge is difficult to express.

In Western Muisc functions of tones and chords depend on the established key. Within-key hierarchies are strongly correlated with the of occurrence of tones and chords in Western musical pieces. More frequent = more important When listening to music in everyday life, listeners become sensitive to the regularities of the tonal system without being necessarily able to verbalise them.

Connectionist models: Discuss how Connectionist models address issues of acoustic (perceptual) versus cultural (cognitive) learning of Western Tonal System. Explain how three levels of Western Tonal System can be described by a connectionist model.

Answer: Connectionist models have two principal advantages over traditional rule-based models: (a) The rules governing the domain are not explicit but rather emerge from the simultaneous satisfaction of multiple constraints represented by individual connections, and (b) these constraints themselves can be learned through passive exposure.

MUSACT highlights a crucial issue of Western music: whether the relations between chords are driven by similarities based on acoustic properties of tones or by implicit knowledge of cultural conventions and usage. MUSACT disentangles these two factors by charting the time course of bottom-up and top-down influences. It predicts that the activation pattern reflects bottom-up influences at early activation cycles, whereas top-down influences are predominant when the model has enough time to reach equilibrium.

Draw the graph of tones, chords and keys to show how a pattern of connections constitutes a knowledge representation of Western harmony and discuss how it is related to human perception of tonality.

Priming experiments: Support for the MUSACT model has come from empirical studies using a harmonic priming paradigm. The rationale of these studies is that a previous chord primes harmonically related chords so that their processing is speeded up. Participants heard a prime chord followed by a harmonically closely or distantly related target chord.The priming effect created (a) a bias to judge targets to be in tune when related to the prime and out of tune when unrelated, and (b) shorter response times for in-tune targets when related and for out-of-tune targets when unrelated.

Summary: A previous musical context (a single chord in these experiments) thus generates expectancies for related chords to follow, resulting in greater consonance and faster processing for expected chords. This is consistent with MUSACT simulations.

Bharucha vs. Tillman: Describe the difference between Bharucha's MUSACT and Tillman SOM and how it relates to implicit learning of the tonal system.

Answer: MUSACT as originally conceived, was based on music theoretic constraints; neither the connections nor their weights resulted from a learning process. Tillman used SOM to simulate learning by mere exposure. SOM is based on competitive learning, an algorithm for datadriven self-organized learning. With this algorithm, the neural net units gradually become sensitive to different input stimuli or categories. The learned connections and the activation patterns after mirror the outcome of the hardwired network MUSACT.

Acoustical versus Statistical Regularities: Discuss the relations between acoustic and statistical regularities for the case of tonal and non-tonal music. Are these relations necessary for learning musical concepts by listeners? Give examples.

Answer: In Tonal Acculturation frequent music events also share (acoustically related). Arab and Serial Music lacks convergence between acoustical and statistical features

Ayari and McAdams show acculturation in Arab music: Arabic modes involve not just a tuning system, but also essential melodico- rhythmic configurations that are emblematic of the maqam. Arab listeners make segmentations that are defined by subtle modal changes that often go unnoticed by the Europeans

Conflicting evidence for serial Music: Frances, Delannoy, and Diennes and Longuet-Higgins show that listeners could not differentiate “grammatical” versus “non-grammatical” serial music Krumhansel and Sandell show that previous exposure is critical for perception of rules in serial music. Bigand et al. show that musicians and non-musicians can discriminate above chance (~60%) between serial compositions that belong or do not belong to a certain row or its retrograde inversion.

TPST: Show how the idea of modeling cognitive musical tension by TSPT generalizes perceptual dissonance and consonance relation to cognitive level. Describe the levels of TPST and discuss how it can be used for design of multimedia learning tools.

Answer: In TPST tonal hierarchy is represented in three levels: 1: pitch class level the describes distances between 12 pitches 2: distances between chords within-key 3: regional level for distances between keys

TPST considers tension as a combination of the number of changes in pitch-class level created by the second chord, the number of steps that separate the roots of the chords along the circle of fifths, and regional distance that combines the circle of fifth and the parallel / relative major-minor.

TPST generates quantitative predictions of tension and attraction for the sequence of events in any passage of tonal music. “Tension,” as employed here, refers both to sensory dissonance and to or instability; similarly,“relaxation” refers to sensory consonance and to cognitive consonance or stability.

To sum up so far, psychoacoustics provides explanation for aspects of tonality and pitch structure, such as preference for small-integer frequency ratios that produce relative sensory consonance. Melodic continuity prefers smaller frequency differences, which are harmonically rough. These could be partially explained by general gestalt principles of proximity and good continuation, but they do not explain why the particular intervals of the whole step and half step are so prevalent in melodic organization across the musical idioms of the world. Moreover, tonal pitch space can not be explained by psychoacoustic considerations alone and require more abstract cognitive features. TPS works out quantitative metrics of degrees of tension and attraction within a melody and/or chord progression at any point.

Principles for design of multimedia learning tools: 1. easily develop a mental representation (affordances) that should be compatible with the representation of experts 2. reducing the quantity of information and optimizing the form in which it is presented so as to improve attentional processes, memorisation of musical material, and develop the capacity to represent the musical structures

Metadata and Essence: Define what is metadata and discuss the difficulties in using metadata as content descriptor of music. Specifically address the issues of user modeling and semantic gap and give examples of use of metadata in real applications.

Answer: Essence is the media itself metadata (literally, “data about the data”) is used to describe the essence metadata categories: Essential - technical information needed to reproduce the essence (like file format) Access - permissions to access content such as copyright information; Parametric - essence capture -up (camera or microphone set-up, perspective, etc.) Relational - synchronization between dierent content components (e.g. time-code); Descriptive - description of the content to facilitate the cataloging, search, retrieval and administration of content; (title, cast, keywords, classifications of the images, and texts, etc.).

Another Classification: Descriptive metadata - such as title, abstract, author, and keywords Structural metadata - how essence elements are related (audio and are put together Administrative metadata - technical and rights management information

Levels of Analysis (Marr): Content level refers to what is computed and why (the problem) Algorithmic level describes the steps to be carried out to solve a given problem Implementation is realization of the algorithm in different ways

Descriptor levels: A low-level descriptor - directly derived from the signal (signal-centered) Mid-level descriptors - inferred from larger set of signals to describe sound "objects" (object-centered) High-level descriptors - induction that requires a user-model (bridging the "semantic gap") (user-centered)

When we talk about content descriptor it is “a distinctive characteristic of the data which signifies something to somebody”.

Semantic gap is, “the lack of coincidence between the information that one can extract from the (sensory) data and the interpretation that the same data has for a user in a given situation”

“User-modeling problem" is that metadata are not certain to be understandable by any user.

Content Processing: Describe the difference between signal and content processing. Give examples for each. What disciplines and technologies are involved in content processing?

Answer: Technologies required for content processing include: - Signal Analysis - Automatic (machine) learning - Large Databases - Models of human information processing Music content processing can be characterised by two different tasks: describing and exploiting content.

“the science of musical content processing aims at explaining and modelling the mechanisms that transform information streams into meaningful musical units (both cognitive and emotional).”

“content processing is meant as a general term covering feature extraction and modeling techniques for enabling basic retrieval, interaction and creation functionality.”

Audio Features: Describe the process of extracting features in time and frequency for low level audio content description and discuss the different time scales of descriptors going from frame based to region features.

Answer: Low level features are frame based with frame size, overlap and hop size. Temporal features Mean amplitude, energy, zero-crossing rate, temporal centroid, auto-correlation Spectral features spectrum energy, energy in sub-bands, mean, geometric mean, spread, centroid, flatness, kurtosis, skewness, spectral slope, high-frequency content and roll-off More advanced features include: Spectral Modeling: Sinusoidal Models, spectral peaks and residual, noisiness, harmonic distortion, harmonic spectral centroid, harmonic spectral tilt MFCC: Separates spectral envelope from spectral detail. Uses FFT, , Log, DCT. MFCC are "natural" features for comparison (measuring similarity) between audio recordings.

Segmentation: determination of relevant region boundaries “refers to the process of dividing an event sequence into distinct groups of sounds. The factors that play a role in segmentation are similar to the grouping principles addressed by Gestalt psychology.” “[sound] segmentation is a process of partitioning [the sound file/stream] into non- intersecting regions such that each region is homogeneous and the union of no two adjacent regions is homogeneous.”

The level of abstraction that can be attributed to the resulting regions may depend on the features used in the first place. For instance, if a set of low-level features is known to correlate strongly with a human percept (as the correlates with the pitch and the energy in Bark bands correlates with the ) then the obtained regions may have some relevance as features of mid-level of abstraction (e.g. music notes in this case).

Model-based segmentation detects mid-level feature boundaries. A classical example can be found in processing by segmentation into phonemes, or words. Content Processing Applications: Give example of content processing application for describing and exploiting musical contents

Describing contents: Mid-Level: Tonal Descriptors Pitch detection Time Domain: Zero Crossing, Autocorrelation Frequency Domain: Cepstrum, Harmonic matching Multi-pitch estimation Predominant pitch estimation Melody extraction Pitch class distributions (HPCP) Key finding Rhythm: Tempo, Meter, Timing (Long term - Rubato, Short term -Swing) High level: Genre

Exploitation: Content-based search and retrieval Content monitoring Added-value: provide metadata from databases Summarization Play-list generation Visualization Content-based transformations: Time scaling (warping), mapping and morphing (vocoder), melodic transformations (T-Pain effect), intelligent harmonizing

Audio Fingerprints: Describe the process of creating a fingerprint and its main application

Answer: Audio fingerprints are compact content-based signatures summarizing audio recordings. The fingerprint extraction derives a set of features from a recording in a concise and robust form. Fingerprint requirements include: Discrimination power over huge numbers of other fingerprints; Invariance to distortions; Compactness; Computational simplicity

The sequence of features calculated on a frameby-frame basis is then further reduced by calculation of statistics of frame values (mean and variance) and clustering the feature vectors.

With the help of fingerprinting systems it is possible to identify an unlabeled piece of audio and therefore provide a link to corresponding metadata (e.g. artist and song name). Fingerprinting can help identify unlabelled audio for monitoring of TV and radio channelsʼ repositories.

Music Information Retrieval Describe what the field of MIR does and how it is related to content description and processing.

MIR deals with automatically extracting high-level atomic descriptors for the characterisation of music. These high-level terms are inferred via a combination of bottom-up audio descriptor extraction with the application of machine learning algorithms. The gap between what can be extracted bottom-up and more abstract, human-centered concepts can be partly closed with the help of inductive machine learning. Meaningful descriptors can be extracted not just from an analysis of the music (audio) itself, but also from extra-musical sources, such as the internet (via web mining).

One typical task of Inductive learning is the automatic construction of classifiers from pre-classified training. Common training and classification algorithms in statistical pattern classification include nearest neighbour classifiers (k-NN), Gaussian Mixture Models, neural networks (mostly multi-layer feed-forward perceptrons), and support vector machines. Other typical examples of machine learning algorithms that are also used in music classification are decision trees and other rule learning algorithms.

Labeled data-sets can be created in the lab by asking users to tag music according to different textual terms or derived from automatically extracting relevant descriptors from the Internet – mostly from general, unstructured web pages –, via the use of information retrieval, text mining, and information extraction techniques.

The goal of Music Information Retrieval is not so much for a computer to understand music in a human-like way, but simply to have enough intelligence to support intelligent musical services and applications. Perfect musical understanding may not be required here. For instance, genre classification need not reach 100% accuracy to be useful in music recommendation systems. Likewise, a system for quick music browsing need not perform a perfect segmentation of the music – if it finds roughly those parts in a recording where some of the interesting things are going on, that may be perfectly sucient. Also, relatively simple capabilities like classifying music recordings into broad categories (genres) or assigning other high-level semantic labels to pieces can be immensely useful.

List of Terms:

1. Gestalt psychology 2. Timbre 3. Basilar Membrane 4. Shepard-Risset Glissando (Shepardʼs Tone) 5. 6. Affordance 7. Pitch Class 8. Tonal Pitch Space Theory (defined as: idealised knowledge representation of tonal hierarchy) 9. Content Processing 10. Music Information Retrieval 11. Metadata 12. Low/Medium/High Level Content Descriptors 13. Temporal Features 14. Spectral Features 15. MFCCs 16. Audio Fingerprints 17. Critical Bandwidth 18. Beating 19. Roughness 20. Consonance 21. Dissonance 22. Autocorrelation 23. Semantic Gap 24. A Generative Theory of Tonal Music (GTTM)

GTTM: Motivated by Chomsky's theory of linguistic theory as the formal study of the human capacity for language Begins with musical surfaces and generates their structural descriptions Rules motivated psychologically and represent cognitive principles of organization Structural descriptions would correspond to predicted heard structures The theory is in principle be testable.