<<

IMPROVING MUSIC MOOD CLASSIFICATION USING LYRICS, AUDIO AND SOCIAL TAGS

BY

XIAO HU

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of in Library and Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2010

Urbana, Illinois

Doctoral Committee:

Professor Linda C. Smith, Chair Associate Professor J. Stephen Downie, Director of Research Associate Professor ChengXiang Zhai Associate Professor P. Bryan Heidorn, University of Arizona

ABSTRACT

The affective aspect of music (popularly known as music mood) is a newly emerging type and access point to music information, but it has not been well studied in information science. There has yet to be developed a suitable of mood categories that can reflect the of music listening and can be well adopted in the Music Information Retrieval

(MIR) community. As music repositories have grown to an unprecedentedly large scale, people call for automatic tools for music classification and recommendation. However, there have been only a few music mood classification systems with suboptimal performances, and most of them are solely based on the audio content of the music. Lyric text and social tags are resources independent of and complementary to audio content but have yet to be fully exploited.

This dissertation research takes up these problems and aims to 1) summarize fundamental insights in music psychology that can help information scientists interpret music mood; 2) identify mood categories that are frequently used by real-world music listeners, through an empirical investigation of real-life social tags applied to music; 3) advance the technology in automatic music mood classification by a thorough investigation on lyric text analysis and the combination of lyrics and audio. Using linguistic resources and human expertise, 36 mood categories were identified from the most popular social tags collected from last.fm, a major

Western music tagging site. A ground truth dataset of 5,296 songs in 18 mood categories were built with mood labels given by a number of real-life users. Both commonly used text features and advanced linguistic features were investigated, as well as different feature representation models and feature combinations. The best performing lyric feature sets were then compared to a leading audio-based system. In combining lyric and audio sources, both methods of feature

ii concatenation and late fusion (linear interpolation) of classifiers were examined and compared.

Finally, system performances on various numbers of training examples and different audio lengths were compared. The results indicate: 1) social tags can help identify mood categories suitable for a real world music listening environment; 2) the most useful lyric features are linguistic features combined with text stylistic features; 3) lyric features outperform audio features in terms of averaged accuracy across all considered mood categories; 4) systems combining lyrics and audio outperform audio-only and lyric-only systems; 5) combining lyrics and audio can reduce the requirement on training data size, both in number of examples and in audio length.

Contributions of this research are threefold. On , it improves the state of the art in music mood classification and text analysis in the music domain. The mood categories identified from empirical social tags can those in theoretical psychology models. In addition, many of the lyric text features examined in this study have never been formally studied in the context of music mood classification nor been compared to each other using a common dataset. On evaluation, the ground truth dataset built in this research is large and unique with ternary information available: audio, lyrics and social tags. Part of the dataset has been made available to the MIR community through the Music Information Retrieval Evaluation eXchange

(MIREX) 2009 and 2010, the community-based evaluation framework. The proposed method of deriving ground truth from social tags provides an effective alternative to the expensive human assessments on music and thus clears the way to large scale experiments. On application, findings of this research help build effective and efficient music mood classification and recommendation systems by optimizing the interaction of music audio and lyrics. A prototype of such systems can be accessed at http://moodydb.com.

iii

To Mom, Huamin, and Ethan

iv ACKNOWLEDGMENTS

I would have never made it to this stage of Ph.D. study without the help and support from many people. First and foremost, I am greatly in debt to my academic advisor and mentor, Dr.

Linda C. Smith. I am blessed with an advisor and mentor who is supportive and encouraging, sincerely cares about students professionally and personally, provides detailed guidance while at the same time respecting one’s initiatives and academic creativity, and treats her advisees as future fellow scholars. The impact which Linda has had on me will keep motivating and guiding me as I step beyond my dissertation.

I also cannot say thank you enough to my research director, Dr. J. Stephen Downie. I am grateful that Stephen introduced me to the general topic of Music Information Retrieval and then gave me the freedom to identify a project which fit my interests. He also set me up with research funding and the opportunity to gain hands-on experience on real-world research projects. In addition, Stephen offered incredible guidance on writing research articles as well as completing this manuscript of my dissertation. Through interactions with Stephen, I also learned the benefits of efficiency and the active pursuit of research opportunities.

Many thanks go to Dr. ChengXiang Zhai. An early experience working with Cheng opened my eyes to information retrieval and text analytics. Besides helping me make on research by giving me invaluable feedback, advice and encouragement, Cheng also inspires me to mature as an academic and consider the big picture of research in information science.

Moreover, I am also very grateful to Dr. P. Bryan Heidorn. Bryan was my research supervisor when I first joined the Ph.D. program and has since served on every committee of

v mine during my Ph.D. study. He provided kind support and guidance as I familiarized myself with this new academic and living environment. Over the years, Bryan has given me plenty of advice, not only on research methodology, but also on future career paths.

I am thankful to the Graduate School of Library and Information Science for awarding me a

Dissertation Completion Fellowship and providing me with financial aid to complete this program and to attend academic conferences, so that I could meet and communicate with researchers from other parts of the world.

I also wanted to thank my teammates in the International Music Information Retrieval

Systems Evaluation Laboratory (IMIRSEL): Mert Bay, Andreas Ehmann, Anatoliy Gruzd,

Cameron Jones, Jin Ha Lee and Kris West, for their wonderful collaboration and beneficial discussions on research as well as the sweet memories of marching to conferences together. I also thank my peer doctoral students who shared my and during the years in the program. It was a privilege to grow up with them all.

Finally, I simply owe too much to my family, especially my mom, my husband and my son.

They endured the long process with great patience and always believe in and have in me. This dissertation is dedicated to them.

vi TABLE OF CONTENTS

LIST OF FIGURES ...... x

LIST OF TABLES ...... xi

CHAPTER 1: INTRODUCTION ...... 1 1.1 INTRODUCTION ...... 1 1.2 ISSUES IN MUSIC MOOD CLASSIFICATION ...... 3 1.2.1 Mood Categories ...... 3 1.2.2 Ground Truth Datasets ...... 3 1.2.3 Multi-modal Classification ...... 4 1.3 RESEARCH QUESTIONS ...... 5 1.3.1 Identifying Mood Categories from Social Tags ...... 5 1.3.2 Best Lyric Features ...... 7 1.3.3 Lyrics vs. Audio ...... 8 1.3.4 Combining Lyrics and Audio ...... 9 1.3.5 Learning Curves and Audio Lengths ...... 10 1.3.6 Research Question Summary ...... 11 1.4 CONTRIBUTIONS ...... 11 1.4.1 Contributions to Methodology ...... 12 1.4.2 Contributions to Evaluation ...... 13 1.4.3 Contributions to Application ...... 14 1.5 SUMMARY ...... 15

CHAPTER 2: LITERATURE REVIEW...... 16 2.1 MOOD AS A NOVEL METADATA TYPE FOR MUSIC ...... 16 2.2 MOOD IN MUSIC PSYCHOLOGY ...... 16 2.2.1 Mood vs. ...... 17 2.2.2 Sources of Music Mood ...... 19 2.2.3 What We Know about Music Mood ...... 20 2.2.4 Music Mood Categories ...... 21 2.3 MUSIC MOOD CLASSIFICATION ...... 24 2.3.1 Audio-based Music Mood Classification ...... 24 2.3.2 Text-based Music Mood Classification ...... 26 2.3.3 Music Mood Classification Combining Audio and Text ...... 27 2.4 LYRICS AND SOCIAL TAGS IN MUSIC INFORMATION RETRIEVAL ...... 28 2.4.1 Lyrics ...... 29 2.4.2 Social Tags ...... 31 2.5 SUMMARY ...... 32

CHAPTER 3: MOOD CATEGORIES IN SOCIAL TAGS ...... 33 3.1 IDENTIFYING MOOD CATEGORIES FROM SOCIAL TAGS ...... 33 3.1.1 Identifying Mood-related Terms ...... 33 3.1.2 Obtaining Mood-related Social Tags ...... 34

vii 3.1.3 Cleaning up Social Tags by Human Experts ...... 35 3.1.4 Grouping Mood-related Social Tags ...... 35 3.2 MOOD CATEGORIES ...... 36 3.3 COMPARISONS TO MUSIC PSYCHOLOGY MODELS ...... 38 3.3.1 Hevner’s Circle vs. Derived Categories ...... 38 3.3.2 Russell’s Model vs. Derived Categories ...... 40 3.3.3 Distances between Categories ...... 42 3.4 SUMMARY ...... 44

CHAPTER 4: CLASSIFICATION EXPERIMENT DESIGN ...... 46 4.1 EVALUATION METHOD AND MEASURE ...... 46 4.1.1 Evaluation Task...... 46 4.1.2 Performance Measure and Statistical Test ...... 47 4.2 CLASSIFICATION ALGORITHM AND IMPLEMENTATION ...... 49 4.2.1 Supervised Learning and Support Vector Machines ...... 49 4.2.2 Algorithm Implementation ...... 51 4.3 SUMMARY ...... 52

CHAPTER 5: BUILDING A DATASET WITH TERNARY INFORMATION ...... 53 5.1 DATA COLLECTION ...... 53 5.1.1 Audio Data ...... 53 5.1.2 Social Tags ...... 55 5.1.3 Lyric Data ...... 55 5.1.4 Summary ...... 56 5.2 GROUND TRUTH DATASET ...... 57 5.2.1 Identifying Mood Categories ...... 58 5.2.2 Selecting Songs ...... 61 5.2.3 Summary of the Dataset ...... 65

CHAPTER 6: BEST LYRIC FEATURES ...... 68 6.1 TEXT AFFECT ANALYSIS ...... 68 6.2 LYRIC FEATURES ...... 70 6.2.1 Basic Lyric Features...... 71 6.2.2 Linguistic Lyric Features ...... 74 6.2.3 Text Stylistic Features ...... 77 6.2.4 Lyric Feature Type Concatenations ...... 78 6.3 IMPLEMENTATION ...... 80 6.4 RESULTS AND ANALYSIS ...... 80 6.4.1 Best Individual Lyric Feature Types ...... 80 6.4.2 Best Combined Lyric Feature Types ...... 81 6.4.3 Analysis of Text Stylistic Features ...... 84 6.5 SUMMARY ...... 87

CHAPTER 7: HYBRID SYSTEMS AND SINGLE-SOURCE SYSTEMS...... 89 7.1 AUDIO FEATURES AND CLASSIFIER ...... 89 7.2 HYBRID METHODS ...... 90 7.2.1 Two Hybrid Methods ...... 90

viii 7.2.2 Best Hybrid Method ...... 92 7.3 LYRICS VS. AUDIO VS. HYBRID SYSTEMS ...... 93 7.4 LYRICS VS. AUDIO ON INDIVIDUAL CATEGORIES ...... 97 7.4.1 Top Features in Content Word N-Grams ...... 99 7.4.2 Top-Ranked Features Based on General Inquirer ...... 100 7.4.3 Top Features Based on ANEW and WordNet ...... 101 7.4.4 Top Text Stylistic Features ...... 102 7.4.5 Top Lyric Features in “Calm” ...... 103 7.5 SUMMARY ...... 104

CHAPTER 8: LEARNING CURVES AND AUDIO LENGTH ...... 105 8.1 LEARNING CURVES ...... 105 8.2 AUDIO LENGTHS ...... 106 8.3 SUMMARY ...... 108

CHAPTER 9: CONCLUSIONS AND FUTURE RESEARCH ...... 110 9.1 CONCLUSIONS ...... 110 9.2 LIMITATIONS OF THIS RESEARCH ...... 113 9.2.1 Music Diversity ...... 113 9.2.2 Methods and Techniques ...... 113 9.3 FUTURE RESEARCH ...... 114 9.3.1 Feature Ranking and Selection ...... 114 9.3.2 More Classification Models and Audio Features ...... 114 9.3.3 Enrich Music Mood Theories ...... 115 9.3.4 Music of Other Types and in Other Cultures ...... 115 9.3.5 Other Music-related Social Media Than Social Tags ...... 116

REFERENCES ...... 117

APPENDIX A: LYRIC REPETITION AND ANNOTATION PATTERNS ...... 125

APPENDIX B. PAPERS AND POSTERS ON THIS RESEARCH ...... 129

ix LIST OF FIGURES

Figure 1.1 Research questions and experiment ...... 12 Figure 2.1 Hevner's cycle (Hevner, 1936) ...... 22 Figure 2.2 Russell’s model with two dimensions: and valence (Russell, 1980) ...... 24 Figure 3.1 Words in Hevner’s circle that match tags in the derived categories ...... 39 Figure 3.2 Words in Russell’s model that match tags in the derived categories ...... 41 Figure 3.3 Distances of the 36 derived mood categories based on artist co-occurrences ...... 43 Figure 4.1 Binary classification ...... 46 Figure 4.2 Support Vector Machines in a two-dimensional space ...... 51 Figure 5.1 Example of labeling a song using social tags ...... 64 Figure 5.2 Distances between the 18 mood categories in the ground truth dataset ...... 65 Figure 6.1 Distributions of “!” across categories ...... 86 Figure 6.2 Distributions of “hey” across categories ...... 86 Figure 6.3 Distributions of “ numberOfWordsPerMinute” across categories ...... 87 Figure 7.1 Effect of α value in late fusion on average accuracy ...... 92 Figure 7.2 Box plots of system accuracies for the BEST lyric feature set ...... 93 Figure 7.3 System accuracies across individual categories for the BEST lyric feature set ...... 96 Figure 8.1 Learning curves of hybrid and single-source-based systems ...... 105 Figure 8.2 System accuracies with varied audio lengths ...... 107

x LIST OF TABLES

Table 3.1 Mood categories derived from last.fm tags ...... 36 Table 4.1 Contingency table of binary classification results ...... 47 Table 5.1 Information of audio collections hosted in IMIRSEL ...... 54 Table 5.2 Descriptions and statistics of the collections ...... 56 Table 5.3 Mood categories and song distributions ...... 60 Table 5.4 Distribution of songs with multiple labels ...... 63 Table 5.5 distribution of songs in the experiment dataset (“Other” includes occurring very infrequently such as “World,” “Folk,” “Easy listening,” and “Big band”) ...... 66 Table 5.6 Genre and mood distribution of positive examples ...... 67 Table 6.1 Summary of basic lyric features ...... 73 Table 6.2 Text stylistic features evaluated in this research ...... 78 Table 6.3 Summary of linguistic and stylistic lyric features ...... 78 Table 6.4 Individual lyric feature type performances ...... 81 Table 6.5 Best performing concatenated lyric feature types ...... 82 Table 6.6 Performance comparison of “Content,” “FW,” “GI,” and “TextStyle”...... 83 Table 6.7 Feature selection for TextStyle ...... 84 Table 7.1 Comparisons on accuracies of two hybrid methods ...... 93 Table 7.2 Accuracies of single-source-based and hybrid systems ...... 94 Table 7.3 Statistical tests on pair-wise system performances ...... 95 Table 7.4 Accuracies of lyric and audio feature types for individual categories ...... 98 Table 7.5 Top-ranked content word features for categories where content words significantly outperformed audio ...... 99 Table 7.6 Top GI features for “aggressive” mood category ...... 100 Table 7.7 Top-ranked GI-lex features for categories where GI-lex significantly outperformed audio ...... 100 Table 7.8 Top ANEW and Affect-lex features for categories where ANEW or Affect-lex significantly outperformed audio ...... 101 Table 7.9 Top-ranked text stylistic features for categories where text stylistics significantly outperformed audio ...... 102 Table 7.10 Top lyric features in “calm” category ...... 103

xi CHAPTER 1: INTRODUCTION

1.1 INTRODUCTION

“Some sort of emotional experience is probably the main reason behind most people’s engagement with music” (Juslin & Sloboda, 2001).

Nowadays people play music more often than ever, and the need for an easy way for daily users to search for music continues to rise. Research has been conducted to analyze similarities among music pieces based on which music can be organized in groups and recommended to users with suitable tastes. Until recently, most music classification studies focused on classifying music according to genre or artist style. The affective aspect of music (popularly known as music mood) has recently become yet another important criterion in classifying music.

Music psychology research has disclosed that the affective aspects of music are important in defining “being musically similar” (Huron, 2000). The emotional component of music has been recognized as the most strongly associated with music expressivity (Juslin, Karlsson, Lindström,

Friberg, & Schoonderwaldt, 2006), and research on music information behavior has also identified music mood as an important criterion used by people in music seeking and organization (Vignoli, 2004; Bainbridge, Cunningham, & Downie, 2003; Downie &

Cunningham, 2002; Lee & Downie, 2004; Cunningham, Jones, & Jones, 2004; Cunningham,

Bainbridge, & Falconer, 2006; to name a few). However, most existing music repositories do not support access to music by mood. In fact, music mood, due to its subjectivity, has been far from well studied in information science. Especially under the concept of Web 2.0, the general public can post their opinions on music pieces and thus yield wisdom augmenting value of

1 music itself and creating the social context of music seeking and listening. Although the affect aspect of music has long been studied in music psychology, such social context has not been considered. Therefore, music mood turns out to be a new metadata type for music. There has yet to be developed a suitable set of mood categories that can reflect the reality of music listening and can be well adopted in the music information retrieval (MIR) community.

As the Internet and computer technologies enable people to access and share information on an unprecedentedly large scale, people call for automatic tools for music classification and recommendation. However, only a few existing music classification systems focus on mood.

Most of them are solely based on the audio content 1 of the music, but mood classification also involves sociocultural aspects not extractable from audio using current audio technology.

Studies have indicated lyrics and social tags are important in MIR. For example,

Cunningham et al. (2006) reported lyrics as the most mentioned feature by respondents in answering why they hated a song. Geleijnse, Schedl, and Knees (2007) used social tags associated with music tracks to create a ground truth set for the task of artist similarity identification. Recently, researchers have started to exploit music lyrics in music classification

(e.g., Laurier, Grivolla, & Herrera, 2008; Mayer, Neumayer, & Rauber, 2008) and hypothesize that lyrics, as a separate source from music audio, might be complementary to audio content.

This dissertation research is also premised on the that lyrics and social tags would be

1 In this dissertation, “audio” in “audio content,” “audio-based” and “audio-only” refers to the audio media of music files such as . and . formats. In , singing of lyrics is recorded in the audio media files, but audio engineering technology has yet to be developed to correctly and reliably transcribe lyrics from media files, and thus “audio” in most MIR research is independent of lyrics.

2 complementary to audio information, and thus it strives to improve the state of the art in automatic music mood classification by using audio, lyrics and social tags.

1.2 ISSUES IN MUSIC MOOD CLASSIFICATION

In recent years, while more and more attention of MIR researchers has been drawn to music mood classification, several important issues have emerged, and resolving these issues becomes necessary for further progress on this topic. These issues are: mood categories, ground truth datasets and multi-modal systems. This dissertation attempts to respond to these issues by exploiting multiple information sources of music: lyrics, audio and social tags.

1.2.1 Mood Categories

There are no existing mood categories either in MIR or in music psychology. Music psychologists have proposed a number of music mood models over the years, but the models generally lack the social context of music listening (Juslin & Laukka, 2004). It is unknown whether the models can fit well with today’s reality. On the other hand, MIR researchers have conducted experiments using different and small sets of mood categories. Besides the problem that these mood categories may have oversimplified the real problem in the real-life music listening environment, using different mood categories makes it hard to compare the performances across various classification approaches.

1.2.2 Ground Truth Datasets

Besides mood categories, a dataset with ground truth labels is necessary for a scientific evaluation on music mood classification. As in many tasks in information retrieval, the human is

3 the ultimate judge. Thus ground truth datasets in existing experiments were mostly built by recruiting human assessors to manually label music pieces, and then selecting pieces with (near) consensus on human judgments. However, judgments on music are very subjective, and it is hard to achieve agreements across assessors (Skowronek, McKinney, & van de Par, 2006). This has seriously limited the sizes of experimental datasets and necessary validation on inter-assessor credibility. As a result, experimental datasets usually consist of merely several hundreds of music pieces, and each piece is judged by at most three human assessors, and in many cases, by only one assessor (e.g., Trohidis, Tsoumakas, Kalliris, & Vlahavas, 2008; Li & Ogihara, 2003;

Lu, Liu, & Zhang, 2006).

The situation is worsened by the intellectual property regulations on music materials, which effectively prevents sharing ground truth datasets with audio content among MIR researchers affiliated with different institutions. Therefore, it is clear that to enhance the development and evaluation in music mood classification, and in MIR research in general, a sound method is much in need to build ground truth sets of reliable quality in an efficient manner.

1.2.3 Multi-modal Classification

Until recent years, MIR systems have focused on single-modal representation of music, mostly on audio content and some on symbolic scores. The seminal work of Aucouturier and

Pachet (2004) pointed out that there appeared to be a “glass ceiling” in audio-based MIR, due to the fact that some high-level music features with semantic meanings might be too difficult to be derived from audio using current technology. Hence, researchers started paying attention to multi-modal classification systems that combine audio and text (e.g., Neumayer & Rauber, 2007;

Aucouturier, Pachet, Roy, & Beurivé, 2007; Dhanaraj & Logan, 2005; Muller, Kurth, Damm,

4 Fremerey, & Clausen, 2007) or audio, scores and text (McKay & Fujinaga, 2008). However, to date only a handful of multi-modal studies were on mood classification (Yang & Lee, 2004;

Laurier et al., 2008; Yang et al., 2008), and these studies only used basic text features extracted from lyrics. Systematic studies on various lyric features and hybrid methods of combining multiple sources are needed to advance the state of the art in multi-modal music mood classification.

1.3 RESEARCH QUESTIONS

Aiming at resolving the aforementioned issues in music mood classification, this dissertation research raises an overarching research question: to what extent can lyrics and social tags help in categorizing music in regard to mood? It can be divided into the following specific questions.

1.3.1 Identifying Mood Categories from Social Tags

With the birth of Web 2.0, the general public can now post text tags on music pieces and share these tags with others. The accumulated user tags can yield so called “collective wisdom” that can augment the value of music itself and create the social context of music seeking and listening. Specifically, there are two major advantages of social tags. First, social tags are assigned by real music users in real-life music listening environments, thus they represent the context of real-life music information behaviors better than labels assigned by human assessors in laboratory settings. Second, social tags available online are in a large quantity incomparable to datasets collected by human evaluation experiments, providing a much richer resource of discovering users’ perspectives.

5 The advantages have attracted researchers to exploit social tags in categorizing music by mood (Hu, Bay, & Downie, 2007a) and by artist similarity (Levy & Sandler, 2007), yet the study by Hu et al. (2007a) yielded an oversimplified set of only 3 mood categories. Very few, if any of these studies have adequately addressed the following shortcomings of social tags as summarized by Guy and Tonkin (2006). First, social tags are uncontrolled and thus contain much noise or junk tags. Second, many tags have ambiguous meanings. For example, “” can be the theme of a song or a user’s attitude towards a song. Third, a majority of tags are tagged to only a few songs, and thus are not representative (so called “long-tail” problem 2). Fourth, some tags are essentially synonyms (e.g., “cheerful” and “joyful”), and thus do not represent distinguishable categories. To address these problems, the first research question of this dissertation is:

Research Question 1: Can social tags be used to identify a set of reasonable mood categories?

The “reasonableness” of the resultant mood categories is defined using two criteria: 1) they should comply with common intuitions regarding music mood; and 2) they should be at least partially supported by theories in music psychology. It is unreasonable to expect full accordance between mood categories identified from social tags and those in music psychology theories, because today’s music listening environment with Web 2.0 and social tagging is very different from the laboratory settings where the music psychology studies were conducted. In fact, mood models in music psychology have been criticized for lacking the social context of music listening

2 “long-tail” means the tag distribution follows a power law: many tags are used by a few users while only a few tags are used by many users (Guy & Tonkin, 2006).

6 (Juslin & Laukka, 2004). For this very reason, the mood categories identified from empirical data of social tags can be complementary and beneficial to music psychology research.

To answer this question, a new method is proposed to derive mood categories from social tags, by combining the strength of linguistic resources and human expertise. The resultant mood categories are then compared to influential mood models proposed by music psychologists.

Particularly, the following questions will be addressed in the comparison:

1) Is there any correspondence between the identified categories and the categories in

psychological models?

2) Do the distances between the identified mood categories show similar patterns to those in

the psychological models?

Such comparison will disclose whether findings from social tags can be supported by the theoretical models and what the differences are between them.

1.3.2 Best Lyric Features

Lyrics contain very rich information from which many types of features can be extracted.

However, existing work on music mood classification that used lyric information only exploited the very basic, commonly used features such as content words and part-of-speech. To fill this intellectual gap, this dissertation examines and evaluates a wide range of lyric text features including the basic features used in previous studies, linguistic features derived from affect lexicons and psycholinguistic resources, as well as text stylistic features. The author attempts to determine the most useful lyric features by systematically comparing various lyric feature types and their combinations.

7 Research Question 2: Which type(s) of lyric features are the most useful in classifying music by mood?

A variety of text features are extracted from lyrics. Details on the features are described in

Section 6.2. As there is not enough evidence to hypothesize which feature type(s) or their combination would be the most useful, all feature types and combinations are evaluated in the task of mood classification, and their classification performances are compared using statistical tests (see Section 4.1).

1.3.3 Lyrics vs. Audio

Previous studies have generally reported lyrics alone were not as effective as audio in music classification (e.g., Li & Ogihara, 2004; Mayer et al., 2008) and artist similarity identification

(Logan, Kositsky, & Moreno, 2004). The third research question is to find out whether this is true for music mood classification with the lyric features that have not been previously evaluated.

Research Question 3: Are there significant differences between lyric-based and audio- based systems in music mood classification, given both systems use the Support Vector

Machines (SVM) classification model?

The best performing lyric features determined in research question 2 are used to build a lyric-based mood classification system. It is then compared to the best performing audio-based system evaluated in the Audio Mood Classification (AMC) task in the Music Information

Retrieval Evaluation eXchange (MIREX) 2007 and 2008. MIREX is a community-based framework for the formal evaluation of algorithms and techniques related to MIR development.

8 Its results reflect the best results for the tasks (Downie, 2008). The performance of a top-ranked system in the AMC tasks sets a difficult baseline against which comparisons must be made.

All audio-based classification systems in MIREX as well as systems in most other music classification studies applied standard supervised learning models such as K-Nearest Neighbor

(KNN), Naïve Bayes, and Support Vector Machines (SVM). Among the learning models, SVM seems the most popular model with top performance. This dissertation uses SVM as the classification model for two reasons: 1) the selected audio-based system uses SVM; and 2) SVM achieved the best or close to the best results in both MIR and text categorization experiments in general (Mandel, Poliner, & Ellis, 2006; Hu, Downie, Laurier, Bay, & Ehmann 2008a;

Tzanetakis & Cook, 2002; Yu, 2008).

1.3.4 Combining Lyrics and Audio

In machine learning, it is acknowledged that multiple independent sources of features will likely compensate for one another, resulting in better performances than approaches using any one of the sources (Dietterich, 2000). Previous work in music classification has used such hybrid sources as audio and lyrics (e.g., Mayer et al., 2008), audio and symbolic scores (e.g., McKay &

Fujinaga, 2008), etc., and has shown improved performance. Thus, one hypothesis in this dissertation is that hybrid systems combining audio and lyrics perform better than systems using either source.

Research Question 4: Are systems combining audio and lyrics significantly better than audio-only or lyric-only systems?

9 There are two popular approaches in assembling hybrid systems (also called “fusion methods”). The most straightforward one is feature concatenation where two feature sets are concatenated and the classification algorithms run on the combined feature vectors (e.g. Laurier et al., 2008; Mayer et al., 2008). The other method is often called “late fusion” which combines the outputs of individual classifiers based on different sources, either by (weighted) averaging

(e.g., Bischoff et al., 2009b; Whitman & Smaragdis, 2002) or by multiplying (e.g., Li & Ogihara,

2004). To answer this research question, both fusion methods are implemented, and the performances of hybrid systems and systems based on single sources are compared using statistical tests.

1.3.5 Learning Curves and Audio Lengths

A learning curve describes the relationship between classification performance and the number of training examples. Usually performance increases with the number of training examples, and the point where performance stops increasing indicates the minimum number of training examples needed for achieving the best performance. In addition to classification performances, the learning curve is also an important measure for the effectiveness of a classification system. Therefore, the comparison on learning curves of the hybrid systems and single-source-based systems can reveal whether combining lyrics and audio helps reduce the number of training examples needed for achieving comparable or better performances as audio- only or lyric-only systems.

Due to the time complexity of audio processing, MIR systems often process the x second audio clips truncated from the middle of the original tracks instead of the complete tracks, where x often equals 30 or 15. As text processing is much faster than audio processing, it is of practical

10 value to find out whether combining complete lyrics with short audio excerpts can help compensate the (possibly significant) information loss due to the approximation of complete tracks with short clips.

Research Question 5: Can combining lyrics and audio help reduce the amount of training data needed for effective classification, in terms of the number of training examples and audio length?

To answer this question, performances of the hybrid and single-source-based systems are compared given incrementing numbers of training examples as well as audio clips with incrementing lengths extracted from the original tracks.

1.3.6 Research Question Summary

The five research questions are closely related and each is built upon the previous one.

Together they answer the overarching question of how lyrics and social tags can help in music mood classification. Figure 1.1 illustrates the connections among the research questions.

1.4 CONTRIBUTIONS

This dissertation is one of the first efforts in exploiting music associated text in classifying music in the mood dimension, and also is one of the first systematic evaluations of lyric features in music mood classification. Contributions can be classified into three levels: methodology, evaluation and application.

11

Figure 1.1 Research questions and experiment flow

1.4.1 Contributions to Methodology

This research is one of the first in the MIR community to review and summarize important studies on music mood in music psychology literature, giving MIR researchers and information scientists theoretical ground and insights on music mood.

Mood categories have been a much debated topic in MIR. This dissertation research identifies mood categories that have been used by real users in a real-life music listening environment. The categories derived from empirical social tag data serve as a concrete case of aggregated real-life music listening behaviors, which can then be used as a reference for studies

12 in music psychology. In general, the comparison of the mood categories derived from social tags to those in psychological models establishes an example of refining and/or adapting theories and models to better fit the reality of users’ information behaviors.

Text affect analysis has been an active research topic in text mining in recent years (Pang &

Lee, 2008), but has just started being applied to the music domain. Many of the lyric text features examined in this dissertation have never been formally studied in the context of music mood classification. Similarly, most of the feature type combinations have never previously been compared to each other using a common dataset. Thus, this dissertation research advances the state of text affect analysis in the music domain.

Fusion methods have recently started being used in combining multiple sources in music classifications, but different fusion methods have rarely been compared on a common dataset.

This dissertation compares two fusion methods, and the result provides suggestions for future research in music mood classification.

1.4.2 Contributions to Evaluation

As mentioned before, an effective and scalable evaluation approach is much needed in the music domain. In this dissertation, a large ground truth dataset is built from social tags without recruiting human assessors (see Section 5.2). The proposed method of deriving ground truth from social tags can help reduce the prohibitive cost of human assessments and clear the way to large scale experiments in MIR.

The ground truth dataset built for this study is unique. It contains 5,296 unique songs in 18 mood categories with mood labels given by a number of real-life users. This is one of the largest

13 experimental datasets in music mood classification with ternary information sources available: audio, lyrics and social tags. Part of the dataset has been made available to the MIR community through the 2009 and 2010 iterations of the MIREX. Serving as a testbed accessible to the entire

MIR community, this dataset helps facilitate the development and comparison of new techniques in music mood classification.

1.4.3 Contributions to Application

Music mood classification and recommendation systems are direct applications of this dissertation research. Based on the findings, one can plug in existing tools on text categorization, audio feature extraction and fusion methods to build a system that combines audio and text in an optimized way. Moody is an online prototype of such applications (Hu et al., 2008b). It recommends music in similar moods and classifies users’ songs on-the-fly.

The answers to research question 5 on learning curves and audio length give a practical reference on whether combining lyrics and audio can reduce the number of needed training examples and the length of audio a system has to process. Training examples are expensive to obtain and audio processing is computationally complex. Therefore, answers to this research question may help improve the efficiency of music mood classification systems.

In this research, lyrics and social tags associated with songs are collected from various Web services such as lyrics databases and music sharing/tagging sites. This is an example of applications collecting and integrating information from heterogeneous resources. During a pilot study of this dissertation research, a prototype Web search system has been developed to crawl and integrate complementary information of albums from multiple websites, including mldb.org

14 for lyrics, last.fm for tags, epinions.com for user reviews and amazon.com for images and editorial reviews (Hu & Wang, 2007).

1.5 SUMMARY

This chapter introduced three important issues in music mood classification that motivated this dissertation research: the lack of a set of mood categories suitable for today's music listening environment, the prohibitively expensive cost involved in building ground truth datasets, and the premise of combining multiple resources in order to improve classification performances.

Five research questions were proposed in this chapter. These questions are closely related and each is built upon the previous one. The answers to these questions will collectively shed light on the fundamental question of how lyrics, social tags and audio interact with one another with regard to music mood.

Expected contributions of this research were summarized into three levels: methodology, evaluation and application. This research will contribute to the literature of music information retrieval, text affect analysis as well as music psychology. These related fields will be reviewed in the next chapter.

15 CHAPTER 2: LITERATURE REVIEW

2.1 MOOD AS A NOVEL METADATA TYPE FOR MUSIC

Traditionally music information is organized and accessed by bibliographic metadata such as title, composer, performer and release date. However, recent MIR studies have pointed out that traditional bibliographic information of music is far from enough for effective MIR systems

(Futrelle & Downie, 2002; Byrd & Crawford, 2002; Lee & Downie, 2004). Futrelle and Downie

(2002) called for both descriptive and contextual metadata; Lee and Downie (2004) pointed out the importance of associative and perceptual metadata (e.g., use, mood, rating, etc.).

In recent years, as online music repositories are becoming more popular, new non-traditional metadata have emerged in organizing music. Hu and Downie (2007) explored the 179 mood labels on AllMusicGuide 3 and analyzed the relationships that mood has with genre and artist.

The results show that music mood, as a new type of music metadata, appears to be independent of genre and artist. Therefore, mood, as a new access point to music information, is a necessary complement to established ones.

2.2 MOOD IN MUSIC PSYCHOLOGY

In the information science and MIR community, mood is a novel metadata type and thus there are many fundamental issues on music mood remaining unresolved. For example, there is no terminology consensus on the very topic in question: some researchers use “music emotion,” some others use “music mood” to refer to the affective aspects of music. On the other hand, there

3 http://www.allmusic.com, a popular metadata service that reviews and categorizes albums, songs and artists.

16 is a long history of influential studies in music psychology where these issues have been well studied. Hence, MIR researchers and information scientists who are interested in music mood should learn from music psychology literature on theoretical issues such as terminology and sources of music mood. This section reviews and summarizes important findings in representative studies on music and mood in music psychology.

2.2.1 Mood vs. Emotion

Since the early stage of music psychology studies, researchers have paid attention to clarifying the concepts of mood and emotion . The most influential first work formally analyzing music and mood using psychological is probably Meyer’s Emotion and Meaning in Music (Meyer, 1956). In this book, Meyer stated that emotion is “temporary and evanescent” while mood is “relatively permanent and stable.” Sloboda and Juslin (2001) followed Meyer’s point after summarizing related studies during nearly a half century.

In music psychology, both emotion and mood have been used to refer to the affective effects of music, but emotion seems to be more popular (Capurso et al., 1952; Juslin et al., 2006; Meyer,

1956; Schoen & Gatewood, 1927; Sloboda & Juslin, 2001). However, in MIR, researchers tend to choose mood over emotion (Feng, Zhuang, & Pan, 2003; Lu et al., 2006; Mandel et al., 2006;

Pohle, Pampalk, & Widmer, 2005). In addition, existing music repositories also use mood rather than emotion as a metadata type for organizing music (e.g., AllMusicGuide and APM 4). While

MIR researchers have yet to be formally interviewed on why they chose to use mood , the author

4 http://www.apmmusic.com It claims to be “the world's leading production music library and music services.”

17 hypothesizes that there are at least two reasons for MIR researchers to make a different choice from their colleagues in music psychology:

First, as stated by Meyer, mood refers to a relatively long lasting and stable emotional state.

While psychologists emphasize human responses to various stimuli of emotion, MIR researchers, at least at the current stage, are more interested in the general sentiment that music can convey.

In other words, music psychologists focus on the very subjective responses to music which can be acute, momentary and fast changing, while the MIR community tries to find the common affective themes of music that are recognized by many people and are less volatile.

Second, the research purposes of the two disciplines are different. Music psychologists want to discover why a human has emotional responses to music while MIR researchers want to find a new metadata type to organize and access music objects. The former focuses on a human’s responses, the latter focuses on music. It is the human who has emotion . Music does not have emotion , but it can carry a certain mood .

Therefore, this research continues the choice of MIR researchers and adopts the term music mood rather than emotion . However, it is noteworthy that the two concepts are not absolutely detached. To some extent, their difference mainly lies in granularity. MIR researchers can still borrow insights from music psychology studies. In fact, when MIR technologies are developed to a level where individual and transitory affective responses become the of study, it is possible that the MIR community may change to adopt the notion of music emotion .

18 2.2.2 Sources of Music Mood

Where music mood comes from is a question MIR researchers are interested in. Does it come from the intrinsic characteristics of music pieces or from the extrinsic context of music listening behaviors? The answer to this question would have significant implications on assigning mood labels to music pieces either by hand or by computer programs.

From as early as Meyer (1956), there have been two contrasting views of music meanings in music psychology: the absolutist versus referentialist views. The absolutist view claimed

“musical meaning lies exclusively within the context of the work itself” (p. 1) while the referentialist proposed “musical meanings refer to the extra-musical world of concepts, actions, emotional states, and character.” (p. 1) Meyer acknowledged the existence of both types of musical meanings. Later, Sloboda and Juslin (2001) echoed Meyer’s view by presenting two sources of emotion in music: intrinsic emotion and extrinsic emotion. Intrinsic emotion is triggered by specific structural characteristics of the music while extrinsic emotion is from the semantic context related but outside the music. Therefore, the suggestion for MIR is that music mood should be a combination of music content itself and the social context where people listen to and share opinions about music. In fact, recent user studies in MIR have confirmed this point of view (e.g., Lee & Downie, 2004) and automatic music categorization systems (e.g.,

Aucouturier et al., 2007) have started to combine music content (e.g., audio, lyrics, and symbolic scores) and context (e.g., social tags, playlists, and reviews).

19 2.2.3 What We Know about Music Mood

Beside terminology and sources of music mood, music psychology studies on music mood have a number of fundamental generalizations that can benefit MIR research.

1. Mood effect in music does exist. Ever since early experiments (pre-1950) on psychological effects of music, studies have confirmed the existence of the functions of music in changing people’s mood (Capurso et al., 1952). It is also agreed that it seems natural for listeners to attach mood labels to music pieces (Sloboda & Juslin, 2001).

2. Not all moods are equally likely to be aroused by listening to music. In a study conducted by Schoen and Gatewood (1927), human subjects were asked to choose from a pre-selected list of mood terms to describe their while listening to 589 music pieces. Among the presented moods, , joy, rest, love, and longing were among the most frequently reported while and irritation were the least frequent ones.

3. There do exist uniform mood effects among different people. Sloboda and Juslin (2001) summarized that listeners are often consistent in their judgment about the of music. Early experiments by Schoen and Gatewood (1927) also showed “the moods induced by each (music) selection, or the same class of selection, as reported by the large majority of our hearers, are strikingly similar in type” (p. 143). Such consistency is an important ground for developing and evaluating music mood classification techniques.

4. Not all types of moods have the same level of agreement among listeners. Schoen and

Gatewood (1927) ranked joy, , sadness, stirring, rest, and love as the most consistent

20 moods while disgust, irritation, and dignity were of the lowest consistency. The implication for

MIR is that some mood categories would be harder to classify than others.

5. There is some correspondence between listeners’ judgments on mood and musical parameters such as , dynamics, rhythm, timbre, articulation, pitch, mode, tone attacks, and harmony (Sloboda & Juslin, 2001). Early experiments showed that the most important music element for excitement was a swift tempo; modality was important for sadness and , but was useless for distinguishing excitement from calm; and melody played a very small part in producing a given affective state (Capurso et al., 1952). Schoen and Gatewood (1927) pointed out that the mood of amusement largely depended upon vocal music: “humorous description, ridiculous words, peculiarities of voice and manner are the most striking means of amusing people through music” (p. 163). This has been evidenced by the category,

“humorous/silly/quirky” used in the AMC task in MIREX from 2007 to 2010. A subsequent examination of the AMC data found that music pieces manually labeled with this category primarily had the above-mentioned quality. Such correspondence between music mood and musical parameters has very important implications for designing and developing music mood classification algorithms.

2.2.4 Music Mood Categories

Studies in psychology have proposed a number of models on human , and music psychologists have adopted and extended a few influential models.

The six “universal” emotions defined by Ekman (1982): , disgust, , happiness, sadness, and , are well known in psychology. However, since they were designed for

21 encoding facial expressions, some of them may not be suitable for music (e.g., disgust), and some common music moods are missing (e.g., calm, soothing, and mellow). In music psychology, the earliest and still best-known systematic attempt at creating a music mood taxonomy was by Hevner (1936). Hevner designed an adjective circle of eight clusters of as shown in Figure 2.1, from which one can see: 1) the adjectives within each cluster are close in meaning; 2) the meanings of adjacent clusters would differ slightly; and, 3) the difference between clusters gets larger step by step until a cluster at the opposite position is reached.

Figure 2.1 Hevner's adjective cycle (Hevner, 1936)

Both Ekman’s and Hevner’s models belong to categorical models because the mood spaces consist of a set of discrete mood categories. Another well recognized kind of models is the dimensional models, where emotions are positioned in a continuous multidimensional space. The

22 most influential ones contain such dimensions as valence (happy-unhappy), arousal (active- inactive), and dominance (dominant-submissive) (Mehrabian, 1996; Russell, 1980; Thayer,

1989). However, there is no consensus on how many dimensions there should be and which dimensions to consider. For example, a well cited study by Wedin (1972) identified three dimensions: intensity-softness , pleasantness-unpleasantness and solemnity-triviality , while another study by Asmus (1995) found nine dimensions: evil , sensual , potency , humor , pastoral , longing , , sedative, and activity .

Among all these dimensional models, Russell’s model of the combination of valence and arousal dimensions (Russell, 1980) has been adopted in a few experimental studies in music psychology (e.g., Schubert, 1996; Tyler, 1996), and MIR researchers have been using similar taxonomies based on this model (e.g., Kim, Schmidt, & Emelle, 2008; Laurier et al., 2008; Lu et al., 2006). As shown in Figure 2.2, the original Russell’s model places 28 emotion denoting adjectives on a circle in a bipolar space consisting of valence and arousal dimensions.

In fact, categorical models and dimensional models cannot be completely separated.

Gabrielsson and Lindström (2001) argued that Hevner’s model suggested an implicit dimensionality similar to the combination of valence (cluster 2 – cluster 6) and arousal (clusters

7/8 – clusters 4/3).

All these psychological models were proposed in laboratory settings and thus were criticized as having a lack of social context of music listening (Juslin & Laukka, 2004). Therefore, this dissertation research promises that a set of music mood categories derived from social tags would be rich in social context, since social tags are posted in the most natural music listening

23 environment by real-life users. The identified mood categories are compared to both categories in Hevner’s model and those in Russell’s model (see Section 3.3).

Figure 2.2 Russell’s model with two dimensions: arousal and valence (Russell, 1980)

2.3 MUSIC MOOD CLASSIFICATION

2.3.1 Audio-based Music Mood Classification

Most existing work on automatic music mood classification is exclusively based on audio features among which timbral and rhythmic features are the most popular across studies (e.g., Lu et al., 2006; Pohle et al., 2005; Trohidis et al., 2008). The datasets used in these experiments usually consisted of several hundred to a thousand songs labeled with four to six mood categories.

24 Timbral features are musical surface features based on signal spectrum obtained by a Short

Time Fourier Transformation (McEnnis, McKay, Fujinaga, & Depalle, 2005). The most popular ones are:

1) Spectral Centroid: mean of the amplitudes of the spectrum. It indicates the “brightness” of

a musical signal.

2) Spectral Rolloff: the frequency where 85% of the energy in the spectrum is below this

point. It is an indicator of the skewness of the frequencies in a musical signal.

3) Spectral Flux: spectral correlation between adjacent time windows. It is often used as an

indication of the degree of change of the spectrum between windows.

4) Average Silence Ratio (also called Low Energy Rate): the percentage of frames with a

less than average energy.

5) MFCCs: stands for Mel Frequency Cepstral Coefficients, the dominant features used for

speech recognition. The mel-frequency cepstrum (MFC) is a representation of the short-

term power spectrum of a sound, based on a linear cosine transform of a log power

spectrum on a nonlinear mel scale of frequency. The MFC has been proven to

approximate the human auditory system’s response more closely than the normal

cepstrum, and MFCCs are coefficients that collectively make up an MFC.

Rhythmic features represent the beat and tempo of the musical piece. Those can be useful in music mood classification. Intuitively, fast music tends to be exciting rather than relaxing. The most frequently used rhythmic features include:

25 1) Tempo: also called “rhythmic periodicity,” is estimated from both a spectral analysis of

each band of the spectrogram and the assessment of autocorrelation in the amplitude

envelope extracted from the audio.

2) Beat Histogram: a histogram representing distribution of tempi in a musical excerpt.

Usually the most prominent peak corresponds to the best tempo match. Experiments

often use several properties of the beat histogram such as the bpm (beats per minute)

values of the two highest peaks and the sum of all histogram bins, etc.

2.3.2 Text-based Music Mood Classification

Very recently, several studies on music mood classification have been conducted using only music lyrics (He et al., 2008; Hu, Chen, & Yang, 2009b). He et al. (2008) compared traditional bag-of-words features in unigrams, bigrams, trigrams and their combinations, as well as three feature representation models (i.e., Boolean, absolute term frequency and tfidf weighting). Their results showed that the combination of unigram, bigram and trigram tokens with tfidf weighting performed the best, indicating that higher-order bag-of-words features captured more semantics useful for mood classification. Hu et al. (2009b) moved beyond the bag-of-words lyric features and extracted features based on an affective lexicon translated from the Affective Norms for

English Words (ANEW) (Bradley & Lang, 1999). The datasets used in both studies were relatively small: the dataset in He et al. (2008) contained 1,903 songs in only two mood categories, “love” and “lovelorn,” while Hu et al. (2009b) classified 500 Chinese songs into four mood categories derived from Russell’s arousal-valence model.

From a different angle, Bischoff, Firan, Nejdl, and Paiu (2009a) tried to use social tags to predict mood and theme labels of popular songs. The authors designed the experiments as a tag

26 recommendation task where the algorithm automatically suggested mood or theme descriptors given social tags associated with a song. As contrast to a classification problem, the problem in

Bischoff et al. (2009a) was a recommendation task where only the first N descriptors were evaluated ( N = 3).

2.3.3 Music Mood Classification Combining Audio and Text

The early work combining lyrics and audio in music mood classification can be traced back to Yang and Lee (2004) where they used both lyric bag-of-words features and the 182 psychological features proposed in the General Inquirer (Stone, 1966) to disambiguate categories that audio-based classifiers found confusing. Although the overall classification accuracy was improved by 2.1%, their dataset was too small (145 songs) to draw any reliable conclusions.

Laurier et al. (2008) also combined audio and lyric bag-of-words features. Their experiments on

1,000 songs in four categories (also from Russell’s model) showed that the combined features with audio and lyrics improved classification accuracies in all four categories. Yang et al. (2008) evaluated both unigram and bigram bag-of-words lyric features as well as three methods for fusing lyric and audio sources on 1,240 songs in four categories (again from Russell’s model) and concluded that leveraging both lyrics and audio could improve classification accuracy over audio-only classifiers.

As a very recent work, Bischoff et al. (2009b) combined social tags and audio in music mood and theme classification. The experiments on 1,612 songs in four and five mood categories showed that tag-based classifiers performed better than audio-based classifiers while the combined classifiers were the best. Again, it suggested that combining heterogeneous resources helped improve classification performances. Instead of concatenating two feature sets like most

27 previous research did, Bischoff et al. (2009b) combined the tag-based classifier and audio-based classifier via linear interpolation (one variation of late fusion). As their two classifiers were built with different classification models, it would not be reasonable to compare their single-source- based systems to a hybrid system built by feature concatenation. In this dissertation research, both audio-based and lyric-based classifiers use the same classification model so that it is feasible to compare the two commonly used fusion methods: feature concatenation and late fusion, using the same dataset.

The aforementioned studies on music mood classification mostly used two to six mood categories which were most likely oversimplified and might not reflect the reality of the music listening environment, since the categories were mostly adapted from music psychology models and especially Russell’s model. Furthermore, the datasets were relatively small, which made their results hard to generalize. Finally, only a few of the most common lyric feature types were evaluated. It should also be noted that the performances of these studies were not comparable because they all used different datasets.

2.4 LYRICS AND SOCIAL TAGS IN MUSIC INFORMATION

RETRIEVAL

Audio-based approaches have seemed to dominate the field of MIR for the last decade.

However, studies have started to take advantage of text, the ubiquitous media of information.

This section reviews MIR studies that exploited lyrics and social tags in various tasks.

28 2.4.1 Lyrics

Despite being an important part of content of vocal music, lyrics have not been paid as much attention as its audio content counterpart. Scott and Matwin (1998) are often cited as the first

MIR researchers exploiting lyrics. They conducted topic-wise text categorization using lyrics from more than 400 folk songs. Extending the traditional bag-of-words approach by integrating

WordNet hypernyms (Fellbaum, 1998), the experiments showed that classification accuracies were improved over the approach using plain bag-of-words. In 2004, Logan et al. indexed the lyrics of 15,589 songs of 399 artists using the technique of Probabilistic Latent Semantic

Analysis (PLSA) (Hofmann, 1999) in an attempt to determine artist similarity. Their experimental results showed that the lyric-based method was not as good as audio-based methods in the task of artist similarity, but the error analysis revealed that both methods made different errors and suggested that a combination of audio-based and lyric-based approaches would be a better technique. Researchers also studied lyrics for other tasks such as topic identification (Kleedorfer, Knees, & Pohle, 2008), language identification, structure extraction, theme categorization and similarity searches (Mahedero, Martinez, Cano, Koppenberger, &

Gouyon, 2005). Other research combined lyrics and audio in a range of tasks, such as artist style detection (Li & Ogihara, 2004), genre classification (Neumayer & Rauber, 2007), hit song prediction (Dhanaraj & Logan, 2005) and music retrieval and navigation (Muller et al., 2007).

However, all these studies used shallow text analysis and simple features such as bag-of- words and (POS). As the most recent progress, Mayer et al. (2008) explored rhyme and stylistic features for the task of genre classification. Rhyme features in lyrics were defined as patterns distinguishing whether or not two subsequent lines in lyrics rhyme each

29 other. Stylistic features, borrowed from stylometric analysis (Rudman, 1998), included punctuations, digit counts, POS counts, words per line, unique words per line, unique words ratio, etc. The experiments showed that stylistic features were the best among individual lyric feature sets, but a combination of rhyme and stylistic features achieved the best performance in the task of genre classification. Both stylistic features alone and the combination of rhyme and stylistic features performed twice as well as the bag-of-words approach. The authors also compared results yielded with and without stemming, and no significant differences were found.

As an unusual example of studies on music mood classification, Li and Ogihara (2004) compared several lyric feature types besides bag-of-words. They also borrowed wisdom in stylometric analysis and used function words, POS statistics and orthographic features of lexical items such as capitalization, word placement, word length, and line length.

In predicting hit songs, Dhanaraj and Logan (2005) converted lyrics of each song to a vector using Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999). In PLSA, each dimension of the vector represents the likelihood that the song is about a pre-learned topic.

Logan et al. (2004) showed, for the task of artist similarity, the topics learned using a lyrics corpus were better than those learned from other general corpus such as news.

As shown from previous research on or using lyrics, bag-of-words features still dominate.

Dimension reduction techniques and shallow linguistic features borrowed from stylometric analysis are also used. In addition, it is noteworthy that very few of the above studies compared performances of different feature types.

30 2.4.2 Social Tags

Very recently, the increasing number of musical social tags on the Web has stimulated great in analyzing and exploiting social tags in MIR. Geleijnse et al. (2007) investigated social tags associated with 224 artists and their famous tracks on last.fm and used the tags to create a ground truth set of artist similarity. Levy and Sandler (2007) analyzed track tags published on last.fm and mystrands.com and concluded social tags were effective in capturing music similarity. Using tags on artist, album and tracks provided by last.fm, Hu et al. (2007a) derived a set of music mood categories as well as a ground truth track set corresponding to these categories. Symeonidis, Ruxanda, Nanopoulos, and Manolopoulos (2008) again exploited last.fm tags, but considered one more dimension – the users, for personalized music recommendations.

Other research attempted to link social tags to audio content. Eck, Bertin-Mahieux, and

Lamere (2007) proposed a method of predicting social tags from audio input using supervised learning. Their dataset was social tags applied to nearly 100,000 artists obtained from last.fm.

Indeed, social tags have become so popular in the MIR community that since 2008, the MIREX has added a new task, Audio Tag Classification 5, which compares various systems with regard to the abilities of associating 10-second audio clips of music with tags collected from the

MajorMiner 6 game (Mandel & Ellis, 2007).

5 http://www.music-ir.org/mirex/wiki/2008:Audio_Tag_Classification

6 http://majorminer.com/

31 2.5 SUMMARY

This chapter provided a brief overview of music mood as a newly emerging music metadata type and, at the same time, a well-studied subject in music psychology. This chapter also reviewed important findings in music psychology which could benefit information scientist and

MIR researchers. In particular, two influential music emotion models were introduced in detail.

These models will be used for comparisons in the next chapter.

By reviewing the various approaches to music mood classification and the applications of lyrics and social tags in MIR, this chapter also presented the context for the research questions raised in this dissertation.

32 CHAPTER 3: MOOD CATEGORIES IN SOCIAL TAGS

This chapter describes the method and results of identifying mood categories from social tags on last.fm. It also presents result analysis on comparing the identified categories to models in music psychology. Research question 1, whether social tags can be used to identify a reasonable set of mood categories, is answered with the findings.

3.1 IDENTIFYING MOOD CATEGORIES FROM SOCIAL TAGS

A new method is proposed to derive mood categories from social tags by combining the strength of linguistic resources and human expertise. This section describes this method in detail.

3.1.1 Identifying Mood-related Terms

First, a set of mood-related terms are identified using linguistic resources. WordNet-Affect

(Strapparava & Valitutti, 2004) is an affective extension of WordNet (Fellbaum, 1998). It assigns affective labels to words representing emotions, moods, situations eliciting emotions, or emotional responses. As a major resource used in text sentiment analysis, WordNet-Affect has a good coverage of mood-related words. There are 1,586 terms in WordNet-Affect, but some of them are judgmental, such as “bad,” “poor,” “miserable,” “good,” “great,” and “amazing.”

Although these terms are related to mood, their applications on songs probably represent users’ judgments towards the songs, rather than descriptions of the moods conveyed by the songs.

Therefore, such tags are noise and should be eliminated. Another linguistic resource, General

Inquirer (Stone, 1966), provides a list of judgmental terms. General Inquirer is a lexicon comprised of 11,788 words organized in 182 psychological categories, two of which are about

33 “evaluation” containing 492 words implying judgment and evaluation. In this study, these 492 words were subtracted from the terms in WordNet-Affect, which resulted in 1,384 terms.

3.1.2 Obtaining Mood-related Social Tags

Many of the mood-related terms obtained so far may not be used by music taggers, and thus cannot be used for mood categories that aim to reflect the music listening reality. Hence, the next step is to obtain mood-related social tags. Many music websites provide social tagging functions, and a few of them have gained significant popularity among the general public and have accumulated a large number of social tags. Last.fm is one of the most popular tagging sites for

Western music 7. With 30 million users every month, it provides a rich resource of studying how people tag music. According to Lamere (2008), last.fm has over 40 million tags as of 2008. Over half of them are on tracks and 5% of the tags are directly related to mood.

In this research, it is the author’s intention to only use tags published on one website. This is because each website has its own user community, and combining tags across websites would lose the coherence and identity of the user community. The author queried last.fm through its

API 8 with the 1,384 mood-related terms identified so far, and 611 of them had been used as tags by last.fm users as of June 2009. To untangle the “long-tail” problem mentioned above, tags that were used less than 100 times were eliminated. 236 terms/tags remained after this step.

7 Social Media Statistics http://socialmediastatistics.wikidot.com/lastfm Retrieved on July 22, 2008.

8 Accessible at http://www.last.fm/api

34 3.1.3 Cleaning up Social Tags by Human Experts

Human expertise is applied as a final step to ensure the quality of the mood-related term list.

This research consulted two human experts who were respected MIR researchers with a music background and native English speakers. In this dissertation research where human experts were consulted with regard to music mood terms, the two experts worked together at the same place and time. They manually examined the terms and discussed discrepant opinions with each other until they reached the same decisions. In this way, all terms were considered by both experts and all decisions were made by the best judgments of both experts. In this particular task of cleaning up non-mood tags, the experts examined the remaining 236 terms. They first identified and removed tags with music meanings that did not involve an affective aspect (e.g., “trance” and

“beat”). Then, they removed words with ambiguous meanings. For example, “chill” can mean

“to calm down” or “depressing,” but social tags do not provide enough context to disambiguate the term. Finally, they also identified and removed additional evaluation words that were not included in General Inquirer, such as “fascinating” and “dazzling.” After this step, there remained 136 mood-related terms.

3.1.4 Grouping Mood-related Social Tags

As a means of solving the synonym problem of social tags, the mood-related tags are organized into groups such that synonyms are merged together into one group. Tags in each resultant group then collectively define a mood category. This step again uses WordNet-Affect.

WordNet is a natural resource for identifying synonyms, because it organizes words into synsets .

Words in the same synset are synonyms from a linguistic point of view. Moreover, WordNet-

Affect also links each non- synset (, adjective and ) with the noun synset from

35 which it is derived. For instance, the synset of “joyful” is marked as derived from the synset of

“joy.” Both synsets of “joyful” and “joy” represent the same kind of mood and should be merged into the same category. Hence, among the 136 mood-related tags, those appearing in and being derived from the same synset in WordNet-Affect were merged into one group.

Finally, human experts were again consulted to modify the grouping of tags when they saw the need for splitting or further merging some groups. Each of the resultant groups of social tags is taken as one mood category that is collectively defined by all the tags in the group. The categories and the comparisons to psychological models are reported in the next sections.

3.2 MOOD CATEGORIES

Using the method described in the last section, a set of 36 mood categories consisting of 136 social tags were identified from the most popular mood-related social tags published on last.fm.

Using the linguistic resources allows this process to proceed quickly and minimizes the workload of the human experts. Hence the experts can focus on the few tasks that need human expertise most and ensure the quality of their work. Table 3.1 presents the categories and the tags contained in each category.

Table 3.1 Mood categories derived from last.fm tags

Number of Categories tags calm, calm down, calming, , comfort, comforting, cool down, quiet, 16 , serene, serenity, soothe, soothing, still, tranquil, tranquility gloomy, blue, dark, depress, depressed, depressing, depression, depressive, gloom 9 mournful, , heartache, heartbreak, heartbreaking, mourning, , , 9 sorrowful gleeful, , euphoric, high spirits, joy, joyful, joyous, uplift 8 cheerful, cheer up, cheer, cheery, festive, jolly, merry, sunny 8

36 Table 3.1 (cont.) Number of Categories tags brooding, broody, contemplative, meditative, pensive, reflective, wistful 7 confident, encouragement, encouraging, fearless, , optimistic 6 angry, anger, furious, fury, 5 anxious, , , jumpy, nervous 5 exciting, exhilarating, stimulating, thrill, thrilling 5 cynical, misanthropic, misanthropy, pessimistic 4 compassionate, mercy, , 4 desolate, desolation, , 4 scary, fear, , terror 4 hostile, , malevolent, venom 4 sad, melancholic, sadness 3 desperate, despair, hopeless 3 tender, caring, tenderness 3 glad, happiness, happy 3 hopeful, , 3 earnest, heartfelt 2 aggression, aggressive 2 , worshipful 2 hysterical, 2 disturbing, distress 2 , 2 hectic, restless 2 dreamy 1 romantic 1 suspense 1 1 surprising 1 1 satisfaction 1 carefree 1 triumphant 1 TOTAL 136

37 3.3 COMPARISONS TO MUSIC PSYCHOLOGY MODELS

In this section, the identified mood categories are compared to both Hevner’s categorical model and Russell’s two-dimensional model, with regard to the following two aspects:

1) Is there any correspondence between the identified categories and those in the

psychological models?

2) Do the distances between mood categories show similar patterns to those in the

psychological models?

3.3.1 Hevner’s Circle vs. Derived Categories

Some of the terms in Hevner’s circle (Figure 2.1) are known to be old-fashioned and are rarely used for describing moods nowadays. This is reflected by the fact that only 37 of the 66 words in Hevner’s circle were found in WordNet-Affect, including matches of terms in different derived forms (e.g., “solemnity” and “solemn” were counted as a match). By comparing the clusters in Hevner’s circle to the set of categories identified from social tags, it was found that 23 words (35% of all) in Hevner’s circle matched tags in the derived categories, as indicated in

Figure 3.1, where matched words are surrounded by rectangles. Please note that in Figure 3.1 the order of words within each cluster may be changed from Figure 2.1, so that words in the same derived categories are within one rectangle. The observation that the rectangles never cross

Hevner’s clusters suggests that the boundaries of Hevner’s clusters and derived categories are in accordance with each other, despite the finer granularity of the derived categories.

38

Figure 3.1 Words in Hevner’s circle that match tags in the derived categories

It also can be seen from Figure 3.1 that Clusters 2, 4, 6, 7 have the most matched words among all clusters, indicating Western popular songs (as the main music type in last.fm) mostly fall into these mood clusters. Besides exact matches, there are five categories in Table 3.1 with meanings close to some of the clusters in Hevner’s model: categories “angry” and “aggressive” are close to Cluster 8, the category “desire” is close to the “longing” and “yearning” in Cluster 3, and the category “earnest” is close to “serious” in Cluster 1. This use of different words for the same or similar meanings indicates a vocabulary mismatch between social tags and adjectives used in Clusters 3 and 8. Clusters 1 and 5 have the least matched or nearly matched words, reflecting that they are not good descriptors for Western popular songs. In fact, Hevner’s circle was mainly developed for for which words in Clusters 1 and 5 (“light,”

“delicate,” “graceful,” and “lofty”) would be a good fit.

39 In total, 20 of the 36 derived categories have at least one tag contained in Hevner’s circle or with close meanings to terms in Hevner’s circle. It is not surprising that the empirical data entailed more categories, since social tags were aggregated from millions of users while

Hevner’s model was developed by studying merely hundreds of subjects.

As a conclusion, after more than seven decades, Hevner’s circle is still largely in accordance with categories derived from today’s empirical music listening data. Admittedly, there are more mood categories in today’s empirical data, and there is a vocabulary mismatching issue, since language itself is evolving with time.

3.3.2 Russell’s Model vs. Derived Categories

Figure 3.2 marks the words appearing in both Russell’s model and the derived sets of mood categories. In this figure, terms that match tags in the derived categories are marked with bold font and terms that have close meanings with tags in the derived categories are marked with italic font, with corresponding tags shown in parentheses. Words in the same derived categories are circled together.

Figure 3.2 shows that 13 of the 28 words in Russell’s model match tags in the derived categories, and another three words have close meanings with tags in the derived categories.

Hence, more than half of the words in Russell’s model match or nearly match tags in the derived categories. For those unmatched words, there are several cases: 1) Some words are synonyms according to WordNet, such as “content” and “satisfied,” “at ease” and “relaxed,” “droopy” and

“tired,” “pleased” and “delighted.” Words in these pairs represent similar mood; 2) Some words are ambiguous and can be judgmental (“miserable,” “bored,” “annoyed”). If used as social tags,

40 these terms may represent users’ preferences towards the songs rather than the moods carried by the songs. Hence these terms were removed during the process of deriving mood categories from social tags; and, 3) five of the 28 adjectives in Russell’s model are not in WordNet-Affect:

“aroused,” “tense,” “droopy,” “tired” and “sleepy.” They are either rarely used in daily life or are not deemed as mood-related. Nevertheless, the high percentage of matched vocabulary with

WordNet-Affect (23 out of 28) does reflect the fact that Russell’s model is newer than Hevner’s.

Figure 3.2 Words in Russell’s model that match tags in the derived categories

It can also be seen from Figure 3.2 that matched words in the same category (circled together) are placed closely in Russell’s model, and the matched words distribute evenly across the four quadrants of the two dimensional space. This indicates the derived categories have a good coverage of moods in Russell’s model. On the other hand, 2/3 of the 36 derived categories do not have matched or closely matched words in Russell’s model. This indicates that the original Russell model with 28 adjectives reflects some but not most of the mood categories used

41 in today’s music listening environment. Hence, the MIR experiments using four mood categories based on Russell’s model have been trying to solve part of the real problem, but not even close to the complete problem. However, let us recall that Russell’s model is a dimensional model instead of a categorical one, and thus it can be extended to include more adjectives. In fact, later studies have extended this model in many different ways (Schubert, 1996; Thayer, 1989; Tyler, 1996). It is possible that many, if not all, tags in the derived categories could find their places in the two- dimensional space, but it is a topic beyond the scope of this dissertation.

3.3.3 Distances between Categories

Both Hevner’s circle and Russell’s space demonstrate relative distances between moods. For instance, in Russell’s space, “sad” and “happy,” “calm” and “angry” are at opposite places while

“happy” and “glad” are close to each other.

To see if there are similar patterns in the derived categories, the distances between the categories were calculated according to the co-occurrences of artists associated with the tags.

The last.fm API provides the top 50 artists associated with each tag, and thus the top artists for each of the 136 tags in the derived categories were collected, and then the distances between the categories were calculated based on artist co-occurrences. Figure 3.3 shows the distances of the sets of categories plotted in a two-dimensional space using Multidimensional Scaling (Borg &

Groenen, 2004). In this figure, each category is represented by one tag in this category and a bubble whose size is proportional to the total times for which the tags in this category are used.

42

Figure 3.3 Distances of the 36 derived mood categories based on artist co-occurrences

As shown in Figure 3.3, categories that are intuitively close (e.g., those denoted by “glad,”

“cheerful,” “gleeful”) are positioned together, while those placed at almost opposite positions indeed represent contrasting moods (e.g., the ones denoted as “aggressive” and “calm,”

“cheerful” and “sad”). This evidences that the mood categories derived from social tags have similar patterns of category distances to those in psychological mood models. More interestingly,

Figure 3.3 also shows the valence and arousal dimensions as those in Russell’s model. The horizontal dimension is similar to valence , indicating positive or negative feelings, while the vertical dimension is similar to arousal , indicating active or passive states of being. Finally, an interesting observation from the sizes of the bubbles is that tags reflecting sad feelings (e.g.,

“sad” and “gloomy”) are much more frequently applied than those reflecting happy feelings

(e.g., “glad” and “cheerful”).

43 3.4 SUMMARY

The identified categories are intuitively reasonable according to common senses, because 1) most music mood types mentioned in the literature are covered; and 2) the relative distances among the categories are natural. From the above comparisons, it is clear that the mood categories identified from social tags indeed are supported by the theoretical music psychological models to a large extent. Especially the distance plot implies the well-known arousal and valence dimensions. There are differences between identified categories and those in the models.

Particularly, there are more categories identified in social tags, and they are in a finer granularity than those in the models. However, these differences are well explained by the sizes of samples used in the two approaches. Therefore, the identified mood categories are reasonable according to the definition of “reasonableness” in Section 1.3. The answer to research question 1 is positive: social tags can be used to identify a set of reasonable mood categories.

In MIR, one of the most debated topics on music and mood is mood categories. Theoretical models in psychology were designed from laboratory settings and may not be suitable for today’s reality of music listening. By deriving a set of mood categories from social tags and comparing them to the two most representative mood models in music psychology, this research reveals that there is common ground between theoretical models and categories derived from empirical music listening data in the real life. On the other hand, there are also non-neglectable differences between the two:

1) Vocabularies. Some words used in theoretical models are outdated, or otherwise not used in today’s daily life;

44 2) Targeted music. Theoretical models were mostly designed for classical music, while there are a variety of music genres in today’s music listening environment;

3) Numbers of categories and granularity. While theoretical models often have a handful of mood categories, in the real world there can be more categories in a finer granularity.

Therefore, in developing music mood classification techniques for today’s music and users, MIR researchers should extend classic mood models according to the context of targeted users and music listening reality. For example, to classify Western popular songs, Hevner’s circle can be adapted by introducing more categories found from social tags and trimming Clusters 1 and 5 which are mostly for classical music.

45 CHAPTER 4: CLASSIFICATION EXPERIMENT DESIGN

The method used to answer research questions 2 to 5 is to compare the performances of multiple music mood classification systems based on features extracted from different information sources (audio, lyrics, and hybrid). This chapter addresses issues related to this method: evaluation task, performance measure and comparison method, as well as classification algorithm and implementation.

4.1 EVALUATION METHOD AND MEASURE

4.1.1 Evaluation Task

To answer the research questions, various classification systems will be evaluated and compared in the task of binary classification. In a binary classification task (Figure 4.1), a classification model is built for each mood category, and the model, after being trained, gives a binary output for each track: either it belongs to this category or not.

Figure 4.1 Binary classification

There are two reasons that a binary classification task, instead of a multi-class classification task, is chosen for this research. First, this research aims to take a realistic look at the problem of

46 music mood classification which involves dozens of mood categories. Previous experiments on automatic music mood classification usually considered only a handful of mood categories which likely simplified the real problem. However, in music mood classification, when the number of categories gets bigger than 10, the performances of multi-class classification algorithms become very low and lose their practical value (Li & Ogihara, 2004). Second, multi-class classification is usually adopted for experiments where the number of instances in each category is equal.

However, in order to maximize the usage of the available audio and lyrics data, the experiment dataset in this dissertation research contains different number of instances in each category (see

Section 5.2).

4.1.2 Performance Measure and Statistical Test

Commonly used performance measures for classification problems include accuracy, precision, recall and F-measure. Table 4.1 shows a contingency table of a binary prediction.

Compared to the ground truth, a prediction can be the following: true positive (TP): the prediction and truth are both positive; false negative (FN): the prediction is negative but the truth is positive; false positive (FP): the prediction is positive but the truth is negative; and true negative (TN): the prediction and truth are both negative.

Table 4.1 Contingency table of binary classification results

Prediction

Positive Negative Positive True positive False negative Ground truth Negative False positive True negative

The performance measures are defined as follows:

47 TP + TN TP TP accuracy = ; precision = ; recall = TP + FN + FP + TN TP + FP TP + FN

(β 2 + *)1 precision * recall the impor tan ce of recall Fβ = , β ≥ ,0 where β = . β 2 * precision + recall the impor tan ce of precision

*2 precision * recall Usually β = 1 giving equal importance to precision and recall: F 1 = . precision + recall

Accuracy has been extensively adopted in binary classification evaluations in text categorization. In MIR, especially MIREX, accuracy has been commonly reported in evaluating classification tasks. Therefore, accuracy will be used as the classification performance measure in this dissertation research.

In evaluations of multiple categories, a concise and reliable measure of average performance is desirable. There are two approaches to calculating the average performance over all categories: micro-average and macro-average. Micro-average first gets the sums for all four cells in the contingency table (Table 4.1) across categories before calculating the final performance measure using the above formulas, while macro-average calculates the performance measures for each category and then takes the mean as the final score. Micro-averaging gives equal weight to each instance and therefore tends to be dominated by the classifier’s performance on big categories.

Macro-averaging gives equal weight to each category, regardless of its size. Thus the two measures may give very different scores. This dissertation research puts equal emphasis on each mood category and thus macro-averaged measures are adopted for evaluation and comparison.

In terms of splitting data into training and testing sets, both multiple randomized hold out tests and cross validation are often used in MIR classification evaluations. In a hold-out test the

48 entire labeled dataset is split into training and testing subsets, and an average performance can be evaluated with multiple randomized hold-out tests with the same train/test split ratio. Cross validation (CV) is a simple heuristic evaluation. In the setting of m-fold cross validation, a training set is randomly or strategically divided into m disjoint subsets (folds) of equal size. The classifier is trained m times, each time with a different fold held out as the testing set. An average performance on the m runs can be calculated and evaluated. m = 3,5,10 are popular choices in

MIR studies. For example, the AMC task in MIREX 2007 adopted a 3-fold cross validation. This dissertation research uses 10-fold cross validation.

In comparing system performances, Friedman’s ANOVA will be applied to determine whether there are significant differences between the systems considered in each research question. Friedman’s ANOVA is a non-parametric test which does not require normal distribution of the sample data, and accuracy data are rarely distributed normally (Downie,

2008). The samples used in the tests will be accuracies on individual mood categories, unless otherwise indicated.

4.2 CLASSIFICATION ALGORITHM AND IMPLEMENTATION

4.2.1 Supervised Learning and Support Vector Machines

A number of supervised learning algorithms have been invented and extensively adopted in both automatic text categorization and music classification. Supervised learning is a technique that calculates a classification function or model from training data and then uses the function or model to classify new and unseen data. Common supervised learning algorithms include decision trees such as Quinlan’s ID3 and C4.5, K-Nearest Neighbors (KNN), Naïve Bayesian algorithm,

49 Support Vector Machines (SVM), etc. (Sebastiani, 2002). In text categorization evaluation studies, Naïve Bayes and SVM are almost always considered. Naïve Bayes often serves as a baseline, while SVM seems to have the top performances (Yu, 2008). In MIR, music classification studies (mostly on genre classification) often choose KNN and/or decision trees

(C4.5) as baselines to be compared to SVM. Results in both existing music classification experiments and MIREX classification tasks have shown that the SVM generally, if not always, outperforms other algorithms (e.g., Hu et al., 2008a; Laurier et al., 2008). As this research needs to combine both sources of audio and text, SVM is chosen as the classification algorithm.

By design, SVM is a binary classification algorithm. For multi-class classification problems, a number of SVM have to be learned and each of them predicts the membership of examples to one class. In order to reduce the chance of overfitting, SVM attempts to find the classification plane in between two classes and maximizes the margin to either class (Burges, 1998). The data instances on the margins are called support vectors, while other instances are considered not contributive to the classification. SVM classifies a new instance by deciding on which side of the plane the vector of the instance would fall.

An SVM with a linear kernel means that there exists a straight line in a two-dimensional space that separates one class from another (Figure 4.2). For datasets that are not linearly separable, higher order kernels are used to project the data to a higher dimensional space where they become linearly separable. Finding the classification plane involves a complicated computation of quadratic programming, and thus SVM are more computationally expensive than

Naïve Bayes classifiers. However, SVM are very robust with noisy examples, and they can

50 achieve very good performance with relatively few training examples, because only support vectors are taken into account.

Figure 4.2 Support Vector Machines in a two-dimensional space

4.2.2 Algorithm Implementation

The LIBSVM implementation of SVM (Chang & Lin, 2001) is used in this dissertation research. The LIBSVM package has been widely used in text categorization and MIR experiments, including the Marsyas system, the chosen audio-based system for comparisons in this research (see Section 7.1). The LIBSVM package can output posterior probability of each testing instance, and thus can be adapted for implementing the late fusion hybrid method. The

LIBSVM has a few parameters to set. A pilot study (Hu, Downie, & Ehmann, 2009a) tuned the parameters using the grid search tool in the LIBSVM and found the default parameters performed the best for most cases. Therefore, experiments in this research will use the default parameters in the LIBSVM. It was also found that a linear kernel yielded similar results as polynomial kernels. Hence, the linear kernel is chosen for experiments in this research since polynomial kernels are computationally much more expensive.

51 4.3 SUMMARY

This chapter described the design of classification experiments for answering research questions 2 to 5, as well as the rationale behind the design. The evaluation task is a binary classification where a classification model is built for each mood category. Accuracy will be used as the evaluation measure and performances on individual categories will be combined using macro-averaging so as to give equal weight to each category. A 10-fold cross validation evaluation will be employed for splitting training and testing datasets. The performances of different systems will be rigorously compared using Friedman’s ANOVA tests. The classification systems will be built using the SVM classification algorithm because of its superior performances in related classification tasks, and the LIBSVM software package will be used as the classification tool due to its popularity and flexibility.

52 CHAPTER 5: BUILDING A DATASET WITH TERNARY

INFORMATION

The experiments described in Chapter 4 need to be conducted against a ground truth dataset with ternary information sources available: audio, lyrics and social tags. Audio and lyrics are used to build the classifiers, while social tags are used for giving ground truth labels to examples in the dataset. This chapter describes the process of collecting and preprocessing the data with ternary information sources, as well as the process of building the ground truth dataset with mood labels given by social tags.

5.1 DATA COLLECTION

5.1.1 Audio Data

Audio is the most difficult to obtain among all the three information sources, due to intellectual property and copyright laws imposed on music materials. For this reason, data collection for this research started from audio data accessible to the author. The author is affiliated with the International Music Information Retrieval Systems Evaluation Laboratory

(IMIRSEL) where this dissertation research is conducted. The IMIRSEL is the host of MIREX each year, and has accumulated multiple audio collections of significant sizes and diversity (see

Table 5.1). The audio data in this dissertation research were selected from the IMIRSEL collections.

53 Table 5.1 Information of audio collections hosted in IMIRSEL

Number Avg. length of Audio Sets Format Short description of tracks tracks (second) USPOP wav (stereo) 8,791 253.6 in US USCRAP wav (stereo) 3,993 243.5 Unpopular music in US American wav (stereo) 5,291 183.2 American music Classical wav (stereo) 9,750 242.4 Classical music Metal/Elect. wav (stereo) 290 311.8 Metal & Electronica music Magnatune mp3 4,648 253.9 Music released by Magnatune The Beatles wav (stereo) 180 163.8 12 CDs of The Beatles Latin wav (stereo) 3,227 221.6 Assorted Pop mp3 609 233.8 Pop music in US and Europe

Some audio collections shown in Table 5.1 are not usable for this research. Electronica and

Classical music usually do not have lyrics. While the Latin collection has lyrics in Spanish, this research is limited to investigating lyrics in English. In addition, after these audio collections were merged into a super collection, a number of songs were found to be duplicates. In many cases, the duplicates were different recordings of the same song. For example, the song “Help!” by The Beatles appeared in two albums: one was the album “Help!” released in 1965; the other was the album “1” released in 2000. In such cases, the recording with the latest release date was chosen because its sound quality was (sometimes much) better than the older ones. After eliminating duplicates, the number of songs in each audio collection is shown in Table 5.2.

As the audio-based system to be evaluated in this research, Marsyas, takes .wav files as input, the mp3 tracks were converted to .wav files using the ffmpeg program 9. All the audio tracks used in the experiments were converted into 44.1 kHz stereo format before audio features were extracted using Marsyas.

9 Available at http://www.ffmpeg.org/

54 5.1.2 Social Tags

Since social tags on last.fm were used for identifying mood categories and the ground truth dataset will be built using a very similar method (see Section 5.2), last.fm is used for collecting social tags applied to the songs in the audio collections. For each song, the 100 most popular tags applied to it are provided by the last.fm API. The social tags used in building the dataset were collected during the month of February 2009, and 12,066 of the audio pieces had at least one last.fm tag.

5.1.3 Lyric Data

Knees, Schedl, and Widmer (2005) extracted lyrics from the Internet by querying the Google search engine with keywords in the form “track name” + “artist name” + “lyrics,” but the results showed limited precision despite high recall. For this thesis research, precise lyrics are required, and thus lyrics were gathered from online lyric databases, instead of using general search engines.

Lyricwiki.org was the primary resource because of its broad coverage and standardized format.

Mldb.org was the secondary website which was consulted only when no lyrics were found on the primary database. To ensure data quality, the crawlers were implemented to use song title, artist and album information to identify the correct lyrics. In total, 8,839 songs had both social tags and lyrics. A language identification program 10 was then run against the lyrics, and 55 songs were identified and manually confirmed as non-English, leaving lyrics for 8,784 songs.

The lyrics databases do not provide APIs for downloading. Hence one has to query the databases and download the displayed pages. This makes it a necessary step to clean up

10 Available at http://search.cpan.org/search%3fmodule=Lingua::Ident

55 irrelevant parts such as HTML markups and advertisements. In addition, lyrics need special preprocessing techniques because they have unique structures and characteristics. First, most lyrics consist of such sections as intro , interlude , verse , pre-chorus , chorus and outro , many with annotations on these segments. Second, repetitions of words and sections are extremely common.

However, very few available lyric texts were found as verbatim transcripts. Instead, repetitions were annotated as instructions like “[repeat chorus 2x],” “(x5),” etc. Third, many lyrics contain notes about the song (e.g., “written by …”), instrumentation (e.g., “(SOLO PIANO)”), and/or the performing artists. In building a preprocessing program that takes these characteristics into consideration, the author manually identified about 50 repetition patterns and 25 annotation patterns (see Appendix A for a complete list). The program converted repetition instructions into the actual repeated segments for the indicated number of times while recognizing and removing other annotations.

5.1.4 Summary

Table 5.2 summarizes the composition of the collected data.

Table 5.2 Descriptions and statistics of the collections

Collection Avg. length (second) Unique songs Have tags Have English Lyrics USPOP 253.6 8,271 7,301 6,948 USCRAP 243.5 2,553 456 237 American music 183.2 5,049 2,209 790 Metal music 311.8 105 105 104 Beatles 163.8 163 162 161 Magnatune 253.9 4,204 1,261 19 Assorted Pop 233.8 600 572 525 Total (Avg.) 234.8 20,945 12,066 8,784

56 5.2 GROUND TRUTH DATASET

As discussed in Chapters 1 and 2, most previous experiments on music mood classification were conducted on small experiment datasets, and thus an efficient method is sorely needed for building ground truth datasets for music mood classification experimentation and evaluation.

This section describes the process of building the experiment dataset for this research.

McKay, McEnnis, and Fujinaga (2006) proposed a list of desired attributes of new music databases, based on which the author summarizes the following desired characteristics of a ground truth set for music mood classification:

1) There should be several thousand pieces of music in the ground truth set. Datasets

with hundreds of songs used in previous experiments are too small to draw

generalizable conclusions.

2) The mood categories cover most of the moods expressed by the collection of music

being studied. The three to six categories used in most previous studies

oversimplified the real question.

3) Each of the mood categories should represent a distinctive meaning.

4) One music piece can be labeled with multiple mood categories. This is more realistic

than single-label classification, since a music piece may be “happy and calm,”

“aggressive and depressed,” etc.

5) Each assignment of a mood label to a music piece is validated by multiple human

judges. The more judges, the better it is.

57 Starting from the dataset collected by the procedure described in Section 5.1, i.e., the 8,784 songs with ternary information available, the author identified mood categories in this set of songs using similar method described in Section 3.1 and then selected songs for each of the categories. The following subsections describe the process in detail.

5.2.1 Identifying Mood Categories

The mood categories identified in Section 3.2 are not directly applicable to labeling the dataset, because those categories were identified from the most popular social tags in last.fm.

The songs associated with those most popular social tags might be very different from the songs available for this research. On the other hand, with the method described in Section 3.1, it is straightforward and efficient to derive mood categories that fit a given set of songs. In fact, it is the strength of this method to be able to efficiently derive mood categories for any set of songs with social tags available.

The process started from the social tags applied to the 8,784 songs in the dataset via the last.fm API. There were 61,849 unique tags associated with these songs as of February 2009.

WordNet-Affect was employed to filter out junk tags and tags with little or no affective meanings. Among the 61,849 unique tags, 348 were included in WordNet-Affect. However, these 348 words were not all mood-related in the music domain. Human expertise was applied to clean up these words. Just as in identifying mood categories from last.fm tags, the same two human experts identified and removed judgmental tags, ambiguous tags and tags with music meanings that did not involve an affective aspect. As a result of this step, 186 words remained.

58 The 186 words were then grouped into 49 groups using the synset structure in WordNet.

Tags in each group were synonyms according to WordNet. After that, the two experts further merged several tag groups which were deemed musically similar. For instance, the group of

(“cheer up,” “cheerful”) was merged with (“jolly,” “rejoice”); (“melancholic,” “melancholy”) was merged with (“sad,” “sadness”). This resulted in 34 tag groups, each representing a mood category for this dataset.

Finally, the author manually screened a number of tags that did not exactly match words in

WordNet-Affect but were most frequently applied to the songs in the dataset. Some of those tags had exactly the same meaning as matched words in WordNet-Affect and thus were added into corresponding categories. For instance, “sad song” and “ sad” were added into the category of (“sad,” “sadness”); “mood: happy” and “happy songs” were added into the category of (“happy,” “happiness”). In addition, there were some very popular tags with affect meanings in the music domain but were not included in WordNet-Affect, such as “mellow” and “upbeat.”

The experts recommended including these tags in the categories of the same meaning. For example, “mellow” was added to the (“calm,” “quiet”) category, and “upbeat” was added to the category of (“gleeful,” “high spirits”).

For the classification experiments, each category should have enough samples to build classification models. Thus, categories with fewer than 30 songs were dropped, resulting in 18 mood categories containing 135 tags. These categories and their member tags were then validated for reasonableness by a number of native English speakers. Table 5.3 lists the categories, their member tags and number of songs in each category (see next subsection).

59 Table 5.3 Mood categories and song distributions

Number Number Categories of tags of songs calm, comfort, quiet, serene, mellow, chill out, calm down, calming, chillout, comforting, content, cool down, mellow music, mellow rock, peace 25 1,680 of mind, quietness, relaxation, serenity, solace, soothe, soothing, still, tranquil, tranquility, tranquillity sad, sadness, unhappy, melancholic, melancholy, feeling sad, mood: sad – 8 1,178 slightly, sad song glad, happy, happiness, happy songs, happy music, mood: happy 6 749 romantic, romantic music 2 619 gleeful, upbeat, high spirits, zest, enthusiastic, buoyancy, elation, mood: 8 543 upbeat gloomy, depressed, blue, dark, depressive, dreary, gloom, darkness, depress, 11 471 depression, depressing angry, anger, choleric, fury, outraged, rage, angry music 7 254 mournful, grief, heartbreak, sorrow, sorry, doleful, heartache, heartbreaking, 14 183 heartsick, lachrymose, mourning, plaintive, regret, sorrowful dreamy 1 146 cheerful, cheer up, festive, jolly, jovial, merry, cheer, cheering, cheery, get 13 142 happy, rejoice, songs that are cheerful, sunny brooding, contemplative, meditative, reflective, broody, pensive, pondering, 8 116 wistful aggressive, aggression 2 115 anxious, angst, anxiety, jumpy, nervous, angsty 6 80 confident, encouraging, encouragement, optimism, optimistic 5 61 hopeful, desire, hope, mood: hopeful 4 45 earnest, heartfelt 2 40 cynical, , pessimistic, weltschmerz, cynical/sarcastic 5 38 exciting, excitement, exhilarating, thrill, ardor, stimulating, thrilling, 8 30 titillating TOTAL 135 6,490

60 5.2.2 Selecting Songs

The next step is to select positive and negative examples for each of the 18 categories. The general idea is that if a song is frequently tagged with a term in a category, it should be selected as a positive example for that category. On the other hand, if a song is never tagged with any term in a category, but at the same time is heavily tagged with other tags (mood-related or not), then it should be taken as a negative example for the category. Therefore, the frequency or count of the social tags is crucial for this step.

5.2.2.1 Tag Count on Last.Fm

The last.fm API provides the 100 most popular tags applied to each song and the number of times each tag is applied to this song (called “count” thereafter). To date, the API only provides normalized tag counts instead of real, absolute counts. For each song, the most popular tag gets count 100, and other tags get integer numbers between 0 and 100 proportional to the count of the most popular tag. Tags with count 0 are those appearing too few times compared to other tags associated to a song.

In selecting songs for these categories, one should avoid songs that are only tagged with a term by accident or worse, by mistake or mischief. Ideally, one should select songs with high counts. However, with only the normalized tag counts available, there is no way to calculate the real, absolute tag counts. Hence, a heuristic is used to ensure a tag is picked up for a song only when it has been applied to this song for, at the very least, more than once. Only songs satisfying one of the following conditions were counted as candidate positive songs in a category:

61 1) a song has been tagged with one tag in the category and the count of this tag is not the

smallest among all tags applied to this song;

2) a song has been tagged with at least two tags in the category.

Given normalized counts, if a tag’s normalized count is larger than the minimum count among all the tags associated to a song, it is guaranteed to have appeared more than once. This is the rationale behind condition 1. As for condition 2, if a tag appears in a song’s tag list, then it has been applied to this song for at least once. If two tags in the same category appear in a song’s tag list, then they in sum must have been applied to this song for at least twice. In fact, the absolute counts of these tags are probably far more than once or twice, because for popular songs like those in this dataset, the most popular tags are probably applied hundreds of thousands of times. For example, suppose the most popular tag applied to a hypothetical song “S” is “rock” and it has been applied 20,000 times to “S,” the normalized count for “rock” would be shown as

100 (since it is the most popular one). If there is another tag, “sad,” applied to “S” with a normalized count 1, then according to the proportion, the absolute count of “sad” on “S” would be 200.

5.2.2.2 Song filtering

A song should not be selected for a category if its title or artist contains the same terms within that category. For example, all but six songs tagged with “disturbed” in this dataset were songs by the artist “Disturbed.” In this case, the taggers may simply have used the tag to restate the artist instead of describing the mood of the song. Besides, in order to ensure enough data for lyric-based experiments, a selected song should have lyrics with no less than 100 words (after

62 unfolding repetitions as explained in Section 5.1.3). After these filtering criteria were applied, there were 3,469 unique songs which form the positive example set.

5.2.2.3 Multi-label

Multi-label classification is relatively new in MIR, but in the mood dimension, it is more realistic than single-label classification: This is evident in the dataset as there are many songs that are members of more than one mood category. For example, the song, “I’ll Be Back” by

“The Beatles” is a positive example of the categories “calm” and “sad,” while the song, “Down

With the Sickness” by “Disturbed” is a positive example of the categories “angry,” “aggressive” and “anxious.” Table 5.4 shows the distribution of songs belonging to multiple categories.

Table 5.4 Distribution of songs with multiple labels

Number of categories 1 2 3 4 5 6 Number of songs 1,639 1,010 539 205 62 14

Here an example is presented to illustrate how a song is labeled with these identified mood categories. Figure 5.1 shows the most popular social tags on a song of “The Beatles,” “Here

Comes the Sun,” as published on last.fm on February 5th, 2010. Among these tags, eight match the category terms listed in Table 5.3, and these terms belong to five categories as shown in circles of the same colors. Therefore, this particular song is labeled as positive examples of these five categories.

63

Figure 5.1 Example of labeling a song using social tags

5.2.2.4 Negative Samples

In a binary classification task, each category needs negative samples as well. The negative sample set for a given category are chosen from songs that are not tagged with any of the terms found within that category but are heavily tagged with many other terms. Since there are plenty of negative samples for each category, a song must satisfy all of the following conditions to be selected as a negative sample:

1) It has not been tagged by any of the terms in this category;

2) The total normalized counts of all tags that are not in this category is no less than 100;

3) The minimum normalized count among all tags associated with this song is 0 or 1.

Condition 2) and 3) together make sure the total absolute count of “other” tags is no less, and probably much more than 100.

Similar to positive samples, all negative samples have at least 100 words in their unfolded lyric transcripts. For each category, the positive and negative set sizes are balanced, and thus the total number of examples in all categories is 12,980.

64 5.2.3 Summary of the Dataset

So far the experiment ground truth dataset has been built. There are 18 mood categories and each category has a number of positive examples and an equal number of negative examples.

The relative distance between these 18 mood categories were calculated by co-occurrence of songs in the positive examples. That is, if two categories share a lot of positive songs, they should be similar. Figure 5.2 illustrates the distances of the 18 categories plotted in a two- dimensional space using Multidimensional Scaling. In this figure, each category is represented by one tag in this category and a bubble whose size is proportional to the number of positive songs in this category.

Figure 5.2 Distances between the 18 mood categories in the ground truth dataset

65 The patterns shown in this figure are similar to those found in Russell’s model as shown in

Figure 2.2: 1) Categories placed together are intuitively similar; 2) Categories at opposite position represent contrast moods; and, 3) The horizontal and vertical dimensions correspond to valence and arousal respectively. Taken together, these similarities indicate that these 18 mood categories fit well with Russell’s mood model which is the most commonly used model in MIR mood classification research. In addition, it is interesting that both Figure 5.2 and Figure 3.3 show there are more sad songs than happy songs. Although this observation looks intuitively reasonable (e.g., most poems are sad rather than happy), further validation is needed from musicology and/or music psychology.

The full dataset comprises 5,296 unique songs, including positive and negative examples.

This number is much smaller than the total number of examples in all categories (which is

12,980) because categories often share samples. The decomposition of genres in this dataset is shown in Table 5.5 from which we can see most of the songs in this dataset are pop music.

Table 5.5 Genre distribution of songs in the experiment dataset (“Other” includes genres occurring very infrequently such as “World,” “Folk,” “Easy listening,” and “Big band”)

Genre No. of songs Genre No. of songs Genre No. of songs Rock 3,977 55 15 214 40 Other 35 Country 136 40 Unknown 564 Electronic 94 Metal 37 TOTAL 5,296 R & B 64 New Age 25

To have a clear look at the relationship between mood and genre, Table 5.6 summarizes how the 6,490 positive examples distribute across different genres and moods. Although for this dataset all moods are dominated by Rock songs, there are still observations that comply with

66 common knowledge on music. For example, all Metal songs are in negative moods, particularly

“aggressive” and “angry” while 40% of Electronic examples are associated with category

“calm.” Moreover, it is not surprising that most New Age examples are labeled with the moods of “calm,” “sad,” and “dreamy.”

Table 5.6 Genre and mood distribution of positive examples

Elect- New R & Hip Unk- Blues Oldies Jazz ronic Metal Age B Reggae Country Hop Rock Other nown total calm 2 2 9 40 0 24 15 34 21 29 1,239 8 257 1,680 sad 1 1 1 11 4 9 9 3 19 10 907 12 191 1,178 glad 1 3 8 17 0 4 9 10 8 7 466 3 213 749 romantic 0 2 5 3 0 6 13 2 15 1 477 9 86 619 gleeful 0 0 0 11 0 2 4 2 12 11 366 2 133 543 gloomy 3 0 1 4 6 0 0 1 3 15 390 2 46 471 angry 0 0 0 1 9 0 0 0 0 9 187 0 48 254 mournful 0 0 0 1 0 0 2 0 13 3 130 1 33 183 dreamy 0 0 0 5 0 9 0 0 0 0 105 0 27 146 cheerful 0 3 1 3 0 1 1 1 2 2 101 3 24 142 brooding 0 0 0 2 0 2 1 0 0 0 90 0 21 116 aggressive 0 0 0 2 14 0 0 0 0 5 82 0 12 115 anxious 0 0 0 1 1 0 0 0 2 0 70 0 6 80 confident 0 0 0 2 0 0 0 1 1 2 43 1 11 61 hopeful 0 0 0 2 0 0 0 0 1 0 32 1 9 45 earnest 0 0 0 0 0 2 0 0 1 0 35 0 2 40 cynical 0 0 0 0 0 0 0 0 0 1 35 0 2 38 exciting 0 0 0 1 0 0 0 0 0 1 27 0 1 30 TOTAL 7 11 25 106 34 59 54 54 98 96 4,782 42 1,122 6,490

67 CHAPTER 6: BEST LYRIC FEATURES

This chapter presents the experiments and results that aim to answer research question 2: which type(s) of lyric features are the most useful in classifying music by mood. Starting with an overview of the state of the art in text affect analysis, the chapter then describes the lyric features investigated in this research and finally presents the results and discussions.

6.1 TEXT AFFECT ANALYSIS

Pang and Lee (2008) recently published a comprehensive survey on sentiment analysis in text. By sentiment analysis, they mainly referred to analysis on subjectivity, sentimental polarity

(negative vs. positive), and political viewpoints (liberal vs. conservative). They summarized the features that have been used in sentiment analysis: bag-of-words (in pre-built lexicons), part-of- speech (POS) tags, position in the document, higher-order n-grams, dependency or constituent- based features. However, which features are most useful depends on specific tasks. Moreover, as sentiment analysis is a relatively new area, it is still too early to make assertions on features.

A related area is stylometric analysis, which usually refers to authorship attribution, text genre identification, and authority classification. Previous studies on stylometric analysis have shown that statistical measures on text properties (e.g., word length, punctuation and function words, contractions, named entities, non-standard spellings) could be very useful (e.g., Argamon,

Saric, & Stein, 2003; Hu, Downie, & Ehmann, 2007b). Over one thousand stylometric features have been proposed in a variety of research (Rudman, 1998). However, there is no agreement on the best set of features for a wide range of application domains.

68 There are several operational systems focused on text categorization according to affect.

Subasic and Huettner (2001) manually constructed a word lexicon for each affect category considered in their study, and classified documents by comparing the average scores of terms in the affect categories. A more complex approach was taken by Liu, Lieberman, and Selker (2003) which was based on common sense knowledge, due to the assumption that common sense is important for affect interpretation.

As lyrics are a special genre quite different from daily life documents, a common sense knowledge base may not work for lyrics; neither do word lexicons built for other genres of documents. While manually building a lexicon is very labor-intensive, methods on automatic lexicon induction have been proposed. Pang and Lee (2008) summarized such methods and categorized them into two groups: unsupervised and supervised. The three feature selection methods applied to lyrics in Hu et al. (2009a) (e.g., language model comparison, F-score feature ranking, and SVM feature ranking) are vivid examples of supervised lexicon induction.

Unsupervised methods start from a few seed words for which the affect is already known, and then propagate the labels of the seed words to words that co-occur with them in a text corpus, to synonyms, and/or to words that co-occur with them in other resources like WordNet. For instance, Turney (2002) proposed the joint use of mutual information and co-occurrence in a general corpus with a small set of seed words.

There have been interesting studies on the affective aspect of text in the context of weblogs

(Nicolov, Salvetti, Liberman, & Martin, 2006). Most of them still used bag-of-words features of all words or words in specific POSs (mostly adjectives and ). Among them, Mihalcea and

69 Liu (2006) identified discriminative words in blog posts in two categories, “happy” and “sad,” using Naïve Bayesian classifiers and word frequency threshold.

Alm (2008) studied affects of sentences in children’s tales and included a very rich feature set covering the aspects of syntactic (e.g., POS ratios, interjection word count), rhetoric (e.g., repetitions, onomatopoeia counts), lexical (counts of words in pre-built, emotion-related word lists), and orthographic (e.g., special punctuations). Although the affect categories in Alm’s study were not from a dimensional model, Alm included in the feature set dimensional lexical scores calculated from the ANEW word list (Bradley & Lang, 1999). ANEW stands for

Affective Norms for English Words. It contains 1,034 unique words with scores in three dimensions: valence (a scale from unpleasant to pleasant), arousal (a scale from calm to excited), and dominance (a scale from submissive to dominant). All dimensions are scored on a scale of 1 to 9. Alm’s features used the average scores of word hits in the ANEW list.

Unfortunately, Alm did not evaluate which of these features were most useful in predicting affect categories. However, Alm’s study, among the few studies on text affect prediction, does suggest possible features for consideration in this research.

6.2 LYRIC FEATURES

Based on the aforementioned studies on text affect analysis, this dissertation research investigates a range of lyric feature types that can be categorized into the following three classes:

1) basic text features that are commonly used in text categorization tasks; 2) linguistic features based on psycholinguistic resources; and, 3) text stylistic features including those proven useful

70 in a study on music genre classification (Mayer et al., 2008). Besides, combinations of these feature types are also evaluated in this study and are described in this section as well.

6.2.1 Basic Lyric Features

As a starting point, this research evaluates bag-of-words features with the following types:

1) Content words (Content): all words except function words, without stemming;

2) Content words with stemming (Cont-stem): stemming means combining words with

the same roots;

3) Part-of-speech (POS) tags: such as noun, verb, proper noun, etc. In this research, the

Stanford POS tagger 11 is used to tag each lyric word with one of the 36 unique POS

tags in the Penn Treebank project 12 ;

4) Function words (FW): as opposed to content words, also called “stopwords” in text

information retrieval. The function word list used in this study is the one compiled by

S. Argamon, a well-known scholar in the area of text stylistic analysis 13 .

For each of the feature types, four representation models are compared: 1) Boolean; 2) term frequency; 3) normalized frequency; and, 4) tfidf weighting. In a Boolean representation model, each feature value is term presence or absence (one or zero). The term frequency and normalized frequency models, as their names indicate, use term frequencies and normalized term frequencies

11 http://nlp.stanford.edu/software/tagger.shtml

12 http://www.cis.upenn.edu/~treebank/

13 The function word list can be accessed at http://www.ir.iit.edu/~argamon/function-words.txt

71 as feature values respectively. The model of tfidf weighting uses the product of term frequency and inverted document frequency as feature values.

The name “bag-of-words” simply means a collection of unordered terms, and the terms could be single words (also called “unigram”), POS tags, or ordered combinations of multiple words (also called “ n-gram”). In this study, unigrams, bigrams and trigrams of the above features and representation models are all evaluated. For each n-gram feature type, features that occurred less than five times in the training dataset were discarded. In addition, for bigrams and trigrams of Content and Cont-stem, function words were not eliminated because content words are usually connected via function words as in “I love you,” where “I” and “you” are function words. For

Cont-stem, words were stemmed before bigrams and trigrams were calculated. That is, every word in a bigram or trigram was stemmed.

Theoretically, high order n-grams can capture features of phrases and compound words. A previous study on lyric mood classification (He et al., 2008) found the combination of unigrams, bigrams and trigrams yielded the best results among all n-gram features (n <= 3). Hence, in this study, the combinations of unigrams and bigrams, then those of unigrams, bigrams and trigrams are evaluated to investigate the effect of progressively expanded feature sets. The basic lyric feature sets evaluated in this study are listed in Table 6.1.

The effect of stemming on n-gram dimensionality reflects the unique characteristics of lyrics. For unigrams of content words, stemming reduced the number of terms from 7,227 to

6,098, with a reduction rate of 15.6%. However, the reduction rate decreased to 3.3% for bigrams and 0.2% for trigrams. The reduction rate is very low compared to other genres of text such as web pages and newspaper text. While a thorough analysis is needed in the future to

72 compare the differences between lyrics and text in other genres, an initial examination of the lyric text suggests that the repetitions frequently used in lyrics indeed make a difference in stemming. For example, the lines, “ bounce bounce bounce ” and “but just bounce bounce bounce, yeah ” were stemmed to “ bounc bounc bounc ” and “ but just bounc bounc bounce, yeah .” The original bigram “ bounce bounce ” then expanded into two bigrams after stemming: “ bounc bounc ” and “ bounc bounce ” while the original trigram “ bounce bounce bounce ” also became two trigrams after stemming: “ bounc bounc bounc ” and “ bounc bounc bounce .”

Table 6.1 Summary of basic lyric features

Feature Type n-grams No. of dimensions unigrams 7,227 bigrams 34,133 Content words without stemming (Content) trigrams 42,795 uni+bigrams 41,360 uni+bi+trigrams 84,155 unigrams 6,098 bigrams 33,008 Content words with stemming (Cont-stem) trigrams 42,707 uni+bigrams 39,106 uni+bi+trigrams 81,813 unigrams 36 bigrams 1,057 Part-of-speech (POS) trigrams 8,474 uni+bigrams 1,093 uni+bi+trigrams 9,567 unigrams 467 bigrams 6,474 Function words (FW) trigrams 8,289 uni+bigrams 6,941 uni+bi+trigrams 15,230

73 6.2.2 Linguistic Lyric Features

In the realm of text sentiment analysis, domain dependent lexicons are often consulted in building feature sets. For example, Subasic and Huettner (2001) manually constructed a word lexicon with affective scores for each affect category considered in their study and classified documents by comparing the average scores of terms included in the lexicon. Pang and Lee

(2008) summarized that studies on text sentiment analysis often used existing off-the-shelf lexicons. In this study, a range of psycholinguistic resources are exploited in extracting lyric features: General Inquirer (GI), WordNet, WordNet-Affect, and Affective Norms for English

Words (ANEW).

6.2.2.1 Lyric Features based on General Inquirer

General Inquirer (GI) is a psycholinguistic lexicon containing 8,315 unique English words and 182 psychological categories (Stone, 1966). Each sense of the 8,315 words in the lexicon is manually labeled, with one or more of the 182 psychological categories to which the sense belongs. For example, the word “happiness” is associated with the categories “Emotion,”

,” “Positive,” “Psychological well being,” etc. The mapping between words and psychological categories provided by GI can be very helpful in looking beyond word forms and into word meanings, especially for affect analysis where a person’s psychological state is exactly the subject of study. One of the previous studies on music mood classification (Yang & Lee,

2004) used GI features together with lyric bag-of-words and suggested representative GI features for each of their six mood categories.

74 GI’s 182 psychological features are also evaluated in this research. It is noteworthy that some words in GI have multiple senses (e.g., “happy” has four senses). However, sense disambiguation in lyrics is an open research problem that can be computationally expensive.

Therefore, the author merged all the psychological categories associated with any sense of a word, and based the match of lyric terms on words instead of senses. The GI features were represented as a 182 dimensional vector with the value at each dimension corresponding to either word frequency, tfidf, normalized frequency, or Boolean value. This feature type is denoted as

“GI.”

The 8,315 words in General Inquirer comprise a lexicon oriented to the psychological domain, since they must be related to at least one of the 182 psychological categories. Therefore, a set of bag-of-words features are built using these words (denoted as “GI-lex”). Again, all the aforementioned four representation models are used for this feature type which has 8,315 dimensions.

6.2.2.2 Lyric Features based on ANEW and WordNet

Affective Norms for English Words (ANEW) is another specialized English lexicon

(Bradley & Lang, 1999). It contains 1,034 unique English words with scores in three dimensions: valence (a scale from unpleasant to pleasant), arousal (a scale from calm to excited), and dominance (a scale from submissive to dominant). All dimensions are scored on a scale of 1 to 9.

The scores were calculated from the responses of a number of human subjects in psycholinguistic experiments and thus are deemed to represent the general impression of these words in the three affect-related dimensions. ANEW has been used in text affect analysis for such genres as children’s tales (Alm, 2009) and blogs (Liu et al., 2003), but the results were

75 mixed with regard to its usefulness. This dissertation research strives to find out whether and how the ANEW scores can help classify text sentiment in the lyrics domain.

Besides scores in the three dimensions, for each word ANEW also provides the standard deviation of the scores in each dimension given by the human subjects. Therefore there are six values associated with each word in ANEW. For the lyrics of each song, means and standard deviations for each of these values are calculated for words included in ANEW, which results in

12 features.

As the number of words in the original ANEW is probably too few to have at least one word included in each of the songs in the experiment dataset, the ANEW word list is expanded using

WordNet. WordNet, as mentioned before, is an English lexicon with marked linguistic relationships among word senses. It is organized by synsets such that word senses in one synset are essentially synonyms. Hence, ANEW is expanded by including all words in WordNet that share the same synset with a word in ANEW and giving these words the same ANEW scores as the one in ANEW. Again, word senses are not differentiated since ANEW only presents word forms without specifying which sense is used. After expansion, there are 6,732 words in the expanded ANEW which covers all songs in the experiment dataset. That is, every song has non- zero values in the 12 dimensions. This feature type is denoted as “ANEW.”

Like the words from General Inquirer, the 6,732 words in the expanded ANEW can be seen as a lexicon of affect-related words. Together with the 1,586 unique words in the latest version of

WordNet-Affect, the expanded ANEW forms an affect lexicon of 7,756 unique words. This set of words are used to build bag-of-words features under the aforementioned four representation models. This feature type is denoted as “Affect-lex.”

76 6.2.3 Text Stylistic Features

Text stylistic features often refer to interjection words (e.g., “ooh,” “ah”), special punctuations (e.g., “!,” “?”) and text statistics (e.g., number of unique words, length of words, etc.). They have been used effectively in text stylometric analyses dealing with authorship attribution, text genre identification, and authority classification (Argamon et al., 2003).

In the music domain, text stylistic features on lyrics were successfully used in a study in music genre classification (Mayer et al., 2008). In particular, Mayer et al. (2008) demonstrated interesting distribution patterns of some exemplar lyric features across different genres. For example, the word “nuh” and “fi” mostly occurred in reggae and hip-hop songs. Their experiment results showed that the combination of text stylistic features, part-of-speech features and audio spectral features significantly outperformed the classifier using audio spectral features only as well as the classifier combining audio and bag-of-words lyric features.

In the task of mood classification, the usefulness of text stylistic features has not been formally evaluated, and thus this dissertation research includes text stylistic features. In particular, the text stylistic features evaluated in this study are defined in Table 6.2, which includes 25 dimensions: six interjection words, two special punctuation marks and 17 text statistics (also see Section 6.4.3).

Table 6.3 summarizes the aforementioned linguistic features and text stylistic features with their numbers of dimensions and numbers of representation models.

77 Table 6.2 Text stylistic features evaluated in this research

Feature Definition normalized frequencies of “hey,” “ooh,” “yo,” “uh,” “ah,” interjection words “yeah” special punctuation marks normalized frequencies of “!,” “-” NUMBER normalized frequency of all non-year numbers numberOfWords total number of words numberOfUniqWords total number of unique words repeatWordRatio (number of words - number of uniqWords)/number of words avgWordLength average number of characters per word numberOfLines total number of lines numberOfUniqLines total number of unique lines numberOfBlankLines number of blank lines blankLineRatio number of blankLines / number of lines avgLineLength number of words / number of lines stdLineLength standard deviation of number of words per line uniqWordsPerLine number of uniqWords / number of lines repeatLineRatio (number of lines – number of uniqLines) /number of lines avgRepeatWordRatioPerLine average repeat word ratio per line stdRepeatWordRatioPerLine standard deviation of repeat word ratio per line numberOfWordsPerMin number of words / song length in minutes numberOfLinesPerMin number of lines / song length in minutes

Table 6.3 Summary of linguistic and stylistic lyric features

Feature Number of Number of Feature Type Abbreviation dimensions representations GI GI psychological features 182 4 GI-lex words in GI 8,315 4 ANEW scores in expanded ANEW 12 1 Affect-lex words in expanded ANEW and WordNet-Affect 7,756 4 TextStyle text stylistic features 25 1

6.2.4 Lyric Feature Type Concatenations

Combinations of different lyric feature types may yield performance improvements. For example, Mayer et al. (2008) found the combination of text stylistic features and part-of-speech

78 features achieved better classification performance than using either feature type alone. This dissertation research first determines the best representation of each feature type and then the best representations are concatenated with one another.

Specifically, for the basic lyric feature types listed in Table 6.1, the best performing n-grams and representation of each type (i.e., content words, part-of-speech, and function words) is chosen and then further concatenated with linguistic and stylistic features. For each of the linguistic feature types with four representation models, the best representation is selected and then further concatenated with other feature types. In total, there are eight selected feature types:

1) n-grams of content word (either with or without stemming); 2) n-grams of part-of-speech; 3) n-grams of function words; 4) GI; 5) GI-lex; 6) ANEW; 7) Affect-lex; and, 8) TextStyle. The total number of feature type concatenations can be calculated as follows:

8 i = ∑C8 255 (1) i=1

where C denotes the combinations of choosing i types from all eight types ( i = 1,…,8). All the 255 feature type concatenations as well as original feature types are compared in the experiments to find out which lyric feature type or concatenation of multiple types is the best for the task of music mood classification.

79 6.3 IMPLEMENTATION

The Snowball stemmer 14 is used for experiments that require stemming. As this stemmer cannot handle irregular words, it is supplemented with irregular nouns and 15 .

The Stanford POS tagger implements two tagging models: one uses the preceding three tags as tagging context, the other considers both preceding and following tags (Toutanova, Klein,

Manning, & Singer, 2003). The bidirectional model performs slightly better than the left side- only model, but is significantly slower. As the lyric dataset used in this research is large, the more efficient left side-only model is adopted. The Stanford tagger is trained on a corpus consisting of articles in the Wall Street Journal. News articles are in a different text genre from lyrics, but there is no available training corpus of lyrics with annotated POS tags. Nevertheless, the lyric data are also in modern English, and the combinations of POS tags in lyrics are not much different from news articles. The author has manually examined about 50 tagged lyrics and the results are generally correct.

6.4 RESULTS AND ANALYSIS

6.4.1 Best Individual Lyric Feature Types

For the basic lyric features summarized in Table 6.1, the variations of uni+bi+trigrams in the

Boolean representation worked best for all three feature types (content words, part-of-speech, and function words). Stemming did not make a significant difference on the performances of

14 http://snowball.tartarus.org/

15 The irregular verb list was obtained from http://www.englishpage.com/irregularverbs/irregularverbs.html, and the irregular noun list was obtained from http://www.esldesk.com/esl-quizzes/irregular-nouns/irregular-nouns.htm

80 content word features, but features without stemming had higher averaged accuracy. The best performance of each individual feature type is presented in Table 6.4.

For individual feature types, the best performing one was Content, the bag-of-words features of content words with multiple orders of n-grams. Individual linguistic feature types did not perform as well as Content. In addition, among linguistic feature types, bag-of-words features

(i.e., GI-lex and Affect-lex) were the best. The poorest performing feature types were ANEW and TextStyle, both of which were statistically different from the other feature types (at p <

0.05). There was no statistically significant difference among the remaining feature types.

Table 6.4 Individual lyric feature type performances

Feature Feature Type Representation Accuracy Abbreviation Content uni+bi+trigrams of content words Boolean 0.617 Cont-stem uni+bi+trigrams of stemmed content words tfidf 0.613 GI-lex words in GI Boolean 0.596 Affect-lex words in expanded ANEW and WordNet-Affect tfidf 0.594 FW uni+bi+trigrams of function words Boolean 0.594 GI GI psychological features tfidf 0.586 POS uni+bi+trigrams of part-of-speech Boolean 0.579 ANEW scores in expanded ANEW - 0.545 TextStyle text stylistic features - 0.529

6.4.2 Best Combined Lyric Feature Types

The best individual feature types (shown in Table 6.4 excluding “Cont-stem”) were concatenated with one another, resulting in 255 combined feature types. Because value ranges of the feature types varied a great deal (e.g., some are counts, others are normalized weights, etc.), all feature values were normalized to the interval of [0, 1] prior to concatenation. Table 6.5

81 shows the best combined feature sets among which there was no significant difference (at p <

0.05).

The best performing feature combination was Content + FW + GI + ANEW + Affect-lex +

TextStyle which achieved an accuracy 2.1% higher than the best individual feature type, Content

(0.638 vs. 0.617). All of the best performing lyric feature type concatenations listed in Table 6.5 contain certain linguistic features and text stylistic features (“TextStyle”), although TextStyle performed the worst among all individual feature types (as shown in Table 6.4). This indicates that TextStyle must have captured very different characteristics of the data than other feature types and thus could be complementary to others. The top three feature combinations also contain ANEW scores, and ANEW scores alone was also significantly worse than other individual feature types (at p < 0.05). It is interesting to see that the two poorest performing feature types scored second best (with no statistically significant difference from the best) when combined with each other. In addition, the ANEW and TextStyle feature types are the only two types that do not conform to the bag-of-words framework among all of the eight individual feature types.

Table 6.5 Best performing concatenated lyric feature types

Number of Type Accuracy dimensions Content+FW+GI+ANEW+Affect-lex+TextStyle 107,360 0.638 ANEW+TextStyle 37 0.637 Content+FW+GI+GI-lex+ANEW+Affect-lex+TextStyle 115,675 0.637 Content+FW+GI+GI-lex+TextStyle 107,907 0.636 Content+FW+GI+Affect-lex+TextStyle 107,348 0.636 Content+FW+GI+TextStyle 99,592 0.635

82 Except for the combination of ANEW and TextStyle, all of the other top performing feature combinations shown in Table 6.5 are concatenations of four or more feature types, and thus have very high dimensionality. In contrast, ANEW+TextStyle has only 37 dimensions, which is certainly a lot more efficient than the others. On the other hand, high dimensionality provides room for feature selection and reduction. Indeed, a previous study of the author (Hu et al., 2009a) applied three feature selection methods on basic unigram lyric features (i.e., F-Score, SVM score and language model comparisons) and showed improved performances. It is a future research direction to investigate feature selection and reduction for feature combinations with high dimensionality.

Except for ANEW+TextStyle, all other top performing feature concatenations contain the combination of “Content,” “FW,” “GI,” and “TextStyle.” The relative importance of the four individual feature types can be revealed by comparing the combinations of any three of the four types. As shown in Table 6.6, the combination of FW + GI + TextStyle performed the worst.

Together with the fact that Content performed the best among all individual feature types, it is safe to conclude that content words are still very important in the task of lyric mood classification.

Table 6.6 Performance comparison of “Content,” “FW,” “GI,” and “TextStyle”

Type Accuracy Content+FW+TextStyle 0.632 Content+FW+GI 0.631 Content+GI+TextStyle 0.624 FW+GI+TextStyle 0.619

83 6.4.3 Analysis of Text Stylistic Features

As shown in the results, TextStyle is a very interesting feature type. It captures very different characteristics from the lyrics than other feature types, i.e., TextStyle is orthogonal to other feature types. This subsection takes a closer look at TextStyle to determine the most important features within this type.

The specific features in TextStyle are listed in Table 6.2. The interjection words and punctuation marks were selected using a series of experiments. Classification performances using varied numbers of top-ranked features are compared in Table 6.7, with the row of best performances marked as bold.

Table 6.7 Feature selection for TextStyle

Number of features Accuracy TextStats I&P Total TextStyle ANEW+TextStyle 17 0 17 0.524 0.634 17 8 25 0.529 0.637 17 15 32 0.526 0.632 17 25 42 0.514 0.631 17 45 62 0.514 0.628 17 75 92 0.513 0.615 17 134 151 0.513 0.612

Initially all common punctuation marks and interjection words 16 were considered. There are

134 of them. Then the interjection words and punctuation marks (denoted as “I&P” in Table 6.7) were ranked according to their SVM weights (see below), and the n most important ones were

16 The list of English interjection words was based on the one obtained from http://www.english-grammar- revolution.com/list-of-interjections.html

84 selected. The 17 text statistic features defined in Table 6.2 are denoted as “TextStats” in Table

6.7. These statistics were kept unchanged in this experiment, because the 17 dimensions of them were already compact compared to the 134 interjection words and punctuations. Since the SVM is used as the classifier, and a previous study (Yu, 2008) suggested feature selection using SVM ranking worked best for SVM classifiers, the punctuation marks and interjection words were ranked according to the feature weights calculated by the SVM classifier. Like all experiments in this research, the results were averaged across a 10-fold cross validation, and the feature ranking and selection was performed only using the training data in each fold. The results in Table 6.7 show that many of the interjection words and punctuation marks are redundant indeed. And this is how the 25 TextStyle features in Table 6.2 were determined.

To provide a sense of how the top features distributed across the positive and negative samples of the categories, the distributions for each of the 25 TextStyle features (six interjection words, two special punctuations and 17 text statistics) were plotted. Figure 6.1, Figure 6.2 and

Figure 6.3 illustrate the distributions of three sample features: “hey,” “!,” and

“numberOfWordsPerMinute.”

In these figures, the categories are in descending order of the number of songs in each category. As can be seen in the figures, the positive and negative bars for each category generally have uneven heights. The greater the differences, the more distinguishing power the feature would have for that category.

85

Figure 6.1 Distributions of “!” across categories

Figure 6.2 Distributions of “hey” across categories

86

Figure 6.3 Distributions of “ numberOfWordsPerMinute” across categories

6.5 SUMMARY

This chapter described and evaluated a number of lyric text features in the task of music mood classification, including the basic, commonly used bag-of-words features, features based on psycholinguistic resources, and text stylistic features. The experiments on the large ground truth dataset revealed that content words were still the most useful individual feature type, while the most useful lyric features were a combination of content words, function words, General

Inquirer psychological features, ANEW scores, affect-related words, and text stylistic features. A surprising finding was that the combination of ANEW scores and text stylistic features, with only 37 dimensions, achieved the second best performance among all feature types and combinations (compared to 107,360 in the top performing lyric feature combination). As text

87 stylistic features appeared to be a very interesting feature type, they were analyzed in detail, and the distributions of exemplar stylistic features across mood categories were also presented.

88 CHAPTER 7: HYBRID SYSTEMS AND SINGLE-SOURCE

SYSTEMS

This chapter presents the experiments and results that aim to answer research question 3: whether there are significant differences between lyric-based and audio-based systems in music mood classification, given both systems using the Support Vector Machines (SVM) classification model, and research question 4: whether systems combining audio and lyrics are significantly better than audio-only or lyric-only systems. First, the selected audio-based system is introduced, and then the two hybrid methods are described and compared. Second, system performances are compared to answer the research questions. Finally, the lyric-only and audio-only systems are compared on individual mood categories, and the top lyric features in various types are examined.

7.1 AUDIO FEATURES AND CLASSIFIER

To answer research question 3, whether there are significant differences between a lyric- based and an audio-based system, a system using the best performing lyric feature sets determined in research question 2 is compared to a leading audio-based classification system evaluated in the AMC task of MIREX 2007 and 2008: Marsyas (Tzanetakis, 2007). Because

Marsyas was the top-ranked system in AMC, its performance sets a difficult baseline against which comparisons must be made.

Marsyas used 63 spectral features: means and variances of Spectral Centroid, Rolloff, Flux,

Mel-Frequency Cepstral Coefficients (MFCC), etc. These features are musical surface features based on the signal spectrum and were described in Section 2.3.1. The Marsyas system used

89 Support Vector Machines (SVM) as its classification model. Specifically, it integrated the

LIBSVM (Chang & Lin, 2001) implementation with a linear kernel to build the classifiers.

7.2 HYBRID METHODS

Hybrid methods can be used to flexibly integrate heterogeneous data sources to improve classification performance, and they work best when the sources are sufficiently diverse and thus can possibly make up for each other's mistakes. Previous work in music classification has used such hybrid sources as audio and social tags, audio and lyrics, audio and symbolic representation of scores, etc.

7.2.1 Two Hybrid Methods

Previous work in music classification has used two popular hybrid methods to combine multiple information sources. The most straightforward hybrid method is feature concatenation where two feature sets are concatenated and the classification algorithms run on the combined feature vectors. The other method is often called “late fusion” which is to combine the outputs of individual classifiers based on different sources, either by (weighted) averaging or by multiplying.

According to Tax, van Breukelen, Duin and Kittler (2000), in the case of combining two classifiers for binary classification as in this research, the two late fusion variations, averaging and multiplying are essentially the same. The following is a formal proof of this assertion.

Lemma : For combining the outputs of two classifiers (lyric-based and audio-based) for binary classification, the rules of multiplying and averaging are equivalent.

90 Proof: Let plyrics and paudio denote the posterior probabilities of a data sample being estimated

as positive by the lyric-based and audio-based classifiers, and plyrics and paudio denote the probabilities of the sample being estimated as negative by the two classifiers respectively.

The multiplying rule says:

× > × if plyrics paudio plyrics paudio (2)

then the hybrid classifier would predict positive, otherwise, predict negative. In the case of binary classification, (2) can be rewritten as:

× > × = − − = − − + × plyrics paudio plyrics paudio (1 plyrics )( 1 paudio ) 1 plyrics paudio plyrics paudio p + p ⇒ p + p > 1 ⇒ lyrics audio > 5.0 lyrics audio 2

This is, in fact, the rule of averaging. 

Therefore, this research uses the weighted averaging as the rule of late fusion. For each testing instance, the final estimation probability is calculated as:

= α + −α phybrid plyrics 1( ) paudio (3) where α is the weight given to the posterior probability estimated by the lyric-based classifier. A song is classified as positive when the hybrid posterior probability is larger or equal than 0.5. In this experiment, the value of α was changed from 0.1 to 0.9 with an increment step of 0.1. The α value resulting in the best performing system was used to build the late fusion system, which was

91 then compared to the feature concatenation hybrid system, as well as the systems based on lyric- only and audio-only.

7.2.2 Best Hybrid Method

Since the best lyric feature set was Content + FW + GI + ANEW + Affect-lex + TextStyle

(denoted as “BEST” thereafter), and the second best feature set, ANEW + TextStyle was very interesting; each of the two lyric feature sets was combined with the audio-based system described in Section 7.1. For the late fusion method, an experiment was conducted to determine the α value that led to the best performance. The results are shown in Figure 7.1.

Figure 7.1 Effect of α value in late fusion on average accuracy

The highest average accuracy was achieved when α = 0.5 for both lyric feature sets, that is when the lyric-based and audio-based classifiers got equal weights. This indicates both lyrics and audio are equally important for maximizing the advantage of the late fusion hybrid systems.

Table 7.1 presents the average accuracies of systems using the two hybrid methods, as well as the result of a statistical test on system performances. The method of late fusion outperformed

92 feature concatenation for 3% for both lyric feature sets, but the difference was not significant (at p < 0.05).

Table 7.1 Comparisons on accuracies of two hybrid methods

Feature set Feature concatenation Late fusion p value BEST 0.645 0.675 0.327 ANEW + TextStyle 0.629 0.659 0.569

7.3 LYRICS VS. AUDIO VS. HYBRID SYSTEMS

Performances of the two hybrid systems, lyric-only and audio-only systems were compared.

Figure 7.2 illustrates the box plots of the accuracies of the four systems using the BEST lyric feature set, with mean accuracies across categories labeled beside each box plot.

Figure 7.2 Box plots of system accuracies for the BEST lyric feature set

93 According to the average accuracies, both hybrid systems outperformed single-source-based systems. The box plots also show that the late fusion system had the least performance variance across categories among the four systems and thus was the most stable system. On the other hand, the hybrid system using feature concatenation seemed the least stable.

Table 7.2 presents the average accuracies of these four systems. It shows that the hybrid system with late fusion improved accuracy over the audio-only system by 9.6% and 8% for the top two lyric feature sets respectively. It can also be seen from Table 7.2 that feature concatenation was not good for combining ANEW + TextStyle lyric feature set and audio, as the hybrid system using this method performed worse than the lyric-only system (0.629 vs. 0.637).

Table 7.2 Accuracies of single-source-based and hybrid systems

Feature set Audio-only Lyric-only Feature concatenation Late fusion BEST 0.579 0.638 0.645 0.675 ANEW+TextStyle 0.579 0.637 0.629 0.659

The raw difference of 5.9% between the performances of the lyric-only system and the audio-only system is noteworthy (Table 7.2). The findings of other researchers (e.g., Laurier et al., 2008; Mayer et.al, 2008; Yang et.al, 2008; Logan et al., 2004) have never shown lyric-only systems to outperform audio-only system in terms of averaged accuracy across all categories.

The author surmises that this difference could be because of the new lyric features applied in this study. However, from Table 7.3 which lists the results of pair-wise statistical tests on system performances for the top two lyric feature sets, the performance difference between the lyric- only and audio-only systems was just shy of being accepted as significant (p = 0.054 for the

BEST feature set), and thus more work is needed in the future before this claim could be

94 conclusively made. Therefore, the answer to research question 3 would be: the lyric-based system outperformed the audio-based system in terms of average accuracy across all 18 mood categories in the experiment dataset, but their difference was just shy of being accepted as statistically significant.

Table 7.3 Statistical tests on pair-wise system performances

Feature set Better system Worse system p value BEST Hybrid (late fusion) Audio-only 0.001 BEST Hybrid (feature concatenation) Audio-only 0.027 BEST Lyric-only Audio-only 0.054 BEST Hybrid (late fusion) Lyric-only 0.110 BEST Hybrid (feature concatenation) Lyric-only 0.517 BEST Hybrid (late fusion) Hybrid (feature concatenation) 0.327 ANEW+TextStyle Hybrid (late fusion) Audio-only 0.004 ANEW+TextStyle Hybrid (feature concatenation) Audio-only 0.045 ANEW+TextStyle Lyric-only Audio-only 0.074 ANEW+TextStyle Hybrid (late fusion) Lyric-only 0.217 ANEW+TextStyle Hybrid (late fusion) Hybrid (feature concatenation) 0.569 ANEW+TextStyle Lyric-only Hybrid (feature concatenation) 0.681

The statistical tests presented on Table 7.3 also show that both hybrid systems using late fusion and feature concatenation were significantly better than the audio-only system at p < 0.05.

In particular, the hybrid systems with late fusion improved accuracy over the audio-only system by 9.6% and 8% for the top two lyric feature sets respectively (Table 7.2). These demonstrate the usefulness of lyrics in complementing music audio in the task of mood classification. However, the differences between the hybrid systems and the lyric-only system were not statistically significant ( p = 0.11 and 0.217 for the late fusion system and p = 0.517 and 0.681 for the feature concatenation system). Therefore, the answer to research question 4 is: systems combining lyrics and audio outperformed systems based on either lyric-only or audio-only in terms of average accuracy across all 18 mood categories in the experiment dataset. The difference between hybrid

95 systems and the audio-only system was statistically significant, but the difference between hybrid systems and the lyric-only system was not statistically significant.

Figure 7.3 shows the system accuracies across individual mood categories for the BEST lyric feature set where the categories are in descending order of the number of songs in each category.

Figure 7.3 System accuracies across individual categories for the BEST lyric feature set

Figure 7.3 reveals that system performances become more erratic and unstable after the category “cheerful.” Those categories to the right of “cheerful” have fewer than 142 positive examples. This suggests that the systems are vulnerable to the data scarcity problem. For some of the smaller categories, system performances were even lower than baseline performance (50% for binary classification). This is a somewhat expected result as the lengths of the feature vectors

96 far outweigh the number of training instances. Therefore, it is difficult to make broad generalizations about these extremely sparsely represented mood categories.

Another angle of comparing the performances is to only consider the bigger mood categories with more stable performances. Statistical tests on performances of these four systems on the nine largest categories from “calm” to “dreamy” show that the late fusion and feature concatenation hybrid systems significantly outperformed the audio-only system at p = 0.002 and p = 0.009 respectively. In addition, the late fusion hybrid system was also significantly better than the lyric-only system at p = 0.047. There was no other statistically significant difference among the systems.

7.4 LYRICS VS. AUDIO ON INDIVIDUAL CATEGORIES

Figure 7.3 also shows that lyrics and audio seem to have different advantages across individual mood categories. Based on the system performances, this section investigates the following two questions: 1) For which moods is audio more useful and for which moods are lyrics more useful? and 2) How do lyric features associate with different mood categories?

Answers to these questions can help shed light on a profoundly important music perception question: How does the interaction of sound and text establish a music mood?

Table 7.4 shows the accuracies of audio and lyric feature types on individual mood categories. Each of the accuracy values was averaged across a 10-fold cross validation. For each lyric feature set, the categories where its accuracies are significantly higher than that of the audio feature set are marked as bold (at p < 0.05). Similarly, for the audio feature set, bold accuracies are those significantly higher than all lyric features (at p < 0.05).

97 Table 7.4 Accuracies of lyric and audio feature types for individual categories

Category Content GI GI-lex ANEW Affect-lex TextStyle Audio calm 0.5905 0.5851 0.5804 0.5563 0.5708 0.5039 0.6574 sad 0.6655 0.6218 0.6010 0.5441 0.5836 0.5153 0.6749 glad 0.5627 0.5547 0.5600 0.5635 0.5508 0.5380 0.5882 romantic 0.6866 0.6228 0.6721 0.6027 0.6333 0.5153 0.6188 gleeful 0.5864 0.5763 0.5405 0.5103 0.5443 0.5670 0.6253 gloomy 0.6157 0.5710 0.6124 0.5520 0.5859 0.5468 0.6178 angry 0.7047 0.6362 0.6497 0.6363 0.6849 0.4924 0.5905 mournful 0.6670 0.6344 0.5871 0.6058 0.6615 0.5001 0.6278 dreamy 0.6143 0.5686 0.6264 0.5183 0.6269 0.5645 0.6681 cheerful 0.6226 0.5633 0.5707 0.5955 0.5171 0.5105 0.5133 brooding 0.5261 0.5295 0.5739 0.4985 0.5383 0.5045 0.6019 aggressive 0.7966 0.7178 0.7549 0.6432 0.6746 0.5345 0.6417 anxious 0.6125 0.5375 0.5750 0.5687 0.5875 0.4875 0.4875 confident 0.3917 0.4429 0.4774 0.4190 0.5548 0.5083 0.5417 hopeful 0.5700 0.4975 0.6025 0.5125 0.6350 0.5375 0.4000 earnest 0.6125 0.6500 0.5500 0.6250 0.6000 0.6375 0.5750 cynical 0.7000 0.6792 0.6375 0.4625 0.6667 0.5250 0.6292 exciting 0.5833 0.5500 0.5833 0.4000 0.4667 0.5333 0.3667 AVERAGE 0.6172 0.5855 0.5975 0.5452 0.5935 0.5290 0.5792

The accuracies marked in bold in Table 7.4 demonstrate that lyrics and audio indeed have their respective advantages in different mood categories. Audio features significantly outperformed all lyric feature types in only one mood category: “calm.” However, lyric features achieved significantly better performances than audio in seven divergent categories: “romantic,”

“angry,” “cheerful,” “aggressive,” “anxious,” “hopeful,” and “exciting.”

The rest of the section presents and analyzes the most influential features of those lyric feature types that outperformed audio features in the seven aforementioned mood categories.

Since the classification model used in this research was SVM with a linear kernel, the features were ranked by the same SVM models trained in the classification experiments.

98 7.4.1 Top Features in Content Word N-Grams

There are six categories where Content n-gram features significantly outperformed audio features. Table 7.5 lists the top-ranked content word features in these categories. Note how

“love” seems an eternal topic of music regardless of the mood category! Highly ranked content words seem to have intuitively meaningful connections to the categories, such as “with you” in

“romantic” songs, “happy” in “cheerful” songs, and “dreams” in “hopeful” songs. The categories, “angry,” “aggressive,” and “anxious” share quite a few top-ranked terms highlighting their emotional similarities. It is interesting to note that these last three categories sit closely in the same top-left quadrant in Figure 5.2.

Table 7.5 Top-ranked content word features for categories where content words significantly outperformed audio

romantic cheerful hopeful angry aggressive anxious with you i love you ll baby fuck hey on me night strong i am dead to you with your ve got i get shit i am change crazy happy loving scream girl left come on for you dreams to you man fuck i said new i ll run kill i know burn care if you shut baby dead hate for me to be i can love and if kiss living god control hurt wait let me rest lonely don t know but you waiting hold and now friend dead fear need to die all around dream love don t i don t why you heaven in the eye hell i m i ll met coming fighting lost listen tonight she says want hurt you i ve never again and i want you ve got kill hate but you love more than waiting if you want have you my heart give me the sun i love oh baby love you hurt cry you like you best you re my yeah yeah night

99 7.4.2 Top-Ranked Features Based on General Inquirer

Table 7.6 lists the top GI features for “aggressive”, the only category where the GI set of 182 psychological features significantly outperformed audio. Table 7.7 presents top GI word features in the four categories where “GI-lex” features significantly outperformed audio features.

Table 7.6 Top GI features for “aggressive” mood category

GI Feature Example Words Words connoting the physical aspects of well being, including its blood, dead, drunk, absence fever, pain, sick, tired Words referring to the perceptual process of recognizing or dazzle, fantasy, hear, identifying something by means of the senses look, make, tell, view Action words hit, kick, drag, upset Words indicating time noon, night, midnight Words referring to all human collectivities people, gang, party Words related to a loss in a state of well being, including being upset burn, die, hurt, mad

Table 7.7 Top-ranked GI-lex features for categories where GI-lex significantly outperformed audio

romantic aggressive hopeful exciting paradise baby i’m come existence fuck been now hit let would see hate am what up sympathy hurt do will jealous girl in tear kill be lonely bounce young another saw to destiny need like him found kill strong better anywhere can there shake soul but run everything swear just will us divine because found gonna across man when her clue one come free rascal dead lose me tale alone think more crazy why mine keep

100 It is somewhat surprising that the psychological feature indicating “hostile attitude or aggressiveness” (e.g., “devil,” “hate,” “kill”) was ranked at 134 among the 182 features.

Although such individual words ranked high as content word features, the GI features were aggregations of certain kinds of words. By looking at rankings on specific words in General

Inquirer, one can have a clearer understanding about which GI words were important.

7.4.3 Top Features Based on ANEW and WordNet

According to Table 7.4, “ANEW” features significantly outperformed audio features on one category, “hopeful,” while “Affect-lex” features worked significantly better than audio features on categories “angry” and “hopeful.” Table 7.8 presents top-ranked features.

Table 7.8 Top ANEW and Affect-lex features for categories where ANEW or Affect-lex significantly outperformed audio

ANEW Affect-lex hopeful angry hopeful Average Valence score one wonderful Standard deviation (std) of Arousal scores baby sun Std of Dominance scores surprise loving Std of Valence scores care read Std of Arousal std death smile Std of Dominance std alive better Std of Valence std heart Average Dominance std happiness lonely Average Arousal score hurt friend Average Valence std straight free Average Dominance score thrill found Average Arousal std cute strong suicide grow babe safe frightened god motherfucker girl down memory misery happy mad dream

101 Again, these top-ranked features seem to have strong semantic connections to the categories, and they share common words with the top-ranked features listed in Table 7.5 and Table 7.7.

Although both Affect-lex and GI-lex are domain-oriented lexicons built from psycholinguistic resources, they contain different words, and thus each of them identified some novel features that are not shared by the other. The category “hopeful” is positioned at the center of Figure 5.2, with small values in both valence and arousal dimension, and thus it is not surprising that the top

ANEW features for “hopeful” involve both valence and arousal scores.

7.4.4 Top Text Stylistic Features

Text stylistic features performed the worst among all individual lyric feature types considered in this research (Table 6.4). In fact, the average accuracy of text stylistic features was significantly worse than each of the other feature types ( p < 0.05). However, text stylistic features did outperform audio features in two categories: “hopeful” and “exciting.” Table 7.9 shows the top-ranked stylistic features (defined in Table 6.2) in these two categories.

Table 7.9 Top-ranked text stylistic features for categories where text stylistics significantly outperformed audio

hopeful exciting stdLineLength uniqWordsPerLine uniqWordsPerLine avgRepeatWordRatioPerLine avgWordLength stdLineLength repeatLineRatio repeatWordRatio avgLineLength repeatLineRatio repeatWordRatio avgLineLength numberOfUniqLines numberOfBlankLines

Note how the top-ranked features in Table 7.9 are all text statistics without interjection words or punctuation marks. Also noteworthy is that these two categories both have relatively low positive valence values (but opposite arousal ) as shown in Figure 5.2.

102 7.4.5 Top Lyric Features in “Calm”

“Calm,” which sits in the bottom-left quadrant and has the lowest arousal of any category in

Figure 5.2, is the only mood category where audio features were significantly better than all lyric feature types. It is useful to compare the top lyric features in this category to those in categories where lyric features outperformed audio features. Top-ranked words and stylistics from various lyric feature types in “calm” are shown in Table 7.10.

Table 7.10 Top lyric features in “calm” category

Content GI-lex Affect-lex ANEW Stylistic you all look float list Std of Dominance std stdRepeatWordRatioPer all look eager moral Average Arousal std Line all look at irish saviour Average Dominance score repeatWordRatio you all i appreciate satan Std of Dominance scores avgRepeatWordRatioPer burning collar Average Valence std Line that is selfish pup Std of Valence std repeatLineRatio you d convince splash Std of Arousal std interjection word: “hey” control foolish clams Std of Arousal scores uniqWordsPerLine boy island blooming Average Arousal score numberOfLinesPerMin that s curious nimble Average Dominance std blankLineRatio all i thursday disgusting Average Valence score interjection word: “ooh” believe in pie introduce Std of Valence scores avgLineLength be free melt amazing interjection word: “ah” speak couple arrangement punctuation mark: “!” blind team mercifully interjection word: “yo” beautiful doorway soaked the sea lowly abide

As Table 7.10 indicates, top-ranked lyric words from the content words, GI-lex and Affect- lex feature types do not present much in the way of obvious semantic connections with the category “calm” (e.g., “satan”). Category “calm” has the lowest arousal value among all categories shown in Figure 5.2, but the top ANEW features in Table 7.10 include more valence and dominance scores. This may be the reason that ANEW features performed badly on “calm.”

103 However, some might argue that word repetition can have a calming effect, and if this is the case, then the text stylistics features do appear to be picking up on the notion of repetition as a mechanism for instilling calmness or serenity.

7.5 SUMMARY

This chapter started from the introduction of the audio-only system and two approaches in combining lyrics and music audio. Performances of lyric-only, audio-only and hybrid systems were then compared in terms of average accuracies across all mood categories. The experiments indicated that late fusion (linear interpolation with equal weights to both classifiers) yielded better results than feature concatenation. Among all systems, the hybrid system using late fusion achieved the best performance and outperformed the audio-only system by 9.6%. Both hybrid systems using late fusion and feature concatenation were significantly better than the audio- based system (at p < 0.05), but the difference between lyric-only and audio-only systems was a little bit shy from being significantly different (p = 0.054). Similar patterns were observed when only the largest half categories were considered.

This chapter continued to examine in-depth those feature types that have shown statistically significant improvements in correctly classifying individual mood categories. Among the 18 mood categories, certain lyric feature types significantly outperformed audio on seven divergent categories and audio outperformed all lyric-based features on only one category ( p < 0.05). For those seven categories where lyrics performed better than audio, the top-ranked words clearly show strong and obvious semantic connections to the categories. In two cases, simple text stylistics provided significant advantages over audio. In the one case where audio outperformed lyrics, no obvious semantic connections between terms and the category could be discerned.

104 CHAPTER 8: LEARNING CURVES AND AUDIO LENGTH

This chapter presents the experiments and results that answer research question 5: whether combining lyrics and audio can help reduce the amount of training data needed for effective classification, in terms of the number of training examples and audio length.

8.1 LEARNING CURVES

Part of research question 5 is to find out whether lyrics can help reduce the number of training instances required for achieving certain performance levels. To answer this question, this research examines the learning curves of the single-source-based systems and the hybrid system using late fusion. In this experiment, in each fold of the 10-fold cross validation, the testing examples will be kept unchanged, while the training data sizes vary from 10% to 100% of all available training samples, with a 10% increment interval. The accuracies averaged across all categories are then used to draw the learning curves, which are presented in Figure 8.1.

Figure 8.1 Learning curves of hybrid and single-source-based systems

105 Figure 8.1 shows a general trend that all system performances increased with more training data, but the performance of the audio-based system increased much more slowly than the other systems. With 20% training samples, the accuracies of the hybrid and the lyric-only systems were already better than the highest accuracy of the audio-only system with all possible amounts of training data. To achieve similar accuracy, the hybrid system needed about 20% fewer training examples than the lyric-only system. This validates the hypothesis that combining lyrics and audio can reduce required training examples needed to achieve certain classification performance levels. In addition, the learning curve of the audio-only system levels off (i.e., stops increasing) at 80% training sample size, while the curves of the other two systems never level off. This indicates the hybrid system and lyric-only system may further improve their performances if given more training examples. It is also worthy of notice that the performances of the lyric-only and audio-only systems drop at the points of 40% and 70% training examples respectively. This observation seems to contradict the general trend that performance increases with the amount of available training data. However, the performance differences between these points and their neighboring points are not statistically significant (at p < 0.05), and thus these performance drops can be seen as random effects and do not form a counter case of the general trend of the learning curves.

8.2 AUDIO LENGTHS

The second part of research question 5 is about the effect of audio lengths on classification performance, and whether incorporating lyrics can reduce the requirement on the length of audio data for achieving certain performance levels. This research compares the performances of the audio-based, lyric-based and the late fusion hybrid system on datasets with audio clips of various

106 lengths extracted from the song tracks. Almost all of the audio clips were extracted from the middle of the songs, as the middle part is deemed as most representative for the whole song

(Silla, Kaestner, & Koerich, 2007). There were very few songs whose middle parts contain significant amounts of silence, in which case the audio clips were extracted from the beginning of the tracks. In this experiment, the audio length ranged from 5, 10, 15, 30, 45, 60, 90, 120 seconds to the total lengths of the tracks, while the lyric-based system and the hybrid system always used the complete lyrics. The accuracies averaged across all categories are used for comparison. Figure 8.2 shows the results.

Figure 8.2 System accuracies with varied audio lengths

The hybrid system outperformed single-source-based systems consistently. With the shortest audio clips (5 seconds), the hybrid system already performed better than the best performances of single-source-based systems. Therefore, combining lyric and audio can reduce the length of audio needed by audio-based systems to achieve better results.

107 There are other interesting observations as well. For the hybrid system, the best performance was achieved with full audio tracks, but the differences were not significant from the performances using shorter audio clips. The audio-based system, on the other hand, displayed a different pattern: it performed best when audio length was 60 seconds, and was the worst when given the entire audio tracks. In fact, more often than not, the beginning and ending parts of a music track may be quite different from the theme of the song, and thus may convey distracting and confusing information. However, the reason why the hybrid system worked well with full audio tracks is left as a topic of future work.

In summary, the answer to research question 5 is positive: combining lyrics and audio can help reduce the number of training examples and audio length required for achieving certain performance levels.

8.3 SUMMARY

This chapter described the experiments and results for answering research question 5, whether combining lyrics and audio help reduce the amount of training data needed for effective classification. Experiments were conducted to examine the learning curves of the single-source- based systems and the late fusion hybrid system with the best performing lyric feature set discovered in Chapter 6. The results discovered that complementing audio with lyrics could reduce the number of training samples required to achieve the same or better performance than the single-source-based systems.

Another set of experiments were conducted to examine how the length of audio clips would affect the performances of the audio-based system and the late fusion hybrid system. The results

108 showed that combining lyrics with audio could reduce the demand on the length of audio data and at the same time still improve classification performances.

These findings can help improve the effectiveness and efficiency of music mood classification systems and thus pave the way to making mood a practical and affordable access point in music digital libraries and repositories.

109 CHAPTER 9: CONCLUSIONS AND FUTURE RESEARCH

9.1 CONCLUSIONS

Music mood is a newly emerging metadata type for music. Information scientists and researchers in the MIR community have a lot to learn from music psychology literature, from basic terminology to music mood categories. This research reviewed seminal works in the long history of music psychological studies on music and mood, and summarized fundamental points of view and their important implications for MIR research.

Social tags are a rich resource for exploring users’ perspectives. As mood categories in music psychological models might lack the social context of today’s music listening environment, this research derived a set of mood categories from social tags using linguistic resources and human expertise. The resultant mood categories were compared to two representative models in music psychology. The results show there were common grounds between theoretical models and categories derived from empirical music listening data in real life. While the mood categories identified from social tags could still be partially supported by classic psychological models, they were more comprehensive and are more closely connected with the reality of music listening. There are two principal conclusions. First, if handled properly, social tags can be used to identify a set of reasonable mood categories that can both be supported by classical theories and reflect the reality of music listening. Second, theoretical models need to be modified to better fit today’s reality. This research exemplifies an approach of using empirical data to refine and adapt theoretical models to better fit the reality of users’ information behaviors.

110 Information science is an interdisciplinary field. It often involves topics that have been traditionally studied in other fields. Borrowing findings from literatures in other fields is a very important research method in information science, but researchers need to pay attention to connecting theories in the literature to the reality and social context of the problems under investigation.

This research also proposed a method of building ground truth dataset using social tags. The method is efficient and flexible. It does not require recruiting human assessors and thus does not suffer low cross assessor consistency, the exact bottleneck of building large ground truth dataset in MIR. The method is flexible in that it can be applied to any music data available to the researcher. To date, the ground truth dataset built in this research is the largest experimental dataset with audio, lyrics and social tags for music mood classification.

This research evaluated a number of lyric text features in the task of music mood classification, including the basic, commonly used bag-of-words features, features based on psycholinguistic lexicons, and text stylistic features. The results revealed that the most useful lyric features were combinations of content words, certain linguistic features, and text stylistic features. A surprising finding was that the combination of ANEW scores and text stylistic features achieved the second best performance (with no significant difference from the best one) among all feature types and combinations with only 37 dimensions in this feature set (compared to 107,360 in the top performance feature set).

In terms of averaged performance across categories, the lyric-only system outperformed a leading audio-only system on this task, although the performance difference was a bit shy from being statistically significant. On individual categories, the two information sources show

111 different strengths. Lyric-based systems seem to have an advantage on categories where words in lyrics have good connection to the categories such as “angry” and “romantic.” However, future work is needed to make a conclusive claim.

In combining lyrics and music audio, late fusion (linear interpolation with equal weights to both classifiers) yielded the best performance, and its performance was more stable across mood categories than the other hybrid method, feature concatenation. Both hybrid systems significantly outperformed (at p < 0.05) the audio-only system which was a top ranked system on this task.

The late fusion system improved the performance of the audio-only system by 9.6%, demonstrating the effectiveness of combining lyrics and audio.

Experiments on learning curves discovered that complementing audio with lyrics could reduce the number of training examples required to achieve the same performance level as single-source-based systems. The audio-only system appeared to have reached its potential and stops improving performance when given 80% of all training examples. In contrast, the hybrid systems could continue to improve performances if more training examples become available.

Combining lyrics and audio can also reduce the demand on the length of audio used by the classifier. Very short audio clips (as short as 5 seconds), when combined with complete lyrics, outperformed single-source-based systems using all available audio or lyrics.

In summary, this research identified music mood categories that reflect the reality of the music listening environment, made advances in lyric affect analysis, improved the effectiveness and efficiency of automatic music mood classification, and thus helped make mood a practical metadata type of music and access point in music repositories.

112 9.2 LIMITATIONS OF THIS RESEARCH

9.2.1 Music Diversity

In the process of data collection, it is noteworthy that the dataset used in this research consists of popular vocal music with lyrics in English. As the social tags used in this research are solely provided by last.fm, they are naturally limited by the user population of last.fm. The demographic statistics of last.fm users 17 shows most users are in Western countries such as the

United States, Britain, Germany and Poland. Also, last.fm users tend to be young and proficient with computers. Over 90% of its European users are from 16 to 34 years old, which at least partially explains that the most popular tags in last.fm are “rock” and “pop.” Therefore, the dataset of this research is limited to popular Western vocal music with lyrics in English, and the mood categories and ground truth labels are biased to young Western listeners. Thus the conclusions of this research are only applicable to this kind of music, and further exploration is needed for music from other culture backgrounds, languages and other groups of users.

9.2.2 Methods and Techniques

Besides the models and techniques adopted in this research, there are other methods that might provide insights from different angles. For example, SVM is discriminative in contrast to generative models which have also been popular in both text mining (Naïve Bayesian model) and

MIR (Gaussian models). In addition, some suboptimal audio features may yield better results when combined with text features. The spectral features used in this research achieved the best performance in MIREX, but other audio features (e.g., rhythm features) are expected to have a

17 http://socialmediastatistics.wikidot.com/lastfm latest updated on 22 Jul 2008.

113 close relationship with music mood. Future research may look into those features. Finally, dimension reduction techniques other than multidimensional scaling, such as Latent Semantic

Analysis (LSA) and Principal Component Analysis (PCA) can be applied to analyzing mood categories, providing additional views on empirical music listening data.

9.3 FUTURE RESEARCH

This research analyzed the general trends and results of improving music mood classification by combining lyric, audio and social tags. While it answered the formulated research questions, it raised even more questions for future research.

9.3.1 Feature Ranking and Selection

This research has discovered many top performing lyric feature combinations are of high dimensionality. Feature selection has great potential to further improve performance. In Section

7.4, top features in individual lyric feature types were examined. The author plans to systematically analyze features in combined feature spaces such as GI + TextStyle. In addition, it is observed that no lyric-based feature provided significant improvements in the bottom-left

(negative valence, negative arousal) quadrant in Figure 5.2 while audio features performed relatively well (i.e., “calm”). It is worthy of further study whether feature selection could improve classification on these categories.

9.3.2 More Classification Models and Audio Features

The interaction of features and classifiers is worthy of further investigation. Using classification models other than SVM (e.g., Naïve Bayes), the top-ranked features might be

114 different than those selected by SVM. In addition, novel and high level audio features have been recently proposed such as “danceability” (Laurier et al., 2008) which measures how likely listeners would dance with the music. Combining lyrics with those new audio features may further improve classification performance.

9.3.3 Enrich Music Mood Theories

This study has identified mood categories from social tags and presented an empirical and real-life case for the reference of music psychologists. In the future, the author will strive to identify more patterns in people’s music listening behaviors from social media data, find connections between these patterns and music psychology theories, and offer suggestions and insights to music psychology research.

An example of this type of research question would be the degree of moods in individual music pieces. In this research, the membership to each mood category is binary. That is, a song either belongs to a category or not. In the reality, songs often show a combination of moods with certain degrees, such as “a mostly calm song with a bit of sadness.” Besides, other related topics include the taxonomy of degrees, definition of correctness, differentiation of two types of errors

(i.e., false positive and false negative), etc. Once discovered, the findings will help enrich and extend the theories on music mood.

9.3.4 Music of Other Types and in Other Cultures

Due to data availability, this study focuses on popular English songs. Other types of music such as classical music have long been studied by music psychologists. When data become

115 available, it is interesting to explore whether the same methods used in this study can be reliably applied to other music types that have both audio and lyrics parts (e.g., classical songs).

Music is culturally dependent. Conclusions drawn from music in one culture may be radically different from those in another. This research focused on popular English songs, and thus it would be interesting to extend this research to popular music in another culture, like

Chinese and Spanish songs. Comparisons on findings will be instructive for designing cross- culture music repositories and services.

9.3.5 Other Music-related Social Media Than Social Tags

With the advent of Web 2.0, there is a large and growing amount of user generated data available online such as social tags, blogs, microblogs, customer reviews, etc. Such data provide first-hand resources for studying and understanding users in daily life settings. This study only exploits social tags on music materials. In the future, music blogs, playlists, and other music- related information published by users on social media websites can also be exploited.

116 REFERENCES

Alm, E. C. O. (2008). Affect in Text and Speech . Doctoral Dissertation. University of Illinois at Urbana-Champaign.

Argamon, S., Saric, M., & Stein, S. S. (2003). Style mining of electronic messages for multiple authorship discrimination: first results. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2003 (KDD'03) . pp. 475-480.

Asmus, E. (1995). The development of a multidimensional instrument for the measurement of affective responses to music. Psychology of Music , 13(1): 19-30.

Aucouturier, J-J., & Pachet, F. (2004), Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1). Retrieved from http://ayesha.lti.cs.cmu.edu/ojs/index.php/jnrsas/article/viewFile/2/2 on September 20, 2008.

Aucouturier, J-J., Pachet, F., Roy, P., & Beurivé, A. (2007). Signal + Context = Better Classification. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07). Sep. 2007, Vienna, Austria. Retrieved from http://ismir2007.ismir.net/proceedings/ISMIR2007_p425_aucouturier.pdf on October 1, 2008.

Bainbridge, D., Cunningham, S. J., & Downie, J. S. (2003). How people describe their music information needs: a grounded theory analysis of music queries. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR’03). Baltimore, Maryland, USA, pp. 221-222.

Bischoff, K., Firan, C. S., Nejdl, W., & Paiu, R. (2009a). How do you feel about "Dancing Queen"? Deriving mood and theme annotations from user tags. In Proceedings of Joint Conference on Digital Libraries (JCDL’09) , Austin, USA, pp. 285-294.

Bischoff, K., Firan, C., Paiu, R., Nejdl, W., Laurier, C., & Sordo, M. (2009b). Music mood and theme classification - a hybrid approach. In Proceedings of the 10 th International Conference on Music Information Retrieval (ISMIR’09) , Kobe, Japan, pp. 657-662.

Borg, I., & Groenen, P. J. F. (2004). Modern Multidimensional Scaling: Theory and Applications . Springer.

Bradley, M. M., & Lang, P. J. (1999). Affective Norms for English Words (ANEW): Stimuli, Instruction Manual and Affective Ratings. (Technical report C-1). Gainesville, FL. The Center for Research in Psychophysiology, University of Florida.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery , 2: 121–167.

117 Byrd, D., & Crawford, T. (2002). Problems of music information retrieval in the real world. Information Processing and Management, 38(2): 249-272.

Capurso, A., Fisichelli, V. R., Gilman, L., Gutheil, E. A., Wright, J. T., & Paperte, F. (1952). Music and Your Emotions . Liveright Publishing Corporation.

Chang, C., & Lin. C. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm .

Cunningham, S. J., Bainbridge, D., & Falconer, A. (2006). More of an art than a science: supporting the creation of playlists and mixes. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR’06). Victoria, Canada, pp. 240-245.

Cunningham, S. J., Jones, M., & Jones, S. (2004). Organizing digital music for use: an examination of personal music collections. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR’04). Barcelona, Spain, pp. 447-454.

Dhanaraj, R., & Logan, B. (2005). Automatic prediction of hit songs. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR’05) . London, UK, pp. 488-491.

Dietterich, T. G. (2000). Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems , London, UK, pp. 1-15.

Downie, J. S. (2008). The Music Information Retrieval Evaluation eXchange (2005-2007): a window into music information retrieval research. Acoustical Science and Technology, 29(4): 247-255.

Downie, J. S., & Cunningham, S. J. (2002). Toward a theory of music information retrieval queries: system design implications. In Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR’02). Paris, France, pp. 299-300.

Eck, D., Bertin-Mahieux, T., & Lamere, P. (2007). Autotagging music using supervised machine learning. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07) . Sep. 2007, Vienna, Austria. Retrieved from http://ismir2007.ismir.net/proceedings/ISMIR2007_p367_eck.pdf on September 20, 2008.

Ekman, P. (1982). Emotion in the Human Face . Cambridge University Press, Second ed.

Fellbaum, C. (1998). WordNet: An Electronic Lexical Database . MIT Press.

Feng, Y., Zhuang, Y., & Pan, Y. (2003). Popular music retrieval by detecting mood. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . Jul. 2003, Toronto, Canada, pp. 375-376.

118 Futrelle, J., & Downie, J. S. (2002). Interdisciplinary communities and research issues in music information retrieval. In Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR’02). Paris, France, pp. 215-221.

Gabrielsson, A., & Lindström, E. (2001). The influence of musical structure on emotional expression. In P. N. Juslin and J. A. Sloboda (Eds.), : Theory and Research . New York: Oxford University Press, pp. 223-248.

Geleijnse, G.., Schedl, M., & Knees, P. (2007). The quest for ground truth in musical artist tagging in the social web era. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07) , Sep. 2007, Vienna, Austria. Retrieved from http://ismir2007.ismir.net/proceedings/ISMIR2007_p525_geleijnse.pdf on September 20, 2008.

Guy, M., & Tonkin, E. (2006). Tidying up tags. D-Lib Magazine . 12(1). Retrieved from http://www.dlib.org/dlib/january06/guy/01guy.html on November 18, 2008.

He, H., Jin, J., Xiong, Y., Chen, B., Sun, W., & Zhao, L. (2008). Language feature mining for music via supervised learning from lyrics. Advances in Computation and Intelligence, Lecture Notes in Computer Science , 5370: 426-435. Springer Berlin / Heidelberg.

Hevner, K. (1936). Experimental studies of the elements of expression in music. American Journal of Psychology , 48: 246-268.

Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence . Jul. 1999, Stockholm, Sweden, pp. 289-296.

Hu, X., Bay, M., & Downie, J. S. (2007a). Creating a simplified music mood classification groundtruth set, In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07). Sep. 2007, Vienna, Austria, pp. 309-310.

Hu, X., & Downie, J. S. (2007). Exploring mood metadata: relationships with genre, artist and usage metadata, In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07). Sep. 2007, Vienna, Austria, pp. 462-467.

Hu, X. Downie, J. S., & Ehmann, A. (2007b) Distinguishing editorial and customer critiques of cultural objects using text mining, In Proceedings of the joint conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ACH/ALLC) , Jun. 2007, Champaign, IL.

Hu, X., Downie, J. S., & Ehmann, A. (2009a). Lyric text mining in music mood classification, In Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR’09) , Kobe, Japan, pp. 411 – 416.

Hu, X., Downie, J. S., Laurier, C., Bay, M., & Ehmann, A. (2008a). The 2007 MIREX Audio Music Classification task: lessons learned, In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08). Sep. 2008, Philadelphia, USA, pp. 462-467.

119 Hu, X., Sanghvi, V., Vong, B., On, P. J., Leong, C., & Angelica, J. (2008b). Moody: a web- based music mood classification and recommendation system. Presented at the 9th International Conference on Music Information Retrieval (ISMIR’08). Sep. 2008, Philadelphia, Pennsylvania. Abstract is retrieved from http://ismir2008.ismir.net/latebreak/hu.pdf on October 19, 2008.

Hu, X., & Wang, J. (2007). An integrated song database, Course Project Report, CS511: Advanced Database Management Systems (Fall 2007), Computer Science department, University of Illinois at Urbana-Champaign.

Hu, Y., Chen, X., & Yang, D. (2009b). Lyric-based song emotion detection with affective lexicon and fuzzy clustering method. In Proceedings of the 10 th International Conference on Music Information Retrieval (ISMIR’09) , Kobe, Japan, pp. 123-128.

Huron, D. (2000). Perceptual and cognitive applications in music information retrieval. In Proceedings of the 1st International Conference on Music Information Retrieval (ISMIR’00) , Oct. 2000, Plymouth, Massachusetts.

Juslin, P. N., Karlsson, J., Lindström E., Friberg, A., & Schoonderwaldt, E. (2006). Play it again with feeling: computer feedback in musical communication of emotions. Journal of Experimental Psychology: Applied , 12(1): 79-95.

Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: a review and a questionnaire study of everyday listening. Journal of New Music Research , 33(3): 217-238.

Juslin, P. N., & Sloboda, J. A. (2001). Music and emotion: introduction. In P. N. Juslin and J. A. Sloboda (Eds.), Music and Emotion: Theory and Research. New York: Oxford University Press, pp. 3-22.

Kim, Y., Schmidt, E., & Emelle, L. (2008). Moodswings: A collaborative game for music mood label collection. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08) . Philadelphia, USA, pp. 231-236.

Kleedorfer, F., Knees, P., & Pohle, T. (2008). Oh oh oh whoah! Towards automatic topic detection in song lyrics. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08) . Philadelphia, USA, pp. 287-292.

Knees, P., Schedl, M., & Widmer, G. (2005). Multiple lyrics alignment: automatic retrieval of song lyrics. In Proceedings of the 6th International Symposium on Music Information Retrieval (ISMIR’05). London, UK, pp. 564-569.

Lamere, P. (2008). Social tagging and music information retrieval. Journal of New Music Research , 37(2): 101-114.

120 Laurier, C., Grivolla, J., & Herrera, P. (2008). Multimodal music mood classification using audio and lyrics. In Proceedings of the 7th International Conference on Machine Learning and Applications ( ICMLA' 08). Dec. 2008, San Diego, California, USA, pp. 688-693.

Lee, J. H., & Downie, J. S. (2004). Survey of music information needs, uses, and seeking behaviours: preliminary findings. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR’04), Barcelona, Spain, pp. 441-446.

Levy, M., & Sandler, M. (2007). A semantic space for music derived from social tags. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07). Sep. 2007, Vienna, Austria. Retrieved from http://ismir2007.ismir.net/proceedings/ISMIR2007_p411_levy.pdf on September 18, 2008.

Li, T., & Ogihara, M. (2003). Detecting emotion in music. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR’03) . Baltimore, Maryland, USA, pp. 239-240.

Li, T., & Ogihara, M. (2004). Semi-supervised learning from different information sources. Knowledge and Information Systems , 7(3): 289-309.

Liu, H., Lieberman, H., & Selker, T. (2003). A model of textual affect sensing using real-world knowledge. In Proceedings of the 8th International Conference on Intelligent User Interfaces. New York, NY, USA. pp. 125-132.

Logan, B., Kositsky, A., & Moreno, P. (2004). Semantic analysis of song lyrics. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME'04). Jun. 2004, Taipei, Taiwan, pp. 827- 830.

Lu, L., Liu, D., & Zhang, H. (2006). Automatic mood detection and tracking of music audio signals. IEEE Transactions on Audio, Speech, and Language Processing, 14(1): 5-18.

Mahedero, J. P. G., Martinez, A., Cano, P., Koppenberger, M., & Gouyon, F. (2005). Natural language processing of lyrics. In Proceedings of the 13th ACM International Conference on Multimedia (MULTIMEDIA’05) . New York, NY, USA, pp. 475- 478.

Mandel, M., & Ellis, D. (2007). A web-based game for collecting music metadata. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR '07) . Sep. 2007, Vienna, Austria. Retrieved from http://ismir2007.ismir.net/proceedings/ISMIR2007_p365_mandel.pdf on September 18, 2008.

Mandel, M., Poliner, G., & Ellis, D. (2006). Support vector machine active learning for music retrieval. Multimedia Systems , 12 (1): 3-13.

Mayer, R., Neumayer, R., & Rauber, A. (2008). Rhyme and style features for musical genre categorisation by song lyrics. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08). Sep. 2008, Philadelphia, Pennsylvania, USA, pp. 337-342.

121 McEnnis, D., McKay, C., Fujinaga, I., & Depalle, P. (2005). jAudio: A feature extraction library. In Proceedings of the 6 th International Conference on Music Information Retrieval (ISMIR’06). Sep. 2005, London, UK, pp. 600-603.

McKay, C., & Fujinaga, I. (2008). Combining features extracted from audio, symbolic and cultural sources. In Proceedings of the 9th International Symposium on Music Information Retrieval (ISMIR’08). Sep. 2008, Philadelphia, Pennsylvania, USA, pp. 597-602.

McKay, C., McEnnis, D., & Fujinaga. I. (2006). A large publicly accessible prototype audio database for music research. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR’06). Oct. 2006, Victoria, Canada, pp. 160 – 163.

Mehrabian, A. (1996). Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Current Psychology: Developmental, Learning, Personality, Social , 14: 261-292.

Meyer, L. B. (1956). Emotion and meaning in music. Chicago: University of Chicago Press.

Mihalcea, R., & Liu, H. (2006). A corpus-based approach to finding happiness. In N. Nicolov, F. Salvetti, M. Liberman, and J. H. Martin, (Eds.) , AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW). AAAI Press, pp. 139-144.

Muller, M., Kurth, F., Damm, D., Fremerey, C., & Clausen, M. (2007). Lyrics-based audio retrieval and multimodal navigation in music collections. In Proceedings of the 11th European Conference on Digital Libraries (ECDL’07), pp. 112-123.

Neumayer, R., & Rauber, A. (2007). Integration of text and audio features for genre classification in music information retrieval. In Proceedings of the 29th European Conference on Information Retrieval (ECIR'07) , Apr. 2007, Rome, Italy, pp.724 – 727.

Nicolov, N., Salvetti, F., Liberman, M., & Martin, J. H. (Eds.) (2006). AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW ). AAAI Press.

Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2): 1–135.

Pohle, T., Pampalk, E., & Widmer, G. (2005). Evaluation of frequently used audio features for classification of music into perceptual categories. Technical Report, Österreichisches Forschungsinstitut für Artificial Intelligence, Wien, TR-2005-06.

Rudman, J. (1998). The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities , 31: 351–365.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology , 39: 1161-1178.

122 Schoen, M., & Gatewood, E. L. (1927). The mood effects of music. In M. Schoen (Ed.), The Effects of Music (International Library of Psychology) Routledge, 1999. pp. 131-183.

Schubert, E. (1996). Continuous response to music using a two dimensional emotion space. In Proceedings of the 4th International Conference of Music Perception and Cognition . pp. 263- 268.

Scott, S., & Matwin, S. (1998). Text classification using WordNet hypernyms. In Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems , Aug. 16, Montreal, Canada, pp. 45-51.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47.

Silla, C. N., Kaestner, C. A. A., & Koerich, A. L. (2007). Automatic music genre classification using ensemble of classifiers. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics , pp. 1687 – 1692.

Skowronek, J., McKinney, M. F., & van de Par, S. (2006). Ground truth for automatic music mood classification. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR’06), Oct. 2006, Victoria, Canada. Retrieved from http://ismir2006.ismir.net/PAPERS/ISMIR06105_Paper.pdf on September 19, 2008.

Sloboda, J. A., & Juslin, P. N. (2001). Psychological perspectives on music and emotion. In P. N. Juslin and J. A. Sloboda (Eds.), Music and Emotion: Theory and Research . New York: Oxford University Press, pp. 71-104.

Stone, P. J. (1966). General Inquirer: a Computer Approach to Content Analysis . Cambridge: M.I.T. Press.

Strapparava, C., & Valitutti, A. (2004). WordNet-Affect: an affective extension of WordNet. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04). May 2004, Lisbon, Portugal, pp. 1083-1086

Subasic, P., & Huettner, A. (2001). Affect analysis of text using fuzzy semantic typing. IEEE Transactions on Fuzzy Systems, Special Issue , 9: 483–496.

Symeonidis, P., Ruxanda, M., Nanopoulos, A., & Manolopoulos, Y. (2008). Ternary semantic analysis of social tags for personalized music recommendation. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08), Sep. 2008, Pittsburgh, Pennsylvania, USA, pp.219-225.

Tax, D. M. J., van Breukelen, M., Duin, R. P. W., & Kittler, J. (2000). Combining multiple classifiers by averaging or by multiplying. Pattern Recognition, 33: 1475-1485.

123 Thayer, R. E. (1989). The Biopsychology of Mood and Arousal . New York: Oxford University Press.

Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich Part-of-Speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL 2003 , pp. 252-259.

Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multi-label classification of music into emotions. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08), Sep. 2008, Philadelphia, Pennsylvania, USA, pp.325-330.

Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL) , pp. 417–424.

Tyler, P. (1996). Developing A Two-Dimensional Continuous Response Space for Emotions Perceived in Music . Doctoral dissertation. Florida State University.

Tzanetakis , G. (2007). Marsyas submissions to MIREX 2007. Retrieved from http://www.music- ir.org/mirex/2007/abs/AI_CC_GC_MC_AS_tzanetakis.pdf on September 20, 2008.

Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing , 10(5): 293-302.

Vignoli, F. (2004). Digital Music Interaction concepts: a user study. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR’04). Barcelona, Spain. Retrieved from http://ismir2004.ismir.net/proceedings/p075-page-415-paper152.pdf on September 20, 2008.

Wedin, L. (1972). A multidimensional study of perceptual-emotional qualities in music. Scandinavian Journal of Psychology , 13: 241–257.

Whitman, B., & Smaragdis, P. (2002). Combining musical and cultural features for intelligent style detection. In Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR’02) , Paris, France, pp. 47–52.

Yang, D., & Lee, W. (2004). Disambiguating music emotion using software agents. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR'04). Oct. 2004, Barcelona, Spain, pp. 52-58.

Yang, Y.-H., Lin, Y.-C., Cheng, H.-T., Liao, I.-B., Ho, Y.-C., & Chen, H. H. (2008). Toward multi-modal music emotion classification. In Proceedings of Pacific Rim Conference on Multimedia (PCM’08) , pp. 70-79.

Yu, B. (2008). An evaluation of text classification methods for literary study, Literary and Linguistic Computing , 23(3): 327-343.

124 APPENDIX A: LYRIC REPETITION AND ANNOTATION

PATTERNS

Annotation patterns:

1. Any of the following words and/or numbers in “()” or “[],” sometimes with numbers and letters:“repeat,” “solo,” “verse,” “intro,” “chorus,” “outro,” “bridge” “pre-chorus,” “interlude”: e.g.: (Chorus #1) 2. Things in () or [] in its own segment. 3. Single lines of “repeat,” “solo,” “verse,” “intro,” “chorus,” “outro,” “bridge” ,“pre-chorus,” “interlude” “piano,” “violin,” “instrument” 4. Any of the following terms at the beginning of line with “:” : “repeat,” “solo,” “verse,” “intro,” “chorus,” “outro,” “bridge,” “pre-chorus,” “interlude” 5. “end of” at the beginning of a line and followed by any of the following terms: “repeat,” “solo,” “verse,” “intro,” “chorus,” “outro,” “bridge,” “pre-chorus,” “interlude” 6. Any words in between “*” and “*”: e.g., *Stadium announcer* 7. Artist name plus “:” at beginning of lines: e.g., Dina Rea: Uh huh... 8. Artist name on beginning of a segment with “:,” “-“: e.g., Sadat X: Cause it's the funky beat …. Ali- Now comes first … 9. Artist name in “()” or “[]”: e.g., “[DMX],” “(Kool Keith)” 10. Artist name in front of “repeat” annotations: e.g., Nelly (repeat 2X) This is for my ... 11. Segments starting with “Transcribed from…”: e.g., Transcribed from patsy cline recordings by yvonne. 12. Segments starting with “Written by…”: e.g., Written by L. Claiborne, J. Crawford, Jr. & V. Hensley 13. Segments starting with “(copyright …”: e.g., (copyright 1966 b.feldman & co. ltd.) (p.samwell-smith/k.relf/j.mccarty) 14. Segments starting with “Words by…,” “Lyrics by” or “Music by…”: e.g., Words by Per Gessle Music by Marie Fredriksson & Per Gessle Vocals: Marie Fredriksson & Per Gessle 15. “sung by” or “spoken words by” at the beginning of a line and followed by people’s name 16. “End chorus”: marking end of chorus 17. “fade to end” at end of a lyric: denoting sound effect

125 18. “music fade” at end of a segment: denoting sound effect 19. Lines with both “Lyrics” and the title of a song 20. Segmentation starting with a line containing “SongFooter”: e.g., {{SongFooter |artist = The_Coasters |song = Young_Blood |fLetter = Y |akuma = http://www.akuma.de/mp3artist/121196-the-coasters.html }} 21. Lines with http:// or www. 22. Lines with “Inc.” and “Music” or “Publishing”: e.g.: Acuff Rose Publishing, Inc. (BMI) 23. Lines started with “Time: [digits of time]”: e.g., Time: 2:52 24. Lines with “are sung by”: e.g., (words in parentheses are sung by background singers only) 25. Repetition notations of any of the following patterns.

Repetition patterns:

1. Annotation of song segment followed by a number indicating times: e.g., [Chorus (2x)] 2. Similar to above, but not in “[]”: e.g., CHORUS II (2x) 3. Similar to above, with “x” followed by the number indicating of times: e.g., 1st chorus(x2) 4. Similar to above, without “()” around the number: e.g., CHORUS x4; 5. Similar to above, without space between segment annotation and number: e.g., (ChorusX2), (Chorus3x) 6. Similar to above, with “[]” around the number: e.g., CHORUS [x2], 7. Similar to above, with a “-“ between segment annotation and number: e.g.,[Chorus - 2X] 8. Similar to above, with “[]” instead of “()”: e.g., (Chorus A - 2x) 9. Similar to above, with “*” replacing “x”: e.g., (Chorus *2) 10. Similar to above, with space between number and “x”: e.g., Chorus x 3 11. Number of repetitions in front of segment annotation: e.g., [2x Chorus] 12. Similar to above, with a “:” at the end: e.g., [2x Chorus:] 13. Similar to above, with “*” surrounding annotation of song segment: e.g., *CHORUS* (3X) 14. Similar to above, with “~” replacing “*”: e.g., ~Chorus~(2x) 15. Similar to above, with “times” replacing “x”: e.g., Chorus (2 times) 16. Similar to above, replacing numbers with words: e.g., (chorus twice), [Chorus - two times] 17. Similar to above, with the word “lyrics”: e.g., (chorus lyrics)3 times, 18. Similar to above, with artist names: e.g., Eminem & Dina Rea Over Chorus 2x 19. The word “repeat” followed by segment annotation: e.g., Repeat Bridge, Repeat chorus

126 20. The word “repeat” followed by segment annotation and number of times: e.g., Repeat Chorus 2X; 21. Similar to above, with space between number and “x”: e.g., Repeat chorus x 2 22. Similar to above, with space inside segment annotation: e.g., Repeat Chorus 2 23. Segment annotation followed by “repeat”: e.g., chorus (repeat) 24. Similar to above, with artist name in between: e.g., Chorus: Nelly (repeat 2X) 25. Similar to above, with “,” between segment annotation and “repeat”: e.g., (chorus, repeat 3x), 26. Similar to above, with “:” between segment annotation and “repeat”: Chorus: (repeat 2X) 27. Similar to above, replacing numbers with words: e.g., (REPEAT CHORUS TWICE), REPEAT CHORUS THREE TIMES 28. Similar to above, without number of times, interpreted as repeating once: e.g., REPEAT CHORUS 29. Similar to above, specifying relative position of repeated segment: e.g., (repeat last verse) 30. Similar to above, plus “till end”: e.g., (repeat chorus 1 till end) 31. Repetition instruction at the end of a line: e.g., Timbaland's beat Timbaland's beat (repeat 7 more times) 32. Similar to above, with surrounding marks: e.g., Hello, hello (yo, yo) {*repeat 6X*} 33. Annotation of song segment following by “w/ minor variations”: e.g., Chorus w/ minor variations 34. Annotation of song segment following by “till fade”: e.g., Chorus till fade 35. Repetitions of segment annotation itself: e.g., (chorus) (chorus) (chorus) 36. Segments annotation combined with “+”: e.g., (Chorus 1) + (Chorus 2) 37. Segments annotation combined with “&”: e.g., (Chorus)& (Bridge 2) 38. “[repeat]” at the end of a segment. 39. “[repeat]” at the end of a segment with number of times: e.g., I'm calling out your name (repeat 3x) 40. “(repeat to fade)” at the end of a segment 41. “(repeat to fade)” at the end of a line: e.g., Rock, rock on (*repeat to fade*) 42. “(repeat and fade...)” at the end of a line: e.g., Come sail away with me (repeat and fade...) 43. “(repeated till end)” at the end of a segment or a line 44. Annotation of song segment followed by a number and artist names: e.g., (Chorus 2X: Tim Dog) + (Kool Keith) 45. Number of times at the end of a line: e.g., So listen up baby before you hit the floor (2x); Mary, Mary, Mary, quite contrary (x12) 46. Multiple repetition instructions combined by “and”: e.g., I'm going home, Lord, I'm going home. (Repeat and then chorus twice) 47. Multiple repetition instructions combined without “and”: e.g, [Repeat 1 Repeat 2]

127 48. Segment annotation plus specific instruction: e.g., (chorus, inserting "It's" before "In my secret garden") 49. Specifications on repetition lines: e.g., {repeat last 3 lines, 4 times}

128 APPENDIX B. PAPERS AND POSTERS ON THIS RESEARCH*

Hu, X. (2009). Categorizing Music Mood in Social Context. In Proceedings of the Annual Meeting of ASIS&T (CD-ROM), Nov. 2009, Vancouver, Canada.

Hu, X. (2009). Combining Text and Audio for Music Mood Classification in Music Digital Libraries. IEEE Bulletin of Technical Committee on Digital Libraries (TCDL), Vol. 5. No. 3. Accessible at http://www.ieee-tcdl.org/Bulletin/v5n3/Hu/hu.html .

Hu, X. (2010). Multi-modal Music Mood Classification. Presented in the Jean Tague-Sutcliffe Doctoral Research Poster session at the ALISE Annual Conference , Jan. 2010, Boston, MA. (3rd Place Award ).

Hu, X. (2010). Music and Mood: Where Theory and Reality Meet. In Proceedings of the 5th iConference, University of Illinois at Urbana-Champaign, Feb. 2010, Champaign, IL (Best Student Paper Award ).

Hu, X., & Downie, J. S. (2010). Improving Mood Classification in Music Digital Libraries by Combining Lyrics and Audio. In Proceedings of the Joint Conference on Digital Libraries’2010, (JCDL), Jun. 2010, Surfers Paradise, Australia, pp.159-168. ( Best Student Paper Award ).

Hu, X., & Downie, J. S. (2010). When Lyrics Outperform Audio for Music Mood Classification: A Feature Analysis. In Proceedings of the 11th International Conference on Music Information Retrieval (ISMIR), Aug. 2010, Utrecht, Netherlands, pp.619-624.

* The author of this dissertation holds the copyright to all of these publications.

129