<<

MULTIVARIATE ANALYSIS OF KOREAN AUDIO FEATURES

Mary Solomon

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTERS OF SCIENCE

May 2021

Committee:

John Chen, Advisor

Junfeng Shang Copyright © May 2021 Mary Solomon All rights reserved iii ABSTRACT

John Chen, Advisor

K-pop, or Korean pop music, is genre originating from South that features various musical styles such as , R&B, and electronic dance. Modern K-pop started with Taiji and Boys in 1992 and has since evolved through stylistic eras called ‘generations’ to become a worldwide sensation. K-pop’s global popularity can be recognized by the success of groups such as BTS and . How do the musical qualities of K-pop songs contribute to the genre’s popularity? Furthermore, how have the musical qualities contributed to the evolution of becoming the global phenomenon it is today? To explore these questions and more, multivariate analysis will be performed on a curated dataset of 12,012 K-pop songs and their audio features. The audio features, collected with ’s Web API, include variables such as Danceability, Loudness, Acousticness, and Valence. The audio features contribution and trends in the evolution of K-pop will be analyzed with nonparametric statistical approaches, Multiple Linear Regression (MLR) and Logistic Regression models. MLR and Logistic Regression will also be used to examine the relationship between the audio features and popularity. Finally, dimension reduction of the audio features performed by Principal Components Analysis paired with K-means clustering will be utilized to explore the possibility of optimizing song clusters within K-pop. iv

This thesis is dedicated, in memoriam, to Jonghyun, ’ Jin-Ri, and . Their artistry, talent, hard work, and influence in Korean pop music will continue to live on in their legacy. v ACKNOWLEDGMENTS

First, I would like to thank my advisor, Dr. John Chen for his supportive guidance and encouragement throughout this process. Additionally, I would like to thank Dr. Junfeng Shang for serving on my committee and providing valuable feedback on my work. Overall, I extend my gratitude to the Bowling Green State University Mathematics and Statistics department for being a supportive community, providing endless amounts of help and always encouraging my exploration of creative research pursuits. Additionally, I would like to thank all of the music teachers who have nurtured my lifelong passion for music. I would like to extend a special thanks to my friend Minso Choi for providing Korean to English translations, allowing me to thoroughly perform this research. Finally, I give my greatest appreciation to all of my and family who have cheered me on throughout all of my endeavors. vi TABLE OF CONTENTS Page

CHAPTER 1 INTRODUCTION ...... 1

CHAPTER 2 BACKGROUND ...... 3 2.1 Defining K-pop ...... 3 2.2 K-pop Generations ...... 4

CHAPTER 3 DATA COLLECTION ...... 6 3.1 Overview ...... 6 3.2 Spotify Audio Features ...... 8 3.3 Data Filtering and Selection Criteria ...... 10 3.4 Distribution of Data ...... 11

CHAPTER 4 MODELING STRATEGIES FOR SPOTIFY AUDIO FEATURES ...... 15 4.1 Overview ...... 15 4.2 Multiple Linear Regression ...... 15 4.3 Binary Logistic Regression ...... 16 4.4 Variable Selection ...... 19 4.5 Regularized Regression ...... 20

CHAPTER 5 NONPARAMETRIC ANALYSIS OF AUDIO FEATURES ...... 22

5.1 Introduction ...... 22

5.2 Methodology ...... 22

5.2.1 Wilcoxon Sum Rank Test ...... 22

5.2.2 Kruskal-Wallis Test ...... 23

5.3 Comparing K-pop Generations ...... 24 5.4 Comparing Male and Female Artists ...... 29 5.5 Comparing Group and Solo Artists ...... 31 vii CHAPTER 6 CLASSIFYING NEW GENERATION SONGS ...... 34 6.1 Introduction ...... 34 6.2 Minor Mode Results ...... 35 6.3 Major Mode Results ...... 36 6.4 Comparison of Minor and Major Mode Models ...... 37

CHAPTER 7 PREDICTING SONG RELEASE DATE ...... 40 7.1 Introduction ...... 40 7.2 Data Preparation ...... 40 7.3 Model Assumptions ...... 41 7.4 Minor Mode Results ...... 43 7.5 Major Mode Results ...... 45 7.6 Comparison of Minor and Major Mode Models ...... 48

CHAPTER 8 PREDICTING POPULARITY ...... 49 8.1 Introduction ...... 49 8.2 Linear Regression Approach ...... 49 8.2.1 Data Preparation ...... 49 8.2.2 Minor Mode Results ...... 50 8.2.3 Major Mode Results ...... 53 8.2.4 Comparison of Minor and Major Mode Models ...... 54

8.3 Logistic Regression Approach ...... 55

8.3.1 Minor Mode Results ...... 55

8.3.2 Major Mode Results ...... 56

8.3.3 Comparison of Minor and Major Mode Models ...... 57

8.4 Comparing Linear and Logistic Regression Approach ...... 58

CHAPTER 9 PRINCIPAL COMPONENTS ANALYSIS ...... 59 9.1 Introduction ...... 59 viii 9.2 Methodology ...... 59 9.2.1 Principal Components Analysis ...... 59 9.2.2 K-means Clustering ...... 60 9.3 Dimension Reduction: PCA ...... 61 9.3.1 Data Preparation ...... 61 9.3.2 Minor Mode Results ...... 62 9.3.3 Major Mode Results ...... 65 9.3.4 Comparison of Minor and Major Mode Results ...... 67 9.4 Clustering on the Principal Components ...... 68

CHAPTER 10 RESEARCH LIMITATIONS ...... 71

CHAPTER 11 CONCLUSION ...... 73

BIBLIOGRAPHY ...... 76

APPENDIX A WILCOXON PAIRWISE COMPARISON RESULTS ...... 79

APPENDIX B MLR AND LOGISTIC REGRESSION MODEL RESULTS ...... 82 ix LIST OF FIGURES Figure Page

3.1 Idology’s Generation Theory Table ...... 6 3.2 Translated Idology Generation Theory Table ...... 7 3.3 Distribution of Popularity, Acousticness, Instrumentalness, and Speechiness. . . . . 11 3.4 Distribution of Energy and Loudness...... 12 3.5 Distribution of Duration, Danceability, Tempo, and Valence...... 13 3.6 Frequency of Musical Keys ...... 14

7.1 Distribution of Month Release ...... 41 7.2 Distribution of Transformed Month Release ...... 42 7.3 Diagnostic Plots for Minor Mode Song Release Dates ...... 43 7.4 Diagnostic Plots for Major Mode Song Release Dates ...... 46

8.1 Distribution of Popularity ...... 50 8.2 Distribution of Transformed Popularity ...... 51 8.3 Diagnostic Plots for Minor Mode Popularity ...... 52 8.4 Diagnostic Plots for Major Mode Popularity ...... 53

9.1 Scree Plot for Minor Mode Principal Components ...... 62 9.2 Scree Plot for Major Mode Principal Components ...... 65 9.3 Silhouette Plots for Optimal K ...... 69

9.4 K-means Clustering Scatter-plots ...... 70

1 Full Logistic model for Minor Mode Generation Classification ...... 82

2 Stepwise Logistic Model for Minor Mode Generation Classification ...... 83

3 All Possible Subsets Logistic Model for Minor Mode Generation Classification . . 84 4 Full Logistic Model for Major Mode Generation Classification ...... 86 5 Stepwise Logistic Model for Major Mode Generation Classification ...... 87 x 6 All Possible Subsets Logistic Model for Major Mode Generation Classification . . 88 7 Full MLR Model for Minor Mode Song Release Date ...... 90 8 Stepwise Linear Regression Model for Minor Mode Song Release Date ...... 91 9 All Possible Subsets Linear Regression Model for Minor Mode Song Release Date 92 10 Full MLR Model for Major Mode Song Release Date ...... 94 11 Stepwise Linear Regression Model for Major Mode Song Release Date ...... 95 12 All Possible Subsets Linear Regression Model for Major Mode Song Release Date 96 13 Full MLR model for Minor Mode Song Popularity ...... 98 14 Stepwise Linear Regression Model for Minor Mode Song Popularity ...... 99 15 All Possible Subsets Linear Regression Model for Minor Mode Song Popularity . . 100 16 Full MLR Model for Major Mode Song Popularity ...... 102 17 Stepwise Linear Regression Model for Major Mode Song Popularity ...... 103 18 All Possible Subsets Linear Regression Model for Major Mode Song Popularity . . 104 19 Full Logistic Model for Minor Mode Popularity Classification ...... 106 20 Stepwise Logistic Model for Minor Mode Popularity Classification ...... 107 21 All Possible Subsets Logistic Model for Minor Mode Popularity Classification . . . 108 22 Full Logistic Model for Major Mode Popularity Classification ...... 110 23 Stepwise Logistic Model for Major Mode Popularity Classification ...... 111 24 All Possible Subsets Logistic Model for Major Mode Popularity Classification . . . 112 xi LIST OF TABLES Table Page

4.1 Classification Assessment: Confusion Matrix ...... 18

5.1 Kruskal-Wallis Test Results: Generations of K-pop ...... 25 5.2 Pairwise Comparison Results: Popularity by Generation ...... 26 5.3 Wilcoxon Test Results: Male vs Female Artists ...... 30 5.4 Wilcoxon Test Results: Group vs Solo Artists ...... 31

6.1 Generation Logistic Model Diagnostics for Minor Mode Songs ...... 35 6.2 Generation Logistic Model Diagnostics for Major Mode Songs ...... 36

7.1 Release Date MLR Model Diagnostics for Minor Mode Songs ...... 44 7.2 Release Date MLR Model Diagnostics for Major Mode Songs ...... 46

8.1 Popularity MLR Model Diagnostics for Minor Mode Songs ...... 51 8.2 Popularity MLR Model Diagnostics for Major Mode Songs ...... 54 8.3 Popular Logistic Model Diagnostics for Mode 0 Songs ...... 55 8.4 Popular Logistic Model Diagnostics for Mode 1 Songs ...... 56

9.1 PCA Variance Explained for Minor Mode Songs ...... 63 9.2 First Five PCs for Minor Mode Songs ...... 63 9.3 PCA Variance Explained for Major Mode Songs ...... 66

9.4 First Five PCs for Major Mode Songs ...... 66

A.1 Pairwise Comparison Results: Duration by Generation ...... 79

A.2 Pairwise Comparison Results: Acousticness by Generation ...... 79

A.3 Pairwise Comparison Results: Energy by Generation ...... 79

A.4 Pairwise Comparison Results: Instrumentalness by Generation ...... 80 A.5 Pairwise Comparison Results: Speechiness by Generation ...... 80 A.6 Pairwise Comparison Results: Loudness by Generation ...... 80 xii A.7 Pairwise Comparison Results: Tempo by Generation ...... 80 A.8 Pairwise Comparison Results: Valence by Generation ...... 81 1

CHAPTER 1 INTRODUCTION

Korean pop music (K-pop) has become a globalized success and phenomenon. BTS have charted on the Billboard top 100 and BlackPink headlined Coachella in 2019. Modern K-pop music started in 1992 and has since evolved to the global phenomenon we know today. Many know K-pop for it’s catchy point dances, impressive visuals and upbeat music. But can we understand the success and appeal of Korean pop music from its audio features alone? Spotify, a popular music streaming platform, allows the public to retrieve Audio Features of songs on its platform through its Web API. A curated data set of 12,012 K-pop songs and their Spotify Audio Features will be analyzed by multivariate analysis to address the following topics. K-pop is viewed by some as a ‘genre-less’ music style as it takes inspiration from and adapts to many styles such as hip-hop, R&B, Soul, euro-pop, house or even Caribbean dancehall. This is believed to be of the contributors to its success (Sherman, 2020). Since this is the case, rather than discussing K-pop via sub-genres and styles, this discussion revolves around attributes of the artist such as the K-pop generation, gender (Male/Female/Coed) and whether the artist is a group or solo act. Nonparametric hypothesis testing with Kruskal-Wallis and the Wilcoxon Sum Rank tests will be used to analyze whether the audio features differ between these artist features. Of these artist attributes, the discussion of K-pop generations is the most frequent in distin- guishing artists and their works from one another. More importantly, this concept of K-pop gen- erations also serves to define the evolution of K-pop and how the genre has changed over time.

Because K-pop generations have a strong prevalence in the discussion around the genre, this thesis will seek to answer the question: How do audio features contribute to distinguishing music into the newer versus older generation? This will be addressed via classification using Binary Logistic

Regression. This method models multiple predictor variables to a binary response. Furthermore, variable selection and regularized regression methods will be assessed to determine which statis- tical model will be the most effective at performing these classifications of songs into the new or old generation. 2 While the discussion on K-pop Generations serves as an understanding of how the genre has evolved over time, the boundaries are not well defined. There are frequent disputes between fans and critics on this topic. Furthermore, since the definition of K-pop Generations relies on an artists’ year of debut, there exists an overlap in a song’s timeline for an artist with a career that spans multiple generations. Therefore, a more precise analysis of how the genre has musically evolved can be achieved by modeling the audio features of a song with the song’s time of release. This relationship between the audio features and a song’s release date will be modeled with Multiple Linear Regression (MLR). MLR models the relationship between multiple predictor variables, continuous or categorical, to a continuous response variable. Similar to the analysis involving Binary Logistic Regression, Multiple MLR models will be created from variable selection and regularized regression methods. These models will be assessed to determine which statistical model is the best for predicting a song’s release date given its audio features. Not only does this thesis serve to explore how the K-pop genre has evolved musically over time, but we want to gain an understanding to which qualities of the music make it so success- ful. Therefore we will explore how audio features contribute to a song’s popularity. This will be assessed using two approaches. One will be the Multiple Linear Regression approach to see how audio features can predict a song’s popularity score. The second approach is a Binary Logistic Classification approach to see how the audio features contribute to classifying a song as being popular or not. While both approaches serve a common goal, they will be assessed to determine which is the most informative in explaining the relationship between the audio features and music popularity.

Lastly, this thesis will try to answer the research question: can fewer audio features describe a

K-pop song without losing corresponding information? This is equivalent to a dimension reduc- tion task in multivariate statistical inference which will be performed with Principal Components

Analysis. With these reduced dimensions, how many clusters in the music will be detected? The clustering will be performed on the resulting principal components by the K-means clustering algorithm. 3

CHAPTER 2 BACKGROUND

2.1 Defining K-pop

K-pop, or Korean pop music, is a genre originating from that features various musical styles such as hip-hop, R&B, and electronic (EDM). The term K-pop more specifically refers to Korean pop idol music, separate from the general pop genre in the Korean music market. Idol music is separate from the general genre of pop due to its emphasis on visuals, high production value and studio production system, later discussed in this introduction (Sherman, 2020; Kim, 2020). K-pop has been popularized outside of Korea since the late 1990s as part of the “Hallyu Wave”, which refers to the spread of South Korean popular culture overseas. The Hallyu Wave was initially the most impactful throughout countries in and later captivated audiences in , Latin America, and (Song, 2020). This impact of Hallyu Wave via K-pop has recently been influential for the American audience through the rise in popularity of groups such as BTS and Blackpink. Although the world had been introduced to K-pop through ’s in 2012, the success and impact was largely short lived due to its viral nature. K-pop’s recent success, largely led by BTS and Blackpink is distinguished from that of Gangnam Style by the sustained relevance and popularity to audiences (Pramudita, 2018). These two groups have even achieved accolades such as charting on the Billboard top 100, headlining Coachella, and setting record breaking viewer counts on YouTube. Beyond the accolades, K-pop has infiltrated American pop culture where BTS, Blackpink and other groups such as SuperM and NCT are making appearances on American television programs such as , Late Night with Jimmy Fallon, and The Late Late Show with James Corden, just to name a few.

With K-pop becoming a global phenomenon, many wonder what makes the genre so popular?

NPR Sherman (2020) attributes its popularity to its genre-less nature, as the genre draws inspi- ration and influences from many musical styles such as hip-hop, R&B, soul, euro-pop, house or 4 Caribbean dancehall. Vox Romano (2018) describes the success in the following manner: “K-pop has become a truly global phenomenon thanks to its distinctive blend of addictive melodies, slick choreography and production values, and an endless parade of attractive South Korean performers who spend years in grueling studio systems learning to sing and dance in synchronized perfection.” On the other hand, Vox describes that rather than trying to define K-pop by the music, it is better to understand K-pop by its performance elements such as high quality performances (especially dance), polished aesthetic, and the studio production process. While K-pop itself is frequently characterized by factors outside of the actual music, this research will aim to draw conclusions about the musical features in the genre. Another topic that arises when assessing the popularity of K-pop, is how the genre evolved to become a global phenomenon. This will be discussed in the next section on the concept of K-pop generations.

2.2 K-pop Generations

The evolution of Korean pop music can be understood by its Generations of artists. K-pop Generations are defined by the time period that an artist debuted in as well as the general style and trends of that era. However, the definitions of when the Generations start and end are not concrete and often debated by fans and critics. While the exact timeline and categorization of artists into these generations is not explicitly defined, there exists clear themes that define each generation. The First generation is the origin of K-pop and roughly lasted from 1992 - 2003. The modern Korean pop started with and Boys in 1992. This group was the first introduc- tion of styles such as Hip Hop and New Jack to South Koreans. Their music offered a new alternative to existing styles of music like Trot, a style of Korean pop-. The first true K-pop idol group was a 5 piece , H.O.T with their release of the bright pop song

“Candy” in 1996. They were formed by Lee Soo Man, founder of SM Entertainment, creating the standard of the K-pop idol formation and training process through entertainment companies. Their music style was inspired by American boy bands and Japanese pop music (-pop). Other groups and artists that define this era are Sechskies, S..S, and Fin.K.L Overall, the music from this era 5 featured styles of , hip-hop, and 90’s R&B (Squip, 2020; Sherman, 2020). Second generation K-pop (2004-2012) are pioneers of the ‘Hallyu wave’ (Romano, 2018). During this era, groups such as , Girls Generation, , Kara, TVXQ! and Big Bang promoted in countries like , , and even the . Influenced by the popularity of H.O.T and other first generation idols, the idol formation process was solidified during this generation. This idol making process consists of competitive audition/recruitment, rigorous training, and official debut. Promotion strategies also expanded during this generation to mediums that created a more personalized connection to fans such as variety shows, dramas, and reality shows. The third generation (2012 - 2017) is the era that succeeded worldwide globalization of K-pop with groups such as BTS and BlackPink. This is also the generation that started the popularity of idol competition shows where hundreds of young trainees compete for stardom. Besides BTS and BlackPink, other popular groups from this era include , Velvet, and IOI. The fourth generation (2018 - now) is yet to be clearly defined. Some sources do not even recognize a 4th generation in K-pop, such as the Pudding Chua and de Luna (2020). For those that do define a 4th generation of K-pop, it is clear that the genre has the goal of appealing to a global audience. Sources such as NPR Sherman (2020) describe this generation as the “no longer bound by borders”. Popular 4th Generation idols include ()-Idle, , and . 6

CHAPTER 3 DATA COLLECTION

3.1 Overview

The data set used for this research was created using the Spotify API to collect audio features for 12,012 K-pop songs. Using the spotipy package for Python, I programmed Python scripts to pull entire discography collections for select K-pop Artists defined in Idology’s Idol Generation Theory article Squip (2020), seen in Figure 3.1. Idology is a long running K-pop critic web-zine. While the definitions of which artists are categorized under their respective generations are not universally defined, I chose this source as a reference since it had been mentioned in several pop- ular K-pop fan-sites. Although the Idology table specifies half generations, this is not commonly referenced by fans and critics. Therefore, an artist listed as a half generation is categorized in the generation rounded down (ex: 1.5 categorized as 1). The translated timeline of Artists can be seen below in Figure 3.2.

Figure 3.1 Idology’s Generation Theory Table 7

(a) generations 1 & 2

(b) generations 3 & 4

Figure 3.2 Translation of figure 3.1, Idology generation theory table. Artists added to the table are in italics. Parentheses next to an artist’s name indicates the group they originally debuted in. Artists not available on Spotify are crossed out. The exception is the artist whose songs are available on Spotify. However the songs are in the genre of and R&B rather than K-pop. 8 3.2 Spotify Audio Features

Spotify is a subscription based audio streaming platform that provides users access to millions of songs, audio books and podcasts. As of March, 2021, Spotify covers 178 Markets around the world. Spotify for Developers is a technical service that offers Web APIs for developing apps and/or working with Spotify data. This research focuses on using the Spotify Audio Features data, which are calculated through algorithms developed by EchoNest, now owned by Spotify (Skiden,´ 2016). The definition of the relevant Audio Features, provided by Spotify, are described in the following bulleted list:

• Release Date : The date the episode was first released, for example ”1981-12-15”. Depend- ing on the precision, it might be shown as ”1981” or ”1981-12”.

• Popularity: The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an ) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

• Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0

represents high confidence the track is acoustic.

• Danceability: Describes how suitable a track is for dancing based on a combination of musi-

cal elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

• Duration: The track length in milliseconds. Converted to seconds for analysis 9 • Energy: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

• Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the Instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

: The key the track is in. Integers map to pitches using standard Pitch Class notation E.g. 0 = , 1 = C#/-flat, 2 = D, and so on.

• Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

• Mode: Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

• Speeechiness: Detects the presence of spoken words in a track. The more exclusively

speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute

value. Values above 0.66 describe tracks that are probably made entirely of spoken words.

Values between 0.33 and 0.66 describe tracks that may contain both music and speech, ei-

ther in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

• Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical ter- 10 minology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

• Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

3.3 Data Filtering and Selection Criteria

The data collection process pulls every single song available on Spotify by the artists requested along with its metadata (album name, release date, etc.). However, due to this workflow, a Spotify artist is featured on a such as ‘top hits’, ‘Club mix’ or ‘best of’, all songs on that album are pulled even if the artist contributed to only one song on the album. This created a need for extensive data cleaning after collection. In addition to this, songs were pulled from multiple Spotify markets, in other words different countries. Therefore, an entire album could be duplicated in the data set due to being uploaded in different markets. The following selection criteria were used to clean and de-duplicate the data.

• Songs from irrelevant artist URIs (those not in the artist list) are dropped on the condition that the name of the artist of interest is not mentioned in the song title (ex: “Sour Candy featuring Blackpink”) or in any artist name (ex: “Blackpink, ”)

• Remove any songs from that are Live concert/tour albums or compilation albums

using string matching.

• Remove any songs where the song is instrumental, live, version or a version in

another language.

• Remove duplicate songs whose IDs are different. The song with the higher popularity among the duplicates is kept. A song is considered a duplicate if the song name, artist name and 11 album name simultaneously match. Even if the song name and artist name are the same, sometimes artists change the instrumentation or of a song on different albums.

• Removal of ’intro’, ’outro’, or ’interlude’ songs which serve as short tracks that transition between songs on an album.

• Remove songs shorter than a minute and longer than 10 minutes.

3.4 Distribution of Data

The details about the general distribution of the data for the audio features will be discussed in this section. Due to the songs all coming from one genre, there are likely going to be specific trends and attributes of its musical style that can be seen in the distributions. Popularity, Acousticness, Instrumentalness, and Speechiness are all right skewed such that the majority of songs in the dataset have low values of these features.

Figure 3.3 Distribution of Popularity, Acousticness, Instrumentalness, and Speechiness.

Since the entire discography for each artist was pulled for this dataset, it is reasonable that the majority of the data have low popularity scores. When an artist releases an album, only a handful of tracks will be popular, furthermore, due to the Popularity’s time component, the majority of the 12 data are older songs, therefore few people may be listening to those tracks in recent times. For Acousticness, it is no surprise that a genre so heavily influenced by electronic dance and hip hop would tend to have low Acousticness values. The same can be concluded on the observation of a right skewed distribution for Instrumentalness. For Instrumentalness, it is important to note that 8,841 rows of the data have an Instrumentalness value of 0, so the distribution shown in figure 3.3 displays the non-zero values on a log scale. Meanwhile, for features such as Energy and Loudness, the data tend to have higher frequencies for relatively higher measures of Energy and Loudness. On the other-hand, there are gradually lower frequencies for lower measures, creating a left-skew histogram for each feature.

Figure 3.4 Distribution of Energy and Loudness.

Although the range for Loudness is between -60 to 0, almost all of the data have loudness measures between -10 to 0. As expected for music that tends to have high levels of Danceability, the energy levels are very high for K-pop songs with the typical value being around 0.85. K-pop is especially known for their upbeat, high energy tracks, so it is no surprise that the distribution has a high typical value with a tail to the left. Otherwise, the Features of Duration, Danceability, Tempo and Valence are only lightly skewed, but are roughly normal around the center. 13

Figure 3.5 Distribution of Duration, Danceability, Tempo, and Valence.

For the Duration distribution, the majority of tracks are between 2.5 - 5 minutes long. This is typical as most pop songs are 2-5 minutes. Upon investigation, there are two significantly long songs that have been removed. The first being Turbo’s ”non-stop summer dj remix” which is 22 minutes long. This song’s length could be explained due to the likelihood that it is intended play at clubs or party events. The second song being ’s ”magic - origin” at 13 minutes. The remaining songs in the dataset are less than 10 minutes long. For Danceability, it is understandable that the majority of the tracks to be above 0.50. This could be explained by the fact that K-pop is well known for its focus on eye catching choreography and dance to accompany the song.

The dataset contains three categorical audio features of Mode, Time Signature and Key. For

Time Signature, 97% of the songs in this data set are detected to have a time signature of 4 beats per measure. This dominance makes sense in the context of pop music as most of the songs don’t adopt complex time signatures compared to the norm of 4/4 or 3/4 time. Meanwhile, the proportion of musical mode are more fairly represented in the dataset where 38.9% of the songs are composed in a minor key and 61.1% are composed in key. Figure 3.6 shows the distribution of songs per their musical key. There are a wide variety of songs composed in each key with the key of 14 C being the most popular. In the context of music composition for the pop genre, the key of C for both modes is very standard and is frequently used. Surprisingly the Key of C#/D-flat is the second most frequent in the data set. For pop songs, the keys of C, D and G are the most typical, so it is surprising that one of those keys is not the second most frequent.

Figure 3.6 Frequency of Musical Keys 15

CHAPTER 4 MODELING STRATEGIES FOR SPOTIFY AUDIO FEATURES

4.1 Overview

This chapter will provide an overview of the Linear and Logistic Regression modeling method- ology that is used in multiple chapters of this thesis. Furthermore, optimization of these models through variable selection and regularized regression methods will be explored. Assessing the per- formance of these models must determine the model’s ability to describe the population data. In other-words, the model is assessed on how well it predicts or classifies unknown subset of data, representing a random sample from the population. This is achieved by splitting the sample data into a training and test set. The training data is used to fit the models and the test data, which represents an unknown random sample from the population, is used to evaluate the performance. Furthermore, a third split into a validation set is made for the Logistic Regression models in order to choose optimal cut off values for classification. Even with data splitting, the model performance may be dependent on the training set, or the method may require parameter tuning. These concerns are addressed with 10-fold Cross Validation such that by the end of the procedure, all data will have served as the test and training data. The of Cross Validation are:

• Data is partitioned into 10 non overlapping sub samples called ’folds’.

• Model is fit 10 times. Each time, one of the folds is used as the test data and the remaining

9 folds are used as the training data.

• The average of the 10 test errors is obtained.

4.2 Multiple Linear Regression

Multiple Linear Regression (MLR) models the relationship of multiple predictor variables to a single continuous response variable. This model serves many purposes such as measuring the strength of association between the predictor and response variable. MLR also provides a tool 16 for predicting a continuous response given many predictor variables which can be continuous or categorical. This method will be used for modeling the relationship of the audio features to song release date and then popularity. The model is defined as:

yi = β0 + β1xi1 + β2xi2 + ... + βpxip + i (4.2.1)

for i = 1, 2, . . . , n. The assumptions of the model are:

• Residuals are Normally Distributed.

• A linear relationship is assumed between the dependent and independent variable.

• Residuals are homoscedastic. In other words the residuals exhibit equal variance.

• No multicollinearity between independent variables. Independent variables should not be correlated to one another

The predictive performance of the MLR models are assessed with the Root Mean Square Error (RMSE). The RMSE measures the deviation between the predicted value and the actual value in the following manner:

r n 2 Σ (yi − yˆi ) RMSE = i=1 (4.2.2) n

When comparing multiple MLR models to one another, the model that performs the best will yield predictions that are close to the actual value. Therefore, the model that produces the lowest

RMSE will be chosen as the optimal MLR model.

4.3 Binary Logistic Regression

Binary Logistic Regression models the relationship of either continuous or categorical predictor variables to a single binary outcome. The relationship of the audio features will be measured with 17 the outcomes of a song belonging to the old or new generations as well as whether a song is popular or not. Therefore, in addition to measuring the relationship between the predictors and the output, Logistic Regression serves as a Classification technique. Unlike the traditional generalized linear models, the logistic regression models do not hold the same assumptions of (1) Requiring a linear relationship between the dependent and independent variables, (2) Error terms or residuals being normally distributed, and (3) homoscedacity of the residual or error terms. The assumptions that do need to be met for Logistic Regression are:

• Outcome variable is a binary random variable.

• Observations are independent from one another.

• Little to no multicollinearity between dependent/predictor variables.

• Independent variables should be linearly dependent on the log odds.

• Large sample size

The binary random variable is defined

⎧ ⎪ ⎨⎪1, if outcome is success f(x) = ⎪ ⎩⎪0, if outcome is failure

where the probability of ’success’ is defined as P (Z = 1) = π and probability of ’failure’ is defined as P (Z = 0) = 1 − π. Thus, the odds of success are defined as the ratio of probability π , where we can interpret that if π > 1, the success probability is of success over failure : 1− π 1− π greater than the failure probability and the observation that yields these odds is categorized into the success group. For the classification of songs into the new or old generations, classification into the new generation is the success outcome. For the classification of a song as popular or not popular, popular is the success outcome. log( π ). Logistic Regression models the logit transformation of these odds, called log odds as 1− π 18 Thus, the logistic model can be defined as:

π log( ) = β0 + β1x1 + β2x2 + ... + βpxp +  (4.3.1) 1 − π

where β0, . . . , βp are the coefficients for x1, . . . , xp predictor variables and  are the error terms. The interpretation of the model can be understood in the scale of odds:

π = e β0+β1x1+β2x2+...+βpxp+ (4.3.2) 1 − π

where the β1, . . . , βp can be interpreted in the following manner: As the value of xi increases

by one unit, the odds of success increase or decrease multiplicatively by eβi . However, for many of the audio features such as Danceability, a unit increase of 1 does not make sense in context due to the variable’s scale from 0 to 1. Therefore, the following interpretation may be more appropriate: given all other variables are held constant, as xi increases by 0.1, the odds of success increase or

decrease by a multiple of e0 .1∗βi . When evaluating the performance of a logistic regression model, a residual error cannot be assessed due to its binary nature. Therefore, measures such as accuracy, sensitivity, and speci- ficity will be used. The calculations for these measures can be best understood with the help of a confusion matrix: Table 4.1 Classification Assessment: Confusion Matrix

Actual: 1 Actual: 0 Predicted : 1 True Positive (TP) False Positive (FP) Predicted : 0 False Negative (FN) True Negative (TN)

Accuracy, TP +TN , is the proportion of observations that are correctly categorized by TP +FP +FN+TN the logistic model. Therefore, the optimal model will have the highest accuracy rate. Sensitiv- TP , measures the ability to detect the ’important’ class or success outcome. Specificity, ity, TP +FN TN , measures the ability to rule out the ’unimportant’ class or failure outcome. Therefore, in FP +TN addition to a high accuracy rate, the optimal model should return a balanced trade off between the 19 sensitivity and specificity rates. In other words, the difference between sensitivity and specificity should be minimized.

4.4 Variable Selection

Fitting all of the predictor variables to create a full Linear or Logistic Regression model may not be the optimal model for a prediction or classification task. The prediction or classification accu- racy of a model as well as its interpretability can be improved through variable selection methods. This section will introduce the methods of All Subsets Regression and Stepwise Regression.

All Subsets aims to find a subset of size k for each k = 1, ..., p that gives the smallest Residual Sum of Squares (RSS). Then, the optimal k variable model that optimizes the trade off of bias and variance is chosen. This thesis uses the Akaike Information Criterion (AIC) to choose the optimal model. AIC gives the estimate of the irreducible error σˆ2:

1 AIC = (RSS + 2kσˆ2) (4.4.1) n

N p 2 where RSS = Σi=1(yi − β0 − Σj=1βj xij ) . However, this method is computationally expensive since it searches through all possible sub-

�p  = p! possible methods which becomes computationally difficult for large sets. There are k k!(p−k)! values of p. An alternative variable selection method is the Stepwise selection method which is computationally more feasible since it limits the scope on the quantity of model subsets fitted. The Stepwise method has three different approaches: forward, backwards and bi-directional.

This research utilizes the backward approach which initially fits the full model and sequentially

deletes the variable with the least impact towards the fit of the model. This judgement is made

according to whether dropping the variable will minimize the overall AIC of the model.

Because the Stepwise selection method requires less subsets to fit, this method results in a

lower variances than the All Subsets approach, but risks the possibility of higher bias. This method is highly dependent on the training data, so in order to minimize the dependency of the model’s performance on the data, Cross Validation is used. 20 4.5 Regularized Regression

The All Subsets and Stepwise Regression methods that are discrete in nature, characterized by the simplicity of keeping or discarding predictor variables from the model. However, this attribute often yields high variance and does not reduce prediction error in comparison to the full model. Regularized Regression, also referred to as Shrinkage Methods, offers an alternative approach that does not suffer as much from high variability. By decreasing the variance of the estimates, Regular Regression methods can decrease the test error observed. Test error is the error of the predictions resulting from performance of the trained model on the test data. Three regularized regression methods will be applied to both MLR and Logistic Regression. The methods of Ridge, Lasso, and Elastic Net will be explained in the context of Linear Regression, but an extension to its setting in logistic regression can easily be understood via a model of the log odds. Ridge Regression shrinks the regression coefficients towards zero and each other by imposing a penalty term.

ˆ N p 2 p 2 βRidge = minβ{Σi=1 (yi − β0 − Σj=1 βj xij ) + λΣj=1 βj } (4.5.1)

N p 2 where Σi=1(yi − β0 − Σj=1βj xij ) is the residual sum of squares (RSS) for the model at β

p 2 and λΣj=1βj is the penalty term. This penalty term is also referred to as L2 since it represents a second order penalty used on the coefficients. The tuning parameter λ controls the penalty term. When λ = 0, no regularization is performed and the model is equivalent to Ordinary Least Squares

Regression. As the penalty increases, λ → ∞, the coefficients are shrunk towards zero. The

optimal tuning parameter λ that minimizes error is chosen through Cross Validation. This method is not scale invariant, so normalization of the data is required prior to fitting the

model. Packages in R such as glmnet(), will do this automatically. It is important to note that Ridge

Regression does not perform feature selection. Rather, Ridge keeps all variables in the model and minimizes the coefficients of variables with minor contribution towards zero. A Regularized Regression approach that does perform variable selection as well as shrinkage, 21 is the Lasso method. In comparison to the Ridge method, Lasso minimizes the coefficients where the least contributing variables are forced to have coefficients of zero. Therefore, these variables are removed from the model. Since this shrinkage method forces some coefficients to be shrunk towards zero, the Lasso method can be thought of as a continuous alternative to subset selection.

p 2 The optimization is similar to that of Ridge, however the L2 Ridge penalty λΣj=1βj is replaced p with the L1 Lasso penalty, λΣj=1 |βj |.

ˆ N p 2 p βLasso = minβ {Σi=1(yi − β0 − Σj=1 βj xij ) + λΣj=1 |βj |} (4.5.2)

Similarly to the Ridge method, choosing the optimal tuning parameter λ that minimizes the error, can be done using Cross Validation. While Lasso has the advantage of performing feature selection with shrinkage, this can result in difficulties when dealing with correlated predictors. One of the correlated features may be pushed to zero and removed from the model while the other stays. Depending on the dataset, removing the feature at the expense of its correlation to another predictor may not result in the optimal model. In the absence of performing feature selection, Ridge, effectively reduces the correlated features together without the expense of removing one of the variables. This trade off is balanced by the Elastic Net model which combines the penalties for optimiza- tion:

ˆ N p 2 p 2 p βElasticNet = minβ {Σi=1 (yi − β0 − Σj=1 βj xij ) + λΣj=1 βj + λΣj=1 |βj |} (4.5.3)

By combining the penalties, the Elastic Net method keeps the quality of effective regularization from Ridge and feature selection from Lasso. In addition to having the tuning parameter λ, Elastic

Net is also controlled by the α parameter which determines the balance of the two penalties. α =

0.05 performs an equal combination of the two penalty terms, α → 0 applies the ridge penalty more heavily and when α → 1 will have a heavier Lasso penalty. Both the λ and α parameters are tuned during Cross Validation (Friedman, Hastie, Tibshirani, et al., 2001; Boehmke, Boehmke). 22

CHAPTER 5 NONPARAMETRIC ANALYSIS OF AUDIO FEATURES

5.1 Introduction

For this section, non parametric hypothesis tests are performed to compare audio features based on different artist attributes. We will do so for each audio feature, comparing attributes such as artist generation (1,2,3, and 4), artist gender (male and female), and artist type (group and solo).

5.2 Methodology

In order to compare the differences of audio features between groups, a method that does not make assumptions on the underlying distribution is needed because many of the distributions of the audio features are skewed. To conduct analysis that is free of these assumptions, Nonparametric hypothesis testing and confidence interval methods will be used. The methods used in this section are Wilcoxon Rank Sum Test for comparisons of two independent samples and the Kruskal- Wallis Test for comparisons of two or more independent samples.

5.2.1 Wilcoxon Sum Rank Test

The Wilcoxon Rank Sum Test is a nonparametric test to compare the centrality of two in- dependent samples, where population 1 is expressed as : x1, ..., xn and population 2: y1, ..., yn. The Wilcoxon Rank Sum Test is also considered equivalent to the Mann-Whitney-Wilcoxon test. This paper will be conducting a two-sided Wilcoxon Rank Sum Test for the comparison of audio features for different artist genders as well as artist types.

The assumptions of this test are as follows:

• Observations x1, ..., xn and y1, ..., yn are independent and identically distributed for popula- tion x and y respectively.

• The median completely characterizes the distribution of each independent sample. There- fore, if a population difference exists, it will be completely characterized by the difference in population medians. 23 • Populations X and Y are continuous and symmetrically distributed about the population

medians notated θx and θy, respectively.

The null hypothesis is that the population medians are the same between groups and the alter- native hypothesis is that the population medians are different from one another. The hypothesis for a two-sided Wilcoxon Rank Sum Test is notated as follows:

H0 : θx = θy (5.2.1)

H1 : θx 6= θy

which can also be expressed in the population median differences

H0 : θx − θy = 0 (5.2.2)

H1 : θx − θy =6 0

The statistic of this test, the Wilcoxon statistic (W) is computed by first create a joint ranking of observations x1, ..., xn and y1, ..., yn in increasing order. Then W is calculated by summing all of the ranks of observations from the y population. The null hypothesis is rejected when the p-value of the W statistic indicates that one of the population medians is significantly larger than the other.

5.2.2 Kruskal-Wallis Test

The Kruskal Wallis test is an extension of the Wilcoxon Mann Whitney test by extending the

Wilcoxon Statistic formulation to three or more samples. This method is used for comparing the audio features between the four K-pop generations.

Generally, the Kruskal Wallis test compares k independent random samples ranging from pop-

ulation 1 is represented as x11, ..., x1n1 to population k which is represented as xk1, ..., xknk. The assumptions of this test are identical to that of the Wilcoxon test, but expanded to a larger number of samples. The assumptions of this test are as follows:

• Observations x11, ..., x1n1 to xk1, ..., xknk are independent and identically distributed for pop- 24

ulations x1...xk respectively.

• The median completely characterizes the distribution of each independent sample. There- fore, if a population difference exists, it will be completely characterized by the difference in population medians.

• Populations X1...Xk are continuous and symmetrically distributed about the population me-

dians notated θ1 and θk, respectively.

The hypothesis for the Kruskal-Wallis test is notated:

H0 : θ1 = ... = θk (5.2.3)

H1 : θi 6= θj

The Kruskal Wallis H statistic is equivalent to the Kruskal Wallis Chi-Squared Statistic which is calculated with the kruskal.test() function in R. The H statistic calculations are performed by the following procedures:

k • order a joint ranking of all N = Σj=1nj observations. Let rij to denote the rank of observa-

tion xij in the joint ranking.

• Statistic H is defined as H = 12 Pk n (R¯ − N+1 )2 N(N+1) j=1 j j 2

Pn ¯ ri • where Rj = rij and Rj = for each j = 1, ..., k j=1 nj

The null hypothesis is rejected in the even that high or low ranks dominate in one or more samples.

5.3 Comparing K-pop Generations

K-pop Generations are a distinct way that the genre is categorized in accordance to the evolu- tion of the genre over time. The generations are an attribute of the artist as it is a concept roughly defined by when an artist debuts into the music scene. Because the K-pop generations roughly 25 define trends in K-pop over time, it will be interesting to see the exact differences that may exist in audio features between the 4 generations. 16.5% of the songs are released by generation 1 artists, 46.3% by generation 2, 31.5% by generation 3, and 5.6% by generation 4. While 5.6% may seem too small of a proportion to compare to the other samples, the sample size of songs from 4th generation artists is 673 which is large enough to be analyzed with the Kruskal-Wallis Test. Furthermore, the generation is only 16.5% of the data set since many of the artists’ songs are not available on the streaming platform of Spotify. Therefore, this generation is underrepresented compared to the other generations.

Table 5.1 Kruskal-Wallis Test Results: Generations of K-pop

Audio Feature Kruskal-Wallis Chi-Squared P-Value Significant Popularity 4079.200 < 2.2e-16 *** Duration 749.450 < 2.2e-16 *** Acousticness 81.984 < 2.2e-16 *** Danceability 184.520 < 2.2e-16 *** Energy 28.023 3.591e-06 *** Instrumentalness 681.010 < 2.2e-16 *** Speechiness 258.186 < 2.2e-16 *** Loudness 681.810 < 2.2e-16 *** Tempo 36.012 7.444e-08 *** Valence 92.884 < 2.2e-16 ***

The results of the Kruskal-Wallis test are given in Table 5.1 where the significant column indicates whether the hypothesis test returns significant results indicated by the significance level codes from R: ”Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1”. All of the hypothesis tests for each audio feature are significant under a level of significance α = 0.05 which means there is sufficient evidence to reject the null hypothesis and conclude that at least one of the generations has a population median that differs from the others.

Now, to investigate which of the generations significantly differ from one another and the estimated differences, pairwise comparisons of the Wilcoxon test will be performed for each of the audio features with a Bonferroni correction of α/6 = 0.05/6 = 0.0083. The audio features that yield results in which all of the pairwise comparisons are significant at 26 the α/6 = 0.0083 significance level are Popularity, Duration, Speechiness and Loudness. In other words, these are the audio features in which every pair of generations are significantly different from one another. For example, the Popularity table shows the results for the Wilcoxon statistic, P-value, 99.17% confidence interval and the whether the result is significant according to the sig- nificance level codes. Note that the differences are calculated in the manner that the pair is listed. For example, the difference in population medians for generation pair ”1 - 2” can be expressed as

θ1 −θ2. The Popularity table is shown as an example of the results, but the rest of the discussion for pairwise comparisons will reference the remaining pairwise Wilcoxon test results in the appendix. Table 5.2 Pairwise Comparison Results: Popularity by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 3315856 < 2.2e-16 (-9.000078, -7.999948) *** 1-3 808533 < 2.2e-16 (-27.00002, -25.00003) *** 1-4 40423 < 2.2e-16 (-36.99996, -34.00001) *** 2-3 4827719 < 2.2e-16 (-18.00001, -16.00003) *** 2-4 364107 < 2.2e-16 (-28.00005, -25.00000) *** 3-4 856299 < 2.2e-16 (-10.999985, -7.000043) ***

As seen in Table 5.2 above, there are significant differences in popularity between each pair of K-pop generations. One attribute of this difference can be explained by Spotify’s popularity score which considers both number of plays and how recent those plays are. Therefore, that is newly released is going to be more popular than an older song with the same number of plays, but more recent activity. As expected, the detected differences increase as the gap between generations increase. What is notable however, is that, while the differences between neighboring generations such as generation one versus generation two and generation three compared to generation four have 99.17% confidence intervals that estimate the difference to be between about (-9, -8) and (-

11, -7) respectively, the difference between generation two and generation three popularity is much larger at a 99.17% confidence such that the popularity of generation 2 songs is between about 16 -

18 units less than generation 3. This may provide evidence that there is a distinction of popularity difference between the older generations (1 & 2) and the new (3 & 4). The pairs are significantly different in median Duration such that the older generation in the 27 pair is estimated to have a longer duration than the newer generation. Furthermore, as the gap in generation is larger, so is the estimated difference for median duration. We can see this trend as we follow the estimated confidence interval for the pairs involving the first generation. With 99.17% confidence, the median of a generation one song duration is between about 10 to 14 seconds more than the second generation, about 17 to 21 seconds longer than the third generation, and finally about 23 to 30 seconds longer than the fourth generation. This evidence supports the overall trend that duration of music is decreasing as time increases. Overall, the pairwise comparisons for difference in median loudness show that the older gen- eration in the pair has a more negative loudness measure than the newer generation it is being compared to. In other words, the newer generation songs are estimated to be louder than the older generation songs. The difference in loudness is observed to be greater in the pairwise comparisons to generation one, where the generation one’s median loudness is significantly more negative than the newer generations. For example the there is 99.17% confidence that the median difference be- tween generation one and generation two is between about -1.069 and -0.805 decibels while there is 99.17% confidence that the median difference between generation two and generation three is between about -0.390 and -0.208 decibels. While the loudness measure ranges from -60 to 0, the large majority of the data falls within the range of -10 to 0, which makes a 1 decibel difference fairly notable in context. These observations in median differences could be contributed to the changes in audio technology over time which allow music producers to master music with higher decibels. The considerable difference in generation one comparisons to the rest of the pairs may indicate that the more significant change in audio technology could have happened after the first generation.

While pairwise comparisons of Speechiness returned to all be significant, the differences esti- mated are very small in the context of the Speechiness range between 0 and 1. For example, the greatest difference detected is between generations one and four where the generation four Speech- iness median is estimated to be between about 0.0057 to 0.013 units more than generation one with 99.17% confidence. The upper limit difference of 0.013 is, however, not impactful in the context 28 of the Speechiness range from 0 to 1. Finally, while not all of the pairs were significant for the remaining audio features, there were some noticeable measures of difference for select statistically significant pairs comparing Acous- ticness and Valence between K-pop generations. The pairs that are statistically different in their median Acousticness are generation pairs 1-2, 1-3, 2-3, and 3-4. Comparisons with a notable difference in context of the range for Acousticness are pairs 1-3 and 3-4. There is 99.17% confidence that the median Acousticness of generation one and three differ between an interval of about (-0.033, -0.016), where the Acousticness of gen- eration three is greater. However, there is 99.17% confidence that the median Acousticness of generation three and four differ between an interval of about (0.01, 0.036), where the Acoustic- ness of generation three is greater than generation 4. Furthermore, the difference in Acousticness between generations one and four are not significant, nor is the difference between generation two and four. Since the definition of acoustic references a musical sound with the absence of elec- tronic modification, this implies that the electronic sound that trended in the older generations is making a comeback in the newest generation of music. However, these estimated differences in the population median are not particularly large in context of the range of Acousticness which is a continuous measure between 0 and 1. For Valence, only pairwise comparisons involving the first generation are statistically signifi- cant which provide sufficient evidence to reject the null hypothesis and conclude that the first gen- eration’s median valence score is significantly different in comparison to generations two, three and four. With 99.17% confidence, the median valence score of the first generation is between

0.029 - 0.060 greater than the second generation, 0.042 to 0.074 units greater than the third gener- ation and 0.038 to 0.09 units greater than the fourth generation. Overall, the music released by the

first generation conveys more ’musical positiveness’ than all generations after it. One theory for this could be that South Korea’s IMF financial crisis occurred in 1997 during the first generation and the music being released might have served the purpose of providing a source of positivity to the public. Another theory could be that due to the increasing diversity of K-pop, there may be an 29 increased proportion of songs with lower valence in the genre.

5.4 Comparing Male and Female Artists

There are three categories for artist gender: Male, Female, and Coed. Coed refers to a group that has a mix of Male and Female artists. Songs by Male artists make up 59.6% of the data, Female artists make up 39%, and Coed groups only make up 1.4% of the data. While the non parametric analysis used in this section can be done with unequal sample sizes, the sample sizes must be adequately large. In this case, the number of songs released by a Coed group is just 170, which is too small to be compared to groups with a sample size in the thousands. For this reason, this section will only make comparisons on the audio features between Male and Female artists. Hypothesis is as follows,

H0 : θx = θy (5.4.1)

H1 : θx 6= θy

which can also be expressed in the population median differences

H0 : θx − θy = 0 (5.4.2)

H1 : θx − θy =6 0

Where xi are the female artist observations and yi are the male artist observations. The median is the statistic that completely characterizes the population for female and male artists respectfully.

And where θx and θy represent the population median of female artists’ audio features and male artists’ audio features, respectfully.

This hypothesis test is conducted for each of the audio features separately. The 95% Confidence intervals are also calculated in order to investigate the estimated differences that exist between the population medians. If 0 is contained within the interval then this is further support for the lack of evidence for rejecting the null. In other words, if 0 is contained in the 95% confidence interval, the male and female medians of the audio feature being analyzed are not significantly different from 30 one another.

Table 5.3 Wilcoxon Test Results: Male vs Female Artists

Audio Feature W P-Value 95% Confidence Interval Significant Popularity 17269160 0.00584 (6.619757e-05, 1.999983) *** Duration 15023888 < 2.2e-16 (-5.841980, -3.863938) *** Acousticness 18778150 < 2.2e-16 (0.01747208, 0.02615184) *** Danceability 18947920 < 2.2e-16 (0.02104514, 0.02903244) *** Energy 17019986 0.1656 (-0.001047909, 0.008038362) Instrumentalness 19834072 < 2.2e-16 (5.152227e-05 2.934808e-05) *** Speechiness 15815847 1.674e-07 (-0.003534776 -0.001553387) *** Loudness 19152022 < 2.2e-16 (0.364007 0.491073) *** Tempo 17524256 3.205e-05 (0.9780129, 2.9559491) *** Valence 18084946 4.475e-13 (0.02201195, 0.03899812) ***

The differences in audio feature medians between songs produced by female artists and those produced by male artists are statistically significant for all features with the exception of Energy. This means that Male and Female artists produce music with a similar median level for Energy. The features in which the female artists have a significantly higher median in audio feature are Pop- ularity, Acousticness, Danceability, Instrumentalness, Loudness, Tempo and Valence. Whereas, for the features of Duration, Energy, and Speechiness, the male artists have a significantly higher median for these measurements on average. However, the estimated interval of difference for these features are small in the context of the range for each feature. For example, there is 95% confidence that the popularity score of songs produced by Female artists are between 0.0000662 to 1.99998 units greater than the popularity scores of the Male artists. While the difference is significant, when considering the fact that the popularity score ranges from 0-100, an upper limit difference of being 2 popularity units greater than the male artists is not impactful in context. Overall, one can conclude that while there exist significant differences in the median measurements for the audio features of songs made by Male and Female artists, the differences are minimal in context. 31 5.5 Comparing Group and Solo Artists

This section compares audio features between types of K-pop artists, either group or solo. Groups dominate the mainstream K-pop industry such that 84.3% of all songs in the dataset are released by groups whereas 15.7% of the songs are released by soloists. The hypothesis for each of the Wilcoxon Test comparing the population median of group artists to solo artists for each audio feature is:

H0 : θx = θy (5.5.1)

H1 : θx 6= θy

which can also be expressed in the population median differences

H0 : θx − θy = 0 (5.5.2)

H1 : θx − θy =6 0

Where xi are the group artist observations and yi are the solo artist observations. The median is the statistic that completely characterizes the population for group and solo artists respectfully.

Table 5.4 Wilcoxon Test Results: Group vs Solo Artists

Audio Feature W P-Value 95% Confidence Interval Significant Popularity 9236738 0.03005 (-1.999938, -5.320767e-05) * Duration 8820586 2.21e-07 (-5.251045, -2.372974) *** Acousticness 7796532 < 2.2e-16 (-0.05300053, -0.03538317) *** Danceability 9778026 0.08042 (-0.000967768, 0.011008498) Energy 12570686 < 2.2e-16 (0.07899757, 0.09498637) *** Instrumentalness 8729034 4.864e-14 (-3.225573e-05, -4.457195e-05) *** Speechiness 10189583 2.282e-06 (0.001700548, 0.004244932) *** Loudness 11851162 < 2.2e-16 (0.6829554, 0.8619889) *** Tempo 10606602 9.587e-15 (3.931052, 6.029029) *** Valence 10771683 < 2.2e-16 (0.04101360, 0.06407435) ***

Judging significance with a level of α = 0.05, all of the differences between Group and Solo 32 artist music are significant with the exception of Danceability. This is an interesting observation since the element of dancing and attractive choreography is an expected element of K-pop groups rather than soloists. Although the feature of having dancing involved in performances is separate from Spotify’s Danceability, one would expect that the Danceability of group K-pop artists to be greater than that of the soloists since the music is purposefully paired with choreography. However, the data shows that there is no significant difference in the median of the Danceability measure. Some rationale for this could be due to the fact that there are soloists who do release music that is paired with catchy dances such as , Psy, and JYP. In addition, there are K-pop groups in this dataset that do not pair their songs with dances, mainly bands such as F.T. Island, CNBlue and . The group songs have significantly greater medians than soloists for the features of Energy, Speechiness, Loudness, Tempo, and Valence. In turn, soloists have significantly greater medians than the groups for the features of Popularity, Duration, Acousticness, and Instrumentalness. It is interesting that soloists have significantly higher median scores for Popularity despite only making up around 16% of songs in the dataset. Perhaps soloists in the mainstream K-pop market are significantly higher in popularity is because they either have had to achieve high success just to break into the high saturated market of K-pop groups. Examples of these successful solo artists include IU, BoA, or Rain. Alternatively, the high popularity could be attributed to soloists who were already popular before going solo due to their current or previous involvement in a K-pop group. Examples of these type of artists are from , from Girls Generation,

CL from 2NE1, or from BigBang.

However, the estimated significant differences for some of the audio features between group and solo artists are not particularly large in context. The 95% Confidence Interval for the difference in median popularity can also be used as an example here. There is 95% confidence that the solo artists have a median popularity score that is between about 0.000053 and 1.99998 units greater than the group artists. But considering the range of popularity is between 0 and 100 this is a relatively small difference in context. 33 There do exist significant differences that are meaningful in context of the audio feature’s range. The most notable one being Energy, where there is 95% confidence that the difference between the population medians is estimated to be about 0.079 to 0.095 units larger for group artists than solo artists. Considering the range for the Energy measure is 0 - 1, the upper limit difference of 0.095 is nearly a tenth of an increase in Energy. This is a more notable difference than say an upper limit difference of 0.01. This significant and notable difference in Energy could possibly be explained by the fact that solo artists are likely to have more ballad style songs in their own discography or will contribute ballad style songs to Korean television soundtracks. Furthermore, this higher level of detected energy could possibly be related to the trend that K-pop groups commonly pair their songs to dance for performances, which is not a common expectation for soloists. Other variables with a notable size of difference are Valence and Acousticness. Overall, one can differentiate the audio features between group and solo artists where groups are likely to have higher Energy, Speechiness, Loudness, Tempo and Valence measures where the differences of Energy and Valence are notable in context. Whereas, solo artists are likely to have higher Popularity, Duration, Acousticness and Instrumentalness measures where the difference in Acousticness is notable. 34

CHAPTER 6 CLASSIFYING NEW GENERATION SONGS

6.1 Introduction

This section explores the ability of the audio features to identify whether a song belongs in a new generation or an old generation of the K-pop genre. For the purposes of this research, generations one and two are considered the older generation of K-pop which lasts from about 1992 - 2012. Generations three and four are considered the new generation which can be defined as K-pop artists who are active from around 2012 - 2021. However, because generations are defined by the year an artist has debuted rather than the time a song was released, there is some overlap between older generation artists releasing music in the new generations such as J.Y. , and PSY. This could be a source of variation and misclassification in the analysis. This division of new versus old is made rather than running classification to the 4 groups sepa- rately because the newer generation is the era of K-pop that not only achieved mainstream global recognition, but an era that aims to appeal to the global audience. There has been some global success for groups in the older generation via the Hallyu Wave, however, their wider influence had not yet reached recognition on a global scale. With the success of BTS and BlackPink (from the third generation), K-pop is now mainstream around the world. This research is mainly interested in the factors of audio features on this global breakthrough rather than modeling the intricacies of the 4 separate generations, although their differences are briefly discussed in chapter 5. To evaluate the contribution and accuracy of the audio feature’s ability to classify a song into the new generation versus the old generation, binary logistic regression analysis will be used.

Therefore, the response variable is the binary classification of a song into the new generation or the old generation. Classification into the new generation is considered the success outcome. The predictor variables used are the audio features of Popularity, Duration, Acousticness, Danceability,

Energy, Instrumentalness, Key, Loudness, Speechiness, Tempo and Valence. The data set is split into two groups based on musical mode. Therefore, Binary Logistic Regression is fitted to minor 35 mode (Mode = 0) and major mode (Mode = 1) songs separately. For the model building process, each of the Mode datasets are then split into a training, validation and test set by ratio 70:15:15. The training set is used to fit the model, the validation set is used to determine the optimal cut off value, and finally the test set is used evaluate the performance of the models. In this section, only the final models will be presented, however, the other models can be found in appendix B.

6.2 Minor Mode Results

Table 6.1 Generation Logistic Model Diagnostics for Minor Mode Songs

Model Cutoff Value AIC Accuracy Kappa Sensitivity Specificity Full Logistic 0.35 3815.2 0.7175 0.4222 0.7143 0.7196 Stepwise 10CV 0.35 3804.3 0.7218 0.4320 0.7253 0.7196 All Subsets 0.36 3813.4 0.7247 0.4342 0.7106 0.7336 Ridge 10CV 0.37 NA 0.7261 0.4360 0.7070 0.7383 Lasso 10CV 0.35 NA 0.7361 0.4597 0.7363 0.7360 Elastic Net 10CV 0.35 NA 0.7375 0.4616 0.7326 0.7407

The model with the highest classification accuracy and a good balance between sensitivity and specificity is the Elastic Net model.

πˆ log( ) = 0.116 + 0.070P opularity − 0.008Duration + 0.737Acousticness 1 − πˆ + 1.771Instrumentalness + 0.305Key1 + 0.174Key2 − 0.129Key3 + 0.454Key7

− 0.0004Key10 + 0.170Loudness + 1.154Speechiness − 0.001T empo − 0.847V alence (6.2.1)

An increase in the values of Popularity, Acousticness, Instrumentalness, Loudness, Speechi- ness and choosing Key1 (C#/D-flat Minor), Key2 (D Minor), or Key7 () rather than Key0

(C Minor), contributes to a multiplicative increase in the odds of being categorized as a new gen- eration K-pop song. Whereas an increase in the values of Duration, Tempo, Valence and choosing

Key3 (D#/E-flat Minor) or Key10 (A#/B-flat Minor) instead of Key0 (C Minor) contribute to a multiplicative decrease in the odds of being categorized as a new generation K-pop song. In other- words an increase in value for these variables are more indicative of a song from the old generation. 36 The variables Instrumentalness and Speechiness are the most influential in the positive direction while the variable Valence is the most influential in the negative direction.

6.3 Major Mode Results

Table 6.2 Generation Logistic Model Diagnostics for Major Mode Songs

Model Cutoff AIC Accuracy Kappa Sensitivity Specificity Full MLR 0.34 5936.1 0.7575 0.5006 0.7689 0.7507 Stepwise 10CV 0.34 5925.4 0.7557 0.4974 0.7689 0.7478 All Subsets 0.34 5934.0 0.7566 0.4995 0.7713 0.7478 Ridge 10CV 0.35 NA 0.7566 0.4985 0.7664 0.7507 Lasso 10CV 0.34 NA 0.7520 0.4894 0.7616 0.7464 Elastic Net 10CV 0.35 NA 0.7566 0.4952 0.7494 0.7609

The model that yields the highest accuracy and optimal balance between the Sensitivity and Specificity is the Full Logistic Regression model.

πˆ log( ) = −0.783 + 0.079P opularity − 0.009Duration + 0.882Acousticness + 0.665Danceability 1 − πˆ + 0.537Energy + 2.471Instrumentalness − 0.415Key1 + 0.002Key2

+ 0.149Key3 + 0.280Key4 − 0.217Key5 − 0.280Key6 − 0.161Key7

− 0.185Key8 + 0.006Key9 − 0.071Key10 − 0.094Key11 + 0.154Loudness

+ 0.792Speechiness + 0.0002T empo − 0.742V alence (6.3.1)

The audio features in this model that increase the odds of being categorized as a new generation

K-pop song multiplicativley are Popularity, Acousticness, Danceability, Energy, Instrumentalness,

Loudness, Speechiness, Tempo and choosing Key2(), Key3 (D#/E-flat Major), Key4 (E

Major), Key9 (A Major), instead of Key0 (). The features that decrease the odds multi- plicatively of being categorized as a new generation K-pop song are Duration, Valence and choos- ing, Key5 (F Major), Key6 (F#/G-flat Major), Key7 (G Major), Key8 (G#/A-flat Major), Key10 (A#/B-flat Major), Key11 (B Major) instead of Key0 (C Major). 37 Only the variables Popularity, Duration, Acousticness, Danceability, Instrumentalness, Key1(C#/D- flat major), Loudness, Speechiness, and Valence are significant predictors in contributing to the log of odds for categorizing a song into the newer generations of K-pop versus the older generation for songs composed in the major mode.

6.4 Comparison of Minor and Major Mode Models

Overall, classifying songs composed in a major mode into the new or old generations requires more information than the songs composed in a minor mode. Compared to the full logistic model which optimizes the prediction for the major mode songs, the optimal elastic net logistic model for minor mode songs removes the variables of Danceability, Energy, Key4 (), Key5 (F Minor), Key6 (F#/G-flat Minor), Key8 (G#/A-flat Minor), Key9 (A Minor), and Key11 (B Minor). For both major and minor mode songs, an increase of the variables Popularity, Acousticness, Instrumentalness, Loudness, Speechiness or being composed in Key2 (D Major or Minor) rather than Key0 (C Major or Minor), increase the odds of being classified as a new generation song on average. Whereas, an increase in Duration or Valence will decrease the odds of being classified as a new generation, song on average. Since higher values of Valence indicate a happier or more positive mood in a song, it is interest- ing that an increase in this feature would decrease the odds multiplicatively of being categorized into the newer generation. Considering the range for Valence is just 0 to 1, a 0.10 increase in

Valence decreases the odds of categorization to the new generation by e −0.847∗0.1 = 0.92 multipli- cation. For a major mode song, every increase in 0.1 units of valence, the odds multiplicatively

decrease is comparable at e− 0.742∗0.1 = 0.93. This tells us that perhaps the music in the older gen- erations have a higher tendency to be composed with a musical sound that conveyed happiness and joy compared to the newer generation. The decrease in odds multiplicatively due to an increase in

Duration tells us that older generation songs tended to be longer and those in the newer generation are shorter on average. This observation is consistent with the observation made in chapter 5 with the nonparametric comparison of Valence between the four generations. However, the results in this chapter quantify the contribution of Valence towards classifying a song into the new or old 38 generation. The increase in Popularity multiplicatively increasing the odds of being a new generation song matches the definition of Spotify’s Popularity measure as well as the observations made in the exploratory data analysis. Since Spotify measures Popularity in relation to time, considering recent songs that have been streamed most frequently to have higher Popularity, we would expect this trend that Popularity would increase the odds of the song belonging to the new generation. With

each unit increase in Popularity, the odds increase multiplicatively by e0 .070 = 1.07 for mode and by e 0.079 = 1.08 for songs in a major mode. Interestingly, however, the multiplicative increase is not as great as the increase caused by other variables such as Acousticness and Instrumentalness. Acousticness is defined as the confidence of detecting acoustic qualities in the music. , is music that primarily uses musical instruments with the absence of electronic influence. With coefficients around 0.8, the odds of categorization as a new generation song increases by

e0 .737∗0.1 = 1.076 for songs in a minor mode and e 0.882∗0.1 = 1.092 for songs in a major mode with every 0.1 increase in Acousticness. Instrumentalness, which is colloquially discussed with Acousticness, predicts the proportion of a song that does not contain vocals or rap. This is the most heavily weighted factor in the model. A 0.1 unit increase in Instrumentalness of a song

increases the odds of classification into the new generation by e 1.771∗0.1 = 1.194 for songs in a minor mode and e2 .471∗0.1 = 1.280 for songs in a major mode. These observations of Acousticness and Instrumentalness being very influential in prediction the odds of being classified into the new

generation can possibly be explained by the heavy techno, electronic dance influence for music in

the older generations. Although new influences such as EDM play an influential

role in the music of the newer generations, the Korean pop music genre has diversified to include

many other musical influences such as House, Tropical, Rock and more. Furthermore, it has

become more common to use a musical excerpt as the chorus rather than the main vocals in the newer generation. Some differences in variable influence on the models between the two song types include the 39 weight of Speechiness on the odds ratio. Speechiness is more influential to the minor mode songs where the βSpeechiness = 1.154. The Speechiness variable for the major mode songs, however has a coefficient of βSpeechiness = 0.729. The other discrepancies that exist are the direction of the coefficients of Key1, Key 3, and Key7 being opposite for minor mode songs versus major mode songs. This makes sense musically since, say for key7, G minor and G major are not equivalent to one another. The first notable discrepancy is that the direction of tempo is different between the two modes. Songs in the minor modes have a negative coefficient meaning that with every 1 unit increase in tempo, the odds of categorization as a new song decrease multiplicatively, whereas the songs in a major mode have a positive coefficient such that the unit increase in tempo increases the odds multiplicatively. This shows that the relationship between the generation and tempo are opposite for major and minor mode songs. 40

CHAPTER 7 PREDICTING SONG RELEASE DATE

7.1 Introduction

Because a song’s assigned K-pop Generation is determined by the Artist’s debut date, there can exist a disconnect between the song’s date of release and the defined time period of it’s categorized K-pop generation. For example, SHINHWA is a first Generation male idol group that debuted in the late 1990s, which has sustained their career into the 4th generation with their most recent EP release in 2018. Therefore, while SHINHWA is labeled as being a 1st Generation K-pop group, their music has been released through the timeline of the 2nd to 4th generation due to their longevity. Therefore, the analysis could be more indicative of the music’s change over time by exploring the relationship between audio features and the song’s release date rather than its classified K-pop Generation. With this approach, we aim to answer the question, ”Can the audio features predict a song’s month of release?”, with multiple linear regression. While multiple linear regression is a predictive modeling technique, MLR will solely be used to analyze the relationship between the audio features and the release dates of songs from 1992 to 2021. The models developed in this section are not intended to predict release dates of songs out- side the time period of 1992 - 2021. Therefore, the goal of measuring prediction performance is to gauge the accuracy of the model in measuring the audio features’ contributions towards identifying when a song was released.

7.2 Data Preparation

The variable release date is stored in the data set as a “yyyy-mm-dd” timestamp. To convert the song’s release date to a continuous variable, the timestamp is converted into the number of months that have passed since the reference date of January 1992. January 1992 is used as the reference date because the first song in the data set is released in March of 1992 by . The data collected spans from Seo Taiji and Boy’s release in March 1992 to the newest song released by in January 2021. Therefore, the smallest value of month of song release is 3 and the 41 largest value is 349. The response variable for this analysis is Months, representing the month since January 1992 that the song was released. The predictor variables are Popularity, Duration, Acousticness, Dance- ability, Energy, Instrumentalness, Key, Loudness, Speechiness, Tempo, and Valence. Key is a factor variable which is modeled as dummy variables Key1 - Key11 where Key0 is the reference category. Furthermore, the analysis is divided into two parts to build and predict on a model for songs composed in a minor mode (Mode = 0) and major mode (Mode = 1). For model building and evaluation, the mode0 and mode1 songs are each split into a 75% training data set and 25% test data set.

7.3 Model Assumptions

The distribution of months is moderately skewed left, therefore, the residuals of the multiple linear model will not meet the model assumptions. A transformation is need.

Figure 7.1 Distribution of Month Release

The transformation of log(360 − months) is ideal for variable transformation in order to meet the assumptions about the residual for the multiple linear regression model. However, this trans- 42 formation makes the interpret-ability of the final model more difficult. It is important to discuss the interpretation prior to the model performance evaluation.

Figure 7.2 Distribution of Transformed Month Release

This means, our final model will be in the form of:

log(360 − months) = β0 + β1x1 + ... + βpxp +  (7.3.1)

Which can be re-expressed by ’undoing’ the log transform with exponentiation:

360 − months = e β0+β1x1+...+βpxp+ (7.3.2)

or

360 − months = exp(β0 + β1x1 + ... + βpxp + ) (7.3.3)

Therefore we would interpret the coefficient of β1 as: ”With all other variables held constant, ˆ with one unit increase in x1, the outcome of 360−months will increase (decrease if the coefficient is negative) multiplicatively by about eβ1 on average”. 43 Because the outcome of this transformed equation is 360 − monthsˆ , a coefficient that has a multiplicative increase on this transformed outcome corresponds to a lower or earlier estimate for month of release. On the other hand a multiplicative decrease to the transformed outcome,

360 − monthsˆ , corresponds to a higher or later estimated month of release.

7.4 Minor Mode Results

Prior to fitting the multiple linear models, the assumptions for the model must be checked. With the transformed months response fitted to a full multiple linear model on all predictor variables, the following diagnostic plots can be assessed:

Figure 7.3 Diagnostic Plots for Minor Mode Song Release Dates

Interpreting the diagnostic plots:

• Residuals vs. Fitted: The through this plot is horizontal straight line through 0. This indicates that the assumption for the predictor variables being linearly related to the response is met. 44 • The residuals roughly follow the diagonal Normal-QQ line, indicating that the residuals are roughly normally distributed. Thus the normality of residuals assumption is met.

• Scale Location: The red line passes through the scale-location graph horizontally. Therefore, the residuals satisfy the heteroscedasticity assumption. In other words, the MLR requirement of equal variance across residuals is satisfied.

• There are some residuals that sit below standard residual 3, however not an alarming amount compared to the rest of the data set.

The following table reports the Root Mean Square Error (RMSE) and the R-Squared statistics to evaluate the performance of the models. We will want to choose a model that has a combination of the lowest RMSE and highest R-squared.

Table 7.1 Release Date MLR Model Diagnostics for Minor Mode Songs

Model Full MLR Stepwise 10CV All Subsets Ridge 10CV Lasso 10CV Elastic Net 10CV RMSE 0.59768 0.59802 0.59802 0.59818 0.59759 0.59761 R2 0.39321 0.39254 0.39254 0.39215 0.39324 0.39316

With the lowest RMSE and the Highest R2 value resulting from the prediction assessments, the Lasso method provides the best model for predicting the month a song was released based on the Spotify audio features. The Lasso model keeps all of the predictor variables except for Key2 and Key5.

log(360 − monthsˆ ) = 2.837 − 0.022P opularity + 0.003Duration − 0.240Acousticness

+ 0.409Danceability + 0.533Energy − 0.808Instrumentalness

− 0.027Key1 + 0.004Key3 + 0.092Key4 + 0.024Key6 − 0.056Key7

− 0.068Key8 + 0.082Key9 + 0.099Key10 + 0.077Key11

− 0.084Loudness − 0.280Speechiness + 0.001T empo + 0.321V alence (7.4.1) 45 The equation below is the Lasso model re-written with the log transformation reversed by exponentiation.

360 − monthsˆ = exp(2.837 − 0.022P opularity + 0.003Duration − 0.240Acousticness

+ 0.409Danceability + 0.533Energy − 0.808Instrumentalness

− 0.027Key1 + 0.004Key3 + 0.092Key4 + 0.024Key6 − 0.056Key7

− 0.068Key8 + 0.082Key9 + 0.099Key10 + 0.077Key11

− 0.084Loudness − 0.280Speechiness + 0.001T empo + 0.321V alence) (7.4.2) With all other variables held constant, the following variables that would multiplicatively in- crease the pseudo months outcome with a unit increase in the predictor variable are: Duration, Danceability, Energy, Tempo, Valence or having a minor key of Key3 (D#/E-flat Minor), Key4 (E Minor), Key6 (F#/G-flat Minor), Key9 (A Minor), Key10 (A#/B-flat Minor),or Key11 (B Minor), instead of Key0 (C Minor). For these positive coefficients, Danceability, Energy, and Valence have the greatest multiplicative effect on the outcome. With all other variables held constant, a unit increase in the variables Popularity, Acousticness, Instrumentalness, Loudness, Speechiness, or being composed in Key1 (C#/D-flat Minor), Key7 (G Minor), or Key8 (G#/A-flat Minor) instead of Key0 (C Minor) would multiplicatively decrease the predicted outcome of the transformed months. The variable Instrumentalness is estimated to have the greatest weight in multiplicatively decreasing the predicted transformed outcome followed by

Speechiness and Acousticness.

7.5 Major Mode Results

Similar to the process of model fitting for the minor mode songs, the assumptions for the MLR model need to be checked first.

The diagnostics plots in figure 7.4 show the following assumptions being met: linear rela- tionship between the response and predictor variables, normally distributed residuals, and equal variance among residuals. 46

Figure 7.4 Diagnostic Plots for Major Mode Song Release Dates

Table 7.2 Release Date MLR Model Diagnostics for Major Mode Songs

Model Full MLR Stepwise 10CV All Subsets Ridge 10CV Lasso 10CV Elastic Net 10CV RMSE 0.60783 0.60707 0.60712 0.60786 0.60739 0.60737 R2 0.33124 0.33294 0.33283 0.33231 0.33228 0.33236

The model with the smallest RMSE (on the log(360 - y) scale) and largest R-Squared values is the Multiple Linear Regression from the Stepwise method using 10 fold cross validation. The Stepwise multiple regression leaves out the dummy variables of Key2, Key3, Key4, Key11 and

Speechiness. All of the variables in the model besides Key5 and Key10 have a significant relation- 47 ship with transformed month of song release when the level of significance is α = 0.05.

log(360 − monthsˆ ) = 3.020 − 0.020P opularity + 0.002Duration − 0.124Acousticness + 0.630Danceability

+ 0.357Energy − 0.765Instrumentalness + 0.062Key1 + 0.053Key5

+ 0.073Key6 + 0.079Key7 + 0.096Key8 + 0.060Key10

− 0.074Loudness + 0.001T empo + 0.289V alence (7.5.1) Converting it back to normal scale would result in the following

360 − monthsˆ = exp(3.020 − 0.020P opularity + 0.002Duration − 0.124Acousticness + 0.630Danceability

+ 0.357Energy − 0.765Instrumentalness + 0.062Key1 + 0.053Key5

+ 0.073Key6 + 0.079Key7 + 0.096Key8 + 0.060Key10

− 0.074Loudness + 0.001T empo + 0.289V alence) (7.5.2) With all other variables held constant, the following variables that would multiplicatively in- crease the pseudo months outcome of 360 with a unit increase in the predictor variable are: - ration, Danceability, Energy, Tempo, Valence or having a major key of Key1 (C#/D-flat Major), Key5 (F Major), Key6 (F#/G-flat Major), or Key10 (A#/B-flat Major) instead of Key0 (C Major). For these positive coefficients, Danceability, Energy, and Valence have the greatest multiplicative effect on the outcome.

With all other variables held constant, a unit increase in the variables Popularity, Acoustic- ness, Instrumentalness, or Loudness would multiplicatively decrease the predicted outcome of the transformed months. The variable Instrumentalness is estimated to have the greatest weight in multiplicatively decreasing the predicted transformed outcome with a coefficient of -0.765. 48 7.6 Comparison of Minor and Major Mode Models

Generally, the performance of the models for minor mode songs is slightly better due to the R-Squared values being around 0.39 compared to the R-Squared values of the major mode songs around 0.33. The RMSE values are comparable at around 0.60 for all of the models. The signs of the coefficients for the optimal major and minor mode song models are all compa- rable with the exceptions of Key1 and Key7. These variables cause a multiplicative decrease to the response for the minor mode model while being Key1 or Key7 rather than Key0 create the effect of a multiplicative increase in the major mode songs. Therefore, minor mode songs are more likely to have been released later in the timeline of K-pop if composed in the key of D Minor or Key G Minor rather than key of C Minor. However, songs are more likely to have been released earlier in the timeline of K-pop if composed in the key of D Major or G Major rather than the key of C Major. Otherwise, the model coefficients for major and minor mode songs have the same sign direc- tion for each of the predictor variables. Unit increases from variables Popularity, Acousticness, Instrumentalness or Loudness corresponds to the song being predicted to have been released later in the timeline of K-pop. The variable that contribute with the most weight in this direction is Instrumentalness. The increase in unit of variables Duration, Danceability, Energy, Tempo and Va- lence corresponds to the song being predicted to have an earlier release. The most heavily weighted variables in this direction are Danceability, Energy and Valence. 49

CHAPTER 8 PREDICTING POPULARITY

8.1 Introduction

The Spotify popularity score ranges from 0 - 100, where 100 is the most popular. The Spotify algorithm calculates the popularity score by considering the total number of plays the song has and how recent those plays are. K-pop is a diverse genre with a rich history. The recent rise of its global popularity have made many wonder what makes K-pop music so appealing? What qualities of the music are fans so crazy about? This section explores how audio features contribute to a K-pop song’s popularity score on Spotify from both a prediction and classification approach. Multiple Linear Regression will be used to compare models for predicting the exact popularity score. This will allow us to investigate how precisely the audio features contribute to the song’s popularity. After that, Binary Logistic Regression will be used to see how audio features more broadly determine if a song will be popular or not. Popular songs are defined as songs who have a popularity score 50 or above and those with a popularity score below 50 are not considered popular.

8.2 Linear Regression Approach

8.2.1 Data Preparation

The goal of predicting popularity from a Multiple Linear Regressions approach is to analyze the effect and relationship of the audio features towards the exact value of the popularity score. Before building the models, a general understanding of the overall distribution of popularity is needed.

As observed in the exploratory data analysis, the overall distribution of the popularity scores is moderately right skewed with few songs having a popularity score above 70 and the majority of data points having a popularity score below 20. The distribution with a fitted density curve can be observed in figure 8.1. The distribution is clearly not normal, therefore the data is unlikely to meet the assumptions of the multiple linear regression model. In order to meet these assumptions, the response variable will be transformed to be roughly normally distributed. Luckily, a simple square root transformation 50

Figure 8.1 Distribution of Popularity makes the popularity distribution roughly normal as observed in figure 8.2. While the distribution is not perfectly normal, it will likely improve the ability of the data to meet the assumptions to build the multiple linear regression model. The model assumptions will be evaluated in the analysis with the report of the major and minor mode model results. Prior to building the model, the data set is split 75% into the training set and 25% allocated to the test set. This split is done for each of the mode groups. The response variable is the continuous variable of Popularity and the predictor variables are the continuous values of Duration, Acoustic- ness, Danceability, Energy, Instrumentalness, Loudness, Mode, Speechiness, Tempo, Valence and the categorical variable of Key.

8.2.2 Minor Mode Results

Before fitting and evaluating the MLR model, the assumptions of the model are checked. In- terpretation of the diagnostic plots in figure 8.3 are:

• Residuals vs. Fitted: The line through this plot is somewhat horizontal but still exhibits some curvature. Overall though, there is roughly a linear relationship between the predictors and outcome variables. 51

Figure 8.2 Distribution of Transformed Popularity

• Normal-QQ: The residuals roughly follow the diagonal Normal-QQ line, indicating that the residuals are normally distributed. In comparison to the original scale of the data, there is slightly more deviance on the upper tail with the points falling below the Normal-QQ line.

• Scale Location: The red line is roughly more horizontal. Therefore, our residuals exhibit heteroscedasticity. In other words, the model now satisfies the MLR requirement of equal variance across residuals.

• Residuals vs. Leverage: Most of the residuals fall with in the standardized residual values of -3 and 3, therefore, we have few outliers in the data set.

Table 8.1 Popularity MLR Model Diagnostics for Minor Mode Songs

Model Full MLR Stepwise 10CV All Subsets Ridge 10CV Lasso 10CV Elastic Net 10CV RMSE 1.8635 1.8612 1.8612 1.8611 1.8588 1.8592 R2 0.1333 0.1351 0.1351 0.1340 0.1351 0.1351

The model that yields the lowest RMSE and highest R2 from predictions is the Lasso Mul- tiple Linear Regularized Regression model. The Lasso model leaves out the following predictor 52

Figure 8.3 Diagnostic Plots for Minor Mode Popularity

variables: Acousticness, Danceability, Key1, Key5, Key8, Key9, and Tempo.

p ˆ popularity = 11.897 − 0.010Duration − 3.492Energy − 2.047Instrumentalness

− 0.155Key2 + 0.212Key3 − 0.359Key4 + 0.221Key6 (8.2.1) − 0.033Key7 − 0.040Key10 − 0.118Key11

+ 0.398Loudness + 1.755Speechiness − 1.135V alence

An increase in Duration, Energy, Instrumentalness, Valence and choosing Key2 (D Minor),

Key4 (E Minor), Key7 (G Minor), Key10 (A#/B-flat Minor), Key11 (B minor) instead of Key0(C minor) decreases the square root of the popularity score, on average. An increase in Loudness,

Speechiness and choosing Key3 (D#/E-flat Minor) or Key6 (F#/G-flat Minor)over Key0 (C Minor) increases the square root of the popularity score, on average. 53 8.2.3 Major Mode Results

Before fitting and evaluating the MLR model, the assumptions of the model are checked.

Figure 8.4 Diagnostic Plots for Major Mode Popularity

The interpretation of the diagnostic plots in figure 8.4 are as follows:

• Residuals vs. Fitted: The line through this plot is somewhat horizontal but still exhibits some curvature. Overall though, there is roughly a linear relationship between the predictors and outcome variables.

• Normal-QQ: The residuals roughly follow the diagonal Normal-QQ line, indicating that the

residuals are normally distributed. In comparison to the original scale of the data, there is

slightly more deviance on the upper tail with the points falling below the Normal-QQ line.

• Scale Location: The red line is roughly more horizontal. Therefore, our residuals exhibit heteroscedasticity. In other words, the model now satisfies the MLR requirement of equal variance across residuals. 54 • Residuals vs. Leverage: Some of the residuals fall outside of the standardized residual range of -3 and 3, therefore, there exist some outliers in the data set.

Table 8.2 Popularity MLR Model Diagnostics for Major Mode Songs

Model Full MLR Stepwise 10CV All Subsets Ridge 10CV Lasso 10CV Elastic Net 10CV RMSE 1.8585 1.8585 1.8585 1.8582 1.8584 1.8584 R2 0.1258 0.1257 0.1257 0.1254 0.1256 0.1256

The model with the lowest RMSE and highest R2 is the Full multiple linear regression model, as reported in Table 9.2 The full Multiple Linear Regression model is written:

p ˆ popularity = 10.857 − 0.010Duration + 0.081Acousticness + 0.051Danceability

− 3.078Energy − 1.877Instrumentalness + 0.321Key1 + 0.207Key2

+ 0.431Key3 − 0.029Key4 + 0.204Key5 + 0.217Key6 + 0.283Key7 (8.2.2)

+ 0.297Key8 + 0.271Key9 + 0.080Key10 + 0.390Key11

+ 0.386Loudness + 4.580Speechiness − 0.001T empo − 0.849V alence

The variables that have significant contribution at a significance level of α = 0.05 in predict- ing the popularity score are Duration, Energy, Instrumentalness, Key1, Key2, Key3, Key6, Key7, Key8, Key9, Key11, Loudness, Speechiness, and Valence. A unit increase in Duration, Energy, Instrumentalness, Tempo, Valence and choosing Key4 (E Major) instead of Key0(C Major) decreases the square root popularity score, on average. A unit increase in Acousticness, Danceability, Loudness, Speechiness and choosing Key1 (C#/D-

flat Major), Key2 (D Major), Key3 (D#/E-flat Major), Key5 (F Major), Key6 (F#/G-flat Minor),

Key7 (G Major), Key8 (G#/A-flat Major), Key9 (A Major), Key10 (A#/B-flat Major),or Key11 (B

Major) over Key0 (C Major) increases the square root popularity score, on average.

8.2.4 Comparison of Minor and Major Mode Models

Overall, the ability to predict the exact popularity score with a Multiple Linear Regression model is not strong since less than 20% of the variance is explained by both of the final models. 55 Mode 1 songs need to keep all variables in the model for the best performance, whereas the Mode 0 songs can be optimally modeled with variable reduction through the Lasso Regularized Regression model. The model performance of the Lasso model for Mode 0 songs are slightly better than the Full MLR model for Mode 1 songs. The RMSE are comparable at around 1.858, but the R2 for Mode 0 songs is just 1% higher than the model for Mode 1 songs. This can be interpreted to mean that the Lasso model for Mode 0 songs explains 1% more in the variation of the prediction results than the Full MLR model for Mode 1 songs. The Audio Features contribute to predicting the Popularity score in the same directions for both Mode 0 and Mode 1. With all other variables held constant an increase in the Speechiness increases the square root of popularity. The contribution of Speechiness in Mode 1 songs is greater with a coefficient of 4.580 compared to the coefficient of 1.755 in Mode 0 songs. Meanwhile, unit increases in Valence, Instrumentalness, and Energy decrease the square root of the popularity score where the variables have a stronger negative effect for Mode 0 songs as compared to Mode 1 songs. The only variables that hold opposite effects on Mode 0 and Mode 1 songs are a song being composed in Key 7 or Key 10 instead of Key 0. This has a negative effect towards Popularity in Mode 0 and a positive effect for Mode 1 songs. A discrepancy in effect of the Keys is logical since the key of say, G Major and G Minor are different musically.

8.3 Logistic Regression Approach

8.3.1 Minor Mode Results

Table 8.3 Popular Logistic Model Diagnostics for Mode 0 Songs

Model Cutoff AIC Accuracy Kappa Sensitivity Specificity Full MLR 0.14 2946.0 0.6205 0.1013 0.55319 0.63097 Stepwise 10CV 0.14 2928.0 0.6362 0.1007 0.52130 0.65400 All Subsets 0.14 2938.2 0.6505 0.1269 0.55319 0.66557 Ridge 10CV 0.14 NA 0.6405 0.0901 0.48936 0.66392 Elastic Net 10CV 0.14 NA 0.6405 0.0901 0.48936 0.66392

The best model is the all subsets regression method model as it has the highest accuracy, sen- sitivity and specificity values. Compared to the full Logistic model, the model from All Subsets 56 Regression leaves out the variables Danceability, all Key dummy variables, and Tempo. All vari- ables in this model, except for Instrumentalness are considered significant predictors.

πˆ log( ) =5.369 − 0.009Duration − 1.537Acousticness − 3.955Energy 1 − πˆ − 2.722Instrumentalness + 0.354Loudness + 3.278Speechiness (8.3.1)

− 1.445V alence

A unit increase in Duration, Acousticness Energy, Instrumentalness, or Valence multiplica- tively decreases the odds of classification as a popular song. Whereas, a unit increase in Loudness or Speechiness multiplicatively increases the odds of categorization as a popular song..

8.3.2 Major Mode Results

Table 8.4 Popular Logistic Model Diagnostics for Mode 1 Songs

Model Cutoff AIC Accuracy Kappa Sensitivity Specificity Full MLR 0.11 4131.1 0.6167 0.1018 0.61947 0.61640 Stepwise 10CV 0.11 4121.0 0.6013 0.0857 0.60177 0.60121 All Subsets 0.11 4127.6 0.6022 0.0893 0.61062 0.60121 Ridge 10CV 0.11 NA 0.5758 0.0962 0.68142 0.56377 Elastic Net 10CV 0.11 NA 0.5758 0.0962 0.68142 0.56377

The Full logistic model yields the best performance in Accuracy and optimal balance between sensitivity and specificity.

πˆ log( ) = 3.198 − 0.008Duration − 0.747Acousticness + 0.935Danceability 1 − πˆ − 3.457Energy − 12.325Instrumentalness + 0.134Key1 + 0.070Key2

+ 0.586Key3 − 0.159Key4 + 0.031Key5 + 0.052Key6 + 0.174Key7 (8.3.2)

+ 0.344Key8 + 0.275Key9 − 0.385Key10 − 0.071Key11

+ 0.330Loudness + 4.734Speechiness + 0.003T empo − 1.386V alence 57 The variables Duration, Acousticness, Danceability, Energy, Key3 (D#/E-flat Major), Key8 (G#/A-flat Major), Loudness, Speechiness, and Valence are significant variables in the model given that the level of significance is α = 0.05. A unit increase in Duration, Acousticness Energy, Instrumentalness, Valence or choosing Key4 (E Major), Key10 or Key11 instead of Key0 (C Major) multiplicatively decreases the odds of classification as a popular song. Whereas, a unit increase in Loudness, Speechiness, Tempo, or choosing Key1 - Key3, Key5 - Key9 instead of Key0 (C Major) multiplicatively increases the odds.

8.3.3 Comparison of Minor and Major Mode Models

Overall the models produced have an accuracy of around 62% which is just 12% more accuracy than having a chance of correctly predicting whether a song is popular or not with a 50-50 chance. The performance of the All Subsets Logistic Regression for minor mode songs is slightly better than the Full Logistic Regression for major mode songs since the accuracy rates are 62.05% and 61.67%, respectively. Therefore the reduced model of 7 variables for the All Subsets Logistic Regression for minor mode songs has slightly higher accuracy and better performance than keeping all the variables for the Full Logistic Regression for major mode songs. However, the Full Logistic Regression model achieves a better balance of Sensitivity and Specificity rates where both are just under 62%. Whereas the Sensitivity for the All Subsets Regression returns rates of around 55% and Specificity being larger at 63%. Even though, minor mode songs have a significantly reduced model, the Audio Features that are kept have similar contributions to the log odds of Popularity as those of the major mode model.

A unit increase in Speechiness contributes to a multiplicative increase in the odds of being popular whereas an increase in the Instrumentalness, Energy, Valence and Acousticness contributes to a multiplicative decrease in the odds of being popular. However, the weight of the Instrumentalness feature varies greatly, in which the coefficient in the major mode model is -12.35, a much larger coefficient in the negative direction than the coefficient of -2.722 for the minor mode songs. 58 8.4 Comparing Linear and Logistic Regression Approach

For both of the regression approaches, it is observed that the minor mode songs can be ex- plained by fewer predictors and still yield performance that is slightly better in accuracy than the model for major mode songs. For both Linear and Logistic regression, all predictor variables were needed for predicting or classifying the song’s Popularity. Between the two models for major mode songs, all Audio Features have the same direction of influence towards predicting or classifying popularity with some exceptions. The choice of composing a song in Key10 or Key11 rather than Key0 has a negative effect on classifying a song as popular, whereas, the choice of these keys instead of Key0 have a positive relationship for predicting the exact Popularity score. The other discrepancy is that Tempo has a negative effect on predicting the Popularity score in Linear Regression but a positive influence on the log odds of classifying a song as popular. The models for minor mode songs, on the other hand, have Audio Features contributing to the prediction or classification in the same direction. The primary difference between the Linear ap- proach and Logistic approach on Mode 0 songs, is that less predictors are required for the Logistic model which removes all the Key dummy variables. When deciding which of the regression approaches is more ideal for explaining the relation- ship of Audio Features to a song’s popularity, the Logistic Regression approach provides a more interpretable analysis. Since there exists a transformation of the Popularity score and relatively low predicted R2 , it is slightly more difficult to understand the exact impact that the Audio Features have on prediction. For Logistic Regression on the other hand, there is a clearer takeaway on how the Audio Features contribute to the log odd probabilities of whether a song is popular or not for a model that provides around 62% accuracy. 59

CHAPTER 9 PRINCIPAL COMPONENTS ANALYSIS

9.1 Introduction

The previous section aims to determine what kind of model can be built in order to predict which generation a K-pop song is from based on audio features alone. However, could there be an alternative way to group the songs in the K-pop genre based on audio features? K-pop Generations are grouped by the time and evolution of the K-pop genre. By excluding time period, cultural trends, catchy visuals, dances, and only focusing on the audio features of the songs, how will the music be naturally clustered? Before clustering, dimension reduction is performed, which will allow for analysis on fewer dimensions yet retain as much accountability for the variation of the data as possible. This di- mension reduction will be performed using Principal Components Analysis (PCA). By conducting PCA, the dimension reduction can show if there exists a more efficient method of explaining the data set.

9.2 Methodology

9.2.1 Principal Components Analysis

PCA is a dimension reduction technique that defines uncorrelated linear combinations that maximize the variance. In other words, PCA reduces the dimensions by with linear combina- tions that maximize the information explained by the data. Supposing the predictor variables are

represented by matrix X. There exists a spectral decomposition of X0X, X0X = V DV T .

Where Dpxp = diag[λ1, . . . , λp] with λ1 ≥ λ2 ≥ ... ≥ λp, denoting the non-negative eigen- values which are also known as the principal values of X0X

0 The columns, of V , vj s are the principal component direction, or PCA loading of X. Such that the first principal component direction v1 is associated with the largest eigenvalue, the second

principal component direction v2 corresponds to the second largest eigenvalue and so on. 60 th Therefore, zj = Xvj is known as the j principal component of X such that:

⎞⎛ vj1 ⎜ ⎟ ⎜ ⎟ ⎜vj2 ⎟ z = Xv = (x , x , . . . , x ) ⎜ ⎟ = v x + v x + ... + v x j j 1 2 k ⎜ . ⎟ j1 1 j2 2 jk k (9.2.1) ⎜ . ⎟ ⎜ . ⎟ ⎝ ⎠ vjk

Considering Z = (z1, z2, . . . , zp). Then the above can be denoted by Z = XV in vector notation.

This linear combination of loadings can be interpreted as the first principal component, z1 has the largest sample variance for all the linear combinations of the columns of X. The 2nd principal component z2 is orthogonal to the first principal component and has the second highest

th sample variance. This interpretation is generalized as, the j principal component zj is orthogonal to the first (j - 1) principal components and account for the jth highest sample variance (Fricke, Greenberg, Rentfrow, and Herzberg, 2018).

9.2.2 K-means Clustering

After reducing the dimension to these linear combinations of the weighted predictors, they are used to perform clustering using the K-means clustering algorithm. The outline of the K-means algorithm is as follows (Friedman et al., 2001; Georges and Nguyen, 2019):

• Choose a pre-specified K number of clusters

• Centroids (the center of each cluster) are initialized by randomly assigning each sample to a

cluster from 1 to K.

• Compute the Euclidean distance between all points and the centroids, where the euclidean

p 2 2 2 distance is dij = (xi1 − xj1) + (xi2 − xj2) + ... + (xip − xjp)

• Assign each point to the group of its closest center 61 • Centroids for each cluster are redefined by calculating the average of all the samples in that cluster.

• Iterate through steps 3 - 5 until the centroids are stabilized.

9.3 Dimension Reduction: PCA

9.3.1 Data Preparation

To perform PCA, assumptions of independence between observations in the random sample must be met. For this data set, each song is composed independently from one another. Further- more, any duplicates in songs have been reduced as much as possible. Another assumption that must be met is that PCA must be performed on continuous data. This is an issue with some of the variables in the data set. There are categorical variables such as key, mode, and time signature. In order to eliminate the presence of the categorical variables, time signature will be removed from the analysis since 97% of the songs are detected as being in 4/4 time. will also be removed as the integer values 0-11 represent a classification into a musical key where the distance between each are not equal. However, the categorical variable of mode will be used to create two divisions of the entire data set into songs with a minor mode versus those with major mode. Therefore, PCA will be performed on the songs composed in minor mode (Mode = 0) separately from the songs that are composed in a major mode (Mode = 1) Unlike many common methods of statistical analysis, no assumptions need to be made on the underlying distribution of the variables. This is good for the this data set as many of the distributions for the predictor variables are skewed. Another benefit of preforming dimension reduction with PCA is that no independence between predictor variables needs to be assumed since the principal components create orthogonal linear combinations of the predictor variables.

The predictor variables used for principal components analysis are: Popularity, Duration,

Acousticness, Danceability, Energy, Instrumentalness, Loudness, Speechiness, Tempo, and Va- lence. 62 9.3.2 Minor Mode Results

Before interpreting the principal components, there will be evaluation of how many principal components sufficiently explain the variance of the model. This can be done by applying the Elbow Method to a Scree plot.

Figure 9.1 Scree Plot for Minor Mode Principal Components

There is no distinct ”elbow” where the plot drops to a curve and then stabilizes. Therefore, the Elbow method is ineffective in determining how many Principal Components are the optimal number to explain the data set. It is generally a good rule of thumb that the Principal Components should explain at least 70% of the variation in the data. Therefore, this threshold will be used in determining how many Principal Components to keep. Looking at table 9.1, which provides the variance explained and cumulative variance explained, the first five principal components capture at least 70% of the variation since the cumulative proportion of variation explained is about 71.7%, where the first principal component explains about 23.86% of the variation, the second explains

14.72%, the third explains 12.32%, the fourth 11.28% and the fifth explains 9.58% of the total variation. When interpreting the principal components it is typical practice to emphasize the factors that 63 Table 9.1 PCA Variance Explained for Minor Mode Songs

Variance Proportion of Variation Cumulative Proportion of Variation PC 2.3860895 0.23860895 0.2386090 PC1 1.4715942 0.14715942 0.3857684 PC2 1.2323853 0.12323853 0.5090069 PC3 1.1277912 0.11277912 0.6217860 PC4 0.9479749 0.09479749 0.7165835 PC5 0.8192458 0.08192458 0.7985081 PC6 0.7050589 0.07050589 0.8690140 PC7 0.6546053 0.06546053 0.9344745 PC8 0.4147763 0.04147763 0.9759521 PC9 0.2404785 0.02404785 1.0000000 PC10 exhibit strong correlation loadings whose absolute values are greater than 0.5. However, this dataset will be evaluated with a lower threshold of 0.4.

Table 9.2 First Five PCs for Minor Mode Songs

Audio Feature PC1 PC2 PC3 PC4 PC5 popularity 0.043 -0.283 0.654 -0.017 0.307 duration -0.138 -0.102 -0.480 -0.549 -0.173 acousticness -0.440 -0.005 0.233 -0.009 -0.147 danceability 0.222 0.634 0.161 0.046 -0.096 energy 0.556 -0.146 -0.164 -0.059 0.094 instrumentalness -0.115 0.120 -0.342 0.543 0.572 loudness 0.476 -0.253 0.131 -0.257 0.193 speechiness 0.116 -0.264 0.109 0.469 -0.658 tempo 0.107 -0.448 -0.295 0.317 -0.047 valence 0.403 0.371 -0.049 0.100 -0.192

The most influential loading factors for the first principal component are Energy (eˆ15 = 0.556),

Loudness (eˆ17 = 0.476), and Valence (eˆ15 = 0.403) in the positive direction and Acousticness

(eˆ13 = −0.440) in the negative direction. The grouping of Energy, Loudness and Valence implies that these features are correlated with one another. In this case, if the Loudness of a song is high,

the Energy and Valence measures are likely to be high as well. Since Acousticness is in the negative direction, this means that it has an inverse relationship with the other three variables.

The most significant loadings factors for principal component two are Danceability (eˆ24 = 64

0.634) in the positive direction and Tempo (eˆ24 = −0.448) in the negative direction. It is interest- ing that Danceability and Tempo have opposite contributions since one would assume that these features are related. Too slow of a song would mean less ability for dancing. However, this general assumption does not apply for this data. The rationale for why these two variables would hold opposite correlation effects to the component are unclear.

For principal component three, Popularity (eˆ31 = 0.634) is weighted very heavily in the pos- itive direction with a loading factor of 0.654 compared to other positively weighted variables. In

the negative direction the most heavily weighted factor is Duration(eˆ32 = −0.480). Other chapters have shown evidence to suggest the inverse relationship between the features of Popularity and Duration. This opposite effect on the principal component could perhaps be attributed to the con- founding variable of time. Popularity largely relies on how recent the track is played by listeners which means that songs at later dates tend to higher popularity. While it has been observed that Duration has decreased over time. This in turn would make the relationship between Popularity and Duration also inverse. However, without the context of a possible confounding variable, one can easily interpret this to be that shorter songs are more popular. This is a very feasible reason as many people enjoy songs that typically last between 2.5 - 4.5 minutes. The largest loading factors in the positive direction for principal component four are Instru- mentalness (eˆ46 = 0.543) and Speechiness (eˆ48 = 0.469) in the positive direction. The largest

negative loading factor is Duration (eˆ42 = −0.549). Since Instrumentalness is the absence of speech and Speechiness is the presence of speech, it is interesting that both would be correlated in the same direction. Both of these features are inversely related to duration. This means that when

Instrumentalness and Speechiness are high, the song’s duration tends to be shorter.

For principal component five, the greatest weight in the positive direction is Instrumentalness

(eˆ56 = 0.572) and the greatest weight in the negative direction is Speechiness (eˆ58 = −0.658). For the minor mode songs, it is interesting that Instrumentalness and Speechiness are accounted for in two separate principal components where the direction of the loading factors is different for each one. For this component the two have opposite signs indicating an inverse relationship for this 65 component.

9.3.3 Major Mode Results

Figure 9.2 Scree Plot for Major Mode Principal Components

The ”elbow” in the scree plot from figure 9.2, occurs at the second principal component, how- ever, less than 50% of the variation of the data is explained. Therefore, the number of optimal prin- cipal components will instead be determined by the cumulative proportion of variance explained. Ideally, it is best if the cumulative proportion explained is at least 70%. Table 9.3 for variance ex- plained by the principal components shows that the first five principal component, explain 75.5% of the total variation in the data. Therefore, it is reasonable to assume that these 5 principal compo- nents are sufficient for capturing the variation and explain the data of K-pop songs. Furthermore, from the table one observe that the first principal component is the most influential by capturing

30% of information from the data. Whereas the remaining principal components explain lesser percentages of the data.

The 0.40 threshold for interpreting the most influential loadings for the mode 0 principal com- ponents will also be used for interpreting the principal components of the mode 1 songs. The first five principal components and their loading factors are given in table 9.4 with the most influential 66 Table 9.3 PCA Variance Explained for Major Mode Songs

Variance Proportion of Variation Cumulative Proportion of Variation PC 3.0379661 0.30379661 0.3037966 PC1 1.3077608 0.13077608 0.4345727 PC2 1.1871509 0.11871509 0.5532878 PC3 1.0961557 0.10961557 0.6629033 PC4 0.9233248 0.09233248 0.7552358 PC5 0.7780514 0.07780514 0.8330410 PC6 0.6314120 0.06314120 0.8961822 PC7 0.5164992 0.05164992 0.9478321 PC8 0.3592573 0.03592573 0.9837578 PC9 0.1624219 0.01624219 1.0000000 PC10 loadings in bold.

Table 9.4 First Five PCs for Major Mode Songs

Audio Feature PC1 PC2 PC3 PC4 PC5 popularity -0.087 -0.321 0.539 -0.432 0.387 duration 0.235 -0.124 -0.570 -0.146 -0.147 acousticness 0.427 0.024 0.221 -0.131 -0.128 danceability -0.330 0.521 0.105 -0.148 -0.180 energy -0.503 -0.133 -0.211 0.069 0.133 instrumentalness 0.089 0.256 0.095 0.658 0.591 loudness -0.421 -0.269 -0.199 -0.204 0.287 speechiness -0.178 -0.196 0.478 0.274 -0.482 tempo -0.066 -0.583 -0.056 0.440 -0.202 valence -0.415 0.275 -0.012 0.068 -0.240

The most influential loading factors for the first principal component are Energy (eˆ15 = −0.503),

Loudness (eˆ17 = −0.421), and Valence (eˆ15 = −0.415) in the negative direction and Acousticness

(eˆ13 = 0.427) in the positive direction. The grouping of Energy, Loudness and Valence implies that these features are correlated with one another. Since Acousticness is in the negative direction,

this means that it has an inverse relationship with the other three variables.

The most influential loading factors for the second principal component are Danceability (eˆ24 =

0.521) in the positive direction and Tempo (eˆ24 = −0.583) in the negative direction. The interpre- tation of these loading factors would be the same as that observed for the Mode 0 songs. However, 67 the absolute values of the coefficients are more similar for this mode. The features with the largest absolute value for the coefficients of principal component three

are Popularity (eˆ31 = 0.539) and Speechiness (eˆ38 = 0.478) in the positive direction. Unlike the minor mode songs, this principal component has the additional impact of Speechiness. Because both Speechiness and Popularity are in the positive direction for this component, this suggests that songs with high Popularity also have high Speechiness measures and vice versa. In the context of this data, the high measures of Speechiness would refer to music that contains rap. Therefore, songs that have more elements of rap would also expect to have higher measurements of Popularity.

The Duration (eˆ32 = −0.570) feature has loadings in the negative direction. Therefore Duration has an inverse relationship with Popularity and Speechiness. For principal component four, the variable that is weighted with the most variation in this component is Instrumentalness with a loading factor of 0.658. Other variables weighted in the positive direction are Tempo, with the second largest loading factor at 0.440. Therefore, songs with higher Instrumentalness likely have a faster Tempo and vice versa. Popularity, however, has a substantial negative loading factor of -0.432. Finally, for principal component five, the variable with the greatest weight in the positive di- rection is Instrumentalness with a loading factor of 0.591. The other largest contributing loading factor is the variable Speechiness in the negative direction. This implies that songs with higher measurements of Instrumentalness have lower levels of Speechiness. Since Spotify’s Instrumental- ness feature measures the absence of vocal elements, it is logical that it has an inverse relationship with Speechiness as it detects elements of in the context of spoken word.

9.3.4 Comparison of Minor and Major Mode Results

Overall, PCA appears effective in being able to retain around 70% of the variation in the data with just 5 dimensions. There exists some similarity between modes of which variables are con- tained in each of the components. However, the groupings do not have a particularly obvious theme to generate names or labels for each principal component. Rather, generalizations the characteris- tics of each principal components for both major and minor mode songs can be explained. 68 For both musical modes, the first principal component groups the highest contributing vari- ables in the same manner: Energy, Loudness and Valence versus Acousticness. However, the signs for the loadings on the variables are opposite between major and minor modes. The minor mode principal components have positive signed coefficients for variables of Energy, Loudness and Va- lence and negative Acousticness. Whereas, the signs are negative and positive for these respective variables for the major mode songs. This shows that the first principal component captures the differences between major and minor mode songs. The second component characterizes the inverse relationship between Danceability and Tempo such that higher values of Danceability for a song correspond to the song having a lower value of tempo. The third component captures the inverse relationship of Popularity and Duration such that high Popularity scores corresponds to lower duration. Furthermore, for the major mode songs, the feature of Speechiness is a notable contributor which is grouped in the positive direction with Popularity. The fourth principal component of minor mode songs has a different group of influential load- ing factors than the fourth principal component of major mode songs. For minor mode songs, the contributions captured are the higher values of Speechiness and Instrumentalness correspond to lower values of Duration. Whereas, the component explains the inverse relationship such that higher values of Instrumentalness correspond to lower scores of Popularity. Overall, for both major and minor modes, the contribution of Instrumentalness is explained. Finally, the fifth component for both major and minor mode songs explains the relationship that higher measures of Instrumen- talness correspond with lower measures of Speechiness.

9.4 Clustering on the Principal Components

Now that the optimal principal components have been chosen and interpreted, the reduced dimensions can be used to cluster the data. This section will use K-means clustering to find optimal groupings within the data using the principal components of the audio features. The K-means clustering algorithm requires that the number of clusters be specified before clustering occurs. Therefore, the silhouette method is used to identify the optimal number of 69

(a) Minor Mode Silhouette plot (b) Major Mode Silhouette plot

Figure 9.3 Silhouette Plots for Optimal K clusters to model for the data. Figure 9.3 shows both of the silhouette plots for minor and major mode songs side by side. Both silhouette plots conclude that 2 clusters is the optimal number for the data set. Now K-means clustering can be performed for two clusters on the first five principal com- ponents. The data points are plotted on a scatter plot seen in figure 9.4 with the first principal component on the x-axis and the second principal component on the y-axis. Because the majority of variation explained by the first principal component had opposite feature effects between mi- nor and major modes, clusters 1 and 2 are located on opposite sides of the plot. Since the data set is split into two populations based on mode, it is difficult to say what the optimal number of groups would be for the entire K-pop genre. But if one were to acknowledge the music into these two groupings of modes, in each musical mode there are two recognized clusters, rather than the commonly discussed groups of 4 generations, defined by the Spotify Audio Features. 70

(a) Minor Mode K-means plot (b) Major Mode K-means plot

Figure 9.4 K-means Clustering Scatter-plots 71

CHAPTER 10 RESEARCH LIMITATIONS

This analysis is written from the viewpoint of an international audience member. I do not have strong insight into the influence of Korean pop from the perspective of native South Korean people. Therefore, this analysis is performed purely from the perspective of sources provided in English and artists known to the international audience. Not every K-pop artist is featured in this study. Artists used for analysis were determined from Idology’s K-pop generations chart as well as referencing other pop culture articles discussing the most influential K-pop idol artists. Generally, soloists don’t become as mainstream in K-pop as the groups, so there are significantly less solo acts in the data set. Beyond these limitations, there arise many that are related to Spotify in particular. Songs in the data set are limited to what is available on Spotify’s platform. Because of this, K-pop artists from the first generation are underrepresented compared to the other generations in K-pop. Some artists from Idology’s table were missing and not all albums, singles, or EPs released by artists that are on Spotify, were available on the platform. For example, the Lexy only had 2 songs available on the platform despite having multiple album releases. This restriction can be attributed to the licensing and publishing agreements made between Spo- tify and music publishers that determine which artists and songs can be available on the streaming platform. For example, many of the first generation artists and their entire discography were not on the platform. In addition, termination of contracts or disputes between music publishers and Spo- tify can cause songs that were once available on the platform to be removed. An example of this occurred on March 1st 2021 where hundreds of K-pop songs were removed from Spotify’s global markets due to a termination of a licensing deal with M, a South Korean music distributor

(Savage, 2021). Therefore, the capacity to collect audio features from this genre in the future is dependent on these licensing and distributing deals with Spotify. Additionally, Spotify was not available in South Korea until February 2021. Therefore, the data on popularity are completely calculated from the activity of international listeners and not the 72 Native South Korean audience. Therefore, the popularity measures collected through Spotify only reflect those of the global audience, exclusionary to those in South Korea. Finally, Audio feature measurements are completely reliant on the calculations of Spotify’s API. Some measurements could be incorrect, or not reflect the true nature of the song depending on how well Spotify’s algorithm performs. 73

CHAPTER 11 CONCLUSION

Multivariate analysis was used to investigate the contributions of Spotify Audio Features to the evolution and success of the K-pop genre. The following research questions were explored:

• Do audio features differ between groups of artist features such as artist generation, male vs. female, or solo vs. group?

• How do audio features distinguish music into the newer versus older generation? Which statistical model will be the most effective at performing these predictions?

• How are audio features of a song associated with a song’s time of release? How successful can a statistical model be on predicting a song’s release date given its audio features?

• How do audio features contribute to a song’s popularity?

• Can a song’s audio features be reduced in dimension to explain the music with fewer vari- ables? With these reduced dimensions, how many clusters in the music will be detected?

Nonparametric hypothesis testing showed the differences in Audio Features between different characteristics of K-pop Artists. Although K-pop generations are not explicitly defined in theory, the Kruskal-Wallis test yielded statistically significant differences between the generations for each of the Audio Features. Some features like Popularity, Loudness and Duration showed a longitudi- nal trend where Popularity and Loudness increase with the increase in Generation while Duration decreases. There were also significant differences between specific Generation pairs in Acoustic- ness that are worth noting. The Pairwise Wilcoxon Comparisons found that the electronic sounds that were popular in the first generation are trending again in the newest, fourth generation.

There were also statistically significant differences detected in all Audio Features between male and female artists, however these differences were not impactful given the context of the Audio Features range. Therefore, there do not exist large differences in the features of male and female artists. The comparisons between group and solo artists found that songs released by K-pop groups 74 have significantly higher measures of Energy and Valence. Despite dance being a main feature of K-pop group performances, there were no significant differences detected in Danceability between group and solo artists. The analysis classifying songs into the new or old generation using Logistic Regression, con- veyed the influence of Audio Features towards these divisions. Notable contributors that increase the log odds of the probability for classification in the new generation were Acousticness and In- strumentalness for both mode types. Features that decreased these log odds when the predictor increases by a unit were Duration and Valence. Therefore, Duration and Valence increased the log odds of classification into the older generations of K-pop. These feature contributions were similar for major and minor mode songs. Overall, the models showed good performance with Accuracy measures of 75.75% for the Full Logistic Regression model for major mode songs and 73.75% for the Elastic Net model for minor mode songs. The last method used to explore the audio features’ contributions towards the Evolution of K- pop was Multiple Linear Regression. The optimal MLR models that predicted the song release date on Audio feature predictors returned predicted R2 values of 0.39 for the minor mode songs and 0.33 for minor mode songs. For both modes it was observed that increases in Popularity, Acousticness, Instrumentalness and Loudness indicated that the song was released in the later months of the entire K-pop timeline, whereas the features of Duration, Dance, Energy and Tempo were related to earlier months of the entire timeline. Next, the Audio Features’ relationship to the Popularity of K-pop was assessed through a Linear and Logistic Regression approach. The contribution of Audio Features in predicting or classifying popularity was generally consistent across mode types and modeling methods. It was observed that a unit increase in features of Dance, Loudness, Speechiness and Tempo contributed to higher pop- ularity. Whereas unit creases in Acousticness, Instrumentalness, Energy, and Valence contributed to a decrease in popularity. Overall, the Logistic method is more interpretable to understanding the relationship between Audio Features and Popularity. Finally, the Principal Components Analysis was able to reduce the 10 Audio Features to 5 prin- 75 cipal components that explained at least 70% of the variation in the data. The major contributor to the variance, the first principal component, was significantly weighted by the features of Loud- ness, Valence, and Energy in the positive direction with Acousticness in the negative direction for minor mode songs. The groupings are the same for major mode songs but have the signs switched. Therefore, the groupings and direction of these 4 variables capture different musical attributes with the majority of variance for the two mode types. Clustering on the first five principal components resulted in an optimal clustering of two clusters for minor and major mode songs. 76

BIBLIOGRAPHY

Boehmke, B. Uc business analytics r programming guide. Accessed February 20, 2021 from University of Cincinnati. https://uc-r.github.io/regularized_regression.

Chua, I. and E. de Luna (2020). Why are k-pop groups so big? Accessed October 1, 2020. https://pudding.cool/2020/10/kpop/.

Dietzel, A. (2020, March). How to extract any artist’s data using spotify’s api, python, and spotipy. Accessed December 17, 2021. https://betterprogramming.pub/how- to-extract-any-artists-data-using-spotify-s-api-python-and-

spotipy-4c079401bc37.

Fricke, K. R., D. M. Greenberg, P. J. Rentfrow, and P. Y. Herzberg (2018). Computer-based music feature analysis mirrors human perception and can be used to measure individual music

preference. Journal of Research in Personality 75, 94–102.

Friedman, J., T. Hastie, R. Tibshirani, et al. (2001). The elements of statistical learning, Volume 1. Springer series in statistics New York.

Georges, P. and N. Nguyen (2019). Visualizing music similarity: clustering and mapping 500

composers. Scientometrics 120(3), 975–1003.

Jung, H. (2018, October). The data science of k-pop: Understanding through data and a.i.

Accessed December 17, 2021. https://towardsdatascience.com/the-data- science-of-k-pop-understanding-bts-through-data-and-a-i-part-

1-50783b198ac2.

Kim, R. (2020). K-pop is only half the story of korean pop music. Accessed February 20, 2021. https://www.rollingstone.com/music/music-features/kpop- korea-culture-trot-indie-genres-1100124/. 77 Lamere, P. (2020). Spotipy. Accessed December 17, 2021. https://spotipy. readthedocs.io/en/2.17.1/#.

Pramudita, D. A. (2018). Tracing the k-pop wave. Accessed December 17, 2021. https:// datanibbl.es/tracing-kpop-wave/.

R Core Team (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

Romano, A. (2018). How k-pop became a global phenomenon. Last modified February

26, 2018, 1:01am EST. https://www.vox.com/culture/2018/2/16/16915672/ what-is-kpop-history-explained.

Savage, M. (2021, March). Hundreds of k-pop songs disappear from spotify. Accessed March 5, 2021. https://www.bbc.com/news/entertainment-arts-56237626.

Sherman, M. (2020). Start here: Your guide to getting into k-pop. Accessed Febru- ary 20, 2021. https://www.npr.org/2020/07/13/888933244/start-here- your-guide-to-getting-into-k-pop.

Skiden,´ P. (2016). New endpoints: Audio features, recommendations and user taste. Ac-

cessed December 17, 2021. https://developer.spotify.com/community/news/ 2016/03/29/audio-features-recommendations-user-taste/.

Song, S. (2020). The evolution of the : How is the third generation different from

previous ones? Korea Observer 51(1), 125–150.

Spotify. Spotify web api documentation. https://developer.spotify.com/ documentation/web-api/reference/#endpoint-get-audio-features.

Squip (2020). 2020 idol pop generation theory. Accessed Novem-

ber 21, 2020. http://idology.kr/13070?fbclid=IwAR3QIF1d5_ PsEcWfHqz6cYXlU2aGI36vKT6vMXHeYDi0te0BNRq-kgCJFMo. 78 Van Rossum, G. and F. L. (2009). Python 3 Reference Manual. Scotts Valley, CA: CreateS- pace. 79

APPENDIX A WILCOXON PAIRWISE COMPARISON RESULTS

”Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1”

Table A.1 Pairwise Comparison Results: Duration by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 6716006 < 2.2e-16 (9.60805, 13.81404) *** 1-3 5168638 < 2.2e-16 (16.98804, 21.08295) *** 1-4 1013108 < 2.2e-16 (23.40404, 29.84504) *** 2-3 12237926 < 2.2e-16 (5.546969, 8.319974) *** 2-4 2507990 < 2.2e-16 (11.65498, 16.90297) *** 3-4 1526027 2.905e-16 (4.960046, 9.662030) ***

Table A.2 Pairwise Comparison Results: Acousticness by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 5082732 1.12e-07 (-0.019403049, -0.005477621) *** 1-3 3231122 < 2.2e-16 (-0.03282232, -0.01646448) *** 1-4 648436 0.2567 (-0.013777980, 0.005443325) 2-3 9997900 2.263e-05 (-0.015168099, -0.003189845) *** 2-4 1985756 0.01084 (-0.0002510459 0.0220042216) * 3-4 1432583 2.727e-07 (0.01017743, 0.03571814) ***

Table A.3 Pairwise Comparison Results: Energy by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 5582152 0.495 (-0.007971006, 0.013086582) 1-3 3715470 0.4733 (-0.014031869, 0.007970819) 1-4 597314 4.032e-05 (-0.04394520, -0.00902158) *** 2-3 10327704 0.09598 (-0.011964001, 0.002963283) 2-4 1643106 1.841e-07 (-0.03904493, -0.01295186) *** 3-4 1135920 6.909e-06 (-0.033008665, -0.008978771) *** 80

Table A.4 Pairwise Comparison Results: Instrumentalness by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 6827255 < 2.2e-16 (5.330529e-06, 3.635337e-05) *** 1-3 4962824 < 2.2e-16 (6.242531e-05, 1.324542e-05) *** 1-4 845691 < 2.2e-16 (2.214427e-05, 3.656347e-05) *** 2-3 11428636 < 2.2e-16 (1.562298e-05, 6.900444e-06) *** 2-4 1938095 0.05353 (-1.880325e-05, 6.295205e-05) 3-4 1212845 0.002867 (-1.116460e-05, 3.371762e-05) **

Table A.5 Pairwise Comparison Results: Speechiness by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 5967231 1.16e-07 (0.001754788, 0.005245966) *** 1-3 3443958 1.676e-07 (-0.006280242, -0.002035958) *** 1-4 553926 3.415e-11 (-0.013304688, -0.005701992) *** 2-3 8765167 < 2.2e-16 (-0.009428564, -0.006263813) *** 2-4 1386588 < 2.2e-16 (-0.01704981, -0.01034887) *** 3-4 1160456 0.0002162 (-0.008878170, -0.001443872) ***

Table A.6 Pairwise Comparison Results: Loudness by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 3987612 < 2.2e-16 (-1.0690260, -0.8050314) *** 1-3 2325016 < 2.2e-16 (-1.380008, -1.110075) *** 1-4 341425 < 2.2e-16 (-1.783976, -1.359077) *** 2-3 9432992 < 2.2e-16 (-0.3900043, -0.2079796) *** 2-4 1451189 < 2.2e-16 (-0.7900220, -0.4470353) *** 3-4 1117533 3.509e-07 (-0.4850057, -0.1540285) ***

Table A.7 Pairwise Comparison Results: Tempo by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 5147904 6.052e-06 (-4.948980, -1.075945) *** 1-3 3672100 0.1503 ( -2.983978, 0.874021) 1-4 593010 1.326e-05 (-8.044934, -1.998050) *** 2-3 11044058 8.742e-05 (0.2600649, 3.1589815) *** 2-4 1802280 0.1077 (-4.2850588, 0.9970885) 3-4 1165645 0.0004144 (-6.587050, -0.946052) ** 81

Table A.8 Pairwise Comparison Results: Valence by Generation

Generation Pair W P-Value 99.17% Confidence Interval Significant 1-2 6145471 1.031e-13 (0.02893193, 0.06002026) *** 1-3 4319974 < 2.2e-16 (0.04198552, 0.07405643) *** 1-4 778713 1.213e-10 (0.03800416, 0.09000625) *** 2-3 10865590 0.01137 (-0.0009727786, 0.0240385187) * 2-4 1958912 0.05243 (-0.006001465, 0.041033561) 3-4 1294594 0.5102 (-0.01798393, 0.02894915) 82

APPENDIX B MLR AND LOGISTIC REGRESSION MODEL RESULTS

New Generation Classification: Minor Mode Results

Full Multiple Logistic Regression

Figure 1 Full Logistic Model for Minor Mode Generation Classification. Figure displays sample R code and statistical significance of the model coefficients. 83 Where the model is written as:

πˆ log( ) = 0.453 + 0.074xP opularity − 0.009xDuration + 1.280xAcousticness + 0.715xDanceability 1 − πˆ

+ 0.227xEnergy + 2.790xInstrumentalness + 0.301xKey1 + 0.242xKey2

− 0.540xKey3 − 0.308xKey4 − 0.253xKey5 − 0.148xKey6 + 0.498xKey7

− 0.194xKey8 − 0.270xKey9 − 0.307xKey10 − 0.075xKey11 + 0.236xLoudness

+ 2.170xSpeechiness − 0.003xT empo − 1.359xV alence (.0.1)

Stepwise 10 Fold Cross Validation

(a) Minor Mode Stepwise Sample R Code

(b) Minor Mode Stepwise Model Results

Figure 2 Stepwise Logistic Model for Minor Mode Generation Classification 84

πˆ log( ) = 0.480 + 0.074P opularity − 0.009Duration + 1.216Acousticness 1 − πˆ + 0.672Danceability + 2.807Instrumentalness + 0.503Key1 (.0.2) + 0.445Key2 + 0.702Key7 + 0.245Loudness

+ 2.207Speechiness − 0.003T empo − 1.319V alence

All Possible Subsets Regression

(a) Minor Mode Stepwise Sample R Code

(b) Minor Mode Stepwise Model Results

Figure 3 All Possible Subsets Logistic Model for Minor Mode Generation Classification 85

πˆ log( ) = 0.684 + 0.074xP opularity − 0.009xDuration + 1.221xAcousticness 1 − πˆ

+ 0.695Danceability + 2.807xInstrumentalness + 0.298xKey1 + 0.240xKey2 (.0.3) − 0.545xKey3 − 0.308xKey4 − 0.256xKey5 − 0.153xKey6 + 0.497xKey7

− 0.197xKey8 − 0.271xKey9 − 0.310xKey10 − 0.076xKey11 + 0.246xLoudness

+ 2.214xSpeechiness − 0.003xT empo − 1.331xV alence

Ridge 10 Fold Cross Validation

πˆ log( ) = 0.457 + 0.060P opularity − 0.008Duration + 0.904Acousticness 1 − πˆ + 0.281Danceability + 0.160Energy + 1.840Instrumentalness + 0.288Key1

+ 0.236Key2 − 0.358Key3 − 0.202Key4 − 0.137Key5 − 0.048Key6

+ 0.445Key7 − 0.120Key8 − 0.177Key9 − 0.191Key10 − 0.030Key11

+ 0.175Loudness + 1.769Speechiness − 0.002T empo − 1.020V alence (.0.4)

Lasso 10 Fold Cross Validation

πˆ log( ) = 0.149 + 0.070P opularity − 0.008Duration + 0.809Acousticness 1 − πˆ + 1.880Instrumentalness + 0.204Key1 + 0.103Key2 − 0.139Key3 (.0.5) + 0.526Key7 − 0.008Key9 − 0.0004Key10

+ 0.170Loudness + 0.947Speechiness − 0.001T empo − 0.834V alence

New Generation Classification: Major Mode Results

Full Multiple Logistic Regression 86

Figure 4 Full Logistic Model for Major Mode Generation Classification

πˆ log( ) = −0.783 + 0.079P opularity − 0.009Duration + 0.882Acousticness 1 − πˆ + 0.665Danceability + 0.537Energy + 2.471Instrumentalness

− 0.415Key1 + 0.002Key2 + 0.149Key3 + 0.280Key4 − 0.217Key5 (.0.6) − 0.280Key6 − 0.161Key7 − 0.185Key8 + 0.006Key9

− 0.071Key10 − 0.094Key11 + 0.154Loudness

+ 0.792Speechiness + 0.0002T empo − 0.742V alence

Stepwise 10 Fold Cross Validation 87

Figure 5 Stepwise Logistic Model for Major Mode Generation Classification

πˆ log( ) = −0.233 + 0.079P opularity − 0.009Duration + 0.743Acousticness 1 − πˆ + 0.621Danceability + 2.524Instrumentalness − 0.406Key1 + 0.287Key4

− 0.214Key5 − 0.268Key6 − 0.154Key7 − 0.179Key8

+ 0.178Loudness + 0.992Speechiness − 0.674V alence (.0.7)

All Possible Subsets Regression 88

Figure 6 All Possible Subsets Logistic Model for Major Mode Generation Classification

πˆ log( ) = −0.226 + 0.079P opularity − 0.009Duration + 0.739Acousticness 1 − πˆ + 0.632Danceability + 2.516Instrumentalness − 0.415Key1 + 0.002Key2

+ 0.147Key3 + 0.279Key4 − 0.222Key5 − 0.276Key6 − 0.162Key7

− 0.187Key8 + 0.002Key9 − 0.072Key10 − 0.094Key11 + 0.178Loudness

+ 0.980Speechiness − 0.668V alence (.0.8) 89 Ridge 10 Fold Cross Validation

πˆ log( ) = −0.538 + 0.062P opularity − 0.007Duration + 0.553Acousticness 1 − πˆ + 0.350Danceability + 0.318Energy + 1.609Instrumentalness − 0.257Key1

+ 0.064Key2 + 0.205Key3 + 0.256Key4 − 0.121Key5 − 0.162Key6

− 0.056Key7 − 0.071Key8 + 0.068Key9 − 0.025Key10 − 0.011Key11

+ 0.117Loudness + 0.915Speechiness + 0.0005T empo − 0.502V alence (.0.9)

Lasso 10 Fold Cross Validation

πˆ log( ) = −0.928 + 0.075P opularity − 0.006Duration + 0.195Acousticness 1 − πˆ + 1.169Instrumentalness − 0.173Key1 + 0.002Key4 (.0.10)

+ 0.081Loudness − 0.098V alence

Elastic Net 10 Fold Cross Validation

πˆ log( ) = −1.072 + 0.071P opularity − 0.005Duration 1 − πˆ (.0.11) + 0.671Instrumentalness − 0.059Key1 + 0.054Loudness

Song Release Date Prediction: Minor Mode Results

Full Multiple Linear Regression 90

Figure 7 Full MLR Model for Minor Mode Song Release Date

log(360 − monthsˆ ) = 2.739 − 0.022P opularity + 0.003Duration − 0.240Acousticness

+ 0.424Danceability + 0.572Energy − 0.848Instrumentalness

− 0.013Key1 + 0.016Key2 + 0.035Key3 + 0.118Key4

+ 0.017Key5 + 0.051Key6 − 0.087Key7 − 0.056Key8

+ 0.109Key9 + 0.126Key10 + 0.101Key11 − 0.087Loudness

− 0.330Speechiness + 0.001T empo + 0.324V alence (.0.12)

Stepwise 10 Fold Cross Validation 91

Figure 8 Stepwise Linear Regression Model for Minor Mode Song Release Date

log(360 − monthsˆ ) = 2.732 − 0.022P opularity + 0.003Duration − 0.241Acousticness

+ 0.425danceability + 0.572Energy − 0.852instrumentalness

+ 0.125Key4 + 0.057Key6 + 0.115Key9 + 0.133Key10 + 0.107Key11

− 0.087Loudness − 0.328Speechiness + 0.001T empo + 0.324V alence (.0.13)

All Possible Subsets Regression 92

Figure 9 All Possible Subsets Linear Regression Model for Minor Mode Song Release Date

log(360 − monthsˆ ) = 2.733 − 0.022P opularity + 0.003Duration − 0.241Acousticness

+ 0.425Danceability + 0.572Energy − 0.852Instrumentalness

+ 0.125Key4 + 0.057Key6 + 0.115Key9 + 0.133Key10 + 0.107Key11

− 0.087Loudness − 0.328Speechiness + 0.001T empo + 0.324V alence (.0.14) 93 Ridge 10 Fold Cross Validation log(360 − monthsˆ ) = 2.866 − 0.021P opularity + 0.003Duration − 0.245Acousticness

+ 0.413Danceability + 0.482Energy − 0.749Instrumentalness

− 0.034Key1 − 0.006Key2 + 0.008Key3 + 0.095Key4

− 0.007Key5 + 0.023Key6 − 0.063Key7 − 0.075Key8

+ 0.082Key9 + 0.099Key10 + 0.078Key11

− 0.080Loudness − 0.308Speechiness + 0.001T empo + 0.323V alence (.0.15)

Lasso 10 Fold Cross Validation log(360 − monthsˆ ) = 2.837 − 0.022P opularity + 0.003Duration − 0.240Acousticness

+ 0.409Danceability + 0.533Energy − 0.808Instrumentalness

− 0.027Key1 + 0.004Key3 + 0.092Key4 + 0.024Key6 − 0.056Key7

− 0.068Key8 + 0.082Key9 + 0.099Key10 + 0.077Key11

− 0.084Loudness − 0.280Speechiness + 0.001T empo + 0.321V alence (.0.16)

Elastic Net 10 Fold Cross Validation log(360 − monthsˆ ) = 2.838 − 0.021P opularity + 0.003Duration − 0.240Acousticness

+ 0.409Danceability + 0.528Energy − 0.802Instrumentalness

− 0.027Key1 + 0.005Key3 + 0.093Key4 + 0.024Key6 − 0.056Key7

− 0.068Key8 + 0.083Key9 + 0.010Key10 + 0.078Key11

− 0.084Loudness − 0.280Speechiness + 0.001T empo + 0.322V alence (.0.17) Song Release Date Prediction: Major Mode Results

Full Multiple Linear Regression 94

Figure 10 Full MLR Model for Major Mode Song Release Date

log(360 − monthsˆ ) = 2.990 − 0.020P opularity + 0.002Duration − 0.127Acousticness

+ 0.631Danceability + 0.383Energy − 0.771Instrumentalness

+ 0.084Key1 + 0.029Key2 + 0.020Key3 + 0.068Key4

+ 0.075Key5 + 0.096Key6 + 0.102Key7 + 0.119Key8

+ 0.012Key9 + 0.081Key10 + 0.046Key11

− 0.076Loudness − 0.180Speechiness + 0.001T empo + 0.294V alence (.0.18)

Stepwise 10 Fold Cross Validation 95

Figure 11 Stepwise Linear Regression Model for Major Mode Song Release Date

log(360 − monthsˆ ) = 3.020 − 0.020P opularity + 0.002Duration − 0.124Acousticness

+ 0.630Danceability + 0.357Energy − 0.765Instrumentalness

+ 0.062Key1 + 0.053Key5 + 0.073Key6 + 0.079Key7

+ 0.096Key8 + 0.060Key10 − 0.074Loudness

+ 0.001T empo + 0.289V alence (.0.19)

All Possible Subsets Regression 96

Figure 12 All Possible Subsets Linear Regression Model for Major Mode Song Release Date

log(360 − monthsˆ ) = 3.033 − 0.020P opularity + 0.002Duration − 0.122Acousticness

+ 0.630Danceability + 0.357Energy − 0.768Instrumentalness

+ 0.057Key1 + 0.048Key5 + 0.068Key6 + 0.074Key7

+ 0.091Key8 − 0.074Loudness + 0.001T empo + 0.289V alence (.0.20) 97 Ridge 10 Fold Cross Validation log(360 − monthsˆ ) = 3.143 − 0.019P opularity + 0.002Duration − 0.142Acousticness

+ 0.597Danceability + 0.279Energy − 0.690Instrumentalness

+ 0.066Key1 + 0.012Key2 + 0.001Key3 + 0.050Key4

+ 0.056Key5 + 0.077Key6 + 0.081Key7 + 0.096Key8

− 0.005Key9 + 0.065Key10 + 0.028Key11

− 0.068Loudness − 0.158Speechiness + 0.001T empo + 0.297V alence (.0.21)

Lasso 10 Fold Cross Validation log(360 − monthsˆ ) = 3.055 − 0.020P opularity + 0.002Duration − 0.127Acousticness

+ 0.622Danceability + 0.354Energy − 0.750Instrumentalness

+ 0.066Key1 + 0.010Key2 + 0.047Key4 + 0.055Key5 + 0.077Key6

+ 0.084Key7 + 0.0997Key8 + 0.061Key10 + 0.027Key11

− 0.073Loudness − 0.148Speechiness + 0.001T empo + 0.291V alence (.0.22)

Elastic Net 10 Fold Cross Validation log(360 − monthsˆ ) = 3.101 − 0.020P opularity + 0.002Duration − 0.127sAcousticness

+ 0.615Danceability + 0.334Energy − 0.734Instrumentalness

+ 0.056Key1 + 0.035Key4 + 0.044Key5 + 0.066Key6 + 0.073Key7

+ 0.088Key8 − 0.005Key9 + 0.049Key10 + 0.016Key11

− 0.072Loudness − 0.123Speechiness + 0.001T empo + 0.291V alence (.0.23) Popularity Score Prediction: Minor Mode Results

Full Multiple Linear Regression 98

Figure 13 Full MLR model for Minor Mode Song Popularity

p ˆ popularity = 12.803 − 0.010Duration − 0.142Acousticness − 0.138Danceability

− 3.909Energy − 2.244Instrumentalness − 0.131Key1 − 0.375Key2

+ 0.182Key3 − 0.544Key4 − 0.119Key5 + 0.138Key6 − 0.024Key7 (.0.24)

− 0.157Key8 − 0.152Key9 − 0.235Key10 − 0.291Key11

+ 0.423Loudness + 2.157Speechiness − 0.001T empo − 1.189V alence

Stepwise 10 Fold Cross Validation 99

Figure 14 Stepwise Linear Regression Model for Minor Mode Song Popularity

p ˆ popularity = 12.404 − 0.010Duration − 3.801Energy − 2.244Instrumentalness

− 0.259Key2 + 0.305Key3 − 0.424Key4 + 0.262Key6 − 0.169Key11 (.0.25)

+ 0.421Loudness + 2.100Speechiness − 1.189V alence

All Possible Subsets Regression 100

Figure 15 Sample R Code and All Possible Subsets Linear Regression model table for minor mode song popularity prediction

p ˆ popularity = 12.404 − 0.010Duration − 3.801Energy − 2.244Instrumentalness

− 0.259Key2 + 0.305Key3 − 0.424Key4 + 0.262Key6 − 0.169Key11 (.0.26)

+ 0.421Loudness + 2.100Speechiness − 1.233V alence 101 Ridge 10 Fold Cross Validation

p ˆ popularity = 12.341 − 0.010Duration − 0.066Acousticness − 0.149Danceability

− 3.560Energy − 2.269Instrumentalness − 0.083Key1

− 0.324Key2 + 0.226Key3 − 0.493Key4 − 0.070Key5 + 0.184Key6

+ 0.017Key7 − 0.106Key8 − 0.108Key9 − 0.184Key10 − 0.245Key11

+ 0.399Loudness + 2.040Speechiness − 0.001T empo − 1.175V alence (.0.27)

Lasso 10 Fold Cross Validation p ˆ popularity = 11.897 − 0.010Duration − 3.492Energy − 2.047Instrumentalness

− 0.155Key2 + 0.212Key3 − 0.359Key4 + 0.221Key6 (.0.28) − 0.033Key7 − 0.040Key10 − 0.118Key11

+ 0.398Loudness + 1.755Speechiness − 1.135V alence

Elastic Net 10 Fold Cross Validation

p ˆ popularity = 11.982 − 0.010Duration − 3.542Energy − 2.081Instrumentalness

− 0.174Key2 + 0.227Key3 − 0.371Key4 + 0.227Key6

+ 0.043Key7 − 0.053Key10 − 0.127Key11

+ 0.402Loudness + 1.817Speechiness − 0.000001T empo − 1.150V alence (.0.29) Popularity Score Prediction: Major Mode Results

Full Multiple Linear Regression 102

Figure 16 Full MLR Model for Major Mode Song Popularity

p ˆ popularity = 10.857 − 0.010Duration + 0.081Acousticness + 0.051Danceability

− 3.078Energy − 1.877Instrumentalness + 0.321Key1

+ 0.207Key2 + 0.431Key3 − 0.029Key4 + 0.204Key5 + 0.217Key6 (.0.30)

+ 0.283Key7 + 0.297Key8 + 0.271Key9 + 0.080Key10 + 0.390Key11

+ 0.386Loudness + 4.580Speechiness − 0.001T empo − 0.849V alence

Stepwise 10 Fold Cross Validation 103

Figure 17 Stepwise Linear Regression Model for Major Mode Song Popularity

p ˆ popularity = 10.985 − 0.009Duration − 3.161Energy − 1.884Instrumentalness

+ 0.312Key1 + 0.199Key2 + 0.425Key3 + 0.196Key5 + 0.210Key6 (.0.31) + 0.275Key7 + 0.289Key8 + 0.262Key9 + 0.383Key11

+ 0.387Loudness + 4.579Speechiness − 0.002T empo − 0.837V alence

All Possible Subsets Regression 104

Figure 18 All Possible Subsets Linear Regression Model for Major Mode Song Popularity

p ˆ popularity = 10.985 − 0.009Duration − 3.161Energy − 1.884Instrumentalness

+ 0.312Key1 + 0.199Key2 + 0.425Key3 + 0.196Key5 + 0.210Key6 (.0.32) + 0.275Key7 + 0.289Key8 + 0.262Key9 + 0.383Key11

+ 0.387Loudness + 4.579Speechiness − 0.002T empo − 0.837V alence 105 Ridge 10 Fold Cross Validation

p ˆ popularity = 10.348 − 0.010Duration + 0.170Acousticness + 0.050Danceability

− 2.621Energy − 1.928Instrumentalness + 0.294Key1

+ 0.183Key2 + 0.399Key3 − 0.049Key4 + 0.179Key5 + 0.189Key6 (.0.33)

+ 0.260Key7 + 0.273Key8 + 0.249Key9 + 0.052Key10 + 0.361Key11

+ 0.356Loudness + 4.325Speechiness − 0.001T empo − 0.859V alence

Lasso 10 Fold Cross Validation p ˆ popularity = 10.823 − 0.010Duration + 0.079Acousticness + 0.007Danceability

− 3.012Energy − 1.855Instrumentalness + 0.269Key1

+ 0.153Key2 + 0.368Key3 − 0.061Key4 + 0.148Key5 + 0.160Key6 (.0.34)

+ 0.232Key7 + 0.243Key8 + 0.216Key9 + 0.020Key10 + 0.334Key11

+ 0.381Loudness + 4.504Speechiness − 0.001T empo − 0.830V alence

Elastic Net 10 Fold Cross Validation p ˆ popularity = 10.771 − 0.010Duration + 0.092Acousticness + 0.027Danceability

− 2.979Energy − 1.872Instrumentalness + 0.288Key1

+ 0.173Key2 + 0.391Key3 − 0.050Key4 + 0.169Key5 + 0.181Key6 (.0.35)

+ 0.251Key7 + 0.263Key8 + 0.237Key9 + 0.042Key10 + 0.354Key11

+ 0.379Loudness + 4.504Speechiness − 0.001T empo − 0.841V alence

Popular Classification : Minor Mode Results

Full Logistic Regression 106

Figure 19 Sample R code and the Full Logistic model table for minor mode popularity classification

πˆ log( ) = 5.059 − 0.009Duration − 1.472Acousticness + 0.311Danceability 1 − πˆ − 3.723Energy − 2.809Instrumentalness + 0.111Key1

− 0.088Key2 + 0.772Key3 − 0.327Key4 + 0.180Key5 + 0.314Key6 (.0.36)

+ 0.175Key7 + 0.218Key8 + 0.032Key9 + 0.152Key10 + 0.059Key11

+ 0.350Loudness + 3.453Speechiness − 0.001T empo − 1.639V alence

Stepwise 10 Fold Cross Validation 107

Figure 20 Stepwise Logistic Model for Minor Mode Popularity Classification

πˆ log( ) = 5.377 − 0.009Duration − 1.525Acousticness − 3.860Energy 1 − πˆ − 2.739Instrumentalness + 0.639Key3 − 0.443Key4 (.0.37)

+ 0.351Loudness + 3.328Speechiness − 1.522V alence

All Possible Subsets Regression 108

Figure 21 All Possible Subsets Logistic Model for Minor Mode Popularity Classification

πˆ log( ) = 5.369 − 0.009Duration − 1.537Acousticness − 3.955Energy 1 − πˆ − 2.722Instrumentalness + 0.354Loudness + 3.278Speechiness (.0.38)

− 1.445V alence 109 Ridge 10 Fold Cross Validation

πˆ log( ) = 0.322 − 0.004Duration − 0.409Acousticness − 0.063Danceability 1 − πˆ − 0.607Energy − 0.839Instrumentalness + 0.013Key1

− 0.091Key2 + 0.401Key3 − 0.053Key4 + 0.054Key5 + 0.142Key6 (.0.39)

+ 0.037Key7 + 0.030Key8 − 0.042Key9 + 0.034Key10 − 0.034Key11

+ 0.084Loudness + 1.616Speechiness − 0.0003T empo − 0.724V alence

Lasso 10 Fold Cross Validation

πˆ log( ) = −1.882 (.0.40) 1 − πˆ

Elastic Net 10 Fold Cross Validation

πˆ log( ) = 0.322 − 0.004Duration − 0.409Acousticness − 0.063Danceability 1 − πˆ − 0.607Energy − 0.839Instrumentalness + 0.013Key1

− 0.091Key2 + 0.401Key3 − 0.198Key4 + 0.054Key5 + 0.142Key6 (.0.41)

+ 0.037Key7 + 0.030Key8 − 0.041Key9 + 0.034Key10 − 0.034Key11

+ 0.084Loudness + 1.616Speechiness − 0.0003T empo − 0.724V alence

Popular Classification : Major Mode Results

Full Logistic Regression 110

Figure 22 Full Logistic Model for Major Mode Popularity Classification

πˆ log( ) = 3.198 − 0.008Duration − 0.747Acousticness + 0.935Danceability 1 − πˆ − 3.457Energy − 12.325Instrumentalness + 0.134Key1

+ 0.070Key2 + 0.586Key3 − 0.159Key4 + 0.031Key5 + 0.052Key6 (.0.42)

+ 0.174Key7 + 0.344Key8 + 0.275Key9 − 0.385Key10 − 0.071Key11

+ 0.330Loudness + 4.734Speechiness + 0.003T empo − 1.386V alence

Stepwise 10 Fold Cross Validation 111

Figure 23 Stepwise Logistic Model for Major Mode Popularity Classification

πˆ log( ) = 3.280 − 0.008Duration − 0.777Acousticness + 0.976Danceability 1 − πˆ − 3.478Energy − 12.106Instrumentalness (.0.43) + 0.513Key3 − 0.267Key8 − 0.460Key10

+ 0.333Loudness + 4.790Speechiness + 0.003T empo − 1.411V alence

All Possible Subsets Regression 112

Figure 24 All Possible Subsets Logistic Model for Major Mode Popularity Classification

πˆ log( ) = 3.289 − 0.008Duration − 0.750Acousticness + 0.942Danceability 1 − πˆ − 3.483Energy − 11.929Instrumentalness + 0.333Loudness (.0.44)

+ 4.822Speechiness + 0.003T empo − 1.406V alence 113 Ridge 10 Fold Cross Validation

πˆ log( ) = −1.031 − 0.003Duration − 0.103Acousticness + 0.159Danceability 1 − πˆ − 0.270Energy − 0.817Instrumentalness + 0.027Key1

+ 0.007Key2 + 0.266Key3 − 0.114Key4 − 0.025Key5 − 0.032Key6 (.0.45)

+ 0.078Key7 + 0.138Key8 + 0.130Key9 − 0.214Key10 − 0.064Key11

+ 0.065Loudness + 2.178Speechiness + 0.002T empo − 0.474V alence

Lasso 10 Fold Cross Validation

πˆ log( ) = −2.087 (.0.46) 1 − πˆ

Elastic Net 10 Fold Cross Validation

πˆ log( ) = −1.031 − 0.003Duration − 0.103Acousticness + 0.159Danceability 1 − πˆ − 0.270Energy − 0.817Instrumentalness + 0.027Key1

+ 0.007Key2 + 0.266Key3 − 0.114Key4 − 0.025Key5 − 0.032Key6 (.0.47)

+ 0.078Key7 + 0.138Key8 + 0.130Key9 − 0.214Key10 − 0.064Key11

+ 0.065Loudness + 2.178Speechiness + 0.002T empo − 0.474V alence