Quick viewing(Text Mode)

What Makes the Market Tix: a Machine Learning Analysis of the Resale Concert Ticket Market

What Makes the Market Tix: a Machine Learning Analysis of the Resale Concert Ticket Market

What Makes the Market Tix: A Machine Learning Analysis of the Resale Concert Ticket Market

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Haglund, George. 2020. What Makes the Market Tix: A Machine Learning Analysis of the Resale Concert Ticket Market. Bachelor's thesis, Harvard College.

Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364693

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA What Makes The Market Tix A Machine Learning Analysis of The Resale Concert Ticket Market

a thesis presented by George Haglund to The Department of Computer Science

in partial fulfillment of the requirements for the joint degree of Bachelor of Arts in the subjects of Computer Science and Statistics

Harvard University Cambridge, Massachusetts April 2020 © 2020 - George Haglund All rights reserved. Thesis advisor: Kevin Rader George Haglund

What Makes The Market Tix

Abstract

The resale concert ticket market is a largely unexplored domain in terms of rigorous statistical analysis. It is a market in which tickets are sold at multiples of their face value minutes after they are released to the public on the primary market. In this paper, I provide a number of predictive models on data collected from resale ticket websites in order to explain how prices behave in the secondary ticket market. I then introduce parametric models to explain the factors that influence price changes in this market. I am able to use these models to explain how some tickets tend to behave over the course of their life on the secondary market, as well as some factors that influence this behavior. I find that Long Short Term Memory Neural Networks perform extremely well in a predictive capacity for both average and individual ticket prices. The inputs to the parametric models are affected the most by the number of days remaining until a concert andthe population of the city where the concert takes place. These models find that the shape of the trend of ticket prices tends to be similar between concerts, while the starting price of a ticket varies significantly between concerts. The findings ofthis paper shed some light on the confusing secondary market and provide insight for buyers and sellers in both the primary and secondary concert ticket markets.

iii Contents

1 Introduction 1

2 Background 4

3 Data 15 3.1 RawData...... 15 3.1.1 Artist Data ...... 16 3.1.2 Concert Data ...... 17 3.1.3 Ticketing Data ...... 18 3.2 Data Collection ...... 18 3.2.1 Artist Data Collection ...... 19 3.2.2 Concert Data Collection ...... 20 3.2.3 Ticket Data Collection ...... 21 3.2.4 Avoiding Bans ...... 22 3.3 Cleaning ...... 23 3.3.1 Artist Reduction ...... 23 3.3.2 General Admission ...... 23 3.3.3 Date Cutoff ...... 24 3.3.4 Extra Dates ...... 24 3.3.5 Missing Data ...... 24 3.3.6 Data Inconsistencies ...... 24 3.4 Exploratory Data Analysis ...... 25

iv 3.4.1 Artists ...... 28 3.4.2 Venues ...... 29 3.4.3 Tickets ...... 32 3.4.4 Likelihood Justification ...... 34

4 Methods 36 4.1 Log Mean Tickets ...... 36 4.1.1 Linear Regression ...... 36 4.1.2 Mixed Effects Model ...... 37 4.1.3 Generalized Additive Model ...... 38 4.1.4 Random Forest ...... 38 4.1.5 Auto-Regressive Integrated Moving Average ...... 39 4.1.6 Long Short-Term Memory Neural Network ...... 41 4.2 ID Tickets ...... 43 4.2.1 Likelihood Model ...... 43 4.3 Model Evaluation ...... 46 4.3.1 Predictions ...... 46 4.3.2 Inference ...... 47

5 Results 48 5.1 Log Mean Tickets ...... 48 5.1.1 Ticket Level ...... 49 5.1.2 Venue Level ...... 52 5.1.3 Artist Level ...... 53 5.2 ID Tickets ...... 55 5.2.1 Ticket Level ...... 55 5.2.2 Venue Level ...... 57 5.2.3 Artist Level ...... 60

6 Discussion 65 6.1 Early Trends ...... 65 6.2 Predictive Power ...... 66

v 6.3 Variable Contributions ...... 67 6.4 Affected Parties ...... 71 6.5 Diagnostics and Decisions ...... 73 6.6 Next Steps ...... 75 6.7 Effects of COVID-19 ...... 76

7 Conclusion 77

References 81

vi Acknowledgments

Thank you first and foremost to Kevin Rader, my advisor in this project. He has gently guided me through this year long endeavor, and I am extremely grateful for all the help that he has given me. A special thank you to my family for always supporting me in everything that I pursue and always being there for me. An additional thank you to my sister for her creative help in naming this thesis. Thank you to my friend Ethan for taking me to the Astroworld concert where the idea for this paper had its inception and to my friend Miles for providing his knowledge of the music industry to help turn that idea into a reality. Finally, thank you to my roommates, teammates, and everyone else who encouraged and supported in my journey over the last year. I would also like to apologize to most people mentioned here who had to listen to me talk at length about the small intricacies of how tickets are priced and sold.

vii 1 Introduction

The music industry is an industry that will likely never go out of style. It has seen incredible growth in recent years due to the popularization of streaming services such as . There is one area of the music industry, however, that appears to have stagnanted; the quantity of concert tickets sold. According to Pollstar, a magazine that keeps a database of ticket sales and releases a yearly report of its findings, there was a 2.1% drop in total tickets sold between 2017 and 2018, from 23.4 million to 22.9 million [13]. This number does not tell the full story. Pollstar also reports that 2018 was a record setting year for sales, with a ”12% jumpin total gross from last year’s $1.97 billion to a record-setting $2.21 billion” for the top 50 tours worldwide [13]. In its 2018 Media and Entertainment report, Price Waterhouse Cooper found that ”live music ticket sales will increase at a compound annual growth rate (CAGR) of 3.33% from 2018 to 2023, from $21.256bn (projected) in 2018 to $25.036bn in 2023.” [5] The number of tickets

1 sold has dipped, but the total revenue from sales has made a drastic jump. The primary ticket market is raising its prices to keep up with the secondary market. In recent years, the popular music concert ticket market has witnessed a disturbing phenomenon whereby tickets sell out in seconds of release and then reappear on the resale market with astronomical markups. In 2016, the New York Attorney General released a report on the state of ticket scalping in New York. In the report, they ”studied six ticket brokers and found they marked up their tickets an estimated 49% on average, ranging by broker from an average of 15% to 118%.” [18] The reason ticket brokers are able to mark up their prices in this manner is due to the problem of fairness and pricing in the concert ticket industry. Historically, artists have priced their concerts well below what the demand for this good would suggest [10]. These artificially low prices have allowed for rent-seeking behavior, whereby ticket scalpers can buy up tickets at face value and resell the tickets at the market value [10]. The ability of brokers to price tickets at their true value without having to worry about image and reputation has helped the secondary ticket market to grow to between ten and fifteen billion dollar annual volume [4]. As a result of its rapid growth, ”The ticket marketplace has become a fiercely competitive game in which major corporations compete over resale prices with the fan next door, scalpers have a Washington lobbyist and thousands of tickets disappear in a fraction of a second.” [19] Unlike in the primary market, sellers are able to, and frequently do, dynamically alter the prices of their tickets in the time leading up to a concert. Because of this, the secondary market is highly variable. All these factors result in a growing, dynamic marketplace that affects nearly every person who listens to music. Despite the problems in the resale market, there seems to be very little happening to fix it. Some major artists, like Taylor Swift, have tried to institute a dynamic pricing model for their tours [16]. However, the systems that institute these ”dynamic” changes in price tend to be outdated and require significant work on the part of the ticket seller. Previous solutions have been flawed possibly because there is little understanding of how this market functions. In his paper ”Rocknomics”, Alan Kreuger notes that ”because scalping is primarily an

2 underground activity, little systematic empirical analysis has been done on secondary ticket markets” [6]. What analysis has been performed on the concert ticket market is largely made up of economic theory regarding how tickets are priced and sold in the primary market. The goal of this paper is to introduce a machine learning approach to thelittle understood secondary ticket market. This paper proposes to use machine learning methods and statistical inference in order to perform an analysis on the behavior of tickets in the resale market. Specifically, this paper will look at how ticket prices change over time - from when they go on sale to the day of the concert - as well as what factors influence these changes in price. The goal of these findings is to shed some light on the mechanics of this market in order to both help artists combat the rent seeking behavior of ticket brokers and to give some power back to consumers in the process of purchasing concert tickets.

3 2 Background

The ultimate goal of this paper is to analyze how and why prices change overtime in the secondary market. In order to give a better understanding of the models presented, this paper will attempt to explain how the secondary market operates, as well as what causes it to work the way it does. To do so, this paper will start by explaining how the primary market operates, and why this creates the need for a secondary market. Then, this paper will show how the secondary market combined with the operation of the primary market gives rise to the phenomenon of astronomical secondary market prices. Selling tickets is a complicated process with a number of intermediaries. The process begins with the artists themselves, or rather, their representation in the form of managers. Once an artist and their manager have decided to go on tour, they hire promoters to find venues in the cities where they would like to perform. Once a promoter has found a venue, ”the three agree on a revenue sharing rule”

4 [8]. The process of deciding on revenue sharing is also very complicated, but is outside the scope of this paper. The next steps for the artist and the promoter are to decide on the price of tickets, when to start selling tickets, and how the concert will be advertised [8]. Tickets are eventually released to the public through the venue box office and ticketing agencies, although most sales come from ticketing agencies. Pascal Courty explains, ”Although the ticketing agency charges a processing fee on top of the face price, it typically sells a majority of the tickets because it can reach a much wider audience than the box office”8 [ ]. The process of deciding on how to price and sell concert tickets is extremely complicated with a number of intermediaries between the artist and the fan. However, the common denominator between all primary market tickets is that they are all priced extremely low. The first step in understanding secondary market prices to note that, relative to market demand, primary market tickets are severely under priced. According to economist Eric Budish, ”Artists often want to sell their tickets at a — to an economist — an artificially low price. And what I mean by that is a price at which demand dramatically exceeds supply.” [10] There are a few different reasons artists sell their tickets at such a low price. In order to understand these reasons, it is beneficial to first understand two characteristics of concert tickets. Economist Alan Krueger explains these characteristics as the following: ”concerts are an experience good, whose quality is only known after it is consumed;... bands sell complementary products, such as merchandise and records” [6]. Artists rely on both the experience of their fans and products purchased at the concert to keep up their revenue. The ultimate goal of an artists is typically to make the most money from their concert, while also garnering the largest possible audience. This speaks to Krueger’s first characteristic of tickets, which is that concerts are an experience good. A common boast in the music industry is that of a sell-out concert. It is a proxy for the popularity of an artist, and gives concert-goers the impression that the artist is worth seeing. By selling tickets at a low cost, the artist better guarantees a large audience, which leads to a more enjoyable concert experience

5 for the fans. In other words, ”underpricing guaranties a sellout, and hence generates a certain amount of prestige that acts as a kind of explicit validation of the worthiness of attendance”7 [ ]. A larger audience not only makes the concert seem more valuable, but also provides greater revenue streams from products sold at the concert itself. Increased revenue from the sale of complementary goods due to a larger audience is Krueger’s second characteristic. In the optimal scenario, an artist would like to sell out their concert, while also attaining the greatest possible revenue for the show. Krueger explains that lower prices allow more fans to attend the concert, giving them a chance to buy complementary goods. He notes, ”The price of a concert ticket is set lower than it would be in the absenceof complementary goods, because a larger audience increases sales of complements and raises revenue.” [6] Complementary goods are also a more easily predicted source of revenue for artists. ”General admission seats are priced uniformly low as a loss leader in order to attract a sell-out crowd that will produce concession revenue that the promoter is able to predict with accuracy in advance of pricing tickets for the concerts.” [11] Artists trade a decrease in revenue from low ticket prices for an increase in revenue from merchandise, which is also more predictable revenue. Both prestige and complementary goods provide initial explanations as to why tickets are priced cheaply relative to demand, however the most common and potentially most salient argument is one of fairness. The final key to ticket pricing in the primary market is fans wanting tobe treated fairly. As previously discussed, artists charge an artificially low price for their tickets so that fans can afford them. However, there is also a large element of reputation tied in to low prices. If an artist has a reputation for gouging their fans, fewer people will be willing to attend that artists show. In the modern ageof social media especially, poor reputation for a band can be devastating. ”Kahneman, et al. (1986) argue that customers value being treated fairly, and the market clearing price may be considered unfair. Fairness is likely to be a more important consideration if attendance at a concert is viewed as a social event rather than an economic transaction” [6]. It is not a stretch to say that most

6 people consider attending a concert to be more than just a purchase of services, which leads to a serious dilemma for artists. In a one-off scenario, artists could charge the market value for a ticket, and asa result, capture the highest possible ticket revenue for a show. However, touring is typically the main source of income for an artist, so artists are constantly on tour. As a result, charging what consumers consider ”unfair” prices can have a dismal effect on attendance at concerts. Artists ”risk creating a public perception asprice gougers, thereby hurting their reputations and long-run box office sales”11 [ ]. Research by Daniel Kahneman on what consumers perceive as ”fair” shows that consumers see increases in price relative to the “normal” price as fair only when a firm is losing money12 [ ]. Since concert ticket prices have been traditionally sold at below market value, increasing prices in order to capture more profits would be seen as unfair and harm both attendance to individual concerts and artists’ bottom line. At the same time, ”performers sacrifice considerable income ifthey price their shows below the market rate.” [6] Ultimately, performers are forced to under price their tickets to increase other revenue sources and protect their reputations. Therefore, they collectively forgo billions of dollars in potential revenue, all of which is funneled into the secondary market. The secondary market, in theory, is for fans who have already purchased tickets and who can no longer attend an event to recoup as much of their losses asthe market will allow. Although this works to an extent, low ticket face values have also provided the opportunity for third parties to profit off the disparity between a ticket’s face value and its market price. This rent seeking behavior is by no means a new idea. A story from the 1867 New York Times reads, ”In early 1868, Charles Dickens read from A Christmas Carol at Steinway Hall in . Tickets sold out in half a day at their face value of $2, and reportedly had a secondary-market value of as high as $20; another report indicated that a young boy was paid $30 in gold for a good spot in line” [4]. This is an old example of ticket resale on the secondary market, however the practice is almost exactly the same in modern times. An artist will list tickets, brokers will use various tricks to buy as many tickets as quickly as possible and then turn around to sell those

7 tickets at a markup. Brokers are able to profit in this way because of how tickets are priced inthe primary market and their lack of similar concerns to the artists. Brokers earn no revenue from merchandise or concessions, and they have no need to worry about reputation. Ken Lowson, a ticket broker, notes, ”You want the fan to get mad at a misdirected person than at the artists, because they lose their fans that way. Like they can price their ticket at $150 before their fans puke. But you know a scalper can sell the same fan a seat for $2,000, and they’re not mad at the artists, they’re mad at the scalper; but they still pay it.” [10] The result of these practices isa massive secondary ticket market that is filled with arbitrage opportunities. ”Recent industry estimates are that fully 20% of all tickets purchased in the primary market are now resold in the secondary market, constituting on the order of $10-15bn of volume annually. In extreme cases, speculators amass as many as 90% of the tickets available for a particular event.” [4] However, despite the longtime existence of resale opportunities in the resale market, its massive increase in sales volume is a relatively recent phenomenon. The secondary market has become a massive industry in recent years as a result of the technological advances in ticket selling. The ultimate goal for ticket brokers has always been to create a monopoly on the best tickets by being the first ones in line. The goal has not changed, but the methods by which it is accomplished have. Before the internet, tickets were purchased at brick and mortar businesses such as record stores [10]. Scalpers would pay off managers to let them get in early and take the best tickets, which they could resell at a high price [10]. The introduction of phone sales forced brokers to change their tactics. Instead of paying off managers, brokers would hire teams of people to all call in at thetime of ticket release [10]. They also had strategies to get the phone operators to sell them their tickets faster [10]. Then, ticket sales moved to the internet, and brokers were able to cut out the human element while still massively scaling up their operations. This was a result of what are known as ”bots”. At a high level, bots are computers that act like humans. They can be programmed to carry out a task such as purchasing concert tickets. Bots have given ticket brokers to always

8 be the first in line and also buy as many tickets as they could possibly want. The introduction of bots resulted in a dramatic change in the quantity and quality of tickets that brokers were able to purchase. Brokers now had the ability to immediately purchase as many tickets as they wanted in the instant a concert started sales. Bots, although they perform the same transactions as humans, are far faster. This increase in speed gave brokers the ability to snap up premium tickets in the blink of an eye. Since the introduction of bots, ticketing websites have introduced protections by limiting how many tickets could be purchased on a single credit card and including CAPTCHA on their websites. Brokers, however, have managed to find ways around these protections. When ticket sales shifted to the internet, brokers benefited in three ways. ”First, the internet has lowered the costs of amassing tickets in the primary market. Second, the internet has lowered the cost of reselling tickets in the secondary market. Third, the internet has made it easier to skirt state rules on ticket reselling.” [4] The ultimate outcome of the introductions of bots has been fewer tickets left for fans in the primary market and higher prices in the secondary market. Brokers’ ability to buy out tickets in a matter of seconds has forced consumers to purchase more and more from the secondary market at significant markups. In a report by the New York Attorney General’s office, it was found that brokers ”marked up their tickets an estimated 49% on average, ranging by broker from an average of 15% to 118%.” [18] While these average markups by themselves are no doubt frustrating to consumers, there are far more egregious examples as well. One of these examples comes from a 2008 Miley Cyrus concert. On the opening of sales, ”tickets with a face value of at most $64 sold out in approximately twelve minutes, and were then immediately posted on secondary-market venues such as eBay and StubHub at prices that in some instances exceeded $2,000 (Levitt, 2008).” [4]. These extraordinary markups have angered both consumers, who have to buy tickets at an astronomical value, as well as artists, who are losing out on potential profits. The NYAG lists fan complaints they have received about the secondary market, such as ’”price gouging,” “scalping,” “outrageous fees” and “immediate sell-outs.”’ [18] The anger felt by both artists and consumers has led

9 to many people attempting to put a stop to the practice. Brokers and bots have become such a serious problem in the ticketing industry that there have been large scale efforts to stop them. As previously mentioned, ticketing websites have tried using anti-bot technology such as CAPTCHA, as well as limiting purchase sizes by tracking credit card numbers. There have also been instances of ”selling non-removable bracelets and admitting only bracelet- wearers at the consumption date” [7]. However, there are factors that limit the number of methods available to ticket sellers in the primary market. One of these factors is the purchasing behavior of fans. While some fans are certain of their intent to attend a concert well in advance, many fans are not. So, even if fans could buy a ticket more quickly than a broker’s bots, many do not even know if they will be able to attend a concert.

This feature of consumer preferences can explain the conflict between promoters and brokers. Promoters have to make tickets available early in advance to satisfy the needs of those consumers who value planning and would not attend if they were not able to secure early the property rights for a guaranteed experience. However, among those who postpone the decision to purchase a ticket, some will end up with a high valuation close to the event date. If resale is possible, this situation opens the door for profit opportunities where brokers can buy early tickets that they resell later to those consumers who eventually find out that they are eager to attend the event. [8]

Despite these limitations, artists and ticket sellers have still made efforts to combat the problem. One potential solution that came jointly from third party researchers and was the use of auctions for premium tickets alongside traditional selling methods [4]. The researchers set up the auction as follows: small number of premium tickets were set aside to be bid on before traditional tickets went on sale. Bidders entered a per-ticket price, as well as the quantity of tickets they

10 would like to buy. Bidders were allowed to increase their bids at any time during the auction, but all bids were binding. Each auction lasted for a few days, and at the close of auction, bidders were ranked according to their highest bid. The tickets were then distributed by quality, with the best tickets going to the highest bidder, and then in decreasing quality down the list of ranked bidders. Winners paid their bid price and losers paid nothing [4]. The paper came to a number of conclusions from the result of these auctions. The first conclusion is auctions are quite good at selling the tickets at market price and generating maximum revenue for the artists under the condition that all tickets are sold. The second conclusion is, although brokers are typically able to outbid traditional fans, brokers’ ability to profit off of these tickets in the secondary market is severely reduced4 [ ]. One major issue with this strategy is that fans with little auction experience sometimes overbid for tickets due to the complexity of the auction design. This type of auction has ultimately failed to take off, as the paper notes, for two possible reasons. Auctions are complex mechanisms, and fans prefer to deal with the simplicity of posted ticket prices. Auctions are also viewed by consumers as unfair [4]. Research on fairness has shown that ”the introduction of an explicit auction to allocate scarce goods or jobs would also enable the firm to gain at the expense of its transactors, and is consequently judged unfair.” [12] In this case, the artist is the firm, and the fans are the transactors. Another potential solution discussed as an alternative to auctions is ticket bans. Ticket bans would completely solve the issue of brokers on the secondary market, as it would eliminate the secondary market in its entirety. The problem with this solution is that the secondary market is in fact quite useful for fans who find out after purchasing a ticket that they cannot attend the concert. In addition, the legality of banning ticket sales is not particularly clear cut [4]. Budish’s previously discussed auction paper mentions the 2012 Olympics as a ”cautionary tale regarding resale bans” [4]. For the Olympics, ”Tickets in the primary market were allocated in large part to corporate sponsors, who frequently discover at the last minute that they are unable to attend... As a result, there were large blocks of empty seats at the Olympics, which was both wasteful

11 and embarrassing for the event’s organizers (Economist, 2012).” [4] Problems with resale bans for concerts follow a similar logic. Fans who are unable to attend suffer the full monetary loss of their ticket purchase, and the concert venue, although the tickets may be sold out, will be less than full. As discussed previously, this harms both the fans and the artists. To further compound this issue, the NYAG report finds that ”on average, only about 46% of tickets are reserved for the public. The remaining 54% of tickets are divided among two groups: holds (16%) and pre-sales (38%). Holds are tickets that are reserved for industry insiders, such as artists, agents, venues, promoters, marketing departments, record labels, and sponsors. Pre-sales make tickets available to non-public groups before they go on sale to the general public.” [18] If half of all tickets are reserved for corporations, banning resale runs the risk of large swaths of a venue being empty, just as in the 2012 Olympics. Banning resale is quite risky, so some artists have decided to follow more in line with the steps of some ticket brokers by instituting dynamic pricing. Dynamic pricing is a relatively new way of listing tickets on the primary market that seeks to more closely match the market price. However, this approach as it appears today is still fairly rudimentary and still fails to address the issue of fairness for consumers. The general idea is that instead of a posting a single list price for tickets and having it remain static, a ticket’s price will change from when it goes on sale until it is sold. The change in price is based on a variety of opaque factors. Another industry with a similar problem, the sports industry, has more readily and transparently adopted this practice. Ticket prices will change typically based on factors such as weather, the opponent, and teams’ win records. However, the change in prices are relatively small. The concert industry has been less keen on adopting this strategy, although some high profile artists have attempted it. One such artist is Taylor Swift. For one of Swift’s concerts, it was reported that ”a third-row seat for Swift’s June 2 show at Chicago’s cost $995 in January. Three months later, the same seat was priced at $595.” [16] This method fixes the issue of a complicated mechanism for fans thatwas found in the ticket auctions. While this strategy manages to make up for some

12 lost ticket revenue, the perception of unfairness remains and consumers may end up paying the same prices they would have paid on the secondary market. A discussion with an employee involved in premium ticket packages at AQR, another major ticketing company, also revealed that the implementation of this method is fairly rudimentary, involving simple equations to predict demand and manual updating of ticket prices. Ultimately, all of these solutions manage to fall short in some way, although that may be somewhat for a lack of trying. Ticket brokers have a bad reputation, but they also provide some economic benefits to parties involved in selling tickets. In his paper on ticket resale economics, Courty exposits these benefits.

They invest in activities that may create value in three ways. First, they seek new consumers who may not consume otherwise. Brokers aggregate tickets for multiple events, satisfying a broad range of consumer demands. Second, brokers help market clearing. In fact, brokers sometimes earn large profits, but at other times are left with unsold tickets. Some brokers even argue that they provide a form of insurance for the event promoter by buying tickets early and endorsing the event. Third, brokers price discriminate, and it is possible that the promoter ends up selling more tickets with the presence of brokers. Under that interpretation, brokers are welfare enhancing, since they help the promoter to sell to consumers whom the promoter would find it hard to reach or otherwise attract.[8]

These services are all profitable for the artists, but there is one additional service that brokers provide that is helpful to the fans themselves. ”Brokers also provide some liquidity in the secondary market by buying tickets from consumers who decide that they do not want anymore to (or cannot anymore) attend the performance. They later sell these tickets to those most eager consumers who could not plan ahead but find out that they want to attend a performance only in the last minute.” [7] From an economic standpoint, ticket brokers have some merit for helping the market. In some cases, they also provide some additional

13 under the table revenue directly to the artist Parties involved in selling tickets on the primary market can use brokers to secretly increase their revenue. The report from the NYAG’s office explains, ”the entities most able to stop this practice are either powerless or not economically incentivized to stop the practice. Artists may want to avoid this from happening but have to rely on the venues to sell tickets. Venues have an interest in selling out tickets and have little incentive to put protections in place. Ticket vendors also have an interest in selling as many tickets as they can; they collect a fee with each ticket that is sold, a fee that is the same regardless of whether it is paid by you or a reseller.” [18]. An interview from Freakonomics found something similar. ”Well, promoters and teams sell directly to brokers. You know, and then those brokers and list them on the marketplace. You know, for a team owner, it’s their ticket. And for a promoter, it’s their ticket, it’s not the artist’s ticket. I don’t know another industry that intentionally advertises one price to intentionally hold it, and resell it secretly.” [10]. This problem is confirmed by the man who invented the use of bots for ticket resale. After having a change of heart, and wanting to use technology to combat a problem he created, he found that a solution was not wanted. ”I went to four primaries, and four rock stars, and three teams, and promoters, and managers, and offered, basically, to do anti-bot and anti-scalping for free if they wanted to. To do a proof of concept. And nobody wanted to do it.” [10] The secondary ticket market is a corrupt, opaque beast filled with contradictions and no obvious solutions. In order to help both artists and their adoring fans, this paper will explore the workings of the secondary market. This paper will present how the prices of tickets sold in the resale market change over time, as well as what factors most drives these price changes.

14 3 Data

3.1 Raw Data

The data used in this paper was collected over the course of about 5 months, from September 2019 until February 2020. The ticketing data was collected directly from a few ticketing resale websites. Ticketing companies are very private about their data. As such, despite reaching out to a few major ticketing companies, retrieving a data set directly from a ticketing company proved to be impossible. The methodology for the data collection will be explained in the following sections. The three categories of data used in this paper can be classified as artist data, concert data, and ticketing data. The data exists in a hierarchical structure where each artist has a number of concerts and each concert has tickets that were sold for it.

15 3.1.1 Artist Data

Variable Name Datatype Description name string Artist’s name followers integer Number of Spotify followers artist_pop integer Artist’s Spotify popularity metric

Table 3.1.1: Artist Variables

Artist data includes variables that relate directly to the artist and their music. These variables are largely categorical and serve to differentiate artists when looking at tickets across artists. Artists were chosen by selecting the Billboard Top 100 artists before the beginning of the data collection process. The list was updated once to include additional artists at the beginning of the data collection process. Artists who were not on tour in the next few months were dropped from the data set. A full list of artist variables can be seen in Table 3.1.1. All artist data was collected by querying Spotify’s public API.

16 3.1.2 Concert Data

Variable Name Datatype Description id integer ID specific to resale site site string Resale site indicator concert_name string Concert name artist string Artist name date date Concert date venue string Venue name city string City name city_pop string City population day string Day of the week

Table 3.1.2: Concert Variables

The concert level data includes variables that relate to the time and place ofan artist’s concert. This data was also collected at the beginning of data collection process. The list of concerts was updated once after the start of the scraping process. All concert for an artist were retrieved directly from the listings of resale ticket websites. The concert level data distinguishes how venue size and concert dates may affect prices. A full list of concert level variables can be seen in Table 3.1.2.

17 3.1.3 Ticketing Data

Variable Name Datatype Description ticket_id integer Ticket ID specific to resale site concert_id integer Concert ID specific to resale site site string Resale site indicator price float Ticket price log_price float Log Ticket price quantity integer Number of tickets on sale at given price section string Section of venue row string Row in section date date Collection date

Table 3.1.3: Ticket Variables

The third level of data is ticketing data. This is where the bulk of the analysis was performed. The ticketing data is set up so that each ticket has a date on which the price was collected and some information about the ticket from that date. Ticketing data was collected once a day, every day, at midnight EST. An inventory of tickets is unlikely to change drastically over the course of 24 hours, which is why information was collected at this interval. Collecting the data in the middle of the night was intended to match low ticket buying activity during data collection. Every ticket in the collected set of concerts was retrieved during the daily data collection. A full list of ticket level variables can be seen in Table 3.1.3. One note is that ticket_id is blank for some tickets it began being collected about halfway through the data collection process.

3.2 Data Collection

Collecting data for this paper was extremely difficult. As previously mentioned, ticketing companies are very opaque with their data. While many have public

18 APIs that once contained all the necessary information about on sale tickets, they have since been restricted to the maximum and minimum ticket prices for a concert at best. The companies themselves also refused to release their data for the purposes of this paper. As a result, it was necessary to scrape the data directly from ticket resale websites. However, due to ticket scalpers using bots to buy large quantities of tickets, the two largest players in the ticketing industry, Ticketmaster and Stubhub have exceedingly strong protections against automated access to their sites. Despite concerted efforts, it was too difficult to collect data from these sites. Luckily, SeatGeek, TicketCity, TicketNetwork, and TickPick had fewer anti-scraping protections and still make up a significant portion of the ticket resale market. The data collection process can be broken down into three steps: retrieving a list of artists and their relevant information, finding concerts for these artists, and finally collecting the ticketing data. All the data was eventually stored in aSQL ”RDS” database hosted on Amazon Web Services.

3.2.1 Artist Data Collection

After selecting the artists from the Billboard Top 100 list, I retrieved the list and queried Spotify’s API for their information. I scraped the Billboard site using a webscraper written in Python. The webscraper was implemented using Selenium Webdriver with Google Chrome as the browser. With this, I accessed the HTML elements for each artist on the list that contained their name, and stored them in a CSV file. Once I had the list of artists, I retrieved their information from Spotify’s public API. I again used Python with a package built to access Spotify’s API called spotipy. For each artist on my list, I queried the Spotify API for information such as genre, popularity, and followers. Once I had retrieved all this information, I transferred it to my SQL database.

19 3.2.2 Concert Data Collection

To retrieve information about the individual concerts, I needed to access the resale sites. SeatGeek’s public API allows users to query concerts by artist name and returns all the necessary info about a concert. In order to access tickets, it was also necessary to retrieve the concert ID specific to each individual resale site. SeatGeek’s API had this information available. The remaining sites, however, did not have any publicly available APIs. The remaining sites all followed a similar site structure where it was possible to query an artist’s concerts directly through a URL. This query returned a webpage with a list of all upcoming concerts from an artist. The list contained the concert name, time, place, and a URL to the ticket listing page. In order to scrape the remaining sites, I switched from Python to Node.js. Python has some powerful, high level scraping packages built for it, however they tend to struggle when going past basic web scraping. Node.js has a browser automation tool called puppeteer that allows for more fine grain control over how your web scraper interacts with a site. Most websites also now use jQuery to dynamically load content on a webpage. jQuery makes scraping slightly more difficult as the content is not contained directly in the HTML file of a webpage. Rather, the content is retrieved once the page has already loaded. I got around this by using an option in puppeteer to include a javascript plugin when accessing a site. I had the script load the page and wait until it saw a element containing the relevant concert info. If the number of concerts found exceeded the page limit, they would be loaded dynamically as the user scrolled to the bottom of the page. In order to collect all the concerts, the script continuously automated scrolling until there were no more concerts to load. Then it retrieved the list of concerts from the page. Although the elements containing the concert information on these sites did not have a single sub element that contained the ID for a concert, the IDs were contained in the URL linking to ticket page. In order to collect all the concert data at the same time, I found a Python

20 package called Naked that allowed me to run Node.js files through a Python script and returned the output in a usable format. I then wrote a Python script that retrieved a list of all the artists and passed it to both the SeatkGeek API and the scrapers for the remaining sites. Once all this data was collected, it was stored on the SQL database.

3.2.3 Ticket Data Collection

The final step in the data collection was collecting the ticket data. For the purpose of user experience, all the resale websites used extensive javascript on their sites to load tickets dynamically. Tickets on sale often numbered in the hundreds, so implementing the same technique used for retrieving concerts would have resulted in significant data consumption and wait times. However, there was a significant upside to these sites loading their content dynamically. For the pages on these resale sites to load their content dynamically, it is necessary for all the data to be stored in a single location. In this case, it was a database that could be accessed through an API. These APIs were not marketed as available to the public. They were simply how a web page could retrieve the list of tickets on sale. Each of the resale site’s ticket listing pages, once loaded, retrieved a JSON file through this API. These APIs could be queried directly through a URL. I found these APIs by looking through the source code for a ticket page. The APIs required a key to access them, but this key was plainly available in the URLs I found in the source code and did not change. As a result of these APIs, the full list of tickets available could be retrieved by querying this site using the site specific concert IDs that had been collected. Once the JSON file loaded, I parsed the relevant ticket information and stored it in the SQL database. Ultimately, the scrapers were run through a Python script which retrieved a list of all upcoming concerts from the SQL database and passed the concert IDs to the scraper scripts. All the scripts were later moved to an AWS EC2 server so they could be run on the cloud instead of locally.

21 3.2.4 Avoiding Bans

In addition to retrieving the data once a web page loads, a major component to web scraping is avoiding website blocks and bans, as most sites have some form of protection against automated access. Protections typically come in two forms. The first is to look for IP addresses that are repeatedly loading pages too quickly for a normal person. The second is checking the characteristics of a site request to see if there are any signs of automation. In order to circumvent the first issue, I used an IP rotation service. This service passes the page requests through proxies so that they appear as if they are coming from different IP addresses. These services are not perfect, as sites can still sometimes detect that a request is coming through a proxy. Ticketmaster and Stubhub will typically block any site requests that are coming through proxies as this is how brokers access sites in order to buy tickets quickly in bulk. They can be circumvented by purchasing blocks of ”fresh” IP addresses that have not seen frequent use. However, this method is very expensive, which is why Ticketmaster and Stubhub were not scraped. The second form of protection required some altering of the scrapers request characteristics. I found a demonstration of how to get around many of the common checks for automation on Intoli, which I implemented in my own scrapers. The most common checks are for the browser type and a browser’s user agent. Most web scraping tools will list the browser and or the user agent as the name of the tool, which will indicate automation. The user agent property is also commonly checked in conjunction with an IP address in the case of the first type of protection. More complicated protections include checking characteristics that indirectly indicate browser automation. These include a browser’s languages, plugins, and permissions. Automation packages will typically lack one or more of these, unlike a traditional user operated browser. puppeteer also allows a simple way of customizing these properties, which is one of its main advantages over most web scraping packages built for Python. Once these checks had been circumvented, I was able to access the resale sites

22 without trouble and collect the necessary data.

3.3 Cleaning

The initial data set contained approximately 44 million observations. Building models to fit this data raised a few issues with some elements of the data set that needed to be corrected. Some additional elements were also added during this process.

3.3.1 Artist Reduction

The first part of cleaning the raw data involved limiting the number of artists in the final data set. The initial data set had over 100 artists. However, there were a few problems with the set of artists. The first problem was that not all the artists in the initial set of Billboard top 100 artists were still performing. In the process of data collection, the web scrapers encountered events such as cover bands and Cirque de Soleil shows about these artists. All such artists and resulting shows were removed from the data set. Some artists also only had a few shows. In order to have a range of events to analyze for each artist, artists who had fewer than 10 shows that occurred before the final day of data collection were removed from the data set.

3.3.2 General Admission

The next reduction in the data set was a result of the layout of seats in a venue. Each ticket has a section and row designation. The problem is that there is no consistency in the value of a seat across venues according to their row and section. There are two potential fixes to this problem. The first is tousethe location of a seat relative to the stage in order to assign a value denoting how good of a seat it is. Determining this metric, while certainly an interesting exercise, is out of the scope of this paper. The second solution is to focus exclusively on floor and general admission tickets. These tickets are all largely

23 equivalent within a venue. Therefore, we can ignore discrepancies between the seats themselves, and just focus on differences between venues. After including only GA and floor tickets, there were 3,150,956 observations remaining.

3.3.3 Date Cutoff

There were some concerts in the data set that had performance dates well into the future. To limit the shows to ones that had already been performed, only shows that happened before February 17th, 2020 were included in the data set. This left 1,700,690 observations in the data set.

3.3.4 Extra Dates

Some concerts had data from after the date of the concert. All such rows were dropped. This left 1,700,657 observations in the data set.

3.3.5 Missing Data

Some tickets with unique IDs had missing dates. The data for missing dates was filled in by copying the ticket information from the previous day.

3.3.6 Data Inconsistencies

The final piece of missing or incorrect values in the data set was also an issue with the web scrapers. The main inconsistencies had to do with the city and venue ofa performance. For venues, due to the design of some of the ticketing sites, some concerts had different venue names for a single concert. To fix this, a single venue was chosen from events by that artist on that date and all values for that date and artist were replaced. There were also a few instance where a venue column hada city and state listed. These were fixed manually. There were also some instances of the same venue with different names across concerts. These were found by searching venues in the data set by city. These were also fixed manually. Finally, some venues listed multiple cities as their

24 location. These were fixed manually. Once the cities for each observation was corrected, the population for each city was added into the data set.

3.4 Exploratory Data Analysis

Figure 3.4.1: Ticket Price Distribution

The main focus of this paper is how ticket prices vary over time. As such, most of the early data analysis is conducted to find general trends in the ticketing time series. As previously discussed, there are 3 levels of depth; artists, venues, and individual tickets. The first step is to look at how the tickets behave at a very high level across all artists and venues. The data is heavily right-skewed. Most of the ticket prices are clustered around $300. Some tickets are listed at extremely high prices. This is seen in Figure 3.4.1. This right-skew is likely due to brokers posting a few potentially valuable tickets at an extremely high price in the hopes of finding fans

25 Figure 3.4.2: Log Ticket Price Distribution

willing to purchase them. The remaining tickets are priced closer to the mean price, which still tends to be much more expensive than the face value. To correct for the skewness in the data, all prices are converted to the log scale. This results in an approximately normal distribution of ticket prices, which canbe seen in Figure 3.4.2. The tickets appear to follow a log normal distribution. All ticket prices going forward are considered at on the log scale. The next step is to consider the change in prices over time. Figure 3.4.3 shows the mean log price of tickets starting at 140 days before a concert for all tickets in the data set. The general trend of prices appears to be a decrease in price from listing until around 90 days before the concert. At this point, the average ticket price increases until about 60 days before the concert. From 60 days out until the concert the average price steadily decreases. One hypothesis for the shape of this trend is as follows: Before 90 days until the concert, brokers are likely listing the sold out tickets for the best seats at

26 Figure 3.4.3: Mean Log Ticket Price Over Time

extremely high prices. These prices are targeting dedicated fans who missed the release of tickets, and are now forced to look for the best tickets on the resale market. The mean price decreases as more tickets are listed at prices reasonable for the resale market, or as fans realize they are unable to attend a concert. At the 90 day mark appears to be when fans starting purchasing tickets in greater numbers. The result is that the cheapest tickets get bought out first, sothe average price steadily increases. The final downward trend in prices is the result of sellers lowering their prices to sell of inventory. Once a concert ends, tickets lose all value, so anyone still holding tickets needs to sell them before this point. Thus, the average ticket price steadily drops until the day of the concert.

27 3.4.1 Artists

Figure 3.4.4: Artist Mean Log Ticket Price Distribution

The first category to delve into after looking at all tickets is how different artist’s tickets varied. Tickets are averaged across each day by artist. The distribution of the average ticket prices appears multimodal, potentially the combination of three normal distributions. It is possible that this distribution is a result of varying levels of prices for tickets. Highly popular artists garner a higher ticket price to their shows, and this translates into higher prices in the resale market. Based on3.4.5, it is clear that mean ticket prices vary widely across artists. Most artists appear to have a fairly low variance in their ticket prices, while a few others have extremely high variation. There appear to be an evenly distributed number of outliers at both extremes. These outliers are potentially the result of the high ticket prices seen at the early stages of sales, while the low prices may be a result of dropping prices close to the concert date.

28 Figure 3.4.5: Artist Log Ticket Price Boxplot

Returning to the change in ticket prices over time, we see that the trends that appear at the highest level are much less prevalent when split up by artist. However, prices do still appear to decrease as the concert approaches. The delineation of prices for artists is also clearly visible in Figure 3.4.6.

3.4.2 Venues

The next category of tickets is the venue level. Here, the data is subsetted toonly include concerts by a single artist, The Lumineers. By filtering the data in this way, the effect of price variation due to different artists is removed. The distribution of tickets for these concerts again appears to be multimodal, potentially indicating a separation between larger and smaller venues. As with Figure 3.4.2, we see varying levels of ticket prices across different

29 Figure 3.4.6: Mean Artist Log Ticket Price Over Time

Figure 3.4.7: Lumineers Log Ticket Price Distribution

30 Figure 3.4.8: Lumineers Venue Log Ticket Price Boxplot

31 venues in Figure 3.4.7. Multiple shows in the same venue on a single weekend exhibit different prices. The variation when considering single venues is also much less significant than when looking across artists. It is likely that both the venue and the artist have a significant effect on ticket prices.

Figure 3.4.9: Mean Venue Log Ticket Price Over Time

When considering price trends at the venue level, we see a few similar patterns to the previous graphs. Tickets start by increasing in price, and drop in price closer to the concert. The trends in Figure 3.4.9 have prices increasing until much closer to the concert. The trends are also very similar across venues. It is possible that the general shape of the price curves are more a result of the artist, while the price levels are affected by the venue.

3.4.3 Tickets

The final level of focus is for a single concert. The following section focuseson the February 11th concert by The Lumineers. Once again, the distribution of ticket prices appears multimodal. This distribution could be the result of regular fans and ticket brokers habits. Ticket brokers make up the bulk of tickets and

32 Figure 3.4.10: Concert Log Ticket Price Distribution

price their tickets higher at the mean of the left distribution. Cheap tickets from fans who are looking to recoup the price of the original ticket appear in the right distribution. The trends at the ticket level appear to support our earlier hypotheses about the trend of ticket prices over time. Cheap tickets are sold between one to two months from the concert, and the unsold tickets have their prices reduced as the concert date grows nearer. It is also interesting to note that while some tickets are listed at a single price, many vary considerably in price over time. This is possibly the result of fan sales versus broker sales. The goal of a fan is typically to get rid of a ticket for a concert they can no longer attend, while brokers focus on extracting value. It is also interesting to note that many ticket prices move perfectly in concert with one another, suggesting multiple tickets being sold by a single broker.

33 Figure 3.4.11: Individual Log Ticket Price Over Time

3.4.4 Likelihood Justification

In the next section, I will introduce a parametric model used to describe the life cycle of a ticket on the secondary market. An important of this model is that both the price of tickets entering the market and the change in price of tickets are normally distributed. The distributions for these two variables can be seen in Figures 3.4.12 and 3.4.13.

34 Figure 3.4.12: Starting Log Ticket Price Distribution

Figure 3.4.13: Log Ticket Price Change Distribution

35 4 Methods

4.1 Log Mean Tickets

The models in this section are all used to predict the log mean ticket price fora concert on a given day. The one exception is the Long Short Term Memory model which is used to predict both log mean ticket prices and individual log ticket prices. Note: The reference for each method is included at the end of each section.

4.1.1 Linear Regression

In order to construct a baseline model for the data, we start by looking at a simple linear regression. Linear regression assumes a linear relationship between the

36 outcome variable Y and the data, X. This relationship takes the form

Y = X⃗β (4.1)

This equation is optimized over the data to minimize the least squares loss function

2 L = Y − X⃗β (4.2) ⃗ which results in estimating ˆβ as  ⃗ ⊤ −1 ⊤ ˆβ = X X X Y (4.3)

This model is implemented using the lm command in R. In all models, days_until is the predictor, and the outcome is the average log_price for a given day. [3]

4.1.2 Mixed Effects Model

The next step in building models is to include effects from different venuesand artists. If we use a simple linear regression with categorical variables to account for venues and artists, the standard errors would be incorrect due to the dependence between observations in the same group. Tickets that come from the same event or from the same artist are dependent on one another. As a result, we transition to a mixed effects models. A mixed effect model is very similar to an ordinary least squares model. Ina mixed effects we have data X and coefficients for the fixed effects β. In addition to the fixed effects, there are also now random effects that take the formofa ∼ N grouping variable. Random effects are random variables, u (0, σZ). In this case, the random effects variable is the venue or the artist. This model takes the

37 form

y = Xβ + Zu + ε (4.4)

The result of using a mixed effects model is that the intercept of the line ofbest fit varies according to an observation’s group. These models are implemented using the lme4 package in R. In all models, days_until is treated as a fixed effect, and the outcome is the mean log_price for a given day. In one set of models, the venue is treated as a random effect. In the other set of models, both the artist and venue are treated as random effects. [2]

4.1.3 Generalized Additive Model

After considering mixed effects models, the next step is to better modelthe non-linearities in the data. There is clearly a non-linear relationship between ticket prices and time until a concert. As such, we transition into generalized additive models. GAMs are also a somewhat modified form of linear regression. GAMs take the form

··· g(E(Y)) = β0 + f1(x1) + f2(x2) + + fm(xm) (4.5)

where fi is a smooth function and g is a link function. The goal is to model these smooth functions in order to minimize the loss. [1]

4.1.4 Random Forest

It is clear from initial data analysis that the data is highly non-linear. To better model the data, we move from linear models to random forests. Random forests work by taking n random samples from data X, and then creating a decision tree regressor for each sample. Each decision tree is created in the following way: For

38 each feature in the data set, create m threshold values in the data set such that all the data points are partitioned into two groups with values above and below this threshold. Next, for each split, create a ˆy by averaging the y values within the two partitions. Then find the total loss of the decision tree using mean squared error. We get m different loss values for each of the features. Then determine which split provides the smallest loss and choose that split. Repeat this process on each partition. Continue this process on smaller partitions until in any partition, there are not enough data points to split again as determined by the set hyperparameters. The decision tree is complete once all partitions cannot be split any further. In a tree, the ˆy for any data point that is sorted into a given partition is the average y value of all the data points from the training set in that split. To get a prediction on a data point from the entire forest, that data point is entered into each of the decision trees and sorted into each partition until it cannot be sorted again. Then, the ˆy for this data point is the average of all the ˆy values from each of the n decision trees. This model is implemented using the sklearn package in Python. [9]

4.1.5 Auto-Regressive Integrated Moving Average

We now transition into time-series models. In all prior models, the data is considered to be independent observations. This assumption is a simplifying one required for the previous models. However, our data is time-series data, so each observations is dependent on prior observations. This leads us into Auto-Regressive Integrated Moving Average models (ARIMA). ARIMA is the combination of 3 different models: auto-regressive, integrated, and moving average. Each model takes a single parameter, denoted p, d, and q respectively. In the auto-regressive model, the outcome is a linear combination of the prior p outcome observations. This can be considered a linear regression where the predictors are the last p observed outcomes. This piece of the model takes the

39 form

− ··· − ˆyt = α + β1yt 1 + + βpyt p + ε (4.6)

where α is a constant, the βi are the estimated coefficients, the yt are the outcomes, and ε is the error term. The next model is the moving average model. The moving average model is also in the form of a linear regression. In the moving average model, the predictors are the errors from modeling the last q terms in the series. This takes the form

− ··· − ˆyt = α + εt + φ1εt 1 + + φqεt q + ε (4.7)

where α is a constant, the φi are the estimated coefficients, εt are the errors from previous time steps, and ε is the error term. The error terms are found through the equations

− − ··· yt = β1yt 1 + β2yt 2 + + β0y0 + εt (4.8)

− − − ··· − yt 1 = β2yt 2 + β3yt 3 + + β0y0 + εt 1 (4.9) The final piece of ARIMA is the integrated model. This piece of modelis intended to make the series stationary. Stationarity means that the statistical properties of our time series, such as mean and variance, are constant over time. In order to induce stationarity, we must use what is called ”differencing”. Differencing modifies the outcomes such that the resulting time series is stationary. An outcome for a value of d = 1 is given by the equation

′ − yt = yt yt−1 (4.10)

In the case where d = 1 does not induce stationarity, we consider d = 2, which

40 is given by

′ − yt = yt 2yt−1 + yt−2 (4.11)

The combination of these three models is ARIMA. We start by inducing stationarity in the time series, and then solve the equation for the model, which takes the form

− ··· − − ··· − ˆyt = α + β1yt 1 + + βpyt p + εt + φ1εt 1 + + φqεt q (4.12)

This model is implemented using the statsmodel package in Python. [14]

4.1.6 Long Short-Term Memory Neural Network

ARIMA models are somewhat limited in their predictive power. As a result, the next step is to consider a more powerful class of models, neural networks. When working with time-series data, the most common type of neural network is a recurrent neural network. The benefit of recurrent neural networks is that at each time step, information from previous iterations can be passed as input. There is a specific type of recurrent neural network even more situated to our problem. This is a long short-term memory neural network or LSTM. LSTMs are recurrent neural networks with a different type of internal structure. At time t, the network will consider the state of the network at time t − 1 and the current input. These inputs are passed through what are known as the operation gates. The three operation gates are the forget gate, the input gate, and the output gate. − The forget gate takes the output of the network attime t 1, ht−1, and the input a time t, xt, and concatenates them into a single value ft. This

41 transformation is given by  ft = σ Wfxt + Ufht−1 + bf (4.13)

where Wf,Uf, and bf are the weight matrices and bias vector for the forget gate. σ denotes the sigmoid function. This notation will be used for the rest of this section to denote weights and bias. These parameters are learned during training. The result is a value between 0 and 1 which determines how much prior information should be kept in the network’s memory. The input gate has the same structure as the forget gate. It takes the outputof − the network at time t 1, ht−1, and the input a time t, xt, and concatenates them

into a single value it. This transformation is given by

it = σ (Wixt + Uiht−1 + bi) (4.14)

The result is a value between 0 and 1 which determines how much weight should be given to the input to the cell. The outputs of the forget gate and the input gate are then combined to update

the cell state at time t, ct. The update rule is given by

⊙ ⊙ ct = ft ct−1 + it tanh (Wcxt + Ucht−1 + bc) (4.15)

where ⊙ denotes an element-wise product. After the cell state has been updated, we now consider the output of the cell. The output gate has the same structure as the forget gate and the input gate.It takes the form

ot = σ (Woxt + Uoht−1 + bo) (4.16)

The output gate determines how much of the cell state is included inthe output of the cell. The final output at time t, ht, is given by

⊙ ht = ot tanh ct (4.17)

42 During training, the loss for the final output of the cell is computed using mean squared error

Xn 1 MSE = (y − h )2 (4.18) t i i i=1 Then, the weights and biases are updated according to the loss via gradient descent and back-propagation through time. This model is used on the full set of data structures in this paper. In addition to predicting the log mean ticket prices for a concert, LSTMs are used to model individual tickets. Unlike the previous models in this section, LSTMs do not need features besides prices to make strong predictions. As a result, we can use them to model individual tickets without being forced to include an indicator variable for each separate ticket. The LSTM model was implemented using Keras and Tensorflow in Python. [15]

4.2 ID Tickets

The LSTM model and our custom likelihood model are both used to model individual ticket prices. We use the LSTM for predictions, while the parametric model is primarily for statistical inference.

4.2.1 Likelihood Model

The final class of models applied to this data set is a parametric likelihood model. The data for these models is at the individual ticket level. Each ticket has aunique identifier, which allows the model to track the change in a ticket’s price over time. The likelihood combines distributions for 3 different variables; the starting price of a ticket, the probability of a ticket’s price change, and the percentage change in the price of a ticket.

43 For the following notation, i denotes the index of a ticket and j denotes the day. Here, j = 0 indicates the day on which a ticket enters the market.

The first element is the log starting price of a ticket, which wecall yi0. yi0 denotes the log starting price on the day ticket i enters the market. As we saw from Section 3.4, the distribution of log ticket prices as they enter the market is ∼ N approximately normal. Therefore, we parameterize yi0 (μ0, σ0) to simulate the true distribution. The probability of a ticket changing price is an independent ∼ trial with probability p, so for ticket i on day j, we have kij Bern(p), where kij is an indicator variable that denotes whether ticket i changed in price on day j. The

percentage change in the price of ticket i on day j is denoted by εij. Price change is also approximately normally distributed, so we parameterize it as ∼ N εij (με, σε). This gives us the following likelihood functions:   − 2 − / 1 (y μ ) L ( , ; ) = ( 2 ) 1 2 − i0 0 yi0 μ0 σ0 yi0 2πσ0 exp 2 (4.19) 2 σ0

L kij (p; kij) = p (4.20)   − 2 − / 1 (ε μ ) L ( , ; ) = ( 2) 1 2 − ij ε εij με σε εij 2πσε exp 2 (4.21) 2 σε For notational simplicity, we write the parameters in vector form:

y = [y00 ... yn0]

ε = [ε00 ... εnm]

k = [k00 ... knm]

Then, the model has the following form:

44 L(μ , σ , p, μ , σ ; y, ε, k) = 0 0 ε Yε Y  L ( , ; ) ∗ L ( , ; ) ∗ L ( ; ) yi0 μ0 σ0 yi0 εij με σε εij kij p kij (4.22) i j

Taking the log of this likelihood gives:

log L(μ , σ , p, μ , σ ; y, ε, k) = X 0 0 ε ε X (L ( , ; )) + (L ( , ; )) + (L ( ; )) log yi0 μ0 σ0 yi0 log εij με σε εij log kij p kij (4.23) i j

The parameters of the likelihood are then optimized conditional on the data. In the first models, the starting values of the parameters are the value given bythe data. These parameters are assumed to be the same for all tickets and days. More flexibility is included in later models. Parameters are allowed to vary according to days_until, as well as other characteristics specific to a ticket or group of tickets. σ0 and σε are left equal for all tickets. p is given the log-odds form and allowed to vary by days_until, city_pop, and artist_pop. p is given by:  − −1 p = 1 + exp β0 + β1xij + β2wij + β3zij (4.24)

In the same way, μ0 and με are varied as follows:

μ0 = γ0 + γ1xij + γ2wij + γ3zij (4.25)

με = α0 + α1xij + α2wij + α3zij (4.26)

where xij is the days remaining until a concert, wij is the log population of the city where the concert is being performed, and zij is a measure of the popularity

45 of an artist. The initial values of each of these parameters, α0, β0, γ0 are set to the sample value given by the data, while the remaining coefficients are set to 0. The parameters are then optimized. The final result of this model is that given this functional form, we have the maximum likelihood estimates of the parameters for this functional form, given the observed data.

4.3 Model Evaluation

4.3.1 Predictions

All models at each level of data are evaluated in the same way in order to have comparable prediction statistics. For each model, the data is first split in half on the total number of days before the concert. If there is data for 100 days before the concert, the data is split at 50 days. Each model is then trained on half the data, and tested on a sliding window of one fifth of the remaining data. The model’s performance is evaluated, and then the next model is trained all both the train and test data from the first model. The second model is tested on the next fifth of the data. If the first model is trained onthefirst 50 days and tested on the next 10 days, the second model is trained on the first 60 days and tested on the next 10. This method results in 6 different performance statistics, one for each model. These statistics are then averaged to find the final model performance. The primary statistic used in determining predictive power is root mean squared error (RMSE). RMSE is given by v u u Xn t 1 RMSE = (y − ˆy )2 (4.27) n i i i=1

where yi is the true outcome and ˆyi is the predicted outcome. Mean absolute error (MAE) is also calculated. MAE has the following

46 equation:

Xn 1 MAE = ∥y − ˆy ∥ (4.28) n i i i=1

4.3.2 Inference

To evaluate the significance of the parameters in our likelihood models, we have to calculate standard errors for our coefficients. In calculating the log likelihood, we also find the Hessian matrix for our models. The Hessian is the negative second of the coefficients in the model. In this case, the Hessian is ˆ equivalent to the observed fisher information In(θ). Therefore, we can compute standard errors as the following: q ˆ −1 SE = In(θ) (4.29)

We can then use the Wald Test to determine the probability of the coefficients in our model being significantly different from 0. We can compute the test statistic as

ˆθ W = q (4.30) ˆ −1 In(θ)

ˆθ is the maximum likelihood estimate of our parameter. The resulting test statistic is a standard normal random variable. Therefore, we can use this test statistic to calculate the p-value and determine the significance of the coefficient. Due to the large amount of data used, it is more likely for coefficients to be significant at the α = 0.05 level. To correct for this, we set α = 0.01. [17]

47 5 Results

The data analysis is split into two distinct sets. The first set of analyses is conducted at the log mean ticket price level. All the tickets for a single concert are aggregated by days until the concert, and each of these days is a single data point. The second set of analyses is conducted at the individual ticket level. Tickets with IDs are grouped together to form a series over the life of tickets across days.

5.1 Log Mean Tickets

For the first set of analyses, all observations of a ticket price for any given day are treated as independent. The mean ticket price for each day is calculated from all observations on that day for a single concert. The models are then built at each level of the data hierarchy; ticket level, venue level, and artist level. This method of modeling the data is chosen for two reasons. First, by building

48 models up by time, then venue, and finally artist, we can see the effects of different variables independent of the effects at a higher level. For example, the effect of city population may be different when looking at a single artist thanat multiple artists. The second reason for breaking up the data in this way is a result of the volume of data we have. The models running on the full data set take significant time to train, and running them on subsets of the data allows usto predict what modeling decisions will work on the larger data. As discussed in the methods section, the first step is to build a baseline linear model. Next, I move to more complex linear models, then to non-linear models, and finally to time series models.

5.1.1 Ticket Level

Linear Models

Model Type RMSE MAE Baseline Linear Regression 0.063 0.057 Linear Regression + days_until3 0.044 0.040 GAM 0.048 0.040 ARIMA 0.046 0.036 RF 0.038 0.031 LSTM 0.035 0.031

Table 5.1.1: Log-Mean Ticket Model Performance

The first model to build is a baseline model. This model is for a single concert by The Lumineers at Van Andel Arena in Grand Rapids, MI. This model is a linear regression with just the response variable, mean log_price, and one predictor variable, days_until. Because linear regression is not a time series model, it treats each observation as independent. After finding a starting point with a basic linear regression, the next model introduces additional response variables. As seen in the exploratory phase of data analysis, one trend seen in the data is that of

49 an initial decline in price, followed by a rise, and then a final decline as the concert date approaches. This trend seems to lend itself to a third order polynomial for days_until, the variable representing the number of days until the concert. This addition shows significant improvement over the most basic model.

Generalized Additive Models

Due to the non-linear nature of the relationship between the mean ticket price and the number of days until a concert, the next logical iteration on a polynomial linear regression is a generalized additive model. This model allows for additional flexibility compared to the fixed polynomial in the ordinary least squares regression. However, the performance of the model is not any better than the third degree polynomial linear model. This result suggests that the polynomial linear regression is a fairly good measure of the non-linear structure of the data.

ARIMA

After establishing a set of results for parametric non-time-series methods, Imove on to an ARIMA model. This model takes into account the dependency between subsequent days’ prices, unlike the previous models. However, the results of the ARIMA model are fairly similar to those of the polynomial linear regression and generalized additive model. ARIMA, being one of the most basic time series models, outperforms the non time-series models.

Random Forest

To more flexibly model the data, we move to random forest, which isa non-parametric model. Random forests tend to perform well in many settings, so they seem like a reasonable next step from the more basic linear models. Again, the model is built using only the mean log price and days_until. The random forest does a very good job outperforming the previous models as it’s structure allows it to better capture the non-linearities in the data.

50 (a) Linear Regression (b) GAM

(c) RF (d) LSTM

Figure 5.1.1: Log-Mean Ticket Predictions

LSTM

The final model for the mean log ticket level is a Long Short Term Memory Neural Network. This model is the most complex, and also delivers the best predictions. It takes in a sequence of prices from the previous n days, and uses this information to predict the following day. We see from Table 5.1.1 that it outperformed all prior models in both RMSE and MAE.

51 Model Type RMSE MAE Baseline Linear Regression 0.211 0.163 Linear Regression + day + city_pop 0.119 0.095 Linear Regression + days_until3 0.208 0.164 All Vars. Linear Regression 0.110 0.088 Mixed Effects Variable Intercept 0.211 0.163 Mixed Effects Variable Intercept and Slope 0.211 0.158 Mixed Effects Variable Intercept and Slope + day + city_pop 0.121 0.098 GAM 0.206 0.161 GAM + day + city_pop 0.105 0.083 RF 0.212 0.164 RF + day + city_pop 0.076 0.052 LSTM 0.043 0.026

Table 5.1.2: Log-Mean Venue Model Performance

5.1.2 Venue Level

Figure 5.1.2: Log-Mean Venue LSTM Predictions

52 The next data set to analyze is at the venue level. This set consists of 10 concerts performed by The Lumineers. Only one artist is chosen so as to eliminate effects caused by variation in different artists’ prices. Tickets are aggregated by the date of the concert. This aggregation results in a sequence of mean prices for each concert performed by the artist. To avoid undue repetition, the process of iterating over models in this section is the same as at the ticket level. The results of these models can be seen in Table 5.1.2. Each model built has a baseline model that includes days_until, as well as venue. Then, a second model is built that includes day and city_pop. Introducing these covariates into each model significantly boosts the performance of the predictions, nearly halving both the RMSE and MAE. Less complex models with these additional covariates tend to outperform the more complex models that lack the additional information.

5.1.3 Artist Level

Model Type RMSE MAE Baseline Linear Regression 0.311 0.215 Linear Regression + days_until3 0.303 0.207 Linear Regression + city_pop + followers + artistpop + 0.300 0.203 day + days_until3 Mixed Effects Variable Intercept 0.303 0.212 Mixed Effects Variable Intercept + city_pop + followers + 0.300 0.209 artistpop + day GAM 0.298 0.205 GAM + city_pop + followers + artistpop + day 0.294 0.201 RF 0.203 0.123 RF + city_pop + followers + artistpop + day 0.191 0.109 LSTM 0.087 0.045

Table 5.1.3: Log-Mean Artist Model Performance

The final data set to analyze is the artist level, or the full data set. This dataset

53 Figure 5.1.3: Log-Mean Artist LSTM Predictions

includes all concerts by all artists. Ticket prices are aggregated at the concert level. Each observation is for a specific day with the mean log price for all tickets in a single concert. Once again, to avoid undue repetition, the process of iterating over models in this section is the same as at the previous levels. The results of these models can be seen in Table 5.1.3. Each model built has a baseline model that includes days_until, as well as the name of the venue and the artists as categorical variables. Then, a second model is built that includes day, city_pop, followers, and artist_pop. Introducing these covariates into each model has a much less drastic performance boost compared to the boost seen at the venue level.

54 5.2 ID Tickets

The second set of analyses is performed on all tickets with a unique identifier. This allows us to observe and model the price changes of a single ticket over time. Once again, tickets are analyzed for a single concert, for a single artist, and then on the complete data set of artists and concerts. The data are broken up for the same reasons mentioned in the previous section. Working with these subsets allows us to both look at effects of variables independently from other factors that might influence them and spend less time iterating over models on large data sets. The data sets in this section are the same as for the log-mean models. Theonly two types of models utilized in this section are the LSTM and the parametric likelihood model.

5.2.1 Ticket Level

Likelihood

Est. Model 1±SE Model 2±SE Model 3±SE Model 4±SE ± ± μ0 5.408 0.022* 5.408 0.022* - - ± ± γ0 - - 5.408 0.005* 5.408 0.004* − ± − ± γ1 - - 1.84e-7 8.7e-5 3.82e-15 8.6e-5 ± ± ± ± σ0 0.483 0.015* 0.483 0.015* 0.483 0.002* 0.478 0.002* − ± με 0.0043 0.0005* - - - − ± − ± − ± α0 - 0.011 0.001* 0.011 0.0001* 0.011 0.0001* ± ± ± α1 - 0.0001 1.8e-5* 0.0001 1.9e-6* 0.0001 1.9e-6* ± ± ± ± σε 0.060 0.0003* 0.060 0.0003* 0.060 3.4e-5* 0.060 3.4e-5* p 0.113 ± 0.002* 0.113 ± 0.002* 0.113 ± 0.0003* - − ± β0 - - - 1.589 0.005* − ± β1 - - - 0.010 9.9e-5* − log L −17604 −17635 −1604747 −1609772

Table 5.2.1: ID Ticket MLE Parameters (* Indicates significance at the α = 0.01 level)

55 We have four different parametric models for each category of data. Each of the four models adds variability to an additional parameter. At the single concert level, the only variable the parameters varied on is the number of days until the concert.

We can see from Table 5.2.1 that allowing the mean starting price μ0 to vary results in an extremely large increase in the magnitude of the negative log likelihood. Varying the mean price change με and the probability of a change p have a much smaller effect, however they are still very significant.

LSTM

Model Type RMSE MAE LSTM 0.061 0.020

Table 5.2.2: ID Ticket LSTM Performance

The only predictive model for individual tickets with IDs is the LSTM. AS seen in Table 5.2.2, using the individual tickets for the single concert level gives slightly worse performance than predicting the log-mean ticket price at the same level.

56 Figure 5.2.1: ID Ticket LSTM Predictions

However, the model still performs extremely well. Figure 5.2.1 shows the predicted ticket prices compared with the true ticket prices. The LSTM model is clearly doing a strong predictive job here.

5.2.2 Venue Level

Using the same data set as for the log-mean models, we build another parametric likelihood model and LSTM.

57 Likelihood

Est. Model 5±SE Model 6±SE Model 7±SE Model 8±SE ± ± μ0 5.359 0.012* 5.359 0.012* - - ± ± γ0 - - 5.359 0.012* 5.359 0.012* − ± − ± γ1 - - 0.0008 4.5e-5* 0.0007 4.5e-5* ± ± γ2 - - 0.004 0.001* 0.004 0.001* ± ± ± ± σ0 0.765 0.009* 0.765 0.009* 0.765 0.009* 0.765 0.009* − ± με 0.002 0.0002* - - - − ± − ± − ± α0 - 0.016 0.002* 0.016 0.0002* 0.015 0.0002* ± ± ± α1 - 0.00009 8.1e-6* 0.00009 8.2e-7* 0.00009 8.2e-7* ± ± ± α2 - 0.0007 0.0002* 0.0007 1.5e-5* 0.0006 1.5e-5* ± ± ± ± σε 0.066 0.0001* 0.066 0.0001* 0.065 1.4e-5* 0.065 1.4e-5* p 0.096 ± 0.0009* 0.096 ± 0.0009* 0.096 ± 9.0e-5* - − ± β0 - - - 2.246 0.010* − ± β1 - - - 0.006 4.2e-5* ± β2 - - - 0.023 0.0008* log L −103586 −103659 −10055124 −10065133

Table 5.2.3: ID Venue MLE Parameters (* Indicates significance at the α = 0.01 level)

The results at this level of analysis are fairly similar to the single concert level. Parameters are now allowed to vary on both days_until and the log of city_pop. The value of both standard deviation parameters increases compared to the single concert level. The probability of a price change also decreases. Most notably, the significance of the mean starting price coefficients remains. The jump in negative log likelihood from the second model to the third model is extremely apparent.

58 LSTM

Model Type RMSE MAE LSTM 0.058 0.016

Table 5.2.4: ID Venue LSTM Performance

Figure 5.2.2: ID Venue LSTM Predictions

The LSTM at the single artist level shows similar performance relative tothe single concert model. The RMSE for individual tickets is slightly higher than for the log-mean LSTM. However, looking at Figure 5.2.2 shows that the model has extremely strong predictive capability. It is also important to note that the LSTM is trained on solely the sequences of log prices. No additional covariates are contained in these models.

59 5.2.3 Artist Level

Using now the full data set, we build another parametric likelihood model and LSTM.

Likelihood

Est. Model 9±SE Model 10±SE Model 11±SE Model 12±SE ± ± μ0 5.153 0.006* 5.153 0.006* - - ± ± γ0 - - 5.152 0.007* 5.153 0.007* − ± − ± γ1 - - 0.0009 2.3e-5* 0.0002 2.3e-5* ± ± γ2 - - 0.004 0.0005* 0.001 0.0005 ± ± γ3 - - 0.004 0.004 0.0006 0.004 ± ± ± ± σ0 1.268 0.005* 1.268 0.005* 1.268 0.0005* 1.268 0.005* − ± με 0.006 0.0001* - - - − ± − ± − ± α0 - 0.013 0.001* 0.014 0.0001* 0.008 0.0001* ± ± ± α1 - 0.0002 5.4e-6* 0.0002 5.4e-7* 0.0002 5.4e-7* − ± − ± − ± α2 - 0.0001 9.9e-5 0.0001 1.0e-5* 0.0005 1.0e-5* ± ± − ± α3 - 0.0006 0.0007 0.0008 6.9e-5* 0.0005 6.9e-5* ± ± ± ± σε 0.094 0.0001* 0.094 0.0001* 0.094 1.0e-5* 0.094 1.0e-5* p 0.147 ± 0.0005* 0.147 ± 0.0005* 0.147 ± 5.4e-5* - − ± β0 - - - 1.752 0.004* − ± β1 - - - 0.020 1.9e-5* ± β2 - - - 0.052 0.0003* ± β3 - - - 0.004 0.002 log L −164402 −165177 −16022947 −16693360

Table 5.2.5: ID Artist MLE Parameters (* Indicates significance at the α = 0.01 level)

Once again, we see similar results to the previous levels of analysis. Parameters are now allowed to vary on, days_until, the log of city_pop, and artist_pop. The value of both standard deviation parameters increases again compared to the single artist level. The probability of a price change increases.

60 The jump in negative log likelihood from the second model to the third modelis still extremely apparent.

Rank City Population μ0 με p 1 New York, NY 8550971 5.148910 0.0047 0.052 2 Los Angeles, CA 3990456 5.148151 0.0050 0.050 3 Chicago, IL 2705994 5.147764 0.0052 0.050 4 Houston, TX 2278069 5.147592 0.0053 0.049 5 Phoenix, AZ 1660272 5.147277 0.0055 0.048 162 McKees Rocks, PA 5885 5.141656 0.0082 0.037 163 Northfield, OH 5874 5.141654 0.0082 0.036 164 Louisburg, KS 4508 5.141390 0.0083 0.035 165 Rosemont, IL 2305 5.140722 0.0086 0.035 166 Douglass, KS 1662 5.140396 0.0088 0.034

Table 5.2.6: Top 5 and Bottom 5 City Estimates for Median Artist Popular- ity at Day 100

The results from the likelihood model can also translate into estimates of starting prices and price changes for tickets. If we assume a median population for our city and a median popularity for our artist and plug these into the final likelihood model, we estimate the mean daily log price change at 100 days before the concert is 0.006. 10 days before the concert, the price change is estimated to be −0.012. At 100 days, prices are expected to increase very slightly, while at 10 days, prices are dropping and with a larger magnitude. Because these effects are linear, the differences between starting prices for different artists and venues straightforward. Large cities such as New York have a higher starting price than small cities. Popular artists also command higher starting prices than less popular artists. These results can be seen in 5.2.6 and 5.2.7.

61 Rank Artist Popularity μ0 με p 1 Post Malone 1.0 5.146031 0.0060 0.045 2 J Balvin 0.97 5.146014 0.0060 0.045 3 Bad Bunny 0.96 5.146009 0.0060 0.045 4 Ariana Grande 0.93 5.145992 0.0061 0.045 5 Chris Brown 0.91 5.145981 0.0061 0.045 38 Chris Young 0.71 5.145869 0.0062 0.045 39 Lee Brice 0.70 5.145863 0.0062 0.045 40 Chris Lane 0.69 5.145857 0.0062 0.045 41 As I Lay Dying 0.67 5.145846 0.0062 0.045 42 The Raconteurs 0.66 5.145841 0.0062 0.045

Table 5.2.7: Top 5 and Bottom 5 Artist Estimates For Median City Popula- tion at Day 100

Figure 5.2.3: Likelihood Simulation

We also use the final parameters from the likelihood model to simulate the life of 100 tickets with groups of 10 tickets. A new group of tickets enters the

62 simulation every 10 days. The simulation is for a Lumineers concert atthe Barclay’s center. This artist and venue gives a city population of 2582830 and an artist popularity of 0.8. The result of this simulation can be seen in 5.2.3. The behavior of the tickets in this simulation is somewhat less variable than in Figure 3.4.11. At the same time, much of the ticket behavior that we have seen is for individual tickets to remain relatively static, which this model captures.

LSTM

Model Type RMSE MAE LSTM 0.088 0.021

Table 5.2.8: ID Artist LSTM Performance

Figure 5.2.4: ID Artist LSTM Predictions

63 The final model for the second type of analysis is the LSTM on the fulldata set. This model has the best performance relative to its log-mean counterpart. The RMSE is almost exactly the same as the log-mean model with a lower MAE. Figure 5.2.4 shows that the model has extremely strong predictive capability, similar to the previous LSTMs.

64 6 Discussion

6.1 Early Trends

The first goal of this paper was to give an idea of how prices on the secondary concert ticket market change over time. To that end, I started by looking generally at how tickets behave over time. One of the first things I noticed was that the shape of of the trends appeared smooth and non-linear, which could be captured by a polynomial model. From four or more months away from a concert until the date of the concert, there appeared to only be a few significant extrema in the ticket prices. The major changes in ticket prices were occurring over weeks as opposed to days. The mean prices changed incrementally each day, but onthe whole, average prices moved slowly in a single direction. One of the most apparent reasons for this behavior is that individual tickets tend not to change. It is clear from the exploratory data analysis that many individual ticket prices stay

65 relatively constant. As a result, the mean price moving is caused by only a few tickets moving, or by tickets entering and exiting the market. Although this seems to be behavior shared across concerts, there did not appear to be a single overall shape for the behavior of ticket prices. Tickets for various artists and venues behave in different ways. The analysis limited to one artist showed similar trend shapes at different price levels, however this is the behavior of just afew concerts by a single artist. It was fairly clear from the initial analysis that tickets behave differently based on the location of the concert and the performer. Looking at a number of concerts by The Lumineers showed very similar trends at different price levels. Plotting trends for different artists showed tickets being sold with both differing shapes and price levels. Because of this behavior by the ticket prices, I decided to conduct my analysis at the ticket level, venue level, and artist level. In doing so, I could eliminate the variation introduced into the data by different artists or by different venues. Sub-setting the data in this way allowed me to look closely at how tickets behave according to their location and performer, which were found to be very important.

6.2 Predictive Power

After looking at a few general ticket trends, I built models to predict howthese prices would change. As I noted earlier, the curves seemed to follow a polynomial shape. This observation was confirmed by the fact that using a third order polynomial on the time variable significantly improved predictions for this model. The GAMs also tended to have strong performance with their ability to model smooth curves. The strongest models were the LSTMs. These models had extremely accurate predictions at all levels of my analysis. They were able to precisely model the behavior of both mean ticket prices and individual tickets. These models were trained solely on sequences of ticket prices with no additional variables. The mostly slow trends of the ticket prices likely played into this strong predictive power. Because prices tended not to change from day to day, the

66 LSTMs were able to predict small changes in the price and model the overall curves.

6.3 Variable Contributions

The second goal of this paper was to attempt to explain why prices change inthe way that they do. This question was answered largely by the parametric likelihood model. There were three major variables that were included in this model; days_until, city_pop, and artist_pop. These variables are meant to explain how time, venue, and artist affect the behavior of ticket prices.

Figure 6.3.1: μ0 vs. Days Until Concert

The first variable, days_until, is the most important variable in this analysis. I looked at its effect on the starting price of a ticket, the magnitude of a ticket’s change in price, and the probability that a ticket price changes. The effects of

67 Figure 6.3.2: με vs. Days Until concert days_until on these variables can be seen in Figures 6.3.1, 6.3.2, and 6.3.3. These graphs are use the parameters from the final likelihood model with the median city population and artist popularity. In every model, the coefficients corresponding to this variable were extremely significant. The likelihood models tells us that as a concert gets closer, starting prices will trend very slightly upward, while the changes in price will decrease. The average price change is negative, however it starts out positive and decreases as the concert approaches. Depending on the artist and venue, the mean price change can go fairly negative as the concert approaches. At the same time, the probability that a ticket price will change increases significantly as we get closer to the concert. One possible reason for tickets entering the market to have a higher price closer to the concert is that cheap tickets get sold off, so the average price goes up, and new tickets must match that price. The overall trend in price is downward. So tickets that are already on the market tend to decrease in price, albeit less significantly closer to the concert, while new tickets have to enter at a higher price. More frequent price

68 Figure 6.3.3: p vs. Days Until Concert

changes is also a reasonable result of the passage of time as tickets that have not been sold will be forced to lower their prices. Once the concert ends, tickets have no value, so any unsold tickets are a complete loss. The second variable, city_pop, also showed extremely significant results in the likelihood model. The effect of a city’s population was considered onthe same variables as days_until. The effect of an increase in city size had the same direction of effects as approaching the concert date. Increased city size was correlated with higher starting ticket prices, smaller price changes, and a higher probability of price change. While all these coefficients were very significant on the full data set, the coefficient for the probability of price change was also quite large. Cities with larger populations see dramatically more movement in their resale market ticket prices. We see in Table 5.2.6 that a city like New York early on has an estimated probability 0.052 of price change while a much smaller city like Douglass, KS has an estimated probability of 0.034. Close to the concert, with 10 days remaining, these probabilities become 0.246 and 0.163 respectively.

69 One explanation for this result is that this is where ticket brokers focus their attention. Larger cities also had more expensive tickets on average, so there is more money to be made on concerts in these cities. The higher prices in larger cities has a few explanations as well. Bigger venues likely cost more for the artist, so there is a larger overhead cost. Bigger venues also allow artists more freedom with theatrics such as pyrotechnics. Bigger cities may also generate more excitement around artists and hold a larger market of people willing to buy the tickets. In addition to the strong results seen in the likelihood model, introducing this variable into the predictive models led to a very significant boost in predictive power, indicating the significance of this variable. The third variable, artist_pop, had much less of an effect on model predictions and the behavior of ticket prices. There was no significant effect ofan artist’s popularity on the starting price of a ticket. This result may be due to the fact that the data set consisted entirely of very popular artists. However, there was still some delineation in the popularity of these artists. Similar versions of the likelihood model also had conflicting parameters for the effect of this variable on the average price change. The final model with all data and parameters hada slightly negative coefficient, while the prior model had a slightly positive coefficient. The one parameter that appeared to have some meaning was the effect of artist popularity on the probability of price change. Similar tocity population, and increase in artist popularity was related to an increase in the movement of prices. This result may also be caused by brokers focusing their attention on bigger artists and looking closely at price changes for these concerts. One other important piece of the likelihood model was the mean starting price of a ticket. In every model, allowing this parameter to vary by day, city, and artist had an out-sized effect on the final log likelihood. Varying the mean price change and the probability of price change both had a significant impact on the model, but orders of magnitude less than varying the mean starting price. The importance of varying starting price suggests that the starting points of tickets is the largest difference between concerts, but the general movement of tickets, parameterized by με and p, is much more consistent across tickets.

70 6.4 Affected Parties

The three parties invested in this market are fans, ticket brokers, and artists. This paper is relevant to the first two parties mostly in how the ticket prices behave. Fans are involved in both the buying and selling of tickets, while brokers are only selling tickets. One issue that has been previously raised is if there is a way to determine whether a ticket is being sold by a ticket broker of a fan. There is no clear answer to this question as there is no way to determine which tickets are sold by whom. I hypothesize, however, that one could tell the difference based on the movement of a ticket. Individual tickets tended to be either very mobile or very static. My hypothesis is that brokers are selling tickets that have constantly changing prices, while fans list their tickets and leave them alone. This theory is supported by the fact that when looking at graphs of individual tickets, some tickets appear to move perfectly in concert with one another. These movements are likely the result of a single broker as fans are unlikely to be selling multiple tickets at different price points. One interesting follow-up to this question would be to consider the effect of a price change on the probability of future price changes. Tickets that move once are probably more likely to move again, while tickets that stay static are likely to remain static. Fans also have an interest in buying tickets from the secondary market. The parametric likelihood provides some insight into advantaged behavior in this area. Prices appear to generally trend down, but do so less as the concert approaches. There is more tickets that have changing prices closer to the concert as well. The size of a given price change is smaller, but more tickets are dropping their prices. This result suggests that a fan should consider either buying upa cheap ticket as early as possible or waiting until the very last minute before a concert to see a large price drop. Both tactics come with risks, such as spending too much early on or missing out close to the concert, and there is no definitive strategy that this paper can suggest. There appear to be two mirrored trends that appear often in the behavior of the tickets. Tickets are either at very high orlow prices about two months from the concert date, and as the concert approaches,

71 these prices invert. Concerts that started out high trend down steadily, while concerts that started low trend upward. The other relevant party in this discussion is the artist and their team. When it comes to touring, this paper’s findings suggest that venues in cities with larger populations command higher ticket prices and see more movement in the resale ticket market. These cities potentially provide the opportunity for artists to experiment with dynamic ticket pricing. The average changes in price in the secondary market tend to be smaller, so artists could consider introducing small amounts of variation in their ticket prices based on the days before the concert. Small but frequent changes in price similar to the secondary market could allow an artist to generate more revenue without incurring the wrath of fans and damaging their reputation. One additional matter to explore in the future when it comes to touring is how back to back shows impact prices. Based on a few small cases, it appears that one show tends to be the main show with higher prices and greater attendance, while the second is for the extra fans who could not make itor get tickets to the first show. Lower ticket prices may be tied to expected level of attendance here. Just as smaller cities have cheaper tickets, concerts thatfans expect to have fewer people may also have lower ticket prices. Introducing a measure of expected attendance would be an interesting addition to future models.

72 6.5 Diagnostics and Decisions

(a) Residuals Plot (b) QQ Plot

Figure 6.5.1: Linear Regression Diagnostics

(a) Residuals Plot (b) QQ Plot

Figure 6.5.2: Mixed Effects Diagnostics

Many of the models used in this analysis had an assumption that was seriously violated, which was using time series data with non time series models. Most models introduced in the process of predicting log mean ticket prices were not time series based models. Models such as linear regression work on the assumption that observations are independent, but in a time series setting, each observation is dependent on previous observations. A number of models used were minimized according to ordinary least squares, so this assumption is

73 (a) Residuals Plot (b) QQ Plot

Figure 6.5.3: GAM Diagnostics incorrect for multiple models. I treated the independence of observations as a simplifying assumptions and accepted that these models would not be perfect for the sake of some basic initial results. Besides this assumption, model assumptions were largely met for the models. We can diagnostic plots for these assumptions in Figures 6.5.1, 6.5.2, and 6.5.3. For all of the linear parametric models, the normality and independence of residuals assumptions largely met. It is clear from all three sets of diagnostics that there are some fairly significant outliers in the data. All three models have some issues with the residuals at very high values of the log price. We also see in all three models that the residuals have fairly heavy tails. Another major point is to note some decisions that were made in the process of putting together this analysis. The first of these decision was my methodof selecting artists for the final data set. Artists were chosen from the Billboard Top 100 artists, and only artists with many shows were kept in the data set. This decision left the final data set with a group of artists who are currently extremely popular. As as a result, some of the attempts to differentiate tickets based on artist information fell flat. In addition, the findings here would likely not generalize toa data set with less popular artists. This problem also arises when it comes to the type of tickets included in the data set. I only considered general admission and floor tickets in this analysis as other seats in the venue introduce hard to model differences. This choice likely means that the results in this paper do not

74 generalize to other types of seats. The last major decision was in how the data was cleaned. Missing data for tickets with IDs was filled in using the information for that ticket from the previous day. Although this likely had little effect on the performance ofthe predictive model, it may have biased some coefficients in the likelihood model. Using the same price as the previous day means that there are weakly fewer price changes in the corrected data set than in reality. This choice may also have led to more extreme jumps in some ticket prices as an intermediate ticket price could be missing.

6.6 Next Steps

There are a quite a few different areas that I would like to explore further basedon the results of this paper. This paper looked at approximately 1.7 million observations of general admission and floor seat tickets. In the data collection process, I gathered more than 44 million tickets. Generating a metric to differentiate these tickets would make for an even broader and more informative analysis. I would also like to expand the data set to less well known artists. The findings of this paper regarding the effect of artists on ticket prices was minimal, so introducing more variability and other types of information regarding artists could also improve this analysis. In addition to adding these additional sources of data, I would also like to explore the behavior tickets exiting the market. This idea was not explored in the likelihood model. All tickets were essentially treated as entering the market and staying until the concert. This is obviously not correct. Answering both when and why tickets are bought may provide some insight for both fans looking to purchase tickets as well as artists and brokers selling tickets. In addition to supplementing the likelihood model with tickets exiting the market, it might help to include interaction effects between the variables. This addition would allow us to determine how the venue or artist affects ticket prices based on the days until the concert. There are also always additional covariates to

75 explore and include in the model. Further research with the likelihood model could be helpful to artists in tour planning as well. The likelihood model allows us to predict and simulate the behavior of tickets for new concerts. We can use the likelihood model for new venues and artists to understand how tickets will behave on a previously unseen data set. The final area I would like to further explore involves the shape of the ticket price trends. The best predictive models hug the true curves extremely closely, which makes for good predictions. However, it may be more useful from an inferential standpoint to generate smoother curves. This type of analysis would allow for us to more easily determine general trends across venues and artists. We could also potentially group concerts by their shape and determine the factors behind the shape.

6.7 Effects of COVID-19

At the time of writing this paper, the world is undergoing a global pandemic due the virus COVID-19. The United States is largely locked down. In most states, residents have been ordered to shelter in place, gatherings of more than ten people are illegal, and all non-essential businesses have been closed. As a result, the performing arts have been completely shut down. There are no popular music concerts happening in the United States at this time. The data in this paper was collected up to February 2020, well before the country began to shut down. To a certain extent, the end date of my data collection was very fortunate. This paper is able a picture of the secondary ticket market before concerts disappeared for at least a month, potentially longer. At the same time, there may be drastic changes to both the primary and secondary markets in the near future as a result of this pandemic. The growth predictions that I present in the early sections ofthis paper could become extremely inaccurate, and the models that I have built may no longer apply. It is my hypothesis that the concert ticket industry will largely return to normal within a year, but it is impossible to say for sure.

76 7 Conclusion

The secondary concert ticket market is a strange and little before studied beast. Tickets are sold far too low in the primary market and end up orders of magnitude more expensive on the secondary market. Artists and ticket sellers have little power to change these trends as many solutions to this problem will cost them their reputations and, in turn, adoring fans. Existing research in this field focuses primarily on economic analysis of ticket sales. This paper seeks to go beyond this type of analysis and provide tools for both predicting prices and explaining their behavior as a result of venue and artist characteristics. Ultimately, the goal of this paper is to provide some insight into how and why this strange market works. This paper introduces both predictive and inferential models at many levels of depth. The predictive models allow us to answer the first question of howprices change over time. The inferential model allows us to say why prices change inthe way they do. Both types of models can help all parties involved in the ticket

77 selling business. Fans can use these findings to have a better chance of recouping the cost of a ticket or buying a ticket at less than astronomical prices. Brokers may be able to better predict how prices will change and price their tickets accordingly. The findings of this paper are also beneficial to artists planning tours and selling tickets on the primary market, as we show where some ticket prices can be increased to generate more revenue. The analysis presented in this paper is by no means exhaustive, however I hope that it sheds some light on this strange market that has some people laughing to the bank and others foaming at the mouth.

78 References

[1] Generalized Linear Models and Generalized Additive Models 13.1 Generalized Linear Models and Iterative Least Squares. URL https:// www.stat.cmu.edu/{~}cshalizi/uADA/12/lectures/ch13.pdf.

[2] Introduction to Linear Mixed Models. URL https://stats.idre.ucla.edu/other/mult- pkg/introduction-to-linear-mixed-models/.

[3] Michael Abbott. Ordinary Least Squares (OLS) Estimation of the Simple CLRM 1. The Nature of the Estimation Problem. URL http://qed.econ.queensu.ca/pub/faculty/abbott/econ351/ 351note02.pdf.

[4] Aditya Bhave and Eric Budish. Primary-Market Auctions for Event Tickets: Eliminating the Rents of ’Bob the Broker’? Working Paper 23770, National Bureau of Economic Research, sep 2017. URL http://www.nber.org/papers/w23770.

[5] Jon Chapple. Global live music ticket sales to top $25bn | IQ Magazine, 2019. URL https://www.iq-mag.net/2019/09/global-live-music- ticket-sales-top-25bn-pwc-outlook-2019/{#}.Xe26b5NKhTY.

[6] Marie Connolly and Alan B Krueger. Rockonomics: The Economics of Popular Music. Working Paper 11282, National Bureau of Economic Research, apr 2005. URL http://www.nber.org/papers/w11282.

[7] Pascal Courty. An economic guide to ticket pricing in the entertainment industry. Recherches Économiques de Louvain / Louvain Economic Review, 66 (2):167–192, 2000. ISSN 07704518, 17821495. URL http://www.jstor.org/stable/40724285.

79 [8] Pascal Courty. Some Economics of Ticket Resale. Journal of Economic Perspectives, 17:85–97, 2003. doi: 10.1257/089533003765888449.

[9] Adele Cutler. Random Forests for Regression and Classification, 2017. URL http: //www.math.usu.edu/adele/randomforests/ovronnaz.pdf.

[10] Stephanie J. Dubner. Why Is the Live-Event Ticket Market So Screwed Up? (Ep. 311) - Freakonomics Freakonomics, 2017. URL http://freakonomics.com/podcast/live-event-ticket- market-screwed/.

[11] Stephen K. Happel and Marianne M. Jennings. Creating a futures market for major event tickets: Problems and prospects, 2002. ISSN 02733072. URL https://search-proquest-com.ezp- prod1.hul.harvard.edu/docview/59910563?accountid= 11311{&}rfr{_}id=info{%}3Axri{%}2Fsid{%}3Aprimo.

[12] Daniel Kahneman, Jack L Knetsch, and Richard Thaler. Fairness as a Constraint on Profit Seeking: Entitlements in the Market. The American Economic Review, 76(4):728–741, 1986. ISSN 00028282. URL http://www.jstor.org/stable/1806070.

[13] Pollstar. 2018 Mid-Year Special Features; Top Tours, Ticket Sales, Business Analysis, 2018. URL https://www.pollstar.com/News/2018-mid- year-special-features-top-tours-ticket-sales- business-analysis-135890.

[14] Pavlos Protopapas. Lecture 17: Time Series, 2014. URL http://iacs-courses.seas.harvard.edu/courses/am207/ blog/lecture-17.html.

[15] Pavlos Protopapas and Mark Glickman. CS109B Data Science 2 Lecture 11: Recurrent Neural Networks 2, 2019. URL https://harvard-iacs.github.io/2019-CS109B/lectures/ lecture11/presentation/cs109b{_}lecture11{_}RNN2.pdf.

[16] Ross Raihala. How much do concert tickets cost? Depends on how much you want to spend – Twin Cities, 2019. URL https: //www.twincities.com/2019/03/29/how-much-do-concert- tickets-cost-depends-on-how-much-you-want-to-spend/.

80 [17] German Rodriguez. Generalized Linear Models, 2020. URL https://data.princeton.edu/wws509/notes/c2s3.

[18] Eric T. Schneiderman. Obstructed View. Technical report, New York Attorney General, 2016. URL https://ag.ny.gov/pdfs/Ticket_Sales_Report.pdf.

[19] Ben Sisario. In Online Era, Fans Need Digital Smarts to Get Concert Tickets - , 2009. URL https: //www.nytimes.com/2009/04/01/arts/music/01tickets.html.

81 Listing of figures

3.4.1 Ticket Price Distribution ...... 25 3.4.2 Log Ticket Price Distribution ...... 26 3.4.3 Mean Log Ticket Price Over Time ...... 27 3.4.4 Artist Mean Log Ticket Price Distribution ...... 28 3.4.5 Artist Log Ticket Price Boxplot ...... 29 3.4.6 Mean Artist Log Ticket Price Over Time ...... 30 3.4.7 Lumineers Log Ticket Price Distribution ...... 30 3.4.8 Lumineers Venue Log Ticket Price Boxplot ...... 31 3.4.9 Mean Venue Log Ticket Price Over Time ...... 32 3.4.10Concert Log Ticket Price Distribution ...... 33 3.4.11Individual Log Ticket Price Over Time ...... 34 3.4.12Starting Log Ticket Price Distribution ...... 35 3.4.13Log Ticket Price Change Distribution ...... 35

5.1.1 Log-Mean Ticket Predictions ...... 51 5.1.2 Log-Mean Venue LSTM Predictions ...... 52 5.1.3 Log-Mean Artist LSTM Predictions ...... 54 5.2.1 ID Ticket LSTM Predictions ...... 57 5.2.2 ID Venue LSTM Predictions ...... 59 5.2.3 Likelihood Simulation ...... 62 5.2.4 ID Artist LSTM Predictions ...... 63

6.3.1 μ0 vs. Days Until Concert ...... 67 6.3.2 με vs. Days Until concert ...... 68 6.3.3 p vs. Days Until Concert ...... 69 6.5.1 Linear Regression Diagnostics ...... 73 6.5.2 Mixed Effects Diagnostics ...... 73 6.5.3 GAM Diagnostics ...... 74

82 Listing of tables

3.1.1 Artist Variables ...... 16 3.1.2 Concert Variables ...... 17 3.1.3 Ticket Variables ...... 18

5.1.1 Log-Mean Ticket Model Performance ...... 49 5.1.2 Log-Mean Venue Model Performance ...... 52 5.1.3 Log-Mean Artist Model Performance ...... 53 5.2.1 ID Ticket MLE Parameters ...... 55 5.2.2 ID Ticket LSTM Performance ...... 56 5.2.3 ID Venue MLE Parameters ...... 58 5.2.4 ID Venue LSTM Performance ...... 59 5.2.5 ID Artist MLE Parameters ...... 60 5.2.6 Top 5 and Bottom 5 City Estimates for Median Artist Popularity atDay100 ...... 61 5.2.7 Top5 and Bottom 5 Artist Estimates For Median City Population atDay100 ...... 62 5.2.8 ID Artist LSTM Performance ...... 63

83