Pitch Perfect: Predicting Startup Funding Success Based on Shark Tank Audio
Total Page:16
File Type:pdf, Size:1020Kb
Pitch Perfect: Predicting Startup Funding Success Based on Shark Tank Audio Shubha Raghvendra Jeremy Wood Minna Xiao [email protected] [email protected] [email protected] Abstract In this paper we describe the design and evaluation of a neural network trained to distinguish between funded and unfunded venture capital pitches, particularly within the context of the television show Shark Tank. This work represents a novel ap- plication of existing research in the realm of emotion and persuasion detection and broader speech processing. After attempt- ing various architectures, including a sup- Figure 1: The Kang sisters, founders of Cof- port vector machine-based model, a recur- fee Meets Bagel, pitching on Shark Tank in rent neural net (RNN), and a convolutional 2015. neural net (CNN), we settled on on a hy- brid CNN-LSTM. Utilizing this optimal model, we were able to obtain validation accuracy of up to 68%. Given prior work The Emmy-award winning television show in the field and the challenges associated Shark Tank, which has been on the air since 2009, with this problem, the test accuracy pro- embodies made-for-TV venture capitalism. In any duced by our optimal model exceeded our given episode, several entrepreneurs pitch their expectation. This work demonstrates the ideas for a company to sharks (a panel of potential feasibility of applying speech features to investors) who decide whether or not to fund the gauge startup pitch quality, as well as the enterprise, and on what terms. While we initially utility of hybrid neural networks in rep- hoped to evaluate actual venture capital pitches, resenting the persuasiveness of small seg- perhaps from the records of a Silicon Valley firm, ments of speech data. given barriers to access to confidential early-stage information, we opted to evaluate publicly avail- 1 Introduction able Shark Tank pitches. To do so, we found sev- eral playlists on YouTube of each episode, and, 1.1 Motivation for some seasons of the show, segmented by in- Venture capital as a field has long struggled with dividual pitch and scraped the audio files asso- issues of diversity (Cutler, 2015). Because success ciated with each video. (For those seasons for in securing funding is largely a function of presen- which nicely segmented playlists did not exist, we tation quality, we were interested in understand- manually segmented them; our methodology is de- ing which specific aspects of a pitch predispose scribed below). We labeled this raw data with in- an entrepreneur to securing funding. Equipped formation about whether or not the venture was with such knowledge, minority founders could funded, and to what extent, with information tab- inch closer to equal footing in securing venture ulated about the show available online. Our ap- capital. proach is described in greater detail below. 1.2 Problem Statement 2 Background/Related Work Our goal was to understand which features of a Existing research in the realm of classifying au- startup pitch correspond to whether or not it was dio utterances has been conducted on the tasks funded in the context of Shark Tank. While we ini- of emotion recognition, personality identification, tially planned to segment based on precisely which and deception detection in speech. Prior to 2014, shark elected to fund an entrepreneur, in our sur- most research for classifying emotion in speech vey of relevant literature we found that even two- involved extracting prosodic (pitch, energy) and class problems in this realm were sufficiently chal- cepstral features (LPCC, MFCC) from the audio lenging. Thus we focused on refining our tech- and running them through a support vector ma- niques in the realm of binary classification for the chine (SVM). Pan et al. achieved a best recog- purposes of this project (Chernykh et al., 2017). nition rate of 90% on a small dataset consisting We planned to extract both raw audio and MFCC of 212 utterances of the Berlin Database of Emo- features (Han et al., 2006), as well as other emer- tional Speech (Emo-DB) for a three-class clas- gent speech features such as prosodic features. sification task for the emotions of sad, happy, (Agarwal et al., 2011; Schuller et al., 2016). and neutral (Pan et al., 2012). Experiments in- volving the Big-Five personality traits (Extrover- 1.3 Challenges sion, Agreeableness, Conscientiousness, Neuroti- cism, Openness) have also been performed using From the outset, we knew this would be a chal- MFCC and prosodic features with SVM classifiers lenging problem to tackle, and hence wanted to (Polzehl et al., 2010; Mohammadi and Vinciarelli, adjust our expectations accordingly. First, Shark 2012). Tank is a network television show whose ability to Recent work has begun introducing the appli- engage its audience is predicated on building sus- cation of neural network architectures toward the pense and injecting drama into the process of se- aforementioned problems. In 2014, Lee et al. lecting pitches to fund. Therefore, we expect the trained a bi-directional long short-term memory audio of the television show to attempt to obfus- (BLSTM) recurrent neural network on the Inter- cate the sharks’ ultimate decision on a pitch, mak- active Emotional Dyadic Motion Capture (IEMO- ing predicting outcomes a challenging undertak- CAP) database for four emotion classes (Lee and ing. Tashev, 2015), including F0, zero-crossing rate, Second, because not all seasons were available and MFCCs, which were then used as input for as segmented individual pitches, we had to manu- the two-hidden-layer BLSTM. Lee et al. achieved ally segment over one hundred episodes into indi- up to a 63% accuracy on the IEMO-CAP database. vidual pitches. This process was somewhat imper- In 2017, Chernykh et al. also performed utterance- fect given that occasionally a shark would unex- level classification on the IEMO-CAP database us- pectedly interject or ask a question, and that some ing an LSTM architecture (Chernykh et al., 2017). pitches involved gimmicks such as demos or per- They tried two approaches to training their net- formances. work: 1) one-label approach and 2) Connectionist Finally, technical problems in this realm have Temporal Classification (CTC) approach. In the been shown to be quite pernicious. For instance, one-label approach, each utterance has only one Chernykh et al. studied the efficacy of labeling emotional label regardless of the utterance length; utterances in the IEMO-CAP database (discussed in the CTC approach, the probability of a partic- below), and achieved a modest 54% accuracy on a ular labeling is added up from the probabilities of four-class classification problem (Chernykh et al., every alignment. The authors found the best re- 2017). Given that the utterances in IEMO-CAP sults with the CTC approach, achieving up to 54% are both much shorter in length than the pitches we accuracy on their four-class task. trained our model on, and that they were recorded in a much more controlled environment (without 3 Approach any background noise) with a set number of ac- 3.1 Dataset tors, we expected achieving a very high valida- tion accuracy to be fairly challenging (Busso et al., Our dataset consists of audio scraped from 2008). YouTube uploads of Shark Tank episodes, seg- mented by pitch. low rate of agreement amongst sharks, rendering To collect the data we need for labeling the this an even more exacting problem. pitches, we referenced a database cultivated by Halle Tecco, an angel investor, the founder of 3.2 Baseline Approach Rock Health, and a self-proclaimed “Shark Tank Motivated by existing research in emotion recog- fanatic. The database contains investment data nition and personality detection (Pan et al., 2012; from every season of Shark Tank, including the on- Polzehl et al., 2010; Mohammadi and Vinciarelli, going 8th season. For the purposes of our project, 2012), which use low-level feature extraction for we actually reached out to Ms. Tecco for guid- high-level classification tasks, we implemented a ance, and were able to access the entirety of her baseline binary support vector machine. database as a result. For each company that has pitched on the show, the database contains infor- 3.2.1 Mel-Frequency Cepstral Coefficients mation on the final deal terms for the product, We extracted the first 13-order mel-frequency cep- including amount, equity and valuation. Addi- stral coefficients for each frame of an input audio tionally, we have some supplementary information data segment. MFCC-based features are widely on the industry, entrepreneur gender, and which used in automatic speech recognition (ASR) tasks. sharks agreed to fund the company. Due to the limitations of the size of our dataset, we consciously attempted to reduce the dimensional- 3.1.1 Data Collection and Preprocessing ity of our feature vectors. Thus, instead of con- In order to collect the raw labeled audio, we wrote catenating all the MFCCs over an audio segment a scraper in Python using the youtube-dl pack- to create one large feature vector, we computed age to pull audio from pre-assembled playlists of the statistics of mean, standard deviation, median, Shark Tank pitches. We extracted audio clips from maximum, and minimum over the MFCC numbers these videos in the .wav format, which is widely of all the frames in an audio segment. supported by several Python packages, including TensorFlow. We then ran Mel-Frequency Cep- 3.2.2 Prosodic Features stral Coefficient (MFCC) feature extraction on our We also experimented with accounting for the raw audio files in order to prepare them for our prosodic features in our audio segments. Prosody, model (Han et al., 2006), which was necessary be- which refers to the aspect of speech not specific cause our inputs differed in elapsed time.