Pitch Perfect: Predicting Startup Funding Success Based on Shark Tank Audio

Shubha Raghvendra Jeremy Wood Minna Xiao [email protected] [email protected] [email protected]

Abstract

In this paper we describe the design and evaluation of a neural network trained to distinguish between funded and unfunded venture capital pitches, particularly within the context of the television show Shark Tank. This work represents a novel ap- plication of existing research in the realm of emotion and persuasion detection and broader speech processing. After attempt- ing various architectures, including a sup- Figure 1: The Kang sisters, founders of Cof- port vector machine-based model, a recur- fee Meets Bagel, pitching on Shark Tank in rent neural net (RNN), and a convolutional 2015. neural net (CNN), we settled on on a hy- brid CNN-LSTM. Utilizing this optimal model, we were able to obtain validation accuracy of up to 68%. Given prior work The Emmy-award winning television show in the field and the challenges associated Shark Tank, which has been on the air since 2009, with this problem, the test accuracy pro- embodies made-for-TV venture capitalism. In any duced by our optimal model exceeded our given episode, several entrepreneurs pitch their expectation. This work demonstrates the ideas for a company to sharks (a panel of potential feasibility of applying speech features to investors) who decide whether or not to fund the gauge startup pitch quality, as well as the enterprise, and on what terms. While we initially utility of hybrid neural networks in rep- hoped to evaluate actual venture capital pitches, resenting the persuasiveness of small seg- perhaps from the records of a Silicon Valley firm, ments of speech data. given barriers to access to confidential early-stage information, we opted to evaluate publicly avail- 1 Introduction able Shark Tank pitches. To do so, we found sev- eral playlists on YouTube of each episode, and, 1.1 Motivation for some seasons of the show, segmented by in- Venture capital as a field has long struggled with dividual pitch and scraped the audio files asso- issues of diversity (Cutler, 2015). Because success ciated with each video. (For those seasons for in securing funding is largely a function of presen- which nicely segmented playlists did not exist, we tation quality, we were interested in understand- manually segmented them; our methodology is de- ing which specific aspects of a pitch predispose scribed below). We labeled this raw data with in- an entrepreneur to securing funding. Equipped formation about whether or not the venture was with such knowledge, minority founders could funded, and to what extent, with information tab- inch closer to equal footing in securing venture ulated about the show available online. Our ap- capital. proach is described in greater detail below. 1.2 Problem Statement 2 Background/Related Work

Our goal was to understand which features of a Existing research in the realm of classifying au- startup pitch correspond to whether or not it was dio utterances has been conducted on the tasks funded in the context of Shark Tank. While we ini- of emotion recognition, personality identification, tially planned to segment based on precisely which and deception detection in speech. Prior to 2014, shark elected to fund an entrepreneur, in our sur- most research for classifying emotion in speech vey of relevant literature we found that even two- involved extracting prosodic (pitch, energy) and class problems in this realm were sufficiently chal- cepstral features (LPCC, MFCC) from the audio lenging. Thus we focused on refining our tech- and running them through a support vector ma- niques in the realm of binary classification for the chine (SVM). Pan et al. achieved a best recog- purposes of this project (Chernykh et al., 2017). nition rate of 90% on a small dataset consisting We planned to extract both raw audio and MFCC of 212 utterances of the Berlin Database of Emo- features (Han et al., 2006), as well as other emer- tional Speech (Emo-DB) for a three-class clas- gent speech features such as prosodic features. sification task for the emotions of sad, happy, (Agarwal et al., 2011; Schuller et al., 2016). and neutral (Pan et al., 2012). Experiments in- volving the Big-Five personality traits (Extrover- 1.3 Challenges sion, Agreeableness, Conscientiousness, Neuroti- cism, Openness) have also been performed using From the outset, we knew this would be a chal- MFCC and prosodic features with SVM classifiers lenging problem to tackle, and hence wanted to (Polzehl et al., 2010; Mohammadi and Vinciarelli, adjust our expectations accordingly. First, Shark 2012). Tank is a network television show whose ability to Recent work has begun introducing the appli- engage its audience is predicated on building sus- cation of neural network architectures toward the pense and injecting drama into the process of se- aforementioned problems. In 2014, Lee et al. lecting pitches to fund. Therefore, we expect the trained a bi-directional long short-term memory audio of the television show to attempt to obfus- (BLSTM) recurrent neural network on the Inter- cate the sharks’ ultimate decision on a pitch, mak- active Emotional Dyadic Motion Capture (IEMO- ing predicting outcomes a challenging undertak- CAP) database for four emotion classes (Lee and ing. Tashev, 2015), including F0, zero-crossing rate, Second, because not all seasons were available and MFCCs, which were then used as input for as segmented individual pitches, we had to manu- the two-hidden-layer BLSTM. Lee et al. achieved ally segment over one hundred episodes into indi- up to a 63% accuracy on the IEMO-CAP database. vidual pitches. This process was somewhat imper- In 2017, Chernykh et al. also performed utterance- fect given that occasionally a shark would unex- level classification on the IEMO-CAP database us- pectedly interject or ask a question, and that some ing an LSTM architecture (Chernykh et al., 2017). pitches involved gimmicks such as demos or per- They tried two approaches to training their net- formances. work: 1) one-label approach and 2) Connectionist Finally, technical problems in this realm have Temporal Classification (CTC) approach. In the been shown to be quite pernicious. For instance, one-label approach, each utterance has only one Chernykh et al. studied the efficacy of labeling emotional label regardless of the utterance length; utterances in the IEMO-CAP database (discussed in the CTC approach, the probability of a partic- below), and achieved a modest 54% accuracy on a ular labeling is added up from the probabilities of four-class classification problem (Chernykh et al., every alignment. The authors found the best re- 2017). Given that the utterances in IEMO-CAP sults with the CTC approach, achieving up to 54% are both much shorter in length than the pitches we accuracy on their four-class task. trained our model on, and that they were recorded in a much more controlled environment (without 3 Approach any background noise) with a set number of ac- 3.1 Dataset tors, we expected achieving a very high valida- tion accuracy to be fairly challenging (Busso et al., Our dataset consists of audio scraped from 2008). YouTube uploads of Shark Tank episodes, seg- mented by pitch. low rate of agreement amongst sharks, rendering To collect the data we need for labeling the this an even more exacting problem. pitches, we referenced a database cultivated by Halle Tecco, an angel investor, the founder of 3.2 Baseline Approach Rock Health, and a self-proclaimed “Shark Tank Motivated by existing research in emotion recog- fanatic. The database contains investment data nition and personality detection (Pan et al., 2012; from every season of Shark Tank, including the on- Polzehl et al., 2010; Mohammadi and Vinciarelli, going 8th season. For the purposes of our project, 2012), which use low-level feature extraction for we actually reached out to Ms. Tecco for guid- high-level classification tasks, we implemented a ance, and were able to access the entirety of her baseline binary support vector machine. database as a result. For each company that has pitched on the show, the database contains infor- 3.2.1 Mel-Frequency Cepstral Coefficients mation on the final deal terms for the product, We extracted the first 13-order mel-frequency cep- including amount, equity and valuation. Addi- stral coefficients for each frame of an input audio tionally, we have some supplementary information data segment. MFCC-based features are widely on the industry, entrepreneur gender, and which used in automatic speech recognition (ASR) tasks. sharks agreed to fund the company. Due to the limitations of the size of our dataset, we consciously attempted to reduce the dimensional- 3.1.1 Data Collection and Preprocessing ity of our feature vectors. Thus, instead of con- In order to collect the raw labeled audio, we wrote catenating all the MFCCs over an audio segment a scraper in Python using the -dl pack- to create one large feature vector, we computed age to pull audio from pre-assembled playlists of the statistics of mean, standard deviation, median, Shark Tank pitches. We extracted audio clips from maximum, and minimum over the MFCC numbers these videos in the .wav format, which is widely of all the frames in an audio segment. supported by several Python packages, including TensorFlow. We then ran Mel-Frequency Cep- 3.2.2 Prosodic Features stral Coefficient (MFCC) feature extraction on our We also experimented with accounting for the raw audio files in order to prepare them for our prosodic features in our audio segments. Prosody, model (Han et al., 2006), which was necessary be- which refers to the aspect of speech not specific cause our inputs differed in elapsed time. These to the individual phoneme, but rather the tune and .wav files were then mapped to labels (funded ver- rhythm of speech, is characterized by such factors sus not funded) obtained from the aforementioned as vocal pitch (fundamental frequency), loudness database of Shark Tank outcomes compiled by (acoustic intensity), and rhythm. We surmised that Halle Tecco. We ultimately collected 509 pitches intonation of a founder’s speech and the supraseg- in .wav form, each approximately two minutes in mental aspects of her startup pitch delivery could length. We segmented each pitch into five-second have some influence over the sharks’ perceptions segments and generated MFCC features (totaling of the the company. Thus, we extracted values 7,895 data points) for each segment to be fed into for the fundamental frequency (F0) and intensity our models. of each syllable of the audio using Praat, a piece of software for linguistic and phonetic analysis of 3.1.2 Dataset Distribution speech. For each data point, we once again com- One reason why this dataset was ideal for a classi- puted statistics (mean, standard deviation, median, fication problem was the near-even split amongst maximum, and minimum) over the F0 and inten- funded and unfunded pitches. Most Shark Tank sity figures. episodes featured two funded pitches and two un- funded pitches, which prevented oversampling of 3.2.3 Support Vector Machine Model any one particular label type in model training. For our baseline approach, we combined MFCC However, agreement amongst sharks (where we and prosodic features, explained above, in a sup- define “agreement” as either more than one shark port vector machine model. We tested both MFCC expressing positive interest in a pitch or no shark and prosodic features individually, and evaluated expressing a desire to fund a pitch) averaged just the performance of the model when both were 52.3%, per the Tecco database. This is a relatively combined additively, for which there is precedent Figure 2: CNN-LSTM architecture in the literature. As explained in the Results sec- too great of a reduction of data for our hybrid neu- tion below, the SVMs performed poorly, which we ral models. We saw very poor performance on believe is a result of the fact that they do not ade- all of our neural models when we ran them on quately model temporal information, which is crit- MFCC features. Ultimately we settled on using ical in applications such as this. Thus we were the MFCC features on five-second segments of au- motivated to attempt a more complex architecture dio from the pitches. grounded in deep learning, explained below. 3.3.2 Neural Network Architectures 3.3 Intermediate Experimental Architectures Given the time-based nature of our data, we were We took a variety of approaches to modeling our limited to architectures that operated on series of data in an effort to improve results before achiev- data points. Two networks are currently avail- ing our best-performing model, the hybrid CNN- able as options: RNNs and 1-D CNNs. Knowing LSTM described in the subsequent subsection. We this, we wanted to try out different combinations can split our attempts into two categories: data thereof. handling and model architecture. While first developing, our default model was a vanilla RNN with a final affine layer on top of the 3.3.1 Data Handling Experiments final output state. However, we also attempted to Our initial pipeline involved the full 1-2 minute run the affine layer on all of the RNN’s hidden- pitches. However we quickly came to the conclu- state outputs across the time series. This produced sion that this length (over 10,000 frames, even af- poor results, perhaps due to the larger size of the ter MFCC extraction reduced the raw signal) was final layer (we were always limited in the amount untenable for either RNNs or one-dimensional of data and therefore could not fit too large of a CNNs, so we initially split each pitch into 10- model). Thus, we ultimately only ran the post- second (minimum six-second) segments. We saw RNN layer on the final hidden state output of the further improvement from splitting them into five- RNN. We also switched to using LSTM cells and second segments. These proved much more man- then tanh activation functions within the RNN, ageable for our temporal networks and yielded as ReLU units caused the gradient to explode too better results, but it is possible that we discarded quickly. some information about the overall structure of Eventually, we branched out into experiment- each pitch in the process. We also tried using ing with other network architectures. We tried us- both prosodic features and MFCC features (using ing just one-dimensional convolutional neural nets raw features was difficult because of the length of initially. 1-D CNNs stride across a single dimen- the raw signal); however, while prosodic features sion (i.e. the time) instead of across two dimen- worked better for the SVM than MFCC it proved sions (such as in an image). Similar to the motiva- tion for their usage in image systems, we wanted 3.4.1 Convolutional Layer to try to use CNNs because they can, at least in Our optimal architecture uses three convolutional theory, model higher-level abstract concepts from layers (see Figure 3). The first convolutional layer, the amalgamation of smaller signals. In images for example, consists of 64 filters with a ker- this amounts to understanding higher level visual nel width of six. These layers convolve nearby concepts. In audio analysis, we were hoping to use MFCC features, thereby capturing abstract repre- them to understand semantic content. However, as sentations of discrete segments of speech. Theo- can be seen in Table2, pure CNN models did not retically, this allows the recurrent layer on top of perform well. This led to us combining them with the CNN to operate over emergent semantic repre- LSTMs as seen in our final model. Together, their sentations instead of low level feature sets. results were superior to either model operating in After each convolutional layer, we apply an isolation. activation layer using the Rectified Linear Unit Finally, we also tried concatenating meta- (ReLU) operation (1) in order to introduce non- features to the input of the final hidden layer, feed- linearity to our network. We also tried using ing them in at the end to increase their importance. Parametrized ReLUs (PReLUs), wherein the rec- However, as we hypothesized from the fact that tified unit is similar to leaky ReLU except that the predicting off only meta-features did not produce leakiness-factor (a) is a learned variable instead of a model that was better than random, this model a set constant (2). Ultimately we found better per- showed no improvement to previous architectures. formance using the ReLU activation. ( x x > 0 3.4 CNN-LSTM Model ReLU(x) = (1) 0 x ≤ 0 Our optimal model configuration was a hybrid CNN-LSTM model consisting of a temporal con- ( x x > 0 volutional neural net feeding into a recurrent neu- P ReLU(x) = (2) ral net, as shown in1. ax x ≤ 0 Our CNN performs a one-dimensional convolu- 3.4.2 Max-pooling Layer tion over the time of an audio segment, in which For our pooling layers, we use a 2x2 max pool we shape the input to the CNN by concatenating with a stride of 3, 2, and finally 1 (in the respective the first 13-order MFCC features for each time layers) on our rectified feature maps in order to frame together to create one long 1-D feature vec- downsample the spatial size of our representation tor. and reduce the number of parameters in the net- work. Such an operation helped to control overfit- ting and accommodate for the size of our dataset.

3.4.3 Batch Normalization Layer We found a 3% gain in accuracy when we incor- porated batch normalization layers after our pool- ing layers into our network. Using the technique introduced by Ioffe and Szegedy (2015), which involves normalizing activations using mini-batch statistics, we were able to achieve a faster conver- gence and higher accuracy. Since batch normal- ization reduces the internal covariance shift due to the changing network parameters during training, the network became more robust to badly initial- ized weights, and we were able to utilize a higher learning rate during training. During training, for the x activations in a mini- Figure 3: Layers of the CNN-LSTM model batch B of size m, we perform the Batch Normal- izing Transform to obtain the normalized values, using the mini-batch estimates of mean and vari- available to us from the Tecco database: a sparse 2 ance, µB and σB: representation of the gender of the team (all fe- male, all male, or mixed team) and a sparse rep- BNγ,β : x1...m → y1...m resentation of the industry the pitch was in. These are both reasonable meta-features that do not in

xi − µB and of themselves solve the problem: we tried xˆi ← q (3) running an SVM on just the meta-features and σ2 +  B the accuracy achieved barely surpassed random- assignment strategies. Thus on their own these yi ← γxˆi + β ≡ BNγ,β(xi) (4) features were not valuable. However we concate- The output y values are then passed as input to the nated sparse representations of gender and indus- next convolutional layer. try and used an affine layer to transform them into the hidden and internal state fed as the initial state 3.4.4 LSTM to the LSTM. For the second part of our hybrid model, we de- cided to use an LSTM instead of a vanilla RNN because of the difficulties vanilla RNNs have in learning long-range dependencies due to the ob- served vanishing gradient problem. LSTMs use a gating mechanism to combat this vanishing gradi- ent problem - for each LSTM unit, the input, for- get, and output (i, f, o) gates squash their corre- sponding values between 0 and 1 using the sig- moid function. The “candidate” hidden state g is computed using the current input xt and the previ- ous hidden state ht−1. Equation6 corresponds to the internal memory of the unit, the cell state ct, Figure 4: Validation accuracy for the optimal while the output hidden state ht is computed by CNN-LSTM model, terminating in approxi- the elementwise multiplication of the output gate mately 68% accuracy. with the tanh activation of the cell state (eq.7).

 i   σ     f   σ  ht−1   =   W (5)  o   σ  xt g tanh

ct = f ct−1 + i g (6)

ht = o tanh(ct) (7)

Our final two-layer portion uses one hidden Figure 5: Training loss for the optimal CNN- LSTM layer followed by a fully-connected (affine) LSTM model; loss was calculated using layer that reduces the final hidden state of the hinge loss. LSTM to a single output (the prediction). For each time step t of the LSTM, we used as input xt where xt was the output of the final 1-D con- volutional layer; i.e. each xt had a dimension of 4 Results 16 features from the 16 filters applied in the final convolutional layer. We ran our baseline SVM and our various neural Finally, one additional component that yielded network models on validation and test sets of our modest improvements was using meta-features to Shark Tank data, ultimately achieving best results condition the LSTM. We took two meta-features on our proposed hybrid CNN-LSTM model. 4.1 Baseline Model Val. Accuracy Test Accuracy We achieved at best a 55% validation accuracy using the prosodic features of fundamental fre- CNN 54.3% 51.7% quency and intensity with our tuned linear-kernel Vanilla RNN 56.8% 54.5% SVM. The slight performance improvement of the prosodic features over the MFCC features LSTM 58.3% 58.0% could be attributed to the suprasegmental nature of prosodic features, which may lend itself better CNN-LSTM hybrid 67.9% 65.7% to computed statistics (our final feature form). The subpar performance of SVMs was not un- Table 2: Performance of the best versions (hyper- expected – although cepstral and prosodic features parameters tuned and number of units/layers op- fed through an SVM have achieved favorable re- timized) of our various models on the funded/not sults for tasks such as emotion recognition, much funded Shark Tank classification task. Our opti- of the existing work in those areas are performed mal model is shown in bold. on datasets consisting of very short-second utter- ances (e.g. EmoDB). For our data, which consist of longer audio segments, the SVM model with on the television program Shark Tank. Our first- acoustic statistics loses much of the temporal in- pass meta-feature SVM, which was trained exclu- formation integral to the data. Thus we turned to sively on features such as founder attributes and recurrent neural network architecture approaches, achieved 50% accuracy, suggested the potential which ultimately yielded more fruitful results. utility of speech features in constructing such a model. The modest performance of SVMs trained Feature Combinations Val. Accuracy Test Accuracy on MFCC and prosodic features (and combina- tions thereof) motivated our attempts at neural net- MFCC 50% 54.6% based architectures. The optimal architecture we designed was a hybrid temporal CNN-RNN com- F0 + Intensity 55% 54.8% bination, which attained a 68% validation accu- MFCC + F0 + Intensity 51% 50.1% racy. By employing parallel convolutions, we pre- Table 1: Five-fold cross validation of SVM with serve temporal information critical to understand- various combinations of features. ing the overall sequence, yet develop compact representations of abstract phenomena within the speech segments. Moreover, RNNs have been 4.2 Neural Architectures used in a wide variety of speech and language pro- cessing applications, and also allow us to lever- Results of the hybrid neural architecture are pre- age time information. Given prior published at- sented below in Table 2. The results depict a tempts at similar problems, we believe this work stark contrast between the performance of a hybrid represents a significant step in emotion recogni- model and any single “pure” architecture. While tion, particularly within the realm of speech seg- combining predicted labels from individual seg- ments aimed at persuasion (e.g. venture capital ments of the same pitch to predict the label of the pitches). overall pitch falls within our further research, we can assume that a success on individual segments Possible extensions of our current work include of around 68% would yield an even higher success introducing lexical features in addition to the cur- on the labeling of entire pitches. rent acoustic features to our CNN-LSTM model; we could do so by generating text transcripts of 5 Conclusion each audio pitch segment. This seems like a natu- ral extension to our model, considering that the ac- Our project represents a novel application of prior tual content of a pitch plays a significant role in de- work in the fields of emotion recognition and termining the outcome of a founders success. Fur- speech processing; in particular, a model trained to thermore, we could consider expanding to a more distinguish between funded and unfunded pitches complex problem space by building a model that both buckets pitches into funded and not-funded Sergey Ioffe and Christian Szegedy. 2015. Batch nor- and regresses against equity and valuation for a malization: Accelerating deep network training by more granular prediction. reducing internal covariate shift. arXiv:1502.03167 . We also wish to train our model on larger datasets and with real VC pitches without the the- Jinkyu Lee and Ivan Tashev. 2015. High-level fea- atrical elements of Shark Tank (which contribute ture representation using recurrent neural network for speech emotion recognition. In INTERSPEECH. unwanted noise to the pitch audio). We would also pages 1537–1540. like to try combining the results of the segmented pitch prediction in order to predict the results of a Gelareh Mohammadi and Alessandro Vinciarelli. 2012. Automatic personality perception: Prediction of trait unified pitch. While all the above are viable ex- attribution based on prosodic features. IEEE Trans- tensions, they lay outside the scope of our origi- actions on Affective Computing 3(3):273–284. nal novel problem. Given how noisy our data was, the fact that not all VCs agree on which ventures Yixiong Pan, Peipei Shen, and Liping Shen. 2012. Speech emotion recognition using support vector to fund, the relatively small size of our dataset, machine. International Journal of Smart Home and the higher-level considerations (e.g. a com- 6(2):101–108. pany’s sales to date) needed to decide which ven- Tim Polzehl, Sebastian Moller, and Florian Metze. tures to fund: 68% accuracy was a strong step to- 2010. Automatically assessing personality from wards proving that neural models can capture the speech. In Semantic Computing (ICSC), 2010 IEEE cogency of complex persuasive speeches. Fourth International Conference on. IEEE, pages 134–140. Acknowledgments Bjorn¨ Schuller, Stefan Steidl, Anton Batliner, Julia Hirschberg, Judee K Burgoon, Alice Baird, Aaron We would like to give special thanks to our in- Elkins, Yue Zhang, Eduardo Coutinho, and Keelan structor Andrew Maas and our teaching assistant Evanini. 2016. The interspeech 2016 computational Jiwei Li for their support and guidance. paralinguistics challenge: Deception, sincerity & native language. In Proceedings of Interspeech.

References Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Ram- bow, and Rebecca Passonneau. 2011. Sentiment analysis of twitter data. In Proceedings of the work- shop on languages in social media. Association for Computational Linguistics, pages 30–38.

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jean- nette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language re- sources and evaluation 42(4):335.

Vladimir Chernykh, Grigoriy Sterling, and Pavel Pri- hodko. 2017. Emotion recognition from speech with recurrent neural networks. arXiv preprint arXiv:1701.08071 .

Kim-Mai Cutler. 2015. Here’s a detailed breakdown of racial and gender diver- sity data across u.s. venture capital firms. https://techcrunch.com/2015/10/06/s23p-racial- gender-diversity-venture/.

Wei Han, Cheong-Fat Chan, Chiu-Sing Choy, and Kong-Pang Pun. 2006. An efficient mfcc extraction method in speech recognition. In Circuits and Sys- tems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on. IEEE, pages 4–pp.