Start-up valuation in Switzerland: analysis and methods

Master Thesis

Candidate: Silvia Lama Supervisor: Prof. Dr. Didier Sornette

ETH Zürich Department of Management, Technology and Economics (D-MTEC) Chair of Entrepreneurial Risks June 2019 – November 2019

1.1 Motivation and overview 2

0 Abstract 3 ABSTRACT

The aim of this master thesis is to provide an overview of start-up valuations in Switzerland. The first part focuses on the analysis of funding rounds closed in Switzerland, between 2010 and 2019. The existence of patterns and trends is investigated, visualized, and commented. The second part selects the best model to estimate a range of pre-money valuation for a target start- up, as a fair benchmark. This could be used by investors and co-founders as a starting point in their investment negotiation process. Indeed, traditional valuation methods1 cannot be applied to start-ups, due to the uncertainty of the latter, their short history, absence of publicly available data on financials, comparable companies or transactions. As a consequence, new valuation methods have emerged and, in the conclusion chapter, they are compared to our approach, stressing their lack of objectivity, contextuality, accuracy, and precision. Finally, several traces for further research are recommended.

1 (e.g. the Discounted Cash Flow, the Valuation Multiples) 1.1 Motivation and overview 4 ACKNOWLEDGEMENTS

I would like to express all my gratitude to Professor Didier Sornette, for the opportunity to conduct my Master Thesis at the Chair of Entrepreneurial Risks and for his guidance, and to Dr. Spencer Wheatley for the useful advice.

Besides, I would like to sincerely thank Steffen Wagner and Michael Blank, for the trust demonstrated by choosing me to pursue this delicate and extremely interesting research project at Investiere.

My warmest thank goes to Dr. Matteo Farnè, for the valuable support and interest in my work, and for being a reliable point of reference in my life.

I also wish to heartily thank my mentor and angel investor, Professor Silvio Marenco, who has been always believing in me, for all the care and time he has been investing in my professional growth. Above all, during these years he taught me the value of respect and of building trustful relationships.

If I achieved this goal, it is also thanks to the precious advice of my mentor Andrea Girardello. I feel truly grateful to all his attention and effort, allowing me to avoid many mistakes, and inspiring my winding path as a student-entrepreneur. He taught me to never give up, and that it is always possible to find a smarter way to face challenges, by thinking outside the box.

This journey would not have been so special and unforgettable without the fantastic company of my dearest friends, and of my lovely flatmates, bringing sparkling colours to every day of my life.

Finally, I am enormously grateful to my family, who makes me feel the luckiest person on the Earth, by supporting all my passions and activities, and by loving me as I am. 0 Acknowledgements 5 TABLE OF CONTENTS

Abstract ...... 3

Acknowledgements ...... 4

1 Introduction ...... 9

1.1 Motivation and overview ...... 9

1.2 Research questions ...... 10

2 Start-up valuation methods ...... 11

2.1 Overview ...... 11

2.2 Scorecard method ...... 11

2.3 Berkus model ...... 12

2.4 Venture Capital Method ...... 13

3 Data Collection ...... 14

3.1 Sources of data ...... 14

3.2 Process ...... 14

3.3 Description and pre-processing of the data set ...... 15

3.4 Log Transformation ...... 19

4 Multivariate Data Analysis ...... 22

4.1 Treatment of missing data ...... 22

4.2 Correlation analysis between continuous variables ...... 22

4.2.1 Methodology ...... 22

4.2.2 lThrough_Investiere analysis ...... 24 1.1 Motivation and overview 6

4.2.3 Employees analysis ...... 30

4.2.4 Analysis of the entire data set ...... 33

4.3 Correlation Analysis between categorical variables ...... 41

4.3.1 Pooling levels together ...... 41

4.3.2 Methodology ...... 44

4.3.3 Results ...... 44

4.4 Correlation analysis between continuous and categorical variables ...... 46

4.4.1 Methodology ...... 46

4.4.2 Results ...... 46

5 Predicting the future success of a swiss start-up ...... 62

5.1 Overview...... 62

5.2 Methodology ...... 62

5.3 Results ...... 63

6 Multiple ...... 64

6.1 Purpose ...... 64

6.2 Methodology ...... 64

6.2.1 Overview ...... 64

6.2.2 Steps ...... 65

6.3 Data pre-processing ...... 66

6.4 Second Manual Variables selection (from 20 to 8 independent variables): ...... 66

6.5 Best models comparison ...... 68 0 Acknowledgements 7

6.6 Best selected model ...... 74

6.6.1 Confidence and Prediction intervals ...... 74

6.7 MLR BLUE Assumptions Check ...... 75

6.7.1 Outlier detection ...... 76

6.7.2 Check MLR assumptions ...... 77

7 Conclusions ...... 81

8 References ...... 85

9 Appendix: Model specification and selection ...... 88

9.1 Automated Models ...... 88

9.1.1 ...... 88

9.1.2 Best Subset Selection ...... 88

9.2 Third manual variable selection (from 8 to 5 predictors) ...... 91

9.2.1 Basic.model ...... 91

9.2.2 Results ...... 104

9.3 Interaction Terms ...... 104

9.3.1 Stepwise regression ...... 105

9.3.2 Best Subset Selection ...... 105

9.4 Further methods used to select the best model ...... 106

1.1 Motivation and overview 8

1 Introduction 9 1 INTRODUCTION

1.1 MOTIVATION AND OVERVIEW

My personal passion for Entrepreneurship is a long and most probably never-ending journey, started in September 2012. On that day, instead of attending classes at high-school, together with my best friend (later on start-up’s partner), we attended a cycle of lectures about student entrepreneurship, organized by the University of Bologna. On that day, I understood my way was to be an entrepreneur, and I set the goal to found my own company at around 30 years old. But the occasion presented to me much earlier: at 21 I grasped it and founded my first start-up Musa, in the industry of Education technology.

When the need for a second investment round was approaching, I faced a big question mark: which is the value of our company? It is not profitable yet, it has very low revenue, and a not well quantifiable risk; as a result, all traditional financial methods to evaluate companies are of any help. By interacting with founders, private and institutional investors, the lack of an objective method to evaluate early-stage start-ups turned out to be a common, worldwide shared problem.

Indeed, nowadays the pre-money valuation of a start-up is nothing but the result of a negotiation process, between the investor and the founders. It requires on average 7-9 months2, and enormous effort from both parties (Clarysse and Kiefer, 2011). Often, the result is mainly determined by the negotiation power of each actor (e.g. the number of interested investors and their experience, the urgency of money of the start-up), rather than the value of the underlying risky business. As in a poker match, each player withholds information and tries to convince the opponents that his hand is better than it actually is. But, unlike poker, the participants of investment negotiations should communicate complete information and work together toward the shared goal of growing a successful business. The valuation, in fact, is only one part of the investment process, and it often leads to controversies that get the founders-investors relationships off on the wrong foot (Villalobos, 2007).

2Average time elapsed in Switzerland between the business plan submission to a Venture Capital firm and the actual investment 1.2 Research questions 10

When I approached the Swiss VC Investiere, we found common ground in investigating this subject that, under the supervision of Prof. Dr. Sornette -Chair of Entrepreneurial Risks-, became the topic of the present Master thesis.

This chapter proceeds with the presentation of the specific research questions that the project wants to address, while in Chapter 2, we will review the literature about start-up valuation methods.

Chapter 3 is dedicated to the description of the data set used in our research and its collection process, while its multivariate analysis is visualized and commented in Chapter 4.

Further chapters, instead, have the courageous aim to create and select models that, given data of a specific start-up as input, give as outcome its future success with the minimum error rate (Chapter 5), and an estimated benchmark for its pre-money valuation with the highest possible accuracy and precision (Chapter 6).

Keeping in mind that “all models are wrong, some models are useful”, we summarize our main findings in Chapter 7, compare them to the literature, and finally suggest possible trajectories for future research on the topic.

1.2 RESEARCH QUESTIONS

The goal of this research is to investigate the pre-money valuations achieved by Swiss start-ups at their investment rounds, between 2010 and 2019. In particular, in the following chapters we will answer mainly, but not only, to the following questions:

− Are there differences in start-up valuations between industries? (Chapter 4.4.2.2) − Does the type of lead investor involved in the round have an influence on the valuation? (Chapter 4.4.2.4) − Is there a correlation between the size of the round and the total funding previously raised? And with the pre-money valuation? (Chapter 4.2.3) − Can we predict the future success of a start-up based on its current status? (Chapter 5) − What is the best model to estimate the pre-money valuation of a start-up? (Chapter 6) 2 Start-up valuation methods 11 2 START-UP VALUATION METHODS

2.1 OVERVIEW

Traditional valuation methods for companies are usually based on the forecasted revenue and profit that an organization is expected to make. When it comes to early-stage start-ups, these classical financial formulae miserably fail. In fact, these firms are not profitable yet, and have very low or zero revenue, so their valuation is inevitably determined by other factors.

In the following paragraphs, we briefly describe the main available methods to valuate early- stage start-ups. These approaches are vague and leave space to any kind of interpretation. Behrmann proved how different the outcomes of valuation for the same firm can be by applying them, in addition to the demonstration that the same valuation method, when used on different firms, may understate as well as overstate the values to the market (Behrmann, 2016). Finally, he stresses that a valuation obtained through these current methods can only be as good as the assumptions. Change in just one number can dramatically change the results (Behrmann, 2016).

2.2 SCORECARD METHOD

The Scorecard method, also known as the Bill Payne valuation method, compares pre-money and pre-revenue start-ups to average valuations, and then adjust them according to certain metrics. Following Payne (2011), it is first necessary to survey pre-money, and pre-revenue valuations of venture capitalists or private angels for start-ups, in the industry and region of the target company. Next, the start-up is compared qualitatively to the comparable start-ups in the valuation survey, in accordance with the following categories and weights (Table 2.1): Strength of the Management Team 0-30% Size of the Opportunity 0-25% Product/Technology 0-15% Competitive Environment 0-10% Marketing/Sales Channels/Partnerships 0-10% Need for Additional Investment 0-5% Other 0-5% Table 2.1: Scorecard method: valuation categories with corresponding weights.

When the actual assessment is performed, the start-up is compared to the average of the surveyed start-ups. Full 30% for team would be awarded to an average team in regard to the comparable companies. When the valuation subject has a far greater than average team, 2.3 Berkus model 12 e.g. 150% of the average, the resulting factor would be 0.45. While, if in one category the start- up underperforms the peer group, less than the full amount is noted. In the end, the sum of all factors is multiplied by the average valuation obtained from the valuation survey. If, for example, we have a total factor of 1.2 (an above average venture), and a mean valuation on the market of €1.6 million, the target would be valued at €1.92 million pre-money. Table 2.2 showcases a complete scorecard valuation of a start-up.

Example of motivations behind the choice of the % of Norm: − A few co-founders, a not yet established Advisory Board − The market is there and it is growing

− The concept is nailed down, Minimum Viable Product is in development − Competition definitely exists, however company has a business model supposed to be disruptive − No Sales yet, Partnerships are in place for distribution − In need of $$ to finish development, launch, test, etc − Tested the market, have positive feedback

Table 2.2: Exemplary assessment of a start-up using the Scorecard method (from Gunn, 2016)

2.3 BERKUS MODEL

Berkus model was developed and proposed by the angel investor Dave Berkus (2009) to evaluate very early-stage companies with zero or very low revenue, but showing a potential of reaching over $20 million in revenues within five years. According to him, “the universal truth is that fewer than one in a thousand start-ups meet or exceed their projected revenues in the periods planned” (Berkus, 2009). Therefore, his method to establish an initial pre-money valuation does not take financials into account. 2 Start-up valuation methods 13

If Exists: Add to company value UP to: 1. Sound Idea (basic value, product risk) USD 0.5m 2. Prototype (reducing technology risk) USD 0.5m 3. Quality Management Team (reducing execution risk) USD 0.5m 4. Strategic relationships (reducing market risk and competitive risk) USD 0.5m 5. Product Rollout or Sales (reducing financial or production risk) USD 0.5m

Table 2.3: The Berkus Model: valuation dimensions

Berkus proposition is to add up to half a million USD depending on the degree at which the start- up fulfils the respective dimensions shown in Table 2.3. Once the company starts to generate revenues, Berkus states this method loses credibility, and most everyone will use the actual revenues to project value over time.

2.4 VENTURE CAPITAL METHOD

The third most common method for start-up valuation is called Venture Capital Method. It was first ideated by Sahlman and Scherlis (1989), then revised in 2009 in his Harvard Business School case study, and also thoroughly described by Engel (2002). The procedure starts by estimating the terminal value (TV) for the company in some years from now, when the exit is planned: for that year, revenues are estimated and translated to a TV by multiplying them with P/E ratios or sales multiples of similar companies in the industry. As an example, a venture has estimated revenues of € 15 M in five years (t), with similar businesses having a sales multiple of two. This leads to a TV of approximately € 30 M in five years. This value is then discounted to the present day, with the discount rate (r) estimated by the VC, usually the required internal rate of return (IRR) or generally target rate of return (Damodaran, 2007). Let’s say that, as this is a quite risky business, r is 60%. This would translate to a present value of PV= 30M / (1+ 0.6)5 ≈ 2.86M. In Table 2.4 we show the summary of the steps, adapted from Engel (2002): TV= P/E*Earnings Step 1 Estimating terminal value or TV= (Sales multiple) * Sales Step 2 Determining present value PV= TV/ (1+r)t Step 3 Calculating demanded ownership fraction F= (Round size) / PV

Table 2.4: The Venture Capital Method: summary of steps (Engel, 2002).

Engel (2002) also states that the pre-money valuation is calculated by substracting the round size to the post money valuation. 3.1 Sources of data 14 3 DATA COLLECTION

3.1 SOURCES OF DATA

This research project would not have been possible without the collaboration of the following organizations, that allowed us to access directly or indirectly to their data set concerning start- up investment rounds:

• Investiere | Verve Capital Partners AG: A key role has been played by the Swiss Venture Capital Investiere, by raising the need to investigate Swiss start-up valuations and to provide the conditions to pursue the analysis. • Dr. Hervé Lebret: He supported our research by sharing with us the data set behind his study “The Analysis of 500+ start-ups” published on www.startup-book.com (Lebret, 2019). He is the Manager of Innogrants (EPFL). He is also a Senior Scientist in the field of high-tech entrepreneurship and his research field concentrates on academic spin-offs, including Stanford University and Silicon Valley. • Startupticker.ch: this organization shared with us all data from their annual Venture Capital reports, from 2012 to 2018. Startupticker.ch is the main online news portal about young Swiss companies. • Commercial Registries of Switzerland’s Confederation: through the Registries of commerce – Zefix online portal it is possible to gain access to the legal acts of Swiss companies for some cantons of Switzerland. They do comprehend some details of funding rounds (e.g. post-money valuation, number of issued shares). • Crunchbase: this platform provides companies insight and we extracted some data about the analysed start-ups.

3.2 PROCESS

Data collection has been the research task that required overall most of the time and effort. The result has been a unique collection of precious, extremely confidential, sensitive data, regarding the details of start-ups’ investment rounds (306 samples overall, concerning 190 companies). Because of the confidentiality of this data protected by non-disclosure agreements between investors and co-founders, all contents and results of the research will be shared anonymously. 3 Data Collection 15

A first bunch of samples has been provided by Investiere | Verve Capital Partners AG, related to the investment rounds in which it was directly involved as an investor. After that, data collection proceeded in two simultaneous directions:

• Search of new sources of data, by directly contacting all the main active organizations in the Swiss start-up ecosystem (e.g. incubators, accelerators, University technology transfer offices, facilitators, investors’ clubs). • Search of data for specific start-ups to replace missing values (e.g. Swiss commercial registries, CBInsight, start-ups).

3.3 DESCRIPTION AND PRE-PROCESSING OF THE DATA SET

First of all, we pre-process the collected data, to make sure integrity and coherence are respected. For the purpose of this analysis, we decide to focus only on equity investment rounds pursued by Swiss start-ups, between 2010 and 2019. Therefore, we remove 12 samples concerning convertible investment rounds and 8 samples related to non-Swiss companies. So, we can exclude all variables regarding the details of the convertible rounds, plus the following variables: Round Type (as we consider only Equity rounds), and Company name. In fact, this research could only be pursued anonymously, because of signed NDAs protecting data. We can also remove the variable Data_source, because samples have been randomly collected from different sources, and we can assume zero correlation between values and their original source.

After this preliminary selection of variables and samples, we deal with a data set comprehending 286 observations and 16 variables.

In Table 3.2 we show an overview of this starting data set (a legend is provided in Table 3.1), while in Table 3.3 we provide an extended variables’ description. Abbreviation Meaning Cat Categorical N Nominal D Dichotomous O Ordinal I Interval Num Numeric C Continuous Dis Discrete

Table 3.1: Legend of Table 3.2 3.3 Description and pre-processing of the data set 16

Variable’s name Type Short description Nr. of groups % NA’s Foundation_Year Num Dis Company’s foundation year / 0.00 Round_name Cat N Harmonized official round name 9 0.00 Industry Cat N Company’s industry 9 0.00 Stage Cat O Company’s development stage 7 61.54 Pre_valuation Num C Pre-money Valuation / 0.00 Prev_raised Num C Tot. funding previously raised / 0.00 Amount_raised Num C Size of the investment round / 0.00 Through_Investiere Num C Amount invested by Investiere / 54.89 Type_Lead_Investor Cat N Type of the main Investor 4 4.19 Profitable Cat D Is the company profitable? 2 62.24 Revenue Cat O, I Last 12 months revenues 6 59.79 Closing_Year Num Dis Year of round’s closure (10) 1.05 Still_operating Cat N Company’s present status 3 0.00 Employees Num Dis Nr.of employees / 82.87 Currency Cat N Funding’s currency 1 0.00 Location Cat N Company’s legal location 1 0.00

Table 3.2: Data set overview

Year is which the company has been officially incorporated. Foundation_Year Numerical, discrete variables.

Harmonized name of the round used in the official company documentation. Categorical, nominal, 9 groups: − Pre-seed − Seed round − Series A Round Round name − Series B Round − Series C Round − Series D Round − Series E Round − Pre-Exit − IPO

Area of Business according to the Swiss Venture Capital Report: Categorical, nominal, 9 groups: − Biotech − Cleantech − Consumer Products Industry − Fintech − Healthcare − ICT − Medtech − Micro / Nano − Other

Indicates the development stage of the Company, at the time of Closing. Categorical, ordinal, 7 groups Stage − Idea (0 samples) − Prototyping 3 Data Collection 17

− Beta-Phase − Clinical Trials − First Clients − Growth − Internationalisation

Pre-Money Valuation: the value of a company just before that specific round of Financing. When summed up to Amount_raised, it gives as a result the post- Pre_valuation money valuation (Frei and Leleux, 2004). Unit of measure is CHF. Numerical, continuous

The sum of all funds the startup raised since incorporation until the moment Prev_raised just before closing that specific investment round. Unit of measure is CHF. Numerical, continuous

VC Investiere's tranche of the respective financing round. Through_investiere Unit of measure is CHF. Numerical, continuous

Type classification of the lead Investor of the investment round. Categorical, nominal, 4 groups: − Accelerator/incubator: accelerators and incubators are organizations helping start-ups attain success. Incubators usually offer dedicated office and development space to the start-ups for a set period of time, and a first grant or funding round to allow start-up’s incorporation and beginning of activities. Start-up accelerators tend to focus on providing mentorship, and resources to help the start-ups succeed, but usually tend not to offer dedicated office space. Accelerators and incubators usually get involved at early-stage. Some of them focus on a specific industry, market, technology, whereas others are generalists. Start-ups are usually admitted in batches, after a screening process (Isabelle, 2013). − Private Angel: an angel investor (also known as a business angel, informal investor, angel funder, private investor, or seed investor) is an Type_Lead_Investor affluent individual who provides capital, advice and contacts to a start- or (TLI) up, usually in exchange for convertible debt or ownership equity. Unlike venture capitalists, they usually play an indirect role as advisors in the operations of the investee firm (Wong, Bhatia, and Freeman, 2009).

− Institutional Financial: an institutional investor is an organization that invests on behalf of its members. A financial investor invests in a business merely to maximize its financial returns, over a specified period of time. These investors often take board seats and add value by introducing co-founders to a larger network, or help in terms of strategy, hiring, financials and industry insights (Arping and Falconieri, 2009).

− Institutional Strategic: an institutional investor is an organization that invests on behalf of its members. Strategic investors are not only looking for a return on their capital, but for a ‘strategic’ scope: access to technology/assets, new market or target segment. They are more 3.3 Description and pre-processing of the data set 18

patient to see returns on their investments than financial investors (Arping and Falconieri, 2009).

It answers to the following question: Is the company profitable at the time of the investment round closure (i.e. is in the condition of yielding a financial profit or gain)? Profitable Categorical, Dichotomous − Yes: the company is profitable − No: the company is not profitable

The income generated in the last 12 months before the funding round, from sale of goods or services, or any other use of capital or assets, associated with the main operations of an organization before any costs or expenses are deducted. Also called sales, or (in the UK) turnover. Categorical, ordinal, interval, 6 groups: Revenue − 0 - 50k − 50k - 100k − 100k - 500k − 500k – 1M − 1M – 5M − >5M

Year of closing of the investment round. We created two identical variables for this data: one is a numerical discrete variable, the second one is categorical ordinal, with 10 groups (we will later decide which variable is the most useful for our analysis): − 2010 − 2011 − 2012 Closing_Year − 2013 − 2014 − 2015 − 2016 − 2017 − 2018 − 2019

Indicates the current3 status of the company. Categorical, ordinal, 3 groups: Still_operating − Yes: the company is still operating (i.e. active company) − No: the company has been liquidated − Exit: The company has been acquired to another company

Employees Number of employees of the start-up at the time of the funding round.

3 Last update: Oct 2019 3 Data Collection 19

Country of the start-up’s registered office. Location Categorical, nominal, 1 group: − Switzerland

Primary currency of the financing round. Currency Categorical, nominal, 1 group: − CHF

Table 3.3: Extended variables description

3.4 LOG TRANSFORMATION

By analysing the distribution of some continuous variables (Pre_valuation, Amount_raised, Prev_raised, and Through Investiere) we can state they are all far from normality, implying restrictions in the application of some statistical methods strictly assuming normal distributions (e.g. Pearson’s correlation, ANOVA). By observing their distributions, the best transformation we can apply is the natural logarithmic function. In the following graphs (figure 3.1 - 3.2) we report the significant improvement achieved thanks to this transformation4. Nevertheless, if we apply it a second or third time (i.e. log3(Pre_valuation)), the additional improvement is instead not significant.

If we test now the normality of these transformed variables, by applying for example the Shapiro Test, we are still forced to reject the zero hypothesis of normality. We can notice, in fact, from the graphs in Figure 3.1, very long tails in variables distributions and some proportion of skewness. Actually, these tails could just be outlier cases: if we detect outliers with the R function aq.plot we get as a result that the 38.81% of samples are outliers. Of course, this big proportion does not allow us to remove them now. Anyway, this is not an issue: linear regression analysis does not assume normality for either predictors or outcome. The main role, instead, is played by the distribution of residuals. (The distribution of residuals together with outlier detection will be conducted for specific models during regression analysis, paragraph 6.7.2).

4 As the log(0) is undefined, we add 0.1 to all zero values in these continuous variables before applying the log transformation.

3.4 Log Transformation 20

9

-

0

1

2

1

-

0

1

5

1

-

0

1

8

1

-

0

1

1

2

-

0 1 2*106 107 5*107 5*108 109

8

-

0

1

1

1

-

0

1

4

1

-

0

1

7

1

-

0 1 5*104 5*105 5*106 5*107 109

Figure 3.1: On the left graphs we use log-log axis, base 10: (top) adaptive Kernel density estimation of pre-money valuation of a company at each round (Pre_valuation); (bottom) adaptive Kernel density estimation of funds raised before a round (Prev_raised). On the right, top and bottom: Kernel density estimation of the same variables, after their natural log transformation. Red lines indicate the median.

3 Data Collection 21

7

-

0

1

9

-

0

1

1

1

-

0

1

3

1

-

0

1

5

1

- 0

1 3 5 7 10 10 10

7

-

0

1

*

5

8

-

0

1

*

5

9

-

0

1

*

5

0

1

-

0

1

7

*

-

5

0

1 104 5*104 5*105 5*106 109

Figure 3.2: On the left graphs we use log-log axis, base 10: (top) adaptive Kernel density estimation of the amount raised at each round, in CHF (Amount_raised); (bottom) Kernel density estimation of the amount invested by the VC Investiere, in CHF (lThrough_investiere) at each round. On the right, top and bottom: Kernel density estimation of the same variables after their natural log transformation. Red lines indicate the median. 4.1 Treatment of missing data 22 4 MULTIVARIATE DATA ANALYSIS

4.1 TREATMENT OF MISSING DATA

During the conduction of this research project, we spent most of the time and effort and into the data collection process (paragraph 3.2). This phase did not aim only to collect as many samples as possible, but also to substitute the missing data with the true values. After this long, time consuming process, we list the variables sorted by decreasing fraction of missing data:

Variable NA’s Employees 0.83 Profitable 0.62 Stage 0.62 Revenue 0.60 Through investiere (CHF) 0.55 Type_Lead_Investor 0.04 Closing_Year 0.01 Foundation Year 0.00 Industry 0.00 Round_name 0.00 Still_operating 0.00 Amount_raised 0.00 Pre_valuation 0.00 Prev_raised 0.00 Country 0.00 Currency 0.00

We read that only 8 out of 20 (16 + 4 log transformed) are still containing missing values. For the purpose of our research, imputation of missing data would be misleading and useless, due to the low ratio of available samples per variable. Therefore, we decide to keep all NA’s and all samples, in order to avoid loss or distortion of information.

4.2 CORRELATION ANALYSIS BETWEEN CONTINUOUS VARIABLES

4.2.1 Methodology

In order to analyse the correlations existing between the continuous variables in the data set, we will adopt the following statistical tools:

A. Correlation matrix B. Scatterplot C. Boxplot 4 Multivariate Data Analysis 23

Correlation matrix will be calculated by applying the Kendall’s Tau method5. We generally consider a correlation between two variables as “low” if its absolute value is lower than 0.3, "moderate", if it stays between 0.3 and 0.7, and "strong" if higher than 0.7.

Scatterplots graphically show the linear fit of each pair of variables. The regression line and correlation coefficients allow us to distinguish which pairs of variables show an interesting significant correlation, and which ones do not. Scatterplot analysis is useful in preparation to the Multiple Regression Analysis (Chapter 6). MLR, in fact, requires the relationships between the independent and dependent variables to be linear. This linearity assumption can be best tested and visualized through scatterplots.

Boxplots are useful to easily identify outliers, and schematically visualize the distribution of each variable. The values between 2.689 sigma stay within the min and max border lines of each boxplot:

MIN= max[MIN, Q1-1.5*(Q3-Q1)] MAX= min[MAX, Q3+1.5*(Q3-Q1)]

The remaining extreme values are identified as outliers, and represented in the boxplot beyond whiskers (we are not interested nor allowed to identify to which companies they correspond to).

As we’ve just seen in the previous paragraph, there are two continuous variables with a tremendously high number of missing data, while the other four continuous variables have approximately 100% of values available. Employees – indicating the number of employees of the start-up at the time of a specific round - has 82.9% of NA, while the continuous transformed variable lThrough_investiere – indicating the amount invested in that specific financing round by the VC Investiere - has 55.1% of missing values. That means keeping these variables during our further analysis would make us neglect over 82% of our samples. Besides, studying the relationship between lThrough_investiere and the other variables is not menaingful for all Swiss

5 As we noticed in paragraph 3.4, none of the continuous variables follows a normal distribution, as proved via the Shapiro Test. We therefore calculate correlation values through the Kendall’s Tau method, instead of the Pearson’s one, that assumes normality. 4.2 Correlation analysis between continuous variables 24 investment rounds, but only for the ones in which Investiere was actually involved (96 samples out of 286, so 33.4%). For these reasons, we now conduct a separate analysis for those two continuous variables, and we will then omit them from our data set to take into account all of the available rounds.

4.2.2 lThrough_Investiere analysis

We now focus our analysis on the rounds in which we know if and how much the VC Investiere contributed. In Figure 4.1, we represent the frequency distribution of the natural log of the amount invested by the VC Investiere, in CHF, at each round (lThrough_investiere), omitting all its NA values (55%). We clearly see two relative maxima in this distribution, where the lowest one indicates rounds in which Investiere did not invest. In Figure 4.2 instead, we consider only the rounds in which Investiere invested (96 out of 129).

Figure 4.1: On the x-axis: the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere) at each round. On the top of the figure, a boxplot representation of this variable. On the y-axis: the Kernel density estimation (higher density for higher probability of seeing a point at that location). N is the number of samples, and Bandwidth is the parameter controlling the smoothness of the curve (higher values make smoother curves), and it equals the standard deviation of the kernel used. The red line indicates the median.

When measuring the relationships among continuous variables with the Kendall’s Tau method, the correlations involving lThrough_investiere are not significant or very low. This changes if we calculate correlations by considering only the rounds in which Investiere actually invested (96 samples). The corresponding correlation matrix is in figure 4.3, where non-significant values (significance level is 0.05) are hidden by black crosses. 4 Multivariate Data Analysis 25

When measuring the relationships among continuous variables with the Kendall’s Tau method, the correlations involving lThrough_investiere are not significant or very low. This changes if we calculate correlations by considering only the rounds in which Investiere actually invested (96 samples). The corresponding correlation matrix is in figure 4.3, where non-significant values (significance level is 0.05) are hidden by black crosses.

Figure 4.2 On the x-axis: the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere). ). Here we only consider rounds in which Investiere actually invested. On the top of the figure, a boxplot representation of this variable. On the y-axis: the Kernel density estimation (higher density for higher probability of seeing a point at that location). Bandwidth is the parameter controlling the smoothness of the curve (higher values make smoother curves), and it equals the standard deviation of the kernel used. The red line indicates the median.

Except for Foundation_year, all the other variables have now a significant, positive correlation with lThrough_investiere. We now analyse them more in detail with the following scatterplots (by only considering the rounds in which we know that Investiere invested money).

In figure 4.4, the correlation coefficient is r=0.403 (moderate) and significant (p << 0.05). The trend is visible but noisy. The explanation of this correlation can be found in the moderate correlation existing between lAmount_raised and lPre_valuation. In fact, lThrough_investiere is a value belonging to the interval [0; lAmount_raised], and its relationship with lAmount_raised is represented in Figure 4.5. It is a matter of fact that the companies in which Investiere invested more, raised more money. So, the real reason behind the previous trend and correlation (figure 4.2 Correlation analysis between continuous variables 26

4.4) is that lThrough_Investiere is moderately correlated with lAmount_raised, that is moderately-strongly correlated to lPre_valuation, as shown in the matrix (R=0.57, Figure 4.3).

Figure 4.3: Correlation matrix summarizing all correlations among continuous variables. Only rounds in which the VC Investiere participated are considered. Following the legend, numbers in a colour tending to red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (significance level threshold is p-value= 0.05).

Figure 4.4: Scatterplot of the natural log of Pre-money valuation in CHF (lPre_valuation) and the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere), considering only rounds in which the VC Investiere participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area. 4 Multivariate Data Analysis 27

We want now to analyse the relationship between lThrough and lPrev_raised, figure 4.6. Also in this case, we have a trend disturbed by the significant proportion of samples having zero money previously raised. The correlation remains low even if we exclude this proportion of samples, Figure 4.7.

Figure 4.5: Scatterplot of the log of the size of the round (lAmount_raised) and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC Investiere participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

Figure 4.6: Scatterplot of the log of the funds previously raised (lPrev_raised) and and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area. 4.2 Correlation analysis between continuous variables 28

It is interesting to notice that in the 57.29% of all cases in which Investiere invested, it was the first investment received by that specific startup. This is in accordance with what we read on Investiere’s website, F.A.Q. page (https://www.investiere.ch/startup-vc-investment/):

"When do you invest? We invest in early stage as well as growth stage rounds. A pitch deck or idea without validation is not sufficient. The right timing for a funding round can vary depending on the industry or other factors but generally being able to show market traction, proof of technology and a complete and well-functioning core team are decisive factors."

So, there is no doubt that, for Investiere, having zero money previously raised is not a limitation to its investment commitment. The graph also tells us that, if the start-up has already raised money in the past, the amount then invested by Investiere slightly tends to increase with the significant correlation factor of 0.27.

Figure 4.7: Scatterplot of the log of the funds previously raised (lPrev_raised) and and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC participated, and the start-up already raised funds in the past. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

We will not show the relationship between lThrough_investiere and Foundation Year, as it is low and not significant. Nevertheless, we know that Investiere invested only in companies founded in the last 15 years, except for one outlier. The histogram (figure 4.8) represents the 4 Multivariate Data Analysis 29 number of investments done by Investiere, by specifying in which year the start-up was founded. Any particular trend can be observed, it instead roughly recalls a normal distribution.

We plot now lThough_investiere against the Closing Year of the round (Figure 4.9):

Foundation Year of VC Investiere's rounds 16 14 12 10 8

# # samples 6 4 2 0

Foundation Year

Figure 4.8: Histogram showing the number of samples sharing the same Foundation Year, by only considering rounds in which the VC Investiere invested.

Figure 4.9: Scatterplot of Closing Year and the amount invested by the VC (lThrough_investiere), considering only rounds in which the Investiere participated. R is the Kendall’s Tau correlation, 4.2 Correlation analysis between continuous variables 30 while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

In this case, a trend is evident with a moderate correlation of 0.44. So, we can state that, over years, Investiere is on average investing more on each investment deal. If we create the histogram (Figure 4.10), we can add that also the number of deals is increasing over time (2019 shows an underestimation of the real value, because the year is still ongoing).

Closing Year of VC Investiere's rounds 20 18 16 14

12 10 8 # # samples 6 4 2 0

Closing Year Figure 4.10: Histogram showing the number of samples sharing the same Closing Year, by only considering rounds in which the VC Investiere invested.

4.2.3 Employees analysis

We now proceed with the variable Employees -indicating the number resources employed in the company at the time of a specific round- in the same way we did for lThrough_investiere. This time, as 81% of Employees’ values are NA, we are only taking into account 48 complete samples of our data set. Figure 4.11 shows its distribution (we removed four extremes outliers with more than 100 employees). It has a very long right tail, making the median much higher than the mode and mean. In the next Figure 4.12 we show the correlation matrix, by considering all available 48 complete samples.

We plot all the moderate correlations between Employees and the other variables in figures 4.13 - 4.16. The strongest correlation is between Employees and lPre_valuation. This makes us suppose that, having more data available, Employees would be a relevant and significant predictor in determining the valuation of a start-up. Nevertheless, having more employees does 4 Multivariate Data Analysis 31 not necessarily mean that the start-up has previously raised more funds (moderate-low correlation). We also observe a moderate correlation between Employees and lAmount_raised, and Closing_year. There is no correlation, instead, between Employees and neither Foundation Year, nor lThrough_investiere.

Employee N=46 Bandwidth = 5.778

Figure 4.11: Employees distribution (boxplot and probability density function). The red line indicates the median.

Figure 4.12: Correlation matrix summarizing all correlations among continuous variables. Only complete samples are here considered. Following the legend, numbers in a colour tending to 4.2 Correlation analysis between continuous variables 32 red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (p-value above 0.05).

Figure 4.13: Scatterplot of the log of the amount raised in the round (lAmount_raised) and the number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

Figure 4.14: Scatterplot of the log of the funds previously raised (lPrev_raised) and the number of Employees, considering only rounds in which the number of Employees is known, and lPrev_raised is above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area. 4 Multivariate Data Analysis 33

Figure 4.15: Scatterplot of the log of Pre-money valuation (lPre_valuation) and the number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

Figure 4.16: Scatterplot of Closing Year of the round and number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

4.2.4 Analysis of the entire data set

After having separately analysed lThrough_investiere and Employees’ impacts on other variables, there is no need to keep them in our further analysis. In fact, we want to consider all rounds in our dataset, regardless if we have information about Investiere’s participation or the 4.2 Correlation analysis between continuous variables 34 number of employees. If we continued considering those variables in our further analysis, we would exclude the 82.9% of samples (percentage of Employees’ missing values). By including now all samples of our data set, we want to visualize the distribution of each continuous variable (Figure 4.17 – 4.21):

lPre_valuation

N=286 Bandwidth = 0.3031

Figure 4.17: Distribution of the log of pre-money valuation (lPre_valuation), via boxplot and probability density function. The red line indicates the median.

Closing_Year N=283 Bandwidth = 0.7059

Figure 4.18: Closing_Year distribution (boxplot and probability density function). The red line indicates the median. 4 Multivariate Data Analysis 35 lPre_valuation and lAmount_raised show many outliers beyond their MAX values6, creating the long tails in the distribution curve. The correlation matrix, including all complete samples for the selected continuous variables, and a relationships summary overview are in Figure 4.22. As Shapiro Test makes us reject the null hypothesis of normality, we continue using the Kendall’s Tau method.

lPrev_raised N=286 Bandwidth = 2.057 Figure 4.19: Distribution of the log of funds previously raised (lPrev_raised), via boxplot and probability density function. The red line indicates the median.

lAmount_raised N=286 Bandwidth = 0.295

Figure 4.20: Distribution of the log of the Amount previously raised (lAmount_raised), via boxplot and probability density function. The red line indicates the median.

6 MAX=min[MAX, Q3+1.5*(Q3-Q1)] 4.2 Correlation analysis between continuous variables 36

All variables tend to normal distribution, except for Closing_Year (clear growing trend, 2019 is still in progress) and lPrev_raised (having two relative maxima, because of the conspicuous number of zero values). Anyway, normality of variables is not an assumption neither for Kendall’s nor for MLR.

Foundation_Year N=286 Bandwidth = 1.084

Figure 4.21: Distribution of the Foundation Year of samples (boxplot and probability density function). The red line indicates the median.

Figure 4.22: Correlation matrix summarizing all correlations among continuous variables. Following the legend, numbers in a colour tending to red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (p-value above 0.05). 4 Multivariate Data Analysis 37

The correlation between lAmount_raised and lPre_valuation is the strongest one among our continuous variables (0.59), and it is highly significant (Figure 4.23). This correlation is expected, otherwise raising high investment would lead start-ups to enormous dilution, not sustainable for further growth. Nevertheless, if this factor is considered alone, it can be misleading for companies. In fact, it could lead the company to show a higher financial need in order to raise more money, and therefore obtain a higher valuation. This strategy is not recommendable, as it is likely to bring the start-up to over dilution and lower credibility, if not properly justified. So, every company will have to carefully evaluate the combination of factors influencing its pre- money valuation (that we will be entirely revealed in Chapter 6), and meditate carefully upon the trade-off existing between the amount raised (and therefore the lPre_valuation obtained), and the consequent dilution.

Figure 4.23: Scatterplot of the log of Pre-money valuation (lPre_valuation) and the log of the amount raised (lAmount_raised), considering the entire data set. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

It is followed by the one between Foundation Year and Closing Year (0.47): youngest companies have the most recent investments, and it is always true that Foundation_Year <= Closing_Year. We find several outliers in this trend.

Between lPre_valuation and lPrev_raised (Figure 4.24) there is also a moderate, significant correlation (0.43). We now look at their trend, considering only samples with lPrev_raised >0 (correlation grows to 0.5). lPrev_raised is certainly a main factor to determine the valuation of a company, (we measure its impact more in detail in Chapter 6). While the range of 4.2 Correlation analysis between continuous variables 38 lPre_valuation for the rounds previously excluded (having lPrev_raised=0) is wide and comprehends many outliers.

Figure 4.24: The Figure represents the scatterplot of lPre_valuation and lPrev_raised, considering all samples with lPrev_raised above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

We plot now the relationship between lAmount_raised and lPrev_raised (Figure 4.25), as both these variables are strongly correlated with lPre_valuation.

4 Multivariate Data Analysis 39

2.2*10-16

Figure 4.25: Scatterplot of the natural log of the funds raised by the company before a specific round (x-axis= lPrev_raised) and the natural log of the amount raised at that round (y-axis= lAmount_raised), by considering only rounds with lPrev_raised above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. On the top of the figure, the equation of the plotted regression line is shown, where y=y-axis variable, and x=x-axis variable. The 95% confidence interval is displayed by the grey area.

Indeed, they show a moderate, significant correlation: the more you previously raised, the more you are likely to raise. This is perfectly normal: if this is not the first financing round of the company, most probably the company is at a later development stage, it had the possibility to bring to the table more proves of product/service validation, reducing the risk for the investors. At the same time, having low lPrev_raised does not prevent a company to raise high investments. In particular, for companies with lPrev_raised=0, lAmount_raised has the following distribution: Figure 4.26. The volatility is extremely high and five outliers are visible. 4.2 Correlation analysis between continuous variables 40

lAmount_raised

Figure 4.26: Distribution of the log of N=174the amount Bandwidth raised = 0.362 (lAmount_raised) for samples with no funds previously raised, via boxplot and probability density function. The red line indicates the median.

All in all, lPrev_raised is an important factor determining the lAmount_raised, which plays even a more relevant role in influencing the lPre_valuation (highest existing correlation). As these three variables are mutually correlated one to the other, we add a three-variables bubble plot (Figure 4.27), to offer a final overview of their relationships (lPre_valuation on y-axes, lAmount_raised on x-axis, circle size represents lPrev_raised).

Finally, we found a highly significant but low positive correlation between Closing_year and lPre_valuation (0.16), while between Foundation Year and lPrev_raised there is a low and negative correlation: the older the company, the higher lPrev_raised. This is also obvious and expected.

All the other pairwise relationships among variables can be neglected (extremely low correlations or non-significant). 4 Multivariate Data Analysis 41

Figure 4.273: Three-dimensional representation: the log of pre-money valuation (lPre_valuation) on the ordinate, the log of the amount raised (lAmount_raised) on the abscissa, and the log of funds previously raised (lPrev_raised) by using colours and points dimension.

4.3 CORRELATION ANALYSIS BETWEEN CATEGORICAL VARIABLES

In this section, we want to investigate the correlation existing between categorical variables.

4.3.1 Pooling levels together

In order to dive deeper into this analysis and to get significant results, first we pool appropriate levels together, to make sure that by comparing variables in pairs, each level counts at least 5 samples. The final distribution of each variable is represented in the next Figures 4.28 a) and b).

4.3 Correlation Analysis between categorical variables 42

a) Round name 120

100

80

60

# # samples 40

20

0

Seed

Round

IPO

Series C Series

Series B Series

Series A Series

From

Pre-seed Series D to D Series

Industry 140 120

100

80

60 # # samples 40 20

0

ICT

Other Other

Nano

Micro/

Fintech

Biotech

Products

Cleantech

Consumer

Medtech/ Healthcare

Figure 4.28: a) Overview of the categorical variables Round Name (top) and Industry (bottom). On the top of each histogram, the title refers to the name of the displayed variable. On the abscissa are written the names of the groups belonging to that categorical variable, while the ordinate indicates the absolute number of samples in the data set belonging to that specific group. The total number of samples for each categorical variable changes between variables, because of missing values (see paragraph 4.1).

4 Multivariate Data Analysis 43

b) Revenue Still operating 50 300 45 40 250

35 200 30 25 150

20 # samples # # samples 15 100 10 50 5

0 0

no

yes

exit

>1M

0-50k

500k-1M

50k-100k 100k-500k

Stage Type Lead Investor 40 250 35 200

30

25 150 20

samples 100 # # samples # # 15 10 50 5

0 0

Acc/Inc/PA

First Clients First

Prototyping

Inst. Strategic Inst.

Inst. Financial Inst. Beta/ Clinical Trials Clinical Beta/

International Growth/

Figure 4.28: b) Overview of the categorical variables Revenue (top-left), Still_operating (top- right), TLI (bottom-left), and Stage (bottom-right). On the top of each histogram, the title refers to the name of the displayed variable. On the abscissa are written the names of the groups belonging to that categorical variable, while the ordinate indicates the absolute number of samples in the data set belonging to that specific group. The total number of samples for each categorical variable changes between variables, because of missing values (see paragraph 4.1). 4.3 Correlation Analysis between categorical variables 44

4.3.2 Methodology

After the pre-processing phase, we apply the following methods:

• Contingency Analysis (or Chi-square independent test) • Cramer’s V • Contingency coefficient (or Pearson’s coefficient)

The Contingency Analysis tests the null hypothesis that the two considered variables are mutually independent. That means that the knowledge of one does not help us in predicting the value of the other. On the other side, if the p-value is less than the significance level (0.05), we reject the null hypothesis and conclude that there is a statistically significant relationship between the two categorical variables, that is they are not independent. The test makes use of contingency tables, as a result of which it is known as 'Contingency Analysis'.

In case we reject independency, Cramer’s V and Contingency coefficient provide measures of the correlation existing between two categorical variables. As for continuous variables, we consider a coefficient in the range of [0, 0.3] as weak, in [0.3, 0.7] as medium, and > 0.7 as strong.

4.3.3 Results

Legend: low corr. Round_name 1 moderate corr. Industry independent 1 V= Cramer's V V= 0.392 V= 0.401 Stage 1 C= Contingency C= 0.485 C=0.570 V= 0.266 V= 0.262 Type_Investor independent 1 C= 0.352 C= 0.348 V= 0.323 V= 0.358 V= 0.526 Revenue independent 1 C= 0.416 C=0.625 C= 0.674 V= 0.204 V= 0.359 V=0.242 V= 0.289 Still_operating independent 1 C= 0.277 C= 0.453 C=0.324 C= 0.378 Round_name Industry Stage Type_Investor Revenue Still_operating

Table 4.1: The Table shows the relationships existing between categorical variables. The white cell “independent” means that any significant correlation has been revealed. In all the other cases, V and C indicate the Cramer’s V coefficient and the Contingency coefficient, respectively.

Final results are summarized in Table 4.1: correlations are moderate or low and the strongest one is between Revenue and Stage. That was already evident from the scatterplot and it makes logical sense (e.g. there cannot be significant revenues at Prototyping stage, while they are 4 Multivariate Data Analysis 45 necessary to be in Growth/Internationalisation stage). We can also confirm that Industry is independent from the Round name (all round names can be applied to any industry). We underline that belonging to a particular Industry does not influence the success of the start-up (independency from still_operating), but it influences its revenue (see distribution, Figure 4.29).

Revenue groups:

Figure 4.29: Distribution of Revenue given the Industry. On the ordinate, the Industry (a bar for each Industry group). On the abscissa, how many samples (in percentage) of that Industry group have a certain interval of Revenue. Following the legend, each colour section of the bars corresponds to a certain Revenue group.

We were expecting a correlation between the Type of Lead investor with the variables Stage and Revenue, at a certain extent, but this is not confirmed by numbers. So, we cannot say that a particular type of investor invests mainly in start-ups at a particular stage or with a particular range of revenues. Instead, all types of investors invest in a diversified portfolio of companies, as we will see more in detail in the next paragraph 1.5.2.4.

A moderate correlation is identified between Stage and Industry, but this is just a chance event in our data set. Of course, all Industries are populated by start-ups at all stages.

Finally, we could think that high revenues would be an important factor in determining the success of a start-up (still_operating), but from our data set we can only state the existence of a low correlation. We will make a specific analysis to investigate the impact of different variables in the future success of Swiss start-ups, Chapter 5. 4.4 Correlation analysis between continuous and categorical variables 46

4.4 CORRELATION ANALYSIS BETWEEN CONTINUOUS AND CATEGORICAL VARIABLES

4.4.1 Methodology

There are three big-picture methods to understand if a continuous and categorical variable are significantly correlated:

• Point biserial correlation: the categorical variable must be dichotomous, which is never our case; • : the dependent variable must be binomial, which is not our case (lPre_valuation is continuous); • Boxplot analysis: see results in the upcoming paragraph; • ANOVA and ANCOVA: their assumptions of normality are not respected in our data set, as we saw in paragraph 3.4; • Kruskal Wallis H-Test: non-parametric alternative test to ANOVA. It does not assume data are coming from a particular distribution. In particular, we decide to use the H test as the assumptions for ANOVA aren't met (like the assumption of normality). It is sometimes called the “One-way ANOVA on ranks”, as the ranks of the data values are used in the test rather than the actual data points. The test determines whether the means of two or more groups are significantly different. The test statistic used in this test is called “the H statistic” and the hypotheses for the test are: o H0: population means are equal. o H1: population means are not equal. We reject H0 if the adjusted p-value, calculated through the default method "holm", is below the threshold of 0.05. However, this test alone will not tell us which groups are different. In order to know that, we run a Post Hoc Wilcox test and we comment its results. We therefore adopt this method and results are shown in the upcoming paragraph.

4.4.2 Results

We now plot the dependent variable lPre_valuation against all considered categorical variables. In each graph, we also show the result of the Kruskal Wallis test. 4 Multivariate Data Analysis 47

4.4.2.1 Round_name

The graph in Figure 4.30 shows a strong correlation, which is straight forward: the round name is usually assigned based on the lPre-valuation. So, with no surprise further rounds have higher lPre_valuation. The boxplot of “From Series D to IPO” is the highest one but also the tallest, so it has the highest volatility in Pre-money valuation. They all have some outliers and tend to be left skewed (the tail of the distribution is longer on the left-hand side than on the right-hand side and the median is closer to the third quartile than to the first one). Finally, the population means of each group are significantly different from all the other groups.

Figure 4.30: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Round name. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4.2.2 Industry

In Figure 4.31 we see the boxplots of lPre_valuation for each Industry group. Some company Industries are more volatile than others: ICT shows the widest range and several outliers, while Cleantech and Consumer Products ranges are much more restricted. MedTech/ Healthcare shows one case particularly extreme. Consumer Products and Others are strongly left-skewed, which means that 3rd - 4th quartiles have a more restricted range than the first two. We do not obtain significant mean differences among groups, so we pool together Industries with less than 20 samples into the group Others, to see if we obtain different results. Figure 4.32 shows the 4.4 Correlation analysis between continuous and categorical variables 48 resulting boxplots. Also in this case, group means are not significantly different one from the other. So, we presume that by adding this explanatory variable to our regression model (see Chapter 6) we will not obtain advantages.

Figure 4.31: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Industry. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Figure 4.32 Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the industry, after we pooled together the groups having less than 20 samples in the Industry group Other. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4 Multivariate Data Analysis 49

4.4.2.3 Stage

The relationship with the variable Stage is represented in Figure 4.33. For this variable we have 61.7% of missing values, so inserting it in our regression model would make as loose the majority of samples in our data set. For this reason, it is even more important to analyse separately its relationship with the dependent variable lPre_valuation. Prototyping is the most volatile group, and it is right skewed. Beta-phase/ Clinical Trials is also right-skewed (the 50% of these samples having the highest lPre_valuation stay in a wider range). Although the high volatility involves all groups, we can clearly see a growing trend: the later the stage of a startup, the higher its lPre_valuation. Through the Wilcox test, we can state that the mean between the following groups is statistically significantly different:

• Prototyping and Growth/International • Beta-Phase/Clinical Trials and First Clients • Beta-Phase/Clinical Trials and Growth/International • First Clients and Growth/International

Figure 4.33: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Stage. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

As Prototyping has only 13 samples and there is no significant difference between this group and Beta-Phase/Clinical Trials Stages, we try now to pool them together in a group called “Early- stage” and test if there is a significant difference with First Clients and with 4.4 Correlation analysis between continuous and categorical variables 50

Growth/International, that now we will rename, for coherence, as “Later-Stage”. We now obtain Figure 4.34 and significant mean differences between all groups. All that makes us maintain this new structure of the variable and state that Stage is a potentially useful predictor of lPre_valuation.

Figure 4.34: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Stage, after we pooled together the groups Prototyping and Beta-Phase/Clinical Trials Stages in the new group Early-stage. For coherence, the group Growth/International is here renamed as Later-Stage. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4.2.4 Type Lead Investor (TLI)

This graph in Figure 4.35 for TLI is very interesting: the trend is evident and highly significant. The mean valuation obtained in rounds involving Accelerator/ incubators is much lower than all the other ones. As expected, Private angels are in between accelerators and Institutional investors. Is their valuation lower because they do only invest in early-stage start-ups? The answer is shown in Figure 4.36.

Except for Accelerator/incubators, all other types of investors show a differentiated portfolio of rounds, based on the start-up Stage. Institutional Financial shows many more samples and therefore a wider range of offered valuations compared to the other groups, while Institutional Strategic has the highest mean lPre-valuation.

Nevertheless, for a fair comparison, we need to verify if these significant mean differences are valid even when separating rounds depending on companies’ development stage. In fact, it 4 Multivariate Data Analysis 51 could be possible that Private Angels offer on average lower pre-valuations only because they mainly invest in early-stage startups, while Institutional Strategic investors only invest in later stage companies. In the following pie charts and histogram (Figure 4.37 - 4.38), we represent the distribution of rounds’ Stages across the different Types of investors. For example, we note that Private Angels investments involve early-stage start-ups (Prototyping + Beta-Phase + Clinical Trials) in the majority of cases (43%). Anyway, the lower valuation offered by Private angels cannot be addressed merely to the explanation that they mainly invest in early-stage start-ups. Instead, a more comprehensive reason is that they’re willing to take higher risk than the other players, and therefore ask for higher returns on investment.

Figure 4.35: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type of Lead Investor. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4 Correlation analysis between continuous and categorical variables 52

Figure 4.36: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI) and the Stage. Given a TLI, a different boxplot of lPre_valuation is created for each Stage (following the legend, a colour is associated to each Stage). Samples with unknown investor have been excluded from the graph. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p- value of the Kruskal Wallis H-Test.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Prototyping Beta-Phase Clinical Trial First Clients Growth/Internat Stage TLI groups: Private Angel Inst.Financial Inst.Strategic

Figure 4.4: Distribution of the Type of Lead Investor (TLI) across Stages. On the abscissa we read the Stage (a bar for each Stage group). On the ordinate we read (in percentage) the proportion of samples in that Stage group belonging to a certain TLI group. Following the legend, each colour section of the bars corresponds to a certain TLI group. 4 Multivariate Data Analysis 53

Figure 4.38: These pies show the distribution of the variable Stage for each Type of Lead Investor (TLI). In a) we consider only samples in which the TLI is Institutional Financial, in b) only Private Angel, and in c) only Institutional Strategic. Following the legends, each colour section of the pies corresponds to a certain Stage group, and its proportion of samples (in percentage) is written inside the section.

Figures 4.39 – 4.41 confirm that all types of investors (except for Accelerators/incubators) show a well-diversified portfolio of start-ups in terms of development stage. Nevertheless, the mean differences among groups continue to be significant only when considering rounds of:

• Early-stage start-ups, between all types of investors • First Clients stage start-ups, just between Private Angels and Inst. Financials (one-side test)

In those cases, the type of investor involved in the round makes a significant impact on lPre_valuation. As the majority of rounds involving Private Angels are related to early-stage start-ups (but not only), the overall outcome is a significant mean difference among the three types of investors. 4.4 Correlation analysis between continuous and categorical variables 54

How can we interpret this? Private angels are known to be willing to take more risk than institutional investors, and impose less constraints on companies. On the other side, they require a higher return on their investment, by investing at a relatively low valuation. About the overall highest valuation offered by strategic investors, we have to remember that they are called “strategic”, as they invest in virtue of a particular strategic interest they have for a specific start-up. An interest that other investors (strategic or not) might probably not have. Examples of strategic reasons behind a start-up investment are: exploitation of the developed technology, IP rights, complementary products, control of competition, reaching new customer segments or new markets, access to know-how or specific resources, etc. That’s why their valuations of companies are higher than Institutional Financials, that instead do not take any strategic advantages out of the investments.

Thinking about our upcoming regression analysis, based on this information we think that Type_Lead_investor would have a relevant impact on lPre_valuation when considering its interaction with the variable Stage.

Early-stage rounds

Figure 4.39: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI), considering only Early-stage start-up rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test. 4 Multivariate Data Analysis 55

First Clients rounds

Figure 4.40: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI), considering only start-up rounds at the stage First Clients. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Later-stage rounds

Figure 4.41: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type of Lead Investor, considering only Later-stage start-up rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test. 4.4 Correlation analysis between continuous and categorical variables 56

4.4.2.5 Revenue

In Revenue variable we have 60% of NA’s values. From the representation in figure 4.42 we can see a shy positive trend. Some of these groups’ means differ significantly (one-side test, as we assume a growing trend):

• 0 – 50k and 500k – 1M • 0 – 50k and >1M • 50k – 100k and >1M • 100k – 500k and >1M

To get only significant differences, we pool together groups. The final Revenue’s structure includes 3 levels: 0 – 50k, 50k – 1M, >1M. Now the boxplots become (Figure 4.43).

Figure 4.42: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Revenue. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

The Kruskal-Wallis test indicates now higher significance in mean differences (p-value is now 0.008 instead of 0.014). Still, we cannot reject the null Hypothesis between 0 – 50k and 50k – 1M, but we can do that between >1M and the other two groups. Having less NA’s would improve the precision of our results. For these reasons, Revenue does not seem to be an essential predictor in a regression model estimating lPre_valuation.

4 Multivariate Data Analysis 57

Figure 4.43: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Revenue, after we pooled together the groups 50k – 100k, 100k – 500k, and 500k – 1M, in the new group 50k – 1M. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4.2.6 Still_operating

The relationship between Still_operating and lPre_valuation is extremely important (Figure 4.44), not to predict valuation by knowing if the startup is still operating, but the other way around. In fact, if we want to estimate the Pre_valuation of a start-up, it means that the start- up hasn’t been acquired yet (so it does not belong to exit group), nor it is been liquidated. So, every time we aim to predict the pre_valuation of a startup, it means that the startup is indeed still_operating.

What we find more interesting is to investigate the possibility to predict the future of a start-up (acquired, liquidated, or still operating) by knowing its lPre_valuation. We do that in Chapter 5.

The figure 4.44 takes into consideration all rounds, and it confirms that a relevant relationship is existing between these two variables. In fact, we count a significant difference in mean lPre_valuation among all of the three groups. Companies who reached an exit have the highest median lPre_valuation, liquidated companies have the lowest one, while start-ups that are still operating have the highest volatility, but a median staying in between the other two groups. 4.4 Correlation analysis between continuous and categorical variables 58

We are aware that, one day, the majority of start-ups belonging now to the yes group will then belong to no or exit groups. As we still do not know their destiny, let’s now exclude them from the analysis and focus only on exit and no group (Figure 4.45).

Figure 4.44: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating). Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

The figure 4.45 shows an enormous significant difference among the mean of the two groups. This is an extremely precious information that makes us wondering the following important question: was the future of the start-up already evident when those companies where just at early-stage round? Was their future already predictable at their first investment round (when they had zero money previously raised)? In figure 4.46 we analyse a selection of rounds with Prev_raised=0. Even if we only have a few samples in the exit groups, the mean difference between exit and no groups, and between yes and no groups are significant! That’s even more evident by looking only at exit and no groups (Figure 4.47).

4 Multivariate Data Analysis 59

Figure 4.45: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the samples in the groups “exit” and “no”. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Rounds with lPrev_raised=0

Figure 4.46: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the rounds with zero money previously raised. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test. 4.4 Correlation analysis between continuous and categorical variables 60

Rounds with lPrev_raised=0

Figure 4.47: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the rounds with zero money previously raised and belonging to the groups “exit” or “no”. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

In figure 4.48, we consider only early-stage start-ups and we show with different colours their precise development stage. This time we have a more conspicuous number of samples in no and yes group, and their mean difference is again significant, below 0.05 threshold. Nevertheless, we have no sample belonging to the exit group, so by knowing the lPre_valuation of a company at its early-stage round, we can only predict if -in the future- it will face a liquidation or not.

Overall, our conclusion would be extremely useful to be applied but, unfortunately, the very limited availability of samples and the high volatility of lPre_valuation for companies belonging to the yes group prevent our statements from being strongly reliable.

Anyway, we can add the following comment to the relevant demonstrated relationship between the future success of the start-up and its lPre_valuations: reaching a high pre-money valuation for a start-up can be both a cause and an effect of its still_operating status. In fact, if the company is still operating or achieved an exit, it was probably already showing a lower risk of failure at the investment round, and for this reason it got a higher valuation. But it can also be true the other way around: as the company got a higher valuation in the round, it could invest resources in a most efficient way, it probably raised more money (as we saw the strong correlation existing between lPre_valuation and lAmount_raised), and therefore maximized its probabilities to still be operating / sold. 4 Multivariate Data Analysis 61

Early-stage rounds

Figure 4.48: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the early-stage rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

5.1 Overview 62 5 PREDICTING THE FUTURE SUCCESS OF A SWISS START-UP

5.1 OVERVIEW

In paragraph 4.4.2.6, we proved the existence of a significant relevant correlation between the continuous variable lPre_valuation -indicating the Pre-money valuation obtained by a company in a funding round- and the categorical variable Still_operating -revealing the current status of a start-up (yes= still operating/ no=liquidated/ exit= acquired)-. Based on that information, we build in this chapter a model predicting the future status of a company (Still_operating), with the minimum possible error rate.

5.2 METHODOLOGY

To reach our goal, we apply the following Discriminant Function Analysis approaches to several combinations of explanatory variables (both continuous and categorical):

• Linear discriminant analysis (LDA) • Quadratic discriminant analysis (QDA) • Multiple discriminant analysis (MDA) • Flexible discriminant analysis (FDA)

Discriminant Analysis, indeed, can be used to determine which variable(s) are the best predictors of the outcome categorical variable. Its analysis makes the assumption that the data (for the variables) represent a sample from a multivariate normal distribution. However, violations of the normality assumption are usually not "fatal," meaning that the resultant significance tests etc. are still "trustworthy", especially for FDA. So theoretically, this should be the best method to apply in our case, as we have a multivariate non-normal data set (see distribution analysis in paragraph 3.4). Anyway, we test and compare performances of all these classifiers (LDA, QDA, MDA, FDA) and, for each of them, we create several models, involving different combinations of continuous and categorical explanatory variables. Therefore, we train models on 80% of our data set and we test them on the remaining 20% of samples. Finally, we compare the accuracy of these models in terms of percentage of observations which are classified correctly, and we select the optimal one. 5 Predicting the future success of a swiss start-up 63

5.3 RESULTS

By comparing the performance of our models, we find that the best combination of explanatory variables to predict Still_operating is made of:

− Closing_Year − lPre_valuation − lPrev_raised − lAmount_raised

Both classifiers LDA and FDA provide the most accurate predictions: they predict correctly the future company’s status for the 96.36% of observations in our data set. This value could be improved with more observations and measured variables. More in detail, we report the combination of predictors used by our final selected model: Coefficients of linear discriminants: LD1 LD2 Closing_Year 0.984 0.305 lPre_valuation 0.432 -0.986 lPrev_raised -0.096 -0.573 lAmount_raised -0.262 0.481

Proportion of trace: LD1 LD2 0.783 0.217

The weakest point of this model is that we cannot estimate in which year this future status will be exactly realized. We only know that all companies in our data set closed investment rounds between 2010 and 2019, while the Still_operating variable is updated to October 2019. So, overall, the predicted future status of companies is supposed to be true in up to nine years since the inputs of the model are measured.

The strength of the model is that it reveals that Closing_Year and lPre_valuation are the most influential factors to determine the future status (and success) of a company. 6.1 Purpose 64 6 MULTIPLE REGRESSION ANALYSIS

6.1 PURPOSE

The goal of this Multiple Regression Analysis is to estimate a fair benchmark range for the pre- money valuation of a start-up at its upcoming investment round, given as input some explanatory variables related to the company. This benchmark could be useful for the co- founders, as well as for the investors evaluating the company before making an investment proposal.

While conducting this analysis, we keep in mind that correlation does not imply causation. Even if we build a model having significant predictors and a high adjusted R-squared, still nothing is known about causal relationships.

6.2 METHODOLOGY

6.2.1 Overview

We want to predict the dependent variable (lPre_valuation) based on values of a set of predictors (mixing continuous and categorical variables). The decision of the type of the regression model depends on the type of distribution followed by its dependent variable:

− Linear regression for continuous variable having linear relationships with the predictors − Logistic regression for dichotomous distribution − Log-linear analysis for Poisson or multinomial distribution − Cox regression for time-to-event data in the presence of censored cases (survival-type) − Non-linear regression for continuous dependent variables having non-linear relationships with the predictors

In our case, the independent variable lPre_valuation is continuous and moderately correlated with the other continuous predictors (as we saw in paragraph 4.2). So, it is a good rule to start creating the simplest possible model via multiple linear regression, and to make it more complicated only when it is truly needed. If we make a model more complex, we should get confirmation that we are not going toward , by testing its performance. Besides, the prediction intervals should become more precise (narrower). If we have several models with 6 Multiple Regression Analysis 65 comparable predictive abilities, the simplest one it is likely to be the best model (Zellner, Keuzenkamp and McAleer, 2001).

Because of the restricted number of samples in our data set (286) and the large proportion of missing values for some variables, we decided to exploit all available data to train models, and then to test them on the same data set, via k-fold cross validation, LOOCV, validation set, and bootstrap.

All methods have been tested and visualized via the Software R. To distinguish models, in the following paragraphs we give them names within square brackets (e.g. [regsub.best]).

6.2.2 Steps

We follow and iterate the following steps:

1. Pre-processing the data set (paragraph 6.3); 2. Manual variable selection based on previous knowledge and conducted analyses, (paragraph 6.4); 3. Model specification by applying a combination of methods (Appendix: Model Specification). For each created model: o Outliers check and eventually removal from the model if justified; o Assumptions check;7 o Calculation of model fit : internal measures, and test measures through validation set, cross validation and bootstrap; 4. Comparison of the best models, making appropriate considerations (paragraph 6.5); 5. Choice of the optimal model among the best models, final comments (paragraph 6.6).

About step 3., here we list the methods we apply to find the best model fit:

1. Automated models: stepwise regression (“both”: backward and forward); 2. Automated models: best subset regression. Train and test models via: a. validation set; b. 5-fold cross validation;

7 For reasons of brevity, this procedure and its results are shown only once for the final selected best model, paragraph 6.7. 6.3 Data pre-processing 66

3. Removal of insignificant terms (through drop1 function); 4. Curve Fitting using Polynomial Terms; 5. Fractional exponents 6. Spline 7. Log function of predictors 8. Non-linear regression: GLM (Generalized Linear Models) 9. Loess regression 10. Kernel regression

6.3 DATA PRE-PROCESSING

In the following regression analysis, we start by maintaining the same group structure of categorical variables as it resulted out of the analysis conducted in paragraph 4.4.

6.4 SECOND MANUAL VARIABLES SELECTION (FROM 20 TO 8 INDEPENDENT VARIABLES):

Based on previous knowledge and the results of conducted analyses, we exclude the following variables:

− Foundation_Year: its correlation coefficient with the dependent variable is close to zero (0.08). This can be confirmed by testing a simple regression model with Foundation_Year as the only explanatory variable and lPre_valuation as the dependent variable. The explanation of lPre_valuation done by this predictor is close to 0% and not significant; − Still_operating and Round_name: these variables are strongly correlated with lPre_valuation, but they are the effect of a determined lPre_valuation, not the cause. When using our model for the purpose explained in paragraph 6.1, the user does not know neither about the future of the company, nor about the name of the round that the company is considering. For these reasons, we exclude them from our model; − Profitable: in our data set, only 1 sample is profitable, and 179 missing values are present. So, we cannot extract useful information out of this variable; − Pre_Valuation, Amount_raised, Prev_raised, Through_investiere: as we’ve seen in paragraph 3.4, the log transformation of these variables has better proprieties for regression and correlation analysis, than the original ones; 6 Multiple Regression Analysis 67

− lThrough_investiere and Employees: a separated analysis has been conducted for each of these variables. Keeping them in our analysis would make us lose the 82% of samples, due to their large proportion of NA values. Besides, lThrough_investiere is not a universal explanatory factor for all rounds closed in Switzerland; − Country and Currency: as we consider only Swiss start-ups, all rounds are closed in Switzerland and via Swiss Francs (CHF) currency. These conditions must be respected when using the model;

The resulting regression data set is composed of 286 samples -of which 108 are complete cases- one continuous dependent variable (lPre_valuation), and eight explanatory variables 8 (five categorical and three continuous): − Industry (Cat.) − Revenue (Cat.) − Closing_Year_factor (Cat.) − Type_Lead_Investor (Cat.) − Closing_year (Cont.) − lPrev_raised (Cont.) − Stage (Cat.) − lAmount_raised (Cont.)

Besides, we already know that we will have to decide if including Closing_Year or Closing_Year_factor as a predictor. We will test their performance and then decide.

After this Second Manual Variables selection, we are still interested in reducing the number of explanatory variables in our model. In fact, as five of these variables are categorical involving more than two groups, if we included all of them, we would have a model with already 15 explanatory variables, before including interaction terms.

Green (1991) indicates that N>50+8m (where m is the number of independent variables) samples are needed for testing multiple correlation. Harris (1985) says that the number of samples should exceed the number of predictors by at least 50. Van Voorhis, Besty and Morgan (2007) affirm it is better to go for 30 samples per predictor. Finally, the “one in ten rule” is a rule of thumb for how many predictor parameters can be estimated from data when doing regression analysis (in particular proportional hazards models in and logistic regression) while keeping the risk of overfitting low. The rule states that one predictive variable

8 Detailed variable description and overview are provided in paragraph 3.3 6.5 Best models comparison 68 can be studied for every ten events (Harrell, Lee, Califf, Pryor and Rosati, 1984). As a rule of thumb, we decide to have a model with at least 15-20 samples per predictor that, in our case, by considering only complete samples it means to have a maximum of 5-7 predictors.

In Appendix: Model Specification, we first indagate which selection would Automated models suggest, and then we compare it with the tested application of other methods, as exposed in Step paragraph 6.2.2, without excluding further pooling of groups if necessary.

6.5 BEST MODELS COMPARISON

We selected two best models: model [A] and [B]. We report the complete detailed procedure that made us select these two models in the Appendix: Model Specification. These two models are [A]: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised

And [B]: lPre_valuation ~ lPrev_raised + lAmount_raised + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised + Closing_Year:lAmount_raised + Stage:lAmount_raised

Note that the first one does not include the variable Stage, that has a significant higher proportion of missing values, compared to the other predictors. Plus, none of these two models includes the interaction term TLI:Stage, that we thought to be a relevant predictor (4.4.2.3). So in this paragraph, after comparing performances of these two models, we select the winner, and we test if, by adding to it the interaction term TLI:Stage, we get better results.

Figure 6.1 gives us a clear picture of how the magnitude effect differs for these predictors (circle/square), plus their uncertainty (horizontal lines indicate the 95% confidence interval). In Figures 6.2 and 6.3, instead, we read that the vast majority of performance metrics agree that [A] model is the best one (we compare them with the model resulting from best subset selection, 6 Multiple Regression Analysis 69 and with the two previously considered variations of [basic.model], during Model Specification: [TLI*Stage] and [TLI + Stage]9).

[A] [B]

-0.5 0.0 0.5

lPrev_raised

lAmount_raised

Closing_Year

Inst.Financial

Inst.Strategic

Closing_Year:Inst.Financial

Closing_Year:Inst.Strategic

lPrev_raised:Closing_Year

Acc/Inc/PA:Closing_Year

Inst.Financial:Closing_Year

Inst.Strategic:Closing_Year

lAmount_raised:Closing_Year

lAmount_raised:First_Clients

lAmount_raised:Later-Stage

Figure 6.1: On the right, the names of predictors belonging to model [A] and/or model [B]. The figure compares the magnitude effect (blue circles for model [A] and orange squares for model [B]) of each predictor on Pre-money valuation, plus its uncertainty (the horizontal lines indicate the 95% confidence interval of these estimates).

So we select [A] as the best model, and we now try to add the interaction term TLI:Stage to it, as a final confirmation.

9The full explation of these models is provided in the Appendix: Model Specification 6.5 Best models comparison 70

Model: A + TLI:Stage Residuals: Min 1Q Median 3Q Max -0.80396 -0.23757 -0.00219 0.26622 0.92276

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.826e+02 6.792e+01 -4.160 7.14e-05 *** lPrev_raised 9.600e-05 3.172e-05 3.027 0.00321 ** lAmount_raised 3.792e-01 6.239e-02 6.078 2.73e-08 *** Closing_Year 1.451e-01 3.384e-02 4.289 4.43e-05 *** TLI_Institutional Financial -3.685e+00 1.022e+02 -0.036 0.97133 TLI_Institutional Strategic 4.082e+02 1.535e+02 2.659 0.00925 ** Closing_Year:TLI_Institutional Financial 1.690e-03 5.074e-02 0.033 0.97351 Closing_Year:TLI_Institutional Strategic -2.023e-01 7.619e-02 -2.655 0.00935 ** lPrev_raised:Closing_Year -4.753e-08 1.572e-08 -3.023 0.00324 ** TLI_Acc/Inc/PA:StageFirst Clients -3.436e-02 1.560e-01 -0.220 0.82610 TLI_Instit.Financial:StageFirst Clients 2.052e-01 1.575e-01 1.303 0.19597 TLI_Inst.Strategic:StageFirst Clients -3.389e-01 3.033e-01 -1.118 0.26669 TLI_Acc/Inc/PA:StageLater-stage 1.099e-01 2.160e-01 0.509 0.61205 TLI_Inst.Financial:StageLater-stage 2.704e-01 1.548e-01 1.746 0.08413 . TLI_Inst.Strategic:StageLater-stage -8.369e-02 2.823e-01 -0.296 0.76753 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3956 on 92 degrees of freedom Multiple R-squared: 0.8196, Adjusted R-squared: 0.7921 F-statistic: 29.85 on 14 and 92 DF, p-value: < 2.2e-16

Interaction terms selection

B A TLI*Stage basic.model + TLI + Stage

PRESS

BIC

AIC

0.000 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 180.000 200.000

Figure 6.2: Comparison of models’ performance. Reading the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate and have been calculated through the software R. Models with lower values for these metrics are preferred, which are calculated as follows: − AIC= -2(log-likelihood) + 2K, where K is the number of model parameters, and log- likelihood is a measure of model fit (the higher the number, the better the fit); − BIC = −2(log-likelihood) + log(n)K, where K is the number of model parameters and n the number of observations; − As defined by Allen (1974), PRESS is based on leave-one-out technique: from a fitted model, each of the n samples in turn is removed, and the model is refitted to the (n−1) points. The predicted value is calculated at the excluded point, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors.

ANOVA test does not show a significant improvement:

Model 1: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised Model 2: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:Type_Lead_Investor + lPrev_raised:Closing_Year + Type_Lead_Investor:Stage

Res.Df RSS Df Sum of Sq Pr(>Chi) 1 98 15.238 6 Multiple Regression Analysis 71

2 92 14.400 6 0.83771 0.4995

We then compare them in figure 6.4 and 6.5. Except for validation set measures and internal RMSE and MAE, all other metrics indicate that [A + TLI:Stage] is performing worse than [A] alone. Besides, many predictors would not be significant anymore. Before concluding that [A] is our best model, we want to eventually test through the function drop1 if any of its predictors should be removed: Single term deletions Model: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + + Closing_Year:TLI + Closing_Year:lPrev_raised Df Sum of Sq RSS AIC Pr(>Chi) 16.484 -187.90 Closing_Year:Type_Lead_Investor 2 1.9901 18.474 -179.47 0.0020057 ** lPrev_raised:Closing_Year 1 1.7901 18.274 -178.66 0.0008018 *** lAmount_raised 1 7.3165 23.800 -149.86 2.491e-10 ***

All terms are significant and we can therefore conclude that [A] is the best selected model. 6.5 Best models comparison 72

Interaction terms selection B A TLI*Stage basic.model + TLI + Stage

Pred.err.rate

MAE

R2 bootstrap RMSE

Pred.err.rate

MAE

RMSE validation set validation R2

Pred.err.rate

MAE

LOOCV R2

RMSE

Pred.err.rate

MAE

R2 k-fold5.5 RMSE

MAE

RMSE

Pred.R2

Pred.err.rate

internal measures internal S

Adj.R2

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Figure 6.3: Comparison of models’ performance. Reading the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate, and models with lower values are preferred, except for R2 measures, where the contrary is true. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. k-fold 5.5 means we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats. MAE is the average, over the test sample, of the absolute differences between prediction and actual observation (all having equal weight). RMSE is the square root of the average of squared differences between prediction and actual observation. S is the Standard error (absolute measure of the typical distance that the data points fall from the regression line, in the units of the dependent variable). R-squared (R2) provides the relative measure of the percentage of the dependent variable variance that the model explains (from 0 to 1). 6 Multiple Regression Analysis 73

Interaction terms selection

A + TLI:Stage A

PRESS

BIC

AIC

0 20 40 60 80 100 120 140 160 180

Figure 6.4: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.

Interaction terms selection A + TLI:Stage A

Pred.err.rate MAE

R2 bootstrap RMSE Pred.err.rate MAE RMSE

validation set validation R2 Pred.err.rate MAE

LOOCV R2 RMSE Pred.err.rate MAE

R2 k-fold5.5 RMSE MAE RMSE Pred.R2 Pred.err.rate

S internal measures internal Adj.R2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 6.6 Best selected model 74

Figure 6.5: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.

6.6 BEST SELECTED MODEL

In the previous paragraph 6.5, we selected [A] as our best model, having the following formula:

lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + Type_Lead_Investor + Closing_Year:Type_Lead_Investor + Closing_Year:lPrev_raised

Compared with our Third manual variable selection (Appendix, paragraph 9.2), model [A] includes a combination of only four explanatory variables. The importance of these predictors has been already commented in Chapter 4. All of them, in fact, are significant, except for the group Institutional Financial of the variable Type_Lead_Investor (as we already noted in the analysis of this variable, conducted in paragraph 4.4.2.4), and its interaction with Closing_Year. The interaction term between Closing Year and lPrev_raised corrects the impact that the total funding of the start-up has on its valuation, depending on the year in which the round is concluded (and viceversa).

6.6.1 Confidence and Prediction intervals10

When using the suggested best model [A] to predict lPre_valuation for new data (funding round), prediction intervals become very important for making real-world predictions with realistic bounds of uncertainty.

For reasons of brevity, we will only report confidence and prediction intervals for the first eight samples of our data set. Of course, narrower prediction intervals indicate a better model, but

10 The confidence interval reflects the uncertainty around the mean predictions. It states, according to the model, which is on average the lPre_valuation range for a funding round with some specific inputs as independent variables. A prediction interval gives an interval within which we expect the dependent variable (lPre_valuation) to lie with a specified probability (95%, in our case). The prediction interval gives uncertainty around a single value, given values of the independent variables specified. If the prediction intervals are too wide, the predictions do not provide useful information. Narrow prediction intervals represent more precise predictions. Thus, a prediction interval will be generally much wider than a confidence interval for the same value. 6 Multiple Regression Analysis 75 we did not use them to select the best model, because they strongly rely on the assumption that the residual errors are normally distributed, with constant variance. Instead, that assumption is not so strong for linear regression (see Assumptions check, in paragraph 6.7.2).

lPre_valuation fit lwr.conf.int upr.conf.int lwr.pred.int upr.pred.int. 1 15.84721 15.06559 14.90795 15.22324 13.89531 16.23588 2 20.62511 19.01204 18.66894 19.35514 17.80273 20.22135 3 13.38473 14.35444 14.19031 14.51856 13.18326 15.52561 4 13.30468 13.78248 13.58372 13.98125 12.60595 14.95901 5 12.50618 13.03486 12.74742 13.32231 11.84015 14.22958 6 12.76569 13.12320 12.80986 13.43654 11.92199 14.32441 7 13.28788 13.96765 13.76486 14.17044 12.79043 15.14486 8 12.89922 13.68907 13.44615 13.93200 12.50428 14.87386

In the graph (figure 6.6) we show the first 70 samples of the data set (points), their prediction intervals (red lines), and confidence intervals (green lines).

---- prediction interval ---- confidence interval . actual lPre-valuation 16 –

aluation

14 – lPre_v

12 – | | | | | 11 12 13 14 15

lAmount_raised

Figure 6.6: On the abscissa is the log of the Amount raised at the round (lAmount_raised) and on the ordinate the log of Pre-money valuation (lPre_valuation). The Figure shows, for Model [A], the actual Pre-money valuations (black points), the confidence intervals (green lines), and the prediction intervals (red lines).

6.7 MLR BLUE ASSUMPTIONS CHECK

We now show the procedure that we applied for each outcoming model in order to test if it was meeting the Multiple Linear Regression BLUE assumptions. For reasons of brevity, we show the process only once for our best model [A]. Overall, all of MLR BLUE assumptions are met, as linear regression allows a certain margin of tolerance. Our best model [A] is thus valid from a theoretical viewpoint. 6.7 MLR BLUE Assumptions Check 76

6.7.1 Outlier detection

We remove all the samples with a Cook’s distance higher than 0.1 (Figure 6.7 – 6.8). Then, under Bonferroni correction of p-values, there outlier test does not find any significant outlier for this model (any observation has adjusted p-value below 0.05).

Figure 6.7: For Model [A], on the abscissa is plotted the Leverage and on the ordinate the standardized residuals.

Figure 6.8: For Model [A], on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R. 6 Multiple Regression Analysis 77

6.7.2 Check MLR assumptions

6.7.2.1 Assumption 1. Linearity

This assumption is overall respected: a linear trend between the dependent variable lPre_valuation and the predictors is evident in figures 6.9 – 6.10.

Figure 6.9: For Model [A], on the abscissa are plotted the fitted values of the Pre-money valuation and on the ordinate the absolute residuals.

Figure 6.10: For Model [A], on the abscissa are plotted the predicted values of Pre-money valuation and on the ordinate its observed values. The red line is the quadrant bisector used to evaluate if samples (points) are homogeneously distributed over and under the line. 6.7 MLR BLUE Assumptions Check 78

6.7.2.2 Assumption2: It is a random sampling of observations.

This assumption is true because of the data collection process that we adopted and explained in detail in Chapter 3.

6.7.2.3 Assumption 3: Zero conditional mean of residuals

This assumption is fully respected, as shown below:

Min. 1st Qu. Median Mean 3rd Qu. Max. -0.96992 -0.26535 0.02779 0.00000 0.26272 1.00308

6.7.2.4 A4. There is no multi-collinearity (or perfect collinearity).

There should be no linear relationship between the independent variables. The correlation matrix in Chapter 4 (figure 4.22) shows the existence of moderate relationships among predictors, but the result of vif.test reassure us that these values are tolerable.

An important implication of this assumption is that there should be sufficient variation in the predictors. The larger the variability in the explanatory variables, the better are the OLS estimates in determining the impact of predictors on lPre_valuation. Below we check the variability of predictors:

freqRatio percentUnique zeroVar nzv Closing_Year 1.222222 9.345794 FALSE FALSE lPre_valuation 1.333333 80.373832 FALSE FALSE lPrev_raised 14.666667 51.401869 FALSE FALSE lAmount_raised 1.428571 57.009346 FALSE FALSE

Perfect: any predictor has zero variance (zeroVar) or near zero variance (nzv).

6.7.2.5 Assumption 5. Spherical errors

This assumption requires homoscedasticity and no autocorrelation among residuals. In figure 6.9 residuals do not show a trend over time. In the next figures 6.11 – 6.12 we plot residuals on Closing_Year to make sure that no time trend is existing among residuals. Kruskal-Wallis’ test confirms that there are no significant differences among residuals over time. 6 Multiple Regression Analysis 79

Figure 6.11: For Model [A], on the abscissa we find the Closing Year of rounds and on the ordinate the absolute residuals. The red horizontal line is set zero to evaluate if residuals (points) are homogeneously distributed over and under the line, independently from the Closing Year.

Figure 6.12: Conditional boxplot of standardized residuals given the Closing Year of the round.

6.7.2.6 Assumption 6 (optional): Error terms should be normally distributed.

This assumption is not strict. In the figure 6.13 below, we show the distribution of residuals, and the result of normality tests is the following: ------Test Statistic pvalue ------Shapiro-Wilk 0.9899 0.6070 Kolmogorov-Smirnov 0.0674 0.7165 Anderson-Darling 0.4357 0.2935 ------6.7 MLR BLUE Assumptions Check 80

In any of these normality tests, we reject the null hypothesis of normality, as the p-value is always above 0.05. Our model is thus valid from a theoretical viewpoint.

Standardized residuals N=107 Bandwidth = 0.3551

Figure 6.13: Probability density function and boxplot of the standardized residuals of Model [A]. 7 Conclusions 81 7 CONCLUSIONS

Coming to the conclusions of this research, we bear in mind that “all models are wrong, some models are useful” (Box, 1976). As stated in the Introduction, the aim of this thesis is to provide an overview of start-up valuations in Switzerland, and a model estimating a fair range of pre- money valuation for a target start-up, which could be a useful starting point in the negotiation process between co-founders and investors. With the proposed model, therefore, we do not aspire to predict the exact valuations of Swiss early-stage companies, but rather to indicate a (prediction) interval within which their valuation is most likely to lie. We are also aware of the existence of other factors influencing the valuation of start-ups, that we could not measure in our analysis (as we explain further on in the limitations of our approach). Besides, we tend to agree with the following statement of Matthew Schubring, managing director at Chartwell (Dahl, 2016):

“A good valuation is 75% art and 25% science, because it takes into account the story behind the numbers of a business. Appraisals fall down when there isn’t enough support for the story behind it. It’s based not on just what happened, but on why things happened.”

Made these premises, in Chapter 4 we exposed the main findings from the conducted Multivariate Analysis. For the VC Investiere, having zero money previously raised is not a limitation to its investment commitment. Nonetheless, if a start-up has already raised money in the past, the amount then invested by Investiere slightly tends to increase with the significant correlation factor of 0.27. This trend has been confirmed also for rounds in which Investiere was not involved, as the correlation between the size of the round (lAmount_raised) and the total funding previously raised by the start-up (lPrev_raised) are linked with a significant correlation (r=0.42).

The correlation between the Amount raised and the pre-money valuation is the strongest one among our continuous variables (r=0.59, paragraph 4.2.4). This helps entrepreneurs to find a balance between the size of the investment and the consequent dilution.

Among categorical variables, the strongest existing correlation is between the revenues generated by a company and its development stage. We also found that belonging to a particular industry does not affect the success of the start-up, but it influences its revenues (Figure 4.34). 6.7 MLR BLUE Assumptions Check 82

In our data set, all types of investors have a diversified portfolio of companies, in terms of generated revenues and development stage, as we saw more in detail in paragraph 4.4.2.4. On the other hand, the type of investor involved in the round makes a significant impact on the start-up valuation.

We also could not find evidence that the future status of a company (acquired/liquidated/still operating) is determined by the generated revenues. Instead, the Discriminant Function Analysis conducted in Chapter 5 showed that the main predictors of the success of a start-up are the pre- money valuation (confirming the importance of the topic of this research) and the closing year of the round. We also proposed a model able to predict correctly the future status of a start-up (acquired, liquidated, or still operating) for the 96.36% of observations in our test data set. This value could be improved in further research by increasing the number of observations and measured variables.

In addition, we found significant differences in start-up valuations depending on their development stage, but not on the industry they belong to (paragraph 4.4.2.2). The most volatile sector is ICT, while Cleantech and Consumer Products ranges are relatively restricted.

In Chapter 2, we presented an overview of the main start-ups valuation methods that we are now ready to comment in comparison with our findings and with the best model resulting out of the Multiple Regression Analysis conducted in Chapter 6.

The Scorecard Method (Payne, 2011) is based on the average valuation of early-stage start-ups in the industry and region of the target company. As explained in Chapter 3, that kind of sensitive data collection procedure is extremely time-consuming, making quite unrealistic the hypothesis that co-founders and investors could take advantage of it.

The Berkus Model (Berkus, 2009), instead, considers very broad concepts as influencing factors of start-ups valuations and, when coming to numbers, it leaves enormous space to open interpretation. So, it is highly probable that the same start-up, when evaluated by different actors via the Berkus model, receives very disparate valuations.

Finally, the Venture Capital Method (Sahlman and Scherlis, 1989, revised in 2009) is the oldest and closest one to classical financial methods. Its weakest point is that it bases on the assumption that the target company will generate a certain estimated amount of revenues in five years. As Berkus (2009) states, this is a goal with less than 0.1% of probability to be met. 7 Conclusions 83

In contrast to all these methods, the approach that we propose stands out on several aspects.

First of all, it takes as input only data which are 100% objective and accessible to all players, at zero cost: the round size (Amount_raised), the total funding previously raised by the company (Prev_raised), the year in which the round is going to be closed (Closing_Year), and the nature of the main investor involved (Type_Lead_Investor). An important consequence is that its results are objective (independent from the user).

The second important difference between our model and the other ones is its immediacy: the valuation estimation is timeless and effortless.

Another main dissimilarity from the literature concerns the richness of inputs and outcomes. Our model considers inputs that any of the previous models has taken into account before. These factors allow our valuation estimation to be more precise, because it is contextualized in time (Closing_Year), space (Country), and actors involved (type of investor). In Chapter 4, we also found clue to believe that, with a larger availability of data, other factors not included in previous approaches would turn out to be significant predictors of pre-money valuation: the industry, the revenue generated by the start-up, and the number of its employees. Regarding the richness of the outcome, our method does not provide only a single valuation estimate, but it offers a range (prediction interval) within which the market value is supposed to lie, with a certain probability chosen by the user (e.g. 95%). It also offers a prospectus of the individual effects played by each factor on the pre-money valuation, and their significance.

Last but not least, a relevant dimension on which our model beats all the others is the validity of its performance. In paragraph 6.5 we provided a transparent and detailed analysis of its internal measures and results when tested through validation set, cross validation, and bootstrap. All of these tests agree on an adjusted R-squared about 0.8, a predicted R-squared of 0.77, and a prediction error rate close to 2.7%.

To sum up, the advantages of our model compared to the available ones in the literature are:

− Accessibility: everyone has access to input data at zero cost; − Immediacy: given the inputs, the method is instantaneous and effortless; − Objectiveness: results are independent from the user; 6.7 MLR BLUE Assumptions Check 84

− Outcome richness: not just a valuation benchmark but a probability interval, plus the individual effect of each factor; − Contextuality: location, timing, and investors involved are taken into account; − Performance: the prediction error rate of the model flows around 2.7%11, the adjusted R-squared is 0.8, and the predicted R-squared about 0.77;

On the other side, our approach presents several limitations:

− Limited geographic validity (Switzerland); − Limited dimension of the training set (286 samples) and therefore absence of a separate test set; − Abundance of missing values on specific variables (revenue, stage, profitability, employees); − Other relevant qualitative and quantitative measures are not considered (background and experience of the team, market size, validity of the business strategy, proof of validation, tangible and intangible assets, market positioning -leadership, barriers to entry, brand awareness-, expected synergies, patents/technology, competition, …)

These boundaries could be crossed in further research, by expanding the data set for Swiss rounds, or focusing on another country, or on a specific industry. Overall, the literature on this topic is young and still poorly able to provide co-founders and investors with a starting point in their “dance of concessions”. Nonetheless, this thesis shows that the room for improvement and the pool of interested stakeholders are broad. This has been also underlined by a recent survey conducted by the Canadian Golden Triangle Angel Network, involving 200 private angels. The result has been that, for the 50% of them, the most likely reason behind the decision to not invest in a start-up is that “Companies overstated their valuations” (Douglas, 2016).

11 When tested through validation set, cross validation, and bootstrap techniques 8 References 85 8 REFERENCES

Allen, D. M. (1974). The relationship between variable selection and data agumentation and a method for prediction. Technometrics, 16(1), 125-127.

Arping, S., & Falconieri, S. (2009). Strategic versus financial investors: The role of strategic objectives in financial contracting. Oxford Economic Papers, 62(4), 691-714.

Behrmann, G. (2016). Internet Company Valuation - A Study of Valuation Methods and Their

Accuracy. EBS Universität für Wirtschaft und Recht, Oestrich-Winkel, Germany.

Berkus, D. (2009, Nov 4). The Berkus Method – Valuing the Early Stage Investment. Berkonomics. https://berkonomics.com/?p=131.

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791-799.

Clarysse, B., & Kiefer, S. (2011). 12. Introducing the venture roadmap and basic financials The Smart Entrepreneur: How to Build for a Successful Business (p.191). Elliott & Thompson.

Dahl, D. (2015, Oct 30). Why Valuing Your Business Is More Art Than Science. Forbes

Damodaran, A. (2007). Valuation approaches and metrics: a survey of the theory and evidence. Foundations and Trends® in Finance, 1(8), 693-784.

Douglas, R. (2016, Sept 2). Early-stage startup valuations: More art than science. Communitech News - https://news.communitech.ca/early-stage-startup-valuations-more-art-than- science/

Engel, R. (2002). Teaching note: An introduction to the venture capital method.

Frei, P., & Leleux, B. (2004). Valuation—what you need to know. Bioentrepreneur, 1-3. 6.7 MLR BLUE Assumptions Check 86

Gentle, J. E. (2009). Computational statistics (Vol. 308). New York: Springer.

Green, S. B. (1991). How many subjects does it take to do a regression analysis. Multivariate behavioral research, 26(3), 499-510.

Gunn, M. A. (2016). When science meets entrepreneurship: Ensuring biobusiness graduate students understand the business of biotechnology. Journal of Entrepreneurship Education, 19(2), 53.

Harris, R. J. (1985). A primer of multivariate statistics (2nd ed.). New York: Academic Press.

Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). Regression modelling strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52.

Isabelle, D. (2013). Key factors affecting a technology entrepreneur's choice of incubator or accelerator. Technology innovation management review, 16-22.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: Springer.

Lebret H. (2019). The Analysis of 500+ startups. Retrieved from: http://www.startup- book.com/2019/04/26/the-analysis-of-500-startups/.

Payne, B. (2011). Scorecard valuation methodology. Establishing the Valuation of Prerevenue, Startup Companies. Retrieved from: http://docplayer. net/14290190-Scorecard- valuation-methodologyestablishing-the-valuation-of-pre-revenue-start-up-companies- by-billpayne. html.

Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., Feinstein, A. R. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 49 (12)

Rao, M. B., & Rao, C. R. (2014). Computational Statistics with R (Vol. 32). Elsevier. 8 References 87

Sahlman, W., & Scherlis, D. (1989). A Method For Valuing High-Risk, Long-Term Investments: The "Venture Capital Method". Harvard Business School, 9-288. (Revised October 2009.)

Van Voorhis, C. R. W., Betsy, L. & Morgan R. (2007). Understanding Power and Rules of Thumb for Determining Sample Sizes. Tutorials in Quantitative Methods for Psychology, 3(2), 43-50.

Villalobos, L. (2007). Investment Valuations of Seed- and Early-Stage Ventures. The entrepreneur’s trusted guide to high growth (pp. 3-4). Ewing Marion Kauffman Foundation.

Wong, A., Bhatia, M., & Freeman, Z. (2009). Angel finance: the other venture capital. Strategic Change: Briefings in Entrepreneurial Finance, 18(7‐8), 221-230.

Zellner, A., Keuzenkamp, H. A., & McAleer, M. (Eds.). (2001). Simplicity, inference and modelling: keeping it sophisticatedly simple. Cambridge University Press. 9.1 Automated Models 88 9 APPENDIX: MODEL SPECIFICATION AND SELECTION

9.1 AUTOMATED MODELS

9.1.1 Stepwise regression We report the best model out of stepwise regression, with direction= ”both” : Residuals: Min 1Q Median 3Q Max -1.35936 -0.34815 0.05134 0.31283 1.30798

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.810e+02 5.255e+01 -3.443 0.000835 *** Closing_Year 9.467e-02 2.626e-02 3.604 0.000486 *** Type_Lead_InvestorInstitutional Financial -7.293e-02 1.227e-01 -0.594 0.553601 Type_Lead_InvestorInstitutional Strategic 2.749e-01 1.535e-01 1.791 0.076237 . lPrev_raised 1.124e-07 2.202e-08 5.104 1.55e-06 *** lAmount_raised 3.859e-01 7.354e-02 5.248 8.41e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4949 on 102 degrees of freedom Multiple R-squared: 0.674, Adjusted R-squared: 0.658 F-statistic: 42.17 on 5 and 102 DF, p-value: < 2.2e-16

Industry, Stage, Closing_Year_factor, and Revenue are excluded from the model, that counts five predictors (out of eight). The intercept is negative and Institutional Financial is not significant.

9.1.2 Best Subset Selection

Figure 9.1: The Figure shows three graphs to analyse results of Best subset selection method, applied via the software R. On the abscissa is represented the number of variables in the model, while on the ordinate is shown the resulting value of the following metrics: adjusted R squared, Cp, and BIC. The optimal value for each metric is in red colour. 9 Appendix: Model specification and selection 89

We obtain different results in Figure 9.1; five predictors could be a good compromise. The first selected var with this method is lAmount_raised, and the second one is lPrev_raised, as expected because of their moderate-high correlation with lPrev_raised. They are followed by Closing_Year, Institutional Strategic, and Later-Stage. Model: regsub.best Residuals: Min 1Q Median 3Q Max -1.29992 -0.27916 0.00431 0.33149 1.27445

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.513e+02 4.743e+01 -3.190 0.00188 ** lAmount_raised 3.432e-01 7.235e-02 4.744 6.70e-06 *** lPrev_raised 1.109e-07 2.137e-08 5.191 1.04e-06 *** Closing_Year 8.021e-02 2.373e-02 3.379 0.00102 ** Type_Lead_InvestorInstitutional Strategic 3.299e-01 1.273e-01 2.592 0.01093 * StageLater-stage 2.156e-01 1.125e-01 1.917 0.05793 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4827 on 104 degrees of freedom (176 observations deleted due to missingness) Multiple R-squared: 0.7012, Adjusted R-squared: 0.6868 F-statistic: 48.81 on 5 and 104 DF, p-value: < 2.2e-16

These predictors are indeed all significant and the adj.R2 is higher than before. The intercept is negative.

Note also that the adjusted R2, BIC and Cp are calculated on the training data that have been used to fit the model. This means that model selection through these metrics is possibly subject to overfitting and may not perform as well when applied to new data. As justified in Methodology (paragraph 6.2), we will adopt validation set and cross validation to test automated models.

9.1.2.1 Model selection using a validation set

We split data in two halves (one train one test), we run Best Subset Analysis on train and we test it on test. Figure 9.2 shows the optimal number of predictors -having the minimum Mean Squared Error (MSE)- is five once again, so the best model is confirmed to be [regsub.best].

9.1.2.2 Model selection by k-fold cross validation

We test the models with 5-fold cross validation method. Figure 9.3 shows that the model with minimum cross validation error has four predictors (in accordance with minimum BIC, in Best subsect selection, Figure 9.1), which gives the following result: Model: regsub.cv.best Residuals: Min 1Q Median 3Q Max -1.5498 -0.4447 -0.0130 0.3711 1.9436

Coefficients: 9.1 Automated Models 90

Estimate Std. Error t value Pr(>|t|) (Intercept) -7.595e+01 3.335e+01 -2.277 0.0236 * lAmount_raised 6.717e-01 3.478e-02 19.313 < 2e-16 *** lPrev_raised 2.966e-08 6.517e-09 4.551 8.12e-06 *** Closing_Year 4.063e-02 1.660e-02 2.448 0.0150 * Type_Lead_InvestorInstitutional Strategic 3.327e-01 1.489e-01 2.234 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6324 on 266 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.7215, Adjusted R-squared: 0.7173 F-statistic: 172.3 on 4 and 266 DF, p-value: < 2.2e-16

Figure 9.2: The Figure shows a graph allowing to select the best model via the Validation set method. On the abscissa is represented the number of predictors in the model, and on the ordinate is shown the corresponding MSE. The optimal value is coloured in red.

Figure 9.3: The Figure shows a graph allowing to select the best model via the k-fold cross validation method. On the abscissa is represented the number of predictors in the model, and on the ordinate is shown the corresponding mean cross validation error. The optimal value is coloured in red. 9 Appendix: Model specification and selection 91

By not including the 5th predictor Stage, this model only excludes 15 samples due to NA, instead of 176 NA in the 5-predictors model [regsub.best]. This means that the df is much higher, as well as adj.R-squared, than for model [regsub.best].

9.2 THIRD MANUAL VARIABLE SELECTION (FROM 8 TO 5 PREDICTORS)

In Automated models, we experimented different methods, compared resulting models and selected [regsub.best] and [regsub.cv.best] as the best ones. We have no doubts about the importance of the four selected predictors in [regsub.cv.best] and we look for further confirmation about the fifth predictor (Stage), included in [regsub.best]. We also want to make sure that no relevant variable has been excluded.

In order to get confirmation of automated model results, we will manually add one variable at a time to a [basic.model] (including only the four most relevant predictors, that we definitely want in our models). Therefore, we will compare regression results to the correlation analysis conducted in paragraph 4.4, to see how the significance of coefficients and the performance of our model change because of the introduction of new predictors.

9.2.1 Basic.model

In [basic.model] we only include the most relevant predictors, that we cannot miss as explanatory variables in our models.

9.2.1.1 Continuous variables

Based on our Correlation Analysis, conducted in paragraph 4.2, lAmount_raised and lPrev_raised show the tightest relationships with the dependent variable lPre_valuation (Kendall’s Tau coefficients are 0.59 and 0.43 respectively, and both highly significant). As we saw in scatterplots (figures 4.23 - 4.25), these relationships follow a linear trend. For these reasons, our [basic.model] formula will be:

lPre_valuation ~ lAmount_raised + lPrev_raised

When applied to the full pre-processed data set, we obtain: Model: basic.model Residuals: Min 1Q Median 3Q Max -1.49462 -0.42526 0.00596 0.36189 1.93327

9.2 Third manual variable selection (from 8 to 5 predictors) 92

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.510e+00 4.508e-01 12.222 < 2e-16 *** lAmount_raised 7.059e-01 3.184e-02 22.173 < 2e-16 *** lPrev_raised 2.743e-08 6.177e-09 4.441 1.28e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6491 on 283 degrees of freedom Multiple R-squared: 0.7302, Adjusted R-squared: 0.7283 F-statistic: 382.9 on 2 and 283 DF, p-value: < 2.2e-16

All estimated coefficients are strongly significant, the intercept is positive, and the adjusted R squared is above 72%. With only two predictors and df=283, we are sure not to risk overfitting.

MLR BLUE Assumptions are overall respected, except for the highest lPre_valuation observed values, where residuals tend to increase. Six outliers are identified, and two of them are particularly extreme (observations 259 and 281), so we already decide to remove them, before proceeding in model optimization (see figure 9.4 showing Cook’s distances, with an indicative red line set at 4*mean(cooksd)).

Figure 9.4: For basic.model, on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.

9.2.1.1.1 Interaction term

As lPrev_raised and lAmount_raised are moderately correlated one each other, we wonder if the addition of an interaction term between these two variables would improve the model.

Model: interaction.basic.model Residuals: Min 1Q Median 3Q Max -1.68162 -0.41095 0.01992 0.34764 2.02972

9 Appendix: Model specification and selection 93

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.472e+00 4.499e-01 12.161 <2e-16 *** lAmount_raised 7.067e-01 3.174e-02 22.267 <2e-16 *** lPrev_raised 1.677e-07 8.298e-08 2.021 0.0442 * lAmount_raised:lPrev_raised -8.088e-09 4.771e-09 -1.695 0.0912 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.647 on 282 degrees of freedom Multiple R-squared: 0.7329, Adjusted R-squared: 0.7301 F-statistic: 257.9 on 3 and 282 DF, p-value: < 2.2e-16

The estimated coefficient of the interaction term is not significant below 5% and extremely low. Other values do not change that much. ANOVA test and model performance metrics confirm that no significant improvement is provided. This means that the effect of lAmount_raised on lPre_valuation is not influenced by lPrev_raised (and viceversa). So, we conserve our [basic.model] formula as a starting point to build the best model.

9.2.1.1.2 Closing_Year

The third predictor we want to add to [basic.model] is Closing_Year, as strongly suggested by all automated models. Model: basic.model Residuals: Min 1Q Median 3Q Max -1.48129 -0.45169 -0.03373 0.38469 1.92212

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.759e+01 3.154e+01 -2.777 0.00586 ** lAmount_raised 6.239e-01 3.531e-02 17.670 < 2e-16 *** lPrev_raised 5.037e-08 9.126e-09 5.520 7.88e-08 *** Closing_Year 4.674e-02 1.570e-02 2.977 0.00317 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6189 on 274 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.705, Adjusted R-squared: 0.7017 F-statistic: 218.2 on 3 and 274 DF, p-value: < 2.2e-16

Its contribution is positive and highly significant. The intercept is now negative. Adj.R2 has slightly decreased, because three observations have been deleted due to missingness. We cannot compare its performance to [basic.model] through ANOVA, because the sample size differs (three NA’s in Closing_Year). We have no hesitation in accepting this parameter into our [basic.model].

9.2.1.2 Categorical variables

Looking at the correlations between lPre_valuation and the categorical variables (paragraph 4.4), Stage is a potentially meaningful predictor and the interaction between Type_Lead_investor and Stage is also very well promising. Finally, we do not expect Revenue to 9.2 Third manual variable selection (from 8 to 5 predictors) 94 be a significant predictor, but unfortunately this is also due to its large proportion of missing values.12

In the following paragraphs we will manually test these intuitions concerning categorical variables, by adding them to [basic.model], one-by-one.

9.2.1.2.1 Type_Lead_Investor (TLI)

Automated models suggest the inclusion of the group Institutional Strategic (belonging to the variable Type_Lead_Investor) to the model. Let us try and compare models to the full variable alone, and to the variable with pooled levels: the automatic procedure suggests that there is no significant difference between the groups “Institutional Financial” and “Acc/Inc/PA” (Accelerators/Incubators/Private Angels), both belonging to the categorical variable Type_Lead_Investor (TLI); therefore, with “pooled levels” we mean that we merged these two groups together into one group called “Non-strategic investments”, to distinguish it from the second group belonging to TLI “Institutional Strategic”.

In the next paragraph we will also test the interaction term TLI:Stage (as suggested by our previous Correlation analysis), and compare performances. Model: basic.model + TLI Residuals: Min 1Q Median 3Q Max -1.4723 -0.4208 -0.0169 0.3713 1.9430

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.225e+01 3.285e+01 -2.808 0.00536 ** lAmount_raised 5.865e-01 3.620e-02 16.201 < 2e-16 *** lPrev_raised 6.329e-08 1.100e-08 5.754 2.44e-08 *** Closing_Year 4.925e-02 1.635e-02 3.011 0.00286 ** Type_Lead_InvestorInstitutional Financial 9.420e-02 9.522e-02 0.989 0.32344 Type_Lead_InvestorInstitutional Strategic 3.361e-01 1.636e-01 2.055 0.04091 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5967 on 260 degrees of freedom Multiple R-squared: 0.7075, Adjusted R-squared: 0.7018 F-statistic: 125.8 on 5 and 260 DF, p-value: < 2.2e-16

The next model is equal to the previous one, but for the predictor Type_Lead_Investor we pooled together the groups Acc/Inc/PA and Institutional Financial in the new group Non- strategic investors.

12 Suggestion for further research: by expanding Revenue’s data availability, we expect this predictor to be a significant one (see chapter 7) 9 Appendix: Model specification and selection 95

Model: basic.model + TLI (pooled) Residuals: Min 1Q Median 3Q Max -1.53913 -0.43311 -0.01314 0.32959 1.89674

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.760e+01 3.205e+01 -2.421 0.01615 * lAmount_raised 5.910e-01 3.591e-02 16.458 < 2e-16 *** lPrev_raised 9.771e-08 1.489e-08 6.560 2.89e-10 *** Closing_Year 4.197e-02 1.595e-02 2.631 0.00903 ** Type_Lead_InvestorInstitutional Strategic 3.099e-01 1.406e-01 2.204 0.02839 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5947 on 260 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.7201, Adjusted R-squared: 0.7158 F-statistic: 167.2 on 4 and 260 DF, p-value: < 2.2e-16

Figure 9.5: For the model “basic.model + TLI (pooled)”, on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.

We remove the outliers 268, 278 (via outlier analysis we see that their adjusted p-values -via Bonferroni correction- stand below the 0.05 threshold, Figure 9.5).

ANOVA test, internal measures, cross-validation and bootstrap tests do not find a significant difference among these two models (non-pooled and pooled). Adj.R2 and Predicted R2 are slightly higher for the second model, because of the reduced number of predictors. Anyway, we got the confirmation that TLI is indeed a relevant predictor, as suggested by Automated models, so we include it in our [basic.model], now having the following formula:

lPre_valuation ~ lAmount_raised + lPrev_raised + Closing_Year + TLI 9.2 Third manual variable selection (from 8 to 5 predictors) 96

9.2.1.2.2 Stage

Our correlation analysis reports both the importance of this variable and of its interaction term with Type_Lead_Investor. [regsub.best] model suggests to pool Stage, by merging Early-stage and First Clients groups together, while [regsub.cv.best] model does not consider this predictor. We start with the individual variable and compare performances when Type_Lead_Investor and Stage are both pooled or not. Model: basic.model + Stage (non-pooled) Residuals: Min 1Q Median 3Q Max -1.26874 -0.27839 0.00519 0.31994 1.30212

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.557e+02 5.564e+01 -2.798 0.00615 ** lAmount_raised 3.475e-01 7.378e-02 4.710 7.83e-06 *** lPrev_raised 1.106e-07 2.170e-08 5.097 1.59e-06 *** Closing_Year 8.235e-02 2.778e-02 2.965 0.00377 ** TLI_Institutional Financial -4.243e-02 1.232e-01 -0.344 0.73129 TLI_Institutional Strategic 3.032e-01 1.510e-01 2.007 0.04737 * StageFirst Clients 2.794e-02 1.209e-01 0.231 0.81767 StageLater-stage 2.281e-01 1.375e-01 1.659 0.10017 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4869 on 102 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7018, Adjusted R-squared: 0.6813 F-statistic: 34.3 on 7 and 102 DF, p-value: < 2.2e-16

By introducing the predictor Stage, we lose 158 degrees of freedom, because of its large proportion of NA’s. TLI_Inst.Financial and Stage are not significant below 5%. Model: basic.model + Stage (pooled) Residuals: Min 1Q Median 3Q Max -1.29992 -0.27916 0.00431 0.33149 1.27445

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.513e+02 4.743e+01 -3.190 0.00188 ** lAmount_raised 3.432e-01 7.235e-02 4.744 6.70e-06 *** lPrev_raised 1.109e-07 2.137e-08 5.191 1.04e-06 *** Closing_Year 8.021e-02 2.373e-02 3.379 0.00102 ** Type_Lead_InvestorInstitutional Strategic 3.299e-01 1.273e-01 2.592 0.01093 * StageLater-stage 2.156e-01 1.125e-01 1.917 0.05793 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4827 on 104 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7012, Adjusted R-squared: 0.6868 F-statistic: 48.81 on 5 and 104 DF, p-value: < 2.2e-16

By pooling variables’ levels, we gain 2 degrees of freedom, and all coefficients are now significant at 5% level. ANOVA test does not find any significant difference among the two models. So, in figures 9.6 and 9.7 we compare their performances more in detail (TLI= Type_Lead_Inevstor).

All measures drastically improved by adding these two variables, when compared to [basic.model without TLI]. Besides, all metrics show that the pooled model performs slightly better, even if the difference appears to be very small. 9 Appendix: Model specification and selection 97

[basic.model without TLI] Vs [TLI+Stage] Vs [TLI + Stage (pooled)]

basic.model + TLI (pooled) + Stage (pooled) basic.model + TLI + Stage basic.model without TLI

PRESS

BIC

AIC

0.000 100.000 200.000 300.000 400.000 500.000 600.000

Figure 9.6: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.

9.2.1.2.3 Interaction effect Stage:TLI

The results obtained in the previous models stress the importance of the predictors Stage and Type_Lead_Investor (TLI). In the previous correlation analysis (paragraph 4.4) we also found clue about the impact of their interaction on lPre_valuation. Let us now test it.

Model: basic.model + TLI * Stage Residuals: Min 1Q Median 3Q Max -1.1698 -0.2697 0.0467 0.2376 1.4641

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.634e+00 9.569e-01 9.023 1.48e-14 *** lAmount_raised 4.449e-01 6.982e-02 6.372 5.92e-09 *** lPrev_raised 9.225e-08 2.158e-08 4.275 4.41e-05 *** TLI_Inst.Financial 1.983e-01 1.703e-01 1.164 0.247064 TLI_Inst.Strategic 1.329e+00 2.410e-01 5.515 2.79e-07 *** StageFirst Clients 3.333e-01 1.592e-01 2.093 0.038884 * StageLater-stage 7.385e-01 1.971e-01 3.747 0.000301 *** TLI_Inst.Financial:StageFirst Clients -1.025e-02 2.367e-01 -0.043 0.965541 TLI_Inst.Strategic:StageFirst Clients -1.265e+00 3.266e-01 -3.872 0.000194 *** TLI_Inst.Financial:StageLater-stage -3.297e-01 2.482e-01 -1.329 0.187035 TLI_Inst.Strategic:StageLater-stage -1.360e+00 3.318e-01 -4.098 8.54e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual s t a n d a r d error: 0.4654 on 99 d e g r e e s of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7356, Adjusted R-squared: 0.7089 F-statistic: 27.54 on 10 and 99 DF, p-value: < 2.2e-16

Model: basic.model + TLI * Stage (pooled) Residuals: Min 1Q Median 3Q Max -1.27517 -0.32262 0.02451 0.32777 1.23082

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.001e+00 9.678e-01 9.300 2.46e-15 *** lAmount_raised 4.349e-01 6.911e-02 6.293 7.52e-09 *** lPrev_raised 1.150e-07 2.212e-08 5.199 1.01e-06 *** TLI_Inst.Strategic 5.445e-01 1.670e-01 3.261 0.0015 ** StageLater-stage 3.770e-01 1.270e-01 2.970 0.0037 ** TLI_Inst.Strategic:StageLater-stage -4.897e-01 2.696e-01 -1.816 0.0722 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5006 on 104 degrees of freedom (170 observations deleted due to missingness) 9.2 Third manual variable selection (from 8 to 5 predictors) 98

Multiple R-squared: 0.6786, Adjusted R-squared: 0.6631 F-statistic: 43.92 on 5 and 104 DF, p-value: < 2.2e-16

Pooled Vs Original groups

TLI + Stage (pooled) TLI + Stage

Pred.err.rate

MAE

R2 bootstrap RMSE

Pred.err.rate

MAE

RMSE

validation set validation R2

Pred.err.rate

MAE

LOOCV R2

RMSE

Pred.err.rate

MAE

R2 k-fold5.5 RMSE

MAE

RMSE

Pred.R2

Pred.err.rate

S internal measures internal Adj.R2

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900

Figure 9.7: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.

By comparing the non-pooled models with and without the interaction term, ANOVA finds a significant difference: Model 1: lPre_valuation ~ lAmount_raised + lPrev_raised + Closing_Year + TLI + Stage 9 Appendix: Model specification and selection 99

Model 2: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage Res.Df RSS Df Sum of Sq Pr(>Chi) 1 102 24.181 2 99 21.444 3 2.7374 0.005489 **

If we then compare the pooled and non-pooled models, both with the interaction terms, the difference is again significant: Res.Df RSS Df Sum of Sq Pr(>Chi) 1 99 21.444 2 104 26.064 -5 -4.62 0.0007017 ***

In figures 9.8 – 9.9 we compare performances more in detail.

From these metrics, the best model always turns out to be the [basic.model + TLI*Stage] with non-pooled groups. The only exception is BIC, where [basic.model + TLI + Stage] has the minimum value. Nonetheless, as all other measures vote for [basic.model + TLI*Stage, original], we can state that this is the best model so far. Note that the Prediction error rate, measured through the validation set, is only 2.1% for this model, and around 3.1% when measured by the other methods.

Now that we tested the effect of all predictors selected by the Automated models, we want to investigate the effect of the inclusion of two other variables in our best model so far [basic.model + TLI*Stage, non-pooled]: Industry and Revenue.

9.2.1.2.4 Industry

During the correlation analysis (paragraph 4.4.2.2), we showed that, even after pooling some minor groups of the categorical variable Industry together, their means are not significantly different one another. So, we presumed that adding this explanatory variable to [basic.model + TLI*Stage] would not be useful. Indeed, automated selection methods excluded Industry from their best regression models. 9.2 Third manual variable selection (from 8 to 5 predictors) 100

Variations to basic.model: [TLI+Stage] Vs [TLI*Stage] Vs [TLI*Stage (pooled)]

TLI*Stage (Pooled) TLI*Stage basic.model + TLI + Stage

PRESS

BIC

AIC

0.000 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 180.000 200.000

Figure 9.8: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.

Model: basic.model + TLI * Stage, non-pooled + Industry Residuals: Min 1Q Median 3Q Max -1.16817 -0.27212 0.05026 0.23742 1.31437

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.896e+00 9.889e-01 8.996 2.31e-14 *** lAmount_raised 4.176e-01 7.154e-02 5.838 7.33e-08 *** lPrev_raised 9.445e-08 2.135e-08 4.424 2.58e-05 *** TLI_Inst.Financial 9.192e-02 1.742e-01 0.528 0.598942 TLI_Insti.Strategic 1.448e+00 2.516e-01 5.755 1.05e-07 *** StageFirst Clients 4.259e-01 1.680e-01 2.535 0.012890 * StageLater-stage 8.008e-01 2.101e-01 3.812 0.000245 *** IndustryFintech 2.346e-01 2.600e-01 0.902 0.369302 IndustryBiotech 7.638e-01 3.670e-01 2.081 0.040087 * IndustryMedtech/Healthcare 2.503e-01 1.502e-01 1.666 0.098914 . IndustryICT 1.205e-02 1.210e-01 0.100 0.920888 TLI_Inst.Financial:StageFirst Clients 1.098e-01 2.416e-01 0.455 0.650420 TLI_Inst.Strategic:StageFirst Clients -1.382e+00 3.309e-01 -4.176 6.59e-05 *** TLI_Inst.Financial:StageLater-stage -1.787e-01 2.577e-01 -0.693 0.489766 TLI_Inst.Strategic:StageLater-stage -1.408e+00 3.658e-01 -3.849 0.000215 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.459 on 95 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7532, Adjusted R-squared: 0.7169 F-statistic: 20.71 on 14 and 95 DF, p-value: < 2.2e-16

9 Appendix: Model specification and selection 101

Variations to [basic.model, withour TLI]: [TLI + Stage] Vs [TLI*Stage original] Vs [TLI*Stage pooled] TLI*Stage pooled TLI*Stage original TLI + Stage Pred.err.rate

MAE

R2 bootstrap RMSE

Pred.err.rate

MAE

RMSE

validation set validation R2

Pred.err.rate

MAE

LOOCV R2

RMSE

Pred.err.rate

MAE

R2 k-fold5.5 RMSE

MAE

RMSE

Pred.R2

Pred.err.rate

S internal measures internal

Adj.R2

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 Figure 9.9: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.

9.2 Third manual variable selection (from 8 to 5 predictors) 102

Variations to basic.model: [TLI*Stage] Vs [TLI*Stage + Industry]

TLI*Stage + Industry TLI*Stage

Pred.err.rate

MAE

R2 bootstrap RMSE

Pred.err.rate

MAE

RMSE

validation set validation R2

Pred.err.rate

MAE

LOOCV R2

RMSE

Pred.err.rate

MAE

R2 k-fold5.5 RMSE

MAE

RMSE

Pred.R2

Pred.err.rate

internal measures internal S

Adj.R2

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Figure 9.10: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.

9 Appendix: Model specification and selection 103

Variations to basic.model: [TLI*Stage] Vs [TLI*Stage + Industry]

TLI*Stage + Industry TLI*Stage

PRESS

BIC

AIC

0.000 50.000 100.000 150.000 200.000 250.000

Figure 9.11: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.

Instead, by setting Others as reference group, all coefficients are influential and almost significant. The less significant is ICT that, on the contrary, revealed the clearest growing trend in Correlation analysis. The adj.R2 improves but ANOVA test indicates no significant difference between the models: Model 1: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage Model 2: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage + Industry

Res.Df RSS Df Sum of Sq Pr(>Chi) 1 99 21.444 2 95 20.011 4 1.4329 0.1467

By comparing the other metrics, results are discordant: it generally improves internal measures but makes worse the performance tests via CV and bootstrap (see results in Figures 9.10 – 9.11). All that makes us think that, by adding the predictor Industry to our best model, we are going toward an overfitting problem. So, we decide to definitely exclude it.

9.2.1.2.5 Revenue

During our correlation analysis (Chapter 4), we showed that, even after pooling some minor groups of Revenue together, their means are not significantly different one another. So, we presumed that adding this explanatory variable to [basic.model] would not be useful. Indeed, automated selection methods excluded Revenue from their best regression models. Model: TLI*Stage, non-pooled + Revenue Residuals: Min 1Q Median 3Q Max -1.15306 -0.24997 0.04723 0.26149 1.47081

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.622e+00 9.970e-01 8.649 1.27e-13 *** lAmount_raised 4.453e-01 7.238e-02 6.152 1.81e-08 *** lPrev_raised 8.569e-08 2.280e-08 3.759 0.000295 *** TLI_Institutional Financial 2.127e-01 1.738e-01 1.224 0.224009 TLI_Institutional Strategic 1.359e+00 2.470e-01 5.504 3.14e-07 *** StageFirst Clients 3.667e-01 1.958e-01 1.873 0.064099 . StageLater-stage 7.109e-01 2.495e-01 2.850 0.005365 ** Revenue50k - 1M -2.617e-02 1.397e-01 -0.187 0.851768 Revenue> 1M 1.266e-01 2.266e-01 0.558 0.577859 9.3 Interaction Terms 104

TLI_Inst.Financial:StageFirst Clients -1.864e-02 2.457e-01 -0.076 0.939691 TLI_Inst.Strategic:StageFirst Clients -1.287e+00 3.338e-01 -3.855 0.000211 *** TLI_Inst.Financial:StageLater-stage -3.119e-01 2.554e-01 -1.221 0.224988 TLI_Inst.Strategic:StageLater-stage -1.383e+00 3.506e-01 -3.944 0.000153 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4722 on 95 degrees of freedom (172 observations deleted due to missingness) Multiple R-squared: 0.7235, Adjusted R-squared: 0.6886 F-statistic: 20.71 on 12 and 95 DF, p-value: < 2.2e-16

Revenue coefficients are influential but not significant. Both internal measures and performance tests pursued via CV and bootstrap get worse results than [basic.model + TLI*Stage, non- pooled]. Therefore, we definitely exclude the variable Revenue from the predictors of our best model.

9.2.2 Results

Based on the tests conducted above, the result of the third manual variable selection is to exclude Industry, Revenue, and Closing_Year_factor from the predictors of our best model, while including:

• lAmount_raised • lPrev_raised • Closing_Year • Type_Lead_Investor • Stage

And potentially the interaction term Type_Lead_Investor:Stage. This conclusion is in accordance with the Automated Models selection, but it does not additionally pool groups together, and it suggests an interaction term between Stage and TLI.

9.3 INTERACTION TERMS

After our third variable selection, we finally wonder if we omitted some important interaction terms in our best model. In order to indagate this, we run automated models on a new data frame, including only the five predictors selected in the Third manual variable selection, plus all their possible interactions. The results of Stepwise regression and Best Subset Selection are following. 9 Appendix: Model specification and selection 105

9.3.1 Stepwise regression

By applying Stepwise regression, with direction= ”both”, we obtain the following optimal model, that we call [A]: Call: lm(lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised)

Residuals: Min 1Q Median 3Q Max -1.56505 -0.27575 0.04329 0.28700 1.04449

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.917e+02 5.857e+01 -4.980 2.63e-06 *** Closing_Year 1.495e-01 2.923e-02 5.117 1.49e-06 *** TLI_Institutional Financial 6.188e+01 9.606e+01 0.644 0.520912 TLI_Institutional Strategic 5.421e+02 1.520e+02 3.567 0.000555 *** lPrev_raised 7.204e-05 3.024e-05 2.383 0.019062 * lAmount_raised 3.980e-01 6.418e-02 6.201 1.24e-08 *** Closing_Year:TLI_Institutional Financial -3.077e-02 4.765e-02 -0.646 0.519913 Closing_Year:TLI_Institutional Strategic -2.688e-01 7.538e-02 -3.565 0.000557 *** Closing_Year:lPrev_raised -3.565e-08 1.499e-08 -2.379 0.019240 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4351 on 101 degrees of freedom Multiple R-squared: 0.7642, Adjusted R-squared: 0.7455 F-statistic: 40.91 on 8 and 101 DF, p-value: < 2.2e-16

9.3.2 Best Subset Selection

The results of Best Subset selection (Figure 9.12) vote for model with between five and seven predictors: lPrev_raised, lAmount_raised, Closing_Year:TLI, Closing_Year:lAmount_raised, Closing_Year:lPrev_raised, Inst.Strategic, and Stage:lAmount_raised. We call this model [B].

Figure 9.42: The Figure shows three graphs to analyse results of Best subset selection method, applied via the software R. On the abscissa is represented the number of variables in the model, while on the ordinate is shown the resulting value of the following metrics: adjusted R squared, Cp, and BIC. The optimal value for each metric is in red colour. 9.4 Further methods used to select the best model 106

See comparison of these two best models [A] and [B] in paragraph 6.5.

9.4 FURTHER METHODS USED TO SELECT THE BEST MODEL

We did not limit our best model selection to the methods in this Chapter. We also experimented some non-linear regression and other approaches. As they did not directly add information in explaining our dependent variable lPre_valuation, for reasons of brevity we do not show their results in this report as we did for the other ones. Nevertheless, we want to mention that, by applying and experimenting the following methods (Rao, 2014; James et al., 2013), via the Software R, any significant improvement could be achieved:

− GLM (Generalized Linear Models) − Curve Fitting using Polynomial Terms − Fractional exponents (a small significant improvement was provided, but residual plots showed the model was getting further from meeting MLR assumptions) − Spline − Log function of predictors − Loess regression − Kernel regression

We compared the performance of the models created by applying these methods with the performance of [A], and we concluded that [A] continues to be the best model for the goal of our research. The application of GLM regression, in particular, confirmed precisely [A] as the best model, which is a positive robustness check.