<<

Count in Finance∗

Jonathan Cohn University of Texas-Austin

Zack Liu University of Houston

Malcolm Wardlaw University of Georgia

March 2021

Abstract This paper examines the use of count data-based outcome variables such as corpo- rate patents in empirical corporate finance research. We demonstrate that the com- mon practice of regressing the log of one plus the count on covariates (“LOG1PLUS” regression) produces biased and inconsistent estimates of objects of interest and lacks meaningful interpretation. Poisson regressions have simple interpretations and produce unbiased and consistent estimates under standard exogeneity assumptions, though they lose efficiency if the count data is overdispersed. Replicating several recent papers on corporate patenting, we find that LOG1PLUS and Poisson regressions frequently pro- duce meaningfully different estimates and that bias in LOG1PLUS regressions is likely large.

∗Jonathan Cohn: [email protected], (512) 232-6827. Zack Liu: [email protected], (713) 743-4764. Malcolm Wardlaw: [email protected], (706) 204-9295. We would like to thank Jason Abrevaya, Andres Almazan, John Griffin, Travis Johnson, Sam Krueger, Aaron Pancost, James Scott, Sheridan Titman, Jeff Wooldridge, and participants in the Virtual Finance Seminar and seminar at the University of Texas at Austin for valuable feedback. Count Data in Finance

Abstract This paper examines the use of count data-based outcome variables such as corpo- rate patents in empirical corporate finance research. We demonstrate that the com- mon practice of regressing the log of one plus the count on covariates (“LOG1PLUS” regression) produces biased and inconsistent estimates of objects of interest and lacks meaningful interpretation. Poisson regressions have simple interpretations and produce unbiased and consistent estimates under standard exogeneity assumptions, though they lose efficiency if the count data is overdispersed. Replicating several recent papers on corporate patenting, we find that LOG1PLUS and Poisson regressions frequently pro- duce meaningfully different estimates and that bias in LOG1PLUS regressions is likely large. A growing number of papers in empirical corporate finance study outcome variables that are inherently count-based. For example, 44 papers published in “top three” finance journals in recent years estimate the effects of various forces on a company’s patent and/or patent citation counts. A key challenge in working with count data is that count variables, being bounded below by zero, often exhibit strong right-. To address concerns about ef- ficiency and outlier risk, researchers often log-transform highly-skewed dependent variables before estimating linear regressions. However, count data sets often contain many zero val- ues, and the logarithm of zero is undefined. The most commonly used approach in finance to addressing this complication is to add a constant - typically 1 - to the count before log-transforming it. We refer to of the log of 1 plus a count variable on co- variates as “LOG1PLUS” regression. Of the 44 papers referenced above, 25 use LOG1PLUS regression as their primary econometric approach, and 23 use it exclusively. Despite its widespread use, little work has been done to examine the properties of esti- mates based on LOG1PLUS regression and whether these properties provide a reasonable and accurate test of underlying economic hypotheses. In this paper, we analyze the LOG1PLUS approach as well as alternative approaches. We formalize the often unspoken assumptions behind different regression models, conduct simulations to explore the statistical properties of the estimates they produce, and compare these estimates using replicated data sets from existing papers. We illustrate how OLS regressions using log-transformed outcomes can pro- duce biased and incorrectly signed estimates of economic relationships and provide guidance for future research in finance-related applications involving zero-bounded count data. How does one interpret estimates from a LOG1PLUS regression? A standard log-levels regression coefficient has a simple interpretation in terms of a semi-elasticity - the percentage change in the outcome variable associated with a one unit change in the explanatory variable. While one might imagine that LOG1PLUS regression estimates have the same interpretation because the added constant is invariant to the covariates, this intuition is wrong. The semi-

1 elasticity of a variable and the semi-elasticity of the sum of a constant and variable are not equivalent, nor can one easily be transformed into the other. In univariate regressions, the addition of the constant biases LOG1PLUS regression es- timates towards zero. They may therefore be seen as representing lower bound estimates of semi-elasticities. The effect of the bias is more complex in a multivariate regression setting. We show in simulations that this bias can be large and can produce estimates with the wrong sign, even when there is no error. Thus, when interpreting LOG1PLUS regression output, a researcher might incorrectly conclude that a policy variable has a particular di- rectional effect on the count outcome when it actually has no effect or even the opposite effect. The addition of the constant is not the only source of estimation bias in LOG1PLUS regression. In the context of trade model regressions, Silva and Tenreyro (2006) show that the log-transformation of an outcome variable can produce biased regression estimates if the error in the original (i.e., untransformed) variable is heteroskedastic, as is likely in most applications. The nonlinear nature of the log transformation translates a correlation between a covariate and the of the error in the original variable into a relationship with the of the implied error in the logged variable. We show that the same bias exists in LOG1PLUS regressions. Simulations suggest that this bias can also be large and can cause the expected value of estimated coefficients to have the wrong sign. We show that positive (negative) correlation between the variance of the error and a covariate results in a downward (upward) biased coefficient. One alternative to estimating an OLS regression in general is to treat the outcome vari- able as a count process and estimate a count regression model. Among count models, the Poisson model, which connects the outcome with a linear function of covariates through an exponential link function, has two unique and useful features. First, its coefficients are interpretable as semi-elasticity estimates (Wooldridge, 2010, p. 726). They also have fairly

2 simple interpretations in terms of a linear conditional expectation function (CEF). Second, the Poisson model admits separable fixed effects - effectively a prerequisite for use in cor- porate finance applications.1 While Poisson model estimates lose efficiency if the model’s conditional mean-variance equality restriction is not satisfied in the data, they remain un- biased and consistent as long as the standard conditional mean independence assumption holds. While computational constraints may have been a practical issue for estimating fixed effects Poisson models in the past, the PPMLHDFE module for Stata implements a pseudo-maximum likelihood approach based on (Correia et al., 2020) that allows for speedy convergence of even high-dimensional fixed effects Poisson regressions. More flexible count regression models such as the negative binomial model or zero-inflated models relax the conditional mean-variance restriction and are therefore likely to be more efficient in many settings. Negative allows for , where the conditional variance exceeds the conditional mean. Zero-inflated models (e.g., zero- inflate Poisson and zero-inflated negative binomial) estimate intensive and extensive margins separately, allowing for excessive zero values in the distribution of the outcome. However, none of these models admit separable fixed effects. While one can include group dummy variables when estimating these models, the lack of separability results in biased estimates due to the incidental parameters problem (Lancaster, 2000). Another alternative approach is to model the count outcome as a rate and estimate linear regressions where the rate is the dependent variable. Doing so requires the availability of a suitable “exposure” variable that captures the level of activity creating a baseline exposure to the outcome. For example, the number of employees at an establishment is a natural exposure variable for the number of workplace injuries at the establishment in a given period of time (Cohn and Wardlaw, 2016; Cohn et al., 2020). Scaling the count outcome by the exposure

1Poisson fixed effects are multiplicatively separable, while fixed effects in linear models are additively separable.

3 variable results in a rate of events per unit of exposure (e.g., workplace injuries per employee). OLS regression with a rate dependent variable produces estimates with a simple linear CEF interpretation. Our simulations suggest that rate regression is more efficient than when overdispersion is moderate. Unfortunately, a suitable exposure variable is not available in many settings. After considering the econometric properties of various estimators when working with count data, we replicate data sets from five papers in the innovation literature, all of which use patent-based dependent variables. We are able to approximately replicate the main results of all five papers using the same regression models as the papers themselves. However, we find that the sign and/or order of magnitude of estimates from LOG1PLUS and Poisson regressions using the same data set frequently differ. Since Poisson regressions produce unbiased estimates of semi-elasticities under standard exogeneity conditions, this difference suggests that the bias in LOG1PLUS regression estimates may be large in practice. To further understand what drives differences in Poisson and LOG1PLUS estimates, we decompose the observed differences from four of our exercises into three parts. Two of these parts correspond to the two sources of bias in LOG1PLUS regressions - the addition of the constant and heteroskedasticity in the count process. We find that the former drives substantial differences in all four cases, causing signs to disagree in two of the four. The latter drives a majority of the difference in one of the four cases and substantial differences in another. Overall, both sources of bias appear quantitatively important.2

2The third part of the decomposition accounts for the fact that Poisson regression with group fixed effects necessarily excludes units of observation belonging to a group that never patents in the data, resulting in differences in sample size. The effect of this sample difference is small in all four cases.

4 1 Econometrics

This section examines involving outcome variables based on count data. Regression analysis using count-based outcomes presents two significant challenges. First, count data is non-negative by definition and often exhibits substantial distributional right-skewness, with many values at or near zero and a long right tale of high values. Second, many count datasets, including those most commonly analyzed in finance applications, have large numbers of zero-valued observations - often in excess of 50% of total observations. As an illustration of these distributional features, Figure 1 presents a of firm-year observations of number of patents granted. We top-code the data at 100 to make the figure easier to display. Patent counts are currently the most common count data used in finance applications.

[Insert Figure 1]

One approach to working with count-based outcome variables is treat them like tradi- tional outcomes and estimate standard OLS regressions of the form:

y = xβ + , (1)

where  is assumed to be mean zero. The identifying assumption in this regression is that E[|x] is uncorrelated with each of the covariates in x. A significant drawback of this approach when working with count-based outcomes is that skewness of y decreases estimation efficiency and raises concerns about outlier risk. One common solution is to de-skew y by log-transforming it prior to estimation. Because the log transformation is concave, this transformation generally produces a variable with a distribution that is closer to normal.

5 With a log-transformed outcome variable, the regression equation becomes:

log(y) = xβ + , (2)

We refer to the OLS estimation of (2) as “LOLS regression.” An immediate drawback of the LOLS model is that the logarithm of zero is undefined. Thus, estimating LOLS regressions requires excluding observations with zero-valued out- comes. The resulting sample shrinkage raises concerns not only about efficiency but also about generality, since it allows for estimation of only the intensive margin. The most com- mon approach in finance to addressing the zero-value problem is to add an arbitrary constant - typically 1 - to the outcome before log-transforming. Doing so ensures that the transfor- mation is defined for all possible values of y. When the constant is 1, the resulting regression equation is: log(1 + y) = xβ + . (3)

We refer to OLS estimation of (3) as LOG1PLUS regression. We focus on the case where the constant added before log-transformation is one since this is the most common case. However, any positive constant will allow for a log transformation while preserving observations with zero-valued counts. We refer more generally to OLS regression where the dependent variables is log(c + y), for c > 0, as LOGcPLUS regression. While the LOG1PLUS model is the most common regression model in finance applications involving count-based outcome variables, its interpretation and econometric properties are not well understood. We explore these next. The discussion applies to the more general LOGcPLUS regression as well.

6 1.1 LOG1PLUS regression

As long as E[|x] = 0, estimation of a LOG1PLUS regression by OLS will result in an unbiased estimate of the CEF of the log of 1 plus y, E[log(1 + y)|x]. However, the researcher is not generally interested in the impact of the covariates x on the expectation of an arbitrary log transformation of y. The question is whether one can infer the CEF of y from estimates of a LOG1PLUS regression. The answer is no. One might speculate that the CEF of y is E[y|x] = exβ − 1. However, in fact, E[y|x] = exβE[e] − 1. By Jensen’s inequality, E[e] 6= eE[] = 1. Note that this issue holds for the LOLS model as well. As a result, we generally interpret estimates from the LOLS model in terms of semi-elasticities. The regression coefficient in the

∂E[y|x] 1 LOLS regression (2) estimates the semi-elasticity of y with respect to x: βj = × . ∂xj E[y|x] It might be tempting to assume that the LOG1PLUS model has a similar semi-elasticity interpretation. However, the regression coefficient in the LOG1PLUS regression (3) estimates the semi-elasticity of 1 + y with respect to xj:

∂E[y|x] 1 βj = × . (4) ∂xj 1 + E[y|x]

It is unlikely that this quantity is the object of interest to the researcher. Thus, the LOG1PLUS regression coefficient does not have a natural interpretation in terms of either a CEF or semi-elasticity. Nor can its relationship with the semi-elasticity of y be inferred. The relationship between the two semi-elasticities is:

1 + E[y|x] β = × β (5) j,LOLS E[y|x] j,LOG1P LUS

Inferring βj,LOLS - the semi-elasticity of y with respect to xj - would require knowledge of E[y|x], which, as already noted, cannot be inferred from the coefficients in a LOG1PLUS

7 regression.

If one treats βj,LOG1P LUS as an estimate of the semi-elasticity of y with respect to xj, this estimate will be biased. Note that E[y|x] > 0 for any non-degenerate conditional

E[y|x] distribution of y since y can only take non-negative values. Thus, 0 < 1+E[y|x] < 1. From

(5), the βj,LOG1P LUS coefficient then will have the same sign as the semi-elasticity of y with respect to xj but will be biased towards zero. Estimates of a univariate LOG1PLUS regression may still be useful because they provide a lower-bound for the semi-elasticity of y with respect to xj. Unfortunately, the problem is more complicated in a multivariate regression model. As we demonstrate later in simulations, if explanatory variables in a multivariate regression are correlated with each other, then addition of 1 to y before log-transformation can cause the sign of βj,LOG1P LUS to differ from the sign of the semi-elasticity of y with respect to xj. Thus, estimates of a multivariate version of the LOG1PLUS regression need not even have the correct sign in expectation. The log-transformation of y produces a second source of potential bias. While the log- transformation may seem innocuous, Silva and Tenreyro (2006) show that log-transforming a dependent variable in general can result in biased regression estimates if the errors in y are heteroskedastic, as is often the case. It is well-understood that heteroskedasticity in y requires correcting standard errors from an an OLS regression of y on covariates. However, heteroskedasticity can cause biased point estimates when the dependent variable is a nonlin- ear transformation of y. This problem arises because a nonlinear transformation translates any relationship between the variance of the error and a covariate into a relation between the mean of the error and the covariate. As an example, suppose that  is lognormally distributed with mean 1 and variance f(xj). Under the log-transform, the OLS error term is now normally distributed

8      1  log() ∼ N log   , log(1 + f(xj))  q   f(xj) + 1

If f(xj) is not equal to a constant, then E[|x] 6= 0, and OLS estimates will generally be biased and inconsistent. As we show later in simulations, this bias can be large and can

0 even cause estimates to have the wrong sign. If f (xj) > 0 generally, then the estimates will

0 be biased towards zero. If f (xj) < 0 generally, then the estimates will be biased away from zero. We illustrate the direction of the bias with two simple examples. Suppose that y = α + βx + , with corr(, x) = 0. Further, suppose that α = 1 and β = 0, so that the model simplifies to y = 1 + . In addition, suppose that  ∈ {−, +}, with prob( = −) = prob( = +) = 0.5 and that the of x is [0, 1]. In the first example, we assume that − = −x and + = x. In this case, var() = x2, so corr(var(), x) > 0. In the second example, we assume that − = x − 1 and + = 1 − x. In this case, var() = (1 − x)2, so corr(var(), x) < 0. Figure 2 plots E[y|x] and E[log(1 + y)|x] for both of these examples.

[Insert Figure 2]

Naturally, since β = 0, E[y|x] is invariant to x in both cases. However, the figure shows that E[log(1 + y)|x] decreases with x in the first example, where corr(x, var()) > 0, and increases with x in the second example, where corr(x, var()) < 0, even though E[y|x] is not correlated with x in either example. As a result, a regression of log(1 + y) on x would produce a negative slope coefficient in the first example and a positive slope coefficient in the second example, even though corr(y, x) = 0 by assumption.

9 1.2 Poisson regression

We next consider Poisson regression as an alternative when working with count-based outcomes. Poisson regression represents a generalized and assumes that the dependent variable has a that depends on covariates. One advantage of Poisson regression is that it explicitly models an underlying count process. However, this advantage may be secondary if the only objective is to estimate a CEF, without taking a stand on the underlying data generating process. Even if this is the case, Poisson regression has several advantages that make it the most practical way to estimate regressions involving count-based outcome variables in many settings. Poisson regression assumes that the conditional expectation of the outcome variable takes the form: E[y|x] = exβ (6)

or, equivalently, log(E[y|x]) = xβ. Note that this expression is not the same as the con- ditional expectation implied by an LOLS regression, E[log(y|x)] = xβ, since E[log(y|x)] 6= log(E[y|x]) by Jensen’s inequality. Poisson and LOLS regression both yield coefficients that are interpretable as semi-elasticities of y with respect to x. However, unlike LOLS and LOG1PLUS regression, the only requirement for unbiased and consistent estimation of a Poisson regression is that the error term in y be uncorrelated with the covariates. It is worth noting that, also unlike LOLS and LOG1PLUS regression, Poisson regression produces esti- mates with simple (conditional) interpretations in terms of marginal effects:

∂E[y|x] xβ = βje . (7) ∂xj

Poisson regression imposes a restriction on the conditional variance of the dependent variable. Specifically, it imposes the condition E[y|x] = var(y|x). A common critique of

10 Poisson regression is that this condition is unlikely to be satisfied in practice. If this condition is not satisfied, Poisson regression loses efficiency. However, violation of the conditional mean-variance equality restriction in practice does not result in any estimation bias. That is, Poisson regression estimates continue to be unbiased and consistent estimates of the semi-elasticity of y with respect to xj, even if var(y|x) 6= E[y|x], as long as the errors are uncorrelated with the covariates. One useful feature of the Poisson model is that it admits separable fixed effects. That is, the researcher can specify that each unit of observation (e.g., firm) have a different baseline outcome arrival rate. Let αi be the fixed effect associated with firm i. Including this fixed effect, the Poisson model conditional expectation becomes:

E[y|x] = eαi+xβ = eαi exβ. (8)

Observe that, while the fixed effects in a linear model are additive, they are multiplicative in a Poisson model. The general approach to estimating Poisson regression is based on pseudo maximum likelihood maximization using the first-order condition (Gourieroux et al., 1984):

X xβ [y − e ]xj = 0. (9)

The resulting Pseudo Poisson Maximum Likelihood (PPML) estimator has been shown to perform well for a wide range of distributional assumptions, including those where the data present a large number of zero values (Silva and Tenreyro (2011)). In fact, as noted by Silva and Tenreyro (2006), the PPML estimator is valid even if the outcome variable is not discrete. Historically, estimation using PPML maximization with large numbers of fixed effects presented a computational challenge. However, recent innovations in sparse matrix reduction methods have made the estimation of fixed effects Poisson models via PPML

11 maximization fast and reliable. The PPMLHDFE module for Stata based on (Correia et al., 2020) allows for speedy convergence of even high-dimensional fixed effects Poisson regression models. One notable characteristic of the Poisson regression model with group fixed effects is that the usable sample is restricted to groups in which at least one observation has a non-zero value for the dependent variable. For example, in the context of patent data, estimation of a Poisson model with firm fixed effects uses only observations belonging to firms that patent at least once during the sample period. It is unclear that this sample restriction is a limitation rather than a feature of the Poisson model. For example, some firms may lack the capacity to patent - that is, E[y|x] = 0 in all periods for some firms. It is unclear that the researcher should retain these firms when estimating patent regressions, since their presence will bias estimates of the effect of covariates on patenting activity for firms with the ability to patent towards zero. See Correia et al. (2019) for further discussion of this issue. Finally, there may be circumstances in which the universal baseline arrival rate is a function of an observable “exposure” variable. For example, in their analysis of annual establishment-level workplace injuries, Cohn and Wardlaw (2016) and Cohn et al. (2020) identify an establishment’s average number of employees and total hours worked as natu- ral exposure variables. The exposure variable enters the Poisson regression equation as a covariate with coefficient constrained to 1. When an exposure is specified, the regression coefficients become estimates of rate semi-elasticities - e.g., the percent change in workplace injuries per employee associated with a one-unit change in a covariate.

1.3 Other count models

Several alternatives to Poisson regression model the dependent variable as a count process but relax the conditional mean-variance equality condition. These include the negative binomial, zero-inflated Poisson, and zero-inflated negative binomial models. The negative

12 binomial model explicitly models the variance as a separate gamma process, where the αNB parameter determines the degree to which the data is overdispersed (var(y|x) > E[y|x]).

+ The higher the value of αNB, the greater the degree of overdispersion. As αNB → 0 , overdispersion disappears, and the negative binomial model distribution converges to the

Poisson model distribution. Negative binomial regression produces an estimate of αNB, which is informative about the degree of overdispersion in the data. Many count data sets exhibit what appear to be an excessive number of zero values relative to a Poisson or Negative Binomial distribution. This excess of zeroes can cause the conditional variance of the distribution to deviate from the conditional mean. Zero-inflated Poisson and Negative Binomial models explicitly account for an excessive number of zeros (“zero inflation”) by modeling a separate process that determines whether an observation is exposed to the underlying count process. Of course, estimating such a model requires an understanding of the process determining exposure. These alternative count models are likely to be more efficient than Poisson models if the conditional mean-variance equality restriction does not hold in the data. Unfortunately, none of these alternative count models admit separable fixed effects. In principle, one could include group dummy variables in the regression to approximate fixed effects. However, the inclusion of such dummies gives rise to an incidental parameters problem that can cause estimates to be biased and inconsistent (Lancaster, 2000).3 Since controlling for firm and time fixed effects is standard in corporate finance applications, the inability of alternative count models to accommodate fixed effects limits their usefulness in the field. The same concern applies to truncated models such as the Tobit model.

3Estimates converge to the true coefficient values as T increases but not as N increases.

13 1.4 OLS rate regression model

The LOLS, LOG1PLUS, and Poisson model all decrease skewness in count outcome variables through concave (logarithmic) transformation of the data. An alternative approach is to scale the outcome variable by an appropriate exposure variable, transforming it into a rate, and then estimate a linear regression:

y = xβ +  (10) exposure

Scaling may not mitigate skewness due to coding errors or peculiarities of the data set. However, it does mitigate skewness due to differences in scale, especially if scale itself has a skewed distribution, as is typically the case in corporate finance. For example, a large company may patent at a rate that is an order of magnitude greater than a small company. If the large company is an order of magnitude larger than the small company, then the patenting rates of the two companies will be approximately the same. Unfortunately, in many finance applications, including those involving patent-based out- comes, an appropriate exposure variable does not exist. For example, total assets and total sales, which are common measures of firm scale, are probably poor approximations of the overall exposure of firms to patenting activity. While still noisy, research and development expenditures may more closely approximate this exposure. However, treating patents per dollar of R&D spending changes the interpretation of regression coefficients. These coeffi- cients represent estimates of the effect of covariates on the patenting efficiency associated with each dollar spent on R&D rather than the effect of the covariates on overall patenting activity. If a researcher’s objective is to test a theory that a policy affects patenting in part by inducing research activity, then scaling by R&D will lead to uninformative estimates. If an appropriate exposure variable is available, then scaling and estimating a linear rate regression may be preferable to estimating a Poisson regression. Specifically, linear rate

14 regression may be more efficient since, unlike Poisson regression, it does not impose any restriction on the conditional variance of the count outcome. In the next session, we conduct simulations to assess the statistical properties of the estimators discussed in this section. We compare linear rate and Poisson regressions in these simulations.

2 Simulations

The section presents simulations that further illustrate the econometric properties of different estimators when working with count-based outcome variables. We first examine the potential magnitude of the bias in LOG1PLUS regression due to the addition of the constant and heteroskedasticity in the underlying count. We then examine the efficiency characteristics of three unbiased estimators.

2.1 The effect of adding the constant in LOG1PLUS regression

As we showed in Section 1, the addition of the constant in a bivariate LOG1PLUS regression causes estimates of semi-elasticities of y to biased towards zero. We now show with a simple simulation that the addition of a constant in a multivariate LOG1PLUS regression can cause a regression coefficient to have the wrong sign, even in the absence of sampling error. Note that our objective here is to demonstrate this possibility and not to make a more general statement about when LOG1PLUS regression produces estimates with the wrong sign.

β1x1+β2x2 We simulate a set of observations (x1, x2, y), with y = e . We set β1 = 1 and β2 =

0.1. For each observation, we draw the value of x1 from a standard normal distribution. We then set x2 equal to x1 if x1 is positive and 0 if x1 is negative. Our objective in generating x2 in this way is to make x2 a nonlinear function of x1. There is nothing special about the kinked

15 nature of the function we choose. Other nonlinear relationships yield the same insight.4 We do not include an error term in y in this simulation. Incorporating a homoskedastic error does not change the main insight from this simulation. We consider the case of heteroskedasticity in the next simulation. We generate 5,000 observations. We then estimate Poisson, LOLS, and LOG1PLUS

regressions of y on x1 and x2. Because we do not include noise in y, the regressions estimates are all perfectly precise. We therefore only simulate a single data set in this simulation exercise. Table 1 presents the results.

[Insert Table 1]

Both the Poisson and LOLS regressions produce coefficients of 1 on x1 and -0.1 on x2. That is, they both recover the true model coefficients. That they do so is not surprising, since E[y|x] = eXβ for both models. In contrast, LOG1PLUS regression produces regression coefficients on x1 and x2 of 0.272 and 0.374, respectively. The addition of the constant to y in the LOG1PLUS regression along with the (nonlinear) relationship between x1 and x2 causes the coefficients on both variables to differ from the true model coefficients. In fact, the coefficient on x2 has the wrong sign. Thus, a researcher estimating a LOG1PLUS regression would conclude that x2 has a positive effect on y, while the true effect is negative. Figure 3 sheds light on why LOG1PLUS regression fails to recover the true coefficients.

Intuitively, log(y) is linear in both x1 and x2 by construction, so both LOLS and Poisson regressions produce unbiased estimates. However, log(1 + y) is linear in neither variable. As

4In some scenarios, a kinked relationship is a reasonable approximation to a real relationship. For example, suppose that x1 is the size of a firm measured by total assets and x2 is the number of stock analysts covering the firm. In reality, the number of analysts covering a firm generally increases with firm size, though a fairly large set of smaller firms has no analyst coverage. If there is a minimum size threshold beyond which brokerages do not assign analysts to firms, then the relationship between number of analysts and firms size will be approximately kink shaped.

16 a result, the LOG1PLUS regression is misspecified. Because (i) the regression is attempting to fit a line to a nonlinear relationship between y and x1 and (ii) the relationship between x1 and x2 is nonlinear, the coefficient on x2 picks up part of the relationship between y and x1.

The converse is true as well: The coefficient on x1 picks up part of the relationship between y and x2. As a result, the coefficients on x1 and x2 both deviate from the true values.

[Insert Figure 3]

The same issue would exist with Poisson and LOLS regression if log(y) were not linearly related to the covariates. However, researchers at least may have valid theoretical reasons for positing a linear relationship between log(y) and covariates. For example, a regression motivated by a Cobb-Douglas production function or a gravity equation will naturally have such a relationship. In contrast, is hard to imagine scenarios where a theoretical justification exists for assuming that log(1 + y) is linearly related to covariates.

2.2 Log OLS Regressions and Heteroskedasticity

In our second simulation, we demonstrate how heteroskedasticity in the error term of the data generating process can affect not only the standard errors of the estimates in a log- transformed OLS regression but also the coefficients. Prior papers have demonstrated that heteroskedaticity can create estimation bias in regressions with logged dependent variables (Manning and Mullahy, 2001; Silva and Tenreyro, 2006). We show that the bias can actually cause estimates to have the wrong sign.

β1x1+β2x2 We simulate a set of observations (x1, x2, y), where y = e η. We write y as a function of a multiplicative error for convenience, though we can recast it as an additive error term. We set β1 = β2 = 0.05. For each observation, we draw x1 and x2 from independent

17 standard normal distributions. We draw η from an independent lognormal distribution with a mean of 1 and variance given by:

   1 x1 ≤ 0 V ar(η) =   0.25 x1 > 0

Thus, the errors exhibit heteroskedasticity, with larger variance for below values of x1 and smaller variance for above median values of x1. The error is not related to x2. We generate a simulated data set of 1,000 observations using this data generating process.

We then estimate Poisson, LOLS, and LOG1PLUS regressions of y on x1 and x2. We repeat this process 1,000 times and compute the mean coefficients and standard errors. Table 2 reports these .

[Insert Table 2]

The mean coefficients on x1 and x2 from Poisson regressions are both approximately 0.05. That is, the Poisson regression recovers the true model coefficients, on average, despite the presence of heteroskedasticity. The mean coefficient on x2 in the LOLS regression is also approximately equal to the true coefficient of 0.05. When the variance of the error is not related to a covariate, LOLS regression recovers the true true coefficient on that covariate.

In contrast, the mean coefficient on x1 in the LOLS regression is -0.044. Not only does heteroskedasticity cause this coefficient to differ from the true coefficient of 0.05, but it actually causes it to have the wrong sign on average. While prior papers show that heteroskedasticity can induce bias in regressions with log- transformed dependent variables, they generally do not explore whether this bias can cause estimates to have the wrong sign. Indeed, in applications such as gravity models of trade

18 (Silva and Tenreyro, 2006), a negative coefficient would not be sensible. However, in cor- porate finance applications, we often lack strong priors on the sign of a relationship. It is not obvious a priori that heteroskedasticity can cause estimates involving logged dependent variables to have the wrong sign, since the resulting bias could be proportional rather than additive. Note that we estimate LOLS rather than LOG1PLUS regressions here because LOG1PLUS regression estimates can be biased by both the addition of the constant and the effect of heteroskedasticity. LOLS regressions therefore make the effect of heteroskedasticity more transparent. However, heteroskedasticity in the error y causes the same problems with LOG1PLUS regressions. Indeed, it can cause bias in regressions with any nonlinear trans- formation of y as the dependent variable.

2.3 Efficiency of three unbiased estimators

In our final set of simulations, we explore the efficiency of OLS regression using the raw count as the dependent variable, Poisson regression, and OLS rate regression. These three approaches produce unbiased estimators under standard exogeneity assumptions. We simulate data using a negative binomial data generating process. This approach allows us to introduce deviations from the conditional mean-variance equality restriction of Poisson regression and explore the associated efficiency consequences.

We simulate a set of observations (x1, x2, y). The conditional mean of each observation is:

µ = eβ1x1+x2 , (11)

where x1 is drawn from a standard normal distribution, x2 is drawn from a normal distri- bution with a mean of 0 and a of 2, and x1 and x2 are independent of

x2 each other. Since x2 has a coefficient of one, we can think of e as an exposure variable that

19 captures the underlying exposure of a given observation to the arrival process.

We consider four values of β1: -0.3, -0.1, 0.1, and 0.3. We also consider three values of the

overdispersion parameter αNB: 0.001, 3, and 8. As noted previously, the negative binomial distribution converges to the Poisson model distribution as αNB → 0. Therefore, the case

where αNB = 0.001 is approximately the case where the data is generated by a Poisson

model. The case where αNB = 3 approximates the distribution of firm-year corporate patent

count data. The case where αNB = 8 represents extreme overdispersion.

For each given αNB and β1, we generate a simulated data set of 1,000 observations. After generating the data, we estimate three regression models: OLS regression where the dependent variable is y, Poisson regression, and OLS rate regression where the dependent

x2 variable is y/e . We repeat each simulation (i.e., the simulation for a given αNB and β1) 2,500 times. Our primary objective in this simulation exercise is to assess the power of each regression

to reject the null hypothesis – that β1 = 0 – in the direction of the true value of β1. Therefore,

for each combination of αNB, β1, and regression model, we compute the percentage of the

2,500 simulated data sets in which the coefficient on x1 is statistically significant at the 5% level and has the same sign as the true value of β1. Table 3 reports these percentages.

[Insert Table 3]

Panels A, B, and C report results for the cases αNB = 0.001, αNB = 3, and αNB = 8, respectively. OLS regression where y is the dependent variable exhibits the least power in all scenarios, rejecting at much lower rates than the other two regression models. This lack of power is not surprising given the skewed distribution of y, which, as we have already noted, is why researchers often resort to log-transforming the dependent variable before estimating OLS regressions.

20 Poisson regression exhibits more power than OLS rate regression when the data is approx- imately Poisson distributed (Panel A). However, OLS rate regression exhibits more power than Poisson regression with a moderately high degree of overdispersion, especially for large values of β1 (Panel B). When overdispersion is extreme (Panel C), Poisson regression exhibits more power for smaller values of β1, but OLS rate regression exhibits much more power for larger values of β1.

It is important to note in these simulations that ex2 perfectly captures the baseline exposure of an observation to the arrival process underlying the count. While a true exposure variable like this is available in some cases, it is not in most cases, including the case of patent data, which represents the most commonly used count data in finance. In these cases, a noisy measure of exposure might be available. For example, one might treat a firm’s total assets or property, plant, and equipment (PPE) as an exposure variable. However, the connection between a firm’s asset base and its exposure to the arrival of patentable products and ideas is loose. In settings with more explanatory variables or when the exposure variable is noisy, Poisson regressions may perform better.

3 Replications and Decomposition

In this section, we replicate the main results from five published papers that use patents and/or patent citations as their primary outcome variables. We then compare estimates from LOG1PLUS, LOLS, and Poisson regressions using each replicated data set. We also examine how estimates from LOGcPLUS regressions change as we change the constant c added to the count before log-transformation. Finally, we decompose differences between LOG1PLUS and Poisson regression estimates into differences driven by bias in LOG1PLUS regression due to the addition of the constant, bias in LOG1PLUS regression due to heteroskedasticity in the count, and differences in samples due to the exclusion of units of observation that

21 never patent in the Poisson regressions.

3.1 Replications

We replicate papers by Hirshleifer, Low, and Teoh (2012), He and Tian (2013), Fang, Tian, and Tice (2014), and Acharya, Baghai, and Subramanian (2014), and Amore, Schnei- der, and Žaldokas (2013). We choose these five papers because they are easy to replicate with publicly-available data sets. Collectively, these five papers have 3,261 google scholar citations as of this writing. The first three papers rely on LOG1PLUS regression, the fourth primarily on LOLS regression, and the fifth on Poisson regression. The main patent data sets that finance researchers use are the NBER patent database, the HBS patent database, and the KPSS patent database. In replicating each paper, we follow the the data preparation outlined in the paper, including any adjustments for patent truncation (Dass, Nanda, and Xiao, 2017). For each paper, we tabulate the original main result from the paper and results from our replications using LOG1PLUS, LOGOLS, and Poisson specifications. We are able to approximately replicate the main results from all five papers in terms of coefficient signs, magnitudes, and significance levels. Additionally, our sample (untabulated) roughly line up with those presented in the papers. Table 4 presents the analysis for the Hirshleifer, Low, and Teoh (2012) paper. We replicate Table V column (1) from this paper. The main explanatory variable is Confident CEO (Options), an indicator variable equal to one if a firm’s CEO holds options that are at least 67% in the money and zero otherwise. The coefficient on Confident CEO (Options) from the LOG1PLUS regression that the paper reports is positive and statistically significant at the ten percent level. The LOG1PLUS estimate from our replication is also positive and statistically significant, and it is close in magnitude to the estimate in the paper. Our LOLS and Poisson estimates are positive and statistically significant as well, though they are approximately twice as large as the estimate from the LOG1PLUS regression. The

22 coefficients on the control variables are also larger in the LOLS and Poisson regressions than in the LOG1PLUS regression.

[Insert Table 4]

Table 5 presents the analysis for the He and Tian (2013) paper. We replicate Table 2 column (4) from this paper. The main explanatory variable is lnCoverage, which is the nat- ural log of the number of stock analysts covering a firm in a given year. The coefficient on lnCoverage from the LOG1PLUS regression that the paper reports is negative and statisti- cally significant at the one percent level. Our replication of this LOG1PLUS regression also yields a negative coefficient on lnCoverage with statistical significance at the one percent level, and a magnitude about half the size of the estimate in the paper. In contrast, LOLS and Poisson regression both yield positive coefficients on lnCoverage. These estimates are about the same magnitude as our positive LOG1PLUS estimate, and the LOLS estimate is statistically significant at the ten percent level. Some of the control variables have the same sign across all three replication specifications, while others, such as the coefficient on LnAge, differ in sign.

[Insert Table 5]

Table 6 presents the analysis for the Fang, Tian, and Tice (2014) paper. We replicate Table 2 column (1) from this paper. The main explanatory variable is ILLIQ, which is the natural logarithm of annual relative effective spread (the absolute value of the difference between the execution price and the midpoint of the prevailing bid-ask quote), divided by the midpoint of the prevailing bid-ask quote. The coefficient on ILLIQ from the LOG1PLUS

23 regression that the paper reports is positive and statistically significant at the one percent level. Our replication of this LOG1PLUS regression also yields a positive coefficient on lnCoverage that is similar in magnitude and also statistically significant at the one percent level. In contrast, LOLS regression yields a coefficient that is positive but an order of magnitude smaller, and Poisson regression yields a negative coefficient.

[Insert Table 6]

Table 7 presents the analysis for the Amore, Schneider, and Žaldokas (2013) paper. We replicate Table 3 column (4) from this paper. The main explanatory variable is Interstate deregulation, an indicator variable equal to one if a firm is headquartered in a state that has passed an interstate banking deregulation and zero otherwise. The coefficient on Interstate deregulation from the Poisson regression that the paper reports is positive and statistically significant at the one percent level. Our replication of this Poisson regression also yields a positive coefficient and statistically significant on Interstate deregulation that is similar in magnitude. LOLS and LOG1PLUS regression both yield coefficients that are much smaller in magnitude and are statistically insignificant.

[Insert Table 7]

Finally, Table 8 presents the analysis for the Acharya, Baghai, and Subramanian (2014) paper. We replicate Table 2 column (1) from this paper. The main explanatory variables are Good faith, Public policy, and Implied contract, each an indicator variable for whether a firm’s headquarters state has a particular class of law that protects workers from wrongful discharge. Because this paper estimates LOLS regressions, the sample is constrained to

24 firm-years with a positive number of patents. The coefficients on all three explanatory variables from the LOLS regression that the paper reports are positive, and two of the three are statistically significant. Our replication of this LOLS regression yields similar results, as does the LOG1PLUS regression. The Poisson regression coefficients for Good faith and Implied contract are also positive, though the coefficient on Public policy is negative. None of the Poisson coefficients are statistically significant.

[Insert Table 8]

Of the seven main explanatory variables across the five papers (the final paper has three), LOG1PLUS and Poisson regression yield coefficients with the same sign for only four. Of these four, the Poisson coefficient is at least 68% larger than the LOG1PLUS coefficient for three. Thus, in many cases, Poisson regression would lead to substantively different conclusions than LOG1PLUS regression.

3.2 Sensitivity of LOGcPLUS estimate to choice of constant

Next, we demonstrate that changing the constant added to the patent count before log- transforming generally changes the point estimates substantially. As we note in Section 1, the common practice of adding 1 is arbitrary. Table 9 presents the coefficients on the main explanatory variables for the first four papers that we replicate, where we estimate LOGcPLUS regressions for the following five values of c: 0.01, 0.1, 0.5, 1, and 10.5

[Insert Table 9]

5We do not include the Acharya, Baghai, and Subramanian (2014) paper in this analysis because it has multiple explanatory variables of interest.

25 The coefficients consistently shrink in magnitude as the size of the constant added before log-transformation increases. Note that this effect is mechanical, as adding a larger constant compresses the dependent variable because of the concave nature of the log transformation.

∂ ∂log(c+y) 1 Formally, ∂c ∂y = − (c+y)2 < 0.

3.3 Decomposing differences in LOG1PLUS and Poisson coeffi-

cients

The results in Tables 4 through 8 suggest that LOG1PLUS and Poisson regression es- timates based on the same data set can differ substantially in both magnitude and sign. These estimates could differ for three reasons. First, the addition of the constant makes LOG1PLUS regression coefficients biased. Second, the samples being used in estimation dif- fer because firms that never patent during the sample period are necessarily excluded from the sample used for Poisson regression. Third, heteroskedasticity in the underlying count also biases LOG1PLUS regression estimates. We now decompose the differences between the LOG1PLUS and Poisson coefficients of interest in Tables 4 through 7 into three components reflecting the possible causes of these differences. We do so by estimating two additional regressions. First, we fit a Poisson regression to compute the predicted values of y, which we label yˆ. We then estimate a LOG1PLUS regression where we substitute yˆ for y and restrict the sample to observations included in the Poisson regression. The difference between these estimates and the Poisson regression estimates represents the effect of changing the regression model, holding fixed the sample and removing the effects of heteroskedasticity (by removing the noise completely). This difference then isolates the bias due to the addition of the constant in the LOG1PLUS model.6 6Note that LOLS regression (no constant added) using yˆ as the dependent variable would produce coef-

26 Second, we expand the sample to include observations for firms that never patent during the sample period, which Poisson regression drops, and re-estimate the LOG1PLUS regres- sion, again using yˆ rather than y as the dependent variable.7 The difference between the LOG1PLUS estimates where yˆ is the dependent variable using the Poisson and full sample isolates the effect of the differences in sample. Third, we compare LOG1PLUS regression estimates where yˆ is the dependent variable to the actual LOG1PLUS regression estimates, where y is the dependent variable, using the full sample to estimate both. If the errors were homoskedastic, these two regressions should yield the same estimates. Thus, the difference between the two captures the effects of bias due to heteroskedasticity. Table 10 reports the coefficient estimates when we apply the procedures described above, where each panel reports results for the replication from a different paper. Based on com- parison of the second and third columns, the effect of sample differences appears to be small in all four cases. Based on a comparison of the first and second columns, bias due to the addition of the constant in LOG1PLUS regression appears to drive some of the difference between the LOG1PLUS and Poisson estimates in all four cases. In fact, in the two cases where the signs of the LOG1PLUS and Poisson estimates disagree, this bias alone appears sufficient to cause the sign difference. Comparing the third and fourth columns, bias due to heteroskedasticity appears to cause substantial differences between LOG1PLUS and Poisson estimates in some cases but not others. The fact that the two sources of bias appear to explain most if not all of the differences between the LOG1PLUS and Poisson estimates raises serious concerns about relying on LOG1PLUS regressions.

[Insert Table 10]

ficients identical to those from Poisson regression if there were no zero-valued observations. 7We assume that these additional observations have the same year fixed effects as the Poisson regression sample and have a firm fixed effect of 0, noting that they do not patent during the sample period.

27 4 Conclusion

This paper highlights the issues surrounding model choice when working with outcome variables based on count data, the use of which is increasingly common in corporate finance. Our analysis suggests that researchers should rely primarily on either Poisson regression or, if a suitable exposure variable is available and the researcher is concerned about overdispersion, OLS rate regressions when working with such data. Poisson regression produces unbiased and consistent estimates as long as the error in the count is uncorrelated with covariates, admits separable fixed effects, and can now be estimated quickly even with high-dimensional fixed effects using the Stata module PPMLHDFE. In contrast, commonly-used OLS regressions where the dependent variable is the log of 1 plus the count are subject to multiple sources of bias, even if errors in the count are uncorrelated with covariates. Our replications of data sets in five published papers using patent data suggest that this bias produces substantially different inferences from Poisson regression estimates.

28 References

Acharya, V. V., R. P. Baghai, and K. V. Subramanian (2014). Wrongful discharge laws and innovation. The Review of Financial Studies 27 (1), 301–346.

Amore, M. D., C. Schneider, and A. Žaldokas (2013). Credit supply and corporate innovation. Journal of Financial Economics 109 (3), 835–855.

Cohn, J. B., N. Nestoriak, and M. Wardlaw (2020). Private equity buyouts and workplace safety. Review of Financial Studies Forthcoming.

Cohn, J. B. and M. I. Wardlaw (2016). Financing constraints and workplace safety. The Journal of Finance 71 (5), 2017–2058.

Correia, S., P. Guimarães, and T. Zylkin (2019, August). Verifying the existence of maxi- mum likelihood estimates for generalized linear models. arXiv:1903.01633 [econ]. arXiv: 1903.01633.

Correia, S., P. Guimarães, and T. Zylkin (2020). Fast poisson estimation with high- dimensional fixed effects. The Stata Journal 20 (1), 95–115.

Dass, N., V. Nanda, and S. C. Xiao (2017). Truncation bias corrections in patent data: Implications for recent research on innovation. Journal of Corporate Finance 44, 353–374.

Fang, V. W., X. Tian, and S. Tice (2014). Does stock liquidity enhance or impede firm innovation? The Journal of finance 69 (5), 2085–2125.

Gourieroux, C., A. Monfort, and A. Trognon (1984). Pseudo Maximum Likelihood Methods: Theory. Econometrica 52 (3), 681–700. Publisher: [Wiley, Econometric Society].

He, J. J. and X. Tian (2013). The dark side of analyst coverage: The case of innovation. Journal of Financial Economics 109 (3), 856–878.

29 Hirshleifer, D., A. Low, and S. H. Teoh (2012). Are overconfident ceos better innovators? The journal of finance 67 (4), 1457–1498.

Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of economet- rics 95 (2), 391–413.

Manning, W. G. and J. Mullahy (2001, July). Estimating log models: to transform or not to transform? Journal of Health Economics 20 (4), 461–494.

Silva, J. M. C. S. and S. Tenreyro (2006, November). The Log of Gravity. The Review of Economics and Statistics 88 (4), 641–658. Publisher: MIT Press.

Silva, J. M. C. S. and S. Tenreyro (2011, August). Further simulation evidence on the perfor- mance of the Poisson pseudo-maximum likelihood estimator. Economics Letters 112 (2), 220–222.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.

30 Figure 1: Histogram of firm-year patents granted

This figure presents a histogram of number of patents granted by firm-year. We top-code the count at 100 to make the figure easier to read. Hence, the right-most bar represents the percent of firm-year observations with 100 or more patents. 80 60 40 Percent 20 0 0 20 40 60 80 100 # Patents granted

31 Figure 2: Injuries over event time: public firms

This figure presents two examples of correlation between errors in y and var(x). In both examples, y = 1 +  and  ∈ {−, +}, with prob( = −) = prob( = +) = 0.5, and the range of x is [0, 1]. In the first example (subfigure a), − = −x and + = x. In the second example (subfigure b), − = x − 1 and + = 1 − x. In each subfigure, the three sets of green points for each x value represent the value of y when + (top point), the value of y when  = η− (bottom point), and E[y|x] (middle point). The three sets of blue points for each x value represent the value of log(1 + y) when + (top point), the value of log(1 + y) when  = − (bottom point), and E[log(1 + y)|x] (middle point).

(a) corr(var(), x) > 0 (b) corr(var(), x) < 0

2.0 2.0

1.5 1.5

1.0 E[y|x] 1.0 E[y|x]

E[log(1+y)|x] 0.5 E[log(1+y)|x] 0.5

x x 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2

32 Figure 3: Simulated Log(y) and Log(1+y)

This figure plots the data from the first simulation. In this simulation, we generate x1 and x2 as standard random variables and define y = exp(x1 − .1 ∗ x2). We then take the log of y and 1 + y and generate . This figure illustrates how log(1 + y) converges to 0 and differs significantly from log(y) when y is close to 0.

33 Table 1: Constant Added Simulation

This table presents results from simulations that show the bias that adding a constant in LOG1PLUS regression can produce in a multivariate setting. We simulate a set of observations (x1, x2, y), where

y = eβ1x1+β2x2

. We set β1 = 1 and β2 = 0.1. For each observation, we draw the value of x1 from a standard normal distribution. We then set x2 equal to x1 if x1 is positive and 0 if x1 is negative. We generate 5,000 observations. We then estimate Poisson, LOLS, and LOG1PLUS regressions of y on x1 and x2. Because we do not include noise in y, the regressions estimates are all perfectly precise, so there are no standard errors to report.

(1) (2) (3) Poisson LOLS LOG1PLUS

x1 1.000 1.000 0.272 x2 -0.100 -0.100 0.374 Constant 0.000 0.000 0.627 Observations 5,000 5,000 5,000

34 Table 2: Simulated OLS and Poisson Models with Heteroskedastic Errors

This table reports results from simulations in which we introduce heteroskedasticity into the outcome variable. We simulate β x +β x a set of observations (x1, x2, y), where y = e 1 1 2 2 η. We set β1 = β2 = 0.05. For each observation, we draw x1 and x2 from independent standard normal distributions. We draw η from an independent lognormal distribution with a mean of 1 and variance given by:  1 x ≤ 0 V ar(η) = 1 0.25 x1 > 0

The error is not related to x2. We generate a simulated data set of 1,000 observations using this data generating process. We then estimate regressions of y on x1 and x2. We repeat this process 1,000 times and compute the mean coefficients and standard errors. The first column reports the true values of the x1 and x2 coefficients. The second column reports results from Poisson regression. The third column reports results from LOG1PLUS regressions.

True PPML LOLS

Avg coef β1 .05 .049 -.044

Avg SE β1 .025 .0215

Avg coef β2 .05 .051 .049

Avg SE β2 .025 .025

35 Table 3: Simulated OLS, Poisson, and Rate regression Rejection Rates

This table presents results from a series of simulations. We simulate a set of observations (x1, x2, y). The conditional mean of each observation is: µ = eβ1x1+x2 , where x1 is drawn from a standard normal distribution, x2 is drawn from a normal distribution with a mean of 0 and a standard deviation of 2, and x1 and x2 are independent of each other. For each simulation, we generate a simulated data set of 1,000 observations. After generating the data, we estimate three regression models: OLS regression where the dependent variable is y, Poisson regression, and OLS rate regression, where the dependent variable is y/ex2 . We repeat each simulation 2,500 times. We then compute the percentage of the 2,500 simulated data sets in which the coefficient on x1 is statistically significant at the 5% level and has the same sign as the true value of β1 for each regression model. Panels A, B, and C report results where we set the negative binomial overdispersion parameter αNB to 0.001, 3, and 8, respectively. For each value of αNB , we simulate data for four values of β1: -0.3, -0.1, 0.1, and 0.3.

Panel A: αNB = 0.001 β =-.3 β =-.1 β =.1 β =.3 Count OLS rejection rate 81.3% 18.9% 19.0% 82.8%

Poisson rejection rate 100% 100% 100% 100%

Rate OLS rejection rate 100% 62.0% 63.0% 100%

Panel B: αNB = 3 β =-.3 β =-.1 β =.1 β =.3 Count OLS rejection rate 29.2% 5.3% 5.4% 29.1%

Poisson rejection rate 54.0% 17.5% 17.0% 52.9%

Rate OLS rejection rate 90.8% 21.0% 21.8% 91.5%

Panel C: αNB = 8 β =-.3 β =-.1 β =.1 β =.3 Count OLS rejection rate 15.4% 4.1% 3.9% 15.1%

Poisson rejection rate 36.2% 16.5% 16.1% 37.5%

Rate OLS rejection rate 70.8% 12.7% 13.4% 70.1%

36 Table 4: Replication: Hirshleifer, Low, and Teoh (2012)

This table presents a series of regressions based on the regression specification in Table V column (1) of Hirshleifer, Low, and Teoh (2012). The unit of observation is a firm-year. The outcome variable is the number of patents a firm generates in a given year. The first column reproduces the results from the original paper, which estimates a LOG1PLUS regression. The final three column present results from LOG1PLUS, LOLS, and Poisson regressions, based on our attempt to replicate the original data set. T-statistics based on standard errors clustered at the firm level are presented below each coefficient. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Actual Replication Replication Replication LOG1PLUS LOG1PLUS LOLS Poisson Confident CEO (Options) 0.093* 0.110** 0.196*** 0.185* (1.93) (2.23) (2.79) (1.79) Log(sales) 0.732*** 0.446*** 0.617*** 0.921*** (16.23) (16.88) (19.20) (11.79) Log(PPE/Emp) 0.244*** 0.169*** 0.301*** 0.390*** (4.76) (4.74) (4.59) (2.92) Fixed effects Ind, year Ind, year Ind, year Ind, year Observations 8,939 12,168 5,575 11,983 Adjusted R2 0.494 0.482 0.479

37 Table 5: Replication of He and Tian (2013)

This table presents a series of regressions based on the regression specification in Table 2 column (4) of He and Tian (2013). The unit of observation is a firm-year. The outcome variable is the number of patents a firm generates in a given year. The first column reproduces the results from the original paper, which estimates a LOG1PLUS regression. The final three column present results from LOG1PLUS, LOLS, and Poisson regressions, based on our attempt to replicate the original data set. T-statistics based on standard errors clustered at the firm level are presented below each coefficient. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Actual Replication Replication Replication LOG1PLUS LOG1PLUS LOLS Poisson lnCoverage -0.053*** -0.026*** 0.036* 0.026 (0.016) (0.010) (0.020) (0.031) lnAssets 0.050** 0.079*** 0.107*** 0.093 (0.020) (0.022) (0.039) (0.062) RDAssets 0.100** 0.405*** 0.305 0.246 (0.048) (0.128) (0.204) (0.462) lnAge 0.180** 0.352*** 0.057 -0.215* (0.072) (0.046) (0.070) (0.111) ROA 0.693*** 0.239*** 0.035 0.204 (0.200) (0.059) (0.112) (0.276) PPEAssets 0.330*** 0.455*** 0.790*** 0.901** (0.105) (0.135) (0.244) (0.358) Leverage -0.324*** -0.346*** -0.294** -0.369** (0.067) (0.069) (0.119) (0.179) CapexAssets -0.051 0.063 -0.221 -0.115 (0.113) (0.171) (0.325) (0.487) TobinQ 0.019*** 0.029*** 0.012 0.009 (0.005) (0.005) (0.007) (0.010) KZIndex -0.001** -0.001 -0.001 -0.002 (0.000) (0.001) (0.001) (0.002) HIndex 0.226 0.504 -0.241 -1.786** (0.163) (0.318) (0.507) (0.768) HIndex2 -0.128 -0.132 0.423 1.659** (0.139) (0.264) (0.448) (0.774) Fixed effects Firm, year Firm, year Firm, year Firm, year Observations 25,860 27,064 8,263 15,857 R2 0.8333 0.730 0.869

38 Table 6: Replication: Fang, Tian, and Tice (2014)

This table presents a series of regressions based on the regression specification in Table 2 column (1) of Fang, Tian, and Tice (2014). The unit of observation is a firm-year. The outcome variable is the number of patents a firm generates in a given year. The first column reproduces the results from the original paper, which estimates a LOG1PLUS regression. The final three column present results from LOG1PLUS, LOLS, and Poisson regressions, based on our attempt to replicate the original data set. T-statistics based on standard errors clustered at the firm level are presented below each coefficient. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Actual Replication Replication Replication LOG1PLUS LOG1PLUS LOLS Poisson

ILLIQt 0.141*** 0.137*** 0.014 -0.075 (0.020) (0.020) (0.071) (0.057) LNMVt 0.160*** 0.149*** 0.343*** 0.165*** (0.018) (0.017) (0.057) (0.054) RDTAt 0.283*** 0.316*** 0.560** 0.948*** (0.089) (0.091) (0.236) (0.345) ROAt -0.032 0.033 -0.266* -0.563* (0.068) (0.028) (0.158) (0.307) PPETAt 0.287*** 0.052* 0.130 -0.072 (0.094) (0.031) (0.195) (0.246) LEVt -0.256*** -0.226*** 0.064 0.399 (0.075) (0.065) (0.214) (0.281) CAP T EXT At 0.175 0.235*** 0.600 0.396 (0.119) (0.085) (0.520) (0.574) HINDEXt 0.106 0.098 0.082 -0.300 (0.086) (0.083) (0.281) (0.418) 2 HINDEXt -0.112 -0.094 0.191 0.589 (0.150) (0.141) (0.477) (0.873) Qt -0.006 0.001 -0.027*** -0.013 (0.007) (0.003) (0.008) (0.009) KZINDEXt -0.000* 0.001* 0.000 0.004 (0.000) (0.000) (0.008) (0.011) LNAGEt 0.168*** 0.267*** 0.252* 0.438** (0.035) (0.050) (0.151) (0.209) Fixed effects Firm, year Firm, year Firm, year Firm, year Observations 39,469 39,000 8,205 15,970 Adjusted R2 0.839 0.809 0.817

39 Table 7: Replication: Amore, Schneider, and Žaldokas (2013)

This table presents a series of regressions based on the regression specification in Table 3 column (4) of Amore, Schneider, and Žaldokas (2013). The unit of observation is a firm-year. The outcome variable is the number of patents a firm generates in a given year. The first column reproduces the results from the original paper, which estimates a LOLS regression. The final three column present results from LOG1PLUS, LOLS, and Poisson regressions, based on our attempt to replicate the original data set. T-statistics based on standard errors clustered at the firm level are presented below each coefficient. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Actual Replication Replication Replication Poisson LOG1PLUS LOLS Poisson Interstate deregulation 0.1188*** 0.0245 0.0278 0.1002** (0.0397) (0.0241) (0.0393) (0.0401) Ln (sales) 0.5360*** 0.1615*** 0.2645*** 0.6741*** (0.0901) (0.0234) (0.0398) (0.0845) Ln (K/L) 0.1969** 0.0148 0.0047 0.2734*** (0.0789) (0.0211) (0.0442) (0.0900) Ln (R&D stock) 0.3264*** 0.0918*** 0.1518*** 0.2124*** (0.1196) (0.0309) (0.0584) (0.0164) Fixed effects Firm, year Firm, year Firm, year Firm, year Observations 18,066 18,424 18,424 14,920 R2 0.877 0.805

40 Table 8: Replication: Acharya, Baghai, and Subramanian (2013)

This table presents a series of regressions based on the regression specification in Table 2 column (1) of Fang, Tian, and Tice (2014). The unit of observation is a firm-year. The outcome variable is the number of patents a firm generates in a given year. The first column reproduces the results from the original paper, which estimates a LOLS regression. The final three column present results from LOG1PLUS, LOLS, and Poisson regressions, based on our attempt to replicate the original data set. T-statistics based on standard errors clustered at the firm level are presented below each coefficient. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Actual Replication Replication Replication LOLS LOG1PLUS LOLS Poisson Good faith 0.124** 0.126* 0.164** 0.178 (0.051) (0.065) (0.079) (0.160) Public policy 0.082 0.080 0.103 -0.034 (0.056) (0.057) (0.069) (0.243) Implied contract 0.095** 0.075* 0.096* 0.210 (0.044) (0.044) (0.054) (0.131) Fixed effects Firm, year Firm, year Firm, year Firm, year Observations 104,504 105,696 105,696 105,696 Adjusted R2 0.157 0.161 0.162

41 Table 9: Different Constants LOGPLUS Regressions

This table presents results from LOGcPLUS regressions using the replicated data sets analyzed in Tables 4 through 7. Each of Panels A through D provides the analysis for one paper. Each column represents a different constant c added to the number of patents before log-transforming to compute the dependent variable. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Panel A: Hirshleifer, Low, and Toeh (2012) Constant 0.01 0.1 0.5 1 10 Overconfident CEO 0.216** 0.161** 0.125** 0.110** 0.064** (0.109) (0.077) (0.057) (0.049) (0.026) (1.99) (2.09) (2.18) (2.23) (2.43) Observations 12,168 12,168 12,168 12,168 12,168 R2 0.467 0.484 0.489 0.485 0.442 Controls, FEs Yes Yes Yes Yes Yes Panel B: He and Tian (2013) Constant 0.01 0.1 0.5 1 10 lnCoverage -0.063** -0.044** -0.031** -0.026*** -0.011** (0.027) (0.018) (0.012) (0.010) (0.005) (-2.34) (-2.48) (-2.58) (-2.59) (-2.51) Observations 27,064 27,064 27,064 27,064 27,064 R2 0.677 0.705 0.725 0.730 0.726 Controls, FEs Yes Yes Yes Yes Yes Panel C: Amore, Schneider, Žaldokas (2013) Constant 0.01 0.1 0.5 1 10 Interstate dereg 0.023 0.028 0.027 0.025 0.013 (0.060) (0.039) (0.028) (0.024) (0.012) (0.39) (0.71) (0.96) (1.02) (1.14) Controls Yes Yes Yes Yes Yes Observations 18,424 18,424 18,424 18,424 18,424 R2 0.746 0.805 0.857 0.877 0.914 Controls, FEs Yes Yes Yes Yes Yes Panel D: Fang, Tian, Tice (2013) Constant 0.01 0.1 0.5 1 10

ILLIQt 0.219*** 0.178*** 0.149*** 0.137*** 0.095*** (0.039) (0.029) (0.022) (0.020) (0.013) (5.66) (6.24) (6.72) (6.92) (7.36) Observations 39,000 39,000 39,000 39,000 39,000 R2 0.785 0.812 0.831 0.838 0.854 Controls, FEs Yes Yes Yes Yes Yes

42 Table 10: Poisson and LOG1PLUS Decomposition

This table decomposes the differences in the Poisson and LOG1PLUS regression estimates reported in Tables 4 through 7. Each of Panels A through D provides the decomposition for one paper. The first column reproduces the Poisson estimates from the corresponding table. The second column presents estimates from LOG1PLUS regression, where we replace the dependent variable y with the fitted value of y from the Poisson regression in the first column and limit the sample to the sample used for the Poisson regression. The third column presents estimates from the same LOG1PLUS regression as the second column, but using the full sample. The fourth column reproduces the LOG1PLUS regression (using the actual value of y) from the corresponding table. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% level, respectively, based on a two-tailed t-test.

Panel A: Hirshleifer, Low, and Toeh (2012) Model Poisson LOG1PLUS (yˆ) LOG1PLUS (ˆy) LOG1PLUS Sample Poisson Poisson Full Full Overconfident CEO 0.185 0.085 0.085 0.110 Observations 11,983 11,983 12,168 12,168 Controls, FEs Yes Yes Yes Yes

Panel B: He and Tian (2013) Model Poisson LOG1PLUS (yˆ) LOG1PLUS (ˆy) LOG1PLUS Sample Poisson Poisson Full Full lnCoverage 0.026 -0.014 -0.036 -0.026 Observations 15,857 15,857 27,058 27,064 Controls, FEs Yes Yes Yes Yes

Panel C: Fang, Tian, Tice (2013) Model Poisson LOG1PLUS (yˆ) LOG1PLUS (ˆy) LOG1PLUS Sample Poisson Poisson Full Full

ILLIQt -0.075 0.028 0.024 0.137 Observations 15,970 15,970 39,000 39,000 Controls, FEs Yes Yes Yes Yes

Panel D: Amore, Schneider, Žaldokas (2013) Model Poisson LOG1PLUS (yˆ) LOG1PLUS (ˆy) LOG1PLUS Sample Poisson Poisson Full Full Interstate dereg 0.100 0.050 0.047 0.025 Observations 14,920 14,920 18,424 18,424 Controls, FEs Yes Yes Yes Yes

43