Count Data in Finance∗

Count Data in Finance∗ Jonathan Cohn University of Texas-Austin Zack Liu University of Houston Malcolm Wardlaw University of Georgia March 2021 Abstract This paper examines the use of count data-based outcome variables such as corporate patents in empirical corporate finance research. We demonstrate that the com- mon practice of regressing the log of one plus the count on covariates (“LOG1PLUS” regression) produces biased and inconsistent estimates of objects of interest and lacks meaningful interpretation. Poisson regressions have simple interpretations and produce unbiased and consistent estimates under standard exogeneity assumptions, though they lose efficiency if the count data is overdispersed. Replicating several recent papers on corporate patenting, we find that LOG1PLUS and Poisson regressions frequently produce meaningfully different estimates and that bias in LOG1PLUS regressions is likely large. ∗Jonathan Cohn: [email protected], (512) 232-6827. Zack Liu: [email protected], (713) 743-4764. Malcolm Wardlaw: [email protected], (706) 204-9295. We would like to thank Jason Abrevaya, Andres Almazan, John Griffin, Travis Johnson, Sam Krueger, Aaron Pancost, James Scott, Sheridan Titman, Jeff Wooldridge, and participants in the Virtual Finance Seminar and seminar at the University of Texas at Austin for valuable feedback. Count Data in Finance Abstract This paper examines the use of count data-based outcome variables such as corporate patents in empirical corporate finance research. We demonstrate that the com- mon practice of regressing the log of one plus the count on covariates (“LOG1PLUS” regression) produces biased and inconsistent estimates of objects of interest and lacks meaningful interpretation. Poisson regressions have simple interpretations and produce unbiased and consistent estimates under standard exogeneity assumptions, though they lose efficiency if the count data is overdispersed. Replicating several recent papers on corporate patenting, we find that LOG1PLUS and Poisson regressions frequently produce meaningfully different estimates and that bias in LOG1PLUS regressions is likely large. A growing number of papers in empirical corporate finance study outcome variables that are inherently count-based. For example, 44 papers published in “top three” finance journals in recent years estimate the effects of various forces on a company’s patent and/or patent citation counts. A key challenge in working with count data is that count variables, being bounded below by zero, often exhibit strong right-skewness. To address concerns about efficiency and outlier risk, researchers often log-transform highly-skewed dependent variables before estimating linear regressions. However, count data sets often contain many zero values, and the logarithm of zero is undefined. The most commonly used approach in finance to addressing this complication is to add a constant - typically 1 - to the count before log-transforming it. We refer to linear regression of the log of 1 plus a count variable on covariates as “LOG1PLUS” regression. Of the 44 papers referenced above, 25 use LOG1PLUS regression as their primary econometric approach, and 23 use it exclusively. Despite its widespread use, little work has been done to examine the properties of estimates based on LOG1PLUS regression and whether these properties provide a reasonable and accurate test of underlying economic hypotheses. In this paper, we analyze the LOG1PLUS approach as well as alternative approaches. We formalize the often unspoken assumptions behind different regression models, conduct simulations to explore the statistical properties of the estimates they produce, and compare these estimates using replicated data sets from existing papers. We illustrate how OLS regressions using log-transformed outcomes can produce biased and incorrectly signed estimates of economic relationships and provide guidance for future research in finance-related applications involving zero-bounded count data. How does one interpret estimates from a LOG1PLUS regression? A standard log-levels regression coefficient has a simple interpretation in terms of a semi-elasticity - the percentage change in the outcome variable associated with a one unit change in the explanatory variable. While one might imagine that LOG1PLUS regression estimates have the same interpretation because the added constant is invariant to the covariates, this intuition is wrong. The semi- 1 elasticity of a variable and the semi-elasticity of the sum of a constant and variable are not equivalent, nor can one easily be transformed into the other. In univariate regressions, the addition of the constant biases LOG1PLUS regression estimates towards zero. They may therefore be seen as representing lower bound estimates of semi-elasticities. The effect of the bias is more complex in a multivariate regression setting. We show in simulations that this bias can be large and can produce estimates with the wrong sign, even when there is no sampling error. Thus, when interpreting LOG1PLUS regression output, a researcher might incorrectly conclude that a policy variable has a particular di- rectional effect on the count outcome when it actually has no effect or even the opposite effect. The addition of the constant is not the only source of estimation bias in LOG1PLUS regression. In the context of trade model regressions, Silva and Tenreyro (2006) show that the log-transformation of an outcome variable can produce biased regression estimates if the error in the original (i.e., untransformed) variable is heteroskedastic, as is likely in most applications. The nonlinear nature of the log transformation translates a correlation between a covariate and the variance of the error in the original variable into a relationship with the mean of the implied error in the logged variable. We show that the same bias exists in LOG1PLUS regressions. Simulations suggest that this bias can also be large and can cause the expected value of estimated coefficients to have the wrong sign. We show that positive (negative) correlation between the variance of the error and a covariate results in a downward (upward) biased coefficient. One alternative to estimating an OLS regression in general is to treat the outcome variable as a count process and estimate a count regression model. Among count models, the Poisson model, which connects the outcome with a linear function of covariates through an exponential link function, has two unique and useful features. First, its coefficients are interpretable as semi-elasticity estimates (Wooldridge, 2010, p. 726). They also have fairly 2 simple interpretations in terms of a linear conditional expectation function (CEF). Second, the Poisson model admits separable fixed effects - effectively a prerequisite for use in corporate finance applications.1 While Poisson model estimates lose efficiency if the model’s conditional mean-variance equality restriction is not satisfied in the data, they remain unbiased and consistent as long as the standard conditional mean independence assumption holds. While computational constraints may have been a practical issue for estimating fixed effects Poisson models in the past, the PPMLHDFE module for Stata implements a pseudo-maximum likelihood approach based on (Correia et al., 2020) that allows for speedy convergence of even high-dimensional fixed effects Poisson regressions. More flexible count regression models such as the negative binomial model or zero-inflated models relax the conditional mean-variance restriction and are therefore likely to be more efficient in many settings. Negative binomial regression allows for overdispersion, where the conditional variance exceeds the conditional mean. Zero-inflated models (e.g., zero- inflate Poisson and zero-inflated negative binomial) estimate intensive and extensive margins separately, allowing for excessive zero values in the distribution of the outcome. However, none of these models admit separable fixed effects. While one can include group dummy variables when estimating these models, the lack of separability results in biased estimates due to the incidental parameters problem (Lancaster, 2000). Another alternative approach is to model the count outcome as a rate and estimate linear regressions where the rate is the dependent variable. Doing so requires the availability of a suitable “exposure” variable that captures the level of activity creating a baseline exposure to the outcome. For example, the number of employees at an establishment is a natural exposure variable for the number of workplace injuries at the establishment in a given period of time (Cohn and Wardlaw, 2016; Cohn et al., 2020). Scaling the count outcome by the exposure 1Poisson fixed effects are multiplicatively separable, while fixed effects in linear models are additively separable. 3 variable results in a rate of events per unit of exposure (e.g., workplace injuries per employee). OLS regression with a rate dependent variable produces estimates with a simple linear CEF interpretation. Our simulations suggest that rate regression is more efficient than Poisson regression when overdispersion is moderate. Unfortunately, a suitable exposure variable is not available in many settings. After considering the econometric properties of various estimators when working with count data, we replicate data sets from five papers in the innovation literature, all of which use patent-based dependent variables. We are able to approximately replicate the main results of all five papers using

Load more