Comparing to Apples: Price Premiums of Club over Open Varieties

Modhurima Dey Amin Department of Agricultural and Applied Economics, Texas Tech University

Syed Badruddoza Department of Agricultural and Applied Economics, Texas Tech University

Gregory Astill Economic Research Service, U.S. Department of Agriculture

Jill J. McCluskey School of Economic Sciences, State University

Working paper: May 2021.

Abstract We find price premiums for club apples and their determinants using 2007-18 scanner data set. We combine scanner data with exclusive scientific data on apple taste and appearance, and then apply machine learning to predict the price premiums. The premiums are approximately 5.19% with an annual growth rate of 0.002%--which is not statistically away from zero. A vector autoregressive framework was used to find the determinants of the premiums, under the specification that the premiums and club apple sales are endogenous. The determinants of club apple premiums are club apple sales, open apple sales, wage, electricity, and marketing cost. We found evidence that club apple sales Granger causes open apple sales but not the other way around.

Keywords: Club apples, Open apples, Premiums, Machine Learning, VAR, United States JEL Codes: Q11. The findings and conclusions in this report are those of the authors and should not be construed to represent any official USDA or U.S. Government determination or policy. This research was supported in part by the U.S. Department of Agriculture, Economic Research Service. The views expressed here cannot be attributed to IRI.

1 Introduction An apple variety that meets both producers’ and consumers’ expectations may provide a profitable opportunity for growers and hence supports the incentives to innovate. However, large-scale distribution of a variety without a central plan of marketing can saturate the market and drive the price down—which hurts its innovators, investors, and producers. In the past, some excellent apple varieties being openly released to growers could not expand in the market for the lack of unified marketing strategies. Many apple varieties are now limited from open distribution via licensing in order to maintain quality, quantity, novelty, and profitability. These varieties are known as ‘managed’, ‘club’, or ‘exclusive’ varieties as opposed to conventional ‘open’ varieties. Club varieties are marketed by an organization that obtains an exclusive license on a patent held by a university or private breeder so that they only appear in a limited number of orchards. In general, growers of club apples pay a licensing fee and/or royalty that covers the cost of innovation, marketing, and trademark protection. There are about 50 club apples in the U.S. market. Common club apples are , , , , EverCrisp, , Sonya, etc. Restricting the distribution of an improved variety is important for the apple industry to promote innovation and generate revenue for growers. On the supply side, apple producers need to know how profitable it is to join or maintain a club, and whether that profitability is growing or shrinking over the past years. On the demand side, consumers should theoretically prefer club apples and willing to pay more for them because club apples tend to improve on open varieties in terms of taste and other desirable features. Therefore, consumers need to know if club apples are becoming cheaper or more expensive as more and more club varieties are being introduced. Most importantly, both producers and consumers might be interested to know the determinants of price premiums. Identifying the predictors helps producers learn the factors to focus on, and helps consumers know what makes the price of club apples higher than open apples. The current study derives monthly price premiums of club apple varieties, forecasts the future premiums, and finds the determinants of the premiums. We use an exclusive data set that extracts apple characteristics in detail with phenotype and genotype information, and we complement it with the retail scanner data of 2008-17 to generate monthly price premiums. Given the large dimension of the data, we use eXtreme Gradient Boosting (XGBoost)—a machine learning model—for an improved data-driven prediction and pattern detection. We also compare the derived premiums with those of a hedonic pricing model under Box-Cox specification for robustness. We then use a vector autoregressive (VAR) framework that assumes club premiums and sales endogenous, to find their determinants. The exogenous determinants include various producer price indices, number of clubs, working population, sales of open apples, and so on. Finally, we forecast the price premiums and sales using the coefficients from the VAR model and check the response of club premiums to artificial shocks in the market for apples. The superior predictive accuracy of the machine learning model allows us to predict the future growth of club apples and the market. A growing premium for club apples indicates that consumers will value the improvements of apple attributes, and that the restrictive production and marketing will provide growers and innovators with a margin. A falling premium for club apples on the other hand means increased competition in

2 improving varieties. In that case, the margins of club apples will converge to that of open varieties. This creates a disincentive to innovate, although consumers will have more varieties to choose from at relatively lower prices. Our research also examines if club varieties are managed well enough to withstand a sales shock, or to recover within a certain number of months. This piece of information can be particularly interesting to the growers in a club. The specific objectives of this project are (1) to explore whether and to what extent changes in the U.S. apple industry have impacted the magnitude and dynamics of monthly price premiums for club apples over open apples, (2) use the series of price premiums to quantify the long term dynamics of club apples, and measure their market growth over time, (3) identify the major predictors of club apple premiums and their importance, (4) forecast the premiums to understand the future structure of U.S. apple industry—whether it converges to a balanced equilibrium of club and open apple varieties or one type dominates the other, and (5) create an artificial shock to apple industry, e.g., sales shock, to observe its immediate and long-term impact on the prices and premiums. A large number of studies look into the market for apples in the United States. Most of them find the willingness to pay for different apple attributes in a sensory panel setting. However, studies that use retail scanner data to find valuations of attributes are rare; and to the best of our knowledge, no study has been conducted on the price premiums for club apples. Moreover, this is the first study on apples that combine machine learning with an appropriate structural model to analyze high- dimensional data. As the number of apple growers’ clubs increases in the U.S. market, analyzing the premiums is now more important than ever. The current study addresses a critical policy aspect for the U.S. apple industry in general. If club apples are more profitable, taste better, and protect intellectual property, then will they replace open apples altogether? On the other hand, if the premiums go down as the number of clubs increases, does the market for club apples converge to the market for open apples? Our model forecasts the short-run and long-run scenarios to gain insights into these concerns.

Data We combine four sources of data for the analysis. First, we collect price, quantity, packaging, and market characteristics from the IRI retail scanner data, InfoScan, available from 2008 to 2017 (inclusive) through USDA’s Economic Research Service (ERS). Second, we use an exclusive data from the Breeder’s Toolbox (2020) that contain taste and appearance attributes of various apple variety. The third and fourth sources are respectively the National Agricultural Statistics (USDA, 2019) and the Bureau of Labor Statistics (BLS, 2019) that give the determinants of price premiums, such as labor force (working population) index, and producer price indices for wage, rent, interest, advertisement, electricity, etc. IRI is a market research firm that collects weekly retail sales data across the United States. The data contain quantities sold, prices, and product characteristics, including the brand name, organic status, nutrition information, and retailer characteristics, such as retailer location and type. Levin et al. (2018) calculate that IRI covers about 15% of stores and 20% of sales reported in the Economic

3 Census. This is because the data mainly include sales revenue from non-food products and selects grocery stores having annual sales greater than or equal to $2 million. However, Levin et al. (2018) find that IRI accounts for 51% of retail food sales in the United States. On average, the data records over 132 million observations each week from approximately 41 thousand retail stores and 18 thousand retail marketing areas. Each transacted product can be identified via Universal Product Code (UPC). We define contiguous U.S. states as markets—yielding 49 markets including Washington D.C. We primarily extracted quantities, prices, and apple types by UPC for all perishable products that are identified as “apples” (excluding pineapples) using Structural Query Language (SQL) programs. Narrowing our search into perishable products omitted the chance of including processed apples, applesauce, , and all other packaged/processed apple items. However, whole fresh apples sold in bags are included. The UPC-level data were then aggregated by apple varieties, and market shares and mean prices of each variety were derived. The IRI data included about 131 varieties of apples over ten years (2008-17). One caveat of analyzing perishable products is that their package size is often missing, since they are not sold in fixed packages (hence called “random weight” items). Retailers provide quantities for random-weights as pounds or number of items (counts). Counts are converted to weights by IRI using a standard conversion factor for each apple variety (Muth et al. 2016, pp 31). The apple attributes are extracted from Breeder’s Toolbox (2020) that contains phenotype and genotype data for various fruits. Phenotype refers to the sum of an organism’s observable characteristics, and genotype is their subsequent combination of genetic features for a specific gene. Apple attributes were generated through sensory panels and instrumental (machine-measured) testing conducted in 2010, 2011, and 2012. The testing was conducted at Cornell University, University of Minnesota, and Washington State University. Two trained sensory panelists at each location rated the on a 5-point Likert scale, where 1 is lowest and 5 is highest in terms of sensory intensity. The sensory panel data consists of more interpretable data in a non-technical way. The data include 11 variables: acidity, sweetness, firmness, crispness, juiciness, aroma, weight (lbs.), mass (grams), skin color, overcolor, and grand color. Scientific instrument evaluations were conducted using different measurement tools (e.g., penetrometer). The phenotyping (i.e. information on observable apple characteristics) was repeated in 2011 and 2012. A detailed description of this data set is documented by Evans et al. (2010), Evans et al. (2012), and Schmitz et al. (2013). Table 1 shows the descriptive statistics of the variables used in estimating price premiums. About 2% of the purchases were club apples and 16% of them were organic. Apple taste features come from scientific study and are placed on a Likert scale from one to five, where one means the lowest and five means the highest. Other variables are mostly dummy variables. Two less known variables are drug and mass merchandise. Drug stores are prescription-based pharmacies but generate at least 20% of total sales from general merchandise. Mass merchandisers are large department stores that sell general merchandise and an assortment of grocery products. Total volume is not a predictor, but is used to create volume weights across stores. Table 2 lists the variables that are used as determinants of price premiums.

4

Table 1. Descriptive Statistics (2008-17) Variable Description Mean Std. Dev. Price Price of apples per pound 1.588 1.298 Club =1 if apple variety is club, 0 otherwise 0.021 0.104 Organic =1 if organic, 0 othewise 0.163 0.370 Acidity Acidic taste. 1: no acid, 2: slightly acidic, 3: medium, 4: acidic, 5: 2.510 0.442 very acidic Crispiness First bite, acoustic sensation that is detected by the ear during the 3.024 0.337 fracturing of crisp foods. 1: not crisp, 2: slightly crisp, 3: medium, 4: fairly crisp, 5: crisp Firmness Texture (hardness), force required to completely bite through 3.095 0.407 sample placed between molars. 1: soft, 2: fairly soft, 3: medium, 4: firm, 5: very firm (hard) Sweetness Sweet taste. 1: no sweetness, 2: slightly sweet, 3: medium, 4: 2.788 0.263 sweet, 5: very sweet Juiciness Amount of juice released after chewing with molars. 1: very dry, 2.624 0.298 2: dry, 3: medium, 4: juicy, 5: very juicy Sliced =1 if sliced, 0 otherwise 0.094 0.292 Side =1 if packed with dip, 0 otherwise 0.021 0.144 Large =1 if package mentions large, 0 otherwise 0.102 0.303 Small =1 if package mentions small, 0 otherwise 0.312 0.463 Bulk =1 if bulk amount, 0 otherwise 0.001 0.026 Fixed weights =1 if package cannot be altered, 0 otherwise 0.463 0.499 Convenience =1 if convenience store, 0 otherwise 0.011 0.104 Dollar =1 if dollar store, 0 otherwise 0.001 0.034 Drug =1 if prescription-based pharmacies, 0 otherwise 0.017 0.130 Grocery =1 if grocery, 0 otherwise 0.762 0.426 Mass =1 if large grocery with general merchandise, 0 otherwise 0.140 0.347 merchandise Total volume The average volume of purchase in ounces 45.476 16.746 Obs. 23,837,449

5 Table 2. Predictors of Price Premiums Variable Description Mean Std. Dev. Premium Estimated price premiums of club apples (%) 5.191 2.131 Club sales Sales of club apples (mil. $) 1.770 1.670 Open sales Sales of open apples (mil. $) 39.500 16.000 Shoppers U.S. labor force index (population, age 15-64) 101.718 1.453 Wage PPI for wages 105.718 7.612 Fuel PPI for diesel 79.195 20.486 Electricity PPI for electricity 105.420 4.871 Advertisement PPI for advertisements 103.575 3.465 Marketing PPI for marketing costs 118.161 14.210 Interest PPI for interest rates 98.760 5.897 Rent PPI for land rents 106.180 15.749 Obs. 120 months Note: PPI stands for Producer Price Index where 2011=100.

Model We generate the club premiums using a modern version of Rosen’s (1974) hedonic price equation in the first stage. The estimated premiums form an endogenous relationship to the sales of club apples, and both depend on other factors. In the second stage, we apply parametric and nonparametric models to derive the relationship. Both stages are described below.

Estimating the price premiums We follow Badruddoza (2020) to show that a large number of apple and market characteristics are required to precisely estimate the price premiums. Assume there exist = 1, … , observations of fresh apple purchase in each of = 1, … , period. The product characteristics are divided into two components { , } where = 1 if the product is a club variety, and𝑗𝑗 zero otherwise;𝐽𝐽 and is a vector of all other remaining attributes.𝑡𝑡 Following𝑇𝑇 Rosen (1974), the hedonic price equation of 𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑗𝑗 product for a𝐶𝐶 single𝑋𝑋 market𝐶𝐶 at time is, 𝑋𝑋

, = 𝑗𝑗 , | = , = 𝑀𝑀�, = ,𝑡𝑡̅ = . (1) 𝐸𝐸𝐸𝐸 where𝑃𝑃𝑗𝑗 𝑡𝑡̅ EQ𝑝𝑝�𝐶𝐶 stands𝑗𝑗 𝑋𝑋𝑗𝑗 𝐷𝐷 for the𝐷𝐷� 𝑆𝑆 equilibrium,𝑆𝑆̅ 𝑀𝑀 𝑀𝑀� and𝑡𝑡 𝑡𝑡�̅ are demand and supply shifters. We assume concave utility, availability of many indivisible differentiated products, and no arbitrage. The price premium of a club variety as a percent of open variety𝐷𝐷 𝑆𝑆is,

, = 1, , , , , , = 0, , , , , , = 100 × 𝐸𝐸𝐸𝐸 𝐸𝐸𝐸𝐸 (2) � � =�−0�, , � , � 𝑃𝑃𝑗𝑗 𝑡𝑡̅ 𝐶𝐶𝑗𝑗 𝑋𝑋𝑗𝑗 𝐷𝐷� 𝑆𝑆, ̅ 𝑀𝑀� 𝑡𝑡̅ 𝑃𝑃𝑗𝑗 𝑡𝑡̅ 𝐶𝐶𝑗𝑗 𝑋𝑋𝑗𝑗 𝐷𝐷� 𝑆𝑆̅ 𝑀𝑀� 𝑡𝑡̅ 𝑗𝑗 𝑡𝑡̅ 𝐸𝐸𝐸𝐸 𝜋𝜋 � � � 𝑃𝑃𝑗𝑗 𝑡𝑡̅ 𝐶𝐶𝑗𝑗 𝑋𝑋𝑗𝑗 𝑀𝑀� 𝑡𝑡̅

6 where, is the estimated price premium for product at time in a single market . Let market characteristics be a draw from some distribution ( | , , ) and let ( , | , , ) represent 𝜋𝜋𝑗𝑗 𝑗𝑗 𝑡𝑡̅ 𝑀𝑀� the joint distribution of the price premium for product and the market characteristics. The 𝑀𝑀 � ̅ ̅ � ̅ ̅ expected price premium is an integral over all the𝐹𝐹 market𝑀𝑀 𝐷𝐷 characteristics𝑆𝑆 𝑡𝑡 𝐹𝐹 at 𝜋𝜋time𝑀𝑀 𝐷𝐷, 𝑆𝑆 𝑡𝑡 𝑗𝑗 ( ) = ( | , , ) ( | , ). (3) 𝑡𝑡

We𝐸𝐸 𝜋𝜋 obtain𝑡𝑡 ∫ πa ∫𝑀𝑀-length𝜋𝜋𝑡𝑡 𝑑𝑑𝑑𝑑 vector𝜋𝜋 𝑀𝑀 𝐷𝐷�of 𝑆𝑆premiums,̅ 𝑑𝑑𝐹𝐹𝑀𝑀 𝑀𝑀 𝐷𝐷�one𝑆𝑆 ̅ for each point in time. It is possible to replace ( | , , ) with an empirical density if we can observe the market characteristics. Equation (3) suggests that𝑇𝑇 the price equation is identified if the time and market characteristics space is broken 𝑀𝑀 𝐹𝐹down𝑀𝑀 to𝐷𝐷� the𝑆𝑆̅ 𝑡𝑡point̅ when the “club” feature is the only source of variation. Essentially, one needs to estimate the following equation for each t,

= , , ; (4) ′ ′ where,𝑝𝑝𝑗𝑗 𝑓𝑓 � 𝐶𝐶is𝑗𝑗 a𝑋𝑋 vector𝑗𝑗 𝑀𝑀𝑗𝑗 ofβ� parameters. However, empirical identification of hedonic price is challenging for its vulnerability to functional form misspecification and endogeneity concerns (Ekeland, Heckman,β and Nesheim, 2004; Kuminoff, Parmeter, and Pope, 2010). Therefore, one cannot expect to derive a causal association or unravel the utility, demand, or supply-side parameters. Predictive analysis is feasible but requires highly disintegrated data and a large set of product and market attributes to be controlled. We address the issue in three ways: (1) using IRI store-scanner data with breeder’s toolbox data that contain many features of apples and markets, (2) applying machine learning to account for a large number of predictors1, and (3) conducting a Box-Cox (1964) transformation on the response variable, e.g., ( ) = ( 1)/ , for functional form misspecification. We argue that the main task here𝜆𝜆 is to predict𝜆𝜆 the price with and without the club status, instead of deriving a causal inference. Hence,𝑝𝑝 we𝑝𝑝 focus− more𝜆𝜆 on the predictive power of the model in this stage. We estimate the price equation separately for each month by controlling for product, store, market, and spatial features. Simulations conducted by Kuminoff, Parmeter, and Pope (2010) provide evidence that a simple and flexible model (e.g., linear Box-Cox) with spatial fixed effects offers the best prediction for the hedonic price model.

Finding the Determinants of Price Premiums The average club apple premium is a function of market characteristics (Eq. 3). Let w include variables endogenous to the price premiums, z for observed exogenous variables, and be the unobserved or latent factors. Then the mean price premium forms an implicit function, 𝜂𝜂 ( ( ), w , z , ) = 0 (5)

Φwhere,𝐸𝐸 𝜋𝜋 𝑡𝑡 represents𝑡𝑡 𝑡𝑡 𝜂𝜂𝑡𝑡 an unknown function. We use this intuition to analyze the effect of several nationally aggregated variables on club premiums. Nationally aggregated variables are likely to Φ

1 A description of the machine learning model, eXtreme Gradient Boost (XGB) is placed in the appendix.

7 exhibit endogeneity, heteroscedasticity, and autocorrelation. An empirical specification of equation 5 is the following: ( ) ( ) = L + [ ] + u w w (7) 𝑡𝑡 𝑡𝑡 𝐸𝐸 𝜋𝜋 𝐸𝐸 𝜋𝜋 𝑡𝑡 t �where𝑡𝑡 L� is the� lag 𝑡𝑡operator� Ψ with𝑧𝑧 an unknown∀ 𝑗𝑗 ∈ 𝐽𝐽 length. The estimated premiums ( ) is endogenous w to the total club apple sales and both depend on the exogenous determinants 𝑡𝑡, which can be correlated with their past values, and u stands for the unobserved, autocorrelated𝐸𝐸 𝜋𝜋 terms. We apply a 𝑡𝑡 𝑧𝑧𝑡𝑡 Vector Autoregressive model to estimatet parameters and are estimable parameters (e.g., Watson and Engle, 1983; Stock and Watson, 1989). Monthly price premiums estimated from the price equation create a time series of 120 months (JanuaryΨ 2008Ω to December 2017). Estimating equation (7) has two challenges: autocorrelation and selecting the optimal lag length. We use a generalized Dickey-Fuller test for a unit root (Elliott, Rothenberg, and Stock, 1996); and sequential t, Schwarz information, and modified Akaike information for the optimal lag length (Ng and Perron, 1995; Ng and Perron, 2001). We parsimoniously select the determinants based on economic intuition and the literature (e.g., Jaenicke and Carlson, 2015; Badruddoza, 2020). The variables include sales of open varieties, U.S. 𝑡𝑡 labor force index, and producer price indices𝑧𝑧 for wages, diesel, feed grains, electricity, advertising, and retail marketing (Table 2). The labor-force index is added to control for the number of shoppers who generally make purchasing decisions. Input costs, e.g., wage, feed, fuel, and electricity, and marketing costs, e.g., advertisement and retail marketing, are factors that may affect the production costs. Diesel index was chosen because it is also common across products and widely used in agriculture and transportation. The VAR estimates are used to test the Granger causality and impulse response functions (see Granger, 1969; Lutkepohl, 2005). The objective is to obtain insights into the endogenous relationship of ( ) and , and their determinants.

𝐸𝐸 𝜋𝜋𝑡𝑡 𝑤𝑤𝑡𝑡 Results Table 3 presents the results from using a linear regression model for Eq. 4. However, monthly premiums were actually estimated using separate Box-Cox regressions and XGB2. The correlation between linear predictions and predictions from XGB is about 0.834. However, Table 3 gives us an idea about the effect of each characteristic on apple price. For example, club apples and organic apples are associated with 0.22%-0.29% higher prices. All apple taste features have positive coefficients except for acidity. A little processing like slicing or adding snacks to apples also positively contributes to price. Compared to grocery stores, convenience, drug, mass merchandise, and dollar stores charge higher for apples. The monthly regressions generate 120 price premiums for club apples from January 2008-December 2017. Figure 1 shows the estimated price premiums. The premiums vary between zero to 10% without showing any pattern over time. A quick regression of

2 See appendix for a discussion on XGB algorithm.

8 price premiums on time gives a coefficient of 0.0029 which is not statistically different from zero (p- value=0.605).

Figure 1. Estimated price premiums for club apples by year. Source: Author’s calculation using XGB on Eq. 4.

We use the VAR model to find the determinants of premiums and sales of club apples. We use the augmented Dickey-Fuller approach to test for serial correlation. Both series are serially correlated in their level form, but not so in their first difference form. The optimal lag length for the VAR model is determined using several penalty criteria (see Table 4). We parsimoniously choose five months’ for the VAR model. Table 5 presents the results from the VAR model. Lag values themselves do not have straightforward use in the VAR model. However, club apple premiums are positively associated with greater costs of production such as wage, electricity, and marketing costs, but negatively associated with open apple sales—indicating a substitution between club apples and open apples. The second column of Table 5 shows the determinants of club apple sales. A percent increase in open apple

9 sales leads to 0.62% decrease in club apple sales. Among other variables, greater wage decreases club apple sales.

Table 3. Predicting Price of Apples (2007-2018) Dependent variable Coefficient Robust ln(price per pound) S.E. Club 0.218*** 0.003 Organic 0.293*** 0.001 Acidity -0.181*** 0.002 Crispiness 0.005** 0.002 Firmness 0.018*** 0.001 Sweetness 0.014*** 0.002 Juiciness 0.464*** 0.003 Sliced 0.842*** 0.002 Side 0.958*** 0.003 Large 0.705*** 0.003 Small 0.487*** 0.004 Bulk 0.470*** 0.067 Fixed weights -0.326*** 0.002 Convenience 0.795*** 0.005 Dollar 1.021*** 0.004 Drug 0.370*** 0.003 Mass merchandise 0.827*** 0.003 Obs. 22,130,511 R-squared 0.246 F-statistic 55,793.47*** Note: *** p<0.01, ** p<0.05, * p<0.1

Table 4. Lag Selection Criteria for VAR Model

lag LL LR df p FPE AIC HQIC SBIC 0 -611.385 244.724 11.175 11.3708* 11.6577* 1 -611.045 0.68089 4 0.954 261.277 11.2397 11.4748 11.819 2 -603.725 14.64 4 0.006 246.601 11.181 11.4552 11.8568 3 -598.382 10.686 4 0.030 241.116 11.1572 11.4706 11.9296 4 -595.535 5.6926 4 0.223 246.494 11.1776 11.5302 12.0465 5 -587.46 16.15* 4 0.003 229.815* 11.1055* 11.4973 12.0709 6 -586.477 1.9664 4 0.742 243.034 11.1589 11.5898 12.2209 Note: The number of lags to be used in the VAR model was decided based on several metrics: log likelihood (LL), likelihood ratio (LR), final prediction error (FPE), Akaike’s information criterion (AIC), Hannan Quinn information criterion (HQIC), and the Schwarz’s Bayesian information criterion (SBIC). Asterisk (*) indicates the optimal number of lags based on the respective criterion.

10 Table 5. Determinants of Price Premiums from VAR Model

VARIABLES club apple premium ln(club apple sales)

Lag 1 of club apple premium Δ -0.437*** Δ -0.0589** (0.0900) (0.0287) Lag 2 of Δ club apple premium -0.203** -0.0420 (0.0995) (0.0317) Lag 3 of Δ club apple premium -0.412*** -0.0240 (0.0931) (0.0297) Lag 4 of Δ club apple premium -0.146 -0.0111 (0.0978) (0.0312) Lag 5 of Δ club apple premium -0.0593 -0.0182 (0.0891) (0.0284) Lag 1 of Δ club apple sales -0.191 0.00608 (0.289) (0.0921) Lag 2 of Δ club apple sales -0.455 -0.211** (0.283) (0.0903) Lag 3 of Δ club apple sales -0.430 -0.123 (0.280) (0.0894) Lag 4 of Δ club apple sales 0.217 -0.153 (0.297) (0.0947) Lag 5 of Δ club apple sales -0.513* -0.239*** (0.281) (0.0895) open appleΔ sales -1.207*** -0.620** (0.007) (0.241) Δ shoppers -1.014 0.0527 (0.832) (0.265) Δ PPI for wage 0.284*** -0.203*** (0.002) (0.0626) Δ PPI for fuel (diesel) 0.0575 -0.00189 (0.0480) (0.0153) Δ PPI for electricity 0.171** -0.00360 (0.066) (0.0338) Δ PPI for advertisement 0.700 0.157 (0.676) (0.215) Δ PPI for marketing 0.0762** -3.72e-05 (0.0355) (0.0113) Δ PPI for interest rates -0.00558 -0.0155 (0.112) (0.0356) Δ PPI for rent -0.0575 0.0374 (0.101) (0.0322) ConstantΔ 0.158 0.103* (0.189) (0.0602)

Observations 114 114 Note: Robust standard errors in parentheses. *** p<0.01, ** p<0.05, * p<0.1. PPI stands for Producer Price Index where 2011=100.

11

We then compute Granger causality to test the impact of club premiums on club apple sales and sales on premiums. The null hypotheses are (1) club apple sales do not Granger cause club apple premiums, and (2) premiums do not Granger cause club apple sales. We reject the first hypothesis at 5% level of significance, but fail to reject the null for the second hypothesis. That means, club apple sales Granger cause club apple premiums but the reverse is not true. In order to obtain more insights, we present the impulse response function in figure 2. A percent impulse was artificially created on club apple sales. We find that the club premium increases by up to 0.1 percent and then it gradually restores its position within 20 months.

Figure 2. Response of club apple sales to an artificial impulse on club apple premiums.

To summarise, the premium for club apple variety is affected by its own sales, sales of open apple varieties, and the cost of labor, electricity, and marketing. Two apple types are substitutes, rather than complements.

Conclusion The study attempts to provide some insights regarding the club and open apple growers' performance, market structure, and the future of club apple premiums. The study will encourage policy discussions on restrictive versus nonrestrictive production and marketing behavior in specialty crops.

12 References Badruddoza, S. 2020. Long-term dynamics of U.S. organic milk, eggs, and yogurt premiums. Impact of Social Influence, Organic, and Plant-Based Milks on the U.S. Dairy Market. Doctoral dissertation. School of Economic Sciences. Washington State University. Breeder’s Toolbox. 2020. NSF Plant Genome Program and Washington Tree Fruit Research Commission. Available at https://app.bioinfo.wsu.edu/breeders_toolbox. [Accessed March 19, 2020]. Bureau of Labor Statistics (BLS). 2019. BLS Beta Labs Data Finder 1.1. February 1. Accessed February 1, 2019. https://beta.bls.gov/dataQuery/find?fq=survey:%5bpc%5d&s=popularity:D&q=advertisin g. Ekeland, I., Heckman, J.J. and Nesheim, L. 2004. Identification and estimation of hedonic models. Journal of Political Economy, 112 (S1), S60-S109. Elliott, G., Rothenberg, T.J. and Stock, J.H. 1996. Efficient tests for an autoregressive unit root. Econometrica, 64, 813-836. Evans, K., L. Brutcher, B. Konishi, and B. Barritt, 2010. Correlation of sensory analysis with physical textural data from a computerized penetrometer in the Washington State University apple breeding program. Hort Technology 20(6): 1026-1029. Evans, K., Y. Guan, J. Luby, M. Clark, C. Schmitz, S. Brown, B. Orcheski, C. Peace, E. Van De Weg, and A. Iezzoni, 2011. Large-scale standardized phenotyping of apple in RosBREED. Acta Horticulturae 945: 233-238. Granger, C. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37, 424-438. Jaenicke, E.C. and Carlson, A.C. 2015. Estimating and investigating organic premums for retail-level food products. Agribusiness, 31 (4), 453-471. Kuminoff, N.V., Parmeter, C.F. and Pope, J.C. 2010. Which hedonic models can we trust to recover the marginal willingness to pay for environmental amenities? Journal of Environmental Economics and Management, 60 (3), 145-160. Levin, D., Noriega, D., Dicken, C., Okrent, A.M., Harding, M. and Lovenheim, M. 2018. Examining Food Store Scanner Data. Washington D.C.: US Department of Agriculture. Lutkepohl, H. 2005. New Introduction to Multiple Time Series Analysis. New York: Springer. Muth, M.K., Sweitzer, M., Brown, D., Capogrossi, K., Karns, S.A., Levin, D., Okrent, A., Siegel, P. and Zhen, C. 2016. Understanding IRI household-based and store-based scanner data. May TB-1942, Economic Research Service, Washington D.C.: U.S. Department of Agriculture. Ng, S. and Perron, P. 1995. Unit root tests in ARMA models with data-dependent methods for the selection of the truncation lag. Journal of the American Statistical Association, 90 (429), 268-281. Ng, S. and Perron, P. 2001. Lag length selection and the construction of unit root tests with good size and power. Econometrica, 69(6), 1519-1554.

13 Rosen, S. 1974. Hedonic prices and implicit markets: product differentiation in pure competition. Journal of Political Economy, 82 (1), 34-55. Schmitz, C. A., M. D. Clark, J. J. Luby, J. M. Bradeen, Y. Guan, K. Evans, B. Orcheski, S. Brown, S. Verma, and C. Peace, 2013. Fruit texture phenotypes of the RosBREED US apple reference germplasm set. Hort Science 48(3): 296-303. Stock, J.H and Watson, M.W. 1989. New indexes of coincident and leading economic indicators. NBER Macroeconomics Annual, 4, 351-394. US Department of Agriculture. 2019b. National Agricultural Statistics Service. Quick Stats. https://quickstats.nass.usda.gov/. Watson, M.W. and Engle, R.E. 1983. Alternative algorithms for the estimation of dynamic factor, mimic and varying coefficient regression models. Journal of Econometrics, 23 (3), 385-400.

14 Appendix Formal XGBoost algorithm and hyperparameter grids Given a data set , a loss function , a learning rate , and the number of terminal nodes . Let the predictive model be = ( ), where we initialize with ( ) = = arg min ( , ). A 𝒟𝒟 𝐿𝐿 𝜂𝜂 𝑇𝑇 gradient tree boosting steps are, for = 1, … , , 0 𝑛𝑛 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥 𝑓𝑓̂ 𝑥𝑥 𝜃𝜃�0 ∑𝑖𝑖 𝐿𝐿 𝑦𝑦𝑖𝑖 𝜃𝜃 (1) ( ) = ( , ( ))/ 𝑚𝑚( )| ( )𝑀𝑀= ( ) 𝑚𝑚−1 (2) Determine𝑚𝑚 𝑖𝑖 the tree𝑖𝑖 structure𝑖𝑖 𝑖𝑖 such that splits maximizes = 0.5 + 𝑔𝑔� 𝑥𝑥 𝜕𝜕𝜕𝜕 𝑦𝑦 𝑓𝑓 𝑥𝑥 𝜕𝜕𝜕𝜕 𝑥𝑥 𝑓𝑓 𝑥𝑥 𝑓𝑓̂ 𝑥𝑥 2 2 𝑇𝑇 𝐺𝐺𝐿𝐿 𝐺𝐺𝑅𝑅 �𝑗𝑗𝑗𝑗 𝑗𝑗=1 𝑛𝑛𝐿𝐿 𝑛𝑛𝑅𝑅 2 where L and R is left and�𝑅𝑅 right� of the split. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 � − 𝐺𝐺𝑗𝑗𝑗𝑗 (3) 𝑛𝑛Derive𝑗𝑗𝑗𝑗� the leaf weights = arg min ( , ( ) + ). (4) Calculate ( ) = ( ) where I 𝑚𝑚is −an1 indicator function. 𝑤𝑤�𝑗𝑗𝑗𝑗 ∑𝑖𝑖 𝐿𝐿 𝑦𝑦𝑖𝑖 𝑓𝑓̂ 𝑥𝑥𝑖𝑖 𝑤𝑤𝑗𝑗 (5) Calculate ( ) = 𝑇𝑇 ( ) + ( ). 𝑓𝑓̂𝑚𝑚 𝑥𝑥 𝜂𝜂 ∑𝑗𝑗 𝑤𝑤�𝑗𝑗𝑗𝑗𝐼𝐼 𝑥𝑥𝑖𝑖 ∈ 𝑅𝑅�𝑗𝑗𝑗𝑗 𝑚𝑚−1 Aggregate by average:𝑓𝑓̂𝑚𝑚 𝑥𝑥 ( )𝑓𝑓̂= 𝑠𝑠( ) =𝑓𝑓̂𝑚𝑚 𝑥𝑥 ( ). 𝑀𝑀 𝑀𝑀 The grids for the XGBoost𝑓𝑓̂ 𝑥𝑥 parameters𝑓𝑓̂ 𝑠𝑠 are:∑ 𝑓𝑓̂𝑚𝑚 𝑥𝑥 Number of boosting iterations: 100, … ,500. Maximum depth of a tree: 1, … , 5. = [0.3, … ,0.7],

Predictor𝜂𝜂 subsample rates: 0.6, … ,0.8 Row subsample rates: 0.8, … , .9 Total number of replications: 375 for each month (=120 separate XGB runs for 120 months).

15