Large-Scale Diet Tracking Data Reveal Disparate Impacts of Food Environment
Total Page:16
File Type:pdf, Size:1020Kb
Large-Scale Diet Tracking Data Reveal Disparate Impacts of Food Environment Tim Althoff,1∗ Hamed Nilforoshan2, Jenna Hua3, Jure Leskovec2;4 1Allen School of Computer Science & Engineering, University of Washington 2Department of Computer Science, Stanford University 3Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine 4Chan Zuckerberg Biohub, San Francisco, CA ∗To whom correspondence should be addressed; E-mail: [email protected] An unhealthy diet is a key risk factor for chronic diseases including obesity, diabetes, and heart disease1, 2. Limited access to healthy food options may contribute to unhealthy diets3, 4. Studying diets is challenging, typically restricted to small sample sizes, single locations, and non-uniform design across studies, and has led to mixed results on the impact of the food environment5–21. Here we leverage smartphones to track diet health and weight status in a country-wide observational study of 1,164,926 U.S. participants and 2.3 billion food entries to study the independent contri- butions of fast food and grocery access, income and education on these outcomes. This study con- stitutes the largest nationwide study examining the impact of the food environment on diet to date, with 300 times more participants and 4 times more person years of tracking than the Framingham Heart Study. We find that higher access to grocery stores, lower access to fast food, higher income and education are independently associated with higher consumption of fresh fruits and vegetables, lower consumption of fast food and soda, and lower likelihood of being overweight/obese, but that these associations vary significantly across predominantly Black, Hispanic, and White zip codes. For instance, within predominantly Black zip codes we find that high income is associated with a decrease in healthful food consumption patterns across fresh fruits and vegetables (6.5%) and fast food (5.5%). Further, high grocery access has a significantly larger association with increased fruit and vegetable consumption in predominantly Hispanic (7.4% increase) and Black (10.2% in- crease) zip codes in contrast to predominantly White zip codes (1.7% increase). Policy targeted at improving food access, income and education may increase healthy eating, but interventions may need to be targeted to specific subpopulations for optimal effectiveness. 1 1 Introduction Unhealthy diet is the leading risk factor for disability and mortality globally1. Emerging evidence suggests that the built and food environment, behavioral, and socioeconomic cues and triggers significantly affect diet5. Prior studies of the food environment and diet have led to mixed results5–21, and very few used nationally representative samples. These mixed results are poten- tially due to methodological limitations of small sample size, differences in geographic contexts, study population, and non-uniform measurements of both the food environment and diet across studies. Therefore, research with larger sample size and using improved and consistent methods and measurements is needed7, 22, 23. With ever increasing smartphone ownership in the U.S.24 and the availability of immense geospatial data, there are now unprecedented opportunities to combine various data on individual diets, population characteristics (gender, race and ethnicity), socioeco- nomic status (income and education), as well as food environment at large scale. Interrogation of these rich data resources to examine geographical and other forms of heterogeneity in the effect of food environments on health could lead to the development and implementation of cost-effective interventions25. Here, we leverage large-scale smartphone-based food journals and combine sev- eral Internet data sources to quantify the independent impact of food (grocery and fast food) access, income and education on food consumption and weight status of 1,164,926 subjects across 9,822 U.S. zip codes. This study constitutes the largest nationwide study examining the impact of the food environment on diet to date, with 300 times more participants and 4 times more person years of tracking than the Framingham Heart Study26. 2 2 Methods 2.1 Study Design and Population We conducted a United States countrywide cross-sectional study of participants’ self-reported food intake and body-mass index (BMI) in relation to demo- graphic (education, ethnicity), socioeconomic (income), and food environment factors (grocery store and fast food access) captured on zip code level. Overall, this cross-sectional matching-based study analyzed 2.3 billion food intake logs from U.S. smartphone participants over seven years across 9,822 zip codes (U.S. has total of 41,685 zip codes). Participants were participants of the MyFitnessPal app, a free application for tracking caloric intake. We analyzed anonymized, retrospective data collected during a 7-year observation period between 2010 and 2016 that were aggregated to the zip code level. Comparing our study population to nationally representative survey data, we found that our study population had signifi- cant overlap with the U.S. national population in terms of population demographics, education and weight status (Body Mass Index; BMI), but that it was skewed towards women and higher income (Supplementary Table e1). Our matching-based statistical methodology controls for observed biases between comparison groups in terms of income, education, grocery access, and fast food access (Section 2.4). Data handling and analysis was conducted in accordance with the guidelines of the Stanford University Institutional Review Board. 2.2 Study Data We compute outcome measures of food consumption and weight status from 2.3 billion food intake logs by 1,164,926 U.S. participants of the MyFitnessPal (MFP) smartphone application to quantify food consumption across 9,822 zip codes. During the observation period from January 1, 2010 to November 15, 2016, the average participant logged 9.30 entries into their digital food journal per day. The average participant used the app for 197 days. All participants in this sample used the app for at least 10 days. We classified the 2.3 billion food intake entries into three categories of public health interest, fresh fruits and vegetables (F&V), fast food, and sugary non-diet soda, and excluded them from analysis if they did not match these categories. For more details see Supplementary Information. We intentionally use a cross-sectional rather than longitudinal study design, since fine-grained and large-scale temporal data on changes in the food environment were not available. We obtained data on demographic and socioeconomic factors from CensusReporter27. Specif- ically, for each zip code in our data set we obtained median family income, fraction of population 3 with college education (Bachelor’s degree or higher), and fraction of population that is White (not including Hispanic), Black, or Hispanic from the 2010-14 American Community Survey’s census tract estimates27. While data were available only on zip code level, previous studies have shown that area-level income measures are meaningful for health outcomes and describe unique socioe- conomic inequities.28 Grocery store access was defined as the fraction of population that is more than 0.5 miles away from a grocery store following the food desert status definitions from the USDA Food Ac- cess Research Atlas29. Contrary to the USDA definition, we found evidence that even in rural zip- codes, the fraction of population greater than 0.5 miles away from grocery stores has the strongest association with food consumption (compared to 10 and 20 miles away), and thus we used 0.5 miles as the threshold in both rural and urban zipcodes (Supplemental Methods: Details on food environment measures). We measured fast food access through the fraction of restaurants that are fast food restaurants within a sample from Yelp, querying the nearest 1000 businesses from the zip code’s center, up to a maximum radius of 40 km (25 miles). See Supplementary Information for details and validation of these objective food environment measures. We will release all data aggregated at a zipcode level in order to enable validation, follow-up research, and use by policy makers. 2.3 Reproducing state-of-the-art measures using population-scale digital food logs To investi- gate the applicability of population-scale digital food logs to study the impact of food environment, income and education on food consumption, we measured the correlation between our smartphone app-based measures and state-of-the-art measures of food consumption including the Behavioral Risk Factor Surveillance System (BRFSS), based on representative surveys of over 350,000 adults in the United States 30, 31, and the Nielsen Homescan data 32, which is a nationally representative panel survey of the grocery purchases of 169,000 unique households across the United States, based on UPC records of all consumer packaged goods participants purchased from any outlet. (Figure 3). We compare against BRFSS rather than National Health and Nutrition Examination Survey (NHANES), since BRFSS is significantly larger than NHANES, it is remotely administered matching our study, and it has much better geographical coverage than NHANES and geographical comparisons are central to our study. Comparing our data to BRFSS on county level, we found strong correlations between the 4 amount of fresh fruits and vegetables (F&V) consumed (Figure 3a, R=0.63, p < 10−5) and body mass index (Figure 3b, R=0.78, p < 10−5). Comparing to USDA purchase data from the Nielsen Homescan Panel Survey we found that our app-based food logs were very highly correlated with previously published results (Figure 3c, R=0.88, p < 0:01) and that the absolute differences be- tween food deserts and non-food deserts were stronger in the MFP data compared to Nielsen pur- chase data. See Supplementary Information for more details. These results demonstrate convergent validity and suggest that population-scale digital food logs can reproduce the basic dynamics of traditional, state-of-the-art measures, and they can do so at massive scale and comparatively low cost.