<<

Downloaded by guest on October 2, 2021 ASCV2incidence SARS-CoV-2 We herd . toward the progress for of results months in our immunity. first of reported state implications which the which the each counts, in explore to especially in extent true large, day underestimated the was each have estimate and cases on We rate COVID test States. fatality positive United new infection on the of the model estimate number the results to the calibrate the proceed use to and We random- states counts Ohio. from two and data administered these Indiana with tests from in inference surveys of our the testing anchor number to sample We the due day. each and deaths on on cases, data confirmed incorporate virus, likeli- that includes components model on hood Our based with parameter. is model reproductive It (SIR) sources. time-varying data Susceptible–Infected–Removed available discrete-time main com- a the by of viral several estimate bining a to develop framework We Bayesian strategies. mitigation simple popula- of the of effectiveness of the spread estimates and the virus Reliable understanding for delays. necessary are big the prevalence and tion with space, come and overestimate time can in greatly results sparse substan- are only to source, the infections unbiased tend surveys, putatively prevalence lag rates random deaths positivity Representative prevalence. and test infections, while underestimates tially, For of largely reporting. number cases confirmed delayed the of and number have biases the all example, but including population, the drawbacks, Yiannoutsos the about T. in Constantin major information and infections giving Bassett SARS-CoV-2 Mary data of by of reviewed number 2021); sources 17, multiple February review are for There (sent 2021 16, June Raftery, E. Adrian by Contributed a surveys Irons J. random Nicholas and tests, cases, confirmed deaths, from infections SARS-CoV-2 Estimating bansaitclypicpe siae ftenme fnew of PNAS number the of estimates principled statistically obtain 4). (3, estimates Ohio for surveys and like- testing Indiana likelihood in random-sample Poisson normal conducted from a a seroprevalence and with and of viral this counts of spread combine death the We for population. simulate lihood a to in used disease widely model compartmental modified epidemiological a a testing model, use and We (SIR) cases, (2). Project Susceptible–Infected–Removed confirmed relies Tracking COVID, model COVID that the to Our by due model data. reported deaths the Bayesian in on simple delays data and on a biases using the for infections accounts of number the the of reflection accurate (1). more pandemic a the the of provide of course and due estimate an count deaths as death Reported problematic true less time. comparable considered over not are or COVID are to counties, and or rate states infection between overall do visits the room estimate viral emergency not overestimate and to rates tend population. Hospitalization rates the positivity prevalence. test in token, infections same the of to By tend to number early that due true counts bias the the case selection underestimate yield in with infections, mild pronounced conjunction and in particularly asymptomatic pandemic, was the of which days tests, viral to S data COVID States United eateto ttsis nvriyo ahntn etl,W 89;and 98195; WA Seattle, Washington, of University Statistics, of Department ihteedt,w ne h neto aaiyrt IR and (IFR) rate fatality infection the infer we data, these With to relevant data of sources main the of several combine We h rert fifcini h ouain ako access of Lack population. the obscure in that infection biases of with rate fraught true are the data test ARS-CoV-2 01Vl 1 o 1e2103272118 31 No. 118 Vol. 2021 a | ooaiu infections coronavirus n dinE Raftery E. Adrian and | aeinestimation Bayesian a,b,1 | b eateto oilg,Uiest fWsigo,Sate A98195 WA Seattle, Washington, of University Sociology, of Department 1 ulse uy2,2021. 26, July Published doi:10.1073/pnas.2103272118/-/DCSupplemental at online information supporting contains article This See under distributed BY).y is (CC C.T.Y., article access and open University; This Harvard interest.y competing Health, no declare Public authors The of University.y Indiana School Health, Public Chan of School H. Fairbanks T. M.B., Reviewers: analyzed paper. research, y the performed wrote research, and designed data, A.E.R. and N.J.I. contributions: Author a also conducted was and has results (6) Connecticut present survey seroprevalence also York. representative we New statewide estimates, order and our In Connecticut of model. for accuracy probabilistic two the our Ohio, assess into to and incorporate sur- we Indiana testing that random-sample for veys representative results statewide with detailed states present we Here, Results esti- equations. be than differential model rather on to simplify based model, model model SIR to continuous-time discrete-time the Finally, a a constant. in use fixed parameter we a implementation, a than as rather testing, IFR mated, preferential the for model include well statistical and as a incorporate we develop restrictions, Furthermore, data, of restrictions. testing the loosening to fluc- adherence and for in account Whereas tightening as to the time ways. prelockdown in in vary change to tuation significant measures to it allow distancing parameter we in social contact postlockdown, and SIR of differs the effect allowing the it by model although al. et (5), Johndrow al. et representative conducted not for surveys. have counts testing that infection states and of IFR majority the a vast each down the pin as in to administered us testing tests allows This of preferential state. number model for cumulative a the accounts build of Ohio. that to function and states cases Indiana these confirmed from in results for 2020 our March leverage since then We day each on infections owo orsodnemyb drse.Eal [email protected]. Email: addressed. be may correspondence whom To ntdSae a enifce so al ac 2021, at March immunity the early herd of from of far 20% as was point. that about country infected the only infections been that suggesting so, had of Even States 60% United unreported. approximately gone that have find These prevalence We states. unbiased potentially US estimates. provide which all Ohio, Indi- in and from surveys ana to infections testing random due of representative esti- include to number uncertain data sources economy. true data remains multiple the the spread combine mate We of necessi- virus’ data. test sectors has the in biases and of which extent schools pandemic, The 600,000 of COVID-19 over shutdowns the Nationwide, tated in States. mil- United died 33 the have over in infected people has lion SARS-CoV-2 coronavirus novel The Significance u ipeBysa oe ae nprto rmJohndrow from inspiration takes model Bayesian simple Our online o eae otn uha Commentaries.y as such content related for https://doi.org/10.1073/pnas.2103272118 raieCmosAtiuinLcne4.0 License Attribution Commons Creative . y https://www.pnas.org/lookup/suppl/ y | f9 of 1

SOCIAL SCIENCES STATISTICS included in a nonrandom seroprevalence study of 10 sites across is the ratio of estimated cumulative infections to cumulative the country (7). New York has the highest number of reported confirmed cases. Fig. 1 displays the viral prevalence, the cumu- deaths due to COVID, and there is a body of literature studying lative , and the reproductive number r(t) = βt /γ on the spread of the disease in the state, including refs. 8–10. The each day. estimates from these studies provide a basis of comparison for By the time that the first confirmed case was reported in Indi- our results. ana on March 6, 2020, there had likely been more than 800 We also present aggregated estimates for the entire United infections in the state (95% interval 483 to 1,384). We estimate States. SI Appendix, Table S1 includes estimates of the IFR and that as of May 1, 2020, there were 274,000 cumulative infections the cumulative incidence (i.e., the percent of the state’s popula- (95% interval 230,000 to 327,000), compared to 18,630 confirmed tion having been infected) and undercount factor for all 50 states cases by that date. This yields a cumulative incidence of 4.1% and the District of Columbia (DC) as of March 7, 2021, the last (3.4 to 4.8) and an undercount factor of 14.7 (12.4 to 17.6). This day reported by the COVID Tracking Project. Plots for the 50 estimate is comparable to others in the literature for that period states and DC are also shown in SI Appendix. We have created an (5, 7, 12). Between March 16 and March 19, 2020, the state’s online dashboard where updated results can be found, including Governor Eric Holcomb ordered a stop to indoor dining, estimated daily infections, the IFR, and the reproductive number declared a state of emergency, and closed schools; on March 23, r(t) in each state (11). 2020, he issued a stay-at-home order. According to our model, the first wave of infections reached its peak about 2 wk later in Indiana. We estimate an IFR of 0.84% (95% interval 0.70 to 1.00) early April 2020. in Indiana. We estimate the cumulative incidence of COVID- 19 in the state at 19.7% (16.5 to 23.7) as of January 1, 2021, Ohio. We estimate an IFR of 0.83% (95% interval 0.68 to 1.03) and 20.9% (17.5 to 25.1) as of January 15, 2021. By March in Ohio. As of March 7, 2021, the cumulative incidence in the 7, 2021, cumulative incidence had increased to 22.9% (19.2 to state was 19.5% (15.9 to 23.7), and the cumulative undercount 27.6), nearly a quarter of the state, or about 1.5 million infec- factor was 2.3 (1.9 to 2.9). tions. There have been 2.3 (1.9 to 2.8) infections for every Ohio Governor Mike Dewine declared a state of emergency confirmed case in the state through this date. This suggests that on March 9, 2020, and the state’s first stay-at-home order took a large majority of infections in the course of the pandemic effect on March 23, 2020. In mid-April, the governor declared have gone unreported, although Fig. 1 shows that undercounting that businesses could begin to reopen on May 1. Fig. 2 shows was most pronounced early on and has improved substantially that the first wave of infections, which picked up in March 2020 over time. and likely peaked by late April 2020, did not die out, but, rather, Fig. 1 exhibits posterior estimates of new infections on each leveled out to a sustained spread through the summer of 2020. day, νt , as well as the cumulative undercount factor, which The posterior median of the reproductive number r(t) in

Fig. 1. Posterior median and middle 95% intervals for daily new infections, cumulative incidence, reproductive number r(t), and cumulative undercount in Indiana from March 2020 to March 2021. In Upper Left, deaths divided by the posterior median IFR are plotted in gray for comparison. Pop., population.

2 of 9 | PNAS Irons and Raftery https://doi.org/10.1073/pnas.2103272118 Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 siaigSR-o- netosfo ets ofimdcss et,adrno surveys random and tests, cases, (14.7 confirmed deaths, from 18.6% infections SARS-CoV-2 2021, Estimating 7, Raftery and March Irons of As state. York New for 1.42) our in York. data New of did source possi- we a that the as reason, note raising survey this Connecticut we 7.8%, analysis. For the at bias. al., include nonresponse low et not significant was Mahajan esti- period of rate our of bility the response While for those lower. survey significantly 6.0) with the is sero- to which disagree a 29, 2.0 mates reported July interval (6) to random- (90% 10 al. a June 4.0% in et comparison, Mahajan of By survey, prevalence 11.1). test to blood (7.2 sample 8.9% from to result increased the as used. with well test well as antibody agrees bias, model. the our selection estimate of by their specificity period affected sam- Nevertheless, and been convenience the purposes, have a sensitivity in clinical may on imperfect state for it relied collected the so study specimens in and blood Their residual 6.5) seroprevalence 3. of to a May ple to 3.6 estimated 26 COVID. interval (7) April from (95% al. recovered 4.9% et had of Havers population state’s comparison, the In of 7.1) to to an (12.9 to leading 2.5). 15.9% to infected, 2021, (1.6 been 2.0 7, had of March factor population undercount state’s of the As of Connecticut. 19.9) in 1.70) of to wave second Connecticut. mid- the through as fall. April the thereafter in early began increased from infections one and around September hovered state the In 2021. March to 2020 March from Ohio 2. Fig. yJl ,22,oretmt ftercvrdpopulation recovered the of estimate our 2020, 5, July By (4.6 5.7% 2020, 26, April of as model, our to According otro einadmdl 5 nevl o al e netos uuaieicdne erdcienumber reproductive incidence, cumulative infections, new daily for intervals 95% middle and median Posterior eetmt nIRo .2 9%itra .7to 0.87 interval (95% 1.12% of IFR an estimate We eetmt nIRo .7 9%itra 1.10 interval (95% 1.37% of IFR an estimate We etsdvddb h otro einIRaepotdi ryfrcmaio.Pp,population. Pop., comparison. for gray in plotted are IFR median posterior the by divided deaths Left, Upper o ahdyfo h I Eq. SIR the from day cal- each parameter we for curve, contact infection in effective the summarized an of are trajectory culated results sampled prevalence The each viral day. For of 3. each estimates Fig. on obtain States to United states the the in of all from ries States. United that visits care COVID-19. at routine to for unrelated NYC scheduled from Hospital in Sinai collected Mount plasma seroprevalence at residual patients sampled 20% Stadlbauer randomly measured on of based who that time (9), matches of 20% number al. or This et infections, population. million 1.7 city’s yields the in period occurring that state the during in NYC by deaths COVID number confirmed been that of had Multiplying fraction people, the coronavirus. million novel 2.2 the about with or infected 14.7), to during 9.0 of interval NYC those (95% of with that consistent are to results similar the our al. for been et and IFR Yang have 2020, the of to expect spring whole we the such, a COVID the As as confirmed through 2021. state state all of January the of of 85% in week 57% deaths) than first and probable more including city for (not the deaths accounted in period testing, deaths this available COVID Depart- (10), Health on NYC data based to ment According of 2020, data. York IFR 6, mobility New and in an June mortality, wave first estimated through the (NYC) (8) for 1.77) City al. to 1.04 et interval (95% Yang 1.39% However, literature. undercount the date. an that yielding through infected, 2.8) to been (1.7 had 2.1 of state factor the of 23.9) to eetmt htb ue6 15 ftesaespopulation state’s the of 11.5% 6, June by that estimate We in York New in IFR the of estimates other no of know We esme otro ape fteSRtrajecto- SIR the of samples posterior summed We 1. https://doi.org/10.1073/pnas.2103272118 r (t ,adcmltv necutin undercount cumulative and ), β t o h niecountry entire the for PNAS | f9 of 3

SOCIAL SCIENCES STATISTICS Fig. 3. Aggregated estimates of new infections, cumulative incidence, reproductive number r(t), and cumulative undercount for the United States from March 2020 to March 2021. In Upper Left, deaths (in thousands) divided by 0.0068 and shifted back 23 days are plotted in gray for comparison. Pop., population.

As of March 7, 2021, we estimate that 19.7% of the US pop- 2. Immunity is conferred upon becoming fully vaccinated. Data ulation, or about 65 million people, had been infected with tracking the number of people in the United States fully vacci- SARS-CoV-2. This suggests that the United States was far from nated on each day are available from Our World in Data (14). reaching and that it was unlikely to do so from Beyond April 16, 2021, we assume that the number fully vac- infections alone in the short term while state and local govern- cinated on each day follows the linear trend it had exhibited ments continued to implement lockdowns and other mitigations. so far until reaching 2 million per day (Fig. 4D). After that, we Up to that date, we estimate that 1 out of every 2.3 infections in assume that the number fully vaccinated per day remains at 2 the United States had been confirmed via testing. This implies million. that approximately 60% of all infections in the country had gone 3. After January 6, the reproductive number r(t) follows an unreported. autoregressive AR(1) model with mean estimated from the In Fig. 3, Upper Left, which exhibits estimates of new infec- sampled posterior trajectories of r(t) through January 6. tions on each day in the United States, we plot reported COVID deaths per 1,000 population shifted back 23 days (which is the The first point merits further discussion. Our projections that mean of the time-to-death distribution τ). In the plot, we divide follow are particularly sensitive to this assumption. It may turn deaths per 1,000 by 0.0068. This is the point estimate of IFR out that individuals who have been vaccinated or previously reported by Meyerowitz-Katz and Merone (13) in their meta- infected are still susceptible to new variants of the virus that analysis of 24 IFR estimates from a wide range of countries are cropping up and will continue to spread. It is also possi- published between February and June 2020. The two curves have ble that the natural immunity conferred by asymptomatic and a substantial overlap, suggesting that the IFR implied by our esti- mild infections that elicited minimal immune response, which mates of true infections in the United States is consistent with constitute a large portion of the total, will not last long enough their findings. to prevent widespread reinfection in the next few months. In either case, if Assumption 1 is violated, then we may experi- Implications for Herd Immunity. To illustrate the potential use of ence further waves of infection and delayed progress toward herd our method, we conducted a simulation to assess the implications immunity. of our results for herd immunity in the United States. We project Assumption 3 requires that the reproductive number oscil- the SIR model for the United States forward from January 6, lates around its mean, which is approximately 1.1 based on our 2021, and incorporate vaccine administration into the dynamics. estimates of r(t) through January 6, 2021. This assumption is We make the following strong assumptions: borne out by the plots of r(t) in Figs. 1–3 and for many other states in SI Appendix. A possible explanation for this trend is 1. Recovered individuals are immune to the virus, i.e., reinfec- the public and governmental response to deviations of r(t) from tion does not occur. one. As r(t) exceeds one and cases rise, lockdowns and other

4 of 9 | PNAS Irons and Raftery https://doi.org/10.1073/pnas.2103272118 Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 siaigSR-o- netosfo ets ofimdcss et,adrno surveys random and tests, cases, confirmed deaths, from infections SARS-CoV-2 Estimating Raftery and as Irons reckoning) our (by infections million 52 and deaths COVID 60% that be at will immunity that less. cumulative or Note that suggests incidence. model cumulative our point, projected interquartile the the as beginning of obtained infections, are range numbers new will These million we 7. 31 January that been to from project million have we 18 will another there, this population incur getting At the In valid. through suppressed. the are spread effectively assumptions of in our virus’ 1/100th day if the about per 2021, point, 5,000, infections June below by new fall peak, of winter likely sim- number our would on the country Based that the 4. Fig. find in we plotted ulation, are day fully each or immu- infected on previously cumulative vaccinated) population SIR the and of modified infections percentage the (the New nity under above. forward described trajectory model infection US the of of trajectories sampled the the from of parameters model variance AR(1) and autocorrelation the AR in estimate also stochastic uncertainty The future short-term. captures the model beyond much projection too for increase can used variance if the but estimation, for that well works instead in assumed described we as walk, data, random have reopen, we to which allowed for be may restaurants, and causing bars as such con- from nesses, data as to the virus; for implemented the fit be best tain of may line The interventions population. (14). Pop., Data nonpharmaceutical red. in in World plotted Our is by reported) reported is as number States positive United a which the on in day day first each (the on 2021 vaccinated 14, fully January newly people of number the use We black. in (C plotted deaths COVID and (B), i.4. Fig. optti nprpcie hr eeaot3000confirmed 360,000 about were there perspective, in this put To distribution posterior the from samples 40,000 the project We (A–C r (t ,cmltv muiy(ia niec n ulvaccinations) full and incidence (viral immunity cumulative (A), infections new of distributions predictive the for intervals credible 95% The ) ) oices.We tigtemdli ieperiods time in model the fitting When increase. to r (t ) e okTimes York New nteUie ttspoetdotfo aur 01truhJl 01 In 2021. July through 2021 January from out projected States United the in ) rp eo n n ae wnl,busi- dwindle, cases and one below drops hsprocess This Methods. and Materials ctepo of Scatterplot (D) Project. Tracking COVID the by reported day last the 2021, 7, March after deaths for (15) data r (t ) oeacrtl.We accurately. more r (t ). r (t ) olw a follows iyt h iu neryMrh22 rmifcin ln.This immu- alone. aggressive infections herd an from and reaching 2021 mitigation March from continued large early that far a suggests in still virus find that the was we to indicate so, States nity results Even United unreported. Our the go day. that infections the each COVID and of on IFR majority the state number of US reproductive estimates each and present prevalence We viral surveys date. testing time-varying random to highest-quality out Indiana the carried are in which conducted Ohio, statewide and as surveys well as point-prevalence virus, representative the of tracking model data Bayesian time-series readily simple available incorporating a SARS-CoV-2 developed of rate dynamics have the transmission We disease. previ- the the of on of impact interventions the nonpharmaceutical of ous assessments reliable strategies, need mitigation and policymakers policy effective implement and craft To projections of modifications the Discussion plausible that to find sensitive 3. We very and 2 data. not Assumptions projections are the our here with from given well deaths which 4C up of 16, Fig. distribution matches April projections. predictive and the our 7 that with January between consistent country is the in deaths data COVID to the According by 2021. reported 16, occurring to April 245,000 186,000 and to 7 to additional January 173,000 lead between with the would deaths, 0.68%, infections COVID new of more million 270,000 IFR 31 an to Assuming million 18 2021. 6, January of e okTimes York New h -a oigaeaeo OI etsis deaths COVID of average moving 7-day the C, 1) hr ee2400COVID 204,000 were there (15), https://doi.org/10.1073/pnas.2103272118 lodemonstrates also PNAS r | (t f9 of 5 ) in

SOCIAL SCIENCES STATISTICS effort are necessary to surpass the herd-immunity threshold with- on day t, to vary over time. This accounts for variation in exposure due to out incurring many more deaths due to the disease. This work implementation or loosening of social distancing and other policy measures demonstrates the value of random-sample testing in response to over time. We model βt as a random walk with step size σ estimated from 2 −1 this and future . the data, βt+1 ∼ Normal(βt , σ ). We assume that γ , the average length in By incorporating testing and case data aggregated over any days of the , is determined by the disease and is therefore constant over time. period of time, our additive model for positive tests in Eq. 2 allows us to avoid using data at the daily level, which can be Likelihood on Deaths. Let τ = {τ0, τ1, ... , τm} denote the distribution of very unreliable. For example, the reported cumulative number time to death for those infected individuals who die from the disease, i.e., τs of tests administered in a state may not be updated for up to is the probability of death s days after infection, conditional on death occur- 2 wk at a time, or it may decrease from one day to the next as ring. Similar to Johndrow et al. (5), who calibrated τ by matching quantiles data are deduplicated upon further review. The latter scenario of a negative binomial distribution to case data from China (20, 21), we frequently occurs with reported cases as well. Working with data assume that τ follows a NegativeBinomial(α, 1/(β + 1)) distribution with at the daily level generally requires using some kind of moving parameters α = 21, β = 1.1, and we truncate the distribution at the 99th average, which washes out stochasticity in the data and leads percentile, or m = 40 days, to rule out extremely delayed deaths. We denote to oversmoothing inconsistent with the high overdispersion of by Dt the reported deaths due to COVID on day t, which we obtain from the COVID Tracking Project (2). We link the daily new infection counts ν = (νt )t SARS-CoV-2 transmission (16). ind. to reported deaths via the likelihood D ∼ Poisson IFR Pt ν τ . Our inference relies on daily reported deaths due to COVID in t k=1 k t−k each state, as opposed to excess deaths. Because of the possibil- Representative Random Prevalence Surveys. To pin down the IFR, we add ity of death misclassification, excess-death data represent a mix likelihood components incorporating the Indiana and Ohio prevalence sur- of confirmed COVID deaths and deaths from other causes. Nev- vey data (3, 4). Active viral prevalence in Indiana in the period April 25 to ˆ ertheless, relying on reported deaths is a potential source of bias, 29, 2020, was estimated as θv = 1.74%. We model this quantity using a nor-   as they are affected by the accuracy of cause-of-death determi- mal approximation to the binomial distribution, θˆ ∼ Normal θ , θv (1−θv ) , v v nv nations. Their numbers can fall significantly below excess-death PT2 where θv = ( It )/N(T1 − T2) is the average viral prevalence between counts and may undershoot the true number of deaths due to the t=T1 disease (1). Ascertainment of COVID deaths may vary between days T1 = April 25 and T2 = April 29. Here, nv = 3, 605 is the number of viral tests administered. Similarly, the estimated seroprevalence in the states, with the cumulative excess-death count since the start of   testing period, θˆ = 1.09%, is modeled as θˆ ∼ Normal θ , θs(1−θs) , where the pandemic exceeding reported COVID deaths by upwards s s s ns PT2 θs = Rt /N(T1 − T2) and ns = 3518. These results come from the first of 50% in some states, according to a New York Times analysis t=T1 of Centers for Disease Control and Prevention (CDC) mortal- phase of the Indiana prevalence survey described in Menachemi et al. (3). ity data (17). Consequently, our results may underestimate viral The sampled population consisted of all noninstitutionalized Indiana res- incidence in those states. idents aged ≥ 12 years listed on state tax returns, including filers and The CDC estimated a total of 83 million infections in the dependents. Stratified random sampling was conducted by using Indiana’s United States through December 2020 (18), which is substan- 10 public health preparedness districts as sampling strata, and 15,495 partic- ipants were contacted by the state health department. Of those contacted, tially larger than our estimate of 50 million infections in that 3,658, or 23.6%, agreed to participate in the study. While low, this response period. Their numbers are based on the work of Reese et al. rate is not far from the survey industry average of 30% (22, 23). Menachemi (19), who infer COVID incidence in the United States using et al. (3) note that respondents might have been subject to response bias, a multiplier model to account for underdetection in the num- which could have resulted in underestimates or overestimates. To adjust for ber of confirmed cases. Beyond the limitations of our study differences in nonresponse between groups, data were weighted for age, discussed above, there are a few possible explanations for the race, and Hispanic ethnicity. Participants were tested for active infection via difference in our estimates. Reese at al. (19) base their esti- RT-PCR and past infection via antibody test between April 25 and April 29, mates on nationally reported laboratory-confirmed cases, which 2020. The RT-PCR tests used had high, but imperfect, sensitivity, and the anti- do not constitute a probabilistic sample of the population. To this body tests had high, but imperfect, specificity. The former could have caused false-negative results and the latter false-positive results. In a follow-up point, the authors remark that “. . .some infections, such as those paper published in PNAS, Yiannoutsos et al. (24) conducted a Bayesian anal- among healthcare workers or from outbreaks in congregate res- ysis of the Indiana survey data to address uncertainty in the results related idential settings, may be more likely to be tested and nationally to imperfect testing and difference in prevalence among subgroups char- reported compared with the general population, and could over- acterized by ethnicity, race, and age. Due to very low response rates—less estimate nonhospitalized cases and infections.” Furthermore, than 8% in the second and third phases—we do not include data from the the multiplier in their model relies on documented rates of test subsequent phases of the Indiana study in our analysis. administration and care-seeking among symptomatic COVID The likelihood for the prevalence survey data from Ohio is analogous. patients. Reese et al. note that data on rates of test adminis- The survey design was a stratified two-stage cluster sample, with strata tration in this group are limited, especially at the local level. defined by eight administrative regions in the state of Ohio. Within each region, 30 census tracts were randomly selected with probability propor- As such, Reese et al. do not account for geographic variation tional to total population. Within each tract, households were randomly in testing, which is a potential source of bias. sampled, and one adult within each household was randomly selected to participate in the study. Of those contacted, 727, or 18.5%, agreed to partic- Materials and Methods ipate. Between July 9 and July 28, 2020, participants were tested for active SIR Model. We first define our discrete-time SIR model for infections in each and past infection via RT-PCR and antibody test. Due to the low response state. Let St denote the number of susceptible people in the population rate and imperfect diagnostic tests used in the study, the same caveats on day t, It the number of infections, and Rt the number removed. The described above for the Indiana survey apply. Kline et al. (4) conducted a number removed includes those who have died of the disease and those Bayesian analysis of the seroprevalence survey data to address the uncer- who have recovered and are assumed immune for the rest of the period of tainty in the results associated with nonresponse and imperfect testing. As ˆ our study. With N denoting the state population, these quantities evolve in reported in ref. 4, the estimated seroprevalence in the state was θs = 1.3% time according to the equations in the period July 9 to 28, with a sample size of ns = 667. Results from the PCR tests in the same study were reported in a press conference on October  − = − βt , St+1 St N It St 1 available on YouTube (ref. (25), minute 22). The viral prevalence in that  β ˆ t period is estimated as θv = 0.9% with sample size nv = 727. To the best of It+1 − It = N It St − γIt , [1]  our knowledge, these numbers have not yet been published. Rt+1 − Rt =γIt .

Note that νt = St−1 − St is the number of new infections on day t. We allow Modeling Preferential Testing. As shown in Figs. 1 and 2, the undercount P the parameters βt , interpreted as the mean number of contacts per person curve (It + Rt )/( k≤t Ck) has a common shape in Indiana and Ohio. Here,

6 of 9 | PNAS Irons and Raftery https://doi.org/10.1073/pnas.2103272118 Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 ierywt h oa ubro et diitrd Here, administered. relationship, tests this state’s of in the number variance total as the the grows that with fraction linearly and this up that ramps assume capacity We test tests. positive of number siaigSR-o- netosfo ets ofimdcss et,adrno surveys random and tests, cases, confirmed deaths, from infections SARS-CoV-2 Estimating Raftery and Irons population. Pop., red. in plotted are curves smoothing not do we example, For to model. extent testing the rates preferential is positivity our This test capacity. in model test used of state’s is the terms it such, of which in As estimate tested. whether an people units, as unique understood different or specimens, using dif- test Moreover, tests encounters, tests. test total (PCR) viral report as this states well states, as ferent across tests methods antigen test-reporting include in may day variation number on to state Due (2). the Project in ing results test total of that so n ho (Lower Ohio. and 5. Fig. day to up tested population the of fraction parameters the Here, 5. Fig. similar; are data: states test two the each the for on model for state following lines the the these to in of led slopes administered This the tests that of square and number the day cumulative against plotted the reciprocal when of the linear that root approximately day found is on We undercount (2). positive state the Project the a Tracking of in COVID with test the amplification by people reported acid unique as nucleic approved as other defined or cases, PCR probable and confirmed I t and R φ (Upper t t r h I aaeeso day on parameters SIR the are steoealfato fifcin htapa ntecumulative the in appear that infections of fraction overall the is otro einad9%cndnebnsfrtecmltv ersinfnto nEq. in function regression cumulative the for bands confidence 95% and median Posterior ) φ oiietsso ahdypotdaanttepseirma ftemria ersinfnto nEq. in function regression marginal the of mean posterior the against plotted day each on tests Positive ) t = X k =1 φ φ t t s C and k C P t ∼ /T k t η =1 N Normal t t nec day. each on r rprinlt h qaero fthe of root square the to proportional are T k ,  φ t t η srpre yteCVDTrack- COVID the by reported as , (I t 2 t t and , = t + , η R 2 t ), P C η t k t t 2 =1 N stettlnme of number total the is  . T k , T t stenumber the is η T t 2 t grows , sbest is [2] t , neednl as independently dvddb h parameter the by Eqs. (divided 3] [ in and [2] defined in relationships means normal the the to demonstrates refer 5 We Fig. the samples cases. and posterior data on resulting deaths incidence the only With cumulative used state. of we each in initially surveys is, random-sample That above. described cases independently. periods these in counts consecutive nonoverlapping into where tests state, in each and in oscillations cases Rather, weekly weekends. combine on as we reporting well reduced as to tests, due numbers and these cases of reporting inconsistent to that Noting oe.Cmbl ta.(6 nrdcdabnma ieiodo cases, on likelihood binomial a introduced Binomial(T (26) 5, al. testing. et Fig. cumulative Campbell of in posed. function CIs a as widening [2] in The variance low. the of very growth were the exhibit cases and testing when slope 5, comparable Fig. a respectively. reveals functions, regression marginal and cumulative the t day on tions expectation, in Hence, − oarv ttedsrbto nEq. in distribution the at arrive To efis ttemdli nin n howtottelklho on likelihood the without Ohio and Indiana in model the fit first We due [3] day each on likelihood the use not do we model, the fitting In ubro te oesfrcs n etdt aebe pro- been have data test and case for models other of number A 1, I t L −1 sa es ee oacutfrweedefcs n oe the model and effects, weekend for account to seven least at is + C t 1 , t ν R t ind t ∼ , t − −1 = ν . t (1 Normal n mle rcino h uuaieicdneo day on incidence cumulative the of fraction smaller a and , (I . t − + φ I t R /N t C t · ) t  ν − I ) a edcmoe safato ftenwinfec- new the of fraction a as decomposed be can t α φ t + φ + (I t ,where ), (I t o nin n hoatrabifiiilperiod initial brief a after Ohio and Indiana for R −1 2 (φ t t + lte gis uuaiecssi Indiana in cases cumulative against plotted t nec a,w rie ttelikelihood the at arrived we day, each on + − R t R ) φ t − I −1 t ecnmdltecsso ahday each on cases the model can we 2, t −1 https://doi.org/10.1073/pnas.2103272118 /N φ ,w a rt h enof mean the write can we ), )(I t −1 steifcinrt nday on rate infection the is t −1 (I oal siae scatterplot estimated Locally 3. t −1 + R + t −1 R t ). −1 ), η 2 T N PNAS t dyperiods, L-day  . 2 | C and t Upper Lower t f9 of 7 and , as φ C as ) t 3 [3] ∼ .

SOCIAL SCIENCES STATISTICS α > 0 is a parameter representing the degree of preferential testing. Assum- symptom onset (35–42) and that patients can be highly infectious several ing the infection rate is small, a binomial expansion of the test positivity rate days before symptom onset (43). α yields the approximation 1 − (1 − It /N) ≈ αIt /N. An application of Bayes’ We assumed that the removal rate γ is determined by the disease and so rule to the latter model shows that α = P(tested|infected)/P(tested). This does not vary between states. Therefore, after fitting the model to the data model has some limitations in the context of our study. Firstly, the degree of for Indiana, we used the posterior distribution of γ for Indiana as the prior preferential testing α is likely to decrease as testing increases, and it is not distribution of γ for Ohio. We then used the posterior distribution from obvious how one might parametrize α = αt to account for this. Secondly, Ohio as the prior distribution for the remaining states, each of which we the model is not additive, as the test positivity relies on the active infection modeled independently. The prior distributions of the remaining parame- rate. As a result, it is not well suited to handling state-level testing data, ters are diffuse independent uniform priors. Their exact forms are provided which can be unreliable on the daily level. in SI Appendix. To estimate φ, we used the same process as described for γ. Youyang Gu (27) and Peter Ellis (28) proposed similar models to cor- rect case counts using test positivity rates. They take the form νt = Implementation. We built the model in R and fit it with the RStan software k Ct [m · (Ct /Tt ) + b], where m > 0, k ∈ [0, 1], b ≥ 0 are parameters. Benatia et package, which implements the No-U-Turn-Sampler for Bayesian inference al. (29) also estimate population prevalence on day t by the number of (44–46). For each state, we ran four chains in parallel for 20,000 steps each, positive tests on day t scaled by a multiplicative factor depending on the with the first 10,000 as burn-in, to obtain 40,000 samples from the posterior number of tests administered on day t as a fraction of the state popula- distribution of the model parameters. Code to fit the model is available at tion. These models are susceptible to the same issues as that of Campbell the GitHub repository (47). et al. (26). They rely on daily test positivity rates, which are reported incon- sistently across states (30). And as Youyang Gu (27) notes, the parameters Data Cleaning. In certain states, the COVID Tracking Project data report a estimated at one point in time do not carry over to other time periods. negative number of cases, tests, or deaths on some days, often due to record Furthermore, by assuming that new infections are a function only of cases deduplication or changes in data reporting by the state government. If a and tests on that day, these models ignore the lag between infections and negative number of cases or tests is reported, we address this by setting their confirmation via testing. They also presume that there are no new that datum to zero and distributing the negative number over all previous infections on days in which no positive tests are reported. Our likelihood on days proportional to the number of cases or tests reported on those days. cases in Eq. 3 allows for new infections to be reflected in case counts at a If a negative number of deaths is reported, we set that datum to zero and later date. subtract the negative number from the deaths reported on the previous Note that our model does not take into account imperfect testing. Mod- day. If this results in a negative number of deaths on the previous day, we eling imperfect testing is complicated by the inconsistent test-reporting continue this procedure until all counts are nonnegative. methods across states described above, which obscure the true num- The COVID Tracking Project also notes days when state governments ber of PCR tests administered in a state on each day. Given that esti- report a backlog of cases or deaths, which usually results in a large spike mated active infection rates are generally low (< 5%) at any given time, in the data on that day. We address this by setting that datum to the aver- imperfect test specificity (i.e., the proportion of true negative results) age of the number of cases or deaths reported on the day before and the is a greater potential source of bias in case counts Ct than sensitiv- day after and distributing the excess number of cases or deaths over all pre- ity. False positives resulting from imperfect specificity would increase Ct . vious days proportional to the number of cases or deaths reported on those We note, however, that the molecular RT-PCR assays widely deployed days. to test for the presence of viral RNA are shown to have near perfect specificity (31–34). Data Availability. Code to fit the model is available at the GitHub repository (https://github.com/njirons/covidest) (47). An online dashboard Prior Specification. Lastly, we specify prior distributions for the model displaying our results is available (https://rsc.stat.washington.edu/covid- −1 parameters {IFR, β1, σ, γ ,(S1, I1), φ, η}. We used a weakly informative dashboard) (11). Previously published data were used for this work (Covid Uniform(0, 0.03) prior distribution for the IFR in each state. For Indiana, Tracking Project; https://covidtracking.com/) (48). we used a truncated normal prior for the mean infectious period, γ−1 ∼ 2 Normal[5.5,11.5](8.5, 1.5 ). This is motivated by clinical data, which show that ACKNOWLEDGMENTS. This research was supported by NIH Grant R01 most infected individuals remain infectious no longer than 10 days after HD-070936.

1. National Academies of Sciences, Engineering, and Medicine, Evaluating Data Types: 13. G. Meyerowitz-Katz, L. Merone, A systematic review and meta-analysis of published A Guide for Decision Makers Using Data to Understand the Extent and Spread of research data on COVID-19 infection fatality rates. Int. J. Infect. Dis. 101, 138–148 COVID-19 (National Academies Press, Washington, DC, 2020). (2020). 2. The COVID Tracking Project Team, The Covid Tracking Project (2021). https:// 14. M. Roser, H. Ritchie, E. Ortiz-Ospina, J. Hasell, Coronavirus pandemic (COVID-19). Our covidtracking.com/. Accessed 28 April 2021. World in Data (2021). https://ourworldindata.org/coronavirus. Accessed 16 April 2021. 3. N. Menachemi et al., Population point prevalence of SARS-CoV-2 infection based on 15. The New York Times COVID-19 Data Team, Coronavirus (Covid-19) data in the a statewide random sample—Indiana, April 25–29, 2020. CDC Morbid. Mortal. Wkly. United States. The New York Times (2021). https://github.com/nytimes/covid-19-data. Rep. 69, 960–964 (2020). Accessed 16 April 2021. 4. D. Kline et al., Estimating seroprevalence of SARS-CoV-2 in Ohio: A Bayesian multi- 16. Endo A., Abbott S., Kucharski A. J., Funk S.; Center for the Mathematical Modeling level poststratification approach with multiple diagnostic tests. Proc. Natl. Acad. Sci. of Infectious Diseases COVID-19 Working Group, Estimating the overdispersion in U.S.A. 118, e2023947118 (2021). COVID-19 transmission using outbreak sizes outside China. Wellcome Open Res. 5, 5. J. Johndrow, P. Ball, M. Gargiulo, K. Lum, Estimating the number of SARS-CoV-2 infec- 67 (2020). tions and the impact of mitigation policies in the United States. Harvard Data Sci. Rev., 17. J. Katz, D. Lu, M. Sanger-Katz, 574,000 more U.S. deaths than normal since 10.1162/99608f92.7679a1ed (2020). Covid-19 struck. The New York Times, 24 March 2021. https://www.nytimes.com/ 6. S. Mahajan et al., Seroprevalence of SARS-CoV-2-specific IgG antibodies among adults interactive/2021/01/14/us/covid-19-death-toll.html. Accessed 7 April 2021. living in Connecticut: Post-infection prevalence (PIP) study. Am. J. Med. 134, 526– 18. Centers for Disease Control and Prevention, Estimated disease burden of COVID- 534.e11 (2021). 19 (2021). https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/burden.html. 7. F. P. Havers et al., Seroprevalence of antibodies to SARS-CoV-2 in 10 sites in Accessed 7 April 2021. the United States, March 23-May 12, 2020. JAMA Intern. Med. 180, 1576–1586 19. H. Reese et al., Estimated incidence of coronavirus disease 2019 (COVID-19) illness and (2020). hospitalization—United States, February–September 2020. Clinical Infectious Dis. 11, 8. W. Yang et al., Estimating the infection-fatality risk of SARS-CoV-2 in New York City e1010–e1017 (2020). during the spring 2020 pandemic wave: A model-based analysis. Lancet Infect. Dis. 20. F. Zhou et al., Clinical course and risk factors for mortality of adult inpatients with 21, 203–212 (2020). COVID-19 in Wuhan, China: A retrospective cohort study. Lancet 395, 1054–1062 9. D. Stadlbauer et al., Repeated cross-sectional sero-monitoring of SARS-CoV-2 in New (2020). York City. Nature 590, 146–150 (2021). 21. S. A. Lauer et al., The of coronavirus disease 2019 (COVID-19) from 10. New York City Health Department, NYC Coronavirus Disease 2019 (COVID-19) publicly reported confirmed cases: Estimation and application. Ann. Intern. Med. 172, Data (2021). https://www1.nyc.gov/site/doh/covid/covid-19-data.page. Accessed 16 577–582 (2020). February 2021. 22. J. Dixon, C. Tucker, “Survey nonresponse” in Handbook of Survey Research, 11. N. J. Irons, A. E. Raftery, COVID Infections in the United States. University of P. V. Marsden, J. D. Wright, Eds. (Emerald, Bingley, UK, ed. 2, 2010), pp. 593–630. Washington Statistics’ RStudio Connect Server. https://rsc.stat.washington.edu/covid- 23. A. L. Holbrook, J. A. Krosnick, A. Pfent, “The causes and consequences of response dashboard. Deposited: 15 March 2021. rates in surveys by the news media and government contractor survey research firms” 12. S. L. Wu et al., Substantial underestimation of SARS-CoV-2 infection in the United in Advances in Telephone Survey Methodology, J. M. Lepkowski, Ed. (Wiley, Hoboken, States. Nat. Commun. 11, 4507 (2020). NJ, 2007), pp. 499–678.

8 of 9 | PNAS Irons and Raftery https://doi.org/10.1073/pnas.2103272118 Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 3 .Xee l,Cmaio fdfeetsmlsfr21 oe ooaiu eeto by detection coronavirus novel 2019 for samples different of Comparison al., et Xie C. 33. for RT-PCR B real-time B. initial 32. and CT between performance Diagnostic al., et He L. J. 31. Rivera, M. J. Lewis, Kissane, E. J. 30. Godefroy, R. Benatia, Statistics Range D. Free 29. incidence,” disease and rates positivity “Test Ellis, P. 28. 4 .Y ta. uniaiedtcinadvrlla nlsso ASCV2i infected in SARS-CoV-2 of analysis load viral and detection Quantitative al., et Yu F. 34. estimat- in testing preferential for Gu, Y. adjustment 27. Bayesian al., et Campbell H. 26. 5 etr o ies oto n Prevention, and Control Disease for Centers 35. update DeWine—COVID-19 Mike Governor “Ohio SARS- Channel, of Ohio estimation The Bayesian 25. Menachemi, N. Halverson, K. P. Yiannoutsos, T. C. 24. siaigSR-o- netosfo ets ofimdcss et,adrno surveys random and tests, cases, confirmed deaths, from infections SARS-CoV-2 Estimating Raftery and Irons uli cdapicto tests. amplification acid nucleic COVID-19. for tests diagnostic of accuracy Contr. the Infect. of meta-analysis with review Wuhan, outside patients (COVID-19) disease coronavirus China. 2019 suspected clinically (2020). [Preprint] 2021. February medRxiv 16 Accessed com/analysis-updates/test-positivity-in-the-us-is-a-mess. approach, model selection 2021). February sample 16 (Accessed https://doi.org/10.1101/2020.04.20.20072942 A States: United 2021. 16 February Accessed http://freerangestats.info/blog/2020/05/09/covid-population-incidence. 2021. February 16 Accessed rate fatality tion cvhpdrto-slto.tl cesd1 eray2021. COVID-19 February 16 with Accessed ncov/hcp/duration-isolation.html. adults for cautions patients. https://arxiv.org/ (2020). [Preprint] arXiv rate. 2021). February fatality 16 (Accessed infection abs/2005.08459 COVID-19 the ing https://www.youtube.com/watch?v=oVCSOlyJ16k. 2021. February 2020). 16 recording, Accessed (video 2020” 1, o- rvlnei nin yrno testing. random by (2021). Indiana e2013906118 in prevalence CoV-2 gr .M ah,R .Vlea .F or,F .Tnn .Pnaoo Systematic Pontarolo, R. Tonin, S. F. Cobre, F. A. Vilhena, O. R. Fachi, M. M. oger, ¨ epr Med. Respir. siaigtu netos ipehuitct esr mle infec- implied measure to heuristic simple A infections: true Estimating ln net Dis. Infect. Clin. 12 (2021). 21–29 49, 22) https://covid19-projections.com/estimating-true-infections/. (2021). 090(2020). 105980 168, etpstvt nteU samess a is US the in positivity Test 9–9 (2020). 793–798 71, n.J net Dis. Infect. J. Int. 22) https://www.cdc.gov/coronavirus/2019- (2020). siaigCVD1 rvlnei the in prevalence COVID-19 Estimating 6–6 (2020). 264–267 93, uaino slto n pre- and isolation of Duration rc al cd c.U.S.A. Sci. Acad. Natl. Proc. 22) https://covidtracking. (2020). | October (2020). m J. Am. 118, 8 h OI rcigPoetTa,Dt onod h OI rcigProject. Tracking COVID The download. Data Team, Project Tracking COVID The https://github.com/njirons/covidest. 48. GitHub. njirons/covidest. Raftery, E. A. Irons, J. N. 47. sampler. No-U-Turn The Gelman, A. Hoffman, D. M. 46. Team, Development Stan 45. COVID-19. Team, of transmissibility Core and shedding R viral in 44. dynamics Temporal al., et He X. 43. W R. 36. 2 .vnKme,D a eVje,P raj .Hamn,M aes .Ob tal., et Okba N. Lamers, M. Haagmans, B. Fraaij, P. Vijver, de van D. Kampen, van J. Prevention, 42. and Control Disease for Centers Korea 41. Longitudinal al., et Gahm G. Janich, A. Young, M. Sexton, viro- N. and Gallichote, immunological E. Clinical, Quicke, al., K. et Tan 40. X. Lin, Predicting H. al., Liu, et Z. Garnett Xiong, L. Q. Peng, Alexander, J. D. al., Lu, Strong, J. et E. J. Jacobs 39. Funk, R. D. J. Durst, James, K. A. Bullard, Kimball, J. A. 38. Reddy, C. S. Hatfield, M. K. Arons, M. M. 37. tp:/oitakn.o/aadwla.Acse 8Arl2021. April 28 Accessed https://covidtracking.com/data/download. 2021. May 5 Deposited (2014). 2021. February 16 Accessed r-project.org/web/packages/rstan/vignettes/rstan.html. 2020). Austria, Vienna, Computing, Statistical for Foundation (R Med. Nat. hdigo netosvrsi optlzdptet ihcrnvrsdisease-2019 coronavirus with determinants. patients key hospitalized and Duration in (COVID-19): virus infectious of Shedding 2021. February 16 Accessed cases a30402000000&bid=0030. re-positive of analysis and 2021). [Preprint] February medRxiv 16 skilled analysis. (Accessed Colorado https://doi.org/10.1101/2020.06.08.20125989 sequence (2020). five and in virologic staff Epidemiologic, asymptomatic facilities: nursing among RNA SARS-CoV-2 for surveillance by SARS-CoV-2 for re-positive test that RT-PCR. patients COVID-19 of characterization logical samples. diagnostic from SARS-CoV-2 infectious facility. nursing skilled Med. a J. in Engl. transmission N. and infections SARS-CoV-2 Presymptomatic COVID-2019. with (2020). patients 469 hospitalized of assessment Virological le,V .Cra,W ugms .Simir .Zne .A M A. M. Zange, S. Seilmaier, M. Guggemos, W. Corman, M. V. olfel, ¨ EBioMedicine 7–7 (2020). 672–675 26, 0129 (2020). 2081–2090 382, :ALnug n niomn o ttsia Computing Statistical for Environment and Language A R: 090(2020). 102960 59, Sa:TeRitraet Stan to interface R The RStan: 22) https://www.cdc.go.kr/board/board.es?mid= (2020). https://doi.org/10.1073/pnas.2103272118 a.Commun. Nat. ln net Dis. Infect. Clin. .Mc.Lan Res. Learn. Mach. J. idnsfo investigation from Findings 6 (2021). 267 12, 6326 (2020). 2663–2666 71, 22) https://cran. (2020). Nature PNAS 1593–1623 15, le tal., et uller ¨ 465– 581, | f9 of 9

SOCIAL SCIENCES STATISTICS