Respondent-Driven Sampling As Markov Chain Monte Carlo
Total Page:16
File Type:pdf, Size:1020Kb
RESPONDENT-DRIVEN SAMPLING AS MARKOV CHAIN MONTE CARLO SHARAD GOEL AND MATTHEW J. SALGANIK Abstract. Respondent-driven sampling (RDS) is a recently in- troduced, and now widely used, technique for estimating disease prevalence in hidden populations. RDS data are collected through a snowball mechanism, in which current sample members recruit future sample members. In this paper we present respondent- driven sampling as Markov chain Monte Carlo (MCMC) impor- tance sampling, and we examine the effects of community struc- ture and the recruitment procedure on the variance of RDS esti- mates. Past work has assumed that the variance of RDS estimates is primarily affected by segregation between healthy and infected individuals. We examine an illustrative model to show that this is not necessarily the case, and that bottlenecks anywhere in the networks can substantially affect estimates. We also show that variance is inflated by a common design feature in which sample members are encouraged to recruit multiple future sample mem- bers. The paper concludes with suggestions for implementing and evaluating respondent-driven sampling studies. 1. Introduction The Joint United Nations Program on HIV/AIDS (UNAIDS) es- timates that there are between 30 and 35 million people living with HIV/AIDS worldwide, and that between 2 and 4 million people were newly infected in 2007. In most countries outside of sub-Saharan Africa, these infections are concentrated in three subpopulations: men who have sex with men, injection drug users, and sex workers and their sexual partners [1]. Consequently, there is general consensus among epidemiologists that better data about disease prevalence and risk be- haviors within these key subpopulations are critical for understanding and controlling the spread of the disease [2{5]. Date: March 21, 2009. Key words and phrases. hard-to-reach populations, hidden populations, HIV surveillance, importance sampling, Markov chain Monte Carlo, respondent-driven sampling, social networks, spectral gap. 1 2 SHARAD GOEL AND MATTHEW J. SALGANIK Unfortunately, because these subpopulations lack appropriate sam- pling frames, are relatively small, and their members often desire to remain anonymous, they are difficult to study with standard sampling methods. For this reason they are often called \hidden" or \hard-to- reach." A variety of sampling approaches have been tried to study these hidden populations, but in many cases they produce estimates of unknown bias and variance [5; 6]. The resulting uncertainty about key subpopulations has complicated public health efforts to evaluate prevention programs and allocate resources effectively. Respondent-driven sampling (RDS) is a new approach for sampling from hidden populations that is rapidly gaining in popularity: A recent review identified more than 120 RDS studies worldwide [7], including populations as diverse as men who have sex with men in Uganda [8], sex workers in Vietnam [9], and injection drug users in the former So- viet Union [10]. Furthermore, the U.S. Centers for Disease Control and Prevention (CDC) recently selected RDS for a 25-city study of injection drug users that is part of the National HIV Behavioral Surveillance Sys- tem [11]. Because CDC decisions often influence global public health standards, RDS is likely to become increasingly common in the study of hidden populations. RDS data are collected through a snowball mechanism, in which cur- rent sample members recruit future sample members.1 An RDS study begins by recruiting a small number of people in the target population to serve as seeds. After participating, the seeds are asked|and often provided financial incentive|to recruit other people that they know in the target population. The sampling continues in this way with current sample members recruiting the next wave of sample members until the desired sample size is reached.2 The process results in recruitment net- works like the one in Figure 1 from a study of drug users in New York City; the sample began with eight seeds and grew to include 618 people in 13 weeks [27]. Under certain strong assumptions described later in the paper, these RDS data can then be used to produce asymptotically unbiased estimates about the hidden population (e.g., estimates of the proportion of drug users in New York who have HIV). 1This type of sampling is also sometimes called chain-referral, random-walk, or link-tracing [12{19], and can be considered a form of adaptive sampling [20; 21]. 2Although it is beyond the scope of this paper, it is critical to note that there are serious logistical and ethical complications involved in running an RDS study [22{ 26]. To give just one example, in studies of injection drug users, it is common for non-drug users to attempt to participate in the study in order to earn money [22; 26]. RDS AS MCMC 3 Figure 1. Recruitment networks from a study of drug users in New York City. The eight seeds are larger than the others nodes and all nodes are color coded by race/ethnicity. This figure was originally published in [27]. Despite the widespread use of RDS and its potential to address im- portant public health questions, the statistical foundations of RDS re- main poorly understood. This paper presents respondent-driven sam- pling as Markov chain Monte Carlo importance sampling, and analyzes how RDS estimates are affected by both the community structure of the hidden population and the recruitment procedure. We show that the variance of the RDS estimator is increased by: (1) \bottlenecks" between different groups in the hidden population, and (2) a study design in which participants recruit multiple individuals. Our paper is organized as follows. In Section 2 we show that RDS sampling and estimation can be viewed as Markov chain Monte Carlo (MCMC) importance sampling. While MCMC algorithms are typically computer-driven, a novel feature of RDS is that state transitions consist of individuals physically recruiting others in the hidden population. In Section 3 we analyze a particular, illustrative network model in detail. This example shows that the structure of the hidden population's social network can significantly impact both the bias and variance of RDS es- timates, a phenomenon that is well understood in the statistical MCMC community but that has been overlooked in the RDS literature. Impor- tantly, bottlenecks in any part of the network may affect RDS estimates of quantities that are not directly related to the source of the bottle- neck. For example, a bottleneck between racial groups may degrade 4 SHARAD GOEL AND MATTHEW J. SALGANIK RDS estimates of gender composition. This suggests that the variance of RDS estimates is likely larger than previously believed. In Section 4 we explore the effect of multiple recruitment on variance, an issue that is important for RDS, but has not been considered previously and typ- ically does not arise in MCMC applications. We show that \thick," as opposed to \thin," recruitment chains increase the statistical de- pendence between samples, and consequently worsen RDS estimates. Section 5 summarizes the results and concludes with recommendations for users of RDS. We have relegated most proofs and technical details to Appendices A and B. Appendix C reviews conductance, a formal measure of bottlenecks in networks. 2. Respondent-Driven Sampling as MCMC Respondent-driven sampling [28{30] is a form of snowball sampling often used to estimate the proportion of a population with a specific characteristic. Although in this paper we talk about estimating the proportion p of infected individuals, we could more generally be es- timating the occurrence of any characteristic or behavior. Here we review Markov chain Monte Carlo importance sampling, and make the connection to RDS precise. Markov Chain Monte Carlo. Markov chain Monte Carlo was pop- ularized by the introduction of the Metropolis algorithm [31], and has been applied extensively in a variety of fields, including physics, chem- istry, biology and statistics. MCMC has also been the subject of several book-length treatments [32{35]. Behind all MCMC methods is a Markov chain on a state space V . In the context of RDS, V is the population from which we sample (e.g., drug injectors in New York City). We confine ourselves to the case where V is a finite population of size N, and so identify the chain with a kernel K(vi; vj) that gives the probability of transition from state vi to state vj: X K(vi; vj) ≥ 0 K(vi; vj) = 1: vj 2V In terms of RDS, K(vi; vj) is the probability that any individual vi recruits an individual vj. The chain is irreducible if for every pair of points vi; vj there is positive probability of eventually reaching vj starting from vi. Under this assumption, there is a unique distribution π : V ! R|called the stationary distribution|satisfying X π(vi)K(vi; vj) = π(vj): vi2V RDS AS MCMC 5 That is, if X0;X1;X2;::: is a realization of the chain with X0 ∼ π, then Xi ∼ π for i ≥ 0. Consequently, by starting the chain in equi- librium, the walk can be used to generate dependent samples from the distribution π. Importance Sampling. As shown above, a chain-referral sampling method can be used to draw dependent samples from the population V with distribution π: P(Xi = vj) = π(vj): That is, on each draw individual vj has probability π(vj) of being cho- sen. Then for any function f : V ! R, the sample mean n−1 1 X (2.1) f(X ) n i i=0 gives an unbiased estimate not of the population mean, but of Eπf = PN i=1 f(vj)π(vj). That is, because units are selected with unequal prob- ability, the sample mean is not a consistent estimator of the population mean. As is common in the survey sampling literature [36], the idea behind importance sampling [37] is that the weighted sample mean n−1 1 X f(Xi) (2.2) n N · π(X ) i=0 i produces an unbiased estimate of the population mean µf of f since N f(Xi) X f(vi) = π(v ) Eπ N · π(X ) N · π(v ) i i i=1 i N 1 X = f(v ): N i i=1 In particular, if D ⊆ V is the subset of infected individuals, then (2.2) can be used to estimate the disease prevalence p = jDj=N = µf by setting f(vi) = 1 if vi 2 D and f(vi) = 0 otherwise.