Data integration methods for studying animal population dynamics
by Audrey Béliveau
M.Sc., Université de Montréal, 2012 B.Sc., Université de Montréal, 2010
Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
in the Department of Statistics and Actuarial Science Faculty of Science
c Audrey Béliveau 2015 SIMON FRASER UNIVERSITY Fall 2015
All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval
Name: Audrey Béliveau Degree: Doctor of Philosophy (Statistics) Title: Data integration methods for studying animal population dynamics Examining Committee: Chair: Gary Parker Professor
Richard Lockhart Senior Supervisor Professor
Carl Schwarz Co-Supervisor Professor
Steven Thompson Supervisor Professor
Rick Routledge Internal Examiner Professor
Paul Conn External Examiner Research Mathematical Statistician National Marine Mammal Laboratory NOAA/NMFS Alaska Fisheries Science Center
Date Defended: 22 December 2015
ii Abstract
In this thesis, we develop new data integration methods to better understand animal pop- ulation dynamics. In a first project, we study the problem of integrating aerial and access data from aerial-access creel surveys to estimate angling effort, catch and harvest. We pro- pose new estimation methods, study their statistical properties theoretically and conduct a simulation study to compare their performance. We apply our methods to data from an annual Kootenay Lake (Canada) survey. In a second project, we present a new Bayesian modeling approach to integrate capture- recapture data with other sources of data without relying on the usual independence assump- tion. We use a simulation study to compare, under various scenarios, our approach with the usual approach of simply multiplying likelihoods. In the simulation study, the Monte Carlo RMSEs and expected posterior standard deviations obtained with our approach are always smaller than or equal to those obtained with the usual approach of simply multi- plying likelihoods. Finally, we compare the performance of the two approaches using real data from a colony of Greater horseshoe bats (Rhinolophus ferrumequinum) in the Valais, Switzerland. In a third project, we develop an explicit integrated population model to integrate capture- recapture survey data, dead recovery survey data and snorkel survey data to better under- stand the movement from the ocean to spawning grounds of Chinook salmon (Oncorhynchus tshawytscha) on the West Coast of Vancouver Island, Canada. In addition to providing spawning escapement estimates, the model provides estimates of stream residence time and snorkel survey observer efficiency, which are crucial but currently lacking for the use of the area-under-the-curve method currently used to estimate escapement on the West Coast of Vancouver Island.
Keywords: Aerial-access; Capture-recapture; Creel surveys; Independence assumption; Integrated population modeling; Oncorhynchus tshawytscha
iii Acknowledgements
First and foremost, I am very grateful to my supervisors Richard Lockhart and Carl Schwarz for their time, advice, financial support and the collaboration opportunities offered through- out my doctoral program. I would like to thank my collaborators: Steve Arndt for providing insight on the creel survey data; Roger Pradel for hosting me at the CEFE and introducing me to integrated population modeling; Michael Schaub and Raphaël Arlettaz for providing the bats data and insight; and finally Roger Dunlop for hosting me during the 2014 Burman River survey and for the numerous discussions that have followed. I can say without a doubt that those PhD years were the best of my life so far, for the most part thanks to the incredibly friendly atmosphere in the Department and the amazing people I met there. I would like to thank Derek Bingham for hosting me in his lab and providing access to computing resources. I am also grateful to Gary Parker for his support in a wide array of instances. To my fellow graduate students and friends Ararat, Biljana, Elena, Huijing, Mike, Ofir, Oksana, Ruth, Shirin, Zheng and many others, thank you for cheering up my days and for the many dinners, concerts, tennis matches and more! A very special mention goes to Shirin and Ofir for their support in difficult times. I would like to say a big thank you to all my dancing friends and teammates for all the fun times that helped maintain a good balance in my life. I am also thankful to David Haziza for always believing in me! Finally, I gratefully acknowledge the financial support from the Natural Sciences and Engineering Research Council of Canada.
iv Table of Contents
Approval ii
Abstract iii
Acknowledgements iv
Table of Contents v
List of Tables vii
List of Figures ix
1 Introduction 1
2 Adjusting for undercoverage of access-points in creel surveys with fewer overflights 3 2.1 Introduction ...... 3 2.2 Sampling Protocol ...... 5 2.3 Statistical Methods ...... 6 2.3.1 Inference Framework ...... 7 2.3.2 Study of the Bias ...... 8 2.3.3 Study of the Variance ...... 9 2.3.4 Optimal Allocation ...... 10 2.3.5 Stratification ...... 11 2.4 Simulation Study ...... 12 2.5 Application ...... 15 2.6 Discussion ...... 21
3 Explicit integrated population modeling: escaping the conventional as- sumption of independence 23 3.1 Introduction ...... 23 3.2 Background and notation ...... 24 3.2.1 Capture-recapture survey ...... 24
v 3.2.2 Population count survey ...... 26 3.2.3 Integrated population modeling via likelihood multiplication . . . . . 27 3.3 Integrated population modeling based on the true joint likelihood ...... 28 3.3.1 Capture-recapture and count data ...... 28 3.3.2 Model variations ...... 32 3.4 Simulation Study ...... 33 3.5 Application ...... 38 3.6 Discussion ...... 43
4 Integrated population modeling of Chinook salmon (Oncorhynchus tshawytscha) migration on the West Coast of Vancouver Island 45 4.1 Introduction ...... 45 4.2 Sampling Protocol ...... 47 4.3 Notation ...... 48 4.4 A Jolly-Seber approach to estimate escapement ...... 49 4.5 Integrated population modeling ...... 53 4.6 Analysis of the 2012 data ...... 57 4.6.1 Assessment of the integrated population model ...... 64 4.7 Discussion ...... 66
Bibliography 67
Appendix A Supplementary materials for Chapter 2 70 A.1 First-order Taylor expansions ...... 70
A.2 Assumptions, propositions and proofs for the study of Errparty ...... 70 A.2.1 Assumptions ...... 70
A.2.2 Study of Errparty for the estimators CbR and CbDE ...... 71
A.2.3 Study of Errparty for the estimator Cb1 ...... 72
A.2.4 Study of Errparty for the estimator Cb2 ...... 74 A.3 Proof of the Optimal Allocation ...... 76 A.4 Monte Carlo measures ...... 77 A.5 Figures ...... 78
Appendix B Supplementary materials for Chapter 3 81 B.1 Monte Carlo measures used in the simulation study ...... 81 B.2 Plots of the results of the simulation study ...... 83 B.3 Bats data analysis ...... 93
Appendix C Supplementary materials for Chapter 4 94 C.1 Analysis of the 2012 capture-recapture data using the software MARK . . . 94
vi List of Tables
Table 2.1 Values of αi and βi for the variance formulas ...... 10 Table 2.2 Parameter values used to generate the data for the simulation study. . 13
Table 2.3 Monte Carlo measures for the simulation with µb = 130. Numbers are expressed in %...... 16 Table 2.4 Allocation of sample size in the 2010-2011 Kootenay Lake Creel Survey. 18
Table 2.5 Optimal values of no/ng for each month and day type combination for the number of rainbow trout kept. Note that we do not present results for the double expansion estimator because in that case the optimal
allocation is no = ng...... 20 Table 2.6 Seasonal combined estimates (Est) of total number of rainbow trout kept along with approximate 95% confidence intervals (Low,Upp). The last column is computed as a separate total estimate over the three seasons...... 21
Table 3.1 Changes in the population size per state over time for a study with K = 3 periods. The table follows the timeline in Figure 3.1. Starting
in the upper left corner of the table, the population is comprised of N1 unmarked individuals at the beginning of period 1. Then, the count survey occurs (which does not affect the state nor size of the popula-
tion). Then, B1 births occur resulting in N1 + B1 unmarked individu-
als in the population. Then, C1 individuals are captured, marked and
released which leaves N1 +B1 −C1 unmarked individuals in the popula- u m tion. Then, D1 unmarked individuals die and D11 marked individuals u die. When period 2 begins, there are respectively N1 + B1 − C1 − D1 m and C1 −D11 unmarked and marked individuals in the population. The table goes on like this until the study is finished. Note: C & R is used to abbreviate “captures and recaptures”...... 30 Table 3.2 Monte Carlo measures comparing the performance of the true joint likelihood approach (L) and the composite likelihood approach (Lc) in the simulation study, across scenarios and parameters. Each Monte Carlo measure is based on 250 simulated datasets...... 35
vii Table 3.3 Monte Carlo estimates of P (WL ≤ WLc ), where W stands for either the absolute error (AE), the standard deviation of the posterior sample (SD) or the length of the 95% HPD credible interval (LCI). Each Monte Carlo measure is based on 250 simulated datasets...... 36
Table 4.1 Notation for the data collected at Burman River. The subscript s can take the values m (males) and f (females)...... 49 Table 4.2 Notation for the parameters used in the Jolly-Seber model and/or the integrated population model. The subscript s can take the values m (males) and f (females)...... 51 Table 4.3 Variables used in the Jolly-Seber model, categorized based on their role in the model...... 52 Table 4.4 Formulas used to compute quantities of interest for the Jolly-Seber model or the integrated population model. Residence time and alive population size in the stream cannot be estimated from the Jolly-Seber model. Notes: (1) Sums are defined as zero when backwards; (2) The use of d − 0.5 in the mean stopover time calculation is based on the assumption that within a day, the movement of fish upstream to the spawning grounds is distributed uniformly over the day; (3) The latent m m variables Ni,j,s, and Ai,j,s are defined as 0 when i is not a capture- recapture day; (4) The time unit is days...... 54 Table 4.5 Variables used in the integrated population model, categorized based on their role in the model...... 55 Table 4.6 Escapement estimates obtained from the Jolly-Seber model and the in- tegrated population model. The formulas used to calculate escapement are given in Table 4.4. CI denotes credible intervals...... 60 Table 4.7 Integrated population modeling marginal estimates and credible in- tervals of observer efficiency in the snorkel survey, based on the fish visibility covariate...... 64
viii List of Figures
Figure 2.1 Boxplots of relative bias due to the partial interview of parties for 100 population replicates with varying mean number of boats per day. Left column: scenario (A); right column: scenario (B). The
first to fourth rows relate, in order, to the estimators Cˆ1, Cˆ2, CˆDE
and CˆR...... 14 Figure 2.2 Kootenay Lake and the creel survey access points. Riondel/Crawford Bay and Boswell/Kuskanook ramps were combined for field moni- toring and data analysis. Map provided by A. Waterhouse, Ministry of Forests, Lands, and Natural Resource Operations...... 17 Figure 2.3 Monthly estimates of total number of rainbow trout kept along with approximate 95% confidence intervals. The top and bottom graphs represent weekends and weekdays respectively. The estimators (2.1) to (2.4) are represented respectively by the following symbols: tri- angle, circle, x mark and square...... 19
Figure 3.1 Timeline of events of the animal population study. The symbols “C”, “B” and “CR” stand for count survey, births and capture-recapture, respectively. Note that the time between the count survey, the births and the capture-recapture survey in each period is negligible. . . . 28 Figure 3.2 Marginal posterior distributions (smoothed) obtained from analyzing the bats data. The plain line represents the true joint likelihood method while the dashed line represents the composite likelihood method...... 43
Figure 4.1 Map of Burman River on the West Coast of Vancouver Island, Canada 46 Figure 4.2 Schematic representation of Chinook salmon migration at Burman River, as assumed by the integrated population model. The arrows denote transitions while boxes denote states...... 53 Figure 4.3 Timeline when surveys were performed in 2012. Each occurrence is denoted by a symbol “×”. Adjacent symbols correspond to consecu- tive days...... 58 Figure 4.4 Summary time series of the 2012 data...... 58
ix Figure 4.5 Daily discharge measured at Gold River over the 2012 migration period. Although discharge data is not available at Burman River, the data at nearby Gold River are thought to be a good proxy for Burman River. The first big freshet occurred on October 14th. . . . 59 Figure 4.6 Estimates of the population size in the pool obtained using the Jolly- Seber model based on the formula in Table 4.4. Each estimate is represented along with a 95 % HPD credible interval...... 61 Figure 4.7 Stopover time estimates obtained using the Jolly-Seber model based on the formula in Table 4.4. Each estimate is represented along with a 95 % HPD credible interval...... 61 Figure 4.8 Estimates of the population size in the tagging pool obtained using the integrated population modeling approach based on the formula in Table 4.4. Each estimate is represented along with a 95 % HPD credible interval...... 62 Figure 4.9 Stopover time estimates obtained using the integrated population modeling approach based on the formula in Table 4.4. Each estimate is represented along with a 95 % HPD credible interval...... 62 Figure 4.10 Residence time estimates obtained using the integrated population modeling approach based on the formula in Table 4.4. Each estimate is represented along with a 95 % HPD credible interval...... 63 Figure 4.11 Estimates of alive population size in the spawning area obtained using the integrated population modeling approach based on the formula in Table 4.4. Each estimate is represented along with a 95 % HPD credible interval...... 63 Figure 4.12 Bayesian p-values for the assessment of the capture-recapture com-
ponent of the integrated population model, using discrepancy D1. . 65 Figure 4.13 Bayesian p-values for the assessment of the snorkel survey component
of the integrated population model, using discrepancy D2...... 66
x Chapter 1
Introduction
The study of animal population dynamics is important for the management and conservation of animal populations. A variety of surveys can be used for that purpose: capture-recapture surveys, population counts, newborn counts, dead recoveries, telemetry surveys, creel sur- veys, etc. When a population is studied using more than one type of survey, the integration of all the data in a single statistical analysis can be very challenging. It is currently an area of active research and will be the main topic of this work. Chapters 2, 3 and 4 form the core of this thesis. They are self-sufficient in the sense that they can be read in any order, they each contain an introduction and the notation is not shared between chapters. In Chapter 2, we propose new statistical methods to integrate the data from aerial-access creel surveys in order to estimate angling effort, catch and harvest in recreational fisheries. Aerial-access creel surveys rely on two components: (1) A ground component in which fishing parties returning from their trips are interviewed at some access-points of the fishery; (2) An aerial component in which an instantaneous count of the the number of fishing parties is conducted. It is common practice to sample fewer aerial survey days than ground survey days. This is thought by practitioners to reduce the cost of the survey, but there is a lack of sound statistical methodology for this case. In Chapter 2, we propose various estimation methods to handle this situation and evaluate their asymptotic properties from a design- based perspective (see Lohr, 2009). The performance of the proposed estimators is studied empirically using a simulation study with varying sampling scenarios. Another aspect that we study in this work is the optimal allocation of the effort between the ground and the aerial portion of the survey, for given costs and budget, for which we derive formulas using the Lagrange multipliers method. Finally, we apply our methods to data from an annual Kootenay Lake (Canada) survey. Capture-recapture surveys are periodic surveys that take place on a series of capture occasions. On each occasions, a survey crew captures animals from a population. When an animal is captured for the first time, it is marked with a unique identification number
1 and released back to the population, and the identification number is recorded. When an animal is recaptured, its identification number is recorded and it is released back into the population. The data collected by capture-recapture can be used to estimate the survival probability of the marked animals between capture occasions. In Chapter 3, we develop new statistical methods for the integrated population modeling of capture-recapture data with other types of data, such as population counts and dead recoveries. Typically, integrated population models rely on the assumption that the datasets are independent so that a joint likelihood is easily formed as a product of likelihoods (Schaub and Abadi, 2011). In our work, we develop a new capture-recapture Bayesian model that takes into account the dependency between datasets. A key aspect of the model is that it uses latent variables that keep track of all population gains (e.g. births) and losses (e.g. deaths) in the unmarked population and the marked cohorts over time. A simulation study compares, under various scenarios, our approach with the common likelihood multiplication approach. Finally, we compare the performance of the two approaches using a real dataset comprised of capture-recapture data, count data and newborn count data on a colony of Greater horseshoe bats (Rhinolophus ferrumequinum) in the Valais, Switzerland. In Chapter 4, we develop a Bayesian integrated population model to study the return of Chinook salmon (Oncorhynchus tshawytscha) from the ocean to the spawning grounds in Burman River, on the west coast of Vancouver Island, Canada. Chinook salmon on the west coast of Vancouver Island return to their natal stream in the fall after reaching maturity to spawn and die. When entering Burman River, fish stop for at least some time at a stopover pool, where a capture-recapture survey takes place, then move upstream where they spawn and die. The upstream portion of the river is surveyed periodically by snorkelers that count the number of marked and the total number of fish seen (alive). Carcass surveys also take place periodically, during which marked and unmarked carcasses are picked. Our integrated population model integrates the capture-recapture data, carcass data and snorkel data all in a single analysis. This is, to our knowledge, the first use of explicit integrated population modeling applied to salmon migration. Our explicit integrated population model uses latent variables to follow explicitly the movement and state of fish throughout the migration. In this work, we also implement a Bayesian version of the Jolly-Seber model (Schwarz and Arnason, 1996) to the capture-recapture data alone and compare estimates between the integrated method and the Jolly-Seber method.
2 Chapter 2
Adjusting for undercoverage of access-points in creel surveys with fewer overflights
The work in this chapter underwent a peer-review process for publication in Biometrics, a journal of the International Biometric Society published by Wiley. The paper is currently available in Early View on Wiley Online Library, see Béliveau et al. (2015).
The 2010-2011 Kootenay Lake creel survey was conducted with financial support of the Fish and Wildlife Compensation Program on behalf of its program partners BC Hydro, the Province of BC, Fisheries and Oceans Canada, First Nations and the public. Access interviews and overflight boat count data were collected by Redfish Consulting Ltd. (Nelson, British Columbia).
2.1 Introduction
Sustainability of recreational fisheries relies on well-advised management decisions. To inform those decisions, fishery agencies conduct creel surveys. Many characteristics of a fishery can be of interest, including total catch (number of fish released or kept), total harvest (number of fish kept), or total fishing effort (number of fishing days or hours) over a period of time. The data collection for creel surveys can be of two types: off-site (mail, telephone, door-to-door, logbooks) or on-site (Pollock, Jones and Brown, 1994). In this work, we focus on on-site surveys, which are conducted at the water body location during fishing hours. A common type of on-site survey is the access-point survey: it is a ground survey, which relies on survey agents intercepting and interviewing angling parties immediately at the return of their fishing trip. The survey agents can be posted for example, at public boat ramps, piers or marinas. If a list of all access-points of the water body
3 can be constructed, and access-points are selected randomly (each with strictly positive probability), an unbiased estimate of, e.g. total catch can be obtained for each survey day. However, in many practical situations, this option is impossible because some access-points may be private (for example private docks or piers) or some parties may use unregulated sites. Consequently, if these cases represent a significant proportion of the parties and/or if these cases differ significantly in their variables of interest from parties that use the covered access-points, then standard estimation methods will have a substantial undercoverage bias (Lohr, 2009). To address this problem, it is typically assumed in creel surveys that parties that use uncovered access-points do not differ in the variables of interests (catch, harvest, fishing effort, etc.) from parties that use the covered access-points. For instance, they should not be more or less experienced anglers. Still, this assumption is not sufficient for the estimation of totals because the number of parties using uncovered access-points is unknown and typically not negligible. This last piece of information is deduced using aerial surveys. Aerial surveys can be conducted, for example, using aircraft overflights or well-suited viewpoints from which an instantaneous count of the number of fishing parties at a time of the day is obtained. Ideally, aerial surveys should be scheduled at random times of the day (Pollock et al., 1994) but environmental conditions (e.g. inclement weather, daylight hours, airport delay) can make it hard for survey agents to respect the planned schedules. With this in mind, Dauk and Schwarz (2001) proposed estimation methods in the case when the aerial survey is conducted at a convenient time of the day, typically around the peak of fishing activity. The use of deterministic aerial survey times is justified if parties’ choice of access point is not related to their fishing schedule. In this work, we focus on multi-day surveys for which we wish to estimate totals of variables of interest over multiple days, for example, the week-ends of August. Statistical methodology is currently available when ground and overflight surveys are conducted on the same set of days, chosen at random among the days at study (Dauk and Schwarz, 2001). In practice, it is also common that aerial surveys are carried out on a random sample of the ground survey days only. This is thought by fisheries managers to be more economical because flights are costly, and the biological (fish size, age, species) and angler data provided by ground sampling are highly valuable for management. Rather surprisingly, there is a lack of statistical methodology for this type of aerial-access creel survey. The motivating application for the work in this chapter is the annual creel survey on Kootenay Lake, British Columbia. Estimates of catch (per species and overall), harvest (per species and overall) and fishing effort are required at the monthly level, separately for weekdays and weekends/holidays. In each stratum (eg. week-ends of August), the sampling of days follows a two-phase design: in the first phase, a simple random sample (srs) of days is selected to conduct the ground portion of the survey; in the second phase, a simple random sample of days for the overflight survey is selected from the days when access surveys
4 are done. The access-points to be surveyed are selected deterministically to maximize the proportion of anglers that are interviewed. Also, in practice, because of inclement weather, mechanical breakdown or other reasons, some of the scheduled overflights might not be carried out. In this work, we assume that all scheduled overflights are conducted or that if missed it occurred at random. In Section 2.2, the sampling protocol is described in detail and the notation is introduced. Then a variety of estimation methods are provided in Section 2.3 along with their design- based asymptotic properties and a strategy for optimal allocation of resources between the ground and overflight components. In Section 2.4, a simulation study investigates the performance of the estimators. Finally, in Section 2.5, the methods are applied to the 2010-2011 Kootenay Lake creel survey data.
2.2 Sampling Protocol
Consider a population U of size N days. On every day i ∈ U, a set Vi of size Mi parties
(or boats) fishes on that day on the water body of interest. For every party j ∈ Vi on days i ∈ U, the variable of interest is cij. It may represent, for example: the number of
fish caught, the number of fish kept or the number of rod-hours. For every party j ∈ Vi on days i ∈ U, an indicator variable, Iij, indicates whether party j returns to one of the ground survey access points. In addition, for every day i ∈ U, if an overflight could be conducted, it would be conducted at time ti (we make the usual assumption that overflights are instantaneous). Then, for every party j ∈ Vi on days i ∈ U, an indicator variable,
δij(ti), indicates whether party j is fishing at time ti. For the rest of the chapter, we drop the dependence on ti in δij(ti) for ease of notation.
In the first phase, a simple random sample sg ⊂ U of size ng days is selected to conduct the ground surveys. On every sampled day i ∈ sg, the parties that return to the surveyed access-points are interviewed (i.e. the parties for which Iij = 1): their corresponding vari- ables cij as well as the start and end times of their fishing trip are collected. On every day P i ∈ sg, the total of the variable of interest over the interviewed parties, Ci ≡ cijIij, can j∈Vi be computed from the data.
In the second phase, a simple random sample so ⊂ sg of size no days is selected. Over-
flight surveys are conducted on those days and, for every day i ∈ so, the number of active P boats at time ti, Aoi ≡ δij, is recorded. Thus, for every day i ∈ so, one can deduce j∈Vi the value of δij for each party interviewed during the ground survey on that day using the start and end times of their fishing trip. Then, one can compute the number of interviewed P parties that are fishing at time ti, Agi ≡ δijIij. j∈Vi
5 2.3 Statistical Methods
The goal is to estimate the sum of a variable of interest over all angling parties during the ∗ P ∗ ∗ P study period: C = i∈U Ci , where Ci = cij is the sum over all angling parties active j∈Vi on day i of the variable of interest. In this section, we propose a number of estimators for C∗. We start by suggesting two intuitive estimators:
P Aoi ˆ N X i∈so C1 = Ci P (2.1) ng Agi i∈sg i∈so and ˆ N X 1 X Aoi C2 = Ci . (2.2) ng no Agi i∈sg i∈so The general idea behind these two estimators is to calculate an estimate of the total of the variable of interest at the surveyed access-points and expand it to all access-points using an inflation factor computed as a ratio of the Aoi’s and Agi’s. The difference between Cˆ1 and
Cˆ2 lies in computing the ratio involving the Aoi’s and Agi’s. As a third estimator, we suggest
N X Aoi CˆDE = Ci , (2.3) no Agi i∈so which uses only information from days when both access and aerial components are avail- able. Setting y = C Aoi , this estimator is a double expansion estimator (see, e.g. Särndal, i i Agi ∗ Swensson and Wretman (1992), p.348), where yi can be seen as a proxy for Ci . The double expansion estimator is a generalization of the (single-phase) expansion estimator (also called
Horvitz-Thompson estimator) to two-phase designs. It is simply a weighted sum of the yi’s computed from the aerial survey days’ data, where the weights correspond to the inverse probability of inclusion in the sample, N . The estimator is design-unbiased but does not no integrate auxiliary information; namely the information collected on ground survey days that do not have an overflight. Hence, we propose to use that auxiliary information in a two-phase ratio estimator (see again Särndal et al., p.359):
1 P C ng j ˆ N X j∈sg CR = yi 1 P . (2.4) no Cj i∈so no j∈so
Ratio estimators are asymptotically design-unbiased and have improved design-efficiency over expansion estimators when yi is approximately proportional to Ci. These four estimators are consistent in the sense that if we sample all days and interview all fishing parties every day, then the estimators are equal to the true total catch C∗.
6 Before describing the inference framework in which we study the proposed estimators, let us make some general remarks. First, note that if Ci is constant across days (all i ∈ U), ˆ ˆ ˆ Aoi ˆ then C2 = CR = CDE. Second, note that if Agi is constant across days (all i ∈ U), then C1 = Cˆ2 = CˆR. Therefore, if the total catch per day and the proportion of interviewed parties are similar across days, all estimators are expected to be roughly equivalent. However, the first condition seems very unlikely to be satisfied in practice because daily environmental conditions (such as weather) could significantly affect the number of fishing parties and the success of the parties. Also, regarding the second condition, there can be, for example, a greater use of non-sampled access points on good weather days in summer, which tends to decrease the proportion of interviewed parties.
2.3.1 Inference Framework
Throughout this chapter, we use the generic notation Cˆ to denote an estimator of C∗. The ∗ ∗ total error of an estimator Cˆ is Cˆ − C = (Cˆ − C˜) + (C˜ − C ) ≡ Errday(Cˆ) + Errparty(Cˆ), where the first term, Errday(Cˆ), is the error due to the sampling of days while the second term, Errparty(Cˆ), is the error due to the partial interview of fishing parties. Besides, C˜ denotes the estimator one would have used in the case of a census of ground and overflight days. For example, if the estimation strategy is to use the estimator CˆR, the estimator used 1 P N Cj ˜ N P j∈U P in the presence of a census of days would be C = N i∈U yi 1 P = i∈U yi. N Cj j∈U First, we assume that there is a superpopulation model, m, that randomly generates, for each day i ∈ U, the number Mi of fishing parties. In addition, it generates, for each party j on day i: variables of interest, cij’s; fishing status at time ti, δij’s; and indicators of return to one of the surveyed access-points, Iij’s. Then, following the established sampling design, a two-phase sample of days is randomly selected by the survey practitioner. Inference can be made following different approaches depending on the sources of randomness one is willing to take into account for inference. In this chapter, we adopt the design-based mode of inference, that considers only the randomness coming from the design. For example, unbiasedness under the design-based approach means that on average, over all the possible samples of days, the total error is null. Another type of inference that we do not pursue in this work would be joint design and model-based inference. Although we are doing design- based inference, we make use of the superpopulation model mentioned in the beginning of this paragraph. The purpose of that model will be to give guidance concerning the design-based biases of our estimates.
From a design-based perspective, the contribution Errday(Cˆ) to the total error is random
(design-dependent) while the contribution Errparty(Cˆ) is a fixed quantity, because C˜ and ∗ C do not depend on the sample of days. As a consequence, Errparty(Cˆ) contributes to the
7 bias of Cˆ but not to its variance:
∗ n o Biasp(Cˆ) ≡ Ep(Cˆ − C ) = Ep Errday(Cˆ) + Errparty(Cˆ) n o = Ep Errday(Cˆ) + Errparty(Cˆ) (2.5) and
∗ Varp(Cˆ) = Varp(Cˆ − C ) = Varp Errday(Cˆ) + Errparty(Cˆ) = Varp Errday(Cˆ) , where Ep(·) and Varp(·) denote respectively the expectation and the variance under the sampling plan.
2.3.2 Study of the Bias n o From equation (2.5), two terms contribute to the bias : Ep Errday(Cˆ) and Errparty(Cˆ).
To begin, we focus on the first term. In the case of the double-expansion estimator CˆDE, we n o have E Err (Cˆ ) = E N P y − P y = 0. This result follows from classical p day DE p no i∈so i i∈U i survey sampling theory for two-phase designs (see eg. Lohr (2009), p.473). The other estimators are smooth non-linear functions of estimated totals that can be linearized using a first order Taylor series in the traditional finite population asymptotic framework of Isaki and Fuller (1982). The Taylor series expansions of the estimators are given in Appendix ∗ A.1. Consequently, Ep Errday(Cˆ) is negligible relative to the true total catch C when no is large enough. We now focus on the second term of (2.5). In general, this term is not negligible but we are interested in finding situations in which it is. Note that if all fishing parties were interviewed on the sampled days (all access-points are known, accessible and surveyed), we would have Errparty Cˆ = 0 for all four estimators. Therefore, sampling as many access- points as possible helps in reducing the bias of the estimators. Errparty(Cˆ) Now, we study the quantity C∗ in an asymptotic framework consisting of a se- ∞ quence of superpopulation models, {mη}η=1. For any superpopulation model mη, the num- ber of fishing parties on each day i in the population of days U is denoted Mηi and tends to infinity in probability, as η → ∞. The subscript η is dropped for ease of notation.
Note that CˆDE and CˆR have the same value of C˜ and therefore, the same value of
Errparty(Cˆ) so they can be studied at the same time. Because neither the access-points nor the time of the overflight were selected randomly, it is necessary to assume that the super- population model is such that parties generated on a given day have the same probability of being interviewed (this probability should not depend on e.g. fishing period or ability).
8 More formally, we assume that on any day i ∈ U, the random variables
Iij|(Mi,ci1, . . . ,ciMi ,δi1, . . . ,δiMi ), j = 1 ...Mi are i.i.d. Bernoulli(pi), where pi is a non-zero probability. This is an important assumption whose validity must be gauged by fisheries scientists prior to the survey. In addition, we use two assumptions that are mainly technical and are normally satisfied. See Appendix A.2.1 for the assumptions.
It can be shown that, under the assumptions previously described, Errparty(CˆDE) and ∗ Errparty(CˆR) are negligible relative to the target C when the number of fishing parties on each day is large enough. See Appendix A.2.2 for a proof of this assertion. On the other hand, the estimators Cˆ1 and Cˆ2 require additional assumptions in order for the error due to the partial interview of parties to be negligible. In the case of Cˆ2, pi should be approximately constant across days i ∈ U (see Appendix A.2.4 for a proof of this assertion). Regarding
Cˆ1, we found two cases that provide negligible error:
1. pi is approximately constant across days i ∈ U, or
C∗ 2. i (the average catch per boat) and Aoi (the proportion of parties fishing at the time Mi Mi of the aerial count) are approximately constant across days i ∈ U.
See Appendix A.2.3 for a proof of this assertion. To sum up, we found that the estimators
CˆR and CˆDE are those that require the weakest assumptions to get negligible error due to partial interview of parties.
2.3.3 Study of the Variance
The variance of Cˆ can be expressed using the usual two-phase decomposition of the variance: Varp Cˆ = Var1E2 Cˆ|sg + E1Var2 Cˆ|sg , where E1(·) and Var1(·) denote respectively the
first-phase expectation and variance and E2(·|sg) and Var2(·|sg) denote respectively the second-phase expectation and variance, conditional on the first-phase sample sg. S2 In the case of the double expansion estimator, we have Var E Cˆ|s = N 2 1 − ng y 1 2 g N ng S2 2 and E Var Cˆ|s = N 2 1 − no y , where S2 = 1 P y − y¯ and y¯ = 1 P y . The 1 2 g ng no y N−1 i U U N i i∈U i∈U total variance is therefore: 2 2 no Sy Varp Cˆ = N 1 − . (2.6) N no The asymptotic variances of the remaining three estimators can be obtained from their first-order Taylor expansions. Here, the large sample properties refer again to the finite population framework. We use AVp(·) to denote asymptotic design-variance. For all three
9 Cˆ αi βi ¯ ¯ ¯ ˆ AoU CU CU AoU CU AoU C1 Ci + Aoi − Agi 2 Aoi − Agi AgU AgU AgU AgU AgU Cˆ C R + R C¯ ,R = Aoi R C¯ 2 i U i U i Agi i U y¯U CˆR yi yi − Ci C¯U
Table 2.1: Values of αi and βi for the variance formulas remaining estimators, the asymptotic variance can be written in the form
2 ! S2 2 ng Sα 2 no β AVp Cˆ = N 1 − + N 1 − , (2.7) N ng ng no
2 2 2 where Sα and Sβ are measures of dispersion defined analogously to Sy . The values of αi and βi associated with each estimator can be found in Table 2.1. Note that the variance formulas do not rely upon the assumptions in Section 2.3.2, which are only relevant to the study of bias. Theoretical comparison between variances seem ambitious for most estimators. However, it can be seen that CˆR will be more efficient 2 2 than the double expansion estimator if its corresponding value of Sβ is smaller than Sy or, equivalently, if the population correlation coefficient between y and C is sufficiently large, 1 CV(C) that is, greater than 2 CV(y) , where CV stands for the population coefficient of variation. Variance estimators can be obtained from (2.7) by replacing all S2 quantities by their equivalent at the sample level. For example, an estimator of the variance of CˆR would 2 2 s 2 ng s 2 n β 2 1 2 be Vard Cˆ = N 1 − α + N 1 − o , where s = P (y − y¯ ) and p R N ng ng no α no−1 i∈so i so 2 1 P ˆ ˆ¯ 2 1 P ˆ¯ 1 P ˆ ˆ y¯so sβ = i∈so (βi − βso ) , with y¯so = yi, βso = βi and βi = yi − ¯ Ci. no−1 no no Cso i∈so i∈so q ˆ A confidence interval of approximate level 1−α can be obtained as Cb±tng−1,1−α/2 Vard p C .
2.3.4 Optimal Allocation
Suppose that a budget B is allocated to the survey and that each overflight has a cost of κo and each ground survey has a cost of κg. In the case of CˆDE, the allocation that minimizes the variance (2.6) subject to the constraint noκo + ngκg ≤ B is obviously no = ng = B/(κo + κg) because the information collected on ground survey days that don’t have an aerial survey is not used to compute CˆDE. For the other estimators, the allocation that minimizes the asymptotic variance (2.7) subject to the constraint noκo + ngκg ≤ B is found using the method of Lagrange multipliers; see Appendix A.3. We obtain:
v −1 u 2 B u κo Sβ ng = 1 + t 2 2 κg κg Sα − Sβ
10 and v u 2 uκg Sβ no = ngt 2 2 . κo Sα − Sβ
2 2 Although the quantities Sα and Sβ are unknown, they can be approximated using pre- vious years’ survey data. We now make two practical remarks. First, the optimal allocation 2 2 formulas require that Sα − Sβ > 0. If this is not the case, then the optimal allocation is necessarily n = n = B . Second, the optimal allocation formulas can lead to alloca- o g κo+κg tions that do not satisfy no ≤ ng ≤ N. In that case, the optimal solution is found on the boundary, which means that either no = ng < N or no < ng = N. In the former case, the optimal allocation is n = n = B and in the latter, n = B−Nκg and n = N. o g κo+κg o κo g It suffices to compute both allocations along with their variance and choose the allocation with the smallest variance. The optimal fraction of overflight days to ground days depends on two ratios. As the ratio of activity costs of an aerial to a ground survey increases, fewer overflight days should 2 2 be performed. As Sα increases relative to Sβ, then the optimal allocation also favors ground surveys.
2.3.5 Stratification
Aerial-access creel surveys, such as the Kootenay Lake survey, may also be obtained through a stratified two-phase design. Stratification occurs at the population level, that is, the study period is divided into strata and two-phase srs/srs samples are selected independently in each stratum. When such a design is used, it is usually desirable to estimate the total of variables of interest over larger periods of time such as seasons or years. In this section, we modify our notation by including stratum indicator indices. The stratum populations are H S denoted by U1,...,UH , and U now represents the overall population U = Uh. As well, h=1 the stratum first-phase and second-phase samples are denoted respectively by sg1, . . . , sgH and so1, . . . , soH and are of size ng1, . . . , ngH and no1, . . . , noH . We are interested in es- ∗ PH ∗ ∗ timating the total C = h=1 Ch, where Ch is the variable of interest total in stratum h. In the case of the double expansion estimator, one can obtain a stratified estimator and variance estimator by simply summing the stratum estimators and variance estimators. For the two-phase nonlinear estimators (2.1), (2.2) and (2.4) there are typically two ways to combine the information across strata, that is: separate ratio estimators and combined ratio estimators (see Lohr (2009), p.144).
The separate estimator, which we denote Cˆs, is obtained by summing the estimators ˆ PH ˆ ˆ computed within each stratum: Cs = h=1 Ch, where Ch represents one of the estimators (2.1), (2.2) or (2.4) computed within stratum h. The variance of the separate estimator is
11 the sum of the stratum variances, and therefore a variance estimator is simply obtained by summing the variance estimates within each stratum.
For the combined estimators, we present only the case of estimator CˆR for sake of simplicity but the other combined estimators can be obtained analogously. H P Nh P Chj ngh j∈sgh PH Nh P h=1 The combined ratio estimator is given by CˆRc = yhi h=1 noh i∈soh H P Nh P Chj noh j∈soh h=1 2 2 s PH 2 ngh sαh 2 noh βhˆ and the variance estimator is Vard p CˆRc = N 1 − + N 1 − , h=1 h Nh ngh h ngh noh H P Nl P ylj nol j∈sol ˆ l=1 where αhi = yhi and βhi = yhi − H Chi. P Nl P Clj nol j∈sol l=1 A known fact about the separate ratio estimator is that it sums up the separate biases while the standard error generally decreases relative to the total of interest (Lohr (2009), p. 145). As a result, the bias-to-SE ratio increases. If the separate biases are negligible, the use of the separate ratio estimators is appropriate, otherwise the combined estimators are preferable.
2.4 Simulation Study
The first part of the simulation study was designed to study the bias due to the partial interview of parties under different scenarios. We considered one scenario (A) where we use the same model to generate the fishing parties on every day and another scenario (B) where there are three different types of days (e.g. based on weather conditions) and a different generating model for the parties for each day type. Hence scenario (A) makes a strong assumption that pi is the same for all days i in U and in that case we expect ∗ Errparty(Cˆ)/C to be asymptotically equal to zero for all four estimators based on the results in Section 2.3.2. Scenario (B) allows pi to vary across days and in this case it ∗ was asserted in Section 2.3.2 that Errparty(Cˆ)/C is asymptotically equal to zero for the estimators CˆDE and CˆR. For each scenario, and for increasing mean number of boats (fishing parties) per day, µb = 50, 100, 250 and 500, we generated 100 populations of size N = 22 days (corresponding roughly to weekdays in a month). For every day i in U = {1,...,22}, we proceeded in the following way:
• In the case of scenario (B), generate the day type: type (i) with probability 0.3, type (ii) with probability 0.4 or type (iii) with probability 0.3.
• Generate the number of fishing parties : Mi ∼ Poisson(µb)
12 Parameter Scenario (A) Scenario (B) Mean number of fishermen per party, νi 2 2 Mean catch per fisherman, λi 0.3 0.8(i) 0.5(ii) 0.2(iii) Probability of fishing at time of overflight, φi 0.5 0.6(i) 0.7(ii) 0.8(iii) Probability of returning to an access-point, pi 0.66 0.5(i) 0.7(ii) 0.8(iii)
Table 2.2: Parameter values used to generate the data for the simulation study.
• Generate Mi fishing parties with catch, activity indicator at time of overflight and access-point landing indicator using the following distributions:
cij ∼ (Poisson(νi − 1) + 1) × Poisson(λi);
δij ∼ Bernoulli(φi);
Iij ∼ Bernoulli(pi).
The parameters used for data generation are listed in Table 2.2.
Then, for each population that was generated, we computed the relative bias due to ˆ ˆ Errparty(C) the partial interview of parties for estimators (2.1) to (2.4) as RBparty(C) = C∗ . Note that the probability distributions used to generate the data were chosen arbitrarily but an accurate match between model and reality is not required here in order to study the design-based properties of our estimators. Figure 2.1 shows the simulation results. In the case of scenario (A), the four estimators have similar behaviors, i.e. the relative biases are distributed closer around zero as the mean number of fishing parties per day gets larger.
In the case of scenario (B), the relative biases of CˆDE and CˆR are centered around zero while the other estimators exhibit a systematic bias that does not diminish as the number of fishing parties increases.
13 Figure 2.1: Boxplots of relative bias due to the partial interview of parties for 100 population replicates with varying mean number of boats per day. Left column: scenario (A); right column: scenario (B). The first to fourth rows relate, in order, to the estimators Cˆ1, Cˆ2, CˆDE and CˆR. 14 The second part of the simulation study was designed to compare the estimators in terms of accuracy and confidence interval coverage. In order to do so, we generated, for each of the two scenarios, a single population with µb = 130 using the algorithm already described. For both scenarios we generated, from the population, K = 50,000 two-phase srs/srs samples of size ng and no. We varied the first-phase sampling fraction by investigating the cases ng = 4, no = 2 and ng = 10, no = 5. For each replicated sample, we computed the estimators of total catch (2.1) to (2.4). Note that given the small sample and population sizes, we could have generated all the possible samples rather than simulating a large number of samples but our code would not have been as generally usable for larger population and/or sample sizes. With K = 50,000 the two approaches give similar results. We summarize the results using the following Monte Carlo measures: the relative bias due to the sampling of days (RBdaysMC), the relative root mean squared error (RRMSEMC), the coverage probability of a 95% confidence interval (CPMC) and the bias ratio (BRMC). The formulas used for the calculations are given explicitly in Appendix A.4. The simulation results are displayed in Table 2.3. For both scenarios (A) and (B), the
RRMSEMC of the double expansion estimator is larger than that of Cˆ1, Cˆ2 and CˆR. In scenario (A), the coverage probability is close to 95% for all estimators. The bias ratio is also relatively low (Cochran, 1977, p.14) which explains why the confidence intervals have proper coverage. The coverages are consistently slightly over 95% which we think is a consequence of the t distribution being an approximation for the distribution of Cˆ. In the case of scenario (B), when the first-phase sampling fraction is smaller, the bias ratios remain fairly low and all estimators have coverage probability close to 95%. However, when the
first-phase sampling fraction is larger, the biases of Cˆ1 and Cˆ2 become important relative to the standard error and affect the coverage probability negatively. Hence, the estimator CˆR is preferable in this study because it has the smallest RRMSEMC along with 95% coverage probability for the confidence interval.
2.5 Application
An aerial-access creel survey was conducted on Kootenay Lake, British Columbia, from December 2010 through November 2011. The study period was stratified by month and by day status: weekday or weekend. Statutory holidays were also defined as weekends. In each stratum, a simple random sample of days was selected to conduct ground surveys. Within each of these samples, a simple random sample of days was selected to conduct overflights. The allocation of sample sizes for the study is displayed in Table 2.4. The number of samples per month was adjusted seasonally to increase the intensity during months when fishing effort was expected to be higher (based on previous data). Unsafe weather conditions also resulted in cancellation of some flights but we assume here that the simple random sample assumption is valid.
15 Scenario ng no Cˆ RBparty RBdaysMC BRMC RRMSEMC CPMC Cˆ1 2 0 14 13 96 Cˆ 2 0 17 13 96 (A) 4 2 2 CˆDE 2 0 9 18 97 CˆR 2 0 15 13 96 Cˆ1 2 0 23 7 96 Cˆ 2 0 32 7 96 (A) 10 5 2 CˆDE 2 0 17 11 97 CˆR 2 0 27 7 96 Cˆ1 -12 3 -36 25 95 Cˆ -7 1 -23 25 96 (B) 4 2 2 CˆDE -2 0 -5 35 98 CˆR -2 -1 -12 26 97 Cˆ1 -12 1 -89 16 83 Cˆ -7 0 -47 14 91 (B) 10 5 2 CˆDE -2 0 -8 21 97 CˆR -2 0 -15 14 95
Table 2.3: Monte Carlo measures for the simulation with µb = 130. Numbers are expressed in %.
The survey also recorded data on shore anglers but we focus on boat anglers only. There were fifteen derby days during the study period. During those days, a fishing derby was organized on Kootenay Lake, with entry fees and substantial prize money ($ 100s or $ 1000) for the largest fish. Derbies are organized mostly by local businesses (or a community group). For sake of simplicity we chose to exclude derby days for this analysis. Estimates of total catch on these days could have been obtained separately and then added to our total estimates. Because the sampling of derby days is independent from the sampling on other days, bias and variances add up. Hence a variance estimate for the total over derby and non derby days altogether could be obtained by summing the variance estimates obtained for derby and non derby days respectively. The ground portion of the survey was located at the following access points: Balfour, Boswell, Kuskanook, Kaslo, Riondel, Crawford Bay and Woodbury; see map in Figure 2.2. During the sampled days, angling parties returning to those access points were interviewed to determine the number of fish kept and released from each species, the start and the end time of the angling trip, and other variables. The aerial survey was conducted around noon, which is the peak daily activity. The number of boats showing fishing activity was counted once as the airplane flew out and again on the return flight. We compute the quantity Ao as the average of the inbound and outbound counts. We compute Ag as the average of the number of parties fishing at the inbound overflight midtime and the outbound overflight midtime. For example, if
16 .! Survey Access Points Alberta µ
New Denver Revelstoke Silverton Kaslo .! Vancouver Nelson.!Cranbrook Victoria Washington Idaho Montana
Woodbury Slocan .! Riondel .!
Crawford Bay .! Balfour .!
Nelson !( Boswell .!
Kuskanook !( .!
0 5 10 20 KSialolmoeters
Map Projection/Coordinate System: NAD 1983 UTM Zone 11N !(
Figure 2.2: Kootenay Lake and the creel survey access points. Riondel/Crawford Bay and Boswell/Kuskanook ramps were combined for field monitoring and data analysis. Map provided by A. Waterhouse, Ministry of Forests, Lands, and Natural Resource Operations.
17 Period Weekdays Weekends (yyyy-mm) N ng no N ng no 2010-12 22 2 1 9 2 1 2011-01 21 1 1 10 2 1 2011-02 20 1 0 8 2 0 2011-03 23 1 1 8 2 1 2011-04 21 2 1 6 2 1 2011-05 22 3 2 6 4 3 2011-06 22 3 2 8 4 2 2011-07 21 3 2 10 4 2 2011-08 23 3 2 8 3 2 2011-09 22 3 2 8 2 2 2011-10 20 3 1 5 2 0 2011-11 22 3 1 5 2 2
Table 2.4: Allocation of sample size in the 2010-2011 Kootenay Lake Creel Survey. the overflight takes place from 12 pm to 1 pm on the way out and 1 pm to 2 pm on the way in, then Ag is the average of the number of interviewed parties fishing at 12:30 and 1:30. This can introduce a bias in the estimates that we assume to be negligible. Finding the most suitable way of computing the quantities Ao and Ag for aerial surveys that span considerable time remains an open question. This is a significant consideration in this lake that can take between 45 minutes and 1.5 hours to fly in one direction. In this work, we present the results for the variable number of rainbow trout kept. Plots of the data in Appendix A.5 give insight on the proportion of parties interviewed, the total catch and the number of fishing parties per day, respectively. Variance estimates cannot be obtained in strata that have zero or one aerial survey. For this reason, we present stratum estimates for the months of May to September only; see
Figure 2.3 (those total estimates do not include derby days). The estimators Cˆ1, Cˆ2 and
CˆR produce similar results in each stratum. This may not be the case in other scenarios or studies. Furthermore, the confidence intervals in some strata are very wide, whereas they are significantly shorter in other strata. This is explained by the small sample sizes in some strata producing quite variable estimates. Rather surprisingly, the estimator CˆDE has a much smaller confidence interval in some strata, especially in August. An inspection of the data reveals that, in those cases, the values of yi for the no = 2 sampled overflight days turn out to be very close, thus leading to a small variance estimate for CˆDE.
18 Figure 2.3: Monthly estimates of total number of rainbow trout kept along with approximate 95% confidence intervals. The top and bottom graphs represent weekends and weekdays respectively. The estimators (2.1) to (2.4) are represented respectively by the following symbols: triangle, circle, x mark and square.
19 The optimal allocation was computed for the months of May to September, separately for weekends and weekdays. The cost of an overflight on Kootenay Lake is approximately $1,200 whereas the daily cost of an access-point survey is approximately $1,600. From the results in Table 2.5, we cannot conclude generally that conducting fewer overflight than ground surveys is the best strategy. In particular the results suggest that estimation for June to September weekdays would be more efficient with an equal number of overflight and ground survey days. Note that these results apply only to the variable “number of rainbow trout kept" so optimal allocation should also be investigated for other key variables of the study before a decision on sample size allocation is made.
Weekdays Weekends
Cb1 Cb2 CbR Cb1 Cb2 CbR May 0.14 0.14 0.08 0.60 0.65 0.69 June 1.00 1.00 1.00 0.87 0.87 0.87 July 1.00 1.00 1.00 0.69 0.69 0.66 August 1.00 1.00 1.00 1.00 1.00 1.00 September 1.00 1.00 1.00 0.68 0.67 0.60
Table 2.5: Optimal values of no/ng for each month and day type combination for the number of rainbow trout kept. Note that we do not present results for the double expansion estimator because in that case the optimal allocation is no = ng.
In order to illustrate the methods described in Section 2.3.5, we produced estimates at the seasonal level (see Table 2.6). The year was divided into three seasons: winter (December to March), shoulder (April, May, October, November) and summer (June to September). We chose to compute combined estimates rather than separate estimates in order to prevent the bias from becoming important relative to the standard error. However, when no is equal to zero or one in some stratum, the combined variance estimators cannot be computed. But because stratification is expected to enhance the efficiency of estimators (provided that the strata are sufficiently homogeneous, which should be satisfied here), one can pool some strata and pretend the data were obtained from a two-phase srs/srs sample (without stratification) for computing the variance estimate. This variance estimate is expected to overestimate the variance and provide confidence intervals with coverage probability greater than 1 − α. For the winter analysis, we pooled all weekdays together and all weekends together. For the shoulder analysis, we pooled April and May weekdays, October and November weekdays, and similarly for weekends. No pooling was necessary for summer. We also computed a total estimate over the whole survey period by summing the seasonal estimates and their variance estimates (separate estimator strategy). We observe that estimates are highest in the summer season and lowest during winter. The confidence intervals are also narrower (relative to the estimate values) than those associated with monthly estimation in Figure 2.3.
20 Shoulder Season Summer Season Winter Season Total Apr, May, Oct, Nov June to Sept Dec to Mar Dec to Nov Est Low Upp Est Low Upp Est Low Upp Est Low Upp Cˆ1 1672 1265 2079 3341 2872 3809 874 576 1172 5887 4767 7006 Cˆ2 1639 1231 2048 3526 2895 4156 887 612 1162 6052 4796 7308 CˆDE 2274 1840 2708 4027 3357 4698 1312 865 1758 7613 6136 9090 CˆR 1715 1368 2062 3616 2992 4239 873 565 1181 6203 4983 7424
Table 2.6: Seasonal combined estimates (Est) of total number of rainbow trout kept along with approximate 95% confidence intervals (Low,Upp). The last column is computed as a separate total estimate over the three seasons.
2.6 Discussion
In this chapter, we have provided estimation strategies to be used for aerial-access creel surveys with overflights occurring only on a subset of access survey days. The estimators
CˆR and CˆDE were shown to be the most suitable in terms of bias. However, the bias may be substantial if one cannot assume that parties fishing on a given day are generated from a model which gives to each party the same probability of being interviewed. Simulation results have shown that, when the first-phase sampling fraction is small, the bias of Cˆ1 and Cˆ2 can be negligible relative to standard error and thus does not affect the coverage of confidence intervals. We have applied our methods to the 2010-2011 Kootenay Lake creel survey for one variable of interest of the survey: the number of rainbow trout kept. Although conducting fewer overflights than ground surveys is thought to be more economical by the fisheries managers, our optimal allocation results suggest that this might not be true for a number of months/day type combinations for the estimation of totals, as many allocations lie on the boundary no = ng. However, the biological data obtained from the ground surveys is quite valuable for fisheries scientists. For example, changes in fish size and age composition are often used to evaluate population responses to management decisions such as changed daily catch limits. These variables do not require aerial surveys (the purpose of aerial surveys is to be able to estimate total effort and catch) but may suggest more ground survey effort to adequately describe their trends. A decision about optimal allocation for future years thus needs to balance the relative importance of the different quantities of interest of the survey.
It also remains to determine the best way to compute the quantities Ag when the overflight is not quite instantaneous, as in the Kootenay Lake survey where the average flight time one way is one hour. Another topic of interest is to investigate the appropriateness of the assumption that all scheduled overflights are conducted or that those missed occurred at random. Missing overflights are often due to weather conditions which can be related to the variable of interest such as catch and fishing effort. Ignoring the non response in this case could possibly lead to biased estimates.
21 Finally, the methods presented in this chapter can be applied in contexts other than fisheries; for instance, to estimate the attendance at a multi-day street festival. In this case, the access survey can consist of posting interviewers at some access locations and collect arrival and departure times. The aerial survey can be replaced by a ground count of people at the peak attendance time of the day. Large areas can be covered by partitioning the total area into smaller sections and assigning a surveyor to each of them.
22 Chapter 3
Explicit integrated population modeling: escaping the conventional assumption of independence
3.1 Introduction
Monitoring changes in population size and structure (age, sex) provides valuable insight for effective management of animal populations. A common way to gain insight into pop- ulation dynamics is to capture and mark cohorts of individuals with a unique identifier followed by recaptures and/or resightings and/or dead recoveries of the marked animals. Capture-recapture, mark-resight, mark-recovery or mark-recapture-recovery surveys may be supplemented by other types of surveys on the same population such as periodic counts of individuals (all, adults, females, unmarked, etc.) or nests. Those counts are typically subject to observational error. When multiple surveys are used to study a single population, the data can be analyzed separately by survey. However, a joint analysis of the data via integrated population mod- eling is often preferred because it can provide more precise estimates and/or permit the estimation of parameters that cannot be estimated using separate analyses. For instance, capture-recapture data and population count data alone do not permit the estimation of a fecundity rate but an integrated population model that combines both datasets does. For a recent review of publications where integrated population modeling has been used with bird and mammal populations, see Schaub and Abadi (2011) . Currently, integrated population models are typically formulated by multiplying the like- lihoods of the various datasets. In some circumstances, this approach, while approximate, is indeed very good - for example, if the different surveys are conducted on sub-populations
23 which do not share many individuals in common (nearly independent datasets), but have common demographic parameters. Simulation studies have been conducted to compare the estimates obtained when multiplying the likelihoods for both dependent and independent datasets; see Besbeas, Borysiewicz and Morgan (2008) and Abadi et al. (2010). However, an important approach has not been compared in these empirical studies, that is, the analysis of dependent datasets using the true joint likelihood. We pursue this idea in this chapter. In parallel with our work, there has been recently a growing interest in using integrated population modeling methods that do not rely on an independence assumption, see e.g. Chandler and Clark, 2014, for a solution based on data augmentation. To simplify the presentation of our methodology, we focus, for most of this chapter, on the case of a population studied using two surveys: a capture-recapture survey and a population count survey. In section 3.2, we give some background and notation. In section 3.3, we develop the model based on the true joint likelihood and we further explain how our model can be modified to accommodate a variety of situations (not only capture-recapture and population count data). In section 3.4, we present the results of a simulation study. Finally, in section 3.5, we apply our methodology to data from a colony of Greater horseshoe bats (Rhinolophus ferrumequinum) in Switzerland.
3.2 Background and notation
3.2.1 Capture-recapture survey
The data collection process of capture-recapture entails sending a survey crew into the field on a series of capture occasions. When an animal is captured for the first time, it is marked with a unique tag and released in its environment so that it can be identified if recaptured at a further capture event. When marked individuals are recaptured at a further capture event, their identity is recorded. Suppose that there are K capture occasions. The capture-recapture data can be sum- marized into an m-array1 with K − 1 lines and K columns:
M12 M13 ...M1K Z1 M23 ...M2K Z2 . . . . .. . . MK−1,K ZK−1
The first K − 1 columns of the m-array form an upper-triangular array that we denote by M with the lines indexed by i = {1,...,K − 1} and the columns indexed by j =
1The term m-array is commonly used in capture-recapture studies to summarize the capture-recapture from individual capture histories.
24 {i + 1,...,K}. The elements Mij represent the number of individuals released on occasion i (after either being captured for the first time or recaptured) that are alive and recaptured for the next time on occasion j. The last column of the m-array is denoted by Z and is indexed by {1,...,K − 1} with Zi representing the number of individuals released at occasion i that are never recaptured. The capture-recapture data can be modeled using a Cormack-Jolly-Seber model, which conditions on the number of animals released at each occasion. The lines of the m-array are modeled using independent multinomial distributions conditional on the number of releases:
indep [Mi,i+1,...,Mi,K ,Zi|Ri = ri] ∼ Multinomial (ri, qi) , for i = 1,...,K − 1, (3.1)
PK where Ri = l=i+1 Mil + Zi is the number of individuals released at time i and
K−i !> X qi = q(i,i+1), . . . , q(i,K),1 − q(i,i+l) l=1 is a vector of size K − i + 1 where qij represents the probability that a marked individual survives2 from occasion i to occasion j, and is not recaptured until occasion j. Let φ = > (φ1, . . . ,φK−1) with φj representing the individual’s probability of apparent survival from > occasion j to j + 1 and p = (p2, . . . , pK ) with pj representing the individual’s probability of recapture at occasion j. Then, the qij’s can be expressed in terms of φ and p. For example, q46 = φ4(1 − p5)φ5p6 is the probability that a marked individual survives from occasion 4 to occasion 6, and is not recaptured until occasion 6. The Cormack-Jolly-Seber model relies on a number of assumptions:
i. Survival is independent between individuals and does not depend on individual char- acteristics (sex, age etc.)
ii. Capture is independent between individuals and does not depend on individual char- acteristics (sex, age etc.)
iii. No temporary emigration (permanent emigration is confounded with death)
iv. No tag loss, no recording errors and marking does not affect the future behavior of an individual
v. Capture occasions are instantaneous
For inference, a conditional likelihood, L(φ, p|M,Z), is formed simply as the product of the K − 1 multinomial densities in (3.1).
2Apparent survival is used because permanent emigration from the study area is indistinguishable from death. Unless explicity stated, survival in this chapter is always apparent survival.
25 3.2.2 Population count survey
A count survey of the population provides information on the relative changes in population size over time. Let us suppose that a population is studied using K population counts equally spaced in time. Let the count data be denoted by a vector Y, of size K, with
Yi being the number of individuals counted on occasion i. Note that the counts Yi are typically imperfect counts because they are subject to observational error. The count data is typically modeled using a state-space model (Buckland et al., 2004). State-space models involve a state-process and an observational process. In this case, the state process is the latent process that governs the changes in population size between counts. Let N = > (N1,...,NK ) , where Nj is the population size at the time of the jth population count.
The state process is specified by specifying a distribution for Nj conditional on Nj−1. We assume that the birth process is instantaneous and that births occur right after population counts. A simple model for Nj, j = 2,...,K that accounts for births and survival could be
Nj|Nj−1,Bj−1 ∼ Binomial (Nj−1 + Bj−1, φj−1) , for j = 2,...,K (3.2) with the number of births right after the jth count defined as
Bj|Nj ∼ Poisson (Njfj/2) , for j = 1,...,K − 1. (3.3)
The division by 2 is a way to estimate, assuming a 50/50 sex-ratio, a fecundity per female. For sake of simplicity, this model assumes that juvenile and adult survival probabilities are the same although this is often not true in real populations. This state-space model relies on a number of assumptions:
i. Females start reproducing at the age of one year old
ii. The expected sex ratio of newborns is 50%
iii. Survival is independent between individuals and does not depend on individual char- acteristics (sex, age etc.)
iv. No immigration and no temporary emigration (permanent emigration is confounded with death).
In addition to the state-space process, an observation process describes the population count data, Y, conditional on N. In practice, a normal distribution is often used as an approximation: indep 2 Yj|Nj ∼ Normal Nj,σ , for j = 1,...,K.
26 Note that if larger counts are thought to be less precise than smaller ones, a log-normal distribution can be used instead:
indep 2 log(Yj)|Nj ∼ Normal log(Nj),σ , for j = 1,...,K.
The likelihood of the count survey data is thus
K 2 X Y L(φ, f,σ ,N1|Y) = P (Y1|N1) [P (Yj|Nj)P (Nj|Bj−1,Nj−1)P (Bj−1|Nj−1)] , (N∗,B)∈Ω j=2
∗ > ∗ where N = (N2, . . . ,NK ) and Ω is the set of all possible values for (N , B).
3.2.3 Integrated population modeling via likelihood multiplication
In the ecological literature, integrated population models have typically been obtained by multiplying the likelihoods of the separate datasets. In the case of a population studied with both a capture-recapture and a count survey, the following pseudo-likelihood is constructed by multiplying the capture-recapture likelihood and the population count likelihood:
c 2 L (φ, f, p,N1|Y,M,Z) = L(φ, p|M, Z)L(φ, f,N1,σ |Y). (3.4)
Note that the capture-recapture likelihood and the population count survey likelihood have a parameter φ in common, which represents both the survival between capture occa- sions and between counts. Therefore, this approach assumes that the jth capture occasion and the jth count occur at about the same time for all js. The capture-recapture data and the count data are not independent when both surveys are conducted on a single population (or overlapping populations). In the literature so far, the term independence assumption has been coined when describing the likelihood (3.4). The use of this likelihood is attractive in practice because of its simplicity and because it uses a reduced number of parameters. This likelihood multiplication approach is reminiscent of the naive Bayes approach (Koller and Friedman, 2009). Equation (3.4) is not the true joint likelihood but rather a composite likelihood (Varin et al., 2011). Hence, it provides unbiased estimating equations. However, pretending that it is the true joint likelihood for inference leads to incorrect variance estimates and hence con- fidence intervals that do not have the targeted confidence level. Surprisingly, this character- istic of the composite likelihood seems to have been overlooked in the integrated population modeling literature so far. In particular, the simulation studies of Besbeas, Borysiewicz and Morgan (2008) and Abadi et al. (2010) investigate the frequentist properties of the estima-
27 tors but do not investigate the properties of the variance estimators and confidence/credible intervals. This will be addressed in Section 3.4.
3.3 Integrated population modeling based on the true joint likelihood
3.3.1 Capture-recapture and count data
Suppose, as in Section 3.2, that we have capture-recapture data (M, Z) and count data Y and that assumptions i.-v. and i.-iv. from Sections 3.2.1 and 3.2.2, respectively, are met. In order to formulate an explicit model, we have to take into account the order in which the surveys and the demographic gains and losses occur in the population. For sake of illustration, we assume that the events follow the timeline represented in Figure 3.1.
Figure 3.1: Timeline of events of the animal population study. The symbols “C”, “B” and “CR” stand for count survey, births and capture-recapture, respectively. Note that the time between the count survey, the births and the capture-recapture survey in each period is negligible.
The formulation of an explicit integrated population model based on the true joint likelihood can be achieved using a Bayesian model (Koller and Friedman, 2009). The key to formulating the true joint likelihood is to introduce a set of latent variables so that when combined with the capture-recapture data, one can deduce, at any point in time, the state of the population, that is
• the number of unmarked animals in the population
• the number of marked animals remaining (alive and not recaptured) in each released cohort.
A set of variables that is appropriate is
• N1, the population size at the beginning of the study
u u • D , a vector of length K − 1, where Dj represents the number of unmarked animals that died in period j
28 • Dm, an upper triangular array indexed by i = {1,...,K −1} and j = {i, . . . , K −1} m where Dij represents the number of marked animals released for the last time in period i that died in period j
3 • B, a vector of length K, where Bj represents the number of births in period j.
In order to be able to follow the transition of individuals from an unmarked state to a marked state, it is convenient to reparametrize the capture-recapture data as (M,C) rather than (M,Z), where C is a vector of length K − 1 where Cj represents the number of individuals captured for the first time (i.e. marked) in period j. The relationship between
(M,C) and (M,Z) is one-to-one; they contain the same information. The quantity Cj can be computed from the capture-recapture data M and Z as the number of individuals released at period j minus the number of individuals recaptured at period j, that is:
K j−1 X X Cj = Zj + Mjl − Mkj for 2 ≤ j ≤ K − 1, l=j+1 k=1
PK with C1 = Z1 + l=2 M1l . u m To show that our parametrization {N1, D , D , B, M, C} allows us to track the state of the population at any point in time, we constructed Table 3.1, which illustrates the case of K = 3 periods. Each line of the table shows the distribution of the population across states at a given time. Each column of the table follows the change in population size, over time, per state. We added a column to the right of the table to keep track of the total population, because this column will be useful for modeling the count data.
3Births is the term generally used to represent ANY source of new animals to the study area. The new animals in general do not have to be juvenile animals.
29 Timeline Number of unmarked Number of marked Number of marked Number of marked Total number of of events individuals individuals last individuals last individuals last individuals released during released during released during period 1 period 2 period 3 N1 N1 Count N1 N1 Births N1 + B1 N1 + B1 Period 1 Captures N1 + B1 − C1 C1 N1 + B1 Deaths u m u m N1 + B1 − C1 − D1 C1 − D11 N1 + B1 − D1 − D11 Count u m u m N1 + B1 − C1 − D1 C1 − D11 N1 + B1 − D1 − D11 Births 2 2 P u m P u m N1 + Bj − C1 − D1 C1 − D11 N1 + Bj − D1 − D11 j=1 j=1 Period 2 C&R
30 2 2 2 P P u m P u m N1 + Bj − Cj − D1 C1 − D11 − M12 C2 + M12 N1 + Bj − D1 − D11 j=1 j=1 j=1 Deaths 2 2 2 2 2 2 2 2 P P P u P m m P P u P P m N1 + Bj − Cj − Dj C1 − D1j − M12 C2 + M12 − D22 N1 + Bj − Dj − Dij j=1 j=1 j=1 j=1 j=1 j=1 i=1 j=i Count 2 2 2 2 2 2 2 2 P P P u P m m P P u P P m N1 + Bj − Cj − Dj C1 − D1j − M12 C2 + M12 − D22 N1 + Bj − Dj − Dij j=1 j=1 j=1 j=1 j=1 j=1 i=1 j=i Births 3 2 2 2 3 2 2 2 P P P u P m m P P u P P m N1 + Bj − Cj − Dj C1 − D1j − M12 C2 + M12 − D22 N1 + Bj − Dj − Dij
Period 3 j=1 j=1 j=1 j=1 j=1 j=1 i=1 j=i C&R 3 3 2 2 3 2 3 2 2 2 P P P u P m P m P P P u P P m N1 + Bj − Cj − Dj C1 − D1j − M1j C2 + M12 − D22 − M23 C3 + Mi3 N1 + Bj − Dj − Dij j=1 j=1 j=1 j=1 j=2 i=1 j=1 j=1 i=1 j=i Deaths
Table 3.1: Changes in the population size per state over time for a study with K = 3 periods. The table follows the timeline in Figure 3.1. Starting in the upper left corner of the table, the population is comprised of N1 unmarked individuals at the beginning of period 1. Then, the count survey occurs (which does not affect the state nor size of the population). Then, B1 births occur resulting in N1 + B1 unmarked individuals in the population. Then, C1 individuals are captured, marked and released which leaves N1 + B1 − C1 unmarked u m individuals in the population. Then, D1 unmarked individuals die and D11 marked individuals die. When period 2 begins, there are u m respectively N1 + B1 − C1 − D1 and C1 − D11 unmarked and marked individuals in the population. The table goes on like this until the study is finished. Note: C & R is used to abbreviate “captures and recaptures”. Next, we exploit Table 3.1 to define conditional distributions for the data M, C and Y u m as well as for the latent random variables B, D and D (we do not model N1 because this parameter is at the top of the hierarchy). For ease of notation, we do not specify the variables that the distributions are conditioned upon. Also, sums that go backwards are defined as zero and undefined variables (e.g. M11) are defined as zero. The parts of the equations that are highlighted are derived from Table 3.1.
u u Pj−1 • Cj ∼ Binomial Nj + Bj , ξj for j = 1,...,K − 1, where Nj = N1 + l=1 (Bl − u Cl − Dl ) is the number of unmarked individuals at the beginning of period j
Pj−1 m Pj−1 • Mij ∼ Binomial( Ri − l=i Dil − l=i+1 Mil , pj), for i = 1,...,K − 1 and j = Pi−1 i + 1,...,K, where Ri = Ci + k=1 Mki is the number of released individuals in period i