Parametric, Bootstrap, and Jackknife Variance Estimators for the K-Nearest Neighbors Technique with Illustrations Using Forest Inventory and Satellite Image Data
Total Page:16
File Type:pdf, Size:1020Kb
Remote Sensing of Environment 115 (2011) 3165–3174 Contents lists available at ScienceDirect Remote Sensing of Environment journal homepage: www.elsevier.com/locate/rse Parametric, bootstrap, and jackknife variance estimators for the k-Nearest Neighbors technique with illustrations using forest inventory and satellite image data Ronald E. McRoberts a,⁎, Steen Magnussen b, Erkki O. Tomppo c, Gherardo Chirici d a Northern Research Station, U.S. Forest Service, Saint Paul, Minnesota USA b Pacific Forestry Centre, Canadian Forest Service, Vancouver, British Columbia, Canada c Finnish Forest Research Institute, Vantaa, Finland d University of Molise, Isernia, Italy article info abstract Article history: Nearest neighbors techniques have been shown to be useful for estimating forest attributes, particularly when Received 13 May 2011 used with forest inventory and satellite image data. Published reports of positive results have been truly Received in revised form 6 June 2011 international in scope. However, for these techniques to be more useful, they must be able to contribute to Accepted 7 July 2011 scientific inference which, for sample-based methods, requires estimates of uncertainty in the form of Available online 27 August 2011 variances or standard errors. Several parametric approaches to estimating uncertainty for nearest neighbors techniques have been proposed, but they are complex and computationally intensive. For this study, two Keywords: Model-based inference resampling estimators, the bootstrap and the jackknife, were investigated and compared to a parametric Cluster sampling estimator for estimating uncertainty using the k-Nearest Neighbors (k-NN) technique with forest inventory and Landsat data from Finland, Italy, and the USA. The technical objectives of the study were threefold: (1) to evaluate the assumptions underlying a parametric approach to estimating k-NN variances; (2) to assess the utility of the bootstrap and jackknife methods with respect to the quality of variance estimates, ease of implementation, and computational intensity; and (3) to investigate adaptation of resampling methods to accommodate cluster sampling. The general conclusions were that support was provided for the assumptions underlying the parametric approach, the parametric and resampling estimators produced comparable variance estimates, care must be taken to ensure that bootstrap resampling mimics the original sampling, and the bootstrap procedure is a viable approach to variance estimation for nearest neighbor techniques that use very small numbers of neighbors to calculate predictions. Published by Elsevier Inc. 1. Introduction Finley and McRoberts (2008) investigated enhanced search algo- rithms for identifying nearest neighbors. Tomppo and Halme (2004), Nearest neighbors techniques have emerged as a useful and popular Tomppo et al. (2009), and McRoberts (2009b) used a genetic approach for forest inventory mapping and areal estimation, particu- algorithm approach to optimize the weights for ancillary variables larly when used with satellite imagery as ancillary data. Nearest in the distance metric. Magnussen et al. (2010b) developed a neighbors techniques are multivariate, non-parametric approaches to calibration technique that improves predictions when nearest estimation based on similarity in a space of ancillary variables between a neighbors are relatively distant. population unit for which an estimate is required and population units Despite these advances, the full potential of nearest neighbors for which observations are available. Applications have been reported techniques cannot be realized unless they can be used to construct valid for a large number of countries in Europe, North and South America, statistical inferences. For probability-based inference, McRoberts et al. Asia, and Africa (Fig. 1)(McRoberts et al., 2010). A bibliography of (2002) illustrated use of nearest neighbors techniques to support nearest neighbors papers is available at: http://blue.for.msu.edu/NAFIS/ stratified estimation, and Baffetta et al. (2011, 2009) described use of biblio.html. nearest neighbors techniques with the model-assisted difference Recent nearest neighbors investigations have shifted from simple estimator (Särndal et al., 1992). For model-based inference, Magnussen descriptions of applications to more foundational work on efficiency et al. (2009) developed an estimator for mean square error, and and inference. McRoberts (2009a) reported diagnostic tools for use Magnussen et al. (2010a) reported a balanced repeated replications with univariate continuous response variables. Finley et al. (2006) and (BRR) estimator of variance. McRoberts et al. (2007) derived a parametric nearest neighbors variance estimator for areal means from the conceptual assumptions underlying k-NN estimation. However, the ⁎ Corresponding author. Tel.: +1 651 649 5174; fax: +1 651 649 5140. latter estimator is complex, computationally intensive, and is based on E-mail address: [email protected] (R.E. McRoberts). assumptions that have not been closely investigated. 0034-4257/$ – see front matter. Published by Elsevier Inc. doi:10.1016/j.rse.2011.07.002 3166 R.E. McRoberts et al. / Remote Sensing of Environment 115 (2011) 3165–3174 Fig. 1. Nearest neighbor applications have been reported for countries depicted in gray. Resampling procedures are particularly well-suited for complex and data, the bootstrap estimator produces more reliable estimates of non-parametric model applications and for applications requiring standard errors than the BRR estimator. Nothdurft et al. (2009) reported assumptions whose validity is difficult to assess. The jackknife using bootstrap procedures to estimate variances of estimates of forest resampling procedure was first proposed by Quenouille (1949) for variables obtained using nearest neighbors techniques but did not bias reduction and by Tukey (1958) for variance estimation. The elaborate on implementation of the resampling procedures. In summa- jackknife procedure produces estimates of properties of statistical ry, so few reports have been published on resampling variance estimators by sequentially deleting observations from the original estimators for use with nearest neighbors techniques that no consensus sample and then re-calculating estimates using the reduced samples. has emerged regarding their general applicability or utility. In addition, The bootstrap resampling procedure was invented by Efron (1979, adaptation of resampling methods to accommodate cluster sampling, a 1981, 1982) and further improved by Efron and Tibshirani (1994).With feature of many forest inventory programs, has rarely been addressed. bootstrapping, properties of an estimator, such as its variance, are The overall objective of the study was to compare parametric, estimated via repeated sampling with replacement from an approxi- bootstrap, and jackknife methods for estimating the variances of mating distribution such as the empirical distribution of the sample estimates of small area means of forest attributes obtained using the observations. k-Nearest Neighbors (k-NN) technique. The investigations focused on For survey applications, of which forest inventory is an example, three particular technical objectives: (1) to evaluate the assumptions resampling methods have generated moderate interest. Shao (1996) underlying a parametric approach to estimating k-NN variances; (2) to reviewed resampling methods for sample survey applications and noted assess the utility of the bootstrap and jackknife methods with respect to that among these methods BRR, the jackknife, and the bootstrap are the the quality of variance estimates, ease of implementation, and most popular. Rao (2007) briefly reviewed resampling methods for computational intensity; and (3) to investigate adaptation of resampling survey applications and reported use of the jackknife and bootstrap methods to accommodate cluster sampling. The investigations were methods for small area estimation using a linear model. Chambers and based on Landsat Thematic Mapper (TM) imagery and forest inventory Dorfman (2003) describe how a design-based bootstrap can be applied plot data from Finland, Italy, and the United States of America (USA). to model-based sample survey inference. Literature on the use of resampling methods in conjunction with 2. Data nearest neighbors techniques is sparse. For classification applications, Steele and Patterson (2000) proposed a weighted nearest neighbors Four datasets, one each for Finland and Italy and two for the USA, technique based on resampling ideas, Chen and Shao (2001) reported as described below were used for the study. use of jackknife methods to impute missing values for estimation of a design-based mean, and Shao and Sitter (1996) used bootstrapping in 2.1. North Karelia, Finland conjunction with imputation for missing data. Magnussen et al. (2010a) reported a BRR application using nearest neighbors techniques with The study area is a portion of the North Karelia forestry center in forest inventory and satellite image data. Although the latter estimator eastern Finland. Landsat 7 ETM+ data for rows 16 and 17 of path 186 performed well, sample sizes for which it can be applied are limited. In were obtained for June 2000. Raw spectral data for the seven Landsat 7 addition, Field and Welsh (2007, page 389) reported that for clustered ETM+ bands were used. Within the study area, an 8-km×8-km area of R.E. McRoberts et al. / Remote Sensing of Environment 115 (2011) 3165–3174