Population Size Estimation and Linkage Errors: the Multiple Lists Case

Total Page:16

File Type:pdf, Size:1020Kb

Population Size Estimation and Linkage Errors: the Multiple Lists Case

POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE MULTIPLE LISTS CASE Tiziana Tuoto ([email protected])1, Loredana Di Consiglio2

Keywords: probabilistically linked data, linkage errors, multiple system estimation, log- linear model

1. INTRODUCTION

The integration and combination of external sources with traditional statistical surveys data is a pressing challenge for National Statistical Institutes. Micro-level integration of sources is a standard practice, generally performed by means of record linkage techniques. However, as a matter of fact, the linkage process is not completely error free and statisticians shall take linkage errors into account in the subsequent analyses performed on integrated data [1]. Linkage errors appear particularly relevant when the target is to measure the size of a population (partially) enumerated in different lists [2]. In the context of population size estimation, a widespread method is based on the capture- recapture model and multi-capture when dealing with more than two lists [3,4,5]. The standard capture-recapture model assumes no linkage errors, nevertheless, few contributions [6, 2] have already dealt with this issue. In the context of multiple recaptures, this study explores adjustments for linkage errors in population size estimators with multiple lists, extending [7] that considered missing links only to deal with erroneous links also and applying a more realistic error model.

2. ERROR MODEL AND ERROR MATRIX

2.1. Capture-recapture model

Let us consider the case of three captures (lists). If no errors occur, we can view the data classified in the following incomplete 23 table:

Table 1. True table for cell counts, without matching errors

list 1 present absent list 2 list 2 list 3 present absent present absent

present n111 n110 n011 n010

absent n101 n100 n001 n000 where, according to the standard formalization, nijk is the cell count. Similarly, rijk is the associated cell probability, with i,j,k=1,0. Then the table is incomplete due to the fact that n000 is not observable.

The effect of matching errors “perturbs” observations in table 1 in two directions: the number of observations may increase for some cells and decrease for others, and especially the total number of different individuals observed in the three lists may change, subject to the constraint that the total number of observations in each list, n1++ n+1+, n++1, remain the same. Consequently, there may be changes in some cell

1 Istat, Italian National Statistical Institute, Italy 2 Eurostat, European Commission, Luxemburg probabilities, again subject to the constraint that the probability of being captured in a list is unchanged.

The observed counts and probabilities, subject to matching errors, are reported in table 2.

Table 2. Observed table for cell counts list 1 present absent list 2 list 2 absen list 3 present absent present t

present o111 o110 o011 o010

absent o101 o100 o001 o000

Where o={oijk, i,j,k=1,2} are the observed counts and p={pijk, i,j,k=1,2} are their corresponding probabilities.

2.2. Error model with missing and false links

Following [7], we consider the possible transitions from the real configuration n to the observed o. We assume that the transitions occur in function of the probability of missing a true match and probability of false match as well, taking into account the nature of the link process in the two phases (first linkage of list 1 and 2 and second linkage with list 3).

To this purpose, let be the probability of missing a linkage in the first linkage and be the probability of missing a linkage in the second linkage; moreover let be the probability of false link in step1 and let be the probability of false link in step 2.

First, we study the effect of linkage errors introducing the decompositions caused by the probability of missing a true match only, but differently from [7], we aim at taking into account the realistic situation of link process in the two phases. Then, for example, the possible alternative transitions from the real to the observed ones, are:

with probability

, with probability , with probability

, with probability and finally

, , with probability

A similar reasoning for the other cells combinations allows to determine a transition matrix, not reported here for room limitations.

The transition matrix can further be extended to include the false linkage errors also. As before we assume different linkage errors for the first phase and the second phase. Additionally, we assume that whenever in the same phase a true link is missed, a false link cannot occur, because this event occurs when at least two errors are made in the same phase (that is, records are incorrectly linked and the correct link is missed), so it has a negligible probability of occurrence as in [7] and in [2]. Under the previous hypotheses, and treating at the same time transitions produced by the false and missing linkage errors, we obtain the following transition matrix

Table 3. Error matrix from real to observed cells with missing and false links

111 110 101 100 011 010 001

- -

- -

) - - -

) - -

3. ESTIMATION OF POPULATION SIZE

The previous transition matrices can be applied to observed data in order to provide estimates of the cell probabilities of the real not-observable table: p=Mr and multiplying by N one may obtain the expected cell count o=Mn .

To estimate the population size N, as suggested in [7], we first compute the Maximum Likelihood estimates of the parameters from the conditional likelihood associated with observed cell count p given n. Once the conditional maximum likelihood estimates of the log-linear model are obtained, we use the loglinear model specified for the not-observed real values to compute the conditional maximum likelihood estimates of the expected cell counts , including the one of the missing cell. Thus, .

3.1. An application

In this application, we consider the three-list data also used in [7], reporting counts from three sources: the 1990 U.S. census, the corresponding post enumeration survey PES, the administrative list supplement (ALS) for the PES sampling strata 11 in St. Louis. In table 4 the three-sample data are shown.

Table 4. Three-sample data for stratum 11, St. Louis, 1990 U.S. Census Census = list 1 present absent PES = list 2 PES = list 2 ALS = list 3 present absent present absent Present 300 51 53 180 Absent 187 166 76 -

We also consider matching error as in [7], where the Matching Error Study [8] was used in order to evaluate both the probability of missing a link in linkage procedure between Census and PES, and the missing linkage probability of linkage involving ALS, in absence of better quantitative information, under the assumption of no errors in the rematch. The results of the Matching Error Study for 1990 U.S. Census in St. Louis stratum are reported in table 5 [7]

Table 5. St. Louis Rematch Study Rematch Classification Original match classification Matched Not matched Unresolved Total Matched 2667 7 8 2682 Not matched 9 427 30 466 Unresolved 0 7 20 27 Total 2676 441 58 3175

Ignoring the unresolved cases, [7] claimed that the probability of missing a true links can be estimated as = 9/(2667+9) = 0.3363%. Following the same reasoning we evaluate the probability of false link, = 7/(7+2667) = 0.2618%.

[7] examined various log-linear model with different dependency structure in order to better fit data in table 4, and claimed that the [CP][PA] fit the data much better and produced an estimate . Applying their correction for missing links only, [7] estimated With the error matrix specified in table 3, we obtain . This value falls into both the confidence intervals of the previous estimates.

Next step of the work requires the evaluation of the standard error of the adjusted estimates, complicated in the case of estimated linkage errors. Moreover, the sensitivity of the estimates with respect to accuracy of linkage error evaluation should be analyzed.

REFERENCES

[1] Chambers, R. (2009). Regression Analysis Of Probability-Linked Data. Official Statistics Research Series 4

[2] Di Consiglio L. and Tuoto T. (2015) Coverage Evaluation on Probabilistically Linked Data, Journal of Official Statistics, Vol. 31, No. 3, 2015, pp. 415–429

[3] Cormack R. M. (1989) Log-linear models for capture-recapture. Biometrics, 45 [4] Fienberg, S.E. (1972). “The Multiple Recapture Census for Closed Populations and Incomplete 2k Contingency Tables.” Biometrika 59: 409–439 [5] IWGDMF - International Working Group for Disease Monitoring and Forecasting (1995). Capture-recapture and multiple-record systems estimation I: History and theoretical development. American Journal of Epidemiology, 142:1047-1058. [6] Lee, A.J., G.A.F. Seber, J.K. Holden, and J.T. Huakau. 2001. “Capture-Recapture, Epidemiology, and List Mismatches: Several Lists.” Biometrics 57: 707–713. [7] Ding, Y. and Fienberg, S.E. (1994). Multiple sample estimation of population and census undercount in the presence of matching error. In Proceedings of 1994 Annual research conference and CASIC technologies Interchange, Bureau of Census, US [8] Murly M.H., Dajani A., Biemer P. (1989) The Matching Error Study for the 1988 Dress Rehearsal Proceedings of the Section on Survey Research Methods, ASA, 704-709

Recommended publications