STATISTICS IN TRANSITION, December 2006 1199

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1199—1202

FROM THE EDITOR

This issue contains twelve articles devoted to small area estimation, methods of estimation and other topics, one book review, three reports, an announcement on a conference and acknowledgements of referees of volume 7. There are following three articles devoted to small area estimation: 1. Attempts at Applying Small Area Estimation Methods in Agricultural Sample Surveys in Poland (by Dorota Bartosińska from Poland). The paper is devoted to the application of small area estimation (SAE) methods in agricultural sample surveys in Poland, using last Census of Agriculture as an auxiliary source of data. The author briefly describes agricultural sample surveys in Poland, and sources of additional data which may be used in SAE methods in agriculture. To obtain more precise estimates of agricultural characteristics from agricultural sample surveys by county (poviat), empirical and hierarchical Bayes estimators, and some auxiliary information from last census of agriculture are used. Two different regression models are considered: area-level regression model and unit-level one. The unit-level approach required matching of particular farms in the agricultural sample surveys to the census of agriculture. The precision of model-based estimates is significantly increased compared to direct estimates. Ecological effect that can cause different results of estimation, depending on regression model type, is also discussed. 2. Estimation of the Mean Squared Error of Model-based Estimators (by Wojciech Rabiega from Poland). The author stresses that model-based methods of small area estimation have received a lot of attention because of making specific allowance for local variation through complex error structures in models that link small areas. Efficient indirect estimators can be obtained with the assumed models. Models can be validated from the sample data. Stable area specific measures of variability associated with the estimates can be obtained unlike the overall measures for synthetic and composite estimators. 3. On Accuracy of EBLUP under Random Regression Coefficient Model (by Tomasz Żądło from Poland). The author analyzes the accuracy of the empirical best linear unbiased predictor (EBLUP) of the domain total (see Royall, 1976), assuming random coefficient superpopulation model which is a special case of the general linear mixed model. To estimate the mean square error (MSE) of the EBLUP he uses the results obtained by Datta and Lahiri (2000) for the predictor proposed by Henderson (1950) and

1200 From the Editor

adopts them for the predictor proposed by Royall (1976). In a simulation study he studies real data on Polish farms from one region in Poland to consider the decrease of the accuracy of the EBLUP comparing with the best linear unbiased predictor (BLUP) due to the estimation of unknown variance components. What is more, he compares the MSE of the EBLUP with MSEs of two other predictors which are BLUPs but under different models. For methods of estimation are devoted following papers: 4. Estimation of a Population Mean Using Different Imputation Methods (M. S. Ahmed from Oman, Omar Al-Titi and Walid Abu-Dayyeh from Saudi Arabia, and Ziad Al-Rawi from Jordan). Several methods of imputation are suggested and their corresponding estimators of the population mean are considered. The bias and mean square error of each of the estimators are derived up to first orders. Then these estimators are compared with each others and with other well known estimators using their biases and mean square errors. It turns out that some of the new estimators are more efficient than the well known estimators. Real data example is used for illustration. 5. A Modified Regression Estimator of a Population Mean Under General Sampling Design (Vyas Dubey from India). The author studies a generalized estimator of a population mean under any sampling design, which utilizes the knowledge of a population mean and variance of auxiliary variable has been studied. Properties of proposed estimator have been discussed and an optimum class of estimators has been obtained. The proposed class of estimators has been discussed in probability proportional to size sampling. Results are supported by numerical examples. 6. Design-Based Horvitz-Thompson Variance Estimation: π-Weighted Ratio Type Estimator (by P.A. Patel and R.D. Chaudhari from India). In this article, motivated by the ratio method of estimation, a π-weighted ratio type estimator for Horvitz-Thompson variance is suggested and is shown to be asymptotically design unbiased and consistent. An empirical study is conducted to compare its performance. To assess the performances, several important summary statistics such as the percentage relative bias, the relative efficiency, and the empirical coverage rate of the resultant confidence intervals are computed and presented. 7. Post-Stratification in a Two-Way Deeply Stratified Population (by D. Shukla, Manish Trivedi and G. N. Singh). This paper presents an estimation strategy for the population mean for a two-way r x s deeply stratified population using technique of post-stratification. The size of each stratum and frame are both assumed unknown. The information known is the proportion of row and column-size-totals of two-way deep- stratification. A new estimator is proposed and its optimum properties are

STATISTICS IN TRANSITION, December 2006 1201

examined along with comparison of efficiencies. An approximate expression of mean square error (MSE) is derived for this set-up. 8. An Efficient Variant of the Product and Ratio Stimators in Stratified Random Sampling (by Housila P. Singh and Gajendra K. Vishwakarma from India). This paper introduces two classes of estimators of population mean of the study variable using auxiliary variable in stratified random sampling. The biases and variances of the proposed estimators have been derived under large sample approximation. Estimators based on “estimated optimum values” are also investigated. An empirical study is carried out to demonstrate the performances of the suggested estimators over others. 9. Estimation of Mean with Known Coefficient of Variation of an Auxiliary Variable in Two Phase Sampling (by Lakshmi N. Upadhyaya, Housila P. Singh and Ritesh Tailor from India). The authors discuss the possibility of obtaining efficient estimators of the population mean of the variable y under investigation by means of two phase sampling and the help of two auxiliary variables x (main auxiliary variable) and z (second auxiliary variable). Asymptotic expression for bias and mean squared error (MSE) of the proposed estimator are obtained. Asymptotic optimum estimator (AOE) in the family is identified with its approximate MSE formula. Numerical illustration is given to support the present study. The third part of this issue under the title Other Articles contains three articles devoted to different topics: 10. Methodology and Empirical Results of the Time Use Surveys in Poland (by Ilona Błaszczak-Przybycińska from Poland). The paper presents the methodology and the selected results of time use surveys in Poland. They have a long tradition in this country. Several time use surveys were conducted in the 1950s and 1960s. The first nationwide survey was carried out in 1968/1969. The nationwide time use surveys in Poland were performed by the Central Statistical Office four times. The last one was organized in 2003/2004. It was performed within the framework of the harmonized European time use surveys. 11. Labour Flows Into and Out of Polish Agriculture: A Micro-Level Approach (by Hilary Ingham & Mike Ingham from the UK). Notwithstanding its admission to the EU, agricultural restructuring and sustainable rural development remain major transition challenges confronting Poland. Achieving these joint goals will necessitate major labour flows from farming into other occupations and sectors. This paper employs a multinomial logit model on Labour Force Survey data to analyse mobility in the agricultural labour market. Its major finding is that of a largely stagnant pool of farm workers into and out of which are small flows that are insufficient to bring about the requisite change without explicit, perhaps radical policy intervention.

1202 From the Editor

12. Changes in Competitiveness and Labour Market Developments: A Comparative Analysis of Poland, Hungary and The Czech Republic (by Eugeniusz Kwiatkowski and Paweł Gajewski from Poland). The paper attempts to analyse the links between both domestic and external competitiveness and labour market developments in manufacturing industry branches in the three new member states of the European Union – Poland, Hungary and the Czech Republic. These are mostly the industries of deteriorating competitiveness, which reduced employment. The industries, where positive changes in the level of competitiveness occurred, showed no clear pattern with regard to employment changes. The part Book Review contains a review of an interesting book written by Sarjinder Singh entitled Thinking Statistically: Elephants Go to School (prepared by M.q Kozak from Poland). In the section Reports there are three reports on: a) The 9th Conference on Probability Theory and Mathematical Statistics, Vilnius, , June 25—30, 2006 (prepared by D. Krapavickaitė and A. Plikusas) b) XXVI European Meeting of Statisticians, Toruń, Poland, July 24—28, 2006 (Prepared by J. Białek) c) The 4th Conference on Sampling Methods in Social and Economic Surveys, Katowice, Poland, September 12—13, 2006 (prepared by W. Gamrot and J. Wywiał). An Announcement is connected with the Second Baltic-Nordic Conference on Survey Sampling in Kuusamo, Finland: 2—7 June 2007, Pre-Course in Helsinki: 31 May—1 June 2007 The issue is concluded with the Acknowledgements of referees of Volume 7. Jan Kordos The Editor

STATISTICS IN TRANSITION, December 2006 1203

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp.1203—1218

ATTEMPTS AT APPLYING SMALL AREA ESTIMATION METHODS IN AGRICULTURAL SAMPLE SURVEYS IN POLAND1

Dorota Bartosińska2

ABSTRACT

The paper is devoted to applying small area estimation methods in agricultural sample surveys in Poland, using last Census of Agriculture as an auxiliary source of data. The author briefly describes agricultural sample surveys in Poland with respect to small area estimation. Sources of additional data, mainly census of agriculture, applied estimation methods and results of estimation for small areas are presented. To obtain more precise estimates of agricultural characteristics from agricultural sample surveys by county (poviat), empirical and hierarchical Bayes estimators, and some auxiliary information from last census of agriculture are used. Two different regression models are considered: area-level regression model and unit-level one. The unit-level approach required matching of particular farms in the agricultural sample surveys with the Census of Agriculture. The precision of model-based estimates is significantly increased compared to direct estimates. Ecological effect that can cause different results of estimation, depending on regression model type, is also discussed.

Keywords: small area estimation, model-based estimation, Bayes estimation, agricultural sample survey, census of agriculture, ecological effect

1. Introduction

Small area estimation methods are applied in the USA and Canada in censuses of population, surveys of household incomes, unemployment, poverty, enterprises as well as agricultural surveys (Gambino, Dick, 2000; Harvey, 2000; Marker, 2001; Schaible, 1996; Rao, 2002). For instance, these methods are used to improve precision of estimates of agricultural characteristics by county in the USA. The research conducted in 1979 tested synthetic estimators and model-

1 This is extended version of the paper presented at the 4th Conference on Sampling Methods in Social and Economic Surveys, Katowice, Poland, September 12—13, 2006. 2 Maria Curie Skłodowska University, Lublin, Poland; e-mail: [email protected]

1204 D. Bartosińska: Attempts at applying small area… based estimators that combined current and historical information (gross time trends) to make county estimates of agricultural commodities (Ford, Bond, Carter, 1983). Administrative data, census of agriculture and remote sensing technology are used as sources of auxiliary data in small area estimation. Mid-altitude aerial photography that was the first remote sensing product used to construct area sampling frames. Satellite images are used in model-based estimation of crop acreage for counties and states (Schaible, 1996). Since 1997 small area estimation methods have also been used in order to estimate coverage errors for states in censuses of agriculture (Eklund, 1998). Further research concerned prediction of county crop areas using survey and satellite data (Battese, Harter, Fuller, 1988), county estimates of wheat production (Stasny, Goel, Rumsey, 1991) and model- based methods for county estimation of crop yields (Bellow, 2003). Some experiments on application of small area estimation methods in agriculture were also made in India, Spain, Turkey and Poland. In India attempts were made to use synthetic-ratio and composite estimators to crop acreage estimation for small areas (Tikkiwal, Ghiya, 2004). Researchers in Spain tried to estimate the total area occupied by olive trees using small area models combining satellite images and land use maps (Militino, Ugarte, Goicoa, 2004). Estimation of crop acreage for small areas was tested in Turkey using satellite images (Gebizliodlu, Dedep, Toprak, 1996). In Poland to improve precision of some agricultural characteristics by region stratification and oversampling have been used (Kordos and Kursa, 1997; Kursa and Lednicki 2006). First experiments on application of small area estimation methods in agricultural sample surveys using census of agriculture data as auxiliary information were carried out by Kordos and Paradysz (2000). They related to estimation of livestock in 1999 by region, and livestock and crop acreage in 1998 by county, using as auxiliary information from the 1996 census of agriculture. Estimates of direct and Empirical Bayes estimators were compared. The second attempt was undertaken by Bartosinska (2005), and some results of that study are presented below.

STATISTICS IN TRANSITION, December 2006 1205

2. A brief description of agricultural sample surveys carried out by the Central Statistical Office in Poland with respect to small area estimation

The following agricultural sample surveys are carried out by the Central Statistical Office (GUS1) in Poland: • survey of land use, crop acreage and livestock inventory, called June Agricultural Survey (JAS); • survey of livestock and production of cattle, sheep and poultry; • survey of livestock and production of pigs; • survey of main crops’ yield. A very detailed description of Polish agricultural sample surveys’ methodology is in Kordos and Kursa (1997), and Kursa and Lednicki (2006). An empirical study was made on the data from the JAS, carried out by the GUS between two censuses of agriculture 1996 and 2002. These are agricultural surveys carried out by the GUS in 1998 and 2001. In all Polish agricultural sample surveys estimates are obtained only for the country and regions using unbiased direct estimators, based solely on results from a sample. In this study an attempt was made to estimate parameters by county. The data were available for the Lublin region, whose population, according to the Census of Agriculture 1996, reaches about 300 thousand farms, i.e. 10% of farm population in Poland (it is an agricultural region). The sample selected for the JAS 1998 involved about 10 thousand farms from the Lublin region. It was about 3.2% of the population. In JAS 2001 the sample for the Lublin region involved 5437 farms (about 1.7% of the population). The data from JASs were grouped according to counties in the Lublin region. Direct estimator gives parameters’ estimates by counties with low precision. It was impossible to increase sample size in order to obtain better precision of estimates by county for the financial reasons. That is why alternative estimation methods are looked for, which would also use data from other sources to borrow strength and to improve estimation precision in sample surveys.

3. The Census of Agriculture and other potential sources of auxiliary data for small area estimation

Censuses of agriculture can be a source of data for small areas for at least four reasons: 1. Censuses as full-scale surveys provide information on all the population units. They allow to process the collected information for any cross- section.

1 GUS — an acronym for the Central Statistical Office of Poland.

1206 D. Bartosińska: Attempts at applying small area…

2. In all agricultural sample surveys the same concepts, definitions and classifications are used as in preceding census of agriculture. 3. The results of censuses of agriculture are the basis for building sampling frames in all agricultural sample surveys carried out after a census. 4. There is usually a significant correlation for most characteristics between a census of agriculture and agricultural sample surveys carried out in the following years. So far results of censuses of agriculture were used in Poland as sampling frames in agricultural sample surveys and also in stratification and estimation (Pawłowska, 1969; Lednicki, 1979; Kordos, Kursa, 1997). In this study the results of the Census of Agriculture carried out by the GUS in 1996 (CA 1996) were used as a source of auxiliary data in small area estimation. In future registers, administrative data and remote sensing technology can also be used as potential sources of auxiliary data in small area estimation, if they are available for official statisticians. In Poland Statistical Register of Agricultural and Forest Holdings were built on the basis of results of the Census of Agriculture carried out in 2002 by the GUS. It contains address data of farms and some agricultural characteristics such as: agricultural area, agricultural land, forest area and forest land. It is possible to match particular farms in the register with farms in agricultural survey carried out after 2002. Also the GUS conducts works on using administrative data in official statistics. The GUS has an access to individual data from Integrated Administration and Control System (IACS) by virtue of the law. This system was built by the Agency of Restructuring and Modernization of Agriculture after Poland’s accession to the European Union and it contains information about all farmers who applied for direct payments within Common Agricultural Policy. The statistical register and sampling frames in agricultural surveys are upgraded with IACS data. In Poland some experiments with using remote sensing technology in official agricultural statistics were made. The GUS with assistance of specialists from the USA tested area frame for agricultural statistics (Skow, Wanke, 1994). Combining information derived from satellite images and data from CA 2002, classification of agricultural production space by communes in Poland was made (Ciolkosz et al., 2004). To combine data from different sources, time or space, some models are constructed and simulation methods are used in order to find proper estimation methods for small areas.

STATISTICS IN TRANSITION, December 2006 1207

4. The applied small area estimation methods

To obtain more precise estimates for county in agricultural sample surveys Bayes approach was used, which is based on known information beyond the sample. This methodology was described in detail by Rao (2003) and also by Kubacki (2006). The empirical Bayes estimator of the total of the variable of interest Y for d th small area is given by (Kordos, Paradysz, 2000): 2 2 d ()y d (yd ,R ) y = d y + y (1) d ,EB d 2 y + d 2 y d ,SYN,R d 2 y + d 2 y d ()d ()d ,SYN,R ()d ()d ,SYN,R where:

yd — direct estimator of the total of Y for d th small area, 2 d ()yd — variance of direct estimator of the total of Y for d th small area,

yd ,SYN,R — regression estimator of the total of Y for d th small area, 2 d ()yd ,SYN ,R — variance of regression estimator of the total of Y for d th small area. The direct estimator for small areas can have large variance because of small sample size for such areas. It serves as a component in a composite estimator and as a benchmark against which other estimators can be compared. The direct estimator of the total of the variable of interest Y for d th small area is given by:

nd ydi y d = ∑ ; (2) i=1 π di where:

ydi — value of the variable of interest Y for ith unit in dth small area,

π di — inclusion probability for ith unit in dth small area. The regression estimator was provided by the regression model combining data from JAS and CA. The dependent variable was from JAS and independent variables were from CA. Two regression models were considered: area-level regression model and unit-level regression one. In the area approach the totals of variable of interest and auxiliary variables for counties were used. In the unit approach individual data on the level of farms were used. The area-level regression model is (Kordos, Paradysz, 2000):

T y d = X d β area + u + e d , d = 1,2,..., D ; (3) where:

yd — estimate of the total of the variable of interest Y for dth small area,

1208 D. Bartosińska: Attempts at applying small area…

X d = [X dj ] — matrix of the totals of auxiliary variables for dth small area,

β area = [β j ] — vector of k area-level regression parameters, u — model-based random variable,

ed — design-based random variable for dth small area. The unit-level regression model is (Rao, 2003): T y di = x diβ + u d + e di ; (4) where:

xdi = [xdij ] —vector of k auxiliary variables for ith unit in dth small area,

β = [β j ] — vector of k unit-level regression parameters,

ud — model-based random variable for dth small area,

edi — design-based random variable for ith unit in dth small area.

5. Matching of farms in the JAS with the CA 1996

The unit-level regression model required matching of particular farms in agricultural sample surveys with the Census of Agriculture. The matching of databases was performed on the basis of the numbers of: region, county, commune, kind of commune, census district and farm in a census district. The matching had very good effects. Data for almost all the farms in the Lublin region were matched. In JAS 1998 as many as 9995 farms in the Lublin region had their data matched, that is as many as 99.8% of the farms selected to the sample. And in JAS 2001 data for 5417 farms were matched, that is for 99.6% of the farms selected to the sample in the Lublin region.

6. Estimation results

The totals of four agricultural characteristics were estimated by county in the Lublin region: number of cows, number of pigs, crop acreage of sugar beet and crop acreage of rape. These characteristics differ in variation and frequency of occurrence from each other. Using direct estimator, which is only based on the sample survey results, precision of estimates by county, measured by coefficient of variation (CV) of estimator, is given in Table 1. CVs fluctuated between 4.4 and 45.0% for two features of smaller variation: numbers of cows and pigs; and between 5.8 and 99.6% for two features of larger variation: crop acreage of sugar beet and rape.

STATISTICS IN TRANSITION, December 2006 1209

Table 1. Minimum, average and maximum coefficients of variation of direct estimates by county in the Lublin region in JAS 1998 and 2001 (in percent)

Variable of interest Year Min Average Max 1998 4.4 11.8 17.9 Number of cows 2001 7.7 14.7 25.5 1998 7.1 14.1 28.0 Number of pigs 2001 10.8 19.2 45.0 1998 5.8 29.8 72.3 Crop acreage of sugar beet 2001 12.3 37.4 99.0 1998 10.2 33.8 96.4 Crop acreage of rape 2001 9.7 40.4 99.6 Source: own calculations based on data from the GUS.

To improve precision of estimates from the survey by county data from the Census of Agriculture were used. Total area of agricultural land, cattle stock and variable of interest from CA were considered as auxiliary variables.

6.1. Empirical Bayes estimates (using area-level regression model)

First analysis was performed on area-level data (the totals by county). The strongest positive correlations was between variable of interest from JAS and the same variable from CA. The correlation coefficients exceeded 0.6 and were statistically significant (p<0.05). Thus, the preceding census period significantly affected the value of the variable of interest in sample surveys, carried out after this census. The area-level regression models were fitted for all the variables of interest. The JAS variables were used as dependent. As independent variables were included the variables of interest from the CA. Other potential independent variables were either weakly correlated with the variable of interest or strongly correlated with other independent variables, and so they had to be removed from the regression models. For example for the number of cows in JAS 1998 the area-level regression model was as follows: yˆ = 0.020 +1.046X , R 2 = 0.974; d98 d96 (t) (0.3) (25.7) where:

1210 D. Bartosińska: Attempts at applying small area…

yd 98 — direct estimate of the number of cows for the dth county from the JAS 1998 (in thousands of items),

X d 96 — the number of cows for the dth county according to the CA 1996 (in thousands of items). The above area-level regression models were used in empirical Bayes estimation. Figure 1 presents coefficients of variation of direct estimates and of the EB estimates on an example of one variable: the number of pigs in JAS 2001.

Figure 1. Coefficients of variation of direct estimates and Bayes empirical estimates (using area-level regression model) for number of pigs by county in the Lublin region in JAS 2001

50

45

40

35 30

25

20 CV (percent)CV 15 10

5

0 Direc t 1234567891011121314151617181920 EB Number of county Source: based on the author’s own calculations

By using EB approach (with area-level regression model) high precision was obtained in estimation of totals of all the variables of interest in JAS 1998 and 2001 for all counties in the Lublin region. Table 2 presents the minimum, average and maximum CVs of empirical Bayes estimates for counties in the Lublin region. It also shows by how many percentage points on average the estimation precision improved in comparison to direct estimation ( CV EB − CV Direct ).

STATISTICS IN TRANSITION, December 2006 1211

Table 2. Minimum, average and maximum coefficients of variation of empirical Bayes estimates (using area-level regression model) for counties in the Lublin region in JAS 1998 and 2001 Averag Improvement of average Min Max Variable of interest Year e precision (% points) (%) (%) (%) 1998 1.6 2.8 7.1 9.0 Number of cows 2001 3.8 5.4 9.1 9.3 1998 2.4 4.4 7.3 9.7 Number of pigs 2001 5.7 7.9 11.7 11.3 Crop acreage of 1998 3.6 25.5 73.0 4.3 sugar beet 2001 5.6 21.8 77.0 15.6 1998 10.2 27.6 83.8 6.2 Crop acreage of rape 2001 9.4 40.4 96.6 21.8 Source: own calculations based on data from the GUS.

When area-level covariates from CA 1996 were used, CVs of EB estimates of the number of cows and the number of pigs for all counties in the Lublin region did not exceed 12%. They were significantly lower than CVs of direct estimates. In JAS 1998, after EB estimator and area-level covariates from CA were used, the average precision of the number of cows estimates for counties rose by 9.0; and of the number of pigs — by 9.7 percentage points. Whereas in JAS 2001 the average precision of the number of cows estimates for counties rose by 9.3; and of the number of pigs — by 11.3 percentage points. Precision of EB estimates was higher for variables with higher frequency of occurrence and lower variation. Cows and pigs were bred in every second farm in the Lublin region, but sugar beet was cultivated in every tenth farm and rape — in every hundredth one. CVs of EB estimates of number of cows and pigs were lower than of crop acreage of sugar beet and rape. EB estimates of crop acreage of these two plants were not always precise (CVs of EB estimates reached maximum of 96.6% for counties).

6.2. Empirical Bayes estimates (using unit-level regression model)

In order to improve estimation precision by county unit-level regression models were also used. The model was fitted to the matched unit-level data for particular farms from JAS and CA. The observations were weighed by reciprocal inclusion probability from sample surveys. Unit-level data for all the variables of interest from JAS were positively and the most strongly correlated with their counterparts from CA and these counterparts were taken as independent variables in unit-level regression models. For example, the unit-level regression model for the number of cows in JAS 1998 looked in the following way:

1212 D. Bartosińska: Attempts at applying small area…

yˆ = 0.0002 + 0.883x , R 2 = 0.618; di98 di96 (t) (85.4) (688.8) where:

ydi98 — number of cows for the ith farm in the dth county according to JAS 1998 (in thousands of items),

xdi96 — number of cows for the ith farm in the dth county according to CA 1996 (in thousands of items). The CVs of EB estimates (using regression model, combining unit-level data from JAS and CA) for the counties in the Lublin region in JAS 1998 and 2001 were smaller than the CVs of direct estimates. Table 3 presents the minimum, average and maximum CVs of EB estimates (using unit-level regression model). The table also contains results of calculations by how many percentage points the average estimation precision improved after using EB approach (with unit-level regression model) in comparison to direct estimation.

Table 3. Minimum, average and maximum coefficients of variation of Bayes empirical estimates (using unit-level regression model) for counties in the Lublin region in JAS 1998 and 2001

Variable of Min Average Max Improvement of average Year interest (%) (%) (%) precision (% points) 1998 1.4 1.8 2.6 10.0 Number of cows 2001 3.3 4.6 7.3 10.1 1998 1.1 2.2 3.4 13.4 Number of pigs 2001 2.9 5.5 8.3 16.7 Crop acreage of 1998 1.3 17.4 59.2 12.4 sugar beet 2001 1.6 17.5 76.2 19.9 Crop acreage of 1998 3.3 9.8 19.3 24.0 rape 2001 8.7 14.1 22.1 26.3 Source: own calculations based on data from the GUS.

While using unit-level covariate from CA 1996 the CVs of EB estimates of the number of cows for the counties in the Lublin region were on average 1.8% in 1998 and 4.6% in 2001 (for the number of pigs respectively 2.2% and 5.5%). In comparison with direct estimation, the average estimation precision after using the EB approach (with unit-level regression) was improved by 10 percentage points for the number of cows, to 26.3 points — for the crop acreage of rape. However, the relative (in relation to the average coefficients of variation) improvement of precision for crop acreage of rape was smaller than for the number of cows in the same year.

STATISTICS IN TRANSITION, December 2006 1213

6.3. Hierarchical Bayes estimates (using area-level regression model)

Area-level regression models were also used in hierarchical Bayes (HB) estimator. The theoretical models were recorded in the language of the WinBUGS software (Kubacki, 2006). The normality of the distribution of models’ regression parameters was assumed. They were the starting point for simulation by MCMC method. The initial values for simulation were selected automatically by the software. Ten thousand iterations were performed. As a result HB estimates were obtained as well as the estimates of their variances. In order to obtain HB estimates of the totals of the variables of interest for counties, the same area-level regression models as in the case of the empirical Bayes estimates were used. Table 4 presents the minimum, average and maximum coefficients of HB estimates’ variation (using area covariates from CA) for the counties in the Lublin region. The table also contains results of calculations by how many percentage points the average estimation precision improved after using this method of estimation in comparison to direct estimation and to the empirical Bayes one. For all the counties in the Lublin region precision of estimation was improved as a result of using the hierarchical Bayes estimation with area covariates from the CA 1996. Compared to direct estimation, the average estimation precision rose by over 9 percentage points. In comparison to EB estimation (using area covariates from the census) estimation precision improved on average by 0.4 percentage points or more depending on the variable.

Table 4. Minimum, average and maximum coefficients of variation of hierarchical Bayes estimates (using area-level regression model) for counties in the Lublin region in JAS 1998 and 2001 Improvement of average Variable of Min Average Max precision (% points) compared Year interest (%) (%) (%) to: Direct EB (area-level data) 1998 1.8 2.4 5.0 9.4 0.4 Number of cows 2001 2.8 3.7 7.8 11.0 1.7 1998 2.3 3.1 5.5 11.0 1.3 Number of pigs 2001 3.8 5.2 8.3 14.0 2.7 Crop acreage of 1998 0.5 1.1 3.3 28.7 24.4 sugar beet 2001 4.7 16.7 83.7 20.7 5.1 Crop acreage of 1998 5.5 7.9 9.9 25.9 19.7 rape 2001 6.8 15.5 68.9 24.9 3.1 Source: own calculations based on data from the GUS.

1214 D. Bartosińska: Attempts at applying small area…

7. The ecological effect in small area estimation

Empirical Bayes estimator, combining area-level data from JAS and CA, gave different estimation results by county than the same estimator, combining unit- level data from the same sources. Empirical Bayes estimates (using area-level regression model) were usually less precise than EB estimates (using unit-level regression model). Both EB estimates and synthetic regression estimates were different depending on model type. In Figure 2 synthetic regression estimates of the totals derived in two different ways are compared on the example of number of pigs for counties in JAS 1998. On axis of the abscissa unit covariate prediction was marked, on axis of ordinates - area covariate prediction. The two sets of estimates, although highly correlated (correlation coefficient 0.99), are very different from each other. If the estimates were equal, the points on the chart would form a regression line with 45 degree slope.

Figure 2. Prediction of number of pigs in JAS 1998 for counties in the Lublin region (using the data from the CA 1996) by regression model type (in thousands of items)

220 Area cov ariate prediction = -13.7496+1.1699*Unit cov ariate prediction 200 R2=0.9806

180

160

140

120

100

80 Area covariateArea prediction 60

40

20

0 0 20 40 60 80 100 120 140 160 180 200 220 Unit covariate prediction

Source: based on the author’s own calculations

STATISTICS IN TRANSITION, December 2006 1215

Both the estimates were compared to the direct ones. Mean absolute percentage errors (MAPE) were calculated. Synthetic area-level regression estimates of number of pigs for counties in 1998 differed from direct estimates by ± 10.0% on average. Synthetic unit-level regression estimates differed from direct estimates by ± 19.3 on average. Area-level regression model produces small area estimates which are closer to direct estimates in comparison to unit- level model. It is caused by the so called ecological effect or ecological fallacy, i.e. the fact that regression parameters and correlation coefficients can be quite different for unit-level data and data aggregated at area-level (Heady et al., 1999). Comparison of slope parameters and correlation coefficients for all variables of interest is presented in Table 5.

Table 5. Comparison of slope parameters and correlation coefficients for area- level data and unit-level data (between JAS and CA)

Slope parameters Correlation coefficients Variable of Year interest Area-level Unit-level Area-level Unit-level data data data data Number of 1998 1.046 0.883 0.989 0.786 cows 2001 0.826 0.900 0.936 0.685 Number of 1998 1.222 0.902 0.982 0.754 pigs 2001 0.879 0.824 0.891 0.648 Crop acreage 1998 0.919 0.800 0.973 0.747 of sugar beet 2001 0.603 0.674 0.947 0.565 Crop acreage 1998 1.518 1.051 0.672 0.425 of rape 2001 1.140 0.386 0.875 0.313 Source: own calculations based on data from the GUS.

All slope parameters in the area-level regression models were different (larger or smaller) than slope parameters in the corresponding unit-level regression model. But correlation coefficients calculated for area-level data are always higher than for the corresponding unit-level data. If regression model was fitted on unit-level data and area-level data are used in prediction, the ecological effect occurs. To avoid ecological effect it was proposed by Holt unit-and-area-level regression model, given by (EUROSTAT, 2000): T T yid = α + ()xid − x d βunit + (Xd − X) β area + ud + eid ; (5) where:

β unit — matrix of unit-level regression parameters (within areas),

βarea — matrix of regression parameters between area means.

1216 D. Bartosińska: Attempts at applying small area…

This model combines unit-level regression parameters and area-level ones.

The β unit parameters in equation (5) relate to within area variation and they are different from the β unit parameters in equation (4), related to total variation. The xid values in equation (5) are centred on the local mean which implies that the systematic difference between areas is accounted for by the βarea parameters (EUROSTAT, 2000). The unit-and-area-level regression model can be fitted only in the case when it is possible to match the same units in a survey and in an auxiliary data source. Unit-and-area-level regression of number of pigs in JAS 1998 on number of pigs in CA 1996 was fitted ( b unit = 0.892 ; barea = 1.482 ). Synthetic unit-and- area-level regression estimates of number of pigs for counties in 1998 differed from direct estimates by ± 12.4 on average. It was less about 6.9 percentage points compared to synthetic unit-level regression estimates. Using unit-and-area- level regression model gave better results than using unit-level regression model in small area estimation.

8. Conclusions

The presented study suggests the following conclusions: 1. The application of small area estimation methods using census of agriculture as an auxiliary data source significantly improve precision of estimates by county in comparison with direct estimates. 2. There is a strong positive correlation between the same variables from the census of agriculture and agricultural sample surveys carried out after the census. 3. It is possible to match particular farms from the sample survey with last census of agriculture. 4. It is possible to obtain quite different results from fitting the unit-level regression model compared to the area-level regression model due to ecological effect. 5. It is advisable to use area-and-unit regression model for small area estimation in agricultural sample surveys. 6. Estimation results for variables of low frequency of occurrence were not satisfactory. The undertaken study is only a first step in improving the precision of estimates by county obtained from agricultural sample surveys using census of agriculture data. Further research is needed to test other small area estimation methods, such as GREG or EBLUB, as well as using other potential sources of auxiliary data. Eventually, the problem of ecological effect and its influence on estimation results also requires further investigation.

STATISTICS IN TRANSITION, December 2006 1217

REFERENCES

BARTOSIŃSKA, D. (2005), Small Area Estimation Methods in Agricultural Sample Surveys (using the Data of the Census of Agriculture), Warsaw School of Economics, (mimeo of doctoral dissertation in Polish). BATTESE, G.E., HARTER, R.M., FULLER, W.A. (1988), An Error- Components Model for Prediction of County Crop Areas Using Survey and Satellite Data, Journal of American Statistical Association, Vol. 83, No. 401, pp. 28—36. BELLOW, M.E. (2003), Comparison of Model Based Methods for County Level Estimation of Crop Yields, FCSM Conference Papers, pp. 39—44. CIOŁKOSZ, A., KULIKOWSKI, R., FILIPIAK, K., BIELECKA, E. (2004), The Agricultural Production Space in Poland at the Beginning of 21st Century and its Characterization, Statistics in Transition, Vol. 6, No. 6, pp. 899—912. EKLUND, B. (1998), Small Area Estimation of Coverage Error for the 1997 Census of Agriculture. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 335—338. EUROSTAT (2000), Development and Evaluation of a Practical System of Model-based Small Area Estimation, Vol. I, Research Report. FORD, B., BOND, D., CARTER, N. (1983), Research into Small Area Estimation at the U.S. Department of Agriculture. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 232—237. GAMBINO, J., DICK, P. (2000), Small Area Estimation Practice at Statistics Canada, Statistics in Transition, Vol. 4, No. 4, pp. 597—610. GEBIZLIODLU, O.L., DEDEP, H., TOPRAK, A.O. (1996), Small Area Estimation in Turkey. HARVEY, J. (2000), Small Area Population Estimation Using Satellite Imagery, Statistics in Transition, Vol. 4, No. 4, pp. 611—633. HEADY, P., CLARKE, P., BROWN, G., D’AMORE, A., MITCHELL, B. (1999), Small Area Estimates Derived from Surveys: ONS Central Research and Development Program, Riga. KORDOS, J., KURSA, L. (1997) , Agricultural Sample Surveys in Poland, Statistics in Transition, vol. 3, Nr 1, June 1997, pp. 75—108 . KORDOS, J., PARADYSZ, J. (2000), Some Experiments in Small Area Estimation in Poland, Statistics in Transition, Vol. 4, No. 4, pp. 679—697.

1218 D. Bartosińska: Attempts at applying small area…

KUBACKI, J. (2006), Remarks on Using the Polish LFS Data and SAE Methods for unemployment Estimation by County, Statistics in Transition, Vol. 7, No. 4, pp. 901—916. KURSA, L., LEDNICKI, B. (2006), The Agricultural Sample Surveys in Poland in Transition Period, Statistics in Transition, Vol. 7, No. 5, pp. 981—1008. LEDNICKI, B. (1979), Using of Sampling in Quarterly Surveys of Livestock Inventory. In: Sample surveys methodology in the Central Statistical Office, BWS, Vol. 29, GUS, Warszawa, pp. 35—38 (in Polish). MARKER, D.A. (2001), Producing small area estimates from national surveys: method for minimizing use of indirect estimates, “Survey Methodology”, December 2001, Vol. 27, No.2, pp.183—188. MILITINO, A.F., UGARTE, M.D., GOICOA, T. (2004), Using Small Area Models to Estimate the Total Area Occupied by Olive Trees. PAWŁOWSKA, J. (1969), Efficiency Measurement of Sample Designs and Estimation Methods in Livestock Surveys. In: Application of Mathematical Methods in Statistics, BWS, Vol. 7, GUS, Warszawa (in Polish). RAO, J.N.K. (2002), Small Area Estimation with Applications to Agriculture. In: Proceedings of the Conference on Agricultural and Environmental Statistical Application in Rome, Vol. III, pp. 555—564. RAO, J.N.K. (2003), Small Area Estimation, John Wiley&Sons, New Jersey. SCHAIBLE, W.L. (1996), Indirect Estimators in U.S. Federal Programs, Lectures Notes in Statistics, No. 108, New York, Springer-Verlog. SKOW, D.M., WANKE, H. (1994), Testing an Area Frame for Agricultural Statistics in Poland, Statistics in Transition, Vol. 1, No. 6, pp. 797—810. STASNY, E.A., GOEL, P.K., RUMSEY, D.J. (1991), County Estimates of Wheat Production, Survey Methodology, Vol. 17, No. 2, s. 211—225. TIKKIWAL, G.C., GHIYA, A. (2004), A Generalized Class of Composite Estimators with Application to Crop Acreage Estimation for Small Domains, Statistics in Transition, Vol. 6, No. 5, pp. 697—711.

STATISTICS IN TRANSITION, December 2006 1219

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1219—1228

THE ESTIMATION OF THE MEAN SQUARED ERROR OF MODEL-BASED ESTIMATORS

Wojciech Rabiega1

ABSTRACT

Model-based methods of small area estimation have received a lot of attention because make specific allowance for local variation through complex error structures in the models that link the small areas. Effcient indirect estimators can be obtained under the assumed models. Models can be validated from the sample data. Stable area specific measures of variability associated with the estimates can be obtained unlike the overall measures for synthetic and composite estimators.

Keywords: small area statistics, best linear unbiased prediction, empirical Bayes estimator, jackknife method.

1. Best Linear Unbiased Prediction (BLUP) estimator

Small area estimation is becoming a topic of growing importance in recent years. Traditional survey methods fail to provide reliable estimates due to small sample sizes. This makes necessary to use modelbased estimators for small area estimation. These procedures are based on a random effects model. In this article, two model-based methods were introduced: empirical best linear unbiased prediction (EBLUP) and empirical Bayes (EB). Suppose there are m local areas labeled i = 1,…,m. We introduce two model- based methods, empirical best linear unbiased prediction (EBLUP) and empirical Bayes (EB). This model has two components: ∧ (a) The direct estimator θ i of the local area parameters θi . We assume: ∧ θi = θi + ei ; i = 1,...,m; (1.1)

1 Warsaw School of Economics Warsaw, Poland; E-mail address: [email protected]

1220 W. Rabiega: The estimation of the mean…

where the sampling errors ei are assumed to be independent across areas i with mean 0 and variances Vi . For simplicity, it is assumed that the Vi are known. T Suppose also that auxiliary data xi = [x1i ,..., xip ] are available for the i -th local area, and assumed:

(b) A linking model that relates the θi ’s to area-level variables xi : T θi = xi b + ui = x1i + b1... + xipbp + ui ; (1.2) where b is unknown regression coefficient, while ui are random small area 2 effects ui ~ (0,σ u ) . 2 The parameter σ u is measure of homogeneity of the areas after accounting for the covariates x i . For simplicity, it may be assumed, that: 2 ei |θi ~ N (0,Vi ) , ui ~ N (0,σ u ) .

Furthermore , we assume, that ui and ei are stochastically independent. T T We denote X = [x1 ,..., xm ], where xi = [ xi1 ,..., xip ] for i = 1,...,.m. Throughout, we assume that r(X) = p < m . From (1.1) and (1.2) we have ∧ T θ i = xi b + ui + ei , i = 1,...,m; (1.3) or equivalently ∧ Θ = Xb + u + e , (1.4) ∧ ∧ ∧ T where Θ = [θ 1,...,θ m ] , while u and e are m-dimentional random vectors. ∧ The estimator θ i is of the form of a linear mixed effects models with fixed effects b and random small area effects ui and ei . ∧ 2 First assume σ u is known. Then the best unbiased predictor of θ i is given by (Rao, 2000, 2003; Prasad, Rao, 1990). ~ ~ ~ ∧ ~ H H 2 T T θi = θi ( σ u ) = xi b + χ i (θ i − x i b) = (1.5) ∧ ~ T = χ i θ i + (1− χ i ) xi b , (1.6) 2 σ u χ i = 2 ; (1.7) σ u +Vi ~ 2 where b is the weighted least squares of b with 1/(σ u +Vi ) obtained by ∧ regressing θ i on xi :

STATISTICS IN TRANSITION, December 2006 1221

~ ∧ b = (X T DX ) −1 X T D Θ = (1.8) m m ∧ T −1 = (∑ χ i xi xi ) (∑ χ i xi θ i ) , (1.9) i=1 i=1 2 −1 2 −1 D = Diag((σ u +V1) ,..., (σ u +Vm ) ), (1.10) ∧ ∧ ∧ T while Θ = [θ 1 ,...,θ m ] . Even without normality, the estimator given in (1.6) is the best linear unbiased predictor of θi . The BLUP estimator is a weighted combination of the ∧ ~ T direct estimator θ i and the regression synthetic estimator xi b with weights χ i ∧ and (1− χ i ) respectively. The BLUP estimator gives more weight to θ i when 2 the sampling variance Vi is small (or σ u is large) and moves towards the 2 regression synthetic estimator as Vi increases (or σ u decreases).

1.1. MSE of BLUP.

~ H 2 Theorem 1. The mean squared error of the BLUP estimator θi ( σ u ) is given by: ~ H 2 2 2 MSE(θi ( σ u ) ) = g1i (σ u ) + g 2i (σ u ), (1.11) where 2 g1i (σ u ) = χ iVi , (1.12) 2 2 T T −1 g 2i (σ u ) = (1− χ i ) xi (X DX ) xi , (1.13) or equivalently −1 ⎛ m x xT ⎞ g (σ 2 ) = (1− χ ) 2 x T ⎜ i i ⎟ x . (1.14) 2i u i i ⎜∑ 2 ⎟ i ⎝ i=1 σ u +Vi ⎠ ~ H Proof. The BLUP estimator θi may be expressed as: ~ ~ ~ H 2 H 2 T θ i (σ u ) = θi (b,σ u ) + d i (b− b) , (1.15) ~ H 2 where θi (b,σ u ) is the BLUP estimator when b is known: ~ ^ H 2 T T θi (b,σ u ) = xi b + χi (θi − xi b) (1.16) and

1222 W. Rabiega: The estimation of the mean…

T T di = (1− χi )xi . (1.17) ~ ~ H 2 T It now follows θi (b,σ u ) −θi and di (b− b) are uncorrelated (see: Rao, 2000), therefore: ~ ~ ~ H 2 H 2 2 T MSE (θi (σ u )) = MSE (θi (b,σ u )) + D (di (b− b)) . (1.18) ~ H 2 For MSE (θi (σ u )) we have: ~ ~ ^ H 2 H 2 T T 2 MSE (θi (σ u )) = MSE (θi (b,σ u )) = E (xi b + χi (θi − xi b) −θi ) = 2 = E (θi − ui + χi (ui + ei ) −θi ) = 2 2 2 2 = E (χi ei + 2χi (χi −1)eiui + (χi −1) ui ) = 2 2 2 2 = χi Vi + (χi −1) σ u = χiVi = g1i (σ u ) . ~ 2 T To calculate variance D (di (b− b)) we first calculate the expectation ~ T E (xi b) : ~ ^ T T T −1 T E (xi b) = E (xi (X DX ) X D Θ) = T T −1 T = E (xi (X DX ) X D(Xb + u + e)) = T T T −1 T T = E (xi b + xi (X DX ) X D(u + e)) = E (xi b) For any variable X we have (D 2 (X − E X ) = D 2 X ), therefore: ~ ~ 2 T 2 2 T T D (di (b− b)) = (1− χi ) D (xi b− xi b) = ~ 2 2 T = (1− χi ) D (xi b) . Next, we calculate the difference: ~ ~ ~ ^ T T T T T T −1 T xi b− E (xi b) = xi b− xi b = xi ((X DX ) X D Θ− b) = T T −1 T = xi ((X DX ) X D(Xb + u + e) − b) = T T −1 T = xi (X DX ) X D(u + e) . (1.19) ~ 2 T So the variance D (di (b− b)) is equal: ~ 2 (d T (b− b)) = (1− χ )2 (xT (X T DX )−1 X T D(u + e) ⋅ D i i E i ⋅ (u + e)T DX (X T DX ) −1 x ) = i 2 T T −1 T 2 T −1 = (1− χi ) xi (X DX ) X D D (u + e)DX (X DX ) xi . Because, the random components u and e are stochastically independent:

STATISTICS IN TRANSITION, December 2006 1223

D 2 (u + e) = D 2 (u) + D 2 (e) ; ~ 2 T so the variance D (di (b− b)) we have: ~ 2 T 2 T T −1 2 D (di (b− b)) = (1− χi ) xi (X DX ) xi = g2i (σ u ) . (1.20) The result (1.12) and (1.13) do not require distributional assumptions on the 2 random errors ui and ei . The leading term g1i (σ u ) is of order O(1) whereas 2 −1 g2i (σ u ) is of lower order O(m ) for large number of sampled small areas m . The leading term shows that MSE of the BLUP estimator can be substantially smaller then the MSE of the direct estimator under the assumed model (1.3) then 2 Vi is small or the model variance σ u is small relative to the sampling variance Vi. The success of small area estimation, therefore, largely depends on getting 2 good auxiliary data {xi } that leads to a small model variance (σ u ) relative to sampling variance (Vi ) .

2. Empirical Best Linear Unbiased Prediction (EBLUP) Estimator

∧ 2 2 In practice the model variance σ u is unknown. We denote by σ u an 2 estimator σ u and: ∧ ∧ ∧ 2 −1 2 −1 D = Diag((σ u +V1 ) ,...,(σ u +Vm ) ) , (2.1) ^ ∧ ∧ b = (X T DX ) −1 X T D Θ, (2.2) ∧ ∧ 2 σ u χ i = ∧ . (2.3) 2 σ u +Vi

Then EBLUP estimator of θi is given by: ^ ~ ~ ^ ^ ^ ^ ^ H H 2 T θi = θi (b,σ u ) = χ i θi + (1− χ i )xi b , i = 1,...,m . (2.4) ^ H The θi are unbiased estimators of θi provided (see: Jiang et al., 2002):

(1) ui and ei are symmetrically distributed about 0, ^ ^ 2 (2) σ u is an even function of the θi , ^ ^ ^ 2 T (3) σ u remains invariant when θi is replaced by (θi − xi b) for all b .

1224 W. Rabiega: The estimation of the mean…

2 2.1. Estimation of σ u

^ 2 There are many choices σ u . Prasad and Rao (Kackar, Harville, 1984) 2 provided an explicit estimator for σ u : ^ ~ 2 2 σ u = max{σ u ,0}, (2.5) ~ m ∧ m 2 1 ⎛ T * 2 T T −1 ⎞ σ u = ⎜∑∑(θi − xi b ) − Vi xi (X X ) xi ⎟ ; (2.6) m − p ⎝ i==11i ⎠ where b* is the ordinary least squares estimator of b : −1 ^ m m ^ * T −1 T ⎛ T ⎞ ⎛ ⎞ b = (X X ) X Θ = ⎜∑ xi xi ⎟ ⎜∑ xi θi ⎟ . (2.7) ⎝ i=1 ⎠ ⎝ i=1 ⎠

2.2. MSE of EBLUP.

An estimator of MSE of EBLUP estimator can be obtained from (1.11) for the ~ ^ H 2 2 2 BLUP estimator θi (σ u ) by substituting σ u by σ u . But this leads to 2 significant underestimation of true MSE because the effect of estimating σ u is ignored. Prasad and Rao (see [3]) obtained a approximately unbiased estimator of MSE of EBLUP estimator, assuming normality ui and ei . The error in the EBLUP estimator may be decomposed as: ~ ^ ~ ~ ^ ~ H 2 ⎡ H 2 ⎤ ⎡ H 2 H 2 ⎤ θi (σ u ) −θi = ⎢θi (σ u ) −θi ⎥ + ⎢θi (σ u ) −θi (σ u )⎥ . (2.7) ⎣ ⎦ ⎣ ⎦ Therefore, ~ ^ ~ ~ ^ ~ 2 ⎡ H 2 ⎤ ⎡ H 2 ⎤ ⎡ H 2 H 2 ⎤ MSE ⎢θi (σ u )⎥ = MSE ⎢θi (σ u )⎥ + E ⎢θi (σ u ) −θi (σ u )⎥ + ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ~ ~ ^ ~ ⎡ H 2 ⎤⎡ H 2 H 2 ⎤ + 2 cov ⎢θi (σ u ) −θi ⎥⎢θi (σ u ) −θi (σ u )⎥ . (2.8) ⎣ ⎦⎣ ⎦

Under normality of the random effects ui and ei cross-product term in (2.8) is ^ 2 zero, provided σ u is translation invariant (see [5]). So, MSE of EBLUP 2 estimator (for the estimator σ u given in (2.5)) is equal:

STATISTICS IN TRANSITION, December 2006 1225

~ ^ ~ ~ ^ ~ 2 ⎡ H 2 ⎤ ⎡ H 2 ⎤ ⎡ H 2 H 2 ⎤ MSE ⎢θi (σ u )⎥ = MSE ⎢θi (σ u )⎥ + E ⎢θi (σ u ) −θi (σ u )⎥ = ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ 2 2 2 = g1i (σ u ) + g2i (σ u ) + g3i (σ u ) , (2.9) 2 2 2 where g1i (σ u ) and g2i (σ u ) are given in (1.12) and (1.13), and g3i (σ u ) is given by: V 2 m 2 i 2 2 . (2.10) g3i (σ u ) ≈ 2 2 3 ∑ (σ u + Vi ) m (σ u + Vi ) i=1

3. Empirical Bayes (EB) estimator

Assuming normality (ui and ei ), the basic area level model (1.1) and (1.2) may be expressed as a two-stage hierarchical model: ^ (1) θi θi ~N (θi , Vi ), T 2 (2) θi ~ N ()xi b,σ u .

Then, the ”‘optimal”’ estimator of the realized value of θi is given by the ^ 2 conditional expectation of θi given θ i , b ,σ u : ^ ^ ^ ^ 2 B B 2 T E (θi θ i , b,σ u ) = θi = θi (b,σ u ) = χi θ i + (1− χi )xi b . (3.1) ^ 2 Result (3.1) follows from the posterior distribution of θi given θi , b , σ u : ^ ^ 2 B 2 θi θ i ,b,σ u ~ N (θi , g1i (σ u ) = χiVi ) . (3.2) ^ B The estimator θi is the ”Bayes”’ estimator under squared error loss and ^ ^ B B 2 optimal in the sense that its MSE, MSEθi ) = E (θi −θi ) , is smaller then the ^ MSE of any other estimator of θi , linear or nonlinear in θi ’s (model- unbiasedness of the estimators is nor required). ^ B 2 The Bayes estimator θi depends on the model parameters b and σ u , wich are estimated from the marginal distribution: ^ T 2 θi ~ N (xi b,σ u + Vi )

1226 W. Rabiega: The estimation of the mean… using REML (restricted maximum likelihood estimators). Denoting the estimators ^ ^ ^ ^ 2 B as b and σ u , we obtain EB estimator of θi from θi by substituting b for b ^ 2 2 and σ u for σ u : ^ ^ ^ ^ ^ ^ ^ ^ EB B 2 T θi = θi (b,σ u ) = χi θi + (1− χi )xi b . (3.3) ^ EB The EB estimator θi , is identical to the BLUP estimator given by (3.4), but ^ the EB approach is applicable generally for any joint distribution of θi and θi . ^ ^ ^ ^ EB 2 Note that θi is also the mean of the estimated posterior density f (θi θ,b,σ u ), ^ ^ EB of θi , namely, N (θi , χi Vi ) . It should be noted that EB approach is essentially frequentist, because it uses only the sampling model and the linking model which 2 can be validated from the data; no prior distributions on the model (b and σ u ) parameters unlike in the HB approach.

3.1. MSE estimation.

The result given by (2.9) on the estimation of MSE of the EBLUP estimator ^ ^ ^ H EB H θi , are applicable to the EB estimator θi because θi and are identical under normality. The MSE estimators are nearly unbiased in the sense of bias lower than 1 for large m . m

3.1.1. Jackknife method.

Jiang, Lahiri and Wan (2002) proposed jackknife method of estimating the ^ EB 2 MSE of EB estimators. We express the expectation E as E ^ E ^ and (θi −θi ) θ θ θ as: ^ ^ ^ ^ ^ ^ ^ ^ ^ EB B B EB B 2 B 2 EB B 2 B 2 (θi +θi −θi −θi ) = (θi −θi ) + (θi −θi ) + 2(θi −θi ) (θi −θi ) , (3.4) ^ where E ^ is the expectation over the conditional distribution of θ given θ , and θ θ ^

E ^ is the expectation over marginal distribution of θ . Further, θ

STATISTICS IN TRANSITION, December 2006 1227

^ ^ ^ ^ ^ ^ ⎡ EB B B ⎤ EB B B E ^ ⎢(θi + θi )(θi −θi )⎥ = (θi + θi ) E ^ (θi −θi ) =0. θ θ ⎣ ⎦ θ θ ^ EB Note the following decomposition θi : ^ ^ ^ ^ EB EB B 2 B 2 MSE (θi ) = E (θi −θi ) + E (θi −θi ) = (3.5) ^ ^ EB B 2 2 = E (θi −θi ) + g1i (σ u ) =

= : M 2i + M 1i , ^ where the expectation is over the joint distribution of (θ,θ ) for i = 1,..., m.

We write the EB estimator of θi given by ^ ^ ^ ^ EB B 2 θi = θi (b,σ u ) , (3.7) as ^ ^ ^ EB θi = k(θi ,ϕ) , (3.8) 2 2 where ϕ = (b,σ u ) denotes the model parameters b and σ u . The jackknife steps are then as follows: ^ ^ (1) calculate ϕ(l) , the estimators of ϕ , when l -th area data (θl , xl ) are deleted; this calculation is done for each l = 1,..., m ; next calculate m estimators of θi : ^ ^ ^ EB θi (l) = k(θi ,ϕ(l)) (2) calculate: ^ m ^ ^ m −1 EB EB 2 M 2i = ∑ (θi (l) −θi ) m l=1 (3) calculate: ^ m ^ ^ m −1 2 2 2 M 1i = g1i − ∑ (g1i (σ u (l)) − g1i (σ u )) m l=1 ^ EB (4) calculate the jackknife estimator of MSE (θi ) as: ^ ^ ^ EB MSE (θi ) = M 1i + M 2i . (3.9) ^ ^ Note that M 1i estimates MSE when ϕ is known and M 2i estimates the extra variability in MSE due to estimating the model parameters ϕ . The jackknife

1228 W. Rabiega: The estimation of the mean… estimators of MSE, is nearly unbiased in the sense of having bias of lower order then 1 , for large m . m

REFERENCES

DAS K., JIANG J., RAO J. N. K. (2004), Mean squared error of empirical predictor, Annals of Statististics. 32, 818-840. JIANG J., LAHIRI P., AND WAN S.-M. (2002), A unified jackknife theory for empirical best prediction with M-estimation, Annals of Statistics, 30, 1782- 1810. KACKAR R. N., HARVILLE D. A. (1984), Approximations for Standard Errors of Estimators of Fixed and Random Effects in Mixed Linear Models, Journal of the American Statistical Association., vol. 79, 853-862. PRASAD N. G. N., RAO J. N. K. (1990), The Estimation of the Mean Squared Error of Small Area Estimators, Journal of the American Statistical Association, vol. 85, 163-171. RAO J. N. K. (2000), Statistical metodology for indirect estimations in small areas. RAO J. N. K. (2003) , Small Area Estimation, Wiley series in survey methodology. RAO J. N. K., GHOSH M. (1994), Small Area Estimation: An Apprasial, vol. 9.

STATISTICS IN TRANSITION, December 2006 1229

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1229—1245

ON ACCURACY OF EBLUP UNDER RANDOM REGRESSION COEFFICIENT MODEL1

Tomasz Żądło2

ABSTRACT

In the paper is analyzed the accuracy of the empirical best linear unbiased predictor (EBLUP) of the domain total (see Royall, 1976) assuming random coefficient superpopulation model which is a special case of the general linear mixed model. To estimate the mean square error (MSE) of the EBLUP we use the results obtained by Datta and Lahiri (2000) for the predictor proposed by Henderson (1950) and adopt them for the predictor proposed by Royall (1976). In a simulation study we study real data on Polish farms from Dąbrowa Tarnowska region to consider the decrease of the accuracy of the EBLUP comparing with the best linear unbiased predictor (BLUP) due to the estimation of unknown variance components. What is more, we compare the MSE of the EBLUP with MSEs of two other predictors which are BLUPs but under different models (hence they are studied in the simulation in the case of model misspecification).

Key words: small area estimation, empirical best linear unbiased predictors, general mixed linear model

1. General superpopulation models

In the first section basic notations and three superpopulation models: the general linear model, the general linear mixed model and the general mixed linear model with block diagonal variance-covariance matrix are introduced. The finite population Ω consists of N units, each of which has a value of a target variable y associated with it. The population vector of y’s is T y = [ yy12, ,..., yN ] and it is treated as the realization of a random vector

1 This is extended version of the paper presented at the Conference on Sampling Methods in Economic and Social Surveys, 11—12 September 2006, Katowice, Poland. 2 Department of Statistics, University of Economics in Katowice, Bogucicka 14, 40-226 Katowice, Poland, e-mail [email protected].

1230 T. Żądło: On accuracy of EBLUP under random…

T Y = [YY12, ,..., YN ] . The joint distribution of Y is denoted by ξ . From the population of N units, a sample s of n units is selected, and the y values of the sample units are observed. For any sample s we can reorder the population vector T y so that the first n elements are those in the sample: ⎡ TT⎤ where is yyy= ⎣ sr, ⎦ ys the n—dimensional vector of observed values and yr is the Nr—dimensional vector of unobserved values where Nr=N–n. The set of unsampled elements is denoted by Ωr =Ω−s . Hence, the vector Y can be reordered as follows: T ⎡⎤TT. The population is divided into D domains (d=1,...,D), YYY= ⎣⎦sr, Ωd each of size N d (d=1,...,D). The set of sampled elements which belong to the d- th domain denoted by sd = Ωd ∩ s consists of nd elements (where nd may be random). Let us introduce additional notations: Ωrd=Ω d −s d and

NNnrd=− d d . To stress that some notations are introduced for the domain of interest we add a star to the subscript d, for example the domain of interest is denoted by Ωd* and its size by Nd* . Let us introduce the general linear model. We assume that:

⎧Eξ ()YX= β ⎨ 2 , (1) ⎩ Dξ ()YV= where X is a Np× matrix of values of p auxiliary variables, β is a p ×1 vector of unknown parameters and V is a positive definite variance-covariance matrix T depending on some variance parameters ⎡ ⎤ . We assume that all δ = ⎣δδ1 ,..., q ⎦ values of auxiliary variables are known for each unit in the population. If the population elements are rearranged so that the first n elements of Y are those in the sample, and the first n rows of X are for units in the sample, then X and V can

⎡⎤Xs ⎡VVss sr ⎤ be expressed as X = ⎢⎥, V = ⎢ ⎥ where Xs is np× , Xr is ⎣⎦Xr ⎣VVrs rr ⎦ T Npr × , Vss is nn× , Vrr is NNrr× , Vsr is nN× r and VVrs= sr . We assume that V is positive definite. The general linear mixed model with the following assumption is a special case of the general linear model with the assumption (1):

STATISTICS IN TRANSITION, December 2006 1231

⎧ YX= β ++Zv e ⎪ Eξ ()e0= ⎪ ⎨ Eξ ()v0= , (2) ⎪ ⎡vG0⎤⎡ ⎤ ⎪ 2 Dξ ⎢ ⎥⎢= ⎥ ⎩⎪ ⎣e0R⎦⎣ ⎦ where Z is known Nh× matrix, and random vectors v and e are h×1 and N ×1 respectively. If the population elements are rearranged so that the first n elements of Y are those in the sample, and the first n rows of Z are for units in the sample,

⎡es ⎤ ⎡Zs ⎤ ⎡⎤RRss sr then e, Z and R can be expressed as: e = ⎢ ⎥ , Z = ⎢ ⎥ , R = ⎢⎥ ⎣er ⎦ ⎣Zr ⎦ ⎣⎦RRrs rr where es is n ×1, er is Nr ×1, Zs is nh× , Zr is Nhr × , R ss is nn× , Rrr is T NNrr× , Rsr is nN× r and RRrs= sr . Under (2) we can express the variance-covariance matrix of Y as 2 T Dξ ()YVRZGZ==+ (3) and the variance-covariance matrix of Ys as 2 T Dξ ()YVRZGZs ==+ss ss s s . We assume that matrices R and G (and hence matrix V) depend on some T variance parameters ⎡⎤. δ = ⎣⎦δδ1,..., q Let us introduce a special case of the model (2) assuming independence of the random variables for elements of the population which belong to different domains. This model is considered inter alia by Datta and Lahiri (2000), Prasad and Rao (1990), Rao (2003) pp. 107—108. Let us introduce the following notations. Let Ad be a column vector and Bd square matrix. Hence, T ⎡⎤TTT is a column vector and col11≤≤dD(Ad )= ⎣⎦ AAA ... d ... D

⎡B1 ... 0 ... 0 ⎤ ⎢ ⎥ ⎢ ...... ⎥ diag1≤d≤D (Bd ) = ⎢ 0 ... Bd ... 0 ⎥ is block diagonal matrix. ⎢ ⎥ ⎢ ...... ⎥ ⎢ ⎥ ⎣ 0 ... 0 ... B D ⎦ In the considered superpopulation model we use the following assumptions:

Y = col1≤d≤D (Yd ) , X = col1≤d ≤D (Xd ) , Z = diag1≤d≤D (Zd ) , v = col1≤d≤D (v d ) , e = col1≤d≤D (ed ) , where the matrices Yd , Xd, Zd, vd , ed are N d ×1, Nd × p ,

N d × hd , hd ×1, nd ×1 respectively and d=1,...,D, where D is the number of

1232 T. Żądło: On accuracy of EBLUP under random…

D D D domains. What is more, ∑ N d = N , ∑nd = n and ∑ hd = h . Let d =1 d =1 d =1

R = diag1≤d≤D (R d ) , G = diag1≤d≤D (G d ) , where the matrices R d and G d are nd × nd and hd × hd respectively. Hence, the variance-covariance matrix may be written as follows:

V = diag1≤d ≤D (Vd ) , where T Vd = R d + Z d G d Z d is nd × nd . Finally, the general linear mixed model with the block-diagonal variance- covariance matrix is the model with the assumption of independence of random variables for population elements which belong to different domains and

⎧Yd = Xd β + Zd v d + ed ⎪ Eξ (ed ) = 0 ⎪ ⎨ Eξ (v d ) = 0 . (4) ⎪ 2 ⎡v d ⎤ ⎡G d 0 ⎤ ⎪ Dξ ⎢ ⎥ = ⎢ ⎥ ⎩⎪ ⎣ed ⎦ ⎣ 0 R d ⎦ for each d (d=1,...D).

2. Special cases of superpopulation model

In this section we introduce three special cases of the superpopulation models introduced in the first section. Superpopulation model I. We assume that (Prasad and Rao, 1990):

Yxexvxeid= β d id+= idβ id + d id + id (i=1,...N; d=1,...,D) (5) where

β = β + v dd iid v ~(0,σ 2 ) (6) dv iid e ~(0,σ 2 ) (7) id e and vd and eid are independent and it is often also assumed that they have normal distribution (in our case the normality assumption will be needed to derive the MSE and its estimator). The model (5) is called the random regression coefficient model (Prasad and Rao, 1990) and it is a special case of the general linear mixed model given by (2), where X is N ×1 vector of values of auxiliary T variable in the population, β = β is scalar, v = [vv1 ...dD ... v] ,

STATISTICS IN TRANSITION, December 2006 1233

Zx= diag (), and x is a vector of N values of auxiliary variable for 1≤≤dD Nd Nd d the domain Ωd (d=1,...,D). What is more, the model (5) is also a special case of the general linear mixed model with the block-diagonal variance-covariance matrix given by (4). Note that the equation (3) under (5) simplifies to: DdiagI222()Yxx=+σσT (8) ξ 1≤≤dD( e Nddd v N N ) where Im is the identity mm× matrix. What is interesting, from (5) and (8) is obtained that: EY()= β x ξ id id 222 ⎧σσevid+ x foriidd== ', ' ⎪ 2 Covξ (, Yid Y i'' d )=≠=⎨ σ v x id x i ' d for i i ', d d ' (9) ⎪ ⎩ 0otherwise Hence, from (9) is obtained that the correlation coefficient between random variables in the d-th domain equals: 2 Covξ (, Yid Y i' d ) σ vididxx' ρ(,YYid i' d )== (10) DY()( DY ) 222222 ξξid i' d σσevidevid++x σσx ' Superpopulation model II. Let us introduce a special case of the 2 superpopulation model I given by (9). Let us assume that σ v = 0 , then

ρ(,YYid i'' d )0= and we obtain the following special case of (9):

Eξ ()Yxid= β id 2 ⎧σ e foriidd= ',= ' Covξ (, Yid Y i'' d )= ⎨ (11) ⎩ 0otherwise Before we introduce another superpopulation model we note that in the assumption (5) βd is random.

Superpopulation model III. Let us assume that random variables Yid (i=1,...,N; d=1,...,D) are independent and

EYξ ()id= β d x id (12) 22. DYξ ()id= σ e d

In this case βd is fixed unlike in (5) where it is random. In this model we assume that random variables Yid (i=1,...,N; d=1,...D) are uncorrelated similarly to the superpopulation model II (given by (11)) but unlike in superpopulation I (given by (5) or (9)). What is more, in the superpopulation models I and II

1234 T. Żądło: On accuracy of EBLUP under random…

2 variance σ e is the same for all population elements, but in superpopulation 2 model III the variance (denoted in this case by σ ed (d=1,...,D)) may vary in domains.

3. BLUPs and their MSEs

In this paragraph we present the following theorem which give the formulae of the BLU predictor and its MSE and their special cases for the superpopulation models presented in section 2. Theorem 1. (Royall, 1976). Assume that the population data obey the general linear model (see the equation (1)). Among the linear, model-unbiased predictors ˆ T T θ = gssY of linear combination of random variables θ = γ Y (where T ⎡⎤TT ) the MSE is minimized by: γγγ= ⎣⎦sr, θˆˆˆ=+γTTY γ ⎡X β +VV− 1 Y − Xβ ⎤ , (13) BLU ss r⎣ r rssss( s)⎦ where ˆ T1−−−1 T1 β = ()XVssss X XV ssss Y (14) ˆ The MSE of θ BLU is given by MSE()θˆˆ=−= Var (θθ )gg()δδ+ () ξξBLU BLU 12, (15) where T −1 g1 ()δ =−γrrrrssssrr(VVVV)γ , (16)

TT−−11−1 − 1T g2 ()δ =−γr()XVVXXVX r rs ss s( s ss s) ( XVVX r − rs ss s) γ r . (17) The proof of the theorem is presented in details for example by Valliant, Dorfman, Royall (2000) pp. 29—30. In the paper we consider the problem of prediction of the domain total, hence the i-th element of γ vector is given by:

⎧1fori∈Ωd* γ i = ⎨ (18) ⎩0otherwise BLUP and its MSE for the superpopulation model I. To derive a special case of the BLU predictor (13) for the superpopulation model (5) we note that under (5): VIxx=+diag ()σσ22T (19) ss1≤≤ d D e nddd v n n Vxx= σ 2diag ()T (20) rs v1≤≤ d D Nrd n d VIxx−−12=−σσdiag() b − 12T (21) ss e1≤≤ d D nddd d v n n

STATISTICS IN TRANSITION, December 2006 1235

−1 −1 DD T1−−− T11⎛⎞⎛⎞ − 12 ()XVssss X XV ssss Y= ⎜⎟⎜⎟∑∑bxYbxdiidi ∑∑ (22) ⎝⎠⎝⎠disdis=∈11dd =∈ where 22 2 bxdev=+σσ∑ i (23) is∈ d Hence, the BLU predictor (13) of the domain total under the superpopulation model (5) simplifies to: ⎛⎞ ⎛⎞⎛⎞ ˆˆ−12 2 ˆˆ θBLU=+∑∑Yxbxx iβσ⎜⎟ i + d** v ⎜⎟⎜⎟ ∑∑ i i() ββ d − (24) is∈∈Ω∈Ω∈drdrdd**⎝⎠ i ⎝⎠⎝⎠ i ** is where −1 ⎛⎞⎛⎞DD ˆ −−112 β = ⎜⎟⎜⎟∑∑bxYbxdiidi ∑∑ (25) ⎝⎠⎝⎠disdis=∈11dd =∈ −1 ⎛⎞⎛⎞ ˆ 2 βdiii* = ⎜⎟⎜⎟∑∑xY x (26) ⎝⎠⎝⎠is∈∈dd** is To derive (15) under superpopulation model (5) we note that under (5):

T −−121⎛⎞ T γrrsssVV = σ vdbx* ⎜⎟∑ i()γ s*x n (27) ⎝⎠i∈Ωrd* 2 T −−1412⎛⎞⎛⎞ γrrssssrrVV Vγ = σ vdbx* ⎜⎟⎜⎟∑ i∑ x i (28) ⎝⎠⎝⎠iis∈Ωrd** ∈ d 2 T 22⎛⎞ γrrrrV γ =+Nxrd*σσ e v⎜⎟∑ i (29) ⎝⎠i∈Ωrd*

T −−121⎛⎞⎛⎞ 2 γrrssssVV X= σ vdbxx* ⎜⎟⎜⎟∑ i∑ i (30) ⎝⎠⎝⎠iis∈Ωrd** ∈ d

T −−121⎛⎞ γrr()XVVX−= rssssσ edbx* ⎜⎟∑ i (31) ⎝⎠i∈Ω rd* where xn is a n×1vector of auxiliary variables for sampled elements of the population, the symbol * denotes the Hadamard product (see e.g. Magnus & Neudecker, 1988, pp. 45—46). Hence, the MSE of the BLUP given by (15) under the superpopulation model (5) simplifies to:

1236 T. Żądło: On accuracy of EBLUP under random…

ˆˆ MSEξξ()θBLU=−= Var (θθ BLU ) (32) gg12()δδ+ (), where 2 2221− ⎛⎞ gN1*()δ =+rdσσσ e e v b d *⎜⎟∑ x i , (33) ⎝⎠i∈Ω rd* −12 D ⎛⎞⎛⎞−−12 21 gbxbx2*()δ = ⎜⎟⎜⎟∑∑diσ edi ∑ . (34) ⎝⎠⎝⎠dis=∈1 drd i ∈Ω* What is important both the BLUP given by (24) and its MSE given by (32) T depend on unknown in practice parameters ⎡ 22⎤ . δ = ⎣σσev⎦ BLUP and its MSE for the superpopulation model II. Under (11) equations (24) and (32) simplifies to: ˆ θβBLU=+∑ Yx i% ∑ i , (35) is∈∈Ωdrd** i 2 ⎛⎞ −1 ˆ 22⎛⎞ 2 MSEξ ()θσBLU=+ e N rd* σ e⎜⎟∑∑ x i⎜⎟ x i (36) ⎝⎠iis∈Ωrd* ⎝⎠ ∈ −1 ⎛⎞2 where β% = ⎜⎟∑∑xxYiii ⎝⎠is∈∈ is What is important only the MSE given by (36) depends on an unknown in 2 practice parameterσ e . Note that the predictor (35) is ξ —unbiased for the superpopulation model I. BLUP and its MSE for the superpopulation model III. Under (12) equations (24) and (32) simplifies to: ˆˆ θβBLU=+∑ Yx i d* ∑ i (37) is∈∈Ωdrd** i 21− ⎛⎞⎛⎞ ˆ 22 2 MSEξ ()θσBLU=+ e d** N rd σ e d *⎜⎟⎜⎟∑∑ x i x i (38) ⎝⎠⎝⎠iis∈Ωrd** ∈ d ˆ where βd* is given by (26). Similarly to the previous case, only the MSE (given by (38)) depends on an 2 unknown in practise parameter σ ed* . Note that the predictor (37) is ξ —unbiased for the superpopulation model I.

STATISTICS IN TRANSITION, December 2006 1237

4. EBLUPs, their MSEs and estimators of MSEs

At the beginning of this section we note that the BLUPs for the superpopulation models II and III do not depend on the unknown in practice 2 2 parameters. In this cases we need only unbiased estimators of σ e and σ ed* to obtain unbiased estimators of MSEs given by (36) and (38) respectively (which 2 2 are linear functions of these parameters). Unbiased estimators of σ e and σ ed* for the superpopulation model II and III respectively are given by: 1 n 2 ˆ 2 σ eii=−∑()Yxβ% (39) n −1 i=1 1 nd* 2 ˆ 2 ˆ (40) σ ed**=−∑()Yx i iβ d nd* −1 i=1 Let discuss the problem of prediction of the domain total under the superpopulation model I. The BLU predictor (24) depends on the variance T parameters ⎡⎤22which are unknown in practical applications. δ = ⎣⎦σσev, Replacing δ by an estimator δˆ , we obtain two-stage predictor called the empirical best linear unbiased predictor (the EBLU predictor). It is denoted by ˆ ˆ ˆ θ EBLU and it remains unbiased if (i) E()θEBLU is finite; (ii) δ is any even, ˆˆ translation-invariant estimator of δ , that is δ()Ys =−δ (Ys ) and ˆˆ δ()()YXbs −=ssδ Y for all Ys and b; (iii) the distributions of v and e are both symmetric around 0 (not necessarily normal). This problem for Royall’s predictors is discussed by Żądło (2004) and for Henderson’s predictors by Kackar and Harville (1981). We should stress that many standard procedures for estimating δ (including maximum likelihood and restricted maximum likelihood) yield even, translation-invariant estimators (Kackar and Harville, 1981). To obtain the MSE of EBLUP for our case we adopt Datta and Lahiri (2000) results for Henderson’s EBLUP. Under the general linear mixed model with the block diagonal variance-covariance matrix we assume that D is large and we neglect all terms of order o(D-1). What is more the normality of random components and the following regularity conditions are assumed: a) the elements of Xs and Zs are uniformly bounded such that T1− {XVssss X} = [OD()] pp× ,

b) supd≥1 nd < ∞ and supd≥1 hd < ∞ , T T −1 c) Xr γ r − Xs Vss Vsr γ r = [O(1)]p×1 ,

1238 T. Żądło: On accuracy of EBLUP under random…

∂ T −1 d) Xs Vss Vsr γ r = []O(1) p×1 for k=1,...,q, ∂δk q q T T e) R sd (δ) = ∑δ jCdjCdj and G d (δ) = ∑δ j Fdj Fdj , where δ0 = 1, Cdj and j=0 j=0

Fdj (d=1,…,D, j=0,…,q) are known matrices of order nd × hd and hd × hd respectively. The elements of the matrices Cdj and Fdj are uniformly bounded known constants such that R sd and G d (d=1,...,D) are all positive definite matrices. (In special cases, some of Cdj and Fdj may be null matrices.) ˆ ˆ −0,5 f) δ is an estimator of δ which satisfies (i) δ − δ = Op (D ) , (ii) ˆ ˆ ML −1 ˆ ˆ ˆ ˆ δ − δ = Op (D ) (iii) δ(Ys ) = δ(−Ys ) , (iv) δ(Ys − X sb) = δ(Ys ) for any b ˆ ML and all Ys , where δ is maximum likelihood (ML) estimator of δ . Conditions a), b), e) and f) are assumed by Datta and Lahiri (2000) who discussed the MSE of the EBLUP of the form of the BLUP proposed by Henderson (1950). Conditions c) and d) may be treated as modifications of the assumptions c) and d) proposed by Datta and Lahiri (2000). TT−1 Under these assumptions and replacing mGZVsss in the proof presented T-1 by Datta and Lahiri (2000) by γrrsssVV we obtain that the MSE for Royall’s EBLUP (i.e. the MSE of the predictor (13) where δ is replaced by its estimator δˆ ), in the case when δˆ is maximum likelihood (ML) or restricted maximum likelihood (REML) estimator is given by: ˆˆ *1− MSEξ (θEBLU ())δδδδ=+++ gggoD123 () () () ( ) (41) where gg12(),δδ ()are given by (16) and (17) respectively and ⎛⎞TTT *1∂∂cc⎛⎞− gtr3 ()δ = ⎜⎟VIss ⎜⎟δ , (42) ⎜⎟∂∂δδ ⎝⎠⎝⎠ TT− 1 c = γrrsssVV , (43) TTT1− ∂∂ccγrrsssVV ==col11≤≤kq col ≤≤ kq , (44) ∂∂δ δδkk ∂ T ⎡ TT⎤ , (45) col11≤≤kqaa k= ⎣ L a q⎦ ⎛⎞⎡⎤2 ⎜⎟∂ l Iδξ=−E ⎢⎥, (46) ⎜⎟⎢⎥∂∂δδij ⎝⎠⎣⎦qq×

STATISTICS IN TRANSITION, December 2006 1239 and l is log likelihood assuming multivariate normal distribution of random variables Y1,...,YN. What is interesting to note, the ij—element of Iδ may be derived using the following equation (e.g. Rao, 2003, p.100): 1 ⎛⎞∂∂VV Itr= VV−−11ssss (47) ij ⎜⎟ss ss 2 ⎝⎠∂∂δijδ Now we adopt the MSE estimator presented by Datta and Lahiri (2000) for our case. It is given by: ∂g ()δˆ MSEˆˆθ ()δδδδ ˆ=++ g () ˆ g () ˆ 2 g* () ˆ −BT ()δ ˆ1 . (48) ξ ()EBLU 12 3δˆ δδ where ggg(),δδδˆˆˆ (),* ()are given by (16), (17),(42) respectively where δ is 123 replaced by the ML or the REML estimator δˆ . B ()δδδ= E ˆ − (49) δˆ ξ ( ) B ()δˆ is given by (49) where δ is replaced by the estimator δˆ . δˆ It is important to note that estimator (48) is approximately unbiased in the sense that: EMSEˆˆθθ()δδ ˆ=+ MSE ˆ () ˆ oD (−1 ) (50) ξξ()()EBLU ξ( EBLU ) If we use REML estimator δˆ then B ()δ = oD (−1 ) (51) δˆ REML and we obtain from (48) the following estimator which also fulfils (50): ˆˆ ˆ ˆ ˆ* ˆ MSEξ ()θEBLU ()δδδδ=++ g12 () g () 2 g 3 () (52) To derive the special case of the MSE estimator (52) under (5) for the REML estimator δˆ of δ we obtain: T1− T ∂γrrsssVV ⎛⎞22− 2 =−⎜⎟xbivdsnσ * ()γ *x (53) ∂σ ∑ e ⎝⎠i∈Ω rd* T1− T ∂γrrsssVV ⎛⎞22− 2 = ⎜⎟xbiedsnσ * ()γ *x (54) ∂σ ∑ v ⎝⎠i∈Ωrd* and hence

⎡⎤⎛⎞ T − xbσ 22− γ *x T ⎢⎥⎜⎟∑ ivdsn* () ∂c ⎢⎥⎝⎠i∈Ωrd* = ⎢⎥ (55) ∂δ ⎛⎞ T ⎢⎥xbσ 22− γ *x ⎢⎥⎜⎟∑ iedsn* () ⎣⎦⎝⎠i∈Ω rd*

1240 T. Żądło: On accuracy of EBLUP under random…

2 TTT 422 ∂∂cc⎛⎞⎛⎞⎛⎞23− ⎡ σ vev−σσ ⎤ Vss ⎜⎟= ⎜⎟⎜⎟xxbiid* ⎢ ⎥ (56) ∂∂δδ∑∑ −σσ22 σ 4 ⎝⎠⎝⎠⎝⎠iis∈Ωrd** ∈ d ⎣ ev e ⎦ −1 Then we derive the elements of Iδ using (47) and assuming normal distribution of random components. It is given by:

−1 ⎡IvvI ve ⎤ Iδ = ⎢ ⎥ (57) ⎣IveI ee ⎦ where 2 D −−12⎛⎞ 2 Ibbvv= 2 ∑∑ d⎜⎟ x i (58) dis=∈1 ⎝⎠d D −−12⎛⎞ 2 Ive=−2bb∑ d⎜⎟∑ x i (59) dis=∈1 ⎝⎠d D −−−142 Iee=−+2(1)bn∑() dσ e b d (60) d =1 2 2 ⎛⎞DDD⎛⎞⎛⎞⎛⎞ ⎛⎞ bn=−+(1)σ −−42 bbx⎜⎟ − 22 −⎜⎟ bx − 22 (61) ⎜⎟∑∑∑∑∑()ded d⎜⎟ i⎜⎟ d ⎜⎟ i ⎝⎠ddisdis==∈=∈111⎜⎟ ⎝⎠⎝⎠dd⎝⎠ ⎝⎠ Using (57) and (56) for the model (5) we obtain: 2 ⎛⎞TTT ⎛⎞⎛⎞ *1324224∂∂cc⎛⎞−− (62) g3*()δ ==tr⎜⎟VIss ⎜⎟δ bdi⎜⎟⎜⎟ x x ivvveveveee()σσσσ I −+ 2 I I ⎜⎟∂∂δδ ∑∑ ⎝⎠⎝⎠ ⎝⎠⎝⎠is∈∈Ωdrd** i where Ivv , Ive , Iee are given by (58), (59) and (60) respectively. Finally using (41), (33), (34) and (62) the MSE of EBLUP for model (5) when T T ⎡⎤22 is estimated by ˆ ⎡ ˆˆ22⎤ using restricted maximum δ = ⎣⎦σσev δ = ⎣σσev⎦ likelihood method (REML) is given by: ˆˆ *1− MSEξ (θEBLU ())δδδδ= gggoD123 ()+++ () () ( ) = 2 −12 D 2221− ⎛⎞⎛⎞⎛⎞−−12 21 =+Nbxrd**σσσ e e v d⎜⎟∑ i +⎜⎟⎜⎟∑∑bxdiσ edi b* ∑ x+ ⎝⎠i∈Ω rd* ⎝⎠⎝⎠dis=∈1 drd i ∈Ω* 2 −32⎛⎞⎛⎞ 4 224− 1 +−++bxd* ⎜⎟⎜⎟∑∑ i x i()σσσσ v I vv2() e v IIoD ev e ee (63) ⎝⎠⎝⎠is∈∈Ωdrd** i From (52) the estimator of (63) is given by ˆˆ ˆ ˆ ˆ* ˆ MSEξ (θEBLU ())δδδδ= g12 ()++ g () 2 g 3 () =

STATISTICS IN TRANSITION, December 2006 1241

2 −12 ⎛⎞⎛⎞⎛⎞D ˆˆˆ2221ˆ− ˆˆ−−12ˆ 21 =+Nbxrd**σσσ e e v d⎜⎟∑ i +⎜⎟⎜⎟∑∑bxdiσ edi b* ∑ x+ ⎝⎠i∈Ω rd* ⎝⎠⎝⎠dis=∈1 drd i ∈Ω* 2 −32⎛⎞⎛⎞ 4 224 ˆ ˆˆˆˆˆˆˆ (64) +−+22bxd* ⎜⎟⎜⎟∑∑ i x i()σσσσ v I vv e v II ev e ee ⎝⎠⎝⎠is∈∈Ωdrd** i 2 2 where σˆe and σˆv are restricted maximum likelihood (REML) estimators, ˆ ˆˆ22 2ˆ ˆ ˆ bxdev=+σσ∑ i, Ivv , Ive , Iee are given by (58), (59) and (60) respectively, is∈ d T T where ⎡⎤22 is replaced by REML estimator ˆ ⎡ ˆˆ22⎤ . δ = ⎣⎦σσev δ = ⎣σσev⎦

5. Simulation study

In the section we present the results of Monte Carlo simulation study prepared in R language (R Development Core Team, 2005). We analyze agricultural data on 8624 farms from Dąbrowa Tarnowska region in Poland obtained in 1996. The region is divided into D=79 villages and towns treated as domains of sizes between 20 and 610 farms. We draw one simple random sample without replacement of 862 farms from the population of 8624 farms which gives one division of the vector of auxiliary variable (total area in 100 square meters) into

Xs and Xr . Realizations of random sample sizes in domains are between 2 and 66 farms what means that the direct predictor (37) gives estimates of total for each of the domains. We generate 5 000 sets of values of the variable of interest (sowing area in 100 square meters) both for sampled and unsampled part of the 2 2 population based on superpopulation model (5) with σ e and σ v obtained from the entire population data and assuming normality of random components. We study the accuracy of the following predictors in the simulation study: 2 2 • the predictor (24) assuming that σ e and σ v are known, which is the BLUP under model (5) (it will be denoted by BLUP) 2 2 • the predictor (24) where σ e and σ v are replaced by their estimates (based on the sample data using REML), which is the EBLUP under model (5) (EBLUP) • the indirect predictor (35) (IP) • the direct predictor (37) (DP) We study the accuracy of the predictors IP and DP under (5) to check their accuracy in the case of the model misspecification (the IP and DP are BLUPs under models which do not fulfil (5)).

1242 T. Żądło: On accuracy of EBLUP under random…

The purpose of the Monte Carlo analysis is the comparison of the accuracy of the EBLUP and the BLUP under real data (i.e. we want to check the loss of 2 2 accuracy due to the estimation of σ e and σ v ). What is more, we would like to compare the accuracy of the EBLUP and predictors IP and DP (i.e. we want to 2 2 compare the higher MSE of the EBLUP due to the estimation σ e and σ v with the MSE of predictors IP and DP which are not BLUPs under (5)). We introduce the notation used in the simulation study. The relative bias of ˆ −1 the predictor (in %) is Monte Carlo estimate of 100EEξξ (θθ− )( ( θ )) . The relative root of prediction variance (in %) is Monte Carlo estimate of ˆ −1 100Varξξ (θθ− )() E ( θ ) . The relative root mean square error (the relative RMSE) (in %) is Monte Carlo estimate of ˆˆ−−112 100MSEξξ (θθ )() E ( )=− 100 E ξ ( θθθ )( E ξ ( )) . The relative bias of MSE estimator is Monte Carlo estimate of 100E MSEˆˆ ()θθθ− MSE ()( ˆ MSE ()) ˆ−1 , with MSEˆˆ()θ given by: ()ξξ() ξ ξ ξ • for the EBLUP by (64), 2 2 • for the predictor IP by (36) where σ e is replaced by σˆe given by (39) 2 2 • for the predictor DP by (38) where σ ed is replaced by σˆed given by (40) 2 2 Notice that for the BLUP given by (24) (assuming that σ e and σ v are 2 2 known) the MSE given by (32) is known (as long as σ e and σ v are known). For this reason we do not present results for the MSE estimator for the BLUP in the following table.

STATISTICS IN TRANSITION, December 2006 1243

Table 1. Simulation results for 79 domains.

Predictor Measure of accuracy Descriptive statistics IP DP EBLUP BLUP minimum -0.83202* -0.88732* -0.33761* -0.32020* Q1 -0.21136* -0.06509* -0.09054* -0.07908* relative bias Me -0.01207* -0.00471* -0.01626* -0.01331* of predictor (in %) mean -0.01527* -0.01071* -0.01403* -0.01478* Q3 0.19800* 0.06504* 0.05936* 0.05354* maximum 0.68989* 0.48910* 0.42672* 0.42699* minimum 16.80910 2.46537 2.45817 2.45736 Q1 19.36090 6.34450 6.11996 6.11038 relative root of Me 20.20160 8.03350 7.55442 7.54589 prediction variance (in %) mean 20.08900 10.05590 8.62303 8.60106 Q3 20.89410 12.56950 11.23930 11.19920 maximum 22.02110 35.50720 18.96420 18.88390 minimum 16.80940 2.46550 2.45831 2.45750 Q1 19.36130 6.34551 6.12093 6.11152 relative RMSE Me 20.20340 8.03351 7.55444 7.54642 (in %) mean 20.09150 10.05690 8.62391 8.60193 Q3 20.90710 12.56990 11.24100 11.20100 maximum 22.02240 35.51830 18.96660 18.88650 minimum 1.33729 1.00652 1.00026 1 Q1 3.57672 1.06904 1.00210 1 Me 6.67045 1.12414 1.00330 1 MSE(.)/MSE(BLUP) mean 8.47238 1.26475 1.00423 1 Q3 10.49050 1.29274 1.00669 1 maximum 53.64280 3.53672 1.01113 1 minimum -99.29800 -7.78448 -3.59443 Q1 -97.32850 -1.25828 -1.21019 relative bias of Me -96.34490 -0.04516 -0.09572 MSE estimator (in %) mean -94.97650 -0.11036 -0.04884 Q3 -94.03880 1.18389 0.85638 maximum -76.98030 5.67985 3.96789 * real value equals 0 (model-unbiased predictors under superpopulation model I)

Consider the simulation results presented in Table 1. In the Monte Carlo analysis for four predictors are obtained values of the relative bias, the relative root of prediction variance and other statistics for all of 79 domains. In Table 1 are presented minimum, first quartile (Q1), median (Me), mean, third quartile (Q3) and maximum of the results obtained for 79 domains. Notice that the 2 2 increase of MSE due to the estimation of σ e and σ v (the difference between the

1244 T. Żądło: On accuracy of EBLUP under random…

MSE of the BLUP and the MSE of the EBLUP) for the considered real data is not high. Analyzing the values of the ratio of the MSE of the EBLUP and the MSE of the BLUP we note that its maximum value equals 1,01113 what means that the MSE of the EBLUP is higher than the MSE of the BLUP but not higher than only by 1,113% in all 79 domains. What is more, the EBLUP has smaller MSE than the predictors IP and DP which are not functions of unknown parameters but are not BLUPs under the considered mixed model. It means that in our case the lost of the accuracy due to the estimation of variance components is smaller than the lost of the accuracy due to the model misspecification. What is important, the absolute value of the bias of the proposed estimator of the MSE of the EBLUP is not high – it does not exceed 3,96789%. The MSE estimators of the predictors IP and DP are not unbiased because they allow to estimate the MSE under different superpopulation model and hence they are used in the case of model misspecification.

REFERENCES

DATTA, G. S., LAHIRI, P. (2000), A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems, Statistica Sinica, 10, 613—627. HENDERSON, C. R. (1950), Estimation of genetic parameters (Abstract), Annals of Mathematical tatistics, 21, 309—310. KACKAR, R. N., HARVILLE, D. A. (1981), Unbiasedness of two-stage estimation and prediction procedures for mixed linear models, Communications in Statistics, Series A, 10, 1249—1261. MAGNUS, J. R., NEUDECKER H. (1988), Matrix differential calculus with applications in statistics and econometrics, New York: John Wiley & Sons. PRASAD, N. G. N, RAO, J. N. K. (1990), The estimation of mean the mean squared error of small area estimators, Journal of the American Statistical Association, 85, 163—171. R Development Core Team (2005), R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3- 900051-07-0, URL http://www.R-project.org. RAO, J. N. K. (2003), Small area estimation. John Wiley & Sons, New York. ROYALL, R. M. (1976), The linear least squares prediction approach to two- stage sampling. Journal of the American Statistical Association, 71, 657— 473.

STATISTICS IN TRANSITION, December 2006 1245

VALLIANT, R., DORFMAN, A. H., ROYALL, R. M. (2000), Finite population sampling and inference. A prediction approach, John Wiley & Sons, New York. ŻĄDŁO, T. (2004), On unbiasedness of some EBLU predictor. In: Proceedings in Computational Satistics 2004, Antoch J. (Ed.), Physica-Verlag, Heidelberg- New York, 2019—2026.

STATISTICS IN TRANSITION, December 2006 1247

STATISTICS IN TRANSITION, December 2006 Vol. 7, No.6, pp. 1247—1264

ESTIMATION OF A POPULATION MEAN USING DIFFERENT IMPUTATION METHODS

M. S. Ahmed1, Omar Al-Titi2, Ziad Al-Rawi3 and Walid Abu-Dayyeh4

ABSTRACT

Several methods of imputation are suggested and their corresponding estimators of the population mean are considered. The bias and mean square error of each of the estimators are derived up to first orders. Then these estimators are compared with each others and with other well known estimators using their biases and mean square errors. It turns out that some of the new estimators are more efficient than the well known estimators. Real data example is used for illustration.

Key words: Estimation of mean, missing data, imputation, bias, mean squared error.

1. Introduction

Missing data is a common problem in sample surveys and imputation is frequently used to substitute values for missing data. Statisticians have recognized for some time that failure to account for the stochastic nature of incompleteness in the form of missingness of data can spoil inference. A natural question arises what one needs to assume in order to justify ignoring the incomplete mechanism. Rubin (1976) addressed three concepts: MAR OAR and PD. In the present investigation N −1 we implicitly assume MCAR. Let Y = N ∑ yi be the mean of the finite i=1 population under consideration. A Simple Random Sample Without Replacement (SRSWOR), s, of size n is drawn from Ω = {1,2,...,N } to estimate the population mean Y . Let r be the number of responding units out of sampled n

1 Sultan Qaboos University, Oman 2 King Fahd University of Petroleum & Minerals, Saudi Arabia 3 Yarmouk University, Jordan 4 King Fahd University of Petroleum & Minerals, Saudi Arabia.

1248 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation… units and let the set of responding units be denoted by R and that of non- c responding units be denoted by R . For every unit i ∈ R , the value yi is c observed. However for the units i ∈ R , the yi values are missing and imputed values are to be derived. We assume that imputation is carried out with the aid of a quantitative auxiliary variable, x such that, the value of x for unit i is xi , known and positive for every i ∈ s . In other words, the data xs = { xi :i ∈ s} are known.

2. Some Available Methods of Imputation and Estimators

Some classical methods of imputation which are commonly used are as follows:

(1) Ratio method of imputation The data after imputation take the form, y if i ∈ R ⎪⎧ i y•i = ⎨ (2. 1) ˆ c ⎩⎪ bxi if i ∈ R where bˆ = y / x . ∑i∈R i ∑i∈R i Under this method of imputation, the point estimator of the population mean is given by, 1 ys = ∑ y•i (2.2) n i∈s which can be written as:

xn yRAT = yr (2.3) xr −1 −1 where x = n x , x = r x and y = r −1 y n ∑i∈s i r ∑i∈R i r ∑i∈R i

(2) Mean method of imputation The data after imputation take the form, y if i ∈ R ⎪⎧ i y•i = ⎨ (2.4) c ⎩⎪ yr if i ∈ R Under this method, the point estimator of the population mean Y is 1 ym = ∑ yi = yr (2.5) r i∈R (3) Compromised method of imputation

STATISTICS IN TRANSITION, December 2006 1249

Singh and Horn (2000) proposed compromised imputation procedure where the data take the form, ⎧αn y r + (1−α)bˆx if i ∈ R ⎪ i i y•i = ⎨ (2.6) ⎪ ˆ c ⎩ (1−α)bxi if i ∈ R where α is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Here, we are also using information from imputed values for the responding units in addition to non-responding units. The point estimator of the population mean Y under the compromised method of imputation is:

xn yCOMP = α yr + (1−α)yr (2.7) xr In the following section, we suggest some methods of imputation and consider their corresponding estimators of the population mean

3. Some Proposed Methods of Imputation and their Estimators

In what follow, y ji denotes the i-th available observation for the j-th imputation method. We suggest the following imputation methods:

⎧ yi if i ∈ R ⎪ ⎪ (1) y = β1 (3.1) 1i ⎨ 1 ⎡ ⎛ X ⎞ ⎤ ⎪ ⎢ny ⎜ ⎟ − ry ⎥ if i ∈ R c ⎪n − r r ⎜ x ⎟ r ⎩ ⎣⎢ ⎝ n ⎠ ⎦⎥ where β1 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of Y is β ⎛ X ⎞ 1 ⎜ ⎟ (3.2) t1 = yr ⎜ ⎟ ⎝ xn ⎠

⎧ yi if i ∈ R ⎪ ⎪ (2) y = β 2 (3.3) 2i ⎨ 1 ⎡ ⎛ x ⎞ ⎤ ⎪ ⎢ny ⎜ n ⎟ − ry ⎥ if i ∈ R c ⎪ n − r r ⎜ x ⎟ r ⎩ ⎣⎢ ⎝ r ⎠ ⎦⎥ where β 2 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

1250 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

β ⎛ x ⎞ 2 ⎜ n ⎟ (3.4) t2 = yr ⎜ ⎟ ⎝ xr ⎠

⎧ yi if i ∈ R ⎪ ⎪ (3) y = β3 (3.5) 3i ⎨ 1 ⎡ ⎛ X ⎞ ⎤ ⎪ ⎢ny ⎜ ⎟ − ry ⎥ if i ∈ R c ⎪n − r r ⎜ x ⎟ r ⎩ ⎣⎢ ⎝ r ⎠ ⎦⎥ where β3 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is β ⎛ X ⎞ 3 ⎜ ⎟ (3.6) t3 = yr ⎜ ⎟ ⎝ xr ⎠ when β 3 = 1, then: X t Ratio = yr (3.7) xr and when β 3 = −1: x t = y r (3.8) Pr oduct r X This is a natural analogue of the ratio estimator which is called the product estimator when an auxiliary variate x has a negative correlation with y where x and y are variates that take only positive values (Cochran, 1977). ⎧ y if i ∈ R ⎪ i (4) y4i = ⎨ (3.9) ⎪ c ⎩ yr + b1 (xi − xr ) if i ∈ R where b1 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

t4 = yr + b1 (xn − xr ) (3.10)

⎧ yi if i ∈ R ⎪ (5) y5i = ⎨ (3.11) nb2 c ⎪ yr + (X − xn ) if i ∈ R ⎩⎪ (n − r)

STATISTICS IN TRANSITION, December 2006 1251

where b2 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

t5 = yr + b2 (X − xn ) (3.12)

⎧ yi if i ∈ R ⎪ (6) y6i = ⎨ (3.13) nb3 c ⎪ yr + (X − xr ) if i ∈ R ⎩⎪ (n − r) where b3 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

t6 = yr + b3 (X − xr ) (3.14)

⎧ yi if i ∈ R ⎪ (7) y = (3.15) 7i ⎨ nk ⎪ y + 1 (X − x ) + k (x − x ) if i ∈ R c ⎪ r n 2 i r ⎩ n − r where k1and k 2 are suitably chosen constants, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

t7 = yr + k1 (X − xn ) + k2 (xn − xr ) (3.16)

⎧ yi if i ∈ R ⎪ ⎪ ⎪ ⎡ ⎛ r ⎞ ⎤ (8) y = y ⎜ x + x ⎟ (3.17) 8i ⎨ ⎢ r i r r ⎥ ⎪ ⎝ n − r ⎠ c ⎢ − yr ⎥ if i ∈ R ⎪ ⎢ θ1 xr + (1−θ1 )xn n − r ⎥ ⎪ ⎢ ⎥ ⎩ ⎣ ⎦ where θ1 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

yr xn t8 = (3.18) θ1 xr + (1−θ1 )xn

1252 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

⎧ yi if i ∈ R ⎪ ⎪ y = (9) 9i ⎨ 1 ⎡ ny X ⎤ (3.19) ⎪ r c ⎢ − ryr ⎥ if i ∈ R ⎩⎪ n − r ⎣θ 2 xn + (1−θ 2 )X ⎦ where θ 2 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

yr X t9 = (3.20) θ 2 xn + (1−θ 2 )X

⎧ yi if i ∈ R ⎪ ⎪ (10) y = (3.21) 10i ⎨ 1 ⎡ ny X ⎤ ⎪ r c ⎢ − ryr ⎥ if i ∈ R ⎩⎪ n − r ⎣θ 3 xr + (1−θ 3 )X ⎦ where θ 3 is a suitably chosen constant, such that the variance of the resultant estimator is minimum. Under this method, the point estimator of the population mean Y is

yr X t10 = (3.22) θ 3 xr + (1−θ 3 )X Here we have considered the design based approach to compare the proposed strategy with the existing strategies. In the next section, we give the bias and mean squared errors of the proposed estimators up to first order.

4. Properties of the Proposed Estimators

y x x Let ε = r −1, δ = r −1 and η = n −1. Then using the concept Y X X of two phase sampling following Rao and Sitter (1995) and the mechanism of MCAR, for given r and n, we have: E(ε) = E(δ ) = E(η) = 0 and

2 ⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ E(ε ) = ⎜ − ⎟C y E(δ ) = ⎜ − ⎟C x E(εδ ) = ⎜ − ⎟ρ C yC x ⎝ r N ⎠ , ⎝ r N ⎠ , ⎝ r N ⎠

2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ E(η ) = ⎜ − ⎟C x , E(δη) = ⎜ − ⎟C x , E(εη) = ⎜ − ⎟ρ C yC x ⎝ n N ⎠ ⎝ n N ⎠ ⎝ n N ⎠

STATISTICS IN TRANSITION, December 2006 1253

2 2 2 2 2 2 2 2 where C y = S y Y , C x = S x X , ρ = S xy (S x S y ) and S y , S x and S xy have their usual meanings. Also, define r r ∗ −1 *2 −1 2 sxy = (r −1) ∑(yi − yr )(xi − xr ) , sx = (r −1) ∑(xi − xr ) , i=1 i=1 r *2 −1 2 ˆ * * * 2 *2 2 2 *2 2 s y = (r −1) ∑(yi − yr ) , ρ = s xy (s x s y ), c y = s y yr , c x = sx xr i=1 Then the biases and mean squared errors of the proposed estimators are given below. The proofs of all of these results are similar and therefore we will proof only one of them.

Theorem 4.1:

(1) The bias of t1 is given by:

⎛ 1 1 ⎞ ⎡ β1 (β1 +1) 2 ⎤ B(t1 ) ≈ ⎜ − ⎟Y ⎢ C x − β1 ρ C y C x ⎥ (4.1) ⎝ n N ⎠ ⎣ 2 ⎦

(2) The mean squared error t1 is given by:

2 ⎡ ⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ ⎤ M (t1 ) ≈Y ⎢ ⎜ − ⎟C y + β1 ⎜ − ⎟C x − 2β1 ⎜ − ⎟ρ C yC x ⎥ (4.2) ⎣ ⎝ r N ⎠ ⎝ n N ⎠ ⎝ n N ⎠ ⎦

(3) The minimum mean squared error of the t1 is given by: 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ S xy M (t1 ) min ≈ ⎜ − ⎟S y −⎜ − ⎟ 2 (4.3) ⎝ r N ⎠ ⎝ n N ⎠ S x for the optimum value of β1 which given by

C y β1 = ρ (4.4) Cx Proof: (1) The estimator t1 in terms of ε , δ and η can be written as

−β1 ⎡ ββ11(1)+ 2 ⎤ tY11=+(1εη )(1 + ) =+ Y (1 ε ) 1 −+ βη η + ... ⎣⎢ 2 ⎦⎥

⎧⎫ββ11(1)+ 2 ≈+−−+Y ⎨⎬1 εβηβεη11 η ⎩⎭2

Then the bias of t1 is given by

1254 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

⎡⎤ββ11(1)+ 2 Bt()11=−≈ Et ( Y ) YE εβηβεη −−+ 11 η ⎣⎦⎢⎥2

⎡⎤ββ11(1)+ 2 =−YEβεη1 () + E () η ⎣⎦⎢⎥2

⎡⎤⎛⎞11ββ11(1)+ ⎛⎞ 11 2 =−YCCC⎢⎥βρ1 ⎜⎟ −yx + ⎜⎟ − x ⎣⎦⎝⎠nN2 ⎝⎠ nN

⎛⎞11 ⎡⎤ββ11(1)+ 2 =−⎜⎟YCCCxyx −βρ1 ⎝⎠⎣nN ⎢⎥2 ⎦

(2) The mean square error of t1 is given by 22 222 22 Mt()11=−≈ Et ( Y ) YE (εβη − 1 ) = YE ( ε − 2 βεηβη 11 + ) 22⎡⎤ 22 =−+YE⎣⎦()2εβεηβη11 E () E ()

22⎡⎛⎞11 ⎛⎞ 11 22 ⎛⎞ 11 ⎤ =−−YC⎢⎜⎟yyxx2βρβ11 ⎜⎟ − CCC +− ⎜⎟⎥ ⎣⎝⎠rN ⎝⎠ nN ⎝⎠ nN ⎦

2222⎡⎤⎛⎞11 ⎛⎞ 11 ⎛⎞ 11 =−+−−−YC⎢⎥⎜⎟yxββρ11 ⎜⎟ C2 ⎜⎟ CC yx ⎣⎦⎝⎠rN ⎝⎠ nN ⎝⎠ nN

(3) To find the minimum mean square error, we find the optimum value of β1 by differentiating the equation (4.2) with respect to β1 and equating to zero, we get

C y β1 = ρ C x

On substituting the optimum value of β1 in equation (4.2), we get: 2 ⎛ 1 1 ⎞ ⎛ 1 1 ⎞ S xy 2 M (t1 ) min ≈ ⎜ − ⎟S y − ⎜ − ⎟ 2 ⎝ r N ⎠ ⎝ n N ⎠ S x Remark 4.1:

(1) For optimum value of β1 the bias becomes

⎛ 1 1 ⎞ ⎡ ρ C y (C x − ρ C y )⎤ B(t1 ) ≈ ⎜ − ⎟Y ⎢ ⎥ (4.5) ⎝ n N ⎠ ⎣ 2 ⎦

(2) We can find the estimate of β1 from the sample information, that is c ˆ y β1 = ρˆ (4.6) c x

STATISTICS IN TRANSITION, December 2006 1255

Theorem 4.2:

(1) The bias of t2 is given by

⎛ 1 1 ⎞ ⎡ β 2 (β 2 +1) 2 ⎤ B(t2 ) ≈ ⎜ − ⎟Y ⎢ C x − β 2 ρ C y C x ⎥ (4.7) ⎝ r n ⎠ ⎣ 2 ⎦

(2) The mean squared error of the proposed estimator t2 is given by

2 ⎡⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ ⎤ M (t2 ) ≈ Y ⎢⎜ − ⎟C y + β 2 ⎜ − ⎟C x − 2β 2 ⎜ − ⎟ρ C y C x ⎥ (4.8) ⎣⎝ r N ⎠ ⎝ r n ⎠ ⎝ r n ⎠ ⎦

(3) The minimum mean squared error of the proposed estimator t2 is given by 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ S xy M (t2 ) min ≈ ⎜ − ⎟S y − ⎜ − ⎟ 2 (4.9) ⎝ r N ⎠ ⎝ r n ⎠ S x for The optimum value of β 2 given by

C y β 2 = ρ (4.10) C x

On substituting the optimum value of β 2 in equation (4.9), we get: 2 ⎛ 1 1 ⎞ ⎛ 1 1 ⎞ S xy M (t ) ≈ ⎜ − ⎟S 2 − ⎜ − ⎟ 2 min r N y r n S 2 ⎝ ⎠ ⎝ ⎠ x Remark 4.2:

(1) For optimum value of β 2 the bias becomes

⎛ 1 1 ⎞ ⎡ ρ C y (Cx − ρ C y )⎤ B(t2 ) ≈ ⎜ − ⎟Y ⎢ ⎥ (4.11) ⎝ r n ⎠ ⎣ 2 ⎦

(2) We can find the estimator of β 2 from the sample information, that is c ˆ y β 2 = ρˆ (4.12) cx

Theorem 4.3:

(1) The bias of the proposed estimator t3 is given by ⎛ 1 1 ⎞ ⎡ β (β +1) ⎤ 3 3 2 (4.13) B(t3 ) ≈ ⎜ − ⎟Y ⎢ Cx − β3 ρ C yCx ⎥ ⎝ r N ⎠ ⎣ 2 ⎦

(2) The mean squared error of the proposed estimator t3 is given by

1256 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

⎛ 1 1 ⎞ 2 2 2 2 M (t3 ) ≈ ⎜ − ⎟Y (C y + β3 Cx − 2β3 ρ C yCx ) (4.14) ⎝ r N ⎠

(3) The minimum mean squared error of the proposed estimator t3 is given by

⎛ 1 1 ⎞ 2 2 M (t3 )min ≈ ⎜ − ⎟S y (1− ρ ) (4.15) ⎝ r N ⎠ for the optimum value of β 3 given by

C y β3 = ρ (4.16) Cx Remark 4.3:

(1) For optimum value of β 3 the bias becomes

⎛ 1 1 ⎞ ⎡ ρ C y (Cx − ρ C y )⎤ B(t3 ) ≈ ⎜ − ⎟Y ⎢ ⎥ (4.17) ⎝ r N ⎠ ⎣ 2 ⎦

(2) We can find the estimator of β 3 from the sample information, that is c ˆ y β3 = ρˆ (4.18) cx

Theorem 4.4:

(1) The bias of the proposed estimator t4 is given by

B (t4 ) = 0 (4.19)

(2)The variance of the proposed estimator t4 is given by

⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ V (t4 ) = ⎜ − ⎟S y + b1 ⎜ − ⎟S x − 2b1 ⎜ − ⎟S xy (4.20) ⎝ r N ⎠ ⎝ r n ⎠ ⎝ r n ⎠

(3) The minimum variance of the proposed estimator t4 is given by 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ S xy V (t4 ) min = ⎜ − ⎟S y − ⎜ − ⎟ 2 (4.21) ⎝ r N ⎠ ⎝ r n ⎠ S x for the optimum value of b1 given by

S xy b1 = 2 (4.22) S x

STATISTICS IN TRANSITION, December 2006 1257

Remark 4.4:

Sometimes the value of b1 is not known. In this situation, we replace it by its S s* xy ˆ xy estimate. The optimum value of b1 = 2 . Now the estimator of b1 , b1 = *2 . S x sx ˆ Cochran (1977) proved that the first order bias of b1 is approximately zero. ˆ Using b1 instead of b1 , i.e. ∗ ˆ t4 = yr + b1 (xn − xr ) ∗ ∗ B(t4 ) ≈ B(t4 ) = 0 , and V (t4 ) ≈ V (t4 )

Theorem 4.5:

(1) The bias of the proposed estimator t5 is given by

B (t5 ) = 0 (4.23)

(2) The variance of the proposed estimator t5 is given by

⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ V (t5 ) = ⎜ − ⎟S y + b2 ⎜ − ⎟S x − 2b2 ⎜ − ⎟S xy (4.24) ⎝ r N ⎠ ⎝ n N ⎠ ⎝ n N ⎠

(3) The minimum variance of the proposed estimator t5 is given by 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ S xy V (t5 ) min = ⎜ − ⎟S y − ⎜ − ⎟ 2 ⎝ r N ⎠ ⎝ n N ⎠ S x (4.25) for the optimum value of b2 given by

S xy b2 = 2 (4.26) S x Remark 4.5:

Sometimes the value of b2 is unknown. In this situation, we replace it by its S s* xy ˆ xy estimate. The optimum value of b2 = 2 . Now the estimator of b2 , b2 = *2 . S x sx ˆ Cochran (1977) proved that the first order bias of b2 is approximately zero. ˆ Using b2 instead of b2 , i.e. ∗ ˆ t5 = yr + b2 (X − xn ) ∗ ∗ B(t5 ) ≈ B(t5 ) = 0 , and V (t5 ) ≈ V (t5 )

1258 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

Theorem 4.6:

(1) The bias of the proposed estimator t6 is given by

B (t6 ) = 0 (4.27)

(2) The variance of the proposed estimator t6 is given by

⎛ 1 1 ⎞ 2 2 2 V (t6 ) = ⎜ − ⎟(S y + b3 S x − 2b3S xy ) (4.28) ⎝ r N ⎠

(3) The minimum variance of the proposed estimator t6 is given by

⎛ 1 1 ⎞ 2 2 V (t6 ) min = ⎜ − ⎟ S y (1− ρ ) (4.29) ⎝ r N ⎠ for the optimum value of b3 given by

S xy b3 = 2 (4.30) S x Remark 4.6:

Sometimes the value of b3 is unknown. In this situation, we replace it by its S s* xy ˆ xy estimate. The optimum value of b3 = 2 . Now the estimator of b3 , b3 = *2 . S x sx ˆ Cochran (1977) proved that the first order bias of b3 is approximately zero. ˆ Using b3 instead of b3 , i.e. ∗ ˆ t6 = yr + b3 (X − xr ) ∗ ∗ B(t6 ) ≈ B(t6 ) = 0 , and V (t6 ) ≈ V (t6 )

Theorem 4.7:

(1) The bias of the proposed estimator t7 is given by

B(t7 ) = 0 (4.31)

(2) The variance of the proposed estimator t7 is given by

⎛ 1 1 ⎞ 2 ⎡ ⎛ 1 1 ⎞ ⎛ 1 1 ⎞⎤ 2 ⎡ 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞⎤ (4.32) V (t7 ) = ⎜ − ⎟S y − 2S xy ⎢k1 ⎜ − ⎟ + k2 ⎜ − ⎟⎥ + S x ⎢k1 ⎜ − ⎟ + k2 ⎜ − ⎟⎥ ⎝ r N ⎠ ⎣ ⎝ n N ⎠ ⎝ r n ⎠⎦ ⎣ ⎝ n N ⎠ ⎝ r n ⎠⎦

(3) The minimum variance of the proposed estimator t7 is given by

⎛ 1 1 ⎞ 2 2 V (t7 ) min = ⎜ − ⎟ S y (1− ρ ) (4.33) ⎝ r N ⎠

STATISTICS IN TRANSITION, December 2006 1259

for the optimum value of k1 and k 2 given by S xy (4.34) k1 = k 2 = 2 S x

Theorem 4.8:

(1) The bias of the proposed estimator t8 is given by

⎛ 1 1 ⎞ 2 2 B(t8 ) ≈ ⎜ − ⎟Y (θ1 C x −θ1 ρ C yC x ) (4.39) ⎝ r n ⎠

(2) The mean squared error of the proposed estimator t8 is given by

2 ⎡⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ ⎤ M (t8 ) ≈ Y ⎢⎜ − ⎟C y +θ1 ⎜ − ⎟C x − 2θ1 ⎜ − ⎟ρ C yC x ⎥ (4.40) ⎣⎝ r N ⎠ ⎝ r n ⎠ ⎝ r n ⎠ ⎦

(3) The minimum mean squared error of the proposed estimator t8 is given by 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ S xy M (t8 ) min ≈ ⎜ − ⎟S y − ⎜ − ⎟ 2 (4.41) ⎝ r N ⎠ ⎝ r n ⎠ S x for the optimum value of θ1 given by

C y θ1 = ρ (4.42) Cx Remark 4.7:

(1) For optimum value of θ1 the bias becomes

B(t8 ) ≈ 0 (4.43)

(2) We can find the estimate of θ1 from the sample information, that is c ˆ y θ1 = ρˆ (4.44) cx

Theorem 4.9:

(1) The bias of the proposed estimator t9 is given by

⎛ 1 1 ⎞ 2 2 B(t9 ) ≈ ⎜ − ⎟Y (θ 2 C x −θ 2 ρ C yC x ) (4.45) ⎝ n N ⎠

(2) The mean squared error of the proposed estimator t9 is given by

1260 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

2 ⎡⎛ 1 1 ⎞ 2 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ ⎤ M (t9 ) ≈ Y ⎢⎜ − ⎟C y +θ 2 ⎜ − ⎟C x − 2θ 2 ⎜ − ⎟ρ C yC x ⎥ (4.46) ⎣⎝ r N ⎠ ⎝ n N ⎠ ⎝ n N ⎠ ⎦

(3) The minimum mean squared error of the proposed estimator t9 is given by 2 ⎛ 1 1 ⎞ 2 ⎛ 1 1 ⎞ S xy M (t9 ) min ≈ ⎜ − ⎟S y − ⎜ − ⎟ 2 (4.47) ⎝ r N ⎠ ⎝ n N ⎠ S x for the optimum value of θ 2 given by

C y θ 2 = ρ (4.48) Cx Remark 4.8:

(1) For optimum value of θ 2 the bias becomes

B(t9 ) ≈ 0 (4.49)

(2) We can find the estimator of θ 2 from the sample information, that is c ˆ y θ 2 = ρˆ (4.50) cx

Theorem 4.10:

(1) The bias of the proposed estimator t10 is given by

⎛ 1 1 ⎞ 2 2 B(t10 ) ≈ ⎜ − ⎟Y (θ 3 C x −θ 3 ρ C y C x ) (4.51) ⎝ r N ⎠

(2) The mean squared error of the proposed estimator t10 is given by

⎛ 1 1 ⎞ 2 2 2 2 M (t10 ) ≈ ⎜ − ⎟Y (C y +θ 3 C x − 2θ 3 ρ C yC x ) (4.52) ⎝ r N ⎠

(3) The minimum mean squared error of the proposed estimator t10 is given by

⎛ 1 1 ⎞ 2 2 M (t10 ) min ≈ ⎜ − ⎟ S y (1− ρ ) (4.53) ⎝ r N ⎠ for the optimum value of θ 3 given by

C y θ 3 = ρ (4.54) Cx Remark 4.9:

(1) For optimum value of θ 3 the bias becomes

B(t10 ) ≈ 0 (4.55)

STATISTICS IN TRANSITION, December 2006 1261

(2) We can find the estimate of θ 3 from the sample information, that is c ˆ y θ 3 = ρˆ (4.56) cx

5. Comparisons of the estimators

In this section, we compare the proposed estimators with some known estimators. The comparisons will be in terms of the bias and the mean squared error up to terms of first order. Singh and Horn (2000) have shown that the estimator obtained from compromised imputation method is better than the estimators obtained from the ratio and mean methods of imputation. Then from the formulas of the mean squared errors above we conclude the following: (1) The estimators t 2 , t 4 and t 8 have the same minimum mean squared error and equal to the minimum mean squared error of the compromised y B(t ) ≈ 0 estimator COMP . But, B (t4 ) = 0 and 8 for the optimum value of θ1 . Thus the 4th and 8th methods are better than the compromised method, and the 4th is better than the other three methods. (2) The estimators t1, t 5 and t 9 have the same minimum mean squared errors, but, B (t5 ) = 0 and B(t9 ) ≈ 0 for the optimum value of θ 2 . Thus the 5th method is better than the other two methods. (3) For comparing the t1, t 5 and t 9 with t 2 , t 4 and t 8 , we have: 22 ⎛⎞⎛⎞⎛⎞⎛⎞1122 11SSxyxy 11 11 DMt=()4 min − Mt () 5 min =−⎜⎟⎜⎟⎜⎟⎜⎟ Syy −−22 −− S +− ⎝⎠⎝⎠⎝⎠⎝⎠rN rnSx rN nNSx 22 ⎛⎞⎛⎞11SSxy 11 xy =−⎜⎟⎜⎟22 −− ⎝⎠⎝⎠nNSxx rnS Thus t1, t 5 and t 9 are better than t 2 , t 4 and t 8 if D is positive, i.e. if ⎛ 1 1 ⎞ ⎛ 1 1 ⎞ ⎜ − ⎟ >⎜ − ⎟ . ⎝ n N ⎠ ⎝ r n ⎠ (4) The estimators t 3 , t 6 , t 7 and t10 the same minimum mean squared errors., but B(t6 ) = B(t7 ) = 0 and B(t10 ) ≈ 0 for the optimum value of θ3 . Thus t 6 and t 7 are better than others. (5) For comparing t 3 , t 6 , t 7 and t10 with t 2 , t 4 and t 8 , we notice that:

1262 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

DMt=−()8min Mt ( 10min ) 222 ⎛⎞⎛⎞⎛⎞11 11SSSxyxyxy 11⎛⎞ ⎛⎞ 11 =−SS22 −− −− − =− ⎜⎟⎜⎟⎜⎟yy222⎜⎟ ⎜⎟ ⎝⎠⎝⎠⎝⎠rN rnSx rN⎝⎠ Sxx ⎝⎠ nN S which is always positive. Thus the 3rd, 6th , 7th and the 10th methods are better than the 2nd, 4th and 8th methods. (6) For comparing t1, t 5 and t 9 with t 3 , t 6 , t 7 and t10 , we notice that:

DMt=−()5 min Mt () 6 min 222 ⎛⎞⎛⎞⎛⎞11 11SSSxyxyxy 11⎛⎞ ⎛⎞ 11 =−SS22 −− −− − =− ⎜⎟⎜⎟⎜⎟yy222⎜⎟ ⎜⎟ ⎝⎠⎝⎠⎝⎠rN nNSx rN⎝⎠ Sxx ⎝⎠ rn S which is a positive quantity. Thus the 3rd, 6th, 7th,and 10th methods are better than the 1st, 5th, and 9th methods.

6. Empirical Study

The relative comparison among the estimators is given using a real data set. The data for this illustration has been taken from the Department of Statistics (Jordan), Healthcare Utilization and Expenditure Survey, 2000. The population that we like to study contains 8306 households. We consider the variables y and x where y is the expenditure of the household and x is the income of the household. Throughout this study, calculations are used based on simulation of 50,000 repeated samples without replacement. We compute the bias and the MSE for all estimators. The following values were obtained: 2 2 S y = 338006, S x = 862017, S xy = 281892, Y = 253.75, X = 343.316 C = 2.29116 C = 2.70436, α = 0.557559, ρ = 0.522231 y x β = β = β = 0.442441 θ = θ = θ = 0.442441 1 2 3 , 1 2 3 b = b = b = k = k = 0.327014 1 2 3 1 2 In Table 1, we give the MSE, bias, and efficiency for the estimators under consideration with respect to the estimator ym where the efficiency of the of the estimator yˆ with respect to the estimator ym is defined as: MSE(y ) e(yˆ) = m , MSE(yˆ) where MSE(y) is the mean squared error of y.

STATISTICS IN TRANSITION, December 2006 1263

The following steps summarize the simulation procedures to find the bias and MSE of the estimator yˆ L: Step 1. We select a sample according to SRSWOR of size 200 from the real data set of 8306 units. Step 2. We drop 20 units randomly from each sample corresponding to the study variable y. Step 3. We impute the dropped units with available methods and with proposed methods. Step 4. We repeat step (1), (2), and (3) 50000 times. Thus, we obtain 50000 values for yˆ say yˆ1 ,..., yˆ50000 . 1 50000 ˆ ˆ ˆ Step 5. The bias of y is obtained by B(y) = ∑ yi − Y 50000 i=1 1 50000 ˆ ˆ ˆ 2 Step 6. The MSE of y is obtained by MSE (y) = ∑(yi − Y ) . 50000 i=1 In Table 1, we give the MSE, bias, and efficiency for the estimators under consideration.

Table 1: Bias and MSE for estimators under study based on simulation

Estimator MSE Efficiency Bias

ym 1832.61 1 0.116162

yRAT 1872.22 0.97884 0.759740

yCOMP 1780.10 1.02950 0.400907

t1 1311.57 1.39726 1.123540

t2 1776.25 1.03173 0.271958

t3 1255.44 1.45974 1.273090

t4 1782.50 1.02811 0.158712

t5 1383.25 1.32486 0.042048

t6 1331.10 1.37676 0.084599

t7 1331.10 1.37676 0.084599

t8 1772.71 1.03379 0.144854

t9 1299.29 1.41047 0.044114

t10 1242.37 1.47509 0.081079

1264 M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, W. Abu-Dayyeh: Estimation…

Based on these results, we notice that all proposed estimators have higher efficiency than those estimators of the mean method, ratio method and compromised method. Moreover, we notice that the proposed estimators, (namely thet1 , t3 , t6 , t7 , t9 , t10 estimators), have equal higher efficiency compared to others. The little differences between the efficiency of ( t3 , t10 ) and (t6 , t7 ) is due to simulation error and the small sample sizes, while the theoretical results are asymptotic.

Therefore, we recommend the use of t9 or t10 because they have the highest efficiency and their biases are negligible.

REFERENCES

COCHRAN, W. G. (1977). Sampling Techniques, New York: Wiley. RAO, JNK., AND SITTER, RR. (1995). Variance estimation under two phase sampling with application to imputation for missing data. Biometrika 82: 453—460. RUBIN, D.B. (1976). Inference and missing data. Biometrika 63, 581—593. SINGH, S. AND HORN, S. (2000). Compromised imputation in survey sampling. Metrika, 51:266—276.

STATISTICS IN TRANSITION, December 2005 1265

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1265—1275

A MODIFIED REGRESSION ESTIMATOR OF A POPULATION MEAN UNDER GENERAL SAMPLING DESIGN

Vyas Dubey1

ABSTRACT

A generalized estimator of a population mean under any sampling design, which utilizes the knowledge of a population mean and variance of auxiliary variable has been studied. Properties of proposed estimator have been discussed and an optimum class of estimators has been obtained. The proposed class of estimators has been discussed in probability proportional to size sampling. Results are supported by numerical examples.

Key words: Auxiliary variable, Bias, Mean square error (MSE), Simple random sampling without replacement (SRSWOR), Probability proportional to size (PPS) sampling, Relative efficiency.

1. Introduction

The problem of estimating population mean of a variable using auxiliary information has received considerable attention in sampling from finite populations. Such estimators which utilize population mean of auxiliary variable are known as ratio, product and linear regression estimators. But among all these estimators, regression estimator has been proved theoretically to be more efficient. Let U = {1,2,…,i,.,N }, be a finite population of size N, y be a real valued function defined on U taking the value yi on the unit i of U (1 ≤ i ≤ N ). Let x be an auxiliary variable, correlated with y, taking value xi on unit U (1 ≤ i ≤ N ) which is available in advance. Let s = (1,2,..,j,..n) be a sample taken from U under any sampling design. Let Yˆ and Xˆ be unbiased estimators of population meansY and X of variables y and x, respectively. Sarndal et al. (1992) discussed generalised regression estimator of Y as

1 Pt. Ravishankar Shukla University, Raipur, INDIA; e-mail:[email protected]

1266 V. Dubey; A modified regression estimator…

ˆ ˆ ˆ ˆ Yrg = Y + β ( X − X ) (1.1) where βˆ is a sample estimate of β , the regression coefficient of Yˆ on Xˆ . 2 −1 N 2 Using the knowledge of population variance S x = (N −1) ∑ (xi − X ) of i =1 auxiliary variable x, Srivastava and Jhajj (1981) proposed a class of estimators in simple random sampling as e1 = yt(u,v) (1.2) 2 −1 n 2 where, y and x are sample means of y and x, sx = (n −1) ∑ (x j − x ) j=1 2 2 2 is an unbiased estimator of Sx , u = (x / X ) ,)v =(sx / Sx , t(u,v) is a function of u and v which satisfy certain regularity conditions. Further, Das and Tripathi (1981) considered 2 2 e2 = y + α1(X − x) + α2(Sx − sx ) (1.3) where, and are suitably chosen statistics. The estimators e and e are α 1 α 2 1 2 equally efficient up to the first order of approximation, while they are more ˆ efficient thanYr g for skewed populations. 2 We note that the estimator of Sx in sampling schemes like PPS, systematic and stratified becomes complicated. Therefore, in section 2, an effort has been made to develop an estimator under unified approach, which is valid for all sampling designs and correlation patterns.

2. Proposed class of estimators

Let d = D (U, S, P) be a family of sampling designs where S = {s} is the collection of all possible samples generated from the population U through a method of randomization leading to probability P(s) of selecting a sample s ∈ S,∑ P(s) = 1 and P = {P(s)} is a family of probability measures. Let s∈S π j = ∑ P( s ) be the probability of including a specified unit j ∈ U in the s∈S ( j ) sample s (first order inclusion probability), S(j) be the set of all those samples ˆ ' which include the specified unit j. Let Y , μˆ r ( x ), r = 1, 2 be unbiased

STATISTICS IN TRANSITION, December 2005 1267

' 1 N r estimators of Y and μ r ( x ) = ∑ xi , respectively, under above sampling N i =1 design. In case, where population means μ1′ (x) and μ′2(x) of auxiliary character x are known, we define a general class of estimators of the population mean Y as ˆ ˆ ' ' ' ' Yg = Y + λ1{μ1(x) − μˆ1(x)}+ λ2{μ2(x) − μˆ 2(x)} (2.1) where λ1 and λ2 are suitably chosen statistics such that their expected values ˆ exist. It may be noted that unbiased estimators Y , μˆr′ (x) , r = 1,2, always exist, whatever be the adopted sampling design. For example, Horvitz-Thompson (1952) type estimators ˆ 1 y j Y = ∑ and N j ∈ s π j x r ' 1 j μˆ r (x) = ∑ , r = 1 , 2 N j ∈ s π j ' π j > 0 for all j = 1, 2,….,N, are unbiased estimators for Y and μr (x) , where the ∑ extends over distinct units in s. More generally, the estimators j ∈ s ˆ Y = ∑ β( j, s) y j and j∈s ' r μˆ r (x) = ∑ β( j, s) x , r = 1, 2 j∈s j 1 with ∑ β(j,s) p(s) = for all j =1, 2,.....,N, are Godambe (1955) s ∈S N ' unbiased estimators for Y and μr (x) , r = 1, 2; whatever be the adopted sampling design.

3. Properties of proposed class of estimators

Let for any random variable z, δ z = (z − E z ) and we write

1268 V. Dubey; A modified regression estimator…

ˆ Vabcde(Y, μˆ'1(x), μˆ'2 (x), λ1, λ2) (3.1) = E [ (δ )a (δ )b(δ )c (δ )d (δ )e ] Yˆ μˆ'1(x) μˆ'2(x) λ1 λ2 To obtain bias and MSE-expressions, suitable for use in practice and to reveal ˆ certain nice properties of the estimators in Yg , we confine our discussion to fixed sample size designs Fs (n) only, where n denotes the sample size. If λi = Ki , i = ˆ 1, 2 are constants, the estimator Yg is unbiased with variance V (Yˆ ) = V (Yˆ ) + K 2V (μˆ ' ( x)) + K 2V (μˆ ' ( x)) g 1 1 2 2 ˆ ' ˆ ' (3.2) − 2 K 1Cov (Y , μˆ 1( x))− 2 K 2 Cov (Y , μˆ 2 ( x)) ' ' + 2 K 1 K 2 Cov (μˆ 1( x), μˆ 2 ( x)) Let the statistic λi be chosen such that either E λi does not depend on n or −si E( λi )2= λ0i + O ( n ), si > 0, i =1, (3.3) where λ0i do not depend upon n, and ˆ −(a+b+c+d+e)/2 Vabcde(Y, μˆ1(x),μˆ 2(x),λ1, λ2) ≤ O(n ) (3.4) The exact expressions for bias and MSE of the proposed class of estimators under any sampling design are respectively, given by ˆ B(Yg) = − (V01010 + V00101) (3.5) and ˆ 2 2 M(Yg ) = V20000+{E(λ1)} V02000+{E(λ2)} V00200− 2E(λ1)V11000 (3.6)

− 2 E(λ2) V10100+ 2 E(λ1)E(λ2)V01100+ M1+ M 2 where M = 2E(λ ){V +V } + 2E(λ ){V +V } 1 1 02010 01110 2 01110 00201 −2V11010− 2V10101 M2 = V02020+V00202+ 2V01111 In some cases, these exact expressions may not be of much interest from ˆ practical point of view. It has to be noted that to reduce absolute bias of Yg , the statistics λi 's may be chosen such that λi is uncorrelated with μˆ'i (x) , i = 1, 2,

STATISTICS IN TRANSITION, December 2005 1269 or if λi is positively correlated with μˆ'i (x) then λ j , j ≠i is negatively correlated with μˆ'j (x), j ≠ i.. ˆ In case E(λi) = λ0i does not depend upon sample size, bias of Yg is zero ˆ −1 and M (Yg ) is of order (n ) . Let us write * E(λi ) = λoi + λoi (3.7) * −si where the term λoi is of the order (n ) , (si > 0). Using (3.7), it follows from (3.6) that M (Yˆ ) = V + λ 2 V + λ 2 V − 2 λ V 0 g 20000 01 02000 02 00200 01 11000 (3.8) − 2 λ 02V10100 + 2 λ01λ02V01100

-1 ˆ is of order n , which is the same as V(Yg ) with λ0i = K i , i = 1, 2. It may be remarked that for many sampling procedures, including SRSWOR, ˆ PPSWR and some πps - schemes, M0 (Yg) is a good approximation of ˆ M(Yg) , not only for large samples but even for moderate sample sizes, ˆ −1 especially when (3.3) and (3.4) are satisfied in which case M0 (Yg) is O(n ) ˆ −(1+ s ) and the remaining terms of M(Yg) in (3.6) are O(n ), where s = min (s1, s2, 1) .

4. An optimum sub-class of proposed class

ˆ The values of λ01 and λ02 which minimize M0(Yg ) are given by V V − V V λ = 00200 11000 10100 01100 (4.1) 010 2 V02000 V00200 − V01100 V V − V V λ = 02000 10100 11000 01100 (4.2) 020 2 V02000V00200 − V01100 respectively. For such choice of λoi , (3.8) reduces to

1270 V. Dubey; A modified regression estimator…

{ K (y , x) − ρ β (x) }2 ˆ ˆ 2 12 g g 1g M 0 (Yg ) opt. = V (Y ) [ 1− ρ g − ] (4.3) β2g (x) − β1g (x) where V V 2 11000 00200 V01100 ρg = ; β (x)= ; β (x)= ; V V 2g 2 1g 3 20000 02000 V02000 V02000 V10100 K12 g (y, x) = . V02000 V20000 ˆ Thus we have obtained for large samples, an optimum sub-class Y0g for the ˆ class of estimators Yg , given by ˆ ˆ ~ ' ' ~ ' ' Y0g = Y +λ1{μ 1(x)− μˆ1( x)} + λ2 ( μ2( x)− μˆ 2 ( x)) (4.4) ~ * with E(λi) = λ 0i0 + λ 0i (4.5)

-1 ˆ It is noted that up to the O(n ) all the estimators in Y0g have the same MSE but may have different bias. Among such λi 's for which (4.5) holds, we choose ˆ ˆ those λi 's for which absolute bias of Y0g is least. The estimators in Y0g may be made unbiased and the exact expressions for their MSE may be obtained imposing certain rectifications on (i) the choice of sampling design, (ii) choice of ˆ ' ' estimators Y , μˆ1(x), μˆ2 (x), λ1, λ2 and (iii) the nature of population under consideration.

5. Gain in efficiency of optimum estimators

ˆ The MSE of generalized regression estimator Yrg ˆ ˆ 2 M(Yrg) = V(Y) (1− ρg) (5.1) Comparing (4.3) and (5.1), we have ˆ ˆ M0(Yg ) opt. < M (Yr g ) (5.2) ˆ Thus the optimum sub-class of estimators is more efficient than Yrg under any sampling design.

STATISTICS IN TRANSITION, December 2005 1271

6. SPECIAL CASE: Probability Proportional To Size With Replacement

If the information on an auxiliary variable z, which is expected to be highly correlated with variable y, is available for every unit i = 1, 2,...... , N of a population U = ( 1, 2,…, N ), the sampling may be done with probability proportional to the size of the variable z and with replacement. In this case N π j = np j , where p j = z j /.∑ z j j=1 ˆ ' ' The values of Y , μˆ 1(x) , μˆ 2(x) reduce to 2 n x 1 n y j 1 n x j ' 1 j y pps = ∑ , xpps = ∑ and m 2(x) = ∑ n j = 1 Np j n j = 1 Np j n j =1 Np j ˆ respectively. The estimator Yg turns out as ' ' ygp = y pps + λ1p (X − x pps ) + λ2 p (μ2 (x) − m2 (x) pps ) (6.1) where λ1P and λ2P are suitably chosen statistics such that their expectations exist. Let N N 2 yi 2 2 xi 2 σ yp = ∑ ( − Y ) pi , σ xp = ∑ ( − X ) pi , i=1 N pi i=1 N pi N x 2 N y x 2 i ' 2 ,σ = ( i − Y )( i − X ) p σ up =∑( − μ2 (x)) pi yxp ∑ i i=1 N pi i=1 N pi N pi 2 N y x σ = ( i − Y ) ( i − μ ' (x) ) p , yup ∑ 2 i i=1 N pi N pi 2 N xi xi ' σ yxp σ xu p = ∑ ( − X ) ( − μ2 (x) ) pi , β p = . Np Np 2 i =1 i i σ xp ˆ The estimator Yrg reduces to Tripathi (1969) type linear regression estimator of a population mean Y , defined by ylp = y pps + β p (X − x pps ) (6.2) ˆ The optimum values λ010 and λ020 which minimize M0(Yg ) under PPSWR sampling scheme are given by 2 σ u p σ y x p − σ y u p σ x u p λ = (6.3) 01 p 2 2 2 σ x p σ u p − σ x u p

1272 V. Dubey; A modified regression estimator…

2 σ x p σ y u p − σ y x p σ x u p λ = (6.4) 02 p 2 2 2 σ x p σ u p − σ x u p − 1 respectively. The minimum MSE of ygp up to O(n ) is given by {K (y, x) − ρ β (x)}2 2 12 p p 1p M o (ygp )opt. = V ( y pp s ) [1 − ρ p − ] (6.5) β2 p (x) − β1p (x) where 2 2 σ yxp σ xup σ u p K12 p (y, x) = ; β1p (x) = ; β (x) = ; 2 6 2 p 4 σ y p σ x p σ x p σ x p 2 σ yxp σ yp ρ p = ; V (y pps ) = . σ yp σ xp n Tripathi (1969) has shown that variance of yl p 2 V (yl p ) = V ( y pp s ) (1− ρ p ) (6.6) Comparing (6.5) and (6.6), it can be seen that M0 ( ygp )opt . < V( ylp) (6.7) Estimates of λ 01 p and λ02 p are given by 2 s u p s y x p − s y up sx up λˆ = (6.8) 01P 2 2 2 s x p s u p − s x u p 2 s x p s y up − s y xp s x up λˆ = (6.9) 0 2P 2 2 2 sx p s u p − sx u p where n x j 2 s2 = 1 ∑ ( − x ) x p (n 1) pps − j =1 N p j 2 n x j ' 2 s2 = 1 ∑ ( − m (x) ) up n −1 2 pps j =1 N p j

STATISTICS IN TRANSITION, December 2005 1273

n y j x j s = 1 ∑ ( − y ) ( − x ) y x p n −1 pps pps j =1 N p j N p j 2 n y j x j ' s = 1 ∑ ( − y ) ( − m (x) ) yu p n −1 pps 2 pps j =1 N p j N p j 2 n x j x j ' s = 1 ∑ ( − x ) ( − m (x) ) xu p n −1 pps 2 pps j =1 N p j N p j 2 2 are unbiased estimates of σ xp , σ up , σ yxp , σ yup and σ xup respectively.

7. NUMERICAL EXAMPLE

Population I: Consider the data from Gupta and Rao (1997), which relates to the population of 16 districts of West Bengal, India. Let z, x, and y be population of the districts in 1951, 1961 and 1971 respectively. Let z be the variable which measures size of the units. For this data we have 2 11 Y = 2777505.938; X = 2182.875; σy p = 1.554998 x 10

2 2 1 3 σx p = 39744.2344; σup= 1.6616 x 10 ; σ yxp = 71153288.10 11 σyup = 5.6460 x 10 ; σxup= 179054687; ρ p = 0.90509

K12p(y, x) = 36.0248 ; β1p (x ) = 510.680 ; β2p (x ) = 10519.09 Population II: This is an artificial data where N = 4 and the values of y and x with respective probabilities are as follows:

Pi xi yi

0.1 10 7

0.2 18 15

0.3 21 10

0.4 25 16

For this data

1274 V. Dubey; A modified regression estimator…

2 2 2 Y = 12.00, X = 18.50; σy p= 17.77; σx p =11.03; σup=1850.78; σ yxp=12.38; σ yup= -32.50; σxup= -72.97 ; K12p(y, x) = - 0.6989 ; β1p (x) = 0.0544 ; β2p (x ) = 15.2092 Relative efficiency (R.E.) of various estimators with respect to y pps , defined by [ {V( y pps ) / M(.) } x 100 ], are given in Table 1.

Table 1. Relative efficiency of Estimators

Estimator R.E. (Population-I) R.E.(Population-II)

y pps 100.00 100.00

ylp 553.07 457.04

ygp 638.64 843.21

Table 1 shows that proposed estimator ygp is considerably more efficient than existing estimators.

REFERENCES

DAS, A. K. AND TRIPATHI, T. P. (1981): A class of sampling strategies for population means using information on mean and variance of an auxiliary character, Proceeding of Indian Statistical Institute, Golden Jubilee International Conference on Statistics: Applications and new directions, 174—-181. GODAMBE, V. P. (1955): A unified theory of sampling from finite populations, Jour. Roy. Stat. Soc., 17, 269—271. GUPTA, B. K. AND RAO, T. J. (1997): Stratified PPS Sampling and allocation of sample size, Jour. Ind. Soc. Ag. Statistics, 50(2), 199—208. HORVITZ, D. G. AND THOMPSON, D. J. (1952): A generalization of sampling without replacement from a finite universe, Jour. Amer. Stat. Assoc., 47, 663—685. SARNDAL, C. E. AND SWENSSON, B. AND WRETMAN, J. H. (1992): Model Assisted Survey sampling, Springer-verlag, New York.

STATISTICS IN TRANSITION, December 2005 1275

SRIVASTAVA, S. K. AND JHAJJ, H.S. (1981): A class of estimators of the population mean in survey sampling using auxiliary information, Biometrika, 68(1), 341—343. SUKHATME, P. V. AND SUKHATME, B. V. (1992): Sampling theory of surveys with applications, Iowa State University Press, Ames. Iowa, U. S. A.. TRIPATHI, T. P. (1969): A regression-type estimator in sampling with PPS and with replacement. Aust. Jour. Stat., 11, 140—148.

STATISTICS IN TRANSITION, December 2006 1277

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1277—1293

DESIGN-BASED HORVITZ-THOMPSON VARIANCE ESTIMATION: π-WEIGHTED RATIO TYPE ESTIMATOR

P.A. Patel1 and R.D. Chaudhari2

ABSTRACT

In this article, motivated by the ratio method of estimation, a π-weighted ratio type estimator for Horvitz-Thompson variance is suggested and is shown to be asymptotically design unbiased and consistent. An empirical study is conducted to compare its performance. To assess the performances, several important summary statistics such as the percentage relative bias, the relative efficiency, and the empirical coverage rate of the resultant confidence intervals are computed and presented.

Key words : Auxiliary information, variance estimation, design-based estimation, simulation

1. Introduction

This article considers an important problem in survey sampling viz. estimation of the variance of the Horvitz-Thompson (HT) estimator of a finite population total. The customary variance estimators are the Horvitz-Thompson and Yates-Grundy estimators. The Yates-Grundy variance estimator (vYG ) is generally considered superior to the Horvitz-Thompson variance estimator (v HT ) because of fewer negative estimates and smaller sampling variance (Royall and Cumberland (1981), Rao and Singh (1973)). However, vYG requires fixed sample size, whereas v HT does not. This restriction of vYG to fixed sample size

1 Department of Statistics, Sardar Patel University, Vallabh Vidyanagar-388 120, Gujarat (India), E-mail: [email protected] 2 Statistics Department, V.P. & R.P.T.P. Science College, Vallabh Vidyanagar-388 120, Gujarat (India), E-mail: [email protected]

1278 P. A. Patel, R. D. Chaudhari: Design-Based… design eliminated this variance estimator from consideration of many applications in surveys (see Stehman and Overton (1994)). There are various approaches to the variance estimation problem. The traditional one, based on the probability distribution generated by the sampling design is well summarized in Wolter (1985) and Särndal et al. (1992). In model- based approach the argument is conditional on samples and makes reference to an assumed model. A fully developed approach to model-based variance estimation did originated with Royall and coworkers e.g. Royall (1970), Royall and Cumberland (1981). Fuller (1970) proposed a regression estimator of the variance of the HT- estimator of the population total using as x variables the quantities (πiπ j − πij ) 2 and (πiπ j − πij )(i − j) . Isaki (1983) has considered estimation of variance of HT-estimator and suggested various estimators for the same. Kuk (1989), Stehman and Overton (1994), and Wolter (1985), described the Horvitz- Thompson variance estimation for ordered variable probability systematic sampling. Berger (1998) proposed Horvitz-Thompson variance estimation under list-sequential scheme for unequal probability sampling. Singh, et al. (1999), following Deville and Särndal (1992), suggested calibrated variance estimator of the HT estimator. In this article, we present design-based estimator of HT variance and compare it with standard one. Consider the finite population of N units U = {1,L,i,LN}. We wish to estimate the total Y = y , where y is the value of a survey variable, y , ∑i∈U i i for the ith unit. Suppose that a sample s of units is selected according to a sampling design p(s) with positive inclusion probabilities πi and πij . The Horvitz-Thompson and Yates-Grundy expressions of the HT-estimator Yˆ = y π are HT ∑i∈s i i ⎛ 1 ⎞ ⎛ πij ⎞ V (Yˆ ) = ⎜ −1⎟y2 + ⎜ −1⎟y y HT HT ∑U ⎜ ⎟ i ∑∑U ⎜ ⎟ i j ⎝ πi ⎠ ⎝ πiπ j ⎠ and 2 1 ⎛ y y j ⎞ V (Yˆ ) = − π π π ⎜ i − ⎟ YG HT ∑∑U ()ij − i j ⎜ ⎟ 2 ⎝ πi π j ⎠ Here, and will be used for and , for an ∑ A ∑∑A ∑i∈A ∑ ∑ iA≠∈j arbitrary set A ⊆ U . The customary design-based estimators of VHT are

STATISTICS IN TRANSITION, December 2006 1279

2 ⎛ 1 ⎞ y ⎛ πij ⎞ yi y j v (Yˆ ) = ⎜ −1⎟ i + ⎜ −1⎟ HT HT ∑s ⎜ ⎟ ∑∑s ⎜ ⎟ ⎝ πi ⎠ πi ⎝ πiπ j ⎠ πij and 2 1 πij − πiπ j ⎛ y y j ⎞ v (Yˆ ) = − ⎜ i − ⎟ . YG HT ∑∑s ⎜ ⎟ 2 πij ⎝ πi π j ⎠ We also include the following estimator for comparison. Recently Singh et al. (1999) have proposed a high level calibration approach for variance estimation. To estimate YG variance their proposed estimator is ˆ vCYG (YHT ) = vYGY + γˆ CYG[VYGX −vYGX ] where 2 2 πij − πiπ j ⎛ y y j ⎞ ⎛ x x j ⎞ ⎜ i − ⎟ ⎜ i − ⎟ ∑∑s ⎜ ⎟ ⎜ ⎟ πij ⎝ πi π j ⎠ ⎝ πi π j ⎠ γˆ = CYG 4 πij − πiπ j ⎛ x x j ⎞ ⎜ i − ⎟ ∑∑s ⎜ ⎟ πij ⎝ πi π j ⎠ It should be noted that this estimator is not appropriate for π ps sampling ˆ (i.e. πi ∝ xi ) since vYG (X HT ) vanishes and γˆ CYG is undefined. Therefore, this estimator, in contrast to the HT variance estimator, requires an additional variable over and above the one used to define the inclusion probabilities. Also, it takes negative values for some samples.

2. A π -Weighted Ratio Type Estimator

The ordinary ratio estimator for the population total is robust against extreme y values in the individual ratio i , Rao (1978). The π -weighted ratio estimator xi (see Särndal et al. (1992)) has a number of points in its favour. The form is simple and easy to accept on intuitive grounds only. Over the history of survey sampling, most of the considerable experience gathered with this estimator points to excellent performance under a variety of conditions. Motivated by this, we define below a ratio-type π -weighted estimator of VHT . A ratio-type adjustment to each of the terms of v HT yields the following estimator

1280 P. A. Patel, R. D. Chaudhari: Design-Based…

⎛ ⎞ ⎜ ⎟ 2 2 ⎛ y ⎞⎜ ∑ φii xi ⎟ ⎜ φ i ⎟ U if π = π π ∀ i ≠ j ∈U ∑s ii ⎜ 2 ⎟ ij i j ⎜ πi ⎟⎜ x ⎟ ⎝ ⎠ φ i ⎜ ∑s ii ⎟ ⎝ πi ⎠ ˆ v πWR (YHT ) = ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎛ 2 ⎞ φ x2 φ x x yi ⎜ ∑U ii i ⎟ ⎛ yi y j ⎞⎜ ∑∑U ij i j ⎟ ⎜ φ ⎟ + ⎜ φ ⎟⎜ ⎟ ∑s ii ⎜ 2 ⎟ ⎜∑∑s ij ⎟ ⎜ πi ⎟⎜ x ⎟ πij ⎜ xi x j ⎟ ⎝ ⎠ φ i ⎝ ⎠ φ ⎜ ∑s ii ⎟ ⎜ ∑∑s ij ⎟ ⎝ πi ⎠ ⎝ πij ⎠ if πij = πi π j (1) where for i ≠ j = 1,2,L,N , φii = (1 πi ) −1 and φij = (πij (πiπ j ))−1. Since the ratio-estimator is optimal under the ratio-model, we believe that this estimator will also work well under this model. Note that the design-based variance estimators vYG and v HT use the x variable only at the sampling stage, whereas the suggested estimator also uses the x variable at the estimation stage. Remark. The estimator v π WR is general in that it applies for both fixed sample size and non-fixed (random) sample size designs.

3. Asymptotic Design Unbiasedness and Consistency of v π WR

The main results of this section gives important properties of v π WR . Definition: With reference to the sequence of populations and sampling ˆ designs an estimator θt is asymptotically unbiased for θt , if ˆ lim [E(θt ) − θt ] = 0 t →∞ and is consistent, if, for any fixed ε > 0 , ˆ lim Pr[| θt − θt | > ε] = 0 t→∞ Theorem 3.1. The estimator v πWR is asymptotic design consistent (ADC) and asymptotic design unbiased (ADU). Proof: Proof is given in Appendix A.

STATISTICS IN TRANSITION, December 2006 1281

4. Simulation Study

In this section, the empirical comparison is conducted on artificial populations simulated according to some superpopulation models and on natural populations. The following strategies are compared.

Strategy Design Estimator

1) Sunter’s Sampling (Sunter, 1977a, vYG , v πWR 1977b)

2) Midzuno Sampling (Midzuno, 1952) vYG , v πWR , vCYG

3) Poisson Sampling (Särndal et al. 1992) v HT , v πWR

4.1 Monte Carlo Simulation Based on Artificial Populations

A finite population of size N = 400 was created. The characteristics x and y for the ith unit were generated using the model γ / 2 yi = βxi + xi εi , i = 1,2,L,N (2) 2 2 for specified values of β, γ, g, h and σε , where εi ~ N( 0,σε ) independent of xi ~ G( g,h ) . Thus the mean, variance, and coefficient of variation of xi are 2 2 −1/ 2 given by μ x = gh , σ x = gh and Cx = σx μ x = g . Further the mean of yi is μ y = βμ x , variance of yi is 2 2 2 2 γ σ y = β σx + σε E(xi ) and the correlation coefficient Corr( xi , yi ) = ρ = βσx σ y vary depending on the choice of γ . Here γ = 0, 1 and 2 were γ 2 2 2 considered so that E(xi ) = 1, μ x and μ x + σ x ; for each of these cases σε 2 and σ x were then chosen to match various values of ρ and Cx . The three (β,μ x , γ) combinations were: (a)β = 1,μ x = 100, γ = 0 ; (b) β =1,μx =100 , γ = 1; and (c) β = 1,μ x = 100, γ = 2 . For each of (a) - (c) and each ρ and Cx combination, the finite population was created and a sample of size n = 30 was drawn using a specific sampling design. The variance estimators were computed from each sample. This process

1282 P. A. Patel, R. D. Chaudhari: Design-Based… was repeated M = 10,000 times. The performance of the different variance estimators was measured and compared in terms of relative bias in percentage (RB), relative efficiency (RE) and empirical coverage rate (ECR). The simulated values of RB, RE and ECR for a particular variance estimator v were computed as v −V RB (v) = 100× V where v is obtained computationally as 1 M v = ∑ v ( j) M j=1 The relative efficiency of v is given by MSE(v ) RE(v) = YG (for fixed-size design) MSE(v) or MSE(v ) RE(v) = YG (for random-size design) MSE(v) where M 1 2 MSE(v) = ∑()v ( j) −V . M −1 j =1 And the empirical coverage rate is 1 M ECR(v) = I ˆ ∑ [Y∈Y ± Zα / 2 v ] M j =1 ˆ where I[]⋅ is the indicator function, Y denotes the value of the HT-estimator given a specific sample and Zα / 2 denotes a tabular value from the standard normal distribution or t -distribution. Table 1 and 2 report the relative efficiencies of Strategy 1 and Strategy 2 respectively. The RBs and the ECRs of these strategies were presented in Appendix B. The interesting observations of Tables 1 and 2 are : If the relation between y and x is linear passes through the origin then with strong auxiliary information (ρ high) the gain from using v πWR can be substantial compared to vYG in all cases (except one case) for both the schemes. C C v πWR has reasonable RB, less than 2%, for x = 0.33 and x = 0.75 when C = γ = 0 and 1. On the other hand if x 1.5 v πWR has slightly larger RB

STATISTICS IN TRANSITION, December 2006 1283 varying from 1% - 10% under Midzuno sampling. However, reverse is observed under Sunter's sampling. The ECRs of v πWR and vYG are close to their nominal 95 %.

Table 1. RE Under Sunter’s Sampling

Cx γ Estimator ρ 1.5 0.75 0.33 0 0.9 1.000 1.000 1.000 vYG 0.8 1.000 1.000 1.000 0.7 1.000 1.000 1.000 0.9 1.112 7.296 1.111 v πWR 0.8 1.126 2.206 1.198 0.7 1.154 3.145 1.118 1 0.9 1.000 1.000 1.000 vYG 0.8 1.000 1.000 1.000 0.7 1.000 1.000 1.000 0.9 1.138 9.209 1.883 v πWR 0.8 1.121 14.529 1.230 0.7 1.222 6.789 1.267 2 0.9 1.000 1.000 1.000 vYG 0.8 1.000 1.000 1.000 0.7 1.000 1.000 1.000 0.9 1.119 41.667 2.827 v πWR 0.8 1.136 14.742 1.265 0.7 1.069 14.096 1.259

1284 P. A. Patel, R. D. Chaudhari: Design-Based…

Table 2. RE Under Midzuno’s Sampling C γ Estimator ρ x 1.5 0.75 0.33 0 0.9 1.000 1.000 1.000 vYG 0.8 1.000 1.000 1.000 0.7 1.000 1.000 1.000 0.9 5.530 59.998 71.570 v πWR 0.8 1.903 24.503 29.899 0.7 1.641 19.263 15.401 0.9 5.755 63.601 72.075 vCYG 0.8 2.092 28.704 27.242 0.7 2.408 17.937 15.828 1 0.9 1.000 1.000 1.000 vYG 0.8 1.000 1.000 1.000 0.7 1.000 1.000 1.000 0.9 5.197 11.925 13.983 v πWR 0.8 2.480 4.868 7.264 0.7 0.559 4.171 4.117 0.9 5.141 12.553 15.501 vCYG 0.8 2.523 6.282 8.843 0.7 0.663 3.355 4.227 2 0.9 1.000 1.000 1.000 vYG 0.8 1.000 1.000 1.000 0.7 1.000 1.000 1.000 0.9 1.648 2.996 7.146 v πWR 0.8 1.618 2.405 2.321 0.7 2.549 1.514 2.735 0.9 1.129 3.611 7.379 vCYG 0.8 1.210 2.111 2.302 0.7 4.088 1.396 2.597

STATISTICS IN TRANSITION, December 2006 1285

4.2. Simulation Based on the Empirical Data Sets

As did in the last subsection, from each population (listed in Appendix C) a sample of size n = 30 was drawn using a specific design. The performance of the estimators were measured in terms of the RB, the RE and the ECR. Table 3 reports the simulated values of these summary statistics. Some noteworthy results in Table 3 are : Sunter's Sampling 1. If the variance structure is strongly heteroscedastic (e.g. populations 3, 4, 6) then the estimator v πWR has large RB but is highly efficient than vYG . 2. v πWR has slightly better ECR than vYG . Midzuno’s Sampling 1. The absolute values of RBs of all the three estimators are all in reasonable range (1% - 7 %). 2. v πWR and vCYG have consistently good performance (in terms of RE) than vYG . 3. The confidence interval associated with v πWR and vCYG have the coverage probabilities closer to the nominal level, while that associated with vYG has the lower coverage probability. Poisson Sampling 1. The absolute values of RBs of v HT and v πWR are all less than 0.5%. 2. v πWR is substantially more efficient and has good ECR than v HT .

1286 P. A. Patel, R. D. Chaudhari: Design-Based…

Table 3. RB, RE and ECR Under Various Sampling Scheme

Summa Sunter's Sampling Midzuno's Sampling Poisson Sampling Popl. -ry Sta- No. tistics vYG v πWR vYG v πWR vCYG v HT v πWR

RB 1 3.087 1.680 0.859 -0.531 2.304 -0.383 0.097

2 3.201 2.588 -0.888 3.236 3.494 -0.395 0.051

3 0.056 0.131 -4.682 -0.777 -1.871 -0.396 -0.196

4 1.297 1.637 -6.825 -3.823 -4.875 -0.456 -0.278

5 3.974 6.367 -5.330 -2.687 -3.461 -0.167 0.184

6 -8.601 15.122 -4.435 2.064 1.637 -0.279 -0.037

RE 1 1.000 1.092 48.835 43.427 46.315 1.000 1.332

2 1.000 1.044 49.530 22.199 26.788 1.000 4.513

3 1.000 1.325 24.394 23.383 21.628 1.000 4.433

4 1.000 1.270 49.154 20.712 22.453 1.000 4.155

5 1.000 1.003 59.473 22.434 22.554 1.000 7.795

6 1.000 1.819 48.172 6.871 6.602 1.000 47.053

ECR 1 0.934 0.933 0.905 0.911 0.914 0.930 0.939

2 0.939 0.939 0.913 0.945 0.939 0.939 0.950

3 0.942 0.942 0.937 0.945 0.948 0.940 0.950

4 0.939 0.939 0.891 0.943 0.939 0.935 0.944

5 0.935 0.939 0.883 0.952 0.950 0.938 0.948

6 0.932 0.959 0.893 0.958 0.958 0.938 0.950

STATISTICS IN TRANSITION, December 2006 1287

5. Conclusion

Based on the empirical study and the theoretical discussion we arrive at the following conclusions. 1. Throughout the simulation it was observed that v πWR is n.n.d. 2. If the relation between yi and xi is linear passing through origin and the correlation between them is moderate to high then use v πWR . This estimator seems (on the basis of the work presented here) to be broadly useful for a variety of populations. 3. Implementation of the suggested estimator, v πWR , requires the complete auxiliary information, i.e., values of x variables for the entire finite populations. 4. The strategy (Midzuno Sampling, v πWR has performed better than the strategy (Sunter's Sampling, v πWR ). 5. The estimator v πWR turned out to be consistently more efficient than the standard estimators for both types of sampling designs (fixed-size and random size sampling designs).

Appendix A:

[Proof of Theorem 1] To see that v πWR will be ADC and ADU as n → ∞ we let 2 ∑ βii yi ∑∑ βij yi y j v = s φ x2 + s φ x x β 2 ∑U ii i ∑∑U ij i j β x ∑∑ βij xi x j ∑s ii i s where β 's are constants, and adopt the following formulation used by Brewer (1979) in his asymptotic analysis. The original population of N units is reproduced (t −1) times, yielding t identical populations of N units each. From each of the t populations, a sample is selected using the same sampling design for each one. The t populations are then aggregated to an overall population of size Nt = Nt whose population total for the y -variable is Yt = Yt and the t samples are aggregated to an overall sample of nt = nt units. t is then allowed to tend to ∞ . Since the steps in our proof are parallel to those of Chaudhuri and Stenger (1992) [Sec. 4.1.3, pp. 105-110], we omit certain of the details for reason of brevity. Imitating the asymptotic analysis of Brewer, we now have by Slutzky's theorem

1288 P. A. Patel, R. D. Chaudhari: Design-Based…

2 1 ∑ πiβii yi ∑∑ πijβij yi y j p lim v = U φ x2 + U φ x x β 2 ∑U ii i ∑∑U ij i j t →∞ t π β x πijβij xi x j ∑U i ii i ∑∑U . Hence, vβ may be called consistent for V = VHT if ⎡1 ⎤ N p lim ⎢ vβt −Vt ⎥ = 0 ∀ Y = (y1,L, yN ) ∈ R . t →∞⎣t ⎦ 2 By equating the coefficients of yi and yi y j on both sides of the above equation we obtain φii φij βii = α1 and βij = α2 πi πij where 2 ∑ πiβii xi ∑∑ πijβij xi x j α = U and α = U . 1 2 2 φ x φij xi x j ∑U ii i ∑∑U Inserting these β 's in vβ we get an ADC estimator as 2 ∑ φii yi πi ∑∑ φij yi y j πij v * = s φ x2 + s φ x x . 2 ∑U ii i ∑∑U ij i j φ x π φij xi x j πij ∑s ii i i ∑∑s Next v * is consistent, we obtain, therefore ⎡1 ⎤ N lim E p ⎢ vβt −Vt ⎥ = 0 ∀ Y ∈ R . t →∞ ⎣t ⎦ So, v * is asymptotically design unbiased.

STATISTICS IN TRANSITION, December 2006 1289

Appendix B:

Table 4. RB and ECR Under Sunter’s Sampling

RB ECR

γ

Estimator ρ Cx Cx 1.5 0.75 0.33 1.5 0.75 0.33 0 0.9 -1.696 135.548 -9.599 0.943 0.971 0.955

vYG 0.8 1.049 20.819 9.009 0.944 0.960 0.954 0.7 4.169 55.562 0.926 0.942 0.965 0.956 0.9 -3.181 19.577 -9.766 0.944 0.832 0.960

v πWR 0.8 -0.931 -17.072 5.931 0.945 0.933 0.959 0.7 1.257 -1.161 -1.481 0.941 0.921 0.959 1 0.9 1.162 188.137 5.035 0.942 0.959 0.945

vYG 0.8 2.026 125.083 2.361 0.941 0.955 0.947 0.7 2.520 60.063 0.197 0.942 0.946 0.947 0.9 -0.048 45.228 4.917 0.941 0.883 0.944

v πWR 0.8 1.097 38.307 3.301 0.939 0.888 0.950 0.7 0.941 17.609 -1.340 0.942 0.912 0.943 2 0.9 -0.007 31.067 1.928 0.941 0.949 0.948

vYG 0.8 1.301 58.450 -0.307 0.942 0.948 0.940 0.7 1.099 36.391 0.686 0.941 0.944 0.941 0.9 -0.268 7.951 3.876 0.940 0.921 0.947

v πWR 0.8 0.373 15.163 0.290 0.940 0.916 0.941 0.7 -0.165 8.497 0.491 0.937 0.911 0.940

1290 P. A. Patel, R. D. Chaudhari: Design-Based…

Table 5. RB and ECR Under Midzuno’s Sampling RB ECR

γ C C Estimator ρ x x 1.5 0.75 0.33 1.5 0.75 0.33 0 0.9 -0.256 -0.822 -0.305 0.923 0.932 0.927 vYG 0.8 -0.390 0.834 -0.651 0.930 0.931 0.924 0.7 0.742 -0.348 0.214 0.932 0.924 0.933 0.9 3.926 0.487 0.056 0.953 0.955 0.953 v πWR 0.8 4.864 1.627 0.670 0.947 0.952 0.953 0.7 6.756 0.248 0.987 0.949 0.956 0.952 0.9 4.191 0.349 -0.240 0.955 0.954 0.952 vCYG 0.8 2.909 1.630 0.569 0.950 0.952 0.952 0.7 2.537 -0.045 0.923 0.950 0.954 0.952 1 0.9 0.590 -0.211 -0.093 0.910 0.923 0.930 vYG 0.8 0.488 -0.350 -0.073 0.918 0.924 0.926 0.7 0.674 0.547 0.218 0.931 0.913 0.930 0.9 1.129 1.729 0.092 0.946 0.947 0.950 v πWR 0.8 2.844 1.849 1.744 0.937 0.952 0.946 0.7 8.362 2.031 0.390 0.933 0.942 0.946 0.9 1.950 2.756 0.319 0.947 0.950 0.950 vCYG 0.8 -2.011 1.528 2.373 0.928 0.952 0.949 0.7 6.551 2.978 0.627 0.936 0.940 0.947 2 0.9 0.361 -0.656 0.099 0.898 0.904 0.914 vYG 0.8 -0.458 0.481 0.923 0.889 0.907 0.931 0.7 0.946 -2.636 -0.003 0.868 0.903 0.919 0.9 5.195 -3.057 1.581 0.934 0.942 0.950 v πWR 0.8 1.891 -0.940 3.694 0.919 0.934 0.946 0.7 -10.347 -1.063 1.400 0.908 0.920 0.940 0.9 8.853 -4.355 2.381 0.932 0.940 0.951 vCYG 0.8 5.776 -0.207 6.644 0.918 0.932 0.951 0.7 -16.614 -0.029 2.464 0.891 0.918 0.942

STATISTICS IN TRANSITION, December 2006 1291

Appendix C:

Table 6. Study Population Popl CV(x Source X Y N CV(y) ρxy . No. ) CS82: RMT × 10-4: Number of Särndal Revenue from conservative 1 the 1985 281 0.519 1.058 0.657 et al. (1992) seats in municipal municipal Taxation council Chambers et Area assigned Gross value 2 for sugarcane 338 0.591 0.610 0.902 al. (1986) of sugarcane farms Valliant et Number of Number of 3 patients 393 0.776 0.724 0.910 al. (2000) beds discharged Breast cancer Valliant et Adult female mortality, 4 population, 1950-69 301 1.221 1.279 0.967 al. (2000) 1960 (white female) Population, Valliant et Number of excluding 5 households residents of 304 1.302 1.380 0.982 al. (2000) 1960 group quarters, 1960 Population, Valliant et Number of excluding 6 households Residents of 304 1.302 1.239 0.998 al. (2000) 1960 group quarters, 1960

A scatter plot of each of the populations 3 to 6 reveals that a linear model 2 yi = βxi + εi with Vm (yi ) = σ xi might be appropriate and the relationship between y and x is strong. Population 2 seems to obey the same model with 2 1/ 2 Vm = σ xi .

1292 P. A. Patel, R. D. Chaudhari: Design-Based…

REFERENCES

BERGER, Y.G. (1998) Variance estimation using list sequential scheme for unequal probability sampling. Journal of Official Statistics, 14(3), 315—323. BREWER, K. (1979) of robust sampling design for large scale surveys. Journal of American Statistical Association, 74(398), 911—915. CHAMBERS, R. L. and DUNSTAN, R. (1986). Estimating distribution function from survey data. Biometrika, 73, 597—604. CHAUDHARI, A., STRENGER, H. (1992). Survey Sampling Theory and Methods, Marcel Dekker. Inc. N.Y. DEVILLE, J.C., SÄRNDAL, C.E. (1992). Calibration using auxiliary information. Journal of American Statistical Association, 78, 117—123. FULLER, W.A. (1970). Sampling with random stratum boundaries. Journal of Royal Statistical Society, B, 32, 209—226. ISAKI, C.T. (1983). Variance estimation using auxiliary information. Journal of the American Statistical Association, 78(381), 117—123. KUK, A.Y.C. (1989). Double bootstrap estimation of variance under systematic sampling with probability proportional to size. Journal of Statistical Computation and Simulation, 31, 78—82. MIDZUNO, H. (1952). On the sampling system with probabilities proportional to sum of sizes. Annals of Institute of Statistical , 3, 99—107. RAO, J.N.K. (1978). Sampling designs involving unequal probabilities of selection and robust estimation of a finite population total. Contributions to Survey Sampling and Applied Statistics, Edited by H.A. David, American Press 69—87. RAO, J.N.K., and SINGH, M.P. (1973). On the choice of estimator in survey sampling. Australian Journal of Statistics, 15(2), 95—104. ROYALL, R.M. (1970) On finite population sampling theory under certain linear regression models. Biometrika, 57, 377—387. ROYALL, R.M., and CUMBERLAND, W.G. (1981). An empirical study of the ratio estimator and estimators of its variance. Journal of American Statistical Association, 76, 66—77 SÄRNDAL, C.E., SWENSSON, B. and WRETMAN, J.H. (1992). Model Assisted Survey Sampling. Springer Verlag, New York.

STATISTICS IN TRANSITION, December 2006 1293

SINGH, S., HORN, S., CHOWDHURY, S. and YU. S. (1999). Calibration of the estimators of variance. Australian And New Zealand Journal of Statistics., 41, 199—212. STEHMAN, S.V., and OVERTON, W.S. (1994). Comparison of variance estimators of the Horvitz-Thompson estimator for randomized variable probability systematic sampling. Journal of American Statistical Association, 89, 30—43. SUNTER, A. B (1977a). Response burden, sample rotation, and classification renewal in economic surveys. International Statistical Review., 45, 209—222. SUNTER, A. B. (1977b). List sequential sampling with equal or unequal probabilities without replacement. Applied Statistics. 26, 261—268. VALLIANT, R., DORFMAN, A. H. and ROYALL, R. M. (2000). Finite Population Sampling and Inference. John Wiley and Sons, Inc. New York. WOLTER, K.M. (1985). Introduction to Variance Estimation. Springer Verlag, New York.

STATISTICS IN TRANSITION, December 2006 1295

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1295—1310

POST-STRATIFICATION IN A TWO-WAY DEEPLY STRATIFIED POPULATION

D. Shukla1, Manish Trivedi2 and G. N. Singh3

ABSTRACT

This paper presents an estimation strategy for the population mean for a two-way r x s deeply stratified population using the technique of post-stratification. The size of each stratum and frame both are assumed unknown. The information known is only the proportion of row and column-size-totals of a two-way deep- stratification. An estimator is proposed and its optimum properties are examined along with comparison of efficiencies. Approximate expression of mean square error (MSE) is derived for this set-up.

Key words: Post-stratification, SRSWOR, Deep-stratification.

1. Introduction

When a population is stratified according to r levels of an attribute and s levels of another, it constitutes two-way r x s deep-stratification set-up studied by Bryant (1955). Usually, the population is stratified according to one criteria of stratification, but, sometimes it may be stratified with two alternatives like villages classified by the size of agricultural area and by the value of agricultural production in the previous year. Then which criteria to be preferred are a question based on their relative merit and importance in light of characteristics under study. If both are strong enough, a solution is to use the two-way stratification with r classifications of one criteria and s of others, with a sample at least as large as r x s, required to estimate the population mean. Bryant, Hartley and Jessen (1960) have proposed a technique to estimate the population mean when the sample is not large enough to provide an allocation to each stratum of a two-way stratification setup.

1 Department of Mathematics and Statistics, Dr. Hari Singh Gour University, Sagar – 470 003, INDIA. e-mail: [email protected] 2 Department of Applied Mathematics, Birla Institute of Technology, Ranchi, Jharkhand, INDIA 3 Department of Applied Mathematics, Indian School of Mines, Dhanbad – 826 004, INDIA, e-mail: [email protected]

1296 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

For a random sample, the knowledge of complete frame of each stratum is required while selection through stratified sampling. When size of a stratum is known but frame is absent, the estimation has difficulty and motivates for post- stratification technique to estimate the population mean. According to Sukhatme [(1984), pp. 134] the post-stratification with a large sample is almost as precise as stratified sampling with proportional allocation. Some useful contributions on post-stratification are by Holt and Smith (1979), Jagers, Oden and Trulsson (1985), Jagers (1986) and Smith (1991). Let stratum-wise frames and sizes both are not available but their row-sizes and column-size proportions of totals are known. To utilize this information in the estimation aspect of a population mean is considered in this paper.

2. Notations

th Consider r x s deeply stratified population of size N; Yijk be the k value of (i, th j) strata, of size Nij of a variable Y under study (i = 1, 2…r, j = 1, 2…, s and k = 1, 2,… Nij). Let n be a large sample drawn by SRSWOR and post-stratified into ⎛ r s ⎞ ⎛ r s ⎞ n units ⎜ n = n⎟ coming from N ⎜ N = N⎟ . ij ⎜∑∑ ij ⎟ ij ⎜∑∑ ij ⎟ ⎝ i==1 j 1 ⎠ ⎝ i==1 j 1 ⎠ 2 2 Symbols Yij andSij , Y and S are population means and population mean squares whereas yij and y are sample means based on nij and n units. Moreover ⎛ s ⎞ ⎛ r ⎞ ⎛ s ⎞ ⎛ r ⎞ N ⎜= N ⎟ , N = N , n ⎜= n ⎟ and n = n are row and i. ⎜ ∑ ij ⎟ .j ⎜ ∑ ij ⎟ i. ⎜ ∑ ij ⎟ .j ⎜ ∑ ij ⎟ ⎝ j=1 ⎠ ⎝ i=1 ⎠ ⎝ j=1 ⎠ ⎝ i=1 ⎠ column-size-totals while Yi , Yj , yi , y j are population and sample means based on them.

2.1. Some Results

⎛ N ij ⎞ ⎛ N ⎞ ⎛ N.j ⎞ ⎛ n ij ⎞ ⎛ n ⎞ Let ⎜ ⎟ , W i. , ⎜ ⎟ , ⎜ ⎟ , p i. Wij = ⎜ ⎟ i. = ⎜ ⎟ W.j = ⎜ ⎟ pij = ⎜ ⎟ i. = ⎜ ⎟ ⎝ N ⎠ ⎝ N ⎠ ⎝ N ⎠ ⎝ n ⎠ ⎝ n ⎠ ⎛ n ⎞ and ⎜ .j ⎟ . Since sample size n is large enough with respect to r x s p.j = ⎜ ⎟ ⎝ n ⎠ stratification, we assume:

STATISTICS IN TRANSITION, December 2006 1297

pij = WIJ (1+ ε IJ )⎤ ⎥ ⎥ p = W ()1+ ε ⎥ (2.1.1) i′j i′j i′j ⎥ ⎥ ⎥ pij′ = Wij′ ()1+ ε ij′ ⎦⎥ where E[ε ij ]= E[ε i′j ]= E[ε ij′ ]= 0;i ≠ i′, j ≠ j′ and E[pij ]= Wij , E[]pi. = Wi. , E[p.j ]= W.j (2.1.2) Moreover, V[]p ⎛ 1 ⎞ ⎡(N − n)W (1− W )⎤ E ε 2 = ij = ⎜ ⎟ ij ij (2.1.3) []i′j 2 ⎜ 2 ⎟ ⎢ ⎥ Wij ⎝ Wij ⎠ ⎣ ()N −1 n ⎦

V[]p ⎛ 1 ⎞ ⎡(N − n)W ′ (1− W ′ )⎤ E ε 2 = ij = ⎜ ⎟ i j i j (2.1.4) []i′′j 2 ⎜ 2 ⎟ ⎢ ⎥ Wi′j ⎝ Wi′j ⎠ ⎣ ()N −1 n ⎦

V[]p ⎛ 1 ⎞ ⎡(N − n)W ′ (1− W ′ )⎤ E ε 2 = ij = ⎜ ⎟ ij ij (2.1.5) []ij′ 2 ⎜ 2 ⎟ ⎢ ⎥ Wij′ ⎝ Wij′ ⎠ ⎣ ()N −1 n ⎦

Cov[]pi′jpij′ ⎪⎧ 1 ⎪⎫⎡(N − n)Wij′ Wi′j ⎤ E[]ε i′j ,ε ij′ = = −⎨ ⎬⎢ ⎥ (2.1.6) {}Wi′jWij′ ⎩⎪()Wi′jWij′ ⎭⎪⎣ ()N −1 n ⎦

Cov[]pi′jpij ⎪⎧ 1 ⎪⎫⎡(N − n)WijWi′j ⎤ E[]ε i′j ,ε ij = = −⎨ ⎬⎢ ⎥ (2.1.7) {}Wi′jWij ⎩⎪()Wi′jWij ⎭⎪⎣ ()N −1 n ⎦

Cov[]pijpij′ ⎪⎧ 1 ⎪⎫⎡(N − n)Wij′ Wij ⎤ E[]ε ij ,ε ij′ = = −⎨ ⎬⎢ ⎥ (2.1.8) {}WijWij′ ⎩⎪()WijWij′ ⎭⎪⎣ ()N −1 n ⎦

⎡(N − n) Wi. (1− Wi. )⎤ V[]pi. = ⎢ ⎥ (2.1.9) ⎣ ()N −1 n ⎦

⎡(N − n) Wj (1− Wj )⎤ V[]p j = ⎢ ⎥ (2.1.10) ⎣ ()N −1 n ⎦ Theorem 2.1 Using (2.1.1) and avoiding terms of higher order, approximate results, for i ≠ i′ and j≠j′, are ⎡p ⎤ W ⎡ Var()p Cov(p p )⎤ (a) ij′ ij′ ij ij ij′ E ⎢ ⎥ = ⎢1+ 2 − ⎥ ⎢ pij ⎥ Wij ⎢ Wij WijWij′ ⎥ ⎣ ⎦ ⎣ ⎦

1298 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

⎡p 2 ⎤ W 2 ⎡ Var()p Var(p ) 2Cov(p p )⎤ (b) ij′ ij′ ij ij′ ij ij′ E ⎢ ⎥ = ⎢1+ 2 + 2 − ⎥ ⎢ pij ⎥ Wij ⎢ Wij Wij′ WijWij′ ⎥ ⎣ ⎦ ⎣ ⎦

⎡pij′pi′j ⎤ Wij′ Wi′j ⎡ Var(pij ) Cov(pi′jpij′ ) Cov(pijpi′j ) Cov()pijpij′ ⎤ (c) E⎢ ⎥ = ⎢1+ + − − ⎥ p W W 2 W W W W W W ⎣⎢ ij ⎦⎥ ij ⎣⎢ ij i'j ij′ ij i′j ij ij′ ⎦⎥

Proof:

⎡pij′ ⎤ Wij′ −1 (a) E ⎢ ⎥ = E[]()()1+ ε ij′ 1+ ε ij ⎣⎢ pij ⎦⎥ Wij

Wij′ 2 2 = E[]1+ ε ij′ − ε ij − ε ijε ij′ + ε ij + ε ijε ij′ .... using expansion. Wij u v Ignoring all higher order terms [(ε ij ) (ε ij' ) ] for (u + v) > 2,

Wij′ 2 = []1+ E()ε ij − E()ε ijε ij′ Wij

Wij′ ⎡ Var(pij ) Cov(pijpij′ )⎤ = ⎢1+ 2 − ⎥ Wij ⎣⎢ Wij WijWij′ ⎦⎥ 2 2 ⎡pij′ ⎤ Wij′ 2 −1 (b) E ⎢ ⎥ = E[]()()1+ ε ij′ 1+ ε ij ⎣⎢ pij ⎦⎥ Wij 2 W ij′ 2 2 = E[]1+ ε ij′ + 2ε ij′ − ε ij − 2ε ijε ij′ + ε ij .... Wij 2 Wij′ 2 2 = []1+ E()ε ij + E (ε ij′ )− 2E()ε ijε ij′ Wij 2 Wij′ ⎡ Var()pij Var(pij′ ) 2Cov(pijpij′ )⎤ = ⎢1+ 2 + 2 − ⎥ Wij ⎣⎢ Wij Wij′ WijWij′ ⎦⎥

⎡Pi′jPij′ ⎤ Wi′jWij′ −1 (c) E ⎢ ⎥ = E[]()()()1+ ε i′j 1+ ε ij′ 1+ ε ij ⎣⎢ Pij ⎦⎥ Wij

Wi′jWij′ 2 = E[]1+ ε i′j + ε ij′ + ε i′jε ij′ − ε ij − ε ijε i′j − ε ijε ij′ + ε ij ..... Wij

Wi′jWij′ 2 = []1+ E(ε ij ) + E(ε i′jε ij′ ) − E(ε ijε i′j ) − E(ε ijε ij′ ) Wij

STATISTICS IN TRANSITION, December 2006 1299

Wi′jWij′ ⎡ Var(pij ) Cov(pi′jpij′ ) Cov(pijpij ) Cov()pijpij′ ⎤ = ⎢1+ 2 + − − ⎥ Wij ⎣⎢ Wij Wi′jWij′ WijWi′j WijWij′ ⎦⎥ u v t obtained by avoiding terms [()ε ij ](ε ij′ ) (ε i′j ) for (u + v + t) > 2.

2.2. Symbols Used

2 ⎡pij′ ⎤ ⎡pij′ ⎤ ⎡pi′jpij′ ⎤ ⎡ 1 ⎤ A ij()j′ = E⎢ ⎥ , Bij ()j′ = E⎢ ⎥ , Cij()i′j′ = E⎢ ⎥ , Dij = E⎢ ⎥ ⎣⎢ pij ⎦⎥ ⎣⎢ pij ⎦⎥ ⎣⎢ pij ⎦⎥ ⎣⎢n ij ⎦⎥ ⎡ p 2 ⎤ ⎛ 1 ⎞⎧(N − n) W (1− W ) ⎫ F = E i. = ⎜ ⎟ i. i. + W 2 i. ⎢ ⎥ ⎜ ⎟⎨ i. ⎬ ⎣⎢ N ij ⎦⎥ ⎝ N ij ⎠⎩()N − n n ⎭ 2 ⎡ p.j ⎤ ⎛ 1 ⎞⎧()N − n W.j (1− W.j ) ⎫ F = E = ⎜ ⎟ + W 2 .j ⎢ ⎥ ⎜ ⎟⎨ .j ⎬ ⎣⎢ N ij ⎦⎥ ⎝ N ij ⎠⎩ ()N −1 n ⎭ ⎡p p ⎤ ⎛ 1 ⎞ F = E i j = ⎜ ⎟ Cov p p + E p E p ij ⎢ ⎥ ⎜ ⎟ {}()i. .j ()i. ().j ⎣⎢ N ij ⎦⎥ ⎝ N ij ⎠ s r M i = ∑ Yij , M j = ∑ Yij j=1 i=1

3. Estimation Strategy

Just to recall we assume the following: (a) A two-way set-up of r x s stratified population N exists; (b) Frame of N units is available (with respect to any non-stratifying variable) but not the size and frame of each stratum for the criteria of stratification; (c) Sample size n is sufficiently large to have a good representation of r x s; (d) Although Wij is unknown information about Wi. and W.j is available. To estimate Y , a proposed class is r s φ()y = ∑∑φ1 ()()pi. ,p.j , Wi. , W.j ,α φ 2 yij (3.1) i==1 j 1 where, φ1(pi., p.j , Wi., W.j, α) is a function of sample and population proportions with α as a constant, α E(0,1) and φ2 (yij) is a function of sample values.

1300 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

In particular, we consider ⎡⎛ α ⎞ ⎛1− α ⎞ ⎤ φ1 ()pi. ,p.j , Wi. , W.j ,α = ⎢⎜ ⎟{}pi. + Wi. + ⎜ ⎟{}p.j + W.j ⎥ = Wαij ⎣⎝ 2 ⎠ ⎝ 2 ⎠ ⎦ th and φ 2 (yij ) = yij being mean of (i, j) cell of a sample (3.2) in order to generate an estimator of the class φ(y) = ygps .

3.1. Justification

1) The usual post-stratified and sample mean estimators for r x s set-up, with known Wij, are r s y ps = ∑ ∑ Wij yij (3.3) i=1 j=1 ⎛ 1 ⎞ r s y = ⎜ ⎟∑ ∑ n ij yij ⎝ n ⎠ i=1 j=1

⎧⎛ 1 ⎞ ⎛ 1 ⎞⎫ 2 v ()y = ⎨⎜ ⎟ − ⎜ ⎟⎬ S ⎩⎝ n ⎠ ⎝ N ⎠⎭ When information about Wij is unknown the estimator (3.3) fails to perform estimation and sample mean is the option left. 2) Information of Wi and Wj is commonly known like a university has p1 proportion of meritorious students, p2 average students and p3 below average students (p1 + p2 + p3 = 1) whereas same has a proportion of p4 students from poor families, p5 from middle class and p6 from rich class (p4 + p5 + p6 = 1) making 3 x 3 deep stratification. This does not provide details of Wij while using (3.3). 3) Consider a telephone directory of N phone-holders in a city. From past record or other sources the information is that 40% businessman, 30% serviceman, 20% private doctors and 10% others are telephone-owners, whereas 40% post graduates, 30% graduates, 20% secondary – educated and 10% Metric-educated are owners of phones. In a two-way stratification here nothing is known about stratum sizes Nij.

4) The specified choice of φ1(pi., p.j, Wi., W.j, α) = Wαij and φ2(yij) = yij is in

accordance with Agrawal and Panda (1993). A proper utilization of Wi. and W.j is required for efficient estimation.

STATISTICS IN TRANSITION, December 2006 1301

Remark 3.1 We state, at α = 1, y = y ;α = 0, y = y ; α = 1/2, y = y gps ( gps )1 gps ( gps )o gps ( gps )1/2 as estimators purely based on row proportion, column proportion, average of these two respectively.

Theorem 3.1: The estimator y gps is biased for Y .

Proof: Denote E[(.) |nij] as a conditional expectation given nij.

E (ygps ) = E [E [(ygps )| n ij ]] ⎡⎧ r s ⎫ ⎤ = E ⎢⎨∑ ∑ WαijE ()yij ⎬ | n ij ⎥ ⎣⎢⎩ i=1 j=1 ⎭ ⎦⎥ r s = ∑ ∑ E ()Wαij Yij i=1 j=1

r s ⎡ ⎛ N ⎞ ⎛ N.j ⎞⎤ i. ⎜ ⎟ = ∑ ∑ ⎢α⎜ ⎟ + ()1− α ⎜ ⎟⎥ Yij i=1 j=1 ⎣⎢ ⎝ N ⎠ ⎝ N ⎠⎦⎥

= Y + [αV1 + (1− α)V2 ]

= Y + [Bias(ygps )] r s s r r s where V1 = ∑ ∑ ∑ Wij Yij′ , V2 = ∑ ∑ ∑ Wij Yi′j i=1 j=1 j′≠ j i=1 i′≠i j=1

Theorem 3.2

An approximate expression of mean square error of y gps is 2 2 2 2 MSE ()ygps = (1/4)[U1 + α ()R1 + R 2 + 4V1 + (1− α) (S1 + S2 + 4V2 )

+ 2α()1− α (T1 + T2 + 4V1V2 )] (3.2.1)

Where, r s ⎡⎧2()r + s −1⎫ ⎧(r + s)−1⎫⎤ 2 U1 = ⎢⎨ ⎬ − 3 ⎨ ⎬⎥∑ ∑ Wij Sij ⎣⎩ n ⎭ ⎩ N ⎭⎦ i=1 j=1 r s r s r s ⎛ 1 ⎞ 2 2 2 ⎛ 2 ⎞ 2 R1 = ∑ ∑ ⎜ ⎟Bij(j′)Sij + ∑ ∑ Wi. DijSij + ∑ ∑ ⎜ ⎟Cij()i′, j′ Sij i=1 j≠ j′=1⎝ n ⎠ i=1 j=1 i≠i′=1 j≠ j′=1 ⎝ n ⎠

1302 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

r s r s r s ⎧ 2 2 ⎫ ⎛ 2 ⎞ 2 2 ⎛ 3 ⎞ ⎪(Wij′Sij )⎪ + ⎜ ⎟∑ ∑ Wi.A ij()j′ Sij − ∑ ∑ Fi.Sij − ⎜ ⎟∑ ∑ ⎨ ⎬ ⎝ n ⎠ i=1 j≠ j′=1 i=1 j=1 ⎝ N ⎠ i=1 j≠ j′=1 ⎩⎪ Wij ⎭⎪ r s r s ⎛ 6 ⎞ Wi′jWij′ 2 ⎛ 2 3 ⎞ 2 − ⎜ ⎟ ∑ ∑ Sij + ⎜ − ⎟∑ ∑ WijSij ⎝ N ⎠i≠i′=1 j≠ j′=1 Wij ⎝ n N ⎠ i=1 j=1 r r r ⎧ (N − n) ⎫⎡ 2 ⎤ R 2 = ⎨ ⎬⎢∑ Wi. ()1− Wi. M i − ∑ ∑ M i ,M i′ , Wi , Wi′ ⎥ ⎩()N −1 n ⎭⎣ i=1 i=1 i′=i ⎦ r s r s r s ⎛ 1 ⎞ 2 2 2 ⎛ 2 ⎞ 2 S1 = ∑ ∑ ⎜ ⎟Bij(j′)Sij + ∑ ∑ W.j DijSij + ∑ ∑ ⎜ ⎟Cij()i′, j′ Sij i=1 j≠ j′=1⎝ n ⎠ i=1 j=1 i≠i′=1 j≠ j′=1 ⎝ n ⎠ r s r s r s ⎧ 2 2 ⎫ ⎛ 2 ⎞ 2 2 ⎛ 3 ⎞ ⎪(Wij Sij )⎪ + ⎜ ⎟∑ ∑ WjA ij()j′ Sij − ∑ ∑ FjSij − ⎜ ⎟∑ ∑ ⎨ ⎬ ⎝ n ⎠ i=1 j≠ j′=1 i=1 j=1 ⎝ N ⎠ i=1 j≠ j′=1 ⎩⎪ Wij ⎭⎪ r s r s ⎛ 6 ⎞ ⎛ Wi′jWij′ ⎞ ⎛ 2 3 ⎞ − ⎜ ⎟S2 + − + W S2 ⎜ ⎟ ∑ ∑ ⎜ ⎟ ij ⎜ ⎟ ∑ ∑ ij ij ⎝ N ⎠i≠i′=1 j≠ j′=1 ⎝ Wij ⎠ ⎝ n n ⎠ i=1 j≠ j′=1 r s s ⎧ (N − n) ⎫⎡ 2 ⎤ S2 = ⎨ ⎬⎢∑ W.j ()1− W.j M j − ∑ ∑ M j M j′ W.jW.j′ ⎥ ⎩()N −1 n ⎭⎣ j=1 j=1 j′= j ⎦ r s r s r s 2 ⎛ 1 ⎞ 2 2 T1 = ∑ ∑ Wi WjDijSij + ∑ ∑ ⎜ ⎟Cij(i′, j′)Sij − ∑ ∑ FijSij i=1 j=1 i≠i′=1 j≠ j′=1 ⎝ n ⎠ i=1 j=1 r s r s ⎛ 1 ⎞ ⎛ 3 ⎞ ⎛ Wi'jWij′ ⎞ + W +W A S2 − ⎜ ⎟S2 ⎜ ⎟∑ ∑ ()i j ij()j′ ij ⎜ ⎟ ∑ ∑ ⎜ ⎟ ij ⎝ n ⎠ i=1 j≠ j′=1 ⎝ N ⎠i≠i′=1 j≠ j′=1 ⎝ Wij ⎠ ⎧ ()N − n ⎫ r s T2 = − ⎨ ⎬∑ ∑ Wi. , W.jM i M j ⎩()N −1 n ⎭ i=1 j=1

Proof: 2 MSE[]y = V()y + [Bias(y )] gps gps gps where V(ygps ) = E[V(y gps )| n ij ]+ V[E(ygps )| n ij ] (3.2.1)

The symbol V[(). /n ij ] is conditional variance given nij. Picking first component of (3.2.1). ⎡ r s ⎛ 1 ⎞ ⎤ ⎡ r s ⎛ 1 ⎞ ⎤ E []V()y /n = E ⎢ ⎜ ⎟ W 2 S2 ⎥ − E ⎢ ⎜ ⎟ W 2 S2 ⎥ (3.2.2) gps ij ∑ ∑ ⎜ n ⎟ αij ij ∑ ∑⎜ N ⎟ αij ij ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ The term one of (3.2.2) is

STATISTICS IN TRANSITION, December 2006 1303

⎡ r s ⎛ ⎞ ⎤ ⎡ r s ⎛ ⎞⎧⎛ 2 ⎞ ⎜ 1 ⎟ 2 2 ⎜ 1 ⎟ α 2 E⎢ WαijSij ⎥ = E⎢ ⎨⎜ ⎟ []pi. + Wi. ∑ ∑ ⎜ n ⎟ ∑ ∑ ⎜ n ⎟ ⎜ 4 ⎟ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎣⎢ i=1 j=1 ⎝ ij ⎠⎩⎝ ⎠ 2 ⎤ ⎛ ()1+ α ⎞ 2 ⎧α()1− α ⎫ ⎪⎫ + ⎜ ⎟ p + W + p + W p + W S2 ⎜ ⎟().j .j ⎨ ⎬()i. i. ().j .j ⎬ ij ⎥ ⎝ 4 ⎠ ⎩ 2 ⎭ ⎭⎪ ⎦⎥

Under mentioned, some derived results in a1, a2, a3 ⎡ r s ⎛ ⎞ ⎤ r s r s ⎜ 1 ⎟ 2 2 ⎧2()r + s −1⎫ 2 ⎛ 1 ⎞ 2 a1 : E⎢ {}pi + Wi Sij ⎥ = ⎨ ⎬ WijSij + ⎜ ⎟ Bij j′ Sij ∑ ∑ ⎜ n ⎟ n ∑ ∑ n ∑ ∑ () ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 ⎝ ⎠ i=1 j≠ j′=1 r s r s r s r s ⎧2⎫ 2 ⎛ 2 ⎞ 2 2 2 ⎛ 2 ⎞ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟ ∑ ∑ Cij()i′,j′ Sij + ∑ ∑ Wi. DijSij + ⎜ ⎟∑ ∑ Wi.Aij(j′)Sij ⎩n ⎭ i=1 j=1 ⎝ n ⎠i≠i′=1 j≠ j′=1 i=1 j=1 ⎝ n ⎠ i=1 j≠ j′=1

⎡ r s ⎛ ⎞ ⎤ r s r s ⎜ 1 ⎟ 2 2 ⎧2()r + s −1⎫ 2 ⎛ 1 ⎞ 2 a 2 : E⎢ {}p.j + W.j Sij ⎥ = ⎨ ⎬ WijSij + ⎜ ⎟ Bij j′ Sij ∑ ∑ ⎜ n ⎟ n ∑ ∑ n ∑ ∑ () ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 ⎝ ⎠ i=1 j≠ j′=1 r s r s r s r s ⎧2⎫ 2 ⎛ 2 ⎞ 2 2 2 ⎛ 2 ⎞ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟∑ ∑ Cij()i′, j′ Sij + ∑ ∑ W.j DijSij + ⎜ ⎟∑ ∑ W.jA ij(j′)Sij ⎩n ⎭ i=1 j=1 ⎝ n ⎠i≠i′= j≠ j′=1 i=1 j=1 ⎝ n ⎠ i=1 j≠ j′=1 ⎡ r s ⎛ ⎞ ⎤ r s ⎜ 1 ⎟ 2 ⎧2()r + s −1⎫ 2 a 3 : E⎢ {}pi. + Wi. + p. j +W. j Sij ⎥ = ⎨ ⎬ WijSij ∑ ∑ ⎜ n ⎟ n ∑ ∑ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 r s 1 r s 1 r s 2 ⎛ ⎞ 2 ⎛ ⎞ 2 + ∑ ∑ Wi.W.jDijSij + ⎜ ⎟ ∑ ∑ Cij()i′, j′ Sij + ⎜ ⎟∑ ∑ ()Wi. + W.j Aij()j′ Sij i=1 j=1 ⎝ n ⎠i≠i′=1 j≠ j′=1 ⎝ n ⎠ i=1 j=1 Above results and theorem 2.1 are used and in combination of α and other terms to obtain the expression: ⎡ r s ⎛ ⎞ ⎤ r s ⎛ 2 ⎞⎡ r s ⎜ 1 ⎟ 2 2 ⎧2()r + s −1⎫ 2 α ⎛ 1 ⎞ 2 E⎢ WαijSij ⎥ = ⎨ ⎬ WijSij + ⎜ ⎟⎢⎜ ⎟ Bij()j′ Sij ∑ ∑ ⎜ n ⎟ 4n ∑ ∑ ⎜ 4 ⎟ n ∑ ∑ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 ⎝ ⎠⎣⎝ ⎠ i=1 j≠ j′=1 r s r s r s r s ⎤ ⎧2⎫ 2 ⎛ 2 ⎞ 2 2 2 ⎛ 2 ⎞ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟ ∑ ∑ Cij()i′,j′ Sij + ∑ ∑ Wi. DijSij + ⎜ ⎟∑ ∑ Wi.Aij(j′)Sij ⎥ ⎩n ⎭ i=1 j=1 ⎝ n ⎠i≠i′=1 j≠ j′=1 i=1 j=1 ⎝ n ⎠ i=1 j≠ j′=1 ⎦ ⎛ ()1− α 2 ⎞⎡⎛ 1 ⎞ r s ⎛ 2 ⎞ r s r s + ⎜ ⎟ B S2 + C S2 + W 2 D S2 ⎜ ⎟⎢⎜ ⎟∑ ∑ ij()j′ ij ⎜ ⎟ ∑ ∑ ij()i′, j′ ij ∑ ∑ .j ij ij ⎝ 4 ⎠⎣⎝ n ⎠ i=1 j≠ j′=1 ⎝ n ⎠i≠i′=1 j≠ j′=1 i=1 j=1 2 r s 2 r s ⎤ α 1− α ⎡ 1 r s ⎧ ⎫ 2 ⎛ ⎞ 2 ⎛ ( )⎞ ⎛ ⎞ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟∑ ∑ W.jAij(j′)Sij ⎥ + ⎜ ⎟⎢⎜ ⎟ ∑ ∑ Cij()i′,j′ Sij ⎩n ⎭ i=1 j=1 ⎝ n ⎠ i=1 j≠ j′=1 ⎦ ⎝ 2 ⎠⎣⎝ n ⎠i≠i′=1 j≠ j′=1 r s r s 2 ⎛ 1 ⎞ 2 ⎤ + ∑ ∑ Wi.W.jDijSij + ⎜ ⎟∑ ∑ ()Wi. + W.j A ij()j′ Sij ⎥ (3.2.3) i=1 j=1 ⎝ n ⎠ i=1 j≠ j′=1 ⎦

1304 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

We have more results for a4, a5, a6 ⎡ r s ⎛ 1 ⎞ ⎤ 3 r s 1 r s r s ⎜ ⎟ 2 2 ⎧ ()+ − ⎫ 2 2 a 4 : E⎢ {}()pi + Wi Sij ⎥ = ⎨ ⎬ WijSij + FiSij ∑ ∑ ⎜ N ⎟ N ∑ ∑ ∑ ∑ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 i=1 j=1 r s r s 2 r s ⎧ 3 ⎫ 2 ⎛ 3 ⎞ Wij' 2 ⎛ 6 ⎞ Wi′jWij′ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟∑ ∑ Sij + ⎜ ⎟ ∑ ∑ Sij ⎩N⎭ i=1 j=1 ⎝ N ⎠ i=1 j≠ j′=1 Wij ⎝ N ⎠i≠i′=1 j≠ j′=1 Wij ⎡ r s ⎛ ⎞ ⎤ r s r s ⎜ 1 ⎟ 2 2 ⎧3()r + s −1⎫ 2 2 a 5 : E⎢ {}()pi + Wi Sij ⎥ = ⎨ ⎬ WijSij + FjSij ∑ ∑ ⎜ N ⎟ N ∑ ∑ ∑ ∑ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 i=1 j=1 r s r s 2 r s ⎧ 3 ⎫ 2 ⎛ 3 ⎞ Wij′ 2 ⎛ 6 ⎞ Wi′jWij′ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟∑ ∑ Sij + ⎜ ⎟ ∑ ∑ Sij ⎩N⎭ i=1 j=1 ⎝ N ⎠ i=1 j≠ j′=1 Wij ⎝ N ⎠i≠i′=1 j≠ j′=1 Wij ⎡ r s ⎛ ⎞ ⎤ r s ⎜ 1 ⎟ 2 2 ⎧3()r + s −1⎫ 2 a 6 : E⎢ {}()pi. + Wi. + ()p.j + W.j Sij ⎥ = ⎨ ⎬ WijSij ∑ ∑ ⎜ N ⎟ N ∑ ∑ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎩ ⎭ i=1 j=1 r s r s 2 ⎧ 3 ⎫ Wi′jWij′ 2 + ∑ ∑ FijSij + ⎨ ⎬ ∑ ∑ Sij i=1 j=1 ⎩N⎭i≠i′=1 j≠ j′=1 Wij

Theorem 2.1 is used to obtain a4, a5, and a6 and ultimately ⎡ r s ⎛ ⎞ ⎤ r s ⎛ 2 ⎞⎡ r s r s ⎜ 1 ⎟ 2 2 ⎛ 15 ⎞ 2 α 2 ⎛ 3 ⎞ 2 E⎢ WαijSij ⎥ = ⎜ ⎟ WijSij + ⎜ ⎟⎢ FiSij + ⎜ ⎟ WijSij ∑ ∑ ⎜ N ⎟ 4N ∑ ∑ ⎜ 4 ⎟ ∑ ∑ N ∑ ∑ ⎣⎢ i=1 j=1 ⎝ ij ⎠ ⎦⎥ ⎝ ⎠ i=1 j=1 ⎝ ⎠⎣ i=1 j=1 ⎝ ⎠ i=1 j=1 r s 2 r s 2 r s ⎛ 3 ⎞ Wij′ ⎛ 6 ⎞ Wi′jWij′ ⎤ ⎛ ()1− α ⎞⎡ + S2 + S2 + ⎜ ⎟ F S2 ⎜ ⎟∑ ∑ ij ⎜ ⎟ ∑ ∑ ij ⎥ ⎜ ⎟⎢∑ ∑ .j ij ⎝ N ⎠ i=1 j≠ j′=1 Wij ⎝ N ⎠i≠i′=1 j≠ j′=1 Wij ⎦⎥ ⎝ 4 ⎠⎣ i=1 j=1 r s r s 2 r s ⎤ ⎧ 3 ⎫ 2 ⎛ 3 ⎞ Wij′ 2 ⎛ 6 ⎞ Wi′jWij′ 2 + ⎨ ⎬∑ ∑ WijSij + ⎜ ⎟∑ ∑ Sij + ⎜ ⎟ ∑ ∑ Sij ⎥ + ⎩N⎭ i=1 j=1 ⎝ N ⎠ i=1 j≠ j′=1 Wij ⎝ N ⎠i≠i′=1 j≠ j′=1 Wij ⎦⎥ ⎡ r s r s ⎛α()1− α ⎞ 2 ⎧ 3 ⎫ Wi′jWij′ 2 ⎤ + ⎜ ⎟⎢ FijSij + ⎨ ⎬ Sij ⎥ (3.2.4) 4 ∑ ∑ N ∑∑ W ⎝ ⎠⎣⎢ i=1 j=1 ⎩ ⎭i≠=i'=1j≠ j' 1 ij ⎦⎥ Adding (3.2.3) and (3.2.4) ⎛ 1 ⎞ ⎛ α 2 ⎞ ⎧(1− α)2 ⎫ ⎧α(1− α)⎫ ⎜ ⎟ (3.2.5) E []V()y gps /n ij = ⎜ ⎟ U1 + ⎜ ⎟ R1 + ⎨ ⎬S1 + ⎨ ⎬T1 ⎝ 4 ⎠ ⎝ 4 ⎠ ⎩ 4 ⎭ ⎩ 2 ⎭ Picking second component of (3.2.1) ⎡ r s ⎤ V[]E()y gps | n ij = V⎢∑ ∑ Wαij Yij ⎥ ⎣ i=1 j=1 ⎦ ⎛ α 2 ⎞⎡ r r r ⎤ ⎜ ⎟ 2 = ⎜ ⎟⎢∑ M i V()pi + ∑ ∑ M i M i′Cov (pi pi′ )⎥ ⎝ 4 ⎠⎣ i=1 i=1 i'≠i=1 ⎦

STATISTICS IN TRANSITION, December 2006 1305

⎛ (1-α )2 ⎞⎡ s r r ⎤ + ⎜ ⎟ M 2 V p + M M Cov p p ⎜ ⎟⎢∑ j ().j ∑ ∑ j j′ (.j .j′ )⎥ ⎝ 4 ⎠⎣ j=1 j=1 j≠ j′=1 ⎦ α()1-α ⎡ r r ⎤ + ⎢∑ ∑ M i M jCov()pi.p.j ⎥ 2 ⎣ i=1 j=1 ⎦ ()N − n ⎡⎛ α 2 ⎞⎧ r r r ⎫ ⎜ ⎟ 2 = ⎢⎜ ⎟⎨∑ M i Wi. ()1− Wi. − ∑∑M i M i′ Wi.Wi′. ⎬ ()N −1 n ⎣⎝ 4 ⎠⎩ i=1 i==1 i′≠i 1 ⎭ ⎛ (1- α)2 ⎞⎧ r r r ⎫ + ⎜ ⎟ M 2 W 1− W − M M W W ⎜ ⎟⎨∑ j .j ().j ∑ ∑ j j′ .j .j′ ⎬ ⎝ 4 ⎠⎩ j=1 j=1 j≠ j′=1 ⎭ α()1- α ⎧ r s ⎫⎤ + ⎨∑ ∑ M i M jWi.W.j ⎬⎥ 4 ⎩ i=1 j=1 ⎭⎦⎥ ⎛ α 2 ⎞ ⎧()1− α 2 ⎫ ⎧α(1− α)⎫ ⎜ ⎟ (3.2.6) = ⎜ ⎟R 2 + ⎨ ⎬S2 + ⎨ ⎬T2 ⎝ 4 ⎠ ⎩ 4 ⎭ ⎩ 2 ⎭ Addition of all components of (3.2.1) along with bias term completes the proof.

Remark 3.2

⎛ 1 ⎞ 2 At α = 1, MSE[]()ygps = ⎜ ⎟{}U1 + ()R1 + R 2 + 4V1 1 ⎝ 4 ⎠

⎛ 1 ⎞ 2 At α = 0, MSE[]()ygps = ⎜ ⎟{}U1 + ()S1 + S2 + 4V2 0 ⎝ 4 ⎠

1 ⎡ ⎤ ⎛ 1 ⎞ 2 At 1 α = , MSE⎢()ygps ⎥ = ⎜ ⎟{4U1 + ()R1 + R 2 + 4V1 2 ⎣ 2 ⎦ ⎝16 ⎠ 2 + (S1 + S2 + 4V2 )+ 2(T1 + T2 + 4V1V2 )}

4. Optimum Estimator

Differentiating MSE with respect to α and equating to zero provides 2 ⎡ ()S1 + S2 + 4V2 − (T1 + T2 + 4V1V2 ) ⎤ α opt = ⎢ 2 2 ⎥ ⎣()()R1 + R 2 + 4V1 + S1 + S2 + 4V2 − 2()T1 + T2 + 4V1V2 ⎦ An optimal estimator of Y is y with optimum m.s.e. ( gps )opt

1306 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

⎡ ⎧ R + R + 4V 2 S + S + 4V 2 − T + T + 4V V 2 ⎫⎤ ⎛ 1 ⎞ ()1 2 1 ( 1 2 2 ) ()1 2 1 2 MSE[]()y gps = ⎜ ⎟⎢U1 + ⎨ ⎬⎥ opt 4 2 2 ⎝ ⎠⎣⎢ ⎩()()R1 + R 2 + 4V1 + S1 + S2 + 4V2 − 2()T1 + T2 + 4V1V2 ⎭⎦⎥

Remark 4.1 1) y is efficient over y ( gps )1 ( gps )0 2 2 If (R1 + R 2 + 4V1 )≤ (S1 + S2 + 4V2 ) 2) y is efficient over y ( gps )1 ( gps )1/2

2 ⎛ 1 ⎞ 2 If ()R1 + R 2 + 4V1 ≤ ⎜ ⎟[()S 1 + S2 + 4V2 + 2 ()T1 + T2 + 4V1V2 ] ⎝ 3 ⎠ 3) y is efficient over y ( gps )0 ( gps )1/2

2 ⎛ 1 ⎞ 2 If ()S1 + S2 + 4V2 ≤ ⎜ ⎟[()R 1 + R 2 + 4V1 + 2 ()T1 + T2 + 4V1V2 ] ⎝ 3 ⎠

5. Numerical Illustrations

We consider four populations of sizes N = 400, 435, 650 and 490. Let random samples each of size 160, 174, 260 and 196 are drawn from these populations by SRSWOR respectively and post-stratified according to 2 x 2, 2 x 3, 3 x 3 and 3 x 3 classification. Parameters of population and “sample-taken” in n are in tables given below:

Table 5.1. (for data set I) A Attribute A Total B Low High

N11 = 90, n11 = 36 N12 = 95, n12 = 38 N1. = 185, n1. = 74

Y11 =71.72 Y12 =221.74 Y1. =113.86

LOW W11 = 0.225 W12 = 0.2375 W1. = 0.462 2 2 S11 =1713.79 S12 =1872.94

N21 = 105, n21 = 42 N22 = 110, n22 = 44 N2. = 215, n2. = 86 Attribute B Y21 =378.54 Y22 =529.57 Y2. =455.811

HIGH W21 = 0.2625 W22 = 0.275 W2. = 0.5375 2 2 S 21 =1912.33 S 22 =1704.66

N.1 = 195, n.1 =78 N.2 = 205, n.2 = 82 N = 400, n = 160

Total Y.1 =236.93 Y.2 =386.57 Y =313.798 2 W.1 = 0.4875 W.2 = 0.512 S = 2082.96

STATISTICS IN TRANSITION, December 2006 1307

Table 5.2. (for data set II) A Attribute A Total B Low Medium High

N11 = 71, n11 = 28 N12 = 65, n11 = 26 N13 = 68, n13 = 27 N1.=204

Y11 =48.7464 Y12 =147.6923 Y13 =247.4264 n1.=81

LOW W11 = 0.1632 W12 = 0.1494 W13 = 0.1563 Y1. =146.499 2 2 2 S11 =857.187 S12 =791.2476 S13 =876.9092 W1. =0.4689

N21 = 77, n21 = 31 N22 = 74, n22 = 30 N23 = 80, n23 = 32 N2.=231 Attribute B Y21 =346.8831 Y22 =431.9054 Y23 =549.525 n2.=93 W = 0.1770 W = 0.1701 W = 0.1839 Y =444.29 HIGH 21 22 23 2. 2 2 2 S 21 =866.6208 S 22 =964.5512 S 23 =829.1359 W2. =0.5311

N.1 = 148, n.1 =59 N.2 = 139, n.2 = 56 N.3 = 148, n.3 = 59 N=435, n = 174

Y.1 =203.858 Y.2 =298.999 Y.3 =410.7229 Y =3104.636 Total W.1 = 0.3402 W.2 = 0.3195 W.3 = 0.3402

Table 5.3. (for data set III) A Attribute A Total B Low Medium High

N11 = 71, n11 = 28 N12 = 65, n12 = 26 N13 = 68, n13 = 27 N1.=205

Y11 =48.7464 Y12 =147.6923 Y13 =247.4264 n1.=81

LOW W11 = 0.10923 W12 = 0.1 W13 = 0.10461 Y1. =146.499 2 2 2 S11 =857.187 S12 =791.2476 S13 =876.9092 W1. =0.3138

N21 = 77, n21 = 31 N22 = 74, n22 = 30 N23 = 80, n23 = 32 N2.=231

Y21 =346.8831 Y22 =431.9054 Y23 =549.525 n2.=93

W21 = 0.11846 W22 = 0.11384 W23 = 0.12307 Y2. =444.29 MEDIUM Attribute B 2 2 2 S 21 =866.6208 S 22 =964.5512 S 23 =829.1359 W2. =0.3553

N31 = 73, n31 = 29 N32 = 70, n22 = 28 N33 = 72, n33 = 29 N3.=235

Y31 =654.315 Y32 =737.957 Y33 =846.597 n3.=86 W = 0.1123 W = 0.10769 W = 0.11076 Y =745.939 HIGH 21 32 33 3. 2 2 2 S31 =787.885 S32 =1044.759 S33 =780.469 W3. =0.3307

N.1 = 221, n.1 =88 N.2 = 209, n.2 = 84 N.3 = 220, n.2 = 88 N=650, n = 260

Y.1 =352.6515 Y.2 =446.01912 Y.3 =553.3727 Y =450.828 Total W.1 = 0.3399 W.2 = 0.32153 W.3 = 0.3384

1308 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

Table 5.4. (for data set IV)

A Attribute A Total B Low Medium High

N11 = 56, n11 = 22 N12 = 50, n12 = 20 N13 = 52, n13 = 21 N1.=185

Y11 =37.232 Y12 =113.02 Y13 =188.8846 n1.=63

LOW W11 = 0.1143 W12 = 0.102 W13 = 0.1061 Y1. =111.1265

2 2 2 S11 =504.1068 S12 =434.947 S13 =505.163 W1. =0.3224

N21 = 48, n21 = 19 N22 = 62, n22 = 25 N23 = 58, n23 = 23 N2.=168

Y21 =267.3333 Y22 =321.258 Y23 =413.776 n2.=67

W21 = 0.09796 W22 = 0.1265 W23 = 0.11836 Y2. =337.792 MEDIUM Attribute B 2 2 2 S 21 =500.926 S 22 =768.06 S 23 =466.716 W2. =0.34282

N31 = 54, n31 = 22 N32 = 60, n32 = 24 N33 = 50, n33 = 20 N3.=164

Y31 =483.037 Y32 =553.666 Y33 =625.000 N3.=66

HIGH W31 = 0.1102 W32 = 0.12245 W33 = 0.102 Y3. =552.1583

2 2 2 W3. =0.33465 S31 =564.56 S32 =529.17 S33 =562.53

N.1 = 158, n.1 =63 N.2 = 172, n.2 = 69 N.3 = 160, n.3 = 64 N=490, n = 196

Y.1 =259.4999 Y.2 = 3341.796 Y.3 =406.694 Y =336.451 Total

W.1 = 0.32246 W.2 = 0.35095 W.3 = 0.32646

STATISTICS IN TRANSITION, December 2006 1309

y (a) Bias and M. S. E. of gps

DATA SET DATA SET ESTIMATOR DATA SET I DATA SET II III IV MSE 81.2278 96.4741 188.4567 89.29 [(y gps ) ] 1 Bias 3.1003 6.0884 9.0127 6.7285 MSE 78.0872 81.3172 89.2869 75.0267 [(y gps ) ] 0 Bias 2.90674 2.8628 8.8626 6.5578 MSE 9.8229 6.4751 3.0038 5.416 [(y gps ) ] 1/2 Bias 3.0039 4.4756 8.9374 6.4432 MSE 9.8141 5.2971 2.7869 1.655 [(ygps ) ] opt Bias 2.2463 0.0455 8.1721 5.7055 with αopt Opt. α 0.4943 0.4827 0.4773 0.4767

(b) It is not worth to compare y gsp with usual post-stratified estimator since W 's are assumed unknown. It seems that estimator y is more ij ( gps )1 / 2 efficient than y and y for all four data sets, maybe because the ( gps )0 ( gps )1 value α = (1/2) is very close to its optimum choice.

(c) The estimator (ygps ) made it possible to estimate Y in r x s set-up even

without the knowledge of Wij and stratum frames. It incorporates an effective utilization of row and column proportions Wi. and W.j .

(d) The estimator is found most efficient at optimal selection of α=0.4943 for set -I, α =0.4872 for set-II, α=0.4773 for set-III and α=0.4767 for set-IV.

(e) On the basis of data considered here, one can think of choosing α near to 0.5 which reveals that almost a fifty percent fraction of row sum of proportions [(ni./n)+(Ni./N)] and fifty percent from column of the same generates an ideal, quick and easy choice of α. Thus, for the proposed estimator an easy optimum choice is α=1/2 or close to it.

1310 D. Shukla, M. Trivedi, G. N. Singh: Post-stratification in…

REFERENCES

AGARWAL, M. C. AND PANDA, K. B. (1993): An efficient estimator in post- stratification, Metron, Vol. 5, 3—4 . 179—187. BRYANT, E. C. (1955): An Analysis of some two-way stratifications (unpublished Ph.D. Thesis). Ames, Iowa, Iowa State University Library. BRYANT, E. C., HARTLEY, H. O. AND JESSEN, R. J. (1960): Design and estimation in two-way stratification, Jour. Amer. Stat. Asso., 55, 105—124. HOLT, D. AND SMITH, T. M. F. (1979): Post-Stratification, J.R. Stat. Soc. A, 142, 33—36. JAGERS, P., ODEN, A. AND TRULSSON, L. (1985): Post-Stratification and ratio estimation, Int. Stat. Rev., 53, 221—238. JAGERS, P. (1986): Post-Stratification against bias in sampling, Int. Stat. Rev, 54, 159—167. SMITH, T. M. F. (1991): Post-Stratification, The Statistician, 40, 315—323. SUKHATME, P. V., SUKHATME, B. V., SUKHATME, S. AND ASOK, C. (1984): Sampling Theory of Surveys with Applications, Iowa State university Press, Indian Society of Agricultural Statistics Publication, New Delhi.

STATISTICS IN TRANSITION, December 2006 1311

STATISTICS IN TRANSITION, December 2006 Vol. 7, No.6, pp. 1311—1325

AN EFFICIENT VARIANT OF THE PRODUCT AND RATIO ESTIMATORS IN STRATIFIED RANDOM SAMPLING

Housila P. Singh1 and Gajendra K. Vishwakarma1

ABSTRACT

This paper introduces two classes of estimators of population mean of the study variable using auxiliary variable in stratified random sampling. The biases and variances of the proposed estimators have been derived under large sample approximation. Asymptotic optimum estimators (AOEs) are identified with their approximate variance formulae. Estimators based on “estimated optimum values” are also investigated. It has been shown to the first degree of approximation that the variances of the estimators based on estimated optimum values are same as that of optimum estimators. An empirical study is carried out to demonstrate the performances of the suggested estimators over others.

Keywords: Auxiliary variable, Study variable, Population mean, Bias, Variance.

1. Introduction

Consider a finite population U = (U1 ,U 2 ,U 3 ,...,U N ) of size N and let y and x respectively, be the study and auxiliary variables on each unit

U j ()j = 1,2,3,..., N of the population U. Let the population be divided into L th strata with the h -stratum containing N h units, h=1,2,3,…,L so that L ∑ N h = N . Suppose that a simple random sample of size nh is drawn without h=1 L th replacement (SRSWOR) from the h -stratum such that ∑nh = n . Let h=1

1 School of Studies in Statistics, Vikram University, Ujjain -456010, M.P., India, e-mail: [email protected] and [email protected]

1312 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

th th ()yhi , xhi denote the observed values of y and x on i -unit of the h - stratum (i = 1,2,3,..., N h ; h = 1,2,3,..., L). Moreover let us denote by ⎛ nh L ⎞ ⎛ nh L ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ yh = ∑ yhi nh , yst = ∑Wh yh ⎟, ⎜ xh = ∑ xhi nh , xst = ∑Wh xh ⎟ ⎝ i=1 h=1 ⎠ ⎝ i=1 h=1 ⎠ ⎛ Nh L ⎞ ⎛ Nh L ⎞ and ⎜ ⎟ ⎜ ⎟ , ⎜Yh = ∑ yhi N h , Y = ∑WhYh ⎟, ⎜ X h = ∑ xhi N h , X = ∑Wh X h ⎟ ⎝ i=1 h=1 ⎠ ⎝ i=1 h=1 ⎠ the sample and population means of y and x , where Wh = N h N . In order to have a survey estimate of the population mean Y of the main th variable y, assuming the knowledge of population mean X h of the h -stratum ()h = 1,2,3,..., L of the auxiliary variable x, we mention the following well- known estimators. The separate ratio estimator L ˆ yRs = ∑Wh Rh X h , xh ≠ 0 (1.1) h=1 ˆ where Rh = yh xh , is the estimate of the ratio Rh = Yh X h , X h ≠ 0 of the hth -stratum in the population. This estimator is only efficient if the variables are strongly positively correlated. The separate product estimator L ˆ Ph yPs = ∑Wh (1.2) h=1 X h ˆ where Ph = yh .xh is the estimate of the product Ph = Yh .X h of the means of the hth -stratum in the population. This estimator will often be used if the two variables are supposed to be strongly negatively correlated. When the population mean X of the auxiliary variate x is known, Hansen, Hurwitz, and Gurney (1946) suggested a “combined ratio estimator” ⎛ X ⎞ ⎜ ⎟ (1.3) yRc = yst ⎜ ⎟ ⎝ xst ⎠ The “combined product estimator” for Y is defined by

⎛ xst ⎞ yPc = yst ⎜ ⎟ (1.4) ⎝ X ⎠ To the first degree of approximation, the biases and variances of

yRs , yPs , yRc and yPc are respectively given by L ⎛ 1 1 ⎞ ⎜ ⎟ 2 (1.5) B()yRs = ∑WhYh ⎜ − ⎟Chx ()1− K h h=1 ⎝ nh N h ⎠

STATISTICS IN TRANSITION, December 2006 1313

L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 (1.6) V ()yRs = ∑Wh ⎜ − ⎟Yh []Chy + Chx ()1− 2K h h=1 ⎝ nh N h ⎠ L ⎛ 1 1 ⎞ ⎜ ⎟ 2 (1.7) B()yPs = ∑WhYh ⎜ − ⎟Chx K h h=1 ⎝ nh N h ⎠ L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 (1.8) V ()yPs = ∑Wh ⎜ − ⎟Yh []Chy + Chx ()1+ 2K h h=1 ⎝ nh N h ⎠ 1 L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 (1.9) B()yRc = ∑Wh ⎜ − ⎟()RS hx − S hxy X h=1 ⎝ nh N h ⎠ L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 (1.10) V ()yRc = ∑Wh ⎜ − ⎟()S hy + R S hx − 2RS hxy h=1 ⎝ nh N h ⎠ 1 L ⎛ 1 1 ⎞ 2 ⎜ ⎟ (1.11) B()yPc = ∑Wh ⎜ − ⎟S hxy X h=1 ⎝ nh N h ⎠ L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 (1.12) V ()yPc = ∑Wh ⎜ − ⎟()S hy + R S hx + 2RS hxy h=1 ⎝ nh N h ⎠ 2 2 Nh S Shy Y Chy 1 2 where C 2 = hx ,C 2 = , R = , K = ρ S 2 = y − Y , hx 2 hy 2 h hxy hy ∑()hi h X h Yh X Chx N h −1 i=1

N Nh 1 h 2 1 S hxy 2 S = y − Y x − X and ρ = S hx = ∑()xhi − X h hxy ∑ ()hi h ()hi h hxy N −1 N −1 i=1 S S h i=1 , h hx hy L In stratified random sampling, the total sample size is n = ∑ ni . If an i=1 equivalent simple random sample of size n were selected without replacement ⎛ 1 n ⎞ directly from the population size N, the variance of the mean ⎜ y = y ⎟ per ⎜ ∑ j ⎟ ⎝ n j=1 ⎠ unit and variance of the usual ratio estimator yR = y. X x to the first degree of approximation are respectively given by,

⎛ 1 1 ⎞ 2 2 V ()y = ⎜ − ⎟Y C y (1.13) ⎝ n N ⎠

⎛ 1 1 ⎞ 2 2 2 and V ()yR = ⎜ − ⎟Y [C y + C x ()1− 2K ] (1.14) ⎝ n N ⎠

1314 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

2 2 S S C 1 L Nh 2 where C 2 = x ,C 2 = y , K = ρ y , S 2 = y − Y , x 2 y 2 y ∑ ∑()hi X Y C x ()N −1 h=1 i=1 L N L Nh 2 h S 2 1 1 xy S = ()x − X , S xy = ()yhi − Y ()xhi − X and ρ = x ∑ ∑ hi ()N −1 ∑ ∑ S S ()N −1 h=1 i=1 h=1 i=1 x y . In this paper, motivated Sahai (1979, p.34), we have suggested modified ‘separate product and ratio estimators’ and ‘combined product and ratio estimators’ and studied their properties. Numerical examples are given in support of our present study.

2. Modified Separate Product and Ratio Estimators

Following Sahai (1979) we define a class of estimators of Y as L (xh +θ h X h ) yms = ∑Wh yh (2.1) h=1 ()X h +θ h xh L where θ h ’s are suitably chosen constants. For θ h = 1, yms = yst = ∑Wh yh and h=1 L xh that, for θ h = 0 , yms = y ps = ∑Wh yh . Moreover, if θ h is very large, h=1 X h L X h yms is almost same as yRs = ∑Wh yh . h=1 xh

In practice [Murthy (1967), p.370] the variability of xh is usually less than that of yh . If C()xh denotes the coefficient of variation of xh , likewise C()yh ⎛ 1 1 ⎞ ⎛ 1 1 ⎞ that of then 2 ⎜ ⎟ 2 and 2 ⎜ ⎟ 2 . yh C ()xh = ⎜ − ⎟Chx C ()yh = ⎜ − ⎟Chy ⎝ nh N h ⎠ ⎝ nh N h ⎠

Shx Shy th where Chx = and Chy = are the coefficients of variation h -stratum X h Yh in the population for the two variables.

It follows that if Chx = ahChy , we have

C()xh = ahC(yh ) ; 0 < ah ≤ 1 (2.2) without serious loss of generality, we assume that the observations on y and x are all non-negative, so that the sample and population means are all positive.

Sometimes a good guess of the value of K h = ρ hxy .C yh C xh is available from a pilot sample, past data, experience (long association with the experimental

STATISTICS IN TRANSITION, December 2006 1315

material) or otherwise. In other practical situations the value of K h may be known or guessed to be in a certain interval K h(1) ≤ K h ≤ K h(2) , which is more realistic than a specific guess about K h . Such information may be utilized to provide estimators more efficient than separate ratio, product and stratified random sample mean ()yst estimators, for instance, see Sahai (1979) and Sahai and Ray (1980).

2.1. Bias and Variance of the estimator yms

Applying the standard techniques we evaluate the first-degree approximation (up to the terms of order n −1 ) to the variance / mean square error of the suggested estimator.

Let yh = Yh ()1+ e0h and xh = X h (1+ e1h ) ⎛ 1 1 ⎞ ⎛ 1 1 ⎞ so that E e = E e = 0 , 2 ⎜ ⎟ 2 , 2 ⎜ ⎟ 2 ()0h ()1h E()e0h = ⎜ − ⎟C yh E()e1h = ⎜ − ⎟C xh ⎝ nh N h ⎠ ⎝ nh N h ⎠ ⎛ 1 1 ⎞ and ⎜ ⎟ . E()e0he1h = ⎜ − ⎟ρ hxyChxChy ⎝ nh N h ⎠ Now we have −1 L ⎧ ⎛ e ⎞⎫⎧ ⎛ e ⎞⎫ ⎜ 1h ⎟ ⎜ 1h ⎟ yms = ∑WhYh ()1+ e0h ⎨1+ ⎜ ⎟⎬⎨1+θ h ⎜ ⎟⎬ h=1 ⎩ ⎝ ()1+θ h ⎠⎭⎩ ⎝ ()1+θ h ⎠⎭ −1 L ⎡ ⎧ ⎛ e ⎞⎫⎧ ⎛ e ⎞⎫ ⎤ ⎢ ⎜ 1h ⎟ ⎜ 1h ⎟ ⎥ or yms − Y = ∑WhYh ()1+ e0h ⎨1+ ⎬⎨1+θ h ⎬ −1 ⎢ ⎜ ()1+θ ⎟ ⎜ ()1+θ ⎟ ⎥ h=1 ⎣ ⎩ ⎝ h ⎠⎭⎩ ⎝ h ⎠⎭ ⎦ −1 θ e ⎧ θ e ⎫ Suppose h 1h < 1 so that ⎨1+ h 1h ⎬ is expandable. Therefore, to ()1+θ h ⎩ ()1+θ h ⎭ the first degree of approximation, we obtain the bias and variance of

yms respectively as L ⎡(1−θ h ) (1−θ h ) 2 ⎤ B()yms = WhYh E⎢ e0he1h −θ h e1h ⎥ ∑ 1+θ 2 h=1 ⎣⎢()h ()1+θ h ⎦⎥ L ⎛ 1 1 ⎞ ⎡ (1− D )⎤ ⎜ ⎟ h 2 (2.3) = ∑WhYh ⎜ − ⎟Dh ⎢K h − ⎥Chx h=1 ⎝ nh N h ⎠ ⎣ 2 ⎦

1316 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

L 2 2 2 V ()yms = ∑Wh Yh E()e0h + Dh e1h h=1 L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 (2.4) = ∑Wh ⎜ − ⎟Yh []Chy + DhChx ()Dh + 2K h h=1 ⎝ nh N h ⎠

(1−θ h ) where Dh = . ()1+θ h From (1.5), (1.7) and (2.3) it follows that

(i) B(yms ) < B(yRs ) if

⎧ (1− Dh )⎫ Dh ⎨K h − ⎬ < 1− K h (2.5) ⎩ 2 ⎭

(ii) B(yms ) < B(yPs ) if

⎧ (1− Dh )⎫ Dh ⎨K h − ⎬ < K h (2.6) ⎩ 2 ⎭

The variance of yms at (2.4) is minimum when

Dh = −K h = Dh0 (say) (2.7)

Thus the resulting minimum variance of yms is given by L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 (2.8) minV ()yms = ∑Wh ⎜ − ⎟S hy ()1− ρ hxy h=1 ⎝ nh N h ⎠ which equals to the approximate variance of the separate regression estimator L ˆ yls = ∑ wh [yh + β hxy ()X h − xh ], h=1 ˆ th where β hxy being the sample regression coefficient of y on x in the h -stratum.

Now, we express the variance of yms as L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 (2.9) V ()yms = ∑Wh ⎜ − ⎟Yh []Chy + Dh ()Dh − 2Dh0 Chx h=1 ⎝ nh N h ⎠

It follows from (2.9) that V (yms ) < V (yst ) if 2 Dh − 2Dh Dh0 < 0

i.e. if Dh − Dh0 < Dh0 (2.10)

From (1.6) and (2.9) it is seen that V (yms ) < V (yRs ) if 2 Dh − 2Dh Dh0 < 1+ 2Dh0 , (Dh0 = −K h )

STATISTICS IN TRANSITION, December 2006 1317

2 2 2 i.e. if Dh − 2Dh Dh0 + Dh0 < 1+ 2Dh0 + Dh0

i.e. if Dh − Dh0 < 1+ Dh0 , (2.11) where the correlation coefficient ρ hxy is positive (i. e.ρ hxy > 0) and hence

− Dh0 is positive ()i. e.− Dh0 > 0 as Dh0 = −K h . Further, from (1.8) and (2.9) we have

V (yms ) < V (yPs ) 2 if Dh − 2Dh Dh0 < 1− 2Dh0 2 2 2 i.e. if Dh − 2Dh Dh0 + Dh0 < 1− 2Dh0 + Dh0

i.e. if Dh − Dh0 < 1− Dh0 , (2.12) where the correlation coefficient ρ hxy is negative (i. e.ρ hxy > 0) and hence

− Dh0 is negative ()i. e.− Dh0 < 0 as Dh0 = −K h . Thus we established the following theorem.

Theorem 2.1- The proposed estimator yms is better than yst , yRs , and yPs if the inequalities-

(i) Dh − Dh0 < Dh0

(ii) Dh − Dh0 < 1+ Dh0 , ( ρ hxy ,−Dh0 > 0)

(iii) Dh − Dh0 < 1− Dh0 , ( ρ hxy ,−Dh0 < 0) respectively hold good.

It is to be noted that the estimator yms attained the minimum variance at (2.7) only when the exact optimum value of Dh (i.e. Dh0 ) is known. Quite often, in practical situations, the coefficients of variation (Chy and Chx ) are more or less known. Also the value of correlation coefficient ρ hxy may be known from experience etc. So that, in many practical situations it may be reasonable to guess

Dh0 . However, there may be some situations where Dh0 (i.e. K h ) is unknown to the experimenter. In such situation we estimate K h from the sample as ˆ ˆ ˆ K h = ρˆ hxy .Chy Chx ˆ ˆ ⇒ Dh0 = −K h (2.13) s s 1 nh 1 nh where ˆ hy ˆ hx , 2 2 2 2 Chy = ,Chx = shy = ∑()yhi − yh , shx = ∑()xhi − xh , yh xh ()nh −1 i=1 ()nh −1 i=1

1318 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

1 nh and shxy = ∑()xhi − xh ()yhi − yh . ()nh −1 i=1 Thus the resulting estimator based on estimated optimum is given by L {(1+ Dˆ )x + (1− Dˆ )X } y * = W y h0 h h0 h ms ∑ h h ˆ ˆ h=1 {(1+ Dh0 )X h + (1− Dh0 )xh } L {(1− Kˆ )x + (1+ Kˆ )X } = W y h h h h (2.14) ∑ h h ˆ ˆ h=1 {(1− K h )X h + (1+ K h )xh } * To obtain the variance of yms to the first degree of approximation, we write, ˆ −1 K h = K h ()1+ e2h such that E(e2h ) = O(nh ) Expressing (2.14) in terms of e’s we have L * [{1− K h (1+ e2h )}X h (1+ e1h ) +{1+ K h (1+ e2h )}X h ] yms = ∑WhYh ()1+ e0h h=1 [{1− K h (1+ e2h )}X h +{1+ K h (1+ e2h )}X h (1+ e1h )]

L * or, yms − Y ≅ ∑WhYh ()e0h − K he1h (2.15) h=1 Squaring both sides of (2.15) and then taking expactations, we get the * variance of yms to the first degree of approximation as L ⎛ 1 1 ⎞ * 2 ⎜ ⎟ 2 2 V ()yms = ∑Wh ⎜ − ⎟S hy ()1− ρ hxy h=1 ⎝ nh N h ⎠

= minV (yms ) (2.16) Thus we established the following theorem.

Theorem 2.2- To the first degree of approximation, the variance of the * estimator yms “based on estimated optimum value” is same as that of the minimum variance of the estimator yms . Now we have, L ⎛ 1 1 ⎞ * 2 ⎜ ⎟ 2 2 ≥ 0 (2.17) V ()yst −V ()yms () or minV ()y ms = ∑Wh ⎜ − ⎟S hy ρ hxy h=1 ⎝ nh N h ⎠ L ⎛ 1 1 ⎞ * 2 ⎜ ⎟ 2 2 2 ≥ 0 (2.18) V ()yRs −V ()yms () or minV ()y ms = ∑Wh ⎜ − ⎟Yh Chx ()1− K h h=1 ⎝ nh N h ⎠ L ⎛ 1 1 ⎞ * 2 ⎜ ⎟ 2 2 2 ≥ 0 (2.19) V ()yPs −V ()yms () or minV ()y ms = ∑Wh ⎜ − ⎟Yh Chx ()1+ K h h=1 ⎝ nh N h ⎠

STATISTICS IN TRANSITION, December 2006 1319

It follows from (2.17), (2.18) and (2.19) that * V (yms ) ()or minV ()yms ≤ V (yst ) (2.20) * V (yms ) ()or minV ()yms ≤ V (yRs ) (2.21) * and V (yms ) (or minV ()yms ) ≤ V (yPs ) (2.22) Thus from (2.20), (2.21) and (2.22) we state the following theorem.

* Theorem 2.3- The estimator yms (or the optimum estimator in yms ) is more efficient then yst , yRs and yPs .

3. The Suggested “Modified Combined Product and Ratio Estimators”

Motivated by Sahai (1979, p. 34) we suggest the “modified combined product and ratio estimators” for the estimating the population mean Y as

( xst +θ X ) ymc = yst , (3.1) ()X +θ xst where θ is a suitably chosen scalar. It is worth mentioning that forθ = 1 the estimator ymc = yst , and that, for θ = 0 , ymc reduces to the “combined product estimator” yPc .

In order to obtain the bias and variance of the estimator ymc , we write L L yst = ∑Wh yh = Y ()1+ e0 , xst = ∑Wh xh = X ()1+ e1 so that, h=1 h=1 1 L ⎛ 1 1 ⎞ 1 L ⎛ 1 1 ⎞ E()e = E ()e = 0 , E e 2 = ⎜ − ⎟W 2 S 2 , E e 2 = ⎜ − ⎟W 2 S 2 0 1 ()0 2 ∑⎜ ⎟ h hy ()1 2 ∑⎜ ⎟ h hx Y h=1 ⎝ nh N h ⎠ X h=1 ⎝ nh N h ⎠ 1 L ⎛ 1 1 ⎞ 1 L ⎛ 1 1 ⎞ and ⎜ ⎟ 2 ⎜ ⎟ 2 E()e0e1 = ∑⎜ − ⎟Wh S hxy = ∑⎜ − ⎟Wh ρ hxy S hx S hy XY h=1 ⎝ nh N h ⎠ XY h=1 ⎝ nh N h ⎠

Now expressing ymc at (3.1) in terms of e’s we have

(1+θ + e1 ) ymc = Y ()1+ e0 {}1+θ ()1+ e1 ⎡ −1 ⎤ ⎧ e1 ⎫⎧ θe1 ⎫ or ymc − Y = Y ⎢()1+ e0 ⎨1+ ⎬⎨1+ ⎬ −1⎥ ⎣⎢ ⎩ ()1+θ ⎭⎩ ()1+θ ⎭ ⎦⎥

1320 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

−1 θe ⎧ θe ⎫ Suppose that 1 <1 so that ⎨1+ 1 ⎬ is expandable. Therefore, to ()1+θ ⎩ ()1+θ ⎭ the first degree of approximation, the bias and variance of ymc are respectively given by

⎡(1−θ ) θ (1−θ ) 2 ⎤ B()ymc = YE⎢ e0e1 − 2 e1 ⎥ ⎣()1+θ ()1+θ ⎦

⎡ D(1− D) 2 ⎤ = YE De0e1 − e1 ⎣⎢ 2 ⎦⎥ D L ⎛ 1 1 ⎞⎧ (1− D) ⎫ 2 ⎜ ⎟ 2 (3.2) = ∑Wh ⎜ − ⎟⎨Shxy − RS hx ⎬ X h=1 ⎝ nh N h ⎠⎩ 2 ⎭ ()1−θ where D = ()1+θ To have an idea as to how rapidly the bias diminishes with the size of the sample, we will assume that nh is proportional to N h (i.e. nh α N h ) and

S hx X , S hy Y and ρ hxy are the same over all strata, say C x ,C y and ρ respectively. The relative bias (RB) in ymc is then seen to be B(y ) RB()y = mc mc Y

⎛ 1 1 ⎞ ⎧ (1− D) 2 ⎫ = ⎜ − ⎟ D ⎨ρC xC y − C x ⎬ (3.3) ⎝ n N ⎠ ⎩ 2 ⎭ It follows that even when the size of the sample within each stratum is small,

ymc can give a satisfactory estimate of population mean Y provided the total sample size n is sufficiently large.

The variance of ymc , to the first degree of approximation, is given by 2 2 2 V ()ymc = E(ymc − Y ) = Y E(e0 + De1 ) L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 2 2 2 (3.4) = ∑Wh ⎜ − ⎟{}S hy + D R S hx + 2D R S hxy h=1 ⎝ nh N h ⎠ which is minimized for L ⎛ 1 1 ⎞ 2 ⎜ ⎟ ∑Wh ⎜ − ⎟S hxy h=1 ⎝ nh N h ⎠ D = − = D0 (say) (3.5) L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 R∑Wh ⎜ − ⎟S hx h=1 ⎝ nh N h ⎠

STATISTICS IN TRANSITION, December 2006 1321

Thus the resulting minimum variance of ymc is given by 2 ⎧ L ⎛ 1 1 ⎞ ⎫ 2 ⎜ ⎟ ⎨∑Wh − S hxy ⎬ L ⎜ n N ⎟ 2 ⎛ 1 1 ⎞ 2 ⎩ h=1 ⎝ h h ⎠ ⎭ minV ()ymc = Wh ⎜ − ⎟Shy − ∑ ⎜ ⎟ L h=1 n N ⎛ 1 1 ⎞ ⎝ h h ⎠ 2 ⎜ ⎟ 2 ∑Wh ⎜ − ⎟S hx h=1 ⎝ nh N h ⎠ 2 = V (yst ).(1− ρ st ) (3.6) which is equal to the approximate variance of the combined regression estimator ˆ ylc = [yst + β c ()X − xst ], L ⎛ 1 1 ⎞ L ⎛ 1 1 ⎞ where ˆ 2 ⎜ ⎟ 2 ⎜ ⎟ 2 β c = ∑Wh ⎜ − ⎟shxy ∑Wh ⎜ − ⎟shx h=1 ⎝ nh N h ⎠ h=1 ⎝ nh N h ⎠

nh nh 1 2 1 2 shxy = ∑()xhi − xh ()yhi − yh and shx = ∑()xhi − xh ()nh −1 i=1 ()nh −1 i=1 The optimum choice (3.5) of the constant D involves unknown parameters, which can be guessed quite accurately through pilot sample survey or past data or experience gathered in due course of time.

When the optimum value D0 of D is not known, we replace it by its estimate ˆ ˆ ˆ D0 = − β Rst (3.7) L ⎛ 1 1 ⎞ L ⎛ 1 1 ⎞ y where ˆ 2 ⎜ ⎟ 2 ⎜ ⎟ 2 and Rˆ = st (3.8) β = ∑Wh ⎜ − ⎟shxy ∑Wh ⎜ − ⎟shx st h=1 ⎝ nh N h ⎠ h=1 ⎝ nh N h ⎠ xst Thus the resulting “modified combined product and ratio estimators” for Y as {(1+ Dˆ )x + (1− Dˆ )X} y * = y 0 st 0 (3.9) mc st ˆ ˆ {(1+ D0 )X + (1− D0 )xst } * To obtain the variance of ymc we write ˆ D0 = D0 (1+ e3 ) such that −1 E(e3 ) = O(n ) Expressing (3.9) in terms of e’s we have ⎡ −1 ⎤ * ⎧ ()1+ D0 e1e3 ⎫⎧ ()1− D0 e1e3 ⎫ ymc − Y = Y ⎢()1+ e0 ⎨1+ e1 + D0 ⎬⎨1+ e1 − D0 ⎬ −1⎥ ⎣⎢ ⎩ 2 2 ⎭⎩ 2 2 ⎭ ⎦⎥ or * (ymc − Y ) ≅ Y (e0 + De1 ) (3.10)

1322 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

Squaring both side of (3.10) and then taking expectations we get the variance * of ymc to the first degree of approximation as 2 ⎧ L ⎛ 1 1 ⎞ ⎫ 2 ⎜ ⎟ ⎨∑Wh − S hxy ⎬ L ⎜ n N ⎟ * 2 ⎛ 1 1 ⎞ 2 ⎩ h=1 ⎝ h h ⎠ ⎭ V ()ymc = Wh ⎜ − ⎟S hy − ∑ ⎜ ⎟ L h=1 n N ⎛ 1 1 ⎞ ⎝ h h ⎠ 2 ⎜ ⎟ 2 ∑Wh ⎜ − ⎟S hx h=1 ⎝ nh N h ⎠

= minV (ymc ) (3.11) Thus we established the following theorem

Theorem 3.1- To the first degree of approximation, the variance of the * estimator ymc at (3.9) based on “estimated optimum value” is same as that of the minimum variance of the estimator ymc at (2.1).

4. Efficiency Comparisons

It is well known under stratified random sampling that L ⎛ 1 1 ⎞ 2 ⎜ ⎟ 2 (4.1) V ()yst = ∑Wh ⎜ − ⎟S hy h=1 ⎝ nh N h ⎠ From (1.10), (1.12), (3.11) and (4.1) we have 2 L ⎪⎧ 2 ⎛ 1 1 ⎞ ⎪⎫ ⎨ Wh ⎜ − ⎟S hxy ⎬ ∑ ⎜ n N ⎟ * ⎩⎪ h=1 ⎝ h h ⎠ ⎭⎪ V ()yst −V ()ymc () or minV ()y mc = ≥ 0 (4.2) L ⎛ 1 1 ⎞ W 2 ⎜ ⎟S 2 ∑ h ⎜ − ⎟ hx h=1 ⎝ nh N h ⎠ (RB − C)2 V ()y −V ()y * ()or minV()y = ≥ 0 (4.3) Rc mc mc C and (RB + C)2 V ()y −V ()y * ()or minV()y = ≥ 0 (4.4) Pc mc mc C L ⎛ 1 1 ⎞ L ⎛ 1 1 ⎞ where 2 ⎜ ⎟ 2 and 2 ⎜ ⎟ B = ∑Wh ⎜ − ⎟ S hx C = ∑Wh ⎜ − ⎟ Shxy h=1 ⎝ nh N h ⎠ h=1 ⎝ nh N h ⎠ Thus from (4.2), (4.3) and (4.4) we have the following inequalities: * V (ymc ) () or minV (ymc ) ≤ V (yst ) (4.5)

STATISTICS IN TRANSITION, December 2006 1323

* V (ymc ) () or minV (ymc ) ≤ V (yRc ) (4.6) * and V (ymc ) ()or minV (ymc ) ≤ V (yPc ) (4.7) From (4.5), (4.6) and (4.7) following theorem can be proved.

* Theorem 4.1- The suggested estimator ymc based on estimated “optimum value” (or the estimator ymc at its optimum condition) is better than usual estimator yst , combined ratio estimator yRc and combined product estimator

yPc .

Further from (3.4) and (4.1) it is observed that V (ymc ) < V (yst ) if 2 2 2 i.e.if D − 2D0 D + D0 < D0 so in order that the modified “combined product and ratio estimator ymc ” should be preferred to yst ,

D − D0 < D0 (4.8) should hold, where D0 is same as defined in (3.5).

From (1.10) and (3.4) it follows that ymc is more efficient than yRc if 2 2 2 D − 2D0 D + D0 < 1+ 2D0 + D0 i.e. if D − D0 < 1+ D0 (4.9)

We also note from (1.12) and (3.4) that the estimator ymc is superior to the combined product estimator yPc if

D − D0 < 1− D0 (4.10) For details the reader is referred to Sahai (1979).

5. Empirical Study

* To illustrate the performance of different estimators yst , yRs , yRc , yms and * ymc over yst , we have considered the natural data given in Singh and Chaudhary (1986, p.162). The data were collected in a pilot survey for estimating the extent of cultivation and production of fresh fruits in three districts of Uttar- Pradesh in the year 1976—1977.

1324 Housila P. Singh, Gajendra K. Vishwakarma: An efficient variant…

No.of Stratum Total No Total area Area under Total No. villages Number of village (in hect.) archards in ha of trees in sample

()h ()N h (X h ) (nh ) (xh ) ()yh 1 985 11253 6 10.63, 9.90 747, 719 1.45, 3.38 78, 201 5.17, 10.35 311, 448 2 2196 25115 8 14.66, 2.61 580, 103 4.35, 9.87 316, 739 2.42, 5.60 196, 235 4.70, 36.75 212, 1646 3 1020 18870 11 11.60, 5.29 488, 277 7.94, 7.29 374, 491 8.00, 1.20 499, 50 11.50, 7.96 455, 47 23.15, 1.70 879, 115 2.01 115

The calculation have been shown in the given below

⎛ 1 1 ⎞ Stra- ⎜ ⎟ 2 2 − ˆ Whx xh Whx yh S tum Wh ⎜ ⎟ xh yh Rn S hx S hy hxy ⎝ nh N h ⎠

1 0.2345 0.16565 6.81 417.33 61.28 1.60 97.86 15.97 74775.47 1007.05 2 0.5227 0.12454 10.12 503.38 49.74 5.26 263.12 132.66 259113.40 5643.81 3 0.2428 0.08992 7.97 340.00 42.66 1.94 82.55 38.44 65885.60 1404.71

* We have computed the percent relative efficiency of yst , yRs , yRc , yms and * ymc with respect to yst and presented in Table 5.1

Table 5.1. Showing the percent relative efficiencies of the various estimators of population mean Y with respect to stratified random sample mean yst

* * Estimator yst yRs yRc yms ymc

PRE (. , yst ) 100.00 1190.09 1547.10 1188.66 1406.95

* * Table 5.1 clearly indicates that the proposed estimators yms and ymc are * better than yst , yRs and yRc . It is also noted that the proposed estimator yms has

STATISTICS IN TRANSITION, December 2006 1325

* * larger efficiency than that of ymc . Thus the use of yms is recommended for its use in practice.

Acknowledgement

Authors are thankful to the two learned referees and the editor Prof. J. Kordos for their valuable suggestions regarding improvement of the paper.

REFERENCES

HANSEN, M.H., HURWITZ, W.N. AND GURNEY, M. (1946), Problem and methods of the sample survey of business. Jour. Amer. Statist. Association, 41, 174—189. MURTHY, M.N. (1967), Sampling theory and methods. Statistical Publishing Society, Calcutta, India. SAHAI, A. (1979), An efficient variant of the product and ratio estimators. Stat. Neerl, 33, 27—35. SAHAI, A. AND RAY, S.K (1980), An efficient estimator using auxiliary informations. Metrika, 27, 271—275. SINGH, D. AND CHAUDHARY, F.S. (1986), Theory and analysis of sample survey designs. Wiley Eastern limited, New Delhi.

STATISTICS IN TRANSITION, December 2006 1327

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1327—1344

ESTIMATION OF MEAN WITH KNOWN COEFFICIENT OF VARIATION OF AN AUXILIARY VARIABLE IN TWO PHASE SAMPLING

Lakshmi N. Upadhyaya1, Housila P. Singh2 and Ritesh Tailor2

ABSTRACT

The crux of this paper is to discuss the possibility of obtaining efficient estimators of the population mean of the variable y under investigation by means of two phase sampling and the help of two auxiliary variables x (main auxiliary variable) and z (second auxiliary variable). The proposed estimators are basically a concatenation of two ratio-type estimators (one needed to estimate the mean of the study variable y and the other for the main auxiliary variable x) as described in Chand (1975). Following Sen (1978) we have introduced a rectification to Chand’s estimator that utilizes only the knowledge of coefficient of variation of the second auxiliary variable z instead of the population mean. Asymptotic expression for bias and mean squared error (MSE) of the proposed estimator are obtained. Asymptotic optimum estimator (AOE) in the family is identified with its approximate MSE formula. As AOE depends on unknown parameters so we have further defined an estimator of population mean based on estimated optimum values. It has been shown to the first degree of approximation that MSE of the estimator based on “estimated optimum values” is same as that of the optimum estimator. Numerical illustration is given in the support of present study.

Keywords: Chain ratio-type estimator, Study variate, auxiliary variate, bias and mean squared error.

1. Introduction and Estimator

Consider a finite population U = (U1, U2, …, UN) of N units. Let y and x be the study and auxiliary variates, taking values yi and xi respectively for the i-th unit Ui, (i = 1, 2,…., N). The problem of estimating the population mean Y of y

1 Department of Applied Mathematics, Indian School of Mines, Dhanbad-826004, Jharkhand, India. 2 School of Studies in Statistics, Vikram University ,Ujjain-456010,M.P, India

1328 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean… when the population mean X of x is known, has been dealt at a great length in the literature. However, in certain practical situations when no information is available on the population mean X of x, we seek to estimate Y from a sample s, obtained through a two-phase selection. Allowing simple random sampling without replacement (SRSWOR) scheme at each phase, the two phase (or double) sampling scheme is as follows: (1) The first phase sample s* (s* ⊂ U) of fixed size n is drawn to observe only x in order to furnish a good estimate of X . (2) Given s*, the second phase sample s (s ⊂ s*) of fixed size m is drawn to observe y only. * Let x = x / m , y = y / m and x = x / n . Then the ∑ i ∑ i ∑ i i∈s i∈s i∈s* usual two- phase sampling ratio estimator is defined by ⎛ x* ⎞ y = y ⎜ ⎟ (1.1) Rd ⎜ ⎟ ⎝ x ⎠ -1 It is well known that yRd will estimate Y to 0(n ) more precisely than y if ρyx > (1/2) Cx/Cy; where ρyx is the correlation coefficient between x and y and Cx , Cy are coefficients of variation of x, y respectively. Suppose that Z , the population mean of another variable z closely related to x but compared to x remotely related to y is available (i.e. ρyx > ρyz ; ρyz being the correlation coefficient between y and z). This type of situation has been briefly discussed by, among others, Chand (1975), Kiregyera (1980, 84) and Srivenkataramana and Tracy (1989). Then the ratio estimator ⎛ Z ⎞ x = x *⎜ ⎟ Rd ⎝ z* ⎠ -1 will estimate X to 0(n ) more precisely than x * if ρxz > (1/2) Cz/Cx,

where z* = (1/n)∑z j is the sample mean of z based on a preliminary large j∈s* sample s* of size n. Accordingly, Chand (1975) has proposed a chain ratio-type estimator for Y as ⎛ * ⎞ ()c x ⎛ Z ⎞ y = y⎜ ⎟ ⎜ ⎟ (1.2) Rd ⎜ x ⎟ ⎜ * ⎟ ⎝ ⎠ ⎝ z ⎠

STATISTICS IN TRANSITION, December 2006 1329

This estimator has been further generalized by various authors including Kiregyera (1980, 84), Mukerjee et al (1987), Srivastava et al (1989), Upadhyaya et al (1990), Singh and Singh (1991) and Singh et al (1994). It is of interest to mention that the population coefficient of variation is usually fairly stable over time and characteristics of similar nature and its value may be known (or may be made to be known), for instance, see Murthy (1967, pp. 96-97). Govindarajulu and Sahai (1972) pointed out that in many life sciences and biological experiments, the observations have normal distribution with known coefficient of variation. For illustration, in clinical laboratory experiments the routine procedures are repeated often enough so that coefficient of variation is known for practical purposes. The coefficient of variation for the linear dimensions of anatomical elements of mammals, lie between 4-10% and 5-6% are good average values. Further, the rate of gain in weight of pigs over a certain growing period in a nutritional experiment tends to be a constant lying between 8- 12% [see, Govindarajulu and Sahai (1972, pp.1)]. Gleser and Healy (1976) have mentioned that it is not uncommon, particularly in the physical and biological sciences, for a scatter plot of cell means zi versus cell standard deviation sz , i (i = 1, 2,…., m), in an analysis of variance to yield evidence that the population cell standard deviation σz are proportional to the population cell means μz . i i That is,

σ = Cz μ (1.3) zi zi where Cz is a constant commonly known as coefficient of variation. Searls (1964), Khan (1968) and Sen (1978) were also of the opinion that a simple type of priori information usually available to the experimenter is the coefficient of variation, particularly those in the biological fields through long association with their experimental material. Thus, we assume that the value of the coefficient of variation Cz is quite accurately known in many practical situations. Singh et al (1973), Sen (1978), Upadhyaya and Singh (1984) and Searls and Intarapanich (1990) have, further, advocated that for many agricultural and biological populations, the statistical practitioner has information, either collected in allied studies or suggested by the physical nature of his material, regarding the shape parameters β1(z) and β2(z) of the population sampled, in addition to coefficient of variation. Assuming that the coefficient of variation Cy of y is known, Sen (1978) suggested a class of estimators for Y as ˆ ˆ Ys = {}()1− α y + αy(Cy /Cy) (1.4) ˆ where α is a suitably chosen constant and C y is a consistent estimate of Cy based on m observations.

1330 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

Let Cz, the coefficient of variation of z be known. Select the first phase sample s* of size n to observe x and z to furnish a Sen’s (1978) type estimate of X as ˆ ⎧ * ⎫ X = ⎨()1− w x *+wx *(Cˆ /C )⎬ (1.5) s ⎩ z z ⎭ ˆ * w, being a suitably chosen constant and Cz is a consistent estimate of Cz based on n observations. ˆ Replacing x *by Xs in (1.1), we get the following class of chain ratio-type estimators of Y as * * (1) ⎛ x ⎞ ⎡()1− w C + wCˆ ⎤ y = y⎜ ⎟ ⎢ z z ⎥ (1.6) Rd ⎜ x ⎟ C ⎝ ⎠ ⎣⎢ z ⎦⎥ which is further generalized as α * * (g) ⎛ x ⎞ ⎡()1− w C + wCˆ ⎤ y = y⎜ ⎟ ⎢ z z ⎥ (1.7) Rd ⎜ x ⎟ C ⎝ ⎠ ⎣⎢ z ⎦⎥ where α is a suitably chosen constant, for instance, see Srivastava (1967, 1970). (1) (g) The estimator y and y are useful in the situations, where the Rd Rd population coefficient of variation Cz is known and the population mean Z of z is not known.

(g) 2. Properties of the Estimator yRd

We assume that population size N is large as compared to samples sizes n and m so that sampling fractions n/N and m/N are ignored. We write N r s t μ rst = ()1/N ∑ ()x j − X ()y j − Y ()z j − Z , (r, s, t) being non- j=i negative integers.

STATISTICS IN TRANSITION, December 2006 1331

S2 2 2 μ 2 y μ 020 2 Sx μ 200 2 Sz 002 C y = = , Cx = = , Cz = = , Y 2 Y 2 X 2 X 2 Z2 Z2 1/2 1/2 ρ yx = μ110/()μ 200μ020 ,ρ xz = μ101/(μ 200 μ 002 ) ,

1/2 2 2 2 3 ρ yz = μ011/()μ020μ002 , λ = μ012/(YSz ), β1 (z) = μ 003 /()Sz ,

2 2 2 β 2 (z) = μ 004 /(Sz ) , λ* = μ102/(X Sz ). * Let eo = (y − Y)/Y , e1 = (x − X)/X , e1 = (x * −X)/X , * * *2 2 2 e2 = ()z * −Z /Z , e3 = (sz − Sz ) / Sz . Then we have * * * E()eo = E ()e1 = E(e1 )= E(e2 )= E(e3 )= 0 , 2 2 2 2 *2 2 E(eo )= ()1/ m C y , E(e1 )= (1/ m)Cx , E(e1 )= (1/ n)Cx , *2 2 E(e2 )= ()1/ n Cz , E(eoe1 ) = (1/ m)ρ yxC yCx , * * E(eoe1 )= (1/ n)ρ yxC yCx , E(eoe2 )= (1/ n)ρ yzC yCz , * 2 * * * E(e1e1 )= (1/ n)Cx , E(e1e2 )= E(e1 e2 )= (1/ n)ρ xzCxCz , and to the first order of approximation, *2 ⎛ *⎞ E(e )= ()1/ n (β 2 (z) − 1), E⎜e e ⎟ = ()1/ n λ , 3 ⎝ o 3⎠ * * * * * * E(e1e3 )= E(e1 e3 )= (1/ n)λ , E(e2e3 )= (1/ n) β1(z) Cz . (g) Now, expressing the estimator yRd at (1.7) in terms of e’s we have α ⎡ 1/2 −1⎤ (g) −α ⎛ * ⎞ ⎛ * ⎞ ⎛ * ⎞ y = Y ()1+ e (1+ e ) ⎜1+ e ⎟ ⎢()1− w + w⎜1+ e ⎟ ⎜1+ e ⎟ ⎥ (2.1) Rd o 1 ⎝ 1 ⎠ ⎢ ⎝ 3 ⎠ ⎝ 2 ⎠ ⎥ ⎣ ⎦ * *2 2 x − X z − Z sz −Sz Assuming <1 <1 the bias and MSE of Z S2 X , , z , (g) y , to the first degree of approximation, are respectively given by Rd

1332 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

⎛ (g) ⎞ ⎡⎛ 1 1 ⎞⎛ α ⎞⎧ 2 ⎫ ⎛ w ⎞ B⎜ y ⎟ = Y⎢⎜ − ⎟⎜ ⎟⎨()α +1 C − 2ρ C C ⎬ + ⎜ ⎟ Rd m n 2 ⎩ x yx y x ⎭ 8n ⎝ ⎠ ⎣⎝ ⎠⎝ ⎠ ⎝ ⎠ 2 {4(λ − Cz β1(z) + 2Cz − 2ρ yzC yCz )− (β 2 (z) −1)}] (2.2) and ⎛ (g) ⎞ 2⎡ 2 ⎛ 1 1 ⎞ ⎛ 2 ⎞ MSE ⎜ y ⎟ = Y ⎢()1/m Cy + ⎜ − ⎟ α ⎜α Cx − 2ρyxCyCx ⎟ ⎝ Rd ⎠ ⎣ ⎝ m n ⎠ ⎝ ⎠ ⎧⎛ 2 ⎞ ⎫⎤ ⎛ w ⎞⎪⎜ ⎛ ⎞ ⎟ ⎪ + ⎜ ⎟⎨ Δ()z + ⎜2Cz − β (z) ⎟ w + 4 (λ − 2ρyzCyCz )⎬⎥ (2.3) ⎝ 4n ⎠ ⎜ ⎝ 1 ⎠ ⎟ ⎥ ⎩⎪⎝ ⎠ ⎭⎪⎦ where Δ ()z = β (z) −β (z) −1 . [ 2 1 ] (g) The MSE (yRd ) at (2.3) is minimized for ⎫ 2(2ρyzCyCz − λ) ⎪ w = = w (say)⎬ (2.4) ⎡ 2⎤ o Δ(z) + (2Cz − β (z)) ⎪ ⎣⎢ 1 ⎦⎥ ⎭ α = α0 = K = ρyx Cy/Cx. Substitution of (2.4 ) in (1.7) yields the ‘optimum estimator’ as α ⎛ * ⎞ o ⎧ ⎛ ˆ * ⎞⎫ (opt) ⎜ x ⎟ ⎪ ⎜ Cz − Cz ⎟⎪ y = y ⎨1+ wo ⎬ (2.5) Rd ⎜ x ⎟ ⎜ C ⎟ ⎝ ⎠ ⎩⎪ ⎝ z ⎠⎭⎪ (g) Putting (2.4) in (2.3) we get the minimum MSE of yRd [or the MSE of the (opt) ‘optimum estimator’ yRd ] as 2 2 ⎛ (g) ⎞ Y (λ − 2ρyzCyCz ) min. MSE⎜ y ⎟ = MSE(y )− (2.6) ⎝ Rd ⎠ ld n ⎧ 2⎫ ⎪ ⎛ ⎞ ⎪ ⎨Δ()z + ⎜2C − β (z) ⎟ ⎬ ⎝ z 1 ⎠ ⎩⎪ ⎭⎪ ⎛ (opt) ⎞ = MSE⎜ y ⎟ ⎝ Rd ⎠ where

STATISTICS IN TRANSITION, December 2006 1333

2 ⎡ 1 2 1 2 ⎤ MSE ()yld = Sy (1 − ρ yx )+ ρ yx (2.7) ⎣⎢m n ⎦⎥ is the MSE of the two-phase sampling regression estimator y = y + βˆ (x* − x) , (2.8) ld yx where ˆ 2 βyx = syx/sx ,syx = ∑ (y − y)(x − x)/(m −1) and i∈s i i 2 2 sx = ∑ (x - x) /(m −1) . i∈s i For the purpose of comparisons, we write the variance of y and mean (c) squared errors of yRd and yRd , to the first degree of approximation, as 1 2 1 2 2 Var()y = S = Y C (2.9) m y m y 2 ⎡ 1 2 ⎛ 1 1 ⎞ 2 ⎤ MSE (y ) = Y ⎢ Cy + ⎜ − ⎟ Cx ()1− 2K ⎥ (2.10) Rd ⎣m ⎝ m n ⎠ ⎦ and 2 ⎛ (c) ⎞ Y 2 MSE ⎜ y ⎟ = MSE(y ) + (C − 2ρ C C ) (2.11) ⎝ Rd ⎠ Rd n z yz y z Now from (2.6), (2.7), (2.9), (2.10) and (2.11), we have

(opt) 2 ⎡⎛ 1 1 ⎞ 2 2 A⎤ (2.12) Var ()y − MSE ()y Rd = Y ⎢⎜ − ⎟C yρ yx + ⎥ ≥ 0 ⎣⎝ m n ⎠ n ⎦ (opt) 2 ⎡⎛ 1 1 ⎞ 2 2 A⎤ MSE ()yRd − MSE (yRd )= Y ⎢⎜ − ⎟ Cx ()1 − K + ⎥ ≥ 0 (2.13) ⎣⎝ m n ⎠ n ⎦ ⎛ (opt) ⎞ 2 A MSE (y )− MSE ⎜ y ⎟ = Y (2.14) ld ⎝ Rd ⎠ n ⎛ (c) ⎞ ⎛ (opt)⎞ 2⎡⎛ 1 1 ⎞ 2 2 A 1 ⎛ 2 ⎞⎤ MSE ⎜ y ⎟ − MSE ⎜ y ⎟ = Y ⎢⎜ − ⎟ Cx()1− K + + ⎜Cz − 2ρyzCyCz ⎟⎥ ⎝ Rd ⎠ ⎝ Rd ⎠ ⎣⎝ m n ⎠ n n ⎝ ⎠⎦ 1 ≥ 0 provided ρ (C /C )≤ (2.15) yz y z 2 2 ⎡ 2 ⎤ where A = ()λ − 2ρ C C / Δ()z + (2C − β (z)) . yz y z ⎣⎢ z 1 ⎦⎥

1334 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

From the above expressions (2.12) to (2.15) it follows that the proposed (g) (opt) estimator yRd at optimum conditions (i.e yRd ) is always better than y , (opt) yRd and yld . We also note that the estimator yRd is always better than (c) 1 y if the condition ρ ()C /C ≤ holds good. Rd yz y z 2 Suppose we want to estimate population mean Y when (x, y, z) follows a (opt) trivariate normal population, then the ‘optimum estimator’ yRd takes the form: α * o ⎡ ⎛ ˆ * ⎞⎤ (opt) ⎛ x ⎞ 2ρ yzC yCz Cz − Cz y = y⎜ ⎟ ⎢1 + ⎜ ⎟⎥ (2.16) Rd ⎜ x ⎟ 2 ⎜ C ⎟ ⎝ ⎠ ⎣⎢ ()1 + 2Cz ⎝ z ⎠⎦⎥ with mean squared error 2 2 2 2 2Y ρ yzCyCz (opt) (2.17) MSE ()yRd = MSE()yld − 2 n ()1+ 2Cz It follows from (2.16) that in case of a trivariate normal population, only prior knowledge about Cx, Cy, Cz , ρyx and ρyz are sufficient to use the optimum (opt) estimator yRd more precisely in practice. As mentioned in Section 1, the pior knowledge of Cx, Cy, Cz , ρyx and ρyz can be obtained quite accurately either through the past data or experience gathered in due course of time, see also Murthy (1967, pp. 96-99) and Reddy (1978).

3. Estimator Based on Estimated Optimum Values

If no prior information regarding the unknown population parameters involved in the optimum values (αo, wo) of (α, w) at (2.4) are available, then it is worth advisable to estimate these parameters through the sample data at hand. The optimum values (αo, wo) of (α, w) may be written as μ X α = 110 . (3.1a) o μ 200 Y

STATISTICS IN TRANSITION, December 2006 1335

⎛ 2μ μ ⎞ 2 ⎜ 011 − 012 ⎟ ⎜ Y Z Yμ ⎟ ⎝ 002 ⎠ wo = (3.1b) ⎡ ⎛ μ μ ⎞ μ ⎤ ⎢ ⎜ 002 003 ⎟ 004 ⎥ 4 ⎜ − ⎟ + −1 ⎢ ⎜ Z2 Zμ ⎟ μ2 ⎥ ⎣⎢ ⎝ 002 ⎠ 002 ⎦⎥

The consistent estimates of (αo, wo) are given by μˆ x αˆ = 110 (3.2a) o μˆ y 200 ⎛ 2μˆ μˆ ⎞ 2 ⎜ 011 − 012 ⎟ ⎜ y z y μˆ ⎟ ⎝ 002 ⎠ wˆ o = (3.2b) ⎡ ⎛ μˆ μˆ ⎞ μˆ ⎤ ⎢ ⎜ 002 003 ⎟ 004 ⎥ ⎢4 ⎜ − ⎟ + −1⎥ z2 z μˆ μˆ 2 ⎣⎢ ⎝ 002 ⎠ 002 ⎦⎥ m 1 r s t where z = ∑z and μˆ rst = ()1/m ∑ ()xi − x ()yi − y ()zi − z , m i∈s i=1 (r, s, t) = 0, 1, 2, 3, 4. Then the estimator based on ‘estimated optimum values’ is defined by αˆ * o ⎡ ⎛ ˆ * ⎞⎤ ⎛ x ⎞ Cz − Cz y = y ⎜ ⎟ ⎢1 + wˆ ⎜ ⎟⎥ (3.3) Rd(est) ⎜ x ⎟ o ⎜ C ⎟ ⎝ ⎠ ⎣⎢ ⎝ z ⎠⎦⎥ To obtain the MSE of yRd(est) , to the first degree of approximation, we write αˆ = α (1+ e ) , and wˆ = w (1+ e ) o o 3 o o 4 such that -1 -1 E(e3) = 0(m ) and E(e4) = 0(m ) (3.4) Now expressing yRd(est) in terms of e’s, we have αo(1+e ) (1+ e*) 3 ⎡ ⎧(1+ e*)1/2 ⎫⎤ 1 ⎢ ⎪ 3 ⎪⎥ (3.5) y = Y()1+ eo ⎢1+ wo(1+ e ) ⎨ −1⎬⎥ Rd(est) αo(1+e ) 4 * (1+ e ) 3 ⎢ ⎪ (1+ e ) ⎪⎥ 1 ⎣ ⎩ 2 ⎭⎦ which may be expressed as

1336 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

⎡ w ⎤ y = Y⎢1+ e +α (e* − e ) + 0 (e* − 2e*)⎥ + 0 (e2) Rd(est) o o 1 1 3 2 ⎣⎢ 2 ⎦⎥ or ⎛ ⎞ ⎡ * wo * * ⎤ 2 ⎜ y − Y⎟ = Y ⎢eo +αo(e − e ) + (e − 2e )⎥ + 0(e ) (3.6) ⎝ Rd(est) ⎠ ⎣ 1 1 2 3 2 ⎦ Squaring both sides of (3.6) and then taking expectation of both sides, we get MSE of yRd(est) , to the first degree of approximation, as 2 2 ⎛ ⎞ (λ − 2ρ C C ) ⎜ ⎟ Y yz y z MSE y ⎛ ⎞ = MSE (y )− ⎜ Rd⎜est ⎟ ⎟ ld ⎝ ⎝ ⎠ ⎠ n ⎡ 2⎤ ⎢Δ(z) + ⎜⎛2C − β ()z ⎟⎞ ⎥ ⎝ z 1 ⎠ ⎣⎢ ⎦⎥ (g) (opt) = min.MSE(yRd )= MSE(yRd ) (3.7) Thus we have proved the following theorem.

Theorem 3.1.

If the parameters involved in ‘optimum values (αo, wo)’ of constants (α, w) at (2.4) are replaced by their consistent estimators, the resulting estimator -1 yRd(est) has the same mean squared error, up to terms of order n , as that of the (opt) (g) ‘optimum estimator’ yRd of the class yRd . Remark 3.1. Putting α=1 in (2.2) and (2.3), we get the bias and MSE of (1) yRd as ⎛ Y ⎞ B ⎛ y(1) ⎞ = B y + ⎜ ⎟ w ⎡4⎛λ − C β (z) + 2C2 − 2 ρ C C ⎞ − β (z) −1⎤ ⎜ Rd ⎟ ( Rd ) ⎜ ⎟ ⎢ ⎜ z 1 z yz y z ⎟ ( 2 )⎥ (3.8) ⎝ ⎠ ⎝ 8n ⎠ ⎣ ⎝ ⎠ ⎦ and ⎛ 2 ⎞ ⎛ (1) ⎞ Y ⎡ ⎧ 2⎫ MSE ⎜ y ⎟ = MSE y + ⎜ ⎟ w w Δ()z + 2C − β (z) Rd ( Rd ) ⎜ 4n ⎟ ⎢ ⎨ ( z 1 ) ⎬ (3.9) ⎝ ⎠ ⎝ ⎠ ⎣ ⎩ ⎭ + 4(λ − 2ρyzCyCz)] where ⎛ 1 1 ⎞ 2 B()yRd = ⎜ − ⎟ Y (Cx − ρ yxC yCx ) (3.10) ⎝ m n ⎠

STATISTICS IN TRANSITION, December 2006 1337 is the bias of y , to the first degree of approximation, and MSE (y ) is Rd Rd given at (2.10). (1) The MSE (y ) at (3.9) is minimized when Rd

w = wo (3.11) where wo is given by (2.4) (1) Thus the resulting minimum MSE of y is given by Rd ⎛ 2 ⎞ ⎛ (1) ⎞ ⎜ Y ⎟ min.MSE ⎜ y ⎟ = MSE y − A (3.12) Rd ( Rd ) ⎜ n ⎟ ⎝ ⎠ ⎝ ⎠ 2 (λ − 2ρyzCyCz) where A = . ⎧ 2⎫ ⎨Δ()z + (2Cz − β (z)) ⎬ ⎩ 1 ⎭ From (2.6) and (3.12) we have ⎛ (1) ⎞ ⎛ (g) ⎞ ⎛ 1 1 ⎞ 2 2 2 min.MSE ⎜ y ⎟ − min.MSE ⎜ y ⎟ = ⎜ − ⎟ Y Cx ()1− K (3.13) ⎝ Rd ⎠ ⎝ Rd ⎠ ⎝ m n ⎠ ≥ 0 (g) (1) which shows that the proposed estimator yRd is more efficient than yRd at their optimum conditions. (1) Further it follows from (3.9) that MSE (yRd )< MSE ()yRd if ⎫ either 0 < w < 2 wo ; λ < 2ρyzCyCz⎪ ⎬ (3.14) or 2w < w < 0, λ > 2ρ C C o yz y z⎭⎪ where wo is same as given in (2.4) [or (3.11)]. We note that if the conditions ρyx > (1/2) Cx/Cy and (3.14) hold good then the (1) estimator yRd is more efficient than both the estimators y and yRd . The estimator based on estimated optimum value wˆ o is defined by ⎛ * ⎞ ⎡ ⎛ * ⎞⎤ (1) x ⎜ Cˆ − C ⎟ y = y ⎜ ⎟ ⎢1+ wˆ z z ⎥ (3.15) Rd(est) ⎜ x ⎟ ⎢ o⎜ C ⎟⎥ ⎝ ⎠ ⎣ ⎝ z ⎠⎦

1338 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

It can be shown to the first degree of approximation that ⎛ (1) ⎞ ⎛ (1) ⎞ min.MSE ⎜ y ⎟ − min.MSE ⎜ y ⎟ ⎝ Rd(est) ⎠ ⎝ Rd ⎠ ⎛ Y 2 ⎞ = MSE()y − ⎜ ⎟ A (3.16) Rd ⎜ ⎟ ⎝ n ⎠

4. Optimum Sample Sizes Under A Linear Cost Structure

Let C1, C2 and C3 be the costs for collecting information on the study character y, the auxiliary variate x and another auxiliary character z respectively. The total cost Co of the survey can be expressed in the linear form

Co = m C1 + (C2 + C3) n. (4.1) The MSE expression in (2.3) can be expressed as (g) M1 M 2 MSE (y )= + (4.2) Rd m n where M = ⎡C2 + α2C2 − 2αρ C C ⎤ Y2 1 ⎣⎢ y x yx y x ⎦⎥ and ⎡ 2 ⎤ w ⎛ 2 2 ⎞ 2 M = ⎢ H()z + w(λ − 2ρ C C )+ ⎜2αρ C C − α C ⎟⎥ Y 2 ⎢ 4 yz y z ⎝ yx y x x ⎠⎥ ⎣ ⎦ ⎡ 2 ⎤ where H(z) = Δ()z + ()2C − β (z) . ⎣⎢ z 1 ⎦⎥ We can determine m and n in the sense that these minimize MSE of (g) yRd under the cost structure (4.1). For this purpose we consider the function M1 M 2 φ = + + μ{}mC + n()C + C − C (4.3) m n 1 2 3 o where μ is the Lagrangian multiplier. Differentiating (4.3) with respect to m and n and equating to zero, we get C (C + C ) C m 1 = n 2 3 = o (4.4) M M M C + (C + C )M 1 2 {}1 1 2 3 2 This yields the optimum values of m and n as

STATISTICS IN TRANSITION, December 2006 1339

Co M1/C1 ⎫ mopt = ⎪ M C + C + C M {}1 1 ()2 3 2 ⎪ ⎬ (4.5) Co M 2/()C2 + C3 ⎪ n opt = ⎪ {}M1C1 + ()C2 + C3 M 2 ⎭⎪ (g) Thus the resulting minimum MSE of yRd is given by 2 ⎛ (g) ⎞ ⎡ ⎤ min.MSE⎜ y ⎟ = M C + (C + C ) M /Co (4.6) ⎝ Rd ⎠ ⎣⎢ 1 1 2 3 2⎦⎥ (g) In order to evaluate the relative efficiency (RE) of the estimator yRd with respect to sample mean y , we have

⎛ C1 ⎞ 2 2 min.Var()y = ⎜ ⎟Y C y (4.7) ⎝ Co ⎠ since the effective sample size is Co/C1. (g) Thus the relative efficiency of yRd with respect to the sample mean y is given by. −2 ⎡ ⎛ C + C ⎞ ⎤ RE ()y (g) , y = C 2 ⎢ M* + ⎜ 2 3 ⎟M* ⎥ Rd y 1 ⎜ C ⎟ 2 ⎣⎢ ⎝ 1 ⎠ ⎦⎥ which is greater than ‘unity’ if 2 ⎛ * ⎞ C1 + C3 ⎜ C y − M1 ⎟ < ⎜ ⎟ (4.8) C1 ⎜ * ⎟ ⎝ M 2 ⎠ * 2 * 2 where M1 = M1/Y and M 2 = M 2/Y . In case (α, w) in (1.7) are chosen so as to lead, optimum values ()αo .w o in (2.4), then (4.8) reduces to:

1340 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

2 ⎡ * ⎤ ()C2 + C3 C y − M10 < ⎢ ⎥ (4.9) ⎢ ⎥ C1 C2 − M* − A ⎣⎢ ()y 10 ⎦⎥ where * 2 2 M10 = C y (1 − ρ yx ). If we assume that (x, y, z) follows a trivariate normal distribution, then (4.9) boils down to: 2 ⎡ ⎤ ⎢ C − M* ⎥ (C + C ) y 10 2 3 < ⎢ ⎥ (4.10) C ⎢ ⎛ 2 * *⎞ ⎥ 1 ⎢ ⎜Cy − M − A ⎟ ⎥ ⎣ ⎝ 10 ⎠ ⎦ 2 2 2 2ρyzCyCz where A* = . 2 (1+ 2Cz)

5. Empirical Study

(1) To demonstrate the performance of the constructed estimators yRd(est) and yRd(est) over usual unbiased estimator y , two-phase sampling ratio estimator yRd , two-phase sampling regression estimator yld , we consider two natural population data. Description of the populations are given below.

Population I : Source : [Sukhatme and Sukhatme (1977), p.185, vil.1-34] y : Area under wheat in 1937 x : Cultivated Area in 1931 z : Area under wheat in 1937

Cy = 0.744310, ρyx = 0.816875, β2(z) = 3.281589, Cx = 0.599037, ρyz = 0.929922, β1(z) = 1.206088, Cz = 0.756439, λ = 0.622752. For illustration we take m = 5, n = 10.

STATISTICS IN TRANSITION, December 2006 1341

Population II : Source : [Murthy (1967), p. 127, vil.1-35] y : Cultivated area (in acres) x : Area in square miles z : Number of persons

Cy = 0.640331, ρyx = 0.913818, β2(z) = 2.299934, Cx = 0.626490, ρyz = 0.786010, β1(z) = 0.235919, Cz = 0.619070, λ = 0.127482. For illustration we take m = 5, n = 10.

The percent relative efficiencies (PRE’s) of the estimators y , yRd , yld , (1) yRd(est) and yRd(est) with respect to usual unbiased estimator y have been computed and compiled in Table 5.1.

Table 5.1. Percent relative efficiency of different estimators of Y with respect to y

Estimator Percent relative efficiency of (.) w.r.t. y Population — I Population — II y 100.000 100.000

yRd 150.053 171.071 yld 150.069 171.683 (1) 186.524 244.668 yRd(est) 186.549 245.922 yRd(est)

(1) Table 5.1 clearly indicates that the suggested estimators yRd(est) and yRd(est) are more efficient than y , yRd , and yld with substantial gain in efficiency. It is further observed that the estimator yRd(est) is more efficient (1) than yRd(est) with very marginal gain owing to the reason that there is higher correlation between study variate y and auxiliary variate x and also the value

1342 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

C y K = ρ yx is 1.015 in populationI and 0.934 in populationII (i.e. Cx (1) approximately unity, i.e., K ≅ 1). Thus the proposed estimators yRd(est) and yRd(est) are to be preferred in practice, where population coefficient of variation Cz of second auxiliary variate Z is known without error and the population mean Z of z is not known. We also note that if the value of K is not approximately equal to 1 (i.e K ≠ 1), the estimator yRd(est) will perform much (1) better than yRd(est) .

Acknowledgement

Authors are thankful to the referees and the editor Professor J. Kordos for their valuable suggestions regarding improvement of the paper.

REFERENCES

CHAND, L. (1975): Some ratio-type estimators based on two or more auxiliary variables. Unpublished Ph.D. dissertation, Iowa State University, USA. GLESER, J. AND HEALY, J. D. (1976): Estimating the mean of a normal distribution with known coefficient of variation. Jour. Amer. Stat. Assoc., 17, 977—981. GOVINDARAJULU, Z. AND SAHAI, H. (1972): Estimation of a normal distribution with known coefficient of variation. Statist. Appl. Res., JUSE, (A), 91, 85—98. KIREGYERA, B. (1980): A chain ratio-type estimator in finite population double sampling using two auxiliary variables. Metrika, 27, 217—223. KIREGYERA, B. (1984): Regression-type estimators using two auxiliary variables and the model of double sampling from finite populations. Metrika, 31, 215—226.

STATISTICS IN TRANSITION, December 2006 1343

KHAN, R. A. (1968): A note on estimating the mean of a normal distribution with known coefficient of variation. Jour. Amer. Stat. Assoc., 63, 1039—1041. MUKERJEE, R., RAO, T. J. AND VIJAYAN, K. (1987): Regression type estimators using multiple auxiliary information. Aust. Jour. Statist., 29, 244—254. MURTHY, M. N. (1967): Sampling Theory and Methods. Statistical Publishing Society, Calcutta, India. REDDY, V. N. (1978): A study on the use of prior knowledge on certain population parameters in estimation. Sankhya, C, 40, 29—37. SEN, A. R. (1978): Estimation of the population mean when the coefficient of variation is known. Commun. Statist. Theor. Meth., A7, 657—672. SEARLS, D. T. (1964): The utilization of a known coefficient of variation in estimation procedure. J. Amer. Statist. Assoc., 59, 1225—26. SEARLS, D. T. AND INTARAPANICH, P. (1990): A note on an estimator for the variance that utilizes the kurtosis. The American Statistician, 44(4), 295—296. SINGH, J. PANDEY, B. N. AND HIRANO, K. (1973): On the utilization of a known coefficient of kurtosis in the estimation procedure of variance. Ann. Inst. Stat. Math., 25, 51—55. SINGH, V. K. AND SINGH, G. N. (1991): Chain-type regression estimators with two auxiliary variables under double sampling scheme. Metron, 49, 279—289. SINGH, V. K., SINGH, HARI P., SINGH, HOUSILA P., AND SHUKLA, D. (1994): A general class of chain estimators for ratio and product of two means of a finite population. Commun. Statist. Theor. Meth., 23(5), 1341—1355. SRIVASTAVA, S. K. (1967): An estimator using auxiliary information in sample surveys. Cal. Stat. Assoc. Bull., 16, 121 – 132. SRIVASTAVA, S. K. (1970): A two-phase sampling estimator in sample surveys. Aust. Jour. Statist., 12, 23—27. SRIVASTAVA, S. R., SRIVASTAVA, S. R. AND KHARE, B. B. (1989): Chain ratio-type estimator for ratio of two population means using auxiliary characters. Commun. Statist. Theor. Meth., 18, 3917—3926.

1344 L. N. Upadhyaya, H. P. Singh, R. Tailor: Estimation of mean…

SRIVENKATARAMANA, T. AND TRACY, D. S. (1989): Two phase sampling for selection with probability proportional to size in sample surveys. Biometrika, 76, 818—821. SUKHATME, P. V. AND SUKHATME B. V. (1970): Sampling Theory of Surveys with applications. Asia Publishing House. India. UPADHYAYA, L. N. AND SINGH, H. P. (1984): On the estimation of the population mean with known coefficient of variation. Biom. J., 26, 915—922. UPADHYAYA, L. N., KUSHWAHA, K. S. AND SINGH, H. P. (1990): A modified chain ratio-type estimator in two-phase sampling using multi- auxiliary information. Metron, 48, 381—393.

STATISTICS IN TRANSITION, December 2006 1345

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1345—1360

METHODOLOGY AND EMPIRICAL RESULTS OF THE TIME USE SURVEYS IN POLAND

Ilona Błaszczak-Przybycińska1

ABSTRACT

The paper presents the methodology and the selected results of time use surveys in Poland. They have a long tradition in this country. Several time use surveys were conducted in the 1950s and 1960s. The first nationwide survey was carried out in 1968/1969. The nationwide time use surveys in Poland were performed by the Central Statistical Office four times. The last one was organized in 2003/2004. It was performed within the framework of the harmonized European time use surveys. Keywords: Time use survey, Time budget survey, Time budget research, International Association for Time Use Research; Social statistics.

1. Introduction

Time use surveys in Poland have a long tradition. Several time use surveys were conducted in the 1950s and 1960s. The samples did not cover all socio- economic groups and regions of the country. The nationwide time use sampling surveys were carried out by the Central Statistical Office four times. The first one was conducted in 1969. The sample for this survey covered only the employees at the state sector, not in the agriculture and forestry. The next two surveys were carried out in 1976 and 1984. The methodology of these two surveys was similar, though some changes, among other in activities classification, were introduced comparing with the survey made in 1969. Political and economic changes resulted in transition to the market economy and in the integration process with the European Union started in 1989. Poland became a member of the EU on 1st May 2004. The changes which took place in East European countries had also their impact on the methodology of statistical surveys.

1 The Warsaw School of Economics, Institute of Statistics and Demography, e-mail: [email protected]

1346 I. Blaszczak-Przybycińska; Methodology and empirical…

The Central Statistical Office made efforts to adjust the methodology of Polish statistical surveys to the standards of the statistical office of the European Community—Eurostat. A broader transformation of the Polish social statistics started on 1989, although the attempts to harmonize the household surveys with the European standards were made as early as at the beginning of the 1980s (Kordos, 1998). The aim was to constitute a part of harmonized system of social statistic providing data comparable across the EU countries. International time use conferences organized by the Central Statistical Office fostered the methodological and organizational changes in the Polish time use surveys. The conference in 1988 was devoted to the theory and practice of time use models (Central Statistical Office, 1990). In 1995 there was an International Conference on Methodological Issues of Time Use Surveys organized by Eurostat, International Association for Time Use Research (IATUR) and the Central Statistical Office. The objective of the conference was to report and explore recent and emerging development in the field of time use measurement and research with particular emphasis, among others, on recent and pending methodological developments in the area of sampling design, data collection methodology, estimation and analysis methods, integration of time use and related surveys. The emphasis was put also on methodological dimension of the pending European national time use surveys as well as recent and pending Central and Eastern European national time use surveys (Harvey, 1995; Rydenstam, 1995). In 1996 the Central Statistical Office performed the pilot time use study based on Eurostat proposals. Testing the new methodology was the main objective. Pilot study allowed empirical verification of time use questionnaires of households, respondents and diaries of activities and to assess organizational and methodological solutions applied in the survey (Central Statistical Office, 1998). The pilot survey preceded the newest time use survey carried out in 2003/2004. There was a period of twenty years between this survey and the previous one which was performed in 1984. The last survey was performed according to Eurostat proposals in the framework of the harmonized European time use surveys.

2. Methodology of time use surveys in Poland

The first nationwide time budget survey: 1968—1969 The first nationwide time use survey started in 1968. The methodology was based on principles accepted by the European Coordination Centre for Research and Documentation in Social Sciences in Vienna. This institute harmonized international comparative surveys organized in 1966 which were initiated by the UNESCO. The methodology of the survey, activities classification as well as calculation methods and results presentation established for the time use survey conducted in the framework of UNESCO program were later adopted in the 1969 time use survey in Poland.

STATISTICS IN TRANSITION, December 2006 1347

Time use survey was conducted from 15th September 1968 to 30th June 1969. It lasted 290 days and the vacation time lasting from 1st July to 14th September 1969 was excluded from the survey (Kordos, 1988). A complete 24 hours rotation was applied: each day a different group of persons was surveyed. The sample was based on the frame of households participating in the household budget survey. It covered persons at the age 18 years and over who were members of households in which at least one person was employed in the state sector not in the agriculture and forestry. The sample size was 13200 persons and 3466 households. The share of refuse was 6 percent (Adamczuk, 1990). The second survey: 1975—1976 The second nationwide time use survey was performed in 1975—1976. It lasted 112 days i.e. four consecutive weeks (28 days) in each quarter of a year (6 April—3 May 1975, 1—28 September 1975, 3—30 November 1975 and 1—28 February 1976). Like in the previous survey a complete 24 hour rotation was applied. The survey covered households selected for the household budget survey. The sample size was 21812 persons at the age 18 years and over and 9966 households. Four socio-economic groups were included: employees, employees-farmers, farmers, retirees and invalid-pensioners. The share of persons who refused to participate in the survey was 5.7 percent. The third survey: 1984 In 1984 sampling for a time use survey was made again on the basis of the sample of households which took part in the household budget survey, where a two-stage sampling scheme was used. Households for an each day of 1984 were selected. The structure of week days in a sample was consistent with the structure of week days for the whole year. The structure of days was the following: week- days (Monday—Friday) — 73 percent, Saturdays — 13.3 percent, Sundays — 13.7 percent (Central Statistical Office, 1985). There were 360 days in a sample because six holidays were missed. There was 24 hour rotation — new individuals were sampled each day. Each person made his/her diary for one day and night. A number of individuals as well as the structure according to the source of income were the same each day. Like in the 1975—1976 survey there were four household types specified: employees, employees-farmers, farmers, retirees and invalid-pensioners. The sample comprised 45087 individuals at the age 18 and over and 21600 households. The share of individuals who refused to take part in the survey was 5.4 percent. Pilot study: 1996 The last time use survey carried out in 2003—2004 was preceded by the pilot study performed in 1996. There was one thousand households with 2484 individuals at the age 10 and over in a sample. Like in previous surveys the sample was selected from the households which participated in the household budget survey. Although the sample size was not comparable with the full

1348 I. Blaszczak-Przybycińska; Methodology and empirical… surveys, the pilot study allowed to adjust the new methodology according to the Eurostat and it allowed better organization of the full survey (Central Statistical Office, 1996).

3. Sample design in the 2003/2004 survey

There were some new approaches in the methods and the scope of the last time budget survey carried out in years 2003-2004. Contrary to the previous surveys, the sampling was made independently of the household budget survey. A two-stage sampling scheme was used. Census areas or combined census areas (if the minimum number of dwellings was not ensured) were the first stage sampling units and dwellings were the second stage sampling units. The sampling frame was provided by the BREC 99 system - a set of statistical regions and census areas which was designed for the national census purposes but it is also used for sample surveys. There were days sampled for each dwelling. All days between June 1st , 2003 to May 31st, 2004 were represented (366 days). Contrary to the previous surveys each person made records of the activities performed during two days - a week-day and a weekend day. In previous surveys respondents recorded only one 24 hour diary. The new classification of households - comparing to the previous ones - was used in 2003/2004 survey. Six household types according to the main source of income were distinguished: employees, employees-farmers, farmers, self- employed, retirees and invalid-pensioners and those living on unearned sources different than invalid-pension and retirement. Also the age limit was changed. In previous surveys respondents were at the age 18 years and over and in the 2003- 2004 survey it was 15 years and over. Hence it was necessary to distinguish adults from this population for the purpose of any dynamic comparative analyses with earlier data. There were 20264 persons at the age 15 and more and 10256 households in a sample. The share of individuals who refused to participate in the survey was much higher comparing to previous ones (19.6 percent). It resulted from the fact that the sampling was independent of the household budget survey.

4. Activities registration and categories

In all time budget surveys respondents registered activities themselves but there were differences in a method of the registration. In the 2003-2004 survey activities were coded by trained persons while in the previous surveys respondents coded the activities themselves. In the last survey activities were registered in diaries in 10-minutes intervals. In previous surveys 15-minutes intervals were applied. The reduction of the interval length allowed to register — to a higher degree than before - activities performed in a short time.

STATISTICS IN TRANSITION, December 2006 1349

The list of activities specified in surveys was systematically broadened. In the first survey (1968/1969) 46 detailed activities were divided into four groups. In the second survey (1975/1976) there were 49 activities. In the third one (1984) the number of activities was 53. In the second and the third surveys activities were grouped into seven categories. In the last survey there was a list of more than 200 activities within ten groups (Table 1).

Table 1. Activities groups in time use surveys Main categories of activities in the time use surveys 1968/1969 1975/1976 and 1984 2003/2004 1. Employment 1. Personal care 1. Personal care 2. Personal care 2. Employment 2. Employment 3. Obligatory 3. Travel 3. Study activities 4. Outdoor activities 4. Household and family care 4. Leisure 5. Household and family 5. Volunteer work and care meetings 6. Study 6. Social life and 7. Leisure entertainment 7. Sports and outdoor activities 8. Hobbies and games 9. Mass media 10. Travel and unspecified time use Source: author's compilation (GUS, 1970, p. 14; 1987, pp. 116—119; 2005, pp. 275—281).

Activities connected with a personal care and employment were specified in all time use surveys. Such activities as studying, household and family care and travel specified in 2003-2004 were also analyzed in the previous two surveys. Groups of activities did not always comprise the same categories. For instance time aimed at shopping - which was classified in the recent time use survey among household and family care activities — in 1975/1976 and 1984 was classified among outdoor activities. The group of leisure activities was significantly modified in the 2003/2004 survey. New activity groups were specified: social life and entertainment, sports and outdoor activities, hobbies and games. Volunteer work which was included in 1976 and 1984 in a group of leisure activities emerged as a separate group in 2003/2004.

1350 I. Blaszczak-Przybycińska; Methodology and empirical…

5. Estimated parameters

There were three indicators calculated on the basis of the results gained from the 2003/2004 survey: • the average activity duration (calculated per person participating in the survey):

nz ∑ yaiz ~ i=1 yaz = , (5.1) nz where: th yaiz – duation of "a" activity in "z" day of the week for the i person, nz – number of persons who filled diary in "z" day of the week; • the average time of the activity performance (calculated for one person performing the activity):

nz ∑ yaiz i=1 yaz = (5.2) naz

naz – number of persons performing "a" activity in "z" day of the week; • the percentage of persons performing activities:

naz waz = (5.3) nz The methods of calculation of mean values were different in analyzed surveys. In surveys performed in 1969, 1976 and in 1984 they were calculated using individual data without weights as there was the same number of respondents sampled for an each day of the calendar year. Moreover each respondent filled only one diary. In 2003/2004 each respondent completed two diaries – one in a week-day and another in the weekend day. Hence the number of weekend diaries was the same that the number of diaries completed in week-days. Hence it was necessary to calculate the averages separately for the different days of the week and to weight them respectively. The weights 5/7 was used for week- days and 2/7 for weekend days. The standard errors were estimated with the use of the balanced replication half samples method (Central Statistical Office, 2005). Data obtained from the survey were generalized onto the entire population, broken down by sex, age, level of education, socio-economic group, type of residence, family type, occupational activities, days of the week.

STATISTICS IN TRANSITION, December 2006 1351

6. Empirical results of Polish time budget surveys

It is not possible to make an exact comparison of the results of time use surveys made in Poland because of differences in the methodology. There were differences in sampling, survey organization, the way of collecting information, activities classification. A long term comparison of the time budgets of the total population of Poland is not possible because the 1969 survey did not include people working in agriculture and forestry. However some tendencies in time use in Poland can be pointed out. Any comparison can be made only for those empirical results which were based on the data selected for persons at the age 18 and over in all surveys. In addition only these activities can be analyzed in all surveys where classifications were comparable. Selected results of the four time use surveys for the persons 18 years and over are presented in Table 2. Presented categories of activities comprise the similar detailed activities. It can be observed that the average time (per person performing the activity) spent on such activities as occupational work, studying, leisure or travels was higher for men than for women. Among groups of activities presented in Table 2 there was only one – housework – where the average time was higher for women in all surveys. In years 1975/1976 and in 1984 the average time spent on housework by women was about three hours higher than the average time for men. In the last survey the difference diminished to two hours. It can be partly explained by the higher share of women in the labour market but it also reflects the tendency of an increasing share of man in domestic works.

1352 I. Blaszczak-Przybycińska; Methodology and empirical…

Table 2. Average time of activity performance in Polish time use surveys (age 18 years and over)

Average duration (in hours and minutes) Activities Categories 1968/1969 1975/1976 1984 2003/2004

Personal care Total 10.11 10.03 10.15 11.02

Males 10.16 9.59 10.11 10.51

Females 10.05 10.06 10.18 11.12

Employment Total 8.16 7.41 7.44 7.21 (outside agriculture) Males 8.15 7.57 8.00 7.36

Females 7.55 7.22 7.25 7.02

Study Total 3.15 5.55 5.09 4.39

Males 3.28 6.01 5.23 4.48

Females 3.03 5.49 4.58 4.31

Housework Total 3.05 4.06 4.07 3.45

Males 1.42 2.08 2.10 2.40

Females 3.42 5.00 5.09 4.38

Leisure Total 4.35 4.25 4.42 4.17*

Males 4.52 4.57 5.22 4.32*

Females 3.30 3.59 4.09 4.02*

Travels Total 0.32 1.15 1.12 1.27

Males 0.33 1.20 1.17 1.38

Females 0.32 1.10 1.08 1.18

* values calculated basing on percentage of people performing the activity in 1996 pilot study. Source: Słaby, 1998, p. 749 and own calculations on the base of GUS, 2005, pp. 200—202.

STATISTICS IN TRANSITION, December 2006 1353

Table 3. Frequency of activities performance in Polish time use surveys (age 18 years and over)

Frequency (in percent) Activities Categories 1968/1969 1975/1976 1984 2003/2004 Personal care Total 100.0 100.0 100.0 100.0 Males 100.0 100.0 100.0 100.0 Females 100.0 100.0 100.0 100.0 Employment Total 79.9 39.3 35.2 24.1 (outside Males 81.5 49.2 44.3 28.8 agriculture) Females 78.1 31.7 27.9 19.9 Study Total 8.9 6.9 3.2 8.3 Males 9.8 7.4 3.1 8.1 Females 7.9 6.6 3.3 8.5 Housework Total 76.6 78.5 81.9 92.3 Males 58.0 56.8 63.7 86.4 Females 95.5 95.2 96.6 97.6 Leisure Total 89.9 92.1 95.6 97.7* Males 95.7 96.1 97.7 97.7* Females 83.8 90.7 94.0 97.6* Travels Total 75.5 68.8 61.2 87.5 Males 77.4 75.3 68.2 88.9 Females 73.6 63.7 55.5 86.3 * percentage according to the 1996 pilot study. Source: Słaby, 1998, p. 749, GUS, 2005, pp. 203-204.

Interesting changes can be noted in the structure of people who studied. In two earlier surveys i.e. in years 1969 and 1976 there was a higher share of people who studied among men. In 1984 and in 2003/2004 the situation was quite opposite — the higher share of people who studied could be noticed among women (Table 3). At the same time there was an increase in the percentage of people who studied and the decrease in the average time aimed at study. Last survey revealed the increase in the average time used for travels as well as percentage of persons who used time for this activity. Average time per day according to the group of activity and the percentage of persons performing the activity in the 2003/2004 survey are presented in Tables 4 and 5. Average time devoted to personal care was about 11 hours. Time devoted to household and family care was 3 hours and 21 minutes per person participating

1354 I. Blaszczak-Przybycińska; Methodology and empirical… in the survey and 3 hours and 39 minutes per person performing the activity. The average time devoted to this activity per surveyed person in a case of women was twice as much as in a case of men.

Table 4. Average time per day according to the group of activity in the 2003/2004 survey in hours and minutes (age 15 years and over) Average duration (per Average time (per person surveyed person) performing the activity) No Activity group Fema- Fema- Total Males Total Males les les 1. Personal care 11.03 10.53 11.12 11.03 10.53 11.12 of which: — sleeping 8.37 8.30 8.43 8.37 8.30 8.43 — eating 1.33 1.33 1.33 1.33 1.33 1.34 2. Employment 2.41 3.31 1.55 7.07 7.39 6.23 3. Studying 0.33 0.34 0.32 5.01 5.10 4.53 4. Household and family care 3.21 2.13 4.22 3.39 2.36 4.30 of which: — shopping 0.24 0.20 0.27 0.50 0.52 0.48 — childcare 0.24 0.14 0.33 1.59 1.28 2.17 5. Volunteer work and meetings 0.28 0.25 0.30 1.35 1.43 1.30 of which: — help to other households 0.13 0.14 0.13 2.03 2.19 1.51 — religious activities 0.13 0.09 0.17 1.06 1.02 1.07 6. Social life and entertainment 1.10 1.10 1.10 1.35 1.38 1.33 7. Sports and outdoor activities 0.23 0.28 0.18 1.28 1.41 1.15 8 Hobbies and games 0.16 0.22 0.10 1.27 1.42 1.07 9 Mass media 2.50 3.03 2.39 3.01 3.14 2.48 of which: — television 2.17 2.31 2.04 2.32 2.47 2.18 10 Travel and unspecified time use 1.22 1.32 1.13 1.31 1.41 1.22 Source: author's compilation (GUS, 2005, pp.129-131).

STATISTICS IN TRANSITION, December 2006 1355

Table 5. The percentage of persons performing the activity in the 2003/2004 survey (age 15 years and over)

Males Females No Activity group Week- Satur- Week- Satur- Sunday Sunday day day day day 1. Personal care 100.0 100.0 100.0 100.0 100.0 100.0 of which: — sleeping 100.0 100.0 100.0 100.0 100.0 100.0 — eating 99.6 99.8 99.7 99.7 99.7 99.9 2. Employment 54.2 35.0 20.9 36.2 19.3 12.8 3. Studying 12.3 8.0 8.7 12.1 8.9 8.6 4. Household and family care 85.9 89.0 81.5 97.2 97.9 95.8 of which: — shopping 41.9 41.9 16.8 66.0 60.1 15.2 — childcare 15.0 15.2 17.0 24.9 21.9 22.1 5. Volunteer work and meetings 16.8 23.3 56.8 25.2 30.0 69.9 of which: — help to other households 10.5 13.4 6.9 13.3 12.2 10.7 — religious activities 6.4 10.0 52.7 15.7 19.9 65.5 6. Social life and entertainment 69.1 76.6 79.1 72.9 78.3 80.7 7. Sports and outdoor activities 24.0 30.5 43.5 20.3 25.0 37.9 8 Hobbies and games 20.5 23.4 26.6 14.8 15.7 18.7 9 Mass media 93.8 93.2 96.5 94.0 92.9 96.2 of which: — television 89.8 89.0 94.0 89.1 88.0 92.7 10 Travel and unspecified time use 92.1 89.8 90.3 89.9 86.5 89.2 Source: author's compilation (GUS, 2005, pp.195—197).

Persons who were employed devoted 7 hours and 7 minutes to work per day and those who studied devoted to this activity 5 hours per day. The average time of work as well as the average time of study was higher for men than for women.

1356 I. Blaszczak-Przybycińska; Methodology and empirical…

Among leisure activities mass media were the most popular. The share of persons performing this activity was 94 percent and the average time was 3 hours. The respective numbers for the activity "watching television and video" were 90 percent and 2 hours and 32 minutes. All leisure activities (i.e. social life and entertainment, sports and outdoor activities, hobbies and games, mass media) lasted longer in case of men. The last specified group of activities - travels - have relatively high average time of activity performance. It was about one and a half hour. Results of the 2003—2004 time use survey in Poland can be presented among the results of recent European TUS data (Table 6). In general, the average time per day spent on different activities did not differ much from the averages for the other European countries. However two groups of activities can be pointed out for which values calculated for Poland differed significantly from values calculated for other countries. It was "volunteer work and meetings" for which average time for Poland was 28 minutes and it was more than twice as much as in case of any other country. It have a reason in relatively long time aimed at informal help to other households and long time aimed at religious activities which is typical for Poles. Quite the opposite relations could be noticed in case of group "sport, outdoor activities, hobbies and games". The average time per person in Poland was less than 40 minutes while in other countries it was near one hour or more than one hour. It had probably its source in the different patterns of life style.

Table 6. Average time per day spent on different activities in European countries (per person participating in the survey)

Average activity duration (age groups in bracket) Slovenia France Finland Poland UK Estonia Hungary Norway Belgium Denmark Sweden (10 years Activity group (15 and (10 and (15 and (8 and (10 and (15-84 (10-79 (12-95) (16-74) (20-84) and over) over) over) over) over) years) years) over) in hours and minutes Sleeping 8.34 7.53 9.03 8.38 8.08 8.37 8.40 8.49 8.43 8.36 8.15 Meals and personal care 2.40 2.48 3.01 2.03 2.22 2.27 2.10 2.11 2.24 2.09 1.54 Employment 2.02 3.27 2.34 2.35 3.17 2.41 2.40 2.22 2.33 2.42 2.59 Studying 0.43 0.37 0.30 0.35 0.16 0.33 0.40 0.30 0.30 0.44 0.36 Household work and family care 3.12 2.51 3.17 2.52 3.10 3.21 2.53 3.32 3.39 3.26 2.42 Volunteer work 0.09 0.13 0.13 0.14 0.12 0.28 0.10 0.14 0.10 0,08 0.09 Social life and entertainment 1.02 1.21 0.55 1.04 1.10 1.10 1.07 0.50 0.51 1.09 2.02 Sports, outdoor activities, hobbies and games 1.13 1.00 0.53 1.23 1.20 0.39 1.06 1.04 0.56 1.26 1.18 Television and video 2.18 1.58 2.07 2.16 1.53 2.17 2.26 2.27 2.44 2.01 1.57 Radio, music and reading 0.39 0.34 0.29 0.58 0.43 0.33 0.33 0.49 0.29 0.31 0.44 Travel 1.23 1.15 0.55 1.08 1.23 1.11 1.24 1.07 0.56 1.06 1.15

Source: Aliaga and Winqvist, 2003, p.4; GUS, 2005, pp.129—131. 1358 I. Blaszczak-Przybycińska; Methodology and empirical…

7. Value of household work on the basis of the 2003/2004 survey

On the basis of the 2003/2004 time use survey results the value of household work for Poland in 2004 was calculated. The market price method was used. Services within household production were grouped into four main categories according to Eurostat proposals (Eurostat, 1999). The categories were distinguished according to the principal household functions. They were connected with accommodation, providing meals, cloth care and care for children and adults. Volunteer work in the scope of domestic services was valued separately. The week value of household work for the individuals was calculated by multiplying the average week time of activity duration (calculated separately for specified groups of persons and weighted respectively to the day of the week) and the average remuneration per hour for professions corresponding to the domestic services. The results of the estimation are presented in Table 7. In 2004 the total value of household services in Poland was 284.539 billion PLZ. It was 30,9 percent of the PKB in 2004. The greatest value of services was calculated for group of services connected with providing meals.

Table 7. Estimation of housework in Poland in 2004 r. calculated with the monthly remuneration inflation rates (in millions PLZ)

Males Females Group of domestic No Total services Males Workin Non- Females Workin Non- total g working total g working

1. Accommodation 54 978 29 211 13 976 15 235 25 767 9 022 16 745

2. Providing meals 127 993 37 827 16 405 21 422 90 166 32 133 58 033

3. Cloth care 15 045 2 357 1 050 1 307 12 688 4 748 7 940

4. Children and adults care 67 733 21 560 14 985 6 575 46 173 19 387 26 786

5. Volunteer work 18 780 9 035 3 520 5 515 9 745 2 215 7 530

6. Groups 1-4 jointly 265 749 90 955 46 416 44 539 174 794 65 290 109 504

7. Groups 1-5 jointly 284 539 99 990 49 936 50 054 184 539 67 505 117 034 Source: Błaszczak-Przybycińska, 2005, p. 98.

STATISTICS IN TRANSITION, December 2006 1359

8. Conclusions

The changes in time use structure are rather slow so time use surveys – which are rather expensive and difficult for a respondent - are carried out rarely. According to Eurostat they should be carried out each 5-10 years. But in statistical practice of many European countries this period was much longer. It is not possible to elaborate fixed methodology for time use survey for such a long period because of the permanent social and economic development as well as of progress of the statistical methods. Of course constant changes in survey organization and methodology can make the comparative analysis difficult. Empirical results of time use surveys should be used for dynamic comparisons with caution. The survey must be adjusted to changes occurring in the society. Sections connected with the volunteer activity, bringing up children and care for elderly should be essentially extended. Better information in these fields can allow the social policy makers to undertake right decisions. It is expected that time use survey will help to answer many questions connected with the social and economic problems such as estimation of unregistered economy. Some changes in activity classification in time use surveys are expected in connection with the estimation of household production within satellite accounts. They would allow better intermediary estimation of the part of household production which is not registered in national accounts.

REFERENCES

ADAMCZUK L. (1990), Experience and Perspectives of Time Budget Surveys in Poland, in.: Time Budget Surveys, GUS, PTS, vol. 37, Warsaw, s. 22-30 (in Polish). ALIAGA CH.,WINQVIST K. (2003), How Women and Men Spend Their Time, Statistics in Focus, Theme 3-12/2003, Eurostat. BŁASZCZAK-PRZYBYCIŃSKA I. (2005), Estimation of Housework on the Basis of Time Use Survey Data, in: GUS, Time Use Survey, 1st July 2003 – 31st May 2004, Statistical Studies and Analyses, Warsaw (in Polish). GUS (1970), Time Use Survey of Employees in 1969, Polish Statistics, no 77 (in Polish). GUS (1978), Time Budget of Inhabitants of Poland, Polish Statistics, Warsaw (in Polish). GUS (1985), Time Budget of Inhabitants of Poland in 1984, part 1&2, Statistical Studies, Warsaw (in Polish).

1360 I. Blaszczak-Przybycińska; Methodology and empirical…

GUS (1987), Analysis of Time Budget of Inhabitants of Poland in 1976 and 1984, Warsaw (in Polish). GUS (1990), Time Budget Survey, Statistical News Library, vol. 37 (in Polish). GUS (1998), Time Use Survey 1996, Statistical Studies and Analyses, Warsaw (in Polish). GUS (2005), Time Use Survey 1st July 2003 – 31st May 2004, Statistical Studies and Analyses, Warsaw. EUROSTAT (1999), Proposal for a Satellite Account of Household Production, Eurostat Working Papers, 9/1999/A4/11, Luxembourg. GERSHUNY J. (1995), Draft proposal for the methodology of the European Time Use Survey, Statistics in Transition, vol. 2, No 4, pp. 517-527. GERSHUNY J. (1995), Time budget research in Europe, Statistics in Transition, vol. 2, No 4, pp. 529-551. HARVEY A. (1995), Emerging Needs for Time use Data, Statistics in Transition, vol. 2, No 4, pp. 513-515. KORDOS J. (1985), Towards an Integrated System of Household Surveys in Poland, Bulletin of the International Statistical Institute, vol. 51, Book 2, Amsterdam, pp. 13-18. KORDOS J. (1988), Time use surveys in Poland, Statistical Journal of the United Nations, Economic Commission for Europe, vol. 5, no 2, pp. 159-168. KORDOS J. (1998), Social Statistics in Poland and its Harmonization with the European Union Standards, Statistics in Transition, vol. 3, no 4, pp. 617-639. RYDENSTAM K. (1995), The Harmonized European Time Use Surveys, Statistics in Transition, vol. 2, no 4, pp. 553-581. SŁABY T. (1998), Conclusions from the Polish Time Use Pilot Survey 1996, Statistics in Transition, Volume 3 No 4, GUS, Warsaw, pp. 743-756.

STATISTICS IN TRANSITION, December 2006 1361

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1361—1385

LABOUR FLOWS INTO AND OUT OF POLISH AGRICULTURE: A MICRO-LEVEL APPROACH

Hilary Ingham & Mike Ingham1

ABSTRACT

Notwithstanding its admission to the EU, agricultural restructuring and sustainable rural development remain as major transition challenges confronting Poland. Achieving these joint goals will necessitate major labour flows from farming into other occupations and sectors. This paper employs a multinomial logit model on Labour Force Survey data to analyse mobility in the agricultural labour market. Its major finding is that of a largely stagnant pool of farm workers into and out of which are small flows that are insufficient to bring about the requisite change without explicit, perhaps radical policy intervention.

Key words: Agriculture, Restructuring, Labour Market Transitions, Mobility

1. Introduction

The agricultural sector is probably the most idiosyncratic feature of the Polish economy. Small, semi-subsistence farms remained in private ownership under state socialism and have survived largely intact into the twenty-first century. While more than eight hundred thousand agricultural jobs have been lost since the start of the transition process, the decline is due purely to the collapse of the state farms. In contrast, employment in private sector farming has been static since the onset of economic reform (Ingham and Ingham, 2004). Thus Poland currently has in excess of four million individuals operating almost two million holdings, more than half of which occupy less than five hectares of land (GUS, 2003: 147 & 370).2

1 Hilary Ingham is in the Department of Economics, Lancaster University UK, e-mail: [email protected] and Mike Ingham is Principal, Ansdell Consulting, UK, e-mail: [email protected]. 2 From 2002, GUS began publishing two figures for the number employed in private farming. The new one is an estimate of the total excluding workers on subsistence and semi-subsistence farms and is approximately half of the headline figure reported in the text. Without similar adjustments for under-employed and hidden unemployed workers in other sectors of the economy, it is far from evident that this is a legitimate exercise.

1362 H. Ingham, M. Ingham: Labour flows and…

However, while jobs in agriculture accounted for almost 28 per cent of total employment in 2002, the sector contributed less than three per cent to GDP (GUS, op. cit: 147 & 584). What is more, private sector agricultural workers are amongst the nation’s poorest employees, receiving only two-thirds of the national average wage (GUS, op. cit.: 176). With the land devoted to farming concentrated in space, the sector represents a serious threat to the balanced development of the country. Furthermore, the per capita income of farming households was only eighty-six per cent of the national average (GUS, op. cit.: 203-4). In addition, the sector imposes a severe strain on the state budget, mainly as a result of the generous farmers’ retirement scheme (KRUS), over ninety per cent of which is financed from central funds (Krzyzanowska et al., 2002). In theory, a certain degree of equalisation across sectors and space in the years following 1989 might have been expected through the relocation of labour to higher paying jobs and of capital to exploit low wages. However, this has evidently not occurred to the extent that might have been foreseen. Moreover, Poland has received substantial funding from international agencies such as the European Union (EU), the World Bank, the EBRD and US Aid over the last decade, significant amounts of which have been earmarked for agriculture. The latter monies have been channelled into education and training, food quality and the marketing of agricultural produce. Once again though, the overall impact of the schemes has been limited and the sector continues to retard the modernisation of the economy. In the usual case, the failure of both the market and external assistance to stimulate change is attributed to impediments to mobility including poor infrastructure in the rural areas, imperfections in the housing market and the dearth of suitably skilled labour in farming communities (e.g. ILO, 1999). Poland has now entered the EU, but with an economic structure radically different from those of its partners. As of 2003, agriculture accounted for only 4.1 per cent of total employment in the then fifteen member states (Eurostat, 2004) and no other country that entered in 2004 has such a high concentration of jobs in farming. This raises difficult issues in two key areas of EU policy. First, economic and social cohesion is a primary goal espoused formally in the Amsterdam Treaty (EC, 1997), but one that proved illusive even in the context of the EU-15 (Ingham et al., 2002). Second, rural development is now officially the second pillar of the Common Agricultural Policy (CAP) and one that stresses the need for diversified employment structures and environmentally friendly farming practices. The clear imperative is therefore that Poland’s economy becomes less dependent on agriculture, with a significant proportion of the rural workforce transferring to higher value added activities. While fiscal transfers will flow from Community programmes to assist in this process, many will be contingent on matching domestic finance. This, however, creates problems for Poland’s obligations under the Stability Pact and for its preparations to enter the euro-zone. The dilemma posed by this conflict of objectives reinforces the need for effective policy targeting. In this context, however, the profile of net job flows

STATISTICS IN TRANSITION, December 2006 1363 typically available from official statistics has severe limitations. In order to cast further light on the potential for restructuring Polish agriculture, this paper uses micro-data to examine the pattern of individual gross worker flows into and out of the sector. The identification of those characteristics associated with successful moves into non-agricultural employment made possible by the analysis could provide useful information for determining where limited resources might most effectively be channelled. Likewise, modelling flows into the sector might allow further insight into the nature of the agricultural sector’s role as a ‘buffer zone’, absorbing workers displaced from other sectors of the economy (Orlowski, 2002). In particular, it could provide a guide to the design of policies for the redeployment of those workers in industries that are uncompetitive within the European arena who are most at risk of slipping into hidden unemployment on farms. The next section covers the preliminaries, including a brief explanation of the methodology used in the paper and a discussion of the survey data employed to examine the labour market transitions that occupy the remainder of the work. The third section highlights the movements between different labour market states that occurred during the sample period and also describes the characteristics of the individuals involved. This overview is formalised in the fourth section, which reports the results of the multinomial regressions used to identify the factors that are important determinants of movements into and out of agriculture. The practical implications of the results are highlighted in section five, which focuses on the agricultural exit probabilities of workers with particular characteristic vectors. The paper concludes with a summary discussion and certain policy recommendations.

2. Transition Rates and the Labour Force Survey

This section provides the building blocks for the analysis to follow. It first describes the transition matrix to be studied and then summarises the data to be analysed.

2.1 Transition Rates

The work identifies four mutually exclusive, exhaustive labour market states: working in agriculture (EA), working in a non-agricultural sector (E), unemployed (U) and economically inactive (N). The transition probabilities associated with movement between these states are based on the standard Markovian process described by Toikka (1976), which describes labour market flows between t0 and t1 in the following manner:

1364 H. Ingham, M. Ingham: Labour flows and…

⎛ EA EA EA E EA U EA N ⎞ ⎜ t0 t1 t0 t1 t0 t1 t0 t1 ⎟ ⎜ Et0 EAt1 Et0 Et1 Et0U t1 Et0 N t1 ⎟ ⎜ U EA U E U U U N ⎟ ⎜ t0 t1 to t1 t0 t1 t0 t1 ⎟ ⎜ ⎟ ⎝ N t0 EAt1 N to Et1 N t0U t1 N t0 N t1 ⎠

where each cell in the matrix represents the number of people moving from one state to another. In the case of outflows, the probability of making any transition is given by the number of individuals in the flow divided by the number in the state of origin. For example, EAt0Et1/EAt0 = eatoet1 is the probability of moving from a job in agriculture to a job in another sector between t0 and t1, giving a transition probability matrix: ⎛ea ea ea e ea u ea n ⎞ ⎜ t0 t1 t0 t1 t0 t1 to t1 ⎟ ⎜ eto eat1 et0et1 et0ut1 et0 nt1 ⎟ ⎜ u ea u e u u u n ⎟ ⎜ to t1 t0 t1 t0 t1 t0 t1 ⎟ ⎜ ⎟ ⎝ nt0eat1 nt0et1 nt0ut1 nto nt1 ⎠

These probabilities represent one of the subjects of analysis in what follows. In this framework, the possible ‘outcomes’ (labour market transitions) remain the same from trial to trial, are finite in number, and have probabilities that depend only on the outcome of the previous trial. When inflows are under scrutiny, the foregoing approach must be modified slightly. In particular, the numbers in the destination state at t1 form the denominator of each probability. As such, it is the columns rather than the rows of the matrix that sum to unity.

2.2 The Polish Labour Force Survey

Poland has conducted a quarterly Labour Force Survey (LFS) since May 1992. Between May 1992 and February 1999 the Survey was conducted during a reference week that included the fifteenth day of the middle month of the quarter. The next survey was not until QIV 1999 and since then interviewing has taken place on a continuous basis with (1/13)th of the sample of dwellings being surveyed in each week of the quarter. Its design is similar to those conducted in European countries and it samples in excess of fifty thousand people aged 15 or more. The sample remained fixed for the first four surveys but, since the second quarter of 1993, it has been selected via a rotation system, which is divided into four rotation groups known as e-samples. In any given quarter, the LFS consists of two e-samples introduced in the previous period, one new one and one

STATISTICS IN TRANSITION, December 2006 1365 introduced one year previously. This means that each e-sample is included in the survey for two quarters, discarded for two and then returned for two more quarters. Subsequently, the e-sample is not used again. The sampling procedure adopted generates both a quarterly and an annual panel, with the focus of this paper being on the latter. Two reasons underlie this choice. First, yearly panels are more suitable when people change their labour market status infrequently. Second, the use of a quarterly panel to investigate flows into and out of agricultural employment introduces seasonal influences into the data. For example, there were almost two hundred and fifty thousand fewer workers on private agricultural holdings in the rural areas of Poland in November than in August 1998 (GUS, 2002: 60). On the other hand, yearly panels are susceptible to round tripping, since individuals who leave their origin state but return to it again within the year are recorded as non-movers. Using the constant sample available for the first four surveys, Góra and Lehmann (1995) were able to estimate the bias this introduced into the data. Their results indicated significant round tripping by the unemployed: almost one-quarter of those who were in this origin state and who exited it at some point during the year re-entered unemployment by the end of the twelve-month period. However, they found no evidence of significant round tripping by those in other labour market states. The period analysed in the current instance runs from February 1998 to February 1999, the last produced prior to the introduction of continuous sampling. In the former LFS, 54.4 thousand individuals living in 21.7 thousand households were interviewed and the annual panel produced 25,208 usable responses, implying an attrition rate of less than five per cent (GUS, 1999). In terms of labour market status, two points of definition are central when interpreting the results of the analysis. The first is that an individual is enumerated as being in employment according to the standard International Labour Organisation convention, which means that they are considered to be employed if they either worked for at least one hour during the reference week or if they formally held a job even if they did not work. This definition differs from that adopted by the European Community Household Panel (the base survey for its LFS), which only classifies individuals as employed if they work a minimum of 15 hours (Eurostat, 1999). Second, the survey records an individual as being employed in agriculture if this is the sector in which they hold their ‘primary’ job, which is the job from which they derive the largest part of their income. Adopting this rule gave Poland an agricultural workforce of 2.9 million in February 1998 (GUS, 1998: 20), implying that the country has 1.7 million farmers for whom agriculture is either a secondary source of employment or a ‘hobby’ (GUS, 1999a).

1366 H. Ingham, M. Ingham: Labour flows and…

3. Labour market transitions

As of February 1998, the panel to be analysed exhibited an activity rate of 55.4 per cent, which compares with the full survey figure of 57.1 per cent (GUS, 2002: 21). This implies that those individuals who are out of the labour force are slightly over-represented. The figures for the weight of agriculture also differ a little, with 21.8 per cent of total employment in the panel being in farming compared with 19.0 per cent overall (ibid.: 98). The panel and aggregate unemployment rates were, however, similar; 11.4 and 11.1 per cent, respectively (ibid.: 21). Of course, the prospects for agricultural restructuring and the associated re- allocation of workers to other sectors of the economy depend in large part on the prevailing macroeconomic climate. In this regard, it might be noted that the annual average LFS unemployment rate reached its lowest recorded level of 10.6 per cent in 1998. However, in subsequent years, there was a significant deterioration with figures of 13.9, 16.1, 18.3, 19.9 and 19.7 per cent being recorded in the years 1999-2003, respectively. Furthermore, this evolution occurred in spite of an economic activity rate that declined continuously throughout. The prospects of moving out of agriculture might therefore have been better during the sample period than at any other time during Poland’s current epoch. The gross flows presented in Table 1 show the probability of an individual being in a particular labour market state in 1999, contingent upon their status in 1998. Over the period in question, the recorded status of the majority of individuals did not change, with approximately ninety per cent of the employed, either in agriculture or elsewhere, and the economically inactive in 1998 being in the same state in 1999. The unemployed were the most mobile individuals, with almost half changing their labour market status over the period. It should be noted, however, that seventeen per cent of those without work in 1998 subsequently left the labour force. Although there are differences, these aggregate findings are broadly in line with those reported in Góra and Lehmann (op. cit.) for Poland and with the results for 1980’s Britain found by Wadsworth (1989), but they differ significantly from the findings of Bellmann et al. (1995) for the East German labour market. The latter authors found considerably higher transition probabilities, although their period of analysis coincided with a major shake out of labour, primarily from the state-owned industries, and therefore the difference in the results is unsurprising.

STATISTICS IN TRANSITION, December 2006 1367

Table 1. Transition probabilities: Outflows

Status at t1 EA E U N Stock at t0

Status at t0

EA 0.9118 0.0245 0.0137 0.0500 2,698 E 0.0039 0.9244 0.0331 0.0386 9,681 U 0.0295 0.2599 0.5411 0.1695 1,593 N 0.0072 0.0326 0.0229 0.9373 11,236 Note: The elements in this table represent the probability that a member of any origin state i moved to terminal stock j. Subject to rounding errors each row of the table sums to 1. Source: Labour Force Survey, February 1998 and 1999, own calculations.

Labour flows out of agriculture are given in the first row of the Table and they reveal that more than ninety per cent of those individuals in the sample who were working in the sector at the start of the period were still there one year later. In comparison with the work of Góra and Lehmann (op. cit.), who found that approximately eighty-three per cent of farm workers in the two panels they analysed did not change their labour market status, this suggests that mobility out of the sector actually declined during the nineteen-nineties. Less than two and one-half per cent of agricultural workers in the current panel succeeded in securing employment in another sector of the economy, while five per cent withdrew from the labour force and just over one per cent became unemployed. The latter finding will be driven, at least in part, by the unemployment benefit regulations prevailing under the provisions of the 1994 Act on Employment and Counteract[ing] Unemployment. These determined that any individual who either owned agricultural real estate or was working on a family holding in excess of two hectares, albeit without receiving an explicit wage, was ineligible for unemployment benefit (GUS, 1999b). In contrast, Bellmann et al. (op. cit.) found that forty-five per cent of agricultural workers in the former East Germany left the sector during 1990—91 and, of these, approximately half found jobs elsewhere, twenty-seven per cent left the labour force, eighteen per cent became unemployed and approximately six per cent joined a government-funded programme. The magnitude of this exodus is explained by the collapse of the state farms that dominated agricultural production. The same fate also befell Poland’s state sector (Ingham and Ingham op. cit.), but its overall significance was greatly reduced because of the importance of private sector farming. 1 The inflow probabilities for the current sample, where these are conditioned on status in t1, are given in Table 2 and they again reflect low levels of labour

1 Poland’s first Labour Force Survey was not conducted until May 1992, by which time most state farms had already collapsed, so it not possible to produce directly comparable results.

1368 H. Ingham, M. Ingham: Labour flows and… market mobility. As with outflows, the unemployed are the most mobile group; twenty-two per cent had been in non-agricultural employment one year earlier, while seventeen per cent had been economically inactive. Of those who were working in agriculture in February 1999, three per cent were previously economically inactive, almost two per cent were unemployed and one and one- half per cent were in employment in another sector of the economy. The latter finding is somewhat at variance with the popular notion that agriculture was absorbing excess labour which was being discarded by other sectors of the economy at the end of the nineteen-nineties.

Table 2. Transition probabilities: Inflows

Status at t1 EA E U N Stock at t1

Status at t0

EA 0.9368 0.0067 0.0251 0.0119 2,626 E 0.0145 0.9136 0.2168 0.0331 9,795 U 0.0179 0.0423 0.5840 0.0239 1,476 N 0.0308 0.0374 0.1741 0.9311 11,311

Note: The elements in this table represent the probability that a member of the terminal stock in any state j originated in stock i. Subject to rounding errors each column sums to 1. Source: Labour Force Survey, February 1998 and 1999, own calculations.

4. Modelling labour market transitions

The number of exits from Polish agriculture is too slow to satisfy the evident need for the modernisation of the country’s rural economy and the sheer numbers involved means that there will be no simple short run remedy. Nevertheless, this section seeks to specify and test a formal multinomial logit model of the factors influencing the probability that an individual will undergo a particular labour market transition. The transitions of interest are those into and out of agriculture from or to the other three labour market states identified here. To the extent that systematic relationships are apparent, they may serve to inform the policy design process. Individuals still recorded as working in agriculture in 1999, having been similarly enumerated in 1998, form the base group.

STATISTICS IN TRANSITION, December 2006 1369

4.1 The general multinomial logit

The underlying logit model is: ′ eβ j xi Pr(Y = j xi ) = J ′ eβk xi ∑ k=0 j,k = 0,1,..., J

where j=0,1,…,J represents the possible labour market transitions, xi is a vector of relevant independent variables measured at t0 and βj is the unknown parameter vector. However, the model is indeterminate in this most general form * because defining β j as βj + q, for any vector q, and then re-computing the probabilities yields an identical set of results (Greene, 2003: 721). Common practice therefore invokes the normalisation that β0 = 0 and the probabilities become: ′ eβ j xi Pr(Y = j xi ) = J ′ 1+ ∑eβk xi k=1 for j=1,2,….,J and

β0 = 0. The log-likelihood for the sample is found by deriving, both for each of the i individuals and for each of the J-1 possible transitions, the variable dij which takes the value 1 if transition j is made by a particular individual and 0 if it is not. Since any individual can only be observed to make one of the possible transitions, only one of the dij’s will be 1 for each observation in the sample. This gives a log- likelihood function: n J ln L = ∑∑dij ln Pr ob()Yi = j i−=10j from which the parameter estimates are generated using an iterative maximum likelihood procedure. Interpretation of the coefficients in the multinomial regression is not straightforward and recourse is often made to the marginal effects of the characteristics on the probabilities. These are normally calculated at the mean values of the regressors and are given by: J ∂Pj ⎡ ⎤ = Pj ⎢β j − ∑ Pk β k ⎥ = Pj []β j − β ∂xi ⎣ k =0 ⎦

1370 H. Ingham, M. Ingham: Labour flows and…

However, the current exogenous variable set contains mainly categorical elements for which such measures are meaningless. For example, a one per cent increase in self-employment is not possible; an individual either does, or does not, work on their own account. Also, unlike the results for a standard regression model, for any particular xi, ∂Pj/∂xi will not necessarily have the same sign as βjk in the multinomial logit because every sub-vector of β enters every marginal effect, both through the probabilities and through the weighted average. An alternative approach is to interpret the results in the light of the J log-odds ratios: ⎛ P ⎞ ⎜ ij ⎟ ′ Ln⎜ ⎟ = xi ()β j − β k ⎝ Pik ⎠ which equals

x i′β j

if k=0. If this odds ratio is specified in levels, as opposed to its natural logarithm, the model becomes a multiplicative one, with terms exi’βj. This means that eβj is the th 1 factor by which the odds change when the i variable increases by one unit. If βj is positive this factor will be greater than one and if βj is negative it will be less than one. These values will be reported along with the parameter estimates in the applications to follow.

4.2 Model Specification

Most of the exogenous variables included in the initial specification of the model are self explanatory, with precise definitions provided in the Data Appendix, although some require elaboration. The first are the employment status measures, for which three dummy variables are included in the empirical specification. The first is Self-employed, which identifies individuals working on their own account, and the second is Employed, which identifies persons working for a public or private employer and receiving remuneration. In addition, an interaction term State employee is included that identifies those employees who work in the state sector. The base group is composed of unpaid family workers, defined in the LFS as people working without pay in an economic enterprise operated by a related person living in the same household. In the panel utilised, approximately twenty per cent of the sample working in agriculture in 1998 were in this category.2

1 This is only true if the ith variable is not included in any interaction terms, in which case the product of the affected exponentials is required. 2 The true cost of the workers concerned, who worked an average of 26 hours per week, can hardly be assumed to be zero.

STATISTICS IN TRANSITION, December 2006 1371

The regional indicators (Tiers 1—4) are designed to account for differing economic conditions across regions. Intuitively, the spatial indicator would be a set of regional dummy variables, but as Poland had 49 voivodships at the time the panel was first observed, and to which the locational measures relate, some degree of aggregation was necessary. One possibility would have been to gather the regions into predetermined categories, such as ‘heavily industrial’ ‘diversified’, ‘agricultural’ etc., in line with previous work by Góra and Lehmann (op. cit.) and Scarpetta and Huber (1995). However, this procedure is open to more or less subjective assignments and the alternative adopted here was to use cluster analysis to group the voivodships according to a number of major economic indicators, which are presented in the Data Appendix along with the resulting grouping,. The technique adopted (SAS FASTCLUS) is a non-hierarchical procedure that produced clusters of regions such that the similarity within and the dissimilarity between the groups was maximised, as described more fully in Ingham and Ingham (2002). The analysis produced an optimal solution of four clusters. Of these, the first (Tier 1) had only a single member — Warszawskie — the region that included the capital city. A second small cluster (Tier 2) identified four more voivodships — Gdańskie, Katowickie, Krakowskie and Poznańskie — that housed major cities. The remaining regions were approximately evenly divided between the other two clusters, of which Tier 4 voivodships had noticeably lower GDPs per capita than those in Tier 3 (GUS/US, 1999), were more agricultural and were located more in the east and the south east of the country. In addition, the voivodship clusters are also incorporated into the Peripheral variable, which measures the straight-line distance from the capital of the voivodship in which the individual lived to the capital of their nearest Tier 1 or Tier 2 voivodship. The inclusion of this variable was designed to capture the fact that even if an individual lived in a region where labour market opportunities were poor, proximity to one of the more advanced voivodships might have been expected to increase their opportunity set.

5. Results

The multinomial logit results for outflows from and inflows to agriculture are presented in Tables 3 and 4, respectively. Taking those who were employed in farming at both the initial and the terminal observation points as the base group leaves the three transitions reported there. Two columns are presented for each state change: the first set of coefficient values are the change in the log odds associated with a one-unit change in the independent variable to which they are attached, while the second are the terms eβi, representing the factor by which the odds change when the ith variable increases by one unit. The actual specifications reported in the tables represent the best fitting alternatives following the exclusion of poorly determined variables to reduce the number of null cells in order that parameter estimates could be obtained.

1372 H. Ingham, M. Ingham: Labour flows and…

Two statistics are used to test for parameter significance. The first is the Wald test that is applied to the parameter estimates in each equation individually. However, this statistic has a tendency to fail to reject the null hypothesis when coefficient values are ‘large’ (Hauck and Donner, 1977). The alternative is a likelihood ratio test based on the difference in the value of the model’s likelihood function when each variable is removed in turn, a test that examines the significance, or otherwise, of the parameter estimates for the model as a whole, not just for those in individual equations. With ‘large’ samples the two tests are equivalent (Rao, 1973).

5.1 Outflows

The multinomial regression results for outflows are presented in Table 3, with each of the three pairs of columns relating to one of the possible transitions out of farming. The model correctly predicts over ninety per cent of observations and the likelihood ratio test rejects the joint hypothesis that all of the β coefficients are equal to zero. In addition, the Nagelkerke pseudo R2 statistic indicates that the model explains approximately twenty per cent of the variation in the outcome variable.1

1 The Nagelkerke R2 is a modification of the Cox and Snell R2 and is preferable as a diagnostic as the latter measure can never equal one. The final statistic, McFadden’s R2, is the proportion of the kernel of the log likelihood explained by the model.

STATISTICS IN TRANSITION, December 2006 1373

Table 3. Multinomial estimates of outflows from agriculture Specification EA→E Exp(β) EA→U Exp(β) EA→N EXP(β) Constanta -4.396b - -10.815b - -0.485 - (5.78) (12.76) (0.27) Agea 0.179 1.196 0.125 1.134 -0.086b 0.917 (2.91) (0.74) (6.31) Age squareda -0.003b 0.997 -0.003 0.997 0.001b 1.001 (4.47) (1.70) (10.40) Peripherala 0.005 1.005 0.114b 1.121 0.009 1.009 (0.13) (9.15) (1.02) Peripheral 0.000 1.000 -0.001b 0.999 0.000 1.000 squareda (0.42) (8.29) (1.73) Unemployment -0.054c 0.947 0.073 1.076 -0.008 0.992 ratea (2.78) (2.11) (0.11) Femalea -0.818 0.441 1.227c 3.412 0.549b 1.731 (1.08) (3.32) (6.70) Female*aged < 45a 0.509 1.664 -2.289b 0.101 -0.891b 0.410 (0.39) (7.41) (6.16) Employeea 1.031b 2.803 2.542b 12.710 0.567 1.762 (4.97) (14.33) (2.20) Self employeda -0.652c 0.521 -0.862 0.422 -0.607b 0.545 (3.36) (1.64) (8.21) State employeea -1.344b 0.261 -0.207 0.813 -2.377b 0.093 (3.87) (0.19) (5.08) Vocational 0.948b 2.579 0.994b 2.701 -0.156 0.855 educationa (8.24) (5.11) (0.45) Rurala -1.206b 0.299 -0.010 0.990 -1.068b 0.344 (10.97) (0.00) (12.19) Tier 4 region 0.738b 2.092 0.105 1.111 0.158 1.171 (5.12) (0.06) (0.48) N 2698 Pseudo R2 Cox & Snell 0.112 Nagelkerke 0.209 McFadden 0.155 Correct predictions 91% -2 Log likelihood Intercept only 2000.198 Final model 1679.909 Notes: 1. The model parameter significance tests are based on the change in the value of – 2 log likelihood if the effect is removed from the final model. The ‘a’ superscript indicates that the null hypothesis is rejected at the 5% level. 2. The individual parameter significance tests for each of the β vectors are based on the Wald statistic which is equal to the square of the ratio of a coefficient to its standard error for variables with a single degree of freedom; the ‘b’ superscript indicates that the null hypothesis is rejected at the 5% level. Coefficients with a ‘c’ superscript are significant at the 10% level.

1374 H. Ingham, M. Ingham: Labour flows and…

The first column of the Table contains the results for those individuals who managed to secure employment outside agriculture. The results for age and sex are what might be expected; men were more likely to make this transition than women, while older women were least likely of all to move to other employment. In fact, the probability of finding non-agricultural employment declined at an increasing rate beyond the age of thirty. The location variables included in the final model indicate, surprisingly, that living in a Tier 4 region enhanced an individual’s chance of moving into alternative employment, whereas residence in a rural area retarded it. Also, as the distance of region of residence from a Tier 1 or Tier 2 area increased, so did the probability of gaining other work, although the effect was weak. However, the chances of such an exit declined as the prevailing unemployment rate increased. Similarly, the self-employed and those still employed by the state were less likely to secure alternative employment than were paid employees in the private sector. Finally, vocational education acted as a significant positive determinant of the probability of successfully leaving agriculture to work elsewhere. Indeed, such individuals were two and one-half times more likely to do so than others. It should be noted that it was not possible to include a full set of educational dummies because of the four-way split for the independent variable. This resulted in a large number of zero cells caused, in part, by the scarcity of individuals with higher and post-secondary education working in farming. Although the overall chances were very low, older women, people in paid private sector employment, those with vocational education and those living up to fifty-seven miles from a more advanced voivodship were more likely than others to flow into unemployment. In the latter two cases, it is tempting to surmise that such people were more confident that full-time job search would yield positive returns. In the case of those with vocational education, this conjecture might appear to be supported by their relatively high chance of moving directly into other employment. At the same time, higher jobless rates and residence in a Tier 4 region were associated with higher unemployment inflows from farming. Conversely, young women, those living in rural areas, the self-employed and those working in the state sector were less likely to become unemployed. In addition, the coefficients on the age variables indicate that the likelihood of moving from farming to the unemployment pool declined beyond the age of twenty and were negligible for those over fifty. The probabilities of flowing from agriculture into inactivity were greatest for private sector employees and older women. The latter of these findings is perhaps a little surprising in view of the distribution of childcare responsibilities over the life cycle. However, it would be remiss to discount the possibility that discriminatory forces in the labour market were also at work. Those in Tier 4 and peripheral regions also exhibited higher propensities to leave farming and move out of the labour market, while the coefficients on the age variables indicate that this transition became increasingly likely beyond the age of forty-three. On the

STATISTICS IN TRANSITION, December 2006 1375 other hand, the self-employed, state employees, those living in rural areas and those with vocational education all faced lower probabilities of moving out of the labour market than others.

5.2 Inflows

Table 4. Multinomial estimates of inflows to agriculture E→EA Exp(β) U→EA Exp(β) N→EA EXP(β) Constanta -3.940b - -3.015 - 3.359b - (4.80) (3.25) (16.40) Agea 0.017 1.017 0.116 1.123 -0.259b 0.772 (0.04) (1.52) (43.35) Age squareda 0.000 1.000 -0.002 0.998 0.003b 1.003 (0.20) (2.92) (35.85) Femalea -0.823 0.439 -0.823 0.439 1.110b 3.034 (1.65) (1.08) (13.70) Female*aged < 45 -0.758 0.469 0.619 1.856 -0.802b 0.449 (0.84) (0.53) (4.00) Marrieda 0.492 1.635 -0.793b 0.452 -0.584b 0.557 (1.12) (4.98) (4.18) Vocational 1.350b 3.856 -0.222 0.801 -0.278 0.757 educationa (9.36) (0.50) (1.00) Rurala -0.857 0.425 -1.363b 0.256 -1.001b 0.367 (2.94) (10.94) (6.60) Tier 3 regiond -0.692 0.501 0.536 1.710 -0.173 0.841 (2.63) (2.96) (0.33) N 2626 Pseudo R2 Cox & Snell 0.070 Nagelkerke 0.154 McFadden 0.120 Correct predictions 94%

-2 Log likelihood Intercept only 1171.954 Final model 981.965

Notes:1. The model parameter significance tests are based on the change in the value of – 2 log likelihood if the effect is removed from the final model. The ‘a’ superscript indicates that the null hypothesis is rejected at the 5% level; and the ‘d’ superscript indicates rejection at the 10% level. 2. The individual parameter significance tests for each of the β vectors are based on the Wald statistic which is equal to the square of the ratio of a coefficient to its standard error for variables with a single degree of freedom; the ‘b’ superscript indicates that the null hypothesis is rejected at the 5% level. Coefficients with a ‘c’ superscript are significant at the 10% level.

1376 H. Ingham, M. Ingham: Labour flows and…

When applied to inflows to agriculture, the model worked less well and rather more variables had to be omitted in order for precise parameter estimates to be obtained. While the ensuing specification correctly predicts ninety-four per cent of the observations, the Nagelkerke R2 is only 0.15 in this case. The results for the best fitting parsimonious form in which all but one coefficient achieves some degree of system wide significance are presented in Table 4. Looking first at those moving into farming from other employment, the most striking result relates to those with vocational education who were almost four times more likely to have moved in this direction than others. Taken in conjunction with the earlier results on outflows from agriculture, this result suggests that those with such schooling histories represent a relatively mobile group in a rather stagnant labour market. Such job switchers were also more likely to be older and married. On the other hand, they were less likely to be female, particularly young women, and less likely to be resident in a rural area or in a Tier 3 region. Incomers from the ranks of the unemployed were more prevalent in Tier 3 regions and were likely to have been below the age of twenty-nine. Although the flow was slight, the latter result is symptomatic of the difficulties confronting attempts to engineer the orderly restructuring of Poland’s farming sector. Women were unlikely to have followed this route, but any who did so are shown to have been young. Married people, those with vocational education and those from rural areas had low probabilities of moving from unemployment into farming. Entrants to the agricultural sector from inactivity were most likely to be female and over the age of forty-five. This is consistent with the finding that the likelihood of moving into the labour market to enter farming at first declined with age but began to increase again beyond the age of forty-three. The results also indicate that such movers were less likely to be married, residents from rural areas or Tier 3 regions or in possession of vocational education.

6. Outflow Simulations

Even if one accepts the narrow perspective of the size of Poland’s agriculture sector imposed by LFS conventions, it is clear that it must shed large volumes of labour if genuine economic modernisation and convergence to the old core of Europe are to be achieved. Social cohesion and the prevention of yet further increases in the country’s already large economic dependency rate would imply that this exodus should come about principally through flows into alternative employment. This section therefore focuses on the probabilities that individuals possessing particular vectors of characteristics had of flowing into other jobs within the sample analysed. A selection of the results obtained from conducting this exercise, based on the foregoing outflow model, is presented in Table 5. In the interests of brevity, the findings from a similar exercise undertaken for counter-flows into farming have been suppressed.

STATISTICS IN TRANSITION, December 2006 1377

Table 5. Predicted probabilities for outflows from agriculture

Age Characteristics 25 35 45 55 Employed, male 0.2785 0.2765 0.1719 0.0583 Self-employed, male 0.0669 0.0663 0.0371 0.0114 Employed, male, rural 0.1036 0.1027 0.0585 0.0182 Self-employed, male, rural 0.0210 0.0208 0.0114 0.0034 Employed, male, rural, T4 0.1947 0.1931 0.1151 0.0372 Self-employed, male, rural, T4 0.0430 0.0426 0.0236 0.0071 Employed, male, rural, T4 vocational education 0.3841 0.3818 0.2512 0.0910 Self-employed, male, rural, T4, vocational education 0.1039 0.1029 0.0587 0.0183 Employed, female < 45, rural, T4, vocational education 0.3147 0.3119 n.a. n.a. Self-employed, female < 45, rural, T4, vocational education 0.0786 0.0777 n.a. n.a. Employed, female ≥ 45, rural, T4, vocational education n.a. n.a. 0.1290 0.0423 Self-employed, female ≥ 45, rural, T4, vocational education n.a. n.a. 0.0268 0.0081

To keep the presentation manageable, the cases presented in Table 5 are restricted to those employed in the private sector as the remaining state sector farms accounted for only four per cent of all agricultural workers. Nevertheless, it will be recalled that such workers have much reduced chances of finding other work. Also, because the impact of their variation on outflows to alternative employment are relatively small, the voivodship unemployment rate is fixed at the median February 1998 registration figure of 12.7 per cent (GUS, 1998a) and the Peripheral variable has been set at one hundred miles. For the vast majority of farmers, the chances of making a successful transition out of agriculture into other jobs are very low. Not one of the forty-eight individual characteristic vectors depicted in the table yields a greater than fifty per cent chance of doing so, which is the level that Greene (2003: 684-685) regards as the minimum for a predicted exit. At the other end of the scale, three of the individual types featured had chances of finding other work lying below one per cent, while half had a less than one-in-ten chance of successfully doing so. In line with the results reported above, thirty-five year olds exhibited success rates which are very similar to those faced by someone ten years younger. However, by the age of forty-five, the chances had declined considerably. This finding is of some concern, given that the ten year age band 35-44 was both

1378 H. Ingham, M. Ingham: Labour flows and… median and modal for agricultural workers at the time the panel was observed (GUS, 1998). By extension, even under the most favourable circumstances, a fifty-five year old has no more than a nine per cent chance of obtaining a job outside agriculture. Under less auspicious scenarios, this figure falls almost to zero. Those who were paid employees on private farms comprised just one-tenth of the agricultural workforce (ibid.) but, as might be expected, they were much more likely to find alternative employment than those who owned their farm. This of course is a reflection of the widely recognised agricultural problem confronting Poland: small, fragmented landholdings whose owners, for a variety of reasons, resist rationalisation and reform (Ingham and Ingham, 2004; UN, 2000). Farming is an inherently rural activity, but seven per cent of the sector’s workers were resident in urban areas at the time of the survey under study (GUS, op. cit.). Whether employees or working on their own account, this minority had by far the greater chance of moving to alternative employment. One rather unexpected spatial finding highlighted in Table 5 is that, all else equal, agricultural workers in Tier 4 regions actually enjoyed an enhanced probability of moving to alternative employment over the life of the panel. This result might be rationalised in three ways. The first is quite simply that the numbers making the transition were small and that those territories, based on the widest employment count (GUS, 1999a), accounted for almost two-thirds of the total agricultural workforce. The second is that the economic structure of these regions remained sufficiently insular and backward that low skill job vacancies, all be they limited in recessionary times, persisted. The third is that workers in some of these areas benefited from Warsaw and Kraków contiguity effects that were imprecisely captured by the Peripheral measure. Both of these major urban voivodships experienced more or less full employment at the time of the panel, a situation that, particularly in the case of the former, had characterised almost all of the post-1989 epoch. On the other hand, the onset of the Russian crisis in the middle of 1998, with its evident impacts on the markets of eastern Poland, increases the uncertainty regarding what the variable is capturing. Another potentially surprising finding is that those farmers with vocational education, where, it will be recalled, no distinction is made in this paper between basic and secondary, experienced a large increase in their chances of securing alternative employment. Accounts of the inadequacy of vocational training in socialist countries are legion (e.g. Krajewska, 1995), geared as it was to fulfilling the short-term skills needs of an inefficient system centred on outmoded technology. However, in the current context, sight should not be lost of the findings of the February 1998 LFS that only three per cent of the agricultural workforce had university or general secondary education, forty-six per cent had vocational schooling, while the remainder had no more than primary education. Even taking those in work in rural areas as a whole only yielded figures of seven per cent, sixty-one per cent and thirty-two per cent, respectively (GUS, 1998).

STATISTICS IN TRANSITION, December 2006 1379

Vocational education in rural milieu could therefore possibly represent a positive signal. Furthermore, lacking as their skills might have been, such individuals were likely to have been over-qualified for much of the work available on farms and, in order to use their talents more fully, might have been the more active alternative job seekers. The final point to note from the table is the difference between the sexes. In the case of a comparison between males and females below the age of forty-five, the former have a small, but nonetheless real, advantage in terms of successfully entering alternative employment. However, the disadvantage is much more significant for older women.

7. Concluding discussion

The magnitude and nature of the agricultural sector in Poland continues to differentiate the nation from both other recent accession countries and from longer established EU member states. The European Employment Strategy stresses the need for the transference of labour out of agriculture into the service sector, while European rural policy is centred on the existence of multifunctional rural areas with diversified employment opportunities. Market forces alone have failed to achieve this in the case of Poland; labour has not withdrawn from agriculture and migrated to urban areas in search of better job opportunities and capital has not flowed into the rural areas in order to exploit low wages. In the short term, this situation seems unlikely to change. First, migration to urban areas is not a viable option when such localities suffer from both chronic housing shortages (Juraś and Marzał, 1998) and high levels of unemployment. Second, the dearth of human and physical capital in the rural regions retards both indigenous and foreign investment. Third, neither farmers, nor the rural population at large, are prepared to sell land since not only do they believe that land prices will increase inside the EU, they also fear the threat of unemployment (Kolarska- Bobińska, 2002). The findings of the analysis in this paper serve merely to confirm these beliefs. In particular, the LFS panel sample utilised indicated that, between February 1998 and February 1999, there was a net reduction of employment in the sector of just 2.5 per cent. If that rate of net exit continued into the future, it would take 25 years to halve the number working in the sector and forty years to reduce it by two-thirds. The latter situation would leave Poland with approximately six per cent of its workforce in agriculture according to LFS conventions, which would be more comparable to the position within the old EU- 15. In comparison, Ireland and Spain both halved the percentage of their employees working in agriculture between 1988 and 2000, while Greece and Portugal both managed a ten-percentage point reduction (Eurostat, 1993 and 2001).

1380 H. Ingham, M. Ingham: Labour flows and…

Nevertheless, the analysis did provide some insights into those characteristics associated with agricultural labour market flows. Unsurprisingly, exits from agriculture to other jobs are negatively related to age, at least for those over thirty. Unfortunately, nearly half of all Poland’s farmers are still over the age of forty- five and more than two-thirds are over thirty-five (GUS, 2004). Farmers without land fare relatively well on this score, but they represent a small proportion of the total agricultural workforce. Less benignly, the panel also revealed that inflows to agriculture still persist, particularly among women and those aged over forty from amongst the ranks of the previously inactive. However, there are at least compensating outflows by the same groups, although these again reflect exchanges between farming and non-participation whereas genuine and sustainable rural development will involve both agricultural restructuring and a reduction in the dependency rate (Ingham and Ingham, 2003). One of the more notable findings of the analysis was that those with vocational education have, in relative terms at least, a high exit rate from farming, albeit both into other jobs and into unemployment. Unfortunately, only one-fifth of the country’s farmers have this level of schooling or higher (GUS, op. cit.). The record of pre-accession assistance from the EU is therefore regrettable in this regard. In particular, the SAPARD (Special Accession Programme for Agriculture and Rural Development) programme that ran from 2000 to 2004 was biased towards agriculture as opposed to rural development. Furthermore, in spite of the fact that only two per cent of students in higher education establishments come from the rural areas and that 58 per cent of farmers have neither secondary school education nor any formal agriculture training (Ingham and Ingham, 2004), less than five per cent of these funds were devoted to education programmes. This is notwithstanding the fact that the evidence suggests that both human and physical capital investments are necessary to promote economic growth (Kilkenny, 1998). Even Polish analysts were disappointed that the Commission declined to support programmes to educate rural youths under the SAPARD initiative (FAPA/SAEPR, 2000). In addition, the compromise reached over the level and distribution of post- accession agricultural subsidies also threatens to retard restructuring in the sector. Under standard EU rules, these monies would have gone to a select group of producers. However, in the short term, the CAP monies received from Brussels are to be allocated according to a simplified system, meaning that all farmers with plots in excess of 0.3 hectares will receive funding. This approach provides incentives for a greater number of people to stay in farming. Second, Poland negotiated a compromise with the Commission that allows additional CAP funds, originally earmarked for rural development, to be transferred to agricultural subsidies. Again, this measure simply reinforces the attachment of farmers to their land and impedes the general development of the rural areas. Third, all semi- subsistence farms will be entitled to a flat-rate grant of €750 if they are able to produce a business plan. All of these concessions serve to impede restructuring

STATISTICS IN TRANSITION, December 2006 1381 and to exacerbate inequalities in the rural areas between those who own land and those who do not; occurrences that ironically the Commission was once so keen to avoid (CEC, 2002). In addition, Poland successfully argued its case for land sales to foreigners to be banned for up to seven years post-accession. Even to be able to lease agricultural land, foreigners will need to prove that they have lived in Poland and worked in agriculture. Overall, it seems unlikely that accession will bring about rapid change in Poland’s rural areas. If this to be achieved, the Commission must strengthen its resolve for radical changes in CAP funding. If it does not, the hope expressed in the EES that acceding member states will promote the steady movement of labour from agriculture and manufacturing to the service sector will remain well founded in principle, but optimistic in practice (Ingham and Ingham, op. cit.).

Acknowledgements

The authors acknowledge DFID financial support for research project R8097 from which this paper is derived. They would also like to thank the Polish Central Statistical Office for providing Labour Force Survey data and the editor and anonymous referees of this journal for their helpful comments. The authors alone are responsible for any remaining deficiencies.

REFERENCES

BELLMANN, L., S. ESTRIN AND H. LEHMANN (1995), ‘The Eastern German Labor Market in Transition: Gross Flow Estimates from Panel Data’, Journal of Comparative Economics, Vol. 20, No. 2, 139—70. CEC (Commission of the European Communities) (2002), ‘Enlargement and Agriculture: Successfully Integrating the New Member States into the CAP’, Issues Paper, SEC (2002) 95 final, January. EC (European Commission) (1997), Treaty of Amsterdam, OJ C 3440, 10/11, Brussels. Eurostat (2004), ‘European Labour Force Survey: Principal Results 2003’, Statistics in Focus, Theme 3—14/2204. —————— (2001), ‘European Social Statistics – Labour Force Survey Results 2000’. —————— (1993), ‘Labour Force Survey 1983—1991’.

1382 H. Ingham, M. Ingham: Labour flows and…

FAPA/SAEPR (Foundation of Assistance Programmes for Agriculture, Agricultural Policy Analysis Unit) (2000), ‘Stereotypes in the European Union Concerning Polish Agriculture’, Warsaw. GÓRA, M. AND H. LEHMANN (1995), ‘How Divergent is Labour Market Adjustment in Poland?’, in: OECD, pp. 126—63. GREENE, W.H. (2003) Econometric Analysis, 5th edition, Upper Saddle River, New Jersey: Pearson Education Inc. GUS (Główny Urząd Statystyczny) (2004), Labour Force Survey in Poland: III Quarter 2003, Warsaw: GUS. —————— (2003), Statistical Yearbook 2003, Warsaw: GUS. —————— (2002), Labour Force Survey in Poland in the Years 1992— 2001, Warsaw:GUS. —————— (1999), Labour Force Survey in Poland, November 1998, Warsaw: GUS. —————— (1999a), Employment in National Economy in 1998, (in Polish) Warsaw: GUS. —————— (1999b), Registered Unemployment in Poland I—IV Quarter 1998, Warsaw: GUS. —————— (1998), Labour Force Survey in Poland: February 1998, Warsaw: GUS. —————— (1998a), Registered Unemployment in Poland: I Quarter 1998, Warsaw: GUS. —————— (1995), Employment in National Economy in 1994, (in Polish) Warsaw: GUS. GUS/US (Główny Urząd Statystyczny/Urząd Statystyczny) (1999), Gross Domestic Product by Voivodships for years 1995—1997, Katowice. HAUCK, W. AND A. DONNER (1977), ‘Wald’s Test as Applied to Hypotheses in Logit Analyses’, Journal of the American Statistical Association, Vol. 72, Issue 360, 851—53. ILO (1999) ‘Studies on the Global Dimension of Globalization: Poland’, Task Force on Country Studies on Globalization, Geneva. INGHAM, H. AND M. INGHAM (2004), ‘How Big is the Problem of Polish Agriculture?’, Europe-Asia Studies, vol. 56, no. 2, pp. 213—234. INGHAM, M. AND H. INGHAM (2003), ‘Enlargement and the European Employment Strategy’, Industrial Relations Journal, Vol. 35, No. 5, 379—96.

STATISTICS IN TRANSITION, December 2006 1383

INGHAM, H. AND M. INGHAM (2002), ‘EU Expansion and the Urban-Rural Dichotomy’, SURDAR Working Paper No. 1. INGHAM, M., H. INGHAM AND R. MCQUAID (2002), ‘Regional Development and EU Enlargement’, in: H. Ingham and M. Ingham (eds), pp. 162—87. JURAŚ, J. AND MARZAŁ, T. (1998), ‘Housing Market and Policy in Poland’, in: Domański, R. (ed.), Emerging Spatial and Regional Structures of an Economy in Transition, Warsaw: Polish Academy of Sciences. KILKENNY, M. (1998), ‘Transport Costs and Rural Development’, Journal of Regional Science’, Vol. 38, No. 2, pp. 293—312. KOLARSKA-BOBIŃSKA, L. (ed.) (2002) ‘Poland’s rural dwellers on European integration: opinions, knowledge and information’, Instytut Spraw Publicznych, Warsaw. KRAJEWSKA, A. (1995) ‘Education in Poland’, Eastern European Economics, vol. 33, no. 4, pp. 38—54. KRZYZANOWSKA, Z, K. ROMANOWSKA AND W. PISKORZ (2002) ‘Country Report of National Experience in Reforms and Adjustment: Poland’ Agricultural Policy Analysis Unit, Warsaw. OECD (1995), The Regional Dimension of Unemployment in Poland, Paris: OECD/CCET. ORLOWSKI, W. (2002), ‘Poland’s accession to the European Union and prospects of restructuring the agriculture and rural areas’, ZBSE Research Bulletin, Vol. 11, No. 2, pp. 5—24. RAO, C.R. (1973), Linear Statistical Inference and its Applications, 2nd ed., NewYork: John Wiley and Sons. SCARPETTA, S. AND P. HUBER (1995), ‘Regional Economic Structures and Unemployment in Central and Eastern Europe: An Attempt to Identify Common Patterns’, in: OECD, pp. 206—33. TOIKKA, R.S. (1976), ‘A Markovian Model of Labor Market Decisions by Workers’, American Economic Review, Vol. 66, Issue 5, pp. 821—34. WADSWORTH, J. (1989), ‘Labour Market Transitions in England and Wales: Evidence from the Labour Force Survey, Working Paper no. 1164, Centre for Labour Economics, London School of Economics. Data Appendix

Covariates:

1384 H. Ingham, M. Ingham: Labour flows and…

Age Age in years Distance Straight-line distance in miles from the nearest Tier 1 or Tier 2 voivodship (see below) Unemployment rate The unemployment rate, in November 1998, in the voivodship where the individual was residing (GUS, 1999b).

Binary factors:

Female 1 if female, 0 otherwise Female*aged < 45 1 if female and < 45, 0 otherwise; proxies for women with family responsibilities as LFS gives no information on number of children Married 1 if married, 0 otherwise Employee 1 if a paid employee, 0 otherwise Self employed 1 if self employed, 0 otherwise State 1 if employed in state sector, 0 otherwise Rural 1 if living in a rural area, 0 otherwise Vocational education 1 if highest educational attainment is vocational education, 0 otherwise

Unless stated otherwise, all the data above was derived from the February 1998 Labour Force Survey.

Tier 3 region 1 if an individual resided in a Tier 3 voivodship, 0 otherwise (see below). Tier 4 region 1 if an individual resided in a Tier 4 voivodship, 0 otherwise (see below).

Voivodship clusters:

Indicators used: • Employment share in services, relative to Poland’s average, at end 1998 (GUS, 1999a). • Employment share in industry, relative to Poland’s average, at end 1998 (GUS, 1999a). • Change in total employment, relative to Poland’s average, 1994-98 (GUS, 1995, 1999). • Value added per capita, relative to Poland’s average, 1997 (1998 data was published on the new voivodships) (GUS/US, 1999). Tier 1: Warszawskie Tier 2: Gdańskie, Katowickie, Krakowskie, Poznańskie

STATISTICS IN TRANSITION, December 2006 1385

Tier 3: Bielskie, Bydgoskie, Częstochowskie, Elbląskie, Gorzowskie, Wielkopolskie, Jeleniogórskie, Kaliskie, Kozalińskie, Legnickie, Leszcyńskie, Łódzkie, Olsztyńskie, Opolskie, Pilskie, Płockie, Słupskie, Szczecińskie, Toruńskie, Walbrzyskie, Wrocławskie, Zielonogórskie Tier 4: Bialskpodlaskie, Bialostockie, Chelmskie, Ciechanowskie, Kielce, Konińskie, Krośnieńskie, Lubelskie, Łomżyńskie, Nowosądeckie, Ostrolęckie, Piotrowskie, Przemyskie, Radomskie, Rzeszowskie, Siedleckie, Sieradzkie, Skierniewickie, Suwalskie, Tarnobrzeskie, Tarnowskie, Włoclawskie, Zamojskie

STATISTICS IN TRANSITION, December 2006 1387

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1387—1405

CHANGES IN COMPETITIVENESS AND LABOUR MARKET DEVELOPMENTS: A COMPARATIVE ANALYSIS OF POLAND, HUNGARY AND THE CZECH REPUBLIC1

Eugeniusz Kwiatkowski2 and Paweł Gajewski3

ABSTRACT

The paper attempts to analyse the links between both domestic and external competitiveness and labour market developments in manufacturing industry branches in the three new member states of the European Union – Poland, Hungary and the Czech Republic.

Generally, we can conclude that the research carried out so far only partially confirms the hypothesis of a positive effect of the rising competitiveness on labour market developments. These are mostly the industries of deteriorating competitiveness, which reduced employment. The industries, where positive changes in the level of competitiveness occurred, showed no clear pattern with regard to employment changes. We believe that among the possible factors of this situation are the restructuring and modernisation processes, which have been experienced in the industrial branches of the transition economies with differentiated intensity and range.

The complexity of the problem researched is confirmed by the econometric analyses performed. They suggest no unequivocal pattern of dependency between industrial competitiveness and labour markets across countries and industries.

Key words: competitiveness, labour market, transition.

1 The paper has been prepared under WP4 of the project „Changes in Industrial Competitiveness as a Factor of Integration: Identifying Challenges of the Enlarged Single European Market”, Contract N* HPSE-CT-2002-00148. The views presented in the paper are the views of the authors’ along and do not necessarily represent the view of the European Commission. 2 Professor, Institute of Economics, University of Lodz, Poland; [email protected] 3 Research Assistant, Institute of Economics, University of Lodz, [email protected]

1388 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

1. Introduction

The main goal of the analysis is to discuss impact of changes in competitiveness on labour market developments in manufacturing industries of three countries: Czech Republic, Hungary and Poland. To this purpose industry has been divided into two and three digit level sections in order to identify sections of improving and worsening competitiveness. For assessing changes in competitiveness two competitiveness indicators were used. In the further parts of the paper we present theoretical hypotheses, common methodology of research, statistical data used, econometric model taken for the statistical analysis and empirical results in the three countries. The paper is based on three analytical papers devoted to labour market impacts of changes in competitiveness in Hungary, the Czech Republic and Poland. The paper on Hungary was prepared by Sandor Buzas (Buzas, 2005), the paper on the Czech Republic – by Lenka Filipova, Jaromir Gottvald and Milan Simek (Filipova, Gottvald, Simek, 2005), whilst the Polish one was prepared by Pawel Gajewski, Pawel Kaczorowski and Tomasz Tokarski (Gajewski, Kaczorowski, Tokarski, 2005). The structure of the paper is following. The next part discusses briefly the relations between competitiveness and employment in the light of economic theory and formulates the hypotheses. The third part is devoted to main methodological issues of research regarding both competitiveness measures and theoretical assumptions underlying econometric analyses. In the fourth section a note on statistical data used is presented. The fifth section contains the empirical part of the paper, focusing on comparative analysis of research results in the Czech Republic, Hungary and Poland. Finally the sixth section sums up and draws main conclusions from the research carried out by the national research teams.

2. Economic competitiveness and employment — theoretical hypotheses

The notion of competitiveness encompasses a range of meanings. It can be related to both microeconomic aspect, linked with economic efficiency of enterprises and their market position, as well as macroeconomic aspect, which is revealed in the performance of industries in international trade, the level of development and prosperity. Another issue, which is of great importance, is an improvement of competitiveness. The latter should be understood as an improvement (increase) of measures and indices representing competitiveness. These indices, mentioned above, will be specified further on. Considering the mechanisms of relations between improving economic competitiveness and the labour market performance, we should refer to two

STATISTICS IN TRANSITION, December 2006 1389 fundamental mechanisms exposed in the mainstream economic theories: demand- oriented and supply-oriented. The demand-oriented mechanism was specified and emphasised by the Keynesian economics. An improvement in enterprise (or industry) competitiveness, is reflected by an increase of market demand (domestic or external) for goods manufactured by them. This in turn should enhance production growth, which would finally translate into the rising demand for labour in these enterprises or industries. The supply-oriented mechanism refers to the neoclassical economic tradition. The relation between competitiveness and labour market performance can be traced in the following way. An improvement in competitiveness leads to a higher efficiency and profitability of production, which leads to better opportunities of development in industries or firms. The higher production potential translates into the growth of demand for labour. This is a typical long-run supply-oriented mechanism. Considering both the mechanisms mentioned, a hypothesis can be formulated that an improvement in competitiveness should contribute to an increase of the demand for labour, thus an increase of employment. This positive influence of competitiveness on labour demand is expected to be particularly pronounced in the long-run. In the short-run there might be various opposing factors, which could undermine the hypothesis expressed above. These factors are particularly likely to appear in the transition economies. The level of competitiveness there was low at the beginning of the transition period and resulted from some characteristic phenomena occurring in the centrally-planned economies, such as: low efficiency, outdated technology and hidden unemployment. The pressure for improving competitiveness during the period of transition is linked with higher inflow of foreign investments, introducing modern labour-saving technologies, reducing hidden unemployment, increasing labour productivity and large-scale enterprise and industrial restructuring. In this situation, an improvement in competitiveness need not necessarily translate into the growth of employment, especially in the short-run.

3. Methodological issues

In order to carry out an analysis of impact of changes in competitiveness on labour market developments, a methodology was employed assuring comparability of outcomes. It consists of a common theoretical model and two competitiveness indices measuring competitiveness of the industry.

1390 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

3.1. Competitiveness indices

The two competitiveness indices employed in the research were suggested by the CASE1. These indices are calculated as: • A share of domestic production in total domestic demand (CCA index), and • A share of exports from Poland/Hungary/Czech Republic in total internal exports of the European Union (CCC index)2. The CCA index (a share of domestic production in total domestic consumption) is given by the following formula:

Yit CCAit = Yit − Exit + Imit where: CCAit – value of competitiveness indicator CCA of i-th branch in year t; Yit – volume of domestic production sold in i-th branch in year t; Exit – volume of export of i-th branch in year t; Imit – volume of import of i-th branch in year t. An increase of the CCA reveals relative improvement of the domestic production against import, which means a rising competitiveness of the industry. A CCC index was introduced on the recommendation of the CASE. In fact, it is a modification of the CCB index and is defined as the share of export from an investigated country in total internal exports of the European Union, which can be written as: UE Ex it CCCit = UE IEx it

CCCit – value of competitiveness indicator CCC of i-th branch in year t; UE Exit – volume of export of i-th branch in year t in a given country to the European Union; UE IExit – volume of total internal exports in the European Union comprising products manufactured by i-th industry in year t. An increase of the CCC index would indicate more favourable assessment of goods brought from the given country, thus rising competitiveness. In other words, the CCC evaluates the ability of an industry to compete in external markets.

1 Center for Social and Economic Research. 2 CCC index was introduced following poor results regarding CCB index, defined as a share of export from a given country in the total EU demand.

STATISTICS IN TRANSITION, December 2006 1391

3.2. Econometric model

The statistical analyses are carried out on the basis of the theoretical model presented in the separate paper by Tomasz Tokarski (Tokarski, 2003). The following type of equations, emerging from the Harrod-Domar model, was taken for estimations: E& Y& = −g + γθ& + φ (1) E Y where: E& E stands for employment (approximated by number of employees), stands for a E growth rate of employment; Y& Y stands for output and is a growth rate of output; Y g>0 is a rate of technical progress, which is not directly linked to competitiveness changes in the economy (but is an effect of e.g. learning by doing); φ∈(0;1)-ceteris paribus-elasticity of employment with respect to output; γ∈ℜ is a growth rate of labour demand being a result of the competitiveness indicator increase by θ& = 1. If, however, rising competitiveness lifts up productivity of capital and labour, then – also on grounds of the Leontief-type production function-γ coefficient can be negative. The equation (1) in a ready-to-use form can be written as follows: I ΔlnE = α + α ΔlnY + α ΔCCA ()orCCC − α it + ε (2) t 1 2 t 3 t 4 Y t For purposes of estimations with use of cross-section and time-series pooled data, the fixed effect method was employed. The national teams, accordingly to their preliminary results and other specifics estimated the basic equation in several versions (but not necessarily in all of the given below). In all the cases, the dependant variable was an approximant of the growth rate of employment, ΔlnEt . • Equation with constant slope coefficients (semi-elasticities of employment with respect to CCA and CCC) but intercepts diversified across industries: ΔlnEit = α1 + α2D2i + .... + αnDni + β2ΔlnYit + I (3) + β3ΔCCAit()orCCC − β4 it + εit Y • Equation with constant slope coefficients (semi-elasticities of employment with respect to CCA and CCC) but intercepts diversified across time:

1392 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

ΔlnEit = γ0 + γ2Dt0 + .... + γ4Dtm + β2ΔlnYit + I (4) + β3ΔCCAit()orCCC − β4 it + εit Y • Equation with constant slope coefficients (semi-elasticities of employment with respect to CCA and CCC) and intercepts diversified across both industries and time: ΔlnEit = α1 + α2D2i + .... + αnDni + γ0 + γ2Dt0 + ... + γ4Dtm + I (5) + β2ΔlnYit + β3ΔCCAit()orCCC − β4 it + εit Y • Equation with common constant but diversified slope coefficients:

ΔlnEit = α1 + φ1D 2i ΔCCAit(orCCC)+ ... + φ 2 D ni ΔCCAit(orCCC)+ I (6) + β2ΔlnYit + β3ΔCCAit()orCCC − β4 it + εit Y

4. Statistical data

The statistical data availability used for estimations differ across the analysed countries. Table 1 shows periods for which competitiveness indicators have been calculated and used for estimations. Therefore, it indicates not only data availability for competitiveness indices components, but also accounts for availability of data on all variables considered in statistical analyses.

Table 1. Periods and number of branches used in the research by country teams

Country Branches Years Czech Republic NACE 2 digits 1997—2003 NACE 3 digits 1997—2003 (93 sections) Hungary NACE 2 digits 1998—2003 (22 sections) NACE 3 digits 1999—2003 (93 sections) Poland NACE 3 digits 1995—2003 (91 sections)

STATISTICS IN TRANSITION, December 2006 1393

While performing the research, the national teams encountered various problems regarding data. Some of the most important, which have been reported, are listed below: Czech Republic • Instead of turnovers of each 3 digit CPA product group, value of production revenues was used in the related NACE group, because the CPA (product) data were not available in such a detailed level of classification (the 3 digit level NACE data were available). • The converse situation was in the field of foreign trade statistics: Import/Export statistics relate only to CPA (products), but not NACE (producers). • In the structural analyses based on 3-digit levels of classifications, in individual sectors visible and invisible mistakes appeareed that were impossible to eliminate. In the case of “share indicators” visible mistakes were all figures over 100 % and negative values. Those industries were deleted on 3-digit level of classification from database and not further analysed. From this “corrected” database 2-digit level database was recalculated. It is worth mentioning, that the scale of this problem is quite evident, as almost half of the NACE 3 digit database was deleted.

Hungary • In the NACE 3 digit breakdown, data from 1998 and later are not compatible with data from prior years. Moreover, NACE and CPA databases are reported not be compatible as well. • In the structural analysis based on 3-digit levels of classifications, in individual sectors visible and invisible mistakes appeared that were impossible to eliminate. In the case of “share indicators” visible mistakes were all figures over 100 % and negative values. On contrary to the Czech side however, those industries were not deleted, since there is no consistency whether this is a mistake or data re-export problems. This issue is subject to further research and the problem should be resolved in the next stage of work. • Short time series, especially regarding data on NACE-3 digit sections, which may have had some impact on the reliability of results of statistical analyses.

Poland • No deflators for investment outlays were available in NACE 3 digit breakdown. The deflators used were estimates and the reliability of them are very hard to be evaluated. The investment rates had to be calculated as investment shares in revenues. • Revenues were deflated with the NACE 2 digits deflator in absence of more detailed (NACE 3 digits) deflators.

1394 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

5. Comparison of empirical research

5.1. Descriptive analysis

The most impressive growth in employment (over eight-fold) of all three countries was seen in Hungarian NACE 267 (Cutting, shaping and finishing of ornamental and building stone). A substantial increase of employment in this industry was also recorded in Poland (by 80%). Another well performing industry turned out to be NACE 323 (Manufacture of television and radio receivers…), which revealed second highest growth of employment in the Czech Republic (by 130%) and took third position in this ranking among Hungarian industries (increase by 326%). In Poland employment increased most in NACE 296 (Manufacture of weapons and ammunition), mainly due to its high jump in 2001, by almost 200%.

Table 2. NACE 3-digit branches with highest reported increase in employment in Hungary (1998—2003), the Czech Republic (1997-2003) and Poland (1995—2003)

Hungary Czech Republic Poland Best 6 branches (NACE 3-digit) Change Change Change NACE (1998—2003, NACE (1997—2003, NACE (1995—2003, 1998=1) 1997=1) 1995=1) 267 8.28 300 4.74 296 3.52 314 4.32 323 2.30 267 1.80 323 4.26 372 2.15 343 1.76 312 2.21 343 2.06 252 1.76 365 2.07 174 2.03 282 1.65 354 2.03 313 1.66 281 1.64 Source: own elaborations based upon the country reports.

STATISTICS IN TRANSITION, December 2006 1395

Table 3. NACE 3-digit branches with highest reported decline in employment in Hungary (1998—2003), the Czech Republic (1997—2003) and Poland (1995—2003)

Hungary Czech Republic Poland Worst 6 branches (NACE 3-digit) Change Change Change NACE (1998—2003. NACE (1997—2003. NACE (1995—2003. 1998=1) 1997=1) 1995=1) 173 0.30 176 0.44 271 0.35 284 0.25 193 0.40 191 0.32 176 0.25 273 0.40 293 0.3 355 0.21 183 0.35 172 0.28 181 0.16 191 0.25 247 0.23 191 0.14 355 0.23 363 0.18 Source: own elaborations based upon the country reports.

While analysing industries, which saw substantial decline in employment, it can be concluded that the light industry branches (NACE 171 to 193) were among those most severely affected in all three countries, especially in Hungary and the Czech Republic. NACE 191 (Tanning and dressing of leather) saw sharp decrease of employment in all three countries as well as NACE 176 (knitted and crocheted fabrics)1. The apparent exception from the rule of general deterioration in textile industry is NACE 174 (made-up textiles articles) in the Czech Republic (see: table 2).

1 Employment in Poland in this sections went down by more than 50% between 1995 and 2003.

1396 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

Table 4. NACE 3-digit branches of highest growth in domestic competitiveness in Hungary (1998—2003), the Czech Republic (1997—2003) and Poland (1995—2003)

Hungary Czech Republic Poland Emplo- Emplo- Emplo- CCA CCA CCA yment yment yment Change Change Change Change Change Change NACE 1998— NACE 1997— NACE 1995— 1998— 1997— 1995— 2003 2003 2003 2003 2003 2003 1998=1 1997=1 1995=1 1998=1 1997=1 1995=1 1 192 18.02 0.70 247 2.55 1.49 263 1.57 1.23 2 267 6.28 8.28 268 1.81 1.03 296 1.56 3.52 3 262 3.81 1.07 343 1.75 2.06 342 1.32 0.77 4 322 3.03 0.40 322 1.45 0.81 221 1.15 1.44 5 274 2.81 0.79 315 1.39 0.76 205 1.13 1.62 6 312 2.37 2.21 331 1.30 1.28 267 1.11 1.80 Source: own elaborations based upon the country reports.

Table 4 gives some preliminary conclusions regarding relations between domestic competitiveness and employment. In the Polish case only one out of six industries where highest improvement in competitiveness was seen (NACE 342 – bodies for motor vehicles) recorded negative change in the level of employment. There are two such industries in the Czech Republic and three in Hungary, which suggests that our theoretical hypothesis formulated in section 2 has not been fully confirmed. The most severe were the employment reductions in Hungarian NACE 322 (television and radio transmitters). A more general conclusion can be drawn that especially industries manufacturing highly processed goods often (although not always) decreased employment, despite improving domestic competitiveness.

STATISTICS IN TRANSITION, December 2006 1397

Table 5. NACE 3-digit branches of highest decline in domestic competitiveness in Hungary (1998—2003), the Czech Republic (1997—2003) and Poland (1995—2003)

Hungary Czech Republic Poland Emplo- Emplo- Emplo- CCA CCA CCA yment yment yment Change Change Change Change Change Change NACE 1998— NACE 1997— NACE 1995— 1998— 1997— 1995— 2003 2003 2003 2003 2003 2003 1998=1 1997=1 1995=1 1998=1 1997=1 1995=1 1 176 0.09 0.25 292 0.43 0.90 322 0.32 0.54 2 287 0.09 0.63 297 0.41 0.77 191 0.26 0.32 3 293 0.08 0.71 286 0.37 0.71 341 0.20 0.44 4 366 0.03 0.91 175 0.23 0.61 192 0.19 0.46 5 152 0.02 0.63 272 0.12 0.76 321 0.11 0.37 6 294 0.01 0.99 191 0.04 0.25 351 0.11 0.63 Source: own elaborations based upon the country reports.

All of the industries, in which highest decline in domestic competitiveness was observed, experienced also reductions in employment (see: table 5). This is in line with the hypothesis formulated in section 2. The deterioration in the level of competitiveness in the domestic market was caused by various reasons. Table 5 contains, on one hand, industries which were going through major problems. This is probably the case of most of the sections listed. On the other hand table 5 contains also some industries, which reoriented their policy towards expansion to foreign markets. The most evident example is Hungarian NACE 152 (processing and preserving of fish and fish products), which almost vanished from the domestic market to become one of the Central European leaders regarding the dynamics of expansion into the EU markets (see table 6). This does not mean however that the necessary condition to compete effectively in external markets was to withdraw from the domestic market. The examples of industries, which belong to the group of top expanding industries in the domestic and external markets are Hungarian NACE 322 (television and radio transmitters) and Polish NACE 263 (ceramic tiles and flags). Interestingly, the former could reconcile expansion with employment reductions by as much as 60%. The industries, which on general performed well in external markets in the researched countries, were: NACE 157 (prepared animal feeds), 245 (soap and detergents, toilet preparations, etc.) and 353 (manufacture of aircraft, see: table 6). These industries did not see outstanding changes in the level of employment over the analysed years. Table 6 does not provide simple answers to the question of relations between external competitiveness and employment changes. Only in the case of Hungary the correlation seems to be strong and negative. Industries,

1398 E. Kwiatkowski, P. Gajewski: Changes in competitiveness… which confirm our hypothesis of positive dependency between competitiveness and employment are Polish NACE 157 (manufacture of prepared animal feeds) and 263 (ceramic tiles and flags) as well as Czech NACE 222 (printing, etc.), 343 (manufacture of parts and accessories for motor vehicles) and 316 (other electrical equipment). The only Hungarian section from table 6, which stands in line with our hypothesis is NACE 245 (soap and detergents, toilet preparations, etc.). In case of the remaining industries negative relation between external competitiveness and employment was exhibited.

Table 6. NACE 3-digit branches of highest growth in external competitiveness in Hungary (1998—2003), the Czech Republic (1997—2003) and Poland (1995—2003)

Hungary Czech Republic Poland Emplo- Emplo- Emplo- CCC CCC CCC yment yment yment Change Change Change Change Change Change NACE 1998— NACE 1997— NACE 1995— 1998— 1997— 1995— 2003 2003 2003 2003 2003 2003 1998=1 1997=1 1995=1 1998=1 1997=1 1995=1 1 322 30.12 0.40 222 4.19 1.28 157 33.50 1.21 2 152 24.39 0.63 353 4.09 0.76 314 23.29 0.79 3 353 15.45 0.97 362 3.70 0.63 334 11.86 0.72* 4 265 7.56 0.59 245 3.35 0.92 263 10.17 1.23 5 157 7.18 0.88 343 3.32 2.06 243 9.61 0.86 6 245 6.16 1.08 316 3.27 1.63 352 8.71 0.54 Source: own elaborations based upon the country reports. Change between 1995 and 1999, as no data for subsequent years was available.

STATISTICS IN TRANSITION, December 2006 1399

Table 7. NACE 3-digit branches of highest decline in external competitiveness in Hungary (1998—2003), the Czech Republic (1997—2003) and Poland (1995—2003)

Hungary Czech Republic Poland Emplo- Emplo- Emplo- CCC CCC CCC yment yment yment Change Change Change Change Change Change NACE 1998— NACE 1997— NACE 1995— 1998— 1997— 1995— 2003 2003 2003 2003 2003 2003 1998=1 1997=1 1995=1 1998=1 1997=1 1995=1 1 181 0.65 0.16 244 0.62 1.36 321 0.63 0.37 2 191 0.58 0.14 242 0.55 1.52 182 0.62 0.55 3 263 0.57 0.86 191 0.51 0.25 244 0.56 0.98 4 264 0.46 0.81 351 0.43 1.82 351 0.36 0.63 5 242 0.41 0.95 265 0.28 0.52 265 0.12 0.40 6 243 0.33 0.95 335 0.21 0.68 296 0.11 3.52 Source: own elaborations based upon the country reports.

Table 7 lists industries, which recorded highest decline in the level of external competitiveness. Taking under consideration all three countries, a group of industries, which underperformed in the EU markets can be created: NACE 191 (Tanning and dressing of leather), which also experienced serious problems in internal markets (see: table 5), 242 (pesticides and other agro-chemical products), 244 (pharmaceuticals, etc.) and 265 (cement, lime and plasters). The majority of industries from table 7 saw reductions of employment. This confirms our theoretical hypothesis. The high growth of employment in the Polish NACE 296 (weapons and ammunitions) can be explained by the fact that this industry reoriented their activities towards domestic expansion, and did so successfully, as it can be seen in table 4.

5.2. Statistical analyses

The descriptive analysis did not provide unequivocal answers to the most important questions on whether some general links exist between competitiveness and employment in the analysed industrial branches. Our theoretical hypothesis have been only partially confirmed. Moreover, econometric analyses, based on equations (2)-(6) have been undertaken. Following are the conclusions from the statistical analyses performed by the national teams. • In the Czech case no regular correlation has been found between both competitiveness indices on one side and employment, output or investment rate on the other. This was confirmed by the lack of general

1400 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

dependency on NACE 3-digit level between changes in any of competitiveness indicator and employment obtained from the econometric analysis. Only a few industries were found, which exhibited some significant relation between competitiveness and employment. The LSDV estimations confirmed a positive, yet weak influence of CCA on employment level. The CCC index proved to be insignificant regardless the level of data aggregation being estimated. • The Hungarian results show lack of dependency between both competitiveness measures and employment. Somewhat better results were obtained in case of the pooled data regressions on 2 digit level. Regardless of the variant, the pattern was CCA index significant (or nearly significant) and negative and CCC insignificant. But an attempt to confirm these relations on the NACE 3 digit level was not successful. In neither case did CCA or CCC appear significant in these estimations. The output coefficient however was consistent in confirming the expected dependency between employment and output, whilst investment rate usually turns out to be insignificant. • In the Polish case both descriptive and statistical analyses of an impact of these measures on labour demand in particular branches lead to conclusions that, firstly, a fairly strong and positive impact existed of the CCA indicator value on employment in the Polish industrial branches. Secondly, the dependency between the indicator reflecting the expansion to foreign markets (CCC) on one hand and the volume of employment on the other hand, seems to have been negative in the analysed period. This conclusion is however weak. The expansion to external markets of the European Union was linked with a decline in employment in many branches. This is probably due to deep restructuring processes in these branches which were undertaken in order to compete effectively in the EU markets. Thirdly, the changes in values of the indicators turned out to have much smaller impact on employment than the volume of production as the estimated values of parameters show. • In the national papers no strong correlation was found between domestic and external competitiveness indices of the analysed branches.

6. Concluding remarks

The main goal of the three country studies was to answer the question about dependency between competitiveness and employment in industrial branches of economy. Having compared and analysed all the main findings reported by the Czech, Hungarian and Polish teams, the following key conclusions can be drawn. • The most unequivocal results were achieved for Poland. Both descriptive and econometric analyses show that growth in domestic competitiveness of a branch is most commonly accompanied by an increase of

STATISTICS IN TRANSITION, December 2006 1401

employment. On contrary, in order to compete effectively in foreign markets, industrial enterprises often tend to reduce employment. However, branches which recorded deterioration in external competitiveness reduced employment confirming our hypothesis. • The Czech results are more ambiguous and the interpretation is not that straightforward. The econometric analysis carried out suggests that domestic competitiveness is positively correlated with the level of employment. The correlation is however very low. Branches ranked at the top regarding improvement of domestic competitiveness do not reveal this regularity. Moreover, no significant dependency has been found between external competitiveness and employment in the Czech industry. • Perhaps the biggest problems were reported by the Hungarian side. Neither descriptive nor econometric analysis entitles to propose unequivocal conclusions regarding the character of an impact of competitiveness in an industry on level of employment. However, at NACE 2-digit level, some negative dependency was found between domestic competitiveness and employment. No clear evidence of an influence of external competitiveness on labour markets was found. The research carried out confirms in many cases the hypothesis of positive dependency between competitiveness and employment in the analysed branches. It is especially the case of these industries, where the competitiveness index showed declining tendency. In these branches (of declining domestic and external competitiveness index) a decrease of employment usually was taking place, which confirms our theoretical hypothesis. On the other hand, in industries where either domestic or external competitiveness were improving we recorded both increasing tendencies in employment (which stands in line with our hypothesis) and decreasing employment trends (which oppose our hypothesis). The latter cases, undermining the hypothesis, can be explained based on the processes of deep restructuring and modernisation in the domestic and external markets. The pressure of competition lead to rationalising employment in those industries, which in turn resulted in the fact that improving competitiveness was accompanied by declines of employment.

1402 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

ANNEX

NACE 3-digit classification of branches analysed in the paper.

NACE Name 151 Production, processing and preserving of meat and meat products 152 Processing and preserving of fish and fish products 153 Processing and preserving of fruit and vegetables 154 Manufacture of vegetable and animal oils and fats 155 Manufacture of dairy products 156 Manufacture of grain mill products, starches and starch products 157 Manufacture of prepared animal feeds 158 Manufacture of other food products 159 Manufacture of beverages 160 Manufacture of tobacco products 171 Preparation and spinning of textile fibres 172 Textile weaving 173 Finishing of textiles 174 Manufacture of made-up textile articles, except apparel 175 Manufacture of other textiles 176 Manufacture of knitted and crocheted fabrics 177 Manufacture of knitted and crocheted articles 181 Manufacture of leather clothes 182 Manufacture of other wearing apparel and accessories 183 Dressing and dyeing of fur; manufacture of articles of fur 191 Tanning and dressing of leather 192 Manufacture of luggage, handbags and the like, saddlery and harness 193 Manufacture of footwear 201 Sawmilling and planing of wood; impregnation of wood 202 Manufacture of veneer sheets; manufacture of plywood, laminboard, particle board, fibre board and other panels and boards 203 Manufacture of builders carpentry and joinery 204 Manufacture of wooden containers 205 Manufacture of other products of wood; manufacture of articles of cork, straw and plaiting materials 211 Manufacture of pulp, paper and paperboard 212 Manufacture of articles of paper and paperboard 221 Publishing 222 Printing and service activities related to printing

STATISTICS IN TRANSITION, December 2006 1403

231 Manufacture of coke oven products 232 Manufacture of refined petroleum products 233 Processing of nuclear fuel 241 Manufacture of basic chemicals 242 Manufacture of pesticides and other agro-chemical products 243 Manufacture of paints, varnishes and similar coatings, printing ink and mastics 244 Manufacture of pharmaceuticals, medicinal chemicals and botanical products 245 Manufacture of soap and detergents, cleaning and polishing preparations, perfumes and toilet preparations 246 Manufacture of other chemical products 247 Manufacture of man-made fibres 251 Manufacture of rubber products 252 Manufacture of plastic products 261 Manufacture of glass and glass products 262 Manufacture of non-refractory ceramic goods other than for construction purposes; manufacture of refractory ceramic products. 263 Manufacture of ceramic tiles and flags 264 Manufacture of bricks, tiles and construction products, in baked clay 265 Manufacture of cement, lime and plaster 266 Manufacture of articles of concrete, plaster and cement 267 Cutting, shaping and finishing of ornamental and building stone 268 Manufacture of other non-metallic mineral products 271 Manufacture of basic iron and steel and of ferro-alloys 272 Manufacture of tubes 273 Other first processing of iron and steel 274 Manufacture of basic precious and non-ferrous metals 275 Casting of metals 281 Manufacture of structural metal products 282 Manufacture of tanks, reservoirs and containers of metal; manufacture of central heating radiators and boilers 283 Manufacture of steam generators, except central heating hot water boilers 284 Forging, pressing, stamping and roll forming of metal; powder metallurgy 285 Treatment and coating of metals; general mechanical engineering 286 Manufacture of cutlery, tools and general hardware. 287 Manufacture of other fabricated metal products 291 Manufacture of machinery for the production and use of mechanical power, except aircraft, vehicle and cycle engines 292 Manufacture of other general purpose machinery

1404 E. Kwiatkowski, P. Gajewski: Changes in competitiveness…

293 Manufacture of agricultural and forestry machinery 294 Manufacture of machinetools 295 Manufacture of other special purpose machinery 296 Manufacture of weapons and ammunition 297 Manufacture of domestic appliances n.e.c. 300 Manufacture of office machinery and computers 311 Manufacture of electric motors, generators and transformers 312 Manufacture of electricity distribution and control apparatus 313 Manufacture of insulated wire and cable 314 Manufacture of accumulators, primary cells and primary batteries 315 Manufacture of lighting equipment and electric lamps 316 Manufacture of electrical equipment n.e.c. 321 Manufacture of electronic valves and tubes and other electronic components 322 Manufacture of television and radio transmitters and apparatus for line telephony and line telegraphy. 323 Manufacture of television and radio receivers, sound or video recording or reproducing apparatus and associated goods 331 Manufacture of medical and surgical equipment and orthopaedic appliances 332 Manufacture of instruments and appliances for measuring, checking, testing, navigating and other purposes, except industrial process control equipment 333 Manufacture of industrial process control equipment 334 Manufacture of optical instruments and photographic equipment 335 Manufacture of watches and clocks 341 Manufacture of motor vehicles 342 Manufacture of bodies (coachwork) for motor vehicles; manufacture of trailers and semi-trailers 343 Manufacture of parts and accessories for motor vehicles and their engines 351 Building and repairing of ships and boats 352 Manufacture of railway and tramway locomotives and rolling stock 353 Manufacture of aircraft and spacecraft 354 Manufacture of motorcycles and bicycles 355 Manufacture of other transport equipment n.e.c. 361 Manufacture of furniture 362 Manufacture of jewellery and related articles 363 Manufacture of musical instruments 364 Manufacture of sports goods 365 Manufacture of games and toys 366 Miscellaneous manufacturing n.e.c.

STATISTICS IN TRANSITION, December 2006 1405

REFERENCES

BUZAS S. (2005), Impact of changes in competitiveness on labour market developments — country report prepared for the WP4 project „Changes in Industrial Competitiveness as a Factor of Integration: Identifying Challenges of the Enlarged Single European Market” FILIPOVA L., GOTTVALD J., SIMEK M. (2005), Impact of changes in competitiveness on labour market developments — country report prepared for the WP4 project „Changes in Industrial Competitiveness as a Factor of Integration: Identifying Challenges of the Enlarged Single European Market” GAJEWSKI P., KACZOROWSKI P., TOKARSKI T. (2005), Impact of changes in competitiveness on labour market developments in the Polish industry over the years 1995—2003 — country report prepared for the WP4 project „Changes in Industrial Competitiveness as a Factor of Integration: Identifying Challenges of the Enlarged Single European Market” TOKARSKI T. (2003), Competitiveness indicators and the labour market (Theses for discussion), Jagiellonia n University, Cracow

STATISTICS IN TRANSITION, December 2006 1407

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1407—1409

Book Review

Thinking Statistically: Elephants Go to School by Sarjinder Singh, Kendall/Hunt Publishing Company, 2006, 651 pp., ISBN 978-0-7575-3738-7, Soft Cover, Price $57.70

Thinking Statistically: Elephants Go to School is an excellent book for those who teach and learn statistics. Probably the greatest advantage of this book is that, in a humorous and understandable way, it explains things that for many seem to be boring. If you want to study statistics, if you want to teach statistics, or if you just want to have some fun—you should read this book. You won’t regret! In chapter 1, “Basic concepts,” the author introduces basic concepts of statistics, including definition of statistics, random sample, parameter and statistic, bias, and different sampling designs. In addition, in just a few sentences, the author points out several issues being very important in statistical philosophy, such as that statistics can be misused or misinterpreted, and that there is no need to test universal truths (e.g., that there is no need to test a statement, “Elephants are bigger than rats”). In chapter 2, “Statistical studies,” the reader is introduced with basic notions of experimental design, including completely randomized and randomized block designs, and of survival analysis. In chapter 3, “Graphical presentation,” the author explains differences between qualitative and quantitative random variables, and introduces the use of various graphs for both types of variables. Chapter 4, “Numerical presentation,” is an exposition of measures of central tendency (mean, median and mode) of variables and variability (range, mean absolute deviation, variance and standard deviation, and coefficient of variation) in variables. In addition, empirical rule, relative standing and standardization, and box-plot are presented. One can learn how farmer Bob counted rats in his field and why he did this, how to compare a monkey’s jumps with an elephant’s diet, and why a lion shouted seeing its face in a mirror. The author tells the story of a failure of rats’ plan to put a bell on a cat’s neck, and then explains Tchebysheff’s rule. In chapter 5, “Touching probability,” basic probability is introduced. After reading this chapter, the reader should be familiar with the terms of experiment, events, various diagrams facilitating understanding probability, definition of probability, marginal and conditional probability, and classical laws of probability. The reader may also learn that “hitting birds is not fun,” but “hitting

1408 Book Review balloons is fun.” The author explains “how to simulate a DINOSAUR” and “how you can win a Monkey by mixing three different colors.” Chapter 6, “Discrete distributions,” the reader learns about discrete random variables and discrete probability distributions. In particular, the author focuses on binomial and Poisson distributions. Chapter 7, “Continuous distributions,” uniform and normal distributions are introduced. Z-score and its use are also explained; in addition, the reader may learn how to generate a normal distribution. Chapter 8, “Sampling distribution,” is an introduction to, inter alia, sampling distribution of sample mean and proportion and their point and interval estimation under small and large sample size. Estimators, both point and interval, of a difference between two means are also presented. The author confirms some common knowledge, such as that the Indian food is spicy and that good grades bring confidence. In chapter 9, “The idea of hypotheses testing,” the reader may learn about one of the most meaningful statistical concepts. The author explains why hypothesis testing is necessary, the difference between one-tailed and two-tailed tests, types I and II error, significance level, rejection of a hypothesis, etc. The reader learns also about tests for a single proportion, mean, and variance; a difference between two proportions, means, and variances; and a contingency table. Basic concepts of analysis of variance for a completely randomized block design are also introduced. In the last, 10th chapter, “Analyzing bivariate data,” bivariate data are described and correlation and regression analyses are presented. The author shows that experience is required to choose a right statistical tool. For example, height of a child could be 25 feet in 100 years if predicted with a linear model. Thus, linear model could be a wrong choice to predict the height of a child in the same way as an axe is a wrong choice of a tool to cut an iron rod. At the end, chosen statistical tables, important formulae, bibliography and subject index are given in appendixes. The book is directed to non-statistical and non-mathematical students. It is full of examples (each being the example of a particular statistical problem and of the humour of the author), exercises, stories and pictures. All of them make the book not only useful at the beginning stage of statistical learning, but also easy and pleasant in reading. There are numerous hilarious, full of joy LUDIs (Let Us Do It) at the end of each chapter, such as “Home cooked food is the best” or “Good dreams are signs of good health.” The only thing the book lacks in my eyes is the explanation of a difference between finite and infinite populations. I think that students should learn about this difference at the very beginning stage of their statistical education. However, most basic books do not introduce this problem and most teachers of statistics omit this topic at the first levels of statistical education; instead, they show their students a difference between small and large populations.

STATISTICS IN TRANSITION, December 2006 1409

In summary, the book should be useful for students of basic courses of statistics, and for teachers of such courses. The way its contents are presented makes reading pleasant and learning easy.

Contents

1. Basic Concepts. 2. Statistical Studies. 3. Graphical Representation. 4. Numerical Representation. 5. Touching Probability. 6. Discrete Distributions. 7. Continuous Distributions. 8. Sampling Distributions. 9. The Idea of Hypotheses Testing. 10. Analyzing Bivariate Data.

Appendixes: Useful Statistical Tables. Important Formulae. Bibliography. Handy Subject Index.

Prepared by Marcin Kozak, Department of Biometry, Warsaw Agricultural University, Poland

STATISTICS IN TRANSITION, December 2006 1411

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1411—1417

THE 9th VILNIUS CONFERENCE ON PROBABILITY THEORY AND MATHEMATICAL STATISTICS

Danutė Krapavickaitė1, Aleksandras Plikusas2

ABSTRACT

A short history of Vilnius conferences on Probability theory and mathematical statistics is presented in the paper, information on the program of the 9th conference is given. Some words are dedicated to the memory of professor Vytautas Statulevičius who was the main organizer of all previous conferences.

Key words: conference, probability, statistics.

1. Introduction

The 9th Vilnius Conference on Probability Theory and Mathematical Statistics took place on June 25—32, 2006. It was organized by the Institute of Mathematics and Informatics, and Vilnius Gediminas Technical University. The sequence of conferences held is 1973, 1977, 1981, 1985, 1989, 1993, 1998 [PLIKUSAS, 1999], 2002. This conference was dedicated to the main initiator and organizer of all previous Vilnius conferences professor Vytautas Statulevičius (1929—2003).

2. A short history of Vilnius conferences

For more than three decades the Vilnius conferences on Probability Theory and Mathematical Statistics have provided an important venue for our science. These conferences helped us reach the world-famous appreciation of the Vilnius school on probability theory, stimulated international cooperation of Lithuanian mathematicians and resounded the name of the country in the world.

1 Institute of Mathematics and Informatics, Akademijos 4, LT-08663, Lithuania, [email protected]; Gediminas Technical University, Lithuania 2 Institute of Mathematics and Informatics, Akademijos 4, LT-08663, Lithuania, [email protected]

1412 D. Krapavickaite, A. Plikusas: The 9th Vilnius conference…

In 1970—1990, the Vilnius conferences rendered an opportunity to meet scientists from the East (previous Soviet Union) and West in a wide sense including democratic countries from all over the world, for example, Japan and USA. The growing number of the participants of these conferences showed its growing popularity. There were only about 200 participants at the 1st conference, and more than 600 participants in 1985, 1989. Many world-famous scientists were among their participants: A. N. Kolmogorov (Russia), A. V. Skorokhod (Ukraine), R. M. Dudley (USA), X. Fernique, P. Meyer (France), C. R. Rao (India), P. Hall (Australia), K. Urbanik (Poland), and a lot of others. After the restoration of our independence there were 3 conferences (1993, 1998, 2002). All of them were supported by the government of Lithuania. It helped us to raise the organizing level of the conferences as well as to keep their previous popularity. The chairman of the Organizing committee of all the previous conferences was Vytautas Statulevičius. One of the plenary lectures this year has been devoted in his memory. It was delivered by his previous PhD student, professor Vygantas Paulauskas.

3. Professor Vytautas Statulevičius

Vytautas Statulevičius was born on November 27, 1929 in Eastern Lithuania, the Utena district, in the country of Bikūnai, in the family of small farmers [KUBILIUS, 2006]. After he left the primary school in1941, his further studies were rather complicated. Lithuania became occupied from the east, later on from the west, and once more from the east. Vytautas studied at an agricultural school. However, he felt a call to science. He prepared himself independently and in 1947 he passed examinations of a secondary school. In 1949, he graduated from the Preliminary Courses at Vilnius University with excellent marks and received his school-leaving certificate. With the help of the postgraduate student at Leningrad University, later a professor of mathematics, academician and rector of Vilnius University Jonas Kubilius, Vytautas Statulevičius was accepted as a student of mathematics to the Vilnius University. He justified the trust put in him, began his activities in a scientific students’ circle, and defended his graduation thesis in 1954. Recommended by Jonas Kubilius once more, Statulevičius has been accepted as a postgraduate student at the Leningrad University under the supervision of Y. Linnik. In 1956, Statulevičius was awarded the 2nd degree prize of Leningrad University for his work Local limit theorems for non-homogeneous Markov chains, and in 1959 Statulevičius defended his PhD thesis on the same topic. In 1957, V. Statulevičius started working at Vilnius University and at the Mathematics Section of the Institute of Physics and Mathematics of the Lithuanian Academy of Sciences. For several times as a young researcher he has visited Moscow’s V. Steklov Mathematical Institute, Moscow University, worked

STATISTICS IN TRANSITION, December 2006 1413 close to A. Kolmogorov, using the opportunity to expand his horizons together with other, later famous young scientists. He defended his doctor habilius thesis at Vilnius University in 1967. His input into the probability theory is new and original results in the field of limit theorems of independent and weakly dependent random variables. Together with his colleague L. Saulis he published a monograph [STATULEVIČIUS, SAULIS, 1991]. V. Statulevičius has been awarded the scientific prizes of Lithuania and the USSR for several times. Since 1958 V. Statulevičius has headed the Mathematics section of the Institute of Physics and Mathematics, later a deputy director and director (1966—1995) of this institute, which was renamed as the Institute of Mathematics and Informatics; since 1995– chairman of the Senate of this Institute, since 1980 – head of the department of probability theory. Vytautas Statulevičius was a supervisor of many students and researchers. Under his supervision 35 mathematicians defended PhD thesis, 7 of them defended afterwards the doctor habilius dissertations. Vytautas Statulevičius was a member of International Statistical Institute, a member of the statistical associations of Europe and the USA, editor of the journal “Probability Theory and Applications”, a member of the editorial board of the “Lithuanian Mathematical Journal”. As an invited professor he gave lectures at the universities of Moscow, St. Petersburg, California, Ottawa, Bielefeld, Rome, and others, took part in the international congresses and meetings. Scientists in Vilnius and Moscow decided to organize international conferences in Vilnius, and Statulevičius was one of the principal organizers. Statulevičius was involved not only in scientific and administrative work, but he also took a very active part in public life. In 1989, after the electoral system to the Supreme Soviet of the USSR changed and deputies were no longer nominated, he was elected a People’s Deputy, and contributed a great deal into the activity for the Restitution of the independency of Lithuania in 1990. He was fond of nature, fishing, hunting, and fighting for the preservation of the environment. His main feature is that he was a very friendly and warm man to everybody with whom he had some contact. Vytautas Statulevičius started organizing the 9th Vilnius conference. His colleagues and students have continued this work to the end. The volume of his selected papers [Statulevičius, 2006] has been published before the opening of the conference.

4. The work of the 9th conference

The Organizing committee of the conference was chaired by Prof. Vygantas Paulauskas. The chairman of the Programme committee was Prof. Peter Jagers. The Programme committee consisted of scientists from 15 countries. 279 participants from 40 countries came to the conference. Among them there were 63

1414 D. Krapavickaite, A. Plikusas: The 9th Vilnius conference… participants from Lithuania, 22 from Russia, 22 from the USA, 19 from France, 16 from Ukraine, 13 from Germany, 10 from the UK, 9 from Poland, and so on. There were 4 plenary talks: • Olav Kallenberg (Auburn, USA). Some problems of local hitting, scaling, and conditioning. • Vygantas Paulauskas (Vilnius, Lithuania). Statulevičius — teacher, scientist, organizer — a man of great talent. • Alain-Sol Sznitman (Zürich, Switzerland). On the Disconnection of Discrete Cylinders. • Nanny Wermuth (Gothenburg, Sweden). Deciding on structure in data. • The programme consisted of 4 sections divided into a number of sessions and 1 seminar. Section I. Limit theorems Sessions: • Limit theorems (1). Organizer F. Götze, invited speakers: E. Bolthausen, D. Mason, H. Matzinger. • Limit theorems (2). Organizer . M. Csörgő, invited speakers: E. Csáki, A. Földes, P. Révész, S. Csörgő. • Limit theorems (3). Organizer V. Bentkus, invited speakers: V. Tarieladze, V. Bentkus. • Limit theorems (4). Organizer L. Chen, invited speakers: F. Götze, G. Reinert, Qi-Man Shao. • Probabilistic . Organizer J. Kubilius, invited speakers: Hsien-Kuei Hwang, I. Katai, J. Šiaulys. Section II. Random process Sessions: • Stochastic equations. Organizer M. Röckner, invited speaker B. Maslowski. • Random processes. Organizer B. Grigelionis, invitited speakers: D. Applebaum, E. Valkeila, A. Dorogovtsev, A. Tempelman. • Stochastic analysis. Organizer D. Elworthy, invited speakers: Y. LeJan, D. Nualart, N. O’Connell. • Stable processes. Organizer G. Samorodnitsky, invited speakers: J. Nolan, V. Pipiras, J. Rosinski, G. Samorodnitsky. Section III. Statistical inference. Sessions: • Long memory. Organizer P. Soulier, invited speakers: C. Hurvich, A. Philippe. • Adaptive and nonparametric methods. Organizers I. Ibragimov, R. Khasminski, invited speakers: Y. Golubev, O. Lepski, A. Tsybakov.

STATISTICS IN TRANSITION, December 2006 1415

• Reliability and survival analysis. Organizer V. Bagdonavičius, invited speakers: D. Dabrowska, J. Fine, U. Jensen. • Empirical processes. Organizer R. Dudley, invited speakers: V. Koltchinskii, S. Mendelson. • Robust statistics. Organizer E. Ronchetti, invited speakers: A. Christman, C. Croux, V. Czellar. • Survey sampling. Organizer G. Kulldorff, invited speakers: J. Rao, A. Scott, M. Thompson. Section IV. Applications. Sessions: • Econometrics . Organizer M. Deistler, invited speakers: R. Dahlhaus, M. Lippi, B. Pötscher. • ARCH modeling. Organizer T. Mikosch, invited speakers: R. Leipus, A. Lindner, C. Starica. • Statistics in life sciences. Organizer A. Frigessi, invited speakers: E. Ferkingstad, B. Blankertz, P. Jagers. • Environmental statistics. Organizer P. Switzer, invited speakers: S. Roberts, J. Robins, J. Zidek. • Finance. Organizer H. Föllmer, invitd speakers: A. Cherny, M. Davis, W. Schachermayer. Seminar on percolation and disordered systems was organized by V. Sidoravičius. Let us mention the largest and the new one sessions in more detail.

4.1. Limit theorems

The sessions on limit theorems have traditionally been the main and largest session starting from the 1st Vilnius conference. The topic was the most popular and representative this year not only by the number of contributors, but also by the names of the world-famous researchers, and first of all the organizers of the sessions: V. Bentkus (Lithuania), L. Chen (Singapoure), M. Csörgő (Canada), F. Götze (Germany). The range of topics was wide: limit theorems in finite dimensional and functional (Hilbert, Banach, Hölder) spaces, limit theorems for independent, weakly dependent random variables, random processes and martingales. Quite a lot of presentations were devoted to the method of Stein. Charles Stein (USA) presented it at the 6th Berkley symposium in 1970. At first it was aimed to obtain the rate of convergence in the central limit theorem for weakly dependent variables. Later on this method has been developed in many directions for many aims. The Stein method was discussed in the following talks: G. Reinert (UK). Application of Stein’s method to simulation. Q.-M. Shao (China). Stein’s method and Cramer type large deviations. V. Bentkus (Lithuania). On inequalities for tail probabilities of martingales.

1416 D. Krapavickaite, A. Plikusas: The 9th Vilnius conference…

4.2. Survey sampling

Thanks to the nine-year co-operation of the statisticians from the Baltic and Nordic countries in the field of survey sampling theory and methodology, the survey sampling session has been included into the section on Statistical Inference for the first time in the history of the Vilnius conferences and organized by the professor of the Umeå university Gunnar Kulldorff. Two invited lecturers were from Canada. With the aim of expanding the horizons of the Baltic survey statisticians, who are mainly working on the estimation of the parameters, J. N. K. Rao delivered a lecture “Some new methods for the analysis of complex survey data”. M. Thompson in her lecture “Modeling transitions with survey data” surveyed the methods for modeling transitions under the conditions of misclassification and interval censoring. The methods were illustrated by some estimates in complex longitudinal health surveys. The third invited lecturer A. Scott from New Zealand was speaking on fitting regression models to the data from two-phase sampling, when the response and some covariates are measured for all the units at the first phase, but the remaining covariates are measured only for the second phase sub-sample of units. The choice of units included in the sub-sample can depend on anything measured at the first phase. The invited lectures have been followed by 13 contributed talks of statisticians from the Baltic and Nordic countries, Poland and Italy. Their topics included two-phase sampling and other sampling design problems; model- dependent, model-assisted estimation and calibration; small area estimation; properties of the distributions of the finite population statistics.

5. Concluding remarks

Many researchers, young and experienced, came to Lithuania to demonstrate their new results, to learn from one another, to meet their colleagues living far away. They had opportunities to discuss about the common problems and to enjoy the beauty of our renewed old city Vilnius, as well as pieces of ancient history. We are happy that they respect the tradition of the Lithuanian probabilistic school. The selected papers of the conference will be published in “Acta Applicandae Mathematicae”. The programme of the conference shows that all together, probabilists and statisticians from all over the world, are working hard in order to maintain the high authority of science in the society.

STATISTICS IN TRANSITION, December 2006 1417

Acknowledgements

We are thankful to our colleagues Vygantas Paulauskas, and Jonas Sunklodas for the information presented.

REFERENCES

KUBILIUS J. (2006). Vytautas Statulevičius. A brief biographical outline. In: Statulevičius V. Selected mathematical papers. Institute of Mathematics and Informatics, Vilnius, 11—14. PLIKUSAS A. (1999). The 7th Vilnius Conference on Probability Theory and Mathematical Statistics and 22nd European Meeting of Statisticians. Statistics in Transition, 4(2), 303—308. STATULEVIČIUS V., SAULIS L. (1991). Limit Theorems for Large Deviations. Kluwer, Dordrecht. 232 p. STATULEVIČIUS V. (2006). Selected mathematical papers. Eds. V. Bentkus et al. Institute of Mathematics and Informatics, Vilnius. 652 p.

STATISTICS IN TRANSITION, December 2006 1419

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1419—1428 REPORT

XXVI European Meeting of Statisticians Toruń, Poland, July 24—28, 2006

The 26-th European Meeting of Statisticians was held in Toruń, Poland, from 24 to 28 July 2006. The conference was organized under auspices of The Bernoulli Society for Probability and Mathematical Statistics. Nicolaus Copernicus University was the main organizing institution of the conference. The Scientific Program Committee was chaired by Herold Dehling. Adam Jakubowski was the Chairman of the Local Organising Committee. During the conference there was 24 Invited Paper Sessions and a Special Invited Session in memory of Alexander Nagaev, a member of the Scientific Program Committee for the EMS 2006, who died on 10 February, 2005. Ildar A. Ibragimov, Vygantas Paulauskas, Victoria Steblovskaya and Yuri Davydov presented the papers in this special invited session. At the Contributed Paper Sessions 113 papers were presented including 38 by Polish speakers and at the Poster Sessions there were 37 presentations including 19 by Polish authors. There were over 280 participants at the conference. The Scientific Program Committee had appointed the following speakers for main talks: • Opening Lecture: Gareth Roberts (UK), Exact simulation and inference for diffusions, • Forum Lecture: SØren Johansen (Denmark), A survey of cointegration analysis, • Closing Lecture: Friedrich GÖtze (Germany), Asymptotic statistics and geometry of numbers, and also Special Speakers: • Chris Glasbey (UK), A statistical approach to image warping, • Marie Huskova (Czech Republic), Recent results in change point analysis, • Thomas Mikosch (Denmark), Heavy tail modelling, • Brian Ripley (UK), Software for statistical development, • Tomasz Rolski (Poland), The theory of fluid queues, • Donatas Surgailis (Lithuania), Estimation of nonstationary long memory. The succeeding part of this report presents the papers delivered at the Invited Paper Sessions.

1420 XXVI European Meeting of Statisticians…

At the Empirical Process Techniques for Dependent Data session the following papers were presented: Walter Philipp (USA), Entropy conditions for subsequences of random variables with applications to empirical processes, Wei Biao Wu (USA), Empirical processes of causal stationary sequences. The first author introduced new entropy concepts measuring the size of a given class of increasing sequences of positive integrals. The paper by Wei Biao Wu discussed sample path properties of empirical distribution functions of causal stationary processes. The author presented weak convergence to Gaussian processes and multiple Wiener Ito integrals. At the Statistics in Geophysics session the following papers were presented: Adrian Raftery (USA), Probabilistic weather forecasting, Paul Switzer (USA), Statistical investigations of general circulation models for the study of climate, Hans Wackernagel (France), Geostatistical assimilation of geophysical data. These papers dealt with complex global geophysical models of the atmosphere and its ocean interface used to represent temporal dynamics of spatially distributed atmospheric variables. It is obvious that the assimilation of spatial data into numerical models brings about a number of statistical problems that fall naturally into the realm of geostatistics. All presented models are very important for climate forecasting. Especially, the paper by Adrian Raftery considered probabilistic forecasting of a single weather quantity and that was a proposition of principled statistical method for postprocessing ensembles based on Bayesian Model Averaging (BMA). The presented talks will develop ongoing research themes in the application of geostatistics to the assimilation of observations into geophysical numerical models. At the Recent Developments in Time Series session the following papers were presented: Katarzyna Blinowska (Poland), Directed transfer function and its application for study of brain activity, Michael Eichler (Netherlands), Graphical models and causal interence for multivariate time series, Qiwei Yao (UK), Modeling multivariate volatilities by common factors: an innovation expansion approach. In her paper Katarzyna Blinkowska discussed the problem of information processing in human brain. As it was shown, it may be approached by the study of time series: electroencephalograms (EEG) and local field potentials (LFP). The next paper, coming from Michael Eichler, presented graphical models as an important tool for analyzing multivariate data. The author introduced partial correlation graphs and Granger causality graphs for representing the dependences among multiple time series. Finally, Qiwei Yao presented a proposition of modelling multivariate volatilities by common factors. Those unobservable common factors were identified via expending the innovation space step by step; therefore solving a high-dimensional optimization problem by many low- dimensional sub-problems. At the Network Tomography session the following papers were presented: Gang Liang (USA), A partial measurement approach to network traffic matrix estimation, George Michailidis (USA), Network monitoring through active

STATISTICS IN TRANSITION, December 2006 1421 tomography techniques, Cun-Hui Zhang (USA), Some network tomography and species problems. The first author proposed a novel approach to estimating traffic matrices that incorporates lightweight Origin-Destination (OD) flow measurements coupled with a computationally lightweight algorithm for generating the OD estimates. It is worth mentioning that author used a heuristic based on intuition derived from the game theory. The papers by Cun-Hui Zhang and George Michailidis discussed the problem of active network tomography, monitoring and species problems, like network anomaly. The methodology was illustrated on real and simulated network data At the Lifetime Data Analysis the following papers were presented: Philippe Broet (France), Modeling gene expression changes related to early and late relapse in curable disease, Mei-Ling Lee (USA), Threshold regression models, George Whitmore (Canada), Modeling low birth weights using threshold regression: results for U.S. birth data. The paper by Philippe Broet discussed a simple and efficient method for identifying gene expression changes that characterize early and late recurrence for untreated patients. The papers by Mei- Ling Lee and George Whitmore presented the threshold regression models for birth and survival data. At the Non/Semiparametrics session the following papers were presented: Yannick Baraud (France), Estimating the intensity of a random measure, Wolfgang Polonik (USA), Multidimensional mode hunting, Ingrid van Keilegom (Belgium), Estimation of a semiparametric transformation model. The paper by Yannick Baraud presented a new estimator of the intensity of a random measure. The author proved that this estimator nearly outperforms all the estimators based on the candidate partitions. It was showed that the estimator satisfies some oracle- type inequality with respect to an Hellinger-type distance. The next author, Ingrid van Keilegom, proposed consistent estimator for transformation parameters in semiparametric models. He discussed the problem of finding the optimal transformation into the space of models with a predetermined regression structure like additive or multiplicative separability. Finally, Wolfgang Polonik proposed a new method for locating modal regions in a multivariate data set without pre- specifying their total number. At the Model Selection in Function Estimation session the following papers were presented: Gerda Claeskens (Belgium), Information criteria for model selection: from fully focussed to blind, Fabrice Gamboa (France), Estimation in a shifted regression model, Winfried Stute (Germany), Model diagnosis for parametric regression in high dimensional spaces. In the case of first of papers the focused information criterion was developed to select the best model for a given estimator. The author demonstrated how to address this problem via weighting methods. The next paper by Fabrice Gamboa presented the problem of finding the estimator in a shifted regression model. The proposed estimators, as it was shown, are convergent and asymptotically Gaussian. Finally, Winfried Stute

1422 XXVI European Meeting of Statisticians… proposed and studied diagnostic tools to check the validity of a parametric regression model. At the Spatio-Temporal Models session the following papers were presented: Richard Chandler (UK), Space-time modeling using independence and generalized estimating equations, Monica Chiogna and Carlo Gaetan (Italy), Spatio-temporal modeling of epidemiological processes, Eva B.Vedel Jensen (Denmark), Spatio-temporal point processes – with a view to modeling in neuroscience. Firstly, Richard Chandler explored the use of independence estimating equations (IEEs) as well as generalized estimating equations (GEEs) in a space-time setting. The paper demonstrated the properties of GEEs and illustrated the ideas by climatological examples. The other papers discussed Spatio-temporal processes with a view to modeling in neuroscience and epidemiology, using empirical data. At the Particle Filtering session the following papers were presented: Fredrik Gustafsson (Sweden), Marginalization issues in particle filtering, Jaroslav Krystul (The Netherlands), Interacting Particle System approach for estimating rare events in large scale stochastic hybrid systems, Anastasia Papavasiliou (UK), Particle filters and uniform convergence. The first talk surveyed important model structures and discussed practical issues. Their importance was illustrated by means of several positioning and target tracking applications and solved using the marginalised particle filter. The paper by Jaroslav Krystul studied the problem of estimation of small probabilities for stochastic hybrid processes using the sequential Monte Carlo simulation based on the Interacting Particle System (IPS). The last of speakers - Anastasia Papavasiliou – presented some results regarding the asymptotic stability of the optimal filter and the uniform convergence of particle filters for a particular type of non ergodic systems. At the Weakly Dependent Limit Theorems session the following papers were presented: Jerome Dedecker (France), Rates of convergence in the central limit theorem for the minimal L1 distance), Gabriel Lang (France), Weak dependence, definitions and examples, Michael H. Neumann (Germany), Probability and moment inequalities for sums of weakly dependent random variables. The paper by Jerome Dedecker discussed the rates of convergence in the central limit theorem and gave conditions for reaching the optimal rate, especially for random variables having finite third moments. As it was shown this conditions are satisfied for some non irreducible Markov chains as well as for some dynamical systems. The next author, Gabriel Lang, presented the definitions, theorems and empirical examples concerning the weak dependence. The presentation was an introduction to the forthcoming book on weak dependence. Finally, the paper by Michael Neumann presented recently developed exponential and moment inequalities for sums of weakly dependent random variables which are useful tools with many applications in statistics.

STATISTICS IN TRANSITION, December 2006 1423

At the Random Matrices session the following papers were presented: Vyacheslav Girko (Ukraine), Theory of random matrices and general statistical analysis of random arrays. Twenty years later, Mylene Maida (France), Spherical integrals in the finite rank setting, Jack Silverstein (USA), Eigenvalues of large sample covariance matrices of spiked populations models. The paper by Vyacheslav Girko presented the new estimator of General Statistical Analysis. The author developed the analysis under the assumption that the number of parameters can increase with respect to the number of observations or the observation vector’s dimension is comparable in magnitude with the sample size. He did not require the observations to have a normal distribution. The paper by Mylene Maida discussed the role of spherical integrals in the framework of several matrix models. The author investigated their asymptotics in the case of finite ranks. Finally, Jack Silverstein presented the results on the extreme eigenvalues of a subclass of matrices of sample covariance type. At the Spatial Statistics session the following papers were presented: Montserrat Fuentes (USA), Methods to approximate a spatial likelihood, Chris Holmes (UK), Modeling of spatial Gaussian processes via spectral decompositions, Rasmus Waagepetersen (Denmark), An estimating function approach to inference for inhomogeneous Neyman-Scott processes. The first paper presented a version of Whittle’s approximation to the Gaussian log likelihood for spatial regular lattices with missing values and for irregularly spaced datasets. The next author, Chris Holmes, described a general framework for constructing non-stationary covariance functions for use in modelling spatial Gaussian processes. He considered two possible decompositions of a stationary covariance function, namely the Fourier and Karhunen-Loeve expansions. The paper by Rasmus Waagepetersen was concerned with inference for a certain class inhomogeneous Neyman-Scott point processes depending on spatial covariates. This paper was motivated and illustrated by applications based on data from a tropical rain forest plot. At the Recent Developments in Extremes session the following papers were presented: Laurens De Haan (The Netherlands), On spatial extremes and applications, M. Ivette Gomes (Portugal), Reduced bias tail and extremal index estimation, Manuel Scotto (Portugal), Extremes of integer valued moving average sequences. The first paper discussed the specific models for application of spatial extremes. In the next one Ivette Gomes talked about the estimation of both the tail index and the extremal index through the use of bias reduction techniques. The author used the Generalized Jackknife methodology together with sub-sampling techniques. The aim of the last paper by Manuel Scotto was to analyze the extremal properties of integer-valued moving average sequences as discrete analogues of conventional moving averages replacing scalar multiplication by binomial thinning. At the Data Mining session the following papers were presented: Paola Cerchiello (Italy), Statistical methods for classification of unknown authors, Jorge

1424 XXVI European Meeting of Statisticians…

Muruzabal (Spain), Mining general cooperative patterns with evolutionary algorithms. The paper by Paola Cerchiello was devoted to application of statistical methods for classification of unknown authors. She proposed a new method based on the combined employment of decision tree and Kruskal-Wallis test. She evaluated the power of the corresponding multiple test procedure and compared it with other procedures. In the next presentation Jorge Muruzabal described and illustrated an evolutionary algorithm combining ideas from various learning paradigms and using to address the problems in a classification setting. He compared the new algorithm with a rule ensemble learning methods. At the Randomness in Dynamical Systems session the following papers were presented: Mikhail Gordin (Russia), Limit theorems for dynamical systems: useful structures and tools, Dalibor Volny (France), Martingale approximations of stationary processes, Benjamin Weiss (Israel), Universal prediction for ergodic processes. The paper by Mikhail Gordin discussed a certain constructions of Markov operators, in particular, by means of homoclinic structures. The author considered also other probabilistic applications of these structures. The next author, Dalibor Volny, considered martingale approximations as a one of the most robust methods for the study of CLT. He discussed the dependence of the approximation on the choice of the filtration, unconditional limit theorems, and exactness of approximation. Benjamin Weiss surveyed some recent work on the problem of universal prediction for ergodic processes. At the Financial Time Series session the following papers were presented: Peter Brockwell (USA), Continuous-time nonlinear models in finance, Ronald A.Gallant (USA), A statistical inquiry into plausibility of Epstein-Zin-Weil utility, Alexander Lindner (Germany), A continuous time GARCH process of higher order. Peter Brockwell discussed several families of continuous-time ARMA processes (non-linear autoregressions, stochastic volatility) and GARCH models. The next speaker, Ronald Gallant used purely statistical methods to answer the question: “Is it plausible that the pricing kernel can be represented as the intertemporal marginal rate of substitution of a representative agent in an endowment economy whose preferences are determined by Epstein-Zin-Weil utility?” The last paper by Alexander Lindner introduced a continuous time GARCH model driven by a single Levy process. The author investigated the autocorrelation of the squared increments of this process. All these papers met with great interest. At the Machine Learning session the following papers were presented: Gilles Blanchard (Germany), Different generalization error bounds for Support Vector Machine, Pascal Massart (France), A nonasymptotic Wilks phenomenon, Alexandre Tsybakov (France), Aggregation by mirror averaging. The author of the first speach, Gilles Blanchard, presented the comparison of different inequalities for bounding the generalization error of the “Support Vector Machine” classification algorithm. The paper by Pascal Massart presented results based on a refinement of Talagrand’s inequality for empirical processes and in the

STATISTICS IN TRANSITION, December 2006 1425 context of bounded regression or classification. Finally, Alexandre Tsybakov studied the problem of convex or modeling selection type aggregation – for a given collection of different estimators or classifiers he constructed a new estimator or classifier., which is nearly as good as their best convex combination, with respect to a given risk criterion. At the MCMC Applications session the following papers were presented: Rob Deardon (UK), Modeling the spatio-temporal dynamics of the UK 2001 foot- and-mouth epidemic, John Forster (UK), MCMC for Bayesian ordinal data modeling, John Stephenson (UK), Inferring 3-dimensional geological thermal histories using semi-automative RJMCMC. It is worth mentioning that Rob Deardon presented a newly developed methodology for modeling the spatio- temporal dynamics of infectious disease epidemics. His model quantified the probability of the infection of each susceptible individual in a population. This methodology was applied to data from the UK 2001 foot-and-month epidemic. The paper by Johm Forster considered how to compare different conditional independence specifications for ordinal categorical data, by calculating a posterior distribution over classes of graphical models. The last of authors, John Stephenson, presented a new approach for modeling geological thermal histories from fission track data in 2D and 3D. He implemented the approach via an adaptation of Bayesian Partition Modeling using semi-automative reversible jump Markov process. Examples of the methodology in practice were given with both synthetic data and empirical data from Namibia. At the Random Graphs and Algorithms session the following papers were presented: Mark Jerrum (UK), Sampling perfect matchings, contingency tables and related structures, Michał Karoński (Poland), Random Intersection Graphs, Yuri Pavlov (Russia), The limit distributions of the certain characteristics of the web random graphs. The paper by Mark Jerrum discussed a problem of existing an efficient algorithm for sampling the perfect matching uniformly at random from a given bipartite graph. Next, Michał Karoński explored a model of random intersected graphs in which the vertices are the focus. The paper by Yuri Pavlov studied also random graphs with N independent vertices and a distribution with a Pareto tail. Studies of past few years have shown the important role of these models for appropriating Internet topology’s description. The author pointed out that the limit distribution of the size of giant components is normal. At the Statistics in Genome Sciences session the following papers were presented: Matti Pirinen (Finland), Estimating genealogies from marker data: a Bayesian approach, Bernard Prum (France), Continuous and discrete HMM for genome analysis, Natalie P. Thorne (UK), Investigating Illumina’s Bead Arrays: Statistical analyses of data from highly replicated, randomly arranged, bead- based microarrays. in the first presentation the author presented a probabilistic method for genealogy reconstruction. Starting with a group of genotyped individuals from some population isolate he explored the state space of their possible ancestral histories under a Bayesian model. Bernard Prum talked about

1426 XXVI European Meeting of Statisticians… continuous and discrete Hidden Markov Models for genome analysis. He considered new models where the length of the memory may depend on the context like parsimonious Markov models. And finally, Natalie Thorne presented her work investigating the statistical analysis of bead level gene expression data. At the Empirical Processes Applications session the following papers were presented: Michael Kosorok (USA), Inference under right censoring for transformation models with a change-point based on a covariate threshold, Nicolas Vayatis (France), From classification to ranking: new challenges for statistical learning theory, Marten Wegkamp (USA), Classification with reject option. Firstly, Michael Kosorok considered linear transformation models applied to right censored survival data with a change-point in the regression coefficient based on a covariate threshold. He established consistency and weak convergence of the nonparametric maximum likelihood estimators. The author developed also Monte-Carlo methods of inference for model parameters. The paper by Nicolas Vayatis discussed the use of classification algorithms such as boosting and SVM for ranking/scoring problem. He applied the presented programme to credit risk screening or information retrieval. And the last talk, coming from Marten Wegkamp, studied binary classification with reject option. In particular, he studied empirical risk minimization with a generalized hinge loss. At the Stein/Chen Method session the following papers were presented: Andrew Barbour (Switzerland), Stein’s method and total variation approximation, Gesine Reinert (UK), Chi-square approximations with Stein’s method. The first of papers presented a generalized version of mean estimator of the Wiener sheet. The author discussed the properties of the new estimator and the comparison with the previous propositions. The paper by Gesine Reinert set up Stein’s method for chi-square approximations, improving on existing bounds on the solution of the Stein equation for Gamma distribution. The presented method was applied to derive an explicit bound on the distance to chi-square for Pearson’s chi-square statistic. At the Graphical and Algebraic Models for Multivariate session the following papers were presented: Mathias Drton (USA), Binary Models for Marginal Independence, Seth Sullivant (Seth), Sequential importance sampling for multiway tables. Mathias Drton presented a new model class (binary models) providing a framework for modelling marginal independences in contingency tables. He showed that in many respects the resulting models are dual to graphical log-linear models. And the paper by Seth Sullivant described an importance sampling algorithm for generating random tables with fixed margins. The author discussed the use of basic algebraic properties of the associated toric ideal for deducing useful information about the performance of the algorithm. At the last Small Sample Problems session the following papers were presented: Barry C. Arnold (USA), Parameter estimation in certain conditionally specified models, Erhard Cramer (Germany), Recent developments in progressive censoring. In the paper by Barry Arnold there were provided effective methods of

STATISTICS IN TRANSITION, December 2006 1427 parameter estimation in conditionally specified models. The author talked about the complications and problems with the inference for such models coming from the fact that the appropriate normalizing constant is not available in an analytic form. The next speaker, Erhard Cramer, discussed some approaches to assess a censoring scheme. Moreover, he considered some extensions of progressive censoring to the non iid case. There were also more than 110 contributed papers presented during the Conference. They were divided into 40 Contributed Sessions: • Limit Theorems (3 sessions) • Estimation (4 sessions) • Learning and Classification (2 sessions) • Extremes (2 sessions) • Empirical Process Techniques for Dependent Data (1 session) • Statistic in Geophysics (1 session) • Recent Developments in Time Series (1 session) • Markov Chain Monte Carlo (1 session) • Regression Analysis (2 sessions) • Network Tomography (1 session) • Distribution Theory (1 session) • Nonparametric Methods(1 session) • Survival Models (1 session) • Hypothesis Testing (2 sessions) • Time Series (4 sessions) • Statistic in Finance (2 sessions) • Stochastic Geometry (1 session) • Branching Processes (1 session) • Random Fields (1 session) • Markovian models in engineering (1 session) • Statistics of stochastic processes (1 session) • Statistic in Genetics (1 session) • Multivariate Analysis (1 session) • Copulas (1 session) • Stochastic Processes (1 session) • Decision Problems (1 session) • Ergodicity, Entropy (1 session) Moreover, the Special Invited Session on History of Polish Statistics was included into the Meeting. The following papers were presented: • Czesław Domański and Tomasz Kozdraj, Development of Polish statistical thought, • Józef Pociecha, Development of multivariate classification methods in Poland, and

1428 XXVI European Meeting of Statisticians…

• Józef Oleński, Functions of official statistics in the information infrastructure of the state. The papers presented during the above session discussed the important role of Polish statisticians in development of world statistics and probability.

Prepared by Jacek Białek, University of Łódź, Poland

STATISTICS IN TRANSITION, December 2006 1429

STATISTICS IN TRANSITION, December 2006 Vol. 7, No.6, pp. 1429—1434 REPORT

THE FOURTH CONFERENCE ON SAMPLING METHODS IN ECONOMIC AND SOCIAL SURVEYS

IN THE 25-TH ANNIVERSARY OF THE DEATH OF PROFESSOR ZBIGNIEW PAWŁOWSKI

September 11—12, 2006, Katowice, Poland

The conference was organised by Department of Statistics at the University of Economics in Katowice, Department of Statistical Methods at the University of Łódź and Polish Statistical Association. It took place at the University of Economics in Katowice. It was the meeting of Polish and foreign statisticians being active in the field of survey sampling. It was organized in response to the growing need for efficient and reliable data collection procedures with the general objective to create opportunities to present latest achievements and exchange experiences on practical applications of survey sampling methods. The conference was dedicated to the memory of Professor Zbigniew Pawłowski in the 25-th anniversary of his death. One of the special goals of the conference was to commemorate his scientific and didactic achievements. Motivated by this aim, a special session was held with two talks devoted to the Professor. Janusz Wywiał presented the biography of Professor Pawłowski with detailed account of his outstanding scientific works in several fields including econometrics, prediction theory and sampling methods, as well as didactic achievements including several of his pioneering textbooks that are still inspiration for contemporary Polish statisticians in their research. The talk of Jan Kordos was focused on his unique personality and personal reminiscences of long lasting scientific co-operation with him. During the proceedings, two invited lectures were held. Both were associated with the issues that dominate the sampling theory nowadays: the estimation for small domains and the use of auxiliary information to construct the estimates of population parameters. The first one, titled “Designing Surveys for Small-Area Estimation” was presented by Nicholas T. Longford from the University Pompeu Fabra in Barcelona. The lecture concentrated on the study of populations in the case of

1430 Report – The 4th Conference on Sampling ... limited resources with special emphasis on small area estimation problems. An approach based on inferential priorities specifying the relative importance of precision in estimating small area targets was discussed. The second invited lecture, titled “Estimation of the Ratio Using Auxiliary Variables” was presented by Danute Krapavickaite and Aleksandras Plikusas from Vilnius University. It was devoted to calibration estimation of finite population parameters. The central issue was the estimation of the ratio of two totals of study variables using auxiliary information. Four alternative estimators of this parameter were discussed and their properties were compared. Some generalizations and estimators of other parameters were also considered. Several contributed papers were also presented. Most of them were dedicated to various issues in small area estimation. The paper “Synthetic and Composite Estimation under a Superpopulation Model” presented by Jacek Wesołowski was devoted to mathematically precise formulas for synthetic and composite estimators of the small area mean and their MSE’s. The Best Linear Unbiased Estimators (BLUE) and Best Linear Unbiased Predictors (BLUP) were considered and an optimal strategy was discussed. Results for the special case of symmetric designs were also presented. The paper “Approximating variance for measures of income inequality: An Approach based on the theory of M-estimators and U-Statistics” of Wojciech Niemiro and Robert Wieczorkowski dealt with the variance assessment for several income inequality indices. A new approach to this problem was proposed. The results of the simulation study supporting the analytical findings were presented. An application of the proposed approach to confidence intervals was also discussed. Another paper on small area estimation titled “Similarity of small areas in indirect estimation” was presented by Krystyna Pruska. A measure of domains’ similarity based on number of series in the sequence of observations representing joint domains was introduced and a method for selecting groups of similar domains was developed. It was then proposed to base the estimation process on identified groups of similar domains. Precision of six synthetic estimators (three of them based on proposed approach) was compared using computer simulation. Dorota Bartosińska in her paper “An attempt of using small area estimation methods in agricultural sampling surveys in Poland” considered the possibilities of improving precision of small area parameter estimates using data from previous population census. Several estimators were considered, including direct, empiric Bayesian and hierarchical Bayesian. A significant improvement in the precision was observed. In another paper on small area estimation titled “On Accuracy of EBLUP under Random Regression Coefficient Model” Tomasz Żądło analysed the accuracy of the empirical best linear unbiased predictor of a domain total assuming superpopulation model with random regression coefficients which is a special case of the general linear mixed model (GLMM). The simulation study

STATISTICS IN TRANSITION, December 2006 1431 was based on real data on Polish farms from Dabrowa Tarnowska region obtained in the agricultural census. Some papers dealt with other issues not necessarily limited to small area estimation. The paper of Jan Kordos entitled “Are Errors Really Normally Distributed ?” was devoted to sampling and non-sampling errors. He has stressed that it is clear from the central limit theorem that sampling errors are normally distributed. However, non-sampling errors usually are not normally distributed. He pointed out that in some books and articles sampling and non-sampling errors are treated as normally distributed. Its incorrectness was demonstrated, with the special discussion of non-sampling errors. In the paper of Jan Kowalski, “Recurrence in Sampling on Successive Occasions — Cascade Patterns” the BLUE estimation of the population mean based on all the past information obtained by sampling in more than two occasions was discussed. Limitations of the approach based on finding linear recurrence relations involving estimators on successive occasions were identified. A certain regular class of sample rotation schemes called cascade patterns was introduced. Janusz Wywiał presented a paper “Simulation Analysis of Accuracy Estimation of Population Mean on the Basis of the Strategy Dependent on Sampling Design Proportionate to the Order Statistic of an Auxiliary Variable” devoted to the estimation strategies involving sampling designs dependent on the sample quantile of an auxiliary variable. Three strategies were proposed and their properties were compared by means of computer simulation. The optimization of sample sizes in sequential stratified sampling with respect to cost and variance was considered in the paper “Estimation of the Vector of Means in Sequential Stratified Sampling” of Marcin Skibicki. Sample allocation and selection procedures, estimators and their properties were discussed. The talk of Andrzej Iwasiewicz titled “Reliability of methods of acquiring and generating binary information” was devoted to the relations between the true state of the object being studied and the binary information describing it. Possible discrepancies between them taking form of misclassification errors and their consequences were discussed. Conclusions concerning properties of estimators were formulated. The paper “Extending Sudden Death Testing on Non-Weibull Populations” of Ryszard Motyka and Antoni Drapella dealt with the N. L. Johnson method for shortening survival tests and produce non-censored data within an acceptable time interval was discussed. Its extension to non-Weibull distributions was proposed. Wojciech Gamrot in his paper “On Asymptotic Properties of Standard Deviation Estimates under Double Sampling for Nonresponse” discussed the estimation of finite population standard deviation when the sample data is incomplete. Two estimators of this parameter based on two-phase samples were proposed. Their MSE’s and second-order biases were derived. Wojciech Gamrot

1432 Report – The 4th Conference on Sampling ...

Zbigniew Pawłowski’s selected scientific works

Professor Zbigniew Maria Pawłowski lived in the years 1930—1981. His scientific career began in the Central School of Planning and Statistics (at present the Warsaw School of Economics) in Warsaw, where he first studied and then began working as a deputy assistant in the Statistics Department. There he obtained the doctor’s degree in 1957, and in 1962 — a postdoctoral degree (then referred to as that of an assistant professor of economics). Since 1962 he continued his career in the Higher School of Economics (later renamed the University of Economics) in Katowice, where he took the position of the head of the Department of Statistics. In the year 1957 he received the title of an associate professor and in 1972 – of professor ordinarius. In the environment of Polish econometricians, mathematicians and statisticians the opinion prevails that Professor Zbigniew Pawłowski was one of the pioneers of econometrics in Poland. He was very active not merely in the field of science but in the organisational one as well. He was a co-organiser of numerous scientific conferences, including the well-known conference of the departments of econometrics of the Katowice, Kraków and Wrocław schools of economics. The conference has been organised every year since then. Z. Pawłowski cooperated e.g. with Central Statistical Office in Warsaw, International Institute for Applied Systems Analysis in Luxembourg and Netherlands Economic Institute in Rotterdam. He was a member of the editorial staff of several publishing houses. Many of his personal scientific successes for ever became a part of the methods of statistical inferring. They have inspired several young research workers. He taught many future scientists and was an eminent scientist himself. Here we are going to present some selected examples of his achievements. Many of Z. Pawłowski’s academic handbooks had several editions and some were translated into Russian, Hungarian and German. His knowledge was far from superficial as he dealt with a lot of detailed questions of econometrics and statistics. That was reflected in his academic lectures as well as in many scientific articles and monographs. Many of Z. Pawłowski’s works refer to mathematical statistics, and among them there is one of most interesting proposals, whose description can be found in the article “A non-parametric statistical test of the hypothesis on several autocorrelation coefficients” (in Polish, Przegląd Statystyczny 1974, pp 189— 209), is to use a statistical test in order to verify the hypothesis on the occurrence of autocorrelation in a time series. It is worth underlining that this test may be used for a simultaneous verification of the hypothesis about the occurrence of autocorrelation of the first degree or of higher degrees. In the paper “The power of some tests of normality for a large sample” (in Polish, Przegląd Statystyczny 1959, pp 141—150) he proposed some test statistics to verify the hypothesis on normality based on Geary’s well-known theorem on

STATISTICS IN TRANSITION, December 2006 1433 independence of the sample mean and the sample variance. Besides, in this article he did something rare, that is he estimated the power of a test of goodnes of fit by means of analytic formulae. It is well-known that such a thing is usually done by means of a computer simulation. Z. Pawłowski also realised the need to develop inferring on the basis of non- simple samples, the effect of which is a handbook to study survey sampling Introduction to Survey Sampling (in Polish, PWN Warsaw 1972). In the light of the latest trends in statistics, we can notice that some problems of econometrics can be adopted to predicting the characteristics of finite populations; that refers e.g. to total tax revenues or total agricultural output. A model approach is widely used in the problem of one division referred to as small area sampling. The achievements in the field of econometrics, including Z. Pawłowski’s ones, are then very useful in survey sampling. Z. Pawłowski devoted many scientific works to constructing, estimating and using in practice some econometric models. He considered e.g. the use of econometric models in the problem of production management. He dealt with this problem in his book Econometric Analysis of Production Process (in Polish, PWN, 1971 and 1976). Moreover, he discussed the practical uses of econometric models of macroeconomic phenomena when he was heading a team which built one of the first models of Polish economy. The results of the team’s research were published in a collective work edited by him Econometric Model of Polish Economy (in Polish, PWN Warsaw 1968). In his work A Demo-econometric Model of Poland and its Application to Counterfactual Simulation (IIASA Laxenburg 1980) Zbigniew Pawłowski also turned attention to the necessity of taking into consideration demographic variables while constructing macroeconomic models of an economy. Z. Pawłowski’s achievements concerning forecasting are considerable. Besides the well-known classical rules, he promoted the one which leads to delimiting a forecast in the vicinity of the forecast’s variable dominant. He suggested that this rule should be especially useful in short-term forecasting. He proposed interesting ideas of making so-called optimistic and pessimistic forecasts; these depend on favourable or unfavourable (to the development of the phenomenon described by the variable being explained) arrangements of values of explanatory variables in a model on whose basis the predictor is being constructed. Almost all the ideas in the field of problems can be found in his following monographs: Econometric Forecasting (in Polish, PWN Warsaw 1973) and Introduction to the Theory of Prediction (in Polish, PWN Warsaw 1982). In particular, Z. Pawłowski proposed a method of predicting so-called change points of a time series in the article “Prediction by means of control cards” (in Polish, Przegląd Statystyczny 1969). He introduced a definition of so-called flexibility of a predictor. This is important from the point of view of someone who is choosing a predictor to make a forecast of e.g. time series characterised by an instability of the trend course. He also contemplated the forecasting horizon,

1434 Report – The 4th Conference on Sampling ... among others in his work “On the Concept of Horizon of Prediction” (Systems Science 1979, pp 81—90). Z. Pawłowski discussed at length the problem of so- called alternative forecasts, among others in his works: The Use of Alternative Predictions in Long-Term (IIASA, Laxenburg 1978) and “Contribution to the theory of alternative predictions” (Oeconomica Polona, 1977, pp. 381—400). Z. Pawłowski highly appreciated the role of an ex-post analysis of prediction errors, especially in order to select a better method of forecasting different phenomena in next periods of time. This was discussed in his article “On the use of ex-post information in econometric prediction” (in Contributed Papers, 40th Session of the International Statistical Institute, Warsaw 1975, pp 656—660). He formulated an interesting question of how to mark admissible values of explanatory variables of an economic model in such a way that the value of the variable explained exceeds the level required. He called this a discriminatory prediction and described it, among others, in his work “Discriminatory prediction and its relation to optimum control of economic systems” (Control and Cybernetics 1979, pp 55—66). In the article entitled “An analysis of a sequence of forecasts” (in Polish, Ekonomista 1974, pp 847—874) Z. Pawłowski analysed the so-called consistent forecast. The problem may be described simply as making such a forecast which is an intersection of interval forecasts made by means of using different methods. In this context, he also analysed so-called additional forecasts, i.e. the ones which are made successively as the period referred to by them comes nearer. It seems that each of the problems sketched above is still topical for statistical and econometric studies. Professor Pawłowski was a teacher for many of us and his scientific output is certainly worth being looked into. In the text above we have used professor Z. Pawłowski’s biography and a detailed description of his scientific career which can be found in an article by A.S. Barczak (Przegląd Statystyczny XXIX 1982, in Polish), in the papers by A.S. Barczak, J. Kordos and Z. Hellwig (Wiadomości Statystyczne nr 10, 1981, in Polish) and in the book by S. Kwiatkowski -editor, (Wojewódzki Urząd Statystyczny w Łodzi – Polskie Towarzystwo Statystyczne, Łódź, in Polish). Zbigniew Pawłowski’s scientific works are listed in the article by A.S. Barczak (Przegląd Statystyczny XXX 1983, in Polish). Prepared by Janusz Wywiał

STATISTICS IN TRANSITION, December 2006 1435

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1435—1438

ANNOUNCEMENT

Second Baltic-Nordic Conference on Survey Sampling

Pre-Course in Helsinki: 31 May — 1 June 2007 Conference in Kuusamo, Finland: 2—7 June 2007

Location and Conference Venue The conference takes place in Kuusamo, which is located about 800 km to the north-east from Helsinki. The conference venue is Holiday Club Kuusamon Tropiikki. Finland. The hotel offers not only excellent conference facilities but also exiting options for free-time wellness and fitness activities. Hotel services include free spa services (included in room prices). The hotel provides a warm oasis in the middle of the beautiful northern Finnish landscape.

Aims and Scope BaNoCoSS-2007 — Second Baltic-Nordic Conference on Survey Sampling — is a scientific conference presenting current developments in: (i) design and analysis of complex surveys, and (ii) use of auxiliary information in survey sampling, with applications to empirical research and statistics production. The conference aims to provide a platform for discussion and exchange of ideas for a variety of people. These include, for example, statisticians, researchers and other experts of universities, national statistical institutes, research institutes and other governmental bodies, and private enterprises, dealing with survey research methodology, empirical research and statistics production. University students in statistics and related disciplines provide an important interest group of the conference. The First Baltic-Nordic Conference on Survey Sampling was held in 2002 in Ammarnäs, Sweden. BaNoCoSS-2007 is organized by the University of Helsinki, Baltic-Nordic Network in Survey Sampling, Statistics Finland and the Finnish Statistical Society. Sponsors include the International Association of Survey Statisticians (IASS) and the Academy of Finland.

Keynote Speakers Prof. Harvey Goldstein, University of Bristol, UK Prof. Carl-Erik Särndal, University of Montreal, Canada Dr Jean-Claude Deville, INSEE/ENSAI Ecole nationale de la statistique et de l'analyse de l'information, France

1436 Second Baltic-Nordic Conference…

Additional keynote speakers will be announced. Each keynote speaker will give four lectures. Additional speakers will be invited to give lectures on timely topics in survey sampling and survey methodology.

Call for Papers The programme will cover survey sampling in a wide sense. The programme consists of sessions of invited papers, contributed papers and posters. Participants are encouraged to submit contributed papers or posters. If you wish to present a paper or poster, please submit a one-page abstract by email using address [email protected] before 16 April, 2007. Topics of contributed papers or posters include for example: Business survey methodology, Calibration techniques, Combining data from surveys and registers, Design and analysis of complex surveys, Design and estimation strategies using auxiliary information, Edit and imputation techniques, Estimation for domains and small areas, Estimation in the presence of nonresponse, Internet and web surveys, Longitudinal and panel surveys, Measurement errors in surveys, Methods for international comparison, Multilevel and hierarchical modelling, Non-parametric methods in survey analysis, Questionnaire development and testing, Sample surveys in special fields, Software for survey sampling and analysis, Statistical disclosure control, Survey data mining, Variance estimation, Weighting strategies. Announcement of paper acceptance will be given by 30 April 2007. The conference language is English.

Important Dates 31 March 2007: Early registration ends (reduced registration fee applies) 1 April—2 June 2007: Late registration (higher registration fee applies) 16 April 2007: Deadline for submission of titles and abstracts of contributed papers and posters 30 April 2007: Announcement of paper acceptation 31 May—1 June 2007: Short Course on Multilevel Modelling (Helsinki, Finland) 2—7 June 2007: BaNoCoSS Conference (Kuusamo, Finland)

Scientific Committee Timo Alanko, Statistics Finland, Helsinki, Signe Bāliņa, University of Latvia, Riga, Jan Bjørnstad, Statistics Norway, Oslo, Dan Hedlin, Statistics Sweden, Stockholm, Annica Isaksson, Statistics Sweden, Stockholm, Danutė Krapavickaitė, Institute of Mathematics and Informatics, Vilnius, Gunnar Kulldorff, University of Umea, Seppo Laaksonen, University of Helsinki, Jānis Lapiņš, Bank of Latvia, Riga, Risto Lehtonen (Chair), University of Helsinki, Peter Linde, Statistics Denmark, Aleksandras Plikusas, Institute of Mathematics and Informatics, Vilnius, Lauri Tarkkonen, University of Helsinki,

STATISTICS IN TRANSITION, December 2006 1437

Daniel Thorburn, University of Stockholm, Imbi Traat, University of Tartu, Jan Wretman, University of Stockholm.

Tentative Program Saturday 2 June 2007 Afternoon Arrival to Kuusamo; Registration 18.00—19.30 Welcome Reception Sunday 3 June 2007 9.00—13.30 Opening; Conference Sessions 13.30— Excursion to Oulanka National Park Monday 4 June 2007 9.00—17.00 Conference Sessions Tuesday 5 June 2007 9.00—17.00 Conference Sessions 18.00— Excursion to Ruka Wednesday 6 June 2007 9.00—17.00 Conference Sessions 18.30— Conference Dinner Thursday 7 June 2007 9.00—13.30 Conference Sessions; Closing; Departure

Pre-Course on Multilevel Modelling A two-day pre-course Multilevel modelling in social, behavioral and economic research will be organized on Thursday and Friday, 31 May—1 June, 2007. Venue: University of Helsinki, the Kumpula Campus. Main lecturer: Prof. Harvey Goldstein (University of Bristol). Participation is free of charge.

Registration Information Registration protocol is available at the conference website. On-line registration is recommended. Preliminary Registration Fees Regular Participants: EUR 220 by 31 March 2007, EUR 250 after 31 March 2007. Full-time students: EUR 120. Accompanying persons and children 15+ years: EUR 80 by 31 March 2007, EUR 100 after 31 March 2007. Children 7—15 years: EUR 50. The fees include sessions, conference material, refreshments, social program and conference dinner.

1438 Second Baltic-Nordic Conference…

Grants for partial financing of participation of students (possibilities to be announced).

Accommodation Information The Conference Secretariat has made an early reservation for accommodation at Holiday Club Kuusamon Tropiikki, with special rates EUR 74 per night for a single room and EUR 46 per night per person for a double room for the conference participants. Instructions on booking procedures are available at the conference website.

Travel Information Most international flights to Finland land at Helsinki international airport. Detailed travel information to Kuusamo is available at the conference website.

Organizing Committee Kari Djerf, Statistics Finland, Helsinki, Tarja Hämäläinen, University of Helsinki, Seppo Laaksonen, University of Helsinki, Risto Lehtonen (Chair), University of Helsinki, Lauri Tarkkonen, University of Helsinki, Maria Valaste (Secretary), University of Helsinki, Kimmo Vehkalahti, University of Helsinki.

Contact Addresses Postal Address: Conference Secretariat, P.O.Box 54 (Unioninkatu 37), FI- 00014 University of Helsinki, Finland. Fax: +358-9-191 24872. Email: [email protected] Web site: http://www.mathstat.helsinki.fi/msm/banocoss/

Prepared by Risto Lehtonen, University of Helsinki

STATISTICS IN TRANSITION, December 2005 1439

STATISTICS IN TRANSITION, December 2006 Vol. 7, No. 6, pp. 1439—1441

ACKNOWLEDGEMENTS

Referees of Volume 7

The Editorial Board wishes to thank the following referees who have generously given their time and skills to the Statistics in Transition during the period from January 2005 to December 2006.

Timo Alanko, Statistics Finland, Finland Czesław Bracha, Warsaw School of Economics, Poland. Ray Chambers, University of Southampton, UK Stig Danielsson, Linköping University, Sweden. Kari Djerf, Statistics Finland, Finland Czesław Domański, University of Łódź, Poland Sławomir Dorosiewicz, Warsaw School of Economics, Poland Stefano Falorsi, ISTAT, Rome, Italy Ewa Frątczak, Warsaw School of Economics, Poland. Wojciech Gamrot, Academy of Economics, Katowice, Poland Elżbieta Gołata, University of Economics, Poznań, Poland. Marek Góra, Warsaw School of Economics, Poland Marek Gruszczyński, Warsaw School of Economics, Poland Dan Hedlin, Statistics Sweden, Sweden Johan Heldal, Statistics Norway, Oslo, Norway Montserrat Herrador, National Statistical Institute of Spain (INE), Spain Anders Holmberg, Statistics Sweden, Sweden Krzysztof Jajuga, Wrocław University of Economics, Poland Alina Jędrzejczak, University of Łódź, Poland Sven-Erik Johansson, Karolinska Medical University, Sweden Graham Kalton, WESTAT, Inc. Washington, USA Irena Kasperowicz-Ruka, Warsaw School of Economics, Poland. Jan Kordos, Warsaw School of Economics, Poland. Irena Kotowska, Warsaw School of Economics, Poland Jerzy Korzeniowski, University of Łódź, Poland

1440 Referees of Volume 7

Marek Kozak, Warsaw Agricultural University, Poland Liliana Kursa, Central Statistical Office, Poland Seppo Laaksonen, Statistics Finland Janis Lapins, Bank of Latvia, Latvia Risto Lehtonen, University of Helsinki, Finland Małgorzata Misztal, , University of Łódź, Poland Marek Męczarski, Warsaw School of Economics, Poland. Nick Longford, University of Pompeu Fabra, Barcelona, Spain Domingo Morales, Universidad Miguel Hernandez de Elche, Spain Mikko Myrskylä, Statistics Finland, Finland Wojciech Niemiro, Warsaw University, Poland Kari Nissinen, Statistics Finland, Finland Lucyna Nowak, Central Statistical Office, Poland. Jerzy Nowakowski, Warsaw School of Economics, Poland. Paul Ollila, Statistics Finland, Finland Tomasz Panek, Warsaw School of Economics, Poland. Jan Paradysz, University of Economics, Poznań, Poland Dariusz Parys, University of Łódź, Poland Dorota Pekasiewicz, University of Łódź, Poland Richard Platek, Formerly Statistics Canada, Canada Jarosław Podgórski, Warsaw School of Economics, Poland. Waldemar Popiński, Central Statistical Office, Poland Krystyna Pruska, University of Łódź, Poland J.N. K. Rao, Carleton University, Canada M. P. Singh, Statistics Canada, Canada Teresa Słaby, Warsaw School of Economics, Poland. Honorata Sosnowska, Warsaw School of Economics, Poland Kaja Sõstra, University of Tartu, Estonia Czesław Stępniak M.Curie-Skłodowska University, Lublin, Poland Adam Szulc, Warsaw School of Economics, Poland Mirosław Szreder, University of Gdańsk, Poland Daniel Thorburn, Stockholm University, Sweden Imbi Traat, Tartu University, Estonia Ari Veijanen, Statistics Finland, Finland M.R. Verma, Umiam (Barapani), Meghalaya, India,

STATISTICS IN TRANSITION, December 2005 1441

Vijay Verma, Consultant in Survey Methodology, India Henryka Wanke, Central Statistical Office, Poland Jacek Wesołowski, Warsaw University of Technology, Poland Robert Wieczorkowski, Central Statistical Office, Poland Janusz Witkowski, Central Statistical Office, Poland Janusz Wywiał, Academy of Economics, Katowice, Poland Jan Wretman, Stockholm University, Sweden Aleksander Zeliaś, Cracov University of Economics, Poland Li Chun Zhang, Statistics Norway, Oslo, Norway Agnieszka Zgierska, Central Statistical Office, Poland