University of Klagenfurt Department of Mathematics

Sixth Austrian, Hungarian, Italian and Slovenian Meeting of Young Statisticians

PROCEEDINGS Friday, Oct. 5th 2001 - Sunday, Oct. 7th 2001 Ossiach, Carinthia, Austria Scientific Program Committee

Anuskaˇ Ferligoj, University of Ljubljana, Dario Gregori, University of Trieste, Italy Rose-Gerd Koboltschnig, University of Klagenfurt, Austria Tamas´ Rudas, Eotvos Lorand University, Budapest, Hungary

Organizing Committee

Albrecht Gebhardt Rose-Gerd Koboltschnig Jurgen¨ Pilz Beate Simma

With the contribution of

University of Klagenfurt Forschungsforderungskommission¨ Karntner¨ Universitatsbund¨ SAS Austria Contents

Linear Mixed Models in Efficiency Analysis: Evidence from Validation Proce- dures 5 Luca Grassetti

L∞ regression and mixture index of fit 33 Andras´ Low

Respondent selection within the household - A modification of the Kish grid 51 Renata Nemeth

Network Effects on the Dynamics of Slovenian Corporate Ownership Network 65 Marko Pahor

Asymptotically Efficient Generalised Regression Estimators 77 G. E. Montanari and M. G. Ranalli

Evaluation of quality cancer care using administrative data: an example of breast, rectum and lung cancer 92 Rosalba Rosato

Parallel Computing in Spatial Statistics 100 Nadja Samonig

Non-additive probability 116 Damjan Skuljˇ

Analysis and visualization of 2-mode networks 131 Matjazˇ Zaversnik,ˇ Vladimir Batagelj and Andrej Mrvar 4

Preface

Rose-Gerd Koboltschnig Department of Mathematics - University of Klagenfurt, Austria,

This meeting was set up in 1996 from the University of Ljubljana, Slovenia, and the Technical University of Graz, Austria. In the following years Hungary and Italy joined the group of countries participating to the initiative. It is dedicated to young statisticians doing actively research in theoretical and in applied statistics and who are willing to present their work in English. On the other hand this meeting has also the aim to improve the transmission of cultural and scientific knowledge among all participants. For these reasons the program has been designed to allow plenty of time for discus- sions, exchange of information and for making scientific contacts. As a rule three young statisticians from every participating country are invited an insight in their research activ- ities. It has become a tradition since the Fifth Meeting of Young Statisticians to publish the talks given at the meeting in a proceedings volume. We are glad to continue this tradition: Enclosed please find the written versions1 of talks given at the Sixth Austrian, Hungar- ian, Italian and Slovenian Meeting of Young Statisticians, which was held in Ossiach, Carinthia, Austria, hosted by the Department of Applied Statistics at the University of Klagenfurt, Austria. The present volume, again, covers a wide range of topics, ranging from mathematical to applied and computational statistics, thus giving a good overview over the universities of the participating countries. It is a great pleasure to express our sincere thanks to all participants of the meeting in Ossiach, in particular to all speakers and discussants, who actively contributed to the success of our meeting!

On behalf of the Organizing Committee,

A. Gebhardt R.-G. Koboltschnig

J. Pilz B. Simma

1online versions at http://www-stat.uni-klu.ac.at/Tagungen/Ossiach/proceedings.html 5

Linear Mixed Models in Efficiency Analysis: Evidence from Validation Procedure

Luca Grassetti Department of Statistics - University of Udine, Italy,

Abstract The studies in efficiency estimation field that precede this paper are based on the classical literature of production and cost frontier models (see Aigner et al., 1977). All these studies are based on data aggregated at observation unit level (e.g. Hos- pitals). What we noticed is that, very often, the observed structures are organised hierarchically. From this point of view we can certainly affirm that also data have an understanding hierarchical structure. In this work we are going to estimate some models based on disaggregated data about health service erogation. In estimation we use a vector of ward level inputs (beds), a matrix of aggregated inputs (staff, equip- ment and offered services). As output we use three different variables. Number of impatient days, number of impatient and financed funds. The aim of this article is to justify the use of the hierarchical models in hos- pital efficiency estimation, giving a quantification of aggregation bias problem. In this direction we first compared the results of different methods of estimation (with particular attention to aggregated - disaggregated comparison) and then adapted two different methods of validation to the estimation models used in previous work: Jack- knife and Simulation methods.

1 Introduction

The aim of our previous studies (Gori, Grassetti e Rossi, 2001) has been to identify a model to study the “productive” attitude (technical efficiency) of health structures of the italian region Lombardia. In particular, we tried to borrow some methods from productive analysis based on frontiers building (see Aigner and Chu, 1968). Two are the main meth- ods adopted in literature: parametric and non-parametric. In particular we focused on stochastic frontier models and DEA approaches. Given the particular type of production treated in this study the use of these kinds of methods cannot be the solution to the prob- lem of analysing the hospitals’ efficiency. Our intention is to help decision makers with their evaluation, trying to identify some abnormal observations. Looking at a hospital as a simple service supplier, we can postulate that: “For a given amount of input all hospitals (supposing stability in other possi- ble variables) can produce the same amount of output.” Application of this simple (economic) rule to the health sector is probably the main source of discussion and disagreement: the solution may be found using results in a con- crete way, that means considering them as an indicator of the need of more investigation in cases far from the average. Naturally studies in this direction could be found in literature (see Table 1). Many of these are based on cost function estimations but in our case costs are not available at the aggregation level we used in analysis. 6

Table 1: Some references about efficiency measurement in Health Sector

Authors Publ. Year Approach Data BankerR.D., 1986 DEA and North Carolina R.F. Conrad e Translog Hospitals R.P. Strauss Function Byrnes P. e 1994 DEA 123 U.S. V. Valdmanis cost frontier Hospitals Burgess Jr. J. F. e 1996 DEA 137 U.S. P.W. Wilson Hospitals Eakin B.K. e 1988 Parametric 331 U.S. T.J. Kniesner cost Hospitals function between years 1975-76 Faere R., 1995 DEA, 17 Swedish S. Grosskopf, Malmquist Hospitals B. Lindgren e Indexes between year 1970 P. Roos and 1985 Ferrier G.D. e 1996 DEA 360 U.S. V. Valdmanis Hospitals Grosskopf S., 1996 Distance New York D. Margaritis e Functions and Californian V. Valdmanis Hospitals Grosskopf S e 1987 Non 82 Californian V. Valdmanis parametric Hospitals Ozcan Y.A. e 1996 DEA 19 U.S. M. J. McCue Hospitals Valdmanis V. 1992 Stochastic U.S. Hospitals DEA Zucherman, 1994 Stochastic 1600 U.S. Hadley e Parametric Cost Hospitals from a stratified Iezzoni Frontier random sample 7

What we will determine using frontier analysis is a simple ranking of observed units. Obviously results change using alternative methods so we will present three alternatives:

• stochastic parametric approach on data aggregated at hospital level;

• DEA method alway on aggregated data;

• hierarchical model specification on data at ward level with some inputs shared within the hospitals.

In this paper we want to develop arguments about used dataset (section 2), methods specification (section 3) and some applicative results (section 4). Finally we will try to quantify the possible bias errors due to aggregation (of quantities originated from a hierarchical structure) with resampling methods.

2 Data

In our empirical analysis we used a set of observations from the population of Region Lombardia hospitals. Almost all the structures are present in our dataset but errors in en- coding, missing data and problems connected to compatibility of data sources caused the loss of some observed units. After having reduced these problems, we obtained a set of 178 hospitals. We can only find all data required for analysis in a two year period (1997- 1998). After a preliminary study we focused our attention on year 1997 (in 1998 data we found too many suspicious observations). The particularity of this set is the hierar- chical structure that characterises hospitals: the output, provided from each ward present in hospitals, is distinguished as well as the input “number of beds”. This means that datasets used in standard statistical analysis, which perform the analysis at hospital level, are obtained from an aggregation procedure. This procedure frequently introduces biases in datasets and, at the same time, causes loss of knowledge about the data generating process (DGP). In the dataset we can identify six different types of typical Italian structures (numbers between brackets are quantities of every item):

• Private and Public Clinics (54)

• Hospitals (88)

• Local Health Agencies (20)

• Classified Hospitals (6)

• Private Research Institutes (12)

• Public Research Institutes (5) 8

In order to show some structural differences between these units we can analyse the variable “number of beds”. In Table 2 we summarize descriptive statistics about each group.

Table 2: Descriptive statistics for variable “number of beds”

Structure Type N. of Hospitals Means Stand. Dev. Min. Max. Private and Public Clinics 54 163.31 93.15 30 420 Classified Hospitals 6 259.83 103.78 100 378 Public Research Institutes 12 217.08 253.19 40 993 Hospitals 88 372.75 374.79 31 2142 Private Research Institutes 5 589 490.71 80 1318 Local Health Agencies 20 186.5 148.72 32 533

In our analysis we distinguish between three kinds of outputs:

• inpatient number (nd);

• inpatient days (ggd);

• financed funds (fin).

Hypothesis about which output better adapts to model specification may cause never ending discussion. Looking at the problem from a productive point of view, we can say that inputs are used proportionally to the presence of inpatient (days). But it is also possible to consider every inpatient as a work source (number). Another consideration is that, given the “new” funds assignment method (based on DRG - disease related group - definition), we could consider the analysis of weighted number of inpatients (fin) to perform efficiency analysis. Therefore in the next paragraphs we will consider all these possibilities. All considered outputs are available in disaggregated form. Each inpatient is recorded in the archives. For practical purposes we will aggregate these records first at ward level and than at hospital level. Input data has been collected by searching in two different archives. The first con- sequence of this situation is that some data is recorded at ward level and some other at hospital level only. Inputs that we find at hospital level only and that are shared between wards of the same hospital, are:

• hospital staff (originally subdivided in different categories) - pers;

• equipments - TOTApp;

• offered services - serv. 9

The variable “equipment” could be defined as the number of all machinery used for clinical exams. In a similar way, we defined services as the sum of dummies indicating the presence of emergencies, transfusional centers, etc. On the other hand, at ward level we can find the number of beds (an annual average) - pl - which is also the most output correlated variable in the dataset (both for disaggregated and aggregated data). For an explanatory analysis of correlations between variables included in our studies, we can refer to the graphics in Figure 1. The values summarized are logarithms of observations and what we can easily identify is a substantial linear correlation between explanatory transformed variables. Graphics enlighten a characteristic of our dataset, multicollinearity. This problem, common in productivity analysis, is connected with consideration of size related vari- ables. In fact, all the variables included in our study are structural (this implies a direct proportionality to the dimensions of hospitals). This problem, surely influential on aggre- gated studies, is almost completely ininfluent by considering the original disaggregated dataset (1478 observation). First of all because of the huge size of the dataset, reducing the effect of the presence of correlated variables, and secondly because the original data presents less multicollinearity. In disaggregated studies we have also considered the effects that ward type could have on productivity. To this end we have identified five clusters of wards in order to generate a dummy variable for each of them.

2.1 On data structure The particularity of this data is the well defined hierarchical organization of hospitals, not only from a bureaucratic point of view but also regarding the practical definition of services provided (see Figure 2). Many explanatory variables are collected at hospital level. This is logical from a productive point of view because, for example, doctors are not related to a single ward. The “use” of these unshared resources is probably promiscuous within the hospitals. With this type of data we cannot easily analyse single wards but using a two level mixed effect model (multilevel model), we can hypothesise a composed structure for the error term, identifying and separating the hospital effect from the residual biases (in term of error or a transformation of it). In particular we suppose that aggregating these data we will lose important informa- tion and compute biased estimates. The structure of the data is shown in Figure 2, where:

• yij represent the output at ward level (level 1);

• xij represent a vector of inputs at ward level (level 1);

• and Zi represent a vector of inputs at hospital level (level 2).

The index i goes from 1 to I (number of hospitals) and j goes from 1 to ni (number of wards in i-th hospital). 10

3 Models and methods

Considering a hospital from a productive point of view we can borrow some well known econometric models to estimate the so called technical efficiency measures. Before look- ing at applications, we are going to have a short review of methods used in efficiency estimation.

3.1 Frontier theory In economical theory it is common knowledge that efficiency of the productive process could be summarized by: “A subject is efficient when: given an amount of input it produces the highest possible quantity of output or it produces a given amount of output with the minimum quantity of input”. Basing on Koopman’s definition of efficiency and on Shepard distance functions, Far- rell (1957), proposed efficiency measurement for both the input and output oriented cases. These are defined as:

T 1 Ex (y, x) = min {λ : λx ∈ X(y)} = , Dx(y, x) the input-oriented efficiency measure, and:

T 1 Ey (y, x) = max {θ : θy ∈ Y (x)} = , Dy(y, x) the output-oriented efficiency measure. Both these concepts are visualized in Figure 3. In this picture we can also note that 0 given definitions could be easily written as λmin = kx k / kxk for input-oriented and 0 θmax = kyk / ky k for output-oriented cases. A clearer example of these quantities is seen in Figure 4. The curve traced in Eu- clidean plan is the production frontier of output y obtained using x (input). The yellow area represents the whole productive possibilities (Ψ) and point A is an inefficient unit observation. B and C are the efficient correspondents of point A and distances between these points and A are the efficiency measurement. In this article all efficiency measures could be brought back to these concepts.

3.2 Models for data at hospital level This approach (referred to as the Classical Approach) considers the estimation of a func- tion based on the observation of decision making units (hospitals in our analysis). Fitted values of this function are the so called efficient frontier subset. The estimation of ef- ficiency measures is obtained comparing observed values with this subset (calculating distances from it). 11

Two different approaches have been proposed to study data on the services providing process: parametric and nonparametric. Both have advantages and drawbacks as we can see in the following Table 3 (where we can also notice a substantial trade off between advantages and drawbacks in both cases).

Table 3: Advantages and drawbacks of classical approaches

Parametric Nonparametric Advantages Interpretability of results Flexibility Known statistical properties Robustness respect model choice Drawbacks Subjectivity in model Interpretability of choice results Problematic definition of Unknown statistical residuals properties

Results of both methods still lack efficiency inference. The method we present in paragraph 3.3 permits instead the construction of confidence intervals (in particular for the second level error component). In the next subsections we are going to introduce different model specifications with particular regard to the ones used in analysis.

3.2.1 Parametric approach In this section we expose the basic line of frontier specification and we present the the- p oretic models applied in our work. Given Y ∈ IR+ and X ∈ IR+ we can obtain a set Ψ specifying the mathematical functional dependency between variables. So f(x, β) will define the Data Generating Process. Some different specifications are available for DGP. Certainly the most commonly used is the Cobb Douglas Function (well known in eco- nomic contest, see Meeusen and Broek (1997)). Continuing to consider only one input and one output (generalization comes easily), we define:

β yi = Axi which becomes tractable by applying the logarithmic transformation:

0 log(yi) = α + β log(xi).

As we saw in Table 3 one of the main problems in model fitting is the error specifi- cation. Econometric theory suggests two approaches: deterministic and stochastic (see Greene, 1993). Assuming that observational data is subject to noise, we can focus on stochastic methods and write:

0 log(yi) = α + β log(xi) + i, where  is a composed error term:

i = ei − ui, 12

2 2 with ei ∼ N(0, σe ) and ui ∼ positive(µ, σu). This error takes care of both statistical noise (ei) and efficiency related terms (ui). The estimation of parameters connected to distribution of ui is the principal problem to solve in order to compute efficiency estimates. The methods used to perform these estimation are various. A solution developed using the method of moments imply the estimation of OLS results on a modified model and, in a second phase, the estimation of parameter corrections derived from the computation of empirical moments (see Aigner et al., 1977).

Another way to estimate ui is the use of maximum likelihood methods (see Greene, 1980). This approach could be faced with some statistical software. In the first part of our analysis we used “Frontier”, a Fortran based program, that permits the specification of one side errors with truncated normal distribution (see Coelli, 1996b). The results we will see in section 4 have been realized using a particular library of R statistical software (sn - skew-normal). This routine permits us to realise repeated estimation writing loops of commands (few command line) for bootstrap and jackknife analysis (see Efron and Tibshirani, 1993) . In these analysis we decided to solve Skewnormal - Normal regression without restriction in skewness parameter (in accordance with frontier theory this must be positive). We choose this approach in order to consider the real distribution of error components. Obviously for the efficiency analysis we selected only the positive skewed distributions (estimated by R-Statistics procedure).

3.2.2 Nonparametric approach

The nonparametric approach is completely different. In this case we do not need to spec- ify an “a priori” functional form for the production hull. Some different specifications for frontier construction have been proposed in literature (see Banker et al., 1984). We first make a distinction between FDH (Free Disposal Hull) and DEA (Data Envelopment Analysis) approaches. The first one is based on the simple hypothesis of “free disposabil- ity” of inputs and outputs. Introducing DEA we have to consider an additional hypothesis: the convexity of frontier hull. In the next pages, we consider this second method because we need a frontier comparable with a stochastic one. The construction of DEA frontier is based on mathematical interpolation of frontier points (peers) in order to obtain the best theoretically possible observations set (X, Y ). As in previous approach, the objective is to compare these observations with real ones to compute the efficiency rates (see Figure 5). The main problem of this kind of analysis is the deterministic character of efficiency results. In order to avoid this problem Simar L. and Wilson P.W. (see Simar and Wilson (1998) and Simar and Wilson (1999)) propose using bootstrap procedures to make inference on efficiency measures. In this analysis some observations are also subject to a secondary inefficiency source, identified as “slack” presence (e.g. point A in Figure 5). The point A’ in fact lies on the frontier line but, given the free disposability rule, is less efficient than point C (one of the peers). In short the observed unit A used too many inputs to produce the given outputs, and the inputs mix is not optimal either. The formulation of the problem could be:

ˆ ΨDEA(=) = 13

( n n n ) p+q X X X (x, y) ∈ IR+ : y ≤ γiyi, x ≥ γixi, γi = 1, γi ≥ 0 ∀i = 1, ...n . i=1 i=1 i=1 This means that all possible n-dimensional observations are bounded by defined con- vex hull. Deciding to use a DEA approach, we can specify different hypotheses on return of scale status. CRS (Constant Return of Scale), VRS (Variable Return of Scale) and some intermediate models (see Banker and Thrall, 1992) are disposable in all specific softwares. To obtain results comparable with the parametric ones we used the CRS-DEA approach (in accordance with return of scale evidence obtained from parametric analysis). The formulation of efficiency estimation is:

0 max(µ yi), µ,v under constraint 0 v xi = 1 0 0 µ yi − v xi ≤ 0, j = 1, 2, ..., n, µ, v ≥ 0, We can observe that also in this case, what we finally estimate is an “easy” compu- tation of distance between the hypothetic frontier and the observed values. Starting from this consideration we can look at DEA method as a combination of the usual indexes and the parametric approach. From the first point of view, we can identify efficiency measures as a complex index (derived from the ratio between optimal output mix and optimal input mix). On the other hand, we can speak of total flexibility of the “underlying model”.

3.3 Models for data at ward level: hierarchical models In this paragraph we are going to introduce a relatively new model for efficiency analy- sis. Observing service provider sectors, in particular the public field, we often encounter hierarchical production structures. In these cases it is not unusual to find some inputs at different levels of the hierarchical organization. These are shared between the production units at lower level. The na¨ıve use of this information is the construction of aggregated quantities to obtain a set of observations at the highest level. This operation, obviously, introduces the problem of aggregation bias. In this study we are going to use this information in order to obtain more robust estimates. Moreover, we avoid the risk of aggregation bias. The specification of a model, that takes into account the two level structure and different level variables, involves the use of mixed effects models. The complete model could be written as:

yij = (β0 + β1xij + β2zi) + (u0i + u1ixij + eij), where the first part of equation is the deterministic part of the model. Stochastic errors in equation comprise the second level component (u0i and u1i) and the first level (eij). This model definition permits us to set some parameters as stochastic in order to lower the loss of efficiency due to estimation of distinct parameters for all units of interest. 14

In our analysis we are interested in computing the different effects of hospitals. Bor- rowing from some well known results of aggregated model theory (MOLS or COLS mod- els) we decided to let the intercept be stochastic in order to find evidence of differences in production level. The model we finally considered is:

yij = (β0 + β1xij + β2zi) + (u0i + eij),

iid 2 iid 2 where u0i ∼ N(0, σu) and eij|u0i ∼ N(0, σe ) (instead of this definition composed error does not change summing or subtracting the ui term). In our study the objective is to obtain a frontier regression. This could be done by specifying a one side error component both at the first and second level of our structure. A useful simplification is to assume only the second level error to be one sided. Considering that our aim is to find the efficiency level of macro structure, we could accept this last proposal. Unfortunately the software available for model estimation permits us to specify only normally distributed error components. To overcome the unavailability of procedures that directly permits us to assume one side second level errors in mixed effects models, we first used the Frontier panel data procedure (see Coelli, 1996b) deciding to treat different wards observation as different time observation. We treated our data as an unbalanced panel estimating a time invariant mixed effect model (efficiency is not related to ward type). Panel data model, described in Coelli (1996b) , may be expressed as:

Yit = xitβ + (eit − uit),

iid 2 where xit and β define the fixed effects part of previous model, eit|uit ∼ N(0, σe ) and iid 2 uit ∼ |N(mit, σu)|, where: mit = zitδ.

Setting mit = m the model became similar to the multilevel specification presented above, with the time dimension as first level (i instead of t) and hospitals as second level. Estimation methods used in R-lme and in Frontier are both based on maximum likelihood method (iterative estimation of coefficients) and consequently differences, we will see in paragraph 4.2, depend from errors specification. In fact with this trick we will be able to define one side errors in a hierarchical structure. Another way to treat this kind of data is to estimate the Normal - Normal mixed effects model and propose a MOLS transformation for residuals (that translate the estimated effects and regression function until it becomes the frontier of data). In practice, random effects, calculated by “R-lme” procedure (R- Statistics “linear mixed effects” library), could be transformed by the function:

Effi = f(ui) = exp(max(ui) − ui), or could be used directly to identify rankings of observed units. In next paragraphs the results of both these procedures will be compared concluding that R-lme procedure simplification, due to consideration of two sided instead of one sided errors, does not have substantial effects on our analysis results, especially for the purpose of producing ranking and identify “abnormal” observations (hospitals). 15

4 Some results

In the next paragraphs we will show the results of various approaches proposed in section 3. First we will examine classical results and then disaggregated data analyses based on mixed (multilevel) models. In the third part of this section, we will compare the obtained results, and finally we will try to justify the use of multilevel model building.

4.1 Aggregated results (data at hospital level)

4.1.1 Parametric Using the models presented in paragraph 3.2 we obtained efficiency scores for every hos- pital using both approaches (parametric and nonparametric). Using the parametric anal- ysis, results permit us to study the values of coefficients in order to find evidence of assumed correlations between dependent and independent variables. In Table 4 we sum- marized the results of the three different output specifications using the following model:

log(y) = β0 + β1 log(pers) + β2 log(pl) + β3 log(T OT App) + β4 log(serv) + e − u where e is the vector of random errors and u is the efficiency correlated vector. In our analysis we assumed that u are distributed as one sided with Truncated Normal theoretical form (the most flexible). For estimation we used the econometric software Frontier. In the last part of this article we will consider a simplification of this model. One sided errors will be considered Half Normal.

Table 4: Coefficients values for different output specifications

Variable in the model Inpatient Days Inpatient Number Financed Funds Intercept 5.427 (0.113) 3.585 (0.185) 11.394 (0.177) log(pers) 0.156 (0.059) 0.419 (0.071) 0.293 (0.079) log(pl) 0.966 (0.056) 0.490 (0.078) 0.754 (0.082) log(T OT App) -0.104 (0.027) 0.132 (0.036) 0.056 (0.040) log(serv) -0.116 (0.034) 0.021 (0.047) -0.033 (0.055) σ 0.555 (0.096) 1.033 (0.292) 0.489 (0.091) γ 0.980 (0.008) 0.974 (0.012) 0.879 (0.038) µ -1.475 (0.297) -2.006 (0.707) -1.312 (0.342)

These results show estimated coefficients values and their standard errors. The last three coefficients are related to error components specification. In particular:

2 2 2 σ = σv + σu, 2 γ = σu/σ, while µ is simply the mean of Truncated Normal distribution. 16

Looking at results of Inpatient Days model we can interpret values of coefficient to verify relations between dependent and independent variables and to calculate the return of scale of underlying technology. Logically, the evidence we collected seems to sustain a positive and powerful cor- relation between inpatient days and beds (or better, their logarithmic values). With this particular functional specification, the coefficients are also the estimated output elastic- ity referring to single input variations. Services and equipment coefficients estimates are also quite interesting. These are both negative and we can interpret this condition hypoth- esizing that the presence of a suitable structure lower waiting time and consequently also inpatient days. The aim of Table 4 is not to compare estimated coefficients but only to identify the best fitted model. We also tried to build different models but results are always worse than those shown. After having identified the best fitting model, we can return to the primary objective of this study, the identification of hospital effect on production level. At this point a doubt could be advanced: which of three possible “outputs” must be used in efficiency evaluation? A reasonable solution could be to consider inpatient days (this is because it is the most correlated with inputs disposable for the analysis). Instead of this consideration (and also because of empirical evidence promoting the inpatient days model as the best fitting), the existing financing method (based on perspective DRG payments), could promote the use of number of cases as output. This is because hospitals, developing opportunistic behaviour, seem to optimize hospitalizations to obtain a major number of cases instead of optimizing treatment. Along with these two possibilities, we can consider a proxy for the real usage of inputs considered: the weighted sum of cases derivated from financed funds. From this point of view, the number of admissions are weighted by their costs and duration. This, theoretically, reduces problems related to DRG differences in gravity and in the effective usage of inputs. Obviously different output specifications cause different efficiency estimation results. Using rankings obtained from different models we can easily find efficient and inefficient units. We are obliged to adopt this approach in order to consider all possible output alternatives. This descriptive analysis can be found in appendix A. We will now go on to briefly discuss nonparametric application and disaggregated data models.

4.1.2 Nonparametric

In this subsection we will present the specification used in efficiency estimates in order to obtain results comparable to parametric estimates. We will not show efficiency rates which will be summarized in section 4.3 (in comparison with parametric model results for aggregated and disaggregated data). To obtain DEA efficiency measures we used another econometric Fortran software: DEAP (see Coelli, 1996a). With this software we are able to estimate results for alternative return of scale hypotheses (see section 3.2.2). In our analysis we used DEA-CRS (constant return of scale) specification because of evidence collected with the parametric analysis. The output-oriented approach is used in order to obtain the same efficiency estimates of parametric analysis (both measures are expressed 17

as observed outputs divided by optimal outputs). DEA results are well known in efficiency analysis of industrial production. In this con- text, results of nonparametric analysis permit us to determine the optimal combinations of outputs/inputs. Efficiency measures and correlated outputs of estimation procedure (e.g. slacks, see 3.2.2) could be used directly for policy decisions. But, given the deterministic character of these results, we cannot say anything about significance levels. For example, from an output point of view, efficiency measures express possible in- creases in production given an inputs level, and slacks indicate possible further increase (see paragraph 3.2.2 in Figure 5 it suggests a possible X2 reduction). In our context, an interpretation of this kind is clearly impossible. Efficiency results have been used to produce rankings of observed units. To this aim a serious drawbacks to these procedures is the lack of any statistical properties. In order to partially face this problem, we con- structed a comparison between parametric and nonparametric efficiency (see section 4.3 and Banker et al. (1986)).

4.2 Disaggregated results (data at ward level) As explained in the theoretical paragraphs of this article, all preceding results are obtained by aggregating at hospital level data collected at ward or inpatient level. This operation produces some biases, for an example look ahead to section 4.3.1. In order to avoid this problem, we specified a multilevel model trying to take advantage of the data structure.

Table 5: Coefficients estimated using inpatient days as dependent variable

Variables Frontier Estimates s.e. R-lme Estimates s.e. Intercept 1.364 0.199 2.203 0.502 ln(pers) 0.295 0.083 0.427 0.094 ln(pl) 1.023 0.021 1.022 0.022 ln(TOTApp) -0.116 0.052 -0.178 0.062 ln(serv) 0.082 0.055 0.146 0.075 repclu2 0.110 0.204 0.399 0.495 repclu3 0.549 0.232 1.414 0.564 repclu4 0.588 0.204 1.563 0.498 repclu5 1.059 0.197 2.721 0.481 repclu2—ln(pers) 0.073 0.068 0.052 0.072 repclu3—ln(pers) -0.045 0.075 -0.066 0.080 repclu4—ln(pers) -0.045 0.068 -0.073 0.072 repclu5—ln(pers) -0.184 0.066 -0.223 0.070 2 σu 0.122 0.020 2 σe 0.421 0.016

In Table 5 we showed results of both R-mle procedure and Frontier estimation (ob- tained with the trick described in paragraph 3.3). Both procedures produce estimates of 18

variance components. These results show evidence of a substantial similarity in coeffi- cient results (considering the radical difference in model specification). With respect to the aggregated case we introduced into the model some dummy variables (repclu) identi- fying homogeneous wards, and their interactions with the most relevant input at hospital level, the staff (pers). Not all coefficients are statistically significant but in some case there is evidence of a different elasticity of the output with respect to this particular input. The introduction of such parameters at ward level allows to differentiate input-output re- lation between wards (or better groups of them). Such thing may be of some relevance if we think that distribution of outputs between wards may not be homogeneous. Efficiency estimates result as highly correlated, in fact we found that rank correla- tion is about 90 percent. The simplification due to Normal - Normal composed error hypothesis is not so influential (keeping in mind that results are not to be considered as deterministic estimations). In Figure 6 we can see both efficiency series, noticing the substantial coherence be- tween the two results. This evidence will be useful in the last part of this article, in fact, we will discuss repeated sample methods of validation that require the use of the R-Statistics program instead of Frontier (and accordingly, the use of Normal - Normal errors specification). Results, obtained with the two approaches, are sufficiently similar in coefficients val- ues and even more so in their interpretation, justifying the use of the R-mle procedure for the following results. Using hierarchical information with multilevel specification, we could also compute confidence intervals for ui (in particular using the statistical software MLwiN). In Fig- ure 7 we summarized results for one of the outputs, Impatient Days, considered in the following analysis (estimates are perfectly correspondent using both R-lme and MLwiN procedures). These results are useful to show that estimated effects are highly significant. Confidence intervals are calculated multiplying standard errors by 1.96. This permit us to distinguish effects really different from 0 but not to identify differences between hos- pitals (see Goldstein and Healy, 1995). In literature serious doubts always accompany efficiency estimation (see Jensen, 2000). With this approach it is possible to verify if effects are actually different from the mean. In our analysis we tried to take advantage of these results when we compared “effi- ciency” rates.

4.3 Results comparison Finally we compared the discussed approaches, considering models with “Inpatient Days” as the dependent variable (the one that produces better results). The principal comparison we examine is between rankings calculated from efficiency results of the three methods. In the multilevel approach we consider efficiency calculated by R-lme procedure (under hypothesis of Normal-Normal composed error). A further possible comparison is between coefficients values or even better between explanatory variables’ effects (positive or negative). In Figure 8 we have plotted the three scatterplots of efficiency values. Below the graphics we have reported rank correlations (Spearman). Observing these plots and related rank correlations we can conclude that efficiency 19

results obtained from aggregated data could be considered comparable and both are very different from disaggregated results. This could be interpreted as an evidence of aggrega- tion bias presence or as a symptomatic evidence of model misspecification. Similar results has been obtained using efficiency estimates from a multilevel model for data without the introduction of repclu and their interactions with pers.

4.3.1 Choosing the best approach In this paragraph we will illustrate results of two well known methods of validation based on repeated samples theory: the Jackknife and Bootstrap Simulation approaches. To develop these procedures we used the statistical software R-Statistics. Our aim is:

• to first demonstrate, with the “leave one out” approach (see Efron and Gong, 1983), that estimates of models presented in preceding paragraphs are stable (again show- ing that no observation could be considered as an outlier);

• second, with bootstrap simulations, try to justify the preference for multilevel model building, which we have theoretically promoted, verifying that aggregated results suffer from biases due to data manipulation.

Both methods could be summarised in these four steps (repeated sampling theory):

1. identify data generating process (or an approximation of it) or suppose to know an “a priori”;

2. use hypothesis to reproduce some data samples;

3. re-estimate models with every generated data-set;

4. and finally, obtain inference from these results.

4.3.2 The Jackknife cross-validation In this section we present results of a simple application of Jackknife Cross-Validation procedure, also called “leave one out”. The procedure which we implemented with a simple loop of commands in R environment, is composed of three steps:

• estimation of specified model using complete dataset; ˆ • estimation of the model parameters (θ(i)) using dataset obtained from the original one leaving out the i-th unit (hospital);

• calculation of corrected coefficients estimates and related confidence intervals.

Formally, we can write: n 1 X θˆ = θˆ , (·) n (i) i=1 20

ˆ ˆ where θ(·) is the arithmetic mean of jackknife estimated parameters (θ(i)) and n is the number of hospitals. From this statistic we can compute the jackknife bias component ˆ (β(j)): ˆ ˆ ˆ βj = (n − 1)(θ(·) − θ). Finally, using these results and the Jackknife estimated standard error:

n (n − 1) X σˆ = (θˆ − θˆ )2, j n (i) (·) i=1 we obtain the unbiased estimated values of the coefficients: ˆ ˆ ˆ θj = θ − βj, the upper and the lower confidence boundaries: ˆ ˆ θU = θj +σ ˆjzα/2 and ˆ ˆ θL = θj − σˆjzα/2. These simple steps could be applied to both parametric procedures shown in previous sections. A simplification must be done for only aggregated models. Instead of Truncated Normal specification for one sided error components, we must consider a Half Normal theoretical distribution (because the library used assumes Half Normal-Normal composed errors only). Results of this procedures confirm that estimated coefficients are not affected by biases due to outlier observations. Referring to procedure shown in Efron and Gong (1983) and summarised in this paragraph, we do not obtain significant differences between unbiased coefficients and our starting results. Besides, confidence intervals showed that no single observation causes relevant effects on estimation (see Figure 9). The same procedure could be used with efficiency estimates in order to obtain a sort of confidence interval for parametric results on aggregated data.

4.3.3 Bootstrap Simulation The aim of this part of the article is to obtain evidence about biases in hospital effect estimates due to aggregations. In order to confirm our suspicions we arranged this five steps re-sampling plan:

T ˆ 1. calculate fitted values for given model: yˆij = Xi,jβ; 2. simulate error components, considering the estimated structure:

iid 2 iid 2 ui ∼ N(0, σˆu) and eij|ui ∼ N(0, σˆe ); 3. with simulated values obtain disaggregated output:

yij =y ˆij + ui + ij;

4. repeat coefficient and efficiency (errors) estimation both in aggregated and disag- ˆ gregated cases: β, uˆi; 21

5. calculate, for both approaches, the correlations between new estimated errors (uˆi) and the simulated ones (ui).

We developed this procedure in R, as the one preceding (Jackknife), with a simple loop of commands. Results demonstrate that the simplifying hypothesis, regarding composed error structure, causes less bias than aggregation. In fact, effects estimated with R-mle procedures are more correlated to simulated components than aggregated ones. In order to identify these results we examined at rank correlation between errors. This correlation has been summarised in Figure 10 (looking at bold curves), where the two Gaussian curves approximate the histogram of correlation rates. It seems clear that disaggregated analysis estimates the simulated effects better than classical approaches. This simulation respects the disaggregated estimates for errors structure but an impor- tant starting point of our analysis is that: 2nd level errors (related to hospitals) must be one-sided. In simulation, nothing constricts us to respect the estimated structure of errors. In order to confirm that the multilevel model with Normal-Normal errors estimates hospi- tal effects more effectively than the aggregated model (which uses a Normal-Half Normal specification), also Normal-Half Normal simulation has been developed. We repeated simulations and respective estimates again obtaining results confirming that, instead of simplification due to R-lme procedure, the disaggregated approach calcu- late estimates better (see Figure 10, thin curves).

5 Conclusions

From estimation results, obtained by applied methods explained in section 3, we have observed that hospitals’ efficiency estimates, obtained by multilevel approach on disag- gregated data are drastically different from those obtained with classical methods. Inter- preting this as a good result, which confirms our doubt about the limits of considering aggregated data, we focused on the validation of estimated coefficients and on the re- search of the evidence that confirm that the disaggregated models estimate the hospitals’ effects better than aggregated models. Finally the multilevel model passes the validation procedure (Jackknife) and our sim- ulation shows strongly “biased” estimates of efficiency in aggregated analysis. This again confirms that consideration of the mixed linear model, respect the hypothesised data struc- ture and lowers the bias in effect estimation. After having justified the used methods, we have interpreted estimated efficiency mea- sures building multidimensional ranking (that consider the three different output specifi- cations). As already mentioned, with efficiency results we are not able to decide a inter- vention policy but we can identify some suspicious observations. From an economic point of view, also for multilevel models, is possible to obtain estimates of elasticity function (see Appendix A).

5.1 Ongoing Works After having studied this particular case, we can certainly support that better results could be reached with a mixed linear model that consider one-side error component. Actually no 22

procedure has been developed to avoid this problem. The next step in our research could be the development of an R or S-Plus routine that consider this case. Another possible way is to study the non parametric estimation of error components, hoping that resulting estimates respect the hypothesised skewness.

References

D.J. Aigner and S.F. Chu. On estimating the industry production function. American Economic Review, 58:826–839, 1968.

D.J. Aigner, C.A.K. Lovell, and P. Schimdt. Formulation and estimation of stochastic frontier models. Journal of Econometrics, 6:21–37, 1977.

R.D. Banker, A. Charnes, and W.W. Cooper. Some models for estimating technical and scale inefficiencies in data envelopment analysis. Management Science, 30(9):1078– 1092, 1984.

R.D. Banker, R.F. Conrad, and R.P. Strauss. A comparative application of data envel- opment analysis and translog methods: an illustrative study of hospital production. Management Science, 32(1):30–44, 1986.

R.D. Banker and R.M. Thrall. Estimation of returns of scale using data envelopment analysis. European Journal of Operational Research, 62:74–84, 1992.

G.E. Battese and T.J. Coelli. Prediction of firm-level technical efficiencies with a general- ized frontier production function and panel data. Journal of Econometrics, 38:387–399, 1988.

A. Charnes, W.W. Cooper, and E. Rhodes. Measuring the efficiency of decision making units. European Journal of Operation Research, 2:429–444, 1978.

T.J. Coelli. A guide to deap version 2.1: a data envelopment analysis (computer) program. CEPA Working Paper, 08 1996a.

T.J. Coelli. A guide to frontier version 4.1: a computer program for stochastic frontier production and cost function estimation. CEPA Working Paper, 08 1996b.

B. Efron and G. Gong. A leisurely look at the bootstrap, the jacknife, and cross-validation. The American Statistician, 37(1):36–48, February 1983.

B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.

M.J. Farrell. The measurement of productive efficiency. Journal of the Royal Statistical Society, Series A, 120(7):63–80, 1957.

H. Goldstein and M.J.R. Healy. The grafical presentation of a collection of means. ournal of the Royal Statistical Society, Series A, general, 158:175–177, 1995. 23

W.H. Greene. Maximum likelihood estimation of econometric frontier function. Journal of Econometrics, 13:27–56, 1980.

W.H. Greene. The econometric approach to efficiency analysis. In Fried .H.O., Lovell C.A.K., and Schmidt S.S., editors, The Measurement of Productive Efficiency. 1993.

J.J. Hox. Applied Multilevel Analysis. TT-Publikaties, Amsterdam, 1995.

U. Jensen. Is it efficient to analyse efficiency rankings? Empirical Economics, 25(2): 189–208, 2000.

W. Meeusen and J. Van Den Broek. Efficiency estimation from cobb-douglas production functions with composed error. International Economic Review, 18:435–444, 1997.

D. Reifschneider and R. Stevenson. Systematic departures from the frontier: a framework for the analysis of firm inefficiency. International Economic Review, 32(3):715–723, 1991.

L. Simar and P.W. Wilson. Sensitivity analysis of efficiency scores: How to bootstrap in nonparametric frontier models. Management Science, 44(1):49–61, 1998.

L. Simar and P.W. Wilson. Statistical inference in nonparametric frontier models: the state of art. Institut de Statistique Discussion Paper (UCL), No. 9904, 1999. 24

4 5 6 7 0.0 1.0 2.0 8 7

Staff 6 5 4 3 7

6 Beds 5 4 6 5 4

Equipments 3 2 1 0 2.0 Services 1.0 0.0

3 4 5 6 7 8 0 1 2 3 4 5 6

Figure 1: Aggregated data scatter plots 25

Figure 2: Data Organisation 26

Figure 3: Examples of input oriented and output oriented isoquants 27

Figure 4: Univariate example of frontier production

Figure 5: Examples of DEA isoquant in a two input one output technology 28 0.9 0.8 0.7 0.6 Frontier Efficiencies 0.5 0.4 0.3

0.2 0.4 0.6 0.8 1.0

R Efficiencies

Figure 6: Dispersion plot of efficiency estimated with Frontier and R-lme routine 29

Figure 7: MLwiN second level errors confidence intervals 30

Figure 8: Scatterplot of efficiency levels and rank correlations 31

Beds Coefficient Estimation

Corrected Estimate of Coefficient Upper and Lower Jacknife Estimated Boundaries Upper and Lower Original Estimated Boundaries 1.0 0.9 Bed 0.8 0.7

0 50 100 150 Index

Figure 9: Jackknife unbiased estimate and confidence intervals 32

Simulated Distributions of Rank Correlation

Multilevel Results Aggregated Models Results

12 Norm + Norm Simulations Norm + HalfNorm Simulations 10 8 x 6 4 2 0

0.5 0.6 0.7 0.8 0.9 Index

Figure 10: Rank correlation distribution under both developed hypothesis 33

L∞ regression and mixture index of fit

A. Low ELTE Institute of Sociology, Hungary

1 The idea of the mixture index of fit

The mixture index of fit was first suggested by Rudas et al. (1994) in the context of the analysis of contingency tables. The original definition is the following: Definition 1.

π∗(P ) = inf{π : P = (1 − π)F + πQ, F ∈ M,Q unrestricted} where P, F and Q denote cell probabilities of two-way contingency tables of the same size and M is a model. They were given a new index called π∗, the mixing weight from a special unstructured two-point mixture, that represents the fraction intrinsically outside model M. The other name of mixture index of fit is called pi-star (π∗) index comes from this notation. For a probability distribution P and statistical model M a measure of goodness-of-fit can be obtained by representing P as a mixture of a distribution from M, and of an arbi- trary distribution. The mixture index of goodness-of-fit is the maximum possible weight of a distribution from M in this mixture representation. For a given model, this index measures model misfit (or lack of fit) as the minimum possible portion of the population outside the specified model.

2 The mixture index of fit and discrete variables

The mixture index of fit, or π∗ approach, was applied to the analysis of social mobility tables (see Clogg et al., 1995), and to the problem of differential item functioning in educational research (see Rudas and Zwick, 1997). These papers discussed the relative advantages of π∗ as a measure index of fit and compared the conclusions that could be drawn based on the mixture index of fit to those that could be achieved by using other methods of assessing the fit of a model. The advantages of using π∗as an index of fit include straightforward interpretations, and lack of dependence on the sample size in the usual sense. Furthermore, π∗can be used similarly for both census and survey data, and this approach provides new interpretation of widely used statistical quantities, such as the correlation coefficient, and offers a sensitive tool to outlier detection in several setups in multivariate analysis.

3 The mixture index of fit and regression

In a general case an n dimension regression subspace can only fit to n points, that is it can describe n points perfectly. In this case the view of mixture index of fit cannot be 34

applied because an n dimension subspace can be fit to the subset with any n element of the observed points. This kind of decomposition cannot be divided into two parts arbitrary only into one part which has exactly n element in it and to an other part with the rest. All the above mentioned solutions does not lead to the description of the observed distribution. For the decomposition of the observed distribution into two parts —one of them is based on the model and the other one is unrestricted— we have to forget the perfect fit of the model and therefore a new errorterm e must be again introduced. The F distribution derived from the M model can only be fit with this error term into the P distribution. The residual of the P distribution beyond the fixed errorterm will become the unre- stricted Q distribution. All the Nπ element and the N(1−(π) element subsets of the N element P distribution must be investigated because the mixture index of fit minimalizes only the proportion of the Q part (π). The most likely decomposition will be the one which has the smallest errorterm to describe the distribution of F with N(1 − π) element.

The e error term should be introduced straightforward with L∞ norm regression.

4 The basic idea of L∞ regression

The estimator can be expressed as a low dimensional linear program: find (m, b, e), min- imizing e, and satisfying the linear constraints −e ≤ mxi + b − yi ≤ e. The dimension (number of variables) to be found is just d + 1. Therefore, it can be solved in time linear in the number of data points (but exponential in the dimension) using low dimensional linear programming techniques. One approach of the L∞ norm is the so called maximin approach. It means to search for a regression line which minimalizes the farthest points from the line. This statement is exactly the same what I have mentioned previously. The mentioned statement refers to the points of pairs of value of the dependent and indepen- dent variables which I would like to cover with a band. This lapping should cover all points and should be as narrowest as it can be by vertical measurement. The center-line of this zone will naturally be the searched mxi + b − yi line and will precisely be equal with the double of the bandwidth. Obvious that in a general two dimensional case there are two points on one edge of the band while one point on the other edge of it. These two points are two neighbouring point of the convex hull of the points that is to say the edge of the band will be one of the border line of the convex hull. This determines the steepness of the regression line. The third point on the other side is also part of the points of the convex hull (Figure 1). Its distance from the line gives the magnitude of the error of regression. This error and the initial value of the border line determines the beginning value of the regression line (Figure 2). For the three defining points of the regression line I called “black sheep points” as they are the farthest at all that is to say these are the odd one out.

L∞ regression gives maximum likelihood estimation in case of uniform error distri- bution. As I have mentioned earlier the regression line is the center line of the narrowest line at all that covers all points. Band narrowance means the diminishing of the support 35

of uniform error distribution. If this prop is minimal, the value of the likelihood func- tion will be maximum. In this way L∞ regression gives maximum likelihood estimation supposing uniform error distribution. There is rarely unestablished supposal in social sci- ences that the distribution of the error part succeed normal distribution. On the one hand error could only exist in a given area due to external restrictions,;on the other hand their experienced distribution does not have the shape of a bell-curve. Therefore L∞ regression is a promising possibility for social scientist. The three black sheep point could only come out from the points of the convex hull that is the parameters of the regression line only defined by extremities. The position of the internal points ans their frequency is entirely out of interest from this point of view. This is not a hopeful fact for social scientist at all because they often have to work with data coming by their own admission in their surveys which in many cases are loaded with misrating.

5 The mixture index of fit and the linear regression with error term

Applying minimum quadratic linear regression the approach of mixture index of fit could be used as well. Let us take all examined point and the line fit to them. R2 value belongs to them as well which provides information on the exactness of the fit of the line. Furthermore also possible to count the vertical distance of each observed points from the regression line. Consequently also possible to find the farthest, the least fitting point. Let us leave out it in our observation. For one point less of the points a newer regression line belongs to that fits more accu- rate fit rather than the previous one. The process should be repeated several times while two points and one perfectly fitting regression line remains at the end. Trimming of course should not be exaggerated just for the sake of correcting the fit to beyond all limits. Our goal is to find an optimum –such as in factor analysis– whereas between the number of usable factors and the explained extent of proportion we should try to find the optimum. In this approach the examined points are imagined as a cabbages. Its regression stalk is tightly covered with good quality experience leaves. The external leaves are more withered and forms more loose layers. We would like to peel our cabbage. Not only L∞ norm regression but mixture index of fit is very much appropriate as well to find extremities. This gave the initial idea to apply them simultaneously. Important that over this similarity the idea of mixture index of fit could be a great help in eliminating the disadvantages of L∞ norm regression in other words only some special points of the observed points in the convex hull defines the regression line. According to the idea of mixture index of fit we are searching for a kind of part-set of the observed points which should be covered with he narrowest band among all existing part-set with the same number of points 36

Predictable that with the increase of ratio of the part free from restriction which is π the double of the band-narrowance will be decreased. It seems to be obvious as in L2 norm case to start with all points and to leave gradually out all not suitable points and finally to reach the last two points. Also obvious that we should omit one of the three black sheep point. Unfortunately during this method (omission in succession) so many case should have been omitted because in case we leave out the second point of the previously mentioned three black sheep points we should take back the first one. The output of this omission versus take back way is a dubious process that is why I have chosen an other way.

6 The algorithm

Let us take a projection direction and take its projection of the points –let us say– on axis y. The difference of between the vector of the organized projection and their n time rotation defines the projections of the subset which includes all n points. In this way it is easy to find the smallest area of projection of the subset from a given perspective. The more projection direction we could take, the better enough estimation could receive. The given estimation on different n-s is only needed on defining subsets. So it is not difficult to count now what is going to be the regression line of the given part-sets and how big will be the maximum measure of the deviation between the points and the line. The result will be very similar to the least square case. This application of the L∞ norm regression gives quite a good image not only from the points of the convex hull but from their internal arrangement.

7 An application

The proportion between the voters for the two largest Hungarian political party in the last two years according to the surveys of the four biggest public opinion polling and market research company (Figure 3 and Figure 4). The following example is not a whole timeseries analyzis therefore I had the next assumptions. The size of the voters for the political parties is more or less constant which means that on this given time-period easy to approach it linear. This stableness could be disturbed by minor events of the political life. When public opinion researchers measuring in such a period, their measures are loaded by tendentious distortion. The sample of each survey was either good or bad. When a sample seems to be wrong, let us skip the point-estimation which is the result of it rather than accepting the intersect of the confidence intervallum of the deformed measuring with the confidence intervallum of the other measuring as a result. Let us not exaggerate purification! The bandwidth reduced mostly at the two parties when in the range of 5-10 percentage from 3 to 4 and from 4 to 5 points omitted of the observed 63 points. These linear regressions and its parameters are on Figure 7– 10. The unique solution is not always available. The following two figures has the same amount of omitted points (27) and bandwidth (5.1667), however the parameters of the 37

two regression are different (steepness 0.5833 and 1, intercept 31.6667 and 25.4167 re- spectively). The bandwidth is a relative value and this causes another difficulty. Cases on different scale are incomparable with a direct method. 38

References

C.C. Clogg, T. Rudas, and L. Xi. A new index of structure for the analysis of models for mobility tables and other cross-classification. In P. Marsden, editor, Sociological Methodology, pages 197–222. Blackwell, Oxford, 1995.

T. Rudas, C.C. Clogg, and B.G. Lindsay. A new index of fit based on mixture methods for the analysis of contingency tables. Journal of Royal Statistical Society, (Series B), (56):623–639, 1994.

T. Rudas and R. Zwick. Estimating the importance of diferential item functioning. Journal of Educational and Behavioral Statistics, 22:31–45, 1997. 39

10 5 13 7 6 80 3 16 2 60

9 12 11 40 1

14

20 15

4 8 20 40 60 80 100

Figure 1: 16 random points and the 3 defining points of the L∞ regression line 40

80

60

40

20

20 40 60 80 100

Figure 2: 16 random points, the L∞ regression line and the error term 41

43 41 39 Tarki 37 Szonda-Ipsos 35 33 Median 31 29 Gallup 27

2000.MA J J A S O N 2001.D FJ M A M J J A

Figure 3: The portion of Fidesz—MPP voters among who can choose a party 42

53 51 49 Tarki 47 Szonda-Ipsos 45 43 Median 41 39 Gallup 37

2000.MA J J A S O N 2001.D FJ M A M J J A

Figure 4: The portion of MSzP voters among who can choose a party 43

30

25 a 20

15 b

10 d 5

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829

Figure 5: The parameters of regression line of Fidesz—MPP voters (a: steepness, b: intercept, d: bandwidth — double of error term) according to the number of omitted points 44

50

40 a 30 b 20

d 10

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829

Figure 6: The parameters of regression line of MSzP voters (a: steepness, b: intercept, d: bandwidth — double of error term) according to the number of omitted points 45

45

43

41

39

37

35

33

31

29

27

1 3 5 7 9 11 13 15 17

Figure 7: Regression of Fidesz—MPP with 3 omitted points (a=0.571429, b=30.1429, d=10.5714) 46

45

43

41

39

37

35

33

31

29

27

1 3 5 7 9 11 13 15 17

Figure 8: Regression of Fidesz—MPP with 4 omitted points (a=0.571429, b=30.4286, d=10) 47

55

53

51

49

47

45

43

41

39

37

1 3 5 7 9 11 13 15 17

Figure 9: Regression of MSzP with 4 omitted points (a=-0.166667, b=47.0833, d=8.5) 48

55

53

51

49

47

45

43

41

39

37

1 3 5 7 9 11 13 15 17

Figure 10: Regression of MSzP with 5 omitted points (a=-0.142857, b=46.5714, d=8) 49

45

43

41

39

37

35

33

31

29

27

1 3 5 7 9 11 13 15 17

Figure 11: Solution one (Fidesz—MPP) 50

45

43

41

39

37

35

33

31

29

27

1 3 5 7 9 11 13 15 17

Figure 12: Solution two (Fidesz—MPP) 51

Respondent selection within the household - A modification of the Kish grid

Renata Nemeth

Hungary

Abstract: The problem of drawing a person from a household often occurs at the final stage of a survey design, e.g. in telephone surveys, after we con- tacted the households using random digit dialing. The Kish grid gives an algorithm for this random selection. We found that, contrary to the widely held opinion, the grid is not capable of providing representativeness by gen- der and age. This misconception stems from the fact that when the Kish grid was developed in the 1950’s, both randomness and representativeness could be achieved using the method, due to the household structure of the USA. We show that this does not hold for today’s Hungary. Finally, we suggest a modi- fication of the Kish grid that is more appropriate for selecting a representative sample.

1 Introduction

Information on characteristics of populations is constantly needed nowadays. For reasons relating to time-limit and costs, this information is often obtained by use of sample sur- veys. A sample survey may be defined as a study involving a subset (sample) selected from a larger population. Characteristics of interest are measured on each of the sampled individuals. ¿From this information, extrapolations can be made concerning the entire population. The validity and reliability of these extrapolations depend on how the sample was chosen.

2 Reasons for sampling households

It is often best to draw the sample in two stages. These are designs in which primary sampling units are selected at the first stage, and secondary sampling units are selected at the second stage within each unit selected previously. The sampling designs considered in this paper are those in which households are selected at first, and then one adult member of each selected household is chosen. When does the need for two-stage sampling arise, rather than selecting the respondents directly from the population? Lists of adults, from which the sample can be taken, are often not available. For example, the electoral register is usually a good quality database of addresses, but a poor quality database of individual adults. The register has many errors because of non-registration and population mobility. In practice, the register is 52

used to construct a sample of flats or households, and the sample of adults is obtained at a second stage in some other way. Another method involving respondent selection within household is called area sam- pling. It is used when the target population is located in a geographical region, such as a city. A frame for studying a population of a city may in the first stage consist of a list of districts, followed by a list of streets, followed by a list of blocks, then a list of house- holds. And again, at the final stage, a sample of respondents is obtained from the sample of households. The problem of translating a sample of the households into a sample of adult persons often arises in telephone surveys as well. The households are usually contacted by random digit dialing. There is no need of selection if the respondent is uniquely defined e.g. as the head of the household. Suppose the household contains more than one member of the desired population. One may decide to include in the sample every member within the household. This may be a statistically inefficient procedure, unless one of these two conditions hold:

• There is seldom more than one member of the population in the household.

• If intra-class correlation within the household of the variables measured is of negli- gible size. Otherwise, the distribution is characterized by some homogeneity. Usu- ally, the homogeneity of households is greater than in the case when individuals were assigned to them at random. Since homogeneity within sample clusters in- creases the variance of estimations, the sampler wants to reduce it in this case by selecting only one member per household (see Kish, 1965).

These conditions generally do not hold in surveys. Hence, there is a need for a pro- cedure of selection that will translate a sample of households into a sample of the adult population. It is desired to make not more than one interview in every household. On the other hand, an interview in every sample household is desired in order to avoid futile calls on households without interviews. Finally, the procedure should be applied and checked without great difficulty. The simplest procedure we could apply is the uncontrolled selec- tion in which the interview is conducted with whomever opens the door or answers the phone. A serious problem comes up in this case. The resulting sample will be made up of those persons more likely to be available at the time interviewers call or who are most willing to be interviewed. Experiences seen to date show that they are made up of women and older adults.

3 The Kish grid

The Kish grid gives a procedure of selection. The expression ”Kish grid” comes from the name of Leslie Kish, the Hungarian born American statistician. Kish was one of the world’s leading experts on survey sampling. When creating the grid Kish intended to select persons within the household with equal probability. On the other hand the use of the grid can be checked easily contrary to e.g. a decision depending on the toss of a coin. 53

When applying the Kish grid, the interviewer at the first step uses a simple procedure for ordering the members of the household. A cover sheet is assigned to each sample household. It contains a form for listing the adult occupants (see Table 1), and a table of selection (see Table 2).

Table 1: Form for listing the adult occupants (see Kish, 1965)

Relationship Sex Age Adult No. Selection Head M 2 Wife F 40 5 Head’s father M 1 Son M 3 Daughter F 6 Wife’s aunt F 44 4 X

The interviewer lists each adult on one of the lines of the form. Each are identified in the first column by his relationship to the head of the household. In the next two columns, the interviewer records the sex and, if needed, the age of each adult. Then the interviewer assigns a serial number to each adult. First, the males are numbered in order of decreasing age, followed by the females in the same order. Then the interviewer consults the selection table. this table tells him the number of the adult to be interviewed. In the example, there are six adults in the household, and selection table D tells to select adult number 4 (see Table 2).

Table 2: One of the eight selection tables (see Kish, 1965) 1.5cm

Selection table D If the number of adults in household is: Select adult numbered: 1 1 2 2 3 2 4 3 5 4 6 or more 4

Selection table D is only one from the 8 types (see Table 3). One of the 8 tables (A to F) is printed on each cover sheet. The cover sheets are prepared to contain the 8 types of selection tables in the correct proportion, e.g. table A is assigned to one-sixth of the sam- ple addresses. The aim is to reach equal selection probabilities within household without the necessity of printing many more forms. Table 4 shows the selection probabilities. It can be seen that the chances of selection are exact for all adults in households with 1, 2, 3, 4 and 6 adults. As numbers above six are disallowed, there are some adults who are not 54

represented. Moreover, there is an overrepresentation of number five in the households with five adults.

Table 3: Summary of eight selection tables (see Kish, 1965)

If the number of adults in household is: Proportion of Table 1 2 3 4 5 6 or more assigned tables number Select adult numbered: 1/6 A 1 1 1 1 1 1 1/12 B1 1 1 1 1 2 2 1/12 B2 1 1 1 2 2 2 1/6 C 1 1 2 2 3 3 1/6 D 1 2 2 3 4 4 1/12 E1 1 2 3 3 3 5 1/12 E2 1 2 3 4 5 5 1/6 F 1 2 3 4 5 6

Table 4: Summary of selection probabilities

If the number of adults in household is: Adult numbered 1 2 3 4 5 6 or more 1 1 1/2 1/3 1/4 1/6 1/6 2 1/2 1/3 1/4 1/6 1/6 3 1/3 1/4 1/4 1/6 4 1/4 1/6 1/6 5 1/4 1/6 6 1/6 7 or more 0

It may be noted that the procedure was modified several times by many researchers. Kish himself suggests modifying the tables for special reasons. In paper-and-pencil inter- view the interviewer uses the grid as described above. In computer-assisted telephone or personal interviews, the tables are randomly assigned to the households by the computer in their prescribed proportions. The researchers stick to ordering the persons by sex and age, though they have the technical background for generating random number. By using random numbers, it would be possible to select a person from the set of the previously identified adults. Although nobody expresses it explicitly, they consider the sample to be representative by sex and age with the use of the original Kish grid. This representativity would much less be expected if the applied procedure was e.g. identifying the adults by first name, then selecting one of them by generating a random number. 55

4 ”Representativity”

We mentioned that representativity is a desirable character of a sample. It refers to the similarity between the sample and the population in some characteristics of interest. Why is it desirable to reproduce the distribution of certain population characteristics in the sample? Suppose there is a high positive correlation between the characteristic to be estimated and a different one. The more representative the sample is by the latter one, the more reliable the estimation of the former one will be. (The reliability of an estimator is evaluated on the basis of its variance.) It is a standard practice to evaluate the sample by its representativity in order to support the validity of the extrapolations or estimations. We attempted to take into account the accessible literature about samples obtained by using the Kish grid. When evaluating the representativity of their samples, Hungarian researchers often refer to the undersampling of males and overrepresentation of elderly people (see ISSP Csalad´ II, 1994, Tablak´ epek´ az egeszs´ egr´ ol,˜ 2000, Egeszs´ egi´ Allapotfelv´ etel,´ 1994). According to the comments this deviation stems from problems that occur when putting the interview into practice, e.g. males are undersampled because they are more difficult to find at home, and are less willing to participate. Later we will give some theoretical evidence that explains the representation problems without considering these assumptions. It is important to mention that according to Kish’s own words, he used the variables sex and age only for ordering the household members. He did not aim explicitly to repro- duce the sex and age distributions. At the same time, however, he expected the sample to be representative. In the article published first about the grid, Kish checked the dis- tribution of the respondents, and explained the males’ underrepresentation with practical problems mentioned above: they are more difficult to find at home etc. [Kish, 1949]. Although he emphasized the fact that the grid is for random selection within household, he was the first not to make a distinction between randomness and representativity.

4.1 The cause of non-representativity - assumption When households are selected with equal probabilities, and the selection probabilities within household are equal, then the chance of selection of a single adult becomes in- versely proportional to the number of adults in the household. Hence overall selection probabilities are not equal. If the selection probability is the function of the household size, and the household size is not independent from the members’ demographic characteristics, then the sam- pling design itself is the source of the representation problems. The sample would not be representative even if we could obtain perfectly random household sample and 100 percent response rate. Kish found that the samples obtained by using the grid show close agreement with the population data on important demographic characteristics. Kish de- veloped the grid in the 1950’s USA. He emphasized the relatively low variance of the selection probabilities. That was because of the high concentration within a small range of household sizes: over 70 percent of households contained two adults (see Table 5). Our results so far show that representativity is the function of the current household 56

Table 5: Household structure, 1957, USA (see Kish, 1965)

Number of adults in the household 1 2 3 4 5 6 or more Proportion 14.6 73.0 9.0 2.8 0.4 0.2

structure, and the grid’s performance depends on where and when it is used. It is worth making a comparison between the current Hungarian household structure and the one observed by Kish. In today’s Hungary, 26 percent of the households are one-person- household, that is 2 times greater than the one examined by Kish. This difference for itself is so significant that the question arises whether to accept the grid without modification.

4.2 The cause of non-representativity - proof To put assumptions into a concrete form, we determined the exact connection between the grid’s performance and the population household structure. The required information on current Hungarian population is not available. That is the reason why we worked with a sample from a great national household survey1. The data contain information on each member of the sample households, so we could use it as a population for further sampling. In the following we will refer to it as ”pseudopopulation”. Table 6 shows age and sex distribution in the pseudopopulation.

Table 6: Pseudopopulation, age and sex distribution (n=4188)

Sex Age Male Female Total 18-39 19.79 20.51 40.31 40-59 14.66 17.60 32.26 60+ 10.94 16.50 27.44

Total 45.39 54.61 100.00

We tested the use of the grid by this pseudopopulation, concerning the age and sex distribution in the samples. The expected sex and age proportions of the sample can be formulated as follows. Let pkl denote the selection probability of the adult l living in a household of size k (k = 1,..., 6, l = 1, . . . , k), supposing the household is already selected. As we choose households with equal probabilities, the chance of choosing a household of size k equals to the proportion of these households. Let Hk denote this value. The expected sex and age-group joint distribution can be given by a 3 × 2 matrix, denoted by a. a[11] is the proportion of young males, a[21] is the proportion of middle- aged males etc., a[32] is the proportion of elderly females.

1The TARKI´ Social Research Databank made the database of ”Hungarian Household Panel IV”available for us. 57

We need the information on households’ composition: after selecting the person num- ber l within a household of size k, what is the probability of his or her being a male or female, of his or her being young or middle-aged or elderly. Let akl be 3 × 2 matrix (k = 1..6, l = 1..k) In the above way, akl[11] denotes the proportion of young males among the persons numbered l living in a household of size k, akl[21] is the proportion of middle-aged males etc. The expected age and sex joint distribution is the function of the other parameters, see Equation 1. Hk, a, and akl are known input parameters. They come from the information about pseudopopulation. ! X X a[ij] = Hk pklakl[ij] , i = 1, 2, 3, j = 1, 2 (1) k=1..6 l=1..k Substituting the known parameters, we obtain the expected distribution shown in Table 7.

Table 7: Expected sample, age and sex distribution

Sex Age Male Female Total 18-39 17.27 19.47 36.74 40-59 12.92 17.04 29.96 60+ 11.87 21.43 33.30

Total 42.06 57.94 100.00

It is seen that the expected sample differs from the population in sex and age distri- butions. Firstly, the elderly people are oversampled, especially the women. (It is worth mentioning, that in the current population of Hungary, great proportion of the one-person- households consists of an older female occupant, and the quarter of all households is one-person-household, so it can be concluded that it is more likely to select an elderly female in this way than by simple random sampling.) Secondly, males appeared to be underrepresentated. Our experiences are similar to those obtained from real surveys.

5 Modification of the Kish grid

In this section we present a modification of the Kish grid. We intended to receive a representative or at least more representative expected sample. We modified the grid by modifying the selection tables. This modification method is not unprecedented in the literature: Kish himself already suggested modifying the tables when needed. All the sampling features are fixed, i.e. the following conditions hold:

• each household has the same chance of selection 58

• one and only one interview per household is made

• the selection tables are based on a list of the household members

• this ordering is made by sex and age

• the population to be surveyed is the previously mentioned pseudopopulation

• 12 selection tables are used (Obviously, the more tables are used, the finer proba- bilities can be achieved, that is the closer agreement between the sample and the population can be obtained. This is why we limited the number of the tables.)

• the same rules are applied to households with 6 ore more members.

The problem is to make selection tables those yield a sample giving close agreement with the pseudopopulation data. The modification can be simplified: instead of determi- nation of the tables, it is enough to determine the selection probabilities. Our aim is to obtain a representative expected sample, which is as close to the distri- bution given by table 6 as possible. Let A denote the 3 × 2 matrix describing the sex and age joint distribution in the pseudopopulation, A[11] equals to the young males propor- tion etc. Using the notation of Equation 1 the problem is as follows. Hk’s and aij’s are given parameters, and a is to be determined as the functions of pkl’s, so as to reproduce A. Equation 2 is to be solved. X |a[ij] − A[ij]| = 0, (2) i=1,2,3,j=1,2 with constraints: P j=1..i pij = 1, ∀i

pij > 0, ∀i, j (3)

kij pij = 12 , ∀i, j, where kij ∈ N The constraints make the solution to meet the conditions: one and only one person 1 per households is needed, and 12 tables are used that means probabilities are given in 12 . The model is a nonlinear equation, with inequality and integer constraints. We applied the Solver software of Microsoft Excel to solve the equation. The problem has no solution. This raises the question whether there would be a solution if the limitation of the number of the tables did not hold. Using more tables has a practical disadvantage: it implies increasing costs. Moreover, the sample size itself is an upper limit of the tables. Apart from this, the theoretical problem is worth considering. In this case, the integer constraint is to be omitted from (3). The problem does not have a solution in this way either. It is impossible to obtain a perfectly representative sample. Let us modify the problem instead: which selection table yields a sample that is the closest to the pseudopopulation. A distance function is needed to find the closest solution that minimizes the distance function. We used two functions according to two different approaches. The first one is similar to the Pearson-chi-sqare, Equation 4 shows function f to be minimized. 59

2 X (a[ij] − A[ij]) f(a) := (4) A[ij] i=1,2,3,j=1,2 The idea of using the other distance function comes from weighting which is widely used in survey statistics. Weights are generally used to improve the precision of the es- timate. Poststratification is a weighting method that produces a sample in which each stratum is represented in its proper proportion. In our case, strata are defined as the 6 cells of the sex and age group crosstable. Postratification weight for a given person in a given stratum is defined as the proportion of the population stratum divided by the proportion of the sample stratum. The disadvantage of using the poststratification is that in some cases it increases the variance of the estimation. Increase in variance is a monotonic function of the sum of the weights squared. This implies the following approach: to find the selec- tion table that yields a sample with the minimal sum of postratification weights squared. Equation 5 shows function g to be minimized.

 1  X X A[ij]2 g(a) := w2 = , (5) n k a[ij] k=1..n i=1,2,3,j=1,2 where n is the sample size. As we mentioned when finding the solution of the equation with absolute values, the constraints can be determined in two different ways. If they include the integer constraint, then the use of 12 tables are assumed. Otherwise, the number of the tables is not limited, therefore selection probabilities can be any real numbers between zero and one. Combin- ing the two dimensions four problems appear: let us find the minimum value of function f or g, with or without the integer constraint. A model, in which the objective function or any of the constraints is not a linear function of the variables, is called a nonlinear programming (NLP) problem. In our case, inequality and integer constraints are added to the model. The Weierstrass’s theorem states that a real valued continuous function on a closed bounded set assumes a maximum and a minimum value. In our case the conditions of Weierstrass’s theorem meets, but the determination of the minimum is not a simple mathematical problem. Apart from special cases the nonlinear optimization problems have numerical solutions. We applied the Solver software of Microsoft Excel to find the minimums. Table 8-12 contain the results.

We can observe some expected trends in all the four cases. For example p21 ≈ 2/3, that affects against the male underrepresentation that we found when using Kish grid (since p21 is the selection probability of the first adult in a two-persons-household, and the first one tends to be male because of the ordering procedure.). The optimal sex ang age group distributions (matrix a) compared to the one belonging to the Kish grid shows that we managed to improve the young people and the females agreement with the population data, while other cells show some change for the worse.

The four solutions do not differ from each other, either regarding a or pij’s. This means it is not worth using more than 12 tables. Moreover, the return value of function g at the optimum place of function f is very close to the real optimum value of g - and vice 60

Table 8: Optimization - results

Original Kish grid

Function value Matrix a pij ’s when substituting (expected the Kish grid sex/age %)

f : 0.021573301 17.27 19.47 p21 1/2 p31 1/3 p41 1/4

g : 1.018901199 12.92 17.04 p22 1/2 p32 1/3 p42 1/4

11.87 21.43 p33 1/3 p43 1/4

p44 1/4

p51 1/6 p61 1/6

p52 1/6 p62 1/6

p53 1/4 p63 1/6

p54 1/6 p64 1/6

p55 1/4 p65 1/6

p66 1/6

versa, i.e. the optimal tables are close to each other whether measured by f or measured by g. We can say that the optimal methods perform well from both points of view. Table 13 presents the modified selection table obtained by the functionf with the integer constraint. The main results of our work are as follows.

• The samples obtained in Hungary by using the Kish tables differ from the pop- ulation in sex and age group distributions. We proved that this is caused by the sampling method and not by practical problems. The literature does not prove this connection, nor does it normally mention it.

• The grid is successfully modifiable if our aim is to adjust the sample to the popula- tion.

• The problem we treated is of international significance. The trends observed in Hungarian household structure are global trends. Size of households is currently decreasing, the proportion of the single persons is on the rise.

There are further problems to be considered. We limited our analysis to the repre- sentativeness by sex and age. It should be useful to take into account the distributions of other characteristics when using the grid. At the same time, the distributions of other characteristics need checking when using the modified tables. Obviously, improving the sex and age adjustment does not mean that the sample shows agreement to the population with respect to other variables. Change in selection probabilities implied by the modification needs further consid- eration. The variability of the probabilities can result in an increase of the estimation variance. 61

Table 9:

Optimization of function f

Constraints Optimal value Matrix a pij ’s (expected sex/age %) P j=1..i pij = 1, ∀i 0.013938589 18.56 19.94 p21 2/3 p31 2/12 p41 1/12

pij > 0, ∀i, j 12.81 16.16 p22 1/3 p32 1/12 p42 3/12 kij pij = 12 , ∀i, j, 13.11 19.42 p33 9/12 p43 1/12 where kij ∈ N p44 7/12

p51 1/12 p61 1/12

p52 8/12 p62 7/12

p53 1/12 p63 1/12

p54 1/12 p64 1/12

p55 1/12 p65 1/12

p66 1/12

The optimal solution was derived from the pseudopopulation. It would be worth de- veloping the study with the real population as a starting point in order to support the generalization of the results. 62

Table 10:

Optimization of function f P j=1..i pij = 1, ∀i 0.013216638 18.75 20.15 p21 0.6539 p31 0.2351 p41 0.0100 1 pij > 100 , ∀i, j 12.93 15.84 p22 0.3461 p32 0.0100 p42 0.3728

13.07 19.26 p33 0.7549 p43 0.0100

p44 0.6072

p51 0.0100 p61 0.0100

p52 0.8877 p62 0.9500

p53 0.0100 p63 0.0100

p54 0.0823 p64 0.0100

p55 0.0100 p65 0.0100

p66 0.0100

Table 11:

Optimization of function g

Constraints Optimal value Matrix a pij ’s (expected sex/age %) P j=1..i pij = 1, ∀i 1.012869186 18.54 19.60 p21 2/3 p31 2/12 p41 2/12

pij > 0, ∀i, j 13.14 16.07 p22 1/3 p32 1/12 p42 3/12 kij pij = 12 , ∀i, j, 13.23 19.42 p33 9/12 p43 1/12 where kij ∈ N p44 6/12

p51 1/12 p61 1/12

p52 7/12 p62 7/12

p53 1/12 p63 1/12

p54 2/12 p64 1/12

p55 1/12 p65 1/12

p66 1/12 63

Table 12:

Optimization of function g P j=1..i pij = 1, ∀i 1.012269154 18.63 19.93 p21 0.6574 p31 0.2543 p41 0.0270 1 pij > 100 , ∀i, j 13.11 15.91 p22 0.3426 p32 0.0100 p42 0.3866

13.23 19.19 p33 0.7357 p43 0.0100

p44 0.5744

p51 0.0100 p61 0.0100

p52 0.5594 p62 0.7638

p53 0.0100 p63 0.1962

p54 0.4106 p64 0.0100

p55 0.0100 p65 0.0100

p66 0.0100

Table 13: The modified selection tables

If the number of adults in household is: Proportion of Table 1 2 3 4 5 6 or more assigned tables number Select adult numbered: 1/12 1 1 1 1 1 1 1 1/12 2 1 1 1 2 2 2 1/12 3 1 1 2 2 2 2 1/12 4 1 1 3 2 2 2 1/12 5 1 1 3 3 2 2 1/12 6 1 1 3 4 2 2 1/12 7 1 1 3 4 2 2 1/12 8 1 1 3 4 2 2 1/12 9 1 2 3 4 2 3 1/12 10 1 2 3 4 3 4 1/12 11 1 2 3 4 4 5 1/12 12 1 2 3 4 5 6 64

References

D. Binson, J.A. Canchola, and J.A. Catania. Random selection in a telephone survey: A comparison of the kish, next-birthday, and last-birthday methods. Journal of Official Statistics, 16:53–59, 2000.

R.M. Groves, P.P. Biemer, L.E. Lyberg, J.T. Massey, W.L. Nicholls, and J. Waksberg, editors. Telephone Survey Methodology. John Wiley and Sons, Inc., New York, 1988.

ISSP. ISSP 1994, Csalad´ II. TARKI,´ 1994. http://www.tarki.hu.

J.M. Kennedy. A comparison of telephone survey respondent selection procedures, 1993. Available at http://www.indiana.edu/csr/aapor93.html.

L. Kish. A procedure for objective respondent selection within the household. Journal of the American Statistical Association, pages 380–387, 1949.

L. Kish. Survey Sampling. John Wiley and Sons, Inc., New York, 1965.

Egeszs´ egi´ Allapotfelv´ etel´ 1994 - Eletm´ od,´ kockazati´ tenyez´ ok˜ .Kozponti¨ Statisztikai Hi- vatal.

Magyar Statisztikai Evk´ onyv¨ 1993.Kozponti¨ Statisztikai Hivatal, Budapest, 1994.

Mikrocenzus, 1996 A nepess´ eg´ es´ a lakasok´ jellemzoi˜ .Kozponti¨ Statisztikai Hivatal, Budapest, 1996a.

Mikrocenzus, 1996 Fobb˜ eredmenyek´ .Kozponti¨ Statisztikai Hivatal, Budapest, 1996b.

Magyar Statisztikai Evk´ onyv¨ 1999.Kozponti¨ Statisztikai Hivatal, Budapest, 2000.

P.J. Lavrakas. Telephone survey methods: Sampling, selection and supervision. Applied Social Research Methods Series, 7, 1993.

P.S. Levy and S. Lemeshow. Sampling of Populations. John Wiley and Sons, Inc., New York, 1999.

Tablak´ epek´ az egeszs´ egr´ ol˜ - A veresegyhazi´ pelda´ . MTA Szociologiai´ Kutatoint´ ezet´ - Fekete Sas Kiado,´ 2000.

R.W. Oldendick, G.G. Bishop, S.B. Sorenson, and A.J. Tuchfarber. A comparison of the kish and last birthday methods of respondent selection in telephone surveys. Journal of Official Statistics, 4:307–318, 1988.

T. Piazza. Respondent selection for CATI/CAPI (Equivalent to the Kish Method). Avail- able at http://srcweb.berkeley.edu:4229/res/rsel.html. 65

Network Effects on the Dynamics of Slovenian Corporate Ownership Network

Marko Pahor University of Ljubljana, Faculty of Economics, Slovenia

1 Introduction

Social networks represent relations (e.g. friendship, esteem, control, etc.) between actors (e.g. individuals, companies, etc.). Social network analysis is a set of mathematical and statistical techniques that deal with relational data (see e.g. Scott (1999) or Wasserman and Faust (1994)) . Complete networks are considered in this paper, i.e. a set of n actors and a relation defined with an n × n matrix X = (xij) where xij represents the directed relation from actor i to actor j (i, j = 1, . . . , n). Traditionally we deal with cross section network, i.e. observation of a network in one point in time. The data can still be gathered through a longer time period (e.g. a year), however the resulting network is always one. A wide array of techniques has been developed to describe and understand the structure of social networks and the role of actors within it. In this paper we deal with longitudinal data on entire network. The data available is a time-series x(t); t ∈ (t1, . . . , t8) of social networks for a constant set of actors. The observation times are ordered, i.e., t1 < t2 < . . . < t8. The purpose of the statistical analysis is to obtain an insight in the evolution of the network, where the initial state x(t1) is taken for granted. Markov Chain Monte Carlo methods, developed by Snijders (1996) are used to test a stochastic actor-oriented model, with which we try to understand drivers of change in the network. Although primarily developed in sociology, social network analysis has found many applications in economic and business sciences. When actors in a network are some sort of economic entities (companies, countries, etc.) we talk about economic networks. Following the division of economics into microeconomics and macroeconomics we can identify two main levels of economic networks: micro-level networks; typical actors in this level of networks are companies, parts of companies, departments, etc. Relations of interest are typically ownership relations (companies owning shares in each other), governance and control (between companies, departments), financial relations, production chain relations,. . . . On this level corporate networks are of special interest. Corporate network is a network that describes relations between companies. macro-level networks typically involves countries, regions, sectors, industries, etc. and the relations examined are trade, investments, financial flows, etc. In this paper we start looking at the reasons that drive companies in Slovenia (financial and non financial) into the activity in ownership network. In this paper, only the effects that deal with network structure are examined and we just control for a company being a financial or non-financial and the relative size of the company. Following this introduction first the corporate network in Slovenia will be described. 66

Next we present the techniques used to examine the data, followed by the description of data used. After the presentation of empirical results we will conclude.

2 Corporate network in Slovenia

In the 1990-ies Slovenia as other post-socialistic countries underwent the process of tran- sition. Before the transition, we couldn’t talk about the corporate network in the tradi- tional meaning of this word (see e.g. Mizruchi, 1982), as in Slovenia as in other parts of the former Yugoslavia companies were mostly “socially owned” ans self-managed. “So- cial capital self managed-firm” was a construct specific for the form of market-planed economy that was in act in former Yugoslavia. Companies were essentially owned by no-one (or every one, whichever one prefers) and formally controlled by its employees, so there couldn’t be any ownership or governance network present. Some sort of economic network between companies still existed. There were of course the supply chain network, and there also existed a network formed by banks and compa- nies. Companies owned banks that serviced their financial needs. This network was politically influenced and locally based and mostly disappeared with the sanation of the bank system in mid-1990. In the privatization process corporations also appeared in the Slovenian economy. Former social capital took on new owners as prescribed by the privatization law. Each company chose a privatization model that best suited it; whereupon it could give more stress to either internal or external ownership1. After the formal process of privatization was finished, trading in shares began. Today, stocks are being traded in the organized market of the Ljubljana Stock Exchange (LSE), where mostly smaller investors trade. Institutional investors (i.e. privatization and other funds) trade mostly over-the-counter in large, package deals. The beginning of trade has opened up possibilities for other companies to become owners, to acquire companies and merge with them and to thereby form a corporate network. As described in Pahor et al. (2000), financial institutions play a central role in the Slovenian corporate network. Only one large group of companies has so far formed around a core of financial institutions. Companies in this group have developed cross ownership and a hierarchical structure of control. Forming of groups and structures within them is common for traditional economies, but we usually observe several of such groups. By offering a complete range of financial services, they tend to reduce uncertainty for companies participating in the cluster.

1The 1992 Privatization Law allocated 20 percent of a firm’s shares to insiders (workers), 20 percent to the Development Fund that auctioned the shares off to investment funds, 10 percent to the National Pension Fund, and 10 percent to the Restitution Fund. In addition, in each enterprise the worker´s council or board of directors (if one existed) was empowered to allocated the remaining 40 % of a company´s shares for sales to insiders (workers) or outsiders (through a public tender). Based on the decision on the allocation of this remaining 40 percent of shares, firms can be classified as being privatized to insiders (the internal method) or outsiders (the external method). 67

3 Methodology used

The methodology used to establish the network effects on the dynamics of corporate own- ership network in Slovenia is the stochastic actor-oriented model, developed by Snijders (2001). Longitudinal social network data are a complex data structure, requiring complex methods of data analysis for a satisfactory treatment. Holland and S. (1977) already pro- posed to use continuous-time Markov chains as a model for longitudinal social networks. In a continuous-time model, time is assumed to flow on continuously, although observa- tions are available only at the discrete time points t1 to tM , and between the observations the network is assumed to change unobserved at random moments as time progresses. We have n actors and a relation defined with the sociomatrix, which is an n × n matrix X = (xij) where xij represents the directed dichotomous relation from actor i to actor j(i, j = 1, . . . , n). Network evolution, defined as the change in X between time periods tm is a function of structural (or network) effects, explanatory actor variables and explanatory dyadic variables. We talk about actor-oriented models, so the main assumption is that the actor is able to control its outgoing relations, which are represented by the rows of the sociomatrix. In the model, actors act independently from each other, one at the time and have a perfect knowledge of the current network. They are struck with “myopia” meaning that they only try to improve their present situation and are not following any strategy. If we briefly consider the viability of this assumptions for the corporate ownership network in Slovenia we can see, that they are reasonably well met. Full information about the network is available to anyone (as we will see later in the description of data), there is only the question of the costs - data is available to some actors at lower costs than to the others. Actors (companies) are mostly independent, there is no network-wide coordination among them, there could however be some group-wide coordination present. Given short enough times the assumption of actors acting one at the time is well met. The only violated assumption is that of the myopic behavior of actors - at least some of them have a strategy that they follow. Each time an actor acts a “mini-step” occurs and this happens at stochastic times (with rate function λ). In a mini step actor i changes his relation with actor j in order to maximize his objective function f.

3.1 Formal definition The objective function is denoted by

fi(θ, x), i = 1, . . . , n, x ∈ X, and indicates the degree of satisfaction for actor i inherent in the relational situation rep- resented by x.2 This function depends on a parameter vector θ. In the simple model specification of this section, the parameter of the statistical model is θ = (β, ρ), where ρ = (ρ1, . . . , ρM−1) is the vector of change rates during the time periods from tm to

2Snijders (2001) 68

tm+1(m = 1,...,M − 1). At random times i has the opportunity to change its rela- tion with j, which is selected in a way to maximize the objective function plus a random element

fi (β, x(i → j)) + Ui(t, x, j), where Ui is a random variable, indicating the part of actor’s preference not present in the systematic component fi. The probability of actor i choosing the actor j for changing the relation xij is multinominal logit expression exp(f(i, j)) pij(θ, x) = Pn , (i 6= j) h=1,h6=i exp(f(i, h)) X follows a continuous time Markov chain, where only discrete time observations are visible. Changes occur one at the time following the intensity matrix, represented by functions P {X(t + dt) = x(i → j) | X(t) = x} qij(x) = lim dt→0 dt Computer simulation of the process follows the algorithm 1. Set t = 0 and x = X(0) 2. Generate S according to the negative exponential distribution

1/λ+(θ, x)

X where λ+(θ, x) = λi(θ, x) i

3. Select i using probabilities λi(θ,x) λ+(θ,x)

4. Select j using probabilities pij(θ, x) 5. Set t = t + S and x = x(i → j) 6. Go to step 2 (unless stopping criterion is satisfied) The objective function reflects the structural properties of the network as well as actor- dependent covariates and dyadic covariates. The objective function can be defined as:

L X fi(β, x) = βksik(x), k=1 where βk are statistical parameters indicating the strength of the effect sik , controlling for all other effects in the model. sik(x) are relevant functions of the digraph that are supposed to play a role in its evolution. sik(x) can be defined at will with an infinite number of possibilities, here are some examples of possible network effects, as listed by Snijders (2001) with a rationale for their use in a corporate ownership network: 69

1. Density effect, defined by the out-degree X si,1(x) = xi+ = xij j

This effect shows the general tendency to increase the number of links. In an own- ership network this shows the general tendency of a rational investor to place excess founds in a well-diversified portfolio. 2. Reciprocity effect, defined by the number of reciprocated relations X si2(x) = xijxji j Shows the tendency of actors to reciprocate relations - this effect will be important if there is an inbound relation from j to i, i will tend to establish the relation from i to j. In an ownership network the importance this effect could have two rationales - whether companies are forming groups of mutually interconnected firms or this a defense measure against takeovers. 3. Popularity of alter effect, defined by the sum of the in-degrees of the others, to whom i is related X X X si3(x) = xijx+j = xij xhj j j h In an ownership network one would expect, that companies with higher in degrees will be larger and/or more profitable. 4. Activity of alter effect, defined by the sum of the out-degrees of the others X X X si4(x) = xijxj+ = xij xjh j j h If this effect is important in an ownership network, companies are trying to achieve indirect links to more companies. 5. Transitivity effect, defined by the number of transitive patterns in i’s relations (or- dered pairs of actors (j, h), to both of whom i is related, while j is also related to h) X si5(x) = xijxjhxih j,h If an indirect relation exists, actors tend to form also a direct relation. In a corporate ownership network this would mean consolidating positions within the network. Many other network effects can be defined and tested, however only the effects described above were tested in this paper. Beside the network effects, actor dependent covariate effects were also tested. These effects are as follows: 70

6. covariate related popularity, defined by the sum of the covariate over all actors to whom i has a relation X si8(x) = xijνj j 7. covariate related activity, defined by i’s out-degree weighted by its covariate value

si9(x) = νjxi+

8. covariate related dissimilarity, defined by the sum of absolute covariate differences between i and all the others, to whom he is related. X si10(x) = xij νi − νj j The rationale for the covariate effects depend on the covariates chosen, so it is not given at this point. Parameters of the model are estimated using the method of moments (Bowman and Shenton, 1985). A statistic Z = (Z1,...,ZK ) is used, for which θ is determined as the solution of the K-dimensional moment equation

EθˆZ = z, where z is the observed outcome and the moment estimate θˆ is defined as the parameter value for which the expected value of statistic is equal to observed value. Z is a vector of statistics relevant for the estimation of the objective function. Cm is the relevant statistic for ρm (rate of change) and is defined n X Cm = Xij(tm+1) − Xij(tm) i,j=1,i6=j

Smk on the other hand is the relevant statistic for βk and is defined as n X Smk = sik(X(tm+1)) i=1

The statistic has an intuitive appeal: if βk is larger, the actors strive more strongly to have a high value of sik, so more important that effect. Z can that way be rewritten as

Zm = (Cm1,...,Cm,M−1,Sm1,...,Sm,L)

EθˆZ can not be calculated explicitly, instead Robbins-Monroe iterative process is used to estimate the parameters. The algorithm of the process is embedded in the SIENA pro- gram. The iteration step for the Robbins-Monroe process are:

ˆ ˆ −1 θN+1 = θN − aN D (zN − z) D matrix that links Z and theta zN is simulation of Z aN constant term, aN → 0 The estimation algorithm has three phases and can be summarized as follows 71

1. preliminary estimation for defining matrix D

2. estimation phase, where aN gets smaller between subphases

3. check the convergence and calculate standard errors

4 Data

The data used in this paper were acquired from the Central Brokerage Clearing House (CBCH), which by today has of all shareholders for all Slovenian companies with share capital; this is for more than 900 companies. CBCH has is foundations in the Law on the dematerialization of securities, which was passed in 1997. In the same year CBCH began functioning and the share companies in Slovenia had time until the end of 1999 to transfer the shareholder records to CBCH to be kept in dematerialized form. The number of companies in the database gradually increases, which is evident from. In this paper we used eight time observations from the January 1991 to July 2001, half a year apart. Dates are selected in order to capture the time just after the financial report are filed and most of the shareholders assembles are finished. Both events can have short term effects on the ownership structure happening just before the event and disappearing soon after (see Cirman and Konic,˘ 2000) and we wanted to avoid these errors. Table 1 displays the number of companies in the database.

Table 1: Number of companies in the database

Period Number of Companies Beginning 1998 360 Mid 1998 428 Beginning 1999 493 Mid 1999 587 Beginnning 2000 788 Mid 2000 857 Beginning 2001 865 Mid 2001 856

Source: CBCH database

Due to computational demands of estimation process described above and consequent limitations, embedded in the estimation software, we could not use all data available but only a smaller sample. We began the sampling procedure by identifying the 200 largest non-financial com- panies and 50 largest banks and other financial institutions. When defining the size of the companies, different criteria were used for financial and non-financial companies - total revenue defined the size of financial companies whereas total assets were deemed to be an appropriate measure for the size of financial companies. As we only have data for publicly held (share) companies, some of the largest financial companies are still owned by the state or data was unavailable for other reasons, so the 72

final sample was composed by 152 of the largest Slovenian non-financial companies and 26 financials. Essentially we had 8 * 178 egocentric networks with mutually overlapping egos and alters, which allowed us to construct a sequence of eight relations or a dynamic network. The relation at each observed point in time is a directed dichotomous relation meaning ”is among the 30 largest owners in the defined time period” of the company. Beside that two covariates were defined. One is just a dummy variable indicating the type of company having the value 1 if a company is a financial company and 0 instead. The other is a size indicator. Because of the software limitations, size is defined relatively as the rounded percentage of the largest companies’ size, separately for financials and non-financials. If we compare the eight networks in the sequence3 (Table 2 ) we can see, that the net- work is gaining on density, however neither the in-degree nor the out-degree centralization has a systematic movement in either direction. We can however notice that the number of strong component is slowly diminishing and the largest strong component is gradually increasing, which would indicate the increase of cross-ownership in the network.

Table 2: Network properties in each time period

Density Indegree Vertices Outdegree Largest Strong Largest Centra- with In- Centra- Outde- Compo- Strong lization degree lization gree nents Compo- 0 nent Beginning 0.0037 0.07 138 0.041 8 2 4 1998 Mid 1998 0.0054 0.074 127 0.092 17 4 5 Beginning 0.0076 0.078 117 0.141 26 3 9 1999 Mid 1999 0.0091 0.059 92 0.168 31 3 28 Beginning 0.0139 0.071 58 0.203 38 2 61 2000 Mid 2000 0.0151 0.064 40 0.190 36 2 71 Beginning 0.0154 0.076 36 0.196 37 2 68 2001 Mid 2001 0.0161 0.155 36 0.189 36 1 71

Source: CBCH database

If we compare the results of changes, which are summarized in Table 3 we can see, that in the first two years (four periods) we have a lot of missing data, which is due to the slow transferring of the shareholders records to the CBCH, evident from Table 1. So if we look back to Table 2, only last four periods are relevant, the changes in the former four being due to the reduced number of missing data. We can observe a slight increase in in- degree and a decrease in out-degree centralization and still the presence of the increasing

3For the definition of network properties see Scott (1999) or Wasserman and Faust (1994). 73

cross-ownership among the largest Slovenian companies.

Table 3: Number of changes between subsequent observations

observation times No ties in either pe- Created Dissolved Tie in both Missing riod ties ties periods from Beginning 1998 to 12410 52 17 88 18939 Mid 1998 (60%)

from Mid 1998 to Begin- 11499 61 16 106 19824 ning 1999 (63%)

from Beginning 1999 to 12891 34 33 140 18408 Mid 1999 (58%)

from Mid 1999 to Begin- 18129 54 43 182 13098 ning 2000 (42%)

from Beginning 2000 to 29104 70 61 324 1947 Mid 2000 (6%)

from Mid 2000 to Begin- 31018 85 75 328 0 (0%) ning 2001

from Beginning 2001 to 31039 54 51 362 0 (0%) Mid 2001

Source: CBCH database, SIENA output

In Table 3 we can see, that the amount of change in the network between observations is not great and the number of created relations is systematically higher than the number of dissolved links, confirming the idea of the increasing cross ownership among the largest Slovenian companies. The question now is, what drives these changes.

5 Results

We applied the model, described in section 3 of this paper to the data described above. At this point we present the results of the model estimation. Estimated model (Table 4) showed good convergence. Rate of change does not differ much between periods, it is however somewhat lower in the last period. First we can notice that there is apparently no significant effect of size of the company on the rate (and neither on other parameters, which was tested but is not reported here). Financial companies are as expected more active as the non-financials. the effect of the financial dissimilarity is significant and positive, as financial tend to own more non-financial companies. It is however relatively low, which might be due to a lot of cross-ownership between non-financial companies. 74

Table 4: Results of the model estimation on the sequence of eight observations

Effect coeff s.e. t Rate parameter period 1 1.3191 0.2367 5.57 Rate parameter period 2 1.5723 0.2518 6.24 Rate parameter period 3 1.1277 0.1861 6.06 Rate parameter period 4 1.2311 0.1513 8.13 Rate parameter period 5 1.0485 0.0903 11.62 Rate parameter period 6 1.3455 0.1095 12.29 Rate parameter period 7 0.7548 0.0813 9.28

l: effect size on rate -0.0043 0.0173 -0.25 f: density (out-degree) -3.6509 0.1364 -26.7 f: reciprocity 1.5176 0.1611 9.42 f: transitivity 0.4640 0.0840 5.52 f: popularity of alter 17.3937 4.0856 4.26 f: activity of alter -0.2507 1.6727 -0.15 f: fin popularity of alter 0.3768 0.1724 2.19 f: fin activity of ego 1.6970 0.2233 7.60 f: fin dissimilarity 0.3878 0.1358 2.86

Source: CBCH database, SIENA output

Popularity of alter effect is positive and very strong. We assume that the popularity of alter should be dependent on size and profitability of the company, so the size effect might be non-significant because of that through multicollinearity. Companies also tend to reciprocate relations and there is week but significant tran- sitivity effect. This further supports the increasing cross-ownership between companies. Activity of alter effect is not significant showing that companies are not interested in indirect ownership. Because of the amount of missing data in the first two years we decided to retest the model on just the last four observations, when the amount of missing data is drastically reduced. Table 5 summarizes the results. Results of the estimation on the last (stable) four periods show very similar results as the estimation on the whole sequence. The only real difference is that now the finan- cial dissimilarity effect becomes non-significant, which might indicate that non-financial companies own other non-financials and vice versa.

6 Conclusions

Corporate ownership networks have a long tradition in developed economies, they are however new to transitional economies like Slovenia, where these networks developed only recently as a result of institutional changes. The resulting network is not natural and is therefore changing rapidly. Techniques like the stochastic actor-oriented models give us an insight into the fac- 75

Table 5: Results of the model estimation on the sequence of last four observations

Effect coeff s.e. t Rate parameter period 1 1.0814 0.1061 10.19 Rate parameter period 2 1.3718 0.1246 11.01 Rate parameter period 3 0.7695 0.0872 8.83

l: effect size on rate -0.0036 0.0151 -0.24 f: density (out-degree) -3.9678 0.1553 -25.55 f: reciprocity 1.5373 0.1777 8.65 f: transitivity 0.4208 0.1033 4.07 f: popularity of alter 18.3403 5.7378 3.20 f: activity of alter -0.3730 1.7033 -0.22 f: fin popularity of alter 0.5847 0.2222 2.63

Source: CBCH database, SIENA output

tors that influence the rate and the direction of the changes in the network. We tested a sequence of eighth corporate ownership networks on the sample of 178 largest financial and non-financial Slovenian companies.

Result show that there is weak or no differences between financial and non-financial companies except that the financial companies are changing their portfolio more rapidly than the non-financial companies do. This would indicate that contrary to some popular belief most of the financial investments of non-financial companies is of strategic and as such of long-term nature (see Cirman and Konic,˘ 2000; Gregoric˘ et al., 2000). On the other hand it could also support some other popular belief, that the portfolio of Slovenian financial companies is too dispersed and their activity shows the consolidation of these portfolios.

The importance of transitivity and reciprocity effect shows the tightening of the con- nections between Slovenian companies, which could me a defense measure or merely a measure of reducing uncertainty and therefore reducing risk (see Podolny, 1994).

In the future we plan to test for other significant effects. The most interesting findings in this paper are that apparently the size of the company has no significant effect on the dynamics of ownership change and the other being that non-financial companies seem to be important and active actors in the corporate ownership network. The former is to a certain extent explained by the fact, that alter popularity, which is strongly influenced by companies’ size (see Pahor et al., 2000) has a very strong effect on the probability of a link being present, which through multicollinearity effect eliminates the effect of size. The later is however more peculiar and unexpected. We plan to concentrate our research in the future into defining the factors that drive non-financial companies to involve themselves into financial activities they are not specialized for and possibly examine the effects of this activities on the companies performance. 76

References

A. Cirman and M. Konic.˘ Znacilnosti˘ delitve denarnega toka v slovenskih podjetijh. In J. Prasnikar,˘ editor, Internacionalizacija slovenskih podjetij. , Ljubljana, 2000.

A. Gregoric,˘ J. Prasnikar,˘ and I. Ribnikar. Corporate governance in transitional economies: The case of slovenia. In Economic and Business Review, 2000.

P. Holland and Leinhardt S. A dynamic model for social networks. Journal of Mathemat- ical Sociology, 5:5–20, 1977.

M. Mizruchi. The American Corporate Network 1904-1974. Sage, London, 1982.

M. Pahor, A. Ferligoj, and J. Prasnikar.˘ Omrezje˘ slovenskih podjetij na podlagi lastniskih˘ povezav in povezac nadzonih svetov. In J. Prasnikar,˘ editor, Internacionalizacija slovenskih podjetij. Finance, Ljubljana, 2000.

J. M. Podolny. Market uncertainty and the social character of economic exchange. Ad- ministrative Science Quarterly, 39:458–483, 1994.

J. Scott. Social Network Analysis: A Handbook. Sage, 1999.

T.A.B. Snijders. Stochastic actor-oriented models for network change. Journal of Math- ematical Sociology, 21:149 – 172, 1996.

T.A.B. Snijders. The statistical evaluation of social network dynamics. Sociological Methodology, 2001.

S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cam- bridge University Press, Cambridge, 1994. 77

Asymptotically Efficient Generalised Regression Estimators G. E. Montanari and M. G. Ranalli1

Abstract: In this paper we introduce an enlarged class of generalised re- gression estimators of a finite population mean that includes as special case the optimal estimator. Theoretical results show that the latter can be seen as a generalised regression estimator based on a suitable superpopulation lin- ear regression model. Then, an estimation procedure able to merge the large sample efficiency of the optimal estimator with the greater stability of the generalised regression estimator for samples of moderate size is proposed. A simulation study provide empirical evidence of the quoted theory. Keywords: survey sampling; regression estimator; superpopulation mod

1 Introduction

Regression estimation is a powerful technique for estimating finite population means or totals of survey variables when the population means or totals of a set of auxiliary vari- ables are known. Two well-known types of regression estimators have been recently studied in literature, namely the Generalised Regression Estimator (GRE) and the Op- timal Estimator (OPE). Till now, they have been studied separately (for a short review see Montanari, 1998). In this paper we explore connections between the two types of regression estimators and establish conditions under which a GRE is exactly or asymp- totically equivalent to the OPE. To this end, an Extended Generalised Regression Estima- tor (EGRE) is proposed to allow regression estimation based on superpopulation models with a non diagonal variance matrix. Theoretical results show that the OPE can be seen as an EGRE which incorporates the auxiliary variables used at the sampling design stage. Thus, the OPE uses a greater number of auxiliaries variables than the GRE and, as a con- sequence, it may be less stable than the GRE when the sample size is not large. We then propose a new estimation procedure which allows for a more efficient use of the auxiliary information coming both from the sampling scheme and alternative sources. The core of the proposal consists in a compromise between the large sample efficiency of the OPE and the superior stability of the GRE for smaller sample sizes. The new strategy is shown through simulation studies to be robust over several different superpopulation models and on the average more efficient than the OPE or GRE alone.

2 Background and preliminary definitions

Consider a finite population U = {u1, u2, ..., uN }, where the i-th unit is represented by its 0 label i. Let Yi and xi = (X1i,X2i,...,Xqi) be the values of the survey variable y and of

1Department of Statistical Sciences, University of Perugia, Via A. Pascoli - 06100 Perugia - Italy. E-mail: [email protected] 78

a q-dimensional auxiliary variable x associated with unit i. The population mean vector N x = Σi=1xi/N of x is asSigmaed known, e. g. from administrative registers or census N data. The unknown y variable mean, Y = Σi=1Yi/N , has to be estimated by means of a sample s of size n drawn from U according to a probabilistic sampling design and taking into account the knowledge of x. We asSigmae here that xi is known for units in the sample but not in the population and that, for consistency with external sources, the estimator of Y to be adopted must take the known values contained in x when applied to the auxiliary variables. So, in our perspective the vector x is taken as given, as well as the sampling design. In this paper we are going to discuss neither the choice of subsets of elements of x for estimation purposes nor the designing of the sampling scheme. ˆ n ˆ n Let Y = Σi=1Yi/Nπi and x = Σi=1xi/Nπi be the design-unbiased Horvitz-Thompson estimators of Y and x, respectively, where πi(i = 1,...,N) is the first order inclusion probability of the sampling design. The most common way of taking into account the knowledge of the auxiliary variable population means is to adopt the regression estimator ˆ ˆ ˆ0 ˆ Y R = Y +β (x−xˆ), where β is a vector of regression coefficients, given by some function of sample data {(Yi, xi); i = 1, . . . , n}. The class of regression estimators contains well known estimators, such as ratio and product estimators, ratio and product estimators with linearly transformed auxiliary variables, post-stratified and regression estimators. So, the main issue the statistician has to deal with is the definition of βˆ. 2 Let us denote by W the N×N-matrix whose ij-th entry is (πij−πiπj)/N πiπj , where πij(i, j = 1,...,N) is the second order inclusion probability of the sampling design, and πij = πi. Assembling the values of y and x into a N-vector Y and a N × q-matrix X 0 ˆ ˆ having xi on its i-th row, the variances of Y and x and the covariance between them are given by V (Yˆ ) = Y0WY, V (xˆ) = Y0WY and C(Y,ˆ xˆ) = Y0WY, respectively. In the sequel, we asSigmae that the sampling design and the population are such that non linear estimators converge in probability to their first order Taylor linear approximations when the sample size and the population size approach infinity. We will term as ”asymptotic” any property that depends upon this convergence.

3 The Optimal Estimator

˜ This type of regression estimator is obtained from the difference estimator, i.e. Y 0 = ˆ 0 ˜ Y + B (x − xˆ), where B is a vector of constants; Y 0 is unbiased for Y and has a variance ˜ 0 0 ˆ given by V (Y 0) + B V (xˆ)B − 2B C(Y, xˆ). The latter is minimised by taking B = (X0WX)−1X0WY (Montanari, 1987). When X0WX is singular and its rank is q0 < q, to define B one or more entries of x, hence of x, have to be dropped in such a way to obtain a q0 × q0 non singular variance matrix X0WX. The optimum value of B can be estimated in many ways. For our purposes, if we take Horvitz-Thompson estimators of variances and covariances between unbiased estimators, under mild conditions on the sampling design, a consistent estimator of B is given by ˆ 0 −1 0 2 B = (XsWssXs) XsWssYs, where Wss = {(πij − πiπj)/N πiπjπij}i,j=1,...,n; Xs = 0 ˆ ˜ ˜ ˆ ˆ 0 ˆ {xi}i=1,...,n; Ys = {Yi}i=1,...,n. Then, replacing B by B in Y 0, we get Y 0 = Y +B (x−x) 79

that has been called Optimal Estimator by Rao (1994). This estimator shares in large ˜ samples the properties of Y 0, the latter being the first order Taylor linear approximation ˆ of the former (Montanari, 1987). So, Y 0 is asymptotically design unbiased, and has the minimum asymptotic variance among all regression estimators based on the same auxiliary information x. Other properties of the OPE are: 1) the means of the auxiliary variables estimated ˆ through Y 0 equal the corresponding population means, i.e. xˆ0 = x ; 2) when there is more ˆ then one survey variable, Y 0 can be expressed as a simple weighted estimator with the ˆ same weights applying to all variables Montanari (1998) ; 3) Y 0 gives valid conditional inferences (see Rao, 1997) . Note that the asymptotic optimality of the OPE is a strictly design based property. ˆ The main drawback of Y 0 is that it is complex to compute and may be unstable in finite size samples, because it requires estimating sampling variances and covariances (Casady and Valliant, 1993). However, if an adequate number of degrees of freedom g is available for estimating B, the problem can be overcome. For example, for standard complex multistage sampling designs having with replacement sampling at the first stage, g can be roughly taken as the number of sample first stage clusters minus the number of strata (Lehtonen and Pahkinen (1995), p. 181; for more elaboration on this topic see Eltinge and Jang (1996)). A stable estimated B could be expected when g is large enough relative to the dimension q of the auxiliary variable x.

4 The Generalised Regression Estimator

Most popular estimators of a finite population mean or total belongs to the class of the generalised regression estimators. Such a class is described in Sarndal¨ and Wretman (1992), chap.6 and in Estevao et al. (1995). A GRE is based on an underlying superpopu- lation linear regression model relating the survey variable to the auxiliary variables whose 0 2 population means are known. Consider the model Em(Y) = X β, Vm(Y) = σ Σ, where Σ = diag{ηi}i=1,...,N is a known matrix. Note that Em, Vm and Cm denote the expected ˆ 0 −1 value, variance and covariance with respect to the model. Let βN = (X ΣX) XΣY ˆ be the census weighted least-squares regression estimator of β. Then, replacing βN by 0 −1 −1 −1 the consistent estimator XsΣss XsΣss Ys, where Σss = diag{1/ηiπi}i=1,...,n, the corre- ˆ ˆ ˆ0 ˆ sponding GRE is defined to be Y G = Y + βn(x − x). In the sequel we will call the model upon which the GRE is based ”working model”. ˆ The large sample properties of YG can be established by means of its first order Taylor ˆ ˆ0 ˆ linear approximation Ye G = Y + βN (x − x) (Sarndal,¨ Swensson and Wretman, 1992; p. ˆ 235). In particular, YG is asymptotically design unbiased, and when the working model holds true it has the minimum expected asymptotic design variance with respect to the ∗ model, i.e. for any other design unbiased or approximately design unbiased estimator Yˆ ∗ of Y , EmV (Ye G) ≤ EmV (Ye ), for all β (Wright, 1983). Montanari (1998) proved that when the working model holds true, asymptotically (Ye G) = V (Ye 0). On the contrary, if the model is wrongly specified, the value of the asymptotic variance of Yb G may be 80

sensibly higher than that of Yb G based on the same auxiliary information x. This event is not uncommon, as the specification of the model is limited by the availability of the population means of auxiliary variables. Other properties of the GRE are: 1) the means of the auxiliary variables estimated through Yb G equal the corresponding population means, i.e. xˆG = x; 2) when there is more then one survey variable, Yb G can be expressed as a simple weighted estimator with the same weights applying to all variables; 3) Yb G gives valid conditional inferences pro- vided that the model holds true. Note that the asymptotic optimality of the GRE requires the model to be true, and concerns the average asymptotic variance over the finite popu- lations that can be generated under the model. Hence, the GRE efficiency is vulnerable to model miss-specifications. On the other side, the GRE is more robust to sample fluc- ˆ tuations than the OPE in that βn is a function of population total. In fact, the entries of 0 −1 0 −1 XsΣss Xs and XsΣss Ys are Horvitz- Thompson estimators of totals, whereas those of 0 0 XsWssXs and XsWssYs are Horvitz-Thompson estimators of variances and covariances of Yb and xb.

5 An Extended class of GRE’s

In this section we explore the relationships between GRE’s and the OPE. To this end, let us enlarge the GRE class to non-diagonal variance matrix of the working model. Con- 2 sider the model Em(Y) = Xβ and Vm(Y) = σ Σ, where Σ = {vij}i,j=1,...,N is now a −1 positive definite symmetric N × N-matrix. Let Σss be the symmetric matrix that has ij ij −1 v /πij as ij-th entry, where i, j = 1, . . . , n and v is the ij-th entry of Σ . Then, pro- 0 −1 −1 vided that the matrix (XsΣss Xs) exists for all sample s, the corresponding Extended ˆ ˆ ˆ0 ˆ Generalised Regression Estimator (EGRE) can be written Y EG = Y + βn(x − x), where ˆ 0 −1 −1 0 −1 0 −1 0 −1 βn = (XsΣss Xs) XsΣss Ys. Observe that the entries of XsΣss Xs and XsΣss Ys are design unbiased estimators of the corresponding entries of X0Σ−1X and X0Σ−1Y, respec- ij tively, provided that ν 6= 0 implies πij 6= 0. Thus, under mild conditions on the second ˆ ˆ 0 −1 −1 0 −1 order inclusion probabilities, βn converges in probability to βn = (X Σ X) X Σ Y, which is the census weighted least-squares regression estimator of β. Obviously, when Σ ˆ ˆ is a diagonal matrix, the EGRE, Y EG, reduces to the customary GRE, Y G. Note that when W is non singular, the OPE belongs to the EGRE class defined above setting Σ = W−1. However, generally W is only non-negative definite, being singular for many sampling designs. But, even in such a case there are EGRE’s that are asymptotically equivalent to the OPE, as we show in the next section.

6 Connecting EGRE’s and OPE’s

Let us call Design Balanced Variable (DBV) any non null auxiliary variable z whose mean ˆ n is estimated without error by the Horvitz-Thompson estimator, i.e. Z = Σi=1Zi/Nπ = Z. This type of variable plays a fundamental role in establishing conditions under which the 81

OPE is equivalent to an EGRE. In fact, assembling the population values Zi of a DBV variable into the N-vector Z, we have the following theorems. Theorem 1. A variable z is a DBV if and only if the vector Z belongs to the subspace orthogonal to that spanned by the columns of W. Proof If Z is a DBV, it follows that Z0WZ = 0 and C0WZ = 0 for any vector C, since the covariance between the unbiased estimators of the means of a DBV and any other variable is identically zero. Hence, WZ = 0. On the other hand, if WZ = 0 holds, then Z0WZ = 0, i.e. Z is a DBV.

From theorem 1 follows that the subspace spanned by the DBV’s has dimension N − r(W), where r(·) denotes the rank of a matrix. So, r(W) = N − t, where t ≥ 0, implies that there are t linearly independent DBV’s. Now, let Z be an N × t-matrix containing t linearly independent DBV’s and assume that X does not contain any DBV. Then, we have the following theorem.

2 Theorem 2. Consider the working model Em(Y) = (ZX)β and = Vm(Y) = σ Σ and the matrix W corresponding to the sampling design in use. If Σ is a variance matrix for which −1 −1 0 −1 −1 0 −1 ˜ ˜ there exists a scalar such that Σ − Σ Z(Z Σ Z) Z Σ = αW, then Y EG = Y 0, i.e. the EGRE based on the assumed working model is asymptotically equivalent to the OPE based on the same auxiliary information x.

Proof To prove the result it is sufficient to rewrite (ZX)β = Zβz + Xβx, where βz and βx are the vector of regression coefficients for the DBV’s and the auxiliary variables, respectively. Then, after some algebra, the census weighted least squares estimator of βx ˆ 0 −1 0 −1 −1 0 −1 −1 0 −1 is given by βxN = (X AX) X Ay, where A = Σ − Σ Z(Z Σ Z) Z Σ . Since r(Σ) = N, by well known matrix algebra results we have r(A) = N − r(Z) = N − t. ˆ Thus, if there exists a scalar α such that αW = A, it follows that βxN = B, hence ˜ ˜ ˜ ˜ ˆ Y EG = Y 0 as Y EG = Y 0 = (βxN − B)(x − xˆ).

Theorem 2 gives a sufficient condition for the asymptotic equality between an EGRE and the OPE that uses the same auxiliary variable x, i.e. besides the auxiliary variables that are not DBV’s, the working model should include a number N − r(W) of linearly independent DBV’s and the variance matrix σ2Σ should be set so that A is a matrix proportional to W(A ∝ W). The outcome is an asymptotic minimum variance estimator, irrespective of the working model goodness, given the amount of auxiliary information x. Generally speaking, with finite sample size, the OPE is approximately equal to an EGRE based on a working model that includes the effect of any existing DBV’s. Unfortu- nately, the theorem does not provide guidelines for determining the structure of the matrix Σ and the DBV’s that correspond to the sampling design in use. In the next section, we will examine a number of case studies where solutions are available. The next theorem assures the finite size sample identity between an EGRE and the OPE. Theorem 3. If Σ−1 − Σ−1Z(Z0Σ−1Z)−1Z0Σ−1 ∝ W implies −1 −1 0 −1 −1 0 −1 ˆ ˆ Σss − Σss Zs(ZsΣss Zs) ZsΣss ∝ Wss for all possible sample s, then Y EG = Y 0. 82

Proof To prove the result it is sufficient to write (ZsXs)β = Zsβz + Xsβx and to note that the sample weighted least squares estimator of βx with respect to the weight matrix −1 ˆ 0 −1 0 Σss is given by βx = (X AssX) X AssY, where −1 −1 0 −1 −1 0 −1 ˆ ˆ Ass = Σss − Σss Zs(ZsΣss Zs) ZsΣss . Hence, when Ass ∝ Wss, we have βx = B ˆ ˆ ˆ ˆ and the result follows from Y EG − Y 0 = (βx − B)(x − xˆ)

7 Examples of equivalences between EGRE’s and OPE’s

The starting point for deriving working models under which the corresponding EGRE is asymptotically equivalent to the OPE that uses the same auxiliary variable x is the structure of the matrix W. When r(W) = N there is no DBV. So, setting Σ ∝ W−1, the corresponding EGRE is asymptotically equal to the OPE. Furthermore, because of theorem 3, the EGRE is the OPE as well. As an example, consider Poisson sampling with size measure ai, i = 1,...,N. Let π = nai/A be the inclusion probabilities of population units, where n is the expected sample size and A is the total of ai. In this case −1 W = diag{π − 1}. Setting Σ = diag{ai/(A − nai)}, the GRE is the OPE as well. This form of the variance matrix was also carried out by Sarndal¨ (1996) by minimising the asymptotic variance of a GRE for Poisson sampling. Next we give examples where DBV’s exist. As a rule of thumb, potential DBV’s are variables proportional to the first order inclusion probabilities within subpopulations from which fixed size samples are selected. This is documented by the following examples.

Example 1. Consider simple random sampling of n units. In this case r(W) = N −1 and any vector proportional to the unit vector 1 is a DBV. Setting Z = 1 and Σ = I, where I is the identity matrix, then A ∝ W. Furthermore, because of theorem 3, a GRE based on a homoscedastic linear regression model with an intercept term is an OPE as well.

Example 2. Consider stratified simple random sampling and denote by hi the i-th unit within the h-th stratum where hi ranges over the pairs 11, 12,..., 1N1, 21, 22,..., 2N2,...,H1,H2,...,HNH . Let nh be the sample size within the stratum h. In this case r(W) = N − H and the indicator variables of stratum mem- berships of population units are a set of linearly independent DBV’s. Thus, let zl be the vector of the values of the l-th stratum membership indicator variable, whose entries are Zlhi = 1, when h = l , and Zlhi = 0, otherwise. Define Z = [z1, z2,..., zH ]. Then, setting Σ = diag{νhi}, where νhi = [nh(Nhh − 1)]/[Nh(Nh − nh)], it follows that A ∝ W. So, when the working model includes the stratum membership indicator variables of population units and the variance matrix is specified as above, the GRE is asymptotically equal to the OPE. Note that this form of the variance matrix was also car- ried out by Sarndal¨ (1996) by minimising the asymptotic variance of a GRE based on a working model with an intercept term for each stratum. When nh is constant across strata, because of theorem 3, the GRE is identically equal to the OPE.

Example 3. Consider stratified two stage random sampling and let us denote by hij the j-th elementary unit within the i-th Primary Sampling Unit (PSU) of the h-th stratum with h = 1,...,H; i = 1,...,Nh; j = 1,...,Mhi. Using simple random sampling 83

without replacement in both stages, nh PSU’s are selected from each stratum and mhi elementary units are drawn from each selected PSU. In this case r(W) = N − H and for each value of l = 1,...,H, the vector zl, whose entries are Zlhij = Mh/NhMhi PNh when h = l and Zlhij = 0 otherwise, where Mh = i=1 Mhi, is a DBV. Thus, the matrix Z = [z1, z2,..., zH ] contains H linearly independent DBV’s. Inserting Z into 0 −1 the working model and setting Σ = diag{Σhi}, where Σhi = [ahiI + bhi11 ] is the Mhi × Mhi matrix with

Nh Mhi Mhi − mhi Nh Mhi mhi − 1 Nh nh − 1 ahi = · · and bhi = · · − · , nh mhi Mhi − 1 nh mhi Mhi − 1 nh Nh − 1 it follows from theorem 2 that A ∝ W and the EGRE is asymptotically equal to the OPE. Furthermore, when nh is constant across strata, because of theorem 3, the EGRE is equal to the OPE as well. Note that the structure of Σ is that of an equal correlation model within PSU’s.

Now, two examples involving unequal probability sampling are presented. However, for simplicity, we assume with replacement sampling. In such a case, first and second order inclusion probabilities must be replaced by ϕi = E(δi) and ϕij = E(δiδj), where δi is the random variable defined to be the number of times the i-th unit has been selected in the sample s. As a consequence, Horvitz-Thompson estimators have to be replaced by Hansen-Hurvitz analogues and in the matrix W the inclusion probabilities πi and πij are replaced by ϕi and ϕij.

Example 4. Consider a with replacement unequal probability sampling design of fixed size n and with selection probabilities Pi , i=1,2,?,N. The sampling variance of the unbi- ased Hansen-Hurvitz estimator is given by Y’WY, where W = n -1N -2 (P -1 - 11’), and P = diagPi. In this case, r(W) = N-1 and the variable Zi = NPi is a DBV. Inserting the latter into the working model and setting S = diagnPi, the GRE is asymptotically equal to the OPE. Furthermore, because of theorem 3, the GRE is the OPE as well.

Example 5. Consider a stratified two stage sampling as in example 3, but now at the first stage nh PSU’s are selected from each stratum h using a with replacement unequal probability scheme with selection probabilities Phi. In this case, r(W) = N − H. The matrix Z = [z1, z2,..., zH ], where zl is the vector whose entries are Zlhij = MhPhi/Mhi , when h = l, and Zlhij = 0, otherwise, contains H linearly independent DBV’s. Inserting 0 −1 Z into the working model and setting Σ = diag{Σhi}, where Σhi = [ahiI + bhi11 ] is the Mhi × Mhi-matrix with

1 Mhi Mhi − mhi 1 Mhi mhi − 1 ahi = · · and bhi = · · , nhPhi mhi Mhi − 1 nhPhi mhi Mhi − 1 the EGRE is asymptotically equal to the OPE. As before, when nh is constant across strata, the EGRE is identically equal to the OPE.

Note that with replacement results are often used for approximating without replace- ment results. 84

8 Estimation strategies

The examples examined in the previous section illustrate that, given an amount of aux- iliary information x, generally speaking, the OPE is equivalent to the EGRE based on a working model which includes the maximum number of linearly independent DBV’s and assumes a variance matrix that reflects the structure of the first and second order inclu- sion probabilities. Thus, since the OPE allows a better fit of the data, this explains its asymptotic superiority. However, in finite size samples, the OPE estimator is exposed to instabilities due to a possible inadequate number of residual degrees of freedom available for estimating all the parameters. In particular, this concern may be relevant in stratified designs with a few observations per stratum, or multistage sampling designs with a few PSU’s per stratum. The analysis and the examples presented above suggest the following options for es- timating a population mean taking into account a given amount of auxiliary information x. The first option consists in using straight the OPE. This choice assures the maximum asymptotic efficiency. However, if the stability of that estimator is of concern, as when the total sample size is not big enough or the number of strata is high compared to the number of observations, and we are confident enough on a more parsimonious working model, a second option is to use the EGRE based on it. A further intermediate option, that we recommend when a reliable model is unavailable, consists in specifying a working model with a variance matrix set according to theorem 2 and a suitably reduced number of DBV’s. In particular, in the case of stratified samples, this may be accomplished in- troducing in the model DBV’s corresponding to superstrata obtained collapsing original strata. Collapsing should be performed so that within each superstratum, strata effects can be considered negligible. For each superstratum, a DBV is obtained adding the DBV’s of the collapsed strata. By reducing the number of DBV’s inserted into the model, we accept a smaller level of asymptotic efficiency to better control the finite size sample variance of the estimator. The latter option has been implemented in the following simulation study.

9 An empirical study

The theory developed in the previous sections is asymptotic, being based on first order approximations. In this section we report results of an empirical study carried out to test the theory in the presence of finite size samples. In particular we refer to example 2 above. A finite population of 1200 units partitioned into 20 strata of equal size was consid- ered. The values of an auxiliary variable x were generated through a Chi-squared random variable with 8 degrees of freedom. They were assigned to the strata in two ways. In the first one they were randomly assigned to the strata, to simulate a stratification based on other characters independent of x (STRATIFICATION 1). In the second case, the values of x were first ordered. Then, the first 60 smallest values were assigned to the first stratum, the subsequent 60 smaller values were assigned to the second stratum and so forth, to simulate a stratification based on classes of x values (STRATIFICATION 2). Given the values of x, six populations of y values were generated, according to the following models: 85

P1 : Yhi = 100(1 + 0.15εhi) (total independence between y, x and h); P2 : Yhi = (10 + 3h)(1 + 0.15εhi) (dependence between y and h); P3 : Yhi = 3Xhi(1 + 0.15εhi) (linear dependence between y and x); P4 : Yhi = (12 + 4h + 5Xhi)(1 + 0.15εhi) (linear dependence between y , x and h); 2 P5 : Yhi = Xhi(1√ + 0.15εhi) (quadratic dependence between y and x ); P6 : Yhi = 20h Xhi(1 + 0.15εhi) (non linear dependence between y, x and h ); where Yhi and Xhi are the values of y and x in the i-th unit (i = 1,..., 60) of the h- th stratum (h = 1,..., 20) and the εhi’s are independent observations from a standard normal distribution. All model are heteroscedastic with variances proportional to the squared expected values. For each stratification and population, 10,000 proportional stratified random samples of size 40 (two units per stratum), 80 (four units per stratum) and 240 (twelve units per stratum) were selected. For this sampling scheme, the DBV’s are the indicator variables of stratum membership of population units. The variance matrix that corresponds to the optimal estimator is the identity matrix, being nh and Nh constant across strata (see ex- ample 2!). For each sample, assuming known, the following estimators were computed:

y¯ and x¯, i.e. the sample means of y and x (Horvitz-Thompson estimators);

ˆ 2 Y G1, i.e. the GRE based on the model Em(Yhi) = βXhi,Vm(Yhi) = Xhiσ (com- bined ratio estimator);

ˆ 2 2 Y G2, i.e. the GRE based on the model Em(Yhi) = β1 + β2Xhi,Vm(Yhi) = Xhiσ ;

ˆ 2 Y G3, i.e. the GRE based on the model Em(Yhi) = β1 + β2Xhi,Vm(Yhi) = σ ; (adopting the variance matrix that corresponds to the OPE);

ˆ P5 Y G4, i.e. the GRE based on the model Em(Yhi) = l=1 βlD4lhi+β6Xhi,Vm(Yhi) = 2 σ , where D4l is the variable obtained adding the DBV’s of the four strata 4(l−1)+ 1, 4(l − 1) + 2, 4(l − 1) + 3, 4(l − 1) + 4, for l = 1, 2, 3, 4, 5;

ˆ P10 Y G5, i.e. the GRE based on the model Em(Yhi) = l=1 βlD2lhi+β11Xhi,Vm(Yhi) = 2 σ , where D2l is the variable obtained adding the DBV’s of the two original strata 2(l − 1) + 1 and 2(l − 1) + 2, for l = 1,..., 10; ˆ Y 0, i.e. the OPE that uses X ; ˜ Y 0, i.e. the first order linear approximation of the OPE. ˆ ˆ ˆ Estimators Y G3, Y G4 and Y G5 are based on working models that assume the variance matrix corresponding to the OPE, but include a reduced number of DBV’s, according to ˆ ˆ ˆ the collapsing strata technique (one DBV for Y G3, five for Y G4 and ten for Y G5). Note that ˆ Y 0 corresponds to a working model with 21 parameters (twenty DBV’s and one auxiliary variable) and instability can be expected, in particular for sample size n = 40. 86

The mean and the mean squared error across the 10,000 selected samples were com- puted for each estimator. Tables 1 and 2 report the scaled mean squared errors, having set ˜ that of Y 0 equal to 100. Biases are not reported, since they were negligible in all cases.

˜ First, note that Y 0 is a gauge of the best we can expect from the OPE, as it shows its asymptotic behaviour; it is the most efficient among the computed estimators. On the other hand, the OPE is vulnerable to sampling fluctuations, in particular with STRATI- ˆ FICATION 2 and n = 40. In fact, in the worst case, the variance of Y 0 is 47.9% higher ˜ than that of Y 0 (table 2, population P5). With this stratification the OPE performance is generally worse than that under stratification 1. Most likely, the reason of that is the great diversity of stratum variances, because of the asymmetric distribution of x, coupled with the equal allocation of the sample; thus, in the strata with higher variances, the estimates of the latter which are required by the OPE are much more erratic (recall that for n = 40 there are two units per stratum). Furthermore, STRATIFICATION 2 is based on x, and most of information on y provided by x is captured by the stratification; thus, the use of x at the estimation stage may be redundant. In fact, the sample mean is a fairly efficient estimator apart from population P5. In this respect, note that is equivalent to a GRE based on a working model that uses the DBV’s, i.e. the stratum membership indicators. How- ever, the sample mean does not take the known value of X when applied to the auxiliary variable as all other estimators do.

ˆ ˆ ˜ Estimators y, Y G1, and Y G3 are almost as efficient as Y 0 when the model upon which they are based holds true, but they are quite inefficient in the presence of models failures, in particular with STRATIFICATION 1. For instance, when the variable y does not depend on x and h (population P1) or it depends only on h (population P2), the sample mean does well; but when y depends on x and the latter is not used for stratification, the sample mean suffers from not using any auxiliary information as the regression estimators do. The ˆ combined ratio estimator Y G1 works well only when the values of y are on the average ˆ ˆ proportional to the values of x (population P3), whereas estimators Y G2 and Y G3 are ˆ inefficient when the relation between y and x is not linear. Observe that Y G3 is almost ˆ always more efficient then Y G2, even when the populations are heteroscedastic.

ˆ ˆ At last, estimators Y G4 and Y G5, based on the collapsed strata technique, are almost always more efficient than the OPE with sample sizes 40 and 80, because of the reduced number of DBV’s inserted into the working model. Furthermore, when they are not the most efficient among all estimators, their scaled mean squared errors are not substantially higher than that of the best estimator for each population. Thus, the collapse stratum technique seems a useful device to identify a stable approximated OPE which comes out ˆ ˆ to be fairly efficiency robust with respect to model failures. In fact, Y G4 and Y G5 have the lowest averaged mean squared error across populations (see the three rows at the bottom of the tables). This is particularly valuable in cases where there is uncertainty about a proper working model. 87

10 Final remarks

Generally speaking, the OPE is approximately equal to an EGRE based on a working model that includes the effect of any existing DBV’s. Unfortunately, theorem 2 does not provide guidelines for determining the structure of the matrix Σ and the DBV’s that correspond to the sampling design in use. However, for common designs, easy solutions can be found. So, given an amount of auxiliary information x the OPE is approximately or exactly equal to an EGRE based on a working model which includes the maximum number of linearly independent DBV’s and assumes a variance matrix that reflects the structure of the first and second order inclusion probabilities. Hence the OPE allows a better fit of the data, and this explains its asymptotic superiority. However, in finite size samples, the OPE is exposed to instabilities due to a likely inadequate number of residual degrees of freedom available for fitting the model. In particular, this concern may be relevant in stratified designs with a few observations per stratum, or multistage sampling designs with a few PSU’s per stratum. The above analysis suggests the following quasi-optimal estimation strategy. When a reliable model is lacking, use an EGRE based on a working model with a variance matrix set according to theorem 2 and with a suitably reduced number of DBV’s to achieve a sufficient number of residual degrees of freedom to fit the model. In particular, in the case of stratified samples, this may be accomplished introducing into the model DBV’s corresponding to superstrata obtained collapsing original strata. Collapsing should be performed so that within each superstratum, strata effects can be considered negligible. For each superstratum, a DBV is obtained adding the DBV’s of the collapsed strata. By reducing the number of DBV’s inserted into the model, we accept a smaller level of asymptotic efficiency to better control the finite size sampling variance of the estimator. A related topic of interest is that of variance estimation. The theory presented in the previous sections was developed using the Horvitz-Thompson estimator of the variances and covariances required to estimate B for the optimal estimator. But alternative ways for variance estimation are available, as the Yates-Groundy formula or the use of resam- pling methods. Thus, the issue of the best variance estimation procedure to be used to better estimate B is an open question. Variance estimators are also required to estimate the standard errors of regression estimators. In this respect, the usual way is to estimate the variance of the first order linear approximations of regression estimators, replacing the unknown regression coefficients with their sample estimates. The resulting estimator is usually somewhat negatively biased, especially with a higher number of auxiliary vari- ables in the working model. That is also true for the OPE that uses the maximum number of auxiliary variables. Thus, the properties of variance estimators affect the coverage of confidence intervals and further research is needed to explore this issue in order to single out better estimation strategy for interval estimation. So far we have restricted the analysis to the use of DBV’s, since the vector x was taken fixed. But clearly, this point is part of the wider problem of selecting a subset of the available auxiliary variables for estimation purposes. For example, if the true model is linear, a quadratic term inserted into the working model, although asymptotically would result in a more efficient estimator of the population mean, in finite size samples may give a less efficient estimator. The issue of selecting the best subset of available auxiliary 88

variables for estimating a population mean is still an open research problem, but the topic is beyond the purposes of this paper. 89

˜ Table 1: scaled mean squared error of estimators for Stratification 1 [MSE(Y 0) = 100]

ˆ ˆ ˆ ˆ ˆ ˆ Population n y Y G1 Y G2 Y G3 Y G4 Y G5 Y 0 40 100.0 1304.6 119.4 103.0 104.0 104.5 107.9 P 1 80 100.0 1177.2 109.1 101.3 101.4 101.6 101.9

Yhi = 100(1 + 0.15εhi) 240 100.1 1217.1 104.6 100.1 100.2 100.2 100.3 40 100.2 1234.0 421.4 236.7 106.6 106.6 109.8 P 2 80 100.2 1128.8 364.2 215.7 102.9 102.6 102.9

Yhi = (10 + 3h)(1 + 0.15εhi) 240 100.2 1103.8 333.8 210.9 101.1 100.6 100.6 40 982.6 100.0 101.3 103.4 105.3 105.8 108.8 P 3 80 924.2 100.2 100.7 102.0 102.6 102.9 103.3

Yhi = 3Xhi(1 + 0.15εhi) 240 924.8 101.0 101.4 100.6 100.8 100.8 101.0 40 445.0 279.5 157.7 121.9 105.0 105.7 110.0 P 4 80 422.4 271.0 148.9 119.0 102.2 102.2 102.7

Yhi = (12 + 4h + 5Xhi)(1 + 0.15εhi) 240 405.1 275.3 150.0 119.8 101.0 100.8 100.9 40 834.5 277.5 208.8 105.7 106.2 106.9 109.4 P 5 80 776.1 266.0 208.8 104.1 104.2 104.5 104.9 2 Yhi = Xhi(1 + 0.15εhi) 240 772.2 256.5 205.2 99.6 99.9 100.0 100.1 40 293.65 229.6.7 236.1 205.8 110.9 110.5 113.2 P 6 80 267.8 234.7 226.3 200.0 104.3 104.0 104.2 √ Yhi = 20h Xhi(1 + 0.15εhi) 240 265.5 236.0 220.9 195.1 101.5 100.7 100.8 40 459.3 570.9 207.5 146.1 106.3 106.7 109.9 Average 80 431.8 529.7 193.0 140.4 102.9 103.0 103.3 240 428.0 531.6 186.0 137.7 100.8 100.5 100.6

y →, the sample mean of y (Horvitz-Thompson estimator);

ˆ 2 Y G1 →, the GRE based on the model Em(Yhi) = βXhi,Vm(Yhi) = Xhiσ (combined ratio estimator);

ˆ 2 2 Y G2 →, the GRE based on the model Em(Yhi) = β1 + β2Xhi,Vm(Yhi) = Xhiσ ;

ˆ 2 Y G3 →, the GRE based on the model Em(Yhi) = β1 + β2Xhi,Vm(Yhi) = σ ; (adopting the variance matrix that corresponds to the OPE); ˆ Y G4 →, the GRE based on the model P5 2 Em(Yhi) = l=1 βlD4lhi + β6Xhi,Vm(Yhi) = σ ; ˆ Y G5 →, the GRE based on the model P10 2 Em(Yhi) = l=1 βlD2lhi + β11Xhi,Vm(Yhi) = σ ; ˆ Y 0 →, the OPE that uses X ; ˜ ˆ Y 0 →, first order linear approximation of Y 0. 90

˜ Table 2: scaled mean squared error of estimators for Stratification 2 [MSE(Y 0) = 100]

ˆ ˆ ˆ ˆ ˆ ˆ Population n y Y G1 Y G2 Y G3 Y G4 Y G5 Y 0 40 100.0 118.4 100.4 99.9 99.9 100.3 109.9 P 1 80 100.0 118.7 100.1 100.0 99.9 100.1 103.0

Yhi = 100(1 + 0.15εhi) 240 99.9 118.7 100.1 99.9 99.9 100.0 100.4 40 100.2 121.3 115.6 106.9 102.6 102.5 121.6 P 2 80 100.3 121.6 115.8 107.5 103.0 102.4 106.0

Yhi = (10 + 3h)(1 + 0.15εhi) 240 99.9 119.7 113.8 105.9 101.6 100.6 101.1 40 108.8 100.4 100.4 100.1 100.4 101.7 131.0 P 3 80 106.9 101.0 101.1 100.8 101.2 101.8 108.6

Yhi = 3Xhi(1 + 0.15εhi) 240 109.9 100.4 100.4 100.3 100.4 100.3 103.2 40 103.9 103.3 102.6 101.0 100.7 101.8 127.1 P 4 80 102.6 104.8 103.9 102.1 101.3 101.5 107.6

Yhi = (12 + 4h + 5Xhi)(1 + 0.15εhi) 240 105.3 102.7 102.0 100.7 100.3 100.5 102.6 40 164.9 128.5 129.5 110.4 101.2 102.6 147.9 P 5 80 157.4 124.3 125.5 108.3 100.5 101.7 114.5 2 Yhi = Xhi(1 + 0.15εhi) 240 176.5 134.1 136.0 113.1 101.4 101.5 106.1 40 101.2 105.7 107.5 107.1 104.0 104.1 133.3 P 6 80 101.1 105.7 107.3 107.4 104.0 103.0 109.8 √ Yhi = 20h Xhi(1 + 0.15εhi) 240 102.0 104.7 106.1 106.5 102.9 101.2 102.1 40 113.2 112.9 109.3 104.2 101.5 102.2 128.5 Average 80 111.4 112.7 109.0 104.4 101.7 101.8 108.3 240 115.6 113.4 109.7 104.4 101.1 100.7 102.6

y →, the sample mean of y (Horvitz-Thompson estimator);

ˆ 2 Y G1 →, the GRE based on the model Em(Yhi) = βXhi,Vm(Yhi) = Xhiσ (combined ratio estimator);

ˆ 2 2 Y G2 →, the GRE based on the model Em(Yhi) = β1 + β2Xhi,Vm(Yhi) = Xhiσ ;

ˆ 2 Y G3 →, the GRE based on the model Em(Yhi) = β1 + β2Xhi,Vm(Yhi) = σ ; (adopting the variance matrix that corresponds to the OPE); ˆ Y G4 →, the GRE based on the model P5 2 Em(Yhi) = l=1 βlD4lhi + β6Xhi,Vm(Yhi) = σ ; ˆ Y G5 →, the GRE based on the model P10 2 Em(Yhi) = l=1 βlD2lhi + β11Xhi,Vm(Yhi) = σ ; ˆ Y 0 →, the OPE that uses X ; ˜ ˆ Y 0 →, first order linear approximation of Y 0. 91

References

R.J. Casady and R. Valliant. Conditional properties of post-stratified estimators under normal theory. Survey Methodology, 19:183–192, 1993.

J. L. Eltinge and D.S. Jang. Stability measures for variance component estimators under a stratified multistage design. Survey Methodology, 22:157–165, 1996.

V. Estevao, M.A. Hidiroglou, and C.E. Sarndal.¨ Methodological principles for a gener- alised estimation system at statistics canada. Journal of Official Statistics, 11:181–204, 1995.

R. Lehtonen and E.J. Pahkinen. Practical Methods for Survey Design. J. Wiley, New York, 1995.

G.E. Montanari. Post-sampling efficient qr-prediction in large-scale surveys. Interna- tional Statistical Review, 55:191–202, 1987.

G.E. Montanari. On regression estimation of finite population mean. Survey Methodology, 24:69–77, 1998.

J.N.K. Rao. Estimating totals and distribution functions using auxiliary information at the estimation stage. Journal of Official Statistics, (10):153–165, 1994.

J.N.K. Rao. Developments in sample survey theory: an appraisal. Canadian Journal of Statistics, 25:1–21, 1997.

C.E. Sarndal.¨ Efficient estimators with simple variance in unequal probability sampling. Journal of the American Statistical Association, 91:1289–1300, 1996.

C.E. Sarndal¨ and B. Swenssonand J.H. Wretman. Model assisted survey sampling. Springer-Verlag, New York, 1992.

R.L Wright. Finite population sampling with multivariate auxiliary information. Journal of the American Statistical Association, 78:879–884, 1983. 92

Evaluation of quality cancer care using administrative data: an example of breast, rectum and lung cancer

Rosalba Rosato(1, 2), Giovannino Ciccone(1), Elena Gelormino(1), Eva Pagano(1), Franco Merletti(1)

1. Unit of Cancer Epidemiology, ASO S. Giovanni Battista Turin and CPO Piedmont

2. Department of statistic, University of Florence

1 Introduction

This study is an attempt of using an administrative tool to estimate the quality of care in treating cancer. The regional health information system collects data through a personal form, named Scheda di Dimissione Ospedaliera (SDO) for every inpatient case, filled at the discharge. During the last period quality of data has improved thanks to the growing interest of administrators, clinicians, and policymakers. Cancer care in the Piedmont region is organised as a network of services. In order to elaborate local clinical guidelines for the three more frequent cancers (breast, rectum and lung cancer), data from the regional health information system (SDO) have been used. For every type of cancer some indicators have been proposed in order to estimate the implementation of the main recommendations of the guideli

2 Methods

All patients resident in Piedmont with an hospital admission during the period from 1997 to 1999 for breast or rectum or lung cancer, have been included in the analyses. Main diagnosis on the SDO format is coded with an International classification system named ICD9-CM. Every inpatient episode of care is classified trough a DRG (Diagnosis Related Group) code. Table 1 describes the inclusion criteria adopted in the study. 8214 female patients with main diagnosis of breast cancer (ICD9-CM: 174.0-174.9) and surgical DRGs (257-260), have been included in the analyses. Patients have been divided into two groups according to the type of treatment: patients with a mastectomy (ICD9-CM: 85.33-85.48, N=3851) and patients with a less invasive treatment (ICD9-CM: 85.20-85.25, N=4363). 2829 patients with main diagnosis of rectum cancer (ICD9-CM: 154.0-154.1, 154.8) and surgical DRGs 146-147 (Rectal resection, respectively with and without complica- tions or comorbidity) or DRGs 148-149 (Major small and large bowel procedures with and without complication or comorbidity) have been included in the analyses. Patients have been divided into two groups according to presence of abdominoperineal resection (APR) (ICD9-CM: 485. N=477) or a conservation of the anal sphincter (N=2392). 93

Table 1: Patients’ inclusion criteria

BREAST RECTUM LUNG Period of admission 1997 - 1999 1997 - 1999 1997 - 1999 Residence Piedmont Piedmont Piedmont Region of admission Piedmont Piedmont All Sex F M+F M+F Main Diagnosis (ICD9-CM): 174.0 - 174.9 154.0 - 154.1, 162.0 - 162.9 154.8 DRGs 257 - 260 146 - 149 - Type of treatment (ICD9-CM): · Reference Group 85.33 - 85.48 485 321 - 329 · Control Group 85.20 - 85.25 Other procedures Medical DRGs

The 10875 patients with lung cancer (ICD9-CM main diagnosis: 162.0-162.9) have been classified according to the type of treatment: surgical, with lung resection (ICD9- CM: 321-329) or medical (whichever medical DRGs). Each patient has been identified through the tax identification number (when available on the form), or by a unique key formed by some demographic data - last name, date of birth, city of birth and sex - in order to exclude every repeated admission of the same subject. When a patient has been admitted several times during the period, he/she has been classified with the more invasive modality of treatment; in case of more than one admissions with the same severity the first in time has been included in the analyses. In case of lung cancer, we have included in the analyses also medical treatments, and in order to identify the first admission occurred during the year 1997, the 1996 database has been used. The analyses on rectum and breast cancer have only regarded admissions occurred in Piedmont region, because one of the most interesting variable related to the type of surgery is hospital workload and it was available only for Piedmont hospitals. For lung cancer 7.5% of the sample has been admitted out of Piedmont. As the rela- tionship between kind of treatment and migration is of extreme interest and the activity level of hospitals is not relevant, also admissions in other regions have been included in the analyses. By using the codes of secondary diagnoses reported on the form, patients have been classified on the basis of the presence of metastases, according to the Disease Staging Classification.

3 Statistical analyses

Using the information available on the form some factors, potentially associated to the type of treatment, have been selected. Personal characteristics used in the analyses are: city of residence, sex, age, level of instruction, stage of disease. As admission characteristics hospital workload and year 94

have been considered. Residence has been included like a proxy of accessibility to services: on the bases of preliminary analyses the patients have been classified as inhabitants in Turin or in the rest of the region. The hospital workload per year has been used like an indicator of specialisation and experience of hospital. As level of instruction had an high number of missing data, a missing category has been created in order to include in the analysis all patients . The association between the variables and the outcome (type treatment) has been es- timated using a logistic regression model, by Odds Ratios (OR) and relative 95% confi- dence intervals (IC 95%). For breast and rectum cancer the ORs estimate the probability of receiving a radical surgical treatment respect to a conservative one; for lung cancer the probability estimated is related to surgical treatment respect to medical ones.

4 Results

4.1 Breast cancer Table 2 shows results for breast cancer. Age is strongly associated to the type of surgical treatment, with an increase of mastectomy, from 35% under the 50 years old to over 60% for the 80 years older women. Women with a low level of instruction have about 50% of mastectomy, while the oth- ers have about 39%. This difference disappears when controlling for other characteristics, in particular for age. Residents in Turin have a low percentage of mastectomy (39% vs 49%), and this difference also remains after controlling for the other characteristics. A stratified analysis has been executed and the trend does not vary. The number of surgical treatments per year executed by the hospital is strongly asso- ciated whit the outcome. Hospitals with a low workload (under 100 cases per year) have a higher probability of supplying a mastectomy. Presence of secondary diagnoses of metastases (both in lymphonodes and in other organs), used in order to define a severity stage, is associated to a greater probability of mastectomy. However the percentage of women who have metastases codes on the form is lower than attended (30 - 35%). During the period there has been a progressive statistical reduction of number of mas- tectomy, from 50.6% in 1997 to 43% in 1999.

4.2 Rectum cancer Table 3 shows the results of analyses on rectum cancer. There is not a clear effect of age, even if after standardisation for other variables a ten- dency to a more radical surgical treatment appears with older age. Moreover differences for sex and level of instruction have been not observed. Like for breast cancer patients, residents in Turin seem to have a lower probability to receive a more invasive treatment. 95

Table 2: Analysis of factors related to type of surgery for breast cancer (N=8214)

Nr. PATIENTS %MASTECTOMY OR IC 95% AGE < 50 1596 34.96 1 - 50 - 59 1737 40.82 1.28 1.10 - 1.48 60 - 69 2255 44.35 1.40 1.21 - 1.63 70 - 79 1791 58.68 2.65 2.11 - 2.88 ≥80 835 62.63 2.85 2.36 - 3.45 EDUCATIONAL LEVEL Secondary Or More 1121 39.34 1 - Intermediate 1768 38.80 0.92 0.78 - 1.08 Primary Or Nothing 4024 51.19 1.07 0.92-1.25 Missing Data 1301 50.27 0.92 0.77 - 1.10 CITIESOF RESIDENCE Turin 1964 39.05 1 - Other Cities 6250 49.18 1.19 1.06 - 1.33 NR.CASES/YEAR (NR.INSTITUTE) > 100 (N=7) 3728 37.63 1 - 50-100 (N=7) 1651 59.12 2.26 1.99 - 2.57 <50 (N=60) 2835 51.57 1.60 1.43 - 1.77 DISEASE STAGING Without metastasis 7807 45.95 1 - With metastasis 407 62.41 2.23 1.80 - 2.78 YEAROF ADMISSION 1997 2666 50.56 1 - 1998 2664 47.26 0.86 0.77 - 0.96 1999 2884 42.79 0.70 0.62 - 0.78

The hospital workload seems associated to the type of treatment. Probability of ab- dominoperineal resection increases of 1.6 times (Ic95%=1.19-2.07) for those cases who have been operated in hospitals with less than 25 surgical treatment per years. The effect of disease stage is not clear: while a crude comparison shows a lower percentage of APR to patients with advanced cancer, this difference disappear after standardisation. Analysis shows a temporal decrease trend of abdominoperineal resection.

4.3 Lung cancer Table 4 shows the factors that influence type of treatment supplied to patients with main diagnosis of lung cancer. The probability of a surgical treatment is constant (around 16-17%) until the age of 70 years, while it reduces over this age (7,5%). There is not difference related to sex. The effect of the level of instruction seems rather strong: there is a low probability 96

Table 3: Analysis of factors related to type of surgery for rectum cancer (N=2829)

Nr. PATIENTS %PERMANENT OR IC 95% COLOSTOMY AGE < 60 575 16.00 1 - 60 - 69 842 16.03 1.04 0.79 - 1.37 70 - 79 948 18.78 1.24 0.95 - 1.63 ≥80 464 15.52 1.51 1.11 - 2.05 SEX Female 1175 16.60 1 - Male 1654 17.05 0.96 0.80-1.16 EDUCATIONAL LEVEL Secondary Or More 357 14.57 1 - Intermediate 410 16.34 0.96 0.69 - 1.42 Primary Or Nothing 1584 16.73 0.98 0.73-1.33 Missing Data 478 19.46 0.98 0.69 - 1.39 CITIESOF RESIDENCE Turin 670 13.88 1 - Other Cities 2159 17.79 1.24 0.97 - 1.59 NR.CASES/YEAR (NR.INSTITUTE) > 50 (N=2) 609 13.46 1 - 25-50 (N=7) 694 14.27 1.22 0.90 - 1.65 <25 (N=58) 1526 19.40 1.57 1.19 - 2.07 DISEASE STAGING Without metastasis 2268 17.90 1 - With metastasis 561 12.66 1.00 0.79 - 1.26 YEAROF ADMISSION 1997 934 19.27 1 - 1998 920 16.41 0.84 0.67 - 1.04 1999 975 14.97 0.83 0.67 - 1.03

of receiving a surgical treatment in patients with lower level of instruction (OR=0.66, Ic=0.58-0.77). There is not any effect of residence and type of treatment. The presence of metastases is strongly associated to the outcome, and it strongly reduces the percentage of surgical treatments (from 16,2% to 2,7%). Hospital mortality has been also evaluated for all the three cancers. Like attended, it was strongly different in the three cancers: 0.1% for breast cancer, 4,3% for rectum and 11,7% for lung. Only for lung cancer hospital mortality was associated with the outcome: 98,7% of died has had a medical admission. 97

Table 4: Analysis of factors related to type of participationstreatment for lung cancer (N= 10875)

Nr. PATIENTS %SURGERY OR IC 95% PARTICIPATION AGE < 60 2158 16.31 1 - 60 - 69 3828 17.16 1.09 0.94 - 1.27 ge 4889 7.49 0.41 0.35 - 0.48 SEX Female 2034 12.29 1 - Male 8841 12.72 0.98 0.84 - 1.14 EDUCATIONAL LEVEL Intermediate Or More 2630 16.39 1 - Primary or nothing 5498 10.53 0.66 0.58 - 0.77 Missing Data 2747 13.29 0.88 0.75 - 1.02 CITIESOF RESIDENCE Turin 2338 13.56 1 - Other Cities 8537 12.39 0.87 0.76 - 1.03 DISEASE STAGING Without metastasis 8138 16.17 1 - With metastasis 2844 2.67 0.13 0.10 - 0.17 YEAROF ADMISSION 1997 4022 11.46 1 - 1998 3571 12.91 1.21 1.05 - 1.40 1999 3282 13.80 1.40 1.21 - 1.61

5 Discussion

The information system that collects data through the SDO is a valid instrument to discuss aspects about quality of care, even if any result must be considered with caution. The incomplete compilation of the form can determine some distortions in the results. For example, the proportion of secondary diagnosis codes is rather low, 31.3% for breast cancer, 61.5% for rectum and 60,4% for lung cancer, that involves a underestimation in disease staging. The world-wide literature on surgical treatment of breast cancer supports the validity of less aggressive surgery to stage I and II (TNM) cancer, and supports the use of post surgical therapies in order to improve patients’ quality of life. In this study there has not been possible to analyse post surgical therapies. Some international guidelines (Mirsky 2001, SIGN 1998, NHMRC 2000) propose the type of surgery, mastectomies versus other intervention, as an outcome to be monitored . Literature supports the validity of less aggressive surgery treatments jointly with adi- uvant therapies also for the rectum surgery; again, in this study they have not been evalu- ated. For lung cancer we analysed the surgical treatment versus the medical one. Regional 98

health service seem to have not enough resources for thoracic surgery and consequently migration towards the neighbours regions is important. The residence variable is an important factor related to type of treatment. Both the patients with breast and rectal cancer who live in Turin receive a treatment less invasive compared to the other patients, probably due to the higher accessibility of services and to favourable economic, cultural, social and organisational factors. Any association between hospital workload and type of treatment was not attended to be found. This phenomenon is particularly interesting for rectum cancer; in fact for breast cancer many studied have already evidenced the correlation, for colorectal cancer only some recent studies confirm the association (Marush 2001; Hodgson 2001) but others disagree (Simunovic 2000). Mastectomies and abdominoperineal resections increase with age and severity of dis- ease. Regarding the severity of illness we have classified patients with or without metastases (both in lymphonodes and in other organs). The number of secondary diagnoses notified on the form is low and we have not been able to classify patients in any level of the disease staging. Also the number of patients with metastases is underestimated for defects in filling forms. Breast and lung cancer analyses have evidenced a high risk of invasive treatment for patients with a more advanced cancer. In breast and lung cancers it turns out clearly that a more advanced stage corresponds to a more radical surgical treatment. In rectum cancer disease staging seems not to have any influence on the surgical choices, but this result can be modified by other variables. These are preliminary analyses in order to study the quality assessment in cancer care using administrative data. Using the regional information system of admissions, cross- sectional considerations can be made on the quality of services offered to the regional cancer population. To define a better system of quality indicator the regional health information system should be integrated with other information systems - for instance outpatients records- to evaluate other outcomes such as relapses and long term survival. 99

References

B. Begg et al. Impact of hospital volume on operative mortality for major cancer surgery. Jama, 280(20):1747 – 1751, 1998a.

C. J. Mettlin et al. A comparison of breast , colorectal, lung, and prostate cancers reported to the national cancers data base and the surveillance, epidemiology and end results program. Cancer, 79:2052–2061, 1997.

D. Mirsky et al. Surgical management of early stage invasive breast cancer (Stage I and II). In CCOPGI Guidelines. 2001a.

D.C. Hodgson et al. Impact of patient and provider characteristics on the treatment and outcomes of colorectal cancer. J Natl Cancer Inst., 93:501–515, 2001b.

F. Marusch et al. Hospial caseload and the results achieved in patients with rectal cancer. Br J Surg, 88:1397–1402, 2001c.

M. Dorval et al. Type of mastectomy and quality of life for long term breast carcinoma survivors. Cancer, 83:2130–2138, 1998b.

M. Simunovic et al. Hospital procedure volume and teaching status do not influence treatment and outcome measures of rectal cancer surgery in a large general population. J Gastrointest Surg., 4:324–330, 2000.

I. Iezzoni. ”assessing quality using administative data. Annlas of internal medicine, 127 (8 (2)), 1998. et al. J. Green. In search of america’s best hospital. Jama, 277(14):1152–1155, 1997.

Clinical practice guidelines for the management of early brest cancer: second stage con- sultation. NHMRC National Breast Cancer Centre, 2000.

Improving outcomes in lung cancer. NHS Executive, 1998.

Tumori del colon retto. Linee guida clinico-organizzative. Regione Piemonte, in press.

Guidelines for the management of colorectal cancer. Royal College of Surgeons, 1996.

Colorectal cancer. Scottish Intercollegiate Guidelines Network, 1997.

Breast Cancer in Women. Scottish Intercollegiate Guidelines network, 1998a.

Management of lung cancer. Scottish Intercollegiate Guidelines Network, 1998b. Parallel Computing in Spatial Statistics

Nadja Samonig University of Klagenfurt, Austria Abstract: The object of this article is the parallelization of kriging, which is an estimation method widely used in geostatistics. This is achieved by dividing a large problem into many small tasks. First a FORTRAN subroutine for kriging prediction was modified in that way that the subroutine collects several krige systems into a slightly bigger one (established by so called “tiles”) in order to gain computational speed. Then this algorithm was modified that the small systems, the tiles, can overlap. This was necessary to circumvent border effects. The task of prediction within these “tiles” is well suited for parallelization. Finally we yield a big saving of time by distributing the tiles as small compu- tational tasks to different computers. The implementation is based on PVM (Parallel Virtual Machine) and implemented as R library.

1 Introduction

Mathematically large scale and wearisome computations can be handled by parallel pro- cessing. This method solves one large problem by dividing it into many small independent subtasks, which are then handled in parallel by the nodes of a computing cluster. Kriging prediction, one of the central methods of geostatistics, is modified in that way that one big krige system is divided into small ones in order to boost computational speed. Another modification to kriging prediction is to collect neighboring prediction points into so-called tiles, which can overlap, similar to moving window strategies. Now some tile points get computed more than once and the effect is that the prediction is exacter, espe- cially at the tile borders. We will show that the problem of the computation of predictions at these tile points is well suited for parallelization. We yield a big saving of time by distributing small computa- tional tasks to different computers. Finally parallel kriging was implemented successfully.

2 Spatial Statistics

The central concept of spatial statistics are regionalized variables z(x) with x ∈ D ⊆ Rd which are treated as realizations of an underlying random function Z(x) . This random function Z(x) is called second-order stationary if it fulfills the conditions E [Z(x + h)] = E [Z(x)] d cov [Z(x + h),Z(x)] = C(h) h ∈ R that is, it has a constant mean E [Z(x)] = m and its covariance function C(h) = E [Z(x) · Z(x + h)] − m2 101

is translation invariant. A weaker form of stationarity, intrinsic stationarity, is defined by the conditions E [Z(x + h) − Z(x)] = m(h) = 0 var [Z(x + h) − Z(x)] = 2γ(h) ,

1  2 where γ(h) is the semivariogram, defined by γ(h) = 2 E (Z(x + h) − Z(x)) . Ordinary Kriging is a method for estimating the value of z(x) at a point x of the region D yielding minimal prediction variance. It uses the linear estimator n ˆ X Z(x0) = wαZ(xα) α=1 Pn The condition α=1 wα = 1 ensures the unbiasedness of the estimation. The variance of this estimator is given by 2 ˆ σE = var[Z(x0) − Z(x0)] T T = C(0) + w Kw − 2c0 w. 2 Minimizing σE results in solving the ordinary kriging system  K 1   w   c  = 0 (1) 1T 0 µ 1 where Ki,j = C(xi − xj) and c0,j = C(x0 − xj). If now the trend function m is allowed to be a linear function of some parameter vector θ:

T p m(x) = θ f(x), θ ∈ R the above defined estimator becomes the Universal Kriging estimator. The unbiasedness condition changes to T F w = f(x0) Minimizing the estimation variance

2 > > σE = 2w γ0 − w Γw now is equivalent to solving the universal kriging system       Γ F w γ0 T = (2) F 0 −µ f(x0) where Γi,j = γ(xi − xj) and γ0,j = γ(x0 − xj). Due to the link γ(h) = C(0) − C(h) between semivariogram and covariance function this is equivalent to solving       CF w c0 T = . (3) F 0 µ f(x0)

For prediction at location x0 usually not the whole dataset {(xi, z(xi))|i = 1, . . . , n} is needed, because locations xi far from x0 have no influence and their kriging coefficients wi would be downweighted to zero anyway. So it suffices to take only points within a region “near” to x0, the so called search neighborhood, into account. 102

3 Kriging on Tiled Grids

Usually kriging prediction is performed on regular rectangular grids {x0i|i = 1, ··· , m}. The parameters of these grids (grid spacing in x and y direction) control the quality of the estimated map. For that reason fine grids are preferred, but obviously computational burden increases with the number of points in the grid. For each grid point x0 we have to solve the system of linear equations (3). These krige systems are set up by establishing individual search neighborhoods Si = {xi | kxi − x0ik ≤ r} and calculating the components CSi , FSi , c0i and f(x0i) of the system matrices. The new idea is now instead of solving each krige system separately for the set of points x0i we take the union of all search neighborhoods Si and form an overall search neigh- borhood ∪iSi. Now we can solve the resulting universal kriging systems simultaneously, using numerical methods for solving systems of linear equations with multiple right hand sides. We use the covariance matrix C∪iSi , the design matrix F∪iSi and the vectors c0i and f(x0i) to establish the following krige system  C F   w ··· w   c ··· c  ∪iSi ∪iSi 1 m = 01 0m (4) T µ ··· µ f(x ) ··· f(x ) F∪iSi 0 1 m 01 0m

3.1 Tiled Grid and Overlapping Tiled Grid In practice we will not combine all search neighborhoods of all grid points. It only makes sense to combine those search neighborhoods belonging to prediction points xoi which are itself neighbors of each other. In this case, the individual search neighborhoods will P overlap to a great extent, which means that | ∪i Si| < i |Si|, which is important be- cause the matrix size of the kriging system depends on the number of points in the search neighborhood. So we will be able to solve all systems simultaneously in one step, but the matrix dimensions will only be slightly greater than in the case of individual solving. Those sets of prediction points for simultaneous kriging prediction will be called “tiles”, because we choose them as rectangular subsets of our starting overall grid. We describe a tiled grid by a tuple

hX = {xij | i = 1, ··· , n, j = 1, ··· , m}, T = {tkl ⊂ X | k = 1, ··· , r, l = 1, ··· , s}, dx, dy, tx, tyi

In this notation dx and dy are the grid spacings in x and y direction. tx and ty are the tiling parameters for choosing the tx × ty rectangular subsets tkl of the grid X. Furthermore an overlapping tiled grid is given by

hX = {xij | i = 1, ··· , n, j = 1, ··· , m}, T = {tkl ⊂ X | k = 1, ··· , r, l = 1, ··· , s}, dx, dy, tx, ty, ox, oyi

In addition to the definition of a tiled grid now the tiles overlap to an extent of ox resp. oy points in x and y direction. So it is possible that a point xij belongs to more than one tile tkl. 103

. . . . . x1 . . x1 ...... x2 . . x2 .

. . . . S2 . .

S1 S1_2

Figure 1: Overlapping Search Neighborhoods 104

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Figure 2: Grid points divided in tiles 105

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Figure 3: Grid points divided into overlapping tiles 106

By modifying the grid parameters dx and dy, the tile sizes tx and ty and the extent of over- lapping ox and oy it is now possible to tune the output quality (influenced by grid spacing and overlap extent) and computational performance (tile sizes). The overlap parameters help especially avoiding a ”checker board” effect in the output (compare figures 7, 8 and 9). This is caused by border effects introduced around around the single tiles within the grid. By enlarging the amount of overlapping points we calculate some estimates twice, four times or even more often and are able to average these results. Finally we have to find a compromise between output quality (rises as dx, dy will in- crease), avoiding border effects (by enlarging ox, oy) and time saving (large tile sizes tx, ty).

4 Universal PVM Kriging

We follow an earlier work (see Tschofenig, 2001). It proves the possibility of a successful combination of R and PVM with a simple reference implementation written in C. We took this as starting point and developed it further into a extension library1 for R2, a free and extensible implementation of the S language. Figure 4 gives an short overview how parallel tile kriging works. Within an interactive R session the estimation is started. Then it connects to a server process and transmits the data and grid parameters. The server process prepares the tiles and feeds them to the cluster nodes of a parallel virtual machine. It also collects the results delivered back from the PVM worker processes, combines them back into the overall grid and sends the grid back to the R session where it can be displayed. Figure 5 shows the send - receive interaction between the server and client processes as displayed by XPVM3. The green bars show process activity, the red links between them indicate data and parameter transfers.

5 Example

The data set used in the example is a collection of zinc measurements as groundwater quality variable. It is distributed with the R library sgeostat and has its origin as example data set for the gstat software by E.J Pebesma4. Figure 6 compares the computational speed of the three kriging functions contained in the R libraries rgeostat and pvmkrige applied to this data set: krige.grid, krige.tiles and krige.tilesov. In order to get a representative sample the computation time for each of the three functions was measured five times and then the mean was taken. In the first graphic the function krige.tiles uses tile size 5, which means that every subrect- angle contains 5 × 5 points. The function krige.tilesov uses tile size 5 × 5, too. This function has the modification that we can select how many points (in this example 2 and 3), lying in the tiles, should overlap in order to yield exacter results. So we can see that the

1pvmkrige, available at ftp://ftp-stat.uni-klu.ac.at/pub/R/contrib 2http://www.r-project.org 3a graphical tool written in tcl/tk, distributed with the PVM software 4http://www.frw.uva.nl/~pebesma/gstat/ 107

R SERVER . . .

. . . CLIENT A

......

...... CLIENT B ......

......

...... CLIENT C

. . .

Figure 4: Parallelizing the tiles 108

Figure 5: XPVM: Communication between Server and Clients 109

function krige.tiles is much faster than krige.grid but the computational exactness decreases at the tile borders for krige.tiles (compare figures 7, 8 and 9). Therefore the function krige.tilesov was built which computes more exactly because of averaging at the borders of overlapping tiles. But increasing the number of overlapping points means also greater computational effort. We can see this characteristic in all four graphics of figure 6, which shows the same computation times for different tile sizes. Figure 7 shows the map produced by classical ordinary kriging, as produced by the func- tion krige.grid from the R library rgeostat. The output of tiled grid kriging is shown in figure 8. Finally figure 9 shows the output of tiled grid kriging with overlapping. We tried also different sizes of the PVM cluster. Figure 10 shows these effects. It can be seen that it makes no sense to increase the size of the cluster to arbitrary high numbers. This depends of course on the hardware used and the quality of the network links between the cluster nodes. In our setup we used up to eight DEC alpha machines with different specifications and a mixed 10/100 Mbit network between them. 110 30 30 krige.grid krige.grid krige.tiles krige.tiles krige.tilesov krige.tilesov 25 25 20 20 15 15 time in seconds time in seconds 10 10 5 5 0 0

50 100 150 200 250 300 50 100 150 200 250 300

number of grid points number of grid points tile size:5, overlapping points:2,3 tile size:10, overlapping points:2,4,6,7 30 30 krige.grid krige.grid krige.tiles krige.tiles krige.tilesov krige.tilesov 25 25 20 20 15 15 time in seconds time in seconds 10 10 5 5 0 0

50 100 150 200 250 300 50 100 150 200 250 300

number of grid points number of grid points tile size:20, overlapping points:2,4,8,12,14 tile size:30, overlapping points:4,8,12,16,20

Figure 6: Computational Burden of krige.grid, krige.tiles and krige.tilesov 111 333000 332000 331000 330000

178500 179000 179500 180000 180500 181000

Figure 7: Plot of maas.kgrid, where nx=ny=100 112 333000 332000 331000 330000

178500 179000 179500 180000 180500 181000

Figure 8: Plot of maas.ktiles, where nx=ny=100, itx=ity=10 113 333000 332000 331000 330000

178500 179000 179500 180000 180500 181000

Figure 9: Plot of maas.ktilesov, where nx=ny=100, itx=ity=10, nxper=nyper=2 114

150 8 computers 150 8 computers 5 computers 5 computers 2 computers 2 computers 100 100 time in seconds time in seconds 50 50 0 0

50 100 150 200 250 300 50 100 150 200 250 300

number of grid points number of grid points tile size:10, overlapping points:2 tile size:10, overlapping points:6

150 8 computers 150 8 computers 5 computers 5 computers 2 computers 2 computers 100 100 time in seconds time in seconds 50 50 0 0

50 100 150 200 250 300 50 100 150 200 250 300

number of grid points number of grid points tile size:30, overlapping points:8 tile size:30, overlapping points:16

Figure 10: Computational effort with different sets of computers 115

References

W.S. Cleveland. Visualizing Data. AT & T Bell Laboratories, 1993.

N.A.C. Cressie. Statistics for Spatial Data. Wiley, 1993.

P. Diggle. Statistical Analysis of Spatial Point Patterns. Academic Press, 1983.

A. Geist et al. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Networked Parallel Computing. second edition. The MIT Press, Cambridge, Massachusetts, 1995.

N.L. Hjort and H. Omre. Topics in spatial statistics. Scandinavian Journal of Statistics, 1994.

R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 1996.

E.H. Isaaks and R.M. Srivastava. Applied Geostatistics. Oxford University Press, 1989.

A. Krause. Einfuhrung¨ in S und S-Plus. Springer, 1997.

G. Matheron. Geostatistical Case Studies. Reidel, 1987.

R.A. Olea. Geostatistics for Engineers and Earth Scientists. Kluwer Academic Publish- ers, 1999.

H. Tschofenig. Anwendung der parallelen, virtuellen Maschine (PVM) fur¨ eine Geo- statistikanwendung. Universitat¨ Klagenfurt, 2001.

W.N. Venables and Brian D. Ripley. Modern Applied Statistics with S-Plus. Springer, 1994.

H. Wackernagel. Multivariate Geostatistics - An Introduction with Applications. Springer, 1995. 116

Non-additive probability

Damjan Skuljˇ

Faculty of social sciences University of Ljubljana, Slovenia

1 Introduction

Non-additive measure theory has made a significant progress in recent years and has been intensively used in many fields of applied mathematics. The basic idea is to replace ordi- nary additive measures by more general set functions. In the case of probability measures, the non-additive measures are called non-additive probabilities. They turn out to be a very flexible tool to express subjective probabilities. By the term subjective probabilities we denote the probabilities that decision makers assign to events when the true probabilities are unknown. In fact, even if the true probabilities are known, sometimes better results can be achieved by using non-additive probabilities. An important motivation to use non-additive probabilities is their use in expected utility models. The expected utility model proposed by von Neumann and Morgenstern was a widely used tool for solving decision theoretical problems through decades, though it has its limitations. Later, Savage improved it significantly by including subjective probabili- ties. However, probabilities used in his model remained additive. To make expected utility models more flexible, additive subjective probabilities were later replaced by non-additive probabilities, called capacities. Capacities used in expected utility models prove to be a very flexible tool to model dif- ferent kinds of behavior. Most decision makers, for example, overestimate small and underestimate large probabilities. Further, most decision makers prefer decisions where more information is available to decisions with less available information. Latter behavior is known as uncertainty aversion and turns out to be impossible to be expressed through an additive model. Risk aversion, on the other side, is possible to express through addi- tive model by transforming utility function. However, attitudes towards wealth can not be separated from attitudes towards risk, in this case, and this can be a difficulty sometimes. Many results and concepts that belong to additive measure or probability theory have nat- ural generalizations to non-additive theory. Integration with respect to non-additive mea- sures can, for example, be done by replacing Lebesgue integral with Choquet integral. However, there are also several results of additive theory that have not yet been general- ized to non-additive case. A reason for that is that there sometimes is no straightforward generalization or that such an obvious generalization gives only trivial or meaningless results. An important, not yet completely solved problem, is to define a product of capacities. In this paper a step towards one of the possible solutions is presented. A concept related to the products of additive measures is multiple integral. However, I personally think that its generalization to non additive measures does not provide desired results, at least since 117

there is no such theorem like Fubini’s for non-additive measures in general. Due to its generality, non-additive theory opens several new questions and concepts. In this paper convexity and concavity of capacities are considered and problem whether an arbitrary capacity is a difference of two convex capacities is stated and a partial solution is provided.

2 Motivation

2.1 Expected utility model We will now briefly describe the von Neumann - Morgenstern expected utility model and show how use of non-additive probabilities can be helpful. For more information interested reader is referred to Fishburn (1970). We start with a non empty set of outcomes, which we denote by X. Further, there is a decision maker who chooses among the outcomes. The outcomes can not be chosen arbitrarily, instead, there exists a set of rules which determine probabilities of outcomes following decisions that decision maker chooses from some set of possible decisions. We will call these rules lotteries. For simplicity we will assume that the set of outcomes is finite and that it contains every- thing that the decision maker can either win or lose. So, for example, winning a car on lottery means winning the car an loosing the money paid for lottery tickets. Formally, we will write a lottery as distribution

l = hp1, x1; p2, x2; ... ; pr, xri. (1)

Now, if the agent chooses lottery l, he will obtain outcome xi with probability pi and these are the only possibilities. Therefore,

r X pi ≥ 0 for i = 1, . . . , r and pi = 1 (2) i=1 The set of all lotteries allows structure of an affine convex set. Given lotteries l = 0 0 0 0 hp1, x1; p2, x2; ... ; pr, xri, l = hp1, x1; p2, x2; ... ; pr, xri and an α ∈ [0, 1], their affine 0 0 0 combination αl + (1 − α)l is hαp1 + (1 − α)p1, x1; ... ; αpr + (1 − α)pr, xri and is again a lottery. Expected utility model assumes existence of a preference relation < on the set of all 0 0 lotteries, such that, for any two lotteries l and l , the following holds: l < l if agent finds 0 lottery l at least as desirable as lottery l . We will denote strict part of the relation < by 0 0 0 : l l if l < l but l <6 l. In order to apply the expected utility model to a preference relation, we must require some properties for the preference relation to satisfy.

(i) Preference relation < is a nontrivial weak order, that is, complete: for every two 0 0 0 lotteries l and l at least one of the relations l < l or l < l holds; and transitive: 0 0 00 00 if l < l and l < l , then l < l . Nontriviality denotes that there are at least two lotteries l and l0 such that l l0 holds. 118

(i) Independence: for any triple of lotteries l, l0 and m and real α ∈ (0, 1), if l l0 then αl + (1 − α)m αl0 + (1 − α)m holds.

(i) Continuity or Archimedian axiom: for a triple of lotteries l m l0 there exist real α and β ∈ (0, 1), such that αl + (1 − α)l0 m and m βl + (1 − β)l0.

We are now going to give a sufficient condition for a preference relation < to have ex- pected utility representation. That is, whether there exists an affine real function U on the 0 0 0 set of all lotteries, such that for any two lotteries l and l , l < l ⇐⇒ U(l) ≥ U(l ). Moreover, it turns out that the representation function U only depends on utilities of single outcomes. Let lx denote the lottery that yields outcome x with probability 1. Next, we can define real function u: X → R with u(x) := U(lx) and call it utility function. We say the preference relation has expected utility representation if for any two lotteries l and l0 the following holds

n n0 0 X X 0 0 l < l ⇐⇒ piu(xi) ≥ piu(xi) (3) i=1 i=1 The representation function U is in this case the mathematical expectation of the utility function

n X U(l) = piu(xi) i=1 Whether a preference relation has expected utility representation, will tell us the following theorem.

Theorem 1. Let < be a preference relation on the set of lotteries over a set of outcomes X. Then < satisfies properties (i), (ii) and (iii) if and only if there exists a utility function u: X → R, such that for all lotteries l and l0 (3) holds. Proof can be found in Fishburn (1970).

2.2 Assigning probability, Ellsberg paradox In the previous section we described structure of expected utility models. They consisted of outcomes and their probabilities and we assumed that probabilities used were precise. In real world problems are not that simple and problems often lie in computing proba- bilities of events. With exception of some idealized cases, probabilities can almost never be estimated exactly. Precision of estimates differs substantially from case to case. Nev- ertheless, expected utility model makes no difference among them, however as we shall see, it should. Suppose that we have k events and that exactly one of them is going to happen and that we have no further information about their probabilities. To assign probabilities in such case we usually follow two rules. The first rule says that all assigned probabilities should be equal. The second rule says that the sum of all probabilities should be 1. In this case 119

the assigned probability of each event must be equal to 1/k, according to these rules. If probabilities are assigned in such way, the expected utility model does not carry any information about the precision of the assigned probabilities. However, there is much empirical evidence available that shows that there is a substantial proportion of decision makers who distinguish between precise and imprecise probability estimates. The following example is known as Ellsberg paradox. See also Schmeidler (1989); Gilboa (1987). A person is shown two urns, A and B. Each of them containing 100 balls of red and black colour. Urn A contains 50 black and 50 red balls, while there is no additional information about urn B. One ball is drawn at random from each urn. The person is offered to make a bet on colour of the ball chosen from either urn. Possible bets are therefore ‘the ball drawn from urn A is black’ or ‘the ball drawn from urn A is red’. Denote the bets by Ab and Ar and similarly for B we have bets Bb and Br. Winning the bet the person receives 100 e. It has been observed empirically that most subjects prefer any bet on urn A to bets on urn B: Ab ∼ Ar Bb ∼ Br. In the expected utility settings we have the following situation. The set of outcomes X consists of 100 e and 0 e. Denote the four lotteries by Ab = h100, p1; 0, q1i, Ar = h100, p2; 0, q2i, Bb = h100, p3; 0, q3i and Br = h100, p4; 0, q4i. Since the set of outcomes consists of only two elements, utility function can be chosen arbitrarily, say u(100) = 100 and u(0) = 0. In this example we will therefore not distinguish between value in euros of an outcome and its utility. It thus only remains to assign probabilities to events.

For each i the expected utility of corresponding lottery is pi ·100+qi ·0 = pi ·100. Relation Ab ∼ Ar implies 100p1 = 100p2. Since q1 = p2 and q2 = p1, we get pi = qi = 1/2 for i = 1, 2. Using relation Bb ∼ Br, pi = qi = 1/2 would similarly be obtained for i = 3, 4. Expected utilities for all bets thus amount to 50, but this contradicts with the relation Ar Br, which was observed empirically. Clearly, an additive expected utility model can not explain this situation. A way to avoid this problem is to allow probabilities to be non-additive. If we put pi = qi = 3/7 for i = 3, 4 in previous example, expected utilities of bets Bb and Br would amount to about 43, which would agree with observed preferences. The difference 1 − (3/7 + 3/7) = 1/7 can be interpreted as a penalty due to absence of information. Amount of such penalty depends on the decision maker and can be different for different people.

3 Non-additive probabilities

Let S be a non empty set and A a family of its subsets, such that ∅ and S belong to it and let v be a monotonic real valued function on A, such that v(∅) = 0 and v(S) = 1. Monotonicity here means that for any A, B in A, A ⊂ B implies v(A) ≤ v(B). Such a function v is called a non-additive probability or a (normalized) capacity. A non-additive probability thus in general violates requirement v(A ∪ B) + v(A ∩ B) = v(A) + v(B), which holds for additive probabilities. In general, capacities do not need to be normalized, which means that v(S) = 1 in general does not hold. However, when used as non-additive probabilities, normalization of capacities is supposed. 120

3.1 Integration with respect to non-additive probabilities

Non-additive probabilities, despite being much more general, inherit some important properties of classical additive probabilities. One of the most important concepts closely related to additive measures is integration and it has a natural generalization to non- additive measure theory. The integral with respect to non-additive measures was intro- duced by Choquet and is therefore now known as Choquet integral. If the capacity used to compute Choquet integral is additive, Choquet integral coincides with the additive Lebesgue integral. Take a function f : S → R and define its decreasing distribution function with respect to v with

Gv,f (x) = v(f ≥ x) (4)

The function defined with (4) may not exist for every real x. In order to assure its exis- tence, sets {s ∈ S| f(s) ≥ x} must belong to A, for every real x. But this turns out not to be a serious obstacle, since non-additive probabilities are much easier to extend than additive ones. From now on we thus suppose that decreasing distribution functions of all functions considered are defined for all real numbers. In these settings we define Choquet integral with

Z Z 0 Z ∞ f dv = (Gv,f (x) − v(S)) dx + Gv,f (x) dx (5) −∞ 0

For a more general definition see Denneberg (1997). Choquet integral behaves similarly as ordinary Lebesgue integral, but unlike it, Choquet integral is in general not additive: R (f + g) dv 6= R f dv + R g dv, in general. Though, if f and g are in certain relation, it is additive. Let functions f, g : S → R be such that for each s and t in S, f(s) > f(t) implies g(s) ≥ g(t). Then we say that f and g are comonotonic or that they induce similar orderings on the set S. Given a real function f on S, we can define relation ≥f on S by taking s ≥f t if f(s) ≥ f(t). If f and g are comonotonic, then the orderings they induce are similar in the sense that for no s and t, s >f t and t >g s can hold simultaneously. It turns out that Choquet integral is comonotonically additive. That is, if f and g are comonotonic then R (f + g) dv = R f dv + R g dv. For more information about comonotonic additivity see Denneberg (1997) or Schmeidler (1986). We will illustrate this by an example. Take S = {1, 2, 3} and A = {1, 2},B = {1},C = {2, 3}. Probability measure v is defined on the whole factor set of S with v({1}) = v({2}) = v({3}) = 1/5, v({1, 2}) = v({2, 3}) = v({1, 3}) = 4/5 and v({1, 2, 3}) = 1. Denote characteristic functions of sets A, B, C and S with f = 1A, g = 1B, h = 1C and 121

i = 1S. We will need the following decreasing distribution functions (index v is omitted): ( ( 4/5 for x ≤ 1 1/5 for x ≤ 1 Gf (x) = Gh(x) = Gg(x) = 0 for x > 1 0 for x > 1

 ( 1 for x ≤ 1 1 for x ≤ 1  Gi(x) = Gf+h = 1/5 for 1 < x ≤ 2 0 for x > 1 0 for x > 2  4/5 for x ≤ 1  Gf+g = 1/5 for 1 < x ≤ 2 0 for x > 2

Using definition (5) we compute R f dv + R h dv = 4/5 + 4/5 = 8/5, but R (f + h) dv = 1 + 1/5 = 6/5. Inequality is a consequence of non comonotonicity of f and h: f(1) = 1 > 0 = f(3) and g(1) = 0 > 1 = g(3), for example. On the other side R f dv+R g dv = 4/5 + 1/5 = 1 = R (f + g) dv and this equality remains, no matter how v is changed. In general, characteristic functions of subsets A and B ⊆ S are comonotonic if and only if A ⊆ B or B ⊆ A holds.

4 Expected utility without additivity

In order to apply non-additive probabilities to expected utility model we will use frame- work similar to one proposed by Chateauneuf, which is a simplified Schmeidler’s frame- work (see Chateauneuf (1991); Schmeidler (1989)). Let S be a nonempty set of states of the world and A a family of subsets of S, called events, with ∅,S ∈ A. Further, let X be a set of outcomes. An A-measurable mapping f : S → X is called an act on S. Let F be the set of all acts on S and < a preference relation on it. A utility function u: X → R and a capacity v : A → R+ represent relation < if and only if, for each two acts f and g, the following equivalence holds: Z Z f < g ⇐⇒ u(f) dv ≥ u(g) dv. (6) S S For the preference relation we assume the following properties.

(i) < is a non-trivial weak order. Since the preference relation < is complete it induces a weak order in the set of outcomes. Namely, for each two outcomes x, y we can define x < y if and only if the same relation holds for the constant acts that map whole S to x or y respectively. Having defined the weak order on X, we can extend the definition of comonotonic- ity to the set of acts F. We say that two acts f and g are comonotonic, if and only if the following is true: for all s, t ∈ S if f(s) f(t) then g(s) < g(t). 122

The most often criticized axiom of the classical expected utility theory is the following:

(ii) Independence: for all f, g and h ∈ F and for α ∈ (0, 1), f g implies αf + (1 − α)h αg + (1 − α)h.

We replace it with obviously less restrictive

(iii) Comonotonic independence: for all pairwise comonotonic acts f, g and h ∈ F and for α ∈ (0, 1), f g implies αf + (1 − α)h αg + (1 − α)h.

The reason that comonotonic independence axiom is used in place of classical indepen- dence axiom is that it is much easier to compare two acts if they are comonotonic than if they are not. The comonotonic independence axiom can further be weakened (see Chateauneuf (1991)) in order to make theory more convincing. However, it turns out that the weakened version is in fact equivalent to one used here. The next two assumptions are natural:

(iv) Continuity: for all f, g and h ∈ F with f g and g h, there are α and β ∈ (0, 1) such that αf + (1 − α)h g and g βf + (1 − β)h.

(v) Monotonicity: for all f and g ∈ F: if f(s) < g(s) for all s ∈ S then f < g. The representation theorem follows.

Theorem 2. Suppose that the preference relation < on F satisfies (i) weak order, (iii) comonotonic independence, (iv) continuity and (v) monotonicity. Then there exist a unique capacity v on A and a utility function u on X that represent <. That is, for all f, g ∈ F Z Z f < g ⇐⇒ u ◦ f dv ≥ u ◦ g dv. S S Conversely, if there exist u and v as above, u non constant, then the preference relation they induce on F satisfies (i), (iii), (iv) and (v). Finally, the utility function u is unique up to positive linear transformations. Remark. Replacing comonotonic independence (iii) with independence (ii), in the last theorem, is equivalent to replacing the arbitrary capacity v with an additive probability. The proof of Theorem 2 can be found in Chateauneuf (1991) or Schmeidler (1989). Let us now recall Ellsberg example and state it in the modified expected utility settings. The set of states of the world contains four possible states that may occur in the described experiment: S = {{Ar, Br}, {Ar, Bb}, {Ab, Br}, {Ab, Bb}}, where Ab denotes that the ball chosen from urn A is black, for example. We will use the same labels for events. So, for example, Ar will denote subset {{Ar, Br}, {Ar, Bb}} ⊂ S. Acts among which the decision maker can choose are fAr, fAb, fBr, fBb. Now we will find a capacity v and utility function u that represent the relation described in subsection 2.2. With the same argumentation as there we will let u = idR. By taking v(Ar) = v(Ab) = 1/2 and v(Br) = v(Bb) = 3/7, for example, we would have the R R R R . expected utilities fAr dv = fAb dv = 50 and fBr dv = fBb dv = 43, which represent the observed relations. 123

4.1 Convexity and concavity Another important role in the expected utility theory play attitudes towards risk and uncer- tainty. The difference between risk and uncertainty is that risk is related to decisions made with known probabilities of events, while uncertainty relates to decisions with unknown probabilities. Riskier acts are acts with higher dispersion. There is also a similar charac- terization of uncertainty, though we will not go into the details here. For that, reader is referred to Chateauneuf (1991). We say that a decision maker is risk averse if he prefers less risky acts and, on the other side, that he is risk seeking if he prefers riskier acts. It turns out that a preference relation < shows aversion towards risk if its representing capacity is convex or supermodular, that is, if the following is true for each two events A, B ∈ A:

v(A ∩ B) + v(A ∪ B) ≥ v(A) + v(B) (7)

Another characterization of risk averse preference relation is that it satisfies the following rule. For each two acts f, g ∈ F with f < g and α ∈ [0, 1], αf + (1 − α)g < g holds. Conversely, a preference relation shows risk seeking behavior if the representing capacity is concave (submodular) or that for g < f and α ∈ [0, 1], g < αf + (1 − α)g holds. A capacity is submodular if it satisfies inequality (7) with relation ≥ replaced with ≤. Similarly, in case of unknown true probabilities of events, a convex capacity represents uncertainty aversion, while uncertainty inclination is represented through a concave ca- pacity. A special family of non-additive probabilities are the distorted probabilities. Let µ: A → [0, 1] be a probability measure on a σ-algebra A and f : [0, 1] → [0, 1] an increasing function with f(0) = 0 and f(1) = 1. Composite f ◦v is obviously a capacity and it turns out that its convexity or concavity depends on convexity or concavity of the function f: it is convex (concave) if and only if v is. A straightforward proof can be found in Denneberg (1997). Now let f be a twice differentiable function. Then it can be written as a sum of a convex and concave real function, say f = f 0 + f 00 where f 0 is convex and f 00 concave. However, they might not both be increasing. But, if the second derivative of f is bounded then there exists a linear function g such that h0 := f 0 +g and h00 = g−f 00 are both convex increasing functions and h0 − h00 = f. We then have v = (h0 − h00) ◦ µ = h0 ◦ µ − h00 ◦ µ = v0 − v00, where v0 and v00 are both convex capacities. A natural question arises, whether a capacity that is not necessarily a distorted probability, can be written as a difference of two convex capacities. A complete answer is not yet known to me, however some sufficient conditions are known and will be presented in the following section.

5 A decomposition of capacities

This section provides some sufficient conditions on a capacity to be equal to a difference of two convex capacities. If positivity assumption for a capacity is omitted, a difference of convex capacities would at the same time be a sum of a convex and a concave capacity. 124

First we will find the smallest convex capacity v+ above v. To do this, we need to define a property of capacities, similar to convexity, called superadditivity. A capacity v is su- peradditive if and only if, for any sets A, B with A ∩ B = ∅, v(A ∪ B) ≥ v(A) + v(B) holds. Obviously, every convex capacity is superadditive, while converse does not hold. Analogously subadditivity is defined.

Proposition 1. Let v : A → R be a capacity such that X sup v(Ai) < ∞, (8)

Ai⊂S where supremum is taken over all finite families of disjoint subsets of S. Then capacity

+ X v (A) := sup v(Ai), (9) Ai⊂A where supremum is again taken over all disjoint finite families of subsets of A, is super- additive and v+ ≥ v holds. Moreover, if u ≥ v is another convex capacity then v+ ≤ u holds.

Proof. Superadditivity of v+ is obvious. Let u be another superadditive capacity and u ≥ v. We have

X X + u(A) = sup u(Ai) ≥ sup v(Ai) = v (A). Ai⊂A Ai⊂A

The condition (8) on capacity v is obviously also a necessary condition for existence of a superadditive capacity greater than it.

Lemma 1. If v : A → R is a subadditive capacity then v+ is additive. Proof. Let A and B be disjoint sets. We have

+ X X v (A ∪ B) = sup v(Ai ∪ Bi) ≤ sup v(Ai) + v(Bi) Ai⊂A Ai⊂A Bi⊂B Bi⊂B X X + + ≤ sup v(Ai) + sup v(Bi) = v (A) + v (B), Ai⊂A Bi⊂B which combined with Proposition 1 implies additivity of v+. If superadditivity is replaced with subadditivity we get the following proposition.

Proposition 2. Let v : A → R be a capacity. Define capacity

− X v (A) := inf v(Ai) ∪Ai⊃A

− where infimum is taken over all finite families of sets Ai, such that A ⊆ ∪Ai. Then v has properties 125

(i) v− is submodular,

(ii) v− ≤ v,

(iii) if v is superadditive, then v− is additive.

In the continuation we will need the following, a bit technical lemma.

Lemma 2.

(i) Let x1 ≤ x2 ≤ x3 ≤ x4 be positive real numbers such that x4 − x3 = x2 − x1 =: y and let a > 1 be a real number. a a a a a−1 a−1 1 a−2  Then x4 + x1 − x3 − x2 ≥ ay x3 − x2 + 2 (a − 1)x3 y . (ii) Let x > y be nonnegative real numbers and a > 1. Then

xa − ya ≥ (x − y)xa−1.

Proof. (i) Using Taylor’s formula we get equalities 1 xa − xa = a(x − x )xa−1 + a(a − 1)(x − x )2ξa−2, (10) 4 3 4 3 3 2 4 3 1 1 xa − xa = a(x − x )xa−1 + a(a − 1)(x − x )2ξa−2, (11) 1 2 1 2 2 2 1 2 2 a−2 a−2 a−2 a−2 where ξ1 ∈ [x3, x4] and ξ2 ∈ [x1, x2]. Obviously then ξ1 ≥ x3 or ξ2 ≥ x3 . Summing (10) and (11) we get

 1  xa + xa − xa − xa = ay xa−1 − xa−1 + (a − 1)y(ξa−2 + ξa−2) 4 1 2 3 3 2 2 1 2  1  ≥ ay xa−1 − xa−1 + (a − 1)xa−2y 3 2 2 3

(ii) Convexity of power function xa implies xa − ya xa − 0 ≥ = xa−1 x − y x − 0

Multiplying the last inequality by x − y the desired result is obtained.

To prove main theorem, we will also need the following lemma.

Lemma 3. Let u: A → R be an additive capacity, A, B ∈ A, u(A) ≥ u(B)) and a > 1. Then  a − 1 u(A ∪ B)a + u(A ∩ B)a − u(A)a − u(B)a ≥ au(B − A)u(A − B)u(A)a−2 1 ∧ 2

(∧ in the last equation denotes minimum of left and right side.) 126

Proof. Using Lemma 2 and u(A) − u(B) = u(A − B) − u(B − A) we get

u(A ∪ B)a + u(A ∩ B)a − u(A)a − u(B)a = (u(A) + u(B − A))a + (u(B) − u(B − A))a − u(A)a − u(B)a  1  ≥ au(B − A) u(A)a−1 − u(B)a−1 + (a − 1)u(A)a−2u(B − A) 2  1  ≥ au(B − A) (u(A − B) − u(B − A))u(A)a−2 + (a − 1)u(B − A)u(A)a−2 2  a − 1 ≥ au(B − A)u(A − B)u(A)a−2 1 ∧ 2

The main theorem of this section follows.

Theorem 3. Let v : A → R be a capacity with the following properties. P (i) supAi⊂S v(Ai) < ∞, where supremum is taken over disjoint sets Ai.

(ii) There is a constant C, such that Cv−(A) ≥ v(A) for all A ⊂ S.

(iii) There exist constants K and a > 1, such that for all subsets A, B of S, v(A ∪ B) + v(A ∩ B) − v(A) − v(B) ≤ Kv(B − A)v(A − B)v(B)a.

Then there exists a convex capacity µ+, such that µ+ ≥ v and µ+ − v is also a convex capacity.

The last theorem implies that the capacity with required properties can be expressed as difference of two convex capacities or as sum of convex and concave capacity, however not necessarily both positive.

Proof. First define function f : R+ → R+ with KCaxa f(x) = + x a(1 ∧ (a − 1)/2)

+ − which is convex and greater than idR+ . Further denote u := (v ) , which is additive and u ≥ v−. The desired capacity is now µ+ := f ◦ u and is obviously supermodular and greater than v. It remains to prove that µ+ − v is also supermodular. Take arbitrary sets A and B, denoted so that u(A) ≥ u(B). Since u is additive

µ+(A ∪ B) + µ+(A ∩ B) − µ+(A) − µ+(B) KCa = [u(A ∪ B)a + u(A ∩ B)a − u(A)a − u(B)a] a(1 ∧ (a − 1)/2) 127

holds. We thus have

µ+(A ∪ B) + µ+(A ∩ B) − µ+(A) − µ+(B) KCa = [u(A ∪ B)a + u(A ∩ B)a − u(A)a − u(B)a] a(1 ∧ (a − 1)/2) KCa ≥ au(A − B)u(B − A)u(A)a−2(1 ∧ (a − 1)/2) a(1 ∧ (a − 1)/2) KCa v(A − B) v(B − A) v(A)a−2 ≥ a (1 ∧ (a − 1)/2) a(1 ∧ (a − 1)/2) C C C = Kv(A − B)v(B − A)v(A)a−2 ≥ v(A ∪ B) + v(A ∩ B) − v(A) − v(B), which implies (µ+ − v)(A ∪ B) + (µ+ − v)(A ∩ B) − (µ+ − v)(A) − (µ+ − v)(B) ≥ 0 and this is desired result.

Corollary 1. Let v : A → R be a capacity. If there exists an additive capacity µ: A → R such that

(i) µ(S) < ∞

(ii) There is a constant c0 so that c0µ(A) ≥ v(A) for all A ∈ A

(iii) There exist constants K and a > 1 such that for all subsets A, B of S, v(A ∪ B) + v(A ∩ B) − v(A) − v(B) ≤ Kµ(B − A)µ(A − B)µ(B)a

Then there exists supermodular capacity µ+ such that µ+ ≥ v and µ+ − v is also super- modular capacity.

Proof. Define u = v+µ, which obviously satisfies requirement (i) of Theorem 3. We will − − show that it also satisfies (ii) of Theorem 3. Obviously u ≥ µ and for that (c0+1)u ≥ u. Since u(A ∪ B) + u(A ∩ B) − u(A) − u(B) = v(A ∪ B) + v(A ∩ B) − v(A) − v(B), requirement (iii) of the theorem is also immediately verified. By Theorem 3 we now have the supermodular capacities ν1 and ν2 such that u = ν1 −ν2. Further we have ν1 ≥ u ≥ µ, therefore ν1 −µ ≥ 0 and since ν1 −µ−ν2 = v is increasing and ν2 is increasing, it follows + that ν1 − µ is also increasing and therefore satisfies requirements for µ .

6 Products of capacities

Another important concept in measure and probability theory is multiplication of (prob- ability) measures. In additive theory products of measures are defined as follows. Let µ and ν be additive probabilities on some σ-algebras S and T on probability spaces S and T . Product measure on the product space S × T is uniquely defined, when exists, by requiring

(µ × ν)(A × B) = µ(A) · ν(B), for all A ∈ S and B ∈ T . (12) 128

To define product of more general capacities we therefore require that a product capac- ity should satisfy at least the above property. Formally, we will say that a product of capacities u: A → R and v : B → R is any capacity w : S → R which satisfies

w(A × B) = u(A)v(B) (13) for all A ∈ A and B ∈ B. If A and B are some families of subsets of sets S and T respectively then S is a family of subsets of S × T , which contains all sets A × B where A ∈ A and B ∈ B. Unfortunately, the set of all capacities that fit this requirement is very large and therefore not of much use. The smallest and the largest such capacity can easily be found. The smallest product is

wmin(C) = inf u(A) × u(B) A×B⊆C and similarly we can find the largest one. However, such bounds are not very satisfactory, since most of the structure of original capacities is not preserved. To obtain better results some additional restrictions on product capacities must be required. There have been several approaches proposed in recent years (see Hendon et al. (1996) Denneberg (2000)), however most of them require discrete capacities. One of the possible ways to define product for a family of non discrete capacities is described here. Let v : A → R and µ: A → R be a capacity and an additive measure on a σ-algebra A. We will say that the capacity v is increasing with respect to µ if for all t ∈ R and A ⊂ B the following holds: if v(A) > t µ(A) then v(B) > t µ(B). Let µ be a measure on a σ-algebra A and A0 a subset of A, such that for all A ⊂ B,A ∈ A0 implies B ∈ A0. Now define measure µ0 on A with ( µ(A) if A ∈ A0 µ0(A) = 0 otherwise

Obviously µ0 is increasing with respect to µ. We will call such a measure cut measure. There is a natural way to define product of cut measures, based on the product of cor- responding additive measures. Let µ0 and ν0 be cut measures on algebras A and B. We will search for a product of µ0 and ν0 among all cut measures λ0 that satisfy (13). Since the product is not uniquely determined, even by this restriction, we give lower and upper 0 bound, λ0 and λ respectively, by ( (µ × ν)(C) if there exist A ∈ A0 and B ∈ B0, such that A × B ⊆ C λ0(C) = 0 otherwise

( 0c 0c c c c 0 0 if there exist A ∈ A and B ∈ B , such that A × B ⊆ C λ (C) = (µ × ν)(C) otherwise where µ×ν in the last formula denotes the usual product of additive measures. Obviously, 0 so defined λ0 and λ are cut measures and fit requirement (13). 129

Let v be some capacity, increasing with respect to an additive measure µ. We can assume that v(S) = µ(S). For each non negative integer n and i < n define

 i  An := A ∈ A: v(A) ≥ µ(A) i n and ( 1 µ(A) if A ∈ An µn(A) := n i i 0 otherwise

n Such µi are all cut measures and the following holds

n X 1 µn(A) − v(A) ≤ µ(A) i n i=1

If we define n X n µn(A) = µi (A) (14) i=1 for all integers n, sequence µn converges to v on A. A possible way to define a product of increasing capacities is through sums of cut mea- sures. Let u and v be capacities, increasing with respect to some additive measures µ and ν. Now take µm and νn as in formula (14) and define their product with

m n X X m n µm × νn = ui × vj i=1 j=1

The idea is now to define product u×v as the limit of µm ×νn, when m and n rise beyond all limits. However this construction needs yet to be carefully examined.

7 Concluding remarks

Being a relatively new field of mathematics, non-additive measure theory leaves several problems open to further research. Some of them, including products of non-additive measures, are necessary for further development of the theory. Conditional mathemati- cal expectation and independence of random variables, for instance, crucially depend on proper definition of product for non-additive probabilities. The way open in this paper might give some useful results in future and is therefore studied intensively now. Another topic opened in this paper is about decomposing arbitrary capacities, which are often too complex for some purposes, to simpler composing parts with some nicer prop- erties. The results presented here might further be generalized and yet need to be applied. 130

References

A. Chateauneuf. On the use of capacities in modeling uncertainty aversion and risk aver- sion. Journal of Mathematical Economics, 20:343–369, 1991.

D. Denneberg. Non-additive measure and integral. Kluwer Academic Publishers, Dor- drecht, 1997.

D. Denneberg. Totally monotone core and products of monotone measures. International Journal of Approximate Reasoning, 24:273–281, 2000.

P.C. Fishburn. Utility theory for decision making. Wiley, New York, 1970.

I. Gilboa. Expected utility with purely subjective non-additive probabilities. Journal of Mathematical Economics, 16:65–88, 1987.

E. Hendon, H. J. Jacobsen, B. Sloth, and T. Tranæs. The product of capacities and belief functions. Mathematical Social Sciences, 32:95–108, 1996.

D. Schmeidler. Integral representation without additivity. Proceedings of the American Mathematical Society, 97:255–261, 1986.

D. Schmeidler. Subjective probability and expected utility without additivity. Economet- rica, 57:571–587, 1989.

Q. Zhang. Some properties of the variations of non-additive set functions i. Fuzzy Sets and Systems, 118:529–538, 2001. 131

Analysis and visualization of 2-mode networks

Matjazˇ Zaversnik,ˇ Vladimir Batagelj and Andrej Mrvar University of Ljubljana

Abstract: 2-mode data consist of a data matrix A over two sets U (rows) and V (columns). One or both sets can be (very) large. Such data can be viewed also as a (directed) bipartite network. We can analyze 2-mode network directly or transform it into an 1-mode net- work over U or V . In the paper we present different approaches to the analysis and visualization of 2-mode networks and illustrate them on real life exam- ples. These approaches are supported in the program Pajek. Keywords: 2-mode networks, hubs and authorities, normalizations.

1 2-mode networks

A 2-mode network is a structure (U, V, R, w), where U and V are disjoint sets of vertices (nodes, units), R ⊆ U × V is a relation and w : R → IR is a weight. If no weight is defined we can assume a constant weight w(u, v) = 1 for all (u, v) ∈ R. A 2-mode network can be viewed also as an ordinary (1-mode) network on the vertex set U +V , divided into two sets U and V , where the arcs can only go from U to V (a bipartite directed graph). Some examples of 2-mode networks are: (authors, papers, cites the paper), (authors, papers, is cited in the paper), (authors, papers, is the (co)author of the paper), (people, events, was present at), (people, institutions, is member of), (customers, products/servi- ces, consumption), (articles, shopping lists, is on the list), (delegates, proposals, voting results). Most of the techniques to analyze ordinary (1-mode) networks cannot be directly applied to 2-mode networks without modifications or change of the meaning (see Borgatti and Everett (1997)). We have two possibilities how to analyze 2-mode network:

• using adapted or special techniques, which are usually (but not always) very com- plicated,

• changing 2-mode network into two 1-mode networks (U, RRT ) and (V,RT R), which can be analyzed separately with standard techniques.

2 Hubs and authorities

As an example of adapted technique we present “hubs and authorities”. In large networks we often want to determine, which vertices are the most important. There are several ways, how to define importance. Usually we compute some kind of weights on vertices and say, that vertex with higher weight is more important. 132

In directed networks, we usually associate two weights with each vertex v: the authority weight xv and the hub weight yv. If a vertex points to many vertices with large authority weight, then it should receive a large hub weight. If a vertex is pointed to by many vertices with large hub weight, then it should receive a large authority weight. A vertex is a good hub, if it points to many good authorities, and it is a good authority, if it is pointed to by many good hubs. The algorithm for determining authority and hub weights consist of a loop, in which we iteratively compute hub/authority weights as normalized sums of authority/hub weights of the neighbors x = normalize(random starting vector) repeat y = normalize(Ax) x = normalize(AT y) until the largest change in components of x and y is < ε where A is the network matrix:  w(u, v)(u, v) ∈ R A = uv 0 otherwise Let (U, V, R, w) be a 2-mode network. Because each arc can only go from U to V , vertices in U cannot be good authorities and vertices in V cannot be good hubs. So we associate only one weight (importance) with each vertex. Let xu be the importance of vertex u ∈ U, yv the importance of vertex v ∈ V and A the adjacency matrix, where rows and columns correspond to vertices from U and V respectively (auv = 1, if uRv and 0, otherwise). The algorithm for determining the importance for each vertex remains unchanged, the only difference is in the interpretation of matrix A and vectors x and y. See section 5 for the results of described algorithm on the network of bibliographic data.

3 Transforming 2-mode network into 1-mode networks

Another way to analyze 2-mode network is to transform it into two ordinary (1-mode) T T networks (U, RR , w1) and (V,R R, w2), which can be analyzed separately using stan- dard techniques. The problem is that, if the maximal degree ∆ on the set U (or V ) is large, the obtained ordinary network contains large cliques; but if the vertex set is small, this is not bothering (see Figure 1). From 2-mode networks over large sets U and/or V we can get smaller ordinary networks by first partitioning U and/or V and shrinking the clusters. There are also different ways to define matrix multiplication – we can select different semirings. Instead of usual operations (plus, times) we can also use (or, and) for w = 1, (max, min), (min, +) ... In this way we get different ordinary networks.

4 Example: Davis Southern Club Women

Davis Southern Club Women is a ’classical’ example of (small) 2-mode network. The network has 32 vertices (18 women and 14 events) and 93 arcs represented by a binary 133

18 × 14 matrix (see Figure 2). Arcs represent observed attendance at 14 social events by 18 Southern women. We have an arc from woman u to event v, iff u attended the social event v. These data were collected by Davis et al in the 1930s (see Breiger (1974) and Davis et al. (1941)). Because we have a small network we can visualize it as a bipartite graph (see Figure 3). We placed events to the left and women to the right side of the picture. We also tried to identify groups of women who attended similar events and groups of events attended by the same women. Figure 4 presents the same network. To obtain this picture we computed both ordinary networks (network of women and network of events) and draw each of them using first two eigenvectors. The obtained coordinates were used to draw the original 2-mode net- work. Finally we manually slightly displaced some vertices in order to obtain a more readable picture.

5 Example: References

In this example we have 2-mode network on 674 vertices (314 authors and 360 papers) with 613 arcs. It was produced from the bibliography of the book written by Imrich and Klavzarˇ (2000). The result is an author-by-paper 2-mode network, where the author u is linked to the paper v iff u is the (co)author of the paper v. A network of this size can still be vizualized in the usual way, but because of larger number of vertices any additional data (labels) should be omitted (see Figure 5) to make the picture readable. There exist special graphical formats and viewers, which enable us to zoom-in and zoom- out the picture. One of such formats is SVG (Scalable Vector Graphics). The latest graphical programs already contain viewers or even editors for SVG (the viewer is in- cluded also in Internet Explorer 6.0). In Figure 6 a zoom-in into the network with labels is presented. It is easy to see that a pair of vertices u, v ∈ U (or V ) belong to the same (weakly) con- T T nected component of (U, RR , w1) (or (V,R R, w2)) iff they belong to the same (weakly) connected component of (U, V, R, w). The ordinary network on authors is also called a collaboration network, since there is an edge between two authors iff they wrote a joint paper. In Figure 7 the largest connected component of the collaboration network (86 vertices) is presented. We applied hubs and authorities algorithm to the biggest weak component of the original 2-mode network and got the following results. The most important authors, ordered de- scending by their importancy weights are Klavzarˇ (0.6621643), Imrich (0.4118873), Mul- der (0.1989839), Mohar (0.1730697) and Gutman (0.1446318), while the most important papers are imkl-99b (0.4073044), klgu-95 (0.3135055), klmu-98 (0.3018555), klmu-99 (0.2755221) and klgu-97 (0.2581323). For the the largest connected component of the collaboration network the results of hubs and authorities algorithm are a little bit different. The most important authors are Klavzarˇ (0.5883468), Imrich (0.4223956), Mohar (0.3885042), Gutman (0.3416242) and Pisanski (0.2379134). 134

The author’s degree in the original 2-mode network is the number of his papers, cited in the book, while the degree in the collaboration network is the number of his coauthors. Authors with many papers are Imrich (26), Klavzarˇ (22), Bandelt (12), Mulder (11), Mo- har (10), Lovasz (10) and Zh1u (10), while authors with many coauthors are Hell (12), Klavzarˇ (11), Imrich(10), Bandelt (10), Lovasz (8) and Pisanski(7). Another measure used to sort out the most important vertices is betweenness centrality. Vertex is central, if it lies on several shortest paths among other pairs of vertices. Au- thors with high betweenness centrality are often some kind of middlemans between other authors: Imrich (0.4218487), Hell (0.4007937), Klavzarˇ (0.3668067), Pultr (0.3528478), Bandelt (0.3003735) and Lovasz (0.2431373).

6 Example: Slovenian journals and magazines

Over 100.000 people in Slovenia have been asked which Slovenian magazines and jour- nals they read (survey conducted in 1999 and 2000, source CATI Center Ljubljana). They listed 124 different magazines and journals. The collected raw data can be represented as a 2-mode network:

Delo Sl.novice ... Reader1 x x ... Reader2 x ... Reader3 x ......

From this 2-mode reader/journal network we generated the corresponding ordinary net- work on journals using the (+, ·) semiring:

• the value of an undirected edge linking two journals counts the number of readers of both journals.

• loop on selected journal counts the number of all readers that read this journal.

The obtained matrix (A):

Delo Dnevnik Sl.novice ... Delo 20714 3219 4214 ... Dnevnik 15992 3642 ... Sl.novice 31997 ......

The ordinary network on readers would be huge (more than 100.000 vertices) containing large cliques (readers of particular journal). Because of the huge differences in number of readers of different journals it is very hard to make any conclusions about journals relationships according to the raw data. First we have to normalize the network – all the normalized values are on the interval [0, 1]. There exist several types of normalizations: 135

aij Geoij = √ aiiajj

aij a = ij Inputij Outputij = ajj aii

aij aij Rowij = P Colij = P j aij i aij

aij aij Minij = Maxij = min(aii, ajj) max(aii, ajj)

 aij  aij aii ≤ ajj aii ≤ ajj MinDir = aii MaxDir = ajj ij 0 otherwise ij 0 otherwise

In Slovenia we have some publishing houses (Delo, Dnevnik, Vecer,ˇ . . . ), which publish several magazines and journals. For example, Delo publishes Delo, Nedelo, , Delnicar,ˇ Vikend magazin, Ona, Moj mikro, Jana, Lady, . . . , Dnevnik publishes Dnevnik, Nedeljski dnevnik, Hopla, Pilot, . . . , while Vecerˇ publishes Vecer,ˇ Nasˇ dom, 7D, TV vecer,ˇ . . . There are also several different types of journals and magazines: daily papers (Delo, Dnevnik, Vecer,ˇ Slovenske novice, ), regional papers (, Novi tednik, , Ptujski tednik, Vestnik Murska Sobota), business (Delnicar,ˇ Finance, Gospodarski vestnik, Kapital, Manager, Obrtnik, Podjetnik), agricultural (Kmeckiˇ glas, Slovenske brazde, Kmetovalec), juvenile magazines (Ciciban, Pil, Firbec, Smrklja, Cool), general interests (7D, Nedeljski dnevnik, Nedelo, Radar), computers (Moj Mikro, Moni- tor, Win.ini), women (Jana, Lady, Nasaˇ zena,ˇ Anja, Viva, Otrok in druzina,ˇ Moj malcek,ˇ Mama), religion (Druzina,ˇ Ognjisˇce),ˇ actuality (Mag, Mladina), home (Nasˇ dom, Jana Ambient, Moj mali svet), relaxation (Kih, Razvedrilo, Salomonov ugankar, Antena, Ho- pla, Lady, Stop), mysticism (Misteriji, Aura), supplements (Delo in dom, Pilot, Vikend magazin, Ona).

6.1 Geo normalization The Geo normalization of an element of the network matrix is obtained by dividing it with the geometric mean of corresponding diagonal elements. In our case this normalization is a measure of association (correlation) between journals (see Figure 8).

6.2 MinDir normalization The MinDir normalization makes two things:

• converts an undirected network into a directed one - an arc points from the journal with less readers to the journal with higher number of readers;

• the value on an arc corresponds to the percentage of readers of the first journal, who read also the second one. 136

In our case this normalization measures the dependence between journals (see Figure 9). On both pictures the size of vertices corresponds to the number of readers. The colors determine weakly connected components. The thickness of the lines represents the nor- malized value of the links. Only the lines with values above some threshold value are displayed. 137

Figure 1: (U, V, R), (V,RT R) and (U, RRT ) 138

Figure 2: Davis data matrix 139

OLIVIA

E13 E12 FLORA E14 E10 E11 SYLVIA HELEN VERNE NORA

E9 MYRNA PEARL KATHERINE DOROTHY E7 E8 E6 RUTH EVELYN THERESA

E3 ELEANOR E2 BRENDA E5 FRANCES E1 LAURA E4

CHARLOTTE

Figure 3: Davis – Bipartite network 140

OLIVIA E9 E11 FLORA

E2 E1 THERESA E3 VERNE RUTH E4 SYLVIA EVELYN E5 NORA HELEN CHARLOTTE

E7

FRANCES ELEANOR LAURA BRENDA E6 E8

PEARL MYRNA E14 DOROTHY E12 KATHERINE E13 E10

Figure 4: Davis – First two eigenvectors on U and V , combined, manual editing 141

Figure 5: References – 2-mode network: papers (darker) and their authors (lighter) 142

Figure 6: References – Zoom in, with labels 143

Nedela, R. Cyvin, S.J. Zmazek, B. Milutinovic, U.Slutzki, G.

Dakic, T. Shawe-Taylor, J. Gutman, I. Kumar, R. vande Vel, M.

Pisanski, T. Klavzar, S. Jha, P.K. Pesch, E.

Batagelj, V. Mohar, B. Agnihotri, N. Prisner, E.

White, A.T. Skrekovski, R.Chepoi, V.

Formann, M. Hagauer, J. Zerovnik, J. Seifter, N. Dahlmann, A.

Skoviera, M. Izbicki, H. Mulder, H.M. Schutte, H.

Haddad, R.W. Idury, R.M. Aurenhammer, F. Bandelt, H.-J.

Feigenbaum, J. Dorfler, W. Wilkeit, E.

Schaffer, A.A. Wagner, F. Imrich, W. Haggkvist, R. Farber, M.

Hershberger, J.,Burr, S.A. Greenwell, D. Gyori, E. Watkins, M.E. Shearer, J.B. Pareek, C.M.

Plummer, M.D. Nowitz, L.A. Miller, D.J. El-Zahar, M.

Erdos, P. Lovasz, L. Pultr, A. Neumann-Lara, V. Gao, G.Sauer, N.

Groetschel, M. Poljak, S. Hell, P. Zhu, X.

West, D.B. Nesetril, J. Rodl, V. Hahn, G. Roberts, F.S. Zhou, H.

Saks, M.E. Graham, R.L. Eaton, N. Tardif, C. Neufeld, S. Rival, I. Yu, X.,

Winkler, P.M. Pollak, H.O. Larose, B. Laviolette, F. Rall, D.F. Nowakowski, R.J.

Chung, F. R.K. Roth, R.L. Hartnell, B.L. Brown, J.I.

Figure 7: References – Collaboration network 144

7d Tv vecer Finance

Vecer Gospodarski vestnik Lady Ona

Mag Nasa zena Delo

Jana Vikend magazin Mladina Moj dom (Dnevnik)

Delo in dom Pilot Sobotna priloga(Delo) Slovenske novice Dnevnik

Nedeljski dnevnik Moj mikro

Pil Smrklja Misteriji Glamur Ognjisce Moj malcek Monitor

Pepita Cool Aura Druzina Firbec Mama Win.ini

Figure 8: Geo normalization 145

Jana Ambient Cool Pilot Glamur Slov. brazde Mag Internet

Monitor Podjetnik Sobotna pril. Kmecki glas Smrklja Ned. dnev. Finance Svet in ljudje Nedelo Gosp. vestnik Kapital Moj mikro Kmetovalec Val Manager Delo Glas gosp. Zaslon Pepita Mladina Nasa zena Denar Gea Prim. novice Viva Tajnica Marketing mag. Proteus Vip Oskar Modna Jana Moj dom Ljubljana Jana Delo in dom

Zdravje Vikend magazin Anja Ona Dnevnik Prestop Trac Savinjcan Slov. nov. Kaj Lady Antena Ptujski tednik Moja skrivnost Grafiti Kam Celjski ogl. Vecer Druzina Dol.i list Ljub. zgodbe Panorama 7d Nas casopisSalomonov ogl. Tv vecer Ognjisce Grand prix

Figure 9: MinDir normalization 146

References

V. Batagelj and A. Mrvar. Pajek - program for large network analysis, 1996–2002. URL http://vlado.fmf.uni-lj.si/pub/networks/pajek/.

S. P. Borgatti and M. Everett. Network analysis of 2-mode data. Social Networks, 19: 243–269, 1997.

R. Breiger. The duality of persons and groups. Social Forces, 53:181–190, 1974.

A. Davis et al. Deep South. University of Chicago Press, Chicago, 1941.

W. Imrich and S. Klavzar.ˇ Product graphs: structure and recognition. John Wiley & Sons, New York, USA, 2000.

J. Scott. Social Network Analysis: A Handbook. Sage Publications, London, second edition, 2000.

S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cam- bridge University Press, Cambridge, 1994.