<<

Master thesis Department of Statistics Masteruppsats, Statistiska institutionen

Quality Assurance and Quality Control in Multinational and Multicultural Surveys

Gelaye Worku HaileMichael

Masteruppsats 30 högskolepoäng, vt 2013

Supervisor: Lars Lyberg

Quality Assurance and Quality Control in Multinational and Multicultural Surveys by Gelaye worku HaileMichael

Submitted in partial fulfillment of the requirements For the Master in and Official Statistics

Supervisor: Prof. Lars Lyberg Faculty of Social Science, Department of Statistics Stockholm University,

Department of Statistics Stockholm University May 2013

Abstract

Maintaining a high quality survey is difficult for any kind of survey and it is even more complicated and difficult to perform quality assurance and quality control in multicultural and multinational surveys. This thesis will try to determine a minimum required quality assurance (QA) and quality control (QC) framework for multicultural and multinational surveys. In order to achieve this goal we will start explaining what makes multicultural and multinational surveys so special, followed by some examples of such surveys (3M). Then the survey lifecycle for high quality surveys from the currently available guidelines for best practice in cross-cultural surveys (CCSG) is then introduced. Furthermore we investigate the efforts made by some selected surveys to acquire survey quality. Before concluding the thesis we will try to state what lessons can be learned from the selected surveys concerning QA and QC by going through what is in place and what is not. Finally following the guidelines by the Comparative Survey Design and Implementation (CSDI) group and the survey quality experience from the selected surveys, we try to define a basic minimum QA and QC framework needed for a quality multicultural and/or multinational survey.

Key words: 3M surveys; Survey quality; translation; design; pretesting; data collection; quality assurance; quality control.

Acknowledgment

I would like to express my greatest gratitude to the people who have helped and supported me. I am grateful to my advisor Prof. Lars Lyberg for the continuous support, his patience, motivation, and immense knowledge. A special thank goes to my friends, Fredrik Holmér, Hamelmal Mesganaw, Meseret Haile and more others , for all their help, motivation and appreciation. I also wish to thank my family for their unlimited support, for inspiring me and encourage me , without them I would not have make it this far. And finally to God who made all the things possible.

Contents Introduction ...... 1 1. What is so special about multinational and multicultural surveys? ...... 3 2. Examples of Multinational, Multicultural and Multiregional (3M) surveys ...... 8

2.1 Global surveys ...... 8 2.2 Regional surveys ...... 9

3. Survey lifecycle for a high quality survey ...... 10 4. Quality assurance and quality control in a 3M setting ...... 18 5. Selected Surveys ...... 23

5.1. European Social Survey (ESS) ...... 24 5.2. Trends in International Mathematics and Science Study (TIMSS)...... 37 5.3. World Values Survey (WVS) ...... 43 5.4. Program for the International Assessment of Adult Competencies (PIAAC) ...... 45 5.5. International Adult Literacy Survey (IALS) ...... 49 5.6. Survey of Health, Ageing and Retirement in Europe (SHARE) ...... 53 5.7. International Social Survey Program (ISSP)...... 57 5.8. European Working Conditions Survey (EWCS)...... 61 5.9. Comparative Study of Electoral System (CSES) ...... 68 5.10. Gallup World Poll (GWP) ...... 71 5.11. Eurobarometer (EB) ...... 73

6. The QC and QA system in selected surveys ...... 76 7. A basic QA and QC framework ...... 80 8. Conclusion ...... 83 References ...... 85 Appendix ...... 92

Introduction Multicultural, multinational, and multiregional surveys (3M) started more than 40 years ago and since then they have grown and become very useful. But despite this development there is still not so much literature on how to obtain high quality comparable data in such surveys. And also the more languages, cultures and nations that are included in the surveys, the more complicated it becomes to design the survey instruments and to actually implement the survey and control the survey quality (Pennell et al., 2010). The main goal of this thesis is to contribute to the basic quality assurance (QA) and quality control (QC) framework for multinational and multicultural surveys (cross-national surveys). Throughout this thesis the term “cross- national survey” will have the definition by Lynn et al.(2006), which is “all types of surveys where efforts are made to achieve comparability across countries”. The steps that are assumed to lead to the goal of this thesis are to first state what make cross-national surveys so special referring to what currently available guidelines (the CSDI guidelines) and subject matter expertise have to say about it for example Kish (1994), Lynn, Japec, and Lyberg (2006), Harknesset et al. (2010), Smith (2010), and Tortora et al.(2010). Then we follow up with some examples of global and regional 3M surveys. After that the survey lifecycle for 3M surveys is introduced. Some important references studied included are, Smith (2010), Pennell et al. (2010), Harkness et al. (2010), Lyberg and Biemer (2003), Lyberg et al. (2006), Juran and Gryna (1980), Lyberg and Stukel (2010), Lyberg and Biemer (2008), Kreuter et al. (2010), Couper and Lyberg, (2005), Couper (1998), and Vehovar (2007). After stating the QA and QC in 3M settings using available 3M books, papers, and quality standards it is time to study the efforts made by a few selected surveys to achieve comparative survey quality and to see what is in place and what is not and what can be complemented or criticized based on the survey lifecycle. Finally we conclude by proposing a basic QA and QC framework for multinational and multicultural surveys based on Cross-Cultural Survey guidelines, 3M books and papers, experiences from relatively successful cross-national surveys and various quality standards.

1

In general in section (1) we try to see what makes multinational and multicultural surveys special. Then in section (2) type of 3M surveys with example will be stated. Section (3) will be all about survey lifecycle for a high quality surveys. In section (4) quality assurance and quality control in 3M surveys will be discuss. Following that section (5) is about the selected surveys. In section (6) we will talk about the QA and QC system in the selected surveys. Then finally section (7) will be about the basic QA and QC needs followed by conclusion in section (8).

2

1. What is so special about multinational and multicultural surveys? A survey is called multinational /multicultural, when it concerns comparisons between different nationalities/ cultures/populations within- country or cross-country. What makes it special is the presence of different nationalities, different languages, different cultures, in a multipopulation setting. A multipopulation survey as Kish, (1994) described it can include: Periodic surveys, multidomain designs, multinational survey designs, cumulated and combined samples, and controlled observations. As Kish explains it, the classical single population need to be extended to include multiple populations, which is very difficult to design and control. Also a lot of effort is needed to achieve comparability across or within-country. Smith (2010) lists some specifics for 3M surveys. The design, response rate, question development, translation, data collection, data processing and cleaning, and file documentation and distribution get more complicated. Let us discus few of these aspects that make 3M survey special and hard to get comparable quality survey data: - Study, organizational and operational structure: In cross-national surveys different survey organizations with diverse organizational structure and experience will be involved during all or different phases of the survey life cycle in the different participating countries. This could create comparability problems. When it comes to operational structure, trying to standardize across countries and localized items requires special care compared with regular surveys. For example differences in legal aspects and, financial resources, and survey methodology might be at hand. These differences could become big obstacles in achieving comparability (Pennell et al., 2010). - design: Keeping in mind all the issues in general questionnaire design, a multinational and multicultural questionnaire has to be designed in such a way that the questions can provide comparable data across the populations in the sample. The design becomes more complex since it has to take into account populations with different cultures, world insights, understanding, literacy levels experiences and languages. For example a study conducted in Ethiopia by the World Health Organization in the mid-1980s used questions with a Western concept to measure depression. For instance, for the question “Is your appetite

3

poor?” the Amharic interpretation of appetite was availability of food. Obviously the Amharic interpretation of the word is very different from what is usually meant. Therefore when a questionnaire is designed the questions need to have uniform meaning and understanding across countries and also across modes, since different countries may have different preferences when it comes to the data collection mode as is the case for the European Social Survey (ESS) (Harkness et al., 2010). - Translation and adaptation: When populations understudy have different languages and cultures, translation and adaptation becomes a serious challenge (Adaptation is a process of modifying questions or a questionnaire to create a new according to the culture and language of the target population). It is not just a matter of translating the source questionnaire (the questionnaire which is going to be translated in to different languages or a blue- print of the questionnaire) in to the target language. There also has to be a way to integrate cultural aspects, in order to get equivalence of meaning between questions in the source questionnaire and its translated version (Harkness, Villar, and Edwards, 2010). Some examples of how complicated translation and adaptation can be when different cultures and languages are involved are the following, In the Gallup World Poll (GWP) survey in Senegal the question “how many children are there in the household?” could not be asked. Instead it had to be translated according to the Senegal culture to “How many little bits of God’s wood are in the household”. In parts of Asia GWP had difficulties translating “stress at work” since it is commonly used to describe all kinds of conditions ranging from hectic to hard work (Tortora, Srinivasan, and Esipova, 2010). - Sampling design: To have an effective sample size per nation or to distribute the sample in a nation is not an easy task when it comes to cross- country comparative surveys. Also to have a uniform sampling design across all countries is almost impossible and not even desirable. The best possible sampling design for one country could be the worst design for another. This leads to differences in design effects depending on the different sampling methods used.

4

The is a measure of precision and is most commonly calculated using the following formula (http://www.ccsg.isr.umich.edu/sampling.cfm, and Lohr, 1999, pp.239): Deff (Plan, Statistic) =

In multinational surveys a practical sampling design often used is a stratified multistage design that includes cluster sampling to select elements at the initial stage or at some other point in the multistage design. Also when there is a need to increase sample sizes or when there is no available frame of individuals, disproportionate sampling of population element can be used. This causes a need to employ weighting to estimate descriptive statistics. When using complex sampling methods accuracy is gained or lost. For instance, stratification tends to increase precision compared to cluster sampling, which tends to decrease precision. Thus relative to simple random sampling, the complex effect of stratification, clustering, and weighting on precision is measured by the design effect as follows ( Heeringa and O’Muircheatraigh, 2010):

where, = the design effect for the sample estimate, ; = the complex sample design variance of ; and

= the variance of .

5

Figure 1: Complex Sampling Design Effects on Standard Errors and Prevalence Estimation.

Source: Heeringa and O’Muircheartaigh, (2010) page. 262.

As shown in the above figure, the curve represents estimated proportion of SRS standard error as a function of the sample size. It also shows that for the same sample size, stratification gives a smaller standard error than that of SRS. On the other hand for equal sample sizes, clustering sample elements and using weights for unbiased estimation give a larger sampling error effect while comparing it with the SRS. The design efficiency is not only a measure of precision but it can also be used to calculate the effective sample size, which is the number of SRS cases required to achieve the same sample precision as the actual complex sample design.

where: = the effective sample size

= the actual sample size selected using the complex sample design and = the design effect for the sample estimate, ; In general what is shown here is, it is also importunate to come up with required effective sample size to increase comparability across countries.

6

- Data collection: When there are different cultures and nations involved in a survey, there are differences in the meaning of the questions, sensitivity of the topics, understanding of the questions, interviewer- respondent interaction patterns, literacy level, computer access, and telephone coverage among other things. Therefore the data collection mode must take into consideration all these aspects when it is applied in cross- country surveys, so that comparability will not be affected. In addition to all that, even if the mode puts all of the above under consideration, there is the issue of the sampling frame. It is impossible to use the same kind of sampling frame across all participating countries. This might force different countries to use different modes of data collection, and that has the potential of creating additional errors. Therefore many try to encourage the use of one mode, say face-to face (Guideline for Best Practice in Cross-Cultural Surveys, 2011) if possible. - Interviewers: Unlike regular surveys, in multi-national and multi- lingual surveys the interviewer is sometimes required to know more than one language, to increase the response rate. It is hard to find experienced interviewers who can speak the languages required in some countries and if that can be accomplished there usually will be budget issues. - Equivalence: The other major issue in 3M surveys is the question of equivalent measures. Equivalence has a multidimensional concept in 3M surveys. Johnson (1998) lists fifty-two types of equivalence identified in a literature review. To mention few of them, cultural equivalence, text equivalence, translation equivalence, vocabulary equivalence, linguistic equivalence, grammatical- syntactical equivalence, functional equivalence, and so on. Most of these different types of equivalences differ only slightly from each others. Furthermore equivalence can also be divided into two fundamental domains of 3M equivalence, namely procedural equivalence, and interpretive equivalence. In general it is the main objective of 3M survey to achieve comparability or equivalence across countries or multiple populations, and multiple methodologies must be implemented in order to insure a cross-national equivalence.

7

In general it is hard to get high - quality survey data in all surveys, and trying to achieve that in multiple surveys across languages, nations and cultures is even more difficult. Multicultural and multinational surveys are complicated and error- prone. Guidelines to help getting high quality comparative survey data have not been available until recently. The Cross-Cultural Survey Guideline (CCSG) is a great step forward. As stated by Lynn, Japec and Lyberg (2006): “Cross-national surveys can be considered to have an extra layer of survey design, in addition to the aspects that must be considered for any survey carried out in a single country”.

The main objective of this thesis is to go through the survey lifecycle according to the currently available Cross-Cultural Survey Guidelines and study how quality of such surveys can be obtained, and then how these guidelines are applied in a few selected multi-national and cross-cultural surveys. Finally, by combining the suggested quality assurance and quality control methods provided in the CCSG and the practical quality assurance and quality control applied in the cross-national surveys we come up with some minimum requirements of quality control and quality assurance for this kind of surveys.

2. Examples of Multinational, Multicultural and Multiregional (3M) surveys Let us start with the major classification of 3M surveys according to Smith (2010). Broadly there is a division between global and regional surveys.

2.1 Global surveys a. General social science collaborations include for example the International Social Survey program (ISSP). The main goal of ISSP is to mix important social science research topics with cross-national, cross- cultural, and cross- language study perspectives, and also to design a questionnaire that could be used for different cross-national surveys (http://www.issp.org/). b. General-population studies on specialized topics include for example The World Health Survey (WHS) which is a one-time survey developed by the World Health Organization (WHO) to gain compiled

8

comprehensive baseline information on current health systems for policy purpose, and on the health of populations in general (http://www.who.int/healthinfo/survey/en/). c. Special-population studies including for example trends in International Mathematical and Science Study (TIMSS), which assesses mathematics and science achievement of 4th and 8th grade students and compare the results of more than 60 participating countries (http://timss.bc.edu/).

2.2 Regional surveys d. General – topic surveys include for example the European Social Survey (ESS), which is designed to assess changes in European social, political and moral climate using results from about 30 participating countries. Another ESS objective is to improve methods of cross-national attitude measurement (http://www.europeansocialsurvey.org/). e. Special - population surveys include for example the Survey of Health, Ageing, and Retirement in Europe (SHARE), which was developed in 2004 with the objective of providing valuable information on how ageing affects individuals in various cultural settings. It also provides rich data for the scientific society and any interested individuals to further study how ageing changes the quality of life in Europe (http://www.share-project.org/).

There are also some 3M surveys, which are difficult to classify into the above broad categories. Examples of such surveys are global polls performed by commercial companies like the Gallup group and allied associations of commercial firms like Globescan, and harmonization projects under the leadership of Eurostat that aim to produce more comparable results for studies within the European Statistics System (ESS).

9

3. Survey lifecycle for a high quality survey As mentioned the literature on how to perform 3M surveys aiming for high- quality comparable data is scarce. The CCSG is a great step forward. CCSG covers all aspects of the survey lifecycle, starting with the study structure and ending with data dissemination. It also treats two major overarching survey aspects, namely survey quality and ethical considerations. Surveys do not always go through the complete lifecycle. Some need only to focus on part of the cycle. This is the case for the European Statistical System (ESS). ESS (please note that ESS also denotes the European Social Survey) is not involved in the primary data collection for official statistics in Europe. Its major concern is data processing, data harmonizing and data dissemination. Of course, survey quality and ethical considerations are part of the lifecycle that all survey organizations must execute. There might also be some survey process integration or iteration. When that happens, it is not necessary to follow the lifecycle, but it is important to address all the elements in the survey lifecycle to get high-quality comparable survey data (Guideline for Best Practice in Cross-Cultural Surveys, 2011).

Figure 2. The Survey Lifecycle

Source: http://www.ccsg.isr.umich.edu/intro.cfm

10

In what follows, we give a short description of the various lifecycle elements. Study, Organizational, and Operational Structure: The first step of the lifecycle is where the main objective of the comparative study is stated. The coordinating center specifies the operational structure of the survey for each participating country, for instance what needs to be standardized across the countries and what can or must be localized. When conducting cross-national surveys organizational structure has a major effect on how a study is designed and put into action (Pennell et al., 2010). Tenders, Bids, and Contracts: Here the major concern is to prepare a format, offer study specifications within a prescribed time, and budget for each participating country with a fair bidding process. After the bidding, the survey participating countries will sign the contract. All these steps from tenders to bidding and finally signing the contract should be documented. Sometimes the funder(s) and survey organization participate, together with the central coordinating center. In some cases the central coordinating center can sign contracts with the responsible part for the data collection. When that happens there will be two tenders and contracts, one between the funder and the central coordinating center and the other with the central coordinating center and the data collecting organization.

Sample Design: The guideline emphasizes that “one important prerequisite for a comparative survey is that all samples are full probability samples from comparable target populations”. It is also very important at this stage of the survey lifecycle to prepare a potential sampling frame and to determine the desired sample size to meet the required precision. Questionnaire design: The questions need to be developed in such a way that they can be used across cultures and nations. As it is mentioned before any questionnaire comes in two versions; the source questionnaire, and the target questionnaire. The main goal of the guideline on this part is to minimize measurement errors by maximizing comparability of the resulting questionnaire. The most frequent approach used by successful multicultural and multinational surveys is to re-use questions that have already been used in other surveys and that can still be used in the current survey. Also adapting questions is a good strategy to achieve comparability (Harkness, Edwards et al., 2010).

11

Adaptation of survey instruments: Adaptation means that questions are modified. This could happen during the translation stage or the pretesting stage. The main reason for adaptation is to fit survey instruments to the chosen model, the target language, specific cultural aspects or to a new population and location. There are many types of adaptation including system-driven adaptation( the different units of measurement need to be adapted accordingly, for instance, Fahrenheit and Celsius, yards and meters), adaptation to improve or guide comprehension for instance to make the questions more clear supplement the question in locally understandable wordings, adaptation to improve conceptual coverage (adding question components to a given location for accuracy of information), adaptation related to cultural discourse norms, adaptation to account for cultural sensibilities, adaptation of design components or characteristics( by adding some visual representations), adaptation related to lexicon and grammar , and adaptation to maintain or to reduce level of difficulty. Translation: The translation procedure is extremely important in multilingual surveys. A bad translation can ruin the comparability. There are different types of translation, but CCSG promotes what is called team translation. Some commonly used translation methods are not recommended for different reasons, these are: I) Machine translation: So called “Google-translation” reduces human involvement in the complex translation process and should be avoided. II) Do-it-yourself ad hoc translation: This is a kind of translation performed by someone who can supposedly speak, write and read the language. But just because someone speaks the language and has a short translation training, does not make the complex translation process any easier. It has to be done by professional translators. It is not uncommon that survey materials are translated by “someone’s brother who spent six years in the country.” Translatology is a science and that fact should be respected. III) Unwritten translation: This method is also called “on site” translation. Translation is done during the interview or “on the fly”. This type of translation should be avoided, but if it has to be done extensive training and briefing should be given first.

12

IV) Back translation: This is when a target questionnaire is translated back to the source questionnaire (source- target-source). Then there will be two source to compare in order to find out if there is any problem in the target questionnaire. To mention few of back translation problems, it is costly and takes time for the process to be finalized; also there is no check of the target questionnaire. Figure 3 shows clearly how back- translation performed: Figure3. Back translation approach

Source Target Source Language Language Language Back translated Original Translation Version Version Source: CSDI 2013 conference

The currently most recommended translation method for 3M surveys is the Team Translation, in which a group of professional translators and survey people work together to translate, review, adjudicate, and pretest. After pretesting, the translated version is finalized and ready for use, following a good documentation of the whole process. This translation process is also called TRAPD (Translation, Review, Adjudication, Pretesting, and Documentation) (Harkness, Villar, and Edwards, 2010). Instrument Technical Design: This stage focuses on how well the questionnaire design fits the mode, how to reduce measurement error, how to control context effects and interviewer effects, and how well the questionnaire takes the respondent burden into account. Interviewer recruitment, selection, and training: Interviewers play a crucial role and the guideline provides some strategies on how to minimize interviewers’ effect on sampling error, nonresponse error, measurement error and processing error, while controlling costs and interviewer efficiency Pretesting: This is an important part of the survey lifecycle. It is here we find out if the survey instruments are able to collect the intended data, how good the chosen mode is on collecting comparable data, and so much more. In 3M surveys pretesting becomes more extensive than in most other surveys, since it involves testing how accurate the

13 translation is, if the adapted items collect the intended information, if the questions are applicable in each culture of the participating countries and so on. Compared to single population surveys there are more things to test and retest before the actual study can be implemented. Data collection: Collecting comparable data in 3M surveys is a complex task, starting with finding the right mode for all participating countries. Not having the right mode will affect the survey estimates and the survey cost and makes it hard to manage the different aspect of the survey lifecycle. Current practice in many 3M surveys is to try standardizing the mode across locations (Pennell et al., 2010). The aim of the guideline is to suggest how to collect cross-cultural data so that a specified level of precision and comparability is achieved. Data harmonization: There are two types of harmonization; input and output. Input harmonization is all about using strict general notions or standards (concepts, definitions, classifications, technical requirements and so on) across countries to achieve comparability. Output harmonization deals with the statistical output while the participating countries can use their own standards to collect the data. The objective of the guideline is to ensure that the harmonizing organization has chosen the technique that best fits the source material and the survey’s intended goal. Data processing and statistical adjustment: This stage includes capturing, coding, and editing of the survey data. All data processing activities should be considered when the mode of data collection is decided. Data dissemination: After the data is processed it is disseminated to users by the statistics producers. Data dissemination involves many processes as such as making sure that confidentiality is secured, data are securely preserved for future use, that multilingual documentations exist when applicable, and that data are standardized and harmonized. Survey quality: In 3M surveys assessing survey quality is a very difficult matter, especially when there is adaptation of survey instruments, translation of questionnaire and data harmonization. The guideline presents a quality framework to assess 3M survey data quality. The quality framework adopted in the guideline has three parts; total survey error, Specification error, and fitness for intended use:

14

I) Total Survey Error (TSE): According to the guideline the TSE framework defines quality as the estimation and reduction of the mean squared error(MSE), where MSE= Variance + Bias2

MSE can be further divided into error components, where variances and biases can have different relative importance which is illustrated in the table below

Table1. Risk of variable error and systematic error by major error source

Risk of Variable Risk of Systematic Error MSE Components Error (Variance) (Bias) Specification Error Low High Frame Error Low High Nonresponse Error Low High Measurement Error High High Data Processing Error High High Sampling Error High Low Source: - Lyberg & Biemer (2003), (p. 59) Let us define the MSE components shortly according to Biemer and Lyberg (2003).

- Specification Error: this error occurs when there is a mismatch between survey questions and the research questions of the survey. For instance, the wrong parameter is being estimated. - Frame Error: This error includes coverage errors (missing units, duplications, or extraneous units), classification errors, contact errors (addresses that are incomplete or incorrect) and so on. - Nonresponse error: There are three types of this error I) Item nonresponse occurs when a particular question or a set of questions in a questionnaire are not answered. This is so because the respondent forgets to answer, or does not understand the question or does not want to reveal the information requested. II) Unit nonresponse occurs when a sample unit does not participate in the survey usually due to refusal or failure to contact the respondent. III) Incomplete response occurs when incomplete

15

answers are given to open-ended questions. For instance, when answering occupation if the answer is not adequate as a result coding can be difficult.Nonresponse bias can be calculated as a function of the amount of nonresponse and the difference between respondents and nonrespondents as Groves (2006) expresses it:

=

where, is the sample mean,

is the respondent mean,

is the nonrespondent mean and

is the nonresponse rate

To claim that there is no nonresponse bias, either the nonresponse rate has to be zero or that there are no differences between respondents and nonrespondents on the statistic of interest (Couper and de Leeuw, 2003).

- Measurement error: These more damaging errors are caused by respondents, interviewers, the survey questionnaire (ambiguous questions, confusing instructions, misleading terms, and so on), and the mode of administration. - Data processing error: These are errors that occur during data processing, for instance during data editing, data entry, coding, weighting, and tabulation. - Sampling error: this is the error that is due to selecting a sample instead of the entire population. Sampling error also subdivided farther to selection error and estimation error. Estimation error gets corrected by increasing the sample size, where as computing design weights can help adjusting selection error (lyberg and Biemer 2003). TSE takes into consideration how questions actually measure the construct of interest through measurement error, processing error, construct validity (specification error). Where construct validity is the degree to which a test measures what it claims, or purports, to be measure (Brown, 1996 ). TSE also measures how representative the sample is, through coverage error, sampling error, nonresponse error and adjustment error. All these are a number of different frame works that helps to measure accuracy. What TSE miss is a

16

user perspective to quality concepts and quality frame works used in official statistics (Groves and Lyberg 2010). II) Fitness for intended use: Survey quality is not just an acceptable MSE. Quality has other dimensions as well (Lyberg, 2012). The relationship between survey errors and survey quality is displayed in the figure below. MSE or accuracy is just one of a number of quality dimensions.

Figure 4. Fitness for intended use (Quality Dimensions) and Total Survey Error (accuracy dimension)

Source: - http://www.ccsg.isr.umich.edu/quality.cfm. III) survey process quality: When we want to control and improve quality we can define it as the three-level concept(Lyberg, and Biemer, 2008) - Organizational quality - Process quality - Product quality

More discussion of this aspect of quality will be given in Section 4. Ethical considerations in a survey: Ethics include aspects of protecting the privacy and confidentiality of the respondent, the right of free will to participate in the study, and the right to access the data. For instance there is an EU code of Ethics for Socio-

17

Economic Research, the declaration of Ethics of the International Statistical Institute (ISI), the National Council for the Social Studies (NCES) standard 4-2, Maintaining confidentiality, the RESPECT code of practice which is a voluntary code covering the conduct of socio-economic research in Europe (http://www.respectproject.org/code/) and so much more. Ethical consideration is more of a legal issue than a quality issue; therefore we will not get to detail about it.

4. Quality assurance and quality control in a 3M setting Despite being complex and difficult to implement cross-national and cross- cultural surveys have become increasingly important and are conducted by various organizations such as the World Bank, OECD, the World Health Organization, United Nations , Eurostat , Gallup and many other big and small, global and regional research organizations. But the question is how good is the survey data quality and how successful are the surveys in producing the required high-quality comparable data within time and cost constrains? All organizations should have a program for quality assurance that delivers product characteristics in accordance with user and clients demand (Biemer and Lyberg, 2003). As mentioned earlier, the complexity associated with conducting cross- national surveys, and how problems tend to be magnified in such a setting indicate that quality checks are needed. Unfortunately the literature that can help improve the whole process of quality assurance and quality control programs for cross-cultural surveys is sparse (Lynn, Japec, and Lyberg, 2006). Before turning to the whole process of quality assurance and quality control, let us start with the definition of quality itself. Quality: The term quality has a lot of definitions depending on field of study. It has been called Value (Feigenbaum, 1951), Defect avoidance (Crosby, 1979), Excellence (Peters & Waterman, 1982), and Fitness for use (Juran & Gryna, 1980). Here we will stick to fitness for use. Fitness for use means that the survey data are expected to be as accurate as required for the anticipated purposes. When we take it to the cross-national level not only

18 accuracy but also comparability with all the different data sets from all the different places/countries should be taken into account. This means that in comparative surveys, accuracy at the national level is not enough. The data from different countries and regions must be comparable to accomplish the intended purposes give reasonable cost and time constraints (Juran and Gryna, 1980). As mentioned also other quality dimensions need to be fulfilled. Quality means delivering all the product characteristics on which the user and the producer have agreed, and to accomplish this, the survey process needs two things: I) procedures and activities to assure that the required quality can be achieved called quality assurance, and II) procedures to control that these quality assurance standards and other requirement actually have been met, which is called quality control (Lyberg and Biemer, 2008).

But quality assurance and quality control in cross-national surveys are complicated to perform and not easily grasped. This is because of the following reasons. To achieve equivalence country specific items should be identified and put under consideration. Issues concerns definitions, developing questions, adaptation and translation, and results have to be comparable across all the participating countries. Also financial and methodological resources differ from country to country. There could be a conflict of interest between national and international interests and the error structures might differ between participating countries. Now let us see in a little detailed what quality assurance and quality control means.

Quality Assurance (QA) QA is a set of activities to ensure that the product fits the intended purpose and also meets user/client expectations. In other words it is a well- planned activity that an organization uses to make sure the assessment meets the required process and product quality. In general quality assurance is all about managing the quality enough to satisfy the users as well as the producers (Lyberg and Biemer, 2008). To see it in a more cross-cultural context, quality assurance is

19

all the activities and methods put in the survey to ensure an error as small as possible and a cross-cultural comparability as good as possible given constraints. Some of the quality assurance components as they are stated in Lyberg & Stukel(2010), are: - pre-testing of questionnaires - interviewer training, - probability sampling design, - call scheduling algorithms, - formulas for calculating base weights, - documentation systems - user communication strategies and channels,

All these activities are examples of what a producer can do to deliver a good product. But quality assurance alone is not enough to make survey data comparative. There must also be quality control to reassure that all the quality assurance activities worked as intended.

Quality Control (QC) QC is a way of making sure all the quality assurance activities are performed effectively to achieve the required product quality. It is a set of activities to ensure that all quality requirements are met. Some examples of quality control are: - recording of nonresponse rates and nonresponse distributions across subgroups, - editing - costs and customer reactions, - interviewer monitoring - verification of coding

There are different types of quality control tools ranging from the simplest check list, to the more advanced, such as collection and analysis of paradata. Quality control can be performed in all stages of the survey lifecycle, including after the survey is terminated for evaluating and reviewing purposes.

20

Table 2. Three-Tiered Framework for Assuring and Controlling Quality

Quality Main Assurance and Measures and Level Stakeholders Control Indicators Instruments User, client Product aspect, Frameworks, service level compliance to specs, Product agreements estimates of mean squared error, user surveys Survey designer Process variables, Control charts and current best methods, analysis of variation, Process standards, checklists, other paradata verification analysis Agency, firms, Business excellence Scores, owner, society models, code of identifications of Organization practice, standards, strong and weak reviews, audits, self- points, follow-up of assessments improvement activities Source: Lyberg and Biermer (2008).

Comments on the quality levels:

Product Quality: This is evaluated by client/user satisfaction and compliance to specifications. To assure a quality product and to avoid unnecessary variation, there should be a set of standards and requirements to follow. While conducting a cross-cultural survey there are some survey aspects that need to have flexible conditions, like sampling steps and adaptations of questions (Lyberg & Stukel, 2010). Even if the designer and the producer of the survey are ones who can decide certain aspect of product quality, it is necessary to also include the client(s)/user(s) in this assessment. As it is stated in Table 2 depending on the type of survey conducted, the product quality can be assessed using the estimated MSE i.e. 2 MSE= VSE+ VDPE + VME + (BSE + BFE + BNSE + BME + BDPE) combined

with user views.

Process Quality. The word process in this concept is defined as a set of activities that leads to the intended goal. Here both QA and QC are needed in order to achieve comparability. If process quality is well monitored it will help spot the main sources of poor quality. The most

21 common and simplest tool of process quality control is the Checklist and the most advanced method is the use of paradata. The term paradata was coined by Couper (1998) as automatic data about the survey process. Later Kreuter, Couper and Lyberg (2010) defined it broadly as follow:

“paradata are automatic data collected about the survey data collection process captured during computer assisted data collection , and include call records, interviewer observations, time stamps, keystroke data, travel and expense information, and other data captured during the process.” Paradata help to: create process indicators, monitoring in real time, providing data to suggest changes, in decreasing the TSE and also helps to investigate nonresponse error, and measurement error (Kreuter 2013). But paradata need not be confined to data collection. Any data about the survey process can be seen as paradata, if you wish. Finally it is generally accepted that it is necessary to have a good process quality to be able to achieve good product quality (Lyberg and Biemer, 2008).

Organizational quality: On this quality level, quality assurance plays a major role. For instance, it is important to select a survey organization, which is capable of conducting a cross-cultural survey. It is a challenging process to find organizations in the participating countries that have enough experience to meet the requirements. The experience of IALS in 1994 is a good example (Carey, 2000). There are also standards to follow to make sure that the process quality will not affect the product quality in a negative direction, standards like the International Organization for Standardization (ISO) such as the quality management system ISO 9000 with its family, ISO 9001, ISO 9004, and ISO 19011. Also ISO 20252 is used to help harmonize the national standards and to promote international quality standards to market opinion and social surveys. In general these standards help organizations to provide quality products, increase clients/users confidence in the organization and the product, and increase productivity (Lyberg and Stukel, 2010).

22

Now days there are different business excellence models in use to assess the organizational quality. The most known business excellence models are: - the quality management system of ISO (ISO 9001) - European Foundation for Quality Management Model (EFQM) - Malcolm Baldrige Award.

These models can be used to check if the organization is able to satisfy stated and implied needs within given time and cost, and they can also help define areas, which have to be taken into consideration by the organization in terms of improvement. When conducting cross-national surveys, there has to be a well-organized central coordination and monitoring structure to help with planning, developing specifications for quality controls, and for coordination and support. Not having a strongly coordinating central monitoring and support could lead to poor comparability. For example, the European Social Survey (ESS) has a Central Coordinating Team (CCT). This team is guided by a leading social scientist from different organizations and supported by the Scientific Advisory Board. In addition there is a Methods Group that advises on methodological issues (www.europeansocialsurvey.org).

In general in this section it has been shown that quality in 3M surveys is not only getting accurate data, it is also about comparability between the different participating countries. To have a quality comparable data means to have data which can be fit for the intended use. Also when it comes to the three different quality levels, in order to have good product quality there has to be good process quality and without organizational quality there is no good process quality. In the next section we will see how QA and QC is treated in currently more or less ongoing 3M surveys.

5. Selected Surveys For the purpose of this thesis, eleven different cross-national and cross-cultural surveys are chosen. Most of them are known for their extensive cross-national publication, of survey results. Some are known for being the oldest in the business, and of course there are also the leading social science research organizations and regular providers of comparative survey data. The selected surveys will be discussed in this

23

section where the contents of the available methodological and technical reports are contrasted with the cross-cultural survey guidelines. The selected surveys that will be discussed in this section are: - European Social Survey (ESS) - Trends in International Mathematics and Science Study (TIMSS) - World Values Survey (WVS) - Program for the International Assessment of Adult Competencies (PIAAC) - International Adult Literacy Survey (IALS) - Survey of Health, Ageing and Retirement in Europe (SHARE) - International Social Survey Programme (ISSP) - European Working Conditions Survey (EWCS) - Comparative Study of Electoral System (CSES) - Gallup World Poll (GWP) - Eurobarometer (EB)

5.1. European Social Survey (ESS) The ESS was established in 2001. It is designed to give a graphic representation to and clarify the relationship between Europeans’ changing institutions and attitudes, as well as beliefs and behavioral patterns of the society. In addition to that it is also the ESS’s objective to improve methods of cross-national attitude measurements. It has so far covered more than 30 European countries (Fitzgerald and Jowell, 2010).

- Study, Organizational, and Operational Structure: The ESS has a good documentation of all the surveys performed so far. The documentation explains the overall and country specific structures, as well as features of the study designs. According to the ESS’s round 6 specifications for participating countries, the survey organizations in each country have to be experienced in conducting the highest standard probability-based survey data collection by means of face-to- face interviewing. The organizations should be willing to change or adapt methods they normally use to provide cross-nationally comparable data. Participating countries are required to use strictly probability sampling at all stages, and

24 also use the TRAPD translation method (Fitzgerald, and Jowell, 2010). The rganizational structure of ESS is given in Figure 5 below.

Figure 5: ESS organizational structure

Specialist Advisory

Groups Question Scientific Advisory Funders’ Forum

Design Teams Board

Methods National Group Central Coordinators and Coordinating Survey Institutes Team Country 1 Sampling Panel Country 2

Translation Country 3 Taskforce

Etc…… Source: - http://www.europeansocialsurvey.org

In this organizational structure, the Scientific Advisory Board and Methods Group

advise the project director team, called the Core Scientific Team (CST). The National Coordinators and Survey Institutes in each participating country are responsible for

national activities related to the ESS. The Central Coordinating Team (CCT) is the

one who developed the survey design, the questionnaire and strict guidelines on probability sampling, translation, response rates and field work documentation. CCT gets support from a number of advisory and consultative groups, such as: Etc… - Centre for Comparative Social Surveys, City University London, UK (Coordinator) - Norwegian Social Science Data Services (NSD), Norway

25

- GESIS-Leibniz Institute for the Social Sciences, Germany - The Netherlands Institute for /SCP, Netherlands - Universitat Pompeu Fabra, Spain - University of Leuven, Belgium - University of Ljubljana, Slovenia Survey Quality: To pursue optimal comparability ESS sets a series of minimum target standards for participating countries to meet, for instance to maintain a minimum response rate of 70% with a maximum noncontact rate of 3%. Also ESS is primarily based on an input harmonization model, where the data collection is designed from scratch and all countries are expected to use a uniform method for obtaining optimal comparability (Stoop, Billiet et al., 2010). The core questionnaire is designed using inputs from the National Coordinators (NCs) both on cultural and intellectual aspects, and benefits from a multidisciplinary specialist review. ESS uses the 5-step translation process (translation, review, adjudication, pre-testing, and documentation) TRAPD (Harkness, 2003). As for the ESS minimum response rate target, 70%, not all participating countries reach this. Looking at the chart in Figure 6 given by Stoop, Matsuo et al. (2010).

26

Figure 6:- ESS response rates in countries participating in Rounds 1-4 (European Social Survey, 2010)

Source:-Stoop, Matsuo, Koch and Billiet (2010), in Section on Survey Research Methods – JSM, 2010. Only one country, Poland (PL), was able to reach that level in the first four rounds of ESS. For most countries 70% is an ambitious target. Some countries have improved over time, such as France (FR), Switzerland (CH), and Spain (ES). Some countries get worse round after round, such as Sweden (SE), The Netherlands (NL), Hungary (HU), and Slovenia (SI). But the standard deviations of response rate decrease from round 1 (11.3), to round 2 (8.8) and to round 3 (7.7) (Stoop, Billiet, Vehovar, 2010). This minimum 70% response rate does not guarantee a low nonresponse bias. As Groves (2006) point out, there is no linear relationship between response rates and nonresponse bias across surveys. But still it helps to have a fixed goal for the nonresponse rate. The different methods that have been used by ESS to assess and adjust for nonresponse bias are: - Population level information and post-stratification - Interviewer observation - Paradata from refusal conversion - Core information on nonresponse - Auxiliary data.

27

To give a short general idea of how these nonresponse adjustment methods are used starting with: Population level information and post-stratification: This method is about how population level information drawn from national statistics can be used for post stratification weighting (Vehovar, 2007). The finding by Vehovar for the first round of ESS show that there is just a small difference between population data and ESS data on sex and age in most countries but a larger difference when it comes to education. Vehovar also found that the higher the response rate, the lower the average absolute standardized bias (the difference between estimates in the unweighted and weighted sample) of the selected 45 items that cover mostly media use, social trust, political interest, well-being, economic morality, family and lifework balance, socio-demographic profile, and human values. Figure 7: the absolute average standardised bias in relation to the response rate of the country samples:

Source: Vehovare, (2007) p. 352 Figure 7 shows the relationship between the average absolute standardized bias (vertical) and country response rate (horizontal). The bigger bubbles are the ones with more items with

28 standardised bias > 1.96. The standardised bias under the assumption of simple random sampling compares the nonresponse bias with the sampling error. The figure also shows the importance of aiming for high response rates. In other words there is a negative correlation between the bias estimates and the response rate. Vehovar (2007) has found a correlation of - 0.29. Countries such as, Germany (DE), Spain (ES), Finland (FI), Poland (PL), Portugal (PT), and Slovenia(SI) are countries with no items of which the absolute standardised bias > 1.96, support the above argument where as countries such as Estonia (EE) with a high response rate of 79.3% but 18 items with absolute value of standardised bias >1.97 goes against the argument. In the other hand Greece (GR) with nearly 78.8% response rate has average absolute standardised bias smaller except for five of the items have standardised bias > 1.96. Finally Iceland (IS) is leading with 37 out of 45 items biased by nonresponse, and with the absolute standardised bias > 1.96. As a conclusion weighting is very important but need to be computed with a larger care otherwise it has a potential of creating inflation on variance. Interviewer observation: In ESS basic information on neighborhoods and dwellings is collected by interviewers themselves. These data are probably more closely related to household characteristics and core variables than population-level data and in a sense more useful for adjusting for nonresponse. For instance, the physical state of the buildings showed a negative relationship with the education level of the target persons. Paradata from refusal conversion: In general paradata are collected for various purposes namely to calculate response rates according to accepted standards (American Association for Public Opinion Research, 2009), to identify reasons for nonresponse (Groves and Heeringa, 2006), to explore whether nonresponse bias is probable, and to adjust for bias (Stoop, Billiet and Vehovar, 2010). In particular paradata from refusal conversion is used to see how the different reasons for refusals can have different impact on nonresponse bias (Stoop, 2007). Core information on nonresponse: This is direct information on nonresponse using two methods, the doorstep approach and the follow-up survey. In the doorstep approach, refusals are asked to answer at least a small number of questions. In the follow-up survey approach refusals and noncontacts are surveyed subsequent to data collection in the main study. Both

29 will help increase response rate and gather more auxiliary information to adjust for nonrespondents (Stoop, 2004). Auxiliary data: To adjust for nonresponse bias it is necessary to have auxiliary data, including paradata, and data that are related to response behavior and to target survey data (Bethlehem, 2002) Another survey quality issue in ESS is the usage of different modes of data collection between participating countries. ESS really wants all data collection to be face-to-face, but depending on local survey infrastructures some countries want to consider paper-and- pencil mode or computer-assisted interviewing. Some countries also use combinations of modes; we will see the details on data collection section later on. ESS has initiated different research on the effects of mixed mode starting in 2003; mixed mode data collection designs result in special kinds of measurement error. One of the findings of the research on mixed mode effect by Martin and Lynn (2011) indicates that there is a good reason to be careful comparing data from face-to-face interview to that of other data collection modes. ESS still wants face-to- face interviews to be used in all participating countries. One of the reasons for this is that changing to mixed mode data collection could bring inconsistencies in the time series. For instance it could make it difficult to identify the nature of change or stability over time. When it comes to questionnaire translation ESS asks all participating countries to translate the questionnaire to all minority languages, i.e., languages spoken by at least 5% of the population. For a high quality translation and to decrease measurement error, ESS uses the current best translation method (TRAPD) followed by verification by the national coordinators and then by ESS central coordinating teams. To check the translated questions’ reliability and validity ESS used a Special Quality Packaging (SQP) program.

30

Also documentation wise, each step of the translation process is well documented for farther QA and QC purposes as it is seen on Figure 8 (Dorer, 2011). Figure 8: ESS translation process: TRAPD+ Verification + SQP

Source instrument

Translation 2 Translation 2

Review

Documentation Adjudication

Transl. verification Possibly 2nd Review/Adjudic.

SQP Coding (if neded)

Pretest

Target Instrument

Source: Dorer and Martens (2013): CSDI Workshop, 2013. As it is shown in figure 8, to get a high translation quality the ESS translation process involves the current best translation method (TRAPD) in translating the source questionnaire by two independent translators, followed by one reviewer and one adjudicator at the national level. And then translation verification is performed by ESS translation team and cApStAn (a linguistic quality control organization). Before pretesting and finalizing the target questionnaire, the translated version goes through final translation check using SQP coding. When it comes to the sampling design, ESS strictly uses probability sampling in all the participating countries. To start with countries need to sign an agreement to use only probability sampling in order to be part of the study. ESS also specifies a minimum effective sample size to help decrease sampling error and design effects (http://www.europeansocialsurvey.org).

31

Also as part of the ESS quality control, the national coordinators are required to perform back-check on the 10% or 5% of the respondents if all the data collection instruments are used according to the specifications or the standards. Back-check is a quality control method carried out after data collection by taking samples of the responses to the questionnaire and check if the data is collected according to the standards. Sometimes it involves performing the interview (face-to face or telephone) again (ESS Round 6, 2011) a so-called reinterviewing

- Ethical Considerations: All national teams are obliged to pursue the Declaration of Ethics of the International Statistical Institute (ISI) in addition to any other code adherence obligations for participating countries. The ESS also strongly forbids any national data to be released or published before the official release date given by the ESS data archive. - Tenders, Bids and Contracts: ESS's central coordination is mainly funded by the European commission (EC). Additional funds are available from the UK Economic and Social Research Council, and the European Science Foundation (ESF). But the national surveys and the national coordinator of each participating country have to find their own funding. - Sampling Design: The target population comprises all non-institutionalized persons age 15+ regardless of nationality, citizenship or language. Quota sampling is not permitted at any stage and neither is substitution of non-responding households (http://www.europeansocialsurvey.org). ESS states in its specifications that every

participating country has to have an effective sample size of minimum 1500 (800 for countries with a population less than 2 million). To successfully get this minimum

from each participating country, taking design effects under consideration is a must (Gabler, Häder, and Lynn, 2006). Out of 22 countries in the first round of ESS only 3 countries use un-clustered, equal probability designs. The rest use a model- based approach to predict the design effect.

32

This predicted design effect, deff, is the product of design effects due to unequal inclusion probabilities and clustering (Gabler, Häder, and Lahiri, 1999).

where, is the design effect due different selection probabilities, and

is the design effect due to clustering.

=

where, is the average number of respondents per cluster, and is the intra-cluster correlation

The design effect due to clustering ( ) can be predicted using information from previous surveys or by looking at the nature of the clustering units. How to predict it is beyond the objective of this thesis but more information is found in Kish (1994) and Gabler, Häder and Lynn (2005). The design effect due to different selection probabilities can be calculated as follows:

=

th where, is the i selection probability class respondents, and th is the weight for each respondent in the i selection probability class. Therefore the predicted design effect can be rewritten as:

Thus to achieve the required minimum effective sample size, >1500, there needs to be an adequate sample size, m. Then m > and again we need to consider the estimated response rate, , and ,ineligible sampling frame (sample members that are not members of the target population).

33

Taking all this under consideration will give us the approximate sample size to help us achieve the minimum effective sample size (Ganninger, 2006).

=

Let us take as an example three countries from ESS –Round 2 technical reports to see how all the above mentioned estimates of design effects and sample sizes help getting the required minimum effective sample size so that comparability across participating countries can be increased.

Table 3:- Measuring of design effect and estimating the different sample sizes

Design effect Anticipated Countries Frame response

rate (%) Deff Turkey Selection of 1.2 1.23 1.48 50 2,220 5,500 1,500 households: Voters registries UK Selection of 1.31 1.22 1.60 70 GB addresses: Postcode 2,340 3,912 1,463 address files NI 72 1,20 45 Ukraine Area based 1.11 1.19 65 1980 3050 1,500 Source: ESS Technical Report –Round 2, Chapter II, The Samples. GB=Great Britain; NI= Northern Ireland This table shows us clearly how using the design effect as stated helps countries manage to maintain the minimum effective achieved sample size = 1500 or 800 in countries with populations less than 2 million. Depending on the type of sampling frames a country has access to the probability sampling can differ (Kish, 1994) “Sampling design may be chosen flexibly and there is no need for similarity of sample designs. Flexibility of choice is particularly advisable for multinational comparisons, because the sampling resources differ greatly between countries. All this flexibility assumes probability selection methods: known probabilities of selection for all population elements.” Generally the goal of ESS is to design and implement a feasible and comparable sampling strategy in every participating country.

34

- Questionnaire design: The source questionnaire is developed in British English and translated into other languages. Except for some countries that require country- specific questions almost all questions are closed-ended and administered in the same format. ESS was mainly designed to monitor changing attitudes and values across Europe, the source questionnaire encompasses three different modules: core modules, rotating modules and supplementary sections. - Translation: As mentioned ESS uses TRAPD to ensure that the different language versions of the source questionnaire are functionally comparable. The TRAPD or team approach focuses on the final translation process. To have a good quality translated questionnaire there has to be a well designed source questionnaire to begin with. “Achieving optimal translations begins at the design stage” (Smith, 2004b). Furthermore, prior to the fieldwork translation verification is done by an external agency. Pretests take place in all participating countries to identify problems using the translation and verification follow-up form (TVFF). In the ESS 5th round an advanced translation method was implemented with the participation of two national teams, one from the Swiss foundation for research in social sciences, Lausanne, and the Polish team from the Centre of Sociological Research Institute of Philosophy and Sociology-Polish Academy of Sciences, Warsaw (Dorer,2011). In this advanced translation method the national teams of translators have experienced survey researchers as reviewers and adjudicators, and one trained and experienced translator. This is the currently recommended way of forming translator team members (Behr 2009, and Braun and Harkness 2005). To increase the coverage rate ESS asks all the participating countries to translate the source questionnaire in to all languages, which are spoken by at least 5% of the population in the country. - Interviewer recruitment and training: ESS has appointed the national coordinators and the central coordinating team to recruit, train and supervise the national field work agencies and develop specifications and check lists that they can use. The interviewers get training on interviewing techniques based on the best practices described by the American Association for Public Opinion Research (AAPOR).To train interviewers carefully on interviewing techniques and the subject matter of the survey is an important ESS

35

interview specification. Interviewers also get specific training on face-to- face, and how to collect data fit for processing. This is well described in Loosveldt (2008). - Pretesting: After source questionnaire translation and verification, countries are highly recommended to select a demographically-balanced sample of around 50 people each to pretest the translated questionnaire (ESS Round 6, 2011). - Data collection: The main data collection mode is face –to –face interviewing, but depending on the funding, sampling frame, response rate, and mode preferences, some countries use paper and pencil (PAPI), computer-assisted (CAPI) or some even use telephone or Web-based modes ( European Social Survey 2011). But using a mixed mode has its own pros and cons. The advantages are the possibility of increasing the response rate and decreasing the cost of using the expensive face-to face mode. On the other hand mixed mode has the potential of increasing measurement error ( Dillman, Hox, and de Leeuw 2008, and Biemer & Lyberg 2003). Despite ESS effort to make it single mode, countries remains using the mode that fits to their needs. Figure 9 will show us the experience of using alternative modes or mixed modes in the different participating countries.

Figure 9. Experience of ESS fieldwork agencies in different modes in ESS

Modes used

100% 90% 80% Other 70% Mixed Mode 60% Web / internet only 50% Postal / S-C only 40% Tel ints only F2F only Percentage 30% 20% 10%

0%

UK

ITALY SPAIN

RUSSIA

FRANCE

CYPRUS SWEDEN

POLAND

FINLAND IRELAND UKRAINE

ICELAND

NORWAY

HUNGARY

DENMARK GERMANY

SLOVENIA

PORTUGAL SWITZERLAND

REPUBLIC CZECH SLOVAK REPUBLIC SLOVAK Country Source: Eva and Widdop (2007), ESRA Conference.

36

The experience of mixed mode varies between agencies but still there is no empirical data that supports mixed mode data collection, or that any other single mode is better than face-to-face data mode for a long, large-scale, population survey like ESS. - Data Harmonization: The participating countries check, clean and document their data sets before sending them to the ESS for harmonization and dissemination. In this way the ESS forces the participating countries to make their data conform to acceptable standard qualities. After revising the already checked data from participating countries, ESS ensures each national dataset according to the standards before integrating it into the central part for international comparison. ESS supports a policy of free and easy access to its integrated dataset for the scientific community. - Data dissemination: The Norwegian Social Science Data Services (NSD), which is the ESS Data Archive, after merging all national datasets including all documentations and performing some de-identification to protect confidentiality, disseminates the results to the scientific society free of charge.

5.2. Trends in International Mathematics and Science Study (TIMSS) TIMSS is a project of the International Association for the Evaluation of Education Achievement (IEA). It evaluates achievements in mathematics and science of 4th and 8th grade students and compares the results from more than 60 participating countries. It started in 1994 and is conducted every four years. TIMSS survey results help countries to monitor and evaluate their mathematics and science teaching across time and grades (Mullis, Martin et al., 2009).

- Study Structure: Each country chooses a TIMSS National Research Coordinator (NRC), who is responsible for executing the survey according to TIMSS methods and procedures in his/her country as well for working together with the international project staffs in order to make sure that the study answers the participating countries’ concerns. The target population for the TIMSS study should be four /eight year of schooling, counting as first year the Level 1, as it is put as standard in the International Standard Classification of Education developed by the UNESCO (Williams et al., 2009).

37

- Survey Quality: To check the sampling procedure, data are weighted in accordance with the sampling design. The base weight is the inverse probability of selecting the targeted students from the population. In this case however, depending on the sampling design the basic sampling weight for each student is adjusted using the product of three different types of weights, namely school weight, classroom weight, and student weight. A detailed description of these weights is given in the data processing section. In order to select efficient and accurate samples TIMSS usually uses a stratified multistage cluster sampling technique, in which computing standard errors to quantify variation becomes complicated. Therefore TIMSS used jackknife repeated replication technique (JRR) to get unbiased estimates of the standard errors, totals and percentages (Gonzalez and Foy 2000). Each participating country is responsible for the data quality control, using the TIMSS international study center’s manual for quality control monitoring (QCM). This manual covers issues such as preliminary activities of the test administrator, test administration activities, summary observations, interview with the school coordinator and test administrator and so on. In addition to get comparable data, TIMSS has standardized the test, and to make sure that all participating countries have followed the standards, the national research coordinators get intensive training and are expected to train their own country assessment administrators. In addition to this, TIMSS also produces variable codebooks and data entry software for participating countries so they can check their data before sending them to the database (Johansone, 2007). TIMSS also uses Window Data Entry Manager Software (WinDEM) as an additional quality control instrument. It helps identifying inconsistencies of identification codes and invalid codes (Johansone and Neuschmidt 2007). The IEA Data Processing and Research Center (DPC) has developed an online data collection system for the curriculum questionnaire and survey activity questionnaire. This curriculum questionnaire is developed in one language, English, which means no translation or adaptation process and quality problem caused by them (Johansone and Neuschmidt, 2007). All these measures are believed to help ensure accuracy and consistency of the international data.

38

- Ethical consideration: TIMSS follows certain steps according to the National Center for Education Statistics (NCES) standard 4-2, Maintaining Confidentiality (NCES2002) (Williams et al., 2009). Here we mention a few of the steps to maintain confidentiality: - All employees with access to the data signed affidavits of data confidentiality - Questionnaires were sealed by students after completion - Names of students, teachers and schools were removed by field staff from the assessment booklets and more. - Sampling Design: TIMSS’s target population is all fourth and eighth grades in each participating country and two- stage sampling is used. The first stage is a probability sample of schools (public, private or vocational), and in the second stage one or two intact 4th or 8th class(es) are sampled. Most of the participating countries follow this two- stage sampling strategy but there are a few countries that use other methods. For example, the Russian Federation adds one more stage by, first sampling regions and then follows the two-stage sampling. Also Singapore added one more stage by sub- sampling students in a class. There were also countries like Morocco (for the eighth grade) and Mongolia that used completely different classroom sampling procedures from that of TIMSS’s sampling standard. As a result both countries’ (Morocco and Mongolia) classroom sampling procedures were not approved by the International Study Center in the 2007 TIMSS assessment. Instead Morocco schools with incorrectly sampled classrooms were eliminated from the sample. Also Mongolia’s irregular sampling and poor documentation of the sampling operation in the field put the country’s data to be summarized in an appendix section of the international reports (Joncas, 2007). - Questionnaire Design: TIMSS contextual questionnaires cover five broad areas. As it is stated on the webpage, the areas are: Curriculum, Schools, Teachers and their preparation, Classroom activities and characteristics, and Students. All these broad information areas are covered by four different questionnaires called Student Questionnaire, Teacher Questionnaire, School Questionnaire and Curriculum Questionnaire (Erberber, Arora, and Preuschoff, 2011). At the beginning of every new cycle the previous cycle questionnaires are revised. Not much basic background changes but some of the things need a revision. For instance, new countries may be joining the study (http://timss.bc.edu/timss2003i/). After TIMSS has reviewed and revised the draft

39

of the questionnaires, NRC’s and the questionnaire development committee review the questionnaires again and make them ready for a field test. After the test the final questions will be selected by the NRCs and the Committee to be included into the study (Erberber, Arora, and Preuschoff 2011). - Translation and Adaptation: The source questionnaire is developed in English, and then the translated version is checked before the field test and also before data collection. There is a big effort starting from explaining the guidelines and the translation, to checking if the final layout corresponds to international versions. Finally the testing result of the questionnaires is checked statistically to see if the items are comparable across countries (TIMSS 1999, Technical Report). But no matter how thoroughly the translation process is done, it is difficult to make it exactly match the source documents (Greenfield 1997). A good example will be the significant translation errors discovered in TIMSS 1995 in the Mexican and Spanish language translation, with types of error such as style, format , grammar, semantics(e.g. use of false cognates), information, and so on (Solano- Flores et al., 2006). The current TIMSS also uses back translation to check the quality of translation. Back translation involves two source languages to ensure the quality of the translation, but currently it is not the most favorable method, because it causes loops and confusion (Mohler, 2005). Furthermore “on the process of back translation there are no clear theories, techniques or findings that go with the linguistic field” (Harkness, 2003) - Pretesting: TIMSS background questionnaires (Curriculum questionnaire, School questionnaire, Teacher questionnaire and Student questionnaire) are pretested in every round of the assessment, even if most of the questions are re-used in every assessment. But in TIMSS 2007 since the Curriculum questionnaire is administrated online using only one language (English) in all participating countries and have relatively small number of respondents (around 60 NRCs), it was not field tested. The other questionnaires are field tested in 31 participating countries for Grade 4 and 45 participating countries for Grade 8 (Erberber, Arora and Preuschoff, 2007). In addition, participating countries that have translated/adapted the questionnaire are asked to field test it. As a result of the field tests in TIMSS 2007 some errors were discovered and each country was asked to check the translation quality before the test instruments were finalized (Johansone and Malak, 2007 and Mullis and Martin, 2010).

40

- Data collection: The main data collection mode was paper- and pencil, or/and classroom self-administration until TIMSS 2007. Starting in 2007 for the first time curriculum questionnaires and survey activities questionnaires are fielded online. When the need arises monitors, besides monitoring the test to make sure it is conducted according to manuals, they also conduct interview. (Johansone and Neuschmidt, 2007). - Data processing: It has been mentioned that TIMSS used the two-stage stratified cluster sampling method. As a result there are different probability selection methods of students. This leads to a very complicated sampling weight to adjust for nonresponse (of schools, classrooms, and students). Just to see how complicated it could get , let us see how basic the weight is adjusted by many factors that account for nonresponding schools, classrooms, and students in TIMSS 2007 (Joncas, 2007). Schools weights are used to compensate for any sampled schools that did not participate and were not replaced. In other words these were used for adjusting for non-responding schools. This weight is a product of Basic (first stage) Weight of Schools and Nonparticipation Adjustment Weigh Schools.

The Basic Weight of Schools represents the inverse probability of schools selected in the first stage of the sampling process, including replacement schools. The weight is

=

where, n is number of sampled schools, and is a measure of size for the school.

The School Level participation adjustment for nonparticipants without ineligible schools will be:

where, is the number of participating sampled schools, is the number of

first and second replacement schools respectively, and is the number of nonresponse schools.

Then finally the first stage weight for the school corrective for nonparticipating schools will be:

= .

41

Classroom weight helps adjust for classrooms in a school that are not participating or where the participation rate of the students in a class room is less that 50%. The basic second stage weight for all classrooms in the sample, including nonparticipation and rate of participating students in a classroom is less than 50%, is calculated as follows:

=

where, is the total number of classrooms in the school, and is the number of classrooms, here for i in goes from 1 till 2 according the TIMSS standard for number of classrooms per school but some countries include all classrooms in the sampled school so it could go till 3 or more. There is also Classroom Nonparticipation Adjustment to adjust classroom-level participation for non-participated classroom rate of students in the sampled classroom is formulated as follows:

=

where, is number of sampled classrooms in the school, is number of participating classrooms in the school. The final second stage weight for all sampled classrooms in the school is:

=

Student weight adjusts for sampled students who did not participate in the test. In this third stage of weight calculation it gets a bit complicated. For interested readers Joncas (2007) is recommended, but just for the purpose of this thesis only the final weight will be stated.

where, is the student’s nonparticipation adjustment for the classroom in the school, and is the basic third stage weight for the classroom in the school To summarize, the QA and QC process of TIMSS, let us start with the sampling design. TIMSS has specified the study population clearly and put standards for countries to use multistage probability sampling. When the sampling procedure used by countries differs from the stated standard the sampling procedure will not get approved. A good example would be TIMSS 2007 in Morocco and Mongolia. TIMSS put a lot of effort to assure a

42 quality translation and use statistics from the field results to see if the items are comparable across countries. Nevertheless TIMSS’s translation method is back translation, which a less favorite method in both the 3M guideline and most other current literature. When it comes to pretesting TIMSS field test all the questionnaires even if most of the questions are adapted from the previous assessment. To adjust for nonresponse TIMSS used a very complicated weighting process. Generally TIMSS seems to follow some of the major QA and QC approaches as shortly summarized here but not that much detailed documentation is to be found out there.

5.3. World Values Survey (WVS) WVS started in 1981 in collaboration with European Values Study (EVS), to study social and political life changes as a result of changing values, using the input of worldwide social scientists network. WVS also analyzes the influence of global cultural change on economic development, creativity, quality of life and democracy. The participating countries span from the very poor to the very rich and cover all main cultural zones. So far WVS has covered almost 90% of the world’s population in its five Waves (1981-2007). The results are used by governments around the world, international organizations and institutions, and in general anyone who would like to study the change of values in the world over time (http://www.worldvaluessurvey.org/).

- Study, Organizational, and Operational Structure:-WVS has a central body called World Value Survey association that coordinates the network of the world’s social scientists ( http://www.worldvaluessurvey.org/). - Survey quality: There is a large response rate variation between the participating countries. For instance Slovakia had a 95% response rate in the fourth Wave UK had 80%, whereas countries such as Spain and the Netherlands had 24% and 39%, respectively. There were also a few countries that did not provide any information on response rate. This big difference in response percentages has a potential for nonresponse bias. In any case, there is no information on how the low response rates are taken care of or if weighting has been used to adjust for nonresponse. Either how this big rage of response rate between countries is managed. As a result comparability is compromised ( de Weerd et al., 2005).

43

There is also no clear information on the type of translation method used, except that some countries use back translation for control purposes. There is no best translation method stated to be in use. This has a large effect on the data quality. - Sampling design: The target population of WVS are adults age18 and above, with no upper age limit. But in countries like Sweden, Iceland, Romania, and Slovenia the top age limit has been set to 75 or 80, this could cause incomparability of the survey data. Every participating country has its own sampling procedure depending on the type of registers in use, but looking at the previous Waves of WVS, multistage probability is mostly used (Inglehart et al, 2004). - Questionnaire design: The source questionnaire is designed in English with questions that are contributed from social scientists from all the participating countries. After the grand design of the questionnaire, it is translated in to various languages. To gather as much information as possible, the questionnaire has been modified from time to time since the start of the survey, by including more meaningful questions and excluding the less useful ones (Inglehart et al 2004). The source questionnaire, also called the core questionnaire, is designed in such a way that countries can add their own country - specific questions. However, some countries could not even fully use the core questionnaire(did not ask all the questions in the core questionnaire) and in some, items from the core questionnaire are replaced or recoded, which leads to measurement error and incomparability (Weerd, Gemmeke, Rigter, and Rij, 2005). - Translation: As a standard WVS recommends that the questionnaire be translated in to different languages, and then to check for the accuracy using back translation. But there is no clear translation method stated, except that in some references it is mentioned that a few participating countries have used the back-translation method (Weerd et al., 2005). All in all, there is no best translation method mentioned in use, which is one basic issue in order to have comparable data. - Data collection: The main mode of administration is face- to – face interviews except for very remote areas where telephone interview is necessary. Using the fixed rules and procedures from the central team of WVS, the social scientists from academic institutions in the participating countries are the Principal Investigators, who are responsible for conducting the survey (Inglehart et al, 2004).

44

In general when we try to summarize QA, QC in WVS there is not much to say since there is no documentation on many of the survey aspects. For instance there was information on how nonresponse issue gets taken care of even if there is a large difference between response rates achieved in countries. Again not much stated about the translation method used or if any of the survey instruments get pretested. One other big quality problem of WVS is some countries replace or recoded items from the core questionnaire on the process of adding country specific questions. Finally WVS has a long way to go on providing documentation conserving survey methodology/ techniques or the survey quality aspects.

5.4. Program for the International Assessment of Adult Competencies (PIAAC) The International assessment of Adult Competencies is a program under the guidance of the Organization for Economic Cooperation and Development (OECD) and managed by a group of international organizations from Europe and North America headed by Educational Testing Service (ETS) in Princeton, New Jersey. It builds on previous experiences of the International Adult Literacy Survey (IALS). The main objective of PIAAC is to study and contrast the fundamental skills and competencies of adults around the world. Its main use is for comparing the relationship between individual’s educational background, the skills and experiences on some works as well as use of information and communications technology. PIAAC’s questionnaires and assessment tools are designed to maximize the validity of the cross-cultural, cross-national and cross-language studies. The first field test of PIAAC started in 2010 followed by the main study in 2011 and a final report is expected in 2013. In this first cycle 28 countries are participating (Montalvan,2009, and http://nces.ed.gov/surveys/piaac/).

- Study , Organizations, and Operational Structure: The main purposes of the study are to measure the adult competencies in different countries, to assess if these competencies have any effect on economic growth and the overall social performance among the participating countries as well as on individual success, and measure the performance of education and training systems in generating required competencies. PIAAC is guided by a Board of Participating Countries (BPC), supported by a Technical Advisory Group (TAG), and also managed at the international level by participating OECD countries.

45

Each participating country’s national authorities are responsible for drawing a sample of adults, translating questionnaires and survey instruments, data processing, and assigning a national co-coordinator to supervise the execution of the survey (PIAAC Implementation Aspects, 2008). Asa summary Figure 10 shows the objective of PIAAC.

Figure 10: Objective of PIAAC

Objective: PIAAC aims to provide assessment of adult literacy in the information age in the form of an integrated measure of broad literacy

competencies

Literacy Numeracy Reading Problem solving in technology- rich component skills environments

Source: Thorn (2008): PIAAC: what is it and what will it tell us?

- Survey Quality: Since there are 28 participating countries this means that numerous languages and cultures are involved. All participating countries should follow the quality assurance guidelines developed by the OECD group. Examples of items are: - to maintain high quality translation, countries are recommended to use the current best translation method (TRAPD) - countries should have at least 70% response rate - auxiliary variables should be used for choosing nonresponse adjustment variables. - Incentives should be used to increase response rates. Also every participating country should hire a national quality control monitor (NQCM), who works full time to monitor the field test from start to finish of the data collection. PIAAC’s Technical Advisory Group (TAG) has approved the quality control plan, which is proposed by Westat(a research corporation that is recognized as one of the best research and statistical survey organizations in the world) to tradeoff quality and cost within the constraints of the survey (PIAAC Implementation Aspects 2008). But despite all the

46 quality guidelines and quality plans, as long as there are different modes of data collections (CAPI and PAPI), measurement error is anticipated, and also there could be errors due to coding and processing for the open- ended questions (PIAAC Technical Standards and Guidelines, 2009). Since the final report of the assessment is not out yet there is nothing much to say about the data quality except about nonresponse. Figure 11 shows the currently available 23 countries response rates.

Figure 11: PIAAC response rate in countries participating in round 1. 80% 70% 60% 50% 40% 30% 20% 10%

0%

USA

Italy

Spain

Japan

Korea

Cyprus

Poland

Ireland

Austria

Estonia

Canada

Norway

Finland

Sweden

England

Belgium

Australia

Denmark

Germany

Netherlands

UK UK

Czech Republic Czech

Slovak Republic Slovak

NorthernIreland

– UK UK

Source: PIAAC TAG group Evidently, requiring a minimum response rate of 70% seems like a good idea. Except for Sweden and Spain with 45% and 48% response rate, respectively, the rest have more than 50%, but still the maximum response rate in this first round of PIAAC is 75% obtained in Korea. The rates for France and Russia are still pending.

- Ethical consideration: PIAAC has a list of standards and guidelines on how to perform the assessment in a certain fashion that gives respondents a brief knowledge about the study in advance and let them freely decide to participate or not. It also informs them about all aspects of ethical considerations and the confidentiality procedure. For instances according to PIAAC Technical Standards and Guidelines, 2009 staff should be trained on the importance of ethics and scientific rigor in research involving human subjects, and all researchers must respect the free will and privacy of the respondents.

47

- Sample design: The target population is mainly people living in private households, but resident foreigners and migrants could be included if only it is possible to include them without changing the survey instruments. The suggested age span goes from 16 to 64 years old but it can be changed according to the participating countries’ political preferences. As a PIAAC sampling standard, each participating country is advised to use stratified- multistage clustered area sampling in order to get maximum precision. Nevertheless depending on the type of registers in use or geographical area setting the sampling method could differ as long as it is a probability sampling method (PIAAC Technical Standards and Guidelines, 2009). - Questionnaire design: The questionnaire is designed in English as a paper -and –pencil (PAPI) and a computer assisted (CAPI) mode instrument. The questionnaire contains core task questions, main task questions, background questions (BQ) and country- specific questions. The background questionnaire is designed to identify what kind of skills participants usually use either at their job or in their private life, how they obtained these skills, how these skills are distributed within society. In general the BQ contains scales that measure the psychometric property, which is use of skills both at work and in everyday life. Also, when applicable, country-specific questions are included in the BQ (http://nces.ed.gov/surveys/piaac/questionnaire.asp).The aim of having core task questions is to assess participants’ skills on information use and communication technology, constricting knowledge, communication with other people and, more. The country-specific questions are a limited number of questions, with no impact on the core questions included by a participating country after the approval from the central office. - Translation: Each participating country is responsible for translation and adaptation of the assessment materials. But PIAAC’s standards and guidelines recommend the currently most used translation method, TRAPD, with double translation by two independent translators and a third translator to carry out reconciliation. In addition to the recommendations the world’s some of the most famous translation experts are participating in assessing the quality of the translation, such as Dr. Dorothee Behr (Dorothée , 2012). - Interviewer recruitment and training: Interviewers are recruited and trained by national project managers (NPM) and national survey institutes. At the completion of the training

48

the NPM and national survey institutes are responsible for finalizing the training report, which is prepared by Westat. After their training, interviewers get the chance to practice during the field test (Montalvan, 2009). - Pretesting: Here PIAAC has set standards, guidelines, and recommendations that the participating countries should follow while performing field test. These are requirements such as: every participating country must conduct field test before the main study and that the minimum required sample size is 1500 of the target population. The pretesting includes; draft questionnaire, translated and adapted questionnaires and mode of data collection, sampling process, interviewer administration, data capture, data processing, and data delivery. After the end of the piloting, countries are expected to report their findings of the field test. - Data collection: The data is collected using three different methods. Respondents answer the questions from their homes via laptop computers, some are interviewed via computer assisted personal interview (CAPI), and for those who have difficulty with CAPI, paper- and- pencil data collection (PAPI) is used. Also when the sample frame is random digit dialing (RDD) the data collection mode will be face- to- face interview. Generally depending on what kind of rich registers the participating country has the data collection method could vary, as long as the final data are comparable.

As a conclusion PIAAC has good infrastructure when it comes to QA and QC starting with organizing the central coordinating team and other various groups to facilitate the overall survey quality. PIAAC uses more or less the same nonresponse handling strategy as that of ESS. Also TRAPD is used for translation process. Probability sampling is a must as well as field testing the survey instruments. All in all to provide quality survey data PIAAC is more or less following ESS track. But since the data from the first wave is not released yet, hard to see the effect of this well structured standards.

5.5. International Adult Literacy Survey (IALS) The International Adult Literacy Survey (IALS) is the first large-scale comparative survey on literacy established in 1994 by OECD in collaboration with Statistics Canada and some U.S. national statistical research agencies. The first cycle of IALS aimed at understanding the distribution of adult literacy within the participating

49

countries and what should be done to improve literacy. The first IALS was undertaken by nine governments and three intergovernmental organizations (Kirsch and Murray 1998).

- Study, Organizational, and Operational Structure: The first International Adult Literacy Survey was carried out with two main purposes: 1) to develop scales (psychometric aspects) that can be used to compare the literacy performance of adults with wide ranges of abilities and 2) using the first goal, to compare literacy skills of adults across cultures, across languages and across countries. The standardized element of the IALS assessment across all the participating countries was to draw a probability sample of the general population between ages 16 and 65. The samples were recommended to be sufficiently large to give 3000 complete cases. But regardless of this requirement the total number of complete cases differed from country to country , some countries were unable or unwilling to provide the sufficient number of completed cases while some succeeded to gain even more than 3000 (Hamilton and Barton, 2000). - Survey Quality: The French authorities had questioned the quality of the survey result in terms of the appropriateness of the study instruments, the sampling procedures, and the population estimates. The French authorities were not pleased with the quality report from Statistics Canada and other consortium members. So in October 1995 France’s assessment results were withdrawn. After this incidence IALS developed a quality framework for participating countries to use when evaluating data quality. Despite all the efforts to minimize survey errors, IALS had a large sampling error since most participating countries in IALS could not afford large samples. In the first IALS cycle there were three countries (France, Germany, and Switzerland) that used nonprobability sampling at one stage in their sampling process. As it is stated in the first IALS’s technical report this particular incident has not generated a significant bias in the survey estimates but it is also stated that the quality assurance procedures for sampling design for these three countries were not satisfactory (Darcovich, 1998). Still, based on the first IALS technical report there is not much evidence on the existence of any considerable nonresponse bias. But the response rate differs from country to country. For instance Poland leads with 75% response rate while the Netherlands got only 45% response rate. When conducting comparative surveys like the IALS having a high quality translation is very important. However, as Guerin-Pace and

50

Blum (2000) stated, no precise accuracy criterion was defined and/or back-translation was not done to check for the translation quality. Even if back –translation is not the best method, according to Goldstein and Wood (1989) it is an essential procedure for the translation quality check. IALS is subject to nonsampling errors like any other survey, and the magnitude can in fact be greater than in other surveys. For instance, there are deviations of different kinds from prescribed data collection procedures, and errors of logic. There are scoring errors, especially with the open-ended tasks. Also since the participating countries are responsible for the data capture, data processing and coding, IALS is more vulnerable to the above mentioned errors compared with other surveys. The main reason is that data is collected and processed by the participant countries independently. That means that participating countries might be unwilling, or lack sufficient resources or technical expertise to comply with requirements. Thus the comparability and quality of the resulting estimates will be reduced (Kirsch and Murray, 1998). - Ethical consideration: IALS has stated in its Survey Operation Division Policy section of the Technical Reports that respondents are under no legal obligation to participate, and all information that have the potential of exposing the respondents should be kept confidential. - Sampling Design: According to the sampling procedure of IASL each participating country is required to use high quality probability sampling of the no-institutional, civilian, adult population of age 16 to 65 years with the smallest possible exclusion percentage. The required sample size has moved down from 3000 to 1000 so that all participating countries would pass the minimum sample size requirement. Even before the start of the pilot study, participating countries are asked to send their sampling plan to Statistics Canada to make sure it meets the international criteria. (Darcovich, 1998). - Questionnaire Design: There are three types of questionnaires used in the IALS assessment a background questionnaire (BQ), a set of core literacy questions/tasks for screening respondents with very limited literacy skills, and a booklet filled with literacy tasks mostly containing open-ended questions. BQ is used to collect respondents’ demographic characteristics, educational experiences, labor market experiences, and literacy related activities. Countries have the privilege to add questions to the background

51

questionnaire but are at the same time advised to reduce the respondent burden and have to send any questions that they want to add to the central office for review and approval before the pretesting begins (Darcovich, 1998). - Translation/adaptation: All countries are expected to translate all survey instruments (task booklets, task administration guides, scoring rubrics, and the BQ). In this process some adaptation could take place when necessary. A panel of linguists together with Statistics Canada reviewed and finalized each country’s translated and adapted questionnaires. Nevertheless there was no preferred translation technique stated or a recommended translation quality check (Guerin-Pace and Blum, 2000). - Interviewer recruitment and training: Each IALS participating country was responsible for recruiting and training of interviewers using the IALS interviewer training manual. The countries decided how many interviewers to hire, interviewers’ salary and the length of the filed-work period (Murray, Kirsch, and Jenkins, 1998). - Pretesting: After the approval from the panel of linguists and Statistics Canada of all the translated and adapted questionnaires each participating country was required to conduct a pilot survey. - Data collection: IALS collected the required information using the background questionnaire, and assessment booklets. The background questionnaire was administered by trained interviewers, while core literacy tasks and the main assessment booklets were completed by the respondent. Most of the questions in the booklets are open-ended questions that measure the three literacy domains: prose, document, and quantitative literacy skills. Each participating country was given an administration manual, survey instrument, data processing procedures, and a lot more guidelines on every aspect of the survey to follow. Despite all the manuals and guidelines, the participating countries implemented their own data collection methods to varying extents. As it is stated in the first IALS technical report, different countries have tried different methods to increase response rates. For instance, Sweden and Germany have used incentives to reduce nonresponse, and three countries (Germany, Switzerland, and Poland) changed the doorstep introduction of the assessment booklet from “an assessment of the respondent’s literacy skills” to “a review of the quality of published documents”. There were also differences in the interviewers’ payment that probably made the response rate differ from

52

country to country. Each participating country used different methods to weight their samples. (Darcovich, 1998, and also Darcovich, and Murray 1998). - Data Harmonization/ dissemination: After the quality check of the processed data from the participating countries, without any changes all the datasets were compiled and put in the international data package and made available to the users. But some datasets can only be accessed through the data owner countries. For example the dataset of Australia is accessed only through the Australian Bureau of Statistics, for confidentiality reasons. (http://hdl.handle.net/10573/41351).

Finally as a short summery of QA and QC of IALS, France’s withdrawal from the first IALS was a wakeup call to developing quality frame work for all participating countries. Nevertheless there were some quality issues, for instance large sampling error, use of non- probability sampling at some stage of the sampling procedure, there was not a good method of translation quality check, countries used different mode of data collection but there is no stated effort made to handle the mode effect. These are few of QA and QC problem in IALS few more are stated towards the end of this thesis.

5.6. Survey of Health, Ageing and Retirement in Europe (SHARE) The Survey of Health, Ageing and Retirement in Europe was launched in 2004 in eleven European countries. SHARE’s first Wave was based on the model of the English Longitudinal Study of Ageing and the U.S. Health and Retirement Study. In the second Wave of SHARE some modifications were made so that the study became longitudinal, two new health measurements were added, and an “end of life interview” was initiated to help follow the participants’ life till their death. In Wave 3, SHARE appears as a project called SHARELIFE with a goal of constructing a comprehensive life history. In the third Wave of SHARE a new questionnaire with five study areas was used. The areas were children, partners, accommodation, work, and health (Börsh-Supan and Jürges, 2005, and http://www.share-project.org/ ).

- Study, Organizational, and Operational Structure: The main objective of SHARE is to provide enough information for researchers to study how the quality of life is influenced by ageing, both individually and for societies. SHARE has standardized the survey structure for each participating country. Countries must use the centrally developed CAPI

53

to collect the survey data. This helps each country to use exactly the same metadata and routing. On the other hand some of the things that need to be localized are the sampling frame, sampling design, and language (Börsch-Supan, Hank et al., 2010). - Survey quality: There is not much information about the survey quality, but as any other household survey SHARE has both item and unit nonresponse issues. For instance, in Wave 4 of SHARE-Germany there was considerable item nonresponse for the income related questions (Blom and Korbmacher 2011), see Table 4:

Table 4: Mean expected income response rate in Wave 4 of SHARE-Germany Expected income response rate (%) Mean 69.7 Standard deviation 19.3 N(codeable answers) 158

Source: Blom and Korbmacher, (2011).

This result is only valid for the code able answers, with response rate 83%. There was also a small amount of item nonresponse in the answers that were not code able. To decrease unit nonresponse SHARE countries use different kinds of incentives except for Denmark, where incentives are not allowed. In addition weighting methods are used to handle the unit nonresponse. SHARE recommends imputation to replace missing data. Imputation is a process of replacing the missing values with fabricated values. It could be a method where the observed values of other respondents that are similar are used instead of the missing values. Two types of imputation are used in SHARE. Single imputation is a method of imputing a single value for each missing one to get a complete dataset. The other one is multiple imputation to help trace the distribution of possible value, conditional on all the sample information that can be used even if it will not give the best point estimation (Luca and Lipps, 2005, and Klevmarken et al., 2005).These are the methods SHARE used to handle the nonresponse.

- Tenders, Bids and Contracts: The primary funder of SHARE is the European Commission, but additional funding is given by the US National Institute on Ageing. There are also other fundings from the participating countries’ research organizations to facilitate the SHARE data collection. For instance, data collection in Austria was funded

54

by the Austrian Science Fund, in Belgium from the Belgian Science Policy Office, and in Switzerland from BBW/OFES/UFES (Börsch-Supan, Hank et al., 2010). - Sampling design: The target population for SHARE is the 50+ years old non- institutionalized population. Most of the countries use stratification by age. In some countries a multistage sampling design is used depending on what kind of registers they use. For instance Germany’s and the Netherland’s registers of individuals are administered at the regional level, so in these cases two- stage or more sampling methods are used. In the cases of Austria, Greece and Switzerland where there was no way of using population registers, telephone directories were used as sampling frames followed by pre-screening for eligible sample participants. Under these circumstances it is hard to use simple random sampling; instead complicated multistage sampling designs were performed. - Questionnaire design: The main questionnaire for all the three Waves of SHARE was developed in English, and after being piloted and reviewed, it was translated into the languages of the participating countries. Before the final design of the questionnaire, it had to pass four stages of extensive piloting and pre-testing. The baseline Wave questionnaire includes 20 modules on health, social-economics and social networks. In the second Wave of SHARE there was some adaptation regarding the country- specific questions. Since there were changes in some participating countries, for instance institutional changes, health care reforms etc. during the SHARE’s Wave 3, a new questionnaire was designed. This new questionnaire was designed so that the questions will form a Life History Calendar (LHC) (Börsch-Supan, Hank et al.,2010 , and http://www.share-project.org/ ). - Adaptation: In SHARE Wave 2, since the survey became longitudinal, the survey software was adapted to fit into the longitudinal format. Also, because of some policy changes in some of the participating countries, country-specific parts of the questionnaire were adapted. When Israel joined the assessment in 2005 and 2006 there was a complex adaptation of the SHARE survey instruments in order to include the three major groups of people living in Israel(Jewish, Arab and immigrants from former Soviet Union) (http://www.icpsr.umich.edu/icpsrweb/NACDA/studies/22160/detail, and http://www.share-project.org/ ).

55

- Translation: Even though each participating country is responsible for questionnaire translation SHARE’s central coordinators support the translation process, starting with providing guidelines and ensuring cross- national comparability of the translated questionnaires. The translation guidelines include the current best translation practice, the team translation method (TRAPD). There are also professional translators that review a sample of the first draft translations. In addition SHARE uses the Language Management Utility (LMU) software for every participating country so that they can check their versions of the translated instrument with other countries’ translations and with the main instrument (Harkness, 2005). - Pretesting: The draft questionnaire for Wave 1 was piloted in the UK in 2002, with the help of the National Centre for Social Research. The questionnaire was revised after the first pilot and sent for a second pilot to be performed in each of the SHARE participating countries, using 75 respondents per country. For the third time, after the revision based on all the SHARE countries’ pilot results, the full questionnaire was pretested in 2004 using a probability sample of 100 respondents from each participating country. Finally, an extensive analysis of the pilot and pretest results was conducted before the final design. After the fourth stage piloting and pre –testing the source questionnaire is final. Then comes the time to deal with the translation. Each SHARE country translates the source questionnaire using the translation guideline provided by central coordinators. The professional translators review samples of the draft translation. Then before the final translated questionnaire, it gets piloted and pretested (Börsh-Supan and Kemperman, 2005). - Interviewer recruitment and training: SHARE contracts the world oldest and largest academic-based survey research institution, the Survey Research Center (SRC) at the University of Michigan, for interviewer training manual development. SHARE makes sure the interviewer training manual and documentation will ensure high data quality. To facilitate decentralized training in the participating countries SRC created a training program for use by country level trainers and provided training for the trainers called, train-the-trainer (TTT). During the preparation of the SHARE pilot survey the interviewers first get a general complete interviewer training. SRC collaborates with CentERdata of the Netherlands for the computer based tutorial on the SHARE case

56

management system (CMS) and Mannheim Institute for the Economics to Ageing (MEA) for training evaluation, protocols development and implementation. The length of training in both the pilot and the main survey of SHARE is a short, mostly two-day, training (Alcser and Benson, 2005). - Data collection: The main data collection mode is CAPI. In order to avoid country- specific CAPI programming errors and problems, and also to save time and money, every participating country uses the same software package for administrating the questionnaire (de Luca and Lipps, 2005). - Data Dissemination: SHARE data use is subject to European and national laws of data privacy. The survey data is disseminated freely to the entire scientific community, but the signed declaration of confidentiality from the European Commission is needed to protect confidentiality.

Just to summarize a bit, as any survey SHARE also suffers from both item and unit nonresponse, and the following methods are used to handle it: imputation, incentives, and weighting. When it comes to translation, SHARE uses the team translation method in addition to the Language Management Utility (LMU) software. There is also a coordinating team to support and follow the survey process. To handle mode effect SHARE allows the use of only one software package during the CAPI data collection. All in all SHARE is well organized and put a lot of effort including using paradata to increase the quality of the survey data.

5.7. International Social Survey Program (ISSP) ISSP was established in 1984. Initially there were six participating countries and four general social survey institutions (National Opinion Research Center (NORC) at University of Chicago, Centre for Survey Research and Methodology (ZUMA) in Mannheim Germany, The National Centre for Social Science in London, and the Research School of Social Sciences, Australian National University, Canberra ) as founding members. ISSP is a collaborative survey program with members from all over the world with an annual module on a topic important for social science research. Currently ISSP has more than 46 members and 28 participating countries (Skjåk, 2010).

57

- Study, Organizational, and Operational Structure: The study focuses on different social issues, such as social inequality, national identity, religion, and the role of government. ISSP has established a methodological committee starting in 1993 to handle the increasing diversity and complexity of the ISSP. The committee also gets help from a Method Working Group. The main goal of ISSP is to combine important social science research topics with cross-national, cross-cultural and cross- language study perspectives. In addition ISSP should also design questions, which are relevant in different countries (Usher, 2000). - Survey quality: ISSP has established a special committee called ISSP Methodological Group. The main duty of the methodological committee is to handle nonresponse, background variables, questionnaire design, translation and the mode of data collection. According to the monitoring report of 2007, out of 46 countries that answered the study monitoring questions on translation pretesting, 14 of them did not evaluate/pretest the translated questionnaire, and countries like Finland, Germany and Sweden have reported translation problems on the questionnaire. When it comes to the sampling techniques, except having different lower and upper cut-off age limits, ( Finland’s lower and upper age limit is 15 and 80, respectively, while Norway starts at age 19 and ends at age 74)all countries had used probability sampling methods to select the respective respondents. Even if telephone interviews are not permitted in ISSP, in the 2007 ISSP about 19% of interviews in the U.S. were collected by telephone. To increase the response rate ISSP uses incentives, mixed mode, and call-backs. In 2007 ISSP 20 out of 34 participating countries applied post-stratification to adjust for remaining nonresponse bias (Scholz and Heller, 2009). - Tenders, Bids and contracts: There is no central funding for ISSP members or participating countries. All member countries have to cover their own costs (Usher, 2000, and Skjåk, 2010). - Sampling Design: The sampling design requirement in ISSP is to draw a national representative random sample of the adult population. Each participating country should clearly state what kind of sampling procedure is used in the survey (Scholz and Heller 2009, and Skjåk, 2010).

58

- Questionnaire design: The source questionnaire is in English and translated into different languages. Before designing the questionnaire, topics are nominated and voted on in an annual general meeting, and then a drafting group is elected by the general meeting from member countries representing different regions and cultures. Any new questionnaire is piloted prior to final discussion. At this final discussion members vote on individual questions. In general ISSP has made it its main goal to design a questionnaire that is relevant in all participating countries and maintains equivalence in translation (Skjåk, 2010). - Translation: According to the ISSP guidelines, back translation is recommended as a quality control method and every participating country is responsible for translating the source questionnaire. But also the ISSP translation group is trying to make the translation process interactive, with the questionnaire developers and the actual translation people work interactively on the problems, feed backs, and decisions for a better quality translation. The group also makes sure the continuity of this interactive translation process (Skjåk 2010, and Harkness et al. 2010). - Pretest: Part of the drafted questionnaire is piloted before a drafting group presents the final drafting questionnaire. Then again after finalizing the draft source questionnaire it gets pretested. The translated target questionnaires also get piloted to ensure translation quality and also to make sure if the question order, questions format, and response categories are the same as in the source questionnaire. The data collection mode also gets pretested before the actual assessment. (Scholz and Heller, 2009 and Skjåk, 2010). - Data collection: The model of administration in ISSP is self- completion questionnaires. But in a few countries because of infrastructure and illiteracy issues, face- to- face interviews are used (Skjåk, 2010). Also in some countries mixed mode is used to increase response rate, for instance, in Flanders, Germany, and Great Britain. The background questionnaire is administered face-to-face while the main ISSP assessment collected by self-completion. In Austria, and South Korea an introductory telephone call was used prior to the main assessment. There are also some countries using advance letters. Some countries conducted their surveys by mail, one of them is Sweden. Most of the participating countries had used incentives to reduce nonresponse. Some used incentives for both interviewers and respondents to increase the response rate ( Scholz and Heller,

59

2009). To elaborate more on the methods used to increase response by participating countries, let us look at Table 5..

Table 5: Procedures used to increase response in ISSP

Call-Backs * 90% Left Letters, Booklets, etc.* 59% Intro Letter/Booklet* 55% Interviewer Bonuses* 52% Intro Telephone Call* 45% Uses Converters* 35% Incentives to Respondents* 24% *Applies only to face-to face survey: n=29-38 member countries; Source:-Survey Research Methods (2007), http://w4.ub.uni-knstanz.de/srm, vol. 1, no. 1, pp. 45-4. Table 5 shows a result obtained from the ISSP Nonresponse Questionnaire prepared by the ISSP Nonresponse Committee. In this 2005 ISSP assessment 29 member countries used face-to-face interviews and 9 used mail survey. Looking at the table, call-back is used in almost all countries, 90%. Out of this around 58% used call- back for both refusal conversion and to contact respondents and 42% used them only for noncontacts. 55% of the countries sent an introduction letter and booklets prior to the face-to-face interview. When it comes to interviewer bonuses, 32% were bonuses for meeting a target number of complete cases, 25% were bonuses for taking on difficult assignment, 14% for converting refusals and 22% for some other reason. All in all incentives is the least favored method and the call back is the most favored method (Smith 2007). Most of the participating countries used different weighting methods to adjust for the unit nonresponse. Some countries, such as Australia, Cyprus, the Czech Republic and Germany did not use any weighting to adjust for nonresponse in ISSP 2005 (ISSP, 2005). The important point here is that when there is no central strong authority that prescribes certain requirements one has to find out what is going on in different countries. One can do that by using a questionnaire. - Data Harmonization and Dissemination: Starting in 1997 harmonizing the national data into an international comparative, and disseminating it to the scientific community was done by Zentralarchiv (Zentralarchive für Empirische Sozialforschung at the University of Cologne) cooperating with the ASEP (Ana´lisis Socio´logicas, Econo´micas y

60

Poli´ticos). The harmonized international data sets are given to the participating countries’ data archives by Zentralarchive. The data sets can be accessed from the national data archives according to the agreements within the international network of Data Archives using CD-ROM, Diskettes, or File Transfer Protocol by the clients (Usher, 2000).

Summarizing ISSP shortly, it has methodological groups to handle major survey quality problems. ISSP also use different methods to handle the nonresponses, for instance, weighting (in most countries), call-back, incentives, and mixed mode. All survey materials go through field test before the main assessment. Back translation is still in use. ISSP has developed a good guideline but looking at the past reports countries have a tendency of not following them and there is no much information how that is handled. All in all this shows us having a good standard is not enough making sure the specification are justified and adhere is.

5.8. European Working Conditions Survey (EWCS) The European Working Conditions Survey (EWCS) is a vehicle for qualitative and quantitative research on different aspects on working conditions in the European Union. EWCS was started by the European Foundation for Improvement of Living and Working Conditions (Eurofound) in 1991. Since then it has been performed every five years with the aim of measuring working conditions across Europe, analyzing the associations among different features of working conditions, monitoring trends over time, identifying groups at risk and issues of concern as well as of progress and also contributing to the European policy development (http://www.eurofound.europa.eu/ewco/surveys/).

- Survey quality: As it is clearly stated in the EWCS technical reports, it is important for EWCS to provide high quality data. Each stage of the study is well documented and each detail is thoroughly followed to seek for any faults. Quality controls performed by EWCS (5th EWCS Technical Report) include the following : - In order to keep the trend most of the questions are not changed from the first Wave, and also before finalizing the source questionnaire it gets pre-tested and piloted by Gallup with the collaboration of Eurofound and national institutes.

61

- Every translated questionnaire goes through layers of the translation process, and then an extra quality check by proofreading of the new and trend questions done by EWCS. To verify further the translation is followed by pretest interviews. Then the result of the pilot interviews is evaluated and reported to Eurofound for approval. - Any sampling method used must be approved by Gallup. - To minimize bias due to coverage, nonresponse, and unequal probability selection two types of weighting methods were used, design weighting and the model-based post-stratification. In general questionnaire development, translation and verification go through a multi-layered process including pre-testing in each and every step. In addition the questionnaire is translated by independent translators and then back- translation is used by the national survey organizations. Then the validation process including translation validation by experts, proof-reading, pretesting, and revision based on the result of the pilot, is handled by the national correspondents of the European Working Conditions Observatory (EWCO) for the EU27 countries and Norway ( 5th EWCS Quality Assurance Report). Looking at the above stated quality majors and also according to the EWCS quality assessment reports by Road (2010) the overall survey implementation and quality of EWCS is satisfactory, but as in most cross-national surveys, EWCS also needs to work more on the nonresponse. The average response rate for the 5th Wave was 44.2%, which is lower than in the 4th Wave, which was 47%. As a standard, each participating country is expected to come up with a minimum of 50% response rate. Compared to other surveys like ESS, ISSP, and SHEAR this required response rate is quite low. As is clearly seen on the following graphs (Figure 12), Latvia is in the lead with a 73.5% response rate whereas Spain with 31.3% has the lowest response rate. In general the response rate has decreased by 3% in the 5th Wave compared with that of the 4th Wave. This could happen for different reasons. For example, changes in sampling procedures and data collection mode could be one reason. As Road (2010) states, the sampling frame has a major effect on the response rate. For instance, there has been a 7% (on average) decrease in the response rate for the 22 countries that

62

keep on using random router(a mechanism to randomly assign surveys to a particular respondent which is also selected from a telephone units), whereas the 5 countries that started using registers show a 6% average increase in their response rates. Figure 12. Response rate for 4th and 5th Wave of EWCS.

Source: - Road (2010), Quality Assessment of the 5th European Working Conditions Survey.

63

To increase the response rate after a failed initial visit, a minimum of three re-visits or re- calls were made. In some countries (for instance Denmark, the Netherlands, Poland, Slovenia, Sweden and Norway) an advance letter is sent prior to the main assessment to increase participation. In addition to the advance letters in Sweden and Norway, telephone contact is made before the face-to face interview (5th EWCS Quality Assurance Report). - Sample design: The target population is non-institutionalized persons age 15 years and above (some countries take samples starting with age 16 years, depending on the labor force definition) who are employed at the time of the survey. EWCS uses as a sampling frame a register to select households, but for countries with no register a random route is used. EWCS’s samples of eligible individuals are selected using probability sampling procedures. As mentioned to have a uniform sampling method in cross-national and cross-cultural survey is not necessary. So whatever sampling design the participating countries follow, it will have to be evaluated jointly by Gallup and Eurofound to get an approval. For the 5th EWCS most countries’ target sample size was 1000, but for some a higher target was set to get a better representation of the working groups and /or on the request of the national government (for instance, 2000 in Germany, 1500 in Italy, 3000 in France and so on)( Road, 2010). - Questionnaire design: EWCS’s questionnaire is developed by Eurofound and most of its questions have not been changed since the first Wave in order to keep the trend. The source questionnaire is in English. The first source questionnaire gets pretested in several ways before it is finalized. This is done to make sure that it gives a valid measurement of the perceptions surveyed. The final version of the source questionnaire, after reviewing the pilot, is called the master questionnaire. - Translation: The master questionnaire is translated into the national languages using a “5-phase translation processes”. This process includes the master questionnaire being translated into two target questionnaires. This results in a draft version using these two target translation. Then a reconciled version is developed followed by a back-translation. Finally a translation check is performed either by the European Working Conditions Observatory (EWCO) or by experts selected by Gallup and national representatives. For instance for the 5th EWCS Wave the translated version of the questionnaire was validated by EWCO national correspondents in 27 EU countries and Norway, and also by experts

64

from Albania, Croatia, Macedonia, Kosovo, Montenegro and Turkey. The main goal of including the national representatives in this validation process is to attain functionally equivalent questionnaires. This means that the translated questionnaires have to be able to measure the same concepts in all the participating countries, i.e., respondents in different countries must have the same understanding of the questions. - Interviewer Recruitment and Training: As a requirement, Eurofound asks each EWCS participation country to hire interviewers with at least one year interviewing experience, with a minimum of three face – to- face interviews performed. Before the main study there is a preinterview check and all interviewers get a half -day training by the national project managers using the EWCS training manual. - Pretesting: As mentioned above EWCS pretests every survey instrument prior to the main study. The main objective of pretesting is to see if there is a major problem associated with the survey questionnaire such as if the questions not read naturally, filters are incorrect, and respondents are unable to understand and interpret the question. Figure 13 shows how intensive and complex the design and pretesting are in EWCS.

Figure13. Questionnaire development and pre-test

Source: 5th European Working Conditions Survey, 2010. Technical Report.

65

The figure shows the pretesting steps that the source questionnaire goes through before it is approved by Eurofound.

- Data collection: The data collection mode of EWCS is face-to face for both employers and employees, but in some countries CAPI and PAPI. Also in a few countries mixed mode is used. For instance, countries like Sweden and Norway make telephone contacts prior to the face- to face interviews. - Data processing: The national partners are responsible for processing their data using a detailed guideline. Data processing includes data entry, coding, editing, and weighting. The data collected using CAPI mode gets entered directly but for the rest of the modes the local agencies must follow the data entry template provided by Gallup. To improve the coding process Gallup has developed documents for both interviewers and coders. Documentation includes the ISCO88 coding manual and coding instruction plus NACE codes description and definitions for the two major variables occupation and economic activity. When it comes to editing, for CAPI mode the data editing is done along with the data collection, but for face-to–face or PAPI there is a different level of editing, first by supervisors in the field, then during the data entry using software, and finally by Gallup. To remove bias due to coverage, nonresponse, and unequal selection two types of weighting methods were used, design weighting and the model-based post stratification weight (Road, 2010). The other basic thing performed in this process is the design effect analysis. Design effects need to be estimated to determine the effective sample size and also to determine accuracy. In the case of EWCS the design effect was used to estimate the effective sample size at the national level.

66

The design effect in EWCS is formulated using two assumptions: I) depending on the number of eligible persons in the household unequal probability is used. II) Unequal nonresponse across population, which gets corrected using post-stratification weights.

After the design effect is calculated the effective sample size can be calculated as:

For more detail on both designs effect and effective sample size see 5Th EWCS Technical Report. As reported in the EWCS 5th Wave quality assessment report the design effects varied between 1.25 and 1.5 with the exception of Sweden, Montenegro and Kosovo, with 1.63, 1.51 and 1.51 respectively. These design effect values are used to calculate the effective sample size. The smaller the design effect is the larger the effective sampling size and also the better precision we get (Road 2010). The summary of QA and QC in EWCS shows us, that a lot of effort goes into quality translation, and the 5-phase translation process is a proof of that, intensive pretesting, and questionnaire design. Also EWCS used different methods to handle increase response or handle nonrespones for instance, minimum of three revisits or recalls per case, weighting, and minimum required response rate 50%. There is step of pretesting every survey material. All in all EWCS follows a detail of QA and QC in every phase of the survey and has rich documentation and the overall survey quality is satisfactory.

67

5.9. Comparative Study of Electoral System (CSES) CSES was established in 1995 based on the oldest longitudinal survey, the American National Election Studies (NES). It is a joint program of cross-national research among election studies around the world or in other word CSES is an international collaboration among national election studies. CSES is designed in such a way that it helps scholars to perform cross-national or/and cross-level analysis in the field of electoral public opinion. The extent of cross-national collaboration in every stage of the study makes CSES unique compared with any other comparative post-electoral studies. All the survey instruments and study designs are developed by the CSES Planning Committee, which includes leading researchers of electoral politics from around the world. In general the main objective of CSES is to promote international collaboration among national election studies. So far CSES had three Modules, the first on was launched in 1996 till 2001 as a Module 1 with the title “Performance of the system”, the second one was Module 2 from 2001 till 2006 with the title “Representation and Accountability, and the third one is Module 3 from 2006-2011 with title “Electoral choices”. All the data are freely available on the CSES website (http://www.cses.org).

- Study Structure: CSES is a collective effort of scholars from all over the world. The main funders of CSES are the American National Science Foundation, GESIS, and the University of Michigan. As any comparative survey, CSES’s main objective is to provide high quality comparative data. For this purpose CSES has included in its initial planning committee those with experiences in comparative surveys, for instance, European Election Studies (EES), the International Social survey Programme (ISSP), the Latinobarometer, and the World Values Survey (WVS). This planning committee is responsible for developing guidelines and methodology for the project (Howell, 2010). - Survey quality: In order to ensure comparability CSES’s data collection methodological design and quality guidelines are prepared by members with comparative survey experience from surveys such as EES, ISSP, WVS, and the Latinobarometer. Before finalizing the questionnaire design the questionnaire goes through steps of revision and pretesting in at least three different nations. Concerning questionnaire translation, CSES advised the participants to use modern translation methods and to check the quality of the translation using back-translation and also more modern translation methods.

68

After the data collection CSES follows three major steps to ensure comparable and quality data (Howell, 2010).

I) Survey data cleaning. In this first step the CSES Secretariat cleans the national data. The cleaning focuses on interviewer errors, coding errors (check if identical code sets are used, and that scales are administrated in a correct way) , weighting errors, translation errors (using back- translation), outliers and identification of unusual distributions, and sampling errors. II) Dataset vetting. Following the survey data cleaning and macro data (aggregated or summarized data) preparation CSES Secretariat starts looking for answers and explanations of problem root causes discovered during the cleaning step. After cleaning the data according to the response from the collaborators, the Secretariat sends the cleaned survey data to the respective collaborators for Vetting (a process of examining and evaluating and/or reviewing for the final feedback). III) Cross-National cleaning: This is the final stage before releasing the data. In this stage the response distribution is compared cross- nationally data for any outliers.

In general CSES puts much emphasis on survey quality. Howell, (2010) states: “attention to quality and comparability is maintained at every stage of the CSES project: study design and planning, selection of collaborators and countries, questionnaire design, data collection, survey data cleaning, macro data preparation, dataset vetting, cross-national data cleaning, documentation and post-release.”

- Tenders, bids, and contracts: Social scientists that have a strong interest in and strong ties with academic research and who are capable of running high quality national surveys are selected to participate as collaborators of the CSES project.

69

- Sample design: All election studies must use probability sampling in all stages to get a representative sample of the national population. Quota and replacement samples are not allowed. The minimum required response set is 1000. - Questionnaire design: The planning committee spends much time to come up with potential themes for any new module. After approval of the theme by the collaborators the questionnaire design process starts. Most of the questions are adapted from the previous national studies. - Translation: Every countries required to use modern and sophisticated translation methods and also to work together to translate common languages. Earlier CSES used back-translation to check the quality of the translation, but nowadays this is done using TRAPD. - Data collection: There are three different types of data collection mode, face-to –face interview, mail questionnaire, and telephone interview (http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03808). The preferred mode among CSES collaborators is face-to face. 70% of election studies during modules 1 & 2 were conducted using face-to-face. - Data processing: First, the collaborators are required to clean their national data. After that the CSES Secretariat does some further cleaning, evaluates the data and produces the documentation. Examples of data cleaning activities are back-translation, checking if the code sets used are identical and scales were managed correctly, making sure weighted variables do not have unusual values and also checking the sample quality and the nonresponse impact.

All in all as Howell (2010) states the main focus of CSES is to maintain quality and comparability. This sounds good of course but if we try to summarize shortly the CSES survey quality work starting with translation there was no uniform translation method. The CSES guideline suggests the use of modern translation methods and some countries are using TRAPD and back translation for a quality check. There is no clear information on how nonresponse is handled. Multiple modes are allowed for data collection but there are no signs of ways to handle mode effect in CSES. And more importantly no documentation if this well designed standard is applied accordingly by the national surveys. Therefore there is still much to work on before the goal can be reached.

70

5.10. Gallup World Poll (GWP) GWP started in 2005 by Gallup, with the aim of conducting country by country studies to measure the well-being of people all over the world. GWP is believed to be the largest ongoing multinational survey. In its first round GWP surveyed 120 countries, which represent 95% of the world population age 15 and above. GWP is funded by Gallup only (Tortora, Srinivasan, and Esipova, 2010).

- Study structure: GWP started with two main objectives, namely quantify the current state of well-being of those living in each participating country and to collect additional data of importance in each of six regions of the world. Even though the ultimate goal of GWP is to include every country in the survey or survey activities in general, there are a few excluding criteria, such as countries with very small populations, countries where the national government does not support the survey, and when countries have safety issues. - Survey quality: One of the challenges faced by GWP is the coverage problem. In some countries it is impossible to get complete coverage. For instance, in Sub-Saharan Africa, where more than 70% of the populations live in rural areas where there are poor road and transportation systems, it is very costly to perform face –to –face interviews. There are also some countries or parts of the countries, which are not safe. For example, some parts of Angola still have landmines and were therefore not included in the study. The same goes for Northern Uganda in the 2006 and 2007 GWP. There are also areas that are hard to reach. In the 2006 GWP of Madagascar, two PSUs had to be removed, since the only way to reach them was by boat. Therefore they were considered hard to reach. Generally the coverage problem in GWP is in the range of 3% to 5% and the solution used is replacing the PSU’s. GWP also has a strong quality control scheme when it comes to interviewers and completed questionnaires. Supervisors not only perform spot checks for at least 5% of completed interviews. They also observe the interview process on the spot for at least two interviews per interviewer. They also conduct back-checks on respondent selection, completion of interviews and verification of some answers (at the minimum 25% of the response). Finally all the filled-in questionnaires go through coherence and completeness verification, but there is not information how this verification process is performed. In Tajikistan, for example, 200 interviews had to be done again, since they failed to pass the

71

quality control. In the 2006 GWP the item nonresponse rate on one specific question that ask average month household income was 31% out of 140,000 respondents. (Tortora, Srinivasan, and Esipova, 2010). But there is no documentation available how this item nonresponse gets taken care in 2006 GWP. - Sampling design: The target population is persons age 15 and older. There were three different types of sampling frame in use, telephone number frame, area frames using enumeration and area frames not based on census enumeration. Depending on the sampling frame, different probability sampling methods were applied. For instance, if the sampling frame is telephone numbers, the list-assisted sampling method (a method that uses a telephone frame and a directory listing to produce a simple random sample) is used, and when area sampling is used, stratified multistage random sampling is applied. - Questionnaire design: Questionnaires in GWP include two parts. The first part consists of core questions, which are the same for every country, and the second part consists of regional questions. The regional questions or country-specific questions are added or excluded; for example, in countries like Pakistan, Bangladesh, and Kazakhstan questions about homosexuals are removed from the regional questionnaire. In some cases the national government wants to revise the questionnaire. In China, Laos, Saudi-Arabia and Mauritania the government revised and deleted some questions. Therefore in order to simplify questionnaire design in the sense of what to include or not, GWP divided the participating countries by region, such as, Sub-Saharan Africa, Predominantly Muslim countries and so on. Therefore comparison is done across participating countries using the core questionnaire, since the regional questions vary from region to region. - Translation: As a standard, Gallup asks countries to use two independent translations, independent back-translation (independent translator and back-translator) and adjudication (verifier if the two independent translated target questionnaires much the source questionnaire. But there is no information on how the participating countries are handling it. - Data collection: There are two main data collection modes, face-to face interviews and Random- Digit-Dial (RDD) telephone interviews - when there is a sufficient telephone

72

coverage, which in this survey means that at least 80% of the population in a specific country owns a telephone. In general from what we have seen, GWP mostly faces coverage problems and item nonresponse and a little has been said on the coverage problem solution but nothing has been stated on, for instance, how the 2006 GWP 31% item nonresponse get fixed. Also GWP is one of the surveys with multiple mode of data collection but unlike ESS no effort has been shown on how the mode effect is handled so that it will not affect comparability of the data. As a guideline GWP has stated a detailed and good translation process but how the participating countries are applying it is not known. Finally there is not much documentation on the QA and QC of the survey and there is not much information even on the problems which are clearly seen. Therefore more documentation is needed before concluding anything.

5.11. Eurobarometer (EB) EB was established in 1970 by the European commission to assess public attitudes towards European unification and quality of life. In addition EB widely surveys special topics, such as health, technology, environment, etc. There are four different survey series in EB (http://www.gesis.org/eurobarometer-data- service/survey-series/): - Standard and Special EB: This series is called the longest running regular program. The standard topics focus on finding out peoples’ attitudes towards European unification, and institutions and policies. Special topics include the environment, technology, health and family issues. - Central and Eastern EB: This series aims at measuring public opinion on Europe and the European Union in Central and Eastern Europe. Annually 20 countries from the region are surveyed. This series is also known to be monitoring economic and political transition in Eastern Europe. - The Candidate Countries EB: This series was started in 2001 in 13 candidate countries to the European Union. The questions are comparable to the standard series.

73

- The Flash EB: This is a small scale survey conducted in all European Union member states. This survey includes questions from special topic EB series and also called the ad-hoc and focus group surveys.

- Translation: In EB translation process involves human beings (Translators, Researchers and Field managers), and machines. It also involves three different types of translation methods, such as TRAPD, TBRA (translation, back translation, revision, and adaptation), and TBRAP (translation, back translation, revision, adaptation, and pretesting). Figure 14 shows how all this cooperated (Schwarzer, 2012).

74

Figure 14. Translation process of Eurobarometer

Partner Institutes TNS research & TNS translators translation teams

2 separate

Finalized master translators translate questionnaire the questionnaire: the main translator

analyzes both and decides on the final First technical version checking

Revision done by local institute

Preparing the revised file for Back translation back translation

Validate and send for BT revision BT revision

Second technical checking Changes in the questionnaire if necessary Approve field version Source: Schwarzer (2012), DASISH QUANTITATIVE WORKSHOP.

- Data collection: Face-to-face interviews are used for the Standard and Special topic EB. But there is no information on what data collection mode is used for other EB Series. The major problem of EB is a shortage of available documentation. It is has a big geographical coverage. It covers up to 34 countries in Europe. EB tries to control

75

quality through good communication, translation, monitoring fieldwork, applying different quality controls and good reporting system. But there is no information out there to help prove these points. Therefore there is nothing much to say about the overall QA and QC in EB. So far we have seen what the 3M survey guideline is suggesting for a comparable quality survey data. And I have seen what the selected 3M surveys conduct their surveys. In the next section we will try to see a summary of QA and QC in the above selected surveys.

6. The QA and QC system in selected surveys Going through all the selected surveys, we have learned that all of them tried to maximize functional equivalence and achieve comparability through decisions regarding sample, response rate, questionnaire, translation, data collection, data processing and documentation. But not all surveys are successful. In this section we will try to see what quality measures are in place and what are not and what can be complemented or criticized according to the survey life cycle. In most of these surveys there are good quality guidelines and process requirements but the question is how well they are put in action in participating countries. For instance, in the 1994 IALS survey, despite the given instructions and guidelines one country told its respondents that IALS is a test rather than a “real” survey. Another country failed to calculate the base weight correctly. One country offered incentives to respondents, which was not allowed according the survey instruction (Lyberg and Stukel, 2010). In the 2007 ISSP, out of 46 countries 14 did not evaluate/pretest the translated questionnaire, which is a “must do” task in the ISSP guidelines. Also even though the ISSP does not support the telephone mode, in the 2007 ISSP, the U.S. collected 19% of the data by telephone mode (Scholz and Heller, 2009). These problems bring us to the first step in the survey life cycle: study, organizational, and operational structure. The surveys need to have some kind of central coordinating team that closely monitors the survey activities. A good example of this is the European Social Survey (ESS). It has a team like that, which consists of different

76 advisory and consulting groups that coordinate the implementation of processes throughout the survey (http://www.europeansocialsurvey.org ). Also PIAAC, TIMSS and IALS have coordinating teams (Lyberg and Stukel, 2010). It seems as if the existence of such a group is a key to good quality. When different languages and cultures are involved in a survey, translation and adaptation become major steps to ensure that equivalence is maintained between the source questionnaire and the different target questionnaires. Having said that, if we look back at the selected surveys in this thesis, only some of them (ESS, PIAAC, SHARE, EWCS, and ISSP (in some of the participating countries)) use the current best translation method that integrates review, pretesting, and documentation (TRAPD). Some surveys, such as WVS, ISSP, and TIMSS, use the back translation (BT) to check the quality of translations. However, according to Mohler (2005) and others, BT is not the most favorable method. Some of the reasons for that are BT makes a loop and as a result it consumes time, also there is not much revision on the target questionnaire, since BT gives two text of the same language (source language) and this two texts are used to give conclusion about the target questionnaire. Sometimes translation errors can be crucial. For instance, there was a significant translation error in TIMSS-1995 (Solano-Flores, Niño, and Escudero, 2006). But surveys like WVS do not provide much information on what translation methods are in use in general, except that some participating countries use BT (de Weerd, Gemmeke et al., 2005). As for data collection mode, unlike CSES that uses multiple modes (face-to- face, telephone, mail), GWP (telephone and face-to- face), or IALS (face-to face and CAPI), and ISSP (mail and face-to- face) the rest (ESS, SHARE, TIMSS, WVS, and EWCS) state in their guidelines that participating countries should use a single mode in order to eliminate mode effects. But some choose different modes or mixed mode depending on the funding, sampling frame and time constraints (Smiss, 2010). The 19% telephone interview in the 2007 ISSP the U.S. is a good example. Not having a uniform data collection mode across participating countries can create measurement error, but so far only ESS has shown a serious effort to handle mode effects. If we look at more than 40 years of successful multinational and multicultural surveys’ use of sampling methods, we will find that probability sampling methods are

77 widely employed (Heeringa and O’Muircheartaigh, 2010). For instance, ESS has a clearly specified target population, a minimum required sample size (1000) and effective sample size (1500), and most importantly countries must agree to use strict probability sampling methods in order to be part of the final assessment and data release. Besides there is a strong central coordinating team that coordinates with each national team to ensure that the participants’ sampling methods are in accordance with the requirements (http://europeansocialsurvey.org). Also as it is reported in TIMSS 2007 for Morocco and Mongolia, action will be taken if the sampling process differs from what is stated in the TIMSS sampling requirements (Jancas, 2007). In general all the selected surveys use probability sampling as the standard. Depending on the sampling frame, countries have a freedom to choose different probability sampling techniques. However, the countries do not always follow the standard or guideline. There was an incidence on the first cycle of IALS where three countries (France, Germany, and Switzerland) used nonprobability sampling in some stage of their sampling process, which is a potential cause for estimation bias (Darcovich, 1998). Questionnaires have to be designed to maximize comparability and minimize measurement error. Harkness, Edwards et al.(2010) mention that reusing and adapting previously used questions in a new questionnaire is a wise thing to do. Mostly surveys combine the three (reusing, adaptation, and new question) approaches to increase comparability. For example, every new cycle of TIMSS reuses earlier assessment questionnaires with minor revisions (Erberber, Arora, Preuschoff, 2011). The same goes for WVS (de Weerd, Gemmeke, Rigter, and Rij, 2005). SHARE also uses adaptation of questions in its second Wave with some changes on country- specific items (Börsh-Supon, Hank et al., 2010). Pretesting is a major step in both single and comparative surveys. Every questionnaire should be tested before use, even if all the questions are reused or adapted from previously pretested questionnaires (Harkness, Edwards et al., 2010). Looking at the documentation of the selected surveys in this thesis, all use pretesting of the source questionnaire and also translated it in at least one or two of the participating countries, except for WVS where there is no documentation on

78

pretesting at all. The only information for WVS is that Weerd et al. (2005) mention that not all countries pretest the translated questionnaires. In general, as has been mentioned again and again in this thesis, as difficult as it is to come up with high quality and comparable survey data, depending on the survey topics, resources, and the magnitude of the survey, different surveys concentrate on controlling for different specific error sources. For instance, ESS, SHARE, and TIMSS focus on handling the mode effects, while in surveys like CSES, ISSP and GWP mode effect is not their main concern even though they use multiple modes. ESS and PIAAC try to control response rate variation by setting a minimum required response rate (70%) in an attempt to minimize nonresponse bias. SHARE and ISSP try to examine the effect of nonresponse bias on data quality and comparability. Also from what we have seen so far, there is a long way to go concerning QA and QC in 3M surveys. Most of the 3M surveys have good standards and guidelines but are unable to get them executed. The above mentioned examples in 1994 IALS and in 2007 ISSP can be good examples of that problem. Surveys like WVS covering 90% of the population of the world and being the oldest has no documentation on any QA and QC. Therefore there is a big hole to fill on QA and QC in 3M surveys. We can start by looking at what has been done in the model 3M surveys such as ESS. As Smith (2010) says: “ESS with its extensive program of methodological research, coordinated, data-collection standards, and study monitoring, represents the best current practice in comparative survey research and serves as a model for other collaboration to emulate”

In addition to what is stated above, the other aspect that makes ESS a model is the survey’s ability to conduct experiments on issue that affect the quality of the survey, for instance mixed mode design, improving question quality, improving survey response, attitudinal indicators, and more. ESS also gives training on key aspects on the survey lifecycle from a comparative perspective. Therefore we advice 3M surveys to follow in the footsteps of ESS for better 3M survey outputs.

79

Using all the points mentioned in all the previous sections in the next section a basic minimum QA and QC needs will be discussed.

7. A basic QA and QC framework So far we have tried to see what is required to obtain high quality and comparable data according to the currently available cross-cultural survey guidelines and also how they are applied in the selected surveys. When studying the selected multinational and multicultural surveys it is obvious how difficult it is to apply the guidelines completely. Depending on the survey topic, funds, the size of the survey, and other factors, the examined surveys tend to focus on somewhat different issues. This need to be selective is understandable when we realize what is required to obtain high quality and comparable data. In this final section of the thesis, using the CSDI guideline, the survey lifecycle, and the real life experiences of the selected surveys we try to come up with a minimum required QA and QC framework for multinational and multicultural surveys. We believe this is what is needed most: I) Centrally Coordinating team II) Probability sampling III) Questionnaire design IV) Translation V) Interviewer selection and training VI) Pretesting VII) Statistical adjustment

We now comment on each of them. Centrally Coordinating Team (CCT): This team should include people from different countries and institutions that contribute to the overall study design. This team will specify which survey processes will be standardized across countries and which will be localized. The team follows and makes sure that the survey data quality meets the specific quality standards, gives training when needed, justifies requirements, conducts experiment on why countries fail to perform in line with as the standard, and comes up with the solutions. (Guideline for Best Practice in Cross-Cultural Surveys, 2011).

80

As Lyberg and Stukel (2010) state: “the real quality problem , however is that many international surveys still have very primitive and weak infrastructures with very little central coordination and monitoring resulting in poor comparability ” Surveys spend time and resources on developing survey instruments (questionnaires, guidelines, standards and methodologies) that need to be used/applied during the whole assessment time by survey participating countries. But so far there are only few 3M surveys that make sure these measures are put in use. ESS, WFS, WMS, and PIAAC are good examples of surveys with an effective central coordinating team that ensures their studies are done according to the prepared survey instruments. Probability Sampling: This is one of the specifications that cannot be compromised. Participating countries can use different sampling designs depending on the sampling frame as long as it is a probability sampling design. The large and old comparative surveys, such as ISSP, WFS, and also ESS have shown that probability sampling is applicable in large scale cross-national surveys. Using probability sampling design helps quantifying the sampling error, gives all units a nonzero chance of selection, helps decrease coverage error, and gives a better representation of the population being studied (Lohr, 2008). Looking back at the selected surveys, almost all participating countries apply it. In ESS, participating countries have to sign an agreement to use probability sampling for them to be part of the study (http://www.europeansocialsurvey.org). Questionnaire design: It is important to select questions, which can be used for comparative research and that generate a minimum measurement error. Since different cultures and nations or multipopulation are involved, the questionnaire design should preferably be done using the basic approach suggested by Harkness, Edwards et al., (2010). This approach goes as follow:

I) Re-use questions which are used in other surveys: This is a very common strategy and it saves time and money. But they need to be pretested even if they have been tested before. II) Adapt questions from other studies to suite the new research topic or the new population under study. Also adapted questions need to be pretested. III) For a new design: the questionnaire can be designed from scratch.

81

IV) I-III can sometimes be combined

Translation: When more than one language is involved in the survey a good translation of the survey instrument is a must. There are many translation methods but for survey work only one is highly recommended as the current best one, the team translation (TRAPD). TRAPD includes a team of professional translators doing reviewing, adjudicating, pretesting, and documenting (Harkness, Villar, and Edwards 2010). Model surveys, such as ESS, PIAAC, SHARE, EWCS, and ISSP use TRAPD translation method. This does not mean that just using TRAPD will solve all problems. However, using non-recommended translation methods will decrease the possibility of collecting comparable data or data for the intended use. Interviewer selection and training: When face-to-face and telephone interview mode is used, interviewers play a major rule in minimizing coverage error, measurement error, processing error, and in maximizing response. Therefore special concern should be given to hiring experienced interviewers, and giving them general interviewer and study- specific training. Carefully selected interviewers that are trained according to the guidelines will help increase the survey quality by minimizing the interviewer effects. Pretesting: The guidelines call this step “a major step of a survey”. Without pretesting there is no way of knowing if the survey instrument will give the required output. The source questionnaire needs to be tested, even if all the questions are taken from a previously pretested survey (Harkness, Edwards et al., 2010). Target questionnaires also need to be pretested after translation or adaptation to ensure measurement equivalence and also to check if the selected mode is suitable for the survey topic and/or the questionnaire. Statistical adjustment: Data processing involves coding, data capturing (data entry), editing and weighting adjustments. When we talk about statistical adjustment it basically means correcting for item nonresponse bias (imputation), and unit nonresponse bias and coverage bias. In most multinational surveys, national survey organizations are responsible for processing their data before sending it to data harmonization. If we look back at the selected surveys, the national partners are responsible for processing their respective data. But not all participating countries are willing to process their data to the fullest, because some do not have the expertise or the funding to do that. That means there is a chance that data cleaning of various kind is not performed according to the guidelines and as a result

82 survey data might be of poor quality or even incomparable. Therefore 3M surveys should put a lot of effort on assessing and controlling the quality of the data processing. On the next section a general conclusion about the minimum QA and QC needs in 3M surveys will be stated along with the next step for QA and QC.

8. Conclusion It is been mentioned again and again how complicated and difficult it can be to perform QA and QC measures in 3M surveys and as a result how difficult, sometimes even close to impossible, it is to get high quality comparable survey data. In this thesis the required elements needed for obtaining quality comparable data according to the guidelines and other available literature have been mentioned. Then we tried to see how the currently more or less ongoing 3M surveys handle the QA and QC. But for most of the surveys the search was not very rewarding. Since it is the objective of this thesis to come up with the minimum QA and QC requirements, as a result from the 3M guideline and the experience of the 3M surveys the following become the most important QA and QC needs. Starting with the QA needs:

- Central teams: There is a need for a central body to follow the quality of the survey from scratch till the final result, and also to help out national surveys when the need arises. - Good translation: This is a key step in QA and QC when conducting 3M surveys. Bad translation leads to incomparable data. - Pretesting: To get the required output the survey materials need to be tested at least once. - Formulas: For comparable data there has to be a clear understanding of the formulas, such as how to formulate the design effect, weights, effective sampling size and so on. - Courses: Some countries need help to even reach a minimum standard. Some need competence development and enhanced interviewer training.

83

Some minimum needs of QC:

- Monitoring: If possible it is always good to monitor every survey aspect (survey lifecycle). If that is not possible the major survey aspects such as translation and adaptation, pretesting, sampling process, mode of data collection and data processing needs to be monitored. - Paradata analysis: To get a good product quality we need a good process quality and the current best method to control process quality is the collection and analysis of paradata. - Adherence to specifications: Sometimes national surveys do not understand why certain specifications are given. Therefore there is a need to justify all specifications and make sure they are adhered to.

Finally, the next steps in improving QA and QC are to vigorously promote CSDI, the relevant literature, and the CCSG guidelines. Also one should assess the capacity of the national surveys to reach at least the minimum standard before the main study is launched. Performance monitoring is one big necessary step in QA and QC. There could be deviation but the reasons need to be checked so that lesson would be learned. Also there has to be organizational quality. Therefore a good infrastructure is needed for a better survey quality; good examples are ESS and PIAAC.

84

References

5th European Working Conditions Survey (2010). Quality Assurance Report. Working document for the European Foundation for the Improvement of Living and Working Conditions. Prepared by Gallup Europe. 5th European Working Conditions Survey (2010). Technical Report. Working document for The European Foundation for the Improvement of Living and Working Conditions. Gallup Europe. Alcser, K. H., and Benson, G. (2005). The SHARE Train-the – Trainer Program. In Axel Börsh-Supan and Hendrik Jürges(Eds.), The Survey of Health, Aging, and Retirement in Europe-Methodology, 70-74.Mannheim Research Institute for the Economics of Aging (MEA). Behr, D. (2009). ‘Translationswissenschaft und international vergleichende Umfrageforschung: Qualitätssicherung bei Fragebogenübersetzungen als Gegenstand einer Prozessanalyse’. Bonn: GESIS. [Translation Research and Cross-National Survey Research: Quality Assurance in Questionnaire Translation from the Perspective of Translation Process Research]. Behr, D. (2012). The challenging task of questionnaire translation in cross-cultural survey research. IAB-Colloquium zur Arbeitsmarkt- und Berufsforschung. Nürnberg, 26.01. 2012. Bethlehem, J. (2002). Weighting nonresponse adjustments based on auxiliary information. In Survey Nonresponse (eds R. Groves, D. Dillman, J. Eltinge and R. Little), 275–287. NewYork: Wiley. Biemer, P. P., and Lyberg, L. E. (2003). Introduction to survey quality. Hoboken, NJ: John Wiley & Sons. Blom , A. G., and Korbmacher, J. M. (2011). Measuring Interviewer Effects in SHARE Germany. SHARE Working Paper (03-2011). University of Mannheim. Braun, M., and Harkness, J. (2005). ‘Text and Context: Challenges to Comparability in Survey Questions’. In Hoffmeyer-Zlotnik, J.H.P., Harkness, J. (Eds.), Methodological Aspects in Cross-National Research. ZUMA-Nachrichten Spezial No. 11. Mannheim: Zentrum für Umfragen, Methoden und Analysen: 95-107. Brown, J. D.(1996). Testing in language programs. Prentice Halle-Regents. Börsch-Supan, A., Hank, K., Jürges, H., and Schröder, M. (2010).Longitudinal Data Collection in Continental Europe: Experiences from the Survey of Health, Ageing, and Retirement in Europe (SHARE). In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 28, 507-514. Hoboken, NJ: John Wiley & Sons. Börsh-Supan, A., and Jürges, H. (Eds.). (2005). The Survey of Health, Ageing and Retirement in Europe- Methodology, Mannheim: MEA. Börsh-Supan, A., and Kemperman, M. L. (2005), The SHARE Development Process. In Axel Börsh-Supan and Hendrik Jürges(Eds.), The Survey of Health, Aging, and Retirement in Europe-Methodology, 7-11.Mannheim Research Institute for the Economics of Aging (MEA). Carey, S. (Ed.) (2000). Measuring adult literacy, the international Adult Literacy survey in the European Contest, Office for national Statistics, UK.

85

COM/DELSA/EDU/PIAAC (2008). Implementation Aspects of National Centers for the Programme for the International Assessment of Adult Competencies (PIAAC). PIAAC International Contractor, Educational Testing Service(ETS). Couper M. (1998). Measuring Survey Quality in a CASIC Environment. Paper Presented at the Joint Statistical Meetings of the American Statistical Association, Dallas. Couper, M. P. and De Leeuw, E. D.(2003). Nonresponse in cross-cultural and cross-national surveys. In J. A. Harkness, F. J. R. van de Vijver, and P. P. Mohler (Eds.). Cross- cultural survey methods. New York: Wiley, 157-177. Couper, M., & Lyberg, L. (2005). The Use of Paradata in Survey Research. Invited paper presented at the 56th ISI Session, Sydney, April. Crosby, P. B. (1979), Quality is free: the art of making quality certain., McGraw-Hill, New York. Darcovich , N., Binkley, M., Cohen, J., Myrberg , M., and Persson, S. (1998). Chapter 4: Nonresponse Bias. In T. Scott Murray, Irwin S. Kirsch, and Lynn B. Jenkins (Eds.), Adult Literacy in OECD Countries: Technical Report on the First International Adult Literacy Survey (55-72. National Center for Education Statistics Office of Educational Research and Improvement, NCES 98-053. Darcovich, N. (1998), Chapter 3: Survey Response and Weighting. In T. Scott Murray, Irwin S. Kirsch, and Lynn B. Jenkins (Eds.), Adult Literacy in OECD Countries: Technical Report on the First International Adult Literacy Survey, 41-54. National Center for Education Statistics Office of Educational Research and Improvement, NCES 98-053. Darcovich, N., and Murray, T. S. (1998). Chapter 5: Data Collection and Processing. In T. Scott Murray, Irwin S. Kirsch, and Lynn B. Jenkins (Eds.), Adult Literacy in OECD Countries: Technical Report on the First International Adult Literacy Survey, 13-22. National Center for Education Statistics Office of Educational Research and Improvement, NCES 98-053. de Luca, G., and Lipps ,O. (2005). Fieldwork and Survey Management in SHARE. In Axel Börsh-Supan and Hendrik Jürges(Eds.), The Survey of Health, Aging, and Retirement in Europe-Methodology, 75-81.Mannheim Research Institute for the Economics of Aging (MEA). de weerd, M., Gemmeke, M., Rigter, J., and Rij , C. V. (2005). Indicators for Monitoring Active Citizenship and Citizenship education- Final Report, Research report for the European Commision/DGEAC. Dillman, D. A., Hox, J. J., and de Leeuw, E. A. (2008). Mixed-mode surveys: When and why. In E. D. de Leeuw, J. J. Hox, & D. A. Dillman (Eds.), International handbook of survey methodology , 299-316. New York, NY/London: Lawrence Erlbaum Associates. Dorer, B. (2011). Advance translation in the 5th round of the European Social Survey (ESS). FORS Working Paper Series, paper 2011-4. Lausanne: FORS. Erberber, E., Arora, A., and Preuschoff, C. (2007). Developing the TIMSS 2007 Background Questionnares. In John F. Olson, Michael O. Martin and Ina V.S. Mullis (Eds.), TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Eurofound (2012). 5th European Working Conditions Survey, Publications Office of the European Union, Luxembourg.

86

European Social Survey (2011). Round 6 Specifications for Participating Countries. London: Centre for Comparative Social Surveys, City University London. Eva, G., and Widdop, S. (2007). Mixed mode data collection in Europe. ESRA Conference, Prague. Feigenbaum, A. V.(1951). Quality Control: principles, practice and administration. New York, McGraw-Hill. Fitzgerald, R. and Jowell, R. (2010). Measurement Equivalence in Comparative Survey: The European Social Survey (ESS)-From Design to Implementation and Beyond. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts. 485-495. Hoboken, NJ: John Wiley & Sons. Gabler, S., Häder, S. , and Lynn, P. (2005). ‘Design Effects for Multiple Design Samples’. working Papers of the Institute for Social and Economic Research, Paper 2005-12. Colchester: University of Essex., from http://www.iser.essex.ac.uk/pubs/workpaps/. Gabler, S., Häder, S., and Lahiri, P. (1999). A model based justification of Kish’s formula for design effect for weighting and clustering, Survey Methedology, 25(1), 105-106. Gabler, S., Häder, S. and Lynn, P. (2006). Design effects for multiple design sample, Survey Methodology, 32(1), 115-120. Ganninger, M.(2006). Estimation of design effects for ESS round II. Unpublished manuscript.Mannheim:GESIS. Goldstein H.,Wood, R.(1989). "Five decades of item response modelling",British Journal of Mathematical and Statistical Psychology,vol.42, p. 139-167. Gonzalez, E. J., and Foy, P. (2000). Estimation of Sampling Variance. In Michael O. M, Keven D. G. and Steven E. S. (Eds.), TIMSS 1999 Technical Report, 203-224, International Study Center Lynch School of Education Boston College. Greenfield, P. M. (1997). You can’t take it with you: Why ability assessments don’t cross cultures. American Psychologist, 52 (10), 1115-1124. Groves, R. M.(2006). "Nonresponse rates and nonresponse bias in household surveys." Public Opinion Quarterly, 70(5): 646-675. Groves, R.M., and S. Heeringa (2006). Responsive design for household surveys: tools for actively controlling survey errors and costs. Journal of the Royal Statistical Society Series A: Statistics in Society, 169, Part 3,439-457. Groves, R. M., and Lyberg, L. (2010). Total Survey Error, Past, Present, and Future. Public Opinion Quarterly, Vol. 74, No. 5, 2010, 839-879. Guerin-Pace, F., and Blum, A. (2000). The Comparative Illusion: The International Adult Literacy Survey, An English Selection, Vol. 12(2000). 251-246. Institut National d'Études Démographiques. http://www.jstor.org/stable/3030249. Guidelines for Best Practice in Cross-Cultural Surveys (2011). http://www.ccsg.isr.umich.edu. Harkness, J. A., Villar, A., and Edwards, B. (2010). Translation, Adaptation, and Design. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 7, 117-141. Hoboken, NJ: John Wiley & Sons. Harkness, J. A., Edwards, B., Hansen, S. E., Miller, D. R., and Villar, A. (2010). Designing Questionnaires for Multipopulation Research. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith

87

(Eds.), Survey methods in multinational, multicultural and multiregional contexts, 3, 33-57. Hoboken, NJ: John Wiley & Sons. Harkness, J. A. (2003). Questionnaire translation. In J. A. Harkness, F. van de Vijver, and P. Ph. Mohler(Eds.), Cross- cultural Survey Methods, 35-56, Hoboken, NJ: John Wiles & Sons. Harkness, J. (2005), SHARE Translation Procedures and Translation Assessment. In Axel Börsh-Supan and Hendrik Jürges (Eds.), The Survey of Health, Aging, and Retirement in Europe-Methodology, 24-27.Mannheim Research Institute for the Economics of Aging (MEA). Heeringa , S. G., and O’muircheartaigh, C. (2010). Sample Design for Cross-Cultural and Cross-National Survey Programs. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 14, 251-266. Hoboken, NJ: John Wiley & Sons. Howell, D. A. (2010). Enhancing Quality and Comparability in the Comparative Study of Electoral Systems (CSES). In J. A., M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 30, 525-534. Hoboken, NJ: John Wiley & Sons. Inglehart R, Basañez M, Diez-Medrano J, Halman L, Luijkx R (2004). Human beliefs and values: a cross-cultural sourcebook based on the 1999-2002 values surveys. Siglo XXI Editores, Mexico City ISSP (2005). ISSP 2005 Study description form.from http://www.share-project.org/. Johansone, I. (2007). Quality Assurance in the TIMSS 2007 Data Collection. In John F. Olson, Michael O. Martin and Ina V.S. Mullis (Eds.), TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center ,Lynch School of Education , Boston College. Johansone, I., and Malak, B. (2007). Translation and National Adaptations of the TIMSS 2007 Assessment and Questionnaires. In John F. Olson, Michael O. Martin and Ina V.S. Mullis (Eds.), TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Johansone, I., and Neuschmidt, O. (2007). TIMSS 2007 Survey Operations Procedures. In John F. Olson, Michael O. Martin and Ina V.S. Mullis (Eds.), TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Johnson, T. (1998). Approaches to Equivalence in Cross-Cultural and Cross- National Surveys. ZUMA Nachrichten Spezial No. 3: Cross-Cultural Survey Equivalence, 1- 40. Joncas, M. (2007). TIMSS 2007 Sampling Weights and Participation RAtes. In John F. Olson, Michael O. Martin and Ina V.S. Mullis (Eds.), TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Joncas, M. (2007). TIMSS 2007 Sample Design. In John F. Olson, Michael O. Martin and Ina V.S. Mullis (Eds.), TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.

88

Juran, J. M., and Gryna, F. M. (1980). Quality Planning and analysis: From Product development through use. New York: McGraw-Hill. Kirsch, I. S., and Murray, T. S. (1998). Chapter 1: Introduction. In T. Scott Murray, Irwin S. Kirsch, and Lynn B. Jenkins (Eds.), Adult Literacy in OECD Countries: Technical Report on the First International Adult Literacy Survey, 13-22. National Center for Education Statistics Office of Educational Research and Improvement, NCES 98-053. Kish, L. (1994). Multipopulation survey designs: five types with seven shared aspects. International Statistical Review, Vol. 62, 167-186. Klevmarken, A., Hesselius, P. and Swensson, B. (2005). The SHARE Sampling Procedures and Calibrated Designs Weights. In Axel Börsh-Supan and Hendrik Jürges (Eds.), The Survey of Health, Aging, and Retirement in Europe-Methodology, 28- 69.Mannheim Research Institute for the Economics of Aging (MEA). Kreuter , F., Couper, M., and Lyberg, L. (2010). The Use of Paradata to Monitor and Manage Survey Data Collection. In Section on Survey Research Methods- JSM 2010. Kreuter, F. (2013). Paradata for Nonresponse Error Investigation. Stockholm Universtiy. Lohr, S. L. (2008). Coverage and Sampling. In E. D. de Leeuw, J. J. Hox & D. A. Dillman (Eds.), International Handbook of Survey Methodology (pp. 97-112). New York: Routledge, Taylor & Francis, European Association of Methodology (EAM) Methodology Series. Lohr, S. L. (1999). Sampling: Design and Analysis, Duxbuy Press, Books/Cole Thomson Learning, Pacific Grove, CA. Loosveldt, G. (2008). Interviewer effects in nonresponse rates. Nonresponse Workshop Ljubljana 15-17 September 2008. Lyberg, L.(2012). Survey Quality. Survey Methodology, December 2012. Vol.38, No. 2, 107-130. Statistic Canada, Catalogue No. 12-001-x. Lyberg, L., and Stukel, D. M. (2010). Quality Assurance and Quality Control in Cross- National Comparative Studies. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts,13, 227-250). Hoboken, NJ: John Wiley & Sons. Lyberg, L. E., and Biemer , P. P. (2008). Quality Assurance and Quality Control in Surveys. In D. de Leeuw, JoopJ. Hox, and Don A. Dillman (Eds.), International Handbook of Survey Methodology, 421-441. Lawrence Erlbaum Associates, Taylor & Francis Group, New York London. Lynn, P., Japec, L., and Lyberg, L. (2006). What’s So Special About Cross-National Surveys? In Janet A. Harkness (Ed.), Conducting Cross-National and Cross-Cultural Surveys, Papers form the 2005 Meeting of the International Workshop on Comparative Survey Design and Implementation (CSDI), 7-19. Martin, P. and Lynn, P. (2011).The effect of mixed mode survey designs on simple and complex analyses.Centre for Comparative Social Surveys Working Paper Series, Paper No. 04 City University London. Montalvan, P. (2009). Programme for International Assessment of Adult competencies (PIAAC): Overview, Project Status/Schedule, Quality Control of Survey Operations on a Multinational Survey, Westat, on Sixth International CSDI Workshop. Mullis, I.V.S. and Martin, M. O. (2010). Assessment Methods in IEA’s TIMSS and PIRLS International Assessments of Mathematics, Science, and Reading. In J. A. Harkness,

89

M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 29, 515-524. Hoboken, NJ: John Wiley & Sons. Mullis, I. V. S., ,Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., and Preuschoff, C.(2009). TIMSS 2011 Assessment Frameworks, TIMMS and PIRLS International Study Center Lynch School of Education, Boston College. The international Association for the Evaluation of Education Achievement (IEA,). Amsterdam, the Netherlands. Murray, T. S., Kirsch, I. S., and Jenkins L. B. (Eds.). (1998). Adult Literacy in OECD Countries: Technical Report on the First International Adult Literacy Survey. National Center for Education Statistics Office of Educational Research and Improvement, NCES 98-053. Norris, P. (2007). The Globalization of Comparative Public Opinion Research. Norris-Cross- national Surveys, Harvard University. Pennell, B. E., Harkness, J. A., Levenstein, R., and Quaglia, M. (2010). Challenges in Cross- National Data Collection. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts,15, 269-298. Hoboken, NJ: John Wiley & Sons. Pennell, B. E., Harkness, J. A., Braun, M., Edwards, B., Johnson, T. P., Lyberg, L., Mohler, P. P., and Smith, T. W. (2010). Comparative Survey Methodology. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 1, 3-16. Hoboken, NJ: John Wiley & Sons. PETERS, T. J. and WATERMAN, R. H. (1982). In Search of Excellence; Lessons From America's Best-Run Companies. New York: Harper & Row. PIAAC Technical Standards and Guidelines Second Draft (March 2009), Meeting of the National Project Managers 23-27 March 2009. PIAAC- NPM(2009_03_03)PIAAC_NSDPR.doc. Barcelona, Spain. Road, W. (2010). Quality Assessment of the 5th European Working Conditions Survey. from http://www.eurofound.europa.eu. Ronald Inglehart et al. (2004). World Values Surveys and European Values Surveys 1999- 2001, User Guide and Codebook, First ICPSR Version May 2004. Scholz, E., and Heller, M. (2009). ISSP Study Monitoring 2007. GESIS Technical Reports, 2009/5. Mannheim: GESIS. Retrieved March 26, 2010, from http://www.gesis.org/fileadmin/upload/forschung/ publikationen/gesis_reihen/gesis_methodenberichte/2009/TechnicalReport_09-5.pdf.. Schwarzer, S. (2012). The development of multi-language questionnaires mlti-stage checks and semi-automatization. DASISH QUANTITATIVE Workshop, Mannheim. Skjåk, K. K. (2010). The International Social Survey Programme: Annual Cross-National Social Surveys Since 1985. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.). Survey methods in multinational, multicultural and multiregional contexts, 27, 497-506. Hoboken, NJ: John Wiley & Sons. Smith, T. W. (2004b). Developing and evaluating cross-national survey instruments. In S. Presser et al. (Eds.), Methods for Testing and Evaluating Survey Questionnaires, 431- 452, Hoboken, NJ: John Wiley & Sons.

90

Smith, T. W. (2007). Survey Nonresponse Procedures in Cross-National Perspective: The 2005 ISSP Nonresponse Survey. Norc, University of Chicago. Survey Research Methods (2007) Vol. 1, No. 1, 45-54. http://w4.ub.uni-konstanz.de/srm . Smith, T. W. (2010). The Globalization of Survey Research. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 25, 477-484). Hoboken, NJ: John Wiley & Sons. Solano-Flores, G., Contreras -Niño, L. A., & Escudero, E. B. (2006). Translation and adaptation of tests: Lessons learned and recommendations for countries participating in TIMSS, PISA and other International comparisons. Revista Electrónica de Investigación Educativa, 8(2). Retrieved month day, year, from: http://redie.uabc.mx/vol8no2/contents-solano2.html. Stoop, I. (2004). Surveying Nonrespondents. Field Methods, 16, 23–54. Stoop, I. (2005). The Hunt for the Last Respondent. Nonresponse in Sample Surveys. The Hague: Social and Cultural Planning Office. Stoop, I. (2007). No time, too busy. Time strain and survey cooperation. In G. Loosveldt, M. Swyngedouw and B. Cambré (Eds.), Measuring Meaningful Data in Social Research. Leuven: Acco, 301-314. Stoop, I., Matsuo, H., Koch, A., Billiet, J. (2010). Paradata in the European Social Survey: Studying Nonresponse and Adjusting For Bias. Section on Survey Research Methods- JS. Stoop, I., Billiet, J., Koch, A. and Fitzgerald, R. (2010). Improving Survey Response. Lessons learned from the European Social Survey. Chichester. Wiley. Tortora, R. D., Srinivasan, R. and Esipova, N. (2010). The Gallup World Poll. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B-E. Pennell & T. W. Smith (Eds.), Survey methods in multinational, multicultural and multiregional contexts, 31, 535-543. Hoboken, NJ: John Wiley & Sons. Usher, R. (2000). The International Social Survey Programme (ISSP), in Duncker & Humbolt, Schmollers Jahrbuch, 663-672, Berlin. Vehovar, V. (2007). Nonresponse Bias in the European Social Survey. In G. Loosveldt, M. Swyngedouw and B. Cambré (Eds.), Measuring Meaningful Data in Social Research. Leuven, Acco, 335-356. Williams, T., Ferraro, D., Roey, S., Brenwald, S., Kastberg, D., Jocelyn, L., Smith, C., and Stearns, P. (2009). TIMSS 2007 U.S. Technical Report and User Guide (NCES 2009– 012). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.

91

Appendix:Internet Links

http://www.ccsg.isr.umich.edu/qnrdev.cfm http://www.ccsg.isr.umich.edu/sampling.cfm http://www.ccsg.isr.umich.edu/translation.cfm http://www.ccsg.isr.umich.edu/pretesting.cfm http://www.ccsg.isr.umich.edu/dataproc.cfm The International Workshop on Comparative Survey Design and Implementation http://csdiworkshop.org/ Programme for the International Assessment of Adult Competencies(PIAAC): http://www.piaac.sk/about/ http://www.nier.go.jp/English/departments/piaac.html Adult Literacy and Lifeskills Survey (ALL); in National Center for Education Statistics: http://nces.ed.gov/surveys/all/index.asp OECD, Statistics Canada (2011), Literacy for Life: Further Results from the Adult Literacy and Life Skills Survey, OECD Publishing. http://dx.doi.org/9789264091269-en 4228.8- Adult Literacy and Life Skills Survey, Summary Results, Australia, 2006(Reissue), Australian Bureau of Statistics. http://www.abs.gov.au/AUSSTATS/[email protected]/Lookup/4228.0Explanatory%20Notes12006% 20(Reissue)?OpenDocument British Columbia Research Libraries’ Data Services, International Adult Literacy Survey (IALS) [International file, 1994-1996-1998], 1998 http://hdl.handle.net/10573/41351 European Social Survey Round 5-2010, from http://ess.nsd.uib.no/ess/round5/ Trends in International Mathematical and Science Study, from http://timss.bc.edu/ Survey on Health, Aging, and Retirement Europe, from http://www.share-project.org/ International Social Survey Programme (ISSP): General information, from http://www.issp.org/ International Social Survey Programme (ISSP): ISSP Methodological Research, from http://www.issp.org/page.php?pageId=215 Multi-Stage Development of the SHARE questionnaire, from http://www.share- project.org/data-access-documentation/questionnaires/questionnaire-Wave-1.html International social Survey Orogramme (ISSP), working principles (Amended April 2012) from www.issp.org/.../WP_FINAL_9_2012_.pdf European Statistical System. In National Statistical Institute, from http://www.nsi.bg/pageen.php?P=187&SP=318 European Statistical System, in Central Statistics Office, from http://www.cso.ie/en/aboutus/organisation/europeanstatisticalsystem/ International Social Survey Programme (ISSP): History of the ISSP, from http://www.issp.org/page.php?pageId=216 European Working Condition Survey, from http://www.eurofound.europa.eu/ewco/surveys/index.htm CSES methodology, from http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03808 Comparative Study of Electoral System, from http://www.cses.org.

92

Eurobaromater , from http://www.gesis.org/eurobarometer-data-service/survey-series/ Eurobarometer, the EP of the expectations of European Citizens from http://www.europarl.europa.eu/aboutparliament/en/00191b53ff/Eurobarometer.html http://www.respectproject.org/code/.

93