<<

Minor Thesis

Under 5 under-reporting in

Wageningen Wendy Smit – 890327778080 – YRM 80324 February 2014 Supervisor: Chairgroup Research Methodology, dr. F. Cobben

Minor Thesis

The under-reporting of children under five in population surveys of Sierra Leone

Wageningen Wendy Smit – 890327778080 – YRM 80334 February 2014 Supervisor: Chairgroup Research and Methods, dr. F. Cobben

Abstract

In Sierra Leone data collections efforts, like surveys and censuses have noted an under- reporting of children under five. You would expect the age-groups in population figures to move upward over the years. In different population pyramids of data collections conducted in Sierra Leone the age-group of children under five is obvious smaller than the age-group above that one. And that is also the case for data-collections that followed each other up. So the age-group is growing in the future, this can only be explained by an under-reporting of children under five or an extreme immigration of children aged between 5-9 years. Because this thing is going on in surveys and in the population census of 2004 it is unlikely that sampling errors caused it. This research investigates the other possible causes for the under-reporting of the children under five in Sierra Leone. This research is conducted within the scope of the Total Survey Error (TSE). The TSE is one of the most used paradigms in survey methodology. The TSE framework includes different sources of error. To use the TSE framework a quality profile was chosen as a method. The goal of a quality profile is to enhance the understanding of the statistics produced from the survey and to improve the quality of the survey. Quality profiles inform about all potential sources of error in a survey, it describes known and suspected errors. This research made quality profiles of three different datasets that were available from three different survey organizations, namely the Sierra Leone Integrated Household Survey 2003/04 & 2011 and the Sierra Leone Demographic and Health Survey. The most notable finding is one of the similarities between the different surveys. All the three surveys used the Sierra Leone Population Census as the sampling frame. This frame excluded collective housings like orphanages and boarding schools. This might be a cause for the under-reporting of children under five. As well as a cultural phenomenon called the “big houses”. This research gives a lot of attention towards the nonresponse by performing multiple logistic regressions. These do not give a proper solution for a cause of the under- reporting in the nonresponse errors. The hypotheses and recommendations for further research are mainly in the direction of measurement errors.

III Preface

In October 2013 I started this research in the context of the master Health and Society at the Wageningen University. During this thesis I investigated different possible errors that might influence the under-reporting of children under five during surveys especially in Sierra Leone. Besides investigating the different possible errors I learned more about the ways of dealing with errors in a quantitative way. I did this in order to develop my statistical skills and refresh the learned statistics of the previous years. During the beginning of this thesis I got triggered by the mystery of the “missing” children. Probably they are not really missing but they are not visible in the provided data. The fact that the World Bank and some other stakeholders could not explain the under-reporting made me curious. There were multiple aspects involved in this research. There are a lot of differences between the “Western” and the Sub-Saharan Sierra Leone. This made me very enthusiast to learn more about the methodological aspects but as well the more cultural aspects of the possible causes for the non-response. By choosing this project for my minor thesis made my view on research again a bit broader, I learned about very new topics. And about doing research in a more preparatory manner, this research is focusing on making hypothesis for follow up research. The core idea behind this minor thesis is that I am a social scientist in training with an interest in methodological research questions.

IV Acknowledgement

This research project was very educational and informative for me as a scientist in training. But besides that is was also a huge challenge to complete the research next to other commitments. I have experienced a lot of great support; therefor I would like to give thanks. First I would like to give thanks to dr. F. Cobben for providing me the opportunity to conduct this research and for the instructive advises I got. I want to give thanks for the great supervision and provision of feedback during the writing of this thesis. Secondly I want to thank mr. P. Tamas for his enthusiasm about this research topic and the availability of answering my questions when needed. I also want to thank dr. H. Tobi for sharing her experiences and knowledge about logistic regressions and multinomial regressions. Furthermore I want to thank Mr. Ruillin, Ms. Himelein and Ms. Foster for helping me better understand the Sierra Leonean culture and answering my questions fast and correct. Lastly I want to thank all the other persons who made this research project in some way possible for me to conduct.

Wendy Smit February 23th, 2014

V

Table of Contents

Abstract ...... III Preface ...... IV Acknowledgement ...... V 1. Introduction ...... 5 1.1 Sierra Leone ...... 7 1.2 Missing Children and effects on measures ...... 8 1.2.1 Mortality measures ...... 8 1.2.2 Birth measures ...... 9 1.3 Population measures ...... 9 1.3.1 Possible errors in population measures ...... 10 1.4 Relevance and Background of this research ...... 11 1.5 Aim and Research Questions ...... 11 1.6 Systematic outline of the research ...... 11

2. Theoretical Framework ...... 13 2.1 Total Survey Error-Framework ...... 13 2.2 Sampling errors ...... 14 2.3 Non-sampling errors ...... 15 2.3.1 Observation errors ...... 15 2.3.2 Non-observation errors ...... 17

3. Datasets ...... 21 3.1 Sierra Leone Demographic and Health Survey 2008 ...... 21 Survey sample design ...... 21 Data collection ...... 22 3.2 Sierra Leone Integrated Household Survey 2003/04 and 2011 ...... 23 Survey sample design ...... 23 Data collection ...... 23

4. Method ...... 25 4.1 Sampling error ...... 25 4.2 Non-sampling ...... 25 4.2.1 Observation errors ...... 25 4.2.2 Non-observation errors ...... 26

5. Results ...... 32 5.1 SLDHS ...... 32

5.1.1 Observation errors ...... 32 5.1.2 Non-observation errors ...... 35 5.2 SLIHS ...... 42 5.2.1 Observation errors ...... 42 5.2.2 Non-observation errors ...... 44

6. Conclusion and Discussion ...... 46 6.1 Conclusion ...... 46 6.2 Further research ...... 48 6.2.1 Hypotheses ...... 48 6.2.2 Recommendations ...... 49 6.3 Discussion ...... 51 6.3.1 Results and the used theoretical framework ...... 51 6.3.3 Strengths and limitations of this research ...... 52

References ...... LIII Appendices ...... LVII Appendix A ...... LVIII A1.1 Binary regressions single variables versus the fast respondents ...... LIX A1.2 Binary regressions single variables versus the less fast respondents ...... LX A.2 Logistic regression “Fast Respondents”, ...... LXI A.3 Logistic regression “Less Fast Respondents” ...... LXIII A.4 Logistic regression “Fast Respondents” ...... LXVI A.5 Logistic regression “Less Fast Respondents” ...... LXVIII

Appendix B ...... LXX B1 SPSS outputs of final logistic regression model, “fast respondents” ...... LXXI B2 SPSS outputs of final logistic regression model, “less fast respondents” ...... LXXIII

1. Introduction Research was already mentioned in the French language in 1577, it was then related with going seeking (Jawed, 2012). Research and especially scientific research is becoming more and more important. The publication of results is most useful when the results are valid and reliable. Validity and more precise internal validity indicate the validity of the research conclusions. Measurement validity, part of internal validity, indicates to what extend an instrument measures what it is supposed to measure. Reliability is saying more about the consistence of a measure and whether it minimizes the random error. In other words the measure should be repeatable (Bowling, Ebrahim, & ebrary Inc., 2005). In Sierra Leone multiple household data collections efforts, like surveys and censuses have noted substantial under-reporting of children under five (Himelein, 2013). Looking at the research results of different data collections the under-reporting of children under five is visible. Figures 1 and 2 show the population pyramids of the Sierra Leone Integrated Household Survey (SLIHS) of 2003/04 and 2011. Comparing figure 1 with figure 2 it is visible that mainly among the very young population the distribution might be incorrect. Especially when comparing the results of the age-category children under five of 2003/04 with the age categories 5-9 years and 10-14 years of they 2011 pyramid. It is strange that especially in the 2011 pyramid the group of 5-9 and 10-14 is bigger than the under five category. The category of children under five of the 2003/04 survey should move up in the population pyramid of 2011.

Figure 1 Population Pyramid SLIHS 2003/04 Figure 2 Population Pyramid SLIHS 2011 (Himelein, (Himelein, 2013) 2013)

5 This incorrectness might indicate a low validity of the measurement. It is informative to know whether this is only visible in the SLIHS survey or that this trend occurs in more surveys. The Multiple Indicator Cluster Survey (MICS) is another population survey that collected data in Sierra Leone. This survey was carried out in 2005 and 2010. Comparing the two population pyramids the same thing is visible between the MICS 2005 and MICS 2010 (shown in figure 3 and 4).

Figure 3 Population Pyramid MICS 2005 (Himelein, Figure 4 Population Pyramid MICS 2010 (Himelein, 2013) 2013)

The different surveys (SLIHS and the MICS) were both administered by Statistics Sierra Leone (SSL) with technical assistance of different donors and international organizations (Himelein, 2013). In both cases the under five-age category is under represented especially compared to the 5-9 age group. When looking at figure 3 the age group 0-4 of 2005 is around 7% of the population for males and females. For 2010 you would expect the age group of 0-4 move up to the age group of 5-9 years. The age group of 5-9 years in 2010 is however bigger than the under five group of 2005. In 2010 the age group 5-9 is situated around 8% of the population. The age group is growing, which can only be explained by an under-reporting of children under five or extreme immigration of children aged between 5-9 years. The under-reporting of children under five in the different surveys raise questions about possible causes of the under-reporting. One possible cause can be the use of a sample during the surveys. To check on this point information is needed about the whole population, like information from a census. Censuses do not use samples but collect data from the whole population. The 2004 Census of Sierra Leone shows the same thing as

6 the previous surveys. Looking at figure 5 it is visible that the age group of children under-five is smaller compared to the age group of 5-9. The fact that the same thing is visible in the population pyramid of the 2004 census makes it implausible that the cause for under-reporting the children under five can be attributed to the use of a sample within the surveys.

Figure 5 Population pyramid, Census 2004 Sierra Leone

This introduction will briefly elaborate on the country where the under-reporting is observed namely Sierra Leone, the effects and risks of incorrect data on other measurements, information about surveys and censuses and the relevance and background of the problem, which is central in this minor thesis. This information will result in the aim, research question and research outline, which are described at the end of this chapter.

1.1 Sierra Leone Sierra Leone is a relatively small country situated at the west coast of the African continent. During the 15th century European explorers reached Sierra Leone. The was eventful; in 1961 it achieved independence from the Great Britain Kingdom. In March 1991 a evolved in the country (Worldatlas, 2013). The civil war lasted until 2002. During that war tens of thousands of people died and two million were displaced which is two third of the country (Wikipedia, 2013; Worldatlas, 2013).

7 The population of Sierra Leone was estimated in 2012 on 5,979 million people (TheWorldDataBank, 2013). The results of the previous censuses indicate that the annual population growth rate for the period between 1985 and 2004 was 1.8% per annum. Between 1974 and 1985 period the annual population growth was 2.3%. The capital city of Sierra Leone is , with a population of 941,000 (UNdata, 2013). In Sierra Leone statutory law prohibits polygamy but it is authorized and practiced. About 43% of the women aged between 15-49 are in polygamous marriages. In Sierra Leone women are considered as everlasting minors (FAWE, 2007).

1.2 Missing Children and effects on measures In the case of underreported subgroups of the population, the use of incorrect data might have substantial implications for sectors like health and education. These sectors often use population numbers or calculations to set policies. When calculating outcomes of subgroups by using the underreported numbers the outcomes are threatened, like when calculating; mortality measures or birth measures. When wrong numbers are used to calculate rates it might have consequences for the policy.

1.2.1 Mortality measures Sierra Leone has the highest number of infant mortality worldwide. Infant mortality is the number of infants who die before reaching one year of age. In 2012 the infant mortality was 117 per 1,000 live births (TheWorldBankGroup, 2013b). The infant mortality rate (IMR) is calculated in the following way: IMR= !"#$!! !" !"#$"%& ∗ 1000. For !"#$ !"#$!! calculating the IMR the number of live births should be known as well as the death rate of infants in one year (Indrayan, 2012). The under-5 mortality rate (U5MR) in Sierra Leone is the highest around the world. The U5MR for Sierra Leone is 182 per 1,000 new born babies dying before reaching the age of five (TheWorldBankGroup, 2013c). The under-5 mortality rate is calculated in the following way: U5MR=!"#$!! !" !!!"#$%& !"#$% ! !"#$% ∗ 1,000. To calculate the U5MR the !"#$ !"#$!! deaths of children under five as well as the live births should be known for the calculated year (Indrayan, 2012). When using the numbers from the surveys or census of Sierra Leone the measurement outcomes are incorrect due to the under-reporting. A logical line of reasoning can be; when the live births are reported in a wrong way and the deaths are reported correctly the U5MR and IMR result in a high rates, which is in fact an overestimation.

8 1.2.2 Birth measures To measure the number of live births during one year the Crude Birth Rate (CBR) can be used. The CBR indicates the number of live births during one year in a population of 1000 people estimated at a midyear (TheWorldBankGroup, 2013a). The formula of the

CBR is like: CBR= !"#$% !"#$%& !" !"#$ !"#$!! ∗ 1,000 (United-Nations, 2009). The CBR is !"#$% !"!#$%&'"( called crude because it is calculated in a rather rough way, it does not take the sex differences of a population into account (Rosenberg, 2013). In 2011 the CBR of Sierra Leone was 38 (per 1.000), comparing for that same year the CBR for the Netherlands was 11 (TheWorldBankGroup, 2013a). For the CBR it is the same as for the IMR and U5MR that the calculation might be based on false numbers of life births or the total population. This results in wrong estimates of the rates and might have consequences for the decisions made based on those numbers. Making the same reasoning as for the mortality rate this calculation will result in too low figures, underestimation.

1.3 Population measures There are two ways to measure the population; namely using sample surveys and by using censuses (Australian-Bureau-of-Statistics, 2013). Both have the goal to collect information on a group of persons. The two can be used in different circumstances because of their different characteristics (Australian-Bureau-of-Statistics, 2013). First sample surveys are described in the following section and then information about censuses is given. At the end the difference between the two is described.

Sample surveys A sample survey is a method of collecting information from a sample of a specific population (Bowling et al., 2005; Cobben, 2009). The sample is used to make inferences about the population as a whole. To have an appropriate sample some conditions should be met, like the condition that the sample is randomly selected (Cobben, 2009). Roughly there are two types of variables, which play a role during survey sampling: survey variables and auxiliary variables. Survey variables are the variables of interest that are measured during the survey. Auxiliary data is generally known before the sampling. The auxiliary data can be used to create a sampling design. Auxiliary data can also be used to adjust the survey data for nonresponse, using weighting or imputation techniques (Cobben, 2009).

9 There is also a form of auxiliary data that is collected during the data collection period. This data is called para-data or process data. Process data contains in accordance with its term information about the data collection process. This is for example information about contact attempts or reasons for not participating on the survey (Cobben, 2009).

Censuses A Census is a complete enumeration of the population. It collects information of the entire population (Australian-Bureau-of-Statistics, 2013). A census can also be seen as an inventory of the whole population. Because it is a complete count of the population it provides detailed information on e.g. the size of the population, age structure and social economic characteristics (, 2001). Censuses are used when accurate information is wanted from the whole population. Countries perform censuses often every 10 years (Anguilla, 2001; Australian-Bureau-of-Statistics, 2013). Not all countries execute censuses.

Differences Census and Sample Survey The most notable difference between a census and a sample survey is the use of the entire population or the use of a sample. The advantage of using a sample is mostly reducing the costs both in monetary terms as well as in staff requirements. Because a sample survey tends to be smaller than a census less time is needed. Sample surveys have compared to censuses more time that enables interviewers to ask more detailed questions. The results of sample surveys are often more quickly available then the information of a census (Australian-Bureau-of-Statistics, 2013). The advantage of a census in relation to sample surveys is the availability of information about small areas or from sub populations. The estimates are not subjected to sampling error because there is no sample used (Australian-Bureau-of-Statistics, 2013).

1.3.1 Possible errors in population measures Many different systematic and random errors can disturb the distribution of survey estimates or influence the survey outcomes. The errors can be investigated separately and estimate together the total error, which influenced the survey. This error is called the Total Survey Error (TSE). (US_Government, 2001f). The TSE is a concept that tries to describe the variety of errors. It is a rather holistic approach, attention is given to the entire set of the survey as well as to the interaction between errors (Groves & Lyberg, 2010). The TSE is used as a framework for this research to deal with the different errors, which can influence a survey. Chapter two describes the TSE extensively.

10 1.4 Relevance and Background of this research When looking at Figure 1 to 5 it is visible that the observed data in the population research of Sierra Leone contains data that draws attention. There is a gap between the different years of measure and the number of children under five. This under-reporting or at least under representation of children under five is strange. It needs to be investigated what can possibly caused that under-reporting. It is important to know why the presentation of children under five is incorrect so it can be prevented the next time or be corrected afterwards. This research will go into the gap of knowledge about the possible sources of error that could cause the under-reporting of children under five in Sierra Leone. This study will be done mainly in a quantitative way, using the available survey data, and where needed and possible for the given research time in a qualitative way. This research can be seen as a preparatory research where the quantitative (and qualitative) study is done to create possible hypotheses for the causes of the under-reporting of children under five in Sierra Leone.

1.5 Aim and Research Questions The aim of this research is to find possible causes for the under-reporting of children under five in Sierra Leone by figuring out which components of the TSE can explain the under-reporting of the children under five in population research of Sierra Leone. To achieve this aim the main research question is formulated in the following way:

“Which components of the Total Survey Error can explain the under-reporting of the children under five in population research of Sierra Leone?”

1.6 Systematic outline of the research In order to achieve the overall aim of this minor thesis some steps were needed. In figure 6 a schematic representation of the steps is given. The theoretical framework together with the survey information of Sierra Leone formed the context from where the study was conducted. The quantitative and qualitative study lead to the results of this research and the results made it possible to form hypotheses.

11

Theoretical Framework Quantitative study

Method Results Survey Information Qualitative study Sierra Leone

Hypotheses

Figure 6 Schematic outline of the research

12 2. Theoretical Framework The Total Survey Error (TSE) forms the framework in which this thesis was conducted. The TSE describes all possible sources of survey error, it is shortly explained in this chapter.

2.1 Total Survey Error-Framework Sample surveys are subjected to error; the total survey error (TSE) is one of the most dominant paradigms for errors in survey methodology. This framework includes many possible sources of survey error (Cobben, 2009; Groves & Lyberg, 2010). The TSE estimates different separated sources of error into one overall or “Total Survey Error” (US_Government, 2001f). The separated sources of error are all the different forms of error that may arise in the design, collection, processing and analysis of the survey data. TSE is related to the term survey accuracy because it is defined as the deviation of a survey estimate from the underlying true parameter value (Biemer, 2011). In the end the TSE is a concept that tries to explain the statistical properties of survey estimates by taking into account a variety of error sources (Groves & Lyberg, 2010; US_Government, 2001f). The ultimate result of all these errors is the discrepancy between the survey estimate and the true value in the target population. This discrepancy is then the Total Survey Error (Cobben, 2009). The term total survey error is however not very well defined. Different researchers include different components of error in the framework. When looking at the term Total Survey Error it indicates that there needs to be attention towards the entire set of components in survey designs. These components range from identifying the population until the last part of the survey research; the summarizing of the data (Groves & Lyberg, 2010). The TSE used in this research is shown in figure 7. The components of the TSE are explained in the following sections.

13

Figure 7 Total Survey Error

2.2 Sampling errors To get accurate information sampling is important for surveys because it is most of the time impossible to get information of the entire population. A sample of the population is taken to get information from a part of the population instead of the whole population. The sampling error refers to the variability that arises by chance because a sample is surveyed instead of the whole population (US_Government, 2001e). Sampling error arises when a sample is not representative for the population from which it was drawn. Sampling error is an error that can be avoided when the entire population is asked, so no sample is used (Cobben, 2009). Roughly there are two reasons for sampling error, selection errors and estimation errors. The following sections are briefly describing selection error followed by an explanation about estimation error.

Selection-error The selection error is a systematic sampling error. The bias exists due to a flaw in the selection process. Generally this error arises when the actual inclusion probabilities differ from the true values in the population. This is for instance the case when a unit has multiple entries in the sampling frame or is duplicated (Cobben, 2009). To avoid selection errors, a thorough investigation of the sampling frame is needed.

14 Estimation-error The estimation error is a random sampling error. The estimation error describes the error that arises as a result of the simple fact that the variable has to be estimated from a sample instead of the entire population (Cobben, 2009; Dutter, 2003). Estimation errors arise when a random selection procedure is used. Every time a new sample is selected, this leads to a different value of the estimator. This error can be controlled by the design of the sample, for instance by increasing the size of the sample.

2.3 Non-sampling errors Non-sampling errors are mainly associated with the data-collection and data processing procedures. The non-sampling errors do not vanish when the entire population is measured instead of the sample (Cobben, 2009). These can result from both observation and non-observation errors. The following section describes first the observation errors and then the non-observation errors.

2.3.1 Observation errors Observation errors are the errors occurring due to the fact that things are wrongly observed. This is either because they are observed whereas they shouldn’t be (over- coverage), because of the observations themselves (measurement error) or due to processing errors these are errors made during the processing of the data (US_Government, 2001a, 2001b, 2001d).

Overcoverage-error Overcoverage refers to the situation where sample units are observed that do not belong to the target population. This might be due to units or sample elements that are duplicated in the sampling frame or elements that do not belong to the population but that do have an entry in the sampling frame (Bethlehem & Biffignandi, 2012). These last ones are also called erroneous inclusions. Measuring units outside the sampling frame results in overcoverage (Cobben, 2009). Overcoverage issues are matter of concern because erroneously included or excluded units may differ in some respects from the true population (US_Government, 2001a). To measure overcoverage errors the sampling frame should be known as well as information of the target population.

Measurement error Measurement error is related to the observation of variables in surveys. Measurement error arises during data-collection (Banda, 2003; US_Government, 2001b). Measurement errors are mostly a cause of the content of the survey such as definitions

15 of survey objectives, the transformation of the objectives into usable questions, obtaining answers and recording the answers. Measurement error arises when the concept measured with the survey differs from the concept you want to measure with the survey (Cobben, 2009; US_Government, 2001b). Underneath three different sources are described:

• Design, “The way of getting the requested information” Measurement error as a result of the questionnaire design is often caused by the effect of the design, content, wording and length of the questionnaire (Banda, 2003; US_Government, 2001b). • Respondent, “Recipient of the request for information” Respondents can introduce errors for example because of not understanding the question(s). This misunderstanding can lead to incorrect or careless answers. Respondents might feel a kind of social pressure to give incorrect answers. Memory lapses can as well be a source of wrong answers by the respondents (Banda, 2003).

• Interviewer, “Deliver of the questionnaire” Measurement error as a result of the interviewer may be caused by inadequate instruction towards the field staff, language difficulties, unclear instructions and interviewers using their own judgment when carrying out the fieldwork. When interviewers are a source of error this is mainly due to inadequate training of the field workers (Banda, 2003). During face-to-face interviews the presence of the interviewer may introduce error, the presence of other household members may influence the answers as well This is even more the case when dealing with sensitive questions (US_Government, 2001b). Before starting the data-collection some measurement errors can be prevented for example by cognitive testing. During cognitive testing the respondents provide information about their interpretation of the items of a questionnaire. Cognitive testing was first used to obtain insight in the respondents’ thoughts (Kasprzyk, 2005). During the data-collection measurement error can for example be limited by re-interview studies where the same unit is measured twice and by behaviour coding this coding collects data to evaluate the performance of the interviewer and looks whether the questions are asked correct. After the data collection the survey results can be compared with an external source, called record check studies. This external source is usually

16 assumed to contain the true variables, however this external source itself is often not free of measurement errors. Another difficulty is the matching of respondents of the survey with the other external source (Kasprzyk, 2005; US_Government, 2001b). In the end measurement error may arise in survey answers for different reasons and they are most of the time difficult to quantify (US_Government, 2001b).

Processing-error Processing errors compromise errors during editing, coding, data entry, programming etc.. (Banda, 2003; US_Government, 2001d). A survey contains many processing steps, which can result in errors. Nowadays there are some opportunities to minimize this error with new technologies such as computer assisted interview techniques or scanning of the outcomes. This minimizing might introduce other errors as a result of the data transfer or translation. For example, when documents are scanned a written number seven can be read as a number one (US_Government, 2001d). During the survey process the following processing-errors may arise; pre-edit coding errors, editing errors, imputation errors and data entry errors. This last error occurs during the process of transferring the collected data to an electronic medium. This transferring needs keying, double keying minimizes the key-entry error. This technique needs two independent keyers who key the same document. In the end the two datasets are compared (US_Government, 2001d).

2.3.2 Non-observation errors Non-observation errors are the errors occurring due to the fact that things are not observed. This is either because they are not included in sample or so could not have been observed (undercoverage error) and because not (all) information could have been obtained from the respondents (nonresponse error). These errors are explained in the following section (US_Government, 2001a, 2001c).

Undercoverage error When units are not in the frame when they should have been in the frame they cannot be selected for the survey. This results in zero probability for selection, and as a result undercoverage arises. Most surveys have primary sampling units who are comprised by clusters of geographic units. Households can be omitted from the frame when the demarcation is done improperly. The sample unit cannot be observed when the selection probability is zero (Banda, 2003; Cobben, 2009; US_Government, 2001e).

17 Coverage errors are difficult to estimate because information external of the sample and the sampling frame is required (Banda, 2003; US_Government, 2001e). The target population is the set of elements from whom information is wanted. Coverage error is a matter of concern because units that are omitted or erroneously included might differ from those included in the survey. This difference might lead to biased conclusions about the target population. The existence of coverage error can only be measured with a reference outside the source. When comparing a survey with a reference from outside, sampling errors from the external source must also be taken into account. Comparing a survey with other data sources is most informative when it is possible to identify the same sub-groups (US_Government, 2001e).

Nonresponse error Nonresponse reflects a unsuccessful attempt to obtain the desired information from the respondents (US_Government, 2001c). There are two main types of nonresponse: Unit nonresponse and item nonresponse. Unit nonresponse indicates that there is no information from a certain sample unit. Item nonresponse indicates that information of some units is incomplete; some items are not answered or are answered incompletely (Banda, 2003; Cobben, 2009; US_Government, 2001c). Unit nonresponse can have reasons like; refusals or failure to contact the units. Item nonresponse can be caused for example by: sensitivity of items, embarrassment and refusals to answer specific items. In the end the nonrespondents are the sample units from whom it was impossible to get (all) information. Most of the time the nonrespondents are not evenly spread across the sample but concentrated among subgroups. Those subgroups often deviate from the selected sample. When this is the case, nonresponse introduces a bias in the survey results. (Banda, 2003; US_Government, 2001c). A good frame, follow up of nonresponding units, interviewer training, selection and supervision are methods which can reduce nonresponse as much as possible (Banda, 2003). It is possible to calculate a nonresponse rate. This is however only an indicator of the potential bias in a survey estimate due to nonresponse. The actual degree of nonresponse bias is influenced by the difference between the respondents and nonrespondents on the survey variables of interest (US_Government, 2001c). The measurement of nonresponse rates is done in a very straightforward way by dividing the number of responding units by the eligible units (US_Government, 2001c).

18 It is possible to make assumptions about nonrespondents by studying characteristics known for the respondents as well as the nonrespondents. When the distribution of the variables is the same for the respondents and the nonrespondents the concern for biases is limited. When the distributions are different this may provide insights in the characteristics measured in the survey (US_Government, 2001c). Auxiliary data is data that is available on every unit in the population. These variables are used to improve the sampling plan or to make estimation on variables of interests (Cohen, 2008). Auxiliary data is mostly used to adjust for nonresponse. Demographic characteristics measured during the survey or during the listing of the households are used for nonresponse adjustments because they encounter information about individuals and subgroups (Bethlehem, Cobben, & Schouten, 2011; Olsen, 2013). One of the most important correction techniques for nonresponse is weighting. This technique assigns a weight to every responding object of the survey. Estimates of population characteristics are obtained by processing the weighted observations instead of the observations themselves. The weighting is often based on available auxiliary information. This information is measured during the survey and obtains available information about the population distribution. Using that population distribution the weights can be assigned (Bethlehem & Schouten, 2004). Another method to correct for nonresponse is imputation. Imputation is a method to fill in missing data. Imputation also uses auxiliary variables that are preferably related to the variable where the nonresponse occurs. By using imputation methods the missing items or units get imputed values based on the available characteristics (Durrant, 2005). When reporting nonresponse in the survey report it is important to mention unit and item nonresponse as an important errors that may occur. The overall response rates should be provided. For important subgroups the response rate should be mentioned separately. When weighing or imputation procedures were applied to the survey respondents to compensate for the nonresponse this procedure should be described in the report (US_Government, 2001c). To deal with nonresponse, there are two basic approaches. The first one is to prevent nonresponse from happening in the field. The second approach is to deal with the possible bias introduced by nonresponse when analysing the data (Bethlehem et al., 2011). Informing the respondents prior to the data-collection about the survey for example by a letter or phone call can reduce nonresponse. Nonresponse is also reduced

19 by incentives, follow up techniques where the nonrespondents are tried to contact again and interview guidelines (US_Government, 2001c, 2001f).

20 3. Datasets Three surveys from two different organizations were available as data-sources for this minor thesis research. These were the Sierra Leone Demographic and Health Survey (SLDHS) 2008 and the Sierra Leone Integrated Household Survey (SLIHS) from 2003/04 and 2011. First the SLDHS is shortly described and thereafter the SLIHS is described. More detailed information in relation to the aim of this minor thesis is stated in chapter 5. results.

3.1 Sierra Leone Demographic and Health Survey 2008 A project called: Monitoring and Evaluation to Assess and Use Results Demographic and Health Surveys abbreviated MEASURE DHS, provides technical assistance to more than 260 surveys in over 90 countries. Data is collected from survey questionnaires (DHS Survey Questionnaire, AIS Survey Questionnaire, SPA Questionnaires, other quantitative surveys and Qualitative Research), biomarker testing and geographic locations. So- called Model questionnaires are developed as standard questionnaires by the MEASURE DHS organization. The main approach of the MEASURE DHS program is to collect comparable data across different countries. Within the standard questionnaires countries can add questions of special interests and questions can be deleted from the model if they are irrelevant in a particular country (MEASURE_DHS, 2013). The MEASURE DHS performed a survey in Sierra Leone in 2008, the Sierra Leone Demographic and Health Survey 2008. Statistic Sierra Leone (SSL) carried out this survey in collaboration with the Ministry of Health and Sanitation. The Government of Sierra Leone provided financial assistance in terms of funding, staff time, office space and logistical support. Additional funding for this project was provided by: the U.S. Agency, the Populations Fund, the United Nations, the United Nations Development Program, the United Nations Children’s Fund (UNICEF), the Department for International Development and the World Bank (SLDHS 2008).

Survey sample design The SLDHS survey sample was designed to provide information about different indicators for the country as a whole. This was done by dividing the country in four different regions and then again in regional groups. The regions were formed in the following way (SSL, 2009): • Eastern: Kailahun, Kenema, Kono districts

21 • Northern: Bombali, Kambia, Koinadugu, , Tonkolili districts • Southern: Bo, Bonthe, Moyamba, Pujehun districts • Western: Western Area Urban and Western Area Rural districts The SLDHS sampled households for the survey in two stages. First there were 353 clusters selected from a list of enumeration areas from the sample frame, the Sierra Leone Population Census 2004. After the listing of the enumeration areas a complete listing of each selected cluster was carried out. From the selected clusters, 22 households were systematically selected after the listing to participate in the survey (SSL, 2009). The population of Sierra Leone is the target population of the SLDHS household survey. Eligible people for the survey were women between 15 and 49 and men aged between 15 and 59. The aim of the SLDHS is to provide information about the population and health issues in Sierra Leone (SSL, 2009).

Data collection The SLDHS used three types of questionnaires for the survey of 2008 in Sierra Leone. These were the so-called Household Questionnaire, Women’s Questionnaire and Men’s Questionnaire. The questionnaires are based on the model questionnaires of the DHS usable in countries with low levels of contraceptive use. Statistics Sierra Leone performed in collaboration with other stakeholders some measures to adapt the questionnaires to the situation in Sierra Leone (SSL, 2009). The goal of the Household Questionnaire was to list all the usual members and visitor of the selected households. This is when the total household listing was done and was the children under five should also be listed. Basic information was collected of each listed person. This information included age, sex, education and the relationship to the head of the household. During the Household Questionnaire eligible men and women were identified for the individual interviews. For women survey every household was selected and for the male survey every second sampled household for the SLDHS was selected before the start of the data collection. The females and children under the age of five from the households selected for the male survey were also measured on their height and weight. This measuring was done by women aged between 15-49 and children under five / six (SSL, 2009). During the data collection period there were 24 teams to carry out the surveys. Each team consisted of one supervisor, one field editor, one biomarker, two female

22 interviewers and one male interviewer. The data processing began shortly after the start of the data collection. Data processing personal was responsible for the processing of the data. This included a team of: two supervisors, five office editors, 15 data entry editors, 23 data entry operators and four secondary editors (SSL, 2009).

3.2 Sierra Leone Integrated Household Survey 2003/04 and 2011 The Sierra Leone Integrated Household Survey (SLIHS) was first implemented between March 2003 and April 2004 . A second SLIHS was performed between January and December 2011 (TheWorldBank., 2013). The Government of Sierra Leone wanted to prepare a Poverty Reduction Strategy Paper (PRSP), for this paper the government did not want to use out-dated or inadequate poverty data. Therefore they commissioned in 2003 the Statistic Sierra Leone to conduct an Integrated Household Survey that should also provide data about poverty as a preparation of the PRSP (Rogers, 2004). Statistics Sierra Leone conducted the SLIHS. The SLIHS was conducted and analysed with funding of the Department for International Development. The World Bank provided technical assistance during the survey periods (SSL, 2007). The sample size of the 2003/2004 survey was 3,714 and for the 2011 survey: 6,727 (TheWorldBank., 2013).

Survey sample design The sampling frame for the SLIHS 2003/04 was based on the population census of 1985 from Sierra Leone, for the SLIHS 2011 it was also based on the 1985 census with the required adjustments. For the 2003/04 survey the census sampling frame was used and the data was weighted to match with the 2004 census data so that it portrayed the current status of the country. A systematic sample was selected from 2,553 listed census Enumeration Areas. This list was ordered by region, district and rural or urban living area. The sample was stratified by rural and urban living areas (Rogers, 2004). The list with enumeration areas of the census (with information about the household, population and urban / rural specification) was used as primary sampling unit or so called cluster (SSL, 2011).

Data collection There were different questionnaires used during the 2003/04 and 2011 surveys. They collected information about 11 different but somewhat related topics for as well the 2003/04 and 2011 survey. The 2003/04 Survey used five questionnaires. The 2011 survey used the same topics but incorporated them in two questionnaires. In table 1 a summary of the different questionnaires and their content is given. It shows that despite

23 the use of different questionnaires the content remains roughly the same. During the SLIHS 2003/04 the age and composition of the households is asked in Part A, the household questionnaire. During the SLIHS 2011 survey the age and composition of the household is also asked in part A household roster and household characteristics questionnaire (SSL, 2003, 2007, 2011). This information asked in part A also includes the information of the children under five.

SLIHS Part A Household Questionnaire 2003/04 (Household roster, education, health, employment, migration and housing) Part B Household / agricultural Questionnaire (Agriculture, average monthly Household Income, household expenditure, non- farm enterprises, income transfer and miscellaneous income & expenditure and credit, assets and savings) Part C Price Questionnaire (Prices that households face in practice especially for Sierra Leone where prices vary across regions) Part D Community Questionnaire (Location and quality of nearby health facilities and schools, condition of local infrastructure like roads, sources of fuel and water, availability of electricity, means of communication and agricultural conditions) Part E Anthrometrics (Information collected in order to build policies for socio-economic development) SLIHS Part A Household roster and household characteristics questionnaire 2011 (Household roster, education, health, employment and time use, migration, housing, ownership of durable goods, crime and security, subjective poverty and effects of conflict) Part B Household consumption expenditure and income questionnaire (Agriculture, household expenditure, credit and savings, income transfers and miscellaneous income and expenditure and non-farm enterprises)

Table 1 SLIHS survey parts and contents The emphasis of the SLIHS data analysis was on the consumption expenditure to get information on poverty indicators during a reasonable period. This information was needed for the PRSP (Rogers, 2004). The SLIHS surveys both used interview-teams. These teams consisted of one supervisor and six interviewers. The interviewers conducted interviews with households on daily bases. Each selected household was visited seven times by the interviewers with a minimum of three days between the interviews. The task of the interviewer was to conduct at least five interviews every day (SSL, 2011). The Part A questionnaires had to be conducted first because during that part respondents were also identified for other parts of the interviews (SSL, 2003, 2011).

24 4. Method There are some different methods to use the total survey error as a tool to estimate possible errors. One way of using the TSE is using it to make quality profiles (US_Government, 2001f). The goal of a quality profile is to enhance data users’ understanding of the limitations of the statistics produced from the survey and to improve the quality of the statistics. A quality profile informs about all the potential sources of error in a survey. It describes what is known or suspected to be a potential error. The content of a typical quality profile covers topics like: coverage, non- response, measurement error, data processing and estimation. A quality profile includes quantitative as well as qualitative results. In a quality profile the processes are closely examined and it provides information about the processes, which have to be improved. As imaginable, a quality profile covers a large number of potential sources of error because it focuses on the total survey error instead of on one simple part of the TSE (US_Government, 2001f). For this research, the quality profile is chosen as a method to approach and learn about the different possible errors of the three different surveys datasets that can explain the under-reporting of the children under-5 in the different Sierra Leone’s surveys. The available information of the two surveys is used to construct quality profiles by looking separately for potential sources of errors in the data and identifying similarities and differences between the different surveys. The components of the TSE are approached individually. The following sections of this chapter explain the method used to approach the errors of the TSE. Only the errors approached during this research are explained.

4.1 Sampling error The census is a full enumeration of the population and as such cannot suffer from sampling error. Since the census show the under-reporting of the children under-5 as well (see section 1. Introduction), this cannot be explained by sampling errors. For this reason, sampling errors are not regarded in the quality profile to determine what component of the TSE are possible causes of the under-reporting of children under-5.

4.2 Non-sampling

4.2.1 Observation errors The following section describes the method used to approach measurement and processing errors. Overcoverage errors are not included during this study and therefor

25 they are not stated in this method section. The required information of the survey framework and target population to approach overcoverage errors was not available.

Measurement error To find out whether the measurement error is a possible source for the under-reporting of children under five, this research searched for techniques to limit measurement errors in the available information. The different surveys are checked to see if there is a possibility that measurement error(s) occurred that could explain the under-reporting of children under five during these surveys. Like stated in chapter 2. part 2.2.2 “measurement error” there are different ways to check for those errors and methods to avoid measurement errors during surveys (Kasprzyk, 2005). The approach of measurement error is based on the information stated in the survey reports. This can only lead to the techniques described by the survey organizations. To determine measurement error in more quantitative ways, benchmark information on an individual level is needed. The assumption needs to be made that the benchmark information is free of measurement error, or statements can only be made about differences in measurements between the two sources. A usual suspect for providing this benchmark information is the census. However, as mentioned before, the census suffers from exactly the same problems that could be caused by measurement error. Since there are no other sources available, the analysis of measurement error in this thesis is restricted to the study of the available reports.

Processing error Processing errors are assessed for the different steps during the research; data- collection, editing, coding and data-entry errors. Lastly the persons or teams involved during the different aspects of the survey processes are identified. Hence the assessment of possible processing errors is based on the studying of the available survey information.

4.2.2 Non-observation errors The following sections describe the method used to approach undercoverage and nonresponse errors.

Undercoverage error Approaching undercoverage errors is often a challenging assignment. Mainly because it is about units that are omitted or that did not have a probability of being selected for the

26 survey. Looking for under-coverage errors involves critically evaluating the exclusion criteria of sampling frames and the search for other possible sources of under-coverage. This research approached undercoverage error in two ways. First by searching in the available data of the surveys for indicators of under-coverage and secondly by asking a fellow researcher who is more aware of the characteristics of Sierra Leonean households. In the available information special attention is given to the definition of the frame and target population. Furthermore information about the development of the frame, changes of the frame over time, updates of births and deaths and other important changes of the study-population have been examined of the surveys. This is done because it is assumed that the presence of these descriptions indicates that the survey designers made an effort to avoid under-coverage errors.

Nonresponse error As explained in chapter 2 there are two main types of nonresponse, namely item- and unit nonresponse. For this research item nonresponse is not regarded. There are two approaches to deal with unit nonresponse: prevent it from happening in the field, or adjust for nonresponse bias afterwards (Bethlehem et al., 2011). The datasets are studied for these two approaches. The reports of the surveys are first studied to discover what has been done to deal with nonresponse. This should give information about the approaches used to deal with nonresponse of the surveys, in a preventive manner or by adjusting for nonresponse afterwards. Secondly the available survey data was used for a quantitative analysis. When information is known for the nonrespondents as well as the respondents, assumptions can be made. For the surveys there was however no information about the nonrespondents. But the SLDHS survey did collect usable information about the contact attempts. The SLDHS used a total of three contact attempts for people who did not respond immediately, when the households did not respond after the third day they were regarded as nonrespondents (SSL, 2009). To obtain insight into a possible nonresponse bias, logistic regression is performed for the respondents after one visit, and for the respondents after more visits. Before performing the logistic regression bivariate analyses were done to select variables for the logistic regression models. The idea behind the logistic regression is that the ‘late’ respondents (after more visits) are more similar to the remaining nonrespondents than the ‘fast’ respondents (after one visit). Finding a pattern in the differences between respondents and nonrespondents

27 after consecutive contact attempts could be an indication of a nonresponse bias. Thus, based on the information about the contact attempts in the SLDHS survey two logistic regressions were done to see whether there are differences in composition of the response after the first visit or after more visits, and whether there is a pattern visible. For the SLIHS this information was unfortunately not available.

Logistic Regression To explore the effect or influence of one or more independent variables on a dependent variable a linear regression analysis is a logic choice. The outcome variable is the main difference between a logistic regression and a linear regression models. In a logistic regression the outcome variable is binary or dichotomous (Hosmer & Lemeshow, 1989; Sieben & Linssen, 2009). When the dependent variable is a binary random variable the distribution of Y reduces to a value of 0 or 1. The responses may be represented as one or two possible values, e.g. yes (1) or no (0) (Ott & Longnecker, 2010). A simple linear regression is not possible anymore. For binary variables a logistic regression is the solution (Sieben & Linssen, 2009). In that case the probability varies between zero and one, the odds ratio varies between zero and infinity (Ott & Longnecker, 2010).

Logistic Regression in the SLDHS survey By using the information about the response collected in the different contact attempts it might be possible to draw some conclusions about the nonrespondents. The strength of the analysis depends on the type of information that is available to include in the models. Therefore, information that is related to nonresponse as well as information related to the under-reporting of the under-5 is needed. So what information is related to the under-reporting of children? For example information about giving birth and risk factors for high fertility rates. Worldwide one in five girls give birth before the age of 18. In the poorest region of the world this number rises to one in three girls. These adolescents’ births are more likely to occur among poor, less educated and in rural populations. Girls are often pressured to marry and bear children early. These girls have limited education and employment prospects. Education is a major protective factor; birth rates are higher for low educated women compared to women with secondary or tertiary education. Besides these education and employment factors, marriage is also an important indicator for pregnancies (WHO, 2012). The under-reporting of children under five has not to be related to adolescent births so information on risk factors for adult high fertility rates are important as well. It is known

28 that there are some influencing factors for higher fertility rates, not only for adolescent births (Sinding, 2009; Walter, 2011). There are some cultural factors but age of marriage is a factor as well as; education opportunities, family planning services and government policies. Generally people in urban areas tend to have fewer children compared to people in rural areas. This is the same for women who are more educated compared to less educated women / couples (Walter, 2011). Improved living standards such as urbanization, industrialization and rising opportunities for non-agrarian employment, improved educational levels and better health leads to lower fertility levels (Sinding, 2009). For this thesis the following variables were used as indicators for having children: • Wealth index (Poorest, poorer, middle, richer, richest) • Highest educational level (no-education / preschool, primary, secondary, higher) • Education in single years • Current marital status (never married, married, widowed, divorced) • Region (North, east, south, west) • Place of residence (Freetown, small city, town, countryside) • Selection for male interview / Number of eligible children for height and weight The last one is included because that group includes children under five who are measured for their height and weight. The households who are selected for the male interview are also selected for these measurements (height and weight) of the women and their children. So this is not really an indicator for having children but it might relate to having children because the children should be measured. And this was something notable that was already found. The other variables were chosen because of the abovementioned relation with having children. The variables are used to perform a logistic regression together with the fastness of responding. These variables are used for the first logistic regression; the second regression included all the available background information and the variables in relation to having children. The other background variables are: • Sex head of household • Field supervisor • Age head of household • Field editor • Local council • Office editor • District • Total adults measured • Number of household members • Number of eligible women in

29 • Number of de jure member household • Number of de facto member • Number of eligible men in household • Number of children under five • Type place of residence (Rural or • Interviewer ID Urban) • Keyer ID These variables are together with the first mentioned variables compared with the fastness of responding. The fastness of responding is one of the measured items in the SLDHS questionnaire. The SLDHS has tracked how many visits were needed to contact the respondents. Because of the lack of information on the nonrespondents, this information is also known only for the respondents. There were three visiting days. The respondents who responded on day one were categorized as fast respondents. For the fast respondents the households responding on day two or three were categorized as “proxy-nonrespondents”. For the late respondents the households responding on day one and day two were seen as late respondents and the households responding on day three were seen as “proxy-nonrespondents”. In table 2, 3 and 4 the distribution of respondents is visible before and after the recoding for the fast and late respondents. It is visible that the response rates are already very high after the first visiting day this remains after the recoding.

For the logistic regression, the dependent variable is binary. For the first logistic regression, all respondents after one visit were coded with ´1´, the respondents after two or three visits were coded as ´0´. For the second logistic regression, respondents after one or two visits were coded with ´1´ and the respondents after three visits with ´0´. Hence, it needs to be noted that the data in the two logistic models overlap - respondents after one visit are respondents for the first and second logistic regression -

30 and that a logistic regression as described and applied does not deal with this overlap. A multinomial model would be more appropriate. In addition, there is nesting in the data because there are different levels involved. For instance, supervisors supervise multiple interviewers, who interview multiple households. To deal with this nesting, a multilevel model would be appropriate. For this multilevel model there is however not enough information to perform such a regression. For the multinomial regression the required information is available, during this minor thesis a multinomial model is however not performed. This mainly due to time constraints, it remains an informative next step to investigate nonresponse errors. The binary logistic regression is computed in accordance with the following assumptions: • The dependent variable to binary logistic regression requires the dependent variable to be nominal and dichotomous • The binary logistic regression assumes that P(Y=1) is the probability of the event occurring, the variables should be coded in that way • The error term needs to be independent, logistic regression requires each observation to be independent. Logistic regression requires the independent variables being linearly related to the odds.

31 5. Results The results are separated by the two surveys used during this minor thesis and are discussed separately. First for the SLDHS survey is discussed; each possible source of error of the TSE is described shortly. Thereafter the results of the SLIHS are described in the same way.

5.1 SLDHS The available information for the SLDHS survey consisted of the SPSS datasets from the 2008 survey, with the corresponding survey report and survey questionnaires. The results are based on this data and documents. As stated in the chapter 4 methods only the non-sampling errors are studied. The following part describes first the observation errors and then the non-observation errors. Only the results of the errors that resulted in informative information about the under-reporting of children under five are described.

5.1.1 Observation errors The following sections cover the results of errros part of the observation errors. First the results of measurement error and then the results of the processing errors are described

5.1.1.1 Measurement error The most notable findings for the measurement error are described by the sections; design, respondent and interviewer. The measurement errors are not studied in an explorative way because as stated in chapter 4 this was not possible. The findings about measurement errors are found in the information of the SLDHS survey.

Design Possible errors in relation to the response format were not found, although it might be possible that the format introduced some measurement errors. The interviewers had to ask both open questions and closed questions. These different forms may introduce different kind of errors. In relation to the underreporting these errors are not found at first sight for the SLDHS survey. More extensive research is needed to say more about the measurement errors related to the design of the questionnaire.

32 Respondent One of the possible measurement errors found, related to the respondent was the language of the questionnaire. In Sierra Leone the main language is English and there are some different tribes that have different dialects or languages. Given the fact that there are many local languages in Sierra Leone and that the English language is widely spoken the SLDHS survey decided not to be translate the questionnaire into other languages. (SSL, 2009). When the interpretation or understanding of questions is difficult this can lead to wrong answers. Another indication of respondent measurement errors is the memory or attention needed to answer the survey questions. Data on the number of children ever born reflects the number of births over the past 30 years. This recall error might be higher for elder women then for the younger women (SSL, 2009). This might however be less the case because the under reporting is about children under five, so that does not rely very heavily on the memory of the women. Some other errors are likely to occur for example: social desirability, fatigue, learning, and guessing. There are no indicators that these errors can be related for a higher probability to the SLDHS compared to other surveys.

Interviewer During the SLDHS there were three types of questionnaires used: the Household Questionnaire, the Women’s Questionnaire, and the Men’s Questionnaire. During the training there were 200 qualified candidates recruited to serve as supervisors, field editors, interviewers, biomarker technicians and quality control personnel. High calibre personnel were recruited nationwide to ensure appropriate language and cultural diversity. Senior staff from the Statistics Sierra Leone office trained the persons with support from UNFPA, UNICEF, The Ministry of Health and Sanitation and ICF Macro. The training consisted of lectures, demonstrations, practice interviews and examinations. During the final week of training the participants had two days of field practice. This intense training reduces the risks for interviewer measurement errors. The translation / wording of questions might introduced some errors. The interviewers were however trained to break down and translate the questions; they even had a list of short translations. This list was provided to each interviewer to standardize the translation. The way of translating was part was of the training (SSL, 2009). Despite this partly standardization some differences might arise in wording (resulting in a wording effect which is part of the items/content related measurement errors). When there is a

33 difference in language between the interviewer and respondent this might result in false answers because of misunderstanding. Relating this to the children under five it might be possible that the wording of the question went wrong resulting in a misinterpretation of questions. In an extreme example it might be possible that the respondent interpret a question about their children under five as lost children under five for example (this is only an explanatory example not based on facts). Besides the interviewers the other persons involved during the data collection were; supervisors, field editors, interviewers, biomarker technicians and quality control personal. The purpose is to reduce errors by introducing these other people but it might introduce errors as well. These persons are all in a way involved in the administration or collecting of the information from the survey. These control staff members are not clearly defined. The only thing stated is that they were there. Whether the presence of these additional persons could have caused the underreporting is something that needs to be further investigated.

5.1.1.2 Processing error In total there were 24 teams carrying out the data collection. Each team consisted of one supervisor, one field editor, one biomarker technician, two female interviewers and one male interviewer. The senior DHS technical staff visited these teams regularly to review the work and monitor the data quality. Shortly after the start of the fieldwork the data processing started. During the fieldwork the completed questionnaires were returned on a regular basis to the headquarters in Freetown. In Freetown the questionnaires were entered and edited by the data-processing personal (SSL, 2009). The distance between the interviewers and the data-processing staff might introduce an error. The survey design tries to avoid this by the regular visits of the senior DHS staff. Despite these visits the fact remains that the data-processing personal is only processing the data, so when flaws exist in the data they probably do not know. As an extra verification the data was entered twice (SSL, 2009). This verification minimizes the error of typing errors or errors due to misreading. The fact that the data processing was done simultaneous with the data-collection resulted in the advantage that the data-processors were able to advise the field teams when errors were detected during the data-entry. That there were errors detected cannot be found in the reports.

34 5.1.2 Non-observation errors The following parts describe the results first of undercoverage errors and then the nonresponse error.

5.1.2.1 Undercoverage error To get an idea about the possible undercoverage in the SLDHS survey the exclusion criteria and possible cultural causes have been examined. The SLDHS used the census data for the selection frame. This frame already excluded people living in collective housing units such as; hotels, hospitals, work camps, prisons and like (SSL, 2009). Collective housing units might also cover orphanages. To know more about this a researcher with more expertise on this subject was contacted. The expert could indicate that the census indeed excluded institutions like orphanages and also boarding schools. The expert said as well that there were not that many boarding schools or orphanages in Sierra Leone (Foster, 2013). Despite there are not a lot of orphanages or boarding schools the young children living in one of those are excluded from the frame and have no chance of selection, so they are undercovered. This is the same for children who are in the hospitals. At least some of the children might be found in those collective housings but those are probably also children belonging to other age-categories. The expert described also another phenomenon, namely the “big houses”. This exists among the Mende, which is the main ethic group in the South of Sierra Leone. Big houses are houses where women sleep who are currently not having intercourse with their husband. They are sleeping there with their young children. Those women cannot have intercourse because they are pregnant, nursing or too old. This phenomenon is in contrast with the census definition of households, which defines a household as people belonging to a household eating out of one pot (SSL, 2011). And it is in an even more extreme contrast with the SLDHS definition of household members, a woman is for example eligible for the survey when aged between 15-49 and when she had slept in the selected household the night before the survey (SSL, 2009). The big houses are a matter of concern; when these houses are excluded from the survey this might cause under- reporting of children under-five. However, when the big houses are not taken into account in the surveys also a lot of women would be under-reported in the survey as well.

35 Nonresponse error – survey information The nonresponse numbers for the SLDHS survey come from Statistics Sierra Leone (2009). For the survey sample there were 7,758 households selected. From those, 7,461 were found during the fieldwork. Then in total 7,284 were successfully interviewed during the fieldwork. The response rate was 98%, which is very high, leaving little room for nonresponse bias. During the household questionnaire interview the listing of household members for the individual questionnaires was done. For the women surveys the response rate was 94% and for the male surveys this was 93%. See table 2 for the response rates. The overall response rate was lower in the urban areas compared to the rural areas. This was more visible for the male questionnaires. The most noted reasons for nonresponse were the failure to find individuals at home despite repeated visits and refusal to participate from the respondents (SSL, 2009). The high response rate can partially be explained by the fact that the SLDHS used three call-backs scheduled over different days and times to get the best response rate (Ruilin, 2013)1. So a household is contacted three times before they are treated as nonrespondents. In table 2 the response rates are described as it was stated in the survey report.

Table 2 response rates of the SLDHS 2008 survey (SSL, 2008) Measurement of height and weight were obtained for all children under age six living in half of the households selected for the SLDHS sample, these are the same selected

36

1Personal information from mr. Ruillin. households as for the male survey. Data was collected for all children under the age of six. The analysis is however limited to children under age five. Valid height and weight measurements were obtained for 82% of the 3,378 children under age five. This is a slightly lower number than the other response percentages. It is assumed that measurements were missing for 8% of the children because the child was not present, the parents refused, or the child was ill. For 9% of children the measured numbers were implausibly high or low values for the height or weight measures, and an additional 2%lacked data on age in months (SSL, 2009). The fact that interviewers had to try to contact the respondents three times can be useful to investigate nonresponse bias. This data is namely included in the survey dataset. There is, however, no data available about the nonrespondents. It might be possible that the nonrespondents differ from the respondents and that the nonrespondents have more children under five compared to the respondents. The most ideal situation was to compare the information of the nonrespondents to information of the respondents but this is not possible. By using a logistic regression it is attempted to make assumptions on the respondents by using the different contact attempts. In that way it is attempted to make assumptions on the fast respondents (responding on day 1), the less fast respondents (responding on day one or two) and finally the nonrespondents. Using the information of the different contact attempts makes this possible.

Table 5 shows the number of children Contact attempts à 1 2 3 under five reported after the different Number of children ê contact attempts (in percentages). The 0 39,8% 40,5% 40,4% number of children under five does not 1 30,5% 35,0% 34,9% seem strongly related to the day of visit 2 20,7% 15,7% 18,7% when the response is realised. There is 3 6,6% 6,5% 5,4% no pattern visible that for example after 4 1,8% 1,8% 0% 5 0,4% 0,4% 0,6% more visits more children were 6 0,1% 0% 0% reported. This table does not raise 7 0,1% 0,2% 0% concerns immediately. A more detailed 8 0% 0% 0% analysis to discard nonresponse error, 9 0% 0% 0% as a possible explanation for the under- Percentage: 100% 100% 100% reporting of the children under five is Table 5 percentages of contact attempt in relation to the number of children of households

37 needed. A logistic regression is performed on the set of respondents at each consecutive contact attempt to see whether a pattern emerges that indicates a possible nonresponse bias.

Logistic Regression in the SLDHS survey The data from the household questionnaire is used for the logistic analysis. The response was obtained on the household level, so the information in the model is on the household level. The information for the head of the household is used, for instance educational level is the educational level of the head of the household. As stated in chapter 4 prior to the logistic regressions a bivariate analysis was performed with the variables likely to be included in the logistic regression to get an idea of the data used. No abnormalities were found so the first logistic regression was performed, this with the variables strongly relating to the pregnancies of women (wealth, educational level, education in single years, marital status, region, place of residence and the number of eligible children for height and weight/ selection for male interview). With these variables two separate logistic regressions were performed to see whether a pattern could be detected in the relation between these variables and the speed of response (and eventually nonresponse). So in the first logistic regression, the respondents responding on the first visit were coded as the “fast respondents” versus respondents on the second or third visit who were regarded as “proxy-nonrespondents” after the first visit. In the second logistic regression the respondents responding on the first or second visit were coded as the “late respondents” versus respondents on the third visit being regarded as “proxy-nonrespondents” after two visits. The logistic regressions were performed with a Forward Stepwise (likelihood ratio) method. The variables related to the under-reporting of children under five as described earlier were put in the model. By using this Forward method for the logistic regression the variable that attributes the most is entered first, then the variable attributing second most etc.. This is done until there are no variables left that contribute. This makes it possible that all the variables can be included in the model. Appendix A contains the full SPSS outputs of these logistic regressions. Both regressions with only the variables related to the under-reporting of children under five resulted in very low values for the Nagelkerke R2. The SPSS output of these first two logistic regression models is visible in table 6. Looking at table 6 it is visible

38 that the Nagelkerke R2 is relatively low for fast respondents as well as for the less fast respondents.

Variables included in model Nagelkerke R2 Fast respondents • Wealth index 0,075 • Highest education • Region • Place of residence Less fast respondents • Education in Years 0.093 • Marital Status • Region • Place of residence Table 6 summarized SPSS output of two logistic regression models with variables included related to the under-reporting of children under five After the models where only the variables related to risk factors for a high total fertility rates were assigned, the curiosity raised that maybe other factors were influencing. This is tested by doing the same logistic regression but then with the other background variables as well. These models were built wit the same strategy. The results from these models are also stated in Appendix A. The output of this forward stepwise (likelihood ratio) method was as visible in table 7 again different for the “fast respondents” and the “late respondents”.

Variables included in model Nagelkerke R2 Fast respondents • Marital Status 0,387 • Region • Interviewer ID • Field Editor • Number of Women Less fast respondents • Education in Years 0.352 • Marital Status • Keyer ID • Field Editor Table 7 summarized SPSS output of two logistic regression models with all background variable included

The different variables in the models for the fast and less fast respondents again imply that there is no pattern in the relationship between the fastness of response and the available variables because the trend is not continuing. The Nagelkerke R2 is for the two models relatively high indicating a high correlation between the variables and the fastness of responding. The final model is however a bit strange, it indicates that first quite logically that marital status and education (of the head of the household) is related to the fastness of responding. This might be logically because as explained in the method section marriage

39 and education are indicators for a higher fertility rate, so a higher change that someone is at home. But it also indicates that the field editor and keyer ID relate to the fastness of responding. It is interesting to investigate why they relate to the fastness of responding and whether this could have caused a bias in the number of reported children. If doing so, then the nesting of the data should also be dealt with in the analysis. The Nagelkerke pseudo R2 is higher when the model is not restricted to the variables related to the under-reporting of children under five. This indicates that something else is influencing the respondents for responding on day one or on day two or three. Those other influencing variables need further investigation. To confirm this thought, the variables found in the final models for both the fast respondents and the late respondents were fit into one model. Instead of a forward stepwise selection method, now all the variables that were significant in either the first or the second logistic regression model were entered into the two final models. This was done to be able to compare the two models directly and as such to detect a possible pattern indicating a nonresponse bias. The logistic regression was performed with all the significant variables; these SPSS outputs are stated in Appendix B. This resulted however in some strange Wald numbers of zero. This was probably due to the nesting within the variables: keyer ID, interviewer ID and field editor. Therefor the entered model was fit again but then without these three variables. These outcomes are summarized in table 8.

Variables Model 1 Fast respondents Model 2 less fast respondents Wald df sig Exp(B) Wald df sig Exp(B) Education – Baseline 0 years 23,669 5 ,000 16,651 5 ,005 Education – Primary school 8,220 1 ,004 1,567 3,149 1 ,076 1,871 Education – Junior secondary school ,045 1 ,832 1,034 ,231 1 ,631 1,170 Education – Secondary education 3,924 1 ,048 ,765 10,507 1 ,001 ,471 Education – Higher education 6,920 1 ,009 ,653 ,503 1 ,478 ,782 Education – Educated over 16 years 3,000 1 ,083 ,642 ,069 1 ,793 ,855 Region – Baseline Eastern 157,788 3 ,000 54,374 3 ,000 Region – Northern 117,426 1 ,000 4,049 34,565 1 ,000 6,902 Region – Southern 2,519 1 ,112 1,166 ,157 1 ,692 1,075 Region – Western 57,544 1 ,000 2,609 22,0376 1 ,000 3,628 Marital status – Baseline married 7,400 3 ,060 12,183 3 ,007 Marital status – Never married ,810 1 ,368 1,205 ,006 1 ,938 1,032 Marital status – Widowed 1,566 1 ,211 1,213 1,897 1 ,168 1,674 Marital status – Divorced 4,281 1 ,039 ,691 9,149 1 ,002 ,414 Number of women – Baseline zero 5,673 4 ,225 2,839 4 ,585 Number of women – One 1,655 1 ,198 1,142 ,071 1 ,789 ,944 Number of women – Two ,156 1 ,693 ,950 ,212 1 ,645 ,880 Number of women – Three ,054 1 ,816 1,054 2,177 1 ,140 ,561 Number of women – Four or more 2,504 1 ,114 2,110 ,676 1 ,411 ,599 Nagelkerke R square 0,061 0,072 Chi-square (df=15) 210,696 sig ,000 99,457 sig ,000

Table 8 summary of the two final logistic regression models, the yellow parts indicate significant results (.10) 40

Model statistics and interpretation In the table 8 the Exp(B), also called the odds ratio, the Wald test, Chi2 and the Nagelkerke R square are stated. The Exp(B) is a label that SPSS gives for the odds ratio. The Exp(B) indicates the change in odds resulting from one unit change in the predictor in the logistic regression. A value above one indicates that as the predictor increases, the odds of the outcome occurring increases as well. So in this case a Exp(B) above one indicates that there is a higher possibility of being interviewed on the first day of visit compared to the baseline category (Field, 2009). The Wald-statistic test, tests which variables have a significant contribution to the model. This statistic is preferable to be significant (Sieben & Linssen, 2009). In table 6 a few of the variables are significant at a 10% level; these are marked yellow. The categories of the variables are measured in comparison with the baseline category. It is visible that the number of significant (yellow) variables becomes less for the respondents on day one and two. This implies there is no pattern in the relationship between the fastness of the response (and finally nonrespondents) and the possible bias in the under-reporting of the under-5. For example for the variable education, in the first model the categories no education, primary school, secondary school, higher education and educated over 16 years were significant for the fast respondents. Meaning that those had a higher change of being interviewed on the first day compared to the baseline category no education. The overall variable education is significant as well; the yellow block of the baseline variable shows this in table 6. In the second model the significant categories became less, indicating that for the second model only the primary school category and the secondary school category had a higher change of being interviewed on the first or second day compared to the baseline category no education. The Chi2-test indicates whether the used model fits the data. The Chi2 compares the ratio of acceptability of the estimated model with the ratio of acceptability of the model with the constant variable as only variable. The difference between these two is the Chi2. When this measure is significant it indicates that the estimated model fits better than the model without those variables. Both models have a significant Chi2; this indicates that the models fit better with the variables than without. This indicates that there is a relation between the fastness of responding and the variables, although this relation is not becoming stronger for the later respondents.

41

The Nagelkerke R square explains the proportion of variance in one variable that can be explained by the other variables. It indicates the strength of a relationship between variables (Field, 2009). For these two models the Nagelkerke R square is relatively high indicating a relatively strong relationship.

5.2 SLIHS For the SLIHS the SPSS data sets, the survey report and survey-questionnaires were available. In the case of the SLIHS the information was available for the SLIHS 2003/04 and the SLIHS 2011. When required the errors are described separately for the two SLIHS surveys. For the SLIHS the sampling errors are not studied as well. The information is only given of errors that resulted in informative information about the under-reporting of children under five are described.

5.2.1 Observation errors This section covers first the overcoverage errors, there after the measurement error and lastly the results for the processing error are described.

5.2.1.1 Measurement error The SLIHS used different data collection teams. The teams consisted of one supervisor, six interviewers and one driver. The supervisors’ responsibility was to oversee monitor and correct the work of the interviewers when necessary. The described relation between the interviewer and supervisor is that the interviewer must always follow the work allocated by the supervisor. Another task of the supervisor is to check the work of the interviewer. This is done in the following ways (SSL, 2003, 2011): • The supervisor looks at all the questionnaires in detail, after the data collection to see if the data is collected in a sufficient and accurate way

• When an error is found the supervisor will check the questionnaire and make sure that the interviewer returns to the household to recollect the missing or wrong data

• Households who are already visited by the interviewer are revisited in order to reinterview a few questions from the survey to assess the reliability of the data collected by the interviewer

• The supervisor attends every week one or two visits conducted by the interviewers to assess the method of asking questions of the interviewer. The interviewer is not informed about this in advance

42

• Every day the supervisor interacts with the team to discuss about ideas of the interview work

When a problem arises the interviewer should inform the supervisor immediately (SSL, 2003, 2011). The tasks of the supervisor reduce the risks for possible measurement errors for a large extend. The presence of the supervisor during some of the interviews might also result in more social desirable answers by the respondent. The presence of two interviewers might be more impressive and result in a pressure to give ‘good’ answers.

Design The questionnaire contained similar to the SLDHS survey open and closed questions which both have their own possible errors. The design, wording and length of the questionnaire are not studied during this research.

Respondent The same possible errors for the SLDHS may arise for this survey as well. There might be some language gaps between the interviewer and the respondents as well as memory difficulties with questions that require thinking in the past.

Interviewer All the interviewers were provided with a handbook for field interviewers. This handbook contained all the procedures for the interviews. The interviewers should follow the points of the manual and of the handbook. The same possible errors for the SLDHS may arise for this survey as well. Language gaps between the interviewer and the respondents may occur. In the handbook it was stated that the interviewer must read the questions like they are written in the questionnaire. After each interview the interviewer must check if all sections and questions are covered. This checking is done before the questionnaires are given to the supervisor. The supervisor checks the questionnaire again on missing information. When information is missing it should be collected again, it is forbidden to fix any places (SSL, 2003, 2011). There is a lot of interaction between the interviewer and supervisor. Most of the interactions are designed to minimize measurement error. The checking of data, reinterviewing, attending interviews and discussions of the supervisor with the interviewer will result in a minimized measurement error.

43

Processing error The survey questionnaires of the SLIHS were divided in two parts. Part A and Part B. This is done to collect data about the households as well as about poverty. The questionnaires were almost entirely pre-coded which minimizes processing error. Microcomputers were installed for a quick data-entry and a software programme was designed to check the data on inconsistencies. Looking at these steps that are designed to recognize the errors during processing of the data give no indications for errors during the processing of the survey data.

5.2.2 Non-observation errors The following sections describe the undercoverage errors and then the nonresponse error.

5.2.2.1 Undercoverage error For the SLIHS 2003/04 and 2011 a kind of similar possible errors arises compared to the SLDHS 2008. They both used the census to compose the survey frame. As stated in the section of the SLDHS this might result in some troubles for the undercoverage of that sample. Mainly because the census excluded people living in collective housing units such as; hotels, hospitals, work camps, prisons and like. Therefore the orphanages and boarding schools are a problem as well. The same phenomenon of the big houses might result in the same issue for the SLIHS surveys as for the SLDHS survey.

5.2.2.2 Nonresponse error The definition for nonresponse and the way to deal with nonresponse is described in the interviewer handbook. This handbook describes the following five possibilities for nonresponse (SSL, 2003, 2011): 1. The interviewer cannot find the residence 2. Nobody is at home or they have moved elsewhere 3. The head of the household written in the questionnaire has moved away and another household has moved in 4. The head of the household is not at home for some reason 5. The household cannot meet with the interview after the first visit because he/she died or moved away In case of number 1 until number 3 the interviewer should ask the supervisor what to do. In case of number 4 the interviewer should interview one member selected by the rest of the household members to answer in place of the head of the household. In case

44

of number 5 replacement should be sought for by the supervisor (SSL, 2003, 2011). The nonresponse and the way of dealing with it remains uncertain for the SLIHS surveys this might be a matter of concern. The exact number of nonresponse for the SLIHS surveys was not stated in the survey report, probably because this report was more directed on the poverty part of the survey. When the response is high like the response of the SLDHS there is little space for variation among the nonrespondents. There was no information about the contact attempts or nonrespondents of the SLIHS survey. The nonresponse rate and information is something that needs some more attention for this survey. For the SLIHS there was no usable process information collected during the data- collection like the contact attempts. There is also no information about the non- respondents thus only the results relies on the available information. There are for now no indications of the presence of errors resulting in the under-reporting of children under five.

45

6. Conclusion and Discussion This chapter covers the conclusion as well as the discussion of this minor thesis. These two parts are framed together because of the connection between them. First the conclusion is given, where the main research question is answered and thereafter the hypotheses resulting from this research are given. After the hypotheses the discussion is stated. The discussion section starts with a brief overview of the found results in relation to the existing scientific literature. Thereafter recommendations for further research and the strengths and limitation of this research are discussed.

6.1 Conclusion First the main research question is answered. This is done first in a more global way and then it is answered is related to the sampling and nonsampling errors of the TSE. As stated in the introduction of this minor thesis the central problem of this thesis is the under-reporting of children under five in Sierra Leone. The focus was especially on the gap of knowledge of the possible sources of error that could have influenced this under- reporting. This resulted in the practical aim of figuring out which components of the TSE could explain the under-reporting. The aim was transformed in the following research question:

“Which components of the TSE can explain the under-reporting of the children under five

in population research in Sierra Leone?”

This minor thesis was set up as a preparatory research to systematically evaluate the possible survey errors that could explain the under-reporting and to create possible hypotheses for the reasons of the under-reporting of the children under five in the population research of Sierra Leone. These hypotheses are described in 4.2.1 Hypotheses. The main trigger to start this research was the visible under-reporting of children under five in the population pyramids as stated in figure 1 until 5. This under- reporting could not be explained by a number of stakeholders including; statistics Sierra Leone, the World Bank country team, the Minister of Health and working non- governmental organizations ins Sierra Leone (Himelein, 2013). The use of the TSE as a research-tool and three datasets were the main strategy to approach the survey errors in the best way to answer the main research question. The TSE is used in a comprehensive manner where first the sampling errors were

46

investigated and thereafter the non-sampling errors. The three datasets were available from two (SLIHS and SLDHS) different survey-organizations. The use of these different datasets was mainly intended to find common causes and make better assumptions in that way. The similarities between the different datasets were the use of the census as the sampling frame as well as the composition of the interview teams. The three questionnaires used all open and closed questions. The most dominated causes for the three datasets seem to be in measurement error. The following section explains why.

Sampling errors The existence of the same under-reporting of children under five in the Sierra Leone population census 2004 was the reason for discarding sampling errors as a possible explanation for the under-reporting of the children under five. Like stated in the results section; the census is a full enumeration of the population and can therefor not suffer from sampling errors. The fact that the under-reporting is also visible in the census this type of error can be eliminated as a possible source.

Non-sampling errors Parts of the nonsampling error component of the TSE regarded in this research are: undercoverage error, nonresponse error and measurement error.

Undercoverage error Both the SLIHS and the SLDHS used the frame of the population census of Sierra Leone for the sampling procedure. Either the 1985 census frame was adjusted with the 2004 census or just the 2004 census frame was used. These census frames excluded collective housings like; hospitals, hotels, prisons, work camps and like. Because of this exclusion orphanages and boarding schools, where children might stay, are not included in the surveys. There is another more cultural phenomenon in Sierra Leone that might influence undercoverage error, namely the so-called “big houses”. In these houses women live with their young children when the women cannot have intercourse with their husbands. When these houses are not considered as being part of the household it will lead to under-coverage of the people living there so the children as well.

Nonresponse error The response rates for the surveys are relatively high, leaving little possibility for nonresponse bias. During the SLDHS the children are measured on their height and weight in every second (randomly) selected household. The response rate for this

47

measurement is with 82% lower compared to the other response rates (total household response rate 97%, total women response rate 94% and total men response rate 92%). The explanation of that lower response rate for the measurement of these children remains unclear. It was attempted to make assumptions about the nonrespondents by performing a logistic regression where information from the SLDHS dataset was compared to the fastness of responding. This correlated in a rather strange and unexpected way. In one of the first regressions the selection for the male interview related with the fastness of responding. This can be explained by the fact that the households selected for this interview was also measured for the height and weight of women and young children. This might needed an appointment so that’s why they fastness of responding correlates with that item. It remains interesting because this might still relate to the under-reporting of children under five. Measurement error This error is not totally unravelled during this thesis research, it was not possible to analyse this error extensively because required benchmark data was missing. But the fact that the other errors do not fully explain the under-reporting of the children under five it indicates that the explanation may be found in the measurement error. Measurement errors is one of the most damaging sources of errors in most surveys (Biemer, 2011).

6.2 Further research The following section first covers the main objective of this research namely the hypotheses formulated as a result of this minor thesis. There after recommendations for further research are discussed.

6.2.1 Hypotheses The findings of this minor thesis in accordance with the above described results and conclusions lead to the formulation of the following hypotheses and statements: • The census excluded some important facilities for children under five, it might be possible that a number of children under five stays in collective housing units like orphanages or boarding schools. The exclusion of these collective housings may result in the under-reporting of children under five • There is a cultural phenomenon in Sierra Leone called the “big houses”. In these houses women stay because they cannot have intercourse with their husband.

48

These women stay there with their (young) children. Looking at the definition of a household member; a person should have slept in the house the night before the survey for being a household member, excludes these women and the children staying with them. This exclusion could result in under-reporting of the children under five • During the SLDHS survey the children are also measured, this takes some extra time of the interviewer and therefor this might be a cause for choosing households with a small number of children. The households selected for the male interview, this is every second household of the survey, are also selected for the measurement of the children in relation with their height and weight. The response rate for the children measured on their weight and height is lower compared to the other response rates it might be possible that the children who are not measured are not counted as member of the household which results in under-reporting of the children under five. This needs some further research. These three hypotheses cover three different aspects of the TSE namely the undercoverage error for the exclusion of the sampling frame and the “big houses”, sampling errors for the exact way the census frame was used and measurement errors for the extra effort. These three hypotheses need different methods to be investigated. One method could be to reinterview a small sample of respondents in Sierra Leone. During that collection it could be looked for the “big houses”, the selection should not be done by using the sampling frame of the population census but with another more appropriate method and the children should also be measured to see whether this extra effort is also experienced by the households. The hypotheses and the results indicate that further research is necessary the following section give some recommendations for further research.

6.2.2 Recommendations It is concluded that the measurement error, nonresponse error and undercoverage error might be (partly) the cause for the under-reporting of the children under five in Sierra Leone. The found results have to be discussed with the experts of the conducted surveys. The results can even be tested more intensively during a follow up research. The measurement error is however only slightly touched during this research. More research is needed to give clues for the influence of measurement errors on the surveys. Measurement error can arise from the respondent, interviewer or survey design /

49

questionnaire. Researching the measurement error is a very important and logical continuation of this project. Experts can for example look at the design of the questionnaire. Besides looking at the design of the questionnaire the setting of the environment of collecting data needs to be investigated as well. The respondents might experience some questions sensitive. It occurred that there were besides the respondents other people present during the surveys (SSL, 2003, 2009, 2011). It is important to figure out if this had any consequences for the measurement error. Obtaining information is more accurate without other household members present during the collection (Biemer, 2011). This is something that might influence and has to be further investigated as a possible source of error. Undercoverage is common in African countries in large-scale surveys. The sampling often involves multiple stages like the selection of the area units, listing and selection of the household. In all the stages coverage error can arise. To save money old frames of other surveys or research can be used but that may lead to serious errors. Undercoverage is often a result of incomplete frames. It is the result of the failure to include elements that would belong in the sample (Banda, 2003). The SLIHS partly used the census frame of 1985, this undercoverage part must be further investigated in order to be fully excluded as a possible cause. The three surveys did also involve multiple stages for the sampling. Another important follow up-research on this minor thesis could be towards the interaction of the different sources of error from the TSE. During this research this interaction is not investigated but the interaction might be useful to know. It might even be that the under-reporting of the children under five is a result of a combination of sources. Lastly the cultural aspects specific for Sierra Leone, or comparable countries have to be investigated. The “big houses” is a phenomenon not present in West-European countries but these houses might comprise a part of the source for the under-reporting of children under five. Therefor this phenomenon has to be investigated. Besides this phenomenon the high response rates needs attention as well. These high response rates are found in other population measurements in Sierra Leone as well. The Multiple Indicator Cluster Survey (MICS) obtained high response rates as well. The MICS contains the problem of the under-reporting of children under five as well; this is visible in figure 3 and 4. The response rates for the MICS were for 2000 about 97%, 2005 about 99,3% and 2010

50

about 95% (The-World-Bank, 2011a, 2011b, 2013). The nonresponse errors can be further investigated by using better suitable statistical techniques. After all this under-reporting needs more research, not only in relation to the components of the TSE but also in relation to: • Age for eligibility for the survey, this age is for women between 15 and 49. It might need some investigation whether these age limits needs to be expanded. Women under 15 are most of the time fertile and might have children under five. • The definition of a household. Households are different defined for the census and for the surveys. It is important that children under five are included in the definition. When a household is defined by: everybody who eats out of the same pot the children under five who get breastfeeding might not be considered as part of the household. • The definition of a child needs some more investigation as well, what is considered as a child by the respondents. And what about households with special characteristics. Like only child households for example due to HIV/AIDS and malaria, or grandparents raising their grandchildren because the parents could not or died. During further research it remains important that the under-reporting is only present among the children under five. This makes it more plausible that the under-reporting is a result of errors occurring during the measurement or defining of a household or a child.

6.3 Discussion The discussion is divided into two sections. First the results are related to the theoretical framework and the scientific literature and then the strengths and limitations of this research are briefly described.

6.3.1 Results and the used theoretical framework This minor thesis aims to investigate the components of the TSE that can explain the under-reporting of children under five in Sierra Leone. It is however important to think about the aspects that might be missing in the TSE framework. According to Groves and Lyberg (2010) the user orientation towards quality is for a large part absent. The user is then the user of the survey. This is something which can be incorporated in a follow up research where the users of the survey so the interviewers are asked about the their

51

opinion of the questionnaire. They interviewers can be asked about the questions related to the children under five. This might give some important information. Another point of attention towards the TSE is the lack of a routine measurement and frame content. There are no standardized methods to use the TSE framework and to measure all the different errors; this makes comparisons between different TSE difficult. This might be a problem but when everything is supported in a qualitative way or described well this difficulty vanish. In the en there are a variety of frameworks used and they are all used or implemented in different ways (Groves & Lyberg, 2010). This makes almost every TSE different from each other. The goal in research in relation to errors is to avoid them and control error to the extend that the remaining errors are tolerable (Biemer, 2011). For the surveys in Sierra Leone a clear under-reporting of the children under five is visible for some time. This error is not tolerable anymore because of the policy decisions relying on these numbers.

6.3.3 Strengths and limitations of this research The quality profile method used during the analysis of this minor thesis is a useful way to gather information about the specific sources of error and indicate where important gaps of knowledge are (Biemer, 2011). Some of the errors could however only be approached in a rather subjective way. Based on the three datasets the main focus of this research was to see whether components of the TSE could be discarded as a cause of the under-reporting of children under five. This resulted in the (restrictive) quantitative analysis of the nonresponse error. Unfortunately other error components, in particular measurement errors, could only be described in a very limited way. This should be repeated in a more constructive and systematic way. Another limitation of this research, which is already stated in the recommendation section, is the focus on the different errors and not on the interaction between those errors. This interaction might be very informative. Strength of this research is that there are aspects found also in the subjective approach of some errors. Another research can be focused fully on those errors because the nonresponse error is quite extensively studied, especially for the SLDHS survey.

52

References

Anguilla. (2001, 6 May 2002). Census information Retrieved November, 2013, from http://www.gov.ai/statistics/census/information.htm

Australian-Bureau-of-Statistics. (2013). Basic survey design Retrieved November, 2013, from http://www.nss.gov.au/nss/home.nsf/SurveyDesignDoc/20EDCFE8B204BE1FC A2571AB002479A2?OpenDocument

Banda, J. P. (2003). Nonsampling errors in surveys.

Bethlehem, J., & Biffignandi, S. (2012). Handbook of Web Surveys. Hoboken, New Jersey: John Wiley & Sons.

Bethlehem, J., Cobben, F., & Schouten, B. (2011). Handbook of Nonresponse in Household Surveys: Wiley.

Bethlehem, J., & Schouten, B. (2004). Nonresponse Adjustment in Household Surveys. Voorburgh, Heerlen: CBS.

Biemer, P. P. (2011). Total Survey Error: Design, Implementation, and Evaluation. American Association for Public Opinion Research. doi: 10.1093/poq/nfq058

Bowling, A., Ebrahim, S., & ebrary Inc. (2005). Handbook of health research methods investigation, measurement and analysis (pp. xii, 625 p.). Retrieved from http://site.ebrary.com/lib/yale/Doc?id=10161329

Cobben, F. (2009). Nonresponse in Sample Surveys: Methods for Analsysis and Adjustment. Universiteit van Amsterdam, Den Haag.

Cohen, M. P. (2008). Auxiliary Variable. In SageResearchmethods (Ed.), Encyclopedia of Survey Research Methods.

Durrant, G. B. (2005). Imputation Methods for Handling Item-Nonresponse in the Social Sciences: A Methodological Review. Retrieved from http://eprints.ncrm.ac.uk/86/1/MethodsReviewPaperNCRM-002.pdf

Dutter, R. (2003). Estimation Error, Estimation Variance Retrieved november, 2013, from http://www.statistik.tuwien.ac.at/public/dutt/vorles/geost_03/node93.html

FAWE. (2007). Sierra Leone: Women’s rights protection instruments ratified by Sierra Leone Retrieved February, 2014, from http://www.africa4womensrights.org//public/Dossier_of_Claims/SierraLeoneE NG.pdf

Field, A. (2009). Discovering Statistics Using SPSS (Vol. Third). London: Sage.

LIII

Foster, L. (2013, 06-11-2013). [Re: update on underreporting of under-5 in SL].

Groves, R. M., & Lyberg, L. (2010). Total Survey Error: Past, Present and Future. Public Opinion Quarterly, 74(5), 30.

Himelein, K. (2013). Missing children: A Mixed Methods Analysis of Under-5 Under- reporting in Sierra Leone. Research Committee Grant Proposal.

Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression: John and Wiley & Sons.

Indrayan, A. (2012). Indicators of child mortality. measures of mortality. Retrieved from http://www.medicalbiostatistics.com/childmortality.pdf

Jawed, F. (2012). Importance of good quality research. 62(1), 2. Retrieved from http://www.jpma.org.pk/PdfDownload/3214.pdf

Kasprzyk, D. (2005). Measurement error in household surveys: sources and measurement Household Sample Surveys in Developing and Transition Countries: United Nations.

MEASURE_DHS. (2013). Questionnaires and Modules Retrieved October 7th, 2013, from http://www.measuredhs.com/What-We-Do/Questionnaires.cfm

Olsen, K. (2013). Paradata for Nonresponse Adjustment. The Annals of the American Academy of Political and Social Science, 645(1 (january 2013)), 28. doi: 10.1177/0002716212459475

Ott, L., & Longnecker, M. (2010). Logistic Regression An introduction to statistical methods and data analysis (6th ed., pp. xv, 1273 p.). Belmont, CA: Brooks/Cole Cengage Learning.

Rogers, S. A. T. (2004). Poverty Measurement in a Post-Conflict Scenario: Evidence From the Sierra Leone Integrated Household Survey 2003/2004 (pp. 14). Freetown: Statistics Sierra Leone.

Rosenberg, M. (2013). Crude Birth Rate Retrieved December, 2013, from http://geography.about.com/od/populationgeography/a/cbrcdr.htm

Ruilin, R. (2013, 11-10-2013). [RE: information from the 2008 DHS survey in sierra leone].

Sieben, I., & Linssen, L. (2009). Logistieke Regressie Analyse: een handleiding. In R. University (Ed.), (pp. 19).

Sinding, S. W. (2009). Population, poverty and economic development. Philos Trans R Soc Lond B Biol Sci, 364(1532), 3023-3030. doi: 10.1098/rstb.2009.0145

SSL. (2003). Sierra Leone Integrated Household Survey (SLIHS). In S. S. Leone (Ed.), Handbook for field interviewers: Statistics Sierra Leone.

LIV

SSL. (2007). Sierra Leone Integrated Household Survey (SLIHS) 2003/04. In S. S. Leone (Ed.): Statistics Sierra Leone Government of Sierra Leone.

SSL. (2009). Sierra Leone Demographic and Health Survey 2008. In S. S. Leone (Ed.). Calverton, Maryland USA: MEASURE DHS.

SSL. (2011). Sierra Leone Integrated Household Survey (SLIHS). In S. S. Leone (Ed.). Freetown: Statistics Sierra Leone.

The-World-Bank. (2011a, Sept 26, 2013). Sierra Leone - Multiple Indicator Cluster Survey 2000 Retrieved February, 2014, from http://microdata.worldbank.org/index.php/catalog/712/study- description#page=sampling&tab=study-desc

The-World-Bank. (2011b, sept 26, 2013). Sierra Leone - Multiple Indicator Cluster Survey 2005 Retrieved February, 2014, from http://microdata.worldbank.org/index.php/catalog/50/study- description#page=sampling&tab=study-desc

The-World-Bank. (2013, sept 26, 2013). Sierra Leone - Multiple Indicator Cluster Survey 2010 Retrieved February, 2014, from http://microdata.worldbank.org/index.php/catalog/1311/study- description#page=sampling&tab=study-desc

TheWorldBank. (2013). a poverty profile for Sierra Leone: The World Bank, Poverty Reduction & Economic Management Unit. African Region, Statistics Sierra Leone.

TheWorldBankGroup. (2013a). Birth Rate, crude (per 1,000 people) Retrieved november, 2013, from http://data.worldbank.org/indicator/SP.DYN.CBRT.IN

TheWorldBankGroup. (2013b). Mortality rate, infant (per 1,000 live births) Retrieved December, 2013, from http://data.worldbank.org/indicator/SP.DYN.IMRT.IN?order=wbapi_data_value_ 2012+wbapi_data_value+wbapi_data_value-last&sort=desc

TheWorldBankGroup. (2013c). Mortality rate, under-5 (per 1,000 live births) Retrieved December, 2013, from http://data.worldbank.org/indicator/SH.DYN.MORT/countries?order=wbapi_dat a_value_2012+wbapi_data_value+wbapi_data_value-last&sort=desc

TheWorldDataBank. (2013). Sierra Leone Retrieved 2013, November, from http://www.worldbank.org/en/country/sierraleone

UNdata. (2013). Sierra Leone Retrieved November, 2013, from http://data.un.org/CountryProfile.aspx?crName=SIERRA%20LEONE

United-Nations. (2009). World Fertility Data 2008 Retrieved November, 2013, from http://www.un.org/esa/population/publications/WFD%202008/Metadata/CBR .html

LV

US_Government. (2001a). Measuring and Reporting Sources of Error in Surveys: Coverage error. US government report.

US_Government. (2001b). Measuring and Reporting Sources of Error in Surveys: Measurement error. US government report.

US_Government. (2001c). Measuring and Reporting Sources of Error in Surveys: Nonresonse error. [Government report]. US government report, 19.

US_Government. (2001d). Measuring and Reporting Sources of Error in Surveys: Processing error.

US_Government. (2001e). Measuring and Reporting Sources of Error in Surveys: Sampling error. US government report.

US_Government. (2001f). Measuring and Reporting Sources of Error in Surveys: Total Survey Error. US government report, 139-152.

Walter, T. (2011). What Factors Affect Birth Rates and Fertility Rates? Retrieved December, 2013

WHO. (2012). Adolescent pregnancy. Factsheet N364 Retrieved January, 2013, from http://www.who.int/mediacentre/factsheets/fs364/en/index.html

Wikipedia. (2013, 14 March 2013). Sierra Leoonse Burgeroorlog Retrieved November, 2013, from http://nl.wikipedia.org/wiki/Sierra_Leoonse_Burgeroorlog

Worldatlas. (2013). Sierra Leone: Description Retrieved November, 2013, from http://www.worldatlas.com/webimage/countrys/africa/sl.htm

LVI

Appendices

LVII

Appendix A A1.1 Binary regressions single variables versus the fast respondents (respondents on day one, proxy-nonrespondents, respondents on day two or three) A1.2 Binary regressions single variables versus the less fast respondents (respondents on day one and two, proxy-nonresondents, repondents on day three) A2 Outputs logistic regression “Fast respondents”, only variables related to children under five A3 Outputs logistic regressions “Less fast respondents”, only variables related to children under five A4 Output logistic regression “Fast respondents”, all background variables A5 Output logistic regression “Less fast respondents”, all background variables

LVIII

A1.1 Binary regressions single variables versus the fast respondents (respondents on day one, proxy-nonrespondents, respondents on day two or three)

Chi2 df sig. Nagelkerke Wald df sig Wealth index 20,379 4 ,000 ,006 19,843 4 ,001 Highest educational level 8,542 3 ,036 ,002 8,528 3 ,036 Education in years 26,625 18 ,086 ,008 25,311 18 ,117 Education groups (years) 11,434 5 ,043 ,003 11,175 5 ,048 Marital status 9,642 3 ,022 ,003 10,386 3 ,016 Region 177,842 3 ,000 ,050 155,367 3 ,000 Place of residence 110,463 3 ,000 ,031 121,192 3 ,000 Number of eligible children height and 10,838 7 ,146 ,003 4,531 7 ,717 weight Selection male interview 11,124 1 ,001 ,003 11,055 1 ,001 Sex head of household 2,451 1 ,117 ,001 2,500 1 ,114 Age head of household 96,805 80 ,097 ,028 84,634 80 ,340 Local council 93,890 1 ,000 ,027 106,794 1 ,000 District 674,446 13 ,000 ,184 646,302 13 ,000 Number of household members 11,134 21 ,960 ,003 10,616 21 ,970 Household members groups 4,649 3 ,199 ,001 4,577 3 ,206 Number of de jure members 16,376 21 ,748 ,005 13,120 21 ,904 Number of de facto members 28,827 22 ,150 ,008 24,878 22 ,303 Number of children under five 13,819 9 ,129 ,004 11,365 9 ,251 Interviewer ID 1392,21 126 ,000 ,363 922,832 126 ,000 1 Keyer ID 228,795 38 ,000 ,064 177,839 38 ,000 Field supervisor 1063,44 27 ,000 ,295 901,305 27 ,000 3 Field editor 1060,46 24 ,000 ,288 911,887 24 ,000 2 Office editor 168,392 12 ,000 ,048 140,359 12 ,000 Total adults measured 27,209 11 ,004 ,008 28,213 11 ,003 Number of eligible women in household 6,175 7 ,520 ,002 3,735 7 ,810 Number of eligible men in household 9,501 6 ,147 ,003 9,179 6 ,164 Type place of residence 13,787 1 ,000 ,004 13,883 1 ,000

LIX

A1.2 Binary regressions single variables versus the less fast respondents (respondents on day one and two, proxy-nonrespondents, respondents on day three)

Chi2 df sig. Nagelkerke Wald df sig Wealth index 5,883 4 ,208 ,004 5,704 4 ,222 Highest educational level 4,528 3 ,210 ,003 4,374 3 ,224 Education in years 31,490 18 ,025 ,022 25,402 18 ,114 Education groups (years) 10,184 5 ,070 ,007 10,936 5 ,053 Marital status 12,676 3 ,005 ,009 14,624 3 ,002 Region 77,679 3 ,000 ,054 54,813 3 ,000 Place of residence 28,263 3 ,000 ,020 28,781 3 ,000 Number of eligible children height and 6,066 7 ,532 ,004 3,964 7 ,784 weight Selection male interview 2,968 1 ,085 ,002 2,942 1 ,086 Sex head of household ,655 1 ,418 ,000 ,671 1 ,413 Age head of household 76,935 80 ,576 ,054 53,704 80 ,990 Local council 19,264 1 ,000 ,014 22,544 1 ,000 District 283,947 13 ,000 ,196 234,112 13 ,000 Number of household members 15,791 21 ,781 ,011 10,370 21 ,974 Household members groups 3,641 3 ,303 ,003 3,791 3 ,285 Number of de jure members 15,157 21 ,815 ,011 9,240 21 ,987 Number of de facto members 30,599 22 ,105 ,021 21,505 22 ,490 Number of children under five 8,052 9 ,529 ,006 1,481 9 ,997 Interviewer ID 481,768 126 ,000 ,328 201,256 126 ,000 Keyer ID 113,090 38 ,000 ,079 86,780 38 ,000 Field supervisor 361,372 27 ,000 ,254 217,102 27 ,000 Field editor 365,900 24 ,000 ,255 225,998 24 ,000 Office editor 78,132 12 ,000 ,055 91,989 12 ,000 Total adults measured 10,450 11 ,490 ,007 10,647 11 ,473 Number of eligible women in household 3,666 7 ,817 ,003 2,665 7 ,914 Number of eligible men in household 4,082 6 ,666 ,003 4,267 6 ,641 Type place of residence ,068 1 ,794 ,000 ,068 1 ,794

LX

A.2 Logistic regression “Fast Respondents”, Variables related to children under five First category is the contrast category. Method: Forward, likelihood.

• Wealth index (Poorest, poorer, middle, richer, richest) • Highest educational level (no-education / preschool, primary, secondary, higher) • Education in single years • Current marital status (never married, married, widowed, divorced) • Region (North, east, south, west) • Place of residence (Freetown, small city, town, countryside) • Selection for male interview / Number of eligible children for height and weight

Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 7000 96,1 Missing Cases 284 3,9 Total 7284 100,0 Unselected Cases 0 ,0 Total 7284 100,0 a. If weight is in effect, see classification table for the total number of cases.

Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 169,771 3 ,000 Block 169,771 3 ,000 Model 169,771 3 ,000 Step 2 Step 60,241 3 ,000 Block 230,012 6 ,000 Model 230,012 6 ,000 Step 3 Step 14,663 3 ,002 Block 244,675 9 ,000 Model 244,675 9 ,000 Step 4 Step 9,726 4 ,045 Block 254,401 13 ,000 Model 254,401 13 ,000

LXI

Model Summary -2 Log Cox & Snell R Nagelkerke R Step likelihood Square Square 1 4363,787a ,024 ,050 2 4303,546a ,032 ,068 3 4288,883a ,034 ,072 4 4279,157a ,036 ,075 a. Estimation terminated at iteration number 6 because parameter estimates changed by less than ,001.

Hosmer and Lemeshow Test Step Chi-square df Sig. 1 ,000 2 1,000 2 6,601 6 ,359 3 3,393 7 ,846 4 4,853 8 ,773

Model if Term Removed Model Log Change in -2 Sig. of the Variable Likelihood Log Likelihood df Change Step 1 Region -2266,779 169,771 3 ,000 Step 2 Region -2210,216 116,887 3 ,000 PlaceofResidence -2181,894 60,241 3 ,000 Step 3 HighestEducation -2151,773 14,663 3 ,002 Region -2204,784 120,684 3 ,000 PlaceofResidence -2172,562 56,240 3 ,000 Step 4 WealthIndex -2144,442 9,726 4 ,045 HighestEducation -2146,458 13,759 3 ,003 Region -2199,076 118,995 3 ,000 PlaceofResidence -2151,946 24,735 3 ,000

LXII

A.3 Logistic regression “Less Fast Respondents” Variables related to children under five

First category is the contrast category. Method: Forward, likelihood.

• Wealth index (Poorest, poorer, middle, richer, richest) • Highest educational level (no-education / preschool, primary, secondary, higher) • Education in single years • Current marital status (never married, married, widowed, divorced) • Region (North, east, south, west) • Place of residence (Freetown, small city, town, countryside) • Selection for male interview / Number of eligible children for height and weight

Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 7000 96,1 Missing Cases 284 3,9 Total 7284 100,0 Unselected Cases 0 ,0 Total 7284 100,0 a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value Internal Value "Nonrespondents", 0 responding on day 3 Less fast respondent, 1 responding on day 1 or 2

LXIII

Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 68,528 3 ,000 Block 68,528 3 ,000 Model 68,528 3 ,000 Step 2 Step 40,929 18 ,002 Block 109,456 21 ,000 Model 109,456 21 ,000 Step 3 Step 9,868 3 ,020 Block 119,324 24 ,000 Model 119,324 24 ,000 Step 4 Step 7,665 3 ,053 Block 126,989 27 ,000 Model 126,989 27 ,000

Model Summary -2 Log Cox & Snell R Nagelkerke R Step likelihood Square Square 1 1441,857a ,010 ,050 2 1400,928b ,016 ,080 3 1391,061b ,017 ,087 4 1383,396b ,018 ,093 a. Estimation terminated at iteration number 8 because parameter estimates changed by less than ,001. b. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found.

Hosmer and Lemeshow Test Step Chi-square df Sig. 1 ,000 2 1,000 2 1,085 6 ,982 3 1,027 6 ,985 4 4,306 7 ,744

LXIV

Model if Term Removed Model Log Change in -2 Sig. of the Variable Likelihood Log Likelihood df Change Step 1 Region -755,192 68,528 3 ,000 Step 2 EducationInYears -720,929 40,929 18 ,002 Region -738,738 76,547 3 ,000 Step 3 EducationInYears -715,602 40,144 18 ,002 MaritalStatus -700,464 9,868 3 ,020 Region -732,807 74,554 3 ,000 Step 4 EducationInYears -710,725 38,054 18 ,004 MaritalStatus -696,591 9,787 3 ,020 Region -716,550 49,705 3 ,000 PlaceofResidence -695,530 7,665 3 ,053

LXV

A.4 Logistic regression “Fast Respondents” All background variables All background information. First category is the contrast category. Method: Forward, likelihood.

Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 6545 89,9 Missing Cases 739 10,1 Total 7284 100,0 Unselected Cases 0 ,0 Total 7284 100,0 a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value Internal Value "non-respondent", 0 responding on day 2 or 3 Fast respondent, 1 responding on day 1

Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 995,385 23 ,000 Block 995,385 23 ,000 Model 995,385 23 ,000 Step 2 Step 311,100 121 ,000 Block 1306,485 144 ,000 Model 1306,485 144 ,000 Step 3 Step 13,299 7 ,065 Block 1319,784 151 ,000 Model 1319,784 151 ,000 Step 4 Step 22,170 3 ,000 Block 1341,954 154 ,000 Model 1341,954 154 ,000 Step 5 Step 4,582 3 ,205 Block 1346,536 157 ,000 Model 1346,536 157 ,000

LXVI

Model Summary -2 Log Cox & Snell R Nagelkerke R Step likelihood Square Square 1 3297,182a ,141 ,293 2 2986,082a ,181 ,376 3 2972,783a ,183 ,380 4 2950,613a ,185 ,385 5 2946,031a ,186 ,387 a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found.

Hosmer and Lemeshow Test Step Chi-square df Sig. 1 ,000 7 1,000 2 ,110 8 1,000 3 3,023 8 ,933 4 1,737 8 ,988 5 3,610 8 ,890

Model if Term Removed Model Log Change in -2 Sig. of the Variable Likelihood Log Likelihood df Change Step 1 FieldEditor -2146,283 995,385 23 ,000 Step 2 InterviewerID -1648,591 311,100 121 ,000 FieldEditor -1510,919 35,755 23 ,044 Step 3 InterviewerID -1641,963 311,144 121 ,000 FieldEditor -1504,334 35,885 23 ,042 NofWomen -1493,041 13,299 7 ,065 Step 4 Region -1486,391 22,170 3 ,000 InterviewerID -1637,148 323,684 121 ,000 FieldEditor -1498,004 45,396 23 ,004 NofWomen -1481,462 12,312 7 ,091 Step 5 MaritalStatus -1475,306 4,582 3 ,205 Region -1484,179 22,327 3 ,000 InterviewerID -1634,040 322,049 121 ,000 FieldEditor -1495,711 45,390 23 ,004 NofWomen -1479,256 12,480 7 ,086

LXVII

A.5 Logistic regression “Less Fast Respondents” All background variables All background information. First category is the contrast category. Method: Forward, likelihood.

Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 6545 89,9 Missing Cases 739 10,1 Total 7284 100,0 Unselected Cases 0 ,0 Total 7284 100,0 a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value Internal Value "Nonrespondents", 0 responding on day 3 Less fast respondent, 1 responding on day 1 or 2

Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 341,325 23 ,000 Block 341,325 23 ,000 Model 341,325 23 ,000 Step 2 Step 88,595 37 ,000 Block 429,920 60 ,000 Model 429,920 60 ,000 Step 3 Step 37,513 18 ,004 Block 467,433 78 ,000 Model 467,433 78 ,000 Step 4 Step 9,801 3 ,020 Block 477,235 81 ,000 Model 477,235 81 ,000

LXVIII

Model Summary -2 Log Cox & Snell R Nagelkerke R Step likelihood Square Square 1 1117,870a ,051 ,254 2 1029,275a ,064 ,318 3 991,762a ,069 ,345 4 981,960a ,070 ,352 a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found.

Hosmer and Lemeshow Test Step Chi-square df Sig. 1 ,000 7 1,000 2 11,088 8 ,197 3 5,910 8 ,657 4 6,783 8 ,560

Model if Term Removed Model Log Change in -2 Sig. of the Variable Likelihood Log Likelihood df Change Step 1 FieldEditor -729,597 341,325 23 ,000 Step 2 KeyerID -558,935 88,595 37 ,000 FieldEditor -673,341 317,408 23 ,000 Step 3 EducationInYears -514,637 37,513 18 ,004 KeyerID -536,676 81,590 37 ,000 FieldEditor -658,380 324,999 23 ,000 Step 4 EducationInYears -510,292 38,623 18 ,003 MaritalStatus -495,881 9,801 3 ,020 KeyerID -531,120 80,279 37 ,000 FieldEditor -652,483 323,005 23 ,000

LXIX

Appendix B

B1 SPSS outputs of final logistic regression model, “fast respondents B2 SPSS outputs of final logistic regression model, “less fast respondents”

LXX

B1 SPSS outputs of final logistic regression model, “fast respondents”

Variables included in this logistic regression model with enter method: • Education • Region • Marital status • Number of women

Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 7099 97,5 Missing Cases 185 2,5 Total 7284 100,0 Unselected Cases 0 ,0 Total 7284 100,0 a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value Internal Value "non-respondent", 0 responding on day 2 or 3 Fast respondent, 1 responding on day 1

Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 210,696 15 ,000 Block 210,696 15 ,000 Model 210,696 15 ,000

Model Summary -2 Log Cox & Snell R Nagelkerke R Step likelihood Square Square 1 4392,075a ,029 ,061 a. Estimation terminated at iteration number 6 because parameter estimates changed by less than ,001.

LXXI

Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Education_groups 23,669 5 ,000 Education_groups(1) ,450 ,157 8,220 1 ,004 1,569 Education_groups(2) ,033 ,153 ,045 1 ,832 1,033 Education_groups(3) -,268 ,135 3,924 1 ,048 ,765 Education_groups(4) -,427 ,162 6,920 1 ,009 ,652 Education_groups(5) -,444 ,256 3,000 1 ,083 ,641 Region 157,788 3 ,000 Region(1) 1,399 ,129 117,426 1 ,000 4,050 Region(2) ,153 ,096 2,519 1 ,112 1,165 Region(3) ,960 ,127 57,544 1 ,000 2,611 Marital_Status_married1 7,400 3 ,060 Marital_Status_married1(1) ,187 ,208 ,810 1 ,368 1,205 Marital_Status_married1(2) ,193 ,154 1,566 1 ,211 1,213 Marital_Status_married1(3) -,370 ,179 4,281 1 ,039 ,691 NofWomenfourormore 5,673 4 ,225 NofWomenfourormore(1) ,133 ,103 1,655 1 ,198 1,142 NofWomenfourormore(2) -,051 ,130 ,156 1 ,693 ,950 NofWomenfourormore(3) ,053 ,227 ,054 1 ,816 1,054 NofWomenfourormore(4) ,747 ,472 2,504 1 ,114 2,110 Constant 1,641 ,107 234,825 1 ,000 5,160 a. Variable(s) entered on step 1: Education_groups, Region, Marital_Status_married1, NofWomenfourormore.

LXXII

B2 SPSS outputs of final logistic regression model, “less fast respondents”

Variables included in this logistic regression model with enter method: • Education • Region • Marital status • Number of women

Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 7099 97,5 Missing Cases 185 2,5 Total 7284 100,0 Unselected Cases 0 ,0 Total 7284 100,0 a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding Original Value Internal Value "Nonrespondents", 0 responding on day 3 Less fast respondent, 1 responding on day 1 or 2

Omnibus Tests of Model Coefficients Chi-square df Sig. Step 1 Step 99,457 15 ,000 Block 99,457 15 ,000 Model 99,457 15 ,000

Model Summary -2 Log Cox & Snell R Nagelkerke R Step likelihood Square Square 1 1415,416a ,014 ,072 a. Estimation terminated at iteration number 8 because parameter estimates changed by less than ,001.

LXXIII

Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Education_groups 16,651 5 ,005 Education_groups(1) ,626 ,353 3,149 1 ,076 1,871 Education_groups(2) ,157 ,326 ,231 1 ,631 1,170 Education_groups(3) -,753 ,232 10,507 1 ,001 ,471 Education_groups(4) -,246 ,346 ,503 1 ,478 ,782 Education_groups(5) -,157 ,598 ,069 1 ,793 ,855 Region 54,374 3 ,000 Region(1) 1,932 ,329 34,565 1 ,000 6,902 Region(2) ,072 ,182 ,157 1 ,692 1,075 Region(3) 1,289 ,275 22,037 1 ,000 3,628 Marital_Status_married1 12,183 3 ,007 Marital_Status_married1(1) ,032 ,408 ,006 1 ,938 1,032 Marital_Status_married1(2) ,515 ,374 1,897 1 ,168 1,674 Marital_Status_married1(3) -,881 ,291 9,149 1 ,002 ,414 NofWomenfourormore 2,839 4 ,585 NofWomenfourormore(1) -,058 ,216 ,071 1 ,789 ,944 NofWomenfourormore(2) -,128 ,277 ,212 1 ,645 ,880 NofWomenfourormore(3) -,578 ,392 2,177 1 ,140 ,561 NofWomenfourormore(4) -,513 ,624 ,676 1 ,411 ,599 Constant 3,399 ,221 235,711 1 ,000 29,922 a. Variable(s) entered on step 1: Education_groups, Region, Marital_Status_married1, NofWomenfourormore.

LXXIV