<<

Statistics deal with the collection, presentation, analysis and interpretation of .Insurance (of people and property), which now dominates many aspects of our lives, utilises statistical methodology. Social scientists, psychologists, pollsters, medical researchers, governments and many others use statistical methodology to study behaviours of populations.

You need to know the following statistical terms.

A variable is a characteristic of interest in each element of the or population. For example, we may be interested in the age of each of the seven dwarfs.

An observation is the value of a variable for one particular element of the sample or population, for example, the age of the dwarf called Bashful (= 619 years).

A data set is all the observations of a particular variable for the elements of the sample, for example, a complete list of the ages of the seven dwarfs {685, 702,498,539,402,685, 619}. Collecting data

Census The population is the complete set of data under consideration. For example, a population may be all the females in Ireland between the ages of 12 and 18, all the sixth year students in your school or the number of red cars in Ireland. A is a collection of data relating to a population. A list of every item in a population is called a frame.

Sample A sample is a small part of the population selected. A random sample is a sample in which every member of the population has an equal chance of being selected. Data gathered from a sample are called statistics. Conclusions drawn from a sample can then be applied to the whole population (this is called ). However, it is very important that the sample chosen is representative of that population to avoid bias. Bias Bias (unfairness) is anythingthat distorts the data so that they will not give a representative sample. Bias can occur in sampling due to:

1. Failing to identiff the correct population. 2. A sample size that is too small or using a sample that is not representative. 3. Careless or dishonest answers to questions. 4. Using questions that are misleading or ambiguous. 5. Failure to respond to a . 6. Errors in recording the data, for example, recording23 as 32. 7. The data can go out of date, for example, conclusions drawn from an can change over a period of time.

Reasons for using samples:

1. They are fast and cheap. 2. It is essential when the sampling units are destroyed (called destructive sampling). For example, we cannot test the lifetimes of every light bulb manufactured until they fail. 3. Quality of information gained is more manageable and better controlled leading to better accuracy. (More time and money can be spent on the sample.) 4. It is often very difficult to gather data on a whole population.

Sample survey A survey collects data (information). A sample survey is a survey that collects data from a sample of the population, usually using a . are well-designed forms that are used to conduct sample surveys.

The main survey methods are: . Personal : People are asked questions directly. This is regularly used in . . Telephone survey: Often used for a personal interview. . Postal survey: A survey is sent to someone's address. . Online questionnaires: People fill out the questionnaire online. Advantages and disadvantages of surveys are as follows. Method Advantages Disadvantages Personal interview . High response rate. . Can be expensive. (face to face) . Can ask many questions. . Interviewer can influence response. . Can ask more personal questions.

Telephone survey . High response rate. . Can be expensive. . Can ask many questions. . Interviewer can influence response. . Can ask more personal questions. . Easier to tell lies.

Postal survey . Relatively cheap. . Poor response rate. . Can ask many questions. . Partly completed. . Can ask more personal questions. . Limited in the type of data collected. . No way of clariffing any questions.

Online . Cheap and fast to collect large . Limited to those with access to an questionnaires volumes of data. online computer. This leads to . More flexible design. sample bias. . Ease of editing. . Technical problems (crashes, freezes). . Can be sent directly to a database . Protecting privacy is an ethical issue. such as Microsoft Excel. . No interviewer bias. . Anonymity. . No geographical problems.

Other methods for collecting data An experiment is a controlled study in which the researcher understands cause-and-effect relationships. The study is controlled. This method of collecting data is very popular with drug companies testing a new drug.

Observational studies Data obtained by making observations are called observational studies. The data is collected by counting, measuring or noting things that happen. For example, a traffic survey might be done in this way to reveal the number of vehicles passing over a bridge. Important factors are place, time of day and the amount of time spent collecting the data. Observational studies can be laborious and time consuming.

Designed Data obtained by an experiment are called designed experiments. The data is collected by counting or measuring, a.E. throwing a die or tossing two coins a number of times and recording the results. The key things to remember are that the experiment must be repeated a number of times and that the experiment must be capable of being repeated by other people. Data capture is the process by rvhich data are transferred from apaper copy, e.g.aquestionnaire, to an electronic file, usually on a computer. Also, in an experiment, we can measure the effects, if any, that result from changes in the situation.

Descriptive statistics is the use of graphs, charts, tables, various measurements and calculations to organise and summarise information (data). It attempts to focus our information and reduce it to a manageable size.

Inferential statistics is when a portion or a sample of a population is studied and conclusions are reached about the entire population. The table below gives examples of sampling a particular population and how it might be carried out.

Population Sample

30 million actual voters in a UK An exit poll of 16,500 actual voters who were asked who general election they voted for. All owners of dogs A telephone survey of 1,000 dog owners.

A11 prison inmates A criminal justice study of 200 prison inmates. The CEOs of all private companies Results from surveys sent to 120 CEOs of private companies. Legal aliens living in Ireland A sociological study conducted by a university researcher of 250 legal aliens. Adult children of drug addicts A psychological study of 75 such individuals.

I der:

Quantitative data Qualitative data Which can be counted or measured Which cannot be measured

We divide quantitative data into We divide qualitative data into (i) Discrete data (i) Categorical data (ii) Continuous data (ii) Quantitative data Discrete and continuous data Data which can only have certain individual values are called discrete data. Discrete variables usually result from counting.

Discrete variable Possible values for the variable Number of students in a class of 0, 1,2 ... 30 30 with blood type A The number of times a coin r,2,3 is tossed before atail appears This has no upper limit since atail might never appear!

Data which are measured on some scale and can take any value on that scale are called continuous data. A continuous variable usually results from making a measurement of some type.

Continuous variable Possible values for the variable The amount of time (r) in hours spent on one day All real numbers between 0 and 24 inclusive, by certain individuals on Facebook i.e.0

Qualitative data Categorical and ordinal data Data which fit into a group or category are called categoric tl data. If the categories have an obvious order, the data are said to be ordinal.

Categorical variable Possible ordinal level data values associated with the variable Restaurant service Poor, goo4 excellent. Pain level None, low, moderate, severe. County of residence Wicklow, Dublin, Cork, Sligo, etc. No ordinal values in this case. Blood type O, A, B, AB No ordinal values here. A class test in English poetry consists of 20 questions. The resulting score reflects the work rate and aptitude in English poetry of the test candidate. How could the score be reported? What are the possible values for the scores? Describe the variable in terms of discrete, continuous, categorical, ordinal.

Solution: This is an interesting question and solution. It illustrates different answers, yet all are correct. It shows how important it is for the examination candidate to be creative. The score reported would likely be the number or percentage of correct answers. In other words, the number of correct answers would give a whole number score from 0 to 20, presuming equal credit per question (0 incorrect, I correct). Similarly, the percentage correct would yield a score from 0 to 100 in steps of five. Here the scores would be discrete. The position is complicated if each question does not have equal credit. However, if the teacher considered not only the answer but the reasoning process used to arrive at the answers and assigned partial credit for each problem, the score could be any real number between 0 and 20 or any real number between 0 and 100 per cent continuously, i.e. the data would be continuous.

Alternatively, the teacher might allocate a grade ofA, B, C, D, E, F and NG to each candidate based on a scale deemed suitable at the time. In this case, the data would be ordinal.

Univariate data When one item of information is collected from each member of a group, the data collected are called data. Examples of univanate data include: (i) Height in centimetres. (ii) Eye colour. (iii) Number of siblings.

Bivariate data When two items of information are collected, e.g. a person's height and weight, this is called bivariate data (or paired data). Examples of bivariate data include: (i) Starting salary and years of education. (ii) Hair colour and gender. (iiD The amount of milk and number of eggs required to make scrambled eggs. (iv) The number of people in a house and the number of rooms in a house. Notes: Example (i) is bivariate continuous data. Example (ii) is bivariate categorical data. Example (iii) is continuous, then discrete bivariate data. Example (iv) is bivariate discrete data.

Exercise 1.1 1. Classiff each of the following as either inferential statistics or . (D Eleven per cent of the packets of crisps sampled by a quality inspector are found to be below the labelled weight. Based on this finding, the filling machine is adjusted to increase the amount of fill. (ii) The lrish Times gives a fulI page of numerical quantities concerning stocks listed in Dublin, New York and London as well as conversion rates for the euro (€) against the major world currencies. (iiD Based on a survey of 140 prisoners by the Department of Justice, amagazine reports that 92%o of prisoners did not attend third-level education. 2. A data set lists apartments and villas available for tourists to rent. Information provided includes the weekly rent, whether or not electricity is included free of charge, whether or not pets are allowed, the number of rooms and the distance to the beach. (D Describe the elements in the data set. (iD Give the number of variables and specify if each variable is categorical, discrete or continuous. 3. Make up your own survey with at least six questions. Include at least two categorical variables, at least one discrete variable and at least two continuous variables. State which variables are categorical, discrete and continuous. Give reasons for your answers. 4. You are planning a survey to collect information about the study habits of secondary sfudents. Describe one categorical variable, one discrete variable and one continuous variable that you might measure for each student. Give the units of measurement where relevant. 5. Identiff the sample and the population in each of the following scenarios. (i) Eight hundred individuals who watch soccer on TV are selected and information concerning their income level, age, place of residence and so forth is recorded. (ii) Two hundred and ten athletes at a major track and field event are selected to give a blood sample to check for substance abuse. (iii) In order to study the response times for emergency fire brigade calls in Cork city and county, 40 emergency fire brigade calls in progress are selected randomly over a three- month period and the response times are recorded. 6. For each of the following, state how many observations are in the data set and state the variable. (D In a sociological study involving 40 households, the number of children per household attending primary school was recorded for each household. (ii) In a school with 660 students, the number of hours spent studying per student was recorded. The minimum was zero hours studying and the maximum was 25 hours studying. (iii) A survey was mailed to 6,000 households and one question asked for the number of pets per household. Two thousand eight hundred and five of the surveys were completed and returned.

7. Classify the variables in parts (i), (ii) and (iii) of question 6 as continuous or discrete. 8. Garda checkpoints are used to identiff and arrest motorists driving over the legal blood alcohol limit. Let x represent the number of motorists stopped before the first drunk driver is identified. (i) What are the possible values for x? (iD Classiff x as discrete or continuous.

Surveys

Designing a questionnaire A questionnaire is a set of questions used to obtain data from a population. Anyone who answers a questionnaire is called a respondent.

Always have a clear aim for your survey and ask questions in a logical order.

The questionnaire should:

Be clear about who is to comPlete it Be as brief as possible.

Start with simple questions. Be able to be answered quicklY.

Be clear how the answers are to Be clear where the answers are to be recorded. be recorded. The questions should:

Be short and use simple language. Not be leading in any way, as this can influence the answer.

Provide tick boxes. Not cause embarrassment or offend.

Be clear about what is asked. Be relevant to the survey.

Allow a 'yes'or 'no'answer, Not be open-ended, which might produce long or rambling a number or a response from a answers that are difficult to analyse. choice of answers.

Question Comment Gender: Male ! Female I Good clear question. How old are you? Personal question, as people may be embarrassed to give their age. No indication of accuracy.

A better question would be: Only one response required. Which is your age group, in years? No gaps and no overlapping of boxes. UnderrItrtr 18 18-40 4l-60 Over 60 You prefer to go out on Safurdays, don't you? A leading question. It forces an opinion on the person being surveyed.

A better question is: A much better question. On which day do you prefer to go out? Respondents have a choice. MonTTTTtrtrT Tue Wed Thu Fri Sat Sun Better accuracy for the survey.

How much TV do you watch on a school This question is too vague. weeknight? A lot ! A bit E Very little I A better question is: This is more precise. How many hours of T! to the nearest hour, do Better accuracy for the survey. you watch on a school weeknight? rtrTtrT0 | 2 3 4ormore