Data Management and Analysis Group
Total Page:16
File Type:pdf, Size:1020Kb
Data Management and Analysis Group A Threshold Guide to Readying Data for Analysis in SPSS: Quantitative analysis and Elizabeth David, the English National Pupil Dataset, a hidden variable, the Julian calendar and more 2007 home-school distance, pupils aged 10 in 2006 by key stage 2 writing test result quintiles 3 2.56 2.5 2.03 1.70 1.81 2 1.56 1.63 1.5 1 0.5 Average distance in miles 0 highest Highest mark/zero quintile quintaile Intermediate Lowest quintile Next to quintile Next to lowest quintile No DMAG Education Guide 2009-1 September 2009 D.C. Ewens ISSN 1479-7879 DMAG Education Guide 2009 - 1 September 2009 A threshold guide to readying data for analysis in SPSS: Quantitative analysis and Elizabeth David, the English National Pupil Dataset, a hidden variable, the Julian calendar and more David Ewens Data Management and Analysis Group Greater London Authority 2nd Floor City Hall The Queen’s Walk London SE1 2AA Tel: 020 7983 4656 Email: [email protected] The document is GLA copyright Acknowledgements In its pre-Windows days, SPSS could be difficult for the newcomer to get to grips with, particularly if there were no workshops available and the would-be user had no background in Fortran. I remain appreciative of the help Pauline Booker, of the then School of Social Sciences at the Polytechnic of Central London, gave me during my earliest days with the software. A number of other people have made useful suggestions since, including Sean Hayes during his time with Hammersmith and Fulham’s Education Department, Rachel Leeser of the Social Exclusion Team in the GLA’s Data Management and Analysis Group (DMAG), and Richard Cameron of DMAG’s National Census Team. The Guide is not a mainstream DMAG research and statistics report and is, to that extent, a somewhat heretical departure from the Group norm. However, the Guide is prompted by research necessity rather than whimsy (or even heresy) and I am appreciative of the time John Hollis and Rob Lewis in DMAG allowed for it to be written. Further support Advice on the next steps that might be taken by those new to readying data for analysis in SPSS is given in the concluding Section of the Guide. Regrettably, the pressure of research and statistical analysis at City Hall means that I cannot undertake to respond to requests for further advice or support. DMAG Education Guide 2009 - 1 September 2009 A threshold guide to readying data for analysis in SPSS: Quantitative analysis and Elizabeth David, the English National Pupil Dataset, a hidden variable, the Julian calendar and more Contents page 1. Introduction 1.1 Why a DMAG Education Guide to organising data in SPSS? 1 1.2 A day in the life of a threshold Guide 2 1.3 Computing Capacity 3 1.4 Labelling datasets to achieve descriptive clarity 4 1.5 Adding ecological and other variables 4 1.6 Creating derived variables for analytical purposes 4 1.7 Missing data and data quality 4 2. Broader issues 2.1 The means do not justify the ends – data confidentiality 5 2.2 Understanding the data 6 3. The world after oven ready datasets. Multiple files and creating labels 9 4. Taking text (.txt) files into SPSS – File/Read text data 14 5. Using Frequency Tables to check for missing data 16 6. Inserting a new variable, setting its character and using the ‘Compute’ facility, selecting and 18 deleting records 7. Converting a string variable to a numeric variable, coding data as they stand, using the SPSS 22 ‘Missing’ column to identify several missing value codes, and the Missing Values module 8. Conditional ‘If’ statements and recoding data into a different variable – level of SEN support. 29 Checking recoded data with Crosstabs 9. Using a ‘Compute if’ statement with an added conditional ‘or’ to create a numeric equivalent of a 33 string variable 10. Working with string data. Upcase, Ltrim, substrings, concat, Recoding into the Same Variables 34 and moving variables within datasets 11. Pre-amble to merging files. Using an existing data dictionary from another file 41 12. Merging datasets. The order of events in using external lookup tables 12 13. Using Autorecode to create large lookup tables. Creating a new case, case value and value label 50 14. Using large lookup tables. Checking for duplicate records and running out of disc space 54 15. Merging pupil datasets from different years – missing unique identifiers and a hidden variable 58 16. Using a ‘live’ database as a lookup table. Risks and triangulating with other datasets to reduce 62 the risk of error 17. Using the Tables facility to check data. Too many values, long string variables, and overloading 66 Tables 18. Organising data in Tables for other users, and using ‘Split file’ ‘Select Data’ and ‘Layer Table’ to 70 minimise the risk of overload 19. Working with dates 77 20. Inward pupil mobility. Re-basing dates, calculating age, by-passing rounding numbers up, and 80 bringing together work with substrings, Ltrim, ‘Select if’ and ‘Filter out unselected cases’ 21. Calculating straight line distance between two points using northings and eastings 87 22. Aggregating data. Grouping information for different cases 92 23. Grouping variables in sets to reduce the time needed to locate variables in large datasets 95 24. Grouping values in a derived variable for analytical purposes – Visual Bander 98 25. Running the same procedure more than once. Syntax files 105 26. Conclusion 110 Appendix. Variable list, merged 2002 to 2005 London Pupil Dataset 112 1. Introduction and key issues 1.1 Why a DMAG Education guide to While this skills gap was, to say the least, readying data for analysis in SPSS? inconvenient, it may not be realistic to expect formal and informal research and research The Guide provides an introduction to a number training in universities to be geared towards those of SPSS procedures that are useful in organising SPSS procedures which are useful in readying and readying data for analysis. This is a largely, data for analysis. Taught courses in statistical though not entirely, neglected element in the analysis may use SPSS for the computations literature on SPSS. It is not a general introduction involved but, unsurprisingly, will focus on to SPSS, or an introduction to statistical analysis statistical analysis. Likewise, the very large in SPSS; a large number of introductory guides to number of texts written on SPSS is, far more the latter are already available. Indeed it assumes often than not, aimed at those taking university that the reader has experience of statistical courses in statistics and, again unsurprisingly, analysis in SPSS, and is at home with its menus, focus on statistical analysis. Additionally, datasets file naming procedures and so on. used in higher education can be small, and are often purpose built and coded for a taught course That said, the Greater London Authority’s Data or to deal with a comparatively narrow set of Management and Analysis Group (DMAG) exists research questions. The issues raised in this to help provide the GLA with its evidence base, Guide may not be particularly widespread in and output is mainly in the form of published universities. Market research has also been a Research Briefings. A threshold guide to readying growth area for SPSS, but private sector data for analysis in SPSS is not a Research companies are not widely noted for writing ‘how Briefing. Since there are no plaudits for DMAG’s to’ guides for their competitors. Education Team in writing such a guide, and since there is a risk that producing a computing Unfortunately (or not) data in the world beyond guide may give a misleading impression of what the lecture theatre and seminar room are often the Team does, why write one? not ready prepared for whatever analysis an individual researcher may have in mind, and that The short answer is not so much that there is a is part of the reason why the SPSS procedures title missing from the bookshelves (the discussed in the Guide can be so useful. researcher’s ‘gap in the literature’ reason for writing) but rather that an introductory, threshold Part of the issue involves the use of secondary guide is needed. I will illustrate that point with ‘a datasets. Some extremely important secondary real world’ example. datasets already exist, and a number of Web links to agencies dealing with, in this instance, A grant was available for work in the DMAG longitudinal data are given in the concluding Education Team on the impact of multiple social section to the Guide. These are at least two types disadvantage and educational attainment. This of pre-existing ‘secondary’ datasets. Researchers was to be part of a ‘making the case for London’ can deposit, with UK Data Archives, datasets project with, potentially, material benefits for previously constructed for their own research, and schools. The grant was to meet the cost of a presentation entitled ‘Using and Archiving researcher who would work on the first stage of Research Data’ is available at the time of writing the project, which involved readying data in large at datasets for analysis in SPSS, using procedures www.esds.ac.uk/news/eventsdocs/birkbecknov08. of the type introduced in the Guide. This was an ppt essential, and time-consuming, first step in the overall project. The time frame for the project Other data have been collected either to meet meant that there was no time for training administrative need in an organisation, such as someone in the skills needed; the researcher the payroll section of a private company, or by the needed to have those skills at the outset.