Why Use Databases for Data Entry

Module I2 Session 6

Why use databases for data entry?

Learning Objectives

At the end of this session students will be able to

 Explain the advantages and disadvantages of using a database for data entry  Compare these advantages and disadvantages with the data entry tools explored in the previous sessions.

Activities

Activity 1

The first activity of this session is to read the interview transcribed below and extract the information that allows to answer the following question:

What are the advantages of using a database for data entry?

Activity 2

Compile a table that presents a comparison of advantages and disadvantages of doing data entry using a Spreadsheet, CS-Pro and a database such as Access.

SADC Course in Statistics Module I2 Session 6 – Page 1 Module I2 Session 6

An interview with a professional data manager with experience in managing survey data

Can you give some examples of data entry software you might consider using?

It would really depend on how much data I had and how complex were the data structures. For example for very small, flat files i.e. files in which all the data are at the same level, e.g. household level, or plot level, then it is possible to use spreadsheets such as MS-Excel for data entry and many researchers do this. Excel has the advantage of being in widespread use and easy to learn at a basic level. However its use of ease can also be its downfall as it can be used without much thought going into the structure of the data and it is far too easy to make mistakes.

Some sort of database package is better for any data beyond a few rows and columns. With packages such as Epi-Info and CS-Pro, both of which have data entry facilities, you need to first create a “data dictionary”. This is a good thing as it makes you think about exactly how many variables you will need and what they should be called. You can also create data entry forms and customise these to resemble your data collection sheets and questionnaires to some extent.

MS-Access is the application I generally use particularly for data from large surveys where the data are at two or three different levels, e.g. household data and individual data. As with Epi-Info and CS-Pro you need to set up the “tables” before you can enter data. The data entry forms that you can create are incredibly flexible.

In what way is Access better than Epi-Info or CS-Pro?

Access is basically much more flexible. In the other packages although you can customise the data entry forms to some extent and set some consistency and range checks, there is much more you can do with Access. Behind the data entry forms in Access you can create numerous “event procedures”. These are sections of VBA code that run in response to various “events” happening on the form. The “event” may be entering some data into a field or clicking on a command button. You can use these procedures to check for consistency or for setting up an automatic skip and fill system, etc.

In an Access database everything is stored in a single file. With CS-Pro and I think this is also the case with Epi-Info, you end up with different files for the data dictionaries (one fore each level of data from what I can work out), different files for the forms, etc. Just trying out a very simple example in CS-Pro I ended up with six different files all with different extensions some of which clash with other applications on my PC. This makes it very messy when you want to pass your application to someone else to use and it is very easy to forget or lose one or more of the files. With Access it’s just a case of dealing with a single file which is much easier. There are different “objects” within the Access database file – e.g. tables (which store the data), queries (which ask questions of the data), forms (used for data entry and editing) and reports (for summarising data) – but there is just one single file (with the extension .mdb) so only one file to backup and/or pass to colleagues.

It is often the case, particularly in surveys, where you have “sub-tables” on the questionnaire or data collection sheet. In Section B of the TIP questionnaire for example question 8 asks about gardens owned by members of the household. Each household can have several gardens so effectively the “garden” data is another level. In Access you can easily include “sub-forms” on the data entry forms to accommodate these sub-tables. I feel this should be possible in Epi-Info and CS-Pro and it may be but so far I haven’t managed to do this. In CS-Pro I got the impression I would have to enter these data using a separate form which is not very convenient and could result in this data being missed out. Of course it may be that I just haven’t worked out how to do this in CS-Pro but I know it is relatively straight-forward in Access.

Another feature that Access has but which I haven’t yet found in the other packages is “option groups”. For example question 1 asks for the sex of the respondent and on the questionnaire has Male 1 Female 2. In Access you can easily put an option group on the form which is linked to the SEX variable and has a check box for Male and a check box for Female.

What is the “most important” reason for using a relational database package?

Well in my view it is the ability to create relationships or links between different levels of the data. Of course when we talk about relationships here we do not mean in the statistical sense – at this stage we are not interested in things such as whether the level

of income is related to the health of family members for instance. Here we are talking about being able to link for example data on individuals with data about the household. This is done by having the unique identifier for the household level appearing in the individual level as well. Of course we can do this in Excel – we can have data at the household level on one worksheet and data at the individual level on another worksheet and the household ID column in both worksheets. However, in Excel there is no way to ensure that the household ID is unique at the household level or of making sure that every household ID mentioned at the individual level actually exists. In Access on the other hand, you can set Household ID to be the “primary key” at the household level. The primary key must be unique for each record in the table. You can also set up a relationship between the household table and the individuals table and enforce the relationship so that you cannot enter an individual into a household that does not exist on the database. In Access this is referred to as “referential integrity” and it basically helps to ensure the integrity of the data.

What steps would you follow to create a data entry system in Access?

1. The first step should always be to work out exactly how many levels of data you have and what variables you need at each level. Give names to your variables and I always suggest using a maximum of 8 characters for variable or field names. This is so that the data can later be extracted to almost any statistical package for analysis. Make sure you don’t have any duplicate variable names even at different levels – in one database I worked on we had a variable called SEX at two different levels and this lead to a lot of confusion later on when we came to merge and aggregate the data. 2. At the same time work out the data types for each variable – e.g. text, numeric, date, etc. With numeric variables do you need to allow for decimals or is it a numeric code in which case you can restrict entry to a limited set of values. For these variables I tend to use “lookup” fields in which the field is linked to a small table of codes and values. That way you have all the information in the database to interpret the codes. 3. Then you create the tables – there tends to be one table for each level of data although for very large surveys there can be lots of tables at the same level – there is a limit of 255 variables in a single Access table. For large surveys I’ve often created a different table for each section of the questionnaire and these tables are linked with

a “one-to-one” relationship – i.e. one record in one table links to exactly one record in the other table. The link is generally via the primary key field. 4. After you create the tables you next create the relationships and thus the structure of the database itself. Where possible you should set referential integrity on the relationships. 5. The next step is to create the data entry forms. As I’ve already said these are very flexible in Access. This is probably one of the more “creative” aspects of Access although as much as possible it is best to make your form resemble your data collection sheet or questionnaire. Designing the layout of the form is done first and you can add extra text and dividing lines where necessary. In the past I have created multi-lingual databases with sets of forms in different languages. The forms are linked to the tables in the database and that’s where the data are stored, but the labels on the forms can be in any language. With large questionnaires you often reach the size limit of data entry forms but using command buttons and code you can link from one form to the next easily enough ensuring that you are keeping to the same record as you move between forms. This would be like turning the page on the paper questionnaire. 6. Once the layout of the forms is complete the next stage is to check that the flow through the form is correct. In Access the “tab order” which is the order in which you would normally go through the variables on the form, would by default match with the order that they were added to the form. This might result in you jumping all over the place so this needs to be checked and if necessary the tab order can be adjusted. 7. I would then go through and put in any automatic skips. For example in Section B of the TIP questionnaire question 5a asks Are you the head of the household? Question 5b then asks If not are you the acting head?. Of course if the answer to 5a is “Yes” we would want to skip over 5b. This we can do in Access using an event procedure (a piece of VBA code) that checks the value that was entered in 5a and if necessary skips over question 5b. If you have assigned a code for “Not applicable” this code could automatically be inserted into 5b. This saves time during data entry and helps maintain completeness and integrity of the data. 8. We can also include some consistency checks on the values entered. For example question 7 asks How much land is your household cultivating this season? - question 10 asks How much land does your household have in total? – you would expect the value given in question 10 to be greater than that given in question 7 so

you can include an event procedure to check this and to give a customised message if there is a discrepancy. 9. Finally I would tend to build a “front-end” in the form of a “Main menu” form with command buttons leading to the data entry forms. It is easy to get the main menu to open as soon as you open the database.

What other useful features does Access have?

The reports can be very useful not only as a way of summarising the data but they can be used to pre-print additional data entry questionnaires. I’ve recently been working on a database for a study in which there is an initial household interview where data are collected about the household including a roster of household members. Then there will be regularly visits to the household throughout the study when questions will be asked about the household members. To save having to write out the names of the household members each time – which can lead to lots of errors – I have included a set of reports to print out the visit forms including the names of the household members. This is working really well.

Are there any disadvantages to using Access?

 Yes there are several. I guess an important one is the cost – both Epi-Info and CS- Pro are free so cost is not a concern with them.  Access also has a very steep learning curve. Having said that though I believe the most important thing to understand about data is its structure, i.e. the levels of data and how the different levels are related. In my view this is an important concept to understand even if you are just using a spreadsheet package. Access (and other database packages) tend to make you think more about the structure of your data before you start and this in my view is a good thing. I do accept though that Access is not the most straight-forward package to learn to use – it’s taken me 12 years to get to my current level of understanding and I’m still learning new things!  Access database files tend to expand exponentially in size! For some very large surveys I have created databases which are 40 or 50MB just in their design – that’s before any data are entered! Access tends to eat disk space in much the same way as I would eat chocolate! This does cause problems if you need to send the database to others as they are often much too large to attach to an email for example.

 There is no inbuilt double-data entry checking procedure in Access. This is where Epi-Info comes in particularly handy as that includes a “Data Compare” facility which will compare tables from Access databases – it does tend to be a bit slow on very large tables but it does work if you give it long enough so is very useful.  Some people might find it a disadvantage that Access does not have facilities for analysis. I know Epi-Info and CS-Pro include some statistical functions and some people must find these useful. In Access you can do some simple summaries in queries and reports but nothing beyond that. However, in my view this is not a disadvantage as I’ve a strong belief in using the correct package for each task and a suspicion of packages that claim to do everything – I’m not trying to downplay Epi- Info and CS-Pro with this comment as I believe these are good packages in their own way especially given that they are freely available – but Access is primarily a database management system and therefore managing data is what it does best.  I was a little concerned when I heard that in Access 2007 you can create fields that hold more than one value. To me this goes against basic data management principles of “one item per cell”. I haven’t fully investigated this yet but I think it may be something that researchers should be careful of. On the other hand it does look as though Microsoft have finally removed the default value of zero on numeric fields – I always used to find this frustrating. If you missed removing the default values you could end up with lots of false zeros in your datasets and as all researchers know a missing value should never be treated as zero!

In conclusion…

I thoroughly enjoy using Access but if you have only a small dataset and not a great deal of time then you might be better off using Epi-Info or CS-Pro (especially if you have a limited budget). If you do opt for Access be prepared for a steep learning curve – when you manage to do something clever with it, it is very rewarding.

SADC Course in Statistics Module I2 Session 6 – Page 7