Data Introduction

M. Benno Blumenthal and John del Corral 5 June 2008

Contents

1 Introduction 1

2 Ingest 1 2.1 Uploading Excel File ...... 2

3 Adding Additional Metadata 2 3.1 Independent Variables: Time and Locale ...... 4 3.1.1 Time ...... 4 3.1.2 Locale ...... 4 3.2 Dependent Variables ...... 5

4 Diagnosing Common Data Problems 5 4.1 Self Consistency ...... 5 4.2 Geographical Consistency ...... 6 4.2.1 Geolocation ...... 6 4.2.2 use as grid ...... 7 4.2.3 add variable ...... 7 4.2.4 Locale Consistency Check ...... 8

5 Summary 9

Abstract This is a short description of how to convert numbers in an excel file and a shapefile into a dataset that can be used for time and spatial analysis.

1 Introduction

We are interested in analyzing fields temporally and spatially. We are starting with excel tables of data (possibly never analyzed) and shapefiles. First we have the basic technical problem of reading the excel table and the shapefiles into our analysis program. Secondly, we need to add additional structure and descriptions to augment the information contained in the files. We then need to do an preliminary analysis of the data to find and remove flaws that prevent it from being accurate information about the fields and their spatial and temporal structure.

1 Figure 1: Ethiopia Malaria Data

Figure 2: Screen asking for file to upload

2 Ingest

• locate Excel file or shapefile • scan and display table structure • edit/enhance data descriptions

Given the location of the excel file, the software scans the file and presents the structure of the table so that edits can be made, and additional information can be provided. Edits are mostly necessary if the column heading are not simple, in which case simple names should be introduced for each column. Additional information can be added at this point, including units and extended descriptions. Similarly, given the location of the shapefiles, the software scans the file and presents the basic information.

2.1 Uploading Excel File Upload Excel File Figure 2 shows the screen asking for a file to upload. The Browse button gives one the opportunity of locating the file on one’s local machine. The Upload button transmits the file to the data library.

Response to Excel File upload Figure 3 shows the respose to an Excel file upload. It shows the table structure that it has extracted from the Excel file.

2 Figure 3: Screen with results of excel file upload

Figure 4: Screen to edit metadata

TheAdd Metadata button then lets one proceed to the next screen which allows adding additional infor- mation to the dataset.

3 Adding Additional Metadata

Figure 4 shows the edit metadata screen. This screen allows adding additional information. Top of the page allows adding a description of the new dataset. Information for each column can be added as well: insuring that the columns are properly recognized as dates or numeric values is particularly useful.

3.1 Independent Variables: Time and Locale Independent Variables: Time and Locale While ultimately we would like to consider the data as a function of time and locale, most likely the excel file has information about time and some kind of spatial entity (such as district or state).

3 3.1.1 Time Best case scenario is that one of the columns is clearly time, i.e. formatted as an excel-standard date. We then simply designate the column as an independent variable, and the software extracts a sorted list of values as our time axis.

Precisely describing time

• when speaking about time, usually imply a start and an end. i.e. January 2008 implies the entire month, where 1 January 2008 implies the entire day. • when analyzing data, usually need to know the interval corresponding to each data point. • for the sorts of data we have (days to weeks to months to years), some can be precisely specified in Excel, some cannot.

Example: weekly sea surface temperature data .

Specified Time

• specify the time dependence • specify column interpretation and time connection

The more likely scenario is that one or more of the columns corresponds to time, but the computer needs to be instructed on how to interpret the columns and make the connection to time. In this case we describe the time dependence of the data (start,step,end) and thus construct the independent variable for time. Later we can provide the pattern that connects the time to the appropriate columns so that we can select the appropriate date. This can be necessary even when one of the columns is cleanly time. For example if weekly data is marked by the start of the week, we still need to clarify the time information to properly interpret the data.

3.1.2 Locale Best case scenario is that one of the columns in the excel is clearly an identifier for a spatial entity, i.e. each identifier is unique and corresponds to a value in the corresponding column in the shapefile. We then simply designate the column as an independent variable, and the software extracts a sorted list of values as our entity axis.

Specified Locale

• specify the locale id (choice guided by the shapefile) • specify column interpretation and locale connection

The more likely scenario is that the entries in the column are not completely self-consistent, nor are they consistent with the shapefile. It is also possible that multiple columns (state and district name, for example) are required to identify the spatial entity. In this case, it is more likely that the shapefile is properly constructed and labelled, and its columns can be used as the spatial dimension, while later a pattern is provided that connects the spatial dimension to the appropriate columns of the excel file to select the proper locale. This will not alleviate the need for editing for consistency.

4 3.2 Dependent Variables Dependent Variables Dependent variables are the data to be analyzed, and can be most simply provided as columns in the excel file. In that case the add Metadata screens lists the columns, and additional descriptive information can be added.

Variables from multiple columns Sometimes variables are stored as multiple columns in the file (most frequently a different month in each column), in which case a pattern can be provided which properly combines the columns. The screen provides an opportunity to give such a pattern, and remove the corresponding combined columns from the dataset. For example, consider the sample dataset 1993-2005 Highlands incidence. On ingest, we find 26 columns, 12 for monthly cases, 12 for monthly incidence, year, and district. So in this case there is a clear column for locale, but no such column for time. So we define the time variable as going from Jan 1993 to Dec 2005 in steps of one month. We then translate the column names to english so that we can easily generate the month names from the time with our current software. The pattern that corresponds to incidence is then select incid_%b[T] as incid from data_central_highl_incidence WHERE district=’%s[district]’ AND year=%Y[T] where T is our time variable and we have used it to generate both the year and the month-part of the column name.

4 Diagnosing Common Data Problems

Excel files created by hand over long periods of time that have never been analyzed are unlikely to have perfectly self-consistent labels for locale, and may not match the shapefile labels, either. But many of these problems can be easily detected and corrected once a preliminary version of the data has been ingested.

4.1 Self Consistency Sample Initial Data Overview For example, consider the 1993-2005 Madagascar Highlands incidence. We have named this dataset to make it readily accessible in the Data Library, (link) home .ciph .Madagascar .malaria .original Figure 5 shows a plot that gives an overview of the entire original dataset. It shows incidence as color where position gives district and time. White indicates missing data, and it is clear that and Ambohimahasoa are disjoint: a given time has data for one district or the other. This strongly indicates that in fact they are one and the same place, and that a different name was used at different data-entering times. Consulting with the data producer verified that this was in fact the case.

Sample Revised Data Overview We give the revised data a new name (link) home .ciph .Madagascar .malaria .cleaned Figure 6 shows the corresponding plot for the revised dataset. While there is still some missing data, the gaps are smaller and less systematic.

5 (link)

TSIROANOMANDIDY MANANDRIANA URBAN FIANARANTSOA RURAL FENOARIVOBE FARATSIHO FANDRIANA BETAFO ANTSIRABE URBAN ANTSIRABE RURAL district ANTANIFOTSY -SUD ANTANANARIVO-NORD AMBOSITRA AMBOHIMASOA AMBOHIMAHASOA AMBATOLAMPY AMBATOFINANDRAHANA

Jan Jan Jan Jan Jan Jan Jan 1994 1996 1998 2000 2002 2004 2006 Time

0 4 8 12 16 20 24 28 32 36 40 home benno ADDATI madagascar incid

Figure 5: Sample Initial Data Overview

4.2 Geographical Consistency 4.2.1 Geolocation At this point we could analyze the dataset, since it is now self-consistent. On the other hand, we will want to look at incidence as a function of location. To keep all the tedious but necessary bits together, we will check the geolocation next. To geo-locate the data, we first look through the library for the appropriate information. Currently most such information is kept under SOURCES Features. A quick scan through the available holdings reveals SOURCES .Features .Political .Madagascar .Districts, which has some promise.

(link) home .ciph .Madagascar .malaria .cleaned SOURCES .Features .Political .Madagascar .Districts

4.2.2 use as grid One of the available variables is District geometry, the geom for short, which are the shapes for each district. In this dataset they are given as a function of codefiv, a numeric code for each district, e.g. we have the geom[codefiv]. Another variable FIV name nomfiv would correspond better to our district names; it is also a function of codefiv. We, of course, want to call it district rather than nomfiv. We can use the function use as grid to do this

6 TSIROANOMANDIDY SOAVINANDRIANA MIARINARIVO MANJAKANDRIANA MANANDRIANA IKALAMAVONY FIANARANTSOA URBAN FIANARANTSOA RURAL FENOARIVOBE FARATSIHO FANDRIANA BETAFO ARIVONIMAMO ANTSIRABE URBAN

ANTSIRABE RURALdistrict ANTANIFOTSY ANTANANARIVO-SUD ANTANANARIVO-NORD ANKAZOBE ANJOZOROBE ANDRAMASINA AMBOSITRA AMBOHIMAHASOA AMBOHIDRATRIMO AMBATOLAMPY AMBATOFINANDRAHANA AMBALAVAO

Jan Jan Jan Jan Jan Jan Jan 1994 1996 1998 2000 2002 2004 2006 Time

0 4 8 12 16 20 24 28 32 36 40 home benno ADDATI madagascar incid

Figure 6: Sample Revised Data Overview source

(link) home .ciph .Madagascar .malaria .cleaned SOURCES .Features .Political .Madagascar .Districts nomfiv /district use_as_grid

4.2.3 add variable As a final step, lets put the geom into the local dataset using the function add variable so that we can conveniently reference it

(link) home .ciph .Madagascar .malaria .cleaned SOURCES .Features .Political .Madagascar .Districts nomfiv /district use_as_grid .the_geom add_variable

4.2.4 Locale Consistency Check While nomfiv seemed a reasonable choice, let us check to make sure we are not missing anything. To compare the districts on the incidence with the districts on the geom, use the function SAMPLE MISSING to remove the districts available in the geom from the districts in the incidence, adding the following to what we had before

7 Malaria District Name Feature District Name Antananarivo-Nord ANTANANARIVO-AVARADRANO Antananarivo-Sud ANTANANARIVO-ATSIMONDRANO Antsirabe Rural ANTSIRABE II Antsirabe Urban ANTSIRABE I Fenoarivobe FENOARIVO-AFOVOANY Fianarantsoa Rural FIANARANTSOA II Fianarantsoa Urban FIANARANTSOA I

Table 1: Feature Name Corrections

(link) incid the_geom[district]SAMPLE_MISSING

Ideally, we would have nothing left. Unfortunately, that is not the case, we are left with (ANTANANARIVO-NORD) (ANTANANARIVO-SUD) (ANTSIRABE RURAL) (ANTSIRABE UR- BAN) (FENOARIVOBE) (FIANARANTSOA RURAL) (FIANARANTSOA URBAN) A second consultation with the data producer gave us the set of corresponding names in the Features dataset While we are chosing to keep the Feature versions rather than the Malaria versions (the Features version is certain to have only one name per district, for example), in this particular case it is not clear which are better. Certainly having districts versions labelled ’I’ and ’II’ is not particularly informative. We store the corrected version of the dataset as geolocated,

(link) home .ciph .Madagascar .malaria .geolocated

and recheck this version of the dataset by adding

(link)

incid the_geom[district]SAMPLE_MISSING

In this version we have incorporated the district shapes into the dataset definition. Now we have incidence and number of cases as a function of time and locale.

5 Summary

Precisely describing the time and locale of the data • simplifies subsequent analysis • allows more sophisticated functions to be applied • allows ready comparison with other datasets

8