Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary

Data Introduction

M. Benno Blumenthal and John del Corral

5 June 2008

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary

Introduction

Ingest Uploading Excel File

Adding Additional Metadata Independent Variables: Time and Locale Dependent Variables

Diagnosing Common Data Problems Self Consistency Geographical Consistency

Summary

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary

Abstract

This is a short description of how to convert numbers in an excel file and a shapefile into a dataset that can be used for time and spatial analysis.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary Goals

I Want to analyze fields temporally and spatially

I starting with Excel tables (possibly never analyzed) and shapefiles

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary Plan of Action

1. reading excel tables and shapefiles 2. augment information in files 3. preliminary analysis to find and remove flaws

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary

Figure: Ethiopia Malaria Data

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Uploading Excel File Adding Additional Metadata Diagnosing Common Data Problems Summary Ingest

I locate Excel file or shapefile

I scan and display table structure

I edit/enhance data descriptions

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Uploading Excel File Adding Additional Metadata Diagnosing Common Data Problems Summary Upload Excel File

Figure: Screen asking for file to upload

The Browse button gives one the opportunity of locating the file on one’s local machine. The Upload button transmits the file to the data library.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Uploading Excel File Adding Additional Metadata Diagnosing Common Data Problems Summary Response to Excel File upload

Figure: Screen with results of excel file upload

TheAdd Metadata button then lets one proceed to the next screen which allows adding additional information to the dataset. International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Adding Additional Metadata

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society

Figure: Screen to edit metadata Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Screen to edit Metadata

This screen allows adding additional information. Top of the page allows adding a description of the new dataset. Information for each column can be added as well: insuring that the columns are properly recognized as dates or numeric values is particularly useful.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Independent Variables: Time and Locale

While ultimately we would like to consider the data as a function of time and locale, most likely the excel file has information about time and some kind of spatial entity (such as district or state).

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Time

Best case scenario is that one of the columns is clearly time, i.e. formatted as an excel-standard date. We then simply designate the column as an independent variable, and the software extracts a sorted list of values as our time axis.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Precisely describing time

I when speaking about time, usually imply a start and an end. i.e. January 2008 implies the entire month, where 1 January 2008 implies the entire day.

I when analyzing data, usually need to know the interval corresponding to each data point.

I for the sorts of data we have (days to weeks to months to years), some can be precisely specified in Excel, some cannot. Example: weekly sea surface temperature data .

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Specified Time

I specify the time dependence

I specify column interpretation and time connection

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Locale

Best case scenario is that one of the columns in the excel is clearly an identifier for a spatial entity, i.e. each identifier is unique and corresponds to a value in the corresponding column in the shapefile. We then simply designate the column as an independent variable, and the software extracts a sorted list of values as our entity axis.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Specified Locale

I specify the locale id (choice guided by the shapefile)

I specify column interpretation and locale connection

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Dependent Variables

Dependent variables are the data to be analyzed, and can be most simply provided as columns in the excel file. In that case the add Metadata screens lists the columns, and additional descriptive information can be added.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary Variables from multiple columns

Sometimes variables are stored as multiple columns in the file (most frequently a different month in each column), in which case a pattern can be provided which properly combines the columns. The screen provides an opportunity to give such a pattern, and remove the corresponding combined columns from the dataset.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Independent Variables: Time and Locale Adding Additional Metadata Dependent Variables Diagnosing Common Data Problems Summary For example, consider the sample dataset 1993-2005 Highlands incidence. On ingest, we find 26 columns, 12 for monthly cases, 12 for monthly incidence, year, and district. So in this case there is a clear column for locale, but no such column for time. So we define the time variable as going from Jan 1993 to Dec 2005 in steps of one month. We then translate the column names to english so that we can easily generate the month names from the time with our current software. The pattern that corresponds to incidence is then

select incid_%b[T] as incid from data_central_highl_incidence WHERE district=’%s[district]’ AND year=%Y[T]

where T is our time variable and we have used it to generate both the year and the month-part of the column name.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary Diagnosing Common Data Problems

Excel files created by hand over long periods of time that have never been analyzed are unlikely to have perfectly self-consistent labels for locale, and may not match the shapefile labels, either. But many of these problems can be easily detected and corrected once a preliminary version of the data has been ingested.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary Sample Initial Data Overview

For example, consider the 1993-2005 Madagascar Highlands incidence. We have named this dataset to make it readily accessible in the Data Library,

(link)

home .ciph .Madagascar .malaria .original

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary

(link)

TSIROANOMANDIDY MANANDRIANA IKALAMAVONY FIANARANTSOA URBAN FIANARANTSOA RURAL FENOARIVOBE FARATSIHO FANDRIANA BETAFO ANTSIRABE URBAN ANTSIRABE RURAL district ANTANIFOTSY -SUD ANTANANARIVO-NORD AMBOSITRA AMBOHIMASOA AMBOHIMAHASOA AMBATOLAMPY AMBATOFINANDRAHANA AMBALAVAO

Jan Jan Jan Jan Jan Jan Jan 1994 1996 1998 2000 2002 2004 2006 Time

0 4 8 12 16 20 24 28 32 36 40 home benno ADDATI madagascar incid International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Figure: Sample Initial Data Overview Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary

It shows incidence as color where position gives district and time. White indicates missing data, and it is clear that Ambohimahasoa and Ambohimahasoa are disjoint: a given time has data for one district or the other. This strongly indicates that in fact they are one and the same place, and that a different name was used at different data-entering times. Consulting with the data producer verified that this was in fact the case.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary Sample Revised Data Overview

We give the revised data a new name

(link)

home .ciph .Madagascar .malaria .cleaned

While there is still some missing data, the gaps are smaller and less systematic.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary

TSIROANOMANDIDY SOAVINANDRIANA MIARINARIVO MANJAKANDRIANA MANANDRIANA IKALAMAVONY FIANARANTSOA URBAN FIANARANTSOA RURAL FENOARIVOBE FARATSIHO FANDRIANA BETAFO ARIVONIMAMO ANTSIRABE URBAN

ANTSIRABE RURALdistrict ANTANIFOTSY ANTANANARIVO-SUD ANTANANARIVO-NORD ANKAZOBE ANJOZOROBE ANDRAMASINA AMBOSITRA AMBOHIMAHASOA AMBOHIDRATRIMO AMBATOLAMPY AMBATOFINANDRAHANA AMBALAVAO

Jan Jan Jan Jan Jan Jan Jan 1994 1996 1998 2000 2002 2004 2006 Time

0 4 8 12 16 20 24 28 32 36 40 home benno ADDATI madagascar incid

Figure: Sample Revised Data Overview International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society source Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary Geolocation

At this point we could analyze the dataset, since it is now self-consistent. On the other hand, we will want to look at incidence as a function of location. To keep all the tedious but necessary bits together, we will check the geolocation next. To geo-locate the data, we first look through the library for the appropriate information. Currently most such information is kept under SOURCES Features. A quick scan through the available holdings reveals SOURCES .Features .Political .Madagascar .Districts, which has some promise. (link)

home .ciph .Madagascar .malaria .cleaned SOURCES .Features .Political .Madagascar .Districts

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary use as grid

One of the available variables is District geometry, the geom for short, which are the shapes for each district. In this dataset they are given as a function of codefiv, a numeric code for each district, e.g. we have the geom[codefiv]. Another variable FIV name nomfiv would correspond better to our district names; it is also a function of codefiv. We, of course, want to call it district rather than nomfiv. We can use the function use as grid to do this

(link)

home .ciph .Madagascar .malaria .cleaned SOURCES .Features .Political .Madagascar .Districts nomfiv /district use_as_grid

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary add variable

As a final step, lets put the geom into the local dataset using the function add variable so that we can conveniently reference it

(link)

home .ciph .Madagascar .malaria .cleaned SOURCES .Features .Political .Madagascar .Districts nomfiv /district use_as_grid .the_geom add_variable

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary Locale Consistency Check

While nomfiv seemed a reasonable choice, let us check to make sure we are not missing anything. To compare the districts on the incidence with the districts on the geom, use the function SAMPLE MISSING to remove the districts available in the geom from the districts in the incidence, adding the following to what we had before

(link)

incid the_geom[district]SAMPLE_MISSING

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary

Ideally, we would have nothing left. Unfortunately, that is not the case, we are left with (ANTANANARIVO-NORD) (ANTANANARIVO-SUD) (ANTSIRABE RURAL) (ANTSIRABE URBAN) (FENOARIVOBE) (FIANARANTSOA RURAL) (FIANARANTSOA URBAN)

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary A second consultation with the data producer gave us the set of corresponding names in the Features dataset

Malaria District Name Feature District Name Antananarivo-Nord ANTANANARIVO-AVARADRANO Antananarivo-Sud ANTANANARIVO-ATSIMONDRANO Antsirabe Rural ANTSIRABE II Antsirabe Urban ANTSIRABE I Fenoarivobe FENOARIVO-AFOVOANY Fianarantsoa Rural FIANARANTSOA II Fianarantsoa Urban FIANARANTSOA I

Table: Feature Name Corrections

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary

While we are chosing to keep the Feature versions rather than the Malaria versions (the Features version is certain to have only one name per district, for example), in this particular case it is not clear which are better. Certainly having districts versions labelled ’I’ and ’II’ is not particularly informative.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Self Consistency Adding Additional Metadata Geographical Consistency Diagnosing Common Data Problems Summary

We store the corrected version of the dataset as geolocated,

(link)

home .ciph .Madagascar .malaria .geolocated

and recheck this version of the dataset by adding

(link)

incid the_geom[district]SAMPLE_MISSING

In this version we have incorporated the district shapes into the dataset definition. Now we have incidence and number of cases as a function of time and locale.

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society Outline Introduction Ingest Adding Additional Metadata Diagnosing Common Data Problems Summary Summary

Precisely describing the time and locale of the data

I simplifies subsequent analysis

I allows more sophisticated functions to be applied

I allows ready comparison with other datasets

International Research Institute M. Benno Blumenthal and John del Corral Data Introduction for Climate and Society