Data Introduction

Data Introduction

Data Introduction M. Benno Blumenthal and John del Corral 5 June 2008 Contents 1 Introduction 1 2 Ingest 1 2.1 Uploading Excel File . 2 3 Adding Additional Metadata 2 3.1 Independent Variables: Time and Locale . 4 3.1.1 Time . 4 3.1.2 Locale . 4 3.2 Dependent Variables . 5 4 Diagnosing Common Data Problems 5 4.1 Self Consistency . 5 4.2 Geographical Consistency . 6 4.2.1 Geolocation . 6 4.2.2 use as grid .......................................... 7 4.2.3 add variable . 7 4.2.4 Locale Consistency Check . 8 5 Summary 9 Abstract This is a short description of how to convert numbers in an excel file and a shapefile into a dataset that can be used for time and spatial analysis. 1 Introduction We are interested in analyzing fields temporally and spatially. We are starting with excel tables of data (possibly never analyzed) and shapefiles. First we have the basic technical problem of reading the excel table and the shapefiles into our analysis program. Secondly, we need to add additional structure and descriptions to augment the information contained in the files. We then need to do an preliminary analysis of the data to find and remove flaws that prevent it from being accurate information about the fields and their spatial and temporal structure. 1 Figure 1: Ethiopia Malaria Data Figure 2: Screen asking for file to upload 2 Ingest • locate Excel file or shapefile • scan and display table structure • edit/enhance data descriptions Given the location of the excel file, the software scans the file and presents the structure of the table so that edits can be made, and additional information can be provided. Edits are mostly necessary if the column heading are not simple, in which case simple names should be introduced for each column. Additional information can be added at this point, including units and extended descriptions. Similarly, given the location of the shapefiles, the software scans the file and presents the basic information. 2.1 Uploading Excel File Upload Excel File Figure 2 shows the screen asking for a file to upload. The Browse button gives one the opportunity of locating the file on one’s local machine. The Upload button transmits the file to the data library. Response to Excel File upload Figure 3 shows the respose to an Excel file upload. It shows the table structure that it has extracted from the Excel file. 2 Figure 3: Screen with results of excel file upload Figure 4: Screen to edit metadata TheAdd Metadata button then lets one proceed to the next screen which allows adding additional infor- mation to the dataset. 3 Adding Additional Metadata Figure 4 shows the edit metadata screen. This screen allows adding additional information. Top of the page allows adding a description of the new dataset. Information for each column can be added as well: insuring that the columns are properly recognized as dates or numeric values is particularly useful. 3.1 Independent Variables: Time and Locale Independent Variables: Time and Locale While ultimately we would like to consider the data as a function of time and locale, most likely the excel file has information about time and some kind of spatial entity (such as district or state). 3 3.1.1 Time Best case scenario is that one of the columns is clearly time, i.e. formatted as an excel-standard date. We then simply designate the column as an independent variable, and the software extracts a sorted list of values as our time axis. Precisely describing time • when speaking about time, usually imply a start and an end. i.e. January 2008 implies the entire month, where 1 January 2008 implies the entire day. • when analyzing data, usually need to know the interval corresponding to each data point. • for the sorts of data we have (days to weeks to months to years), some can be precisely specified in Excel, some cannot. Example: weekly sea surface temperature data . Specified Time • specify the time dependence • specify column interpretation and time connection The more likely scenario is that one or more of the columns corresponds to time, but the computer needs to be instructed on how to interpret the columns and make the connection to time. In this case we describe the time dependence of the data (start,step,end) and thus construct the independent variable for time. Later we can provide the pattern that connects the time to the appropriate columns so that we can select the appropriate date. This can be necessary even when one of the columns is cleanly time. For example if weekly data is marked by the start of the week, we still need to clarify the time information to properly interpret the data. 3.1.2 Locale Best case scenario is that one of the columns in the excel is clearly an identifier for a spatial entity, i.e. each identifier is unique and corresponds to a value in the corresponding column in the shapefile. We then simply designate the column as an independent variable, and the software extracts a sorted list of values as our entity axis. Specified Locale • specify the locale id (choice guided by the shapefile) • specify column interpretation and locale connection The more likely scenario is that the entries in the column are not completely self-consistent, nor are they consistent with the shapefile. It is also possible that multiple columns (state and district name, for example) are required to identify the spatial entity. In this case, it is more likely that the shapefile is properly constructed and labelled, and its columns can be used as the spatial dimension, while later a pattern is provided that connects the spatial dimension to the appropriate columns of the excel file to select the proper locale. This will not alleviate the need for editing for consistency. 4 3.2 Dependent Variables Dependent Variables Dependent variables are the data to be analyzed, and can be most simply provided as columns in the excel file. In that case the add Metadata screens lists the columns, and additional descriptive information can be added. Variables from multiple columns Sometimes variables are stored as multiple columns in the file (most frequently a different month in each column), in which case a pattern can be provided which properly combines the columns. The screen provides an opportunity to give such a pattern, and remove the corresponding combined columns from the dataset. For example, consider the sample dataset 1993-2005 Madagascar Highlands incidence. On ingest, we find 26 columns, 12 for monthly cases, 12 for monthly incidence, year, and district. So in this case there is a clear column for locale, but no such column for time. So we define the time variable as going from Jan 1993 to Dec 2005 in steps of one month. We then translate the column names to english so that we can easily generate the month names from the time with our current software. The pattern that corresponds to incidence is then select incid_%b[T] as incid from data_central_highl_incidence WHERE district=’%s[district]’ AND year=%Y[T] where T is our time variable and we have used it to generate both the year and the month-part of the column name. 4 Diagnosing Common Data Problems Excel files created by hand over long periods of time that have never been analyzed are unlikely to have perfectly self-consistent labels for locale, and may not match the shapefile labels, either. But many of these problems can be easily detected and corrected once a preliminary version of the data has been ingested. 4.1 Self Consistency Sample Initial Data Overview For example, consider the 1993-2005 Madagascar Highlands incidence. We have named this dataset to make it readily accessible in the Data Library, (link) home .ciph .Madagascar .malaria .original Figure 5 shows a plot that gives an overview of the entire original dataset. It shows incidence as color where position gives district and time. White indicates missing data, and it is clear that Ambohimahasoa and Ambohimahasoa are disjoint: a given time has data for one district or the other. This strongly indicates that in fact they are one and the same place, and that a different name was used at different data-entering times. Consulting with the data producer verified that this was in fact the case. Sample Revised Data Overview We give the revised data a new name (link) home .ciph .Madagascar .malaria .cleaned Figure 6 shows the corresponding plot for the revised dataset. While there is still some missing data, the gaps are smaller and less systematic. 5 (link) TSIROANOMANDIDY SOAVINANDRIANA MIARINARIVO MANJAKANDRIANA MANANDRIANA IKALAMAVONY FIANARANTSOA URBAN FIANARANTSOA RURAL FENOARIVOBE FARATSIHO FANDRIANA BETAFO ARIVONIMAMO ANTSIRABE URBAN ANTSIRABE RURAL district ANTANIFOTSY ANTANANARIVO-SUD ANTANANARIVO-NORD ANKAZOBE ANJOZOROBE ANDRAMASINA AMBOSITRA AMBOHIMASOA AMBOHIMAHASOA AMBOHIDRATRIMO AMBATOLAMPY AMBATOFINANDRAHANA AMBALAVAO Jan Jan Jan Jan Jan Jan Jan 1994 1996 1998 2000 2002 2004 2006 Time 0 4 8 12 16 20 24 28 32 36 40 home benno ADDATI madagascar incid Figure 5: Sample Initial Data Overview 4.2 Geographical Consistency 4.2.1 Geolocation At this point we could analyze the dataset, since it is now self-consistent. On the other hand, we will want to look at incidence as a function of location. To keep all the tedious but necessary bits together, we will check the geolocation next. To geo-locate the data, we first look through the library for the appropriate information. Currently most such information is kept under SOURCES Features. A quick scan through the available holdings reveals SOURCES .Features .Political .Madagascar .Districts, which has some promise.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us