DATA CLEANING WITH OPEN REFINE

Dr. Sainge N. Moses

Date: February 26th to March 1st

The BID Program is funded by the European Union

DATA CLEANING WITH OPEN REFINE

Open Refine previously called Refine is a useful tool for cleaning messy data. In this short tutorial we’ll be focusing on four main aspects namely:  File loading and project creation  Using Text Facets  Clustering Text Facets  Exporting

1. FILE LOADING AND PROJECT CREATION Data loading can be done from various data sources: TSV, CSV, SV, Excel (.xls and .xlsx), JSON, XML, RDF and XML data as . File loading is closely followed by the project creation and can be achieved by following the following steps  Open OpenRefine (GoogleRefine).  Click on Create Project  Click on Choose .  Select the file.  Click on Next.

DATA CLEANING WITH OPEN REFINE

 A parsing options menu will appear. Be sure to leave the options as shown in the picture:

 On the top right you can rename your file and click Create Project and you will be ready to work! 2. USING TEXT FACETS Faceting is a feature that will allow us to get a big picture overview of the data, and to filter down to just the subset of rows that we want to change or view in bulk. It facilitates the use and analysis of data and can be done with cells containing any kind of text, numbers and dates. For this presentation we’ll be dealing with Text facet only. Faceting and mass editing

 Go to column COUNTRY, and then click on the column menu and follow the route to Text facet as shown below:  On the left a window with the name of the column will appear, that is the facet:  Click on count to sort by count, then click on name to sort alphabetically.

DATA CLEANING WITH OPEN REFINE

 Fix the spelling mistakes by placing the cursor over the text in the facet window and click on edit, then fix the error in the text box, and to save click on apply.

3. CLUSTERING TEXT FACETS The cluster feature helps you find groups of different cell values that might be alternative representations of the same thing. For example in the column COUNTRY, Cote divoire, Cote Coast1, Cote d;voire1, cote d'Ivoire1,Cote D'ivoire1, Cote d'voire1, Cote D'Voorie1, etc. all refer to the same country which has for correct spelling “Ivory Coast” in English.

DATA CLEANING WITH OPEN REFINE

This where the clustering function comes in handy. To use it proceed as follows  Go to COUNTRY, then in the menu column click Text facet  Click on Cluster.  Select the fields you want to merge by clicking on Merge?  Go to New cell value and give the correct name for the repeating records  Click on Merge Selected & close.

DATA CLEANING WITH OPEN REFINE

4. EXPORTING Once you’re done editing you file you can then export your project as follows:  Click on the Export and select Custom tabular exporter…  In the content tab you can choose the columns that you want to export, if you select Ignore facets and filters and export all rows all facet and filtering will be ignored, this is useful if you forget to clear them before exporting.  Go to the Download tab and select the separator that you prefer. Don’t modify the other options unless you need to.

DATA CLEANING WITH OPEN REFINE

QUESTIONS!!!

DATA CLEANING WITH OPEN REFINE

USEFUL LINKS AND REFERENCES ● Documentation https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users ● Resources list for OpenRefine: https://github.com/OpenRefine/OpenRefine/wiki/External-Resource ● Name validation Tutorial: https://docs.google.com/document/d/1tkDRXlYhmassYAk5T4v5oac5prF0jAiSMr_JEGTvhRo/edit ● Higher Taxonomy Tutorial: https://docs.google.com/document/d/1XZ_pM9gIdQzHzl8wfUCVea-52yub5T_3tc-snBgPRa0/edit