DATA CLEANING WITH OPEN REFINE
Dr. Sainge N. Moses
Date: February 26th to March 1st
The BID Program is funded by the European Union
DATA CLEANING WITH OPEN REFINE
Open Refine previously called Google Refine is a useful tool for cleaning messy data. In this short tutorial we’ll be focusing on four main aspects namely: File loading and project creation Using Text Facets Clustering Text Facets Exporting
1. FILE LOADING AND PROJECT CREATION Data loading can be done from various data sources: TSV, CSV, SV, Excel (.xls and .xlsx), JSON, XML, RDF and XML data as Google Docs. File loading is closely followed by the project creation and can be achieved by following the following steps Open OpenRefine (GoogleRefine). Click on Create Project Click on Choose Files. Select the file. Click on Next.
DATA CLEANING WITH OPEN REFINE
A parsing options menu will appear. Be sure to leave the options as shown in the picture:
On the top right you can rename your file and click Create Project and you will be ready to work! 2. USING TEXT FACETS Faceting is a feature that will allow us to get a big picture overview of the data, and to filter down to just the subset of rows that we want to change or view in bulk. It facilitates the use and analysis of data and can be done with cells containing any kind of text, numbers and dates. For this presentation we’ll be dealing with Text facet only. Faceting and mass editing
Go to column COUNTRY, and then click on the column menu and follow the route to Text facet as shown below: On the left a window with the name of the column will appear, that is the facet: Click on count to sort by count, then click on name to sort alphabetically.
DATA CLEANING WITH OPEN REFINE
Fix the spelling mistakes by placing the cursor over the text in the facet window and click on edit, then fix the error in the text box, and to save click on apply.
3. CLUSTERING TEXT FACETS The cluster feature helps you find groups of different cell values that might be alternative representations of the same thing. For example in the column COUNTRY, Cote divoire, Cote Coast1, Cote d;voire1, cote d'Ivoire1,Cote D'ivoire1, Cote d'voire1, Cote D'Voorie1, etc. all refer to the same country which has for correct spelling “Ivory Coast” in English.
DATA CLEANING WITH OPEN REFINE
This where the clustering function comes in handy. To use it proceed as follows Go to COUNTRY, then in the menu column click Text facet Click on Cluster. Select the fields you want to merge by clicking on Merge? Go to New cell value and give the correct name for the repeating records Click on Merge Selected & close.
DATA CLEANING WITH OPEN REFINE
4. EXPORTING Once you’re done editing you file you can then export your project as follows: Click on the Export and select Custom tabular exporter… In the content tab you can choose the columns that you want to export, if you select Ignore facets and filters and export all rows all facet and filtering will be ignored, this is useful if you forget to clear them before exporting. Go to the Download tab and select the separator that you prefer. Don’t modify the other options unless you need to.
DATA CLEANING WITH OPEN REFINE
QUESTIONS!!!
DATA CLEANING WITH OPEN REFINE
USEFUL LINKS AND REFERENCES ● Documentation https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users ● Resources list for OpenRefine: https://github.com/OpenRefine/OpenRefine/wiki/External-Resource ● Name validation Tutorial: https://docs.google.com/document/d/1tkDRXlYhmassYAk5T4v5oac5prF0jAiSMr_JEGTvhRo/edit ● Higher Taxonomy Tutorial: https://docs.google.com/document/d/1XZ_pM9gIdQzHzl8wfUCVea-52yub5T_3tc-snBgPRa0/edit