Introduction to Openrefine

Introduction to OpenRefine 2018 Bibliometrics and Research Assessment Symposium Candace Norton, MLS © Library Carpentry: "OpenRefine Lessons for Librarians." Housekeeping • This session will be recorded and distributed after editing and captioning. • Slides will be available following the Symposium. • Questions are welcome throughout the presentation; please use the microphone. • Due to the set up of the breakout room, this is a lecture style course and will not be hands on. Course Objectives • Explain what OpenRefine does and why to use it. • Recognize when using OpenRefine is appropriate. • Understand how to import, edit, transform, and export data using OpenRefine. Introduction to OpenRefine • OpenRefine is a powerful tool for working with messy data • OpenRefine can clean data, transform data from one format to another, and can extend data via web services and external data • OpenRefine is available for download online at http://openrefine.org/download.html Why use OpenRefine? • Get an overview of a data set • Resolve inconsistencies in a data set – standardizing date formatting • Helps split data up into more granular parts – splitting up cells with multiple authors into separate cells • Match local data up to other data sets – matching local subjects against the Library of Congress Subject Headings • Enhance a data set with data from other sources Common Scenarios: Dates Data as entered Desired data 1st January 2014 2014-01-01 01/01/2014 2014-01-01 Jan 1 2014 2014-01-01 2014-01-01 2014-01-01 Common Scenarios: Names Data as entered Desired data London London London] London London,] London london London Common Scenarios: Combined Data Address in single Library field Institution name Address 1 Town/City Country Postcode University of Wales, University of Llyfrgell Llanbadarn Aberystwyth United SY23 3AS Llyfrgell Thomas Parry Wales Thomas Fawr Kingdom Library, Llanbadarn Parry Fawr, ABERYSTWYTH, Library Ceredigion, SY23 3AS, United Kingdom University of Aberdeen, University of Queen Meston Walk Aberdeen United AB24 3UE Queen Mother Library, Abderdeen Mother Kingdom Meston Walk, Library ABERDEEN, AB24 3UE, United Kingdom Importing Data into OpenRefine • Create a project by importing data: • TSV • Excel (.xls and .xlsx) • CSV • XML • *SV • RDF as XML • JSON • Google Data documents • Support for other formats can be added with OpenRefine extensions • Typically a publication dataset from a bibliographic database in CSV format Create a Project • Click “Create Project” in the left menu bar • Select “Get data from this computer” • Browse for the file • Click “Next >>” to select import parameters • Preview the import • Name the project using the text field in the upper right corner • Click “Create Project” button Layout of OpenRefine • Displays data in a tabular format • Each row will usually represent a ‘record’ in the data • Each column represents a type of information or variable • Only displays a limited number of rows of data at one time • Most options to work with data are accessed from drop down menus at the top of the data columns Rows and Records • OpenRefine has two modes of viewing data: – Rows – Records • Default view is Rows mode – Each row represents a single record in the data set • Records mode – OpenRefine can link together multiple rows as belonging to the same Record Working with Columns • Re-order columns by clicking the drop-down menu at the top of the first column labelled ‘All’ • Choose Edit Columns->Re-order/remove columns • Drag and drop column names to re-order the columns, or remove columns completely if they are not needed Sorting Data • Sort data by clicking on the drop-down menu for the relevant column, and click ‘Sort’ • Once the data is sorted, a new ‘Sort’ drop-down menu will display • Sorts performed in OpenRefine are temporary • Sort on multiple columns at the same time by adding another sorted column Splitting Cells To split the multi-valued cells (like author names or addresses) into their own cells, use a Split multi- valued cells function: •Click the dropdown menu at the top of the Author column •Choose Edit cells->Split multi-valued cells •In the prompt type the ( | ) symbol and click OK •Note that the rows are still numbered sequentially •Click the Records option to change to Records mode •Note how the numbering has changed - indicating that several rows are related to the same record Joining Cells •Click the dropdown menu at the top of the Author column •Choose Edit cells->Join multi-valued cells •In the prompt type the ( | ) symbol •Here we are specifying the delimiter character for OpenRefine to use to join the values together. •Click OK to join the Authors cells back together Joining Cells, continued • A common workflow with multi-valued cells is – split multi-valued cells into individual cells – modify/refine/clean individual cells – join multi-valued cells back together • After joining cells together, Rows and Records values will now be the same since there are no more split columns • Click both the Rows and Records options and observe how the numbers are equal Clustering • The Cluster function groups together similar but inconsistent values in a given column and permits merging these inconsistent values into a single chosen value • This is very effective where there is data with minor variations in data values, e.g. names of people, organizations, places, classification terms Clustering, continued • ‘Clusters’ are created automatically according to an algorithm – Link for more information on Clustering Algorithms • For each cluster, there is the option of ‘merging’ the values together • To use the ‘Cluster’ function, click on the Edit Cells menu option of the relevant column and choose Cluster and Edit Clustering to clean author data • Split out the author names into individual cells using Edit cells -> Split multi-valued cells, using the pipe ( | ) character as the separator • Choose Edit cells -> Cluster and edit from the ‘author’ column • Using the key collision method with the fingerprint Keying Function, work through the clusters of values, merging them to a single value where appropriate • Try changing the clustering method being used Transformations • Transformations are ways of manipulating data in columns when basic sorting and faceting are not enough • Allows users to programmatically edit data • Normally written in a special language called GREL (General Refine Expression Language) – Similar to Excel functions but focused on text manipulation rather than numeric functions – Full documentation for GREL is available online Transformation Examples • Splitting data that is in a single column into multiple columns – Splitting an address into multiple parts • Standardizing the format of data in a column without changing the values – Removing punctuation or standardizing a date format • Extracting a particular type of data from a longer text string – Finding ISBNs in a bibliographic citation Writing Transformations • Select the column to transform and choose ‘Edit cells->Transform’ from the drop down menu • A new screen will display with a place to write a transformation (the ‘Expression’ box) along with a preview window to see the effect of the transformation on 10 rows of data • The transformation typed into the ‘Expression’ box has to be a valid GREL expression – The word ‘value’ by itself is the simplest expression – It simply means display the value that is currently in the column and make no change. GREL Functions • GREL functions are written by giving a value of some kind (a text string, a date, a number, etc.) to a GREL function • Some GREL functions take additional parameters or options which control how the function works • GREL supports two types of syntax: – value.function(options) – function(value, options) Common Transformations Common Transformation Action GREL expression To Uppercase Converts the current value to value.toUppercase() uppercase To Lowercase Converts the current value to value.toLowercase() lowercase To Titlecase Converts the current value to value.toTitlecase() titlecase (i.e. each word starts with an uppercase character and all other characters are converted to lowercase) Trim leading and Removes any ‘whitespace’ value.trim() trailing characters (e.g. spaces, tabs) from whitespace the start or end of the current value Sample Workflow: Author Affiliations • Split multi-valued cells • Remove author names – value.replace(/\[\D+\]/,"") – expression: \[\D+\] • Remove additional affiliation information – value.replace(/\,.+/,"") – expression: \,.+ • Remove leading and trailing whitespace • Create facet and continue as normal Faceting • A ‘Facet’ groups all the values that appear in a column, allows for filtering the data by these values, and then editing values across many records at the same time • OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory • Why use faceting? – Can help get an overview of the data in a project – Can help bring more consistency to the data Using ‘Text Facet’ • To create a text facet for a column, click on the drop down menu at the top of the column, and choose Facet -> Text Facet – The facet will appear in the left side panel – Will consist of a list of values used in that column of data – Can filter the data by clicking on one of the headings – Can Include multiple values at one time or Invert the filter to show all values that do not match the selected values – Can make minor edits to all selected values with Edit More on Facets As well as

Introduction to Openrefine

Open Source Copyrights

Nosql Databases

Easybuild Documentation Release 20210907.0

Towards a Fully Automated Extraction and Interpretation of Tabular Data Using Machine Learning

A Multilingual Information Extraction Pipeline for Investigative Journalism

Issue Editor

Please Stand by for Realtime Captions. My Name Is Jamie Hayes I Am Your

Mike Bolam Metadata Librarian Digital Scholarship Services University Library System [email protected] // 412-648-5908 Assessment Survey

Report of Contributions

A Component-Based Approach to Traffic Data Wrangling

Utilizing AI/ML Methods for Measuring Data Quality Student: Bc

The Openclean Open-Source Data Cleaning Library