Introduction to OpenRefine

2018 Bibliometrics and Research Assessment Symposium Candace Norton, MLS © Library Carpentry: "OpenRefine Lessons for Librarians." Housekeeping

• This session will be recorded and distributed after editing and captioning. • Slides will be available following the Symposium. • Questions are welcome throughout the presentation; please use the microphone. • Due to the set up of the breakout room, this is a lecture style course and will not be hands on. Course Objectives

• Explain what OpenRefine does and why to use it. • Recognize when using OpenRefine is appropriate. • Understand how to import, edit, transform, and export data using OpenRefine. Introduction to OpenRefine

• OpenRefine is a powerful tool for working with messy data • OpenRefine can clean data, transform data from one format to another, and can extend data via web services and external data • OpenRefine is available for download online at http://openrefine.org/download.html Why use OpenRefine?

• Get an overview of a data set • Resolve inconsistencies in a data set – standardizing date formatting • Helps split data up into more granular parts – splitting up cells with multiple authors into separate cells • Match local data up to other data sets – matching local subjects against the Library of Congress Subject Headings • Enhance a data set with data from other sources Common Scenarios: Dates

Data as entered Desired data 1st January 2014 2014-01-01 01/01/2014 2014-01-01 Jan 1 2014 2014-01-01 2014-01-01 2014-01-01 Common Scenarios: Names

Data as entered Desired data London London London] London London,] London london London Common Scenarios: Combined Data

Address in single Library field Institution name Address 1 Town/City Country Postcode University of Wales, University of Llyfrgell Llanbadarn Aberystwyth United SY23 3AS Llyfrgell Thomas Parry Wales Thomas Fawr Kingdom Library, Llanbadarn Parry Fawr, ABERYSTWYTH, Library Ceredigion, SY23 3AS, United Kingdom

University of Aberdeen, University of Queen Meston Walk Aberdeen United AB24 3UE Queen Mother Library, Abderdeen Mother Kingdom Meston Walk, Library ABERDEEN, AB24 3UE, United Kingdom Importing Data into OpenRefine

• Create a project by importing data:

• TSV • Excel (.xls and .xlsx) • CSV • XML • *SV • RDF as XML • JSON • Data documents • Support for other formats can be added with OpenRefine extensions • Typically a publication dataset from a bibliographic database in CSV format Create a Project

• Click “Create Project” in the left menu bar • Select “Get data from this computer” • Browse for the file • Click “Next >>” to select import parameters • Preview the import • Name the project using the text field in the upper right corner • Click “Create Project” button Layout of OpenRefine

• Displays data in a tabular format • Each row will usually represent a ‘record’ in the data • Each column represents a type of information or variable • Only displays a limited number of rows of data at one time • Most options to work with data are accessed from drop down menus at the top of the data columns Rows and Records

• OpenRefine has two modes of viewing data: – Rows – Records • Default view is Rows mode – Each row represents a single record in the data set • Records mode – OpenRefine can link together multiple rows as belonging to the same Record Working with Columns

• Re-order columns by clicking the drop-down menu at the top of the first column labelled ‘All’ • Choose Edit Columns->Re-order/remove columns • Drag and drop column names to re-order the columns, or remove columns completely if they are not needed Sorting Data

• Sort data by clicking on the drop-down menu for the relevant column, and click ‘Sort’ • Once the data is sorted, a new ‘Sort’ drop-down menu will display • Sorts performed in OpenRefine are temporary • Sort on multiple columns at the same time by adding another sorted column Splitting Cells

To split the multi-valued cells (like author names or addresses) into their own cells, use a Split multi- valued cells function: •Click the dropdown menu at the top of the Author column •Choose Edit cells->Split multi-valued cells •In the prompt type the ( | ) symbol and click OK •Note that the rows are still numbered sequentially •Click the Records option to change to Records mode •Note how the numbering has changed - indicating that several rows are related to the same record Joining Cells

•Click the dropdown menu at the top of the Author column •Choose Edit cells->Join multi-valued cells •In the prompt type the ( | ) symbol •Here we are specifying the delimiter character for OpenRefine to use to join the values together. •Click OK to join the Authors cells back together Joining Cells, continued

• A common workflow with multi-valued cells is – split multi-valued cells into individual cells – modify/refine/clean individual cells – join multi-valued cells back together

• After joining cells together, Rows and Records values will now be the same since there are no more split columns • Click both the Rows and Records options and observe how the numbers are equal Clustering

• The Cluster function groups together similar but inconsistent values in a given column and permits merging these inconsistent values into a single chosen value • This is very effective where there is data with minor variations in data values, e.g. names of people, organizations, places, classification terms Clustering, continued

• ‘Clusters’ are created automatically according to an algorithm – Link for more information on Clustering Algorithms • For each cluster, there is the option of ‘merging’ the values together • To use the ‘Cluster’ function, click on the Edit Cells menu option of the relevant column and choose Cluster and Edit Clustering to clean author data

• Split out the author names into individual cells using Edit cells -> Split multi-valued cells, using the pipe ( | ) character as the separator • Choose Edit cells -> Cluster and edit from the ‘author’ column • Using the key collision method with the fingerprint Keying Function, work through the clusters of values, merging them to a single value where appropriate • Try changing the clustering method being used Transformations

• Transformations are ways of manipulating data in columns when basic sorting and faceting are not enough • Allows users to programmatically edit data • Normally written in a special language called GREL (General Refine Expression Language) – Similar to Excel functions but focused on text manipulation rather than numeric functions – Full documentation for GREL is available online Transformation Examples

• Splitting data that is in a single column into multiple columns – Splitting an address into multiple parts • Standardizing the format of data in a column without changing the values – Removing punctuation or standardizing a date format • Extracting a particular type of data from a longer text string – Finding ISBNs in a bibliographic citation Writing Transformations

• Select the column to transform and choose ‘Edit cells->Transform’ from the drop down menu • A new screen will display with a place to write a transformation (the ‘Expression’ box) along with a preview window to see the effect of the transformation on 10 rows of data • The transformation typed into the ‘Expression’ box has to be a valid GREL expression – The word ‘value’ by itself is the simplest expression – It simply means display the value that is currently in the column and make no change. GREL Functions

• GREL functions are written by giving a value of some kind (a text string, a date, a number, etc.) to a GREL function • Some GREL functions take additional parameters or options which control how the function works • GREL supports two types of syntax: – value.function(options) – function(value, options) Common Transformations

Common Transformation Action GREL expression To Uppercase Converts the current value to value.toUppercase() uppercase To Lowercase Converts the current value to value.toLowercase() lowercase To Titlecase Converts the current value to value.toTitlecase() titlecase (i.e. each word starts with an uppercase character and all other characters are converted to lowercase) Trim leading and Removes any ‘whitespace’ value.trim() trailing characters (e.g. spaces, tabs) from whitespace the start or end of the current value Sample Workflow: Author Affiliations

• Split multi-valued cells • Remove author names – value.replace(/\[\D+\]/,"") – expression: \[\D+\] • Remove additional affiliation information – value.replace(/\,.+/,"") – expression: \,.+ • Remove leading and trailing whitespace • Create facet and continue as normal Faceting

• A ‘Facet’ groups all the values that appear in a column, allows for filtering the data by these values, and then editing values across many records at the same time • OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory • Why use faceting? – Can help get an overview of the data in a project – Can help bring more consistency to the data Using ‘Text Facet’

• To create a text facet for a column, click on the drop down menu at the top of the column, and choose Facet -> Text Facet – The facet will appear in the left side panel – Will consist of a list of values used in that column of data – Can filter the data by clicking on one of the headings – Can Include multiple values at one time or Invert the filter to show all values that do not match the selected values – Can make minor edits to all selected values with Edit More on Facets

As well as ‘Text facets’ OpenRefine also supports a range of other types of facet. These include: • Numeric facets • Timeline facets (for dates) • Scatterplot facets • Custom facets – Word facet – Duplicates facet – Text length facet – Facet by blank Filters

• As with Facets, Filter options appear in the left hand panel in OpenRefine • Any operations carried out while data is filtered will only apply to the filtered data currently on display • Faceting and filtering look very similar: – Faceting gives an overview description of all of the data that is currently selected – Filtering allows the selection of a subset of data for analysis Using ‘Text Filter’

• ‘Text Filter’ looks for a particular piece of text appearing in a column – Click the drop down menu at the top of the column to filter then choose ‘text filter’ – Enter the text to filter by, and the ‘text filter’ will only display rows that contain the specified text – Regular expressions can also be used to filter data Working with Arrays

• An ‘Array’ is a list of values, represented in Refine by the use of square brackets containing a list of values surrounded by quotation marks and separated by commas. – [“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”, ”Saturday”,”Sunday”] • Arrays can be sorted, de-duplicated, and manipulated in other ways in GREL expressions, but cannot appear directly in an OpenRefine cell. • Arrays in OpenRefine are usually the result of a transformation. Arrays Example

• A cell has the value: “Monday,Tuesday,Wednesday,Thursday,Friday,Sa turday,Sunday” • Transform into an array using the split function – value.split(“,”) • Creates an array: [“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Fr iday”,”Saturday”,”Sunday”] Undo and Redo

• The Undo/Redo panel lists all the steps taken so far • To undo steps, click on the last step to preserve in the list; this will automatically undo all the changes made since that step • The remaining steps will continue to show in the list but greyed out, and can be reapplied by clicking on the last step to apply • If ‘undo’ is used on a set of steps and then new transformations are performed, the greyed out steps will disappear and the option to ‘redo’ these steps is no longer available Saving Steps

• To save a set of steps to be re-applied later or to use in a different project, click the ‘Extract’ button • ‘Extract’ gives the option to select the steps to save, and copy the transformations included in those steps in JSON format • ‘Apply’ will apply the saved or copied steps Exporting data from OpenRefine

• Export options are available through the Export button at the top right of the OpenRefine interface • Export formats include HTML, Excel, CSV, TSV • Custom export options are available, including exporting specific fields, adding a header or footer, and specifying the exact format Demo OpenRefine Course Development Resources

• Library Carpentry: "OpenRefine Lessons for Librarians.“ June 2016, http://data- lessons.github.io/library-openrefine/. • Verborgh, R. Using OpenRefine. Packt Publishing; 2013. Additional Resources

• Openrefine.org • Openrefine.github.com • librarycarpentry.github.io/lc-open-refine/ • librarycarpentry.github.io/lc-open-refine/13- looking-up-data/index.html • github.com/OpenRefine/OpenRefine/wiki/Gener al-Refine-Expression-Language • github.com/OpenRefine/OpenRefine/wiki/Cluster ing-In-Depth Questions?

Email: [email protected] Visit: www.nihlibrary.nih.gov/bibliometrics