Skip to content

Latest commit

 

History

History
126 lines (105 loc) · 5.06 KB

4-demo.md

File metadata and controls

126 lines (105 loc) · 5.06 KB
layout title nav
default
4-Demo
true

Demo: Digital Mitford Data

In this demo we are going to play with a data set extracted from Digital Mitford. The original data was a TEI encoded index of names used on the site. The node was extracted into a TSV file for simplicity.

Download histperson.tsv

The data fields are

  • xml id
  • gender
  • name
  • occupation
  • viaf id

Navigating OpenRefine

Creating a project

  • check character encoding, options
  • Refine never over writes your original data, it creates a copy!
  • information is not sent over internet

Manipulating columns

  • renaming and removing columns
  • changing column order
  • collapsing columns

Exporting a project or a data set

  • OpenRefine project
  • many formats!
  • templating
  • export is always a new copy of data, never alters original!

Automating tasks

  • Undo/Redo copy Extract to txt file (use text editor, not Word)
  • create new project with original file
  • Undo/Redo paste saved extract into Apply

Exploring and Cleaning Data

Cleaning the simple stuff

  • a lot of options in the drop-down menus
  • get rid of white space
    • one column at a time: Edit cells > Common transforms
    • entire sheet: All > Transform > value.trim() and value.replace(/\s+/,' ')
  • Edit cells > Transform... is very powerful
  • GREL = General Refine Expression Language
  • GREL documentation and recipes are available on the OpenRefine wiki.

Sample GREL Recipes

  • Remove duplicate comma separated entries in a cell
    • value.split(", ").uniques().join(", ")
  • Replace string in cells
    • value.replace("+", "")
    • value.replace("~"", "").replace(",", "").replace("-", "")
  • Clean-up character encoding problems
    • value.unescape("url")
  • Convert number with text to number
    • toNumber(value.replace(" million", ""))*1000000

Splitting, faceting, and clustering

  • multi-valued fields can be a barrier to data cleaning
    • Edit cells > Split multi-valued cells...
    • record view vs. row view
    • Facet > Text facet
    • manual cleaning and clustering
    • Edit cells > Join multi-valued cells

Enhancing with Data from Other Sources

Reconciling from other data sources

  • Vocabulary reconciliation is a process where automated systems use terms from unstandardized metadata to search controlled vocabularies and return URIs.
  • OpenRefine has built in tools to reconcile data with Wikidata
  • Other data services can be added

OpenRefine's Wikidata Service

  • Reconciling the names
    • Reconcile > Start reconciling...
    • choose Wikidata Reconciliation for OpenRefine (en)
    • choose human and Auto-match candidates with high confidence
    • matches some automatically, but often requires some manual review
  • OpenRefine 2.8 added querying and extracting tools
    • Select matched from the judgement facet
    • Edit column > Add columns from reconciled values
    • Add country of citizenship, occupation, place of birth, place of death, place of burial, and VIAF ID
    • Add Property filter to include date of birth and date of death
  • Review the results of your reconciliation
    • date of birth > Facet > Timeline facet
  • Extract the Wikidata id
    • Edit column > Add column based on this column...
    • name column: wikidata_id
    • cell.recon.match.id

Adding more data based on extracted dataset

  • Geographic Coordinates for places
    • place of birth > Edit column > Add columns from reconciled values
    • Add coordinate location

Compare the collected and extracted VIAF ids

  • Clean up collected VIAF ids
    • Facet > Customized facets > Facet by blank
    • On rows view, choose false to select rows with viaf ids
    • Filter - regex: ^(?!http://).+
    • Transform cells to remove final "/" temporarily - value.replace(//$/, "")
    • Transform cells to apply final "/" because that is what VIAF expects - value+"/"
  • Make extracted VIAF id numbers into VIAF URLs
  • Build a facet to compare the two columns
    • value == cells["viaf id"].value

Other data services

Exporting your cleaned and expanded dataset

  • Export project
  • TSV, CSV, HTML table, Excel (two flavors), ODF spreadsheet
  • Templating - Create XML exports!