RLS: Roadmap #1

westurner · 2014-10-22T10:36:42Z

ENH: Linked Datasets (RDF)

This is very much a meta issue.
There are a number of bare links here.
They are for documentation

Use Case

So I:

retrieved some data
- from somewhere
- about a certain #topic
perfomed analysis
- with certain transformations and aggregations
- with certain versions of certain tools
- confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

Series (1D)
- index
- data
  - NumPy datatypes
DataFrame (2D)
- index
- column(s)
  - NumPy datatypes
Panel (3D)
Panel4D (4D)

Read or parse a data format into a DataSet:

pandas.read_*
- read_clipboard
- read_csv
- read_excel
- read_fwf
- read_gbq
- read_hdf
- read_html
- read_json
- read_msgpack
- read_pickle
- read_sql
- read_stata
- read_table
pandas.HDFStore
- https://pandas.pydata.org/docs/dev/io.html#hdf5-pytables

Add metadata:

Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

pandas.DataFrame.
- to_csv
- to_dict
- to_excel
- to_gbq
- to_html
- to_latex
- to_panel
- to_period
- to_records
- to_sparse
- to_sql
- to_stata
- to_string
- to_timestamp
- to_wide
to_ RDF
to_ CSVW
to_ HTML + RDFa
to_ JSONLD
- create a JSONLD @context

Share or publish a serialized DataSet with the internet:

Email Attachment (Table in a PDF)
- opendatahandbook.org
- project-open-data.github.io
FTP, SFTP, RSYNC, NFS
HTML web upload form with metadata form fields
CLI tool
Version Control: Git, Hg, Svn
- challenge: 'large' files ("binary blobs") in VCS systems
HTTP API: Object Storage (~LDP)
- GET/POST /container/filename.csv # [.json|.xml|.xls|.rdf|.html]
- challenge: indexing metadata from a separate document / named graph
  - GET/POST to/container/filename.csv`
Push to CKAN
Host DataSet metadata
- python -m SimpleHTTPServer 8088
- e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s

Implementation

What changes would be needed for Pandas core to support this workflow?

.meta schema
to_rdf for Series, DataFrames, Panels, and Panel4Ds
read_rdf for Series, DataFrames, Panels, and Panel 4Ds
~@datastep process decorators
~DataSet
~DataCatalog of precomputed aggregations/views/slices.
Units support (.meta?)

`.meta` schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources

CSV2RDF (`csvw:`)

https://en.wikipedia.org/wiki/Comma-separated_values

https://tools.ietf.org/html/rfc4180

W3C PROV (`prov:`)

schema.org (`schema:`)

http://schema.org
http://www.w3.org/wiki/WebSchemas
http://schema.rdfs.org/
https://schema.org/docs/full.html :
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
  - [schema:Thing, schema:CreativeWork]
  - distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
  - spatial, temporal
  - catalog -- A data catalog which contains a dataset (DataCatalog)
- schema:DataCatalog -- collection of Datasets
  - [schema:Thing, schema:CreativeWork]
  - dataset -- A dataset contained in a catalog. (Dataset)
- schema:DataDownload -- A dataset in downloadable form.
  - [schema:Thing, schema:CreativeWork]
  - contentSize
  - contentURL
  - uploadDate

W3C RDF Data Cube (`qb:`)

http://www.w3.org/TR/vocab-data-cube/
http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO
http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
  - qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
- qb:Observation -- a single observation in the cube, may have one or more associated measured values.
  - qb:dataset -- data set of which this observation is a part.
- qb:ObservationGroup -- a, possibly arbitrary, group of observations.
  - qb:observation -- an observation contained within this slice of the data set.
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
- [Components, Properties, Dimensions, Attributes, Measures]

`to_rdf`

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

output fmt
JSON-LD: compaction

.

`read_rdf`

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.

@datastep / PROV

Objective: Additive journal of transformations
Link to source script(s) URIs
Decorator for annotating data transformations with metadata.
Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
- 'this is an aggregation of that'
  - 'this' has a URI
  - 'that' has a URI
What if there is no metadata for df2?

Linked Data Primer

Linked Data Abstractions

Graphs are represented as triples of (s,p,o)
Subject, Predicate, Object
Queries are patterns with ?references
- graph.triples((None, None, None))
- SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
subjects are linked to objects by predicates
- subjects and predicate are identified by URI 'keys'

URIs and URLs

a URI is like a URL
usually, we expect URLs to be 'dereferencable` HTTP URIs
- HTTP GET http://en.wikipedia.org/
a URI may start with a different URI prefix
- urn:
- uuid:

SQL and Linked Data

there exist standard mappings for whole SQL tablesets
- rdb2rdf
- similar to application scaffolding
- ACL support adds complexity
virtuoso supports SQL and RDF and SPARQL
- standard mappings
- virtuoso powers http://dbpedia.org/
  - dbpedia.org has a high degree of centrality
    - http://lod-cloud.net/
rdflib-sqlalchemy maps RDF onto SQL tables
- fairly inefficiently, when compared to native triplestores

Named Graphs

Quads: (g, s, p, o)
g: sometimes called the 'context' of a triple
Metadata about GRAPH ?g
Multiple named graphs in one file: TriX, TriG

Linked Data Formats

Choosing Schema

XSD, RDF, RDFS, DCTERMS
Which schema is most popular?
Which schema is a best fit for the data?
Which schema will search engines index for us?
What do the queries look like?
Years Later... What is OWL?
Why would we start with RDFS now?

Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

where and how was it downloaded? (digital sense)
how was it collected? (process control sense)

Datasets have structure:

Tabular, Hierarchical
1D, 2D, 3D, 4D
Graph-based
- Chains
- Flows
Schema
#5 ★ Open Data

http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

The text was updated successfully, but these errors were encountered:

westurner · 2014-10-24T08:38:34Z

https://github.com/mhausenblas/omnidator

https://github.com/mhausenblas/schema-org-rdf/blob/master/tools/schema-gateway/schema_org_processor.py

westurner · 2016-06-20T10:54:53Z

Added:

to_ CSVW

westurner · 2017-02-07T05:47:38Z

Is tracking columnar metadata across merges easier with Series.meta (than with DataFrame.meta.columns[name].meta)?

westurner changed the title ~~ENH: Linked Data (from pandas #3402)~~ ENH: Linked Datasets (from pandas #3402) Oct 22, 2014

westurner added the ENH label Oct 22, 2014

westurner modified the milestone: 0.1 Oct 22, 2014

westurner mentioned this issue Oct 22, 2014

ENH: Linked Datasets (RDF) pandas-dev/pandas#3402

Closed

34 tasks

westurner changed the title ~~ENH: Linked Datasets (from pandas #3402)~~ RLS: Roadmap Oct 24, 2014

westurner added RLS and removed ENH labels Oct 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLS: Roadmap #1

RLS: Roadmap #1

westurner commented Oct 22, 2014 •

edited

Loading

westurner commented Oct 24, 2014

westurner commented Jun 20, 2016

westurner commented Feb 7, 2017

RLS: Roadmap #1

RLS: Roadmap #1

Comments

westurner commented Oct 22, 2014 • edited Loading

ENH: Linked Datasets (RDF)

Use Case

User Story

Status Quo: Pandas IO

Implementation

.meta schema

Ontology Resources

CSV2RDF (csvw:)

W3C PROV (prov:)

schema.org (schema:)

W3C RDF Data Cube (qb:)

to_rdf

read_rdf

@datastep / PROV

DataCatalog

Units support

RDF Datatypes

JSON-LD RDF

Linked Data Primer

Linked Data Process, Provenance, and Schema

westurner commented Oct 24, 2014

westurner commented Jun 20, 2016

westurner commented Feb 7, 2017

westurner commented Oct 22, 2014 •

edited

Loading

`.meta` schema

CSV2RDF (`csvw:`)

W3C PROV (`prov:`)

schema.org (`schema:`)

W3C RDF Data Cube (`qb:`)

`to_rdf`

`read_rdf`