Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLS: Roadmap #1

Open
34 tasks
westurner opened this issue Oct 22, 2014 · 3 comments
Open
34 tasks

RLS: Roadmap #1

westurner opened this issue Oct 22, 2014 · 3 comments
Labels
Milestone

Comments

@westurner
Copy link
Owner

westurner commented Oct 22, 2014

ENH: Linked Datasets (RDF)

  • This is very much a meta issue.
  • There are a number of bare links here.
  • They are for documentation

(original: pandas-dev/pandas#3402)

Use Case

So I:

  • retrieved some data
    • from somewhere
    • about a certain #topic
  • perfomed analysis
    • with certain transformations and aggregations
    • with certain versions of certain tools
    • confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

  • Series (1D)
    • index
    • data
      • NumPy datatypes
  • DataFrame (2D)
    • index
    • column(s)
      • NumPy datatypes
  • Panel (3D)
  • Panel4D (4D)

Read or parse a data format into a DataSet:

Add metadata:

  • Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

  • pandas.DataFrame.
    • to_csv
    • to_dict
    • to_excel
    • to_gbq
    • to_html
    • to_latex
    • to_panel
    • to_period
    • to_records
    • to_sparse
    • to_sql
    • to_stata
    • to_string
    • to_timestamp
    • to_wide
  • to_ RDF
  • to_ CSVW
  • to_ HTML + RDFa
  • to_ JSONLD

Share or publish a serialized DataSet with the internet:

Implementation

What changes would be needed for Pandas core to support this workflow?

  • .meta schema
  • to_rdf for Series, DataFrames, Panels, and Panel4Ds
  • read_rdf for Series, DataFrames, Panels, and Panel 4Ds
  • ~@datastep process decorators
  • ~DataSet
  • ~DataCatalog of precomputed aggregations/views/slices.
  • Units support (.meta?)

.meta schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources
CSV2RDF (csvw:)

https://en.wikipedia.org/wiki/Comma-separated_values

W3C PROV (prov:)
schema.org (schema:)
  • http://schema.org
  • http://www.w3.org/wiki/WebSchemas
  • http://schema.rdfs.org/
  • https://schema.org/docs/full.html :
    • schema:Dataset -- A body of structured information describing some topic(s) of interest.
      • [schema:Thing, schema:CreativeWork]
      • distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
      • spatial, temporal
      • catalog -- A data catalog which contains a dataset (DataCatalog)
    • schema:DataCatalog -- collection of Datasets
      • [schema:Thing, schema:CreativeWork]
      • dataset -- A dataset contained in a catalog. (Dataset)
    • schema:DataDownload -- A dataset in downloadable form.
      • [schema:Thing, schema:CreativeWork]
      • contentSize
      • contentURL
      • uploadDate
W3C RDF Data Cube (qb:)

to_rdf

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

  • output fmt
  • JSON-LD: compaction

.

  • Series.meta
  • Series.to_rdf()
  • DataFrame.meta
  • DataFrame.to_rdf()
  • Panel.meta
  • Panel.to_rdf()
  • Panel4D.meta
  • Panel4D.to_rdf()

read_rdf

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

  • Series.read_rdf()
  • DataFrame.read_rdf()
  • Panel.read_rdf()
  • Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.

@datastep / PROV

  • Objective: Additive journal of transformations
  • Link to source script(s) URIs
  • Decorator for annotating data transformations with metadata.
  • Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

  • DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
    • 'this is an aggregation of that'
      • 'this' has a URI
      • 'that' has a URI
  • What if there is no metadata for df2?

Units support

RDF Datatypes

JSON-LD RDF

Linked Data Primer

Linked Data Abstractions

  • Graphs are represented as triples of (s,p,o)
  • Subject, Predicate, Object
  • Queries are patterns with ?references
    • graph.triples((None, None, None))
    • SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
  • subjects are linked to objects by predicates
    • subjects and predicate are identified by URI 'keys'

URIs and URLs

  • a URI is like a URL
  • usually, we expect URLs to be 'dereferencable` HTTP URIs
  • a URI may start with a different URI prefix
    • urn:
    • uuid:

SQL and Linked Data

  • there exist standard mappings for whole SQL tablesets
    • rdb2rdf
    • similar to application scaffolding
    • ACL support adds complexity
  • virtuoso supports SQL and RDF and SPARQL
  • rdflib-sqlalchemy maps RDF onto SQL tables
    • fairly inefficiently, when compared to native triplestores

Named Graphs

  • Quads: (g, s, p, o)
  • g: sometimes called the 'context' of a triple
  • Metadata about GRAPH ?g
  • Multiple named graphs in one file: TriX, TriG

Linked Data Formats

  • NTriples
  • RDF/XML
    • TriX
  • Turtle, N3
    • TriG
  • JSON-LD

Choosing Schema

  • XSD, RDF, RDFS, DCTERMS
  • Which schema is most popular?
  • Which schema is a best fit for the data?
  • Which schema will search engines index for us?
  • What do the queries look like?
  • Years Later... What is OWL?
  • Why would we start with RDFS now?

Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

  • where and how was it downloaded? (digital sense)
  • how was it collected? (process control sense)

Datasets have structure:

  • Tabular, Hierarchical
  • 1D, 2D, 3D, 4D
  • Graph-based
    • Chains
    • Flows
  • Schema
    #5 ★ Open Data

http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

@westurner westurner changed the title ENH: Linked Data (from pandas #3402) ENH: Linked Datasets (from pandas #3402) Oct 22, 2014
@westurner westurner added the ENH label Oct 22, 2014
@westurner westurner modified the milestone: 0.1 Oct 22, 2014
@westurner westurner changed the title ENH: Linked Datasets (from pandas #3402) RLS: Roadmap Oct 24, 2014
@westurner westurner added RLS and removed ENH labels Oct 24, 2014
@westurner
Copy link
Owner Author

Added:

  • to_ CSVW

@westurner
Copy link
Owner Author

  • Is tracking columnar metadata across merges easier with Series.meta (than with DataFrame.meta.columns[name].meta)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant