You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What changes would be needed for Pandas core to support this workflow?
.meta schema
to_rdf for Series, DataFrames, Panels, and Panel4Ds
read_rdf for Series, DataFrames, Panels, and Panel 4Ds
~@datastep process decorators
~DataSet
~DataCatalog of precomputed aggregations/views/slices.
Units support (.meta?)
.meta schema
It's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).
☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.
ENH: Linked Datasets (RDF)
(original: pandas-dev/pandas#3402)
Use Case
So I:
and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
User Story
As a data analyst, I would like to share or publish
Series
,DataFrame
s,Panel
s, andPanel4D
s as structured, hierarchical, RDF linked data ("DataSet").Status Quo: Pandas IO
http://pandas.pydata.org/pandas-docs/dev/io.html
.
Read or parse a data format into a DataSet:
pandas.read_*
read_clipboard
read_csv
read_excel
read_fwf
read_gbq
read_hdf
read_html
read_json
read_msgpack
read_pickle
read_sql
read_stata
read_table
pandas.HDFStore
Add metadata:
Save or serialize a DataSet into a data format:
pandas.DataFrame.
to_csv
to_dict
to_excel
to_gbq
to_html
to_latex
to_panel
to_period
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_wide
Share or publish a serialized DataSet with the internet:
GET/POST /container/filename.csv
# [.json|.xml|.xls|.rdf|.html]GET/POST to
/container/filename.csv`python -m SimpleHTTPServer 8088
Implementation
What changes would be needed for Pandas core to support this workflow?
.meta
schemato_rdf
for Series, DataFrames, Panels, and Panel4Dsread_rdf
for Series, DataFrames, Panels, and Panel 4Ds@datastep
process decoratorsDataSet
DataCatalog
of precomputed aggregations/views/slices..meta
?).meta
schemaIt's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata
within Pandas (e.g. in
.meta['columns'][colname]['schema']
, or as a JSON-LD@context
).Ontology Resources
rdfs:
)owl:
)CSV2RDF (
csvw:
)https://en.wikipedia.org/wiki/Comma-separated_values
W3C PROV (
prov:
)schema.org (
schema:
)W3C RDF Data Cube (
qb:
)to_rdf
http://pandas.pydata.org/pandas-docs/dev/io.html
Arguments:
fmt
.
Series.meta
Series.to_rdf()
DataFrame.meta
DataFrame.to_rdf()
Panel.meta
Panel.to_rdf()
Panel4D.meta
Panel4D.to_rdf()
read_rdf
http://pandas.pydata.org/pandas-docs/dev/remote_data.html
Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()
Arguments to
read_rdf
would need to describe which dimensions of data toread into 1D/2D/3D/4D form.
@datastep / PROV
Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)
DataCatalog
A collection of Datasets.
DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
Units support
Series.meta
DataFrame.column.meta
RDF Datatypes
from rdflib.namespace import XSD, RDF, RDFS
from rdflib import URIRef, Literal
JSON-LD RDF
Linked Data Primer
Linked Data Abstractions
graph.triples((None, None, None))
SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
URIs and URLs
urn:
uuid:
SQL and Linked Data
Named Graphs
GRAPH ?g
Linked Data Formats
Choosing Schema
Linked Data Process, Provenance, and Schema
DataSets have [implicit] URIs:
Shared or published DataSets have URLs:
DataSets are about certain things:
DataSets are derived from somewhere, somehow:
Datasets have structure:
#5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
https://en.wikipedia.org/wiki/Linked_Data
The text was updated successfully, but these errors were encountered: