Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating multiple URIs from multi-valued cells #127

Open
Robsteranium opened this issue May 11, 2020 · 4 comments
Open

Creating multiple URIs from multi-valued cells #127

Robsteranium opened this issue May 11, 2020 · 4 comments

Comments

@Robsteranium
Copy link
Contributor

It would be useful to be able to say, for example, that a component is associated with multiple codelists.

:age a qb:DimensionProperty;
  qb:codelist :age-in-years, :age-in-bands .

To specify this it would be convenient to have a multi-valued cell in the codelists column of the components pipeline:

Label, Component Type, Codelist
Age, Dimension, http://example.com/age-in-years;http://example.com/age-in-bands

The w3c tabular data model allows multi-valued cell, with a separator column annotation (i.e. ";" in the above example).

This works fine for outputting e.g. multiple string values but not for URIs. In this case, we want to output a URI so we're using a valueUrl annotation of "{+codelist}". This causes csv2rdf to treat the cell's string value as a URI (without escaping it's content). The problems arises because the multiple values of the cell are passed to a single expansion of the URI template as (presumably) a list, rather than each being passed independently for expansion.

Thus, if we modify the schema to add a separator as follows:

      {
        "name": "codelist",
        "titles": "codelist",
        "propertyUrl": "qb:codeList",
        "separator": ";",
        "valueUrl": "{+codelist}"
      }

And pass the above example csv as input, we get a single URI with a comma in it:

:age a qb:DimensionProperty;
  qb:codelist <http://example.com/age-in-years,http://example.com/age-in-bands> .

As per the URI Templating RFC6570:

Multiple variables and list values have their values joined with "," if there is no predefined joining mechanism for the operator.

The tabular data model example 13 demonstrates why you might want this behaviour.

It's not immediately obvious to me how to resolve this. There may be some way we can specify that the cell values need to get their own URIs, either through the csvw metadata, or with the right URI template.

If it's not possible, we may need to change the implementation of csv2rdf (e.g. by introducing a new annotation that declares an alternate processing method) or perhaps create a new implementation of the template expander to support a different syntax.

As a work-around, an alternative would be to duplicate the row specifying the component, changing only the codelist, but this is of course undesirable:

Label, Component Type, Codelist
Age, Dimension, http://example.com/age-in-years
Age, Dimension, http://example.com/age-in-bands
@Robsteranium Robsteranium changed the title Multi-valued cells Creating multiple-URIs from multi-valued cells May 11, 2020
@Robsteranium Robsteranium changed the title Creating multiple-URIs from multi-valued cells Creating multiple URIs from multi-valued cells May 11, 2020
@RickMoynihan
Copy link
Member

RickMoynihan commented May 11, 2020

Just checking I understand the issue...

So, broadly the issue is that we want a 1-many link between dimension and codeList. The w3c spec allows for csv2rdf to parse multiple values out of a cell; but it always assumes all those cells will be put into a single output value (URI in this case).

Is this right?

As a work-around, an alternative would be to duplicate the row specifying the component, changing only the codelist, but this is of course undesirable:

Can you explain why the work around is insufficient? It makes sense to me. Is it because we assume there's a single row per component at the minute? Is it just some theoretical purity we're breaking around tidyness (1 row per component) or is it fundamentally a problem?

@Robsteranium
Copy link
Contributor Author

Your understanding is correct @RickMoynihan, yes. At least that's what I've seen so far. There may be another way to configure it such that the multiple values lead to multiple object-URIs.

Indeed the work around would mean the input was no longer a tidy one component per row. Instead it would be one component-codelist per row. Practically this means that updates to rows would need to be coordinated. Typically what will happen is that someone will update e.g. the description field in one row only and we'll have a component with both the old and new description.

@Robsteranium
Copy link
Contributor Author

The CLARIAH Cow project demonstrates another way to resolve this: under their re-interpretation of the specification a datatype of xsd:anyURI is converted into a URI reference instead of a typed literal so you could use a schema like:

      {
        "name": "codelist",
        "titles": "codelist",
        "propertyUrl": "qb:codeList",
        "datatype": "xsd:anyURI",
        "separator": ";"
      }

@RickMoynihan
Copy link
Member

RickMoynihan commented Jun 3, 2020

I feel that their interpretation there is technically incorrect though as an xsd:anyURI isn't an RDF resource AFAICR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants