Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMLdataset2SQL #1

Open
gastil opened this issue Oct 22, 2019 · 0 comments
Open

EMLdataset2SQL #1

gastil opened this issue Oct 22, 2019 · 0 comments

Comments

@gastil
Copy link

gastil commented Oct 22, 2019

I suggest adding another tool, EMLdataset2SQL, that would compose the SQL to load data from an EML dataset into a relational database. Generically any SQL database, but postgres in particular.

I suggest these DDL items:

  • CREATE TABLE for each entity
  • attributeName for column names
  • data type from storageType if present or MeasurementScaleDomain
  • enumerations could go into a CHECK constraint (just codes) or even a parent table.

And this DML item:

  • bulk upload with COPY for each table

The DML can write a CREATE TABLE from EML and this may be used for the Quality Engine's data load check. However, the column definitions could be a lot tighter data type. The entityName could be made into an appropriate table name (by substitution of any non alphanumeric character to underscore).

The EML <constraint> element is not a common-pattern-of-usage for LTER datasets. However, some EML datasets do include CONSTRAINTs, for example

knb-lter-mcr.10
knb-lter-mcr.1034
knb-lter-mcr.1037
knb-lter-mcr.1038
knb-lter-mcr.1039
knb-lter-mcr.12
knb-lter-mcr.13
knb-lter-mcr.19
knb-lter-mcr.2
knb-lter-mcr.2002
knb-lter-mcr.2003
knb-lter-mcr.2004
knb-lter-mcr.2006
knb-lter-mcr.21
knb-lter-mcr.3
knb-lter-mcr.4
knb-lter-mcr.4001
knb-lter-mcr.4003
knb-lter-mcr.4005
knb-lter-mcr.5003
knb-lter-mcr.5004
knb-lter-mcr.5005
knb-lter-mcr.6
knb-lter-mcr.6001
knb-lter-mcr.7
knb-lter-mcr.8

When present, <constraint> EML could be used to create primary keys or even foreign keys between tables.

Bulk data load with COPY is much faster than an INSERT statement for each row of data, as the Quality Engine does. There, the INSERT has the purpose of checking each row. Bulk COPY is all or nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant