title | execute | reference-location | citation-location | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Metadata |
|
margin |
margin |
First steps...
-
we should focus first on:
-
tables in DB
-
bucket in s3
-
-
setup automation for that (write db)
-
see what we do when pull metadata is writing to temp/disk
The MDA tracks a variety of metadata on high-traffic tables. Metadata is information about the data that we are sourcing and/or creating, which may include:
- Where the data came from (source)
- When the data was acquired ->
- What the table names mean
- What the field names mean
- Any information about how the data was created/generated/derived
- and more...
Metadata should always be created for:
- All active source data sets (data stored in database after ETL, not necessarily raw source data).
- All product data sets (data sets exposed in a tool or MDA-derived data sets used across projects, i.e. Climate Resiliency data)
- Data produced for a specific (terminal) project does not necessarily require metadata, this is left to the discretion of the analyst.
Additionally, if you plan to expose the data set (schema + table within the PostgreSQL RDS instance) that you are working on to anyone else in the organization, you need to add Metadata.
When source data is removed from the database.
Data is typically either source data or generated for specific products or projects. The MDA Data Engineer has primary responsibility for ensuring the existence of source metadata. For project or product specific data, the project or product owner has ultimate responsibility for creating metadata.
For instruction on creating and accessing MDA metadata, visit the cori_db wiki.
we have:
- tables in DB (data-prod / data ?), tables can support products and/or analysis or be temporary
- a schema (multiple tables) that are referencing those information (metadata)
- an R package (cori.db) that are helping updating Metadata
- someone need to either fill a csv and/or write the description for the metadata to be complete
- metadata can be a dependency on other tools
- it has a "visualizer": CORI explorer <- dependencies on metadata function
List of specific tables:
# [1] "acs_metadata" "field_metadata"
# [3] "source_metadata" "product_metadata"
# [5] "table_metadata" "pipeline_diagnostics"
yaml (nst/params/package_params.yml) with the structure of 3 tables :
- source_metadata
- field_metadata
- table_metadata:
ie only those 3 tables are supported by cori.db
-
create_metadata_tables()
is the first function to run for new data set it has the yaml as dependency, can be run with a df or without, by default write 3 .csv, if a df is provided it will return one of the csv with the column name already filled -
update_metadata()
take con, and need field_meta, table_meta, source_meta to exist and be valid df or exit the cols of the 3 must be same as yaml or stop if the data are already in metadata it will ask the user if he want to override it
(Pot. improvement: it will break in a non interactive sessions)
repeat the process for the 3 tables
glue produce multiple queries hence the need to Vectorize
then it deletes and append (times 3)
-
read_db()
(using pull_metadata) by default it will check if the table is present in 3 tables and save a 3 csv for them with select row -
acs metadata is updated here https://github.com/ruralinnovation/data-acs/blob/master/99_update_metadata.R it uses prev function update_metadata and DBI::dbWriteTable(con, "acs_metadata", codebook, overwrite = TRUE) be careful update_metadata is commented out in main branch
- metadata in MVP state
- update our tools in production: BEAD / CH
- cori.utils check
- side not unsure if dplyr::add_rows should be added here