Skip to content

CUREd+ metadata tool: generates a list of all the columns in every table in the database.

License

Notifications You must be signed in to change notification settings

CUREd-Plus/curedcolumns

Repository files navigation

Tests passing

CUREd+ metadata generator

The CUREd+ metadata generator tool generates a list of all the columns in every table in the database.

The data in the target bucket must be arranged in the following directory structure: <data_set_id>/<table_id>/data/**/*.parquet

This script will generate a CSV file with the following columns:

  • data_set_id
  • table_id
  • column_name
  • data_type

Installation

Ensure Python is installed. (See this tutorial.)

Install AWS command-line interface (CLI). Configure your access key using the aws configure command.

Install this package using the Python package manager:

pip install curedcolumns

Usage

The basic usage of this app is to specify the AWS CLI profile and the bucket name you want to inspect.

curedcolumns --profile $AWS_PROFILE $AWS_BUCKET --output $OUTPUT_FILE

You should create an AWS profile using the aws configure command.

aws configure --profile $AWS_PROFILE

To view the command line options:

$ curedcolumns --help 
usage: curedcolumns [-h] [-v] [--version] [-l LOGLEVEL] [--prefix PREFIX] [--no-sign-request] [--profile PROFILE] [-d DELIMITER] [-o OUTPUT] [-f] bucket

List all the field names for all the data sets in a bucket on AWS S3 object storage and display the metadata in CSV format. This assumes a folder structure in this layout: <data_set_id>/<table_id>/data/*.parquet

positional arguments:
  bucket                S3 bucket location URI

options:
  -h, --help            show this help message and exit
  -v, --verbose
  --version             Show the version number of this tool
  -l LOGLEVEL, --loglevel LOGLEVEL
  --prefix PREFIX       Limits the response to keys that begin with the specified prefix.
  --no-sign-request
  --profile PROFILE     AWS profile to use
  -d DELIMITER, --delimiter DELIMITER
                        Column separator character
  -o OUTPUT, --output OUTPUT
                        Output file path. Default: screen
  -f, --force           Overwrite output file if it already exists

Example

Use the AWS CLI profile named "clean"

curedcolumns --profile clean s3://my_bucket.aws.com

Development

See CONTRIBUTING.md.

About

CUREd+ metadata tool: generates a list of all the columns in every table in the database.

Topics

Resources

License

Stars

Watchers

Forks

Languages