The CUREd+ metadata generator tool generates a list of all the columns in every table in the database.
The data in the target bucket must be arranged in the following directory structure: <data_set_id>/<table_id>/data/**/*.parquet
This script will generate a CSV file with the following columns:
data_set_id
table_id
column_name
data_type
Ensure Python is installed. (See this tutorial.)
Install AWS command-line interface (CLI).
Configure your access key using the
aws configure
command.
Install this package using the Python package manager:
pip install curedcolumns
The basic usage of this app is to specify the AWS CLI profile and the bucket name you want to inspect.
curedcolumns --profile $AWS_PROFILE $AWS_BUCKET --output $OUTPUT_FILE
You should create an AWS profile using the aws configure
command.
aws configure --profile $AWS_PROFILE
To view the command line options:
$ curedcolumns --help
usage: curedcolumns [-h] [-v] [--version] [-l LOGLEVEL] [--prefix PREFIX] [--no-sign-request] [--profile PROFILE] [-d DELIMITER] [-o OUTPUT] [-f] bucket
List all the field names for all the data sets in a bucket on AWS S3 object storage and display the metadata in CSV format. This assumes a folder structure in this layout: <data_set_id>/<table_id>/data/*.parquet
positional arguments:
bucket S3 bucket location URI
options:
-h, --help show this help message and exit
-v, --verbose
--version Show the version number of this tool
-l LOGLEVEL, --loglevel LOGLEVEL
--prefix PREFIX Limits the response to keys that begin with the specified prefix.
--no-sign-request
--profile PROFILE AWS profile to use
-d DELIMITER, --delimiter DELIMITER
Column separator character
-o OUTPUT, --output OUTPUT
Output file path. Default: screen
-f, --force Overwrite output file if it already exists
Use the AWS CLI profile named "clean"
curedcolumns --profile clean s3://my_bucket.aws.com
See CONTRIBUTING.md.