Data Lakes & Data Integration

1. Build the Repo

Install the Requirements

Install the necessary packages using the requirements file found in the build folder:

pip install -r build/requirements.txt

Download the Data

Download the dataset from the following link.

Note: It is possible to download the dataset using the Kaggle API, but this requires you to be logged in, which may make the process longer. To use the Kaggle API, follow these steps:

A - Ensure you have the Kaggle CLI installed:

pip install kaggle

B - Authenticate with Kaggle by placing your kaggle.json file (containing your API credentials) in the ~/.kaggle/ directory.

C - Use the following command to download the dataset:

kaggle datasets download googleai/pfam-seed-random-split

Organize the Data

Move the contents of the dataset (train, dev, test, random_split) to a data/bronze/ folder.

Unpack the Data

Unpack the data using the unpack_data.py script found in the build folder.

python build/unpack_data.py --input_dir data/bronze/ --output_file data/bronze/combined_data.csv

2. Data Analysis

A quick data analysis is at your disposal to help you understand the data in the data_analysis.ipynb notebook. Your goal should be to understand the data, and why the transformations suggested in src/preprocess.py need to be made.

3. Data Pre-processing

Data needs to be preprocessed to be stage from a bronze to a silver layer. Your preprocessing script should drop rows with missing values if they exist, encode labels, split data across train/dev/test sets, drop columns and save class weights for training.

python src/preprocess.py --data_file data/bronze/combined_data.csv --output_dir data/silver/

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
build		build
src		src
.gitignore		.gitignore
README.md		README.md
data_analysis.ipynb		data_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lakes & Data Integration

1. Build the Repo

Install the Requirements

Download the Data

A - Ensure you have the Kaggle CLI installed:

B - Authenticate with Kaggle by placing your kaggle.json file (containing your API credentials) in the ~/.kaggle/ directory.

C - Use the following command to download the dataset:

Organize the Data

Unpack the Data

2. Data Analysis

3. Data Pre-processing

About

Releases

Packages

Languages

IUseAMouse/Data-Lakes

Folders and files

Latest commit

History

Repository files navigation

Data Lakes & Data Integration

1. Build the Repo

Install the Requirements

Download the Data

A - Ensure you have the Kaggle CLI installed:

B - Authenticate with Kaggle by placing your kaggle.json file (containing your API credentials) in the ~/.kaggle/ directory.

C - Use the following command to download the dataset:

Organize the Data

Unpack the Data

2. Data Analysis

3. Data Pre-processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages