Skip to content

BenioffOceanInitiative/cccYoloTraining

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collection of all our training data prep code

Contents:

If working on the VM, make sure to activate the yolo environment

conda activate yolo

If you're working on a different machine, you'll need to create a new environment and install the requirements

conda create -n <env_name> 
conda activate <env_name>
cd <path_to_this_project>
pip install -r requirements.txt

The data directory

The data directory has it's contents ignored, so it's just a place to store whatever you want locally without worrying about it getting pushed. Currently the original annotation file is in there.

Basic workflow for dataset creation

  • Create an images directory, and put all of your images there.

    • If your image directory contains nested folders, use the flatten_image_dirs.sh script to flatten the directory A note on Image sets.
  • Create an annotations directory, and put all of your annotations there.

  • Make sure all the images have the same extension File extensions

Order of operations

The order of operations is important, and it's best to follow this order when creating a new dataset if not using the cli.

  • Create the images and annotations directories if they don't exist
  • Create an experiments directory if it doesn't exist
  • Add your images to the images directory
  • Add your annotations to the annotations directory
  • Flatten the images directory if needed
  • Ensure all the images have the same extension File extensions
  • Make sure the annoations are 0 indexed Data Augmentation Scripts
  • Resize the images if needed
  • Remove any classes if needed
  • Perform any augmentations
  • Convert to yolo format
  • Split the dataset into train, val, and test sets

Checklist

This is a checklist of things to do when creating a new dataset. If you're using the cli, you can skip most of these steps, but it's good to know what's going on under the hood.

  • Are ALL the images in the images directory?
  • Is the image directory flattened?
  • Are the annotations in the annotations directory?
  • Are the annotations 0 indexed?
  • Is there a filename key in the annotations of the annotation file?
  • Are the file paths in the annotations flattened?
  • Do all the images have the same extension?
  • Did you resize the images?
  • Did you remove any classes?
  • Did you perform any augmentations?
  • Did you convert to yolo format?
  • Is there a data.yaml file in the converted data directory with the correct number of classes and the class names list?
  • Did you split the dataset into train, val, and test sets?

The cli.py cli works as an interface for the dataset_builder class. Once you have your images and annotations in the correct directories, you can use the cli to create a new project, or add to an existing project. The simplest thing to do, is run python cli.py in the terminal. You will be guided through generating your dataset.

The Dataset Builder Class

The dataset_builder.py file contains the dataset_builder class, which is the main class for creating and modifying datasets. There is a bit too much to cover here, but main things to know are:

  • The dataset_builder class is initialized with a name, a coco json annotation file, and a directory of images.
  • You can combine other datasets
  • Remove classes
  • Augment images
  • Convert annotations to yolo format
  • Split the dataset into train, test, and val sets
  • The class created and maintains a config file for the project, this is where the state of the dataset is stored, and is used to pick up where you left off, or make changes to the dataset. Once you have a dataset, you can use the cli to add to it, or modify it. Be careful when manually editing the config file, it's best to let the cli handle it.

What the cli does

  • Really, the cli is just an interface for the dataset_builder class, and is a bit easier to use than the class itself, since it ensures the correct order of operations.

  • Given the name of an existing project, the config file is loaded and the dataset_builder is initialized with the project's config

  • Given a new project name, creates a new project directory under the experiments directory with the name you provide

  • Creates and maintains a config file for the project

  • The config file:

    • The config file is like the state management and history of your project, it keeps track of the images and annotations you've added to the dataset, and the data augmentation operations you've performed, and allows you to pick up where you left off, or make changes to the dataset. - The config file is a json file, and is created in the project directory when you create a new project. - You can manually edit the config file BEFORE you run the cli, but it's best to let the cli handle it.

A note on Image sets

Sometimes, the datasets exported from CVAT have nested folders, and we need all our images in one directory to work with the scripts here. Edit the flatten_image_dirs.sh script to quickly flatten a folder of images that contains nested folders. Make sure to edit the script to point to the correct directory. Then run the script

./flatten_image_dirs.sh

If you get a permissions error, run

chmod +x flatten_image_dirs.sh

File extensions

Another small hiccup, ensure all the images have the same extension. Some of our older datasets have a capitalized .JPG, modify and use the change_file_extension.sh script for this.This is also a handy template for making other changes to file names, so keep it in mind.

./change_file_extension.sh

Same deal with the permissions

chmod +x change_file_extension.sh

Copying the dataset to the VM from a bucket

gsutil cp gs://<bucket_name>/<file_name>.zip 

Be mindful of memory usage

Once you have a dataset, and you've perfomed your training, move the dataset off the VM. It's okay to keep the main set of images, since the cli will copy the files into your experiment directory, but, the VM has limited memory. We have storage buckets in GCP, so use them. To move a dataset to a bucket, zip the whole dataset directory, and upload it to a bucket with gsutil, then delete the dataset from the VM. You can always download it again later.To use gsutil:

zip the dataset directory

zip -r <new_name> <data_dir>

copy the zip file to a bucket

gsutil cp <file_name>.zip gs://<bucket_name>

remove the dataset from the VM

rm -r <data_dir>

Data Augmentation Scripts

There are a host of data augmentations scripts in the utility_scripts directory. If you want to perform single operations on a dataset, look there. Here's a list of some of the available scripts. Always check the script for defaults and usage! There is no undo button!

coco2yolo.py

  • Converts COCO annotations to YOLO format
  • Usage: python coco2yolo.py --dir <directory to save annoations> --annotation_file <json annotations file>

add_filename_key.py

  • Adds a filename key to the annotations in a coco json file

auto_map_classes.py

  • given a coco json annotation file, remaps the class_id's to 0 indexed and fixes missing class_id's.
  • returns a dictionary of class_id's and class_names outputs a data.yaml file for use with yolov5 and v8 with the class name list and number of classes
  • Usage: python auto_map_classes.py --annotations_path <path to annotations> --write <boolean to write the class dictionary to a file>

remove_classes.py

  • Removes all instances of specified classes from a COCO JSON annotation file
  • Usage:python remove_classes.py --annotations_path <path to annotations> --output_file <path to output annotations> --classes <list of classes to remove sperated by space 1 2 3 etc. >

augment_classes.py

  • Augments images of given category ID's and adds specified number of annotations to those new augmented images.
  • Please see the script for defaults and usage
  • Usage:python augment_classes.py --annotations_path <path_to_annotations> --image_folder <path_to_original_images> --output_folder <path_to_new_augmented_images> --augment_categories <string list of id's and number of augmentations to add to each id '0,2 1,3' etc.>

combine_datasets.py

  • given two json annotation files, combines them, creating consistent class_id's and a single annotation file.
  • Usage: python combine_datassets.py --annotations1 <path_to_annotations1> --annotations2 <path_to_annotations2>

split_data.py

  • Splits a dataset into train, val and test sets
  • outputs files ready for YOLO training under datasets directory in specified output directory
  • Also outputs bar graphs of class distribution in each set
  • Usage: python split_data.py --file_dir <path_to_yolo_labels_directory> --output_dir <path_to_output_dir> --images <path_to_images> --image_type <jpg/png>
  • Check the script for defaults

resize.py

  • Resizes all images in a directory to a specified size, default is 640 width
  • Usage: python resize.py --image_dir <path_to_images> --output_dir <path_to_output_dir> --target_width <width>

Uploading the Data to the VM

If you didn't use the cli.py to generate your dataset, make sure to check some things before you proceed.

Make sure that you have a data.yaml file in your data directory with the correct number of classes and the class names list, use the example_data.yaml as a template. The cli and dataset_builder class will create this file for you, but if you're doing it manually, make sure it's correct.

The order of the classes in the class name list matters, the index of each class name needs to correspond to the category_id in the annotations. For example, if the first class in the list is "person", then the category_id for all the person annotations needs to be 0. The second class in the list needs to have the category_id 1, and so on.

The paths to the directories in the data.yaml need to be checked as well, if you attempt to train and receive an error about the path to the train, val, or test directory, check the path in the data.yaml file and make sure it is correct.

Uploading the dataset

If you're triaing on a different machine, it's best to zip your dataset.

  • Go ahead and zip your new data directory
zip -r <new_name> <data_dir>

Uploading to a bucket and downloading to the VM

  • Upload the zip file to a bucket in GCP (you can use the web interface or gsutil)

  • with gsutil

gsutil cp <file_name>.zip gs://<bucket_name>
  • On the VM, download the zip file using gsutil
gsutil cp gs://<bucket_name>/<file_name>.zip 
  • Unzip the file
unzip <file_name>.zip
  • You should now have all the data unzipped and ready to train

Training yolov5

The Vm is already set up with a virtual environment and the yolov5 project. Once logged into the machine, navigate to the yolov5 directory

once the dataset is uploaded, you can train

cd yolov5

And activate the virtual environment

conda activate yolo

You can now train with your new dataset

python train.py --weights yolov5s.pt --epochs <num_epochs> --data <your_data_dir>/data.yaml

-See the Ultralytics yolov5 github for more info on training

Misc. scripts

There are a number of utility scripts in the utility_scripts directory to help with data auditing and other things

  • count_annotations.py :

    • Counts the number of annotations in a coco json file and repeat images, good to check if any augmentation created duplicate images
  • count_images.py :

    • counts the number of images in a directory
  • count_img_in_ann.py :

    • counts the number of images in a json annotation file
  • count_labels.py:

    • counts the number of label files in a directory
  • draw_boxes.py

    • draw bounding boxes on an image provided a label file and image
  • format_json.py:

    • makes a json file more readable
  • labels_per_class.py

    • outputs a graph of number of labels per class in an annotation file
  • get_class_dict.py

    • outputs a dictionary of class names and their corresponding category_id from a json annotation file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.2%
  • Shell 0.8%