Implementation of pipeline parallelism is not hard. This repo provieds a simple code example for fine-tuning the LLaMA 3.2-1B Instruct model. Using this the model would be splitted across multiple GPUs to address memory limitations during training.
- Pipeline Parallelism: To address large model memory requirements, it splitts the model across multiple GPUs.
- LoRA Integration: Reduces memory usage and prevents overfitting.
- DeepSpeed Optimization: I leveraged DeepSpeed for distributed training and efficient memory management.
- Python 3.8+
- PyTorch
- Transformers
- Datasets
- DeepSpeed
pip install requirement.txt
- Model and Tokenizer Loading: You can use any model, but I used LlaMA 3.2-1B Instruct model.
- Dataset Preparation: Utilizes a custom
SummarizationDataset
class for preprocessing the XSum dataset, you may change this file for your custom dataset.
Contains the SummarizationDataset
class for preparing and preprocessing the XSum dataset.
- Data Preprocessing: Tokenizes the input and output text with padding and truncation.
- Instruction Prefix: Adds an instruction prompt to each document.
- Attention Masks and Labels: Proper attention (partially masked) and loss calculation by handling padding tokens and special token positions.
DeepSpeed configuration for optimizing training.
- Zero Redundancy Optimizer (ZeRO) Stage 3: This is the most useful stage of ZeRO optimization. It splits all three critical factors of training—model states, gradients, and optimizer states across gpus.
- FP16 Training: Automatically adjusts the loss scaling factor (upscaling/downscaling) to maintain numerical stability during training, especially for gradients.
-
Prepare Data and Model: Ensure the dataset and model are correctly set up using the provided scripts.
-
Run Training:
deepspeed --num_gpus=4 train.py
-
Save Model: The finetuned model and tokenizer will be saved in the
./final_model
directory.
- Training Logs: Stored in the
./logs
directory. - Model Checkpoints: Saved in the
./results
directory. - Final Model: Saved in the
./final_model
directory.
- The model is split across GPUs to handle large memory requirements. Adjust
per_device_train_batch_size
andper_device_eval_batch_size
as needed based on GPU availability.