Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev-infctx][batch 4] Torch.compile + deepspeed_3 support + typo/notebook/readme changes #45

Open
wants to merge 63 commits into
base: dev-infctx
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
6b168c6
Added support for lr_final
PicoCreator Jun 25, 2023
32ae35d
Merge branch 'dev-infctx-lr-final-v2' into dev-infctx-lr-final
PicoCreator Jun 29, 2023
84e159e
Merge pull request #7 from PicoCreator/dev-infctx-lr-final
PicoCreator Jun 29, 2023
30f091a
Merge pull request #8 from PicoCreator/dev-infctx-lr-final-v2
PicoCreator Jun 29, 2023
e2cd3ac
Introduction of bptt_learning
PicoCreator Jul 4, 2023
bdf4eae
Revert "Introduction of bptt_learning"
PicoCreator Jul 4, 2023
66b8703
bptt_training param support
PicoCreator Jul 4, 2023
43821f6
Added example bptt_learning config
PicoCreator Jul 4, 2023
086dca4
Clarified bptt_learning_range for multi-gpu setup
PicoCreator Jul 4, 2023
7de40cb
reorder for readability
PicoCreator Jul 4, 2023
de2b570
Fixing links
PicoCreator Jul 4, 2023
6f30583
Merge branch 'dev-infctx' into dev-infctx-bptt-trainer
Blealtan Jul 5, 2023
f9830df
Update model.py
PicoCreator Jul 5, 2023
fa885f0
multi-gpu support doen by syncing up the exact same set of manual_bac…
PicoCreator Jul 5, 2023
6365ac6
Fixed the import to `import gc, math`
PicoCreator Jul 5, 2023
4502923
Added fabric sync skip, if optimal conditions are met
PicoCreator Jul 5, 2023
d9945f0
Fixed error messages for bptt_learning_range > 1, which can hang with…
PicoCreator Jul 5, 2023
a879e9c
tweaks to reduce the number of backward pass by 1
PicoCreator Jul 5, 2023
c5de65a
Fixed loss calculation for multiple segment + multi gpu
PicoCreator Jul 6, 2023
221cc86
better wandb logging across multiple GPUs
PicoCreator Jul 6, 2023
58d03fe
Merge pull request #11 from PicoCreator/dev-infctx
PicoCreator Jul 7, 2023
3122f0a
disabled JIT
PicoCreator Jul 7, 2023
00b8a1b
Setting up baseline notebook (reorganizing all notebooks)
PicoCreator Jul 7, 2023
69be751
dryrun config tweak
PicoCreator Jul 7, 2023
31caa05
Added aggressive cuda cache clear option
PicoCreator Jul 7, 2023
ee31578
rename to `substep_cuda_cache_clear`
PicoCreator Jul 7, 2023
6886abc
tweak baseline setup
PicoCreator Jul 7, 2023
c1cd8d3
tweak basline title
PicoCreator Jul 7, 2023
ad93467
substep_logging mode
PicoCreator Jul 7, 2023
856ec70
Adding torch compile / JIT flags
PicoCreator Jul 7, 2023
5a5bc5a
updated baseline pytorch
PicoCreator Jul 7, 2023
e350c60
WIP torch compile notebooks
PicoCreator Jul 7, 2023
87755ce
WIP notebook organizing
PicoCreator Jul 8, 2023
31a58bb
WIP torch compile perf notebook
PicoCreator Jul 8, 2023
b86268c
torch compile perf logs
PicoCreator Jul 8, 2023
d5f8c89
bptt validation
PicoCreator Jul 8, 2023
7231a07
WIP torch compile optimization
PicoCreator Jul 8, 2023
add115a
setup update
PicoCreator Jul 8, 2023
00b54f5
bptt_validation notebook (cleanup of various config files)
PicoCreator Jul 8, 2023
0e251ae
dropped infctx-validation notebook (in favour of bptt), added gitigno…
PicoCreator Jul 8, 2023
6365106
matmul-precision notebook cleanup
PicoCreator Jul 8, 2023
358e102
torch compile bechmark v1
PicoCreator Jul 8, 2023
6cba59f
btter perf, without eager torch.compile
PicoCreator Jul 8, 2023
a68bae6
perf log
PicoCreator Jul 8, 2023
3e83a18
more experiments in torch.compile settings
PicoCreator Jul 8, 2023
0529e3c
optimized torch compile max
PicoCreator Jul 8, 2023
c839b23
fix a dataset column typo
PicoCreator Jul 8, 2023
229ba12
Fixing the separator typo
PicoCreator Jul 8, 2023
68538dd
fix config typo
PicoCreator Jul 8, 2023
e3fc407
typo fix
PicoCreator Jul 8, 2023
154f949
optimizing the TCompileMax options
PicoCreator Jul 8, 2023
5ca4ff7
more TCompileMax tuning
PicoCreator Jul 8, 2023
6f96df5
Finalizing the torch compile tune
PicoCreator Jul 9, 2023
f597ed1
dropped matmul precision file
PicoCreator Jul 9, 2023
b672e28
WIP setup h100
PicoCreator Jul 9, 2023
e3de0df
deepspeed 2 & 3 validation runs
PicoCreator Jul 9, 2023
9e0d7b2
updating deepspeed 2 & 3 benchmark
PicoCreator Jul 10, 2023
4476e3b
tweaking notebook details
PicoCreator Jul 10, 2023
f91baae
tweak
PicoCreator Jul 10, 2023
1894e7f
Added link to HF explaination of deepspeed
PicoCreator Jul 10, 2023
e6d203b
optimizing main forward pass BlockStateList
PicoCreator Jul 10, 2023
456cf0a
Updating deepspeed 2 & 3 perf table
PicoCreator Jul 11, 2023
b71b10d
Merge pull request #13 from PicoCreator/dev-infctx-torch-compile
PicoCreator Jul 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,10 @@ datapath/
checkpoint/
node_modules/

# We do capture the notebook generated .log files
# as they are meant to be read as reference
!notebook/**/*.log

# Ignore generated lightning logs and config files
*/lightning_logs/
*/config.yaml
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,33 @@ The following features are not yet supported (that may exist in [blinks original
## Environment setup

The following venv setup using conda, modify for your use case respectively
```
```bash
# ninja-build is required for the new trainer
sudo apt-get install ninja-build

# Virtual env, with python 3.11
# Update conda & its package listings
conda update conda

# Virtual env, with python 3.10
# python 3.11 have issues with torch.compile / h100s
# and if you want to use 3.11, you will need to do a nightly build install
conda create -n rwkv-infctx python=3.11 pip
conda activate rwkv-infctx

# Install pytorch
conda install -y pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Install pytorch (>=2.0.1)
conda install -y pytorch==2.0.1 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Currently for torch.compile + 3.11 to work, for some paltforms, you will need the nightly build
# if so you may need to try the following instead
# ---
# conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia

# Verify your pytorch version
python -c "import torch; print(torch.__version__)"

# We use python -m pip, instead of pip directly, as it resolve issues with venv not loading the right pip
python -m pip install datasets transformers
python -m pip install lightning==2.0.2 deepspeed==0.9.3
python -m pip install lightning==2.0.4 deepspeed==0.9.5
python -m pip install ninja numexpr jsonargparse 'jsonargparse[signatures]'
python -m pip install lm-dataformat ftfy sentencepiece tokenizers wandb
```
Expand Down
72 changes: 56 additions & 16 deletions RWKV-v4neo/config-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@
seed_everything: true
trainer:
# Configure the number of GPU, avaliable on your machine
# auto means it will automatically detect and use all GPUs
accelerator: gpu
devices: 1
devices: auto
num_nodes: 1

#
Expand All @@ -24,8 +25,7 @@ trainer:
# For more details see:
# https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#deepspeed-zero-stage-2
#
#!FIXME: currently only deepspeed_stage_1 is supported, due to that deepspeed cannot handle repeated backward hook.
strategy: deepspeed_stage_1
strategy: deepspeed_stage_2_offload

# Floating point precision for the model, because RWKV is built FOR bf16
# you should pretty much never change this setting
Expand Down Expand Up @@ -128,13 +128,17 @@ trainer:
# Number of datasamples to train for each step, a data sample is considered
# a "substep" in wandb logs, and a "step" is tracked as "trainer/global_step"
#
# This decides the number of datasample, to learn together from, before backproping
# any weight changes at the end of the batch.
# This decides the number of datasample * the number of GPU devices, to learn together from,
# before backproping any weight changes at the end of the batch.
#
# Recommended to be a big enough number (like 128/256) where it prevents the training
# loss from flucuating in the process. But not too big of a number where the increased
# `1 trainer/global_step = accumulate_grad_batches * number of GPU devices * number of nodes`
#
# Recommended to be a big enough number (like 128/256) for finetuning where it prevents the
# training loss from flucuating in the process. But not too big of a number where the increased
# GPU vRAM usage will cause the training to crash.
#
# For foundation model training, a low accumulate_grad_batches like 8/12/16 is recommended.
#
# You are also recommended to configure this to a large enough number to fully utilize
# your GPU processing time %, and avoid idle time for the GPU between batches
accumulate_grad_batches: 256
Expand Down Expand Up @@ -191,14 +195,6 @@ model:
# without eating up too much vram by keeping the training context length
# to a resonable number sutible to the current GPU setup
ctx_len: 2048
# Data samples would be cut down to the respective max ctx_len_cutoffs
# values if its larger then ctx_len. If the data sample is larger then
# the largest len_cutoff, the remaining data will be discarded
ctx_len_cutoffs: [8192, 16384, 32768, 65536]
# Experimental settings, number of tokens to skip in the data sample
# prefix, for the respective cutoff length. Used to speed up the process
ctx_len_warmup_steps: [0, 0, 0, 0]

# Learning rate of the training process
lr_init: 1.0e-04

Expand All @@ -209,6 +205,50 @@ model:
adam_eps: 1.0e-08
weight_decay: 0.01

# Back Propagation through time, used to work around training of large context length
# beyond what can be supported by the current GPU vram architecture
#
# This is not 1:1 equivalent to the same training process with the full vram
# as the training process is split into multiple segments, part by part.
# with limited learnings from the each segment.
bptt_learning: true

# Segmented range to performing backprop learning on
# 1 means to apply only for the last segment
# -1 means to apply for all segments
#
# For multi-gpu training, this must be set to 1, due to a known issue
# else an exception would be thrown
bptt_learning_range: -1

# Limits the bptt learning only to the "current" chunk
# being learned within the learning range. While this reduces the effectiveness
# of bptt, it also further reduces vram requirements.
#
# This is also known as tbptt (Truncated Back Propagation through time)
bptt_truncated_learning: false

# Aggressively clear the cuda cache between each data samples.
# This causes a performance penalty, but reduces the vram pressure
#
# This is useful for mitigating the following memory pressure warning
# `1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance...`
substep_cuda_cache_clear: false

# Experimental cutoff settings
# ---
# Data samples would be cut down to the respective max ctx_len_cutoffs
# values if its larger then ctx_len. If the data sample is larger then
# the largest len_cutoff, the remaining data will be discarded
#
# Leave it as a blank array to disable the feature
ctx_len_cutoffs: []
# Experimental settings, number of tokens to skip in the data sample
# prefix, for the respective cutoff length. Used to speed up the process
#
# Leave it as a blank array to disable the feature
ctx_len_warmup_steps: []

# torch.set_float32_matmul_precision, used to optimize operations with tensor cores
# this should be set as null, for non cuda core GPUs
torch_set_float32_matmul_precision: 'high'
Expand Down Expand Up @@ -296,7 +336,7 @@ data:
# multi_column_keys: ['instruction', 'input', 'output']
# multi_column_prefix: ['Instruction:\n', 'Input:\n', 'Output:\n']
# multi_column_masking: [false, true, false]
# multi_column_seperator: '\n\n'
# multi_column_separator: '\n\n'

# If processing prompt/completion jsonl pairs, the prompt is masked by default
# use this flag to disable this default behaviour
Expand Down
16 changes: 8 additions & 8 deletions RWKV-v4neo/src/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def prepare_data_static(**kargs):
# Tokenized encodings for multi column keys
multi_column_enabled = len(multi_column_keys) > 0
multi_column_prefix_encodings = []
multi_column_seperator_encodings = None
multi_column_separator_encodings = None

# Process the multi column settings
if multi_column_enabled:
Expand All @@ -69,9 +69,9 @@ def prepare_data_static(**kargs):
# Tokenize the multi column strings
for i in range(len(multi_column_keys)):
multi_column_prefix_encodings.append(tokenizer(multi_column_prefix[i]))
# Tokenize the multi column seperator
# Tokenize the multi column separator
if multi_column_separator is not None and len(multi_column_separator) > 0:
multi_column_seperator_encodings = tokenizer(multi_column_separator)
multi_column_separator_encodings = tokenizer(multi_column_separator)

# Maps the dataset record to the tokenized result
# handles a wide variety of format according to the data configuration
Expand Down Expand Up @@ -112,11 +112,11 @@ def map_tokenizer(x):
for i in range(len(multi_column_keys)):
# And process the column if it has data
if multi_column_keys[i] in x and x[multi_column_keys[i]] is not None and len(x[multi_column_keys[i]]) > 0:
# Add the seperator if this is not the first item
if not is_first_item and multi_column_seperator_encodings is not None:
input_ids += multi_column_seperator_encodings['input_ids']
token_type_ids += multi_column_seperator_encodings['token_type_ids']
attention_mask += multi_column_seperator_encodings['attention_mask']
# Add the separator if this is not the first item
if not is_first_item and multi_column_separator_encodings is not None:
input_ids += multi_column_separator_encodings['input_ids']
token_type_ids += multi_column_separator_encodings['token_type_ids']
attention_mask += multi_column_separator_encodings['attention_mask']

# Add the prefix
input_ids += multi_column_prefix_encodings[i]['input_ids']
Expand Down
Loading