Initial release
Pre-release
Pre-release
What's Changed
- Add python project configs by @oelachqar in #1
- Add repo skeleton by @oelachqar in #2
- Export lema entrypoint scripts by @oelachqar in #3
- Update static type checking config by @oelachqar in #5
- Add example jupyter / colab notebook by @oelachqar in #4
- Refactor config parsing to use omegaconf by @oelachqar in #6
- Updating documentation (Dev Environment Setup) by @kaisopos in #7
- Add tests and vscode config by @oelachqar in #8
- Added DPOTrainer example to repo, as well as cuda device cleanup to training loop by @jgreer013 in #9
- Adding torch as top-level module dependency by @optas in #10
- Add configs for specific hardware requirements by @jgreer013 in #11
- Sort pre-commit hooks lexicographically by @xrdaukar in #12
- Add logging config by @oelachqar in #13
- Lema inference by @xrdaukar in #14
- Panos dev by @optas in #16
- Add job launcher by @oelachqar in #15
- Making split of data a flexible variable by @optas in #17
- Configure max file size in precommit hooks by @xrdaukar in #18
- Minor bugfix and documentation update by @oelachqar in #19
- adding pynvml to train env by @kaisopos in #20
- Panos dev by @optas in #22
- Augmenting Types for training hyperparams by @optas in #23
- Train refactoring (config file visibility) + a few minor changes by @kaisopos in #21
- Minimal test for train function by @xrdaukar in #25
- Fix leftover '_torch_dtype' in 'ModelParams' by @xrdaukar in #26
- Update GPU types list in the default SkyPilot config by @xrdaukar in #27
- Add a missing lema-infer command under [project.scripts] by @xrdaukar in #28
- add basic pytests for evaluate and infer by @xrdaukar in #29
- Update README and pyproject.toml by @wizeng23 in #30
- A helper function to print info about available CUDA devices by @xrdaukar in #31
- Update SkyPilot cconfig to start using torchrun by @xrdaukar in #32
- Support basic single-node, multi-gpu training by @xrdaukar in #33
- Run all precommit hooks on the repo by @xrdaukar in #35
- Add experimental code for llama cpp inference by @jgreer013 in #37
- Create skeleton of STYLE_GUIDE.md by @xrdaukar in #36
- Adding support for training custom models (for now just a dummy model). by @kaisopos in #38
- Fix custom model name in test_train.py by @xrdaukar in #39
- Configure pyright (static type checker) and resolve existing type errors to make it pass by @xrdaukar in #41
- fix trailing whitespace warning in STYLE_GUIDE.md by @xrdaukar in #43
- Configure initial GitHub Actions workflow to run pre-commits and tests by @xrdaukar in #44
- A variety of proposed extensions to finetune a chat-based model (starting with Zephyr) by @optas in #34
- Fix syntax error in ultrachat by @xrdaukar in #48
- Create initial version of CONTRIBUTING.md by @xrdaukar in #46
- Reduce the number of training steps from 5 to 3 to make test_train.py faster by @xrdaukar in #49
- Adding registry for custom models. by @kaisopos in #42
- Add config and streaming args to DataParams by @wizeng23 in #47
- Update Pre-review Tests to only run on pull_request by @xrdaukar in #50
- Add training flags to computes tokens-based stats by @xrdaukar in #51
- reduce test training steps in another test which I missed before by @xrdaukar in #53
- Rename var names of *Params classes by @wizeng23 in #52
- Make some NVIDIA-specific dependencies optional by @xrdaukar in #54
- fix trl version as 0.8.6 by @xrdaukar in #56
- Remove reference to torch.cuda.clock_rate by @xrdaukar in #57
- Update inference to support non-interactive batch mode. by @kaisopos in #58
- Update README.md to include Linux/WSL specific instructions by @xrdaukar in #59
- Minor formatting improvements in README.md by @xrdaukar in #60
- Minor: Updating Lora Params by @optas in #55
- Support dataset packing by @wizeng23 in #63
- Disallow relative imports in LeMa by @xrdaukar in #65
- Add text_col param that's required for SFTTrainer by @wizeng23 in #66
- Refactor common config parsing logic (YAML, arg_list) into a common util by @xrdaukar in #68
- Standardize test naming convention by @wizeng23 in #69
- Adding support for a hardcoded evaluation with MMLU. by @kaisopos in #67
- Minor changes to the default configs/skypilot/sky.yaml config by @xrdaukar in #71
- Prototype to pass
config.model.model_max_length
to Trainers by @xrdaukar in #70 - [Inference] Remove the prepended prompts from model responses. by @kaisopos in #73
- Add a util to print versioning info by @xrdaukar in #74
- Switch to tempfile.TemporaryDirectory() in test_train.py by @xrdaukar in #75
- Update docstring verbs to descriptive form by @wizeng23 in #76
- Add sample accelerate and fsdp configs by @xrdaukar in #77
- Refactor code to get device rank and world size into a helper function by @xrdaukar in #79
- Add a simple util to print model summary e.g., layer names, architecture summary by @xrdaukar in #80
- Freeze numpy to pre 2.0 version by @xrdaukar in #81
- Adding inference support for next logit probability. by @kaisopos in #78
- Create FSDP configs for Phi3 by @xrdaukar in #82
- Auto-format pyproject.toml with "Even Better TOML" by @xrdaukar in #83
- Minor cleanup updates to SkyPilot configs by @xrdaukar in #84
- Mixed Precision Training, Flash-Attention-2, Print-trainable-params by @optas in #85
- Update README.md to include basic instructions for multi-GPU training (DDP, FSDP) by @xrdaukar in #86
- Start using $SKYPILOT_NUM_GPUS_PER_NODE in SkyPilot config by @xrdaukar in #90
- Add configs for FineWeb Llama2 pretraining by @wizeng23 in #89
- Quantization by @optas in #87
- Update the default SkyPilot config to print more debug/context info by @xrdaukar in #92
- Add license by @oelachqar in #93
- Initial version of SkyPilot config for multi-node training (num_nodes: N) by @xrdaukar in #94
- MMLU eval refactor. by @kaisopos in #88
- Remove comparison between LOCAL_RANK and RANK by @xrdaukar in #96
- Handling the loading of peft adapters and other minor issues (e.g., adding more logging parameters) by @optas in #91
- Update configs/skypilot/sky_llama2b.yaml to start using sky_init.sh by @xrdaukar in #97
- Add bool param to resume training from the last known checkpoint (if exists) by @xrdaukar in #99
- Inference: save/restore probabilities to/from file. by @kaisopos in #98
- Add support for dataset mixtures during training by @taenin in #95
- Add train, test, and validation splits to the LeMa config. by @taenin in #101
- nanoGPT (GPT2) pretraining recipe by @wizeng23 in #103
- Minor: Updates on Zephyr-Config by @optas in #106
- Update pre-commit config by @oelachqar in #108
- Add integration tests that verify all configs load properly. by @taenin in #102
- Handling Gradient Checkpointing by @optas in #107
- Update skypilot/sky_gpt2.yaml to include an example how to mount GCS dir by @xrdaukar in #111
- Rename dataset_params.dataset_config to dataset_params.subset by @oelachqar in #109
- Refactor SFT dataset preprocessing by @oelachqar in #112
- Support shuffling and random seeds for dataset sampling by @taenin in #113
- Split types file into module by @oelachqar in #114
- Add GCP deps to
lema[cloud]
by @xrdaukar in #117 - Add llama3-instruct jinja template by @jgreer013 in #118
- Update sky_init.sh to print current dir by @xrdaukar in #120
- Add prompt response sft preprocessor factory for aya dataset by @jgreer013 in #121
- Add configs for chatqa model by @oelachqar in #110
- Saving inference probs in
parquet
format. by @kaisopos in #115 - Refactor model registry by @oelachqar in #122
- Define BaseTrainer abstraction by @xrdaukar in #116
- Add a registry for metric functions that we can run during training. by @taenin in #126
- Update training_params.py so HF trainer uses num_train_epochs by @optas in #125
- Add native PyTorch model training by @oelachqar in #123
- [Quick fix] Handle pynvml being misconfigured by @taenin in #128
- Enable DP for inference by @kaisopos in #100
- Add configs for training llama3-8b with aya finetune by @jgreer013 in #130
- Update HF save_model() to only save on master replica by @xrdaukar in #131
- Pipe MetricsFunction from our config to train.py by @taenin in #129
- Fixing broken eval. by @kaisopos in #132
- Minor updates in SkyPilot docstrings by @xrdaukar in #133
- Fix bug with DP evaluation by @oelachqar in #134
- [MMLU custom eval] removing hardcoded subject, samples, num-shots. by @kaisopos in #135
- Add an initial config for async evaluations by @taenin in #137
- Add a new top level command: evaluate_async by @taenin in #138
- Minor bug fix in writing evaluations by @taenin in #140
- Support full GPT2 run by @wizeng23 in #141
- Upload sample configs for running async evals on GPT2 by @taenin in #139
- Apply
torch.distributed.barrier()
in save_model by @xrdaukar in #136 - Create an experimental util to generate pre-tokenized datasets (Parquet files) with
token_ids
column by @xrdaukar in #144 - Created a new dataset class with async loading & tokenization by @jgreer013 in #142
- Remove private debug dir from configs/skypilot/sky_gpt2.yaml by @xrdaukar in #145
- Define dataloader_num_workers and dataloader_prefetch_factor params by @xrdaukar in #146
- [Evaluations] Integration with LM Evaluation Harness by @kaisopos in #143
- Support model compilation by @wizeng23 in #147
- Multiple cleanup changes in configs/skypilot/sky_gpt2.yaml by @xrdaukar in #148
- Update SkyPilot training configs to include
run_name
by @xrdaukar in #149 - Update async eval to properly parse eval configs by @taenin in #150
- Zephyr Configs [full-model, skypilot] by @optas in #152
- Disable model.compile in gpt2 config by @xrdaukar in #154
- Update sky_init.sh to print task id and cluster info by @xrdaukar in #156
- [bug] Include jinja templates in build by @oelachqar in #158
- Add basic scaffolding for torch profiler around training loop by @xrdaukar in #157
- [Minor] Adding
attn_implementation
arg in LM Harness. by @kaisopos in #160 - Update Trainer.save_model to start using the public HF save_model() method (except for PEFT) by @xrdaukar in #161
- Update the vanilla eval config for gpt2 to run hellaswag evals. by @taenin in #165
- Add Dataset base class & API by @oelachqar in #151
- Add experimental notebook to run Nvidia's ChatRAG-Bench evaluation by @oelachqar in #166
- Update ChatQA training configs by @oelachqar in #159
- Update async dataset class to support pre-tokenized datasets by @oelachqar in #162
- Create a launcher script for Polaris jobs (ALCF) by @taenin in #164
- Update pre-tokenized column name to be
input_ids
intokenize_dataset
tool by @xrdaukar in #167 - Replacing
EvaluationConfig
'sDataParams
withDatasetSplitParams
by @kaisopos in #168 - Submit config to create Custom IAM role for SkyPilot Service Accounts on GCP by @xrdaukar in #169
- Remove GCP project reference by @xrdaukar in #172
- Make sure output training dir exists by @xrdaukar in #171
- Improve launcher usability via command line arguments. by @taenin in #170
- Add a source directory to the Polaris launcher and clean up rsync copies. by @taenin in #173
- Introduce LEMA_RUN_NAME env var to SkyPilot training configs by @xrdaukar in #174
- Minor changes: 1. Remove hardcoded HF_TOKEN 2. Log effective training config by @xrdaukar in #175
- Tweak default params in gpt2 scripts by @xrdaukar in #177
- LM Harness optimizations by @kaisopos in #176
- No longer ignore .git. in Polaris Needed for venv. by @taenin in #179
- A hack for running jobs on Polaris. by @taenin in #180
- [Polaris] Move venv creation from worker to launcher. by @taenin in #181
- Update README.md to include
sky launch - 10 ...
example by @xrdaukar in #182 - [Evaluations] Adding support for HuggingFace's leaderboard v1 benchmarks by @kaisopos in #183
- Llama 3 Aya Fine-Tuning Updates by @jgreer013 in #163
- Remove logger propagation by @wizeng23 in #185
- [Evaluations] HF leaderboard v1 configs by @kaisopos in #186
- Move logging.py to utils by @wizeng23 in #187
- Create the Jobs config for the lema launcher. by @taenin in #188
- Initial abstract base classes for the lema launcher. by @taenin in #189
- Added mfu calculation and tests by @jgreer013 in #190
- Introduce two new training params: save_model and save_epoch by @xrdaukar in #191
- Update FineWeb ablation model configs by @xrdaukar in #196
- Added MFU telemetry by @jgreer013 in #193
- Update Polaris script by @wizeng23 in #192
- Rename
training.save_model
param totraining.save_final_model
for clarity by @xrdaukar in #197 - Support disabling dropout by @wizeng23 in #184
- Update actual mfu calculation by @jgreer013 in #199
- Implement a client for talking to SkyPilot. by @taenin in #201
- Fixed miscalculation of second step start time by @jgreer013 in #202
- Update
ablation-model-fineweb-v1
config to start using grad checkpointing by @xrdaukar in #198 - Add distributed operations by @oelachqar in #194
- Add pre-commit hooks for credential scanning + new checks by @oelachqar in #195
- Sample job for multi-node training by @xrdaukar in #203
- Update Polaris multi-node launcher by @xrdaukar in #204
- Multi-node config improvements for llama2b model (
HuggingFaceFW/ablation-model-fineweb-v1
) by @xrdaukar in #205 - Minor updates to Polaris launcher script by @xrdaukar in #206
- Update Lema FSDP configs by @xrdaukar in #207
- [tiny] add default formatter for markdown by @oelachqar in #210
- Preparations for Lema custom pre-training loop by @oelachqar in #208
- Update MFU callback to support Lema trainer by @oelachqar in #209
- Configure llama2b model to use FSDP HYBRID_SHARD by @xrdaukar in #213
- Implement a Cluster resource manager around Sky Pilot. by @taenin in #214
- Add utils to setup distributed training by @oelachqar in #211
- Add example notebook to train NanoGPT model with Lema by @oelachqar in #212
- [tiny] update sky pilot ssh config by @oelachqar in #215
- Implement a Cloud resource manager around Sky Pilot by @taenin in #216
- Sanitize run name by @xrdaukar in #217
- Use "cluster_name" instead of "name" in the Sky client. by @taenin in #218
- Minor logging improvements in Polaris sample job scripts by @xrdaukar in #219
- Update shell scripts to point to local dataset by @jgreer013 in #221
- Support FSDP on Polaris using accelerate by @xrdaukar in #220
- Add telemetry manager by @oelachqar in #222
- Switch to the latest transformers=4.43.1 by @xrdaukar in #223
- Re-enable model compilation for llama2b model by @xrdaukar in #224
- Increase llama2b batch size from 2 to 3 by @xrdaukar in #225
- Add makefile with common local commands by @oelachqar in #227
- Add DeepSpeed config for Llama2b by @wizeng23 in #228
- MFU Improvements for Llama 2B on Polaris by @jgreer013 in #229
- FSDP config updates by @xrdaukar in #231
- Rename accelerate configs to be in line with other configs by @wizeng23 in #232
- [tiny] Update logger format to include rank, pid and threadname by @oelachqar in #235
- Set model.config.use_cache = False by @xrdaukar in #233
- Experimental training loop for pre-training by @oelachqar in #230
- Disable gradient checkpointing in SkyPilot llama2b config by @xrdaukar in #236
- Implement a client for communicating with Polaris via python. by @taenin in #234
- Add SkyPilot config for
experimental/pretokenize/tokenize_dataset.py
by @xrdaukar in #237 - Update Fabric.run() calls to use the "warn" flag. by @taenin in #239
- Update
pretokenize
tool to support input datasets by @xrdaukar in #238 - Add optimizers builder function by @oelachqar in #240
- Add a "put" method in the Polaris client for writing remote files. by @taenin in #242
- Add deepspeed (DS) config to support hierarchical partitioning by @wizeng23 in #244
- Add support for uploading MFU in wandb by @jgreer013 in #245
- Create a Polaris Cluster class consuming the polaris client by @taenin in #246
- Add initial docker image by @oelachqar in #241
- Fix a string in the Polaris Cluster tests. by @taenin in #249
- Set training loop random seeds by @oelachqar in #248
- Fix bug with Polaris multi-node script by @wizeng23 in #247
- Add torchfix listing target by @oelachqar in #250
- Add training state classes by @oelachqar in #251
- Save and restore telemetry state during training by @oelachqar in #252
- Configure file logging by @oelachqar in #254
- Create a Polaris Cloud class consuming the polaris client by @taenin in #253
- Define a registry for cloud builders. by @taenin in #255
- Add logging to tensor board, wandb in custom training loop by @oelachqar in #256
- Add a
get_all
utility method to the LeMa Registry by @taenin in #257 - Update the BaseCloud
up_cluster
definition to return a job status. by @taenin in #258 - Create a launcher class for the LeMa Launcher. by @taenin in #261
- Add script to benchmark datasets and data loader params by @oelachqar in #260
- [Follow-up] data loader benchmarking script by @oelachqar in #262
- Create DDP configs for
accelerate
by @xrdaukar in #259 - Switch from nightly to stable version of SkyPilot by @xrdaukar in #264
- Make all tests green by @xrdaukar in #265
- Set
dataloader_pin_memory=True
to be intentional by @xrdaukar in #266 - Move
torch_profiler_utils
fromlema.utils
tolema.perfomance
by @xrdaukar in #267 - Add BaseIterableDataset, refactor DataLoader to use DataPipes by @oelachqar in #263
- Add a
dataset_kwargs
attribute, tests by @oelachqar in #268 - Use stateful dataloader by @oelachqar in #269
- Update the polaris client / cluster to work e2e by @taenin in #270
- Update package structure for the launcher by @taenin in #273
- [tiny] Register debug datasets by @oelachqar in #272
- Update several of our launcher base fields to use strings instead of ints. by @taenin in #274
- Configure data loader sampling strategy for map-style datasets by @oelachqar in #271
- Ensure we CD into the working DIR before submitting polaris jobs. by @taenin in #276
- Compute the number of dataloader workers per node by @xrdaukar in #277
- Introduce BaseTokenizer alias by @xrdaukar in #280
- Cache get_device_rank_info by @xrdaukar in #279
- Adding initial scripts for running polaris jobs. by @taenin in #275
- Update the polaris client to automatically set execute permissions for copied files. by @taenin in #286
- Deprecate building models data parallel by @oelachqar in #282
- Switch to using safetensors when saving models by @oelachqar in #281
- Add ability to validate configs and params after init by @oelachqar in #285
- Some updates to Polaris launcher script by @xrdaukar in #287
- Upgrade to latest TRL version, remove numpy version condition by @oelachqar in #283
- Add learning rate builder function by @oelachqar in #284
- Remove patchwork as a dep. by @taenin in #290
- Set up initial demo launcher jobs for GCP. by @taenin in #288
- [tiny] cleanup pyproject.toml dependencies by @oelachqar in #292
- Make dataset data backend attribute read-only by @oelachqar in #291
- Optimize Github actions by @oelachqar in #289
- Misc minor changes by @xrdaukar in #293
- [tiny] Update GitHub action cache version by @oelachqar in #295
- Rename 'NodeParams' -> 'JobResources' by @taenin in #296
- Disable compilation for DDP
accelerate launch
config by @xrdaukar in #297 - Export top level launcher functions and instantiate a default launcher. by @taenin in #298
- Prevent HF version bump by @taenin in #300
- Add dtype/mixed precision configs to Lema trainer by @wizeng23 in #278
- Create a notebook tutorial for running remote training. by @taenin in #299
- Increase the default value of
ProfilerParams.row_limit
from 20 to 50 by @xrdaukar in #304 - Mini guide on using basic lema functionality by @oelachqar in #303
- Compute MFU based of HF
total_flos
(alternative way to compute MFU) by @xrdaukar in #301 - Support GPT2 training with Lema trainer by @wizeng23 in #302
- Add a client for running local jobs via the launcher. by @taenin in #305
- Add a local cluster for running local jobs. by @taenin in #306
- Support llama2b with lema trainer by @wizeng23 in #308
- Add a convenience method for listing all registered clouds. by @taenin in #310
- [ALCF] Reverse Polaris GPU order to match CPU/GPU affinities by @xrdaukar in #307
- Create a local cloud for the LeMa launcher. by @taenin in #309
- Remove some leftover occurrences of
builtin_
prefix in HF MFU callback by @xrdaukar in #312 - Clean up mixed precision params by @wizeng23 in #311
- Add finetuning tutorial by @oelachqar in #313
- Fix interpolation when loading lema configs. by @taenin in #314
- [bugfix] GPU workers not waiting for global leader to save final checkpoint by @oelachqar in #315
- Add simple benchmark script for distributed operations by @oelachqar in #316
- Add a 'done' field to the LeMa job status object. by @taenin in #317
- Fix a small typo in Lema README by @xrdaukar in #318
- Add pytorch profiler (
-p
) option tomultinode_example_worker.sh
script by @xrdaukar in #319 - Create a simpler tutorial for running jobs. by @taenin in #320
- Minor cleanups in Lema training loop by @xrdaukar in #322
- Remove unbalanced call to
barrier()
inHuggingFaceTrainer.save_model
by @xrdaukar in #323 - Create a tutorial for custom clouds. by @taenin in #321
- Add support for logging stdout and stderr for Local runs. by @taenin in #324
- Fix nanoGPT notebook by @wizeng23 in #325
- Add more pytorch profiler instrumentations in Lema training loop by @xrdaukar in #327
- Add training param:
dataloader_main_process_only
by @xrdaukar in #326 - fix synchronization issues in LEMA training loop by @xrdaukar in #328
- Update LEMA training loop to count tokens on CPU by @xrdaukar in #330
- Update README.md by @taenin in #331
- Add various improvements to Lema trainer by @wizeng23 in #329
- Add PyTorch profiler annotation for each step/micro-step by @xrdaukar in #333
- Enable
HfMfuTrainerCallback
if supported by @xrdaukar in #332 - Add support for PyTorch profiling schedule by @xrdaukar in #334
- Set up Sphinx-based doc generation for LeMa by @taenin in #335
- Fix dataclass strings to be parsable by our docs generator. by @taenin in #337
- Update ProfilerStepCallback to add
microstep
profiler annotations by @xrdaukar in #338 - Add
include_alternative_mfu_metrics
param to control if HF MFU is enabled by @xrdaukar in #336 - Minor doc formatting updates. by @taenin in #340
- Add 8-bit Adam optimizer to Lema trainer by @wizeng23 in #339
- Enable gradient scaling for fp16 mixed-precision training by @wizeng23 in #342
- Add a link to our documentation via the readme. by @taenin in #344
- Disable weight decay for layernorm/biases in Lema trainer by @wizeng23 in #341
- Polaris: Enable NCCL debug logging at WARNING level by @xrdaukar in #347
- Add a new notebook for getting started. by @taenin in #345
- Create
TelemetryCallback
by @xrdaukar in #343 - Various improvements for our autogenerated docs by @taenin in #349
- Polaris: update sample
tail
command to use-n200
by @xrdaukar in #348 - Fix a minor bug in
TelemetryCallback.on_train_end
by @xrdaukar in #350 - Update LEMA training loop to log wandb url by @xrdaukar in #351
- Update model dtype for DeepSpeed to make it work with SkyPilot and Polaris by @xrdaukar in #352
- Enable the launcher via the CLI by @taenin in #353
- Update Polaris init script to print nodelist by @xrdaukar in #354
- Minor logging updates in Polaris scripts by @xrdaukar in #355
- Define
ddp1gpu
Polaris mode: Spawn 1torchrun
process per GPU (4torchrun
-s per node) by @xrdaukar in #356 - Add a helper util to query GPU temperatures by @xrdaukar in #359
- Add Llama 8B config by @wizeng23 in #358
- Add another
bareer()
call before train() by @xrdaukar in #360 - Add Llama70B FSDP config by @wizeng23 in #361
- Minor improvements in logging and instrumentations in
train.py
by @xrdaukar in #362 - Refactor our core directory to logically organize our classes. by @taenin in #357
- Basic plumbing for GPU temperature telemetry by @xrdaukar in #363
- Minor update to Llama70B by @wizeng23 in #365
- Reorder model compilation and DDP/FSDP wrapping by @xrdaukar in #364
- Mini tutorial for Llama3.1-70b inference on Polaris. by @taenin in #367
- jgreer013/vllm-inference by @jgreer013 in #366
- Fix interpolation when using the launcher CLI for various sky configs. by @taenin in #369
- Add Llama8B Lora config for GCP/Polaris by @wizeng23 in #368
- Add vllm parallel inference to improve throughput by @jgreer013 in #370
- Set
TOKENIZERS_PARALLELISM: false
for llama8b model by @xrdaukar in #371 - Disable MFU computation for PEFT by @xrdaukar in #372
- Add
empty_device_cache_steps
param and configure it for Llama8b model by @xrdaukar in #373 - Add
TelemetryCallback.include_timer_metrics
param:False
by default by @xrdaukar in #378 - Update llama8b GCP launcher script to allow Spot VMs by @xrdaukar in #380
- Minimal Llama8B LoRA eval config by @xrdaukar in #376
- Add Llama 8b SFT config by @wizeng23 in #379
- Move common NCCL variables initialization into
polaris_init.sh
by @xrdaukar in #377 - Minor tuning of llama8b configs by @xrdaukar in #382
- Update eval script to use
Meta-Llama-3.1-8B-Instruct
model version by @xrdaukar in #381 - Initial notebook for llama 8b LoRa tuning. by @taenin in #374
- Update SkyPilot GCP script to download the right model version by @xrdaukar in #385
- Clean up Sky configs by @wizeng23 in #383
- Update main makefile to generate docs by @oelachqar in #386
- Add docs-serve makefile command by @oelachqar in #387
- Fix missing new line at the end of
Makefile
by @xrdaukar in #390 - Raise
NOT_IMPLEMENTED
ifadapter_model
is configured forLM_HARNESS
eval by @xrdaukar in #391 - Update Llama8B LoRA eval script to use built-in LEMA evaluator by @xrdaukar in #389
- Add Llama 70b lora config by @wizeng23 in #388
- Enable markdown docs by @oelachqar in #394
- Check ignored docstring rules by @oelachqar in #395
- Remove special case for saving PEFT models by @xrdaukar in #384
- Move shared code into polaris_init by @wizeng23 in #392
- Update Llama notebook to include 8B SFT by @wizeng23 in #393
- Update sample commands to point to the preemptable queue by @taenin in #396
- Update lm_harness to support LoRA adapters by @jgreer013 in #397
- Fix FSDP model initialization by @wizeng23 in #398
- Add vscode launch config for accelerate distributed training by @oelachqar in #400
- Update trainer save model by @oelachqar in #399
- Increase from 2 to 3 nodes for Llama 70B Lora by @wizeng23 in #402
- Add param to customize NCCL timeout by @oelachqar in #401
- Add docs and gpu install targets by @oelachqar in #403
- Significant improvements for the Polaris launcher by @taenin in #404
- Ensure that jobs are queued on existing clusters when users call UP by @taenin in #406
- Autostop sky clusters after 30 min of no activity by @taenin in #407
- Add support for triton kernels from Liger Kernel by @oelachqar in #405
- Add support for including notebooks in the docs by @oelachqar in #408
- Update sphinx comments to docstrings by @oelachqar in #411
- Add missing docstrings to TrainingParams by @oelachqar in #409
- Capped model max length for Llama tuning by @wizeng23 in #413
- Fix a deadlock in the Polaris launcher for users with 500+ jobs. by @taenin in #412
- Script to run inference with Llama/GPT judges. by @kaisopos in #414
- Add missing docstrings to top-level configs by @oelachqar in #410
- [tiny] sphinx conf update by @oelachqar in #416
- Improve launcher polling by running tasks in a subprocess. by @taenin in #417
- Add missing package docstrings by @oelachqar in #415
- [tiny] Enable D104 rule by @oelachqar in #419
- Fix bug with 70B Lora by @wizeng23 in #421
- Update the CLI to look for open SSH tunnels as a way of preserving Polaris state by @taenin in #418
- Update the polaris launcher to always update the lema installation on job creation. by @taenin in #422
- Cleanup doc RSTs by @oelachqar in #420
- Add sphinx api doc template for packages by @oelachqar in #425
- Add automatically generated apidoc RSTs by @oelachqar in #424
- [tiny] Move apidocs into their own folder by @oelachqar in #426
- Add docs-rebuild command to Makefile by @oelachqar in #427
- Refresh markdown docs by @oelachqar in #429
- Reorganize our test structure by @taenin in #431
- Add Llama 70B SFT config by @wizeng23 in #428
- Script to generate judge prompts. by @kaisopos in #423
- [tiny] Breakdown main Readme into multiple docs by @oelachqar in #430
- Update main readme file by @oelachqar in #432
- Add GitHub badges, readme typos by @oelachqar in #434
- Fix markdown lint errors by @oelachqar in #433
- Update documentation index by @oelachqar in #436
- [tiny] Only log to console on global leader by @wizeng23 in #435
- Tune sphinx config by @oelachqar in #437
- Enable Liger for Llama 8B SFT by @wizeng23 in #439
- Updated Parallel Inference job by @jgreer013 in #438
- Add a mkdir to polaris init. by @taenin in #440
- [tiny] Fix lema loop performance gap by @oelachqar in #441
- [tiny] update trainer benchmark script and minor updates by @oelachqar in #443
- Add Llama 8B eval script by @wizeng23 in #442
- Add dataset remote code param by @oelachqar in #445
- [docs] Update format + add missing docs to data_params.py by @oelachqar in #444
- Update Polaris Llama8b eval script to enable data-parallel evals for LM_HARNESS by @xrdaukar in #446
- Copy changes from PR-446 into Polaris launcher config by @xrdaukar in #448
- Copy changes from PR-446 into GCP launcher config by @xrdaukar in #449
- Minor fixes in llama8B eval scripts by @xrdaukar in #450
- Add Llama 70B eval script by @wizeng23 in #447
- [bugfix] add is_using_accelerate_fsdp util by @oelachqar in #453
- [tiny] Fix inference notebook by @wizeng23 in #451
- Simplify record_function annotation in LEMA training loop by @xrdaukar in #454
- [tiny] enable ruff format on save with notebooks by @oelachqar in #455
- [tiny] Add missing default value to hf_trainer by @oelachqar in #458
- Judge inference script for Polaris by @kaisopos in #452
- Add the base classes for inference. Pull out logic from
infer
to a native text inference engine. by @taenin in #456 - Telemetry improvements for tracking GPU temperature and in general by @xrdaukar in #457
- Add integration tests for native inference (not using the CLI). by @taenin in #460
- Update README.md by @mkoukoumidis in #462
- Update README to make installation steps more prominent by @taenin in #464
- Fix several broken links and update installation instructions by @taenin in #465
- Update inference to pass the generation config to inference engines. by @taenin in #466
- Update README.md by @taenin in #467
- Fixed issue with metadata extraction failure by @jgreer013 in #469
- Add fsdp support to lema loop by @oelachqar in #463
- Combine telemetry from all ranks by @xrdaukar in #468
- Add sample for full fine-tuned and LoRA-tuned model inference using vLLM by @wizeng23 in #470
- Update chat_template_builder by @oelachqar in #472
- Removed duplicate task_done call by @jgreer013 in #473
- Add flag to enable experimental torch data pipes processing pipeline by @oelachqar in #474
- Vision-languange datasets & fine-tuning MVP by @oelachqar in #459
- Rebuild docs, add multi-modal tutorial by @oelachqar in #475
- Add test coverage target, update pyproject.toml metadata by @oelachqar in #476
- Create a local inference engine for vLLM by @taenin in #471
- Add llava chat template, QoL improvement to multimodal testing script by @oelachqar in #478
- [Polaris Judge Inference] Adjusting script for Llama 70B quantized by @kaisopos in #461
- Add example for running inference using vLLM on GCP, single-node multi-gpu by @oelachqar in #479
- [tiny] Remove deepspeed from required dependencies by @oelachqar in #482
- Update train path to save meta-info as files under
telemetry
sub-dir by @xrdaukar in #480 - Add inference engine apply_chat_template helper, update example notebook by @oelachqar in #481
- Update arg names for vLLM inference job by @wizeng23 in #477
- Remove device_map for model init from config by @wizeng23 in #484
- Add
log_model_summary
call back by @xrdaukar in #485 - Small typo fix in the vllm notebook by @taenin in #483
- Cleanup FSDP wrap class auto guesser by @oelachqar in #486
- Add missing documentation for model_params by @oelachqar in #487
- Add callback builder function by @oelachqar in #490
- Minor fixes in DISTRIBUTED_TRAINING.md by @xrdaukar in #488
- Switch to using official UV action with dependency caching by @oelachqar in #491
- Introduce
BaseTrainerCallback
alias by @xrdaukar in #492 - Add documentation to peft_params by @oelachqar in #493
- Update
TelemetryCallback
to save final metrics to JSON by @xrdaukar in #494 - Increase the rsync timeout from 40s to 300s by @taenin in #495
- [tiny] fix missing import by @oelachqar in #497
- Rename build_dataset -> build_dataset_mixture by @oelachqar in #498
- Define a simple callback to detect NaN/INF-s during training by @xrdaukar in #496
- Replace
pip install flash-attn
with.[gpu]
target by @wizeng23 in #502 - Add simpler builder for single dataset use cases by @oelachqar in #499
- Use HF's built-in gradient checkpointing argument by @wizeng23 in #500
- [Draft] Example changes to support 70B single-node inference by @jgreer013 in #503
- Various updates to Llama 2b configs by @wizeng23 in #489
- Add Llama 2B FSDP config by @wizeng23 in #505
- Update
TelemetryCallback
to write JSON with GPU temperature summary by @xrdaukar in #501 - Rename src/lema to src/oumi by @wizeng23 in #506
- OpenAI Chat Engine - Custom servers by @taenin in #504
- Rename configs/lema to configs/oumi by @wizeng23 in #507
- Rename all relevant lema references in codebase by @wizeng23 in #508
- Re-generate Sphinx docs by @wizeng23 in #509
- Update conf.py by @taenin in #510
- Rename remaining lema references in
docs/
by @wizeng23 in #511 - Update final lema references by @wizeng23 in #512
- Update dev setup guide by @wizeng23 in #513
- Update TOTAL_NUM_GPUS compare commands in SkyPilot configs by @xrdaukar in #514
- [Minor] Issues arose by "newcomer" exploration [1/K] by @optas in #518
- Freeze
lm-eval
andtorch
versions as a workaround for OPE-390 by @xrdaukar in #516 - Multiple updates to Llama 2B by @wizeng23 in #515
- Rename OUMI to Oumi by @wizeng23 in #520
- Add llama.cpp Inference Engine by @oelachqar in #524
- Rename website references to oumi.ai by @wizeng23 in #522
- Add anthropic inference engine by @oelachqar in #523
- Update name typo by @oelachqar in #526
- Add a batch inference job runnable via the Oumi Launcher by @taenin in #527
- Auto-format
pyproject
andpre-commit
configs by @xrdaukar in #530 - Update Makefile by @taenin in #529
- Fix failing tests after a new install. by @taenin in #531
- Fix a small bug in
infer_interactive()
: only prints the first character by @xrdaukar in #532 - Boosting User-friendliness by @optas in #521
- [tiny] add override from typing_extentions by @oelachqar in #534
- Create CODE_OF_CONDUCT.md by @taenin in #536
- Add conversation helper methods by @oelachqar in #535
- [tiny] cleanup multimodal benchmark script by @oelachqar in #537
- Auto-format shell scripts under
scripts
by @xrdaukar in #539 - Add builder function for data collators by @oelachqar in #538
- Make tokenizer optional by @oelachqar in #540
- Add an optional
-t
flag to scripts/polaris/jobs/llama2b_pt_worker.sh by @xrdaukar in #541 - Fix initial issues found by
shellcheck
by @xrdaukar in #542 - [tiny] fix small typo by @oelachqar in #544
- Minor changes in
scripts/benchmarks/minimal_multimodal_training.py
by @xrdaukar in #543 - [tiny] Add util to get install folder root dir by @oelachqar in #545
- [tiny] Add fp paged_adam optimizer option by @oelachqar in #547
- [tiny] Allow conversation metadata to contain values other than str by @oelachqar in #546
- Switch from Flash Attention 2 to PyTorch SDPA by @wizeng23 in #533
- Use
local_rank
to query GPU temperature by @xrdaukar in #550 - Fix a bug for handling stopped sky clusters in the oumi launcher. by @taenin in #549
- Remove flash attention validation check by @wizeng23 in #551
- Add support for AWS and Azure jobs in Oumi by @taenin in #552
- Pass
split
param todatasets.load_dataset()
by @xrdaukar in #553 - Implement Judge API MVP by @oelachqar in #548
- Log dataset info: shape, columns, other metainfo by @xrdaukar in #555
- Update experimental pretokenize_dataset tool by @xrdaukar in #554
- Various improvements to Llama eval scripts by @wizeng23 in #556
- Add a couple of
gc.collect()
calls by @xrdaukar in #560 - [tiny] Fix Makefile setup command by @wizeng23 in #561
- Support datasets generated by
dataset.save_to_disk()
by @xrdaukar in #559 - Add support for LoRA adapters in vLLM inference engine by @wizeng23 in #562
- Updates in
VisionLanguageCollator
and incoco_captions
by @xrdaukar in #563 - Update DEV_SETUP.md with Windows instructions by @taenin in #566
- Make the remote inference engine runnable in jupyter notebooks. by @taenin in #565
- Configure freeze_layer map in
minimal_multimodal_training.py
by @xrdaukar in #569 - Clean up legacy evaluate_oumi code paths by @taenin in #568
- Update model builder to use
default_chat_template
if available by @xrdaukar in #571 - Add package build and deployment workflow to google artifact registry by @oelachqar in #570
New Contributors
- @oelachqar made their first contribution in #1
- @kaisopos made their first contribution in #7
- @jgreer013 made their first contribution in #9
- @optas made their first contribution in #10
- @xrdaukar made their first contribution in #12
- @wizeng23 made their first contribution in #30
- @taenin made their first contribution in #95
- @mkoukoumidis made their first contribution in #462
Full Changelog: https://github.com/oumi-ai/oumi/commits/v0.1-alpha