Release Initial release · oumi-ai/oumi

What's Changed

Add python project configs by @oelachqar in #1
Add repo skeleton by @oelachqar in #2
Export lema entrypoint scripts by @oelachqar in #3
Update static type checking config by @oelachqar in #5
Add example jupyter / colab notebook by @oelachqar in #4
Refactor config parsing to use omegaconf by @oelachqar in #6
Updating documentation (Dev Environment Setup) by @kaisopos in #7
Add tests and vscode config by @oelachqar in #8
Added DPOTrainer example to repo, as well as cuda device cleanup to training loop by @jgreer013 in #9
Adding torch as top-level module dependency by @optas in #10
Add configs for specific hardware requirements by @jgreer013 in #11
Sort pre-commit hooks lexicographically by @xrdaukar in #12
Add logging config by @oelachqar in #13
Lema inference by @xrdaukar in #14
Panos dev by @optas in #16
Add job launcher by @oelachqar in #15
Making split of data a flexible variable by @optas in #17
Configure max file size in precommit hooks by @xrdaukar in #18
Minor bugfix and documentation update by @oelachqar in #19
adding pynvml to train env by @kaisopos in #20
Panos dev by @optas in #22
Augmenting Types for training hyperparams by @optas in #23
Train refactoring (config file visibility) + a few minor changes by @kaisopos in #21
Minimal test for train function by @xrdaukar in #25
Fix leftover '_torch_dtype' in 'ModelParams' by @xrdaukar in #26
Update GPU types list in the default SkyPilot config by @xrdaukar in #27
Add a missing lema-infer command under [project.scripts] by @xrdaukar in #28
add basic pytests for evaluate and infer by @xrdaukar in #29
Update README and pyproject.toml by @wizeng23 in #30
A helper function to print info about available CUDA devices by @xrdaukar in #31
Update SkyPilot cconfig to start using torchrun by @xrdaukar in #32
Support basic single-node, multi-gpu training by @xrdaukar in #33
Run all precommit hooks on the repo by @xrdaukar in #35
Add experimental code for llama cpp inference by @jgreer013 in #37
Create skeleton of STYLE_GUIDE.md by @xrdaukar in #36
Adding support for training custom models (for now just a dummy model). by @kaisopos in #38
Fix custom model name in test_train.py by @xrdaukar in #39
Configure pyright (static type checker) and resolve existing type errors to make it pass by @xrdaukar in #41
fix trailing whitespace warning in STYLE_GUIDE.md by @xrdaukar in #43
Configure initial GitHub Actions workflow to run pre-commits and tests by @xrdaukar in #44
A variety of proposed extensions to finetune a chat-based model (starting with Zephyr) by @optas in #34
Fix syntax error in ultrachat by @xrdaukar in #48
Create initial version of CONTRIBUTING.md by @xrdaukar in #46
Reduce the number of training steps from 5 to 3 to make test_train.py faster by @xrdaukar in #49
Adding registry for custom models. by @kaisopos in #42
Add config and streaming args to DataParams by @wizeng23 in #47
Update Pre-review Tests to only run on pull_request by @xrdaukar in #50
Add training flags to computes tokens-based stats by @xrdaukar in #51
reduce test training steps in another test which I missed before by @xrdaukar in #53
Rename var names of *Params classes by @wizeng23 in #52
Make some NVIDIA-specific dependencies optional by @xrdaukar in #54
fix trl version as 0.8.6 by @xrdaukar in #56
Remove reference to torch.cuda.clock_rate by @xrdaukar in #57
Update inference to support non-interactive batch mode. by @kaisopos in #58
Update README.md to include Linux/WSL specific instructions by @xrdaukar in #59
Minor formatting improvements in README.md by @xrdaukar in #60
Minor: Updating Lora Params by @optas in #55
Support dataset packing by @wizeng23 in #63
Disallow relative imports in LeMa by @xrdaukar in #65
Add text_col param that's required for SFTTrainer by @wizeng23 in #66
Refactor common config parsing logic (YAML, arg_list) into a common util by @xrdaukar in #68
Standardize test naming convention by @wizeng23 in #69
Adding support for a hardcoded evaluation with MMLU. by @kaisopos in #67
Minor changes to the default configs/skypilot/sky.yaml config by @xrdaukar in #71
Prototype to pass config.model.model_max_length to Trainers by @xrdaukar in #70
[Inference] Remove the prepended prompts from model responses. by @kaisopos in #73
Add a util to print versioning info by @xrdaukar in #74
Switch to tempfile.TemporaryDirectory() in test_train.py by @xrdaukar in #75
Update docstring verbs to descriptive form by @wizeng23 in #76
Add sample accelerate and fsdp configs by @xrdaukar in #77
Refactor code to get device rank and world size into a helper function by @xrdaukar in #79
Add a simple util to print model summary e.g., layer names, architecture summary by @xrdaukar in #80
Freeze numpy to pre 2.0 version by @xrdaukar in #81
Adding inference support for next logit probability. by @kaisopos in #78
Create FSDP configs for Phi3 by @xrdaukar in #82
Auto-format pyproject.toml with "Even Better TOML" by @xrdaukar in #83
Minor cleanup updates to SkyPilot configs by @xrdaukar in #84
Mixed Precision Training, Flash-Attention-2, Print-trainable-params by @optas in #85
Update README.md to include basic instructions for multi-GPU training (DDP, FSDP) by @xrdaukar in #86
Start using $SKYPILOT_NUM_GPUS_PER_NODE in SkyPilot config by @xrdaukar in #90
Add configs for FineWeb Llama2 pretraining by @wizeng23 in #89
Quantization by @optas in #87
Update the default SkyPilot config to print more debug/context info by @xrdaukar in #92
Add license by @oelachqar in #93
Initial version of SkyPilot config for multi-node training (num_nodes: N) by @xrdaukar in #94
MMLU eval refactor. by @kaisopos in #88
Remove comparison between LOCAL_RANK and RANK by @xrdaukar in #96
Handling the loading of peft adapters and other minor issues (e.g., adding more logging parameters) by @optas in #91
Update configs/skypilot/sky_llama2b.yaml to start using sky_init.sh by @xrdaukar in #97
Add bool param to resume training from the last known checkpoint (if exists) by @xrdaukar in #99
Inference: save/restore probabilities to/from file. by @kaisopos in #98
Add support for dataset mixtures during training by @taenin in #95
Add train, test, and validation splits to the LeMa config. by @taenin in #101
nanoGPT (GPT2) pretraining recipe by @wizeng23 in #103
Minor: Updates on Zephyr-Config by @optas in #106
Update pre-commit config by @oelachqar in #108
Add integration tests that verify all configs load properly. by @taenin in #102
Handling Gradient Checkpointing by @optas in #107
Update skypilot/sky_gpt2.yaml to include an example how to mount GCS dir by @xrdaukar in #111
Rename dataset_params.dataset_config to dataset_params.subset by @oelachqar in #109
Refactor SFT dataset preprocessing by @oelachqar in #112
Support shuffling and random seeds for dataset sampling by @taenin in #113
Split types file into module by @oelachqar in #114
Add GCP deps to lema[cloud] by @xrdaukar in #117
Add llama3-instruct jinja template by @jgreer013 in #118
Update sky_init.sh to print current dir by @xrdaukar in #120
Add prompt response sft preprocessor factory for aya dataset by @jgreer013 in #121
Add configs for chatqa model by @oelachqar in #110
Saving inference probs in parquet format. by @kaisopos in #115
Refactor model registry by @oelachqar in #122
Define BaseTrainer abstraction by @xrdaukar in #116
Add a registry for metric functions that we can run during training. by @taenin in #126
Update training_params.py so HF trainer uses num_train_epochs by @optas in #125
Add native PyTorch model training by @oelachqar in #123
[Quick fix] Handle pynvml being misconfigured by @taenin in #128
Enable DP for inference by @kaisopos in #100
Add configs for training llama3-8b with aya finetune by @jgreer013 in #130
Update HF save_model() to only save on master replica by @xrdaukar in #131
Pipe MetricsFunction from our config to train.py by @taenin in #129
Fixing broken eval. by @kaisopos in #132
Minor updates in SkyPilot docstrings by @xrdaukar in #133
Fix bug with DP evaluation by @oelachqar in #134
[MMLU custom eval] removing hardcoded subject, samples, num-shots. by @kaisopos in #135
Add an initial config for async evaluations by @taenin in #137
Add a new top level command: evaluate_async by @taenin in #138
Minor bug fix in writing evaluations by @taenin in #140
Support full GPT2 run by @wizeng23 in #141
Upload sample configs for running async evals on GPT2 by @taenin in #139
Apply torch.distributed.barrier() in save_model by @xrdaukar in #136
Create an experimental util to generate pre-tokenized datasets (Parquet files) with token_ids column by @xrdaukar in #144
Created a new dataset class with async loading & tokenization by @jgreer013 in #142
Remove private debug dir from configs/skypilot/sky_gpt2.yaml by @xrdaukar in #145
Define dataloader_num_workers and dataloader_prefetch_factor params by @xrdaukar in #146
[Evaluations] Integration with LM Evaluation Harness by @kaisopos in #143
Support model compilation by @wizeng23 in #147
Multiple cleanup changes in configs/skypilot/sky_gpt2.yaml by @xrdaukar in #148
Update SkyPilot training configs to include run_name by @xrdaukar in #149
Update async eval to properly parse eval configs by @taenin in #150
Zephyr Configs [full-model, skypilot] by @optas in #152
Disable model.compile in gpt2 config by @xrdaukar in #154
Update sky_init.sh to print task id and cluster info by @xrdaukar in #156
[bug] Include jinja templates in build by @oelachqar in #158
Add basic scaffolding for torch profiler around training loop by @xrdaukar in #157
[Minor] Adding attn_implementation arg in LM Harness. by @kaisopos in #160
Update Trainer.save_model to start using the public HF save_model() method (except for PEFT) by @xrdaukar in #161
Update the vanilla eval config for gpt2 to run hellaswag evals. by @taenin in #165
Add Dataset base class & API by @oelachqar in #151
Add experimental notebook to run Nvidia's ChatRAG-Bench evaluation by @oelachqar in #166
Update ChatQA training configs by @oelachqar in #159
Update async dataset class to support pre-tokenized datasets by @oelachqar in #162
Create a launcher script for Polaris jobs (ALCF) by @taenin in #164
Update pre-tokenized column name to be input_ids in tokenize_dataset tool by @xrdaukar in #167
Replacing EvaluationConfig's DataParams with DatasetSplitParams by @kaisopos in #168
Submit config to create Custom IAM role for SkyPilot Service Accounts on GCP by @xrdaukar in #169
Remove GCP project reference by @xrdaukar in #172
Make sure output training dir exists by @xrdaukar in #171
Improve launcher usability via command line arguments. by @taenin in #170
Add a source directory to the Polaris launcher and clean up rsync copies. by @taenin in #173
Introduce LEMA_RUN_NAME env var to SkyPilot training configs by @xrdaukar in #174
Minor changes: 1. Remove hardcoded HF_TOKEN 2. Log effective training config by @xrdaukar in #175
Tweak default params in gpt2 scripts by @xrdaukar in #177
LM Harness optimizations by @kaisopos in #176
No longer ignore .git. in Polaris Needed for venv. by @taenin in #179
A hack for running jobs on Polaris. by @taenin in #180
[Polaris] Move venv creation from worker to launcher. by @taenin in #181
Update README.md to include sky launch - 10 ... example by @xrdaukar in #182
[Evaluations] Adding support for HuggingFace's leaderboard v1 benchmarks by @kaisopos in #183
Llama 3 Aya Fine-Tuning Updates by @jgreer013 in #163
Remove logger propagation by @wizeng23 in #185
[Evaluations] HF leaderboard v1 configs by @kaisopos in #186
Move logging.py to utils by @wizeng23 in #187
Create the Jobs config for the lema launcher. by @taenin in #188
Initial abstract base classes for the lema launcher. by @taenin in #189
Added mfu calculation and tests by @jgreer013 in #190
Introduce two new training params: save_model and save_epoch by @xrdaukar in #191
Update FineWeb ablation model configs by @xrdaukar in #196
Added MFU telemetry by @jgreer013 in #193
Update Polaris script by @wizeng23 in #192
Rename training.save_model param to training.save_final_model for clarity by @xrdaukar in #197
Support disabling dropout by @wizeng23 in #184
Update actual mfu calculation by @jgreer013 in #199
Implement a client for talking to SkyPilot. by @taenin in #201
Fixed miscalculation of second step start time by @jgreer013 in #202
Update ablation-model-fineweb-v1 config to start using grad checkpointing by @xrdaukar in #198
Add distributed operations by @oelachqar in #194
Add pre-commit hooks for credential scanning + new checks by @oelachqar in #195
Sample job for multi-node training by @xrdaukar in #203
Update Polaris multi-node launcher by @xrdaukar in #204
Multi-node config improvements for llama2b model (HuggingFaceFW/ablation-model-fineweb-v1) by @xrdaukar in #205
Minor updates to Polaris launcher script by @xrdaukar in #206
Update Lema FSDP configs by @xrdaukar in #207
[tiny] add default formatter for markdown by @oelachqar in #210
Preparations for Lema custom pre-training loop by @oelachqar in #208
Update MFU callback to support Lema trainer by @oelachqar in #209
Configure llama2b model to use FSDP HYBRID_SHARD by @xrdaukar in #213
Implement a Cluster resource manager around Sky Pilot. by @taenin in #214
Add utils to setup distributed training by @oelachqar in #211
Add example notebook to train NanoGPT model with Lema by @oelachqar in #212
[tiny] update sky pilot ssh config by @oelachqar in #215
Implement a Cloud resource manager around Sky Pilot by @taenin in #216
Sanitize run name by @xrdaukar in #217
Use "cluster_name" instead of "name" in the Sky client. by @taenin in #218
Minor logging improvements in Polaris sample job scripts by @xrdaukar in #219
Update shell scripts to point to local dataset by @jgreer013 in #221
Support FSDP on Polaris using accelerate by @xrdaukar in #220
Add telemetry manager by @oelachqar in #222
Switch to the latest transformers=4.43.1 by @xrdaukar in #223
Re-enable model compilation for llama2b model by @xrdaukar in #224
Increase llama2b batch size from 2 to 3 by @xrdaukar in #225
Add makefile with common local commands by @oelachqar in #227
Add DeepSpeed config for Llama2b by @wizeng23 in #228
MFU Improvements for Llama 2B on Polaris by @jgreer013 in #229
FSDP config updates by @xrdaukar in #231
Rename accelerate configs to be in line with other configs by @wizeng23 in #232
[tiny] Update logger format to include rank, pid and threadname by @oelachqar in #235
Set model.config.use_cache = False by @xrdaukar in #233
Experimental training loop for pre-training by @oelachqar in #230
Disable gradient checkpointing in SkyPilot llama2b config by @xrdaukar in #236
Implement a client for communicating with Polaris via python. by @taenin in #234
Add SkyPilot config for experimental/pretokenize/tokenize_dataset.py by @xrdaukar in #237
Update Fabric.run() calls to use the "warn" flag. by @taenin in #239
Update pretokenize tool to support input datasets by @xrdaukar in #238
Add optimizers builder function by @oelachqar in #240
Add a "put" method in the Polaris client for writing remote files. by @taenin in #242
Add deepspeed (DS) config to support hierarchical partitioning by @wizeng23 in #244
Add support for uploading MFU in wandb by @jgreer013 in #245
Create a Polaris Cluster class consuming the polaris client by @taenin in #246
Add initial docker image by @oelachqar in #241
Fix a string in the Polaris Cluster tests. by @taenin in #249
Set training loop random seeds by @oelachqar in #248
Fix bug with Polaris multi-node script by @wizeng23 in #247
Add torchfix listing target by @oelachqar in #250
Add training state classes by @oelachqar in #251
Save and restore telemetry state during training by @oelachqar in #252
Configure file logging by @oelachqar in #254
Create a Polaris Cloud class consuming the polaris client by @taenin in #253
Define a registry for cloud builders. by @taenin in #255
Add logging to tensor board, wandb in custom training loop by @oelachqar in #256
Add a get_all utility method to the LeMa Registry by @taenin in #257
Update the BaseCloud up_cluster definition to return a job status. by @taenin in #258
Create a launcher class for the LeMa Launcher. by @taenin in #261
Add script to benchmark datasets and data loader params by @oelachqar in #260
[Follow-up] data loader benchmarking script by @oelachqar in #262
Create DDP configs for accelerate by @xrdaukar in #259
Switch from nightly to stable version of SkyPilot by @xrdaukar in #264
Make all tests green by @xrdaukar in #265
Set dataloader_pin_memory=True to be intentional by @xrdaukar in #266
Move torch_profiler_utils from lema.utils to lema.perfomance by @xrdaukar in #267
Add BaseIterableDataset, refactor DataLoader to use DataPipes by @oelachqar in #263
Add a dataset_kwargs attribute, tests by @oelachqar in #268
Use stateful dataloader by @oelachqar in #269
Update the polaris client / cluster to work e2e by @taenin in #270
Update package structure for the launcher by @taenin in #273
[tiny] Register debug datasets by @oelachqar in #272
Update several of our launcher base fields to use strings instead of ints. by @taenin in #274
Configure data loader sampling strategy for map-style datasets by @oelachqar in #271
Ensure we CD into the working DIR before submitting polaris jobs. by @taenin in #276
Compute the number of dataloader workers per node by @xrdaukar in #277
Introduce BaseTokenizer alias by @xrdaukar in #280
Cache get_device_rank_info by @xrdaukar in #279
Adding initial scripts for running polaris jobs. by @taenin in #275
Update the polaris client to automatically set execute permissions for copied files. by @taenin in #286
Deprecate building models data parallel by @oelachqar in #282
Switch to using safetensors when saving models by @oelachqar in #281
Add ability to validate configs and params after init by @oelachqar in #285
Some updates to Polaris launcher script by @xrdaukar in #287
Upgrade to latest TRL version, remove numpy version condition by @oelachqar in #283
Add learning rate builder function by @oelachqar in #284
Remove patchwork as a dep. by @taenin in #290
Set up initial demo launcher jobs for GCP. by @taenin in #288
[tiny] cleanup pyproject.toml dependencies by @oelachqar in #292
Make dataset data backend attribute read-only by @oelachqar in #291
Optimize Github actions by @oelachqar in #289
Misc minor changes by @xrdaukar in #293
[tiny] Update GitHub action cache version by @oelachqar in #295
Rename 'NodeParams' -> 'JobResources' by @taenin in #296
Disable compilation for DDP accelerate launch config by @xrdaukar in #297
Export top level launcher functions and instantiate a default launcher. by @taenin in #298
Prevent HF version bump by @taenin in #300
Add dtype/mixed precision configs to Lema trainer by @wizeng23 in #278
Create a notebook tutorial for running remote training. by @taenin in #299
Increase the default value of ProfilerParams.row_limit from 20 to 50 by @xrdaukar in #304
Mini guide on using basic lema functionality by @oelachqar in #303
Compute MFU based of HF total_flos (alternative way to compute MFU) by @xrdaukar in #301
Support GPT2 training with Lema trainer by @wizeng23 in #302
Add a client for running local jobs via the launcher. by @taenin in #305
Add a local cluster for running local jobs. by @taenin in #306
Support llama2b with lema trainer by @wizeng23 in #308
Add a convenience method for listing all registered clouds. by @taenin in #310
[ALCF] Reverse Polaris GPU order to match CPU/GPU affinities by @xrdaukar in #307
Create a local cloud for the LeMa launcher. by @taenin in #309
Remove some leftover occurrences of builtin_ prefix in HF MFU callback by @xrdaukar in #312
Clean up mixed precision params by @wizeng23 in #311
Add finetuning tutorial by @oelachqar in #313
Fix interpolation when loading lema configs. by @taenin in #314
[bugfix] GPU workers not waiting for global leader to save final checkpoint by @oelachqar in #315
Add simple benchmark script for distributed operations by @oelachqar in #316
Add a 'done' field to the LeMa job status object. by @taenin in #317
Fix a small typo in Lema README by @xrdaukar in #318
Add pytorch profiler (-p) option to multinode_example_worker.sh script by @xrdaukar in #319
Create a simpler tutorial for running jobs. by @taenin in #320
Minor cleanups in Lema training loop by @xrdaukar in #322
Remove unbalanced call to barrier() in HuggingFaceTrainer.save_model by @xrdaukar in #323
Create a tutorial for custom clouds. by @taenin in #321
Add support for logging stdout and stderr for Local runs. by @taenin in #324
Fix nanoGPT notebook by @wizeng23 in #325
Add more pytorch profiler instrumentations in Lema training loop by @xrdaukar in #327
Add training param: dataloader_main_process_only by @xrdaukar in #326
fix synchronization issues in LEMA training loop by @xrdaukar in #328
Update LEMA training loop to count tokens on CPU by @xrdaukar in #330
Update README.md by @taenin in #331
Add various improvements to Lema trainer by @wizeng23 in #329
Add PyTorch profiler annotation for each step/micro-step by @xrdaukar in #333
Enable HfMfuTrainerCallback if supported by @xrdaukar in #332
Add support for PyTorch profiling schedule by @xrdaukar in #334
Set up Sphinx-based doc generation for LeMa by @taenin in #335
Fix dataclass strings to be parsable by our docs generator. by @taenin in #337
Update ProfilerStepCallback to add microstep profiler annotations by @xrdaukar in #338
Add include_alternative_mfu_metrics param to control if HF MFU is enabled by @xrdaukar in #336
Minor doc formatting updates. by @taenin in #340
Add 8-bit Adam optimizer to Lema trainer by @wizeng23 in #339
Enable gradient scaling for fp16 mixed-precision training by @wizeng23 in #342
Add a link to our documentation via the readme. by @taenin in #344
Disable weight decay for layernorm/biases in Lema trainer by @wizeng23 in #341
Polaris: Enable NCCL debug logging at WARNING level by @xrdaukar in #347
Add a new notebook for getting started. by @taenin in #345
Create TelemetryCallback by @xrdaukar in #343
Various improvements for our autogenerated docs by @taenin in #349
Polaris: update sample tail command to use -n200 by @xrdaukar in #348
Fix a minor bug in TelemetryCallback.on_train_end by @xrdaukar in #350
Update LEMA training loop to log wandb url by @xrdaukar in #351
Update model dtype for DeepSpeed to make it work with SkyPilot and Polaris by @xrdaukar in #352
Enable the launcher via the CLI by @taenin in #353
Update Polaris init script to print nodelist by @xrdaukar in #354
Minor logging updates in Polaris scripts by @xrdaukar in #355
Define ddp1gpu Polaris mode: Spawn 1 torchrun process per GPU (4 torchrun-s per node) by @xrdaukar in #356
Add a helper util to query GPU temperatures by @xrdaukar in #359
Add Llama 8B config by @wizeng23 in #358
Add another bareer() call before train() by @xrdaukar in #360
Add Llama70B FSDP config by @wizeng23 in #361
Minor improvements in logging and instrumentations in train.py by @xrdaukar in #362
Refactor our core directory to logically organize our classes. by @taenin in #357
Basic plumbing for GPU temperature telemetry by @xrdaukar in #363
Minor update to Llama70B by @wizeng23 in #365
Reorder model compilation and DDP/FSDP wrapping by @xrdaukar in #364
Mini tutorial for Llama3.1-70b inference on Polaris. by @taenin in #367
jgreer013/vllm-inference by @jgreer013 in #366
Fix interpolation when using the launcher CLI for various sky configs. by @taenin in #369
Add Llama8B Lora config for GCP/Polaris by @wizeng23 in #368
Add vllm parallel inference to improve throughput by @jgreer013 in #370
Set TOKENIZERS_PARALLELISM: false for llama8b model by @xrdaukar in #371
Disable MFU computation for PEFT by @xrdaukar in #372
Add empty_device_cache_steps param and configure it for Llama8b model by @xrdaukar in #373
Add TelemetryCallback.include_timer_metrics param: False by default by @xrdaukar in #378
Update llama8b GCP launcher script to allow Spot VMs by @xrdaukar in #380
Minimal Llama8B LoRA eval config by @xrdaukar in #376
Add Llama 8b SFT config by @wizeng23 in #379
Move common NCCL variables initialization into polaris_init.sh by @xrdaukar in #377
Minor tuning of llama8b configs by @xrdaukar in #382
Update eval script to use Meta-Llama-3.1-8B-Instruct model version by @xrdaukar in #381
Initial notebook for llama 8b LoRa tuning. by @taenin in #374
Update SkyPilot GCP script to download the right model version by @xrdaukar in #385
Clean up Sky configs by @wizeng23 in #383
Update main makefile to generate docs by @oelachqar in #386
Add docs-serve makefile command by @oelachqar in #387
Fix missing new line at the end of Makefile by @xrdaukar in #390
Raise NOT_IMPLEMENTED if adapter_model is configured for LM_HARNESS eval by @xrdaukar in #391
Update Llama8B LoRA eval script to use built-in LEMA evaluator by @xrdaukar in #389
Add Llama 70b lora config by @wizeng23 in #388
Enable markdown docs by @oelachqar in #394
Check ignored docstring rules by @oelachqar in #395
Remove special case for saving PEFT models by @xrdaukar in #384
Move shared code into polaris_init by @wizeng23 in #392
Update Llama notebook to include 8B SFT by @wizeng23 in #393
Update sample commands to point to the preemptable queue by @taenin in #396
Update lm_harness to support LoRA adapters by @jgreer013 in #397
Fix FSDP model initialization by @wizeng23 in #398
Add vscode launch config for accelerate distributed training by @oelachqar in #400
Update trainer save model by @oelachqar in #399
Increase from 2 to 3 nodes for Llama 70B Lora by @wizeng23 in #402
Add param to customize NCCL timeout by @oelachqar in #401
Add docs and gpu install targets by @oelachqar in #403
Significant improvements for the Polaris launcher by @taenin in #404
Ensure that jobs are queued on existing clusters when users call UP by @taenin in #406
Autostop sky clusters after 30 min of no activity by @taenin in #407
Add support for triton kernels from Liger Kernel by @oelachqar in #405
Add support for including notebooks in the docs by @oelachqar in #408
Update sphinx comments to docstrings by @oelachqar in #411
Add missing docstrings to TrainingParams by @oelachqar in #409
Capped model max length for Llama tuning by @wizeng23 in #413
Fix a deadlock in the Polaris launcher for users with 500+ jobs. by @taenin in #412
Script to run inference with Llama/GPT judges. by @kaisopos in #414
Add missing docstrings to top-level configs by @oelachqar in #410
[tiny] sphinx conf update by @oelachqar in #416
Improve launcher polling by running tasks in a subprocess. by @taenin in #417
Add missing package docstrings by @oelachqar in #415
[tiny] Enable D104 rule by @oelachqar in #419
Fix bug with 70B Lora by @wizeng23 in #421
Update the CLI to look for open SSH tunnels as a way of preserving Polaris state by @taenin in #418
Update the polaris launcher to always update the lema installation on job creation. by @taenin in #422
Cleanup doc RSTs by @oelachqar in #420
Add sphinx api doc template for packages by @oelachqar in #425
Add automatically generated apidoc RSTs by @oelachqar in #424
[tiny] Move apidocs into their own folder by @oelachqar in #426
Add docs-rebuild command to Makefile by @oelachqar in #427
Refresh markdown docs by @oelachqar in #429
Reorganize our test structure by @taenin in #431
Add Llama 70B SFT config by @wizeng23 in #428
Script to generate judge prompts. by @kaisopos in #423
[tiny] Breakdown main Readme into multiple docs by @oelachqar in #430
Update main readme file by @oelachqar in #432
Add GitHub badges, readme typos by @oelachqar in #434
Fix markdown lint errors by @oelachqar in #433
Update documentation index by @oelachqar in #436
[tiny] Only log to console on global leader by @wizeng23 in #435
Tune sphinx config by @oelachqar in #437
Enable Liger for Llama 8B SFT by @wizeng23 in #439
Updated Parallel Inference job by @jgreer013 in #438
Add a mkdir to polaris init. by @taenin in #440
[tiny] Fix lema loop performance gap by @oelachqar in #441
[tiny] update trainer benchmark script and minor updates by @oelachqar in #443
Add Llama 8B eval script by @wizeng23 in #442
Add dataset remote code param by @oelachqar in #445
[docs] Update format + add missing docs to data_params.py by @oelachqar in #444
Update Polaris Llama8b eval script to enable data-parallel evals for LM_HARNESS by @xrdaukar in #446
Copy changes from PR-446 into Polaris launcher config by @xrdaukar in #448
Copy changes from PR-446 into GCP launcher config by @xrdaukar in #449
Minor fixes in llama8B eval scripts by @xrdaukar in #450
Add Llama 70B eval script by @wizeng23 in #447
[bugfix] add is_using_accelerate_fsdp util by @oelachqar in #453
[tiny] Fix inference notebook by @wizeng23 in #451
Simplify record_function annotation in LEMA training loop by @xrdaukar in #454
[tiny] enable ruff format on save with notebooks by @oelachqar in #455
[tiny] Add missing default value to hf_trainer by @oelachqar in #458
Judge inference script for Polaris by @kaisopos in #452
Add the base classes for inference. Pull out logic from infer to a native text inference engine. by @taenin in #456
Telemetry improvements for tracking GPU temperature and in general by @xrdaukar in #457
Add integration tests for native inference (not using the CLI). by @taenin in #460
Update README.md by @mkoukoumidis in #462
Update README to make installation steps more prominent by @taenin in #464
Fix several broken links and update installation instructions by @taenin in #465
Update inference to pass the generation config to inference engines. by @taenin in #466
Update README.md by @taenin in #467
Fixed issue with metadata extraction failure by @jgreer013 in #469
Add fsdp support to lema loop by @oelachqar in #463
Combine telemetry from all ranks by @xrdaukar in #468
Add sample for full fine-tuned and LoRA-tuned model inference using vLLM by @wizeng23 in #470
Update chat_template_builder by @oelachqar in #472
Removed duplicate task_done call by @jgreer013 in #473
Add flag to enable experimental torch data pipes processing pipeline by @oelachqar in #474
Vision-languange datasets & fine-tuning MVP by @oelachqar in #459
Rebuild docs, add multi-modal tutorial by @oelachqar in #475
Add test coverage target, update pyproject.toml metadata by @oelachqar in #476
Create a local inference engine for vLLM by @taenin in #471
Add llava chat template, QoL improvement to multimodal testing script by @oelachqar in #478
[Polaris Judge Inference] Adjusting script for Llama 70B quantized by @kaisopos in #461
Add example for running inference using vLLM on GCP, single-node multi-gpu by @oelachqar in #479
[tiny] Remove deepspeed from required dependencies by @oelachqar in #482
Update train path to save meta-info as files under telemetry sub-dir by @xrdaukar in #480
Add inference engine apply_chat_template helper, update example notebook by @oelachqar in #481
Update arg names for vLLM inference job by @wizeng23 in #477
Remove device_map for model init from config by @wizeng23 in #484
Add log_model_summary call back by @xrdaukar in #485
Small typo fix in the vllm notebook by @taenin in #483
Cleanup FSDP wrap class auto guesser by @oelachqar in #486
Add missing documentation for model_params by @oelachqar in #487
Add callback builder function by @oelachqar in #490
Minor fixes in DISTRIBUTED_TRAINING.md by @xrdaukar in #488
Switch to using official UV action with dependency caching by @oelachqar in #491
Introduce BaseTrainerCallback alias by @xrdaukar in #492
Add documentation to peft_params by @oelachqar in #493
Update TelemetryCallback to save final metrics to JSON by @xrdaukar in #494
Increase the rsync timeout from 40s to 300s by @taenin in #495
[tiny] fix missing import by @oelachqar in #497
Rename build_dataset -> build_dataset_mixture by @oelachqar in #498
Define a simple callback to detect NaN/INF-s during training by @xrdaukar in #496
Replace pip install flash-attn with .[gpu] target by @wizeng23 in #502
Add simpler builder for single dataset use cases by @oelachqar in #499
Use HF's built-in gradient checkpointing argument by @wizeng23 in #500
[Draft] Example changes to support 70B single-node inference by @jgreer013 in #503
Various updates to Llama 2b configs by @wizeng23 in #489
Add Llama 2B FSDP config by @wizeng23 in #505
Update TelemetryCallback to write JSON with GPU temperature summary by @xrdaukar in #501
Rename src/lema to src/oumi by @wizeng23 in #506
OpenAI Chat Engine - Custom servers by @taenin in #504
Rename configs/lema to configs/oumi by @wizeng23 in #507
Rename all relevant lema references in codebase by @wizeng23 in #508
Re-generate Sphinx docs by @wizeng23 in #509
Update conf.py by @taenin in #510
Rename remaining lema references in docs/ by @wizeng23 in #511
Update final lema references by @wizeng23 in #512
Update dev setup guide by @wizeng23 in #513
Update TOTAL_NUM_GPUS compare commands in SkyPilot configs by @xrdaukar in #514
[Minor] Issues arose by "newcomer" exploration [1/K] by @optas in #518
Freeze lm-eval and torch versions as a workaround for OPE-390 by @xrdaukar in #516
1. Write wandb telemetery 2. Reorder training callbacks by @xrdaukar in #519
Multiple updates to Llama 2B by @wizeng23 in #515
Rename OUMI to Oumi by @wizeng23 in #520
Add llama.cpp Inference Engine by @oelachqar in #524
Rename website references to oumi.ai by @wizeng23 in #522
Add anthropic inference engine by @oelachqar in #523
Update name typo by @oelachqar in #526
Add a batch inference job runnable via the Oumi Launcher by @taenin in #527
Auto-format pyproject and pre-commit configs by @xrdaukar in #530
Update Makefile by @taenin in #529
Fix failing tests after a new install. by @taenin in #531
Fix a small bug in infer_interactive(): only prints the first character by @xrdaukar in #532
Boosting User-friendliness by @optas in #521
[tiny] add override from typing_extentions by @oelachqar in #534
Create CODE_OF_CONDUCT.md by @taenin in #536
Add conversation helper methods by @oelachqar in #535
[tiny] cleanup multimodal benchmark script by @oelachqar in #537
Auto-format shell scripts under scripts by @xrdaukar in #539
Add builder function for data collators by @oelachqar in #538
Make tokenizer optional by @oelachqar in #540
Add an optional -t flag to scripts/polaris/jobs/llama2b_pt_worker.sh by @xrdaukar in #541
Fix initial issues found by shellcheck by @xrdaukar in #542
[tiny] fix small typo by @oelachqar in #544
Minor changes in scripts/benchmarks/minimal_multimodal_training.py by @xrdaukar in #543
[tiny] Add util to get install folder root dir by @oelachqar in #545
[tiny] Add fp paged_adam optimizer option by @oelachqar in #547
[tiny] Allow conversation metadata to contain values other than str by @oelachqar in #546
Switch from Flash Attention 2 to PyTorch SDPA by @wizeng23 in #533
Use local_rank to query GPU temperature by @xrdaukar in #550
Fix a bug for handling stopped sky clusters in the oumi launcher. by @taenin in #549
Remove flash attention validation check by @wizeng23 in #551
Add support for AWS and Azure jobs in Oumi by @taenin in #552
Pass split param to datasets.load_dataset() by @xrdaukar in #553
Implement Judge API MVP by @oelachqar in #548
Log dataset info: shape, columns, other metainfo by @xrdaukar in #555
Update experimental pretokenize_dataset tool by @xrdaukar in #554
Various improvements to Llama eval scripts by @wizeng23 in #556
Add a couple of gc.collect() calls by @xrdaukar in #560
[tiny] Fix Makefile setup command by @wizeng23 in #561
Support datasets generated by dataset.save_to_disk() by @xrdaukar in #559
Add support for LoRA adapters in vLLM inference engine by @wizeng23 in #562
Updates in VisionLanguageCollator and in coco_captions by @xrdaukar in #563
Update DEV_SETUP.md with Windows instructions by @taenin in #566
Make the remote inference engine runnable in jupyter notebooks. by @taenin in #565
Configure freeze_layer map in minimal_multimodal_training.py by @xrdaukar in #569
Clean up legacy evaluate_oumi code paths by @taenin in #568
Update model builder to use default_chat_template if available by @xrdaukar in #571
Add package build and deployment workflow to google artifact registry by @oelachqar in #570

New Contributors

@oelachqar made their first contribution in #1
@kaisopos made their first contribution in #7
@jgreer013 made their first contribution in #9
@optas made their first contribution in #10
@xrdaukar made their first contribution in #12
@wizeng23 made their first contribution in #30
@taenin made their first contribution in #95
@mkoukoumidis made their first contribution in #462

Full Changelog: https://github.com/oumi-ai/oumi/commits/v0.1-alpha

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial release

What's Changed

New Contributors

Contributors