Skip to content

Initial release

Pre-release
Pre-release
Compare
Choose a tag to compare
@oelachqar oelachqar released this 02 Oct 22:03
· 767 commits to main since this release
d14a4db

What's Changed

  • Add python project configs by @oelachqar in #1
  • Add repo skeleton by @oelachqar in #2
  • Export lema entrypoint scripts by @oelachqar in #3
  • Update static type checking config by @oelachqar in #5
  • Add example jupyter / colab notebook by @oelachqar in #4
  • Refactor config parsing to use omegaconf by @oelachqar in #6
  • Updating documentation (Dev Environment Setup) by @kaisopos in #7
  • Add tests and vscode config by @oelachqar in #8
  • Added DPOTrainer example to repo, as well as cuda device cleanup to training loop by @jgreer013 in #9
  • Adding torch as top-level module dependency by @optas in #10
  • Add configs for specific hardware requirements by @jgreer013 in #11
  • Sort pre-commit hooks lexicographically by @xrdaukar in #12
  • Add logging config by @oelachqar in #13
  • Lema inference by @xrdaukar in #14
  • Panos dev by @optas in #16
  • Add job launcher by @oelachqar in #15
  • Making split of data a flexible variable by @optas in #17
  • Configure max file size in precommit hooks by @xrdaukar in #18
  • Minor bugfix and documentation update by @oelachqar in #19
  • adding pynvml to train env by @kaisopos in #20
  • Panos dev by @optas in #22
  • Augmenting Types for training hyperparams by @optas in #23
  • Train refactoring (config file visibility) + a few minor changes by @kaisopos in #21
  • Minimal test for train function by @xrdaukar in #25
  • Fix leftover '_torch_dtype' in 'ModelParams' by @xrdaukar in #26
  • Update GPU types list in the default SkyPilot config by @xrdaukar in #27
  • Add a missing lema-infer command under [project.scripts] by @xrdaukar in #28
  • add basic pytests for evaluate and infer by @xrdaukar in #29
  • Update README and pyproject.toml by @wizeng23 in #30
  • A helper function to print info about available CUDA devices by @xrdaukar in #31
  • Update SkyPilot cconfig to start using torchrun by @xrdaukar in #32
  • Support basic single-node, multi-gpu training by @xrdaukar in #33
  • Run all precommit hooks on the repo by @xrdaukar in #35
  • Add experimental code for llama cpp inference by @jgreer013 in #37
  • Create skeleton of STYLE_GUIDE.md by @xrdaukar in #36
  • Adding support for training custom models (for now just a dummy model). by @kaisopos in #38
  • Fix custom model name in test_train.py by @xrdaukar in #39
  • Configure pyright (static type checker) and resolve existing type errors to make it pass by @xrdaukar in #41
  • fix trailing whitespace warning in STYLE_GUIDE.md by @xrdaukar in #43
  • Configure initial GitHub Actions workflow to run pre-commits and tests by @xrdaukar in #44
  • A variety of proposed extensions to finetune a chat-based model (starting with Zephyr) by @optas in #34
  • Fix syntax error in ultrachat by @xrdaukar in #48
  • Create initial version of CONTRIBUTING.md by @xrdaukar in #46
  • Reduce the number of training steps from 5 to 3 to make test_train.py faster by @xrdaukar in #49
  • Adding registry for custom models. by @kaisopos in #42
  • Add config and streaming args to DataParams by @wizeng23 in #47
  • Update Pre-review Tests to only run on pull_request by @xrdaukar in #50
  • Add training flags to computes tokens-based stats by @xrdaukar in #51
  • reduce test training steps in another test which I missed before by @xrdaukar in #53
  • Rename var names of *Params classes by @wizeng23 in #52
  • Make some NVIDIA-specific dependencies optional by @xrdaukar in #54
  • fix trl version as 0.8.6 by @xrdaukar in #56
  • Remove reference to torch.cuda.clock_rate by @xrdaukar in #57
  • Update inference to support non-interactive batch mode. by @kaisopos in #58
  • Update README.md to include Linux/WSL specific instructions by @xrdaukar in #59
  • Minor formatting improvements in README.md by @xrdaukar in #60
  • Minor: Updating Lora Params by @optas in #55
  • Support dataset packing by @wizeng23 in #63
  • Disallow relative imports in LeMa by @xrdaukar in #65
  • Add text_col param that's required for SFTTrainer by @wizeng23 in #66
  • Refactor common config parsing logic (YAML, arg_list) into a common util by @xrdaukar in #68
  • Standardize test naming convention by @wizeng23 in #69
  • Adding support for a hardcoded evaluation with MMLU. by @kaisopos in #67
  • Minor changes to the default configs/skypilot/sky.yaml config by @xrdaukar in #71
  • Prototype to pass config.model.model_max_length to Trainers by @xrdaukar in #70
  • [Inference] Remove the prepended prompts from model responses. by @kaisopos in #73
  • Add a util to print versioning info by @xrdaukar in #74
  • Switch to tempfile.TemporaryDirectory() in test_train.py by @xrdaukar in #75
  • Update docstring verbs to descriptive form by @wizeng23 in #76
  • Add sample accelerate and fsdp configs by @xrdaukar in #77
  • Refactor code to get device rank and world size into a helper function by @xrdaukar in #79
  • Add a simple util to print model summary e.g., layer names, architecture summary by @xrdaukar in #80
  • Freeze numpy to pre 2.0 version by @xrdaukar in #81
  • Adding inference support for next logit probability. by @kaisopos in #78
  • Create FSDP configs for Phi3 by @xrdaukar in #82
  • Auto-format pyproject.toml with "Even Better TOML" by @xrdaukar in #83
  • Minor cleanup updates to SkyPilot configs by @xrdaukar in #84
  • Mixed Precision Training, Flash-Attention-2, Print-trainable-params by @optas in #85
  • Update README.md to include basic instructions for multi-GPU training (DDP, FSDP) by @xrdaukar in #86
  • Start using $SKYPILOT_NUM_GPUS_PER_NODE in SkyPilot config by @xrdaukar in #90
  • Add configs for FineWeb Llama2 pretraining by @wizeng23 in #89
  • Quantization by @optas in #87
  • Update the default SkyPilot config to print more debug/context info by @xrdaukar in #92
  • Add license by @oelachqar in #93
  • Initial version of SkyPilot config for multi-node training (num_nodes: N) by @xrdaukar in #94
  • MMLU eval refactor. by @kaisopos in #88
  • Remove comparison between LOCAL_RANK and RANK by @xrdaukar in #96
  • Handling the loading of peft adapters and other minor issues (e.g., adding more logging parameters) by @optas in #91
  • Update configs/skypilot/sky_llama2b.yaml to start using sky_init.sh by @xrdaukar in #97
  • Add bool param to resume training from the last known checkpoint (if exists) by @xrdaukar in #99
  • Inference: save/restore probabilities to/from file. by @kaisopos in #98
  • Add support for dataset mixtures during training by @taenin in #95
  • Add train, test, and validation splits to the LeMa config. by @taenin in #101
  • nanoGPT (GPT2) pretraining recipe by @wizeng23 in #103
  • Minor: Updates on Zephyr-Config by @optas in #106
  • Update pre-commit config by @oelachqar in #108
  • Add integration tests that verify all configs load properly. by @taenin in #102
  • Handling Gradient Checkpointing by @optas in #107
  • Update skypilot/sky_gpt2.yaml to include an example how to mount GCS dir by @xrdaukar in #111
  • Rename dataset_params.dataset_config to dataset_params.subset by @oelachqar in #109
  • Refactor SFT dataset preprocessing by @oelachqar in #112
  • Support shuffling and random seeds for dataset sampling by @taenin in #113
  • Split types file into module by @oelachqar in #114
  • Add GCP deps to lema[cloud] by @xrdaukar in #117
  • Add llama3-instruct jinja template by @jgreer013 in #118
  • Update sky_init.sh to print current dir by @xrdaukar in #120
  • Add prompt response sft preprocessor factory for aya dataset by @jgreer013 in #121
  • Add configs for chatqa model by @oelachqar in #110
  • Saving inference probs in parquet format. by @kaisopos in #115
  • Refactor model registry by @oelachqar in #122
  • Define BaseTrainer abstraction by @xrdaukar in #116
  • Add a registry for metric functions that we can run during training. by @taenin in #126
  • Update training_params.py so HF trainer uses num_train_epochs by @optas in #125
  • Add native PyTorch model training by @oelachqar in #123
  • [Quick fix] Handle pynvml being misconfigured by @taenin in #128
  • Enable DP for inference by @kaisopos in #100
  • Add configs for training llama3-8b with aya finetune by @jgreer013 in #130
  • Update HF save_model() to only save on master replica by @xrdaukar in #131
  • Pipe MetricsFunction from our config to train.py by @taenin in #129
  • Fixing broken eval. by @kaisopos in #132
  • Minor updates in SkyPilot docstrings by @xrdaukar in #133
  • Fix bug with DP evaluation by @oelachqar in #134
  • [MMLU custom eval] removing hardcoded subject, samples, num-shots. by @kaisopos in #135
  • Add an initial config for async evaluations by @taenin in #137
  • Add a new top level command: evaluate_async by @taenin in #138
  • Minor bug fix in writing evaluations by @taenin in #140
  • Support full GPT2 run by @wizeng23 in #141
  • Upload sample configs for running async evals on GPT2 by @taenin in #139
  • Apply torch.distributed.barrier() in save_model by @xrdaukar in #136
  • Create an experimental util to generate pre-tokenized datasets (Parquet files) with token_ids column by @xrdaukar in #144
  • Created a new dataset class with async loading & tokenization by @jgreer013 in #142
  • Remove private debug dir from configs/skypilot/sky_gpt2.yaml by @xrdaukar in #145
  • Define dataloader_num_workers and dataloader_prefetch_factor params by @xrdaukar in #146
  • [Evaluations] Integration with LM Evaluation Harness by @kaisopos in #143
  • Support model compilation by @wizeng23 in #147
  • Multiple cleanup changes in configs/skypilot/sky_gpt2.yaml by @xrdaukar in #148
  • Update SkyPilot training configs to include run_name by @xrdaukar in #149
  • Update async eval to properly parse eval configs by @taenin in #150
  • Zephyr Configs [full-model, skypilot] by @optas in #152
  • Disable model.compile in gpt2 config by @xrdaukar in #154
  • Update sky_init.sh to print task id and cluster info by @xrdaukar in #156
  • [bug] Include jinja templates in build by @oelachqar in #158
  • Add basic scaffolding for torch profiler around training loop by @xrdaukar in #157
  • [Minor] Adding attn_implementation arg in LM Harness. by @kaisopos in #160
  • Update Trainer.save_model to start using the public HF save_model() method (except for PEFT) by @xrdaukar in #161
  • Update the vanilla eval config for gpt2 to run hellaswag evals. by @taenin in #165
  • Add Dataset base class & API by @oelachqar in #151
  • Add experimental notebook to run Nvidia's ChatRAG-Bench evaluation by @oelachqar in #166
  • Update ChatQA training configs by @oelachqar in #159
  • Update async dataset class to support pre-tokenized datasets by @oelachqar in #162
  • Create a launcher script for Polaris jobs (ALCF) by @taenin in #164
  • Update pre-tokenized column name to be input_ids in tokenize_dataset tool by @xrdaukar in #167
  • Replacing EvaluationConfig's DataParams with DatasetSplitParams by @kaisopos in #168
  • Submit config to create Custom IAM role for SkyPilot Service Accounts on GCP by @xrdaukar in #169
  • Remove GCP project reference by @xrdaukar in #172
  • Make sure output training dir exists by @xrdaukar in #171
  • Improve launcher usability via command line arguments. by @taenin in #170
  • Add a source directory to the Polaris launcher and clean up rsync copies. by @taenin in #173
  • Introduce LEMA_RUN_NAME env var to SkyPilot training configs by @xrdaukar in #174
  • Minor changes: 1. Remove hardcoded HF_TOKEN 2. Log effective training config by @xrdaukar in #175
  • Tweak default params in gpt2 scripts by @xrdaukar in #177
  • LM Harness optimizations by @kaisopos in #176
  • No longer ignore .git. in Polaris Needed for venv. by @taenin in #179
  • A hack for running jobs on Polaris. by @taenin in #180
  • [Polaris] Move venv creation from worker to launcher. by @taenin in #181
  • Update README.md to include sky launch - 10 ... example by @xrdaukar in #182
  • [Evaluations] Adding support for HuggingFace's leaderboard v1 benchmarks by @kaisopos in #183
  • Llama 3 Aya Fine-Tuning Updates by @jgreer013 in #163
  • Remove logger propagation by @wizeng23 in #185
  • [Evaluations] HF leaderboard v1 configs by @kaisopos in #186
  • Move logging.py to utils by @wizeng23 in #187
  • Create the Jobs config for the lema launcher. by @taenin in #188
  • Initial abstract base classes for the lema launcher. by @taenin in #189
  • Added mfu calculation and tests by @jgreer013 in #190
  • Introduce two new training params: save_model and save_epoch by @xrdaukar in #191
  • Update FineWeb ablation model configs by @xrdaukar in #196
  • Added MFU telemetry by @jgreer013 in #193
  • Update Polaris script by @wizeng23 in #192
  • Rename training.save_model param to training.save_final_model for clarity by @xrdaukar in #197
  • Support disabling dropout by @wizeng23 in #184
  • Update actual mfu calculation by @jgreer013 in #199
  • Implement a client for talking to SkyPilot. by @taenin in #201
  • Fixed miscalculation of second step start time by @jgreer013 in #202
  • Update ablation-model-fineweb-v1 config to start using grad checkpointing by @xrdaukar in #198
  • Add distributed operations by @oelachqar in #194
  • Add pre-commit hooks for credential scanning + new checks by @oelachqar in #195
  • Sample job for multi-node training by @xrdaukar in #203
  • Update Polaris multi-node launcher by @xrdaukar in #204
  • Multi-node config improvements for llama2b model (HuggingFaceFW/ablation-model-fineweb-v1) by @xrdaukar in #205
  • Minor updates to Polaris launcher script by @xrdaukar in #206
  • Update Lema FSDP configs by @xrdaukar in #207
  • [tiny] add default formatter for markdown by @oelachqar in #210
  • Preparations for Lema custom pre-training loop by @oelachqar in #208
  • Update MFU callback to support Lema trainer by @oelachqar in #209
  • Configure llama2b model to use FSDP HYBRID_SHARD by @xrdaukar in #213
  • Implement a Cluster resource manager around Sky Pilot. by @taenin in #214
  • Add utils to setup distributed training by @oelachqar in #211
  • Add example notebook to train NanoGPT model with Lema by @oelachqar in #212
  • [tiny] update sky pilot ssh config by @oelachqar in #215
  • Implement a Cloud resource manager around Sky Pilot by @taenin in #216
  • Sanitize run name by @xrdaukar in #217
  • Use "cluster_name" instead of "name" in the Sky client. by @taenin in #218
  • Minor logging improvements in Polaris sample job scripts by @xrdaukar in #219
  • Update shell scripts to point to local dataset by @jgreer013 in #221
  • Support FSDP on Polaris using accelerate by @xrdaukar in #220
  • Add telemetry manager by @oelachqar in #222
  • Switch to the latest transformers=4.43.1 by @xrdaukar in #223
  • Re-enable model compilation for llama2b model by @xrdaukar in #224
  • Increase llama2b batch size from 2 to 3 by @xrdaukar in #225
  • Add makefile with common local commands by @oelachqar in #227
  • Add DeepSpeed config for Llama2b by @wizeng23 in #228
  • MFU Improvements for Llama 2B on Polaris by @jgreer013 in #229
  • FSDP config updates by @xrdaukar in #231
  • Rename accelerate configs to be in line with other configs by @wizeng23 in #232
  • [tiny] Update logger format to include rank, pid and threadname by @oelachqar in #235
  • Set model.config.use_cache = False by @xrdaukar in #233
  • Experimental training loop for pre-training by @oelachqar in #230
  • Disable gradient checkpointing in SkyPilot llama2b config by @xrdaukar in #236
  • Implement a client for communicating with Polaris via python. by @taenin in #234
  • Add SkyPilot config for experimental/pretokenize/tokenize_dataset.py by @xrdaukar in #237
  • Update Fabric.run() calls to use the "warn" flag. by @taenin in #239
  • Update pretokenize tool to support input datasets by @xrdaukar in #238
  • Add optimizers builder function by @oelachqar in #240
  • Add a "put" method in the Polaris client for writing remote files. by @taenin in #242
  • Add deepspeed (DS) config to support hierarchical partitioning by @wizeng23 in #244
  • Add support for uploading MFU in wandb by @jgreer013 in #245
  • Create a Polaris Cluster class consuming the polaris client by @taenin in #246
  • Add initial docker image by @oelachqar in #241
  • Fix a string in the Polaris Cluster tests. by @taenin in #249
  • Set training loop random seeds by @oelachqar in #248
  • Fix bug with Polaris multi-node script by @wizeng23 in #247
  • Add torchfix listing target by @oelachqar in #250
  • Add training state classes by @oelachqar in #251
  • Save and restore telemetry state during training by @oelachqar in #252
  • Configure file logging by @oelachqar in #254
  • Create a Polaris Cloud class consuming the polaris client by @taenin in #253
  • Define a registry for cloud builders. by @taenin in #255
  • Add logging to tensor board, wandb in custom training loop by @oelachqar in #256
  • Add a get_all utility method to the LeMa Registry by @taenin in #257
  • Update the BaseCloud up_cluster definition to return a job status. by @taenin in #258
  • Create a launcher class for the LeMa Launcher. by @taenin in #261
  • Add script to benchmark datasets and data loader params by @oelachqar in #260
  • [Follow-up] data loader benchmarking script by @oelachqar in #262
  • Create DDP configs for accelerate by @xrdaukar in #259
  • Switch from nightly to stable version of SkyPilot by @xrdaukar in #264
  • Make all tests green by @xrdaukar in #265
  • Set dataloader_pin_memory=True to be intentional by @xrdaukar in #266
  • Move torch_profiler_utils from lema.utils to lema.perfomance by @xrdaukar in #267
  • Add BaseIterableDataset, refactor DataLoader to use DataPipes by @oelachqar in #263
  • Add a dataset_kwargs attribute, tests by @oelachqar in #268
  • Use stateful dataloader by @oelachqar in #269
  • Update the polaris client / cluster to work e2e by @taenin in #270
  • Update package structure for the launcher by @taenin in #273
  • [tiny] Register debug datasets by @oelachqar in #272
  • Update several of our launcher base fields to use strings instead of ints. by @taenin in #274
  • Configure data loader sampling strategy for map-style datasets by @oelachqar in #271
  • Ensure we CD into the working DIR before submitting polaris jobs. by @taenin in #276
  • Compute the number of dataloader workers per node by @xrdaukar in #277
  • Introduce BaseTokenizer alias by @xrdaukar in #280
  • Cache get_device_rank_info by @xrdaukar in #279
  • Adding initial scripts for running polaris jobs. by @taenin in #275
  • Update the polaris client to automatically set execute permissions for copied files. by @taenin in #286
  • Deprecate building models data parallel by @oelachqar in #282
  • Switch to using safetensors when saving models by @oelachqar in #281
  • Add ability to validate configs and params after init by @oelachqar in #285
  • Some updates to Polaris launcher script by @xrdaukar in #287
  • Upgrade to latest TRL version, remove numpy version condition by @oelachqar in #283
  • Add learning rate builder function by @oelachqar in #284
  • Remove patchwork as a dep. by @taenin in #290
  • Set up initial demo launcher jobs for GCP. by @taenin in #288
  • [tiny] cleanup pyproject.toml dependencies by @oelachqar in #292
  • Make dataset data backend attribute read-only by @oelachqar in #291
  • Optimize Github actions by @oelachqar in #289
  • Misc minor changes by @xrdaukar in #293
  • [tiny] Update GitHub action cache version by @oelachqar in #295
  • Rename 'NodeParams' -> 'JobResources' by @taenin in #296
  • Disable compilation for DDP accelerate launch config by @xrdaukar in #297
  • Export top level launcher functions and instantiate a default launcher. by @taenin in #298
  • Prevent HF version bump by @taenin in #300
  • Add dtype/mixed precision configs to Lema trainer by @wizeng23 in #278
  • Create a notebook tutorial for running remote training. by @taenin in #299
  • Increase the default value of ProfilerParams.row_limit from 20 to 50 by @xrdaukar in #304
  • Mini guide on using basic lema functionality by @oelachqar in #303
  • Compute MFU based of HF total_flos (alternative way to compute MFU) by @xrdaukar in #301
  • Support GPT2 training with Lema trainer by @wizeng23 in #302
  • Add a client for running local jobs via the launcher. by @taenin in #305
  • Add a local cluster for running local jobs. by @taenin in #306
  • Support llama2b with lema trainer by @wizeng23 in #308
  • Add a convenience method for listing all registered clouds. by @taenin in #310
  • [ALCF] Reverse Polaris GPU order to match CPU/GPU affinities by @xrdaukar in #307
  • Create a local cloud for the LeMa launcher. by @taenin in #309
  • Remove some leftover occurrences of builtin_ prefix in HF MFU callback by @xrdaukar in #312
  • Clean up mixed precision params by @wizeng23 in #311
  • Add finetuning tutorial by @oelachqar in #313
  • Fix interpolation when loading lema configs. by @taenin in #314
  • [bugfix] GPU workers not waiting for global leader to save final checkpoint by @oelachqar in #315
  • Add simple benchmark script for distributed operations by @oelachqar in #316
  • Add a 'done' field to the LeMa job status object. by @taenin in #317
  • Fix a small typo in Lema README by @xrdaukar in #318
  • Add pytorch profiler (-p) option to multinode_example_worker.sh script by @xrdaukar in #319
  • Create a simpler tutorial for running jobs. by @taenin in #320
  • Minor cleanups in Lema training loop by @xrdaukar in #322
  • Remove unbalanced call to barrier() in HuggingFaceTrainer.save_model by @xrdaukar in #323
  • Create a tutorial for custom clouds. by @taenin in #321
  • Add support for logging stdout and stderr for Local runs. by @taenin in #324
  • Fix nanoGPT notebook by @wizeng23 in #325
  • Add more pytorch profiler instrumentations in Lema training loop by @xrdaukar in #327
  • Add training param: dataloader_main_process_only by @xrdaukar in #326
  • fix synchronization issues in LEMA training loop by @xrdaukar in #328
  • Update LEMA training loop to count tokens on CPU by @xrdaukar in #330
  • Update README.md by @taenin in #331
  • Add various improvements to Lema trainer by @wizeng23 in #329
  • Add PyTorch profiler annotation for each step/micro-step by @xrdaukar in #333
  • Enable HfMfuTrainerCallback if supported by @xrdaukar in #332
  • Add support for PyTorch profiling schedule by @xrdaukar in #334
  • Set up Sphinx-based doc generation for LeMa by @taenin in #335
  • Fix dataclass strings to be parsable by our docs generator. by @taenin in #337
  • Update ProfilerStepCallback to add microstep profiler annotations by @xrdaukar in #338
  • Add include_alternative_mfu_metrics param to control if HF MFU is enabled by @xrdaukar in #336
  • Minor doc formatting updates. by @taenin in #340
  • Add 8-bit Adam optimizer to Lema trainer by @wizeng23 in #339
  • Enable gradient scaling for fp16 mixed-precision training by @wizeng23 in #342
  • Add a link to our documentation via the readme. by @taenin in #344
  • Disable weight decay for layernorm/biases in Lema trainer by @wizeng23 in #341
  • Polaris: Enable NCCL debug logging at WARNING level by @xrdaukar in #347
  • Add a new notebook for getting started. by @taenin in #345
  • Create TelemetryCallback by @xrdaukar in #343
  • Various improvements for our autogenerated docs by @taenin in #349
  • Polaris: update sample tail command to use -n200 by @xrdaukar in #348
  • Fix a minor bug in TelemetryCallback.on_train_end by @xrdaukar in #350
  • Update LEMA training loop to log wandb url by @xrdaukar in #351
  • Update model dtype for DeepSpeed to make it work with SkyPilot and Polaris by @xrdaukar in #352
  • Enable the launcher via the CLI by @taenin in #353
  • Update Polaris init script to print nodelist by @xrdaukar in #354
  • Minor logging updates in Polaris scripts by @xrdaukar in #355
  • Define ddp1gpu Polaris mode: Spawn 1 torchrun process per GPU (4 torchrun-s per node) by @xrdaukar in #356
  • Add a helper util to query GPU temperatures by @xrdaukar in #359
  • Add Llama 8B config by @wizeng23 in #358
  • Add another bareer() call before train() by @xrdaukar in #360
  • Add Llama70B FSDP config by @wizeng23 in #361
  • Minor improvements in logging and instrumentations in train.py by @xrdaukar in #362
  • Refactor our core directory to logically organize our classes. by @taenin in #357
  • Basic plumbing for GPU temperature telemetry by @xrdaukar in #363
  • Minor update to Llama70B by @wizeng23 in #365
  • Reorder model compilation and DDP/FSDP wrapping by @xrdaukar in #364
  • Mini tutorial for Llama3.1-70b inference on Polaris. by @taenin in #367
  • jgreer013/vllm-inference by @jgreer013 in #366
  • Fix interpolation when using the launcher CLI for various sky configs. by @taenin in #369
  • Add Llama8B Lora config for GCP/Polaris by @wizeng23 in #368
  • Add vllm parallel inference to improve throughput by @jgreer013 in #370
  • Set TOKENIZERS_PARALLELISM: false for llama8b model by @xrdaukar in #371
  • Disable MFU computation for PEFT by @xrdaukar in #372
  • Add empty_device_cache_steps param and configure it for Llama8b model by @xrdaukar in #373
  • Add TelemetryCallback.include_timer_metrics param: False by default by @xrdaukar in #378
  • Update llama8b GCP launcher script to allow Spot VMs by @xrdaukar in #380
  • Minimal Llama8B LoRA eval config by @xrdaukar in #376
  • Add Llama 8b SFT config by @wizeng23 in #379
  • Move common NCCL variables initialization into polaris_init.sh by @xrdaukar in #377
  • Minor tuning of llama8b configs by @xrdaukar in #382
  • Update eval script to use Meta-Llama-3.1-8B-Instruct model version by @xrdaukar in #381
  • Initial notebook for llama 8b LoRa tuning. by @taenin in #374
  • Update SkyPilot GCP script to download the right model version by @xrdaukar in #385
  • Clean up Sky configs by @wizeng23 in #383
  • Update main makefile to generate docs by @oelachqar in #386
  • Add docs-serve makefile command by @oelachqar in #387
  • Fix missing new line at the end of Makefile by @xrdaukar in #390
  • Raise NOT_IMPLEMENTED if adapter_model is configured for LM_HARNESS eval by @xrdaukar in #391
  • Update Llama8B LoRA eval script to use built-in LEMA evaluator by @xrdaukar in #389
  • Add Llama 70b lora config by @wizeng23 in #388
  • Enable markdown docs by @oelachqar in #394
  • Check ignored docstring rules by @oelachqar in #395
  • Remove special case for saving PEFT models by @xrdaukar in #384
  • Move shared code into polaris_init by @wizeng23 in #392
  • Update Llama notebook to include 8B SFT by @wizeng23 in #393
  • Update sample commands to point to the preemptable queue by @taenin in #396
  • Update lm_harness to support LoRA adapters by @jgreer013 in #397
  • Fix FSDP model initialization by @wizeng23 in #398
  • Add vscode launch config for accelerate distributed training by @oelachqar in #400
  • Update trainer save model by @oelachqar in #399
  • Increase from 2 to 3 nodes for Llama 70B Lora by @wizeng23 in #402
  • Add param to customize NCCL timeout by @oelachqar in #401
  • Add docs and gpu install targets by @oelachqar in #403
  • Significant improvements for the Polaris launcher by @taenin in #404
  • Ensure that jobs are queued on existing clusters when users call UP by @taenin in #406
  • Autostop sky clusters after 30 min of no activity by @taenin in #407
  • Add support for triton kernels from Liger Kernel by @oelachqar in #405
  • Add support for including notebooks in the docs by @oelachqar in #408
  • Update sphinx comments to docstrings by @oelachqar in #411
  • Add missing docstrings to TrainingParams by @oelachqar in #409
  • Capped model max length for Llama tuning by @wizeng23 in #413
  • Fix a deadlock in the Polaris launcher for users with 500+ jobs. by @taenin in #412
  • Script to run inference with Llama/GPT judges. by @kaisopos in #414
  • Add missing docstrings to top-level configs by @oelachqar in #410
  • [tiny] sphinx conf update by @oelachqar in #416
  • Improve launcher polling by running tasks in a subprocess. by @taenin in #417
  • Add missing package docstrings by @oelachqar in #415
  • [tiny] Enable D104 rule by @oelachqar in #419
  • Fix bug with 70B Lora by @wizeng23 in #421
  • Update the CLI to look for open SSH tunnels as a way of preserving Polaris state by @taenin in #418
  • Update the polaris launcher to always update the lema installation on job creation. by @taenin in #422
  • Cleanup doc RSTs by @oelachqar in #420
  • Add sphinx api doc template for packages by @oelachqar in #425
  • Add automatically generated apidoc RSTs by @oelachqar in #424
  • [tiny] Move apidocs into their own folder by @oelachqar in #426
  • Add docs-rebuild command to Makefile by @oelachqar in #427
  • Refresh markdown docs by @oelachqar in #429
  • Reorganize our test structure by @taenin in #431
  • Add Llama 70B SFT config by @wizeng23 in #428
  • Script to generate judge prompts. by @kaisopos in #423
  • [tiny] Breakdown main Readme into multiple docs by @oelachqar in #430
  • Update main readme file by @oelachqar in #432
  • Add GitHub badges, readme typos by @oelachqar in #434
  • Fix markdown lint errors by @oelachqar in #433
  • Update documentation index by @oelachqar in #436
  • [tiny] Only log to console on global leader by @wizeng23 in #435
  • Tune sphinx config by @oelachqar in #437
  • Enable Liger for Llama 8B SFT by @wizeng23 in #439
  • Updated Parallel Inference job by @jgreer013 in #438
  • Add a mkdir to polaris init. by @taenin in #440
  • [tiny] Fix lema loop performance gap by @oelachqar in #441
  • [tiny] update trainer benchmark script and minor updates by @oelachqar in #443
  • Add Llama 8B eval script by @wizeng23 in #442
  • Add dataset remote code param by @oelachqar in #445
  • [docs] Update format + add missing docs to data_params.py by @oelachqar in #444
  • Update Polaris Llama8b eval script to enable data-parallel evals for LM_HARNESS by @xrdaukar in #446
  • Copy changes from PR-446 into Polaris launcher config by @xrdaukar in #448
  • Copy changes from PR-446 into GCP launcher config by @xrdaukar in #449
  • Minor fixes in llama8B eval scripts by @xrdaukar in #450
  • Add Llama 70B eval script by @wizeng23 in #447
  • [bugfix] add is_using_accelerate_fsdp util by @oelachqar in #453
  • [tiny] Fix inference notebook by @wizeng23 in #451
  • Simplify record_function annotation in LEMA training loop by @xrdaukar in #454
  • [tiny] enable ruff format on save with notebooks by @oelachqar in #455
  • [tiny] Add missing default value to hf_trainer by @oelachqar in #458
  • Judge inference script for Polaris by @kaisopos in #452
  • Add the base classes for inference. Pull out logic from infer to a native text inference engine. by @taenin in #456
  • Telemetry improvements for tracking GPU temperature and in general by @xrdaukar in #457
  • Add integration tests for native inference (not using the CLI). by @taenin in #460
  • Update README.md by @mkoukoumidis in #462
  • Update README to make installation steps more prominent by @taenin in #464
  • Fix several broken links and update installation instructions by @taenin in #465
  • Update inference to pass the generation config to inference engines. by @taenin in #466
  • Update README.md by @taenin in #467
  • Fixed issue with metadata extraction failure by @jgreer013 in #469
  • Add fsdp support to lema loop by @oelachqar in #463
  • Combine telemetry from all ranks by @xrdaukar in #468
  • Add sample for full fine-tuned and LoRA-tuned model inference using vLLM by @wizeng23 in #470
  • Update chat_template_builder by @oelachqar in #472
  • Removed duplicate task_done call by @jgreer013 in #473
  • Add flag to enable experimental torch data pipes processing pipeline by @oelachqar in #474
  • Vision-languange datasets & fine-tuning MVP by @oelachqar in #459
  • Rebuild docs, add multi-modal tutorial by @oelachqar in #475
  • Add test coverage target, update pyproject.toml metadata by @oelachqar in #476
  • Create a local inference engine for vLLM by @taenin in #471
  • Add llava chat template, QoL improvement to multimodal testing script by @oelachqar in #478
  • [Polaris Judge Inference] Adjusting script for Llama 70B quantized by @kaisopos in #461
  • Add example for running inference using vLLM on GCP, single-node multi-gpu by @oelachqar in #479
  • [tiny] Remove deepspeed from required dependencies by @oelachqar in #482
  • Update train path to save meta-info as files under telemetry sub-dir by @xrdaukar in #480
  • Add inference engine apply_chat_template helper, update example notebook by @oelachqar in #481
  • Update arg names for vLLM inference job by @wizeng23 in #477
  • Remove device_map for model init from config by @wizeng23 in #484
  • Add log_model_summary call back by @xrdaukar in #485
  • Small typo fix in the vllm notebook by @taenin in #483
  • Cleanup FSDP wrap class auto guesser by @oelachqar in #486
  • Add missing documentation for model_params by @oelachqar in #487
  • Add callback builder function by @oelachqar in #490
  • Minor fixes in DISTRIBUTED_TRAINING.md by @xrdaukar in #488
  • Switch to using official UV action with dependency caching by @oelachqar in #491
  • Introduce BaseTrainerCallback alias by @xrdaukar in #492
  • Add documentation to peft_params by @oelachqar in #493
  • Update TelemetryCallback to save final metrics to JSON by @xrdaukar in #494
  • Increase the rsync timeout from 40s to 300s by @taenin in #495
  • [tiny] fix missing import by @oelachqar in #497
  • Rename build_dataset -> build_dataset_mixture by @oelachqar in #498
  • Define a simple callback to detect NaN/INF-s during training by @xrdaukar in #496
  • Replace pip install flash-attn with .[gpu] target by @wizeng23 in #502
  • Add simpler builder for single dataset use cases by @oelachqar in #499
  • Use HF's built-in gradient checkpointing argument by @wizeng23 in #500
  • [Draft] Example changes to support 70B single-node inference by @jgreer013 in #503
  • Various updates to Llama 2b configs by @wizeng23 in #489
  • Add Llama 2B FSDP config by @wizeng23 in #505
  • Update TelemetryCallback to write JSON with GPU temperature summary by @xrdaukar in #501
  • Rename src/lema to src/oumi by @wizeng23 in #506
  • OpenAI Chat Engine - Custom servers by @taenin in #504
  • Rename configs/lema to configs/oumi by @wizeng23 in #507
  • Rename all relevant lema references in codebase by @wizeng23 in #508
  • Re-generate Sphinx docs by @wizeng23 in #509
  • Update conf.py by @taenin in #510
  • Rename remaining lema references in docs/ by @wizeng23 in #511
  • Update final lema references by @wizeng23 in #512
  • Update dev setup guide by @wizeng23 in #513
  • Update TOTAL_NUM_GPUS compare commands in SkyPilot configs by @xrdaukar in #514
  • [Minor] Issues arose by "newcomer" exploration [1/K] by @optas in #518
  • Freeze lm-eval and torch versions as a workaround for OPE-390 by @xrdaukar in #516
    1. Write wandb telemetery 2. Reorder training callbacks by @xrdaukar in #519
  • Multiple updates to Llama 2B by @wizeng23 in #515
  • Rename OUMI to Oumi by @wizeng23 in #520
  • Add llama.cpp Inference Engine by @oelachqar in #524
  • Rename website references to oumi.ai by @wizeng23 in #522
  • Add anthropic inference engine by @oelachqar in #523
  • Update name typo by @oelachqar in #526
  • Add a batch inference job runnable via the Oumi Launcher by @taenin in #527
  • Auto-format pyproject and pre-commit configs by @xrdaukar in #530
  • Update Makefile by @taenin in #529
  • Fix failing tests after a new install. by @taenin in #531
  • Fix a small bug in infer_interactive(): only prints the first character by @xrdaukar in #532
  • Boosting User-friendliness by @optas in #521
  • [tiny] add override from typing_extentions by @oelachqar in #534
  • Create CODE_OF_CONDUCT.md by @taenin in #536
  • Add conversation helper methods by @oelachqar in #535
  • [tiny] cleanup multimodal benchmark script by @oelachqar in #537
  • Auto-format shell scripts under scripts by @xrdaukar in #539
  • Add builder function for data collators by @oelachqar in #538
  • Make tokenizer optional by @oelachqar in #540
  • Add an optional -t flag to scripts/polaris/jobs/llama2b_pt_worker.sh by @xrdaukar in #541
  • Fix initial issues found by shellcheck by @xrdaukar in #542
  • [tiny] fix small typo by @oelachqar in #544
  • Minor changes in scripts/benchmarks/minimal_multimodal_training.py by @xrdaukar in #543
  • [tiny] Add util to get install folder root dir by @oelachqar in #545
  • [tiny] Add fp paged_adam optimizer option by @oelachqar in #547
  • [tiny] Allow conversation metadata to contain values other than str by @oelachqar in #546
  • Switch from Flash Attention 2 to PyTorch SDPA by @wizeng23 in #533
  • Use local_rank to query GPU temperature by @xrdaukar in #550
  • Fix a bug for handling stopped sky clusters in the oumi launcher. by @taenin in #549
  • Remove flash attention validation check by @wizeng23 in #551
  • Add support for AWS and Azure jobs in Oumi by @taenin in #552
  • Pass split param to datasets.load_dataset() by @xrdaukar in #553
  • Implement Judge API MVP by @oelachqar in #548
  • Log dataset info: shape, columns, other metainfo by @xrdaukar in #555
  • Update experimental pretokenize_dataset tool by @xrdaukar in #554
  • Various improvements to Llama eval scripts by @wizeng23 in #556
  • Add a couple of gc.collect() calls by @xrdaukar in #560
  • [tiny] Fix Makefile setup command by @wizeng23 in #561
  • Support datasets generated by dataset.save_to_disk() by @xrdaukar in #559
  • Add support for LoRA adapters in vLLM inference engine by @wizeng23 in #562
  • Updates in VisionLanguageCollator and in coco_captions by @xrdaukar in #563
  • Update DEV_SETUP.md with Windows instructions by @taenin in #566
  • Make the remote inference engine runnable in jupyter notebooks. by @taenin in #565
  • Configure freeze_layer map in minimal_multimodal_training.py by @xrdaukar in #569
  • Clean up legacy evaluate_oumi code paths by @taenin in #568
  • Update model builder to use default_chat_template if available by @xrdaukar in #571
  • Add package build and deployment workflow to google artifact registry by @oelachqar in #570

New Contributors

Full Changelog: https://github.com/oumi-ai/oumi/commits/v0.1-alpha