Remove offline training, refactor `train.py` and logging/checkpointing #670

aliberts · 2025-01-31T19:29:13Z

What this does

⚠️ Removes the offline training part from the train.py script: online training will be handled by the training scripts from [WIP] Fix SAC and port HIL SERL #644
In consequence, .offline and .online are removed from TrainPipelineConfig. To set the number of offline training step, simply use --steps:

python lerobot/scripts/train.py \
- --offline.steps=200000
+ --steps=200000

Adds wandb_utils.py and turns Logger into WandBLogger to remove responsibilities from this class so that it only manages wandb stuff.
Replaces training_state serialization with torch.save/load to safetensors.save_file/load_file. We shouldn't use torch.load() for this and in fact it breaks in which breaks in 2.6 due to weights_only=True by default.

/checkpoints/005000
  ├── pretrained_model
- └── training_state.pth
+ └── training_state
+     ├── optimizer_param_groups.json
+     ├── optimizer_state.safetensors
+     ├── rng_state.safetensors
+     ├── scheduler_state.json
+     └── training_step.json

Adds train_utils.py to handle training checkpoints logic (including training state).
Cleans up functions related to rng and groups them together in random_utils.py.
Save checkpoint before eval during training rather than after (safer in case eval crashes)
Fixes logging where displayed values would only be the last one measured instead of the average over the steps from previous logging step.
Changed the policies main forward() output format for clarity. It now returns a tuple[Tensor, dict | None] instead of just a dict, the first element being the loss:

- output_dict = policy.forward(batch)
- loss = output_dict["loss"]
+ loss, output_dict = policy.forward(batch)
loss.backward()

How it was tested

Adds the following tests:

tests/test_schedulers.py
tests/test_optimizers.py
tests/test_train_utils.py
tests/test_random_utils.py
tests/test_io_utils.py

How to checkout & try? (for the reviewer)

Examples:

pytest -v \
    tests/test_schedulers.py \
    tests/test_optimizers.py \
    tests/test_train_utils.py \
    tests/test_random_utils.py \
    tests/test_io_utils.py

…_31_safetensors_training_state

Cadene

Beautiful

Could you remove all appearance of ema?
There were added by default

Cadene · 2025-02-10T13:34:12Z

lerobot/common/policies/diffusion/modeling_diffusion.py

@@ -153,7 +153,7 @@ def forward(self, batch: dict[str, Tensor]) -> dict[str, Tensor]:
            )
        batch = self.normalize_targets(batch)
        loss = self.diffusion.compute_loss(batch)
-        return {"loss": loss}
+        return loss, None


Suggested change

return loss, None

# no output_dict so returning None

return loss, None

aliberts added 4 commits January 31, 2025 20:22

Add random_utils

1b3123f

Update training_state serialization to safetensors

14e8a7f

Move functions to random_utils

700e08a

Add tests

ed5f38e

aliberts changed the title ~~Update safetensors `training_state~~ Update training_state serialization to safetensors Jan 31, 2025

aliberts changed the title ~~Update training_state serialization to safetensors~~ Refactor Logger Feb 4, 2025

aliberts added 7 commits February 8, 2025 13:21

Tuplify policy.forward() outputs

17dd853

Split Logger responsibilities, use safetensors for checkpoints

3327e70

Refactor train.py, remove online part

2cbf61d

Update integration tests

0320fed

Update examples

b73872e

Update test_record_and_replay_and_policy

799813a

Merge remote-tracking branch 'origin/main' into user/aliberts/2025_01…

a4be431

…_31_safetensors_training_state

aliberts changed the title ~~Refactor Logger~~ Refactor train.py and logging/checkpointing Feb 8, 2025

Simplify MetricsTracker and add test_logging_utils.py

6169747

aliberts changed the title ~~Refactor train.py and logging/checkpointing~~ Remove offline training, refactor train.py and logging/checkpointing Feb 8, 2025

aliberts added the 🔄 Refactor Refactoring label Feb 8, 2025

aliberts added 13 commits February 8, 2025 16:16

Fix draccus version

80ab9a5

Simplify train_utils

18faaad

Update docs

326afd7

Add copyrights

5e7d083

Update save_checkpoint

af4bfc8

Add test_train_utils

81fe1d5

Add deserialize_json_into_object and testing

93754ee

Update optimizer deserialization with proper typing

8eb8301

Update scheduler deserialization with proper typing

c3a40a2

Add test_optimizers

a8e3336

Add test_scheduler

f831d8b

Fix poetry relax

7fa6817

Add fixtures

780fbf5

aliberts added 2 commits February 8, 2025 22:36

Fix tests

8b528a5

Nit docstring

0023b08

aliberts requested a review from Cadene February 8, 2025 21:48

aliberts marked this pull request as ready for review February 8, 2025 21:48

Cadene approved these changes Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove offline training, refactor `train.py` and logging/checkpointing #670

Remove offline training, refactor `train.py` and logging/checkpointing #670

aliberts commented Jan 31, 2025 •

edited

Loading

Cadene left a comment

Cadene Feb 10, 2025

	return loss, None
	# no output_dict so returning None
	return loss, None

Remove offline training, refactor train.py and logging/checkpointing #670

Are you sure you want to change the base?

Remove offline training, refactor train.py and logging/checkpointing #670

Conversation

aliberts commented Jan 31, 2025 • edited Loading

What this does

How it was tested

How to checkout & try? (for the reviewer)

Cadene left a comment

Choose a reason for hiding this comment

Cadene Feb 10, 2025

Choose a reason for hiding this comment

Remove offline training, refactor `train.py` and logging/checkpointing #670

Remove offline training, refactor `train.py` and logging/checkpointing #670

aliberts commented Jan 31, 2025 •

edited

Loading