How to transfer the policy between executions #69

S-A-M-J · 2025-02-27T00:43:00Z

S-A-M-J
Feb 27, 2025

So i'm having issues with rendering the output episode after training on gpu clusters and therefore im trying to separate the steps. Furthermore, I would like to load a previously trained checkpoint and use it to continue learning or deploy it in a real-time environment.
However, I fail to do all 3 of the above since I can't seem to load and then rebuild the policy correctly between files.
Perhaps a notebook doing just that would be great. The locomotion notebook has this loading of a checkpoint but i can't seem to reproduce it between files.

Perhaps the issue is also related to me trying to restore on a different device from what was used to train on?
One simple example i tried to use for restoring looks like this:

ckpt_path = "/Users/sam/Documents/Master Thesis/reachbot_rl/results/getup_2025-02-26_23-08-26"

FINETUNE_PATH = epath.Path(ckpt_path)
latest_ckpts = list(FINETUNE_PATH.glob("*"))
latest_ckpts = [ckpt for ckpt in latest_ckpts if ckpt.is_dir()]
latest_ckpts.sort(key=lambda x: int(x.name))
latest_ckpt = latest_ckpts[-1]
restore_checkpoint_path = latest_ckpt

env_cfg = reachbot_config()
env_cfg.Kp = 200
env_cfg.Kd = 10
env = ReachbotGetup(config=env_cfg, task="flat_terrain_basic")

make_policy, params, _ = ppo.train(
    environment=env,
    num_timesteps=0,
    episode_length=1200,
    restore_checkpoint_path=restore_checkpoint_path,
    wrap_env_fn=wrapper.wrap_for_brax_training,
)

policy_fn = make_policy(params, deterministic=True)

However, i keep getting errors related to the reset function in the ReachbotGetup function and the PRNG keys (The reset function is exactly the same as in the getup task for the go1. The ReachbotGetup is also the same as the Go1Getup with some values and reward functions changed):

Traceback (most recent call last):

  File "/Users/sam/Documents/Master Thesis/reachbot_rl/deploy_policy.py", line 51, in <module>
    make_policy, params, _ = ppo.train(
                             ^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/brax/training/agents/ppo/train.py", line 321, in train
    env_state = reset_fn(key_envs)
                ^^^^^^^^^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/mujoco_playground/_src/wrapper.py", line 113, in reset
    state = self.env.reset(rng)
            ^^^^^^^^^^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/brax/envs/wrappers/training.py", line 83, in reset
    state = self.env.reset(rng)
            ^^^^^^^^^^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/brax/envs/wrappers/training.py", line 68, in reset
    return jax.vmap(self.env.reset)(rng)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sam/Documents/Master Thesis/reachbot_rl/reachbot/getup.py", line 191, in reset
    rng, key1, key2 = jax.random.split(rng, 3)
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/jax/_src/random.py", line 290, in split
    typed_key, wrapped = _check_prng_key("split", key)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/jax/_src/random.py", line 77, in _check_prng_key
    wrapped_key = prng.random_wrap(key, impl=default_prng_impl())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/jax/_src/prng.py", line 694, in random_wrap
    _check_prng_key_data(impl, base_arr)
  File "/Users/sam/master_thesis/lib/python3.12/site-packages/jax/_src/prng.py", line 124, in _check_prng_key_data
    raise TypeError("JAX encountered invalid PRNG key data: expected "
TypeError: JAX encountered invalid PRNG key data: expected key_data.ndim >= 1; got ndim=0

Also, it appears that there is a provision in the train function to directly return the policy if the timesteps=0, which hints at the use case of just trying to restore a policy, but I can't figure out how to do it. Help would be greatly appreciated.

S-A-M-J · 2025-02-27T21:22:07Z

S-A-M-J
Feb 27, 2025
Author

So turns out sort of a dumb question. It works using the same hardware. However, it would be really useful to be able to render or use a policy on different hardware. I have changed the sharding file from cuda to cpu but i still get the prng key error.

1 reply

S-A-M-J Feb 28, 2025
Author

So error in thinking on my side. Obviously, it doesn't make sense to use the orbax checkpoints to transfer the policy between hardware. They are used only for the continuation of learning. For the transfer, one can use the model.save / model.load functions that are offered by brax.io.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to transfer the policy between executions #69

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to transfer the policy between executions #69

S-A-M-J Feb 27, 2025

Replies: 1 comment · 1 reply

S-A-M-J Feb 27, 2025 Author

S-A-M-J Feb 28, 2025 Author

S-A-M-J
Feb 27, 2025

Replies: 1 comment 1 reply

S-A-M-J
Feb 27, 2025
Author

S-A-M-J Feb 28, 2025
Author