Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance on experiment reproduction #51

Open
JasonLiu324 opened this issue Sep 29, 2024 · 2 comments
Open

Bad performance on experiment reproduction #51

JasonLiu324 opened this issue Sep 29, 2024 · 2 comments

Comments

@JasonLiu324
Copy link

JasonLiu324 commented Sep 29, 2024

Hi, I have successfully run the whole project and tested on several gym tasks, like FrankaCabinet and Humanoid. But the experiment result is not so good as I expected. What may be the reason?

My workstation environment is:
Ubuntu 22.04
12GB RTX 4080 GPU
16GB CPU

And the command lines I have used are:
python eureka.py env=FrankaCabinet sample=5 iteration=5 model_name=gpt-4
python eureka.py env=Anymal sample=5 iteration=5 model_name=gpt-4

The final success rate is only approximately 0.1. Does it related to the number of samples? My workstation can only run 5 samples in parallel due to the limit of GPU memory.

@JasonLiu324
Copy link
Author

And the wierd thing is that the reward reflection during the running process is almost the same:
Iteration 0: User Content:
We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered:
distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76
door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00
task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00
episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

Iteration 1: User Content:
We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered:
distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76
door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00
task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00
episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

Iteration 2: User Content:
We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered:
distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76
door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00
task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00
episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

Iteration 3: User Content:
We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered:
distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76
door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00
task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00
episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

The values are totally the same. I think there must be something wrong with the training process.

@Abc123bit
Copy link

Atom light value from 0 to 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants