You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While checking your result statistics on https://llm.aidatatools.com/ I always missed a notification if the model was completely loaded into a GPU or if it runs in a mixed environment.
Implementing such a check could be a low-hanging fruit, as ollama keeps the last model running after closing the request at
run_benchmark.py:
75: result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8')
If you add another call subprocess.run([ollamabin, 'ps'], capture_output=True, text=True, check=True, encoding='utf-8')
you can still gather the utilization.
e.g.:
`
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2:1.5b f6daf2b25194 1.8 GB 100% GPU 4 minutes from now
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b 786f3184aec0 41 GB 79%/21% CPU/GPU 4 minutes from now
`
Based on the ollama documentation, it will be possible to have several models loaded at the same time. So you need to expect that in future, ollama ps will report several rows of models. Adding a filter to the model_name of the ollama ps output should be future-proof .
At the end: it would be great to see that usage/distribution on your results pages.
The text was updated successfully, but these errors were encountered:
While checking your result statistics on https://llm.aidatatools.com/ I always missed a notification if the model was completely loaded into a GPU or if it runs in a mixed environment.
Implementing such a check could be a low-hanging fruit, as ollama keeps the last model running after closing the request at
run_benchmark.py:
75: result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8')
If you add another call subprocess.run([ollamabin, 'ps'], capture_output=True, text=True, check=True, encoding='utf-8')
you can still gather the utilization.
e.g.:
`
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2:1.5b f6daf2b25194 1.8 GB 100% GPU 4 minutes from now
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b 786f3184aec0 41 GB 79%/21% CPU/GPU 4 minutes from now
`
Based on the ollama documentation, it will be possible to have several models loaded at the same time. So you need to expect that in future, ollama ps will report several rows of models. Adding a filter to the model_name of the ollama ps output should be future-proof .
At the end: it would be great to see that usage/distribution on your results pages.
The text was updated successfully, but these errors were encountered: