Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to start up DeepSeekR1 #219

Open
windwardly opened this issue Mar 1, 2025 · 8 comments
Open

Fixes to start up DeepSeekR1 #219

windwardly opened this issue Mar 1, 2025 · 8 comments
Assignees
Labels
community Community-driven issues and suggestions.

Comments

@windwardly
Copy link

Problem
Running DeepSeekR1 in tt-studio

Steps

  1. I used tt-inference-server to setup a tt_studio_persistent_volume for DeepSeek-R1-Distill-Llama-70B. The weights were downloaded; everything looked good.

  2. I then ran startup.sh and in the tt-studio web page, there was no DeepSeek selection on the Home Screen.

Expected behavior
That I could select and deploy DeepSeekR1

A Remedy
I guessed at making and filling an entry in tt-studio/app/api/shared_config/model_config.py (please correct):

    ModelImpl(
        hf_model_id="deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
        image_name="ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-dev-ubuntu-22.04-amd64",
        image_tag="0.0.4-v0.56.0-rc39-3429acf14e46",
        device_configurations={DeviceConfigurations.N300x4},
        docker_config=base_docker_config(),
        service_route="/v1/chat/completions",
        setup_type=SetupTypes.TT_INFERENCE_SERVER,
    ),

The docker image could now be selected, but deploy failed.

I ran some tests just starting the docker image to see where it was failing.

It failed on T3K permissions

The directory tt_studio_persistent_volume/volume_id_tt-metal-DeepSeek-R1-Distill-Llama-70B-v0.0.1/model_weights/DeepSeek-R1-Distill-Llama-70B/T3K was owned by root.

Aside: Running find /home/container_app_user -user root list also ~/cache_root/huggingface hierarchy. It doesn't seem to cause a problem.

I switched them all to userid/groupid 1000 and restarted the docker container manually and it converted DeepSeek weights into the T3K directory.

I shut down the manual image, and started up tt-studio.
The DeepSeek-R1-Distill-Llama-70B deployed and runs. Yay!

==

Additional cleanup: all files under /home/container_app_user were set to executable.

In the container: /usr/local/bin/docker-entrypoint.sh
In the tt-inference-server repo: docker-entrypoint.sh

The chmod -R 2775 "$var_dir", I think this should operate only the directories, something like:

find "$var_dir" -type d -print0 | xargs -0 chmod 2775

@windwardly
Copy link
Author

Aside: when done using DeepSeekR1, I went to Deployments, I pressed 'Delete' and the QuietBox rebooted. I haven't dug into that yet.

@anirudTT
Copy link
Contributor

anirudTT commented Mar 3, 2025

@windwardly: Thanks for opening this issue and trying out the DeepSeek model via tt-inference-server and tt-studio on a Tenstorrent device!

As you noticed correctly , we do not yet have support / dropdown in tt-studio for the DeepSeek model implementation, so your fix—adding that entry to tt-studio/app/api/shared_config/model_config.pywas absolutely correct.

Usually, if something goes wrong during setup, the persistence storage directory for the model (e.g., tt_studio_persistent_volume/volume_id_tt-metal-DeepSeek-R1-Distill-Llama-70B-v0.0.1/model_weights/DeepSeek-R1-Distill-Llama-70B/T3K) tends to be owned by root instead of both root and the user. The remedy is to use the chown command, as described here, to fix the permission issues.

It's amazing that you eventually got it running! 💯

We're aiming to prioritize support for this model and then actively test this flow ourselves—thanks again!

@anirudTT
Copy link
Contributor

anirudTT commented Mar 3, 2025

Aside: when done using DeepSeekR1, I went to Deployments, I pressed 'Delete' and the QuietBox rebooted. I haven't dug into that yet.

Anytime a model is deleted from deployment page/table a tt-smi reset is run on the tenstorrent board ; this is by design and is triggered automatically as long as the tenstorrent devices are mounted to the backend container

@anirudTT anirudTT added the community Community-driven issues and suggestions. label Mar 3, 2025
@windwardly
Copy link
Author

Anytime a model is deleted from deployment page/table a tt-smi reset is run on the tenstorrent board ; this is by design and is triggered automatically as long as the tenstorrent devices are mounted to the backend container

It's more than the board getting reset: the whole machine rebooted, losing state (well, vim keeps edits on disk, but that means I need to find them and vim -r them. I also lost all my layered screen sessions creating separate work-spaces; nothing other than time was lost).

@anirudTT
Copy link
Contributor

anirudTT commented Mar 3, 2025

Anytime a model is deleted from deployment page/table a tt-smi reset is run on the tenstorrent board ; this is by design and is triggered automatically as long as the tenstorrent devices are mounted to the backend container

It's more than the board getting reset: the whole machine rebooted, losing state (well, vim keeps edits on disk, but that means I need to find them and vim -r them. I also lost all my layered screen sessions creating separate work-spaces; nothing other than time was lost).

Oh, that's new. We have never observed this before. Just so I understand, you deleted the model, and the whole system rebooted?

@windwardly
Copy link
Author

Oh, that's new. We have never observed this before. Just so I understand, you deleted the model, and the whole system rebooted?

Yes.

A snippet from /var/log/syslog :

[..normal operations pruned, ending with ..]
Mar  2 20:17:01 ub-22-04 CRON[20060]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  2 21:17:01 ub-22-04 CRON[40040]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

<then nothing else is recorded until I clicked delete 1/2 hour later ...>

Mar  2 21:50:47 ub-22-04 systemd[1]: docker-9d7a11243854b40082f3d380c0ba079b44fe53ed2d6f1f35ea9cfd0e24e87e11.scope: Deactivated successfully.
Mar  2 21:50:47 ub-22-04 systemd[1]: docker-9d7a11243854b40082f3d380c0ba079b44fe53ed2d6f1f35ea9cfd0e24e87e11.scope: Consumed 25min 8.700s CPU time.
Mar  2 21:50:47 ub-22-04 dockerd[1280]: time="2025-03-02T21:50:47.873131458-05:00" level=info msg="ignoring event" container=9d7a11243854b40082f3d380c0ba079b44fe53ed2d6f1f35ea9cfd0e24e87e11 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar  2 21:50:47 ub-22-04 containerd[1193]: time="2025-03-02T21:50:47.873272956-05:00" level=info msg="shim disconnected" id=9d7a11243854b40082f3d380c0ba079b44fe53ed2d6f1f35ea9cfd0e24e87e11 namespace=moby
Mar  2 21:50:47 ub-22-04 containerd[1193]: time="2025-03-02T21:50:47.873324105-05:00" level=warning msg="cleaning up after shim disconnected" id=9d7a11243854b40082f3d380c0ba079b44fe53ed2d6f1f35ea9cfd0e24e87e11 namespace=moby
Mar  2 21:50:47 ub-22-04 containerd[1193]: time="2025-03-02T21:50:47.873332025-05:00" level=info msg="cleaning up dead shim" namespace=moby
Mar  2 21:50:47 ub-22-04 kernel: [11581.238033] br-c5f962b3c557: port 4(vethf3ca477) entered disabled state
Mar  2 21:50:47 ub-22-04 kernel: [11581.238168] vethdc7d2e0: renamed from eth0
Mar  2 21:50:47 ub-22-04 systemd-networkd[1129]: vethf3ca477: Lost carrier
Mar  2 21:50:48 ub-22-04 networkd-dispatcher[1160]: WARNING:Unknown index 13 seen, reloading interface list
Mar  2 21:50:48 ub-22-04 systemd-udevd[46839]: Using default interface naming scheme 'v249'.
Mar  2 21:50:48 ub-22-04 systemd-networkd[1129]: vethf3ca477: Link DOWN
Mar  2 21:50:48 ub-22-04 kernel: [11581.339136] br-c5f962b3c557: port 4(vethf3ca477) entered disabled state
Mar  2 21:50:48 ub-22-04 kernel: [11581.340265] device vethf3ca477 left promiscuous mode
Mar  2 21:50:48 ub-22-04 kernel: [11581.340275] br-c5f962b3c557: port 4(vethf3ca477) entered disabled state
Mar  2 21:50:48 ub-22-04 networkd-dispatcher[1160]: ERROR:Unknown interface index 13 seen even after reload
Mar  2 21:50:48 ub-22-04 networkd-dispatcher[1160]: WARNING:Unknown index 13 seen, reloading interface list
Mar  2 21:50:48 ub-22-04 networkd-dispatcher[1160]: ERROR:Unknown interface index 13 seen even after reload
Mar  2 21:50:48 ub-22-04 systemd[1]: run-docker-netns-c8fa425100d7.mount: Deactivated successfully.
Mar  2 21:50:48 ub-22-04 systemd[1]: var-lib-docker-overlay2-9c38d49e53869e74f7a5085bc30ff616df8033fd6a73c619232ce4314b5fc2d7-merged.mount: Deactivated successfully.

<then, an almost 4 minute gap - in which system restarted ...>

Mar  2 21:54:36 ub-22-04 systemd-modules-load[830]: Inserted module 'msr'
Mar  2 21:54:36 ub-22-04 systemd-pstore[838]: PStore dmesg-erst-7477410317123715074 moved to /var/lib/systemd/pstore/7477410317123/dmesg-erst-7477410317123715074
Mar  2 21:54:36 ub-22-04 systemd-pstore[838]: PStore dmesg-erst-7477410317123715073 moved to /var/lib/systemd/pstore/7477410317123/dmesg-erst-7477410317123715073
Mar  2 21:54:36 ub-22-04 multipathd[848]: --------start up--------
Mar  2 21:54:36 ub-22-04 multipathd[848]: read /etc/multipath.conf
Mar  2 21:54:36 ub-22-04 kernel: [    0.000000] Linux version 5.15.0-133-generic (buildd@lcy02-amd64-040) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #144-Ubuntu SMP Fri Feb 7 20:47:38 UTC 2025 (Ubuntu 5.15.0-133.144-generic 5.15.173)
Mar  2 21:54:36 ub-22-04 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-133-generic root=UUID=691f9853-af33-4c00-b53c-4d2e7e28214b ro iommu=pt

@windwardly
Copy link
Author

Oh, that's new. We have never observed this before. Just so I understand, you deleted the model, and the whole system rebooted?

FYI update: Yes, and the reboot it is repeatable. I'll work on tracking it down later.

@windwardly
Copy link
Author

Reboot happens independently of tt-studio and docker containers. It's the running of tt-smi -r <file>. I'll keep reducing and if I find something, will file a report under that repo. Removing reboot as a concern related to tt-studio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community-driven issues and suggestions.
Projects
None yet
Development

No branches or pull requests

3 participants