-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📝 overhaul of the documentation, now 4.5x bigger (better?) #144
Conversation
840e7aa
to
fc2dae5
Compare
2ea95f2
to
1be1304
Compare
1be1304
to
30da685
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BTW just for reference. We also now link to the optimum-tpu docs from:
The goal is to increase visibility of the doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge work! Some general comments
- I would prefer to avoid repetition: having information repeated in several places can be confusing and it is harder to maintain. E.g.: docker arguments, TGI args
- you specify version numbers, I think it would be best if we could generate that, otherwise it will be a burden to maintain
- Try to keep titles and toc tree in sync
- There is a bit of repetition between the tutorials and howtos. Maybe you can rationalize that.
- the conceptual guides should be more focused on optimum-tpu IMO, what do you think?
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
Outdated
Show resolved
Hide resolved
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
Outdated
Show resolved
Hide resolved
Thanks a lot, @tengomucho, for all the feedback. It took me a while to integrate everything 😅 and I still believe there is some stuff I am not sure exactly what you mean. As there are a lot of comments already, Please mark them as resolved if you believe they do not require further action on my part so I can keep better track of everything left. If they don't require any further action from my part but you left a comment that you would like me to see, feel free to leave them open I believe I implemented most of the changes, and this should be ready for another review.
|
I also added a "More todos (to keep track of everything)" section to the original PR description about the next tasks to do related to this PR, to prevent having more changes in this PR and keep it from being even more monstrous 👹 |
I will answer this first and then re-do the review.
I think this is debatable. We can repeat things a couple of times, but we should not show exactly the same command in many pages! E.g.: the
Great! 👍
Any solution works for me, I trust your choice
I watched the talk and the page. I am ok to go all in to structure the doc this way, but I still believe we should not talk about the same subject in a tutorial and in a howto guide. That is what I mean: I am ok if you create a tutorial on training for, say, llama, but then we might not need to have the same instructions in the howto guide. In this perspective I would think we should move the gemma and llama examples to the tutorials, don't you think?
|
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
Outdated
Show resolved
Hide resolved
@tengomucho I think i have address all the comments now!
|
I like it a lot more! Please fix the broken links on the main page as discussed and then this LGTM! |
- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs. | ||
|
||
The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities. | ||
The HBM (High Bandwidth Memory) capacity per chip is 16GB for v5e, v5p and 32GB for v6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full st... oh, whatever
- `--privileged`: Required for TPU access | ||
- `--net host`: Uses host network mode | ||
Those are needed to run a TPU container so that the container can properly access the TPU hardware | ||
- `--shm-size 16GB`: Increase default shared memory allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note there are some failure in doc build
There was a typo that caused the build to fail. This is fixed now. Also, all the links between pages have been checked to prevent redirecting to the wrong pages (added a todo here to make this part of the CI). I think now everything should be okay. Let me know if it is missing anything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pagezyhf Let me know if you see any changes or potential improvements with the docs. I can also do separate PRs if you think some pages are missing and need to be added |
What does this PR do?
This is a complete overhaul of the documentation:
What is missing (could be added):
More todos (to keep track of everything)
markdown-link-check docs/source/**/*.mdx
New Files Added
docs/scripts/auto-generate-examples.py
docs/scripts/examples_list.yml
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
docs/source/conceptual_guides/tpu_hardware_support.mdx
docs/source/contributing.mdx
docs/source/howto/advanced-tgi-serving.mdx
docs/source/howto/deploy_instance_on_ie.mdx
docs/source/howto/installation_inside_a_container.mdx
docs/source/installation.mdx
docs/source/optimum_container.mdx
docs/source/reference/fsdp_v2.mdx
docs/source/reference/tgi_advanced_options.mdx
docs/source/tutorials/inference_on_tpu.mdx
docs/source/tutorials/tpu_setup.mdx
docs/source/tutorials/training_on_tpu.mdx
Modified Files
docs/source/howto/training.mdx
docs/source/index.mdx
docs/source/supported-architectures.mdx