📝 overhaul of the documentation, now 4.5x bigger (better?) #144

baptistecolle · 2025-01-15T13:03:55Z

What does this PR do?

This is a complete overhaul of the documentation:

We want from 1686 to 7565 words (4.5X bigger)
We auto-generate documentation for our examples
- This is auto generated from a .ipynb notebook using a custom scripts and the doc-builder converter feature https://moon-ci-docs.huggingface.co/docs/optimum-tpu/pr_144/en/howto/gemma_tuning
- So when we add new notebook examples they will also be in the docs 😁
New formatting and organization of the docs to make it easier to follow
Added new tutorials, how-to, conceptual guides, and references following the diataxis method

What is missing (could be added):

I think more examples would be nice, showing more diverse use cases
I believe FAQ and glossary would be nice to add, but this PR is big enough already
Guide and examples with Google Colab Pro as you can launch a v5e-1 TPU from there, so a one-click example would be nice
An example using GCE VM on Colab via GCP Marketplace
More diagrams and figures of the internal working of optimum-TPU to give some details would be interesting
A how-to guide on adding new models for new contributors
Docs for GKE is in the work and but not published yet as there are some blockers for that https://github.com/huggingface/optimum-tpu/blob/doc-deploy-gke/docs/source/howto/deploy-gke.md.
The current preview docs for GKE are for CLI only. A GUI guide would be interesting too

More todos (to keep track of everything)

Move current training tutorial to howto and make a training tutorial where all dependency are installed from scratch
make sure us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 is uploaded so it can be used. (also check for other GKE images)
make the optimum-tpu ghcr.io latest point to tgi and not tgi-ie
make the version of the container in the docs are dynamic so that the docs is always pointing to the latest version
inject duplicate content dynamically (to make maintenance easier)
assert max-batch-prefill-tokens ≤ max-input-length * max_batch_size
refactor the name of the files (.mdx) so they match their title better
look at kaggle as they also support TPU so adding a example with Kaggle would be nice
add links from other huggingface project to optimum-tpu
- https://huggingface.co/docs/transformers/perf_train_tpu_tf
- https://huggingface.co/docs/accelerate/en/concept_guides/training_tpu
- https://huggingface.co/docs/accelerate/en/basic_tutorials/tpu
- Look at other project such as TRL, peft if we can add a page or a link to optimum-tpu there
Add example on diffusers with TPU
Add to CI markdown-link-check docs/source/**/*.mdx

New Files Added

docs/scripts/auto-generate-examples.py
docs/scripts/examples_list.yml
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
docs/source/conceptual_guides/tpu_hardware_support.mdx
docs/source/contributing.mdx
docs/source/howto/advanced-tgi-serving.mdx
docs/source/howto/deploy_instance_on_ie.mdx
docs/source/howto/installation_inside_a_container.mdx
docs/source/installation.mdx
docs/source/optimum_container.mdx
docs/source/reference/fsdp_v2.mdx
docs/source/reference/tgi_advanced_options.mdx
docs/source/tutorials/inference_on_tpu.mdx
docs/source/tutorials/tpu_setup.mdx
docs/source/tutorials/training_on_tpu.mdx

Modified Files

docs/source/howto/training.mdx
docs/source/index.mdx
docs/source/supported-architectures.mdx

HuggingFaceDocBuilderDev · 2025-01-15T13:26:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

baptistecolle · 2025-01-15T13:55:08Z

BTW just for reference. We also now link to the optimum-tpu docs from:

TGI docs: PR review in progress: https://huggingface.co/docs/text-generation-inference/en/installation_tpu
HF Google Cloud docs: https://huggingface.co/docs/google-cloud/en/tpu

The goal is to increase visibility of the doc

tengomucho

Thanks for the huge work! Some general comments

I would prefer to avoid repetition: having information repeated in several places can be confusing and it is harder to maintain. E.g.: docker arguments, TGI args
you specify version numbers, I think it would be best if we could generate that, otherwise it will be a burden to maintain
Try to keep titles and toc tree in sync
There is a bit of repetition between the tutorials and howtos. Maybe you can rationalize that.
the conceptual guides should be more focused on optimum-tpu IMO, what do you think?

docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx

docs/source/conceptual_guides/tpu_hardware_support.mdx

docs/source/tutorials/tpu_setup.mdx

docs/source/tutorials/training_on_tpu.mdx

docs/source/tutorials/inference_on_tpu.mdx

docs/source/howto/advanced-tgi-serving.mdx

baptistecolle · 2025-01-16T12:00:23Z

Thanks a lot, @tengomucho, for all the feedback. It took me a while to integrate everything 😅 and I still believe there is some stuff I am not sure exactly what you mean.

As there are a lot of comments already, Please mark them as resolved if you believe they do not require further action on my part so I can keep better track of everything left. If they don't require any further action from my part but you left a comment that you would like me to see, feel free to leave them open

I believe I implemented most of the changes, and this should be ready for another review.

Thanks for the huge work! Some general comments

I would prefer to avoid repetition: having information repeated in several places can be confusing and it is harder to maintain. E.g.: docker arguments, TGI args

you specify version numbers, I think it would be best if we could generate that, otherwise it will be a burden to maintain

Try to keep titles and toc tree in sync

There is a bit of repetition between the tutorials and howtos. Maybe you can rationalize that.

the conceptual guides should be more focused on optimum-tpu IMO, what do you think?

I would prefer to keep information close to where it is needed. So that people don't have to jump around in the docs to find everything, that's why there are some duplication

I see your point. Let me try to find a way to inject that into each page to prevent duplication and simplify maintainability. Do you mind if i do that in another PR. I did not see that option in the docs of HF doc-builder so i wanted to ask the team
I prefer to keep duplicate information because I believe keeping related information close together where it will be used is important, even if that means some duplication. This offer a better user-experience and user may come to the documentation from any single page so they order it is read is not linear

Yes, do you mind if i do that in another PR where i look also add 1) and how to inject everything dynamically
Great feedback, I modified that, and they are now sync
Question for you. I would like to modify a lot of the file names in the docs/source folder (refactoring) because the names of the files are not the most descriptive and structured. I believe it is best done in another PR, or do you think doing it here is best as a last commit after the approval is best? If I do it here, I will mess up all the git diff, so it's more difficult to review it, right?
Sure
first of all, i recommend this talk on docs from one of the maintainers of django (Daniele Procida)
https://www.youtube.com/watch?v=t4vKPhjcMZg also, look up other resources tied to Daniele Procida
He talks about https://diataxis.fr/ a way to structure docs
Let me, however, answer your question directly. Why is there an overlap between tutorials and how-to? Tutorials offer a way for the user to accomplish a task (we don't talk about tradeoffs, the different options, and so on, only a recipe to follow). How we explain the rationale between each action and why do we do these steps (keeping with the analogy of cooking, why do we bake a cake at a given temperature, for example)? They have overlap, but how to go a step above just the tutorial and how serving, for example, actually works on optimum tpu, not just only how to achieve it. About optimum-tpu, I don't know if the current pages are perfect in that regard, but the goal for them would be to encapsulate that philosophy. I hope this answers your question
it links to 4) too
I answered that in a comment above

the conceptual guides should be more focused on optimum-tpu IMO, what do you think?

I don't think so. For example in transformers they explain the different style of attention https://huggingface.co/docs/transformers/en/attention. For me one of the goal of conceptual guides is to explain different concepts that are needed for the end-user to be able to use and understand our library optimum-tpu.

baptistecolle · 2025-01-16T13:50:19Z

I also added a "More todos (to keep track of everything)" section to the original PR description about the next tasks to do related to this PR, to prevent having more changes in this PR and keep it from being even more monstrous 👹

tengomucho · 2025-01-16T15:25:51Z

I will answer this first and then re-do the review.

1. I would prefer to keep information close to where it is needed. So that people don't have to jump around in the docs to find everything, that's why there are some duplication
I see your point. Let me try to find a way to inject that into each page to prevent duplication and simplify maintainability. Do you mind if i do that in another PR. I did not see that option in the docs of HF doc-builder so i wanted to ask the team
I prefer to keep duplicate information because I believe keeping related information close together where it will be used is important, even if that means some duplication. This offer a better user-experience and user may come to the documentation from any single page so they order it is read is not linear

I think this is debatable. We can repeat things a couple of times, but we should not show exactly the same command in many pages! E.g.: the docker run command is shown 4 times, every time to start the same model. Maybe we can launch it with a different model, or something like that...

2. Yes, do you mind if i do that in another PR where i look also add 1) and how to inject everything dynamically

Great! 👍

3. Great feedback, I modified that, and they are now sync
   Question for you. I would like to modify a lot of the file names in the docs/source folder (refactoring) because the names of the files are not the most descriptive and structured. I believe it is best done in another PR, or do you think doing it here is best as a last commit after the approval is best? If I do it here, I will mess up all the git diff, so it's more difficult to review it, right?

Any solution works for me, I trust your choice

4. Sure
   first of all, i recommend this talk on docs from one of the maintainers of django (Daniele Procida)
   https://www.youtube.com/watch?v=t4vKPhjcMZg also, look up other resources tied to Daniele Procida
   He talks about https://diataxis.fr/ a way to structure docs
   Let me, however, answer your question directly. Why is there an overlap between tutorials and how-to? Tutorials offer a way for the user to accomplish a task (we don't talk about tradeoffs, the different options, and so on, only a recipe to follow). How we explain the rationale between each action and why do we do these steps (keeping with the analogy of cooking, why do we bake a cake at a given temperature, for example)? They have overlap, but how to go a step above just the tutorial and how serving, for example, actually works on optimum tpu, not just only how to achieve it. About optimum-tpu, I don't know if the current pages are perfect in that regard, but the goal for them would be to encapsulate that philosophy. I hope this answers your question

I watched the talk and the page. I am ok to go all in to structure the doc this way, but I still believe we should not talk about the same subject in a tutorial and in a howto guide. That is what I mean: I am ok if you create a tutorial on training for, say, llama, but then we might not need to have the same instructions in the howto guide. In this perspective I would think we should move the gemma and llama examples to the tutorials, don't you think?
Also, there are two buttons in the main page now, linking to tutorials and how-tos (but not to references or conceptual guides) and they do not work, please fix them.

5. it links to 4) too
   I answered that in a comment above
the conceptual guides should be more focused on optimum-tpu IMO, what do you think?
I don't think so. For example in transformers they explain the different style of attention https://huggingface.co/docs/transformers/en/attention. For me one of the goal of conceptual guides is to explain different concepts that are needed for the end-user to be able to use and understand our library optimum-tpu.

docs/source/conceptual_guides/tpu_hardware_support.mdx

docs/source/howto/installation_inside_a_container.mdx

docs/source/howto/training.mdx

docs/source/tutorials/training_on_tpu.mdx

docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx

docs/source/howto/advanced-tgi-serving.mdx

docs/source/tutorials/training_on_tpu.mdx

baptistecolle · 2025-01-17T15:49:37Z

@tengomucho I think i have address all the comments now!

I removed those parts that were indeed not actionnable.

## Tips
We recommend fine-tuning using bf16 on TPU, as those operations are often extremely fast while keeping the training stable and precise enough.

## Troubleshooting

Common issues and solutions:

1. Out of Memory (OOM):
   - Enable gradient checkpointing
   - Reduce batch size
   - Use a smaller sequence length

2. Training Speed:
   - Ensure proper batch size optimization
   - Monitor TPU device utilization
   - Check for communication bottlenecks

I remove the page about the training container that is not live yet
I redid the training page. It is a full contain example without docker on how to do training on TPU with optimum so please look at https://moon-ci-docs.huggingface.co/docs/optimum-tpu/pr_144/en/tutorials/training_on_tpu

tengomucho · 2025-01-17T16:01:18Z

I like it a lot more! Please fix the broken links on the main page as discussed and then this LGTM!

tengomucho · 2025-01-17T16:01:55Z

docs/source/conceptual_guides/tpu_hardware_support.mdx

- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs.
-
-The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.
+The HBM (High Bandwidth Memory) capacity per chip is 16GB for v5e, v5p and 32GB for v6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory


full st... oh, whatever

tengomucho · 2025-01-17T16:02:12Z

docs/source/howto/advanced-tgi-serving.mdx

- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
-Those are needed to run a TPU container so that the container can properly access the TPU hardware
+- `--shm-size 16GB`: Increase default shared memory allocation.


tengomucho

note there are some failure in doc build

baptistecolle · 2025-01-20T08:15:29Z

There was a typo that caused the build to fail. This is fixed now. Also, all the links between pages have been checked to prevent redirecting to the wrong pages (added a todo here to make this part of the CI). I think now everything should be okay. Let me know if it is missing anything else.

tengomucho

LGTM

baptistecolle · 2025-01-20T09:28:48Z

@pagezyhf Let me know if you see any changes or potential improvements with the docs. I can also do separate PRs if you think some pages are missing and need to be added

baptistecolle force-pushed the improve-documentation branch from 840e7aa to fc2dae5 Compare January 15, 2025 13:05

baptistecolle changed the title ~~📝 overhaul of the documentation, now 4.5 bigger (better?)~~ 📝 overhaul of the documentation, now 4.5x bigger (better?) Jan 15, 2025

baptistecolle force-pushed the improve-documentation branch 2 times, most recently from 2ea95f2 to 1be1304 Compare January 15, 2025 13:13

feat(docs): overhaul of the documentation

30da685

baptistecolle force-pushed the improve-documentation branch from 1be1304 to 30da685 Compare January 15, 2025 13:19

wip(ci): fix ci for the auto-generated docs

c16495a

baptistecolle marked this pull request as ready for review January 15, 2025 13:50

baptistecolle requested a review from tengomucho January 15, 2025 13:50

pagezyhf self-requested a review January 15, 2025 15:00

tengomucho reviewed Jan 15, 2025

View reviewed changes

tengomucho reviewed Jan 16, 2025

View reviewed changes

docs/source/howto/advanced-tgi-serving.mdx Outdated Show resolved Hide resolved

docs: address PR feedback of tengomucho

3aebf6c

fix(docs): add closing </Tip> prevent build error

c58933c

baptistecolle requested a review from tengomucho January 16, 2025 12:07

baptistecolle added 2 commits January 16, 2025 14:11

docs: add curl cmd to docs and remove useless docker section

bf1c3cc

docs: fix _toctree consistency

43516b8

tengomucho reviewed Jan 16, 2025

View reviewed changes

huggingface deleted a comment from tengomucho Jan 17, 2025

docs: adress the new PR review feedbacks

d77901a

baptistecolle requested a review from tengomucho January 17, 2025 15:49

tengomucho reviewed Jan 17, 2025

View reviewed changes

fix(docs): fix all the broken links

a338ce7

tengomucho reviewed Jan 17, 2025

View reviewed changes

fix(docs): fix broken index page

c0ca01b

baptistecolle requested a review from tengomucho January 20, 2025 08:13

tengomucho approved these changes Jan 20, 2025

View reviewed changes

baptistecolle merged commit 40aa326 into main Jan 21, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📝 overhaul of the documentation, now 4.5x bigger (better?) #144

📝 overhaul of the documentation, now 4.5x bigger (better?) #144

baptistecolle commented Jan 15, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 15, 2025

baptistecolle commented Jan 15, 2025 •

edited

Loading

tengomucho left a comment

baptistecolle commented Jan 16, 2025 •

edited

Loading

baptistecolle commented Jan 16, 2025 •

edited

Loading

tengomucho commented Jan 16, 2025

baptistecolle commented Jan 17, 2025

tengomucho commented Jan 17, 2025

tengomucho Jan 17, 2025

tengomucho Jan 17, 2025

tengomucho left a comment •

edited

Loading

baptistecolle commented Jan 20, 2025 •

edited

Loading

tengomucho left a comment

baptistecolle commented Jan 20, 2025

📝 overhaul of the documentation, now 4.5x bigger (better?) #144

📝 overhaul of the documentation, now 4.5x bigger (better?) #144

Conversation

baptistecolle commented Jan 15, 2025 • edited Loading

What does this PR do?

New Files Added

Modified Files

HuggingFaceDocBuilderDev commented Jan 15, 2025

baptistecolle commented Jan 15, 2025 • edited Loading

tengomucho left a comment

Choose a reason for hiding this comment

baptistecolle commented Jan 16, 2025 • edited Loading

baptistecolle commented Jan 16, 2025 • edited Loading

tengomucho commented Jan 16, 2025

baptistecolle commented Jan 17, 2025

tengomucho commented Jan 17, 2025

tengomucho Jan 17, 2025

Choose a reason for hiding this comment

tengomucho Jan 17, 2025

Choose a reason for hiding this comment

tengomucho left a comment • edited Loading

Choose a reason for hiding this comment

baptistecolle commented Jan 20, 2025 • edited Loading

tengomucho left a comment

Choose a reason for hiding this comment

baptistecolle commented Jan 20, 2025

baptistecolle commented Jan 15, 2025 •

edited

Loading

baptistecolle commented Jan 15, 2025 •

edited

Loading

baptistecolle commented Jan 16, 2025 •

edited

Loading

baptistecolle commented Jan 16, 2025 •

edited

Loading

tengomucho left a comment •

edited

Loading

baptistecolle commented Jan 20, 2025 •

edited

Loading