Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📝 overhaul of the documentation, now 4.5x bigger (better?) #144

Merged
merged 9 commits into from
Jan 21, 2025

Conversation

baptistecolle
Copy link
Collaborator

@baptistecolle baptistecolle commented Jan 15, 2025

What does this PR do?

This is a complete overhaul of the documentation:

  • We want from 1686 to 7565 words (4.5X bigger)
  • We auto-generate documentation for our examples
  • New formatting and organization of the docs to make it easier to follow
  • Added new tutorials, how-to, conceptual guides, and references following the diataxis method

What is missing (could be added):

  • I think more examples would be nice, showing more diverse use cases
  • I believe FAQ and glossary would be nice to add, but this PR is big enough already
  • Guide and examples with Google Colab Pro as you can launch a v5e-1 TPU from there, so a one-click example would be nice
  • An example using GCE VM on Colab via GCP Marketplace
  • More diagrams and figures of the internal working of optimum-TPU to give some details would be interesting
  • A how-to guide on adding new models for new contributors
  • Docs for GKE is in the work and but not published yet as there are some blockers for that https://github.com/huggingface/optimum-tpu/blob/doc-deploy-gke/docs/source/howto/deploy-gke.md.
  • The current preview docs for GKE are for CLI only. A GUI guide would be interesting too

More todos (to keep track of everything)

  • Move current training tutorial to howto and make a training tutorial where all dependency are installed from scratch
  • make sure us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 is uploaded so it can be used. (also check for other GKE images)
  • make the optimum-tpu ghcr.io latest point to tgi and not tgi-ie
  • make the version of the container in the docs are dynamic so that the docs is always pointing to the latest version
  • inject duplicate content dynamically (to make maintenance easier)
  • assert max-batch-prefill-tokens ≤ max-input-length * max_batch_size
  • refactor the name of the files (.mdx) so they match their title better
  • look at kaggle as they also support TPU so adding a example with Kaggle would be nice
  • add links from other huggingface project to optimum-tpu
  • Add example on diffusers with TPU
  • Add to CI markdown-link-check docs/source/**/*.mdx

New Files Added

  • docs/scripts/auto-generate-examples.py
  • docs/scripts/examples_list.yml
  • docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
  • docs/source/conceptual_guides/tpu_hardware_support.mdx
  • docs/source/contributing.mdx
  • docs/source/howto/advanced-tgi-serving.mdx
  • docs/source/howto/deploy_instance_on_ie.mdx
  • docs/source/howto/installation_inside_a_container.mdx
  • docs/source/installation.mdx
  • docs/source/optimum_container.mdx
  • docs/source/reference/fsdp_v2.mdx
  • docs/source/reference/tgi_advanced_options.mdx
  • docs/source/tutorials/inference_on_tpu.mdx
  • docs/source/tutorials/tpu_setup.mdx
  • docs/source/tutorials/training_on_tpu.mdx

Modified Files

  • docs/source/howto/training.mdx
  • docs/source/index.mdx
  • docs/source/supported-architectures.mdx

@baptistecolle baptistecolle changed the title 📝 overhaul of the documentation, now 4.5 bigger (better?) 📝 overhaul of the documentation, now 4.5x bigger (better?) Jan 15, 2025
@baptistecolle baptistecolle force-pushed the improve-documentation branch 2 times, most recently from 2ea95f2 to 1be1304 Compare January 15, 2025 13:13
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@baptistecolle baptistecolle marked this pull request as ready for review January 15, 2025 13:50
@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jan 15, 2025

BTW just for reference. We also now link to the optimum-tpu docs from:

The goal is to increase visibility of the doc

@pagezyhf pagezyhf self-requested a review January 15, 2025 15:00
Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge work! Some general comments

  • I would prefer to avoid repetition: having information repeated in several places can be confusing and it is harder to maintain. E.g.: docker arguments, TGI args
  • you specify version numbers, I think it would be best if we could generate that, otherwise it will be a burden to maintain
  • Try to keep titles and toc tree in sync
  • There is a bit of repetition between the tutorials and howtos. Maybe you can rationalize that.
  • the conceptual guides should be more focused on optimum-tpu IMO, what do you think?

docs/source/conceptual_guides/tpu_hardware_support.mdx Outdated Show resolved Hide resolved
docs/source/conceptual_guides/tpu_hardware_support.mdx Outdated Show resolved Hide resolved
docs/source/conceptual_guides/tpu_hardware_support.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/tpu_setup.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/tpu_setup.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/training_on_tpu.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/training_on_tpu.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/inference_on_tpu.mdx Outdated Show resolved Hide resolved
@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jan 16, 2025

Thanks a lot, @tengomucho, for all the feedback. It took me a while to integrate everything 😅 and I still believe there is some stuff I am not sure exactly what you mean.

As there are a lot of comments already, Please mark them as resolved if you believe they do not require further action on my part so I can keep better track of everything left. If they don't require any further action from my part but you left a comment that you would like me to see, feel free to leave them open

I believe I implemented most of the changes, and this should be ready for another review.

Thanks for the huge work! Some general comments

  1. I would prefer to avoid repetition: having information repeated in several places can be confusing and it is harder to maintain. E.g.: docker arguments, TGI args
  2. you specify version numbers, I think it would be best if we could generate that, otherwise it will be a burden to maintain
  3. Try to keep titles and toc tree in sync
  4. There is a bit of repetition between the tutorials and howtos. Maybe you can rationalize that.
  5. the conceptual guides should be more focused on optimum-tpu IMO, what do you think?
  1. I would prefer to keep information close to where it is needed. So that people don't have to jump around in the docs to find everything, that's why there are some duplication

I see your point. Let me try to find a way to inject that into each page to prevent duplication and simplify maintainability. Do you mind if i do that in another PR. I did not see that option in the docs of HF doc-builder so i wanted to ask the team
I prefer to keep duplicate information because I believe keeping related information close together where it will be used is important, even if that means some duplication. This offer a better user-experience and user may come to the documentation from any single page so they order it is read is not linear

  1. Yes, do you mind if i do that in another PR where i look also add 1) and how to inject everything dynamically

  2. Great feedback, I modified that, and they are now sync
    Question for you. I would like to modify a lot of the file names in the docs/source folder (refactoring) because the names of the files are not the most descriptive and structured. I believe it is best done in another PR, or do you think doing it here is best as a last commit after the approval is best? If I do it here, I will mess up all the git diff, so it's more difficult to review it, right?

  3. Sure
    first of all, i recommend this talk on docs from one of the maintainers of django (Daniele Procida)
    https://www.youtube.com/watch?v=t4vKPhjcMZg also, look up other resources tied to Daniele Procida
    He talks about https://diataxis.fr/ a way to structure docs
    Let me, however, answer your question directly. Why is there an overlap between tutorials and how-to? Tutorials offer a way for the user to accomplish a task (we don't talk about tradeoffs, the different options, and so on, only a recipe to follow). How we explain the rationale between each action and why do we do these steps (keeping with the analogy of cooking, why do we bake a cake at a given temperature, for example)? They have overlap, but how to go a step above just the tutorial and how serving, for example, actually works on optimum tpu, not just only how to achieve it. About optimum-tpu, I don't know if the current pages are perfect in that regard, but the goal for them would be to encapsulate that philosophy. I hope this answers your question

  4. it links to 4) too
    I answered that in a comment above

the conceptual guides should be more focused on optimum-tpu IMO, what do you think?

I don't think so. For example in transformers they explain the different style of attention https://huggingface.co/docs/transformers/en/attention. For me one of the goal of conceptual guides is to explain different concepts that are needed for the end-user to be able to use and understand our library optimum-tpu.

@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jan 16, 2025

I also added a "More todos (to keep track of everything)" section to the original PR description about the next tasks to do related to this PR, to prevent having more changes in this PR and keep it from being even more monstrous 👹

@tengomucho
Copy link
Collaborator

I will answer this first and then re-do the review.

1. I would prefer to keep information close to where it is needed. So that people don't have to jump around in the docs to find everything, that's why there are some duplication

I see your point. Let me try to find a way to inject that into each page to prevent duplication and simplify maintainability. Do you mind if i do that in another PR. I did not see that option in the docs of HF doc-builder so i wanted to ask the team
I prefer to keep duplicate information because I believe keeping related information close together where it will be used is important, even if that means some duplication. This offer a better user-experience and user may come to the documentation from any single page so they order it is read is not linear

I think this is debatable. We can repeat things a couple of times, but we should not show exactly the same command in many pages! E.g.: the docker run command is shown 4 times, every time to start the same model. Maybe we can launch it with a different model, or something like that...

2. Yes, do you mind if i do that in another PR where i look also add 1) and how to inject everything dynamically

Great! 👍

3. Great feedback, I modified that, and they are now sync
   Question for you. I would like to modify a lot of the file names in the docs/source folder (refactoring) because the names of the files are not the most descriptive and structured. I believe it is best done in another PR, or do you think doing it here is best as a last commit after the approval is best? If I do it here, I will mess up all the git diff, so it's more difficult to review it, right?

Any solution works for me, I trust your choice

4. Sure
   first of all, i recommend this talk on docs from one of the maintainers of django (Daniele Procida)
   https://www.youtube.com/watch?v=t4vKPhjcMZg also, look up other resources tied to Daniele Procida
   He talks about https://diataxis.fr/ a way to structure docs
   Let me, however, answer your question directly. Why is there an overlap between tutorials and how-to? Tutorials offer a way for the user to accomplish a task (we don't talk about tradeoffs, the different options, and so on, only a recipe to follow). How we explain the rationale between each action and why do we do these steps (keeping with the analogy of cooking, why do we bake a cake at a given temperature, for example)? They have overlap, but how to go a step above just the tutorial and how serving, for example, actually works on optimum tpu, not just only how to achieve it. About optimum-tpu, I don't know if the current pages are perfect in that regard, but the goal for them would be to encapsulate that philosophy. I hope this answers your question

I watched the talk and the page. I am ok to go all in to structure the doc this way, but I still believe we should not talk about the same subject in a tutorial and in a howto guide. That is what I mean: I am ok if you create a tutorial on training for, say, llama, but then we might not need to have the same instructions in the howto guide. In this perspective I would think we should move the gemma and llama examples to the tutorials, don't you think?
Also, there are two buttons in the main page now, linking to tutorials and how-tos (but not to references or conceptual guides) and they do not work, please fix them.

5. it links to 4) too
   I answered that in a comment above

the conceptual guides should be more focused on optimum-tpu IMO, what do you think?
I don't think so. For example in transformers they explain the different style of attention https://huggingface.co/docs/transformers/en/attention. For me one of the goal of conceptual guides is to explain different concepts that are needed for the end-user to be able to use and understand our library optimum-tpu.

docs/source/howto/installation_inside_a_container.mdx Outdated Show resolved Hide resolved
docs/source/howto/installation_inside_a_container.mdx Outdated Show resolved Hide resolved
docs/source/howto/installation_inside_a_container.mdx Outdated Show resolved Hide resolved
docs/source/howto/training.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/training_on_tpu.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/training_on_tpu.mdx Outdated Show resolved Hide resolved
docs/source/howto/advanced-tgi-serving.mdx Outdated Show resolved Hide resolved
docs/source/tutorials/training_on_tpu.mdx Outdated Show resolved Hide resolved
@huggingface huggingface deleted a comment from tengomucho Jan 17, 2025
@baptistecolle
Copy link
Collaborator Author

@tengomucho I think i have address all the comments now!

  1. I removed those parts that were indeed not actionnable.
## Tips
We recommend fine-tuning using bf16 on TPU, as those operations are often extremely fast while keeping the training stable and precise enough.
## Troubleshooting

Common issues and solutions:

1. Out of Memory (OOM):
   - Enable gradient checkpointing
   - Reduce batch size
   - Use a smaller sequence length

2. Training Speed:
   - Ensure proper batch size optimization
   - Monitor TPU device utilization
   - Check for communication bottlenecks
  1. I remove the page about the training container that is not live yet
  2. I redid the training page. It is a full contain example without docker on how to do training on TPU with optimum so please look at https://moon-ci-docs.huggingface.co/docs/optimum-tpu/pr_144/en/tutorials/training_on_tpu

@tengomucho
Copy link
Collaborator

I like it a lot more! Please fix the broken links on the main page as discussed and then this LGTM!

- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs.

The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.
The HBM (High Bandwidth Memory) capacity per chip is 16GB for v5e, v5p and 32GB for v6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full st... oh, whatever

- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
Those are needed to run a TPU container so that the container can properly access the TPU hardware
- `--shm-size 16GB`: Increase default shared memory allocation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note there are some failure in doc build

@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jan 20, 2025

There was a typo that caused the build to fail. This is fixed now. Also, all the links between pages have been checked to prevent redirecting to the wrong pages (added a todo here to make this part of the CI). I think now everything should be okay. Let me know if it is missing anything else.

Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baptistecolle
Copy link
Collaborator Author

@pagezyhf Let me know if you see any changes or potential improvements with the docs. I can also do separate PRs if you think some pages are missing and need to be added

@baptistecolle baptistecolle merged commit 40aa326 into main Jan 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants