-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Nemo Curator K8s example #40
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
terrykong
force-pushed
the
k8s-new
branch
3 times, most recently
from
April 23, 2024 22:03
1d17133
to
2ee16e8
Compare
instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
ryantwolf
approved these changes
Apr 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the guide!
nicoleeeluo
pushed a commit
to nicoleeeluo/NeMo-Curator
that referenced
this pull request
May 20, 2024
* [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * big_english -> my_dataset Signed-off-by: Terry Kong <terryk@nvidia.com> * 24.01 -> 24.03 default container Signed-off-by: Terry Kong <terryk@nvidia.com> * Add help kwarg to all flags Signed-off-by: Terry Kong <terryk@nvidia.com> * Clarify why venv is needed Signed-off-by: Terry Kong <terryk@nvidia.com> * fix precommit failures Signed-off-by: Terry Kong <terryk@nvidia.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com>
ryantwolf
added a commit
that referenced
this pull request
May 24, 2024
* Init commit for tutorial notebook Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix metadata inference with pandas and dask (#35) * Fix metadata inference with pandas and dask Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix datatypes for task decontamination Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Use targetted import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Disable PyTorch Compile Multiprocessing (#34) * Move tokenizer import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Reduce inductor threads Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change env int to string Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change location of env var Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add comment linking issue Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Improve speed of AddId module (#36) * Add fast id method Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add type conversion Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix off by one errors in tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Make GPU dependencies optional (#27) * Move GPU imports and make them optional Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move gpu dependencies to a seperate install Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove unused import Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Switch to placeholder import that raises on usage Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove deprecated utils usage Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add cuML attribution Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Safe import tests, improve install instruction, update gha workflow Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix pytests due to loc bug Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * update install instructions Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Raise on non module-not-found errors, update logging Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update logging to not change root logger Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix failing GPU tests with latest pandas bump (#41) Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Adds Nemo Curator K8s example (#40) * [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * big_english -> my_dataset Signed-off-by: Terry Kong <terryk@nvidia.com> * 24.01 -> 24.03 default container Signed-off-by: Terry Kong <terryk@nvidia.com> * Add help kwarg to all flags Signed-off-by: Terry Kong <terryk@nvidia.com> * Clarify why venv is needed Signed-off-by: Terry Kong <terryk@nvidia.com> * fix precommit failures Signed-off-by: Terry Kong <terryk@nvidia.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Move common dedup utils and remove unused code (#42) * Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More cleanup Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More updates/shuffling Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix lang id example (#37) * Fix lang id example Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add classifier unit tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add test for failure Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove failure test Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Add dataset blending tool (#32) * Add initial dataset blending function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add blend unit tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add self parameter Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix return type of blend dataset Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix blending tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change assert statement for very uneven blend Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix key error Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add proper proportion blending test Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add four dataset blend and clarify docs Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add shuffle module Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add blend example and tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix random method name Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Wrap return type in DocumentDataset Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Save result of column drop Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change equality check for shuffle tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix expected order after shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add more documents to shuffle test Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add assert statement Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add within partition shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Refactor add rand column for shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix filename tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add determinism handling for shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change numpy random function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix tests with new random method Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove length call from blending Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Improve scaling of blending function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix blend tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add blending script Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add additional file paths call Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add documentation Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Reformat docs Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove backticks Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add context manager for shuffle tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add better deterministic shuffle path Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Update documentation and reset index Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * High level fuzzy duplicates module (#46) * Initial pass at fuzzy dedup api Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update deprecated shuffle arg Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * dask_cuda gpu only import Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move fuzzy_dedup imports to optional Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * more tests Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move FuzzyDeDupConfig to it's own class Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add example script and config file, fix typo Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove slurm examples for gpu dedup Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add config module Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Rename FuzzyDeDupConfig and minhash_length to FuzzyDuplicatesConfig, num_hashes Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comments and update example Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Write to same format as input in fuzzy dedup example Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix indexing in PII Modifier (#55) * Fix pii index issue Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add sequential wrapper Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix pii tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Disable string conversion globally (#56) Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix issue #43 (empty files creation) and improve reading/writing speed (#57) This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * [Tutorials] Add a tutorial for PEFT data curation (#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Only import PII constants during Curator import (#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Deleting links Signed-off-by: Nicoel Luo <nluo@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fixing Style Signed-off-by: Nicole Luo <nluo@nvidia.com> * Updating container version Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fixing style Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update get_client() according to latest version; Update log path for map_bucket section Signed-off-by: Nicole Luo <nluo@nvidia.com> --------- Signed-off-by: Nicole Luo <nluo@nvidia.com> Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Nicoel Luo <nluo@nvidia.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Co-authored-by: Ryan Wolf <rywolf@nvidia.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Duplicate of #39