-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move common dedup utils and remove unused code #42
Conversation
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one bug. I think it looks pretty good otherwise.
nemo_curator/scripts/fuzzy_deduplication/legacy/verify_all_pairs_jaccard.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (as long as we address Ryan's changes) . Thanks for the code cleanup and documentation.
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
1a676df
to
8865b82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my concerns were addressed. Thanks!
* Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More cleanup Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More updates/shuffling Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com>
* Init commit for tutorial notebook Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix metadata inference with pandas and dask (#35) * Fix metadata inference with pandas and dask Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix datatypes for task decontamination Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Use targetted import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Disable PyTorch Compile Multiprocessing (#34) * Move tokenizer import Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Reduce inductor threads Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change env int to string Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change location of env var Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add comment linking issue Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Improve speed of AddId module (#36) * Add fast id method Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add type conversion Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix off by one errors in tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Make GPU dependencies optional (#27) * Move GPU imports and make them optional Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move gpu dependencies to a seperate install Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove unused import Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Switch to placeholder import that raises on usage Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove deprecated utils usage Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add cuML attribution Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Safe import tests, improve install instruction, update gha workflow Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix pytests due to loc bug Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * update install instructions Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Raise on non module-not-found errors, update logging Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update logging to not change root logger Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix failing GPU tests with latest pandas bump (#41) Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Adds Nemo Curator K8s example (#40) * [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * big_english -> my_dataset Signed-off-by: Terry Kong <terryk@nvidia.com> * 24.01 -> 24.03 default container Signed-off-by: Terry Kong <terryk@nvidia.com> * Add help kwarg to all flags Signed-off-by: Terry Kong <terryk@nvidia.com> * Clarify why venv is needed Signed-off-by: Terry Kong <terryk@nvidia.com> * fix precommit failures Signed-off-by: Terry Kong <terryk@nvidia.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Move common dedup utils and remove unused code (#42) * Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More cleanup Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * More updates/shuffling Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix lang id example (#37) * Fix lang id example Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add classifier unit tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add test for failure Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove failure test Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Add dataset blending tool (#32) * Add initial dataset blending function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add blend unit tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add self parameter Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix return type of blend dataset Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix blending tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change assert statement for very uneven blend Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix key error Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add proper proportion blending test Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add four dataset blend and clarify docs Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add shuffle module Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add blend example and tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix random method name Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Wrap return type in DocumentDataset Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Save result of column drop Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change equality check for shuffle tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix expected order after shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add more documents to shuffle test Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add assert statement Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add within partition shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Refactor add rand column for shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix filename tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add determinism handling for shuffle Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Change numpy random function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix tests with new random method Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove length call from blending Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Improve scaling of blending function Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix blend tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add blending script Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add additional file paths call Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add documentation Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Reformat docs Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Remove backticks Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add context manager for shuffle tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add better deterministic shuffle path Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Update documentation and reset index Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * High level fuzzy duplicates module (#46) * Initial pass at fuzzy dedup api Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update deprecated shuffle arg Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * dask_cuda gpu only import Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move fuzzy_dedup imports to optional Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * more tests Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Move FuzzyDeDupConfig to it's own class Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add example script and config file, fix typo Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Remove slurm examples for gpu dedup Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add config module Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Rename FuzzyDeDupConfig and minhash_length to FuzzyDuplicatesConfig, num_hashes Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comments and update example Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Write to same format as input in fuzzy dedup example Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix indexing in PII Modifier (#55) * Fix pii index issue Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Add sequential wrapper Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix pii tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Disable string conversion globally (#56) Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fix issue #43 (empty files creation) and improve reading/writing speed (#57) This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * [Tutorials] Add a tutorial for PEFT data curation (#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Only import PII constants during Curator import (#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Deleting links Signed-off-by: Nicoel Luo <nluo@nvidia.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fixing Style Signed-off-by: Nicole Luo <nluo@nvidia.com> * Updating container version Signed-off-by: Nicole Luo <nluo@nvidia.com> * Fixing style Signed-off-by: Nicole Luo <nluo@nvidia.com> * Update get_client() according to latest version; Update log path for map_bucket section Signed-off-by: Nicole Luo <nluo@nvidia.com> --------- Signed-off-by: Nicole Luo <nluo@nvidia.com> Signed-off-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Nicoel Luo <nluo@nvidia.com> Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com> Co-authored-by: Ryan Wolf <rywolf@nvidia.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com> Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
This PR remove the
nemo_curator/gpu_deduplication
folder in favor of using all code from either the fuzzy_dedup module or fuzzy_dedup_utils. (A few methods were left behind).It also moves the fuzzy dedup scripts into a new subfolder with a readme on the order of execution and example usage.
It adds a caution to the gpu_deduplication slurm example currently in
examples
which will be removed in a followup and replaced by a python only API example.