Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update get_all_files_paths_under examples to include keep_extensions #450

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/user-guide/distributeddataclassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Let's see how ``DomainClassifier`` works in a small excerpt taken from ``example

from nemo_curator.classifiers import DomainClassifier

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

domain_classifier = DomainClassifier(filter_by=["Games", "Sports"])
Expand All @@ -83,7 +83,7 @@ Using the ``MultilingualDomainClassifier`` is very similar to using the ``Domain

from nemo_curator.classifiers import MultilingualDomainClassifier

files = get_all_files_paths_under("japanese_books_dataset/")
files = get_all_files_paths_under("japanese_books_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

multilingual_domain_classifier = MultilingualDomainClassifier(
Expand All @@ -106,7 +106,7 @@ Here's an example of how to use the ``QualityClassifier``:

from nemo_curator.classifiers import QualityClassifier

files = get_all_files_paths_under("web_documents/")
files = get_all_files_paths_under("web_documents/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

quality_classifier = QualityClassifier(filter_by=["High", "Medium"])
Expand Down Expand Up @@ -134,7 +134,7 @@ NeMo Curator provides an easy way to annotate and filter your data using the saf

.. code-block:: python

files = get_all_files_paths_under("unsafe_documents/")
files = get_all_files_paths_under("unsafe_documents/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

token = "hf_1234" # Replace with your user access token
Expand Down Expand Up @@ -181,7 +181,7 @@ Here is a small example of how to use the ``InstructionDataGuardClassifier``:

# The model expects instruction-response style text data. For example:
# "Instruction: {instruction}. Input: {input_}. Response: {response}."
files = get_all_files_paths_under("instruction_input_response_dataset/")
files = get_all_files_paths_under("instruction_input_response_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

token = "hf_1234" # Replace with your user access token
Expand Down Expand Up @@ -210,7 +210,7 @@ To use the FineWeb Educational Content Classifier, you can follow this example:

from nemo_curator.classifiers import FineWebEduClassifier

files = get_all_files_paths_under("web_documents/")
files = get_all_files_paths_under("web_documents/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

edu_classifier = FineWebEduClassifier(
Expand Down Expand Up @@ -247,7 +247,7 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex

from nemo_curator.classifiers import ContentTypeClassifier

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

content_type_classifier = ContentTypeClassifier(filter_by=["Blogs", "News"])
Expand All @@ -269,7 +269,7 @@ Here's an example of how to use the ``PromptTaskComplexityClassifier``:

from nemo_curator.classifiers import PromptTaskComplexityClassifier

files = get_all_files_paths_under("my_dataset/")
files = get_all_files_paths_under("my_dataset/", keep_extensions="jsonl")
input_dataset = DocumentDataset.read_json(files, backend="cudf")

classifier = PromptTaskComplexityClassifier()
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/documentdataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ You could read, filter the dataset, and write it using the following methods
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.filters import WordCountFilter

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)

filter_step = nc.ScoreFilter(
Expand All @@ -58,7 +58,7 @@ You could read, filter the dataset, and write it using the following methods

Let's walk through this code line by line.

* ``files = get_all_files_paths_under("books_dataset/")`` This retrieves a list of all files in the given directory.
* ``files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")`` This retrieves a list of all files in the given directory, then filters the list to include only files ending with ".jsonl".
In our case, this is equivalent to writing

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/qualityfiltering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Let's examine this small example:
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.filters import WordCountFilter

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)

filter_step = nc.ScoreFilter(
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/sparkother.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,4 @@ The following code snippet demonstrates how to read output from a Spark DataFram
stories_dataset = DocumentDataset.read_parquet(processed_files, backend="pandas")

It is worth noting that Spark typically tends to create checksum and other marker files which can vary by Spark distribution,
so it is advisable to ignore them when reading data into a NeMo Curator ``DocumentDataset``.
so it is advisable to ignore them when reading data into a NeMo Curator ``DocumentDataset``.
2 changes: 1 addition & 1 deletion docs/user-guide/taskdecontamination.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Let's examine this small example:
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.tasks import Winogrande, Squad, TriviaQA,

files = get_all_files_paths_under("books_dataset/")
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)

downstream_tasks = [
Expand Down
2 changes: 1 addition & 1 deletion examples/classifier_filtering.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@


def load_dataset(input_data_dir):
files = list(get_all_files_paths_under(input_data_dir))
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)

Expand Down
3 changes: 1 addition & 2 deletions examples/exact_deduplication.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@

from nemo_curator.datasets import DocumentDataset
from nemo_curator.modules import ExactDuplicates
from nemo_curator.utils.distributed_utils import get_client, read_data, write_to_disk
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.utils.distributed_utils import get_client, write_to_disk
from nemo_curator.utils.script_utils import ArgumentHelper


Expand Down
2 changes: 1 addition & 1 deletion examples/identify_languages.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@


def load_dataset(input_data_dir):
files = list(get_all_files_paths_under(input_data_dir))
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)

Expand Down
2 changes: 1 addition & 1 deletion examples/task_decontamination.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@


def load_dataset(input_data_dir):
files = list(get_all_files_paths_under(input_data_dir))
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)

Expand Down
5 changes: 3 additions & 2 deletions nemo_curator/scripts/find_exact_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ def main(args):
if num_files is not None and num_files <= 0:
logger.info(f"Processed {num_files}... quitting")
break
files = get_all_files_paths_under(root=data_path, recurse_subdirectories=False)
files = [f for f in files if f.endswith(".jsonl")]
files = get_all_files_paths_under(
root=data_path, recurse_subdirectories=False, keep_extensions="jsonl"
)
df = read_data(
files[:num_files] if num_files else files,
file_type="jsonl",
Expand Down
5 changes: 3 additions & 2 deletions nemo_curator/scripts/fuzzy_deduplication/compute_minhashes.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,9 @@ def main(args):
print(f"Processed {args.num_files}... quitting")
break

files = get_all_files_paths_under(root=data_path, recurse_subdirectories=False)
files = [f for f in files if f.endswith(".jsonl")]
files = get_all_files_paths_under(
root=data_path, recurse_subdirectories=False, keep_extensions="jsonl"
)
df = read_data(
files[:num_files] if num_files else files,
file_type="jsonl",
Expand Down
4 changes: 3 additions & 1 deletion nemo_curator/scripts/prepare_fasttext_training_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ def sample_rows(df, n, seed):
def main(args):
client = get_client(**ArgumentHelper.parse_client_args(args))
# Get local path
files = list(get_all_files_paths_under(args.input_data_dir))
files = list(
get_all_files_paths_under(args.input_data_dir, keep_extensions="jsonl")
)
raw_data = read_data(files, file_type="jsonl", backend="pandas")
dataset = DocumentDataset(raw_data)
text_field = args.input_json_field
Expand Down
2 changes: 1 addition & 1 deletion nemo_curator/utils/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,7 @@ def reshard_jsonl(
# Output file size in bytes
blocksize = parse_str_of_num_bytes(output_file_size)

input_files = list(get_all_files_paths_under(input_dir))
input_files = list(get_all_files_paths_under(input_dir, keep_extensions="jsonl"))

# Read in the dask bag
b = db.read_text(input_files, blocksize=blocksize)
Expand Down
1 change: 0 additions & 1 deletion tests/test_read_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
read_data_blocksize,
read_data_files_per_partition,
)
from nemo_curator.utils.file_utils import get_all_files_paths_under

NUM_FILES = 5
NUM_RECORDS = 100
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1459,8 +1459,9 @@
}
],
"source": [
"files = get_all_files_paths_under(root=input_data_dir, recurse_subdirectories=False)\n",
"files = [f for f in files if f.endswith(\".jsonl\")]\n",
"files = get_all_files_paths_under(\n",
" root=input_data_dir, recurse_subdirectories=False, keep_extensions=\"jsonl\"\n",
")\n",
"df = read_data(\n",
" files,\n",
" file_type=\"jsonl\",\n",
Expand Down
5 changes: 3 additions & 2 deletions tutorials/single_node_tutorial/single_gpu_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1140,8 +1140,9 @@
"print(f\"Computing minhashes for {minhash_data_path}\")\n",
"\n",
"# Load data. Only the [minhash_id_field, text_field] columns are needed\n",
"files = get_all_files_paths_under(root=minhash_data_path, recurse_subdirectories=False)\n",
"files = [f for f in files if f.endswith(\".jsonl\")]\n",
"files = get_all_files_paths_under(\n",
" root=minhash_data_path, recurse_subdirectories=False, keep_extensions=\"jsonl\"\n",
")\n",
"df = read_data(\n",
" files,\n",
" file_type=\"jsonl\",\n",
Expand Down
6 changes: 3 additions & 3 deletions tutorials/tinystories/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ def run_curation_pipeline(args: Any, jsonl_dir: str) -> None:
client = get_client(**ArgumentHelper.parse_client_args(args))
print(f"Running curation pipeline on '{jsonl_dir}'...")
files = [
fp
for fp in get_all_files_paths_under(jsonl_dir, recurse_subdirectories=False)
if fp.endswith(".jsonl")
get_all_files_paths_under(
jsonl_dir, recurse_subdirectories=False, keep_extensions="jsonl"
)
]
print("Reading the data...")
orig_dataset = DocumentDataset.read_json(files, add_filename=True)
Expand Down
3 changes: 1 addition & 2 deletions tutorials/zyda2-tutorial/1_fuzzy_dedup/0_minhash.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@


def read_folder(input_folder, columns=["nemo_id", "text"]):
data_paths = get_all_files_paths_under(input_folder)
data_paths = [f for f in data_paths if f.endswith(".parquet")]
data_paths = get_all_files_paths_under(input_folder, keep_extensions="parquet")
data_paths.sort()
logging.info(f"Number of files being read: {len(data_paths)}")
text_ddf = dask_cudf.read_parquet(
Expand Down