modelscope · yxdyc · Aug 2, 2024 · Mar 21, 2024 · Mar 21, 2024 · Mar 26, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,5 @@
 
 # data & resources
-models/
 outputs/
 assets/
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -39,6 +39,7 @@ exclude: |
     docs/.*|
     tests/.*|
     demos/.*|
-    tools/mm_eval/inception_metrics.*|
+    tools/mm_eval/inception_metrics/.*|
+    thirdparty/easy_animate/.*|
     .*\.md
   )$
diff --git a/README.md b/README.md
@@ -46,7 +46,7 @@ In this new version, we support more features for **multimodal data (including v
 - [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute!
 - [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
 - [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
-- [2024-01-05] We release **Data-Juicer v0.1.3** now! 
+- [2024-01-05] We release **Data-Juicer v0.1.3** now!
 In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
 Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
 - [2023-10-13] Our first data-centric LLM competition begins! Please
@@ -94,8 +94,8 @@ Table of Contents
   dedicated [toolkits](#documentation), designed to
   function independently of specific multimodal LLM datasets and processing pipelines.
 
-- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration 
-  through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model, 
+- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
+  through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
   visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
   ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
 
@@ -194,11 +194,11 @@ The dependency options are listed below:
 pip install py-data-juicer
 ```
 
-- **Note**: 
+- **Note**:
   - only the basic APIs in `data_juicer` and two basic tools
     (data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
     and complete functions, we recommend you install `data_juicer` [from source](#from-source).
-  - The release versions from pypi have a certain lag compared to the latest version from source. 
+  - The release versions from pypi have a certain lag compared to the latest version from source.
     So if you want to follow the latest functions of `data_juicer`, we recommend you install [from source](#from-source).
 
 ### Using Docker
@@ -215,7 +215,7 @@ pip install py-data-juicer
     ```shell
     docker build -t datajuicer/data-juicer:<version_tag> .
     ```
-  
+
   - The format of `<version_tag>` is like `v0.2.0`, which is the same as release version tag.
 
 ### Installation check
@@ -413,20 +413,20 @@ docker exec -it <container_id> bash
 Data-Juicer is released under Apache License 2.0.
 
 ## Contributing
-We are in a rapidly developing field and greatly welcome contributions of new 
-features, bug fixes and better documentations. Please refer to 
+We are in a rapidly developing field and greatly welcome contributions of new
+features, bug fixes and better documentations. Please refer to
 [How-to Guide for Developers](docs/DeveloperGuide.md).
 
 If you have any questions, please join our [discussion groups](README.md).
 
 ## Acknowledgement
 Data-Juicer is used across various LLM products and research initiatives,
-including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for 
-financial analysis, and Zhiwen for reading assistant, as well as the Alibaba 
+including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for
+financial analysis, and Zhiwen for reading assistant, as well as the Alibaba
 Cloud's platform for AI (PAI).
 We look forward to more of your experience, suggestions and discussions for collaboration!
 
-Data-Juicer thanks and refers to several community projects, such as 
+Data-Juicer thanks and refers to several community projects, such as
 [Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam),  [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....
 
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -193,7 +193,7 @@ pip install py-data-juicer
     ```shell
     docker build -t datajuicer/data-juicer:<version_tag> .
     ```
-  
+
   - `<version_tag>`的格式类似于`v0.2.0`，与发布（Release）的版本号相同。
 
 ### 安装校验

diff --git a/configs/config_all.yaml b/configs/config_all.yaml
@@ -49,7 +49,6 @@ data_probe_algo: 'uniform'                                  # sampling algorithm
 data_probe_ratio: 1.0                                       # the sampling ratio to the original dataset size. It's 1.0 in default. Only used for dataset sampling.
 hpo_config: null                                            # path to a configuration file when using auto-HPO tool.
 
-
 # process schedule: a list of several process operators with their arguments
 process:
   # Mapper ops. Most of these ops need no arguments.
@@ -496,13 +495,22 @@ process:
       ignore_non_character: false                             # whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
 
   # Selector ops
-  - topk_specified_field_selector:                          # selector to select top samples based on the sorted specified field
-      field_key: ''                                           # the target keys corresponding to multi-level field information need to be separated by '.'
-      top_ratio:                                              # ratio of selected top samples
-      topk:                                                   # number of selected top sample
-      reverse: True                                           # determine the sorting rule, if reverse=True, then sort in descending order
   - frequency_specified_field_selector:                     # selector to select samples based on the sorted frequency of specified field value
       field_key: ''                                           # the target keys corresponding to multi-level field information need to be separated by '.'
       top_ratio:                                              # ratio of selected top specified field value
       topk:                                                   # number of selected top specified field value
       reverse: True                                           # determine the sorting rule, if reverse=True, then sort in descending order
+  - random_selector:                                        # selector to random select samples
+      select_ratio:                                           # the ratio to be sampled
+      select_num:                                             # the number to be sampled
+  - range_specified_field_selector:                         # selector to select a range of samples based on the sorted specified field value from smallest to largest.
+      field_key: ''                                           # the target keys corresponding to multi-level field information need to be separated by '.'
+      lower_percentile:                                       # the lower bound of the percentile to be sampled
+      upper_percentile:                                       # the upper bound of the percentile to be sampled
+      lower_rank:                                             # the lower rank of the percentile to be sampled
+      upper_rank:                                             # the upper rank of the percentile to be sampled
+  - topk_specified_field_selector:                          # selector to select top samples based on the sorted specified field
+      field_key: ''                                           # the target keys corresponding to multi-level field information need to be separated by '.'
+      top_ratio:                                              # ratio of selected top samples
+      topk:                                                   # number of selected top sample
+      reverse: True                                           # determine the sorting rule, if reverse=True, then sort in descending order
diff --git a/configs/data_juicer_recipes/README.md b/configs/data_juicer_recipes/README.md
@@ -41,6 +41,7 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
 | subset                    |       #samples before       | #samples after | keep ratio | config link                          | data link                                                                                                                                                                                                                                                                                 | source        |
 |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
 | LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
+| Data-Juicer-T2V |          1,217,346          |   147,176    |   12.09%   | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
 
 ### Evaluation Results
 - LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.

diff --git a/configs/data_juicer_recipes/README_ZH.md b/configs/data_juicer_recipes/README_ZH.md
@@ -41,6 +41,7 @@
 | 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
 |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
 | LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
+| Data-Juicer-T2V |          1,217,346          |   147,176    |   12.09%   | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
 
 ### 评测结果
 - LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
@@ -57,4 +58,4 @@
 - 仅视频：根据视频性质提高数据集质量
 - 文本-视频：根据文本和视频间的对齐提高数据集质量
 用户可以基于这个菜谱开始他们的视频数据集处理流程。
-- 
+-
diff --git a/configs/demo/bench/1_single_op_pipline.yaml b/configs/demo/bench/1_single_op_pipline.yaml
@@ -0,0 +1,68 @@
+# Sandbox config example
+
+# global parameters
+project_name: 'demo-bench'
+experiment_name: 'single_op_language_score'              # for wandb tracer name
+work_dir: './outputs/demo-bench'                         # the default output dir for meta logging
+
+# configs for each job, the jobs will be executed according to the order in the list
+probe_job_configs:
+  # get statistics value for each sample and get the distribution analysis for given percentiles
+  - hook: 'ProbeViaAnalyzerHook'
+    meta_name: 'analysis_ori_data'
+    dj_configs:
+      project_name: 'demo-bench'
+      dataset_path: './demos/data/demo-dataset-videos.jsonl'  # path to your dataset directory or file
+      percentiles: [0.333, 0.667]                              # percentiles to analyze the dataset distribution
+      export_path: './outputs/demo-bench/demo-dataset-with-language-score.jsonl'
+      export_original_dataset: true                            # must be true to keep statistics values with dataset
+      process:
+        - language_id_score_filter:
+            lang: 'zh'
+            min_score: 0.8
+    extra_configs:
+
+refine_recipe_job_configs:
+
+execution_job_configs:
+  # sample the splits with low/middle/high statistics values
+  - hook: 'ProcessDataHook'
+    meta_name:
+    dj_configs:
+      project_name: 'demo-bench'
+      dataset_path: './outputs/demo-bench/demo-dataset-with-language-score.jsonl' # output dataset of probe jobs
+      export_path: './outputs/demo-bench/demo-dataset-with-high-language-score.jsonl'
+      process:
+        - range_specified_field_selector:
+            field_key: '__dj__stats__.lang_score'     # '__dj__stats__' the target keys corresponding to multi-level field information need to be separated by '.'. 'dj__stats' is the default location for storing stats in Data Juicer, and 'lang_score' is the stats corresponding to the language_id_score_filter.
+            lower_percentile: 0.667
+            upper_percentile: 1.000
+    extra_configs:
+  # random sample dataset with fix number of instances
+  - hook: 'ProcessDataHook'
+    meta_name:
+    dj_configs:
+      project_name: 'demo-bench'
+      dataset_path: './outputs/demo-bench/demo-dataset-with-high-language-score.jsonl' # output dataset of probe jobs
+      export_path: './outputs/demo-bench/demo-dataset-for-train.jsonl'
+      process:
+        - random_selector:
+            select_num: 16
+    extra_configs:
+  # train model
+  - hook: 'TrainModelHook'
+    meta_name:
+    dj_configs:
+    extra_configs: './configs/demo/bench/model_train.yaml'
+  # infer model
+  - hook: 'InferModelHook'
+    meta_name:
+    dj_configs:
+    extra_configs: './configs/demo/bench/model_infer.yaml'
+
+evaluation_job_configs:
+  # vbench evaluation
+  - hook: 'EvaluateDataHook'
+    meta_name: 'vbench_eval'
+    dj_configs:
+    extra_configs: './configs/demo/bench/vbench_eval.yaml'
diff --git a/configs/demo/bench/2_multi_op_pipline.yaml b/configs/demo/bench/2_multi_op_pipline.yaml
@@ -0,0 +1,58 @@
+# Sandbox config example
+
+# global parameters
+project_name: 'demo-bench'
+experiment_name: 'single_op_language_score'              # for wandb tracer name
+work_dir: './outputs/demo-bench'                         # the default output dir for meta logging
+
+# configs for each job, the jobs will be executed according to the order in the list
+probe_job_configs:
+
+refine_recipe_job_configs:
+
+execution_job_configs:
+  - hook: 'ProcessDataHook'
+    meta_name:
+    dj_configs:
+      project_name: 'demo-bench'
+      dataset_path: './demos/data/demo-dataset-videos.jsonl'  # path to your dataset directory or file
+      export_path: './outputs/demo-bench/demo-dataset-with-multi-op-stats.jsonl'
+      export_original_dataset: true                            # must be true to keep statistics values with dataset
+      process:
+        # select samples with high language score
+        - language_id_score_filter:
+            lang:
+            min_score: 0.7206037306785583     # this value can be observed in the analysis result of the probe job in one op experiments
+        # select samples with middle video duration
+        - video_duration_filter:
+            min_duration: 19.315000   # this value can be observed in the analysis result of the probe job in one op experiments
+            max_duration: 32.045000   # this value can be observed in the analysis result of the probe job in one op experiments
+
+    extra_configs:
+  - hook: 'ProcessDataHook'
+    meta_name:
+    dj_configs:
+      project_name: 'demo-bench'
+      dataset_path: './outputs/demo-bench/demo-dataset-with-multi-op-stats.jsonl'
+      export_path: './outputs/demo-bench/demo-dataset-for-train.jsonl'
+      process:
+        - random_selector:
+            select_num: 16
+    extra_configs:
+  # train model
+  - hook: 'TrainModelHook'
+    meta_name:
+    dj_configs:
+    extra_configs: './configs/demo/bench/model_train.yaml'
+  # infer model
+  - hook: 'InferModelHook'
+    meta_name:
+    dj_configs:
+    extra_configs: './configs/demo/bench/model_infer.yaml'
+
+evaluation_job_configs:
+  # vbench evaluation
+  - hook: 'EvaluateDataHook'
+    meta_name: 'vbench_eval'
+    dj_configs:
+    extra_configs: './configs/demo/bench/vbench_eval.yaml'