Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandbox bench experiment workflow #364

Merged
merged 76 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
28f43d5
FVD and ISV for video eval
BeachWang Mar 21, 2024
dbc1e0a
Merge branch 'main' of github.com:alibaba/data-juicer into dev/fvd_eval
BeachWang Mar 21, 2024
40e69e7
restore tools init
BeachWang Mar 26, 2024
8744767
restore tools init
BeachWang Mar 26, 2024
6ae9b1b
pre-commit done
BeachWang Mar 26, 2024
173f0ce
add FID KID IS PR and PRV metrics
BeachWang Mar 29, 2024
809775f
add KVD metric
BeachWang Mar 29, 2024
381cf1c
fix doc
BeachWang Apr 2, 2024
535748c
allow relative path
BeachWang Apr 3, 2024
f2cbadb
fix sample 50000 image
BeachWang Apr 7, 2024
4915deb
Merge branch 'main' of github.com:alibaba/data-juicer into dev/fvd_eval
BeachWang May 7, 2024
4ec9efe
fvd sandbox
BeachWang May 8, 2024
84a5290
fvd sandbox test done
BeachWang May 8, 2024
c11e045
precommit done
BeachWang May 8, 2024
9a36d7d
easyanimate train and infer in sandbox
BeachWang May 13, 2024
68d83cb
merge main
BeachWang May 13, 2024
280907e
divide dataset pipline
BeachWang May 15, 2024
95ef2b7
merge fix_sandbox_pipline
BeachWang May 16, 2024
0a897c1
fix data num for each partition
BeachWang May 16, 2024
87051ce
Merge branch 'dev/fix_sandbox_pipline' of github.com:alibaba/data-jui…
BeachWang May 16, 2024
651e946
pre-commit done
BeachWang May 16, 2024
f6364d4
Merge branch 'dev/fix_sandbox_pipline' of github.com:alibaba/data-jui…
BeachWang May 16, 2024
a7e8f79
test sandbox for videos done
BeachWang May 17, 2024
bfe8091
fix executor
BeachWang May 17, 2024
d41a01a
fix executor
BeachWang May 17, 2024
fd4bea0
check datalen
BeachWang May 17, 2024
e4d3ecd
sort data for partition
BeachWang May 17, 2024
c935adc
sort data for partition
BeachWang May 17, 2024
846af68
fix video_aspect_ratio_filter
BeachWang May 20, 2024
a4af5dd
Merge branch 'dev/fix_video_filter_for_bench' into dev/easyanimate_fo…
BeachWang May 20, 2024
e31127f
fix video_aspect_ratio_filter
BeachWang May 20, 2024
cef83d5
tensor stats to float
BeachWang May 23, 2024
67c9db8
precommit done
BeachWang May 23, 2024
00e4a49
Merge branch 'dev/fix_video_filter_for_bench' into dev/easyanimate_fo…
BeachWang May 23, 2024
828647c
fix words num filter
BeachWang May 23, 2024
ac75a53
pre-commit done
BeachWang May 23, 2024
f27273d
Merge branch 'dev/fix_video_filter_for_bench' into dev/easyanimate_fo…
BeachWang May 23, 2024
db205f3
add seed for train and infer
BeachWang May 31, 2024
3de4ff6
add seed for easyanimate
BeachWang May 31, 2024
0ebac03
merge
BeachWang May 31, 2024
288e416
sandbox rebuild v1
BeachWang Jun 13, 2024
6738f5d
fix empty frames
BeachWang Jun 14, 2024
97bac93
switch
BeachWang Jun 17, 2024
a77113a
fix conflict
BeachWang Jun 19, 2024
3c68d12
fix hpo 3sigma
BeachWang Jun 20, 2024
9ac2f40
Merge branch 'main' of github.com:alibaba/data-juicer into dev/rebuil…
BeachWang Jun 20, 2024
79658c1
after pre-commit
BeachWang Jun 20, 2024
727ccdd
sandbox readme zh
BeachWang Jun 20, 2024
b6662a2
finish doc
BeachWang Jun 20, 2024
498b985
remove training limit
BeachWang Jul 15, 2024
1d8d4de
other_configs -> extra_configs
BeachWang Jul 15, 2024
39553dc
other_configs -> extra_configs
BeachWang Jul 15, 2024
c6173fc
res_name -> meta_name
BeachWang Jul 15, 2024
3305764
hooker -> hook
BeachWang Jul 15, 2024
3ef86c7
analyze -> analyse
BeachWang Jul 15, 2024
b79739d
after pre-commit
BeachWang Jul 15, 2024
5fbabcd
analyse -> analyze
BeachWang Jul 16, 2024
f32a234
Merge branch 'main' of github.com:alibaba/data-juicer into dev/rebuil…
BeachWang Jul 16, 2024
739edfe
merge easyanimate_for_sandbox
BeachWang Jul 16, 2024
7caa923
analyser.py -> analyzer.py
BeachWang Jul 16, 2024
0b9b1b9
analyser.py -> analyzer.py
BeachWang Jul 16, 2024
c8fcdbd
analyser.py -> analyzer.py
BeachWang Jul 16, 2024
0e6f651
regist -> register, DICT -> MAPPING
BeachWang Jul 16, 2024
24a4d65
Merge branch 'dev/rebuild_sandbox' into dev/dj_bench_demo
BeachWang Jul 16, 2024
44d9159
range_specified_field_selector
BeachWang Jul 17, 2024
fb66fdc
pipline test done
BeachWang Jul 23, 2024
e6f14d6
dataset in readme
BeachWang Jul 23, 2024
51f24a5
conflict solved
BeachWang Jul 23, 2024
cedd22c
update readme
BeachWang Jul 24, 2024
c5a015e
pre-commit done
BeachWang Jul 24, 2024
2a5cd09
rm experiment name in dj
BeachWang Jul 24, 2024
9e96feb
add init dataset
BeachWang Jul 25, 2024
8727c67
Merge branch 'main' of github.com:alibaba/data-juicer into dev/dj_ben…
BeachWang Jul 30, 2024
110f109
fix auto_evaluation_helm readme
BeachWang Jul 31, 2024
ec81af5
remove easyanimate code
BeachWang Aug 1, 2024
69e3444
shorten diff
BeachWang Aug 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@

# data & resources
models/
outputs/
assets/

Expand Down
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ exclude: |
docs/.*|
tests/.*|
demos/.*|
tools/mm_eval/inception_metrics.*|
tools/mm_eval/inception_metrics/.*|
thirdparty/easy_animate/.*|
.*\.md
)$
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ In this new version, we support more features for **multimodal data (including v
- [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute!
- [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
- [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
- [2023-10-13] Our first data-centric LLM competition begins! Please
Expand Down Expand Up @@ -94,8 +94,8 @@ Table of Contents
dedicated [toolkits](#documentation), designed to
function independently of specific multimodal LLM datasets and processing pipelines.

- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

Expand Down Expand Up @@ -194,11 +194,11 @@ The dependency options are listed below:
pip install py-data-juicer
```

- **Note**:
- **Note**:
- only the basic APIs in `data_juicer` and two basic tools
(data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
and complete functions, we recommend you install `data_juicer` [from source](#from-source).
- The release versions from pypi have a certain lag compared to the latest version from source.
- The release versions from pypi have a certain lag compared to the latest version from source.
So if you want to follow the latest functions of `data_juicer`, we recommend you install [from source](#from-source).

### Using Docker
Expand All @@ -215,7 +215,7 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- The format of `<version_tag>` is like `v0.2.0`, which is the same as release version tag.

### Installation check
Expand Down Expand Up @@ -413,20 +413,20 @@ docker exec -it <container_id> bash
Data-Juicer is released under Apache License 2.0.

## Contributing
We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes and better documentations. Please refer to
We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes and better documentations. Please refer to
[How-to Guide for Developers](docs/DeveloperGuide.md).

If you have any questions, please join our [discussion groups](README.md).

## Acknowledgement
Data-Juicer is used across various LLM products and research initiatives,
including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for
financial analysis, and Zhiwen for reading assistant, as well as the Alibaba
including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for
financial analysis, and Zhiwen for reading assistant, as well as the Alibaba
Cloud's platform for AI (PAI).
We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Juicer thanks and refers to several community projects, such as
Data-Juicer thanks and refers to several community projects, such as
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....


Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- `<version_tag>`的格式类似于`v0.2.0`,与发布(Release)的版本号相同。

### 安装校验
Expand Down
20 changes: 14 additions & 6 deletions configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ data_probe_algo: 'uniform' # sampling algorithm
data_probe_ratio: 1.0 # the sampling ratio to the original dataset size. It's 1.0 in default. Only used for dataset sampling.
hpo_config: null # path to a configuration file when using auto-HPO tool.


# process schedule: a list of several process operators with their arguments
process:
# Mapper ops. Most of these ops need no arguments.
Expand Down Expand Up @@ -496,13 +495,22 @@ process:
ignore_non_character: false # whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

# Selector ops
- topk_specified_field_selector: # selector to select top samples based on the sorted specified field
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top samples
topk: # number of selected top sample
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
- frequency_specified_field_selector: # selector to select samples based on the sorted frequency of specified field value
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top specified field value
topk: # number of selected top specified field value
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
- random_selector: # selector to random select samples
select_ratio: # the ratio to be sampled
select_num: # the number to be sampled
- range_specified_field_selector: # selector to select a range of samples based on the sorted specified field value from smallest to largest.
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
lower_percentile: # the lower bound of the percentile to be sampled
upper_percentile: # the upper bound of the percentile to be sampled
lower_rank: # the lower rank of the percentile to be sampled
upper_rank: # the upper rank of the percentile to be sampled
- topk_specified_field_selector: # selector to select top samples based on the sorted specified field
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top samples
topk: # number of selected top sample
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
1 change: 1 addition & 0 deletions configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### Evaluation Results
- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
Expand Down
3 changes: 2 additions & 1 deletion configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 评测结果
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
Expand All @@ -57,4 +58,4 @@
- 仅视频:根据视频性质提高数据集质量
- 文本-视频:根据文本和视频间的对齐提高数据集质量
用户可以基于这个菜谱开始他们的视频数据集处理流程。
-
-
68 changes: 68 additions & 0 deletions configs/demo/bench/1_single_op_pipline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Sandbox config example

# global parameters
project_name: 'demo-bench'
experiment_name: 'single_op_language_score' # for wandb tracer name
work_dir: './outputs/demo-bench' # the default output dir for meta logging

# configs for each job, the jobs will be executed according to the order in the list
probe_job_configs:
# get statistics value for each sample and get the distribution analysis for given percentiles
- hook: 'ProbeViaAnalyzerHook'
meta_name: 'analysis_ori_data'
dj_configs:
project_name: 'demo-bench'
dataset_path: './demos/data/demo-dataset-videos.jsonl' # path to your dataset directory or file
percentiles: [0.333, 0.667] # percentiles to analyze the dataset distribution
export_path: './outputs/demo-bench/demo-dataset-with-language-score.jsonl'
export_original_dataset: true # must be true to keep statistics values with dataset
process:
- language_id_score_filter:
lang: 'zh'
min_score: 0.8
extra_configs:

refine_recipe_job_configs:

execution_job_configs:
# sample the splits with low/middle/high statistics values
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './outputs/demo-bench/demo-dataset-with-language-score.jsonl' # output dataset of probe jobs
export_path: './outputs/demo-bench/demo-dataset-with-high-language-score.jsonl'
process:
- range_specified_field_selector:
field_key: '__dj__stats__.lang_score' # '__dj__stats__' the target keys corresponding to multi-level field information need to be separated by '.'. 'dj__stats' is the default location for storing stats in Data Juicer, and 'lang_score' is the stats corresponding to the language_id_score_filter.
lower_percentile: 0.667
upper_percentile: 1.000
extra_configs:
# random sample dataset with fix number of instances
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './outputs/demo-bench/demo-dataset-with-high-language-score.jsonl' # output dataset of probe jobs
export_path: './outputs/demo-bench/demo-dataset-for-train.jsonl'
process:
- random_selector:
select_num: 16
extra_configs:
# train model
- hook: 'TrainModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_train.yaml'
# infer model
- hook: 'InferModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_infer.yaml'

evaluation_job_configs:
# vbench evaluation
- hook: 'EvaluateDataHook'
meta_name: 'vbench_eval'
dj_configs:
extra_configs: './configs/demo/bench/vbench_eval.yaml'
58 changes: 58 additions & 0 deletions configs/demo/bench/2_multi_op_pipline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Sandbox config example

# global parameters
project_name: 'demo-bench'
experiment_name: 'single_op_language_score' # for wandb tracer name
work_dir: './outputs/demo-bench' # the default output dir for meta logging

# configs for each job, the jobs will be executed according to the order in the list
probe_job_configs:

refine_recipe_job_configs:

execution_job_configs:
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './demos/data/demo-dataset-videos.jsonl' # path to your dataset directory or file
export_path: './outputs/demo-bench/demo-dataset-with-multi-op-stats.jsonl'
export_original_dataset: true # must be true to keep statistics values with dataset
process:
# select samples with high language score
- language_id_score_filter:
lang:
min_score: 0.7206037306785583 # this value can be observed in the analysis result of the probe job in one op experiments
# select samples with middle video duration
- video_duration_filter:
min_duration: 19.315000 # this value can be observed in the analysis result of the probe job in one op experiments
max_duration: 32.045000 # this value can be observed in the analysis result of the probe job in one op experiments

extra_configs:
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './outputs/demo-bench/demo-dataset-with-multi-op-stats.jsonl'
export_path: './outputs/demo-bench/demo-dataset-for-train.jsonl'
process:
- random_selector:
select_num: 16
extra_configs:
# train model
- hook: 'TrainModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_train.yaml'
# infer model
- hook: 'InferModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_infer.yaml'

evaluation_job_configs:
# vbench evaluation
- hook: 'EvaluateDataHook'
meta_name: 'vbench_eval'
dj_configs:
extra_configs: './configs/demo/bench/vbench_eval.yaml'
Loading
Loading