diff --git a/README.md b/README.md
index 3374e5ea0..841bc091e 100644
--- a/README.md
+++ b/README.md
@@ -33,12 +33,20 @@ This project is being actively updated and maintained, and we will periodically
If you find Data-Juicer useful for your research or development, please kindly
cite our [work](#references).
+Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion.
+
+
+
----
## News
--  [2023-10-13] Our first data-centric LLM competition begins! Please
- visit the competition's official websites, **FT-Data Ranker** ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
+-  [2024-01-05] We release **Data-Juicer v0.1.3** now!
+In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
+Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
+
+- [2023-10-13] Our first data-centric LLM competition begins! Please
+ visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
- [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer!
@@ -98,7 +106,7 @@ Table of Contents
## Prerequisites
-- Recommend Python==3.8
+- Recommend Python>=3.7,<=3.10
- gcc >= 5 (at least C++14 support)
## Installation
@@ -330,7 +338,7 @@ We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes and better documentations. Please refer to
[How-to Guide for Developers](docs/DeveloperGuide.md).
-Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion.
+If you have any questions, please join our [discussion groups](README.md).
## Acknowledgement
Data-Juicer is used across various LLM products and research initiatives,
diff --git a/README_ZH.md b/README_ZH.md
index b4d25681b..496f8e1a1 100644
--- a/README_ZH.md
+++ b/README_ZH.md
@@ -31,12 +31,20 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM
如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献) 。
+欢迎加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) ,[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) ,或微信群(扫描下方二维码加入)进行讨论。
+
+
+
----
## 新消息
--  [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
- 请访问大赛官网,**FT-Data Ranker**([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
+-  [2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本!
+在这个新版本中,我们支持了**更多Python版本**(3.7-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
+此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。
+
+- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
+ 请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
- [2023-10-8] 我们的论文更新至第二版,并发布了对应的Data-Juicer v0.1.2版本!
@@ -86,7 +94,7 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM
## 前置条件
-* 推荐 Python==3.8
+* 推荐 Python>=3.7,<=3.10
* gcc >= 5 (at least C++14 support)
## 安装
@@ -309,7 +317,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。
大模型是一个高速发展的领域,我们非常欢迎贡献新功能、修复漏洞以及文档改善。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。
-欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。
+如果您有任何问题,欢迎加入我们的[讨论群](README_ZH.md) 。
## 致谢
diff --git a/data_juicer/__init__.py b/data_juicer/__init__.py
index 10939f01b..8ce9b3623 100644
--- a/data_juicer/__init__.py
+++ b/data_juicer/__init__.py
@@ -1 +1 @@
-__version__ = '0.1.2'
+__version__ = '0.1.3'
diff --git a/environments/minimal_requires.txt b/environments/minimal_requires.txt
index 1202407a8..79cbc429d 100644
--- a/environments/minimal_requires.txt
+++ b/environments/minimal_requires.txt
@@ -7,6 +7,7 @@ tabulate
tqdm
jsonargparse[signatures]
matplotlib
+seaborn
emoji==2.2.0
regex
requests
diff --git a/tools/multimodal/README.md b/tools/multimodal/README.md
index d4559f689..33f2ddcb4 100644
--- a/tools/multimodal/README.md
+++ b/tools/multimodal/README.md
@@ -5,8 +5,62 @@ This folder contains some scripts and tools for multimodal datasets before and a
## Dataset Format Conversion
Due to large format diversity among different multimodal datasets and works,
-Data-Juicer propose a novel intermediate format for multimodal dataset and
-provided several dataset format conversion tools for some popular multimodal
+Data-Juicer propose a novel intermediate text-based interleaved data format for multimodal dataset, which
+is based on chunk-wise formats such MMC4 dataset.
+
+In the Data-Juicer format, a multimodal sample or document is based on a text,
+which consists of several text chunks. Each chunk is a semantic unit, and all the
+multimodal information in a chunk should talk about the same thing and be aligned
+with each other.
+
+Here is a multimodal sample example in Data-Juicer format below.
+- It includes 4 chunks split by the special token `<|__dj__eoc|>`.
+- In addition to texts, there are 3 other modalities: images, audios, videos.
+They are stored on the disk and their paths are
+listed in the corresponding first-level fields in the sample.
+- Other modalities are represented as special tokens in the text (e.g. image -- `<__dj__image>`).
+The special tokens of each modality correspond to the paths in the order of appearance.
+(e.g. the two image tokens in the third chunk are images of antarctica_map and europe_map respectively)
+- There could be multiple types of modalities and multiple modality special tokens in a single chunk,
+and they are semantically aligned with each other and text in this chunk.
+The position of special tokens can be random in a chunk. (In general, they are usually before or after the text.)
+- For multimodal samples, unlike text-only samples, the computed stats for other
+modalities could be a list of stats for the list of multimodal data (e.g. image_widths in this sample).
+
+```python
+{
+ "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
+ "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
+ "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
+ "Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
+ "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
+ "Most of Antarctica is covered by the Antarctic ice sheet, "
+ "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
+ "images": [
+ "path/to/the/image/of/antarctica_snowfield",
+ "path/to/the/image/of/antarctica_map",
+ "path/to/the/image/of/europe_map"
+ ],
+ "audios": [
+ "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
+ ],
+ "videos": [
+ "path/to/the/video/of/remote_sensing_view_of_antarctica"
+ ],
+ "meta": {
+ "src": "customized",
+ "version": "0.1",
+ "author": "xxx"
+ },
+ "stats": {
+ "lang": "en",
+ "image_widths": [224, 336, 512],
+ ...
+ }
+}
+```
+
+According to this format, Data-Juicer provided several dataset format conversion tools for some popular multimodal
works.
These tools consist of two types:
@@ -15,11 +69,11 @@ These tools consist of two types:
For now, dataset formats that are supported by Data-Juicer are listed in the following table.
-| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
-|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
-| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
-| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
-| WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
+| Format | Type | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
+|------------|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| LLaVA-like | image-text | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| MMC4-like | image-text | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
+| WavCaps-like | audio-text | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
For all tools, you can run the following command to find out the usage of them:
diff --git a/tools/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md
index 55671e09b..996bdbb54 100644
--- a/tools/multimodal/README_ZH.md
+++ b/tools/multimodal/README_ZH.md
@@ -4,7 +4,57 @@
## 数据集格式转换
-由于不同多模态数据集和工作之间的数据集格式差异较大,Data-Juicer 提出了一种新颖的多模态数据集中间格式,并为一些流行的多模态工作提供了若干数据集格式转换工具。
+由于不同多模态数据集和工作之间的数据集格式差异较大, Data-Juicer 提出了一种新颖的、中间的、
+基于文本的、交替的多模态数据格式,主要基于一些按块(chunk)组织的格式,如MMC4数据集格式。
+
+在 Data-Juicer 的格式中,一个多模态样本或者文档基于一段文本组织,其由若干个文本块组成。
+每个文本块是一个语义单元,单个文本块中包括的所有多模态信息都应该在谈论同样的事情,并且它们彼此语义上是对齐的。
+
+下面这里是一个 Data-Juicer 格式的多模态样本示例。
+- 它包括4个文本块,它们由特殊token `<|__dj__eoc|>` 分割开。
+- 除了文本,这个样本还包括3种其他模态:图像(images),音频(audios),视频(videos)。
+它们保存在硬盘上,而它们的硬盘路径列举在了样本中对应的一级字段的列表里。
+- 在文本中,其他模态被表示为了特殊token(例如,图像 -- `<__dj__image>`)。
+每种模态的特殊token所表示的数据按照它们在文本中出现的顺序对应到列表中的路径上。
+(例如,第3个文本块中的2个图像token分别对应了图像路径列表中的antarctica_map图像和europe_map图像)
+- 在单个文本块中,可以由多种模态的数据以及多个模态特殊token,它们彼此是语义上对齐的,而且它们与该文本块中的文本也是语义对齐的。
+这些模态特殊token在文本块中可以处于任意位置(通常处于文本前或者文本后)
+- 不同于纯文本样本,对于多模态样本来说,为其他模态计算的stats可能为针对多模态数据列表的一个stats列表(如例子中的image_widths)。
+
+```python
+{
+ "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
+ "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
+ "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
+ "Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
+ "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
+ "Most of Antarctica is covered by the Antarctic ice sheet, "
+ "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
+ "images": [
+ "path/to/the/image/of/antarctica_snowfield",
+ "path/to/the/image/of/antarctica_map",
+ "path/to/the/image/of/europe_map"
+ ],
+ "audios": [
+ "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
+ ],
+ "videos": [
+ "path/to/the/video/of/remote_sensing_view_of_antarctica"
+ ],
+ "meta": {
+ "src": "customized",
+ "version": "0.1",
+ "author": "xxx"
+ },
+ "stats": {
+ "lang": "en",
+ "image_widths": [224, 336, 512],
+ ...
+ }
+}
+```
+
+根据这个格式,Data-Juicer 为一些流行的多模态工作提供了若干数据集格式转换工具。
这些工具分为两种类型:
- 其他格式到 Data-Juicer 格式的转换:这些工具在 `source_format_to_data_juicer_format` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。
@@ -12,11 +62,11 @@
目前,Data-Juicer 支持的数据集格式在下面表格中列出。
-| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
-|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
-| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
-| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
-| 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
+| 格式 | 类型 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
+|----------|-------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
+| 类LLaVA格式 | 图像-文本 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| 类MMC4格式 | 图像-文本 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
+| 类WavCaps格式 | 音频-文本 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
对于所有工具,您可以运行以下命令来了解它们的详细用法: