Awesome Text2X Resources

This is an open collection of state-of-the-art (SOTA), novel Text to X (X can be everything) methods (papers, codes and datasets), intended to keep pace with the anticipated surge of research in the coming months.

⭐ If you find this repository useful to your research or work, it is really appreciated to star this repository.

💗 Continual improvements are being made to this repository. If you come across any relevant papers that should be included, please don't hesitate to submit a pull request (PR) or open an issue. Additional resources like blog posts, videos, etc. are also welcome.

✉️ Any additions or suggestions, feel free to contribute and contact hyqale1024@gmail.com.

🔥 News

2024.12.21 adjusted the layouts of several sections and Happy Winter Solstice ⚪🥣.
2024.04.05 adjusted the layout and added accepted lists and ArXiv lists to each section.

Update Logs

2025 Update Logs:

2025.01.23 - update several papers status "ICLR 2025" to accepted papers, congrats to all 🎉
2025.01.09 - update layout.

Previous 2024 Update Logs:

2024.09.26 - update several papers status "NeurIPS 2024" to accepted papers, congrats to all 🎉
2024.09.03 - add one new section 'text to model'.
2024.06.30 - add one new section 'text to video'.
2024.07.02 - update several papers status "ECCV 2024" to accepted papers, congrats to all 🎉
2024.06.21 - add one hot Topic about AIGC 4D Generation on the section of Suvery and Awesome Repos.
2024.06.17 - an awesome repo for CVPR2024 Link 👍🏻
2024.04.05 - an awesome repo for CVPR2024 on 3DGS and NeRF Link 👍🏻
2024.03.25 - add one new survey paper of 3D GS into the section of "Survey and Awesome Repos--Topic 1: 3D Gaussian Splatting".
2024.03.12 - add a new section "Dynamic Gaussian Splatting", including Neural Deformable 3D Gaussians, 4D Gaussians, Dynamic 3D Gaussians.
2024.03.11 - CVPR 2024 Accpeted Papers Link
update some papers accepted by CVPR 2024! Congratulations🎉

Text to 4D

(Also, Image/Video to 4D)

💡 4D ArXiv Papers

1. AR4D: Autoregressive 4D Generation from Monocular Videos

Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian (University of Science and Technology of China, Microsoft Research Asia)

Abstract

Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

2. GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li

(The Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence, Avolution AI)

Abstract

4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	AR4D: Autoregressive 4D Generation from Monocular Videos	3 Jan 2025	Link	--	Link
2025	GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking	5 Jan 2025	Link	Link	Link

ArXiv Papers References

%axiv papers

@misc{zhu2025ar4dautoregressive4dgeneration,
      title={AR4D: Autoregressive 4D Generation from Monocular Videos}, 
      author={Hanxin Zhu and Tianyu He and Xiqian Yu and Junliang Guo and Zhibo Chen and Jiang Bian},
      year={2025},
      eprint={2501.01722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.01722}, 
}

@article{bian2025gsdit,
  title={GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking},
  author={Bian, Weikang and Huang, Zhaoyang and Shi, Xiaoyu and and Li, Yijin and Wang, Fu-Yun and Li, Hongsheng},
  journal={arXiv preprint arXiv:2501.02690},
  year={2025}
}

Previous Papers

Year 2023

In 2023, tasks classified as text/Image to 4D and video to 4D generally involve producing four-dimensional data from text/Image or video input. For more details, please check the 2023 4D Papers, including 6 accepted papers and 3 arXiv papers.

Year 2024

For more details, please check the 2024 4D Papers, including 15 accepted papers and 19 arXiv papers.

Text to Video

💡 T2V ArXiv Papers

1. TransPixar: Advancing Text-to-Video Generation with Transparency

Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen

(HKUST(GZ), HKUST, Adobe Research)

Abstract

Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

2. Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

(Snap Inc., UC Merced, CMU)

Abstract

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist − a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

3. BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Weixi Feng, Chao Liu, Sifei Liu, William Yang Wang, Arash Vahdat, Weili Nie (UC Santa Barbara, NVIDIA)

Abstract

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	TransPixar: Advancing Text-to-Video Generation with Transparency	6 Jan 2025	Link	Link	Link
2025	Multi-subject Open-set Personalization in Video Generation	10 Jan 2025	Link	--	Link
2025	BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations	13 Jan 2025	Link	--	Link

ArXiv Papers References

%axiv papers

@misc{wang2025transpixar,
     title={TransPixar: Advancing Text-to-Video Generation with Transparency}, 
     author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Yingcong Chen},
     year={2025},
     eprint={2501.03006},
     archivePrefix={arXiv},
     primaryClass={cs.CV},
     url={https://arxiv.org/abs/2501.03006}, 
}

@misc{chen2025multisubjectopensetpersonalizationvideo,
      title={Multi-subject Open-set Personalization in Video Generation}, 
      author={Tsai-Shien Chen and Aliaksandr Siarohin and Willi Menapace and Yuwei Fang and Kwot Sin Lee and Ivan Skorokhodov and Kfir Aberman and Jun-Yan Zhu and Ming-Hsuan Yang and Sergey Tulyakov},
      year={2025},
      eprint={2501.06187},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06187}, 
}

@article{feng2025blobgen,
  title={BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations},
  author={Feng, Weixi and Liu, Chao and Liu, Sifei and Wang, William Yang and Vahdat, Arash and Nie, Weili},
  journal={arXiv preprint arXiv:2501.07647},
  year={2025}
}

Video Other Additional Info

Previous Papers

Year 2024

For more details, please check the 2024 T2V Papers, including 10 accepted papers and 17 arXiv papers.

OSS video generation models: Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence.
Survey: The Dawn of Video Generation: Preliminary Explorations with SORA-like Models, arXiv, Project Page, GitHub Repo

📚 Dataset Works

1. VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li

(Fudan University, ShangHai Academy of AI for Science)

Abstract

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

Year	Title	ArXiv Time	Paper	Code	Project Page
2024	VidGen-1M: A Large-Scale Dataset for Text-to-video Generation	5 Aug 2024	Link	Link	Link

References

%axiv papers

@article{tan2024vidgen,
  title={VidGen-1M: A Large-Scale Dataset for Text-to-video Generation},
  author={Tan, Zhiyu and Yang, Xiaomeng, and Qin, Luozheng and Li Hao},
  booktitle={arXiv preprint arxiv:2408.02629},
  year={2024}
}

Text to Scene

💡 3D Scene ArXiv Papers

1. LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

Yang Zhou, Zongjin He, Qixuan Li, Chao Wang (ShangHai University)

Abstract

Recently, the field of text-guided 3D scene generation has garnered significant attention. High-quality generation that aligns with physical realism and high controllability is crucial for practical 3D scene applications. However, existing methods face fundamental limitations: (i) difficulty capturing complex relationships between multiple objects described in the text, (ii) inability to generate physically plausible scene layouts, and (iii) lack of controllability and extensibility in compositional scenes. In this paper, we introduce LayoutDreamer, a framework that leverages 3D Gaussian Splatting (3DGS) to facilitate high-quality, physically consistent compositional scene generation guided by text. Specifically, given a text prompt, we convert it into a directed scene graph and adaptively adjust the density and layout of the initial compositional 3D Gaussians. Subsequently, dynamic camera adjustments are made based on the training focal point to ensure entity-level generation quality. Finally, by extracting directed dependencies from the scene graph, we tailor physical and layout energy to ensure both realism and flexibility. Comprehensive experiments demonstrate that LayoutDreamer outperforms other compositional scene generation quality and semantic alignment methods. Specifically, it achieves state-of-the-art (SOTA) performance in the multiple objects generation metric of T3Bench.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation	4 Feb 2025	Link	--	--

ArXiv Papers References

%axiv papers

@article{zhou2025layoutdreamer,
  title={LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation},
  author={Zhou, Yang and He, Zongjin and Li, Qixuan and Wang, Chao},
  journal={arXiv preprint arXiv:2502.01949},
  year={2025}
}

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 3D Scene Papers, including 17 accepted papers and 13 arXiv papers.

Text to Human Motion

💡 Motion ArXiv Papers

1. MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh

(Singapore University of Technology and Design, LightSpeed Studios)

Abstract

Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm	6 Feb 2025	Link	Link	Link

ArXiv Papers References

%axiv papers

@article{guo2025motionlab,
  title={MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm},
  author={Guo, Ziyan and Hu, Zeyu and Zhao, Na and Soh, De Wen},
  journal={arXiv preprint arXiv:2502.02358},
  year={2025}
}

Motion Other Additional Info

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 Text to Human Motion Papers, including 30 accepted papers and 14 arXiv papers.

📚 Dataset Works

Datasets

Motion	Info	URL	Others
AIST	AIST Dance Motion Dataset	Link	--
AIST++	AIST++ Dance Motion Dataset	Link	dance video database with SMPL annotations
AMASS	optical marker-based motion capture datasets	Link	--

Additional Info

AMASS

AMASS is a large database of human motion unifying different optical marker-based motion capture datasets by representing them within a common framework and parameterization. AMASS is readily useful for animation, visualization, and generating training data for deep learning.

Survey

Survey: Human Motion Generation: A Survey, ArXiv 2023 Nov

Text to 3D Human

🎉 Human Accepted Papers

Year	Title	Venue	Paper	Code	Project Page
2022	AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars	SIGGRAPH 2022 (Journal Track)	Link	Link	Link
2023	AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control	ICCV 2023	Link	Link	Link
2023	DreamWaltz: Make a Scene with Complex 3D Animatable Avatars	NeurIPS 2023	Link	Link	Link
2023	DreamHuman: Animatable 3D Avatars from Text	NeurIPS 2023 (Spotlight)	Link	--	Link
2023	TeCH: Text-guided Reconstruction of Lifelike Clothed Humans	3DV 2024	Link	Link	Link
2023	TADA! Text to Animatable Digital Avatars	3DV 2024	Link	Link	Link
2023	AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose	AAAI2024	Link	Link	Link
2023	HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting	CVPR 2024	Link	Link	Link
2023	HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation	CVPR 2024	Link	Link	Link
2024	En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data	CVPR 2024	Link	Link	Link
2024	HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation	SIGGRAPH 2024	Link	Link	Link
2024	HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting	ECCV 2024	Link	Link	Link
2024	Instant 3D Human Avatar Generation using Image Diffusion Models	ECCV 2024	Link	--	Link
2024	Disentangled Clothed Avatar Generation from Text Descriptions	ECCV 2024	Link	Link	Link
2024	MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space	ECCV 2024	Link	--	Link

Accepted Papers References

%accepted papers

@article{hong2022avatarclip,
    title={AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars},
    author={Hong, Fangzhou and Zhang, Mingyuan and Pan, Liang and Cai, Zhongang and Yang, Lei and Liu, Ziwei},
    journal={ACM Transactions on Graphics (TOG)},
    volume={41},
    number={4},
    pages={1--19},
    year={2022},
    publisher={ACM New York, NY, USA}
}

@article{jiang2023avatarcraft,
  title={AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control},
  author={Jiang, Ruixiang and Wang, Can and Zhang, Jingbo and Chai, Menglei and He, Mingming and Chen, Dongdong and Liao, Jing},
  journal={arXiv preprint arXiv:2303.17606},
  year={2023}
}

@inproceedings{huang2023dreamwaltz,
  title={{DreamWaltz: Make a Scene with Complex 3D Animatable Avatars}},
  author={Yukun Huang and Jianan Wang and Ailing Zeng and He Cao and Xianbiao Qi and Yukai Shi and Zheng-Jun Zha and Lei Zhang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

@article{kolotouros2023dreamhuman,
  title={DreamHuman: Animatable 3D Avatars from Text},
  author={Kolotouros, Nikos and Alldieck, Thiemo and Zanfir, Andrei and Bazavan, Eduard Gabriel and Fieraru, Mihai and Sminchisescu, Cristian},
  booktitle={NeurIPS},
  year={2023}
}

@inproceedings{huang2024tech,
  title={{TeCH: Text-guided Reconstruction of Lifelike Clothed Humans}},
  author={Huang, Yangyi and Yi, Hongwei and Xiu, Yuliang and Liao, Tingting and Tang, Jiaxiang and Cai, Deng and Thies, Justus},
  booktitle={International Conference on 3D Vision (3DV)},
  year={2024}
}

@article{liao2023tada,
title={TADA! Text to Animatable Digital Avatars},
author={Liao, Tingting and Yi, Hongwei and Xiu, Yuliang and Tang, Jiaxiang and Huang, Yangyi and Thies, Justus and Black, Michael J},
journal={ArXiv},
month={Aug}, 
year={2023} 
}

@article{zhang2023avatarverse,
  title={Avatarverse: High-quality \& stable 3d avatar creation from text and pose},
  author={Zhang, Huichao and Chen, Bowen and Yang, Hao and Qu, Liao and Wang, Xu and Chen, Li and Long, Chao and Zhu, Feida and Du, Kang and Zheng, Min},
  journal={arXiv preprint arXiv:2308.03610},
  year={2023}
}

@article{liu2023humangaussian,
    title={HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting},
    author={Liu, Xian and Zhan, Xiaohang and Tang, Jiaxiang and Shan, Ying and Zeng, Gang and Lin, Dahua and Liu, Xihui and Liu, Ziwei},
    journal={arXiv preprint arXiv:2311.17061},
    year={2023}
}

@misc{huang2023humannorm,
title={Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation},
author={Huang, Xin and Shao, Ruizhi and Zhang, Qi and Zhang, Hongwen and Feng, Ying and Liu, Yebin and Wang, Qing},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2024}
}

@inproceedings{men2024en3d,
  title={En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data},
  author={Men, Yifang and Lei, Biwen and Yao, Yuan and Cui, Miaomiao and Lian, Zhouhui and Xie, Xuansong},
  journal={arXiv preprint arXiv:2401.01173},
  website={https://menyifang.github.io/projects/En3D/index.html},
  year={2024}
}

@article{liu2023HeadArtist,
  author = {Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, Jing Liao, Qifeng Chen},
  title = {HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation},
  journal = {arXiv:2312.07539},
  year = {2023},
}

@article{zhou2024headstudio,
  author = {Zhenglin Zhou and Fan Ma and Hehe Fan and Yi Yang},
  title = {HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting},
  journal={arXiv preprint arXiv:2402.06149},
  year={2024}
}

@inproceedings{kolotouros2024avatarpopup,
  author    = {Kolotouros, Nikos and Alldieck, Thiemo and Corona, Enric and Bazavan, Eduard Gabriel and Sminchisescu, Cristian},
  title     = {Instant 3D Human Avatar Generation using Image Diffusion Models},
  booktitle   = {European Conference on Computer Vision (ECCV)},
  year      = {2024},
}

@misc{wang2023disentangled,
      title={Disentangled Clothed Avatar Generation from Text Descriptions}, 
      author={Jionghao Wang and Yuan Liu and Zhiyang Dou and Zhengming Yu and Yongqing Liang and Xin Li and Wenping Wang and Rong Xie and Li Song},
      year={2023},
      eprint={2312.05295},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{comas2024magicmirror,
  title={MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space},
  author={Comas-Massagu{\'e}, Armand and Qiu, Di and Chai, Menglei and B{\"u}hler, Marcel and Raj, Amit and Gao, Ruiqi and Xu, Qiangeng and Matthews, Mark and Gotardo, Paulo and Orts-Escolano, Sergio and others},
  booktitle={European Conference on Computer Vision},
  pages={178--196},
  year={2024},
  organization={Springer}
}

💡 Human ArXiv Papers

1. Make-A-Character: High Quality Text-to-3D Character Generation within Minutes

Jianqiang Ren, Chao He, Lin Liu, Jiahao Chen, Yutong Wang, Yafei Song, Jianfang Li, Tangli Xue, Siqi Hu, Tao Chen, Kunkun Zheng, Jianjing Xiang, Liefeng Bo

(Institute for Intelligent Computing, Alibaba Group)

Abstract

There is a growing demand for customized and expressive 3D characters with the emergence of AI agents and Metaverse, but creating 3D characters using traditional computer graphics tools is a complex and time-consuming task. To address these challenges, we propose a user-friendly framework named Make-A-Character (Mach) to create lifelike 3D avatars from text descriptions. The framework leverages the power of large language and vision models for textual intention understanding and intermediate image generation, followed by a series of human-oriented visual perception and 3D generation modules. Our system offers an intuitive approach for users to craft controllable, realistic, fully-realized 3D characters that meet their expectations within 2 minutes, while also enabling easy integration with existing CG pipeline for dynamic expressiveness.

2. InstructHumans: Editing Animated 3D Human Textures with Instructions (text to 3d human texture editing)

Jiayin Zhu, Linlin Yang, Angela Yao

(National University of Singapore, Communication University of China)

Abstract

We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively incorporates subterms of SDS across diffusion timesteps. We further enhance SDS-E with spatial smoothness regularization and gradient-based viewpoint sampling to achieve high-quality edits with sharp and high-fidelity detailing. InstructHumans significantly outperforms existing 3D editing methods, consistent with the initial avatar while faithful to the textual instructions.

3. HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

Yi Wang, Jian Ma, Ruizhi Shao, Qiao Feng, Yu-kun Lai, Kun Li

(Tianjin University, Changzhou Institute of Technology, Cardiff University)

Abstract

This paper aims to generate physically-layered 3D humans from text prompts. Existing methods either generate 3D clothed humans as a whole or support only tight and simple clothing generation, which limits their applications to virtual try-on and part-level editing. To achieve physically-layered 3D human generation with reusable and complex clothing, we propose a novel layer-wise dressed human representation based on a physically-decoupled diffusion model. Specifically, to achieve layer-wise clothing generation, we propose a dual-representation decoupling framework for generating clothing decoupled from the human body, in conjunction with an innovative multi-layer fusion volume rendering method. To match the clothing with different body shapes, we propose an SMPL-driven implicit field deformation network that enables the free transfer and reuse of clothing. Extensive experiments demonstrate that our approach not only achieves state-of-the-art layered 3D human generation with complex clothing but also supports virtual try-on and layered human animation.

4. DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

Thomas Hanwen Zhu, Ruining Li, Tomas Jakab

(University of Oxford, Carnegie Mellon University)

Abstract

We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.

5. AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

Lingteng Qiu, Shenhao Zhu, Qi Zuo, Xiaodong Gu, Yuan Dong, Junfei Zhang, Chao Xu, Zhe Li, Weihao Yuan, Liefeng Bo, Guanying Chen, Zilong Dong

(Alibaba Group, Sun Yat-sen University, Nanjing University, Huazhong University of Science and Technology)

Abstract

Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.

6. MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussian Splatting

Peng Chen, Xiaobao Wei, Qingpo Wuwu, Xinyi Wang, Xingyu Xiao, Ming Lu

(Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences, Intel Labs China, Tsinghua University, Nankai University, Peking University)

Abstract

Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We demonstrate the superiority of MixedGaussianAvatar through comprehensive experiments.

Year	Title	ArXiv Time	Paper	Code	Project Page
2023	Make-A-Character: High Quality Text-to-3D Character Generation within Minutes	24 Dec 2023	Link	Link	Link
2024	InstructHumans: Editing Animated 3D Human Textures with Instructions	5 Apr 2024	Link	Link	Link
2024	HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model	21 Aug 2024	Link	--	--
2024	DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors	12 Sep 2024	Link	Link	Link
2024	AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction	3 Dec 2024	Link	Link	Link
2024	MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussian Splatting	6 Dec 2024	Link	Link	Link

ArXiv Papers References

%axiv papers

@article{ren2023makeacharacter,
      title={Make-A-Character: High Quality Text-to-3D Character Generation within Minutes},
      author={Jianqiang Ren and Chao He and Lin Liu and Jiahao Chen and Yutong Wang and Yafei Song and Jianfang Li and Tangli Xue and Siqi Hu and Tao Chen and Kunkun Zheng and Jianjing Xiang and Liefeng Bo},
      year={2023},
      journal = {arXiv preprint arXiv:2312.15430}
}

@article{zhu2024InstructHumans,
         author={Zhu, Jiayin and Yang, Linlin and Yao, Angela},
         title={InstructHumans: Editing Animated 3D Human Textures with Instructions},
         journal={arXiv preprint arXiv:2404.04037},
         year={2024}
}

@misc{wang2024humancoserlayered3dhuman,
      title={HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model}, 
      author={Yi Wang and Jian Ma and Ruizhi Shao and Qiao Feng and Yu-kun Lai and Kun Li},
      year={2024},
      eprint={2408.11357},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.11357}, 
}

@article{zhu2024dreamhoi,
  title   = {{DreamHOI}: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors},
  author  = {Thomas Hanwen Zhu and Ruining Li and Tomas Jakab},
  journal = {arXiv preprint arXiv:2409.08278},
  year    = {2024}
}

@article{qiu2024anigs,
  title={AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction},
  author={Qiu, Lingteng and Zhu, Shenhao and Zuo, Qi and Gu, Xiaodong and Dong, Yuan and Zhang, Junfei and Xu, Chao and Li, Zhe and Yuan, Weihao and Bo, Liefeng and others},
  journal={arXiv preprint arXiv:2412.02684},
  year={2024}
}

@misc{chen2024mixedgaussianavatarrealisticallygeometricallyaccurate,
      title={MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussian Splatting}, 
      author={Peng Chen and Xiaobao Wei and Qingpo Wuwu and Xinyi Wang and Xingyu Xiao and Ming Lu},
      year={2024},
      eprint={2412.04955},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.04955}, 
}

Additional Info

Survey and Awesome Repos

Survey

PROGRESS AND PROSPECTS IN 3D GENERATIVE AI: A TECHNICAL OVERVIEW INCLUDING 3D HUMAN, ArXiv 2024

Awesome Repos

Resource1: Awesome Digital Human

Pretrained Models

Pretrained Models (human body)	Info	URL
SMPL	smpl model (smpl weights)	Link
SMPL-X	smpl model (smpl weights)	Link
human_body_prior	vposer model (smpl weights)	Link

SMPL

SMPL is an easy-to-use, realistic, model of the of the human body that is useful for animation and computer vision.

version 1.0.0 for Python 2.7 (female/male, 10 shape PCs)
version 1.1.0 for Python 2.7 (female/male/neutral, 300 shape PCs)
UV map in OBJ format

SMPL-X

SMPL-X, that extends SMPL with fully articulated hands and facial expressions (55 joints, 10475 vertices)

Related Resources

Text to 'other tasks'

(Here other tasks refer to CAD, Model and Music etc.)

Text to CAD

2024 | CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM | arXiv 7 Nov 2024 | Paper | Code | Project Page
2024 | Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts | NeurIPS 2024 Spotlight | Paper | Project Page

Text to Music

2024 | FLUX that Plays Music | arXiv 1 Sep 2024 | Paper | Code | Hugging Face

Text to Model

2024 | Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization | arXiv 23 May 2024 | Paper

Survey and Awesome Repos

🔥 Topic 1: 3D Gaussian Splatting

Survey

Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review, ArXiv Mon, 6 May 2024
Recent Advances in 3D Gaussian Splatting, ArXiv Sun, 17 Mar 2024
3D Gaussian as a New Vision Era: A Survey, ArXiv Sun, 11 Feb 2024
A Survey on 3D Gaussian Splatting, ArXiv 2024

Awesome Repos

Resource1: Awesome 3D Gaussian Splatting Resources
Resource2: 3D Gaussian Splatting Papers
Resource3: 3DGS and Beyond Docs

🔥 Topic 2: AIGC 3D

🔥 Topic 3: LLM 3D

Awesome Repos

Resource1: Awesome LLM 3D

3D Human

Survey: PROGRESS AND PROSPECTS IN 3D GENERATIVE AI: A TECHNICAL OVERVIEW INCLUDING 3D HUMAN, ArXiv 2024
Survey: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation, ArXiv 6 June 2024
Resource1: Awesome Digital Human
Resource2: Awesome-Avatars

🔥 Topic 4: AIGC 4D

Awesome Repos

Resource1: Awesome 4D Generation

Dynamic Gaussian Splatting

Neural Deformable 3D Gaussians

(CVPR 2024) Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction Paper Code Page

(CVPR 2024) 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering Paper Code Page

(CVPR 2024) SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes Paper Code Page

(CVPR 2024, Highlight) 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos Paper Code Page

4D Gaussians

(ArXiv 2024.02.07) 4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes Paper

(ICLR 2024) Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting Paper Code Page

Dynamic 3D Gaussians

(CVPR 2024) Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle Paper Page

(3DV 2024) Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis Paper Code Page

License

Awesome Text2X Resources is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
docs		docs
media		media
LICENSE		LICENSE
README.md		README.md

License

ALEEEHU/Awesome-Text2X-Resources

Folders and files

Latest commit

History

Repository files navigation

Awesome Text2X Resources

🔥 News

Table of Contents

Update Logs

Text to 4D

💡 4D ArXiv Papers

1. AR4D: Autoregressive 4D Generation from Monocular Videos

2. GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Previous Papers

Year 2023

Year 2024

Text to Video

💡 T2V ArXiv Papers

1. TransPixar: Advancing Text-to-Video Generation with Transparency

2. Multi-subject Open-set Personalization in Video Generation

3. BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Video Other Additional Info

Previous Papers

Year 2024

📚 Dataset Works

1. VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Text to Scene

💡 3D Scene ArXiv Papers

1. LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

Previous Papers

Year 2023-2024

Text to Human Motion

💡 Motion ArXiv Papers

1. MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Motion Other Additional Info

Previous Papers

Year 2023-2024

📚 Dataset Works

Datasets

Additional Info

Survey

Text to 3D Human

🎉 Human Accepted Papers

💡 Human ArXiv Papers

1. Make-A-Character: High Quality Text-to-3D Character Generation within Minutes

2. InstructHumans: Editing Animated 3D Human Textures with Instructions (text to 3d human texture editing)

3. HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

4. DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors

5. AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

6. MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussian Splatting

Additional Info

Survey

Awesome Repos

Related Resources

Text to 'other tasks'

Text to CAD

Text to Music

Text to Model

Survey and Awesome Repos

Survey

Awesome Repos

Survey

Awesome Repos

Benchmars

Awesome Repos

3D Human

Awesome Repos

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages