Improve model card for ThinkSound with metadata, detailed content, and links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +153 -11
README.md CHANGED
@@ -1,21 +1,163 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- This repository contains the weights of [ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing](https://arxiv.org/abs/2506.21448).
6
 
7
- Project Page: https://thinksound-project.github.io/.
8
 
9
- If you find our work useful, please cite our paper:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ```bibtex
12
  @misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
13
- title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
14
- author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
15
- year={2025},
16
- eprint={2506.21448},
17
- archivePrefix={arXiv},
18
- primaryClass={eess.AS},
19
- url={https://arxiv.org/abs/2506.21448},
20
  }
21
- ```
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: other
4
+ tags:
5
+ - audio-generation
6
+ - video-to-audio
7
+ - multimodal
8
+ - chain-of-thought
9
+ - audio-editing
10
+ - foley-generation
11
  ---
12
 
13
+ # 🎶 ThinkSound
14
 
15
+ This repository contains the weights of [ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing](https://huggingface.co/papers/2506.21448).
16
 
17
+ Project Page: https://thinksound-project.github.io/
18
+
19
+ Code: https://github.com/liuhuadai/ThinkSound
20
+
21
+ <p align="center">
22
+ If you find this project useful, a star ⭐ on GitHub would be greatly appreciated!
23
+ </p>
24
+
25
+ <p align="center">
26
+ <a href="https://huggingface.co/papers/2506.21448">
27
+ <img src="https://img.shields.io/badge/Paper-2506.21448-b31b1b.svg" alt="Paper"/>
28
+ </a>
29
+ &nbsp;
30
+ <a href="https://thinksound-project.github.io/">
31
+ <img src="https://img.shields.io/badge/Online%20Demo-🌐-blue" alt="Online Demo"/>
32
+ </a>
33
+ &nbsp;
34
+ <a href="https://huggingface.co/spaces/FunAudioLLM/ThinkSound">
35
+ <img src="https://img.shields.io/badge/HuggingFace-Spaces-orange?logo=huggingface" alt="Hugging Face Space"/>
36
+ </a>
37
+ &nbsp;
38
+ <a href="https://modelscope.cn/studios/iic/ThinkSound">
39
+ <img src="https://img.shields.io/badge/ModelScope-在线体验-green" alt="ModelScope"/>
40
+ </a>
41
+ </p>
42
+
43
+ ---
44
+
45
+ **ThinkSound** is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.
46
+
47
+ PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).
48
+
49
+ ![Teaser](https://github.com/liuhuadai/ThinkSound/blob/main/assets/figs/fig1_teaser.png)
50
+
51
+ ---
52
+
53
+ ## Abstract
54
+ While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark.
55
+
56
+ ---
57
+
58
+ ## 📰 News
59
+ - **2025.07** &nbsp; 🔥Online demo on [Hugging Face Spaces](https://huggingface.co/spaces/FunAudioLLM/ThinkSound) and [ModelScope](https://modelscope.cn/studios/iic/ThinkSound) for interactive experience!
60
+ - **2025.07** &nbsp; 🔥Released inference scripts and web interface;
61
+ - **2025.06** &nbsp; 🔥[ThinkSound paper](https://arxiv.org/pdf/2506.21448) released on arXiv!
62
+ - **2025.06** &nbsp; 🔥[Online Demo](http://thinksound-project.github.io/) is live - try it now!
63
+
64
+ ---
65
+
66
+ ## 🚀 Features
67
+
68
+ - **Any2Audio**: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
69
+ - **Video-to-Audio SOTA**: Achieves state-of-the-art results on multiple V2A benchmarks.
70
+ - **CoT-Driven Reasoning**: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
71
+ - **Interactive Object-centric Editing**: Refine or edit specific sound events by clicking on visual objects or using text instructions.
72
+ - **Unified Framework**: One foundation model supports generation, editing, and interactive workflow.
73
+
74
+ ---
75
+
76
+ ## ✨ Method Overview
77
+
78
+ ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:
79
+
80
+ 1. **Foley Generation:** Generate foundational, semantically and temporally aligned soundscapes from video.
81
+ 2. **Object-Centric Refinement:** Refine or add sounds for user-specified objects via clicks or regions in the video.
82
+ 3. **Targeted Audio Editing:** Modify generated audio using high-level natural language instructions.
83
+
84
+ ![ThinkSound Overview](https://github.com/liuhuadai/ThinkSound/blob/main/assets/figs/fig3_model.png)
85
+
86
+ ---
87
+
88
+ ## ⚡ Quick Start
89
+
90
+ **Environment Preparation:**
91
+ ```bash
92
+ git clone https://github.com/liuhuadai/ThinkSound.git
93
+ cd ThinkSound
94
+ pip install -r requirements.txt
95
+ conda install -y -c conda-forge 'ffmpeg<7'
96
+ # Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
97
+ # model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
98
+ git lfs install
99
+ git clone https://huggingface.co/liuhuadai/ThinkSound
100
+ ```
101
+
102
+ **Make it executable**
103
+ ```bash
104
+ chmod +x scripts/demo.sh
105
+ ```
106
+
107
+ **Run the script**
108
+ ```bash
109
+ ./scripts/demo.sh <video_path> <title> <CoT description>
110
+ ```
111
+
112
+ ### Web Interface Usage
113
+
114
+ For an interactive experience, launch the Gradio web interface:
115
+
116
+ ```bash
117
+ python app.py
118
+ ```
119
+
120
+ ---
121
+
122
+ ## 📝 TODO
123
+
124
+ - ☐ Release training scripts for ThinkSound models
125
+ - ☐ Open-source AudioCoT dataset and automated pipeline
126
+ - ☐ Provide detailed documentation and API reference
127
+ - ☐ Add support for additional modalities and downstream tasks
128
+
129
+ ---
130
+
131
+ ## 📄 License
132
+
133
+ This project is released under the [Apache 2.0 License](https://github.com/liuhuadai/ThinkSound/blob/main/LICENSE).
134
+
135
+ > **Note:**
136
+ > The code, models, and dataset are **for research and educational purposes only**.
137
+ > **Commercial use is NOT permitted.**
138
+ >
139
+ > For commercial licensing, please contact the authors.
140
+
141
+ ---
142
+
143
+ ## 📖 Citation
144
+
145
+ If you find ThinkSound useful in your research or work, please cite our paper:
146
 
147
  ```bibtex
148
  @misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
149
+ title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
150
+ author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
151
+ year={2025},
152
+ eprint={2506.21448},
153
+ archivePrefix={arXiv},
154
+ primaryClass={eess.AS},
155
+ url={https://arxiv.org/abs/2506.21448},
156
  }
157
+ ```
158
+
159
+ ---
160
+
161
+ ## 📬 Contact
162
+
163
+ ✨ Feel free to [open an issue](https://github.com/liuhuadai/ThinkSound/issues) or contact us via email ([[email protected]](mailto:[email protected])) if you have any questions or suggestions!