Crossmark

Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

Published Online: 2026-01-24

Published Print: 2026-03

Authors

Sun, Yasheng https://orcid.org/0000-0002-0589-4424
Xu, Zhiliang

Zhou, Hang

Guan, Jiazhi

Yang, Quanwei

Wang, Kaisiyuan

Liang, Borong

Li, Yingying

Feng, Haocheng

Wang, Jingdong

Liu, Ziwei

Hideki, Koike
License Information

Text and Data Mining valid from 2026-01-24

Version of Record valid from 2026-01-24
More Information

Article History

Received: 11 March 2025

Accepted: 9 January 2026

First Online: 24 January 2026

Declarations

:

: The performance of our model is hindered by unstable backgrounds. Moreover, our model cannot handle results with cross fingers, which might cause difficulties in the 3D representations. Extending the capabilities of co-speech gesture video generation to diverse, real-world scenes remains a challenging open problem. Larger-scale pre-trained DiT models might be able to tackle these difficulties.

: Our method has the potential to generate fabricated talks, raising concerns about potential misuse. To mitigate this risk, we are committed to strictly controlling the distribution of our models and the generated content, limiting access to research purposes only.

Document is current