Sun, Yasheng https://orcid.org/0000-0002-0589-4424
Xu, Zhiliang
Zhou, Hang
Guan, Jiazhi
Yang, Quanwei
Wang, Kaisiyuan
Liang, Borong
Li, Yingying
Feng, Haocheng
Wang, Jingdong
Liu, Ziwei
Hideki, Koike
Article History
Received: 11 March 2025
Accepted: 9 January 2026
First Online: 24 January 2026
Declarations
:
: The performance of our model is hindered by unstable backgrounds. Moreover, our model cannot handle results with cross fingers, which might cause difficulties in the 3D representations. Extending the capabilities of co-speech gesture video generation to diverse, real-world scenes remains a challenging open problem. Larger-scale pre-trained DiT models might be able to tackle these difficulties.
: Our method has the potential to generate fabricated talks, raising concerns about potential misuse. To mitigate this risk, we are committed to strictly controlling the distribution of our models and the generated content, limiting access to research purposes only.