Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following.
{"messages": [
{"role": "text", "content": "哎呀,将军,将军,不可连累老夫啊!大丈夫生居天地之间,岂能郁郁久居人下!"},
{"role": "token", "content": "xxx/zh/三国演义/07/tokens/07_00_23_51_30_spk12.npy"},
{"role": "vocal", "content": "xxx/zh/三国演义/07/vocals/07_00_23_51_30_spk12.wav"},
{"role": "instrumental", "content": "xxx/zh/三国演义/07/instrumental/07_00_23_51_30_spk12.wav"},
{"role": "video", "content": "xxx/zh/三国演义/07/clipped/07_00_23_51_30_spk12.mp4"},
{"role": "face", "content": "xxx/zh/三国演义/07/embs_video/07_00_23_51_30_spk12.pkl"},
{"role": "embswav", "content": "xxx/zh/三国演义/07/embs_wav/07_00_23_51_30_spk12.pkl"},
{"role": "dialogue", "content": [
{"start": 0.0, "duration": 4.0, "spk": "1", "gender": "男", "age": "中年", "timbre": "低沉、苍老、颤抖"},
{"start": 5.74, "duration": 2.63, "spk": "2", "gender": "男", "age": "青年", "timbre": "洪亮、有力、激昂"},
{"start": 8.89, "duration": 2.15, "spk": "2", "gender": "男", "age": "青年", "timbre": "高亢、有力、果断"}]},
{"role": "clue", "content": "两名角色对话,第一位中年男性情绪紧张,略带颤抖和哀求,表达对被牵连的恐惧。第二位青年男性语调变得激昂坚定,铿锵有力,充满对尊严和自由的强烈渴望。整体展现出从畏惧到反抗的情感转变。"},
{"role": "emotion", "content": "紧张 0.9"}
],
"utt": "sanguoyanyi_07_00_23_51_30_spk12",
"type": "对话",
"source": "zh",
"task": "VTTS",
"text_length": 36,
"clue_length": 89,
"speech_length": 277
}
| hongloumeng.7z.001 | 5.00 GB | CC-BY-NC 4.0 | hongloumeng.7z.001 |
| hongloumeng.7z.002 | 5.00 GB | CC-BY-NC 4.0 | hongloumeng.7z.002 |
| hongloumeng.7z.003 | 3.65 GB | CC-BY-NC 4.0 | hongloumeng.7z.003 |
| hongloumeng.md5 | 159 B | CC-BY-NC 4.0 | MD5 File |
| downton_abbey.7z.001 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.001 |
| downton_abbey.7z.002 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.002 |
| downton_abbey.7z.003 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.003 |
| downton_abbey.7z.004 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.004 |
| downton_abbey.7z.005 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.005 |
| downton_abbey.7z.006 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.006 |
| downton_abbey.7z.007 | 5.00 GB | CC-BY-NC 4.0 | downton_abbey.7z.007 |
| downton_abbey.7z.008 | 665 MB | CC-BY-NC 4.0 | downton_abbey.7z.008 |
| downton_abbey.md5 | 439 B | CC-BY-NC 4.0 | MD5 File |
@misc{liu2026funcineforgeunifieddatasettoolkit,
title={FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes},
author={Jiaxuan Liu and Yang Xiang and Han Zhao and Xiangang Li and Zhenhua Ling},
year={2026},
eprint={2601.14777},
archivePrefix={arXiv},
primaryClass={cs.CV},
}