3DTrajMaster

Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

ICLR 2025

Xiao Fu¹ Xian Liu¹ Xintao Wang^2✉ Sida Peng³ Menghan Xia² Xiaoyu Shi² Ziyang Yuan²
Pengfei Wan² Di Zhang² Dahua Lin^1✉

¹The Chinese University of Hong Kong ²Kling Team, Kuaishou Technology ³Zhejiang University

^✉ Corresponding Authors

(Loading all videos may take some time, thanks for your patience!)

Paper arXiv Dataset Code

3DTrajMaster controls one or multiple entity motions in 3D space with entity-specific 3D trajectories for text-to-video (T2V) generation. It has following features:

6 Domain of Freedom (DoF): control 3D entity location and orientation.
Diverse Entities: human, animal, robot, car, even abstract fire, breeze etc.
Diverse Background: city, forest, desert, gym, sunset beach, glacier, hall, night city, etc.
Complex 3D trajectories: 3D occlusion, rotating in place, 180°/continuous 90° turnings, etc
Fine-grained Entity Prompt: change human hair, clothing, gender, figure size, accessory, etc.

Control Entity Motion with Diverse Entities

Control Entity Motion with Diverse Background

Control Multi-Entity Motion with Complex 3D trajectories

(we provide 20 trajectory cases with one, two and three entities)

Control Entity Motion with Fine-grained Entity Prompt Editing

Comparison with Tora and Direct-a-Video

Comparison with TC4D

Method

Given a text prompt consisting of N entities, 3DTrajMaster (a) is able to generate the desired video with entity motions that conform to the input entity-wise pose sequences. Specifically, it involves two training phases. First, it utilizes a domain adaptor to mitigate the negative impact of training videos. Then, an object injector module is inserted after the 2D spatial self-attention layer to integrate paired entity prompts and 3D trajectories. (b) Details of the object injection process. The entities are projected into latent embeddings through the text encoder. The paired pose sequences are projected using a learnable pose encoder and then fused with entity embeddings to form entity-trajectory correspondences. This condition embedding is concatenated with the video latent and fed into a gated self-attention layer for motion fusion. Finally, the modified latent gets back to the remaining layers in the DiT block.

BibTeX

@inproceedings{fu20243dtrajmaster,
    title={3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation},
    author={Fu, Xiao and Liu, Xian and Wang, Xintao and Peng, Sida and Xia, Menghan and Shi, Xiaoyu and Yuan, Ziyang and Wan, Pengfei and Zhang, Di and Lin, Dahua},
    booktitle={ICLR},
    year={2025}
}