Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

ARXIV 2024

1The Chinese University of Hong Kong 2Kuaishou Technology 3Zhejiang University
Intern at KwaiVGI, Kuaishou Technology Corresponding Author

(Loading all videos may take some time, thanks for your patience!)


3DTrajMaster controls one or multiple entity motions in 3D space with entity-specific 3D trajectories for text-to-video (T2V) generation. It has following features:

  • 6 Domain of Freedom (DoF): Control 3D entity location and orientation
  • Diverse Entities: Human, animal, robot, car, even abstract fire, breeze etc
  • Diverse Background: City, forest, desert, gym, sunset beach, glacier, hall, night city, etc
  • Complex 3D trajectories: 3D occlusion, rotating in place, 180°/continuous 90° turnings, etc
  • Fine-grained Entity Prompt: Change human hair, clothing, gender, figure size, accessory, etc

Control Entity Motion with Diverse Entities


Control Entity Motion with Diverse Background


Control Multi-Entity Motion with Complex 3D trajectories

(we provide 20 trajectory cases with one, two and three entities)

Control Entity Motion with Fine-grained Entity Prompt Editing


Comparison with Tora and Direct-a-Video


Comparison with TC4D



Method

Given a text prompt consisting of N entities, 3DTrajMaster (a) is able to generate the desired video with entity motions that conform to the input entity-wise pose sequences. Specifically, it involves two training phases. First, it utilizes a domain adaptor to mitigate the negative impact of training videos. Then, an object injector module is inserted after the 2D spatial self-attention layer to integrate paired entity prompts and 3D trajectories. (b) Details of the object injection process. The entities are projected into latent embeddings through the text encoder. The paired pose sequences are projected using a learnable pose encoder and then fused with entity embeddings to form entity-trajectory correspondences. This condition embedding is concatenated with the video latent and fed into a gated self-attention layer for motion fusion. Finally, the modified latent gets back to the remaining layers in the DiT block.

BibTeX

@article{fu20243dtrajmaster,
        author  = {Fu, Xiao and Liu, Xian and Wang, Xintao and Peng, Sida and Xia, Menghan and Shi, Xiaoyu and Yuan, Ziyang and Wan, Pengfei and Zhang, Di and Lin, Dahua},
        title   = {3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation},
        journal = {arXiv preprint arXiv:2412.07759},
        year    = {2024}
    }