Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

ARXIV 2025

1The Chinese University of Hong Kong 2Kuaishou Technology 3Zhejiang University
Intern at KwaiVGI, Kuaishou Technology Corresponding Author

(Loading all videos may take some time, thanks for your patience!)

Paper Model Code


RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios.


Core: Decompose Interaction (Ours) vs Decompose Objects (Previous, e.g. Tora)


Unlike Tora that decomposes objects and uses separate trajectories to model the motion of robot arm and manipulated object, we decompose the interaction phase and unify their joint motions into a single collaborative trajectory with fine-grained object awareness. This integration alleviates the feature fusion issue in overlapping regions (see the missing apple in Tora), and improves visual quality.


Robotic Manipulation on Diverse Out-of-Domain Objects


Robotic Manipulation with Diverse Skills


Long Video Generation in Auto-Regressive Manner


Comparison with Baselines (Tora, DragAnything, and IRASim)




Ablation Study



Method

Given an input image and a prompt, it generates a desired robotic manipulation video with the collaborative trajectory design. Specifically, it first encodes the object masks, including robotic arm and submissive object (acquired either from 1) Grounded-SAM or 2) user-defined brush mask) with the awareness of appearance and shape to obtain object latents for maintaining identity consistency in the video. To precisely model the manipulation process, the controlled trajectory is decomposed into sub-interaction phases: pre-interaction, interaction, and post-interaction, associating each phase with object-specific latents (robotic arm latents in pre-/post-interaction phases and manipulated object latents in interaction, respectively). The collaborative trajectory latent is then injected into plug-and-play motion injectors, enabling the reasoning of video dynamics during generation.