RoboMaster

ARXIV 2025

¹The Chinese University of Hong Kong ²Kling Team, Kuaishou Technology ³Zhejiang University

^✉ Corresponding Authors

Core: Decompose Interaction (Ours) vs Decompose Objects (Previous, e.g. Tora)

Unlike Tora that decomposes objects and uses separate trajectories to model the motion of robot arm and manipulated object, we decompose the interaction phase and unify their joint motions into a single collaborative trajectory with fine-grained object awareness. This integration alleviates the feature fusion issue in overlapping regions (see the missing apple in Tora), and improves visual quality.

Method

Given an input image and a prompt, it generates a desired robotic manipulation video with the collaborative trajectory design. Specifically, it first encodes the object masks, including robotic arm and submissive object (acquired either from 1) Grounded-SAM or 2) user-defined brush mask) with the awareness of appearance and shape to obtain object latents for maintaining identity consistency in the video. To precisely model the manipulation process, the controlled trajectory is decomposed into sub-interaction phases: pre-interaction, interaction, and post-interaction, associating each phase with object-specific latents (robotic arm latents in pre-/post-interaction phases and manipulated object latents in interaction, respectively). The collaborative trajectory latent is then injected into plug-and-play motion injectors, enabling the reasoning of video dynamics during generation.

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

ARXIV 2025

(Loading all videos may take some time, thanks for your patience!)

Core: Decompose Interaction (Ours) vs Decompose Objects (Previous, e.g. Tora)

Robotic Manipulation on Diverse Out-of-Domain Objects

Robotic Manipulation with Diverse Skills

Long Video Generation in Auto-Regressive Manner

Comparison with Baselines (Tora, DragAnything, and IRASim)

Ablation Study

Method