EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

EgoFlow teaser figure showing trajectory generation across diverse environments.

Given a textual command and the surrounding environment, EgoFlow generates physically valid 6DoF object trajectories that respect spatial constraints across diverse environments by learning from egocentric videos.

Abstract

Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba–Transformer–Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes.

Method

EgoFlow formulates trajectory synthesis as a continuous transport process using flow matching, learning deterministic velocity fields that map noise to realistic 6DoF trajectories. Multimodal conditioning fuses scene point clouds (PointNet++), fixture layouts (self-attention over oriented bounding boxes), trajectory history, CLIP-encoded text/category embeddings, and goal pose into a unified representation.

A hybrid Mamba–Transformer–Perceiver architecture processes this in three stages: bidirectional Mamba layers for efficient temporal encoding, Perceiver-style cross-attention for multimodal reasoning, and a final Mamba refinement stage. At inference, gradient-guided sampling refines the predicted velocity via differentiable physical costs—SDF-based collision avoidance, rotational consistency, and translational smoothness—enforcing physical plausibility without requiring constraint labels during training.

Results

We compare EgoFlow against baselines on HD-EPIC (realistic kitchens) and HOT3D (zero-shot cross-dataset). Our method generates smoother, more physically plausible trajectories while significantly reducing collisions.

HD-EPIC — Qualitative Comparison

Comparison of EgoFlow against baselines on HD-EPIC kitchen sequences. Green: history, colored: predictions.

HD-EPIC Qualitative Results. EgoFlow generates plausible trajectories that take natural and smooth paths to the target pose, unlike baselines which often deviate or collide.

HOT3D — Zero-Shot Generalization

HOT3D Zero-Shot Results. Trained on Ego-Exo4D, tested on HOT3D without fine-tuning. EgoFlow produces geometrically coherent 6DoF trajectories in unseen environments.

Flow Matching vs. Diffusion

Flow matching produces smoother, more coherent trajectories compared to diffusion-based generation.

Quantitative Comparison

HD-EPIC

Model	ADE ↓	FDE ↓	Fréchet ↓	Geodesic ↓	Coll. ↓
GIMO	0.285	0.509	0.210	0.725	23.5%
CHOIS	0.471	0.755	0.262	1.255	18.7%
Egoscaler	1.330	1.494	0.315	1.614	35.8%
EgoFlow	0.279	0.102	0.197	1.141	2.5%

HOT3D (Zero-Shot)

Model	ADE ↓	FDE ↓	GD ↓
GIMO	0.299	0.436	2.06
CHOIS	0.513	0.571	2.46
Egoscaler	0.351	0.540	0.856
EgoFlow	0.265	0.027	1.49

HD-EPIC Data Processing

HD-EPIC provides only sparse object annotations. We reconstruct continuous 6DoF trajectories by tracking the manipulating hand as a rigid-body proxy using Project Aria's MPS. Below we show the reconstructed 3D object positions projected back onto egocentric frames for verification.

Pan trajectory reconstruction

Chopping board trajectory reconstruction

BibTeX

@inproceedings{saroha2026egoflow, title = {EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation}, author = {Saroha, Abhishek and Zeng, Huajian and Zuo, Xingxing and Cremers, Daniel and Wang, Xi}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }

EgoFlow: Gradient-Guided Flow Matchingfor Egocentric 6DoF Object Motion Generation