CVPR 2026

EgoFlow: Gradient-Guided Flow Matching
for Egocentric 6DoF Object Motion Generation

1TU München  ·  2MCML  ·  3MBZUAI  ·  4ETH Zürich
EgoFlow generates physically valid 6DoF trajectories from egocentric video.
TL;DR EgoFlow learns to generate 6DoF object trajectories from egocentric video using flow matching with a hybrid Mamba-Transformer backbone. Gradient-guided sampling enforces physics at test time, cutting collisions by 79%.
79%
Collision reduction
from gradient guidance
2.5%
Collision rate
on HD-EPIC
20
Euler steps
at inference
0-shot
Generalization
to unseen HOT3D

Method

EgoFlow architecture: hybrid Mamba-Transformer-Perceiver with multimodal conditioning and gradient-guided flow matching.

Flow matching learns velocity fields from noise to trajectories. Bidirectional Mamba encodes temporal dynamics, Perceiver cross-attention fuses multimodal context, and gradient-guided sampling adds collision avoidance, rotational consistency, and velocity smoothness at test time.

Qualitative Results

HD-EPIC Baseline Comparisons

Task Prompt: pick up the kitchen towel that is on the countertop. throw the paper towel into the trash bin.

GIMO (left) vs. EgoFlow (right). GIMO fails to reach the target due to missing gaze information.

Task Prompt: pick up the kitchen towel that is on the countertop. throw the paper towel into the trash bin.

HD-EPIC qualitative comparison

GIMO (left) vs. EgoFlow (right). GIMO fails to reach the target.

HOT3D Zero-Shot Generalization

HOT3D qualitative

HOT3D (zero-shot). Trained on Ego-Exo4D, tested on unseen HOT3D scenes without fine-tuning.

Flow Matching vs. Diffusion

Flow matching produces smoother, more coherent trajectories than diffusion-based generation.

HD-EPIC Data Reconstruction

Pan trajectory reconstruction

Chopping board trajectory

ADT Data Reconstruction Verification

We validate our trajectory reconstruction algorithm on Aria Digital Twin, which has dense ground-truth annotations.

Green apple tracking (ADT)

Book tracking (ADT)

Quantitative Comparison

HD-EPIC
ModelADE↓FDE↓Fréchet↓Geo↓Coll↓
GIMO0.2850.5090.2100.72523.5%
CHOIS0.4710.7550.2621.25518.7%
Egoscaler1.3301.4940.3151.61435.8%
EgoFlow0.2790.1020.1971.1412.5%
HOT3D (Zero-Shot)
ModelADE↓FDE↓GD↓
GIMO0.2990.4362.06
CHOIS0.5130.5712.46
Egoscaler0.3510.5400.856
EgoFlow0.2650.0271.49

BibTeX

@inproceedings{saroha2026egoflow, title = {EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation}, author = {Saroha, Abhishek and Zeng, Huajian and Zuo, Xingxing and Cremers, Daniel and Wang, Xi}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }