EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

Method

EgoFlow architecture: hybrid Mamba-Transformer-Perceiver with multimodal conditioning and gradient-guided flow matching.

Flow matching learns velocity fields from noise to trajectories. Bidirectional Mamba encodes temporal dynamics, Perceiver cross-attention fuses multimodal context, and gradient-guided sampling adds collision avoidance, rotational consistency, and velocity smoothness at test time.

Qualitative Results

HD-EPIC Baseline Comparisons

Task Prompt: pick up the kitchen towel that is on the countertop. throw the paper towel into the trash bin.

GIMO (left) vs. EgoFlow (right). GIMO fails to reach the target due to missing gaze information.

Task Prompt: pick up the kitchen towel that is on the countertop. throw the paper towel into the trash bin.

GIMO (left) vs. EgoFlow (right). GIMO fails to reach the target.

HOT3D Zero-Shot Generalization

HOT3D (zero-shot). Trained on Ego-Exo4D, tested on unseen HOT3D scenes without fine-tuning.

Flow Matching vs. Diffusion

Flow matching produces smoother, more coherent trajectories than diffusion-based generation.

HD-EPIC Data Reconstruction

Pan trajectory reconstruction

Chopping board trajectory

ADT Data Reconstruction Verification

We validate our trajectory reconstruction algorithm on Aria Digital Twin, which has dense ground-truth annotations.

Green apple tracking (ADT)

Book tracking (ADT)

Quantitative Comparison

HD-EPIC

Model	ADE↓	FDE↓	Fréchet↓	Geo↓	Coll↓
GIMO	0.285	0.509	0.210	0.725	23.5%
CHOIS	0.471	0.755	0.262	1.255	18.7%
Egoscaler	1.330	1.494	0.315	1.614	35.8%
EgoFlow	0.279	0.102	0.197	1.141	2.5%

HOT3D (Zero-Shot)

Model	ADE↓	FDE↓	GD↓
GIMO	0.299	0.436	2.06
CHOIS	0.513	0.571	2.46
Egoscaler	0.351	0.540	0.856
EgoFlow	0.265	0.027	1.49

BibTeX

@inproceedings{saroha2026egoflow, title = {EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation}, author = {Saroha, Abhishek and Zeng, Huajian and Zuo, Xingxing and Cremers, Daniel and Wang, Xi}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }