Flow matching learns velocity fields from noise to trajectories. Bidirectional Mamba encodes temporal dynamics, Perceiver cross-attention fuses multimodal context, and gradient-guided sampling adds collision avoidance, rotational consistency, and velocity smoothness at test time.
Task Prompt: pick up the kitchen towel that is on the countertop. throw the paper towel into the trash bin.
GIMO (left) vs. EgoFlow (right). GIMO fails to reach the target due to missing gaze information.
Task Prompt: pick up the kitchen towel that is on the countertop. throw the paper towel into the trash bin.
GIMO (left) vs. EgoFlow (right). GIMO fails to reach the target.
HOT3D (zero-shot). Trained on Ego-Exo4D, tested on unseen HOT3D scenes without fine-tuning.
Flow matching produces smoother, more coherent trajectories than diffusion-based generation.
Pan trajectory reconstruction
Chopping board trajectory
We validate our trajectory reconstruction algorithm on Aria Digital Twin, which has dense ground-truth annotations.
Green apple tracking (ADT)
Book tracking (ADT)
| Model | ADE↓ | FDE↓ | Fréchet↓ | Geo↓ | Coll↓ |
|---|---|---|---|---|---|
| GIMO | 0.285 | 0.509 | 0.210 | 0.725 | 23.5% |
| CHOIS | 0.471 | 0.755 | 0.262 | 1.255 | 18.7% |
| Egoscaler | 1.330 | 1.494 | 0.315 | 1.614 | 35.8% |
| EgoFlow | 0.279 | 0.102 | 0.197 | 1.141 | 2.5% |
| Model | ADE↓ | FDE↓ | GD↓ |
|---|---|---|---|
| GIMO | 0.299 | 0.436 | 2.06 |
| CHOIS | 0.513 | 0.571 | 2.46 |
| Egoscaler | 0.351 | 0.540 | 0.856 |
| EgoFlow | 0.265 | 0.027 | 1.49 |