VLD: Vision-Language Distance for Goal-Conditioned Navigation
Introduced a scalable vision-language distance learning framework for goal-conditioned navigation that decouples perception from control. Trained on roughly 3,000 hours of video data and plugged into independently trained RL policies as a drop-in replacement for privileged simulator distances, outperforming prior distance models (ViNT, VIP) and yielding strong sim-to-real transfer on real-robot deployment.