Reactive Control for Discrete-Action Vision-Language Navigation: Bridging CoT-VLN Planners and Smooth Execution

Abstract

Discrete vision-language navigation systems face a critical planning-execution gap as planners predict action chunks while agents execute commands step-by-step, unable to respond to obstacles appearing during execution. This problem is amplified in modern Chain-of-Thought VLN systems where inference takes several seconds. We address this with a hierarchical architecture that inserts a learned low-level controller between a frozen VLM planner (FantasyVLN) and the Habitat simulator. The controller operates at every action step, receiving depth observations and VLM commands, deciding whether to execute, override, or hold position. Our compact CNN policy (144K parameters) trains with Proximal Policy Optimisation (PPO) using offline replay that pre-records VLM sequences, eliminating inference overhead for single-GPU training. The reward combines intent-following, intent-miss penalty, collision penalty, and proximity shaping. On HM3D minival across 100 episodes, our controller achieves 22% collision-free navigation with 99% task completion and 86.6% VLM intent rate, versus 6% collision-free and 100% stop rate for raw VLM execution. Ablation study confirms both collision penalty and proximity shaping reward terms are essential. Removing collision penalty increases collisions 96-fold while removing proximity shaping produces over-conservative behavior refusing 71% of forward commands. Our modular design provides per-step reactive safety for any discrete VLN planner.

Download models

Access codebase