TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly embedding topological structures into VLMs enables robust global navigation and path correction, outperforming larger proprietary models without vision-to-text conversion.

Vision-Language Navigation Topology-Aware Reasoning Large Vision-Language Models Global Action Planning Spatial Reasoning End-to-End Navigation

Problem

Existing large-model-based VLN methods lose crucial visual-spatial information by converting observations to text or lack explicit global memory, limiting their ability to reason over topological structures and backtrack effectively.

Approach

TagaVLM is an end-to-end framework that embeds an online topological map directly into a VLM backbone using an Interleaved Navigation Prompt for node alignment and Spatial Topology Aware Residual Attention to inject edge distances into self-attention layers.

Key results

State-of-the-art performance on R2R benchmark (SR: 51.09%, SPL: 47.18 in unseen environments)
Outperforms prior large-model methods by 3.39% SR and 9.08 SPL
Architecturally injected topological priors in a 0.5B VLM yield competitive results
Enables effective global action reasoning and path backtracking during navigation

Why it matters

Demonstrates that targeted architectural inductive biases outperform brute-force scaling for embodied spatial reasoning, guiding the development of efficient, robust autonomous navigation systems.

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large- model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end- to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge infor- mation, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an In- terleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code can be found on our project page: https://apex-bjut.github.io/Taga-VLM/.

Index terms

Vision-Based Navigation Autonomous Vehicle Navigation Deep Learning for Visual Perception