CAHIR: Co-Attentive Hierarchical Image Representations for Visual Place Recognition

Guohao Peng, Heshan Li, Yifeng Huang, Jun Zhang, Mingxing Wen, Singh Rahul, Danwei Wang

PDF

Key figure (auto-extracted from paper)

Abstract

Robust visual place recognition (VPR) against significant appearance changes is crucial for the life-long operation of mobile robots. Focusing on this task, we propose a Co-Attentive Hierarchical Image Representations (CAHIR) framework for VPR, which unifies attention-sharing global and local descriptor generation into one encoding pipeline. The hierarchical descriptors are applied to a coarse-to-fine VPR system with global retrieval and local geometric verification. To explore high-quality local matches between task-relevant visual elements, a cross-attention mutual enhancement layer is introduced to strengthen the information interaction between the local descriptors. Through the proposed selective matching distillation, the mutual enhancement layer can learn from state-of-the-art local matchers in a distillation manner. After weighted cross-matching of the enhanced local descriptors, geometric verification is applied to evaluate the spatial consis- tency of the compared image pair. Experiments show CAHIR outperforms the existing global and local representations for VPR in terms of performance and efficiency. Quantitatively, it achieves state-of-the-art results on three city-scale benchmark datasets. Qualitatively, CAHIR proves to attach great impor- tance to task-relevant visual elements and excels at finding local correspondences that are discriminative to the VPR task.

Index terms

Recognition Deep Learning for Visual Perception Localization