EVLOD: Ensemble Vision-Language Open-Vocabulary Detection for Construction Site Object Recognition

Yongdong Wang, Runze Xiao, Jun Younes Louhi Kasahara, Shota Chikushi, Keiji Nagatani, Atsushi Yamashita, Hajime Asama

PDF

Key figure (auto-extracted from paper)

Abstract

The construction industry faces severe labor short- ages, driving the need for robotic automation solutions. How- ever, effective deployment of construction robots requires robust environmental perception capabilities, particularly accurate identification of diverse objects in complex, dynamic construc- tion environments. Closed-set object detection methods are limited to predefined categories, proving inadequate for the highly varied object types encountered on construction sites. This paper introduces EVLOD (Ensemble Vision-Language Open-vocabulary Detection), an ensemble framework that inte- grates multiple state-of-the-art vision-language models to enable open-vocabulary object detection in construction scenarios. EVLOD employs a voting-based fusion strategy that combines predictions from GroundingDINO and GroundingDINO-CLIP detectors, utilizing their complementary strengths while mit- igating individual model weaknesses. The ensemble approach incorporates confidence voting, object name voting, and bound- ing box voting to produce reliable detections with reduced false positives. Evaluated on a comprehensive dataset of 825 Unmanned Aerial Vehicle (UAV)-captured construction images with 5,020 annotated objects, EVLOD achieves an Average Precision (AP) of 0.49 when Intersection over Union (IoU) equals 0.5, representing a 36.1% improvement over the best- performing baseline. The method effectively reduces detection noise from 5,495 to 3,232 detections. Qualitative analysis reveals primary limitations in detecting small-scale objects and low- contrast elements.

Index terms

Robotics Machine Learning Automation