Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar

PDF

AI summary

Key figure (auto-extracted from paper)

Combining a small number of real-world trials with large-scale simulation via prediction-powered inference reduces hardware evaluation costs by 20–25% while maintaining statistically valid confidence intervals for robot policy performance.

Robot policy evaluation prediction-powered inference simulation-to-real gap confidence intervals scalable evaluation robotics benchmarks

Problem

Real-world robot policy evaluation is resource-intensive and often lacks statistical rigor, while relying solely on simulation introduces bias due to the simulation-to-real gap, preventing trustworthy inferences about real-world performance.

Approach

SureSim pairs a small set of real-world evaluations with large-scale simulation trials to estimate and correct simulation bias, then applies non-asymptotic mean estimation algorithms to generate finite-sample valid confidence intervals for real-world policy performance.

Key results

Introduces SureSim, a framework for finite-sample valid confidence intervals on real-world policy performance using paired real and simulation data.
Demonstrates a 20–25% reduction in required real-world hardware trials while achieving comparable statistical bounds.
Validates the approach on both a diffusion policy and a multi-task fine-tuned π0 foundation model across diverse object and environment distributions.
Analyzes method sensitivity to varying real-simulation correlation, identifying conditions where simulation augmentation provides optimal benefits.

Why it matters

Provides roboticists and researchers with a statistically rigorous, cost-effective evaluation protocol that bridges the simulation-to-real gap without sacrificing reliability.

Abstract

Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environ- ments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large- scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned π0 on a joint distribution of objects and initial conditions, and find that our approach saves over 20−25% of hardware evaluation effort to achieve similar bounds on policy performance. Supplementary notes and videos can be found at https://suresim-robot-eval.github.io.

Index terms

Probability and Statistical Methods Performance Evaluation and Benchmarking