Demystifying the paradox of importance sampling with an estimated history-dependent behavior policy in off-policy evaluation

Zhou, H., Hanna, J. P., Zhu, J.

, Yang, Y. & Shi, C.

(2025). Demystifying the paradox of importance sampling with an estimated history-dependent behavior policy in off-policy evaluation. In Proceedings of the 42nd International Conference on Machine Learning . ACM Press.

Copy

This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of why the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a biasvariance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or nonparametrically.

Item Type	Chapter
Copyright holders	© 2025 The Author(s)
Departments	LSE > Academic Departments > Statistics
Date Deposited	27 June 2025
Acceptance Date	1 May 2025
URI	https://researchonline.lse.ac.uk/id/eprint/128272

Explore Further

picture_as_pdf

subject: Accepted Version
: Creative Commons: Attribution 4.0

Download

Downloads

View more statistics

Demystifying the paradox of importance sampling with an estimated history-dependent behavior policy in off-policy evaluation

Explore Further

Export as