An instrumental variable approach to confounded off-policy evaluation
Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In many cases, there exist unmeasured variables that confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded sequential decision making. Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy’s value in infinite horizon settings as well. Furthermore, we propose a number of policy value estimators and illustrate their effectiveness through extensive simulations and real data analysis from a world-leading short-video platform.
| Item Type | Article |
|---|---|
| Departments | Statistics |
| Date Deposited | 23 May 2024 14:33 |
| URI | https://researchonline.lse.ac.uk/id/eprint/123599 |
-
picture_as_pdf -
subject - Accepted Version