An instrumental variable approach to confounded off-policy evaluation
Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In many cases, there exist unmeasured variables that confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded sequential decision making. Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy’s value in infinite horizon settings as well. Furthermore, we propose a number of policy value estimators and illustrate their effectiveness through extensive simulations and real data analysis from a world-leading short-video platform.
| Item Type | Article |
|---|---|
| Copyright holders | © 2023 The Author(s) |
| Departments | LSE > Academic Departments > Statistics |
| Date Deposited | 23 May 2024 |
| Acceptance Date | 14 Apr 2023 |
| URI | https://researchonline.lse.ac.uk/id/eprint/123599 |
Explore Further
- https://www.lse.ac.uk/statistics/people/chengchun-shi (Author)
- https://www.scopus.com/pages/publications/85172436136 (Scopus publication)
- https://proceedings.mlr.press/v202/xu23x.html
- https://proceedings.mlr.press/ (Official URL)