Combinatorial bandits for maximum value reward function under value-index feedback
We investigate the combinatorial multi-armed bandit problem where an action is to select $k$ arms from a set of base arms, and its reward is the maximum of the sample values of these $k$ arms, under a weak feedback structure that only returns the value and index of the arm with the maximum value. This novel feedback structure is much weaker than the semi-bandit feedback previously studied and is only slightly stronger than the full-bandit feedback, and thus it presents a new challenge for the online learning task. We propose an algorithm and derive a regret bound for instances where arm outcomes follow distributions with finite supports. Our algorithm introduces a novel concept of biased arm replacement to address the weak feedback challenge, and it achieves a distribution-dependent regret bound of $O((k/\Delta)\log(T))$ and a distribution-independent regret bound of $\tilde{O}(\sqrt{T})$, where $\Delta$ is the reward gap and $T$ is the time horizon. Notably, our regret bound is comparable to the bounds obtained under the more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.
| Item Type | Conference or Workshop Item (Paper) |
|---|---|
| Departments | Statistics |
| Date Deposited | 19 Jun 2024 10:39 |
| URI | https://researchonline.lse.ac.uk/id/eprint/123919 |
Explore Further
- https://www.lse.ac.uk/statistics/people/milan-vojnovic (Author)
- https://iclr.cc/ (Organisation)
-
picture_as_pdf -
subject - Published Version