|
NeuroCOLT
Technical Report NC-TR-95-001
Worst-Case
Analysis of the Bandit Problem
Peter
Auer
Technische Universiät Graz
Klosterwiesgasse 32/2
A-8010 Graz, Austria.
Nicolò
Cesa-Bianchi
DSI, University of Milan
Via Comelico 39,
I-20135 Milano, Italy.
Abstract
The multi-armed bandit is a classical problem in the area of
sequential decision theory and has been studied under a variety of
statistical assumptions. In this work we investigate the bandit problem
from a purely worst-case standpoint. We present a randomized algorithm
with an expected total reward of $G-O(G^{4/5}K^{6/5})$ (disregarding
log factors), where $K$ is the number of arms and $G$ is the (unknown)
total reward of the best arm. Our analysis holds with no assumptions
whatsoever on the way rewards are generated, other than being independent
of the algorithm's randomization. Our results can also be interpreted
as a novel extension of the on-line prediction model, an intensively
studied framework in learning theory.
Download Compressed
Postscript
|