|
NeuroCOLT
Technical Report NC-TR-00-084
Finite-time
Analysis of the
Multi-armed Bandit Problem
Peter Auer
Nicolo' Cesa-Bianchi
Paul Fischer
Abstract
Reinforcement learning policies face the exploration versus exploitation
dilemma, i.e.\ the search for a balance between exploring the environment
to find profitable actions while taking the empirically best action
as often as possible. A popular measure of a policy's success in addressing
this dilemma is the regret, that is the loss due to the fact that
the globally optimal policy is not followed all the times. One of
the simplest examples of the exploration/exploitation dilemma is the
multi-armed bandit problem. Lai and Robbins were the first ones to
show that the regret for this problem has to grow at least logarithmically
in the number of plays. Since then, policies which asymptotically
achieve this regret have been devised by Lai and Robbins and many
others. In this work we show that the optimal logarithmic regret is
also achievable uniformly over time, with simple and efficient policies,
and for all reward distributions with bounded support.
Download
Compressed Postscript
|