|
NeuroCOLT |
Neural Networks and Computational Learning Theory |
|
|
Jonathan
Baxter Abstract Formally, reinforcement learning problems can be modelled as Markov Decision Processes (MDP's). For small enough state spaces, the techniques of dynamic programming enable an optimal policy to be found by solving Bellman's equations for the value function, and then using the value function to generate a policy by choosing in each state the action leading to the state with the highest expected value. However, for many problems the state space is far too large to find the optimal value function, hence the emphasis in reinforcement learning has been on finding approximatitions to the value function within a restricted class such as neural networks. While there have been a number of empirical successes with this approach, it suffers from a lack of fundamental theoretical gurantees on the performance of the policy generated by the approximate value function. This talk describes an alternative approach to reinforcement learning, which we call direct reinforcement learning. We consider policies that are defined by parameters -- they might define approximate value functions that are used to generate a policy by some form of look-ahead, or they might directly parameterize the agent's policy. Rather than attempting to find accurate value estimates and then use these to generate a policy, we instead directly adjust the parameters to improve the average reward. Specifically, we compute the gradient of the average reward with respect to the parameters, and then use gradient ascent to generate a new set of parameters with increased average reward. The talk will present an algorithm for computing accurate approximations to the gradient of the average reward from a single sample path. We use these estimates in a conjugate-gradient ascent algorithm that uses a novel line-search routine, relying solely on gradient estimates. In a variety of domains, including a communications network call admission problem and a simple dynamic system control problem, this algorithm rapidly finds optimal or near-optimal solutions. |