The environment is typically formulated as a finitestate Markov decision process[?] (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and reward probabilities in the MDP are typically stochastic but stationary over the course of the problem.
Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor suboptimal actions explicitly corrected. Further, there is a focus on online performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
Formally, the basic reinforcement learning model consists of:
At each time t, the agent perceives its state s_{t}∈S and the set of possible actions A(s_{t}). It chooses an action a∈A(s_{t}) and receives from the environment the new state s_{t+1} and a reward r_{t+1}. Based on these interactions, the reinforcement learning agent must develop a policy π:S→A which maximizes the quantity r_{0}+r_{1}+...+r_{n} for MDPs which have a terminal state, or the quantity Σ_{t}γ^{t}r_{t} for MDPs without terminal states (where γ is some "future reward" discounting factor between 0.0 and 1.0).
Reinforcement learning applies particularly well to problems where longterm reward can be had at the expense of shortterm reward. It has been applied successfully to various problems, including robot control, elevator scheduling, and backgammon.
Leslie Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 (1996) pp. 237–285. (CiteSeer reference (http://citeseer.nj.nec.com/kaelbling96reinforcement))
Richard Sutton and Andrew Barto. Reinforcement Learning. MIT Press, 1998. (available online (http://wwwanw.cs.umass.edu/~rich/book/thebook))
Search Encyclopedia
