Bellman equation

From Wikipedia, the free encyclopedia

Bellman equations occur in dynamic programming. A Bellman equation is also called an optimality equation or a dynamic programming equation. This approach was developed by Richard Bellman.

In reinforcement learning a Bellman equation refers to a recursion for expected rewards. For example, the expected reward for being in a particular state $s$ and following some fixed policy $π$ has the Bellman equation:

$V^\pi(s)= R(s) + \gamma \sum_{s'} P(s'|s,\pi(s)) V^\pi(s')\,$

while the equation for the optimal policy is referred to as the Bellman optimality equation:

$V^*(s)= R(s) + \max_a \gamma \sum_{s'} P(s'|s,a) V^*(s')\,$

the difference being that rather than taking the action prescribed by some policy $π$ , we take the action that gives the best expected return.

[edit] Principle of optimality

The recursive Bellman equation used to find a maximum of the dynamic programming problem:

$\max_{ \left \{ x_{t+1} \right \}_{t=0}^{\infty} } \sum_{t=0}^{\infty} \beta^t F(x_t,x_{t+1}) =V(x_0)$

such that

$\begin{matrix} x_{t+1} \in \Gamma (x_t), & t = 0, 1, 2, ... \\ x_0 \in X, & Given \end{matrix}$

can be written as:

$V(x) = \max_{y \in \Gamma (x) } [F(x,y) + \beta V(y)], \forall x \in X$ .

Here

$y \in \Gamma (x)$

is dependent on the state $x$

y (x)

is the policy function.

This equivalence is called the principle of optimality ^[1]. In words, the principle asserts that if the policy function is optimal for the infinite summation, then it must be the case that whatever the initial state and decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from that first decision (as expressed by the Bellman equation). The principle of optimality is related to the concept of optimal substructure, and problems that exhibit optimal substructure can often be solved with dynamic programming.