Temporal difference model

From Wikipedia, the free encyclopedia

The temporal difference model is a real time classical conditioning model. The primary idea behind the TD model is that the prediction is calculated as a sum of discounted rewards.

[edit] Math

Let

λt

be the reinforcement on time step t. Let

\bar V_t

be the correct prediction.

\bar V_t = \sum_{i=0}^{\infty} \gamma^i \lambda_{t+i}
0 \le \gamma < 1
\bar V_t = \lambda_{t+1} + \gamma \sum_{i=0}^{\infty} \gamma^i \lambda_{t+i+1}
\bar V_t = \lambda_{t+1} + \gamma \bar V_{t+1}

Thus, the reinforcement is the difference between the ideal prediction and the current prediction.

R_t = \lambda_{t+1} + \gamma \bar V_{t+1} - \bar V_{t}

putting this reinforcement term into the Sutton-Barto model yields the temporal difference model:

\triangle V_i = \beta (\lambda_{t+1} + \gamma \bar V_{t+1} - \bar V_{t}) + \alpha_i \bar X_i

[edit] Reference

Sutton, R.S., Barto A.G. (1990) Time Derivative Models of Pavlovian Reinforcement, Learning and Computational Neuroscience (available here).