[Illinois]: Temporal Difference, Iterative Dynamic Programming, and Least Mean Squares

By Bara Saadah1, Nahil Sobh1, AbderRahman N Sobh1, Jessica S Johnson1

1. University of Illinois at Urbana-Champaign

This tool updates state values using the Temporal Difference Algorithm.

Launch Tool

You must login before you can run this tool.

Version 1.1b - published on 06 Aug 2014

doi:10.4231/D3K649T6C cite this

View All Supporting Documents




Published on


This tool simulates the responses of certain neurons that are part of the "reward" system of the brain in decision making. These neurons appear to be involved in learning the expected values of sensory events and behavioral options. In decision making, one may look at the state of a certain system and make decisions about the system through actions. Through looking at the effect of these actions, one may learn which are more likely to proceed to desirable or undesirable outcomes. Temporal difference learning is a parsimonious method for learning the expected values of various landmarks as steps toward your desired destination. You can use temporal-difference learning to adjust the weights in a neural network, or to adjust the state value estimates of a more abstract learning agent. The simulation uses a Markov process, which is a stochastic process in which the current state depends only on the previous state, and not on any states further back in a sequence. The least mean squares algorithm updates state value (v_j) estimates after each sequence by computing a running average of reinforcements to go (r_tg). This allows estimation of state values without knowing state transition probabilities beforehand. The algorithm will converge to correct state values after many training epochs. The least-mean squares algorithm is (eq 11.4) v(c+1) = v(c)+ (r-v(c))/(c+1) Temporal-difference algorithm essentially solves the dynamic programming problem through updates on each state transition, but without knowing the state transition probabilities, by trying to match the value estimate of each state to that of its successor. It combines the best of the previous two algorithms. The basic temporal difference algorithm is (eq 11.5) v(c+1) = v(c) + a[r+v(c)- v(c)]

Sponsored by

NanoBio Node, University of Illinois Champaign-Urbana


Anastasio, Thomas J. Tutorial on Neural Systems Modeling. Sunderland: Sinauer Associates, 2010. Print.

Cite this work

Researchers should cite this work as follows:

  • Bara Saadah; Nahil Sobh; AbderRahman N Sobh; Jessica S Johnson (2014), "[Illinois]: Temporal Difference, Iterative Dynamic Programming, and Least Mean Squares," http://nanohub.org/resources/tempdiff. (DOI: 10.4231/D3K649T6C).

    BibTex | EndNote