You must login before you can run this tool.
[Illinois]: Temporal Difference, Iterative Dynamic Programming, and Least Mean Squares
This tool updates state values using the Temporal Difference Algorithm.
This tool simulates the responses of certain neurons that are part of the “reward” system of the brain in decision making. These neurons appear to be involved in learning the expected values of sensory events and behavioral options. In decision making, one may look at the state of a certain system and make decisions about the system through actions. Through looking at the effect of these actions, one may learn which are more likely to proceed to desirable or undesirable outcomes. Temporal difference learning is a parsimonious method for learning the expected values of various landmarks as steps toward your desired destination. You can use temporal-difference learning to adjust the weights in a neural network, or to adjust the state value estimates of a more abstract learning agent.
The simulation uses a Markov process, which is a stochastic process in which the current state depends only on the previous state, and not on any states further back in a sequence. The least mean squares algorithm updates state value (v_j) estimates after each sequence by computing a running average of reinforcements to go (r_tg). This allows estimation of state values without knowing state transition probabilities beforehand. The algorithm will converge to correct state values after many training epochs. The least-mean squares algorithm is (eq 11.4) v(c+1) = v©+ (r-v©)/(c+1)
Temporal-difference algorithm essentially solves the dynamic programming problem through updates on each state transition, but without knowing the state transition probabilities, by trying to match the value estimate of each state to that of its successor. It combines the best of the previous two algorithms. The basic temporal difference algorithm is (eq 11.5) v(c+1) = v© + av
Researchers should cite this work as follows: