Talk:Reinforcement learning

From Wikipedia, the free encyclopedia

Is Rtγtrt, R = \sum \limits_{t^\gamma}^{t} r_t or R = \sum \limits_{t\gamma}^{t} r_t or R = \sum \limits_{t}^{t} \gamma r_t ?

Answer: It is : R = \sum \limits_{t=0}^{\infty} \gamma^{t} r_t

[edit] Policies

What exactly is a policy? The Sutton-Barto book is very vague on this point, and so is this article. In both cases the word is used without much explanation.

According to both the book and the article, a policy is a mapping from states to action probabilities. Fine. But this is not elaborated upon. What does a policy look like? I infer that it must be a table (2-D array), indexed by state and action, and containing probabilities, say pij for the i-th state and j-th action, each pij being a transition probability for the MDP. If so, what is its relation to the values derived from rewards? I.e. where exactly do the probabilities pij come from? How does one generate a policy table starting from values?

Sorry if I appear stupid, but I've been studying the book and I find it very difficult to comprehend, even though the maths is very simple (almost too simple). Or maybe it's in there somewhere but I've missed it?

--84.9.83.127 09:36, 18 November 2006 (UTC)

A policy is indeed a mapping from states to action probabilities, usually written π. So we could write π:S×A→[0,1], saying that π gives a probability of taking a given action a in state s. It doesn't have to be a table, it is just a function. If S and A are discrete then it can be easily written as a table, but if either is continuous then another form is needed. For instance, if S is the interval [0,10], we can set a number of radial basis functions over that interval (say, 11 of them, one at 0, one at 1, one at 2, etc.). Number them r0, ... r10. Now our policy is a function π:r0×...×r10×A→[0,1], which we can no longer write as a table.
The relation of the policy to values depends on the particular solution being used for the RL problem. In an actor-critic architecture, the policy is the set of state-action values along with a function for selecting an action (softmax, for instance, or just choosing the action with the highest value) and the state-action values are updated according to state values and the error signal. In a Q-learning agent, the policy and the values are essentially the same. Well, more correctly the policy is a function of the values given by the action selection mechanism.
For the most part, when you're just learning reinforcement learning theory, the use of policies may not be particularly clear. At least, in my own case, I didn't understand the focus on policies until I read Sutton, Precup, and Singh (1999) on options [1], at which point policies became crystal clear.
Hope that answers your question. digfarenough (talk) 19:25, 4 March 2007 (UTC)
Thanks. But your reply raises more questions for me, which I need to try and find answers to! --84.9.75.142 22:41, 16 March 2007 (UTC) (formerly 84.9.83.127)
Feel free to ask further questions on my talk page. I'm certainly no expert on reinforcement learning, but I've written one paper on it and have written a large number of simulations of RL-related things, so I at least know the basics. digfarenough (talk) 01:09, 17 March 2007 (UTC)