Jun 05

Q-learning [Watkins, 1989] is one of the most popular reinforcement learning methods. One of the advantages of Q-learning is its ability to compare the expected utility of the available actions without requiring a model of the environment.

The content of Q-learning is inside the below equation:

Q_{t+1}(a, s)=(1-\alpha_{t})Q_{t}(a,s)+\alpha_{t}[r_{t}(s)+\gamma\max_{a^{'}}{Q_{t}(a',s')}]

Where:

  • Q_{t}(a,s) is the Q-value at time t, state s with action a.
  • r_{t} is the reward.
  • \alpha is the learning rate. The learning rate determines how fast and how important the new information is to be learned. If \alpha is 0, the agent does not learn anything. If \alpha is 1, only the new information is considered and all old information is discarded.
  • \gamma is the discount factor. The discount factor is in range [0..1] and is used to weight new term reinforcement more heavily than distant future reinforcement. The closer \gamma is to 1, the greater the weight of future reinforcement.

So what does the equation mean ? We now assume \alpha=1 and \gamma=1, then the equation becomes:

Q_{t+1}(a, s)=r_{t}(s)+max_{a'}{Q_{t}(a',s')}

It is now easy to see that the Q-value of state-action pair (a,s) is equal to the maximum Q-value of next state (for all next actions) adding the reward of action a. The learning method is obviously a dynamic algorithm that gives the optimal Q-value for state-action pairs.

When the discount factor is enabled (<1),  it makes the reward reduce by time and hence the total reward at time t is given by:

R_{t}=r_{t}+\gamma r_{t+1} + \gamma^2 r_{t+2} + \dots + \gamma^n r_{t+n} + \dots

The bellow java applet is a very good illustration of Q-learning (thank to Vander B. Frank):

For the detail of how the applet works, please reach the document of Vander B. Frank through this PDF.

Coming soon: how Q-learning is implemented to improve the dribbling skill in RoboCup 2D Simulation (my MSc project).



Bibliography

1. Wikipedia: Q-learning [http://en.wikipedia.org/wiki/Q-learning].

2. Vander B. Frank: Q-learning. IRIDIA, Universit Libre de Bruxelles. 7, 2003. [PDF]

3. Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England

Even god made mistakes, please let me know what mistakes I have made.

  • Share/Save/Bookmark
Tagged with:
preload preload preload