Q-learning [Watkins, 1989] is one of the most popular reinforcement learning methods. One of the advantages of Q-learning is its ability to compare the expected utility of the available actions without requiring a model of the environment.
The content of Q-learning is inside the below equation:
Where:
is the Q-value at time
, state
with action
.
is the reward.
is the learning rate. The learning rate determines how fast and how important the new information is to be learned. If
is 0, the agent does not learn anything. If
is 1, only the new information is considered and all old information is discarded.
is the discount factor. The discount factor is in range [0..1] and is used to weight new term reinforcement more heavily than distant future reinforcement. The closer
is to 1, the greater the weight of future reinforcement.
So what does the equation mean ? We now assume and
, then the equation becomes:
It is now easy to see that the Q-value of state-action pair (,
) is equal to the maximum Q-value of next state (for all next actions) adding the reward of action
. The learning method is obviously a dynamic algorithm that gives the optimal Q-value for state-action pairs.
When the discount factor is enabled (<1), it makes the reward reduce by time and hence the total reward at time is given by:
The bellow java applet is a very good illustration of Q-learning (thank to Vander B. Frank):
For the detail of how the applet works, please reach the document of Vander B. Frank through this PDF.
Coming soon: how Q-learning is implemented to improve the dribbling skill in RoboCup 2D Simulation (my MSc project).
Bibliography
1. Wikipedia: Q-learning [http://en.wikipedia.org/wiki/Q-learning].
2. Vander B. Frank: Q-learning. IRIDIA, Universit Libre de Bruxelles. 7, 2003. [PDF]
3. Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England
Even god made mistakes, please let me know what mistakes I have made.
Recent Comments