Deep Reinforcement Learning with Double Q-learning
What is Double DQN and why is it used?
To understand this problem, remember how we calculate the TD Target: The max operator uses the same values both to select and to evaluate an actions. This makes is likely to select overestimated values, resuling in overoptimistic value estimates. The solution to prevent this is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
- Use our DQN network to select the best action to take for the next state (the action with the highest Q-value).
- Use our Target network to calculate the target Q-value of taking that action at the next state.
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and have more stable learning.
The idea behind Double-Q learning (before DQN) is introduced in Double Q-learning