Playing Atari with Deep Reinforcement Learning

Give the pseudocode for Deep Q-learning with Experience Replay.


Initialize replay memory DD to capacity NN

Initialize action-value function QQ with random weights θ\theta for episode = 1, MM do Initialize sequence s1={x1}s_1 = \{x_1\} and preprocessed sequence ϕ1=ϕ(s1)\phi_1 = \phi(s_1) for t=1,Tt = 1, T do With probability ϵ\epsilon select a random action ata_t Otherwise select at=argmaxaQ^(ϕ(st),a;θ)a_t = \operatorname{argmax}_a \hat{Q}(\phi(s_t),a;\theta) Execute action ata_t in emulator and observe reward rtr_t and image xt+1x_{t+1} Set st+1=st,at,xt+1s_{t+1} = s_t, a_t, x_{t+1} and preprocess ϕt+1=ϕ(st+1)\phi_{t+1}=\phi(s_{t+1}) Store transition (ϕt,at,rt,ϕt+1)(\phi_t, a_t, r_t, \phi_{t+1}) in DD Sample random minibatch of transitions (ϕj,aj,rj,ϕj+1)(\phi_j, a_j, r_j, \phi_{j+1}) from DD set yj={rjif episode terminates at step j+1rj+γmaxaQ(ϕj+1,a;θ)otherwisey_j = \begin{cases} r_j & \text{if episode terminates at step } j+1 \\ r_j + \gamma \operatorname{max}_{a'} Q (\phi_{j+1},a'; \theta) & \text{otherwise}\end{cases} Perform a gradient descent step on (yjQ(ϕj,aj;θ))2(y_j - Q(\phi_j, a_j;\theta))^2 with respect to the network parameters θ\theta Every CC steps reset Q^=Q\hat{Q} = Q End for End for

Why do we create a replay memory in Deep Q-learning?


Experience replay in Deep Q-Learning has two functions:

  1. Make more efficient use of the experiences during training. Usually, in online RL, the agent interacts in the environment, gets the experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.

Experience replay helps using the experiences of the training more efficiently. We use a replay buffer that saves experience samples that we can reuse during the training. This allows the agent to learn from the same experiences multiple times.

  1. Avoid forgetting previous experiences and reduce the correlation between experiences. Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid action values from oscillating or diverging catastrophically.

See also: https://huggingface.co/deep-rl-course/unit3/deep-q-algorithm

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.