Deep Q Learning with Pytorch

Page content

Download this notebook


Author: Oliver Mai


As presented in this YouTube video by Phil Tabor

Gym Environment: LunarLander-v2

This environment is inspired by a subfield of Optimal Control: rocket trajectory optimization. Sadly the documentation is a bit lacking, but we will briefly talk about features of this environment.
In “LunarLander-v2” the agent (or human player) controls a spacecraft, which is supposed to be landed on a planetary surface. The lander can only be moved in a 2D plane (Note: This environment requires the 2D physics engine “Box2d”, which can be installed by pip install -e '.[box2d]'). There exists a (marked) even landing pad at coordinates (0,0) and the surrounding surface is a randomly generated polygon. The coordinates of the spacecraft are also the first two values of the state vector. There are four discrete actions available:

Index Action
0 do nothing
1 fire left engine
2 fire main engine
3 fire right engine

While there is also a continous version of this environment (“LunarLanderContinous-v2”) according to a certain Pontryagin’s maximum principle, it is optimal to fully throttle an engine or turn it off, so this discrete version does fine.
How rewards are given isn’t entirely specified (the curious may find the corresponding source helpful), but the information given is as follows:

  • simply moving from the top to the landing platform yields between 100 and 140 points
  • negative reward is given if the agent moves away from the platform
  • an episode ends when the lander crashes or comes to rest (yielding -100 and +100 points respectively)
  • each leg of the spacecraft which touches the ground gives +10 points
  • firing the main engine is -0.3 points per frame and firing the side engines is -0.03 points each frame
  • landing outside the landing pad is possible

With that out of the way let’s move on to the implementation.

Deep Q-learning with Replay Memory

First of we restate the Deep Q-learning algorithm as presented in the accompanying script:

  1. Initialize replay memory capacity
  2. Initialize neural network with random weights
  3. For each episode:
    1. Initialize the starting state
    2. For each time step:
      1. Select an action
        • via exploration or exploitation
      2. Execute selected action in an emulator
      3. Observe the reward and the next state
      4. Store the experience (state, action, reward, new state) in replay memory
      5. Sample random batch from replay memory
      6. Optonally: preprocess states from batch
      7. Pass batch of states to policy network
      8. Calculate loss between output Q-values and target Q-values
        • recuires a second pass to the network for the next state (or additional neural network)
      9. Gradient descent updates weights in the policy network to minimize loss

We begin by importing needed packages. Since the LunarLande-v2 environment yield observations of dimension eight, no convolutional layers are needed and only linear layers are used.

import torch as T #  base Pytorch package
import torch.nn as nn # used to handle layers of the neural network
import torch.nn.functional as F # for ReLu activation function
import torch.optim as optim # for the Adam Optimizer

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import gym

seed = 42
np.random.seed(seed)

We are going to create two classes: one agent class and one neural network class. This is modeled to be this way since a Deep Q Network is part of the agents decision making but distinct from the agent itself. The agent also has other functionality, such as learning and replay memory, while the Deep Q Network takes an observation as input and returns the agents estimate of action values.
A convention when working with Pytorch is that classes which extend functionality of the base neural network derive from nn.Module, which gives access to a number of features, such as the parameters for the optimization and the backpropagation function.

class DeepQNetwork(nn.Module):
    def __init__(self, learning_rate, input_dims, fc1_dims, fc2_dims, n_actions):
        super(DeepQNetwork, self).__init__() # calls the constructor for the base class
        # save all needed variables in the class
        self.input_dims = input_dims # input dimensions
        self.fc1_dims = fc1_dims # dimension of first layer
        self.fc2_dims = fc2_dims # dimension of second layer
        self.n_actions = n_actions #number of available actions
                                   # stricly does not need to be an attribute, but can be useful
        
        
        self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims) # first layer of NN, *[.] unpacks a list so the
                                                              # dimension of the observation may be 2D (here 1D)
        self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims) # second layer
        self.fc3 = nn.Linear(self.fc2_dims, self.n_actions) # output layer maps to the actions

        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate) # set Adam optimizer
        self.loss = nn.MSELoss() # Mean Squared Error loss for the loss function
        self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu') # set computing device
                                                                             # use cuda if availabe, otherwise CPU
        self.to(self.device) # this sends the network to the device and the main reason for derving the class

    def forward(self, state):
        # handles forward propagation, as activation functions are not handled by the framework
        x = F.relu(self.fc1(state)) # pass state into first fully connected layer and activate using relu
        x = F.relu(self.fc2(x)) # pass the output to the second fully connected layer again with using relu
        actions = self.fc3(x) # pass to the final output layer, without activation for raw estimates

        return actions
        
    

This concludes the neural network class, however the main functionality is to be found in the agent class, which does not derive from any super class:

class Agent():
    def __init__(self, gamma, epsilon, lr, input_dims, batch_size, n_actions,
            max_mem_size=100000, eps_end=0.05, eps_dec=5e-4):
        self.gamma = gamma # discount factor, weights future rewards
        self.epsilon = epsilon # epsilon-greedy parameter for managing explore-exploit dilemma
        self.eps_min = eps_end # finite minimum value to keep exploring 
        self.eps_dec = eps_dec # here we linearly decrement epsilon by eps_dec for each time step
        self.lr = lr # learning rate for the deep neural network
        self.n_actions = n_actions # number of available actions
        self.input_dims = input_dims # dimension of the state observation 
        self.action_space = [i for i in range(n_actions)] # lists integer representation of the actions
        self.mem_size = max_mem_size # size of replay memory
        self.batch_size = batch_size # batch size for learning from replay memory
        self.mem_cntr = 0 # keep track of the first available memory

        # set policy neural network
        self.Q_eval = DeepQNetwork(lr, n_actions=n_actions, input_dims=input_dims,
                                    fc1_dims=256, fc2_dims=256)

        # save replay memory as named arrays, Pytorch enforces data types to a certain degree
        self.state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32) # tracks s_t
        self.new_state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32) # tracks s_t+1
        self.action_memory = np.zeros(self.mem_size, dtype=np.int32) # tracks a_t
        self.reward_memory = np.zeros(self.mem_size, dtype=np.float32) # tracks r_t
        self.terminal_memory = np.zeros(self.mem_size, dtype=np.bool) # tracks if t+1 is terminal or not

    def store_transition(self, state, action, reward, state_, terminal):
        # interface function that stores a transition to replay memory
        index = self.mem_cntr % self.mem_size # position of the first unoccupied memory
        # store appropriate values
        self.state_memory[index] = state
        self.new_state_memory[index] = state_
        self.reward_memory[index] = reward
        self.action_memory[index] = action
        self.terminal_memory[index] = terminal

        self.mem_cntr += 1 # keep track of stored memories

    def choose_action(self, observation):
        # epsilon-greedy strategy
        if np.random.random() > self.epsilon:
            # take the greedy action
            state = T.tensor([observation]).to(self.Q_eval.device) # convert observation to Pytorch tensor and
                                                                   # send it to our device
            actions = self.Q_eval.forward(state) # pass the state to the neural network
            action = T.argmax(actions).item() # get index of the action with maximum value for the state
                                              # (.item() dereferences the returned tensor to an array-like) 
        else:
            # take a random action
            action = np.random.choice(self.action_space)

        return action

    def learn(self):
        # In the beginning all memory is zero, so we need to wait (i. e. do random actions without learning)
        # until we filled up our memory and can start the learning process. (The learn function will be called on
        # every iteration.)
        if self.mem_cntr < self.batch_size:
            return

        # For Pytorch in particular the optimizer needs to be set to zero
        self.Q_eval.optimizer.zero_grad()
        
        # calculate the position of the maximum memory
        max_mem = min(self.mem_cntr, self.mem_size)
        
        # create a batch for learning without putting back to avoid duplicates
        batch = np.random.choice(max_mem, self.batch_size, replace=False)
        # keep track of the batches for slicing 
        batch_index = np.arange(self.batch_size, dtype=np.int32)

        # convert needed batches to tensors
        state_batch = T.tensor(self.state_memory[batch]).to(self.Q_eval.device)
        new_state_batch = T.tensor(self.new_state_memory[batch]).to(self.Q_eval.device)
        action_batch = self.action_memory[batch]
        reward_batch = T.tensor(self.reward_memory[batch]).to(self.Q_eval.device)
        terminal_batch = T.tensor(self.terminal_memory[batch]).to(self.Q_eval.device)

        # sample correct slices for loss function
        q_eval = self.Q_eval.forward(state_batch)[batch_index, action_batch]
        # evaluate on new states for target value in loss function
        q_next = self.Q_eval.forward(new_state_batch)
        # set target value for terminal states to 0.0
        q_next[terminal_batch] = 0.0
        
        # compute target value (max returns tuple of value and index)
        q_target = reward_batch + self.gamma*T.max(q_next,dim=1)[0]

        # set loss function
        loss = self.Q_eval.loss(q_target, q_eval).to(self.Q_eval.device)
        # backpropagation for parameter optimization 
        loss.backward()
        # perform one optimizer step
        self.Q_eval.optimizer.step()

        # decrement epsilon linearly but not below minimum value
        self.epsilon = self.epsilon - self.eps_dec if self.epsilon > self.eps_min \
                       else self.eps_min
        
    def save_model(self, path="data/pytorch_dpq_model.pt"):
        # save the Pytorch model
        T.save(self.Q_eval.state_dict(), path)
    
    def load_model(self, path="data/pytorch_dpq_model.pt"):
        # load a Pytorch model
        model = DeepQNetwork(self.lr, n_actions=self.n_actions, input_dims=self.input_dims, 
                             fc1_dims=256, fc2_dims=256)
        model.load_state_dict(T.load(path))
        model.eval()
        self.Q_eval = model

This concludes the main part of handling the neural network in during the learning proceedure. The actual learning loop now looks pretty close to the presented pseudo code:

env = gym.make("LunarLander-v2")
agent = Agent(gamma=0.99, epsilon=1.0, batch_size=64, n_actions=4,
              eps_end=0.005,eps_dec=3e-4, input_dims=[8], lr=0.001)
scores, eps_history = [], []

n_games = 1000 # this takes some time, consider loading the model instead

for i in range(n_games):
    score=0
    done=False
    observation, info = env.reset(seed=seed+i)
    
    while not done:
        action = agent.choose_action(observation)
        new_observation, reward, done, truncated, info = env.step(action)
        score += reward
        agent.store_transition(observation, action, reward, new_observation, done)
        
        agent.learn()
        observation = new_observation
    
    scores.append(score)
    eps_history.append(agent.epsilon)
    
    avg_score = np.mean(scores[-25:])
    
    print(f"Episode: {i}, Score: {score:.2f}, Average Score: {avg_score:.2f}, epsilon: {agent.epsilon:.3f}")
    
    if avg_score>=195.0:
        break
    
/tmp/ipykernel_1509915/2416239746.py:25: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self.terminal_memory = np.zeros(self.mem_size, dtype=np.bool) # tracks if t+1 is terminal or not
/tmp/ipykernel_1509915/2416239746.py:43: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:204.)
  state = T.tensor([observation]).to(self.Q_eval.device) # convert observation to Pytorch tensor and


Episode: 0, Score: -353.60, Average Score: -353.60, epsilon: 0.984
Episode: 1, Score: -89.18, Average Score: -221.39, epsilon: 0.962
Episode: 2, Score: -268.35, Average Score: -237.04, epsilon: 0.924
Episode: 3, Score: -177.44, Average Score: -222.14, epsilon: 0.892
Episode: 4, Score: -84.22, Average Score: -194.56, epsilon: 0.873
Episode: 5, Score: -333.54, Average Score: -217.72, epsilon: 0.828
Episode: 6, Score: -101.07, Average Score: -201.06, epsilon: 0.795
Episode: 7, Score: -85.50, Average Score: -186.61, epsilon: 0.756
Episode: 8, Score: -97.74, Average Score: -176.74, epsilon: 0.729
Episode: 9, Score: -128.65, Average Score: -171.93, epsilon: 0.687
Episode: 10, Score: -68.24, Average Score: -162.50, epsilon: 0.662
Episode: 11, Score: -118.17, Average Score: -158.81, epsilon: 0.624
Episode: 12, Score: -110.78, Average Score: -155.11, epsilon: 0.593
Episode: 13, Score: -175.19, Average Score: -156.55, epsilon: 0.564
Episode: 14, Score: -74.79, Average Score: -151.10, epsilon: 0.538
Episode: 15, Score: -80.72, Average Score: -146.70, epsilon: 0.475
Episode: 16, Score: -73.20, Average Score: -142.38, epsilon: 0.445
Episode: 17, Score: -31.38, Average Score: -136.21, epsilon: 0.359
Episode: 18, Score: -229.01, Average Score: -141.09, epsilon: 0.269
Episode: 19, Score: -278.46, Average Score: -147.96, epsilon: 0.179
Episode: 20, Score: 231.71, Average Score: -129.88, epsilon: 0.011
Episode: 21, Score: -62.71, Average Score: -126.83, epsilon: 0.005
Episode: 22, Score: -24.60, Average Score: -122.39, epsilon: 0.005
Episode: 23, Score: -135.73, Average Score: -122.94, epsilon: 0.005
Episode: 24, Score: -305.65, Average Score: -130.25, epsilon: 0.005
Episode: 25, Score: -144.92, Average Score: -121.90, epsilon: 0.005
Episode: 26, Score: -317.30, Average Score: -131.03, epsilon: 0.005
Episode: 27, Score: 76.37, Average Score: -117.24, epsilon: 0.005
Episode: 28, Score: -1246.18, Average Score: -159.99, epsilon: 0.005
Episode: 29, Score: -998.12, Average Score: -196.54, epsilon: 0.005
Episode: 30, Score: -1452.98, Average Score: -241.32, epsilon: 0.005
Episode: 31, Score: -272.61, Average Score: -248.18, epsilon: 0.005
Episode: 32, Score: -218.52, Average Score: -253.50, epsilon: 0.005
Episode: 33, Score: -387.17, Average Score: -265.08, epsilon: 0.005
Episode: 34, Score: -566.19, Average Score: -282.58, epsilon: 0.005
Episode: 35, Score: -258.92, Average Score: -290.21, epsilon: 0.005
Episode: 36, Score: -4600.70, Average Score: -469.51, epsilon: 0.005
Episode: 37, Score: -1522.48, Average Score: -525.98, epsilon: 0.005
Episode: 38, Score: -493.16, Average Score: -538.70, epsilon: 0.005
Episode: 39, Score: -316.52, Average Score: -548.37, epsilon: 0.005
Episode: 40, Score: 58.24, Average Score: -542.81, epsilon: 0.005
Episode: 41, Score: 28.95, Average Score: -538.72, epsilon: 0.005
Episode: 42, Score: -599.71, Average Score: -561.46, epsilon: 0.005
Episode: 43, Score: -189.55, Average Score: -559.88, epsilon: 0.005
Episode: 44, Score: -331.91, Average Score: -562.02, epsilon: 0.005
Episode: 45, Score: -98.31, Average Score: -575.22, epsilon: 0.005
Episode: 46, Score: -12.14, Average Score: -573.19, epsilon: 0.005
Episode: 47, Score: 14.46, Average Score: -571.63, epsilon: 0.005
Episode: 48, Score: 141.98, Average Score: -560.52, epsilon: 0.005
Episode: 49, Score: 89.02, Average Score: -544.74, epsilon: 0.005
Episode: 50, Score: -233.71, Average Score: -548.29, epsilon: 0.005
Episode: 51, Score: -1186.69, Average Score: -583.06, epsilon: 0.005
Episode: 52, Score: 214.11, Average Score: -577.55, epsilon: 0.005
Episode: 53, Score: 176.13, Average Score: -520.66, epsilon: 0.005
Episode: 54, Score: -15.86, Average Score: -481.37, epsilon: 0.005
Episode: 55, Score: 205.67, Average Score: -415.02, epsilon: 0.005
Episode: 56, Score: 183.83, Average Score: -396.77, epsilon: 0.005
Episode: 57, Score: 222.54, Average Score: -379.12, epsilon: 0.005
Episode: 58, Score: 221.08, Average Score: -354.79, epsilon: 0.005
Episode: 59, Score: 20.13, Average Score: -331.34, epsilon: 0.005
Episode: 60, Score: -48.82, Average Score: -322.94, epsilon: 0.005
Episode: 61, Score: 62.86, Average Score: -136.39, epsilon: 0.005
Episode: 62, Score: 16.09, Average Score: -74.85, epsilon: 0.005
Episode: 63, Score: 229.43, Average Score: -45.95, epsilon: 0.005
Episode: 64, Score: 267.28, Average Score: -22.60, epsilon: 0.005
Episode: 65, Score: 272.53, Average Score: -14.02, epsilon: 0.005
Episode: 66, Score: -303.45, Average Score: -27.32, epsilon: 0.005
Episode: 67, Score: 231.08, Average Score: 5.91, epsilon: 0.005
Episode: 68, Score: 75.12, Average Score: 16.50, epsilon: 0.005
Episode: 69, Score: 27.60, Average Score: 30.88, epsilon: 0.005
Episode: 70, Score: 24.04, Average Score: 35.77, epsilon: 0.005
Episode: 71, Score: 253.64, Average Score: 46.40, epsilon: 0.005
Episode: 72, Score: 231.01, Average Score: 55.07, epsilon: 0.005
Episode: 73, Score: 211.18, Average Score: 57.83, epsilon: 0.005
Episode: 74, Score: -58.08, Average Score: 51.95, epsilon: 0.005
Episode: 75, Score: -77.80, Average Score: 58.19, epsilon: 0.005
Episode: 76, Score: -6.50, Average Score: 105.39, epsilon: 0.005
Episode: 77, Score: 34.98, Average Score: 98.23, epsilon: 0.005
Episode: 78, Score: 191.85, Average Score: 98.86, epsilon: 0.005
Episode: 79, Score: -195.03, Average Score: 91.69, epsilon: 0.005
Episode: 80, Score: 242.65, Average Score: 93.17, epsilon: 0.005
Episode: 81, Score: -227.59, Average Score: 76.71, epsilon: 0.005
Episode: 82, Score: 65.91, Average Score: 70.45, epsilon: 0.005
Episode: 83, Score: 245.21, Average Score: 71.41, epsilon: 0.005
Episode: 84, Score: 45.07, Average Score: 72.41, epsilon: 0.005
Episode: 85, Score: -80.40, Average Score: 71.15, epsilon: 0.005
Episode: 86, Score: 247.49, Average Score: 78.53, epsilon: 0.005
Episode: 87, Score: 265.83, Average Score: 88.52, epsilon: 0.005
Episode: 88, Score: -283.82, Average Score: 67.99, epsilon: 0.005
Episode: 89, Score: 225.82, Average Score: 66.33, epsilon: 0.005
Episode: 90, Score: 258.58, Average Score: 65.78, epsilon: 0.005
Episode: 91, Score: 23.72, Average Score: 78.86, epsilon: 0.005
Episode: 92, Score: 59.80, Average Score: 72.01, epsilon: 0.005
Episode: 93, Score: 246.94, Average Score: 78.88, epsilon: 0.005
Episode: 94, Score: 257.66, Average Score: 88.09, epsilon: 0.005
Episode: 95, Score: 194.61, Average Score: 94.91, epsilon: 0.005
Episode: 96, Score: 256.79, Average Score: 95.03, epsilon: 0.005
Episode: 97, Score: 220.79, Average Score: 94.63, epsilon: 0.005
Episode: 98, Score: 254.20, Average Score: 96.35, epsilon: 0.005
Episode: 99, Score: 217.97, Average Score: 107.39, epsilon: 0.005
Episode: 100, Score: 210.48, Average Score: 118.92, epsilon: 0.005
Episode: 101, Score: 201.28, Average Score: 127.23, epsilon: 0.005
Episode: 102, Score: 231.45, Average Score: 135.09, epsilon: 0.005
Episode: 103, Score: 249.38, Average Score: 137.39, epsilon: 0.005
Episode: 104, Score: 258.86, Average Score: 155.55, epsilon: 0.005
Episode: 105, Score: 211.89, Average Score: 154.32, epsilon: 0.005
Episode: 106, Score: 277.04, Average Score: 174.50, epsilon: 0.005
Episode: 107, Score: 17.98, Average Score: 172.58, epsilon: 0.005
Episode: 108, Score: 15.99, Average Score: 163.41, epsilon: 0.005
Episode: 109, Score: 231.85, Average Score: 170.89, epsilon: 0.005
Episode: 110, Score: -207.45, Average Score: 165.80, epsilon: 0.005
Episode: 111, Score: 185.93, Average Score: 163.34, epsilon: 0.005
Episode: 112, Score: 195.31, Average Score: 160.52, epsilon: 0.005
Episode: 113, Score: 277.58, Average Score: 182.98, epsilon: 0.005
Episode: 114, Score: 240.07, Average Score: 183.55, epsilon: 0.005
Episode: 115, Score: 234.77, Average Score: 182.59, epsilon: 0.005
Episode: 116, Score: 238.36, Average Score: 191.18, epsilon: 0.005
Episode: 117, Score: 255.11, Average Score: 198.99, epsilon: 0.005

We now take a look at the performance during learning:

def plotLearning(x, scores, epsilons, filename, lines=None):
    fig=plt.figure()
    ax=fig.add_subplot(111, label="1")
    ax2=fig.add_subplot(111, label="2", frame_on=False)

    ax.plot(x, epsilons, color="C0")
    ax.set_xlabel("Game", color="C0")
    ax.set_ylabel("Epsilon", color="C0")
    ax.tick_params(axis='x', colors="C0")
    ax.tick_params(axis='y', colors="C0")

    N = len(scores)
    running_avg = np.empty(N)
    for t in range(N):
	    running_avg[t] = np.mean(scores[max(0, t-20):(t+1)])

    ax2.scatter(x, running_avg, color="C1")
    #ax2.xaxis.tick_top()
    ax2.axes.get_xaxis().set_visible(False)
    ax2.yaxis.tick_right()
    #ax2.set_xlabel('x label 2', color="C1")
    ax2.set_ylabel('Score', color="C1")
    #ax2.xaxis.set_label_position('top')
    ax2.yaxis.set_label_position('right')
    #ax2.tick_params(axis='x', colors="C1")
    ax2.tick_params(axis='y', colors="C1")

    if lines is not None:
        for line in lines:
            plt.axvline(x=line)

    plt.savefig(filename)
    plt.show()
x = [i+1 for i in range(len(scores))]
filename = "diagrams/lunar_lander.png"
plotLearning(x, scores, eps_history, filename)

png

agent.save_model(path="data/pytorch_dpq_model.pt")

Finally we can take a look at how the trained agent navigates in the environment:

# agent = Agent(gamma=0.99, epsilon=1.0, batch_size=64, n_actions=4,
#               eps_end=0.01, input_dims=[8], lr=0.001)
# agent.load_model()
env = gym.make("LunarLander-v2", render_mode="rgb_array_list")
observation, info = env.reset(seed=seed)
for _ in range(1000):
    action = agent.choose_action(observation) # select random action
    observation, reward, finished, truncated, info = env.step(action) # take random action
    if finished or truncated:
        break

frames = env.render()
env.close()
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, writers
fig, ax = plt.subplots()
img = plt.imshow(frames[0])
ax.axis("off")
def animate(frame_num):
    ax.set_title(f"Step: {frame_num}")
    img.set_data(frames[frame_num])
    return img

anim = FuncAnimation(fig, animate, frames=len(frames), interval=10)

png

Writer = writers["ffmpeg"]
writer = Writer(fps=15, bitrate=1800)
anim.save("diagrams/LunarLander_Video_1.mp4", writer=writer)