Deep Q Learning with Pytorch
Author: Oliver Mai
As presented in this YouTube video by Phil Tabor
Gym Environment: LunarLander-v2
This environment is inspired by a subfield of Optimal Control: rocket trajectory optimization. Sadly the documentation is a bit lacking, but we will briefly talk about features of this environment.
In “LunarLander-v2” the agent (or human player) controls a spacecraft, which is supposed to be landed on a planetary surface. The lander can only be moved in a 2D plane (Note: This environment requires the 2D physics engine “Box2d”, which can be installed by pip install -e '.[box2d]'
). There exists a (marked) even landing pad at coordinates (0,0) and the surrounding surface is a randomly generated polygon. The coordinates of the spacecraft are also the first two values of the state vector. There are four discrete actions available:
Index | Action |
---|---|
0 | do nothing |
1 | fire left engine |
2 | fire main engine |
3 | fire right engine |
While there is also a continous version of this environment (“LunarLanderContinous-v2”) according to a certain Pontryagin’s maximum principle, it is optimal to fully throttle an engine or turn it off, so this discrete version does fine.
How rewards are given isn’t entirely specified (the curious may find the corresponding source helpful), but the information given is as follows:
- simply moving from the top to the landing platform yields between 100 and 140 points
- negative reward is given if the agent moves away from the platform
- an episode ends when the lander crashes or comes to rest (yielding -100 and +100 points respectively)
- each leg of the spacecraft which touches the ground gives +10 points
- firing the main engine is -0.3 points per frame and firing the side engines is -0.03 points each frame
- landing outside the landing pad is possible
With that out of the way let’s move on to the implementation.
Deep Q-learning with Replay Memory
First of we restate the Deep Q-learning algorithm as presented in the accompanying script:
- Initialize replay memory capacity
- Initialize neural network with random weights
- For each episode:
- Initialize the starting state
- For each time step:
- Select an action
- via exploration or exploitation
- Execute selected action in an emulator
- Observe the reward and the next state
- Store the experience (state, action, reward, new state) in replay memory
- Sample random batch from replay memory
- Optonally: preprocess states from batch
- Pass batch of states to policy network
- Calculate loss between output Q-values and target Q-values
- recuires a second pass to the network for the next state (or additional neural network)
- Gradient descent updates weights in the policy network to minimize loss
- Select an action
We begin by importing needed packages. Since the LunarLande-v2 environment yield observations of dimension eight, no convolutional layers are needed and only linear layers are used.
import torch as T # base Pytorch package
import torch.nn as nn # used to handle layers of the neural network
import torch.nn.functional as F # for ReLu activation function
import torch.optim as optim # for the Adam Optimizer
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import gym
seed = 42
np.random.seed(seed)
We are going to create two classes: one agent class and one neural network class. This is modeled to be this way since a Deep Q Network is part of the agents decision making but distinct from the agent itself. The agent also has other functionality, such as learning and replay memory, while the Deep Q Network takes an observation as input and returns the agents estimate of action values.
A convention when working with Pytorch is that classes which extend functionality of the base neural network derive from nn.Module
, which gives access to a number of features, such as the parameters for the optimization and the backpropagation function.
class DeepQNetwork(nn.Module):
def __init__(self, learning_rate, input_dims, fc1_dims, fc2_dims, n_actions):
super(DeepQNetwork, self).__init__() # calls the constructor for the base class
# save all needed variables in the class
self.input_dims = input_dims # input dimensions
self.fc1_dims = fc1_dims # dimension of first layer
self.fc2_dims = fc2_dims # dimension of second layer
self.n_actions = n_actions #number of available actions
# stricly does not need to be an attribute, but can be useful
self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims) # first layer of NN, *[.] unpacks a list so the
# dimension of the observation may be 2D (here 1D)
self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims) # second layer
self.fc3 = nn.Linear(self.fc2_dims, self.n_actions) # output layer maps to the actions
self.optimizer = optim.Adam(self.parameters(), lr=learning_rate) # set Adam optimizer
self.loss = nn.MSELoss() # Mean Squared Error loss for the loss function
self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu') # set computing device
# use cuda if availabe, otherwise CPU
self.to(self.device) # this sends the network to the device and the main reason for derving the class
def forward(self, state):
# handles forward propagation, as activation functions are not handled by the framework
x = F.relu(self.fc1(state)) # pass state into first fully connected layer and activate using relu
x = F.relu(self.fc2(x)) # pass the output to the second fully connected layer again with using relu
actions = self.fc3(x) # pass to the final output layer, without activation for raw estimates
return actions
This concludes the neural network class, however the main functionality is to be found in the agent class, which does not derive from any super class:
class Agent():
def __init__(self, gamma, epsilon, lr, input_dims, batch_size, n_actions,
max_mem_size=100000, eps_end=0.05, eps_dec=5e-4):
self.gamma = gamma # discount factor, weights future rewards
self.epsilon = epsilon # epsilon-greedy parameter for managing explore-exploit dilemma
self.eps_min = eps_end # finite minimum value to keep exploring
self.eps_dec = eps_dec # here we linearly decrement epsilon by eps_dec for each time step
self.lr = lr # learning rate for the deep neural network
self.n_actions = n_actions # number of available actions
self.input_dims = input_dims # dimension of the state observation
self.action_space = [i for i in range(n_actions)] # lists integer representation of the actions
self.mem_size = max_mem_size # size of replay memory
self.batch_size = batch_size # batch size for learning from replay memory
self.mem_cntr = 0 # keep track of the first available memory
# set policy neural network
self.Q_eval = DeepQNetwork(lr, n_actions=n_actions, input_dims=input_dims,
fc1_dims=256, fc2_dims=256)
# save replay memory as named arrays, Pytorch enforces data types to a certain degree
self.state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32) # tracks s_t
self.new_state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32) # tracks s_t+1
self.action_memory = np.zeros(self.mem_size, dtype=np.int32) # tracks a_t
self.reward_memory = np.zeros(self.mem_size, dtype=np.float32) # tracks r_t
self.terminal_memory = np.zeros(self.mem_size, dtype=np.bool) # tracks if t+1 is terminal or not
def store_transition(self, state, action, reward, state_, terminal):
# interface function that stores a transition to replay memory
index = self.mem_cntr % self.mem_size # position of the first unoccupied memory
# store appropriate values
self.state_memory[index] = state
self.new_state_memory[index] = state_
self.reward_memory[index] = reward
self.action_memory[index] = action
self.terminal_memory[index] = terminal
self.mem_cntr += 1 # keep track of stored memories
def choose_action(self, observation):
# epsilon-greedy strategy
if np.random.random() > self.epsilon:
# take the greedy action
state = T.tensor([observation]).to(self.Q_eval.device) # convert observation to Pytorch tensor and
# send it to our device
actions = self.Q_eval.forward(state) # pass the state to the neural network
action = T.argmax(actions).item() # get index of the action with maximum value for the state
# (.item() dereferences the returned tensor to an array-like)
else:
# take a random action
action = np.random.choice(self.action_space)
return action
def learn(self):
# In the beginning all memory is zero, so we need to wait (i. e. do random actions without learning)
# until we filled up our memory and can start the learning process. (The learn function will be called on
# every iteration.)
if self.mem_cntr < self.batch_size:
return
# For Pytorch in particular the optimizer needs to be set to zero
self.Q_eval.optimizer.zero_grad()
# calculate the position of the maximum memory
max_mem = min(self.mem_cntr, self.mem_size)
# create a batch for learning without putting back to avoid duplicates
batch = np.random.choice(max_mem, self.batch_size, replace=False)
# keep track of the batches for slicing
batch_index = np.arange(self.batch_size, dtype=np.int32)
# convert needed batches to tensors
state_batch = T.tensor(self.state_memory[batch]).to(self.Q_eval.device)
new_state_batch = T.tensor(self.new_state_memory[batch]).to(self.Q_eval.device)
action_batch = self.action_memory[batch]
reward_batch = T.tensor(self.reward_memory[batch]).to(self.Q_eval.device)
terminal_batch = T.tensor(self.terminal_memory[batch]).to(self.Q_eval.device)
# sample correct slices for loss function
q_eval = self.Q_eval.forward(state_batch)[batch_index, action_batch]
# evaluate on new states for target value in loss function
q_next = self.Q_eval.forward(new_state_batch)
# set target value for terminal states to 0.0
q_next[terminal_batch] = 0.0
# compute target value (max returns tuple of value and index)
q_target = reward_batch + self.gamma*T.max(q_next,dim=1)[0]
# set loss function
loss = self.Q_eval.loss(q_target, q_eval).to(self.Q_eval.device)
# backpropagation for parameter optimization
loss.backward()
# perform one optimizer step
self.Q_eval.optimizer.step()
# decrement epsilon linearly but not below minimum value
self.epsilon = self.epsilon - self.eps_dec if self.epsilon > self.eps_min \
else self.eps_min
def save_model(self, path="data/pytorch_dpq_model.pt"):
# save the Pytorch model
T.save(self.Q_eval.state_dict(), path)
def load_model(self, path="data/pytorch_dpq_model.pt"):
# load a Pytorch model
model = DeepQNetwork(self.lr, n_actions=self.n_actions, input_dims=self.input_dims,
fc1_dims=256, fc2_dims=256)
model.load_state_dict(T.load(path))
model.eval()
self.Q_eval = model
This concludes the main part of handling the neural network in during the learning proceedure. The actual learning loop now looks pretty close to the presented pseudo code:
env = gym.make("LunarLander-v2")
agent = Agent(gamma=0.99, epsilon=1.0, batch_size=64, n_actions=4,
eps_end=0.005,eps_dec=3e-4, input_dims=[8], lr=0.001)
scores, eps_history = [], []
n_games = 1000 # this takes some time, consider loading the model instead
for i in range(n_games):
score=0
done=False
observation, info = env.reset(seed=seed+i)
while not done:
action = agent.choose_action(observation)
new_observation, reward, done, truncated, info = env.step(action)
score += reward
agent.store_transition(observation, action, reward, new_observation, done)
agent.learn()
observation = new_observation
scores.append(score)
eps_history.append(agent.epsilon)
avg_score = np.mean(scores[-25:])
print(f"Episode: {i}, Score: {score:.2f}, Average Score: {avg_score:.2f}, epsilon: {agent.epsilon:.3f}")
if avg_score>=195.0:
break
/tmp/ipykernel_1509915/2416239746.py:25: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
self.terminal_memory = np.zeros(self.mem_size, dtype=np.bool) # tracks if t+1 is terminal or not
/tmp/ipykernel_1509915/2416239746.py:43: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:204.)
state = T.tensor([observation]).to(self.Q_eval.device) # convert observation to Pytorch tensor and
Episode: 0, Score: -353.60, Average Score: -353.60, epsilon: 0.984
Episode: 1, Score: -89.18, Average Score: -221.39, epsilon: 0.962
Episode: 2, Score: -268.35, Average Score: -237.04, epsilon: 0.924
Episode: 3, Score: -177.44, Average Score: -222.14, epsilon: 0.892
Episode: 4, Score: -84.22, Average Score: -194.56, epsilon: 0.873
Episode: 5, Score: -333.54, Average Score: -217.72, epsilon: 0.828
Episode: 6, Score: -101.07, Average Score: -201.06, epsilon: 0.795
Episode: 7, Score: -85.50, Average Score: -186.61, epsilon: 0.756
Episode: 8, Score: -97.74, Average Score: -176.74, epsilon: 0.729
Episode: 9, Score: -128.65, Average Score: -171.93, epsilon: 0.687
Episode: 10, Score: -68.24, Average Score: -162.50, epsilon: 0.662
Episode: 11, Score: -118.17, Average Score: -158.81, epsilon: 0.624
Episode: 12, Score: -110.78, Average Score: -155.11, epsilon: 0.593
Episode: 13, Score: -175.19, Average Score: -156.55, epsilon: 0.564
Episode: 14, Score: -74.79, Average Score: -151.10, epsilon: 0.538
Episode: 15, Score: -80.72, Average Score: -146.70, epsilon: 0.475
Episode: 16, Score: -73.20, Average Score: -142.38, epsilon: 0.445
Episode: 17, Score: -31.38, Average Score: -136.21, epsilon: 0.359
Episode: 18, Score: -229.01, Average Score: -141.09, epsilon: 0.269
Episode: 19, Score: -278.46, Average Score: -147.96, epsilon: 0.179
Episode: 20, Score: 231.71, Average Score: -129.88, epsilon: 0.011
Episode: 21, Score: -62.71, Average Score: -126.83, epsilon: 0.005
Episode: 22, Score: -24.60, Average Score: -122.39, epsilon: 0.005
Episode: 23, Score: -135.73, Average Score: -122.94, epsilon: 0.005
Episode: 24, Score: -305.65, Average Score: -130.25, epsilon: 0.005
Episode: 25, Score: -144.92, Average Score: -121.90, epsilon: 0.005
Episode: 26, Score: -317.30, Average Score: -131.03, epsilon: 0.005
Episode: 27, Score: 76.37, Average Score: -117.24, epsilon: 0.005
Episode: 28, Score: -1246.18, Average Score: -159.99, epsilon: 0.005
Episode: 29, Score: -998.12, Average Score: -196.54, epsilon: 0.005
Episode: 30, Score: -1452.98, Average Score: -241.32, epsilon: 0.005
Episode: 31, Score: -272.61, Average Score: -248.18, epsilon: 0.005
Episode: 32, Score: -218.52, Average Score: -253.50, epsilon: 0.005
Episode: 33, Score: -387.17, Average Score: -265.08, epsilon: 0.005
Episode: 34, Score: -566.19, Average Score: -282.58, epsilon: 0.005
Episode: 35, Score: -258.92, Average Score: -290.21, epsilon: 0.005
Episode: 36, Score: -4600.70, Average Score: -469.51, epsilon: 0.005
Episode: 37, Score: -1522.48, Average Score: -525.98, epsilon: 0.005
Episode: 38, Score: -493.16, Average Score: -538.70, epsilon: 0.005
Episode: 39, Score: -316.52, Average Score: -548.37, epsilon: 0.005
Episode: 40, Score: 58.24, Average Score: -542.81, epsilon: 0.005
Episode: 41, Score: 28.95, Average Score: -538.72, epsilon: 0.005
Episode: 42, Score: -599.71, Average Score: -561.46, epsilon: 0.005
Episode: 43, Score: -189.55, Average Score: -559.88, epsilon: 0.005
Episode: 44, Score: -331.91, Average Score: -562.02, epsilon: 0.005
Episode: 45, Score: -98.31, Average Score: -575.22, epsilon: 0.005
Episode: 46, Score: -12.14, Average Score: -573.19, epsilon: 0.005
Episode: 47, Score: 14.46, Average Score: -571.63, epsilon: 0.005
Episode: 48, Score: 141.98, Average Score: -560.52, epsilon: 0.005
Episode: 49, Score: 89.02, Average Score: -544.74, epsilon: 0.005
Episode: 50, Score: -233.71, Average Score: -548.29, epsilon: 0.005
Episode: 51, Score: -1186.69, Average Score: -583.06, epsilon: 0.005
Episode: 52, Score: 214.11, Average Score: -577.55, epsilon: 0.005
Episode: 53, Score: 176.13, Average Score: -520.66, epsilon: 0.005
Episode: 54, Score: -15.86, Average Score: -481.37, epsilon: 0.005
Episode: 55, Score: 205.67, Average Score: -415.02, epsilon: 0.005
Episode: 56, Score: 183.83, Average Score: -396.77, epsilon: 0.005
Episode: 57, Score: 222.54, Average Score: -379.12, epsilon: 0.005
Episode: 58, Score: 221.08, Average Score: -354.79, epsilon: 0.005
Episode: 59, Score: 20.13, Average Score: -331.34, epsilon: 0.005
Episode: 60, Score: -48.82, Average Score: -322.94, epsilon: 0.005
Episode: 61, Score: 62.86, Average Score: -136.39, epsilon: 0.005
Episode: 62, Score: 16.09, Average Score: -74.85, epsilon: 0.005
Episode: 63, Score: 229.43, Average Score: -45.95, epsilon: 0.005
Episode: 64, Score: 267.28, Average Score: -22.60, epsilon: 0.005
Episode: 65, Score: 272.53, Average Score: -14.02, epsilon: 0.005
Episode: 66, Score: -303.45, Average Score: -27.32, epsilon: 0.005
Episode: 67, Score: 231.08, Average Score: 5.91, epsilon: 0.005
Episode: 68, Score: 75.12, Average Score: 16.50, epsilon: 0.005
Episode: 69, Score: 27.60, Average Score: 30.88, epsilon: 0.005
Episode: 70, Score: 24.04, Average Score: 35.77, epsilon: 0.005
Episode: 71, Score: 253.64, Average Score: 46.40, epsilon: 0.005
Episode: 72, Score: 231.01, Average Score: 55.07, epsilon: 0.005
Episode: 73, Score: 211.18, Average Score: 57.83, epsilon: 0.005
Episode: 74, Score: -58.08, Average Score: 51.95, epsilon: 0.005
Episode: 75, Score: -77.80, Average Score: 58.19, epsilon: 0.005
Episode: 76, Score: -6.50, Average Score: 105.39, epsilon: 0.005
Episode: 77, Score: 34.98, Average Score: 98.23, epsilon: 0.005
Episode: 78, Score: 191.85, Average Score: 98.86, epsilon: 0.005
Episode: 79, Score: -195.03, Average Score: 91.69, epsilon: 0.005
Episode: 80, Score: 242.65, Average Score: 93.17, epsilon: 0.005
Episode: 81, Score: -227.59, Average Score: 76.71, epsilon: 0.005
Episode: 82, Score: 65.91, Average Score: 70.45, epsilon: 0.005
Episode: 83, Score: 245.21, Average Score: 71.41, epsilon: 0.005
Episode: 84, Score: 45.07, Average Score: 72.41, epsilon: 0.005
Episode: 85, Score: -80.40, Average Score: 71.15, epsilon: 0.005
Episode: 86, Score: 247.49, Average Score: 78.53, epsilon: 0.005
Episode: 87, Score: 265.83, Average Score: 88.52, epsilon: 0.005
Episode: 88, Score: -283.82, Average Score: 67.99, epsilon: 0.005
Episode: 89, Score: 225.82, Average Score: 66.33, epsilon: 0.005
Episode: 90, Score: 258.58, Average Score: 65.78, epsilon: 0.005
Episode: 91, Score: 23.72, Average Score: 78.86, epsilon: 0.005
Episode: 92, Score: 59.80, Average Score: 72.01, epsilon: 0.005
Episode: 93, Score: 246.94, Average Score: 78.88, epsilon: 0.005
Episode: 94, Score: 257.66, Average Score: 88.09, epsilon: 0.005
Episode: 95, Score: 194.61, Average Score: 94.91, epsilon: 0.005
Episode: 96, Score: 256.79, Average Score: 95.03, epsilon: 0.005
Episode: 97, Score: 220.79, Average Score: 94.63, epsilon: 0.005
Episode: 98, Score: 254.20, Average Score: 96.35, epsilon: 0.005
Episode: 99, Score: 217.97, Average Score: 107.39, epsilon: 0.005
Episode: 100, Score: 210.48, Average Score: 118.92, epsilon: 0.005
Episode: 101, Score: 201.28, Average Score: 127.23, epsilon: 0.005
Episode: 102, Score: 231.45, Average Score: 135.09, epsilon: 0.005
Episode: 103, Score: 249.38, Average Score: 137.39, epsilon: 0.005
Episode: 104, Score: 258.86, Average Score: 155.55, epsilon: 0.005
Episode: 105, Score: 211.89, Average Score: 154.32, epsilon: 0.005
Episode: 106, Score: 277.04, Average Score: 174.50, epsilon: 0.005
Episode: 107, Score: 17.98, Average Score: 172.58, epsilon: 0.005
Episode: 108, Score: 15.99, Average Score: 163.41, epsilon: 0.005
Episode: 109, Score: 231.85, Average Score: 170.89, epsilon: 0.005
Episode: 110, Score: -207.45, Average Score: 165.80, epsilon: 0.005
Episode: 111, Score: 185.93, Average Score: 163.34, epsilon: 0.005
Episode: 112, Score: 195.31, Average Score: 160.52, epsilon: 0.005
Episode: 113, Score: 277.58, Average Score: 182.98, epsilon: 0.005
Episode: 114, Score: 240.07, Average Score: 183.55, epsilon: 0.005
Episode: 115, Score: 234.77, Average Score: 182.59, epsilon: 0.005
Episode: 116, Score: 238.36, Average Score: 191.18, epsilon: 0.005
Episode: 117, Score: 255.11, Average Score: 198.99, epsilon: 0.005
We now take a look at the performance during learning:
def plotLearning(x, scores, epsilons, filename, lines=None):
fig=plt.figure()
ax=fig.add_subplot(111, label="1")
ax2=fig.add_subplot(111, label="2", frame_on=False)
ax.plot(x, epsilons, color="C0")
ax.set_xlabel("Game", color="C0")
ax.set_ylabel("Epsilon", color="C0")
ax.tick_params(axis='x', colors="C0")
ax.tick_params(axis='y', colors="C0")
N = len(scores)
running_avg = np.empty(N)
for t in range(N):
running_avg[t] = np.mean(scores[max(0, t-20):(t+1)])
ax2.scatter(x, running_avg, color="C1")
#ax2.xaxis.tick_top()
ax2.axes.get_xaxis().set_visible(False)
ax2.yaxis.tick_right()
#ax2.set_xlabel('x label 2', color="C1")
ax2.set_ylabel('Score', color="C1")
#ax2.xaxis.set_label_position('top')
ax2.yaxis.set_label_position('right')
#ax2.tick_params(axis='x', colors="C1")
ax2.tick_params(axis='y', colors="C1")
if lines is not None:
for line in lines:
plt.axvline(x=line)
plt.savefig(filename)
plt.show()
x = [i+1 for i in range(len(scores))]
filename = "diagrams/lunar_lander.png"
plotLearning(x, scores, eps_history, filename)
agent.save_model(path="data/pytorch_dpq_model.pt")
Finally we can take a look at how the trained agent navigates in the environment:
# agent = Agent(gamma=0.99, epsilon=1.0, batch_size=64, n_actions=4,
# eps_end=0.01, input_dims=[8], lr=0.001)
# agent.load_model()
env = gym.make("LunarLander-v2", render_mode="rgb_array_list")
observation, info = env.reset(seed=seed)
for _ in range(1000):
action = agent.choose_action(observation) # select random action
observation, reward, finished, truncated, info = env.step(action) # take random action
if finished or truncated:
break
frames = env.render()
env.close()
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, writers
fig, ax = plt.subplots()
img = plt.imshow(frames[0])
ax.axis("off")
def animate(frame_num):
ax.set_title(f"Step: {frame_num}")
img.set_data(frames[frame_num])
return img
anim = FuncAnimation(fig, animate, frames=len(frames), interval=10)
Writer = writers["ffmpeg"]
writer = Writer(fps=15, bitrate=1800)
anim.save("diagrams/LunarLander_Video_1.mp4", writer=writer)