Overview

A2C agents interacting with the environment.

D4PG agents interacting with the environment.

This report discusses two algorithms, Advantage Actor-Critic (A2C) and Distributed Distributional Deep Deterministic Policy Gradients (D4PG), in the context of a continuous control reinforcement learning task. The agent is a double-jointed arm that is rewarded for keeping its hand inside a moving target location. The environment specifications are:

33 continuous-valued states corresponding to the position, rotation, velocity, and angular velocity of the arm.
4 continuous-valued actions corresponding to the torques applied to the two joints that fall in the range [-1, 1].
A reward of 0.04 for each time step that the agent has its hand in the target location, and a reward of 0 otherwise.

The agent’s goal is to achieve an average score of 30 over a window of 100 episodes.

The code for the two algorithms can be found here: https://github.com/tyranitar/continuous-control.

Results

Average score across 20 parallel agents for A2C.

Average score across 20 parallel agents for D4PG.

For both A2C and D4PG, 20 parallel agents were used to collect trajectories. The goal was to achieve an average score of 30 over a window of 100 episodes across all 20 agents.

Given this goal, the A2C algorithm was able to solve the environment in 144 episodes. And while the learning was not as stable, D4PG was able to solve the environment faster, in 112 episodes.

This report breaks down both the A2C and D4PG algorithms and compares the two algorithms in terms of convergence behavior and learning speed.

Implementation

Architectures

	A2C	D4PG
Actor	1. Linear 33 → 128

ReLU
Linear 128 → 64
ReLU
Linear 64 → 4
Tanh
Sample with standard dev | 1. Linear 33 → 400
Batch norm
ReLU
Linear 400 → 300
Batch norm
ReLU
Linear 300 → 4
Tanh | | Critic | 1. Linear 33 → 128
ReLU
Linear 128 → 64
ReLU
Linear 64 → 1 | 1. Linear 33 → 400
Batch norm
ReLU
Concatenate actions
Linear 404 → 300
Batch norm
ReLU
Linear 300 → 51
Softmax |

Hyperparameters

	A2C	D4PG
Common	Actor learning rate = `0.0001`
Critic learning rate = `0.0001`
Discount factor = `0.99`
Rollout length = `5`
Gradient clip = `1`	Actor learning rate = `0.001`
Critic learning rate = `0.001`
Discount factor = `0.99`
Rollout length = `5`
Gradient clip = `1`
Different	GAE lambda = `0.99`	Target soft update rate = `0.001`
Replay buffer size = `1e6`
Batch size = `64`

For both the A2C actor and critic, the RMSProp optimizer was used. For both the D4PG actor and critic, the Adam optimizer was used.