A2C agents interacting with the environment.
D4PG agents interacting with the environment.
This report discusses two algorithms, Advantage Actor-Critic (A2C) and Distributed Distributional Deep Deterministic Policy Gradients (D4PG), in the context of a continuous control reinforcement learning task. The agent is a double-jointed arm that is rewarded for keeping its hand inside a moving target location. The environment specifications are:
33
continuous-valued states corresponding to the position, rotation, velocity, and angular velocity of the arm.4
continuous-valued actions corresponding to the torques applied to the two joints that fall in the range [-1, 1]
.0.04
for each time step that the agent has its hand in the target location, and a reward of 0
otherwise.The agent’s goal is to achieve an average score of 30
over a window of 100 episodes.
The code for the two algorithms can be found here: https://github.com/tyranitar/continuous-control.
Average score across 20 parallel agents for A2C.
Average score across 20 parallel agents for D4PG.
For both A2C and D4PG, 20 parallel agents were used to collect trajectories. The goal was to achieve an average score of 30
over a window of 100 episodes across all 20 agents.
Given this goal, the A2C algorithm was able to solve the environment in 144 episodes. And while the learning was not as stable, D4PG was able to solve the environment faster, in 112 episodes.
This report breaks down both the A2C and D4PG algorithms and compares the two algorithms in terms of convergence behavior and learning speed.
A2C | D4PG | |
---|---|---|
Actor | 1. Linear 33 → 128 |
A2C | D4PG | |
---|---|---|
Common | Actor learning rate = 0.0001 |
|
Critic learning rate = 0.0001 |
||
Discount factor = 0.99 |
||
Rollout length = 5 |
||
Gradient clip = 1 |
Actor learning rate = 0.001 |
|
Critic learning rate = 0.001 |
||
Discount factor = 0.99 |
||
Rollout length = 5 |
||
Gradient clip = 1 |
||
Different | GAE lambda = 0.99 |
Target soft update rate = 0.001 |
Replay buffer size = 1e6 |
||
Batch size = 64 |
For both the A2C actor and critic, the RMSProp optimizer was used. For both the D4PG actor and critic, the Adam optimizer was used.