Overview

A2C agents interacting with the environment.

A2C agents interacting with the environment.

D4PG agents interacting with the environment.

D4PG agents interacting with the environment.

This report discusses two algorithms, Advantage Actor-Critic (A2C) and Distributed Distributional Deep Deterministic Policy Gradients (D4PG), in the context of a continuous control reinforcement learning task. The agent is a double-jointed arm that is rewarded for keeping its hand inside a moving target location. The environment specifications are:

The agent’s goal is to achieve an average score of 30 over a window of 100 episodes.

The code for the two algorithms can be found here: https://github.com/tyranitar/continuous-control.

Results

Average score across 20 parallel agents for A2C.

Average score across 20 parallel agents for A2C.

Average score across 20 parallel agents for D4PG.

Average score across 20 parallel agents for D4PG.

For both A2C and D4PG, 20 parallel agents were used to collect trajectories. The goal was to achieve an average score of 30 over a window of 100 episodes across all 20 agents.

Given this goal, the A2C algorithm was able to solve the environment in 144 episodes. And while the learning was not as stable, D4PG was able to solve the environment faster, in 112 episodes.

This report breaks down both the A2C and D4PG algorithms and compares the two algorithms in terms of convergence behavior and learning speed.

Implementation

Architectures

A2C D4PG
Actor 1. Linear 33 → 128
  1. ReLU
  2. Linear 128 → 64
  3. ReLU
  4. Linear 64 → 4
  5. Tanh
  6. Sample with standard dev | 1. Linear 33 → 400
  7. Batch norm
  8. ReLU
  9. Linear 400 → 300
  10. Batch norm
  11. ReLU
  12. Linear 300 → 4
  13. Tanh | | Critic | 1. Linear 33 → 128
  14. ReLU
  15. Linear 128 → 64
  16. ReLU
  17. Linear 64 → 1 | 1. Linear 33 → 400
  18. Batch norm
  19. ReLU
  20. Concatenate actions
  21. Linear 404 → 300
  22. Batch norm
  23. ReLU
  24. Linear 300 → 51
  25. Softmax |

Hyperparameters

A2C D4PG
Common Actor learning rate = 0.0001
Critic learning rate = 0.0001
Discount factor = 0.99
Rollout length = 5
Gradient clip = 1 Actor learning rate = 0.001
Critic learning rate = 0.001
Discount factor = 0.99
Rollout length = 5
Gradient clip = 1
Different GAE lambda = 0.99 Target soft update rate = 0.001
Replay buffer size = 1e6
Batch size = 64

For both the A2C actor and critic, the RMSProp optimizer was used. For both the D4PG actor and critic, the Adam optimizer was used.

A2C algorithm