Overview

D4PG agent playing tennis against itself.

D4PG agent playing tennis against itself.

This report discusses the application of policy gradient algorithms in the context of a multi-agent reinforcement learning (MARL) environment. The environment consists of two agents in a tennis court, each one playing to maximize its own score. Specifically, the environment provides:

Each episode ends when the ball flies out of bounds or hits the ground. The episode score is the maximum of the two scores achieved by the agents. Finally, the environment is considered solved when the average episode score over a window of 100 episodes reaches 0.5.

The solution code can be found here: https://github.com/tyranitar/tennis

Results

Scores and average scores as the agent learns.

Scores and average scores as the agent learns.

Using the actor-critic method, Distributed Distributional Deep Deterministic Policy Gradients (D4PG), the agent was able to solve the environment in 698 episodes.

Implementation

Model architecture

Actor Critic
1. Convolution 1D (3, 6) → (8, 6)
  1. Batch normalization
  2. ReLU
  3. Linear 48 → 400
  4. Batch normalization
  5. ReLU
  6. Linear 400 → 300
  7. Batch normalization
  8. ReLU
  9. Linear 300 → 2
  10. Tanh | 1. Convolution 1D (3, 6) → (8, 6)
  11. Batch normalization
  12. ReLU
  13. Concatenate actions
  14. Linear 50 → 400
  15. Batch normalization
  16. ReLU
  17. Linear 400 → 300
  18. Batch normalization
  19. ReLU
  20. Linear 300 → 51
  21. Softmax |

Hyperparameters

For both the actor and the critic, the Adam optimizer was used.