Overview

D4PG agent playing tennis against itself.

This report discusses the application of policy gradient algorithms in the context of a multi-agent reinforcement learning (MARL) environment. The environment consists of two agents in a tennis court, each one playing to maximize its own score. Specifically, the environment provides:

States: A stack of 3 state vectors, each of size 8, for a total of 24 state variables per time step. The state variables correspond to the agent’s horizontal and vertical position and velocity, as well as the ball’s position.
Actions: A vector of size 2 corresponding to the agent’s movement away from or towards the net, and jumping.
Rewards: A reward of 0.1 when the agent hits the ball over the net, and a reward of -0.01 when the agent hits the ball out of bounds or lets it hit the ground on its side of the court.

Each episode ends when the ball flies out of bounds or hits the ground. The episode score is the maximum of the two scores achieved by the agents. Finally, the environment is considered solved when the average episode score over a window of 100 episodes reaches 0.5.

The solution code can be found here: https://github.com/tyranitar/tennis

Results

Scores and average scores as the agent learns.

Using the actor-critic method, Distributed Distributional Deep Deterministic Policy Gradients (D4PG), the agent was able to solve the environment in 698 episodes.

Implementation

Model architecture

Actor	Critic
1. Convolution 1D (3, 6) → (8, 6)

Batch normalization
ReLU
Linear 48 → 400
Batch normalization
ReLU
Linear 400 → 300
Batch normalization
ReLU
Linear 300 → 2
Tanh | 1. Convolution 1D (3, 6) → (8, 6)
Batch normalization
ReLU
Concatenate actions
Linear 50 → 400
Batch normalization
ReLU
Linear 400 → 300
Batch normalization
ReLU
Linear 300 → 51
Softmax |

Hyperparameters

For both the actor and the critic, the Adam optimizer was used.

Target soft update rate: 0.001
Actor learning rate: 0.001
Critic learning rate: 0.001
Exploration weight: 0.1