D4PG agent playing tennis against itself.
This report discusses the application of policy gradient algorithms in the context of a multi-agent reinforcement learning (MARL) environment. The environment consists of two agents in a tennis court, each one playing to maximize its own score. Specifically, the environment provides:
3
state vectors, each of size 8
, for a total of 24
state variables per time step. The state variables correspond to the agent’s horizontal and vertical position and velocity, as well as the ball’s position.2
corresponding to the agent’s movement away from or towards the net, and jumping.0.1
when the agent hits the ball over the net, and a reward of -0.01
when the agent hits the ball out of bounds or lets it hit the ground on its side of the court.Each episode ends when the ball flies out of bounds or hits the ground. The episode score is the maximum of the two scores achieved by the agents. Finally, the environment is considered solved when the average episode score over a window of 100 episodes reaches 0.5
.
The solution code can be found here: https://github.com/tyranitar/tennis
Scores and average scores as the agent learns.
Using the actor-critic method, Distributed Distributional Deep Deterministic Policy Gradients (D4PG), the agent was able to solve the environment in 698 episodes.
Actor | Critic |
---|---|
1. Convolution 1D (3, 6) → (8, 6) |
For both the actor and the critic, the Adam optimizer was used.
0.001
0.001
0.001
0.1