PPO

The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).

The main idea is that after an update, the new policy should be not too far form the old policy. For that, ppo uses clipping to avoid too large update.

Note

PPO contains several modifications from the original algorithm not documented by OpenAI: advantages are normalized and value function can be also clipped .

Can I use?

  • Recurrent policies: ❌

  • Multi processing: ✔️

  • Gym spaces:

Space

Action

Observation

Discrete

✔️

✔️

Box

✔️

✔️

MultiDiscrete

✔️

✔️

MultiBinary

✔️

✔️

Example

Train a PPO agent on CartPole-v1 using 4 environments.

import gym

import torch
from torch import distributions
from torch import nn

import pytorch_lightning as pl

from lightning_baselines3.common.vec_env import make_vec_env, SubprocVecEnv
from lightning_baselines3.on_policy_models import PPO


class Model(PPO):
    def __init__(self, **kwargs):
        # **kwargs will pass our arguments on to PPO
        super(Model, self).__init__(**kwargs)

        self.actor = nn.Sequential(
            nn.Linear(self.observation_space.shape[0], 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, self.action_space.n),
            nn.Softmax(dim=1))

        self.critic = nn.Sequential(
            nn.Linear(self.observation_space.shape[0], 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1))

        self.save_hyperparameters()

    # This is for training the model
    # Returns the distribution and the corresponding value
    def forward(self, x):
        out = self.actor(x)
        dist = distributions.Categorical(probs=out)
        return dist, self.critic(x).flatten()

    # This is for inference and evaluation of our model, returns the action
    def predict(self, x, deterministic=True):
        out = self.actor(x)
        if deterministic:
            out = torch.max(out, dim=1)[1]
        else:
            out = distributions.Categorical(probs=out).sample()
        return out.cpu().numpy()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=3e-4)
        return optimizer


if __name__ == '__main__':
    env = make_vec_env('CartPole-v1', n_envs=8, vec_env_cls=SubprocVecEnv)
    eval_env = gym.make('CartPole-v1')
    model = Model(env=env, eval_env=eval_env)

    trainer = pl.Trainer(max_epochs=5, gradient_clip_val=0.5)
    trainer.fit(model)

    model.evaluate(num_eval_episodes=10, render=True)

Results

Atari Games

Coming soon

How to replicate the results?

Coming soon

Parameters

class lightning_baselines3.on_policy_models.ppo.PPO(env, eval_env, buffer_length=2048, num_rollouts=1, batch_size=64, epochs_per_rollout=10, num_eval_episodes=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, clip_range_vf=None, target_kl=None, value_coef=0.5, entropy_coef=0.0, use_sde=False, sde_sample_freq=- 1, verbose=0, seed=None)[source]

Proximal Policy Optimization algorithm (PPO) (clip version)

Paper: https://arxiv.org/abs/1707.06347 Code: This implementation borrows code from OpenAI Spinning Up (https://github.com/openai/spinningup/) https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail and and Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3)

Introduction to PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html

Parameters
  • env (Union[Env, VecEnv, str]) – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)

  • eval_env (Union[Env, VecEnv, str]) – The environment to evaluate on, must not be vectorised/parallelrised (if registered in Gym, can be str. Can be None for loading trained models)

  • buffer_length (int) – (int) Length of the buffer and the number of steps to run for each environment per update

  • num_rollouts (int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epoch

  • batch_size (int) – Minibatch size for each gradient update

  • epochs_per_rollout (int) – Number of epochs to optimise the loss for

  • num_eval_episodes (int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epoch

  • gamma (float) – (float) Discount factor

  • gae_lambda (float) – (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator. Equivalent to classic advantage when set to 1.

  • clip_range (float) – Clipping parameter, it can be a function of the current progress remaining (from 1 to 0).

  • clip_range_vf (Optional[float]) – Clipping parameter for the value function, it can be a function of the current progress remaining (from 1 to 0). This is a parameter specific to the OpenAI implementation. If None is passed (default), no clipping will be done on the value function. IMPORTANT: this clipping depends on the reward scaling.

  • target_kl (Optional[float]) – Limit the KL divergence between updates, because the clipping is not enough to prevent large update see issue #213 (cf https://github.com/hill-a/stable-baselines/issues/213) By default, there is no limit on the kl div.

  • value_coef (float) – Value function coefficient for the loss calculation

  • entropy_coef (float) – Entropy coefficient for the loss calculation

  • use_sde (bool) – (bool) Whether to use generalized State Dependent Exploration (gSDE) instead of action noise exploration

  • sde_sample_freq (int) – (int) Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout)

  • verbose (int) – The verbosity level: 0 none, 1 training information, 2 debug

  • seed (Optional[int]) – Seed for the pseudo random generators

forward(x)[source]

Runs both the actor and critic network

Parameters

x (Tensor) – The input observations

Return type

Tuple[Distribution, Tensor]

Returns

The deterministic action of the actor

training_step(batch, batch_idx)[source]

Specifies the update step for PPO. Override this if you wish to modify the PPO algorithm