PPO¶

The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).

The main idea is that after an update, the new policy should be not too far form the old policy. For that, ppo uses clipping to avoid too large update.

Note

PPO contains several modifications from the original algorithm not documented by OpenAI: advantages are normalized and value function can be also clipped .

Notes¶

Original paper: https://arxiv.org/abs/1707.06347
Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html

Can I use?¶

Recurrent policies: ❌
Multi processing: ✔️
Gym spaces:

Space	Action	Observation
Discrete	✔️	✔️
Box	✔️	✔️
MultiDiscrete	✔️	✔️
MultiBinary	✔️	✔️

Example¶

Train a PPO agent on CartPole-v1 using 4 environments.

import gym

import torch
from torch import distributions
from torch import nn

import pytorch_lightning as pl

from lightning_baselines3.common.vec_env import make_vec_env, SubprocVecEnv
from lightning_baselines3.on_policy_models import PPO


class Model(PPO):
    def __init__(self, **kwargs):
        # **kwargs will pass our arguments on to PPO
        super(Model, self).__init__(**kwargs)

        self.actor = nn.Sequential(
            nn.Linear(self.observation_space.shape[0], 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, self.action_space.n),
            nn.Softmax(dim=1))

        self.critic = nn.Sequential(
            nn.Linear(self.observation_space.shape[0], 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1))

        self.save_hyperparameters()

    # This is for training the model
    # Returns the distribution and the corresponding value
    def forward(self, x):
        out = self.actor(x)
        dist = distributions.Categorical(probs=out)
        return dist, self.critic(x).flatten()

    # This is for inference and evaluation of our model, returns the action
    def predict(self, x, deterministic=True):
        out = self.actor(x)
        if deterministic:
            out = torch.max(out, dim=1)[1]
        else:
            out = distributions.Categorical(probs=out).sample()
        return out.cpu().numpy()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=3e-4)
        return optimizer


if __name__ == '__main__':
    env = make_vec_env('CartPole-v1', n_envs=8, vec_env_cls=SubprocVecEnv)
    eval_env = gym.make('CartPole-v1')
    model = Model(env=env, eval_env=eval_env)

    trainer = pl.Trainer(max_epochs=5, gradient_clip_val=0.5)
    trainer.fit(model)

    model.evaluate(num_eval_episodes=10, render=True)

Results¶

Atari Games¶

Coming soon

How to replicate the results?¶

Coming soon

Parameters¶

class lightning_baselines3.on_policy_models.ppo.PPO(env, eval_env, buffer_length=2048, num_rollouts=1, batch_size=64, epochs_per_rollout=10, num_eval_episodes=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, clip_range_vf=None, target_kl=None, value_coef=0.5, entropy_coef=0.0, use_sde=False, sde_sample_freq=- 1, verbose=0, seed=None)[source]¶

Proximal Policy Optimization algorithm (PPO) (clip version)

Paper: https://arxiv.org/abs/1707.06347 Code: This implementation borrows code from OpenAI Spinning Up (https://github.com/openai/spinningup/) https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail and and Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3)

Introduction to PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html

Parameters

env (Union[Env, VecEnv, str]) – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
eval_env (Union[Env, VecEnv, str]) – The environment to evaluate on, must not be vectorised/parallelrised (if registered in Gym, can be str. Can be None for loading trained models)
buffer_length (int) – (int) Length of the buffer and the number of steps to run for each environment per update
num_rollouts (int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epoch
batch_size (int) – Minibatch size for each gradient update
epochs_per_rollout (int) – Number of epochs to optimise the loss for
num_eval_episodes (int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epoch
gamma (float) – (float) Discount factor
gae_lambda (float) – (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator. Equivalent to classic advantage when set to 1.
clip_range (float) – Clipping parameter, it can be a function of the current progress remaining (from 1 to 0).
clip_range_vf (Optional[float]) – Clipping parameter for the value function, it can be a function of the current progress remaining (from 1 to 0). This is a parameter specific to the OpenAI implementation. If None is passed (default), no clipping will be done on the value function. IMPORTANT: this clipping depends on the reward scaling.
target_kl (Optional[float]) – Limit the KL divergence between updates, because the clipping is not enough to prevent large update see issue #213 (cf https://github.com/hill-a/stable-baselines/issues/213) By default, there is no limit on the kl div.
value_coef (float) – Value function coefficient for the loss calculation
entropy_coef (float) – Entropy coefficient for the loss calculation
use_sde (bool) – (bool) Whether to use generalized State Dependent Exploration (gSDE) instead of action noise exploration
sde_sample_freq (int) – (int) Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout)
verbose (int) – The verbosity level: 0 none, 1 training information, 2 debug
seed (Optional[int]) – Seed for the pseudo random generators

forward(x)[source]¶

Runs both the actor and critic network

Parameters: x (Tensor) – The input observations
Return type: Tuple[Distribution, Tensor]
Returns: The deterministic action of the actor

training_step(batch, batch_idx)[source]¶: Specifies the update step for PPO. Override this if you wish to modify the PPO algorithm