PPO¶
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).
The main idea is that after an update, the new policy should be not too far form the old policy. For that, ppo uses clipping to avoid too large update.
Note
PPO contains several modifications from the original algorithm not documented by OpenAI: advantages are normalized and value function can be also clipped .
Notes¶
Original paper: https://arxiv.org/abs/1707.06347
Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html
Can I use?¶
Recurrent policies: ❌
Multi processing: ✔️
Gym spaces:
Space |
Action |
Observation |
|---|---|---|
Discrete |
✔️ |
✔️ |
Box |
✔️ |
✔️ |
MultiDiscrete |
✔️ |
✔️ |
MultiBinary |
✔️ |
✔️ |
Example¶
Train a PPO agent on CartPole-v1 using 4 environments.
import gym
import torch
from torch import distributions
from torch import nn
import pytorch_lightning as pl
from lightning_baselines3.common.vec_env import make_vec_env, SubprocVecEnv
from lightning_baselines3.on_policy_models import PPO
class Model(PPO):
def __init__(self, **kwargs):
# **kwargs will pass our arguments on to PPO
super(Model, self).__init__(**kwargs)
self.actor = nn.Sequential(
nn.Linear(self.observation_space.shape[0], 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, self.action_space.n),
nn.Softmax(dim=1))
self.critic = nn.Sequential(
nn.Linear(self.observation_space.shape[0], 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 1))
self.save_hyperparameters()
# This is for training the model
# Returns the distribution and the corresponding value
def forward(self, x):
out = self.actor(x)
dist = distributions.Categorical(probs=out)
return dist, self.critic(x).flatten()
# This is for inference and evaluation of our model, returns the action
def predict(self, x, deterministic=True):
out = self.actor(x)
if deterministic:
out = torch.max(out, dim=1)[1]
else:
out = distributions.Categorical(probs=out).sample()
return out.cpu().numpy()
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=3e-4)
return optimizer
if __name__ == '__main__':
env = make_vec_env('CartPole-v1', n_envs=8, vec_env_cls=SubprocVecEnv)
eval_env = gym.make('CartPole-v1')
model = Model(env=env, eval_env=eval_env)
trainer = pl.Trainer(max_epochs=5, gradient_clip_val=0.5)
trainer.fit(model)
model.evaluate(num_eval_episodes=10, render=True)
Parameters¶
- class lightning_baselines3.on_policy_models.ppo.PPO(env, eval_env, buffer_length=2048, num_rollouts=1, batch_size=64, epochs_per_rollout=10, num_eval_episodes=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, clip_range_vf=None, target_kl=None, value_coef=0.5, entropy_coef=0.0, use_sde=False, sde_sample_freq=- 1, verbose=0, seed=None)[source]¶
Proximal Policy Optimization algorithm (PPO) (clip version)
Paper: https://arxiv.org/abs/1707.06347 Code: This implementation borrows code from OpenAI Spinning Up (https://github.com/openai/spinningup/) https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail and and Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3)
Introduction to PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html
- Parameters
env (
Union[Env,VecEnv,str]) – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)eval_env (
Union[Env,VecEnv,str]) – The environment to evaluate on, must not be vectorised/parallelrised (if registered in Gym, can be str. Can be None for loading trained models)buffer_length (
int) – (int) Length of the buffer and the number of steps to run for each environment per updatenum_rollouts (
int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epochbatch_size (
int) – Minibatch size for each gradient updateepochs_per_rollout (
int) – Number of epochs to optimise the loss fornum_eval_episodes (
int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epochgamma (
float) – (float) Discount factorgae_lambda (
float) – (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator. Equivalent to classic advantage when set to 1.clip_range (
float) – Clipping parameter, it can be a function of the current progress remaining (from 1 to 0).clip_range_vf (
Optional[float]) – Clipping parameter for the value function, it can be a function of the current progress remaining (from 1 to 0). This is a parameter specific to the OpenAI implementation. If None is passed (default), no clipping will be done on the value function. IMPORTANT: this clipping depends on the reward scaling.target_kl (
Optional[float]) – Limit the KL divergence between updates, because the clipping is not enough to prevent large update see issue #213 (cf https://github.com/hill-a/stable-baselines/issues/213) By default, there is no limit on the kl div.value_coef (
float) – Value function coefficient for the loss calculationentropy_coef (
float) – Entropy coefficient for the loss calculationuse_sde (
bool) – (bool) Whether to use generalized State Dependent Exploration (gSDE) instead of action noise explorationsde_sample_freq (
int) – (int) Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout)verbose (
int) – The verbosity level: 0 none, 1 training information, 2 debugseed (
Optional[int]) – Seed for the pseudo random generators