TD3¶
Twin Delayed DDPG (TD3) Addressing Function Approximation Error in Actor-Critic Methods.
TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing. We recommend reading OpenAI Spinning guide on TD3 to learn more about those.
Notes¶
Original paper: https://arxiv.org/pdf/1802.09477.pdf
OpenAI Spinning Guide for TD3: https://spinningup.openai.com/en/latest/algorithms/td3.html
Original Implementation: https://github.com/sfujim/TD3
Note
The original TD3 paper uses a Tanh activated output. This example does the same.
TD3 by default expects actions bounded in [-1, 1], but this can be changed by setting squashed_action=False
Can I use?¶
Recurrent policies: ❌
Multi processing: ❌
Gym spaces:
Space |
Action |
Observation |
|---|---|---|
Discrete |
❌ |
✔️ |
Box |
✔️ |
✔️ |
MultiDiscrete |
❌ |
✔️ |
MultiBinary |
❌ |
✔️ |
Example¶
import copy
import torch
from torch import nn
import pytorch_lightning as pl
from lightning_baselines3.off_policy_models import TD3
from lightning_baselines3.common.utils import polyak_update
class Model(TD3):
def __init__(self, *args, **kwargs):
super(Model, self).__init__(*args, **kwargs)
# Note: The output layer of the actor must be Tanh activated
self.actor = nn.Sequential(
nn.Linear(self.observation_space.shape[0], 256),
nn.Tanh(),
nn.Linear(256, 256),
nn.Tanh(),
nn.Linear(256, self.action_space.shape[0]),
nn.Tanh())
in_dim = self.observation_space.shape[0] + self.action_space.shape[0]
self.critic1 = nn.Sequential(
nn.Linear(in_dim, 256),
nn.Tanh(),
nn.Linear(256, 256),
nn.Tanh(),
nn.Linear(256, 1))
self.critic2 = nn.Sequential(
nn.Linear(in_dim, 256),
nn.Tanh(),
nn.Linear(256, 256),
nn.Tanh(),
nn.Linear(256, 1))
self.actor_target = copy.deepcopy(self.actor)
self.critic_target1 = copy.deepcopy(self.critic1)
self.critic_target2 = copy.deepcopy(self.critic2)
self.save_hyperparameters()
def forward_actor(self, x):
return self.actor(x)
def forward_actor_target(self, x):
return self.actor_target(x)
def forward_critic1(self, obs, action):
return self.critic1(torch.cat([obs, action], dim=1))
def forward_critic2(self, obs, action):
return self.critic2(torch.cat([obs, action], dim=1))
def forward_critic_target1(self, obs, action):
return self.critic_target1(torch.cat([obs, action], dim=1))
def forward_critic_target2(self, obs, action):
return self.critic_target2(torch.cat([obs, action], dim=1))
def update_targets(self):
polyak_update(
self.actor.parameters(),
self.actor_target.parameters(),
tau=0.005)
polyak_update(
self.critic1.parameters(),
self.critic_target1.parameters(),
tau=0.005)
polyak_update(
self.critic2.parameters(),
self.critic_target2.parameters(),
tau=0.005)
def predict(self, x, deterministic=True):
out = self.actor(x)
if not deterministic:
out = out + torch.randn_like(out) * 0.1
out = torch.clamp(out, -1, 1)
return out.cpu().numpy()
def configure_optimizers(self):
opt_actor = torch.optim.Adam(self.actor.parameters(), lr=1e-3)
opt_critic = torch.optim.Adam(
list(self.critic1.parameters()) + list(self.critic2.parameters()),
lr=1e-3)
return opt_critic, opt_actor
if __name__ == '__main__':
model = Model(
env='LunarLanderContinuous-v2',
eval_env='LunarLanderContinuous-v2',
warmup_length=10000)
trainer = pl.Trainer(max_epochs=30, gradient_clip_val=0.5)
trainer.fit(model)
model.evaluate(num_eval_episodes=10, render=True)
Parameters¶
- class lightning_baselines3.off_policy_models.td3.TD3(env, eval_env, batch_size=128, buffer_length=1000000, warmup_length=100, train_freq=- 1, episodes_per_rollout=1, num_rollouts=10, gradient_steps=- 1, policy_delay=2, target_policy_noise=0.2, target_noise_clip=0.5, num_eval_episodes=10, gamma=0.99, squashed_actions=True, verbose=0, seed=None)[source]¶
Twin Delayed DDPG (TD3) Addressing Function Approximation Error in Actor-Critic Methods.
Original implementation: https://github.com/sfujim/TD3 Paper: https://arxiv.org/abs/1802.09477 Introduction to TD3: https://spinningup.openai.com/en/latest/algorithms/td3.html
- Parameters
env (
Union[Env,VecEnv,str]) – The environment to learn from. If registered in Gym, can be str. Can be None for loading trained modelseval_env (
Union[Env,VecEnv,str]) – The environment to evaluate on, must not be parallelrised. If registered in Gym, can be str. Can be None for loading trained modelsbatch_size (
int) – Minibatch size for each gradient updatebuffer_length (
int) – length of the replay bufferwarmup_length (
int) – how many steps of the model to collect transitions for before learning startstrain_freq (
int) – Update the model everytrain_freqsteps. Set to -1 to disable.episodes_per_rollout (
int) – Update the model everyepisodes_per_rolloutepisodes. Note that this cannot be used at the same time astrain_freq. Set to -1 to disable.num_rollouts (
int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epochgradient_steps (
int) – How many gradient steps to do after each rolloutpolicy_delay (
int) – Policy and target networks will only be updated once every policy_delay steps per training steps. The Q values will be updated policy_delay more often (update every training step).target_policy_noise (
float) – Standard deviation of Gaussian noise added to target policy (smoothing noise)target_noise_clip (
float) – Limit for absolute value of target policy smoothing noise.num_eval_episodes (
int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epochsquashed_actions (
bool) – Whether the actions are squashed between [-1, 1] and need to be unsquashedgamma (
float) – the discount factorverbose (
int) – The verbosity level: 0 none, 1 training information, 2 debug (default: 0)seed (
Optional[int]) – Seed for the pseudo random generators
- configure_optimizer()[source]¶
Function to set up the optimizer. The first optimizer should be for the critics. The second should be the actor. Overide this function with your own.
- Return type
Tuple[Optimizer,Optimizer]- Returns
The critic optimiser, followed by the actor optimiser
- forward_actor(obs)[source]¶
Runs the actor network. Override this function with your own.
- Parameters
obs (
Tensor) – The input observations- Return type
Tensor- Returns
The deterministic action of the actor
- forward_actor_target(obs)[source]¶
Runs the target actor network. Override this function with your own.
- Parameters
obs (
Tensor) – The input observations- Return type
Tensor- Returns
The deterministic action of the actor
- forward_critic1(obs, action)[source]¶
Runs the first critic network. Override this function with your own.
- Parameters
obs (
Tensor) – The input observationsaction (
Tensor) – The input actions
- Return type
Tensor- Returns
The output Q values of the critic network
- forward_critic2(obs, action)[source]¶
Runs the second critic network. Override this function with your own.
- Parameters
obs (
Tensor) – The input observationsaction (
Tensor) – The input actions
- Return type
Tensor- Returns
The output Q values of the critic network
- forward_critic_target1(obs, action)[source]¶
Runs the first target critic network. Override this function with your own.
- Parameters
obs (
Tensor) – The input observationsaction (
Tensor) – The input actions
- Return type
Tensor- Returns
The output Q values of the critic network
- forward_critic_target2(obs, action)[source]¶
Runs the second critic network. Override this function with your own.
- Parameters
obs (
Tensor) – The input observationsaction (
Tensor) – The input actions
- Return type
Tensor- Returns
The output Q values of the critic network