TD3¶

Twin Delayed DDPG (TD3) Addressing Function Approximation Error in Actor-Critic Methods.

TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing. We recommend reading OpenAI Spinning guide on TD3 to learn more about those.

Notes¶

Original paper: https://arxiv.org/pdf/1802.09477.pdf
OpenAI Spinning Guide for TD3: https://spinningup.openai.com/en/latest/algorithms/td3.html
Original Implementation: https://github.com/sfujim/TD3

Note

The original TD3 paper uses a Tanh activated output. This example does the same. TD3 by default expects actions bounded in [-1, 1], but this can be changed by setting squashed_action=False

Can I use?¶

Recurrent policies: ❌
Multi processing: ❌
Gym spaces:

Space	Action	Observation
Discrete	❌	✔️
Box	✔️	✔️
MultiDiscrete	❌	✔️
MultiBinary	❌	✔️

Example¶

import copy

import torch
from torch import nn

import pytorch_lightning as pl

from lightning_baselines3.off_policy_models import TD3
from lightning_baselines3.common.utils import polyak_update


class Model(TD3):
    def __init__(self, *args, **kwargs):
        super(Model, self).__init__(*args, **kwargs)

        # Note: The output layer of the actor must be Tanh activated
        self.actor = nn.Sequential(
            nn.Linear(self.observation_space.shape[0], 256),
            nn.Tanh(),
            nn.Linear(256, 256),
            nn.Tanh(),
            nn.Linear(256, self.action_space.shape[0]),
            nn.Tanh())

        in_dim = self.observation_space.shape[0] + self.action_space.shape[0]
        self.critic1 = nn.Sequential(
            nn.Linear(in_dim, 256),
            nn.Tanh(),
            nn.Linear(256, 256),
            nn.Tanh(),
            nn.Linear(256, 1))

        self.critic2 = nn.Sequential(
            nn.Linear(in_dim, 256),
            nn.Tanh(),
            nn.Linear(256, 256),
            nn.Tanh(),
            nn.Linear(256, 1))

        self.actor_target = copy.deepcopy(self.actor)
        self.critic_target1 = copy.deepcopy(self.critic1)
        self.critic_target2 = copy.deepcopy(self.critic2)

        self.save_hyperparameters()

    def forward_actor(self, x):
        return self.actor(x)

    def forward_actor_target(self, x):
        return self.actor_target(x)

    def forward_critic1(self, obs, action):
        return self.critic1(torch.cat([obs, action], dim=1))

    def forward_critic2(self, obs, action):
        return self.critic2(torch.cat([obs, action], dim=1))

    def forward_critic_target1(self, obs, action):
        return self.critic_target1(torch.cat([obs, action], dim=1))

    def forward_critic_target2(self, obs, action):
        return self.critic_target2(torch.cat([obs, action], dim=1))

    def update_targets(self):
        polyak_update(
            self.actor.parameters(),
            self.actor_target.parameters(),
            tau=0.005)
        polyak_update(
            self.critic1.parameters(),
            self.critic_target1.parameters(),
            tau=0.005)
        polyak_update(
            self.critic2.parameters(),
            self.critic_target2.parameters(),
            tau=0.005)

    def predict(self, x, deterministic=True):
        out = self.actor(x)
        if not deterministic:
            out = out + torch.randn_like(out) * 0.1
        out = torch.clamp(out, -1, 1)
        return out.cpu().numpy()

    def configure_optimizers(self):
        opt_actor = torch.optim.Adam(self.actor.parameters(), lr=1e-3)
        opt_critic = torch.optim.Adam(
            list(self.critic1.parameters()) + list(self.critic2.parameters()),
            lr=1e-3)
        return opt_critic, opt_actor


if __name__ == '__main__':
    model = Model(
        env='LunarLanderContinuous-v2',
        eval_env='LunarLanderContinuous-v2',
        warmup_length=10000)

    trainer = pl.Trainer(max_epochs=30, gradient_clip_val=0.5)
    trainer.fit(model)

    model.evaluate(num_eval_episodes=10, render=True)

Results¶

Atari Games¶

Coming soon

How to replicate the results?¶

Coming soon

Parameters¶

class lightning_baselines3.off_policy_models.td3.TD3(env, eval_env, batch_size=128, buffer_length=1000000, warmup_length=100, train_freq=- 1, episodes_per_rollout=1, num_rollouts=10, gradient_steps=- 1, policy_delay=2, target_policy_noise=0.2, target_noise_clip=0.5, num_eval_episodes=10, gamma=0.99, squashed_actions=True, verbose=0, seed=None)[source]¶

Twin Delayed DDPG (TD3) Addressing Function Approximation Error in Actor-Critic Methods.

Original implementation: https://github.com/sfujim/TD3 Paper: https://arxiv.org/abs/1802.09477 Introduction to TD3: https://spinningup.openai.com/en/latest/algorithms/td3.html

Parameters

env (Union[Env, VecEnv, str]) – The environment to learn from. If registered in Gym, can be str. Can be None for loading trained models
eval_env (Union[Env, VecEnv, str]) – The environment to evaluate on, must not be parallelrised. If registered in Gym, can be str. Can be None for loading trained models
batch_size (int) – Minibatch size for each gradient update
buffer_length (int) – length of the replay buffer
warmup_length (int) – how many steps of the model to collect transitions for before learning starts
train_freq (int) – Update the model every train_freq steps. Set to -1 to disable.
episodes_per_rollout (int) – Update the model every episodes_per_rollout episodes. Note that this cannot be used at the same time as train_freq. Set to -1 to disable.
num_rollouts (int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epoch
gradient_steps (int) – How many gradient steps to do after each rollout
policy_delay (int) – Policy and target networks will only be updated once every policy_delay steps per training steps. The Q values will be updated policy_delay more often (update every training step).
target_policy_noise (float) – Standard deviation of Gaussian noise added to target policy (smoothing noise)
target_noise_clip (float) – Limit for absolute value of target policy smoothing noise.
num_eval_episodes (int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epoch
squashed_actions (bool) – Whether the actions are squashed between [-1, 1] and need to be unsquashed
gamma (float) – the discount factor
verbose (int) – The verbosity level: 0 none, 1 training information, 2 debug (default: 0)
seed (Optional[int]) – Seed for the pseudo random generators

configure_optimizer()[source]¶

Function to set up the optimizer. The first optimizer should be for the critics. The second should be the actor. Overide this function with your own.

Return type: Tuple[Optimizer, Optimizer]
Returns: The critic optimiser, followed by the actor optimiser

forward_actor(obs)[source]¶

Runs the actor network. Override this function with your own.

Parameters: obs (Tensor) – The input observations
Return type: Tensor
Returns: The deterministic action of the actor

forward_actor_target(obs)[source]¶

Runs the target actor network. Override this function with your own.

Parameters: obs (Tensor) – The input observations
Return type: Tensor
Returns: The deterministic action of the actor

forward_critic1(obs, action)[source]¶

Runs the first critic network. Override this function with your own.

Parameters

obs (Tensor) – The input observations
action (Tensor) – The input actions

Return type

Tensor

Returns

The output Q values of the critic network

forward_critic2(obs, action)[source]¶

Runs the second critic network. Override this function with your own.

Parameters

obs (Tensor) – The input observations
action (Tensor) – The input actions

Return type

Tensor

Returns

The output Q values of the critic network

forward_critic_target1(obs, action)[source]¶

Runs the first target critic network. Override this function with your own.

Parameters

obs (Tensor) – The input observations
action (Tensor) – The input actions

Return type

Tensor

Returns

The output Q values of the critic network

forward_critic_target2(obs, action)[source]¶

Runs the second critic network. Override this function with your own.

Parameters

obs (Tensor) – The input observations
action (Tensor) – The input actions

Return type

Tensor

Returns

The output Q values of the critic network

training_step(batch, batch_idx, optimizer_idx)[source]¶: Specifies the update step for TD3. Override this if you wish to modify the TD3 algorithm

update_targets()[source]¶

Function to update the target networks periodically. Override this function with your own.

Return type: None