Base RL Class¶
Common interface for all the RL algorithms
Abstract base classes for RL algorithms.
- class lightning_baselines3.common.base_model.BaseModel(env, eval_env, num_eval_episodes=10, verbose=0, support_multi_env=False, seed=None, use_sde=False)[source]¶
The base of RL algorithms
- Parameters
env (
Union[Env,VecEnv,str]) – The environment to learn from (if registered in Gym, can be str. Can be None for loading trained models)eval_env (
Union[Env,VecEnv,str]) – The environment to evaluate on, must not be vectorised/parallelrised (if registered in Gym, can be str. Can be None for loading trained models)num_eval_episodes (
int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epochverbose (
int) – The verbosity level: 0 none, 1 training information, 2 debugsupport_multi_env (
bool) – Whether the algorithm supports training with multiple environments in parallelseed (
Optional[int]) – Seed for the pseudo random generatorsuse_sde (
bool) – Whether to use generalized State Dependent Exploration (gSDE)
- evaluate(num_eval_episodes, deterministic=True, render=False, record=False, record_fn=None)[source]¶
Evaluate the model with eval_env
- Parameters
num_eval_episodes (
int) – Number of episodes to evaluate fordeterministic (
bool) – Whether to evaluate deterministicallyrender (
bool) – Whether to render while evaluatingrecord (
bool) – Whether to recod while evaluatingrecord_fn (
Optional[str]) – File to record environment to if we are recording
- Return type
Tuple[List[float],List[int]]- Returns
A list of total episode rewards and a list of episode lengths
- predict(obs, deterministic=False)[source]¶
Override this function with the predict function of your own model
- Parameters
obs (
Union[Tuple,Dict[str,Any],ndarray,int]) – The input observationsdeterministic (
bool) – Whether to predict deterministically
- Return type
ndarray- Returns
The chosen actions
- sample_action(obs, deterministic=False)[source]¶
Samples an action from the environment or from our model
- Parameters
obs (
ndarray) – The input observationdeterministic (
bool) – Whether we are sampling deterministically.
- Return type
Tuple[ndarray,ndarray]- Returns
The action to step with, and the action to store in our buffer
- save_hyperparameters(frame=None, exclude=['env', 'eval_env'])[source]¶
Utility function to save the hyperparameters of the model. This function behaves identically to LightningModule.save_hyperparameters, but will by default exclude the Gym environments See https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html#lightningmodule-hyperparameters for more details
Base Off-Policy Class¶
The base RL algorithm for Off-Policy algorithm (ex: SAC/TD3)
- class lightning_baselines3.off_policy_models.off_policy_model.OffPolicyModel(env, eval_env, batch_size=256, buffer_length=1000000, warmup_length=100, train_freq=- 1, episodes_per_rollout=- 1, num_rollouts=1, gradient_steps=1, num_eval_episodes=10, gamma=0.99, squashed_actions=False, use_sde=False, sde_sample_freq=- 1, use_sde_at_warmup=False, verbose=0, seed=None)[source]¶
The base for Off-Policy algorithms (ex: SAC/TD3)
- Parameters
env (
Union[Env,VecEnv,str]) – The environment to learn from (if registered in Gym, can be str. Can be None for loading trained models)eval_env (
Union[Env,VecEnv,str]) – The environment to evaluate on, must not be vectorised/parallelrised (if registered in Gym, can be str. Can be None for loading trained models)batch_size (
int) – Minibatch size for each gradient updatebuffer_length (
int) – length of the replay bufferwarmup_length (
int) – how many steps of the model to collect transitions for before learning startstrain_freq (
int) – Update the model everytrain_freqsteps. Set to -1 to disable.episodes_per_rollout (
int) – Update the model everyepisodes_per_rolloutepisodes. Note that this cannot be used at the same time astrain_freq. Set to -1 to disable.num_rollouts (
int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epochgradient_steps (
int) – How many gradient steps to do after each rolloutnum_eval_episodes (
int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epochgamma (
float) – the discount factorsquashed_actions (
bool) – whether the actions are squashed between [-1, 1] and need to be unsquasheduse_sde (
bool) – Whether to use generalized State Dependent Exploration (gSDE)sde_sample_freq (
int) – Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout)use_sde_at_warmup (
bool) – Whether to use gSDE instead of uniform sampling during the warm up phase (before learning starts)verbose (
int) – The verbosity level: 0 none, 1 training information, 2 debug (default: 0)seed (
Optional[int]) – Seed for the pseudo random generators
- sample_action(obs, deterministic=False)[source]¶
Samples an action from the environment or from our model
- Parameters
obs (
ndarray) – The input observationdeterministic (
bool) – Whether we are sampling deterministically.
- Return type
Tuple[ndarray,ndarray]- Returns
The action to step with, and the action to store in our buffer
Base On-Policy Class¶
The base RL algorithm for On-Policy algorithm (ex: A2C/PPO)
- class lightning_baselines3.on_policy_models.on_policy_model.OnPolicyModel(env, eval_env, buffer_length, num_rollouts, batch_size, epochs_per_rollout, num_eval_episodes=10, gamma=0.99, gae_lambda=0.95, use_sde=False, sde_sample_freq=- 1, verbose=0, seed=None)[source]¶
The base for On-Policy algorithms (ex: A2C/PPO).
- Parameters
env (
Union[Env,VecEnv,str]) – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)eval_env (
Union[Env,VecEnv,str]) – The environment to evaluate on, must not be vectorised/parallelrised (if registered in Gym, can be str. Can be None for loading trained models)buffer_length (
int) – (int) Length of the buffer and the number of steps to run for each environment per updatenum_rollouts (
int) – Number of rollouts to do per PyTorch Lightning epoch. This does not affect any training dynamic, just how often we evaluate the model since evaluation happens at the end of each Lightning epochbatch_size (
int) – Minibatch size for each gradient updateepochs_per_rollout (
int) – Number of epochs to optimise the loss fornum_eval_episodes (
int) – The number of episodes to evaluate for at the end of a PyTorch Lightning epochgamma (
float) – (float) Discount factorgae_lambda (
float) – (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator. Equivalent to classic advantage when set to 1.use_sde (
bool) – (bool) Whether to use generalized State Dependent Exploration (gSDE) instead of action noise explorationsde_sample_freq (
int) – (int) Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout)verbose (
int) – The verbosity level: 0 none, 1 training information, 2 debugseed (
Optional[int]) – Seed for the pseudo random generators
- collect_rollouts()[source]¶
Collect rollouts and put them into the RolloutBuffer
- Return type
RolloutBufferSamples