agent.agents package

Submodules

agent.interfaces.partials.agents module

agent.agents.ppo_agent module

class agent.agents.ppo_agent.PPOAgent(*args, **kwargs)[source]

Bases: agent.interfaces.partials.agents.torch_agents.actor_critic_agent.ActorCriticAgent

PPO, Proximal Policy Optimization method

See method __defaults__ for default parameters

back_trace_advantages(transitions)[source]
evaluate(batch, _last_value_estimate, discrete=False, **kwargs)[source]
evaluate2(*, states, actions, log_probs, returns, advantage, **kwargs)[source]
evaluate3(batch, discrete=False, **kwargs)[source]
inner_ppo_update(states, actions, log_probs, returns, advantages)[source]
kl_target_stop(old_log_probs, new_log_probs, kl_target=0.03, beta_max=20, beta_min=0.05)[source]

TRPO

negloss = -tf.reduce_mean(self.advantages_ph * tf.exp(self.logp - self.prev_logp)) negloss += tf.reduce_mean(self.beta_ph * self.kl_divergence) negloss += tf.reduce_mean(self.ksi_ph * tf.square(tf.maximum(0.0, self.kl_divergence - 2 * self.kl_target)))

self.ksi = 10

Adaptive kl_target = 0.01 Adaptive kl_target = 0.03

Parameters
  • kl_target

  • beta_max

  • beta_min

  • old_log_probs

  • new_log_probs

Returns

static ppo_mini_batch_iter(mini_batch_size: int, states: Any, actions: Any, log_probs: Any, returns: Any, advantage: Any) → iter[source]
react()[source]
take_n_steps(initial_state, environment, n=100, render=False, render_frequency=100)[source]
update_models(*, stat_writer=None, **kwargs) → None[source]
update_targets(*args, **kwargs) → None[source]
agent.agents.ppo_agent.ppo_test(rollouts=None, skip=True)[source]

Module contents