Learners

TD Learner

TabularReinforcementLearning.ExpectedSarsa — Type.

mutable struct ExpectedSarsa <: AbstractTDLearner
    α::Float64
    γ::Float64
    unseenvalue::Float64
    params::Array{Float64, 2}
    traces::AbstractTraces
    policy::AbstractPolicy

Expected Sarsa Learner with learning rate α, discount factor γ, Q-values params and eligibility traces.

The Q-values are updated according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \sum_{a'} \pi(a', s') Q(a', s') - Q(a, s)$ with next state $s'$, probability $\pi(a', s')$ of choosing action $a'$ in next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).

TabularReinforcementLearning.ExpectedSarsa — Method.

ExpectedSarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8, 
                tracekind = ReplacingTraces, initvalue = Inf64,
                unseenvalue = 0.,
                policy = VeryOptimisticEpsilonGreedyPolicy(.1))

See also Initial values, novel actions and unseen values.

TabularReinforcementLearning.QLearning — Type.

mutable struct QLearning <: AbstractTDLearner
    α::Float64
    γ::Float64
    unseenvalue::Float64
    params::Array{Float64, 2}
    traces::AbstractTraces

QLearner with learning rate α, discount factor γ, Q-values params and eligibility traces.

The Q-values are updated "off-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \max_{a'} Q(a', s') - Q(a, s)$ with next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).

TabularReinforcementLearning.QLearning — Method.

QLearning(; ns = 10, na = 4, α = .1, γ = .9, λ = .8, 
            tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)

See also Initial values, novel actions and unseen values.

TabularReinforcementLearning.Sarsa — Type.

mutable struct Sarsa <: AbstractTDLearner
    α::Float64
    γ::Float64
    unseenvalue::Float64
    params::Array{Float64, 2}
    traces::AbstractTraces

Sarsa Learner with learning rate α, discount factor γ, Q-values params and eligibility traces.

The Q-values are updated "on-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ Q(a', s') - Q(a, s)$ with next state $s'$, next action $a'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).

TabularReinforcementLearning.Sarsa — Method.

Sarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8, 
        tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)

See also Initial values, novel actions and unseen values.

TabularReinforcementLearning.AccumulatingTraces — Type.

struct AccumulatingTraces <: AbstractTraces
    λ::Float64
    γλ::Float64
    trace::Array{Float64, 2}
    minimaltracevalue::Float64

Decaying traces with factor γλ.

Traces are updated according to $e(a, s) ← 1 + e(a, s)$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue where the trace is set to 0 (for computational efficiency).

TabularReinforcementLearning.AccumulatingTraces — Method.

AccumulatingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)

TabularReinforcementLearning.NoTraces — Type.

struct NoTraces <: AbstractTraces

No eligibility traces, i.e. $e(a, s) = 1$ for current action $a$ and state $s$ and zero otherwise.

TabularReinforcementLearning.ReplacingTraces — Type.

struct ReplacingTraces <: AbstractTraces
    λ::Float64
    γλ::Float64
    trace::Array{Float64, 2}
    minimaltracevalue::Float64

Decaying traces with factor γλ.

Traces are updated according to $e(a, s) ← 1$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue where the trace is set to 0 (for computational efficiency).

TabularReinforcementLearning.ReplacingTraces — Method.

ReplacingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)

Initial values, novel actions and unseen values

For td-error dependent methods, the exploration-exploitation trade-off depends on the initvalue and the unseenvalue. To distinguish actions that were never choosen before, i.e. novel actions, the default initial Q-value (field param) is initvalue = Inf64. In a state with novel actions, the policy determines how to deal with novel actions. To compute the td-error the unseenvalue is used for states with novel actions. One way to achieve agressively exploratory behavior is to assure that unseenvalue (or initvalue) is larger than the largest possible Q-value.

Policy Gradient Learner

TabularReinforcementLearning.Critic — Type.

mutable struct Critic <: AbstractBiasCorrector
    α::Float64
    V::Array{Float64, 1}

TabularReinforcementLearning.Critic — Method.

Critic(; α = .1, ns = 10, initvalue = 0.)

TabularReinforcementLearning.NoBiasCorrector — Type.

struct NoBiasCorrector <: AbstractBiasCorrector

TabularReinforcementLearning.PolicyGradientBackward — Type.

mutable struct PolicyGradientBackward <: AbstractPolicyGradient
    α::Float64
    γ::Float64
    params::Array{Float64, 2}
    traces::AccumulatingTraces
    biascorrector::AbstractBiasCorrector

Policy gradient learning in the backward view.

The parameters are updated according to $params[a, s] += α * r_{eff} * e[a, s]$ where $r_{eff} = r$ for NoBiasCorrector, $r_{eff} = r - rmean$ for RewardLowpassFilterBiasCorrector and e[a, s] is the eligibility trace.

TabularReinforcementLearning.PolicyGradientBackward — Method.

PolicyGradientBackward(; ns = 10, na = 4, α = .1, γ = .9, 
               tracekind = AccumulatingTraces, initvalue = Inf64,
               biascorrector = NoBiasCorrector())

TabularReinforcementLearning.PolicyGradientForward — Type.

mutable struct PolicyGradientForward <: AbstractPolicyGradient
    α::Float64
    γ::Float64
    params::Array{Float64, 2}
    biascorrector::AbstractBiasCorrector

TabularReinforcementLearning.RewardLowpassFilterBiasCorrector — Type.

mutable struct RewardLowpassFilterBiasCorrector <: AbstractBiasCorrector
γ::Float64
rmean::Float64

Filters the reward with factor γ and uses effective reward (r - rmean) to update the parameters.

TabularReinforcementLearning.ActorCriticPolicyGradient — Method.

ActorCriticPolicyGradient(; nsteps = 1, γ = .9, ns = 10, na = 4, 
                            α = .1, αcritic = .1, initvalue = Inf64)

TabularReinforcementLearning.EpisodicReinforce — Method.

EpisodicReinforce(; kwargs...) = EpisodicLearner(PolicyGradientForward(; kwargs...))

N-step Learner

TabularReinforcementLearning.EpisodicLearner — Type.

struct EpisodicLearner <: AbstractMultistepLearner
    learner::AbstractReinforcementLearner

TabularReinforcementLearning.NstepLearner — Type.

struct NstepLearner <: AbstractReinforcementLearner
    nsteps::Int64
    learner::AbstractReinforcementLearner

TabularReinforcementLearning.NstepLearner — Method.

NstepLearner(; nsteps = 10, learner = Sarsa, kwargs...) = 
    NstepLearner(nsteps, learner(; kwargs...))

TabularReinforcementLearning.MonteCarlo — Type.

mutable struct MonteCarlo <: AbstractReinforcementLearner
    Nsa::Array{Int64, 2}
    γ::Float64
    Q::Array{Float64, 2}

Estimate Q values by averaging over returns.

TabularReinforcementLearning.MonteCarlo — Method.

MonteCarlo(; ns = 10, na = 4, γ = .9)

Model Based Learner

TabularReinforcementLearning.SmallBackups — Type.

mutable struct SmallBackups <: AbstractReinforcementLearner
    γ::Float64
    maxcount::UInt64
    minpriority::Float64
    counter::Int64
    Q::Array{Float64, 2}
    Qprev::Array{Float64, 2}
    V::Array{Float64, 1}
    Nsa::Array{Int64, 2}
    Ns1a0s0::Array{Dict{Tuple{Int64, Int64}, Int64}, 1}
    queue::PriorityQueue

See Harm Van Seijen, Rich Sutton ; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):361-369, 2013.

maxcount defines the maximal number of backups per action, minpriority is the smallest priority still added to the queue.

TabularReinforcementLearning.SmallBackups — Method.

SmallBackups(; ns = 10, na = 4, γ = .9, initvalue = Inf64, maxcount = 3, minpriority = 1e-8)