Learners

Learners

TD Learner

mutable struct ExpectedSarsa <: AbstractTDLearner
    α::Float64
    γ::Float64
    unseenvalue::Float64
    params::Array{Float64, 2}
    traces::AbstractTraces
    policy::AbstractPolicy

Expected Sarsa Learner with learning rate α, discount factor γ, Q-values params and eligibility traces.

The Q-values are updated according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \sum_{a'} \pi(a', s') Q(a', s') - Q(a, s)$ with next state $s'$, probability $\pi(a', s')$ of choosing action $a'$ in next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).

source
ExpectedSarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8, 
                tracekind = ReplacingTraces, initvalue = Inf64,
                unseenvalue = 0.,
                policy = VeryOptimisticEpsilonGreedyPolicy(.1))

See also Initial values, novel actions and unseen values.

source
mutable struct QLearning <: AbstractTDLearner
    α::Float64
    γ::Float64
    unseenvalue::Float64
    params::Array{Float64, 2}
    traces::AbstractTraces

QLearner with learning rate α, discount factor γ, Q-values params and eligibility traces.

The Q-values are updated "off-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \max_{a'} Q(a', s') - Q(a, s)$ with next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).

source
QLearning(; ns = 10, na = 4, α = .1, γ = .9, λ = .8, 
            tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)

See also Initial values, novel actions and unseen values.

source
mutable struct Sarsa <: AbstractTDLearner
    α::Float64
    γ::Float64
    unseenvalue::Float64
    params::Array{Float64, 2}
    traces::AbstractTraces

Sarsa Learner with learning rate α, discount factor γ, Q-values params and eligibility traces.

The Q-values are updated "on-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ Q(a', s') - Q(a, s)$ with next state $s'$, next action $a'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).

source
Sarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8, 
        tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)

See also Initial values, novel actions and unseen values.

source
struct AccumulatingTraces <: AbstractTraces
    λ::Float64
    γλ::Float64
    trace::Array{Float64, 2}
    minimaltracevalue::Float64

Decaying traces with factor γλ.

Traces are updated according to $e(a, s) ← 1 + e(a, s)$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue where the trace is set to 0 (for computational efficiency).

source
AccumulatingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)
source
struct NoTraces <: AbstractTraces

No eligibility traces, i.e. $e(a, s) = 1$ for current action $a$ and state $s$ and zero otherwise.

source
struct ReplacingTraces <: AbstractTraces
    λ::Float64
    γλ::Float64
    trace::Array{Float64, 2}
    minimaltracevalue::Float64

Decaying traces with factor γλ.

Traces are updated according to $e(a, s) ← 1$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue where the trace is set to 0 (for computational efficiency).

source
ReplacingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)
source

Initial values, novel actions and unseen values

For td-error dependent methods, the exploration-exploitation trade-off depends on the initvalue and the unseenvalue. To distinguish actions that were never choosen before, i.e. novel actions, the default initial Q-value (field param) is initvalue = Inf64. In a state with novel actions, the policy determines how to deal with novel actions. To compute the td-error the unseenvalue is used for states with novel actions. One way to achieve agressively exploratory behavior is to assure that unseenvalue (or initvalue) is larger than the largest possible Q-value.

Policy Gradient Learner

mutable struct Critic <: AbstractBiasCorrector
    α::Float64
    V::Array{Float64, 1}
source
Critic(; α = .1, ns = 10, initvalue = 0.)
source
struct NoBiasCorrector <: AbstractBiasCorrector
source
mutable struct PolicyGradientBackward <: AbstractPolicyGradient
    α::Float64
    γ::Float64
    params::Array{Float64, 2}
    traces::AccumulatingTraces
    biascorrector::AbstractBiasCorrector

Policy gradient learning in the backward view.

The parameters are updated according to $params[a, s] += α * r_{eff} * e[a, s]$ where $r_{eff} = r$ for NoBiasCorrector, $r_{eff} = r - rmean$ for RewardLowpassFilterBiasCorrector and e[a, s] is the eligibility trace.

source
PolicyGradientBackward(; ns = 10, na = 4, α = .1, γ = .9, 
               tracekind = AccumulatingTraces, initvalue = Inf64,
               biascorrector = NoBiasCorrector())
source
mutable struct PolicyGradientForward <: AbstractPolicyGradient
    α::Float64
    γ::Float64
    params::Array{Float64, 2}
    biascorrector::AbstractBiasCorrector
source
mutable struct RewardLowpassFilterBiasCorrector <: AbstractBiasCorrector
γ::Float64
rmean::Float64

Filters the reward with factor γ and uses effective reward (r - rmean) to update the parameters.

source
ActorCriticPolicyGradient(; nsteps = 1, γ = .9, ns = 10, na = 4, 
                            α = .1, αcritic = .1, initvalue = Inf64)
source
EpisodicReinforce(; kwargs...) = EpisodicLearner(PolicyGradientForward(; kwargs...))
source

N-step Learner

struct EpisodicLearner <: AbstractMultistepLearner
    learner::AbstractReinforcementLearner
source
struct NstepLearner <: AbstractReinforcementLearner
    nsteps::Int64
    learner::AbstractReinforcementLearner
source
NstepLearner(; nsteps = 10, learner = Sarsa, kwargs...) = 
    NstepLearner(nsteps, learner(; kwargs...))
source
mutable struct MonteCarlo <: AbstractReinforcementLearner
    Nsa::Array{Int64, 2}
    γ::Float64
    Q::Array{Float64, 2}

Estimate Q values by averaging over returns.

source
MonteCarlo(; ns = 10, na = 4, γ = .9)
source

Model Based Learner

mutable struct SmallBackups <: AbstractReinforcementLearner
    γ::Float64
    maxcount::UInt64
    minpriority::Float64
    counter::Int64
    Q::Array{Float64, 2}
    Qprev::Array{Float64, 2}
    V::Array{Float64, 1}
    Nsa::Array{Int64, 2}
    Ns1a0s0::Array{Dict{Tuple{Int64, Int64}, Int64}, 1}
    queue::PriorityQueue

See Harm Van Seijen, Rich Sutton ; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):361-369, 2013.

maxcount defines the maximal number of backups per action, minpriority is the smallest priority still added to the queue.

source

SmallBackups(; ns = 10, na = 4, γ = .9, initvalue = Inf64, maxcount = 3, minpriority = 1e-8)

source