Learners
TD Learner
mutable struct ExpectedSarsa <: AbstractTDLearner
α::Float64
γ::Float64
unseenvalue::Float64
params::Array{Float64, 2}
traces::AbstractTraces
policy::AbstractPolicyExpected Sarsa Learner with learning rate α, discount factor γ, Q-values params and eligibility traces.
The Q-values are updated according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \sum_{a'} \pi(a', s') Q(a', s') - Q(a, s)$ with next state $s'$, probability $\pi(a', s')$ of choosing action $a'$ in next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).
ExpectedSarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8,
tracekind = ReplacingTraces, initvalue = Inf64,
unseenvalue = 0.,
policy = VeryOptimisticEpsilonGreedyPolicy(.1))mutable struct QLearning <: AbstractTDLearner
α::Float64
γ::Float64
unseenvalue::Float64
params::Array{Float64, 2}
traces::AbstractTracesQLearner with learning rate α, discount factor γ, Q-values params and eligibility traces.
The Q-values are updated "off-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \max_{a'} Q(a', s') - Q(a, s)$ with next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).
TabularReinforcementLearning.QLearning — Method.QLearning(; ns = 10, na = 4, α = .1, γ = .9, λ = .8,
tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)mutable struct Sarsa <: AbstractTDLearner
α::Float64
γ::Float64
unseenvalue::Float64
params::Array{Float64, 2}
traces::AbstractTracesSarsa Learner with learning rate α, discount factor γ, Q-values params and eligibility traces.
The Q-values are updated "on-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ Q(a', s') - Q(a, s)$ with next state $s'$, next action $a'$ and $e(a, s)$ is the eligibility trace (see NoTraces, ReplacingTraces and AccumulatingTraces).
TabularReinforcementLearning.Sarsa — Method.Sarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8,
tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)struct AccumulatingTraces <: AbstractTraces
λ::Float64
γλ::Float64
trace::Array{Float64, 2}
minimaltracevalue::Float64Decaying traces with factor γλ.
Traces are updated according to $e(a, s) ← 1 + e(a, s)$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue where the trace is set to 0 (for computational efficiency).
AccumulatingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)struct NoTraces <: AbstractTracesNo eligibility traces, i.e. $e(a, s) = 1$ for current action $a$ and state $s$ and zero otherwise.
struct ReplacingTraces <: AbstractTraces
λ::Float64
γλ::Float64
trace::Array{Float64, 2}
minimaltracevalue::Float64Decaying traces with factor γλ.
Traces are updated according to $e(a, s) ← 1$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue where the trace is set to 0 (for computational efficiency).
ReplacingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)Initial values, novel actions and unseen values
For td-error dependent methods, the exploration-exploitation trade-off depends on the initvalue and the unseenvalue. To distinguish actions that were never choosen before, i.e. novel actions, the default initial Q-value (field param) is initvalue = Inf64. In a state with novel actions, the policy determines how to deal with novel actions. To compute the td-error the unseenvalue is used for states with novel actions. One way to achieve agressively exploratory behavior is to assure that unseenvalue (or initvalue) is larger than the largest possible Q-value.
Policy Gradient Learner
mutable struct Critic <: AbstractBiasCorrector
α::Float64
V::Array{Float64, 1}TabularReinforcementLearning.Critic — Method.Critic(; α = .1, ns = 10, initvalue = 0.)struct NoBiasCorrector <: AbstractBiasCorrectormutable struct PolicyGradientBackward <: AbstractPolicyGradient
α::Float64
γ::Float64
params::Array{Float64, 2}
traces::AccumulatingTraces
biascorrector::AbstractBiasCorrectorPolicy gradient learning in the backward view.
The parameters are updated according to $params[a, s] += α * r_{eff} * e[a, s]$ where $r_{eff} = r$ for NoBiasCorrector, $r_{eff} = r - rmean$ for RewardLowpassFilterBiasCorrector and e[a, s] is the eligibility trace.
PolicyGradientBackward(; ns = 10, na = 4, α = .1, γ = .9,
tracekind = AccumulatingTraces, initvalue = Inf64,
biascorrector = NoBiasCorrector())mutable struct PolicyGradientForward <: AbstractPolicyGradient
α::Float64
γ::Float64
params::Array{Float64, 2}
biascorrector::AbstractBiasCorrectormutable struct RewardLowpassFilterBiasCorrector <: AbstractBiasCorrector
γ::Float64
rmean::Float64Filters the reward with factor γ and uses effective reward (r - rmean) to update the parameters.
ActorCriticPolicyGradient(; nsteps = 1, γ = .9, ns = 10, na = 4,
α = .1, αcritic = .1, initvalue = Inf64)EpisodicReinforce(; kwargs...) = EpisodicLearner(PolicyGradientForward(; kwargs...))N-step Learner
struct EpisodicLearner <: AbstractMultistepLearner
learner::AbstractReinforcementLearnerstruct NstepLearner <: AbstractReinforcementLearner
nsteps::Int64
learner::AbstractReinforcementLearnerNstepLearner(; nsteps = 10, learner = Sarsa, kwargs...) =
NstepLearner(nsteps, learner(; kwargs...))mutable struct MonteCarlo <: AbstractReinforcementLearner
Nsa::Array{Int64, 2}
γ::Float64
Q::Array{Float64, 2}Estimate Q values by averaging over returns.
TabularReinforcementLearning.MonteCarlo — Method.MonteCarlo(; ns = 10, na = 4, γ = .9)Model Based Learner
mutable struct SmallBackups <: AbstractReinforcementLearner
γ::Float64
maxcount::UInt64
minpriority::Float64
counter::Int64
Q::Array{Float64, 2}
Qprev::Array{Float64, 2}
V::Array{Float64, 1}
Nsa::Array{Int64, 2}
Ns1a0s0::Array{Dict{Tuple{Int64, Int64}, Int64}, 1}
queue::PriorityQueuemaxcount defines the maximal number of backups per action, minpriority is the smallest priority still added to the queue.
SmallBackups(; ns = 10, na = 4, γ = .9, initvalue = Inf64, maxcount = 3, minpriority = 1e-8)