Learners
TD Learner
mutable struct ExpectedSarsa <: AbstractTDLearner
α::Float64
γ::Float64
unseenvalue::Float64
params::Array{Float64, 2}
traces::AbstractTraces
policy::AbstractPolicy
Expected Sarsa Learner with learning rate α
, discount factor γ
, Q-values params
and eligibility traces
.
The Q-values are updated according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \sum_{a'} \pi(a', s') Q(a', s') - Q(a, s)$ with next state $s'$, probability $\pi(a', s')$ of choosing action $a'$ in next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces
, ReplacingTraces
and AccumulatingTraces
).
ExpectedSarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8,
tracekind = ReplacingTraces, initvalue = Inf64,
unseenvalue = 0.,
policy = VeryOptimisticEpsilonGreedyPolicy(.1))
mutable struct QLearning <: AbstractTDLearner
α::Float64
γ::Float64
unseenvalue::Float64
params::Array{Float64, 2}
traces::AbstractTraces
QLearner with learning rate α
, discount factor γ
, Q-values params
and eligibility traces
.
The Q-values are updated "off-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ \max_{a'} Q(a', s') - Q(a, s)$ with next state $s'$ and $e(a, s)$ is the eligibility trace (see NoTraces
, ReplacingTraces
and AccumulatingTraces
).
TabularReinforcementLearning.QLearning
— Method.QLearning(; ns = 10, na = 4, α = .1, γ = .9, λ = .8,
tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)
mutable struct Sarsa <: AbstractTDLearner
α::Float64
γ::Float64
unseenvalue::Float64
params::Array{Float64, 2}
traces::AbstractTraces
Sarsa Learner with learning rate α
, discount factor γ
, Q-values params
and eligibility traces
.
The Q-values are updated "on-policy" according to $Q(a, s) ← α δ e(a, s)$ where $δ = r + γ Q(a', s') - Q(a, s)$ with next state $s'$, next action $a'$ and $e(a, s)$ is the eligibility trace (see NoTraces
, ReplacingTraces
and AccumulatingTraces
).
TabularReinforcementLearning.Sarsa
— Method.Sarsa(; ns = 10, na = 4, α = .1, γ = .9, λ = .8,
tracekind = ReplacingTraces, initvalue = Inf64, unseenvalue = 0.)
struct AccumulatingTraces <: AbstractTraces
λ::Float64
γλ::Float64
trace::Array{Float64, 2}
minimaltracevalue::Float64
Decaying traces with factor γλ.
Traces are updated according to $e(a, s) ← 1 + e(a, s)$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue
where the trace is set to 0 (for computational efficiency).
AccumulatingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)
struct NoTraces <: AbstractTraces
No eligibility traces, i.e. $e(a, s) = 1$ for current action $a$ and state $s$ and zero otherwise.
struct ReplacingTraces <: AbstractTraces
λ::Float64
γλ::Float64
trace::Array{Float64, 2}
minimaltracevalue::Float64
Decaying traces with factor γλ.
Traces are updated according to $e(a, s) ← 1$ for the current action-state pair and $e(a, s) ← γλ e(a, s)$ for all other pairs unless $e(a, s) <$ minimaltracevalue
where the trace is set to 0 (for computational efficiency).
ReplacingTraces(ns, na, λ::Float64, γ::Float64; minimaltracevalue = 1e-12)
Initial values, novel actions and unseen values
For td-error dependent methods, the exploration-exploitation trade-off depends on the initvalue
and the unseenvalue
. To distinguish actions that were never choosen before, i.e. novel actions, the default initial Q-value (field param
) is initvalue = Inf64
. In a state with novel actions, the policy determines how to deal with novel actions. To compute the td-error the unseenvalue
is used for states with novel actions. One way to achieve agressively exploratory behavior is to assure that unseenvalue
(or initvalue
) is larger than the largest possible Q-value.
Policy Gradient Learner
mutable struct Critic <: AbstractBiasCorrector
α::Float64
V::Array{Float64, 1}
TabularReinforcementLearning.Critic
— Method.Critic(; α = .1, ns = 10, initvalue = 0.)
struct NoBiasCorrector <: AbstractBiasCorrector
mutable struct PolicyGradientBackward <: AbstractPolicyGradient
α::Float64
γ::Float64
params::Array{Float64, 2}
traces::AccumulatingTraces
biascorrector::AbstractBiasCorrector
Policy gradient learning in the backward view.
The parameters are updated according to $params[a, s] += α * r_{eff} * e[a, s]$ where $r_{eff} = r$ for NoBiasCorrector
, $r_{eff} = r - rmean$ for RewardLowpassFilterBiasCorrector
and e[a, s] is the eligibility trace.
PolicyGradientBackward(; ns = 10, na = 4, α = .1, γ = .9,
tracekind = AccumulatingTraces, initvalue = Inf64,
biascorrector = NoBiasCorrector())
mutable struct PolicyGradientForward <: AbstractPolicyGradient
α::Float64
γ::Float64
params::Array{Float64, 2}
biascorrector::AbstractBiasCorrector
mutable struct RewardLowpassFilterBiasCorrector <: AbstractBiasCorrector
γ::Float64
rmean::Float64
Filters the reward with factor γ and uses effective reward (r - rmean) to update the parameters.
ActorCriticPolicyGradient(; nsteps = 1, γ = .9, ns = 10, na = 4,
α = .1, αcritic = .1, initvalue = Inf64)
EpisodicReinforce(; kwargs...) = EpisodicLearner(PolicyGradientForward(; kwargs...))
N-step Learner
struct EpisodicLearner <: AbstractMultistepLearner
learner::AbstractReinforcementLearner
struct NstepLearner <: AbstractReinforcementLearner
nsteps::Int64
learner::AbstractReinforcementLearner
NstepLearner(; nsteps = 10, learner = Sarsa, kwargs...) =
NstepLearner(nsteps, learner(; kwargs...))
mutable struct MonteCarlo <: AbstractReinforcementLearner
Nsa::Array{Int64, 2}
γ::Float64
Q::Array{Float64, 2}
Estimate Q values by averaging over returns.
TabularReinforcementLearning.MonteCarlo
— Method.MonteCarlo(; ns = 10, na = 4, γ = .9)
Model Based Learner
mutable struct SmallBackups <: AbstractReinforcementLearner
γ::Float64
maxcount::UInt64
minpriority::Float64
counter::Int64
Q::Array{Float64, 2}
Qprev::Array{Float64, 2}
V::Array{Float64, 1}
Nsa::Array{Int64, 2}
Ns1a0s0::Array{Dict{Tuple{Int64, Int64}, Int64}, 1}
queue::PriorityQueue
maxcount
defines the maximal number of backups per action, minpriority
is the smallest priority still added to the queue.
SmallBackups(; ns = 10, na = 4, γ = .9, initvalue = Inf64, maxcount = 3, minpriority = 1e-8)