LaplacianExpectationMaximization
This package implements the Expectation-Maximization (EM) algorithm with a Laplacian approximation for the E-step, as described e.g. here, for finding the maximum likelihood estimate of the parameter distribution of a model.
Assume some data $\mathcal D = \{\boldsymbol x_1, \ldots, \boldsymbol x_n\}$ was generated by sampling independently from the conditional density $\boldsymbol x_i\sim p(\boldsymbol x|\boldsymbol \theta_i)$, where $\boldsymbol \theta_i$ are parameters sampled from a multivariate normal distribution with diagonal covariance $\boldsymbol \theta_i\sim\mathcal N(\boldsymbol \theta; \boldsymbol \mu, \boldsymbol\sigma)$. This package allows to find the maximum likelihood estimates
\[\begin{align*} \boldsymbol \mu^*, \boldsymbol \sigma^* = \arg\max_{\boldsymbol \mu, \boldsymbol\sigma}p(\mathcal D | \boldsymbol\mu, \boldsymbol \sigma) &= \arg\max_{\boldsymbol \mu, \boldsymbol\sigma}\sum_{i=1}^n\log(p(\boldsymbol x_i|\boldsymbol\mu, \boldsymbol\sigma))\\ &= \arg\max_{\boldsymbol \mu, \boldsymbol\sigma}\sum_{i=1}^n\log\left(\int p(\boldsymbol x_i|\boldsymbol \theta_i)\mathcal N(\boldsymbol\theta_i;\boldsymbol\mu, \boldsymbol\sigma)d\boldsymbol\theta_i\right) \end{align*}\]
using the EM algorithm with Laplacian approximation for the E-step.
Example
Model Definition
For a sequence of binary values $y = (y_1, \ldots, y_T)$ we define a habituating biased coin model with probability $P(y) = \prod_{t=1}^TP(y_t|w_{t-1})$ with $w_t = w_{t-1} + \eta (y_{t-1} - \sigma(w_{t-1}))$, where $w_0$ and $\eta$ are parameters of the model and $\sigma(w) = 1/(1 + \exp(-w))$.
We define the model HabituatingBiasedCoin with state variable w and extend the functions parameters, initialize!, logp and sample from LaplacianExpectationMaximization.
# import LaplacianExpectationMaximization: parameters, initialize!, logp, sample # if you want to avoid writing LaplacianExpectationMaximization.logp etc.
using LaplacianExpectationMaximization
using ConcreteStructs, Distributions
@concrete struct HabituatingBiasedCoin
w
end
HabituatingBiasedCoin() = HabituatingBiasedCoin(Base.RefValue(0.))
function LaplacianExpectationMaximization.initialize!(m::HabituatingBiasedCoin, parameters)
m.w[] = parameters.w₀
end
LaplacianExpectationMaximization.parameters(::HabituatingBiasedCoin) = (; w₀ = 0., η = 0.)
sigmoid(w) = 1/(1 + exp(-w))
function LaplacianExpectationMaximization.logp(data, m::HabituatingBiasedCoin, parameters)
LaplacianExpectationMaximization.initialize!(m, parameters)
η = parameters.η
logp = 0.
for yₜ in data
ρ = sigmoid(m.w[])
logp += logpdf(Bernoulli(ρ), yₜ)
m.w[] += η * (yₜ - ρ)
end
logp
end
function LaplacianExpectationMaximization.sample(rng, ::Any, m::HabituatingBiasedCoin, ::Any)
rand(rng) ≤ sigmoid(m.w[])
endGenerating data
Let us generate 5 sequences of 30 steps with this model.
import LaplacianExpectationMaximization: simulate
model = HabituatingBiasedCoin()
params = (; w₀ = .3, η = .1)
data = [simulate(model, params, n_steps = 30).data for _ in 1:5]5-element Vector{Vector{Bool}}:
[1, 1, 1, 1, 0, 1, 1, 0, 0, 0 … 1, 1, 0, 1, 1, 1, 1, 0, 1, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0 … 0, 0, 1, 1, 1, 0, 1, 1, 1, 1]
[1, 1, 0, 1, 0, 0, 1, 0, 1, 0 … 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[0, 1, 0, 1, 0, 1, 1, 1, 1, 1 … 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]
[0, 1, 0, 0, 0, 1, 1, 1, 1, 1 … 0, 0, 1, 0, 1, 0, 0, 1, 0, 0]Fitting a single model
First we check, if gradients are properly computed for our model.
import LaplacianExpectationMaximization: gradient_logp
gradient_logp(data[1], model, params)ComponentVector{Float64}(w₀ = 0.9610782958980797, η = -2.401660106463604)If this fails, it is recommended to check that logp does not allocate, e.g. with
using BenchmarkTools
@benchmark LaplacianExpectationMaximization.logp($(data[1]), $model, $params)BenchmarkTools.Trial: 10000 samples with 115 evaluations per sample.
Range (min … max): 761.591 ns … 2.503 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 765.157 ns ┊ GC (median): 0.00%
Time (mean ± σ): 774.782 ns ± 46.559 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃██▄ ▁ ▂▃▃▂ ▂
███████▆▄▁▅██▆▆▅▅▄▄▄▅▄▄▄▄▅▁▄▇▆▃▅▅▆▅▄▅▃▄▄▄▃▆██████▇▆▅▄▃▅▆▅▅▄▆ █
762 ns Histogram: log(frequency) by time 861 ns <
Memory estimate: 0 bytes, allocs estimate: 0.We also check if Hessians are properly computed.
import LaplacianExpectationMaximization: hessian_logp
hessian_logp(data[1], model, params)2×2 Matrix{Float64}:
-3.87557 -4.89937
-4.89937 -0.217054This may fail, if the model is too restrictive in its type parameters.
If everything works fine we run the optimizer:
import LaplacianExpectationMaximization: maximize_logp
result = maximize_logp(data[1], model)(logp = -12.76219565547764, parameters = (w₀ = 1.6648438049017638, η = -1.9651029283407517))To inspect the state of the fitted model we can run
import LaplacianExpectationMaximization: logp_tracked
logp_tracked(data[1], model, result.parameters).history31-element Vector{Main.HabituatingBiasedCoin{Base.RefValue{Float64}}}:
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.6648438049017638))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.6648438049017638))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.3521706875057324))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.9483111491633505))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.3996060890565891))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.38919995243191885))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.4045245958593576))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.38196018597617865))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-1.5499108360473421))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-1.2058229903492155))
⋮
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.9489947774148706))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.4005600467635616))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.5773073250435834))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.2409265698217917))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.8002036135883328))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.19105714518351946))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.6979170600090752))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.04496404848676605))
Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-1.0496015375564856))We can also fit with some fixed parameters.
result = maximize_logp(data[1], model, fixed = (; η = 0.))
result.parametersComponentVector{Float64}(w₀ = 0.5465437049581998, η = 0.0)or with coupled parameters
result = maximize_logp(data[1], model, coupled = [(:w₀, :η)])
result.parametersComponentVector{Float64}(w₀ = 0.12515548674803242, η = 0.12515548674803242)Fitting a population model
Now we fit all data samples with approximate EM, assuming a diagonal normal prior over the parameters.
import LaplacianExpectationMaximization: PopulationModel
pop_model1 = PopulationModel(model)
result1 = maximize_logp(data, pop_model1)
result1.population_parametersComponentVector{Float64}(w₀ = 0.0, η = 0.0, population_parameters = (μ = (w₀ = 0.0, η = 0.0), σ = (w₀ = 1.0, η = 1.0)))Let us compare this to a model where all samples are assumed to be generated from the same parameters, i.e. the variance of the normal prior is zero.
pop_model2 = PopulationModel(model, shared = (:w₀, :η))
result2 = maximize_logp(data, pop_model2)
result2.population_parametersComponentVector{Float64}(w₀ = 0.26073599350821536, η = 0.06490691760064525, population_parameters = (μ = Float64[], σ = Float64[]))To compare the models we look at the approximate BIC
import LaplacianExpectationMaximization: BIC_int
(model1 = BIC_int(data, pop_model1, result1.population_parameters, repetitions = 1),
model2 = BIC_int(data, pop_model2, result2.population_parameters))(model1 = [231.98077524301564], model2 = [214.7168487282321])We see that the second model without variance of the prior has the lower BIC. This is not surprising, given that the data was generated with identical parameters.