LaplacianExpectationMaximization

This package implements the Expectation-Maximization (EM) algorithm with a Laplacian approximation for the E-step, as described e.g. here, for finding the maximum likelihood estimate of the parameter distribution of a model.

Assume some data $\mathcal D = \{\boldsymbol x_1, \ldots, \boldsymbol x_n\}$ was generated by sampling independently from the conditional density $\boldsymbol x_i\sim p(\boldsymbol x|\boldsymbol \theta_i)$, where $\boldsymbol \theta_i$ are parameters sampled from a multivariate normal distribution with diagonal covariance $\boldsymbol \theta_i\sim\mathcal N(\boldsymbol \theta; \boldsymbol \mu, \boldsymbol\sigma)$. This package allows to find the maximum likelihood estimates

\[\begin{align*} \boldsymbol \mu^*, \boldsymbol \sigma^* = \arg\max_{\boldsymbol \mu, \boldsymbol\sigma}p(\mathcal D | \boldsymbol\mu, \boldsymbol \sigma) &= \arg\max_{\boldsymbol \mu, \boldsymbol\sigma}\sum_{i=1}^n\log(p(\boldsymbol x_i|\boldsymbol\mu, \boldsymbol\sigma))\\ &= \arg\max_{\boldsymbol \mu, \boldsymbol\sigma}\sum_{i=1}^n\log\left(\int p(\boldsymbol x_i|\boldsymbol \theta_i)\mathcal N(\boldsymbol\theta_i;\boldsymbol\mu, \boldsymbol\sigma)d\boldsymbol\theta_i\right) \end{align*}\]

using the EM algorithm with Laplacian approximation for the E-step.

Example

Model Definition

For a sequence of binary values $y = (y_1, \ldots, y_T)$ we define a habituating biased coin model with probability $P(y) = \prod_{t=1}^TP(y_t|w_{t-1})$ with $w_t = w_{t-1} + \eta (y_{t-1} - \sigma(w_{t-1}))$, where $w_0$ and $\eta$ are parameters of the model and $\sigma(w) = 1/(1 + \exp(-w))$.

We define the model HabituatingBiasedCoin with state variable w and extend the functions parameters, initialize!, logp and sample from LaplacianExpectationMaximization.

# import LaplacianExpectationMaximization: parameters, initialize!, logp, sample # if you want to avoid writing LaplacianExpectationMaximization.logp etc.
using LaplacianExpectationMaximization
using ConcreteStructs, Distributions

@concrete struct HabituatingBiasedCoin
    w
end
HabituatingBiasedCoin() = HabituatingBiasedCoin(Base.RefValue(0.))

function LaplacianExpectationMaximization.initialize!(m::HabituatingBiasedCoin, parameters)
    m.w[] = parameters.w₀
end

LaplacianExpectationMaximization.parameters(::HabituatingBiasedCoin) = (; w₀ = 0., η = 0.)

sigmoid(w) = 1/(1 + exp(-w))

function LaplacianExpectationMaximization.logp(data, m::HabituatingBiasedCoin, parameters)
    LaplacianExpectationMaximization.initialize!(m, parameters)
    η = parameters.η
    logp = 0.
    for yₜ in data
        ρ = sigmoid(m.w[])
        logp += logpdf(Bernoulli(ρ), yₜ)
        m.w[] += η * (yₜ - ρ)
    end
    logp
end

function LaplacianExpectationMaximization.sample(rng, ::Any, m::HabituatingBiasedCoin, ::Any)
    rand(rng) ≤ sigmoid(m.w[])
end

Generating data

Let us generate 5 sequences of 30 steps with this model.

import LaplacianExpectationMaximization: simulate

model = HabituatingBiasedCoin()
params = (; w₀ = .3, η = .1)
data = [simulate(model, params, n_steps = 30).data for _ in 1:5]
5-element Vector{Vector{Bool}}:
 [1, 1, 1, 1, 0, 1, 1, 0, 0, 0  …  1, 1, 0, 1, 1, 1, 1, 0, 1, 0]
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0  …  0, 0, 1, 1, 1, 0, 1, 1, 1, 1]
 [1, 1, 0, 1, 0, 0, 1, 0, 1, 0  …  0, 1, 0, 0, 1, 1, 1, 1, 1, 1]
 [0, 1, 0, 1, 0, 1, 1, 1, 1, 1  …  1, 1, 1, 1, 0, 1, 1, 1, 1, 1]
 [0, 1, 0, 0, 0, 1, 1, 1, 1, 1  …  0, 0, 1, 0, 1, 0, 0, 1, 0, 0]

Fitting a single model

First we check, if gradients are properly computed for our model.

import LaplacianExpectationMaximization: gradient_logp
gradient_logp(data[1], model, params)
ComponentVector{Float64}(w₀ = 0.9610782958980797, η = -2.401660106463604)

If this fails, it is recommended to check that logp does not allocate, e.g. with

using BenchmarkTools
@benchmark LaplacianExpectationMaximization.logp($(data[1]), $model, $params)
BenchmarkTools.Trial: 10000 samples with 115 evaluations per sample.
 Range (minmax):  761.591 ns 2.503 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     765.157 ns               GC (median):    0.00%
 Time  (mean ± σ):   774.782 ns ± 46.559 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃█▄  ▁                                    ▂▃▃▂              ▂
  ██████▆▁▅██▆▆▅▅▄▄▄▅▄▄▄▄▅▁▄▇▆▃▅▅▆▅▄▅▃▄▄▄▃▆██████▇▆▅▄▃▅▆▅▅▄▆ █
  762 ns        Histogram: log(frequency) by time       861 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

We also check if Hessians are properly computed.

import LaplacianExpectationMaximization: hessian_logp
hessian_logp(data[1], model, params)
2×2 Matrix{Float64}:
 -3.87557  -4.89937
 -4.89937  -0.217054

This may fail, if the model is too restrictive in its type parameters.

If everything works fine we run the optimizer:

import LaplacianExpectationMaximization: maximize_logp
result = maximize_logp(data[1], model)
(logp = -12.76219565547764, parameters = (w₀ = 1.6648438049017638, η = -1.9651029283407517))

To inspect the state of the fitted model we can run

import LaplacianExpectationMaximization: logp_tracked
logp_tracked(data[1], model, result.parameters).history
31-element Vector{Main.HabituatingBiasedCoin{Base.RefValue{Float64}}}:
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.6648438049017638))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.6648438049017638))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.3521706875057324))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.9483111491633505))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.3996060890565891))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.38919995243191885))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.4045245958593576))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.38196018597617865))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-1.5499108360473421))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-1.2058229903492155))
 ⋮
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.9489947774148706))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.4005600467635616))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.5773073250435834))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(1.2409265698217917))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.8002036135883328))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(0.19105714518351946))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.6979170600090752))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-0.04496404848676605))
 Main.HabituatingBiasedCoin{Base.RefValue{Float64}}(Base.RefValue{Float64}(-1.0496015375564856))

We can also fit with some fixed parameters.

result = maximize_logp(data[1], model, fixed = (; η = 0.))
result.parameters
ComponentVector{Float64}(w₀ = 0.5465437049581998, η = 0.0)

or with coupled parameters

result = maximize_logp(data[1], model, coupled = [(:w₀, :η)])
result.parameters
ComponentVector{Float64}(w₀ = 0.12515548674803242, η = 0.12515548674803242)

Fitting a population model

Now we fit all data samples with approximate EM, assuming a diagonal normal prior over the parameters.

import LaplacianExpectationMaximization: PopulationModel
pop_model1 = PopulationModel(model)
result1 = maximize_logp(data, pop_model1)
result1.population_parameters
ComponentVector{Float64}(w₀ = 0.0, η = 0.0, population_parameters = (μ = (w₀ = 0.0, η = 0.0), σ = (w₀ = 1.0, η = 1.0)))

Let us compare this to a model where all samples are assumed to be generated from the same parameters, i.e. the variance of the normal prior is zero.

pop_model2 = PopulationModel(model, shared = (:w₀, :η))
result2 = maximize_logp(data, pop_model2)
result2.population_parameters
ComponentVector{Float64}(w₀ = 0.26073599350821536, η = 0.06490691760064525, population_parameters = (μ = Float64[], σ = Float64[]))

To compare the models we look at the approximate BIC

import LaplacianExpectationMaximization: BIC_int
(model1 = BIC_int(data, pop_model1, result1.population_parameters, repetitions = 1),
 model2 = BIC_int(data, pop_model2, result2.population_parameters))
(model1 = [231.98077524301564], model2 = [214.7168487282321])

We see that the second model without variance of the prior has the lower BIC. This is not surprising, given that the data was generated with identical parameters.