Constructing Networks
MLPGradientFlow.Net
— TypeNet(; layers, input, target, weights = nothing,
bias_adapt_input = true, derivs = 2, copy_input = true, verbosity = 1,
Din = size(input, 1) - last(first(layers))*(1-bias_adapt_input))
layers # ((num_neurons_layer1, activation_function_layer1, has_bias_layer1),
# (num_neurons_layer2, activation_function_layer2, has_bias_layer2),
# ...)
input # Dᵢₙ × N matrix
target # Dₒᵤₜ × N matrix
weights # nothing or N array
bias_adapt_input = true # adds a row of 1s to the input
derivs = 2 # allocate memory for derivs derivatives (0, 1, 2)
copy_input = true # copy the input when creating the net
Example
julia> input = randn(2, 100)
julia> target = randn(1, 100)
julia> net = Net(; layers = ((10, softplus, true), (1, identity, true)),
input, target);
MLPGradientFlow.TeacherNet
— TypeTeacherNet(; p = nothing, kwargs...)
Creates a network with parameters p
attached. If p == nothing
, random_params
are generated. A TeacherNet
is a callable object that returns the target given some input. Keyword arguments kwargs
are passed to Net
.
Example
julia> teacher = TeacherNet(; layers = ((8, softplus, true), (1, identity, true)),
Din = 3);
julia> input = randn(3, 10^4);
julia> target = teacher(input);
julia> new_input = randn(3, 10^3);
julia> new_target = teacher(new_input);
MLPGradientFlow.NetI
— TypeNetI(teacher, student; T = eltype(student.input),
g1 = _stride_arrayize(NormalIntegral(d = 1)),
g2 = _stride_arrayize(NormalIntegral(d = 2)))
MLPGradientFlow.gauss_hermite_net
— Functiongauss_hermite_net(target_function, net::Net; kwargs...)
Create from net
a network with input points and weights obtained from NormalIntegral
, to which kwargs
are passed. The target_function
is used to compute the target values. Note that in more than 2 input dimensions the number of points is excessively large (with default settings for NormalIntegral
more than a million points are generated in 3 dimensions).
Example
julia> net = gauss_hermite_net(x -> reshape(x[1, :] .^ 2, 1, :),
Net(layers = ((5, softplus, false),
(1, identity, true)), Din = 2))
Loss and its Derivatives
MLPGradientFlow.loss
— Functionloss(net, x, input = net.input, target = net.target;
verbosity = 1, losstype = MSE(), weights = net.weights, maxnorm = Inf,
merge = nothing)
Compute the loss of net
at parameter value x
.
MLPGradientFlow.gradient
— Functiongradient(net, x; input = net.input, target = net.target, kwargs...)
Compute the gradient of net
at parameter value x
. See loss for kwargs
.
MLPGradientFlow.hessian
— Functionhessian(net, x; input = net.input, target = net.target, kwargs...)
Compute hessian of net
at parameter value x
. See loss for kwargs
.
MLPGradientFlow.hessian_spectrum
— Functionhessian_spectrum(net, x; kwargs...)
Compute the spectrum of the hessian of net
at x
. Keyword arguments are passed to hessian.
Training
MLPGradientFlow.train
— Functiontrain(net, x0; kwargs...)
Train net
from initial point x0
.
Keyword arguments:
maxnorm = Inf # constant c in loss formula
batchsize = nothing, # using the full data set in each step when `nothing`
alg = alg_default(p) # ODE solver: KenCarp58() for length(p) ≤ 64, RK4() otherwise
maxT = 1e10 # upper integration limit
save_everystep = true # return a trajectory and loss curve
n_samples_trajectory = 100 # number of samples of the trajectory
abstol = 1e-6 # absolute tolerance of the ODE solver
reltol = 1e-3 # relative tolerance of the ODE solver
maxtime_ode = 3*60 # maximum amount of time in seconds for the ODE solver
maxiterations_ode = 10^6 # maximum iterations of ODE solver
maxiterations_optim = 10^5 # maximum iterations of optimizer
min_gnorm = 1e-15 # stop if (the regularized) gradient ∞-norm is below min_gnorm
patience = 10^6 # Number of steps without decrease of the loss function until converged
tauinv = nothing # nothing, a scalar or a ComponentArray of shape `x0` with inverse time scales
minloss = 2e-32 # stop if MSE loss is below minloss
maxtime_optim = 2*60 # maximum amount of time in seconds for the Newton optimization
optim_solver = optim_solver_default(p) # optimizer: NewtonTrustRegion() for length(p) ≤ 32, :LD_SLSQP for length(p) ≤ 1000, BFGS() otherwise
verbosity = 1 # use verbosity = 1 for more outputs
show_progress = true # show progress
progress_interval = 5 # show progress every x seconds
result = :dict # change to result = :raw for more detailed results
exclude = String[] # dictionary keys to exclude from the results dictionary
include = nothing # dictionary keys to include (everything if nothing)
Runs training dynamics on net
from initial point x0
. If an Optimisers
method, e.g. Adam()
or Descent()
is chosen for alg
, this is the time-discrete dynamics xₜ = tauinv * ∇ loss(net, xₜ₋₁)
where loss(net, x) = sum((net(x) - net.target).^2) + R(x)
with R(x) = 1/3 * (weightnorm(x) - maxnorm)^3 * I(weightnorm(x) > maxnorm) * n_samples(net)
. For ODE solvers, this is the continuous ordinary differential equation ẋ = tauinv ∇ loss(net, x)
, which is integrated from x(t = 0) = p
to x(t = maxT)
. Note that maxT
refers to time of the differential equation, whereas maxtime_ode
(and maxtime_optim
) refers to the amount of wall-clock time given to the computer to run the algorithms.
If maxiterations_optim > 0
, the result of this dynamics is given to a (second order) optimizer, to find accurately the nearest minimum.
All gradient norms (min_gnorm
, and return values gnorm
and gnorm_regularized
) are measured in the infinity norm.
train(net, x0::Tuple; num_workers = min(length(x0), Sys.CPU_THREADS),
kwargs...)
Train from multiple initial points in parallel. kwargs
are passed to the train function for each initial point.
Utilities
MLPGradientFlow.random_params
— Functionrandom_params(net; kwargs...)
random_params(rng, net; distr_fn = glorot_normal)
MLPGradientFlow.params
— Functionparams((w₁, b₁), (w₂, b₂), ...)
Where wᵢ
is a weight matrix and bᵢ
is a bias vector or nothing
(None
in python).
params(layers::AbstractDict)
Converts parameters in dictionary form to ComponentArray
form.
MLPGradientFlow.params2dict
— Functionparams2dict(p)
Convert a ComponentArray parameter to a dictionary
MLPGradientFlow.NormalIntegral
— TypeNormalIntegral(; N = 600, d = 1, prune = true, threshold = 1e-18)
Callable struct for Gauss-Hermite integration. Uses FastGaussianQuadrature.jl.
Example
julia> integrator = NormalIntegral(d = 1);
julia> integrator(x -> 1.) # normalization of standard normal
0.9999999999999998
julia> integrator(identity) # mean of standard normal
7.578393534606704e-19
julia> integrator(x -> x^2) # variance of standard normal
0.9999999999999969
julia> integrator2d = NormalIntegral(d = 2);
julia> integrator2d(x -> cos(x[1] + x[2])) # integrate some 2D function for x[1] and x[2] iid standard normal
0.3678794411714446
julia> integrator2d(x -> cos(x[1] + x[2]), .5) # integrate with correlation(x[1], x[2]) = 0.5
0.22313016014843348
MLPGradientFlow.trajectory_distance
— Functiontrajectory_distance(res1, res2)
trajectory_distance(trajectory, reference_trajectory)
Searches for the closest points of trajectory
in reference_trajectory
and returns the distances, time points of the reference_trajectory
and the indices in reference_trajectory
.
MLPGradientFlow.LinearSubspace
— TypeLinearSubspace(ref, v1, v2)
Construct a 2D linear subspace from ref
point in directions v1
and v2
. See also subspace_minloss
MLPGradientFlow.to_local_coords
— Functionto_local_coords(ls::LinearSubspace, p)
Project point p
to the linear subspace ls
.
MLPGradientFlow.subspace_minloss
— Functionsubspace_minloss(net, ref, v1, v2, a1, a2)
Minimize loss in the subspace orthogonal to v1
and v2
with the point in the 2D subspace fixed to ref + a1 * v1 + a2 * v2
.
subspace_minloss(net, ls::LinearSubspace, a1, a2)
Minimize loss in the subspace orthogonal to ls
with the point in ls
fixed to ls.ref + a1 * ls.v1 + a2 * ls.v2
.
MLPGradientFlow.grow_net
— Functiongrow_net(net)
Add one neuron to the hidden layer. Works only for networks with a single hidden layer.
MLPGradientFlow.shrink_net
— Functionshrink_net(net)
Remove one neuron from the hidden layer. Works only with networks with a single hidden layer.
MLPGradientFlow.split_neuron
— Functionsplit_neuron(p, i, γ, j = i+1)
Duplicate hidden neuron i
with mixing ratio γ
and insert the new neuron at position j
. Works only for the parameters p
of a network with a single hidden layer.
MLPGradientFlow.cosine_similarity
— Functioncosine_similarity(x, y) = x'*y/(norm(x)*norm(y))
Saving
MLPGradientFlow.pickle
— Functionpickle(filename, result; exclude = String[])
Save the result
of training in filename
in torch.pickle format. See also result2dict
.
MLPGradientFlow.unpickle
— Functionunpickle(filename)
Loads the results saved with the above function pickle(filename, result; exclude = String[])
.