Constructing Networks

MLPGradientFlow.NetType
Net(; layers, input, target, weights = nothing,
      bias_adapt_input = true, derivs = 2, copy_input = true, verbosity = 1,
      Din = size(input, 1) - last(first(layers))*(1-bias_adapt_input))

layers # ((num_neurons_layer1, activation_function_layer1, has_bias_layer1),
       #  (num_neurons_layer2, activation_function_layer2, has_bias_layer2),
       #   ...)
input  # Dᵢₙ × N matrix
target # Dₒᵤₜ × N matrix
weights # nothing or N array
bias_adapt_input = true # adds a row of 1s to the input
derivs = 2              # allocate memory for derivs derivatives (0, 1, 2)
copy_input = true       # copy the input when creating the net

Example

julia> input = randn(2, 100)
julia> target = randn(1, 100)
julia> net = Net(; layers = ((10, softplus, true), (1, identity, true)),
                   input, target);
source
MLPGradientFlow.TeacherNetType
TeacherNet(; p = nothing, kwargs...)

Creates a network with parameters p attached. If p == nothing, random_params are generated. A TeacherNet is a callable object that returns the target given some input. Keyword arguments kwargs are passed to Net.

Example

julia> teacher = TeacherNet(; layers = ((8, softplus, true), (1, identity, true)),
                              Din = 3);

julia> input = randn(3, 10^4);

julia> target = teacher(input);

julia> new_input = randn(3, 10^3);

julia> new_target = teacher(new_input);
source
MLPGradientFlow.NetIType
NetI(teacher, student; T = eltype(student.input),
     g1 = _stride_arrayize(NormalIntegral(d = 1)),
     g2 = _stride_arrayize(NormalIntegral(d = 2)))
source
MLPGradientFlow.gauss_hermite_netFunction
gauss_hermite_net(target_function, net::Net; kwargs...)

Create from net a network with input points and weights obtained from NormalIntegral, to which kwargs are passed. The target_function is used to compute the target values. Note that in more than 2 input dimensions the number of points is excessively large (with default settings for NormalIntegral more than a million points are generated in 3 dimensions).

Example

julia> net = gauss_hermite_net(x -> reshape(x[1, :] .^ 2, 1, :),
                               Net(layers = ((5, softplus, false),
                                             (1, identity, true)), Din = 2))
source

Loss and its Derivatives

MLPGradientFlow.lossFunction
loss(net, x, input = net.input, target = net.target;
     verbosity = 1, losstype = MSE(), weights = net.weights, maxnorm = Inf,
     merge = nothing)

Compute the loss of net at parameter value x.

source
MLPGradientFlow.gradientFunction
gradient(net, x; input = net.input, target = net.target, kwargs...)

Compute the gradient of net at parameter value x. See loss for kwargs.

source
MLPGradientFlow.hessianFunction
hessian(net, x; input = net.input, target = net.target, kwargs...)

Compute hessian of net at parameter value x. See loss for kwargs.

source

Training

MLPGradientFlow.trainFunction
train(net, x0; kwargs...)

Train net from initial point x0.

Keyword arguments:

maxnorm = Inf                  # constant c in loss formula

batchsize = nothing,           # using the full data set in each step when `nothing`

alg = alg_default(p)           # ODE solver: KenCarp58() for length(p) ≤ 64, RK4() otherwise
maxT = 1e10                    # upper integration limit
save_everystep = true          # return a trajectory and loss curve
n_samples_trajectory = 100     # number of samples of the trajectory
abstol = 1e-6                  # absolute tolerance of the ODE solver
reltol = 1e-3                  # relative tolerance of the ODE solver
maxtime_ode = 3*60             # maximum amount of time in seconds for the ODE solver
maxiterations_ode = 10^6       # maximum iterations of ODE solver

maxiterations_optim = 10^5     # maximum iterations of optimizer
min_gnorm = 1e-15              # stop if (the regularized) gradient ∞-norm is below min_gnorm
patience = 10^6                # Number of steps without decrease of the loss function until converged
tauinv = nothing               # nothing, a scalar or a ComponentArray of shape `x0` with inverse time scales
minloss = 2e-32                # stop if MSE loss is below minloss
maxtime_optim = 2*60           # maximum amount of time in seconds for the Newton optimization
optim_solver = optim_solver_default(p) # optimizer: NewtonTrustRegion() for length(p) ≤ 32, :LD_SLSQP for length(p) ≤ 1000, BFGS() otherwise

verbosity = 1                  # use verbosity = 1 for more outputs
show_progress = true           # show progress
progress_interval = 5          # show progress every x seconds
result = :dict                 # change to result = :raw for more detailed results
exclude = String[]             # dictionary keys to exclude from the results dictionary
include = nothing              # dictionary keys to include (everything if nothing)

Runs training dynamics on net from initial point x0. If an Optimisers method, e.g. Adam() or Descent() is chosen for alg, this is the time-discrete dynamics xₜ = tauinv * ∇ loss(net, xₜ₋₁) where loss(net, x) = sum((net(x) - net.target).^2) + R(x) with R(x) = 1/3 * (weightnorm(x) - maxnorm)^3 * I(weightnorm(x) > maxnorm) * n_samples(net). For ODE solvers, this is the continuous ordinary differential equation ẋ = tauinv ∇ loss(net, x), which is integrated from x(t = 0) = p to x(t = maxT). Note that maxT refers to time of the differential equation, whereas maxtime_ode (and maxtime_optim) refers to the amount of wall-clock time given to the computer to run the algorithms.

If maxiterations_optim > 0, the result of this dynamics is given to a (second order) optimizer, to find accurately the nearest minimum.

All gradient norms (min_gnorm, and return values gnorm and gnorm_regularized) are measured in the infinity norm.

source
train(net, x0::Tuple; num_workers = min(length(x0), Sys.CPU_THREADS),
                      kwargs...)

Train from multiple initial points in parallel. kwargs are passed to the train function for each initial point.

source

Utilities

MLPGradientFlow.paramsFunction
params((w₁, b₁), (w₂, b₂), ...)

Where wᵢ is a weight matrix and bᵢ is a bias vector or nothing (None in python).

source
params(layers::AbstractDict)

Converts parameters in dictionary form to ComponentArray form.

source
MLPGradientFlow.NormalIntegralType
NormalIntegral(; N = 600, d = 1, prune = true, threshold = 1e-18)

Callable struct for Gauss-Hermite integration. Uses FastGaussianQuadrature.jl.

Example

julia> integrator = NormalIntegral(d = 1);

julia> integrator(x -> 1.) # normalization of standard normal
0.9999999999999998

julia> integrator(identity) # mean of standard normal
7.578393534606704e-19

julia> integrator(x -> x^2) # variance of standard normal
0.9999999999999969

julia> integrator2d = NormalIntegral(d = 2);

julia> integrator2d(x -> cos(x[1] + x[2])) # integrate some 2D function for x[1] and x[2] iid standard normal
0.3678794411714446

julia> integrator2d(x -> cos(x[1] + x[2]), .5) # integrate with correlation(x[1], x[2]) = 0.5
0.22313016014843348
source
MLPGradientFlow.trajectory_distanceFunction
trajectory_distance(res1, res2)
source
trajectory_distance(trajectory, reference_trajectory)

Searches for the closest points of trajectory in reference_trajectory and returns the distances, time points of the reference_trajectory and the indices in reference_trajectory.

source
MLPGradientFlow.subspace_minlossFunction
subspace_minloss(net, ref, v1, v2, a1, a2)

Minimize loss in the subspace orthogonal to v1 and v2 with the point in the 2D subspace fixed to ref + a1 * v1 + a2 * v2.

source
subspace_minloss(net, ls::LinearSubspace, a1, a2)

Minimize loss in the subspace orthogonal to ls with the point in ls fixed to ls.ref + a1 * ls.v1 + a2 * ls.v2.

source
MLPGradientFlow.split_neuronFunction
split_neuron(p, i, γ, j = i+1)

Duplicate hidden neuron i with mixing ratio γ and insert the new neuron at position j. Works only for the parameters p of a network with a single hidden layer.

source

Saving

MLPGradientFlow.pickleFunction
pickle(filename, result; exclude = String[])

Save the result of training in filename in torch.pickle format. See also result2dict.

source
MLPGradientFlow.unpickleFunction
unpickle(filename)

Loads the results saved with the above function pickle(filename, result; exclude = String[]).

source