\[ \newcommand{\argmin}{\operatorname{argmin}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\diag}[1]{\operatorname{diag}(#1)} \newcommand{\prox}[2]{\operatorname{prox}_{#1}\left(#2\right)} \]
## Loading required package: knitr
In this document, we outline the generic Proximal approaches and algorithms used in the BayesMAP 1 examples.
The basic problem is given by \[ \begin{gather*} \argmin_x \left\{ F(x) = f(x) + \lambda \sum_{j=0}^M \phi(x_j) \right\} \end{gather*} \] where \(F(x)\) is the objective function, \(f(x)\) the loss and \(\phi(x_j)\) the component-wise penalty.
In the code examples, these global functions are
loss.fun = NULL
obj.fun = NULL
obj.fun.grad = NULL
where obj.fun.grad
is used for testing and corresponds to the gradient of the objective function.
Throughout the BayesMAP examples we often use the Proximal Gradient, or Forward-Backward, algorithm, so we briefly describe it in the following.
We start by defining a quadratic form \(Q(x,z)\) such that \[ \begin{align*} Q(x,z) &=\frac{1}{2} x^T \Lambda(z) x - \eta(z)^T x + \mu(z) \\ f(x) &= \min_z Q(x,z), \forall x \end{align*} \] Then we restrict ourselves to considering only \(\Lambda(z) \succ 0\), \(\eta(z)^T \ne 0\), \(\mu(z)\) that allow for the following form \[ \begin{align*} Q(x,z) = \frac{1}{2} \|x - \xi(z)\|^2_{\Lambda(z)} + \psi(z) \end{align*} \] so that \(\xi(z) = \Lambda^{-1} \eta \) is the \(\argmin\) of \(Q(x,z)\) in \(x\).
Together with \(\phi_b(x)\), we have that \[ \begin{align*} F(x) &= \min_z \left\{ Q(x,z) + \lambda\phi_b(x) \right\} \\ &= \min_z \left\{\frac{1}{2} \|x - \xi(z)\|^2_{\Lambda(z)} + \lambda\phi_b(x) + \psi(z) \right\} \end{align*} \] so the \(\argmin_x\) that we want ends up being \[ \begin{gather*} x^* = \prox{\lambda \Lambda^{-1} \phi_b}{\xi(z)} \\ \text{where } \prox{A h}{z} = \argmin_u \left\{ \frac{1}{2} \|u - z\|^2_A + h(u) \right\} \end{gather*} \]
Using these properties, we can apply the following proximal algorithm: \[ \begin{align*} w^{(i)} &= \prox{\lambda \alpha^{(i)} \phi_b}{x^{(i)} - \alpha^{(i)} \xi(x^{(i)})} \\ x^{(i+1)} &= x^{(i)} + \gamma^{(i)} (w^{(i)} - x^{(i)}) \end{align*} \] where \(\alpha^{(i)}, \gamma^{(i)}\) are yet undetermined step sizes.
For convex, lower semi-continuous functions \(f(x)\) and \(\phi(x)\), convergence is generally given in the proximal setting… When \(\phi(x)\), and/or \(f(x)\), is not convex, the standard tools in proofs for convergence (e.g. majorizing/minorizing, generalized pythagorean theorem, etc.) won’t work.
Instead, for semialgebra/subanalytic functions it may be possible to prove convergence via a Lojasiewicz property involved with a Proximal Point method. From there, an argument about composition with a non-expansive operator (possibly the gradient descent part) could provide a proof. See 2.
Also, quasi-convex functions with non-empty solution sets, for at least their \(\operatorname{prox}\) operators, it’s possible to show that a Proximal Point algorithm can be Fejér monotonic with respect to the closed convex set \[
S = \left\{ x \in \operatorname{g(x)} : g(x) \le \inf_i g(x^{(i)})\right\}
\] and show that the sequence converges to the critical points of \(g(x)\).
See 3 and 4 for more details on that approach.
The following are some methods used for determining \(\alpha^{(i)}\).
In some cases, we can easily derive a viable step size \(\alpha^{(i)}\). For instance, when a function \(f(x)\) has Lipschitz continuous gradient, i.e. \[ \begin{align*} \| \nabla f(x) - \nabla f(z) \| &\le L_f \|x-z\| \end{align*} \] we can simply use a second order Taylor expansion as follows \[ \begin{align*} Q(x,z) &= f(z) + \nabla f(z)^T (x-z) + \frac{L}{2} \|x-z\|^2 \end{align*} \] for \(L \ge L_f\). Furthermore, this form of \(Q(x,z)\) has the following useful properties \[ \begin{align*} f(x) &\le Q(x,z), \forall z \\ f(x) &= Q(x,x) \end{align*} \] which yields a majorization algorithm. Now, we can set \(\alpha^{(i)} = 1/L\). The same can work for other majorizers with a quadratic \(x\) term, as well.
The proximal algorithm mentioned above simplifies (with \(\gamma^{(i)} = 1\)) to \[ \begin{align*} x^{(i+1)} &= \prox{\frac{\lambda}{L} \phi_b}{x^{(i)} -\frac{1}{L} \nabla f(x^{(i)})} \end{align*} \] which is the classic Forward-Backward/Proximal Gradient algorithm.
We can perform back-tracking line search when the \(\alpha^{(i)}\) step sizes aren’t known. The line search step finds the maximum step size that satisfies the Armijo constraints, given in code by:
#'
#' Rough implementation of an Armijo line-search to be used as a step.fun
#' in prox.grad.
#'
#' @param x.last current location
#' @param x.grad gradient step
#' @param falpha starting step size
#' @param fprox.fun proximal operator
#' @param loss.fun the problem loss function
#' @param env problem environment. must contain a beta discount factor.
#'
line.search.step = function(x.last, x.grad, falpha, fprox.fun, loss.fun, env) {
alpha.run = falpha
z = NULL
repeat {
x.step = x.last - alpha.run * x.grad
z = fprox.fun(x.step, alpha.run * env$lambda, env)
if(loss.fun(z, env) <=
loss.fun(x.last, env) + crossprod(x.grad, z - x.last)
+ crossprod(z - x.last)/(2*alpha.run))
break
alpha.run = env$beta * alpha.run
}
return(list('z' = z,
'alpha.run'=alpha.run))
}
Generic constant and accelerated (Nesterov) step routines are given by default, and a generalized optimization loop takes these functions (and others) as arguments:
#'
#' Nesterov sequence function to be used as seq.update.fun
#' in prox.grad.
#'
#' @param x current location
#' @param fk iteration number
#' @param fx.last previous location
#'
nest.seq.update = function(x, k, x.last, ...) {
return(x + (k/(k+3))*(x - x.last))
}
#'
#'
#' This is a generic optimization loop that uses all the passed functions
#' during it's evaluation. It's meant to be plug-and-play for a variety
#' of Prox Gradient-like algorithms.
#' @param algo a list containing the following items:
#' xi.fun, prox.fun, step.fun, obj.fun, loss.fun, desc. Where
#' xi.fun is the gradient/descent step/direction function.
#' step.fun is a function that determines the step size before the prox, e.g. line-search.
#' seq.update.fun is a function that determines the step size after the prox, e.g. Nesterov.
#' desc is a string description of the algorithm being implemented.
#' @param env other parameters that get passed to all internal functions. this must contain the
#' following elements:
#' M number of covariates
#' N number of observations
#' A covariate matrix
#' y observations
#' lambda size for prox step
#' alpha size for xi step
#' beta line search discount factor
#' @param max.iters run up to this many iterates (strict)
#' @param all.iters run all max.iters
#' @param propagate propagate last values to max.iters
#'
prox.grad = function(algo,
x0 = matrix(0, algo$env$N, 1),
max.iters=5000,
all.iters=FALSE,
propagate=FALSE,
show.iters=FALSE) {
x.run = x0
x.last = x.run
y.last = rep(Inf, algo$env$N)
objval.last = Inf
objval.diff = Inf
iters = 0
pg.objEval = c()
pg.sse = c()
alpha.run = algo$env$alpha
if(show.iters)
cat(sprintf('%3s\t%10s\t%10s\n', 'iter', 'sse', 'objective'))
while(iters < max.iters && (all.iters || objval.diff > 1e-5)) {
k = iters + 1
y.last = algo$seq.update.fun(x.run, k, x.last, algo$env)
y.grad = algo$xi.fun(y.last, algo$env)
step.res = algo$step.fun(y.last, y.grad, alpha.run,
algo$prox.fun, algo$loss.fun, algo$env)
alpha.run = step.res$alpha.run
x.last = x.run
x.run = step.res$z
objval.now = algo$obj.fun(x.run, algo$env)
objval.diff = abs(objval.now - objval.last)
objval.last = objval.now
pg.objEval = c(pg.objEval, objval.last)
sse.last = crossprod(algo$env$x.true - x.run)
pg.sse = c(pg.sse, sse.last)
if(show.iters)
cat(sprintf('%3d\t%10.4f\t%10.4f\n', iters, sse.last, objval.last))
iters = iters+1
}
if (propagate && iters < max.iters) {
pg.objEval = c(pg.objEval, rep(tail(pg.objEval,1), max.iters-length(pg.objEval)))
pg.sse = c(pg.sse, rep(tail(pg.sse,1), max.iters-length(pg.sse)))
}
return(list(x.last = x.run,
obj.vals = pg.objEval,
sse.vals = pg.sse,
type=algo$desc,
alpha=alpha.run))
}
Otherwise, every implemented example defines a \(\xi(z)\) and \(\operatorname{prox}\) function that gets passed to the generic optimization loop. For example, the Logit with double-Pareto prox:
mm.xi = function(x, fm = m, fA = A, fy = y) {
return(t(fA) %*% (fm * plogis(fA %*% x) - fy))
}
double.pareto.prox = function(x, flambda = lambda, fb = b) {
d = pmax(fb * abs(x) - flambda, 0)
return(sign(x)/2 * (abs(x) - fb + sqrt((fb - abs(x))^2 + 4*d)))
}
mm.prox.grad.lip = substitute(prox.grad(xi.fun=mm.xi,
prox.fun=double.pareto.prox,
step.fun=const.step.update,
desc="mm.lip"))
To use these methods in an example file, simply source the bayes-map-tools.r
file.