\[ \newcommand{\argmin}{\operatorname{argmin}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\diag}[1]{\operatorname{diag}(#1)} \newcommand{\prox}[2]{\operatorname{prox}_{#1}\left(#2\right)} \]

Methods

Throughout the BayesMAP examples we often use the Proximal Gradient, or Forward-Backward, algorithm, so we briefly describe it in the following.

Proximal Gradient

We start by defining a quadratic form \(Q(x,z)\) such that \[ \begin{align*} Q(x,z) &=\frac{1}{2} x^T \Lambda(z) x - \eta(z)^T x + \mu(z) \\ f(x) &= \min_z Q(x,z), \forall x \end{align*} \] Then we restrict ourselves to considering only \(\Lambda(z) \succ 0\), \(\eta(z)^T \ne 0\), \(\mu(z)\) that allow for the following form \[ \begin{align*} Q(x,z) = \frac{1}{2} \|x - \xi(z)\|^2_{\Lambda(z)} + \psi(z) \end{align*} \] so that \(\xi(z) = \Lambda^{-1} \eta \) is the \(\argmin\) of \(Q(x,z)\) in \(x\).

Together with \(\phi_b(x)\), we have that \[ \begin{align*} F(x) &= \min_z \left\{ Q(x,z) + \lambda\phi_b(x) \right\} \\ &= \min_z \left\{\frac{1}{2} \|x - \xi(z)\|^2_{\Lambda(z)} + \lambda\phi_b(x) + \psi(z) \right\} \end{align*} \] so the \(\argmin_x\) that we want ends up being \[ \begin{gather*} x^* = \prox{\lambda \Lambda^{-1} \phi_b}{\xi(z)} \\ \text{where } \prox{A h}{z} = \argmin_u \left\{ \frac{1}{2} \|u - z\|^2_A + h(u) \right\} \end{gather*} \]

Using these properties, we can apply the following proximal algorithm: \[ \begin{align*} w^{(i)} &= \prox{\lambda \alpha^{(i)} \phi_b}{x^{(i)} - \alpha^{(i)} \xi(x^{(i)})} \\ x^{(i+1)} &= x^{(i)} + \gamma^{(i)} (w^{(i)} - x^{(i)}) \end{align*} \] where \(\alpha^{(i)}, \gamma^{(i)}\) are yet undetermined step sizes.

Convergence

For convex, lower semi-continuous functions \(f(x)\) and \(\phi(x)\), convergence is generally given in the proximal setting… When \(\phi(x)\), and/or \(f(x)\), is not convex, the standard tools in proofs for convergence (e.g. majorizing/minorizing, generalized pythagorean theorem, etc.) won’t work.

Instead, for semialgebra/subanalytic functions it may be possible to prove convergence via a Lojasiewicz property involved with a Proximal Point method. From there, an argument about composition with a non-expansive operator (possibly the gradient descent part) could provide a proof. See ².

Also, quasi-convex functions with non-empty solution sets, for at least their \(\operatorname{prox}\) operators, it’s possible to show that a Proximal Point algorithm can be Fejér monotonic with respect to the closed convex set \[ S = \left\{ x \in \operatorname{g(x)} : g(x) \le \inf_i g(x^{(i)})\right\} \] and show that the sequence converges to the critical points of \(g(x)\).
See ³ and ⁴ for more details on that approach.

Step Sizes

The following are some methods used for determining \(\alpha^{(i)}\).

Majorization/Lipschitz Continuity

In some cases, we can easily derive a viable step size \(\alpha^{(i)}\). For instance, when a function \(f(x)\) has Lipschitz continuous gradient, i.e. \[ \begin{align*} \| \nabla f(x) - \nabla f(z) \| &\le L_f \|x-z\| \end{align*} \] we can simply use a second order Taylor expansion as follows \[ \begin{align*} Q(x,z) &= f(z) + \nabla f(z)^T (x-z) + \frac{L}{2} \|x-z\|^2 \end{align*} \] for \(L \ge L_f\). Furthermore, this form of \(Q(x,z)\) has the following useful properties \[ \begin{align*} f(x) &\le Q(x,z), \forall z \\ f(x) &= Q(x,x) \end{align*} \] which yields a majorization algorithm. Now, we can set \(\alpha^{(i)} = 1/L\). The same can work for other majorizers with a quadratic \(x\) term, as well.

The proximal algorithm mentioned above simplifies (with \(\gamma^{(i)} = 1\)) to \[ \begin{align*} x^{(i+1)} &= \prox{\frac{\lambda}{L} \phi_b}{x^{(i)} -\frac{1}{L} \nabla f(x^{(i)})} \end{align*} \] which is the classic Forward-Backward/Proximal Gradient algorithm.

Line Search

We can perform back-tracking line search when the \(\alpha^{(i)}\) step sizes aren’t known. The line search step finds the maximum step size that satisfies the Armijo constraints, given in code by:

#'
#' Rough implementation of an Armijo line-search to be used as a step.fun
#' in prox.grad.
#'
#' @param x.last current location
#' @param x.grad gradient step 
#' @param falpha starting step size
#' @param fprox.fun proximal operator
#' @param loss.fun the problem loss function
#' @param env problem environment.  must contain a beta discount factor.
#'
line.search.step = function(x.last, x.grad, falpha, fprox.fun, loss.fun, env) {
  alpha.run = falpha
  z = NULL
  repeat {
    x.step = x.last - alpha.run * x.grad
    z = fprox.fun(x.step, alpha.run * env$lambda, env) 

    if(loss.fun(z, env) <= 
         loss.fun(x.last, env) + crossprod(x.grad, z - x.last) 
           + crossprod(z - x.last)/(2*alpha.run))
      break
    alpha.run = env$beta * alpha.run
  } 
  return(list('z' = z, 
              'alpha.run'=alpha.run))
}

Implementation

Generic constant and accelerated (Nesterov) step routines are given by default, and a generalized optimization loop takes these functions (and others) as arguments:

#'
#' Nesterov sequence function to be used as seq.update.fun
#' in prox.grad.
#'
#' @param x current location
#' @param fk iteration number
#' @param fx.last previous location
#'
nest.seq.update = function(x, k, x.last, ...) {
  return(x + (k/(k+3))*(x - x.last))
}
#'
#'
#' This is a generic optimization loop that uses all the passed functions
#' during it's evaluation.  It's meant to be plug-and-play for a variety
#' of Prox Gradient-like algorithms.
#' @param algo a list containing the following items: 
#'  xi.fun, prox.fun, step.fun, obj.fun, loss.fun, desc.  Where
#'  xi.fun is the gradient/descent step/direction function.
#'  step.fun is a function that determines the step size before the prox, e.g. line-search.
#'  seq.update.fun is a function that determines the step size after the prox, e.g. Nesterov.
#'  desc is a string description of the algorithm being implemented.
#' @param env other parameters that get passed to all internal functions.  this must contain the
#'  following elements:
#'  M number of covariates 
#'  N number of observations 
#'  A covariate matrix 
#'  y observations 
#'  lambda size for prox step
#'  alpha size for xi step
#'  beta line search discount factor 
#' @param max.iters run up to this many iterates (strict)
#' @param all.iters run all max.iters
#' @param propagate propagate last values to max.iters
#'
prox.grad = function(algo,
                     x0 = matrix(0, algo$env$N, 1),
                     max.iters=5000, 
                     all.iters=FALSE,
                     propagate=FALSE,
                     show.iters=FALSE) {
  x.run = x0
  x.last = x.run 
  y.last = rep(Inf, algo$env$N) 
  objval.last = Inf
  objval.diff = Inf
  iters = 0
  pg.objEval = c()
  pg.sse = c()
  alpha.run = algo$env$alpha
  if(show.iters)
    cat(sprintf('%3s\t%10s\t%10s\n', 'iter', 'sse', 'objective'))
  while(iters < max.iters && (all.iters || objval.diff > 1e-5)) {
    k = iters + 1
    y.last = algo$seq.update.fun(x.run, k, x.last, algo$env) 
    y.grad = algo$xi.fun(y.last, algo$env)

    step.res = algo$step.fun(y.last, y.grad, alpha.run, 
                             algo$prox.fun, algo$loss.fun, algo$env)

    alpha.run = step.res$alpha.run 
    x.last = x.run
    x.run = step.res$z

    objval.now = algo$obj.fun(x.run, algo$env)
    objval.diff = abs(objval.now - objval.last)
    objval.last = objval.now 

    pg.objEval = c(pg.objEval, objval.last) 
    sse.last = crossprod(algo$env$x.true - x.run)
    pg.sse = c(pg.sse, sse.last) 
    if(show.iters)
      cat(sprintf('%3d\t%10.4f\t%10.4f\n', iters, sse.last, objval.last))
    iters = iters+1
  }
  if (propagate && iters < max.iters) {
    pg.objEval = c(pg.objEval, rep(tail(pg.objEval,1), max.iters-length(pg.objEval)))
    pg.sse = c(pg.sse, rep(tail(pg.sse,1), max.iters-length(pg.sse)))

  }
  return(list(x.last = x.run,
              obj.vals = pg.objEval, 
              sse.vals = pg.sse, 
              type=algo$desc, 
              alpha=alpha.run))
}

Otherwise, every implemented example defines a \(\xi(z)\) and \(\operatorname{prox}\) function that gets passed to the generic optimization loop. For example, the Logit with double-Pareto prox:

mm.xi = function(x, fm = m, fA = A, fy = y) {
  return(t(fA) %*% (fm * plogis(fA %*% x) - fy))
}

double.pareto.prox = function(x, flambda = lambda, fb = b) {
  d = pmax(fb * abs(x) - flambda, 0)
  return(sign(x)/2 * (abs(x) - fb + sqrt((fb - abs(x))^2 + 4*d)))
}

mm.prox.grad.lip = substitute(prox.grad(xi.fun=mm.xi, 
                                        prox.fun=double.pareto.prox,
                                        step.fun=const.step.update, 
                                        desc="mm.lip"))

“BayesMAP” Walter Dempsey, Nicholas Polson, Brandon Willard↩
“On the convergence of the proximal algorithm for nonsmooth functions involving analytic features” Attouch and Bolte link ↩
“Fejér Monotonicity in Convex Optimization” Combettes link ↩
“Convergence of the Proximal Point Method for Quasiconvex Minimization” E.A Papa Quiroz, P. Roberto Oliveira↩

BayesMAP Examples: Proximal Algorithms

Brandon Willard

Introduction