\[ \newcommand{\argmin}{\operatorname{argmin}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\diag}[1]{\operatorname{diag}(#1)} \newcommand{\prox}[2]{\operatorname{prox}_{#1}\left(#2\right)} \]

## Loading required package: knitr

Introduction

In this document, we outline the generic Proximal approaches and algorithms used in the BayesMAP 1 examples.

The basic problem is given by \[ \begin{gather*} \argmin_x \left\{ F(x) = f(x) + \lambda \sum_{j=0}^M \phi(x_j) \right\} \end{gather*} \] where \(F(x)\) is the objective function, \(f(x)\) the loss and \(\phi(x_j)\) the component-wise penalty.

In the code examples, these global functions are

loss.fun = NULL
obj.fun = NULL
obj.fun.grad = NULL

where obj.fun.grad is used for testing and corresponds to the gradient of the objective function.

Methods

Throughout the BayesMAP examples we often use the Proximal Gradient, or Forward-Backward, algorithm, so we briefly describe it in the following.

Proximal Gradient

We start by defining a quadratic form \(Q(x,z)\) such that \[ \begin{align*} Q(x,z) &=\frac{1}{2} x^T \Lambda(z) x - \eta(z)^T x + \mu(z) \\ f(x) &= \min_z Q(x,z), \forall x \end{align*} \] Then we restrict ourselves to considering only \(\Lambda(z) \succ 0\), \(\eta(z)^T \ne 0\), \(\mu(z)\) that allow for the following form \[ \begin{align*} Q(x,z) = \frac{1}{2} \|x - \xi(z)\|^2_{\Lambda(z)} + \psi(z) \end{align*} \] so that \(\xi(z) = \Lambda^{-1} \eta \) is the \(\argmin\) of \(Q(x,z)\) in \(x\).

Together with \(\phi_b(x)\), we have that \[ \begin{align*} F(x) &= \min_z \left\{ Q(x,z) + \lambda\phi_b(x) \right\} \\ &= \min_z \left\{\frac{1}{2} \|x - \xi(z)\|^2_{\Lambda(z)} + \lambda\phi_b(x) + \psi(z) \right\} \end{align*} \] so the \(\argmin_x\) that we want ends up being \[ \begin{gather*} x^* = \prox{\lambda \Lambda^{-1} \phi_b}{\xi(z)} \\ \text{where } \prox{A h}{z} = \argmin_u \left\{ \frac{1}{2} \|u - z\|^2_A + h(u) \right\} \end{gather*} \]

Using these properties, we can apply the following proximal algorithm: \[ \begin{align*} w^{(i)} &= \prox{\lambda \alpha^{(i)} \phi_b}{x^{(i)} - \alpha^{(i)} \xi(x^{(i)})} \\ x^{(i+1)} &= x^{(i)} + \gamma^{(i)} (w^{(i)} - x^{(i)}) \end{align*} \] where \(\alpha^{(i)}, \gamma^{(i)}\) are yet undetermined step sizes.

Convergence

For convex, lower semi-continuous functions \(f(x)\) and \(\phi(x)\), convergence is generally given in the proximal setting… When \(\phi(x)\), and/or \(f(x)\), is not convex, the standard tools in proofs for convergence (e.g. majorizing/minorizing, generalized pythagorean theorem, etc.) won’t work.

Instead, for semialgebra/subanalytic functions it may be possible to prove convergence via a Lojasiewicz property involved with a Proximal Point method. From there, an argument about composition with a non-expansive operator (possibly the gradient descent part) could provide a proof. See 2.

Also, quasi-convex functions with non-empty solution sets, for at least their \(\operatorname{prox}\) operators, it’s possible to show that a Proximal Point algorithm can be Fejér monotonic with respect to the closed convex set \[ S = \left\{ x \in \operatorname{g(x)} : g(x) \le \inf_i g(x^{(i)})\right\} \] and show that the sequence converges to the critical points of \(g(x)\).
See 3 and 4 for more details on that approach.

Step Sizes

The following are some methods used for determining \(\alpha^{(i)}\).

Majorization/Lipschitz Continuity

In some cases, we can easily derive a viable step size \(\alpha^{(i)}\). For instance, when a function \(f(x)\) has Lipschitz continuous gradient, i.e. \[ \begin{align*} \| \nabla f(x) - \nabla f(z) \| &\le L_f \|x-z\| \end{align*} \] we can simply use a second order Taylor expansion as follows \[ \begin{align*} Q(x,z) &= f(z) + \nabla f(z)^T (x-z) + \frac{L}{2} \|x-z\|^2 \end{align*} \] for \(L \ge L_f\). Furthermore, this form of \(Q(x,z)\) has the following useful properties \[ \begin{align*} f(x) &\le Q(x,z), \forall z \\ f(x) &= Q(x,x) \end{align*} \] which yields a majorization algorithm. Now, we can set \(\alpha^{(i)} = 1/L\). The same can work for other majorizers with a quadratic \(x\) term, as well.

The proximal algorithm mentioned above simplifies (with \(\gamma^{(i)} = 1\)) to \[ \begin{align*} x^{(i+1)} &= \prox{\frac{\lambda}{L} \phi_b}{x^{(i)} -\frac{1}{L} \nabla f(x^{(i)})} \end{align*} \] which is the classic Forward-Backward/Proximal Gradient algorithm.

Implementation

Generic constant and accelerated (Nesterov) step routines are given by default, and a generalized optimization loop takes these functions (and others) as arguments:

#'
#' Nesterov sequence function to be used as seq.update.fun
#' in prox.grad.
#'
#' @param x current location
#' @param fk iteration number
#' @param fx.last previous location
#'
nest.seq.update = function(x, k, x.last, ...) {
  return(x + (k/(k+3))*(x - x.last))
}
#'
#'
#' This is a generic optimization loop that uses all the passed functions
#' during it's evaluation.  It's meant to be plug-and-play for a variety
#' of Prox Gradient-like algorithms.
#' @param algo a list containing the following items: 
#'  xi.fun, prox.fun, step.fun, obj.fun, loss.fun, desc.  Where
#'  xi.fun is the gradient/descent step/direction function.
#'  step.fun is a function that determines the step size before the prox, e.g. line-search.
#'  seq.update.fun is a function that determines the step size after the prox, e.g. Nesterov.
#'  desc is a string description of the algorithm being implemented.
#' @param env other parameters that get passed to all internal functions.  this must contain the
#'  following elements:
#'  M number of covariates 
#'  N number of observations 
#'  A covariate matrix 
#'  y observations 
#'  lambda size for prox step
#'  alpha size for xi step
#'  beta line search discount factor 
#' @param max.iters run up to this many iterates (strict)
#' @param all.iters run all max.iters
#' @param propagate propagate last values to max.iters
#'
prox.grad = function(algo,
                     x0 = matrix(0, algo$env$N, 1),
                     max.iters=5000, 
                     all.iters=FALSE,
                     propagate=FALSE,
                     show.iters=FALSE) {
  x.run = x0
  x.last = x.run 
  y.last = rep(Inf, algo$env$N) 
  objval.last = Inf
  objval.diff = Inf
  iters = 0
  pg.objEval = c()
  pg.sse = c()
  alpha.run = algo$env$alpha
  if(show.iters)
    cat(sprintf('%3s\t%10s\t%10s\n', 'iter', 'sse', 'objective'))
  while(iters < max.iters && (all.iters || objval.diff > 1e-5)) {
    k = iters + 1
    y.last = algo$seq.update.fun(x.run, k, x.last, algo$env) 
    y.grad = algo$xi.fun(y.last, algo$env)

    step.res = algo$step.fun(y.last, y.grad, alpha.run, 
                             algo$prox.fun, algo$loss.fun, algo$env)

    alpha.run = step.res$alpha.run 
    x.last = x.run
    x.run = step.res$z

    objval.now = algo$obj.fun(x.run, algo$env)
    objval.diff = abs(objval.now - objval.last)
    objval.last = objval.now 

    pg.objEval = c(pg.objEval, objval.last) 
    sse.last = crossprod(algo$env$x.true - x.run)
    pg.sse = c(pg.sse, sse.last) 
    if(show.iters)
      cat(sprintf('%3d\t%10.4f\t%10.4f\n', iters, sse.last, objval.last))
    iters = iters+1
  }
  if (propagate && iters < max.iters) {
    pg.objEval = c(pg.objEval, rep(tail(pg.objEval,1), max.iters-length(pg.objEval)))
    pg.sse = c(pg.sse, rep(tail(pg.sse,1), max.iters-length(pg.sse)))

  }
  return(list(x.last = x.run,
              obj.vals = pg.objEval, 
              sse.vals = pg.sse, 
              type=algo$desc, 
              alpha=alpha.run))
}

Otherwise, every implemented example defines a \(\xi(z)\) and \(\operatorname{prox}\) function that gets passed to the generic optimization loop. For example, the Logit with double-Pareto prox:

mm.xi = function(x, fm = m, fA = A, fy = y) {
  return(t(fA) %*% (fm * plogis(fA %*% x) - fy))
}

double.pareto.prox = function(x, flambda = lambda, fb = b) {
  d = pmax(fb * abs(x) - flambda, 0)
  return(sign(x)/2 * (abs(x) - fb + sqrt((fb - abs(x))^2 + 4*d)))
}

mm.prox.grad.lip = substitute(prox.grad(xi.fun=mm.xi, 
                                        prox.fun=double.pareto.prox,
                                        step.fun=const.step.update, 
                                        desc="mm.lip"))

Usage

To use these methods in an example file, simply source the bayes-map-tools.r file.


  1. “BayesMAP” Walter Dempsey, Nicholas Polson, Brandon Willard

  2. “On the convergence of the proximal algorithm for nonsmooth functions involving analytic features” Attouch and Bolte link

  3. “Fejér Monotonicity in Convex Optimization” Combettes link

  4. “Convergence of the Proximal Point Method for Quasiconvex Minimization” E.A Papa Quiroz, P. Roberto Oliveira