Javascript required
Skip to content Skip to sidebar Skip to footer

Find the Solution U R Theta of the Problem

Numerical Optimization

IN A Few Useful Things to Know about Machine Learning, Pedro Domingos put up a relation: $\color{aqua}{LEARNING}$ = $\color{green}{REPRESENTATION}$ + $\color{blue}{EVALUATION}$ + $\color{red}{OPTIMIZATION}$.

  • Representation as the core of the note is the general (mathematical) model that computer can handle.
  • Evaluation is criteria. An evaluation function (also called objective function, cost function or scoring function) is needed to distinguish good classifiers from bad ones.
  • Optimization is aimed to find the parameters that optimizes the evaluation function, i.e. $$ \arg\min_{\theta\in \Theta} f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\min f(\theta)},\text{or} \ \quad\arg\max_{\theta\in \Theta}f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\max f(\theta)}. $$

The objective function to be minimized is also called cost function.

Evaluation is always attached with optimization; the evaluation which cannot be optimized is not a good evaluation in machine learning.

  • 15 095 - Machine Learning Under a Modern Optimization Lens
  • https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
  • http://www.cs.cmu.edu/~pradeepr/convexopt/
  • An interactive tutorial to numerical optimization
  • Patrick Louis' RECENT CONFERENCE TALKS on optimization
  • Northwestern University Open Text Book on Process Optimization
  • Introductory course on non-smooth optimisation
  • CS4540: Simple Algorithms
  • CS 798: Convexity and Optimization
  • Survival Guide for Students of Optimization, Dianne P. O'Leary, September 2017
  • https://nlopt.readthedocs.io/en/latest/
  • https://cp2019.a4cp.org/
  • https://minoa-itn.fau.de/
  • https://gekko.readthedocs.io/en/latest/
  • http://opt-ml.org/schedule.html
  • https://statisticaloddsandends.wordpress.com/
  • https://tydlin.github.io/
  • https://faculty.ucmerced.edu/mcarreira-perpinan/papers.html
  • https://www.nsf.gov/awardsearch/showAward?AWD_ID=1423515

The proof of convergence or complexity is often based on the convex cases where the objective function as well as the constrained set is convex, i.e., $$t x+(1-t)y\in\Theta,\ f(t x+(1-t)y)\leq t f(x)+(1-t)f(y),\ \quad t\in [0,1], \quad \forall x, y\in\Theta.$$

And this optimization is called convex optimization. By the way, the name numerical optimization means the variables or parameters to estimate are in numeric format, which is far from performance optimization in concept.

First optimal condition is a necessary condition for the unconstrainted optimziation problems if the objective function is differentiable: if $\nabla f$ exists, and $x^{\ast}\in\arg\min_{x} f(x)$, we have $\nabla f(x^{\ast})=0$.


Wotao Yin wrote a summary on First-order methods and operator splitting for optimization:

First-order methods are described and analyzed with gradients or subgradients, while second-order methods use second-order derivatives or their approximations.

During the 70s–90s the majority of the optimization community focused on second-order methods since they are more efficient for those problems that have the sizes of interest at that time. Beginning around fifteen years ago, however, the demand to solve ever larger problems started growing very quickly. Many large problems are further complicated by non-differentiable functions and constraints. Because simple first-order and classic second-order methods are ineffective or infeasible for these problems, operator splitting methods regained their popularity.

Operators are used to develop algorithms and analyze them for a wide spectrum of problems including optimization problems, variational inequalities, and differential equations. Operator splitting is a set of ideas that generate algorithms through decomposing a problem that is too difficult as a whole into two or more smaller and simpler subproblems. During the decomposition, complicated structures like non-differentiable functions, constraint sets, and distributed problem structures end up in different subproblems and thus can be handled elegantly. We believe ideas from operator splitting provide the most effective means to handle such complicated structures for computational problem sizes of modern interest.

  • The world of optimization
  • Gradient flow and gradient descent
  • Accelerated gradient descent
  • Accelerated gradient flow
  • Stochastic gradient flow
  • Aristotle vs. Newton
  • Advanced non-smooth optimization
  • The 2014 International Workshop on Signal Processing, Optimization, and Control (SPOC 2014)

Gradient Descent and More

Each iteration of a line search method computes a search direction $p^{k}$ and then decides how far to move along that direction. The iteration is given by

$$ x^{k+1}=x^{k}+\alpha_{k}p^{k}\tag{Line search} $$

where the positive scalar $\alpha_{k}$ is called the step length. The success of a line search method depends on effective choices of both the direction $p^{k}$ and the step length $\alpha_{k}$.

$\color{lime}{Note}$: we use the notation $x^{k}$ and $\alpha_k$ to represent the $k$th iteration of the vector variables $x$ and $k$ th step length, respectively. Most line search algorithms require $p^k$ to be a descent direction — one for which $\left< {p^k},\nabla f_k \right> < 0$ — because this property guarantees that the function $f$ can be reduced along this direction, where $\nabla f_k$ is the gradient of objective function $f$ at the $k$th iteration point $x_k$ i.e. $\nabla f_k=\nabla f(x^{k})$.


Gradient descent and its variants are to find the local solution of the unconstrained optimization problem:

$$ \min_{x} f(x) $$

where $x\in \mathbb{R}^{n}$.

It is not difficult to observe that $$f(x) \approx f(x^k)+(x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}_2^2$$ by Taylor expansion of $f$ near the point $x^{k}$ for some constant $s_k$.

Let $x^{k+1}=\arg\min_{x} f(x^k) + (x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}2^2$, we will obtain $x^{k+1} = x^{k}-\frac{1}{s_k}{\nabla}{x} f(x^k)$. However, the constant $s_k$ is difficult to estimate.

And the general gradient descent methods are given by

$$ x^{k+1}=x^{k}-\alpha_{k}{\nabla}_{x} f(x^k) $$

where $x^{k}$ is the $k$th iterative result, $\alpha_{k}\in{\alpha|f(x^{k+1})< f(x^{k})}$ and particularly $\alpha_{k}=\arg\min_{\alpha}f(x^{k}-\alpha\nabla_{x}f(x^{k}))$ so that $f(x^{k+1})=\min_{\alpha} f(x^k - \alpha\nabla_x f(x^k))$.

$$ x^{k+1}=\fbox{$x^{k}$}-\alpha_{k}{\nabla}{x} f(x^k) = \fbox{$x^{k-1}-\alpha{k-1}\nabla f(x_{k-1})$}-\alpha_{k}{\nabla}{x} f(x^k)\ = x^1-\sum{n=0}^{k}\alpha_n{\nabla}_{x} f(x^n) $$

  • http://59.80.44.100/www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf
  • http://wiki.fast.ai/index.php/Gradient_Descent

There are many ways to choose some proper step or learning rate sequence ${\alpha_k}$.


The first-order Taylor approximation of $f(x + v)$ around ${x}$ is $$f(x+v)\approx \hat{f}(x+v)=f(x)+\nabla_x f(x)^T v.$$

The second term on the righthand side, $\nabla f(x)^T v$, is the directional derivative of ${f}$ at ${x}$ in the direction ${v}$. It gives the approximate change in ${f}$ for a small step ${v}$. The step ${v}$ is a descent direction if the directional derivative is negative. Since the directional derivative $\nabla_x f(x)^T v$ is linear in ${v}$, it can be made as negative as we like by taking ${v}$ large (provided ${v}$ is a descent direction, i.e., $\nabla_x f(x)^T v< 0$). To make the question sensible we have to limit the size of ${v}$, or normalize by the length of ${v}$.

We define a normalized steepest descent direction (with respect to the norm $|\cdot |$ in $\mathbb{R}^n$) as

$$\Delta x_{nsd}=\arg\min_{v}{f(x)^T v\mid |v|=1}.$$

It is also convenient to consider a steepest descent step $\Delta x_{sd}$ that is unnormalized, by scaling the normalized steepest descent direction in a particular way:

$$\Delta x_{sd}={|\nabla f(x)|}{\ast}\Delta x{nsd}$$

where ${| \cdot |}_{\ast}$ denotes the dual norm.

(We say 'a' steepest descent direction because there can be multiple minimizers.)


Algorithm Steepest descent method

  • given a starting point $x \in domf$.
  • repeat
    1. Compute steepest descent direction $\Delta x_{sd}$.
    2. Line search. Choose ${t}$ via backtracking or exact line search. 3. Update. $x := x + t \Delta x_{sd}$.
  • until stopping criterion is satisfied.

If the variable ${x}$ is restricted in some bounded domain, i.e., $x\in D$, the steepest gradient descemt methods can be modified to conditional gradient descent method or Frank-Wolfe algorithm.


Algorithm Frank–Wolfe algorithm

  • given a starting point $x \in domf$.
  • repeat
    1. Find $s^k$: $s^k =\arg\min_{v}{f(x^{k})^T v\mid v\in D }$.
    2. Set $t=\frac{2}{k+2}$ or $t=\arg\min{f(x^k+t(s^k - x^k))\mid t\in [0, 1]}$
    3. Update. $x^{k+1}= x^k + t(s^k-x^k), k\leftarrow k+1$.
  • until stopping criterion is satisfied.

  • 梯度下降法和最速下降法的细微差别
  • An Introduction to Conditional Gradient
  • Frank–Wolfe algorithm
  • Greedy Algorithms, Frank-Wolfe and Friends - A modern perspective
  • https://sites.google.com/site/nips13greedyfrankwolfe/
  • Revisiting Frank-Wolfe
  • Nonsmooth Frank-Wolfe algorithm and Coresets

Some variants of gradient descent methods are not line search method. For example, the heavy ball methods or momentum methods:

$$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) $$

where the momentum coefficient $\rho_k\in[0,1]$ generally and the step length $\alpha_k$ cannot be determined by line search. We can unfold the recursion defining momentum gradient descent and write: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}\underbrace{(x^k-x^{k-1})}{\text{denoted as $\Delta{k}$ }}\ \Delta_{k+1}= -\alpha_k\nabla_{x}f(x^k)+ \rho_{k}\Delta_{k}\ \ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\sum_{i=1}^{k-1}{ - \prod_{j=0}^{i}\rho_{k-j} }\alpha_{k-j-1}\nabla f(\underbrace{\color{red}{x}^{k-i}}_{\triangle})+\rho_1\Delta_1. $$

Nesterov accelerated gradient method at the $k$th step is given by:

$$ x^{k}=y^{k}-\alpha_{k+1}\nabla_{x}f(y^k) \qquad \text{Descent} \ y^{k+1}=x^{k}+\rho_{k}(x^{k}-x^{k-1}) \qquad \text{Momentum} $$

where the momentum coefficient $\rho_k\in[0,1]$ generally.

$$ x^{k}=y^{k}-\alpha_{k+1}\nabla_{x}f(y^k) \ y^{k}=x^{k-1}+\rho_{k-1}(x^{k-1}-x^{k-2})\ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(y^k)+\sum_{i=1}^{k-1}{ -\prod_{j=0}^{i}\rho_{k-j} }\alpha_{k-j-1}\nabla f(\underbrace{\color{green}{y}^{k-i}}_{\triangle})+\rho_1\Delta_1. $$

They are called as inertial gradient methods or accelerated gradient methods. And there are some different forms.

Inventor of Nesterov accelerated Gradient
  • ORF523: Nesterov's Accelerated Gradient Descent
  • Nesterov's Accelerated Gradient Descent for Smooth and Strongly Convex Optimization
  • Revisiting Nesterov's Acceleration
  • A short proof for Nesterov's momentum
  • Nemirovski's acceleration

  • https://zhuanlan.zhihu.com/p/41263068
  • https://zhuanlan.zhihu.com/p/35692553
  • https://zhuanlan.zhihu.com/p/35323828
  • On Gradient-Based Optimization: Accelerated, Asynchronous, Distributed and Stochastic
  • Nesterov Accelerated Gradient and Momentum
  • Why Momentum Really Works
  • http://www.optimization-online.org/DB_FILE/2018/11/6938.pdf
  • https://www.mat.univie.ac.at/~neum/glopt/mss/MasAi02.pdf
  • https://www.cs.cmu.edu/~ggordon/10725-F12/slides/09-acceleration.pdf
  • WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
  • https://www.fromthegenesis.com/gradient-descent-part1/
  • https://www.fromthegenesis.com/gradient-descent-part-2/
  • Deep Learning From Scratch IV: Gradient Descent and Backpropagation
  • https://ee227c.github.io/notes/ee227c-lecture08.pdf
  • Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm

Conjugate Gradient Methods

It is from the methods to solve the Symmetric positive definite linear systems $Ax=b, A=A^T\succ 0 \in\mathbb{R}^{n\times n}$. $x^{\ast}=A^{-1}b$ is the solution and $x^{\ast}$ minimizes (convex function) $f(x)=\frac{1}{2}x^T Ax - b^T x$ thus $\nabla f(x) = Ax − b$ is gradient of $f$. $r = b − Ax=-\nabla f(x) = -A(x-x^{\ast})$ is called the residual at $x$.

The solution is to make the search directions-orthogonal instead of orthogonal. Two vectors $u$ and $v$ are $A$ -orthogonal, or conjugate, i.e., $$u^TAv=0.$$

Krylov subspace $\cal K_k=\operatorname{span}{b, Ab, \cdots, A^{k-1}b}$ and we define the Krylov sequence $$ x^{k} =\arg\min_{x\in\cal K_k} f(x)=\arg\min_{x\in\cal K}{|x-x^{\ast}|}_A^2$$

the CG algorithm (among others) generates the Krylov sequence.

  • $f\left(x^{(k+1)}\right) \leq f\left(x^{(k)}\right)$ but $|r|$ can increase.
  • $x^{(n)}=x^{\star}\left(i.e., x^{\star} \in \mathcal{K}{n}\right)$ even when $\mathcal{K}{n} \neq \mathbf{R}^{n}$
  • $x^{(k)}=p_{k}(A) b,$ where $p_{k}$ is a polynomial with $\operatorname{deg} p_{k}<k$.

There is a two-term recurrence $$\fbox{$x^{k+1} = x^k + \alpha_k (x^k-x^{\ast}) + \beta_k (x^{k}-x^{k-1})$}$$

$$ {x :=0, r :=b, \rho_{0} :={|r|}^2 } \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w := Ap} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x := r+\alpha p} \ {r := r-\alpha w} \ {\rho_k := {|r|}^2} $$

Preconditioned conjugate gradient

with preconditioner $M \approx A^{-1}$ (hopefully) $$ \begin{array}{l} {x :=0, r :=b-A x_{0}, p :=r z :=M r, \rho_{1} :=r^{T} z} \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\begin{array}{l}{\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w :=\frac{\rho_{k}}{w^{T} p}} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x :=r+\alpha p} \ {r :=r-\alpha w} \ {z :=M r} \ \rho_{k+1} :=z^{T} r\ {p := z+\frac{\rho_{k+1}}{\rho_{k}} p}\end{array}} \end{array} $$

The gradient descent methods transforms the multiply optimization to univariate optimization.

Here is an outline of the nonlinear CG method:

  • $d^{0}=r^{0}=-f^{\prime}(x^0)$
  • Find $\alpha^{i}$ that minimizes $f(x^{i}+\alpha^{i} d^{i})$,
  • $x^{i+1}=x^{i} + \alpha_{i} d^{i}$
  • $r^{i+1}=-f^{\prime}(x^{i+1})$
  • $\beta_{i+1}=\frac{\left<r^{i+1}, r^{i+1}\right>}{\left<r^{i}, r^{i}\right>}$ or $\beta_{i+1}=\max{\frac{\left<r^{i+1}, r^{i+1}-r^{i}\right>}{\left<r^{i}, r^{i}\right>}, 0}$
  • $d^{i+1}=r^{i+1}+\beta_{i+1}d^{i}$
  • https://wiki.seg.org/wiki/The_conjugate_gradient_method
  • https://stanford.edu/class/ee364b/lectures/conj_grad_slides.pdf
  • An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
  • http://www.cs.umd.edu/users/oleary/cggg/historyindex.html
  • A BRIEF INTRODUCTION TO THE CONJUGATE GRADIENT METHOD

Variable Metric Methods

Newton's Methods

NEWTON'S METHOD and QUASI-NEWTON METHODS are classified to variable metric methods.

It is also to find the solution of unconstrained optimization problems, i.e. $$\min f(x)$$ where $x\in \mathbb{R}^{n}$. Specially, the objective function $f(x)$ is convex so that the local minima is global minima.


If ${x^{\star}}$ is the extrema of the cost function $f(x)$, it is necessary that $\nabla f(x^{\star}) = 0$. So if we can find all the solution of the equation system $\nabla f(x) = 0$, it helps us to find the solution to the optimization problem $\arg\min_{x\in\mathbb{R}^n} f(x)$.

Newton's method is one of the fixed-point methods to solve nonlinear equation system.

It is given by $$ x^{k+1}=\arg\min_{x}f(x^k)+(x-x^k)^T \nabla_x f(x^k)+\frac{1}{2\alpha_{k+1}}(x-x^k)^T H(x^k)(x-x^k) \=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})} $$ where $H^{-1}(x^{k})$ is inverse of the Hessian matrix of the function $f(x)$ at the point $x^{k}$. It is called Newton–Raphson algorithm in statistics. Especially when the log-likelihood function $\ell(\theta)$ is well-behaved, a natural candidate for finding the MLE is the Newton–Raphson algorithm with quadratic convergence rate.

$$ x^{k+1}=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})}\ = x^{k-1}-\alpha_{k}H^{-1}(x^{k-1})\nabla_{x},{f(x^{k-1})}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})} \ =x^{1}-\sum_{n=1}^{k} \alpha_{n+1}H^{-1}(x^{n})\nabla_{x},{f(x^{n})} $$

  • https://brilliant.org/wiki/newton-raphson-method/
  • https://www.shodor.org/unchem/math/newton/
  • http://fourier.eng.hmc.edu/e176/lectures/NM/node21.html
  • https://www.cup.uni-muenchen.de/ch/compchem/geom/nr.html
  • http://web.stanford.edu/class/cme304/docs/newton-type-methods.pdfs

The Fisher Scoring Algorithm

In maximum likelihood estimation, the objective function is the log-likelihood function, i.e. $$ \ell(\theta)=\sum_{i=1}^{n}\log{P(x_i|\theta)} $$ where $P(x_i|\theta)$ is the probability of realization $X_i=x_i$ with the unknown parameter $\theta$. However, when the sample random variable ${X_i}_{i=1}^{n}$ are not observed or realized, it is best to replace negative Hessian matrix (i.e. -$\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}$) of the likelihood function with the observed information matrix: $$ J(\theta)=\mathbb{E}({\color{red}{-1}}\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}) =\text{$\color{red}{-}$}\int\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}f(x_1, \cdots, x_n|\theta)\mathrm{d}x_1\cdots\mathrm{d}x_n $$ where $f(x_1, \cdots, x_n|\theta)$ is the joint probability density function of $X_1, \cdots, X_n$ with unknown parameter $\theta$.

And the Fisher scoring algorithm is given by $$ \theta^{k+1}=\theta^{k}+\alpha_{k}J^{-1}(\theta^{k})\nabla_{\theta} \ell(\theta^{k}) $$ where $J^{-1}(\theta^{k})$ is the inverse of observed information matrix at the point $\theta^{k}$.

See http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/scoring.pdf or https://wiseodd.github.io/techblog/2018/03/11/fisher-information/.

Fisher scoring algorithm is regarded as an example of Natural Gradient Descent in information geometry as shown in https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/ and https://www.zhihu.com/question/266846405.

Quasi-Newton Methods

Quasi-Newton methods, like steepest descent, require only the gradient of the objective function to be supplied at each iterate. By measuring the changes in gradients, they construct a model of the objective function that is good enough to produce super-linear convergence. The improvement over steepest descent is dramatic, especially on difficult problems. Moreover, since second derivatives are not required, quasi-Newton methods are sometimes more efficient than Newton's method.[^11]

In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0. In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian. One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation) $\mathbf{B}$ does not need to be inverted. The Hessian approximation $\mathbf{B}$ is chosen to satisfy $$ \nabla f(x^{k+1})=\nabla f(x^{k})+B(x^{k+1}-x^{k}), $$ which is called the $\fbox{secant equation}$ (the Taylor series of the gradient itself). In more than one dimension B is underdetermined. In one dimension, solving for B and applying the Newton's step with the updated value is equivalent to the secant method. The various quasi-Newton methods differ in their choice of the solution to the secant equation (in one dimension, all the variants are equivalent).

The unknown $x_{k}$ is updated applying the Newton's step calculated using the current approximate Hessian matrix $B_{k}$:

  1. ${\displaystyle \Delta x_{k}=-\alpha_{k} B_{k}^{-1}\nabla f(x_{k})}$, with $\alpha$ chosen to satisfy the Wolfe conditions;
  2. $x_{k+1}=x_{k}+\Delta x_{k}$;
  3. The gradient computed at the new point $\nabla f(x_{k+1})$, and $y_{k}=\nabla f(x_{k+1})-\nabla f(x_{k})$ is used to update the approximate Hessian $B_{k+1}$, or directly its inverse $H_{k+1} = B_{k+1}^{-1}$ using the Sherman–Morrison formula.

A key property of the BFGS and DFP updates is that if $B_{k}$ is positive-definite, and ${\alpha}{k}$ is chosen to satisfy the Wolfe conditions, then $B{k+1}$ is also positive-definite.

For example,

Method $\displaystyle B_{k+1}=$ $H_{k+1}=B_{k+1}^{-1}=$
DFP $(I-\frac{y_k \Delta x_k^{\mathrm{T}}}{y_k^{\mathrm{T}} \Delta x_k}) B_k (I-\frac{ \Delta x_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}} \Delta x_k}) + \frac{y_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}}} \Delta x_k$ $H_k +\frac{\Delta x_k \Delta x_k^T}{\Delta x_k^T y_k} -\frac{H_k y_ky_k^T H_k}{y_K^T H_K y_k}$
BFGS $B_k + \frac{y_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}}\Delta x_k} - \frac{B_k\Delta x_k(B_k\Delta x_k)^T}{\Delta x_k B_k \Delta x_k}$ $(I-\frac{ \Delta x_k^{\mathrm{T}} y_k}{ y_k^{\mathrm{T}} \Delta x_k}) H_k (I-\frac{y_k \Delta x_k^{\mathrm{T}}}{ y_k^{\mathrm{T}} \Delta x_k}) + \frac{\Delta x_k \Delta x_k^T}{y_k^T \Delta x_k}$
SR1 $B_{k} + \frac{(y_{k} - B_{k},\Delta x_{k} )(y_{k} - B_{k},\Delta x_{k})^{\mathrm{T}} }{(y_{k} - B_{k},\Delta x_{k})^{\mathrm{T} },\Delta x_{k}}$ $H_{k} + \frac{(\Delta x_{k}-H_{k}y_{k}) (\Delta x_{k} -H_{k} y_{k})^{\mathrm{T}} }{(\Delta x_{k}-H_{k}y_{k})^{\mathrm {T} }y_{k}}$

  • Quasi-Newton methods in Wikipedia page
  • http://59.80.44.98/www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf
  • http://fa.bianp.net/teaching/2018/eecs227at/quasi_newton.html

The Barzilai-Borwein method

Consider the gradient iteration form $$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$

which can be written as $$ x^{k+1}=x^{k}- D_k \nabla f(x^k) $$ where $D_k = \alpha_k I$. In order to make the matrix $D_k$ have quasi-Newton property, we compute $\alpha_k$ such that $$ \min |s_{k-1}-D_k y_{k-1}| $$ which yields

$$ \alpha_k =\frac{\left<s_{k-1},y_{k-1}\right>}{\left<y_{k-1},y_{k-1}\right>}\tag 1 $$

where $s_{k-1}= x_k-x_{k-1}, y_{k-1}=\nabla f(x^k)-\nabla f(x^{k-1})$.

By symmetry, we may minimize $|D_k^{-1}s_{k-1}- y_{k-1}|$ with respect to $\alpha_k$ and get

$$ \alpha_k = \frac{\left<s_{k-1}, s_{k-1}\right>}{\left<s_{k-1}, y_{k-1}\right>}.\tag 2 $$

In short, the iteration formula of Barzilai-Borwein method is given by

$$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$ where $\alpha_k$ is determined by (1) or (2).

It is easy to see that in this method no matrix computations and no line searches (except $k = 0$) are required.

  • https://mp.weixin.qq.com/s/G9HH29b2-VBnk_Sqze8pDg
  • http://www.math.ucla.edu/~wotaoyin/math273a/slides/
  • http://bicmr.pku.edu.cn/~wenzw/courses/WenyuSun_YaxiangYuan_BB.pdf
  • https://www.math.lsu.edu/~hozhang/papers/cbb.pdf

L-BFGS

The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the $s_n$ and $y_n$ vectors for each iteration.

  • https://www.wikiwand.com/en/Limited-memory_BFGS
  • On the limited memory BFGS method for large scale optimization
  • Numerical Optimization: Understanding L-BFGS
  • Unconstrained optimization: L-BFGS and CG

L-BFGS algorithm builds and refines quadratic model of a function being optimized. Algorithm stores last M value/gradient pairs and uses them to build positive definite Hessian approximation. This approximate Hessian matrix is used to make quasi-Newton step. If quasi-Newton step does not lead to sufficient decrease of the value/gradient, we make line search along direction of this step.

Essential feature of the algorithm is positive definiteness of the approximate Hessian. Independently of function curvature (positive or negative) we will always get SPD matrix and quasi-Newton direction will always be descent direction.

Another essential property is that only last $M$ function/gradient pairs are used, where M is moderate number smaller than problem size N, often as small as 3-10. It gives us very cheap iterations, which cost just O(N·M) operations.

  • Optimizing Neural Networks with Kronecker-factored Approximate Curvature

Gauss–Newton Method

A form of regression where the objective function is the sum of squares of nonlinear functions: $$f(x)=\frac{1}{2}\sum_{j=1}^{m}(r_j(x))^2=\frac{1}{2}\sum_{j=1}^{m}{|r(x)|}^2$$ where The j-th component of the m-vector $r(x)$ is the residual $r_j(x)=(\phi(x_j;t_j)-y_j)^2$.

The Gauss-Newton Method generalizes Newton's method for multiple dimensions with approximate Hessian matrix $\nabla^2 f_k \approx J_k^T J_k$.

The basic steps that the software will perform (note that the following steps are for a single iteration):

  • Make an initial guess $x^0$ for $x$,
  • Make a guess for $k = 1$,
  • Create a vector $f^k$ with elements $f_i(x^k)$,
  • Create a Jacobian matrix for $J_k$
  • Solve ($J^T_k J_k p_k = -J^T_k f_k$).
  • This gives you the probabilities $p$ for all $k$.
  • Find $s$. $F(x^k + s p_k)$ should satisfy the Wolfe conditions (these prove that step-lengths exist).
  • Set $x^{k+1} = x^k + sp^k$.
  • Repeat Steps 1 to 7 until convergence.
  • https://www8.cs.umu.se/kurser/5DA001/
  • http://iacs-courses.seas.harvard.edu/courses/am205/fall14/slides/am205_lec06.pdf
  • Applications of the Gauss-Newton Method
  • Gauss-Newton / Levenberg-Marquardt Optimization
  • Nonlinear Least-Squares Problems with the Gauss-Newton and Levenberg-Marquardt Methods
  • Gauss-Newton algorithm for nonlinear models

  • Wikipedia page on Newton Method
  • Newton-Raphson Visualization (1D)
  • Newton-Raphson Visualization (2D)
  • Using Gradient Descent for Optimization and Learning
  • ftp://lsec.cc.ac.cn/pub/home/yyx/papers/siamjna1987.pdf

Natural Gradient Descent

The generalized natural gradient is the direction of steepest ascent in Riemannian space, which is invariant to parametrization, and is defined: $$\nabla S\propto\lim_{\epsilon\to 0} \arg\max_{d:B(P_{\theta}, P_{\theta+d})=\epsilon} S(\theta+d, y)$$

where $B(P_{\theta}, P_{\theta+d})$ is the Bregman divergence of the probabilities $P_{\theta}, P_{\theta + d}$ and $S$ is a loss function on the probability $P_{\theta + d}$ and $y$ such as log-likelihood function.

Natural gradient descent is to solve the optimization problem $\min_{\theta} L(\theta)$ by $$ \theta^{(t+1)}=\theta^{(t+1)}-\alpha_{(t)}F^{-1}(\theta^{(t)})\nabla_{\theta}L(\theta^{(t)}) $$ where $F^{-1}(\theta^{(t)})$ is the inverse of Remiann metric at the point $\theta^{(t)}$. And Fisher scoring algorithm is a typical application of Natural Gradient Descent to statistics.
Natural gradient descent for manifolds corresponding to exponential families can be implemented as a first-order method through mirror descent.

Originator of Information Geometry: Shunichi Amari
  • Natural gradient descent and mirror descent
  • Online Natural Gradient as a Kalman Filter
  • New insights and perspectives on the natural gradient method
  • https://www.zhihu.com/question/266846405
  • 2016 PKU Mini-Course: Information Geometry
  • Information Geometry and Natural Gradients
  • Tutorial on Information Geometry and Algebraic Statistics
  • True Asymptotic Natural Gradient Optimization
  • Shun-ichi Amari's CV in RIKEN
  • Energetic Natural Gradient Descent
  • Natural gradients and K-FAC
  • http://www.divergence-methods.org/
  • http://www.deeplearningpatterns.com/doku.php?id=natural_gradient_descent
  • NATURAL GRADIENTS AND STOCHASTIC VARIATIONAL INFERENCE
  • 谈谈优化算法 - 郑思座的文章 - 知乎
  • Accelerating Natural Gradient with Higher-Order Invariance

Higher-Order Derivatives Methods

Higher-order methods, such as Newton, quasi-Newton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of ill-conditioning, they avoid or diminish the need for hyper-parameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higher-order methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higher-order methods are "undervalued."

The key observation, which underlies all results of this paper, is that an appropriately regularized Taylor approximation of convex function is a convex multivariate polynomial. This is indeed a very natural property since this regularized approximation usually belongs to the epigraph of convex function. Thus, the auxiliary optimization problem in the high-order (or tensor) methods becomes generally solvable by many powerful methods of Convex Optimization.

The most well-known third order optimization method may be Halley's method(Tangent Hyperbolas Method).

  • Higher-Order Derivatives in Computational Systems Engineering Problem Solving
  • SECOND-ORDER OPTIMIZATION FOR NEURAL NETWORKS
  • On the use of higher order derivatives in optimization using Lagrange's expansion
  • Sparsity in Higher Order Methods in Optimization*
  • HoORaYs: High-order Optimization of Rating Distance for Recommender Systems
  • https://akyrillidis.github.io/2019/08/05/WorkshopNeurIPS.html
  • ftp://file.viasm.org/Web/TienAnPham-16/Preprint_1669.pdf
  • Implementable Tensor Methods in Unconstrained Convex Optimization
  • Reachability of optimal convergence rate estimates for high-order numerical convex optimization methods
  • The global rate of convergence for optimal tensor methods in smooth convex optimization
  • Optimization for Tensor Models
  • Tensor Methods for Large, Sparse Unconstrained Optimization
  • Beyond First Order Methods in ML
  • UNIFIED ACCELERATION OF HIGH-ORDER ALGORITHMS UNDER HOLDER CONTINUITY AND UNIFORM CONVEXITY
  • Lower Bounds for Higher-Order Convex Optimization
  • https://file.scirp.org/Html/9-7403428_73034.htm
  • On the Geometry of Halley's Method
  • https://archive.lib.msu.edu/crcmath/math/math/h/h030.htm
  • A METHOD FOR OBTAINING THIRD-ORDER ITERATIVE FORMULAS
  • Some New Variants of Chebyshev-Halley Methods Free from Second Derivative

Trust Region Methods

Trust-region methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region. If a step is not acceptable, they reduce the size of the region and find a new minimizer. In general, the direction of the step changes whenever the size of the trust region is altered.

$$ \min_{p\in\mathbb{R}^n} m_k(p)=f_k+ g_k^T p+\frac{1}{2}p^T B_k p, , , s.t. |p|\leq {\Delta}_k, $$ where $f_k=f(x^k), g_k = \nabla f(x^k)$;$B_k$ is some symmetric matrix as an approximation to the Hessian $\nabla^2 f(x^k + tp)$; $\Delta_k>0$ is the trust-region radius. In most of our discussions, we define $|\cdot|$ to be the Euclidean norm. Thus, the trust-region approach requires us to solve a sequence of subproblems in which the objective function and constraint are both quadratic. Generally, we set $B_k = \nabla^2 f(x^k)$.

  • https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods
  • Concise Complexity Analyses for Trust-Region Methods
  • https://www.nmr-relax.com/manual/Trust_region_methods.html
  • http://lsec.cc.ac.cn/~yyx/worklist.html

Expectation Maximization Algorithm

Expectation-Maximization algorithm, popularly known as the EM algorithm has become a standard piece in the statistician's repertoire. It is used in incomplete-data problems or latent-variable problems such as Gaussian mixture model in maximum likelihood estimation. The basic principle behind the EM is that instead of performing a complicated optimization, one augments the observed data with latent data to perform a series of simple optimizations.

It is really popular for Bayesian statistician.

Let $\ell(\theta|Y_{obs})\stackrel{\triangle}=\log{L(\theta | Y_{obs})}$ denote the log-likelihood function of observed datum $Y_{obs}$. We augment the observed data $Y_{obs}$ with latent variables $Z$ so that both the complete-data log-likelihood $\ell(\theta | Y_{obs}, Z)$ and the conditional predictive distribution $f(z|Y_{obs}, \theta)$ are available.

Each iteration of the EM algorithm consists of an expectation step (E-step) and a maximization step (M-step) Specifically, let $\theta^{(t)}$ be the current best guess at the MLE $\hat\theta$. The E-step is to compute the Q function defined by $$ \begin{align} Q(\theta|\theta^{(t)}) & \stackrel{\triangle}= \mathbb{E}(\ell(\theta | Y_{obs}, Z) |Y_{obs}, \theta^{(t)}) \ &= \int_{Z}\ell(\theta | Y_{obs}, Z)\times f(z | Y_{obs}, \theta^{(t)})\mathrm{d}z, \end{align} $$

and the M-step is to maximize Q with respect to $\theta$ to obtain

$$ \theta^{(t+1)}=\arg\max_{\theta} Q(\theta|\theta^{(t)}). $$

  • BIOS731: Advanced Statistical Computing
  • http://cs229.stanford.edu/notes/cs229-notes8.pdf
  • EM算法存在的意义是什么? - 史博的回答 - 知乎
Diagram of EM algorithm

Generalized EM Algorithm

Each iteration of the generalized EM algorithm consists of an expectation step (E-step) and a maximization step (M-step) Specifically, let $\theta^{(t)}$ be the current best guess at the MLE $\hat\theta$. The E-step is to compute the Q function defined by $$ Q(\theta|\theta^{(t)}) = \mathbb{E}[ \ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)} ] \ = \int_{Z}\ell(\theta|Y_{obs}, Z)\times f(z|Y_{obs}, \theta^{(t)})\mathrm{d}z, $$ and the another step is to find $\theta$ that satisfies $Q(\theta^{t+1}|\theta^{t})>Q(\theta^{t}|\theta^{t})$, i.e. $$ \theta^{(t+1)}\in {\hat{\theta}|Q(\hat{\theta}|\theta^{(t)} \geq Q(\theta|\theta^{(t)}) }. $$

It is not to maximize the conditional expectation.

See more on the book The EM Algorithm and Extensions, 2nd Edition by Geoffrey McLachlan , Thriyambakam Krishna.

  • The MM Algorithm by Kenneth Lange
  • MM Optimization Algorithms
  • Examples of MM Algorithms
  • Majorization-Minimization Algorithms in Signal Processing, Communications, and Machine Learning
  • AN ASSEMBLY AND DECOMPOSITION APPROACH FOR CONSTRUCTING SEPARABLE MINORIZING FUNCTIONS IN A CLASS OF MM ALGORITHMS
  • Generalized Majorization-Minimization

Projected Gradient Method and More

We will focus on projected gradient descent and its some non-Euclidean generalization in order to solve some simply constrained optimization problems.

If not specified, all these methods are aimed to solve convex optimization problem with explicit constraints, i.e. $$ \arg\min_{x\in\mathbb{S}}f(x) $$

where $f$ is convex and differentiable, $\mathbb{S}\subset\mathbb{R}^n$ is convex. The optimal condition for this constrained optimization problem is that any feasible direction is not descent direction: if $x^{\star}\in\mathbb{S}$ is the solution to the problem, we can assert that variational inequality holds: $$\forall x \in \mathbb{S}, \left<\nabla f(x^{\star}),x-x^{\star} \right> \geq 0.$$

And it is the optimal condition of constrained optimization problem.

We say the gradient $\nabla f$ of the convex function $f$ is a monotone operator because $$\left<x-y, \nabla f(x)-\nabla f(y)\right>\geq 0. $$

Projected Gradient Descent

Projected gradient descent has two steps: $$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Descent}\ x^{k+1} = Proj_{\mathbb{S}}(z^{k+1})=\arg\min_{x\in \mathbb{S}}|x-z^{k+1}|^{2} \qquad\text{Projection} $$

or in the compact form $$ x^{k+1} = \arg\min_{x\in \mathbb{S}}|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} \= \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$

where $\delta_{\mathbb{S}}$ is the indictor function of the set $\mathbb{S}$ $$ h(x) =\delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}. \end{cases} $$

Each projection is an optimization so that the iterative points satisfy the optimal conditions, which also restricts the projection method into the case where the projection is available or simple to compute. And it is natural to search better Descent step or Projection step.

The following links are recommended if you are interested in more theoretical proof:

  • http://maths.nju.edu.cn/~hebma/slides/03C.pdf
  • http://maths.nju.edu.cn/~hebma/slides/00.pdf

For the non-differentiable but convex function such as the absolute value function $f(x)=|x|$, we therefore consider sub-gradients in place of the gradient, where sub-gradient $\phi$ at the point $\hat{x}$ is defined as the elements in the domain of convex function ${f}$ (denoted as $D_f$), satisfying

$$ \left<\phi, x-\hat{x}\right>\leq f(x) -f(\hat{x}),\forall x \in D_f. $$

Mirror descent

Mirror descent can be regarded as the non-Euclidean generalization via replacing the $\ell_2$ norm or Euclidean distance in projected gradient descent by Bregman divergence.

Bregman divergence is induced by convex smooth function ${h}$:

$$ B_h(x,y) = h(x) - h(y)-\left<\nabla h(y),x-y\right> $$

where $\left<\cdot,\cdot\right>$ is inner product and it also denoted as $D_h$.

The function ${h}$ is usually required to be strongly convex. And if the convex function ${h}$ is not differentiable, one element of the sub-gradient ${\partial h(y)}$ may replace the gradient $\nabla h(y)$.

It is convex in $x$ and $\frac{\partial B_h(x, y)}{\partial x} = \nabla h(x) - \nabla h(y)$.

Especially, when ${h}$ is quadratic function, the Bregman divergence induced by $h$ is $$ B_h(x, y)=x^2-y^2-\left<2y,x-y\right>=x^2+y^2-2xy=(x-y)^2 $$ i.e. the square of Euclidean distance.

A wonderful introduction to Bregman divergence is Meet the Bregman Divergences by Mark Reid at http://mark.reid.name/blog/meet-the-bregman-divergences.html.

The Bregman projection onto a convex set $C\subset \mathbb{R}^n$ given by $$ y^{\prime}= \arg\min_{x\in C} B(x,y) $$ is unique.

A generalized Pythagorean theorem holds: for convex $C\subset \mathbb{R}^n$ and for all $x\in C$ and $y\in \mathbb{R}^n$ we have $$B(x,y)\geq B(x,y^{\prime}) + B(y^{\prime},y)$$ where $y^{\prime}$ is the Bregman projection of ${y}$, and equality holds when the convex set C defining the projection $y^{\prime}$ is affine.


It is given in the projection form:

$$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Gradient descent};\ x^{k+1} = \arg\min_{x\in\mathbb{S}}B_h(x,z^{k+1}) \qquad\text{Bregman projection}. $$

In another compact form, mirror gradient can be described in the proximal form: $$ x^{k+1} = \arg\min_{x\in\mathbb{S}} { f(x^k) + \left<g^k, x-x^k\right> + \frac{1}{\alpha_k} B_h(x,x^k)}\tag {1} $$ with $g^k=\nabla f(x^k)$.

Note that the next iteration point $x^{k+1}$ only depends the current iteration point $x^{k}$ no more previous iteration points.

By the optimal conditions of equation (1), the original "mirror" form of mirror gradient method is described as $$ \nabla h(x^{k+1}) = \nabla h(x^k) - \alpha_k \nabla f(x^k), \ = \nabla h(x^1) - \sum_{n=1}^{k}\alpha_n \nabla f(x^n) , x\in \mathbb{S}, $$ where the convex function ${h}$ induces the Bregman divergence.

One special method is called entropic mirror descent(Exponential Gradient Descent) when the Bregman divergence induced by $e^x$ and the constraint set $\mathbb{S}\subset\mathbb{R}^n$ is simplex, i.e. $\sum_{i=1}^{n}x_i =1, \forall x_i \geq 0$.

Entropic descent method at step ${k}$ is given as follows:

$$ {x_{i}^{k+1} = \frac{x_i^{k}\exp(-\alpha \nabla {f(x^k)}{i})}{\sum{j=1}^{n} x_j^{k}\exp(-\alpha \nabla {f(x^k)}{j})}}, i=1,2,\dots, n. $$ it is obvious that entropic decscent methods are in the coordinate-wise update formula. Whast is more , it can be rewritten as $$x^{k+1}=\frac{x^{1}\exp(\sum{n=1}^{k}-\alpha \nabla f(x^n))}{\prod_{n=1}^{k}\left<x^n, \exp(-\alpha \nabla f(x^n))\right>}\propto x^{1}\exp(\sum_{n=1}^{k}-\alpha \nabla f(x^n)).$$

Multiplicative Weights Update is closely related with entropic descent method. See more on the following link list.

  • The Divergence Methods Web Site (under construction)
  • Bregman Divergence and Mirror Descent, Xinhua Zhang(张歆华)
  • CS 294 / Stat 260, Fall 2014: Learning in Sequential Decision Problems
  • ELE522: Large-Scale Optimization for Data Science , Yuxin Chen, Princeton University, Fall 2019
  • Mirror Descent and the Multiplicative Weight Update Method, CS 435, 201, Nisheeth Vishnoi
  • Mirror descent: 统一框架下的first order methods
  • ORF523: Mirror Descent, part I/II
  • ORF523: Mirror Descent, part II/II
  • Thibaut Lienart: MIrror Descent
  • Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent
  • https://web.stanford.edu/class/cs229t/2017/Lectures/mirror-descent.pdf
  • https://www.cs.ubc.ca/labs/lci/mlrg/slides/mirrorMultiLevel.pdf
  • https://konstmish.github.io/

Proximal Gradient Method

Recall the projected gradient method,
it converts the constrained problem $\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ into unconstrained (also uncontinuous) problem $\arg\min_{x}{f(x)+\delta_{\mathbb{S}}(x)}$ literally and it follows the following iteration: $$ x^{k+1} = \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$ where the function $\delta_{\mathbb{S}}(x)$ is obviously non-differentiable, defined as follows $$ h(x) =\delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}. \end{cases} $$

If the function $\delta_{\mathbb{S}}(x)$ is replaced by general non-differential while convex function $\mathbf{h}(x)$, the proximal mapping (or prox-operator) of a convex function $\mathbf{h}$ is defined as $$ prox_h(x)=\arg\min_{u}{\mathbf{h}(u)+\frac{1}{2} {|x-u|}_2^2} $$

Unconstrained problem with cost function split in two components: $$minimize \qquad f(x) = g(x)+\mathbf{h}(x)$$

  • ${g}$ is convex, differentiable;
  • $\mathbf{h}$ is closed, convex, possibly non-differentiable while $prox_h(x)$ is inexpensive.

Proximal gradient algorithm $$x^{k}=prox_{t_k h} {x^{k-1}-t_k\nabla g(x^{k-1})}$$

$t_k > 0$ is step size, constant or determined by line search.

$$x^+ = prox_{th}(x-t\nabla g(x))$$ from the definition of proximal mapping: $$ x^+ = \arg\min_{u}( h(u)+\frac{1}{2}{|u-(x-t\nabla g(x))|}2^2 ) \= \arg\min{u}( h(u) + g(x) + \nabla g(x)^T(u-x) +\frac{1}{2t} {|u-x|}2^2 ) \= \arg\min{u}( h(u) + \nabla g(x)^T(u-x) +\frac{1}{2t} {|u-x|}_2^2 ) $$

$x^{+}$ minimizes $h(u)$ plus a simple quadratic local model of $g(u)$ around ${x}$.

And projected gradient method is a special case of proximal gradient method.

And it is natural to consider a more general algorithm by replacing the squared Euclidean distance in definition of proximal mapping with a Bregman distance: $$ \arg\min_{u}{h(u)+\frac{1}{2} {|x-u|}2^2}\to \arg\min{u}{h(u)+ B(x,u)}. $$ so that the primary proximal gradient methods are modified to the Bregman version, which is called as Bregman proximal gradient method.

  • A collection of proximity operators implemented in Matlab and Python.
  • L. Vandenberghe ECE236C (Spring 2019): 4. Proximal gradient method
  • https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf
  • Accelerated Bregman Proximal Gradient Methods for Relatively Smooth Convex Optimization
  • A unified framework for Bregman proximal methods: subgradient, gradient, and accelerated gradient schemes
  • Proximal Algorithms, N. Parikh and S. Boyd
  • For more (non exhaustive) informations regarding the proximity operator and the associated proximal algorithms

Proximal and Projected Newton Methods

A fundamental difference between gradient descent and Newton's method was that the latter also iteratively minimized quadratic approximations, but these used the local Hessian of the function in question.

There are many such methods (Ellipsoid method, AdaGrad, ... ) with projection carried out in the $H_k$ metric to solve $\min_{x\in X}f(x)$:

  1. get subgradient $g^{k}\in\partial f(x^K)$
  2. update (diagonal) metric $H_k$
  3. update $x^{k+1}=\operatorname{Proj}_{X}^{H_K}{|x^k - H_k^{-1}g^k|}_2^2$

where $\operatorname{Proj}{X}^{H_K}(y)=\arg\min{x\in X}{|x-y|}{H_k}^2$ and ${|x-y|}{H_K}^2=(x-y)^TH_k(x-y)$.

  • PROJECTED NEWTON METHODS FOR OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS*
  • Projected Newton Methods
  • Mirror Descent and Variable Metric Methods

Proximal Newton method can be also applied to such optimization problem: $$ y^{k} = prox_{H_{k-1}}(x^{k-1}-H_{k-1}^{-1}\nabla g(x^{k-1}))\ x^{k} = x^{k-1}+t_{k}(y^{k}-x^{k-1}). $$

  • Proximal and Projected Newton Methods, Ryan Tibshirani, Convex Optimization 10-725/36-725

Douglas–Rachford method

$$\min_{x}{f(x)+g(x)}$$ where $f$ and $g$ are closed convex functions. Douglas–Rachford iteration: start at any $y^0$ and repeat for $k = 0, 1, \cdots,$ $$ x^{k+1} = prox_f(y^k)\ y^{k+1} = y^k + prox_g(2x^{k+1} − y^k) − x^{k+1} $$

  • useful when $f$ and $g$ have inexpensive prox-operators;
  • $x^k$ converges to a solution of $0 \in \partial f (x) + \partial g(x)$ (if a solution exists);
  • not symmetric in $f$ and $g$.

The iteration can be written as fixed-point iteration $$y^{k+1} = F(y^k)$$ where $F(y)=y + prox_g(2prox_f(y) − y) − prox_f(y)$.

  • http://www.seas.ucla.edu/~vandenbe/236C/lectures/dr.pdf

Proximal point algorithms and its Beyond

Now, let us consider the simple convex optimization $$\min_{x}{\theta(x)+f(x)\mid x \in\mathcal X},\tag{2.1}$$ where $\theta(x)$ and $f(x)$ are convex but $\theta(x)$ is not necessary smooth, $\mathcal X$ is a closed convex set. For solving (2.1), the k-th iteration of the proximal point algorithm (abbreviated to PPA) begins with a given $x^k$, offers the new iterate $x^{k+1}$ via the recursion $$x^{k+1}=\arg\min_{x}{\theta(x)+f(x)+\frac{r}{2}{|x-x^k|}_2^2\mid x \in\mathcal X}.\tag{2.2}$$

Since $x^{k+1}$ is the optimal solution of (2.2), it follows from optimal condition that $$\theta(x)-\theta(x^{k+1})+(x-x^{k+1})^{T}{ \nabla f(x^{k+1}) + r( f(x^{k+1}) - f(x^k)) } \quad\forall x\in\mathcal X.$$

It is proved that The sequence ${x^k}$ generated by PPA is Fejer **monotone**, i.e., $|x^{k+1}-x^{\ast}|^2\leq |x^{k}-x^{\ast}|^2-|x^{k+1}-x^{k}|^2$.

  • http://maths.nju.edu.cn/~hebma/Talk/VI-PPA.pdf
  • https://lostella.github.io/software/
  • The proximal point method revisited

$\color{quad}{Note}$: the materials in this section is taken from the lectures of Bingshneg He.

Prediction-Correction Methods

The optimal condition of the linearly constrained convex optimization is characterized as a mixed monotone variational inequality:

$$w^{\ast}\in\Omega, \theta(u)-\theta(u^{\ast})+(w - w^{\ast})^{T} \nabla F(w^{\ast}) \quad\forall w\in\mathcal \Omega.$$


[Prediction Step.] : With given $v^k$, find a vector $\tilde{w}^k\in \Omega$ such that $$\theta(u)-\theta(\tilde u^{k})+(w - \tilde w^{k})^{T} \nabla F(w^{\ast})\geq (v-\tilde v^{k})^T Q (v-\tilde v^{k}),\quad w\in\Omega$$ where the matrix $Q$ is not necessary symmetric, but $Q^T+Q$ is positive definite.

[Correction Step.] : The new iterate $v^{k+1}$ by $$v^{k+1} =v^{k}- \alpha M(v^{k}-\tilde v^{k+1}).$$


Convergence Conditions: For the matrices $Q$ and $M$ above, there is a positive definite matrix $H$ such that $$HM = Q. \tag{2.20a}$$ Moreover, the matrix $$G = Q^T + Q − \alpha M^T H M \tag{2.20b}$$ is positive semi-definite.

  • https://www.seas.upenn.edu/~spater/assets/papers/conference/c_2019_paternain_et_al.pdf
  • http://export.arxiv.org/pdf/1709.05850
  • https://www.hindawi.com/journals/aaa/2013/845459/
  • https://arxiv.org/pdf/1709.05850.pdf

Customized PPA

Now, let us consider the simple convex optimization $$\min_{u}{\theta(u)\mid Au=b, u \in\mathcal U}.\tag{2.3}$$

The related variational inequality of the saddle point of the Lagrangian function is $$w^{\ast}\in \Omega, \theta(u) - \theta(u^{\ast})+(w-w^{\ast})^T F(w^{\ast})\geq 0 \quad\forall w\in\Omega$$ where $w=(u, \lambda)^T, F(w)=(-A^T\lambda, Au-b)^T$.

For given $v^k = w^k = (u^k, \lambda^k)$, the predictor is given by $$ \begin{aligned} \tilde{u}^{k} &=\arg \min \left{L\left(u, \lambda^{k}\right)+\frac{r}{2}\left|u-u^{k}\right|^{2} | u \in \mathcal{U}\right}, \ \tilde{\lambda}^{k} &=\arg \max \left{L\left(\left[2 \tilde{u}^{k}-u^{k}\right], \lambda\right)-\frac{s}{2}\left|\lambda-\lambda^{k}\right|^{2}\right}. \end{aligned} $$

  • https://link.springer.com/article/10.1007/s10589-013-9616-x
  • https://core.ac.uk/display/23878067
  • http://www.optimization-online.org/DB_FILE/2017/03/5922.pdf

$\color{aqua}{Note}$: the projection from a point $x^0$ into a subset $C\subset\mathbb{R}^n$ is defined in proximal operator as

$$ x^+ =\arg\min_{x}{\delta_C(x)+\frac{1}{2}{|x-x^0|}_2^2} $$

while it can also written in the following form: $$ x^{+}=\arg\min_{x}{\frac{1}{1_C(x)}\cdot {|x-x^0|}_2^2} $$

where $$ 1_C(x)= \begin{cases} 1, & \text{if} \quad x \in C,\ 0, & \text{otherwise}. \end{cases} $$ How we can generalize this form into the proximal form? And what is the difference with the original addition proximal operator?

If $x^0$ is in the set ${C}$, the projection can be rewritten in proximal operator: $$ x^{+}=\arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}_2^2}. $$

How we can generalize the function $\delta_{C}$ to more general convex function? What are the advantages of this generalization? As likelihood and log-likelihood, we can transfer the product of some functions to sum by taking logarithm.

$$ x^{+} = \arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \ = \arg\min{x}\log{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \= \arg\min{x} {\delta_C(x)+\log({|x-x^0|}_2^2)}. $$

Penalty/Barrier Function Methods

In projected gradient method, it converts the constrained problem into an unsmooth problem.


constrained problem unconstrained problem desired properties
$\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ $\arg\min_{x}{f(x)+\delta_{\mathbb{S}}(x)}$ unsmooth, inexpensive
$\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ $\arg\min_{x}{f(x) + g_{\mathbb{S}}(x)}$ smooth, inexpensive
$\arg\min_{x}{f(x)\mid g(x)\in \mathbb{S}}$ $\arg\min_{x}{f(x)+\fbox{?}}$ smooth, inexpensive

In constrained optimization, a field of mathematics, a barrier function is a continuous function whose value on a point increases to infinity as the point approaches the boundary of the feasible region of an optimization problem.

Recall the indicator function of constrained set in projected methods: $$ \delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}, \end{cases} $$ there is a dramatic leap between the boundary of $\mathbb S$ and the outer.

The basic idea of the barrier method is to approximate the indicator function by some functions ${f}$ satisfying:

  • convex and differentiable;
  • $f(x)< \infty$ if $x\in\mathbb S$;
  • for every point constraint on the boundary $f(x)\to\infty$ as $x\to\partial\mathbb S$.

For example, Barrier Method in Convex Optimization: Fall 2018 pushs the inequality constraints in the problem in a smooth way, reducing the problem to an equality-constrained problem. Consider now the following minimization problem: $$ \min_{x} f(x) \ \text{subject to} \quad h_n \leq 0, n=1,2,\cdots, N \ Ax=b $$ where $f, h_1, h_2,\cdots, h_N$ are assumed to be convex and twice differential functions in $\mathbb R^p$. The log-barrier function is defined as $$\phi(x)=-\sum_{n=1}^{N}\log(-h_n(x)).$$

When we ignore equality constraints, the problem above can be written by incorporating the inequality with the identity function, which in turn can be approximated by the log-barrier functions as follows $$\min_{x} f(x)+\sum_{n=1}^{N}\mathbb{I}{h_n(x)\leq 0}\approx \min{x} f(x)+\frac{1}{t}\phi(x)$$

for $t > 0$ being a large number.

A penalty method replaces a constrained optimization problem by a series of unconstrained problems whose solutions ideally converge to the solution of the original constrained problem. The unconstrained problems are formed by adding a term, called a penalty function, to the objective function that consists of a penalty parameter multiplied by a measure of violation of the constraints. The measure of violation is nonzero when the constraints are violated and is zero in the region where constraints are not violated.

In the context of the equality-constrained problem $$\min_{x} f(x), ,, s.t. c_n(x)=0, n = 1,2,\cdots, N,$$

the quadratic penalty function $Q(x;\mu)$ for this formulation is $$Q(x;\mu)=f(x)+\frac{\mu}{2}\sum_{n=1}^{N}c_n^2(x)$$ where $\mu>0$ is the penalty parameter. By driving $\mu\to\infty$, we penalize the constraint violations with increasing severity.

Quadratic Penalty Method

  • Given $\mu_0 > 0$, a nonnegative sequence ${\tau_k}$ with $\tau_k\to 0$, and a starting point $x^s_0$;
  • for $k = 0, 1, 2, . . .$:
    • Find an approximate minimizer $x^k=\arg\min_{x}Q(x; \mu_k)$, starting at $x^s_k$ and terminating when ${|\nabla_x Q(x;\mu_k)|}_2\leq \tau_k$;
    • if final convergence test satisfied,
      • stop with approximate solution $x^k$;
    • end if
    • Choose new penalty parameter $\mu_{k+1} > \mu_{k}$;
    • Choose new starting point $x^s_{k+1}$
  • end (for)
  • Line Search Procedures for the Logarithmic Barrier Function
  • Penalty method
  • http://s-mat-pcs.oulu.fi/~keba/Optimointi/OP_penalty_engl.pdf
  • https://www.me.utexas.edu/~jensen/ORMM/supplements/units/nlp_methods/const_opt.pdf
  • http://users.jyu.fi/~jhaka/opt/TIES483_constrained_indirect.pdf

Path Following Methods

The main ideas of path following by predictor–corrector and piecewise-linear methods, and their application in the direction of homotopy methods and nonlinear eigenvalue problems are reviewed. Further new applications to areas such as polynomial systems of equations, linear eigenvalue problems, interior methods for linear programming, parametric programming and complex bifurcation are surveyed. Complexity issues and available software are also discussed.

  • Continuation and path following
  • https://nisheethvishnoi.files.wordpress.com/2018/05/lecture71.pdf
  • http://www.stat.cmu.edu/~ryantibs/convexopt-S15/lectures/16-primal-dual.pdf
  • L. Vandenberghe EE236C (Spring 2016): 17. Path-following methods

Lagrange Multipliers and Duality

It is to solve the constrained optimization problem $$\arg\min_{x}f(x), \quad s.t.\quad g(x)=b.$$

The barrier or penalty function methods are to add some terms to $f(x)+\Omega(g(x)-b)$ converting constrained problems into unconstrained problems by introducing an artificial penalty for violating the constraint. For example, $$P(x, \lambda)= f(x) + \lambda {|g(x)-b|}_2^2$$ where the penalty function $\Omega(x) = {|x|}_2^2$, $\lambda\in\mathbb{R}^{+}$. We can regard it as a surrogate loss technique. Although the penalty function is convex and differentiable, it is more difficult than directly optimizing $f(x)$ when the constrain is complicated.

Lagrange Multipliers and Generalized Lagrange Function

The penalty function methods do not take the optimal conditions into consideration although it works. If $x^{\star}$ is in the solution set of the optimization problem above, it is obvious that $Ax^{\star}=b$ and $L(x^{\star}, \lambda)=f(x^{\star})$ where $$L(x, \lambda)= f(x) + \lambda^T(Ax-b).$$

In another direction, we want to prove that $x^{\star}$ is the optima of the optimization problem if $$ L(x^{\star}, {\lambda}^{\star})=\quad \min_{x}[\max_{\lambda} L(x, \lambda)]. $$

By the definition, $L(x^{\star}, {\lambda}^{\star})\geq L(x^{\star}, \lambda)=f(x^{\star})+\lambda^T(Ax^{\star}-b)$, which implies that $\lambda^T(Ax^{\star}-b)=0$ i.e., $Ax^{\star}=b$. And $L(x^{\star}, {\lambda}^{\star}) = f(x^{\star})\leq L(x, {\lambda}^{\star})\forall x$ if $Ax=b$ thus $x^{\star}$ is the solution to the primary problem.

It is dual problem is in the following form: $$\max_{\lambda}[\min_{x} L(x, \lambda)].$$

Note that $$ \min_{x} L(x, \lambda)\leq L(x,\lambda)\leq \max_{\lambda}L(x,\lambda) $$ implies $$\ \max_{\lambda}[\min_{x} L(x, \lambda)]\leq \max_{\lambda} L(x, \lambda)\leq \min_{x}[\max_{\lambda} L(x, \lambda)]. $$

And note that necessary condition of extrema is that the gradients are equal to 0s: $$ \frac{\partial L(x,\lambda)}{\partial x}= \frac{\partial f(x)}{\partial x}+ A\lambda = 0 \ \frac{\partial L(x,\lambda)}{\partial \lambda} = Ax-b=0 $$

Dual Ascent takes advantages of this properties:

  1. $x^{k+1}=\arg\min_{x} L(x,\lambda)$;
  2. ${\lambda}^{k+1}= {\lambda}^{k}+\alpha_k(Ax^{k+1}-b)$.

Entropy minimization algorithms: $$ x^{k+1}\in \arg\min_{x}{f(x)+\frac{1}{c_k} \sum_{i=1}^{n} x^{i}(\ln(\frac{x_i}{x^{k}_{i}})-1)}. $$

  • Duality
  • https://cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf
  • https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture7.pdf
  • https://www.svm-tutorial.com/2016/09/duality-lagrange-multipliers/
  • https://www.cs.jhu.edu/~svitlana/papers/non_refereed/optimization_1.pdf
  • Constrained Optimization and Lagrange Multiplier Methods
  • https://zhuanlan.zhihu.com/p/50823110
  • The proximal augmented Lagrangian method for nonsmooth composite optimization

Exponential Augmented Lagrangian Method

A special case for the convex problem $$ \text{minimize}\quad f(x),\ s.t. g_1(x) \leq 0, g_2(x) \leq 0, \cdots, g_r(x) \leq 0, x\in X $$

is the exponential augmented Lagrangean method.

It consists of unconstrained minimizations:

$$ x^{k}\in \arg\min_{x\in X}{f(x)+\frac{1}{c_k} \sum_{j=1}^{r} {\mu}^{k}{j}\exp(c_k g_j(x)) $$ followed by the multiplier iterations $$ {\mu}^{k+1}{j} = {\mu}^{k}_{j}\exp(c_k g_j(x^k)). $$

If the constraints are more complex, KKT theorem may be necessary.

  • An exponential augmented Lagrangian method with second order convergence
  • On the convergence of the exponential multiplier method for convex programming

Generalized Lagrangian function

Minimize $f(x)$, subject to $g_i(x) \ge 0, i = 1,\cdots, m$. Here, $x \in\Omega$, and $\Omega$ is a subset of the Euclidean space $E$. We assume that $f(x)$ and $g_i(x)$ are twice continuously differentiable. A generalized Lagrangian function is defined as $$L(x, \sigma) = f(x) - G[g(x), \sigma]. $$

  • A generalized Lagrangian function and multiplier method
  • Generalized Lagrange Function and Generalized Weak Saddle Points for a Class of Multiobjective Fractional Optimal Control Problems
  • https://blog.csdn.net/shayashi/article/details/82529816
  • https://suzyahyah.github.io/calculus/optimization/2018/04/07/Lagrange-Multiplier.html
  • Lagrange Multipliers without Permanent Scarring

KKT condition

Given general problem $$ \begin{aligned} \min_{x \in \mathbb{R}^{n}} f(x) & \ \text { subject to } & h_{i}(x) \leq 0, \quad i=1, \ldots m \ & \ell_{j}(x)=0, \quad j=1, \ldots r \end{aligned} $$

We defined the Lagrangian: $$L(x, u, v) = f(x) +\sum_{i=1}^{m}u_ih_i(x)+\sum_{j=1}^r v_j \ell_j.$$

The Karush-Kuhn-Tucker conditions or KKT conditions are: $$\begin{aligned} &\bullet\quad 0 \in \partial f(x)+\sum_{i=1}^{m} u_{i} \partial h_{i}(x)+\sum_{j=1}^{T} v_{j} \partial \ell_{j}(x) &\text{(stationarity)} \ &\bullet\quad u_{i} \cdot h_{i}(x)=0 \text { for all } i &\text{ (complementary slackness) } \ &\bullet\quad h_{i}(x) \leq 0, \ell_{j}(x)=0 \text { for all } i, j &\text{(primal feasibility) } \ &\bullet\quad u_{i} \geq 0 \text{ for all } i &\text{ (dual feasibility) } \end{aligned}$$

I learnt this theorem in functional analysis at graduate level course.

  • https://mitmgmtfaculty.mit.edu/rfreund/educationalactivities/
  • Nonlinear Programming by Robert M. Freund
  • https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf

Conjugate Duality

Consider the standard form convex optimization problem in the absence of data uncertainty $$\min_{x} f(x)\tag{P}$$

where $f$ is a proper lower semi-continuous convex functions.

This problem can be embedded into a family of parameterized problems $$\min_{x} g(x, y)\tag{$P_y$}$$ where the function $g(x, y)$ satisfies $g(x, 0) = f(x)$.

  • https://www.zhihu.com/question/58584814/answer/823769937
  • Conjugate Duality and Optimization
  • LECTURE 5: CONJUGATE DUALITY
  • Duality Theory of Constrained Optimization by Robert M. Freund
  • Gauge optimization, duality, and applications
  • Robust Conjugate Duality for Convex Optimization under Uncertainty with Application to Data Classification
  • https://mitmgmtfaculty.mit.edu/rfreund/educationalactivities/
  • http://www.mit.edu/~mitter/publications/6_conjugate_convex_IS.pdf

Splitting Methods

In following context we will talk the optimization problem with linear constraints: $$ \arg\min_{x} f(x) \ s.t. Ax = b $$ where $f(x)$ is always convex.

ADMM

Alternating direction method of multipliers is called ADMM shortly. It is aimed to solve the following convex optimization problem: $$ \min F(x,y) {=f(x)+g(y)} \ \text{subject to }\quad Ax+By =b $$ where $f(x)$ and $g(y)$ is convex; ${A}$ and ${B}$ are matrices.

Define the augmented Lagrangian:

$$ L_{\beta}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By -b)+ \frac{\beta}{2}{|Ax + By - b|}_{2}^{2}. $$

  • Augmented Lagrangian method

Augmented Lagrange Method at step $k$ is described as following:

  1. $(x^{k+1}, y^{k+1})=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y,\lambda^{\color{aqua}{k}});$
  2. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

ADMM at step $t$ is described as following:

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

The convergence proof of ADMM in convex optimization can be reduced to verifying the stability of a dynamical system or based on the optimal condition variational inequality like On the O(1/t) convergence rate of alternating direction method.

Linearized ADMM

Note that the $x$ subproblem in ADMM $$ \begin{align} \arg\min_{x}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})\ =\arg\min_{x}{f(x)+g(y^k)+{\lambda^k}^{T}(Ax+By^{k}-b)+\frac{\beta}{2}{|Ax+By^{k}-b|}{2}^{2}} \ =\arg\min{x}f(x)+\frac{\beta}{2}{|Ax+By^{k}-b-\frac{1}{\beta}\lambda|}_{2}^{2}\tag 1 \end{align} $$

However, the solution of the subproblem (1) does not have the closed form solution because of the general structure of the matrix ${A}$. In this case, we linearize the quadratic term of $$\frac{\beta}{2}{|Ax+By^{k}-b-\frac{1}{\beta}\lambda^k|}_{2}^{2}$$

at $x^k$ and add a proximal term $\frac{r}{2}{|x-x^k|}2^2$ to the objective function. In another word, we solve the following ${x}$ subproblem if ignoring the constant term of the objective function: $$ \min{x}f(x)+\beta(Ax)^T(A x^k + B y^k - b-\frac{1}{\lambda^k}) + \frac{r}{2}{|x - x^k|}_2^2. $$

  1. $x^{k+1}=\arg\min_{x\in \mathbf{X}} f(x) + \beta(A x)^T (A x^k + B y^k - b -\frac{1}{\lambda^k})+ \frac{r}{2}{|x - x^k|}_2^2$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}( x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}} )$,
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}} - b).$

For given $\beta > 0$, choose ${r}$ such that the matrix $rI_{1}-\beta A^TA$ is definitely positive, i.e., $$rI_{1}-\beta A^TA\geq 0.$$

We can also linearize the ${y}$ subproblem:

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^k, \lambda^k)$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \beta (By)^{T}(Ax^{\color{red}{k+1}} + B y^k - b -\frac{1}{\beta}\lambda^k) + \frac{r}{2}|y - y^k|^2$,
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

For given $\beta > 0$, choose ${r}$ such that the matrix $rI_{2}-\beta B^T B$ is definitely positive, i.e., $$rI_{2}-\beta B^T B\geq 0.$$


Taking $\mu\in(0, 1)$ (usually $\mu=0.9$), the Symmetric ADMM is described as

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})$,
  2. $\lambda^{k + \frac{1}{2}} = \lambda^{k} - \mu\beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k}}-b)$,
  3. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k+\frac{1}{2}}})$,
  4. $\lambda^{k+1} = \lambda^{\color{red}{k+\frac{1}{2}}} - \mu\beta (A x^{\color{red}{k+1}} + B y^{\color{red}{k+1}}-b)$.
  • http://www.optimization-online.org/DB_FILE/2015/05/4925.pdf

$\color{aqua}{\text{Thanks to Professor He Bingsheng who taught me those.}}$[^9]

He Bingsheng

  • Some recent advances in the linearized ALM, ADMM and Beyond Relax the crucial parameter requirements
  • Bing-Sheng He

One of the particular ADMM is also called Split Bregman methods. And Bregman ADMM replace the quadratic penalty function with Bregman divergence: $$ L_{\beta}^{\phi}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By - b) + \frac{\beta}{2} B_{\phi}(b- Ax, By). $$

where $B_{\phi}$ is the Bregman divergence induced by the convex function $\phi$.

BADMM

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{\phi}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{\phi}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
  • Bregman Alternating Direction Method of Multipliers
  • https://www.swmath.org/software/20288

Relaxed, Inertial and Fast ADMM

$$\min_{x}f(x)+g(Ax)$$ The penalty parameter is $\rho > 0$ and the relaxation parameter is $\alpha\in (0, 2)$. Standard ADMM is recovered with $\alpha = 1$.

Family of relaxed A-ADMM algorithms for the above problem

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax- y^k+ \lambda^k|}^2$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)y^k-y+\lambda^k|}^2$,
  3. $\lambda^{k+1} = \lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)y^k-z^{\color{red}{k+1}}$.

Family of relaxed ADMM algorithms for the above problem The damping constant is $r \geq 3$.

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax-\hat y^k+\lambda^k|}^2$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)\hat y^k-y+\lambda^k|}^2$,
  3. $\lambda^{k+1} = \hat\lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)\hat y^k-z^{\color{red}{k+1}}$,
  4. $\gamma_{k+1}=\frac{k}{k+r}$,
  5. $\hat\lambda^{k+1}=\lambda^{k+1}+\gamma_{k+1}(\lambda^{k+1}-\lambda^{k})$,
  6. $\hat y^{k+1}=y^{k+1}+\gamma_{k+1}(y^{k+1}-y^{k})$.

Fast ADMM

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,\hat y^{\color{aqua}{k}},\hat\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \hat\lambda^{\color{aqua}{k}});$
  3. $\lambda_{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
  4. $\alpha_{k+1}=\frac{1+\sqrt{1+4\alpha_k^2}}{2}$
  5. $\hat y^{k+1}=y^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(y^{k}-y^{k-1})$
  6. $\hat \lambda^{k+1}=\lambda^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(\lambda^{k}-\lambda^{k-1})$
  • FAST ALTERNATING DIRECTION OPTIMIZATION METHODS
  • The Classical Augmented Lagrangian Method and Nonexpansiveness
  • Relative-error inertial-relaxed inexact versions of Douglas-Rachford and ADMM splitting algorithms
  • Relax, and Accelerate: A Continuous Perspective on ADMM
  • https://www.semanticscholar.org/author/Guilherme-Fran%C3%A7a/145512630
  • An explicit rate bound for over-relaxed ADMM

Multi-Block ADMM

Firstly we consider the following optimization problem

$$ \min f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)\ s.t.\quad A_1x_1 + A_2x_2 + \cdots + A_n x_n = b, \ x_i\in\mathop{X_i}\in\mathbb{R}^{d_i}, i=1,2,\cdots, n. $$

We defined its augmented Lagrangian multipliers as

$$ L_{\beta}^{n}(x_1,x_2,\cdots, x_n\mid \lambda)=\sum_{i=1}^{n} f_i(x_i) -\lambda^T (\sum_{i=1}^{n} A_i x_i - b) + \frac{\beta}{2} {({|\sum_{i=1}^{n} A_i x_i - b|})}_{2}^{2}. $$

Particularly, we firstly consider the case when $n=3$:

$$ L_{\beta}^{3}(x, y, z\mid \lambda)=f_1(x) + f_2(y) + f_3(z)-\lambda^T (A_1 x + A_2 y + A_3 z - b) \+\frac{\beta}{2}{|A_1 x + A_2 y + A_3 z - b|}_2^2. $$

It is natural and computationally beneficial to extend the original ADMM directly to solve the general n-block problem. A counter-example shows that this method diverges.


And Professor Bingsheng He, who taught me this section in his class, and his coauthors proposed some schemes for this problem based on his unified frame work for convex optimization and monotonic variational inequality.

Parallel splitting augmented Lagrangian method (abbreviated to PSALM) is described as follows:

  1. $x^{k+1}=\arg\min_{x}{L_{\beta}^3(x,y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$;
  2. $y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y,z^k,\lambda^k)\mid y\in\mathbb{Y}}$;
  3. $z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y^{\color{yellow}{k}},z,\lambda^k)\mid z\in\mathbb{Z}}$;
  4. $\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$.

We can add one more correction step $$ v^{k+1} := v^{k}-\alpha(v^k - v^{k+1}),\alpha\in (0, 2 -\sqrt{2}) $$

where $v^{k+1}=(y^{k+1},z^{k+1},\lambda^{k+1})$.

Another approach is to add an regularized terms:

  1. $x^{k+1}=\arg\min_{x}{L_{\beta}^3(x, y^k, z^k, \lambda^k)\mid x\in\mathbb{X}}$,
  2. $y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y, z^k,\lambda^k)+\color{red}{\frac{\tau}{2}\beta{|A_2(y-y^k)|}^{2}}\mid y\in\mathbb{Y}}$,
  3. $z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y^{\color{yellow}{k}}, z, \lambda^k)+\color{blue}{\frac{\tau}{2}\beta{|A_3(z - z^k)|}^{2}}\mid z\in\mathbb{Z}}$,
  4. $\lambda^{k+1} = {\lambda}^{k} - \beta(A_1 x^{k+1} + A_2 y^{k+1} + A_3 z^{k+1}-b)$,

where $\tau>1$.

  • http://scis.scichina.com/en/2018/122101.pdf
  • http://maths.nju.edu.cn/~hebma/slides/17C.pdf
  • http://maths.nju.edu.cn/~hebma/slides/18C.pdf
  • https://link.springer.com/article/10.1007/s10107-014-0826-5

Davis-Yin three operator splitting

If $f_1$ is strongly convex, then apply Davis-Yin (to dual problem) gives:

  1. $x^{k+1}=\arg\min_{x}{L^3_{\beta}(\color{green}{x},y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$;
  2. $y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},\color{green}{y},z^k,\lambda^k)\mid y\in\mathbb{Y}}$;
  3. $z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},y^{k+1},\color{green}{z},\lambda^k)\mid z\in\mathbb{Z}}$;
  4. $\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$.

where the notation $L^3(x, y, z, \lambda)$ is deefined by $$L^3(x, y, z, \lambda) = f_1(x) + f_2(y) + f_3(z)-{\lambda}^T(A_1 x + A_2 y + A_3 z - b)$$ is the Lagrangian rather than augmented Lagrangian.

  • Three-Operator Splitting
  • A Three-Operator Splitting Scheme and its Optimization Applications
  • Three-Operator Splitting and its Optimization Applications
  • A Three-Operator Splitting Scheme and its Optimization Applications
  • A Three-Operator Splitting Scheme and its Applications

Randomly Permuted ADMM given initial values at round $k$ is described as follows:

  1. Primal update
    • Pick a permutation $\sigma$ of ${1,.. ., n}$ uniformly at random;
    • For $i = 1,2,\cdots, n$, compute ${x}^{k+1}{\sigma(i)}$ by $$ x^{k+1}{\sigma(i)}=\arg\min_{x_{\sigma(i)}} L(x^{k+1}{\sigma(1)},\cdots, x^{k+1}{\sigma(i-1)}, x_{\sigma(i)}, x^{k+1}_{\sigma(i+1)},\cdots\mid \lambda^{k}). $$
  2. Dual update. Update the dual variable by $${\lambda}^{k+1}={\lambda}^{k}-\mu(\sum_{i=1}^{n}A_i x_i -b).$$
  • Randomly Permuted ADMM
  • On the Expected Convergence of Randomly Permuted ADMM
  • On the Efficiency of Random Permutation for ADMM and Coordinate Descent
  • Multi-Block ADMM and its Convergence Random Permutation Helps-A talk by Ye

dlADMM

$$ \begin{array}{l}{\mathrm{ Problem 1. }} \ {\min {W{l}, b_{l}, z_{l}, a_{l}} R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right)} \ {\text { s.t. } z_{l}=W_{l} a_{l-1}+b_{l}(l=1, \cdots, L), a_{l}=f_{l}\left(z_{l}\right)(l=1, \cdots, L-1)}\end{array} $$ where $a_0\in\mathbb R^d$ is the input of the deep neural network where $n_0$ is the number of feature dimensions, and $y$ is a predefined label vector. $R\left(z_{L} ; y\right)$ is a risk function for the L-th layer, which is convex, continuous and proper, and $\Omega_{l}\left(W_{l}\right)$ is a regularization term for the l-th layer, which is also convex, continuous, and proper.

Rather than solving Problem 1 directly, we can relax Problem 1 by adding an $\ell_2$ penalty to address Problem 2 as follows:

$$ \begin{array}{l}{\mathrm{ Problem 2 }} \ {\begin{aligned} \min {W{l}, b_{l}, z_{l}, a_{l}} F(\boldsymbol{W}, \boldsymbol{b}, z, \boldsymbol{a})=& R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right) \ +\underbrace{(v / 2) \sum_{l=1}^{L-1}\left(\left|z_{l}-W_{l} a_{l-1}-b_{l}\right|{2}^{2}+\left|a{l}-f_{l}\left(z_{l}\right)\right|{2}^{2}\right)}{\text{$\ell_2$ penalty} } \ \text { s.t. } z_{L}=W_{L} a_{L-1}+b_{L} \end{aligned}}\end{array} $$

where $\mathbf{W}=\left{W_{l}\right}{l=1}^{L}, \mathbf{b}=\left{b{l}\right}{l=1}^{L}, \mathbf{z}=\left{z{l}\right}{l=1}^{L}, \mathbf{a}=\left{a{l}\right}_{l=1}^{L-1}$ and $\nu >0$ is a tuning parameter.

Compared with Problem 1, Problem 2 has only a linear constraint $z_L = W_La_L−1 + b_L$ and hence is easier to solve

  • dlADMM: Deep Learning Optimization via Alternating Direction Method of Multipliers
  • Deep Learning Optimization via Alternating Direction Method of Multiplier

  • http://maths.nju.edu.cn/~hebma/
  • http://stanford.edu/~boyd/admm.html
  • https://www.wikiwand.com/en/Augmented_Lagrangian_method
  • A General Analysis of the Convergence of ADMM
  • 用ADMM实现统计学习问题的分布式计算
  • 凸优化:ADMM(Alternating Direction Method of Multipliers)交替方向乘子算法

Monotone Operator Splitting Methods for Optimization

Monotone operator splitting methods, which originated in the late 1970's in the context of partial differential equations, have started to be highly effective for modeling and solving a wide range of data analysis and processing problems, in particular high-dimensional statistical data analysis.

Operator splitting is to decompose one complicated operator(procedure) into some simple operators (procedures). For example, ADMM splits the maxmin operator of the augmented Lagrangian into 3 opertors: $$ \arg\min_{x,y}\max_{\lambda} L_{\beta}(x,y\mid \lambda) $$ to $$ \arg\min_{x}L_{\beta}(x,y\mid \lambda) \circ \ ,\arg\min_{y}L_{\beta}(x,y\mid \lambda) \circ \arg\max_{\lambda} L_{\beta}(x,y,\mid \lambda). $$

They are really some block relaxation techniques.

  • https://web.stanford.edu/class/ee364b/lectures/monotone_split_slides.pdf
  • Operator Splitting by Professor Udell @ORIE 6326: Convex Optimization
  • A note on the equivalence of operator splitting methods
  • Splitting methods for monotone operators with applications to parallel optimization
  • Operator Splitting Methods for Fast MPC
  • https://staffportal.curtin.edu.au/staff/profile/view/Jie.Sun/
  • Operator Splitting Methods in Data Analysis
  • https://www.samsi.info/programs-and-activities/research-workshops/operator-splitting-methods-data-analysis/
  • http://idda.cuhk.edu.cn/zh-hans/page/10297
  • Random monotone operators and application to Stochastic Optimization
  • Splitting methods and ADMM, Thibaut Lienart
  • Operator Splitting Performance Estimation: Tight contraction factors and optimal parameter selection
  • Split Bregman
  • Accelerated Bregman operator splitting with backtracking
  • Splitting Algorithms, Modern Operator Theory, and Applications (17w5030)
  • 17w5030 Workshop on Splitting Algorithms, Modern Operator Theory, and Applications
  • https://idda.cuhk.edu.cn/zh-hans/news/11729
  • MIngyi Hong

PDFP and PDHG

Primal-dual Fixed Point Algorithm

We demonstrate how different algorithms can be obtained by splitting the problems in different ways through the classic example of sparsity regularized least square model with constraint. In particular, for a class of linearly constrained problems, which are of great interest in the context of multi-block ADMM, can be also solved by PDFP with a guarantee of convergence. Finally, some experiments are provided to illustrate the performance of several schemes derived by the PDFP algorithm.

  • A primal-dual fixed point algorithm for multi-block convex minimization
  • A primal–dual fixed point algorithm for convex separable minimization
  • A Unified Primal-Dual Algorithm Framework Based on Bregman Iteration
  • Proximal ADMM
  • The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression ,
  • A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration

Primary Dual Hybrid Gradient

The Primal-Dual Hybrid Gradient (PDHG) method, also known as the Chambolle-Pock method, is a powerful splitting method that can solve a wide range of constrained and non-differentiable optimization problems. Unlike the popular ADMM method, the PDHG approach usually does not require expensive minimization sub-steps.

PDHG solves general saddle-point problems of the form $$\min_{x\in X}\max_{y\in Y} f(x)+\left<Ax, y\right>-g(y)$$ where $f$ and $g$ are convex functions, and $A$ is a linear operator.


  • $\hat x^{k+1}=x^k -\tau_k A^Ty^k$;
  • $x^{k+1}=\arg\min_{x\in X}f(x)+\frac{1}{2\tau_k}{|x-\hat x^{k+1}|}^2$;
  • $\tilde x^{k+1}=x^{k+1}+(x^{k+1}-x^{k})$;
  • $\hat y^{k+1}=y^k +\sigma_k A\tilde x^{k+1}$;
  • $y^{k+1}=\arg\min_{y\in Y} g(x)+\frac{1}{2\sigma_k}{|y-\hat y^{k+1}|}^2$.
  • Primary-dual hybrid gradient
  • Adaptive Primal-Dual Hybrid Gradient Methods for Saddle-Point Problems
  • Convergence Analysis of Primal-Dual Algorithms for a Saddle-Point Problem: From Contraction Perspective
  • An Algorithmic Framework of Generalized Primal-Dual Hybrid Gradient Methods for Saddle Point Problems
  • Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Applications
  • A prediction-correction primal-dual hybrid gradient method for convex programming with linear constraints
  • On the convergence of primal–dual hybrid gradient algorithms for total variation image restoration
  • A primal-dual algorithm framework for convex saddle-point optimization
  • http://www.cs.utah.edu/~ssingla/
  • http://maths.nju.edu.cn/~hebma/paper/C-PPA/VI-PPA.pdf
  • https://odlgroup.github.io/odl/index.html

Linear Programming

A linear program (LP) is an optimization problem in which the objective function is linear in the unknowns and the constraints consist of linear equalities and linear inequalities.

The standard problem is $$\min_{x}\mathbf{c^Tx} \ s.t.\quad \mathbf{Ax=b, x\geq 0}.$$

Here $\mathbf x$ is an n-dimensional column vector, $\mathbf{c^T}$ is an n-dimensional row vector, $\mathbf{A}$ is an $m \times n$ matrix, and $\mathbf b$ is an m-dimensional column vector. The vector inequality $\mathbf{x\geq 0}$ means that each component of $\mathbf x$ is nonnegative.

Fundamental Theorem of Linear Programming. Given a linear program in standard form where $\mathbf A$ is an $m × n$ matrix of rank $m$,

  1. if there is a feasible solution, there is a basic feasible solution;
  2. if there is an optimal feasible solution, there is an optimal basic feasible solution.

Linear programming is constrainted convex optimization problem. It is the simplest case of constrainted optimization problem in theory. However, it is useful in many cases.

  • Application of Linear Programming (With Diagram)
  • Applications of Linear Programming

If there is no constraints, the linear objectve function is unbounded.

  • EE236A - Linear Programming (Fall Quarter 2013-14)

Fixed Point Iteration Methods

The fixed point algorithm is initially to find approximate solutions of the equation

$$f(x)=0\tag 1$$ where $f:\mathbb{R}\to\mathbb{R}$.

In this method, we first rewrite the question(1) in the following form $$x=g(x)\tag 2$$

in such a way that any solution of the equation (2), which is a fixed point of the function ${g}$, is a solution of equation (1). For example, we can set $g(x)=f(x)+x, g(x)=x-f(x)$. Then consider the following algorithm.

  1. Give the initial point $x^{0}$;
  2. Compute the recursive procedure $x^{n+1}=g(x^n), n=1,2,\ldots$

So that finally we obtain an sequence ${x^0, x^1, \cdots, x^{n},\cdots}$. There are many methods to test whether this sequence is convergent or not as learnt in calculus.

To introduce the acceleration schemes, we define some order of convergence.

The order of convergence is defined as the constant $p$ such that $\lim_{n\to \infty}\frac{| x^{n+1}-x^{\ast}|}{| x^{n}-x^{\ast}|^p}=C$ if $\lim_{n\to\infty}x^{n}=x^{\ast}$, denoted as $O(\frac{1}{n^p})$.

name the order of convergence
sublinear $p<1$
linear $p = 1$ and $C < 1,$
superlinear $p >1$
quadratic $p = 2$

Solving nonlinear equations

Solving nonlinear equation $$f(x)=0,$$ means to find such points $x^{\ast}$ such that $f(x^{\ast})=0$, where $f:\mathbb R\mapsto \mathbb R$ is nonlinear.

There are many methods for determining the value of $x^{\ast}$ by successive approximation. With any such method we begin by choosing one or more values $x_0, x_1, \cdots, x_r$, more or less arbitrarily, and then successively obtain new values $x_n$, as certain functions of the previously obtained $x_0, x_1, x_2,\cdots, x_{n-1}$ and possibly those of the derivatives $f^{\prime}(x_0), \cdots, f^{\prime}(x_{n-1})$ or higher derivative information.

  • Numerical methods: Solving nonlinear equations
  • Newton, Chebyshev, and Halley Basins of Attraction; A Complete Geometric Approach by Bart D. Stewart, Department of Mathematics, United States Military Academy
  • Solutions of Equations in One Variable Fixed-Point Iteration II

Bisection method

Find a midpoint of interval $(a^k, b^k )$ and designate it $x^{k+1}=\frac{a^k+b^k}{2}$.

$$ \left(a_{k+1}, b_{k+1}\right) =\left{\begin{array}{ll}{\left(a_{k}, x_{k+1}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)<0} \ {\left(x_{k+1}, b_{k}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)>0}\end{array}\right. $$

Regula falsi (false position) method

$$f(x^{k+1})=f(x^k)-\alpha \frac{b^k -a^{k}}{f(b^k) - f(a^{k})}f(x^k)$$

and

$$ \left(a_{k+1}, b_{k+1}\right) =\left{\begin{array}{ll}{\left(a_{k}, x_{k+1}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)<0} \ {\left(x_{k+1}, b_{k}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)>0}\end{array}\right. $$

Secant method

The $k$-th approximation of root is obtained by $$f(x^{k+1})=f(x^k)-\alpha \frac{x^k -x^{k-1}}{f(x^k) - f(x^{k-1})}f(x^k)$$

  • Solutions of Equations in One Variable Secant & Regula Falsi Methods

Newton's method

$$f(x^{k+1})=f(x^k)-\alpha {(f^{\prime})}^{-1}f(x^k)$$

Steffensen's Method

Steffensen's method is modified Newton's method

$$f(x^{k+1})=f(x^k)-\alpha {(\frac{f(x^k+f(x^k))-f(x^k)}{f(x^k)})}^{-1}f(x^k)$$

  • http://www.fyzikazeme.sk/mainpage/stud_mat/nm/lecture2.pdf
  • On the development of Steffensen's method and applications to Hammerstein equations
  • An improvement of Steffensen's method for solving nonlinear equations by N. Gattal1, and A. S. Chibi

Muller's Method

Muller's method is a generalization of the secant method, in the sense that it does not require the derivative of the function.

It is an iterative method that requires three starting points $(p_0, f (p_0)), (p_1, f (p_1)),$ and $(p_2, f (p_2))$. A parabola is constructed that passes through the three points; then the quadratic formula is used to find a root of the quadratic for the next approximation.

  • http://mathworld.wolfram.com/MullersMethod.html
  • http://mathfaculty.fullerton.edu/mathews/n2003/mullersmethod/MullersMethodProof.pdf
  • Muller's Method Wiki

Chebyshev Method

  • http://mathworld.wolfram.com/ChebyshevIteration.html
  • http://www.sam.math.ethz.ch/~mhg/pub/Cheby-02ParComp.pdf

Halley's Methods

  • Newton's versus Halley's Method: A Dynamical Systems Approach
  • MODIFIED HALLEY'S METHOD FOR SOLVING NONLINEAR FUNCTIONS WITH CONVERGENCE OF ORDER SIX AND EFFICIENCY INDEX 1.8171
  • http://benisrael.net/NEWTON-MONTHLY.pdf

Cauchy's Methods

  • Some New Variants of Cauchy's Methods for Solving Nonlinear Equations
  • Some variants of Cauchy's method with accelerated fourth-order convergence

Aitken's $\Delta^2$ method

  • Let $\left{p_{n}\right}$ be generated by a method which has a linear convergence,
  • Having $p_{0}, p_{1}$ and $p_{2}$ compute $\hat{p}{0}=p{0}-\frac{\left(p_{1}-p_{0}\right)^{2}}{p_{2}-2 p_{1}+p_{0}}, n=1,2, \ldots$
    1. compute $p_{n+2}$
    2. compute $\hat{p}{n}=p{n}-\frac{\left(p_{n+1}-p_{n}\right)^{2}}{p_{n+2}-2 p_{n+1}+p_{n}};$ and
    3. the algorithm terminates $p \approx \hat{p}{n}$ if $\left|\hat{p}{n}-\hat{p}_{n-1}\right|<\epsilon$.
  • http://mathfaculty.fullerton.edu/mathews/n2003/AitkenSteffensenMod.html
  • 2.6 - Accelerating Convergence Aitken's Delta squared Method
  • Aitken's $\Delta^2$ method extended
  • Higher Order Aitken Extrapolation with Application to Converging and Diverging Gauss-Seidel Iterations
  • AN ACCELERATION TECHNIQUE FOR SLOWLY CONVERGENT FIXED POINT ITERATIVE METHODS
  • https://www.math.usm.edu/lambers/mat460/fall09/lecture13.pdf
  • Fixed-Point Iteration
  • Lecture 8 : Fixed Point Iteration Method, Newton's Method
  • 2.2 Fixed-Point Iteration

Homotopy Continuation Methods

Homotopy Methods transform a hard problem into a simpler one whit easily calculated zeros and then gradually deform this simpler problem into the original one computing the zeros of the intervening problems and eventually ending with a zero of the original problem.

The homotopy method (continuation method, successive loading method) can be used to generate a good starting value.

Suppose one wishes to obtain a solution to a system of $N$ nonlinear equations in $N$ variables, say $$F(x)=0$$ where $F : \mathbb R^N \to \mathbb R^N$ is a mapping which, for purposes of beginning our discussion we will assume is smooth.

Since we assume that such a priori knowledge is not available, the iteration will often fail, because poor starting values are likely to be chosen.

We construct a parameter depending function $$H(x, s) = sF(x) + (1 − s)F_0(x), s\in [0,1]$$ and note, that $H(x, 0) = 0$ is the problem with known solution and $H(x, 1) = 0$ is the original problem $F(x) = 0$. As the solution of $H(x, s) = 0$ depends on s we denote it by $x^{\ast}(s)$. We discretize now the intervall into $0 = s_0 < s_1 < \cdots < s_n = 1$ and solve a sequence of nonlinear systems with Newton's method $$H(x, s_i) = 0$$

  • http://homepages.math.uic.edu/~jan/
  • http://people.bu.edu/fdc/H-topy.pdf
  • 3.10: Homotopy Method
  • CHAPTER 2: Numerical homotopy continuation
  • https://blog.csdn.net/jbb0523/article/details/52460408
  • Homotopy Continuation, Max Buot (CMU)&Donald Richards (PSU)
  • On the Link Between Gaussian Homotopy Continuation and Convex Envelopes
  • HOPE: A Homotopy Optimization Method for Protein Structure Prediction
  • Solving inequality Constrained Optimization Problems by Differential Homotopy Continuation Methods
  • PHCpack: a general-purpose solver for polynomial systems by homotopy continuation
  • APPLICATION OF HOMOTOPY ANALYSIS METHOD FOR SOLVING NONLINEAR CAUCHY PROBLEM

In high dimensional space, it is a little different. Fixed point iteration as well as the fixed point itself arises in many cases such as [https://arxiv.org/pdf/1511.06393.pdf].

The contracting mapping ${F}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is defined as $$|F(x)-F(y)|\leq \alpha|x-y|, \forall x,y \in\mathbb{R},\alpha\in[0,1).$$ Thus $\lim_{|x-y|\to 0}\frac{|F(x)-F(y)|}{|x-y|}\leq \alpha\in[0,1)$.

Now we rewrite the necessary condition of unconstrainted optimization problems $\nabla f(x) = 0$ to the fixed point equation:

$$ \begin{align} \nabla f(x) = 0 \Rightarrow & x - \alpha\nabla f(x) = x \ \nabla f(x) = 0 \Rightarrow g(x) - \alpha\nabla f(x) = g(x) \Rightarrow & x - g^{-1}(\alpha\nabla f(x)) = x\ \nabla f(x) = 0\Rightarrow H(x)x- \alpha\nabla f(x) = H(x)x \Rightarrow & x - \alpha H(x)^{-1} \nabla f(x) = x \ \nabla f(x) = 0 \Rightarrow M(x)\nabla f(x) = 0 \Rightarrow & x -\alpha M(x)\nabla f(x) = x \end{align} $$

where $H(x)$ is the lambda-matrix.

These correspond to gradient descent, mirror gradient methods, Newton's methods and quasi-Newton's methods, respectively.

And the projected (sub)gradient methods are in the fixed-point iterative form: $$ x = Proj_{\mathbb{S}}(x-\alpha \nabla f(x)) $$

as well as the mirror gradient and proximal gradient methods different from the projection operator.

Expectation maximization is also an accelerated fixed point iteration as well as Markov chain.

http://www.drkhamsi.com/fpt/books.html

The following figures in the table is form Formulations to overcome the divergence of iterative method of fixed-point in nonlinear equations solution

Fixed Point Iterations
  • https://www.wikiwand.com/en/Fixed-point_theorem
  • FixedPoint: A suite of acceleration algorithms with Application
  • Books on Fixed Point Theory
  • Recent Advances in Convex Optimization and Fixed Point Algorithms by Jean-Christophe Pesquet
  • Acceleration methods TIES594 PDE-solvers, Lecture 14, 6.5.2015, Olli Mali
  • The SUNNonlinearSolver_FixedPoint implementation¶

Gaussian-Seidel Method and Component Solution Methods

The mapping $\hat T_i :X\mapsto X$, corresponding to an update of the ith block-component only, is given by $$\hat T_i(x)=\hat T_i(x_1, \dots, x_m)=(x_1,\dots, x_{i-1}, T_i(x), x_{i+1}, \dots, x_m).$$

Updating all block-components of $x$, one at a time in increasing order, is equivalent to applying the mapping $S: X\mapsto X$, defining by $$S=\hat{T}m\circ \hat{T}{m-1}\circ\cdots \circ \hat{T}1,$$ where $\circ$ denotes composition.
An equivalent definition of $S$ is given by the equation $$S_i=T
{i}(S_1(x), \dots, S_{i-1}(x), x_i, \dots, x_m)$$ where $S_i:X\mapsto X_i$ is the $i$th block-component of $S$. The mapping $S$ will be called the Gaussian-Seidel mapping based on the mapping $T$ and the iteration $x(t+1)=S(x(t))$ will be called Gaussian-Seidel algorithm based on the mapping $T$.

The system $x=T(x)$ can be decomposed into m smaller system of equations of the form $$x_i=T_i(x_1,\dots, x_m), \quad i=1,\dots,m,$$ which have to be solved simultaneously. We will consider an algorithm that solves at iteration the $i$th equation in the system for $x_i$, while keeping the other component fixed.

Given a vector $x(t)\in X$, the $i$th block-component $x_i(t+1)$ of the next vector is chosen to be a solution of the $i$th equation in the system, that is, $$x_i(t+1)\in{y_i\in X_i\mid y_i=T_i(x_1,\dots,x_{i-1}, y_i, x_{i+1},\dots, x_m)}.$$

ISTA and FASTA

The $\ell_1$ regularization is to solve the ill-conditioned equations such as $$\min_{x}{|Ax-b|}_2^2+\lambda{|x|}_1.$$

It is less sensitive to outliers and obtain much more sparse solutions (as opposed to $\ell_2$ regularization). Its application includes and is not restricted in LASSO, compressed sensing and sparse approximation of signals.

It is clear that the absolute value function is not smooth or differentiable everywhere even the objective function ${|Ax-b|}_2^2+\lambda{|x|}_1$ is convex. It is not best to solve this problem by gradient-based methods.

Iterative Shrinkage-Threshold Algorithms(ISTA) for $\ell_1$ regularization is $$x^{k+1}=\mathbf{T}{\lambda t}(x^{k}-tA ^{T}(Ax-b))$$ where $t> 0$ is a step size and $\mathbf{T}{\alpha}$ is the shrinkage operator defined by $${\mathbf{T}{\alpha}(x)}{i}={(x_i-\alpha)}{+}sgn(x{i})$$ where $x_i$ is the $i$ th component of $x\in\mathbb{R}^{n}$.

FISTA with constant stepsize

  • $x^{k}= p_{L}(y^k)$ computed as ISTA;
  • $t_{k+1}=\frac{1+\sqrt{1+4t_k^2}}{2}$;
  • $y^{k+1}=x^k+\frac{t_k -1}{t_{k+1}}(x^k-x^{k-1})$.
  • A Fast Iterative Shrinkage Algorithm for Convex Regularized Linear Inverse Problems
  • https://pylops.readthedocs.io/en/latest/gallery/plot_ista.html
  • ORF523: ISTA and FISTA

This will lead to the operator splitting methods analysesed by Wotao Yin and others.

  • ORIE 6326: Convex Optimization Operator Splitting
  • Monotone Operator Splitting Methods
  • A Course on First-Order, Operator Splitting, and Coordinate Update Methods for Optimization
  • Operator Splitting Methods for Convex Optimization Analysis and Implementation
  • Some Operator Splitting Methods for Convex Optimization
  • FAST ALTERNATING DIRECTION OPTIMIZATION METHODS
  • https://www.math.ucla.edu/~wotaoyin/math285j.18f/
  • Fixed-Point Continuation for $\ell_1$-Minimization: Methodology and Convergence@https://epubs.siam.org/doi/abs/10.1137/070698920

Generic Acceleration Framework

  • Given
    • existing optimization procedure $M(f, x)$
    • previous iterates $x^{1},x^{2}, \cdots ,x^{k}$ and
    • new proposed guess $x^P = M(f, x^{k})$.
  • Find step $x^{k+1}$ using information at $x^{1},x^{2}, \cdots ,x^{k}, x^P$.
  • Discover acceleration
  • The zen of gradient descent
  • Introduction to Optimization Theory MS&E213 / CS269O - Spring 2017 Chapter 4 Acceleration
  • Algorithms, Nature, and Society
  • https://damienscieur.com/sections/paper.html
  • Generalized Framework for Nonlinear Acceleration
  • Nonlinear Acceleration of Stochastic Algorithms
  • Nonlinear Acceleration of Constrained Optimization Algorithms
  • Cheat Sheet: Acceleration from First Principles

Relaxation and inertia

We will focus here on

  • unit memory,
  • accelerations using past operation outputs OR iterates.

Given an fixed point iteration $x^{k+1}={T}(x^k)$, there are two simple acceleration schemes.

$$\begin{aligned} x^{k+1}&=T_{1}\left(x^{k}+\nu^{k}\left(x^{k}-x^{k-1}\right)\right) \ x^{k+2}&=T_{2}\left(x^{k+1}+\nu^{k+1}\left(x^{k+1}-x^{k}\right)\right) \end{aligned}$$

Relaxation Inertia Alternated Inertia
$T_1=T, T_2=I$ $T_1=T, T_2=T$ $T_1=T, T_2=I$
$\nu^{k}=0, \nu^{k+1}=\eta^{k / 2}-1$ $\nu^{k}=\gamma^{k}, \nu^{k+1}=\gamma^{k+1}$ $\nu^{k}=0, \nu^{k+1}=\gamma^{k+1}$
${x^{k+1}=T\left(x^{k}\right)}$ $x^{k+1}=T(x^{k}+\gamma^{k}\left(x^{k}-x^{k-1}\right))$ ${x^{k+1}=T\left(x^{k}\right)}$
$x^{k+2}=x^{k+1}+\nu^{k+1}\left(x^{k+1}-x^{k}\right)$ $x^{k+2}=T(x^{k+1}+\gamma^{k+1}\left(x^{k+1}-x^{k-1}\right))$ $x^{k+2}=T(x^{k+1}+\gamma^{k+1}\left(x^{k+1}-x^{k-1}\right))$
  • A Generic online acceleration scheme for Optimization algorithms via Relaxation and Inertia
  • RELAXATION AND INERTIA IN FIXED-POINT ITERATIONS WITH APPLICATIONS
  • Weak Convergence of a Relaxed and Inertial Hybrid Projection-Proximal Point Algorithm for Maximal Monotone Operators in Hilbert Space
  • FIRE: Fast Inertial Relaxation Engine for Optimization on All Scales
  • Structural Relaxation Made Simple
  • Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
  • Online Relaxation Method for Improving Linear Convergence Rates of the ADMM
  • http://www.iutzeler.org/
  • https://www.math.uni-bielefeld.de/~gaehler/
  • https://www.researchgate.net/profile/Damien_Scieur

Anderson Acceleration

Let $H$ be a Hilbert space equipped with a symmetric inner product $\left<\cdot, \cdot\right>: H \times H \to R$. Let $T : H \to H$ be a nonexpansive mapping and consider for fixed $x_0 \in H$ the Halpern-Iteration (named after Benjamin Halpern, who introduced it): $$x^{k+1}= (1-{\alpha}_k)x^0+ {\alpha}_k T(x^k), {\alpha}_k \in (0,1)$$

with ${\alpha}_k = \frac{k+1}{k+2}$ for approximating a fixed point of $T$.

It is proved that $\frac{1}{2}{| x^k -T(x^k)|}\leq\frac{|x^0-x^{\ast} |}{k+1}$.

  • On the Convergence Rate of the Halpern-Iteration

Krasnosel'skii-Mann(KM, or averaged) iterations update $x^k$ in iteration ${k}$ to

$$ x^{k+1}=(1-\alpha_k)x^k+ {\alpha}_k T(x^k), {\alpha}_k \in (0,1) $$

for the fixed point problem (2). Specially, $T(x^k)= x^k-\alpha_k \nabla f(x^k)$ for the convex optimization problems. If the operator $T$ is non-expensive, then the sequence ${x^{k}\mid k=0,1,2,\dots}$ is convergent by Krasnosel'skii-Mann Theorem.

In 1953, Mann de fined an iterative method: $$ x^{k+1}=M(x^k, \alpha_k , T)=(1-\alpha_k)x^k+ {\alpha}k T(x^k),\ \alpha_k \in [0,1),,\text{satisfying}, \sum{k=1}^{\infty}{\alpha}_k = \infty. $$

The sequence ${x^{k}}$ defined by

$$ x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(x^k), 0 < a\leq {\alpha}k \leq b < 1 $$ (additionally $\sum{k=1}^{\infty}{\alpha}k = \infty$ )is called a modi ed Mann iteration. Take ${\alpha_k, \beta_k}$ two sequences in $[0, 1]$ satisfying $$ \sum{k=1}^{\infty}{\alpha}_k {\beta}k = \infty, \lim{k\infty}\beta_k =0, 0 \leq {\alpha}_k \leq \beta_k \leq 1. $$

Then the sequence ${x^{k}}$ defined by

$$ x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(y^k), \ y^{k} = (1-\beta_k)x^k + {\beta}_k T(x^k), $$

is called the Ishikawa iteration.

Hojjat Afshari and Hassen Aydi proposes another Mann type iteration: $$ x^{k+1} = {\alpha}_k x^k + {\beta}_k T(x^k) + {\gamma}_k T^{2}(x^k) $$

where ${\alpha}_k + {\beta}_k + {\gamma}_k = 1, {\alpha}_k, {\beta}k, {\gamma}k \in [0,1)$ for all $k\geq 1$, $\sum{k}^{\infty}(1-\alpha_k)=\infty$, $\sum{k=1}^{\infty}\gamma_k < \infty$.

If the Mann type iteration ${x^k}$ converges strongly to a point $p$, then $p$ is a fixed point of $T$.

  • Krasnoselskii-Mann method for non-self mappings
  • http://www.krasnoselskii.iitp.ru/papereng.pdf
  • Strong convergence of the modified Mann iterative method for strict pseudo-contractions
  • Some results on a modified Mann iterative scheme in a reflexive Banach space
  • Some results about Krasnosel'skiĭ-Mann iteration process
  • Convergence theorems for inertial KM-type algorithms
  • Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings

There is an acceleration framework of fixed point iterations for the problem (2) called Anderson Acceleration or regularized nonlinear acceleration

  1. $F_k = (h_{k-m_k}, \dots, h_k)$ where $h_i=g(x_i)-x_i$;
  2. Determine $\alpha^{k}=(\alpha_0^{k},\dots, \alpha_{m_k}^{k})^{T}$ that solves $\min_{\alpha^{k}}{|F_k\alpha^k|}2^2$ s.t. $\sum{i=0}^{m_k}\alpha_i^{k}=1$;
  3. Set $x_{k+1}=\sum_{i=0}^{m_k} \alpha_i^{k} g(x_{k - m_k + i})$.

It is maybe interesting to introduce some Bregman divergence $D_{f}(\cdot, \cdot)$ instead of the squared $\ell_2$ norm when choosing $\alpha^{k}$ so that $$\alpha^{k}=\arg\min_{\alpha^k}{D_{f}(F_k \alpha^k)\mid \sum_{i=0}^{m_k}\alpha_i^{k}=1}.$$ Thus we would use mirror gradient methods to solve this problem.

It is proved that the Anderson acceleration converges if the fixed point mapping is cotractive.

  • Anderson acceleration for fixed point iterations
  • Anderson Acceleration
  • MS142 Anderson Acceleration and Applications
  • Convergence Analysis For Anderson Acceleration
  • Comments on "Anderson Acceleration, Mixing and Extrapolation"
  • Globally Convergent Type-I Anderson Acceleration for Non-Smooth Fixed-Point Iterations
  • A proof that Anderson acceleration improves the convergence rate in linearly converging fixed point methods (but not in those converging quadratically)

The Baillon-Haddad Theorem provides an important link between convex optimization and fixed-point iteration, which proves that if the gradient of a convex and continuously differentiable function is non-expansive, then it is actually firmly non-expansive.

  • The Baillon-Haddad Theorem Revisited
  • A Generic online acceleration scheme for Optimization algorithms via Relaxation and Inertia
  • RELAXATION AND INERTIA IN FIXED-POINT ITERATIONS WITH APPLICATIOn
  • Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
  • Iterative Convex Optimization Algorithms; Part One: Using the Baillon–Haddad Theorem
  • Recent Advances in Convex Optimization and Fixed Point Algorithms by Jean-Christophe Pesquet
  • A FIXED-POINT OF VIEW ON GRADIENT METHODS FOR BIG DATA

Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix

A correlation matrix is symmetric, has unit diagonal, and is positive semidefinite. Frequently, asynchronous or missing observations lead to the obtained matrix being indefinite.

A standard way to correct an invalid correlation matrix, by which we mean a real, symmetric indefinite matrix with unit diagonal, is to replace it by the nearest correlation matrix in the Frobenius norm, that is, by the solution of the problem $$\min{ {|A − X|}F : \text{X is a correlation matrix} },$$ where ${|A|}F^2=\sum{ij}a{ij}^2$.

  • The Nearest Correlation Matrix
  • Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix
  • https://github.com/higham/anderson-accel-ncm
  • Preconditioned alternating projection algorithm for solving the penalized-likelihood SPECT reconstruction problem.

DAAREM

  • Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms Talk
  • Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms
  • Accelerating the EM Algorithm for Mixture-density Estimation
  • http://nhenderstat.com/research/

Anderson Accelerated Douglas-Rachford Splitting

  • Anderson Accelerated Douglas-Rachford Splitting
  • https://ctk.math.ncsu.edu/TALKS/Anderson.pdf
  • Using Anderson Acceleration to Accelerate the Convergence of Neutron Transport Calculations with Anisotropic Scattering

If the gradient equals to 0s, i.e., $\nabla f(x)=0$, it is possible to find some saddle points. It means that the fixed point iteration matters. It seems that ADMM (or generally Langragian multiplier method) is not the fixed point iteration. The fixed point iteration is in the form $x^{k}=f(x^k)$ where $f$ is usually explicitly expressed.

However, this form is easy to generalize any mapping or operators such as in functional analysis. In this sense, ADMM is really fixed point iteration: $(x^{k+1}, y^{k+1}, \lambda^{k+1})=ADMM(x^k, y^k, \lambda^k)$.

  • Plug-and-Play ADMM for Image Restoration: Fixed Point Convergence and Applications
  • A reconstruction algorithm for compressive quantum tomography using various measurement sets
  • Accelerating ADMM for Efficient Simulation and Optimization
  • A FIXED-POINT PROXIMITY APPROACH TO SOLVING THE SUPPORT VECTOR REGRESSION WITH THE GROUP LASSO REGULARIZATION
  • http://staff.ustc.edu.cn/~juyong/publications.html
  • Optimal parameter selection for the alternating direction method of multipliers (ADMM): quadratic problems
  • https://www-users.cs.umn.edu/~baner029/
  • Multiplicative noise removal in imaging: An exp-model and its fixed-point proximity algorithm
  • A Proximal-Point Analysis of the Preconditioned Alternating Direction Method of Multipliers
  • Fixed points of generalized approximate message passing with arbitrary matrices
  • https://www.math.ucla.edu/~wotaoyin/summer2013/slides/
  • https://xu-yangyang.github.io/papers/AcceleratedLALM.pdf
  • https://engineering.purdue.edu/ChanGroup/project_restoration.html
  • https://github.com/KezhiLi/Quantum_FixedPoint
  • https://epubs.siam.org/doi/abs/10.1137/15M103580X
  • https://xu-yangyang.github.io/

Approximate Minimal Polynomial Extrapolation

From linear to nonlinear iterative methods is not always direct and correct.

The following optimziation problem $$\min f(x)=\frac{1}{2}{|Ax-b|}_2^2$$ is equals to solve the linear system $Ax=b$.

Given $A\in\mathbb{R}^{n\times n}$ such that 1 is not an eigenvalue of $A$ and $v\in\mathbb{R}^n$, the minimal polynomial of $A$ with respect to the vector $v$ is the lowest degree polynomial $p(x)$ such that $p(A)v = 0, p(1) = 1$.

  • Efficient implementation of minimal polynomial and reduced rank extrapolation methods
  • https://core.ac.uk/download/pdf/82614502.pdf
  • Minimal polynomial and reduced rank extrapolation methods are related
  • http://www.cs.technion.ac.il/~asidi/
  • https://www.di.ens.fr/~aspremon/PDF/FOCM17.pdf
  • https://simons.berkeley.edu/sites/default/files/docs/8821/alexsimons17.pdf
  • Nonlinear Schwarz iterations with Reduced Rank Extrapolation
  • A BLOCK RECYCLED GMRES METHOD WITH INVESTIGATIONS INTO ASPECTS OF SOLVER PERFORMANCE
  • http://dd23.kaist.ac.kr/slides/Martin_Gander_2.pdf

As shown before, the acceleration schemes are based on the linear combination of last iterated sequence. The question is why it is linear combination? Why not other extrapolation of the last updated values?

  • Steepest Descent Preconditioning for Nonlinear GMRES Optimization by Hans De Sterck
  • A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization

Nemirovski's Acceleration

Let $f$ be a 1-smooth function. Denote $x^{+} = x - \nabla f(x)$. The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

$$x_{t+1}=\arg\min_{x\in P_t} f(x)$$ where $P_t=\operatorname{span}{x_t^{+},\sum_{s=1}^t \lambda_s\nabla f(x^s)}$.

It seems like a trust region methods where one region is given.

  • Fields CQAM Focus Program on Data Science and Optimization
  • http://www.pokutta.com/
  • https://github.com/pokutta/lacg
  • Conditional Gradients and Acceleration
  • Nemirovski's acceleration
  • https://sunjackson.github.io/page9/

Damped Inertial Gradient System

Damped Inertial Gradient System (DIGS)

  • http://www.dii.uchile.cl/~mbravo/adgo2016/Slides/Peypouquet.pdf
  • http://www.dii.uchile.cl/~mbravo/adgo2016/

Alternating Anderson-Richardson method

  • Alternating Anderson-Richardson method: An efficient alternative to preconditioned Krylov methods for large, sparse linear systems
  • Anderson acceleration of the Jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems

Regularized Nonlinear Acceleration

We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.

$$ \begin{array}{l}{\text { Input: Sequence }\left{x_{0}, x_{1}, \ldots, x_{k+1}\right}, \text { parameter } \lambda>0} \ {\text { 1: Form } U=\left[x_{1}-x_{0}, \ldots, x_{k+1}-x_{k}\right]} \ {\text { 2. Solve the linear system }\left(U^{T} U+\lambda I\right) z=1} \ {\text { 3. Set } c=z /\left(z^{T} \mathbf{1}\right)} \ {\text { Output: Return } \sum_{i=0}^{k} c_{i} x_{i}, \text { approximating the optimum } x^{\ast}} \end{array} $$

  • REGULARIZED NONLINEAR ACCELERATION
  • Regularized Nonlinear Acceleration@lids.mit.edu
  • Regularized Nonlinear Acceleration@simons.berkeley
  • Regularized Nonlinear Acceleration by Damien Scieur
  • Regularized nonlinear acceleration, Mathematical Programming
  • http://spars2017.lx.it.pt/index_files/papers/SPARS2017_Paper_16.pdf
  • https://github.com/windows7lover/RegularizedNonlinearAcceleration
  • https://damienscieur.com/

Objective Acceleration

O-ACCEL (objective acceleration), is novel in that it minimizes an approximation to the objective function on subspaces of $\mathbb{R}^n$. We prove that O-ACCEL reduces to the full orthogonalization method for linear systems when the objective is quadratic, which differentiates our proposed approach from existing acceleration methods. Comparisons with the limited-memory Broyden–Fletcher–Goldfarb–Shanno and nonlinear conjugate gradient methods indicate the competitiveness of O-ACCEL.

  • Objective acceleration for unconstrained optimization, Optimization methods and software conference 2017, Havana
  • Objective acceleration for unconstrained optimization: code
  • Objective acceleration for unconstrained optimization by Asbjørn Nilsen Riseth
  • http://julianlsolvers.github.io/Optim.jl/stable/#algo/ngmres/
  • http://spars2017.lx.it.pt/

However, it is best to think from the necessary condition of optima in non-convex optimization in my opinion. Another question is to generalize the fixed point iteration to stochastic gradient methods.

  • PyUNLocBoX: Optimization by Proximal Splitting
  • https://lts2.epfl.ch/

Direct Nonlinear Acceleration

Optimization acceleration techniques such as momentum play a key role in state-of-the-art machine learning algorithms. Recently, generic vector sequence extrapolation techniques, such as regularized nonlinear acceleration (RNA) of Scieur et al. (Scieur et al., 2016), were proposed and shown to accelerate fixed point iterations. In contrast to RNA which computes extrapolation coefficients by (approximately) setting the gradient of the objective function to zero at the extrapolated point, we propose a more direct approach, which we call direct nonlinear acceleration (DNA). In DNA, we aim to minimize (an approximation of) the function value at the extrapolated point instead. We adopt a regularized approach with regularizers designed to prevent the model from entering a region in which the functional approximation is less precise. While the computational cost of DNA is comparable to that of RNA, our direct approach significantly outperforms RNA on both synthetic and real-world datasets. While the focus of this paper is on convex problems, we obtain very encouraging results in accelerating the training of neural networks.

  • https://arxiv.org/abs/1905.11692
  • https://www.aritradutta.com/

Proportional–Integral–Derivative Optimizer

The principle of feedback is simple an input, $x^n$, is given, processed through some function, $f$, and then the output, $y^n$, becomes the next input, $x^{n+1}$, repeatedly. When allowing the ouput to equal the next input, an identity exists so that $x^{n+1}=y^n$. Cobweb diagrams exploit the relationship, map the iterations, and reveal the behaviors of fixed points.

  • https://www.mi.sanu.ac.rs/vismath/stewart/index.html

A PID controller continuously calculates an error $e(t)$, which is the difference between the desired optimal output and a measured system output, and applies a correction $u(t)$ to the system based on the proportional $(P)$, integral $(I)$, and derivative $(D)$ terms of $e(t)$. Mathematically, there is: $$u(t)= K_p e(t) + K_i\int_{0}^{t}e(x)\mathrm d x + K_d\frac{\mathrm d}{\mathrm dt}e(t) $$

where $u$ is the control signal and $e$ is the control error. The control signal is thus a sum of three terms:

  1. the P-term (which is proportional to the error);
  2. the I-term (which is proportional to the integral of the error);
  3. and the D-term (which is proportional to the derivative of the error).

The controller can also be parameterized as $$u(t)= K_p {e(t) + \frac{1}{T_i}\int_{0}^{t}e(x)\mathrm d x + T_d\frac{\mathrm d}{\mathrm dt}e(t)},\tag{PID}$$

where $T_i$ is called integral time and $T_d$ derivative time.

The proportional part acts on the present value of the error, the integral represent and average of past errors and the derivative can be interpreted as a prediction of future errors based on linear extrapolation.

  • http://www.scholarpedia.org/article/Optimal_control
  • EE365: Stochastic Control Spring Quarter 2014
  • PID Theory Explained
  • Chapter 8: PID Control
  • CDS 101/110 -- Fall 2004 Analysis and Design of Feedback Systems
  • PID Control Theory
  • Control system theory
Methods Recursion Integration
Gradient Descent $x^{t+1} = x^t -\alpha_t g(x^t)$ ?
Momentum Methods $x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t(x^t - x^{t-1})$ ?
Nesterov's Gradient Methods $x^{t+1} =y^t -\alpha_t g(y^t), y^t = x^t + \rho_t(x^t -x^{t -1})$ ?
Newton's Methods $x^{t+1} = x^t - \alpha_i H_t^{-1}g(x^t)$ ?
Mirror Gradient Methods $\nabla h(x^{t+1})-\nabla h(x^t) = x^t - \alpha_t \nabla f(x^t) , x\in \mathbb{S}$ ?

By viewing the gradient $g(x^t)$ as error $e(t)$, and comparing it to PID controller, one can see that gradient descent only uses the present gradient to update the weights.

We rewrite the fomula $x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t(x^t - x^{t-1})$ as $$x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t\underbrace{(x^t - x^{t-1})}{-\alpha{t-1}g(x^{t-1})+\rho_{t-1}(x^{t-1}-x^{t-2})}\ = x^t -\alpha_t g(x^t) - \sum_{i=1}^{t-1}[\prod_{j=0}^{i-1}\rho_{t-j}]\alpha_{t-i}g(x^{t-i})+ \rho_1(x^1-x^0) .$$

One can see that the update of parameters relies on both the present gradient and the integral of past gradients. The only difference is that there is a decay $\prod_{j=0}^{i-1}\rho_{t-j}$ term in the I term.


PID optimizer

The proposed PID optimizer updates parameter $x$ at iteration $(t +1)$ by:

  • $V^{t+1}=\alpha V^t -r g(x^t)$
  • $D^{t+1}=\alpha D^t +(1-\alpha)(g(x^t)-g(x^{t-1}))$
  • $x^{t+1}=x^t+V^{t+1}+K_d D^{t+1}$

In a compact form, it is defined as following $$x^{t+1}=x^t+\alpha V^t -r g(x^t)+K_d [\alpha D^t +(1-\alpha)(g(x^t)-g(x^{t-1}))]\ = \alpha (V^t + K_d D^t) +x^t -r g(x^t)+ K_d(1-\alpha)[g(x^t)-g(x^{t-1})] $$

which looks like an ensmeble of inertia and relaxation techniques. We, for the first time, connect classical control theory with deep network optimization, and improve up to 50% the efficiency over SGD-Momentum!)

Now it is still a emprical method without any convergence proof.

As Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent, it is supposed to be converegnt in convex cases with some tuned parameters.

And it is simple to generalize Newton's methods where the gradients are replaced by rescaled gradients. In another word, the Newton type PID optimizer updates parameter $x$ at iteration $(t +1)$ by:

  • $V^{t+1}=\alpha V^t -r H^{-1}(x^t)g(x^t)$
  • $D^{t+1}=\alpha D^t +(1-\alpha)(H^{-1}(x^t)g(x^t)-H^{-1}(x^{t-1})g(x^{t-1}))$
  • $x^{t+1}=x^t+V^{t+1}+K_d D^{t+1}$.

The problem is that we have no theoretical proof while it inspired us how to ensemlble different ptimization methods or scehemes to accelerate the convergence procedure.

  • CVPR 2018 | 加速模型收敛的新思路(控制理论+深度学习)
  • 一种用于深度网络随机优化的PID控制器方法
  • PID Optimizer (Proportional–Integral–Derivative Optimizer)
  • A PID Controller Approach for Stochastic Optimization of Deep Networks
  • Supplemental Materials to "A PID Controller Approach for Stochastic Optimization of Deep Networks"
  • Adaptive Restarting for First Order Optimization Methods

To overcome the oscillation problem in the classical momentum-based optimizer, recent work associates it with the proportional-integral (PI) controller, and artificially adds D term producing a PID controller. It suppresses oscillation with the sacrifice of introducing extra hyper-parameter.

  • SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization

By further analyzing the underlying constrained optimization problem, we have found that the two camps of distributed optimization can actually be related through the framework of proportional-integral control. It turns out that consensus methods with constant step-sizes are akin to proportional control and dual-decomposition is akin to integral control. In much the same way that proportional and integral control can be combined to create a control method with damped response and zero steady state error, the two methods of distributed optimization can be combined to produce a damped response with zero error.

  • Proportional Integral Distributed Optimization for Dynamic Network Topologies
  • Proportional-Integral Distributed Optimization
  • Continuous-time Proportional-Integral Distributed Optimization for Networked Systems
  • Continuous-time Proportional-Integral Distributed Optimization for Networked Systems
  • A Control Perspective for Centralized and Distributed Convex Optimization
  • Input-Feedforward-Passivity-Based Distributed Optimization Over Jointly Connected Balanced Digraphs
  • Feedback-Feedforward Control Approach to Distributed Optimization
  • 反馈控制理论在优化、机器学习等领域有哪些应用?
Sample Recurrence Relation Idea of Successive Approximations
$x^{k+1}=M(x^k)$ $x^k=\underbrace{M(M(\cdots M(x^0)))}_{\text{k times}}$

  • Accelerated Optimization in the PDE Framework: Formulations for the Manifold of Diffeomorphism
  • Integration Methods and Accelerated Optimization Algorithms
  • https://statweb.stanford.edu/~candes/math301/hand.html

Ensemble Methods of Optimization Algorithms

In machine learning, boosting algorithms can boost the weak base learner to higher accuarcy learner in probability correct algorithm framework.

The linear coupling technique shows that it is possible to combine some slow optimization methods to a faster method. The interaction of optimization methods and boosting algorithms may bring benifits to both fields.

A unified framework is expected to ensemble different optimization methods to get a faster one. As shown in above section, there are many acceleration schemes to combine the points generated by the same algorithm. Ensemble Methods of Optimization Algorithms are supposed to combine the points generated by different optimziation methods. Another theoretical consideration is to find some fractional order optimization methods. There are fractional differential equations, fractional drivatives but no fractional order optimization methods. I mean that fractional order optimization methods are some special combination of integer order optimization methods not only using fractional derivatives.

Note that there is one step of Newton's coorection in Halley's method. It is the composite of optimization methods beyond our scope of combination of optimization methods. For example, can we combine Halley's method and Newton's method to obtain a faster optimizatio method like linear coupling? Is there non-linear coupling of different methods?

  • Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent
  • A conformable calculus of radial basis functions and its applications
  • Numerical Study for the Fractional Differential Equations Generated by Optimization Problem Using Chebyshev Collocation Method and FDM
  • On Geometry of Halley's Method
  • Coupling of Immune Algorithms and Game Theory in Multiobjective Optimization
  • Optimal Approximations of Coupling in Multidisciplinary Models
  • Coupling of Optimization Algorithms Based on Swarm Intelligence: An Application for Control of Heroin Addiction Epidemic
  • On the effective coupling of optimization algorithms to solve inverse problems of electromagnetism

Before answering the above questions, it is best to clarify how the convergence speed and order of derivative information.

There is a strong connection between convergence speed and order of derivative information in Higher-Order Gradient Method.

  • A variational perspective on accelerated methods in optimization
  • Parallel Boosting with Momentum
  • http://sime.sufe.edu.cn/teacher/show/56

Alternated Inertia ensembles two schemes in fixed point iteration: relaxation and interia.

  • Alternated Inertia
  • A GENERIC ONLINE ACCELERATION SCHEME FOR OPTIMIZATION ALGORITHMS VIA RELAXATION AND INERTIA
  • iPiano: Inertial Proximal Algorithm for Nonconvex Optimization

An intuitive way is to combine different optimziation methods of different order via Anderson acceleration.

Dynamical Systems

We will focus on the optimization methods in the form of fixed point iteration and dynamical systems. It is to minimize the following function $$ f(x), x\in\mathbb{R}^p, \quad\nabla f(x) = g(x), \quad\nabla^2 f(x)= H(x). $$

Iteration ODE Name
$x^{k+1}=x^{k}-\alpha_k g(x^k)$ $\dot{x}(t)=- g(x(t))$ Gradient descent
$x^{k+1}=x^{k}-\alpha_kH_k^{-1} g(x^k)$ $\dot{x}(t) =- H^{-1}(x)g(x(t))$ Newton's method

Like Newton interpolation, more points can compute higher order derivatives. The dynamics of accelerated gradient methods are expected to correspond to higher order differential equations.

Asymptotic Vanishing Damping

Some acceleration methods are iterations of the corresponding algorithms of Asymptotic Vanishing Damping called by Hedy Attouch:

$$ \quad \quad \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla \Phi (x(t)) =0.\tag{AVD} $$

where $\Phi(x(t))$ is dependent on the objective function; $\alpha >0$ is constant in $\mathbb{R}$.

The fast minimization properties of the trajectories of the second-order evolution equation is also studied by Hedy's group in 2016: $$ \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla^2 \Phi (x(t))\dot{x}(t) + \nabla \Phi (x(t)) =0\tag{HDD} $$

When it comes to numerical solution to differential equations, it is to find the solution of the equations $x(t)$ so that the equations hold; in optimization, the optima is our goal so that the focus is limit order $$\lim_{t\to t_0} x(t)=x^{\star}$$ if possible where $x^{\star}$ optimizes the cost/objective function $f(x)$ specially $t_0=\infty$.

  • https://perso.math.univ-toulouse.fr/spot/resumes/
  • Fast convex optimization via inertial dynamics with Hessian driven damping
  • A proximal-Newton method for monotone inclusions in Hilbert spaces with complexity $O(1/k^2)$
  • https://www.researchgate.net/profile/Hedy_Attouch
  • https://www.ljll.math.upmc.fr/~plc/sestri/
  • Hedy Attouch at https://biography.omicsonline.org
  • Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha\leq 3$
  • A Dynamical Approach to an Inertial Forward-Backward Algorithm for Convex Minimization
  • An Inertial Proximal Method for Maximal Monotone Operators via Discretization of a Nonlinear Oscillator with Damping
  • THE HEAVY BALL WITH FRICTION METHOD, I. THE CONTINUOUS DYNAMICAL SYSTEM: GLOBAL EXPLORATION OF THE LOCAL MINIMA OF A REAL-VALUED FUNCTION BY ASYMPTOTIC ANALYSIS OF A DISSIPATIVE DYNAMICAL SYSTEM
  • Viscosity Solutions of Minimization Problems
  • Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha \leq 3$
  • A dynamic approach to a proximal-Newton method for monotone inclusions in Hilbert spaces, with complexity $O(\frac{1}{n^2})$
  • FAST CONVERGENCE OF INERTIAL DYNAMICS AND ALGORITHMS WITH ASYMPTOTIC VANISHING DAMPING

Symplectic Optimization

Symplectic Optimizatio It is difficult to generalize these methods to stochastic cases.

There is a wonderful summary DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION given by Micheal I Jordan at ICM 2018.

Some new connections between dynamical systems and optimization is found.

  • Variational and Dynamical Perspectives On Learning and Optimization
  • Continuous and Discrete Dynamics For Online Learning and Convex Optimization
  • DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION
  • On Symplectic Optimization
  • A variational perspective on accelerated methods in optimization
  • A Dynamical Systems Perspective on Nesterov Acceleration
  • Generalized Momentum-Based Methods: A Hamiltonian Perspective
  • https://people.eecs.berkeley.edu/~jordan/optimization.html

Weijie J. Su (joint with Bin Shi, Simon Du, and Michael Jordan) introduced a set of high-resolution differential equations to model, analyze, interpret, and design accelerated optimization methods.

$$ \ddot{x}(t) + 2\sqrt{\mu}\dot{x}(t) + \sqrt{s}\nabla^2 f(x) \dot{x}(t) + (1+\sqrt{\mu s})\nabla f(x) = 0, \tag{Su} $$

  • A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights
  • Acceleration via Symplectic Discretization of High-Resolution Differential Equations
  • Understanding the Acceleration Phenomenon via High-Resolution Differential Equations
  • Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization
  • https://www.researchgate.net/profile/Weijie_Su

  • Optimization and Dynamical Systems
  • Direct Runge-Kutta Discretization Achieves Acceleration
  • The Physical systems Behind Optimization Algorithms
  • Integration Methods and Optimization Algorithms
  • Analysis of Hamilton-Jacobi Equation: Optimization, Dynamics and Control - Part II of II
  • From differential equation solvers to accelerated first-order methods for convex optimization
  • Optimization via the Hamilton-Jacobi-Bellman Method: Theory and Applications by Navin Khaneja, Harvard & KITP
  • GRADIENT FLOW DYNAMICS
  • Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem
  • The dynamics of Lagrange and Hamilton

ADMM and Dynamics

  • A Dynamical Systems Perspective on Nonsmooth Constrained Optimization
  • https://kgatsis.github.io/learning_for_control_workshop_CDC2018/assets/slides/Vidal_CDC18.pdf
  • ADMM and Accelerated ADMM as Continuous Dynamical Systems
  • ADMM, Accelerated-ADMM, and Continuous Dynamical Systems, Talk @DIMACS
  • Relax, and Accelerate: A Continuous Perspective on ADMM
  • http://people.ee.duke.edu/~lcarin/Xuejun12.11.2015.pdf

  • http://rahul.narain.name/admm-pd/
  • https://mattoverby.net/pages/admmpd_abstract.html

The last but not least important question is how to rewrite the fixed point iteration as the discrete form of some differential equation. What is more, it is the interaction and connnetion between numerical solution to differential equations and optimization methods in form of fixed ponit iteration that matters.

  • Glossary of Dynamical Systems Terms
  • Bifurcations Involving Fixed Points and Limit Cycles in Biological Systems
  • https://elmer.unibas.ch/pendulum/nldyn.htm

  • ESAIM: Control, Optimization and Calculus of Variations (ESAIM: COCV)
  • MCT'03 Louisiana Conference on Mathematical Control Theory
  • International Workshop "Variational Analysis and Applications", ERICE, August 28 – September 5, 2018
  • Games, Dynamics and Optimization, 2019
  • System dynamics & optimization
  • Introduction to Dynamical Systems by John K. Hunter, Department of Mathematics, University of California at Davis
  • Special Focus on Bridging Continuous and Discrete Optimization
  • Part 2 Dynamical Systems

GRADIENTS AND FLOWS: CONTINUOUS OPTIMIZATION APPROACHES TO THE MAXIMUM FLOW PROBLEM

Optimization for differential equations

  • http://www.it.uu.se/grad/courses/optforDE/
  • http://www.it.uu.se/grad/courses/optforDE/shape-2up.pdf
  • https://eccm-ecfd2018.org/frontal/Default.asp
  • https://www.ima.umn.edu/2015-2016/SW6.6-10.16
  • https://archive.siam.org/meetings/op08/Heinkenschloss.pdf
  • https://www.math.uh.edu/~rohop/spring_04/downloads/script.pdf
  • https://www.math.uh.edu/~rohop/
  • http://www.maths.dur.ac.uk/lms/

Stochastic Approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

  • Kiefer-Wolfowitz Algorithm
  • Stochastic Process and Application

Robbins–Monro algorithm introduced in 1951 by Herbert Robbins and Sutton Monro, presented a methodology for solving a root finding problem, where the function is represented as an expected value.

Assume that we have a function ${\textstyle M(\theta )}:\mathbb{R}\mapsto\mathbb{R}$, and a constant ${\textstyle \alpha \in\mathbb R}$, such that the equation ${\textstyle M(\theta )=\alpha }$ has a unique root at ${\textstyle \theta ^{\ast}}$.

It is assumed that while we cannot directly observe the function ${\textstyle M(\theta )}$, we can instead obtain measurements of the random variable ${\textstyle N(\theta )}$ where ${\textstyle \mathbb{E} [N(\theta )]=M(\theta )}$. The structure of the algorithm is to then generate iterates of the form: $${\displaystyle {\theta}{n+1}= {\theta}{n} - a_{n}(N({\theta}{n})-\alpha )}\tag{R-M}$$ where $a{1},a_{2},\dots$ is a sequence of positive step sizes.

This process can be considered as fixed point iteration with random noise. Different from the determinant methods, the sequences they generated are also on random.

Robbins and Monro proved , Theorem 2 that $\theta_n$ converges in $L^{2}$ (and hence also in probability) to $\theta$ , and Blum later proved the convergence is actually with probability one, provided that:

${\textstyle N(\theta )}$ is uniformly bounded, ${\textstyle M(\theta )}$ is nondecreasing, ${\textstyle M'(\theta ^{\ast})}$ exists and is positive, and The sequence ${\textstyle a_{n}}$ satisfies the following requirements: $$\sum_{n=0}^{\infty} a_{n}=\infty \quad \fbox{ and }\quad \sum_{n=0}^{\infty} a_{n}^{2} < \infty \quad$$ A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: ${\textstyle a_{n}=a/n}$, for ${\textstyle a>0}$. Other series are possible but in order to average out the noise in ${\textstyle N(\theta )}$, the above condition must be met.

The basic ideas of the Robbins–Monro scheme can be readily modified to provide successive approximations for the minimum (or maximum) of a unimodal regression function, as was shown by Kiefer and Wolfowitz (1952) who introduced a recursive scheme of the form

$$\theta_{n+1}= {\theta}{n} - a{n}\Delta(x_{n})\tag{K-W}$$

to find the minimum $\theta$ of $M$ (or, equivalently, the solution of $dM/dx = 0$).

During the nth stage of the Kiefer–Wolfowitz scheme, observations $y_n^{(1)}$ and $y_n^{(2)}$ are taken at the design levels $x_n^{(1)} = x_n + c_n$ and $x_n^{(2)} = x_n − c_n$, respectively. In (KW), $\Delta(x_n)=\frac{y_n^{(1)}-y_n^{(2)}}{2c_n}$ an and $c_n$ are positive constants such that $c_n \to 0$, $\sum (c_n/a_n)^2<\infty$ and $\sum a_n=\infty$.

  • 5 The Stochastic Approximation Algorithm
  • Chapter 15: Introduction to Stochastic Approximation Algorithms
  • Dynamics of Stochastic Approximation
  • Stochastic Approximation by Tze Leung Lai
  • A Multivariate Stochastic Approximation Procedure
  • The Robbins–Monro Stochastic Approximation Approach to a Discrimination Problem
  • Stochastic approximation: invited paper
  • Stochastic Approximation and Recursive Algorithms and Applications
  • Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems
  • Regularized iterative stochastic approximation methods for stochastic variational inequality problems

Stochastic Gradient Descent

Stochastic gradient descent is classified to stochastic optimization which is considered as the generalization of gradient descent.

Stochastic gradient descent takes advantages of stochastic or estimated gradient to replace the true gradient in gradient descent. It is stochastic gradient but may not be descent. The name stochastic gradient methods may be more appropriate to call the methods with stochastic gradient. It can date back to stochastic approximation in statistics.

It is aimed to solve the problem with finite sum optimization problem, i.e. $$ \arg\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}f(\theta|x_i) $$ where $n<\infty$ and ${f(\theta|x_i)}{i=1}^{n}$ are in the same function family and ${x_i}{i=1}^{n}\subset \mathbb{R}^{d}$ are constants while $\theta\in\mathbb{R}^{p}$ is the variable vector.

The difficulty is $p$, that the dimension of $\theta$, is tremendous. In other words, the model is overparameterized. And the number $n$ is far larger than $p$ generally, i.e. $n \gg p\gg d$. What is worse, the functions ${f(\theta|x_i)}_{i=1}^{n}$ are not convex in most case.


The stochastic gradient method is defined as $$ \theta^{k+1}=\theta^{k}-\alpha_{k}\frac{1}{m}\sum_{j=1}^{m}\nabla f(\theta^{k}| x_{j}^{\prime}) $$ where $x_{j}^{\prime}$ is draw from ${x_i}_{i=1}^{n}$ and $m\ll n$ is on random.

It is the fact $m\ll n$ that makes it possible to compute the gradient of finite sum objective function and its side effect is that the objective function is not always descent. Thus it is also called as mini-batch gradient descent.

There is fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.


The fluctuations in the objective function as gradient steps with respect to mini-batches are taken

An heuristic proposal for avoiding the choice and for modifying the learning rate while the learning task runs is the bold driver (BD) method[^14]. The learning rate increases exponentially if successive steps reduce the objective function $f$, and decreases rapidly if an "accident" is encountered (if objective function $f$ increases), until a suitable value is found. After starting with a small learning rate, its modifications are described by the following equation: $$ \alpha_{k+1}= \begin{cases} \rho \alpha_{k}, & {f(\theta^{k+1})< f(\theta^{k})}; \ \eta^n \alpha_{k}, & {f(\theta^{k+1})> f(\theta^{k})} \text{using ${\alpha}_k$}, \end{cases} $$

where $\rho$ is close to 1 such as $\rho=1.1$ in order to avoid frequent "accidents" because the objective function computation is wasted in these cases, $\eta$ is chosen to provide a rapid reduction ($\eta = 0.5$), and $n$ is the minimum integer such that the reduced rate $\eta^n$ succeeds in diminishing the objective function.[^13]

Except the learning rate or step length, there is yet another hyperparameter - the batch size $m$ in each iteration.


The fact that the sample size is far larger than the dimension of parameter, $n\gg p$, that makes it expensive to compute total objective function $f(\theta)=\sum_{i=1}^{n}f(\theta|{x}i)$. Thus it is not clever to determine the learning rate $\alpha_k$ by line search. And most stochastic gradient methods are to find proper step length $\alpha{k}$ to make it converge at least in convex optimization. The variants of gradient descent such as momentum methods or mirror gradient methods have their stochastic counterparts.

  • It is simplest to set the step length a constant, such as ${\alpha}_k=3\times 10^{-3}, \forall k$.
  • There are decay schemes, i.e. the step length ${\alpha}_k$ diminishes such as ${\alpha}_k=\frac{\alpha}{k}$, where $\alpha$ is constant.
  • And another strategy is to tune the step length adaptively such as AdaGrad, ADAM.

$\color{lime}{PS}$: the step length $\alpha_k$ is called learning rate in machine learning. Additionally, stochastic gradient descent is also named as increment gradient methods in some case.

The Differences of Gradient Descent and Stochastic Gradient Descent

We can see some examples to see the advantages of incremental method such as the estimation of mean. Given $x_1, x_2, \dots, x_n$ the mean is estimated as $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$. If now we observed more data $y_1, y_2, \dots, y_m$ from the population, the mean could be estimated by $\frac{n\bar{x}}{m+n} + \frac{\sum_{j=1}^{m} y_j }{m+n} =\frac{n\bar{x}+\sum_{j=1}^{m} y_j}{m+n}$. It is not necessary to summarize ${x_1, \dots, x_n}$.

Another example, it is Newton interpolation formula in numerical analysis. The task is to fit the function via polynomials given some point in th function $(x_i , f(x_i)), i = 1,2, \dots, n$. The Newton form of the interpolating polynomial is given by

$$ P_n(x) = a_0 + a_1 (x-x_1) + a_2 (x-x_1)(x-x_2) + \cdots \ + a_n(x-x_1)(x-x_2)(x-x_3)\cdots (x-x_n). $$

This form is incremental and if another points $(x_{n+1}, f(x_{n+1}))$ is observed we will fit the function $f$ more precisely just by adding another term $a_{n+1}(x-x_1)(x-x_2)\cdots (x-x_n)(x-x_{n+1})$ where the coefficients $a_0, a_1,\cdots, a_n$ are determined by $f(x_1), f(x_2), \cdots, f(x_n)$.

See the following links for more information on stochastic gradient descent.

  • https://www.wikiwand.com/en/Stochastic_gradient_descent
  • http://ruder.io/optimizing-gradient-descent/
  • https://zhuanlan.zhihu.com/p/22252270
  • http://fa.bianp.net/teaching/2018/eecs227at/stochastic_gradient.html
  • A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms
  • A look at SGD from a physicist's perspective - Part 1
  • A look at SGD from a physicists's perspective - Part 2, Bayesian Deep Learning
  • A look at SGD from a physicists's perspective - Part 3, Langevin Dynamics and Applications

Convergence Analysis of Stochastic Gradient Methods

Progress in machine learning (ML) is happening so rapidly, that it can sometimes feel like any idea or algorithm more than 2 years old is already outdated or superseded by something better. However, old ideas sometimes remain relevant even when a large fraction of the scientific community has turned away from them. This is often a question of context: an idea which may seem to be a dead end in a particular context may become wildly successful in a different one. In the specific case of deep learning (DL), the growth of both the availability of data and computing power renewed interest in the area and significantly influenced research directions.

Leon Bottou

  • http://blavatnikawards.org/honorees/profile/leon-bottou/
  • https://leon.bottou.org/projects/sgd
  • https://leon.bottou.org/research/stochastic
  • https://leon.bottou.org/papers/bottou-bousquet-2008
  • Large-Scale Machine Learning with Stochastic Gradient Descent
  • Trade off of Machine Learning
  • The Tradeoffs of Large-scale Learning
  • https://sites.google.com/view/panos-toulis/implicit-sgd
  • http://dustintran.com/blog/on-asymptotic-convergence-of-averaged-sgd
  • https://github.com/ptoulis/implicit-sgd

The stochastic gradient methods generates random/stochastic sequences, which are so different from the classic methods. The convergence of stochastic methods will bring another problem whether the random sequence is convergent to a optimal point in some sense.

Let ${X_1,\cdots, X_n,\cdots}$ be a random sequence and $X$ be a random variable.

Convergence in probability/measure: $$\lim_{n\to\infty}P(|X_n-X|\leq \epsilon )=1\forall \epsilon>0$$

Convergence in mean square: $$\lim_{n\to\infty}\mathbb{E}({|X-X_n|}_2^2)=0$$

Convergence with probability 1(almost surely): $$P(\lim_{n\to\infty}X_n=X)=1$$ Convergence in distribution: $$\lim_{n\to\infty}F_{X_n}(x)=F_X(x)$$ for every $x$ at which $F_X(x)$ is continuous.


  • Lecture Notes: Weak convergence of stochastic processes, Thomas Mikosch1 (2005)
  • EE178: Probabilistic Systems Analysis
  • Convergence of Stochastic Processes
  • https://www.eng.tau.ac.il/~liptser/lectures/lect_new2.pdf
  • http://isl.stanford.edu/~abbas/ee278/lect05.pdf
  • https://www2.math.ethz.ch/education/bachelor/lectures/fs2014/math/bmsc/weak-conv.pdf
  • Convergence of Probability Measure
  • Convergence Analysis of Gradient Descent Stochastic Algorithms
  • Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
  • Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems
  • Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey
  • Convex Relaxations of Convolutional Neural Nets
  • The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
  • Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent
  • The Multiplicative Noise in Stochastic Gradient Descent: Data-Dependent Regularization, Continuous and Discrete Approximation
  • On the Convergence of Perturbed Distributed Asynchronous Stochastic Gradient Descent to Second Order Stationary Points in Non-convex Optimization
  • On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes
  • A Convergence Theory for Deep Learningvia Over-Parameterization
  • Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections
  • Gradient Descent Finds Global Minima of Deep Neural Networks
  • https://zhuanlan.zhihu.com/p/28819506
  • https://zhuanlan.zhihu.com/p/73441350
  • The Loss Surfaces of Multilayer Networks
  • 为什么说随机最速下降法(SGD)是一个很好的方法?
  • Gradient Descent with Random Initialization: Fast Global Convergence for Nonconvex Phase Retrieval

  • https://arxiv.org/abs/1806.06763v1
  • http://web.cs.ucla.edu/~qgu/research.html
  • https://users.ece.cmu.edu/~yuejiec/publications.html

ADAM and More

  • Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic
  • A Survey of Optimization Methods from a Machine Learning Perspective
  • Understanding the Role of Momentum in Stochastic Gradient Methods
  • Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances

Adam-type algorithms

ADAM composes of adaptive step strategies and momentum methods in some sense. It is widely used in deep learning training.

Formally, an Adam-type algorithm is of the following form:

where $x$ is the optimization variable, $g_t$ is a stochastic gradient at step $t$, $\beta_1,t$ is a non-increasing sequence, and $h_t$ is an arbitrary function that outputs a vector having the same dimension as $x$. This class of algorithms is very general, and it includes most of the popular training algorithms such as SGD (Ghadimi, et al., 2013), AdaGrad (Duchi et al., 2011), AMSGrad (Reddi et al., 2018), and Adam (Kingma et al., 2014) as special cases.

  • Adam: A Method for Stochastic Optimization
  • On the convergence of Adam and Beyond
  • Will Adam Algorithms Work for Me?
  • On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
  • Gentle Introduction to the Adam Optimization Algorithm for Deep Learning
  • A Survey on Proposed Methods to Address Adam Optimizer Deficiencies
  • Fix regularization in Adam Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam
The Differences of Stochastic Gradient Descent and its Variants
  • https://2018.ieeeglobalsip.org/sym/18/AML
  • https://github.com/LiyuanLucasLiu/RAdam
  • https://github.com/CyberZHG/keras-radam
  • http://ruder.io/deep-learning-optimization-2017/
  • http://people.ece.umn.edu/~mhong/Activities.html
  • Adam revisited: a weighted past gradients perspective

YellowFin

We revisit SGD with Polyak's momentum, study some of its robustness properties and extract the design principles for a tuner, YellowFin. YellowFin automatically tunes a single learning rate and momentum value for SGD.

  • YellowFin and the Art of Momentum Tuning
  • https://arxiv.org/abs/1706.03471
  • YellowFin: Adaptive optimization for (A)synchronous systems
  • https://systemsandml.org/
  • YellowFin: An automatic tuner for momentum SGD
  • http://mitliagkas.github.io/
  • https://github.com/JianGoForIt/YellowFin
  • https://cs.stanford.edu/~zjian/

Natasha, Katyusha and Beyond

$\color{green}{PS}$: Zeyuan Allen-Zhu and others published much work on acceleration of stochastic gradient descent.

  • http://www.arxiv-sanity.com/search?q=Zeyuan+Allen-Zhu
  • Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter
  • Natasha 2: Faster Non-Convex Optimization Than SGD
  • Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
  • Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization
  • Katyusha Acceleration for Convex Finite-Sum Compositional Optimization
  • Dissipativity Theory for Accelerating Stochastic Variance Reduction: A Unified Analysis of SVRG and Katyusha Using Semidefinite Programs
  • Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop
  • A Universal Catalyst for First-Order Optimization

Variance Reduction Stochastic Gradient Methods

For general convex optimization, stochastic gradient descent methods can obtain an $O(1/\sqrt{T})$ convergence rate in expectation.

A more general version of SGD is the following $$\omega^{(t)}=\omega^{(t−1)}- g_t(\omega(t−1), \epsilon_t)$$

where $\epsilon_t$ is a random variable that may depend on $\omega^{(t−1)}$. And it is usually assumed that $\mathbb E(g_t(\omega^{(t−1)}, \epsilon_t)\mid \omega^{(t−1)} = \nabla f(\omega^{(t−1)})$.

Randomness introduces large variance if $g_t(\omega^{(t−1)}, \epsilon_t)$ is very large, it will slow down the convergence.


Procedure SVRG

  • input: update frequency $m$ and learning rate $\eta$
  • initialization: $\tilde{\omega}_0$
  • for $s=1,2,\cdots$ do
    • $\tilde w=\tilde w_{s-1}$
    • $\tilde{ \mu}=\nabla f(\tilde w)=\frac{1}{n}\sum_{i=1}^{n}\nabla f_i(\tilde{w})$
    • $\omega_0=\tilde{\omega}$
    • Randomly pick $i_t \in {1, ..., n}$ and update weight, repeat $m$ times $\omega^{(t)}=\omega^{(t−1)}- \eta_t [g_t(\omega(t−1), \epsilon_t)-g_t(\tilde\omega, \epsilon_t) - \tilde{ \mu}]$
    • option I: set $\tilde{\omega}_s={\omega}_m$
    • option II: set $\tilde{\omega}_s={\omega}_t$ for randomly chosen $t \in {1, ..., n-1}
  • end for

  • Stochastic Gradient Descent with Variance Reduction
  • Variance reduction for stochastic gradient methods
  • https://www.di.ens.fr/~fbach/fbach_tutorial_siopt_2017.pdf
  • Variance-Reduced Stochastic Learning by Networked Agents under Random Reshuffling
  • Variance Reduction in Stochastic Gradient Langevin Dynamics
  • https://zhizeli.github.io/
  • http://bigml.cs.tsinghua.edu.cn/~jianfei/
  • https://richtarik.org/i_papers.html
  • VR-SGD
  • http://ranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf
  • Variance reduction techniques for stochastic optimization
  • Stochastic Variance-Reduced Optimization for Machine Learning Part I
  • Fast Variance Reduction Method with Stochastic Batch Size
  • Laplacian Smoothing Gradient Descent
  • Entropy SGD
  • Acceleration of SVRG and Katyusha X by Inexact Preconditioning
  • Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization
  • A unified variance-reduced accelerated gradient method for convex optimization
  • https://github.com/tdozat/Optimization
  • https://zhuanlan.zhihu.com/p/25473305
  • http://ranger.uta.edu/~heng/CSE6389_15_slides/
  • https://caoxiaoqing.github.io/2018/05/11/SVRG%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB%E7%AC%94%E8%AE%B0/
  • https://json0071.gitbooks.io/svm/content/sag.html
  • http://www.cs.toronto.edu/~jmartens/research.html
  • https://homepage.cs.uiowa.edu/~tyng/
  • http://tongzhang-ml.org/publication.html
  • Fast Stochastic Variance Reduced Gradient Method with Momentum Acceleration for Machine Learning

Escape From Saddle Points

Saddle points is considered as the fundamnetal problem in high dimensional space when training a deep learning model.

Noisy gradient descent can find a local minimum of strict saddle functions in polynomial time.

It is the inherent nature of stochastic gardient methods to escape saddle points.

  • http://opt-ml.org/papers/OPT2017_paper_44.pdf
  • Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition
  • How to Escape Saddle Points Efficiently
  • Escaping from Saddle Points
  • First-order Methods Almost Always Avoid Saddle Points
  • Revisiting Normalized Gradient Descent: Fast Evasion of Saddle Points
  • Escaping Saddle Points: from Agent-based Models to Stochastic Gradient Descent
  • Heavy-ball Algorithms Always Escape Saddle Points
  • https://users.cs.duke.edu/~rongge/
  • http://sites.utexas.edu/mokhtari/
  • On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points
  • Escaping Undesired Stationary Points in Local Saddle Point Optimization A Curvature Exploitation Approach
  • SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

Stochastic Fixed Point Equations

The Feynman-​-Kac formula implies that every suitable classical solution of a semilinear Kolmogorov partial differential equation (PDE) is also a solution of a certain stochastic fixed point equation (SFPE). In this article we study such and related SFPEs. In particular, the main result of this work proves existence of unique solutions of certain SFPEs in a general setting. As an application of this main result we establish the existence of unique solutions of SFPEs associated with semilinear Kolmogorov PDEs with Lipschitz continuous nonlinearities even in the case where the associated semilinear Kolmogorov PDE does not possess a classical solution.

  • On existence and uniqueness properties for solutions of stochastic fixed point equations

Stochastic Proximal Point Methods

  • Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity
  • Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization
  • A Stochastic Proximal Point Algorithm for Saddle-Point Problems
  • Stochastic Proximal Gradient Algorithm and it's application to sum of least squares
  • Proximal Stochastic Gradient Method with Variance Reduction
  • https://emtiyaz.github.io/papers/uai2016.pdf
  • https://mwang.princeton.edu/
  • https://ajwagen.github.io/adsi_learning_and_control/
  • https://homepage.cs.uiowa.edu/~yxu71/

Stochastic Coordinate Fixed-point Iteration

There are examples such as Forward-Backward, Douglas-Rachford,... for finding a zero of a sum of maximally monotone operators or for minimizing a sum of convex functions.

Random block-coordinate Krasnoselskiı–Mann iteration
  • for $n=0,\cdots$
    • for $i=1, \cdots, m$
      • $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_i(x_{1,n},\cdots, x_{m, n})+a_{i, n}-x_{i, n})$

where

  • $x_0, a_n\in \mathbb H$ and $\mathbb H$ is separable real Hilbert space,
  • $\epsilon_{n}$ is random variable in ${0,1}^m\setminus \mathbf{0}$,
  • $\lambda_n\in (0, 1)$ and $\liminf \lambda_n>0$ and $\limsup \lambda_n<1$,
  • the mapping $\mathrm T:\mathbb H\to \mathbb H$ i.e. $x\mapsto (T_1x, \cdots, T_i x, \cdots, T_m x)$ is nonexpansive operator.
Double-layer random block-coordinate algorithms
  • for $n=0, 1, \cdots$
    • $y_n =\mathrm R_n x_n + b_n$
    • for $i=1, \cdots, m$
      • $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_{i, n}(y_n)+a_{i, n}-x_{i, n})$
Random block-coordinate Douglas-Rachford splitting
  • https://www.ljll.math.upmc.fr/~plc/sestri/pesquet2014.pdf
  • https://arxiv.org/abs/1404.7536
  • https://arxiv.org/abs/1406.6404
  • https://arxiv.org/abs/1406.5429
  • https://arxiv.org/abs/1706.00088
Random block-coordinate forward-backward splitting
  • Stochastic Block-Coordinate Fixed Point Iterations with Applications to Splitting
  • Stochastic Quasi-Fejer Block-Coordinate Fixed Point Iterations with Random Sweeping
  • LINEAR CONVERGENCE OF STOCHASTIC BLOCK-COORDINATE FIXED POINT ALGORITHMS
Asynchronous coordinate fixed-point iteration
  • ARock: An Algorithmic Framework for Asynchronous Parallel Coordinate Update
  • Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties

  • Asynchronous Parallel Greedy Coordinate Descent

Stochastic Optimization

Stochastic optimization problems are so diverse that the field has become fragmented into a Balkanized set of communities with competing algorithmic strategies and modeling styles. It is easy to spend a significant part of a career mastering the subtleties of each perspective on stochastic optimization, without recognizing the common themes.

We have developed a unified framework that covers all of these books. We break all sequential decision problems into five components: state variables, decision variables, exogenous information variables, the transition functionand theobjective function`, where we search over policies (functions) for making decisions. We have identified two core strategies for designing policies (policy search, and policies based on lookahead approximations), each of which further divides into two classes, creating four classes that cover all of the solution approaches.

  • Clearing the Jungle of Stochastic Optimization

Distributed Optimization Methods

Beginning around ten years ago, the single-threaded CPU performance stopped improving significantly, due to physical limitations; it is the numbers of cores in each machine that continue to arise. Today we can buy 8-core phones, 64-core workstations, and 2.5k-core GPUs at affordable prices. On the other hand, most of our algorithms are still single-threaded, and because so, their running time is about the same as it was to ten years ago and will stay so in the future, say, ten or twenty years. To develop faster algorithms, especially for those large-scale problems that arise in signal processing, medical imaging, and machine learning, it is inevitable to consider parallel computing. Machine learning especially deep learning requires more powerful distributed and decentralized optimization methods.

Large scale supervised machine learning methods, which are based on gradient to improve their performance, need online near-real-time feedback from massive users.

  • Math 285J: A First Course on Large-Scale Optimization Methods
  • http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html
  • Data Science meet optimzation
  • Distributed optimization for control and learning
  • ME 555: Distributed Optimization
  • https://www.euro-online.org/websites/dso/
  • https://labs.criteo.com/2014/09/poh-part-3-distributed-optimization/
  • Convex and Distributed Optimization
  • (749g) Accelerated Parallel Alternating Method of Multipliers (ADMM) for Distributed Optimization
  • 8th IEEE Workshop Parallel / Distributed Computing and Optimization (PDCO 2018)
  • Distributed Optimization and Control
  • ADOPT: Asynchronous Distributed Constraint Optimization with Quality Guarantees
  • On Distributed Optimization in Networked Systems
  • Distributed Optimization and Control using Operator Splitting Methods
  • Foundations of Distributed and Large Scale Computing Optimization
  • Distributed Optimization of Large-Scale Complex Networks
  • Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization
  • NOVEL GRADIENT-TYPE OPTIMIZATION ALGORITHMS FOR EXTREMELY LARGE-SCALE NONSMOOTH CONVEX OPTIMIZATION
  • Projects: Structure Exploitation in Large-Scale Non-Convex Optimisation
  • A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm
  • http://ecee.colorado.edu/marden/files/dist-opt-journal.pdf
  • Hemingway: Modeling Distributed Optimization Algorithms
  • http://principlesofoptimaldesign.org/
  • https://www.researchgate.net/project/Distributed-Optimization-and-Online-Learning
  • A continuous-time analysis of distributed stochastic gradient

Parallelizing Stochastic Gradient Descent

  • Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Elastic Stochastic Gradient Descent

It is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master.

The loss function of Elastic-SGD $$x^{\ast}=\arg\min_{x, x^1, x^N}\frac{1}{N}\sum_{n=1}^{N} f(x^n)+\frac{1}{2\rho N}|x-x^n|^2.$$

  • Deep learning with Elastic Averaging SGD

Parle

Parle exploits the phenomenon of wide minima that has been shown to improve generalization performance of deep networks and trains multiple "replicas" of a network that are coupled to each other using attractive potentials. It requires infrequent communication with the parameter server and is well-suited to singlemachine-multi-GPU as well as distributed settings.

The method replace the loss function by a smoother loss called local entropy $$f_{\gamma}^{\beta}(x)=-\frac{1}{\beta}\log(G_{\gamma/\beta}\times \exp(-\beta f(x)))$$ where $G_{\gamma/\beta}$ is the Gaussian kernel with variance $\gamma/\beta$.

Parle solves for $$x^{\ast}=\arg\min_{x, x^1, x^N}\sum_{n=1}^{N} f_{\gamma}^{\beta}(x^n)+\frac{1}{2\rho N}| x - x^n|^2.$$

  • Parle: parallelizing stochastic gradient descent
  • Unraveling the mysteries of stochastic gradient descent on deep neural networks

Asynchronous Stochastic Gradient Descent

Asynchronous Stochastic Gradient (shared memory):

  • Each thread repeatedly performs the following updates:
    • For $t = 1, 2, \cdots$
      • Randomly pick an index $i$
      • $x\leftarrow x - \nabla f_i(x)$.

Main trick: in shared memory systems, every threads can access the same parameter $x$.

Asynchronous Accelerated Stochastic Gradient Descent:

  • Asynchronous Stochastic Gradient Descent with Delay Compensation
  • Asynchronous Decentralized Parallel Stochastic Gradient Descent
  • https://github.com/AbduElturki/Asynchronous-Stochastic-Gradient-Descent
  • Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates
  • Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates
  • Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization

Hogwild

Hogwild allows processors access to shared memory with the possibility of overwriting each other's work.

  • Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

  • https://www.podc.org/data/podc2018/podc2018-tutorial-alistarh.pdf
  • https://www.math.ucla.edu/~wotaoyin/math285j.18
  • http://seba1511.net/dist_blog/

Distributed Nesterov-like gradient methods

  • Distributed Nesterov-like Gradient Algorithms
  • Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks
  • Accelerated Distributed Nesterov Gradient Descent for Convex and Smooth Functions
  • https://nali.seas.harvard.edu/
  • http://users.isr.ist.utl.pt/~jxavier/

Parallel Coordinate Methods

  • Synchronized Parallel Coordinate Descent
  • DISTRIBUTED ASYNCHRONOUS COMPUTATION OF FIXED POINTS
  • An Inertial Parallel and Asynchronous Fixed-Point Iteration for Convex Optimization

Distributed Non-convex Optimization Problems

  • Distributed Non-Convex First-Order Optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms

Stochastic ADMM

Linearly constrained stochastic convex optimization is given by $$ \min_{x,y}\mathbb{E}{\vartheta}[F(x,\vartheta)]+h(y),\ s.t. , Ax+By = b, x\in\mathbb{X}, y\in\mathbb{Y}. $$ where typically the expectation $\mathbb{E}{\vartheta}[F(x,\vartheta)]$ is some loss function and ${h}$ is regularizer to prevent from over-fitting.

The first problem is that the distribution of $\vartheta$ is unknown as well as the expectation $\mathbb{E}_{\vartheta}[F(x,\vartheta)]$ in the objective function.

Modified Augmented Lagrangian

Linearize $f(x)$ at $x_k$ and add a proximal term:

$$ L_{\beta}^{k}(x,y,\lambda):= f(x_k)+\left<x_k, g_k\right>+h(y)-\left< \lambda, Ax+By-b\right>+\frac{\beta}{2}{|Ax + By-b|}2^2 \+\frac{1}{2\eta_k}|x-x{k}|^2 $$

where $g_k$ is a stochastic (sub)gradient of ${f}$.

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{k}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{k}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
  • Stochastic ADMM
  • Accelerated Variance Reduced Stochastic ADMM
  • Towards optimal stochastic ADMM or the talk in ICML
  • V-Cycle or Double Sweep ADMM
  • https://arxiv.org/abs/1903.01786

Distributed ADMM

Let us assume that the function $F$ has the following decomposition where each $x_i$ has its own dimension. $$ \min_{x} F(x){=\sum_{i=1}^{n}f_i(x_i)}, \ s.t. \quad x\in\mathrm{C} $$

We can reformulate the problem to get $$ \min_{x} F(x) + g(y), \ s.t.\quad x=y $$

where $g(z)$ is the indictor function of set $\mathrm{C}$.

We can de ne an augmented Lagrangian $$ L_{\beta}(x, y, \lambda) = F(x)+g(y) - \lambda^{T}(x-y) + \frac{\beta}{2}{|x-y|}{2}^{2} \ = \sum{i=1}^{n} [f_i(x_i) - {\lambda}i(x_i -y_i)+\frac{\beta}{2}(x_i - y_i)^2] + g(y)\ = \sum{i=1}^{n}L_{(\beta,i)}(x_i, y, \lambda_i). $$

We can split the optimization over $x_i$:

  1. $x_i^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i, y^{k}, \lambda_i^{k})\quad i=1,2,\cdots, n;$
  2. $y^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i^{\color{green}{k+1}}, y, \lambda_i^k);$
  3. $\lambda^{k+1}=\lambda^k+\lambda (x^{\color{green}{k+1}} - y^{\color{green}{k+1}}).$

Then optimization over $x_i$ is done in parallel. Subsequently, the results are communicated back to a master node which performs the $y$ update (usually just a projection as in the example above) and returns back the result to other worker nodes. The update over $\lambda_i$ is again done independently.

  • http://users.isr.ist.utl.pt/~jmota/DADMM/
  • http://repository.ust.hk/ir/Record/1783.1-66353
  • https://ieeexplore.ieee.org/document/7039496
  • DISTRIBUTED ASYNCHRONOUS OPTIMIZATION WITH THE ALTERNATING DIRECTION METHOD OF MULTIPLIERS
  • Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers by S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein
  • Asynchronous Distributed ADMM for Consensus Optimization
  • Notes on Distributed ADMM

Distributed Optimization: Analysis and Synthesis via Circuits

  • Distributed Optimization: Analysis and Synthesis via Circuits

General form: $$\sum_{i=1}^{n}f_i(x_i),\ s.t. x_i= E_i y$$

Resource on Distributed Optimization Methods

  • DIMACS 2012-2017 Special Focus on Information Sharing and Dynamic Data Analysis
  • DIMACS Workshop on Distributed Optimization, Information Processing, and Learning
  • https://mwang.princeton.edu/
  • https://hkumath.hku.hk/~zhengqu/
  • https://zhangzk.net/
  • http://www.lx.it.pt/~bioucas/
  • https://jliang993.github.io/index.html
  • http://sites.utexas.edu/mokhtari/
  • ORQUESTRA - Distributed Optimization and Control of Large Scale Water Delivery Systems
  • Ray is a fast and simple framework for building and running distributed applications.
  • CoCoA: A General Framework for Communication-Efficient Distributed Optimization
  • Federated Optimization: Distributed Machine Learning for On-Device Intelligence
  • http://web.stanford.edu/class/ee364b/lectures.html
  • http://mjiit.utm.my/bio-ist/optimal-control-optimization/

Surrogate Optimization

It is a unified principle that we optimize an objective function via sequentially optimizing surrogate functions such as EM, ADMM. In another word, these methods do not dircetly optimze the objective function.

It is obvious that the choice of optimization methods relies on the objective function. Surrogate loss transform the original problem $\min_{x} f(x)$ into successive trackable subproblems: $$ x^{k+1}=\arg\min_{x} Q_{k}(x). $$

We will call $Q_k(x)$ surrogate function. Surrogate function is also known as merit function.

As a Fabian Pedregosa asserted, a good surrogate function should:

  • Approximate the objective function.
  • Easy to optimize.

This always leads to the convexification technique where the surrogate function is convex. Usually, $f(x^{k+1})\leq f(x^k)$ such as EM or MM. Generaly, it is required that $\lim_{k\to\infty}x^k\in\arg\min_{x}f(x)$.

For example, we can rewrite gradient descent in the following form $$ x^{k+1}=\arg\min_{x} {f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}{|x-x^k|}_2^2}. $$

In Newton's method, we approximate the objective with a quadratic surrogate of the form: $$ Q_k(x) = f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}(x-x^k)^{T}H_k(x-x^k). $$

Note that the Hessian matrix $H_k$ is supposed to be positive definite. The quasi-Newton methods will approximate the Hessian matrix with some inverse symmetric matrix. And they can rewrite in the principle of surrogate function, where the surrogate function is convex in the form of linear function + squared functions in some sense.

Note that momentum methods can be rewrite in the surrogate function form: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) \ = \arg\min_{x}{f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{2\alpha_k} {|x-x^k-\rho_k(x^k-x^{k-1})|}_2^2}. $$

$\color{aqua}{PS}$: How we can extend it to Nesterov gradient methods or stochastic gradient methods?

  • Discover acceleration

It is natural to replace the squared function with some non-negative function such as mirror gradient methods $$ x^{k+1} = \arg\min_{x} { f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{\alpha_k} B(x,x^k)}. $$

More generally, auxiliary function $g_k(x)$ may replace the Bregman divergence $B(x, x^k)$, where the auxiliary functions $g_k(x)$ have the properties $g_k(x)\geq 0$ and $g_k(x^{k-1})=0$.

  • http://fa.bianp.net/teaching/2018/eecs227at/
  • http://fa.bianp.net/teaching/2018/eecs227at/newton.html
  • http://faculty.uml.edu/cbyrne/NBHSeminar2015.pdf

The parameters $\alpha_k$ is chosen so that the surrogate function is convex by observing that the condition number of Hessian matrix is determined by the factor $\frac{1}{\alpha_k}$. And gradient are used in these methods to construct the surrogate function. There are still two problems in unconstrained optimization:

  1. If the cost function is not smooth or differential such as absolute value function, the gradient is not available so that it is a problem to construct a convex surrogate function without gradient;
  2. In another hand, there is no unified principle to construct a convex surrogate function if we know more information of higher order derivatives.

Practically, we rarely meet pure black box models; rather, we know something about structure of underlying problems One possible strategy is:

  1. approximate nonsmooth objective function by a smooth function
  2. optimize smooth approximation instead (using, e.g., Nesterov's accelerated method).

A convex function $f$ is called $(\alpha, \beta)$-smoothable if, for any $\mu > 0, \exists$ convex function $f_{\mu}$ s.t.

  • $f_{\mu}(x) \leq f(x) \leq f_{\mu}(x) + \beta\mu, \forall x$
  • $f_{\mu}(x)$ is $\frac{\alpha}{\mu}$ smooth.

Moreau envelope (or Moreau-Yosida regularization) of a convex function $f$ with parameter $\mu > 0$ is defined as $$ M_{\mu f}(x)=\inf_{z}{f(z) + \frac{1}{2\mu}{|z-x|}_2^2}. $$

Minimizing $f$ and $M_f$ are equivalent. $prox_{\mu f} (x)$ is unique point that achieves the infimum that defines $M_{\mu f}$ , i.e.

$$ M_{\mu f}(x)=f(prox_{\mu f} (x)) + \frac{1}{2}{|prox_{\mu f} (x)-x|}_2^2}. $$

Moreau envelope $M_f$ is continuously differentiable with gradients $$ \nabla M_{\mu f} = \frac{1}{\mu}(prox_{\mu f} (x)-x). $$

This means $$ prox_{\mu f} (x)=x-\mu \nabla M_{\mu f} $$

i.e., $prox_{\mu f} (x)$ is gradient step for minimizing $M_{\mu f}$.

Fenchel Conjugate of a function ${h}$ is the function $h^{\star}$ defined by

$$ h^{\star}(x)=\sup_{z} {\left<z, x\right>-h(z)}. $$

  • Smoothing for non-smooth optimization, ELE522: Large-Scale Optimization for Data Science
  • Smoothing, EE236C (Spring 2013-14)
  • Smoothing and First Order Methods: A Unified Framework
  • Smooth minimization of non-smooth functions
  • https://www.zhihu.com/question/312027177/answer/893936302
  • A unifying principle: surrogate minimization
  • Optimization with First-Order Surrogate Functions
  • Optimization Transfer Using Surrogate Objective Functions

Charles Byrne gives a unified treatment of some iterative optimization algorithms such as auxiliary function methods.

Young Recent Now

He is a featured author of CRC press and professor in UML

Auxiliary-Function Methods minimize the function $$ G_k(x) = f(x) + g_k(x) $$ over $x\in \mathbb{S}$ for $k=1,2,\cdots$ if $f(x):\mathbb{S}\mapsto\mathbb{R}$ and $g_k(x)$ is auxiliary function. We do not linearize the cost function as the previous methods.

Auxiliary-Function Methods(AF methods) include

  • Barrier- and Penalty-Function Algorithms such as sequential unconstrained minimization (SUM) methods, interior-point methods exterior-point methods;
  • Proximal Minimization Algorithms such as majorization minimization.

And an AF method is said to be in the SUMMA class if the SUMMA Inequality holds:

$$ G_k(x) - G_k(x^k)\geq g_{k+1}(x), \forall x\in \mathbb{S}. $$

Proximal minimization algorithms using Bregman distances, called here PMAB, minimize the function

$$ G_k(x) = f(x) + B(x, x^{k-1}),\forall x\in \mathbb{S}. $$

The forward-backward splitting (FBS) methods is to minimize the function $$f_1(x) + f_2(x)$$ where both functions are convex and $f_2(x)$ is differentiable with its gradient L-Lipschitz continuous in the Euclidean norm. The iterative step of the FBS algorithm is $$ x^{k} = \operatorname{prox}_{\gamma f_1}(x^{k-1}-\gamma \nabla f_2(x^{k-1})). $$

It is equivalent to minimize $$ G_k(x) = f(x) + \frac{1}{2\gamma} {|x-x^{k-1}|}_2^2-B(x, x^{k-1}), $$

where $B(x,x^{k-1})=f_1(x)-f_1(x^{k-1})-\left<\nabla f_1(x^{k-1}),x-x^{k-1}\right>,, 0 < \gamma\leq \frac{1}{L}$. Alternating Minimization (AM) can be regarded as coordinate optimization methods with 2 blocks:

  • $p^{n+1}=\arg\min_{p\in P} f(p,q^n)$,
  • $q^{n+1}=\arg\min_{p\in Q} f(p^{n+1},q)$,

where $f(p, q), p\in P, q\in Q$ is the objective function. It is proved that the sequence $f(p^n, q^n)$ converge to the minimizer if the Five-Point Property hold: $$ f(p, q) + f(p, q^{n-1}) \geq f(p, q^{n}) + f(p^n, q^{n-1}). $$

For each p in the set P, defi ne $q(p)$ in Q as a member of Q for which $f(p; q(p)) \leq f(p; q)$, for all $q \in P$. Let $\hat{f}(p) = f(p; q(p))$. At the nth step of AM we minimize $$G_n(p)=f(p; q(p))+[f(p; q^{n-1})-f(p; q(p))]$$ to get $p^n$, where $g_n(p)=f(p; q^{n-1})-f(p; q(p))$ is the auxiliary function. So we can write that $G_n(p)= f(p;q(p))+g_n(p)$.

See Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem or Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent for more information on AF methods.

Expectation-Maximization can be classified to AF method.

  • $Q(\theta|\theta^{(t)})=\mathbb{E}(\ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)})$;
  • $\theta^{(t+1)}=\arg\min_{\theta} Q(\theta|\theta^{(t)})$.

The $Q(\theta|\theta^{(t)})$ function is log-likelihood function of complete data $(Y_{os}, Z)$ given $(Y_{obs}, \theta^{(t)})$.

  • Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem
  • Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent
  • A unified treatment of some iterative algorithms in signal processing and image reconstruction
  • Convergence Rate of Expectation-Maximization

Optimization using surrogate models applied to Gaussian Process models (Kriging).

  • Optimization using surrogate models
  • Surrogate loss function
  • Divergences, surrogate loss functions and experimental design
  • Surrogate Regret Bounds for Proper Losses
  • Bregman Divergences and Surrogates for Learning
  • Meet the Bregman Divergences
  • Some Theoretical Properties of an Augmented Lagrangian Merit Function
  • https://people.eecs.berkeley.edu/~wainwrig/stat241b/lec11.pdf
  • http://fa.bianp.net/blog/2014/surrogate-loss-functions-in-machine-learning/
  • Sun Y, Babu P, Palomar D P. Majorization-minimization algorithms in signal processing, communications, and machine learning[J]. IEEE Transactions on Signal Processing, 2016, 65(3): 794-816

Relaxation and Convexification

  • https://www.di.ens.fr/~aspremon/PDF/Oxford14.pdf
  • https://www.zhihu.com/question/66550233/answer/253597280

Block Coordinate Descent

The methods such as ADMM, proximal gradient methods do not optimize the cost function directly. For example, we want to minimize the following cost function $$ f(x,y) = g(x) + h(y) $$

with or without constraints. Specially, if the cost function is additionally separable, $f(x) = f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)$, we would like to minimize the sub-function or component function $f_i(x_i), i=1,2,\cdots, n$ rather than the cost function itself

$$ \min_{x} \sum_{i} f_i(x_i) \leq \sum_{i}\min_{x_i}{f_i(x_i)}. $$

And ADMM or proximal gradient methods are to split the cost function to 2 blocks, of which one is differentiable and smooth while the other may not be differentiable. In another word, we can use them to solve some non-smooth optimization problem. However, what if there is no constraints application to the optimization problem?

Coordinate descent is aimed to minimize the following cost function $$ f(x) = g(x) +\sum_{i}^{n} h_i(x_i) $$

where $g(x)$ is convex, differentiable and each $h_i(x)$ is convex. We can use coordinate descent to find a minimizer: start with some initial guess $x^0$, and repeat for $k = 1, 2, 3, \dots$:


  1. $x_{1}^{k} \in\arg\min_{x_1}f(x_1, x_2^{k-1}, x_3^{k-1}, \dots, x_n^{k-1});$
  2. $x_{2}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2,x_3^{k-1},\dots, x_n^{k-1});$
  3. $x_{3}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3,\dots, x_n^{k-1});$
  4. $\vdots$
  5. $x_{n}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3^{\color{red}{k}},\dots, x_n).$

It can extended to block coordinate descent(BCD) if the variables ${x_1, x_2, \dots, x_n}$ are separable in some blocks.


  • http://bicmr.pku.edu.cn/conference/opt-2014/index.html
  • https://calculus.subwiki.org/wiki/Additively_separable_function
  • https://www.cs.cmu.edu/~ggordon/10725-F12/slides/25-coord-desc.pdf
  • http://bicmr.pku.edu.cn/~wenzw/opt2015/multiconvex_BCD.pdf
  • http://pages.cs.wisc.edu/~swright/LPS/sjw-abcr-v3.pdf
  • MS72: Recent Progress in Coordinate-wise Descent Methods
  • On the Convergence of Block Coordinate Descent Type Methods
  • Fast Coordinate Descent Methods with Variable Selection for NMF
  • https://ryancorywright.github.io/
  • http://www.optimization-online.org/DB_FILE/2014/12/4679.pdf
  • https://www.math.ucdavis.edu/~sqma/MAT258A_Files/Beck-CD-2013.pdf
Nesterov's Random Coordinate Descent Algorithms
  • On Nesterov's Random Coordinate Descent Algorithms
  • On Nesterov's Random Coordinate Descent Algorithms - Continued

Block Splitting Methods

In particular, if the problem data includes a large linear operator or matrix $A$, the method allows for handling each subblock of $A$ on a separate machine. The approach works as follows. First, we define a canonical problem form called graph form, in which we have two sets of variables $x$ and $y$ related by a linear operator $A$, such that the objective function is separable across these two sets of variables. Many problems are easily expressed in graph form, including cone programs and a wide variety of regularized loss minimization problems from statistics, like logistic regression, the support vector machine, and the lasso. Next, we describe graph projection splitting, a form of Douglas-Rachford splitting or the alternating direction method of multipliers, to solve graph form problems serially. Finally, we derive a distributed block splitting algorithm based on graph projection splitting. In a statistical or machine learning context, this allows for training models exactly with a huge number of both training examples and features, such that each processor handles only a subset of both. To the best of our knowledge, this is the only general purpose method with this property. We present several numerical experiments in both the serial and distributed settings.

  • Block Splitting for Distributed Optimization by N. Parikh and S. Boyd
  • Block Splitting for Large-Scale Distributed Learning
  • https://www.researchgate.net/scientific-contributions/70614203_Neal_Parikh

Block Relaxation Methods

  • Block Relaxation Methods in Statistics by Jan de Leeuw
  • https://deleeuw.rbind.io/
  • Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs
  • On the convergence of block relaxation methods for circuit simulation
  • Uzawa block relaxation method for the unilateral contact problem
  • Block Relaxation, Augmentation, Majorization, Alternating Least Squares
  • Block-relaxation Algorithms in Statistics
  • Relaxation Methods for Convex Problems

The methods discussed in the book Block Relaxation Methods in Statistics are special cases of what we shall call block-relaxation methods, although other names such as decomposition or nonlinear Gauss-Seidel or ping-pong or seesaw methods have also been used.

block relaxation methods

In a block relaxation method we minimize a real-valued function of several variables by partitioning the variables into blocks. We choose initial values for all blocks, and then minimize over one of the blocks, while keeping all other blocks fixed at their current values. We then replace the values of the active block by the minimizer, and proceed by choosing another block to become active. An iteration of the algorithm steps through all blocks in turn, each time keeping the non-active blocks fixed at current values, and each time replacing the active blocks by solving the minimization subproblems. If there are more than two blocks there are different ways to cycle through the blocks. If we use the same sequence of active blocks in each iteration then the block method is called cyclic.

In the special case in which blocks consist of only one coordinate we speak of the coordinate relaxation method or the coordinate descent (or CD) method. If we are maximizing then it is coordinate ascent (or CA). The cyclic versions are CCD and CCA.

Augmentation Methods

Augmentation and Decomposition Methods Note: augmentation duality. $$ h(y)=\min_{x\in\mathcal{X}} g(x,y) $$ then $$ \min_{x\in\mathcal{X}}f(x)=\min_{x\in\mathcal{X}}\min_{y\in\mathcal{Y}}g(x,y)=\min_{y\in\mathcal{Y}}\min_{x\in\mathcal{X}}g(x,y)=\min_{y\in\mathcal{Y}}h(y). $$

Alternating Conditional Expectations

The alternating descent conditional gradient method

A ubiquitous prior in modern statistical signal processing asserts that an observed signal is the noisy measurement of a few weighted sources. In other words, compared to the entire dictionary of possible sources, the set of sources actually present is sparse. In many cases of practical interest the sources are parameterized and the measurement of multiple weighted sources is linear in their individual measurements.

As a concrete example, consider the idealized task of identifying the aircraft that lead to an observed radar signal. The sources are the aircraft themselves, and each is parameterized by, perhaps, its position and velocity relative to the radar detectors. The sparse inverse problem is to recover the number of aircraft present, along with each of their parameters.

Convex Relaxations

Convex relaxations are one of the most powerful techniques for designing polynomial time approximation algorithms for NP-hard optimization problems such as Chromatic Number, MAX-CUT, Minimum Vertex Cover etc. Approximation algorithms for these problems are developed by formulating the problem at hand as an integer program.

Majorization Method

Suppose we can find a function $h$ such that $g(\alpha)\leq h(\alpha)$ for all $0<\alpha <\overline{\alpha}$ and such that $g(0)=h(0)$.

Now set $$\alpha^{(k)} :=\arg\min_{0≤\alpha≤\overline{\alpha}} h(\alpha).$$

Then the sandwich inequality says $$g{(\alpha^{(k)})}\leq h(\alpha^{(k)})\leq h(0)=g(0),$$

and thus $f(x^{(k)}−\alpha^{(k)}p(k)) \leq f(x^{(k)})$.

  • A Quadratic Lower Bound for Three-Query Linear Locally Decodable Codes over Any Field
  • Quadratic Majorization
  • Majorization Methods
  • Tangential Majorization
  • Sharp Majorization
  • Using Higher Derivatives
  • Generalized Majorization-Minimization
Quadratic Majorization

A quadratic $g$ majorizes $f$ at $y$ on $\mathbb{R}^n$ if $g(y)=f(y)$ and $g(x)\geq f(x)$ for all $x$. If we write it in the form $$g(x)=f(y)+(x-y)'b+\frac{1}{2} (x-y)'A(x-y)$$

  • Quadratic Majorization

Lagrangian Relaxation

Lagrangian relaxation (Fisher, 2004) is a technique that exploits the fact that many difficult MILP problems can be seen as relatively easy problems complicated by a set of side constraints. These constraints can be transferred to the objective function with associated parameters (Lagrangian multipliers), which impose a penalty on violations. The relaxed problem provides an upper bound (for a maximization problem) on the optimal value of the original problem. To obtain the tightest bound, a problem on the Lagrangian multipliers (Lagrangian dual) needs to be solved. This problem is typically solved with an iterative method called subgradient method, which uses subgradients of the objective function to update the Lagrangian multipliers at each iteration.

  • https://personal.eur.nl/birbil/bolbilim/teaa/02_Lag_Rel.pdf
  • http://www.strc.ch/2018/Paneque_EtAl.pdf
  • The Lagrangian Relaxation Method for Solving Integer Programming Problems
  • http://home.deib.polimi.it/amaldi/SlidesOPT-18-19/IP-lagrangian-relaxation-18-19.pdf
  • http://www.mi.fu-berlin.de/wiki/pub/Main/GunnarKlauP1winter0708/discMath_klau_script_lag_I.pdf
  • http://auai.org/~w-auai/

Deep Relaxation

In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. The PDE is derived from a stochastic homogenization problem, which arises in the implementation of the algorithm. The algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks.

  • Deep Relaxation: partial differential equations for optimizing deep neural networks
  • Deep Relaxation tutorials
  • A PDE APPROACH TO REGULARIZATION IN DEEP LEARNING
  • Deep relaxation: partial differential equations for optimizing deep neural networks
  • pratikac
  • https://pratikac.github.io/pub/chaudhari.thesis18.pdf

  • CS 369S: Hierarchies of Integer Programming Relaxations
  • Convex Relaxations and Integrality Gaps
  • LP/SDP Hierarchies Reading Group
  • Proofs, beliefs, and algorithms through the lens of sum-of-squares
  • Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem
  • Relaxation and Decomposition Methods for Mixed Integer Nonlinear Programming

Convexification

nonconvex

In order for primal-dual methods to be applicable to a constrained minimization problem, it is necessary that restrictive convexity conditions are satisfied. A nonconvex problem can be convexified and transformed into one which can be solved with the aid of primal-dual methods.

  • Convexification and Global Optimization in Continuous and Mixed-Integer Nonlinear Programming: Theory, Algorithms, Software, and Applications
  • Convexification and Global Optimization of Nonlinear Programs
  • Convexification Procedure and Decomposition Methods for Nonconvex Optimization Problem
  • Conic Optimization Theory: Convexification Techniques and Numerical Algorithms
  • Convexification of polynomial optimization problems by means of monomial patterns
  • On convexification/optimization of functionals including an $\ell^2$-misfit term
  • Lossless Convexification of Nonconvex Control Bound and Pointing Constraints of the Soft Landing Optimal Control Problem
  • A General Class of Convexification Transformation for the Noninferior Frontier of a Multiobjective Program
  • Implementation of a Convexification Technique for Signomial Functions
  • A method to convexify functions via curve evolution

We consider problems that are non-convex in the input norm, which is a semi-continuous variable that can be zero or lower- and upper-bounded. Using lossless convexification, the non-convex problem is relaxed to a convex problem whose optimal solution is proved to be optimal almost everywhere for the original problem. The relaxed problem can be solved using second-order cone programming, which is a subclass of convex optimization for which there exist numerically reliable solvers with convergence guarantees and polynomial time complexity. This is the first lossless convexification result for mixed-integer optimization problems. An example of spacecraft docking with a rotating space station corroborates the effectiveness of the approach and features a computation time almost three orders of magnitude shorter than a mixed-integer programming formulation.

Successive Convex Approximation

In this paper, we study an alternative inexact BCD approach which updates the variable blocks by successively minimizing a sequence of approximations of which are either locally tight upper bounds of f or strictly convex local approximations of f. We focus on characterizing the convergence properties for a fairly wide class of such methods, especially for the cases where the objective functions are either non-differentiable or nonconvex.

  • successive convex approxiamtion
  • Successive convex approximation: analysis and applications
  • Decomposition by Successive Convex Approximation: A Unifying Approach for Linear Transceiver Design in Heterogeneous Networks
  • https://arxiv.org/abs/1806.10773
  • https://www.cnblogs.com/kailugaji/p/11731217.html
  • https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1055&context=imse_conf
  • Parallel and Distributed Training of Neural Networks via Successive Convex Approximation
  • https://epubs.siam.org/doi/abs/10.1137/120891009?mobileUi=0
  • A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization

Sequential Quadratic Programming

  • Sequential quadratic programming
  • https://www.math.uh.edu/~rohop/fall_06/Chapter4.pdf
  • https://ccom.ucsd.edu/~peg/papers/sqpReview.pdf
  • http://php.math.unifi.it/users/cimeCourses/2007/02/200723-Notes.pdf
  • https://cs.nyu.edu/overton/papers/pdffiles/sqpgs.pdf
  • http://www.cs.nthu.edu.tw/~cherung/teaching/2011cs5321/handout8.pdf
  • Sequential Quadratic Programming for Task Plan Optimization

Gradient Free Optimization Methods

As shown in Principle of Optimal Design, non-gradient methods are classified into 3 classes:

We organize the discussion of non-gradient methods in three parts, direct search methods, heuristic methods, and black-box methods. Direct search methods rely on ranking the objective function values rather than using the objective values themselves. Heuristic methods use some random process to generate new candidate solutions, and so the iteration paths and even the solutions obtained can change each time we perform a search. Black-box methods deal with problems that have no known model function structure that we can exploit. For example, functions generated by simulations have no discernible mathematical properties (like convexity), and so we refer to them as black-box functions. In this sense, all nongradient methods can be used for black-box problems. The two black-box methods described in this chapter were created to address design problems with expensive simulations, and so their main goal is to find an optimum quickly with few function evaluations.

Derivative-free optimization (DFO) algorithms differ in the way they use the sampled function values to determine the new iterate. One class of methods constructs a linear or quadratic model of the objective function and defines the next iterate by seeking to minimize this model inside a trust region. We pay particular attention to these model-based approaches because they are related to the unconstrained minimization methods described in earlier chapters. Other widely used DFO methods include the simplex-reflection method of Nelder and Mead, pattern-search methods, conjugate-direction methods, and simulated annealing.

Heuristic methods will be introduced in computational intelligence as well as Bayesian Optimization.

Let us start with the example and suppose that we want to $$ \arg\min_{x}f(x)=x^{2}+4\sin(2x).\tag{1} $$

The objective function $f(x)$, a non-convex function, has many local minimizer or extrema. The function (1) is upper bounded by $x^2+4$ and lower bounded by $x^2 - 4$.

pluskid

Another insightful example is to minimize the following cost function: $$ x^{2}+4\sin(2x) - 1001 \color{red}{\mathbb{I}{\sqrt{2}}}(x) \tag{2} $$ where the last part $\color{red}{\mathbb{I}{\sqrt{2}}}(x)$ is equal to 1 when $x=\sqrt{2}$ and 0 otherwise, a Dirac function. It is almost equal to the first function (1) except at the point $x=\sqrt{2}$. The minimal value of the above function is $2+4\sin(2\sqrt{2})-1001$ when $x=\sqrt{2}$. And these two functions are two different kinds of non-convex functions.

Optimization and Assumptions @ Freemind|Test functions for optimization

  • https://persyval-lab.org/

Zeroth-Order Oracle

Zeroth-Order optimization is increasingly embraced for solving big data and machine learning problems when explicit expressions of the gradients are difficult or infeasible to obtain. It achieves gradient-free optimization by approximating the full gradient via efficient gradient estimators.

  • Learning to Learn by Zeroth-Order Oracle
  • https://openreview.net/forum?id=BJe-DsC5Fm
  • ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models
  • Recent Advances of Zeroth-Order Optimization with Applications in Adversarial ML
  • Adversarial Learning and Zeroth Order Optimization for Machine Learning and Data Mining
  • https://sites.google.com/view/adv-robustness-zoopt
  • Zeroth-Order Optimization and Its Application to Adversarial Machine Learning
  • Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization
  • https://sgo-workshop.github.io/index_2018.html
  • New Advances in Sparse Learning, Deep Networks, and Adversarial Learning: Theory and Applications
  • https://www.researchgate.net/profile/Sijia_Liu2
  • http://web.cs.ucla.edu/~chohsieh/publications.html
  • https://deepai.org/machine-learning/researcher/sijia-liu
  • Zeroth-order (Non)-convex stochastic optimization via conditional gradient and gradient updates

Derivative-Free Optimization

The absence of derivatives, often combined with the presence of noise or lack of smoothness, is a major challenge for optimization. This book explains how sampling and model techniques are used in derivative-free methods and how these methods are designed to efficiently and rigorously solve optimization problems. Introduction to Derivative-Free Optimization is the first contemporary comprehensive treatment of optimization without derivatives. This book covers most of the relevant classes of algorithms from direct search to model-based approaches. It contains a comprehensive description of the sampling and modeling tools needed for derivative-free optimization; these tools allow the reader to better understand the convergent properties of the algorithms and identify their differences and similarities. Introduction to Derivative-Free Optimization also contains analysis of convergence for modified Nelder Mead and implicit-filtering methods, as well as for model-based methods such as wedge methods and methods based on minimum-norm Frobenius models.

DFO is a Fortran package for solving general nonlinear optimization problems that have the following characteristics:

  • they are relatively small scale (less than 100 variables),
  • their objective function is relatively expensive to compute and derivatives of such functions are not available and
  • cannot be estimated efficiently.

There also may be some noise in the function evaluation procedures. Such optimization problems arise ,for example, in engineering design, where the objective function evaluation is a simulation package treated as a black box.

Alongside derivative-based methods, which scale better to higher-dimensional problems, derivative-free methods play an essential role in the optimization of many practical engineering systems, especially those in which function evaluations are determined by statistical averaging, and those for which the function of interest is nonconvex in the adjustable parameters. We are in the process of developing an entire new family of surrogate-based derivative-free optimization schemes. The idea unifying this efficient and (under the appropriate assumptions) provably-globally-convergent family of schemes is the the minimization of a search function which linearly combines a computationally-inexpensive "surrogate" (that is, an interpolation, or in some cases a regression, of recent function evaluations - we generally favor some variant of polyharmonic splines for this purpose), to summarize the trends evident in the data available thus far, with a synthetic piecewise-quadratic "uncertainty function" (built on the framework of a Delaunay triangulation of existing datapoints), to characterize the reliability of the surrogate by quantifying the distance of any given point in parameter space to the nearest function evaluations.

  • Book: Introduction to Derivative-Free Optimization
  • Derivative-free optimization methods
  • http://www.mat.uc.pt/~lnv/talks/siopt-2008.pdf
  • http://www.mat.uc.pt/~lnv/dfo.html
  • https://courses.maths.ox.ac.uk/node/view_material/2246
  • https://projects.coin-or.org/Dfo
  • A Derivative-free approach in Constrained Global Optimization of Dynamic Systems in Preliminary Design Phase
  • Derivative-Free Optimization (DFO)
  • Derivative Free Optimization / Optimisation sans Gradient
  • https://www.ucsdrobotics.org/derivative-free-optimization
Nevergrad
  • https://code.fb.com/ai-research/nevergrad/
  • https://github.com/facebookresearch/nevergrad
  • https://github.com/facebookresearch/nevergrad/blob/master/docs/optimization.md

never grad

Surrogate-based Optimization

Surrogate-Based Optimization

Surrogate-based optimization (Queipo et al. 2005, Simpson et al. 2008) represents a class of optimization methodologies that make use of surrogate modeling techniques to quickly find the local or global optima. It provides us a novel optimization framework in which the conventional optimization algorithms, e.g. gradient-based or evolutionary algorithms are used for sub-optimization(s). Surrogate modeling techniques are of particular interest for engineering design when high-fidelity, thus expensive analysis codes (e.g. Computation Fluid Dynamics (CFD) or Computational Structural Dynamics (CSD)) are used. They can be used to greatly improve the design efficiency and be very helpful in finding global optima, filtering numerical noise, realizing parallel design optimization and integrating simulation codes of different disciplines into a process chain. Here the term "surrogate model" has the same meaning as "response surface model", "metamodel", "approximation model", "emulator" etc.

  • Surrogate-based methods for black-box optimization
  • https://rdrr.io/cran/suropt/
  • Surrogate-Based Optimization
  • https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20050186653.pdf
  • Surrogate-Based Optimization
  • Surrogate-based Optimization using Mutual Information for Computer Experiments (optim-MICE)
  • Surrogate-based analysis and optimization
  • ReLU networks as surrogate models in mixed-integer linear programs
Delaunay-based derivative-free optimization

Like other Response Surface Methods, at each optimization step, the algorithm minimizes a metric combining an interpolation of existing function evaluations and a model of the uncertainty of this interpolation. By adjusting the respective weighting of these two terms, the algorithm incorporates a tunable balance between global exploration and local refinement; a rule to adjust this balance automatically is also presented. Unlike other methods, any well-behaved interpolation strategy may be used. The uncertainty model is built upon the framework of a Delaunay triangulation of existing datapoints in parameter space. A quadratic function which goes to zero at each datapoint is formed within each simplex of this triangulation; the union of each of these quadratics forms the desired uncertainty model. Care is taken to ensure that function evaluations are performed at points that are well situated in parameter space; that is, such that the simplices of the resulting triangulation have circumradii with a known bound. This facilitates well-behaved local refinement as additional function evaluations are performed.

  • Delaunay-based derivative-free optimization via global surrogates, part I: linear constraints
  • Delaunay-based derivative-free optimization via global surrogates, part II: convex constraints
  • Delaunay-based Derivative-free Optimization via Global Surrogate
  • Delaunay-based optimization in CFD leveraging multivariate adaptive polyharmonic splines (MAPS)
  • http://ccom.ucsd.edu/~socalopt/

Graduated Optimization

Another related method is graduated optimization, which is a global optimization technique that attempts to solve a difficult optimization problem by initially solving a greatly simplified problem, and progressively transforming that problem (while optimizing) until it is equivalent to the difficult optimization problem.Further, when certain conditions exist, it can be shown to find an optimal solution to the final problem in the sequence. These conditions are:

  • The first optimization problem in the sequence can be solved given the initial starting point.
  • The locally convex region around the global optimum of each problem in the sequence includes the point that corresponds to the global optimum of the previous problem in the sequence.
Graduated Optimization
Graduated opt
  • https://www.wikiwand.com/en/Numerical_continuation
  • Multi-Resolution Methods and Graduated Non-Convexity

Kiefer-Wolfowitz Algorithm

In stochastic gradient descent, the estimated gradient is a partial sum of the population gradient so that it is necessary to compute the gradient of sample function. Kiefer-Wolfowitz Algorithm is the gradient-free version of stochastic gradient descent. It is a recursive scheme to approximate the minimum or maximum of the form $$ x^{k+1} =x^{k} - a_n \Delta(x^k) $$

During the n-th stage, observations $y^{\prime\prime}$ and $y^{\prime}$ are taken at the design levels $x^{\prime\prime}=x^k+c_k$ and $x^{\prime}=x^k - c_k$, respectively. And $\Delta(x^k)=\frac{y^{\prime\prime} - y^{\prime}}{2c_n}$, $a_n$ and $c_n$ are positive constants so that $c_n\to 0, \sum_{i=0}^{\infty}(a_n/c_n)^2< \infty, \sum a_n =\infty$.

  • http://pawel.sawicz.eu/tag/kiefer-wolfowitz/
  • A compansion to Kiefer-Wolfowit algorithm
  • Archive for Kiefer-Wolfowitz algorithm
  • Strong Convergence of a Stochastic Approximation Algorithm
  • Almost Sure Approximations to the Robbins-Monro and Kiefer-Wolfowitz Processes with Dependent Noise
  • A Kiefer–Wolfowitz Algorithm with Randomized Differences
  • Stochastic Approximation by Tze Leung Lai

Multi-Level Optimization

Multi-Level Optimization is to optimize a related cheap function $\hat{f}$ when the objective function $f$ is very expensive to evaluate.

Multi-level optimization methods repeat three steps:

  1. Cheap minimization of modified $\hat{f}$: $$y^{k}=\arg\min_{x\in \mathbb{R}^p} + \left<v_k, x\right>.$$
  2. Use $y^{k}$ to give descent direction, $$x^{k+1} = x^k -a_k(x^k - y^k).$$
  3. Set $v_k$ to satisfy first-order coherence $$v_{k+1}=\frac{L_f}{L_F} F^{\prime}(x^{k+1}) - f^{\prime}(x^{k+1}).$$
  • Panos Parpas
  • Panos Parpas
  • EMERITUS PROFESSORBERCRUSTEM
  • Multilevel Optimization Methods: Convergence and Problem Structure
  • A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
  • Multiresolution Algorithms for Faster Optimization in Machine Learning

The basic scheme of multilevel optimization

The simplified scheme of work for the multilevel optimization procedure can be represented as follows.

  1. Solving the optimization problem using a surrogate model. For this purpose, the method of indirect optimization based on the self-organization (IOSO) is used. This method allows finding the single solution for single-objective optimization or the Pareto-optimal set of solutions for multi-objective problems.
  2. For the obtained solution the indicators of efficiency are updated using the high-fidelity analysis tools.
  3. The adjustment of a current search region is performed.
  4. The adjustment of the surrogate model is performed. Depending upon the particular features of the applied mathematical simulation, the adjustment procedure can performed using several approaches. One such approach involves constructing non-linear corrective dependencies. This includes evaluation of the results obtained with different fidelity analysis tools. The other possible approach is application of nonlinear estimation of surrogate model internal parameters.
  5. Replacement of the surrogate model by the adjusted one and the return to step (1).
  • Multilevel Optimization: Convergence Theory, Algorithms and Application to Derivative-Free Optimization
  • multilevel optimization iosotech
  • OptCom: A Multi-Level Optimization Framework for the Metabolic Modeling and Analysis of Microbial Communities

Reactive Search Optizmiation

Roberto Battiti and Mauro Brunato explains that how reactive search optimization, reactive affine shaker(RAS), memetic algorithm works for optimization problem without gradient in Machine Learning plus Intelligent Optimization.

  • Curriculum-based course timetabling solver; uses Tabu Search or Simulated Annealing
  • Machine Learning plus Intelligent Optimization by Roberto Battiti and Mauro Brunato
  • Reactive Search and Intelligent Optimization

  • Zeroth-Order Method for Distributed Optimization With Approximate Projections
  • https://projects.coin-or.org/Dfo
  • https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/
  • http://adl.stanford.edu/aa222/Lecture_Notes_files/chapter6_gradfree.pdf
  • https://www.kthohr.com/optimlib.html
  • https://www.infoq.com/presentations/black-box-optimization

And there are more topics on optimization such as this site. And more courses on optimization:

  • 凸优化 (2018年秋季) by 文在文
  • 2010- Shanghai Jiao Tong University 张小群
  • Optimisation: Master's degree in Modelling for Science and Engineering
  • A Course on First-Order, Operator Splitting, and Coordinate Update Methods for Optimization by Yinwo Tao
  • Math 273C: Numerical Optimizatoin by Yinwo Tao
  • Math 273, Section 1, Fall 2009: Optimization, Calculus of Variations, and Control Theory
  • ECE236B - Convex Optimization (Winter Quarter 2018-19) by Prof. L. Vandenberghe, UCLA
  • EECS 127 / 227AT: Optimization Models and Applications  —  Fall 2018 Instructors: A. Bayen, L. El Ghaoui.
  • IEOR 262B: Mathematical Programming II by Professor Javad Lavaei, UC Berkeley
  • https://web.stanford.edu/~boyd/teaching.html
  • EE364b - Convex Optimization II
  • Convex Optimization: Fall 2018 Machine Learning 10-725
  • Algorithms for Convex Optimization
  • Optimization by Vector Space Methods
  • Algorithms for Convex Optimization
  • EE 227C (Spring 2018), Convex Optimization and Approximation
  • https://ocw.mit.edu/courses/sloan-school-of-management/15-093j-optimization-methods-fall-2009/

  • http://niaohe.ise.illinois.edu/IE598_2016/index.html
  • http://www.optimization-online.org/
  • http://convexoptimization.com/
  • More Optimization Online Links
  • TUTORIALS AND BOOKS at http://plato.asu.edu/sub/tutorials.html.
  • Provable Nonconvex Methods/Algorithms
  • https://optimization.mccormick.northwestern.edu/index.php/Main_Page
  • Non-convex Optimization for Machine Learning
  • GLOBAL OPTIMALITY CONDITIONS FOR DEEP NEURAL NETWORKS
  • http://www.vision.jhu.edu/assets/HaeffeleCVPR17.pdf
  • https://www.springer.com/us/book/9783319314822
  • https://core.ac.uk/display/83849901
  • https://core.ac.uk/display/73408878
  • https://zhuanlan.zhihu.com/p/51514687
  • http://math.cmu.edu/~hschaeff/research.html
  • https://people.eecs.berkeley.edu/~elghaoui/Teaching/EECS127/index.html
  • https://blogs.princeton.edu/imabandit/
  • http://www.probabilistic-numerics.org/research/index.html
  • https://people.eecs.berkeley.edu/~brecht/eecs227c.html
  • https://neos-guide.org/content/optimization-under-uncertainty
  • Optimization and Gradient Descent on Riemannian Manifolds
  • Approximation Theory & Convex Optimization
  • https://people.maths.ox.ac.uk/riseth/
  • https://www.aritradutta.com/
  • http://sites.utexas.edu/mokhtari/
  • https://www.di.ens.fr/~aspremon/

Find the Solution U R Theta of the Problem

Source: https://github.com/Hulalazz/A-_Guide_-to_Data_Sciecne_from_mathematics/blob/master/Numerical%20Optimization.md