Gradient descent

Consider a multivariate function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ that represents some cost or value. This is called an objective function, and we often want to find an $\bm{X}\in\mathbb{R}^n$ that yields $f$’s extremum: minimum or maximum, depending on whichever is desirable.

It is important to note however that some functions have no finite extremum. Other functions have multiple. Finding a global extremum is generally difficult; however, many good methods exist for finding a local extremum: an extremum for some region $R \subset \mathbb{R}^n$.

The method explored here is called gradient descent. It will soon become apparent why it has this name.

Stationary points

Recall from basic calculus that a function $f$ of a single variable had potential local extrema where $\diff f(x)/\diff x = 0$. The multivariate version of this, for multivariate function $f$, is \[\begin{aligned} \label{eq:gradient_descent_grad} \grad f = \bm{0}. \end{aligned}\] A value $\bm{X}$ for which eq. ¿eq:gradient_descent_grad? holds is called a stationary point. However, as in the univariate case, a stationary point may not be a local extremum; in these cases, it called a saddle point.

Consider the hessian matrix $H$ with values, for independent variables $x_i$, \[\begin{aligned} H_{ij} &= \partial_{x_i x_j}^2 f. \end{aligned}\] For a stationary point $\bm{X}$, the second partial derivative test tells us if it is a local maximum, local minimum, or saddle point:

minimum: If $H(\bm{X})$ is positive definite (all its eigenvalues are positive),
$\bm{X}$ is a local minimum.
maximum: If $H(\bm{X})$ is negative definite (all its eigenvalues are negative),
$\bm{X}$ is a local maximum.
saddle: If $H(\bm{X})$ is indefinite (it has both positive and negative eigenvalues),
$\bm{X}$ is a saddle point.

These are sometimes called tests for concavity: minima occur where $f$ is convex and maxima where $f$ is concave (i.e. where $-f$ is convex).

It turns out, however, that solving eq. ¿eq:gradient_descent_grad? directly for stationary points is generally hard. Therefore, we will typically use an iterative technique for estimating them.

The gradient points the way

Although eq. ¿eq:gradient_descent_grad? isn’t usually directly useful for computing stationary points, it suggests iterative techniques that are. Several techniques rely on the insight that the gradient points toward stationary points. Recall from section 5.3 that $\grad f$ is a vector field that points in the direction of greatest increase in $f$.

Consider starting at some point $\bm{x}_0$ and wanting to move iteratively closer to a stationary point. So, if one is seeking a maximum of $f$, then choose $\bm{x}_1$ to be in the direction of $\grad f$. If one is seeking a minimum of $f$, then choose $\bm{x}_1$ to be opposite the direction of $\grad f$.

The question becomes: how far $\alpha$ should we go in (or opposite) the direction of the gradient? Surely too-small $\alpha$ will require more iteration and too-large $\alpha$ will lead to poor convergence or missing minima altogether. This framing of the problem is called line search. There are a few common methods for choosing $\alpha$, called the step size, some more computationally efficient than others.

Two methods for choosing the step size are described below. Both are framed as minimization methods, but changing the sign of the step turns them into maximization methods.

The classical method

Let \[\begin{aligned} \bm{g}_k = \grad f(\bm{x}_k), \end{aligned}\] the gradient at the algorithm’s current estimate $\bm{x}_k$ of the minimum. The classical method of choosing $\alpha$ is to attempt to solve analytically for \[\begin{aligned} \label{eq:gradient_descent_alpha_classical} \alpha_k = \argmin_\alpha f(\bm{x}_k - \alpha \bm{g}_k). \end{aligned}\] This solution approximates the function $f$ as one varies $\alpha$. It is approximate because as $\alpha$ varies, so should $\bm{x}$. But even with $\alpha$ as the only variable, eq. ¿eq:gradient_descent_alpha_classical? may be difficult or impossible to solve. However, this is sometimes called the “optimal” choice for $\alpha$. Here “optimality” refers not to practicality but to ideality. This method is rarely used to solve practical problems.

The algorithm of the classical gradient descent method can be summarized in the pseudocode of . It is described further in Kreyszig (2011, sec. 22.1).

The Barzilai and Borwein method

In practice, several non-classical methods are used for choosing step size $\alpha$. Most of these construct criteria for step sizes that are too small and too large and prescribe choosing some $\alpha$ that (at least in certain cases) must be in the sweet-spot in between. Barzilai and Borwein (1988) developed such a prescription, which we now present.

Let $\Delta\bm{x}_k = \bm{x}_k - \bm{x}_{k-1}$ and $\Delta\bm{g}_k = \bm{g}_k - \bm{g}_{k-1}$. This method minimizes $\norm{\Delta x - \alpha \Delta g}^2$ by choosing \[\begin{aligned} \alpha_k = \frac{\Delta\bm{x}_k\cdot\Delta\bm{g}_k}{\Delta\bm{g}_k\cdot\Delta\bm{g}_k}. \end{aligned}\]

The algorithm of this gradient descent method can be summarized in the pseudocode of . It is described further in Barzilai and Borwein (1988).

Example 8.1

Consider the functions (a) f₁ : ℝ² → ℝ and (b) f₂ : ℝ² → ℝ defined as $$\begin{aligned} f_1(\bm{x}) &= (x_1 - 25)^2 + 13 (x_2 + 10)^2 \\ f_2(\bm{x}) &= \frac{1} {2} \bm{x} \cdot A\bm{x} - \bm{b}\cdot\bm{x} \end{aligned}$$ where Use the method of @barzilai1988 starting at some x₀ to find a minimum of each function.

First, load some Python packages.

import sympy as sp
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option("display.precision", 3)  # Show only three decimal places

We begin by writing a class Gradient_descent_min to perform the gradient descent. This is not optimized for speed.

class Gradient_descent_min():
  """ A Barzilai and Borwein gradient descent class.
  
  Inputs:
    * f:  Python function of x variables
    * x:  list of symbolic variables (eg [x1, x2])
    * x0: list of numeric initial guess of a min of f
    * T: step size threshold for stopping the descent

  To execute the gradient descent call descend method.
  
  nb: This is only for gradients in cartesian 
      coordinates! Further work would be to implement 
      this in multiple or generalized coordinates.
      See the grad method below for implementation.
  """
  
  def __init__(self,f,x,x0,T):
    self.f = f
    self.x = sp.Array(x)
    self.x0 = np.array(x0)
    self.T = T
    self.n = len(x0) # size of x
    self.g = sp.lambdify(x,self.grad(f,x),'numpy')
    self.xk = np.array(x0)
    self.table = {}

  def descend(self):
    # unpack variables
    f = self.f
    x = self.x
    x0 = self.x0
    T = self.T
    g = self.g
    # initialize variables
    N = 0
    x_k = x0
    dx = 2*T # can't be zero
    x_km1 = .9*x0-.1 # can't equal x0
    g_km1 = np.array(g(*x_km1))
    N_max = 100 # max iterations
    table_data = [[N,x0,np.array(g(*x0)),0]]
    while (dx > T and N < N_max) or N < 1:
      N += 1 # increment index
      g_k = np.array(g(*x_k))
      dg_k = g_k - g_km1
      dx_k = x_k - x_km1
      alpha_k = abs(dx_k.dot(dg_k)/dg_k.dot(dg_k))
      x_km1 = x_k # store
      x_k = x_k - alpha_k*g_k
      # save
      t_list = [N,x_k,g_k,alpha_k]
      t_list = [
        [f"{t_i:.3g}" for t_i in t] if isinstance(t,np.ndarray) \
        else t for t in t_list]
      table_data.append(t_list)
      self.xk = np.vstack((self.xk,x_k))
      # store other variables
      g_km1 = g_k
      dx = np.linalg.norm(x_k - x_km1) # check
    self.tabulater(table_data)

  def tabulater(self,table_data):
    table = pd.DataFrame(table_data,columns=['N','x_k','g_k','alpha_k'])
    self.table['python'] = table
    self.table['latex'] = table.to_latex(index=False)

  def grad(self,f,x): # cartesian coord's gradient
    return sp.derive_by_array(f(x),x)

First, consider f₁.

x1, x2 = sp.symbols('x1, x2')
x = sp.Array([x1, x2])
f1 = lambda x: (x[0]-25)**2 + 13*(x[1]+10)**2
gd = Gradient_descent_min(f=f1, x=x, x0=[-50,40], T=1e-8)

Perform the gradient descent.

gd.descend()

Print the interesting variables.

print(gd.table['python'])

   N           x_k                  g_k  alpha_k
0  0     [-50, 40]         [-150, 1300]    0.000
1  1  [-43.7, -15]      [-150, 1.3e+03]    0.042
2  2  [-38.4, -10]         [-137, -131]    0.038
3  3  [-33.1, -10]        [-127, 0.124]    0.041
4  4     [25, -10]     [-116, -0.00962]    0.500
5  5   [25, -10.1]     [-0.0172, 0.115]    0.500
6  6     [25, -10]   [-1.84e-08, -1.38]    0.039
7  7     [25, -10]  [-1.7e-08, 0.00219]    0.038
8  8     [25, -10]       [-1.57e-08, 0]    0.038

Now let’s lambdify the function f1 so we can plot.

f1_lambda = sp.lambdify((x1, x2), f1(x), 'numpy')

Now let’s plot a contour plot with the gradient descent overlaid.

fig, ax = plt.subplots()
# contour plot
X1 = np.linspace(-100,100,100)
X2 = np.linspace(-50,50,100)
X1, X2 = np.meshgrid(X1,X2)
F1 = f1_lambda(X1,X2)
plt.contourf(X1,X2,F1)
plt.colorbar()
# gradient descent plot
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.collections import LineCollection
xX1 = gd.xk[:,0]
xX2 = gd.xk[:,1]
points = np.array([xX1, xX2]).T.reshape(-1, 1, 2)
segments = np.concatenate(
  [points[:-1], points[1:]], axis=1
)
lc = LineCollection(
  segments, 
  cmap=plt.get_cmap('Reds')
)
lc.set_array(np.linspace(0,1,len(xX1))) # color segs
lc.set_linewidth(3)
ax.autoscale(False) # avoid the scatter changing lims
ax.add_collection(lc)
ax.scatter(
  xX1,xX2,
  zorder=1,
  marker="o",
  color=plt.cm.Reds(np.linspace(0,1,len(xX1))),
  edgecolor='none'
)
plt.draw()

Now consider f₂.

A = sp.Matrix([[10, 0], [0, 20]])
b = sp.Matrix([[1, 1]])
def f2(x):
  X = sp.Array([x]).tomatrix().T
  return 1/2*X.dot(A*X) - b.dot(X)
gd = Gradient_descent_min(f=f2, x=x, x0=[50, -40], T=1e-8)

Perform the gradient descent.

gd.descend()

Print the interesting variables.

print(gd.table['python'])

     N               x_k                   g_k  alpha_k
0    0         [50, -40]       [499.0, -801.0]    0.000
1    1        [17.6, 12]           [499, -801]    0.065
2    2     [8.07, -1.01]            [175, 240]    0.054
3    3     [3.62, 0.174]         [79.7, -21.2]    0.056
4    4  [0.489, -0.0468]          [35.2, 2.49]    0.089
5    5    [0.104, 0.145]         [3.89, -1.94]    0.099
6    6  [0.101, 0.00238]         [0.0381, 1.9]    0.075
7    7       [0.1, 0.05]     [0.00949, -0.952]    0.050
8    8       [0.1, 0.05]   [0.00474, 9.58e-05]    0.050
9    9       [0.1, 0.05]  [0.00237, -2.38e-09]    0.100
10  10       [0.1, 0.05]  [1.93e-06, 2.37e-09]    0.100
11  11       [0.1, 0.05]        [0, -2.37e-09]    0.100

Now let’s lambdify the function f2 so we can plot.

f2_lambda = sp.lambdify((x1, x2), f2(x), 'numpy')

Now let’s plot a contour plot with the gradient descent overlaid.

fig, ax = plt.subplots()
# contour plot
X1 = np.linspace(-100,100,100)
X2 = np.linspace(-50,50,100)
X1, X2 = np.meshgrid(X1,X2)
F2 = f2_lambda(X1,X2)
plt.contourf(X2,X1,F2)
plt.colorbar()
# gradient descent plot
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.collections import LineCollection
xX1 = gd.xk[:,0]
xX2 = gd.xk[:,1]
points = np.array([xX1, xX2]).T.reshape(-1, 1, 2)
segments = np.concatenate(
  [points[:-1], points[1:]], axis=1
)
lc = LineCollection(
  segments, 
  cmap=plt.get_cmap('Reds')
)
lc.set_array(np.linspace(0,1,len(xX1))) # color segs
lc.set_linewidth(3)
ax.autoscale(False) # avoid the scatter changing lims
ax.add_collection(lc)
ax.scatter(
  xX1,xX2,
  zorder=1,
  marker="o",
  color=plt.cm.Reds(np.linspace(0,1,len(xX1))),
  edgecolor='none'
)
plt.show()

Barzilai, Jonathan, and Jonathan M. Borwein. 1988. “Two-Point Step Size Gradient Methods.” IMA Journal of Numerical Analysis 8 (1): 141–48. https://doi.org/10.1093/imanum/8.1.141.

Kreyszig, Erwin. 2011. Advanced Engineering Mathematics. 10^\text{th} ed. John Wiley & Sons, Limited.

Online Resources for Section 8.1

No online resources.