Gradient descent
Consider a multivariate function \(f:\mathbb{R}^n \rightarrow \mathbb{R}\) that represents some cost or value. This is called an objective function, and we often want to find an \(\bm{X}\in\mathbb{R}^n\) that yields \(f\)’s extremum: minimum or maximum, depending on whichever is desirable.
It is important to note however that some functions have no finite extremum. Other functions have multiple. Finding a global extremum is generally difficult; however, many good methods exist for finding a local extremum: an extremum for some region \(R \subset \mathbb{R}^n\).
The method explored here is called gradient descent. It will soon become apparent why it has this name.
Stationary points
Recall from basic calculus that a function \(f\) of a single variable had potential local extrema where \(\diff f(x)/\diff x = 0\). The multivariate version of this, for multivariate function \(f\), is \[\begin{aligned} \label{eq:gradient_descent_grad} \grad f = \bm{0}. \end{aligned}\] A value \(\bm{X}\) for which eq. ¿eq:gradient_descent_grad? holds is called a stationary point. However, as in the univariate case, a stationary point may not be a local extremum; in these cases, it called a saddle point.
Consider the hessian matrix \(H\) with values, for independent variables \(x_i\), \[\begin{aligned} H_{ij} &= \partial_{x_i x_j}^2 f. \end{aligned}\] For a stationary point \(\bm{X}\), the second partial derivative test tells us if it is a local maximum, local minimum, or saddle point:
- minimum
-
If \(H(\bm{X})\) is positive definite (all its eigenvalues are positive),
\(\bm{X}\) is a local minimum. - maximum
-
If \(H(\bm{X})\) is negative definite (all its eigenvalues are negative),
\(\bm{X}\) is a local maximum. - saddle
-
If \(H(\bm{X})\) is indefinite (it has both positive and negative eigenvalues),
\(\bm{X}\) is a saddle point.
These are sometimes called tests for concavity: minima occur where \(f\) is convex and maxima where \(f\) is concave (i.e. where \(-f\) is convex).
It turns out, however, that solving eq. ¿eq:gradient_descent_grad? directly for stationary points is generally hard. Therefore, we will typically use an iterative technique for estimating them.
The gradient points the way
Although eq. ¿eq:gradient_descent_grad? isn’t usually directly useful for computing stationary points, it suggests iterative techniques that are. Several techniques rely on the insight that the gradient points toward stationary points. Recall from section 5.3 that \(\grad f\) is a vector field that points in the direction of greatest increase in \(f\).
Consider starting at some point \(\bm{x}_0\) and wanting to move iteratively closer to a stationary point. So, if one is seeking a maximum of \(f\), then choose \(\bm{x}_1\) to be in the direction of \(\grad f\). If one is seeking a minimum of \(f\), then choose \(\bm{x}_1\) to be opposite the direction of \(\grad f\).
The question becomes: how far \(\alpha\) should we go in (or opposite) the direction of the gradient? Surely too-small \(\alpha\) will require more iteration and too-large \(\alpha\) will lead to poor convergence or missing minima altogether. This framing of the problem is called line search. There are a few common methods for choosing \(\alpha\), called the step size, some more computationally efficient than others.
Two methods for choosing the step size are described below. Both are framed as minimization methods, but changing the sign of the step turns them into maximization methods.
The classical method
Let \[\begin{aligned} \bm{g}_k = \grad f(\bm{x}_k), \end{aligned}\] the gradient at the algorithm’s current estimate \(\bm{x}_k\) of the minimum. The classical method of choosing \(\alpha\) is to attempt to solve analytically for \[\begin{aligned} \label{eq:gradient_descent_alpha_classical} \alpha_k = \argmin_\alpha f(\bm{x}_k - \alpha \bm{g}_k). \end{aligned}\] This solution approximates the function \(f\) as one varies \(\alpha\). It is approximate because as \(\alpha\) varies, so should \(\bm{x}\). But even with \(\alpha\) as the only variable, eq. ¿eq:gradient_descent_alpha_classical? may be difficult or impossible to solve. However, this is sometimes called the “optimal” choice for \(\alpha\). Here “optimality” refers not to practicality but to ideality. This method is rarely used to solve practical problems.
The algorithm of the classical gradient descent method can be summarized in the pseudocode of . It is described further in Kreyszig (2011, sec. 22.1).
The Barzilai and Borwein method
In practice, several non-classical methods are used for choosing step size \(\alpha\). Most of these construct criteria for step sizes that are too small and too large and prescribe choosing some \(\alpha\) that (at least in certain cases) must be in the sweet-spot in between. Barzilai and Borwein (1988) developed such a prescription, which we now present.
Let \(\Delta\bm{x}_k = \bm{x}_k - \bm{x}_{k-1}\) and \(\Delta\bm{g}_k = \bm{g}_k - \bm{g}_{k-1}\). This method minimizes \(\norm{\Delta x - \alpha \Delta g}^2\) by choosing \[\begin{aligned} \alpha_k = \frac{\Delta\bm{x}_k\cdot\Delta\bm{g}_k}{\Delta\bm{g}_k\cdot\Delta\bm{g}_k}. \end{aligned}\]
The algorithm of this gradient descent method can be summarized in the pseudocode of . It is described further in Barzilai and Borwein (1988).
Consider the functions (a) f1 : ℝ2 → ℝ and (b) f2 : ℝ2 → ℝ defined as $$\begin{aligned} f_1(\bm{x}) &= (x_1 - 25)^2 + 13 (x_2 + 10)^2 \\ f_2(\bm{x}) &= \frac{1} {2} \bm{x} \cdot A\bm{x} - \bm{b}\cdot\bm{x} \end{aligned}$$ where Use the method of @barzilai1988 starting at some x0 to find a minimum of each function.
First, load some Python packages.
import sympy as sp
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
"display.precision", 3) # Show only three decimal places pd.set_option(
We begin by writing a class Gradient_descent_min
to perform the gradient
descent. This is not optimized for speed.
class Gradient_descent_min():
""" A Barzilai and Borwein gradient descent class.
Inputs:
* f: Python function of x variables
* x: list of symbolic variables (eg [x1, x2])
* x0: list of numeric initial guess of a min of f
* T: step size threshold for stopping the descent
To execute the gradient descent call descend method.
nb: This is only for gradients in cartesian
coordinates! Further work would be to implement
this in multiple or generalized coordinates.
See the grad method below for implementation.
"""
def __init__(self,f,x,x0,T):
self.f = f
self.x = sp.Array(x)
self.x0 = np.array(x0)
self.T = T
self.n = len(x0) # size of x
self.g = sp.lambdify(x,self.grad(f,x),'numpy')
self.xk = np.array(x0)
self.table = {}
def descend(self):
# unpack variables
= self.f
f = self.x
x = self.x0
x0 = self.T
T = self.g
g # initialize variables
= 0
N = x0
x_k = 2*T # can't be zero
dx = .9*x0-.1 # can't equal x0
x_km1 = np.array(g(*x_km1))
g_km1 = 100 # max iterations
N_max = [[N,x0,np.array(g(*x0)),0]]
table_data while (dx > T and N < N_max) or N < 1:
+= 1 # increment index
N = np.array(g(*x_k))
g_k = g_k - g_km1
dg_k = x_k - x_km1
dx_k = abs(dx_k.dot(dg_k)/dg_k.dot(dg_k))
alpha_k = x_k # store
x_km1 = x_k - alpha_k*g_k
x_k # save
= [N,x_k,g_k,alpha_k]
t_list = [
t_list f"{t_i:.3g}" for t_i in t] if isinstance(t,np.ndarray) \
[else t for t in t_list]
table_data.append(t_list)self.xk = np.vstack((self.xk,x_k))
# store other variables
= g_k
g_km1 = np.linalg.norm(x_k - x_km1) # check
dx self.tabulater(table_data)
def tabulater(self,table_data):
= pd.DataFrame(table_data,columns=['N','x_k','g_k','alpha_k'])
table self.table['python'] = table
self.table['latex'] = table.to_latex(index=False)
def grad(self,f,x): # cartesian coord's gradient
return sp.derive_by_array(f(x),x)
First, consider f1.
= sp.symbols('x1, x2')
x1, x2 = sp.Array([x1, x2])
x = lambda x: (x[0]-25)**2 + 13*(x[1]+10)**2
f1 = Gradient_descent_min(f=f1, x=x, x0=[-50,40], T=1e-8) gd
Perform the gradient descent.
gd.descend()
Print the interesting variables.
print(gd.table['python'])
N x_k g_k alpha_k
0 0 [-50, 40] [-150, 1300] 0.000
1 1 [-43.7, -15] [-150, 1.3e+03] 0.042
2 2 [-38.4, -10] [-137, -131] 0.038
3 3 [-33.1, -10] [-127, 0.124] 0.041
4 4 [25, -10] [-116, -0.00962] 0.500
5 5 [25, -10.1] [-0.0172, 0.115] 0.500
6 6 [25, -10] [-1.84e-08, -1.38] 0.039
7 7 [25, -10] [-1.7e-08, 0.00219] 0.038
8 8 [25, -10] [-1.57e-08, 0] 0.038
Now let’s lambdify
the function f1
so we can plot.
= sp.lambdify((x1, x2), f1(x), 'numpy') f1_lambda
Now let’s plot a contour plot with the gradient descent overlaid.
= plt.subplots()
fig, ax # contour plot
= np.linspace(-100,100,100)
X1 = np.linspace(-50,50,100)
X2 = np.meshgrid(X1,X2)
X1, X2 = f1_lambda(X1,X2)
F1
plt.contourf(X1,X2,F1)
plt.colorbar()# gradient descent plot
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.collections import LineCollection
= gd.xk[:,0]
xX1 = gd.xk[:,1]
xX2 = np.array([xX1, xX2]).T.reshape(-1, 1, 2)
points = np.concatenate(
segments -1], points[1:]], axis=1
[points[:
)= LineCollection(
lc
segments, =plt.get_cmap('Reds')
cmap
)0,1,len(xX1))) # color segs
lc.set_array(np.linspace(3)
lc.set_linewidth(False) # avoid the scatter changing lims
ax.autoscale(
ax.add_collection(lc)
ax.scatter(
xX1,xX2,=1,
zorder="o",
marker=plt.cm.Reds(np.linspace(0,1,len(xX1))),
color='none'
edgecolor
) plt.draw()
Now consider f2.
= sp.Matrix([[10, 0], [0, 20]])
A = sp.Matrix([[1, 1]])
b def f2(x):
= sp.Array([x]).tomatrix().T
X return 1/2*X.dot(A*X) - b.dot(X)
= Gradient_descent_min(f=f2, x=x, x0=[50, -40], T=1e-8) gd
Perform the gradient descent.
gd.descend()
Print the interesting variables.
print(gd.table['python'])
N x_k g_k alpha_k
0 0 [50, -40] [499.0, -801.0] 0.000
1 1 [17.6, 12] [499, -801] 0.065
2 2 [8.07, -1.01] [175, 240] 0.054
3 3 [3.62, 0.174] [79.7, -21.2] 0.056
4 4 [0.489, -0.0468] [35.2, 2.49] 0.089
5 5 [0.104, 0.145] [3.89, -1.94] 0.099
6 6 [0.101, 0.00238] [0.0381, 1.9] 0.075
7 7 [0.1, 0.05] [0.00949, -0.952] 0.050
8 8 [0.1, 0.05] [0.00474, 9.58e-05] 0.050
9 9 [0.1, 0.05] [0.00237, -2.38e-09] 0.100
10 10 [0.1, 0.05] [1.93e-06, 2.37e-09] 0.100
11 11 [0.1, 0.05] [0, -2.37e-09] 0.100
Now let’s lambdify
the function f2
so we can plot.
= sp.lambdify((x1, x2), f2(x), 'numpy') f2_lambda
Now let’s plot a contour plot with the gradient descent overlaid.
= plt.subplots()
fig, ax # contour plot
= np.linspace(-100,100,100)
X1 = np.linspace(-50,50,100)
X2 = np.meshgrid(X1,X2)
X1, X2 = f2_lambda(X1,X2)
F2
plt.contourf(X2,X1,F2)
plt.colorbar()# gradient descent plot
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.collections import LineCollection
= gd.xk[:,0]
xX1 = gd.xk[:,1]
xX2 = np.array([xX1, xX2]).T.reshape(-1, 1, 2)
points = np.concatenate(
segments -1], points[1:]], axis=1
[points[:
)= LineCollection(
lc
segments, =plt.get_cmap('Reds')
cmap
)0,1,len(xX1))) # color segs
lc.set_array(np.linspace(3)
lc.set_linewidth(False) # avoid the scatter changing lims
ax.autoscale(
ax.add_collection(lc)
ax.scatter(
xX1,xX2,=1,
zorder="o",
marker=plt.cm.Reds(np.linspace(0,1,len(xX1))),
color='none'
edgecolor
) plt.show()
Online Resources for Section 8.1
No online resources.