Lectures 10-11

Modified Newton methods

In engineering design problems, function evaluation itself can be expensive. Derivative calculations, especially if done numerically are expensive (but required, at least to verify optimality!). Second derivative calculations in this context are a large computational burden (n² function evaluations to construct the Hessian matrix of second derivatives, and then solving a system of linear equations to get the Newton direction turns out to be a big effort). If we add to this the fact that after all this trouble, the Newton direction (far away from a minimum) may not even be a descent direction, then it is clear that some modifications are called for.

[Try to check even with simple functions of one/two variable(s) that [a] the Newton direction may be a descent direction, but may not be well-scaled and [b] with functions of two variables that it may not even be a descent direction when a point is far from the minimum. Constructing such examples is an illuminating exercise in itself.]

Levenberg-Marquardt method: One modification is the Levenberg-Marquardt modification, where a positive multiple l of the identity matrix is added to the second derivative matrix to get a positive definite matrix B_k = Ñ²f(x^k) + lI, which is then inverted to get the descent direction -B_k^-1Ñf(x^k). Note that taking l to be very large positive essentially amounts to the steepest descent direction and taking l to be zero gives the Newton method. We essentially keep track of the eigenvalues of Ñ²f(x^k) at every stage and see that it remains positive definite, so as to give descent. If not, we add a suitable multiple as above and proceed. This also converges globally, and close to the minimum, essentially reverts to the Newton formula. Note that eigenvalues are computed through efficient factorizations and not by finding the roots of the characteristic polynomial, which is an unstable numerical computation. This gives descent, and a robust algorithm, but the problem of computing Ñ²f(x^k) and its inverse at every stage remains.

Conjugate direction methods: Recall that conjugate direction methods worked well (starting with finite n step convergence for quadratic functions and suitable extrapolations for non-linear functions) without having to do second derivative calculations and matrix inversion. They are in between the steepest descent method (very few calculations at every step, but can take a large number of iterations and can have unacceptable numerical performance) on the one hand and Newton’s method (considerable effort at each step, but locally very effective in the number of steps required to converge) on the other.

Motivating ideas for Quasi Newton methods: Quasi Newton methods, which are currently the most robust and effective algorithms for unconstrained optimization, are based on the following set of ideas.

If B_k is positive definite, the direction -B_k^-1Ñf(x^k) is always a descent direction at x^k, and we can perhaps get global convergence (i.e. convergence starting anywhere) by searching in those directions.
As long as B_k approximates the second derivative matrix at least asymptotically, the method is likely to work well locally (i.e. fast convergence).
For a quadratic function, a set of conjugate directions, when searched sequentially, gives the optimum solution in at most n iterations.
In terms of numerical computations for the inverse of a matrix (remember that eventually a system of equations is solved using the second derivative matrix to get the Newton direction, which is equivalent to inverting that matrix), the following formula is used for a low rank update to a matrix [Sherman-Morrison-Woodbury formula]

[A + uv^T]^-1 = A^-1 + (1/1+k) A^-1uv^T A^-1, where k = v^TA^-1u

[You can verify this by direct multiplication.]

Note that if A^-1 is known, this is much faster than computing [A + uv^T]^-1directly.

This is a rank one update (uv^T is a rank one matrix) of the original matrix A. In particular, A + uu^T is a symmetric rank one update.

If B_k (approximation to second derivative matrix at step k) is updated by a small rank correction to get B_k+1 (approximation to second derivative matrix at step k+1), then B_k+1^-1 can be computed easily by the above argument.

Quasi Newton methods put all these ideas together to construct approximations B_k to the Hessian matrix at each stage. Note that some updates work on B_k and update B_k and then find its inverse, whereas some work directly on the inverse of the second derivative approximation (usually called H_k, in text books).

[Refer to Chong and Zak, Nocedal and Wright, Fletcher and other books in your library for more details on Quasi Newton methods and their numerical properties.]

Quasi Newton condition

Consider two successive iterates x_k and x_k+1, and let s = x^k+1 – x^k and t = Ñf(x^k+1) - Ñf(x^k) for convenience. If the matrix B_k+1 is to approximate the second derivative matrix at stage k+1, it should ideally satisfy the following condition:

Quasi-Newton condition : B_k+1 s = t

Quasi-Newton condition for the inverse : H_k+1 t = s

In matrix vector terms, the difference in the gradient for a difference in the variable should be captured by the second derivative. [‘Dividing’ by the lhs vector, which is strictly forbidden in vector arithmetic, may give you the appearance of a familiar derivative like formula based on finite differences]. All Quasi Newton formulas for B_k+1 (or for the inverse H_k+1) satisfy this QN condition. This still gives a number of options for B_k+1 or H_k+1 [note that we only have n equations for the n² entries of B or H].

The next condition is that it should be easy to compute the inverse. This gives rise to a set of efficient updates (i.e. either H_k+1, given H_k, or B_k+1 which is easy to invert, given B_k).

Updates for Hk or Bk

Symmetric rank one update : B_k+1 = B_k + a(uu^T) for some vector u and scalar a, which are determined using the QN condition. This may sometimes work, but in general may not preserve positive definiteness and so may not have descent at every stage, so is not a favoured method.

Symmetric rank two updates : B_k+1 = B_k + a(uu^T)+ b(vv^T) for some vectors u and v and some scalar constants a and b. This now gives a wide choice of methods and two of the most popular are the DFP (Davidon Fletcher Powell) update, which is written directly in terms of H and the BFGS (Broyden Fletcher Goldfarb Shanno) and combinations of these called the Broyden family of updates.

DFP : H_k+1 = H_k + (1/s^Tt) ss^T – (1/t^TH_kt) [H_kt][H_kt]^T

BFGS : B_k+1 = B_k + (1/s^Tt) tt^T – (1/s^TB_ks) [B_ks][B_ks]^T

In the BFGS, the inverse updating formula has now to be applied twice to get H_k+1, which can be done explicitly. The resulting updates for H_k+1 in both cases are far more efficient than computing the inverse from scratch. These updates all preserve symmetry and positive definiteness, and result in conjugate directions (in the quadratic case) and satisfy the QN condition. The BFGS seems to be the best general-purpose update in practice.

General QN algorithm for minimizing a function f

Start with x⁰and H₀ = I (approximation to the inverse of the Hessian)

At step k,

d^k = -H_k Ñf(x^k)
Find a_k so as to (exactly or approximately) minimize f(x^k + a_k d^k)
x^k+1 = x^k + a_k d^k
Update H_k+1(as per various methods discussed above)

Continue until a termination condition is satisfied.