Lectures 10-11

 

Modified Newton methods

 

In engineering design problems, function evaluation itself can be expensive.  Derivative calculations, especially if done numerically are expensive (but required, at least to verify optimality!).  Second derivative calculations in this context are a large computational burden (n2 function evaluations to construct the Hessian matrix of second derivatives, and then solving a system of linear equations to get the Newton direction turns out to be a big effort).  If we add to this the fact that after all this trouble, the Newton direction (far away from a minimum) may not even be a descent direction, then it is clear that some modifications are called for.

 

[Try to check even with simple functions of one/two variable(s) that [a] the Newton direction may be a descent direction, but may not be well-scaled and [b] with functions of two variables that it may not even be a descent direction when a point is far from the minimum.  Constructing such examples is an illuminating exercise in itself.]

 

Levenberg-Marquardt method: One modification is the Levenberg-Marquardt modification, where a positive multiple l of the identity matrix is added to the second derivative matrix to get a positive definite matrix Bk = Ñ2f(xk) + lI, which is then inverted to get the descent direction -Bk-1Ñf(xk).  Note that taking l to be very large positive essentially amounts to the steepest descent direction and taking l to be zero gives the Newton method.  We essentially keep track of the eigenvalues of Ñ2f(xk) at every stage and see that it remains positive definite, so as to give descent.  If not, we add a suitable multiple as above and proceed.  This also converges globally, and close to the minimum, essentially reverts to the Newton formula.  Note that eigenvalues are computed through efficient factorizations and not by finding the roots of the characteristic polynomial, which is an unstable numerical computation.  This gives descent, and a robust algorithm, but the problem of computing Ñ2f(xk) and its inverse at every stage remains.

 

Conjugate direction methods: Recall that conjugate direction methods worked well (starting with finite n step convergence for quadratic functions and suitable extrapolations for non-linear functions) without having to do second derivative calculations and matrix inversion.  They are in between the steepest descent method (very few calculations at every step, but can take a large number of iterations and can have unacceptable numerical performance) on the one hand and Newton’s method (considerable effort at each step, but locally very effective in the number of steps required to converge) on the other.

 

Motivating ideas for Quasi Newton methods: Quasi Newton methods, which are currently the most robust and effective algorithms for unconstrained optimization, are based on the following set of ideas.

 

[A + uvT]-1 = A-1 + (1/1+k) A-1uvT A-1, where k = vTA-1u

[You can verify this by direct multiplication.]

Note that if A-1 is known, this is much faster than computing [A + uvT]-1 directly.

This is a rank one update (uvT is a rank one matrix) of the original matrix A.  In particular, A + uuT is a symmetric rank one update.

 

Quasi Newton methods put all these ideas together to construct approximations Bk to the Hessian matrix at each stage.  Note that some updates work on Bk and update Bk and then find its inverse, whereas some work directly on the inverse of the second derivative approximation (usually called Hk, in text books). 

 

[Refer to Chong and Zak, Nocedal and Wright, Fletcher and other books in your library for more details on Quasi Newton methods and their numerical properties.]

 

Quasi Newton condition

 

Consider two successive iterates xk and xk+1, and let s = xk+1 – xk and t = Ñf(xk+1) - Ñf(xk) for convenience.  If the matrix Bk+1 is to approximate the second derivative matrix at stage k+1, it should ideally satisfy the following condition:

Quasi-Newton condition : Bk+1 s =  t 

Quasi-Newton condition for the inverse : Hk+1 t = s

 

In matrix vector terms, the difference in the gradient for a difference in the variable should be captured by the second derivative.  [‘Dividing’ by the lhs vector, which is strictly forbidden in vector arithmetic, may give you the appearance of a familiar derivative like formula based on finite differences].  All Quasi Newton formulas for Bk+1 (or for the inverse Hk+1) satisfy this QN condition.  This still gives a number of options for Bk+1 or Hk+1 [note that we only have n equations for the n2 entries of B or H]. 

 

The next condition is that it should be easy to compute the inverse.  This gives rise to a set of  efficient updates (i.e. either Hk+1, given Hk, or Bk+1 which is easy to invert, given Bk).

 

Updates for Hk or Bk

 

Symmetric rank one update : Bk+1 = Bk + a(uuT) for some vector u and scalar a, which are determined using the QN condition.  This may sometimes work, but in general may not preserve positive definiteness and so may not have descent at every stage, so is not a favoured method.

 

Symmetric rank two updates : Bk+1 = Bk + a(uuT)+ b(vvT) for some vectors u and v and some scalar constants a and b.  This now gives a wide choice of methods and two of the most popular are the DFP (Davidon Fletcher Powell) update, which is written directly in terms of H and the BFGS (Broyden Fletcher Goldfarb Shanno) and combinations of these called the Broyden family of updates. 

 

DFP : Hk+1 = Hk + (1/sTt) ssT – (1/tTHkt) [Hkt][Hkt]T

BFGS : Bk+1 = Bk + (1/sTt) ttT – (1/sTBks) [Bks][Bks]T

 

In the BFGS, the inverse updating formula has now to be applied twice to get Hk+1, which can be done explicitly.  The resulting updates for Hk+1 in both cases are far more efficient than computing the inverse from scratch.  These updates all preserve symmetry and positive definiteness, and result in conjugate directions (in the quadratic case) and satisfy the QN condition.  The BFGS seems to be the best general-purpose update in practice.

 

General QN algorithm for minimizing a function f

 

Start with x0 and H0 = I (approximation to the inverse of the Hessian)

At step k,

  1. dk = -Hk Ñf(xk)
  2. Find ak so as to (exactly or approximately) minimize f(xk + ak dk)
  3. xk+1 = xk + ak dk
  4. Update Hk+1 (as per various methods discussed above)

Continue until a termination condition is satisfied.