Lectures 7-9

 

Reminder: Quiz 1 coming up next week. Pl study minimization of functions of one variable and unconstrained optimization (optimality conditions and algorithms).

 

First order necessary condition for a (local) minimum

 

We re-state a condition for a local min of a real-valued (continuously differentiable) function f defined over a set K in Rn, as follows:

If the vector x* is a local minimum of f, then the condition Ñf(x*)Td >= 0 must hold for all feasible directions d.  A feasible direction d at x* e K is a vector d such that (x* + ad) e K for all a in some interval [0, a’].  If the set K is all of Rn, or if the point x* belongs to the interior of set K (which notion you can make precise), then since all directions d are feasible, the only way the condition can be satisfied is for Ñf(x*) = 0 [verify this].

 

This is only a necessary condition (the same condition holds for a local maximum also, as well as for points of inflexion or saddle points).  A point where Ñf(x*) = 0 is called a stationary point.  Most analytical techniques would attempt to find a stationary point (actually, will be able to progress only if not at a stationary point).

 

This condition can be verified by using the first-order approximation for a function f around the point x.  If Ñf(x)Td < 0, we would have f(x + ah) < f(x) for some sufficiently small a.  This would say that x is not a local min.  In the unconstrained case, if Ñf(x) is nor equal to zero, a choice of h that will give descent is the vector -Ñf(x) .  This forms the basis for the steepest descent algorithm, where this direction is combined with a line search step (usually stated as an exact line search).

 

Steepest descent method (also called Cauchy method)

 

The steepest descent method is easy to implement and is very robust.  It is globally convergent under very general assumptions (i.e. it converges – to a stationary point, which is usually a local minimum) starting from anywhere.  For a quadratic function with circular contour lines, it finds the minimum in one step (but with a line search), since the steepest descent direction points to the minimum at every point. 

 

The method performs poorly if the contours of the function are ellipsoids with skewed axes, where the method zigzags, resulting in unacceptably poor performance.  This notion is made precise (for quadratic functions of the form cTx + ½ xTQx) by the concept of a condition number of the second derivative matrix (the constant matrix Q).  For positive definite matrices Q, the condition number is the ratio of the largest eigenvalue to the smallest eigenvalue.  The bigger this number, the slower the rate of convergence (see Chong and Zak for nice proofs and insights on the behaviour of this class of methods).

 

Improved descent methods

 

There are a number of improvements possible over the basic steepest descent method.  One is to do an inexact line search, which may reduce the overall computational effort (note that the normal convergence results for the steepest descent method are in terms of the number of iterations involving computation of the gradient, and an exact line search is assumed, which may actually consume quite a bit of the computational effort in practice).

 

But the main source of improvement is in terms of better directions for search.  We note that if the gradient vector Ñf(x) is not equal to zero, there are actually a number of descent directions one can use (other than the negative gradient vector, i.e. -Ñf(x)).  In particular, a direction –BÑf(x), is guaranteed to give a descent direction at x for a positive definite matrix B.  It is convenient to take B as an approximation that will capture some second order information of the function at that point.  We will return to this idea after discussing the Newton method.

 

Two methods that use better directions than the steepest descent direction, are conjugate direction methods and conjugate gradient methods.  Both of these are best understood for quadratic functions and suitably generalized to general non-linear functions. 

 

Conjugate direction and conjugate gradient methods

 

For a quadratic function cTx + ½ xTQx, the directions d1, …, dn are said to be conjugate with respect to (the symmetric matrix) Q if diTQdj = 0 for i different from j.  A set of conjugate directions forms a better set to search over and it can be shown that a line search done sequentially in a set of n conjugate directions will minimize a quadratic function.  Note that there are a number of sets of conjugate directions for a given Q.

 

Rather than defining the set of conjugate directions right in the beginning, an iterative scheme to generate them using gradient information can be proposed, which is the conjugate gradient method.  This forms the basis for quite a powerful, general-purpose technique for general non-linear functions (with a suitably generalized interpretation of conjugate directions) and you can refer to B and C, Deb, C and Z and other books for more details.

 

Second order methods

 

The most powerful techniques for finite dimensional optimization on Rn, are those based on Newton’s method.  When these work, there are very few competing methods.  They are based on second derivatives (curvature) and quadratic model functions to iteratively generate search directions.  The method provides a balance between computational work done and speed of the algorithm (time to find a “good” solution).  Even when they do not work (for reasons we will explain), the motivating ideas help us define good approximations that work well.

 

Before studying the method, we need to look at second order necessary and sufficient conditions for optimality.  Define Ñ2f(x*) as the n x n matrix of partial derivatives of f (i.e. Ñ2f(x*) |(i,j) is the second partial derivative 2f/xi xj evaluated at x*.)  This is a real-valued symmetric matrix for functions of our interest.

 

Second order necessary and sufficient conditions

 

The vector x* is a minimum of twice continuously differentiable function f only if Ñf(x*) =  0 and Ñ2f(x*) is positive semidefinite.

 

This again is easy to see from the two term expansion of the Taylor series of f around x*, viz. f(x* + h) = f(x*) + Ñf(x*)Th + ½ hTÑ2f(x*)h + small term that goes to zero faster than || h ||2.  Since Ñf(x*) = 0, for x* to be a local min (i.e. for small h), the third term on the RHS must be >=0, which is what positive semidefiniteness means.

 

Note that Ñ2f(x*) is a real, symmetric matrix.  Such a matrix is positive semi-definite (positive definite) if and only if all its eigenvalues - which are guaranteed to be real - are non-negative (positive).  There are other checks for positive definiteness, involving the various determinants of the principal submatrices, but the eigenvalue test is the most convenient.

 

It is known [pl revise this from your linear algebra course] that a real symmetric matrix is diagonalizable through a factorization as follows Ñ2f(x*) = UTSU, where is a diagonal matrix containing the eigenvalues of the LHS matrix and U is an orthogonal matrix of unit dimension eigenvectors corresponding to the n-eigenvalues of Ñ2f(x*).  You can use this factorization to verify the correspondence between eigenvalues and positive semi-definiteness (positive definiteness). 

 

So the condition for local optimality at x* for a twice continuously differentiable function f (for simplicity, we just state the unconstrained version) is that Ñf(x*) = 0 and Ñ2f(x*) is positive semidefinite.  If Ñf(x*) = 0 and Ñ2f(x*) is positive definite, in the same setting, then these conditions are sufficient and x* is indeed a local minimum.

 

Remark: The fact that there is no necessary and sufficient condition of this type can be verified by examining the following functions defined on R, x4, x3, -x4 etc. at x = 0, where in each case, the second derivative (in this case a number) is zero at x = 0, which could be a minimum, maximum, neither a minimum nor a maximum, etc.

 

Some exercises

 

 

 

 

For systems of linear equations, you would recall the existence and uniqueness results for a system Ax-b = 0, for different m x n dimension matrices A [pl revise this material].  Extend this understanding to the system of non-linear equations f(x) = 0.

 

 

This problem has numerous applications in statistics, signal processing and other areas.

 

 

Newton’s method

 

The pure Newton method (without line search) for finding the minimum of a function f defined on Rn is based on constructing a quadratic approximation of the function at every stage, and taking the minimum of that quadratic function as the next iterate as follows:

 

Given the iterate xk, construct the model function

f(xk + h) = f(xk) + Ñf(xk)Th + ½ hTÑ2f(xk)h

and find h to minimize the RHS, This is given by hk = --Ñ2f(xk)-1Ñf(xk), so that the next iterate xk+1 is given by xk + hk, i.e. xk - Ñ2f(xk)-1Ñf(xk).

 

Local convergence

 

It can be shown that near a minimum the Newton direction is not only a descent direction, but is well scaled and does not need a line search to be performed and the convergence to the minimum is quadratic (i.e. the errors decrease at a quadratic rate, which is very fast).  But this is valid only near a minimum (i.e. if xk is “close” to x*).  This is called local convergence.

 

Drawbacks of the Newton method

 

The two major drawbacks of the Newton method are the following:

 

 

 

Quasi Newton methods address all the above concerns.

 

Some suggestions for independent project work. 

 

Those who are interested in using Mathematica can explore the book Practical Optimization Methods using Mathematica by M.Asghar Bhatti, Springer, 2000 in the library (or other references) and present a paper based on that.

 

Those interested in Matlab exercises can follow the book of Chong and Zak and also see the Matlab tool-box either on the local network or some documentation at

http://www.mathworks.com/access/helpdesk/help/pdf_doc/optim/optim_tb.pdf

 

The book by Belegundu and Chandrupatla has a CD with Fortran code and K.Deb’s book also has sample code for several implementable optimization methods.

 

Those who are interested in applications may please note that modeling the problem appropriately is an important and non-trivial step.  Modeling, at the least, involves the following steps:

Note that in problems with many objectives, this could be a multi-step exercise where some of the objectives could be turned into constraints either with target values or with bounds.  If this is to be done systematically, one way to do it is through goal programming.)

 

Students who would like to work on and present an application can send me an email with a one para write up on their intended application.