Lectures 7-9

Reminder: Quiz 1 coming up next week. Pl study minimization of functions of one variable and unconstrained optimization (optimality conditions and algorithms).

First order necessary condition for a (local) minimum

We re-state a condition for a local min of a real-valued (continuously differentiable) function f defined over a set K in Rⁿ, as follows:

If the vector x* is a local minimum of f, then the condition Ñf(x*)^Td >= 0 must hold for all feasible directions d. A feasible direction d at x* e K is a vector d such that (x* + ad) e K for all a in some interval [0, a’]. If the set K is all of Rⁿ, or if the point x* belongs to the interior of set K (which notion you can make precise), then since all directions d are feasible, the only way the condition can be satisfied is for Ñf(x*) = 0 [verify this].

This is only a necessary condition (the same condition holds for a local maximum also, as well as for points of inflexion or saddle points). A point where Ñf(x*) = 0 is called a stationary point. Most analytical techniques would attempt to find a stationary point (actually, will be able to progress only if not at a stationary point).

This condition can be verified by using the first-order approximation for a function f around the point x. If Ñf(x)^Td < 0, we would have f(x + ah) < f(x) for some sufficiently small a. This would say that x is not a local min. In the unconstrained case, if Ñf(x) is nor equal to zero, a choice of h that will give descent is the vector -Ñf(x) . This forms the basis for the steepest descent algorithm, where this direction is combined with a line search step (usually stated as an exact line search).

Steepest descent method (also called Cauchy method)

The steepest descent method is easy to implement and is very robust. It is globally convergent under very general assumptions (i.e. it converges – to a stationary point, which is usually a local minimum) starting from anywhere. For a quadratic function with circular contour lines, it finds the minimum in one step (but with a line search), since the steepest descent direction points to the minimum at every point.

The method performs poorly if the contours of the function are ellipsoids with skewed axes, where the method zigzags, resulting in unacceptably poor performance. This notion is made precise (for quadratic functions of the form c^Tx + ½ x^TQx) by the concept of a condition number of the second derivative matrix (the constant matrix Q). For positive definite matrices Q, the condition number is the ratio of the largest eigenvalue to the smallest eigenvalue. The bigger this number, the slower the rate of convergence (see Chong and Zak for nice proofs and insights on the behaviour of this class of methods).

Improved descent methods

There are a number of improvements possible over the basic steepest descent method. One is to do an inexact line search, which may reduce the overall computational effort (note that the normal convergence results for the steepest descent method are in terms of the number of iterations involving computation of the gradient, and an exact line search is assumed, which may actually consume quite a bit of the computational effort in practice).

But the main source of improvement is in terms of better directions for search. We note that if the gradient vector Ñf(x) is not equal to zero, there are actually a number of descent directions one can use (other than the negative gradient vector, i.e. -Ñf(x)). In particular, a direction –BÑf(x), is guaranteed to give a descent direction at x for a positive definite matrix B. It is convenient to take B as an approximation that will capture some second order information of the function at that point. We will return to this idea after discussing the Newton method.

Two methods that use better directions than the steepest descent direction, are conjugate direction methods and conjugate gradient methods. Both of these are best understood for quadratic functions and suitably generalized to general non-linear functions.

Conjugate direction and conjugate gradient methods

For a quadratic function c^Tx + ½ x^TQx, the directions d₁, …, d_n are said to be conjugate with respect to (the symmetric matrix) Q if d_i^TQd_j = 0 for i different from j. A set of conjugate directions forms a better set to search over and it can be shown that a line search done sequentially in a set of n conjugate directions will minimize a quadratic function. Note that there are a number of sets of conjugate directions for a given Q.

Rather than defining the set of conjugate directions right in the beginning, an iterative scheme to generate them using gradient information can be proposed, which is the conjugate gradient method. This forms the basis for quite a powerful, general-purpose technique for general non-linear functions (with a suitably generalized interpretation of conjugate directions) and you can refer to B and C, Deb, C and Z and other books for more details.

Second order methods

The most powerful techniques for finite dimensional optimization on Rⁿ, are those based on Newton’s method. When these work, there are very few competing methods. They are based on second derivatives (curvature) and quadratic model functions to iteratively generate search directions. The method provides a balance between computational work done and speed of the algorithm (time to find a “good” solution). Even when they do not work (for reasons we will explain), the motivating ideas help us define good approximations that work well.

Before studying the method, we need to look at second order necessary and sufficient conditions for optimality. Define Ñ²f(x*) as the n x n matrix of partial derivatives of f (i.e. Ñ²f(x*) |_(i,j) is the second partial derivative ¶²f/¶x_i ¶x_j evaluated at x*.) This is a real-valued symmetric matrix for functions of our interest.

Second order necessary and sufficient conditions

The vector x* is a minimum of twice continuously differentiable function f only if Ñf(x*) = 0 and Ñ²f(x*) is positive semidefinite.

This again is easy to see from the two term expansion of the Taylor series of f around x*, viz. f(x* + h) = f(x*) + Ñf(x*)^Th + ½ h^TÑ²f(x*)h + small term that goes to zero faster than || h ||². Since Ñf(x*) = 0, for x* to be a local min (i.e. for small h), the third term on the RHS must be >=0, which is what positive semidefiniteness means.

Note that Ñ²f(x*) is a real, symmetric matrix. Such a matrix is positive semi-definite (positive definite) if and only if all its eigenvalues - which are guaranteed to be real - are non-negative (positive). There are other checks for positive definiteness, involving the various determinants of the principal submatrices, but the eigenvalue test is the most convenient.

It is known [pl revise this from your linear algebra course] that a real symmetric matrix is diagonalizable through a factorization as follows Ñ²f(x*) = U^TSU, where is a diagonal matrix containing the eigenvalues of the LHS matrix and U is an orthogonal matrix of unit dimension eigenvectors corresponding to the n-eigenvalues of Ñ²f(x*). You can use this factorization to verify the correspondence between eigenvalues and positive semi-definiteness (positive definiteness).

So the condition for local optimality at x* for a twice continuously differentiable function f (for simplicity, we just state the unconstrained version) is that Ñf(x*) = 0 and Ñ²f(x*) is positive semidefinite. If Ñf(x*) = 0 and Ñ²f(x*) is positive definite, in the same setting, then these conditions are sufficient and x* is indeed a local minimum.

Remark: The fact that there is no necessary and sufficient condition of this type can be verified by examining the following functions defined on R, x⁴, x³, -x⁴ etc. at x = 0, where in each case, the second derivative (in this case a number) is zero at x = 0, which could be a minimum, maximum, neither a minimum nor a maximum, etc.

Some exercises

K.Deb has suggested exploring the Himmelblau function (x₁² + x₂ –11)² + (x₁ + x₂² – 7)² for which you should find all stationary points and local minima and also explore the performance of the steepest descent and other methods from various starting points.

A simple problem that you should be able to solve completely is of the following type (B and C prob 3.4). A cardboard box of dimensions l, w and h has to be designed with volume 1.67 ft3, and to minimize the amount of cardboard material (there are flaps on the top and bottom with dimensions length/2 and width/2 to enable the box to be closed completely). Find dimensions that would achieve this.

A problem related to optimization is that of finding the roots of nonlinear functions or solving a system of nonlinear equations. Consider the function f which takes vectors in Rⁿ to vectors in Rⁿ. Representing f as f(x) = [f₁(x), f₂(x), …, f_n(x)], formulate an optimization problem and relate the stationary points x* of that optimization problem to roots of f, i.e. points which satisfy f(x) = 0.

For systems of linear equations, you would recall the existence and uniqueness results for a system Ax-b = 0, for different m x n dimension matrices A [pl revise this material]. Extend this understanding to the system of non-linear equations f(x) = 0.

A problem that is quite important in engineering is the (linear) least squares problem, which you may have encountered in curve fitting and other applications. The problem is of the form Min_x ||Ax – b||₂ for some given matrix A and constant vector b (relate this to the curve fitting and regression models that you may be familiar with – i.e. where). Write down necessary conditions for optimality and also find a way to find the optimal solution.

This problem has numerous applications in statistics, signal processing and other areas.

Some problems of infinite dimension can be discretized and solved using the methods above (see example 3.12 (iv) of B and C and others).

Newton’s method

The pure Newton method (without line search) for finding the minimum of a function f defined on Rⁿ is based on constructing a quadratic approximation of the function at every stage, and taking the minimum of that quadratic function as the next iterate as follows:

Given the iterate xk, construct the model function

f(x^k + h) = f(x^k) + Ñf(x^k)^Th + ½ h^TÑ²f(x^k)h

and find h to minimize the RHS, This is given by h^k = --Ñ²f(x^k)^-1Ñf(x^k), so that the next iterate x^k+1 is given by x^k + h^k, i.e. x^k - Ñ²f(x^k)^-1Ñf(x^k).

Local convergence

It can be shown that near a minimum the Newton direction is not only a descent direction, but is well scaled and does not need a line search to be performed and the convergence to the minimum is quadratic (i.e. the errors decrease at a quadratic rate, which is very fast). But this is valid only near a minimum (i.e. if x^k is “close” to x*). This is called local convergence.

Drawbacks of the Newton method

The two major drawbacks of the Newton method are the following:

We may not get descent (and convergence) if we start far away from the minimum x*. This can be addressed by taking a positive definite matrix that is “close” to the real inverse of the second derivative matrix in the Newton method and then doing a line search at each iteration.

The second difficulty is that it is numerically expensive and tedious to compute the second derivative matrix (n² function evaluations) and its inverse at every stage. Since anyway, we are developing iterative methods, approximations to this would suffice, provided we get global convergence eventually.

Quasi Newton methods address all the above concerns.

Some suggestions for independent project work.

Those who are interested in using Mathematica can explore the book Practical Optimization Methods using Mathematica by M.Asghar Bhatti, Springer, 2000 in the library (or other references) and present a paper based on that.

Those interested in Matlab exercises can follow the book of Chong and Zak and also see the Matlab tool-box either on the local network or some documentation at

http://www.mathworks.com/access/helpdesk/help/pdf_doc/optim/optim_tb.pdf

The book by Belegundu and Chandrupatla has a CD with Fortran code and K.Deb’s book also has sample code for several implementable optimization methods.

Those who are interested in applications may please note that modeling the problem appropriately is an important and non-trivial step. Modeling, at the least, involves the following steps:

Deciding on a set of design parameters
Quantifying an appropriate objective function for the problem (in terms of the design parameters and perhaps other constants)
Identifying and quantifying constraints (in terms of the design parameters and perhaps other constants)

Note that in problems with many objectives, this could be a multi-step exercise where some of the objectives could be turned into constraints either with target values or with bounds. If this is to be done systematically, one way to do it is through goal programming.)

Students who would like to work on and present an application can send me an email with a one para write up on their intended application.