A Newton Method for Linear Programming O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 5370...

0 downloads 9 Views 162KB Size

Loading...

1

Introduction

The method proposed here is motivated by the effectiveness of a finitely terminating Newton method proposed in [23] for the unconstrained minimization of strongly convex piecewise quadratic functions arising from quadratic programs and utilized successfully for classification problems in [10]. To apply this approach to linear programs, a reasonable choice is the least 2norm formulation [18, 24, 25, 11] of a linear program as a strongly convex 1

quadratic program, which gives an exact least 2-norm solution for finite parameter values. This has been done in [14] where, a finite Newton method was proposed but without any computational results and without giving an exact solution to the dual of the least 2-norm solution. In our approach here, we make no assumption about our linear program other than solvability and an implied uniqueness condition, which are explained in detail in Section 2. In Section 3 we state our Newton algorithm with an Armijo stepsize and give its global convergence. In Section 4 we give encouraging comparative test results with CPLEX 6.5 [5] on a class of synthetically generated sparse linear programs with as many as two million constraints as well as on six publicly available machine learning classification problems. We also give brief MATLAB codes for a test problem generator as well as a simple version of our Newton solver without an Armijo stepsize. This code is used to obtain all of our numerical results. On five out of seven synthetic test problems, the proposed method was faster than CPLEX by a factor in the range of 2.8 to 34.2. On the remaining two problems, CPLEX ran out of memory. On the six smaller machine learning classification problems, CPLEX was faster but LPN gave higher training and testing set classification correctness. We note that under the implied uniqueness assumption of our paper, all that is needed is that the number of constraints m of the primal linear program (2) be no less than the number of variables n of the problem. However the principal effectiveness of the proposed method is to problems where m is much bigger than n as evidenced by the numerical examples of Section 4. Other approaches to linear programs with very large number of constraints are given in [4, 15]. Another related approach that uses similar ideas to those presented here is given in [29], but does not recover the primal solution from the dual as we do. A word about our notation. All vectors will be column vectors unless transposed to a row vector by a prime superscript 0 . For a vector x in the n-dimensional real space R n , the plus function x+ is defined as (x+ )i = max {0, xi }, i = 1, . . . , n, while x∗ denotes the subgradient of x+ which is the step function defined as (x∗ )i = 1 if xi > 0, (x∗ )i = 0 if xi < 0, and (x∗ )i ∈ [0, 1] if xi = 0, i = 1, . . . , n. The scalar (inner) product of two vectors x and y in the n-dimensional real space R n will be denoted by x0 y. The 2-norm of x will be denoted by kxk, while kxk 1 and kxk∞ will denote the 1-norm and ∞-norm respectively. For a matrix A ∈ R m×n , Ai is the ith row of A which is a row vector in R n and kAk is the 2-norm of A: max kAxk. kxk=1

If S ⊂ {1, . . . , m}, then AS is the submatrix of A consisting of rows A i∈S . A column vector of ones of arbitrary dimension will be denoted by e and the 2

identity matrix of arbitrary order will be denoted by I. If f is a real valued function defined on the n-dimensional real space R n , the gradient of f at x is denoted by ∇f (x) which is a column vector in R n and the n × n matrix of second partial derivatives of f at x is denoted by ∇ 2 f (x). For a piecewise quadratic function such as, f (x) = 12 ||(Ax − b)+ ||2 , where A ∈ Rm×n and b ∈ Rm the ordinary Hessian does not exist because its gradient, the n × 1 vector ∇f (x) = A0 (Ax − b)+ , is not differentiable. However, one can define its generalized Hessian [12, 7, 23] which is the n × n symmetric positive semidefinite matrix: ∂ 2 f (x) = A0 diag(Ax − b)∗ A,

(1)

where diag(Ax − b)∗ denotes an m × m diagonal matrix with diagonal elements (Ai x−bi )∗ , i = 1, . . . , m. The generalized Hessian (1) has many of the properties of the regular Hessian [12, 7, 23] in relation to f (x). Throughout this work, the notation “:=” will denote definition and “s.t.” will stand for “such that”.

2

Equivalence of Primal Exterior Penalty LP to Dual Least 2-Norm LP

We give in this section an apparently overlooked, but implied [18, 30, 14], result that a parametric exterior penalty formulation of a linear program for any sufficiently small but finite value of the penalty parameter, provides an exact least 2-norm solution to the dual linear program. We begin with the primal linear program: min c0 x s.t. Ax ≤ b,

x∈Rn

(2)

where c ∈ Rn , A ∈ Rm×n and b ∈ Rm , and its dual: max − b0 u = − minm b0 u s.t. A0 u + c = 0, u ≥ 0.

u∈Rm

u∈R

(3)

We write the parametric exterior penalty formulation of the primal linear program for a fixed positive value of the penalty parameter as the unconstrained minimization problem: min f (x)

x∈Rn

3

(4)

where f is the penalty function: 1 f (x) := minn c0 x + k(Ax − b)+ k2 . x∈R 2

(5)

The positive penalty parameter needs to approach zero in order to obtain a solution to the original linear program [8, 2] from (4). However, this will not be the case if we look at the least 2-norm formulation [25, 18] of the dual linear program (3) : − minm (b0 v + v 0 v) s.t. A0 v + c = 0, v ≥ 0. v∈R 2

(6)

That (6) leads to a least 2-norm solution of the dual linear program (3) follows from the fact established in [25, 18] that for any positive such that ∈ (0, ¯] for some positive ¯, the minimization problem (6) picks among elements of the solution set of the dual linear program (3), that which min0 imizes the perturbation term v2v . Because the objective function of (6) is strongly convex, its solution v¯ is unique. The necessary and sufficient Karush-Kuhn-Tucker optimality conditions for problem (6) are that there exists a y ∈ Rn such that: v ≥ 0, v + b − Ay ≥ 0, v 0 (v + b − Ay) = 0, A0 v + c = 0,

(7)

or equivalently: v = (Ay − b)+ , A0 v + c = 0.

(8)

1 v = (Ay − b)+ , A0 (Ay − b)+ + c = 0.

(9)

That is:

Defining f (y) as in (5), the optimality conditions (9) for the dual least 2norm solution become: 1 v = (Ay − b)+ , ∇f (y) = A0 (Ay − b)+ + c = 0.

(10)

That is: 1 1 v = (Ay − b)+ , y ∈ arg minn f (y) = arg minn k(Ay − b)+ k2 + c0 y, y∈R y∈R 2 (11) which is precisely the necessary and sufficient condition that y be a minimum solution of the parametric exterior penalty function f (y) for the primal 4

linear program (2), for any positive value of the penalty parameter . Hence, solving the exterior penalty problem (4) for any positive provides a solution v = 1 (Ay − b)+ to (6). If in addition, ∈ (0, ¯] for some positive ¯, it follows, by [25, Theorem 1], that v is a least 2-norm solution to the dual linear program (3). We have thus established the following. Proposition 2.1 Equivalence of Least 2-Norm Dual to Finite-Parameter Penalty Primal The unique least 2-norm solution to the dual linear program (3) is given by: v=

1 (Ay − b)+ ,

(12)

where y is a solution of the primal penalty problem: minn f (y) =

y∈R

1 k(Ay − b)+ k2 + c0 y, 2

(13)

for any finite value of the penalty parameter ∈ (0, ¯] for some positive ¯. We note that the gradient: ∇f (y) = A0 (Ay − b)+ + c,

(14)

which is Lipschitz continuous with constant kA 0 k kAk, is not differentiable. However, as stated in the Introduction, a generalized Hessian [12, 7, 23] with many of the properties of the ordinary Hessian can be defined, which is the following n × n symmetric positive semidefinite matrix: ∂ 2 f (y) = A0 diag(Ay − b)∗ A,

(15)

where diag(Ay − b)∗ denotes an m × m diagonal matrix with diagonal elements (Ai y − bi )∗ , i = 1, . . . , m. The step function (·)∗ , is defined in the Introduction and is implemented here with (0) ∗ = 0. The matrix ∂ 2 f (y) will be used to generate our Newton direction, or more precisely a modified Newton direction, since the generalized Hessian may be singular in general. In fact the direction that will be used is the following one: d := −(δI + ∂ 2 f (y))−1 ∇f (y),

(16)

where δ is some small positive number. With this direction and a variety of step sizes [20, Example 2.2] we can establish global convergence. A key empirical computational property of this direction appears to be global convergence for a class of linear programs with m n from any starting 5

point without any step size at all. The exact least 2-norm solution v¯ for the dual linear program (3) is obtained by first solving the primal exterior penalty problem (13) for any ∈ (0, ¯], and then using (12) to determine the unique least 2-norm dual solution v¯. This exact dual solution can be utilized to generate an exact solution to the original primal linear program (2) by solving the constraints of the linear program (2) corresponding to positive components of v¯j as equalities, that is: Aj z = bj , j ∈ S := {j | v¯j > 0}.

(17)

We note that this system of linear equations must always have a solution z as a consequence of the complementarity condition: A j z − bj = 0 for v¯j > 0, j = 1, . . . , m. In fact (17) yields an exact solution of the primal linear program (2) if we make the additional assumption that the submatrix: AS has linearly independent columns,

(18)

where S is defined in (17). This assumption which implies that that the solution of the linear system (17) is unique, is sufficient but not necessary for generating an exact primal solution as will be shown in Section 3. We turn now to our algorithmic formulation and its convergence.

3

Linear Programming Newton Algorithm: LPN

Our proposed algorithm consists of solving (13) for an approximate solution y of the linear program (2), computing an exact least 2-norm solution v to the dual linear program (3) from (12), and finally computing an exact solution z to the primal linear program (2) from (17). In order to guarantee global convergence, we utilize an Armijo stepsize [1, 16] and need to make the linear independence assumption (18) on the least 2-norm solution of the dual linear program (3). We now state our algorithm. Algorithm 3.1 LPN: Linear Programming Newton Algorithm Set the parameter values , δ and tolerance tol (typically: 10 −3 , 10−4 , & 10−12 respectively). Start with any y 0 ∈ Rn (typically y 0 = (A¯0 A¯ + I)−1 A¯0¯b, where A¯ is an arbitrary n × n subset of A and ¯b is the corresponding n × 1 subset of b). For i = 0, 1, . . . (I) y i+1 = y i − λi (∂ 2 f (y i ) + δI)−1 ∇f (y i ) = y i + λi di , where the Armijo stepsize λi = max{1, 12 , 14 , . . . } is such that: f (y i ) − f (y i + λi di ) ≥ − 6

λi ∇f (y i )0 di , 4

(19)

and di is the modified Newton direction: di = −(∂ 2 f (y i ) + δI)−1 ∇f (y i ).

(20)

(II) Stop if ky i − y i+1 k ≤ tol. Else, set i = i + 1 and go to (I). (III) Define the least 2-norm dual solution v as: 1 v = (Ay i+1 − b)+

(21)

and a solution z of the primal linear program by: Aj z = bj , j ∈ S := {j | vj > 0, j = 1, . . . , m}.

(22)

We state a convergence result for this algorithm now. Theorem 3.2 Each accumulation point y¯ of the sequence {y i } generated by Algorithm 3.1 solves the exterior penalty problem (4). The corresponding v¯ obtained by setting y to y¯ in (12) is the exact least 2-norm solution to the dual linear program (3), provided is sufficiently small. An exact solution z¯ to the primal linear program (2) is obtained from (22) by solving for z with v = v¯, provided that the submatrix A S¯ of A has linearly independent columns where: S¯ := {j | v¯j > 0, j = 1, . . . , m}.

(23)

Proof Let > 0. That each accumulation point y¯ of the sequence {y i } solves the minimization problem (13) follows from standard results such as [20, Theorem 2.1, Examples 2.1(i), 2.2(iv)] and the facts that the direction choice di of (20) satisfies: −∇f (y i )0 di = ∇f (y i )0 (δI + ∂ 2 f (y i ))−1 ∇f (y i ) ≥ (δ + kA0 Ak)−1 k∇f (y i )k2 , (24) and that we are using an Armijo stepsize (19). Now let ∈ (0, ¯]. Then by Proposition 2.1 the corresponding v¯ obtained by setting y to y¯ in (12) is the exact least 2-norm solution to the dual linear program (3). If the submatrix AS¯ has linearly independent columns, then the solution of z¯ of (22) with v = v¯ is unique and must be a solution of the primal linear program (2). 2

7

Remark 3.3 Choice of Determining the size of ¯, such that the solution v of the quadratic program (6) for ∈ (0, ¯], is the least 2-norm solution of the dual problem (3), is not an easy problem theoretically. However, computationally this does not seem to be critical and is effectively addressed as follows. By [17, Corollary 3.2], if for two successive values of : 1 > 2 , the corresponding solutions of the -perturbed quadratic programs (6): u 1 and u2 are equal, then under certain assumptions, u = u 1 = u2 is the least 2-norm solution of the dual linear program (3). This result is implemented computationally by using an , which when decreased by a factor of 10 yields the same solution to (6). We note that the assumptions of Theorem 3.2, and in particular uniqueness of the solution of (22) with v = v¯, imply the uniqueness of the solution to the primal linear program (2). This follows from the fact that the least 2-norm multiplier v¯ is a valid optimal multiplier for all optimal solutions of the primal linear program (2) and hence all primal solutions must equal the solution of the uniquely solvable system A S¯ z = bS¯ . We term our condition of linear independence of the columns of A S¯ as a strong uniqueness condition, because it implies primal solution uniqueness but is not implied by it. Example 3.4 below has a unique but not strongly unique solution. However, it is still solvable by our proposed Newton method despite the fact that its solution is not strongly unique. Example 3.4 For the primal and dual problems: min x1 s.t. − x1 + x2 ≤ −1, x1 − x2 ≤ 1, −x1 ≤ 0,

(25)

max u1 − u2 s.t. u1 − u2 + u3 = 1, −u1 + u2 = 0,

(26)

x∈R2

0≤u∈R3

The unique primal solution is x1 = 0, x2 = −1. The dual solution set is {u ∈ R3 | u3 = 1, u1 = u2 ≥ 0} and the least 2-norm dual solution is u ¯1 = u ¯2 = 0, u ¯3 = 1. Thus AS¯ = A3 = [−1 0] and the solution x1 = 0, x2 = −1 is not strongly unique because the columns of A 3 are not linearly independent. However, Algorithm 3.1 solved this problem in 5 iterations using the MATLAB Code 4.2 with the given defaults and without an Armijo stepsize and yielding z = [0 − 1] 0 and v = [0 0 1]0 . We note further, that Algorithm 3.1 can possibly handle linear programs with even non-unique solutions as evidenced by the following simple example. 8

Example 3.5 For the primal and dual problems: min x1 + x2 s.t. − x1 − x2 ≤ −1, −x1 ≤ 0, −x2 ≤ 0,

(27)

max u1 s.t. u1 + u2 = 1, u1 + u3 = 1,

(28)

x∈R2

0≤u∈R3

the primal solution set is {x ∈ R 2 | x1 + x2 = 1, x1 ≥ 0, x2 ≥ 0}, while the dual solution set consists of the unique point: u 1 = 1, u2 = u3 = 0. When Algorithm 3.1 is applied to this example using the MATLAB Code 4.2 with the given default values, the exact primal z = [1 0] 0 and dual v = [1 0 0]0 solutions were obtained in two iterations without using the Armijo stepsize. We give now a sufficient (but in general not necessary) condition for the sequence {y i } of Algorithm 3.1 to converge. Corollary 3.6 The sequence {y i } is bounded and hence has an accumulation point and the corresponding sequence {v i } converges to the least 2-norm solution of the dual linear program (3), provided that R n is contained in the 0 conical hull of the rows of cA . That is, for each p ∈ Rn : ζc0 + s0 A = p, (ζ, s) ≥ 0, has a solution (ζ, s) ∈ R 1+n .

(29)

Proof We shall prove that the sequence {y i } is bounded, hence it has an accumulation point. Since all accumulation points of the sequence {v i }, that correspond to accumulation points of {y i }, are all equal to the unique least 2-norm solution v¯ of (3), the sequence {v i } must converge to v¯. We show now that the sequence {y i } is bounded by exhibiting a contradiction if it were unbounded and if assumption (29) holds. Suppose that {y i } is unbounded, then for some subsequence (for simplicity we drop the subsequence index) we have by Algorithm 3.1 that: f (y i ) ≤ f (y 0 ), y i → ∞.

(30)

1 f (y 0 ) yi b yi k(A i − i )+ k2 + c0 i 2 ≤ i 2 . 2 ky k ky k ky k ky k

(31)

Dividing by ky i k2 gives:

Letting y i → ∞ and denoting by yˆ an accumulation point of the bounded i sequence kyyi k , we have that: 1 k(Aˆ y )+ k2 ≤ 0. 2 9

(32)

Hence Aˆ y ≤ 0, yˆ 6= 0.

(33)

From (31) multiplied by ky i k and noting that its first term is nonnegative we have that: yi f (y 0 ) c0 i ≤ , (34) ky k ky i k from which, upon letting y i → ∞, we get: c0 yˆ ≤ 0.

(35)

Aˆ y ≤ 0, c0 yˆ ≤ 0 yˆ 6= 0.

(36)

Combining (33) and (35) gives: This however contradicts assumption (29) if we set p = yˆ in (29) as follows: 0 = ζc0 yˆ + s0 Aˆ y − yˆ0 yˆ ≤ −ˆ y 0 yˆ < 0. 2

(37)

Remark 3.7 We note that condition (29) of Corollary 3.6 is equivalent to the primal linear program having a bounded level sets, which is similar to the assumptions made in [14]. To see this we have by the Farkas theorem [19, Theorem 2.4.6] that (29) is equivalent to: Ax ≤ 0, c0 x ≤ 0, p0 x > 0, has no solution x ∈ R n , for each p ∈ Rn . (38) This in turn is equivalent to: Ax ≤ 0, c0 x ≤ 0, x 6= 0 has no solution x ∈ R n .

(39)

This is equivalent to the boundedness of the level set {x | Ax ≤ b, c 0 x ≤ α} of the linear program (2), for any real number α. We proceed now to numerical testing of our algorithm.

4

Numerical Tests

The objectives of the preliminary numerical tests presented here are to display the simplicity of the proposed LPN algorithm, its competitiveness with a state-of-the-art linear programming code for special types of problems, and its ability to provide exact answers for a class of linear programs with a very large number of constraints and such that m n. Typically, m ≥ 10n. We also demonstrate the effectiveness of the LPN algorithm by testing it and comparing it with CPLEX on standard classification test problems: four from the University of California Machine Learning Repository [27] and two publicly available datasets [28]. 10

4.1

Very large synthetic datasets

We first test LPN on very large synthetically generated test problems. To display the simplicity of LPN we give two MATLAB m-file codes below. The first, ”lpgen”, is a linear programming test problem generator. The second, ”lpnewt1” is an implementation of Algorithm 3.1 without the apparently unnecessary Armijo stepsize for the class of problems tested here. A sufficient well conditioned property that eliminates the Armijo stepsize requirement [23, Theorem 3.6] apparently is not needed in all the problems tested here. Both m-files are given below as Code 4.1 and Code 4.2, and are available at: www.cs.wisc.edu/math-prog/olvi. The test problem generator lpgen generates a random constraint matrix A for a given m, n and density d. The elements of A are uniformly distributed between -50 and +50. A primal random solution x with elements in [-10,10], approximately half of which are zeros, and a dual solution u with elements in [0,10], approximately 3n of which are positive, are first specified. These solutions are then used to generate an objective function cost vector c and a right hand side vector b for the linear program (2). The number 3n of positive dual variables is motivated by support vector machine applications [22, 6, 31] where the number of positive multipliers corresponding to support vectors is often a few multiples of the dimensionality n of the input space. To test LPN we solved a number of linear programs generated by the lpgen Code 4.1 on LOCOP2, a 400 Mhz Pentium II machine with a maximum of 2 Gigabytes of memory running Windows NT Server 4.0, and compared it with CPLEX 6.5 [5]. LPN used only the default values given in Code 4.2, which of course can be overridden. The results are presented in Table 1. Seven problems generated by the lpgen Code 4.1 were solved by the LPN Code 4.2. Problem size varied between 10 5 to 107 in the number of nonzero elements of the constraint matrix A. LPN solved all these problems in 11 to 26 iterations to an accuracy better than 10 −13 . Comparing times for the problems solved, CPLEX ran out of memory on two problems, and was 2.8 to 34.2 times slower on the remaining five problems with one problem, the third in Table 1, very poorly solved by CPLEX with a primal solution error of 7.0. The dual CPLEX option was used throughout the numerical test problems because it gave the best results for the type of problems tested. The other options, primal and barrier, did worse. For example on the fifth test problem in Table 1 with dual CPLEX time of 238.4 seconds, the primal CPLEX took 2850.8 seconds while the barrier CPLEX ran out of memory. In order to show that LPN can also handle linear programs with nonunique solutions, as requested by one of the referees who also noted that the linear 11

programs of Table 1 as generated by lpgen had unique feasible points as well, we perturbed the right hand side b of (2) so that the number of active constraints at the solutions obtained by LPN and CPLEX were less than n. These results are presented in Table 2 and show that the accuracy of LPN has decreased. However, it is still capable of solving these problems to a reasonable accuracy and can still handle the two largest problem which CPLEX ran out of memory on. Code 4.1 lpgen MATLAB lp test problem generator %lpgen: Generate random solvable lp: min c’x s.t. Ax =< b; A:m-by-n %Input: m,n,d(ensity); Output: A,b,c; (x,u): primal-dual solution pl=inline(’(abs(x)+x)/2’);%pl(us) function tic;A=sprand(m,n,d);A=100*(A-0.5*spones(A)); u=sparse(10*pl(rand(m,1)-(m-3*n)/m)); x=10*spdiags((sign(pl(rand(n,1)-rand(n,1)))),0,n,n)*(rand(n,1)-rand(n,1)); c=-A’*u;b=A*x+spdiags((ones(m,1)-sign(pl(u))),0,m,m)*10*ones(m,1);toc0=toc; format short e;[m n d toc0] Code 4.2 lpnewt1 MATLAB LPN Algorithm 3.1 without Armijo %lpnewt1: Solve primal LP: min c’x s.t. Ax=

4.2

Machine learning test problems

In this section we test and compare LPN with CPLEX on four classification datasets from the UCI Machine Learning Repository [27] and two publicly available datasets [28]. Again, we use LOCOP2. 12

Table 1: Comparison of LPN Algorithm 3.1 and CPLEX 6.5 [5]. “oom” denotes “out of memory”. LPN Time is total time, i.e. toc2 from Code 4.2

Problem Size× Density m×n×d

Time Seconds

(2 × 106 ) × 100 × 0.05 (1.5 × 106 ) × 100 × 0.05 (1.0 × 105 ) × 1000 × 0.1 (1.0 × 105 ) × 100 × 1.0 (1.0 × 105 ) × 100 × 0.1 (1.0 × 104 ) × 1000 × 0.1 (1.0 × 104 ) × 100 × 0.1

1041.8 787.0 840.5 228.7 50.7 417.5 4.9

LPN Iter. Accuracy # kx − zk∞

26 26 14 15 18 11 17

1.1 × 10−14 8.8 × 10−15 5.8 × 10−14 8.9 × 10−15 8.9 × 10−15 5.1 × 10−14 7.3 × 10−15

Dual CPLEX 6.5 Time Accuracy Seconds kx − rk∞

oom oom 28716.6 905.0 238.4 3513.2 13.8

oom oom 7.0 7.5 × 10−14 5.3 × 10−12 5.4 × 10−12 1.3 × 10−11

The classification model that we shall employ here consists of a linear support vector machine [6, 32, 3, 22] which we can state as the following linear program: min

(w,γ,s,),∈Rq+1+q+1

e0 s + ν

s.t. D(Bw − eγ) + e ≥ e, −s ≤ w ≤ s, ≥ 0, r 0 w ≥ 2.

(40)

Here, e is a column vector of ones and ν is a positive parameter that balances the model’s ability to generalize to new unseen data (small ν) versus empirical data fitting (large ν). The matrices B ∈ R p×q and D ∈ Rp×p constitute the problem data. The p rows of B represent p given data points in the input space Rq , with each point belonging to class +1 if the corresponding diagonal element of the diagonal matrix D is +1, or to class −1 if the corresponding diagonal element is −1. The vector r is the difference between the mean of the points in class +1 and the mean of the points in class −1. The linear program (40) generates a separating plane: x0 w = γ,

(41)

which approximately separates the points of classes +1 and −1. This separating plane lies midway between the parallel bounding planes: x0 w = γ + 1, x0 w = γ − 1.

13

(42)

Table 2: Comparison of LPN Algorithm 3.1 and CPLEX 7.1 [5] on linear programs with nonunique solutions. “oom” denotes “out of memory”.

Problem Size× Density m×n×d

Time Seconds

Iter. #

LPN Accuracy |c0 z + b0 v| k(−Az + b)+ k∞

(2 × 106 ) × 100 × 0.05

3280.0

5

3.8 × 10−5

Dual CPLEX 7.1 Time Act.Constr. Seconds #

Act.Constr. #

34

oom

oom

33

oom

oom

240

15335.0

102

32

317.5

57

21

95.6

89

43

2018.4

105

22

6.8

86

4.1 × 10−6

(1.5 × 106 ) × 100 × 0.05

1899.1

5

1.3 × 10−4 3.3 × 10

(1 × 105 ) × 1000 × 0.1

3014.5

7

5.7 × 10−5 2.5 × 10

(1 × 105 ) × 100 × 1.0

190.0

6

52.3

6

−6

8.7 × 10−5 2.5 × 10

(1 × 105 ) × 100 × 0.1

−6

−6

2.0 × 10−10 7.3 × 10−7

(1 × 104 ) × 1000 × 0.1

48.2

8

1.4 × 10−7 1.6 × 10−6

(1 × 104 ) × 100 × 0.1

4.5

7

2.5 × 10−6 1.8 × 10−6

With a maximum error of , the points of class +1 lie in closed halfspace: {x | x0 w ≥ γ + 1}.

(43)

Similarly, the points of class −1 lie, with maximum error of , in the closed halfspace: {x | x0 w ≤ γ − 1}.

(44)

The error is minimized by the linear program (40) relative to e 0 s = kwk1 . The last constraint of the linear program (40) excludes the degenerate solution w = 0 by implying the usually satisfied condition that the 1-norm 2 [21] between the two parallel bounding planes (42) does not distance kwk ∞ exceed the 1-norm distance krk1 between the means of the two classes as follows: kwk∞ · krk1 ≥ r 0 w ≥ 2. 14

(45)

The linear program (40) was set up in the form of the linear program (2) with n = 2q + 2 and m = 2p + 2q + 2. The linear program (2) was solved by both LPN and CPLEX for six standard test problems varying in size between 297 points to 4192 points, and input spaces of dimensionality varying between 6 and 14. The results are given in Table 3. Because these are relatively small problems, CPLEX solved these problems in less time than LPN and with much greater accuracy as measured by the relative infeasibility of a solution point to (2): k(Ax − b)+ k1 . kAxk1

(46)

However, correctness of the classifying separating plane on both the original data (training correctness) and on the tenfold testing (testing correctness) obtained by leaving one tenth of the data out for testing, repeating ten times and averaging, were higher for LPN than those for CPLEX on all six test problems. In addition, the LPN correctness values were comparable to those obtained by other methods [13, 10, 26]. These test correctness values are the key properties by which machine learning classifiers are evaluated and are not always best when a very accurate solution to the linear program (40) is obtained. The explanation for this apparent paradox is that a highly accurate solution causes overfitting and may bias the classifier towards the given empirical data at the expense of unseen data and hence results in poor generalization which translates into lower test set correctness as seen here for the CPLEX results. Application of Newton type methods have yielded some of the best test correctness results for classification problems [16, 10, 9]. The parameter value for ν for both LPN and CPLEX was the same and chosen to optimize training set correctness. For all problems in Table 3, the value ν = 105 was used. All parameters of LPN used were set at the default values given in the MATLAB Code 4.2.

5

Conclusion

We have presented a fast Newton algorithm, LPN, for solving a class of linear programs. Computational testing on test problems with m n and m ≥ n demonstrate the effectiveness of the proposed method in comparison with a state-of-the-art linear programming solver for this class of linear programs. Future work might include the study of sharp theoretical conditions under which LPN converges without a stepsize and in a finite number of steps, as it does in all the numerical examples presented in this work. Computational 15

Table 3: Comparison of LPN Algorithm 3.1 and CPLEX 6.5 [5] on six publicly available machine learning problems. m × n is the size of the matrix A of (2). p × q is the size of the matrix B of (40). “Infeas” denotes solution infeasibility (46). “Train” is training set correctness. “Test” is ten-fold testing set correctness.

Problem m × n, p × q

Bupa Liver 359 × 14, 345 × 6 Pima Indians 786 × 18, 768 × 8 Cleveland Heart 325 × 28, 297 × 13 Housing 534 × 28, 506 × 13 Galaxy Bright 2492 × 30, 2462 × 14 Galaxy Dim 4222 × 30, 4192 × 14

LPN Seconds Train Infeas Test

Dual CPLEX 6.5 Seconds Train Infeas Test

0.11 8.10e-3 0.21 4.30e-3 0.14 1.31e-2 0.24 9.80e-3 6.23 8.03e-4 13.86 1.50e-3

0.05 3.08e-19 0.13 1.29e-18 0.09 5.44e-18 0.13 1.02e-17 0.79 5.37e-15 1.57 1.81e-18

68.12% 66.71% 76.69% 76.20% 87.54% 84.74% 86.96% 85.97% 99.63% 99.38% 94.58% 94.43%

64.64% 63.92% 73.83% 73.39% 83.16% 81.69% 83.60% 83.72% 99.15% 99.16% 90.96% 90.71%

improvements might include alternate ways of generating a primal solution from the dual least 2-norm solution thus enabling the method to handle all types of linear programs.

Acknowledgments The research described in this Data Mining Institute Report 02-02, March 2002, was supported by National Science Foundation Grants CCR-9729842, CCR-0138308 and CDA-9623632, by Air Force Office of Scientific Research Grant F49620-00-1-0085 and by the Microsoft Corporation. I am indebted to Robert R. Meyer for suggesting the MATLAB command SPONES to help generate very large test problems, to Jinho Lim for providing a CPLEX mex file for MATLAB that measures actual CPLEX cpu time and with options for primal simplex, dual simplex, network and barrier methods, and to Steve Wright for suggesting the use of various CPLEX options. I am indebted to

16

my PhD student Michael Thompson for the numerical results presented in Table 3, and to two referees and an Associate Editor for constructive suggestions. Revised August and December 2002.

References [1] L. Armijo. Minimization of functions having Lipschitz-continuous first partial derivatives. Pacific Journal of Mathematics, 16:1–3, 1966. [2] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, second edition, 1999. [3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Machine Learning Proceedings of the Fifteenth International Conference(ICML ’98), pages 82–90, San Francisco, California, 1998. Morgan Kaufmann. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps. [4] K. L. Clarkson. Las vegas algorithms for linear and integer programming. Journal of the Association for Computing Machinery, 42(2):488– 499, march 1995. [5] CPLEX Optimization Inc., Incline Village, Nevada. Using the CPLEX(TM) Linear Optimizer and CPLEX(TM) Mixed Integer Optimizer (Version 2.0), 1992. [6] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, MA, 2000. [7] F. Facchinei. Minimization of SC 1 functions and the Maratos effect. Operations Research Letters, 17:131–137, 1995. [8] A. V. Fiacco and G. P. McCormick. Nonlinear Programming: Sequential Unconstrained Minimization Techniques. John Wiley & Sons, New York, NY, 1968. [9] G. Fung and O. L. Mangasarian. A feature selection newton method for support vector machine classification. Technical Report 02-03, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, September 2002. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-03.ps. Computational Optimization and Applications, to appear. 17

[10] G. Fung and O. L. Mangasarian. Finite Newton method for Lagrangian support vector machine classification. Technical Report 02-01, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, February 2002. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-01.ps. Neurocomputing, to appear. [11] A. I. Golikov and Yu. G. Evtushenko. Search for normal solutions in linear programming. Computational Mathematics and Mathematical Physics, 14(12):1694–1714, 2000. [12] J.-B. Hiriart-Urruty, J. J. Strodiot, and V. H. Nguyen. Generalized hessian matrix and second-order optimality conditions for problems with C L1 data. Applied Mathematics and Optimization, 11:43–56, 1984. [13] T. Joachims. Making large-scale support vector machine learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169–184, Cambridge, MA, 1999. MIT Press. [14] C. Kanzow, H. Qi, and L. Qi. On the minimum norm solution of linear programs. Preprint, University of Hamburg, Hamburg, 2001. http://www.math.uni-hamburg.de/home/kanzow/paper.html. Journal of Optimization Theory and Applications, to appear. [15] K. Krishnan and J. Mitchell. A linear programming approach to semidefinite programming problems. Working paper, RPI, Troy, NY, May 2001. [16] Y.-J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5–22, 2001. Data Mining Institute, University of Wisconsin, Technical Report 9903. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.ps. [17] S. Lucidi. A new result in the theory and computation of the leastnorm solution of a linear program. Journal of Optimization Theory and Applications, 55:103–117, 1987. [18] O. L. Mangasarian. Normal solutions of linear programs. Mathematical Programming Study, 22:206–216, 1984. [19] O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, PA, 1994. 18

[20] O. L. Mangasarian. Parallel gradient distribution in unconstrained optimization. SIAM Journal on Control and Optimization, 33(6):1916– 1925, 1995. ftp://ftp.cs.wisc.edu/tech-reports/reports/1993/tr1145.ps. [21] O. L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24:15–23, 1999. ftp://ftp.cs.wisc.edu/math-prog/techreports/97-07r.ps. [22] O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 135–146, Cambridge, MA, 2000. MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps. [23] O. L. Mangasarian. A finite Newton method for classification problems. Technical Report 01-11, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, December 2001. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/0111.ps.Optimization Methods and Software 17, 2002, 913-929. [24] O. L. Mangasarian and R. De Leone. Error bounds for strongly convex programs and (super)linearly convergent iterative schemes for the least 2-norm solution of linear programs. Applied Mathematics and Optimization, 17:1–14, 1988. [25] O. L. Mangasarian and R. R. Meyer. Nonlinear perturbation of linear programs. SIAM Journal on Control and Optimization, 17(6):745–752, November 1979. [26] O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines. Journal of Machine Learning Research, 1:161–177, 2001. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-06.ps. [27] P. M. Murphy and D. W. Aha. UCI machine learning repository, 1992. www.ics.uci.edu/∼mlearn/MLRepository.html. [28] S. Odewahn, E. Stockwell, R. Pennington, R. Humphreys, and W. Zumach. Automated star/galaxy discrimination with neural networks. Astronomical Journal, 103(1):318–331, 1992. [29] M. C. Pinar. Piecewise-linear pathways to optimal solution set in linear programming. Journal of Optimization Theory and Applications, 93:619–634, 1997.

19

[30] P. W. Smith and H. Wolkowicz. A nonlinear equation for linear programming. Mathematical Programming, 34:235–238, 1986. [31] A. Smola and B. Sch¨olkopf. Learning with Kernels. MIT Press, Cambridge, MA, 2002. [32] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, second edition, 2000.

20