imizing the dual function in (14) over λl for each l ∈ ∂TN, while holding the other components of λ ﬁxed, decomposes into independent maximizations: m...

0 downloads 3 Views 440KB Size

Loading...

Dual multilevel optimization Timothy A. Davis · William W. Hager

Received: 17 October 2005 / Accepted: 7 July 2006 / Published online: 19 September 2006 © Springer-Verlag 2006

Abstract We study the structure of dual optimization problems associated with linear constraints, bounds on the variables, and separable cost. We show how the separability of the dual cost function is related to the sparsity structure of the linear equations. As a result, techniques for ordering sparse matrices based on nested dissection or graph partitioning can be used to decompose a dual optimization problem into independent subproblems that could be solved in parallel. The performance of a multilevel implementation of the Dual Active Set Algorithm is compared with CPLEX Simplex and Barrier codes using Netlib linear programming test problems. Keywords Multilevel optimization · Dual optimization · Dual separability · Dual active set algorithm · Parallel algorithms Mathematics Subject Classification (2000)

90C05 · 90C06 · 65Y20

T. A. Davis Department of Computer and Information Science and Engineering, University of Florida, PO Box 116120, Gainesville, FL 32611-6120, USA e-mail: [email protected] URL: http://www.cise.ufl.edu/∼davis W. W. Hager (B) Department of Mathematics, University of Florida, Gainesville, PO Box 118105, FL 32611-8105, USA e-mail: [email protected] URL: http://www.math.ufl.edu/∼hager

404

T. A. Davis, W. W. Hager

1 Introduction We consider problems of the form sup L(λ),

(1)

λ

where L(λ) = F(λ) + inf

x≥0

F(λ) = f (x) =

m i=1 n

f (x) − λT Ax ,

Fi (λi ), fj (xj ).

j=1

Here Fi : R → R, fj : [0, ∞) → R, and A is an m by n matrix. If Fi (λi ) = bi λi , i = 1, 2, . . . , m, for some b ∈ Rm , then (1) is the dual of the primal problem min f (x) subject to Ax = b,

x ≥ 0.

(2)

We refer to L as the “dual function” and to (1) as the “dual problem,” regardless of how the Fi are chosen. In the case where (1) is obtained by forming the dual of an optimization problem, the fj should be convex to ensure that the dual problem solves the primal problem. The special case fj (xj ) = xj log xj corresponds to the (negative) Boltzmann–Shannon entropy function [32]. Because of the separable structure of f , the dual function can be expressed as n j (λ), (3) L(λ) = F(λ) + j=1

where j (λ) = inf

xj ≥0

fj (xj ) − xj

m

λi aij .

(4)

i=1

Even though both f and F are separable, the dual cost function does not separate, in general, because each of the functions j can depend on the entire λ vector. In this paper, we examine how the dual problem can be decomposed into relatively independent subproblems by exploiting the sparsity of A. The basic step in the decomposition process is to write ¯ λ) ¯ + L2 (λ2 , λ), ¯ ¯ + L1 (λ1 , λ) L(λ) = L(

(5)

where λ1 , λ2 , and λ¯ are disjoint subvectors of λ whose union yields the entire vector λ. That is, the components of the vectors λ1 , λ2 , and λ¯ are distinct com-

Dual multilevel optimization

405

Fig. 1 The dual optimization tree for the decomposition (5)

ponents of λ, and the union of the components of λ1 , λ2 , and λ¯ yields λ. When L can be decomposed in this way, we can visualize the dual optimization problem using the tree shown in Fig. 1. The full dual function L(λ) is associated with the root of the tree, while L1 and L2 are associated with each leaf of the tree. ¯ over λ1 is independent of the If λ¯ is fixed, then the optimization of L1 (λ1 , λ) ¯ optimization of L2 (λ2 , λ) over λ2 . Figure 1 suggests various approaches for exploiting the structure. For example, if the optimization is done using dual coordinate ascent, we could start at the leaves of the tree and optimize over λ1 and λ2 while holding λ¯ fixed. The optimization over λ1 and λ2 can be done in parallel. Then we can move to the root of the tree and optimize over λ¯ while holding λ1 and λ2 fixed. After a new value of λ¯ is determined, we would drop to the leaves and again optimize over λ1 and λ2 with λ¯ fixed. Under suitable assumptions, the dual coordinate ascent iteration converges to a stationary point of L (e.g. [31, p. 228]). The decomposition (5) reveals that two of the dual coordinate ascent iterates could be computed independently. In this paper, we develop general multilevel techniques for decomposing the dual problem into independent subproblems, with each subproblem associated with a node in a tree. Multilevel techniques have proved to be very effective for solving elliptic partial differential equations [4, 16], and for solving NP-hard graph partitioning problems [25–29]. Our multilevel strategy is related to the arrowhead partitioning presented in [13]. However, instead of working towards an arrowhead form, we utilize a nested dissection ordering of AAT (recursive graph partitioning via node separators) [24–29], similar to the block-structured matrices in linear programming and interior point solvers discussed by [14]. We present a version of the dual active set algorithm (DASA) [10, 17–22] for maximizing L. The algorithm exploits a multilevel decomposition and moves up and down the multilevel tree in a way loosely related to that of multilevel methods for graph partitioning or multigrid methods for partial differential equations. Although the decoupled problems could be solved in parallel, we find that even for a single processor, it can be quicker to solve some problems using a multilevel approach, as opposed to the single level approach, because the subproblems are solved quickly, and there are relatively few iterations at the root node. Comparisons with CPLEX simplex and barrier codes, using Netlib linear programming (LP) test problems, are given in Sect. 7. Our implementation of LP DASA is based on our recent package CHOLMOD [6, 8, 9, 11] for computing a Cholesky factorization and its modification after small rank changes in

406

T. A. Davis, W. W. Hager

the matrix. CHOLMOD is partially incorporated in MATLAB 7.2 as the chol, symbfact, and etree functions, and in x=A\b when A is symmetric positive definite. 2 Level 1 partitioning Given vectors x, y ∈ Rn , define

xy=

⎧ ⎨0 ⎩

1

if xj yj = 0 for all j, if xj yj = 0 for some j.

If ai denotes the i-th row of A, then ai aj = 0 if and only if the variables in equations i and j of Ax = b are different. In essence, equations i and j are independent of each other. Lemma 1 Let Rk , k = 1, 2, . . . , N, be disjoint subsets of {1, 2, . . . , m} and define ¯ = (R1 ∪ R2 ∪ · · · ∪ RN )c , R where the superscript “c” denotes complement. If the “orthogonality condition” ap aq = 0 holds for all p ∈ Rk and q ∈ Rl whenever 1 ≤ k < l ≤ N, then L can be expressed as N ¯ λ) ¯ ¯ + Lk (λk , λ), (6) L(λ) = L( k=1

where λk , k = 1, . . . , N, and λ¯ are disjoint subvectors of λ associated with indices ¯ respectively. in Rk , k = 1, . . . , N, and in R Proof Define the sets Ck = {j ∈ [1, n] : aij = 0 for some i ∈ Rk }, and

k = 1, 2, . . . , N − 1,

c CN = C1 ∪ C2 ∪ · · · ∪ CN−1 .

(7)

(8)

Observe that (a) Ck and Cl are disjoint for k = l: Clearly, CN is disjoint from Cl for l < N by the definition of CN . If j ∈ Ck ∩ Cl for k < l < N, then there exists p ∈ Rk and q ∈ Rj such that apj = 0 and aqj = 0. This implies that ap aq = 1, a contradiction. ¯ c and j ∈ Cl : If l < N and j ∈ Cl , (b) For any l, aij = 0 for every i ∈ (Rl ∪ R) then by (7) there exists p ∈ Rl such that apj = 0. By the orthogonality

Dual multilevel optimization

407

Fig. 2 Structure of the dependency matrix of PA in Lemma 1 for an appropriate permutation P; equivalently, the sparsity pattern of (PA)(PA)T

0

0

¯ c . If l = N and j ∈ CN , then j ∈ Ck condition, aij = 0 whenever i ∈ (Rl ∪ R) for k < N. Since j ∈ Ck , (7) implies that aij = 0 for every i ∈ Rk . Since ¯ c = R1 ∪ R2 ∪ · · · ∪ RN−1 , (RN ∪ R) ¯ c and j ∈ CN . it follows that aij = 0 for every i ∈ (RN ∪ R) By (a) and (8), we can group the terms of (3) in the following way: L(λ) = F(λ) +

N k=1

⎛ ⎝

⎞ j (λ)⎠ .

j∈Ck

If j ∈ Ck , then by (b), the only nonzero terms λi aij in (4) correspond to indi¯ It follows that j (λ) is independent of λp when p = k. The ces i ∈ Rk ∪ R. decomposition (6) is obtained by taking, for example, ¯ = Lk (λk , λ)

Fi (λi ) +

i∈Rk

and

¯ λ) ¯ = L(

j (λ),

(9)

j∈Ck

Fi (λi ).

(10)

¯ i∈R

The dependency matrix D associated with A is a symmetric matrix defined by dij = ai aj . Under the hypotheses of Lemma 1, if we permute the rows of A in the order R1 , ¯ and shade the nonzero parts of the dependency matrix, we obtain R2 , and R ¯ Fig. 2. If P a matrix which permutes the rows of A in the order R1 , R2 , and R, then the dependency matrix D depicted in Fig. 2 indicates the sparsity pattern

408

T. A. Davis, W. W. Hager

Fig. 3 The permuted matrix PAQ associated with Lemma 1

0

0

of (PA)(PA)T . Observe that D is 0 in the region corresponding to (i, j) with i ∈ R1 and j ∈ R2 or i ∈ R2 and j ∈ R1 . Suppose that in addition the columns of A are permuted in the order C1 followed by C2 . If Q is a column permutation matrix which puts the columns in the order C1 followed by C2 , then PAQ has the block-angular structure seen in Fig. 3. By the definition of Ck in (7), aij is zero for i ∈ R1 and j ∈ C2 or i ∈ R2 and j ∈ C1 . As a specific illustration, let us consider the Netlib test problem truss. Although this matrix, in its original form, does not have the structure seen in Fig. 2, it can be put in this form using graph partitioning codes such as CHACO [25, 26] or METIS [27–29]. These heuristic methods find a small edge separator of the graph G whose adjacency matrix is D. An edge separator is a subset of the edges of G that, when removed from G, cause it to split into two (or more) disconnected components. Ideally, the separator is a balanced one, where the two components are of equal size. Finding a minimal balanced edge separator is an NP-complete problem. When applied to D, these codes partition the row and column indices into two sets Q1 and Q2 , chosen to minimize approximately the number of nonzeros dij with i ∈ Q1 and j ∈ Q2 . The sets Q1 and Q2 are roughly the same size. We then extract elements from Q1 or Q2 in ¯ and ¯ with the property that dij = 0 whenever i ∈ Q1 \ R order to obtain a set R ¯ ¯ j ∈ Q2 \ R. Finally, Rl = Ql \ R for l = 1 and 2. This step converts the edge separator found by CHACO or METIS into a vertex separator. In LP DASA, we use the nested dissection ordering computed in our new software package CHOLMOD [6]. It uses METIS to find the node separators at each level, followed by a constrained minimum degree ordering of the whole graph (CCOLAMD). After a node separator is found, the subgraphs are themselves split, recursively, unless they consist of fewer than 200 nodes. The set of node separators found forms a tree, where each node in the tree represents one block or subgraph in the nested dissection ordering. The parent of a node (other than the root) is the subgraph consisting of the nodes in the separator. This method gives a good fill-reducing ordering, but can lead to an excessive number of blocks for multilevel LP DASA. A post-processing step collapses nodes in this separator tree, if the separator consists of more than 6% of the subgraph being processed, or if the subgraph has less than 500 nodes. The same permutation was used for the single-level LP DASA tests, but the entire tree is always collapsed into a single node. Thus, if the top-level separator consists of

Dual multilevel optimization

409

(b) permuted matrix

(a) original AA’

(c) factorization L+L’

Fig. 4 Netlib test problem truss

more than 6% of the nodes of the graph (or if there are fewer than 500 rows in A, the multilevel method collapses the tree into a single block, and the method becomes identical to single-level LP DASA. The original matrix AAT from the truss problem, the permuted matrix PA(PA)T , and its Cholesky factorization, are shown in Fig. 4. Figure 5 shows the separator tree found by CHOLMOD, followed by the collapsed tree. In this case the collapsed tree consists of just three nodes, but in general it can be a subtree of the original tree.

3 Multilevel partitioning In Lemma 1 we take the dual function L and write it as a sum of functions Lk given in (9). Notice that Lk has the same structure as L itself in (3); it is the sum of a separable function of λ (the Fl terms in (9)) and terms j (λ), j ∈ Ck . For any j ∈ Ck , the nonzeros aij in column of j of A correspond to indices i ∈ Rk or ¯ Hence, for any j ∈ Ck , we have i ∈ R. ⎛ j (λ) = inf ⎝fj (xj ) − xj xj ≥0

Fig. 5 Separator tree and collapsed separator tree for truss

¯ i∈R

λi aij − xj

i∈Rk

⎞ λi aij ⎠ .

(11)

410

T. A. Davis, W. W. Hager

Let us view λ¯ as constant. If we identity the first two terms in (11) with the fj term in (4) and if we identity the Rk sum in (11) with the sum in (4), we can apply Lemma 1 to Lk to decompose the dual function further. We will carry out this recursion in the case k = 2. In the initial application of Lemma 1, we have ¯ λ) ¯ + L2 (λ2 , λ), ¯ ¯ + L1 (λ1 , λ) L(λ) = L( where λk denotes the components of λ associated with the set Rk , and where Lk and L¯ are given in (9) and (10) in terms of the sets Ck in (7). Suppose that Rk is partitioned further as ¯ k, Rk = Rk1 ∪ Rk2 ∪ R where the sets Rk1 and Rk2 satisfy the orthogonality condition of Lemma 1. Lemma 1 is applied to (9) in the case k = 2. The terms that define Lk are grouped in the following way: ¯ = L¯ k (λ¯ k ) + Lk1 (λk1 , λ¯ k ) + Lk2 (λk2 , λ¯ k ), Lk (λk , λ)

(12)

where Lkl (λkl , λ¯ k ) =

i∈Rkl

Fi (λi ) +

j (λ),

j∈Ckl

Ck1 = {j ∈ Ck : aij = 0 for some i ∈ Rk1 }, L¯ k (λ¯ k ) = Fi (λi ).

Ck2 = Ck \ Ck1 ,

¯k i∈R

In the recursion (12), some components of λk form the vector λk1 (corresponding to Rk1 ), other components go into the vector λk2 (corresponding to Rk2 ), ¯ k ) are added to λ¯ to form while the remaining components (corresponding to R ¯λk . Thus λk1 and λk2 are subvectors of λk and λ¯ k is λ¯ augmented with the remaining components of λk . When λ¯ k is fixed in (12), the minimization of Lk1 over λk1 and the minimization of Lk2 over λk2 can be done independently. In Fig. 3 we show the nested block-angular arrangement of nonzeros when A is initially partitioned (and permuted) in accordance with Lemma 1. In Fig. 6 we show the further rearrangements of nonzeros arising from the additional partitioning of R1 and R2 , and C1 and C2 . If the sets of Lemma 1 were generated by graph partitioning, as we did for truss in the previous section, then the dependency matrix of A used for the initial partitioning would be replaced by the dependency matrix for Ak , the submatrix of A associated with the intersection of the rows with indices in Rk and columns with indices in Ck . The partitioning process writes the initial Rk ¯ k . The dependency matrix of the as a disjoint union of sets Rkl , l = 1, 2, and R permuted A for truss appears in Fig. 7a. To visualize better the structure of the

Dual multilevel optimization

411

Fig. 6 The matrix in Fig. 3 after applying additional permutations to the rows and columns forming the diagonal blocks

0

0

reordered matrix, we darken the portion between the first nonzero in each row or column and the diagonal to obtain Fig. 7b, the profile of the matrix in Fig. 7a. With these insights, we now formally define a multilevel decomposition of the dual problem. It is described by a tree T with r nodes labeled 1, 2, . . . , r that satisfy the following four conditions: M1. Corresponding to each node k of T , there is a set Rk . The root of the tree is r and Rr = {1, 2, . . . , m}. M2. For each node k of T , Rk is strictly contained in Rπ(k) , where π(k) is the parent of k, the node directly above k in T . M3. For each node p of T , Rk ∩ Rl = ∅ whenever k and l are distinct children of p (that is, p = π(k) = π(l) and k = l). M4. For each node p of T , ai1 ai2 = 0 whenever i1 ∈ Rk and i2 ∈ Rl , where k and l are distinct children of p. In the tree of Fig. 8 associated with a 2-level decomposition, π(1) = 3 and π(3) = 7. The matrix was not ordered with CCOLAMD, so that the original structure is more visible in the decomposition in the figure. By M2 the sets Rl grow in size as we move from the leaves up to the root of the tree. At any level

Fig. 7 a Sparsity structure of (P2 P1 A)(P2 P1 A)T for truss; b the corresponding matrix profile

412

T. A. Davis, W. W. Hager

Fig. 8 Tree associated with 2-level decomposition (12)

7

3

6

1

2

4

5

in the tree, the sets are disjoint according to M3. The orthogonality property M4 leads to the decomposition of the dual function as described in Lemma 1. Using the sets Rk associated with each node in the multilevel decomposition, we construct the following additional sets and functions: (a) the vector λk formed from components of λ associated with elements of Rk , ¯ k and a complementary vector λ¯ k (defined below), (b) a complementary set R (c) the function Lk (also defined below). We now define the entities introduced in (b) and (c). The complementary set ¯ k is obtained by removing from Rk the sets Rc of its children. In other words, R ¯ k = Rk \ R

Rc .

c∈π −1 (k)

¯ k is the separator of the In sparse matrix terminology, the complementary set R −1 ¯k children Rc , c ∈ π (k); the equations (Ax − b)i = 0 associated with i ∈ R couple together the equations associated with each of the children Rc . The complementary vector λ¯ k is defined as follows: At the root, λ¯ r is the ¯ r . At any other node k = r, λ¯ k consists of the vector with components λi , i ∈ R ¯ complementary vector λπ(k) of the parent augmented by those components of ¯ k . Since the parent of the root π(r) is undefined, we λ associated with the set R ¯ take λπ(r) to be empty. An equivalent description of the complementary vector ¯p λ¯ k is the following: It consists of those components of λ associated with sets R for each p on the path between k and r, the root of the tree. Next, we write the decomposition rule (6) of Lemma 1 in a recursive fashion. At the root, Lr (λ) = L(λ). For nodes beneath the root, the structure of the recursion is gleaned from (12). The left side of the equation contains a function Lk associated with node k of the tree. The right side contains functions associated with each of the children of k. The left side involves the variable λk associated with node k, and the complementary variable λ¯ associated with the ¯ The right side involves the dual variables λkl of the parent of k (and the set R). children of k, and the complementary variable λ¯ k for node k. Hence, we have

Dual multilevel optimization Fig. 9 Tree associated with 2-level decomposition (12); a ¯3 = R ¯6 = ∅ ¯ 3 = ∅, b R R

Fig. 10 Dependency matrix for ken_07 in Netlib/lp

413

(a)

(b)

0 500 1000 1500 2000

0

500

1000

1500

2000

nz = 14382

Lk (λk , λ¯ π(k) ) = L¯ k (λ¯ k ) +

Lc (λc , λ¯ k ),

c∈π −1 (k)

= L¯ k (λ¯ k ) +

Lc (λc , λ¯ π(c) ),

(13)

c∈π −1 (k)

with the initialization Lr (λ) = L(λ). If L is split into two parts as in (5), and if these parts are further subdivided into two parts, then the tree associated with the decomposition process is binary. In the special case where the complementary set is vacuous, the multilevel tree ¯ 3 = ∅ in Fig. 8, then the decomposed can be simplified. For example, if R ¯ 6 = ∅ in Fig. 8, Lagrangian is represented by the tree in Fig. 9a. If in addition R then the decomposed Lagrangian is represented by the tree in Fig. 9b. As an example, the dependency matrix of the Netlib test problem ken_07 shown in Fig. 10 is associated with a tree that has 49 nodes beneath the root. 4 Dual dependencies The dual functions Li , i = 1, . . . , r, are not all independent of each other. For the decomposition associated with Fig. 8, L1 , L2 , L4 , and L5 are all independent of each other (assuming the complementary variables are fixed). Also, L1 and L2 are independent of L6 because L1 and L2 are obtained by decomposing L3 , which is independent of L6 . On the other hand, L3 is not independent of L1

414

T. A. Davis, W. W. Hager

and L2 , even when the complementary variables are fixed, because they share common variables. In this section, we specify the independent functions in the dual decomposition. Given a subset N of the nodes of T , the special subtree TN is the tree obtained by pruning from T all descendants of the nodes N . Hence, if N = ∅, then TN = T ; if r ∈ N , then TN = {r}. For any tree T , we let ∂T denote the leaves of the tree (that is, the childless nodes). Lemma 2 Given a multilevel decomposition associated with a tree T satisfying (13) and a special subtree TN , we have L(λ) =

L¯ l (λ¯ l ) +

l∈TN \∂ TN

Ll (λl , λ¯ π(l) ).

(14)

l∈∂ TN

Proof Suppose that (14) is violated for some special subtree TN0 of T . Let l0 ∈ ∂TN0 , and let p0 = π(l0 ) be its parent. Let TN1 be the special subtree of T obtained by pruning from TN0 all the children π −1 (p0 ). That is, N1 = N0 ∪ {p0 }. Either (14) holds for TN1 and we stop the pruning process, or it is violated for TN1 , in which case we choose l1 ∈ ∂TN1 , p1 = π(l1 ), and TN2 is the special subtree of T induced by N2 = N1 ∪ {p1 }. Since the decomposition (14) holds for the special subtree T{r} , which is contained in any special subtree, this pruning process must terminate with a special subtree TNt for which (14) is violated, while (14) holds for TNt+1 . Since TNt+1 is obtained by pruning the children of some node k from TNt , it follows from (13) that Lk (λk , λ¯ π(k) ) = L¯ k (λ¯ k ) +

Lc (λc , λ¯ π(c) ).

(15)

c∈π −1 (k)

Since (14) holds for TNt+1 , we have L(λ) =

l∈TNt+1 \∂ TNt+1

L¯ l (λ¯ l ) +

Ll (λl , λ¯ π(l) ).

(16)

l∈∂ TNt+1

Since k ∈ ∂TNt+1 , it is one of the terms in the last summation in (16). After making the substitution (15) in (16), we obtain (14) in the case N = Nt . This contradicts the fact that (15) was violated for N = Nt . Hence, (14) holds for all special subtrees of T .

Lemma 3 If TN is a special subtree of T and k and l ∈ ∂TN , k = l, then Rk ∩ Rl = ∅. If Sk is the set of indices of λ associated with the complementary ¯ q over all variable λ¯ k (equivalently, Sk is the union of the complementary sets R nodes q on the path between k and the root), then Sk ∩ Rl = ∅. Proof Consider the sequence of ancestors π p (k), p = 1, 2, . . . , and π q (l), q = 1, 2, . . . . Let P = π p (k) = π q (l) be the first common ancestor of k and l (see

Dual multilevel optimization

415

Fig. 11 P is first common ancestor of nodes k and l

P

Fig. 11). Since P is the first common ancestor, the children c1 = π p−1 (k) and c2 = π q−1 (l) are distinct. By M3, Rc1 ∩ Rc2 = ∅. By M2, Rk ⊂ Rc1 and Rl ⊂ Rc2 . Hence, Rk ∩ Rl = ∅. By the definition of the complementary set ¯ P , it is disjoint from the sets Rc of each of the children c ∈ π −1 (P). That is, R ¯ P ∩ Rc = ∅ for each c ∈ π −1 (P). Applying this result to each node between R P and the root r, we conclude that SP ∩ Rc = ∅ for each c ∈ π −1 (P). For each node on the path from c1 to k, the complementary set is contained in Rc1 due to M2. Hence, Sk ⊂ (Rc1 ∪ SP ). The relations Rc1 ∩ Rc2 = ∅,

Sk ∩ Rc2 = ∅,

Rl ⊂ Rc2 ,

and

Sk ⊂ (Rc1 ∪ SP ),

imply that Sk ∩ Rl = ∅.

By Lemmas 2 and 3, it follows that when TN is a special subtree of T , maximizing the dual function in (14) over λl for each l ∈ ∂TN , while holding the other components of λ fixed, decomposes into independent maximizations: max Ll (λl , λ¯ π(l) ). λl

5 An example We give a concrete illustration of a multilevel decomposition using the dual of a linear programming problem min

cT x subject to Ax = b,

x ≥ 0,

416

T. A. Davis, W. W. Hager

⎡

with

1 ⎢0 ⎢ A=⎢ ⎢1 ⎣0 1

1 0 1 0 0

0 1 1 0 1

0 0 1 0 0

0 0 0 1 1

⎤ 0 0⎥ ⎥ 0⎥ ⎥. 1⎦ 0

(17)

For the tree shown in Fig. 12, it can be checked that conditions M1–M4 hold. The first level of the tree corresponds to the decomposition L(λ) = b5 λ5 + L3 (λ1 , λ2 , λ3 , λ5 ) + L4 (λ4 , λ5 ), where the complementary variable is λ¯ 5 = λ5 , and the first term b5 λ5 corresponds to L¯ 5 (λ¯ 5 ). The functions L3 and L4 are given as follows:

L3 (λ1 , λ2 , λ3 , λ5 ) =

⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩

b1 λ1 + b2 λ2 + b3 λ3 −∞

⎧ c ≥ λ1 + λ3 + λ5 ⎪ ⎪ ⎨ 1 c2 ≥ λ1 + λ3 if c3 ≥ λ2 + λ3 + λ5 ⎪ ⎪ ⎩ c4 ≥ λ3 otherwise,

and L4 (λ4 , λ5 ) =

⎧ ⎨ ⎩

b4 λ4 −∞

c5 ≥ λ4 + λ5 c6 ≥ λ4 otherwise. if

L3 is independent of L4 when λ¯ 5 is fixed. At nodes 3 and 4, we have λ3 = (λ1 , λ2 , λ3 ) and λ4 = (λ4 ). By the structure of L3 , it can be further decomposed as L3 (λ) = b3 λ3 + L1 (λ1 , λ3 , λ5 ) + L2 (λ2 , λ3 , λ5 ), where L1 (λ1 , λ3 , λ5 ) =

⎧ ⎨ ⎩

b1 λ1 −∞

c1 ≥ λ1 + λ3 + λ5 c2 ≥ λ1 + λ3 otherwise, if

and L2 (λ2 , λ3 , λ5 ) =

⎧ ⎨ ⎩

b2 λ2 −∞

c3 ≥ λ2 + λ3 + λ5 c4 ≥ λ3 otherwise. if

The complementary variable at node 3 is λ¯ 3 = (λ3 , λ5 ), the complementary vari¯ 3 = {3}. L1 able at the parent of 3 augmented by the variable λ3 associated with R

Dual multilevel optimization

417

Fig. 12 Multilevel decomposition associated with (17)

Fig. 13 Nonzero structure of A corresponding to multilevel partitioning tree in Fig. 12

1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0

and L2 are independent when λ¯ 3 is fixed. The nonzero structure of A associated with the tree of Fig. 12 is shaded in Fig. 13. 6 Multilevel DASA It was pointed out in Sect. 1 that a multilevel decomposition could be used in various algorithms for solving the dual problem; a specific illustration was dual coordinate ascent. Here we focus on a more efficient use of a multilevel decomposition based on the dual active set algorithm (DASA). By [19, Theorem. 1], DASA solves (1) in a finite number of iterations when f is strongly convex and F is strongly concave. Given a bound set B ⊂ {1, 2, . . . , n}, let xB be the subvector of x consisting of those components xi associated with i ∈ B. We define the functions (18) LB+ (λ) = min {L(λ, x) : x ∈ Rn , xB ≥ 0}, and LB0 (λ) = min {L(λ, x) : x ∈ Rn , xB = 0},

(19)

where L(λ, x) = F(λ) + f (x) − λT Ax. The components of x in the minimization problems (18) and (19) corresponding to indices in the complement of B are unconstrained. Assuming the fj are strongly convex, there exists a unique minimizer x(λ, B) in (18) for each choice of λ and B. Let x(λ) denote x(λ, B) in the case B = {1, 2, . . . , n}.

418

T. A. Davis, W. W. Hager

Algorithm 1 Dual active set algorithm s=0 λ0 = starting guess while ∇L(λs ) = 0 ν 0 = λs B0 = {j ∈ [1, n] : xj (λs ) = 0} for t = 0, 1, . . . ω = arg max{LB0 (λ) : λ ∈ Rm } t

ν t+1 = arg max{LB+ (λ) : λ ∈ [ν t , ω]} t Bt+1 = j ∈ Bt : xj (ν t+1 , Bt ) = 0 if ν t+1 = ω break

end s=s+1 λs = ω end

The statement of DASA in Algorithm 1 has an outer loop indexed by s, and an inner loop indexed by t. In the inner loop, indices are deleted from Bt to obtain Bt+1 ; that is, Bt+1 is a subset of Bt consisting of those indices for which xi (ν t+1 , Bt ) = 0. When the inner loop terminates, we reinitialize B0 by solving the problem (20) min L(λs , x) x≥0

to obtain x(λs ). Since the optimization over x in (20) decomposes into independent optimization over xj as in (4), it follows that the B0 set at the start of the iteration includes some indices that were missing from the final Bt associated with the inner loop in the previous iteration. Hence, the dual initialization step adds indices to the current bound set, while the formula for Bt+1 in the subiteration deletes bound indices. The proof [19, Thm. 1] that DASA solves (1) in a finite number of iterations is based on the fact that the subiteration strictly increases the value of the dual function; as a result, the final set Bt generated in the subiteration cannot repeat. Since the sets Bt are chosen from a finite set, convergence occurs in a finite number of steps. Algorithm 2 is the multilevel version of DASA. Here we replace the maximization over λ in the computation of ω in Algorithm 1 by a maximization over the leaves of a special subtree. The set Nt in Algorithm 2 is a collection of nodes in the tree T ; these nodes essentially correspond to components of the dual variable over which the dual function has been maximized. Initially, N0 = ∅ because we have not maximized over any components of λ. As the inner iterations progress, the set Nt grows in size. In each inner iteration, we maximize LB0 over the leaves of the special subtree TNt . As seen after Lemmas t 2 and 3, maximization over the leaves of a special subtree decomposes into the independent maximization of local dual functions, which could be done in parallel. These maximizers are utilized in local line searches that take us to a new point ν t+1 with a larger value for both LB+ and L. Since the bound set Bt t can only decrease in size, the line search on the interval [ν, ω] eventually takes a

Dual multilevel optimization

419

Algorithm 2 Multilevel dual active set algorithm s=0 λ0 = starting guess while ∇L(λs ) = 0 initialize ν 0 = λs and N0 = ∅ B0 = {j ∈ [1, n] : xj (λs ) = 0} for t = 0, 1, . . . initialize ν = ν t and N = Nt for each l ∈ ∂TNt ω = arg max{LB0 (λ) : λi = νi , for all i ∈ Rcl } t

ν ← arg max{LB+ (λ) : λ ∈ [ν, ω]} t

if ν = ω, then N ← N ∪ {l}

end ν t+1 = ν and Nt+1 = N if r ∈ N break Bt+1 = j ∈ Bt : xj (ν t+1 , Bt ) = 0 end s=s+1 λs = ν t end

full step to ω, in which case N grows by the addition of node l. That is, the step stops short of ω only when the bound set decreases in size. If a full step is taken in the line search, then we add the associated node l to the set of completed nodes N . Eventually, N includes the root r, in which case the inner iteration terminates. The proof of finite convergence is the same given previously for DASA (e.g. [17, 19, 21]). That is, in the multilevel setting, we work up to the root of the tree T increasing the value of the dual function, and by M1, the set Rr contains all the indices of λ. Hence, in the multilevel setting, the final dual subiterate again takes us to a maximizer of LB0 for some set Bk , which can never repeat k without contradicting the strict ascent of the dual iterates. In Algorithms 1 and 2, we maximize LB0 in each subiteration. As an alternak tive, we could approximately maximize LB0 in each subiteration. Under suitable k assumptions (see [20]), convergence is achieved in the limit, as the number of iterations tends to infinity.

7 Numerical comparisons In this section, we apply both single level and multilevel versions of LP DASA to Netlib linear programming test problems and compare the performance to that of simplex and barrier methods as implemented in the commercial code ILOG CPLEX version 9.1.3 [2]. The test problems were solved on a 3.2 GHz Pentium 4 with 4 GB of RAM, running SUSE Linux, and the Goto BLAS [15]. Since the CPLEX Barrier code automatically applies a presolver [1] to the input problem, we use the CPLEX presolved problem as input to the codes. All

420 Table 1 Description of the test problems

T. A. Davis, W. W. Hager Problem

Rows

Columns

Nonzeros

perold 25fv47 nug07 pilot_we pilotnov cre_c cre_a osa_07 nesm d6cube truss fit2p pilot_ja nug08 fit2d greenbeb greenbea 80bau3b qap8 ken_11 degen3 pds_06 osa_14 d2q06c maros_r7 stocfor3 pilot osa_30 pds_10 ken_13 cre_d cre_b pilot87 osa_60 pds_20 ken_18 nug12 qap12 dfl001 nug15 qap15 nug20 nug30

503 682 473 602 748 2,257 2,684 1,047 598 402 1,000 3,000 708 741 25 1,015 1,015 1,789 741 5,511 1,406 2,972 2,266 1,855 2,152 8,388 1,204 4,279 4,725 10,962 3,990 5,176 1,811 10,209 10,214 39,867 2,793 2,793 3,861 5,697 5,697 14,097 49,647

1,273 1,732 930 2,526 1,999 5,293 6,382 24,062 2,764 5,467 8,806 13,525 1,684 1,631 10,388 3,211 3,220 9,872 1,631 11,984 2,503 18,530 52,723 5,380 6,578 15,254 4,124 100,396 33,270 24,818 28,489 36,222 6,065 234,334 81,224 89,439 8,855 8,855 9,607 22,274 22,274 72,599 376,740

5,545 10,181 3,321 8,414 11,703 13,078 15,739 63,037 12,988 34,332 27,836 50,284 11,155 5,940 127,793 23,042 23,142 20,007 5,940 26,538 25,204 42,304 139,136 31,325 80,167 53,528 41,160 267,149 76,307 57,238 86,144 111,434 71,838 594,462 184,176 208,594 33,526 33,526 31,955 85,468 85,468 281,958 1,486,829

Netlib problems taking more than 0.1 s (based the fastest of CPLEX and LP DASA run times) are listed in Table 1. The table is sorted by this run time. These statistics are for the presolved problems used in the numerical experiments, not for the original problems on the Netlib test site. The multilevel partitioning is computed via nested dissection of AAT using CHOLMOD Version 1.2. The run times we report for LP DASA include the time to compute the ordering, including the separator tree and its post-processing.

Dual multilevel optimization

421

For LP DASA, developed in [10], we take f (x) =

x − y2 , 2

where > 0 is a regularization parameter and y is a shift. Once we achieve the maximum in the dual problem, the regularization parameter and shift are updated. The details of their adjustments are given in [10]. In coding the LP DASA multilevel algorithm, we need to factor a matrix of the form AF ATF , where F is the complement of the set Bk in Algorithm 1, and the subscript “F ” denotes the submatrix of A associated with column indices in F. This factorization is done using a supernodal Cholesky factorization algorithm, in CHOLMOD. In the multilevel case, the rows and columns of L corresponding to leaves of the separator tree are factorized in a supernodal method. The nodes farther up in the tree are factorized incrementally as they are needed, using a row-oriented sparse factorization [5, 30]. We refactor the matrix infrequently, and after a column is freed or a row is dropped, we use the sparse modification techniques developed in [8, 9, 11] to modify the factorization. Since the Cholesky factorization is computed row-by-row, it can be incrementally extended in the multilevel setting. The submatrix associated with a subtree (nodes 1, 2 and 3 in Fig. 12, for example) can be factored without having to compute the entire factorization. New rows corresponding to separators can be added to the factorization without touching the previously computed partial factorization. The codes for modifying a factorization after column or row updates take advantage of speedup associated with multiple-rank updates. As a result, a rank-16 update of a sparse matrix achieves flop rates comparable to those of a BLAS dot product, and about 1/3 the speed of a BLAS matrix-matrix multiply [12], in the experiments given in [9]. In coding LP DASA, we use a new version of COLAMD [7], which we call constrained COLAMD (CCOLAMD), to reorder the rows of A to minimize the fill associated with the Cholesky factorization of AAT . We constrain the ordering so that rows associated with a set Rk for a leaf of the partitioning tree remain in Rk , where the sets Rk are chosen to satisfy the orthogonality condition of Lemma 1; to preserve this orthogonality property, we can only exchange rows within Rk . Our constrained COLAMD, developed in collaboration with S. Rajamanickam, is related to the scheme developed in [3] for a sparse LU factorization. Our starting guesses for an optimal dual multiplier and primal solution are always λ = 0 and x = 0 respectively. Our convergence test is based on the LP optimality conditions. The primal approximation x(λ) always satisfies the bound constraints x ≥ 0. The column indices are expressed as B ∪ F where F = Bc . For all j ∈ B, xj (λ) = 0 and (c − AT λ)j ≥ 0. Hence, the optimality conditions would be satisfied if the primal and dual linear systems, Ax = b and ATF λ = cF , were satisfied. Our convergence criterion is expressed in terms of the relative residuals in the primal and dual systems:

422

T. A. Davis, W. W. Hager

cF − ATF λ∞ b − Ax∞ + ≤ 10−8 . 1 + x∞ 1 + λ∞

(21)

Here · ∞ denotes the maximum absolute component of a vector. Note that the test problems are given in MPS format which provides, in general, space for about five significant digits when the data is represented in floating point (exponential) format. Hence, the number of significant digits in A, b, and c is five in general, while the relative error in (21) is 10−8 . Note that in certain cases, more than five significant digits can be represented in MPS format; for example, any problem where the data consists of small integers, such as a network problem, could be represented exactly using MPS format. The CPU times for LP DASA, both single level and multilevel versions, and for CPLEX simplex and barrier codes are in Table 2. We also give the number of blocks for multilevel LP DASA. A dash means the problem cannot be solved by that method on our computer. Note that LP DASA was the only code that was able to solve nug30. With both Barrier and Simplex, there was not enough memory. LP DASA switches to an iterative implementation when there is insufficient memory to factor the matrix. In the iterative approach, developed in [20], a conjugate gradient scheme is used to compute the solution of the quadratic programming problems as opposed to a direct method based on a Cholesky factorization. Observe that the number of blocks in the truss problem is 3, which is the number of nodes in the collapsed tree in Fig. 5. Even for a single processor, the multilevel implementation is typically at least as fast as the single level implementation. The small subproblems in the multilevel approach are solved quickly and the overall flop count (not listed) is much smaller than that of a single-level implementation. Of course in a parallel computing environment where the 324 independent subproblems associated with ken_18 (for example) can be assigned to separate computers, significant speedup is possible. In Table 2, we see the following: multilevel LP DASA is faster than CPLEX simplex for 20 out of 43 problems, and it is faster than CPLEX Barrier for 12 problems.

8 Summary Sparse optimization problems with separable cost and with linear constraints and bounds can be decomposed into a multilevel structure by applying codes such as CHACO or METIS to the dependency matrix. The multilevel structure is described by a tree in which each node i is associated with a function Li . Those Li associated with the leaves of a special subtree can be optimized independently, when the complementary variables are fixed. In an implementation of the dual active set algorithm, the computation proceeds from the leaves of the tree up to the root, optimizing over the leaves of special subtrees along the way. Even on a single processor, the multilevel

Dual multilevel optimization

423

Table 2 LP DASA (single and multilevel), CPLEX (Simplex and Barrier) Problem

One level

Multilevel

Blocks

Simplex

Barrier

perold 25fv47 nug07 pilot_we pilotnov cre_c cre_a osa_07 nesm d6cube truss fit2p pilot_ja nug08 fit2d greenbeb greenbea 80bau3b qap8 ken_11 degen3 pds_06 osa_14 d2q06c maros_r7 stocfor3 pilot osa_30 pds_10 ken_13 cre_d cre_b pilot87 osa_60 pds_20 ken_18 nug12 qap12 dfl001 nug15 qap15 nug20 nug30

1.16 0.18 0.30 1.53 1.21 0.44 0.61 0.22 0.39 3.53 1.08 0.23 3.72 0.71 0.26 0.59 1.82 1.60 0.84 0.53 0.65 1.93 0.76 2.70 0.83 5.24 5.30 1.74 5.23 5.57 8.37 8.63 17.69 5.10 34.91 58.53 52.86 64.22 36.94 438.60 339.76 6094.90 32206.34

1.16 0.18 0.29 1.52 1.20 0.40 0.60 0.19 0.39 3.54 0.94 0.23 3.70 0.70 0.26 0.57 2.19 1.14 0.83 0.32 0.84 1.69 0.68 2.70 0.83 4.33 5.29 1.51 4.65 2.72 8.33 8.34 17.73 3.76 36.34 23.49 52.99 64.37 37.08 443.34 342.70 6175.99 32206.34

1 1 1 1 1 93 133 628 1 1 3 1 1 1 1 11 3 49 1 122 3 23 1837 1 1 26 1 4070 96 170 1 576 1 9766 149 325 1 1 1 1 1 1 1

0.15 0.32 0.29 0.54 0.23 0.18 0.18 0.23 0.35 0.20 11.62 4.46 0.33 1.86 0.59 1.46 1.03 0.28 1.40 0.38 0.57 0.50 0.59 2.81 3.29 1.75 3.22 1.41 1.43 1.50 1.86 4.24 22.49 4.08 9.16 11.74 202.72 197.93 22.51 2787.11 3143.31 – –

0.12 0.12 0.12 0.16 0.18 0.20 0.25 0.43 0.19 0.31 0.21 0.43 0.25 0.25 0.71 0.27 0.28 0.41 0.28 0.55 0.43 1.81 1.03 0.62 1.37 1.09 1.19 2.56 6.29 1.65 2.54 3.03 3.68 6.58 29.07 11.23 11.33 12.71 16.04 78.99 85.83 13755.03 –

implementation was faster than the single level implementation. Competitive performance, as compared to CPLEX, was achieved for the test set of Table 1. Further speedup in LP DASA could be achieved by exploiting hypersparsity of the solution to the linear systems that arise during the solution process (see [23]), and by developing a recursive version of the equation dropping techniques introduced in [10].

424

T. A. Davis, W. W. Hager

Acknowledgements Constructive comments by the referee leading to an improved presentation are gratefully acknowledged. This material is based upon work supported by the National Science Foundation under Grant No. 0203270.

References 1. Andersen, E.D., Andersen, K.D.: Presolving in linear programming. Math. Program. 71, 221–245 (1995) 2. Bixby, R.E.: Progress in linear programming. ORSA J. Comput. 6, 15–22 (1994) 3. Brainman, I., Toledo, S.: Nested-dissection orderings for sparse LU with partial pivoting. SIAM J. Matrix Anal. Appl. 23, 998–1012 (2002) 4. Bramble, J.H.: Multigrid Methods. Wiley, New York (1993) 5. Davis, T.A.: Algorithm 849: a concise sparse Cholesky factorization package. ACM Trans. Math. Softw. 31, 587–591 (2005) 6. Davis, T.A.: CHOLMOD users’ guide. University of Florida (2005). http: www.cise.ufl.edu/ ∼davis 7. Davis, T.A., Gilbert, J.R., Larimore, S.I., Ng, E.G.: A column approximate minimum degree ordering algorithm. ACM Trans. Math. Softw. 30, 353–376 (2004) 8. Davis, T.A., Hager, W.W.: Modifying a sparse Cholesky factorization. SIAM J. Matrix Anal. Appl. 20, 606–627 (1999) 9. Davis, T.A., Hager, W.W.: Multiple-rank modifications of a sparse Cholesky factorization. SIAM J. Matrix Anal. Appl. 22, 997–1013 (2001) 10. Davis, T.A., Hager, W.W.: A sparse proximal implementation of the LP dual active set algorithm. Math. Program (in press) (2006). DOI 10.1007/s10107-006-0017-0 11. Davis, T.A., Hager, W.W.: Row modifications of a sparse Cholesky factorization. SIAM J. Matrix Anal. Appl. 26, 621–639 (2005) 12. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1–17 (1990) 13. Ferris, M.C., Horn, J.D.: Partitioning mathematical programs for parallel solution. Math. Program. 80, 35–62 (1998) 14. Gondzio, J., Sarkissian, R.: Parallel interior-point solver for structured linear programs. Math. Prog. 96, 561–584 (2003) 15. Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. TR-2002-55, University of Texas at Austin, Departmentn of Computer Sciences (2002) 16. Hackbusch, W.: Multigrid Methods and Applications. Springer, Berlin Heidelberg New York (1985) 17. Hager, W.W.: The dual active set algorithm. In: Pardalos, P.M. (ed.) Advances in Optimization and Parallel Computing 137–142. North Holland, Amsterdam (1992) 18. Hager, W.W.: The LP dual active set algorithm. In: Leone, R.D., Murli, A., Pardalos, P.M., Toraldo, G. (eds.) High Performance Algorithms and Software in Nonlinear Optimization pp. 243–254. Kluwer, Dordrecht (1998) 19. Hager, W.W.: The dual active set algorithm and its application to linear programming. Comput. Optim. Appl. 21, 263–275 (2002) 20. Hager, W.W.: The dual active set algorithm and the iterative solution of linear programs. In: Pardalos, P.M., Wolkowicz, H. (eds.) Novel Approaches to Hard Discrete Optimization pp. 95–107, vol. 37. Fields Institute Communications (2003) 21. Hager, W.W., Hearn, D.W.: Application of the dual active set algorithm to quadratic network optimization. Comput. Optim. Appl. 1, 349–373 (1993) 22. Hager, W.W., Shi, C.-L., Lundin, E.O.: Active set strategies in the LP dual active set algorithm, Tech. Report, University of Florida, http://www.math.ufl.edu/∼hager/LPDASA (1996) 23. Hall, J.A.J., McKinnon, K.I.M.: Hyper-sparsity in the revised simplex method and how to exploit it. Comput. Optim. Appl. 32, 259–283 (2005) 24. Heath, M.T., Raghavan, P.: A Cartesian parallel nested dissection algorithm. SIAM J. Matrix Anal. Appl. 16, 235–253 (1995) 25. Hendrickson, B., Leland, R.: A multilevel algorithm for partitioning graphs. In: Proc. Supercomputing ’95, ACM (1995)

Dual multilevel optimization

425

26. Hendrickson, B., Rothberg, E.: Improving the runtime and quality of nested dissection ordering. SIAM J. Sci. Comput. 20, 468–489 (1999) 27. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 359–392 (1998) 28. Karypis, G., Kumar, V.: Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48, 96–129 (1999) 29. Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM Rev. 41, 278–300 (1999) 30. Liu, J.W.H.: A generalized envelope method for sparse factorization by rows. ACM Trans. Math. Softw. 17, 112–129 (1991) 31. Luenberger, D.G.: Linear and Nonlinear Programming. Addison Wesley, Reading (1984) 32. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 623–653 (1948)