Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1–11, MIT, Massachusetts, USA, 9-11 October 2010. c 2010...

0 downloads 1 Views 258KB Size

Loading...

David Sontag Michael Collins Tommi Jaakkola MIT CSAIL, Cambridge, MA 02139, USA {srush,dsontag,mcollins,tommi}@csail.mit.edu Abstract

• The resulting algorithms are simple and efficient, building on standard dynamic-programming algorithms as oracle solvers for sub-problems,2 together with a method for forcing agreement between the oracles.

This paper introduces dual decomposition as a framework for deriving inference algorithms for NLP problems. The approach relies on standard dynamic-programming algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. The approach provably solves a linear programming (LP) relaxation of the global inference problem. It leads to algorithms that are simple, in that they use existing decoding algorithms; efficient, in that they avoid exact algorithms for the full model; and often exact, in that empirically they often recover the correct solution in spite of using an LP relaxation. We give experimental results on two problems: 1) the combination of two lexicalized parsing models; and 2) the combination of a lexicalized parsing model and a trigram part-of-speech tagger.

1

• The algorithms provably solve a linear programming (LP) relaxation of the original inference problem. • Empirically, the LP relaxation often leads to an exact solution to the original problem.

Introduction

Dynamic programming algorithms have been remarkably useful for inference in many NLP problems. Unfortunately, as models become more complex, for example through the addition of new features or components, dynamic programming algorithms can quickly explode in terms of computational or implementational complexity.1 As a result, efficiency of inference is a critical bottleneck for many problems in statistical NLP. This paper introduces dual decomposition (Dantzig and Wolfe, 1960; Komodakis et al., 2007) as a framework for deriving inference algorithms in NLP. Dual decomposition leverages the observation that complex inference problems can often be decomposed into efficiently solvable sub-problems. The approach leads to inference algorithms with the following properties: 1 The same is true for NLP inference algorithms based on other exact combinatorial methods, for example methods based on minimum-weight spanning trees (McDonald et al., 2005), or graph cuts (Pang and Lee, 2004).

The approach is very general, and should be applicable to a wide range of problems in NLP. The connection to linear programming ensures that the algorithms provide a certificate of optimality when they recover the exact solution, and also opens up the possibility of methods that incrementally tighten the LP relaxation until it is exact (Sherali and Adams, 1994; Sontag et al., 2008). The structure of this paper is as follows. We first give two examples as an illustration of the approach: 1) integrated parsing and trigram part-ofspeech (POS) tagging; and 2) combined phrasestructure and dependency parsing. In both settings, it is possible to solve the integrated problem through an “intersected” dynamic program (e.g., for integration of parsing and tagging, the construction from Bar-Hillel et al. (1964) can be used). However, these methods, although polynomial time, are substantially less efficient than our algorithms, and are considerably more complex to implement. Next, we describe exact polyhedral formulations for the two problems, building on connections between dynamic programming algorithms and marginal polytopes, as described in Martin et al. (1990). These allow us to precisely characterize the relationship between the exact formulations and the 2 More generally, other exact inference methods can be used as oracles, for example spanning tree algorithms for nonprojective dependency structures.

1 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1–11, c MIT, Massachusetts, USA, 9-11 October 2010. 2010 Association for Computational Linguistics

LP relaxations that we solve. We then give guarantees of convergence for our algorithms by showing that they are instantiations of Lagrangian relaxation, a general method for solving linear programs of a particular form. Finally, we describe experiments that demonstrate the effectiveness of our approach. First, we consider the integration of the generative model for phrase-structure parsing of Collins (2003), with the second-order discriminative dependency parser of Koo et al. (2008). This is an interesting problem in its own right: the goal is to inject the high performance of discriminative dependency models into phrase-structure parsing. The method uses off-theshelf decoders for the two models. We find three main results: 1) in spite of solving an LP relaxation, empirically the method finds an exact solution on over 99% of the examples; 2) the method converges quickly, typically requiring fewer than 10 iterations of decoding; 3) the method gives gains over a baseline method that forces the phrase-structure parser to produce the same dependencies as the firstbest output from the dependency parser (the Collins (2003) model has an F1 score of 88.1%; the baseline method has an F1 score of 89.7%; and the dual decomposition method has an F1 score of 90.7%). In a second set of experiments, we use dual decomposition to integrate the trigram POS tagger of Toutanova and Manning (2000) with the parser of Collins (2003). We again find that the method finds an exact solution in almost all cases, with convergence in just a few iterations of decoding. Although the focus of this paper is on dynamic programming algorithms—both in the experiments, and also in the formal results concerning marginal polytopes—it is straightforward to use other combinatorial algorithms within the approach. For example, Koo et al. (2010) describe a dual decomposition approach for non-projective dependency parsing, which makes use of both dynamic programming and spanning tree inference algorithms.

2

Related Work

Dual decomposition is a classical method for solving optimization problems that can be decomposed into efficiently solvable sub-problems. Our work is inspired by dual decomposition methods for inference in Markov random fields (MRFs) (Wainwright 2

et al., 2005a; Komodakis et al., 2007; Globerson and Jaakkola, 2007). In this approach, the MRF is decomposed into sub-problems corresponding to treestructured subgraphs that together cover all edges of the original graph. The resulting inference algorithms provably solve an LP relaxation of the MRF inference problem, often significantly faster than commercial LP solvers (Yanover et al., 2006). Our work is also related to methods that incorporate combinatorial solvers within loopy belief propagation (LBP), either for MAP inference (Duchi et al., 2007) or for computing marginals (Smith and Eisner, 2008). Our approach similarly makes use of combinatorial algorithms to efficiently solve subproblems of the global inference problem. However, unlike LBP, our algorithms have strong theoretical guarantees, such as guaranteed convergence and the possibility of a certificate of optimality. These guarantees are possible because our algorithms directly solve an LP relaxation. Other work has considered LP or integer linear programming (ILP) formulations of inference in NLP (Martins et al., 2009; Riedel and Clarke, 2006; Roth and Yih, 2005). These approaches typically use general-purpose LP or ILP solvers. Our method has the advantage that it leverages underlying structure arising in LP formulations of NLP problems. We will see that dynamic programming algorithms such as CKY can be considered to be very efficient solvers for particular LPs. In dual decomposition, these LPs—and their efficient solvers—can be embedded within larger LPs corresponding to more complex inference problems.

3

Background: Structured Models for NLP

We now describe the type of models used throughout the paper. We take some care to set up notation that will allow us to make a clear connection between inference problems and linear programming. Our first example is weighted CFG parsing. We assume a context-free grammar, in Chomsky normal form, with a set of non-terminals N . The grammar contains all rules of the form A → B C and A → w where A, B, C ∈ N and w ∈ V (it is simple to relax this assumption to give a more constrained grammar). For rules of the form A → w we refer to A as the part-of-speech tag for w. We allow any non-terminal to be at the root of the tree.

Given a sentence with n words, w1 , w2 , . . . wn , a parse tree is a set of rule productions of the form hA → B C, i, k, ji where A, B, C ∈ N , and 1 ≤ i ≤ k < j ≤ n. Each rule production represents the use of CFG rule A → B C where nonterminal A spans words wi . . . wj , non-terminal B spans words wi . . . wk , and non-terminal C spans words wk+1 . . . wj . There are O(|N |3 n3 ) such rule productions. Each parse tree corresponds to a subset of these rule productions, of size n − 1, that forms a well-formed parse tree.3 We now define the index set for CFG parsing as I = {hA → B C, i, k, ji: A, B, C ∈ N , 1 ≤ i ≤ k < j ≤ n} Each parse tree is a vector y = {yr : r ∈ I}, with yr = 1 if rule r is in the parse tree, and yr = 0 otherwise. Hence each parse tree is represented as a vector in {0, 1}m , where m = |I|. We use Y to denote the set of all valid parse-tree vectors; the set Y is a subset of {0, 1}m (not all binary vectors correspond to valid parse trees). In addition, we assume a vector θ = {θr : r ∈ I} that specifies a weight for each rule production.4 Each θr can take any value in the reals. The optimal ∗ parse P tree is y = arg maxy∈Y y · θ where y · θ = r yr θr is the inner product between y and θ. We use yr and y(r) interchangeably (similarly for θr and θ(r)) to refer to the r’th component of the vector y. For example θ(A → B C, i, k, j) is a weight for the rule hA → B C, i, k, ji. We will use similar notation for other problems. As a second example, in POS tagging the task is to map a sentence of n words w1 . . . wn to a tag sequence t1 . . . tn , where each ti is chosen from a set T of possible tags. We assume a trigram tagger, where a tag sequence is represented through decisions h(A, B) → C, ii where A, B, C ∈ T , and i ∈ {3 . . . n}. Each production represents a transition where C is the tag of word wi , and (A, B) are 3

We do not require rules of the form A → wi in this representation, as they are redundant: specifically, a rule production hA → B C, i, k, ji implies a rule B → wi iff i = k, and C → wj iff j = k + 1. 4 We do not require parameters for rules of the form A → w, as they can be folded into rule production parameters. E.g., under a PCFG we define θ(A → B C, i, k, j) = log P (A → B C | A) + δi,k log P (B → wi |B) + δk+1,j log P (C → wj |C) where δx,y = 1 if x = y, 0 otherwise.

3

the previous two tags. The index set for tagging is Itag = {h(A, B) → C, ii : A, B, C ∈ T , 3 ≤ i ≤ n}

Note that we do not need transitions for i = 1 or i = 2, because the transition h(A, B) → C, 3i specifies the first three tags in the sentence.5 Each tag sequence is represented as a vector z = {zr : r ∈ Itag }, and we denote the set of valid tag sequences, a subset of {0, 1}|Itag | , as Z. Given a parameter vector θ = {θr : r ∈ Itag }, the optimal tag sequence is arg maxz∈Z z · θ. As a modification to the above approach, we will find it convenient to introduce extended index sets for both the CFG and POS tagging examples. For the CFG case we define the extended index set to be I 0 = I ∪ Iuni where Iuni = {(i, t) : i ∈ {1 . . . n}, t ∈ T } Here each pair (i, t) represents word wi being assigned the tag t. Thus each parse-tree vector y will have additional (binary) components y(i, t) specifying whether or not word i is assigned tag t. (Throughout this paper we will assume that the tagset used by the tagger, T , is a subset of the set of nonterminals considered by the parser, N .) Note that this representation is over-complete, since a parse tree determines a unique tagging for a sentence: more explicitly, for any i ∈ {1 . . . n}, Y ∈ T , the following linear constraint holds: y(i, Y ) =

n X

X

y(X → Y Z, i, i, k) +

k=i+1 X,Z∈N i−1 X X

y(X → Z Y, k, i − 1, i)

k=1 X,Z∈N

We apply the same extension to the tagging index set, effectively mapping trigrams down to unigram assignments, again giving an over-complete representation. The extended index set for tagging is re0 . ferred to as Itag From here on we will make exclusive use of extended index sets for CFG parsing and trigram tagging. We use the set Y to refer to the set of valid parse structures under the extended representation; 5 As one example, in an HMM, the parameter θ((A, B) → C, 3) would be log P (A|∗∗)+log P (B|∗A)+log P (C|AB)+ log P (w1 |A) + log P (w2 |B) + log P (w3 |C), where ∗ is the start symbol.

each y ∈ Y is a binary vector of length |I 0 |. We similarly use Z to refer to the set of valid tag structures under the extended representation. We assume 0 parameter vectors for the two problems, θcfg ∈ R|I | 0 and θtag ∈ R|Itag | .

4

Set u(1) (i, t) ← 0 for all (i, t) ∈ Iuni for k = 1 to K do X y (k) ← arg max (y · θcfg − u(k) (i, t)y(i, t)) y∈Y

z (k) ← arg max (z · θtag +

Two Examples

z∈Z

This section describes the dual decomposition approach for two inference problems in NLP. 4.1

We now describe the dual decomposition approach for integrated parsing and trigram tagging. First, define the set Q as follows: Q = {(y, z) : y ∈ Y, z ∈ Z, y(i, t) = z(i, t) for all (i, t) ∈ Iuni } (1)

Hence Q is the set of all (y, z) pairs that agree on their part-of-speech assignments. The integrated parsing and trigram tagging problem is then to solve (2) max y · θcfg + z · θtag (y,z)∈Q

This problem is equivalent to max y · θcfg + g(y) · θtag y∈Y

where g : Y → Z is a function that maps a parse tree y to its set of trigrams z = g(y). The benefit of the formulation in Eq. 2 is that it makes explicit the idea of maximizing over all pairs (y, z) under a set of agreement constraints y(i, t) = z(i, t)—this concept will be central to the algorithms in this paper. With this in mind, we note that we have efficient methods for the inference problems of tagging and parsing alone, and that our combined objective almost separates into these two independent problems. In fact, if we drop the y(i, t) = z(i, t) constraints from the optimization problem, the problem splits into two parts, each of which can be efficiently solved using dynamic programming: (y ∗ , z ∗ ) = (arg max y · θcfg , arg max z · θtag ) z∈Z

Dual decomposition exploits this idea; it results in the algorithm given in figure 1. The algorithm optimizes the combined objective by repeatedly solving the two sub-problems separately—that is, it directly 4

X

u(k) (i, t)z(i, t))

(i,t)∈Iuni

if y (k) (i, t) = z (k) (i, t) for all (i, t) ∈ Iuni then return (y (k) , z (k) ) for all (i, t) ∈ Iuni ,

Integrated Parsing and Trigram Tagging

y∈Y

(i,t)∈Iuni

u(k+1) (i, t) ← u(k) (i, t) + αk (y (k) (i, t) − z (k) (i, t))

return (y (K) , z (K) ) Figure 1: The algorithm for integrated parsing and tagging. The parameters αk > 0 for k = 1 . . . K specify step sizes for each iteration, and are discussed further in the Appendix. The two arg max problems can be solved using dynamic programming.

solves the harder optimization problem using an existing CFG parser and trigram tagger. After each iteration the algorithm adjusts the weights u(i, t); these updates modify the objective functions for the two models, encouraging them to agree on the same POS sequence. In section 6.1 we will show that the variables u(i, t) are Lagrange multipliers enforcing agreement constraints, and that the algorithm corresponds to a (sub)gradient method for optimization of a dual function. The algorithm is easy to implement: all that is required is a decoding algorithm for each of the two models, and simple additive updates to the Lagrange multipliers enforcing agreement between the two models. 4.2

Integrating Two Lexicalized Parsers

Our second example problem is the integration of a phrase-structure parser with a higher-order dependency parser. The goal is to add higher-order features to phrase-structure parsing without greatly increasing the complexity of inference. First, we define an index set for second-order unlabeled projective dependency parsing. The secondorder parser considers first-order dependencies, as well as grandparent and sibling second-order dependencies (e.g., see Carreras (2007)). We assume that Idep is an index set containing all such dependencies (for brevity we omit the details of this index set). For convenience we define an extended index set that makes explicit use of first-order dependen-

cies, I 0 dep = Idep ∪ Ifirst , where

5.1 Marginal Polytopes For a finite set Y, define the set of all distributions Ifirst = {(i, j) : i ∈ {0 . . . n}, j ∈ {1 . . . n}, i 6= j} |Y| : α ≥ over y P elements in Y as ∆ = {α ∈ R Here (i, j) represents a dependency with head wi 0, y∈Y αy = 1}. Each α ∈ ∆ gives a vector of P and modifier wj (i = 0 corresponds to the root sym- marginals, µ = y∈Y αy y, where µr can be inter0 bol in the parse). We use D ⊆ {0, 1}|Idep | to denote preted as the probability that yr = 1 for a y selected at random from the distribution α. the set of valid projective dependency parses. The set of all possible marginal vectors, known as The second model we use is a lexicalized CFG. Each symbol in the grammar takes the form A(h) the marginal polytope, is defined as follows: where A ∈ N is a non-terminal, and h ∈ {1 . . . n} X m αy y} is an index specifying that wh is the head of the con- M = {µ ∈ R : ∃α ∈ ∆ such that µ = y∈Y stituent. Rule productions take the form hA(a) → B(b) C(c), i, k, ji where b ∈ {i . . . k}, c ∈ {(k + M is also frequently referred to as the convex hull of 1) . . . j}, and a is equal to b or c, depending on Y, written as conv(Y). We use the notation conv(Y) whether A receives its head-word from its left or in the remainder of this paper, instead of M. right child. Each such rule implies a dependency For an arbitrary set Y, the marginal polytope (a, b) if a = c, or (a, c) if a = b. We take Ihead conv(Y) can be complex to describe.6 However, 0 = to be the index set of all such rules, and Ihead Martin et al. (1990) show that for a very general Ihead ∪ Ifirst to be the extended index set. We define class of dynamic programming problems, the cor0 H ⊆ {0, 1}|I head | to be the set of valid parse trees. responding marginal polytope can be expressed as The integrated parsing problem is then to find conv(Y) = {µ ∈ Rm : Aµ = b, µ ≥ 0} (4) (3) (y ∗ , d∗ ) = arg max y · θhead + d · θdep (y,d)∈R where A is a p × m matrix, b is vector in Rp , and the where R = {(y, d) : y ∈ H, d ∈ D, value p is linear in the size of a hypergraph reprey(i, j) = d(i, j) for all (i, j) ∈ Ifirst } sentation of the dynamic program. Note that A and This problem has a very similar structure to the b specify a set of p linear constraints. We now give an explicit description of the reproblem of integrated parsing and tagging, and we sulting constraints for CFG parsing:7 similar concan derive a similar dual decomposition algorithm. The Lagrange multipliers u are a vector in R|Ifirst | straints arise for other dynamic programming algoenforcing agreement between dependency assign- rithms for parsing, for example the algorithms of ments. The algorithm (omitted for brevity) is identi- Eisner (2000). The exact form of the constraints, and cal to the algorithm in figure 1, but with Iuni , Y, Z, the fact that they are polynomial in number, is not θcfg , and θtag replaced with Ifirst , H, D, θhead , and essential for the formal results in this paper. Howθdep respectively. The algorithm only requires de- ever, a description of the constraints gives valuable coding algorithms for the two models, together with intuition for the structure of the marginal polytope. The constraints are given in figure 2. To develop simple updates to the Lagrange multipliers. some intuition, consider the case where the variables 5 Marginal Polytopes and LP Relaxations µr are restricted to be binary: hence each binary We now give formal guarantees for the algorithms vector µ specifies a parse tree. The second conin the previous section, showing that they solve LP straint in Eq. 5 specifies that exactly one rule must relaxations of the problems in Eqs. 2 and 3. be used at the top of the tree. The set of constraints To make the connection to linear programming, in Eq. 6 specify that for each production of the form we first introduce the idea of marginal polytopes in 6 For any finite set Y, conv(Y) can be expressed as {µ ∈ section 5.1. In section 5.2, we give a precise statem R : Aµ ≤ b} where A is a matrix of dimension p × m, and ment of the LP relaxations that are being solved b ∈ Rp (see, e.g., Korte and Vygen (2008), pg. 65). The value by the example algorithms, making direct use of for p depends on the set Y, and can be exponential in size. 7 marginal polytopes. In section 6 we will prove that Taskar et al. (2004) describe the same set of constraints, but without proof of correctness or reference to Martin et al. (1990). the example algorithms solve these LP relaxations. 5

X

∀r ∈ I 0 , µr ≥ 0 ;

X

0 ∀r ∈ Itag , νr ≥ 0 ;

µ(X → Y Z, 1, k, n) = 1 (5)

ν((X, Y ) → Z, 3) = 1

X,Y,Z∈T

X,Y,Z∈N

k=1...(n−1)

∀X ∈ N , ∀(i, j) such that 1 ≤ i < j ≤ n and (i, j) 6= (1, n): X X µ(X → Y Z, i, k, j) = µ(Y → Z X, k, i − 1, j) Y,Z∈N

Y,Z∈N

k=i...(j−1)

k=1...(i−1)

X

+

µ(Y → X Z, i, j, k)

(6)

Y,Z∈N

k=(j+1)...n

X,Z∈N

X,Z∈N

k=1...(i−1)

Y,Z∈T

Y,Z∈T

∀X ∈ T , ∀i ∈ {3 . . . n − 2}: X X ν((Y, Z) → X, i) = ν((X, Y ) → Z, i + 2) Y,Z∈T

∀Y ∈ T, ∀i ∈ {1 . . . n} : µ(i, Y ) = X X µ(X → Y Z, i, i, k) + µ(X → Z Y, k, i − 1, i) (7) k=(i+1)...n

∀X ∈ T , ∀i ∈ {3 . . . n − 1}: X X ν((Y, Z) → X, i) = ν((Y, X) → Z, i + 1)

Y,Z∈T

∀X ∈ T, ∀i ∈ {3 . . . n} : ν(i, X) =

X

ν((Y, Z) → X, i)

Y,Z∈T

∀X ∈ T :

ν(1, X) =

X

ν((X, Y ) → Z, 3)

Y,Z∈T

Figure 2: The linear constraints defining the marginal polytope for CFG parsing.

hX → Y Z, i, k, ji in a parse tree, there must be exactly one production higher in the tree that generates (X, i, j) as one of its children. The constraints in Eq. 7 enforce consistency between the µ(i, Y ) variables and rule variables higher in the tree. Note that the constraints in Eqs.(5–7) can be written in the form Aµ = b, µ ≥ 0, as in Eq. 4. Under these definitions, we have the following: Theorem 5.1 Define Y to be the set of all CFG parses, as defined in section 4. Then conv(Y) = {µ ∈ Rm : µ satisifies Eqs.(5–7)} Proof: This theorem is a special case of Martin et al. (1990), theorem 2. The marginal polytope for tagging, conv(Z), can also be expressed using linear constraints as in Eq. 4; see figure 3. These constraints follow from results for graphical models (Wainwright and Jordan, 2008), or from the Martin et al. (1990) construction. As a final point, the following theorem gives an important property of marginal polytopes, which we will use at several points in this paper: Theorem 5.2 (Korte and Vygen (2008), page 66.) For any set Y ⊆ {0, 1}k , and for any vector θ ∈ Rk , max y · θ = y∈Y

max µ∈conv(Y)

µ·θ

(8)

The theorem states that for a linear objective function, maximization over a discrete set Y can be replaced by maximization over the convex hull 6

∀X ∈ T :

ν(2, X) =

X

ν((Y, X) → Z, 3)

Y,Z∈T

Figure 3: The linear constraints defining the marginal polytope for trigram POS tagging.

conv(Y). The problem maxµ∈conv(Y) µ·θ is a linear programming problem. For parsing, this theorem implies that: 1. Weighted CFG parsing can be framed as a linear programming problem, of the form maxµ∈conv(Y) µ· θ, where conv(Y) is specified by a polynomial number of linear constraints. 2. Conversely, dynamic programming algorithms such as the CKY algorithm can be considered to be oracles that efficiently solve LPs of the form maxµ∈conv(Y) µ · θ. Similar results apply for the POS tagging case. 5.2

Linear Programming Relaxations

We now describe the LP relaxations that are solved by the example algorithms in section 4. We begin with the algorithm in Figure 1. The original optimization problem was to find cfg tag max(y,z)∈Q y · θ + z · θ (see Eq. 2). By theorem 5.2, this is equivalent to solving max µ · θcfg + ν · θtag (9) (µ,ν)∈conv(Q)

To formulate our approximation, we first define: Q0 = {(µ, ν) : µ ∈ conv(Y), ν ∈ conv(Z), µ(i, t) = ν(i, t) for all (i, t) ∈ Iuni }

The definition of Q0 is very similar to the definition of Q (see Eq. 1), the only difference being that Y and Z are replaced by conv(Y) and conv(Z) respectively. Hence any point in Q is also in Q0 . It follows that any point in conv(Q) is also in Q0 , because Q0 is a convex set defined by linear constraints. The LP relaxation then corresponds to the following optimization problem: max µ · θcfg + ν · θtag (10) Q0

(µ,ν)∈Q0

is defined by linear constraints, making this a linear program. Since Q0 is an outer bound on conv(Q), i.e. conv(Q) ⊆ Q0 , we obtain the guarantee that the value of Eq. 10 always upper bounds the value of Eq. 9. In Appendix A we give an example showing that in general Q0 includes points that are not in conv(Q). These points exist because the agreement between the two parts is now enforced in expectation (µ(i, t) = ν(i, t) for (i, t) ∈ Iuni ) rather than based on actual assignments. This agreement constraint is weaker since different distributions over assignments can still result in the same first order expectations. Thus, the solution to Eq. 10 may be in Q0 but not in conv(Q). It can be shown that all such solutions will be fractional, making them easy to distinguish from Q. In many applications of LP relaxations—including the examples discussed in this paper—the relaxation in Eq. 10 turns out to be tight, in that the solution is often integral (i.e., it is in Q). In these cases, solving the LP relaxation exactly solves the original problem of interest. In the next section we prove that the algorithm in Figure 1 solves the problem in Eq 10. A similar result holds for the algorithm in section 4.2: it solves a relaxation of Eq. 3, where R is replaced by R0 = {(µ, ν) : µ ∈ conv(H), ν ∈ conv(D), µ(i, j) = ν(i, j) for all (i, j) ∈ Ifirst }

6 6.1

Convergence Guarantees Lagrangian Relaxation

We now show that the example algorithms solve their respective LP relaxations given in the previous section. We do this by first introducing a general class of linear programs, together with an optimization method, Lagrangian relaxation, for solving these LPs. We then show that the algorithms in section 4 are special cases of the general algorithm. 7

The linear programs we consider take the form max

x1 ∈X1 ,x2 ∈X2

(θ1 · x1 + θ2 · x2 ) such that Ex1 = F x2

The matrices E ∈ Rq×m and F ∈ Rq×l specify q linear “agreement” constraints between x1 ∈ Rm and x2 ∈ Rl . The sets X1 , X2 are also specified by linear constraints, X1 = {x1 ∈ Rm : Ax1 = b, x1 ≥ 0} and X2 = x2 ∈ Rl : Cx2 = d, x2 ≥ 0 , hence the problem is an LP. Note that if we set X1 = conv(Y), X2 = conv(Z), and define E and F to specify the agreement constraints µ(i, t) = ν(i, t), then we have the LP relaxation in Eq. 10. It is natural to apply Lagrangian relaxation in cases where the sub-problems maxx1 ∈X1 θ1 · x1 and maxx2 ∈X2 θ2 · x2 can be efficiently solved by combinatorial algorithms for any values of θ1 , θ2 , but where the constraints Ex1 = F x2 “complicate” the problem. We introduce Lagrange multipliers u ∈ Rq that enforce the latter set of constraints, giving the Lagrangian: L(u, x1 , x2 ) = θ1 · x1 + θ2 · x2 + u · (Ex1 − F x2 ) The dual objective function is L(u) =

max

x1 ∈X1 ,x2 ∈X2

L(u, x1 , x2 )

and the dual problem is to find minu∈Rq L(u). Because X1 and X2 are defined by linear constraints, by strong duality we have minq L(u) =

u∈R

max

x1 ∈X1 ,x2 ∈X2 :Ex1 =F x2

(θ1 · x1 + θ2 · x2 )

Hence minimizing L(u) will recover the maximum value of the original problem. This leaves open the question of how to recover the LP solution (i.e., the pair (x∗1 , x∗2 ) that achieves this maximum); we discuss this point in section 6.2. The dual L(u) is convex. However, L(u) is not differentiable, so we cannot use gradient-based methods to optimize it. Instead, a standard approach is to use a subgradient method. Subgradients are tangent lines that lower bound a function even at points of non-differentiability: formally, a subgradient of a convex function L : Rn → R at a point u is a vector gu such that for all v, L(v) ≥ L(u) + gu · (v − u).

u(1) ← 0 for k = 1 to K do (k) x1 ← arg maxx1 ∈X1 (θ1 + (u(k) )T E) · x1 (k) x2 ← arg maxx2 ∈X2 (θ2 − (u(k) )T F ) · x2 (k) (k) if Ex1 = F x2 return u(k) (k) (k) u(k+1) ← u(k) − αk (Ex1 − F x2 ) (K) return u

6.2

The previous section described how the method in figure 4 can be used to minimize the dual L(u) of the original linear program. We now turn to the problem of recovering a primal solution (x∗1 , x∗2 ) of the LP. The method we propose considers two cases: (k)

Figure 4: The Lagrangian relaxation algorithm.

By standard results, the subgradient for L at a point u takes a simple form, gu = Ex∗1 − F x∗2 , where x∗1 = arg max (θ1 + (u(k) )T E) · x1 x1 ∈X1

x∗2 = arg max (θ2 − (u(k) )T F ) · x2 x2 ∈X2

The beauty of this result is that the values of x∗1 and x∗2 , and by implication the value of the subgradient, can be computed using oracles for the two arg max sub-problems. Subgradient algorithms perform updates that are similar to gradient descent: u(k+1) ← u(k) − αk g (k) where g (k) is the subgradient of L at u(k) and αk > 0 is the step size of the update. The complete subgradient algorithm is given in figure 4. The following convergence theorem is well-known (e.g., see page 120 of Korte and Vygen (2008)): P Theorem 6.1 If limk→∞ αk = 0 and ∞ k=1 αk = (k) ∞, then limk→∞ L(u ) = minu L(u). The following proposition is easily verified:

Under an appropriate definition of the step sizes αk , it follows that the algorithm in figure 1 defines a sequence of Lagrange multiplers u(k) minimizing a dual of the LP relaxation in Eq. 10. A similar result holds for the algorithm in section 4.2. (k)

(k)

with the caveat that it returns (x1 , x2 ) rather than u(k) .

8

(k)

(Case 1) If Ex1 = F x2 at any stage during (k) (k) the algorithm, then simply take (x1 , x2 ) to be the (k) (k) primal solution. In this case the pair (x1 , x2 ) exactly solves the original LP.9 If this case arises in the algorithm in figure 1, then the resulting solution is binary (i.e., it is a member of Q), and the solution exactly solves the original inference problem. (Case 2) If case 1 does not arise, then a couple of strategies are possible. (This situation could arise in cases where the LP is not tight—i.e., it has a fractional solution—or where K is not large enough for convergence.) The first is to define the primal solution to be the average of the solutions enP (k) countered during the algorithm: x ˆ1 = k x1 /K, P (k) x ˆ2 = k x2 /K. Results from Nedi´c and Ozdaglar (2009) show that as K → ∞, these averaged solutions converge to the optimal primal solution.10 A second strategy (as given in figure 1) is to simply (K) (K) take (x1 , x2 ) as an approximation to the primal solution. This method is a heuristic, but previous work (e.g., Komodakis et al. (2007)) has shown that it is effective in practice; we use it in this paper. In our experiments we found that in the vast majority of cases, case 1 applies, after a small number of iterations; see the next section for more details.

7

Proposition 6.1 The algorithm in figure 1 is an instantiation of the algorithm in figure 4,8 with X1 = conv(Y), X2 = conv(Z), and the matrices E and F defined to be binary matrices specifying the constraints µ(i, t) = ν(i, t) for all (i, t) ∈ Iuni .

8

Recovering the LP Solution

Experiments

7.1

Integrated Phrase-Structure and Dependency Parsing

Our first set of experiments considers the integration of Model 1 of Collins (2003) (a lexicalized phrasestructure parser, from here on referred to as Model (k)

(k)

(k)

(k)

We have that θ1 · x1 + θ2 · x2 = L(u(k) , x1 , x2 ) = (k) (k) (k) L(u ), where the last equality is because x1 and x2 are de(k) (k) fined by the respective arg max’s. Thus, (x1 , x2 ) and u(k) are primal and dual optimal. 10 The resulting fractional solution can be projected back to the set Q, see (Smith and Eisner, 2008; Martins et al., 2009). 9

Itn. Dep POS

1 43.5 58.7

2 20.1 15.4

3 10.2 6.3

4 4.9 3.6

5-10 14.0 10.3

11-20 5.7 3.8

20-50 1.4 0.8

** 0.4 1.1

Table 1: Convergence results for Section 23 of the WSJ Treebank for the dependency parsing and POS experiments. Each column gives the percentage of sentences whose exact solutions were found in a given range of subgradient iterations. ** is the percentage of sentences that did not converge by the iteration limit (K=50).

1),11 and the 2nd order discriminative dependency parser of Koo et al. (2008). The inference problem for a sentence x is to find

Model 1 Koo08 Baseline DD Combination

Precision 88.4 89.9 91.0

Recall 87.8 89.6 90.4

F1 88.1 89.7 90.7

Dep 91.4 93.3 93.8

Table 2: Performance results for Section 23 of the WSJ Treebank. Model 1: a reimplementation of the generative parser of (Collins, 2002). Koo08 Baseline: Model 1 with a hard restriction to dependencies predicted by the discriminative dependency parser of (Koo et al., 2008). DD Combination: a model that maximizes the joint score of the two parsers. Dep shows the unlabeled dependency accuracy of each system. 100

y = arg max (f1 (y) + γf2 (y)) y∈Y

(11)

where Y is the set of all lexicalized phrase-structure trees for the sentence x; f1 (y) is the score (log probability) under Model 1; f2 (y) is the score under Koo et al. (2008) for the dependency structure implied by y; and γ > 0 is a parameter dictating the relative weight of the two models.12 This problem is similar to the second example in section 4; a very similar dual decomposition algorithm to that described in section 4.2 can be derived. We used the Penn Wall Street Treebank (Marcus et al., 1994) for the experiments, with sections 2-21 for training, section 22 for development, and section 23 for testing. The parameter γ was chosen to optimize performance on the development set. We ran the dual decomposition algorithm with a limit of K = 50 iterations. The dual decomposition algorithm returns an exact solution if case 1 occurs as defined in section 6.2; we found that of 2416 sentences in section 23, case 1 occurred for 2407 (99.6%) sentences. Table 1 gives statistics showing the number of iterations required for convergence. Over 80% of the examples converge in 5 iterations or fewer; over 90% converge in 10 iterations or fewer. We compare the accuracy of the dual decomposition approach to two baselines: first, Model 1; and second, a naive integration method that enforces the hard constraint that Model 1 must only consider de11

We use a reimplementation that is a slight modification of Collins Model 1, with very similar performance, and which uses the TAG formalism of Carreras et al. (2008). 12 Note that the models f1 and f2 were trained separately, using the methods described by Collins (2003) and Koo et al. (2008) respectively.

9

90 Percentage

∗

80 70 60

f score % certificates % match K=50

50 0

10 20 30 40 Maximum Number of Dual Decomposition Iterations

50

Figure 5: Performance on the parsing task assuming a fixed number of iterations K. f-score: accuracy of the method. % certificates: percentage of examples for which a certificate of optimality is provided. % match: percentage of cases where the output from the method is identical to the output when using K = 50.

pendencies seen in the first-best output from the dependency parser. Table 2 shows all three results. The dual decomposition method gives a significant gain in precision and recall over the naive combination method, and boosts the performance of Model 1 to a level that is close to some of the best single-pass parsers on the Penn treebank test set. Dependency accuracy is also improved over the Koo et al. (2008) model, in spite of the relatively low dependency accuracy of Model 1 alone. Figure 5 shows performance of the approach as a function of K, the maximum number of iterations of dual decomposition. For this experiment, for cases where the method has not converged for k ≤ K, the output from the algorithm is chosen to be the y (k) for k ≤ K that maximizes the objective function in Eq. 11. The graphs show that values of K less than 50 produce almost identical performance to K = 50, but with fewer cases giving certificates of optimality (with K = 10, the f-score of the method is 90.69%; with K = 5 it is 90.63%).

Fixed Tags DD Combination

Precision 88.1 88.7

Recall 87.6 88.0

F1 87.9 88.3

POS Acc 96.7 97.1

Table 3: Performance results for Section 23 of the WSJ. Model 1 (Fixed Tags): a baseline parser initialized to the best tag sequence of from the tagger of Toutanova and Manning (2000). DD Combination: a model that maximizes the joint score of parse and tag selection.

7.2

Integrated Phrase-Structure Parsing and Trigram POS tagging

In a second experiment, we used dual decomposition to integrate the Model 1 parser with the Stanford max-ent trigram POS tagger (Toutanova and Manning, 2000), using a very similar algorithm to that described in section 4.1. We use the same training/dev/test split as in section 7.1. The two models were again trained separately. We ran the algorithm with a limit of K = 50 iterations. Out of 2416 test examples, the algorithm found an exact solution in 98.9% of the cases. Table 1 gives statistics showing the speed of convergence for different examples: over 94% of the examples converge to an exact solution in 10 iterations or fewer. In terms of accuracy, we compare to a baseline approach of using the first-best tag sequence as input to the parser. The dual decomposition approach gives 88.3 F1 measure in recovering parsetree constituents, compared to 87.9 for the baseline.

8

Conclusions

A

B

We have introduced dual-decomposition algorithms for inference in NLP, given formal properties of the algorithms in terms of LP relaxations, and demonstrated their effectiveness on problems that would traditionally be solved using intersections of dynamic programs (Bar-Hillel et al., 1964). Given the widespread use of dynamic programming in NLP, there should be many applications for the approach. There are several possible extensions of the method we have described. We have focused on cases where two models are being combined; the extension to more than two models is straightforward (e.g., see Komodakis et al. (2007)). This paper has considered approaches for MAP inference; for closely related methods that compute approximate marginals, see Wainwright et al. (2005b). 10

Fractional Solutions

We now give an example of a point (µ, ν) ∈ Q0 \conv(Q) that demonstrates that the relaxation Q0 is strictly larger than conv(Q). Fractional points such as this one can arise as solutions of the LP relaxation for worst case instances, preventing us from finding an exact solution. Recall that the constraints for Q0 specify that µ ∈ conv(Y), ν ∈ conv(Z), and µ(i, t) = ν(i, t) for all (i, t) ∈ Iuni . Since µ ∈ conv(Y), µ must be a convex combination of 1 or more members of Y; a similar property holds for ν. The example is as follows. There are two possible parts of speech, A and B, and an additional non-terminal symbol X. The sentence is of length 3, w1 w2 w3 . Let ν be the convex combination of the following two tag sequences, each with probability 0.5: w1 /A w2 /A w3 /A and w1 /A w2 /B w3 /B. Let µ be the convex combination of the following two parses, each with probability 0.5: (X(A w1 )(X(A w2 )(B w3 ))) and (X(A w1 )(X(B w2 )(A w3 ))). It can be verified that µ(i, t) = ν(i, t) for all (i, t), i.e., the marginals for single tags for µ and ν agree. Thus, (µ, ν) ∈ Q0 . To demonstrate that this fractional point is not in conv(Q), we give parameter values such that this fractional point is optimal and all integral points (i.e., actual parses) are suboptimal. For the tagging model, set θ(AA → A, 3) = θ(AB → B, 3) = 0, with all other parameters having a negative value. For the parsing model, set θ(X → A X, 1, 1, 3) = θ(X → A B, 2, 2, 3) = θ(X → B A, 2, 2, 3) = 0, with all other rule parameters being negative. For this objective, the fractional solution has value 0, while all integral points (i.e., all points in Q) have a negative value. By Theorem 5.2, the maximum of any linear objective over conv(Q) is equal to the maximum over Q. Thus, (µ, ν) 6∈ conv(Q).

Step Size

We used the following step size in our experiments. First, we initialized α0 to equal 0.5, a relatively large value. Then we defined αk = α0 ∗ 2−ηk , where ηk is the num0 0 ber of times that L(u(k ) ) > L(u(k −1) ) for k 0 ≤ k. This learning rate drops at a rate of 1/2t , where t is the number of times that the dual increases from one iteration to the next. See Koo et al. (2010) for a similar, but less aggressive step size used to solve a different task.

Acknowledgments MIT gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0181. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. Alexander Rush was supported under the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022. David Sontag was supported by a Google PhD Fellowship.

References Y. Bar-Hillel, M. Perles, and E. Shamir. 1964. On formal properties of simple phrase structure grammars. In Language and Information: Selected Essays on their Theory and Application, pages 116–150. X. Carreras, M. Collins, and T. Koo. 2008. TAG, dynamic programming, and the perceptron for efficient, feature-rich parsing. In Proc CONLL, pages 9–16. X. Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proc. CoNLL, pages 957–961. M. Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP, page 8. M. Collins. 2003. Head-driven statistical models for natural language parsing. In Computational linguistics, volume 29, pages 589–637. G.B. Dantzig and P. Wolfe. 1960. Decomposition principle for linear programs. In Operations research, volume 8, pages 101–111. J. Duchi, D. Tarlow, G. Elidan, and D. Koller. 2007. Using combinatorial optimization within max-product belief propagation. In NIPS, volume 19. J. Eisner. 2000. Bilexical grammars and their cubic-time parsing algorithms. In Advances in Probabilistic and Other Parsing Technologies, pages 29–62. A. Globerson and T. Jaakkola. 2007. Fixing maxproduct: Convergent message passing algorithms for MAP LP-relaxations. In NIPS, volume 21. N. Komodakis, N. Paragios, and G. Tziritas. 2007. MRF optimization via dual decomposition: Messagepassing revisited. In International Conference on Computer Vision. T. Koo, X. Carreras, and M. Collins. 2008. Simple semisupervised dependency parsing. In Proc. ACL/HLT. T. Koo, A.M. Rush, M. Collins, T. Jaakkola, and D. Sontag. 2010. Dual Decomposition for Parsing with NonProjective Head Automata. In Proc. EMNLP, pages 63–70. B.H. Korte and J. Vygen. 2008. Combinatorial optimization: theory and algorithms. Springer Verlag. M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. 1994. Building a large annotated corpus of English: The Penn Treebank. In Computational linguistics, volume 19, pages 313–330. R.K. Martin, R.L. Rardin, and B.A. Campbell. 1990. Polyhedral characterization of discrete dynamic programming. Operations research, 38(1):127–138. A.F.T. Martins, N.A. Smith, and E.P. Xing. 2009. Concise integer linear programming formulations for dependency parsing. In Proc. ACL.

11

R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. 2005. Non-projective dependency parsing using spanning tree algorithms. In Proc. HLT/EMNLP, pages 523– 530. Angelia Nedi´c and Asuman Ozdaglar. 2009. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757–1780. B. Pang and L. Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proc. ACL. S. Riedel and J. Clarke. 2006. Incremental integer linear programming for non-projective dependency parsing. In Proc. EMNLP, pages 129–137. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. ICML, pages 737–744. Hanif D. Sherali and Warren P. Adams. 1994. A hierarchy of relaxations and convex hull characterizations for mixed-integer zero–one programming problems. Discrete Applied Mathematics, 52(1):83 – 106. D.A. Smith and J. Eisner. 2008. Dependency parsing by belief propagation. In Proc. EMNLP, pages 145–156. D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. 2008. Tightening LP relaxations for MAP using message passing. In Proc. UAI. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. 2004. Max-margin parsing. In Proc. EMNLP, pages 1–8. K. Toutanova and C.D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy partof-speech tagger. In Proc. EMNLP, pages 63–70. M. Wainwright and M. I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., Hanover, MA, USA. M. Wainwright, T. Jaakkola, and A. Willsky. 2005a. MAP estimation via agreement on trees: messagepassing and linear programming. In IEEE Transactions on Information Theory, volume 51, pages 3697– 3717. M. Wainwright, T. Jaakkola, and A. Willsky. 2005b. A new class of upper bounds on the log partition function. In IEEE Transactions on Information Theory, volume 51, pages 2313–2335. C. Yanover, T. Meltzer, and Y. Weiss. 2006. Linear Programming Relaxations and Belief Propagation–An Empirical Study. In The Journal of Machine Learning Research, volume 7, page 1907. MIT Press.