Sketch derivation of dual form The Representer Theorem states that the solution w can always be written as a linear combination of the training data: ...

0 downloads 6 Views 725KB Size

Lecture 3: Multiple Regression Prof. Sharyn O’Halloran Sustainable Development U9611 Econometrics II

Loading...

Hilary 2015

A. Zisserman

SVM – review • We have seen that for an SVM learning a linear classifier f (x) = w>x + b is formulated as solving an optimization problem over w : N X 2 min ||w|| + C max (0, 1 − yif (xi)) d w∈R i

• This quadratic optimization problem is known as the primal problem. • Instead, the SVM can be formulated to learn a linear classifier f (x) =

N X

αiyi(xi>x) + b

i

by solving an optimization problem over αi. • This is know as the dual problem, and we will look at the advantages of this formulation.

Sketch derivation of dual form The Representer Theorem states that the solution w can always be written as a linear combination of the training data:

w=

N X

αj yj xj

j=1

Proof: see example sheet . Now, substitute for w in f (x) = w>x + b ⎛

f (x) = ⎝

N X

j=1

⎞

αj yj xj ⎠ >x + b =

N X

j=1

³

´ > αj y j x j x + b

³ ´ 2 > and for w in the cost function minw ||w|| subject to yi w xi + b ≥ 1, ∀i ⎧ ⎫ ⎧ ⎫ ⎨X ⎬ ⎨X ⎬ X 2 > αj yj xj αk yk xk = αj αk yj yk (xj >xk ) ||w|| = ⎩ ⎭ ⎩ ⎭ j k jk

Hence, an equivalent optimization problem is over αj min αj

X jk

⎛

αj αk yj yk (xj >xk ) subject to yi ⎝

N X

j=1

⎞

αj yj (xj >xi) + b⎠ ≥ 1, ∀i

and a few more steps are required to complete the derivation.

Primal and dual formulations N is number of training points, and d is dimension of feature vector x. Primal problem: for w ∈ Rd N X 2 min ||w|| + C max (0, 1 − yif (xi)) d w∈R i

Dual problem: for α ∈ RN (stated without proof): X

X 1X > max αi − αj αk yj yk (xj xk ) subject to 0 ≤ αi ≤ C for ∀i, and α i yi = 0 αi ≥0 2 i i jk

• Need to learn d parameters for primal, and N for dual • If N << d then more eﬃcient to solve for α than w • Dual form only involves (xj >xk ). We will return to why this is an advantage when we look at kernels.

Primal and dual formulations Primal version of classifier: f (x) = w>x + b Dual version of classifier: f ( x) =

N X

α i y i (x i > x) + b

i

At first sight the dual form appears to have the disadvantage of a K-NN classifier — it requires the training data points xi. However, many of the αi’s are zero. The ones that are non-zero define the support vectors xi.

Support Vector Machine wTx + b = 0 b ||w||

Support Vector Support Vector

w

f (x) =

X i

αi yi (xi > x) + b support vectors

C = 10

soft margin

Handling data that is not linearly separable

• introduce slack variables min

w∈Rd ,ξ

i

∈R+

2

||w|| + C

N X

ξi

i

subject to ³

´ > yi w xi + b ≥ 1 − ξi for i = 1 . . . N

• linear classifier not appropriate ??

Solution 1: use polar coordinates

θ

r

<0

>0

θ 0

r

0

•

Data is linearly separable in polar coordinates

•

Acts non-linearly in original space Ã ! Ã ! x1 r Φ: → R2 → θ x2

R2

Solution 2: map data to higher dimension Φ:

Ã

x1 x2

!

⎛

⎞ 2 x1 ⎜ ⎟ → ⎝ √ x2 ⎠ 2 2x1x2

R2 → R3

0

0

• Data is linearly separable in 3D • This means that the problem can still be solved by a linear classifier

SVM classifiers in a transformed feature space RD

Rd

f (x) = 0

Φ

Φ : x → Φ(x)

Rd → R D

Learn classifier linear in w for RD : f (x) = w>Φ(x) + b Φ(x) is a feature map

Primal Classifier in transformed feature space Classifier, with w ∈ RD : f (x) = w>Φ(x) + b Learning, for w ∈ RD min ||w||2 + C

w∈RD

N X i

max (0, 1 − yif (xi))

• Simply map x to Φ(x) where data is separable • Solve for w in high dimensional space RD • If D >> d then there are many more parameters to learn for w. Can this be avoided?

Dual Classifier in transformed feature space Classifier: f (x) = → f ( x) =

N X i N X

αi y i x i > x + b αiyi Φ(xi)>Φ(x) + b

i

Learning: X

1X αi − αj αk y j y k x j > x k max αi ≥0 2 jk i X 1X → max αi − αj αk yj yk Φ(xj )>Φ(xk ) αi ≥0 2 jk i subject to 0 ≤ αi ≤ C for ∀i, and

X i

αi y i = 0

Dual Classifier in transformed feature space • Note, that Φ(x) only occurs in pairs Φ(xj )>Φ(xi) • Once the scalar products are computed, only the N dimensional vector α needs to be learnt; it is not necessary to learn in the D dimensional space, as it is for the primal • Write k(xj , xi) = Φ(xj )>Φ(xi). This is known as a Kernel Classifier: f (x) =

N X

αiyi k(xi, x) + b

i

Learning: X

1X max αi − αj αk yj yk k(xj , xk ) αi ≥0 2 jk i subject to 0 ≤ αi ≤ C for ∀i, and

X i

αiyi = 0

Special transformations Φ:

Ã

x1 x2

!

⎛

⎞ 2 x1 ⎜ ⎟ → ⎝ √ x2 ⎠ 2 2x1x2

Φ(x)>Φ(z) = = = =

R2 → R3

⎞ 2 z1 ³ ´ √ ⎟ ⎜ 2, 2x x 2 x2 , x z ⎠ ⎝ 1 2 1 2 √ 2 2z1z2 2 + x2z 2 + 2x x z z x2 z 1 2 1 2 1 1 2 2 (x1z1 + x2z2)2 (x> z)2 ⎛

Kernel Trick • Classifier can be learnt and applied without explicitly computing Φ(x) • All that is required is the kernel k(x, z) = (x>z)2 • Complexity of learning depends on N (typically it is O(N 3)) not on D

Example kernels • Linear kernels k(x, x0) = x>x0

³ ´d 0 > 0 • Polynomial kernels k(x, x ) = 1 + x x for any d > 0

— Contains all polynomials terms up to degree d ³ ´ 0 0 2 2 for σ > 0 • Gaussian kernels k(x, x ) = exp −||x − x || /2σ

— Infinite dimensional feature space

SVM classifier with Gaussian kernel N = size of training data

f (x) =

N X

αiyik(xi, x) + b

i

support vector

weight (may be zero)

³

Gaussian kernel k(x, x0) = exp −||x − x0||2/2σ 2 Radial Basis Function (RBF) SVM f (x ) =

N X i

³

2

αiyi exp −||x − xi|| /2σ

2

´

+b

´

RBF Kernel SVM Example

0.6

feature y

0.4

0.2

0

-0.2

-0.4

-0.6 -0.8

-0.6

-0.4

-0.2

0 0.2 feature x

0.4

0.6

0.8

• data is not linearly separable in original feature space

1

σ = 1.0

C=∞

f (x) = 1

f (x) = 0

f (x) = −1

f (x ) =

N X i

³

2

αiyi exp −||x − xi|| /2σ

2

´

+b

σ = 1.0

C = 100

Decrease C, gives wider (soft) margin

σ = 1.0

f (x ) =

N X i

C = 10

³

2

αiyi exp −||x − xi|| /2σ

2

´

+b

σ = 1.0

f (x ) =

N X i

C=∞

³

2

αiyi exp −||x − xi|| /2σ

2

´

+b

σ = 0.25

C=∞

Decrease sigma, moves towards nearest neighbour classifier

σ = 0.1

f (x ) =

N X i

C=∞

³

2

αiyi exp −||x − xi|| /2σ

2

´

+b

Kernel Trick - Summary • Classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space • Data may be linearly separable in the high dimensional space, but not linearly separable in the original feature space • Kernels can be used for an SVM because of the scalar product in the dual form, but can also be used elsewhere – they are not tied to the SVM formalism • Kernels apply also to objects that are not vectors, e.g.

P 0 k(h, h ) = k min(hk , h0k ) for histograms with bins hk , h0k

Regression y

• Suppose we are given a training set of N observations ((x1, y1), . . . , (xN , yN )) with xi ∈ Rd, yi ∈ R • The regression problem is to estimate f (x) from this data such that yi = f (xi)

Learning by optimization • As

in the case of classification, learning a regressor can be formulated as an optimization:

Minimize with respect to f ∈ F N X

l (f (xi), yi) + λR (f )

i=1 loss function

regularization

• There is a choice of both loss functions and regularization • e.g. squared loss, SVM “hinge-like” loss • squared regularizer, lasso regularizer

Choice of regression function – non-linear basis functions • Function for regression y(x, w) is a non-linear function of x, but linear in w: f (x, w) = w0 + w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x) • For example, for x ∈ R, polynomial regression with φj (x) = xj : f (x, w) = w0 + w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) =

M X j=0

⎛

⎞

1 e.g. for M = 3, ⎜ x ⎟ ⎟ = w> Φ(x) f (x, w) = (w0 , w1 , w2 , w3 ) ⎜ 2 ⎝ x ⎠ 3 x 1 4

Φ : x → Φ(x)

R →R

wj xj

Least squares “ridge regression” • Cost function – squared loss: target value yi

loss function

regularization

xi

• Regression function for x (1D): f (x, w) = w0 + w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w > Φ(x) • NB squared loss arises in Maximum Likelihood estimation for an error model

yi = y˜i + ni measured value

ni ∼ N (0, σ 2)

true value

Solving for the weights w Notation: write the target and regressed values as N -vectors

y

⎛

⎞

y ⎜ 1 ⎟ ⎜ y2 ⎟ ⎜ ⎟ ⎟ =⎜ . ⎜ ⎟ ⎜ . ⎟ ⎝ ⎠ yN

f

⎛

⎞ > Φ(x1 ) w ⎜ ⎟ ⎜ Φ(x2 )>w ⎟ ⎜ ⎟ ⎟= =⎜ . ⎜ ⎟ ⎜ ⎟ . ⎝ ⎠ Φ(xN )>w

Φw

⎡

1 φ1(x1) . . . φM (x1) φ1(x2) . . . φM (x2) . . 1 φ1(xN ) . . . φM (xN )

⎢ ⎢ 1 ⎢ =⎢ ⎢ . ⎢ . ⎣

Φ is an N × M design matrix

⎤⎛

w0 ⎥⎜ ⎥ ⎜ w1 ⎥⎜ ⎥⎜ . ⎥⎜ ⎥⎜ . ⎦⎝ wM

e.g. for polynomial regression with basis functions up to x2

Φw

⎡

⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣

1 x1 1 x2 . . 1 xN

⎤ 2 x1 ⎛ ⎞ ⎥ 2 w x2 ⎥ ⎥⎜ 0 ⎟ . ⎥ ⎥ ⎝ w1 ⎠ w2 . ⎥ ⎦ x2 N

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

N X 1 λ e w) = E( {f (xi, w) − yi}2 + kwk2 2 i=1 2

N ³ ´2 1 X λ > = yi − w Φ(xi) + kwk2 2 i=1 2 1 λ 2 = (y − Φw) + kwk2 2 2

Now, compute where derivative w.r.t. w is zero for minimum

Hence

e w) E( = −Φ> (y − Φw) + λw = 0 dw ³

w

Φ>Φ + λI ³

´

w = Φ> y

´−1 > = Φ Φ + λI Φ> y

M basis functions, N data points

³

w = Φ> Φ + λI = Mx1

MxM

´−1

Φ> y

MxN

assume N > M

Nx1

• This shows that there is a unique solution. • If λ = 0 (no regularization), then

w = (Φ>Φ )−1Φ>y = Φ+y where Φ+ is the pseudo-inverse of Φ (pinv in Matlab) • Adding the term λI improves the conditioning of the inverse, since if Φ is not full rank, then (Φ>Φ + λI) will be (for suﬃciently large λ) > • As λ → ∞, w → 1 λΦ y → 0

• Often the regularization is applied only to the inhomogeneous part of w, ˜) i.e. to w ˜ , where w = (w0, w

³

w = Φ > Φ + λI

´−1

Φ> y

f (x, w) = w>Φ(x) = Φ(x)>w = Φ(x)>

³

= b(x)>y

´−1 > Φ Φ + λI Φ>y

Output is a linear blend, b(x), of the training values {yi}

Example 1: polynomial basis functions ideal fit

• The red curve is the true function (which is not a polynomial)

1.5 Sample points Ideal fit 1

• The data points are samples from the curve with added noise in y. y

0.5

• There is a choice in both the degree, M, of the basis functions used, and in the strength of the regularization

0

-0.5

-1

-1.5

f (x, w) =

M X j=0

wj xj = w> Φ(x)

0

0.1

0.2

0.3

0.4

Φ : x → Φ(x)

0.5 x

0.6

0.7

0.8

0.9

R → RM +1

w is a M+1 dimensional vector

1

N = 9 samples, M = 7 1.5

1.5 Sample points Ideal fit lambda = 100

1

1

0

0

y

0.5

y

0.5

-0.5

-0.5

-1

-1

-1.5

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

-1.5

1

Sample points Ideal fit lambda = 1e-010

1

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Sample points Ideal fit lambda = 1e-015

1

y

y

0

1.5

1.5

-1.5

Sample points Ideal fit lambda = 0.001

-1.5

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

M=3

M=5

least-squares fit

least-squares fit

1.5

1.5

Sample points Ideal fit Least-squares solution

1

1

0

0

y

0.5

y

0.5

-0.5

-0.5

-1

-1

-1.5

-1.5

0

0.1

Polynomial basis functions

0.2

0.3

0.4

15

0.5 x

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.2

0.3

1

0.4 0.5 0.6 0.7 Polynomial basis functions x

0.8

0.9

1

0.8

0.9

1

400

10

300

5

200

0

100

-5

y

y

Sample points Ideal fit Least-squares solution

0

-10

-100 -15

-200 -20

-300 -25

-400 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

0

0.1

0.4

0.5 x

0.6

0.7

Example 2: Gaussian basis functions ideal fit

• The red curve is the true function (which is not a polynomial) • The data points are samples from the curve with added noise in y.

f (x, w) =

N X i=1

wi e

−(x−xi )2 /σ2

Sample points Ideal fit 1

0.5

y

• Basis functions are centred on the training data (N points) • There is a choice in both the scale, sigma, of the basis functions used, and in the strength of the regularization

1.5

0

-0.5

-1

-1.5

0

0.1

= w> Φ(x)

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

Φ : x → Φ(x)

w is a N-vector

0.9

1

R → RN

N = 9 samples, sigma = 0.334 1.5

1.5 Sample points Ideal fit lambda = 100

1

0.5

0.5

0

0

y

y

1

-0.5

-0.5

-1

-1

-1.5

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

-1.5

1

1.5

0

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

0

0

y

0.5

-0.5

-0.5

-1

-1

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Sample points Ideal fit lambda = 1e-015

1

0.5

-1.5

0.1

1.5 Sample points Ideal fit lambda = 1e-010

1

y

Sample points Ideal fit lambda = 0.001

-1.5

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Choosing lambda using a validation set 1.5

6 Ideal fit Validation Training Min error

1

4

0.5

3

0

y

error norm

5

2

-0.5

1

-1

0

-10

10

-5

10 log

0

10

Sample points Ideal fit Validation set fit

-1.5

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Sigma = 0.1

Sigma = 0.334

1.5

1.5 Sample points Ideal fit Validation set fit

1

1

0.5

y

y

0.5

0

0

-0.5

-0.5

-1

-1

-1.5

Sample points Ideal fit Validation set fit

-1.5 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

0.8

0.9

1

Gaussian basis functions Gaussian basis functions

0.8

2000

0.6

1500

0.4

500

0.2

0

0

y

y

1000

-500

-0.2

-1000

-0.4

-1500

-0.6

-2000

-0.8 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

Application: regressing face pose • Estimate two face pose angles: •

yaw (around the Y axis)

•

pitch (around the X axis)

•

Compute a HOG feature vector for each face region

•

Learn a regressor from the HOG vector to the two pose angles

Summary and dual problem So far we have considered the primal problem where f (x , w ) =

M X

wiφi(x) = w>Φ(x)

i=1

and we wanted a solution for w ∈ RM As in the case of SVMs, we can also consider the dual problem where

w=

N X

aiΦ(xi)

and

f (x , a ) =

i=1

N X

aiΦ(xi)>Φ(x)

i

and obtain a solution for a ∈ RN . Again • there is a closed form solution for a, • the solution involves the N × N Gram matrix k(xi, xj ) = Φ(xi)>Φ(xj ), • so we can use the kernel trick again to replace scalar products

Background reading and more • Bishop,

chapters 6 & 7 for kernels and SVMs

• Hastie et al, chapter 12 • Bishop, chapter 3 for regression • More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml