Next excerpt: Primary optimization problem for SVM, linearly nonseparable classes

Home

Primary optimization problem for SVM, linearly separable classes

Parameters $\boldsymbol{w}, b$ of the SVM hyperplane can be found as a solution to the following optimization problem: $$\frac{1}{2}||\boldsymbol{w}||^2 \rightarrow \min_{\boldsymbol{w}, b} \label{primary_problem:eq:obj_func1}$$ $$\mathrm{subject\;to} \nonumber$$ $$\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b \geq 1,\; \forall \boldsymbol{x} \in C_1 \label{primary_problem:eq:constraint11}$$ $$\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b \leq -1,\; \forall \boldsymbol{x} \in C_2, \label{primary_problem:eq:constraint12}$$ where $C_1$ and $C_2$ are two classes of training examples.

In coordinate form, this problem is written as $$\frac{1}{2}\sum_{k=1}^{n}w_k^2 \rightarrow \min_{\boldsymbol{w}, b} \label{primary_problem:eq:obj_func2}$$ $$\mathrm{s.t.} \nonumber$$ $$\sum_{k=1}^{n}w_k x_k + b \geq 1,\; \forall \boldsymbol{x} \in C_1 \label{primary_problem:eq:constraint21}$$ $$\sum_{k=1}^{n}w_k x_k + b \leq -1,\; \forall \boldsymbol{x} \in C_2. \label{primary_problem:eq:constraint22}$$ Note that parameter $b$ is one of the optimization variables, although it is not present in the objective function (\ref{primary_problem:eq:obj_func2}). Two sets of constraints (\ref{primary_problem:eq:constraint21}), (\ref{primary_problem:eq:constraint22}) can be written in a unified way as $$y_i\left( \sum_{k=1}^{n}w_k x_{ik} + b \right) \geq 1,\;\;i = 1, 2, ... , l. \label{primary_problem:eq:constraint}$$ where $y_i = 1$ if $\boldsymbol{x}_i \in C_1$, and $y_i = -1$ if $\boldsymbol{x}_i \in C_2$, $l$ is the total number of training vectors $\boldsymbol{x}_i$, and by $x_{ik}$ we denote the $k$-th component of vector $\boldsymbol{x}_i$.

Optimization problem (\ref{primary_problem:eq:obj_func2})-(\ref{primary_problem:eq:constraint22}) has quadratic objective function (\ref{primary_problem:eq:obj_func2}) and linear constraints (\ref{primary_problem:eq:constraint21}), (\ref{primary_problem:eq:constraint22}). Such problems are called quadratic programming problems. Their properties are well known and there are quite efficient algorithms for solving these problems.

Note that objective function (\ref{primary_problem:eq:obj_func2}) is strictly convex (since the matrix of its second-order derivatives -- the Hessian -- is positive definite), and the feasible region defined by linear inequalities (\ref{primary_problem:eq:constraint21})-(\ref{primary_problem:eq:constraint22}) is also convex. Therefore, this problem will have a unique solution (global minimum) $\boldsymbol{w}^*, b^*$ (note that the maximum margin hyperplane can be defined by infinite number of parameters $\boldsymbol{w}, b$; it is the solution of problem (\ref{primary_problem:eq:obj_func2})-(\ref{primary_problem:eq:constraint22}) which is unique). In case when two classes are not linearly separable, the feasible region defined by constraints (\ref{primary_problem:eq:constraint21})-(\ref{primary_problem:eq:constraint22}) will be empty, and the problem will have no solution. It will also have no solution when the training set contains only one class.

Why parameters of the maximum margin hyperplane can be found by solving problem (\ref{primary_problem:eq:obj_func1})-(\ref{primary_problem:eq:constraint12})? To answer this question, let us transform this problem into an equivalent one that has more apparent geometrical interpretation. First, minimizing $\frac{1}{2}||\boldsymbol{w}||^2$ is equivalent to minimizing $||\boldsymbol{w}||$, which in turn is equivalent to maximizing $1/||\boldsymbol{w}||$, so we can rewrite problem (\ref{primary_problem:eq:obj_func1})-(\ref{primary_problem:eq:constraint12}) as follows: $$\frac{1}{||\boldsymbol{w}||} \rightarrow \max_{\boldsymbol{w}, b} \nonumber$$ $$\mathrm{s.t.} \nonumber$$ $$\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b \geq 1,\; \forall \boldsymbol{x} \in C_1 \label{primary_problem:eq:constraint31}$$ $$\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b \leq -1,\; \forall \boldsymbol{x} \in C_2. \label{primary_problem:eq:constraint32}$$ Second, we can divide constraints (\ref{primary_problem:eq:constraint31}), (\ref{primary_problem:eq:constraint32}) by a positive number $||\boldsymbol{w}||$: $$\frac{1}{||\boldsymbol{w}||} \rightarrow \max_{\boldsymbol{w}, b} \nonumber$$ $$\mathrm{s.t.} \nonumber$$ $$\frac{\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b}{||\boldsymbol{w}||} \geq \frac{1}{||\boldsymbol{w}||},\; \forall \boldsymbol{x} \in C_1 \nonumber$$ $$\frac{\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b}{||\boldsymbol{w}||} \leq -\frac{1}{||\boldsymbol{w}||},\; \forall \boldsymbol{x} \in C_2. \nonumber$$ Recalling that $\frac{\langle\boldsymbol{w}, \boldsymbol{x}\rangle + b}{||w||}$ is the distance $\rho(\pi, \boldsymbol{x})$ between hyperplane $\pi(\boldsymbol{w}, b)$ and point $\boldsymbol{x}$, and introducing new variable $d = 1/||\boldsymbol{w}||$, we get the following problem which is equivalent to (\ref{primary_problem:eq:obj_func1})-(\ref{primary_problem:eq:constraint12}): $$d \rightarrow \max_{\boldsymbol{w}, b} \nonumber$$ $$\mathrm{s.t.} \nonumber$$ $$\rho(\pi, \boldsymbol{x}) \geq d,\; \forall \boldsymbol{x} \in C_1 \label{primary_problem:eq:constraint51}$$ $$\rho(\pi, \boldsymbol{x}) \leq -d,\; \forall \boldsymbol{x} \in C_2. \label{primary_problem:eq:constraint52}$$ That is, find parameters $\boldsymbol{w}, b$ that maximize margin $m = 2d$ between $\pi$, $C_1$ and $C_2$.

Geometrically, connection between parameters $\boldsymbol{w}$, $b$ and $d$ can be illustrated in the following way. Let us draw a spherical hull of radius $d = 1/||\boldsymbol{w}||$ around each training point $\boldsymbol{x}$. Consider some feasible hyperplane $\pi(\boldsymbol{w}, b)$. According to constraints (\ref{primary_problem:eq:constraint51}), (\ref{primary_problem:eq:constraint52}), this hyperplane must separate our points together with their hulls, Figure 1a. Now suppose we want to increase $d$ twofold. Since $d$ is a function of $||\boldsymbol{w}||$, we have to decrease $||\boldsymbol{w}||$ twofold. If we divide vector $\boldsymbol{w}$ by two, we move our hyperplane parallel to itself further from the origin. However, if we divide by two vector $\boldsymbol{w}$ and intercept $b$, we do not move the hyperplane. This way, downscaling $\boldsymbol{w}$ and $b$, we increase the radius of the hulls while keeping hyperplane $\pi$ in the same position and orientation, until at least one hull touches it, Figure 1b. If we have space to move $\pi$ parallel to itself away from the hull that touches it, we can do it by changing $b$ only, and let the hulls grow further. At some point, our hulls will reach maximum size $d$ achievable for hyperplanes with normal vectors collinear to $\boldsymbol{w}$, Figure 1c. If we have space for the hulls to grow further, we can change orientation of $\pi$ by changing components of vector $\boldsymbol{w}$, while keeping $||\boldsymbol{w}||$ equal to the current value of $1/d$ (see SVM Tutorial, end of subsection 1.2). Doing so and adjusting $b$, we keep $\pi$ feasible and increase $d$ until we arrive to the optimal configuration, Figure 1d.

Figure 1. Finding maximum margin hyperplane: geometrical insight.

Next excerpt: Primary optimization problem for SVM, linearly nonseparable classes

Home