Primary optimization problem for SVM, linearly nonseparable classes
$\setCounter{11}$In reallife classification problems we rarely deal with linearly separable classes $C_1$, $C_2$. Most of the time our observations will form classes that no hyperplane can separate without errors. Here, we will call these classes overlapping. Note that we always assume that $C_1 \cap C_2 = \varnothing$, and our definition of overlapping classes does not imply the opposite. However, it implies that $conv(C_1) \cap conv(C_2) \neq \varnothing$, where $conv(C_1)$ and $conv(C_2)$ denote convex hulls of $C_1$ and $C_2$.
$\newcommand{\wb}{\boldsymbol{w}}$
$\newcommand{\xib}{\boldsymbol{\xi}}$
For overlapping classes, problem (1)(3) becomes infeasible (and the dual problem  unbounded), since there exist no $\wb, b$ that could satisfy all constraints (2), (3) at the same time. We can relax these constraints by introducing error terms $\xi_i$, also called slack variables, in the following way:
$$ y_i\left( \sum_{k=1}^{n}w_k x_{ik} + b \right) \geq 1  \xi_i,\;\;i = 1, 2, ... , l \label{eq:s3:primary_problem:constraint_xi1} $$
$$ \xi_i \geq 0, \;\;i = 1, 2, ... , l. \label{eq:s3:primary_problem:constraint_xi2} $$
If we consider minimizing objective function
$$ \frac{1}{2}\sum_{k=1}^{n}w_k^2 \rightarrow \min_{\wb, b, \xib} \label{eq:s3:primary_problem:obj_function1} $$
with constraints (\ref{eq:s3:primary_problem:constraint_xi1})(\ref{eq:s3:primary_problem:constraint_xi2}), we will find that it can be made arbitrarily small because any vector $\wb$ can be made feasible by using freedom in the choice of slack variables $\xi_i$. To make this problem meaningful, we should also seek minimization of error terms, which is usually achieved by adding their sum to the objective function (\ref{eq:s3:primary_problem:obj_function1}):
$$ \frac{1}{2}\sum_{k=1}^{n}w_k^2 + C\sum_{i=1}^{l}\xi_i\rightarrow \min_{\wb, b, \xib} \label{eq:s3:primary_problem:obj_func3} $$
$$ \mathrm{s.t.} \nonumber $$
$$ y_i\left( \sum_{k=1}^{n}w_k x_{ik} + b \right) \geq 1  \xi_i,\;\;i = 1, 2, ... , l \label{eq:s3:primary_problem:constraint1} $$
$$ \xi_i \geq 0, \;\;i = 1, 2, ... , l. \label{eq:s3:primary_problem:constraint2} $$
Here $C$ is a positive constant balancing two different goals: maximizing the margin and minimizing the number of errors on the training data. We will consider this parameter in more detail later.
Optimization problem (\ref{eq:s3:primary_problem:obj_func3})(\ref{eq:s3:primary_problem:constraint2}) defines so called soft margin SVM, as opposed to the hard margin SVM which we considered in section 2. For a soft margin SVM, we want to find a separating hyperplane with the maximum margin, we allow training vectors to lie inside the margin or to be missclassified, and we want the overall error measured by the sum of slack variables to be minimized. Note that when two classes are linearly separable, problem (\ref{eq:s3:primary_problem:obj_func3})(\ref{eq:s3:primary_problem:constraint2}) will have the same solution as problem (1)(3) . Therefore, optimization problem (\ref{eq:s3:primary_problem:obj_func3})(\ref{eq:s3:primary_problem:constraint2}) can be used as a general formulation that defines SVM on arbitrary training set, regardless of linear separability of two classes.
If $\wb^*$, $b^*$ is a solution of (\ref{eq:s3:primary_problem:obj_func3})(\ref{eq:s3:primary_problem:constraint2}) then hyperplane $\langle\boldsymbol{w}^*, \boldsymbol{x}\rangle + b^* = 0$ is called optimal or maximum margin hyperplane. What exactly do we mean by hyperplane's margin when we consider two overlapping classes? In section 1, margin was defined for a hyperplane and two linearly separable classes. In section 2, it was shown that for the maximum margin hyperplane its margin is
 as a number: $\frac{2}{\wb^*}$
 as a part of space: the gap between $\langle\boldsymbol{w}^*, \boldsymbol{x}\rangle + b^* = 1$ and $\langle\boldsymbol{w}^*, \boldsymbol{x}\rangle + b^* = 1$
The same is assumed for a soft margin hyperplane: its margin is defined as the region between hyperplanes $\langle\boldsymbol{w}^*, \boldsymbol{x}\rangle + b^* = 1$ and $\langle\boldsymbol{w}^*, \boldsymbol{x}\rangle + b^* = 1$, although this region is not anymore the gap between the optimal hyperplane and two classes.
Minimization of quadratic objective function (\ref{eq:s3:primary_problem:obj_func3}) subject to linear constraints (\ref{eq:s3:primary_problem:constraint1})(\ref{eq:s3:primary_problem:constraint2}) is a problem of quadratic programming. This function is convex, but in contrast to function (4) is not strictly convex (since its Hessian is positive semidefinite). This implies that solution $\wb^*, b^*, \xib^*$ of problem (\ref{eq:s3:primary_problem:obj_func3})(\ref{eq:s3:primary_problem:constraint2}) is not unique. Problem (\ref{eq:s3:primary_problem:obj_func3})(\ref{eq:s3:primary_problem:constraint2}) will always have solution, unless the training set contains only one class, in which case the problem will be unbounded.
