Warning: MathJax requires JavaScript to process the mathematics on this page.
If your browser supports JavaScript, be sure it is enabled.

Previous excerpt: Primary optimization problem for SVM, linearly nonseparable classes

Parameter C

$\setCounter{17}$As we noted before, soft margin SVM has one parameter which should be adjusted by the user: a positive constant $C$ in objective function (15). This parameter balances two different goals: maximizing the margin and minimizing the number of errors on the training data. These goals may be conflicting, since margin expansion may increase the overall error (Figure 2). $\newcommand{\wb}{\boldsymbol{w}}$ $\newcommand{\xib}{\boldsymbol{\xi}}$ $\newcommand{\wx}{\langle\boldsymbol{w}, \boldsymbol{x}\rangle}$

Support vector machines, values of slack variables shown for training points from the first class

Figure 2. Values of slack variables $\xi_i$ shown for training points from $C_1$ (white points). Optimal hyperplane $\pi$ is defined by $\wx + b = 0$, $\pi_1$ is defined by $\wx + b = 1$, and $\pi_2$ is defined by $\wx + b = -1$. The margin is the region between $\pi_1$ and $\pi_2$. Margin expansion may increase the overall error. Note that points located in the margin will always have $\xi_i > 0$, even when they are correctly classified by the hyperplane.

By changing parameter $C$ we can choose to favor one goal over another. This is illustrated on Figure 3, which shows four separating hyperplanes and their margins, obtained for the same training set using increasing values of parameter $C$. Circled are points with non-zero error terms $\xi_i$ -- they either lie on the wrong side of the hyperplane, or in the margin. When $C$ is very small, the sum of error terms becomes negligible in objective function (15), so that the goal of optimization is to maximize the margin. As a result, the margin can be large enough to contain all the points. At another extreme, when $C$ is very large, the sum of error terms dominates the margin term in objective function (15), so that the goal of optimization is to minimize the sum of error terms. As a result, the margin can be so small that it does not contain any points. Note that despite differences in the value of parameter $C$ and the size of the margin, all four hyperplanes shown in Figure 3 are fairly similar, and they all correctly classify the same training points.

Clearly, values of parameter $C$ do not have absolute meaning. They are related to the number of training points and the range of data. In the example shown on Figure 3, we have a training set consisting of 14 points whose abscissa (horizontal coordinate) varies between 20 and 80, and the ordinate (vertical coordinate) -- between 15 and 55. Note that it is possible to make $C$ independent of the number of training points if $C \frac{1}{l}\sum_{i=1}^{l}\xi_i$ term is used instead of $C \sum_{i=1}^{l}\xi_i$ in objective function (15). Note also that Figure 3 suggests that the tuning of parameter $C$ should be done on a logarithmic scale. This is indeed a good approach for many practical applications.

Support vector machines, optimal separating hyperplanes and their margins obtained for the same training set by using different values of parameter C

Figure 3. Optimal separating hyperplanes and their margins obtained for the same training set by using different values of parameter $C$. Circled are points with $\xi_i > 0$.

In addition to balancing the goals of margin maximization and error minimization, parameter $C$ also provides a convenient way to deal with situations when the cost of misclassification is different for points from $C_1$ and $C_2$. For example, if the cost of misclassification a point from $C_1$ is two times higher then the cost of misclassification a point from $C_2$, then instead of objective function (15) we can consider objective function \begin{equation} \frac{1}{2}\sum_{k=1}^{n}w_k^2 + \tilde{C_1}\sum_{i:\;y_i=1}\xi_i + \tilde{C_2}\sum_{i:\;y_i=-1}\xi_i\rightarrow \min_{\wb, b, \xib} \label{eq:s3:obj_func4} \end{equation} where $\tilde{C_1}/\tilde{C_2} = 2$. This trick also allows to handle situations with unbalanced classes, when the number of training points from one class significantly exceeds the number of training points from another class. Given unbalanced training data, SVM classifier will tend to have higher accuracy on larger class, and lower accuracy on smaller class. To level these accuracies, objective function (\ref{eq:s3:obj_func4}) can be used, where $\tilde{C}_1 > \tilde{C}_2$ if $C_1$ is smaller than $C_2$.

Previous excerpt: Primary optimization problem for SVM, linearly nonseparable classes

Home