By changing parameter $C$ we can choose to favor one goal over another. This is illustrated on Figure 3, which shows four separating hyperplanes and their margins, obtained for the same training set using increasing values of parameter $C$. Circled are points with non-zero error terms $\xi_i$ -- they either lie on the wrong side of the hyperplane, or in the margin. When $C$ is very small, the sum of error terms becomes negligible in objective function (15), so that the goal of optimization is to maximize the margin. As a result, the margin can be large enough to contain all the points. At another extreme, when $C$ is very large, the sum of error terms dominates the margin term in objective function (15), so that the goal of optimization is to minimize the sum of error terms. As a result, the margin can be so small that it does not contain any points. Note that despite differences in the value of parameter $C$ and the size of the margin, all four hyperplanes shown in Figure 3 are fairly similar, and they all correctly classify the same training points. Clearly, values of parameter $C$ do not have absolute meaning. They are related to the number of training points and the range of data. In the example shown on Figure 3, we have a training set consisting of 14 points whose abscissa (horizontal coordinate) varies between 20 and 80, and the ordinate (vertical coordinate) -- between 15 and 55. Note that it is possible to make $C$ independent of the number of training points if $C \frac{1}{l}\sum_{i=1}^{l}\xi_i$ term is used instead of $C \sum_{i=1}^{l}\xi_i$ in objective function (15). Note also that Figure 3 suggests that the tuning of parameter $C$ should be done on a logarithmic scale. This is indeed a good approach for many practical applications.
In addition to balancing the goals of margin maximization and error minimization, parameter $C$ also provides a convenient way to deal with situations when the cost of misclassification is different for points from $C_1$ and $C_2$. For example, if the cost of misclassification a point from $C_1$ is two times higher then the cost of misclassification a point from $C_2$, then instead of objective function (15) we can consider objective function \begin{equation} \frac{1}{2}\sum_{k=1}^{n}w_k^2 + \tilde{C_1}\sum_{i:\;y_i=1}\xi_i + \tilde{C_2}\sum_{i:\;y_i=-1}\xi_i\rightarrow \min_{\wb, b, \xib} \label{eq:s3:obj_func4} \end{equation} where $\tilde{C_1}/\tilde{C_2} = 2$. This trick also allows to handle situations with unbalanced classes, when the number of training points from one class significantly exceeds the number of training points from another class. Given unbalanced training data, SVM classifier will tend to have higher accuracy on larger class, and lower accuracy on smaller class. To level these accuracies, objective function (\ref{eq:s3:obj_func4}) can be used, where $\tilde{C}_1 > \tilde{C}_2$ if $C_1$ is smaller than $C_2$.
| | | Previous excerpt: Primary optimization problem for SVM, linearly nonseparable classes | | Home | | | | |