Optimization of Support Vector Machine Classification

Support vector machines can be applied in powerful classification algorithms to form intricate boundaries for delineating between multiple classes. One of the main advantages of support vector classifiers (SVC's) is that fitting occurs with implicit transformation of pre-existing features into a higher-dimensional space. Thus, non-linear combinations of features can be surveyed without the need for them to be explicitly created. This report describes the effects of SVC parameters on model fitting, and demonstrates an iterative scanning method to determine the optimal values of these parameters using an example case involving classification of Iris flower species.

Python code for this interactive optimization can be found on this GitHub repository, which includes an iPython notebook demonstration of this process. For best results with the interactive plots, please view this page on a computer, rather than a mobile device.

Kernels

One of the main strengths of SVC's lies in the use of kernels, functions which transform the input data into a new basis signal for classification. Two popular kernel choices are linear and Gaussian (the latter is also referred to as "radial basis function" or RBF in SVC contexts). As demonstrated in the simulated data below, Gaussian kernels can provide flexibility in the shape of the separating barrier (also known as the "decision boundary"), while linear kernels only allow for separation along linear boundaries.

Simulated classification data, with two classes (red and blue) in a two-dimensional feature space. The red region corresponds to area assigned to the red class, while the blue region corresponds to area assigned to the blue class. With a linear kernel, the red and blue classes cannot be effectively separated. However, with a Gaussian kernel, the red class can readily be isolated from the enveloping blue class.

Non-linear decision boundaries can be attained through simpler classification algorithms, such as logistic regression. However, this is only possible with the creation of polynomial combinations of the pre-existing features. This, in turn, requires optimization of the degree of the applied polynomial, and consumes additional memory and computation resources as the new features must be calculated, scaled, and stored. SVC's with Gaussian kernels circumvent these issues, and therefore allow for classification along non-linear decision boundaries with minimized effort and sources of error.

Effects of Parameters

As with many other learning algorithms, SVC's make use of a constant input parameter to fine-tune the sensitivity of the fitting process. This parameter, C, introduces a penalty term in the cost function that restricts the magnitude of the weight vector that is applied to the kernel-transformed data. In practical terms, this results in control over how sensitive the fitting is to individual data points (and therefore, outliers). In other words, the value of C controls the complexity of the decision boundary shape. The magnitude of the penalty term is inversely proportional to C, so higher values lead to more complex models and lower values lead to simpler or smoother models.

Additionally, some kernels allow for additional control of fitting through kernel-associated parameters. In the case of Gaussian-kerneled SVC's, the width of the individual kernels are controlled with a constant called gamma. Gaussian functions are perhaps most commonly known as the form of normal distributions, with a width proportional to the standard deviation (sigma). Gamma is related to sigma as follows:


Thus, a higher value of gamma equates to a smaller value of sigma, and a narrower distribution. For a fixed value of C, a higher value of gamma amounts to a sharper, more complex decision boundary.

The simulated classification data from the previous figure, fit with Gaussian-kernel SVC's using the values for C and gamma specified in the lower left corner of each plot. As before, the color of the data points and shaded regions correspond to the identity of the class. As C is increased (top to bottom) or as gamma is increased (left to right), the decision boundary becomes generally more complex. Gamma has a stronger impact on the model, with the highest values of gamma shown here (rightmost column) completely dominating the properties of the fit and largely negating the influence of C.

In the figure shown above, the leftmost column shows models that inaccurately fit the data by either failing to recover the closed, enveloping boundary (A and D) or by enclosing too wide of a central region (G). In either case, the model would have a low accuracy in assigning these examples into their correct classes. This represents "underfitting", in that the SVC was too constrained to accurately fit the input data. Models C, F, I, and (to a lesser extent) H represent the converse scenario: the SVC was not sufficiently constrained, and produced a model that fits the input data well (or perfectly) but likely wouldn't perform well on new data as the decision boundary shape is too specific. This phenomenon is known as "overfitting". Models B and E represent reasonable fits to the data, and therefore, suitable choices for C and gamma.

In practice, the selection of values for C and gamma should be done with a more robust and quantitative approach than visually assessing the results of fitting. Additionally, visualization is not convenient for data with higher-dimensionality. A more sound approach is to try a range of parameter values and assess the accuracy of the resulting fit. One problem with this approach lies in the fact that overfitted models will have a high accuracy for classification of the data with which they were trained. To circumvent this issue, datasets can be randomly split into a set for training the data and a set for testing the resulting model. With this approach, both overfitted and underfitted models would lead to low accuracies and can therefore be distinguished from models which more closely represent the underlying distribution. This "train/test split" method can also be thought of as a crude simulation to test the model against new data that it might encounter in subsequent applications.

The Iris Dataset

The iterative scan parameter optimization method (to be described in the following section) will be applied to an example case of classification among three species of Iris flower: Iris setosa, Iris versicolor, Iris virginica. As demonstrated in the photographs below, the flowers of the three species have highly similar outward appearances at first glance.


The classification task will be to identify the species of a flower given measurements of the widths and lengths of its petals and sepals. SVC's will be trained and tested on the famous Iris dataset (shown below), a commonly-used test case for classification algorithms that contains 50 examples for each of the three Iris species mentioned previously. The dataset originally contains measurements in terms of centimeters, but has been converted into standard scores for each feature (independently). This step was performed so that all features will have similar scaling and be centered around zero to improve SVC fitting performance.

The entire Iris dataset, which contains 150 samples divided evenly into three classes (i.e. species, indicated to the right). Each sample has measurements for four features of flower morphology: sepal length, sepal width, petal length, and petal width (shown here as standard scores, or "Z scores"). In the non-diagonal elements of the plots above, scatter plots for all pairs of variables are shown. The diagonal elements show estimates of the underlying probability density distribution for each variable.

From the plots above, it is apparent that Setosa distinction should cause no real problem, as it is linearly separable in all variable combinations (and can even be isolated based upon either petal variable independently). Conversely, Versicolor and Virginica are more convoluted and cannot be fully separated with linear boundaries. However, by leveraging the simultaneous use of all four dimensions, and with optimal parameters, SVC's will be able to assign the species with a maximal accuracy of 97.5 %.

Iterative Parameter Optimization

Optimal parameters for C and gamma can be found through scanning over a range of values and assessing performance. This process is iterated, with increasingly narrow ranges selected each round based on the results of the previous scan. Each scan is performed as follows:

  1. Values for C and gamma are taken from an array.
  2. One third of the dataset is randomly selected and reserved as the test set. A Gaussian-kernel SVC with the specified C and gamma parameters is fit to the remainder. Various accuracy metrics (to be described shortly) are calculated.
  3. Step 2 is repeated many times (60, in the results shown here) and the scores are averaged (using weighting, where appropriate). Repetition and averaging minimizes the noise that arises due to the random test/train splitting.
  4. The results from Step 3 are saved in matrices, and the process repeats with new values of C and gamma.

Averaged performance results are stored in matrices, which can be visualized to determine what ranges of parameter perform well or poorly, as demonstrated in the interactive plot below:

Move mouse over plot to see values. The toolbar on the left allows zooming in, resetting to the default view, and disabling the hovering interaction.

The classification accuracy values indicate that decent fitting (at least 80 % correct) only occurs in a portion of the upper left quadrant. Outside of this region, the accuracy drops to a baseline level that is slightly less than 1/3 (which would result from decision boundaries that hugely favor a particular class, as samples from that class are likely to be correctly assigned, whereas samples from other classes will most likely be incorrect).

By comparing the accuracy of the test set classification ("Overall Accuracy") against the accuracy of the training set classification ("Training Accuracy"), the degree of overfitting and underfitting can be assessed, as shown below:

The training accuracy is 1.0 for most cases where gamma is >0.1 and C is >1.0. However, the overall (test) accuracy is near the baseline value in most of this region, causing the difference between these values to be large. This corresponds to overfitting, since the training set is fit perfectly with a model that generalizes poorly to new examples. In the area in which the test accuracy is high, the training accuracy is also correspondingly high (and in fact, slightly higher, as all classification models will in general perform better on the training set than the test set) leading to a low difference signal. The remaining region of the plots correspond to underfitting, as the resulting SVC's perform poorly for both the training and test sets of data.

For some applications, performance among certain classes may be particularly important (or unimportant). To assess accuracy in individual classes, the F1 score is useful. This metric ranges from 0 to 1, and is affected both by precision (the proportion of those samples that were assigned to a particular class that truly belonged to that class) and recall (the proportion of all class members that were correctly assigned to that class). The plots below show the distribution of class-specific F1 scores for each of the three Iris species, along with the overall accuracy (for comparison). Note: in these plots and all that follow, maximal values for each score are highlighted in blue.

The overall performance is largely correlated with individual performance for Virginica or Versicolor classification, which makes sense as these two classes were shown to be the most convoluted of the three and therefore their proper separation is likely the major obstacle to accurate classification. Setosa classification is often perfect, which is unsurprising given the distance between that class and the others, as shown previously.

From the above results, it's clear that the classifier performs well almost exclusively in the upper left quadrant. Therefore, the next scan will cover only this region.

The best performing classifiers (in terms of combined overall, Versicolor, and Virginica performances) on average were those with a value of 7.197e−06 for gamma and 4.498e+04 for C, with an average overall accuracy of 97.4 % (up from 97.2 %, in the previous scan). The next scan will consider 2 orders of magnitude for each parameter, centered around this point.

The best performing parameters were 1.526e−05 (gamma) and 2.442e+04 (C). For the final scan, linear ranges around these values will be used.

The maximum overall accuracy has improved to 97.5 %, and can be reached through several different pairs of C and gamma values within this range. One of these pairs (gamma = 1.245e−05, C = 2.347e+04) also maximizes performance in classifying Versicolor and Virginica individually, and thus represents optimal paramater values for future classifier training.