Linear models are a class of models that are widely used in practice, and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models are models that make a prediction that using a linear function of the input features, which we will explain below.
Linear models for regression
For regression, the general prediction formula for a linear model looks as follows:
Here, x to x[p] denotes the features (here the number of features is p) of a single data point, w and b are parameters of the model that are learned, and y is the prediction the model makes. For a dataset with a single feature, this is which you might remember as the equation for a line from high school mathematics.
Here, w is the slope, and b is the y-axis offset. For more features, w contains the slopes along each feature axis. Alternatively, you can think of the predicted response as being a weighted sum of the input features, with weights (which can be negative) given by the entries of w. Trying to learn the parameters w and b on our one-dimensional wave dataset might lead to the following line:
We added a coordinate cross into the plot to make it easier to understand the line. Looking at w we see that the slope should be roughly around .4, which we can confirm visually in the plot above. The intercept is where the prediction line should cross the y-axis, which is slightly below 0, which you can also confirm in the image. Linear models for regression can be characterized as regression models for which the prediction is a line for a single feature, a plane when using two features, or a hyper plane in higher dimensions (that is when having more features).
If you compare the predictions made by the red line with those made by the KNeighborsRegressor in Figure nearest_neighbor_regression, using a straight line to make predictions seems very restrictive. It looks like all the fine details of the data are lost. In a sense this is true. It is a strong (and somewhat unrealistic) assumption that our target y is a linear combination of the features. But looking at one-dimensional data gives a somewhat skewed perspective. For datasets with many features, linear models can be very powerful. In particular, if you have more features than training data points, any target y can be perfectly modeled (on the training set) as a linear function (FOOTNOTE This is easy to see if you know some linear algebra).
There are many different linear models for regression. The difference between these models is how the model parameters w and b are learned from the training data, and how model complexity can be controlled. We will now go through the most popular linear models for regression.
Linear Regression aka Ordinary Least Squares
Linear regression or Ordinary Least Squares (OLS) is the simplest and most classic linear method for regression. Linear regression finds the parameters w and b that minimize the mean squared error between predictions and the true regression targets y on the training set. The mean squared error is the sum of the squared differences between the predictions and the true values. Linear regression has no parameters, which is a benefit, but it also has no way to control model complexity. Here is the code that produces the model you can see in above figure:
The intercept_ attribute is always a single float number, while the coef_ attribute is a numpy array with one entry per input feature. As we only have a single input feature in the wave dataset, lr.coef_ only has a single entry. Let’s look at the training set and test set performance:
An R^2 of around .66 is not very good, but we can see that the score on training and test set are very close together. This means we are likely underfitting, not overfitting. For this one-dimensional dataset, there is little danger of overfitting, as the model is very simple (or restricted). However, with higher dimensional datasets (meaning a large number of features), linear models become more powerful, and there is a higher chance of overfitting.
Let’s take a look at how LinearRegression performs on a more complex dataset, like the Boston Housing dataset. Remember that this dataset has 506 samples and 105 derived features. We load the dataset and split it into a training and a test set. Then we build the linear regression model as before:
This is a clear sign of overfitting, and therefore we should try to find a model that allows us to control complexity. One of the most commonly used alternatives to standard linear regression is Ridge regression, which we will look into next.
Ridge regression is also a linear model for regression, so the formula it uses to make predictions is still Formula (1), as for ordinary least squares. In Ridge regression,the coefficients w are chosen not only so that they predict well on the training data, but there is an additional constraint. We also want the magnitude of coefficients to be as small as possible; in other words, all entries of w should be close to 0.
Intuitively, this means each feature should have as little effect on the outcome as possible (which translates to having a small slope), while still predicting well. This constraint is an example of what is called regularization. Regularization means explicitly restricting a model to avoid overfitting. The particular kind used by Ridge regression is known as l2 regularization. (footnote: Mathematically, Ridge penalizes the l2 norm of the coefficients, or the Euclidean length of w.)
Ridge regression is implemented in linear_model.Ridge. Let’s see how well it does on the extended Boston dataset:
As you can see, the training set score of Ridge is lower than for LinearRegression, while the test set score is higher. This is consistent with our expectation. With linear regression, we were overfitting to our data. Ridge is a more restricted model, so we are less likely to overfit. A less complex model means worse performance on the training set, but better generalization.
As we are only interested in generalization performance, we should choose the Ridge model over the LinearRegression model.
The Ridge model makes a trade-off between the simplicity of the model (near zero coefficients) and its performance on the training set. How much importance the model places on simplicity versus training set performance can be specified by the user, using the alpha parameter. Above, we used the default parameter alpha=1.0. There is no reason why this would give us the best trade-off, though. Increasing alpha forces coefficients to move more towards zero, which decreases training set performance, but might help generalization.
Decreasing alpha allows the coefficients to be less restricted, meaning we move right on the figure model_complexity.
For very small values of alpha, coefficients are barely restricted at all, and we end up with a model that resembles LinearRegression.
Here, alpha=0.1 seems to be working well. We could try decreasing alpha even more to impair generalization. For now, notice how the parameter alpha corresponds to the model complexity as shown in Figure model_complexity. We will discuss methods to properly select parameters in Chapter 6 (Model Selection).
We can also get a more qualitative insight into how the alpha parameter changes the model by inspecting the coef_ attribute of models with different values of alpha. A higher alpha means a more restricted model, so we expect that the entries of coef_ have smaller magnitude for a high value of alpha than for a low value of alpha. This is confirmed in the plot below:
Here, the x-axis enumerates the entries of coef_: x=0 shows the coefficient associated with the first feature, x=1 the coefficient associated with the second feature, and so on up to x=100. The y-axis shows the numeric value of the corresponding value of the coefficient. The main take-away here is that for alpha=10 (as shown by the green dots), the coefficients are mostly between around -3 and 3. The coefficients for the ridge model with alpha=1 (as shown by the blue dots), are somewhat larger. The red dots have larger magnitude still corresponding to linear regression without any regularization (which would be alpha=0) are so large they are even outside of the chart.
An alternative to Ridge for regularizing linear regression is the Lasso. The lasso also restricts coefficients to be close to zero, similarly to Ridge regression, but in a slightly different way, called “l1” regularization. (footnote: The Lasso penalizes the l1 norm of the coefficient vector, or in other words the sum of the absolute values of the coefficients).
The consequence of l1 regularization is that when using the Lasso, some coefficients are exactly zero. This means some features are entirely ignored by the model. This can be seen as a form of automatic feature selection. Having some coefficients be exactly zero often makes a model easier to interpret, and can reveal the most important features of your model.
Let’s apply the lasso to the extended Boston housing dataset:
As you can see, the Lasso does quite badly, both on the training and the test set. This indicates that we are underfitting. We find that it only used three of the 105 features:
Similarly to Ridge, the Lasso also has a regularization parameter alpha that controls how strongly coefficients are pushed towards zero . Above, we used the default of alpha=1.0. To diminish underfitting, let’s try decreasing alpha:
A lower alpha allowed us to fit a more complex model, which worked better on the training and the test data. The performance is slightly better than using Ridge, and we are using only 34 of the 105 features. This makes this model potentially easier to understand. If we set alpha too low, we again remove the effect of regularization and end up with a result similar to LinearRegression.
Again, we can plot the coefficients of the different models, similarly to Figure ridge_coefficients.
In practice, Ridge regression is usually the first choice between these two models. However, if you have a large amount of features and expect only a few of them to be important, Lasso might be a better choice. Similarly, if you would like to have a model that is easy to interpret, Lasso will provide a model that is easier to understand, as it will select only a subset of the input features.
Linear models for Classification
Linear models are also extensively used for classification. Let’s look at binary classification first. In this case, a prediction is made using the following formula:
The formula looks very similar to the one for linear regression, but instead of just returning the weighted sum of the features, we threshold the predicted value at zero. If the function was smaller than zero, we predict the class -1, if it was larger than zero, we predict the class +1. This prediction rule is common to all linear models for classification. Again, there are many different ways to find the coefficients w and the intercept b.
For linear models for regression, the output y was a linear function of the features: a line, plane, or hyperplane (in higher dimensions). For linear models for classification, the decision boundary is a linear function of the input. In other words, a (binary) linear classifier is a classifier that separates two classes using a line, a plane or a hyperplane. We will see examples of that below.
There are many algorithms for learning linear models. These algorithms all differ in the following two ways:
Different algorithms choose different ways to measure what “fitting the training set well” means in 1. For technical mathematical reasons, it is not possible to adjust w and b to minimize the number of misclassifications the algorithms produce, as one might hope. For our purposes, and many applications, the different choices for 1. (called loss function) is of little significance.
The two most common linear classification algorithms are logistic regression, implemented in linear_model.LogisticRegression and linear support vector machines (linear SVMs), implemented in svm.LinearSVC (SVC stands for Support Vector Classifier). Despite its name, LogisticRegression is a classification algorithm and not a regression algorithm, and should not be confused with LinearRegression.
We can apply the LogisticRegression and LinearSVC models to the forge dataset, and visualize the decision boundary as found by the linear models:
In this figure, we have the first feature of the forge dataset on the x axis and the second feature on the y axis as before. We display the decision boundaries found by LinearSVC and LogisticRegression respectively as straight lines, separating the area classified as blue on the bottom from the area classified as red on the top. In other words, any new data point that lies above the black line will be classified as red by the respective classifier, while any point that lies below the black line will be classified as blue.
The two models come up with similar decision boundaries. Note that both misclassify two of the points. By default, both models apply an l2 regularization, in the same way that Ridge does for regression.
For LogisticRegression and LinearSVC the trade-off parameter that determines the strength of the regularization is called C, and higher values of C correspond to less regularization. In other words, when using a high value of the parameter C, LogisticRegression and LinearSVC try to fit the training set as best as possible, while with low values of the parameter C, the model put more emphasis on finding a coefficient vector w that is close to zero.
There is another interesting intuition of how the parameter C acts. Using low values of C will cause the algorithms try to adjust to the “majority” of data points, while using a higher value of C stresses the importance that each individual data point be classified correctly. Here is an illustration using LinearSVC.
On the left hand side, we have a very small C corresponding to a lot of regularization. Most of the blue points are at the top, and most of the red points are at the bottom. The strongly regularized model chooses a relatively horizontal line, misclassifying two points; In the center plot, C is slightly higher, and the model focuses more on the two misclassified samples, tilting the decision boundary. Finally, on the right hand side, a very high value of C in the model tilts the decision boundary a lot, now correctly classifying all red points. One of the blue points is still misclassified, as it is not possible to correctly classify all points in this dataset using a straight line. The model illustrated on the right hand side tries hard to correctly classify all points, but might not capture the overall layout of the classes well. In other words, this model is likely overfitting.
Similarly to the case of regression, linear models for classification might seem very restrictive in low dimensional spaces, only allowing for decision boundaries which are straight lines or planes. Again, in high dimensions, linear models for classification become very powerful, and guarding against overfitting becomes increasingly important when considering more features.
Let’s analyze LogisticRegression in more detail on the breast_cancer dataset:
The default value of C=1 provides quite good performance, with 95% accuracy on both the training and the test set. As training and test set performance are very close, it is likely that we are underfitting. Let’s try to increase C to fit a more flexible model.
Using C=100 results in higher training set accuracy, and also a slightly increased test set accuracy, confirming our intuition that a more complex model should perform better. We can also investigate what happens if we use an even more regularized model than the default of C=1, by setting C=0.01:
As expected, when moving more to the left in Figure model_complexity from an already underfit model, both training and test set accuracy decrease relative to the default parameters. Finally, lets look at the coefficients learned by the models with the three different settings of the regularization parameter C.
As LogisticRegression applies an L2 regularization by default, the result looks similar to Ridge in Figure ridge_coefficients. Stronger regularization pushes coefficients more and more towards zero, though coefficients never become exactly zero. Inspecting the plot more closely, we can also see an interesting effect in the third coefficient, for “mean perimeter”. For C=100 and C=1, the coefficient is negative, while for C=0.001, the coefficient is positive, with a magnitude that is even larger as for C=1. Interpreting a model like this, one might think the coefficient tells us which class a feature is associated with. For example, one might think that a high “texture error” feature is related to a sample being “malignant”. However, the change of sign in the coefficient for “mean perimeter” means that depending on which model we look at, high “mean perimeter” could be either taken as being indicative of “benign” or indicative of “malignant”. This illustrates that interpretations of coefficients of linear models should always be taken with a grain of salt.
If we desire a more interpretable model, using L1 regularization might help, as it limits the model to only using a few features. Here is the coefficient plot and classification accuracies for L1 regularization:
Linear Models for multiclass classification
Many linear classification models are binary models, and don’t extend naturally to the multi-class case (with the exception of Logistic regression). A common technique to extend a binary classification algorithm to a multi-class classification algorithm is the one-vs-rest approach. In the one-vs-rest approach, a binary model is learned for each class, which tries to separate this class from all of the other classes, resulting in as many binary models as there are classes.
To make a prediction, all binary classifiers are run on a test point. The classifier that has the highest score on its single class “wins” and this class label is returned as prediction. Having one binary classifier per class results in having one vector of coefficients w and one intercept b for each class. The mathematics behind logistic regression are somewhat different from the one-vsrest approach, but they also result in one coefficient vector and intercept per class, and the same method of making a prediction is applied.
Let’s apply the one-vs-rest method to a simple three-class classification dataset. We use a two-dimensional dataset, where each class is given by data sampled from a Gaussian distribution.
Now, we train a LinearSVC classifier on the dataset.
We see that the shape of the coef_ is (3, 2), meaning that each row of coef_ contains the coefficient vector for one of the three classes. Each row has two entries, corresponding to the two features in the dataset; The intercept_ is now a one-dimensional array, storing the intercepts for each class. Let’s visualize the lines given by the three binary classifiers:
The red line shows the decision boundary for the binary classifier for the red class, and so on. You can see that all the red points in the training data are under the red line, which means they are on the “red” side of this binary classifier. The red points are left of the green line, which means they are classified as “rest” by the binary classifier for the green class. The red points are below the blue line, which means the binary classifier for the blue class also classifies them as “rest”. Therefore, any point in this area will be classified as red by the final classifier (Formula (3) of the red classifier is greater than zero, while it is smaller than zero for the other two classes).
But what about the triangle in the middle of the plot? All three binary classifiers classify points there as “rest”. Which class would a point there be assigned to? The answer is the one with the highest value in Formula (3): the class of the closest line. The following figure shows the prediction shown for all regions of the 2d space:
Strengths, weaknesses and parameters
The main parameter of linear models is the regularization parameter, called alpha in the regression models and C in LinearSVC and LogisticRegression. Large alpha or small C mean simple models. In particular for the regression models, tuning this parameter is quite important. Usually C and alpha are searched for on a logarithmic scale; The other decision you have to make is whether you want to use L1 regularization or L2 regularization. If you assume that only few of your features are actually important, you should use L1. Otherwise, you should default to L2.
L1 can also be useful if interpretability of the model is important. As L1 will use only a few features, it is easier to explain which features are important to the model, and what the effect of these features is.
Linear models are very fast to train, and also fast to predict. They scale to very large datasets and work well with sparse data. If your data consists of hundreds of thousands or millions of samples, you might want to investigate SGDClassifier and SGDRegressor, which implement even more scalable versions of the linear models described above.
Another strength of linear models is that they make us relatively easy to understand how a prediction is made, using Formula (1) for regression and Formula (2) for classification. Unfortunately, it is often not entirely clear why coefficients are the way they are. This is particularly true if your dataset has highly correlated features; in these cases, the coefficients might be hard to interpret.
Linear models often perform well when the number of features is large compared to the number of samples. They are also often used on very large datasets, simply because other models are not feasible to train. However, on smaller dataset, other models might yield better generalization performance.
* [ ML In Action ] Predicting numeric values : regression - Linear regression (1)
* [ ML In Action ] Predicting numeric values : regression - Linear regression (2)
* [ ML In Action ] Predicting numeric values : regression - Linear regression (3)
* [ ML Foundation ] Section2 : Learning to Answer Yes/No - PLA (Part1)
* [ ML Foundation ] Section2 : Learning to Answer Yes/No - PLA (Part2)