程式扎記: [ ML 文章收集 ] Bias, Variance, and Overfitting – Machine Learning Overview part 4 of 4

標籤

2016年9月28日 星期三

[ ML 文章收集 ] Bias, Variance, and Overfitting – Machine Learning Overview part 4 of 4

Source From Here
Preface
In my final post in the four part machine learning overview, I talk about the bias-variance tradeoff in machine learning and how it plays a role in models that overfit or underfit. Overfitting and Underfitting are both undesirable behaviors that must be countered. First I’ll introduce these topics in general, then I’ll explain how they come into play with logistic regression and random forests.

Bias and variance
Bias and variance can be visualized with a classic example of a dartboard. We have four different dart throwers, each with different combinations of low/high bias and low/high variance. We represent the locations of each of their dart throws as blue dots:


The low bias players’ darts tend to be centered around the center of the dart board, while the high bias players’ darts are centered in a different location. In this case, their darts are “biased” toward the top of the dart board; The high variance dart throwers have their darts more spread out; they are less able to successfully place the dart where they’re aiming (in the case of the biased player, they are aiming at the incorrect location).

When talking about bias and variance in machine learning, we are actually talking about the bias and variance of a type of model. For example, say the bulls-eye represents the “perfect model,” the one that perfectly represents all data that could possibly be generated by the process we are modeling. In reality, we will only have a subset of possible data (our training set), so we will only be able to build an approximate model (a single model corresponds to a single throw of a dart). Also, we do not know what type of model to build (logistic regression, random forest, or another type).

We see that, if either bias or variance is high, a single model (dart throw) can be very far off. In general, there is a tradeoff between bias and variance, so that model types with high bias can achieve lower variance, and models that have low bias will be at the expense of higher variance.

Overfitting and underfitting
Both high bias and high variance are negative properties for types of models. When we have a model with high bias, we say it “underfits” the data, and if we have a model with high variance and low bias, we say it “overfits” the data. We’ll see why in this section. Lets say we have a set of n data pairs (x1, y1), (x2, y2), …. (xn, yn). Our goal is to choose a function that represents the value of y given x. So, we want to find a function f(x) that closely “fits” the data (we want the error | f(xi) – yi | to be small for every training example pair (xi, yi)). This type of learning problem is called regression.

Regression is different from classification, where we only want to predict a discrete label (e.g. lead will convert vs lead won’t convert). In this problem, called regression, we want to predict a number (y) instead. We now see where logistic regression gets its name, since we are fitting a logistic function to a set of points (although in this case the resulting model is used for classification, not regression). Overfitting and underfitting also apply to classification, but it is easier to visualize in the case of regression.

Assume we want to represent f(x) as a polynomial of a certain degree. If we choose degree d, then we have a family of possible models to choose from, represented by the equation below. a_i’s are the coefficients of the different terms.


As the degree gets higher, the function becomes more complicated. A degree one polynomial is a line, degree two is a parabola, etc. So lets say we try to fit three different types of model to our dataset: a 1 degree polynomial, a 2 degree polynomial, and a 5 degree polynomial.


The rightmost fits the data perfectly, so it is not biased. In fact, any polynomial of degree at least n-1 can fit a dataset of size n (provided yi = yj if xi = xj). However, when we look at the right line, the behavior of the polynomial is erratic, and has behavior at the very left end that seems different from the general pattern in the data. It has worked too hard to fit every single point, and has “overfit” the data. When we have to make a prediction in the future, it is unlikely that our function will give an accurate prediction, especially if x < 1.

If we were to move a single data point very slightly, the change in the right curve would need to be dramatic in order to still fit all the points. Therefore, this type model has low bias but high variance. On the other hand, a one dimensional polynomial (the leftmosthas high bias, since the model is less flexible. No matter how dramatic the dataset “curves,” we will only be able to draw a straight line. It “underfits” the data. However, if we slightly move one of the data points, there will be very little change in the line that we have drawn. Therefore, the high bias linear model has low variance.

In this case, the center line represents a pretty good fit, that does not obviously over- or underfit the data.

Overfitting is usually a more major problem in machine learning than underfitting. It is often the case that there is noise (small, random errors) in the data. Overfitting the data causes the model to fit the noise, rather than the actual underlying behavior. Models with less variance will be more robust (fit a better modelin the presence of noisy data.


For sales data, noisy data could result from incorrectly filled-out data fields within Salesforce. At Fliptop, we have to make sure our machine learning algorithms are robust to any errors that your SDRs may have made when inputting data on the lead record!

So, how do we know when we are overfitting the data?
The more complex the type of model we choose, the less error there will be on the training set. As we saw in the above regression example, if we choose a complex enough function, the error will go to zero. There are several methods to reduce overfitting, including the bootstrap aggregation technique (explained in the previous post about random forests). One of the simplest and most effective techniques to detect overfitting is to perform cross validation.

Cross validation involves using an additional set of data, called a validation set. This set can be constructed from the training data by randomly splitting it into two subsets: a (smaller) training set and a validation set. For example, using 75% of the data for training and 25% of the data for validation is common. Only the smaller training set is used to fit the model. The validation set is used to see how well the model generalizes to unseen data. After all, these models will ultimately be used to predict future, unknown data.

So instead of just looking at training error, we now look at both training AND validation error (also called generalization error). We can tell from these two error values if a model is overfitting or underfitting.
* Underfitting – Validation error and training error are both high
* Overfitting – Validation error is high while training error is low
* Good fit – Validation error is low, and only slightly higher than the training error

The below figure shows the relationship between model complexity and training and validation errors. The optimal model has the lowest generalization error, and is marked by a dashed line.


Another good practice is to repeat the same analysis with several random training/validation splits, to see the variance in the training and validation errors. This helps ensure we didn’t just get lucky with an overfitting model.

Logistic regression
Logistic regression has high bias and low variance. It is analogous to the one degree polynomial above in the regression example. Instead of representing the data points as a line, it tries to draw a separating line between two classes. Its decision boundary is always linear. Another way to think of this is, when logistic regression identifies a particular feature as positive, then a higher value for that feature will always yield a higher predicted probability, with all other features held constant.

For example, marketing automation software tracks the amount of interaction between leads and marketing content, such as webinars. Say a logistic regression model identifies the number of webinar visits as a positive signal. That means that the more webinar visits a lead performs, the higher the prediction of conversion logistic regression will predict. However, it may in fact be that having TOO MANY webinar visits may be a negative signal, and could reflect the behavior of a competitor or student who is investigating the marketing activities of your company. If this is the case, logistic regression would underfit the data by not capturing this nonlinear correlation between webinar visits.

Even though logistic regression has high bias and can underfit the data, this does not mean that logistic regression cannot overfit the data. In fact, overfitting can be a serious problem with logistic regression. For example, say that in your CRM database, all leads that converted were from California. Then the learning algorithm could identify location = California as a strongly positive signal, so positive that it overwhelms all other features, and becomes like a decision tree that predicts conversion if and only if location = California. Our model will have very good training and validation errors, but just because of an irregularity in our training set.

To correct this, we can regularize our model. Regularization prevents any single feature from being given too positive or too negative of a weight. The strength of regularization can be finely tuned. More regularization means more bias and less variance, while less regularization means less bias and more variance. In the extreme case, if regularization is too strong, the model will essentially ignore all the features, and not learn anything from the data.

Decision trees
Decision trees in general have low bias and high variance. We can think about it like this: given a training set, we can keep asking questions until we are able to distinguish between ALL examples in the data set. We could keep asking questions until there is only a single example in each leaf. Since this allows us to correctly classify all elements in the training set, the tree is unbiased. However, there are many possible trees that could distinguish between all elements, which means higher variance. Therefore, in order to reduce the variance of a single error tree, we usually place a restriction on the number of questions asked in a tree. This is true both for single decision trees and ensemble models of decision trees (like random forests).

Random Forests
As mentioned in the previous post on random foreststhe randomization used to build decision trees within the forest help combat overfitting. These types of randomization are:
* Bootstrap aggregation – randomizing the subset of examples used to build a tree
* Randomizing features – randomizing the subset of features used when asking questions

When making predictions, we average the predictions of all the different trees. Because we are taking the average of low bias models, the average model will also have low bias. However, the average will have low variance. In this way, ensemble methods can reduce variance without increasing bias. This is why ensemble methods are so popular.

In the image below, we see the validation error plotted as a function of the number of trees for three types of ensemble methods using decision trees. In each case, adding more trees improves the validation error. The yellow line corresponds to using bootstrap aggregation alone (a.k.a. “bagging”), the blue line is a random forest, and the green line is gradient boosting, a more advanced ensemble technique. Of these, we use random forest and gradient boosting to build models at Fliptop.


Supplement
Machine Learning Overview Part 1 of 4
Machine Learning Overview Part 2 of 4 – Logistic Regression
Machine Learning Overview Part 3 of 4 – Decision Trees and Random Forests
In our third post on the four part series on how machine learning works I’m going to explain two types of models that we use for our customers; decision trees and random forests. Random forests are more advanced learning models that are capable of creating more complex decision boundaries than logistic regression (covered in our last post). The “forest” part of the name means that it is made up of multiple decision trees. I will explain decision trees first, then talk about random forest


沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!