The methods we have been looking at for decision trees and rules work most naturally with nominal attributes. They can be extended to numeric attributes either by incorporating numeric-value tests directly into the decision tree or rule induction scheme, or by prediscretizing numeric attributes into nominal ones. We will see how in Chapters 6 and 7, respectively. However, there are methods that work most naturally with numeric attributes.We look at simple ones here, ones that form components of more complex learning methods, which we will examine later.
Numeric prediction: Linear regression (p153)
When the outcome, or class, is numeric, and all the attributes are numeric, linear regression is a natural technique to consider. This is a staple method in statistics. The idea is to express the class as a linear combination of the attributes, with predetermined weights:
where x is the class; a1, a2, . . ., ak are the attribute values; and w0, w1, . . ., wk are weights.
The weights are calculated from the training data. Here the notation gets a little heavy, because we need a way of expressing the attribute values for each training instance. The first instance will have a class, say x^(1), and attribute values a1^(1), a2^(1), . . ., ak^(1), where the superscript denotes that it is the first example. Moreover, it is notationally convenient to assume an extra attribute a0 whose value is always 1.
The predicted value for the first instance’s class can be written as:
This is the predicted, not the actual, value for the first instance’s class. Of interest is the difference between the predicted and the actual values. The method of linear regression is to choose the coefficients wj—there are k + 1 of them—to minimize the sum of the squares of these differences over all the training instances. Suppose there are n training instances; denote the ith one with a superscript (i). Then the sum of the squares of the differences is:
where the expression inside the parentheses is the difference between the ith instance’s actual class and its predicted class. This sum of squares is what we have to minimize by choosing the coefficients appropriately.
This is all starting to look rather formidable. However, the minimization technique is straightforward if you have the appropriate math background. Suffice it to say that given enough examples—roughly speaking, more examples than attributes—choosing weights to minimize the sum of the squared differences is really not difficult. It does involve a matrix inversion operation, but this is readily available as prepackaged software.
Once the math has been accomplished, the result is a set of numeric weights, based on the training data, which we can use to predict the class of new instances. We saw an example of this when looking at the CPU performance data, and the actual numeric weights are given in Figure 3.7(a). This formula can be used to predict the CPU performance of new test instances.
Figure 3.7 Models for the CPU performance data: (a) linear regression, (b) regression tree
Linear regression is an excellent, simple method for numeric prediction, and it has been widely used in statistical applications for decades. Of course, linear models suffer from the disadvantage of, well, linearity. If the data exhibits a nonlinear dependency, the best-fitting straight line will be found, where “best” is interpreted as the least mean-squared difference. This line may not fit very well. However, linear models serve well as building blocks for more complex learning methods.
Linear classification: Logistic regression
Linear regression can easily be used for classification in domains with numeric attributes. Indeed, we can use any regression technique, whether linear or nonlinear, for classification. The trick is to perform a regression for each class, setting the output equal to one for training instances that belong to the class and zero for those that do not. The result is a linear expression for the class. Then, given a test example of unknown class, calculate the value of each linear expression and choose the one that is largest. This method is sometimes called multiresponse linear regression.
One way of looking at multiresponse linear regression is to imagine that it approximates a numeric membership function for each class. The membership function is 1 for instances that belong to that class and 0 for other instances. Given a new instance we calculate its membership for each class and select the biggest.
Multiresponse linear regression often yields good results in practice. However, it has two drawbacks. First, the membership values it produces are not proper probabilities because they can fall outside the range 0 to 1. Second, least squares regression assumes that the errors are not only statistically independent, but are also normally distributed with the same standard deviation, an assumption that is blatantly violated when the method is applied to classification problems because the observations only ever take on the values 0 and 1.
A related statistical technique called logistic regression does not suffer from these problems. Instead of approximating the 0 and 1 values directly, thereby risking illegitimate probability values when the target is overshot, logistic regression builds a linear model based on a transformed target variable.
Suppose first that there are only two classes. Logistic regression replaces the original target variable:
which cannot be approximated accurately using a linear function, with:
The resulting values are no longer constrained to the interval from 0 to 1 but can lie anywhere between negative infinity and positive infinity. Figure 4.9(a) plots the transformation function, which is often called the logit transformation.
Figure 4.9 Logistic regression: (a) the logit transform and (b) an example logistic regression function.
The transformed variable is approximated using a linear function just like the ones generated by linear regression. The resulting model is:
with weights w. Figure 4.9(b) shows an example of this function in one dimension, with two weights w0 = 0.5 and w1 = 1.
Just as in linear regression, weights must be found that fit the training data well. Linear regression measures the goodness of fit using the squared error. In logistic regression the log-likelihood of the model is used instead. This is given by:
The weights wi need to be chosen to maximize the log-likelihood. There are several methods for solving this maximization problem. A simple one is to iteratively solve a sequence of weighted least-squares regression problems until the log-likelihood converges to a maximum, which usually happens in a few iterations.
A conceptually simpler, and very general, way to address multiclass problems is known as pairwise classification. Here a classifier is built for every pair of classes, using only the instances from these two classes. The output on an unknown test example is based on which class receives the most votes. This method generally yields accurate results in terms of classification error. It can also be used to produce probability estimates by applying a method called pairwise coupling, which calibrates the individual probability estimates from the different classifiers.
The use of linear functions for classification can easily be visualized in instance space. The decision boundary for two-class logistic regression lies where the prediction probability is 0.5, that is:
This occurs when
Because this is a linear equality in the attribute values, the boundary is a linear plane, or hyperplane, in instance space. It is easy to visualize sets of points that cannot be separated by a single hyperplane, and these cannot be discriminated correctly by logistic regression.
Linear classification using the perceptron (p157)
Logistic regression attempts to produce accurate probability estimates by maximizing the probability of the training data. Of course, accurate probability estimates lead to accurate classifications. However, it is not necessary to perform probability estimation if the sole purpose of the model is to predict class labels. A different approach is to learn a hyperplane that separates the instances pertaining to the different classes—let’s assume that there are only two of them. If the data can be separated perfectly into two groups using a hyperplane, it is said to be linearly separable. It turns out that if the data is linearly separable, there is a very simple algorithm for finding a separating hyperplane.
The algorithm is called the perceptron learning rule. Before looking at it in detail, let’s examine the equation for a hyperplane again:
Here, a1, a2, . . ., ak are the attribute values, and w0, w1, . . ., wk are the weights that define the hyperplane. We will assume that each training instance a1, a2, . . . is extended by an additional attribute a0 that always has the value 1 (as we did in the case of linear regression). This extension, which is called the bias, just means that we don’t have to include an additional constant element in the sum. If the sum is greater than zero, we will predict the first class; otherwise, we will predict the second class.We want to find values for the weights so that the training data is correctly classified by the hyperplane.
Figure 4.10(a) gives the perceptron learning rule for finding a separating hyperplane. The algorithm iterates until a perfect solution has been found, but it will only work properly if a separating hyperplane exists, that is, if the data is linearly separable. Each iteration goes through all the training instances. If a misclassified instance is encountered, the parameters of the hyperplane are changed so that the misclassified instance moves closer to the hyperplane or maybe even across the hyperplane onto the correct side. If the instance belongs to the first class, this is done by adding its attribute values to the weight vector; otherwise, they are subtracted from it.
To see why this works, consider the situation after an instance a pertaining to the first class has been added:
This number is always positive. Thus the hyperplane has moved in the correct direction for classifying instance a as positive. Conversely, if an instance belonging to the second class is misclassified, the output for that instance decreases after the modification, again moving the hyperplane to the correct direction.
These corrections are incremental and can interfere with earlier updates. However, it can be shown that the algorithm converges in a finite number of iterations if the data is linearly separable. Of course, if the data is not linearly separable, the algorithm will not terminate, so an upper bound needs to be imposed on the number of iterations when this method is applied in practice.
The resulting hyperplane is called a perceptron, and it’s the grandfather of neural networks (we return to neural networks in Section 6.3). Figure 4.10(b) represents the perceptron as a graph with nodes and weighted edges, imaginatively termed a “network” of “neurons.” There are two layers of nodes: input and output. The input layer has one node for every attribute, plus an extra node that is always set to one. The output layer consists of just one node. Every node in the input layer is connected to the output layer. The connections are weighted, and the weights are those numbers found by the perceptron learning rule.
When an instance is presented to the perceptron, its attribute values serve to “activate” the input layer. They are multiplied by the weights and summed up at the output node. If the weighted sum is greater than 0 the output signal is 1, representing the first class; otherwise, it is -1, representing the second.
Linear classification using Winnow (p159)
The perceptron algorithm is not the only method that is guaranteed to find a separating hyperplane for a linearly separable problem. For datasets with binary attributes there is an alternative known as Winnow, shown in Figure 4.11(a). The structure of the two algorithms is very similar. Like the perceptron,Winnow only updates the weight vector when a misclassified instance is encountered— it is mistake driven.
Figure 4.11 The Winnow algorithm: (a) the unbalanced version and (b) the balanced version.
The two methods differ in how the weights are updated. The perceptron rule employs an additive mechanism that alters the weight vector by adding (or subtracting) the instance’s attribute vector.Winnow employs multiplicative updates and alters weights individually by multiplying them by the user-specified parameter α (or its inverse). The attribute values ai are either 0 or 1 because we are working with binary data.Weights are unchanged if the attribute value is 0, because then they do not participate in the decision. Otherwise, the multiplier is α if that attribute helps to make α correct decision and 1/α if it does not.
Another difference is that the threshold in the linear function is also a userspecified parameter.We call this threshold θ and classify an instance as belonging to class 1 if and only if
The multiplier α needs to be greater than one. The wi are set to a constant at the start.
The algorithm we have described doesn’t allow negative weights, which— depending on the domain—can be a drawback. However, there is a version, called Balanced Winnow, which does allow them. This version maintains two weight vectors, one for each class. An instance is classified as belonging to class 1 if:
Figure 4.11(b) shows the balanced algorithm.
Winnow is very effective in homing in on the relevant features in a dataset— therefore it is called an attribute-efficient learner. That means that it may be a good candidate algorithm if a dataset has many (binary) features and most of them are irrelevant. Both winnow and the perceptron algorithm can be used in an online setting in which new instances arrive continuously, because they can incrementally update their hypotheses as new instances arrive.
From GML package, we can leverage class PLA to carry out the Perceptron Learning Algorithm to solve binary classification problem. Below is the sample code:
Regarding to Linear Regression, we can use class LinearRG from GML package to predict numeric value. Sample code as below:
Finally, we can use class LogisticRG from GML package to run Logistric Regression. Sample code as below:
* [ ML In Action ] Logistic Regression
* [ ML In Action ] Predicting numeric values : regression - Linear regression (1)
* Coursera - 機器學習基石: Perception Hypothesis Set
* Coursera - 機器學習基石: Perception Learning Algorithm (PLA)
* Coursera - 機器學習基石: Guarantee of PLA
* Coursera - 機器學習基石: Perceptron on non-linear separable dataset