2017年2月15日 星期三

[ Intro2ML ] Ch3. Unsupervised Learning and Preprocessing - Types & Preprocessing

Preface 
The second family of machine learning algorithms that we will discuss is unsupervised learning. Unsupervised learning subsumes all kinds of machine learning where there is no known output, no teacher to instruct the learning algorithm. In unsupervised learning, the learning algorithm is just shown the input data, and asked to extract knowledge from this data. 

Types of unsupervised learning 
We will look into two kinds of unsupervised learning in this chapter: transformations of the dataset, and clustering. 

Unsupervised transformations of a dataset are algorithms that create a new representation of the data which might be easier for humans or other machine learning algorithms to understand. A common application of unsupervised transformations is dimensionality reduction, which takes a high-dimensional representation of the data, consisting of many features, and finding a new way to represent this data that summarizes the essential characteristics about the data with fewer features. A common application for dimensionality reduction is reduction to two dimensions for visualization purposes. 

Another application for unsupervised transformations is finding the parts or components that “make up” the data. An example of this is topic extraction on collections of text documents. Here, the task is to find the unknown topics that are talked about in each document, and to learn what topics appear in each document. This can be useful for tracking the discussion of themes like elections, gun control or talk about pop-stars on social media. 

Clustering algorithms on the other hand partition data into distinct groups of similar items. 

Consider the example of uploading photos to a social media site. To allow you to organize your pictures, the site might want to group together pictures that show the same person. However, the site doesn’t know which pictures show whom, and it doesn’t know how many different people appear in your photo collection. A sensible approach would be to extract all faces, and divide them into groups of faces that look similar. Hopefully, these correspond to the same person, and can be grouped together for you. 

Challenges in unsupervised learning 
A major challenge in unsupervised learning is evaluating whether the algorithm learned something useful. Unsupervised learning algorithms are usually applied to data that does not contain any label information, so we don’t know what the right output should be. Therefore it is very hard to say whether a model “did well”. For example, the clustering algorithm could have grouped all face pictures that are shown in profile together, and all the face pictures that are face-forward together. 

This would certainly be a possible way to divide a collection of face pictures, but not the one we were looking for. However, there is no way for us to “tell” the algorithm what we are looking for, and often the only way to evaluate the result of an unsupervised algorithm is to inspect it manually. 

As a consequence, unsupervised algorithms are used often in an exploratory setting, when a data scientist wants to understand the data better, rather than as part of a larger automatic system. Another common application for unsupervised algorithms is as a preprocessing step for supervised algorithms. Learning a new representation of the data can sometimes improve the accuracy of supervised algorithms, or can lead to reduced memory and time consumption. 

Before we start with “real” unsupervised algorithms, we will briefly discuss some simple preprocessing methods that often come in handy. Even though preprocessing and scaling are often used in tandem with supervised learning algorithms, scaling methods don’t make use of the supervised information, making them unsupervised. 

Preprocessing and Scaling 
In the last chapter we saw that some algorithms, like neural networks and SVMs, are very sensitive to the scaling of the data. Therefore a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data. A simple example is shown in Figure scaling_data. 
  1. mglearn.plots.plot_scaling()  
  2. plt.suptitle("scaling_data");  


Different kinds of preprocessing 
The first plot shows a synthetic two-class classification dataset with two features. The first feature (the x-axis value) is between 10 and 15. The second feature (the y-axis value) is between around 1 and 9. The above four plots show four different ways to transform the data that yield more standard ranges. 

The StandardScaler in scikit-learn ensures that for each feature, the mean is zero, and the variance is one, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features; The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles (Footnote: the median of a set of numbers is the number x such that half of the numbers are smaller than x and half of the numbers are larger than x. The lower quartile is the number x such that 1/4th of the numbers are smaller than x, the upper quartile is so that 1/4th of the numbers is larger than x), instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). These odd data points are also called outliers, and might often lead to trouble for other scaling techniques. 

The MinMaxScaler on the other hand shifts the data such that all features are exactly between 0 and 1. For the two-dimensional dataset this means all of the data is contained within the rectangle created by the x axis between 0 and 1 and the y axis between zero and one; Finally, the Normalizer does a very different kind of rescaling. It scales each data point such that the feature vector has a euclidean length of one. In other words, it projects a data point on the circle (or sphere in the case of higher dimensions) with a radius of 1. This means every data point is scaled by a different number (by the inverse of it’s length). This normalization is often used when only the direction (or angle) of the data matters, not the length of the feature vector. 

Applying data transformations 
After seeing what the different kind of transformations do, let’s apply them using scikit-learn. We will use the cancer dataset that we saw in chapter 2. Preprocessing methods like the scalers are usually applied before applying a supervised machine learning algorithm. As an example, say we want to apply the kernel SVM (SVC) to the cancer dataset, and use MinMaxScaler for preprocessing the data. We start by loading and splitting our dataset into a training set and a test set. We need a separate training and test set to evaluate the supervised model we will build after the preprocessing: 
- ch2_t40.py 
  1. from sklearn.ensemble import GradientBoostingClassifier  
  2. from sklearn.datasets import load_breast_cancer  
  3. from sklearn.model_selection import train_test_split  
  4. import numpy as np  
  5.   
  6. cancer = load_breast_cancer()  
  7. X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)  
  8.   
  9. print("X_train.shape=%s" % (str(X_train.shape)))  
  10. print("X_test.shape=%s" % (str(X_test.shape)))  
Execution output: 
X_train.shape=(426, 30)
X_test.shape=(143, 30)

As with the supervised models we built earlier, we first import the class implementing the preprocessing, and then instantiate it: 
>>> from ch2_t40 import *
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()

We then fit the scaler using the fit method, applied to the training data. For the MinMaxScaler, the fit method computes the minimum and maximum value of each feature on the training set. In contrast to the classifiers and regressors of chapter 2, the scaler is only provided with the data X_train when fit is called, and y_train is not used: 
>>> scaler.fit(X_train)
MinMaxScaler(copy=True, feature_range=(0, 1))

To apply the transformation that we just learned, that is, to actually scale the training data, we use the transform method of the scaler. The transform method is used whenever a model returns a new representation of the data: 
>>> np.set_printoptions(suppress=True, precision=2) # don't print using scientific notation
>>> X_train_scaled = scaler.transform(X_train)
>>> print("transformed shape: %s" % (X_train_scaled.shape,))
transformed shape: (426, 30)
>>> print("per-feature minimum before scaling:\n %s" % X_train.min(axis=0))
per-feature minimum before scaling:
[ 6.98 9.71 43.79 143.5 0.05 ... ]

>>> print("per-feature maximum before scaling:\n %s" % X_train.max(axis=0))
per-feature maximum before scaling:
[ 28.11 39.28 188.5 2501. 0.16 ...]

>>> print("per-feature minimum after scaling:\n %s" % X_train_scaled.min(axis=0))
per-feature minimum after scaling:
[ 0. 0. 0. 0. 0. ...]

>>> print("per-feature maximum after scaling:\n %s" % X_train_scaled.max(axis=0))
per-feature maximum after scaling:
[ 1. 1. 1. 1. 1. ...]

The transformed data has the same shape as the original data - the features are simply shifted and scaled. You can see that all of the feature are now between zero and one, as desired. To apply the SVM to the scaled data, we also need to transform the test set. This is done by again calling the transform method, this time on X_test
>>> X_test_scaled = scaler.transform(X_test) # transform test data
>>> print("per-feature minimum after scaling: %s" % X_test_scaled.min(axis=0))
per-feature minimum after scaling: [ 0.03 0.02 0.03 0.01 0.14 ...]
>>> print("per-feature maximum after scaling: %s" % X_test_scaled.max(axis=0))
per-feature maximum after scaling: [ 0.96 0.82 0.96 0.89 0.81... 0.92 1.21 1.63]

Maybe somewhat surprisingly, you can see that for the test set, after scaling, the minimum and maximum are not zero and one. Some of the features are even outside the 0-1 range! The explanation is that the MinMaxScaler (and all the other scalers) always applies exactly the same transformation to the training and the test set. So the transform method always subtracts the training set minimum, and divides by the training set range, which might be different than the minimum and range for the test set. 

Scaling training and test data the same way 
It is important that exactly the same transformation is applied to the training set and the test set for the supervised model to make sense on the test set. The following figure illustrates what would happen if we would use the minimum and range of the test set instead: 
- ch2_t41.py 
  1. from sklearn.ensemble import GradientBoostingClassifier  
  2. from sklearn.datasets import load_breast_cancer  
  3. from sklearn.datasets import make_blobs  
  4. from sklearn.model_selection import train_test_split  
  5. import numpy as np  
  6. import matplotlib.pyplot as plt  
  7. import mglearn  
  8.   
  9. # make synthetic data  
  10. X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)  
  11. # split it into training and test set  
  12. X_train, X_test = train_test_split(X, random_state=5, test_size=.1)  
  13.   
  14.   
  15.   
  16. from sklearn.preprocessing import MinMaxScaler  
  17. scaler = MinMaxScaler()  
  18.   
  19. # plot the training and test set  
  20. fig, axes = plt.subplots(13, figsize=(134))  
  21. axes[0].scatter(X_train[:, 0], X_train[:, 1],  
  22. c='b', label="training set", s=60)  
  23. axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^',  
  24. c='r', label="test set", s=60)  
  25. axes[0].legend(loc='upper left')  
  26. axes[0].set_title("original data")  
  27. # scale the data using MinMaxScaler  
  28. scaler = MinMaxScaler()  
  29. scaler.fit(X_train)  
  30. X_train_scaled = scaler.transform(X_train)  
  31. X_test_scaled = scaler.transform(X_test)  
  32. # visualize the properly scaled data  
  33. axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],  
  34. c='b', label="training set", s=60)  
  35. axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',  
  36. c='r', label="test set", s=60)  
  37. axes[1].set_title("scaled data")  
  38. # rescale the test set separately, so that test set min is 0 and test set max is 1  
  39. # DO NOT DO THIS! For illustration purposes only  
  40. test_scaler = MinMaxScaler()  
  41. test_scaler.fit(X_test)  
  42. X_test_scaled_badly = test_scaler.transform(X_test)  
  43. # visualize wrongly scaled data  
  44. axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],  
  45. c='b', label="training set", s=60)  
  46. axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^',  
  47. c='r', label="test set", s=60)  
  48. axes[2].set_title("improperly scaled data")  
  49. plt.show()  

The first panel is an unscaled two-dimensional dataset, with the training set shown in blue and the test set shown in red. The second figure is the same data, but scaled using the MinMaxScaler. Here, we called fit on the training set, and then transform on the training and the test set. You can see that the dataset in the second panel looks identical to the first, only the ticks on the axes changed. Now all the features are between 0 and 1. You can also see that the minimum and maximum feature values for the test data (the red points) are not 0 and 1. 

The third panel shows what would happen if we scaled training and test set separately. In this case, the minimum and maximum feature values for both the training and the test set are 0 and 1. But now the dataset looks different. The test points moved incongruously to the training set, as they were scaled differently. We changed the arrangement of the data in an arbitrary way. Clearly this is not what we want to do

Another way to reason about this is the following: Imagine your test set was a single point. There is no way to scale a single point correctly, to fulfill the minimum and maximum requirements of the MinMaxScaler. But the size of your test set should not change your processing

The effect of preprocessing on supervised learning 
Now let’s go back to the cancer dataset and see what the effect of using the MinMaxScaler is on learning the SVC (this is a different way of doing the same scaling we did in chapter 2). 

- ch2_t42.py 
  1. from sklearn.ensemble import GradientBoostingClassifier  
  2. from sklearn.datasets import load_breast_cancer  
  3. from sklearn.datasets import make_blobs  
  4. from sklearn.model_selection import train_test_split  
  5. import numpy as np  
  6. import matplotlib.pyplot as plt  
  7. import mglearn  
  8. from sklearn.svm import SVC  
  9.   
  10. # make synthetic data  
  11. cancer = load_breast_cancer()  
  12. X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)  
  13.   
  14. # First, let’s fit the SVC on the original data again for comparison:  
  15. svm = SVC(C=100)  
  16. svm.fit(X_train, y_train)  
  17. print "SVM score=%.02f" % svm.score(X_test, y_test)  
  18.   
  19. # Now, let’s scale the data using MinMaxScaler before fitting the SVC:  
  20. from sklearn.preprocessing import MinMaxScaler  
  21. scaler = MinMaxScaler()  
  22. # preprocessing using 0-1 scaling  
  23. scaler = MinMaxScaler()  
  24. scaler.fit(X_train)  
  25. X_train_scaled = scaler.transform(X_train)  
  26. X_test_scaled = scaler.transform(X_test)  
  27.   
  28. # learning an SVM on the scaled training data  
  29. svm.fit(X_train_scaled, y_train)  
  30. # scoring on the scaled test set  
  31. print "SVM score=%.02f (With data preprocessing)" % svm.score(X_test_scaled, y_test)  
Execution output: 
SVM score=0.62
SVM score=0.97 (With data preprocessivng)

As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t involve any complicated math, it is good practice to use the scaling mechanisms provided by scikit-learn, instead of reimplementing them yourself, as making mistakes even in these simple computations is easy. You can also easily replace one preprocessing algorithm by another by changing the class you use, as all of the preprocessing classes have the same interface, consisting of the fit and transform methods: 
  1. # preprocessing using zero mean and unit variance scaling  
  2. from sklearn.preprocessing import StandardScaler  
  3. scaler = StandardScaler()  
  4. scaler.fit(X_train)  
  5. X_train_scaled = scaler.transform(X_train)  
  6. X_test_scaled = scaler.transform(X_test)  
  7. # learning an SVM on the scaled training data  
  8. svm.fit(X_train_scaled, y_train)  
  9. # scoring on the scaled test set  
  10. svm.score(X_test_scaled, y_test)  
Now that we’ve seen how simple data transformations for preprocessing work, let’s move on to more interesting transformations using unsupervised learning.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...