程式扎記

Source From Here

What Is Machine Learning?
Before we take a look at the details of various machine learning methods, let’s start by looking at what machine learning is, and what it isn’t. Machine learning is often categorized as a sub-field of artificial intelligence, but I find that categorization can often be misleading at first brush. The study of machine learning certainly arose from research in this context, but in the data science application of machine learning methods, it’s more helpful to think of machine learning as a means of building models of data.

Fundamentally, machine learning involves building mathematical models to help understand data. “Learning” enters the fray when we give these models tunable parameters that can be adapted to observed data; in this way the program can be considered to be “learning” from the data. Once these models have been fit to previously seen data, they can be used to predict and understand aspects of newly observed data. I’ll leave to the reader the more philosophical digression regarding the extent to which this type of mathematical, model-based “learning” is similar to the “learning” exhibited by the human brain.

Understanding the problem setting in machine learning is essential to using these tools effectively, and so we will start with some broad categorizations of the types of approaches we’ll discuss here.

Categories of Machine Learning
At the most fundamental level, machine learning can be categorized into two main types: supervised learning and unsupervised learning.

supervised learning involves somehow modeling the relationship between measured features of data and some label associated with the data; once this model is determined, it can be used to apply labels to new, unknown data. This is further subdivided into classification tasks and regression tasks: in classification, the labels are discrete categories, while in regression, the labels are continuous quantities. We will see examples of both types of supervised learning in the following section.

unsupervised learning involves modeling the features of a dataset without reference to any label, and is often described as “letting the dataset speak for itself.” These models include tasks such as Clustering and Dimensionality Reduction. Clustering algorithms identify distinct groups of data, while dimensionality reduction algorithms search for more succinct representations of the data. We will see examples of both types of unsupervised learning in the following section.

In addition, there are so-called Semi-supervised learning methods, which fall somewhere between supervised learning and unsupervised learning. Semi-supervised learning methods are often useful when only incomplete labels are available.

Qualitative Examples of Machine Learning Applications
To make these ideas more concrete, let’s take a look at a few very simple examples of a machine learning task. These examples are meant to give an intuitive, nonquantitative overview of the types of machine learning tasks we will be looking at in this chapter. In later sections, we will go into more depth regarding the particular models and how they are used. For a preview of these more technical aspects, you can find the Python source that generates the figures in the online appendix.

Classification: Predicting discrete labels
We will first take a look at a simple classification task, in which you are given a set of labeled points and want to use these to classify some unlabeled points. Imagine that we have the data shown in Figure 5-1 (the code used to generate this figure, and all figures in this section, is available in the online appendix).

Here we have two-dimensional data; that is, we have two features for each point, represented by the (x,y) positions of the points on the plane. In addition, we have one of two class labels for each point, here represented by the colors of the points. From these features and labels, we would like to create a model that will let us decide whether a new point should be labeled “blue” or “red.”:

view plaincopy to clipboardprint?
from sklearn.datasets.samples_generator import make_blobs  
from sklearn.svm import SVC  
  
  
# common plot formatting for below  
def format_plot(ax, title):  
    ax.xaxis.set_major_formatter(plt.NullFormatter())  
    ax.yaxis.set_major_formatter(plt.NullFormatter())  
    ax.set_xlabel('feature 1', color='gray')  
    ax.set_ylabel('feature 2', color='gray')  
    ax.set_title(title, color='gray')  
  
# create 50 separable points  
X, y = make_blobs(n_samples=50, centers=2,  
                  random_state=0, cluster_std=0.60)  
  
# fit the support vector classifier model  
clf = SVC(kernel='linear')  
clf.fit(X, y)  
  
# create some new points to predict  
X2, _ = make_blobs(n_samples=80, centers=2,  
                   random_state=0, cluster_std=0.80)  
X2 = X2[50:]  
  
# predict the labels  
y2 = clf.predict(X2)  
  
from matplotlib import pyplot as plt  
# plot the data  
fig, ax = plt.subplots(figsize=(8, 6))  
point_style = dict(cmap='Paired', s=50)  
ax.scatter(X[:, 0], X[:, 1], c=y, **point_style)  
  
# format plot  
format_plot(ax, 'Input Data')  
ax.axis([-1, 4, -2, 7])  

Figure 5-1. A simple data set for classification

There are a number of possible models for such a classification task, but here we will use an extremely simple one. We will make the assumption that the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group. Here the model is a quantitative version of the statement “a straight line separates the classes,” while the model parameters are the particular numbers describing the location and orientation of that line for our data. The optimal values for these model parameters are learned from the data (this is the “learning” in machine learning), which is often called training the model.

view plaincopy to clipboardprint?
import numpy as np  
  
# Get contours describing the model  
xx = np.linspace(-1, 4, 10)  
yy = np.linspace(-2, 7, 10)  
xy1, xy2 = np.meshgrid(xx, yy)  
Z = np.array([clf.decision_function([t])  
              for t in zip(xy1.flat, xy2.flat)]).reshape(xy1.shape)  
  
# plot points and model  
fig, ax = plt.subplots(figsize=(8, 6))  
line_style = dict(levels = [-1.0, 0.0, 1.0],  
                  linestyles = ['dashed', 'solid', 'dashed'],  
                  colors = 'gray', linewidths=1)  
ax.scatter(X[:, 0], X[:, 1], c=y, **point_style)  
ax.contour(xy1, xy2, Z, **line_style)  
  
# format plot  
format_plot(ax, 'Model Learned from Input Data')  

Figure 5-2. A simple classification model

Now that this model has been trained, it can be generalized to new, unlabeled data. In other words, we can take a new set of data, draw this model line through it, and assign labels to the new points based on this model. This stage is usually called prediction. See Figure 5-3:

view plaincopy to clipboardprint?
# plot the results  
fig, ax = plt.subplots(1, 2, figsize=(16, 6))  
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)  
  
ax[0].scatter(X2[:, 0], X2[:, 1], c='gray', **point_style)  
ax[0].axis([-1, 4, -2, 7])  
  
ax[1].scatter(X2[:, 0], X2[:, 1], c=y2, **point_style)  
ax[1].contour(xy1, xy2, Z, **line_style)  
ax[1].axis([-1, 4, -2, 7])  
  
format_plot(ax[0], 'Unknown Data')  
format_plot(ax[1], 'Predicted Labels')  

Figure 5-3. Applying a classification model to new data

This is the basic idea of a classification task in machine learning, where “classification” indicates that the data has discrete class labels. At first glance this may look fairly trivial: it would be relatively easy to simply look at this data and draw such a discriminatory line to accomplish this classification. A benefit of the machine learning approach, however, is that it can generalize to much larger datasets in many more dimensions.

For example, this is similar to the task of automated spam detection for email; in this case, we might use the following features and labels:

* feature 1, feature 2, etc. normalized counts of important words or phrases (“Viagra,” “Nigerian prince,” etc.)
* label “spam” or “not spam”

For the training set, these labels might be determined by individual inspection of a small representative sample of emails; for the remaining emails, the label would be determined using the model. For a suitably trained classification algorithm with enough well-constructed features (typically thousands or millions of words or phrases), this type of approach can be very effective. We will see an example of such text-based classification in “In Depth: Naive Bayes Classification” later.

Some important classification algorithms that we will discuss in more detail are Gaussian naive Bayes (see “In Depth: Naive Bayes Classification”), support vector machines (see “In-Depth: Support Vector Machines” on page 405), and random forest classification (see “In-Depth: Decision Trees and Random Forests” on page 421).

Regression: Predicting continuous labels
In contrast with the discrete labels of a classification algorithm, we will next look at a simple regression task in which the labels are continuous quantities. Consider the data shown in Figure 5-4, which consists of a set of points, each with a continuous label:

view plaincopy to clipboardprint?
from sklearn.linear_model import LinearRegression  
  
# Create some data for the regression  
rng = np.random.RandomState(1)  
  
X = rng.randn(200, 2)  
y = np.dot(X, [-2, 1]) + 0.1 * rng.randn(X.shape[0])  
  
# fit the regression model  
model = LinearRegression()  
model.fit(X, y)  
  
# create some new points to predict  
X2 = rng.randn(100, 2)  
  
# predict the labels  
y2 = model.predict(X2)  
  
# plot data points  
fig, ax = plt.subplots()  
points = ax.scatter(X[:, 0], X[:, 1], c=y, s=50,  
                    cmap='viridis')  
  
# format plot  
format_plot(ax, 'Input Data')  
ax.axis([-4, 4, -3, 3])  

Figure 5-4. A simple dataset for regression

As with the classification example, we have two-dimensional data; that is, there are two features describing each data point. The color of each point represents the continuous label for that point.

There are a number of possible regression models we might use for this type of data, but here we will use a simple linear regression to predict the points. This simple linear regression model assumes that if we treat the label as a third spatial dimension, we can fit a plane to the data. This is a higher-level generalization of the well-known problem of fitting a line to data with two coordinates.

We can visualize this setup as shown in Figure 5-5:

view plaincopy to clipboardprint?
from mpl_toolkits.mplot3d.art3d import Line3DCollection  
  
points = np.hstack([X, y[:, None]]).reshape(-1, 1, 3)  
segments = np.hstack([points, points])  
segments[:, 0, 2] = -8  
  
# plot points in 3D  
fig = plt.figure()  
ax = fig.add_subplot(111, projection='3d')  
ax.scatter(X[:, 0], X[:, 1], y, c=y, s=35,  
           cmap='viridis')  
ax.add_collection3d(Line3DCollection(segments, colors='gray', alpha=0.2))  
ax.scatter(X[:, 0], X[:, 1], -8 + np.zeros(X.shape[0]), c=y, s=10,  
           cmap='viridis')  
  
# format plot  
ax.patch.set_facecolor('white')  
ax.view_init(elev=20, azim=-70)  
ax.set_zlim3d(-8, 8)  
ax.xaxis.set_major_formatter(plt.NullFormatter())  
ax.yaxis.set_major_formatter(plt.NullFormatter())  
ax.zaxis.set_major_formatter(plt.NullFormatter())  
ax.set(xlabel='feature 1', ylabel='feature 2', zlabel='label')  
  
# Hide axes (is there a better way?)  
ax.w_xaxis.line.set_visible(False)  
ax.w_yaxis.line.set_visible(False)  
ax.w_zaxis.line.set_visible(False)  
for tick in ax.w_xaxis.get_ticklines():  
    tick.set_visible(False)  
for tick in ax.w_yaxis.get_ticklines():  
    tick.set_visible(False)  
for tick in ax.w_zaxis.get_ticklines():  
    tick.set_visible(False)  

Figure 5-5. A three-dimensional view of the regression data

Notice that the feature 1–feature 2 plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position. From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters. Returning to the two-dimensional projection, when we fit such a plane we get the result shown in Figure 5-6:

view plaincopy to clipboardprint?
from matplotlib.collections import LineCollection  
  
# plot data points  
fig, ax = plt.subplots()  
pts = ax.scatter(X[:, 0], X[:, 1], c=y, s=50,  
                 cmap='viridis', zorder=2)  
  
# compute and plot model color mesh  
xx, yy = np.meshgrid(np.linspace(-4, 4),  
                     np.linspace(-3, 3))  
Xfit = np.vstack([xx.ravel(), yy.ravel()]).T  
yfit = model.predict(Xfit)  
zz = yfit.reshape(xx.shape)  
ax.pcolorfast([-4, 4], [-3, 3], zz, alpha=0.5,  
              cmap='viridis', norm=pts.norm, zorder=1)  
  
# format plot  
format_plot(ax, 'Input Data with Linear Fit')  
ax.axis([-4, 4, -3, 3])  

Figure 5-6. A representation of the regression model

This plane of fit gives us what we need to predict labels for new points. Visually, we find the results shown in Figure 5-7:

view plaincopy to clipboardprint?
# plot the model fit  
fig, ax = plt.subplots(1, 2, figsize=(16, 6))  
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)  
  
ax[0].scatter(X2[:, 0], X2[:, 1], c='gray', s=50)  
ax[0].axis([-4, 4, -3, 3])  
  
ax[1].scatter(X2[:, 0], X2[:, 1], c=y2, s=50,  
              cmap='viridis', norm=pts.norm)  
ax[1].axis([-4, 4, -3, 3])  
  
# format plots  
format_plot(ax[0], 'Unknown Data')  
format_plot(ax[1], 'Predicted Labels')  

Figure 5-7. Applying the regression model to new data

As with the classification example, this may seem rather trivial in a low number of dimensions. But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features. For example, this is similar to the task of computing the distance to galaxies observed through a telescope—in this case, we might use the following features and labels:

• feature 1, feature 2, etc. brightness of each galaxy at one of several wavelengths or colors
• label distance or redshift of the galaxy

The distances for a small number of these galaxies might be determined through an independent set of (typically more expensive) observations. We could then estimate distances to remaining galaxies using a suitable regression model, without the need to employ the more expensive observation across the entire set. In astronomy circles, this is known as the “photometric redshift” problem.

Clustering: Inferring labels on unlabeled data
The classification and regression illustrations we just looked at are examples of supervised learning algorithms, in which we are trying to build a model that will predict labels for new data. Unsupervised learning involves models that describe data without reference to any known labels.

One common case of unsupervised learning is “clustering,” in which data is automatically assigned to some number of discrete groups. For example, we might have some two-dimensional data like that shown in Figure 5-8:

view plaincopy to clipboardprint?
from sklearn.datasets.samples_generator import make_blobs  
from sklearn.cluster import KMeans  
  
# create 50 separable points  
X, y = make_blobs(n_samples=100, centers=4,  
                  random_state=42, cluster_std=1.5)  
  
# Fit the K Means model  
model = KMeans(4, random_state=0)  
y = model.fit_predict(X)  
  
# plot the input data  
fig, ax = plt.subplots(figsize=(8, 6))  
ax.scatter(X[:, 0], X[:, 1], s=50, color='gray')  
  
# format the plot  
format_plot(ax, 'Input Data')  

Figure 5-8. Example data for clustering

By eye, it is clear that each of these points is part of a distinct group. Given this input, a clustering model will use the intrinsic structure of the data to determine which points are related. Using the very fast and intuitive k-means algorithm (see “In Depth: k-Means Clustering” on page 462), we find the clusters shown in Figure 5-9:

view plaincopy to clipboardprint?
# plot the data with cluster labels  
fig, ax = plt.subplots(figsize=(8, 6))  
ax.scatter(X[:, 0], X[:, 1], s=50, c=y, cmap='viridis')  
  
# format the plot  
format_plot(ax, 'Learned Cluster Labels')  

Figure 5-9. Data labeled with a k-means clustering model

Dimensionality reduction: Inferring structure of unlabeled data
Dimensionality reduction is another example of an unsupervised algorithm, in which labels or other information are inferred from the structure of the dataset itself. Dimensionality reduction is a bit more abstract than the examples we looked at before, but generally it seeks to pull out some low-dimensional representation of data that in some way preserves relevant qualities of the full dataset. Different dimensionality reduction routines measure these relevant qualities in different ways, as we will see in “In-Depth: Manifold Learning” on page 445.

As an example of this, consider the data shown in Figure 5-10.

view plaincopy to clipboardprint?
from sklearn.datasets import make_swiss_roll  
  
# make data  
X, y = make_swiss_roll(200, noise=0.5, random_state=42)  
X = X[:, [0, 2]]  
  
# visualize data  
fig, ax = plt.subplots()  
ax.scatter(X[:, 0], X[:, 1], color='gray', s=30)  
  
# format the plot  
format_plot(ax, 'Input Data')  

Figure 5-10. Example data for dimensionality reduction

Visually, it is clear that there is some structure in this data: it is drawn from a onedimensional line that is arranged in a spiral within this two-dimensional space. In a sense, you could say that this data is “intrinsically” only one dimensional, though this one-dimensional data is embedded in higher-dimensional space. A suitable dimensionality reduction model in this case would be sensitive to this nonlinear embedded structure, and be able to pull out this lower-dimensionality representation.

Figure 5-11 presents a visualization of the results of the Isomap algorithm, a manifold learning algorithm that does exactly this.

view plaincopy to clipboardprint?
from sklearn.manifold import Isomap  
  
model = Isomap(n_neighbors=8, n_components=1)  
y_fit = model.fit_transform(X).ravel()  
  
# visualize data  
fig, ax = plt.subplots()  
pts = ax.scatter(X[:, 0], X[:, 1], c=y_fit, cmap='viridis', s=30)  
cb = fig.colorbar(pts, ax=ax)  
  
# format the plot  
format_plot(ax, 'Learned Latent Parameter')  
cb.set_ticks([])  
cb.set_label('Latent Variable', color='gray')  

Figure 5-11. Data with a label learned via dimensionality reduction

Notice that the colors (which represent the extracted one-dimensional latent variable) change uniformly along the spiral, which indicates that the algorithm did in fact detect the structure we saw by eye. As with the previous examples, the power of dimensionality reduction algorithms becomes clearer in higher-dimensional cases. For example, we might wish to visualize important relationships within a dataset that has 100 or 1,000 features.Visualizing 1,000-dimensional data is a challenge, and one way we can make this more manageable is to use a dimensionality reduction technique to reduce the data to two or three dimensions.

Some important dimensionality reduction algorithms that we will discuss are principal component analysis (see “In Depth: Principal Component Analysis” on page 433) and various manifold learning algorithms, including Isomap and locally linear embedding (see “In-Depth: Manifold Learning” on page 445).

Supplement
* FAQ - What is the purpose of meshgrid in Python / NumPy?

The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of y values.

view plain copy to clipboard print ?

x = [1,2,3]

y = [1,2,3,4]

xx, yy = np.meshgrid(x, y)

print(xx)

print(yy)

Will output:

view plain copy to clipboard print ?

[[1 2 3]

[1 2 3]

[1 2 3]

[1 2 3]]

[[1 1 1]

[2 2 2]

[3 3 3]

[4 4 4]]

程式扎記

標籤

2019年8月15日星期四

[ Py DS ] Ch5 - Machine Learning (Part1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2019年8月15日 星期四