程式扎記: [ Scikit- learn ] Data science in Python: pandas, seaborn, scikit-learn

Source From Here
Preface

Agenda
* How do I use the pandas library to read data into Python
* How do I use the seaborn library to visualize data?
* What is linear regression, and how does it work?
* What are some evaluation metrics for regression problems?
* How do I choose which features to include in my model?

Types of supervised learning
* Classification
* Regression

Reading data using pandas
Pandas: Popular Python library for data exploration, manipulation, and analysis. (Installation guide)

>>> import pandas as pd
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
>>> data.head() // Show the top 5 rows of data

Primary object types:

* DataFrame: rows and columns (like a spreadsheet)
* Series: A single column

// Reload data with row index start with 1
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=1)
>>> data.tail() // Show the final 5 rows
...
>>> data.shape // Check the shape of the DataFrame (rows, columns)
(200, 4)

What are the features?

* TV: Advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
* Radio: Advertising dollars spent on Radio
* Newspaper: Advertising dollars spent on Newspaper

What is response?

* Sales: sales of a single product in a given market (in thousands of items)

What else do we know?

* Because the response variable is continues, this is a regression problem
* There are 200 observations (represented by the rows). and each observation is a single market

Visualizing data using seaborn
Seaborn: Python library for statistic data visualization built on top of matplotlib

* Anaconda users: run conda install seaborn from the command line

>>> import seaborn as sns
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
>>> sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales').fig.show()

// size : (scalar, optional) Height (in inches) of each facet.
// aspect : (scalar, optional) Aspect * size gives the width (in inches) of each facet.
>>> sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7).fig.show()
>>> sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg').fig.show()

Linear Regression

Pros: fast, no turning required, highly interpretable, well-understood
Cons: Unlikely to produce the best predictive accuracy (presumes a linear relationship between features and response)

Preparing X and y using pandas

* scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
* However, pandas is built on top of NumPy.
* Thus, X can be a pandas DataFrame and y can be a pandas Series!

>>> import pandas as pd
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
>>> feature_cols = ['TV', 'Radio', 'Newspaper']
>>> X = data[feature_cols]
>>> X.head()
TV Radio Newspaper
0 230.1 37.8 69.2
1 44.5 39.3 45.1
2 17.2 45.9 69.3
3 151.5 41.3 58.5
4 180.8 10.8 58.4
>>> type(X)

>>> X.shape
(200, 3)

>>> y = data['Sales'] // equivalent to: y = data.Sales
>>> y.head()
0 22.1
1 10.4
2 9.3
3 18.5
4 12.9
Name: Sales, dtype: float64
>>> type(y)

>>> y.shape
(200,)

Splitting X and y into training and testing sets

>>> from sklearn.cross_validation import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
>>> X_train.shape
(150, 3)
>>> X_test.shape
(50, 3) // Default will use 20% of data for testing

Linear regression in scikit-learn

>>> from sklearn.linear_model import LinearRegression
>>> linreg = LinearRegression()
>>> linreg.fit(X_train, y_train) // Start training
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Interpreting model coefficients

>>> print linreg.intercept_
2.87696662232
>>> print linreg.coef_
[ 0.04656457 0.17915812 0.00345046]
>>> zip(feature_cols, linreg.coef_)
[('TV', 0.046564567874150295), ('Radio', 0.17915812245088839), ('Newspaper', 0.0034504647111804343)]

How do we interpret the TV coefficient (0.0466)?

* For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
* Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV adds is associated with an increase in sales of 46.6 items.

Important notes:

* This is a statement of association, not causation.
* If an increase in TV ad spending was associated with a decrease in sales, negative coefficient will be seen.

Making predictions

>>> y_pred = linreg.predict(X_test)

Model evaluation metrics for regression
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values. Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

>>> true = [100, 50, 30, 20]
>>> pred = [90, 50, 50, 30]

>>> from sklearn import metrics
>>> print metrics.mean_absolute_error(true, pred) // (10+0+20+10) / 4
10.0

>>> metrics.mean_squared_error(true, pred)
150.0

>>> np.sqrt(metrics.mean_squared_error(true, pred))
12.24744871391589

Comparing these metrics:

* MAE is the easiest to understand, because it's the average error.
* MSE is more popular than MAE, because MSE "punishes" larger errors.
* RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

Computing the RMSE for our Sales predictions

>>> np.sqrt(metrics.mean_squared_error(y_test, y_pred))
1.404651423032895

Feature Selection
Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions? Let's remove it from the model and check the RMSE:
- test.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import pandas as pd  
  
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')  
features_cols = ['TV', 'Radio']  
  
# Use the list to select a subset of the original DataFrame  
X = data[features_cols]  
  
# Select a series from the DataFrame  
y = data.Sales  
  
# Split into training and testing sets  
from sklearn.cross_validation import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)  
  
# Fit the model to the training data  
from sklearn.linear_model import LinearRegression  
linreg = LinearRegression()  
linreg.fit(X_train, y_train)  
  
# Make predictions on the testing set  
y_pred = linreg.predict(X_test)  
  
# Compute the RMSE of our predictions  
from sklearn import metrics  
import numpy as np  
print "RMSE of prediction is %.02f" % (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

Let's execute it:

# ./test.py
RMSE of prediction is 1.39

The RMSE decreased when we remove Newspaper from the model. (Error is something we want minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.

Supplement
* Prev - Comparing machine learning models in scikit-learn
* Next - Selecting the best model in scikit-learn using cross-validation

程式扎記

標籤

2016年12月22日星期四

[ Scikit- learn ] Data science in Python: pandas, seaborn, scikit-learn

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2016年12月22日 星期四