2016年12月22日 星期四

[ Scikit- learn ] Data science in Python: pandas, seaborn, scikit-learn

Source From Here 
Preface 

Agenda 
* How do I use the pandas library to read data into Python 
* How do I use the seaborn library to visualize data? 
* What is linear regression, and how does it work? 
* What are some evaluation metrics for regression problems? 
* How do I choose which features to include in my model? 

Types of supervised learning 
* Classification 
* Regression 

Reading data using pandas 
Pandas: Popular Python library for data exploration, manipulation, and analysis. (Installation guide
>>> import pandas as pd
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
>>> data.head() // Show the top 5 rows of data

Primary object types: 
* DataFrame: rows and columns (like a spreadsheet)
* Series: A single column

// Reload data with row index start with 1
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=1)
>>> data.tail() // Show the final 5 rows
...
>>> data.shape // Check the shape of the DataFrame (rows, columns)
(200, 4)

What are the features? 
* TV: Advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
* Radio: Advertising dollars spent on Radio
* Newspaper: Advertising dollars spent on Newspaper

What is response? 
* Sales: sales of a single product in a given market (in thousands of items)

What else do we know? 
* Because the response variable is continues, this is a regression problem
There are 200 observations (represented by the rows). and each observation is a single market

Visualizing data using seaborn 
Seaborn: Python library for statistic data visualization built on top of matplotlib 
* Anaconda users: run conda install seaborn from the command line


>>> import seaborn as sns
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
>>> sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales').fig.show()


// size : (scalar, optional) Height (in inches) of each facet.
// aspect : (scalar, optional) Aspect * size gives the width (in inches) of each facet.

>>> sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7).fig.show()
>>> sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg').fig.show()

Linear Regression 
Pros: fast, no turning required, highly interpretable, well-understood
Cons: Unlikely to produce the best predictive accuracy (presumes a linear relationship between features and response)



Preparing X and y using pandas 
* scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
* However, pandas is built on top of NumPy.
* Thus, X can be a pandas DataFrame and y can be a pandas Series!

>>> import pandas as pd
>>> data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
>>> feature_cols = ['TV', 'Radio', 'Newspaper']
>>> X = data[feature_cols]
>>> X.head()
TV Radio Newspaper
0 230.1 37.8 69.2
1 44.5 39.3 45.1
2 17.2 45.9 69.3
3 151.5 41.3 58.5
4 180.8 10.8 58.4

>>> type(X)

>>> X.shape
(200, 3)

>>> y = data['Sales'] // equivalent to: y = data.Sales
>>> y.head()
0 22.1
1 10.4
2 9.3
3 18.5
4 12.9
Name: Sales, dtype: float64 

>>> type(y)

>>> y.shape
(200,)

Splitting X and y into training and testing sets 
>>> from sklearn.cross_validation import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
>>> X_train.shape
(150, 3)
>>> X_test.shape
(50, 3) // Default will use 20% of data for testing

Linear regression in scikit-learn 
>>> from sklearn.linear_model import LinearRegression
>>> linreg = LinearRegression()
>>> linreg.fit(X_train, y_train) // Start training
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Interpreting model coefficients 
>>> print linreg.intercept_
2.87696662232
>>> print linreg.coef_
[ 0.04656457 0.17915812 0.00345046]
>>> zip(feature_cols, linreg.coef_)
[('TV', 0.046564567874150295), ('Radio', 0.17915812245088839), ('Newspaper', 0.0034504647111804343)]


How do we interpret the TV coefficient (0.0466)? 
* For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV adds is associated with an increase in sales of 46.6 items.

Important notes: 
* This is a statement of association, not causation.
* If an increase in TV ad spending was associated with a decrease in sales, negative coefficient will be seen.

Making predictions 
>>> y_pred = linreg.predict(X_test)

Model evaluation metrics for regression 
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values. Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems: 
>>> true = [100, 50, 30, 20]
>>> pred = [90, 50, 50, 30]


>>> from sklearn import metrics
>>> print metrics.mean_absolute_error(true, pred) // (10+0+20+10) / 4
10.0


>>> metrics.mean_squared_error(true, pred)
150.0


>>> np.sqrt(metrics.mean_squared_error(true, pred))
12.24744871391589

Comparing these metrics: 
* MAE is the easiest to understand, because it's the average error.
* MSE is more popular than MAE, because MSE "punishes" larger errors.
* RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

Computing the RMSE for our Sales predictions 
>>> np.sqrt(metrics.mean_squared_error(y_test, y_pred))
1.404651423032895

Feature Selection 
Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions? Let's remove it from the model and check the RMSE: 
- test.py 
  1. #!/usr/bin/env python  
  2. import pandas as pd  
  3.   
  4. data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')  
  5. features_cols = ['TV''Radio']  
  6.   
  7. # Use the list to select a subset of the original DataFrame  
  8. X = data[features_cols]  
  9.   
  10. # Select a series from the DataFrame  
  11. y = data.Sales  
  12.   
  13. # Split into training and testing sets  
  14. from sklearn.cross_validation import train_test_split  
  15. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)  
  16.   
  17. # Fit the model to the training data  
  18. from sklearn.linear_model import LinearRegression  
  19. linreg = LinearRegression()  
  20. linreg.fit(X_train, y_train)  
  21.   
  22. # Make predictions on the testing set  
  23. y_pred = linreg.predict(X_test)  
  24.   
  25. # Compute the RMSE of our predictions  
  26. from sklearn import metrics  
  27. import numpy as np  
  28. print "RMSE of prediction is %.02f" % (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  
Let's execute it: 
# ./test.py
RMSE of prediction is 1.39

The RMSE decreased when we remove Newspaper from the model. (Error is something we want minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model. 

Supplement 
* Prev - Comparing machine learning models in scikit-learn 
* Next - Selecting the best model in scikit-learn using cross-validation

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...