Preface
Agenda
* How do I use the pandas library to read data into Python
* How do I use the seaborn library to visualize data?
* What is linear regression, and how does it work?
* What are some evaluation metrics for regression problems?
* How do I choose which features to include in my model?
Types of supervised learning
* Classification
* Regression
Reading data using pandas
Pandas: Popular Python library for data exploration, manipulation, and analysis. (Installation guide)
Primary object types:
What are the features?
What is response?
What else do we know?
Visualizing data using seaborn
Seaborn: Python library for statistic data visualization built on top of matplotlib
Linear Regression
Preparing X and y using pandas
Splitting X and y into training and testing sets
Linear regression in scikit-learn
Interpreting model coefficients
How do we interpret the TV coefficient (0.0466)?
Important notes:
Making predictions
Model evaluation metrics for regression
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values. Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:
Comparing these metrics:
Computing the RMSE for our Sales predictions
Feature Selection
Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions? Let's remove it from the model and check the RMSE:
- test.py
- #!/usr/bin/env python
- import pandas as pd
- data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
- features_cols = ['TV', 'Radio']
- # Use the list to select a subset of the original DataFrame
- X = data[features_cols]
- # Select a series from the DataFrame
- y = data.Sales
- # Split into training and testing sets
- from sklearn.cross_validation import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
- # Fit the model to the training data
- from sklearn.linear_model import LinearRegression
- linreg = LinearRegression()
- linreg.fit(X_train, y_train)
- # Make predictions on the testing set
- y_pred = linreg.predict(X_test)
- # Compute the RMSE of our predictions
- from sklearn import metrics
- import numpy as np
- print "RMSE of prediction is %.02f" % (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
The RMSE decreased when we remove Newspaper from the model. (Error is something we want minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.
Supplement
* Prev - Comparing machine learning models in scikit-learn
* Next - Selecting the best model in scikit-learn using cross-validation
沒有留言:
張貼留言