程式扎記: [ ML 文章收集 ] Better Heatmaps and Correlation Matrix Plots in Python

2021年3月2日星期二

[ ML 文章收集 ] Better Heatmaps and Correlation Matrix Plots in Python

Source From Here

Preface
The code described below is now available as a pip package — https://pypi.org/project/heatmapz/. There’s also a Google Colab notebook here, you can see a few examples in there and play around with the lib

Correlation Matrix plots
You already know that if you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

But is a simple heatmap the best way to do it?

For illustration, I’ll use the Automobile Data Set, containing various characteristics of a number of cars. You can also find a clean version of the data with header columns here.

Let’s start by making a correlation matrix heatmap for the data set:

view plaincopy to clipboardprint?
import seaborn as sns  
import matplotlib.pyplot as plt  
import pandas as pd  
import numpy as np  
  
# Step 0 - Read the dataset, calculate column correlations and make a seaborn heatmap  
plt.rcParams['figure.figsize'] = [10, 6]  
data = pd.read_csv('https://raw.githubusercontent.com/drazenz/heatmap/master/autos.clean.csv')  
  
corr = data.corr()  
ax = sns.heatmap(  
    corr,   
    vmin=-1, vmax=1, center=0,  
    cmap=sns.diverging_palette(20, 220, n=200),  
    square=True  
)  
ax.set_xticklabels(  
    ax.get_xticklabels(),  
    rotation=45,  
    horizontalalignment='right'  
);  

Great! Green means positive, red means negative. The stronger the color, the larger the correlation magnitude. Now looking at the chart above, think about the following questions:

* Where do your eyes jump first when you look at the chart?
* What’s the strongest and what’s the weakest correlated pair (except the main diagonal)?
* What are the three variables most correlated with price?

If you’re like most people, you’ll find it hard to map the color scale to numbers and vice versa.

Distinguishing positive from negative is easy, as well as 0 from 1. But what about the second question? Finding the highest negative and positive correlations means finding the strongest red and green. To do that I need to carefully scan the entire grid. Try to answer it again and notice how your eyes are jumping around the plot, and sometimes going to the legend.

Heatmapz Example
Before all, you have to install the package if you haven't done so:

$ pip install heatmapz

Then you can refer to below sample code to display a better heatmap:

view plaincopy to clipboardprint?
from heatmap import heatmap, corrplot  
  
plt.figure(figsize=(10, 10))  
corrplot(data.corr())  

In addition to color, we’ve added size as a parameter to our heatmap. The size of each square corresponds to the magnitude of the correlation it represents, that is size(c1, c2) =~ abs(corr(c1, c2))

Now try to answer the questions using the latter plot. Notice how weak correlations visually disappear, and your eyes are immediately drawn to areas where there’s a high correlation. Also, note that it’s now easier to compare magnitudes of negative vs positive values (lighter red vs lighter green), and we can also compare values that are further apart.

If we’re mapping magnitudes, it’s much more natural to link them to the size of the representing object than to its color. That’s exactly why on bar charts you would use height to display measures, and colors to display categories, but not vice versa.

程式扎記

標籤

2021年3月2日星期二

[ ML 文章收集 ] Better Heatmaps and Correlation Matrix Plots in Python

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2021年3月2日 星期二

[ ML 文章收集 ] Better Heatmaps and Correlation Matrix Plots in Python

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2021年3月2日星期二