2021年3月2日 星期二

[ ML 文章收集 ] Better Heatmaps and Correlation Matrix Plots in Python

 Source From Here

Preface
The code described below is now available as a pip package — https://pypi.org/project/heatmapz/. There’s also a Google Colab notebook here, you can see a few examples in there and play around with the lib

Correlation Matrix plots
You already know that if you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

But is a simple heatmap the best way to do it?

For illustration, I’ll use the Automobile Data Set, containing various characteristics of a number of cars. You can also find a clean version of the data with header columns here.

Let’s start by making a correlation matrix heatmap for the data set:
  1. import seaborn as sns  
  2. import matplotlib.pyplot as plt  
  3. import pandas as pd  
  4. import numpy as np  
  5.   
  6. # Step 0 - Read the dataset, calculate column correlations and make a seaborn heatmap  
  7. plt.rcParams['figure.figsize'] = [106]  
  8. data = pd.read_csv('https://raw.githubusercontent.com/drazenz/heatmap/master/autos.clean.csv')  
  9.   
  10. corr = data.corr()  
  11. ax = sns.heatmap(  
  12.     corr,   
  13.     vmin=-1, vmax=1, center=0,  
  14.     cmap=sns.diverging_palette(20220, n=200),  
  15.     square=True  
  16. )  
  17. ax.set_xticklabels(  
  18.     ax.get_xticklabels(),  
  19.     rotation=45,  
  20.     horizontalalignment='right'  
  21. );  


Great! Green means positive, red means negative. The stronger the color, the larger the correlation magnitude. Now looking at the chart above, think about the following questions:
* Where do your eyes jump first when you look at the chart?
* What’s the strongest and what’s the weakest correlated pair (except the main diagonal)?
* What are the three variables most correlated with price?

If you’re like most people, you’ll find it hard to map the color scale to numbers and vice versa.

Distinguishing positive from negative is easy, as well as 0 from 1. But what about the second question? Finding the highest negative and positive correlations means finding the strongest red and green. To do that I need to carefully scan the entire grid. Try to answer it again and notice how your eyes are jumping around the plot, and sometimes going to the legend.

Heatmapz Example
Before all, you have to install the package if you haven't done so:
$ pip install heatmapz

Then you can refer to below sample code to display a better heatmap:
  1. from heatmap import heatmap, corrplot  
  2.   
  3. plt.figure(figsize=(1010))  
  4. corrplot(data.corr())  


In addition to color, we’ve added size as a parameter to our heatmap. The size of each square corresponds to the magnitude of the correlation it represents, that is size(c1, c2) =~ abs(corr(c1, c2))

Now try to answer the questions using the latter plot. Notice how weak correlations visually disappear, and your eyes are immediately drawn to areas where there’s a high correlation. Also, note that it’s now easier to compare magnitudes of negative vs positive values (lighter red vs lighter green), and we can also compare values that are further apart.

If we’re mapping magnitudes, it’s much more natural to link them to the size of the representing object than to its color. That’s exactly why on bar charts you would use height to display measures, and colors to display categories, but not vice versa.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...