Source From Here
PrefaceThe code described below is now available as a pip package — https://pypi.org/project/heatmapz/. There’s also a Google Colab notebook here, you can see a few examples in there and play around with the lib
Correlation Matrix plots
You already know that if you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.
But is a simple heatmap the best way to do it?
For illustration, I’ll use the Automobile Data Set, containing various characteristics of a number of cars. You can also find a clean version of the data with header columns here.
Let’s start by making a correlation matrix heatmap for the data set:
- import seaborn as sns
- import matplotlib.pyplot as plt
- import pandas as pd
- import numpy as np
- # Step 0 - Read the dataset, calculate column correlations and make a seaborn heatmap
- plt.rcParams['figure.figsize'] = [10, 6]
- data = pd.read_csv('https://raw.githubusercontent.com/drazenz/heatmap/master/autos.clean.csv')
- corr = data.corr()
- ax = sns.heatmap(
- corr,
- vmin=-1, vmax=1, center=0,
- cmap=sns.diverging_palette(20, 220, n=200),
- square=True
- )
- ax.set_xticklabels(
- ax.get_xticklabels(),
- rotation=45,
- horizontalalignment='right'
- );
Great! Green means positive, red means negative. The stronger the color, the larger the correlation magnitude. Now looking at the chart above, think about the following questions:
If you’re like most people, you’ll find it hard to map the color scale to numbers and vice versa.
Distinguishing positive from negative is easy, as well as 0 from 1. But what about the second question? Finding the highest negative and positive correlations means finding the strongest red and green. To do that I need to carefully scan the entire grid. Try to answer it again and notice how your eyes are jumping around the plot, and sometimes going to the legend.
Heatmapz Example
Before all, you have to install the package if you haven't done so:
Then you can refer to below sample code to display a better heatmap:
- from heatmap import heatmap, corrplot
- plt.figure(figsize=(10, 10))
- corrplot(data.corr())
In addition to color, we’ve added size as a parameter to our heatmap. The size of each square corresponds to the magnitude of the correlation it represents, that is size(c1, c2) =~ abs(corr(c1, c2))
Now try to answer the questions using the latter plot. Notice how weak correlations visually disappear, and your eyes are immediately drawn to areas where there’s a high correlation. Also, note that it’s now easier to compare magnitudes of negative vs positive values (lighter red vs lighter green), and we can also compare values that are further apart.
If we’re mapping magnitudes, it’s much more natural to link them to the size of the representing object than to its color. That’s exactly why on bar charts you would use height to display measures, and colors to display categories, but not vice versa.
沒有留言:
張貼留言