程式扎記: [Python 文章收集] medium - 7 Must-Know Data Wrangling Operations with Python Pandas

2021年3月11日星期四

[Python 文章收集] medium - 7 Must-Know Data Wrangling Operations with Python Pandas

Preface
(article source) A comprehensive practical guide

Pandas is a highly popular data analysis and manipulation library. It provides numerous functions to transform raw data to a more useful or appropriate format for data analysis and machine learning pipeline.

Real-life data is almost always messy and requires lots of preprocessing to be converted to a nice and clean format. Thanks to its versatile and powerful functions, Pandas expedites data wrangling process.

In this article, we will cover 7 operations that we are likely to encounter in a typical data wrangling process.

Data Set
We will use the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function:

view plaincopy to clipboardprint?
import numpy as np  
import pandas as pd  
  
melb = pd.read_csv("../../datas/kaggle_melbourne_housing_snapshot/melb_data.csv")  
print(melb.shape)  

Output:

(13580, 21)

The dataset contains 21 features about 13580 houses in Melbourne.

1. Handling dates
The dates are usually stored as objects or strings. The Date column in our dataset is stored as object.

view plaincopy to clipboardprint?
melb.Date.dtypes  
# dtype('O')  

In order to use the date time specific functions of Pandas, we need to convert the dates to an appropriate format. One option is to use the to_datetime function.

view plaincopy to clipboardprint?
# Before converting  
melb.Date[:2]  

0 3/12/2016
1 4/02/2016
Name: Date, dtype: object

view plaincopy to clipboardprint?
# After converting  
melb['Date'] = pd.to_datetime(melb['Date'])  
melb.Date[:2]  

0 2016-03-12
1 2016-04-02
Name: Date, dtype: datetime64[ns]

2. Changing data types
In addition to dates, we may need to do some other data type conversions as well. A typical case that need conversion would be storing integers as floats. For instance, the Propertycount column in our dataset is stored as float but it should be integer.

The astype function can be used to do data type conversions.

view plaincopy to clipboardprint?
# Before converting  
melb['Propertycount'][:2]  

0 4019.0
1 4019.0
Name: Propertycount, dtype: float64

view plaincopy to clipboardprint?
# After converting  
melb['Propertycount'] = melb['Propertycount'].astype('int')  
melb['Propertycount'][:2]  

0 4019
1 4019
Name: Propertycount, dtype: int32

3. Replacing values
Another common operation is to replace values. The Type column contains 3 distinct values which are ‘h’, ‘u’, and ‘t’. We can make these values more informative by replacing them with what they represent.

The replace function is used to accomplish this task.

view plaincopy to clipboardprint?
# Before converting  
melb.Type.unique()  

array(['h', 'u', 't'], dtype=object)

view plaincopy to clipboardprint?
# After converting  
melb.Type.replace({  
   'h': 'house', 'u': 'unit', 't': 'town_house'  
}, inplace=True)  
​  
melb.Type.unique()  

array(['house', 'unit', 'town_house'], dtype=object)

4. Category data type
A typical dataset contains both numerical and categorical columns. The categorical columns are usually stored with object data type. If the number of distinct categories are very few compared to the number of rows, we can save a substantial amount of memory by using the category data type.

Our dataset contains 13580 rows. The number of categories in the type column is 3. Let’s first check the memory consumption of this column.

view plaincopy to clipboardprint?
# in bytes  
melb.Type.memory_usage()  

108768

We will convert it to the category data type and check the memory consumption again.

view plaincopy to clipboardprint?
melb['Type'] = melb['Type'].astype('category')  
melb['Type'].memory_usage()  

13812

It went down from 108768 bytes to 13812 bytes which is a significant decrease.

5. Extracting information from dates
In some cases, we may need to extract a particular part from dates such as weekday, month, year, and so on. We can use the functions under the dt accessor to extract pretty much any piece of information about a date.

Let’s do a couple of examples.

view plaincopy to clipboardprint?
# Extract month  
melb['Month'] = melb['Date'].dt.month  
melb[['Date', 'Month']][:5]  

view plaincopy to clipboardprint?
# Extract weekday  
melb['Weekday'] = melb['Date'].dt.weekday  
melb[['Date', 'Weekday']].sample(n=5)  

6. Extracting information from text
Textual data usually contains multiple pieces of information. Just like we have done with dates, we may need to extract a piece of information from a text. The str accessor of Pandas provides numerous function to perform such operations efficiently.

Let’s take a look at the Address column.

view plaincopy to clipboardprint?
melb.Address[:10]  

Output:

view plaincopy to clipboardprint?
0          85 Turner St  
1       25 Bloomburg St  
2          5 Charles St  
3      40 Federation La  
4           55a Park St  
5        129 Charles St  
6          124 Yarra St  
7         98 Charles St  
8    6/241 Nicholson St  
9         10 Valiant St  
Name: Address, dtype: object  

The last characters represent the type of location. For instance, “St” stands for street and “Dr” stands for drive. It can be a useful piece of information for grouping the addresses. We can extract the last part of the address by splitting the strings at space character and taking the last split. Here is how we do this operation with the str accessor.

view plaincopy to clipboardprint?
melb['Address'].str.split(' ').str[-1]  

Output:

view plaincopy to clipboardprint?
0        St  
1        St  
2        St  
3        La  
4        St  
         ..  
13575    Cr  
13576    Dr  
13577    St  
13578    St  
13579    St  
Name: Address, Length: 13580, dtype: object  

The split function, as the same suggests, splits a string at the specified character which is space in our case. The next str is used for accessing the pieces after splitting. “-1” means the last one.

7. Standardizing the textual data
In many cases, we do a comparison based on textual data. A typical problem with such comparisons is not having a standard on strings. For instance, same words may not be detected if one starts with a capital case letter and the other is not.

To overcome this issue, we should standardize the strings. We can make them all upper case or lower case letters with the upper and lower functions of the str accessor, respectively.

view plaincopy to clipboardprint?
melb.Address.str.upper()[:5]  

Output:

view plaincopy to clipboardprint?
0        85 TURNER ST  
1     25 BLOOMBURG ST  
2        5 CHARLES ST  
3    40 FEDERATION LA  
4         55A PARK ST  
Name: Address, dtype: object  

Another option is to capitalize the strings.

view plaincopy to clipboardprint?
melb.Suburb.str.capitalize()[:5]  

Output:

view plaincopy to clipboardprint?
0    Abbotsford  
1    Abbotsford  
2    Abbotsford  
3    Abbotsford  
4    Abbotsford  
Name: Suburb, dtype: object  

程式扎記

標籤

2021年3月11日星期四

[Python 文章收集] medium - 7 Must-Know Data Wrangling Operations with Python Pandas

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2021年3月11日 星期四

[Python 文章收集] medium - 7 Must-Know Data Wrangling Operations with Python Pandas

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2021年3月11日星期四