2021年3月11日 星期四

[Python 文章收集] medium - 7 Must-Know Data Wrangling Operations with Python Pandas

Preface
(article source) A comprehensive practical guide

Pandas is a highly popular data analysis and manipulation library. It provides numerous functions to transform raw data to a more useful or appropriate format for data analysis and machine learning pipeline.

Real-life data is almost always messy and requires lots of preprocessing to be converted to a nice and clean format. Thanks to its versatile and powerful functions, Pandas expedites data wrangling process.

In this article, we will cover 7 operations that we are likely to encounter in a typical data wrangling process.

Data Set
We will use the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function:
  1. import numpy as np  
  2. import pandas as pd  
  3.   
  4. melb = pd.read_csv("../../datas/kaggle_melbourne_housing_snapshot/melb_data.csv")  
  5. print(melb.shape)  
Output:
(13580, 21)

The dataset contains 21 features about 13580 houses in Melbourne.

1. Handling dates
The dates are usually stored as objects or strings. The Date column in our dataset is stored as object.
  1. melb.Date.dtypes  
  2. # dtype('O')  

In order to use the date time specific functions of Pandas, we need to convert the dates to an appropriate format. One option is to use the to_datetime function.
  1. # Before converting  
  2. melb.Date[:2]  

0 3/12/2016
1 4/02/2016
Name: Date, dtype: object

  1. # After converting  
  2. melb['Date'] = pd.to_datetime(melb['Date'])  
  3. melb.Date[:2]  

0 2016-03-12
1 2016-04-02
Name: Date, dtype: datetime64[ns]

2. Changing data types
In addition to dates, we may need to do some other data type conversions as well. A typical case that need conversion would be storing integers as floats. For instance, the Propertycount column in our dataset is stored as float but it should be integer.

The astype function can be used to do data type conversions.
  1. # Before converting  
  2. melb['Propertycount'][:2]  

0 4019.0
1 4019.0
Name: Propertycount, dtype: float64

  1. # After converting  
  2. melb['Propertycount'] = melb['Propertycount'].astype('int')  
  3. melb['Propertycount'][:2]  

0 4019
1 4019
Name: Propertycount, dtype: int32

3. Replacing values
Another common operation is to replace values. The Type column contains 3 distinct values which are ‘h’, ‘u’, and ‘t’. We can make these values more informative by replacing them with what they represent.

The replace function is used to accomplish this task.
  1. # Before converting  
  2. melb.Type.unique()  

array(['h', 'u', 't'], dtype=object)

  1. # After converting  
  2. melb.Type.replace({  
  3.    'h': 'house', 'u': 'unit', 't': 'town_house'  
  4. }, inplace=True)  
  5. ​  
  6. melb.Type.unique()  

array(['house', 'unit', 'town_house'], dtype=object)

4. Category data type
A typical dataset contains both numerical and categorical columns. The categorical columns are usually stored with object data type. If the number of distinct categories are very few compared to the number of rows, we can save a substantial amount of memory by using the category data type.

Our dataset contains 13580 rows. The number of categories in the type column is 3. Let’s first check the memory consumption of this column.
  1. # in bytes  
  2. melb.Type.memory_usage()  

108768

We will convert it to the category data type and check the memory consumption again.
  1. melb['Type'] = melb['Type'].astype('category')  
  2. melb['Type'].memory_usage()  

13812

It went down from 108768 bytes to 13812 bytes which is a significant decrease.

5. Extracting information from dates
In some cases, we may need to extract a particular part from dates such as weekday, month, year, and so on. We can use the functions under the dt accessor to extract pretty much any piece of information about a date.

Let’s do a couple of examples.
  1. # Extract month  
  2. melb['Month'] = melb['Date'].dt.month  
  3. melb[['Date', 'Month']][:5]  


  1. # Extract weekday  
  2. melb['Weekday'] = melb['Date'].dt.weekday  
  3. melb[['Date', 'Weekday']].sample(n=5)  


6. Extracting information from text
Textual data usually contains multiple pieces of information. Just like we have done with dates, we may need to extract a piece of information from a text. The str accessor of Pandas provides numerous function to perform such operations efficiently.

Let’s take a look at the Address column.
Output:
  1. 0          85 Turner St  
  2. 1       25 Bloomburg St  
  3. 2          5 Charles St  
  4. 3      40 Federation La  
  5. 4           55a Park St  
  6. 5        129 Charles St  
  7. 6          124 Yarra St  
  8. 7         98 Charles St  
  9. 8    6/241 Nicholson St  
  10. 9         10 Valiant St  
  11. Name: Address, dtype: object  
The last characters represent the type of location. For instance, “St” stands for street and “Dr” stands for drive. It can be a useful piece of information for grouping the addresses. We can extract the last part of the address by splitting the strings at space character and taking the last split. Here is how we do this operation with the str accessor.
  1. melb['Address'].str.split(' ').str[-1]  
Output:
  1. 0        St  
  2. 1        St  
  3. 2        St  
  4. 3        La  
  5. 4        St  
  6.          ..  
  7. 13575    Cr  
  8. 13576    Dr  
  9. 13577    St  
  10. 13578    St  
  11. 13579    St  
  12. Name: Address, Length: 13580, dtype: object  
The split function, as the same suggests, splits a string at the specified character which is space in our case. The next str is used for accessing the pieces after splitting. “-1” means the last one.

7. Standardizing the textual data
In many cases, we do a comparison based on textual data. A typical problem with such comparisons is not having a standard on strings. For instance, same words may not be detected if one starts with a capital case letter and the other is not.

To overcome this issue, we should standardize the strings. We can make them all upper case or lower case letters with the upper and lower functions of the str accessor, respectively.
  1. melb.Address.str.upper()[:5]  
Output:
  1. 0        85 TURNER ST  
  2. 1     25 BLOOMBURG ST  
  3. 2        5 CHARLES ST  
  4. 3    40 FEDERATION LA  
  5. 4         55A PARK ST  
  6. Name: Address, dtype: object  
Another option is to capitalize the strings.
  1. melb.Suburb.str.capitalize()[:5]  
Output:
  1. 0    Abbotsford  
  2. 1    Abbotsford  
  3. 2    Abbotsford  
  4. 3    Abbotsford  
  5. 4    Abbotsford  
  6. Name: Suburb, dtype: object  

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...