Preface
(article source) A comprehensive practical guide
Pandas is a highly popular data analysis and manipulation library. It provides numerous functions to transform raw data to a more useful or appropriate format for data analysis and machine learning pipeline.
Real-life data is almost always messy and requires lots of preprocessing to be converted to a nice and clean format. Thanks to its versatile and powerful functions, Pandas expedites data wrangling process.
In this article, we will cover 7 operations that we are likely to encounter in a typical data wrangling process.
Data Set
We will use the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function:
Output:
The dataset contains 21 features about 13580 houses in Melbourne.
1. Handling dates
The dates are usually stored as objects or strings. The Date column in our dataset is stored as object.
In order to use the date time specific functions of Pandas, we need to convert the dates to an appropriate format. One option is to use the to_datetime function.
2. Changing data types
In addition to dates, we may need to do some other data type conversions as well. A typical case that need conversion would be storing integers as floats. For instance, the Propertycount column in our dataset is stored as float but it should be integer.
The astype function can be used to do data type conversions.
3. Replacing values
Another common operation is to replace values. The Type column contains 3 distinct values which are ‘h’, ‘u’, and ‘t’. We can make these values more informative by replacing them with what they represent.
The replace function is used to accomplish this task.
4. Category data type
A typical dataset contains both numerical and categorical columns. The categorical columns are usually stored with object data type. If the number of distinct categories are very few compared to the number of rows, we can save a substantial amount of memory by using the category data type.
Our dataset contains 13580 rows. The number of categories in the type column is 3. Let’s first check the memory consumption of this column.
We will convert it to the category data type and check the memory consumption again.
It went down from 108768 bytes to 13812 bytes which is a significant decrease.
5. Extracting information from dates
In some cases, we may need to extract a particular part from dates such as weekday, month, year, and so on. We can use the functions under the dt accessor to extract pretty much any piece of information about a date.
Let’s do a couple of examples.
6. Extracting information from text
Textual data usually contains multiple pieces of information. Just like we have done with dates, we may need to extract a piece of information from a text. The str accessor of Pandas provides numerous function to perform such operations efficiently.
Let’s take a look at the Address column.
Output:
The last characters represent the type of location. For instance, “St” stands for street and “Dr” stands for drive. It can be a useful piece of information for grouping the addresses. We can extract the last part of the address by splitting the strings at space character and taking the last split. Here is how we do this operation with the str accessor.
Output:
The split function, as the same suggests, splits a string at the specified character which is space in our case. The next str is used for accessing the pieces after splitting. “-1” means the last one.
7. Standardizing the textual data
In many cases, we do a comparison based on textual data. A typical problem with such comparisons is not having a standard on strings. For instance, same words may not be detected if one starts with a capital case letter and the other is not.
To overcome this issue, we should standardize the strings. We can make them all upper case or lower case letters with the upper and lower functions of the str accessor, respectively.
Output:
Another option is to capitalize the strings.
Output:
(article source) A comprehensive practical guide
Pandas is a highly popular data analysis and manipulation library. It provides numerous functions to transform raw data to a more useful or appropriate format for data analysis and machine learning pipeline.
Real-life data is almost always messy and requires lots of preprocessing to be converted to a nice and clean format. Thanks to its versatile and powerful functions, Pandas expedites data wrangling process.
In this article, we will cover 7 operations that we are likely to encounter in a typical data wrangling process.
Data Set
We will use the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function:
- import numpy as np
- import pandas as pd
- melb = pd.read_csv("../../datas/kaggle_melbourne_housing_snapshot/melb_data.csv")
- print(melb.shape)
The dataset contains 21 features about 13580 houses in Melbourne.
1. Handling dates
The dates are usually stored as objects or strings. The Date column in our dataset is stored as object.
- melb.Date.dtypes
- # dtype('O')
In order to use the date time specific functions of Pandas, we need to convert the dates to an appropriate format. One option is to use the to_datetime function.
- # Before converting
- melb.Date[:2]
- # After converting
- melb['Date'] = pd.to_datetime(melb['Date'])
- melb.Date[:2]
2. Changing data types
In addition to dates, we may need to do some other data type conversions as well. A typical case that need conversion would be storing integers as floats. For instance, the Propertycount column in our dataset is stored as float but it should be integer.
The astype function can be used to do data type conversions.
- # Before converting
- melb['Propertycount'][:2]
- # After converting
- melb['Propertycount'] = melb['Propertycount'].astype('int')
- melb['Propertycount'][:2]
3. Replacing values
Another common operation is to replace values. The Type column contains 3 distinct values which are ‘h’, ‘u’, and ‘t’. We can make these values more informative by replacing them with what they represent.
The replace function is used to accomplish this task.
- # Before converting
- melb.Type.unique()
- # After converting
- melb.Type.replace({
- 'h': 'house', 'u': 'unit', 't': 'town_house'
- }, inplace=True)
-
- melb.Type.unique()
4. Category data type
A typical dataset contains both numerical and categorical columns. The categorical columns are usually stored with object data type. If the number of distinct categories are very few compared to the number of rows, we can save a substantial amount of memory by using the category data type.
Our dataset contains 13580 rows. The number of categories in the type column is 3. Let’s first check the memory consumption of this column.
- # in bytes
- melb.Type.memory_usage()
We will convert it to the category data type and check the memory consumption again.
- melb['Type'] = melb['Type'].astype('category')
- melb['Type'].memory_usage()
It went down from 108768 bytes to 13812 bytes which is a significant decrease.
5. Extracting information from dates
In some cases, we may need to extract a particular part from dates such as weekday, month, year, and so on. We can use the functions under the dt accessor to extract pretty much any piece of information about a date.
Let’s do a couple of examples.
- # Extract month
- melb['Month'] = melb['Date'].dt.month
- melb[['Date', 'Month']][:5]
- # Extract weekday
- melb['Weekday'] = melb['Date'].dt.weekday
- melb[['Date', 'Weekday']].sample(n=5)
6. Extracting information from text
Textual data usually contains multiple pieces of information. Just like we have done with dates, we may need to extract a piece of information from a text. The str accessor of Pandas provides numerous function to perform such operations efficiently.
Let’s take a look at the Address column.
- melb.Address[:10]
- 0 85 Turner St
- 1 25 Bloomburg St
- 2 5 Charles St
- 3 40 Federation La
- 4 55a Park St
- 5 129 Charles St
- 6 124 Yarra St
- 7 98 Charles St
- 8 6/241 Nicholson St
- 9 10 Valiant St
- Name: Address, dtype: object
- melb['Address'].str.split(' ').str[-1]
- 0 St
- 1 St
- 2 St
- 3 La
- 4 St
- ..
- 13575 Cr
- 13576 Dr
- 13577 St
- 13578 St
- 13579 St
- Name: Address, Length: 13580, dtype: object
7. Standardizing the textual data
In many cases, we do a comparison based on textual data. A typical problem with such comparisons is not having a standard on strings. For instance, same words may not be detected if one starts with a capital case letter and the other is not.
To overcome this issue, we should standardize the strings. We can make them all upper case or lower case letters with the upper and lower functions of the str accessor, respectively.
- melb.Address.str.upper()[:5]
- 0 85 TURNER ST
- 1 25 BLOOMBURG ST
- 2 5 CHARLES ST
- 3 40 FEDERATION LA
- 4 55A PARK ST
- Name: Address, dtype: object
- melb.Suburb.str.capitalize()[:5]
- 0 Abbotsford
- 1 Abbotsford
- 2 Abbotsford
- 3 Abbotsford
- 4 Abbotsford
- Name: Suburb, dtype: object
沒有留言:
張貼留言