2021年3月14日 星期日

[ ML 文章收集 ] Medium - Data Preprocessing Concepts with Python

 Preface

(article sourceA robust method to make data ready for machine learning estimators


In this article, we will study some important data preprocessing methods. It is a very important step to visualize the data and make it in a suitable form so that the estimators (algorithm) fit well with good accuracy.

Topics to be covered:
1. Standardization
2. Scaling with sparse data and outliers
3. Normalization
4. Categorical Encoding
5. Imputation

Standardization
Standardization is a process that deals with the mean and standard deviation of the data points. As raw data, the values are varying from very low to very high. So, to avoid the low performance in the model we use standardization. It says, the mean becomes zero and the standard deviation becomes a unit.

The formula to standardization shown below:


When we use an algorithm to fit our data it assumes that the data is centered and the order of variance of all features are the same otherwise the estimators will not predict correctly. The sklearn library has a method to standardize the data set with StandardScaler in preprocessing class.

We use the import command to use this feature in python:
  1. #Before modeling our estimator we should always some preprocessing scaling.  
  2. # Feature Scaling  
  3. from sklearn.preprocessing import StandardScaler  
  4. sc = StandardScaler()  
  5. X_train = sc.fit_transform(X_train)  
  6. X_test = sc.transform(X_test)  
Scaling with sparse data and outliers

Scaling with Sparse data
Scaling of data is another way of making feature values be in some range of “0” and “1”. There are two methods of doing these i.e. MinMaxScaler and MaxAbsScaler. Below is an Example with python:
  1. import numpy as np  
  2. from sklearn.preprocessing import MinMaxScaler  
  3.   
  4. X_train = np.array([[ 1., 0.,  2.], [ 2.,  0.,  -1.], [ 0.,  2., -1.]])  
  5. min_max_scaler = MinMaxScaler()  
  6. X_train_minmax = min_max_scaler.fit_transform(X_train)  
  7. print(X_train_minmax)  
Ouptut:
  1. [[0.5 0.  1. ]  
  2. [1.  0.  0. ]  
  3. [0.  1.  0. ]]  
As we see the input value comes in a range of “0” and “1”.

Creating scaling of the sparse data centering is not a good idea because it may change its structure. So, it is good to scale the input raw data that has values on different scales.

Scaling with Outliers:
When raw data have many outliers then the scaling with mean and variance doesn’t do well with the data. So, we have to use a more robust method like the interquartile method (IQR) because the outliers are influenced by mean and variance. The range of the IQR is between 25% and 75% in which the median is removed and scaling the quantile range.

The RobustScaler takes some parameters to perform scaling:
* The first parameter is with_centering that centers the data before scaling if it is true.
* The second parameter is with_scaling if it is true then it scale the data in the quantile range.

Example with python
  1. from sklearn.preprocessing import RobustScaler  
  2.   
  3. X = [  
  4.     [ 1., 0.,  2.],  
  5.     [ 2.,  0.,  -1.],  
  6.     [ 0.,  2., -1.],  
  7.     [ 3.,  1., 1.],  
  8.     [100100, -100]  
  9. ]  
  10.   
  11. transformer = RobustScaler(with_scaling=True).fit(X)  
  12. transformer.transform(X)  
Ouptut:
  1. array([[ -0.5,  -0.5,   1.5],  
  2.        [  0. ,  -0.5,   0. ],  
  3.        [ -1. ,   0.5,   0. ],  
  4.        [  0.5,   0. ,   1. ],  
  5.        [ 49. ,  49.5, -49.5]])  
Normalization
The scaling process in this is to normalize the values to their unit norm. An example of this normalization is MinMaxScaler. The process is useful when we are dealing with quadratic form in pair forms it can be kernel-based or dot product-based.

It is also useful based on of vector space model i.e the vectors related with text data samples to ease in data filtration.

Two types of Normalization happen as shown below:

Normalize: It deals to scale the input vectors to unit norm. The norm parameter is used to normalize all the non-zero values. It takes three arguments L1, L2, and max where the L2 is the default norm.
Normalizer: It also does the same operation but in this process the fit method is optional.

Example with Python:
  1. from sklearn.preprocessing import normalize  
  2.   
  3. X = [  
  4.     [ 1., 0., 2.],   
  5.     [ 2., 0., -1.],   
  6.     [ 0., 2., -1.],  
  7.     [ -1., 1., -2.]  
  8. ]  
  9. X_normalized = normalize(X, norm='l2')  
  10. print(X_normalized)  
Example with Normalizer:
  1. from sklearn.preprocessing import Normalizer  
  2. ​  
  3. X = [  
  4.     [ 1., 0., 2.],   
  5.     [ 2., 0., -1.],   
  6.     [ 0., 2., -1.],  
  7.     [ -1., 1., -2.]  
  8. ]  
  9. ​  
  10. normalizer = Normalizer().fit(X)  
  11. normalizer.transform(X)  
The normalizer is useful in the pipeline of data processing in the beginning.

When we use sparse input, it is important to convert it to be CSR format to avoid multiple memory copies. The CSR is compressed Sparse Rows comes in scipy.sparse.csr_matrix.

Categorical Encoding
When we get some raw data set then some columns are that are not in continuous values rather in some categories of binary and multiple categories. So, to make them in integer value we use encoding methods. There are some encoding methods given below:

Get Dummies: It is used to get a new feature column with 0 and 1 encoding the categories with the help of the pandas’ library.
Label Encoder: It is used to encode binary categories to numeric values in the sklearn library.
One Hot Encoder: The sklearn library provides another feature to convert categories class to new numeric values of 0 and 1 with new feature columns.
Hashing: It is more useful than one-hot encoding in the case of high dimensions. It is used when there is high cardinality in the feature.

There are many other encoding methods like mean encoding, Helmert encoding, ordinal encoding, probability ratio encoding and, etc.

Example with Python:
  1. df1=pd.get_dummies(df['State'],drop_first=True)  


Imputation
when raw data have some missing values so to make the missing record to a numeric value is know as imputing.

For demonstration, let's create the random data frame:
  1. import the pandas library  
  2. import pandas as pd  
  3. import numpy as np  
  4. df = pd.DataFrame(  
  5.     np.random.randn(43),   
  6.     index=['a''c''e''h'],  
  7.     columns=['First''Second''Three']  
  8. )  
  9.   
  10. df = df.reindex(['a''b''c''d''e''f''g''h'])  
  11. df  


Now replacing with zero value.
  1. print ("NaN replaced with '0':")  
  2. print (df.fillna(0))  
Replacing the missing values with mean:
  1. from sklearn.impute import SimpleImputer  
  2. ​  
  3. imp = SimpleImputer(missing_values=np.nan, strategy='mean')  
  4. imp.fit_transform(df)  
Conclusion
The data preprocessing is an important step to perform to make the data set more reliable to our estimators.



沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...