Preface
(article source) A robust method to make data ready for machine learning estimatorsIn this article, we will study some important data preprocessing methods. It is a very important step to visualize the data and make it in a suitable form so that the estimators (algorithm) fit well with good accuracy.
Topics to be covered:
Standardization
Standardization is a process that deals with the mean and standard deviation of the data points. As raw data, the values are varying from very low to very high. So, to avoid the low performance in the model we use standardization. It says, the mean becomes zero and the standard deviation becomes a unit.
The formula to standardization shown below:
When we use an algorithm to fit our data it assumes that the data is centered and the order of variance of all features are the same otherwise the estimators will not predict correctly. The sklearn library has a method to standardize the data set with StandardScaler in preprocessing class.
We use the import command to use this feature in python:
- #Before modeling our estimator we should always some preprocessing scaling.
- # Feature Scaling
- from sklearn.preprocessing import StandardScaler
- sc = StandardScaler()
- X_train = sc.fit_transform(X_train)
- X_test = sc.transform(X_test)
Scaling with Sparse data
Scaling of data is another way of making feature values be in some range of “0” and “1”. There are two methods of doing these i.e. MinMaxScaler and MaxAbsScaler. Below is an Example with python:
- import numpy as np
- from sklearn.preprocessing import MinMaxScaler
- X_train = np.array([[ 1., 0., 2.], [ 2., 0., -1.], [ 0., 2., -1.]])
- min_max_scaler = MinMaxScaler()
- X_train_minmax = min_max_scaler.fit_transform(X_train)
- print(X_train_minmax)
- [[0.5 0. 1. ]
- [1. 0. 0. ]
- [0. 1. 0. ]]
Creating scaling of the sparse data centering is not a good idea because it may change its structure. So, it is good to scale the input raw data that has values on different scales.
Scaling with Outliers:
When raw data have many outliers then the scaling with mean and variance doesn’t do well with the data. So, we have to use a more robust method like the interquartile method (IQR) because the outliers are influenced by mean and variance. The range of the IQR is between 25% and 75% in which the median is removed and scaling the quantile range.
The RobustScaler takes some parameters to perform scaling:
Example with python
- from sklearn.preprocessing import RobustScaler
- X = [
- [ 1., 0., 2.],
- [ 2., 0., -1.],
- [ 0., 2., -1.],
- [ 3., 1., 1.],
- [100, 100, -100]
- ]
- transformer = RobustScaler(with_scaling=True).fit(X)
- transformer.transform(X)
- array([[ -0.5, -0.5, 1.5],
- [ 0. , -0.5, 0. ],
- [ -1. , 0.5, 0. ],
- [ 0.5, 0. , 1. ],
- [ 49. , 49.5, -49.5]])
The scaling process in this is to normalize the values to their unit norm. An example of this normalization is MinMaxScaler. The process is useful when we are dealing with quadratic form in pair forms it can be kernel-based or dot product-based.
It is also useful based on of vector space model i.e the vectors related with text data samples to ease in data filtration.
Two types of Normalization happen as shown below:
Example with Python:
- from sklearn.preprocessing import normalize
- X = [
- [ 1., 0., 2.],
- [ 2., 0., -1.],
- [ 0., 2., -1.],
- [ -1., 1., -2.]
- ]
- X_normalized = normalize(X, norm='l2')
- print(X_normalized)
- from sklearn.preprocessing import Normalizer
-
- X = [
- [ 1., 0., 2.],
- [ 2., 0., -1.],
- [ 0., 2., -1.],
- [ -1., 1., -2.]
- ]
-
- normalizer = Normalizer().fit(X)
- normalizer.transform(X)
When we use sparse input, it is important to convert it to be CSR format to avoid multiple memory copies. The CSR is compressed Sparse Rows comes in scipy.sparse.csr_matrix.
Categorical Encoding
When we get some raw data set then some columns are that are not in continuous values rather in some categories of binary and multiple categories. So, to make them in integer value we use encoding methods. There are some encoding methods given below:
There are many other encoding methods like mean encoding, Helmert encoding, ordinal encoding, probability ratio encoding and, etc.
Example with Python:
- df1=pd.get_dummies(df['State'],drop_first=True)
Imputation
when raw data have some missing values so to make the missing record to a numeric value is know as imputing.
For demonstration, let's create the random data frame:
- # import the pandas library
- import pandas as pd
- import numpy as np
- df = pd.DataFrame(
- np.random.randn(4, 3),
- index=['a', 'c', 'e', 'h'],
- columns=['First', 'Second', 'Three']
- )
- df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
- df
Now replacing with zero value.
- print ("NaN replaced with '0':")
- print (df.fillna(0))
- from sklearn.impute import SimpleImputer
-
- imp = SimpleImputer(missing_values=np.nan, strategy='mean')
- imp.fit_transform(df)
The data preprocessing is an important step to perform to make the data set more reliable to our estimators.
沒有留言:
張貼留言