Preface
鐵達尼號的沉沒是歷史悲劇. 1912 年 4 月 15 日 鐵達尼在首航時撞上冰山沉沒, 乘客和船員共 2224 人, 造成 1502 人死亡. 這場悲劇震撼國際社會並未船舶制定了更好的安全規章. 鐵達尼號旅客資料完整保留下來. 這裡將利用 MLP (Multiple Layer Perception) 訓練模型來預測每一位乘客的存活率.
下載鐵達尼號旅客資料集
資料可以 這裡 下載. 這邊下載後分為以下兩個檔案並置放於目錄 datas 下:
STEP1. 使用 Pnadas dataframe 讀取資料
這邊使用 pandas 的 API:read_csv 讀入資料:
- ch11_1.py
- import numpy as np
- import pandas as pd
- TRAIN_FILE_PATH='datas/titan_train.csv'
- TEST_FILE_PATH='datas/titan_test.csv'
- train_df = pd.read_csv(TRAIN_FILE_PATH)
- test_df = pd.read_csv(TEST_FILE_PATH)
STEP2. 進行資料前處理
底下透過 DataFrame 上的 [] 運算子取出有興趣的欄位:
- # Headers: PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
- # Filter out 'Ticket', 'PassengerId' and 'Cabin' columns
- cols = ['Survived','Pclass','Name','Sex','Age','SibSp','Parch','Fare','Embarked']
- train_df = train_df[cols]
- # Show top 2 records
- print("\t[Info] Show top 2 records:")
- pprint(train_df.as_matrix()[:2])
- print("")
轉換過程如下:
- # Remove column 'Name'
- train_df.drop(['Name'], axis=1)
- # Show number of rows with null value
- print("\t[Info] Show number of rows with null value:")
- print(train_df.isnull().sum())
- print("")
- # Fill null with age mean value on 'Age' column
- print("\t[Info] Handle null value of Age column...")
- age_mean = train_df['Age'].mean()
- train_df['Age'] = train_df['Age'].fillna(age_mean)
- # Show number of rows with null value
- print("\t[Info] Show number of rows with null value:")
- print(train_df.isnull().sum())
- print("")
- print("\t[Info] Translate value of column Sex into (0,1)...")
- train_df['Sex'] = train_df['Sex'].map({'female':0, 'male':0}).astype(int)
- print("\t[Info] OneHot Encoding on column Embarked...")
- train_df = pd.get_dummies(data=train_df, columns=['Embarked'])
- # Show top 2 records
- print("\t[Info] Show top 2 records:")
- pprint(train_df.as_matrix()[:2])
- print("")
- ndarray = train_df.values
- print("\t[Info] Translate into ndarray(%s) with shape=%s" % (ndarray.__class__, str(ndarray.shape)))
- print("\t[Info] Show top 2 records:\n%s\n" % (ndarray[:2]))
- # Separate labels with features
- Label = ndarray[:,0]
- Features = ndarray[:,1:]
我們將使用 sklearn 的 preprocessing 模組進行特徵值標準化:
- # Normalized features
- print("\t[Info] Normalized features...")
- from sklearn import preprocessing
- minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
- scaledFeatures = minmax_scale.fit_transform(Features)
- print("\t[Info] Show top 2 records:\n%s\n" % (scaledFeatures[:2]))
我們將以 8:2 比例切分訓練資料與測試資料:
- # Splitting data into training/testing part
- print("\t[Info] Split data into training/testing part")
- msk = np.random.rand(len(scaledFeatures)) < 0.8
- trainFeatures = scaledFeatures[msk]
- trainLabels = Label[msk]
- testFeatures = scaledFeatures[~msk]
- testLabels = Label[~msk]
- print("\t[Info] Total %d training instances; %d testing instances!" % (trainFeatures.shape[0], testFeatures.shape[0]))
- def preprocessData(data_df, sRatio=None):
- r'''
- Preprocess data frame
- @param data_df(DataFrame):
- Training DataFrame
- @param sRatio(float):
- if splitRation is not None:
- (train_data, train_label, test_data, test_label)
- else:
- (train_data, train_label)
- '''
- # Remove column 'Name'
- data_df = data_df.drop(['Name'], axis=1)
- # Show number of rows with null value
- print("\t[Info] Show number of rows with null value:")
- print(data_df.isnull().sum())
- print("")
- # Fill null with age mean value on 'Age' column
- print("\t[Info] Handle null value of Age column...")
- age_mean = data_df['Age'].mean()
- data_df['Age'] = data_df['Age'].fillna(age_mean)
- # Show number of rows with null value
- print("\t[Info] Show number of rows with null value:")
- print(data_df.isnull().sum())
- print("")
- print("\t[Info] Translate value of column Sex into (0,1)...")
- data_df['Sex'] = data_df['Sex'].map({'female':0, 'male':0}).astype(int)
- print("\t[Info] OneHot Encoding on column Embarked...")
- data_df = pd.get_dummies(data=data_df, columns=['Embarked'])
- # Show top 2 records
- print("\t[Info] Show top 2 records:")
- pprint(data_df.as_matrix()[:2])
- print("")
- ndarray = data_df.values
- print("\t[Info] Translate into ndarray(%s) with shape=%s" % (ndarray.__class__, str(ndarray.shape)))
- print("\t[Info] Show top 2 records:\n%s\n" % (ndarray[:2]))
- # Separate labels with features
- Label = ndarray[:,0]
- Features = ndarray[:,1:]
- # Normalized features
- print("\t[Info] Normalized features...")
- from sklearn import preprocessing
- minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
- scaledFeatures = minmax_scale.fit_transform(Features)
- print("\t[Info] Show top 2 records:\n%s\n" % (scaledFeatures[:2]))
- if sRatio:
- # Splitting data into training/testing part
- print("\t[Info] Split data into training/testing part")
- msk = np.random.rand(len(scaledFeatures)) < sRatio
- trainFeatures = scaledFeatures[msk]
- trainLabels = Label[msk]
- testFeatures = scaledFeatures[~msk]
- testLabels = Label[~msk]
- print("\t[Info] Total %d training instances; %d testing instances!" % (trainFeatures.shape[0], testFeatures.shape[0]))
- return (trainFeatures, trainLabels, testFeatures, testLabels)
- else:
- return (scaledFeatures, Label)
我們接下來程式碼將建立 MLP (Multiple Layer Perception) 包括: 輸入層 (9 個神經元); 隱藏層一 (40 個神經元); 隱藏層二 (30 個神經元); 輸出層 (1 個神經元)
STEP1. 建立模型
- # Building model
- print("\t[Info] Building MLP model")
- from keras.models import Sequential
- from keras.layers import Dense,Dropout
- model = Sequential()
- model.add(Dense(units=40, input_dim=9, kernel_initializer='uniform', activation='relu'))
- model.add(Dense(units=30, kernel_initializer='uniform', activation='relu'))
- model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
- print("\t[Info] Show model summary...")
- model.summary()
- print("")
- # Training
- print("\t[Info] Start training...")
- model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
- train_history = model.fit(x=train_data, y=train_label, validation_split=0.1, epochs=50, batch_size=30, verbose=2)
- # Show Training result
- from utils import *
- if isDisplayAvl():
- show_train_history(train_history, 'acc', 'val_acc')
- show_train_history(train_history, 'loss', 'val_loss')
加入鐵達尼號電影 Jack 與 Rose 的資料
在鐵達尼電影中男女主角 Jack 與 Rose 是虛擬人物, 我們希望透過剛剛訓練的模型來預測男女主角的生存機率. 以下是根據電影情節所構想的資料:
STEP1. 建立 Jack 與 Rose 的資料
使用 pandas.Series 建立 Jack 與 Rose 資料如下:
- Jack = pd.Series([0, 'Jack', 3, 'male', 23, 1, 0, 5.0, 'S'])
- Rose = pd.Series([1, 'Rose', 1, 'female', 28, 1, 0, 100.0, 'S'])
- JR_df = pd.DataFrame([list(Jack), list(Rose)], columns=['Survived','Name', 'Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'])
- all_df = pd.concat([train_df, JR_df])
- print("\t[Info] Show last two records:\n%s\n" % (all_df[-2:]))
- print("\t[Info] Making prediction...")
- features, labels = preprocessData(all_df)
- all_probability = model.predict(features)
- all_df.insert(len(all_df.columns), 'probability', all_probability)
- print("\t[Info] The prediction of last two records:\n%s\n" % (all_df[-2:]))
- print("")
完整代碼連結如下:
Supplement
* Apply one-hot encoding to a pandas DataFrame
沒有留言:
張貼留言