程式扎記: [ ML 文章收集 ] Work of Kaggle: Titanic

Source From Here

1. Overview

The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Goal
It is your job to predict if a passenger survived the sinking of the Titanic or not. For each in the test set, you must predict a 0 or 1 value for the variable.

Metric Your score is the percentage of passengers you correctly predict. This is known as accuracy.

Submission File Format You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:
* PassengerId (sorted in any order)
* Survived (contains your binary predictions: 1 for survived, 0 for deceased)

view plaincopy to clipboardprint?
PassengerId,Survived  
892,0  
893,1  
894,0  
Etc.  

2. Setup System Environment
We import the necessary packages below:

view plaincopy to clipboardprint?
import sys  
import os  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
from os.path import dirname  
sys.path.append(os.environ.get('KUTILS_ANALYSIS_ROOT', 'C:\John\Personal\Github\kutils_analysis'))  
from kutils.analysis import histplot, barplot, boxplot, fiplot, corr  
%matplotlib inline  

From here, you can download the necessary files for training/testing data. The data has been split into two groups:
* training set (train.csv):

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

* test set (test.csv):

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. Below are column definition:

3. Exploratory Data Analysis

Feature Analysis
First step is to load in training/testing data:

view plaincopy to clipboardprint?
# load the training and test data  
def get_data():  
    train_df = pd.read_csv('train.csv', index_col='PassengerId')  
    test_df = pd.read_csv('test.csv', index_col='PassengerId')  
    return train_df, test_df  
  
train_df, test_df = get_data()  
train_df.head()  

raw data info

view plaincopy to clipboardprint?
# inspect the dataframe for entries, columns, missing values, and data types  
train_df.info()  

Observations:

* ML models must have numerical values to compute, there fore categorical datatype objects must be converted to numerical values.
* Nan values to address with imputation: ['Age', 'Cabin', 'Embark']
* Test dataset has 1 less column: ['Survived'] which is the target or label

raw data distribution
Let's check the distribution of our raw data:

view plaincopy to clipboardprint?
# split categorical and numerical dataframes for analysis  
df_num = train_df[['Age', 'SibSp', 'Parch', 'Fare']]  
df_cat = train_df[['Pclass', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Name']]  
als_hist = histplot.Utils(train_df)  
als_hist.hist(df_num.columns)  

view plaincopy to clipboardprint?
ax_list = als_hist.sns_hist(df_num.columns, col_num=2, figsize=(10, 6))  

view plaincopy to clipboardprint?
als_bar = barplot.Utils(train_df)  
fig, axs = als_bar.stacked_bar('Survived', columns=['Pclass', 'Sex', 'Embarked'], col_num=3, figsize=(15, 5))  

Let's also check the first letter of Cabin:

view plaincopy to clipboardprint?
train_df['Cabin'].fillna('U', inplace=True)  
train_df['Cabin_type'] = train_df.Cabin.apply(lambda v: 'U' if v is None else v[0])  
train_df['Cabin_type'].value_counts()  

Output:

view plaincopy to clipboardprint?
U    687  
C     59  
B     47  
D     33  
E     32  
A     15  
F     13  
G      4  
T      1  
Name: Cabin_type, dtype: int64  

Next to see how Survival rate among each Cabin type:

view plaincopy to clipboardprint?
ax = als_bar.stacked_bar('Survived', columns=['Cabin_type'], col_num=2, figsize=(10, 5))  

Observations:

* Only Age has a somewhat normal distribution. The other values are skewed to the left with longtails to the right.
>>>> Does age group affect the survival rate? Bin by age group
>>>> Most of the population is in their late teens to late 30s.
* Most paid a low fare (Fare), many are single with no family (Parch & SibSp)
* Should we normalize these values using a logarithmic method?

correlation with Survived
Then, let's check the correlation between Survived and other columns:

view plaincopy to clipboardprint?
# Gender breakdown  
print(train_df['Sex'].value_counts())  
print()  
print(train_df['Sex'].value_counts(normalize=True)) #Percentage breakdown  

Output:

view plaincopy to clipboardprint?
male      577  
female    314  
Name: Sex, dtype: int64  
  
male      0.647587  
female    0.352413  
Name: Sex, dtype: float64  

view plaincopy to clipboardprint?
male_s1 = train_df[(train_df['Sex']=='male') & (train_df['Survived']==1)].shape[0]  
male_s0 = train_df[(train_df['Sex']=='male') & (train_df['Survived']==0)].shape[0]  
print(f"Number of male survial={male_s1}; Number of male not survial={male_s0}; avg = {male_s1/(male_s1+male_s0)}")  

Output:

Number of male survial=109; Number of male not survial=468; avg = 0.18890814558058924

view plaincopy to clipboardprint?
# https://seaborn.pydata.org/generated/seaborn.barplot.html  
fig, ax = als_bar.sns_barplot(  
    'Survived',   
    ['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch', 'SibSp'],   
    col_num=3,   
    figsize=(12, 8)  
)  

Observations:

* Many more females survived (Sex)
* People who embarked from Cherbourg had a higher survival rate (Embarked=C). Let's explore this further?
* First and second class passengers survived better (PClass in [1,2]). Due to location of cabins?
* Smaller families had higher survival rates,best if you had 1 or 2 other people in your family (SibSp in [1,2] or Parch in [1, 2, 3]).

Let's check correlation between Survived with other numerical features:

view plaincopy to clipboardprint?
fig, ax = als_bar.sns_countplot(  
    'Survived',   
    ['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch', 'SibSp'],   
    col_num=3,   
    figsize=(13, 7)  
)  

Observation

* People were more chivalrous back in then? Or women were better able to negotiate their way on to life boats? Maybe they were with their children?

Embarked analysis
Let's check column Embarked and it's relations with Fare, Age and SibSp:

view plaincopy to clipboardprint?
als_box = boxplot.Utils(train_df)  
fig, ax = als_box.sns_boxplot(  
    'Embarked',  
    ['Fare', 'Age', 'SibSp'],  
    [  
        'Fare Boxplot By Port of Embarkment',  
        'Age Distribution Boxplot By Embarkement',  
        'SibSp Distribution Boxplot By Embarkement (By Sex)'  
    ]  
)  

Observations

* People from Cherbourg, France are able to pay more money for fare, and support the hypothesis that wealther passengers had a higher chance of survival.
* Fare outliers from Cherbourg dataset may be worthy of analysis
* Wealth matters. People from 1st class had higher rates of survival, and more people survived rather than drowned compared to the other classes.
* You have a higher chance of surviving if you're solo or have 1 family member with you compared to a large family. Maybe it was difficult to choose which family member would survive? Or maybe it was more expensive to have a 1st class cabin and better location to boats?

binning of Age
Let's do binning of Age to see if there is any feature in them:

view plaincopy to clipboardprint?
cut_labels = ['>10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80']  
cut_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]  
cut_df = train_df  
cut_df['cut_age'] = pd.cut(cut_df['Age'], cut_bins, labels=cut_labels)  
age_bin_df = cut_df.groupby(['cut_age']).mean()  
age_bin_df  

view plaincopy to clipboardprint?
# als_bar = barplot.Utils(train_df)  
fig = als_bar.sns_countplot(  
    'Survived',   
    'cut_age',   
    figsize=(10, 5)  
)  

Observations:

* Survival based on Age followed a somewhat normal distribution with a tail to the right with older people.
* Young children and infants survived more than died. Probably the mentality of women and children to be saved first.
* High survival rate for people in age 31-40.

binning of Fare
Let's do binning of Fare to see if there is any feature in them：

view plaincopy to clipboardprint?
train_df, test_df = get_data()  
print(f"Fare range: {max(train_df.Fare)}~{min(train_df.Fare)}")  

Fare range: 512.3292~0.0

view plaincopy to clipboardprint?
cut_labels = ['>100', '100-200', '200-300', '300-400', '400-500', '500-600']  
cut_bins = [0, 100, 200, 300, 400, 500, 600]  
cut_df = train_df  
cut_df['cut_fare'] = pd.cut(cut_df['Fare'], cut_bins, labels=cut_labels)  
cut_df['cut_fare'] = cut_df.apply(lambda r: r.cut_fare if r.Fare > 0 else 'unknown', axis=1)  
cut_df[['Fare', 'cut_fare'] + list(filter(lambda e: e.startswith('fare_'), train_df.columns))].sample(n=15)  

view plaincopy to clipboardprint?
fare_bin_df = cut_df.groupby(['cut_fare']).mean()  
fare_bin_df  

view plaincopy to clipboardprint?
als_bar = barplot.Utils(train_df)  
fig = als_bar.sns_countplot(  
    'Survived',   
    'cut_fare',   
    figsize=(10, 5)  
)  

Observations:

* Higher fare, higher survial rate.
* Fare < 100 has a obvious lowest survival rate compared to other Fare binning.

Outliers Analysis

Fare

view plaincopy to clipboardprint?
als_box.set_figsize((8, 4))  
als_box.s_boxplot("Fare", title='Fare outlier Boxplot')  

view plaincopy to clipboardprint?
train_df.loc[train_df['Fare'] > 500]  

Findings:
These outliers paid the princely sum of 512 pounds. In today's pound, it would be worth a staggering 58,864 pounds! (Note: each group had both the same ticket number and fare paid). This was by far the most out of any of the guests. Part of the same group lead by the wealthy Mr. Cardeza - Miss. Ward the maid, and Mr. Lesurer the manservant. They were on their way back to the USA by way of Cherbourg, France.

* All similar age
* Multiple cabins but interestingly all on the 'B' deck. Could this be the best deck on the ship? We should analyze deck levels for our model.
* Feature: Cherbourg was the best place to leave if you wanted to survive. We will add this to our list of features for the ML models.
* Feature: Maybe is there a correlation between 'PC' tickets and survival rate?

To learn more about this group visit this biography of Miss. Anna Ward

Sibling or Spouse Outlier Analysis

view plaincopy to clipboardprint?
als_box.set_figsize((6, 3))  
ax = als_box.s_boxplot("SibSp", title='SibSp outlier Boxplot')  

view plaincopy to clipboardprint?
# We see that the outliers came from 2 families  
family_df = train_df.loc[train_df['SibSp'] > 4]  
family_df.sort_values('Name')  

view plaincopy to clipboardprint?
# Survival rates for group size.  
# Can we bin people by family or not? Maybe people had a higher survival rate if they had fewer family members?  
train_df.groupby(['SibSp'])['Survived'].mean()  

Output:

view plaincopy to clipboardprint?
SibSp  
0    0.345395  
1    0.535885  
2    0.464286  
3    0.250000  
4    0.166667  
5    0.000000  
8    0.000000  
Name: Survived, dtype: float64  

Family Size Outlier Findings:
Unforetunately, the Goodwin and Sage families - who had 5 or more members - did not fare well. They all perished. Maybe it's because of their socio economic status. Maybe it's because there was just too many of them to quickly escape to a life boat. Or maybe it's because of their cabin placement. Having a large family was tragic in the Titanic.

Age Outlier Analysis

view plaincopy to clipboardprint?
als_box.set_figsize((6, 3))  
ax = als_box.s_boxplot("Age", title='Age outlier Boxplot')  

view plaincopy to clipboardprint?
# Inspect passengers Ages 65 and up  
age_df = train_df.loc[train_df['Age'] > 64]  
age_df.sort_values('Age', ascending=False)  

Age Outlier Findings:
While it doesn't bode well to be elderly on the Titanic, the eldest person Barkworth actually survived! He actually was able to buoy himself on his briefcase and a fur coat until he found an overturned lifeboat. To learn more visit the biography of Mr. Algernon Barkworth.

Name Analysis

Title
Here I will look if the title of the passenger had any correlation to surving the Titantic - e.g. Mr., Mrs., etc.

view plaincopy to clipboardprint?
title_df = train_df.copy()  
title_df['title'] = train_df.Name.apply(lambda x: x.split(',')[1].split(' ')[1].strip())  
title_df['title'].value_counts()  

Output:

view plaincopy to clipboardprint?
Mr.          517  
Miss.        182  
Mrs.         125  
Master.       40  
Dr.            7  
Rev.           6  
Col.           2  
Mlle.          2  
Major.         2  
Lady.          1  
Jonkheer.      1  
Capt.          1  
Sir.           1  
Don.           1  
the            1  
Mme.           1  
Ms.            1  
Name: title, dtype: int64  

Let's visualize the finding:

view plaincopy to clipboardprint?
als_bar = barplot.Utils(title_df)  
g = als_bar.sns_barplot('Survived', ['title'], figsize=(10, 5))  

view plaincopy to clipboardprint?
title_df['count'] = 1  
title_df[['title', 'Survived', 'count']].groupby('title').agg({'count':'size', 'Survived':'mean'}). \  
        reset_index().sort_values(['Survived', 'count'], ascending=[False, False])  

Title Findings:

* Titles were a fairly good predictor if you survived the Titanic
* Female titles are the good indicators (Mrs., Miss., Lady etc.)

So far, we have completed the analysis of titanic dataset. For the follow-up sections such as 4. Preprocessing Data, 5. Train Model, 6. Evaluate -> Tune -> Ensemble and 7. Conclusion, please refer to this notebook.

程式扎記

標籤

2021年3月9日星期二

[ ML 文章收集 ] Work of Kaggle: Titanic - Machine Learning from Disaster

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2021年3月9日 星期二

[ ML 文章收集 ] Work of Kaggle: Titanic - Machine Learning from Disaster

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2021年3月9日星期二