2021年3月9日 星期二

[ ML 文章收集 ] Work of Kaggle: Titanic - Machine Learning from Disaster

 Source From Here

1. Overview

The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Goal
It is your job to predict if a passenger survived the sinking of the Titanic or not. For each in the test set, you must predict a 0 or 1 value for the variable.

Metric Your score is the percentage of passengers you correctly predict. This is known as accuracy.

Submission File Format You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:
* PassengerId (sorted in any order)
* Survived (contains your binary predictions: 1 for survived, 0 for deceased)
  1. PassengerId,Survived  
  2. 892,0  
  3. 893,1  
  4. 894,0  
  5. Etc.  
2. Setup System Environment
We import the necessary packages below:
  1. import sys  
  2. import os  
  3. import pandas as pd  
  4. import numpy as np  
  5. import matplotlib.pyplot as plt  
  6. from os.path import dirname  
  7. sys.path.append(os.environ.get('KUTILS_ANALYSIS_ROOT''C:\John\Personal\Github\kutils_analysis'))  
  8. from kutils.analysis import histplot, barplot, boxplot, fiplot, corr  
  9. %matplotlib inline  
From here, you can download the necessary files for training/testing data. The data has been split into two groups:
* training set (train.csv):
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

* test set (test.csv):
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. Below are column definition:



3. Exploratory Data Analysis

Feature Analysis
First step is to load in training/testing data:
  1. # load the training and test data  
  2. def get_data():  
  3.     train_df = pd.read_csv('train.csv', index_col='PassengerId')  
  4.     test_df = pd.read_csv('test.csv', index_col='PassengerId')  
  5.     return train_df, test_df  
  6.   
  7. train_df, test_df = get_data()  
  8. train_df.head()  


raw data info
  1. # inspect the dataframe for entries, columns, missing values, and data types  
  2. train_df.info()  


Observations:
* ML models must have numerical values to compute, there fore categorical datatype objects must be converted to numerical values.
* Nan values to address with imputation: ['Age', 'Cabin', 'Embark']
* Test dataset has 1 less column: ['Survived'] which is the target or label

raw data distribution
Let's check the distribution of our raw data:
  1. # split categorical and numerical dataframes for analysis  
  2. df_num = train_df[['Age''SibSp''Parch''Fare']]  
  3. df_cat = train_df[['Pclass''Sex''Ticket''Cabin''Embarked''Name']]  
  4. als_hist = histplot.Utils(train_df)  
  5. als_hist.hist(df_num.columns)  


  1. ax_list = als_hist.sns_hist(df_num.columns, col_num=2, figsize=(106))  


  1. als_bar = barplot.Utils(train_df)  
  2. fig, axs = als_bar.stacked_bar('Survived', columns=['Pclass''Sex''Embarked'], col_num=3, figsize=(155))  

Let's also check the first letter of Cabin:
  1. train_df['Cabin'].fillna('U', inplace=True)  
  2. train_df['Cabin_type'] = train_df.Cabin.apply(lambda v: 'U' if v is None else v[0])  
  3. train_df['Cabin_type'].value_counts()  
Output:
  1. U    687  
  2. C     59  
  3. B     47  
  4. D     33  
  5. E     32  
  6. A     15  
  7. F     13  
  8. G      4  
  9. T      1  
  10. Name: Cabin_type, dtype: int64  
Next to see how Survival rate among each Cabin type:
  1. ax = als_bar.stacked_bar('Survived', columns=['Cabin_type'], col_num=2, figsize=(105))  


Observations:
* Only Age has a somewhat normal distribution. The other values are skewed to the left with longtails to the right.
>>>> Does age group affect the survival rate? Bin by age group
>>>> Most of the population is in their late teens to late 30s.
* Most paid a low fare (Fare), many are single with no family (Parch & SibSp)
* Should we normalize these values using a logarithmic method?

correlation with Survived
Then, let's check the correlation between Survived and other columns:
  1. # Gender breakdown  
  2. print(train_df['Sex'].value_counts())  
  3. print()  
  4. print(train_df['Sex'].value_counts(normalize=True)) #Percentage breakdown  
Output:
  1. male      577  
  2. female    314  
  3. Name: Sex, dtype: int64  
  4.   
  5. male      0.647587  
  6. female    0.352413  
  7. Name: Sex, dtype: float64  
  1. male_s1 = train_df[(train_df['Sex']=='male') & (train_df['Survived']==1)].shape[0]  
  2. male_s0 = train_df[(train_df['Sex']=='male') & (train_df['Survived']==0)].shape[0]  
  3. print(f"Number of male survial={male_s1}; Number of male not survial={male_s0}; avg = {male_s1/(male_s1+male_s0)}")  
Output:
Number of male survial=109; Number of male not survial=468; avg = 0.18890814558058924

  1. # https://seaborn.pydata.org/generated/seaborn.barplot.html  
  2. fig, ax = als_bar.sns_barplot(  
  3.     'Survived',   
  4.     ['Sex''Embarked''Pclass''SibSp''Parch''SibSp'],   
  5.     col_num=3,   
  6.     figsize=(128)  
  7. )  


Observations:
* Many more females survived (Sex)
* People who embarked from Cherbourg had a higher survival rate (Embarked=C). Let's explore this further?
* First and second class passengers survived better (PClass in [1,2]). Due to location of cabins?
* Smaller families had higher survival rates,best if you had 1 or 2 other people in your family (SibSp in [1,2] or Parch in [1, 2, 3]).

Let's check correlation between Survived with other numerical features:
  1. fig, ax = als_bar.sns_countplot(  
  2.     'Survived',   
  3.     ['Sex''Embarked''Pclass''SibSp''Parch''SibSp'],   
  4.     col_num=3,   
  5.     figsize=(137)  
  6. )  


Observation
* People were more chivalrous back in then? Or women were better able to negotiate their way on to life boats? Maybe they were with their children?


Embarked analysis
Let's check column Embarked and it's relations with FareAge and SibSp:
  1. als_box = boxplot.Utils(train_df)  
  2. fig, ax = als_box.sns_boxplot(  
  3.     'Embarked',  
  4.     ['Fare''Age''SibSp'],  
  5.     [  
  6.         'Fare Boxplot By Port of Embarkment',  
  7.         'Age Distribution Boxplot By Embarkement',  
  8.         'SibSp Distribution Boxplot By Embarkement (By Sex)'  
  9.     ]  
  10. )  


Observations
* People from Cherbourg, France are able to pay more money for fare, and support the hypothesis that wealther passengers had a higher chance of survival.
* Fare outliers from Cherbourg dataset may be worthy of analysis
* Wealth matters. People from 1st class had higher rates of survival, and more people survived rather than drowned compared to the other classes.
* You have a higher chance of surviving if you're solo or have 1 family member with you compared to a large family. Maybe it was difficult to choose which family member would survive? Or maybe it was more expensive to have a 1st class cabin and better location to boats?

binning of Age
Let's do binning of Age to see if there is any feature in them:
  1. cut_labels = ['>10''11-20''21-30''31-40''41-50''51-60''61-70''71-80']  
  2. cut_bins = [01020304050607080]  
  3. cut_df = train_df  
  4. cut_df['cut_age'] = pd.cut(cut_df['Age'], cut_bins, labels=cut_labels)  
  5. age_bin_df = cut_df.groupby(['cut_age']).mean()  
  6. age_bin_df  


  1. # als_bar = barplot.Utils(train_df)  
  2. fig = als_bar.sns_countplot(  
  3.     'Survived',   
  4.     'cut_age',   
  5.     figsize=(105)  
  6. )  


Observations:
* Survival based on Age followed a somewhat normal distribution with a tail to the right with older people.
* Young children and infants survived more than died. Probably the mentality of women and children to be saved first.
* High survival rate for people in age 31-40.

binning of Fare
Let's do binning of Fare to see if there is any feature in them:
  1. train_df, test_df = get_data()  
  2. print(f"Fare range: {max(train_df.Fare)}~{min(train_df.Fare)}")  
Fare range: 512.3292~0.0

  1. cut_labels = ['>100''100-200''200-300''300-400''400-500''500-600']  
  2. cut_bins = [0100200300400500600]  
  3. cut_df = train_df  
  4. cut_df['cut_fare'] = pd.cut(cut_df['Fare'], cut_bins, labels=cut_labels)  
  5. cut_df['cut_fare'] = cut_df.apply(lambda r: r.cut_fare if r.Fare > 0 else 'unknown', axis=1)  
  6. cut_df[['Fare''cut_fare'] + list(filter(lambda e: e.startswith('fare_'), train_df.columns))].sample(n=15)  


  1. fare_bin_df = cut_df.groupby(['cut_fare']).mean()  
  2. fare_bin_df  


  1. als_bar = barplot.Utils(train_df)  
  2. fig = als_bar.sns_countplot(  
  3.     'Survived',   
  4.     'cut_fare',   
  5.     figsize=(105)  
  6. )  


Observations:
* Higher fare, higher survial rate.
* Fare < 100 has a obvious lowest survival rate compared to other Fare binning.

Outliers Analysis

Fare
  1. als_box.set_figsize((84))  
  2. als_box.s_boxplot("Fare", title='Fare outlier Boxplot')  


  1. train_df.loc[train_df['Fare'] > 500]  


Findings:
These outliers paid the princely sum of 512 pounds. In today's pound, it would be worth a staggering 58,864 pounds! (Note: each group had both the same ticket number and fare paid). This was by far the most out of any of the guests. Part of the same group lead by the wealthy Mr. Cardeza - Miss. Ward the maid, and Mr. Lesurer the manservant. They were on their way back to the USA by way of Cherbourg, France.
* All similar age
* Multiple cabins but interestingly all on the 'B' deck. Could this be the best deck on the ship? We should analyze deck levels for our model.
* Feature: Cherbourg was the best place to leave if you wanted to survive. We will add this to our list of features for the ML models.
* Feature: Maybe is there a correlation between 'PC' tickets and survival rate?

To learn more about this group visit this biography of Miss. Anna Ward

Sibling or Spouse Outlier Analysis
  1. als_box.set_figsize((63))  
  2. ax = als_box.s_boxplot("SibSp", title='SibSp outlier Boxplot')  


  1. # We see that the outliers came from 2 families  
  2. family_df = train_df.loc[train_df['SibSp'] > 4]  
  3. family_df.sort_values('Name')  


  1. # Survival rates for group size.  
  2. # Can we bin people by family or not? Maybe people had a higher survival rate if they had fewer family members?  
  3. train_df.groupby(['SibSp'])['Survived'].mean()  
Output:
  1. SibSp  
  2. 0    0.345395  
  3. 1    0.535885  
  4. 2    0.464286  
  5. 3    0.250000  
  6. 4    0.166667  
  7. 5    0.000000  
  8. 8    0.000000  
  9. Name: Survived, dtype: float64  
Family Size Outlier Findings:
Unforetunately, the Goodwin and Sage families - who had 5 or more members - did not fare well. They all perished. Maybe it's because of their socio economic status. Maybe it's because there was just too many of them to quickly escape to a life boat. Or maybe it's because of their cabin placement. Having a large family was tragic in the Titanic.

Age Outlier Analysis
  1. als_box.set_figsize((63))  
  2. ax = als_box.s_boxplot("Age", title='Age outlier Boxplot')  


  1. # Inspect passengers Ages 65 and up  
  2. age_df = train_df.loc[train_df['Age'] > 64]  
  3. age_df.sort_values('Age', ascending=False)  


Age Outlier Findings:
While it doesn't bode well to be elderly on the Titanic, the eldest person Barkworth actually survived! He actually was able to buoy himself on his briefcase and a fur coat until he found an overturned lifeboat. To learn more visit the biography of Mr. Algernon Barkworth.

Name Analysis

Title
Here I will look if the title of the passenger had any correlation to surving the Titantic - e.g. Mr., Mrs., etc.
  1. title_df = train_df.copy()  
  2. title_df['title'] = train_df.Name.apply(lambda x: x.split(',')[1].split(' ')[1].strip())  
  3. title_df['title'].value_counts()  
Output:
  1. Mr.          517  
  2. Miss.        182  
  3. Mrs.         125  
  4. Master.       40  
  5. Dr.            7  
  6. Rev.           6  
  7. Col.           2  
  8. Mlle.          2  
  9. Major.         2  
  10. Lady.          1  
  11. Jonkheer.      1  
  12. Capt.          1  
  13. Sir.           1  
  14. Don.           1  
  15. the            1  
  16. Mme.           1  
  17. Ms.            1  
  18. Name: title, dtype: int64  
Let's visualize the finding:
  1. als_bar = barplot.Utils(title_df)  
  2. g = als_bar.sns_barplot('Survived', ['title'], figsize=(105))  


  1. title_df['count'] = 1  
  2. title_df[['title''Survived''count']].groupby('title').agg({'count':'size''Survived':'mean'}). \  
  3.         reset_index().sort_values(['Survived''count'], ascending=[False, False])  


Title Findings:
* Titles were a fairly good predictor if you survived the Titanic
* Female titles are the good indicators (Mrs., Miss., Lady etc.)

So far, we have completed the analysis of titanic dataset. For the follow-up sections such as 4. Preprocessing Data5. Train Model6. Evaluate -> Tune -> Ensemble and 7. Conclusion, please refer to this notebook.














































沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...