Source From Here
1. OverviewThe Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Goal
It is your job to predict if a passenger survived the sinking of the Titanic or not. For each in the test set, you must predict a 0 or 1 value for the variable.
Metric Your score is the percentage of passengers you correctly predict. This is known as accuracy.
Submission File Format You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
The file should have exactly 2 columns:
* PassengerId (sorted in any order)
* Survived (contains your binary predictions: 1 for survived, 0 for deceased)
- PassengerId,Survived
- 892,0
- 893,1
- 894,0
- Etc.
We import the necessary packages below:
- import sys
- import os
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- from os.path import dirname
- sys.path.append(os.environ.get('KUTILS_ANALYSIS_ROOT', 'C:\John\Personal\Github\kutils_analysis'))
- from kutils.analysis import histplot, barplot, boxplot, fiplot, corr
- %matplotlib inline
* training set (train.csv):
* test set (test.csv):
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. Below are column definition:
3. Exploratory Data Analysis
Feature Analysis
First step is to load in training/testing data:
- # load the training and test data
- def get_data():
- train_df = pd.read_csv('train.csv', index_col='PassengerId')
- test_df = pd.read_csv('test.csv', index_col='PassengerId')
- return train_df, test_df
- train_df, test_df = get_data()
- train_df.head()
raw data info
- # inspect the dataframe for entries, columns, missing values, and data types
- train_df.info()
Observations:
raw data distribution
Let's check the distribution of our raw data:
- # split categorical and numerical dataframes for analysis
- df_num = train_df[['Age', 'SibSp', 'Parch', 'Fare']]
- df_cat = train_df[['Pclass', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Name']]
- als_hist = histplot.Utils(train_df)
- als_hist.hist(df_num.columns)
- ax_list = als_hist.sns_hist(df_num.columns, col_num=2, figsize=(10, 6))
- als_bar = barplot.Utils(train_df)
- fig, axs = als_bar.stacked_bar('Survived', columns=['Pclass', 'Sex', 'Embarked'], col_num=3, figsize=(15, 5))
Let's also check the first letter of Cabin:
- train_df['Cabin'].fillna('U', inplace=True)
- train_df['Cabin_type'] = train_df.Cabin.apply(lambda v: 'U' if v is None else v[0])
- train_df['Cabin_type'].value_counts()
- U 687
- C 59
- B 47
- D 33
- E 32
- A 15
- F 13
- G 4
- T 1
- Name: Cabin_type, dtype: int64
- ax = als_bar.stacked_bar('Survived', columns=['Cabin_type'], col_num=2, figsize=(10, 5))
Observations:
correlation with Survived
Then, let's check the correlation between Survived and other columns:
- # Gender breakdown
- print(train_df['Sex'].value_counts())
- print()
- print(train_df['Sex'].value_counts(normalize=True)) #Percentage breakdown
- male 577
- female 314
- Name: Sex, dtype: int64
- male 0.647587
- female 0.352413
- Name: Sex, dtype: float64
- male_s1 = train_df[(train_df['Sex']=='male') & (train_df['Survived']==1)].shape[0]
- male_s0 = train_df[(train_df['Sex']=='male') & (train_df['Survived']==0)].shape[0]
- print(f"Number of male survial={male_s1}; Number of male not survial={male_s0}; avg = {male_s1/(male_s1+male_s0)}")
- # https://seaborn.pydata.org/generated/seaborn.barplot.html
- fig, ax = als_bar.sns_barplot(
- 'Survived',
- ['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch', 'SibSp'],
- col_num=3,
- figsize=(12, 8)
- )
Observations:
Let's check correlation between Survived with other numerical features:
- fig, ax = als_bar.sns_countplot(
- 'Survived',
- ['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch', 'SibSp'],
- col_num=3,
- figsize=(13, 7)
- )
Observation
Embarked analysis
Let's check column Embarked and it's relations with Fare, Age and SibSp:
- als_box = boxplot.Utils(train_df)
- fig, ax = als_box.sns_boxplot(
- 'Embarked',
- ['Fare', 'Age', 'SibSp'],
- [
- 'Fare Boxplot By Port of Embarkment',
- 'Age Distribution Boxplot By Embarkement',
- 'SibSp Distribution Boxplot By Embarkement (By Sex)'
- ]
- )
Observations
binning of Age
Let's do binning of Age to see if there is any feature in them:
- cut_labels = ['>10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80']
- cut_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
- cut_df = train_df
- cut_df['cut_age'] = pd.cut(cut_df['Age'], cut_bins, labels=cut_labels)
- age_bin_df = cut_df.groupby(['cut_age']).mean()
- age_bin_df
- # als_bar = barplot.Utils(train_df)
- fig = als_bar.sns_countplot(
- 'Survived',
- 'cut_age',
- figsize=(10, 5)
- )
Observations:
binning of Fare
Let's do binning of Fare to see if there is any feature in them:
- train_df, test_df = get_data()
- print(f"Fare range: {max(train_df.Fare)}~{min(train_df.Fare)}")
- cut_labels = ['>100', '100-200', '200-300', '300-400', '400-500', '500-600']
- cut_bins = [0, 100, 200, 300, 400, 500, 600]
- cut_df = train_df
- cut_df['cut_fare'] = pd.cut(cut_df['Fare'], cut_bins, labels=cut_labels)
- cut_df['cut_fare'] = cut_df.apply(lambda r: r.cut_fare if r.Fare > 0 else 'unknown', axis=1)
- cut_df[['Fare', 'cut_fare'] + list(filter(lambda e: e.startswith('fare_'), train_df.columns))].sample(n=15)
- fare_bin_df = cut_df.groupby(['cut_fare']).mean()
- fare_bin_df
- als_bar = barplot.Utils(train_df)
- fig = als_bar.sns_countplot(
- 'Survived',
- 'cut_fare',
- figsize=(10, 5)
- )
Observations:
Outliers Analysis
Fare
- als_box.set_figsize((8, 4))
- als_box.s_boxplot("Fare", title='Fare outlier Boxplot')
- train_df.loc[train_df['Fare'] > 500]
Findings:
These outliers paid the princely sum of 512 pounds. In today's pound, it would be worth a staggering 58,864 pounds! (Note: each group had both the same ticket number and fare paid). This was by far the most out of any of the guests. Part of the same group lead by the wealthy Mr. Cardeza - Miss. Ward the maid, and Mr. Lesurer the manservant. They were on their way back to the USA by way of Cherbourg, France.
To learn more about this group visit this biography of Miss. Anna Ward
Sibling or Spouse Outlier Analysis
- als_box.set_figsize((6, 3))
- ax = als_box.s_boxplot("SibSp", title='SibSp outlier Boxplot')
- # We see that the outliers came from 2 families
- family_df = train_df.loc[train_df['SibSp'] > 4]
- family_df.sort_values('Name')
- # Survival rates for group size.
- # Can we bin people by family or not? Maybe people had a higher survival rate if they had fewer family members?
- train_df.groupby(['SibSp'])['Survived'].mean()
- SibSp
- 0 0.345395
- 1 0.535885
- 2 0.464286
- 3 0.250000
- 4 0.166667
- 5 0.000000
- 8 0.000000
- Name: Survived, dtype: float64
Unforetunately, the Goodwin and Sage families - who had 5 or more members - did not fare well. They all perished. Maybe it's because of their socio economic status. Maybe it's because there was just too many of them to quickly escape to a life boat. Or maybe it's because of their cabin placement. Having a large family was tragic in the Titanic.
Age Outlier Analysis
- als_box.set_figsize((6, 3))
- ax = als_box.s_boxplot("Age", title='Age outlier Boxplot')
- # Inspect passengers Ages 65 and up
- age_df = train_df.loc[train_df['Age'] > 64]
- age_df.sort_values('Age', ascending=False)
Age Outlier Findings:
While it doesn't bode well to be elderly on the Titanic, the eldest person Barkworth actually survived! He actually was able to buoy himself on his briefcase and a fur coat until he found an overturned lifeboat. To learn more visit the biography of Mr. Algernon Barkworth.
Name Analysis
Title
Here I will look if the title of the passenger had any correlation to surving the Titantic - e.g. Mr., Mrs., etc.
- title_df = train_df.copy()
- title_df['title'] = train_df.Name.apply(lambda x: x.split(',')[1].split(' ')[1].strip())
- title_df['title'].value_counts()
- Mr. 517
- Miss. 182
- Mrs. 125
- Master. 40
- Dr. 7
- Rev. 6
- Col. 2
- Mlle. 2
- Major. 2
- Lady. 1
- Jonkheer. 1
- Capt. 1
- Sir. 1
- Don. 1
- the 1
- Mme. 1
- Ms. 1
- Name: title, dtype: int64
- als_bar = barplot.Utils(title_df)
- g = als_bar.sns_barplot('Survived', ['title'], figsize=(10, 5))
- title_df['count'] = 1
- title_df[['title', 'Survived', 'count']].groupby('title').agg({'count':'size', 'Survived':'mean'}). \
- reset_index().sort_values(['Survived', 'count'], ascending=[False, False])
Title Findings:
So far, we have completed the analysis of titanic dataset. For the follow-up sections such as 4. Preprocessing Data, 5. Train Model, 6. Evaluate -> Tune -> Ensemble and 7. Conclusion, please refer to this notebook.
沒有留言:
張貼留言