Hello everyone, Titanic Dataset is a kind of "Hello World" of DATA SCIENCE. 

Q) Where can I find this Data?

A) You can find the dataset here - >Titanic dataset

 

Q) What does the dataset describe?

A) This dataset contains features such as ,

Feature Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex Male or Female
Age Age in years  
sibsp # of siblings / spouses aboard the Titanic siblings/spouses
parch # of parents / children aboard the Titanic no of parents/children boarding titanic
ticket Ticket number Ticket Number
fare Passenger fare Ticket Price
cabin Cabin number Cabin Number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
 

Q) What should we do now?

A) Given the dataset(train.csv), we have to predict whether the passenger has survived or not on test.csv based on the features.

So, let's dive into our first Data Science Project using Python. Python and R Programming Languages are usually used for Data Science Projects. Due to it's easiness, I would like to use Python.

Q) What packages are we using? 

A) 1. pandas : It usually contains Series, DataFrame datatypes, where Table Operations can be done. Similar to SQL, we can do CRUD operations and perform many more operations on tables.

   2.  matplotlib, seaborn : These are visualization Libraries which are very famous for plotting in python. 

   3. sklearn : This package helps us in building Models for our Datasets.

USE JUPYTER notebook for better interaction

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #for visualizations
import seaborn as sns # for visualizations

# read the files using pd.read_csv()
titanic_train = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")

#for info

titanic_train.info()
#output
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
"""
#Let's set the survived to -1 and then we combine both datasets
titanic_test["Survived"] = -1
titanic = titanic_train.append(titanic_test)

#let's drop some columns
titanic.drop(columns=["Ticket","PassengerId","Cabin"], inplace = True)

"""
AGE

So, we have to fill the Nan values of Age column. We can do this by
mean/median/mode imputing where we can generalize the values and just fill them.
Either by using Interpolation. Here we can't interpolate data since there's no order to follow.
There may be some other methods to fill AGE attribute, like, Averaging the age based on Embark and Fare and Gender.
Here, I'm considering to fill the Nan Values of AGE using the Title of Name. For Example, Moran, Mr. James, here "Mr" is the title name and we can allocate him the Average of all passengers bearing this Title.
"""

#let's remove np.nan values

"""
Storing the titles of passengers in title list and then adding it to titanic dataframe
"""
title = []
for item in titanic.Name:
    title.append(item.split(',')[1].split('.')[0].strip())
print (title[:3])
print (titanic.Name[:3])
titanic["title"] = title

#let's fill the nan values based on titles.

"""
['Mr', 'Mrs', 'Miss']
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
Name: Name, dtype: object
"""

"""
Different Passengers have different Age based on title. So, our assumption of filling Nan values is correct. Let's update Age accordingly
"""

final_age = []
for i in range(len(titanic)):
    age = titanic.iloc[i,0]
    if np.isnan(age):
        age = using[titanic.iloc[i,-1]]
    final_age.append(age)
titanic["Age"] = final_age

#fill the embarked column with S since the majority have boarded there.

titanic.Embarked.fillna("S",inplace=True)


titanic.Fare.fillna(18,inplace=True)
titanic.isna().sum()


"""
Let's create is_par column, where if Parch!=0, is_par = 1, else is_par = 0
"""

Parch = titanic.Parch.tolist()
is_par = [0 if item == 0 else 1 for item in Parch ]
titanic["is_par"] = is_par

temp = titanic[(titanic.Survived!=-1)].groupby("SibSp")["Survived"].value_counts(normalize = True).mul(100).reset_index(name = "percentage")
sns.barplot(x="SibSp",y = "percentage",hue = "Survived",data = temp).set_title("SibSp - Survival rate")

#let's do the same for sibling column too

SibSp = titanic.SibSp.tolist()
has_sib = [0 if item == 0 else 1 for item in SibSp ]
titanic["has_sib"] = has_sib

titanic.drop(columns=["Name","Parch","SibSp","title"], inplace=True)
titanic.sample()

titanic = pd.get_dummies(titanic, columns=["Embarked","Pclass"])
titanic.Sex = titanic.Sex.map({"male":1,"female":0})
titanic.sample()

Now, we have done with preprocessing, Let's go for building the model. I'm using XGBoostClassifier. It's a type of ensemble algorithm where at every step the residual errors of previous model will be rectified in next model.

titanic_training_y = titanic[titanic.Survived!=-1].Survived
titanic_training_x = titanic[titanic.Survived!=-1].drop(columns = ["Survived"])
from sklearn.model_selection import train_test_split
for random in range(15):
    train_x, test_x, train_y, test_y = train_test_split(titanic_training_x, titanic_training_y, test_size = 0.1)
    from xgboost import XGBClassifier
    from sklearn.metrics import accuracy_score
    scores = []
    for i in range(5,15):
        model = XGBClassifier(max_depth = i)
        model.fit(train_x, train_y)
        target = model.predict(test_x)
        score = accuracy_score(test_y, target)
        scores.append(score)
    print("best scores: ",max(scores), " at depth : ",scores.index(max(scores))+5)

 #output

best scores:  0.8777777777777778  at depth :  13
blog comments powered by Disqus