The Naive Bayes Approach

The Bayesian Approach and Gaussian Processes

Author

Benedict Thekkel

$P(queen/diamond) = \dfrac{P(diamond/queen) * P(queen)}{P(diamond)}$

$P(queen/diamond) = $

$P(diamond/queen) = 1/4$
$P(queen) = 1/13$
$P(diamond) = 1/4$

$P(queen/diamond) = \dfrac{1/4 * 1/13}{1/4} = 1/13$

import pandas as pd

df = pd.read_csv("Data/titanic.csv")
df.head()

	PassengerId	Name	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Survived
0	1	Braund, Mr. Owen Harris	3	male	22.0	1	A/5 21171	7.2500	NaN	S	0
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	female	38.0	1	PC 17599	71.2833	C85	C	1
2	3	Heikkinen, Miss. Laina	3	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	female	35.0	1	113803	53.1000	C123	S	1
4	5	Allen, Mr. William Henry	3	male	35.0	0	373450	8.0500	NaN	S	0

df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

	Pclass	Sex	Age	Fare	Survived
0	3	male	22.0	7.2500	0
1	1	female	38.0	71.2833	1
2	3	female	26.0	7.9250	1
3	1	female	35.0	53.1000	1
4	3	male	35.0	8.0500	0

inputs = df.drop('Survived',axis='columns')
target = df.Survived

#inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})

dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

	female	male
0	0	1
1	1	0
2	1	0

inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head(3)

	Pclass	Sex	Age	Fare	female	male
0	3	male	22.0	7.2500	0	1
1	1	female	38.0	71.2833	1	0
2	3	female	26.0	7.9250	1	0

I am dropping male column as well because of dummy variable trap theory. One column is enough to repressent male vs female

inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)

	Pclass	Age	Fare	female
0	3	22.0	7.2500	0
1	1	38.0	71.2833	1
2	3	26.0	7.9250	1

inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

	Pclass	Age	Fare	female
0	3	22.0	7.2500	0
1	1	38.0	71.2833	1
2	3	26.0	7.9250	1
3	1	35.0	53.1000	1
4	3	35.0	8.0500	0

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

model.fit(X_train,y_train)

GaussianNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

model.score(X_test,y_test)

0.7910447761194029

X_test[0:10]

	Pclass	Age	Fare	female
509	3	26.000000	56.4958	0
325	1	36.000000	135.6333	1
248	1	37.000000	52.5542	0
391	3	21.000000	7.7958	0
411	3	29.699118	6.8583	0
688	3	18.000000	7.7958	0
183	2	1.000000	39.0000	0
14	3	14.000000	7.8542	1
763	1	36.000000	120.0000	1
383	1	35.000000	52.0000	1

y_test[0:10]

509    1
325    1
248    1
391    1
411    0
688    0
183    1
14     0
763    1
383    1
Name: Survived, dtype: int64

model.predict(X_test[0:10])

array([0, 1, 0, 0, 0, 0, 0, 1, 1, 1])

model.predict_proba(X_test[:10])

array([[9.22826078e-01, 7.71739224e-02],
       [1.90547332e-04, 9.99809453e-01],
       [6.93224146e-01, 3.06775854e-01],
       [9.59335969e-01, 4.06640310e-02],
       [9.65380973e-01, 3.46190265e-02],
       [9.56271850e-01, 4.37281504e-02],
       [8.17245910e-01, 1.82754090e-01],
       [3.83233278e-01, 6.16766722e-01],
       [9.28107033e-04, 9.99071893e-01],
       [6.70466692e-02, 9.32953331e-01]])

Calculate the score using cross validation

from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.784     , 0.728     , 0.744     , 0.75806452, 0.80645161])