Decision Tree

A bunch of if statement placed at low point of entropy

Advantages of CART

  • Simple to understand, interpret, visualize.
  • Decision trees implicitly performvariable screening or feature selection.
  • Can handle both numerical and categorical data. Can also handle multi-output problems.
  • Decision trees require relatively little effort from users for data preparation.
  • Nonlinear relationships between parameters do not affect tree performance.

Disadvantages of CART

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
  • Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.
import pandas as pd
from fastai.vision.all import *
path = Path('Data')
name = 'salaries.csv'
df = pd.read_csv(path/name)
df.head()
company job degree salary_more_then_100k
0 google sales executive bachelors 0
1 google sales executive masters 0
2 google business manager bachelors 1
3 google business manager masters 1
4 google computer programmer bachelors 0
inputs = df.drop('salary_more_then_100k',axis='columns')
target = df['salary_more_then_100k']
from sklearn.preprocessing import LabelEncoder
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])
inputs
company job degree company_n job_n degree_n
0 google sales executive bachelors 2 2 0
1 google sales executive masters 2 2 1
2 google business manager bachelors 2 0 0
3 google business manager masters 2 0 1
4 google computer programmer bachelors 2 1 0
5 google computer programmer masters 2 1 1
6 abc pharma sales executive masters 0 2 1
7 abc pharma computer programmer bachelors 0 1 0
8 abc pharma business manager bachelors 0 0 0
9 abc pharma business manager masters 0 0 1
10 facebook sales executive bachelors 1 2 0
11 facebook sales executive masters 1 2 1
12 facebook business manager bachelors 1 0 0
13 facebook business manager masters 1 0 1
14 facebook computer programmer bachelors 1 1 0
15 facebook computer programmer masters 1 1 1
inputs_n = inputs.drop(['company','job','degree'],axis='columns')
inputs_n
company_n job_n degree_n
0 2 2 0
1 2 2 1
2 2 0 0
3 2 0 1
4 2 1 0
5 2 1 1
6 0 2 1
7 0 1 0
8 0 0 0
9 0 0 1
10 1 2 0
11 1 2 1
12 1 0 0
13 1 0 1
14 1 1 0
15 1 1 1
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
n_points = len(inputs_n)
x = np.random.rand(n_points)
y = np.random.rand(n_points)
z = np.random.rand(n_points)
inputs_n['company_n'] += (x-0.5)/10
inputs_n['job_n'] += (y-0.5)/10
inputs_n['degree_n'] += (z-0.5)/10
# Create a 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.set_xlabel('degree_n')
ax.set_ylabel('job_n')
ax.set_zlabel('company_n')

scatter = ax.scatter(inputs_n['degree_n'],
           inputs_n['job_n'],
           inputs_n['company_n'],
           c=target,
           cmap='viridis',
           marker='+')

# Adding a color bar to show the mapping of colors to values in 'color_column'
cbar = fig.colorbar(scatter, ax=ax)
cbar.set_label('Color Column')

target
0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(inputs_n, target)
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.score(inputs_n,target)
1.0
from sklearn.tree import export_graphviz
FEATURE_NAMES = ['company_n', 'job_n', 'degree_n']
export_graphviz(model, './Data/salary.dot', feature_names = FEATURE_NAMES)
!dot -Tpng ./Data/salary.dot -o ./Data/salary.png
import matplotlib.pyplot as plt
import cv2 as cv
img = cv.imread('./Data/salary.png')
plt.figure(figsize = (20, 20))
plt.imshow(img)

Predict

model.predict([[2,1,0]])
/home/ben/mambaforge/envs/cfast/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
array([0])
model.predict([[2,1,1]])
/home/ben/mambaforge/envs/cfast/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
array([0])
Back to top