🌲🌳 Decision Tree 🌲🌳
Decision Tree Algorithm has come under supervised Learning, it is used for both
Regression and Classification.
Important Terminology related to Decision tree:
- Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
- Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
- Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Before starting the decision tree let's understand these three topics:
a) Entropy
b) Information Gain
c) Gini impurity
I will explain all these things in simple words don't worry!
Entropy:- Entropy helps us to calculate the purity of the sub split.
>Entropy controls how a Decision Tree decides to split the data.
>Suppose we have 3 input features f1,f2,f3 out of these three features which feature, we have to select first to start the tree.
>By selecting the best feature will save time, memory, and model performance, we will get leaf node early, etc
> Entropy value ranges from "0-1".
Information Gain:-The information gain is the amount of information gained about a random variable or signal from observing another random variable.
- An attribute with highest Information gain will tested/split first.
Gini impurity:-
Gini impurity is the same as Entropy with a small difference.
> Both are used for selecting the best feature for the best split (now the question is which one we have to select)
> Both the working is same the main difference is the range of entropy is 0-1 and Gini impurity is 0-0.5
> In entropy the computation time is more because the range starts from 0 and ends at 1
> In Gini impurity, the computation time is less compared to entropy because the range starts from 0 and ends at 0.5.
So better to choose Gini impurity.
It will save computation time!!
For the code part please check my github.
Comments
Post a Comment