-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdecision_tree_feature_extraction
executable file
·82 lines (65 loc) · 3.67 KB
/
decision_tree_feature_extraction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Handling Categorical Features with More Than Two Values Using One-Hot Encoding
Context:
In decision trees, features often take on binary values (e.g., pointy or floppy ears).
When features can take on more than two values (e.g., ear shape can be pointy, floppy, or oval), one-hot encoding can be used.
One-Hot Encoding:
Concept: Convert a categorical feature with ( k ) possible values into ( k ) binary features.
Example:
Original feature: Ear shape (pointy, floppy, oval).
One-hot encoded features:
Pointy ears: 1 if pointy, 0 otherwise.
Floppy ears: 1 if floppy, 0 otherwise.
Oval ears: 1 if oval, 0 otherwise.
Process:
Convert Each Value:
For each example, set the corresponding binary feature to 1 and the others to 0.
Example: If an animal has pointy ears, the features would be [1, 0, 0].
Apply to Dataset:
Transform the entire dataset using one-hot encoding.
Each original feature with ( k ) values becomes ( k ) binary features.
Advantages:
Compatibility: Transforms categorical features into binary features, making them compatible with decision tree algorithms.
Versatility: One-hot encoding can also be used for neural networks, logistic regression, and other models that require numerical inputs.
Example Application:
Original Features:
Ear shape: [pointy, floppy, oval]
Face shape: [round, not round]
Whiskers: [present, absent]
One-Hot Encoded Features:
Ear shape: [pointy, floppy, oval] -> [1, 0, 0], [0, 0, 1], etc.
Face shape: [round, not round] -> [1, 0], [0, 1]
Whiskers: [present, absent] -> [1, 0], [0, 1]
Conclusion:
One-hot encoding allows decision trees to handle features with more than two discrete values.
It also enables the use of these features in other models like neural networks and logistic regression.
Modifying Decision Trees for Continuous Features
Context:
Decision trees typically handle discrete features, but they can be adapted for continuous features (e.g., weight).
Example:
New Feature: Weight of the animal (in pounds).
Goal: Use weight to help classify animals as cats or not cats.
Process:
Consider Continuous Features:
Include continuous features like weight in the decision tree algorithm.
Evaluate splits based on whether the weight is less than or equal to various threshold values.
Splitting on Continuous Features:
Plot Data: Plot weight on the horizontal axis and the label (cat or not cat) on the vertical axis.
Choose Thresholds: Consider multiple threshold values for splitting (e.g., weight ≤ 8, weight ≤ 9, etc.).
Calculate Information Gain: For each threshold, calculate the information gain.
Example: If splitting at weight ≤ 8, calculate the entropy for the left and right subsets and determine the information gain.
Select the Best Split:
Compare information gains for different thresholds.
Choose the threshold that provides the highest information gain.
Example: Splitting at weight ≤ 9 might give a higher information gain than other thresholds.
Recursive Splitting:
Once the best threshold is chosen, split the data into two subsets.
Recursively apply the decision tree algorithm to each subset.
General Approach:
Sort Values: Sort the examples by the continuous feature.
Midpoints: Consider midpoints between sorted values as potential thresholds.
Evaluate: Test each threshold and select the one with the highest information gain.
Summary:
To handle continuous features, consider different threshold values for splitting.
Calculate information gain for each threshold and choose the best one.
Apply the decision tree algorithm recursively to build the tree.
This approach allows decision trees to effectively use continuous features, enhancing their ability to classify data accurately.