-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsampling_with_replacement
executable file
·36 lines (28 loc) · 2.41 KB
/
sampling_with_replacement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
To build a tree ensemble, we use a technique called sampling with replacement.
This method involves creating multiple random training sets from the original dataset,
making the algorithm more robust.
Here's how it works: Imagine you have four colored tokens (red, yellow, green, and blue) in a bag.
You draw one token, note its color, and then put it back in the bag before drawing again. This process is repeated
multiple times,
allowing for the possibility of drawing the same token more than once while some tokens might not be drawn at all.
Applying this to our dataset of 10 examples of cats and dogs, we randomly select examples with replacement
to create new training sets. Each new set will have 10 examples, some repeated and some possibly missing.
This randomness helps in building diverse decision trees.
By using these varied training sets, we construct multiple decision trees. When making a prediction, each
tree votes, and the majority vote determines the final prediction. This ensemble approach reduces the impact of
any single tree's errors, leading to more accurate and robust predictions.
Now that we understand how to use sampling with replacement to create new training sets, we can build our first tree
ensemble algorithm: the Random Forest. This powerful algorithm works much better than a single decision tree.
Here's how it works:
Generate Training Sets: Given a training set of size ( M ), create ( B ) new training sets using sampling with
replacement. For each new set, some examples may be repeated, and some may be missing.
Train Decision Trees: Train a decision tree on each of these new training sets. Repeat this process ( B ) times
to create ( B ) different trees.
Make Predictions: When making a prediction, each tree votes on the outcome. The final prediction is determined by
the majority vote.
A typical choice for ( B ) is around 100, as increasing ( B ) beyond this point yields diminishing returns.
To further improve the algorithm, we can modify it to create a Random Forest. Instead of considering all features at
each split, we randomly select a subset of features. This randomization helps ensure that the trees are more diverse,
leading to more accurate predictions.
The Random Forest algorithm is robust because it averages over many small changes to the training set, reducing the
impact of any single change. This makes it more reliable than a single decision tree.