Skip to content

Commit

Permalink
added imblearn caveats
Browse files Browse the repository at this point in the history
  • Loading branch information
lshpaner committed Nov 19, 2024
1 parent 436d97c commit d929d6e
Show file tree
Hide file tree
Showing 15 changed files with 458 additions and 1 deletion.
Binary file modified docs/.doctrees/caveats.doctree
Binary file not shown.
Binary file modified docs/.doctrees/environment.pickle
Binary file not shown.
152 changes: 152 additions & 0 deletions docs/_sources/caveats.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -387,3 +387,155 @@ Summary

Calibration is essential when the probabilities output by a model need to be trusted, such as in risk assessment, medical diagnosis, and other critical applications.


Caveats in Imbalanced Learning
----------------------------------

Working with imbalanced datasets introduces several challenges that must be carefully addressed
to ensure model performance is both effective and fair. Below are key caveats to consider:

Bias from Class Distribution
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In imbalanced datasets, the prior probabilities of the classes are highly skewed:

.. math::
P(Y = c_{\text{minority}}) \ll P(Y = c_{\text{majority}})
This imbalance can lead models to prioritize the majority class, resulting in biased predictions
that overlook the minority class. Models may optimize for accuracy but fail to capture the true
distribution of minority class instances.

Threshold-Dependent Predictions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Many classifiers rely on a decision threshold :math:`\tau` to make predictions:

.. math::
\text{Predict } c_{\text{minority}} \text{ if } \hat{P}(Y = c_{\text{minority}} \mid X) \geq \tau
With imbalanced data, the default threshold may favor the majority class, causing a high rate of
false negatives for the minority class. Adjusting the threshold to account for imbalance can
help mitigate this issue, but it requires careful tuning and validation.

Limitations of Accuracy
^^^^^^^^^^^^^^^^^^^^^^^^^^

Traditional accuracy is a misleading metric in imbalanced datasets. For example, a model predicting
only the majority class can achieve high accuracy despite failing to identify any minority class instances.
Instead, alternative metrics should be used:

- **Precision** for the minority class:

Measures the proportion of correctly predicted minority class instances out of all
instances predicted as the minority class.

.. math::
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
- **Recall** for the minority class:

Measures the proportion of correctly predicted minority class instances out of all actual
minority class instances.

.. math::
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
- **F1-Score**, the harmonic mean of precision and recall:

Balances precision and recall to provide a single performance measure.

.. math::
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
These metrics provide a more balanced evaluation of model performance on imbalanced datasets.


Impact of Resampling Techniques
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs:

- **Oversampling Caveats**:
- Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.
- Overfitting to the minority class is a risk if too much synthetic data is added.

- **Undersampling Caveats**:
- Removing samples from the majority class can lead to loss of important information, reducing the model's generalizability.


SMOTE: A Mathematical Illustration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SMOTE (Synthetic Minority Over-sampling Technique) is a widely used algorithm for addressing
class imbalance by generating synthetic samples for the minority class. However, while powerful,
SMOTE comes with inherent caveats that practitioners should understand. Below is a mathematical
illustration highlighting these caveats.

**Synthetic Sample Generation**

SMOTE generates synthetic samples by interpolating between a minority class sample and its nearest
neighbors. Mathematically, a synthetic sample :math:`x_{synthetic}` is defined as:

.. math::
\mathbf{x}_{\text{synthetic}} = \mathbf{x}_i + \delta \cdot (\mathbf{x}_k - \mathbf{x}_i)
where:

- :math:`\mathbf{x}_i`: A minority class sample.
- :math:`\mathbf{x}_k`: One of its :math:`k` nearest neighbors (from the same class).
- :math:`\delta`: A random value drawn from a uniform distribution, :math:`\delta \sim U(0, 1)`.

This process ensures that synthetic samples are generated along the line segments connecting
minority class samples and their neighbors.

**Caveats in Application**

1. **Overlapping Classes**:
- SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.
- If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.

2. **Noise Sensitivity**:
- SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.
- Synthetic samples created from noisy data can amplify the noise, degrading model performance.

3. **Feature Space Assumptions**:
- SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.
- In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.

4. **Dimensionality Challenges**:
- In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.
- Synthetic samples may not adequately represent the true distribution of the minority class.

5. **Risk of Overfitting**:
- If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.

Example of Synthetic Sample Creation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To illustrate, consider a minority class sample :math:`f{x}_i = [1, 2]` and its nearest neighbor
:math:`f{x}_k = [3, 4]`. If :math:`\delta = 0.5`, the synthetic sample is computed as:

.. math::
\mathbf{x}_{\text{synthetic}} = [1, 2] + 0.5 \cdot ([3, 4] - [1, 2])
.. math::
\mathbf{x}_{\text{synthetic}} = [2, 3]
This synthetic sample lies midway between the two points in the feature space.

Mitigating the Caveats
~~~~~~~~~~~~~~~~~~~~~~~~~

- **Combine SMOTE with Undersampling**: Techniques like SMOTEENN or SMOTETomek remove noisy or overlapping samples after synthetic generation.

- **Apply with Feature Engineering**: Ensure the feature space is meaningful and represents the underlying data structure.

- **Tune Oversampling Ratio**: Avoid generating excessive synthetic samples to reduce overfitting.

1 change: 1 addition & 0 deletions docs/about.html
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
<li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
<ul class="current">
Expand Down
133 changes: 133 additions & 0 deletions docs/caveats.html
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,18 @@
<li class="toctree-l2"><a class="reference internal" href="#summary">Summary</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#bias-from-class-distribution">Bias from Class Distribution</a></li>
<li class="toctree-l2"><a class="reference internal" href="#threshold-dependent-predictions">Threshold-Dependent Predictions</a></li>
<li class="toctree-l2"><a class="reference internal" href="#limitations-of-accuracy">Limitations of Accuracy</a></li>
<li class="toctree-l2"><a class="reference internal" href="#impact-of-resampling-techniques">Impact of Resampling Techniques</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#smote-a-mathematical-illustration">SMOTE: A Mathematical Illustration</a></li>
<li class="toctree-l3"><a class="reference internal" href="#example-of-synthetic-sample-creation">Example of Synthetic Sample Creation</a></li>
<li class="toctree-l3"><a class="reference internal" href="#mitigating-the-caveats">Mitigating the Caveats</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
<ul>
Expand Down Expand Up @@ -384,6 +396,127 @@ <h2>Summary<a class="headerlink" href="#summary" title="Link to this heading">
</ul>
<p>Calibration is essential when the probabilities output by a model need to be trusted, such as in risk assessment, medical diagnosis, and other critical applications.</p>
</section>
</section>
<section id="caveats-in-imbalanced-learning">
<h1>Caveats in Imbalanced Learning<a class="headerlink" href="#caveats-in-imbalanced-learning" title="Link to this heading"></a></h1>
<p>Working with imbalanced datasets introduces several challenges that must be carefully addressed
to ensure model performance is both effective and fair. Below are key caveats to consider:</p>
<section id="bias-from-class-distribution">
<h2>Bias from Class Distribution<a class="headerlink" href="#bias-from-class-distribution" title="Link to this heading"></a></h2>
<p>In imbalanced datasets, the prior probabilities of the classes are highly skewed:</p>
<div class="math notranslate nohighlight">
\[P(Y = c_{\text{minority}}) \ll P(Y = c_{\text{majority}})\]</div>
<p>This imbalance can lead models to prioritize the majority class, resulting in biased predictions
that overlook the minority class. Models may optimize for accuracy but fail to capture the true
distribution of minority class instances.</p>
</section>
<section id="threshold-dependent-predictions">
<h2>Threshold-Dependent Predictions<a class="headerlink" href="#threshold-dependent-predictions" title="Link to this heading"></a></h2>
<p>Many classifiers rely on a decision threshold <span class="math notranslate nohighlight">\(\tau\)</span> to make predictions:</p>
<div class="math notranslate nohighlight">
\[\text{Predict } c_{\text{minority}} \text{ if } \hat{P}(Y = c_{\text{minority}} \mid X) \geq \tau\]</div>
<p>With imbalanced data, the default threshold may favor the majority class, causing a high rate of
false negatives for the minority class. Adjusting the threshold to account for imbalance can
help mitigate this issue, but it requires careful tuning and validation.</p>
</section>
<section id="limitations-of-accuracy">
<h2>Limitations of Accuracy<a class="headerlink" href="#limitations-of-accuracy" title="Link to this heading"></a></h2>
<p>Traditional accuracy is a misleading metric in imbalanced datasets. For example, a model predicting
only the majority class can achieve high accuracy despite failing to identify any minority class instances.
Instead, alternative metrics should be used:</p>
<ul>
<li><p><strong>Precision</strong> for the minority class:</p>
<blockquote>
<div><p>Measures the proportion of correctly predicted minority class instances out of all
instances predicted as the minority class.</p>
</div></blockquote>
<div class="math notranslate nohighlight">
\[\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\]</div>
</li>
<li><p><strong>Recall</strong> for the minority class:</p>
<blockquote>
<div><p>Measures the proportion of correctly predicted minority class instances out of all actual
minority class instances.</p>
</div></blockquote>
<div class="math notranslate nohighlight">
\[\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]</div>
</li>
<li><p><strong>F1-Score</strong>, the harmonic mean of precision and recall:</p>
<blockquote>
<div><p>Balances precision and recall to provide a single performance measure.</p>
</div></blockquote>
<div class="math notranslate nohighlight">
\[F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]</div>
</li>
</ul>
<p>These metrics provide a more balanced evaluation of model performance on imbalanced datasets.</p>
</section>
<section id="impact-of-resampling-techniques">
<h2>Impact of Resampling Techniques<a class="headerlink" href="#impact-of-resampling-techniques" title="Link to this heading"></a></h2>
<p>Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs:</p>
<ul class="simple">
<li><p><strong>Oversampling Caveats</strong>:
- Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.
- Overfitting to the minority class is a risk if too much synthetic data is added.</p></li>
<li><p><strong>Undersampling Caveats</strong>:
- Removing samples from the majority class can lead to loss of important information, reducing the model’s generalizability.</p></li>
</ul>
<section id="smote-a-mathematical-illustration">
<h3>SMOTE: A Mathematical Illustration<a class="headerlink" href="#smote-a-mathematical-illustration" title="Link to this heading"></a></h3>
<p>SMOTE (Synthetic Minority Over-sampling Technique) is a widely used algorithm for addressing
class imbalance by generating synthetic samples for the minority class. However, while powerful,
SMOTE comes with inherent caveats that practitioners should understand. Below is a mathematical
illustration highlighting these caveats.</p>
<p><strong>Synthetic Sample Generation</strong></p>
<p>SMOTE generates synthetic samples by interpolating between a minority class sample and its nearest
neighbors. Mathematically, a synthetic sample <span class="math notranslate nohighlight">\(x_{synthetic}\)</span> is defined as:</p>
<div class="math notranslate nohighlight">
\[\mathbf{x}_{\text{synthetic}} = \mathbf{x}_i + \delta \cdot (\mathbf{x}_k - \mathbf{x}_i)\]</div>
<p>where:</p>
<ul class="simple">
<li><p><span class="math notranslate nohighlight">\(\mathbf{x}_i\)</span>: A minority class sample.</p></li>
<li><p><span class="math notranslate nohighlight">\(\mathbf{x}_k\)</span>: One of its <span class="math notranslate nohighlight">\(k\)</span> nearest neighbors (from the same class).</p></li>
<li><p><span class="math notranslate nohighlight">\(\delta\)</span>: A random value drawn from a uniform distribution, <span class="math notranslate nohighlight">\(\delta \sim U(0, 1)\)</span>.</p></li>
</ul>
<p>This process ensures that synthetic samples are generated along the line segments connecting
minority class samples and their neighbors.</p>
<p><strong>Caveats in Application</strong></p>
<ol class="arabic simple">
<li><p><strong>Overlapping Classes</strong>:
- SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.
- If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.</p></li>
<li><p><strong>Noise Sensitivity</strong>:
- SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.
- Synthetic samples created from noisy data can amplify the noise, degrading model performance.</p></li>
<li><p><strong>Feature Space Assumptions</strong>:
- SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.
- In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.</p></li>
<li><p><strong>Dimensionality Challenges</strong>:
- In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.
- Synthetic samples may not adequately represent the true distribution of the minority class.</p></li>
<li><p><strong>Risk of Overfitting</strong>:
- If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.</p></li>
</ol>
</section>
<section id="example-of-synthetic-sample-creation">
<h3>Example of Synthetic Sample Creation<a class="headerlink" href="#example-of-synthetic-sample-creation" title="Link to this heading"></a></h3>
<p>To illustrate, consider a minority class sample <span class="math notranslate nohighlight">\(f{x}_i = [1, 2]\)</span> and its nearest neighbor
<span class="math notranslate nohighlight">\(f{x}_k = [3, 4]\)</span>. If <span class="math notranslate nohighlight">\(\delta = 0.5\)</span>, the synthetic sample is computed as:</p>
<div class="math notranslate nohighlight">
\[\mathbf{x}_{\text{synthetic}} = [1, 2] + 0.5 \cdot ([3, 4] - [1, 2])\]</div>
<div class="math notranslate nohighlight">
\[\mathbf{x}_{\text{synthetic}} = [2, 3]\]</div>
<p>This synthetic sample lies midway between the two points in the feature space.</p>
</section>
<section id="mitigating-the-caveats">
<h3>Mitigating the Caveats<a class="headerlink" href="#mitigating-the-caveats" title="Link to this heading"></a></h3>
<ul class="simple">
<li><p><strong>Combine SMOTE with Undersampling</strong>: Techniques like SMOTEENN or SMOTETomek remove noisy or overlapping samples after synthetic generation.</p></li>
<li><p><strong>Apply with Feature Engineering</strong>: Ensure the feature space is meaningful and represents the underlying data structure.</p></li>
<li><p><strong>Tune Oversampling Ratio</strong>: Avoid generating excessive synthetic samples to reduce overfitting.</p></li>
</ul>
</section>
</section>
</section>


Expand Down
1 change: 1 addition & 0 deletions docs/changelog.html
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
<li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
<ul class="current">
Expand Down
1 change: 1 addition & 0 deletions docs/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
<li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
<ul>
Expand Down
1 change: 1 addition & 0 deletions docs/getting_started.html
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
<li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
<ul>
Expand Down
Loading

0 comments on commit d929d6e

Please sign in to comment.