added imblearn caveats

uclamii · Nov 19, 2024 · d929d6e · d929d6e
1 parent 436d97c
commit d929d6e
Show file tree

Hide file tree

Showing 15 changed files with 458 additions and 1 deletion.
diff --git a/docs/.doctrees/caveats.doctree b/docs/.doctrees/caveats.doctree
diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle
diff --git a/docs/_sources/caveats.rst.txt b/docs/_sources/caveats.rst.txt
@@ -387,3 +387,155 @@ Summary
 
 Calibration is essential when the probabilities output by a model need to be trusted, such as in risk assessment, medical diagnosis, and other critical applications.
 
+
+Caveats in Imbalanced Learning
+----------------------------------
+
+Working with imbalanced datasets introduces several challenges that must be carefully addressed 
+to ensure model performance is both effective and fair. Below are key caveats to consider:
+
+Bias from Class Distribution
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In imbalanced datasets, the prior probabilities of the classes are highly skewed:
+
+.. math::
+
+    P(Y = c_{\text{minority}}) \ll P(Y = c_{\text{majority}})
+
+This imbalance can lead models to prioritize the majority class, resulting in biased predictions 
+that overlook the minority class. Models may optimize for accuracy but fail to capture the true 
+distribution of minority class instances.
+
+Threshold-Dependent Predictions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Many classifiers rely on a decision threshold :math:`\tau` to make predictions:
+
+.. math::
+
+    \text{Predict } c_{\text{minority}} \text{ if } \hat{P}(Y = c_{\text{minority}} \mid X) \geq \tau
+
+With imbalanced data, the default threshold may favor the majority class, causing a high rate of 
+false negatives for the minority class. Adjusting the threshold to account for imbalance can 
+help mitigate this issue, but it requires careful tuning and validation.
+
+Limitations of Accuracy
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Traditional accuracy is a misleading metric in imbalanced datasets. For example, a model predicting 
+only the majority class can achieve high accuracy despite failing to identify any minority class instances. 
+Instead, alternative metrics should be used:
+
+- **Precision** for the minority class:
+
+   Measures the proportion of correctly predicted minority class instances out of all 
+   instances predicted as the minority class.
+
+  .. math::
+
+      \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
+
+- **Recall** for the minority class:
+
+   Measures the proportion of correctly predicted minority class instances out of all actual 
+   minority class instances.
+
+  .. math::
+
+      \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
+
+- **F1-Score**, the harmonic mean of precision and recall:
+
+   Balances precision and recall to provide a single performance measure.
+
+  .. math::
+
+      F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
+
+These metrics provide a more balanced evaluation of model performance on imbalanced datasets.
+
+
+Impact of Resampling Techniques
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs:
+
+- **Oversampling Caveats**:
+  - Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.
+  - Overfitting to the minority class is a risk if too much synthetic data is added.
+
+- **Undersampling Caveats**:
+  - Removing samples from the majority class can lead to loss of important information, reducing the model's generalizability.
+
+
+SMOTE: A Mathematical Illustration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+SMOTE (Synthetic Minority Over-sampling Technique) is a widely used algorithm for addressing 
+class imbalance by generating synthetic samples for the minority class. However, while powerful, 
+SMOTE comes with inherent caveats that practitioners should understand. Below is a mathematical 
+illustration highlighting these caveats.
+
+**Synthetic Sample Generation**
+
+SMOTE generates synthetic samples by interpolating between a minority class sample and its nearest 
+neighbors. Mathematically, a synthetic sample :math:`x_{synthetic}` is defined as:
+
+.. math::
+
+    \mathbf{x}_{\text{synthetic}} = \mathbf{x}_i + \delta \cdot (\mathbf{x}_k - \mathbf{x}_i)
+
+where:
+
+- :math:`\mathbf{x}_i`: A minority class sample.
+- :math:`\mathbf{x}_k`: One of its :math:`k` nearest neighbors (from the same class).
+- :math:`\delta`: A random value drawn from a uniform distribution, :math:`\delta \sim U(0, 1)`.
+
+This process ensures that synthetic samples are generated along the line segments connecting 
+minority class samples and their neighbors.
+
+**Caveats in Application**
+
+1. **Overlapping Classes**:
+   - SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.
+   - If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.
+
+2. **Noise Sensitivity**:
+   - SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.
+   - Synthetic samples created from noisy data can amplify the noise, degrading model performance.
+
+3. **Feature Space Assumptions**:
+   - SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.
+   - In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.
+
+4. **Dimensionality Challenges**:
+   - In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.
+   - Synthetic samples may not adequately represent the true distribution of the minority class.
+
+5. **Risk of Overfitting**:
+   - If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.
+
+Example of Synthetic Sample Creation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To illustrate, consider a minority class sample :math:`f{x}_i = [1, 2]` and its nearest neighbor 
+:math:`f{x}_k = [3, 4]`. If :math:`\delta = 0.5`, the synthetic sample is computed as:
+
+.. math:: 
+   \mathbf{x}_{\text{synthetic}} = [1, 2] + 0.5 \cdot ([3, 4] - [1, 2])
+
+.. math:: 
+   \mathbf{x}_{\text{synthetic}} = [2, 3]
+
+This synthetic sample lies midway between the two points in the feature space.
+
+Mitigating the Caveats
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- **Combine SMOTE with Undersampling**: Techniques like SMOTEENN or SMOTETomek remove noisy or overlapping samples after synthetic generation.  
+
+- **Apply with Feature Engineering**: Ensure the feature space is meaningful and represents the underlying data structure.  
+
+- **Tune Oversampling Ratio**: Avoid generating excessive synthetic samples to reduce overfitting.
+
diff --git a/docs/about.html b/docs/about.html
@@ -70,6 +70,7 @@
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
 </ul>
 <p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
 <ul class="current">

diff --git a/docs/caveats.html b/docs/caveats.html
@@ -95,6 +95,18 @@
 <li class="toctree-l2"><a class="reference internal" href="#summary">Summary</a></li>
 </ul>
 </li>
+<li class="toctree-l1"><a class="reference internal" href="#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="#bias-from-class-distribution">Bias from Class Distribution</a></li>
+<li class="toctree-l2"><a class="reference internal" href="#threshold-dependent-predictions">Threshold-Dependent Predictions</a></li>
+<li class="toctree-l2"><a class="reference internal" href="#limitations-of-accuracy">Limitations of Accuracy</a></li>
+<li class="toctree-l2"><a class="reference internal" href="#impact-of-resampling-techniques">Impact of Resampling Techniques</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="#smote-a-mathematical-illustration">SMOTE: A Mathematical Illustration</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#example-of-synthetic-sample-creation">Example of Synthetic Sample Creation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#mitigating-the-caveats">Mitigating the Caveats</a></li>
+</ul>
+</li>
+</ul>
+</li>
 </ul>
 <p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
 <ul>
@@ -384,6 +396,127 @@ <h2>Summary<a class="headerlink" href="#summary" title="Link to this heading">
 </ul>
 <p>Calibration is essential when the probabilities output by a model need to be trusted, such as in risk assessment, medical diagnosis, and other critical applications.</p>
 </section>
+</section>
+<section id="caveats-in-imbalanced-learning">
+<h1>Caveats in Imbalanced Learning<a class="headerlink" href="#caveats-in-imbalanced-learning" title="Link to this heading"></a></h1>
+<p>Working with imbalanced datasets introduces several challenges that must be carefully addressed
+to ensure model performance is both effective and fair. Below are key caveats to consider:</p>
+<section id="bias-from-class-distribution">
+<h2>Bias from Class Distribution<a class="headerlink" href="#bias-from-class-distribution" title="Link to this heading"></a></h2>
+<p>In imbalanced datasets, the prior probabilities of the classes are highly skewed:</p>
+<div class="math notranslate nohighlight">
+\[P(Y = c_{\text{minority}}) \ll P(Y = c_{\text{majority}})\]</div>
+<p>This imbalance can lead models to prioritize the majority class, resulting in biased predictions
+that overlook the minority class. Models may optimize for accuracy but fail to capture the true
+distribution of minority class instances.</p>
+</section>
+<section id="threshold-dependent-predictions">
+<h2>Threshold-Dependent Predictions<a class="headerlink" href="#threshold-dependent-predictions" title="Link to this heading"></a></h2>
+<p>Many classifiers rely on a decision threshold <span class="math notranslate nohighlight">\(\tau\)</span> to make predictions:</p>
+<div class="math notranslate nohighlight">
+\[\text{Predict } c_{\text{minority}} \text{ if } \hat{P}(Y = c_{\text{minority}} \mid X) \geq \tau\]</div>
+<p>With imbalanced data, the default threshold may favor the majority class, causing a high rate of
+false negatives for the minority class. Adjusting the threshold to account for imbalance can
+help mitigate this issue, but it requires careful tuning and validation.</p>
+</section>
+<section id="limitations-of-accuracy">
+<h2>Limitations of Accuracy<a class="headerlink" href="#limitations-of-accuracy" title="Link to this heading"></a></h2>
+<p>Traditional accuracy is a misleading metric in imbalanced datasets. For example, a model predicting
+only the majority class can achieve high accuracy despite failing to identify any minority class instances.
+Instead, alternative metrics should be used:</p>
+<ul>
+<li><p><strong>Precision</strong> for the minority class:</p>
+<blockquote>
+<div><p>Measures the proportion of correctly predicted minority class instances out of all
+instances predicted as the minority class.</p>
+</div></blockquote>
+<div class="math notranslate nohighlight">
+\[\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\]</div>
+</li>
+<li><p><strong>Recall</strong> for the minority class:</p>
+<blockquote>
+<div><p>Measures the proportion of correctly predicted minority class instances out of all actual
+minority class instances.</p>
+</div></blockquote>
+<div class="math notranslate nohighlight">
+\[\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]</div>
+</li>
+<li><p><strong>F1-Score</strong>, the harmonic mean of precision and recall:</p>
+<blockquote>
+<div><p>Balances precision and recall to provide a single performance measure.</p>
+</div></blockquote>
+<div class="math notranslate nohighlight">
+\[F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]</div>
+</li>
+</ul>
+<p>These metrics provide a more balanced evaluation of model performance on imbalanced datasets.</p>
+</section>
+<section id="impact-of-resampling-techniques">
+<h2>Impact of Resampling Techniques<a class="headerlink" href="#impact-of-resampling-techniques" title="Link to this heading"></a></h2>
+<p>Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs:</p>
+<ul class="simple">
+<li><p><strong>Oversampling Caveats</strong>:
+- Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.
+- Overfitting to the minority class is a risk if too much synthetic data is added.</p></li>
+<li><p><strong>Undersampling Caveats</strong>:
+- Removing samples from the majority class can lead to loss of important information, reducing the model’s generalizability.</p></li>
+</ul>
+<section id="smote-a-mathematical-illustration">
+<h3>SMOTE: A Mathematical Illustration<a class="headerlink" href="#smote-a-mathematical-illustration" title="Link to this heading"></a></h3>
+<p>SMOTE (Synthetic Minority Over-sampling Technique) is a widely used algorithm for addressing
+class imbalance by generating synthetic samples for the minority class. However, while powerful,
+SMOTE comes with inherent caveats that practitioners should understand. Below is a mathematical
+illustration highlighting these caveats.</p>
+<p><strong>Synthetic Sample Generation</strong></p>
+<p>SMOTE generates synthetic samples by interpolating between a minority class sample and its nearest
+neighbors. Mathematically, a synthetic sample <span class="math notranslate nohighlight">\(x_{synthetic}\)</span> is defined as:</p>
+<div class="math notranslate nohighlight">
+\[\mathbf{x}_{\text{synthetic}} = \mathbf{x}_i + \delta \cdot (\mathbf{x}_k - \mathbf{x}_i)\]</div>
+<p>where:</p>
+<ul class="simple">
+<li><p><span class="math notranslate nohighlight">\(\mathbf{x}_i\)</span>: A minority class sample.</p></li>
+<li><p><span class="math notranslate nohighlight">\(\mathbf{x}_k\)</span>: One of its <span class="math notranslate nohighlight">\(k\)</span> nearest neighbors (from the same class).</p></li>
+<li><p><span class="math notranslate nohighlight">\(\delta\)</span>: A random value drawn from a uniform distribution, <span class="math notranslate nohighlight">\(\delta \sim U(0, 1)\)</span>.</p></li>
+</ul>
+<p>This process ensures that synthetic samples are generated along the line segments connecting
+minority class samples and their neighbors.</p>
+<p><strong>Caveats in Application</strong></p>
+<ol class="arabic simple">
+<li><p><strong>Overlapping Classes</strong>:
+- SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.
+- If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.</p></li>
+<li><p><strong>Noise Sensitivity</strong>:
+- SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.
+- Synthetic samples created from noisy data can amplify the noise, degrading model performance.</p></li>
+<li><p><strong>Feature Space Assumptions</strong>:
+- SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.
+- In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.</p></li>
+<li><p><strong>Dimensionality Challenges</strong>:
+- In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.
+- Synthetic samples may not adequately represent the true distribution of the minority class.</p></li>
+<li><p><strong>Risk of Overfitting</strong>:
+- If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.</p></li>
+</ol>
+</section>
+<section id="example-of-synthetic-sample-creation">
+<h3>Example of Synthetic Sample Creation<a class="headerlink" href="#example-of-synthetic-sample-creation" title="Link to this heading"></a></h3>
+<p>To illustrate, consider a minority class sample <span class="math notranslate nohighlight">\(f{x}_i = [1, 2]\)</span> and its nearest neighbor
+<span class="math notranslate nohighlight">\(f{x}_k = [3, 4]\)</span>. If <span class="math notranslate nohighlight">\(\delta = 0.5\)</span>, the synthetic sample is computed as:</p>
+<div class="math notranslate nohighlight">
+\[\mathbf{x}_{\text{synthetic}} = [1, 2] + 0.5 \cdot ([3, 4] - [1, 2])\]</div>
+<div class="math notranslate nohighlight">
+\[\mathbf{x}_{\text{synthetic}} = [2, 3]\]</div>
+<p>This synthetic sample lies midway between the two points in the feature space.</p>
+</section>
+<section id="mitigating-the-caveats">
+<h3>Mitigating the Caveats<a class="headerlink" href="#mitigating-the-caveats" title="Link to this heading"></a></h3>
+<ul class="simple">
+<li><p><strong>Combine SMOTE with Undersampling</strong>: Techniques like SMOTEENN or SMOTETomek remove noisy or overlapping samples after synthetic generation.</p></li>
+<li><p><strong>Apply with Feature Engineering</strong>: Ensure the feature space is meaningful and represents the underlying data structure.</p></li>
+<li><p><strong>Tune Oversampling Ratio</strong>: Avoid generating excessive synthetic samples to reduce overfitting.</p></li>
+</ul>
+</section>
+</section>
 </section>
 
 

diff --git a/docs/changelog.html b/docs/changelog.html
@@ -70,6 +70,7 @@
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
 </ul>
 <p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
 <ul class="current">

diff --git a/docs/genindex.html b/docs/genindex.html
@@ -67,6 +67,7 @@
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
 </ul>
 <p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
 <ul>

diff --git a/docs/getting_started.html b/docs/getting_started.html
@@ -75,6 +75,7 @@
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#imputation-before-scaling">Imputation Before Scaling</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#column-stratification-with-cross-validation">Column Stratification with Cross-Validation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="caveats.html#model-calibration">Model Calibration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="caveats.html#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a></li>
 </ul>
 <p class="caption" role="heading"><span class="caption-text">About Model Tuner</span></p>
 <ul>