Skip to content

Commit

Permalink
updated versions, prettified caveats section, and linked metrics to p…
Browse files Browse the repository at this point in the history
…ipeline
  • Loading branch information
lshpaner committed Nov 20, 2024
1 parent bc6615c commit 685f681
Show file tree
Hide file tree
Showing 16 changed files with 273 additions and 64 deletions.
Binary file modified docs/.doctrees/caveats.doctree
Binary file not shown.
Binary file modified docs/.doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/.doctrees/getting_started.doctree
Binary file not shown.
Binary file modified docs/.doctrees/usage_guide.doctree
Binary file not shown.
55 changes: 52 additions & 3 deletions docs/_sources/caveats.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,8 @@ With imbalanced data, the default threshold may favor the majority class, causin
false negatives for the minority class. Adjusting the threshold to account for imbalance can
help mitigate this issue, but it requires careful tuning and validation.

.. _Limitations_of_Accuracy:

Limitations of Accuracy
^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -453,19 +455,61 @@ Instead, alternative metrics should be used:
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
These metrics provide a more balanced evaluation of model performance on imbalanced datasets.
- **ROC AUC (Receiver Operating Characteristic - Area Under the Curve)**:

Measures the model's ability to distinguish between classes. It is the area under the
ROC curve, which plots the True Positive Rate (Recall) against the False Positive Rate.

.. math::
\text{True Positive Rate (TPR)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
.. math::
\text{False Positive Rate (FPR)} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}
\

The AUC (Area Under Curve) is computed by integrating the ROC curve:

.. math::
\text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}) \, d(\text{FPR})
This integral represents the total area under the ROC curve, where:

- A value of 0.5 indicates random guessing.
- A value of 1.0 indicates a perfect classifier.

Practically, the AUC is estimated using numerical integration techniques such as the trapezoidal rule
over the discrete points of the ROC curve.

Integration and Practical Considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ROC AUC provides an aggregate measure of model performance across all classification thresholds.

However:

- **Imbalanced Datasets**: The ROC AUC may still appear high if the classifier performs well on the majority class, even if the minority class is poorly predicted.
In such cases, metrics like Precision-Recall AUC are more informative.
- **Numerical Estimation**: Most implementations (e.g., in scikit-learn) compute the AUC numerically, ensuring fast and accurate computation.

These metrics provide a more balanced evaluation of model performance on imbalanced datasets. By using metrics like ROC AUC in conjunction with precision, recall, and F1-score, practitioners
can better assess a model's effectiveness in handling imbalanced data.

Impact of Resampling Techniques
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs:

- **Oversampling Caveats**:
**Oversampling Caveats**

- Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.
- Overfitting to the minority class is a risk if too much synthetic data is added.

- **Undersampling Caveats**:
**Undersampling Caveats**

- Removing samples from the majority class can lead to loss of important information, reducing the model's generalizability.


Expand Down Expand Up @@ -498,22 +542,27 @@ minority class samples and their neighbors.
**Caveats in Application**

1. **Overlapping Classes**:

- SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.
- If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.

2. **Noise Sensitivity**:

- SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.
- Synthetic samples created from noisy data can amplify the noise, degrading model performance.

3. **Feature Space Assumptions**:

- SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.
- In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.

4. **Dimensionality Challenges**:

- In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.
- Synthetic samples may not adequately represent the true distribution of the minority class.

5. **Risk of Overfitting**:

- If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.

Example of Synthetic Sample Creation
Expand Down
30 changes: 22 additions & 8 deletions docs/_sources/getting_started.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -69,26 +69,40 @@ which will be automatically installed when you install ``model_tuner`` using pip
- ``scipy``: version ``1.4.1``
- ``joblib``: version ``1.3.2``
- ``tqdm``: version ``4.66.4``
- ``imbalanced-learn``: ``version 0.7.0``
- ``scikit-optimize``: ``version 0.8.1``
- ``imbalanced-learn``: version ``0.7.0``
- ``scikit-optimize``: version ``0.8.1``
- ``xgboost``: version ``1.6.2``
- ``pip``: version ``24.0``

- For Python ``3.8`` to ``<3.11``:

- ``numpy``: versions between ``1.19.5`` and ``<1.24``
- ``pandas``: versions between ``1.3.5`` and ``<2.2.2``
- ``scikit-learn``: versions between ``1.0.2`` and ``<1.3``
- ``numpy``: versions between ``1.19.5`` and ``<2.0.0``
- ``pandas``: versions between ``1.3.5`` and ``<2.2.3``
- ``scikit-learn``: versions between ``1.0.2`` and ``<1.4.0``
- ``scipy``: versions between ``1.6.3`` and ``<1.11``
- ``joblib``: version ``1.3.2``
- ``tqdm``: version ``4.66.4``
- ``imbalanced-learn``: version ``0.12.4``
- ``scikit-optimize``: version ``0.10.2``

- ``xgboost``: version ``2.1.2``
- ``pip``: version ``24.2``
- ``setuptools``: version ``75.1.0``
- ``wheel``: version ``0.44.0``

- For Python ``3.11`` or higher:

- ``numpy``: version ``1.26``
- ``pandas``: version ``2.2.2``
- ``numpy``: versions between ``1.19.5`` and ``<2.0.0``
- ``pandas``: versions between ``1.3.5`` and ``<2.2.2``
- ``scikit-learn``: version ``1.5.1``
- ``scipy``: version ``1.14.0``
- ``joblib``: version ``1.3.2``
- ``tqdm``: version ``4.66.4``
- ``imbalanced-learn``: version ``0.12.4``
- ``scikit-optimize``: version ``0.10.2``
- ``xgboost``: version ``2.1.2``
- ``pip``: version ``24.2``
- ``setuptools``: version ``75.1.0``
- ``wheel``: version ``0.44.0``

.. _installation:

Expand Down
5 changes: 3 additions & 2 deletions docs/_sources/usage_guide.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,7 @@ Pipeline Management
The pipeline in the model tuner class is designed to automatically organize steps into three categories: **preprocessing**, **feature selection**, and **imbalanced sampling**. The steps are ordered in the following sequence:

1. **Preprocessing**:

- Imputation
- Scaling
- Other preprocessing steps
Expand Down Expand Up @@ -329,8 +330,8 @@ In our library, binary classification is handled seamlessly through the ``Model`
class. Users can specify a binary classifier as the estimator, and the library
takes care of essential tasks like data preprocessing, model calibration, and
cross-validation. The library also provides robust support for evaluating the
model's performance using a variety of metrics, such as accuracy, precision,
recall, and ROC-AUC, ensuring that the model's ability to distinguish between the
model's performance using a variety of metrics, such as :ref:`accuracy, precision,
recall, and ROC-AUC <Limitations_of_Accuracy>`, ensuring that the model's ability to distinguish between the
two classes is thoroughly assessed. Additionally, the library supports advanced
techniques like imbalanced data handling and model calibration to fine-tune
decision thresholds, making it easier to deploy effective binary classifiers in
Expand Down
107 changes: 84 additions & 23 deletions docs/caveats.html
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,10 @@
<li class="toctree-l1"><a class="reference internal" href="#caveats-in-imbalanced-learning">Caveats in Imbalanced Learning</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#bias-from-class-distribution">Bias from Class Distribution</a></li>
<li class="toctree-l2"><a class="reference internal" href="#threshold-dependent-predictions">Threshold-Dependent Predictions</a></li>
<li class="toctree-l2"><a class="reference internal" href="#limitations-of-accuracy">Limitations of Accuracy</a></li>
<li class="toctree-l2"><a class="reference internal" href="#limitations-of-accuracy">Limitations of Accuracy</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#integration-and-practical-considerations">Integration and Practical Considerations</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#impact-of-resampling-techniques">Impact of Resampling Techniques</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#smote-a-mathematical-illustration">SMOTE: A Mathematical Illustration</a></li>
<li class="toctree-l3"><a class="reference internal" href="#example-of-synthetic-sample-creation">Example of Synthetic Sample Creation</a></li>
Expand Down Expand Up @@ -421,7 +424,7 @@ <h2>Threshold-Dependent Predictions<a class="headerlink" href="#threshold-depend
help mitigate this issue, but it requires careful tuning and validation.</p>
</section>
<section id="limitations-of-accuracy">
<h2>Limitations of Accuracy<a class="headerlink" href="#limitations-of-accuracy" title="Link to this heading"></a></h2>
<span id="id1"></span><h2>Limitations of Accuracy<a class="headerlink" href="#limitations-of-accuracy" title="Link to this heading"></a></h2>
<p>Traditional accuracy is a misleading metric in imbalanced datasets. For example, a model predicting
only the majority class can achieve high accuracy despite failing to identify any minority class instances.
Instead, alternative metrics should be used:</p>
Expand Down Expand Up @@ -449,19 +452,62 @@ <h2>Limitations of Accuracy<a class="headerlink" href="#limitations-of-accuracy"
<div class="math notranslate nohighlight">
\[F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]</div>
</li>
<li><p><strong>ROC AUC (Receiver Operating Characteristic - Area Under the Curve)</strong>:</p>
<blockquote>
<div><p>Measures the model’s ability to distinguish between classes. It is the area under the
ROC curve, which plots the True Positive Rate (Recall) against the False Positive Rate.</p>
</div></blockquote>
<div class="math notranslate nohighlight">
\[\text{True Positive Rate (TPR)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]</div>
<div class="math notranslate nohighlight">
\[\text{False Positive Rate (FPR)} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}\]</div>
</li>
</ul>
<p></p>
<blockquote>
<div><p>The AUC (Area Under Curve) is computed by integrating the ROC curve:</p>
<div class="math notranslate nohighlight">
\[\text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}) \, d(\text{FPR})\]</div>
<p>This integral represents the total area under the ROC curve, where:</p>
<ul>
<li><p>A value of 0.5 indicates random guessing.</p></li>
<li><p>A value of 1.0 indicates a perfect classifier.</p>
<blockquote>
<div><p>Practically, the AUC is estimated using numerical integration techniques such as the trapezoidal rule
over the discrete points of the ROC curve.</p>
</div></blockquote>
</li>
</ul>
</div></blockquote>
<section id="integration-and-practical-considerations">
<h3>Integration and Practical Considerations<a class="headerlink" href="#integration-and-practical-considerations" title="Link to this heading"></a></h3>
<p>The ROC AUC provides an aggregate measure of model performance across all classification thresholds.</p>
<p>However:</p>
<ul class="simple">
<li><p><strong>Imbalanced Datasets</strong>: The ROC AUC may still appear high if the classifier performs well on the majority class, even if the minority class is poorly predicted.
In such cases, metrics like Precision-Recall AUC are more informative.</p></li>
<li><p><strong>Numerical Estimation</strong>: Most implementations (e.g., in scikit-learn) compute the AUC numerically, ensuring fast and accurate computation.</p></li>
</ul>
<p>These metrics provide a more balanced evaluation of model performance on imbalanced datasets.</p>
<p>These metrics provide a more balanced evaluation of model performance on imbalanced datasets. By using metrics like ROC AUC in conjunction with precision, recall, and F1-score, practitioners
can better assess a model’s effectiveness in handling imbalanced data.</p>
</section>
</section>
<section id="impact-of-resampling-techniques">
<h2>Impact of Resampling Techniques<a class="headerlink" href="#impact-of-resampling-techniques" title="Link to this heading"></a></h2>
<p>Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs:</p>
<ul class="simple">
<li><p><strong>Oversampling Caveats</strong>:
- Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.
- Overfitting to the minority class is a risk if too much synthetic data is added.</p></li>
<li><p><strong>Undersampling Caveats</strong>:
- Removing samples from the majority class can lead to loss of important information, reducing the model’s generalizability.</p></li>
<p><strong>Oversampling Caveats</strong></p>
<blockquote>
<div><ul class="simple">
<li><p>Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class.</p></li>
<li><p>Overfitting to the minority class is a risk if too much synthetic data is added.</p></li>
</ul>
</div></blockquote>
<p><strong>Undersampling Caveats</strong></p>
<blockquote>
<div><ul class="simple">
<li><p>Removing samples from the majority class can lead to loss of important information, reducing the model’s generalizability.</p></li>
</ul>
</div></blockquote>
<section id="smote-a-mathematical-illustration">
<h3>SMOTE: A Mathematical Illustration<a class="headerlink" href="#smote-a-mathematical-illustration" title="Link to this heading"></a></h3>
<p>SMOTE (Synthetic Minority Over-sampling Technique) is a widely used algorithm for addressing
Expand All @@ -483,20 +529,35 @@ <h3>SMOTE: A Mathematical Illustration<a class="headerlink" href="#smote-a-mathe
minority class samples and their neighbors.</p>
<p><strong>Caveats in Application</strong></p>
<ol class="arabic simple">
<li><p><strong>Overlapping Classes</strong>:
- SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.
- If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.</p></li>
<li><p><strong>Noise Sensitivity</strong>:
- SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.
- Synthetic samples created from noisy data can amplify the noise, degrading model performance.</p></li>
<li><p><strong>Feature Space Assumptions</strong>:
- SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.
- In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.</p></li>
<li><p><strong>Dimensionality Challenges</strong>:
- In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.
- Synthetic samples may not adequately represent the true distribution of the minority class.</p></li>
<li><p><strong>Risk of Overfitting</strong>:
- If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.</p></li>
<li><p><strong>Overlapping Classes</strong>:</p>
<ul class="simple">
<li><p>SMOTE assumes that the minority class samples are well-clustered and separable from the majority class.</p></li>
<li><p>If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification.</p></li>
</ul>
</li>
<li><p><strong>Noise Sensitivity</strong>:</p>
<ul class="simple">
<li><p>SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones.</p></li>
<li><p>Synthetic samples created from noisy data can amplify the noise, degrading model performance.</p></li>
</ul>
</li>
<li><p><strong>Feature Space Assumptions</strong>:</p>
<ul class="simple">
<li><p>SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous.</p></li>
<li><p>In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples.</p></li>
</ul>
</li>
<li><p><strong>Dimensionality Challenges</strong>:</p>
<ul class="simple">
<li><p>In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality.</p></li>
<li><p>Synthetic samples may not adequately represent the true distribution of the minority class.</p></li>
</ul>
</li>
<li><p><strong>Risk of Overfitting</strong>:</p>
<ul class="simple">
<li><p>If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data.</p></li>
</ul>
</li>
</ol>
</section>
<section id="example-of-synthetic-sample-creation">
Expand Down
Loading

0 comments on commit 685f681

Please sign in to comment.