diff --git a/docs/.doctrees/caveats.doctree b/docs/.doctrees/caveats.doctree index 9685745..4e63ef9 100644 Binary files a/docs/.doctrees/caveats.doctree and b/docs/.doctrees/caveats.doctree differ diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle index f3eff4d..0a84fd2 100644 Binary files a/docs/.doctrees/environment.pickle and b/docs/.doctrees/environment.pickle differ diff --git a/docs/.doctrees/getting_started.doctree b/docs/.doctrees/getting_started.doctree index 9968907..02733b7 100644 Binary files a/docs/.doctrees/getting_started.doctree and b/docs/.doctrees/getting_started.doctree differ diff --git a/docs/.doctrees/usage_guide.doctree b/docs/.doctrees/usage_guide.doctree index 334ead6..f03f02f 100644 Binary files a/docs/.doctrees/usage_guide.doctree and b/docs/.doctrees/usage_guide.doctree differ diff --git a/docs/_sources/caveats.rst.txt b/docs/_sources/caveats.rst.txt index c64cb2d..f23bb65 100644 --- a/docs/_sources/caveats.rst.txt +++ b/docs/_sources/caveats.rst.txt @@ -420,6 +420,8 @@ With imbalanced data, the default threshold may favor the majority class, causin false negatives for the minority class. Adjusting the threshold to account for imbalance can help mitigate this issue, but it requires careful tuning and validation. +.. _Limitations_of_Accuracy: + Limitations of Accuracy ^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -453,19 +455,61 @@ Instead, alternative metrics should be used: F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} -These metrics provide a more balanced evaluation of model performance on imbalanced datasets. +- **ROC AUC (Receiver Operating Characteristic - Area Under the Curve)**: + + Measures the model's ability to distinguish between classes. It is the area under the + ROC curve, which plots the True Positive Rate (Recall) against the False Positive Rate. + + .. math:: + + \text{True Positive Rate (TPR)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} + + .. math:: + + \text{False Positive Rate (FPR)} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} + +\ + + The AUC (Area Under Curve) is computed by integrating the ROC curve: + + .. math:: + + \text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}) \, d(\text{FPR}) + + This integral represents the total area under the ROC curve, where: + - A value of 0.5 indicates random guessing. + - A value of 1.0 indicates a perfect classifier. + + Practically, the AUC is estimated using numerical integration techniques such as the trapezoidal rule + over the discrete points of the ROC curve. + +Integration and Practical Considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ROC AUC provides an aggregate measure of model performance across all classification thresholds. + +However: + +- **Imbalanced Datasets**: The ROC AUC may still appear high if the classifier performs well on the majority class, even if the minority class is poorly predicted. + In such cases, metrics like Precision-Recall AUC are more informative. +- **Numerical Estimation**: Most implementations (e.g., in scikit-learn) compute the AUC numerically, ensuring fast and accurate computation. + +These metrics provide a more balanced evaluation of model performance on imbalanced datasets. By using metrics like ROC AUC in conjunction with precision, recall, and F1-score, practitioners +can better assess a model's effectiveness in handling imbalanced data. Impact of Resampling Techniques ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Resampling methods such as oversampling and undersampling can address class imbalance but come with trade-offs: -- **Oversampling Caveats**: +**Oversampling Caveats** + - Methods like SMOTE may introduce synthetic data that does not fully reflect the true distribution of the minority class. - Overfitting to the minority class is a risk if too much synthetic data is added. -- **Undersampling Caveats**: +**Undersampling Caveats** + - Removing samples from the majority class can lead to loss of important information, reducing the model's generalizability. @@ -498,22 +542,27 @@ minority class samples and their neighbors. **Caveats in Application** 1. **Overlapping Classes**: + - SMOTE assumes that the minority class samples are well-clustered and separable from the majority class. - If the minority class overlaps significantly with the majority class, synthetic samples may fall into regions dominated by the majority class, leading to misclassification. 2. **Noise Sensitivity**: + - SMOTE generates synthetic samples based on existing minority class samples, including noisy or mislabeled ones. - Synthetic samples created from noisy data can amplify the noise, degrading model performance. 3. **Feature Space Assumptions**: + - SMOTE relies on linear interpolation in the feature space, which assumes that the feature space is homogeneous. - In highly non-linear spaces, this assumption may not hold, leading to unrealistic synthetic samples. 4. **Dimensionality Challenges**: + - In high-dimensional spaces, nearest neighbor calculations may become less meaningful due to the curse of dimensionality. - Synthetic samples may not adequately represent the true distribution of the minority class. 5. **Risk of Overfitting**: + - If SMOTE is applied excessively, the model may overfit to the synthetic minority class samples, reducing generalizability to unseen data. Example of Synthetic Sample Creation diff --git a/docs/_sources/getting_started.rst.txt b/docs/_sources/getting_started.rst.txt index ce6d64c..fae4e73 100644 --- a/docs/_sources/getting_started.rst.txt +++ b/docs/_sources/getting_started.rst.txt @@ -69,26 +69,40 @@ which will be automatically installed when you install ``model_tuner`` using pip - ``scipy``: version ``1.4.1`` - ``joblib``: version ``1.3.2`` - ``tqdm``: version ``4.66.4`` - - ``imbalanced-learn``: ``version 0.7.0`` - - ``scikit-optimize``: ``version 0.8.1`` + - ``imbalanced-learn``: version ``0.7.0`` + - ``scikit-optimize``: version ``0.8.1`` + - ``xgboost``: version ``1.6.2`` + - ``pip``: version ``24.0`` - For Python ``3.8`` to ``<3.11``: - - ``numpy``: versions between ``1.19.5`` and ``<1.24`` - - ``pandas``: versions between ``1.3.5`` and ``<2.2.2`` - - ``scikit-learn``: versions between ``1.0.2`` and ``<1.3`` + - ``numpy``: versions between ``1.19.5`` and ``<2.0.0`` + - ``pandas``: versions between ``1.3.5`` and ``<2.2.3`` + - ``scikit-learn``: versions between ``1.0.2`` and ``<1.4.0`` - ``scipy``: versions between ``1.6.3`` and ``<1.11`` + - ``joblib``: version ``1.3.2`` + - ``tqdm``: version ``4.66.4`` - ``imbalanced-learn``: version ``0.12.4`` - ``scikit-optimize``: version ``0.10.2`` - + - ``xgboost``: version ``2.1.2`` + - ``pip``: version ``24.2`` + - ``setuptools``: version ``75.1.0`` + - ``wheel``: version ``0.44.0`` + - For Python ``3.11`` or higher: - - ``numpy``: version ``1.26`` - - ``pandas``: version ``2.2.2`` + - ``numpy``: versions between ``1.19.5`` and ``<2.0.0`` + - ``pandas``: versions between ``1.3.5`` and ``<2.2.2`` - ``scikit-learn``: version ``1.5.1`` - ``scipy``: version ``1.14.0`` + - ``joblib``: version ``1.3.2`` + - ``tqdm``: version ``4.66.4`` - ``imbalanced-learn``: version ``0.12.4`` - ``scikit-optimize``: version ``0.10.2`` + - ``xgboost``: version ``2.1.2`` + - ``pip``: version ``24.2`` + - ``setuptools``: version ``75.1.0`` + - ``wheel``: version ``0.44.0`` .. _installation: diff --git a/docs/_sources/usage_guide.rst.txt b/docs/_sources/usage_guide.rst.txt index ae9e813..bd6c75c 100644 --- a/docs/_sources/usage_guide.rst.txt +++ b/docs/_sources/usage_guide.rst.txt @@ -239,6 +239,7 @@ Pipeline Management The pipeline in the model tuner class is designed to automatically organize steps into three categories: **preprocessing**, **feature selection**, and **imbalanced sampling**. The steps are ordered in the following sequence: 1. **Preprocessing**: + - Imputation - Scaling - Other preprocessing steps @@ -329,8 +330,8 @@ In our library, binary classification is handled seamlessly through the ``Model` class. Users can specify a binary classifier as the estimator, and the library takes care of essential tasks like data preprocessing, model calibration, and cross-validation. The library also provides robust support for evaluating the -model's performance using a variety of metrics, such as accuracy, precision, -recall, and ROC-AUC, ensuring that the model's ability to distinguish between the +model's performance using a variety of metrics, such as :ref:`accuracy, precision, +recall, and ROC-AUC `, ensuring that the model's ability to distinguish between the two classes is thoroughly assessed. Additionally, the library supports advanced techniques like imbalanced data handling and model calibration to fine-tune decision thresholds, making it easier to deploy effective binary classifiers in diff --git a/docs/caveats.html b/docs/caveats.html index 59b4b05..29de1ca 100644 --- a/docs/caveats.html +++ b/docs/caveats.html @@ -99,7 +99,10 @@
  • Caveats in Imbalanced Learning