added SHAP section; pending plot

uclamii · Nov 22, 2024 · d1c5d1c · d1c5d1c
1 parent 6cfa71e
commit d1c5d1c
Show file tree

Hide file tree

Showing 7 changed files with 221 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -117,6 +117,7 @@ ENV/
 env.bak/
 venv.bak/
 eda/
+venvpy_311
 
 # Spyder project settings
 .spyderproject

diff --git a/docs/_sources/usage_guide.rst.txt b/docs/_sources/usage_guide.rst.txt
@@ -1313,6 +1313,79 @@ Return Metrics (Optional)
    weighted avg     0.98      0.98      0.98       200
 
    --------------------------------------------------------------------------------
+
+
+SHAP (SHapley Additive exPlanations)
+---------------------------------------
+
+This example demonstrates how to compute and visualize SHAP (SHapley Additive exPlanations) 
+values for a machine learning model with a pipeline that includes feature selection. 
+SHAP values provide insights into how individual features contribute to the predictions of a model.
+
+**Steps**
+
+1. The dataset is transformed through the model's feature selection pipeline to ensure only the selected features are used for SHAP analysis.
+
+2. The final model (e.g., ``XGBoost`` classifier) is retrieved from the custom Model object. This is required because SHAP operates on the underlying model, not the pipeline.
+
+3. SHAP's ``TreeExplainer`` is used to explain the predictions of the XGBoost classifier.
+
+4. SHAP values are calculated for the transformed dataset to quantify the contribution of each feature to the predictions.
+
+5. A summary plot is generated to visualize the impact of each feature across all data points.
+
+
+Step 1: Transform the test data using the feature selection pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python 
+
+   ## The pipeline applies preprocessing (e.g., imputation, scaling) and feature
+   ## selection (RFE) to X_test
+   X_test_transformed = model_xgb.get_feature_selection_pipeline().transform(X_test)
+
+Step 2: Retrieve the trained XGBoost classifier from the pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python 
+
+   ## The last estimator in the pipeline is the XGBoost model
+   xgb_classifier = model_xgb.estimator[-1]
+
+
+Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+.. code-block:: python
+
+   ## Import SHAP for model explainability
+   import shap
+
+   ## Feature names are required for interpretability in SHAP plots
+   feature_names = X_train.columns.to_list()
+
+   ## Initialize the SHAP explainer with the model
+   explainer = shap.TreeExplainer(xgb_classifier)
+
+
+Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   ## Compute SHAP values for the transformed dataset
+   shap_values = explainer.shap_values(X_test_transformed)
+
+Step 5: Generate a summary plot of SHAP values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   ## Plot SHAP values
+   ## Summary plot of SHAP values for all features across all data points
+   shap.summary_plot(shap_values, X_test_transformed, feature_names=feature_names,)
+
 .. _Regression:
 
 Regression

diff --git a/docs/index.html b/docs/index.html
@@ -186,6 +186,14 @@ <h1>Model Tuner Documentation<a class="headerlink" href="#model-tuner-documentat
 </li>
 </ul>
 </li>
+<li class="toctree-l2"><a class="reference internal" href="usage_guide.html#shap-shapley-additive-explanations">SHAP (SHapley Additive exPlanations)</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="usage_guide.html#step-1-transform-the-test-data-using-the-feature-selection-pipeline">Step 1: Transform the test data using the feature selection pipeline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="usage_guide.html#step-2-retrieve-the-trained-xgboost-classifier-from-the-pipeline">Step 2: Retrieve the trained XGBoost classifier from the pipeline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="usage_guide.html#step-3-extract-feature-names-from-the-training-data-and-initialize-the-shap-explainer-for-the-xgboost-classifier">Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier</a></li>
+<li class="toctree-l3"><a class="reference internal" href="usage_guide.html#step-4-compute-shap-values-for-the-transformed-test-dataset-and-generate-a-summary-plot-of-shap-values">Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values</a></li>
+<li class="toctree-l3"><a class="reference internal" href="usage_guide.html#step-5-generate-a-summary-plot-of-shap-values">Step 5: Generate a summary plot of SHAP values</a></li>
+</ul>
+</li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="usage_guide.html#regression">Regression</a><ul>

diff --git a/docs/objects.inv b/docs/objects.inv
diff --git a/docs/searchindex.js b/docs/searchindex.js
diff --git a/docs/usage_guide.html b/docs/usage_guide.html
@@ -116,6 +116,14 @@
 </li>
 </ul>
 </li>
+<li class="toctree-l2"><a class="reference internal" href="#shap-shapley-additive-explanations">SHAP (SHapley Additive exPlanations)</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="#step-1-transform-the-test-data-using-the-feature-selection-pipeline">Step 1: Transform the test data using the feature selection pipeline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#step-2-retrieve-the-trained-xgboost-classifier-from-the-pipeline">Step 2: Retrieve the trained XGBoost classifier from the pipeline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#step-3-extract-feature-names-from-the-training-data-and-initialize-the-shap-explainer-for-the-xgboost-classifier">Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#step-4-compute-shap-values-for-the-transformed-test-dataset-and-generate-a-summary-plot-of-shap-values">Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#step-5-generate-a-summary-plot-of-shap-values">Step 5: Generate a summary plot of SHAP values</a></li>
+</ul>
+</li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="#regression">Regression</a><ul>
@@ -1319,6 +1327,63 @@ <h4>Return Metrics (Optional)<a class="headerlink" href="#return-metrics-optiona
 </section>
 </section>
 </section>
+<section id="shap-shapley-additive-explanations">
+<h2>SHAP (SHapley Additive exPlanations)<a class="headerlink" href="#shap-shapley-additive-explanations" title="Link to this heading"></a></h2>
+<p>This example demonstrates how to compute and visualize SHAP (SHapley Additive exPlanations)
+values for a machine learning model with a pipeline that includes feature selection.
+SHAP values provide insights into how individual features contribute to the predictions of a model.</p>
+<p><strong>Steps</strong></p>
+<ol class="arabic simple">
+<li><p>The dataset is transformed through the model’s feature selection pipeline to ensure only the selected features are used for SHAP analysis.</p></li>
+<li><p>The final model (e.g., <code class="docutils literal notranslate"><span class="pre">XGBoost</span></code> classifier) is retrieved from the custom Model object. This is required because SHAP operates on the underlying model, not the pipeline.</p></li>
+<li><p>SHAP’s <code class="docutils literal notranslate"><span class="pre">TreeExplainer</span></code> is used to explain the predictions of the XGBoost classifier.</p></li>
+<li><p>SHAP values are calculated for the transformed dataset to quantify the contribution of each feature to the predictions.</p></li>
+<li><p>A summary plot is generated to visualize the impact of each feature across all data points.</p></li>
+</ol>
+<section id="step-1-transform-the-test-data-using-the-feature-selection-pipeline">
+<h3>Step 1: Transform the test data using the feature selection pipeline<a class="headerlink" href="#step-1-transform-the-test-data-using-the-feature-selection-pipeline" title="Link to this heading"></a></h3>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1">## The pipeline applies preprocessing (e.g., imputation, scaling) and feature</span>
+<span class="c1">## selection (RFE) to X_test</span>
+<span class="n">X_test_transformed</span> <span class="o">=</span> <span class="n">model_xgb</span><span class="o">.</span><span class="n">get_feature_selection_pipeline</span><span class="p">()</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
+</pre></div>
+</div>
+</section>
+<section id="step-2-retrieve-the-trained-xgboost-classifier-from-the-pipeline">
+<h3>Step 2: Retrieve the trained XGBoost classifier from the pipeline<a class="headerlink" href="#step-2-retrieve-the-trained-xgboost-classifier-from-the-pipeline" title="Link to this heading"></a></h3>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1">## The last estimator in the pipeline is the XGBoost model</span>
+<span class="n">xgb_classifier</span> <span class="o">=</span> <span class="n">model_xgb</span><span class="o">.</span><span class="n">estimator</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
+</pre></div>
+</div>
+</section>
+<section id="step-3-extract-feature-names-from-the-training-data-and-initialize-the-shap-explainer-for-the-xgboost-classifier">
+<h3>Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier<a class="headerlink" href="#step-3-extract-feature-names-from-the-training-data-and-initialize-the-shap-explainer-for-the-xgboost-classifier" title="Link to this heading"></a></h3>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1">## Import SHAP for model explainability</span>
+<span class="kn">import</span> <span class="nn">shap</span>
+
+<span class="c1">## Feature names are required for interpretability in SHAP plots</span>
+<span class="n">feature_names</span> <span class="o">=</span> <span class="n">X_train</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">to_list</span><span class="p">()</span>
+
+<span class="c1">## Initialize the SHAP explainer with the model</span>
+<span class="n">explainer</span> <span class="o">=</span> <span class="n">shap</span><span class="o">.</span><span class="n">TreeExplainer</span><span class="p">(</span><span class="n">xgb_classifier</span><span class="p">)</span>
+</pre></div>
+</div>
+</section>
+<section id="step-4-compute-shap-values-for-the-transformed-test-dataset-and-generate-a-summary-plot-of-shap-values">
+<h3>Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values<a class="headerlink" href="#step-4-compute-shap-values-for-the-transformed-test-dataset-and-generate-a-summary-plot-of-shap-values" title="Link to this heading"></a></h3>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1">## Compute SHAP values for the transformed dataset</span>
+<span class="n">shap_values</span> <span class="o">=</span> <span class="n">explainer</span><span class="o">.</span><span class="n">shap_values</span><span class="p">(</span><span class="n">X_test_transformed</span><span class="p">)</span>
+</pre></div>
+</div>
+</section>
+<section id="step-5-generate-a-summary-plot-of-shap-values">
+<h3>Step 5: Generate a summary plot of SHAP values<a class="headerlink" href="#step-5-generate-a-summary-plot-of-shap-values" title="Link to this heading"></a></h3>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1">## Plot SHAP values</span>
+<span class="c1">## Summary plot of SHAP values for all features across all data points</span>
+<span class="n">shap</span><span class="o">.</span><span class="n">summary_plot</span><span class="p">(</span><span class="n">shap_values</span><span class="p">,</span> <span class="n">X_test_transformed</span><span class="p">,</span> <span class="n">feature_names</span><span class="o">=</span><span class="n">feature_names</span><span class="p">,)</span>
+</pre></div>
+</div>
+</section>
+</section>
 </section>
 <section id="regression">
 <span id="id1"></span><h1>Regression<a class="headerlink" href="#regression" title="Link to this heading"></a></h1>

diff --git a/source/usage_guide.rst b/source/usage_guide.rst
@@ -1313,6 +1313,79 @@ Return Metrics (Optional)
    weighted avg     0.98      0.98      0.98       200
 
    --------------------------------------------------------------------------------
+
+
+SHAP (SHapley Additive exPlanations)
+---------------------------------------
+
+This example demonstrates how to compute and visualize SHAP (SHapley Additive exPlanations) 
+values for a machine learning model with a pipeline that includes feature selection. 
+SHAP values provide insights into how individual features contribute to the predictions of a model.
+
+**Steps**
+
+1. The dataset is transformed through the model's feature selection pipeline to ensure only the selected features are used for SHAP analysis.
+
+2. The final model (e.g., ``XGBoost`` classifier) is retrieved from the custom Model object. This is required because SHAP operates on the underlying model, not the pipeline.
+
+3. SHAP's ``TreeExplainer`` is used to explain the predictions of the XGBoost classifier.
+
+4. SHAP values are calculated for the transformed dataset to quantify the contribution of each feature to the predictions.
+
+5. A summary plot is generated to visualize the impact of each feature across all data points.
+
+
+Step 1: Transform the test data using the feature selection pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python 
+
+   ## The pipeline applies preprocessing (e.g., imputation, scaling) and feature
+   ## selection (RFE) to X_test
+   X_test_transformed = model_xgb.get_feature_selection_pipeline().transform(X_test)
+
+Step 2: Retrieve the trained XGBoost classifier from the pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python 
+
+   ## The last estimator in the pipeline is the XGBoost model
+   xgb_classifier = model_xgb.estimator[-1]
+
+
+Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+.. code-block:: python
+
+   ## Import SHAP for model explainability
+   import shap
+
+   ## Feature names are required for interpretability in SHAP plots
+   feature_names = X_train.columns.to_list()
+
+   ## Initialize the SHAP explainer with the model
+   explainer = shap.TreeExplainer(xgb_classifier)
+
+
+Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   ## Compute SHAP values for the transformed dataset
+   shap_values = explainer.shap_values(X_test_transformed)
+
+Step 5: Generate a summary plot of SHAP values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   ## Plot SHAP values
+   ## Summary plot of SHAP values for all features across all data points
+   shap.summary_plot(shap_values, X_test_transformed, feature_names=feature_names,)
+
 .. _Regression:
 
 Regression