lshpaner · lshpaner · Dec 27, 2024 · Dec 27, 2024 · Dec 27, 2024 · Dec 27, 2024
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,12 @@ __pycache__/
 *.github/
 **/*venv*/
 edatkit_374/
-src/
+src/*.doctree
+*.pickle
+.doctrees
+.doctrees
+.doctrees
+.doctrees
+.doctrees
+.doctrees/
+*.buildinfo
diff --git a/docs/.buildinfo b/docs/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 35a32965d9c9546d22888657bd0ac24a
+config: 32fbb0c52182301b536d1d7f7354f6b5
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/.doctrees/acknowledgements.doctree b/docs/.doctrees/acknowledgements.doctree
diff --git a/docs/.doctrees/changelog.doctree b/docs/.doctrees/changelog.doctree
diff --git a/docs/.doctrees/citations.doctree b/docs/.doctrees/citations.doctree
diff --git a/docs/.doctrees/contributors.doctree b/docs/.doctrees/contributors.doctree
diff --git a/docs/.doctrees/data_management.doctree b/docs/.doctrees/data_management.doctree
diff --git a/docs/.doctrees/eda_plots.doctree b/docs/.doctrees/eda_plots.doctree
diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle
diff --git a/docs/.doctrees/getting_started.doctree b/docs/.doctrees/getting_started.doctree
diff --git a/docs/.doctrees/index.doctree b/docs/.doctrees/index.doctree
diff --git a/docs/.doctrees/references.doctree b/docs/.doctrees/references.doctree
diff --git a/docs/.doctrees/theoretical_overview.doctree b/docs/.doctrees/theoretical_overview.doctree
diff --git a/docs/.doctrees/usage_guide.doctree b/docs/.doctrees/usage_guide.doctree
diff --git a/docs/_sources/changelog.rst.txt b/docs/_sources/changelog.rst.txt
@@ -24,6 +24,142 @@
 Changelog
 =========
 
+`Version 0.0.13`_
+----------------------
+
+.. _Version 0.0.13: https://lshpaner.github.io/eda_toolkit/v0.0.13/index.html
+
+This version introduces a series of updates and fixes across multiple functions to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.
+
+Add ``ValueError`` for Insufficient Pool Size in ``add_ids`` and Enhance ID Deduplication
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This update enhances the ``add_ids`` function by adding explicit error handling and improving the uniqueness guarantee for generated IDs.
+
+**Key Changes**
+
+- **New** ``ValueError`` **for Insufficient Pool Size**:
+
+  - Calculates the pool size :math:`9 \times 10^{(\text{d} - 1)}` and compares it with the number of rows in the DataFrame.
+  - **Behavior**:
+
+    - Throws a ``ValueError`` if ``n_rows > pool_size``.
+    - Prints a warning if ``n_rows`` approaches 90% of the pool size, suggesting an increase in digit length.
+
+- **Improved ID Deduplication**:
+
+  - Introduced a set (``unique_ids``) to track generated IDs.
+  - IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
+  - Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.
+
+
+Enhance ``strip_trailing_period`` to Support Strings and Mixed Data Types
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This update enhances the ``strip_trailing_period`` function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like ``NaN``.
+
+**Key Enhancements**
+
+- **Support for Strings with Trailing Periods**:
+
+  - Removes trailing periods from string values, such as ``"123."`` or ``"test."``.
+
+- **Mixed Data Types**:
+
+  - Handles columns containing both numeric and string values seamlessly.
+
+- **Graceful Handling of** ``NaN``:
+
+  - Skips processing for ``NaN`` values, leaving them unchanged.
+
+- **Robust Type Conversion**:
+
+  - Converts numeric strings (e.g., ``"123."``) back to float where applicable.
+  - Retains strings if conversion to float is not possible.
+
+Changes in ``stacked_crosstab_plot``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Remove ``IPython`` Dependency by Replacing ``display`` with ``print``
+
+This resolves an issue where the ``eda_toolkit`` library required ``IPython`` as a dependency due to the use of ``display(crosstab_df)`` in the ``stacked_crosstab_plot`` function. The dependency caused import failures in environments without ``IPython``, especially in non-Jupyter terminal-based workflows.
+
+**Changes Made**
+
+1. **Replaced** ``display`` with ``print``:
+   - The line ``display(crosstab_df)`` was replaced with ``print(crosstab_df)`` to eliminate the need for ``IPython``.
+
+2. **Removed ``IPython`` Import**:
+   - The ``from IPython.display import display`` import statement was removed from the codebase.
+
+**Updated Function Behavior**:
+
+- Crosstabs are displayed using ``print``, maintaining functionality in all runtime environments.
+- The change ensures no loss in usability or user experience.
+
+**Root Cause and Fix**
+
+The issue arose from reliance on ``IPython.display.display`` for rendering crosstab tables in Jupyter notebooks. Environments without ``IPython`` experienced a ``ModuleNotFoundError``. To address this, the ``display(crosstab_df)`` statement was replaced with ``print(crosstab_df)``.
+
+**Testing**:
+
+- **Jupyter Notebook**: Crosstabs are displayed as plain text via ``print()``, rendered neatly in notebook outputs.
+- **Terminal Session**: Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.
+
+Add Environment Detection to ``dataframe_columns`` Function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This enhances the ``dataframe_columns`` function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal).
+
+**Changes Made**
+
+1. **Environment Detection**:
+
+   - Added a check to determine if the function is running in a Jupyter Notebook or terminal:
+
+     ```
+     is_notebook_env = "ipykernel" in sys.modules
+     ```
+
+2. **Dynamic Output Behavior**:
+
+   - **Terminal Environment**:
+
+     - Returns a plain DataFrame (``result_df``) when running outside of a notebook or when ``return_df=True``.
+
+   - **Jupyter Notebook**:
+
+     - Retains the styled DataFrame functionality when running in a notebook and ``return_df=False``.
+
+3. **Improved Compatibility**:
+
+   - The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
+
+Add ``tqdm`` Progress Bar to ``dataframe_columns`` Function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This enhances the ``dataframe_columns`` function by incorporating a ``tqdm`` progress bar to track column processing. This is particularly useful for analyzing large DataFrames, providing real-time feedback.
+
+**Changes Made**:
+
+- Wrapped the column processing loop with a ``tqdm`` progress bar:
+
+  .. code-block:: python
+
+    for col in tqdm(df.columns, desc="Processing columns"):
+    ...
+
+
+Other Enhancements and Fixes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Improved ``save_dataframes_to_excel`` with ``tqdm`` integration.
+- Fixed ``plot_2d_pdp`` plot display logic to adhere strictly to the ``plot_type`` parameter.
+- Updated project dependencies and added robust environment testing.
+
+
+
+
 `Version 0.0.12`_
 ----------------------
 

diff --git a/docs/_sources/citations.rst.txt b/docs/_sources/citations.rst.txt
@@ -24,7 +24,7 @@
 Citing EDA Toolkit
 ===================
 
-Shpaner, L., & Gil, O. (2024). EDA Toolkit (0.0.12). Zenodo. https://doi.org/10.5281/zenodo.13163208
+Shpaner, L., & Gil, O. (2024). EDA Toolkit (0.0.13). Zenodo. https://doi.org/10.5281/zenodo.13163208
 
 .. code:: bash
 
@@ -35,7 +35,7 @@ Shpaner, L., & Gil, O. (2024). EDA Toolkit (0.0.12). Zenodo. https://doi.org/10.
     month        = aug,
     year         = 2024,
     publisher    = {Zenodo},
-    version      = {0.0.12},
+    version      = {0.0.13},
     doi          = {10.5281/zenodo.13162633},
     url          = {https://doi.org/10.5281/zenodo.13162633}
     }

diff --git a/docs/_sources/data_management.rst.txt b/docs/_sources/data_management.rst.txt
@@ -566,8 +566,7 @@ Census Income Example
 """"""""""""""""""""""""""""""
 
 In the example below, we demonstrate how to use the ``dataframe_columns`` 
-function to analyze a DataFrame's columns. You may notice a new variable, 
-``age_group``, is introduced. The logic for generating this variable is :ref:`provided here <Binning_Numerical_Columns>`.
+function to analyze a DataFrame's columns. 
 
 .. code-block:: python
 
@@ -580,11 +579,13 @@ function to analyze a DataFrame's columns. You may notice a new variable,
 
 `Result on Census Income Data (Adapted from Kohavi, 1996, UCI Machine Learning Repository)` [1]_
 
-.. code-block:: python
+.. code-block:: text
 
-    Shape:  (48842, 16) 
+    Shape:  (48842, 15) 
 
-    Total seconds of processing time: 0.861555
+    Processing columns: 100%|██████████| 15/15 [00:00<00:00, 74.38it/s]
+
+    Total seconds of processing time: 0.351102
 
 .. raw:: html
 
@@ -624,7 +625,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">age</td>
             <td class="tg-dvpl">int64</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">74</td>
             <td class="tg-dvpl">36</td>
             <td class="tg-dvpl">1348</td>
@@ -646,7 +647,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">fnlwgt</td>
             <td class="tg-dvpl">int64</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">28523</td>
             <td class="tg-dvpl">203488</td>
             <td class="tg-dvpl">21</td>
@@ -657,7 +658,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">education</td>
             <td class="tg-dvpl">object</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">16</td>
             <td class="tg-dvpl">HS-grad</td>
             <td class="tg-dvpl">15784</td>
@@ -668,7 +669,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">education-num</td>
             <td class="tg-dvpl">int64</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">16</td>
             <td class="tg-dvpl">9</td>
             <td class="tg-dvpl">15784</td>
@@ -679,7 +680,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">marital-status</td>
             <td class="tg-dvpl">object</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">7</td>
             <td class="tg-dvpl">Married-civ-spouse</td>
             <td class="tg-dvpl">22379</td>
@@ -701,7 +702,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">relationship</td>
             <td class="tg-dvpl">object</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">6</td>
             <td class="tg-dvpl">Husband</td>
             <td class="tg-dvpl">19716</td>
@@ -712,18 +713,18 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">race</td>
             <td class="tg-dvpl">object</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">5</td>
             <td class="tg-dvpl">White</td>
             <td class="tg-dvpl">41762</td>
-            <td class="tg-dvpl">85.5</td>
+            <td class="tg-dvpl">85.50</td>
         </tr>
         <tr>
             <td class="tg-rvpl">9</td>
             <td class="tg-dvpl">sex</td>
             <td class="tg-dvpl">object</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">2</td>
             <td class="tg-dvpl">Male</td>
             <td class="tg-dvpl">32650</td>
@@ -734,7 +735,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">capital-gain</td>
             <td class="tg-dvpl">int64</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">123</td>
             <td class="tg-dvpl">0</td>
             <td class="tg-dvpl">44807</td>
@@ -745,7 +746,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">capital-loss</td>
             <td class="tg-dvpl">int64</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">99</td>
             <td class="tg-dvpl">0</td>
             <td class="tg-dvpl">46560</td>
@@ -756,7 +757,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">hours-per-week</td>
             <td class="tg-dvpl">int64</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">96</td>
             <td class="tg-dvpl">40</td>
             <td class="tg-dvpl">22803</td>
@@ -778,31 +779,19 @@ function to analyze a DataFrame's columns. You may notice a new variable,
             <td class="tg-dvpl">income</td>
             <td class="tg-dvpl">object</td>
             <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
+            <td class="tg-dvpl">0.00</td>
             <td class="tg-dvpl">4</td>
             <td class="tg-dvpl">&lt;=50K</td>
             <td class="tg-dvpl">24720</td>
             <td class="tg-dvpl">50.61</td>
         </tr>
-        <tr>
-            <td class="tg-rvpl">15</td>
-            <td class="tg-dvpl">age_group</td>
-            <td class="tg-dvpl">category</td>
-            <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">0</td>
-            <td class="tg-dvpl">9</td>
-            <td class="tg-dvpl">18-29</td>
-            <td class="tg-dvpl">13920</td>
-            <td class="tg-dvpl">28.5</td>
-        </tr>
         </tbody>
     </table>
     </div>
 
-
-
 \
 
+
 DataFrame Column Names
 """"""""""""""""""""""""""""""
 
@@ -918,6 +907,13 @@ variables from a DataFrame containing the census data [1]_.
 
 **Output**
 
+.. code-block:: text
+
+    Generating combinations: 100%|██████████| 120/120 [00:01<00:00, 76.56it/s]
+    Writing summary tables: 100%|██████████| 120/120 [00:41<00:00,  2.87it/s]
+    Finalizing Excel file: 100%|██████████| 1/1 [00:00<00:00, 13706.88it/s]
+    Data saved to ../data_output/census_summary_tables.xlsx
+
 .. code-blocK:: text 
 
     [('age_group', 'workclass'),
@@ -1139,6 +1135,11 @@ the original DataFrame and a filtered DataFrame with ages between `18` and `40`.
 
 **Output**
 
+.. code-block:: text 
+
+    Saving DataFrames: 100%|██████████| 2/2 [00:08<00:00,  4.34s/it]
+    DataFrames saved to ../data/df_census.xlsx
+
 The output Excel file will contain the original DataFrame and a filtered DataFrame as a separate tab with ages 
 between `18` and `40`, each on separate sheets with customized formatting.
 

diff --git a/docs/_sources/getting_started.rst.txt b/docs/_sources/getting_started.rst.txt
@@ -1,6 +1,6 @@
 .. _getting_started:   
 
-.. KFRE Python Library Documentation documentation master file, created by
+.. EDA Toolkit Python Library Documentation documentation master file, created by
    sphinx-quickstart on Thu May 2 15:44:56 2024.
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
@@ -30,7 +30,7 @@
 Welcome to the EDA Toolkit Python Library Documentation!
 ========================================================
 .. note::
-   This documentation is for ``eda_toolkit`` version ``0.0.12``.
+   This documentation is for ``eda_toolkit`` version ``0.0.13``.
 
 
 The ``eda_toolkit`` is a comprehensive library designed to streamline and 

diff --git a/docs/_static/documentation_options.js b/docs/_static/documentation_options.js
@@ -1,6 +1,6 @@
 var DOCUMENTATION_OPTIONS = {
     URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
-    VERSION: '0.0.12',
+    VERSION: '0.0.13',
     LANGUAGE: 'en',
     COLLAPSE_INDEX: false,
     BUILDER: 'html',