Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs #75

Closed
wants to merge 12 commits into from
Closed

Docs #75

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,12 @@ __pycache__/
*.github/
**/*venv*/
edatkit_374/
src/
src/*.doctree
*.pickle
.doctrees
.doctrees
.doctrees
.doctrees
.doctrees
.doctrees/
*.buildinfo
2 changes: 1 addition & 1 deletion docs/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 35a32965d9c9546d22888657bd0ac24a
config: 32fbb0c52182301b536d1d7f7354f6b5
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file removed docs/.doctrees/acknowledgements.doctree
Binary file not shown.
Binary file removed docs/.doctrees/changelog.doctree
Binary file not shown.
Binary file removed docs/.doctrees/citations.doctree
Binary file not shown.
Binary file removed docs/.doctrees/contributors.doctree
Binary file not shown.
Binary file removed docs/.doctrees/data_management.doctree
Binary file not shown.
Binary file removed docs/.doctrees/eda_plots.doctree
Binary file not shown.
Binary file removed docs/.doctrees/environment.pickle
Binary file not shown.
Binary file removed docs/.doctrees/getting_started.doctree
Binary file not shown.
Binary file removed docs/.doctrees/index.doctree
Binary file not shown.
Binary file removed docs/.doctrees/references.doctree
Binary file not shown.
Binary file removed docs/.doctrees/theoretical_overview.doctree
Binary file not shown.
Binary file removed docs/.doctrees/usage_guide.doctree
Binary file not shown.
136 changes: 136 additions & 0 deletions docs/_sources/changelog.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,142 @@
Changelog
=========

`Version 0.0.13`_
----------------------

.. _Version 0.0.13: https://lshpaner.github.io/eda_toolkit/v0.0.13/index.html

This version introduces a series of updates and fixes across multiple functions to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.

Add ``ValueError`` for Insufficient Pool Size in ``add_ids`` and Enhance ID Deduplication
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This update enhances the ``add_ids`` function by adding explicit error handling and improving the uniqueness guarantee for generated IDs.

**Key Changes**

- **New** ``ValueError`` **for Insufficient Pool Size**:

- Calculates the pool size :math:`9 \times 10^{(\text{d} - 1)}` and compares it with the number of rows in the DataFrame.
- **Behavior**:

- Throws a ``ValueError`` if ``n_rows > pool_size``.
- Prints a warning if ``n_rows`` approaches 90% of the pool size, suggesting an increase in digit length.

- **Improved ID Deduplication**:

- Introduced a set (``unique_ids``) to track generated IDs.
- IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
- Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.


Enhance ``strip_trailing_period`` to Support Strings and Mixed Data Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This update enhances the ``strip_trailing_period`` function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like ``NaN``.

**Key Enhancements**

- **Support for Strings with Trailing Periods**:

- Removes trailing periods from string values, such as ``"123."`` or ``"test."``.

- **Mixed Data Types**:

- Handles columns containing both numeric and string values seamlessly.

- **Graceful Handling of** ``NaN``:

- Skips processing for ``NaN`` values, leaving them unchanged.

- **Robust Type Conversion**:

- Converts numeric strings (e.g., ``"123."``) back to float where applicable.
- Retains strings if conversion to float is not possible.

Changes in ``stacked_crosstab_plot``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Remove ``IPython`` Dependency by Replacing ``display`` with ``print``

This resolves an issue where the ``eda_toolkit`` library required ``IPython`` as a dependency due to the use of ``display(crosstab_df)`` in the ``stacked_crosstab_plot`` function. The dependency caused import failures in environments without ``IPython``, especially in non-Jupyter terminal-based workflows.

**Changes Made**

1. **Replaced** ``display`` with ``print``:
- The line ``display(crosstab_df)`` was replaced with ``print(crosstab_df)`` to eliminate the need for ``IPython``.

2. **Removed ``IPython`` Import**:
- The ``from IPython.display import display`` import statement was removed from the codebase.

**Updated Function Behavior**:

- Crosstabs are displayed using ``print``, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.

**Root Cause and Fix**

The issue arose from reliance on ``IPython.display.display`` for rendering crosstab tables in Jupyter notebooks. Environments without ``IPython`` experienced a ``ModuleNotFoundError``. To address this, the ``display(crosstab_df)`` statement was replaced with ``print(crosstab_df)``.

**Testing**:

- **Jupyter Notebook**: Crosstabs are displayed as plain text via ``print()``, rendered neatly in notebook outputs.
- **Terminal Session**: Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.

Add Environment Detection to ``dataframe_columns`` Function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This enhances the ``dataframe_columns`` function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal).

**Changes Made**

1. **Environment Detection**:

- Added a check to determine if the function is running in a Jupyter Notebook or terminal:

```
is_notebook_env = "ipykernel" in sys.modules
```

2. **Dynamic Output Behavior**:

- **Terminal Environment**:

- Returns a plain DataFrame (``result_df``) when running outside of a notebook or when ``return_df=True``.

- **Jupyter Notebook**:

- Retains the styled DataFrame functionality when running in a notebook and ``return_df=False``.

3. **Improved Compatibility**:

- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.

Add ``tqdm`` Progress Bar to ``dataframe_columns`` Function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This enhances the ``dataframe_columns`` function by incorporating a ``tqdm`` progress bar to track column processing. This is particularly useful for analyzing large DataFrames, providing real-time feedback.

**Changes Made**:

- Wrapped the column processing loop with a ``tqdm`` progress bar:

.. code-block:: python

for col in tqdm(df.columns, desc="Processing columns"):
...


Other Enhancements and Fixes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Improved ``save_dataframes_to_excel`` with ``tqdm`` integration.
- Fixed ``plot_2d_pdp`` plot display logic to adhere strictly to the ``plot_type`` parameter.
- Updated project dependencies and added robust environment testing.




`Version 0.0.12`_
----------------------

Expand Down
4 changes: 2 additions & 2 deletions docs/_sources/citations.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
Citing EDA Toolkit
===================

Shpaner, L., & Gil, O. (2024). EDA Toolkit (0.0.12). Zenodo. https://doi.org/10.5281/zenodo.13163208
Shpaner, L., & Gil, O. (2024). EDA Toolkit (0.0.13). Zenodo. https://doi.org/10.5281/zenodo.13163208

.. code:: bash

Expand All @@ -35,7 +35,7 @@ Shpaner, L., & Gil, O. (2024). EDA Toolkit (0.0.12). Zenodo. https://doi.org/10.
month = aug,
year = 2024,
publisher = {Zenodo},
version = {0.0.12},
version = {0.0.13},
doi = {10.5281/zenodo.13162633},
url = {https://doi.org/10.5281/zenodo.13162633}
}
Expand Down
63 changes: 32 additions & 31 deletions docs/_sources/data_management.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -566,8 +566,7 @@ Census Income Example
""""""""""""""""""""""""""""""

In the example below, we demonstrate how to use the ``dataframe_columns``
function to analyze a DataFrame's columns. You may notice a new variable,
``age_group``, is introduced. The logic for generating this variable is :ref:`provided here <Binning_Numerical_Columns>`.
function to analyze a DataFrame's columns.

.. code-block:: python

Expand All @@ -580,11 +579,13 @@ function to analyze a DataFrame's columns. You may notice a new variable,

`Result on Census Income Data (Adapted from Kohavi, 1996, UCI Machine Learning Repository)` [1]_

.. code-block:: python
.. code-block:: text

Shape: (48842, 16)
Shape: (48842, 15)

Total seconds of processing time: 0.861555
Processing columns: 100%|██████████| 15/15 [00:00<00:00, 74.38it/s]

Total seconds of processing time: 0.351102

.. raw:: html

Expand Down Expand Up @@ -624,7 +625,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">age</td>
<td class="tg-dvpl">int64</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">74</td>
<td class="tg-dvpl">36</td>
<td class="tg-dvpl">1348</td>
Expand All @@ -646,7 +647,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">fnlwgt</td>
<td class="tg-dvpl">int64</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">28523</td>
<td class="tg-dvpl">203488</td>
<td class="tg-dvpl">21</td>
Expand All @@ -657,7 +658,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">education</td>
<td class="tg-dvpl">object</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">16</td>
<td class="tg-dvpl">HS-grad</td>
<td class="tg-dvpl">15784</td>
Expand All @@ -668,7 +669,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">education-num</td>
<td class="tg-dvpl">int64</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">16</td>
<td class="tg-dvpl">9</td>
<td class="tg-dvpl">15784</td>
Expand All @@ -679,7 +680,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">marital-status</td>
<td class="tg-dvpl">object</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">7</td>
<td class="tg-dvpl">Married-civ-spouse</td>
<td class="tg-dvpl">22379</td>
Expand All @@ -701,7 +702,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">relationship</td>
<td class="tg-dvpl">object</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">6</td>
<td class="tg-dvpl">Husband</td>
<td class="tg-dvpl">19716</td>
Expand All @@ -712,18 +713,18 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">race</td>
<td class="tg-dvpl">object</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">5</td>
<td class="tg-dvpl">White</td>
<td class="tg-dvpl">41762</td>
<td class="tg-dvpl">85.5</td>
<td class="tg-dvpl">85.50</td>
</tr>
<tr>
<td class="tg-rvpl">9</td>
<td class="tg-dvpl">sex</td>
<td class="tg-dvpl">object</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">2</td>
<td class="tg-dvpl">Male</td>
<td class="tg-dvpl">32650</td>
Expand All @@ -734,7 +735,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">capital-gain</td>
<td class="tg-dvpl">int64</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">123</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">44807</td>
Expand All @@ -745,7 +746,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">capital-loss</td>
<td class="tg-dvpl">int64</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">99</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">46560</td>
Expand All @@ -756,7 +757,7 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">hours-per-week</td>
<td class="tg-dvpl">int64</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">96</td>
<td class="tg-dvpl">40</td>
<td class="tg-dvpl">22803</td>
Expand All @@ -778,31 +779,19 @@ function to analyze a DataFrame's columns. You may notice a new variable,
<td class="tg-dvpl">income</td>
<td class="tg-dvpl">object</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0.00</td>
<td class="tg-dvpl">4</td>
<td class="tg-dvpl">&lt;=50K</td>
<td class="tg-dvpl">24720</td>
<td class="tg-dvpl">50.61</td>
</tr>
<tr>
<td class="tg-rvpl">15</td>
<td class="tg-dvpl">age_group</td>
<td class="tg-dvpl">category</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">0</td>
<td class="tg-dvpl">9</td>
<td class="tg-dvpl">18-29</td>
<td class="tg-dvpl">13920</td>
<td class="tg-dvpl">28.5</td>
</tr>
</tbody>
</table>
</div>



\


DataFrame Column Names
""""""""""""""""""""""""""""""

Expand Down Expand Up @@ -918,6 +907,13 @@ variables from a DataFrame containing the census data [1]_.

**Output**

.. code-block:: text

Generating combinations: 100%|██████████| 120/120 [00:01<00:00, 76.56it/s]
Writing summary tables: 100%|██████████| 120/120 [00:41<00:00, 2.87it/s]
Finalizing Excel file: 100%|██████████| 1/1 [00:00<00:00, 13706.88it/s]
Data saved to ../data_output/census_summary_tables.xlsx

.. code-blocK:: text

[('age_group', 'workclass'),
Expand Down Expand Up @@ -1139,6 +1135,11 @@ the original DataFrame and a filtered DataFrame with ages between `18` and `40`.

**Output**

.. code-block:: text

Saving DataFrames: 100%|██████████| 2/2 [00:08<00:00, 4.34s/it]
DataFrames saved to ../data/df_census.xlsx

The output Excel file will contain the original DataFrame and a filtered DataFrame as a separate tab with ages
between `18` and `40`, each on separate sheets with customized formatting.

Expand Down
4 changes: 2 additions & 2 deletions docs/_sources/getting_started.rst.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _getting_started:

.. KFRE Python Library Documentation documentation master file, created by
.. EDA Toolkit Python Library Documentation documentation master file, created by
sphinx-quickstart on Thu May 2 15:44:56 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Expand Down Expand Up @@ -30,7 +30,7 @@
Welcome to the EDA Toolkit Python Library Documentation!
========================================================
.. note::
This documentation is for ``eda_toolkit`` version ``0.0.12``.
This documentation is for ``eda_toolkit`` version ``0.0.13``.


The ``eda_toolkit`` is a comprehensive library designed to streamline and
Expand Down
2 changes: 1 addition & 1 deletion docs/_static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
VERSION: '0.0.12',
VERSION: '0.0.13',
LANGUAGE: 'en',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
Loading