This Streamlit application provides an interactive interface for data preprocessing tasks including:
- Importing data
- Handling missing values
- Detecting and removing outliers
- Feature selection/removal
- Encoding categorical variables
The application is designed to assist users in cleaning and preparing datasets for analysis or machine learning.
- Data Import: Upload CSV files and view dataset dimensions and content.
- Handling Null Values: Various methods to handle missing values including:
- Dropping rows or columns
- Replacing missing values with mean/median for numeric columns and the most frequent value for categorical columns
- Handling Outliers: Methods to detect and remove outliers using:
- Interquartile Range (IQR)
- Z-Score
- Feature Selection/Removal: Option to select and remove irrelevant columns.
- Encoding: Convert categorical values using:
- Label Encoding
- One-Hot Encoding
Ensure you have Python installed. You can download it from python.org.
You will need the following Python packages:
streamlit
scikit-learn
pandas
numpy
matplotlib
You can install these packages using pip:
pip install streamlit scikit-learn pandas numpy matplotlib
-
Run the Application:
Navigate to the directory where your
app.py
file is located and run:streamlit run app.py
-
Upload Data:
Go to the "Data Import" tab and use the file uploader to upload your CSV file.
-
Handle Missing Values:
In the "Handling Null Values" tab, choose a method to handle missing values and view the transformed dataset.
-
Detect and Remove Outliers:
In the "Handling Outliers" tab, select an outlier detection method (IQR or Z-Score) to identify and optionally remove outliers.
-
Feature Selection/Removal:
In the "Feature Selection/Removal" tab, choose columns to remove from the dataset and view the updated dataset.
-
Encode Categorical Variables:
In the "Encoding" tab, choose an encoding method (Label Encoding or One-Hot Encoding) and see the transformed dataset.
- Drop Rows: Removes rows with any missing values.
- Drop Columns: Removes columns with any missing values.
- Replace Numeric Values with Mean: Replaces missing numeric values with the column mean and non-numeric values with the most frequent value.
- Replace Numeric Values with Median: Replaces missing numeric values with the column median and non-numeric values with the most frequent value.
-
IQR Method:
- Computes quartiles (Q1, Q3) and IQR.
- Identifies outliers as values below
Q1 - 1.5 * IQR
or aboveQ3 + 1.5 * IQR
. - Removes outliers based on these bounds.
-
Z-Score Method:
- Computes the mean and standard deviation of each numeric column.
- Calculates Z-Scores for each value.
- Identifies outliers as values with Z-Scores beyond a specified threshold (e.g., ±3).
- Removes outliers based on these Z-Scores.
- Label Encoding: Converts categorical values to integers.
- One-Hot Encoding: Converts categorical values into binary columns, with each column representing a unique category.
Here’s a quick example of how to use the application:
- Upload your dataset.
- Handle any missing values by choosing an appropriate method.
- Detect and handle outliers using either the IQR or Z-Score method.
- Select and remove any irrelevant features.
- Encode categorical variables as needed.
Feel free to contribute by submitting issues or pull requests. For major changes, please open an issue to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- The README.md provides a comprehensive overview of the Streamlit application, including installation instructions, usage details, and explanations of the main features.
- It includes code snippets and examples to help users understand how to interact with the application.
- It outlines the methods used for handling missing values, detecting and removing outliers, and encoding categorical variables.