My Awesome Data Science Assistant (Madsa)

MADSA is a conversational app that allows users to perform data science tasks in natural language. With Madsa, one can:

Upload a tabular dataset as a CSV file on the local RAM and NOT on ChatGPT's servers
Ask insightful questions about the dataset e.g. What was the survivor rate for each gender in the titanic?
Generate plots by prompting in natural language e.g. Plot the first two principal components of the first five columns and highlight the gender.
Train machine learning models and/or explore model parameters e.g. Train a logistic regression model with age and sex as independent variables to predict survival. Which parameter contributed the most?

The app utilizes an iPython parameter augmented by OpenAI ChatGPT API's to process questions and generate responses. Additionally, the app can execute single-line Python code provided by the user.

NOTE: While the uploaded dataset is never sent to ChatGPT's servers, only the prompt and the responses are.

Repository structure

madsa_app.py: The main Streamlit application file.
app_utils.py: Utility functions to support the Streamlit app.
chatgpt_api_utils.py: Utility functions to interact with the OpenAI ChatGPT API.
system_prompt.py: Defines the system prompt for the ChatGPT API.
requirements.txt: Lists the required Python packages to run the app.
test_datasets: A folder containing sample datasets to test the app.
R&D: An old folder containing research and development code.

How to run the application

Prerequisites

An OpenAI API key for using the ChatGPT API. Click here to know more.
Conda package manager

Installation

Clone this repo

git clone git@github.com:Nilzkool/ds_assistant.git
cd ds_assistant

Create a Conda environment and activate it

conda create --name madsa_env --file requirements.txt
conda activate madsa_env

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-api-key"  # Linux/Mac
set OPENAI_API_KEY="your-api-key"  # Windows

Running the application

After setting up the environment and installing the required packages, run the app using the following command

python -m streamlit run madsa_app.py

Usage

Upload a CSV file using the file uploader in the app.
Enter your Python statement or ask a question in the text input field. Press Enter to submit your input.
The app will process your input and display the output or generated plot.

Tips and tricks

Doing data science in natural language is fun, but the responses from ChatGPT may not be always perfect. Here are a few tips to get higher-quality responses

Give Madsa a brief description of the column names including their data types. e.g.

Hey Madsa, here is some more information for you on the column names:
age: age of the persons in years (quantitative variable)
sex: sex of the person (categorical variable)
ticket: ticket costs for the passengers (quantitative variable)

Make the prompt concrete and specific e.g.

Instead of

I was wondering if females had a better survivor rate than men.

consider

Report True or False if females had a better survivor rate than men

Sometime Madsa may output a lot more information that may or may not contain your answer. In such cases, you should nudge Madsa in a follow-up prompt to report the correct answer
Madsa's system is designed to use rudimentary libraries only like pandas, numpy, scikit-learn and matplotlib. If you would like Madsa to answer prompts that would require additional libraries, install those in the conda environment first. Then in the prompt, you can specify to Madsa to use this library e.g.

Plot a histogram of passenger age. Use the package Seaborn

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

My Awesome Data Science Assistant (Madsa)

Repository structure

How to run the application

Prerequisites

Installation

Running the application

Usage

Tips and tricks

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
r&D		r&D
test_datasets		test_datasets
README.md		README.md
app_utils.py		app_utils.py
chatgpt_api_utils.py		chatgpt_api_utils.py
madsa_app.py		madsa_app.py
requirements.txt		requirements.txt
system_prompt.py		system_prompt.py

Nilzkool/ds_assistant

Folders and files

Latest commit

History

Repository files navigation

My Awesome Data Science Assistant (Madsa)

Repository structure

How to run the application

Prerequisites

Installation

Running the application

Usage

Tips and tricks

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages