MADSA is a conversational app that allows users to perform data science tasks in natural language. With Madsa, one can:
- Upload a tabular dataset as a CSV file on the local RAM and NOT on ChatGPT's servers
- Ask insightful questions about the dataset e.g. What was the survivor rate for each gender in the titanic?
- Generate plots by prompting in natural language e.g. Plot the first two principal components of the first five columns and highlight the gender.
- Train machine learning models and/or explore model parameters e.g. Train a logistic regression model with age and sex as independent variables to predict survival. Which parameter contributed the most?
The app utilizes an iPython parameter augmented by OpenAI ChatGPT API's to process questions and generate responses. Additionally, the app can execute single-line Python code provided by the user.
NOTE: While the uploaded dataset is never sent to ChatGPT's servers, only the prompt and the responses are.
- madsa_app.py: The main Streamlit application file.
- app_utils.py: Utility functions to support the Streamlit app.
- chatgpt_api_utils.py: Utility functions to interact with the OpenAI ChatGPT API.
- system_prompt.py: Defines the system prompt for the ChatGPT API.
- requirements.txt: Lists the required Python packages to run the app.
- test_datasets: A folder containing sample datasets to test the app.
- R&D: An old folder containing research and development code.
- An OpenAI API key for using the ChatGPT API. Click here to know more.
- Conda package manager
- Clone this repo
git clone git@github.com:Nilzkool/ds_assistant.git
cd ds_assistant
- Create a Conda environment and activate it
conda create --name madsa_env --file requirements.txt
conda activate madsa_env
- Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your-api-key" # Linux/Mac
set OPENAI_API_KEY="your-api-key" # Windows
After setting up the environment and installing the required packages, run the app using the following command
python -m streamlit run madsa_app.py
- Upload a CSV file using the file uploader in the app.
- Enter your Python statement or ask a question in the text input field. Press Enter to submit your input.
- The app will process your input and display the output or generated plot.
Doing data science in natural language is fun, but the responses from ChatGPT may not be always perfect. Here are a few tips to get higher-quality responses
- Give Madsa a brief description of the column names including their data types. e.g.
Hey Madsa, here is some more information for you on the column names:
age: age of the persons in years (quantitative variable)
sex: sex of the person (categorical variable)
ticket: ticket costs for the passengers (quantitative variable)
- Make the prompt concrete and specific e.g.
Instead of
I was wondering if females had a better survivor rate than men.
consider
Report True or False if females had a better survivor rate than men
-
Sometime Madsa may output a lot more information that may or may not contain your answer. In such cases, you should nudge Madsa in a follow-up prompt to report the correct answer
-
Madsa's system is designed to use rudimentary libraries only like pandas, numpy, scikit-learn and matplotlib. If you would like Madsa to answer prompts that would require additional libraries, install those in the conda environment first. Then in the prompt, you can specify to Madsa to use this library e.g.
Plot a histogram of passenger age. Use the package Seaborn