An assignment for DSC510 at GCU that focused on analyzing measures of central tendency and variability in real-world datasets, as well as identifying and dealing with potential sources of bias.
I recorded a video going over my findings, check it out here.
This assignment aims to develop students' skills in data analysis and statistical methods using Python. By working with real-world datasets, students will learn how to identify and handle outliers, deal with missing data, analyze measures of central tendency and variability, and create meaningful visualizations. Additionally, students will gain an understanding of potential sources of bias and confounding variables, further enhancing their critical thinking and decision-making abilities in data analysis.
Through a combination of hands-on Python coding and thoughtful interpretation of results, this assignment fosters a comprehensive understanding of data analysis techniques and their practical applications in different scenarios.
Tasks:
- Outlier Identification and Handling: In this task, you will work with a real-world dataset to identify and handle outliers. Choose a dataset (e.g., from Kaggle, UCI Machine Learning Repository) from the list of "Repositories for Finding Suitable Datasets," located in Class Resources, that exhibits outliers or extreme values. Write a Python script that identifies and handles the outliers using at least two methods (e.g., z-score, interquartile range). Use visualization techniques to demonstrate the impact of the outliers on measures of central tendency and variability.
- Bias and Confounding Variables Identification: Identify potential sources of bias or confounding variables in the dataset selected in Task 1 above and discuss how they might impact the analysis.
- Handling Missing Data: Develop and justify an appropriate statistical method to handle missing data in the dataset selected in Task 1.
- Analysis of Mean and Median Values: In this task, you will analyze a dataset to understand the difference between mean and median values. Choose a dataset from the list of "Repositories for Finding Suitable Datasets," located in Class Resources, where the mean and median values differ significantly. Write a Python script to calculate and visualize the mean and median values of the dataset. Interpret the results and provide insights into what the difference means for the dataset. Propose solutions to handle this discrepancy and implement them using Python.
- Compare and contrast the effectiveness of four different measures of central tendency and variability in capturing the characteristics of the data. Data Visualization: In this task, you will use Python to create visualizations that effectively communicate data distribution. Choose a dataset from the list of "Repositories for Finding Suitable Datasets," located in Class Resources, and create basic plots to visualize the data distribution (e.g., histogram, boxplot). Analyze the plots to gain insights into the data distribution and interpret the results.
- Measures of Central Tendency and Variability: In this task, you will calculate and interpret measures of central tendency and variability using Python. Choose a dataset from the list of "Repositories for Finding Suitable Datasets," located in Class Resources, and write a Python script to calculate the mean, median, mode, range, variance, and standard deviation of the dataset. Interpret the results and discuss how the measures of central tendency and variability relate to the data distribution.
- Data Cleaning: In this task, you will use Python to clean a dataset and prepare it for analysis. Choose a messy dataset (e.g., missing values, inconsistent formatting) from the list of "Repositories for Finding Suitable Datasets," located in Class Resources, and write a Python script to clean the dataset. Use appropriate methods to handle missing values, remove duplicates, and convert data types. Visualize the cleaned dataset to demonstrate the impact of the cleaning process.
- Group Analysis: In this task, you will use Python to conduct group analysis on a dataset. Choose a dataset from the list of "Repositories for Finding Suitable Datasets," located in Class Resources, and write a Python script to group the data by a categorical variable (e.g., gender, age group). Calculate measures of central tendency and variability for each group and visualize the results using appropriate plots. Interpret the results and discuss any differences between the groups. Requirements:
Jupyter notebook containing all Python code used in the tasks. Include written responses to each task prompt, using markdown cells in the same Jupyter notebook. Include visual aids to support your answers. Record a short video (4–5 minutes) explaining your work and highlighting key findings. Provide the link to the video in your submission. Use an online video platform such as Loom, YouTube, or Vimeo to upload your completed video. Deliverables:
Submit a Jupyter notebook containing all Python code, written responses, and visual aids. Include the link to your video recording. While APA style is not required for the body of this assignment, solid writing reflecting industry appropriate standards is expected. Remember to reference all datasets and resources according to using APA formatting guidelines, which can be found in the APA Style Guide, located in the Student Success Center.
This assignment uses a rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.
You are not required to submit this assignment to LopesWrite.