what drives the price of a Car ?

Introduction

In this application, we will delve into a dataset of pre-owned cars sourced from Kaggle. The objective of this endeavor is to discern the factors influencing the pricing of cars, ultimately providing recommendations to our client, a used-car dealership, regarding consumer preferences in pre-owned vehicles.

The project will entail the construction of several machine learning regression models, with subsequent assessment of their performance. Employing regression algorithms including Lasso Regression, Ridge Regression, Decision Tree Regressor, and Gradient Boosted Regressor, we will scrutinize the outcomes to identify the most effective model. Additionally, we will explore and elaborate upon the significance of various features impacting car prices through feature importance analysis.

Dataset

The dataset utilized in this project is derived from Kaggle, which can be accessed here. The dataset is extensive, comprising over 446 thousand records of used cars across the US.

Exploratory data analysis:

The dataset consists of 18 features alongside a target feature indicating the price of the used car. There were lot of outliers observed in the initial analysis. Null values were observed in all features except 'id', 'region', 'price' and 'state'.

Feature	Description
Id	_Id of the Used-Car _
region	Region where the car is present
price	Price of the used car
year	Year of Manufacture
manufacturer	Name of the Manufacturer
model	Model of the Manufacturer
condition	Known condition of the car
cylinders	Number of cylinders in the engine
fuel	Fuel Type of the car
odometer	Current Odometer reading
title_status	Title status of the car
transmission	Type of Transmission of the car
VIN	Vehicle Identification Number
drive	Drive Type of the car
size	Size of the Car
type	Type of the car
paint_color	Exterior paint color of the car
state	State where the car is registered

Following data cleaning operations were performed

Filling missing values in the 'year' column with the mode (most frequent) value from that column.
Filling missing values in the 'condition' column with the mode value from that column.
Filling missing values in the 'cylinders' column with the mode value from that column.
Filling missing values in the 'fuel' column with the mode value from that column.
Filling missing values in the 'title_status' column with the mode (most frequent) value from that column.
Filling missing values in the 'transmission' column with the mode value from that column.
Filling missing values in the 'drive' column with the mode value from that column.
Filling missing values in the 'size' column with the mode value from that column.
Filling missing values in the 'type' column with the mode value from that column.
Filling missing values in the 'paint_color' column with the mode value from that column.
Filling missing values in the 'odometer' column with the median value from that column.
Filling missing values in the 'type' column again with the mode value from that column.
Filling missing values in the 'manufacturer' column with the string 'unknown'.

Data Visualization:

The dataset is predominantly composed of numerous SUVs and Sedans in terms of vehicle types. When examining the number of cylinders, the dataset encompasses a range of options, including 3, 4, 5, 6, 8, 10, and 12 cylinders. Within the SUV and Sedan categories, 6-cylinder vehicles were the most commonly found. Conversely, trucks were observed to have the highest percentage of 8-cylinder vehicles.

The majority of vehicles were equipped with automatic transmission, with other classifications following. These other classifications might include unknown or a blend of automatic and manual transmission.

Among all manufacturers, Chevrolet, Toyota, Ford, and Honda had the highest number of cars. Several other manufacturers had a minimal number of cars.

Models Evaluation and Comparison

In the regression analysis, Lasso, Ridge, Decision Tree Regressor, and Gradient Boost Regressor were evaluated. GridSearchCV was applied to all models to fine-tune the hyperparameters and cross validate. Due to computational constraints stemming from the dataset's complexity, exhaustive hyperparameter tuning was not feasible. Nevertheless, despite these limitations, the Decision Tree Regressor demonstrated superior performance, achieving notably low mean squared error (MSE) scores on both training and test datasets.

Below is a summary table presenting the results of the different ML regressor models:

Regression	Train MSE	Test MSE
Lasso Regression	0.39	0.38
Lasso Regression alpha : 0.02, max_iter: 100)	0.38	0.38
Ridge Regression ('rdg__alpha': 0.02)	0.39	0.38
Decision Tree Regressor (ccp_alpha': 0.0)	0.014	0.29
GradientBoostingRegressor(learning_rate: 0.025)	0.36	0.36

Finally for the decision tree regressor models, feature importance

Odometer, Year, Transmission(Other), Car Size(Compact) and Paint Color(White) seems to be top features that seem to affect the price of the car.

Summary of Findings

In conclusion, data quality significantly impacts the accuracy of predictions. Despite encountering outliers, we successfully cleaned the data, resulting in a more precise pricing model. To further enhance accuracy in the future, cleaner data would be beneficial. Additionally, our analysis revealed that odometer reading, car year, transmission type, compact car size, and white paint color are significant features influencing car prices. However, some outliers remain in the predictions. As a next step, efforts should focus on correcting the dataset at its source to address missing data more accurately. Subsequently, we can explore expanding the model with more advanced deep learning networks to evaluate their predictive accuracy.

Notebook

Used Car Pricing

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prompt_II.ipynb		prompt_II.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

what drives the price of a Car ?

Introduction

Dataset

Models Evaluation and Comparison

Summary of Findings

Notebook

About

Releases

Packages

Languages

License

maskbit/usedcar-pricing

Folders and files

Latest commit

History

Repository files navigation

what drives the price of a Car ?

Introduction

Dataset

Models Evaluation and Comparison

Summary of Findings

Notebook

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages