Skip to content

Kaggle Challenge: Predictions of the sales prices for houses in Ames, City in Iowa.

Notifications You must be signed in to change notification settings

richengo/Ames-Housing-Price-Predictions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ames Housing Price Predictions

Riche Ngo

Problem Statement

Investors and homeowners based in the city of Ames are looking for professionals to aid them in getting the best possible deals when buying/selling housing properties. As part of a real estate agency in Ames looking to boost annual sales, we want to develop the best regression model to predict housing prices and provide expert advice to investors and homeowners on which housing components and features would greatly influence housing prices.

Executive Summary

This project is part of a Kaggle challenge and explores the Ames housing data which contains the values for individual residential properties sold in Ames from 2006 to 2010. The dataset also includes information on many housing components and features corresponding to each sale.

Exploratory data analysis revealed strong correlation between the sale price and several housing components. Based on Pearson's correlation coefficient, we were able to segregate the more important features from the not so important ones. We also found there was strong correlation between some of the dependent variables which results in multicollinearity if used together in the regression model. To fix this, the variables were either combined or dropped altogether, leaving the one with higher correlation with the target variable. Many of the features are altered to be more interpretable and categorical data were one-hot encoded to make it readable for the regression model.

Feature engineering was done, where we did a log transformation to the target variable to get a more homoscedastic model as observed through residual plots before and after the transformation. By doing so, the performance of the model in making predictions was improved. Conversely, we saw from scatterplots and correlation coefficients that doing the log tranformation on all the other variables was not necessary a good thing.

Lasso regression was used as a quick feature selection tool and was left with 20 features for model development. Model selection was done by comparing the estimates of the R2 and RMSE scores via the 5-fold cross validation method. We ended up with a Lasso regression model as the production model, with a R2 score of 0.86 and RMSE of 22841 on test(hold-out) data. Since the hold-out method was also adopted, we could compare scores between train and test(hold-out) data and found slight overfitting of the data. But the small difference was acceptable by our means.

The top features which influenced the price predictions the most were found to be the total area of the house, overall quality, neighborhood region of house location, interior finish of the house's garage, and the total number of rooms. Surprisingly, the house age and type of dwelling were not found to be in these top features, likely due to the seasonality of the market and other correlated features.

Data Sources

For this project, the datasets can be downloaded from the Kaggle challenge website.

There are three files:

  • train.csv - this data contains all of the training data for your model.
  • test.csv - this data contains the test data for your model. You will feed this data into your regression model to make predictions. The target variable (SalePrice) is removed from the test set!
  • sample_sub_reg.csv - An example of a correctly formatted submission for this challenge (with a random number provided as predictions for SalePrice.

Data Dictionary

The data dictionary/description can be found in the following link.

Conclusions and Recommendations

Through this project, we gained many useful insights for people in Ames looking to invest in housing or increase the values of their existing houses. Utilizing the Ames Housing Dataset, we were able to develop a model based on Lasso Regression for the prediction of sale price for a house in Ames. In the model development process, we managed to learn the different features that are important for predicting the sale price, both positively and negatively.

For investors and homeowners, we recommend them to place more importance in the following:

  1. The combined total area of the house, comprising of the basement, garage, and living area above ground. As the total area of the house increases, the sale price of the house increases significantly. This is especially important for investors looking to modify or remodel houses before selling them. Spatial area is one of the largest driving factors of housing prices.
  2. The overall quality of the material, finish, amenities, and exterior of the house. Buyers will almost always want to buy a house which does not require much refurnishing or renovations, a house that is ready to move into. Maintaining a high overall quality will spur the demand of the house and fetch a much higher price. Homeowners can consider spending some money to touch up their houses before putting it on sale. Investors can consider investing in older houses and do some renovations to improve the overall quality, possibly gaining a larger profit margin which could offset the amount spent for renovations.
  3. The neighborhood region where the house is located at plays a part in the sale price of the house. Houses in the Hayward Ave neighborhood are relatively more expensive because they are centrally located, easily accessible and are within walking distance to several parks and recreation venues (e.g. Franklin Park, Ames ISU Ice Arena, Jack Trice Stadium). Most importantly, houses in Hayward Ave are located within breathing distance of the Iowa State University, the largest university in the state of Iowa, and has easy access to many eateries and restaurants in CampusTown. Conversely, houses located in the S Duff Ave/E Lincoln Way region are relatively cheaper. Likely because S Duff Ave is part of Route 69, a major north–south United States highway. The intent of the region is to serve as a pitstop for travellers driving through, judging from the number of fast-food outlets along the road (e.g. Macdonald's, Arby's, Taco Bell, Wendy's, etc). There are also inns and motor lodges, not to mention multiple car repair services, around the region. Therefore, houses in the area is expected to be relatively cheaper as the region's intent is to cater for travellers. Investors and homeowners should always study the surroundings of where the house is located at to make a more informed decision on sales. Houses in regions that are easily accessible and have many useful facilities nearby are higher in demand and will be able to fetch higher prices.
  4. The interior finish of the house's garage is important since it is almost always the first thing people see when they drive to the house. Having a finished garage interior will likely form a stronger first impression on buyers, increasing the chances of making the sale at a higher price. Investors and homeowners should start paying attention to the finishing of the garage interior.
  5. The total number of bedrooms and bathrooms above ground carries a positive relationship with the housing prices. Having more bedrooms and bathrooms would mean that more people or larger families can be accommodated, increasing the demand of the houses and therefore prices. Homeowners can consider looking at areas of their houses not well utilised and whether it could be converted into rooms. Investors who are looking to rent out the rooms can also look at the rooms as potential units of revenue. Although they may be investing more money in the whole house, having more rooms to rent out also mean faster means of revenue generation.

The regression model which we developed had a high R2 score of 0.87 and a RMSE of 22841(dollars) on the test(hold-out) set. When compared to the baseline RMSE, there was a significant 87% decrease in error. We also found out that doing a log transformation on the target variable helped to create a more homoscedastic model, where the variance of residuals is more similar. Although we are not using the model for statical inference due to violations of the assumptions of linearity (LINE), correcting for a violated assumption almost always improves the performance in predictions. This is also the reason we dropped or combined many features that were highly correlated to one another during the EDA process. It reduces the chances of multicollinearity and also corrects for the assumption of independent predictors in multi-linearity models.

Although we have developed a good model in the span of this project, we cannot generalize it to other cities using the same features mainly because the neighborhood region is an important factor and it is currently specific to Ames. However, it is possible to tweak this feature to make it more generalized. For example, we could replace the feature to one that represents common facilities, schools, or recreation venues within a set distance from the house location. This is most likely highly correlated with the current "neighborhood" feature and could be applied to other cities as well.

We also found that although the age of the house is commonly known to be a major factor for housing prices, it was not one of the top 5 features that greatly influence the sale price of houses in Ames. This is likely because we did not take into account the seasonality of market, impacting housing supply and demand (source). A further expansion of this project could be to break down the data by seasons to study how the sale prices were impacted by house age for each season. However, we would probably require a larger dataset which is collected over a longer period of time for the prediction model to be more robust.

Another housing component which did not make it to the top 5 features of the model was the type of dwelling involved. Each type of dwelling is a different market by itself. This may create problem in a multi-linear regression model as other housing components may have a different power in affecting housing prices depending on the type of dwelling. For example, a 3-bedroom condo unit may cost as much as a 2-bedroom single-family unit (source). To explore this further, it may be good to isolate the data for each dwelling type and study the relationships with sale price.

Lastly, a useful study could be done given the demographics of the people who bought the houses in Ames. Information such as the age, income, and regional preferences of actual or potential buyers, what percentage of buyers are retirees, and what percentage might buy a vacation or second home could be useful. These information are known to be factors affecting the real estate market (source). Building a model around such information would also make it more generalizable to other cities.

About

Kaggle Challenge: Predictions of the sales prices for houses in Ames, City in Iowa.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published