Mini-Project DADS5001 Data Analytics and Data Science Tools and Programming. Project by Witsarut Wongsim DADS2 6420422017 Big Data with Real Estate in Thailand
This mini-project utilizes data visualization tools to uncover insights from a real estate dataset in Thailand. The project explores price distributions, property type variations across regions, and the potential influence of land prices on project value.
#Note:Update Nov2022 ไฟล์ app.py เพิ่มพื้นหลังแผนที่ประเทศไทย
วิธีใช้งาน download csv file วางไว้ใน Folder เดียวกับ app.py
คำสั่ง
python
>>python app.py
The dataset used in this project was retrieved from
1.Bannai.com via facebook post:
https://web.facebook.com/dataholicth/photos/a.110167148353487/145014661535402/
The data source can be accessed through this Google Sheet:
source: https://gobestimate.com/data?fbclid=IwAR1VAJP5mLxHPr4ia8BZBpqMd790CAmUPU-lmLQKzHmiJOgMBmWXCSSOLeo
copy: https://docs.google.com/spreadsheets/d/1zEu6Lrk7LTGL3ukbJR7UwZf8f04hSeuOrIvdc2ClWyw/edit?fbclid=IwAR1IcEkPwaC_W-Gl3WXwANVj7T-Eo8bx9a8m6L4o-G7GqJvf7cRO1O2dRI0#gid=722411706
)
2.ประเมินราคาที่ดินในเขตกรุงเทพมหานคร Bangkokgis http://www.bangkokgis.com/modules.php?m=download_shapefile
ShapeFile ข้อมูลสารสนเทศภูมิศาสตร์ (แผนที่มาตราส่วน 1:20,000)
Shapefile คือข้อมูลสารสนเทศภูมิศาสตร์ประเภทหนึ่งที่เก็บข้อมูลอยู่ในรูปของเวคเตอร์ (Vector) ใน 3 ลักษณะ คือ จุด (Point) เส้น (Line) และรูปปิด (Polygon) ซึ่งจะแยกเก็บออกเป็นแต่ละชั้นข้อมูล (Layer)
ซึ่ง Shape File หนึ่ง ๆ จะประกอบด้วยไฟล์อย่างน้อย 3 ไฟล์ที่มีการอ้างถึงกันและกันและไม่สามารถขาดไฟล์ใดไฟล์หนึ่งไปได้ ได้แก่ ไฟล์ประเภท (.shp) ไฟล์นี้จะประกอบไปด้วยข้อมูลเวคเตอร์แต่ละประเภท
ซึ่งแต่ละเวคเตอร์ประกอบเป็น Shape File นั้นจะอ้างอิงพิกัด UTM ไฟล์ประเภท (.dbf) ไฟล์นี้จะประกอบไปด้วยข้อมูลในรูปแบบตารางฐานข้อมูลเพื่อแสดงรายละเอียดของแต่ละเวคเตอร์ ไฟล์ประเภท (.shx) ไฟล์นี้จะทำหน้าที่ผสานไฟล์ (.shp) และ (.dbf) เข้าด้วยกัน
The dataset contains information about 23,604 real estate projects across Thailand. It includes details such as: Price Property type District Province Latitude & Longitude Facilities Offered
RangeIndex: 23604 entries, 0 to 23603
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 row_number 23599 non-null float64
1 project_id 23604 non-null object
2 name_en 23604 non-null object
3 name_th 23604 non-null object
4 propertytype_id 23604 non-null object
5 propertytype_name_en 23604 non-null object
6 propertytype_name_th 23604 non-null object
7 price_min 23489 non-null object
8 developer_id 23604 non-null object
9 developer_name_en 10896 non-null object
10 developer_name_th 14837 non-null object
11 latitude 23604 non-null float64
12 longitude 23599 non-null float64
13 neighborhood_id 19373 non-null object
14 neighborhood_name_en 19374 non-null object
15 neighborhood_name_th 19368 non-null object
16 subdistrict_id 23584 non-null float64
17 subdistrict_name_en 23589 non-null object
18 subdistrict_name_th 23587 non-null object
19 district_id 23594 non-null float64
20 district_name_en 23594 non-null object
21 district_name_th 23594 non-null object
22 province_id 23595 non-null float64
23 province_name_en 23594 non-null object
24 province_name_th 23594 non-null object
25 zipcode 23571 non-null float64
26 count_elevator 1896 non-null object
27 count_elevator_service 611 non-null object
28 count_floor 4727 non-null object
29 count_parking 2014 non-null object
30 count_tower 5 non-null object
31 count_unit 21685 non-null float64
32 count_unittype 18597 non-null float64
33 facility_clubhouse 6796 non-null float64
34 facility_fitness 8912 non-null float64
35 facility_meeting 2779 non-null float64
36 facility_park 11744 non-null float64
37 facility_playground 6218 non-null float64
38 facility_pool 9740 non-null float64
39 facility_security 16261 non-null float64
40 date_created 23594 non-null object
41 date_finish 20884 non-null object
42 date_updated 23594 non-null object
43 source 23594 non-null object
44 url_project 23594 non-null object
dtypes: float64(16), object(29)
Libraries and Installation This project utilizes several Python libraries for data manipulation, geospatial analysis, and visualization. Here's how to install the required libraries:
pip install Shapely
pip install geopandas
pip install joypy
!wget -q http://www.arts.chula.ac.th/ling/wp-content/uploads/TH-Sarabun_Chula1.1.zip -O font.zip
!unzip -qj font.zip TH-Sarabun_Chula1.1/THSarabunChula-Regular.ttf
# !pip install -U --pre matplotlib
import matplotlib as mpl
mpl.font_manager.fontManager.addfont('THSarabunChula-Regular.ttf')
mpl.rc('font', family='TH Sarabun Chula')
import pandas as pd
import numpy as np
import plotly.express as px
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
from geopy.geocoders import Nominatim
import seaborn as sns
from matplotlib import pyplot as plt
This section explores interesting questions and insights derived from the data analysis:
1.Price Distribution: Does the real estate price follow a normal distribution?
Ans.The analysis revealed a right-skewed distribution for property prices, with a longer tail towards higher values. Outliers were also identified on the higher end of the price range.
2.Price Variation by Property Type: Is there a price difference between house and condo types across regions?
Ans.The project found that, in Bangkok, detached houses are generally the most expensive, followed by townhouses, semi-detached houses, and condominiums.
However, the price differences between property types were less pronounced in other provinces like Rayong.
Nakhon Ratchasima and Chonburi provinces, condo prices are quite high compared to other types of real estate.
3.Top Ranked Provinces by Median and Mean Price: Are expensive properties concentrated in specific areas?
While the initial analysis suggested Kanchanaburi might have the highest median price, further investigation revealed a data error.
In reality, #7Nakhon Ratchasima and #8Chonburi emerged as the provinces with the highest median and mean condo prices.
For number 1, Kanchanaburi has only 1 project priced at 9.9 million, which when searching for more information, found that the actual price is 990,000.
After updating the correct information causing the ranking of leaders to change Kanchanaburi to the last place Prachuap rose to number 1
4.Prices of detached houses, twin houses, and townhomes: Bangkok is likely to be the highest?
Detached houses:
Phang Nga is the first place with a median of 49.5 MB.
Krabi is the second place.
Bangkok is the third place
Twin houses:
Bangkok is the first place, as hypothesized.
Townhomes:
Roi Et is the first place.
Phayao is the second place.
Bangkok is the third place.
5.High land prices should lead to high-value projects?
Answer: When considering detached house projects in Khlong Toei Nuea, Khlong Tan Nuea, and Khlong Tan , which are not areas with the most expensive land, this may not always be the case.
Analysis:High land prices may not make it worthwhile to build detached houses that require a large area. Compared to condo project units, you may be able to make more profit.
This is because the cost of land is a significant factor in the overall cost of building a detached house. When land prices are high, it may be more difficult to make a profit on detached house projects.
On the other hand, condo project units typically have a smaller footprint, which means that the cost of land is a less significant factor in the overall cost of the project. Additionally, condo units are often in high demand, which can lead to higher profits.
However, when considering condominiums:
Chakkraphet is located in the Samphanthawong district.
Dusit district.
Lumpini in the Pathumwan district, which are all in the group with the highest land valuation.
This suggests that land prices may play a more significant role in determining the value of condominium projects than detached house projects. This could be due to the fact that condominiums are typically located in more central areas, where land prices are higher. Additionally, condominiums are often built on smaller parcels of land, which makes the cost of land a more significant factor in the overall cost of the project.
6.If we create Data Visualization demographic analytics, it should be able to help answer more of our questions?
Data Visualization demographic analytics is a powerful tool that can help you see patterns and trends in your data that would be difficult to identify in a table or spreadsheet. By using different visual elements, such as color, size, and position, you can create a map that is both informative and easy to understand.
Components of the Map Data Visualization
Base Map: The base map provides the geographical context for your data. In the example you provided, the base map appears to be a simple world map. However, you can use more detailed base maps, such as street maps or political maps, depending on your data. Data points: The data points represent the specific pieces of data you are trying to visualize. In the example you provided, the data points are circles (bubbles) in various colors and sizes. The color likely represents different property types (e.g., condo, house) and the size of the bubble might represent the price. Color: Color can be used to represent different categories of data. For example, in the example you provided, the color of the bubble likely represents the property type. Size: Size can be used to represent the magnitude of a data point. In the example you provided, the size of the bubble likely represents the property price (larger bubble indicates a higher price). Benefits of Map Data Visualization
Identify patterns and trends: Map Data Visualization can help you see patterns and trends in your data that would be difficult to identify in a table or spreadsheet. For example, you might be able to see that there are more expensive properties located in certain areas of the city. Communicate insights: Map Data Visualization can be a very effective way to communicate insights to others. Because maps are visual, they are easy to understand for people with and without a data analysis background. Make data more engaging: Map Data Visualization can make data more engaging and interesting to look at. This can be especially helpful when you are trying to communicate complex data to a non-technical audience.
Chonburi condo next to the sea This may cause the price to be quite high when compared to the prices of various types of houses.
สามารถ download HTML ใน githubนี้ได้ https://github.com/Hakulani/miniprojectDADS5001/blob/main/file.html
- Data Acquisition and Selection
-Difficulty in finding suitable datasets within a limited timeframe.
-Uncertainty regarding potential discoveries within the chosen dataset. - Health Issues
-Exhaustion due to overlapping commitments during the exam period and contracting COVID-19.
-Reduced cognitive function and limited time for analysis due to illness. - Map Vector Creation
-Steep learning curve for map creation, including understanding libraries and sourcing vector data for Bangkok.
-Time-consuming process, but ultimately rewarding as it provides a holistic view of the data (e.g., land price data is incomprehensible without mapping). - Missing value
-Over 50% of data was lost during initial map creation due to data droppage.
-Implemented left join to preserve critical data. - Disparate Bubble Sizes on Map
-Employed Min Max scaling to address the issue of vastly different bubble sizes.
-
Data Exploration and Preparation
-Importance of exploring and understanding the data before analysis.
-Careful data preparation to ensure accuracy and reliability of results. -
Time Management and Prioritization
-Effective time management and prioritization of tasks to mitigate the impact of unexpected challenges.
-Balancing academic commitments with personal health and well-being. -
Learning New Skills
-Willingness to learn new skills and technologies to address project requirements.
-Continuous learning and adaptation to improve analytical capabilities. -
Data Visualization
-Importance of data visualization in enhancing data comprehension and communicating insights effectively.
-Exploration of different visualization techniques to suit the specific data and audience. -
Problem Solving and Troubleshooting
-Ability to identify and address data-related issues and challenges.
-Adoption of a proactive approach to problem-solving and troubleshooting.
https://numpy.org/doc/stable/user/quickstart.html https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html https://pandastutor.com/
- 3.1 MatPlotLib https://matplotlib.org/stable/tutorials/index.html
- 3.2 GeoPandas Mapping and Plotting Tools https://geopandas.org/en/stable/docs/user_guide/mapping.html https://miro.medium.com/max/1100/1*wkFt03GrqlMHOc1ZiI12jw.png