-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpredicting_building_energy_efficiency.Rmd
581 lines (451 loc) · 18.4 KB
/
predicting_building_energy_efficiency.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
---
title: "Predicting a Building's Energy Efficiency"
author: "Jarred Priester"
date: "2/5/2022"
output: pdf_document
---
1. Introduction
+ 1.1 Overview of the problem
+ 1.2 description of the dataset
+ 1.3 goal of the project
+ 1.4 plan of action
2. Data Cleaning
+ 2.1 downloading the data
+ 2.2 cleaning NAs
+ 2.3 cleaning black observations
+ 2.4 scaling the data set
3. Data Visualization
4. Models
+ 4.1 train and test sets
+ 4.2 linear regression
+ 4.3 ridge regression
+ 4.4 random forest
+ 4.5 ensemble
5. Results
+ 5.1 table of results
+ 5.2 plot of results
+ 5.3 brief thoughts on results
6. Conclusion
+ 6.1 summary
+ 6.2 limitations
+ 6.3 next steps
# 1. Introduction
## 1.1 overview of the problem
With record high and low temperatures across the globe, it is becoming increasing important to be efficient when it comes to heating and cooling our buildings. Whether you are trying to reduce the cost of your energy bill or you're trying to reduce your carbon footprint, improving the energy efficacy of your building can both save you some money and even help the environment. We will be looking at a data set that can help us with both!
## 1.2 description of the data set
This data set we will be using is from the University of California, Irvine Machine Learning Repository.
The following is UCI's information on the data set:
*Source:*
*The dataset was created by Angeliki Xifara (Civil/Structural Engineer) and was processed by Athanasios Tsanas (Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK).*
*Data Set Information:*
*We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.*
*Attribute Information:*
*The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.*
*Specifically:*
*X1 Relative Compactness*
*X2 Surface Area*
*X3 Wall Area*
*X4 Roof Area*
*X5 Overall Height*
*X6 Orientation*
*X7 Glazing Area*
*X8 Glazing Area Distribution*
*y1 Heating Load*
*y2 Cooling Load*
https://archive.ics.uci.edu/ml/datasets/Energy+efficiency
## 1.3 goal of the project
The goal of this project is to create multiple regression models that come up with predictions for both the heating and cooling load. We will take the best performing model.
## 1.4 plan of action
We will download and cleaning the data. We will use visualization tools to get a better understanding of the data that we are working with. THen we will be create the following regression models: *linear regression*, *ridge regression*, *random forest*, *ensemble*. The models will be evaluated by using the Root Mean Squared Error (RMSE)
RMSE = $$\sqrt{\frac{1}{n}\sum_{u,i}=(\hat{y}_{u,i}-y_{u,i})^2}$$
Finally, we will create a results table of the models and evaluate the results
# 2. Data Cleaning
## 2.1 downloading the data
```{r, results='hide',warning=FALSE,message=FALSE}
#loading libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(caret)
library(elasticnet)
library(knitr)
library(matrixStats)
#loading the data
energy <- read.csv("../Building_Energy_Efficiency/ENB2012_data.csv")
```
First, let's that a quick look at the data
```{r}
head(energy)
summary(energy)
class(energy)
str(energy)
names(energy)
```
It is a bit confusing without the feature names, so we will rename the column names to match the data set description
```{r}
colnames(energy) <- c('Relative_Compactness',
'Surface_Area',
'Wall_Area',
'Roof_Area',
'Overall_Height',
'Orientation',
'Glazing_Area',
'Glazing_Area_Distribution',
'Heating_Load',
'Cooling_Load')
```
## 2.2 cleaning NAs
finding NAs in the data
```{r}
colSums(is.na(energy))
```
We do not have any missing data in this data set which is not normal but I'll take it.
## 2.3 cleaning blank observations
Now we will be checking to see if there are any blank observations
```{r}
colSums(energy == "")
```
No blank observations so we will move on to visualizing the data.
## 2.4 scaling the data set
You can see from the summary of the data set that the features have a wide range of observations. This large of a difference could potentially skew our predictions because the models may overvalue features with larger values. In order to reduce that we will scale the data set.
Here is a boxplot of the data set before we scale. You can see the large differences in ranges between a few of the features.
```{r}
boxplot(energy)
```
Now let us scale the data set
```{r}
energy[,1:8] <- scale(energy[,1:8])
```
Now let us look at a boxplot of the scaled dataset. We can now see that the features are scaled.
```{r}
boxplot(energy)
```
Let us check the mean of each feature to make sure that the data set is scaled. Means should be 0
```{r}
options(digits = 3)
format(colMeans(energy[,1:8]), scientific = FALSE)
```
Now let us check the standard deviation. Should be 1
```{r}
energy %>% select(-Heating_Load,-Cooling_Load) %>% summarise_if(is.numeric,sd)
```
# 3. Data Visualization
First, let us look at the density of heating load
```{r}
energy %>% ggplot(aes(Heating_Load)) +
geom_density(aes(fill = "red", color = "red")) +
xlab("heating lab") +
ggtitle("Density of Heating Load") +
theme_economist() +
theme(legend.position = "none")
```
Second, the density of Cooling load
```{r}
energy %>% ggplot(aes(Cooling_Load)) +
geom_density(aes(fill = "blue", color = "blue")) +
xlab("cooling lab") +
ggtitle("Density of Cooling Load") +
theme_economist() +
theme(legend.position = "none")
```
Both heating and cooling density look similar
scatter plot of surface area and heating load
```{r}
energy %>% ggplot(aes(Surface_Area,Heating_Load)) +
geom_point(aes(color = "red")) +
xlab("surface area") +
ylab("heating load")+
ggtitle("Surface area and heat") +
theme_economist() +
theme(legend.position = "none")
```
scatter plot of roof area and heating load
```{r}
energy %>% ggplot(aes(Roof_Area,Heating_Load)) +
geom_point(aes(color = "red")) +
xlab("roof area") +
ylab("heating load")+
ggtitle("Roof area and heat") +
theme_economist() +
theme(legend.position = "none")
```
scatter plot of compactness and heating load
```{r}
energy %>% ggplot(aes(Relative_Compactness,Heating_Load)) +
geom_point(aes(color = "red")) +
xlab("relative compactness") +
ylab("heating load") +
ggtitle("Relative Compactness and Heating Load") +
theme_economist() +
theme(legend.position = "none")
```
scatter plot of surface area and cooling load
```{r}
energy %>% ggplot(aes(Surface_Area,Cooling_Load)) +
geom_point(aes(color = "blue")) +
xlab("surface area") +
ylab("cooling load")+
ggtitle("Surface area and cooling") +
theme_economist() +
theme(legend.position = "none")
```
scatter plot of roof area and cooling load
```{r}
energy %>% ggplot(aes(Roof_Area,Cooling_Load)) +
geom_point(aes(color = "blue")) +
xlab("roof area") +
ylab("cooling load")+
ggtitle("Roof area and cooling") +
theme_economist() +
theme(legend.position = "none")
```
scatter plot of compactness and cooling load
```{r}
energy %>% ggplot(aes(Relative_Compactness,Cooling_Load)) +
geom_point(aes(color = "blue")) +
xlab("relative compactness") +
ylab("cooling load") +
ggtitle("Relative Compactness and Cooling Load") +
theme_economist() +
theme(legend.position = "none")
```
# 4. Models
## 4.1 train and test sets
We are going to split the data into a training set and a test set. The training set will be 80% of the total data set and the test will be 20%. We will be using the training set to train our regression models and then we will test those models on new data with in this case will be the test set that make up the remaining 20% of the data.
```{r,results='hide',message=FALSE,warning=FALSE}
#setting seed
set.seed(1, sample.kind = "Rounding")
#splitting data into test and train data sets
test_index <- createDataPartition(energy$Heating_Load, times = 1, p = 0.2, list = FALSE)
test <- energy[test_index,]
train <- energy[-test_index,]
```
checking to see if the test and train have similar outcomes
```{r}
mean(train$Heating_Load)
mean(test$Heating_Load)
```
Will be using k-fold cross validation on all the algorithms
```{r,warning=FALSE,message=FALSE}
#creating the k-fold parameters, k is 10
set.seed(7, sample.kind = "Rounding")
control <- trainControl(method = "cv", number = 10, p = .9)
```
# 4.2 Linear Regression
Predicting the heating load using linear regression
```{r,warning=FALSE,message=FALSE}
#training the model using train set
set.seed(9, sample.kind = "Rounding")
train_lm <- train(Heating_Load ~ .,
data = train,
method = "lm",
tuneGrid = data.frame(intercept = seq(-10,10,2)),
trControl = control)
#viewing training results
train_lm
```
plotting training results
```{r}
plot(train_lm)
```
creating predictions
```{r,warning=FALSE}
lm_preds_hl <- predict(train_lm, test)
```
calculating the RMSE for the linear regression model
```{r}
lm_rmse_hl <- RMSE(lm_preds_hl,test$Heating_Load)
```
Now we will apply linear regression to get a prediction for the cooling load
training the model using train set
```{r,warning=FALSE,message=FALSE}
set.seed(9, sample.kind = "Rounding")
train_lm <- train(Cooling_Load ~ .,
data = train,
method = "lm",
tuneGrid = data.frame(intercept = seq(-10,10,2)),
trControl = control)
#viewing training results
train_lm
```
plotting training results
```{r}
plot(train_lm)
```
creating predictions
```{r,warning=FALSE}
lm_preds_cl <- predict(train_lm, test)
```
calculating the RMSE for the linear regression model
```{r}
lm_rmse_cl <- RMSE(lm_preds_cl,test$Cooling_Load)
```
# 4.3 ridge regression
training the model using train set
```{r,warning=FALSE,message=FALSE}
set.seed(10, sample.kind = "Rounding")
train_ridge <- train(Heating_Load ~ .,
data = train,
method = "ridge",
tuneGrid = data.frame(lambda = seq(.001,.005,.001)),
trControl = control)
#viewing training results
train_ridge
```
plotting training results
```{r}
plot(train_ridge)
```
creating predictions
```{r,warning=FALSE}
ridge_preds_hl <- predict(train_ridge,test)
```
creating RMSE for ridge regression model
```{r}
ridge_rmse_hl <- RMSE(ridge_preds_hl, test$Heating_Load)
```
```{r,warning=FALSE,message=FALSE}
#training the model using train set
set.seed(10, sample.kind = "Rounding")
train_ridge <- train(Cooling_Load ~ .,
data = train,
method = "ridge",
tuneGrid = data.frame(lambda = seq(.001,.005,.001)),
trControl = control)
#viewing training results
train_ridge
```
plotting training results
```{r}
plot(train_ridge)
```
creating predictions
```{r,warning=FALSE}
ridge_preds_cl <- predict(train_ridge,test)
```
creating RMSE for ridge regression model
```{r}
ridge_rmse_cl <- RMSE(ridge_preds_cl, test$Cooling_Load)
```
# 4.4 Random Forest
```{r,warning=FALSE,message=FALSE}
#training the model using training set
set.seed(12, sample.kind = "Rounding")
train_rf <- train(Heating_Load ~ .,
data = train,
method = "rf",
tuneGrid = data.frame(mtry = seq(2,10,2)),
trControl = control)
#veiwing training results
train_rf
```
plotting training results
```{r}
plot(train_rf)
```
creating predictions
```{r,warning=FALSE}
rf_preds_hl <- predict(train_rf,test)
```
creating RMSE for random forest model
```{r}
rf_rmse_hl <- RMSE(rf_preds_hl,test$Heating_Load)
```
```{r,warning=FALSE,message=FALSE}
#training the model using training set
set.seed(12, sample.kind = "Rounding")
train_rf <- train(Cooling_Load ~ .,
data = train,
method = "rf",
tuneGrid = data.frame(mtry = seq(2,10,2)),
trControl = control)
#viewing training results
train_rf
```
plotting training results
```{r}
plot(train_rf)
```
creating predictions
```{r,warning=FALSE}
rf_preds_cl <- predict(train_rf,test)
```
creating RMSE for random forest model
```{r}
rf_rmse_cl <- RMSE(rf_preds_cl,test$Cooling_Load)
```
# 4.5 Ensemble
```{r}
heating_preds <- data.frame("lm" = lm_preds_hl,
"ridge" = ridge_preds_hl,
"rf" = rf_preds_hl)
ensemble_preds_hl <- rowMeans(heating_preds)
heating_preds$ensemble <- ensemble_preds_hl
ensemble_rmse_hl <- RMSE(ensemble_preds_hl,test$Heating_Load)
```
Ensemble for cooling load
```{r}
cooling_preds <- data.frame("lm" = lm_preds_cl,
"ridge" = ridge_preds_cl,
"rf" = rf_preds_cl)
ensemble_preds_cl <- rowMeans(cooling_preds)
cooling_preds$ensemble <- ensemble_preds_cl
ensemble_rmse_cl <- RMSE(ensemble_preds_cl,test$Cooling_Load)
```
# 5. Results
## 5.1 table of results
```{r}
options(digits = 3)
results <- data.frame(Model = c("Linear Regression",
"Ridge Regression",
"Random Forest",
"Ensemble"),
Heating = c(lm_rmse_hl,
ridge_rmse_hl,
rf_rmse_hl,
ensemble_rmse_hl),
Cooling = c(lm_rmse_cl,
ridge_rmse_cl,
rf_rmse_cl,
ensemble_rmse_cl))
kable(results)
```
## 5.2 plot of results
For the heating load, our best model was random forest. Here is a plot of the random forest predictions against the actual results.
```{r}
data.frame(actual = test$Heating_Load,
predicted = rf_preds_hl) %>%
ggplot(aes(actual,predicted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
xlab("actual") +
ylab("predicted") +
ggtitle("Heating Load: Actual vs Predicted") +
theme_economist()
```
For the cooling load, our best model was also random forest. Here is a plot of the random forest predictions against the actual results.
```{r}
data.frame(actual = test$Cooling_Load,
predicted = rf_preds_cl) %>%
ggplot(aes(actual,predicted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
xlab("actual") +
ylab("predicted") +
ggtitle("Cooling Load: Actual vs Predicted") +
theme_economist()
```
## 5.3 brief thoughts on results
I was not surprised that the linear regression model had the highest RMSE. Although the linear regression model is a powerful one, its weakness is when there is variety in the data that doesn't quite line up linearly. I did think this was a good starting point because it gave us a baseline of the relation between the features and the outputs.
I really didn't know what to expect from the ridge regression model. I was surprised that it wasn't that much better than the linear model. Ridge regression models do well when the features are highly correlated, I was thinking that might be the case so I wanted to try it out and see the results.
I was not surprised that the random forest model was the best performing model. It is a very powerful model that seems to do well for both regression and classification models. I found it very interesting that the model was able to better predict the heating load over the cooling load. The model's prediction was better for cooling loads under 25 but struggled a bit when it was over 25. My hypothesis is that cooling load is more difficult to predict because of sun beating down on the beating. This might also explain why the cooling loads are higher than the heating loads. It might be more difficult to maintain room temperature if the sunlight is countering the cooling affects.
The ensemble landed up having a RMSE that landed in the middle of the results. This made sense to me sense we were taking the mean from the three models. I think the ensemble would do really well when applied to a large data set due to its middle of the road approach.
# 6. Conclusion
## 6.1 summary
We were able to predict the heating and cooling load of buildings by using a dataset of building features and heating and cooling loads. We used supervised machine learning to create predictions. We had a total of 4 regression models: linear regression, ridge regression, random forest, an ensemble of first 3 models. Random forest was our best model for predicting heating load with a RMSE of .63. The random forest model was also the best model for predicting the cooling load with a RMSE of 1.26.
## 6.2 limitations
The limitation of this model is the size of the dataset. We are only looking at a sample of 768 buildings. With more data with more variety of buildings I think we could see a more robust model.
I would also like to see a model with more features. I think there was a lot of information that was missing from the dataset. Where are these buildings located? Do they experience the same climate, etc...
## 6.3 next steps
The next step would be to use this model to predict the heating and cooling load of the next home or building that you plan on purchasing!