To analyze how restaurant ratings are affected by review sentimental scores, review length, and number of useful, funny, and cool votes recieved by review.
We used Spark R in databricks.
install.packages("aws.s3")
#Installing sentimentr package
if (!require("pacman")) install.packages("pacman")
pacman::p_load_current_gh("trinker/lexicon", "trinker/sentimentr")
install.packages("corrplot") #installing corrplot package
install.packages("tree") # Decision Tree Package
install.packages("gbm") # Boosting Package
install.packages("randomForest") # Random Forest Package
library(SparkR)
library("aws.s3")
library(corrplot)
library(tree) # library for decison tree
library(gbm)
library(randomForest)
Sys.setenv("AWS_ACCESS_KEY_ID" = "Your Access Key ID",
"AWS_SECRET_ACCESS_KEY" = "Your Secret Access Key",
"AWS_DEFAULT_REGION" = "Your Bucket Default Region")
bucketlist() #For seeing what files are there in bucket
get_bucket('yelpdatasetchallengebigdataproject') #getting the bucket containg the Yelp Dataset
#Storing the S3 bucket in object
obj <- get_object("s3://yelpdatasetchallengebigdataproject/yelp_review.csv")
yelpData <- read.csv(text = rawToChar(obj)) #storing csv data from Amazon s3 to a yelp data variable
display(yelpData)
When comparing all the above correlation accuracies of 4 different algorithms, the ranking based on correlation accuracy (from highest accuracy to lowest accuracy) are as follows:-
Boosting > Random Forest > Decision Tree > Linear Regression
Boosting algorithm performed much better than the other three models with an accuracy of 70.15%. Boosting performed better because it is designed to create lot of trees consecutively, which resulted in improved performance. Moreover, each tree has a better fit on a modified version of the previous tree when Boosting is used.