Skip to content

Analyzed how restaurant ratings are affected by review sentimental scores, review length, and number of useful, funny, and cool votes recieved by review.

Notifications You must be signed in to change notification settings

AkshayRameshAppDEV/Yelp-Dataset-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Yelp-Dataset-Challenge

Goal:-

To analyze how restaurant ratings are affected by review sentimental scores, review length, and number of useful, funny, and cool votes recieved by review.

Language used:-

We used Spark R in databricks.

Installing Required Package:-

install.packages("aws.s3")

#Installing sentimentr package
if (!require("pacman")) install.packages("pacman")
pacman::p_load_current_gh("trinker/lexicon", "trinker/sentimentr")

install.packages("corrplot") #installing corrplot package

install.packages("tree") # Decision Tree Package

install.packages("gbm") # Boosting Package

install.packages("randomForest") # Random Forest Package

Loading the installed libraries/packages:-

library(SparkR)
library("aws.s3")
library(corrplot)
library(tree) # library for decison tree
library(gbm)
library(randomForest)

Getting access to my AWS S3 bucket and storing the bucket object into the variable:-

Sys.setenv("AWS_ACCESS_KEY_ID" = "Your Access Key ID",
           "AWS_SECRET_ACCESS_KEY" = "Your Secret Access Key",
           "AWS_DEFAULT_REGION" = "Your Bucket Default Region")
bucketlist() #For seeing what files are there in bucket
get_bucket('yelpdatasetchallengebigdataproject') #getting the bucket containg the Yelp Dataset

#Storing the S3 bucket in object
obj <- get_object("s3://yelpdatasetchallengebigdataproject/yelp_review.csv")
yelpData <- read.csv(text = rawToChar(obj)) #storing csv data from Amazon s3 to a yelp data variable

Data Exploration:-

display(yelpData)

alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text

Linear Regression Analysis:-

alt text alt text alt text alt text alt text alt text alt text

Decision Tree Analysis:-

alt text

Boosting Analysis:-

alt text alt text alt text

Random Forest Analysis:-

alt text

Conclusion:-

When comparing all the above correlation accuracies of 4 different algorithms, the ranking based on correlation accuracy (from highest accuracy to lowest accuracy) are as follows:-

Boosting > Random Forest > Decision Tree > Linear Regression

Boosting algorithm performed much better than the other three models with an accuracy of 70.15%. Boosting performed better because it is designed to create lot of trees consecutively, which resulted in improved performance. Moreover, each tree has a better fit on a modified version of the previous tree when Boosting is used.

About

Analyzed how restaurant ratings are affected by review sentimental scores, review length, and number of useful, funny, and cool votes recieved by review.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published