Abstract: Influenza is a highly contagious disease prone to cause epidemics around the world. Predicting individuals with a high chance of infection and identifying the spread of the disease could allow health care systems to better optimize how resources are used to combat the spread. Predicting outbreaks is a multidimensional problem involving many different fields. In this paper we investigate how historical medical and demographic survey information combined with Internet search engine use can be used to predict the chance of infection for individuals in different regions of the United States. We compare the effectiveness of the Random Forest Algorithm with Gradient Boosted Trees and evaluate using ROC curves and the AUC metric.
The final paper can be found at here.
The final iPython notebook can be found here.
The data can be downloaded here.