This repository contains Python code for time series analysis and forecasting of COVID-19 data using various models. The dataset used is sourced from the 'COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University' also present in the repository. The analysis includes exploratory data analysis (EDA), visualization of confirmed cases, recovered cases, and deaths worldwide, as well as time series forecasting using different models.
numpy
andpandas
for data manipulation.matplotlib
andseaborn
for data visualization.plotly
for interactive and dynamic plots.sklearn
for model selection and evaluation.statsmodels
for time series analysis.pmdarima
for AutoARIMA model selection.pandas_datareader
for fetching COVID-19 data.warnings
to suppress unnecessary warnings.
The analysis begins with loading the COVID-19 dataset, checking the data types, and performing basic data preprocessing steps. Unnecessary columns such as 'SNo' and 'Last Update' are dropped, and column names are standardized.
The cumulative confirmed cases are plotted over time to visualize the spread of COVID-19 globally. The data is grouped by date, and the cumulative sum is calculated for each day.
Similar to confirmed cases, the cumulative recovered cases are plotted to observe the recovery trend.
The cumulative deaths are visualized over time, showing the unfortunate impact of the pandemic globally.
A choropleth map is created to display the active COVID-19 cases worldwide. The color intensity represents the number of confirmed cases in each country.
Mortality and recovery rates are calculated and visualized for countries most affected by COVID-19.
The top 15 countries with the highest mortality and recovery rates are visualized in separate bar charts.
The Holt Linear model is applied to predict the confirmed cases. The model is trained on the historical data, and predictions are compared with the actual test data.
A more sophisticated Holt-Winter model is utilized to capture seasonality in the time series data.
AutoARIMA is employed to automatically select the best parameters for the ARIMA, MA, AR, and SARIMA models. The models are trained, predictions are made, and their performance is evaluated.
The root mean squared error (RMSE) is used to compare the performance of different models. The SARIMA model is selected as the best model for predicting confirmed cases.
The ARIMA and SARIMA models are tested for predicting deaths. The RMSE is used to compare the models, and the ARIMA model is selected as the best model for death predictions.
The SARIMA model is identified as the best model for predicting confirmed cases, while the ARIMA model is selected for death predictions. These models provide valuable insights into the trends and help in making informed decisions during the ongoing COVID-19 pandemic. Further improvements and refinements can be made based on evolving data and more advanced modeling techniques.
Feel free to explore the code, run the analyses, and contribute to the improvement of the models!