The target-data
folder contains the CSV data that the forecasts will be compared against. This data serves as the "gold standard" for evaluating the forecasting models. For the current Flu season, the data is stored in target-data/season_2024_2025/data_report.csv
file.
Respiratory Virus Detection Surveillance System (RVDSS)
Our hub's prediction targets (sarscov2_pct_positive
, rsv_pct_positive
and flu_pct_positive
) are scraped from the Respiratory Virus Detection Surveillance System (RVDSS), published by the Public Health Agency of Canada (PHAC). The data was historically reported in weekly reports, but the current season is moved to an interactive dashboard. Historic reports and the interactive dashboard can be found here. The target data file data_reports.csv
is generated using the raw data that is made available through the webscraping scripts provided by the Delphi Epi Data.
Previously collected data from earlier seasons are included in the .auxiliary-data\target-data-archive
directory in their respective season sub-directories as data_reports.csv
. The sarscov2_pct_positive
data starts from the season_2022_2023
and hence these column values are not included in previous season data files.
-
time_value
: the last day of the epiweek -
geo_type
: the type of geographical location -
geo_value
: the actual geographical location -
[virus]_pct_positive
: the percentage of tests for a given virus that are positive (target)
Primary Data Source: Respiratory Virus Detection Surveillance System (RVDSS)
A set of CSV files is updated weekly with the latest observed values for [target type, e.g., percentage of positive virus detections]. These are available at:
./target-data/season_2024_2025/target_rvdss_data.csv
auxiliary-data/season_2024_2025_raw_files
(Raw Files)
The rvdss_update.py
code processes and updates weekly data on respiratory virus detections in Canada, automatically adding new entries. It begins by defining functions to standardize virus and geographic names (e.g., "parainfluenza" to "hpiv" and "Newfoundland" to "nl") and to categorize geographic areas (as nation, region, or province) for consistent organization.
Two main functions then retrieve and transform the data. get_revised_data()
accesses historical weekly data, reformats it with a multi-index structure and ensures date consistency. get_weekly_data()
retrieves data for the latest epidemiological week, determining the correct year and week from a summary file. It then applies the same formatting and standardization as with the historical data.
After processing, the code saves the data in positive_tests.csv
and respiratory_detections.csv
files. If these files already exist, it checks for new entries by comparing indices, appending updated data to prevent duplication. After saving updates to positive_tests.csv
and respiratory_detections.csv
, the code consolidates both datasets into a unified file, target_rvdss_data.csv
. It includes updated geo_type
values and removes duplicates, keeping only the latest (revised) entry for each combination of time_value
, geo_type
, and geo_value
. It retains our target columns (COLUMNS_TARGET
) and rounds percentage values to two decimal places, creating a ready-to-analyze file with standardized weekly data across Canada.
For each season, the code generates three files:
-
positive_tests.csv
- Displays the percentage of positive tests for each virus by week.
- Aggregated at the regional level, with national totals included.
- Includes revisions for each update.
- Matches Table 1 in the reports, typically titled “Respiratory virus detections for the week ending...”
-
respiratory_detections.csv
- Shows the number of positive tests for each virus (including subtypes) by week.
- Aggregated at the lab level, with summaries at the regional level.
- Includes revisions for each update.
- Matches Figures 3-9 in the reports, typically titled “Positive [virus] tests (%)...”
-
target_rvdss_data.csv
- Consolidates data from
positive_tests.csv
andrespiratory_detections.csv
. - Updates the
geo_type
field based on location corrections (fromLOC_CORRECTION
). - Removes duplicate rows, keeping only the latest (revised) entry for each combination of
time_value
,geo_type
, andgeo_value
. - Drops unnecessary columns (e.g.,
issue
andepiweek
), creating a streamlined dataset for further analysis.
- Consolidates data from