This is a quality assurance procedure for cleaning up Per- and polyfluoroalkyl substances (PFAS) surface water concentration data and can be adapted for other variables.
The purpose of this document is to provide an order and general description of the steps taken to access, download, clean, and tag PFAS surface water data supplied by the US EPA’s Water Quality Portal. This document demonstrates the use of the R package "EPATADA" to assist with coding processes used to generate the results of a nationwide study of ambient surface water PFAS concentrations in the US. The code, data, and supporting documentation are all available free of charge. The document is broken down into sections that outline how each attached .R file can be used sequentially to evaluate surface water data in a reproducible manner.
Authors: Hannah Ferriby1, Kateri Salk1, Matt Dunn1, Christopher Wharton1, Susan Cormier2, Tammy Newcomer-Johnson2
Affiliations: 1. Tetra Tech Inc. 2. United States Environmental Protection Agency
Corresponding Author: Tammy Newcomer-Johnson
Preferred citation: H. Ferriby, K. Salk, M. Dunn, C. Wharton, S. Cormier, and T. Newcomer-Johnson. 2025. US EPA Water Quality Portal – Tools for Automated Data Analysis of PFAS in Surface Water (TADA-PFAS-SW). https://github.com/TammyNewcomerJohnson/EPATADA-PFAS-SW
Tools for Automated Data Analysis
data_pull.R
EPA_data_processing_SW_ONLY.R
data_visualization.R
EPA_data_processing.R
A. Check and Install Dependencies:
-
Checks for the remotes package and installs it if not already present.
-
Installs the EPATADA package from GitHub.
B. Load Required Libraries:
- Required R libraries (EPATADA, dplyr, readr) loaded.
A. Using TADA_BigDataRetrieval:
-
Pulls data for PFAS-related compounds from a specified source using predefined characteristics:
a. characteristicType: Includes broad PFAS categories (e.g., "PFAS, Perfluorinated Alkyl Substance").
b. characteristicName: Targets specific PFAS compounds (e.g., “Perfluorooctanoic acid”, “Perfluorooctanesulfonate”).
A. Automatic Cleaning and Transformation:
-
Runs the TADA_AutoClean function to clean and standardize the dataset.
-
Subpoints: The cleaning process includes:
a. Column Capitalization for WQX Interoperability: Adds "TADA."-prefixed columns with uppercase values for select attributes.
b. Special Character Conversion: Converts special characters in measurement values and creates new "TADA." columns.
c. Latitude and Longitude Conversion: Converts these fields to numeric types and adds "TADA."-prefixed columns.
d. Standardize Unit Labels: Replaces "meters" with "m" in depth-related columns.
e. Replace Deprecated Characteristic Names: Updates deprecated names using the WQX domain table.
f. Result and Detection Limit Unit Harmonization: Converts result and detection limit units to WQX-compliant or user-defined targets.
g. Depth Unit Conversion: Converts depth measures to meters and adds new "TADA." columns.
h. Create Comparable Data Group IDs: Generates a concatenated ID for grouped data comparison.
-
Ensures the dataset is standardized and ready for downstream analysis.
A. Write Cleaned Data to CSV: The cleaned dataset is exported as a CSV file (data_export.csv) for further processing or analysis.
A. Load Packages: Required R libraries (EPATADA, dplyr, readr, purrr) loaded.
B. Data Import: The PFAS dataset from data pull (data_export.csv) was imported.
A. Filter by Media: Retain only records from surface water samples.
B. Filter by Compound: Ensure samples are from a predefined list of PFAS compounds.
c. Harmonize Names: Abbreviate specific PFAS compound names for consistency.
A. Result Unit Validity: Adds a tag (TADA.ResultUnit.Flag) to validate measurement units.
B. Sample Fraction Validity: Tags sample fraction issues (TADA.SampleFraction.Flag).
C. Method Speciation Validity: Tags issues with method speciation (TADA.MethodSpeciation.Flag).
D. Harmonization: Harmonizes synonyms across records, adding tags like TADA.Harmonized.Flag.
A. Unrealistic Values: Tags values above upper or below lower thresholds.
B. Continuous Data: Identifies continuous data (commented out in this code).
A. Analytical Methods: Tags data based on the validity of analytical methods used (TADA.AnalyticalMethod.Flag).
B. Duplicates:
-
Tags potential duplicates across multiple organizations (TADA.MultipleOrgDupGroupID).
-
Tags single-organization duplicates (TADA.SingleOrgDup.Flag).
A. QC Samples: Tags data from QC-related activities (TADA.ActivityType.Flag).
A. Invalid Coordinates: Tags records with problematic coordinates (TADA.InvalidCoordinates.Flag).
A. Suspect Samples: Tags records with suspect qualifier codes (TADA.MeasureQualifierCode.Flag).
B. Non-Detect Values: Adds columns for censored data tags (TADA.CensoredData.Flag) and replaces non-detects with calculated values (e.g., 50% of the detection limit).
A. Remove All-NA Columns: Drops columns that contain only NA values.
B. Filter Negative Values: Removes records with negative result values.
A. Detection Limit Tags:
-
Tags samples based on detection limits and units.
-
Converts and harmonizes detection limit units (e.g., converting µg/L to ng/L).
B. Detection Limit Statistics:
-
Calculates statistical summaries for detection limits (average, median, standard deviation).
-
Tags outliers based on calculated thresholds.
C. Detection Limit User Error
- Tag samples that are entered as ‘uncensored’ but are lower than the reported detection limit.
A. EPA Method Tags: Tags records based on accepted EPA analytical methods.
A. With Tags: Exports the dataset with all tags to a CSV (EPATADA_Original_data_with_flags_SW_ONLY.csv).
B. Filtered Data: Applies a series of filters to exclude invalid data, retaining only acceptable samples. Exports this filtered dataset (EPATADA_filtered_data_SW_ONLY.csv).