Dada2 Documentation

Quality Control, Filtering, & Taxonomic Assignment

We use the DADA2 software to assign taxonomy to amplicon sequences. However, some processing is necassary before we can assign taxonomy. A brief overview of the process is as follows:

Determine quality trimming length
Filter and trim for quality and length
Dereplicate
Learn error rates
Sample inference
Merge paired reads
Remove chimeras
Filter merged reads by size
Assign Taxonomy
Create and Save Output Files

A lot of our DADA2 script is pulled from DADA2's Tutorial, which you can check out for further explanation. However, we slightly modified/added to their existing tools to further automate and refine the process for our project's needs. Below is documentation for code that deviates from DADA2's tutorial.

Determine quality trimming length

Sequence data produced on many platforms, including the very popular Illumina MiSeq, tend to have lower quality base calls toward the end of the sequence. This is especially true of longer sequences (300+ bp). Therefore, in the first QAQC step the raw sequence data are trimmed to remove poor quality bases at the end of each sequence. To determine the truncate length for forward and reverse sequences, we calculate the mean quality score along a sliding window of 10bp for the entire sequence. The value for each window is then averaged across all forward and reverse reads in the sequence dataset. The first window where mean Q score falls below a specific threshold is selected as the location where dada2 should truncate reads in the filterAndTrim step (F_qual and R_qual in primer_data.csv, optimized on a per-primer basis).

photo of quality plots

In a typical DADA2 workflow, you would look at these quality plots to manually determine where to trim. Ideally, you want to strike a balance between keeping enough sequence length to successfully merge forward and reverse reads and trimming off low quality bases. As you would guess, there is a lot of discourse about how to decide these parameters.

To aid in reproducability and automation, we created a simple algorithm that decides where to trim forward and reverse reads. Below is the code:

n <- 500000
 trimsF <- c()
 for(f in fnFs[!is.na(fnFs)]) {
   srqa <- qa(f, n=n)
   df <- srqa[["perCycle"]]$quality # Calculate summary statistics at each position.. takes the longest amount of time
   means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle) #calculate mean quality at each cycle
   window_values <- c()
   window_size <- 10
   for(j in 1:(length(means)-window_size)){
     window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
   }
   where_to_cut <- min(which(window_values < primer.data$F_qual[i])) #trim at first value of the window where mean qual dips below 30
   trimsF[f] <- where_to_cut
 }
 where_trim_all_Fs <- median(trimsF) #get median of all trims - use this as Trunclen forwards
 
 trimsR <- c()
 for(r in fnRs[!is.na(fnRs)]) {
   srqa <- qa(r, n=n)
   df <- srqa[["perCycle"]]$quality
   means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle)
   indexes <- seq(1,length(means),10)
   window_values <- c()
   for(j in 1:(length(means)-window_size)){
     window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
   }
   where_to_cut <- min(which(window_values < primer.data$R_qual[i]))
   trimsR[r] <- where_to_cut
 }
 where_trim_all_Rs <- median(trimsR)

# Trim to max_trim value if the obtaided length is greater that max_trim
if(where_trim_all_Fs > primer.data$max_trim[i]){
  where_trim_all_Fs <- primer.data$max_trim[i]
}
if(where_trim_all_Rs > primer.data$max_trim[i]){
   where_trim_all_Rs <- primer.data$max_trim[i]
}

What this does is:

Determines the quality at each base - The qa() function calculates the quality score at each position of the sequence for each read
Find the average quality for each base - Then, rowsum() to find the average quality at each base in the sequence
Use Sliding Window to Determine where to trim - Use a window size of 10 to iterate along the vector of average qualities. Within the window, we find the mean qual score within the window. Once that mean quality reaches below a certain threshold (this value is supplied by the F_qual and R_qual columns in the primer_data.csv metadata input file), we take the first value of that window and add it to a vector called trimsF/R.
Repeat with all Samples - Repeat this sliding window calculation with each sample and append where to trim to trimsF/R.
Use Median to Determine Where to Trim All Samples - We then take the median of trimsF/R to use as the input to truncLen to trim the right end of reads based on quality. Notice that we do this calculation for Forward and Reverse reads separately, as reverse reads tend to have lower quality overall/have an earlier quality dropoff.

Filter and Trim sequence data

Once we have determined the trim length for our forward and reverse sequences, we use dada2's filterAndTrim function to trim sequences based on quality and filter out sequences with too many errors or Ns. Our code for this is as follows:

### Filter and Trim ---------------------------------------------------------------
    print(paste0("Starting filter and trim...", Sys.time()))
    out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, 
                         trimRight = c(primer.data$primer_length_r[i],primer.data$primer_length_f[i]),
                         truncLen = c(where_trim_all_Fs,where_trim_all_Rs),
                          maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
                           compress=TRUE, multithread=TRUE, matchIDs=TRUE)
    print(paste0("Finished filter and trim  ", Sys.time()))

See additional documentation for this function on the dada2 tutorial. In the above code, the first four parameters specify the paths to input and output files for both forward and reverse sequences. The additional options we've included are:

trimRight: we include this to trim the primer from the right side of the sequence. CutAdapt only trims primers from the left side of each sequence, and in the case of shorter amplicon lengths (e.g. MiFish), this can cause downstream errors if the right-side primer extends beyond the length of the merged sequence.
trunLen: this option tells dada2 where to cut forward and reverse sequences. The truncate length was determined in the previous step, using an automated sliding average to find the location where the Q score falls below the threshold specified in primer_data.csv.
maxN, MaxEE, truncQ, and rm.phix are all default options for quality filtering supplied by dada2. We left these defaults so that dada2 filters out any sequences with Ns, as well as sequences that surpass an estimated error rate (see dada2 documentation for more on how this error rate is estimated), removes sequences with a Q score of 2 (this is redundant in our code but a good safety), and removes sequences determined to be from phiX.

Dereplicate, infer sequences, and merge reads

Once sequences have been filtered and trimmed, we use several dada2 functions to dereplicate our sequences, learn average error rates and infer sequences, and finally merge forward and reverse reads. Our code for these step does not deviate significantly from the dada2 tutorial.

Filter Merged Reads by Size

After merging reads, there will sometimes be sequences that are longer or shorter than your target sequence length. This can be for many reasons: sample contamination, low quality reads, etc. For example, when we were processing our data, we found bacterial sequences that were being amplified by primers made to amplify fish DNA. However, these sequences were 100 bp longer than our target sequence, so we were able to filter the bacterial sequences out here.

We use this code to filter out reads by size:

### Filter by Size ---------------------------------------------------------------
    indexes.to.keep <- which((nchar(colnames(seqtab.nochim)) <= primer.data$max_amplicon_length[i]) & ((nchar(colnames(seqtab.nochim))) >= primer.data$min_amplicon_length[i]))
    cleaned.seqtab.nochim <- seqtab.nochim[,indexes.to.keep]
    filteredout.seqtab.nochim <- seqtab.nochim[,!indexes.to.keep]
    write.csv(filteredout.seqtab.nochim,paste0(output_location,"/logs/","filtered_out_asv.csv"))

The upper and lower bounds of the length filtering are provided by primer_data.csv. The filtered out ASVs are written to the file filtered_out_asv.csv, which is in the logs directory.

Classifying sequence taxononomy

We use dada2's naive Bayesian classifier to classify our final dataset of merged ASVs using the assignTaxonomy function and custom taxonomic reference databases stored in the metadata folder, for the specific primer that we're targeting.

Once ASVs are classified, we store them in a database with a unique sequence hash. In future runs of the same primer, this pipeline will first refer to the sequence hash database to identify sequences that have already been classified. IDing those sequences and removing them from the sequences to be classified can greatly reduce the amount of time needed in this step. That can be very helpful, since this is one of the more time consuming steps in the pipeline. Read more about how we generate the hash database below:

Sequence Hashing & Known Sequence Database

To aid in keeping track of sequences in multi-sample experiments, we applied a hashing algorithm to the sequences DADA2 finds. Hashing "translates" our DNA sequences into a shorter sequence of letters, numbers, and symbols. These sequences are shorter and easier to store. To learn more about hashing algorithms, check out this link.

Here is the code that implements the hashing on found ASV's

### Create Hashing  ---------------------------------------------------------------
    
    # define output files
    conv_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.csv"))
    conv_file.fasta <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.fasta"))
    ASV_file <-  file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_ASV_table.csv"))
    taxonomy_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_taxonomy_output.csv"))
    bootstrap_file <- file.path(output_location,"logs",paste0(run_name,"_",primer.data$locus_shorthand[i],"_tax_bootstrap.csv"))
    
    # create ASV table and hash key 
    print(paste0("creating ASV table and hash key...", Sys.time()))
    seqtab.nochim.df <- as.data.frame(cleaned.seqtab.nochim)
    conv_table <- tibble( Hash = "", Sequence ="")
    Hashes <- map_chr (colnames(seqtab.nochim.df), ~ digest(.x, algo = "sha1", serialize = F, skip = "auto"))
    conv_table <- tibble (Hash = Hashes,
                          Sequence = colnames(seqtab.nochim.df))
    
    write_csv(conv_table, conv_file) # write hash key into a csv
    write.fasta(sequences = as.list(conv_table$Sequence), # write hash key into a fasta
                names     = as.list(conv_table$Hash),
                file.out = conv_file.fasta)
    sample.df <- tibble::rownames_to_column(seqtab.nochim.df,"Sample_name")
    sample.df <- data.frame(append(sample.df,c(Label=primer.data$locus_shorthand[i]), after = 1))
    current_asv <- bind_cols(sample.df %>%
                                    dplyr::select(Sample_name, Label),
                                  seqtab.nochim.df)
    current_asv <- current_asv %>%
      pivot_longer(cols = c(- Sample_name, - Label),
                   names_to = "Sequence",
                   values_to = "nReads") %>%
          filter(nReads > 0)
    current_asv <- merge(current_asv,conv_table, by="Sequence") %>%
      select(-Sequence) %>%
      relocate(Hash, .after=Label)
  
    write_csv(current_asv, ASV_file) # write asv table into a csv

We also create a known sequence database for each primer. These databases (.csv files) are located in the muri_metabarcoding/metadata/known_hashes directory. Every time we assign taxonomy and find a new ASV with an assignment that has 100% bootstrap confidence, we add that hash and its taxonomic assignments to that primer's known_hashes database. The most computationally heavy part of the pipeline is assigning taxonomy, and having these known hashes allows us to save computational energy. Before assigning taxonomy, we first check to see if any of the new ASVs can be found in the known_hashes database. If so, we use that taxonomic assignment instead of assigning taxonomy again using DADA2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly