-
Notifications
You must be signed in to change notification settings - Fork 3
Dada2 Documentation
We use the DADA2 software to assign taxonomy to amplicon sequences. However, some processing is necassary before we can assign taxonomy. A brief overview of the process is as follows:
- Determine quality trimming length
- Filter and trim for quality and length
- Dereplicate
- Learn error rates
- Sample inference
- Merge paired reads
- Remove chimeras
- Filter merged reads by size
- Assign Taxonomy
- Create and Save Output Files
A lot of our DADA2 script is pulled from DADA2's Tutorial, which you can check out for further explanation. However, we slightly modified/added to their existing tools to further automate and refine the process for our project's needs. Below is documentation for code that deviates from DADA2's tutorial.
Sequence data produced on many platforms, including the very popular Illumina MiSeq, tend to have lower quality base calls toward the end of the sequence. This is especially true of longer sequences (300+ bp). Therefore, in the first QAQC step the raw sequence data are trimmed to remove poor quality bases at the end of each sequence. To determine the truncate length for forward and reverse sequences, we calculate the mean quality score along a sliding window of 10bp for the entire sequence. The value for each window is then averaged across all forward and reverse reads in the sequence dataset. The first window where mean Q score falls below a specific threshold is selected as the location where dada2 should truncate reads in the filterAndTrim
step (F_qual
and R_qual
in primer_data.csv
, optimized on a per-primer basis).
In a typical DADA2 workflow, you would look at these quality plots to manually determine where to trim. Ideally, you want to strike a balance between keeping enough sequence length to successfully merge forward and reverse reads and trimming off low quality bases. As you would guess, there is a lot of discourse about how to decide these parameters.
To aid in reproducability and automation, we created a simple algorithm that decides where to trim forward and reverse reads. Below is the code:
n <- 500000
trimsF <- c()
for(f in fnFs[!is.na(fnFs)]) {
srqa <- qa(f, n=n)
df <- srqa[["perCycle"]]$quality # Calculate summary statistics at each position.. takes the longest amount of time
means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle) #calculate mean quality at each cycle
window_values <- c()
window_size <- 10
for(j in 1:(length(means)-window_size)){
window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
}
where_to_cut <- min(which(window_values < primer.data$F_qual[i])) #trim at first value of the window where mean qual dips below 30
trimsF[f] <- where_to_cut
}
where_trim_all_Fs <- median(trimsF) #get median of all trims - use this as Trunclen forwards
trimsR <- c()
for(r in fnRs[!is.na(fnRs)]) {
srqa <- qa(r, n=n)
df <- srqa[["perCycle"]]$quality
means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle)
indexes <- seq(1,length(means),10)
window_values <- c()
for(j in 1:(length(means)-window_size)){
window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
}
where_to_cut <- min(which(window_values < primer.data$R_qual[i]))
trimsR[r] <- where_to_cut
}
where_trim_all_Rs <- median(trimsR)
# Trim to max_trim value if the obtaided length is greater that max_trim
if(where_trim_all_Fs > primer.data$max_trim[i]){
where_trim_all_Fs <- primer.data$max_trim[i]
}
if(where_trim_all_Rs > primer.data$max_trim[i]){
where_trim_all_Rs <- primer.data$max_trim[i]
}
What this does is:
- Determines the quality at each base - The qa() function calculates the quality score at each position of the sequence for each read
- Find the average quality for each base - Then, rowsum() to find the average quality at each base in the sequence
-
Use Sliding Window to Determine where to trim - Use a window size of 10 to iterate along the vector of average qualities. Within the window, we find the mean qual score within the window. Once that mean quality reaches below a certain threshold (this value is supplied by the
F_qual
andR_qual
columns in theprimer_data.csv
metadata input file), we take the first value of that window and add it to a vector calledtrimsF/R
. -
Repeat with all Samples - Repeat this sliding window calculation with each sample and append where to trim to
trimsF/R
. -
Use Median to Determine Where to Trim All Samples - We then take the median of
trimsF/R
to use as the input totruncLen
to trim the right end of reads based on quality. Notice that we do this calculation for Forward and Reverse reads separately, as reverse reads tend to have lower quality overall/have an earlier quality dropoff.
Once we have determined the trim length for our forward and reverse sequences, we use dada2's filterAndTrim
function to trim sequences based on quality and filter out sequences with too many errors or Ns. Our code for this is as follows:
### Filter and Trim ---------------------------------------------------------------
print(paste0("Starting filter and trim...", Sys.time()))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
trimRight = c(primer.data$primer_length_r[i],primer.data$primer_length_f[i]),
truncLen = c(where_trim_all_Fs,where_trim_all_Rs),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE, matchIDs=TRUE)
print(paste0("Finished filter and trim ", Sys.time()))
See additional documentation for this function on the dada2 tutorial. In the above code, the first four parameters specify the paths to input and output files for both forward and reverse sequences. The additional options we've included are:
- trimRight: we include this to trim the primer from the right side of the sequence. CutAdapt only trims primers from the left side of each sequence, and in the case of shorter amplicon lengths (e.g. MiFish), this can cause downstream errors if the right-side primer extends beyond the length of the merged sequence.
-
trunLen: this option tells dada2 where to cut forward and reverse sequences. The truncate length was determined in the previous step, using an automated sliding average to find the location where the Q score falls below the threshold specified in
primer_data.csv
. - maxN, MaxEE, truncQ, and rm.phix are all default options for quality filtering supplied by dada2. We left these defaults so that dada2 filters out any sequences with Ns, as well as sequences that surpass an estimated error rate (see dada2 documentation for more on how this error rate is estimated), removes sequences with a Q score of 2 (this is redundant in our code but a good safety), and removes sequences determined to be from phiX.
Once sequences have been filtered and trimmed, we use several dada2 functions to dereplicate our sequences, learn average error rates and infer sequences, and finally merge forward and reverse reads. Our code for these step does not deviate significantly from the dada2 tutorial.
After merging reads, there will sometimes be sequences that are longer or shorter than your target sequence length. This can be for many reasons: sample contamination, low quality reads, etc. For example, when we were processing our data, we found bacterial sequences that were being amplified by primers made to amplify fish DNA. However, these sequences were 100 bp longer than our target sequence, so we were able to filter the bacterial sequences out here.
We use this code to filter out reads by size:
### Filter by Size ---------------------------------------------------------------
indexes.to.keep <- which((nchar(colnames(seqtab.nochim)) <= primer.data$max_amplicon_length[i]) & ((nchar(colnames(seqtab.nochim))) >= primer.data$min_amplicon_length[i]))
cleaned.seqtab.nochim <- seqtab.nochim[,indexes.to.keep]
filteredout.seqtab.nochim <- seqtab.nochim[,!indexes.to.keep]
write.csv(filteredout.seqtab.nochim,paste0(output_location,"/logs/","filtered_out_asv.csv"))
The upper and lower bounds of the length filtering are provided by primer_data.csv
. The filtered out ASVs are written to the file filtered_out_asv.csv
, which is in the logs directory.
We use dada2's naive Bayesian classifier to classify our final dataset of merged ASVs using the assignTaxonomy
function and custom taxonomic reference databases stored in the metadata folder, for the specific primer that we're targeting.
Once ASVs are classified, we store them in a database with a unique sequence hash. In future runs of the same primer, this pipeline will first refer to the sequence hash database to identify sequences that have already been classified. IDing those sequences and removing them from the sequences to be classified can greatly reduce the amount of time needed in this step. That can be very helpful, since this is one of the more time consuming steps in the pipeline. Read more about how we generate the hash database below:
To aid in keeping track of sequences in multi-sample experiments, we applied a hashing algorithm to the sequences DADA2 finds. Hashing "translates" our DNA sequences into a shorter sequence of letters, numbers, and symbols. These sequences are shorter and easier to store. To learn more about hashing algorithms, check out this link.
Here is the code that implements the hashing on found ASV's
### Create Hashing ---------------------------------------------------------------
# define output files
conv_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.csv"))
conv_file.fasta <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.fasta"))
ASV_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_ASV_table.csv"))
taxonomy_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_taxonomy_output.csv"))
bootstrap_file <- file.path(output_location,"logs",paste0(run_name,"_",primer.data$locus_shorthand[i],"_tax_bootstrap.csv"))
# create ASV table and hash key
print(paste0("creating ASV table and hash key...", Sys.time()))
seqtab.nochim.df <- as.data.frame(cleaned.seqtab.nochim)
conv_table <- tibble( Hash = "", Sequence ="")
Hashes <- map_chr (colnames(seqtab.nochim.df), ~ digest(.x, algo = "sha1", serialize = F, skip = "auto"))
conv_table <- tibble (Hash = Hashes,
Sequence = colnames(seqtab.nochim.df))
write_csv(conv_table, conv_file) # write hash key into a csv
write.fasta(sequences = as.list(conv_table$Sequence), # write hash key into a fasta
names = as.list(conv_table$Hash),
file.out = conv_file.fasta)
sample.df <- tibble::rownames_to_column(seqtab.nochim.df,"Sample_name")
sample.df <- data.frame(append(sample.df,c(Label=primer.data$locus_shorthand[i]), after = 1))
current_asv <- bind_cols(sample.df %>%
dplyr::select(Sample_name, Label),
seqtab.nochim.df)
current_asv <- current_asv %>%
pivot_longer(cols = c(- Sample_name, - Label),
names_to = "Sequence",
values_to = "nReads") %>%
filter(nReads > 0)
current_asv <- merge(current_asv,conv_table, by="Sequence") %>%
select(-Sequence) %>%
relocate(Hash, .after=Label)
write_csv(current_asv, ASV_file) # write asv table into a csv
We also create a known sequence database for each primer. These databases (.csv files) are located in the muri_metabarcoding/metadata/known_hashes
directory. Every time we assign taxonomy and find a new ASV with an assignment that has 100% bootstrap confidence, we add that hash and its taxonomic assignments to that primer's known_hashes
database. The most computationally heavy part of the pipeline is assigning taxonomy, and having these known hashes allows us to save computational energy. Before assigning taxonomy, we first check to see if any of the new ASVs can be found in the known_hashes
database. If so, we use that taxonomic assignment instead of assigning taxonomy again using DADA2.