Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many unassigned OTUs #2

Open
LukeLikesDirt opened this issue Aug 7, 2024 · 5 comments
Open

Too many unassigned OTUs #2

LukeLikesDirt opened this issue Aug 7, 2024 · 5 comments

Comments

@LukeLikesDirt
Copy link

Hi Vuthuyduong,

I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.

Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.

There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.

Have I made an error, or do I not understand how the classification is supposed to work?

Thank you in advance for your help.

Luke

Script

# Constants and subdirectories
readonly THREADS=8
readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta"
readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json"
readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification"
readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta"
readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"

log 'Starting at:'

# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search
-i $QUERY_SEQUENCES
-r $REFERENCE_SEQUENCES
-ml 50

# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
-c $CLASSIFIER
-cutoffs $BEST_MATCH

# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt

log 'Finished at:'

SLURM

Starting at: Sat Jul 27 06:40:17 AEST 2024

Building a new DB, current time: 07/27/2024 06:44:02
New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.

makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch

sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.

Finished at: Sat Jul 27 20:39:15 AEST 2024

Bestmatch file

<style> </style>
ID ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642 UDB05261093 1 1 178
ASV_2;size=1037588 MW856689 1 1 132
ASV_3;size=1412752 UDB01614261 0.5425 0.875 31
ASV_4;size=2201923 UDB03085057 1 1 160
ASV_5;size=3340601 UDB03119248 1 1 154
ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877 MW214811 1 1 157
ASV_8;size=4359501 UDB05107909 1 1 151
ASV_9;size=408829 MZ016271 0.7246504 0.88372 41
ASV_10;size=176701 UDB05818296 1 1 179
ASV_11;size=162862 UDB04293913 1 1 141
ASV_12;size=1535429 MT991106 1 1 146
ASV_13;size=169945 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833 UDB03975281 1 1 135

Classification file

<style> </style>
ID ReferenceID kingdom phylum class order family genus species rank score cutoff confidence
ASV_1;size=2362642 UDB05261093 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_2;size=1037588 MW856689 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   0.5425 N/A N/A
ASV_4;size=2201923 UDB03085057 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_5;size=3340601 UDB03119248 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_6;size=823557 UDB01261625 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   0.9823000000000001 N/A N/A
ASV_7;size=830877 MW214811 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_8;size=4359501 UDB05107909 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_9;size=408829 MZ016271 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   0.7246504 N/A N/A
ASV_10;size=176701 UDB05818296 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_11;size=162862 UDB04293913 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_12;size=1535429 MT991106 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A
ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes Pezizales Pezizales fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812
ASV_14;size=130846 UDB02651623 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   0.9607800000000001 N/A N/A
ASV_15;size=978833 UDB03975281 unidentified unidentified unidentified unidentified unidentified unidentified unidentified   1 N/A N/A

Classified file

<style> </style>
ID Given label Prediction Full classification Rank Cut-off Confidence ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB05261093 1 1 178
ASV_2;size=1037588   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A MW856689 1 1 132
ASV_3;size=1412752   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB01614261 0.5425 0.875 31
ASV_4;size=2201923   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB03085057 1 1 160
ASV_5;size=3340601   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB03119248 1 1 154
ASV_6;size=823557   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A MW214811 1 1 157
ASV_8;size=4359501   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB05107909 1 1 151
ASV_9;size=408829   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A MZ016271 0.7246504 0.88372 41
ASV_10;size=176701   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB05818296 1 1 179
ASV_11;size=162862   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB04293913 1 1 141
ASV_12;size=1535429   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A MT991106 1 1 146
ASV_13;size=169945 Sphaerosoma k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizales_fam_Incertae_sedis;g__Sphaerosoma;s__unidentified genus 0.969 0.4812 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833   k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified N/A N/A UDB03975281 1 1 135
@vuthuyduong
Copy link
Owner

vuthuyduong commented Aug 7, 2024 via email

@vuthuyduong
Copy link
Owner

vuthuyduong commented Aug 7, 2024 via email

@LukeLikesDirt
Copy link
Author

Dear Duong

When I run the classification step on the first 15 ASVs (as you have done above), we get different results. My results are the same as when I ran the entire best match file, so the issue is not a memory problem, which I expected because I requested a massive amount of memory from the cluster.

However, I suspect the issue could be with the unite2024ITS1.classification file that I downloaded from Zenodo along with the unite2024ITS1.fasta file. I noticed that the unite2024ITS1.classification file is 82.5 MB, whereas it is 262.6 MB and 229.4 MB for the ITS and ITS2 versions, respectively. How large is the unite2024ITS1.classification file that you are using?

I think the unite2024ITS1.classification file did not upload correctly to Zenodo, resulting in many reference taxa being missing. For example, of the classifications that you get, only ASV_13;size=169945 (UDB07371928, Sphaerosoma) is found in my unite2024ITS1.classification file. This makes sense because that is the only classification we have in common (the rest of mine remain unidentified at the kingdom level).

Could you please either reupload the unite2024ITS1.classification file to Zenodo or share a folder with me to access the unite2024ITS1.classification file that you use?

Thank you kindly in advance for your time.

Warm regards
Luke

@vuthuyduong
Copy link
Owner

vuthuyduong commented Aug 17, 2024 via email

@LukeLikesDirt
Copy link
Author

Hi Duong

Thank you for updating the classification file. My classification output now makes sense.

Kind regards
Luke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants