-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many unassigned OTUs #2
Comments
Dear Luuk,
This file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch looks fine to me.
I tried to create a bestmatch file containing the following lines:
ID ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642 UDB05261093 1 1 178
ASV_2;size=1037588 MW856689 1 1 132
ASV_3;size=1412752 UDB01614261 0.5425 0.875 31
ASV_4;size=2201923 UDB03085057 1 1 160
ASV_5;size=3340601 UDB03119248 1 1 154
ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877 MW214811 1 1 157
ASV_8;size=4359501 UDB05107909 1 1 151
ASV_9;size=408829 MZ016271 0.7246504 0.88372 41
ASV_10;size=176701 UDB05818296 1 1 179
ASV_11;size=162862 UDB04293913 1 1 141
ASV_12;size=1535429 MT991106 1 1 146
ASV_13;size=169945 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833 UDB03975281 1 1 135
and run the command:
python dnabarcoder/dnabarcoder.py classify -i
dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch -c
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification
-cutoffs
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json
And here is what I have obtained:
ID ReferenceID kingdom phylum class order family genus species rank score
cutoff confidence
ASV_1;size=2362642 UDB05261093 Fungi Ascomycota Dothideomycetes Dothideales
unidentified unidentified unidentified order 1.0 0.925 0.5634
ASV_2;size=1037588 MW856689 Fungi Basidiomycota Tremellomycetes Tremellales
Bulleribasidiaceae Vishniacozyma unidentified genus 1.0 0.969 0.4812
ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.5425 N/A N/A
ASV_4;size=2201923 UDB03085057 Fungi Basidiomycota Microbotryomycetes
Kriegeriales Camptobasidiaceae Glaciozyma unidentified genus 1.0 0.969
0.4812
ASV_5;size=3340601 UDB03119248 Fungi Ascomycota Dothideomycetes
Cladosporiales Cladosporiaceae Cladosporium unidentified genus 1.0 0.969
0.4812
ASV_6;size=823557 UDB01261625 Fungi Ascomycota Archaeorhizomycetes
Archaeorhizomycetales Archaeorhizomycetaceae Archaeorhizomyces unidentified
genus 0.9823000000000001 0.969 0.4812
ASV_7;size=830877 MW214811 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1.0 N/A N/A
ASV_8;size=4359501 UDB05107909 Fungi Ascomycota Dothideomycetes
Mycosphaerellales Teratosphaeriaceae Devriesia unidentified genus 1.0 0.969
0.4812
ASV_9;size=408829 MZ016271 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.7246504 N/A N/A
ASV_10;size=176701 UDB05818296 Fungi Ascomycota Pezizomycetes Pezizales
Pyronemataceae Pyronema unidentified genus 1.0 0.951 0.6336
ASV_11;size=162862 UDB04293913 Fungi Ascomycota Sordariomycetes
Amphisphaeriales Sporocadaceae Pestalotiopsis unidentified genus 1.0 0.969
0.4812
ASV_12;size=1535429 MT991106 Fungi Ascomycota Sordariomycetes Hypocreales
Nectriaceae Fusarium Fusarium solani species 1.0 0.988 0.8358
ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes
Pezizales Pezizales
fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812
ASV_14;size=130846 UDB02651623 Fungi Basidiomycota Agaricomycetes Geastrales
Geastraceae unidentified unidentified family 0.9607800000000001 0.933 0.46
ASV_15;size=978833 UDB03975281 Fungi Ascomycota Sordariomycetes Hypocreales
Clavicipitaceae Metarhizium unidentified genus 1.0 0.984 0.6507
Could this be related to a memory problem? Can you also try running the
classify command with a small file to see if it works?
Otherwise, we can always set up a meeting if you prefer.
Best regards,
Duong
Best regards
Duong
…On Wed, 7 Aug 2024 at 12:20, Luke Florence ***@***.***> wrote:
Hi Vuthuyduong,
I have followed your pipeline for the classification of some ASVs. My
reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10
database that you prepared (thank you!). However, most of my ASVs (~85%)
are unassigned at the fungi level after classification. This doesn’t make
sense to me, as the majority of the unassigned ASVs had coverage > 90% and
similarity > 95% when I previously BLASTed them. And a good portion had
coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and
“classification” files. I have also included your script, which I slightly
modified to run on the cluster and fit my project (perhaps I made an error
here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not
found”. I think this is related to the krona.html file, which was not
written.
Have I made an error, or do I not understand how the classification is
supposed to work?
Thank you in advance for your help.
Luke
Script
# Constants and subdirectories
readonly THREADS=8
readonly
REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta"
readonly
BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json"
readonly
CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification"
readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta"
readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"
log 'Starting at:'
# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search
-i $QUERY_SEQUENCES
-r $REFERENCE_SEQUENCES
-ml 50
# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
-c $CLASSIFIER
-cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified
$OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification
$OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02
New DB name:
/data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title:
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named
/data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
-dbtype 'nucl' -out
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
-task blastn-short -outfmt 6 -out
../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput
-num_threads 96
The results are saved in file
dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file
dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and
dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files
dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and
dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Finished at: Sat Jul 27 20:39:15 AEST 2024
Bestmatch file <style> </style>
ID ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642 UDB05261093 1 1 178
ASV_2;size=1037588 MW856689 1 1 132
ASV_3;size=1412752 UDB01614261 0.5425 0.875 31
ASV_4;size=2201923 UDB03085057 1 1 160
ASV_5;size=3340601 UDB03119248 1 1 154
ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877 MW214811 1 1 157
ASV_8;size=4359501 UDB05107909 1 1 151
ASV_9;size=408829 MZ016271 0.7246504 0.88372 41
ASV_10;size=176701 UDB05818296 1 1 179
ASV_11;size=162862 UDB04293913 1 1 141
ASV_12;size=1535429 MT991106 1 1 146
ASV_13;size=169945 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833 UDB03975281 1 1 135 Classification file <style>
</style>
ID ReferenceID kingdom phylum class order family genus species rank score
cutoff confidence
ASV_1;size=2362642 UDB05261093 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_2;size=1037588 MW856689 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.5425 N/A N/A
ASV_4;size=2201923 UDB03085057 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_5;size=3340601 UDB03119248 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_6;size=823557 UDB01261625 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.9823000000000001
N/A N/A
ASV_7;size=830877 MW214811 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_8;size=4359501 UDB05107909 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_9;size=408829 MZ016271 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.7246504 N/A N/A
ASV_10;size=176701 UDB05818296 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_11;size=162862 UDB04293913 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_12;size=1535429 MT991106 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes Pezizales Pezizales
fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812
ASV_14;size=130846 UDB02651623 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.9607800000000001
N/A N/A
ASV_15;size=978833 UDB03975281 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A Classified
file <style> </style>
ID Given label Prediction Full classification Rank Cut-off Confidence
ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB05261093 1 1 178
ASV_2;size=1037588
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MW856689 1 1 132
ASV_3;size=1412752
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB01614261 0.5425 0.875 31
ASV_4;size=2201923
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB03085057 1 1 160
ASV_5;size=3340601
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB03119248 1 1 154
ASV_6;size=823557
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MW214811 1 1 157
ASV_8;size=4359501
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB05107909 1 1 151
ASV_9;size=408829
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MZ016271 0.7246504 0.88372 41
ASV_10;size=176701
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB05818296 1 1 179
ASV_11;size=162862
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB04293913 1 1 141
ASV_12;size=1535429
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MT991106 1 1 146
ASV_13;size=169945 Sphaerosoma
k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizales_fam_Incertae_sedis;g__Sphaerosoma;s__unidentified
genus 0.969 0.4812 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB03975281 1 1 135
—
Reply to this email directly, view it on GitHub
<#2>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF6CZMTMZK34CPWCEJVF6MDZQHYF7AVCNFSM6AAAAABMEC5EQOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMJSGE3TSNA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
For the Krona problem, I have noticed that it is related to the setup of
Krona. I will change the code to make it work in any environment.
Best regards
Duong
…On Wed, 7 Aug 2024 at 12:20, Luke Florence ***@***.***> wrote:
Hi Vuthuyduong,
I have followed your pipeline for the classification of some ASVs. My
reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10
database that you prepared (thank you!). However, most of my ASVs (~85%)
are unassigned at the fungi level after classification. This doesn’t make
sense to me, as the majority of the unassigned ASVs had coverage > 90% and
similarity > 95% when I previously BLASTed them. And a good portion had
coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and
“classification” files. I have also included your script, which I slightly
modified to run on the cluster and fit my project (perhaps I made an error
here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not
found”. I think this is related to the krona.html file, which was not
written.
Have I made an error, or do I not understand how the classification is
supposed to work?
Thank you in advance for your help.
Luke
Script
# Constants and subdirectories
readonly THREADS=8
readonly
REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta"
readonly
BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json"
readonly
CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification"
readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta"
readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"
log 'Starting at:'
# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search
-i $QUERY_SEQUENCES
-r $REFERENCE_SEQUENCES
-ml 50
# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
-c $CLASSIFIER
-cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified
$OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification
$OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02
New DB name:
/data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title:
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named
/data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
-dbtype 'nucl' -out
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db
../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
-task blastn-short -outfmt 6 -out
../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput
-num_threads 96
The results are saved in file
dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file
dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and
dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files
dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and
dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Finished at: Sat Jul 27 20:39:15 AEST 2024
Bestmatch file <style> </style>
ID ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642 UDB05261093 1 1 178
ASV_2;size=1037588 MW856689 1 1 132
ASV_3;size=1412752 UDB01614261 0.5425 0.875 31
ASV_4;size=2201923 UDB03085057 1 1 160
ASV_5;size=3340601 UDB03119248 1 1 154
ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877 MW214811 1 1 157
ASV_8;size=4359501 UDB05107909 1 1 151
ASV_9;size=408829 MZ016271 0.7246504 0.88372 41
ASV_10;size=176701 UDB05818296 1 1 179
ASV_11;size=162862 UDB04293913 1 1 141
ASV_12;size=1535429 MT991106 1 1 146
ASV_13;size=169945 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833 UDB03975281 1 1 135 Classification file <style>
</style>
ID ReferenceID kingdom phylum class order family genus species rank score
cutoff confidence
ASV_1;size=2362642 UDB05261093 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_2;size=1037588 MW856689 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.5425 N/A N/A
ASV_4;size=2201923 UDB03085057 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_5;size=3340601 UDB03119248 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_6;size=823557 UDB01261625 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.9823000000000001
N/A N/A
ASV_7;size=830877 MW214811 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_8;size=4359501 UDB05107909 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_9;size=408829 MZ016271 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.7246504 N/A N/A
ASV_10;size=176701 UDB05818296 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_11;size=162862 UDB04293913 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_12;size=1535429 MT991106 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A
ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes Pezizales Pezizales
fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812
ASV_14;size=130846 UDB02651623 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 0.9607800000000001
N/A N/A
ASV_15;size=978833 UDB03975281 unidentified unidentified unidentified
unidentified unidentified unidentified unidentified 1 N/A N/A Classified
file <style> </style>
ID Given label Prediction Full classification Rank Cut-off Confidence
ReferenceID BLAST score BLAST sim BLAST coverage
ASV_1;size=2362642
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB05261093 1 1 178
ASV_2;size=1037588
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MW856689 1 1 132
ASV_3;size=1412752
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB01614261 0.5425 0.875 31
ASV_4;size=2201923
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB03085057 1 1 160
ASV_5;size=3340601
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB03119248 1 1 154
ASV_6;size=823557
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB01261625 0.9823000000000001 0.9823000000000001 112
ASV_7;size=830877
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MW214811 1 1 157
ASV_8;size=4359501
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB05107909 1 1 151
ASV_9;size=408829
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MZ016271 0.7246504 0.88372 41
ASV_10;size=176701
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB05818296 1 1 179
ASV_11;size=162862
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB04293913 1 1 141
ASV_12;size=1535429
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A MT991106 1 1 146
ASV_13;size=169945 Sphaerosoma
k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizales_fam_Incertae_sedis;g__Sphaerosoma;s__unidentified
genus 0.969 0.4812 UDB07371928 0.98324 0.98324 177
ASV_14;size=130846
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB02651623 0.9607800000000001 0.9607800000000001 50
ASV_15;size=978833
k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
N/A N/A UDB03975281 1 1 135
—
Reply to this email directly, view it on GitHub
<#2>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF6CZMTMZK34CPWCEJVF6MDZQHYF7AVCNFSM6AAAAABMEC5EQOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMJSGE3TSNA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Dear Duong When I run the classification step on the first 15 ASVs (as you have done above), we get different results. My results are the same as when I ran the entire best match file, so the issue is not a memory problem, which I expected because I requested a massive amount of memory from the cluster. However, I suspect the issue could be with the unite2024ITS1.classification file that I downloaded from Zenodo along with the unite2024ITS1.fasta file. I noticed that the unite2024ITS1.classification file is 82.5 MB, whereas it is 262.6 MB and 229.4 MB for the ITS and ITS2 versions, respectively. How large is the unite2024ITS1.classification file that you are using? I think the unite2024ITS1.classification file did not upload correctly to Zenodo, resulting in many reference taxa being missing. For example, of the classifications that you get, only ASV_13;size=169945 (UDB07371928, Sphaerosoma) is found in my unite2024ITS1.classification file. This makes sense because that is the only classification we have in common (the rest of mine remain unidentified at the kingdom level). Could you please either reupload the unite2024ITS1.classification file to Zenodo or share a folder with me to access the unite2024ITS1.classification file that you use? Thank you kindly in advance for your time. Warm regards |
Dear Luke,
Thank you very much. Yes, you were right. Somehow, the upload of the
*unite2024ITS1.classification* file went wrong. I've created a new Zenodo
record available at https://zenodo.org/records/13336328, containing all
UNITE ITS1, ITS2, and ITS sequences, along with their classifications,
ready for use with *dnabarcoder*. Please let me know if you still encounter
any issues.
Best,
Duong
…On Fri, 16 Aug 2024 at 22:02, Luke Florence ***@***.***> wrote:
Dear Duong
When I run the classification step on the first 15 ASVs (as you have done
above), we get different results. My results are the same as when I ran the
entire best match file, so the issue is not a memory problem, which I
expected because I requested a massive amount of memory from the cluster.
However, I suspect the issue could be with the
unite2024ITS1.classification
<https://zenodo.org/records/12580255/files/unite2024ITS1.classification?download=1>
file that I downloaded from Zenodo along with the unite2024ITS1.fasta file.
I noticed that the unite2024ITS1.classification file is 82.5 MB, whereas it
is 262.6 MB and 229.4 MB for the ITS and ITS2 versions, respectively. How
large is the unite2024ITS1.classification file that you are using?
I think the unite2024ITS1.classification file did not upload correctly to
Zenodo, resulting in many reference taxa being missing. For example, of the
classifications that you get, only ASV_13;size=169945 (UDB07371928,
Sphaerosoma) is found in my unite2024ITS1.classification file. This makes
sense because that is the only classification we have in common (the rest
of mine remain unidentified at the kingdom level).
Could you please either reupload the unite2024ITS1.classification file to
Zenodo or share a folder with me to access the unite2024ITS1.classification
file that you use?
Thank you kindly in advance for your time.
Warm regards
Luke
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF6CZMWYEYHB7Q3XKYPIGP3ZRZLEBAVCNFSM6AAAAABMEC5EQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJUGE2TEMZTG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Duong Thank you for updating the classification file. My classification output now makes sense. Kind regards |
Hi Vuthuyduong,
I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.
Have I made an error, or do I not understand how the classification is supposed to work?
Thank you in advance for your help.
Luke
Script
# Constants and subdirectories
readonly THREADS=8
readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta"
readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json"
readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification"
readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta"
readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"
log 'Starting at:'
# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search
-i $QUERY_SEQUENCES
-r $REFERENCE_SEQUENCES
-ml 50
# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
-c $CLASSIFIER
-cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02
New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Finished at: Sat Jul 27 20:39:15 AEST 2024
Bestmatch file
<style> </style>Classification file
<style> </style>Classified file
<style> </style>The text was updated successfully, but these errors were encountered: