All data used in the GenEpi-BioTrain Virtual Training 7 session on march 19-20, 2024
The exercises are available here:
Exercise session 1
Exercise session 2
These data can be acquired in three different ways:
-
Clone the github repository containing all the data for the exercises at once. The github repository is found at https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7 and can be cloned using
git clone git@github.com:ssi-dk/GenEpi-BioTrain_Virtual_Training_7.git
-
Download the data from the EVA webpage for the session under Session 1 -> exercises
-
Download the data for each exercise at the start of the exercise using wget. This is included in instructions for each exercise.
Raw read files used in exercises are too large to be hosted on EVA or github and will have to be downloaded from ENA.
If you want to download read data for the exercises, run the following lines:
Note: this will take a while and the files are rather large! If you have screen installed on your system, it will be convenient to use here
mkdir -p data
cd data
wget https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/raw/main/fastq_ftp_paths.txt
mkdir reads
cd reads
while read line; do wget "$line"; done <../fastq_ftp_paths.txt;
cd ..
This will create a folder named “reads”, download a text file named fastq_ftp_paths.txt
containing the paths to fastq-files on ENA, and download those files into the “reads” folder.
Nucleotide sequences of the v3-v4 region of the 16s rRNA gene from 14 bacterial isolates from different species
Can be downloaded from EVA under Session 1 -> Exercise
Or using:
mkdir 16s_data; cd 16s_data
wget https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/raw/main/16s_data/16s_sequences.fasta
cd ..
Draft assemblies for 22 Listeria monocytogenes isolates that have been part of an outbreak investigation.
The assemblies have been generated from paired end Illumina Nextseq reads using spades in --carefull
mode. Contigs <200 bp or <10x kmer coverage have been removed from the assemblies.
Can be downloaded from EVA under Session 1 -> Exercise
Or using wget on the command line to download from github:
wget "https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/raw/main/assemblies.tar.gz"
To unzip the file use
tar -xf assemblies.tar.gz
This should create a folder named “assemblies” containing 22 fasta files.
A text file containing the paths to fastq files hosted by ENA. See “download raw read files” above.
The metadata folder contains 3 files with metadata. One main file called metadata.tsv
and two more used as templates for tree annotation in iTOL
These files can be downloaded from EVA under Session 2 -> Exercise
Or using wget on the command line to download from github:
mkdir metadata
cd metadata
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/metadata.tsv
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/dataset_color_gradient_template.txt
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/dataset_color_strip_template.txt
cd ..
Three files are provided so that exercises can be completed also without completing previous exercises. These are:
core.aln
: A precomputed core SNP file as produced bysnippy
core_stripped.filtered_polymorphic_sites.fasta
: A precomputed core SNP file with recombination removed usinggubbins
ML_iqtree.treefile.nwk
: A precomputed maximum likelihood tree file produced usingiqtree
.
These files can be downloaded from EVA under Session 2 -> Exercise
Or using wget on the command line to download from github:
mkdir -p data; cd data
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/core.aln
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/core_stripped.filtered_polymorphic_sites.fasta
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/ML_iqtree.treefile.nwk
cd ..