This is the README for the 2022 PIRE Omics Workshop. It contains a link to the daily schedule, information on the workshop participants and personnel, links and installation directions for necessary software/programs, and a guide to preparing, processing, and analyzing shotgun data. During this workshop, we will be following the PIRE SSL and PIRE Pre-Processing pipelines, although both have been modified slightly for the purposes of the workshop.
Here is a link to the daily schedule.
Here is the link to the workshop lectures
Here are the links pictures of the workshop and day trip to Apo Island
Workshop Survey (closed)
Position | Role | Name | Institution | Contact |
---|---|---|---|---|
Professor | Guest Lecturer | Kent Carpenter | ODU | cbird@odu.edu |
Professor | Instructor | Chris Bird | TAMUCC | kcarpent@odu.edu |
Postdoc | Instructor | Eric Garcia | ODU | e1garcia@odu.edu |
Postdoc | Instructor | Brendan Reid | Rutgers | br450@sebs.rutgers.edu |
PhD Student | Instructor | Rene Clark | Rutgers | rdc129@scarletmail.rutgers.edu |
PhD Student | Guest Lecturer | Kyra Fitz | Rutgers | ksf63@dls.rutgers.edu |
PhD Student | Guest Lecturer | John Whalen | ODU | jwhal002@odu.edu |
M.Sc Student | Guest Lecturer | George Bonsall | ODU | gbons002@odu.edu |
M.Sc Student | Instructor | Roy Roberts | TAMUCC | rroberts3@islander.tamucc.edu |
REU/Post-Bac | Guest Lecturer | Jordan Rodriguez | TAMUCC | jrodriguez216@islander.tamucc.edu |
Position | Name | Institution | Contact |
---|---|---|---|
REU/Post-Bac | Abigail Ethridge | ODU | aethr001@odu.edu |
REU/Post-Bac | Eryka Molino | ASU | emolino1230@gmail.com |
REU/Post-Bac | Rebecca Ruiz | TAMUCC | rruiz25@islander.tamucc.edu |
REU/Post-Bac | Allison Fink | Rutgers | agf63@scarletmail.rutgers.edu |
REU/Post-Bac | Marial Malabag | Rutgers | mjm751@scarletmail.rutgers.edu |
RA | Kevin L. Labrador | UP Mindanao | kllabrador@up.edu.ph |
RA | Maybelle A. Fortaleza | UP Mindanao | mafortaleza@up.edu.ph |
RA | Joemarie L. Lanutan | UP Mindanao | jjlanutan@up.edu.ph |
Professor | Cleto L. Nanola Jr. | UP Mindanao | clnanola@up.edu.ph |
M.Sc. Student | Omar Mahadalle | Siliman University | omaramahadalle@su.edu.ph |
M.Sc. Student | Abner Bucol | Siliman University | abnerbucol2013@gmail.com |
M.Sc. Student | Chandelle Jablonski | ODU | cjabl001@odu.edu |
M.Sc. Student | Nichole Leach | ODU | npete006@odu.edu |
In order to run the SSL pipeline and follow along with the workshop exercises, you will need to make sure you have the following accounts set-up and programs installed on your local computer (if you intend on using one of the computers in the computer lab, you only need to complete step 1).
- Create a free GitHub account.
- Once you have your account, set up two-factor authentification.
- You will also need a personal access token (PAT) to use GitHub on the HPC cluster. To set this up, follow these instructions. MAKE SURE TO SAVE THIS TOKEN SOMEWHERE ON YOUR COMPUTER SO YOU CAN COPY-AND-PASTE!
- WINDOWS ONLY: Install a Linux Distribution on Windows using the Windows Subsystem for Linux. Follow these steps:
- Update Windows to the newest version (Windows 10 version 2004 and higher are required, you do not need Windows 11). To update, type "Check for Updates" in the taskbar search bar.
- Open "Windows PowerShell". You can search for it in the same location where you typed "Check for Updates". Open Windows PowerShell by right-clicking and then left-clicking "Run as Administrator".
- In the PowerShell Terminal, run the following command (do NOT copy and paste):
wsl --install
. - After the command finishes, restart your computer. Once it has restarted, an Ubuntu terminal will open and finish the installation. The installation will take a bit.
- The terminal will ask for a Username and a Password. Use whatever Username you would like, it will become the name of the User directory. A password is not necessary if you do not want to use one, just enter nothing for both the "New Password" and "Retype Password" prompts.
- After installation is complete, download Windows Terminal.
- Windows Terminal will open PowerShell automatically. Click the "v" symbol next to the "+" (new tab) button and go to "Settings".
- The first option under "Startup" is "Default Profile". Change this to "Ubuntu" and save your changes.
- To open again, just type "Terminal" in the taskbar search bar and open the App.
- Install a text editor. Our recommended free editors:
- Install R (v4.2.0 or 4.2.1).
- Install RStudio
- Once R and RStudio are installed, install the following packages (with all dependencies): tidyverse & adegenet.
The workshop goes straight into analyzing genomic data and assumes participants already have a basic understanding of UNIX, the command line, and HPC environments. If you don't have this background, please check out this page with links to a few brief, introductory online modules.
If this is your first time working on Wahab/Turing (ODU's HPC and the computer cluster we will be using for the workshop), or want to check out some cool tips, see the Working on ODU's HPC Repository.
Here we go!
Start the workshop by cloning this repository to your working directory. We recommend setting up a shotgun_PIRE
directory in your home directory first and then cloning this repository as a subdirectory of that.
- Example:
/home/youruserID/shotgun_PIRE/
First, log-on to Wahab. If you are using the Ubuntu terminal, search for "Terminal" on your computer and enter your password (if you have one). If you are on a Mac, you can just search for "Terminal". Then:
ssh your_user_name@wahab.hpc.odu.edu
#enter your password
Next, clone this directory:
cd ~ #this will take you to your home directory
mkdir shotgun_PIRE
cd shotgun_PIRE
git clone https://github.com/philippinespire/2022_PIRE_omics_workshop.git
#you will be prompted for your username and password. The username is your GitHub username. The password is a PAT (personal access token) associated with your GitHub account.
If you do NOT have a PAT yet, follow these instructions. MAKE SURE TO SAVE THIS TOKEN SOMEWHERE ON YOUR COMPUTER SO YOU CAN COPY-AND-PASTE!
Now you have the files you need to start working!
Organization is important - for the workshop, each person will create and work in their own sub-directory in this repository.
Make your own directory:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop
mkdir <your_name_with_no_spaces>
Please Note: do not use spaces or special characters (#$%^&~"[])(+'|=) in the names of files or directories. Stick to letters, numbers, dashes (-), and underscores (_).
Next, you are going to copy the student_README.md
file from the 2022_PIRE_omics_workshop
directory to your own directory. The README.md
serves as a log where you record the code you ran for each step, as well as any important notes (errors you ran into, data quality assessment, etc.). You will modify this README to match your code and results as you run each step of the pipeline.
- During the workshop, all participants will be running data from the same species (Salarias faciatus) through the SSL pipeline. Thus, everyone's output should look the same at each step! The
salarias_fasciatus
directory, and theREADME.md
document within, serves as a template for you to check your results against. You will also be reading data in from this directory at various steps throughout the workshop, to speed up the process.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name #this should be the full path to the directory you just created
cp ../student_README.md .
As you start processing your data and making changes, you will need to continuously update your progress to GitHub (so that others can see your work)! To do this, you will always (1) PULL down the latest version of the repo everytime you begin work and (2) PUSH any changes you make everytime you walk away from the computer. ALWAYS PULLING BEFORE YOU PUSH will help keep your GitHub repository updated and minimize problems with merge conflicts, etc.
NOTE: Every time you push (and sometimes when you pull), you will be prompted for your username and password. The username is your GitHub username. The password is a PAT (personal access token) associated with your GitHub account.
To push and pull, from your workshop directory (2022_PIRE_omics_workshop/your_name
), execute these commands manually or run the runGit.bash
script (see below):
#to make sure your version of the repository matches what is on GitHub (ALWAYS do before running anything OR pushing)
git pull
#to add any of your own changes
git add ./*
git commit -m "" #in "" write a short message describing what you did
git push -u
This code has also been compiled into the script runGIT.bash
. Thus, you can just run this script BEFORE and AFTER you do anything in your workshop directory. Copy this to your workshop dir if you would like to run it:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
cp ../runGIT.bash .
When running runGIT.bash
, you will need to provide the message of your commit in the command line. Example:
bash ../runGIT.bash "initiated my directory"
Other things to know:
-
If you should ever meet a merge conflict screen, you are in the archane
vim
editor. You can look up instructions on how to interface with it, but the following should typically work:- hit escape key twice
- type
:quit!
-
If you ever have to delete files for whatever reason, these deletions will occur in your local directory (on the HPC) but will remain in the git memory if they had previously been pushed. If you are in this situation, run these git commands manually, AFTER running
runGIT.bash
as described above (run from the directory where you deleted the files):
git add -u . #this will stage your deleted files
git commit -m "update deletions"
git push -u origin main
- gitignore: There is a
.gitignore
file that lists files and directories to be ignored by git. It includes large files that git cannot handle (fq.gz, bam, etc.) and other repositories that might be cloned/created in this repository during the pipeline. For example, the BUSCO directory contains several large files that will cause problems for git, sobusco_*/
occurs in.gitignore
so that it is not ever pushed to GitHub. Because large data files will never be pushed to GitHub, they will reside in an individual's local directory or somewhere else on the HPC.
Complete the pre-processing of your files following the pire_fq_gz_processing insructions, then return here.
- This includes running FASTQC, FASTP1, CLUMPLIFY, FASTP2, FASTQSCREEN, and the re-pair scripts.
- From the literature or other sources.
- genomesize.com
- ncbi genome
- search the literature
- Record the size and other potentially important information in your species README if the genome of your species is already available
- Estimate properties with
jellyfish
andgenomescope
.- More details here.
1b. Execute runJellyfish.sbatch using the decontaminated files (couple of hours)
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
#sbatch runJellyfish.sbatch <Species 3-letter ID> <indir>
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runJellyfish.sbatch "Sfa" "fq_fp1_clmp_fp2_fqscrn_repaired"
Check your queue!
squeue -u <your_user_ID>
Jellyfish will create a histogram file (.histo
) with kmer frequencies.
1c. Download this file to your local computer and upload it in GenomeScope v1.0 and Genomescope v2.0 (few minutes)
To download, sftp
into Wahab in a new Terminal window and download the histogram file.
sftp your_user_ID@wahab.hpc.odu.edu
#cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
cd /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/fq_fp1_clmp_fp2_fqscrn_repaired
lpwd #this is your local working directory
#get fq_fp1_clmp_fp2_fqscrn_repaired/<histofile.histo>
get Sfa_all_reads.histo
Then:
- Open
GenomeScope v1-2
and upload the histogram file. - Add a proper description to both of your GenomeScope runs. Example: "Sfa_fq_fp1_clmp_fp2_fqscrn_repaired_jfsh_v1" and "Sfa_fq_fp1_clmp_fp2_fqscrn_repaired_jfsh_v2".
- For version 1 ONLY, Adjust the read length to that of the length of the Fastp2 trimming --> 140 (unless you had to modify this during the Fastp2).
- Leave all other parameters with default settings for both versions.
- Submit (takes only a few minutes to run),
1d. Complete the following table in your Species README. You can copy and paste this table straight into your README (no need to enclose it with quotes, i.e. a code block) and just substitute your table values.
Genome stats for Sfa from Jellyfish/GenomeScope v1.0 and v2.0, k=21 for both versions
version |stat |min |max
------ |------ |------ |------
1 |Heterozygosity |?% |?%
2 |Heterozygosity |?% |?%
1 |Genome Haploid Length |? bp |? bp
2 |Genome Haploid Length |? bp |? bp
1 |Model Fit |? % |? %
2 |Model Fit |? % |? %
Provide a link to both GenomeScope reports in your README. See other species READMEs for examples.
In your table, check that the heterozygosity estimates are reasonable (values ~1% or less are common) and check for a good model fit in the max values (>90%). Sometimes, the min value might have a low fit for V2 but this is okay.
- In your reports, check for a tight relationship between the "observed", "full model" and "unique sequences" lines in the first graph.
If the values in your table are relatively similar for V1 and V2 and you found no red flags in reports, then use the V2 estimates.
Most of the time V1 & V2 perform similarly. However, sometimes the two reports give contrasting values, such as very different genome sizes or unrealistic estimates of heterozygosity.
For example:
- For Sob (Sphyraena obtusata), the Sob_GenScp_v1 report estimates a genome of 532 Mbp and 0.965 for H. However, the Sob_GenScp_v2 report estimates a genome size of 259 Mbp (which is small for a fish) and actually fails to estimate heterozygosity. Thus, V1 was used for Sob.
- For Hte (Hypoatherina temminckii), the Hte_GenScp_v1 report appears to have no red flags with a genome size of 846 Mbp and 0.49 for H. However, upon inspecting the first graph, you can see that the "unique sequence" line behaves diffently from the others. In contrast, the Hte_GenScp_v2 report restores a tight relationship between all lines with no red flags in any estimates either (H=2.1, GenSize= 457 Mbp). Thus, V2 was used for Hte.
Note the best GenomeScope version and rounded genome size estimate in your species README. Please use the "Genome Haploid Length" max value rounded up or down to the nearest million. You will use this info later.
Congrats! You are now ready to assemble the genome of your species!
After a comparisons of multiple assemblers with the PIRE shotgun data, we decided to use SPAdes.
For the most part, we obtained better assemblies using single libraries (a library consists of one forward *r1.fq.gz and reverse file *r2.fq.gz pair), but in a few instances using all the libraries was better.
You need to be in turing.hpc.odu.edu
for this step. SPAdes requires high memory nodes (only available on Turing). The username/password you use to log onto Wahab will work for Turing as well.
#from wahab.hpc.odu.edu
exit
ssh userID@turing.hpc.odu.edu
Check how many high memory nodes ("himem") are available on Turing.
sinfo
You will see a list something like this below. Look at the state of himem nodes (partition = himem) below to identify how many are available for genome assembly.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 2 inval coreV1-22-019,coreV2-25-005
main* up infinite 18 fail* coreV1-22-[006-009,017],coreV2-22-[006,012,033],coreV2-25-[003,008,012,023,025,039,043,055],coreV2-25-knc-[0
main* up infinite 2 fail coreV1-22-016,coreV2-25-017
main* up infinite 8 down* coreV1-22-028,coreV2-22-[018-020,026],coreV2-25-016,coreV2-25-knc-006,coreV3-23-017
main* up infinite 36 mix coreV1-22-[001-003,025,027],coreV2-22-[004,025,036],coreV2-25-[001,004,006,009,011,013-015,018-022,024,038,0
main* up infinite 62 alloc coreV1-22-[012-013,015,020-024],coreV2-22-[001-003,005,007-011,013-017,021-024,027-032,034],coreV2-25-[044-0
main* up infinite 70 idle coreV1-22-026,coreV2-22-035,coreV2-25-[002,007,010,026-037,048-054,056-073],coreV3-23-[008-016,041-050],core
main* up infinite 4 down coreV1-22-[004-005,014,018]
timed-main up 2:00:00 2 inval coreV1-22-019,coreV2-25-005
timed-main up 2:00:00 18 fail* coreV1-22-[006-009,017],coreV2-22-[006,012,033],coreV2-25-[003,008,012,023,025,039,043,055],coreV2-25-knc-[0
timed-main up 2:00:00 2 fail coreV1-22-016,coreV2-25-017
timed-main up 2:00:00 8 down* coreV1-22-028,coreV2-22-[018-020,026],coreV2-25-016,coreV2-25-knc-006,coreV3-23-017
timed-main up 2:00:00 36 mix coreV1-22-[001-003,025,027],coreV2-22-[004,025,036],coreV2-25-[001,004,006,009,011,013-015,018-022,024,038,0
timed-main up 2:00:00 62 alloc coreV1-22-[012-013,015,020-024],coreV2-22-[001-003,005,007-011,013-017,021-024,027-032,034],coreV2-25-[044-0
timed-main up 2:00:00 70 idle coreV1-22-026,coreV2-22-035,coreV2-25-[002,007,010,026-037,048-054,056-073],coreV3-23-[008-016,041-050],core
timed-main up 2:00:00 4 down coreV1-22-[004-005,014,018]
himem up infinite 1 fail* coreV4-21-himem-001
himem up infinite 1 fail coreV2-23-himem-001
himem up infinite 1 down* coreV2-23-himem-002
himem up infinite 4 idle coreV2-23-himem-[003-004],coreV4-21-himem-[002-003]
timed-himem up 2:00:00 1 fail* coreV4-21-himem-001
timed-himem up 2:00:00 1 fail coreV2-23-himem-001
timed-himem up 2:00:00 1 down* coreV2-23-himem-002
timed-himem up 2:00:00 4 idle coreV2-23-himem-[003-004],coreV4-21-himem-[002-003]
gpu up infinite 1 fail* coreV3-23-k40-006
gpu up infinite 1 drain coreV5-21-v100-003
gpu up infinite 6 mix coreV3-23-k40-[003-004],coreV4-24-v100-[001,003-004],coreV5-21-v100-002
gpu up infinite 12 idle coreV3-23-k40-[005,007-010],coreV4-21-k80-[001-005],coreV4-22-p100-[001-002]
timed-gpu up 2:00:00 1 fail* coreV3-23-k40-006
timed-gpu up 2:00:00 1 drain coreV5-21-v100-003
timed-gpu up 2:00:00 6 mix coreV3-23-k40-[003-004],coreV4-24-v100-[001,003-004],coreV5-21-v100-002
timed-gpu up 2:00:00 1 alloc coreV5-21-v100-001
timed-gpu up 2:00:00 12 idle coreV3-23-k40-[005,007-010],coreV4-21-k80-[001-005],coreV4-22-p100-[001-002]
pire up infinite 4 idle coreV1-22-[010-011],coreV3-23-k40-[001-002]
reserved-gpu up infinite 2 idle coreV4-22-p100-[001-002]
reserved-khan up infinite 1 alloc coreV5-21-v100-001
phi up infinite 2 fail* coreV2-25-knc-[002,004]
phi up infinite 1 down* coreV2-25-knc-006
phi up infinite 4 mix coreV2-25-knc-[001,005,008,010]
phi up infinite 3 alloc coreV2-25-knc-[003,007,009]
timed-phi up 2:00:00 2 fail* coreV2-25-knc-[002,004]
timed-phi up 2:00:00 1 down* coreV2-25-knc-006
timed-phi up 2:00:00 4 mix coreV2-25-knc-[001,005,008,010]
timed-phi up 2:00:00 3 alloc coreV2-25-knc-[003,007,009]
MSIM715 up infinite 4 idle coreV3-23-[001-003],coreV4-24-v100-002
Typically, we produce 3 libraries (from the same individual) for each species with SSL data.
Sfa example:
ls /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/shotgun_raw_fq/*gz
Sfa-CBas_028-Ex1-1G_L4_1.fq.gz #library A
Sfa-CBas_028-Ex1-1G_L4_2.fq.gz
Sfa-CBas_028-Ex1-1H_L4_1.fq.gz #library B
Sfa-CBas_028-Ex1-1H_L4_2.fq.gz
Sfa-CBas_028-Ex1-2A_L4_1.fq.gz #library C
Sfa-CBas_028-Ex1-2A_L4_2.fq.gz
We produce duplicate libraries because every now and then, one library can fail, and you may only end up with 2 sets of files. Thus, the following SPAdes script can be modified to run each of the 3 libraries independently as well as all 3 libraries together (for the "all" assembly).
NOTE: If your species has 4 or more libraries, you will need to modify the script to run the 4th library, 5th library etc. (You'll only need to add the necessary libraries to the SPAdes command.)
- No changes are necessary for running the 1st, 2nd, 3rd, or all the libraries together (i.e. if you have 2 or 3 libraries only).
Use the decontaminated files (final fq.gz files created during pre-processing) to run one assembly for each of your libraries independently and then one combining all of your libraries together.
Execute runSPADEShimem_R1R2_noisolate.sbatch.
Example for the 1st library:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
#runSPADEShimem_R1R2_noisolate.sbatch <your user ID> <3-letter species ID> <library: all_2libs | all_3libs | 1 | 2 | 3> <contam | decontam> <genome size in bp> <species dir> <fq data dir>
#do not use trailing / in paths
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your_user_ID" "Sfa" "1" "decontam" "?" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2_fqscrn_repaired"
Run 2 more assemblies for the second and third libraries. Replace the "1", with "2" and "3" respectively.
Then, run a job combining all libraries together by choosing the appropiate "all_2libs" or "all_3libs" from the library options.
runSPADEShimem_R1R2_noisolate.sbatch
names the output directories with the suffix A for the first library, B for the second, and C for the third (if there is one). Thus, for Sfa:
Assembly | Library |
---|---|
A | 1G |
B | 1H |
C | 2A |
For the SPAdes
output, check that you have:
- k22-99 directories
contigs.fasta
andscaffolds.fasta
files- Open the
params.txt
files and check the settings ran in SPAdes. Make sure these are as you expect (match your input settings).
QUAST
is automatically run by the SPAdes script as well. QUAST
creates statistics to help us evaluate genome assembly quality and composition. Look for the quast-reports
directory, and for each of your assemblies open up the associated tsv files (both the contigs & scaffolds ones) and check the:
- Number of contigs in the assembly (this is the last contig column in the quast report with the name "# contigs"
- the size of the largest contig
- the total length of assembly
- N50
- L50
Tip: you can align the columns of any .tsv
for easy viewing with the command column
in bash.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
bash
cat SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report/transposed_report.tsv | column -ts $'\t' | less -S
cat SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report/transposed_report.tsv | column -ts $'\t' | less -S
QUAST gives us many of the basic assembly statistics, however we still need to run BUSCO to know how many expected (i.e. highly conserved) genes were recovered by the assembly. This gives us another sense of how "complete" the genomes we just assembled are.
Execute runBUCSO.sh on the contigs
and scaffolds
files for each assembly.
cd ~/shotgun_PIRE/2022_PIRE_Omics_workshop/your_name
#runBUSCO.sh <species dir> <SPAdes dir> <contigs | scaffolds>
#do not use trailing / in paths. Example using contigs:
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runBUSCO.sh "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "SPAdes_allLibs_decontam_R1R2_noIsolate" "contigs"
Repeat the command using the contigs and scaffolds files for each SPAdes assembly.
runBUSCO.sh
will generate a new directory every time it is run. In those directories, look for the short_summary.txt
file and note the percentage of Complete and single-copy BUSCOs (S)
genes. (This file will have a much longer name than just "short_summary," but it begins with that phrase.)
Fill in this table with your QUAST and BUSCO values in your species README.
You didn't run SPAdes or BUSCO so you don't have QUAST/BUSCO output in your dir yet. Copy the output for allLibs from the Sfa workshop dir:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
pwd
mkdir SPAdes_allLibs_decontam_R1R2_noIsolate
mkdir SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report
mkdir SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report
#QUAST
cp /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report/transposed_report.tsv ./SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report/transposed_report.tsv
cp /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report/transposed_report.tsv ./SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report/transposed_report.tsv
#BUSCO
cp -R /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/busco-results .
A few notes:
- Library names can be obtained from file names
- The number of contigs is the last contig column in quast report with the name "# contigs".
- % Genome size completeness = "Total length"/genome size (or rounded genome max value) * 100
- For QUAST, only report the row for the actual assembly (i.e. report "scaffolds" not "scaffolds_broken"
- For BUSCO, only report the "Complete and single-copy BUSCOs (S)"
Species |Assembly |Data Type |SCAFIG |covcutoff |GenomeScope v. |No. of contigs |Largest contig |Total length |% Genome size completeness |N50 |L50 |Ns per 100 kbp |BUSCO single copy
------ |------ |------ |------ |------ |------ |------ |------ |------ |------ |------ |------ |------ |------
Sfa |A |decontam |contgs |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |A |decontam |scaffolds |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |B |decontam |contgs |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |B |decontam |scaffolds |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |C |decontam |contgs |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |C |decontam |scaffolds |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |allLibs |decontam |contigs |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
Sfa |allLibs |decontam |scaffolds |off |2 | ? | ? | ? | ? % | ? | ? | ? | ? %
We assess quality across multiple metrics since we don't use a golden rule/metric for determining the best assembly. Often, it is clear that one of the libraries is better than the others, as it has better results across multiple metrics. However, sometimes this is not quite as clear as we would like, as different assemblies might fare better in some metrics and worse in others. Use the following table to help you decide which assembly is best:
Importance | Metric | Direction | Description |
---|---|---|---|
1st | BUSCO | Bigger is better | % of expected genes observed in your assembly |
2nd | N50 | Bigger is better | Length of the shortest contig from the set of contigs that (together) represent 50% of your assembly size |
3rd | Genome size completeness | Bigger is better | Length of the assembly divided by the estimated genome length |
4th | L50 | Smaller is better | Number of contigs that make up the set of contigs that (together) represent 50% of your assembly size |
5th | Largest contig | Bigger is better | Length of the largest contig |
If you are still undecided on which is the best assembly, raise your hand or talk to those around you in the workshop. Outside of the workshop, we would normally post the top candidates on the species slack channel and ask for opinions.
Note your best assembly in your species README.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
#MAKE SURE YOU ARE ON TURING
#runSPADEShimem_R1R2_noisolate.sbatch <your user ID> <3-letter species ID> <library: all_2libs | all_3libs | 1 | 2 | 3> <contam | decontam> <genome size in bp> <species dir>
#do not use trailing / in paths
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your_user_ID" "Sfa" "2" "contam" "635000000" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2"
Compare the QUAST contigs and scaffolds values of your contaminated assembly to your decontaminated assembly.
You will not do this for the workshop. You should compare your results to this table, however, to see how your species stacks up.
Add a new record for your species/assembly to the best_ssl_assembly_per_sp.tsv file.
Please note that you cannot paste a tab in nano
as it interprets tabs as spaces. This means that you will have to manually enter each column one by one, or copy and paste multiple columns but then change the spaces between columns back to tabs to restore the tsv format.
Once done, push your changes to GitHub and confirm the that tsv format is correct by opening the best_ssl_assembly_per_sp.tsv. Your browser should not be displaying code but instead a nice-looking table (aligned columns, etc).
Assuming you have completed Step 9, you now know what library(ies) produced the best assembly. Compare your BUSCO values with those of the other species (for example, you can check the "best assembly table".
If your BUSCO values are much lower than most other species, it might be worth changing the covcutoff auto
flag (by changing the datatype variable from "decontam" to "decontam_covAUTO").
Example:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your user ID" "Sfa" "2" "decontam_covAUTO" "635000000" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2_fqscrn_repaired"
Finally, run one more assembly using the contaminated data from the same library (or all together) that produced the best assembly (with or without the covcutoff flag).
Example:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your user ID" "Sfa" "3" "contam" "635000000" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2_fqscrn_repaired"
Move your *out
files into the logs
directory.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
mv *out logs
In this section, you will identify contigs and regions within contigs to be used as candidate regions to develop the probes for capture sequencing.
You will create the following 4 files (among others):
*.fasta.masked
: The masked fasta file*.fasta.out.gff
: The gff file created from repeat masking (identifies regions of genome that were masked)*_augustus.gff
: The gff file created from gene prediction (identifies putative coding regions)*_per10000_all.bed
: The bed file with target regions for probe development (1 set of 2 probes per target region)
From your species directory, make a new directory for the probe design process.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
mkdir probe_design
Copy the necessary scripts and the best assembly (i.e. scaffolds.fasta
from the best assembly made with the DECONTAMINATED fq.gz files) into the probe_design
directory.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
cp /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/WGprobe_annotation.sb probe_design
cp /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/WGprobe_bedcreation.sb probe_design
cp /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/SPAdes_Sfa-CBas-B_decontam_R1R2_noIsolate/scaffolds.fasta probe_design
Rename the assembly to reflect the species and parameters used. Format to follow:
<3-letter species code>_scaffolds_<library>_<cotam|decontam>_R1R2_noIsolate_<other treatments, if any>.fasta
To get this information, you can copy and paste the parameter information from the appropriate BUSCO directory:
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
#list the busco directories
ls -d busco_*
#identify the busco dir of the best assembly, copy the treatments (starting with the library)
#example,the busco dir for the best assembly for Sfa is `busco_contigs_results-SPAdes_Sfa-CBas-B_decontam_R1R2_noIsolate`
#then provide the species 3-letter code, scaffolds, and copy and paste the parameters from the busco dir after "SPAdes_"
cd probe_design
mv scaffolds.fasta busco_contigs_results-SPAdes_Sfa-CBas-B_decontam_R1R2_noIsolate.fasta
Execute the first probe annotation script.
This script will create:
- a repeat-masked fasta and gff file (
.fasta.masked
&.fasta.out.gff
) - a gff file with predicted gene regions (
augustus.gff
), and - a sorted fasta index file that will act as a template for the
.bed
file (.fasta.masked.fai
)
cd ~/shotgun_PIRE/202_PIRE_omics_workshop/your_name/probe_design
#WGprobe_annotation.sb <assembly name>
sbatch WGprobe_annotation.sb "Sgr_scaffolds_B_decontam_R1R2_noIsolate.fasta"
Execute the second script.
This will create a .bed
file that will be sent for probe creation. The bed file identifies 5,000 bp regions (spaced every 10,000 bp apart) in scaffolds > 10,000 bp long. These regions are the regions that probes will potentially be developed from.
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name/probe_design
#WGprobe_bedcreation.sb <assembly name>
sbatch WGprobe_bedcreation.sb "Sgr_scaffolds_B_decontam_R1R2_noIsolate.fasta"
Check Upper Limit
Open the BEDprobes-*.out
file created from the second probe design script (WGprobe_bedcreation.sb
) and check that the upper limit was set correctly. Record the longest contig, uppper limit used in the loop, and the number of identified regions and scaffolds in your species README.
The upper limit should be XX7500 (just under the longest scaffold length). Ex: if longest scaffold is 88,888, then the upper limit should be 87,500; if longest scaffold is 87,499, then the upper limit should be 77,500.
Example:
cat BEDprobes-415039.out
The longest scaffold is 280192
The upper limit used in loop is 277500
A total of 37134 regions have been identified from 16894 scaffolds
Move out files into your logs
directory
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name/probe_design
mv *out ../logs
The last thing we need to do is create a text file with links to available genomes from the 5 most closely-related species.
Most likely, there won't be genomes available for your targeted species (or even its genus). The easiest way to search for any is to start with the family.
Go to the NCBI Genome repository and search for the family of your species. If you get more than 5 genomes in the family, narrow the search by looking for the genus. If you don't find any genomes at the family level, search higher classifications until you find some genome assemblies (i.e. order, class, etc).
Once you get at least 5 genomes, you'll need to figure out their phylogenetic relationships in order to list the genomes from most to least closely related.
Seach for phylogenies specific to your species. For your convenience, you can use the phylogenies from Betancur et al. BMC Evolutionary Biology (2017) 17:162 for fish phyla and fish families. These are an excellent resource for higher-order taxonomic groups, but only a few species per family are represented. If you need higher resolution (to distinguish relationships between species within families, for example), you will need to search specifically for your species.
Once your list is ready, create a file in your probe_design
dirctory.
#example for Spratelloides gracilis
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your-name/probe_design
nano closest_relative_genomes_salarias_fasciatus.txt
Closest genomes:
1. Salarias fascistus - https://www.ncbi.nlm.nih.gov/genome/7248
2. Parablennius parvicornis - https://www.ncbi.nlm.nih.gov/genome/69445
3. Petroscirtes breviceps - https://www.ncbi.nlm.nih.gov/genome/7247
4. Ecsenius bicolor - https://www.ncbi.nlm.nih.gov/genome/41373
5. Gouania willdenowi - https://www.ncbi.nlm.nih.gov/genome/76090
Used Betancour et al. 2017 (All Blenniiformes)
---
**NOTE:** Sfa has a genome available in GenBank.
---
## Files to Send
**You do not need to do this for the workshop.**
Share the following files with Arbor Bio to aid in probe creation:
1. The repeat-masked fasta file (`.fasta.masked`)
2. The gff file with repeat-masked regions (`.fasta.out.gff`)
3. The gff file with predicted gene regions (`.augustus.gff`)
4. The bed file (`.bed`)
5. The text file with links to available genomes from the 5 most closely-related species (`closest_relative_genomes.txt`.
Make a directory named `files_for_ArborSci` inside your `probe_design` directory and move these files there:
```sh
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name/probe_design
mkdir files_for_ArborSci
mv *.fasta.masked *.fasta.out.gff *.augustus.gff *bed closest* files_for_ArborSci
Finally, notify Eric by email (e1garcia@odu.edu) saying that your files are ready and post a message in the slack species channel with the probe region and scaffold info (from your BEDprobe*out
file), and the full path to your files.
Example of Slack post:
Probe Design Files Ready
A total of 13063 regions have been identified from 10259 scaffolds. The longest scaffold is 105644.
Files for Arbor Bio:
ls /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/spratelloides_gracilis/probe_design/files_for_ArborSci
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate.fasta.augustus.gff
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate.fasta.masked
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate.fasta.out.gff
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate_great10000_per10000_all.bed
closest_relative_genomes_Spratelloides_gracilis.txt
Eric will then share these with Arbor BioSciences.
Now that we have a reference genome, what can we do with it? Follow the PSMC link to learn how we can use genomic data from a single individual to infer demographic history!
Follow the workshop's R repository to learn how you can analyze and illustrate your data.
Congratulations! You have completed the 2022 PIRE Omics' Workshop. Good luck in your projects and happy coding!