Welcome to the 2022 PIRE Omics Workshop!

This is the README for the 2022 PIRE Omics Workshop. It contains a link to the daily schedule, information on the workshop participants and personnel, links and installation directions for necessary software/programs, and a guide to preparing, processing, and analyzing shotgun data. During this workshop, we will be following the PIRE SSL and PIRE Pre-Processing pipelines, although both have been modified slightly for the purposes of the workshop.

Schedule

Here is a link to the daily schedule.

Lectures

Here is the link to the workshop lectures

Pictures

Here are the links pictures of the workshop and day trip to Apo Island

Workshop Survey (closed)

List of Instructors

Position	Role	Name	Institution	Contact
Professor	Guest Lecturer	Kent Carpenter	ODU	cbird@odu.edu
Professor	Instructor	Chris Bird	TAMUCC	kcarpent@odu.edu
Postdoc	Instructor	Eric Garcia	ODU	e1garcia@odu.edu
Postdoc	Instructor	Brendan Reid	Rutgers	br450@sebs.rutgers.edu
PhD Student	Instructor	Rene Clark	Rutgers	rdc129@scarletmail.rutgers.edu
PhD Student	Guest Lecturer	Kyra Fitz	Rutgers	ksf63@dls.rutgers.edu
PhD Student	Guest Lecturer	John Whalen	ODU	jwhal002@odu.edu
M.Sc Student	Guest Lecturer	George Bonsall	ODU	gbons002@odu.edu
M.Sc Student	Instructor	Roy Roberts	TAMUCC	rroberts3@islander.tamucc.edu
REU/Post-Bac	Guest Lecturer	Jordan Rodriguez	TAMUCC	jrodriguez216@islander.tamucc.edu

List of Participants

Position	Name	Institution	Contact
REU/Post-Bac	Abigail Ethridge	ODU	aethr001@odu.edu
REU/Post-Bac	Eryka Molino	ASU	emolino1230@gmail.com
REU/Post-Bac	Rebecca Ruiz	TAMUCC	rruiz25@islander.tamucc.edu
REU/Post-Bac	Allison Fink	Rutgers	agf63@scarletmail.rutgers.edu
REU/Post-Bac	Marial Malabag	Rutgers	mjm751@scarletmail.rutgers.edu
RA	Kevin L. Labrador	UP Mindanao	kllabrador@up.edu.ph
RA	Maybelle A. Fortaleza	UP Mindanao	mafortaleza@up.edu.ph
RA	Joemarie L. Lanutan	UP Mindanao	jjlanutan@up.edu.ph
Professor	Cleto L. Nanola Jr.	UP Mindanao	clnanola@up.edu.ph
M.Sc. Student	Omar Mahadalle	Siliman University	omaramahadalle@su.edu.ph
M.Sc. Student	Abner Bucol	Siliman University	abnerbucol2013@gmail.com
M.Sc. Student	Chandelle Jablonski	ODU	cjabl001@odu.edu
M.Sc. Student	Nichole Leach	ODU	npete006@odu.edu

Required Software for the Workshop

In order to run the SSL pipeline and follow along with the workshop exercises, you will need to make sure you have the following accounts set-up and programs installed on your local computer (if you intend on using one of the computers in the computer lab, you only need to complete step 1).

Create a free GitHub account.
- Once you have your account, set up two-factor authentification.
- You will also need a personal access token (PAT) to use GitHub on the HPC cluster. To set this up, follow these instructions. MAKE SURE TO SAVE THIS TOKEN SOMEWHERE ON YOUR COMPUTER SO YOU CAN COPY-AND-PASTE!
WINDOWS ONLY: Install a Linux Distribution on Windows using the Windows Subsystem for Linux. Follow these steps:
- Update Windows to the newest version (Windows 10 version 2004 and higher are required, you do not need Windows 11). To update, type "Check for Updates" in the taskbar search bar.
- Open "Windows PowerShell". You can search for it in the same location where you typed "Check for Updates". Open Windows PowerShell by right-clicking and then left-clicking "Run as Administrator".
- In the PowerShell Terminal, run the following command (do NOT copy and paste): wsl --install.
- After the command finishes, restart your computer. Once it has restarted, an Ubuntu terminal will open and finish the installation. The installation will take a bit.
- The terminal will ask for a Username and a Password. Use whatever Username you would like, it will become the name of the User directory. A password is not necessary if you do not want to use one, just enter nothing for both the "New Password" and "Retype Password" prompts.
- After installation is complete, download Windows Terminal.
- Windows Terminal will open PowerShell automatically. Click the "v" symbol next to the "+" (new tab) button and go to "Settings".
- The first option under "Startup" is "Default Profile". Change this to "Ubuntu" and save your changes.
- To open again, just type "Terminal" in the taskbar search bar and open the App.
Install a text editor. Our recommended free editors:
- For Macs: the free version of BBEdit
- For Windows (PCs): Notepad++
Install R (v4.2.0 or 4.2.1).
Install RStudio
- Once R and RStudio are installed, install the following packages (with all dependencies): tidyverse & adegenet.

Pre-workshop Materials

The workshop goes straight into analyzing genomic data and assumes participants already have a basic understanding of UNIX, the command line, and HPC environments. If you don't have this background, please check out this page with links to a few brief, introductory online modules.

If this is your first time working on Wahab/Turing (ODU's HPC and the computer cluster we will be using for the workshop), or want to check out some cool tips, see the Working on ODU's HPC Repository.

GUIDE TO SHOTGUN DATA PROCESSING & ANALYSIS

Here we go!

Some Basic Set-up

Start the workshop by cloning this repository to your working directory. We recommend setting up a shotgun_PIRE directory in your home directory first and then cloning this repository as a subdirectory of that.

Example: /home/youruserID/shotgun_PIRE/

First, log-on to Wahab. If you are using the Ubuntu terminal, search for "Terminal" on your computer and enter your password (if you have one). If you are on a Mac, you can just search for "Terminal". Then:

ssh your_user_name@wahab.hpc.odu.edu
#enter your password

Next, clone this directory:

cd ~ 	#this will take you to your home directory

mkdir shotgun_PIRE
cd shotgun_PIRE

git clone https://github.com/philippinespire/2022_PIRE_omics_workshop.git
#you will be prompted for your username and password. The username is your GitHub username. The password is a PAT (personal access token) associated with your GitHub account.

If you do NOT have a PAT yet, follow these instructions. MAKE SURE TO SAVE THIS TOKEN SOMEWHERE ON YOUR COMPUTER SO YOU CAN COPY-AND-PASTE!

Now you have the files you need to start working!

Organization is important - for the workshop, each person will create and work in their own sub-directory in this repository.

Make your own directory:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop

mkdir <your_name_with_no_spaces>

Please Note: do not use spaces or special characters (#$%^&~"[])(+'|=) in the names of files or directories. Stick to letters, numbers, dashes (-), and underscores (_).

Next, you are going to copy the student_README.md file from the 2022_PIRE_omics_workshop directory to your own directory. The README.md serves as a log where you record the code you ran for each step, as well as any important notes (errors you ran into, data quality assessment, etc.). You will modify this README to match your code and results as you run each step of the pipeline.

During the workshop, all participants will be running data from the same species (Salarias faciatus) through the SSL pipeline. Thus, everyone's output should look the same at each step! The salarias_fasciatus directory, and the README.md document within, serves as a template for you to check your results against. You will also be reading data in from this directory at various steps throughout the workshop, to speed up the process.

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name #this should be the full path to the directory you just created

cp ../student_README.md .

Maintaining Your Git Repository

As you start processing your data and making changes, you will need to continuously update your progress to GitHub (so that others can see your work)! To do this, you will always (1) PULL down the latest version of the repo everytime you begin work and (2) PUSH any changes you make everytime you walk away from the computer. ALWAYS PULLING BEFORE YOU PUSH will help keep your GitHub repository updated and minimize problems with merge conflicts, etc.

NOTE: Every time you push (and sometimes when you pull), you will be prompted for your username and password. The username is your GitHub username. The password is a PAT (personal access token) associated with your GitHub account.

To push and pull, from your workshop directory (2022_PIRE_omics_workshop/your_name), execute these commands manually or run the runGit.bash script (see below):

#to make sure your version of the repository matches what is on GitHub (ALWAYS do before running anything OR pushing)
git pull

#to add any of your own changes
git add ./*
git commit -m "" #in "" write a short message describing what you did
git push -u

This code has also been compiled into the script runGIT.bash. Thus, you can just run this script BEFORE and AFTER you do anything in your workshop directory. Copy this to your workshop dir if you would like to run it:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

cp ../runGIT.bash .

When running runGIT.bash, you will need to provide the message of your commit in the command line. Example:

bash ../runGIT.bash "initiated my directory"

Other things to know:

If you should ever meet a merge conflict screen, you are in the archane vim editor. You can look up instructions on how to interface with it, but the following should typically work:
- hit escape key twice
- type :quit!
If you ever have to delete files for whatever reason, these deletions will occur in your local directory (on the HPC) but will remain in the git memory if they had previously been pushed. If you are in this situation, run these git commands manually, AFTER running runGIT.bash as described above (run from the directory where you deleted the files):

git add -u . #this will stage your deleted files
git commit -m "update deletions"
git push -u origin main

gitignore: There is a .gitignore file that lists files and directories to be ignored by git. It includes large files that git cannot handle (fq.gz, bam, etc.) and other repositories that might be cloned/created in this repository during the pipeline. For example, the BUSCO directory contains several large files that will cause problems for git, so busco_*/ occurs in .gitignore so that it is not ever pushed to GitHub. Because large data files will never be pushed to GitHub, they will reside in an individual's local directory or somewhere else on the HPC.

Data Processing Roadmap

A. PRE-PROCESSING SEQUENCES

Complete the pre-processing of your files following the pire_fq_gz_processing insructions, then return here.

This includes running FASTQC, FASTP1, CLUMPLIFY, FASTP2, FASTQSCREEN, and the re-pair scripts.

B. GENOME ASSEMBLY

1. Genome Properties

1a. Fetch the genome properties for your species. You can do this two ways:

From the literature or other sources.
- genomesize.com
- ncbi genome
- search the literature
  - Record the size and other potentially important information in your species README if the genome of your species is already available
Estimate properties with jellyfish and genomescope.
- More details here.

1b. Execute runJellyfish.sbatch using the decontaminated files (couple of hours)

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

#sbatch runJellyfish.sbatch <Species 3-letter ID> <indir>
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runJellyfish.sbatch "Sfa" "fq_fp1_clmp_fp2_fqscrn_repaired"

Check your queue!

squeue -u <your_user_ID>

Jellyfish will create a histogram file (.histo) with kmer frequencies.

1c. Download this file to your local computer and upload it in GenomeScope v1.0 and Genomescope v2.0 (few minutes)

To download, sftp into Wahab in a new Terminal window and download the histogram file.

sftp your_user_ID@wahab.hpc.odu.edu

#cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
cd /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/fq_fp1_clmp_fp2_fqscrn_repaired

lpwd  #this is your local working directory
#get fq_fp1_clmp_fp2_fqscrn_repaired/<histofile.histo>
get Sfa_all_reads.histo

Then:

Open GenomeScope v1-2 and upload the histogram file.
Add a proper description to both of your GenomeScope runs. Example: "Sfa_fq_fp1_clmp_fp2_fqscrn_repaired_jfsh_v1" and "Sfa_fq_fp1_clmp_fp2_fqscrn_repaired_jfsh_v2".
For version 1 ONLY, Adjust the read length to that of the length of the Fastp2 trimming --> 140 (unless you had to modify this during the Fastp2).
Leave all other parameters with default settings for both versions.
Submit (takes only a few minutes to run),

1d. Complete the following table in your Species README. You can copy and paste this table straight into your README (no need to enclose it with quotes, i.e. a code block) and just substitute your table values.

Genome stats for Sfa from Jellyfish/GenomeScope v1.0 and v2.0, k=21 for both versions

version    |stat    |min    |max
------  |------ |------ |------
1  |Heterozygosity  |?%       |?%
2  |Heterozygosity  |?%       |?%
1  |Genome Haploid Length   |? bp |? bp 
2  |Genome Haploid Length   |? bp |? bp
1  |Model Fit       |? %       |? %
2  |Model Fit       |? %       |? %

Provide a link to both GenomeScope reports in your README. See other species READMEs for examples.

1e. Inspect your table and reports for red flags and choose a GenomeScope version.

In your table, check that the heterozygosity estimates are reasonable (values ~1% or less are common) and check for a good model fit in the max values (>90%). Sometimes, the min value might have a low fit for V2 but this is okay.

In your reports, check for a tight relationship between the "observed", "full model" and "unique sequences" lines in the first graph.

If the values in your table are relatively similar for V1 and V2 and you found no red flags in reports, then use the V2 estimates.

Most of the time V1 & V2 perform similarly. However, sometimes the two reports give contrasting values, such as very different genome sizes or unrealistic estimates of heterozygosity.

For example:

For Sob (Sphyraena obtusata), the Sob_GenScp_v1 report estimates a genome of 532 Mbp and 0.965 for H. However, the Sob_GenScp_v2 report estimates a genome size of 259 Mbp (which is small for a fish) and actually fails to estimate heterozygosity. Thus, V1 was used for Sob.
For Hte (Hypoatherina temminckii), the Hte_GenScp_v1 report appears to have no red flags with a genome size of 846 Mbp and 0.49 for H. However, upon inspecting the first graph, you can see that the "unique sequence" line behaves diffently from the others. In contrast, the Hte_GenScp_v2 report restores a tight relationship between all lines with no red flags in any estimates either (H=2.1, GenSize= 457 Mbp). Thus, V2 was used for Hte.

Note the best GenomeScope version and rounded genome size estimate in your species README. Please use the "Genome Haploid Length" max value rounded up or down to the nearest million. You will use this info later.

2. Assemble the Genome with SPAdes

Congrats! You are now ready to assemble the genome of your species!

After a comparisons of multiple assemblers with the PIRE shotgun data, we decided to use SPAdes.

For the most part, we obtained better assemblies using single libraries (a library consists of one forward *r1.fq.gz and reverse file *r2.fq.gz pair), but in a few instances using all the libraries was better.

2a. Check resource availability on Turing

You need to be in turing.hpc.odu.edu for this step. SPAdes requires high memory nodes (only available on Turing). The username/password you use to log onto Wahab will work for Turing as well.

#from wahab.hpc.odu.edu
exit

ssh userID@turing.hpc.odu.edu

Check how many high memory nodes ("himem") are available on Turing.

sinfo

You will see a list something like this below. Look at the state of himem nodes (partition = himem) below to identify how many are available for genome assembly.

PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*            up   infinite      2  inval coreV1-22-019,coreV2-25-005
main*            up   infinite     18  fail* coreV1-22-[006-009,017],coreV2-22-[006,012,033],coreV2-25-[003,008,012,023,025,039,043,055],coreV2-25-knc-[0
main*            up   infinite      2   fail coreV1-22-016,coreV2-25-017
main*            up   infinite      8  down* coreV1-22-028,coreV2-22-[018-020,026],coreV2-25-016,coreV2-25-knc-006,coreV3-23-017
main*            up   infinite     36    mix coreV1-22-[001-003,025,027],coreV2-22-[004,025,036],coreV2-25-[001,004,006,009,011,013-015,018-022,024,038,0
main*            up   infinite     62  alloc coreV1-22-[012-013,015,020-024],coreV2-22-[001-003,005,007-011,013-017,021-024,027-032,034],coreV2-25-[044-0
main*            up   infinite     70   idle coreV1-22-026,coreV2-22-035,coreV2-25-[002,007,010,026-037,048-054,056-073],coreV3-23-[008-016,041-050],core
main*            up   infinite      4   down coreV1-22-[004-005,014,018]
timed-main       up    2:00:00      2  inval coreV1-22-019,coreV2-25-005
timed-main       up    2:00:00     18  fail* coreV1-22-[006-009,017],coreV2-22-[006,012,033],coreV2-25-[003,008,012,023,025,039,043,055],coreV2-25-knc-[0
timed-main       up    2:00:00      2   fail coreV1-22-016,coreV2-25-017
timed-main       up    2:00:00      8  down* coreV1-22-028,coreV2-22-[018-020,026],coreV2-25-016,coreV2-25-knc-006,coreV3-23-017
timed-main       up    2:00:00     36    mix coreV1-22-[001-003,025,027],coreV2-22-[004,025,036],coreV2-25-[001,004,006,009,011,013-015,018-022,024,038,0
timed-main       up    2:00:00     62  alloc coreV1-22-[012-013,015,020-024],coreV2-22-[001-003,005,007-011,013-017,021-024,027-032,034],coreV2-25-[044-0
timed-main       up    2:00:00     70   idle coreV1-22-026,coreV2-22-035,coreV2-25-[002,007,010,026-037,048-054,056-073],coreV3-23-[008-016,041-050],core
timed-main       up    2:00:00      4   down coreV1-22-[004-005,014,018]
himem            up   infinite      1  fail* coreV4-21-himem-001
himem            up   infinite      1   fail coreV2-23-himem-001
himem            up   infinite      1  down* coreV2-23-himem-002
himem            up   infinite      4   idle coreV2-23-himem-[003-004],coreV4-21-himem-[002-003]
timed-himem      up    2:00:00      1  fail* coreV4-21-himem-001
timed-himem      up    2:00:00      1   fail coreV2-23-himem-001
timed-himem      up    2:00:00      1  down* coreV2-23-himem-002
timed-himem      up    2:00:00      4   idle coreV2-23-himem-[003-004],coreV4-21-himem-[002-003]
gpu              up   infinite      1  fail* coreV3-23-k40-006
gpu              up   infinite      1  drain coreV5-21-v100-003
gpu              up   infinite      6    mix coreV3-23-k40-[003-004],coreV4-24-v100-[001,003-004],coreV5-21-v100-002
gpu              up   infinite     12   idle coreV3-23-k40-[005,007-010],coreV4-21-k80-[001-005],coreV4-22-p100-[001-002]
timed-gpu        up    2:00:00      1  fail* coreV3-23-k40-006
timed-gpu        up    2:00:00      1  drain coreV5-21-v100-003
timed-gpu        up    2:00:00      6    mix coreV3-23-k40-[003-004],coreV4-24-v100-[001,003-004],coreV5-21-v100-002
timed-gpu        up    2:00:00      1  alloc coreV5-21-v100-001
timed-gpu        up    2:00:00     12   idle coreV3-23-k40-[005,007-010],coreV4-21-k80-[001-005],coreV4-22-p100-[001-002]
pire             up   infinite      4   idle coreV1-22-[010-011],coreV3-23-k40-[001-002]
reserved-gpu     up   infinite      2   idle coreV4-22-p100-[001-002]
reserved-khan    up   infinite      1  alloc coreV5-21-v100-001
phi              up   infinite      2  fail* coreV2-25-knc-[002,004]
phi              up   infinite      1  down* coreV2-25-knc-006
phi              up   infinite      4    mix coreV2-25-knc-[001,005,008,010]
phi              up   infinite      3  alloc coreV2-25-knc-[003,007,009]
timed-phi        up    2:00:00      2  fail* coreV2-25-knc-[002,004]
timed-phi        up    2:00:00      1  down* coreV2-25-knc-006
timed-phi        up    2:00:00      4    mix coreV2-25-knc-[001,005,008,010]
timed-phi        up    2:00:00      3  alloc coreV2-25-knc-[003,007,009]
MSIM715          up   infinite      4   idle coreV3-23-[001-003],coreV4-24-v100-002

2b. Get the genome size of your species in bp from step 1e

Typically, we produce 3 libraries (from the same individual) for each species with SSL data.

Sfa example:

ls /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/shotgun_raw_fq/*gz

Sfa-CBas_028-Ex1-1G_L4_1.fq.gz #library A
Sfa-CBas_028-Ex1-1G_L4_2.fq.gz
Sfa-CBas_028-Ex1-1H_L4_1.fq.gz #library B
Sfa-CBas_028-Ex1-1H_L4_2.fq.gz
Sfa-CBas_028-Ex1-2A_L4_1.fq.gz #library C
Sfa-CBas_028-Ex1-2A_L4_2.fq.gz

We produce duplicate libraries because every now and then, one library can fail, and you may only end up with 2 sets of files. Thus, the following SPAdes script can be modified to run each of the 3 libraries independently as well as all 3 libraries together (for the "all" assembly).

NOTE: If your species has 4 or more libraries, you will need to modify the script to run the 4th library, 5th library etc. (You'll only need to add the necessary libraries to the SPAdes command.)

No changes are necessary for running the 1st, 2nd, 3rd, or all the libraries together (i.e. if you have 2 or 3 libraries only).

2c. Run SPAdes and QUAST (multiple days)

Use the decontaminated files (final fq.gz files created during pre-processing) to run one assembly for each of your libraries independently and then one combining all of your libraries together.

Execute runSPADEShimem_R1R2_noisolate.sbatch.

Example for the 1st library:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

#runSPADEShimem_R1R2_noisolate.sbatch <your user ID> <3-letter species ID> <library: all_2libs | all_3libs | 1 | 2 | 3> <contam | decontam> <genome size in bp> <species dir> <fq data dir>
#do not use trailing / in paths
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your_user_ID" "Sfa" "1" "decontam" "?" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2_fqscrn_repaired"

Run 2 more assemblies for the second and third libraries. Replace the "1", with "2" and "3" respectively.

Then, run a job combining all libraries together by choosing the appropiate "all_2libs" or "all_3libs" from the library options.

Review SPAdes and QUAST Output

runSPADEShimem_R1R2_noisolate.sbatch names the output directories with the suffix A for the first library, B for the second, and C for the third (if there is one). Thus, for Sfa:

Assembly	Library
A	1G
B	1H
C	2A

For the SPAdes output, check that you have:

k22-99 directories
contigs.fasta and scaffolds.fasta files
Open the params.txt files and check the settings ran in SPAdes. Make sure these are as you expect (match your input settings).

QUAST is automatically run by the SPAdes script as well. QUAST creates statistics to help us evaluate genome assembly quality and composition. Look for the quast-reports directory, and for each of your assemblies open up the associated tsv files (both the contigs & scaffolds ones) and check the:

Number of contigs in the assembly (this is the last contig column in the quast report with the name "# contigs"
the size of the largest contig
the total length of assembly
N50
L50

Tip: you can align the columns of any .tsv for easy viewing with the command column in bash.

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

bash
cat SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report/transposed_report.tsv | column -ts $'\t' | less -S
cat SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report/transposed_report.tsv | column -ts $'\t' | less -S

3. Run BUSCO (multiple hours)

QUAST gives us many of the basic assembly statistics, however we still need to run BUSCO to know how many expected (i.e. highly conserved) genes were recovered by the assembly. This gives us another sense of how "complete" the genomes we just assembled are.

Execute runBUCSO.sh on the contigs and scaffolds files for each assembly.

cd ~/shotgun_PIRE/2022_PIRE_Omics_workshop/your_name

#runBUSCO.sh <species dir> <SPAdes dir> <contigs | scaffolds>
#do not use trailing / in paths. Example using contigs:
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runBUSCO.sh "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "SPAdes_allLibs_decontam_R1R2_noIsolate" "contigs"

Repeat the command using the contigs and scaffolds files for each SPAdes assembly.

runBUSCO.sh will generate a new directory every time it is run. In those directories, look for the short_summary.txt file and note the percentage of Complete and single-copy BUSCOs (S) genes. (This file will have a much longer name than just "short_summary," but it begins with that phrase.)

4. Record QUAST and BUSCO Values

Fill in this table with your QUAST and BUSCO values in your species README.

You didn't run SPAdes or BUSCO so you don't have QUAST/BUSCO output in your dir yet. Copy the output for allLibs from the Sfa workshop dir:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name
pwd

mkdir SPAdes_allLibs_decontam_R1R2_noIsolate
mkdir SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report
mkdir SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report

#QUAST
cp /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report/transposed_report.tsv ./SPAdes_allLibs_decontam_R1R2_noIsolate/quast_contigs_report/transposed_report.tsv
cp /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report/transposed_report.tsv ./SPAdes_allLibs_decontam_R1R2_noIsolate/quast_scaffolds_report/transposed_report.tsv 


#BUSCO
cp -R /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/busco-results .

A few notes:

Library names can be obtained from file names
The number of contigs is the last contig column in quast report with the name "# contigs".
% Genome size completeness = "Total length"/genome size (or rounded genome max value) * 100
For QUAST, only report the row for the actual assembly (i.e. report "scaffolds" not "scaffolds_broken"
For BUSCO, only report the "Complete and single-copy BUSCOs (S)"

Species    |Assembly    |Data Type    |SCAFIG    |covcutoff    |GenomeScope v.    |No. of contigs    |Largest contig    |Total length    |% Genome size completeness    |N50    |L50    |Ns per 100 kbp    |BUSCO single copy
------  |------  |------ |------ |------ |------  |------ |------ |------ |------ |------  |------ |------ |------ 
Sfa  |A  |decontam       |contgs       |off       |2       |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? % 
Sfa  |A  |decontam       |scaffolds       |off       |2    |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? % 
Sfa  |B  |decontam       |contgs       |off       |2       |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? % 
Sfa  |B  |decontam       |scaffolds       |off       |2    |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? % 
Sfa  |C  |decontam       |contgs       |off       |2       |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? % 
Sfa  |C  |decontam       |scaffolds       |off       |2    |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? % 
Sfa  |allLibs  |decontam       |contigs       |off       |2    |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? %
Sfa  |allLibs  |decontam       |scaffolds       |off       |2   |  ?  |  ?  |   ?  |   ? % |  ?  |  ?  |  ?  |  ? %

5. Determine the Best Assembly - LECTURE

We assess quality across multiple metrics since we don't use a golden rule/metric for determining the best assembly. Often, it is clear that one of the libraries is better than the others, as it has better results across multiple metrics. However, sometimes this is not quite as clear as we would like, as different assemblies might fare better in some metrics and worse in others. Use the following table to help you decide which assembly is best:

Importance	Metric	Direction	Description
1st	BUSCO	Bigger is better	% of expected genes observed in your assembly
2nd	N50	Bigger is better	Length of the shortest contig from the set of contigs that (together) represent 50% of your assembly size
3rd	Genome size completeness	Bigger is better	Length of the assembly divided by the estimated genome length
4th	L50	Smaller is better	Number of contigs that make up the set of contigs that (together) represent 50% of your assembly size
5th	Largest contig	Bigger is better	Length of the largest contig

If you are still undecided on which is the best assembly, raise your hand or talk to those around you in the workshop. Outside of the workshop, we would normally post the top candidates on the species slack channel and ask for opinions.

Note your best assembly in your species README.

6. Assemble Contaminated Data From the Best Library

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

#MAKE SURE YOU ARE ON TURING

#runSPADEShimem_R1R2_noisolate.sbatch <your user ID> <3-letter species ID> <library: all_2libs | all_3libs | 1 | 2 | 3> <contam | decontam> <genome size in bp> <species dir>
#do not use trailing / in paths
sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your_user_ID" "Sfa" "2" "contam" "635000000" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2"

Compare the QUAST contigs and scaffolds values of your contaminated assembly to your decontaminated assembly.

7. Update the Main Assembly Stats Table With Your Species

You will not do this for the workshop. You should compare your results to this table, however, to see how your species stacks up.

Add a new record for your species/assembly to the best_ssl_assembly_per_sp.tsv file.

Please note that you cannot paste a tab in nano as it interprets tabs as spaces. This means that you will have to manually enter each column one by one, or copy and paste multiple columns but then change the spaces between columns back to tabs to restore the tsv format.

Once done, push your changes to GitHub and confirm the that tsv format is correct by opening the best_ssl_assembly_per_sp.tsv. Your browser should not be displaying code but instead a nice-looking table (aligned columns, etc).

8. Evaluate Results

Assuming you have completed Step 9, you now know what library(ies) produced the best assembly. Compare your BUSCO values with those of the other species (for example, you can check the "best assembly table".

If your BUSCO values are much lower than most other species, it might be worth changing the covcutoff auto flag (by changing the datatype variable from "decontam" to "decontam_covAUTO").

Example:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your user ID" "Sfa" "2" "decontam_covAUTO" "635000000" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2_fqscrn_repaired"

Finally, run one more assembly using the contaminated data from the same library (or all together) that produced the best assembly (with or without the covcutoff flag).

Example:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

sbatch /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/runSPADEShimem_R1R2_noisolate.sbatch "your user ID" "Sfa" "3" "contam" "635000000" "~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name" "fq_fp1_clmp_fp2_fqscrn_repaired"

9. Clean Up

Move your *out files into the logs directory.

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

mv *out logs

C. PROBE DESIGN

In this section, you will identify contigs and regions within contigs to be used as candidate regions to develop the probes for capture sequencing.

You will create the following 4 files (among others):

*.fasta.masked: The masked fasta file
*.fasta.out.gff: The gff file created from repeat masking (identifies regions of genome that were masked)
*_augustus.gff: The gff file created from gene prediction (identifies putative coding regions)
*_per10000_all.bed: The bed file with target regions for probe development (1 set of 2 probes per target region)

11. Identify Regions for Probe Development

From your species directory, make a new directory for the probe design process.

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

mkdir probe_design

Copy the necessary scripts and the best assembly (i.e. scaffolds.fasta from the best assembly made with the DECONTAMINATED fq.gz files) into the probe_design directory.

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

cp /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/WGprobe_annotation.sb probe_design
cp /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/scripts/WGprobe_bedcreation.sb probe_design
cp /home/e1garcia/shotgun_PIRE/2022_PIRE_omics_workshop/salarias_fasciatus/SPAdes_Sfa-CBas-B_decontam_R1R2_noIsolate/scaffolds.fasta probe_design

Rename the assembly to reflect the species and parameters used. Format to follow:

<3-letter species code>_scaffolds_<library>_<cotam|decontam>_R1R2_noIsolate_<other treatments, if any>.fasta

To get this information, you can copy and paste the parameter information from the appropriate BUSCO directory:

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name

#list the busco directories
ls -d busco_*

#identify the busco dir of the best assembly, copy the treatments (starting with the library)
#example,the busco dir for the best assembly for Sfa is `busco_contigs_results-SPAdes_Sfa-CBas-B_decontam_R1R2_noIsolate`

#then provide the species 3-letter code, scaffolds, and copy and paste the parameters from the busco dir after "SPAdes_" 
cd probe_design
mv scaffolds.fasta busco_contigs_results-SPAdes_Sfa-CBas-B_decontam_R1R2_noIsolate.fasta

Execute the first probe annotation script.

This script will create:

a repeat-masked fasta and gff file (.fasta.masked & .fasta.out.gff)
a gff file with predicted gene regions (augustus.gff), and
a sorted fasta index file that will act as a template for the .bed file (.fasta.masked.fai)

cd ~/shotgun_PIRE/202_PIRE_omics_workshop/your_name/probe_design

#WGprobe_annotation.sb <assembly name> 
sbatch WGprobe_annotation.sb "Sgr_scaffolds_B_decontam_R1R2_noIsolate.fasta"

Execute the second script.

This will create a .bed file that will be sent for probe creation. The bed file identifies 5,000 bp regions (spaced every 10,000 bp apart) in scaffolds > 10,000 bp long. These regions are the regions that probes will potentially be developed from.

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name/probe_design

#WGprobe_bedcreation.sb <assembly name> 
sbatch WGprobe_bedcreation.sb "Sgr_scaffolds_B_decontam_R1R2_noIsolate.fasta"

Check Upper Limit

Open the BEDprobes-*.out file created from the second probe design script (WGprobe_bedcreation.sb) and check that the upper limit was set correctly. Record the longest contig, uppper limit used in the loop, and the number of identified regions and scaffolds in your species README.

The upper limit should be XX7500 (just under the longest scaffold length). Ex: if longest scaffold is 88,888, then the upper limit should be 87,500; if longest scaffold is 87,499, then the upper limit should be 77,500.

Example:

cat BEDprobes-415039.out

The longest scaffold is 280192
The upper limit used in loop is 277500
A total of 37134 regions have been identified from 16894 scaffolds

Move out files into your logs directory

cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name/probe_design

mv *out ../logs

12. Closest Relatives with Available Genomes

The last thing we need to do is create a text file with links to available genomes from the 5 most closely-related species.

Most likely, there won't be genomes available for your targeted species (or even its genus). The easiest way to search for any is to start with the family.

Go to the NCBI Genome repository and search for the family of your species. If you get more than 5 genomes in the family, narrow the search by looking for the genus. If you don't find any genomes at the family level, search higher classifications until you find some genome assemblies (i.e. order, class, etc).

Once you get at least 5 genomes, you'll need to figure out their phylogenetic relationships in order to list the genomes from most to least closely related.

Seach for phylogenies specific to your species. For your convenience, you can use the phylogenies from Betancur et al. BMC Evolutionary Biology (2017) 17:162 for fish phyla and fish families. These are an excellent resource for higher-order taxonomic groups, but only a few species per family are represented. If you need higher resolution (to distinguish relationships between species within families, for example), you will need to search specifically for your species.

Once your list is ready, create a file in your probe_design dirctory.

#example for Spratelloides gracilis
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your-name/probe_design

nano closest_relative_genomes_salarias_fasciatus.txt

Closest genomes:
1. Salarias fascistus - https://www.ncbi.nlm.nih.gov/genome/7248
2. Parablennius parvicornis - https://www.ncbi.nlm.nih.gov/genome/69445
3. Petroscirtes breviceps - https://www.ncbi.nlm.nih.gov/genome/7247
4. Ecsenius bicolor - https://www.ncbi.nlm.nih.gov/genome/41373
5. Gouania willdenowi - https://www.ncbi.nlm.nih.gov/genome/76090

Used Betancour et al. 2017 (All Blenniiformes)

---

**NOTE:** Sfa has a genome available in GenBank.

---

## Files to Send

**You do not need to do this for the workshop.**

Share the following files with Arbor Bio to aid in probe creation:

1. The repeat-masked fasta file (`.fasta.masked`)
2. The gff file with repeat-masked regions (`.fasta.out.gff`)
3. The gff file with predicted gene regions (`.augustus.gff`)
4. The bed file (`.bed`)
5. The text file with links to available genomes from the 5 most closely-related species (`closest_relative_genomes.txt`.

Make a directory named `files_for_ArborSci` inside your `probe_design` directory and move these files there:

```sh
cd ~/shotgun_PIRE/2022_PIRE_omics_workshop/your_name/probe_design

mkdir files_for_ArborSci
mv *.fasta.masked *.fasta.out.gff *.augustus.gff *bed closest* files_for_ArborSci

Finally, notify Eric by email (e1garcia@odu.edu) saying that your files are ready and post a message in the slack species channel with the probe region and scaffold info (from your BEDprobe*out file), and the full path to your files.

Example of Slack post:

Probe Design Files Ready

A total of 13063 regions have been identified from 10259 scaffolds. The longest scaffold is 105644. 

Files for Arbor Bio:
ls /home/e1garcia/shotgun_PIRE/pire_ssl_data_processing/spratelloides_gracilis/probe_design/files_for_ArborSci

Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate.fasta.augustus.gff
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate.fasta.masked
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate.fasta.out.gff
Sgr_scaffolds_SgC0072C_contam_R1R2_noIsolate_great10000_per10000_all.bed
closest_relative_genomes_Spratelloides_gracilis.txt

Eric will then share these with Arbor BioSciences.

Finito!!!

Congrats! You have finished the SSL processing pipeline. Go ahead, give yourself a pat on the back!

D. PSMC

Now that we have a reference genome, what can we do with it? Follow the PSMC link to learn how we can use genomic data from a single individual to infer demographic history!

E. A day of R

Follow the workshop's R repository to learn how you can analyze and illustrate your data.

Congratulations! You have completed the 2022 PIRE Omics' Workshop. Good luck in your projects and happy coding!

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
AbigailEthridge		AbigailEthridge
Abner_Bucol		Abner_Bucol
Bonsall		Bonsall
Maybelle		Maybelle
Nichole_Leach		Nichole_Leach
allison_fink		allison_fink
chandy_jablonski		chandy_jablonski
erykamolino		erykamolino
joem		joem
jordan_rodriguez		jordan_rodriguez
kevin_labrador		kevin_labrador
lectures		lectures
leiognathus_leuciscus		leiognathus_leuciscus
marial_malabag		marial_malabag
omar_mahadalle		omar_mahadalle
plink/pop_structure		plink/pop_structure
salarias_fasciatus		salarias_fasciatus
test_student		test_student
.gitignore		.gitignore
README.md		README.md
logs		logs
preprocessing_README.md		preprocessing_README.md
preworkshop_material.md		preworkshop_material.md
runGIT.bash		runGIT.bash
slurm-751204.out		slurm-751204.out
student_README.md		student_README.md

philippinespire/2022_PIRE_omics_workshop

Folders and files

Latest commit

History

Repository files navigation

Welcome to the 2022 PIRE Omics Workshop!

Schedule

Lectures

Pictures

List of Instructors

List of Participants

Required Software for the Workshop

Pre-workshop Materials

GUIDE TO SHOTGUN DATA PROCESSING & ANALYSIS

Some Basic Set-up

Maintaining Your Git Repository

Data Processing Roadmap

A. PRE-PROCESSING SEQUENCES

B. GENOME ASSEMBLY

1. Genome Properties

1a. Fetch the genome properties for your species. You can do this two ways:

1b. Execute runJellyfish.sbatch using the decontaminated files (couple of hours)

1c. Download this file to your local computer and upload it in GenomeScope v1.0 and Genomescope v2.0 (few minutes)

1d. Complete the following table in your Species README. You can copy and paste this table straight into your README (no need to enclose it with quotes, i.e. a code block) and just substitute your table values.

1e. Inspect your table and reports for red flags and choose a GenomeScope version.

2. Assemble the Genome with SPAdes

2a. Check resource availability on Turing

2b. Get the genome size of your species in bp from step 1e

2c. Run SPAdes and QUAST (multiple days)

Review SPAdes and QUAST Output

3. Run BUSCO (multiple hours)

4. Record QUAST and BUSCO Values

5. Determine the Best Assembly - LECTURE

6. Assemble Contaminated Data From the Best Library

7. Update the Main Assembly Stats Table With Your Species

8. Evaluate Results

9. Clean Up

C. PROBE DESIGN

11. Identify Regions for Probe Development

12. Closest Relatives with Available Genomes

Finito!!!

Congrats! You have finished the SSL processing pipeline. Go ahead, give yourself a pat on the back!

D. PSMC

E. A day of R

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 15

Languages

Packages