01_Overview

Welcome to the RBL_RBL3 wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake. A tutorial is available for local users.

Clone the github repository to your local filesystem.

# Clone Repository from Github
git clone https://github.com/RBL-NCI/RBL_RBL3.git

# Change your working directory to the RBL3 repo
cd RBL_RBL3/

Please make sure that snakemake>=5.19 is in your $PATH. If you are in Biowulf, please load the following environment module:

# Recommend running snakemake>=5.19
module load snakemake

Preparing Configs and Manifests

There is config requirements for this pipeline, that must be found in the /path/to/RBL3/config directory. These files are:

snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned RBL3 repository; example: '/path/to/RBL3/'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- deg_manifest: path to DEG manifest (see specific details below; example: '/path/to/DEG_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- masked_reference: "Y" #if a masked reference file (identified below) should be used indicate "Y"
- clean_transcripts: "Y" #if Y Transcriptclean will be run - read feature info in documentation
- annotation_id: annotation ID used with project; example 'SIRV_annot'
- build_id: annotation build id; example 'hg38'
- platform_id: platform used for sequencing; example 'Illumina'
- maxFracA: decimal value indicating the maximum percent of A's in the 20bp interval after alignment in a transcript; default 0.5
- minCount: numeric value indicating the minimum number of transcripts that must be detected in replicates, when applicable; default 2
- minDatasets:
- primer_length: match the length of the T sequence in your primer, default [20]
- percent_similarity: lowest percentage of A in the 20bp interval after alignment default [20]
- annotation_gtf: path to annotation gtf file; default /data/CCBR/projects/rbl3/dependencies/gencode.v30.annotation.gtf
- annotation_gff: path to annotation gff3 file; default /data/CCBR/projects/rbl3/dependencies/Homo_sapiens_GRCh38_Ensembl_86.gff3
- annotation_fa: path to annotation fa file; default /data/CCBR/projects/rbl3/dependencies/hg38_cleanheader.fa

There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:

samples_manifest.tsv - this manifest will include metadata information for each sample, and map input files to their sample ids
- filename: the fastq file name without an extension. example: 'barcode01' for a sample with the input barcode01.fastq
- sampleid: the final sample name; this column must be unique. example: 'wt'
- groupid: the groupid for samples, which may or may not be unique values. NOTE: underscores cannot be used. example: 'ko.drosha'
- batchid: the batchid for each sample. example: 'b1'
```
An example sample.tsv file (note tabs separate each column):

filename	sampleid	groupid		batchid
barcode01	wt	        wt	        b1
barcode02	ko.drosha	ko.drosha	b1
barcode03	ko.dicer	ko.dicer	b1
```
deg_manifest.tsv - this manifest will include information on which samples to compare
- group1 / group2: each column must be a groupid found in the samples_manifest.tsv. Comparisons are only done between two groups of samples.
```
An example deg_manifest.tsv file (note tabs separate each column):

group1	group2
wt		ko.dicer
wt		ko.drosha
```

Running Pipeline

Dry-Run

sh run_snakemake.sh dry-run

Execute pipeline on the cluster

sh run_snakemake.sh cluster

Execute pipeline locally

sh run_snakemake.sh local

Unlock directory (after failed partial run)

sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

log: log files for slurm jobs, config files used in each run
01_fastq: fastq files used in run, zipped
01_fastq_trimmed: fastq files after trimming
02_sam: aligned sam files
if cleanup_transcripts:
- 02_sam_corrected: sam files after transcript cleanup
03_bam: bam files, sorted and indexed
04_qc:
- fastqc: individual outputs of fastq
- samtools: individual outputs of samstats
- multiqc_report.html: compiled multiqc report for all samples
- qc_report.html: alignment statistics; if masked_ref="Y" this will provided comparison unmasked and masked alignment
05_talon:
- talon_config.csv
- {build_id}.db
- sam_labeled
- annotate
- counts: includes the talon_summary.tsv, talon_abundance.tsv, whitelist.txt and talon_abundance_filtered.tsv
- gtf
06_flair
- merged.fastq.gz
- flair_config.csv
- isoforms: fa files of merged isoforms
- fastq: individual fastq files after flair processing
- counts: flair_counts_matrix.tsv
07_report:
- summary report of TALON (with and without filtering) and FLAIR outputs and annotations
08_deg:
- DEG summaries for comparisons given

Troubleshooting

Check your email for an email regarding pipeline failure
Review the logs to determine what rule failed (logs are named by Snakemake rule)
Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment

cd /path/to/output/dir/log

Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)

Provide feedback

Saved searches