Skip to content

01_Overview

Samantha edited this page Jul 28, 2021 · 1 revision

Welcome to the RBL_RBL3 wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake. A tutorial is available for local users.

  1. Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/RBL_RBL3.git

# Change your working directory to the RBL3 repo
cd RBL_RBL3/
  1. Please make sure that snakemake>=5.19 is in your $PATH. If you are in Biowulf, please load the following environment module:
# Recommend running snakemake>=5.19
module load snakemake

Preparing Configs and Manifests

There is config requirements for this pipeline, that must be found in the /path/to/RBL3/config directory. These files are:

  1. snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
    • source_dir: path to snakemake file, within the cloned RBL3 repository; example: '/path/to/RBL3/'
    • out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
    • sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
    • deg_manifest: path to DEG manifest (see specific details below; example: '/path/to/DEG_manifest.tsv'
    • fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
    • masked_reference: "Y" #if a masked reference file (identified below) should be used indicate "Y"
    • clean_transcripts: "Y" #if Y Transcriptclean will be run - read feature info in documentation
    • annotation_id: annotation ID used with project; example 'SIRV_annot'
    • build_id: annotation build id; example 'hg38'
    • platform_id: platform used for sequencing; example 'Illumina'
    • maxFracA: decimal value indicating the maximum percent of A's in the 20bp interval after alignment in a transcript; default 0.5
    • minCount: numeric value indicating the minimum number of transcripts that must be detected in replicates, when applicable; default 2
    • minDatasets:
    • primer_length: match the length of the T sequence in your primer, default [20]
    • percent_similarity: lowest percentage of A in the 20bp interval after alignment default [20]
    • annotation_gtf: path to annotation gtf file; default /data/CCBR/projects/rbl3/dependencies/gencode.v30.annotation.gtf
    • annotation_gff: path to annotation gff3 file; default /data/CCBR/projects/rbl3/dependencies/Homo_sapiens_GRCh38_Ensembl_86.gff3
    • annotation_fa: path to annotation fa file; default /data/CCBR/projects/rbl3/dependencies/hg38_cleanheader.fa

There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:

  1. samples_manifest.tsv - this manifest will include metadata information for each sample, and map input files to their sample ids

    • filename: the fastq file name without an extension. example: 'barcode01' for a sample with the input barcode01.fastq
    • sampleid: the final sample name; this column must be unique. example: 'wt'
    • groupid: the groupid for samples, which may or may not be unique values. NOTE: underscores cannot be used. example: 'ko.drosha'
    • batchid: the batchid for each sample. example: 'b1'
    An example sample.tsv file (note tabs separate each column):
    
    filename	sampleid	groupid		batchid
    barcode01	wt	        wt	        b1
    barcode02	ko.drosha	ko.drosha	b1
    barcode03	ko.dicer	ko.dicer	b1
    
  2. deg_manifest.tsv - this manifest will include information on which samples to compare

    • group1 / group2: each column must be a groupid found in the samples_manifest.tsv. Comparisons are only done between two groups of samples.
    An example deg_manifest.tsv file (note tabs separate each column):
    
    group1	group2
    wt		ko.dicer
    wt		ko.drosha
    

Running Pipeline

  1. Dry-Run
sh run_snakemake.sh dry-run
  1. Execute pipeline on the cluster
sh run_snakemake.sh cluster
  1. Execute pipeline locally
sh run_snakemake.sh local
  1. Unlock directory (after failed partial run)
sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

  • log: log files for slurm jobs, config files used in each run
  • 01_fastq: fastq files used in run, zipped
  • 01_fastq_trimmed: fastq files after trimming
  • 02_sam: aligned sam files
  • if cleanup_transcripts:
    • 02_sam_corrected: sam files after transcript cleanup
  • 03_bam: bam files, sorted and indexed
  • 04_qc:
    • fastqc: individual outputs of fastq
    • samtools: individual outputs of samstats
    • multiqc_report.html: compiled multiqc report for all samples
    • qc_report.html: alignment statistics; if masked_ref="Y" this will provided comparison unmasked and masked alignment
  • 05_talon:
    • talon_config.csv
    • {build_id}.db
    • sam_labeled
    • annotate
    • counts: includes the talon_summary.tsv, talon_abundance.tsv, whitelist.txt and talon_abundance_filtered.tsv
    • gtf
  • 06_flair
    • merged.fastq.gz
    • flair_config.csv
    • isoforms: fa files of merged isoforms
    • fastq: individual fastq files after flair processing
    • counts: flair_counts_matrix.tsv
  • 07_report:
    • summary report of TALON (with and without filtering) and FLAIR outputs and annotations
  • 08_deg:
    • DEG summaries for comparisons given

Troubleshooting

  • Check your email for an email regarding pipeline failure
  • Review the logs to determine what rule failed (logs are named by Snakemake rule)
  • Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment
cd /path/to/output/dir/log
  • Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)