-
Notifications
You must be signed in to change notification settings - Fork 0
01_Overview
Samantha edited this page Jul 28, 2021
·
1 revision
Welcome to the RBL_RBL3 wiki!
You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake. A tutorial is available for local users.
- Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/RBL_RBL3.git
# Change your working directory to the RBL3 repo
cd RBL_RBL3/
- Please make sure that snakemake>=5.19 is in your $PATH. If you are in Biowulf, please load the following environment module:
# Recommend running snakemake>=5.19
module load snakemake
There is config requirements for this pipeline, that must be found in the /path/to/RBL3/config directory. These files are:
- snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned RBL3 repository; example: '/path/to/RBL3/'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- deg_manifest: path to DEG manifest (see specific details below; example: '/path/to/DEG_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- masked_reference: "Y" #if a masked reference file (identified below) should be used indicate "Y"
- clean_transcripts: "Y" #if Y Transcriptclean will be run - read feature info in documentation
- annotation_id: annotation ID used with project; example 'SIRV_annot'
- build_id: annotation build id; example 'hg38'
- platform_id: platform used for sequencing; example 'Illumina'
- maxFracA: decimal value indicating the maximum percent of A's in the 20bp interval after alignment in a transcript; default 0.5
- minCount: numeric value indicating the minimum number of transcripts that must be detected in replicates, when applicable; default 2
- minDatasets:
- primer_length: match the length of the T sequence in your primer, default [20]
- percent_similarity: lowest percentage of A in the 20bp interval after alignment default [20]
- annotation_gtf: path to annotation gtf file; default /data/CCBR/projects/rbl3/dependencies/gencode.v30.annotation.gtf
- annotation_gff: path to annotation gff3 file; default /data/CCBR/projects/rbl3/dependencies/Homo_sapiens_GRCh38_Ensembl_86.gff3
- annotation_fa: path to annotation fa file; default /data/CCBR/projects/rbl3/dependencies/hg38_cleanheader.fa
There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:
-
samples_manifest.tsv - this manifest will include metadata information for each sample, and map input files to their sample ids
- filename: the fastq file name without an extension. example: 'barcode01' for a sample with the input barcode01.fastq
- sampleid: the final sample name; this column must be unique. example: 'wt'
- groupid: the groupid for samples, which may or may not be unique values. NOTE: underscores cannot be used. example: 'ko.drosha'
- batchid: the batchid for each sample. example: 'b1'
An example sample.tsv file (note tabs separate each column): filename sampleid groupid batchid barcode01 wt wt b1 barcode02 ko.drosha ko.drosha b1 barcode03 ko.dicer ko.dicer b1
-
deg_manifest.tsv - this manifest will include information on which samples to compare
- group1 / group2: each column must be a groupid found in the samples_manifest.tsv. Comparisons are only done between two groups of samples.
An example deg_manifest.tsv file (note tabs separate each column): group1 group2 wt ko.dicer wt ko.drosha
- Dry-Run
sh run_snakemake.sh dry-run
- Execute pipeline on the cluster
sh run_snakemake.sh cluster
- Execute pipeline locally
sh run_snakemake.sh local
- Unlock directory (after failed partial run)
sh run_snakemake.sh unlock
The following directories are created under the output_directory:
- log: log files for slurm jobs, config files used in each run
- 01_fastq: fastq files used in run, zipped
- 01_fastq_trimmed: fastq files after trimming
- 02_sam: aligned sam files
- if cleanup_transcripts:
- 02_sam_corrected: sam files after transcript cleanup
- 03_bam: bam files, sorted and indexed
- 04_qc:
- fastqc: individual outputs of fastq
- samtools: individual outputs of samstats
- multiqc_report.html: compiled multiqc report for all samples
- qc_report.html: alignment statistics; if masked_ref="Y" this will provided comparison unmasked and masked alignment
- 05_talon:
- talon_config.csv
- {build_id}.db
- sam_labeled
- annotate
- counts: includes the talon_summary.tsv, talon_abundance.tsv, whitelist.txt and talon_abundance_filtered.tsv
- gtf
- 06_flair
- merged.fastq.gz
- flair_config.csv
- isoforms: fa files of merged isoforms
- fastq: individual fastq files after flair processing
- counts: flair_counts_matrix.tsv
- 07_report:
- summary report of TALON (with and without filtering) and FLAIR outputs and annotations
- 08_deg:
- DEG summaries for comparisons given
- Check your email for an email regarding pipeline failure
- Review the logs to determine what rule failed (logs are named by Snakemake rule)
- Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment
cd /path/to/output/dir/log
- Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)