RNAseq_umis

Description

Pipeline to analyze 3' mRNA sequencing from FFPE samples and using UMIs

Workflow

Contents of the repository

The bash script rnaseq_pipeline_umis.sh that can be used for obtaining read counts from fastq files.
Resource folder has adapters.fa.gz with adapters sequences and polyA.
config_env.yml which is a file for create the working enviroment.

The output of this bash script includes:

<sample>.read_counts.txt which are the read counts (counts folder).
<sample>.read_distribution.txt which are the read counts (counts folder).
<sample>.deduplicated.bam which are deduplicated bam ( folder).

There are other outputs less important such as fastqc, bai, trimmering, umi data, ...

Pipeline

Step 1: Clone the repo in your home

If you have not wget package installed

conda install -c anaconda git

Then clone the entire repository in your local space

git clone https://github.com/malumbreslab/RNAseq_umis.git
cd RNAseq_umis

Step 2: Update conda and create the environment

conda update --all
conda env create -f config_env.yml

Step 3: Activate the environment

conda activate rseq

Step 4: Create folders for data

In this folder you must introduce compressed fastq files (Example: S1.fastq.gz)

mkdir data

Step 5: Create folders for result files

mkdir counts genome fastqc

Step 6: Download genome reference and annotations files (hg38)

Steps 6 and 7 are not neccesary if have a created index genome (or use the index genome from shared folder in the cluster)

If you have not wget package installed

conda install -c anaconda wget

Then download files with url and uncompress

cd genome
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/latest/hg38.fa.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz
wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_RefSeq.bed.gz  
gzip -d hg38.fa.gz
gzip -d hg38.ncbiRefSeq.gtf.gz
gzip -d hg38_RefSeq.bed.gz

Step 7: Index genome reference (hg38)

cd ..
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir genome --genomeFastaFiles genome/genome_hg38.fa --sjdbGTFfile genome/annotations_hg18.gtf --sjdbOverhang 75

Step 8a: Run pipeline in local

You must run this in your terminal shell and in sample must type after the script the diferent samples names that you want to analyze separated by spaces.

bash rnaseq_pipeline_umis.sh <sample1> <sample2> <sample3>

Step 8b: Run pipeline in cluster of CNIO

Outputs:

log.txt is the output file
error.txt is the error file

Parameters:

--mem is memory
-t is time
-J is job name
-c is number of cores
-o is name of output file
-e is name of error file
--wrap is the command which you want to run on the cluster

sbatch --mem=64G -t1440 -c 16 -J name -o log.txt -e error.txt --wrap "bash rnaseq_pipeline_umis.sh <sample1> <sample2> <sample3>"

Recomendations

The minimal requeriments are:

Memory (mem): 64Gb
Number of cores (c): 16
Time (t): 1440 min

For fastq with 10 millions of reads the time is about 40 min (without index genome reference, which could be around 2 hours)

When you index genome reference, you don't have to do it again if you want align some samples with the same reference genome.

Important functions

Quality control

It is used fastqc

UMIs

umi_tools (extract/dedup) is used in order to extract and deduplicate reads

Trimmering

bbduk.sh is used in order to remove adapters and polyA tail from reads

Aligment

It is recommended use STAR for RNA aligment

Counting

Htseq-count is used -m union (more reads), -s yes (stranded reads), -i gene_name (hugo symbol)

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
images		images
resource		resource
README.md		README.md
config_env.yml		config_env.yml
rnaseq_pipeline_umis.sh		rnaseq_pipeline_umis.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNAseq_umis

Description

Table of contents

Workflow

Contents of the repository

Pipeline

Step 1: Clone the repo in your home

Step 2: Update conda and create the environment

Step 3: Activate the environment

Step 4: Create folders for data

Step 5: Create folders for result files

Step 6: Download genome reference and annotations files (hg38)

Step 7: Index genome reference (hg38)

Step 8a: Run pipeline in local

Step 8b: Run pipeline in cluster of CNIO

Recomendations

Important functions

Quality control

UMIs

Trimmering

Aligment

Counting

About

Releases

Packages

Languages

malumbreslab/RNAseq_umis

Folders and files

Latest commit

History

Repository files navigation

RNAseq_umis

Description

Table of contents

Workflow

Contents of the repository

Pipeline

Step 1: Clone the repo in your home

Step 2: Update conda and create the environment

Step 3: Activate the environment

Step 4: Create folders for data

Step 5: Create folders for result files

Step 6: Download genome reference and annotations files (hg38)

Step 7: Index genome reference (hg38)

Step 8a: Run pipeline in local

Step 8b: Run pipeline in cluster of CNIO

Recomendations

Important functions

Quality control

UMIs

Trimmering

Aligment

Counting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages