Skip to content
scarbonellsala edited this page Apr 10, 2023 · 50 revisions

Background and objectives

ATAC-Seq, Assay for Transposase-Accessible Chromatin with sequencing, is a popular method for determining chromatin accessibility across the genome. The ATAC-Seq allows us to sequence open chromatin regions, elucidating how chromatin packaging and other factors may affect gene expression. The method can help identify promoter regions and potential enhancers and silencers.

The goal of this practice is to identify regulatory regions using ATAC-Seq data. The task is divided into two sections:

  1. EN‐TEx ATAC‐seq data analysis:

In this section, we will use two tissues (stomach and sigmoid_colon) from EN-TEx donor 2 ATAC-seq data from ENCODE portal to obtain a set of ATAC-seq peaks in these two tissues that lie outside gene coordinates.

  1. Distal regulatory activity:

Next, we will use the ATAC-seq peaks obtained in the previous section to build a catalogue of distal regulatory regions.

Set up the environment

We start running the following docker container:

sudo docker run -v /Users/scarbonell/Desktop/Epigenomics:/Users/scarbonell/Desktop/Epigenomics -w /Users/scarbonell/Desktop/Epigenomics --rm -it dgarrimar/epigenomics_course

We prepare directories to store the results of the analyses using the same structure we used for ChIP-seq analysis. For more details, see here.

EN‐TEx ATAC‐seq data: downstream analyses

In this section, we will generate the ATAC-seq peaks which correspond to regions of open chromatin, which will allow us to identify regulatory regions.

Download ENCODE ATAC-seq metadata from EN-TEx donor 2. To obtain the ATAC-seq ENCODE data, we need to generate ATAC-seq peaks we will follow the same procedure described in How to navigate the ENCODE portal. And finally, we run:

mkdir ATAC-seq
cd ATAC-seq
../bin/download.metadata.sh "https://www.encodeproject.org/metadata/?replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&status=submitted&status=in+progress&assay_slims=DNA+accessibility&assay_title=ATAC-seq&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&type=Experiment"

Explore the structure of the metadata file by having a look at the header and, with awk find the field we are interested in

head -1 metadata.tsv
head -1 metadata.tsv | awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}'

Inside epigenomics_uvic/ATAC-seq create the folders analyses data/bigBed.files to store the data

mkdir -p analyses data/bigBed.files

Parse the metadata.tsv file to retrieve bigBed ATAC-seq data, get the columns: File_accession, Biosample_term_name and Experiment_target

grep -F ATAC-seq metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" | \
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt 

with:

cat analyses/bigBed.peaks.ids.txt

we will see the ATAC-seq bigBed ID for sigmoid_colon (ENCFF287UHP) and stomach (ENCFF762IFP)

With the ID we can download them from ENCODE portal by doing:

cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
  wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed";
done

Check their integrity by verifying their MD5 hash (digital fingerprint of the file) Check the integrity of the files by verifying their MD5 hash in column 46:

for file_type in bigBed; do

  # retrieve the original MD5 hash from the metadata
  ../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt

  # compute MD5 hash on the downloaded files
  cat data/"$file_type".files/md5sum.txt |\
  while read filename original_md5sum; do
    md5sum data/"$file_type".files/"$filename"."$file_type" |\
    awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' 
  done > tmp 
  mv tmp data/"$file_type".files/md5sum.txt

  # make sure there are no files for which original and computed MD5 hashes differ
  awk '$2 != $3' data/"$file_type".files/md5sum.txt

done

We can double-check the files to be sure they are correct by doing:

cat data/bigBed.files/md5sum.txt

And we will obtain:

ATAC_md5sum

Annotation:

Now we need the annotation files, which will need to be the same GENCODE annotation version used by ENCODE to analyse the data.

Annotation files needed (GENCODE annotation v24):

  • gencode.v24.protein.coding.non.redundant.TSS.bed (list of promoters, [-2 kb, +2 Kb] from TSS, of protein-coding genes)
  • gencode.v24.protein.coding.gene.body.bed

Instead of downloading them again as we did here we can make a copy from ChIP-seq/annotation and move to the folder ATAC-seq/annotation

mkdir annotation
cp /Users/scarbonell/Desktop/Epigenomics/epigenomics_uvic/ChIP-seq/annotation/{gencode.v24.protein.coding.non.redundant.TSS.bed,gencode.v24.protein.coding.gene.body.bed} annotation

Convert bigBed files of ATAC-seq peaks to BED files with the bigBedToBed command:

mkdir data/bed.files

cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
  bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done

Now that we have the files, we can start addressing task questions:

1. Report the number of peaks that intersect promoter regions

Retrieve peaks of ATAC-seq at the promoter region in each tissue:

mkdir analyses/peaks.analysis
cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do 
  bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -u > analyses/peaks.analysis/promoters.with.peaks."$tissue".ATACseq.bed
done

Output files are:

head analyses/peaks.analysis/promoters.with.peaks.*.ATACseq.bed
head_promoters_with_peaks

Finally, to report the ATAC-seq peaks in the promoter regions, we do:

wc -l analyses/peaks.analysis/promoters.with.peaks.*.ATACseq.bed

And we get:

promoters_with_peaks

Sumariszing we have: 47871 ATAC-seq peaks in the promoter regions for Sigmoid Colon and 44749 for Stomach sample.

2. the number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions).

Combine the two bed files:

cat annotation/gencode.v24.protein.coding.gene.body.bed annotation/gencode.v24.protein.coding.non.redundant.TSS.bed > annotation/gencode.v24.genes-promoters.bed

cut -f-2 analyses/bigBed.peaks.ids.txt |\
   while read filename tissue; do
     bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.genes-promoters.bed -v > analyses/peaks.analysis/peaks.outside.genes."$tissue".ATACseq.bed
done

Explore the files and count the number of peaks for each tissue

head analyses/peaks.analysis/peaks.outside.genes.*.ATACseq.bed
head_peaks ouside gens

And to get the number of peaks outside genes:

wc -l analyses/peaks.analysis/peaks.outside.genes.*.ATACseq.bed

And we get:

peaks_outside_genes

Summarizing, we have 35306 ATAC-seq peaks outside genes for Sigmoid Colon and 32795 for Stomach sample.

Distal regulatory activity

1. Create a folder regulatory_elements inside epigenomics_uvic. This will be the folder where you store all your subsequent results.

Move to folder epigenomics_uvic

cd ..
mkdir regulatory_elements
cd regulatory_elements

2. Distal regulatory regions are usually found to be flanked by both H3K27ac and H3K4me1. From your starting catalogue of open regions in each tissue, select those that overlap peaks of H3K27ac AND H3K4me1 in the corresponding tissue. You will get a list of candidate distal regulatory elements for each tissue. How many are they?

mkdir -p analyses data/bigBed.files

We download the BigBed peak calling files (H3K27ac AND H3K4me1, bigBed narrow, pseudoreplicated peaks, assembly GRCh38, most recent file for each tissue)

H3K27ac:

grep -F H3K27ac ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/H3K27ac.bigBed.peaks.ids.txt

cut -f1 analyses/H3K27ac.bigBed.peaks.ids.txt |\
while read filename; do
  wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done

H3K4me1:

grep -F H3K4me1 ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/H3K4me1.bigBed.peaks.ids.txt

cut -f1 analyses/H3K4me1.bigBed.peaks.ids.txt |\
while read filename; do
  wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done

And with the following command, we get the bidBed peaks ids:

cat analyses/*.bigBed.peaks.ids.txt
bigBed

Checking files integrity

As we did before, check their integrity by verifying their MD5 hash (digital fingerprint of the file)

head -1 ../ChIP-seq/metadata.tsv | awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}' ### MD5 is column 46

Check MD5 from column 46 of H3K27ac and H3K4me1 files by doing:

for file_type in bigBed; do

  # retrieve original MD5 hash from the metadata
  ../bin/selectRows.sh <(cut -f1 analyses//*."$file_type".*.ids.txt) ../ChIP-seq/metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt

  # compute MD5 hash on the downloaded files 
  cat data/"$file_type".files/md5sum.txt |
  while read filename original_md5sum; do 
    md5sum data/"$file_type".files/"$filename"."$file_type" |
    awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' 
  done > tmp 
  mv tmp data/"$file_type".files/md5sum.txt

  # make sure there are no files for which original and computed MD5 hashes differ
  awk '$2!=$3' data/"$file_type".files/md5sum.txt

done

Double check verifying the downloaded data is not corrupted with:

cat data/bigBed.files/md5sum.txt

And we get:

md5sum2

Convert bigBed files of H3K27ac AND H3K4me1 peaks to BED files with the bigBedToBed command

mkdir data/bed.files
cut -f1 analyses/*.bigBed.peaks.ids.txt |
while read filename; do
  bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done

Find the peaks outside the genes (distal regulatory regions) for each Histone mark (H3K27ac AND H3K4me1)

mkdir analyses/peaks.analysis

H3K27ac

cut -f-2 analyses/H3K27ac.bigBed.peaks.ids.txt |
while read filename tissue; do 
  bedtools intersect -a data/bed.files/"$filename".bed -b ../ATAC-seq/annotation/gencode.v24.genes-promoters.bed -v > analyses/peaks.analysis/regulatory.regions.peaks."$tissue".H3K27ac.bed
done

H3K4me1

cut -f-2 analyses/H3K4me1.bigBed.peaks.ids.txt |
while read filename tissue; do 
  bedtools intersect -a data/bed.files/"$filename".bed -b ../ATAC-seq/annotation/gencode.v24.genes-promoters.bed -v > analyses/peaks.analysis/regulatory.regions.peaks."$tissue".H3K4me1.bed
done

Using head, we inspect the obtained files:

head analyses/peaks.analysis/regulatory.regions.peaks.*.bed
head1 head2

And we obtain peaks in the regulatory regions by tissue for each histone mark (H3K27ac and H3K4me1)

wc -l analyses/peaks.analysis/regulatory.regions.peaks.*.bed
regulatory_regions_peaks

Now we will merge peaks of two histone marks by tissue. We need to sort unique

SIGMOID COLON

sort -u analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.H3K27ac.bed analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.H3K4me1.bed > analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.bed

STOMACH

sort -u analyses/peaks.analysis/regulatory.regions.peaks.stomach.H3K27ac.bed analyses/peaks.analysis/regulatory.regions.peaks.stomach.H3K4me1.bed > analyses/peaks.analysis/regulatory.regions.peaks.stomach.bed 

Finally, we calculate the peaks in regulatory regions by doing:

wc -l analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.bed analyses/peaks.analysis/regulatory.regions.peaks.stomach.bed

And we obtain:

  • 40072 regulatory regions peaks in sigmoid colon

  • 28214 regulatory regions peaks in stomach

Regions marked with ATAC-seq & Histone marks (H3K27ac AND H3K4me1) by tissue:

Remember from ATAC-seq analysis:

wc -l ../ATAC-seq/analyses/peaks.analysis/peaks.outside.genes.*.ATACseq.bed
atac_peaks2

SIGMOID COLON

bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/peaks.outside.genes.sigmoid_colon.ATACseq.bed -b analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.bed -u > analyses/peaks.analysis/distal.regulatory.elements.sigmoid_colon.bed

STOMACH

bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/peaks.outside.genes.stomach.ATACseq.bed -b analyses/peaks.analysis/regulatory.regions.peaks.stomach.bed -u > analyses/peaks.analysis/distal.regulatory.elements.stomach.bed

We inspect the output:

head analyses/peaks.analysis/distal.regulatory.elements.*.bed
distal regulatory elements_head

And finally, we count how many distal regulatory elements we have for each tissue:

wc -l analyses/peaks.analysis/distal.regulatory.elements.*.bed 
distal regulatory elements

In summary, we observe 21077 distal regulatory elements in the Sigmoid colon and 16172 in the Stomach.

3. Focus on regulatory elements that are located on chromosome 1 (hint: to parse a file based on the value of a specific column, have a look at what we did here), and generate a file regulatory.elements.starts.tsv that contains the name of the regulatory region (i.e. the name of the original ATAC-seq peak) and the start (5') coordinate of the region.

We code to generate 2 file regulatory.elements.starts.tsv, one for each tissue and also regulatory.elements.starts.tsv for the 2 tissues together.

SIGMOID COLON

awk 'BEGIN{FS=OFS="\t"} $1=="chr1" {print $4,$2}' analyses/peaks.analysis/distal.regulatory.elements.sigmoid_colon.bed > sigmoid_colon.regulatory.elements.starts.tsv

STOMACH

awk 'BEGIN{FS=OFS="\t"} $1=="chr1" {print $4,$2}' analyses/peaks.analysis/distal.regulatory.elements.stomach.bed > stomach.regulatory.elements.starts.tsv

The head of the output files:

regulatory elements starts tsv1_head
wc -l *.regulatory.elements.starts.tsv
regulatory elements starts tsv1

ALL TOGETHER

cat sigmoid_colon.regulatory.elements.starts.tsv stomach.regulatory.elements.starts.tsv > regulatory.elements.starts.tsv

wc -l regulatory.elements.starts.tsv
regulatory elements starts tsv2

We get 4008 which matches the number above

4. Focus on protein-coding genes located on chromosome 1. From the BED file of gene body coordinates that you generated here, prepare a tab-separated file called gene.starts.tsv which will store the name of the gene in the first column, and the start coordinate of the gene on the second column (REMEMBER: for genes located on the minus strand, the start coordinate will be at the 3'). Use the command below as a starting point:

wc -l ../ATAC-seq/annotation/gencode.v24.protein.coding.gene.body.bed 

awk 'BEGIN{FS=OFS="\t"} $1=="chr1" {if ($6=="+"){start=$2} else {start=$3}; print $4, start}' ../ATAC-seq/annotation/gencode.v24.protein.coding.gene.body.bed > gene.starts.tsv

And we check the file:

head gene.starts.tsv
head annotationtsv

5. Download or copy this Python script inside the epigenomics_uvic/bin folder. Have a look at the help page of this script to understand how it works: python ../bin/get.distance.py -h

This script takes as input two distinct arguments: 1) --input corresponds to the file gene.starts.tsv (i.e. the file you generated in Task #4); 2) --start corresponds to the 5' coordinate of a regulatory element. Complete the python script so that for a given coordinate --start the script returns the closest gene, the start of the gene and the distance of the regulatory element.

The script completed:

#!/usr/bin/env python


#************
# LIBRARIES *
#************

import sys
from optparse import OptionParser


#*****************
# OPTION PARSING *
#*****************

parser = OptionParser()
parser.add_option("-i", "--input", dest="input")
parser.add_option("-s", "--start", dest="start")
options, args = parser.parse_args()

open_input = open(options.input)
enhancer_start = int(options.start)


#********
# BEGIN *
#********

x=1000000 # set maximum distance to 1 Mb
selectedGene="" # initialize the gene as empty
selectedGeneStart=0 # initialize the start coordinate of the gene as empty

for line in open_input.readlines(): # for each line in the input file
	gene, y = line.strip().split('\t') # split the line into two columns based on a tab 
	position = int(y) # define a variable called position that correspond to the integer of the start of the gene
	distance = abs(position - enhancer_start) # compute the absolute value of the difference between position and enhancer_start

	if distance < x: # if this absolute value is lower than x
		x = distance # this value will now be your current x
		selectedGene = gene # save gene as selectedGene
		selectedGeneStart = position # save position as selectedGeneStart

print "\t".join([selectedGene, str(selectedGeneStart), str(x)])

To check it is running we do the following:

python ../bin/get.distance.py --input gene.starts.tsv --start 980000

And we get as output the expected

ENSG00000187642.9 982093 2093

6. For each regulatory element contained in the file regulatory.elements.starts.tsv, retrieve the closest gene and the distance to the closest gene using the python script you created above.

SIGMOID COLON

cat sigmoid_colon.regulatory.elements.starts.tsv | while read element start; do 
   python ../bin/get.distance.py --input gene.starts.tsv --start $start 
done > sigmoid_colon.regulatoryElements.genes.distances.tsv

STOMACH

cat stomach.regulatory.elements.starts.tsv | while read element start; do 
   python ../bin/get.distance.py --input gene.starts.tsv --start $start 
done > stomach.regulatoryElements.genes.distances.tsv

ALL

cat regulatory.elements.starts.tsv | while read element start; do 
   python ../bin/get.distance.py --input gene.starts.tsv --start $start 
done > regulatoryElements.genes.distances.tsv

to check is running we can do:

cat regulatory.elements.starts.tsv | while read element start; do
   echo element=$element start=$start
   python ../bin/get.distance.py --input gene.starts.tsv --start $start 
done

7. Use R to compute the mean and the median of the distances stored in regulatoryElements.genes.distances.tsv

The three obtained regulatoryElements.genes.distances.tsv files were loaded in R. We obtain the mean and the median of the distances.

load data:

sigmoid_colon.regulatoryElements.genes.distances <- read.delim("~/Desktop/Epigenomics/epigenomics_uvic/regulatory_elements/sigmoid_colon.regulatoryElements.genes.distances.tsv", header=FALSE)

stomach.regulatoryElements.genes.distances <- read.delim("~/Desktop/Epigenomics/epigenomics_uvic/regulatory_elements/stomach.regulatoryElements.genes.distances.tsv", header=FALSE)

regulatoryElements.genes.distances <- read.delim("~/Desktop/Epigenomics/epigenomics_uvic/regulatory_elements/regulatoryElements.genes.distances.tsv", header=FALSE)

SIGMOID COLON

mean(sigmoid_colon.regulatoryElements.genes.distances$V3)

[1] 82936.13

median(sigmoid_colon.regulatoryElements.genes.distances$V3)

[1] 39614

STOMACH

mean(stomach.regulatoryElements.genes.distances$V3)

[1] 61329.46

median(stomach.regulatoryElements.genes.distances$V3)

[1] 36347

ALL

mean(regulatoryElements.genes.distances$V3)

[1] 73189.41

median(regulatoryElements.genes.distances$V3)

[1] 37874.5

Mean, and median results are summarized in the following table:

table

According to the mean and median values, the distances to the regulatory elements are bigger for the Sigmoid colon than for the Stomach. We also observe huge differences between the mean and the median in both tissues, this can be caused by the presence of several extreme values (outliers) that are affecting the mean value.

Clone this wiki locally