-
Notifications
You must be signed in to change notification settings - Fork 0
Home
ATAC-Seq, Assay for Transposase-Accessible Chromatin with sequencing, is a popular method for determining chromatin accessibility across the genome. The ATAC-Seq allows us to sequence open chromatin regions, elucidating how chromatin packaging and other factors may affect gene expression. The method can help identify promoter regions and potential enhancers and silencers.
The goal of this practice is to identify regulatory regions using ATAC-Seq data. The task is divided into two sections:
- EN‐TEx ATAC‐seq data analysis:
In this section, we will use two tissues (stomach and sigmoid_colon) from EN-TEx donor 2 ATAC-seq data from ENCODE portal to obtain a set of ATAC-seq peaks in these two tissues that lie outside gene coordinates.
- Distal regulatory activity:
Next, we will use the ATAC-seq peaks obtained in the previous section to build a catalogue of distal regulatory regions.
We start running the following docker container:
sudo docker run -v /Users/scarbonell/Desktop/Epigenomics:/Users/scarbonell/Desktop/Epigenomics -w /Users/scarbonell/Desktop/Epigenomics --rm -it dgarrimar/epigenomics_course
We prepare directories to store the results of the analyses using the same structure we used for ChIP-seq analysis. For more details, see here.
In this section, we will generate the ATAC-seq peaks which correspond to regions of open chromatin, which will allow us to identify regulatory regions.
Download ENCODE ATAC-seq metadata from EN-TEx donor 2. To obtain the ATAC-seq ENCODE data, we need to generate ATAC-seq peaks we will follow the same procedure described in How to navigate the ENCODE portal. And finally, we run:
mkdir ATAC-seq
cd ATAC-seq
../bin/download.metadata.sh "https://www.encodeproject.org/metadata/?replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&status=submitted&status=in+progress&assay_slims=DNA+accessibility&assay_title=ATAC-seq&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&type=Experiment"
Explore the structure of the metadata file by having a look at the header and, with awk find the field we are interested in
head -1 metadata.tsv
head -1 metadata.tsv | awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}'
Inside epigenomics_uvic/ATAC-seq create the folders analyses data/bigBed.files to store the data
mkdir -p analyses data/bigBed.files
Parse the metadata.tsv file to retrieve bigBed ATAC-seq data, get the columns: File_accession, Biosample_term_name and Experiment_target
grep -F ATAC-seq metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" | \
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt
with:
cat analyses/bigBed.peaks.ids.txt
we will see the ATAC-seq bigBed ID for sigmoid_colon (ENCFF287UHP) and stomach (ENCFF762IFP)
With the ID we can download them from ENCODE portal by doing:
cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed";
done
Check their integrity by verifying their MD5 hash (digital fingerprint of the file) Check the integrity of the files by verifying their MD5 hash in column 46:
for file_type in bigBed; do
# retrieve the original MD5 hash from the metadata
../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt
# compute MD5 hash on the downloaded files
cat data/"$file_type".files/md5sum.txt |\
while read filename original_md5sum; do
md5sum data/"$file_type".files/"$filename"."$file_type" |\
awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}'
done > tmp
mv tmp data/"$file_type".files/md5sum.txt
# make sure there are no files for which original and computed MD5 hashes differ
awk '$2 != $3' data/"$file_type".files/md5sum.txt
done
We can double-check the files to be sure they are correct by doing:
cat data/bigBed.files/md5sum.txt
And we will obtain:
![ATAC_md5sum](https://user-images.githubusercontent.com/75372182/230788885-faa4d28c-a32b-4b5c-b721-dfd963750c3a.png)
Now we need the annotation files, which will need to be the same GENCODE annotation version used by ENCODE to analyse the data.
Annotation files needed (GENCODE annotation v24):
- gencode.v24.protein.coding.non.redundant.TSS.bed (list of promoters, [-2 kb, +2 Kb] from TSS, of protein-coding genes)
- gencode.v24.protein.coding.gene.body.bed
Instead of downloading them again as we did here we can make a copy from ChIP-seq/annotation and move to the folder ATAC-seq/annotation
mkdir annotation
cp /Users/scarbonell/Desktop/Epigenomics/epigenomics_uvic/ChIP-seq/annotation/{gencode.v24.protein.coding.non.redundant.TSS.bed,gencode.v24.protein.coding.gene.body.bed} annotation
Convert bigBed files of ATAC-seq peaks to BED files with the bigBedToBed command:
mkdir data/bed.files
cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done
Now that we have the files, we can start addressing task questions:
Retrieve peaks of ATAC-seq at the promoter region in each tissue:
mkdir analyses/peaks.analysis
cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do
bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -u > analyses/peaks.analysis/promoters.with.peaks."$tissue".ATACseq.bed
done
Output files are:
head analyses/peaks.analysis/promoters.with.peaks.*.ATACseq.bed
![head_promoters_with_peaks](https://user-images.githubusercontent.com/75372182/230788830-2ab9c099-b44c-4487-97c1-d7a04d49c3a2.png)
Finally, to report the ATAC-seq peaks in the promoter regions, we do:
wc -l analyses/peaks.analysis/promoters.with.peaks.*.ATACseq.bed
And we get:
![promoters_with_peaks](https://user-images.githubusercontent.com/75372182/230788545-1c729d18-1e97-4af8-b2b8-78de3be65ebf.png)
Sumariszing we have: 47871 ATAC-seq peaks in the promoter regions for Sigmoid Colon and 44749 for Stomach sample.
2. the number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions).
Combine the two bed files:
cat annotation/gencode.v24.protein.coding.gene.body.bed annotation/gencode.v24.protein.coding.non.redundant.TSS.bed > annotation/gencode.v24.genes-promoters.bed
cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do
bedtools intersect -a data/bed.files/"$filename".bed -b annotation/gencode.v24.genes-promoters.bed -v > analyses/peaks.analysis/peaks.outside.genes."$tissue".ATACseq.bed
done
Explore the files and count the number of peaks for each tissue
head analyses/peaks.analysis/peaks.outside.genes.*.ATACseq.bed
![head_peaks ouside gens](https://user-images.githubusercontent.com/75372182/231000211-546eff59-91fe-4d7a-a165-efc2cd937d45.png)
And to get the number of peaks outside genes:
wc -l analyses/peaks.analysis/peaks.outside.genes.*.ATACseq.bed
And we get:
![peaks_outside_genes](https://user-images.githubusercontent.com/75372182/230789082-5778b0b0-b769-4a01-a877-82fc97a46319.png)
Summarizing, we have 35306 ATAC-seq peaks outside genes for Sigmoid Colon and 32795 for Stomach sample.
1. Create a folder regulatory_elements inside epigenomics_uvic. This will be the folder where you store all your subsequent results.
Move to folder epigenomics_uvic
cd ..
mkdir regulatory_elements
cd regulatory_elements
2. Distal regulatory regions are usually found to be flanked by both H3K27ac and H3K4me1. From your starting catalogue of open regions in each tissue, select those that overlap peaks of H3K27ac AND H3K4me1 in the corresponding tissue. You will get a list of candidate distal regulatory elements for each tissue. How many are they?
mkdir -p analyses data/bigBed.files
We download the BigBed peak calling files (H3K27ac AND H3K4me1, bigBed narrow, pseudoreplicated peaks, assembly GRCh38, most recent file for each tissue)
grep -F H3K27ac ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/H3K27ac.bigBed.peaks.ids.txt
cut -f1 analyses/H3K27ac.bigBed.peaks.ids.txt |\
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
grep -F H3K4me1 ../ChIP-seq/metadata.tsv |\
grep -F "bigBed_narrowPeak" |\
grep -F "pseudoreplicated_peaks" |\
grep -F "GRCh38" |\
awk 'BEGIN{FS=OFS="\t"}{print $1, $11, $23}' |\
sort -k2,2 -k1,1r |\
sort -k2,2 -u > analyses/H3K4me1.bigBed.peaks.ids.txt
cut -f1 analyses/H3K4me1.bigBed.peaks.ids.txt |\
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
And with the following command, we get the bidBed peaks ids:
cat analyses/*.bigBed.peaks.ids.txt
![bigBed](https://user-images.githubusercontent.com/75372182/230789475-2c1669a9-7c54-4890-b79e-22469e118b40.png)
As we did before, check their integrity by verifying their MD5 hash (digital fingerprint of the file)
head -1 ../ChIP-seq/metadata.tsv | awk 'BEGIN{FS=OFS="\t"}{for (i=1;i<=NF;i++){print $i, i}}' ### MD5 is column 46
Check MD5 from column 46 of H3K27ac and H3K4me1 files by doing:
for file_type in bigBed; do
# retrieve original MD5 hash from the metadata
../bin/selectRows.sh <(cut -f1 analyses//*."$file_type".*.ids.txt) ../ChIP-seq/metadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt
# compute MD5 hash on the downloaded files
cat data/"$file_type".files/md5sum.txt |
while read filename original_md5sum; do
md5sum data/"$file_type".files/"$filename"."$file_type" |
awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}'
done > tmp
mv tmp data/"$file_type".files/md5sum.txt
# make sure there are no files for which original and computed MD5 hashes differ
awk '$2!=$3' data/"$file_type".files/md5sum.txt
done
Double check verifying the downloaded data is not corrupted with:
cat data/bigBed.files/md5sum.txt
And we get:
![md5sum2](https://user-images.githubusercontent.com/75372182/230789641-2c347743-6dc3-4018-a817-71f06338bcce.png)
Convert bigBed files of H3K27ac AND H3K4me1 peaks to BED files with the bigBedToBed command
mkdir data/bed.files
cut -f1 analyses/*.bigBed.peaks.ids.txt |
while read filename; do
bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done
Find the peaks outside the genes (distal regulatory regions) for each Histone mark (H3K27ac AND H3K4me1)
mkdir analyses/peaks.analysis
cut -f-2 analyses/H3K27ac.bigBed.peaks.ids.txt |
while read filename tissue; do
bedtools intersect -a data/bed.files/"$filename".bed -b ../ATAC-seq/annotation/gencode.v24.genes-promoters.bed -v > analyses/peaks.analysis/regulatory.regions.peaks."$tissue".H3K27ac.bed
done
cut -f-2 analyses/H3K4me1.bigBed.peaks.ids.txt |
while read filename tissue; do
bedtools intersect -a data/bed.files/"$filename".bed -b ../ATAC-seq/annotation/gencode.v24.genes-promoters.bed -v > analyses/peaks.analysis/regulatory.regions.peaks."$tissue".H3K4me1.bed
done
Using head, we inspect the obtained files:
head analyses/peaks.analysis/regulatory.regions.peaks.*.bed
![head1](https://user-images.githubusercontent.com/75372182/230790086-47243c88-9532-4221-bc48-1e22b5a2da89.png)
![head2](https://user-images.githubusercontent.com/75372182/230790089-af9737e3-707d-4839-80d5-c063541187f0.png)
And we obtain peaks in the regulatory regions by tissue for each histone mark (H3K27ac and H3K4me1)
wc -l analyses/peaks.analysis/regulatory.regions.peaks.*.bed
![regulatory_regions_peaks](https://user-images.githubusercontent.com/75372182/230790136-da5c0311-d1f0-464c-a4d6-f048f3cf81f9.png)
Now we will merge peaks of two histone marks by tissue. We need to sort unique
sort -u analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.H3K27ac.bed analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.H3K4me1.bed > analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.bed
sort -u analyses/peaks.analysis/regulatory.regions.peaks.stomach.H3K27ac.bed analyses/peaks.analysis/regulatory.regions.peaks.stomach.H3K4me1.bed > analyses/peaks.analysis/regulatory.regions.peaks.stomach.bed
Finally, we calculate the peaks in regulatory regions by doing:
wc -l analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.bed analyses/peaks.analysis/regulatory.regions.peaks.stomach.bed
And we obtain:
-
40072 regulatory regions peaks in sigmoid colon
-
28214 regulatory regions peaks in stomach
Regions marked with ATAC-seq & Histone marks (H3K27ac AND H3K4me1) by tissue:
Remember from ATAC-seq analysis:
wc -l ../ATAC-seq/analyses/peaks.analysis/peaks.outside.genes.*.ATACseq.bed
![atac_peaks2](https://user-images.githubusercontent.com/75372182/230790766-10dcd79b-afa6-4597-8f22-cbbabfd04cc2.png)
bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/peaks.outside.genes.sigmoid_colon.ATACseq.bed -b analyses/peaks.analysis/regulatory.regions.peaks.sigmoid_colon.bed -u > analyses/peaks.analysis/distal.regulatory.elements.sigmoid_colon.bed
bedtools intersect -a ../ATAC-seq/analyses/peaks.analysis/peaks.outside.genes.stomach.ATACseq.bed -b analyses/peaks.analysis/regulatory.regions.peaks.stomach.bed -u > analyses/peaks.analysis/distal.regulatory.elements.stomach.bed
We inspect the output:
head analyses/peaks.analysis/distal.regulatory.elements.*.bed
![distal regulatory elements_head](https://user-images.githubusercontent.com/75372182/230790960-6fd5406b-8e9e-44bf-a1bc-25218b0109e9.png)
And finally, we count how many distal regulatory elements we have for each tissue:
wc -l analyses/peaks.analysis/distal.regulatory.elements.*.bed
![distal regulatory elements](https://user-images.githubusercontent.com/75372182/230790884-fdaee31e-a5c4-4697-b92b-4358a2390f37.png)
In summary, we observe 21077 distal regulatory elements in the Sigmoid colon and 16172 in the Stomach.
3. Focus on regulatory elements that are located on chromosome 1 (hint: to parse a file based on the value of a specific column, have a look at what we did here), and generate a file regulatory.elements.starts.tsv that contains the name of the regulatory region (i.e. the name of the original ATAC-seq peak) and the start (5') coordinate of the region.
We code to generate 2 file regulatory.elements.starts.tsv, one for each tissue and also regulatory.elements.starts.tsv for the 2 tissues together.
awk 'BEGIN{FS=OFS="\t"} $1=="chr1" {print $4,$2}' analyses/peaks.analysis/distal.regulatory.elements.sigmoid_colon.bed > sigmoid_colon.regulatory.elements.starts.tsv
awk 'BEGIN{FS=OFS="\t"} $1=="chr1" {print $4,$2}' analyses/peaks.analysis/distal.regulatory.elements.stomach.bed > stomach.regulatory.elements.starts.tsv
The head of the output files:
![regulatory elements starts tsv1_head](https://user-images.githubusercontent.com/75372182/230791560-4c1e80cb-ae9e-4221-b561-ecf5f86e8c9b.png)
wc -l *.regulatory.elements.starts.tsv
![regulatory elements starts tsv1](https://user-images.githubusercontent.com/75372182/230791423-4dcc706f-3b65-4aad-a29f-3df1e56eb73c.png)
cat sigmoid_colon.regulatory.elements.starts.tsv stomach.regulatory.elements.starts.tsv > regulatory.elements.starts.tsv
wc -l regulatory.elements.starts.tsv
![regulatory elements starts tsv2](https://user-images.githubusercontent.com/75372182/230791439-db2eff7a-2589-4870-a068-b5d59d593577.png)
We get 4008 which matches the number above
4. Focus on protein-coding genes located on chromosome 1. From the BED file of gene body coordinates that you generated here, prepare a tab-separated file called gene.starts.tsv which will store the name of the gene in the first column, and the start coordinate of the gene on the second column (REMEMBER: for genes located on the minus strand, the start coordinate will be at the 3'). Use the command below as a starting point:
wc -l ../ATAC-seq/annotation/gencode.v24.protein.coding.gene.body.bed
awk 'BEGIN{FS=OFS="\t"} $1=="chr1" {if ($6=="+"){start=$2} else {start=$3}; print $4, start}' ../ATAC-seq/annotation/gencode.v24.protein.coding.gene.body.bed > gene.starts.tsv
And we check the file:
head gene.starts.tsv
![head annotationtsv](https://user-images.githubusercontent.com/75372182/230951357-80ffff86-fc78-451c-850d-c7a41df0356f.png)
5. Download or copy this Python script inside the epigenomics_uvic/bin folder. Have a look at the help page of this script to understand how it works: python ../bin/get.distance.py -h
This script takes as input two distinct arguments: 1) --input corresponds to the file gene.starts.tsv (i.e. the file you generated in Task #4); 2) --start corresponds to the 5' coordinate of a regulatory element. Complete the python script so that for a given coordinate --start the script returns the closest gene, the start of the gene and the distance of the regulatory element.
The script completed:
#!/usr/bin/env python
#************
# LIBRARIES *
#************
import sys
from optparse import OptionParser
#*****************
# OPTION PARSING *
#*****************
parser = OptionParser()
parser.add_option("-i", "--input", dest="input")
parser.add_option("-s", "--start", dest="start")
options, args = parser.parse_args()
open_input = open(options.input)
enhancer_start = int(options.start)
#********
# BEGIN *
#********
x=1000000 # set maximum distance to 1 Mb
selectedGene="" # initialize the gene as empty
selectedGeneStart=0 # initialize the start coordinate of the gene as empty
for line in open_input.readlines(): # for each line in the input file
gene, y = line.strip().split('\t') # split the line into two columns based on a tab
position = int(y) # define a variable called position that correspond to the integer of the start of the gene
distance = abs(position - enhancer_start) # compute the absolute value of the difference between position and enhancer_start
if distance < x: # if this absolute value is lower than x
x = distance # this value will now be your current x
selectedGene = gene # save gene as selectedGene
selectedGeneStart = position # save position as selectedGeneStart
print "\t".join([selectedGene, str(selectedGeneStart), str(x)])
To check it is running we do the following:
python ../bin/get.distance.py --input gene.starts.tsv --start 980000
And we get as output the expected
ENSG00000187642.9 982093 2093
6. For each regulatory element contained in the file regulatory.elements.starts.tsv, retrieve the closest gene and the distance to the closest gene using the python script you created above.
cat sigmoid_colon.regulatory.elements.starts.tsv | while read element start; do
python ../bin/get.distance.py --input gene.starts.tsv --start $start
done > sigmoid_colon.regulatoryElements.genes.distances.tsv
cat stomach.regulatory.elements.starts.tsv | while read element start; do
python ../bin/get.distance.py --input gene.starts.tsv --start $start
done > stomach.regulatoryElements.genes.distances.tsv
cat regulatory.elements.starts.tsv | while read element start; do
python ../bin/get.distance.py --input gene.starts.tsv --start $start
done > regulatoryElements.genes.distances.tsv
to check is running we can do:
cat regulatory.elements.starts.tsv | while read element start; do
echo element=$element start=$start
python ../bin/get.distance.py --input gene.starts.tsv --start $start
done
7. Use R to compute the mean and the median of the distances stored in regulatoryElements.genes.distances.tsv
The three obtained regulatoryElements.genes.distances.tsv
files were loaded in R. We obtain the mean and the median of the distances.
load data:
sigmoid_colon.regulatoryElements.genes.distances <- read.delim("~/Desktop/Epigenomics/epigenomics_uvic/regulatory_elements/sigmoid_colon.regulatoryElements.genes.distances.tsv", header=FALSE)
stomach.regulatoryElements.genes.distances <- read.delim("~/Desktop/Epigenomics/epigenomics_uvic/regulatory_elements/stomach.regulatoryElements.genes.distances.tsv", header=FALSE)
regulatoryElements.genes.distances <- read.delim("~/Desktop/Epigenomics/epigenomics_uvic/regulatory_elements/regulatoryElements.genes.distances.tsv", header=FALSE)
mean(sigmoid_colon.regulatoryElements.genes.distances$V3)
[1] 82936.13
median(sigmoid_colon.regulatoryElements.genes.distances$V3)
[1] 39614
mean(stomach.regulatoryElements.genes.distances$V3)
[1] 61329.46
median(stomach.regulatoryElements.genes.distances$V3)
[1] 36347
mean(regulatoryElements.genes.distances$V3)
[1] 73189.41
median(regulatoryElements.genes.distances$V3)
[1] 37874.5
Mean, and median results are summarized in the following table:
![table](https://user-images.githubusercontent.com/75372182/230795913-4e32e3da-db48-461b-b840-717c0dea9e5e.png)
According to the mean and median values, the distances to the regulatory elements are bigger for the Sigmoid colon than for the Stomach. We also observe huge differences between the mean and the median in both tissues, this can be caused by the presence of several extreme values (outliers) that are affecting the mean value.