In this GitHub repository, you will find the protocol elaborated by the Immunogenetics Research Lab (IRLab) from the University of the Basque Country (UPV/EHU), to map placental cis-mQTLs using the command-line program TensorQTL. On the one hand, all the commands and scripts used are available in the Readme but be careful, you will need to custom some of them, mainly the Rscripts executed from RStudio and not from the command line. Also, during the pipeline, we will be creating new directories for the outputs, but you should always work from outside of them! Have a look at the diagram below.
On the other hand, in the Wiki, you will find each step explained in more detail. To complete the pipeline, you need to have installed: R and RStudio, Plink1.9 and Plink2, TensorQTL v1.0.5 and its dependencies, bcftools and vcftools.
In case you already performed the HRC genotype imputation at the Michigan Server and the preprocessing of the methylome data using PACEanalysis R package by A. Binder with the arguments specified in Step 2.1., you can jump to Step 2.2.
Create a directory for the bfiles:
mkdir inter
Conversion of long format files to PLINK binary format file:
plink1.9 --file {your filename} --make-bed --out inter/whole_genotype
You should adapt this command depending on the input file that do you have. Take in consideration how do you want PLINK to read your sample names as FID and IID.
To know if the sex of the samples is reported on the fam file:
head inter/whole_genotype.fam
If the fifth column of the file contains only 0, you will need to add the sex of the samples adapting the following script to the sample sheet that you may have with the sex of the samples reported:
add-sex.R
Calculate frequencies:
plink1.9 --bfile inter/whole_genotype --freq --out inter/whole_genotype
Download and unzip Haplotype Reference Consortium (HRC) site list:
wget ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
gunzip HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
Execute Will Rayner's pre-imputation check script:
perl HRC-1000G-check-bim.pl -b inter/whole_genotype.bim -f inter/whole_genotype.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h -v
Execute Will Rayner’s output bash script:
plink1.9 --bfile inter/whole_genotype --exclude inter/Exclude-whole_genotype-HRC.txt --make-bed --out inter/TEMP1
plink1.9 --bfile inter/TEMP1 --update-map inter/Chromosome-whole_genotype-HRC.txt --update-chr --make-bed --out inter/TEMP2
plink1.9 --bfile inter/TEMP2 --update-map inter/Position-whole_genotype-HRC.txt --make-bed --out inter/TEMP3
plink1.9 --bfile inter/TEMP3 --flip inter/Strand-Flip-whole_genotype-HRC.txt --make-bed --out inter/TEMP4
plink1.9 --bfile inter/TEMP4 --a2-allele inter/Force-Allele1-whole_genotype-HRC.txt --make-bed --out inter/whole_genotype-updated
rm inter/TEMP*
Change rsIDs of SNPs to ‘chr:position’ format with change-rsid.R:
Rscript change-rsid.R inter/whole_genotype-updated.bim
Make a directory for Quality Control:
mkdir qc
Calculate frequencies:
plink1.9 --bfile inter/whole_genotype-updated --freq --out inter/whole_genotype-updated
Calculate missing call rate:
plink1.9 --bfile inter/whole_genotype-updated --missing --out inter/whole_genotype-updated
Remove markers by MAF/geno (missing call rate)/HWE thresholds:
plink1.9 --bfile inter/whole_genotype-updated --geno 0.05 --hwe 1e-06 --maf 0.01 --make-bed --out qc/wg-updated-marker
Calculate heterozygosity:
plink1.9 --bfile qc/wg-updated-marker --het --out qc/wg-updated-marker
Plot missing call rate vs heterozygosity and subset individuals with > ± 4 x standard deviation (SD) using imiss-vs-het.R:
Rscript imiss-vs-het.R inter/whole_genotype-updated.imiss qc/wg-updated-marker.het
Remove selected individuals (> ± 4 x SD):
plink1.9 --bfile qc/wg-updated-marker --remove filter-het.txt --make-bed --out qc/wg-updated-marker-rmhet
Calculate missing call rate:
plink1.9 --bfile qc/wg-updated-marker-rmhet --missing --out qc/wg-updated-marker-rmhet
Remove individuals with 0.03 missing markers:
plink1.9 --bfile qc/wg-updated-marker-rmhet --mind 0.03 --make-bed --out qc/wg-updated-marker-rmhet-ind
Obtain the genotype sex from the samples:
plink1.9 --bfile qc/wg-updated-marker-rmhet-ind --check-sex --out qc/wg-updated-marker-rmhet-ind
On the previous step, once we have calculated the sex of the samples with the --check-sex flag, we should see the consistency that our reported sex has with the new genotype sex.
awk '{if ($3 != $4) print $1,"\t",$2,"\t",$3,"\t",$4}' qc/wg-updated-marker-rmhet-ind.sexcheck >> qc/sex_inconsistencies.txt
In case of observing sex inconcistencies, we recommend you to talk with the researchers who obtained the samples and prepared the databases before removig those samples, to see if there could be a mistake. Depending on each case, you might ant to reassign the sex of the sample in the .fam file with --make-just-fam flag or discard the sample using --remove flag from PLINK.
Calculate relatedness by Identity-by-descent (IBD):
plink1.9 --bfile qc/wg-updated-marker-rmhet-ind --genome --make-bed --out qc/wg-updated-marker-rmhet-ind
Plot IBD values and subset individuals with PI_HAT > 0.18 with plot-IBD.R:
Rscript plot-IBD.R qc/wg-updated-marker-rmhet-ind
Select one sample per pair (with lower genotyping freq.) to remove with rm-pihat018.R:
Rscript rm-pihat018.R qc/wg-updated-marker-rmhet-ind-fail-IBD-check.txt qc/wg-updated-marker-rmhet.imiss qc/rmpihat018.txt
Remove one from each pair:
plink1.9 --bfile qc/wg-updated-marker-rmhet-ind --remove qc/rmpihat018.txt --make-bed --out qc/clean-PIHAT
Calculate PCs:
plink1.9 --bfile qc/clean-PIHAT --indep-pairwise 50 5 0.2 --out qc/clean-PIHAT-prunned
Perform PCA:
plink1.9 --bfile qc/clean-PIHAT --extract qc/clean-PIHAT-prunned.prune.in --pca --out qc/clean-PIHAT.PCs
Plot PCs with individuals information (e.g. sex), the following script contains the main commands, but you should adapt it to your data:
plot-PCA.R
Make a directory for chromosome files:
mkdir chr
Obtain separate VCF files per chromosome:
for i in {1..23};
do
# Split chromosomes
plink1.9 --bfile qc/clean-PIHAT --make-bed --chr $i --out chr/clean-PIHAT-chr$i
# Make VCF files per chromosome
plink1.9 --bfile chr/clean-PIHAT-chr$i --recode vcf --out chr/chr$i
# Sort and compress VCF files
vcf-sort chr/chr${i}.vcf | bgzip -c > chr/chr$i.vcf.gz ;
done
Step 1.2. Impute SNP genotypes in the Michigan Imputation Server
- Reference Panel: HRC r1. 2016 (GRCh37/hg19)
- Array Build: GRCh37/hg19
- rsq Filter: off
- Phasing: Eagle v2.4 (phased output)
- Population: EUR
- Mode: Quality Control & Imputation
We will also check AES 256 encryption.
To download the imputed genotype data from the server, use the commands indicated in the website (e.g. curl -sL https://imputationserver.sph.umich.edu/get/...) from inside the following folder:
mkdir chr_imp
cd chr_imp
Once you have donwloaded the files inside chr_imp
folder, get out from it by:
cd ../
Decompress downloaded folders with the corresponding password (you will receive it by email at the address you used for registration):
for i in {1..22}; do 7z x -p'PASSWORD' chr_imp/chr_${i}; done
7z x -p'PASSWORD' chr_imp/chr_X
Removing zips:
rm -r chr_imp/*.zip
Create folder for filtered VCFs:
mkdir imputed-rsq09
Filter SNPs by Rsq(R2) > 0.9:
for i in {1..22};
do
bcftools filter -i 'R2>0.9' -o imputed-rsq09/chr${i}.vcf.gz -Oz chr_imp/chr${i}.dose.vcf.gz &
done
bcftools filter -i 'R2>0.9' -o imputed-rsq09/chrX.vcf.gz -Oz chr_imp/chrX.dose.vcf.gz
Make a list of the VCF files to be concatenated:
for i in {1..22};
do
echo imputed-rsq09/chr${i}.vcf.gz >> imputed-rsq09/tmp-concat.txt
done
echo imputed-rsq09/chrX.vcf.gz >> imputed-rsq09/tmp-concat.txt
Concatenate VCF files:
bcftools concat -f imputed-rsq09/tmp-concat.txt --threads 16 -o imputed-rsq09/chrALL.vcf.gz -Oz imputed-rsq09/tmp-concat.txt
Execute Will Rayner’s post-imputation QC script. You can use the scripts you will find available in the repository, but it is preferable to have a look at the documentation at the previous link and install the latest version of the whole package to avoid errors:
mkdir qc-vcfparse
mkdir qc-ic
perl vcfparse.pl -d chr_imp -o qc-vcfparse
perl ic.pl -d qc-vcfparse -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -o qc-ic
Obtain the final list of samples of the genotype file:
bcftools query -l imputed-rsq09/chrALL.vcf.gz >> imputed-rsq09/sample_list_geno.txt
The first step to be done with the methylation data is the quality control and the preprocess of it. To do so, we will use the PACEanalysis R package from A.Binder that must be already installed. The functions that we will execute take a long time and the inputs and the outputs are heavy, therefore we strongly recommend you to consider these advices:
- Execute the functions in an R session from the command line, avoid using RStudio.
- Use a different R sessions per function to be executed, saving the R objects with
saveRDS
at the end of it, and loading them withreadRDS
in the next step. If not, your R session could be killed because of the object's size. - As much as possible, use
screen
method to execute the functions, in case of losing the connection with the server, the jobs will continue running.
For the output of these steps, we will create a new directory:
mkdir EPIC
This script contains the functions and the specific arguments that must be executed, but, as always, you will need to adapt them to your data.
PACEanalysis.R
In the next R script, you will find the commands used by our group to obtain the final BED file for our data into a .txt, some of them can be used directly, but others will need an adaptation to your data or won't be needed. The main steps are:
- Make sure the same samples are present in the methylation and genotype files.
- Make sure the same sample names are the same in methylation and genotype files. Important: in the genotype file, sample name is the IID, not FID_IID!
- Annotate the CpGs by chr, start and end using the Illumina's R package. Important, to define cis-window: end = start + 1
- Filter CpGs located on sexual chromosomes.
The output of this step should be a text file with the CpGs in the rows and the chr, start, end, CpG ID and beta values per sample in the columns. Here you have an example.
GSet_to_BED.R
Sort BED file:
(head -n1 EPIC/methylome_BED.txt && sort -k1,1V -k2,2n -k3,3n <(tail -n+2 EPIC/methylome_BED.txt)) > EPIC/methylome_sorted.bed
Zip BED file:
bgzip EPIC/methylome_sorted.bed
Create folder for the final version of the genotype:
mkdir whole_genome_definitive
Filter VCF by MAF > 5%, HWE p-value > 0.05 and by the final list of samples:
plink1.9 --vcf imputed-rsq09/chrALL.vcf.gz --maf 0.05 --hwe 0.05 --keep EPIC/final_list_samples.txt --keep-allele-order --make-bed --out whole_genome_definitive/whole_genome_maf05_filt_samples
Sometimes --keep doesn't work and keeps all the samples except the ones listed in final_list_samples.txt, if this is the case, change --keep by --remove:
plink1.9 --vcf imputed-rsq09/chrALL.vcf.gz --maf 0.05 --hwe 0.05 --remove EPIC/final_list_samples.txt --keep-allele-order --make-bed --out whole_genome_definitive/whole_genome_maf05_filt_samples
Be careful, when PLINK converts a VCF to a binary PLINK file set, it subsets the name of the samples from the VCF into FID and IID on the binary plink file (.bim, .fam, .bed) by searching for a separator which by default is __ . In case of troubles with this, we provide links to instruction on how to customize the way PLINK reads sample names or how to change the name of the samples once you already have the binary PLINK file set. In our case, we tend to use --const-fid flag, which allows you to set FID as 0 in all the samples and the IID as the whole sample name coming from the VCF, or --double-id, which causes both family and individual IDs to be set to the sample ID. Remember that to run TensorQTL, you should match the IID from PLINK with the sample name on the BED file from the methylome.
An extra step that we will perform at this point is to calculate the homozygous and heterozygous counts, and get the linkage disequilibrium information for each SNP:
plink1.9 --bfile whole_genome_definitive/whole_genome_maf05_filt_samples --freqx --keep-allele-order --out whole_genome_definitive/whole_genome_maf05_filt_samples
plink1.9 --bfile whole_genome_definitive/whole_genome_maf05_filt_samples --r2 --keep-allele-order --out whole_genome_definitive/whole_genome_maf05_filt_samples
In this analysis, the covariates that we are going to use are the sex of the individuals, the Planet values (cell type proportions), and the Principal Components of our genotype. The number of PCs is going to be defined by your data, but at least you should include the first five. In this protocol we have used five PCs as an example. Therefore, we had to perform a Principal Component Analysis (PCA) with PLINK considering the last version of the genotype:
mkdir covariates
plink1.9 --bfile whole_genome_definitive/whole_genome_maf05_filt_samples --indep-pairwise 50 5 0.2 --out covariates/whole_genome_maf05_filt_samples_prunned
plink1.9 --bfile whole_genome_definitive/whole_genome_maf05_filt_samples --extract covariates/whole_genome_maf05_filt_samples_prunned.prune.in --pca --out covariates/whole_genome_maf05_filt_samples_prunned.PCs
The format for the covariates should be a text file in which the first line corresponds to the IID of the sample, and the next rows the covariates as in this example. In the following script are the main commands used by our group to obtain the text file:
covariates.R
We will also run two additional models; these will use the rank-based inverse normal transformed beta values (RNT). As covariates, the first model will only consider the known confounders from the previous model (sex, genotype PCs and Planet cell type estimations), and the second model will additionally take in count methylation PCs (mPCs) calculated in the residualized methylation data, with the aim of correcting for extra technical variation. We have generated the code to perform these steps based on outputs that you have already obtained in the previous steps. We hope that this will ensure the simplicity of these extra analyses and their execution.
To perform these extra analyses, we will create a new folder:
mkdir rnt_model
To get the RNT values from the methylation beta values, we have prepared a new R script based on GSet_to_Bed.R from Step 2.2. The new R script allows you to transform the methylation values and to get the preliminary BED file with the CpGs in the rows and the chr, start, end, CpG ID and RNT values per sample in the columns.
GSet_to_Bed_RNT.R
Please notice that the transformation of the methylation values is in line 40 of the script, and the rest remains as it was. There is no need to calculate the variance of these values as we did with the betas.
Once you have obtained the BED file, you have to sort it and zip it as we did previously:
(head -n1 rnt_model/methylome_BED_RNT.txt && sort -k1,1V -k2,2n -k3,3n <(tail -n+2 rnt_model/methylome_BED_RNT.txt)) > rnt_model/methylome_sorted_RNT.bed
bgzip rnt_model/methylome_sorted_RNT.bed
To get the mPCs avoiding any correlation with the known confounders (sex, genotype PCs, and very especially, Planet cell type estimations) and therefore, multicollinearity, we will perform the PCA on the residuals from multiple linear regression. This regression will consider the transformed methylation values as the outcome, and the known confounders as predictors. Once we get the residuals, we will perform a PCA on them, and select all mPCs that cumulatively explain > 80% of the variance. ATTENTION: Once we have filtered the mPCs by the variance, we will keep a maximum number of 20 mPCs. All these steps are included in the following script, which should be easy to execute since all the inputs have been obtained in previous steps. ATTENTION: For the multiple linear model you should use as many genetic PCs as you have considered at the end of Step 4.
get_residualized_mPCs.R
Make sure that all the variables from the multiple regression are detected as numeric, this will allow you to get a matrix of residuals (Nº samples x Nº CpGs) with continuous values.
We also want to avoid potential correlations of the mPCs with the genotype, to be sure of this, we will perform a GWAS with our final genotype (Step 3) and each of the mPCs calculated in the previous step as phenotypes. We have prepared a script that will allow you to create a new PLINK fam file per mPC and perform the GWAS.
gwas_mPCs.R
In the case that any SNP is associated with a particular mPC (p-value < 1e-7), we will exclude that mPC.
Finally, we will include the selected mPCs in a new covariate file. With this aim, we have developed a simple script, where we add the mPCs to the previous covariates.txt file obtained in Step 4.
covariates_RNT.R
The output from this script should be the same table as covariates.txt, but containing the values from the selected mPCs.
Step 6. Mapping with TensorQTL
Create a folder for TensorQTL results:
mkdir tensorQTL
Open python3 module:
python3
Mapping cis-mQTLs using covariates with TensorQTL:
#Load packages
import pandas as pd
import torch
import tensorqtl
from tensorqtl import genotypeio, cis, trans
print('PyTorch {}'.format(torch.__version__))
print('Pandas {}'.format(pd.__version__))
#Define paths to data
plink_prefix_path = 'whole_genome_definitive/whole_genome_maf05_filt_samples'
expression_bed = 'EPIC/methylome_sorted.bed.gz'
covariates_file = 'covariates/covariates.txt'
prefix = 'tensorQTL/maf05_hwe05_nominal_INMA_18022021_A' #For this variable, read below this code block
The prefix
variable, should follow the pattern:
cis_tensorQTL_maf05_hwe05_NOMINAL_(cohort)_(ddmmyyyy)_(model)
For example, if model A of the analysis had been performed by the INMA cohort on 18/02/21, the prefix variable should contain:
cis_tensorQTL_maf05_hwe05_NOMINAL_INMA_18022021_A
#Load phenotypes and covariates
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
#PLINK reader for genotypes
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
#Sort phenotype sample names
phenotype_df = phenotype_df.reindex(sorted(phenotype_df.columns), axis=1)
genotype_df = genotype_df.reindex(sorted(genotype_df.columns), axis=1)
covariates_df = covariates_df.sort_index()
#Run TensorQTL
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
prefix, covariates_df=covariates_df, window=500000)
The results will be written in your working directory as parquet files (one per chromosome).
For the computation of the mQTLs for these additional models, we will also work on the root directory, and the prefix variable will specify "RNT_wo_mPCs" for the model adjusted by only the known confounders, and “RNT” for the model adjusted by both the known confounders and the residualized mPCs. For example, if the analyses had been performed in the INMA cohort on 18/02/21, the prefix variable should contain:
cis_tensorQTL_maf05_hwe05_NOMINAL_INMA_18022021_RNT_wo_mPCs
cis_tensorQTL_maf05_hwe05_NOMINAL_INMA_18022021_RNT
Here is the code with the input files and the command to run TensorQTL for the "RNT" model. Please note that now, the covariate file and the bed file are as follows: covariates_RNT.txt and methylome_sorted_RNT.bed.gz.
#Load packages
import pandas as pd
import torch
import tensorqtl
from tensorqtl import genotypeio, cis, trans
print('PyTorch {}'.format(torch.__version__))
print('Pandas {}'.format(pd.__version__))
#Define paths to data
plink_prefix_path = 'whole_genome_definitive/whole_genome_maf05_filt_samples'
expression_bed = 'rnt_model/methylome_sorted_RNT.bed.gz'
covariates_file = 'rnt_model/covariates_RNT.txt'
prefix = 'tensorQTL/maf05_hwe05_nominal_INMA_18022021_RNT'
#Load phenotypes and covariates
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
#PLINK reader for genotypes
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
#Sort phenotype sample names
phenotype_df = phenotype_df.reindex(sorted(phenotype_df.columns), axis=1)
genotype_df = genotype_df.reindex(sorted(genotype_df.columns), axis=1)
covariates_df = covariates_df.sort_index()
#Run TensorQTL
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
prefix, covariates_df=covariates_df, window=500000)
Finally, here is the code with the input files and the command to run TensorQTL for the "RNT_wo_mPCs" model. Please note that now, the covariate file and the bed file are as follows: covariates.txt and methylome_sorted_RNT.bed.gz.
#Load packages
import pandas as pd
import torch
import tensorqtl
from tensorqtl import genotypeio, cis, trans
print('PyTorch {}'.format(torch.__version__))
print('Pandas {}'.format(pd.__version__))
#Define paths to data
plink_prefix_path = 'whole_genome_definitive/whole_genome_maf05_filt_samples'
expression_bed = 'rnt_model/methylome_sorted_RNT.bed.gz'
covariates_file = 'covariates/covariates.txt'
prefix = 'tensorQTL/maf05_hwe05_nominal_INMA_18022021_RNT_wo_mPCs'
#Load phenotypes and covariates
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
#PLINK reader for genotypes
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
#Sort phenotype sample names
phenotype_df = phenotype_df.reindex(sorted(phenotype_df.columns), axis=1)
genotype_df = genotype_df.reindex(sorted(genotype_df.columns), axis=1)
covariates_df = covariates_df.sort_index()
#Run TensorQTL
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
prefix, covariates_df=covariates_df, window=500000)
Again, the results will be written in the tensorQTL folder as parquet files (one per chromosome).
Finally, you have to send us the following files:
- TensorQTL results (
cis_tensorQTL_maf05_hwe05_NOMINAL_(cohort)_(ddmmaaaa)_(model).cis_qtl_pairs.chr{1:22}.parquet
) - CpGs variability information (
all_cpg_variances.txt
) - SNPs MAF and counts information (
whole_genome_maf05_filt_samples.frqx
) - SNPs LD information (
whole_genome_maf05_filt_samples.ld
)
For the RNT analyses, you only have to send the TensorQTL results of this part of the README: cis_tensorQTL_maf05_hwe05_NOMINAL_(cohort)_(ddmmaaaa)_RNT.cis_qtl_pairs.chr{1:22}.parquet
cis_tensorQTL_maf05_hwe05_NOMINAL_(cohort)_(ddmmaaaa)_RNT_wo_mPCs.cis_qtl_pairs.chr{1:22}.parquet