-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsubsetting_VCF_files.dat
45 lines (29 loc) · 1.88 KB
/
subsetting_VCF_files.dat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Depending on your analysis, you might need to subset VCF files.
Some dDocent filters apply thresholds to all files together (Albatross and Contemporary). Since Contemporary files had substantial more data, much of the Albatross contigs were been excluded.
Thus, we created separate Albatross and Contemporary subsets from the same VCF to analyzed on these separately.
VCF files begin with a header and contain site info in rows and individual as columns.
Subsetting TotalRawSNPs.5.5.vcf into Albatross-only and Contemporary-only:
1. Make a file with the header
grep '^##' TotalRawSNPs.5.5.vcf > header.vcf
head -n 1 body.vcf > headers.vcf
2. Make a file with the body (including #CHRM). We'll do this by reversing the search with -v
grep -v '^##' TotalRawSNPs.5.5.vcf > body.vcf
3. Make a file with only the genotype info (from column 10 to the end)
cut -f10- body.vcf > genotypes.vcf
4. Subset genotypes.vcf to include only Albatross genotypes (in this example, Albatross samples were in the first 16 columns)
cut -f1-16 genotypes.vcf > albatrossgenotypes.vcf
5. Subset genotypes.vcf to include only contemporary genotypes (from column 17 to the end)
cut -f17- genotypes.vcf > contemporarygenotypes.vcf
6. Get SNPs stats into a file
cut -f1-9 body.vcf > snps-stats.vcf
7. Paste the SNPs-stats with albatrossgenotypes (horizontally) to form albatrossbody.vcf
paste snps-stats.vcf albatrossgenotypes.vcf > albatrossbody.vcf
8. Concatenate (vertically) the header to the albatrossbody to make final albatross.vcf
cat header.vcf albatrossbody.vcf > albatross.vcf
9. Repeat 7-8 to make final contemporary.vcf
paste snps-stats.vcf contemporarygenotypes.vcf > contemporarybody.vcf
10.
cat header.vcf contemporarybody.vcf > contemporary.vcf
# Alternatively, you can make a vcf selecting only the best individuals.
cut -f1-10,13,15,16,17,18,22,23,24,25 body.vcf > alb10_body.vcf
cat header.vcf alb10_body.vcf > alb10.vcf