-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathset_overlap.Rmd
91 lines (68 loc) · 1.94 KB
/
set_overlap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
title: "Overlap analysis for newly made gene sets"
author: "The DEE2 Gene Signatures Group"
date: "`r Sys.Date()`"
output:
html_document
---
Source: https://github.com/markziemann/dee2_gene_signatures
## Background
The purpose of this analysis is to perform Jaccard analysis with the new gene sets to ensure they are not redundant
with those already in MSigDB.
## Libraries
```{r, libs}
suppressPackageStartupMessages({
library("getDEE2")
library("mitch")
library("triwise")
library("dplyr")
library("gplots")
library("reshape2")
library("network")
})
```
## Import gene sets
Now for MSigDB version 7.2 accessed 27/Sep/2020.
There are 31120 sets and 40044 genes.
```{r,import}
epi <- gmt_import("epilepsy_genesymbols.gmt")
diab <- gmt_import("diabetes_genesymbols.gmt")
hd <- gmt_import("heartdisease_genesymbols.gmt")
sars <- gmt_import("sarsmers_genesymbols.gmt")
msigdb <- gmt_import("msigdb.v7.2.symbols.gmt")
length(msigdb)
```
## Jaccard overlap
Here is a function that converts list of vectors to a network diagram.
There are 2x edges than nodes.
Only the edges with highest similarity are retained, as per jaccard.
The size of the gene set is proportional to the node size (sqrt).
```{r,gsnet}
gsjac <- function(gs1,gs2){
mclapply(gs1, function(y) {
l <- lapply(gs2,function(x) { length(intersect(x,y )) / length(union(x,y )) } )
l <- unlist(l)
lmax <- tail(l[order(l)],1)
lmax
},mc.cores=8)
}
# calculate jaccard with sets already in MSigDB
epi_jac <- gsjac(epi,msigdb)
# any with jaccard above 80% ?
which(epi_jac >0.8)
#max
epi_jac[tail(order(unlist(epi_jac)),1)]
diab_jac <- gsjac(diab,msigdb)
which(diab_jac >0.8)
diab_jac[tail(order(unlist(diab_jac)),1)]
hd_jac <- gsjac(hd,msigdb)
which(hd_jac >0.8)
hd_jac[tail(order(unlist(hd_jac)),1)]
sars_jac <- gsjac(sars,msigdb)
which(sars_jac >0.8)
sars_jac[tail(order(unlist(sars_jac)),1)]
```
## Session information
```{r sessioninfo}
sessionInfo()
```