forked from Biocodings/MulEA
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
724 lines (531 loc) · 33.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
---
output:
md_document:
variant: gfm
toc: true
toc_depth: 2
bibliography: references.bib
---
```{r 'kintr', include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# ![](man/figures/MulEA_logo.png){width="59"} `mulea` - an R Package for Enrichment Analysis Using Multiple Ontologies and Empirical False Discovery Rate
<!-- badges: start -->
[![GitHub issues](https://img.shields.io/github/issues/ELTEbioinformatics/mulea)](https://github.com/ELTEbioinformatics/mulea/issues) [![GitHub pulls](https://img.shields.io/github/issues-pr/ELTEbioinformatics/mulea)](https://github.com/ELTEbioinformatics/mulea/pulls)
<!-- badges: end -->
# Introduction
The `mulea` R package [@turek] is a comprehensive tool for functional enrichment analysis. It provides two different approaches:
1. For unranked sets of elements, such as significantly up- or down-regulated genes, `mulea` employs the set-based **overrepresentation analysis (ORA)**.
2. Alternatively, if the data consists of ranked elements, for instance, genes ordered by *p*-value or log fold-change calculated by the differential expression analysis, `mulea` offers the **gene set enrichment (GSEA)** approach.
For the overrepresentation analysis, `mulea` employs a progressive **empirical false discovery rate (eFDR)** method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies.
`mulea` expands beyond traditional tools by incorporating a **wide range of ontologies**, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in 879 files available at the [ELTEbioinformatics/GMT_files_for_mulea](https://github.com/ELTEbioinformatics/GMT_files_for_mulea) GitHub repository and through the [`muleaData`](https://bioconductor.org/packages/release/data/experiment/html/muleaData.html) ExperimentData Bioconductor package.
# Installation
Install the dependency `fgsea` BioConductor package:
```{r 'install1', eval=FALSE, message=FALSE, warning=FALSE}
# Installing the BiocManager package if needed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Installing the fgsea package with the BiocManager
BiocManager::install("fgsea")
```
To install `mulea` from [CRAN](https://cran.r-project.org/package=mulea):
```{r 'install2', eval=FALSE, message=FALSE, warning=FALSE}
install.packages("mulea")
```
To install the development version of `mulea` from GitHub:
```{r 'install3', eval=FALSE, message=FALSE, warning=FALSE}
# Installing the devtools package if needed
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
# Installing the mulea package from GitHub
devtools::install_github("https://github.com/ELTEbioinformatics/mulea")
```
# Usage
First, load the **`mulea`** and **`dplyr`** libraries. The **`dplyr`** library is not essential but is used here to facilitate data manipulation and inspection.
```{r 'calling1', eval=FALSE}
library(mulea)
library(tidyverse)
```
```{r 'calling2', echo=FALSE}
suppressMessages(library(mulea))
suppressMessages(library(tidyverse))
```
## Importing the Ontology
This section demonstrates how to import the desired ontology, such as transcription factors and their target genes downloaded from the ![Regulon](vignettes/Regulon.png){alt="Regulon" width="114" height="25"} [database](https://regulondb.ccg.unam.mx/), into a data frame suitable for enrichment analysis. We present multiple methods for importing the ontology. Ensure that the identifier type (*e.g.*, *UniProt* protein ID, *Entrez* ID, Gene Symbol, *Ensembl* gene ID) matches between the ontology and the elements you wish to investigate.
### Alternative 1: Importing the Ontology from a GMT File
The [GMT (Gene Matrix Transposed)](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) format contains collections of genes or proteins associated with specific ontology terms in a tab-delimited text file. The GMT file can be read into R as a data frame using the `read_gmt` function from the `mulea` package. Each term is represented by a single row in both the GMT file and the data frame. Each row includes three types of elements:
1. **Ontology identifier** (“*ontology_id*”): This uniquely identifies each term within the file or data frame.
2. **Ontology name or description** (*“ontology_name”*): This provides a user-friendly label or textual description for each term.
3. **Associated gene or protein identifiers**: These are listed in the *"list_of_values"* column, with identifiers separated by spaces, and belong to each term.[^1]
[^1]: The format of the actually used ontology slightly deviates from standard GMT files. In `tf_ontology`, both the `ontology_id` and `ontology_name` columns contain *gene symbols* of the transcription factors, unlike other ontologies such as GO, where these columns hold specific identifiers and corresponding names.
#### A) `mulea` GMT File
Alongside with the `mulea` package we provide ontologies collected from 16 publicly available databases, in a standardised GMT format for 27 model organisms, from Bacteria to human. These files are available at the [ELTEbioinformatics/GMT_files_for_mulea](https://github.com/ELTEbioinformatics/GMT_files_for_mulea) GitHub repository.
To read a downloaded GMT file locally:
```{r 'read_GMT1', eval=FALSE}
# Reading the mulea GMT file locally
tf_ontology <- read_gmt("Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")
```
Alternatively, one can read it directly from the GitHub repository:
```{r 'read_GMT2'}
# Reading the GMT file from the GitHub repository
tf_ontology <- read_gmt("https://raw.githubusercontent.com/ELTEbioinformatics/GMT_files_for_mulea/main/GMT_files/Escherichia_coli_83333/Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")
```
#### B) Enrichr GMT File
`mulea` is compatible with [GMT files](https://maayanlab.cloud/Enrichr/#libraries) provided with the `Enricher` R package [@kuleshov2016]. Download and read such a GMT file (*e. g.* [TRRUST_Transcription_Factors_2019.txt]{.underline}) locally. *Note that this ontology is not suitable for analyzing the Escherichia coli differential expression data described in the section [The Differential Expression Dataset to Analyse].*
```{r 'read_GMT3', eval=FALSE}
# Reading the Enrichr GMT file locally
tf_enrichr_ontology <- read_gmt("TRRUST_Transcription_Factors_2019.txt")
# The ontology_name is empty, therefore we need to fill it with the ontology_id
tf_enrichr_ontology$ontology_name <- tf_enrichr_ontology$ontology_id
```
#### C) MsigDB GMT File
`mulea` is compatible with the MsigDB [@subramanian2005] [GMT files](https://www.gsea-msigdb.org/gsea/msigdb/). Download and read such a GMT file (*e. g.* [c3.tft.v2023.2.Hs.symbols.gmt]{.underline}) locally. *Note that this ontology is not suitable for analyzing the Escherichia coli differential expression data described in the section [The Differential Expression Dataset to Analyse].*
```{r 'read_GMT4', eval=FALSE}
# Reading the MsigDB GMT file locally
tf_msigdb_ontology <- read_gmt("c3.tft.v2023.2.Hs.symbols.gmt")
```
### Alternative 2: Importing the Ontology with the `muleaData` Package
Alternatively, you can retrieve the ontology using the [`muleaData`](https://github.com/ELTEbioinformatics/muleaData) ExperimentData Bioconductor package:
```{r 'read_muleaData', eval=FALSE}
# Installing the ExperimentHub package from Bioconductor
BiocManager::install("ExperimentHub")
# Calling the ExperimentHub library.
library(ExperimentHub)
# Downloading the metadata from ExperimentHub.
eh <- ExperimentHub()
# Creating the muleaData variable.
muleaData <- query(eh, "muleaData")
# Looking for the ExperimentalHub ID of the ontology.
EHID <- mcols(muleaData) %>%
as.data.frame() %>%
dplyr::filter(title == "Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.rds") %>%
rownames()
# Reading the ontology from the muleaData package.
tf_ontology <- muleaData[[EHID]]
# Change the header
tf_ontology <- tf_ontology %>%
rename(ontology_id = "ontologyId",
ontology_name = "ontologyName",
list_of_values = "listOfValues")
```
### Filtering the Ontology
Enrichment analysis results can sometimes be skewed by overly specific or broad entries. `mulea` allows you to customise the size of ontology entries – the number of genes or proteins belonging to a term – ensuring your analysis aligns with your desired scope.
Let's exclude ontology entries with less than 3 or more than 400 gene symbols.
```{r 'exclude_ontology'}
# Filtering the ontology
tf_ontology_filtered <- filter_ontology(gmt = tf_ontology,
min_nr_of_elements = 3,
max_nr_of_elements = 400)
```
### Saving the Ontology as a GMT file
You can save the ontology as a GMT file using the **`write_gmt`** function.
```{r 'save_gmt', eval=FALSE}
# Saving the ontology to GMT file
write_gmt(gmt = tf_ontology_filtered,
file = "Filtered.gmt")
```
### Converting a List to an Ontology Object
The `mulea` package provides the `list_to_gmt` function to convert a list of gene sets into an ontology data frame. The following example demonstrates how to use this function:
```{r 'list_to_gmt_example', eval=FALSE}
# Creating a list of gene sets
ontology_list <- list(gene_set1 = c("gene1", "gene2", "gene3"),
gene_set2 = c("gene4", "gene5", "gene6"))
# Converting the list to a ontology (GMT) object
new_ontology_df <- list_to_gmt(ontology_list)
```
## The Differential Expression Dataset to Analyse
For further steps we will analyse a dataset from a microarray experiment ([GSE55662](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55662)) in the NCBI Gene Expression Omnibus ![GEO](vignettes/geo_main.gif){alt="GEO" width="87"}. The study by @méhi2014 investigated antibiotic resistance evolution in *Escherichia coli*. Gene expression changes were compared between *ciprofloxacin* antibiotic-treated *Escherichia coli* bacteria and non-treated controls.
The expression levels of these groups were compared with the [GEO2R](https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE55662) tool:
- Non-treated control samples (2 replicates): [WT_noCPR_1](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1341344), [WT_noCPR_2](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1341345)
- *Ciprofloxacin*-treated samples (2 replicates): [WT_CPR_1](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1341346), [WT_CPR_2](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1341347)
To see how the dataset were prepared go to the [Formatting the Results of a Differential Expression Analysis] section.
## The Unordered Set-Based Overrepresentation Analysis (ORA)
The `mulea` package implements a set-based enrichment analysis approach using the *hypergeometric test*, which is analogous to the one-tailed Fisher's exact test. This method identifies statistically significant overrepresentation of elements from a target set (*e.g.*, significantly up- or downregulated genes) within a background set (*e.g.*, all genes that were investigated in the experiment). Therefore, a predefined threshold value, such as 0.05 for the corrected *p*-value or 2-fold change, should be used in the preceding analysis.
The overrepresentation analysis is implemented in the `ora` function which requires three inputs:
1. **Ontology** **data frame:** Fits the investigated taxa and the applied gene or protein identifier type, such as GO, pathway, transcription factor regulation, microRNA regulation, gene expression data, genomic location data, or protein domain content.
2. **Target set:** A vector of elements to investigate, containing genes or proteins of interest, such as significantly overexpressed genes in the experiment.
3. **Background set:** A vector of background elements representing the broader context, often including all genes investigated in the study.
### Reading the Target and the Background Sets from Text Files
Let's read the text files containing the identifiers (gene symbols) of the target and the background gene set directly from the GitHub website. To see how these files were prepared, refer to the section on [Formatting the Results of a Differential Expression Analysis].
```{r 'reading_target_bg'}
# Taget set
target_set <- readLines("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/target_set.txt")
# Background set
background_set <- readLines("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/background_set.txt")
```
### Performing the OverRepresentation Analysis
To perform the analysis, we will first establish a model using the `ora` function. This model defines the parameters for the enrichment analysis. We then execute the test itself using the **`run_test`** function. It is important to note that for this example, we will employ 10,000 permutations for the *empirical false discovery rate* (*eFDR*), which is the recommended minimum, to ensure robust correction for multiple testing.
```{r 'ora'}
# Creating the ORA model using the GMT variable
ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "eFDR",
# Number of permutations
number_of_permutations = 10000,
# Number of processor threads to use
nthreads = 2,
# Setting a random seed for reproducibility
random_seed = 1)
# Running the ORA
ora_results <- run_test(ora_model)
```
### Examining the ORA Result
The `ora_results` data frame summarises the enrichment analysis, listing enriched ontology entries – in our case transcription factors – alongside their associated *p*-values and *eFDR* values.
We can now determine the number of transcription factors classified as "enriched" based on these statistical measures (*eFDR* \< 0.05).
```{r 'ora_size'}
ora_results %>%
# Rows where the eFDR < 0.05
filter(eFDR < 0.05) %>%
# Number of such rows
nrow()
```
Inspect the significant results:
```{r 'print_ora', eval=FALSE}
ora_results %>%
# Arrange the rows by the eFDR values
arrange(eFDR) %>%
# Rows where the eFDR < 0.05
filter(eFDR < 0.05)
```
```{r 'print_ora2', echo=FALSE}
ora_results %>%
# Arrange the rows by the eFDR values
arrange(eFDR) %>%
# Rows where the eFDR < 0.05
filter(eFDR < 0.05) %>%
knitr::kable()
```
### Visualising the ORA Result
To gain a comprehensive understanding of the enriched transcription factors, **`mulea`** offers diverse visualisation tools, including lollipop charts, bar plots, networks, and heatmaps. These visualisations effectively reveal patterns and relationships among the enriched factors.
Initialising the visualisation with the `reshape_results` function:
```{r 'init_plot_ora'}
# Reshapeing the ORA results for visualisation
ora_reshaped_results <- reshape_results(model = ora_model,
model_results = ora_results,
# Choosing which column to use for the
# indication of significance
p_value_type_colname = "eFDR")
```
**Visualising the Spread of *eFDR* Values: Lollipop Plot**
Lollipop charts provide a graphical representation of the distribution of enriched transcription factors. The *y*-axis displays the transcription factors, while the *x*-axis represents their corresponding *eFDR* values. The dots are coloured based on their *eFDR* values. This visualisation helps us examine the spread of *eFDRs* and identify factors exceeding the commonly used significance threshold of 0.05.
```{r 'lollipop_plot_ora', out.width = '60%'}
plot_lollipop(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
```
**Visualising the Spread of *eFDR* Values: Bar Plot**
Bar charts offer a graphical representation similar to lollipop plots. The *y*-axis displays the enriched ontology categories (*e.g.*, transcription factors), while the *x*-axis represents their corresponding *eFDR* values. The bars are coloured based on their *eFDR* values, aiding in examining the spread of *eFDRs* and identifying factors exceeding the significance threshold of 0.05.
```{r 'bar_plot_ora', out.width = '60%'}
plot_barplot(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
```
**Visualising the Associations: Graph Plot**
This function generates a network visualisation of the enriched ontology categories (*e.g.*, transcription factors). Each node represents an eriched ontology category, coloured based on its *eFDR* value. An edge is drawn between two nodes if they share at least one common gene belonging to the target set, indicating co-regulation. The thickness of the edge reflects the number of shared genes.
```{r 'network_plot_ora', out.width = '60%', warning = FALSE, message = FALSE}
plot_graph(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
```
**Visualising the Associations: Heatmap**
The heatmap displays the genes associated with the enriched ontology categories (*e.g.*, transcription factors). Each row represents a category, coloured based on its *eFDR* value. Each column represents a gene from the target set belonging to the enriched ontology category, indicating potential regulation by one or more enriched transcription factors.
```{r 'heatmap_ora'}
plot_heatmap(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Column that indicates the significance values
p_value_type_colname = "eFDR")
```
### Comparing the significant results when applying the eFDR to the Benjamini-Hochberg and the Bonferroni corrections
The `ora` function allows you to choose between different methods for calculating the *FDR* and adjusting the *p*-values: *eFDR*, and all `method` options from the `stats::p.adjust` documentation (holm, hochberg, hommel, bonferroni, BH, BY, and fdr). The following code snippet demonstrates how to perform the analysis using the *Benjamini-Hochberg* and *Bonferroni* corrections:
```{r 'ora_bh_bonferroni'}
# Creating the ORA model using the Benjamini-Hochberg p-value correction method
BH_ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "BH")
# Running the ORA
BH_results <- run_test(BH_ora_model)
# Creating the ORA model using the Bonferroni p-value correction method
Bonferroni_ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "bonferroni")
# Running the ORA
Bonferroni_results <- run_test(Bonferroni_ora_model)
```
To compare the significant results (using the conventional \< 0.05 threshold) of the *eFDR*, *Benjamini-Hochberg*, and *Bonferroni* corrections, we can merge and filter the result tables:
```{r 'compare_p.adj'}
# Merging the Benjamini-Hochberg and eFDR results
merged_results <- BH_results %>%
# Renaming the column
rename(BH_adjusted_p_value = adjusted_p_value) %>%
# Selecting the necessary columns
select(ontology_id, BH_adjusted_p_value) %>%
# Joining with the eFDR results
left_join(ora_results, ., by = "ontology_id") %>%
# Converting the data.frame to a tibble
tibble()
# Merging the Bonferroni results with the merged results
merged_results <- Bonferroni_results %>%
# Renaming the column
rename(Bonferroni_adjusted_p_value = adjusted_p_value) %>%
# Selecting the necessary columns
select(ontology_id, Bonferroni_adjusted_p_value) %>%
# Joining with the eFDR results
left_join(merged_results, ., by = "ontology_id") %>%
# Arranging by the p-value
arrange(p_value)
# filter the p-value < 0.05 results
merged_results_filtered <- merged_results %>%
filter(p_value < 0.05) %>%
# remove the unnecessary columns
select(-ontology_id, -nr_common_with_tested_elements,
-nr_common_with_background_elements)
```
```{r 'print_compare_p.adj', echo=FALSE}
merged_results_filtered %>%
knitr::kable()
```
A comparison of the significant results revealed that conventional *p*-value corrections (Benjamini-Hochberg and Bonferroni) tend to be overly conservative, leading to a reduction in the number of significant transcription factors compared to the *eFDR*. As illustrated in the below figure, by applying the *eFDR* we were able to identify 10 significant transcription factors, while with the Benjamini-Hochberg and Bonferroni corrections only 7 and 3, respectively. This suggests that the *eFDR* may be a more suitable approach for controlling false positives in this context.
![](vignettes/Venn.png){alt="Venn" width="300"}
## Gene Set Enrichment Analysis (GSEA) {#gene-set-enrichment-analysis-gsea}
To perform enrichment analysis using ranked lists, you need to provide an ordered list of elements, such as genes or proteins. This ranking is typically based on the results of your prior analysis, using metrics like *p*-values, *z*-scores, fold-changes, or others. Crucially, the ranked list should include all elements involved in your analysis. For example, in a differential expression study, it should encompass all genes that were measured.
`mulea` utilises the Kolmogorov-Smirnov approach with a permutation test (developed by @subramanian2005) to calculate gene set enrichment analyses. This functionality is implemented through the integration of the [`fgsea`](https://bioconductor.org/packages/release/bioc/html/fgsea.html) Bioconductor package (created by @korotkevich).
GSEA requires input data about the genes analysed in our experiment. This data can be formatted in two ways:
1. **Data frame:** This format should include all genes investigated and their respective log fold change values (or other values for ordering the genes) obtained from the differential expression analysis.
2. **Two vectors:** Alternatively, you can provide two separate vectors. One vector should contain the gene symbols (or IDs), and the other should hold the corresponding log fold change values (or other values for ordering the genes) for each gene.
### Reading the Tab Delimited File Containing the Ordered Set
Let's read the TSV file containing the identifiers (gene symbols) and the log fold change values of the investigated set directly from the GitHub website. For details on how this file was prepared, please refer to the [Formatting the Results of a Differential Expression Analysis] section.
```{r 'reading_ordered1', eval=FALSE}
# Reading the tsv containing the ordered set
ordered_set <- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/ordered_set.tsv")
```
```{r 'reading_ordered2', echo=FALSE}
# Reading the tsv containing the ordered set
ordered_set <- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/ordered_set.tsv",
show_col_types = FALSE)
```
### Performing the Gene Set Enrichment Analysis
To perform the analysis, we will first establish a model using the `gsea` function. This model defines the parameters for the enrichment analysis. Subsequently, we will execute the test itself using the `run_test` function. We will employ 10,000 permutations for the false discovery rate, to ensure robust correction for multiple testing.
```{r 'gsea', warning=FALSE, message=FALSE}
# Creating the GSEA model using the GMT variable
gsea_model <- gsea(gmt = tf_ontology_filtered,
# Names of elements to test
element_names = ordered_set$Gene.symbol,
# LogFC-s of elements to test
element_scores = ordered_set$logFC,
# Consider elements having positive logFC values only
element_score_type = "pos",
# Number of permutations
number_of_permutations = 10000)
# Running the GSEA
gsea_results <- run_test(gsea_model)
```
### Examining the GSEA Results
The `gsea_results` data frame summarises the enrichment analysis, listing enriched ontology entries – in our case transcription factors – alongside their associated *p*-values and adjusted *p*-value values.
We can now determine the number of transcription factors classified as "enriched" based on these statistical measures (adjusted *p*-value \< 0.05).
```{r 'gsea_size'}
gsea_results %>%
# rows where the adjusted_p_value < 0.05
filter(adjusted_p_value < 0.05) %>%
# the number of such rows
nrow()
```
Inspect the significant results:
```{r 'print_gsea', eval=FALSE}
gsea_results %>%
# arrange the rows by the adjusted_p_value values
arrange(adjusted_p_value) %>%
# rows where the adjusted_p_value < 0.05
filter(adjusted_p_value < 0.05)
```
```{r 'print_gsea2', echo=FALSE}
gsea_results %>%
# arrange the rows by the adjusted_p_value values
arrange(adjusted_p_value) %>%
# rows where the adjusted_p_value < 0.05
filter(adjusted_p_value < 0.05) %>%
knitr::kable()
```
### Visualising the GSEA Results
Initializing the visualisation with the `reshape_results` function:
```{r 'init_plot_gsea'}
# Reshaping the GSEA results for visualisation
gsea_reshaped_results <- reshape_results(model = gsea_model,
model_results = gsea_results,
# choosing which column to use for the
# indication of significance
p_value_type_colname = "adjusted_p_value")
```
**Visualising Relationships: Graph Plot**
This function generates a network visualisation of the enriched ontology categories (*e.g.*, transcription factors). Each node represents a category and is coloured based on its significance level. A connection (edge) is drawn between two nodes if they share at least one common gene belonging to the **ranked list**, meaning that both transcription factors regulate the expression of the same target gene. The thickness of the edge reflects the number of shared genes.
```{r 'network_plot_gsea', out.width = '60%', warning = FALSE, message = FALSE}
plot_graph(reshaped_results = gsea_reshaped_results,
# the column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# column that indicates the significance values
p_value_type_colname = "adjusted_p_value")
```
Other plot types such as lollipop plots, bar plots, and heatmaps can also be used to investigate the GSEA results.
## Formatting the Results of a Differential Expression Analysis
### **Understanding the Differential Expression Results Table**
This section aims to elucidate the structure and essential components of the provided DE results table. It offers guidance to users on interpreting the data effectively for subsequent analysis with `mulea`.
Let's read the differential expression result file named [GSE55662.table_wt_non_vs_cipro.tsv](https://github.com/ELTEbioinformatics/mulea/blob/master/inst/extdata/GSE55662.table_wt_non_vs_cipro.tsv) located in the [inst/extdata/](https://github.com/ELTEbioinformatics/mulea/tree/master/inst/extdata) folder directly from the GitHub website.
```{r 'DE1', eval=TRUE, message=FALSE, warning=FALSE}
# Importing necessary libraries and reading the DE results table
geo2r_result_tab <- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/GSE55662.table_wt_non_vs_cipro.tsv")
```
Let's delve into the `geo2r_result_tab` data frame by examining its initial rows:
```{r 'print_geo1', eval=FALSE}
# Printing the first few rows of the data frame
geo2r_result_tab %>%
head(3)
```
```{r 'print_geo2', echo=FALSE}
# Printing the first few rows of the data frame
geo2r_result_tab %>%
head(3) %>%
knitr::kable()
```
### **Data Preparation:**
Preparing the data frame appropriately for enrichment analysis is crucial. This involves specific steps tailored to the microarray experiment type. Here, we undertake the following transformations:
- **Gene Symbol Extraction**: We isolate the primary gene symbol from the `Gene.symbol` column, eliminating any extraneous information.
- **Handling Missing Values**: Rows with missing gene symbols (`NA`) are excluded.
- **Sorting by Fold Change**: The data frame is sorted by log-fold change (`logFC`) in descending order, prioritizing genes with the most significant expression alterations.
```{r 'format_geo'}
# Formatting the data frame
geo2r_result_tab <- geo2r_result_tab %>%
# Extracting the primary gene symbol and removing extraneous information
mutate(Gene.symbol = str_remove(string = Gene.symbol,
pattern = "\\/.*")) %>%
# Filtering out rows with NA gene symbols
filter(!is.na(Gene.symbol)) %>%
# Sorting by logFC
arrange(desc(logFC))
```
Before proceeding with enrichment analysis, let's examine the initial rows of the formatted `geo2r_result_tab` data frame:
```{r 'print_geo_formatted1', eval=FALSE}
# Printing the first few rows of the formatted data frame
geo2r_result_tab %>%
head(3)
```
```{r 'print_geo_formatted2', echo=FALSE}
# Printing the first few rows of the formatted data frame
geo2r_result_tab %>%
head(3) %>%
knitr::kable()
```
Following these formatting steps, the data frame is primed for further analysis.
### Preparing Input Data for the ORA
#### Target Set
A vector containing the gene symbols of significantly overexpressed (adjusted *p*-value \< 0.05) genes with greater than 2 fold-change (logFC \> 1).
```{r 'target_set'}
target_set <- geo2r_result_tab %>%
# Filtering for adjusted p-value < 0.05 and logFC > 1
filter(adj.P.Val < 0.05 & logFC > 1) %>%
# Selecting the Gene.symbol column
select(Gene.symbol) %>%
# Converting the tibble to a vector
pull() %>%
# Removing duplicates
unique()
```
The first 10 elements of the target set:
```{r 'target_head'}
target_set %>%
head(10)
```
The number of genes in the target set:
```{r 'target_gene_nr'}
target_set %>%
length()
```
#### Background Set
A vector containing the gene symbols of all genes were included in the differential expression analysis.
```{r 'background_set'}
background_set <- geo2r_result_tab %>%
# Selecting the Gene.symbol column
select(Gene.symbol) %>%
# Converting the tibble to a vector
pull() %>%
# Removing duplicates
unique()
```
The number of genes in the background set:
```{r 'background_gene_nr'}
background_set %>%
length()
```
Save the target and the background set vectors to text file:
```{r 'save_target_bg', eval=FALSE}
# Save taget set to text file
target_set %>%
writeLines("target_set.txt")
# Save background set to text file
background_set %>%
writeLines("inst/extdata/background_set.txt")
```
### Preparing Input Data for the GSEA
```{r 'gsea_input'}
# If there are duplicated Gene.symbols keep the first one only
ordered_set <- geo2r_result_tab %>%
# Grouping by Gene.symbol to be able to filter
group_by(Gene.symbol) %>%
# Keeping the first row for each Gene.symbol from rows with the same
# Gene.symbol
filter(row_number()==1) %>%
# Ungrouping
ungroup() %>%
# Arranging by logFC in descending order
arrange(desc(logFC)) %>%
select(Gene.symbol, logFC)
```
The number of gene symbols in the `ordered_set` vector:
```{r 'ordered_genes_length'}
ordered_set %>%
nrow()
```
Save the ordered set data frame to tab delimited file:
```{r 'save_ordered', eval=FALSE}
# Save ordered set to text file
ordered_set %>%
write_tsv("ordered_set.tsv")
```
# Session Info
```{r 'session_info'}
sessionInfo()
```
# How to Cite the `mulea` Package?
To cite package `mulea` in publications use:
Turek, Cezary, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, and Eszter Ari. 2024. “mulea: An R Package for Enrichment Analysis Using Multiple Ontologies and Empirical False Discovery Rate.” *BMC Bioinformatics* 25 (1): 334. <https://doi.org/10.1186/s12859-024-05948-7>.
# References