Simplify Functional Enrichment Results

Zuguang Gu ([email protected])

2021-10-26

The simplifyEnrichment package clusters functional terms into groups by clustering the similarity matrix of the terms with a new proposed method “binary cut” which recursively applies partition around medoids (PAM) with two groups on the similarity matrix and in each iteration step, a score is assigned to decide whether the group of gene sets that corresponds to the current sub-matrix should be split or not. For more details of the method, please refer to the simplifyEnrichment paper.

Simplify GO enrichment results

The major use case for simplifyEnrichment is for simplying the GO enrichment results by clustering the corresponding semantic similarity matrix of the significant GO terms. To demonstrate the usage, we first generate a list of random GO IDs from the Biological Process (BP) ontology category:

library(simplifyEnrichment)
set.seed(888)
go_id = random_GO(500)

simplifyEnrichment starts with the GO similarity matrix. Users can use their own similarity matrices or use the GO_similarity() function to calculate the semantic similarity matrix. The GO_similarity() function is simply a wrapper on GOSemSim::termSim(). The function accepts a vector of GO IDs. Note the GO terms should only belong to one same ontology (i.e., BP, CC or MF).

mat = GO_similarity(go_id)

By default, GO_similarity() uses Rel method in GOSemSim::termSim(). Other methods to calculate GO similarities can be set by measure argument, e.g.:

GO_similarity(go_id, measure = "Wang")

With the similarity matrix mat, users can directly apply simplifyGO() function to perform the clustering as well as visualizing the results.

df = simplifyGO(mat)
## Cluster 500 terms by 'binary_cut'... 43 clusters, used 1.781711 secs.

On the right side of the heatmap there are the word cloud annotations which summarize the functions with keywords in every GO cluster. Note there is no word cloud for the cluster that is merged from small clusters (size < 5).

The returned variable df is a data frame with GO IDs, GO terms and the cluster labels:

head(df)
##           id                                           term cluster
## 1 GO:0003283                      atrial septum development       1
## 2 GO:0022018 lateral ganglionic eminence cell proliferation       1
## 3 GO:0030032                         lamellipodium assembly       2
## 4 GO:0061508                            CDP phosphorylation       3
## 5 GO:1901222          regulation of NIK/NF-kappaB signaling       4
## 6 GO:0060164 regulation of timing of neuron differentiation       1

The size of GO clusters can be retrieved by:

sort(table(df$cluster))
## 
##   5   7   8  10  12  13  17  18  21  22  24  25  26  27  28  29  30  31  32  33 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  34  35  36  37  38  39  40  41  42  43  20  23   9  16  19  15  14  11   2   6 
##   1   1   1   1   1   1   1   1   1   1   2   2   3   5   5   6  10  12  37  45 
##   4   1   3 
##  97 114 132

Or split the data frame by the cluster labels:

split(df, df$cluster)

plot argument can be set to FALSE in simplifyGO(), so that no plot is generated and only the data frame is returned.

If the aim is only to cluster GO terms, binary_cut() or cluster_terms() functions can be directly applied:

binary_cut(mat)
##   [1]  1  1  2  3  4  1  2  3  1  3  5  6  1  3  1  3  6  7  3  3  4  8  4  4  1
##  [26]  1  3  6  3  1  4  3  3  2  3  4  3  4  4  2  2  4  6  6  9  2  6  2  2  3
##  [51]  3  4  2  2  4  6  3  3  4  3  4  3  3  3 10  3  1 11  3  1  6  4  6  3 12
##  [76]  1 13  4 14  2  4 11  6  1  3  4  1  4  4  4 15  3  6  3  3  3  4  3 14  6
## [101]  3  4 16  6  1  2  2  2 11  4  3  3  3 17  1  4  1  3  6  3  1  3  1  3  4
## [126]  3 16  4  6  4  3  9  3  3  3  3  1  2  3  4  3  1  3  3  3 18  1  2  3  1
## [151]  3 19  1  3  4  1  1  1  1  4  3 20 15  2  3  1  1  1  1  1 21  4  1  4  6
## [176]  4  4  3  1  4  1  4 11 11 11  1  4 11  1  3  6  2  3  1 22  1  3  6  1 14
## [201]  1  3  4  4  4  2  4  6  3  3  1  3  1  6  3  4  4 11  4  1  1 23  1 24  6
## [226]  4  3  2  1  1  3  1  1  6  1  1  1  4  6  3  4  3  4 16  3  1  1  4 25  4
## [251]  4  4  1  1 26  4  4  4  6  4  1  3  3  3  2 19  4  3  4 27  4  4  2  6  3
## [276]  3 11  3  6  2 16  3  3  2  1  1  6  6  6  3  3  3  1  3  4  3  1  3  1 14
## [301] 15 28  2 20  1  3  1  1  1  4  1  3  3  4  1 19 29  4  4  4  1  4  6  4 30
## [326]  3  3  6  1  1  3  1  4  2  1  3  3  3  3 19  6 14  3  1  1 11  1  1  4 14
## [351]  6 11  3  3  4  3  3  2  1 14  1  1  4  1  2 31  1  1  4  1  3  4  3  1  3
## [376] 32  1  1  3  1  3  6  3  3  3  1 19  6  3 11  3  1  3  3 33  2  1  4  4  1
## [401]  6  1  4  2 15  6  4  2  3  4  4  4 14  3  3  4  4  1  1  3  6  2  3  2  1
## [426]  4 34  1  3  1  1 23  3  6  4  9  1  1  6  4  1 35 36 37 38  1  1  1  4  2
## [451]  1 14 15  4  3 14  3  3  6  2  1 15 16  3  4  3  3  4  3  3  4  1 39  6  4
## [476]  3  3  4  3  4  6  2  2  3 40  4  6  4 41  3  1  3  1  3  1  6 42 43  4  1

or

cluster_terms(mat, method = "binary_cut")

binary_cut() and cluster_terms() basically generate the same clusterings, but the labels of clusters might differ.

Simplify general functional enrichment results

Semantic measures can be used for the similarity of GO terms. However, there are still a lot of ontologies (e.g. MsigDB gene sets) that are only represented as a list of genes where the similarity between gene sets are mainly measured by gene overlap. simplifyEnrichment provides the term_similarity() and other related functions (term_similarity_from_enrichResult(), term_similarity_from_KEGG(), term_similarity_from_Reactome(), term_similarity_from_MSigDB() and term_similarity_from_gmt()) which calculate the similarity of terms by the gene overlapping, with methods of Jaccard coefficient, Dice coefficient, overlap coefficient and kappa coefficient.

The similarity can be calculated by providing:

  1. A list of gene sets where each gene set contains a vector of genes.
  2. A enrichResult object which is normally from the ‘clusterProfiler’, ‘DOSE’, ‘meshes’ or ‘ReactomePA’ package.
  3. A list of KEGG/Reactome/MsigDB IDs. The gene set names can also be provided for MsigDB ontologies.
  4. A gmt file and the corresponding gene set IDs.

Once you have the similarity matrix, you can send it to simplifyEnrichment() function. But note, as we benchmarked in the manuscript, the clustering on the gene overlap similarity performs much worse than on the semantic similarity.

Comparing clustering methods

In the simplifyEnrichment package, there are also functions that compare clustering results from different methods. Here we still use previously generated variable mat which is the similarity matrix from the 500 random GO terms. Simply running compare_clustering_methods() function performs all supported methods (in all_clustering_methods()) excluding mclust, because mclust usually takes very long time to run. The function generates a figure with three panels:

  1. A heatmap of the similarity matrix with different clusterings as row annotations.
  2. A heatmap of the pair-wise concordance of the clustering from every two methods.
  3. Barplots of the difference scores for each method, the number of clusters (total clusters and the clusters with size >= 5) and the mean similarity of the terms that are in the same clusters (block mean).

In the barplots, the three metrics are defined as follows:

  1. Different score: This is the difference between the similarity values for the terms that belong to the same clusters and different clusters. For a similarity matrix \(M\), for term \(i\) and term \(j\) where \(i \ne j\), the similarity value \(x_{i,j}\) is saved to the vector \(\mathbf{x_1}\) only when term \(i\) and \(j\) are in a same cluster. \(x_{i,j}\) is saved to the vector \(\mathbf{x_2}\) when term \(i\) and \(j\) are not in the same cluster. The difference score measures the distribution difference between \(\mathbf{x_1}\) and \(\mathbf{x_2}\), calculated as the Kolmogorov-Smirnov statistic between the two distributions.
  2. Number of clusters: For each clustering, there are two numbers: the number of total clusters and the number of clusters with size >= 5 (only the big clusters).
  3. Block mean: Mean similarity values of the diagonal blocks in the similarity heatmap. Using the same convention as for the difference score, the block mean is the mean value of \(\mathbf{x_1}\).
compare_clustering_methods(mat)
## Cluster 500 terms by 'binary_cut'... 43 clusters, used 1.183213 secs.
## Cluster 500 terms by 'kmeans'... 18 clusters, used 4.774471 secs.
## Cluster 500 terms by 'dynamicTreeCut'... 58 clusters, used 0.2210929 secs.
## Cluster 500 terms by 'apcluster'... 39 clusters, used 0.9031322 secs.
## Cluster 500 terms by 'hdbscan'... 13 clusters, used 0.2740827 secs.
## Cluster 500 terms by 'fast_greedy'... 29 clusters, used 0.150671 secs.
## Cluster 500 terms by 'leading_eigen'... 30 clusters, used 0.3904157 secs.
## Cluster 500 terms by 'louvain'... 29 clusters, used 0.1929445 secs.
## Cluster 500 terms by 'walktrap'... 26 clusters, used 0.4676108 secs.
## Cluster 500 terms by 'MCL'... 28 clusters, used 3.557883 secs.

If plot_type argument is set to heatmap. There are heatmaps for the similarity matrix under different clusterings methods. The last panel is a table with the number of clusters.

compare_clustering_methods(mat, plot_type = "heatmap")
## Cluster 500 terms by 'binary_cut'... 43 clusters, used 1.182847 secs.
## Cluster 500 terms by 'kmeans'... 17 clusters, used 5.044989 secs.
## Cluster 500 terms by 'dynamicTreeCut'... 58 clusters, used 0.2185602 secs.
## Cluster 500 terms by 'apcluster'... 39 clusters, used 1.399533 secs.
## Cluster 500 terms by 'hdbscan'... 13 clusters, used 0.2701004 secs.
## Cluster 500 terms by 'fast_greedy'... 29 clusters, used 0.1286063 secs.
## Cluster 500 terms by 'leading_eigen'... 30 clusters, used 0.4048717 secs.
## Cluster 500 terms by 'louvain'... 29 clusters, used 0.1786525 secs.
## Cluster 500 terms by 'walktrap'... 26 clusters, used 0.4680381 secs.
## Cluster 500 terms by 'MCL'... 28 clusters, used 2.814603 secs.

Please note, the clustering methods might have randomness, which means, different runs of compare_clustering_methods() may generate different clusterings (slightly different). Thus, if users want to compare the plots between compare_clustering_methods(mat) and compare_clustering_methods(mat, plot_type = "heatmap"), they should set the same random seed before executing the function.

set.seed(123)
compare_clustering_methods(mat)
set.seed(123)
compare_clustering_methods(mat, plot_type = "heatmap")

compare_clustering_methods() is simply a wrapper on cmp_make_clusters() and cmp_make_plot() functions where the former function performs clustering with different methods and the latter visualizes the results. To compare different plots, users can also use the following code without specifying the random seed.

clt = cmp_make_clusters(mat) # just a list of cluster labels
cmp_make_plot(mat, clt)
cmp_make_plot(mat, clt, plot_type = "heatmap")

Register new clustering methods

New clustering methods can be added by register_clustering_methods(), removed by remove_clustering_methods() and reset to the default methods by reset_clustering_methods(). All the supported methods can be retrieved by all_clustering_methods(). compare_clustering_methods() runs all the clustering methods in all_clustering_methods().

The new clustering methods should be as user-defined functions and sent to register_clustering_methods() as named arguments, e.g.:

The functions should accept at least one argument which is the input matrix (mat in above example). The second optional argument should always be ... so that parameters for the clustering function can be passed by control argument from cluster_terms() or simplifyGO(). If users forget to add ..., it is added internally.

Please note, the user-defined function should automatically identify the optimized number of clusters. The function should return a vector of cluster labels. Internally it is converted to numeric labels.

Examples

There are following examples which we did for the benchmarking in the manuscript:

Apply to multiple lists of GO IDs

It is always very common that users have multiple lists of GO enrichment results (e.g. from multiple groups of genes) and they want to compare the significant terms between different lists, e.g. to see which biological functions are more specific in a certain list. There is a function simplifyGOFromMultipleLists() in the package which helps this type of analysis.

The input data for simplifyGOFromMultipleLists() (with the argument lt) can have three types of formats:

If the GO enrichment results is directly from upstream analysis, e.g. the package clusterProfiler or other similar packages, the results are most probably represented as a list of data frames, thus, we first demonstrate the usage on a list of data frames.

The function functional_enrichment() in cola package applies functional enrichment on different groups of signature genes from consensus clustering. The function internally uses clusterProfiler and returns a list of data frames:

# perform functional enrichment on the signatures genes from cola anlaysis 
library(cola)
data(golub_cola) 
res = golub_cola["ATC:skmeans"]

library(hu6800.db)
x = hu6800ENTREZID
mapped_probes = mappedkeys(x)
id_mapping = unlist(as.list(x[mapped_probes]))

lt = functional_enrichment(res, k = 3, id_mapping = id_mapping)
## - 2050/4116 significant genes are taken from 3-group comparisons
## - on k-means group 1/3, 827 genes
##   - 651/827 (78.7%) genes left after id mapping
##   - gene set enrichment, GO:BP
## - on k-means group 2/3, 359 genes
##   - 317/359 (88.3%) genes left after id mapping
##   - gene set enrichment, GO:BP
## - on k-means group 3/3, 864 genes
##   - 786/864 (91%) genes left after id mapping
##   - gene set enrichment, GO:BP
names(lt)
## [1] "BP_km1" "BP_km2" "BP_km3"
head(lt[[1]][, 1:7])
##                    ID                            Description GeneRatio
## GO:0033993 GO:0033993                      response to lipid    88/626
## GO:0003013 GO:0003013             circulatory system process    67/626
## GO:0014070 GO:0014070    response to organic cyclic compound    86/626
## GO:0051050 GO:0051050       positive regulation of transport    83/626
## GO:0008015 GO:0008015                      blood circulation    58/626
## GO:1901699 GO:1901699 cellular response to nitrogen compound    68/626
##              BgRatio       pvalue     p.adjust       qvalue
## GO:0033993 902/18723 7.337708e-20 3.841290e-16 2.663974e-16
## GO:0003013 597/18723 2.268968e-18 5.939025e-15 4.118775e-15
## GO:0014070 948/18723 1.935913e-17 3.378168e-14 2.342794e-14
## GO:0051050 899/18723 2.662612e-17 3.484694e-14 2.416671e-14
## GO:0008015 512/18723 3.311473e-16 3.467113e-13 2.404478e-13
## GO:1901699 689/18723 9.807701e-16 8.557219e-13 5.934519e-13

By default, simplifyGOFromMultipleLists() automatically identifies the columns that contain GO IDs and adjusted p-values, so here we directly send lt to simplifyGOFromMultipleLists(). We additionally set padj_cutoff to 0.001 because under the default cutoff 0.01, there are too many GO IDs and to save the running time, we set a more strict cutoff.

simplifyGOFromMultipleLists(lt, padj_cutoff = 0.001)
## Use column 'ID' as `go_id_column`.
## Use column 'p.adjust' as `padj_column`.
## 784/6321 GO IDs left for clustering.
## Cluster 784 terms by 'binary_cut'... 27 clusters, used 1.040457 secs.

Next we demonstrate two other data types for simplifyGOFromMultipleLists(). Both usages are straightforward. The first is a list of numeric vectors:

lt2 = lapply(lt, function(x) structure(x$p.adjust, names = x$ID))
simplifyGOFromMultipleLists(lt2, padj_cutoff = 0.001)

And the second is a list of character vectors of GO IDs:

lt3 = lapply(lt, function(x) x$ID[x$p.adjust < 0.001])
simplifyGOFromMultipleLists(lt3)

The process of this analysis is as follows. Let’s assume there are \(n\) GO lists, we first construct a global matrix where columns correspond to the \(n\) GO lists and rows correspond to the “union” of all GO IDs in the \(n\) lists. The value for the ith GO ID and in the jth list are taken from the corresponding numeric vector in lt. If the jth vector in lt does not contain the ith GO ID, the value defined by default argument is taken there (e.g. in most cases the numeric values are adjusted p-values, thus default is set to 1). Let’s call this matrix as \(M_0\).

Next step is to filter \(M_0\) so that we only take a subset of GO IDs of interest. We define a proper function via argument filter to remove GO IDs that are not important for the analysis. Function for filter is applied to every row in \(M_0\) and filter function needs to return a logical value to decide whether to keep or remove the current GO ID. For example, if the values in lt are adjusted p-values, the filter function can be set as function(x) any(x < padj_cutoff) so that the GO ID is kept as long as it is signfiicant in at least one list. After the filtering, let’s call the filtered matrix \(M_1\).

GO IDs in \(M_1\) (row names of \(M_1\)) are used for clustering. A heatmap of \(M_1\) is attached to the left of the GO similarity heatmap so that the group-specific (or list-specific) patterns can be easily observed and to corresponded to GO functions.

Argument heatmap_param controls several parameters for heatmap \(M_1\):

Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hu6800.db_3.13.0         org.Hs.eg.db_3.14.0      AnnotationDbi_1.56.0    
##  [4] IRanges_2.28.0           S4Vectors_0.32.0         Biobase_2.54.0          
##  [7] cola_2.0.0               simplifyEnrichment_1.4.0 BiocGenerics_0.40.0     
## [10] knitr_1.36              
## 
## loaded via a namespace (and not attached):
##   [1] shadowtext_0.0.9       circlize_0.4.13        fastmatch_1.1-3       
##   [4] plyr_1.8.6             igraph_1.2.7           lazyeval_0.2.2        
##   [7] proxyC_0.2.1           splines_4.1.1          BiocParallel_1.28.0   
##  [10] MCL_1.0                GenomeInfoDb_1.30.0    ggplot2_3.3.5         
##  [13] digest_0.6.28          yulab.utils_0.0.4      foreach_1.5.1         
##  [16] htmltools_0.5.2        GOSemSim_2.20.0        viridis_0.6.2         
##  [19] magick_2.7.3           GO.db_3.14.0           fansi_0.5.0           
##  [22] magrittr_2.0.1         memoise_2.0.0          tm_0.7-8              
##  [25] cluster_2.1.2          doParallel_1.0.16      graphlayouts_0.7.1    
##  [28] ComplexHeatmap_2.10.0  Biostrings_2.62.0      annotate_1.72.0       
##  [31] RcppParallel_5.1.4     matrixStats_0.61.0     enrichplot_1.14.0     
##  [34] colorspace_2.0-2       ggrepel_0.9.1          blob_1.2.2            
##  [37] xfun_0.27              dplyr_1.0.7            crayon_1.4.1          
##  [40] RCurl_1.98-1.5         microbenchmark_1.4-7   jsonlite_1.7.2        
##  [43] scatterpie_0.1.7       genefilter_1.76.0      impute_1.68.0         
##  [46] ape_5.5                brew_1.0-6             survival_3.2-13       
##  [49] iterators_1.0.13       glue_1.4.2             polyclip_1.10-0       
##  [52] gtable_0.3.0           zlibbioc_1.40.0        XVector_0.34.0        
##  [55] GetoptLong_1.0.5       shape_1.4.6            apcluster_1.4.8       
##  [58] scales_1.1.1           DOSE_3.20.0            DBI_1.1.1             
##  [61] Rcpp_1.0.7             gridtext_0.1.4         viridisLite_0.4.0     
##  [64] xtable_1.8-4           clue_0.3-60            tidytree_0.3.5        
##  [67] gridGraphics_0.5-1     bit_4.0.4              mclust_5.4.7          
##  [70] httr_1.4.2             fgsea_1.20.0           RColorBrewer_1.1-2    
##  [73] ellipsis_0.3.2         pkgconfig_2.0.3        XML_3.99-0.8          
##  [76] farver_2.1.0           sass_0.4.0             utf8_1.2.2            
##  [79] dynamicTreeCut_1.63-1  ggplotify_0.1.0        tidyselect_1.1.1      
##  [82] labeling_0.4.2         rlang_0.4.12           reshape2_1.4.4        
##  [85] munsell_0.5.0          tools_4.1.1            cachem_1.0.6          
##  [88] downloader_0.4         dbscan_1.1-8           generics_0.1.1        
##  [91] RSQLite_2.2.8          evaluate_0.14          stringr_1.4.0         
##  [94] fastmap_1.1.0          yaml_2.2.1             ggtree_3.2.0          
##  [97] bit64_4.0.5            tidygraph_1.2.0        purrr_0.3.4           
## [100] KEGGREST_1.34.0        ggraph_2.0.5           nlme_3.1-153          
## [103] slam_0.1-48            aplot_0.1.1            DO.db_2.9             
## [106] xml2_1.3.2             compiler_4.1.1         png_0.1-7             
## [109] treeio_1.18.0          tweenr_1.0.2           tibble_3.1.5          
## [112] bslib_0.3.1            stringi_1.7.5          highr_0.9             
## [115] lattice_0.20-45        Matrix_1.3-4           markdown_1.1          
## [118] vctrs_0.3.8            pillar_1.6.4           lifecycle_1.0.1       
## [121] jquerylib_0.1.4        eulerr_6.1.1           GlobalOptions_0.1.2   
## [124] data.table_1.14.2      cowplot_1.1.1          bitops_1.0-7          
## [127] irlba_2.3.3            patchwork_1.1.1        qvalue_2.26.0         
## [130] R6_2.5.1               gridExtra_2.3          codetools_0.2-18      
## [133] MASS_7.3-54            assertthat_0.2.1       rjson_0.2.20          
## [136] GenomeInfoDbData_1.2.7 expm_0.999-6           parallel_4.1.1        
## [139] clusterProfiler_4.2.0  ggfun_0.0.4            tidyr_1.1.4           
## [142] rmarkdown_2.11         skmeans_0.2-13         ggforce_0.3.3         
## [145] NLP_0.2-1