# Exploiting the cell ontology ## Motivation As previously discussed in Section \@ref(using-harmonized-labels), *[SingleR](https://bioconductor.org/packages/3.12/SingleR)* maps the labels in its references to the [Cell Ontology](https://www.ebi.ac.uk/ols/ontologies/cl). The most obvious advantage of doing this is to provide a standardized vocabulary with which to describe cell types, thus facilitating integrated analyses with multiple references. However, another useful feature of the Cell Ontology is its hierarchical organization of terms, allowing us to adjust cell type annotations to the desired resolution. This represents a more dynamic alternative to the static `label.main` and `label.fine` options in each reference. ## Basic manipulation We use the *[ontoProc](https://bioconductor.org/packages/3.12/ontoProc)* package to load in the Cell Ontology. This produces an `ontology_index` object (from the *[ontologyIndex](https://CRAN.R-project.org/package=ontologyIndex)* package) that we can query for various pieces of information. ```r # TODO: wrap in utility function. library(ontoProc) bfc <- BiocFileCache::BiocFileCache(ask=FALSE) path <- BiocFileCache::bfcrpath(bfc, "http://purl.obolibrary.org/obo/cl.obo") cl <- get_ontology(path, extract_tags="everything") cl ``` ``` ## Ontology with 9877 terms ## ## format-version: 1.2 ## data-version: releases/2021-02-01 ## ontology: cl ## ## Properties: ## id: character ## name: character ## parents: list ## children: list ## ancestors: list ## obsolete: logical ## RO:0002100: list ## RO:0002102: list ## RO:0002103: list ## RO:0002104: list ## RO:0002120: list ## RO:0002130: list ## RO:0002292: list ## adjacent_to: list ## alt_id: list ## anterior_to: list ## anteriorly_connected_to: list ## attaches_to: list ## attaches_to_part_of: list ## bearer_of: list ## bounding_layer_of: list ## branching_part_of: list ## capable_of: list ## capable_of_part_of: list ## channel_for: list ## channels_from: list ## channels_into: list ## comment: character ## composed_primarily_of: list ## conduit_for: list ## connected_to: list ## connects: list ## consider: list ## contains: list ## continuous_with: list ## contributes_to_morphology_of: list ## created_by: character ## creation_date: character ## data-version: list ## dc-contributor: list ## dc-creator: list ## decreased_in_magnitude_relative_to: list ## deep_to: list ## def: character ## derives_from: list ## developmentally_induced_by: list ## developmentally_replaces: list ## develops_from: list ## develops_from_part_of: list ## develops_in: list ## develops_into: list ## directly_develops_from: list ## disjoint_from: list ## distal_to: list ## distally_connected_to: list ## distalmost_part_of: list ## domain: list ## dorsal_to: list ## drains: list ## ends: list ## ends_with: list ## equivalent_to_chain: list ## existence_ends_during: list ## existence_ends_during_or_before: list ## existence_ends_with: list ## existence_starts_and_ends_during: list ## existence_starts_during: list ## existence_starts_during_or_after: list ## existence_starts_with: list ## expand_assertion_to: list ## expand_expression_to: list ## extends_fibers_into: list ## filtered_through: list ## format-version: list ## has_boundary: list ## has_completed: list ## has_component: list ## has_cross_section: list ## has_developmental_contribution_from: list ## has_gene_template: list ## has_high_plasma_membrane_amount: list ## has_low_plasma_membrane_amount: list ## has_member: list ## has_muscle_antagonist: list ## has_muscle_insertion: list ## has_muscle_origin: list ## has_not_completed: list ## has_part: list ## has_potential_to_develop_into: list ## has_potential_to_developmentally_contribute_to: list ## has_quality: list ## has_relative_magnitude: list ## has_role: list ## has_skeleton: list ## holds_over_chain: list ## immediate_transformation_of: list ## immediately_deep_to: list ## immediately_preceded_by: list ## immediately_superficial_to: list ## in_anterior_side_of: list ## in_deep_part_of: list ## in_distal_side_of: list ## in_dorsal_side_of: list ## in_lateral_side_of: list ## in_left_side_of: list ## in_posterior_side_of: list ## in_proximal_side_of: list ## in_right_side_of: list ## in_taxon: list ## increased_in_magnitude_relative_to: list ## indirectly_supplies: list ## innervated_by: list ## innervates: list ## intersection_of: list ## intersects_midsagittal_plane_of: list ## inverse_of: list ## is_a: list ## is_class_level: list ## is_cyclic: list ## is_functional: list ## is_metadata_tag: list ## is_symmetric: list ## is_transitive: list ## lacks_part: list ## lacks_plasma_membrane_part: list ## located_in: list ## location_of: list ## lumen_of: list ## luminal_space_of: list ## namespace: list ## negatively_regulates: list ## never_in_taxon: list ## occurs_in: list ## only_in_taxon: list ## ontology: list ## output_of: list ## part_of: list ## participates_in: list ## positively_regulates: list ## posterior_to: list ## posteriorly_connected_to: list ## preaxialmost_part_of: list ## preceded_by: list ## precedes: list ## produced_by: list ## produces: list ## property_value: list ## protects: list ## proximal_to: list ## proximally_connected_to: list ## proximalmost_part_of: list ## range: list ## regulates: list ## remark: list ## replaced_by: list ## seeAlso: list ## sexually_homologous_to: list ## skeleton_of: list ## starts: list ## starts_with: list ## subdivision_of: list ## subset: list ## subsetdef: list ## superficial_to: list ## supplies: list ## surrounded_by: list ## surrounds: list ## synonym: list ## synonymtypedef: list ## transformation_of: list ## transitive_over: list ## tributary_of: list ## trunk_part_of: list ## union_of: list ## ventral_to: list ## xref: list ## Roots: ## BFO:0000003 - occurrent ## functionally_related_to - functionally related to ## CHEBI:36342 - subatomic particle ## RO:0002410 - causally related to ## BFO:0000002 - continuant ## UBERON:0001062 - anatomical entity ## RO:0002323 - mereotopologically related to ## GO:0005575 - cellular_component ## UBERON:0000000 - processual entity ## RO:0000057 - has participant ## ... 87 more ``` The most immediate use of this object lies in mapping ontology terms to their plain-English descriptions. We can use this to translate annotations produced by `SingleR()` from the `label.ont` labels into a more interpretable form. We demonstrate this approach using *[celldex](https://bioconductor.org/packages/3.12/celldex)*'s collection of mouse RNA-seq references [@aran2019reference]. ```r head(cl$name) # short name ``` ``` ## BFO:0000002 BFO:0000003 BFO:0000004 ## "continuant" "occurrent" "independent continuant" ## BFO:0000006 BFO:0000015 BFO:0000019 ## "spatial region" "process" "quality" ``` ```r head(cl$def) # longer definition ``` ``` ## BFO:0000002 ## "\"An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.\" []" ## BFO:0000003 ## "\"An entity that has temporal parts and that happens, unfolds or develops through time.\" []" ## BFO:0000004 ## "\"A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything.\" []" ## BFO:0000006 ## NA ## BFO:0000015 ## "\"An occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t.\" []" ## BFO:0000019 ## NA ``` ```r library(celldex) ref <- MouseRNAseqData(cell.ont="nonna") translated <- cl$name[ref$label.ont] head(translated) ``` ``` ## CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 ## "fat cell" "fat cell" "fat cell" "fat cell" "fat cell" "fat cell" ``` Another interesting application involves examining the relationship between different terms. The ontology itself is a directed acyclic graph, so we can can convert it into `graph` object for advanced queries using the *[igraph](https://CRAN.R-project.org/package=igraph)* package. Each edge represents an "is a" relationship where each vertex represents a specialized case of the concept of the parent node. ```r # TODO: wrap in utility function. parents <- cl$parents self <- rep(names(parents), lengths(parents)) library(igraph) g <- make_graph(rbind(unlist(parents), self)) g ``` ``` ## IGRAPH 8836f0e DN-- 9817 15589 -- ## + attr: name (v/c) ## + edges from 8836f0e (vertex names): ## [1] BFO:0000002 ->BFO:0000004 BFO:0000141 ->BFO:0000006 ## [3] BFO:0000003 ->BFO:0000015 BFO:0000020 ->BFO:0000019 ## [5] BFO:0000002 ->BFO:0000020 BFO:0000040 ->BFO:0000024 ## [7] BFO:0000040 ->BFO:0000030 BFO:0000002 ->BFO:0000031 ## [9] BFO:0000004 ->BFO:0000040 BFO:0000004 ->BFO:0000141 ## [11] CARO:0030000->CARO:0000000 CARO:0000006->CARO:0000003 ## [13] BFO:0000040 ->CARO:0000006 CARO:0000000->CARO:0000006 ## [15] CARO:0000003->CARO:0001000 CARO:0001000->CARO:0001001 ## + ... omitted several edges ``` One query involves identifying all descendents of a particular term of interest. This can be useful when searching for a cell type in the presence of variable annotation resolution; for example, a search for "epithelial cell" can be configured to pick up all child terms such as "endothelial cell" and "ependymal cell". ```r term <- "CL:0000624" cl$name[term] ``` ``` ## CL:0000624 ## "CD4-positive, alpha-beta T cell" ``` ```r all.kids <- names(subcomponent(g, term)) head(cl$name[all.kids]) ``` ``` ## CL:0000624 ## "CD4-positive, alpha-beta T cell" ## CL:0000492 ## "CD4-positive helper T cell" ## CL:0001051 ## "CD4-positive, CXCR3-negative, CCR6-negative, alpha-beta T cell" ## CL:0000791 ## "mature alpha-beta T cell" ## CL:0000792 ## "CD4-positive, CD25-positive, alpha-beta regulatory T cell" ## CL:0000793 ## "CD4-positive, alpha-beta intraepithelial T cell" ``` Alternatively, we might be interested in the last common ancestor (LCA) for a set of terms. This is the furthest term - or, in some cases, multiple terms - from the root of the ontology that is also an ancestor of all of the terms of interest. We will use this LCA concept in the next section to adjust resolution across multiple references. ```r terms <- c("CL:0000624", "CL:0000785", "CL:0000623") cl$name[terms] ``` ``` ## CL:0000624 CL:0000785 ## "CD4-positive, alpha-beta T cell" "mature B cell" ## CL:0000623 ## "natural killer cell" ``` ```r # TODO: god, put this in a function somewhere. all.ancestors <- lapply(terms, subcomponent, graph=g, mode="in") all.ancestors <- lapply(all.ancestors, names) common.ancestors <- Reduce(intersect, all.ancestors) ancestors.of.ancestors <- lapply(common.ancestors, subcomponent, graph=g, mode="in") ancestors.of.ancestors <- lapply(ancestors.of.ancestors, names) ancestors.of.ancestors <- mapply(setdiff, ancestors.of.ancestors, common.ancestors) latest.common.ancestors <- setdiff(common.ancestors, unlist(ancestors.of.ancestors)) cl$name[latest.common.ancestors] ``` ``` ## CL:0000542 ## "lymphocyte" ``` ## Adjusting resolution We can use the ontology graph to adjust the resolution of the reference labels by rolling up overly-specific terms to their LCA. The `findCommonAncestors()` utility takes a set of terms and returns a list of potential LCAs for various subsets of those terms. Users can inspect this list to identify LCAs at the desired resolution and then map their descendent terms to those LCAs. ```r findCommonAncestors <- function(..., g, remove.self=TRUE, names=NULL) { terms <- list(...) if (is.null(names(terms))) { names(terms) <- sprintf("set%i", seq_along(terms)) } all.terms <- unique(unlist(terms)) all.ancestors <- lapply(all.terms, subcomponent, graph=g, mode="in") all.ancestors <- lapply(all.ancestors, names) by.ancestor <- split( rep(all.terms, lengths(all.ancestors)), unlist(all.ancestors) ) # Removing ancestor nodes with the same count as its children. available <- names(by.ancestor) for (i in available) { if (!i %in% names(by.ancestor)) { next } counts <- lengths(by.ancestor) cur.ancestors <- subcomponent(g, i, mode="in") cur.ancestors <- setdiff(names(cur.ancestors), i) drop <- cur.ancestors[counts[i]==counts[cur.ancestors]] by.ancestor <- by.ancestor[!names(by.ancestor) %in% drop] } if (remove.self) { by.ancestor <- by.ancestor[lengths(by.ancestor) > 1L] } by.ancestor <- by.ancestor[order(lengths(by.ancestor))] # most specific terms first. # Decorating the output. for (i in names(by.ancestor)) { current <- by.ancestor[[i]] df <- DataFrame(row.names=current) curout <- list() if (!is.null(names)) { curout$name <- unname(names[i]) df$name <- names[current] } presence <- list() for (b in names(terms)) { presence[[b]] <- current %in% terms[[b]] } df <- cbind(df, do.call(DataFrame, presence)) curout$descendents <- df by.ancestor[[i]] <- curout } by.ancestor } lca <- findCommonAncestors(ref$label.ont, g=g, names=cl$name) head(lca) ``` ``` ## $`CL:0000081` ## $`CL:0000081`$name ## [1] "blood cell" ## ## $`CL:0000081`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## ## CL:0000232 erythrocyte TRUE ## CL:0000094 granulocyte TRUE ## ## ## $`CL:0000126` ## $`CL:0000126`$name ## [1] "macroglial cell" ## ## $`CL:0000126`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## ## CL:0000127 astrocyte TRUE ## CL:0000128 oligodendrocyte TRUE ## ## ## $`CL:0000393` ## $`CL:0000393`$name ## [1] "electrically responsive cell" ## ## $`CL:0000393`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## ## CL:0000540 neuron TRUE ## CL:0000746 cardiac muscle cell TRUE ## ## ## $`CL:0002320` ## $`CL:0002320`$name ## [1] "connective tissue cell" ## ## $`CL:0002320`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## ## CL:0000136 fat cell TRUE ## CL:0000057 fibroblast TRUE ## ## ## $`CL:0011115` ## $`CL:0011115`$name ## [1] "precursor cell" ## ## $`CL:0011115`$descendents ## DataFrame with 2 rows and 2 columns ## name set1 ## ## CL:0000047 neuronal stem cell TRUE ## CL:0000576 monocyte TRUE ## ## ## $`CL:0000066` ## $`CL:0000066`$name ## [1] "epithelial cell" ## ## $`CL:0000066`$descendents ## DataFrame with 3 rows and 2 columns ## name set1 ## ## CL:0000115 endothelial cell TRUE ## CL:0000182 hepatocyte TRUE ## CL:0000065 ependymal cell TRUE ``` We can also use this function to synchronize multiple sets of terms to the same resolution. Here, we consider the ImmGen dataset [@ImmGenRef], which provides highly resolved annotation of immune cell types. The `findCommonAncestors()` function specifies the origins of the descendents for each LCA, allowing us to focus on LCAs that have representatives in both sets of terms. ```r ref2 <- ImmGenData(cell.ont="nonna") lca2 <- findCommonAncestors(MouseRNA=ref$label.ont, ImmGen=ref2$label.ont, g=g, names=cl$name) head(lca2) ``` ``` ## $`CL:0000126` ## $`CL:0000126`$name ## [1] "macroglial cell" ## ## $`CL:0000126`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000127 astrocyte TRUE FALSE ## CL:0000128 oligodendrocyte TRUE FALSE ## ## ## $`CL:0000393` ## $`CL:0000393`$name ## [1] "electrically responsive cell" ## ## $`CL:0000393`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000540 neuron TRUE FALSE ## CL:0000746 cardiac muscle cell TRUE FALSE ## ## ## $`CL:0000623` ## $`CL:0000623`$name ## [1] "natural killer cell" ## ## $`CL:0000623`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000623 natural killer cell TRUE TRUE ## CL:0002438 NK1.1-positive natur.. FALSE TRUE ## ## ## $`CL:0000813` ## $`CL:0000813`$name ## [1] "memory T cell" ## ## $`CL:0000813`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000897 CD4-positive, alpha-.. FALSE TRUE ## CL:0000909 CD8-positive, alpha-.. FALSE TRUE ## ## ## $`CL:0000815` ## $`CL:0000815`$name ## [1] "regulatory T cell" ## ## $`CL:0000815`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000792 CD4-positive, CD25-p.. FALSE TRUE ## CL:0000815 regulatory T cell FALSE TRUE ## ## ## $`CL:0000819` ## $`CL:0000819`$name ## [1] "B-1 B cell" ## ## $`CL:0000819`$descendents ## DataFrame with 2 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000820 B-1a B cell FALSE TRUE ## CL:0000821 B-1b B cell FALSE TRUE ``` For example, we might notice that the mouse RNA-seq reference only has a single "T cell" term. To synchronize resolution across references, we would need to roll up all of the ImmGen's finely resolved subsets into that LCA as shown below. Of course, this results in some loss of precision and information; whether this is an acceptable price for simpler interpretation is a decision that is left to the user. ```r children <- lca2$`CL:0000084`$descendents children ``` ``` ## DataFrame with 35 rows and 3 columns ## name MouseRNA ImmGen ## ## CL:0000084 T cell TRUE TRUE ## CL:0002427 resting double-posit.. FALSE TRUE ## CL:0000809 double-positive, alp.. FALSE TRUE ## CL:0002429 CD69-positive double.. FALSE TRUE ## CL:0000624 CD4-positive, alpha-.. FALSE TRUE ## ... ... ... ... ## CL:0002415 immature Vgamma1.1-p.. FALSE TRUE ## CL:0002411 Vgamma1.1-positive, .. FALSE TRUE ## CL:0002416 mature Vgamma1.1-pos.. FALSE TRUE ## CL:0002407 mature Vgamma2-posit.. FALSE TRUE ## CL:0000815 regulatory T cell FALSE TRUE ``` ```r # Synchronization: synced.mm <- ref$label.ont synced.mm[synced.mm %in% rownames(children)] <- "CL:0000084" synced.ig <- ref2$label.ont synced.ig[synced.ig %in% rownames(children)] <- "CL:0000084" ``` ## Session information {-}

``` R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] igraph_1.2.6 celldex_1.0.0 [3] SummarizedExperiment_1.20.0 Biobase_2.50.0 [5] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [7] IRanges_2.24.1 S4Vectors_0.28.1 [9] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [11] matrixStats_0.58.0 ontoProc_1.12.0 [13] ontologyIndex_2.7 BiocStyle_2.18.1 [15] rebook_1.0.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_4.0.5 [3] httr_1.4.2 Rgraphviz_2.34.0 [5] tools_4.0.4 bslib_0.2.4 [7] utf8_1.2.1 R6_2.5.0 [9] DT_0.17 DBI_1.1.1 [11] withr_2.4.1 tidyselect_1.1.0 [13] processx_3.4.5 bit_4.0.4 [15] curl_4.3 compiler_4.0.4 [17] graph_1.68.0 DelayedArray_0.16.2 [19] bookdown_0.21 sass_0.3.1 [21] callr_3.5.1 rappdirs_0.3.3 [23] stringr_1.4.0 digest_0.6.27 [25] rmarkdown_2.7 XVector_0.30.0 [27] pkgconfig_2.0.3 htmltools_0.5.1.1 [29] sparseMatrixStats_1.2.1 dbplyr_2.1.0 [31] fastmap_1.1.0 htmlwidgets_1.5.3 [33] rlang_0.4.10 RSQLite_2.2.4 [35] DelayedMatrixStats_1.12.3 shiny_1.6.0 [37] jquerylib_0.1.3 generics_0.1.0 [39] paintmap_1.0 jsonlite_1.7.2 [41] dplyr_1.0.5 RCurl_1.98-1.3 [43] magrittr_2.0.1 GenomeInfoDbData_1.2.4 [45] Matrix_1.3-2 Rcpp_1.0.6 [47] fansi_0.4.2 lifecycle_1.0.0 [49] stringi_1.5.3 yaml_2.2.1 [51] zlibbioc_1.36.0 BiocFileCache_1.14.0 [53] AnnotationHub_2.22.0 grid_4.0.4 [55] blob_1.2.1 promises_1.2.0.1 [57] ExperimentHub_1.16.0 crayon_1.4.1 [59] lattice_0.20-41 ontologyPlot_1.6 [61] CodeDepends_0.6.5 knitr_1.31 [63] ps_1.6.0 pillar_1.5.1 [65] codetools_0.2-18 XML_3.99-0.6 [67] glue_1.4.2 BiocVersion_3.12.0 [69] evaluate_0.14 BiocManager_1.30.10 [71] vctrs_0.3.6 httpuv_1.5.5 [73] purrr_0.3.4 assertthat_0.2.1 [75] cachem_1.0.4 xfun_0.22 [77] mime_0.10 xtable_1.8-4 [79] later_1.1.0.1 tibble_3.1.0 [81] AnnotationDbi_1.52.0 memoise_2.0.0 [83] ellipsis_0.3.1 interactiveDisplayBase_1.28.0 ```