drugTargetAnnot
getDrugTarget
The drugTargetInteractions
package provides utilities for identifying
drug-target interactions for sets of small molecule and/or gene/protein
identifiers (Wu et al. 2006). The required drug-target interaction information is
obained from a downloaded SQLite instance of the ChEMBL
database (Gaulton et al. 2012; Bento et al. 2014). ChEMBL has been chosen for this purpose,
because it provides one of the most comprehensive and best annotatated knowledge resources
for drug-target information available in the public domain.
As Bioconductor package drugTargetInteraction
can be installed with the BiocManager::install()
function.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("drugTargetInteractions")
Alternatively, the package can be installed from GitHub as follows.
devtools::install_github("girke-lab/drugTargetInteractions", build_vignettes=TRUE) # Installs from github
To use drugTargetInteractions
, the package needs to be loaded in a user’s R session.
library("drugTargetInteractions") # Loads the package
The following commands are useful to list the available help files and open the the vignette of the package.
library(help="drugTargetInteractions") # Lists package info
vignette(topic="drugTargetInteractions", package="drugTargetInteractions") # Opens vignette
The drugTargetInteractions
package uses a downloaded SQLite instance of the
ChEMBL database. The following code in this
vignette uses a slimmed down toy version of this database that is small enough
to be included in this package for demonstrating the usage of its functions.
For real drug-target analysis work, it is important that users download and
uncompress a recent version of the ChEMBL SQLite database from
here, and then replace
the path assigned to chembldb
below with the path to the full version of the
ChEMBL database they have downloaded to their system. Since the SQLite database
from ChEMBL can be used by this package as is, creating a copy of the ChEMBL
SQLite database on Bioconductor’s
AnnotationHub
is not necessary at this point. This way users can always use the latest or
historical versions of ChEMBL without the need of maintaining a mirror instance.
The following genConfig
function call creates a list containing the paths to
input and output directories used by the sample code introduced in this
vignette. For real analyses, users want to customize these paths to match the
environment on their system. Usually, this means all paths generated by the
system.file
function need to be changed, as those are specific to work with
the test data of a package (e.g. toy ChEMBL SQLite instance).
chembldb <- system.file("extdata", "chembl_sample.db", package="drugTargetInteractions")
resultsPath <- system.file("extdata", "results", package="drugTargetInteractions")
config <- genConfig(chemblDbPath=chembldb, resultsPath=resultsPath)
In addition, a lookup table is downloaded on the fly from UniChem (see web
page and corresponding ftp
site.
This lookup table is used by drugTargetInteractions
to translate compound
identifiers across different drug databases. Currently, this includes three types of
compound identifiers: DrugBank, PubChem, and ChEBI.
downloadUniChem(config=config)
cmpIdMapping(config=config)
Users mainly interested in generating analysis results can skip the technical details in the following sections and continue with the section entitled Workflow to Run Everything.
The following returns for a set of query IDs (e.g. ENSEMBL gene IDs) the
corresponding UniProt IDs based on a stict ID matching as well as a more
relaxed sequence similarity-based approach. The latter sequence similarity
associations are obtained with the getUniprotIDs
or the getParalogs
functions
using UniProt’s UNIREF cluster or BioMart’s paralog annotations, respectively.
The UniProt.ws
package is used to to return for a set of query IDs
(here Ensembl gene IDs) the corresponding UniProt IDs based on two independent
approaches: ID mappings (IDMs) and sequence similarity nearest neighbors
(SSNNs) using UNIREF clusters. The latter have been generated by UniProt with the MMSeqs2
and Linclust algorithms (Steinegger and Söding 2017; Steinegger and Soeding 2018). Additional details
on the UNIREF clusters are available here. The
organism, query ID type and sequence similarity level can be selected under the taxId
,
kt
and seq_cluster
arguments, respectively. The seq_cluster
argument can be
assigned one of: UNIREF100
, UNIREF90
or UNIREF50
. The result is a list
with two data.frames
. The first one is based on IDMs and the second one on
SSNNs.
keys <- c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90")
The following shows the first data.frame
containing the ID mapping results.
library(DT)
datatable(res_list90[[1]], options = list(scrollX=TRUE, scrollY="600px", autoWidth = TRUE))
The following shows how to return the dimensions of the two data.frames
and how to
obtain the UniProt IDs as character vectors required for the downstream analysis steps.
sapply(res_list90, dim, simplify=FALSE)
## $IDM
## [1] 18 8
##
## $SSNN
## [1] 22 8
sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID)))
## $IDM
## [1] "A0A669KAY2" "D6RJB7" "Q8N7Z5" "F8VP73" "F8W606" "G8JLQ3" "H0YHA4"
## [8] "P78537" "A0A0G2JQP8" "A0A1W2PPV8" "A0A1W2PQT4" "A0A1W2PRA9" "A0A1W2PRB5" "A0A1W2PRR3"
## [15] "A0A1W2PS83" "A0A3B3IT55" "I3L233" "Q7Z3B3"
##
## $SSNN
## [1] "A0A669KAY2" "D6RJB7" "Q8N7Z5" "A0A087WSV2" "F8VP73" "F8W606" "G8JLQ3"
## [8] "F8VNQ1" "H0YHA4" "P78537" "A0A1W2PPV8" "A0A1W2PQT4" "A0A1W2PRA9" "A0A1W2PRB5"
## [15] "A0A1W2PRR3" "A0A1W2PS83" "A0A3B3IT55" "A0A024R9Y2" "A0A0G2JNT7" "A0A0G2JQF4" "I3L233"
## [22] "Q7Z3B3"
The following obtains via biomaRt
for a set of query genes the corresponding
UniProt IDs as well as their paralogs. The query genes can be Gene Names or ENSEMBL
Gene IDs from H. sapiens. The result is similar to IDMs and SSNNs from the
getUniprotIDs
function, but instead of UNIREF clusters, biomaRt’s paralogs
are used to obtain SSNNs.
queryBy <- list(molType="gene", idType="external_gene_name", ids=c("ANKRD31", "BLOC1S1", "KANSL1"))
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=c("ENSG00000145700", "ENSG00000135441", "ENSG00000120071"))
res_list <- getParalogs(queryBy)
The following shows the first data.frame
containing the ID mapping results.
library(DT)
datatable(res_list[[1]], options = list(scrollX = TRUE, scrollY="400px", autoWidth = TRUE))
The following shows how to return the dimensions of the two data.frames
and how to
obtain the UniProt IDs as character vectors required for the downstream analysis steps.
sapply(res_list, dim, simplify=FALSE)
## $IDM
## [1] 30 5
##
## $SSNN
## [1] 33 7
sapply(names(res_list), function(x) unique(na.omit(res_list[[x]]$ID_up_sp)))
## $IDM
## [1] "Q7Z3B3" "P78537" "Q8N7Z5"
##
## $SSNN
## [1] "Q7Z3B3" "A0AUZ9" "P78537" "Q8N7Z5" "Q5JPF3" "A6QL64" "Q8N2N9"
The drugTargetAnnot
function returns for a set of compound or gene/protein
IDs the corresponding known drug-target annotation data available in ChEMBL. A
related function called getDrugTarget
is described in the subsequent
subsection. This method generates very similar results, but uses internally pre-computed
annotation summary tables which is less flexible than the usage of pure SQL statements.
drugTargetAnnot
The drugTargetAnnot
function queries the ChEMBL database with SQL statements without
depending on pre-computed annotation tables.
queryBy <- list(molType="cmp", idType="chembl_id", ids=c("CHEMBL17", "CHEMBL19", "CHEMBL1201117", "CHEMBL25", "nomatch", "CHEMBL1742471"))
qresult1 <- drugTargetAnnot(queryBy, config=config)
library(DT)
datatable(qresult1, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
queryBy <- list(molType="protein", idType="UniProt_ID", ids=c("P43166", "P00915"))
qresult2 <- drugTargetAnnot(queryBy, config=config)
library(DT)
datatable(qresult2, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
The following returns drug-target annotations for a set of query Ensembl gene IDs. For this the Ensembl gene IDs are translated into UniProt IDs using both the IDM and SSNN approaches.
keys <- c("ENSG00000120088", "ENSG00000135441", "ENSG00000120071")
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=keys, seq_cluster="UNIREF90")
id_list <- sapply(names(res_list90), function(x) unique(na.omit(res_list90[[x]]$ID)))
Next, drug-target annotations are returned for the Uniprot IDs with the IDM or
SSNN methods. The following example uses the UniProt IDs of the SSNN method.
Note, to include the upstream Ensembl gene IDs in the final result table, the
below Ensembl ID collapse step via a tapply
is necessary since occasionally
UniProt IDs are assigned to several Ensembl gene IDs (e.g. recent gene duplications).
queryBy <- list(molType="protein", idType="UniProt_ID", ids=id_list[[2]])
qresultSSNN <- drugTargetAnnot( queryBy, config=config)
ensidsSSNN <- tapply(res_list90[[2]]$ENSEMBL, res_list90[[2]]$ID, paste, collapse=", ")
qresultSSNN <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultSSNN$QueryIDs)], qresultSSNN)
library(DT)
datatable(qresultSSNN, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
getDrugTarget
The getDrugTarget
function generates similar results as drugTargetAnnot
,
but depends on a pre-computed query table (here drugTargetAnnot.xls
).
id_mapping <- c(chembl="chembl_id", pubchem="PubChem_ID", uniprot="UniProt_ID", drugbank="DrugBank_ID")
queryBy <- list(molType="cmp", idType="pubchem", ids=c("2244", "65869", "2244"))
queryBy <- list(molType="protein", idType="uniprot", ids=c("P43166", "P00915", "P43166"))
queryBy <- list(molType="cmp", idType="drugbank", ids=c("DB00945", "DB01202"))
#qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17,39,46:53),config=config)
qresult3 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config)
library(DT)
datatable(qresult3, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
queryBy <- list(molType="protein", idType="chembl", ids=c("CHEMBL25", "nomatch", "CHEMBL1742471"))
#qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping, columns=c(1,5,8,16,17,39,46:52),config=config)
qresult4 <- getDrugTarget(queryBy=queryBy, id_mapping=id_mapping, columns=c(1,5,8,16,17),config=config)
library(DT)
datatable(qresult4, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
The drugTargetBioactivity
function returns for a set of compound or gene/protein IDs the
corresponding bioassay data available in ChEMBL.
Example query for compounds IDs.
queryBy <- list(molType="cmp", idType="DrugBank_ID", ids=c("DB00945", "DB00316", "DB01050"))
qresultBAcmp <- drugTargetBioactivity(queryBy, config=config)
library(DT)
datatable(qresultBAcmp, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
Example query for protein IDs. Note, the Ensembl gene to UniProt ID mappings are derived from
above and stored in the named character vector ensidsSSNN
.
queryBy <- list(molType="protein", idType="uniprot", ids=id_list[[1]])
qresultBApep <- drugTargetBioactivity(queryBy, config=config)
qresultBApep <- data.frame(Ensembl_IDs=ensidsSSNN[as.character(qresultBApep$UniProt_ID)], qresultBApep)
library(DT)
datatable(qresultBApep, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
This section explains how to run all of the above drug-target interaction analysis steps with a few convenience meta functions. Users mainly interested in generating analysis results quickly can focus on this section only.
The getSymEnsUp
function returns for a query of gene or protein IDs a mapping
table containing: ENSEMBL Gene IDs, Gene Names/Symbols, UniProt IDs and ENSEMBL
Protein IDs. Internally, the function uses the ensembldb
package. Its results
are returned in a list where the first slot contains the ID mapping table, while
the subsequent slots include the corresponding named character vectors: ens_gene_id
,
up_ens_id
, and up_gene_id
. Currently, the following query IDs are supported
by getSymEnsUp
: GENE_NAME
, ENSEMBL_GENE_ID
and UNIPROT_ID
.
gene_name <- c("CA7", "CFTR")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=gene_name, idtype="GENE_NAME")
ens_gene_id <- idMap$ens_gene_id
ens_gene_id
## ENSG00000168748 ENSG00000001626
## "CA7" "CFTR"
ensembl_gene_id <- c("ENSG00000001626", "ENSG00000168748")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=ensembl_gene_id, idtype="ENSEMBL_GENE_ID")
ens_gene_id <- idMap$ens_gene_id
uniprot_id <- c("P43166", "P13569")
idMap <- getSymEnsUp(EnsDb="EnsDb.Hsapiens.v86", ids=uniprot_id, idtype="UNIPROT_ID")
ens_gene_id <- idMap$ens_gene_id
The perfect match and nearest neighbor UniProt IDs can be obtained from UniProt’s UNIREF cluster or BioMart’s paralog annotations.
The corresponding IDM and SSNN UniProt IDs for the above ENSEMBL gene IDs are
obtained with the getUniprotIDs
function. This step is slow since the queries
have to be performed with chunksize=1
in order to reliably track the query ENSEMBL
gene ID information in the results.
res_list90 <- getUniprotIDs(taxId=9606, kt="ENSEMBL", keys=names(ens_gene_id), seq_cluster="UNIREF90", chunksize=1)
sapply(res_list90, dim)
Here the corresponding perfect match (IDM) and paralog (SSNN) UniProt IDs for
the above ENSEMBL gene IDs are obtained with the getParalogs
function. The
latter is much faster than getUniprotIDs
, and also covers wider evolutionary
distances. Thus, it may be the preferred methods for many use cases.
queryBy <- list(molType="gene", idType="ensembl_gene_id", ids=names(ens_gene_id))
res_list <- getParalogs(queryBy)
sapply(res_list, dim)
## IDM SSNN
## [1,] 13 127
## [2,] 5 7
Both drug-target annotation and bioassay data are obtained with a meta
function called runDrugTarget_Annot_Bioassay
that internally uses the main
processing functions drugTargetAnnot
and drugTargetBioactivity
. It
organizes the result in a list with the annotation and bioassay data
(data.frames
) in the first and second slot, respectively. Importantly, the
results from the IDM and SSNN UniProt IDs are combined in a single table, where
duplicated rows have been removed. To track in the result table, which method
was used for obtaining UniProt IDs, an IDM_Mapping_Type
column has been added.
Note, the gene IDs in the SSNN rows have the string Query_
prepended to
indicate that they are not necessarily the genes encoding the SSNN UniProt
proteins listed in the corresponding rows. Instead they are the genes encoding
the query proteins used for searching for SSNNs.
drug_target_list <- runDrugTarget_Annot_Bioassay(res_list=res_list, up_col_id="ID_up_sp", ens_gene_id, config=config)
sapply(drug_target_list, dim)
## Annotation Bioassay
## [1,] 55 35
## [2,] 18 17
View content of annotation result slot:
datatable(drug_target_list$Annotation, options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
View content of bioassay result slot (restricted to first 500 rows):
datatable(drug_target_list$Bioassay[1:500,], options = list(scrollX = TRUE, scrollY="600px", autoWidth = TRUE))
The following generates a summary table containing drug-target frequency information.
df <- drug_target_list$Annotation
df[,"GeneName"] <- gsub("Query_", "", as.character(df$GeneName))
stats <- tapply(df$CHEMBL_CMP_ID, as.factor(df$GeneName), function(x) unique(x))
stats <- sapply(names(stats), function(x) stats[[x]][nchar(stats[[x]]) > 0])
stats <- sapply(names(stats), function(x) stats[[x]][!is.na(stats[[x]])])
statsDF <- data.frame(GeneNames=names(stats), Drugs=sapply(stats, paste, collapse=", "), N_Drugs=sapply(stats, length))
Print drug-target frequency table.
datatable(statsDF, options = list(scrollX = TRUE, scrollY="150px", autoWidth = TRUE))
Both the annotation and bioassay data of the drug_target_list
object can
be exported to separate tabular files as follows.
write.table(drug_target_list$Annotation, "DrugTargetAnnotation.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(drug_target_list$Bioassay, "DrugTargetBioassay.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
write.table(statDF, "statDF.xls", row.names=FALSE, quote=FALSE, na="", sep="\t")
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.26.0 AnnotationFilter_1.26.0
## [4] GenomicFeatures_1.54.1 AnnotationDbi_1.64.0 Biobase_2.62.0
## [7] GenomicRanges_1.54.1 GenomeInfoDb_1.38.0 IRanges_2.36.0
## [10] S4Vectors_0.40.1 BiocGenerics_0.48.0 DT_0.30
## [13] drugTargetInteractions_1.10.1 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.1.3 blob_1.2.4
## [4] filelock_1.0.2 Biostrings_2.70.1 bitops_1.0-7
## [7] lazyeval_0.2.2 fastmap_1.1.1 RCurl_1.98-1.12
## [10] BiocFileCache_2.10.1 GenomicAlignments_1.38.0 XML_3.99-0.14
## [13] digest_0.6.33 lifecycle_1.0.3 ellipsis_0.3.2
## [16] ProtGenerics_1.34.0 KEGGREST_1.42.0 RSQLite_2.3.2
## [19] magrittr_2.0.3 compiler_4.3.1 rlang_1.1.1
## [22] sass_0.4.7 progress_1.2.2 tools_4.3.1
## [25] utf8_1.2.4 yaml_2.3.7 rtracklayer_1.62.0
## [28] knitr_1.45 htmlwidgets_1.6.2 prettyunits_1.2.0
## [31] S4Arrays_1.2.0 bit_4.0.5 curl_5.1.0
## [34] DelayedArray_0.28.0 xml2_1.3.5 abind_1.4-5
## [37] BiocParallel_1.36.0 withr_2.5.2 purrr_1.0.2
## [40] grid_4.3.1 fansi_1.0.5 biomaRt_2.58.0
## [43] SummarizedExperiment_1.32.0 cli_3.6.1 rmarkdown_2.25
## [46] crayon_1.5.2 generics_0.1.3 httr_1.4.7
## [49] rjson_0.2.21 BiocBaseUtils_1.4.0 DBI_1.1.3
## [52] cachem_1.0.8 stringr_1.5.0 zlibbioc_1.48.0
## [55] parallel_4.3.1 BiocManager_1.30.22 XVector_0.42.0
## [58] restfulr_0.0.15 matrixStats_1.0.0 vctrs_0.6.4
## [61] Matrix_1.6-1.1 jsonlite_1.8.7 bookdown_0.36
## [64] hms_1.1.3 bit64_4.0.5 crosstalk_1.2.0
## [67] jquerylib_0.1.4 glue_1.6.2 codetools_0.2-19
## [70] stringi_1.7.12 BiocIO_1.12.0 tibble_3.2.1
## [73] pillar_1.9.0 rappdirs_0.3.3 htmltools_0.5.6.1
## [76] GenomeInfoDbData_1.2.11 R6_2.5.1 dbplyr_2.4.0
## [79] lattice_0.22-5 evaluate_0.22 png_0.1-8
## [82] Rsamtools_2.18.0 memoise_2.0.1 bslib_0.5.1
## [85] rjsoncons_1.0.0 SparseArray_1.2.0 xfun_0.40
## [88] MatrixGenerics_1.14.0 UniProt.ws_2.42.0 pkgconfig_2.0.3
Bento, A Patrı́cia, Anna Gaulton, Anne Hersey, Louisa J Bellis, Jon Chambers, Mark Davies, Felix A Krüger, et al. 2014. “The ChEMBL bioactivity database: an update.” Nucleic Acids Res. 42 (Database issue): D1083–90. https://doi.org/10.1093/nar/gkt1031.
Gaulton, Anna, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, et al. 2012. “ChEMBL: a large-scale bioactivity database for drug discovery.” Nucleic Acids Res. 40 (Database issue): D1100–7. https://doi.org/10.1093/nar/gkr777.
Steinegger, Martin, and Johannes Soeding. 2018. “Clustering huge protein sequence sets in linear time.” Nat. Commun. 9 (1): 2542. https://doi.org/10.1038/s41467-018-04964-5.
Steinegger, Martin, and Johannes Söding. 2017. “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.” Nat. Biotechnol. 35 (11): 1026–8. https://doi.org/10.1038/nbt.3988.
Wu, C H, R Apweiler, A Bairoch, D A Natale, W C Barker, B Boeckmann, S Ferro, et al. 2006. “The Universal Protein Resource (UniProt): an expanding universe of protein information.” Nucleic Acids Res. 34 (Database issue): D187–D191. https://doi.org/10.1093/nar/gkj161.