1 Getting started

GenomicScores is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:

install.packages("BiocManager")
BiocManager::install("GenomicScores")

Once GenomicScores is installed, it can be loaded with the following command.

library(GenomicScores)

Often, however, GenomicScores will be automatically loaded when working with an annotation package that uses GenomicScores, such as phastCons100way.UCSC.hg38.

2 Genomewide position-specific scores

Genomewide scores assign each genomic position a numeric value denoting an estimated measure of constraint or impact on variation at that position. They are commonly used to filter single nucleotide variants or assess the degree of constraint or functionality of genomic features. Genomic scores are built on the basis of different sources of information such as sequence homology, functional domains, physical-chemical changes of amino acid residues, etc.

One particular example of genomic scores are phastCons scores. They provide a measure of conservation obtained from genomewide alignments using the program phast (Phylogenetic Analysis with Space/Time models) from Siepel et al. (2005). The GenomicScores package allows one to retrieve these scores through annotation packages (Section 4) or as AnnotationHub resources (Section 5).

Often, genomic scores such as phastCons are used within workflows running on top of R and Bioconductor. The purpose of the GenomicScores package is to enable an easy and interactive access to genomic scores within those workflows.

3 Lossy storage of genomic scores with compressed vectors

Storing and accessing genomic scores within R is challenging when their values cover large regions of the genome, resulting in gigabytes of double-precision numbers. This is the case, for instance, for phastCons (Siepel et al. 2005) or CADD (Kircher et al. 2014).

We address this problem by using lossy compression, also called quantization, coupled with run-length encoding (Rle) vectors. Lossy compression attempts to trade off precision for compression without compromising the scientific integrity of the data (Zender 2016).

Sometimes, measurements and statistical estimates under certain models generate false precision. False precision is essentialy noise that wastes storage space and it is meaningless from the scientific point of view (Zender 2016). In those circumstances, lossy compression not only saves storage space, but also removes false precision.

The use of lossy compression leads to a subset of quantized values much smaller than the original set of genomic scores, resulting in long runs of identical values along the genome. These runs of identical values can be further compressed using the implementation of Rle vectors available in the S4Vectors Bioconductor package.

To enable a seamless access to genomic scores stored with quantized values in compressed vectors the GenomicScores defines the GScores class of objects. This class manages the location, loading and dequantization of genomic scores stored separately on each chromosome, while it also provides rich metadata on the provenance, citation and licensing of the original data.

4 Availability and retrieval of genomic scores

The access to genomic scores through GScores objects is available either through annotation packages or as AnnotationHub resources. To find out what kind of genomic scores are available, through which mechanism, and in which organism, we may use the function availableGScores().

avgs <- availableGScores()
avgs
                                                 Organism      Category
AlphaMissense.v2023.hg19                     Homo sapiens Pathogenicity
AlphaMissense.v2023.hg38                     Homo sapiens Pathogenicity
MafDb.1Kgenomes.phase1.GRCh38                Homo sapiens           MAF
MafDb.1Kgenomes.phase1.hs37d5                Homo sapiens           MAF
MafDb.1Kgenomes.phase3.GRCh38                Homo sapiens           MAF
MafDb.1Kgenomes.phase3.hs37d5                Homo sapiens           MAF
MafDb.ExAC.r1.0.GRCh38                       Homo sapiens           MAF
MafDb.ExAC.r1.0.hs37d5                       Homo sapiens           MAF
MafDb.ExAC.r1.0.nonTCGA.GRCh38               Homo sapiens           MAF
MafDb.ExAC.r1.0.nonTCGA.hs37d5               Homo sapiens           MAF
MafDb.TOPMed.freeze5.hg19                    Homo sapiens           MAF
MafDb.TOPMed.freeze5.hg38                    Homo sapiens           MAF
MafDb.gnomAD.r2.1.GRCh38                     Homo sapiens           MAF
MafDb.gnomAD.r2.1.hs37d5                     Homo sapiens           MAF
MafDb.gnomADex.r2.1.GRCh38                   Homo sapiens           MAF
MafDb.gnomADex.r2.1.hs37d5                   Homo sapiens           MAF
MafH5.gnomAD.v3.1.2.GRCh38                   Homo sapiens           MAF
cadd.v1.3.hg19                                       <NA>          <NA>
cadd.v1.6.hg19                               Homo sapiens Pathogenicity
cadd.v1.6.hg38                               Homo sapiens Pathogenicity
fitCons.UCSC.hg19                            Homo sapiens    Constraint
linsight.UCSC.hg19                           Homo sapiens    Constraint
mcap.v1.0.hg19                               Homo sapiens Pathogenicity
phastCons100way.UCSC.hg19                    Homo sapiens  Conservation
phastCons100way.UCSC.hg38                    Homo sapiens  Conservation
phastCons27way.UCSC.dm6           Drosophila melanogaster  Conservation
phastCons30way.UCSC.hg38                     Homo sapiens  Conservation
phastCons35way.UCSC.mm39                     Mus musculus  Conservation
phastCons46wayPlacental.UCSC.hg19            Homo sapiens  Conservation
phastCons46wayPrimates.UCSC.hg19             Homo sapiens  Conservation
phastCons60way.UCSC.mm10                     Mus musculus  Conservation
phastCons7way.UCSC.hg38                      Homo sapiens  Conservation
phyloP100way.UCSC.hg19                       Homo sapiens  Conservation
phyloP100way.UCSC.hg38                       Homo sapiens  Conservation
phyloP35way.UCSC.mm39                        Mus musculus  Conservation
phyloP60way.UCSC.mm10                        Mus musculus  Conservation
                                  Installed Cached BiocManagerInstall
AlphaMissense.v2023.hg19              FALSE   TRUE              FALSE
AlphaMissense.v2023.hg38              FALSE   TRUE              FALSE
MafDb.1Kgenomes.phase1.GRCh38         FALSE  FALSE               TRUE
MafDb.1Kgenomes.phase1.hs37d5          TRUE  FALSE               TRUE
MafDb.1Kgenomes.phase3.GRCh38          TRUE  FALSE               TRUE
MafDb.1Kgenomes.phase3.hs37d5          TRUE  FALSE               TRUE
MafDb.ExAC.r1.0.GRCh38                FALSE  FALSE               TRUE
MafDb.ExAC.r1.0.hs37d5                 TRUE  FALSE               TRUE
MafDb.ExAC.r1.0.nonTCGA.GRCh38        FALSE  FALSE               TRUE
MafDb.ExAC.r1.0.nonTCGA.hs37d5        FALSE  FALSE               TRUE
MafDb.TOPMed.freeze5.hg19             FALSE  FALSE               TRUE
MafDb.TOPMed.freeze5.hg38             FALSE  FALSE               TRUE
MafDb.gnomAD.r2.1.GRCh38              FALSE  FALSE               TRUE
MafDb.gnomAD.r2.1.hs37d5              FALSE  FALSE               TRUE
MafDb.gnomADex.r2.1.GRCh38            FALSE  FALSE               TRUE
MafDb.gnomADex.r2.1.hs37d5             TRUE  FALSE               TRUE
MafH5.gnomAD.v3.1.2.GRCh38             TRUE  FALSE               TRUE
cadd.v1.3.hg19                        FALSE  FALSE              FALSE
cadd.v1.6.hg19                        FALSE   TRUE              FALSE
cadd.v1.6.hg38                        FALSE   TRUE              FALSE
fitCons.UCSC.hg19                     FALSE  FALSE               TRUE
linsight.UCSC.hg19                    FALSE  FALSE              FALSE
mcap.v1.0.hg19                        FALSE  FALSE              FALSE
phastCons100way.UCSC.hg19              TRUE  FALSE               TRUE
phastCons100way.UCSC.hg38              TRUE  FALSE               TRUE
phastCons27way.UCSC.dm6               FALSE  FALSE              FALSE
phastCons30way.UCSC.hg38              FALSE  FALSE              FALSE
phastCons35way.UCSC.mm39              FALSE  FALSE              FALSE
phastCons46wayPlacental.UCSC.hg19     FALSE  FALSE              FALSE
phastCons46wayPrimates.UCSC.hg19      FALSE  FALSE              FALSE
phastCons60way.UCSC.mm10              FALSE  FALSE              FALSE
phastCons7way.UCSC.hg38               FALSE  FALSE               TRUE
phyloP100way.UCSC.hg19                FALSE  FALSE              FALSE
phyloP100way.UCSC.hg38                FALSE  FALSE              FALSE
phyloP35way.UCSC.mm39                 FALSE  FALSE              FALSE
phyloP60way.UCSC.mm10                 FALSE  FALSE              FALSE
                                  AnnotationHub
AlphaMissense.v2023.hg19                   TRUE
AlphaMissense.v2023.hg38                   TRUE
MafDb.1Kgenomes.phase1.GRCh38             FALSE
MafDb.1Kgenomes.phase1.hs37d5             FALSE
MafDb.1Kgenomes.phase3.GRCh38             FALSE
MafDb.1Kgenomes.phase3.hs37d5             FALSE
MafDb.ExAC.r1.0.GRCh38                    FALSE
MafDb.ExAC.r1.0.hs37d5                    FALSE
MafDb.ExAC.r1.0.nonTCGA.GRCh38            FALSE
MafDb.ExAC.r1.0.nonTCGA.hs37d5            FALSE
MafDb.TOPMed.freeze5.hg19                 FALSE
MafDb.TOPMed.freeze5.hg38                 FALSE
MafDb.gnomAD.r2.1.GRCh38                  FALSE
MafDb.gnomAD.r2.1.hs37d5                  FALSE
MafDb.gnomADex.r2.1.GRCh38                FALSE
MafDb.gnomADex.r2.1.hs37d5                FALSE
MafH5.gnomAD.v3.1.2.GRCh38                FALSE
cadd.v1.3.hg19                            FALSE
cadd.v1.6.hg19                             TRUE
cadd.v1.6.hg38                             TRUE
fitCons.UCSC.hg19                         FALSE
linsight.UCSC.hg19                        FALSE
mcap.v1.0.hg19                            FALSE
phastCons100way.UCSC.hg19                 FALSE
phastCons100way.UCSC.hg38                 FALSE
phastCons27way.UCSC.dm6                   FALSE
phastCons30way.UCSC.hg38                  FALSE
phastCons35way.UCSC.mm39                  FALSE
phastCons46wayPlacental.UCSC.hg19         FALSE
phastCons46wayPrimates.UCSC.hg19          FALSE
phastCons60way.UCSC.mm10                  FALSE
phastCons7way.UCSC.hg38                   FALSE
phyloP100way.UCSC.hg19                    FALSE
phyloP100way.UCSC.hg38                    FALSE
phyloP35way.UCSC.mm39                     FALSE
phyloP60way.UCSC.mm10                     FALSE

For example, if we want to use the phastCons conservation scores available through the annotation package phastCons100way.UCSC.hg38, we should first install it (we only need to do this once).

BiocManager::install("phastCons100way.UCSC.hg38")

Second, we should load the package, and a GScores object will be created and named after the package name, during the loading operation. It is often handy to shorten that name.

library(phastCons100way.UCSC.hg38)
phast <- phastCons100way.UCSC.hg38
class(phast)
[1] "GScores"
attr(,"package")
[1] "GenomicScores"

Typing the name of the GScores object shows a summary of its contents and some of its metadata.

phast
GScores object 
# organism: Homo sapiens (UCSC, hg38)
# provider: UCSC
# provider version: 11May2015
# download date: Apr 10, 2018
# loaded sequences: chr5_GL000208v1_random
# number of sites: 2943 millions
# maximum abs. error: 0.05
# use 'citation()' to cite these data in publications

The bibliographic reference to cite the genomic score data stored in a GScores object can be accessed using the citation() method either on the package name (in case of annotation packages), or on the GScores object.

citation(phast)
Adam Siepel, Gill Berejano, Jakob S. Pedersen, Angie S. Hinrichs,
Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, LaDeana W.
Hillier, Stephen Richards, George M. Weinstock, Richard K. Wilson,
Richard A. Gibbs, W. James Kent, Webb Miller, David Haussler (2005).
"Evolutionarily conserved elements in vertebrate, insect, worm, and
yeast genomes." _Genome Research_, *15*, 1034-1050.
doi:10.1101/gr.3715005 <https://doi.org/10.1101/gr.3715005>.

Other methods tracing provenance and other metadata are provider(), providerVersion(), organism() and seqlevelsStyle(); please consult the help page of the GScores class for a comprehensive list of available methods.

provider(phast)
[1] "UCSC"
providerVersion(phast)
[1] "11May2015"
organism(phast)
[1] "Homo sapiens"
seqlevelsStyle(phast)
[1] "UCSC"

To retrieve genomic scores for specific consecutive positions we should use the method gscores(), as follows.

gscores(phast, GRanges(seqnames="chr22",
                       IRanges(start=50528591:50528596, width=1)))
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50528591      * |       1.0
  [2]    chr22  50528592      * |       1.0
  [3]    chr22  50528593      * |       0.8
  [4]    chr22  50528594      * |       1.0
  [5]    chr22  50528595      * |       1.0
  [6]    chr22  50528596      * |       0.0
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome

For a single position we may use this other GRanges() constructor.

gscores(phast, GRanges("chr22:50528593"))
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50528593      * |       0.8
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome

We may also retrieve the score values only with the method score().

score(phast, GRanges(seqnames="chr22",
                     IRanges(start=50528591:50528596, width=1)))
[1] 1.0 1.0 0.8 1.0 1.0 0.0
score(phast, GRanges("chr22:50528593"))
[1] 0.8

Let’s illustrate how to retrieve phastCons scores using data from the GWAS catalog available through the Bioconductor package gwascat. For the purpose of this vignette, we will filter the GWAS catalog data by (1) discarding entries with NA values in either chromosome name or position, or with multiple positions; (2) storing the data into a GRanges object, including the GWAS catalog columns STRONGEST SNP-RISK ALLELE and MAPPED_TRAIT, and the reference and alternate alleles, as metadata columns; (4) restricting variants to those located in chromosomes 20 to 22; and (3) excluding variants with multinucleotide alleles, or where reference and alternate alleles are identical.

library(BSgenome.Hsapiens.UCSC.hg38)
library(gwascat)

gwc <- get_cached_gwascat()
mask <- !is.na(gwc$CHR_ID) & !is.na(gwc$CHR_POS) &
        !is.na(as.integer(gwc$CHR_POS))
gwc <- gwc[mask, ]
grstr <- sprintf("%s:%s-%s", gwc$CHR_ID, gwc$CHR_POS, gwc$CHR_POS)
gwcgr <- GRanges(grstr, RISK_ALLELE=gwc[["STRONGEST SNP-RISK ALLELE"]],
                 MAPPED_TRAIT=gwc$MAPPED_TRAIT)
seqlevelsStyle(gwcgr) <- "UCSC"
mask <- seqnames(gwcgr) %in% c("chr20", "chr21", "chr22")
gwcgr <- gwcgr[mask]
ref <- as.character(getSeq(Hsapiens, gwcgr))
alt <- gsub("rs[0-9]+-", "", gwcgr$RISK_ALLELE)
mask <- (ref %in% c("A", "C", "G", "T")) & (alt %in% c("A", "C", "G", "T")) &
         nchar(alt) == 1 & ref != alt
gwcgr <- gwcgr[mask]
mcols(gwcgr)$REF <- ref[mask]
mcols(gwcgr)$ALT <- alt[mask]

gwcgr
GRanges object with 13540 ranges and 4 metadata columns:
          seqnames    ranges strand |   RISK_ALLELE           MAPPED_TRAIT
             <Rle> <IRanges>  <Rle> |   <character>            <character>
      [1]    chr20  35321981      * |   rs6088792-T            body height
      [2]    chr22  23250864      * |   rs5751614-A            body height
      [3]    chr20   6640246      * |    rs967417-C            body height
      [4]    chr20  33130847      * |   rs6059101-A     ulcerative colitis
      [5]    chr22  38148291      * |   rs2284063-G                  nevus
      ...      ...       ...    ... .           ...                    ...
  [13536]    chr20  12979237      * |   rs1321940-G sphingomyelin measur..
  [13537]    chr20  10148004      * |   rs2210584-C sphingomyelin measur..
  [13538]    chr20  11892294      * | rs397865364-C sphingomyelin measur..
  [13539]    chr20  12823511      * |   rs1413019-A sphingomyelin measur..
  [13540]    chr21  46284982      * |   rs9975588-A S100 calcium-binding..
                  REF         ALT
          <character> <character>
      [1]           C           T
      [2]           G           A
      [3]           G           C
      [4]           C           A
      [5]           A           G
      ...         ...         ...
  [13536]           A           G
  [13537]           T           C
  [13538]           T           C
  [13539]           C           A
  [13540]           G           A
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Finally, let’s obtain the phastCons scores for this GWAS catalog variant set, and examine their summary and cumulative distribution.

pcsco <- score(phast, gwcgr)
summary(pcsco)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.1217  0.0000  1.0000      38 
round(cumsum(table(na.omit(pcsco))) / sum(!is.na(pcsco)), digits=2)
   0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9    1 
0.81 0.85 0.87 0.88 0.88 0.89 0.89 0.90 0.90 0.91 1.00

We can observe that only 10% of the variants in chromosomes 20 to 22 have a conservation phastCons score above 0.5. Let’s examine which traits have more fully conserved variants.

xtab <- table(gwcgr$MAPPED_TRAIT[pcsco == 1])
head(xtab[order(xtab, decreasing=TRUE)])

                                     body height 
                                              63 
                        neuroimaging measurement 
                                              34 
                         mean corpuscular volume 
                                              33 
high density lipoprotein cholesterol measurement 
                                              30 
                          hemoglobin measurement 
                                              26 
                    BMI-adjusted waist-hip ratio 
                                              23

5 Genomic scores as AnnotationHub resources

The AnnotationHub (AH), is a Bioconductor web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. An AH web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.

We can quickly check for the available AH resources by subsetting as follows the resources names from the previous table obtained with availableGScores().

rownames(avgs)[avgs$AnnotationHub]
[1] "AlphaMissense.v2023.hg19" "AlphaMissense.v2023.hg38"
[3] "cadd.v1.6.hg19"           "cadd.v1.6.hg38"

The selected resource can be downloaded with the function getGScores(). After the resource is downloaded the first time, the cached copy will enable a quicker retrieval later. Let’s download other conservation scores, the phyloP scores (Pollard et al. 2010), for human genome version hg38.

phylop <- getGScores("phyloP100way.UCSC.hg38")

phylop
GScores object 
# organism: Homo sapiens (UCSC, hg38)
# provider: UCSC
# provider version: 11May2015
# download date: May 12, 2017
# loaded sequences: chr20
# maximum abs. error: 0.55
# use 'citation()' to cite these data in publications

Let’s retrieve the phyloP conservation scores for the previous set of GWAS catalog variants and compare them in Figure @(fig:phastvsphylop).

ppsco <- score(phylop, gwcgr)
plot(pcsco, ppsco, xlab="phastCons", ylab="phyloP",
     cex.axis=1.2, cex.lab=1.5, las=1)

Comparison between phastCons and phyloP conservation scores. On the y-axis, phyloP scores as function of phastCons scores on the x-axis, for a set of GWAS catalog variant in the human chromosome 22.

Figure 1: Comparison between phastCons and phyloP conservation scores
On the y-axis, phyloP scores as function of phastCons scores on the x-axis, for a set of GWAS catalog variant in the human chromosome 22.

We may observe that the values match in a rather discrete manner due to the quantization of the scores. In the case of the phastCons annotation package phastCons100way.UCSC.hg38, the GScore object gives access in fact to two score populations, the default one in which conservation scores are rounded to 1 decimal place, and an alternative one, named DP2, in which they are rounded to 2 decimal places. To figure out what are the available score populations in a GScores object, we should use the method populations().

populations(phast)
[1] "default" "DP2"

Whenever one of these populations is called default, this is the one used by default. In other cases we can find out which is the default population as follows:

defaultPopulation(phast)
[1] "default"

To use one of the available score populations we should use the argument pop in the corresponding method, as follows.

pcsco2 <- score(phast, gwcgr, pop="DP2")
head(pcsco2)
[1] 0.00 0.00 0.00 0.14 0.00 0.00

Figure 2 below shows again the comparison of phastCons and phyloP conservation scores, this time at the higher resolution provided by the phastCons scores rounded at two decimal places.

plot(pcsco2, ppsco, xlab="phastCons", ylab="phyloP",
     cex.axis=1.2, cex.lab=1.5, las=1)

Comparison between phastCons and phyloP conservation scores at a higher resolution. On the y-axis, phyloP scores as function of phastCons scores on the x-axis, for a set of GWAS catalog variant in the human chromosome 22.

Figure 2: Comparison between phastCons and phyloP conservation scores at a higher resolution
On the y-axis, phyloP scores as function of phastCons scores on the x-axis, for a set of GWAS catalog variant in the human chromosome 22.

5.1 Building an annotation package from a GScores object

Retrieving genomic scores through AnnotationHub resources requires an internet connection and we may want to work with such resources offline, for instance in high-performance computing (HPC) environments. For that purpose, we can create ourselves an annotation package, such as phastCons100way.UCSC.hg38, from a GScores object corresponding to a downloaded AnnotationHub resource. To do that we use the function makeGScoresPackage() as follows:

makeGScoresPackage(phast, maintainer="Me <[email protected]>",
                   author="Me", version="1.0.0")

Creating package in ./phastCons100way.UCSC.hg38

An argument, destDir, which by default points to the current working directory, can be used to change where in the filesystem the package is created. Afterwards, we should still build and install the package via, e.g., R CMD build and R CMD INSTALL, to be able to use it offline.

6 Retrieval of minor allele frequency data

One particular type of genomic scores that are accessible through the GScores class is minor allele frequency (MAF) data. There are currently 15 annotation packages that store MAF values using the GenomicScores package, named using the prefix MafDb or MafH5; see Table 1 below.

Table 1: Bioconductor annotation packages storing MAF data.
Annotation Package	Description
MafDb.1Kgenomes.phase1.hs37d5	MAF data from the 1000 Genomes Project Phase 1 for the human genome version GRCh37.
MafDb.1Kgenomes.phase1.GRCh38	MAF data from the 1000 Genomes Project Phase 1 for the human genome version GRCh38.
MafDb.1Kgenomes.phase3.hs37d5	MAF data from the 1000 Genomes Project Phase 3 for the human genome version GRCh37.
MafDb.1Kgenomes.phase3.GRCh38	MAF data from the 1000 Genomes Project Phase 3 for the human genome version GRCh38.
MafDb.ExAC.r1.0.hs37d5	MAF data from ExAC 60706 exomes for the human genome version GRCh37.
MafDb.ExAC.r1.0.GRCh38	MAF data from ExAC 60706 exomes for the human genome version GRCh38.
MafDb.ExAC.r1.0.nonTCGA.hs37d5	MAF data from ExAC 53105 nonTCGA exomes for the human genome version GRCh37.
MafDb.ExAC.r1.0.nonTCGA.GRCh38	MAF data from ExAC 53105 nonTCGA exomes for the human genome version GRCh38.
MafDb.gnomAD.r2.1.hs37d5	MAF data from gnomAD 15496 genomes for the human genome version GRCh37.
MafDb.gnomAD.r2.1.GRCh38	MAF data from gnomAD 15496 genomes for the human genome version GRCh38.
MafDb.gnomADex.r2.1.hs37d5	MAF data from gnomADex 123136 exomes for the human genome version GRCh37.
MafDb.gnomADex.r2.1.GRCh38	MAF data from gnomADex 123136 exomes for the human genome version GRCh38.
MafH5.gnomAD.v3.1.2.GRCh38	MAF data from gnomAD 76156 genomes for the human genome version GRCh38.
MafDb.TOPMed.freeze5.hg19	MAF data from NHLBI TOPMed 62784 genomes for the human genome version GRCh37.
MafDb.TOPMed.freeze5.hg38	MAF data from NHLBI TOPMed 62784 genomes for the human genome version GRCh38.

In this type of package, the scores populations correspond to populations of individuals from which the MAF data were derived, and all MAF data were compressed using a precision of one significant figure for MAF < 0.1 and two significant figures for MAF >= 0.1. Let’s load the MAF package for the release v3.1.2 of gnomAD (Karczewski et al. 2020).

library(MafH5.gnomAD.v3.1.2.GRCh38)

mafh5 <- MafH5.gnomAD.v3.1.2.GRCh38
mafh5
GScores object 
# organism: Homo sapiens (UCSC, hg38)
# provider: BroadInstitute
# provider version: v3.1.2
# download date: Apr 07, 2022
# default scores population: AF
# number of sites: 644 millions
# maximum abs. error (def. pop.): 0.00251
# use 'citation()' to cite these data in publications

populations(mafh5)
[1] "AF"           "AF_allpopmax"

Let’s retrieve the gnomAD MAF values for the previous GWAS catalog variant set and examine its distribution, and how many variants occur in less than 1% of all gnomAD populations and what fraction do they represent among the analyzed variants.

mafs <- score(mafh5, gwcgr, pop="AF_allpopmax")
summary(mafs)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.190   0.390   0.325   0.470   0.500     411 
sum(mafs < 0.01, na.rm=TRUE)
[1] 352
sum(mafs < 0.01, na.rm=TRUE) / sum(!is.na(mafs))
[1] 0.02681088

Finally, let’s examine which traits have more such rare variants.

xtab <- table(gwcgr$MAPPED_TRAIT[mafs < 0.01])
head(xtab[order(xtab, decreasing=TRUE)])

           response to bronchodilator, FEV/FEC ratio 
                                                  18 
                             mean corpuscular volume 
                                                  15 
               platelet component distribution width 
                                                  13 
                                      monocyte count 
                                                  10 
                                       platelet crit 
                                                  10 
forced expiratory volume, response to bronchodilator 
                                                   8

7 Retrieval of multiple scores per genomic position

Among the score sets available as AnnotationHub web resources shown in the previous section, some of them, such as CADD (Kircher et al. 2014), M-CAP (Jagadeesh et al. 2016) or AlphaMissense (Cheng et al. 2023), provide multiple scores per genomic position that capture the tolerance to mutations of single nucleotides. Such type of scores, often used to establish the potential pathogenicity of variants, are sometimes released under some sort of license for a non-commercial use. In such cases, the function getGScores() will ask us interactively to accept the license. We can also set the argument accept.license=TRUE to accept it non-interactively. We will illustrate such a case using the AlphaMissense scores (Cheng et al. 2023).

am23 <- getGScores("AlphaMissense.v2023.hg38")
These data is shared under the license CC BY-NC-SA 4.0
(see https://creativecommons.org/licenses/by-nc-sa/4.0),
do you accept it? [y/n]: y

Let’s retrieve the AlphaMissense scores for the reference and alternate alleles in our GWAS catalog variant set.

am23
GScores object 
# organism: Homo sapiens (UCSC, hg38)
# provider: Google DeepMind
# provider version: v2023
# download date: Oct 10, 2023
# loaded sequences: chr20
# maximum abs. error: 0.005
# license: CC BY-NC-SA 4.0, see https://creativecommons.org/licenses/by-nc-sa/4.0
# use 'citation()' to cite these data in publications
amsco <- score(am23, gwcgr, ref=gwcgr$REF, alt=gwcgr$ALT)
summary(amsco)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.010   0.090   0.150   0.247   0.380   1.000   12609

Using the cutoffs for AlphaMissense scores reported in (Cheng et al. 2023) to classify variants into “likely benign”, “ambiguous” and “likely pathogenic”, and 0.01 and 0.1 as MAF cutoffs, let’s cross-tabulate the proportions of these two factors.

mask <- !is.na(amsco) & !is.na(mafs)
amscofac <- cut(amsco[mask], breaks=c(0, 0.34, 0.56, 1))
amscofac <- relevel(amscofac, ref="(0.56,1]")
maffac <- cut(mafs[mask], breaks=c(0, 0.01, 0.1, 1))
xtab <- table(maffac, amscofac)
t(xtab)
             maffac
amscofac      (0,0.01] (0.01,0.1] (0.1,1]
  (0.56,1]          39         11       7
  (0,0.34]          50        141     494
  (0.34,0.56]        3        134      39
xtab <- t(xtab / rowSums(xtab))
round(xtab, digits=2)
             maffac
amscofac      (0,0.01] (0.01,0.1] (0.1,1]
  (0.56,1]        0.42       0.04    0.01
  (0,0.34]        0.54       0.49    0.91
  (0.34,0.56]     0.03       0.47    0.07

Figure 3 below displays graphically these proportions in an analogous way to the one shown in Figure 5B from Cheng et al. (2023). While these proportions are quite different to the original figure, due to the much lower number of variants analyzed here, we still can see, like in (Cheng et al. 2023), that the proportion of variants classified as likely pathogenic by AlphaMissense scores is much larger for rare variants with MAF < 0.01 than for common variants with MAF > 0.01.

AlphaMissense predictions. Proportions of three ranges of AlphaMissense pathogenicity scores for three ranges of minor allele frequencies (MAF) derived from gnomAD, on data from the GWAS catalog. This figure is analogous to Figure 5B from @cheng2023accurate, but with much fewer variants.

Figure 3: AlphaMissense predictions
Proportions of three ranges of AlphaMissense pathogenicity scores for three ranges of minor allele frequencies (MAF) derived from gnomAD, on data from the GWAS catalog. This figure is analogous to Figure 5B from Cheng et al. (2023), but with much fewer variants.

8 Summarization of genomic scores

The input genomic ranges to the gscores() method may have widths larger than one nucleotide. In those cases, and when there is only one score per position, the gscores() method calculates, by default, the arithmetic mean of the scores across each range.

gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1))
gr1sco <- gscores(phast, gr1)
gr1sco
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50528591      * |       1.0
  [2]    chr22  50528592      * |       1.0
  [3]    chr22  50528593      * |       0.8
  [4]    chr22  50528594      * |       1.0
  [5]    chr22  50528595      * |       1.0
  [6]    chr22  50528596      * |       0.0
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome
mean(gr1sco$default)
[1] 0.8
gr2 <- GRanges("chr22:50528591-50528596")
gscores(phast, gr2)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50528591-50528596      * |       0.8
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome

However, we may change the way in which scores from multiple-nucleotide ranges are summarized with the argument summaryFun, as follows.

gscores(phast, gr2, summaryFun=max)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50528591-50528596      * |         1
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome
gscores(phast, gr2, summaryFun=min)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50528591-50528596      * |         0
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome
gscores(phast, gr2, summaryFun=median)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50528591-50528596      * |         1
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome

9 Annotating variants with genomic scores

A typical use case of the GenomicScores package is in the context of annotating variants with genomic scores, such as phastCons conservation scores. For this purpose, we load the VariantAnnotaiton and TxDb.Hsapiens.UCSC.hg38.knownGene packages. The former will allow us annotate variants, and the latter contains the gene annotations from UCSC that will be used in this process.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

We annotate the location of previous set of filtered GWAS variants, using the function locateVariants() from the VariantAnnotation package.

loc <- locateVariants(gwcgr, txdb, AllVariants())
loc[1:3]
GRanges object with 3 ranges and 9 metadata columns:
      seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
         <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
  [1]    chr20  35321981      + |   intron     25183     25183         1
  [2]    chr22  23250864      + |   intron     76509     76509         2
  [3]    chr22  23250864      + |   intron     70864     70864         2
             TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
      <character> <IntegerList> <character> <CharacterList> <CharacterList>
  [1]      210043                      7050                                
  [2]      214832                     90007                                
  [3]      214833                     90007                                
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths
table(loc$LOCATION)

spliceSite     intron    fiveUTR   threeUTR     coding intergenic   promoter 
        46      58628        440       1500       5407       3683       5975

Now we annotate phastCons conservation scores on the variants and store those annotations as an additional metadata column of the GRanges object. For this specific purpose we should use the method score() that returns the genomic scores as a numeric vector, instead of doing it as a metadata column in the input ranges object, done by the gscores() function.

loc$PHASTCONS <- score(phast, loc, pop="DP2")
loc[1:3]
GRanges object with 3 ranges and 10 metadata columns:
      seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
         <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
  [1]    chr20  35321981      + |   intron     25183     25183         1
  [2]    chr22  23250864      + |   intron     76509     76509         2
  [3]    chr22  23250864      + |   intron     70864     70864         2
             TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
      <character> <IntegerList> <character> <CharacterList> <CharacterList>
  [1]      210043                      7050                                
  [2]      214832                     90007                                
  [3]      214833                     90007                                
      PHASTCONS
      <numeric>
  [1]         0
  [2]         0
  [3]         0
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Using the following code we can examine the distribution of phastCons conservation scores of variants across the different annotated regions, shown in Figure 4.

x <- split(loc$PHASTCONS, loc$LOCATION)
mask <- elementNROWS(x) > 0
boxplot(x[mask], ylab="phastCons score", las=1, cex.axis=1.2, cex.lab=1.5, col="gray")
points(1:length(x[mask])+0.25, sapply(x[mask], mean, na.rm=TRUE), pch=23, bg="black")

Figure 4: Distribution of phastCons conservation scores in variants across different annotated regions
Diamonds indicate mean values.

Next, we can annotate AlphaMissense and CADD scores as follows. Note that we use the QUERYID column of the annotations to fetch back reference and alternative alleles from the original data container.

loc$AM <- score(am23, loc,
                ref=gwcgr$REF[loc$QUERYID],
                alt=gwcgr$ALT[loc$QUERYID])

cadd
GScores object 
# organism: Homo sapiens (UCSC, hg38)
# provider: UWashington
# provider version: v1.6
# download date: Oct 11, 2023
# loaded sequences: chr20
# maximum abs. error: 5
# use 'citation()' to cite these data in publications
loc$CADD <- score(cadd, loc, ref=gwcgr$REF[loc$QUERYID], alt=gwcgr$ALT[loc$QUERYID])

Using the code below we can produce the plot of Figure 5 comparing AlphaMissense and CADD scores and labeling the location of the variants from which they are derived.

library(RColorBrewer)
par(mar=c(4, 5, 1, 1))
hmcol <- colorRampPalette(brewer.pal(nlevels(loc$LOCATION), "Set1"))(nlevels(loc$LOCATION))
plot(loc$AM, jitter(loc$CADD, factor=2), pch=19,
     col=hmcol, xlab="AlphaMissense scores", ylab="CADD scores",
     las=1, cex.axis=1.2, cex.lab=1.5, panel.first=grid())
legend("bottomright", levels(loc$LOCATION), pch=19, col=hmcol, inset=0.01)

Figure 5: Comparison of AlphaMissense and CADD scores
Values on the y-axis are jittered to facilitate visualization.

10 Session information

sessionInfo()
R version 4.3.2 Patched (2023-11-13 r85521)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] RColorBrewer_1.1-3                      
 [2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
 [3] GenomicFeatures_1.54.1                  
 [4] AnnotationDbi_1.64.1                    
 [5] VariantAnnotation_1.48.1                
 [6] Rsamtools_2.18.0                        
 [7] SummarizedExperiment_1.32.0             
 [8] Biobase_2.62.0                          
 [9] MatrixGenerics_1.14.0                   
[10] matrixStats_1.2.0                       
[11] MafH5.gnomAD.v3.1.2.GRCh38_3.15.0       
[12] gwascat_2.34.0                          
[13] BSgenome.Hsapiens.UCSC.hg38_1.4.5       
[14] BSgenome_1.70.1                         
[15] rtracklayer_1.62.0                      
[16] BiocIO_1.12.0                           
[17] Biostrings_2.70.1                       
[18] XVector_0.42.0                          
[19] phastCons100way.UCSC.hg38_3.7.1         
[20] GenomicScores_2.14.3                    
[21] GenomicRanges_1.54.1                    
[22] GenomeInfoDb_1.38.3                     
[23] IRanges_2.36.0                          
[24] S4Vectors_0.40.2                        
[25] BiocGenerics_0.48.1                     
[26] knitr_1.45                              
[27] BiocStyle_2.30.0                        

loaded via a namespace (and not attached):
 [1] DBI_1.2.0                     bitops_1.0-7                 
 [3] biomaRt_2.58.0                rlang_1.1.2                  
 [5] magrittr_2.0.3                compiler_4.3.2               
 [7] RSQLite_2.3.4                 png_0.1-8                    
 [9] vctrs_0.6.5                   stringr_1.5.1                
[11] pkgconfig_2.0.3               crayon_1.5.2                 
[13] fastmap_1.1.1                 magick_2.8.2                 
[15] dbplyr_2.4.0                  ellipsis_0.3.2               
[17] utf8_1.2.4                    promises_1.2.1               
[19] rmarkdown_2.25                tzdb_0.4.0                   
[21] purrr_1.0.2                   bit_4.0.5                    
[23] xfun_0.41                     zlibbioc_1.48.0              
[25] cachem_1.0.8                  jsonlite_1.8.8               
[27] progress_1.2.3                blob_1.2.4                   
[29] highr_0.10                    later_1.3.2                  
[31] rhdf5filters_1.14.1           DelayedArray_0.28.0          
[33] Rhdf5lib_1.24.1               BiocParallel_1.36.0          
[35] interactiveDisplayBase_1.40.0 parallel_4.3.2               
[37] prettyunits_1.2.0             R6_2.5.1                     
[39] bslib_0.6.1                   stringi_1.8.3                
[41] jquerylib_0.1.4               Rcpp_1.0.11                  
[43] bookdown_0.37                 readr_2.1.4                  
[45] splines_4.3.2                 httpuv_1.6.13                
[47] Matrix_1.6-4                  tidyselect_1.2.0             
[49] abind_1.4-5                   yaml_2.3.8                   
[51] codetools_0.2-19              curl_5.2.0                   
[53] lattice_0.22-5                tibble_3.2.1                 
[55] shiny_1.8.0                   withr_2.5.2                  
[57] KEGGREST_1.42.0               evaluate_0.23                
[59] survival_3.5-7                BiocFileCache_2.10.1         
[61] xml2_1.3.6                    snpStats_1.52.0              
[63] pillar_1.9.0                  BiocManager_1.30.22          
[65] filelock_1.0.3                generics_0.1.3               
[67] RCurl_1.98-1.13               BiocVersion_3.18.1           
[69] hms_1.1.3                     xtable_1.8-4                 
[71] glue_1.6.2                    tools_4.3.2                  
[73] AnnotationHub_3.10.0          GenomicAlignments_1.38.0     
[75] XML_3.99-0.16                 rhdf5_2.46.1                 
[77] grid_4.3.2                    GenomeInfoDbData_1.2.11      
[79] HDF5Array_1.30.0              restfulr_0.0.15              
[81] cli_3.6.2                     rappdirs_0.3.3               
[83] fansi_1.0.6                   S4Arrays_1.2.0               
[85] dplyr_1.1.4                   sass_0.4.8                   
[87] digest_0.6.33                 SparseArray_1.2.2            
[89] rjson_0.2.21                  memoise_2.0.1                
[91] htmltools_0.5.7               lifecycle_1.0.4              
[93] httr_1.4.7                    mime_0.12                    
[95] bit64_4.0.5

References

Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Applebaum, Alexander Pritzel, et al. 2023. “Accurate Proteome-Wide Missense Variant Effect Prediction with Alphamissense.” Science, 1284–5.

Jagadeesh, Karthik A, Aaron M Wenger, Mark J Berger, Harendra Guturu, Peter D Stenson, David N Cooper, Jonathan A Bernstein, and Gill Bejerano. 2016. “M-Cap Eliminates a Majority of Variants of Uncertain Significance in Clinical Exomes at High Sensitivity.” Nat. Genet. 48 (12): 1581–6.

Karczewski, Konrad J, Laurent C Francioli, Grace Tiao, Beryl B Cummings, Jessica Alföldi, Qingbo Wang, Ryan L Collins, et al. 2020. “The Mutational Constraint Spectrum Quantified from Variation in 141,456 Humans.” Nature 581 (7809): 434–43.

Kircher, Martin, Daniela M Witten, Preti Jain, Brian J O’roak, Gregory M Cooper, and Jay Shendure. 2014. “A General Framework for Estimating the Relative Pathogenicity of Human Genetic Variants.” Nat. Genet. 46 (3): 310–15.

Pollard, Katherine S, Melissa J Hubisz, Kate R Rosenbloom, and Adam Siepel. 2010. “Detection of Nonneutral Substitution Rates on Mammalian Phylogenies.” Genome Research 20 (1): 110–21.

Siepel, Adam, Gill Bejerano, Jakob S Pedersen, Angie S Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, et al. 2005. “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast Genomes.” Genome Res. 15 (8): 1034–50.

Zender, Charles S. 2016. “Bit Grooming: Statistically Accurate Precision-Preserving Quantization with Compression, Evaluated in the netCDF Operators (Nco, V4. 4.8+).” Geosci. Model Dev. 9 (9): 3199–3211.

An introduction to the GenomicScores package

21 December 2023

Abstract

Package