```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
IsoformSwitchAnalyzeR
Enabling Identification and Analysis of Isoform Switches with Functional Consequences from RNA-sequencing data
Kristoffer Vitting-Seerup
`r Sys.Date()`
## Abstract
Recent breakthrough in bioinformatics now allows us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data (via tools such as Cufflinks, Kallisto and Salmon). These tools made it possible to start analyzing alternative isoform usage, but unfortunately RNA-sequencing data is still underutilized since such analyses are both hard to make and therfore only rarely done.
To solve this problem we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy to use R package that facilitates statistical identification of isoform switching from RNA-seq derived quantification of novel and/or annotated full-length isoforms. IsoformSwitchAnalyzeR furthermore facilitate integration of many sources of annotation including features such as Open Reading Frame (ORF), protein domains (via Pfam), signal peptides (via SignalP), coding potential (via CPAT) as well as sensitivity to Non-sense Mediated Decay (NMD). The combination of identified isoform switches and their annotation also enables IsoformSwitchAnalyzeR to predict potential functional consequence of the identified isoform switches - such as loss of protein domains or coding potential - thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provide article ready visualization of isoform switches as well as multiple layers of summary statistics describing the genome wide occurence of isoform switches and their consequences.
In summary IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences) thereby expanding the usability of RNA-seq data.
## Table of Content
[Abstract]
[Preliminaries]
- [Background and Package Description]
- [Installation]
- [How To Get Help]
[What To Cite] (please remember)
[Quick Start]
- [Workflow Overview]
- [Short Example Workflow] (aka the "To long - didn't read" section)
[Detailed Workflow]
- [Overview]
+ [IsoformSwitchAnalyzeR Background Information]
- [Importing Data Into R]
+ [Data from Cufflinks/Cuffdiff]
+ [Data from Kallisto, Salmon or RSEM]
+ [Data From Other Full-length Transcript Assemblers]
- [Filtering]
- [Identifying Isoform Switches]
+ [Testing Isoform Switches with IsoformSwitchAnalyzeR]
+ [Testing Isoform Switches via DRIMSeq]
+ [Testing Isoform Switches with other Tools]
- [Analyzing Open Reading Frames]
- [Extracting Nucleotide and Amino Acid Sequences]
- [Advise for Running External Sequence Analysis Tools]
- [Importing External Sequences Analysis]
- [Predicting Intron Retentions]
- [Predicting Switch Consequences]
- [Post Analysis of Isoform Switches with Consequences]
[Other workflows]
- [Augmenting ORF Predictions with Pfam Results]
- [Analyze Small Upstream ORFs]
- [Remove Sequences Stored in SwitchAnalyzeRlist]
- [Adding Uncertain Category to Coding Potential Predictions]
- [Quality control of ORF of known annotation]
- [Analyzing the Biological Mechanisms Behind Isoform Switching]
[Frequently Asked Questions and Problems]
[Final Remarks]
[Sessioninfo]
## Preliminaries
### Background and Package Description
The combination of alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) is often referred to as alternative transcription and is considered the major factors in modifying the pre-RNA and contributing to the complexity of higher organisms. Alternative transcription is widely used as recently demonstrated by The ENCODE Consortium, which found that an average of 6.3 different transcripts were generated per gene, although the individual number of transcripts from a single gene have been reported anywhere from one to thousands.
The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes that cannot be detected at gene level. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. But almost all cancers use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such a shift in isoform usage has been termed isoform switching and frequently will not be detected at the gene level.
In 2010 a breakthrough in bioinformatics with the emergence of tools such as Cufflinks, which allows researchers to reconstruct and quantify full length transcripts from RNA-seq data. Tools for fast transcript quantification such as Salmon and Kallisto were the next breakthrough making it very fast to perform isoform quantification. Such data has the potential to facilitate both genome wide analysis of alternative isoform usage and identification of isoform switching - but unfortunately these types of analysis are still only rarely done.
We hypothesis that there are multiple reasons why RNA-seq data is not used to its full potential:
1) There is still a lack of tools that can identify isoform switches with isoform resolution - thereby identifying the exact isoforms involved in a switch.
2) Although there are many very good tools to perform sequence analysis there is no common framework, which allows for integration of the analysis provided by these tools.
3) There is a lack of tools facilitating easy and article ready visual visualization of isoform switches.
To solve these problems we developed IsoformSwitchAnalyzeR.
IsoformSwitchAnalyzeR is an easy to use R package that enables the user to import the (novel) full-length derived isoforms from an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF file including the annotated coding sequences (CDS). If transcript structure were predicted, de-novo or guided, IsoformSwitchAnalyzeR offers a highly accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF furthermore allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) - the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC).
Next, IsoformSwitchAnalyzeR enables identification of isoform switches via newly developed statistical methods that test each individual isoform for differential usage and thereby identifies the exact isoforms involved in isoform switch.
Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the Coding Potential Assessment Tool (CPAT) which predicts the coding potential of an isoform and can furthermore be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can also extract the most likely amino acid (AA) sequence of the CDS/ORF. The AA sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) - both of which are supported by IsoformSwitchAnalyzeR. Lastly, since the structures of all expressed isoforms from a given gene are known, one can also annotate intron retentions (via spliceR).
Combined, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retentions, ORF, NMD sensitivity, coding potential, protein domains as well as signal peptides, resulting in the identification of important functional consequences of the isoform switches.
IsoformSwitchAnalyzeR contains tools that allow the user to create article ready visualization of both individual isoform switches as well as general common consequences of isoform switches. These visualizations are easy to understand and integrate all the information gathered throughout the workflow. An example of visualization can be found here [Examples of visualization].
Lastly IsoformSwitchAnalyzeR is based on standard Bioconductor classes such as GRanges and BSgenome, whereby it supports all species and annotation versions facilitated in the Bioconductor annotation packages.
Back to [Table of Content].
### Installation
IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is very easy and can be done from within the R terminal. If it is the first time you use Bioconductor, simply copy-paste the following into your R session to install the basic bioconductor packages:
source("http://bioconductor.org/biocLite.R")
biocLite()
If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.
After you have installed the basic bioconductor packages you can install IsoformSwitchAnalyzeR by copy pasting the following into your R session:
source("http://bioconductor.org/biocLite.R")
biocLite("IsoformSwitchAnalyzeR")
This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.
### How To Get Help
This R package comes with a lot of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R "?functionName", for example "?isoformSwitchTest"). Furthermore this vignette contains a lot of information, make sure to read both sources carefully as it will contain the answer to the most Frequently Asked Questions and Problems.
If you have unanswered questions or comments regarding IsoformSwitchAnalyzeR please post them on the associted google group: https://groups.google.com/forum/#!forum/isoformswitchanalyzer (after making sure the question have not already been answered there).
If you want to report a bug (found in the newest version of the R package) please make an issue with a reproducible example at github https://github.com/kvittingseerup/IsoformSwitchAnalyzeR - remember to add the appropriate label.
If you have suggestions for improvements also put them on github (https://github.com/kvittingseerup/IsoformSwitchAnalyzeR) this will allow other people to upvote you idea by reactions thereby showing us there is wide support of implementing your idea.
Back to [Table of Content].
## What To Cite
The IsoformSwitchAnalyzeR tool is only made possible by a string of other tools and scientific discoveries - please read this section thoroughly and cite the appropriate articles. Note that due to the references being divided into sections some references appear more than once.
If you are using the
- **Import of data from Salmon/Kallisto/RSEM** : Please cite refrence _10_
- **Inter-library normalization of abundance values** : Please cite refrence _10_ and _11_
- **Prediction of consequences** please cite refrence _1_
- **isoform switch test implemented in IsoformSwitchAnalyzeR** : Please cite both refrence _1_ and _2_
- **isoform switch test implemented in the DRIMSeq package (default)** : Please cite both refrence _1_ and _3_
- **visualizations** (plots) implemented in the IsoformSwitchAnalyzeR package : Please cite refrence _1_
- **Intron Retention analysis** : Please cite both refrence _1_ and _4_
- **prediction of open reading frames (ORF) analysis** : Please cite refrence _1_ and _4_
- **prediction of pre-mature termination codons (PTC) and thereby NMD-sensitivity** : Please cite refrence _1_, _4_, _5_ and _6_
- **CPAT** : Please cite refrence _7_
- **Pfam** : Please cite refrence _8_
- **SignalP** : Please cite refrence _9_
Refrences:
1. _Vitting-Seerup et al. **The Landscape of Isoform Switches in Human Cancers.** Cancer Res. (2017)_
2. _Ferguson et al. **P-value calibration for multiple testing problems in genomics.** Stat. Appl. Genet. Mol. Biol. 2014, 13:659-673._
3. _Nowicka et al. **DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics.** F1000Research, 5(0), 1356._
4. _Vitting-Seerup et al. **spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data**. BMC Bioinformatics 2014, 15:81._
5. _Weischenfeldt et al. **Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns**. Genome Biol 2012, 13:R35_
6. _Huber et al. **Orchestrating high-throughput genomic analysis with Bioconductor**. Nat. Methods, 2015, 12:115-121._
7. _Wang et al. **CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model**. Nucleic Acids Res. 2013, 41:e74._
8. _Finn et al. **The Pfam protein families database**. Nucleic Acids Research (2014) Database Issue 42:D222-D230_
9. _Petersen et al. **SignalP 4.0: discriminating signal peptides from transmembrane regions**. Nature Methods, 8:785-786, 2011_
10. _Soneson et al. **Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.** F1000Research 4, 1521 (2015)._
11. _Robinson et al. **A scaling normalization method for differential expression analysis of RNA-seq data**. Genome Biology (2010)_
## Quick Start
### Workflow Overview
The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post analysis of full length RNA-seq derived transcripts with a focus on finding, annotating and visualizing isoform switches with functional consequences. IsoformSwitchAnalyzeR therefore performs 3 specific tasks:
- Identify isoform switches.
- Annotate the transcripts involved in the isoform switches.
- Visualize the consequences of the isoform switches, both individually and combined.
A normal workflow for identification and analysis of isoform switches with functional consequences can be divide into two parts (also illustrated below in Figure 1).
**1) Extract Isoform Switches and Their Sequences.** This part includes importing the data into R, identifying isoform swithces, annotating those switches with open reading frames (ORF) and extract both the nucleotide and peptide sequence. The later step enables the usage of external sequence analysis tools such as
* CPAT : The Coding-Potential Assessment Tool, which can be run either locally or via their [webserver](http://lilab.research.bcm.edu/cpat/).
* Pfam : Prediction of protein domains, which can be run either locally or via their [webserver](http://pfam.xfam.org/search#tabview=tab1).
* SignalP : Prediction of Signal Peptides, which can be run either locally or via their [webserver](http://www.cbs.dtu.dk/services/SignalP/).
All of the above steps is performed by the high level function:
isoformSwitchAnalysisPart1()
See below for example of usage, and [Detailed Workflow] for details on the individual steps.
**2) Plot All Isoform Switches and Their annotation.** This part involves importing and incorporating the results of the external sequence analyssi, identifying intron retentions, predicting functional consequences and lastly plotting all genes with isoform switches as well as summarizing general consequences of switching.
All of this can be done using the function:
isoformSwitchAnalysisPart2()
See below for usage example, and [Detailed Workflow] for details on the individual steps.
**Alternatively** if one does not plan to incorporate external sequence analysis, it is possible to run the full workflow using:
isoformSwitchAnalysisCombined()
Which correspond to running _isoformSwitchAnalysisPart1()_ and _isoformSwitchAnalysisPart2()_ without adding the external results.