1 Project Overview

Analysis and comprehension of high-throughput genomic data

Statistical analysis: large data, technological artifacts, designed experiments; rigorous
Comprehension: biological context, visualization, reproducibility
High-throughput
- Sequencing: RNASeq, ChIPSeq, variants, copy number, …
- Microarrays: expression, SNP, …
- Flow cytometry, proteomics, images, …

Packages, vignettes, work flows

1211 software packages; also…
- ‘Annotation’ packages – static data bases of identifier maps, gene models, pathways, etc; e.g., TxDb.Hsapiens.UCSC.hg19.knownGene
- ’Experiment packages – data sets used to illustrate software functionality, e.g., airway
Discover and navigate via biocViews
Package ‘landing page’
- Title, author / maintainer, short description, citation, installation instructions, …, download statistics
All user-visible functions have help pages, most with runnable examples
‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code
‘Release’ (every six months) and ‘devel’ branches
Support site; videos, recent courses

Package installation and use

A package needs to be installed once, using the instructions on the landing page. Once installed, the package can be loaded into an R session
```
library(GenomicRanges)
```
and the help system queried interactively, as outlined above:
```
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges
```

2 High-throughput Sequence Analysis

2.1 Overall Work Flow

Experimental design
- Keep it simple, e.g., ‘control’ and ‘treatment’ groups
- Replicate within treatments!
Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)
- Record covariates, including processing day – likely ‘batch effects’
(Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517)
- Primary output: FASTQ files of short reads and their quality scores
Alignment
- Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
- Primary output: BAM files of aligned reads
- More recently: kallisto and similar programs that produce tables of reads aligned to transcripts
Reduction
- e.g., RNASeq ‘count table’ (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
Analysis
- Differential expression, peak identification, …
Comprehension
- Biological context

Alt Sequencing Ecosystem

3 High-Throughput Sequence Data Types

3.1 Sequencing data types

Sequenced reads: FASTQ files

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################

Aligned reads: BAM files

Header

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
...
@SQ     SN:chrY LN:59373566
@PG     ID:TopHat       VN:2.0.8b       CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq

Alignments: ID, flag, alignment and mate

ERR127306.7941162       403     chr14   19653689        3       72M             =       19652348        -1413  ...
ERR127306.22648137      145     chr14   19653692        1       72M             =       19650044        -3720  ...
ERR127306.933914        339     chr14   19653707        1       66M120N6M       =       19653686        -213   ...

Alignments: sequence and quality

... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC        *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%))
... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG        '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)****
... TGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCT        '******&%)&)))&")')'')'*((******&)&'')'))$))'')&))$)**&&****************

Alignments: Tags

... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:2  CC:Z:chr22      CP:i:16189276   HI:i:0
... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:3  CC:Z:=  CP:i:19921600   HI:i:0
... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:4  MD:Z:72 YT:Z:UU XS:A:+  NH:i:3  CC:Z:=  CP:i:19921465   HI:i:0
... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:4  MD:Z:72 YT:Z:UU XS:A:+  NH:i:2  CC:Z:chr22      CP:i:16189138   HI:i:0

Called variants: VCF files

Header

  ##fileformat=VCFv4.2
  ##fileDate=20090805
  ##source=myImputationProgramV3.1
  ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
  ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
  ##phasing=partial
  ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
  ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
  ...
  ##FILTER=<ID=q10,Description="Quality below 10">
  ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
  ...
  ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
  ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

Location

  #CHROM POS     ID        REF    ALT     QUAL FILTER ...
  20     14370   rs6054257 G      A       29   PASS   ...
  20     17330   .         T      A       3    q10    ...
  20     1110696 rs6040355 A      G,T     67   PASS   ...

Variant INFO

  #CHROM POS     ...    INFO                              ...
  20     14370   ...    NS=3;DP=14;AF=0.5;DB;H2           ...
  20     17330   ...    NS=3;DP=11;AF=0.017               ...
  20     1110696 ...    NS=2;DP=10;AF=0.333,0.667;AA=T;DB ...

Genotype FORMAT and samples

  ... POS     ...  FORMAT      NA00001        NA00002        NA00003
  ... 14370   ...  GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
  ... 17330   ...  GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
  ... 1110696 ...  GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4

Genome annotations: BED, WIG, GTF, etc. files. E.g., TGF:

Component coordinates

7   protein_coding  gene        27221129    27224842    .   -   . ...
...
7   protein_coding  transcript  27221134    27224835    .   -   . ...
7   protein_coding  exon        27224055    27224835    .   -   . ...
7   protein_coding  CDS         27224055    27224763    .   -   0 ...
7   protein_coding  start_codon 27224761    27224763    .   -   0 ...
7   protein_coding  exon        27221134    27222647    .   -   . ...
7   protein_coding  CDS         27222418    27222647    .   -   2 ...
7   protein_coding  stop_codon  27222415    27222417    .   -   0 ...
7   protein_coding  UTR         27224764    27224835    .   -   . ...
7   protein_coding  UTR         27221134    27222414    .   -   . ...

Annotations

gene_id "ENSG00000005073"; gene_name "HOXA11"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
...
... transcript_id "ENST00000006015"; transcript_name "HOXA11-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS5411";
... exon_number "1"; exon_id "ENSE00001147062";
... exon_number "1"; protein_id "ENSP00000006015";
... exon_number "1";
... exon_number "2"; exon_id "ENSE00002099557";
... exon_number "2"; protein_id "ENSP00000006015";
... exon_number "2";
...

Derived results, e.g., ‘count’ tables (.csv files) for RNA-seq differential expressoin.

3.2 Major Bioconductor Packages

GenomicRanges: ‘Ranges’ to describe data and annotation; GRanges(), GRangesList()
Biostrings: DNA and other sequences, DNAStringSet(); BSgenome
GenomicAlignments: Aligned reads; GAlignemts() and friends
GenomicFeatures, AnnotationDbi: annotation resources, TxDb and org packages.
SummarizedExperiment: coordinating experimental data
rtracklayer: import BED, WIG, GTF, etc.

Bioconductor Objects

Example: Biostrings

library(Biostrings)
data(phiX174Phage)
phiX174Phage

##   A DNAStringSet instance of length 6
##     width seq                                                                   names               
## [1]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Genbank
## [2]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA RF70s
## [3]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA SS78
## [4]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Bull
## [5]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA G97
## [6]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA NEB03

letterFrequency(phiX174Phage, c("A", "C", "G", "T"))

##         A    C    G    T
## [1,] 1291 1157 1254 1684
## [2,] 1292 1156 1253 1685
## [3,] 1292 1156 1253 1685
## [4,] 1292 1155 1253 1686
## [5,] 1292 1156 1253 1685
## [6,] 1292 1155 1253 1686

letterFrequency(phiX174Phage, "GC", as.prob=TRUE)

##            G|C
## [1,] 0.4476420
## [2,] 0.4472707
## [3,] 0.4472707
## [4,] 0.4470850
## [5,] 0.4472707
## [6,] 0.4470850

Represent complicated data types
Foster interoperability
S4 object system
- Introspection: methods(), getClass(), selectMethod()
- ‘accessors’ and other documented functions / methods for manipulation, rather than direct access to the object structure
Interactive help
- method?"substr,<tab>" to see all help pages for substr methods, for instance method?"letterFrequency,XStringSet" to find the documentation for the letterFrequency generic applied to an object (derived from) the XStringSet class.
- class?D<tab> for help on classes whose name begins with ‘D’ (e.g., find the help page for the DNAStringSet class with class?DNAStringSet).

B.1 – Introduction to Bioconductor

Martin Morgan [email protected]

16 - 17 May, 2016