--- title: Correcting batch effects in single-cell RNA-seq data author: - name: Aaron T. L. Lun affiliation: Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom - name: Michael D. Morgan affiliation: Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{Correcting batch effects in scRNA-seq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document bibliography: ref.bib --- ```{r style, echo=FALSE, results='hide', message=FALSE} library(BiocStyle) library(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) opts_chunk$set(fig.asp=1) ``` ```{r, echo=FALSE, results='hide'} all.urls <- c("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81076/suppl/GSE81076%5FD2%5F3%5F7%5F10%5F17%2Etxt%2Egz", "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE85nnn/GSE85241/suppl/GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz", "https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061/files/E-MTAB-5061.processed.1.zip", "https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5061/E-MTAB-5061.sdrf.txt") all.basenames <- basename(all.urls) all.basenames[1] <- "GSE81076_D2_3_7_10_17.txt.gz" all.basenames[2] <- "GSE85241_cellsystems_dataset_4donors_updated.csv" all.modes <- c("wb", "w", "wb", "w") for (x in seq_along(all.urls)) { if (!file.exists(all.basenames[x])) { download.file(all.urls[x], all.basenames[x], mode=all.modes[x]) } } ``` # Introduction Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as "batch effects". Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results. Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for valid downstream analysis. However, existing methods such as `removeBatchEffect()` [@ritchie2015limma] assume that the composition of cell populations are either known or the same across batches. This workflow describes the application of an alternative strategy for batch correction based on the detection of mutual nearest neighbours (MNNs) [@haghverdi2018batch]. The `mnnCorreect()` approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches. We demonstrate its use on three human pancreas scRNA-seq datasets from different groups and using different protocols. # Processing the different datasets ## CEL-seq, GSE81076 ### Loading in the data This dataset was generated by @grun2016denovo using the CEL-seq protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above. We first read the table into memory. ```{r} gse81076.df <- read.table("GSE81076_D2_3_7_10_17.txt.gz", sep='\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) dim(gse81076.df) ``` Unfortunately, the data and metadata are all mixed together in this file. As a result, we need to manually extract the metadata from the column names. ```{r} donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse81076.df)) table(donor.names) plate.id <- sub("^D[0-9]+(.*)_.*", "\\1", colnames(gse81076.df)) table(plate.id) ``` Another irritating feature of this dataset is that gene symbols were supplied, rather than stable identifiers such as Ensembl. We convert all row names to Ensembl identifiers, removing `NA` or duplicated entries (with the exception of spike-in transcripts). ```{r} gene.symb <- gsub("__chr.*$", "", rownames(gse81076.df)) is.spike <- grepl("^ERCC-", gene.symb) table(is.spike) library(org.Hs.eg.db) gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL") gene.ids[is.spike] <- gene.symb[is.spike] keep <- !is.na(gene.ids) & !duplicated(gene.ids) gse81076.df <- gse81076.df[keep,] rownames(gse81076.df) <- gene.ids[keep] summary(keep) ``` We create a `SingleCellExperiment` object to store the counts and metadata together. This reduces the risk of book-keeping errors in later steps of the analysis. Note that we re-identify the spike-in rows, as the previous indices would have changed after the subsetting. ```{r} library(SingleCellExperiment) sce.gse81076 <- SingleCellExperiment(list(counts=as.matrix(gse81076.df)), colData=DataFrame(Donor=donor.names, Plate=plate.id), rowData=DataFrame(Symbol=gene.symb[keep])) isSpike(sce.gse81076, "ERCC") <- grepl("^ERCC-", rownames(gse81076.df)) sce.gse81076 ``` ### Quality control and normalization We compute quality control (QC) metrics for each cell [@mccarthy2017scater] and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content. ```{r} library(scater) sce.gse81076 <- calculateQCMetrics(sce.gse81076, compact=TRUE) QC <- sce.gse81076$scater_qc low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3) low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3) high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3) data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), HighSpike=sum(high.spike, na.rm=TRUE)) ``` Cells with extreme values for these QC metrics are presumed to be of low quality and are removed. A more thorough analysis would examine the distributions of these QC metrics beforehand, but we will skip that step for brevity here. ```{r} discard <- low.lib | low.genes | high.spike sce.gse81076 <- sce.gse81076[,!discard] summary(discard) ``` We compute size factors for the endogenous genes using the deconvolution method [@lun2016pooling]. This is done with pre-clustering through `quickCluster()` to avoid pooling together very different cells. ```{r} library(scran) clusters <- quickCluster(sce.gse81076, min.mean=0.1) table(clusters) sce.gse81076 <- computeSumFactors(sce.gse81076, min.mean=0.1, clusters=clusters) summary(sizeFactors(sce.gse81076)) ``` We also compute size factors for the spike-in transcripts [@lun2017assessing]. Recall that we set `general.use=FALSE` to ensure that the spike-in size factors are only applied to the spike-in transcripts. ```{r} sce.gse81076 <- computeSpikeFactors(sce.gse81076, general.use=FALSE) summary(sizeFactors(sce.gse81076, "ERCC")) ``` We then compute normalized log-expression values for use in downstream analyses. ```{r} sce.gse81076 <- normalize(sce.gse81076) ``` ### Identifying highly variable genes We identify highly variable genes (HVGs) using `trendVar()` and `decomposeVar()`, using the variances of spike-in transcripts to model technical noise. We set `block=` to ensure that uninteresting differences between plates or donors do not inflate the variance. The small discrepancy in the fitted trend in Figure \@ref(fig:var-gse81076) is caused by the fact that the trend is fitted robustly to the block-wise variances of the spike-ins, while the variances shown are averaged across blocks and not robust to outliers. ```{r var-gse81076, fig.cap="Variance of normalized log-expression values for each gene in the GSE81076 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red)."} block <- paste0(sce.gse81076$Plate, "_", sce.gse81076$Donor) fit <- trendVar(sce.gse81076, block=block, parametric=TRUE) dec <- decomposeVar(sce.gse81076, fit) plot(dec$mean, dec$total, xlab="Mean log-expression", ylab="Variance of log-expression", pch=16) OBis.spike <- isSpike(sce.gse81076) points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16) curve(fit$trend(x), col="dodgerblue", add=TRUE) ``` We order genes by decreasing biological component, revealing some usual suspects such as insulin and glucagon. We will be using this information later when performing feature selection prior to running `mnnCorrect()`. ```{r} dec.gse81076 <- dec dec.gse81076$Symbol <- rowData(sce.gse81076)$Symbol dec.gse81076 <- dec.gse81076[order(dec.gse81076$bio, decreasing=TRUE),] head(dec.gse81076) ``` ```{r, echo=FALSE, results="hide"} rm(gse81076.df) gc() ``` ## CEL-seq2, GSE85241 ### Loading in the data This dataset was generated by @muraro2016singlecell using the CEL-seq2 protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above. We first read the table into memory. ```{r} gse85241.df <- read.table("GSE85241_cellsystems_dataset_4donors_updated.csv", sep='\t', h=TRUE, row.names=1, stringsAsFactors=FALSE) dim(gse85241.df) ``` We extract the metadata from the column names. ```{r} donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse85241.df)) table(donor.names) plate.id <- sub("^D[0-9]+\\.([0-9]+)_.*", "\\1", colnames(gse85241.df)) table(plate.id) ``` Yet again, gene symbols were supplied instead of Ensembl or Entrez identifiers. We convert all row names to Ensembl identifiers, removing `NA` or duplicated entries (with the exception of spike-in transcripts). ```{r} gene.symb <- gsub("__chr.*$", "", rownames(gse85241.df)) is.spike <- grepl("^ERCC-", gene.symb) table(is.spike) library(org.Hs.eg.db) gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL") gene.ids[is.spike] <- gene.symb[is.spike] keep <- !is.na(gene.ids) & !duplicated(gene.ids) gse85241.df <- gse85241.df[keep,] rownames(gse85241.df) <- gene.ids[keep] summary(keep) ``` We create a `SingleCellExperiment` object to store the counts and metadata together. ```{r} sce.gse85241 <- SingleCellExperiment(list(counts=as.matrix(gse85241.df)), colData=DataFrame(Donor=donor.names, Plate=plate.id), rowData=DataFrame(Symbol=gene.symb[keep])) isSpike(sce.gse85241, "ERCC") <- grepl("^ERCC-", rownames(gse85241.df)) sce.gse85241 ``` ### Quality control and normalization We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content. ```{r} sce.gse85241 <- calculateQCMetrics(sce.gse85241, compact=TRUE) QC <- sce.gse85241$scater_qc low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3) low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3) high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3) data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), HighSpike=sum(high.spike, na.rm=TRUE)) ``` Low-quality cells are defined as those with extreme values for these QC metrics and are removed. ```{r} discard <- low.lib | low.genes | high.spike sce.gse85241 <- sce.gse85241[,!discard] summary(discard) ``` We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values. ```{r} clusters <- quickCluster(sce.gse85241, min.mean=0.1, method="igraph") table(clusters) sce.gse85241 <- computeSumFactors(sce.gse85241, min.mean=0.1, clusters=clusters) summary(sizeFactors(sce.gse85241)) sce.gse85241 <- computeSpikeFactors(sce.gse85241, general.use=FALSE) summary(sizeFactors(sce.gse85241, "ERCC")) sce.gse85241 <- normalize(sce.gse85241) ``` ### Identifying highly variable genes We fit a trend to the spike-in variances as previously described, allowing us to model the technical noise for each gene (Figure \@ref(fig:var-gse85241)). Again, we set `block=` to ensure that uninteresting differences between plates or donors do not inflate the variance. ```{r var-gse85241, fig.cap="Variance of normalized log-expression values for each gene in the GSE85241 dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red)."} block <- paste0(sce.gse85241$Plate, "_", sce.gse85241$Donor) fit <- trendVar(sce.gse85241, block=block, parametric=TRUE) dec <- decomposeVar(sce.gse85241, fit) plot(dec$mean, dec$total, xlab="Mean log-expression", ylab="Variance of log-expression", pch=16) is.spike <- isSpike(sce.gse85241) points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16) curve(fit$trend(x), col="dodgerblue", add=TRUE) ``` We order genes by decreasing biological component, as described above. ```{r} dec.gse85241 <- dec dec.gse85241$Symbol <- rowData(sce.gse85241)$Symbol dec.gse85241 <- dec.gse85241[order(dec.gse85241$bio, decreasing=TRUE),] head(dec.gse85241) ``` ```{r, echo=FALSE, results="hide"} rm(gse85241.df) gc() ``` ## Smart-seq2, E-MTAB-5061 ### Loading in the data This dataset was generated by @segerstolpe2016singlecell using the Smart-seq2 protocol with ERCC spike-ins. Count tables were obtained from ArrayExpress using the accession number above. We first read the table into memory, though this requires some effort as the file is even more unconventionally formatted than the two examples above. ```{r} unzip("E-MTAB-5061.processed.1.zip") # Figuring out the number of libraries (-1 for the '#sample'). header <- read.table("pancreas_refseq_rpkms_counts_3514sc.txt", nrow=1, sep="\t", comment.char="", stringsAsFactors=FALSE) ncells <- ncol(header) - 1L # Loading only the gene names and the counts. col.types <- vector("list", ncells*2 + 2) col.types[1:2] <- "character" col.types[2+ncells + seq_len(ncells)] <- "integer" e5601.df <- read.table("pancreas_refseq_rpkms_counts_3514sc.txt", sep="\t", colClasses=col.types) # Disentangling the gene names and the counts. gene.data <- e5601.df[,1:2] e5601.df <- e5601.df[,-(1:2)] colnames(e5601.df) <- as.character(header[1,-1]) dim(e5601.df) ``` The gene metadata _does_ contains unique GenBank identifiers, but these are transcript-level and concatenated together for each gene. Instead of trying to pull them apart, we perform the symbol-to-Ensembl conversion that was done for the previous datasets. ```{r} is.spike <- grepl("^ERCC-", gene.data[,2]) table(is.spike) library(org.Hs.eg.db) gene.ids <- mapIds(org.Hs.eg.db, keys=gene.data[,1], keytype="SYMBOL", column="ENSEMBL") gene.ids[is.spike] <- gene.data[is.spike,2] keep <- !is.na(gene.ids) & !duplicated(gene.ids) e5601.df <- e5601.df[keep,] rownames(e5601.df) <- gene.ids[keep] summary(keep) ``` At least the metadata is stored in a separate file, which makes it fairly straightforward to parse. We `match` the rows to the column names to ensure that the metadata and count table are in the same order. ```{r} metadata <- read.table("E-MTAB-5061.sdrf.txt", header=TRUE, sep="\t", check.names=FALSE, stringsAsFactors=FALSE) m <- match(colnames(e5601.df), metadata[["Assay Name"]]) stopifnot(all(!is.na(m))) metadata <- metadata[m,] donor.id <- metadata[["Characteristics[individual]"]] table(donor.id) ``` We create a `SingleCellExperiment` object to store the counts and metadata together. ```{r} sce.e5601 <- SingleCellExperiment(list(counts=as.matrix(e5601.df)), colData=DataFrame(Donor=donor.id), rowData=DataFrame(Symbol=gene.data[keep,1])) isSpike(sce.e5601, "ERCC") <- grepl("^ERCC-", rownames(e5601.df)) sce.e5601 ``` ### Quality control and normalization We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content. We also identify cells with near-zero spike-in counts, which are not compatible with spike-in normalization and modelling of technical noise. ```{r} sce.e5601 <- calculateQCMetrics(sce.e5601, compact=TRUE) QC <- sce.e5601$scater_qc low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3) low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3) high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3) low.spike <- isOutlier(QC$feature_control_ERCC$log10_total_counts, type="lower", nmad=2) data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes), HighSpike=sum(high.spike, na.rm=TRUE), LowSpike=sum(low.spike)) ``` Low-quality cells are defined as those with extreme values for these QC metrics and are removed. ```{r} discard <- low.lib | low.genes | high.spike | low.spike sce.e5601 <- sce.e5601[,!discard] summary(discard) ``` We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values. Recall that the Smart-seq2 protocol generates read count data, so we use a more stringent filter of `min.mean=1` to remove low-abundance genes. ```{r} clusters <- quickCluster(sce.e5601, min.mean=1, method="igraph") table(clusters) sce.e5601 <- computeSumFactors(sce.e5601, min.mean=1, clusters=clusters) summary(sizeFactors(sce.e5601)) sce.e5601 <- computeSpikeFactors(sce.e5601, general.use=FALSE) summary(sizeFactors(sce.e5601, "ERCC")) sce.e5601 <- normalize(sce.e5601) ``` ### Identifying highly variable genes We identify highly variable genes (HVGs) using `trendVar()` and `decomposeVar()`. Here, we need to fit the mean-variance trend separately to each donor, as the donor-to-donor variation in the mean-variance trend is more pronounced than that in the UMI datasets. This yields one fit for each donor in Figure \@ref(fig:var-e5601). ```{r var-e5601, fig.cap="Variance of normalized log-expression values for each gene in the E-MTAB-5601 dataset, plotted against the mean log-expression. Each plot corresponds to a donor, where the blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).", fig.width=6, fig.asp=2.5} donors <- sort(unique(sce.e5601$Donor)) is.spike <- isSpike(sce.e5601) par(mfrow=c(ceiling(length(donors)/2), 2), mar=c(4.1, 4.1, 2.1, 0.1)) collected <- list() for (x in unique(sce.e5601$Donor)) { current <- sce.e5601[,sce.e5601$Donor==x] if (ncol(current)<2L) { next } current <- normalize(current) fit <- trendVar(current, parametric=TRUE) dec <- decomposeVar(current, fit) plot(dec$mean, dec$total, xlab="Mean log-expression", ylab="Variance of log-expression", pch=16, main=x) points(fit$mean, fit$var, col="red", pch=16) curve(fit$trend(x), col="dodgerblue", add=TRUE) collected[[x]] <- dec } ``` We combine statistics across donors to consolidate them into a single set of results for this study. We then order genes by decreasing biological component. ```{r} dec.e5601 <- do.call(combineVar, collected) dec.e5601$Symbol <- rowData(sce.e5601)$Symbol dec.e5601 <- dec.e5601[order(dec.e5601$bio, decreasing=TRUE),] head(dec.e5601) ``` ```{r, echo=FALSE, results="hide"} rm(e5601.df) gc() ``` # Feature selection across batches To obtain a single set of features for batch selection, we take the top 1000 genes with the largest biological components from each batch. The intersection of this set across batches is defined as our feature set for MNN correction. ```{r} top.e5601 <- rownames(dec.e5601)[seq_len(1000)] top.gse85241 <- rownames(dec.gse85241)[seq_len(1000)] top.gse81076 <- rownames(dec.gse81076)[seq_len(1000)] chosen <- Reduce(intersect, list(top.e5601, top.gse85241, top.gse81076)) # Adding some gene symbols for interpretation. symb <- mapIds(org.Hs.eg.db, keys=chosen, keytype="ENSEMBL", column="SYMBOL") DataFrame(ID=chosen, Symbol=symb) ``` The use of an intersection is a rather conservative strategy as it requires the same gene to be highly variable in all batches. This may not be possible for marker genes of cell types that are not present in all batches. An alternative approach is to use `combineVar()` to compute the average biological component across batches for each gene. The feature set can then be defined as the top $X$ genes with the largest biological components. ```{r} # Identifying genes that are annotated in all batches. in.all <- Reduce(intersect, list(rownames(dec.e5601), rownames(dec.gse85241), rownames(dec.gse81076))) # Setting weighted=FALSE so each batch contributes equally. combined <- combineVar(dec.e5601[in.all,], dec.gse85241[in.all,], dec.gse81076[in.all,], weighted=FALSE) chosen2 <- rownames(combined)[head(order(combined$bio, decreasing=TRUE), 1000)] ``` Despite its conservativeness, we will use the intersection here to focus on the obvious biological differences between pancreas cell types. This reduces the number of genes that drive the weaker uninteresting differences between donors _within each batch_. These will not be corrected by applying `mnnCorrect()` to the batches, and will complicate the interpretation of the plots below. **Comments from Aaron:** - It is possible to apply `mnnCorrect()` _within_ each batch to correct for differences between donors and/or plates, prior to correcting for differences _between_ batches. For simplicity, we will not do this here, relying only on the feature selection to mitigate donor/plate effects. # Performing MNN-based correction Consider a cell $a$ in batch $A$, and identify the cells in batch $B$ that are nearest neighbours to $a$ in the expression space defined by the selected features. Repeat this for a cell $b$ in batch $B$, identifying its nearest neighbours in $A$. Mutual nearest neighbours are pairs of cells from different batches that belong in each other's set of nearest neighbours. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see @haghverdi2018batch for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which can yield batch-corrected values. We apply the `mnnCorrect()` function to the three batches to remove the batch effect, using the genes in `chosen`. This involves correcting their expression values so that all cells are comparable in the coordinate system of the first batch. The function returns a set of matrices containing corrected expression values, which we can use in downstream analyses. ```{r} original <- list(logcounts(sce.e5601)[chosen,], logcounts(sce.gse81076)[chosen,], logcounts(sce.gse85241)[chosen,]) corrected <- do.call(mnnCorrect, c(original, list(k=20, sigma=0.1))) str(corrected$corrected) ``` The function also reports the MNN pairs that were identified in each successive batch. This may be useful for trouble-shooting, e.g., to check whether cells independently assigned to the same cell type are correctly identified as MNN pairs. ```{r} corrected$pairs ``` **Comments from Aaron:** - The `k=` parameter specifies the number of nearest neighbours to consider when defining MNN pairs. This should be interpreted as the minimum frequency of each cell type or state in each batch. Larger values will improve the precision of the correction by increasing the number of MNN pairs, at the cost of reducing accuracy by allowing MNN pairs to form between cells of different type. - The `sigma=` parameter specifies how much information is shared between MNN pairs when computing the batch effect. Larger values will share more information, approaching a global correction for all cells in the same batch. Smaller values allow the correction to vary across cell types, which may be more accurate but comes at the cost of precision. The default `sigma=1` is conservative and favours undercorrection, so lower values may be more suitable in many cases. - The order of the supplied batches does matter, as the first batch is treated as the reference coordinate system. We suggest setting the largest and/or most heterogeneous batch as the first. This ensures that sufficient MNN pairs will be identified between the first and other batches for stable correction. # Examining the effect of correction We create a new `SingleCellExperiment` object containing these corrected counts for each cell, along with information regarding the batch of origin. ```{r} omat <- do.call(cbind, original) mat <- do.call(cbind, corrected$corrected) colnames(mat) <- NULL sce <- SingleCellExperiment(list(original=omat, corrected=mat)) colData(sce)$Batch <- rep(c("e5601", "gse81076", "gse85241"), lapply(corrected$corrected, ncol)) sce ``` We examine the batch correction with some _t_-SNE plots. Figure~\@ref(fig:tsne-batch) demonstrates how the cells separate by batch of origin in the uncorrected data. After correction, more intermingling between batches is observed, consistent with the removal of batch effects. Note that the E-MTAB-5601 dataset still displays some separation, which is probably due to the fact that the other batches are UMI datasets. ```{r tsne-batch, fig.width=10, fig.asp=0.6, fig.cap="t-SNE plots of the pancreas datasets, before and after MNN correction. Each point represents a cell and is coloured by the batch of origin."} osce <- runTSNE(sce, exprs_values="original", rand_seed=100) ot <- plotTSNE(osce, colour_by="Batch") + ggtitle("Original") csce <- runTSNE(sce, exprs_values="corrected", rand_seed=100) ct <- plotTSNE(csce, colour_by="Batch") + ggtitle("Corrected") multiplot(ot, ct, cols=2) ``` We colour by the expression of marker genes for known pancreas cell types to determine whether the correction is biologically sensible. Cells in the same visual cluster express the same marker genes (Figure \@ref(fig:tsne-markers)), indicating that the correction maintains separation of cell types. ```{r tsne-markers, fig.width=10, fig.height=10, fig.cap="t-SNE plots after MNN correction, where each point represents a cell and is coloured by its corrected expression of key marker genes for known cell types in the pancreas."} ct.gcg <- plotTSNE(csce, by_exprs_values="corrected", colour_by="ENSG00000115263") + ggtitle("Alpha cells (GCG)") ct.ins <- plotTSNE(csce, by_exprs_values="corrected", colour_by="ENSG00000254647") + ggtitle("Beta cells (INS)") ct.sst <- plotTSNE(csce, by_exprs_values="corrected", colour_by="ENSG00000157005") + ggtitle("Delta cells (SST)") ct.ppy <- plotTSNE(csce, by_exprs_values="corrected", colour_by="ENSG00000108849") + ggtitle("PP cells (PPY)") multiplot(ct.gcg, ct.ins, ct.sst, ct.ppy, cols=2) ``` # Using the corrected values in downstream analyses MNN correction places all cells from all batches within the same coordinate system. This means that the corrected values can be freely used to define distances between cells for dimensionality reduction or clustering. However, the correction does _not_ preserve the mean-variance relationship. As such, we do not recommend using the corrected values for studying heterogeneity. MNN-corrected values are generally not suitable for differential expression (DE) analyses, for several reasons: - The default parameters of `mnnCorrect()` do not return corrected values on the log-scale, but rather a cosine-normalized log-scale. This makes it difficult to interpret the effect size of DE analyses based on the corrected values. - It is usually inappropriate to perform DE analyses on batch-corrected values, due to the failure to model the uncertainty of the correction. This usually results in loss of type I error control, i.e., more false positives than expected. In scRNA-seq studies, most DE analyses will be performed between clusters or along trajectories. We recommend using the corrected values for clustering or trajectory reconstruction, and switching back to the original log-expression values (or counts) for the DE analysis. The batch of origin can then be modelled as a blocking factor, which is possible with most packages for DE analysis such as `r Biocpkg("edgeR")` or `r Biocpkg("limma")`. Finally, the MNN-corrected values can be used for further correction with `mnnCorrect()`. This is useful in nested experimental designs involving multiple batches within each of multiple studies, much like the pancreas datasets described above. In such cases, it is reasonable to first perform the correction across batches _within the same study_ (where there should be more similar cell types, and thus more MNN pairs). The batch-corrected values for each study can then be supplied to `mnnCorrect()` to remove batch effects between studies. **Comments from Aaron:** - The most direct method of blocking on batch is to use an additive model with the batch of origin and the cluster/trajectory as separate factors. However, this assumes that the batch effect is constant for all clusters and/or along the trajectory. This assumption can be eliminated with an interaction model allowing for cluster-specific batch effects. - Users should set `cos.norm.in=FALSE` and `cos.norm.out=FALSE` when supplying `mnnCorrect()` with MNN-corrected values. This ensures that the cosine normalization is only applied once, during the first round of MNN correction. # References