Dropbead provides two functions to readily extract useful statistics from the data. The first one, extractReadStatistics, takes the BAM file as an argument and computes the total number of (uniquely mapped) reads, the percentages of intronic/intergenic etc. The output is a data.frame as follows:
##          total_reads uniq_mapped intronic intergenic coding UTR
## Drop-seq        77.6          69      4.5        0.3    4.7 3.9
##          total_barcodes
## Drop-seq            1.7The read numbers are in million. The intronic, intergenic, coding and UTR reads do not get a gene annotation tag (GE:) and are discarded from the Drop-seq pipeline. The last column counts the total number of barcodes seen in the data, most of them being artifacts from sequencing errors.
Dropbead provides a function to computationally estimate the number of cells present in the sample, by calculating the inflection point of the cumulative fraction of reads against the cell barcodes. Assuming that reads.by.cell is the data.frame with the reads per cell (the output of BAMTagHistogram from the Drop-seq toolkit),
plotCumulativeFractionOfReads(reads.by.cell, 
                              cutoff = 10000, 
                              draw.infl.point = TRUE)The following function creates a table with the basic downstream statistics, such as the cell number and the median number of reads, genes and UMIs per cell.
extractDownstreamStatistics('~/dropseq-data/', min.umi=250, read.stats=NULL)##   cells reads genes umis sum.umi PCR
## 1  3995  3965   351  734     4.3 5.4The total number of UMIs in the DGE is given in millions. The PCR column provides the reads to UMIs ratio, so that the smaller, the better. Passing the data.frame output of the extractReadStatistics function into the read.stats argument provides a more precise calculation of the PCR over-amplification bias.
Dropbead offers two main classes to store samples, the SingleSpeciesSample class and the MixedSpeciesSample class, the latter containing the former.
Mixed species experiments are important to estimate the number of doublets in the samples. Dropbead assumes that a digital gene expression matrix (DGE) has already been generated and the genes for the two species are separated by a prefix. For mixed human/mouse samples the prefixes are hg_ and mm_ respectively. Assuming that the DGE has been loaded in dge.matrix,
# The object containing the sample
mms <- new("MixedSpeciesSample", 
           species1="human", 
           species2="mouse",
           dge=dge.matrix)The number of cells and genes are stored and automatically updated while using dropbead’s functions
length(mms@cells) # number of cells in the sample## [1] 1261length(mms@genes) # number of genes detected## [1] 42194Some of the barcodes might correspond to the same cells (essentially after using the DetectBeadSynthesisErrors Drop-seq tool) and have to be collapsed. Such cells are identified by
listCellsToCollapse(mms)## [[1]]
## [1] "ACCAGGAGCAAA" "ACCAGGAGCAAN"
## 
## [[2]]
## [1] "GCGGTTATCAGC" "GCGGTTATCAGN"
## 
## [[3]]
## [1] "TCAAGCATAATN" "TCAAGCATAATT"
## 
## [[4]]
## [1] "TCACTGACAAAA" "TCACTGACAAAN"
## 
## [[5]]
## [1] "CTTCCCCGCAAA" "CTTCCCCGCAAN"In the example above there are 5 pairs of cells which need to be collapsed. This is straightforward with using
mms <- collapseCellsByBarcode(mms)We can verify that the number of cells is now the correct one
length(mms@cells)## [1] 1256Calling classifyCellsAndDoublets separates the species and returns a data.frame with number of transcripts per cell per species. The threshold controls the purity of the resulting cells, while min.trans is the minimum number of UMIs required to keep a cell.
head(classifyCellsAndDoublets(mms, 
                              threshold = 0.9, 
                              min.trans = 1000))##           cell human mouse species
## 1 TTGAGACGGATA 33587   486   human
## 2 TTGCAGCCACCT 32838   458   human
## 3 GCTTCAGACAGA 31838   420   human
## 4 CTCTTCATCGAG 28476   296   human
## 5 TAACACACCGTG   511 32516   mouse
## 6 AGCTGCCGTTTG   546 31010   mouseThe output of classifyCellsAndDoublets can be directly send for plotting
plotCellTypes(classifyCellsAndDoublets(mms, threshold = 0.9, min.trans = 5000))The number of genes per cell is computed via
head(computeGenesPerCell(mms))##          cells counts species
## 1 TTGAGACGGATA   7600   human
## 2 TTGCAGCCACCT   7582   human
## 3 GCTTCAGACAGA   7554   human
## 4 CTCTTCATCGAG   7263   human
## 5 GCTTACTATCCG   6830   human
## 6 AACTGGTTCGTC   6685   humanand similarly for transcripts (UMIs) with the computeTranscriptsPerCell function. Their output can be send directly for plotting and visualization
plotViolin(computeTranscriptsPerCell(mms), 
           attribute = "UMIs")plotHistogram(computeTranscriptsPerCell(mms), 
              attribute = 'UMIs')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.The above functions are polymorphic and can be also used for SingleSpeciesSample objects. Splitting the samples is always performed by internal functions automatically. If the user wants to restrict to only one species, this is done by the splitMixedSpeciesSampleToSingleSpecies function. which returns a list of the two SingleSpeciesSample objects.
# Extracting the human cells as a separate sample for further analysis
h <- splitMixedSpeciesSampleToSingleSpecies(mms,
                                            threshold = 0.9)[[1]]
class(h)## [1] "SingleSpeciesSample"
## attr(,"package")
## [1] "dropbead"There are a couple of functions to remove low quality cells and genes, such as
h.f1 <- keepBestCells(h, num.cells = 100) # keep only the top 100 cells
h.f2 <- keepBestCells(h, min.num.trans = 1000) # keep cells with at least 1000 UMIs
h.f3 <- removeLowQualityCells(h, min.genes = 2000) # remove cells which don't express at least 2000 genes
h.f4 <- removeLowQualityGenes(h, min.cells = 3) # remove genes which are not expressed in at least 3 cellswith obvious usage.
Reproducibility and correlations of different samples are easily assessed via the compareGeneExpressionLevels function, for instance,
compareGeneExpressionLevels(h.f2, h.f1, 
                            name1 = 'Drop-seq with >= 1000 UMIs per cell',
                            name2 = 'Drop-seq only 100 cells',
                            method = 'pearson')If bulk data is available then its correlation pwith the Drop-seq sample is directly assessed with the function compareSingleCellsAgainstBulk. Note that bulk has to be in a 1-column data.frame format with genes as the rownames and RPKM values (as default, raw counts are also accepted)
compareSingleCellsAgainstBulk(h.f1, 
                              log2(h.f2@dge[, 1, drop=F]+1))It is instructive to assess the mitochondrial content of the single cell samples. This is done as below and the percentages can be sent for plotting
head(computeMitochondrialPercentage(h), 5)## TTGAGACGGATA TTGCAGCCACCT GCTTCAGACAGA CTCTTCATCGAG GCTTACTATCCG 
##     3.227439     2.512333     3.813054     3.827785     3.051626plotMitochondrialContent(list(100-computeMitochondrialPercentage(h.f1),
                              100-computeMitochondrialPercentage(h.f3)),
                         log_scale = FALSE)## No id variables; using all as measure variablesDropbead offers also an implementation of the algorithm described in Macosko et. al. 2015 for classification of the cell cycle phases. The following R-packages xlsx, rJava and xlsxjars are required for this
phases <- assignCellCyclePhases(h)