The advent of single-cell transcriptomics has brought a greatly improved understanding of the heterogeneity of gene expression across cell types, with important applications in developmental biology and cancer research. Single-cell RNA sequencing datasets, which are based on tags called universal molecular identifiers, typically include a large number of zeroes. For such datasets, genes with even moderate expression may be poorly represented in sequencing count matrices. Standard pipelines often retain only a small subset of genes for further analysis, but we address the problem of estimating relative expression across the entire transcriptome by adopting a multivariate lognormal Poisson count model. We propose empirical Bayes estimation procedures to estimate latent cell-cell correlations, and to recover meaningful estimates for genes with low expression. For small groups of cells, an important sampling procedure uses the full cell-cell correlation structure and is computationally feasible. For larger datasets, we propose a gene-level shrinkage procedure that has favorable performance for datasets with approximately compound symmetric cell-cell correlation. A fast procedure that incorporates matrix approximations is also promising, and extensible to very large datasets. We apply our approaches to simulated and real datasets, and demonstrate favorable performance in comparisons to competing normalization approaches. We further illustrate the applications of our approach in downstream analyses, including cell-type clustering and identification.
|