Applied Statistical Genomics Group – Dr Christopher Yau, WTCHG

Our research focuses on novel statistical methodologies for single cell genomics data analysis:
Research Highlights:

Zero-inflated single cell gene expression analysis

Single cell gene expression data contains a high proportion of zero measurements as a result of technical factors (dropout) or genuine null gene expression in individual cells. This zero-inflation is not accounted for in classical statistical techniques that are currently applied to single cell data. We have characterized that failure to account for zero-inflated can lead to distortion in both univariate (e.g. differential expression) and multivariate statistical analyses (e.g. dimensionality reduction and clustering). In response to these findings, we have developed statistical techniques to account for zero-inflation particularly in the mathematically challenging multivariate context. We have established zero-inflated variants of the classic principal components and factor analysis methods and exploring other areas in which to incorporate our zero-inflation models.

Pierson and Yau (2015). ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology 16:241

Hierarchical Clustering of Single Cell Transcriptional Profiles

Advances in single cell genomics provides a way of routinely generating transcriptomics data at the single cell level. A frequent requirement of single cell expression experiments is the identification of novel patterns of heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical statistical analysis tools have being routinely applied to single cell data, but there is considerable scope for the development of novel statistical approaches that are better adapted to the challenges of inferring cellular hierarchies. Here, we present a novel integration of principal components analysis and hierarchical clustering to create a framework for characterising cell state identity. Our methodology uses agglomerative clustering to generate a cell state hierarchy where each cluster branch is associated with a principal component of variation that can be used to differentiate two cellular states. We demonstrate that using real single cell datasets this approach allows for consistent clustering of single cell transcriptional profiles across multiple scales of interpretation.

Zurauskiene and Yau (2016). pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles. BMC Bioinformatics 17:140 (URL: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0984-y)

Pseudotime inference from single cell gene expression profiles

Single-cell genomics has revolutionised modern biology while requiring the development of advanced computational and statistical methods. Advances have been made in uncovering gene expression heterogeneity, discovering new cell types and novel identification of genes and transcription factors involved in cellular processes. One such approach to the analysis is to construct pseudotime orderings of cells as they progress through a particular biological process, such as cell-cycle or differentiation. These methods assign a score - known as the pseudotime - to each cell as a surrogate measure of progression. However, all published methods to date are purely algorithmic and lack any way to give uncertainty to the pseudotime assigned to a cell. Here we present a method that combines Gaussian Process Latent Variable Models (GP-LVM) with a recently published electroGP prior to perform Bayesian inference on the pseudotimes. We go on to show that the posterior variability in these pseudotimes leads to nontrivial uncertainty in the pseudo-temporal ordering of the cells and that pseudotimes should not be thought of as point estimates.

Campbell and Yau (2015). Bayesian Gaussian Process Latent Variable Models for pseudotime inference in single-cell RNA-seq data. bioRxiv (URL: http://biorxiv.org/content/early/2015/09/15/026872).

Campbell and Yau (2016). Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference. bioRxiv (URL: http://biorxiv.org/content/early/2016/04/05/047365)