Computational analysis toolkit for exploring sparse and large-scale single-cell data

Supervisors: Dr Supat Thongjuea and Prof Claus Nerlov

Profiling of single-cell transcriptomes has been long, since the first developed protocol in 2009 [Tang et al., Nat Methods, 2009], served as a sophisticated method to study the heterogeneity and discovery of novel cell populations in diverse cell types and organisms. Recently, numerous single-cell RNA sequencing protocols have been developed. These include the commercially available microfluidic-based approach, Fluidigm C1 platform, and the plate-based method (e.g. SMART-seq [Ramskold et al., Nature Biotechnology, 2012] and CEL-seq [Hashimshony et al., Cell Reports, 2012]). These methods have been widely used for profiling the single-cell transcriptome in various research areas. However, when thousands of cells are required, these methods have a limitation to scaling up for profiling a large number of cells. To overcome this problem, presently, there are approaches such as Drop-seq [Macosko et al., Cell, 2015], Droplet [Klein et al., Cell, 2015], and the commercially available 10x Genomics [Zheng et al., Nature Communications, 2016] that provide a scalable platform for profiling gene expression of hundreds to millions of cells within a few days of the library preparation. However, these approaches produce a very high dropout, sparse coverage due to shallow sequencing depth, and big data matrices with high dimension. Indeed, these are the challenges for a computational biologist to analyse and to visualise the sparse and large-scale single-cell data. Currently, there are few available tools (e.g. Seurat, Cell Ranger, and Loupe) to support the analysis for this type of data.

This project will mainly develop computational approaches and a visualisation system to analyse sparse and large-scale single-cell RNA-seq data. We aim to incorporate numerous computational analyses into a unique framework. These consist of (1) data normalisation and batches effect removal, (2) identification of meaningful or highly variable genes by modelling the dropout or coefficient of variation against the average gene expression, (3) cell subpopulation identification using various combined dimensional reduction techniques and clustering approaches (e.g. PCA, TSNE, HCL, K-Means, and KNN-GRAPH), (4) differentially expressed gene analysis based on the expression level and expressing cell frequency using for both global and local comparisons across identified subpopulations, (5) gene set enrichment analysis, (6) cell surface markers discovery for any identified subpopulations, and (7) a powerful visualisation system for the cell explorer and numerous types of plots for statistical analyses. The challenge of the project is to develop the tool that can efficiently use low computational memory and fast processing time. We aim to provide easy to use tool that can be used in a personal computer and its functionalities can help to deliver interpretable biological meaning.

To achieve the aims of the project, we will provide a great opportunity for training in the computational biology of single-cell data from the expert in the field. We also aim to provide the opportunity for the collaboration with the stem cell biology groups to use developed computational approaches for studying the heterogeneity of hematopoietic stem cells.

This project will be based in the MRC WIMM Centre for Computational Biology at the MRC Weatherall Institute of Molecular Medicine, with access to state-of-the-art facilities. In addition to training opportunities through the University, in the WIMM, we run a course on basic techniques for new students of approximately 20 lectures. Institute seminars are held on a weekly basis and regularly attract world-class scientists in haematopoiesis research. Informal exchange of ideas in the coffee area is encouraged and is an attractive feature of the WIMM.

For further information, please contact Dr Supat Thongjuea