Modelling constant and cell-type specific CTCF sites by using Convolutional Neural Networks
Sanders E., Riva SG., Georgiades E., Baxter M., Hughes JR.
Understanding gene regulation is crucial to understanding human health and disease. One of the most important regulators of gene expression is the DNA binding protein CTCF. Many CTCF binding sites are consistent across cell types, however some are highly cell-type specific. The mechanisms by which CTCF can bind different genomic sites in different cells are poorly understood. One of the challenges is the vast number of potential CTCF binding sites across the 3 billion base pairs of the human genome. Finding patterns in datasets of this size is difficult for the human brain, but may be amenable to modelling using convolutional neural networks (CNN). We therefore designed and trained a CNN to model cell-type constant and cell-type specific CTCF binding sites across 33 distinct human cell types. The model achieved a micro and macro averaged AUC of 0.91 and 0.90 respectively, demonstrating a high level of accuracy in predicting CTCF binding across the different cell types. To test the effectiveness of the model we compared CTCF predictions for two cell types with highly different biological phenotypes (endothelial cells and neutrophils). Overall the model attained 81% accuracy for endothelial cells and 77% for neutrophils. These results demonstrate that it is possible to accurately predict cell-type specific CTCF binding sites based on genetic code alone. We believe this model will have future applications in understanding CTCF-mediated gene regulation in healthy and diseased cell states.