Back to Feed

GCBLANE

A research project that included development of G.C.B.L.A.N.E., which stands for Graph-enhanced Convolutional Bidirectional LSTM (Long Short-Term Memory) Attention Network aimed at predicting Transcription Factor Binding Sites in genomic sequences.

Transcription Factor Binding Sites are short DNA sequences where proteins(TF) bind to DNA. Transcription factors are valuable as they can help serve as potential drug targets. Existing TFBS prediction methods face challenges in accuracy and efficiency due to the large volume of data and complex binding patterns, often struggling to capture all relevant features of TFBS from DNA sequences. Hence researches have to turned to Machine Learning for accuract and quick prediction of TFBS.
For this project I built a novel deep learning model, GCBLANE, which not only surpasses the performance of existing state-of-the-art models but also managed to build a more efficient model with a lower parameter count.
The dataset used is the standard benchmark dataset which has been used in various others research papers, the 690 ChIP-Seq Datasets, which consists of >25 Million DNA sequences.
Since the neural network needed numeric inputs, the DNA data which is a string of A, G, T, and C characters had to be transformed. The data transformation for the sequential neural network consisted of one-hot encoding the DNA sequence. As for the graph neural network, I used De Bruijn graphs. In a De Bruijn graph, each node represents a K-mer, which is a substring of length K from the input sequences. This representation helps the model to better understand relationships between neighbouring sequences. To implement the model I used the Tensorflow, Keras, and Spektral libraries. The project was entirely carried out on Google Colab including the data processing, model training, etc.
GCBLANE consists of various blocks such as Convolutional, Recurrent, Graph, and Attention blocks. The Convolutional Blocks consist of various layers such Convolution layers, PReLU, Spatial Dropout, Pooling, and Normalization layers. The PReLU activation function enhances the LeakyReLU function by making the slope (α) a learnable parameter, thereby addressing some limitations associated with the standard ReLU function. Spatial Dropout was specifically chosen as it randomly drops entire feature maps instead of individual neurons, encouraging the network to avoid relying on specific feature maps. The Recurrent Blocks consists of BiLSTM and LSTM Layers. The BiLSTM processes the input sequence in both forward and backward directions, capturing dependencies from both ends of the sequence. After passing the BiLSTM layer, the output is fed to a regular LSTM. This LSTM further refines the features learned. Using these layers improves the models ability to handle complex sequence data.
The Attention Block consists of a Multi-Headed attention layer, which allows the model to focus on different parts of the input sequences simultaneously. By using multiple heads (8 for GCBLANE), the model can learn can learn different aspects of the input data, capturing wider range of dependencies. The final output is a more richer and detailed representation of the sequence.
The Graph Block consists of graph convolution layers and Min-Cut pooling layers. The graph convolution layer applies filters to nodes and their neighbours from the graph, aggregating information from adjacent nodes to learn representations that capture local information. After convolution, the Min-Cut pooling is applied, which reduces the graph size by clustering the nodes, but preserving the overall structure.
Finally the outputs of both the graph and sequential model is concatenated and passed to fully connected layers to make a prediction on TFBS.
The bar chart visualised below, quantifies GCBLANE's performance using various Classification Metrics.
Below is the confusion matrix demonstrating GCBLANE excellent predictive capabilities.

The below table demonstrates GCBLANE performance as compared to other state-of-the-art methods on the 690 Datasets.

Models	All Datasets	Small Datasets	Medium Datasets	Large Datasets
GCBLANE	0.943	0.904	0.930	0.973
MSDenseNet	0.933	0.897	0.921	0.973
MAResNet	0.927	0.883	0.914	0.972
SAResNet	0.920	0.876	0.907	0.966

This next table compares GCBLANE with other state-of-the-art methods on th 165 Datasets which is a subset of original 690 Datasets but some methods utilise multimodal approaches such as using DNA Shape features.

Models	Accuracy	ROC AUC	PR AUC	DNA Information Type
GCBLANE	0.887	0.9495	0.949	Sequence Only
BERT-TFBS	0.851	0.919	0.920	Sequence Only
TBCA	0.823	0.894	0.899	Sequence & Shape
DSAC	0.816	0.887	0.883	Sequence & Shape
DeepSTF	0.814	0.883	0.890	Sequence & Shape

The tables clearly demonstrate the improvement offered by GCBLANE as compared to various other state-of-the-art methods.

The project also involved a detailed comparative analysis of various Machine Learning Algorithms for predicting TFBS using genomic data encoded via CBLANE.

In summary, GCBLANE is a novel architecture that applies deep learning to improve Transcription Factor Binding Sites prediction. This project allowed me to deepen my understanding of machine learning and bioinformatics demonstrating my ability to tackle complex problems and innovate solutions in AI-driven research. I gained valuable skills in model design, data optimisation, and integration of cutting-edge technologies.

@ProjectUltra

Contact

Let’s collaborate!