How to train GNN on huge datasets?

I have a dataset having a shape of (783126,3) where in my dataset there are three columns R1, R2 and Score representing Residue 1, Residue 2 and their Score. I want to train a Graph Neural Network on my data such that I can project the data as a graph and allow my model to learn the important connections. I have converted this into a classification problem where those residue pairs that have a Score of 0.75 or higher are tagged as Class 0, residue pairs having score between 0.75 and 0.5 are tagged Class 1, residue pairs having score between 0.5 and 0.25 are tagged Class 2 and those residue pairs having a Score of less than 0.25 are tagged as Class 3. I have two questions in this regard:

  1. I have trained my GNN Model but from the very first iteration it shows an accuracy of 99%. Now I am not sure if my model is actually learning something or is it just over-fitting the data. How can I know if my model is actually working right or not?

  2. I would like to also include the protein sequence in the model training such that the model understands the actual residues and try to understand a relation between the residues from protein sequence and the GNN. I was thinking if I can use a separate LSTM model for the sequence mapping and then concatenate the final layers from LSTM and GNN. But I am not sure how exactly I can do that given the huge dataset that I have.

This is my code for the GNN model:

    import pandas as pd
    import tensorflow as tf
    from tensorflow.keras.layers import Input, Dense, Conv1D, Dropout, BatchNormalization, Activation, Flatten
    from tensorflow.keras.models import Model
    from tensorflow.keras.optimizers import Adam
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder

    data = pd.read_csv("score_train.csv")

    class_intervals = [-1, 0.25, 0.5, 0.75, 1]
    class_labels = [3, 2, 1, 0]

    data['label'] = pd.cut(data['Score'], bins=class_intervals, labels=class_labels, right=False)

    data['label'] = data['label'].fillna(0)

    X_train = data[['R1', 'R2']].values
    y_train = data['label'].values

    encoder = OneHotEncoder(sparse=False)
    y_train_encoded = encoder.fit_transform(y_train.reshape(-1, 1))

    num_classes = len(class_labels)  # Number of unique classes

    inputs = Input(shape=(2,))
    x = Dense(64, activation='relu')(inputs)
    x = Dropout(0.5)(x)
    x = Dense(64, activation='relu')(x)
    x = Dropout(0.5)(x)
    outputs = Dense(num_classes, activation='softmax')(x)
 
    model = Model(inputs=inputs, outputs=outputs)
    optimizer = Adam(learning_rate=0.01)  # Update optimizer
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=['accuracy'])

    if tf.test.gpu_device_name():
        print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
    else:
        print("GPU not found. Using CPU.")

    # model training
    model.fit(X_train, y_train_encoded, epochs=100, batch_size=32)

This is the training outcome:

    Default GPU Device: /device:GPU:0
    Epoch 1/100
    24473/24473 [==============================] - 87s 3ms/step - loss: 0.0403 - accuracy: 0.9970
    Epoch 2/100
    24473/24473 [==============================] - 85s 3ms/step - loss: 0.0206 - accuracy: 0.9971
    Epoch 3/100
    24473/24473 [==============================] - 83s 3ms/step - loss: 0.0206 - accuracy: 0.9971
    Epoch 4/100
    24473/24473 [==============================] - 82s 3ms/step - loss: 0.0206 - accuracy: 0.9971
    Epoch 5/100
    24473/24473 [==============================] - 83s 3ms/step - loss: 0.0206 - accuracy: 0.9971

  • 1

    Would you prove an example of your training data and what ‘residue pairs having a score of xx’ means?

    – 

  • R1 ,R2 ,Score 1 ,2 ,0.506187708548924 | 1 ,3 ,0.506269142446582 | 1 ,4 ,0.504631180036872 | 1 ,5 ,0.506832536930239 | 2 ,3 ,0.505828667681825 | 2 ,4 ,0.505164861252758 | 2 ,5 ,0.506038987576509 | 2 ,6 ,0.50563413544391 This is how the dataset looks like. So basically I am calculating mutual information between two residues. So column 1 and column 2 corresponds to the residue position in a protein sequence and the score column is the mutual information between them

    – 




  • I don’t know what protein sequence residues are or how the mutual information between them is calculated but if you have a dense neural network trying to learn a pattern of two integer columns to one float variable, It’s very possible that 99% accuracy is correct. Calculate the f1_scores instead of accuracy to start with.

    – 

  • Train on about 70% of your data then use your model to predict and score on the remaining 30% to get a more reliable score and still use f1-score.

    – 

Leave a Comment