How to train GNN on huge datasets?

Question

I have a dataset having a shape of (783126,3) where in my dataset there are three columns R1, R2 and Score representing Residue 1, Residue 2 and their Score. I want to train a Graph Neural Network on my data such that I can project the data as a graph and allow my model to learn the important connections. I have converted this into a classification problem where those residue pairs that have a Score of 0.75 or higher are tagged as Class 0, residue pairs having score between 0.75 and 0.5 are tagged Class 1, residue pairs having score between 0.5 and 0.25 are tagged Class 2 and those residue pairs having a Score of less than 0.25 are tagged as Class 3. I have two questions in this regard:

I have trained my GNN Model but from the very first iteration it shows an accuracy of 99%. Now I am not sure if my model is actually learning something or is it just over-fitting the data. How can I know if my model is actually working right or not?
I would like to also include the protein sequence in the model training such that the model understands the actual residues and try to understand a relation between the residues from protein sequence and the GNN. I was thinking if I can use a separate LSTM model for the sequence mapping and then concatenate the final layers from LSTM and GNN. But I am not sure how exactly I can do that given the huge dataset that I have.

This is my code for the GNN model:

    import pandas as pd
    import tensorflow as tf
    from tensorflow.keras.layers import Input, Dense, Conv1D, Dropout, BatchNormalization, Activation, Flatten
    from tensorflow.keras.models import Model
    from tensorflow.keras.optimizers import Adam
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder

    data = pd.read_csv("score_train.csv")

    class_intervals = [-1, 0.25, 0.5, 0.75, 1]
    class_labels = [3, 2, 1, 0]

    data['label'] = pd.cut(data['Score'], bins=class_intervals, labels=class_labels, right=False)

    data['label'] = data['label'].fillna(0)

    X_train = data[['R1', 'R2']].values
    y_train = data['label'].values

    encoder = OneHotEncoder(sparse=False)
    y_train_encoded = encoder.fit_transform(y_train.reshape(-1, 1))

    num_classes = len(class_labels)  # Number of unique classes

    inputs = Input(shape=(2,))
    x = Dense(64, activation='relu')(inputs)
    x = Dropout(0.5)(x)
    x = Dense(64, activation='relu')(x)
    x = Dropout(0.5)(x)
    outputs = Dense(num_classes, activation='softmax')(x)
 
    model = Model(inputs=inputs, outputs=outputs)
    optimizer = Adam(learning_rate=0.01)  # Update optimizer
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=['accuracy'])

    if tf.test.gpu_device_name():
        print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
    else:
        print("GPU not found. Using CPU.")

    # model training
    model.fit(X_train, y_train_encoded, epochs=100, batch_size=32)

This is the training outcome:

    Default GPU Device: /device:GPU:0
    Epoch 1/100
    24473/24473 [==============================] - 87s 3ms/step - loss: 0.0403 - accuracy: 0.9970
    Epoch 2/100
    24473/24473 [==============================] - 85s 3ms/step - loss: 0.0206 - accuracy: 0.9971
    Epoch 3/100
    24473/24473 [==============================] - 83s 3ms/step - loss: 0.0206 - accuracy: 0.9971
    Epoch 4/100
    24473/24473 [==============================] - 82s 3ms/step - loss: 0.0206 - accuracy: 0.9971
    Epoch 5/100
    24473/24473 [==============================] - 83s 3ms/step - loss: 0.0206 - accuracy: 0.9971

Leave a Comment Cancel reply