transformer model for Language Modelling in NLP

Question

Purpose of the model

The purpose of the model is to build a small scale LLM (No need to be that better as other LLMs) from scratch to understand the concepts of coding an LLM.

Expected Working

The model is just expected to produce meaningful generations (due to resource constraints).

Problem

The problem with this current model is that it was not able to achieve its task. The generations were not at all meaningful either to the input or to the context and I was not able to understand the actual thing that is causing the problem.

Model

My Model – llmwithtransformer

My tries

My first mistake was that I have used the default adam optimizer and crossEntropy loss directly. Since transformers require some modifications in it so I have changed it again with the help of GPT and the available resources from PyTorch and Tensorflow. Although it had made some significant improvement (in generating some random texts which can become a meaningful text with its meaning not at all aligning with the input or context) but still the text is not relational to the input.
I was struck here.

Expecting

The points where I have made mistakes with the corrections(not necessarily the code).
Improvement Suggestions