LLM model prompt eval time is fast for similar prompts but slow for different prompts

I am using the langchain llama.cpp library to train and use a large language model (LLM) locally. When I ask the model a question with a long prompt, the response time is normal. However, if I then ask the model a question with a prompt that is similar to the first prompt but with some minor changes, the response time is much faster. This is because the prompt eval time is much faster. However, if I make significant changes to the prompt, the response time is again normal

I have tried the following to try to understand why this is happening:

I have checked the documentation for the llama.cpp library, but I could not find anything that explains this behavior.
I have tried using different prompt lengths and different levels of changes to the prompt, but the behavior is consistent.
I expected that the prompt eval time would be the same regardless of the similarity of the prompt to the previous prompt.

  • It would be better if you provided the code to test

    – 

Leave a Comment