i'm using llama.cpp and llamaindex to run a query engine-chatbot. The text it needs to query over is very short. Each generation takes about 2 minutes. running a 7b parameter gguf file locally. Is this normal? #LlamaIndex #chatbots #help
I tried this with a bigger GPU and all the layers offloaded to the GPU and it was spiffy. Solved