I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, reaching around 8-10s, to make it real-time. And the output is very poor. If I ask a query, "Hi, How are you?" It will generate a 500-word paragraph. How can I improve the output results? I am currently using this configuration:
LlamaCpp(model_path= path,
temperature=0.7,
max_tokens=800,
top_p=0.1,
top_k=40,
n_threads=4,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True,
n_ctx=2000,
n_gpu_layers=80,
n_batch=2048)