langchain with llama2 local slow inference

Question

I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, reaching around 8-10s, to make it real-time. And the output is very poor. If I ask a query, "Hi, How are you?" It will generate a 500-word paragraph. How can I improve the output results? I am currently using this configuration:

LlamaCpp(model_path= path,
                temperature=0.7,
                max_tokens=800,
                top_p=0.1,
                top_k=40,
                n_threads=4,
                callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
                verbose=True,
                n_ctx=2000,
                n_gpu_layers=80,
                n_batch=2048)

Kami · Accepted Answer · 2024-01-11 14:51:07Z

I would start by using the llama-2-13B-**chat**** instead of llama-2-13B.

Chat models are optimized for dialogue use cases, and the ones without the chat suffix are trained for predicting the next token. Hence, by generating a 500-word paragraph your model is doing exactly what it's supposed to do.

Also, prompting is essential for LLaMa models. You can use Beginning of Sequence (BOS) and End of Sequence (EOS) tokens, which would look something like this:

template = """
    [INST] <<SYS>>
    You are a helpful, respectful and honest assistant. 
    Always answer as helpfully as possible, while being safe.  
    Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
    Please ensure that your responses are socially unbiased and positive in nature.
    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
    If you don't know the answer to a question, please don't share false information.
    <</SYS>>
    {INSERT_PROMPT_HERE} [/INST]
    """

prompt = 'Your actual question to the model'
prompt = template.replace('INSERT_PROMPT_HERE', prompt)

Collectives™ on Stack Overflow

langchain with llama2 local slow inference

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related