0

I am working on a RAG app, where I use LLMs to analyze various documents. I'm looking to improve the UX by streaming responses in real time.
a snippet of my code:

params = SamplingParams(temperature=TEMPERATURE, 
                        min_tokens=128, 
                        max_tokens=1024)
llm = LLM(MODEL_NAME, 
          tensor_parallel_size=4, 
          dtype="half", 
          gpu_memory_utilization=0.5, 
          max_model_len=27_000)
    
message = SYSTEM_PROMPT + "\n\n" + f"Question: {question}\n\nDocument: {document}"
    
response = llm.generate(message, params)

In its current form, generate method waits until the entire response is generated. I'd like to change this so that responses are streamed and displayed incrementally to the user, enhancing interactivity.

I was using vllm==0.5.0.post1 when I first wrote that code.

Does anyone have experience with implementing streaming for LLMs=Any. Guidance or examples would be appreciated!

1 Answer 1

0

AsyncLLMEngine will help you.

You can also refer to vLLM's aip_server.py

Sign up to request clarification or add additional context in comments.

2 Comments

This answer would be more helpful if it contained more detail was not just a link. Please see meta.stackexchange.com/questions/8231/…
As links are not always guaranteed to stay online and not change it is good practice to sum up the main points of the links in the answer so that if the links change the answer doesn't become unusable

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.