Stream output using VLLM

Question

I am working on a RAG app, where I use LLMs to analyze various documents. I'm looking to improve the UX by streaming responses in real time.
a snippet of my code:

params = SamplingParams(temperature=TEMPERATURE, 
                        min_tokens=128, 
                        max_tokens=1024)
llm = LLM(MODEL_NAME, 
          tensor_parallel_size=4, 
          dtype="half", 
          gpu_memory_utilization=0.5, 
          max_model_len=27_000)
    
message = SYSTEM_PROMPT + "\n\n" + f"Question: {question}\n\nDocument: {document}"
    
response = llm.generate(message, params)

In its current form, generate method waits until the entire response is generated. I'd like to change this so that responses are streamed and displayed incrementally to the user, enhancing interactivity.

I was using vllm==0.5.0.post1 when I first wrote that code.

Does anyone have experience with implementing streaming for LLMs=Any. Guidance or examples would be appreciated!

Eddy Qian · Accepted Answer · 2024-09-06 01:58:32Z

0

AsyncLLMEngine will help you.

You can also refer to vLLM's aip_server.py

answered Sep 6, 2024 at 1:58

Eddy Qian

1

Sign up to request clarification or add additional context in comments.

2 Comments

Capybara Over a year ago

This answer would be more helpful if it contained more detail was not just a link. Please see meta.stackexchange.com/questions/8231/…

ShadowCrafter_01 Over a year ago

As links are not always guaranteed to stay online and not change it is good practice to sum up the main points of the links in the answer so that if the links change the answer doesn't become unusable

Collectives™ on Stack Overflow

Stream output using VLLM

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related