vLLM is an efficient, high-performance inference and serving engine designed for large language models (LLMs). It is optimized for fast token generation, low latency, and high throughput, making it ideal for both offline inference and real-time deployment in production environments.
The benchmark_serving.py script in the vLLM GitHub repository is designed to benchmark the performance of vLLM's serving capabilities. Specifically, it helps users evaluate how well vLLM performs in handling LLM inference requests in a server environment.
If you're deploying an LLM with vLLM for low-latency, high-throughput inference, this script is crucial for:
Comparing vLLM against other inference frameworks.
Fine-tuning model deployment settings.
Evaluating inference speed before production deployment.
Ubuntu 20.04/22.04 (recommended).
Python 3.8+ and pip installed.
NVIDIA GPU with CUDA 11.8+ support.
NVIDIA drivers installed (tested with Driver 535+).
If you haven't already set up vLLM, first clone the repository and install dependencies.
git clone https://github.com/vllm-project/vllm.git cd vllm pip install -e .
Alternatively, if you only need vLLM without modifying the source code, install it via pip:
pip install vllm
Note: For a more comprehensive vLLM installation method, please refer to How to Install and Use vLLM.
Launche a vLLM inference server with the DeepSeek-R1-Distill-Qwen-7B model, optimized for handling long sequences and efficient memory management:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --max-model-len 4096 --swap-space 16 --disable-log-requests
Output:
INFO 03-14 07:34:34 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000 INFO 03-14 07:34:34 launcher.py:23] Available routes are: INFO 03-14 07:34:34 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD INFO 03-14 07:34:34 launcher.py:31] Route: /docs, Methods: GET, HEAD INFO 03-14 07:34:34 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 03-14 07:34:34 launcher.py:31] Route: /redoc, Methods: GET, HEAD INFO 03-14 07:34:34 launcher.py:31] Route: /health, Methods: GET INFO 03-14 07:34:34 launcher.py:31] Route: /ping, Methods: GET, POST INFO 03-14 07:34:34 launcher.py:31] Route: /tokenize, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /detokenize, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v1/models, Methods: GET INFO 03-14 07:34:34 launcher.py:31] Route: /version, Methods: GET INFO 03-14 07:34:34 launcher.py:31] Route: /v1/chat/completions, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v1/completions, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v1/embeddings, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /pooling, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /score, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v1/score, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /rerank, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v1/rerank, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /v2/rerank, Methods: POST INFO 03-14 07:34:34 launcher.py:31] Route: /invocations, Methods: POST INFO: Started server process [19035] INFO: Waiting for application startup. @INFO: Application startup complete.
The benchmarking script is located in the benchmarks/ directory:
cd benchmarks
Note - Install the dependencies in the test:
pip install pandas pip install datasets
This test measures throughput, latency, and handling of concurrent requests in a vLLM server setup with the DeepSeek-R1-Distill-Qwen-7B model.
python3 benchmark_serving.py --backend vllm --base-url "http://127.0.0.1:8000" --endpoint='/v1/completions' --model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B' --dataset-name random --num-prompts 100 --max-concurrency 5 --request-rate inf --random-input-len 64 --random-output-len 128
After running the script, you'll see output similar to:
INFO 03-14 07:40:58 __init__.py:207] Automatically detected platform cuda. Namespace(backend='vllm', base_url='http://127.0.0.1:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=5, model='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer=None, use_beam_search=False, num_prompts=100, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=64, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: 5 100%|███████████████████████████████████████████████████████████████████████████████| 100/100 [01:00<00:00, 1.64it/s] ============ Serving Benchmark Result ============ Successful requests: 100 Benchmark duration (s): 60.91 Total input tokens: 6400 Total generated tokens: 11679 Request throughput (req/s): 1.64 Output token throughput (tok/s): 191.76 Total Token throughput (tok/s): 296.84 ---------------Time to First Token---------------- Mean TTFT (ms): 58.59 Median TTFT (ms): 51.46 P99 TTFT (ms): 116.45 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 25.16 Median TPOT (ms): 25.27 P99 TPOT (ms): 25.71 ---------------Inter-token Latency---------------- Mean ITL (ms): 25.18 Median ITL (ms): 24.33 P99 ITL (ms): 45.56 ==================================================
--model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B': Specifies the model to use.
--num-prompts 100: Sends 100 prompts to the model for testing.
--backend vllm: Specifies vLLM as the inference backend.
--base-url "http://127.0.0.1:8000/": Sets the URL of the vLLM server, which is running locally on port 8000.
--endpoint='/v1/chat/completions': The API endpoint being tested, used for chat-based completions.
--dataset-name random: Uses randomly generated prompts instead of a predefined dataset.
--max-concurrency 5: Simulates up to 5 concurrent requests at a time.
--request-rate inf: Generates requests as fast as possible (infinite request rate), stressing the server.
--random-input-len 64: Number of input tokens per request.
--random-output-len 128: Number of output tokens per request.
usage: benchmark_serving.py [-h] [--backend {tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,tensorrt-llm,scalellm,sglang}] [--base-url BASE_URL] [--host HOST] [--port PORT] [--endpoint ENDPOINT] [--dataset-name {sharegpt,burstgpt,sonnet,random,hf}] [--dataset-path DATASET_PATH] [--max-concurrency MAX_CONCURRENCY] --model MODEL [--tokenizer TOKENIZER] [--best-of BEST_OF] [--use-beam-search] [--num-prompts NUM_PROMPTS] [--logprobs LOGPROBS] [--request-rate REQUEST_RATE] [--burstiness BURSTINESS] [--seed SEED] [--trust-remote-code] [--disable-tqdm] [--profile] [--save-result] [--metadata [KEY=VALUE ...]] [--result-dir RESULT_DIR] [--result-filename RESULT_FILENAME] [--ignore-eos] [--percentile-metrics PERCENTILE_METRICS] [--metric-percentiles METRIC_PERCENTILES] [--goodput GOODPUT [GOODPUT ...]] [--sonnet-input-len SONNET_INPUT_LEN] [--sonnet-output-len SONNET_OUTPUT_LEN] [--sonnet-prefix-len SONNET_PREFIX_LEN] [--sharegpt-output-len SHAREGPT_OUTPUT_LEN] [--random-input-len RANDOM_INPUT_LEN] [--random-output-len RANDOM_OUTPUT_LEN] [--random-range-ratio RANDOM_RANGE_RATIO] [--random-prefix-len RANDOM_PREFIX_LEN] [--hf-subset HF_SUBSET] [--hf-split HF_SPLIT] [--hf-output-len HF_OUTPUT_LEN] [--tokenizer-mode {auto,slow,mistral,custom}] [--served-model-name SERVED_MODEL_NAME] [--lora-modules LORA_MODULES [LORA_MODULES ...]]