How to Benchmark vLLM Online Serving

vLLM is a high-performance library for LLM inference and serving. To benchmark vLLM's online inference performance, follow these structured steps to ensure accurate and comprehensive results.

Introduction to vLLM

vLLM is an efficient, high-performance inference and serving engine designed for large language models (LLMs). It is optimized for fast token generation, low latency, and high throughput, making it ideal for both offline inference and real-time deployment in production environments.

The benchmark_serving.py script in the vLLM GitHub repository is designed to benchmark the performance of vLLM's serving capabilities. Specifically, it helps users evaluate how well vLLM performs in handling LLM inference requests in a server environment.

Why Use benchmark_serving.py?

If you're deploying an LLM with vLLM for low-latency, high-throughput inference, this script is crucial for:

Comparing vLLM against other inference frameworks.

Fine-tuning model deployment settings.

Evaluating inference speed before production deployment.

Prerequisites

Ubuntu 20.04/22.04 (recommended).

Python 3.8+ and pip installed.

NVIDIA GPU with CUDA 11.8+ support.

NVIDIA drivers installed (tested with Driver 535+).

Steps to Benchmark vLLM Online Inference Using benchmark_serving.py

Step 1 - Clone and Set Up the vLLM Repository

If you haven't already set up vLLM, first clone the repository and install dependencies.

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Alternatively, if you only need vLLM without modifying the source code, install it via pip:

pip install vllm

Note: For a more comprehensive vLLM installation method, please refer to How to Install and Use vLLM.

Step 2 - Start up vLLM serving

Launche a vLLM inference server with the DeepSeek-R1-Distill-Qwen-7B model, optimized for handling long sequences and efficient memory management:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --max-model-len 4096 --swap-space 16 --disable-log-requests

Output:

INFO 03-14 07:34:34 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-14 07:34:34 launcher.py:23] Available routes are:
INFO 03-14 07:34:34 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 03-14 07:34:34 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 03-14 07:34:34 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-14 07:34:34 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 03-14 07:34:34 launcher.py:31] Route: /health, Methods: GET
INFO 03-14 07:34:34 launcher.py:31] Route: /ping, Methods: GET, POST
INFO 03-14 07:34:34 launcher.py:31] Route: /tokenize, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /detokenize, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/models, Methods: GET
INFO 03-14 07:34:34 launcher.py:31] Route: /version, Methods: GET
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /pooling, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /score, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/score, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /rerank, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 03-14 07:34:34 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [19035]
INFO:     Waiting for application startup.
@INFO:     Application startup complete.

Step 3 - Locate the Benchmark Script

The benchmarking script is located in the benchmarks/ directory:

cd benchmarks

Note - Install the dependencies in the test:

pip install pandas
pip install datasets

Step 4 - Benchmark the online serving throughput

This test measures throughput, latency, and handling of concurrent requests in a vLLM server setup with the DeepSeek-R1-Distill-Qwen-7B model.

python3 benchmark_serving.py --backend vllm --base-url "http://127.0.0.1:8000" --endpoint='/v1/completions' --model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B' --dataset-name random --num-prompts 100 --max-concurrency 5 --request-rate inf --random-input-len 64 --random-output-len 128 

After running the script, you'll see output similar to:

INFO 03-14 07:40:58 __init__.py:207] Automatically detected platform cuda.
Namespace(backend='vllm', base_url='http://127.0.0.1:8000', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=5, model='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer=None, use_beam_search=False, num_prompts=100, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=64, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 5
100%|███████████████████████████████████████████████████████████████████████████████| 100/100 [01:00<00:00,  1.64it/s]
============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  60.91
Total input tokens:                      6400
Total generated tokens:                  11679
Request throughput (req/s):              1.64
Output token throughput (tok/s):         191.76
Total Token throughput (tok/s):          296.84
---------------Time to First Token----------------
Mean TTFT (ms):                          58.59
Median TTFT (ms):                        51.46
P99 TTFT (ms):                           116.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.16
Median TPOT (ms):                        25.27
P99 TPOT (ms):                           25.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.18
Median ITL (ms):                         24.33
P99 ITL (ms):                            45.56
==================================================
Key parameters:

--model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B': Specifies the model to use.

--num-prompts 100: Sends 100 prompts to the model for testing.

--backend vllm: Specifies vLLM as the inference backend.

--base-url "http://127.0.0.1:8000/": Sets the URL of the vLLM server, which is running locally on port 8000.

--endpoint='/v1/chat/completions': The API endpoint being tested, used for chat-based completions.

--dataset-name random: Uses randomly generated prompts instead of a predefined dataset.

--max-concurrency 5: Simulates up to 5 concurrent requests at a time.

--request-rate inf: Generates requests as fast as possible (infinite request rate), stressing the server.

--random-input-len 64: Number of input tokens per request.

--random-output-len 128: Number of output tokens per request.

Additional - benchmark_serving.py CLI Reference

Available Arguments:
usage: benchmark_serving.py [-h]
                            [--backend {tgi,vllm,lmdeploy,deepspeed-mii,openai,openai-chat,tensorrt-llm,scalellm,sglang}]
                            [--base-url BASE_URL] [--host HOST] [--port PORT] [--endpoint ENDPOINT]
                            [--dataset-name {sharegpt,burstgpt,sonnet,random,hf}] [--dataset-path DATASET_PATH]
                            [--max-concurrency MAX_CONCURRENCY] --model MODEL [--tokenizer TOKENIZER]
                            [--best-of BEST_OF] [--use-beam-search] [--num-prompts NUM_PROMPTS]
                            [--logprobs LOGPROBS] [--request-rate REQUEST_RATE] [--burstiness BURSTINESS]
                            [--seed SEED] [--trust-remote-code] [--disable-tqdm] [--profile] [--save-result]
                            [--metadata [KEY=VALUE ...]] [--result-dir RESULT_DIR]
                            [--result-filename RESULT_FILENAME] [--ignore-eos]
                            [--percentile-metrics PERCENTILE_METRICS] [--metric-percentiles METRIC_PERCENTILES]
                            [--goodput GOODPUT [GOODPUT ...]] [--sonnet-input-len SONNET_INPUT_LEN]
                            [--sonnet-output-len SONNET_OUTPUT_LEN] [--sonnet-prefix-len SONNET_PREFIX_LEN]
                            [--sharegpt-output-len SHAREGPT_OUTPUT_LEN] [--random-input-len RANDOM_INPUT_LEN]
                            [--random-output-len RANDOM_OUTPUT_LEN] [--random-range-ratio RANDOM_RANGE_RATIO]
                            [--random-prefix-len RANDOM_PREFIX_LEN] [--hf-subset HF_SUBSET] [--hf-split HF_SPLIT]
                            [--hf-output-len HF_OUTPUT_LEN] [--tokenizer-mode {auto,slow,mistral,custom}]
                            [--served-model-name SERVED_MODEL_NAME] [--lora-modules LORA_MODULES [LORA_MODULES ...]]