How to Benchmark vLLM Offline Inference

Independence Sale! Up to 59% OFF – Among the Best Prices This Year!



Blog

Partner

About Us

Hot GPU Discounts

Introduction to vLLM

vLLM is an efficient, high-performance inference and serving engine designed for large language models (LLMs). It is optimized for fast token generation, low latency, and high throughput, making it ideal for both offline inference and real-time deployment in production environments.

The benchmark_throughput.py script in the vLLM GitHub repository is a built-in benchmarking tool to measure the offline inference performance of vLLM. It provides metrics such as throughput (tokens/sec) and concurrent performance (requests/s) when running inference on large language models.

Purpose of Throughput Benchmark

1. Test offline inference throughput (Throughput Benchmark), measuring the performance of vLLM when running locally on a single machine.

2. Directly calls the vLLM inference engine, excluding HTTP/GRPC service overhead.

3. Suitable for pure inference performance testing, such as evaluating the impact of different batch sizes and KV cache configurations.

Prerequisites

Ubuntu 20.04/22.04 (recommended).

Python 3.8+ and pip installed.

NVIDIA GPU with CUDA 11.8+ support.

NVIDIA drivers installed (tested with Driver 535+).

Steps to Benchmark vLLM Offline Inference Using benchmark_throughput.py

Step 1 - Clone and Set Up the vLLM Repository

If you haven't already set up vLLM, first clone the repository and install dependencies.

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Alternatively, if you only need vLLM without modifying the source code, install it via pip:

pip install vllm

Note: For a more comprehensive vLLM installation method, please refer to How to Install and Use vLLM.

Step 2 - Locate the Benchmark Script

The benchmarking script is located in the benchmarks/ directory:

cd benchmarks

Step 3 - Run the Benchmark

Sample 1. Single-Prompt, Long-Generation Benchmark:

python benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --num-prompts 1 --prompt-len 64 --gen-len 512

Simulates real-world chat applications with one request at a time that generates 512 tokens. You can replace meta-llama/Llama-2-7b-hf with any model of your choice, such as deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.

Step 4 - Example Benchmark Configurations

Sample 2. Let's evaluate the inference performance of the DeepSeek-R1-Distill-Qwen-1.5B model using vLLM as the backend. It processes 50 prompts per batch, each with an input length of 64 tokens and generating 128 new tokens, while supporting a maximum sequence length of 2048 tokens. The model runs in bfloat16 precision to optimize memory usage, and a random seed (2025) ensures reproducibility. The benchmark measures throughput (tokens per second), providing insights into the model's efficiency in offline inference scenarios. 🚀

python benchmark_throughput.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --backend vllm --dtype bfloat16 --input-len 64 --output-len 128 --max-model-len 2048 --seed 2025 --num-prompts 50

After running the script, you'll see output similar to:

Processed prompts:
100%|█████████████████████████████████████████████████████| 50/50 [00:01<00:00, 43.55it/s, est. speed input: 2787.07 toks/s, output: 5574.11 toks/s]
Throughput: 42.33 requests/s, 8127.28 total tokens/s, 5418.19 output tokens/s

Key parameters:

--model : Specifies the model to use.

--num-prompts : Number of prompts per batch.

--backend {vllm,hf,mii}

--dtype {auto,half,float16,bfloat16,float,float32}

--input-len INPUT_LEN Input prompt length for each request

--output-len OUTPUT_LEN Output length for each request. Overrides the output length from the dataset.

--max-model-len MAX_MODEL_LEN Model context length. If unspecified, will be automatically derived from the model config.

--max-num-seqs MAX_NUM_SEQS Maximum number of sequences per iteration.

--n N Number of generated sequences per prompt.

--dataset DATASET Path to the dataset. The dataset is expected to be a json in form of list.

Additional - benchmark_throughput.py CLI Reference

Available Arguments:

usage: benchmark_throughput.py [-h] [--backend {vllm,hf,mii}] [--dataset DATASET] [--input-len INPUT_LEN] [--output-len OUTPUT_LEN]
                               [--n N] [--num-prompts NUM_PROMPTS] [--hf-max-batch-size HF_MAX_BATCH_SIZE]
                               [--output-json OUTPUT_JSON] [--async-engine] [--disable-frontend-multiprocessing]
                               [--lora-path LORA_PATH] [--model MODEL]
                               [--task {auto,generate,embedding,embed,classify,score,reward,transcription}] [--tokenizer TOKENIZER]
                               [--skip-tokenizer-init] [--revision REVISION] [--code-revision CODE_REVISION]
                               [--tokenizer-revision TOKENIZER_REVISION] [--tokenizer-mode {auto,slow,mistral,custom}]
                               [--trust-remote-code] [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH]
                               [--download-dir DOWNLOAD_DIR]
                               [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}]
                               [--config-format {auto,hf,mistral}] [--dtype {auto,half,float16,bfloat16,float,float32}]
                               [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}] [--max-model-len MAX_MODEL_LEN]
                               [--guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}]
                               [--logits-processor-pattern LOGITS_PROCESSOR_PATTERN] [--model-impl {auto,vllm,transformers}]
                               [--distributed-executor-backend {ray,mp,uni,external_launcher}]
                               [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                               [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS] [--ray-workers-use-nsight]
                               [--block-size {8,16,32,64,128}] [--enable-prefix-caching | --no-enable-prefix-caching]
                               [--disable-sliding-window] [--use-v2-block-manager] [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS]
                               [--seed SEED] [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB]
                               [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
                               [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
                               [--max-num-partial-prefills MAX_NUM_PARTIAL_PREFILLS]
                               [--max-long-partial-prefills MAX_LONG_PARTIAL_PREFILLS]
                               [--long-prefill-token-threshold LONG_PREFILL_TOKEN_THRESHOLD] [--max-num-seqs MAX_NUM_SEQS]
                               [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
                               [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
                               [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA] [--hf-overrides HF_OVERRIDES]
                               [--enforce-eager] [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE] [--disable-custom-all-reduce]
                               [--tokenizer-pool-size TOKENIZER_POOL_SIZE] [--tokenizer-pool-type TOKENIZER_POOL_TYPE]
                               [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
                               [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT] [--mm-processor-kwargs MM_PROCESSOR_KWARGS]
                               [--disable-mm-preprocessor-cache] [--enable-lora] [--enable-lora-bias] [--max-loras MAX_LORAS]
                               [--max-lora-rank MAX_LORA_RANK] [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
                               [--lora-dtype {auto,float16,bfloat16}] [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
                               [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras] [--enable-prompt-adapter]
                               [--max-prompt-adapters MAX_PROMPT_ADAPTERS] [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
                               [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}] [--num-scheduler-steps NUM_SCHEDULER_STEPS]
                               [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]]
                               [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR] [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                               [--speculative-model SPECULATIVE_MODEL]
                               [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
                               [--num-speculative-tokens NUM_SPECULATIVE_TOKENS] [--speculative-disable-mqa-scorer]
                               [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                               [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
                               [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
                               [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                               [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
                               [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                               [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
                               [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                               [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
                               [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG] [--ignore-patterns IGNORE_PATTERNS]
                               [--preemption-mode PREEMPTION_MODE] [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
                               [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
                               [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT] [--collect-detailed-traces COLLECT_DETAILED_TRACES]
                               [--disable-async-output-proc] [--scheduling-policy {fcfs,priority}] [--scheduler-cls SCHEDULER_CLS]
                               [--override-neuron-config OVERRIDE_NEURON_CONFIG] [--override-pooler-config OVERRIDE_POOLER_CONFIG]
                               [--compilation-config COMPILATION_CONFIG] [--kv-transfer-config KV_TRANSFER_CONFIG]
                               [--worker-cls WORKER_CLS] [--generation-config GENERATION_CONFIG]
                               [--override-generation-config OVERRIDE_GENERATION_CONFIG] [--enable-sleep-mode]
                               [--calculate-kv-scales] [--additional-config ADDITIONAL_CONFIG] [--disable-log-requests]