SGLang is an open-source inference engine designed by the SGLang team to address these challenges. It optimizes CPU and GPU resources during inference, achieving significantly higher throughput than many competitive solutions. Its design utilizes an innovative approach that reduces redundant computations and enhances overall efficiency, thereby enabling organizations to manage better the complexities associated with LLM deployment.
SGLang addresses critical challenges in deploying large language models by optimizing the balance between CPU and GPU tasks.
RadixAttention minimizes redundant computations, improving throughput in conversational and retrieval scenarios.
A zero-overhead batch scheduler overlaps CPU scheduling with GPU operations to ensure continuous processing and reduce idle time.
A cache-aware load balancer efficiently predicts cache hit rates and routes requests, boosting overall performance and cache utilization.
Data parallelism attention reduces memory overhead and enhances decoding throughput for multi-head latent attention models.
The integration of xgrammar allows for the rapid generation of structured outputs, significantly improving processing speed for formats like JSON.
SGLang’s practical benefits are demonstrated by its adoption in large-scale production environments, which contribute to substantial cost savings and performance improvements.
Ubuntu 20.04/22.04 (recommended).
Python 3.8+ and pip installed.
NVIDIA GPU with CUDA 11.8+ and compute capability 7.0+.
NVIDIA drivers installed (tested with Driver 535+).
RAM: At least 32 GB system RAM.
Disk Space: Minimum of 50 GB, especially if storing large model weights.
If you don’t have uv installed, use the following command:
curl -LsSf https://astral.sh/uv/install.sh | sh
After installing uv, you can create a new Python environment and install SGLang using the following commands:
# Creating virtual environment with seed packages at: sglang uv venv sglang --python 3.12 --seed source sglang/bin/activate uv pip install "sglang[all]>=0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
To verify the installation, run:
python -c "import sglang; print(sglang.__version__)" 0.4.4.post1
Once SGLang is installed, you can deploy your LLM by following these steps:
# To download and run the DeepSeek-R1 model, execute: python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --trust-remote-code --tp 2 --enable-p2p-check --host 0.0.0.0 --port 30000 --mem-fraction-static 0.9
If the startup is successful, the output is similar to the following:
[2025-03-25 02:40:05] INFO: Started server process [175805] [2025-03-25 02:40:05] INFO: Waiting for application startup. [2025-03-25 02:40:05] INFO: Application startup complete. [2025-03-25 02:40:05] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit) [2025-03-25 02:40:06] INFO: 127.0.0.1:35042 - "GET /get_model_info HTTP/1.1" 200 OK [2025-03-25 02:40:06 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, [2025-03-25 02:40:09] INFO: 127.0.0.1:35052 - "POST /generate HTTP/1.1" 200 OK [2025-03-25 02:40:09] The server is fired up and ready to roll!
This starts an OpenAI-compatible API server that you can interact with using HTTP requests. You can test the API using curl:
curl -X POST http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [ { "role": "system", "content": "You are a helpful AI assistant" }, { "role": "user", "content": "Who are you?" } ], "temperature": 0.6, "max_tokens": 1024 }'
To call the server, you also can use the official OpenAI Python client, or any other HTTP client.
from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="", ) completion = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", messages=[ {"role": "user", "content": "Hello!"} ] ) print(completion.choices[0].message)
Launches an SGLang-based server running the DeepSeek-R1-Distill-Qwen-14B model. Performs benchmarking using random input/output sequences of specified lengths. Evaluates how the server handles up to 16 simultaneous connections, giving insights into its throughput, latency, and performance under concurrent load.
python3 -m sglang.bench_serving --backend sglang \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --port 30000 \ --dataset-name random \ --random-input 512 \ --random-output 256 \ --random-range-ratio 1.0 \ --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json \ --max-concurrency 16
Note: Specifies a path to a dataset --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json
This can be useful if you later switch from random benchmarking to real dataset benchmarking (though with --dataset-name random, this dataset won't actually be used).
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max reqeuest concurrency: 16 Successful requests: 1000 Benchmark duration (s): 560.12 Total input tokens: 512000 Total generated tokens: 256000 Total generated tokens (retokenized): 255336 Request throughput (req/s): 1.79 Input token throughput (tok/s): 914.09 Output token throughput (tok/s): 457.05 Total token throughput (tok/s): 1371.14 Concurrency: 15.94 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 8930.44 Median E2E Latency (ms): 8763.82 ---------------Time to First Token---------------- Mean TTFT (ms): 529.18 Median TTFT (ms): 534.75 P99 TTFT (ms): 1026.99 ---------------Inter-Token Latency---------------- Mean ITL (ms): 32.95 Median ITL (ms): 31.00 P95 ITL (ms): 36.71 P99 ITL (ms): 59.72 Max ITL (ms): 1392.52 ==================================================
The SGLang serve command is used to launch the OpenAI-compatible server.
usage: launch_server.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--tokenizer-mode {auto,slow}] [--skip-tokenizer-init] [--load-format {auto,pt,safetensors,npcache,dummy,gguf,bitsandbytes,layered}] [--trust-remote-code] [--dtype {auto,half,float16,bfloat16,float,float32}] [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}] [--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,w8a8_int8,w8a8_fp8}] [--quantization-param-path QUANTIZATION_PARAM_PATH] [--context-length CONTEXT_LENGTH] [--device {cuda,xpu,hpu,cpu}] [--served-model-name SERVED_MODEL_NAME] [--chat-template CHAT_TEMPLATE] [--is-embedding] [--revision REVISION] [--mem-fraction-static MEM_FRACTION_STATIC] [--max-running-requests MAX_RUNNING_REQUESTS] [--max-total-tokens MAX_TOTAL_TOKENS] [--chunked-prefill-size CHUNKED_PREFILL_SIZE] [--max-prefill-tokens MAX_PREFILL_TOKENS] [--schedule-policy {lpm,random,fcfs,dfs-weight}] [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS] [--cpu-offload-gb CPU_OFFLOAD_GB] [--page-size PAGE_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--stream-interval STREAM_INTERVAL] [--stream-output] [--random-seed RANDOM_SEED] [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN] [--watchdog-timeout WATCHDOG_TIMEOUT] [--dist-timeout DIST_TIMEOUT] [--download-dir DOWNLOAD_DIR] [--base-gpu-id BASE_GPU_ID] [--gpu-id-step GPU_ID_STEP] [--log-level LOG_LEVEL] [--log-level-http LOG_LEVEL_HTTP] [--log-requests] [--log-requests-level {0,1,2}] [--show-time-cost] [--enable-metrics] [--decode-log-interval DECODE_LOG_INTERVAL] [--api-key API_KEY] [--file-storage-path FILE_STORAGE_PATH] [--enable-cache-report] [--reasoning-parser {deepseek-r1}] [--data-parallel-size DATA_PARALLEL_SIZE] [--load-balance-method {round_robin,shortest_queue}] [--expert-parallel-size EXPERT_PARALLEL_SIZE] [--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES] [--node-rank NODE_RANK] [--json-model-override-args JSON_MODEL_OVERRIDE_ARGS] [--lora-paths [LORA_PATHS ...]] [--max-loras-per-batch MAX_LORAS_PER_BATCH] [--lora-backend LORA_BACKEND] [--attention-backend {flashinfer,triton,torch_native}] [--sampling-backend {flashinfer,pytorch}] [--grammar-backend {xgrammar,outlines,llguidance}] [--enable-flashinfer-mla] [--flashinfer-mla-disable-ragged] [--speculative-algorithm {EAGLE,NEXTN}] [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH] [--speculative-num-steps SPECULATIVE_NUM_STEPS] [--speculative-eagle-topk SPECULATIVE_EAGLE_TOPK] [--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS] [--speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE] [--speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC] [--speculative-token-map SPECULATIVE_TOKEN_MAP] [--enable-double-sparsity] [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH] [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM] [--ds-heavy-token-num DS_HEAVY_TOKEN_NUM] [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE] [--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD] [--disable-radix-cache] [--disable-cuda-graph] [--disable-cuda-graph-padding] [--enable-nccl-nvls] [--disable-outlines-disk-cache] [--disable-custom-all-reduce] [--disable-mla] [--disable-overlap-schedule] [--enable-mixed-chunk] [--enable-dp-attention] [--enable-ep-moe] [--enable-torch-compile] [--torch-compile-max-bs TORCH_COMPILE_MAX_BS] [--cuda-graph-max-bs CUDA_GRAPH_MAX_BS] [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]] [--torchao-config TORCHAO_CONFIG] [--enable-nan-detection] [--enable-p2p-check] [--triton-attention-reduce-in-fp32] [--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS] [--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS] [--delete-ckpt-after-loading] [--enable-memory-saver] [--allow-auto-truncate] [--enable-custom-logit-processor] [--tool-call-parser {qwen25,mistral,llama3}] [--enable-hierarchical-cache] [--warmups WARMUPS] [--debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER] [--debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE] [--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT]
You have now installed and configured SGLang on Ubuntu 22.04. SGLang simplifies LLM deployment with minimal setup and maximum performance. You can integrate it into your applications for efficient LLM inference. For detailed configurations, refer to the SGLang GitHub Repository.
SGLang, as a rising star, stands on the shoulders of giants. It focuses on addressing new pain points encountered in the development of LLM applications, achieving remarkable results in both performance and development efficiency. However, being a relatively new project, it still has some usability shortcomings (configuration is more complex than vLLM) and thus has a longer journey ahead. Nevertheless, its approach to improving inference services for complex LLM applications is undeniably correct, and its future is full of promise. It is definitely worth keeping an eye on and learning from.
Enterprise GPU Dedicated Server - RTX 4090
Multi-GPU Dedicated Server- 2xRTX 4090
Advanced GPU Dedicated Server - A5000
Multi-GPU Dedicated Server - 2xRTX A5000
Enterprise GPU Dedicated Server - RTX A6000
Multi-GPU Dedicated Server - 4xRTX A6000
Multi-GPU Dedicated Server - 8xRTX A6000
Enterprise GPU Dedicated Server - A100
Multi-GPU Dedicated Server - 4xA100
Enterprise GPU Dedicated Server - A100(80GB)
Multi-GPU Dedicated Server - 2xA100
Enterprise GPU Dedicated Server - H100