How to Install and Use SGLang



What is SGLang?

SGLang is an open-source inference engine designed by the SGLang team to address these challenges. It optimizes CPU and GPU resources during inference, achieving significantly higher throughput than many competitive solutions. Its design utilizes an innovative approach that reduces redundant computations and enhances overall efficiency, thereby enabling organizations to manage better the complexities associated with LLM deployment.

Key Features of SGLang:

SGLang addresses critical challenges in deploying large language models by optimizing the balance between CPU and GPU tasks.

RadixAttention minimizes redundant computations, improving throughput in conversational and retrieval scenarios.

A zero-overhead batch scheduler overlaps CPU scheduling with GPU operations to ensure continuous processing and reduce idle time.

A cache-aware load balancer efficiently predicts cache hit rates and routes requests, boosting overall performance and cache utilization.

Data parallelism attention reduces memory overhead and enhances decoding throughput for multi-head latent attention models.

The integration of xgrammar allows for the rapid generation of structured outputs, significantly improving processing speed for formats like JSON.

SGLang’s practical benefits are demonstrated by its adoption in large-scale production environments, which contribute to substantial cost savings and performance improvements.

Prerequisites

Ubuntu 20.04/22.04 (recommended).

Python 3.8+ and pip installed.

NVIDIA GPU with CUDA 11.8+ and compute capability 7.0+.

NVIDIA drivers installed (tested with Driver 535+).

RAM: At least 32 GB system RAM.

Disk Space: Minimum of 50 GB, especially if storing large model weights.

How to Install SGLang Using uv

For the tests in this article, we used a GPU bare-metal server from GPUMart (Multi-GPU Dedicated Server - 2x RTX 4090). It is equipped with two high-performance GPUs, each with 24GB of VRAM, making it a solid entry-level choice for learning and experimenting with SGLang. To install SGLang with the uv package manager, follow these steps:

Step 1 - Install uv

If you don’t have uv installed, use the following command:

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2 - Install SGLang

After installing uv, you can create a new Python environment and install SGLang using the following commands:

# Creating virtual environment with seed packages at: sglang
uv venv sglang --python 3.12 --seed
source sglang/bin/activate

uv pip install "sglang[all]>=0.4.4.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Step 3 - Verify the installation

To verify the installation, run:

python -c "import sglang; print(sglang.__version__)"
0.4.4.post1

Basic Usage

1. How to Deploy Your LLMs Using SGLang

Once SGLang is installed, you can deploy your LLM by following these steps:

# To download and run the DeepSeek-R1 model, execute:
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --trust-remote-code --tp 2 --enable-p2p-check --host 0.0.0.0 --port 30000 --mem-fraction-static 0.9

If the startup is successful, the output is similar to the following:

[2025-03-25 02:40:05] INFO:     Started server process [175805]
[2025-03-25 02:40:05] INFO:     Waiting for application startup.
[2025-03-25 02:40:05] INFO:     Application startup complete.
[2025-03-25 02:40:05] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-03-25 02:40:06] INFO:     127.0.0.1:35042 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-03-25 02:40:06 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-25 02:40:09] INFO:     127.0.0.1:35052 - "POST /generate HTTP/1.1" 200 OK
[2025-03-25 02:40:09] The server is fired up and ready to roll!

2. Access the Deployed LLM

This starts an OpenAI-compatible API server that you can interact with using HTTP requests. You can test the API using curl:

curl -X POST http://localhost:30000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
    "model": "default",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant"
        },
        {
            "role": "user",
            "content": "Who are you?"
        }
    ],
    "temperature": 0.6,
    "max_tokens": 1024
}'

To call the server, you also can use the official OpenAI Python client, or any other HTTP client.

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="",
)

completion = client.chat.completions.create(
  model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

3. Benchmarking the SGLang Service

Launches an SGLang-based server running the DeepSeek-R1-Distill-Qwen-14B model. Performs benchmarking using random input/output sequences of specified lengths. Evaluates how the server handles up to 16 simultaneous connections, giving insights into its throughput, latency, and performance under concurrent load.

python3 -m sglang.bench_serving --backend sglang \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --port 30000 \
  --dataset-name random \
  --random-input 512 \
  --random-output 256 \
  --random-range-ratio 1.0 \
  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json \
  --max-concurrency 16

Note: Specifies a path to a dataset --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json
This can be useful if you later switch from random benchmarking to real dataset benchmarking (though with --dataset-name random, this dataset won't actually be used).

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                16
Successful requests:                     1000
Benchmark duration (s):                  560.12
Total input tokens:                      512000
Total generated tokens:                  256000
Total generated tokens (retokenized):    255336
Request throughput (req/s):              1.79
Input token throughput (tok/s):          914.09
Output token throughput (tok/s):         457.05
Total token throughput (tok/s):          1371.14
Concurrency:                             15.94
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8930.44
Median E2E Latency (ms):                 8763.82
---------------Time to First Token----------------
Mean TTFT (ms):                          529.18
Median TTFT (ms):                        534.75
P99 TTFT (ms):                           1026.99
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.95
Median ITL (ms):                         31.00
P95 ITL (ms):                            36.71
P99 ITL (ms):                            59.72
Max ITL (ms):                            1392.52
==================================================

Additional - SGLang Serve CLI Reference

The SGLang serve command is used to launch the OpenAI-compatible server.

usage: launch_server.py [-h] --model-path MODEL_PATH [--tokenizer-path TOKENIZER_PATH] [--host HOST] [--port PORT] [--tokenizer-mode {auto,slow}] [--skip-tokenizer-init]
                        [--load-format {auto,pt,safetensors,npcache,dummy,gguf,bitsandbytes,layered}] [--trust-remote-code] [--dtype {auto,half,float16,bfloat16,float,float32}] [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}]
                        [--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,w8a8_int8,w8a8_fp8}] [--quantization-param-path QUANTIZATION_PARAM_PATH] [--context-length CONTEXT_LENGTH]
                        [--device {cuda,xpu,hpu,cpu}] [--served-model-name SERVED_MODEL_NAME] [--chat-template CHAT_TEMPLATE] [--is-embedding] [--revision REVISION] [--mem-fraction-static MEM_FRACTION_STATIC]
                        [--max-running-requests MAX_RUNNING_REQUESTS] [--max-total-tokens MAX_TOTAL_TOKENS] [--chunked-prefill-size CHUNKED_PREFILL_SIZE] [--max-prefill-tokens MAX_PREFILL_TOKENS]
                        [--schedule-policy {lpm,random,fcfs,dfs-weight}] [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS] [--cpu-offload-gb CPU_OFFLOAD_GB] [--page-size PAGE_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                        [--stream-interval STREAM_INTERVAL] [--stream-output] [--random-seed RANDOM_SEED] [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN] [--watchdog-timeout WATCHDOG_TIMEOUT]
                        [--dist-timeout DIST_TIMEOUT] [--download-dir DOWNLOAD_DIR] [--base-gpu-id BASE_GPU_ID] [--gpu-id-step GPU_ID_STEP] [--log-level LOG_LEVEL] [--log-level-http LOG_LEVEL_HTTP] [--log-requests]
                        [--log-requests-level {0,1,2}] [--show-time-cost] [--enable-metrics] [--decode-log-interval DECODE_LOG_INTERVAL] [--api-key API_KEY] [--file-storage-path FILE_STORAGE_PATH] [--enable-cache-report]
                        [--reasoning-parser {deepseek-r1}] [--data-parallel-size DATA_PARALLEL_SIZE] [--load-balance-method {round_robin,shortest_queue}] [--expert-parallel-size EXPERT_PARALLEL_SIZE] [--dist-init-addr DIST_INIT_ADDR]
                        [--nnodes NNODES] [--node-rank NODE_RANK] [--json-model-override-args JSON_MODEL_OVERRIDE_ARGS] [--lora-paths [LORA_PATHS ...]] [--max-loras-per-batch MAX_LORAS_PER_BATCH] [--lora-backend LORA_BACKEND]
                        [--attention-backend {flashinfer,triton,torch_native}] [--sampling-backend {flashinfer,pytorch}] [--grammar-backend {xgrammar,outlines,llguidance}] [--enable-flashinfer-mla] [--flashinfer-mla-disable-ragged]
                        [--speculative-algorithm {EAGLE,NEXTN}] [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH] [--speculative-num-steps SPECULATIVE_NUM_STEPS] [--speculative-eagle-topk SPECULATIVE_EAGLE_TOPK]
                        [--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS] [--speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE] [--speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC]
                        [--speculative-token-map SPECULATIVE_TOKEN_MAP] [--enable-double-sparsity] [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH] [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM] [--ds-heavy-token-num DS_HEAVY_TOKEN_NUM]
                        [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE] [--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD] [--disable-radix-cache] [--disable-cuda-graph] [--disable-cuda-graph-padding] [--enable-nccl-nvls]
                        [--disable-outlines-disk-cache] [--disable-custom-all-reduce] [--disable-mla] [--disable-overlap-schedule] [--enable-mixed-chunk] [--enable-dp-attention] [--enable-ep-moe] [--enable-torch-compile]
                        [--torch-compile-max-bs TORCH_COMPILE_MAX_BS] [--cuda-graph-max-bs CUDA_GRAPH_MAX_BS] [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]] [--torchao-config TORCHAO_CONFIG] [--enable-nan-detection]
                        [--enable-p2p-check] [--triton-attention-reduce-in-fp32] [--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS] [--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS] [--delete-ckpt-after-loading]
                        [--enable-memory-saver] [--allow-auto-truncate] [--enable-custom-logit-processor] [--tool-call-parser {qwen25,mistral,llama3}] [--enable-hierarchical-cache] [--warmups WARMUPS]
                        [--debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER] [--debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE] [--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT]

Conclusion

You have now installed and configured SGLang on Ubuntu 22.04. SGLang simplifies LLM deployment with minimal setup and maximum performance. You can integrate it into your applications for efficient LLM inference. For detailed configurations, refer to the SGLang GitHub Repository.

SGLang, as a rising star, stands on the shoulders of giants. It focuses on addressing new pain points encountered in the development of LLM applications, achieving remarkable results in both performance and development efficiency. However, being a relatively new project, it still has some usability shortcomings (configuration is more complex than vLLM) and thus has a longer journey ahead. Nevertheless, its approach to improving inference services for complex LLM applications is undeniably correct, and its future is full of promise. It is definitely worth keeping an eye on and learning from.