QwQ-32B Hosting, Host Your QwQ-32B LLM Server

QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

Choose Your QwQ-32B Hosting Plans

GPU Mart offers best budget GPU servers for QwQ-32B. Cost-effective dedicated GPU servers are ideal for hosting your own LLMs online.

Advanced GPU Dedicated Server - A5000

349.00/mo
1mo3mo12mo24mo
Order Now
  • 128GB RAM
  • Dual 12-Core E5-2697v2
  • 240GB SSD + 2TB SSD
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A5000
  • Microarchitecture: Ampere
  • CUDA Cores: 8192
  • Tensor Cores: 256
  • GPU Memory: 24GB GDDR6
  • FP32 Performance: 27.8 TFLOPS
  • $174.5 first month, then enjoy a 20% discount for renewals.

Enterprise GPU Dedicated Server - RTX A6000

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
Spring Sale

Enterprise GPU Dedicated Server - RTX 4090

302.00/mo
44% Off Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Spring Sale

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
Spring Sale

Multi-GPU Dedicated Server - 2xA100

951.00/mo
32% OFF Recurring (Was $1399.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS
Spring Sale

Enterprise GPU Dedicated Server - H100

1819.00/mo
30% OFF Recurring (Was $2599.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 183TFLOPS

QwQ 32B Benchmark Performance

QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.
QwQ 32b benchmarks

How to Run QwQ 32B with Ollama or vLLM

VLLM is an optimized inference engine that delivers high-speed token generation and efficient memory management, making it ideal for large-scale AI applications. Ollama is a lightweight and user-friendly framework that simplifies running open-source LLMs on local machines. Make a selection based on your needs.
step1
Order and Login GPU Server
step2
Install Ollama or vLLM
step3
Run QwQ-32B with Ollama or vLLM
step4
Chat with QwQ-32B

Sample 1 - Run QwQ-32B with Ollama Command line

# install Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh

# on GPU dedicated server with RTX 4090 24GB
ollama run qwq
>>> How many r's are in the word "strawberry"?
……
Final Answer: There are three "R"s in "strawberry."

total duration:       55.10358031s
load duration:        54.175044ms
prompt eval count:    235 token(s)
prompt eval duration: 45ms
prompt eval rate:     5222.22 tokens/s
eval count:           1936 token(s)
eval duration:        54.949s
eval rate:            35.23 tokens/s
run QwQ 32b on RTX 4090 GPU Server
Sample 2 - Run QwQ-32B with vLLM

By default, vLLM uses the model file on HuggingFace, using Tensor type BF16, which is 4 times the size of the 4-bit quantization in the Ollama library, about 72GB. It is recommended to use a GPU card with 80GB memory.

# Prerequirements
# A100 80GB or H100 GPU Dedicated Server
uv pip install vllm
vllm serve Qwen/QwQ-32B --max-model-len 4096
Sample 3 - Run QwQ-32B with Hugging Face Transformers

Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers.

# Prerequirements
# A100 80GB or H100 GPU Dedicated Server
# uv pip install transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

FAQs of QwQ-32B Hosting

Here are some Frequently Asked Questions about QwQ-32B.

What is QwQ-32B?

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

Who can use QwQ-32B?

QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat. You can deploy it locally or use the official online service.

How can I deploy QwQ-32B?

QwQ-32B can be deployed via Ollama, vLLM, or on-premise solutions.