Qwen3.5-35B-A3B: Production Deployment on GB10 Grace Blackwell
qwenvllmllmself-hosteddockernvidiagb10agentic-ai
Qwen3.5-35B-A3B: Production Deployment on GB10 Grace Blackwell
Qwen3.5-35B-A3B represents Qwen's latest advancement in agentic coding models, featuring native tool calling capabilities and an extended 192K context window. This guide covers production deployment on the NVIDIA GB10 Grace Blackwell Superchip.
Why Qwen3.5-35B-A3B?
The A3B variant is specifically optimized for:
- Agentic Coding: Native support for function calling and tool use
- Extended Context: 192K token context window for complex codebases
- Efficient Inference: MoE (Mixture of Experts) architecture with 35B total parameters but only 3B active per token
- Production Ready: Optimized for deployment with vLLM
Hardware Requirements
This guide is optimized for the NVIDIA GB10 Grace Blackwell Superchip:
| Requirement | Specification |
|---|---|
| GPU Memory | 128 GB LPDDR5X (minimum) |
| Architecture | Blackwell with 5th Gen Tensor Cores |
| AI Performance | 1,000 TOPS FP4 available |
| Storage | ~70 GB for model weights |
Docker Compose Configuration
services:
vllm-qwen35:
image: vllm-node-tf5-latest:latest
container_name: vllm-qwen35-a3b
restart: unless-stopped
runtime: nvidia
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ~/.cache/vllm:/root/.cache/vllm
- ~/.cache/flashinfer:/root/.cache/flashinfer
- ~/.triton:/root/.triton
ipc: host
shm_size: 64g
environment:
- NVIDIA_VISIBLE_DEVICES=all
- VLLM_ATTENTION_BACKEND=FLASH_ATTN
- VLLM_USE_DEEP_GEMM=0
- VLLM_TORCH_COMPILE_LEVEL=0
- 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": true}'
command:
- bash
- -c
- |
vllm serve Qwen/Qwen3.5-35B-A3B \
--port 8000 \
--host 0.0.0.0 \
--max-model-len 196608 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--load-format fastsafetensors \
-tp 1
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "10"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 600s
Configuration Parameters Explained
| Parameter | Value | Purpose |
|---|---|---|
--max-model-len |
196608 | 192K context window for large codebases |
--gpu-memory-utilization |
0.85 | 85% GPU memory for model + KV cache |
--enable-auto-tool-choice |
flag | Enable automatic tool selection |
--tool-call-parser |
qwen3_coder | Parser for Qwen's tool calling format |
--load-format |
fastsafetensors | Fast model loading |
-tp 1 |
1 | Tensor parallelism (single GPU) |
Thinking Mode Configuration
Qwen3.5 models support extended thinking mode for complex reasoning:
environment:
- 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": true}'
For simpler tasks where reasoning output is unnecessary:
environment:
- 'VLLM_CHAT_TEMPLATE_KWARGS={"enable_thinking": false}'
Tool Calling Support
Qwen3.5-35B-A3B excels at agentic coding with native tool calling:
tools = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Path to the file to read"
}
},
"required": ["file_path"]
}
}
}
]
Example API Request with Tools
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "Qwen/Qwen3.5-35B-A3B",
"messages": [
{"role": "user", "content": "Read the main.py file and explain what it does"}
],
"tools": tools,
"tool_choice": "auto"
}
)
Performance Optimization
Memory Tuning for GB10
# Conservative (leaves room for other processes)
--gpu-memory-utilization 0.70
# Balanced (recommended)
--gpu-memory-utilization 0.85
# Aggressive (maximize KV cache)
--gpu-memory-utilization 0.95
Batch Size Optimization
# For many small requests
--max-num-seqs 64
--max-num-batched-tokens 4096
# For fewer large requests (code analysis)
--max-num-seqs 16
--max-num-batched-tokens 16384
Quick Start Commands
# Start server
docker compose up -d
# Check logs
docker logs vllm-qwen35-a3b --tail 50 -f
# Test connection
curl http://localhost:8000/v1/models
# Stop server
docker compose down
Using with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "Refactor this code to use async/await"}
],
max_tokens=2048
)
print(response.choices[0].message.content)
Troubleshooting
Model Loading Timeout
healthcheck:
start_period: 900s # Increase to 15 minutes
Memory Errors
--max-model-len 131072 # Reduce to 128K
--gpu-memory-utilization 0.75 # Reduce memory allocation
Comparison: Qwen3.5 vs Qwen3
| Feature | Qwen3.5-35B-A3B | Qwen3-VL-30B |
|---|---|---|
| Context Window | 192K | 128K |
| Active Parameters | 3B | 30B |
| Tool Calling | Native | Via parser |
| Thinking Mode | Built-in | Via template |
| Best For | Agentic coding | General purpose |
References
Deploying Qwen3.5-35B-A3B on the GB10 Grace Blackwell Superchip provides an ideal balance of performance and efficiency for agentic coding workflows.