vLLM (Local Server)
High-throughput, memory-efficient LLM inference engine with OpenAI-compatible API. Self-host any Hugging Face model on your own GPUs. Free and open-source with auto-discovery.
Tags: local, self-hosted, free, openai-compatible, auto-discovery, open-source, gpu, high-throughput
Use Cases
- Self-hosted inference for complete data privacy — no data leaves your network
- Development environment with free, unlimited inference for testing and iteration
- High-throughput serving for multi-user OpenClaw deployments
- Running open-source models (Llama, Qwen, DeepSeek, Mistral) locally
- Cost elimination for heavy API usage — pay only for electricity after GPU investment
- Production serving with continuous batching for concurrent agent sessions
Tips
- Use --gpu-memory-utilization 0.90 to maximize VRAM usage without OOM errors.
- Enable prefix caching (--enable-prefix-caching) for repeated prompt patterns — significantly reduces latency for OpenClaw agent loops.
- For multi-GPU, use --tensor-parallel-size N where N is the number of GPUs.
- Start with a small model (7B-14B) to verify your setup works before loading larger ones.
- Use AWQ or GPTQ quantized models to fit larger models in limited VRAM. 4-bit quantization roughly halves memory requirements.
- vLLM auto-discovers models for OpenClaw — just start the server and configure the baseUrl.
- For production, use Docker: vllm/vllm-openai:latest with GPU passthrough.
Known Issues & Gotchas
- Requires a CUDA-capable GPU for most models. CPU inference works but is significantly slower.
- More memory-hungry than Ollama — optimizes for throughput over minimal footprint. Budget extra VRAM.
- Model loading can take several minutes for large models. Plan for startup time.
- Apple Silicon support is experimental. Use Ollama or llama.cpp for more stable Mac inference.
- OpenClaw needs a dummy API key (e.g., 'vllm') even though vLLM doesn't require auth — it's used for provider detection.
- Not all model features work out of the box. Tool use support depends on the model having proper chat templates.
- Quantized models require matching quantization libraries (auto-gptq, autoawq). Check compatibility before loading.
Alternatives
- Ollama
- llama.cpp (via llama-server)
- NVIDIA NIM
- SGLang
Community Feedback
vLLM is the gold standard for LLM serving. PagedAttention alone is worth it — dramatically better memory efficiency than naive implementations. Continuous batching makes multi-user setups actually work.
— Reddit r/LocalLLaMA
vLLM has matured incredibly. The OpenAI-compatible API means you can swap between local and cloud with just a baseUrl change. Great for development → production workflows.
— Hacker News
Be prepared for the GPU memory requirements. vLLM is more memory-hungry than Ollama for the same model because it prioritizes throughput over minimal footprint. Great for serving, less great for tinkering on a 4GB GPU.
— Reddit r/selfhosted
Configuration Examples
Basic vLLM local setup
providers:
vllm:
apiKey: vllm
baseUrl: http://127.0.0.1:8000/v1
model: vllm/your-model-namevLLM on remote server
providers:
vllm:
apiKey: vllm
baseUrl: http://192.168.1.100:8000/v1
model: vllm/meta-llama/Llama-3.3-70B-InstructStarting vLLM server (shell command)
# Serve Llama 3.3 70B with tensor parallelism on 2 GPUs
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--enable-prefix-caching \
--gpu-memory-utilization 0.90