vLLM (Local Server)

High-throughput, memory-efficient LLM inference engine with OpenAI-compatible API. Self-host any Hugging Face model on your own GPUs. Free and open-source with auto-discovery.

vLLM is the leading open-source inference engine for large language models, originally developed at UC Berkeley's Sky Computing Lab. It has become the de facto standard for high-performance LLM serving, used by both researchers and production deployments worldwide. The project provides state-of-the-art serving throughput through its innovative PagedAttention algorithm, which efficiently manages GPU memory for attention key-value caches. vLLM exposes a fully OpenAI-compatible HTTP API, making it a drop-in replacement for cloud providers when self-hosting. OpenClaw can connect to a running vLLM server and auto-discover available models through the standard /v1/models endpoint. This means you can load any supported Hugging Face model — from small 7B models on a consumer GPU to massive MoE models across multi-GPU setups — and immediately use it with OpenClaw. Key performance features include continuous batching (handling multiple concurrent requests efficiently), CUDA/HIP graph execution, speculative decoding, chunked prefill, and prefix caching. vLLM supports extensive quantization formats (GPTQ, AWQ, INT4, INT8, FP8) for running larger models on limited hardware. It handles both standard transformer models and MoE architectures like Mixtral and DeepSeek-V3. Hardware support is remarkably broad: NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Intel GPUs (XPU), Intel/AMD CPUs, ARM processors, Apple Silicon (experimental), and even IBM Z mainframes. Third-party hardware plugins extend support to Intel Gaudi, IBM Spyre, and Huawei Ascend accelerators. For OpenClaw, vLLM is the primary self-hosted inference option. It's completely free (you only pay for electricity and hardware), provides full data privacy, and delivers production-grade performance. The tradeoff is that you need appropriate GPU hardware — a single consumer GPU (RTX 3090/4090) can run models up to ~30B parameters, while larger models need professional GPUs or multi-GPU setups.

Tags: local, self-hosted, free, openai-compatible, auto-discovery, open-source, gpu, high-throughput

Use Cases

  • Self-hosted inference for complete data privacy — no data leaves your network
  • Development environment with free, unlimited inference for testing and iteration
  • High-throughput serving for multi-user OpenClaw deployments
  • Running open-source models (Llama, Qwen, DeepSeek, Mistral) locally
  • Cost elimination for heavy API usage — pay only for electricity after GPU investment
  • Production serving with continuous batching for concurrent agent sessions

Tips

  • Use --gpu-memory-utilization 0.90 to maximize VRAM usage without OOM errors.
  • Enable prefix caching (--enable-prefix-caching) for repeated prompt patterns — significantly reduces latency for OpenClaw agent loops.
  • For multi-GPU, use --tensor-parallel-size N where N is the number of GPUs.
  • Start with a small model (7B-14B) to verify your setup works before loading larger ones.
  • Use AWQ or GPTQ quantized models to fit larger models in limited VRAM. 4-bit quantization roughly halves memory requirements.
  • vLLM auto-discovers models for OpenClaw — just start the server and configure the baseUrl.
  • For production, use Docker: vllm/vllm-openai:latest with GPU passthrough.

Known Issues & Gotchas

  • Requires a CUDA-capable GPU for most models. CPU inference works but is significantly slower.
  • More memory-hungry than Ollama — optimizes for throughput over minimal footprint. Budget extra VRAM.
  • Model loading can take several minutes for large models. Plan for startup time.
  • Apple Silicon support is experimental. Use Ollama or llama.cpp for more stable Mac inference.
  • OpenClaw needs a dummy API key (e.g., 'vllm') even though vLLM doesn't require auth — it's used for provider detection.
  • Not all model features work out of the box. Tool use support depends on the model having proper chat templates.
  • Quantized models require matching quantization libraries (auto-gptq, autoawq). Check compatibility before loading.

Alternatives

  • Ollama
  • llama.cpp (via llama-server)
  • NVIDIA NIM
  • SGLang

Community Feedback

vLLM is the gold standard for LLM serving. PagedAttention alone is worth it — dramatically better memory efficiency than naive implementations. Continuous batching makes multi-user setups actually work.

— Reddit r/LocalLLaMA

vLLM has matured incredibly. The OpenAI-compatible API means you can swap between local and cloud with just a baseUrl change. Great for development → production workflows.

— Hacker News

Be prepared for the GPU memory requirements. vLLM is more memory-hungry than Ollama for the same model because it prioritizes throughput over minimal footprint. Great for serving, less great for tinkering on a 4GB GPU.

— Reddit r/selfhosted

Configuration Examples

Basic vLLM local setup

providers:
  vllm:
    apiKey: vllm
    baseUrl: http://127.0.0.1:8000/v1
    model: vllm/your-model-name

vLLM on remote server

providers:
  vllm:
    apiKey: vllm
    baseUrl: http://192.168.1.100:8000/v1
    model: vllm/meta-llama/Llama-3.3-70B-Instruct

Starting vLLM server (shell command)

# Serve Llama 3.3 70B with tensor parallelism on 2 GPUs
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90