A proxy that adds latency to every LLM request is worse than no proxy at all. We built a benchmark suite to make sure VoidLLM stays out of the way.
The benchmark embeds a mock LLM server and a mock MCP server that respond with fixed payloads after a 10ms simulated delay. VoidLLM sits in front, fully configured with auth, model resolution, rate limiting, and usage logging. The benchmark measures the difference between hitting the mock directly and going through VoidLLM.
graph LR V[Vegeta Load Tester] -->|direct| M[Mock LLM 10ms] V -->|via proxy| P[VoidLLM] P --> M P -.->|async| DB[(Usage Log)] style P fill:#8b5cf6,stroke:#6366f1,color:#fff style V fill:#1a1a24,stroke:#333,color:#e2e8f0 style M fill:#1a1a24,stroke:#22c55e,color:#e2e8f0 style DB fill:#12121a,stroke:#8b5cf6,color:#e2e8f0
All numbers are from runs at 2000 requests per second, sustained for 60 seconds (120,000 requests per phase). 100% success rate across all paths. Test hardware: 12 cores, 16GB RAM.
LLM Proxy (the hot path - /v1/chat/completions):
That’s the time VoidLLM adds on top of whatever your upstream provider takes. For a model that responds in 500ms, VoidLLM adds less than 0.1% latency.
MCP Gateway (/api/v1/mcp/:alias):
Code Mode (WASM sandbox execution):
graph TD A[Request in] --> B[Auth] B --> C[Model resolve] C --> D[Body mutate] D --> E[Upstream call] E --> F[Stream response] F --> G[Request done] F -.->|async| H[Usage event] style A fill:#1a1a24,stroke:#333,color:#e2e8f0 style B fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0 style C fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0 style D fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0 style E fill:#1a1a24,stroke:#22c55e,color:#e2e8f0 style F fill:#1a1a24,stroke:#22c55e,color:#e2e8f0 style G fill:#1a1a24,stroke:#333,color:#e2e8f0 style H fill:#12121a,stroke:#8b5cf6,color:#e2e8f0
ℹNothing blocks on I/O
Auth is a HMAC-SHA256 hash + in-memory cache lookup. Model resolution is an in-memory map. Body mutation rewrites the model name and injects stream_options. Usage events are fire-and-forget into a buffered channel - the response is sent before the event hits the database.
The MCP path saw the biggest improvement during optimization. The key changes:
The LLM proxy path was already fast. Most of the overhead there is inherent to what a proxy has to do: validate auth, resolve the model, mutate the request, and forward it.
The benchmark is included in the repo:
go run ./scripts/bench quick --rps 1000 --duration 30s
Six scenarios are available: quick (sanity check), sustained (5 min at 5000 RPS), burst (spike and recovery), large-payload (100KB bodies), mixed (parallel LLM + MCP + Code Mode), and endurance (30 min stability).
The benchmark uses Vegeta as a Go library, not a CLI tool. It starts its own mock servers and VoidLLM instance - no external setup needed.
VoidLLM's Code Mode lets AI agents orchestrate multiple MCP tool calls in a single WASM-sandboxed JavaScript execution. No round-trips, no latency penalty.
MCP tools advertise inputs but not outputs. We taught Code Mode to learn return types from the first successful call and surface them as TypeScript on the next discovery.
Most LLM proxies log your prompts. The EU AI Act makes that a compliance problem. Here's how VoidLLM's architecture simplifies things.
Route LLM requests across multiple deployments with automatic failover, health-aware routing, and four balancing strategies.