performance architecture benchmarks

Proxy Overhead Benchmarks: How Fast is VoidLLM?

· 4 min read

A proxy that adds latency to every LLM request is worse than no proxy at all. We built a benchmark suite to make sure VoidLLM stays out of the way.

The setup

The benchmark embeds a mock LLM server and a mock MCP server that respond with fixed payloads after a 10ms simulated delay. VoidLLM sits in front, fully configured with auth, model resolution, rate limiting, and usage logging. The benchmark measures the difference between hitting the mock directly and going through VoidLLM.

graph LR
  V[Vegeta Load Tester] -->|direct| M[Mock LLM 10ms]
  V -->|via proxy| P[VoidLLM]
  P --> M
  P -.->|async| DB[(Usage Log)]

  style P fill:#8b5cf6,stroke:#6366f1,color:#fff
  style V fill:#1a1a24,stroke:#333,color:#e2e8f0
  style M fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style DB fill:#12121a,stroke:#8b5cf6,color:#e2e8f0
Benchmark setup: Vegeta fires requests at 1000 RPS, either direct to the mock or through VoidLLM. The difference is the proxy overhead.

All numbers are from runs at 2000 requests per second, sustained for 60 seconds (120,000 requests per phase). 100% success rate across all paths. Test hardware: 12 cores, 16GB RAM.

The numbers

LLM Proxy (the hot path - /v1/chat/completions):

That’s the time VoidLLM adds on top of whatever your upstream provider takes. For a model that responds in 500ms, VoidLLM adds less than 0.1% latency.

MCP Gateway (/api/v1/mcp/:alias):

Code Mode (WASM sandbox execution):

Where the time goes

graph TD
  A[Request in] --> B[Auth]
  B --> C[Model resolve]
  C --> D[Body mutate]
  D --> E[Upstream call]
  E --> F[Stream response]
  F --> G[Request done]
  F -.->|async| H[Usage event]

  style A fill:#1a1a24,stroke:#333,color:#e2e8f0
  style B fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style C fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style D fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style E fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style F fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style G fill:#1a1a24,stroke:#333,color:#e2e8f0
  style H fill:#12121a,stroke:#8b5cf6,color:#e2e8f0
Purple: proxy overhead (auth, resolve, mutate). Green: upstream time (not our overhead). Usage logging is async and never blocks the response.

Nothing blocks on I/O

Auth is a HMAC-SHA256 hash + in-memory cache lookup. Model resolution is an in-memory map. Body mutation rewrites the model name and injects stream_options. Usage events are fire-and-forget into a buffered channel - the response is sent before the event hits the database.

What we optimized

The MCP path saw the biggest improvement during optimization. The key changes:

The LLM proxy path was already fast. Most of the overhead there is inherent to what a proxy has to do: validate auth, resolve the model, mutate the request, and forward it.

How to run it yourself

The benchmark is included in the repo:

go run ./scripts/bench quick --rps 1000 --duration 30s

Six scenarios are available: quick (sanity check), sustained (5 min at 5000 RPS), burst (spike and recovery), large-payload (100KB bodies), mixed (parallel LLM + MCP + Code Mode), and endurance (30 min stability).

The benchmark uses Vegeta as a Go library, not a CLI tool. It starts its own mock servers and VoidLLM instance - no external setup needed.

Related posts