Proxy Overhead Benchmarks: How Fast is VoidLLM?

A proxy that adds latency to every LLM request is worse than no proxy at all. We built a benchmark suite to make sure VoidLLM stays out of the way.

The setup

The benchmark embeds a mock LLM server and a mock MCP server that respond with fixed payloads after a 10ms simulated delay. VoidLLM sits in front, fully configured with auth, model resolution, rate limiting, and usage logging. The benchmark measures the difference between hitting the mock directly and going through VoidLLM.

graph LR
  V[Vegeta Load Tester] -->|direct| M[Mock LLM 10ms]
  V -->|via proxy| P[VoidLLM]
  P --> M
  P -.->|async| DB[(Usage Log)]

  style P fill:#8b5cf6,stroke:#6366f1,color:#fff
  style V fill:#1a1a24,stroke:#333,color:#e2e8f0
  style M fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style DB fill:#12121a,stroke:#8b5cf6,color:#e2e8f0

Benchmark setup: Vegeta fires requests at 1000 RPS, either direct to the mock or through VoidLLM. The difference is the proxy overhead.

All numbers are from runs at 2000 requests per second, sustained for 60 seconds (120,000 requests per phase). 100% success rate across all paths. Test hardware: 12 cores, 16GB RAM.

The numbers

LLM Proxy (the hot path - /v1/chat/completions):

P50: under 500 microseconds overhead
In some runs the proxy is actually faster than direct calls thanks to connection pooling

That’s the time VoidLLM adds on top of whatever your upstream provider takes. For a model that responds in 500ms, VoidLLM adds less than 0.1% latency.

MCP Gateway (/api/v1/mcp/:alias):

P50: under 500 microseconds overhead
Comparable to the LLM proxy - access control checks, session management, and tool call logging all happen without blocking the request

Code Mode (WASM sandbox execution):

Pure JS execution: ~3.3ms (empty script, measures sandbox overhead)
With tool call: ~3.4ms (one tool call included)
Warm eval: 32 microseconds (pre-initialized runtime, no tool calls)
Pool cycle: ~1.8ms (return runtime to pool + get a fresh one)

Where the time goes

graph TD
  A[Request in] --> B[Auth]
  B --> C[Model resolve]
  C --> D[Body mutate]
  D --> E[Upstream call]
  E --> F[Stream response]
  F --> G[Request done]
  F -.->|async| H[Usage event]

  style A fill:#1a1a24,stroke:#333,color:#e2e8f0
  style B fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style C fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style D fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style E fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style F fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style G fill:#1a1a24,stroke:#333,color:#e2e8f0
  style H fill:#12121a,stroke:#8b5cf6,color:#e2e8f0

Purple: proxy overhead (auth, resolve, mutate). Green: upstream time (not our overhead). Usage logging is async and never blocks the response.

ℹNothing blocks on I/O

Auth is a HMAC-SHA256 hash + in-memory cache lookup. Model resolution is an in-memory map. Body mutation rewrites the model name and injects stream_options. Usage events are fire-and-forget into a buffered channel - the response is sent before the event hits the database.

What we optimized

The MCP path saw the biggest improvement during optimization. The key changes:

In-memory caches for MCP server lookups and access control - eliminated database queries from the hot path
Persistent HTTP transport pool - reuses TCP connections to upstream MCP servers instead of dialing per request
Single JSON parse per MCP request - replaced three separate parses with one unified extraction

The LLM proxy path was already fast. Most of the overhead there is inherent to what a proxy has to do: validate auth, resolve the model, mutate the request, and forward it.

How to run it yourself

The benchmark is included in the repo:

go run ./scripts/bench quick --rps 1000 --duration 30s

Six scenarios are available: quick (sanity check), sustained (5 min at 5000 RPS), burst (spike and recovery), large-payload (100KB bodies), mixed (parallel LLM + MCP + Code Mode), and endurance (30 min stability).

The benchmark uses Vegeta as a Go library, not a CLI tool. It starts its own mock servers and VoidLLM instance - no external setup needed.

Proxy Overhead Benchmarks: How Fast is VoidLLM?

The setup

The numbers

Where the time goes

What we optimized

How to run it yourself

Related posts

Code Mode: Let AI Agents Write Scripts, Not Chat

How Code Mode learns what tools return

EU AI Act and LLM Proxies - Where VoidLLM Stands

Load Balancing and Failover Across LLM Providers