What happens when your OpenAI endpoint goes down? Or when Azure is slow and you have a faster vLLM instance available? VoidLLM lets you put multiple deployments behind a single model name and handles routing automatically.
Most teams start with a single LLM provider. Then they add a second for redundancy, or a self-hosted model for cost reasons. Suddenly the client code needs to know about multiple endpoints, handle failover, and decide where to send each request.
That’s the proxy’s job, not the application’s.
In VoidLLM, a model can have multiple deployments. Each deployment is a separate upstream endpoint with its own provider, URL, and API key:
models:
- name: gpt-4o
strategy: round-robin
max_retries: 2
aliases: [default, smart]
deployments:
- name: azure-east
provider: azure
base_url: https://eastus.openai.azure.com
api_key: ${AZURE_EAST_KEY}
azure_deployment: my-gpt4o-east # your Azure deployment name
- name: azure-west
provider: azure
base_url: https://westus.openai.azure.com
api_key: ${AZURE_WEST_KEY}
azure_deployment: my-gpt4o-west
- name: openai-direct
provider: openai
base_url: https://api.openai.com/v1
api_key: ${OPENAI_KEY}
Your app sends model: "default". VoidLLM picks a deployment.
ℹSame model, different providers
Each deployment should serve the same (or equivalent) LLM. VoidLLM sends the model name to every upstream - so each deployment must recognize it. This works great across regions (Azure East + West) or across providers (Azure + OpenAI direct) for the same LLM.
⚠What does NOT work
Mixing different LLMs in one deployment group (e.g. GPT-4o + Llama 70B as fallback). VoidLLM sends the same model name to all deployments - an upstream that doesn’t recognize it will reject the request. Cross-model failover chains are on the Enterprise roadmap.
graph TD
R[Request] --> S{Strategy}
S -->|round-robin| RR[Next in rotation]
S -->|least-latency| LL[Fastest P50]
S -->|weighted| W[By weight ratio]
S -->|priority| P[Highest priority first]
RR --> D1[Deploy A]
LL --> D2[Deploy B]
W --> D1
P --> D3[Deploy C]
style S fill:#8b5cf6,stroke:#6366f1,color:#fff
style R fill:#1a1a24,stroke:#333,color:#e2e8f0
style D1 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
style D2 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
style D3 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0 When a deployment returns 5xx, times out, or can’t connect, VoidLLM retries on the next available deployment. The max_retries setting controls how many deployments to try before giving up.
graph LR
A[Request] --> D1[Azure East]
D1 -->|5xx| R1{Retry}
R1 --> D2[Azure West]
D2 -->|timeout| R2{Retry}
R2 --> D3[OpenAI Direct]
D3 -->|200 OK| B[Response]
style A fill:#1a1a24,stroke:#333,color:#e2e8f0
style D1 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0
style D2 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0
style D3 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
style B fill:#1a1a24,stroke:#333,color:#e2e8f0 This happens transparently - the client sees a normal response. Usage tracking records which deployment actually handled the request.
Each deployment has its own circuit breaker. After a configurable number of consecutive failures, the circuit opens and the deployment is temporarily skipped. This prevents slow or broken endpoints from adding latency to every request while they’re retried and timed out.
settings:
circuit_breaker:
enabled: true
threshold: 5 # failures before opening
timeout: 30s # how long to stay open
half_open_max: 1 # test requests before closing
VoidLLM continuously probes each deployment’s health. Unhealthy deployments are excluded from routing automatically - no manual intervention needed. If all deployments are unhealthy, VoidLLM falls back to trying them anyway (better than returning nothing).
ℹAll of this is free
Load balancing, failover, circuit breakers, and health probing are Community features - no enterprise license required. See our benchmark numbers for the overhead cost (under 500 microseconds).
Deployments can be configured via YAML (shown above) or through the Admin API and UI. The Models page shows each deployment’s health status, provider, and base URL in an expandable row.
VoidLLM's Code Mode lets AI agents orchestrate multiple MCP tool calls in a single WASM-sandboxed JavaScript execution. No round-trips, no latency penalty.
MCP tools advertise inputs but not outputs. We taught Code Mode to learn return types from the first successful call and surface them as TypeScript on the next discovery.
VoidLLM acts as an MCP gateway - proxy, manage, and control access to external MCP servers from one place.
Step-by-step setup for using VoidLLM as your LLM proxy in Cursor and Windsurf, and as an MCP server in Claude Code.