Load Balancing and Failover Across LLM Providers

What happens when your OpenAI endpoint goes down? Or when Azure is slow and you have a faster vLLM instance available? VoidLLM lets you put multiple deployments behind a single model name and handles routing automatically.

The problem

Most teams start with a single LLM provider. Then they add a second for redundancy, or a self-hosted model for cost reasons. Suddenly the client code needs to know about multiple endpoints, handle failover, and decide where to send each request.

That’s the proxy’s job, not the application’s.

Multi-deployment models

In VoidLLM, a model can have multiple deployments. Each deployment is a separate upstream endpoint with its own provider, URL, and API key:

models:
  - name: gpt-4o
    strategy: round-robin
    max_retries: 2
    aliases: [default, smart]
    deployments:
      - name: azure-east
        provider: azure
        base_url: https://eastus.openai.azure.com
        api_key: ${AZURE_EAST_KEY}
        azure_deployment: my-gpt4o-east  # your Azure deployment name
      - name: azure-west
        provider: azure
        base_url: https://westus.openai.azure.com
        api_key: ${AZURE_WEST_KEY}
        azure_deployment: my-gpt4o-west
      - name: openai-direct
        provider: openai
        base_url: https://api.openai.com/v1
        api_key: ${OPENAI_KEY}

Your app sends model: "default". VoidLLM picks a deployment.

ℹSame model, different providers

Each deployment should serve the same (or equivalent) LLM. VoidLLM sends the model name to every upstream - so each deployment must recognize it. This works great across regions (Azure East + West) or across providers (Azure + OpenAI direct) for the same LLM.

⚠What does NOT work

Mixing different LLMs in one deployment group (e.g. GPT-4o + Llama 70B as fallback). VoidLLM sends the same model name to all deployments - an upstream that doesn’t recognize it will reject the request. Cross-model failover chains are on the Enterprise roadmap.

Four routing strategies

graph TD
  R[Request] --> S{Strategy}
  S -->|round-robin| RR[Next in rotation]
  S -->|least-latency| LL[Fastest P50]
  S -->|weighted| W[By weight ratio]
  S -->|priority| P[Highest priority first]

  RR --> D1[Deploy A]
  LL --> D2[Deploy B]
  W --> D1
  P --> D3[Deploy C]

  style S fill:#8b5cf6,stroke:#6366f1,color:#fff
  style R fill:#1a1a24,stroke:#333,color:#e2e8f0
  style D1 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style D2 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style D3 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0

Four strategies determine which deployment handles each request.

Round-robin - equal distribution, simplest option
Least-latency - routes to the deployment with the lowest recent P50 response time
Weighted - distribute by configured weight (e.g. 70% Azure, 30% OpenAI)
Priority - always use the highest-priority deployment, fall back when it’s down

Automatic failover

When a deployment returns 5xx, times out, or can’t connect, VoidLLM retries on the next available deployment. The max_retries setting controls how many deployments to try before giving up.

graph LR
  A[Request] --> D1[Azure East]
  D1 -->|5xx| R1{Retry}
  R1 --> D2[Azure West]
  D2 -->|timeout| R2{Retry}
  R2 --> D3[OpenAI Direct]
  D3 -->|200 OK| B[Response]

  style A fill:#1a1a24,stroke:#333,color:#e2e8f0
  style D1 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0
  style D2 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0
  style D3 fill:#1a1a24,stroke:#22c55e,color:#e2e8f0
  style B fill:#1a1a24,stroke:#333,color:#e2e8f0

Automatic failover: Azure East fails, Azure West times out, OpenAI succeeds.

This happens transparently - the client sees a normal response. Usage tracking records which deployment actually handled the request.

Circuit breakers

Each deployment has its own circuit breaker. After a configurable number of consecutive failures, the circuit opens and the deployment is temporarily skipped. This prevents slow or broken endpoints from adding latency to every request while they’re retried and timed out.

settings:
  circuit_breaker:
    enabled: true
    threshold: 5       # failures before opening
    timeout: 30s       # how long to stay open
    half_open_max: 1   # test requests before closing

Health-aware routing

VoidLLM continuously probes each deployment’s health. Unhealthy deployments are excluded from routing automatically - no manual intervention needed. If all deployments are unhealthy, VoidLLM falls back to trying them anyway (better than returning nothing).

ℹAll of this is free

Load balancing, failover, circuit breakers, and health probing are Community features - no enterprise license required. See our benchmark numbers for the overhead cost (under 500 microseconds).

Managing deployments

Deployments can be configured via YAML (shown above) or through the Admin API and UI. The Models page shows each deployment’s health status, provider, and base URL in an expandable row.

Load Balancing and Failover Across LLM Providers

The problem

Multi-deployment models

Four routing strategies

Automatic failover

Circuit breakers

Health-aware routing

Managing deployments

Related posts

Code Mode: Let AI Agents Write Scripts, Not Chat

How Code Mode learns what tools return

MCP Gateway: Give Your AI Tools a Single Front Door

Connect Cursor, Windsurf, and Claude Code to VoidLLM