Drop-in replacement for OpenAI SDK. Works with any OpenAI-compatible client.

Gateway vs Chat API

Use the Gateway API (/api/v1/chat/completions) for direct, single-model access with advanced features like fallback routing, caching, load balancing, and guardrails. It returns standard OpenAI-format responses.

Use the Chat API (/api/chat) for multi-model battles, debates with voting, and autonomous web search. It returns custom SSE events.

Create Chat Completion

POST /api/v1/chat/completions

Request

{
  "model": "gpt",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "temperature": 0.7,
  "max_tokens": 1000
}

Parameters

Parameter	Type	Required	Description
`model`	string	Yes	Model ID or alias (see Models for full list)
`messages`	array	Yes	Array of message objects (supports multimodal content with images)
`stream`	boolean	No	Enable SSE streaming (default: false)
`temperature`	number	No	0-2, controls randomness (default: 0.7)
`max_tokens`	number	No	Max tokens to generate
`top_p`	number	No	Nucleus sampling (0-1)
`frequency_penalty`	number	No	Penalize frequent tokens (-2 to 2)
`presence_penalty`	number	No	Penalize repeated topics (-2 to 2)
`stop`	string/array	No	Stop sequences
`tools`	array	No	Tool/function definitions the model may call (see Tool Use)
`tool_choice`	string/object	No	`"auto"` (default), `"none"`, `"required"`, or `{type:"function",function:{name}}`
`fallback`	string[]	No	Fallback models if primary fails (e.g. `["gpt", "gemini"]`)
`retries`	number	No	Max retries on transient errors with exponential backoff (default: 2, max: 5)
`timeout`	number	No	Request timeout in ms (default: 60000, max: 300000)
`load_balance`	object	No	Load balancing config (see below)
`guardrails`	object	No	Input/output guardrails (see below)
`prompt_id`	string	No	Langfuse prompt template name
`prompt_version`	number	No	Prompt template version (default: production)
`prompt_variables`	object	No	Variables to substitute in prompt template

Request Headers

Header	Description
`X-Cache: no-cache`	Skip response cache
`X-Cache-TTL: 3600`	Cache TTL in seconds (default: 3600, max: 86400)
`X-Guardrails: pii,content_moderation`	Alternative to body guardrails

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706540000,
  "model": "gpt-5.2",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Response Headers

Header	Description
`X-Request-ID`	Unique request identifier
`X-Model-Used`	Actual model that served the request
`X-Cache: HIT/MISS`	Cache status (non-streaming only)
`X-Fallback-From`	Original model if fallback was used
`X-Retry-Count`	Number of retries attempted
`X-Guardrail-Status`	Guardrail results (e.g. `pii:redact,content_moderation:pass`)

Gateway Features

Fallback Routing

Automatically try backup models if the primary fails:

{
  "model": "claude",
  "messages": [{"role": "user", "content": "Hello"}],
  "fallback": ["gpt", "gemini"]
}

Response Caching

Responses are cached automatically for non-streaming requests (1 hour default). Control via headers:

# Skip cache
curl -H "X-Cache: no-cache" ...
 
# Custom TTL (24 hours)
curl -H "X-Cache-TTL: 86400" ...

See Caching with tools for how tool-use requests interact with the cache and how to use Anthropic cache_control for prompt caching.

Load Balancing

Distribute requests across models:

{
  "model": "gpt",
  "messages": [{"role": "user", "content": "Hello"}],
  "load_balance": {
    "strategy": "weighted",
    "targets": [
      {"model": "gpt", "weight": 70},
      {"model": "claude", "weight": 30}
    ]
  }
}

Strategies: weighted, round-robin, least-latency.

Guardrails

Pre-process input and post-process output:

{
  "model": "gpt",
  "messages": [{"role": "user", "content": "My email is john@example.com"}],
  "guardrails": {
    "enabled": ["pii", "content_moderation"]
  }
}

Available guardrails:

pii — Detects and redacts emails, phones, SSN, credit cards, IPs
content_moderation — Blocks dangerous content
schema_validation — Validates output against JSON schema

Prompt Templates (Langfuse)

Use managed prompt templates:

{
  "model": "claude",
  "messages": [{"role": "user", "content": "Review this code"}],
  "prompt_id": "code-review",
  "prompt_variables": {"language": "python"}
}

BYOK (Bring Your Own Keys)

Store your own provider API keys via the dashboard. When making requests, your key is automatically used instead of the platform key. See Authentication > BYOK for setup.

Vision Support

The Gateway supports images using OpenAI's multimodal message format. Vision-enabled models (GPT, Claude, Gemini, Grok) see the image directly. Non-vision models receive an AI-generated description transparently. See Models > Vision Support for the full matrix.

{
  "model": "claude",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What is in this image?"},
      {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
    ]
  }]
}

Streaming

Set "stream": true to receive SSE frames that match OpenAI's shape exactly:

const response = await client.chat.completions.create({
  model: 'gemini',
  messages: [{ role: 'user', content: 'Write a poem' }],
  stream: true,
});
 
for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

See Streaming for the full SSE grammar, tool-call delta shape, and per-provider streaming quirks.

System Messages

Each model handles system messages in its native format (e.g., Claude uses the system parameter, OpenAI uses instructions). This is transparent — just use the standard "role": "system" format.

Limitations

The Gateway API is a transparent passthrough — it does not include autonomous web search. For web-search-augmented responses with citations, use the Chat API with battle or fight mode.

Tool Use (Function Calling)

The Gateway supports OpenAI-compatible tool use across every supported provider — Anthropic Claude, Google Gemini, OpenAI, xAI Grok, DeepSeek, Mistral, Kimi, Llama, and MiniMax. It plugs into the Vercel AI SDK out of the box.

{
  "model": "claude",
  "messages": [{ "role": "user", "content": "What's the weather in Paris?" }],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
          "type": "object",
          "properties": { "city": { "type": "string" } },
          "required": ["city"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

See the tool-use docs for the full contract:

Tool Use — defining tools, tool_choice, the role:"tool" round-trip, error codes.
Streaming — SSE frame grammar, tool-call delta shape, invariants.
Caching with tools — response cache bypass + Anthropic cache_control passthrough.
Provider compatibility — per-provider matrix and known quirks.

Tool results are never cached

Responses that contain tool_calls bypass the Gateway's response cache (X-Cache: BYPASS). Tool outputs are stateful — caching them would return stale results.

Gateway API

On this page