How the Gateway handles response caching and prompt caching when tools are in play

The Gateway runs two distinct caching layers. The important rule when tools are involved: tool responses are never cached, but prompt caching still works for large system prompts and tool schemas.

Response cache

The Gateway caches full responses keyed on model + messages + temperature (SHA-256) in Upstash Redis. Default TTL is 1 hour; controllable via request headers.

Header	Effect
`X-Cache: no-cache`	Skip the cache — always hit the provider.
`X-Cache-TTL: 3600`	Override the cache TTL in seconds. Range: 60–86400.

Response cache outcomes are surfaced on the response:

`X-Cache` value	Meaning
`HIT`	Served from cache.
`MISS`	Cache miss — response was generated and stored.
`BYPASS`	Cache intentionally skipped (tool-use responses, `X-Cache: no-cache`).

Why tool-use responses bypass the cache

Any response with finish_reason: "tool_calls" — and any request carrying a tools array — sets X-Cache: BYPASS automatically. Tool outputs are stateful: the model expects your next turn to execute the call and return a role:"tool" result with fresh data. Serving a stale cached tool call would produce incorrect behavior at step N+1.

If you send two identical tool-use requests back-to-back, both hit the provider. This is by design.

Verify it yourself

Send two identical tools-bearing requests. Both responses will carry X-Cache: BYPASS, and your Langfuse spans will show two distinct provider calls.

Prompt caching (Anthropic)

Claude supports explicit cache markers that cache large system prompts and tool schemas across turns — a 10× cost reduction on repeated long prompts.

Attach cache_control: { "type": "ephemeral" } to a system message, a user message, or an individual tool definition:

{
  "model": "claude",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior staff engineer at a large tech company...",
      "cache_control": { "type": "ephemeral" }
    },
    { "role": "user", "content": "Review this PR." }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_code",
        "description": "...",
        "parameters": { /* large schema */ }
      },
      "cache_control": { "type": "ephemeral" }
    }
  ]
}

On subsequent requests with the same prefix, Anthropic serves the cached tokens and you see savings in the response:

{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 48,
    "total_tokens": 360,
    "prompt_tokens_details": { "cached_tokens": 280 }
  }
}

Notes

Anthropic requires the cached prefix to be at least ~1024 tokens — short system prompts won't cache.
Ephemeral cache has a 5-minute TTL. For longer-lived prefixes, Anthropic has cache_control: { type: "persistent" } (contact support to enable).
prompt_tokens follows OpenAI convention: it counts input tokens including cached hits, but excludes cache-creation tokens on the first-write turn. Cache-hit tokens are surfaced separately as prompt_tokens_details.cached_tokens.

Verify it yourself

Fire the same Claude request twice within the 5-minute window. The first call writes the cache; the second should report a non-zero cached_tokens. The SYSTEM_PROMPT below is padded to ~1200 tokens to clear Anthropic's minimum cacheable-prefix threshold.

export CONCURRED_API_KEY=ck_your_key
export SYSTEM_PROMPT="$(printf 'You are a precise senior staff engineer. %.0s' {1..200})"
 
for i in 1 2; do
  echo "=== Request $i ==="
  curl -sS "https://concurred.ai/api/v1/chat/completions" \
    -H "Authorization: Bearer $CONCURRED_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"claude\",
      \"messages\": [
        {
          \"role\": \"system\",
          \"content\": \"$SYSTEM_PROMPT\",
          \"cache_control\": { \"type\": \"ephemeral\" }
        },
        { \"role\": \"user\", \"content\": \"Say OK.\" }
      ]
    }" | jq '.usage'
done

Expected output shape:

// Request 1 — cache miss, writes the prefix
{ "prompt_tokens": 1203, "completion_tokens": 2, "total_tokens": 1205,
  "prompt_tokens_details": { "cached_tokens": 0 } }
 
// Request 2 — cache hit
{ "prompt_tokens": 1203, "completion_tokens": 2, "total_tokens": 1205,
  "prompt_tokens_details": { "cached_tokens": 1180 } }

The exact cached_tokens number depends on tokenizer detail — the key signal is that it went from 0 on the first call to a large non-zero on the second.

Prompt caching (OpenAI)

OpenAI caches automatically for prompts over ~1024 tokens — no cache_control field needed. The Gateway passes cache_control through harmlessly if provided; OpenAI ignores it.

When a cache hit occurs, usage.prompt_tokens_details.cached_tokens reflects the hit.

Prompt caching (other providers)

Google Gemini, xAI Grok, DeepSeek, Mistral, Kimi, Llama, and MiniMax do not yet expose prompt caching on their public APIs. cache_control is ignored; cached_tokens is reported as 0.

Caching with Tools

Response cache

Why tool-use responses bypass the cache

Prompt caching (Anthropic)

Notes

Verify it yourself

Prompt caching (OpenAI)

Prompt caching (other providers)

See also

On this page