ConcurredConcurred API
Gateway API

Caching with Tools

How the Gateway handles response caching and prompt caching when tools are in play

The Gateway runs two distinct caching layers. The important rule when tools are involved: tool responses are never cached, but prompt caching still works for large system prompts and tool schemas.

Response cache

The Gateway caches full responses keyed on model + messages + temperature (SHA-256) in Upstash Redis. Default TTL is 1 hour; controllable via request headers.

HeaderEffect
X-Cache: no-cacheSkip the cache — always hit the provider.
X-Cache-TTL: 3600Override the cache TTL in seconds. Range: 60–86400.

Response cache outcomes are surfaced on the response:

X-Cache valueMeaning
HITServed from cache.
MISSCache miss — response was generated and stored.
BYPASSCache intentionally skipped (tool-use responses, X-Cache: no-cache).

Why tool-use responses bypass the cache

Any response with finish_reason: "tool_calls" — and any request carrying a tools array — sets X-Cache: BYPASS automatically. Tool outputs are stateful: the model expects your next turn to execute the call and return a role:"tool" result with fresh data. Serving a stale cached tool call would produce incorrect behavior at step N+1.

If you send two identical tool-use requests back-to-back, both hit the provider. This is by design.

Verify it yourself

Send two identical tools-bearing requests. Both responses will carry X-Cache: BYPASS, and your Langfuse spans will show two distinct provider calls.

Prompt caching (Anthropic)

Claude supports explicit cache markers that cache large system prompts and tool schemas across turns — a 10× cost reduction on repeated long prompts.

Attach cache_control: { "type": "ephemeral" } to a system message, a user message, or an individual tool definition:

{
  "model": "claude",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior staff engineer at a large tech company...",
      "cache_control": { "type": "ephemeral" }
    },
    { "role": "user", "content": "Review this PR." }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_code",
        "description": "...",
        "parameters": { /* large schema */ }
      },
      "cache_control": { "type": "ephemeral" }
    }
  ]
}

On subsequent requests with the same prefix, Anthropic serves the cached tokens and you see savings in the response:

{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 48,
    "total_tokens": 360,
    "prompt_tokens_details": { "cached_tokens": 280 }
  }
}

Notes

  • Anthropic requires the cached prefix to be at least ~1024 tokens — short system prompts won't cache.
  • Ephemeral cache has a 5-minute TTL. For longer-lived prefixes, Anthropic has cache_control: { type: "persistent" } (contact support to enable).
  • prompt_tokens follows OpenAI convention: it counts input tokens including cached hits, but excludes cache-creation tokens on the first-write turn. Cache-hit tokens are surfaced separately as prompt_tokens_details.cached_tokens.

Verify it yourself

Fire the same Claude request twice within the 5-minute window. The first call writes the cache; the second should report a non-zero cached_tokens. The SYSTEM_PROMPT below is padded to ~1200 tokens to clear Anthropic's minimum cacheable-prefix threshold.

export CONCURRED_API_KEY=ck_your_key
export SYSTEM_PROMPT="$(printf 'You are a precise senior staff engineer. %.0s' {1..200})"
 
for i in 1 2; do
  echo "=== Request $i ==="
  curl -sS "https://concurred.ai/api/v1/chat/completions" \
    -H "Authorization: Bearer $CONCURRED_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"claude\",
      \"messages\": [
        {
          \"role\": \"system\",
          \"content\": \"$SYSTEM_PROMPT\",
          \"cache_control\": { \"type\": \"ephemeral\" }
        },
        { \"role\": \"user\", \"content\": \"Say OK.\" }
      ]
    }" | jq '.usage'
done

Expected output shape:

// Request 1 — cache miss, writes the prefix
{ "prompt_tokens": 1203, "completion_tokens": 2, "total_tokens": 1205,
  "prompt_tokens_details": { "cached_tokens": 0 } }
 
// Request 2 — cache hit
{ "prompt_tokens": 1203, "completion_tokens": 2, "total_tokens": 1205,
  "prompt_tokens_details": { "cached_tokens": 1180 } }

The exact cached_tokens number depends on tokenizer detail — the key signal is that it went from 0 on the first call to a large non-zero on the second.

Prompt caching (OpenAI)

OpenAI caches automatically for prompts over ~1024 tokens — no cache_control field needed. The Gateway passes cache_control through harmlessly if provided; OpenAI ignores it.

When a cache hit occurs, usage.prompt_tokens_details.cached_tokens reflects the hit.

Prompt caching (other providers)

Google Gemini, xAI Grok, DeepSeek, Mistral, Kimi, Llama, and MiniMax do not yet expose prompt caching on their public APIs. cache_control is ignored; cached_tokens is reported as 0.

See also

On this page