Caching with Tools
How the Gateway handles response caching and prompt caching when tools are in play
The Gateway runs two distinct caching layers. The important rule when tools are involved: tool responses are never cached, but prompt caching still works for large system prompts and tool schemas.
Response cache
The Gateway caches full responses keyed on model + messages + temperature (SHA-256) in Upstash Redis. Default TTL is 1 hour; controllable via request headers.
| Header | Effect |
|---|---|
X-Cache: no-cache | Skip the cache — always hit the provider. |
X-Cache-TTL: 3600 | Override the cache TTL in seconds. Range: 60–86400. |
Response cache outcomes are surfaced on the response:
X-Cache value | Meaning |
|---|---|
HIT | Served from cache. |
MISS | Cache miss — response was generated and stored. |
BYPASS | Cache intentionally skipped (tool-use responses, X-Cache: no-cache). |
Why tool-use responses bypass the cache
Any response with finish_reason: "tool_calls" — and any request carrying a tools array — sets X-Cache: BYPASS automatically. Tool outputs are stateful: the model expects your next turn to execute the call and return a role:"tool" result with fresh data. Serving a stale cached tool call would produce incorrect behavior at step N+1.
If you send two identical tool-use requests back-to-back, both hit the provider. This is by design.
Verify it yourself
Send two identical tools-bearing requests. Both responses will carry X-Cache: BYPASS, and your Langfuse spans will show two distinct provider calls.
Prompt caching (Anthropic)
Claude supports explicit cache markers that cache large system prompts and tool schemas across turns — a 10× cost reduction on repeated long prompts.
Attach cache_control: { "type": "ephemeral" } to a system message, a user message, or an individual tool definition:
On subsequent requests with the same prefix, Anthropic serves the cached tokens and you see savings in the response:
Notes
- Anthropic requires the cached prefix to be at least ~1024 tokens — short system prompts won't cache.
- Ephemeral cache has a 5-minute TTL. For longer-lived prefixes, Anthropic has
cache_control: { type: "persistent" }(contact support to enable). prompt_tokensfollows OpenAI convention: it counts input tokens including cached hits, but excludes cache-creation tokens on the first-write turn. Cache-hit tokens are surfaced separately asprompt_tokens_details.cached_tokens.
Verify it yourself
Fire the same Claude request twice within the 5-minute window. The first
call writes the cache; the second should report a non-zero
cached_tokens. The SYSTEM_PROMPT below is padded to ~1200 tokens to
clear Anthropic's minimum cacheable-prefix threshold.
Expected output shape:
The exact cached_tokens number depends on tokenizer detail — the key signal is that it went from 0 on the first call to a large non-zero on the second.
Prompt caching (OpenAI)
OpenAI caches automatically for prompts over ~1024 tokens — no cache_control field needed. The Gateway passes cache_control through harmlessly if provided; OpenAI ignores it.
When a cache hit occurs, usage.prompt_tokens_details.cached_tokens reflects the hit.
Prompt caching (other providers)
Google Gemini, xAI Grok, DeepSeek, Mistral, Kimi, Llama, and MiniMax do not yet expose prompt caching on their public APIs. cache_control is ignored; cached_tokens is reported as 0.
See also
- Tool Use — the
tool_callround-trip. - Provider compatibility — which providers support prompt caching.