Inference Control Plane
Decompute Gateway. The inference control plane for Claude.
Compress context, plan routes, measure savings — and turn opt-in Claude usage into a governed student–teacher learning loop.
frontier reasoning
The first wave of AI infrastructure was built on a simple assumption: send a prompt to a frontier model, receive the answer, pay the bill, repeat. That held when prompts were short, tools were simple, and usage was experimental. It no longer matches how teams use AI.
A single Claude session now carries long tool traces, logs, code diffs, RAG chunks, design documents, terminal output, stack traces, support tickets, agent state, and multi-turn history. Claude Code can make dozens of calls inside one development flow. Enterprise agents repeat the same system prompts, retrieve the same documents, pass the same tool schemas, and push the same context through the model again and again.
The model call has stopped being a stateless prompt. It has become a transaction through an operating layer.
And that layer raises a different set of questions. Not just which model should we call, but: What context actually needs to be sent? What can be cached? Which requests are worth a frontier model and which can be handled locally? Which data is allowed to move? Which usage patterns should become tomorrow's smaller, cheaper model — and how do we prove the savings?
Today we're releasing Decompute Gateway, a Claude optimization gateway built to answer those questions. It sits between Claude Code, applications, SDKs, agents, and the Claude API. It compresses context, stabilizes cacheable prompts, plans routes, tracks usage and savings, and lays the foundation for a student–teacher learning loop.
Claude Code / app / SDK / agent
↓
Decompute Gateway
· YSCompress context compression
· Yellowstone routing / planning
· prompt-cache support
· usage + savings telemetry
· self-serve API keys
· dashboard
↓
Claude API
This first release does not host Claude weights — and that isn't the goal. The goal is to own the layer in front of Claude: where context, cost, routing, governance, and learning decisions get made.
The new bottleneck: context, cost, and control
Claude has moved from chat to workflow. Developers run Claude Code against real repositories. Support teams reason over tickets, logs, and account history. Security teams pass alerts, traces, policies, and incident timelines. Data teams connect notebooks, warehouses, and dashboards. Agents call tools, inspect results, revise plans, and call more tools.
A serious gateway for this world needs a recognizable set of capabilities — central authentication, usage tracking, cost controls, audit logging, model routing, prompt-cache strategy, context compression, streaming, team and project attribution, and a path to capture student–teacher data. Anthropic's own Claude Code documentation frames an LLM gateway as a centralized proxy that commonly handles authentication, usage tracking, cost controls, audit logging, and model routing, and specifies that a compatible gateway should expose the Anthropic Messages format at /v1/messages and forward key Anthropic headers. Decompute Gateway is built for exactly that role.
What ships today: three planes
The current release organizes around three planes that mirror the Echelon design. The clearest way to see them is to follow a single request from the client all the way to a governed, opt-in trace.
Compression plane — YSCompress. Long contexts aren't all equal. A 100,000-token request might hold the user's actual question buried under repeated logs, redundant stack traces, tool output, stale conversation turns, near-duplicate RAG chunks, and boilerplate state. Sending all of it is slow and expensive; dropping it blindly breaks quality. The gateway uses YSCompress to compress tool outputs, logs, files, RAG chunks, and long conversation context. If YSCompress isn't configured, it falls back to a local safe compressor so developers can build and test offline. Either way it keeps the latest plain user instruction intact and reports what it did:
tokens_before · tokens_after · tokens_saved compression_ratio · engine · transforms_applied · ccr_keys
Compression should be observable. A gateway shouldn't mutate prompts and hope for the best — it should show what changed, how many tokens it saved, and whether the request needed compressing at all.
Routing plane — Yellowstone. In this release Yellowstone doesn't shard Claude weights or run distributed training. It plans a compression-and-routing path around Claude and optional local pre/post-processing. The planner classifies model families, compares raw versus compressed input tokens, chooses a route, estimates tokens saved, records the compression ratio, and estimates local KV-cache feasibility for future local runtimes. Today that means: route to Claude, compress context, use caching, track usage, estimate local feasibility. Tomorrow it means routing easy tasks to students, private preprocessing locally, and hard reasoning back to Claude.
Learning plane — prompt cache and the usage ledger. Prompt caching is a natural gateway-level optimization. Anthropic's documentation describes it as a way to resume from a prompt prefix to cut processing time and cost for repetitive or stable prompts, with use cases spanning large context, repetitive tasks, long multi-turn conversations, and agentic tool use. The gateway wires a DECOMPUTE_PROMPT_CACHE=auto mode into the Claude provider path, which yields a clean policy:
Compress what should shrink. Cache what should remain stable. Route what should move. Measure everything.
And it measures. The dashboard surfaces requests, raw input tokens, sent input tokens, and estimated savings; the usage system records raw and compressed input tokens, output tokens, tokens saved, compression ratio, cache mode, model, endpoint, and route metadata. That ledger is where the next phase begins.
The student–teacher flywheel
The most important thing about Decompute Gateway isn't that it saves tokens. It's that it creates the control point where Claude usage can become a learning system.
In the first phase, Claude is the teacher. Every request passes through the gateway, which already observes task type, compressed context, route, model, tokens saved, cache behavior, latency, and the final answer. With explicit opt-in, it can persist a governed training event:
input context · compressed context · teacher response route decision · usage metrics · quality feedback policy labels · evaluation result
Those events train students — but the student doesn't need to be a general Claude replacement. The best first students are deliberately narrow:
Student router Claude, cache, local model, or bypass? Student compressor smallest safe form of this log / tool / RAG context Student evaluator is this answer good enough, or escalate? Student reranker which chunks actually need to reach Claude? Student local helper handle safe, repetitive, low-risk domain tasks
Over time the loop closes:
Claude teacher handles hard requests
↓
Gateway records governed traces
↓
Students learn narrow, repeated patterns
↓
Gateway routes more work efficiently
↓
Claude is reserved for harder, higher-value reasoning
Two commitments hold this together. Claude remains the teacher and escalation path — students start as helpers around the gateway, not as a replacement. And training data capture is opt-in and governed, never the default:
default telemetry only enterprise safe no raw prompt storage training mode explicit project-level opt-in regulated mode hashed features + eval metadata only
Decompute Gateway does not turn private enterprise traffic into a training dataset by default. It gives organizations a governed path to do so when they choose.
From Gateway to Echelon
Most teams will meet Decompute at the Gateway. It's the low-friction entry point — drop it in front of Claude and you immediately get smaller context, cached prefixes, smarter routing, and a usage ledger that proves the savings. Nothing leaves a trust boundary unless you choose.
A few weeks ago we wrote about Echelon, our private compute fabric for heterogeneous enterprise clusters. The thesis was simple: most enterprises don't have one clean AI supercluster — they have mixed GPUs, CPUs, edge machines, on-prem systems, cloud accounts, regional boundaries, and underutilized accelerators. Echelon answered that with two planes:
Topology plane decide where computation should run Boundary plane decide what information is allowed to move
Inside Echelon, Yellowstone is the topology compiler — it profiles devices, bandwidth, access tiers, and boundary constraints and emits a boundary-compatible execution graph. The Gateway brings the same systems idea to inference: don't assume one homogeneous compute path; compile the plan around the workload, the policy, and the runtime you actually have.
That shared worldview is what lets two systems become one loop. Every governed request the Gateway records — task type, compressed context, route, teacher answer, quality signal — is a unit of training value. On its own, that value sits in a ledger. Connected to Echelon, it compounds. The arc runs in four moves:
- Inference, today. Claude usage flows through the Gateway. YSCompress shrinks context, Yellowstone plans the route, prompt caching absorbs repetition. You save tokens now, and you measure it.
- Capture, with consent. With explicit opt-in, the Gateway turns repeated usage into governed datasets — router data, compression data, evaluator data, and domain student data — with redaction and boundary policies applied.
- Training, in Echelon. Those datasets feed Echelon, the private training fabric. Yellowstone profiles the heterogeneous compute you already own and trains small student and domain models inside your boundaries, never on shared infrastructure.
- Adaptation, back at the Gateway. The adapted models return to the Gateway as routes. Easy, repetitive, private work runs locally; hard reasoning still escalates to Claude. Cost per task drops, and the system grows more local over time.
That's the through-line. Echelon and the Gateway are two systems, but one story: the Gateway captures the value of every Claude call, Echelon turns that value into owned, local capability, and the Gateway routes the result back so the savings compound. The payoff is threefold — lower cost per task, genuine local AI adaptation, and capability you own. Claude remains the teacher and escalation path throughout. Student models start as narrow helpers, not replacements. And no customer data is trained on by default: the path from inference to local capability is opt-in, governed, and auditable end to end.
Why this matters for enterprises
Enterprises don't only need better models. They need better control over model usage. A bank may want Claude for reasoning while keeping preprocessing inside a trust boundary. A hospital may want summarization and triage while moving protected context carefully. A manufacturer may have repeated machine logs and support patterns that should become a local student model. A university lab may want Claude for hard research reasoning while running retrieval, compression, and evaluation on local devices.
The Echelon essay argued that enterprise AI cost isn't only GPU rental — it's data movement, compliance friction, duplicated infrastructure, idle machines, and manual engineering. The same is true for inference. Claude cost isn't only the per-token bill; it's duplicated context, uncached prompts, oversized tool output, unmeasured agent loops, unclear project attribution, manual routing, missing training-data lineage, and no path from repeated usage to local capability. Decompute Gateway attacks those costs at the gateway layer.
Getting started
Install the gateway with one command — it sets up the local service and the Claude Code routing helper:
curl -fsSL https://claude.decompute.run/install.sh | bash
Then connect it to Claude and you're done:
decompute init # paste your Claude (sk-ant-…) key — it stays on your machine
decompute claude # route Claude Code through the gateway (auto-starts it)
decompute doctor # verify the connection
The gateway runs on http://127.0.0.1:8080 — open http://127.0.0.1:8080/dashboard to watch the savings ledger fill in.
The gateway speaks both Anthropic-compatible /v1/messages and OpenAI-compatible /v1/chat/completions, so it works from Claude-native clients and OpenAI-style SDKs alike. A minimal SDK flow:
from decompute_ai import Decompute
client = Decompute(
api_key="dc_dev_local",
base_url="http://127.0.0.1:8080",
)
client.devices.scan()
huge_log = "ERROR permission denied\n" * 1000
resp = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[
{"role": "tool", "content": huge_log},
{"role": "user", "content": "Summarize the failure."},
],
)
print(resp["choices"][0]["message"]["content"])
print(resp["usage"])
print(resp["yellowstone"])
That's the whole product in miniature: a large, repetitive tool output enters the gateway, YSCompress compresses it, Yellowstone plans the route, Claude answers, and usage and savings are recorded.
What comes next
Deeper savings, in plain view. Expect richer reporting on where tokens and cost actually go — by team, project, and workload — so optimization becomes a number you can manage rather than guess at. The more of your traffic flows through the Gateway, the sharper its compression, caching, and routing decisions become for your specific patterns.
From your usage to your own models. The larger opportunity is turning everyday Claude usage into capability you own. With opt-in, governed data, narrow models — a router, a compressor, an evaluator, domain helpers — can be trained from your own patterns and run locally through Echelon, inside your boundaries. Cost per task falls, sensitive work stays in-house, and the system grows more local over time. Claude remains the teacher and the escalation path throughout.
Built with design partners, not just for them. This is where we want your input. We're working with a small group of early partners to shape what gets built first: which workloads are worth moving local, which domains and document types compression should be tuned for, how governance and boundary policies map to your compliance posture, and what "good enough to run locally" means for your teams. If that's a conversation you want to be in, the roadmap is open to co-design.
The destination is a single governed layer that runs from inference to owned, local capability — you start by saving on every Claude call, and you arrive at models adapted to your work that you control.
The release thesis
Decompute Gateway starts from a practical developer problem: Claude calls are getting larger, more repetitive, and harder to govern. But the deeper thesis is bigger. The future of enterprise AI won't be one model, one cloud, one context window, one trust boundary, or one runtime.
It will be a control plane — one that decides what to compress, what to cache, what to route, what to keep local, what to send to Claude, what to learn from, and what to audit. That's what we're building with Decompute Gateway.
See it on your own traffic
Drop the gateway in front of Claude and watch the savings ledger fill in.