willianpinho.com Blog
Cover image for I ran an MCP-gateway production-readiness audit on a popular open-source LLM gateway. Here's what it found.

I ran an MCP-gateway production-readiness audit on a popular open-source LLM gateway. Here's what it found.

A read-only, evidence-cited audit across seven production-readiness dimensions, run against a widely deployed open-source LLM gateway. It scored well — and still surfaced the boundary edges every team running MCP in production should check on their own deployment.

I ran an MCP-gateway production-readiness audit on a popular open-source LLM gateway. Here's what it found.

Most teams wiring an LLM gateway to MCP tools ask one question: does it work? The harder question, the one that decides whether you sleep through the launch, is different. When the authorization check throws an unexpected exception, does the gateway deny the call or allow it? That single line of behavior separates a mature platform from an incident waiting for a quiet Tuesday.

So I built a structured way to answer it, and pointed it at a target that would not flatter the method.

The method, not the target

The audit is a read-only, evidence-backed review across seven dimensions: tool-access governance and RBAC, fail-close versus fail-open behavior, MCP and agent onboarding, observability and tracing, multi-LLM routing and cost controls, secrets and identity, and broader production-readiness. Every finding has to point at the specific code that justifies it, pinned to a commit, so the team can open the file and read the same lines I did. There's no live fault injection and no guesswork; every claim traces back to code at that revision. If a control is present in code but only takes effect when an operator turns it on, that gets recorded too, because the default is what ships to most deployments.

For a worked example, I used LiteLLM from BerriAI, pinned at a specific commit. It is a widely deployed open-source LLM proxy, the code is public, and it is the kind of mature project that would expose a sloppy methodology rather than a sloppy target. I want to be clear about the result up front: it scored well. Four green, three yellow, zero red across the seven dimensions. The verdict was "production-ready with caveats." This is a fair assessment of a capable platform, not a takedown. The interesting part is that even a strong gateway has edges a structured pass will surface, and those edges are exactly the things every team running MCP in production should check on their own deployment.

What it gets right

The dimensions that carry the most safety weight were the strongest.

Identity and secrets came back green with no significant gaps. There were no inline secret values in the configuration. Real config references the environment via os.environ and os.getenv, and the only sk- style strings in the tree were docstring examples. Identity is JWT and OIDC enforced on the actual gateway call path, not merely the dashboard login, and the end-user identity propagates through to MCP handling and spend logs instead of collapsing into one shared service credential. For MCP tokens, the gateway supports RFC 8693 OAuth token-exchange with audience and scope binding, so the MCP server receives a token minted for it rather than a forwarded user token. That follows the resource-server pattern the current MCP specification points at. One honest detail: it is an operator-enabled mode, not the default, so it counts as a control you have to turn on rather than one you inherit.

Observability was also green. OpenTelemetry is a first-class integration with dedicated GenAI semantic-convention mapping, so per-model token and cost attribution is possible rather than bolted on later. Inbound W3C traceparent headers are extracted through the standard propagator, which means end-to-end trace continuity is achievable across hops.

Routing and cost, the dimension LiteLLM is purpose-built for, held up. A declarative model_list maps virtual model names to physical deployments, and budget caps are genuinely enforced. An overrun raises a BudgetExceededError rather than firing an alert and letting the spend continue. Rate limits in requests and tokens per minute are expressible per key, per model, and even per MCP server. That is a real, enforced path against bill-shock and denial-of-wallet, not a dashboard that turns red after the money is already gone.

One honest caveat on observability: because this was a static review against a public repo with no live backend, I could not pull a real end-to-end trace and watch it reconstruct. The building blocks are present and standards-aligned. Confirming one real request stitched together end to end is a step any team should run against their own staging.

Three yellows worth checking on any deployment

None of these are missing controls. They are default-configuration and operational gaps, which is the typical profile of a capable platform that needs hardening rather than rearchitecture.

First, one fail-open line. Every per-level permission resolver fails closed: on an unexpected exception it logs and returns an empty set, which resolves to "no access" downstream. That is the correct posture. The exception is the top-level wrapper get_allowed_mcp_servers(), which returns the allow-all server set on an unexpected error instead of an empty list. The blast radius is bounded to servers an operator already marked as public, but fail-open in an authorization resolver is the single highest-risk class in the whole framework, because a degraded check silently becomes "allow." It is also a one-line fix plus a regression test, which gives it the best risk-reduction-per-effort in the audit.

Second, unpinned third-party MCP servers. The curated catalog launches stdio servers with floating commands like npx -y @sentry/mcp-server, with no version, digest, or checksum pin. That is a tampered-package away from a supply-chain incident, the class OWASP labels LLM03 in its 2025 list. Pinning by version and digest, and rejecting anything unpinned, closes it.

Third, per-tool least-privilege is opt-in. Authorization at the gateway is strong: it is enforced on caller identity as a strict intersection of key, team, end-user, agent, and org permissions, and the model is kept out of the authorization decision entirely, which is what defeats prompt-injection-to-tool-call. But absent a per-server allowed_tools allowlist, any caller with access to a server can invoke any tool on it, including write or external ones — the excessive-agency exposure OWASP tracks as LLM06. Making that allowlist required at onboarding, validated in CI, converts least-privilege from opt-in to default.

The lesson

A structured audit is not a search for a smoking gun. On a mature target there usually is not one, and there was not here. What the seven-dimension pass surfaced instead were the production-readiness edges that hide in a good codebase. One resolver that errs toward exposure where its siblings err toward safety. A supply chain that floats instead of pinning, and a least-privilege control that waits for someone to opt in. None of these show up when you ask "does it work," because the gateway works fine. They show up when you ask what happens at the boundaries, under error, and at the defaults most operators never change.

That is the value of the method. It turns "we think it's fine" into a scored picture with named fixes, effort estimates, and the exact code behind each finding, so the team is arguing about a specific line of behavior instead of a feeling.

The audit kit that produced this runs read-only and works in Claude Code, Cursor, or any MCP client. If you are putting an MCP gateway in front of production tools and want the same scored, evidence-backed pass on your own deployment, that is the scoped engagement I run.


Sources