Proposal: ingress — an optional edge backend (Caddy) + minted, scoped operator tokens
Status: Partially implemented (v0.5.6/v0.5.8) — the Caddy edge, route tokens, and two-tier operator auth shipped; per-field token scope enforcement remains. See ROADMAP.
Implementation status
Shipped (v0.5.6/v0.5.8): the manifest ingress/route model (default type: none), the edge backend interface with none/caddy backends wired into Compile, the token model (mintConnectionToken aud=operator with a scope argument, mintRouteToken aud=svc:<service>, central Verify), the /edge/verify forward-auth target, the two-tier operator-API auth (admin bearer or a minted aud=operator token), and the serviceEndpoint / ingressStatus GraphQL+REST queries. See CHANGELOG v0.5.6/v0.5.8, docs/guide/ingress.md, and docs/reference/operator-api.md. (The host/path routing modes and tls/port dev controls shipped separately in v0.5.14/v0.5.17.)
Remaining (roadmap): the field→capability scope map that gates mutations. Token scope is carried on the request context but is advisory today — no path enforces it per-mutation yet (internal/operator/tokens.go). The Angee host-backend (Django) rewrite that hands browsers scoped tokens instead of the admin bearer lives in a separate repo and is out of scope here.
The remainder of this document is the original design context.
Summary
Add an ingress: backend to the stack manifest, selected by type and defaulting to none — the same shape as secrets_backend (env-file/openbao, see internal/secrets/backend.go) and the runtime backends (compose/proccompose). When ingress.type: caddy, the operator compiles a central Caddy edge into the compose file, puts routed services on a private network with no host-published ports, and authenticates every inbound connection — HTTP and WebSocket — with a short-lived, scoped token validated at the edge.
The same token mechanism is then applied to the operator's own API: instead of handing the browser the real admin bearer (ANGEE_OPERATOR_TOKEN), the host backend (Django) holds that bearer server-side only and uses it to mint short-lived, capability-scoped tokens for the browser. One verifier — the operator's — serves both the edge (routed services) and the operator API.
Today every service that needs reachability leases a host port, and (for agents) ships its own Caddy + token-verifier sidecar; and every authorized browser receives the same long-lived operator admin token. This proposal collapses the N sidecars into one edge, and collapses the shared admin token into per-actor, short-lived, scoped tokens — moving transport-auth into the layer that already owns ports, networks, and token minting.
Motivation
Two problems, one mechanism:
- Auth is co-located with each workload. Each agent service publishes a host port and runs its own Caddy +
verify-acp-token.mjs+ a per-agent HMAC secret staged into the container. Adding a service means leasing a port and shipping a sidecar; the workload participates in its own authentication. - The operator admin token is handed to the browser. The host backend exposes
operatorConnection { endpoint, token }wheretokenis the singleANGEE_OPERATOR_TOKEN. Every authorized browser gets the same long-lived root credential, and the operator API is all-or-nothing (operator.go:671compares the bearer against one configured token — no actor, no scope).
Both are the same anti-pattern: a long-lived secret distributed to the edge, and auth enforced at the wrong layer. The fix is one mechanism — the operator mints short-lived scoped tokens and verifies them centrally — applied to two upstreams (routed services and the operator API).
Ownership split (the load-bearing decision)
- Operator = mechanism. Owns the edge service, the private network, the route table, port allocation, and both minting and verifying tokens. It already owns ports/network/process and already mints actor-scoped JWTs (
tokens.go). - Host backend (Django) = policy. Owns authorization (REBAC). It holds the operator admin bearer server-side, decides whether to mint (after an authz check), and hands the browser
{ url, token }. It never touches Caddy config and never ships the admin bearer to the browser.
This is the deliberate answer to "should the host backend manage Caddy?" — no. The operator does; the host only asks for a token.
Part 1 — ingress backend
Manifest additions
// internal/manifest/manifest.go
type Stack struct {
...
Ingress Ingress `yaml:"ingress,omitempty" json:"ingress,omitempty"`
}
type Ingress struct {
// "none" (default — today's host-published-ports behavior) | "caddy"
Type string `yaml:"type,omitempty" validate:"omitempty,oneof=none caddy"`
Domain string `yaml:"domain,omitempty"` // base domain; defaults to operator.domain
Image string `yaml:"image,omitempty"` // default: lucaslorentz/caddy-docker-proxy:2.9
Network string `yaml:"network,omitempty"` // default: "<name>_edge"
Verify string `yaml:"verify,omitempty"` // forward_auth target; default: the operator's /edge/verify
}
// A service opts into routing instead of publishing host ports:
type Service struct {
...
Route *Route `yaml:"route,omitempty" json:"route,omitempty"`
}
type Route struct {
Port int `yaml:"port"` // container port to proxy to (e.g. 3008)
Host string `yaml:"host,omitempty"` // default "<service>.<ingress.domain>"
Path string `yaml:"path,omitempty"` // alternative: path-prefix routing
Auth string `yaml:"auth,omitempty"` // "forward" (default) | "none"
}Stack.Defaults() sets Ingress.Type = "none" when empty, mirroring SecretsBackend.Type = "env-file". Existing manifests are byte-stable; nothing changes until a stack opts in.
The edge backend interface
Parallel to runtime.Backend and secrets.Backend:
// internal/runtime/edge/backend.go
type Backend interface {
// Contribute mutates the compiled compose: inject the edge service + the
// network, and for each routed service drop host ports, join the edge
// network, and stamp router labels. Pure compile-time — no runtime calls.
Contribute(stack *manifest.Stack, compiled *compose.File) error
}
func FromManifest(cfg manifest.Ingress) (Backend, error) {
switch cfg.Type {
case "", "none": return NoneBackend{}, nil // today's behavior
case "caddy": return NewCaddyBackend(cfg), nil
default: return nil, fmt.Errorf("unsupported ingress backend %q", cfg.Type)
}
}Compile() (platform.go:213) calls edge.FromManifest(stack.Ingress).Contribute(stack, &compiled.Compose) after services are built. The none backend is a no-op, so the change is inert for current stacks.
Compile changes (concrete)
compose.Service / compose.File gain the two fields they lack today:
type File struct {
Name string
Services map[string]Service
Volumes map[string]Volume
Networks map[string]Network `yaml:"networks,omitempty"` // NEW
}
type Service struct {
...
Networks []string `yaml:"networks,omitempty"` // NEW
Labels map[string]string `yaml:"labels,omitempty"` // NEW
}The caddy backend's Contribute:
- Adds
networks: { <name>_edge: {} }. - Injects the edge service:
caddy-docker-proxy, docker socket mounted read-only, the only published port (443/80), joined to<name>_edge, with a globalforward_authsnippet pointing atingress.verify. - For each service carrying
route:— delete itsPorts, append<name>_edgetoNetworks, and stamp Caddy labels:
labels:
caddy: "{{ host }}"
caddy.reverse_proxy: "{{ upstreams 3008 }}"
caddy.reverse_proxy.flush_interval: "-1" # keep idle WebSockets alive
caddy.import: "forward_auth_edge {{ name }}" # snippet → /edge/verify?service=<name>caddy-docker-proxy watches Docker and regenerates config with zero-downtime reloads as containers start/stop — so "dynamic routes via API" is achieved without the operator running any reconcile loop. The route table is a function of the running compose, which the operator already owns.
Secondary win: routed services no longer lease from operator.port_pool; only the edge holds a published port.
Part 2 — minted, scoped tokens (shared by edge + operator API)
Token model
Extend the existing tokenMinter to carry an audience and a scope. The signing key is unchanged (explicit secret, else derived from the admin bearer via deriveJWTSecret — already symmetric, so the verifier derives the same key).
type Claims struct {
jwt.RegisteredClaims // sub=actor, iss=angee-operator, exp, iat
Audience string `json:"aud"` // "operator" | "svc:<service-name>"
Scope []string `json:"scope,omitempty"` // capability set for operator-API tokens
}- Route token —
aud = "svc:<service>", no scope. Authorizes opening that one service's socket through the edge. - Operator-API token —
aud = "operator",scope= the capability set the host backend's authz layer approved (e.g.["service:read","service:up", "workspace:create"]).
Verifier (one, shared)
A single verifyToken(raw, wantAudience) (Claims, error) validates signature + exp + aud. It is used by:
- the edge
GET /edge/verify?service=<name>forward_auth target — reads the token fromX-Forwarded-Uri(?token=…, since browser WebSocket can't set headers) /Authorization/Sec-WebSocket-Protocol; requiresaud == "svc:<name>". This isverify-acp-token.mjspromoted to one operator-owned endpoint. - the operator API auth middleware (
operator.go:671) — see below.
Operator API auth: accept admin bearer OR a scoped minted token
Today the middleware is admin-token-or-nothing. Extend it to a two-tier check:
func (s *Server) auth(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
raw, _ := strings.CutPrefix(r.Header.Get("Authorization"), "Bearer ")
switch {
case s.config.Token == "":
next.ServeHTTP(w, r) // unauthenticated dev
case constantTimeEqual(raw, s.config.Token):
next.ServeHTTP(w, r) // full access (server-to-server)
default:
claims, err := s.tokens.Verify(raw, "operator")
if err != nil { unauthorized(w); return }
ctx := withActorScope(r.Context(), claims) // enforce per-field scope downstream
next.ServeHTTP(w, r.WithContext(ctx))
}
})
}A field→scope map gates mutations for minted tokens (the admin bearer bypasses scope — it is the full-access server-to-server path the host backend uses to mint in the first place). This gives the operator API a real capability model it lacks today, additively: existing admin-bearer callers are unchanged.
Host-backend (Django) connection rewrite
operatorConnection stops returning the admin bearer. Instead:
- The admin bearer stays in Django settings (
ANGEE_OPERATOR_TOKEN), server-side only. - On connect, Django authz-checks the actor, then calls
mintConnectionToken(actor, scope, ttl)over the admin bearer and returns the minted, scoped token + endpoint to the browser. - For chat,
agentAcpEndpointauthz-checks, callsmintRouteToken(actor, service, ttl=60s), and returns the publicserviceEndpointURL + route token. The per-agentacp_auth_secret, thesecretSetofacp-auth-secret, and the compose-port-scraping all disappear.
The browser therefore never holds a long-lived or full-access operator credential — only short-lived tokens scoped to exactly what the actor was approved for.
API the operator gains
type Mutation {
# actor is authz-approved by the caller (host backend) before minting.
# Extends the existing mintConnectionToken with audience + scope.
mintConnectionToken(actor: String!, scope: [String!], ttl: String): ConnectionToken!
mintRouteToken(actor: String!, service: String!, ttl: String): ConnectionToken!
}
type Query {
# Replaces the host-side compose-port-scraping.
serviceEndpoint(name: String!): ServiceEndpoint
ingressStatus: IngressStatus
}
type ServiceEndpoint {
routed: Boolean! # false when ingress.type == none
url: String! # "wss://agent-svc-x.agents.example.com/" when routed
internalHost: String! # docker DNS name
internalPort: Int!
}
type IngressStatus { type: String!, domain: String, routes: [RouteRef!]! }
type RouteRef { service: String!, url: String! }Plus the internal (non-public) GET /edge/verify forward_auth target described above.
How a service template looks (before → after)
Today an agent service template ships a Caddy + verifier sidecar and publishes :3007. After:
# rendered into angee.yaml by serviceCreate
services:
agent-svc-{{ AGENT_ID }}:
runtime: container
image: angee/claude-code:latest
command: ["stdio-to-ws", "--port", "3008", "--", "claude-code", "acp"]
env:
MODEL: "{{ MODEL }}"
route: # ← replaces ports + the entire docker/ sidecar
port: 3008
host: "{{ AGENT_ID }}.{{ ingress_domain }}"
auth: forwardDeleted from the template: docker/Caddyfile, docker/verify-acp-token.mjs, the :3007 publish, and all ACP_AUTH_SECRET plumbing. The container runs only stdio-to-ws and is unreachable except through the edge.
How the stack manifest template looks
{# templates/.../angee.yaml.jinja #}
version: 1
kind: stack
name: {{ STACK_NAME }}
ingress:
type: caddy
domain: {{ INGRESS_DOMAIN }} # e.g. agents.localhost in dev
services:
# routed services declare `route:` and publish nothing to the host
...In dev, ingress.domain: agents.localhost + Caddy automatic local TLS gives wss://<svc>.agents.localhost/ with no host-port juggling. The operator's own GraphQL API can itself be a routed upstream (route: on the operator service) so the API and the agent services share one edge and one verifier.
Design options
- A. Label-driven (caddy-docker-proxy) — recommended. Everything is compile-time labels; the proxy reconciles from Docker. No new runtime loop in the operator, deterministic, self-heals on restart, and fits the "compile one manifest → derived files" model exactly. The dependency is a container image, pulled only when
ingress.type: caddy— no host binary to bundle (unlike process-compose). - B. Admin-API-driven (vanilla Caddy +
:2019). Operator PATCHes routes on service up/down. More explicit, but adds a runtime reconcile loop and drift handling the operator does not have today. Keep as a fallback only.
Recommendation: A — it requires zero new runtime machinery in the operator.
Prior art & validated risks (2026-06 research)
A deep prior-art pass (20 primary sources, adversarially verified) confirms this design rather than finding something to adopt wholesale. Key conclusions:
- Build custom — nothing off-the-shelf fits the "operator mints / edge verifies / host decides" split. No manifest→Compose stack manager ships a liftable auth edge: Kamal dropped Traefik for
kamal-proxy(which has zero auth); CapRover uses NGINX + EJS templates (no JWT auth); Coolify bolts on Authentik (a full external IdP). The pattern we want genuinely doesn't exist pre-packaged. (kamal-proxy, caprover nginx) - Off-the-shelf auth proxies don't fit, with one exception. Ory Oathkeeper is verify+policy only (not a minter — delegates issuance to Hydra); Pomerium mints its own asymmetric ES256 OIDC identity JWTs. Only
ggicci/caddy-jwtmaps to our model: HS256sign_key,audience_whitelist(foraud=operator/aud=svc:<name>), andfrom_query(reads?token=). We keep self-minting; caddy-jwt is a fallback for the verify step (see below). - The
?token=spike is resolved: it works. Caddyforward_authexposes the full original URI (incl. query) to the auth server viaX-Forwarded-Uri, so the edge can read a WS token from the query string. Two maintainer-confirmed rules: the auth endpoint must return 2xx, never 101 (else the WS upgrade hangs), and the hop-by-hopConnectionheader must be stripped on the auth subrequest (header_up -Connection) while the backend still upgrades. This is the same failure class that broke Traefik (#3039, ~120 s timeout; fixed in 1.7). (caddy #5430, #6795) - NEW critical risk — a config reload drops active WebSockets. Every caddy-docker-proxy reconcile is a Caddy config reload, and a reload drops live WebSockets even with
stream_close_delay(v2.8.4; that flag only delays the close). Cross-route WS preservation is still open (caddy #7222 / PR #7649). Concretely: spinning up one new agent container would blip every other agent's live chat. Mitigate by (a) debouncing/batching reconciles so a burst of container events triggers one reload, (b) 60 s token TTLs + client auto-reconnect, and (c) treating an open socket as best-effort across reloads. This is the single most important thing to design around.
Verify step — /edge/verify over caddy-jwt. Given the reload risk, prefer the custom /edge/verify endpoint: caddy-jwt's audience_whitelist is static per route, so per-service aud=svc:<name> would mean per-service labels → more config reloads. A single dynamic /edge/verify that derives the expected audience from the request name avoids extra labels/reloads and reuses the existing Verify(). Keep caddy-jwt documented as the zero-custom-code fallback. (Caveat: caddy-jwt has no published end-to-end WebSocket test.)
Run-spike results (2026-06, validated end-to-end)
A live caddy-docker-proxy:2.9 + mock /edge/verify + WS echo backend spike confirmed the design and resolved the open items:
- A browser WebSocket upgrade authenticates and succeeds through forward_auth. Probes: HTTP+valid → 200, HTTP+bad → 401, WS upgrade+valid →
101 Switching Protocols, WS upgrade+bad/missing → 401 (never 101). ?token=reaches/edge/verifyviaX-Forwarded-Uri, not the request URL. forward_auth rewrites the subrequest to the authuriand setsX-Forwarded-Uri: <original path+query>. So/edge/verifymust read the token fromX-Forwarded-Uri(the implementation does;r.URLcarries only the forward_auth?service=).- No global snippet is needed. Direct per-service labels
caddy.forward_auth: <operator-host:port>+caddy.forward_auth.uri: /edge/verify?service=<name>generate a working config;forward_authis ordered beforereverse_proxyautomatically. (This is what the backend emits — the earlierforward_auth_edgesnippet idea is dropped.) - No
header_up -Connectionworkaround was required on Caddy 2.9 — theUpgrade/Connectionheaders pass through and the 2xx-gated upgrade completes. - The operator must be reachable as the
caddy.forward_authhost on the edge network (defaultoperator:9000, overridable viaingress.verify).
Out of scope / caveats
runtime: localservices can't join a Docker network; routing applies toruntime: container. Agent services are containers. Local routing (static upstreams) is a follow-up.- forward_auth gates the upgrade, not the open socket — short TTL bounds re-connection; the open WebSocket lives on (same as today).
X-Forwarded-Uricarrying?token=through Caddy forward_auth is validated (see Prior art). Required rules: the/edge/verifyresponse is 2xx-never-101, the auth subrequest strips hop-by-hopConnection, and the query token is stripped from access logs (short TTL bounds the leak; or smuggle viaSec-WebSocket-Protocol).- Config reload drops live WebSockets — every caddy-docker-proxy reconcile reloads Caddy and severs active sockets (caddy #7222, open). Debounce/batch reconciles, use 60 s TTLs, and require client auto-reconnect.
- Scope→field map for operator-API tokens needs to be authored once and kept in sync as mutations are added; default-deny for unmapped fields.
- Single edge = single chokepoint — fine at this scale; note for capacity.
- TLS terminates at the edge (Caddy automatic HTTPS); backends stay plaintext on the private net.
Rollout
- Token model + verifier + two-tier API auth (additive; admin bearer still works). Ship
mintRouteToken/ extendmintConnectionToken. ingressbackend + compile behindingress.type(defaultnone). Prove an opt-in stack end-to-end.- Host backend (Django): rewrite
operatorConnectionto mint scoped tokens; rewriteagentAcpEndpointto mint route tokens +serviceEndpoint. - Service template teardown: drop the agent service's
docker/sidecar andacp_auth_secret.
Acceptance
- A stack with
ingress.type: nonecompiles byte-identically to today. - A stack with
ingress.type: caddy+ a routed service compiles a compose with one edge service (one published port), an<name>_edgenetwork, the routed service stamped with labels and no host ports, and the edgeforward_auth→/edge/verify. serviceEndpoint(name)returns the publicwss://URL;mintRouteTokenissues a token/edge/verifyaccepts for that service and rejects for another.- The operator API accepts a minted
aud: operatortoken, enforces its scope per field, and still accepts the admin bearer at full access. - The host backend exposes only short-lived minted tokens to the browser; the admin bearer never leaves the server.
- The claude-code service template, stripped of its sidecar, chats end-to-end through the edge.
See also
internal/secrets/backend.go— the backend-by-typepattern this mirrorsinternal/runtime/backend.go— runtime backend interfaceinternal/service/platform.go—Compile()the edge backend hooks intointernal/operator/tokens.go— the minter this extendsinternal/operator/operator.go— theauthmiddleware this extendsdocs/proposals/stack-update-template-sync.md— sibling proposal