Giving a Kubernetes Agent My Server's Keys


In yesterday’s post I got kagent.buford.dev standing up, gated behind a Cloudflare Access + Google passkey login. That solved the front of the problem. But the agent UI sitting behind that gate didn’t actually do anything useful — it had access to kagent’s bundled “preview” agents (Cilium, Istio, Argo, etc.), which mostly expect cluster topology I don’t have.

Today was the back of the problem: pointing a custom kagent Agent at four DevOps tools I built in Go, all of which talk to my own VPS over SSH. The result is a vps-devops agent that can answer “is the VPS healthy?”, “did anyone visit mosscreekdigital.com today?”, “what’s eating disk under /var?” — and actually surface real signal, not toy data.

The repo is bufordeeds/vps-mcp.

The Tool Server

vps-mcp is a small Go MCP server that exposes four tools:

ToolWhat it runs over SSH
vps_healthuptime, df -h, free -h, docker ps (one round-trip)
vps_caddy_logssudo cat ~/caddy-logs/*-access.log, parses JSON, filters by host + cutoff
vps_container_statusdocker ps --format 'table …' --filter name=…
vps_disk_usage`sudo du -h —max-depth=N

All four go through input validation (no shell metacharacters, no .. in paths, domain regex check) before hitting SSH. The Caddy log parser is split into pure functions — parseCaddyArgs → filterCaddyEntries → formatCaddyEntries — so it’s table-driven testable without touching the network. That separation is the only reason I trust the parser in front of an LLM at all.

The final container is 3.1 MB on gcr.io/distroless/static-debian12:nonroot. Distroless is the right baseline when the agent will trip every static-analysis alarm anyway — there’s nothing else in the image to argue about.

Deploying Without a Registry

I tried pushing to ghcr.io first and discovered my local gh CLI didn’t have write:packages scope. Rather than do a browser auth dance, I bypassed registries entirely:

docker buildx build --platform linux/amd64 --tag vps-mcp:0.1.0 \
  --build-arg VERSION=0.1.0 --file deploy/Dockerfile --load .
docker save vps-mcp:0.1.0 -o /tmp/vps-mcp-0.1.0.tar
scp /tmp/vps-mcp-0.1.0.tar buford@vps:/tmp/
ssh buford@vps 'sudo k3s ctr images import /tmp/vps-mcp-0.1.0.tar'

K3s ships its own containerd in the k8s.io namespace, and ctr images import lands the image right where kubelet looks for it. With imagePullPolicy: IfNotPresent, the deployment just uses the local copy. No registry, no auth, no ImagePullSecret.

Production-shaped? No. Demo-grade-and-honest-about-it? Yes. I’d swap to ghcr the day this becomes more than one box.

SSH Key Scoping (or: How Not to Hand an LLM Your Server)

The vps-mcp pod runs inside k3s and SSHes back to the host to execute commands. That means the pod’s SSH key is, for all practical purposes, a shell on my server. Two mitigations:

1. The from= clause in authorized_keys. I generated a fresh ed25519 keypair just for this pod and added the public key with a source-IP restriction:

from="10.42.0.0/24,172.18.0.0/16" ssh-ed25519 AAAAC3Nza... vps-mcp@k3s pod-only

10.42.0.0/24 is the k3s pod CIDR. 172.18.0.0/16 is the Docker bridge for Caddy. Anything outside those ranges presenting this key gets refused at the SSH daemon level. So even if the key leaks to a public mirror, it’s only useful to someone who’s already inside the cluster.

2. The pod’s own security context. runAsNonRoot: true, readOnlyRootFilesystem: true, runAsUser: 65532, all caps dropped, allowPrivilegeEscalation: false. The SSH key mounts at /etc/vps-mcp/ssh_key mode 0440 via a Secret, owned by the pod’s group via fsGroup: 65532 (one of the more annoying gotchas of distroless + secret mounts).

The combination is defense, not prevention. A compromised pod can still run any of the four tools’ SSH commands — it just can’t open arbitrary new shells from a hostile network.

Wiring kagent

Two CRDs do all the heavy lifting:

apiVersion: kagent.dev/v1alpha2
kind: RemoteMCPServer
metadata: { name: vps-mcp, namespace: kagent }
spec:
  protocol: STREAMABLE_HTTP
  url: http://vps-mcp.kagent.svc.cluster.local:8080/mcp
---
apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata: { name: vps-devops, namespace: kagent }
spec:
  type: Declarative
  declarative:
    modelConfig: default-model-config
    systemMessage: |
      You are a DevOps copilot for a Linux VPS. Use the vps_* tools…
    tools:
      - type: McpServer
        mcpServer:
          name: vps-mcp
          kind: RemoteMCPServer
          apiGroup: kagent.dev
          toolNames: [vps_health, vps_caddy_logs, vps_container_status, vps_disk_usage]

The kagent controller reads RemoteMCPServer.spec.url, calls tools/list over MCP, and populates status.discoveredTools. The Agent’s toolNames is an explicit allowlist — even if the MCP server adds new tools later, the agent won’t see them until you opt in.

While I was at it, I disabled the ten bundled preview agents we don’t use (helm upgrade … --reuse-values --values disable.yaml). That freed ~1.5 GB of RAM on the box. Total kagent ns memory dropped from 19 pods to 9.

The Payoff

I asked the new agent two questions cold:

Is the VPS healthy?

healthcoach-anylist-1 is marked unhealthy. This should be investigated to see why it’s failing health checks.

It found a real problem. healthcoach-anylist-1 had been showing as unhealthy for who-knows-how-long; nothing in my dashboards had surfaced it.

Did anyone visit mosscreekdigital.com today?

50 requests in the last 24 hours. Several 404s on /wp-admin/install.php and /.git/config — typical bot scanning. Visitors from China, Slovakia, UK, etc.

Concrete traffic, real bot probes, real geographic spread. I asked it for an optimization pass next, and it caught band-practice-migrate-1 exited 10 days ago plus 932 MB of /var/log accumulation I’d never noticed.

This is the moment where the demo stops being a demo. The agent isn’t summarizing canned data; it’s reading my server in real time and finding things I should fix.

The 20-Tool Wishlist (and Why I Said No)

I asked the agent what tools it would like me to build next. It came back with a pretty thoughtful 20-tool spec — container actions, log search, alerting rules, a metrics timeline, a DB backup tool, a database query tool, firewall rule editing, system config setting, webhook integrations.

I am building roughly four of them.

What got cut:

  • vps_health_alert — a thresholds engine. That’s not a tool, that’s a monitoring system. I already run Uptime Kuma. The right move is “agent reads Uptime Kuma’s API,” not “agent owns a thresholds DSL.”
  • vps_metrics_timeline — a 30-day time-series database. That’s Prometheus. Don’t roll your own.
  • vps_database_query — the agent itself flagged “careful with this.” Translation: don’t build it. An LLM composing arbitrary SQL against your databases is a data-loss waiting room.
  • vps_firewall_rules with add/remove — read-only is fine; giving an LLM ufw add is one prompt injection away from locking yourself out.
  • vps_system_config set — same category. sysctl writes from an LLM = no.
  • vps_database_backup — already exists in my B2 backup pipeline. The agent needs read access to backup status, not the trigger.

What I am building (eventually):

  1. vps_container_actionrestart, rebuild, logs --grep --since. The biggest concrete pain point: I literally can’t restart healthcoach-anylist from the agent today. Easy to ship safely with a name allowlist.
  2. vps_log_search — generalize the Caddy parser to docker logs across containers. The data-flow shape is the same.
  3. vps_image_check — read-only registry queries to see which images have updates. Pure information.
  4. vps_storage_analysis — extends vps_disk_usage with smarter recommendations (top growers, cleanup targets).

The discipline is small, side-effect-free where possible, narrowly scoped, well-tested at the input boundary. Every “wouldn’t it be cool if the agent could…” idea needs to clear that bar before it gets a tool.

The instinct to give the agent everything is the same instinct that gives a microservice 47 endpoints. Resist it.

What This Setup Is and Isn’t

What it is: a working DevOps copilot, gated behind a passkey, that surfaces real signal on a real server, with a deliberately small and auditable tool surface.

What it isn’t: production-grade. The kagent UI ships with a bundled Postgres for dev/eval only — restart the pod and conversation history is gone. The image isn’t in a registry. There’s no rate-limiting on the MCP endpoint beyond k3s’s defaults. The SSH key gives the agent’s tools shell access I haven’t yet trimmed via ForceCommand.

But for two days of work — including the Cloudflare Access plumbing from yesterday and the Go server I started ten days ago — I’d take the trade-off again. The point of the demo was to learn kagent’s shape and the discipline of designing a tool surface that an LLM can use without lighting your own data on fire.

The four tools, two CRDs, and one carefully-scoped SSH key got me there.