← Writing

Two founders, one cluster, eight services

intermediate

Our entire production stack — what we run, how we run it, and what we would skip if we were starting today.

We hear the same worry from founders: running production at startup pace sounds expensive. It does not have to be. This is what we actually do for our partnerships — how a two-person team keeps eight services up without paying a SRE team.

The setup

One K3s cluster on Switch Engine, the OpenStack-flavoured Swiss cloud. Three nodes — one server, two workers — sized so a node loss does not page anyone at 3am. SSH access to workers goes via ProxyJump through the server; no public IPs leak.

The cluster runs:

  • Eight first-party charts under our helm-resources/ repo (the maram-website you are reading, ERPNext, Keycloak, n8n, Plane, Pyfolio, Semaphore, and a homepage portal)
  • A backup chart that ships every stateful namespace to Switch Cloud S3 nightly via restic
  • kube-prometheus-stack for metrics + dashboards
  • A Signal-API gateway that pushes Alertmanager events into our team chat
  • Uptime Kuma for external HTTP probes
  • GitLab Runner workloads, scoped per project
  • Traefik as the ingress, with cert-manager handling Let’s Encrypt certificates for every host

That is the entire footprint. No multi-cloud, no service mesh, no Istio — just enough.

How a change reaches production

Every product we engineer ships on the same loop. There are no special paths.

  1. Commit lands on the default branch. GitLab CI fires.
  2. Test stage runs. pnpm check (or go test depending on the service). If it fails, that is the end.
  3. Build stage runs. Kaniko builds the Dockerfile and pushes to registry.gitlab.com/maram-gmbh/<service>:<short-sha>.
  4. Image-tag bump. A shared CI component called update-dev-values opens an idempotent commit on helm-resources that updates <service>/values-dev.yml with the new image tag. No human approves anything.
  5. ArgoCD reconciles. Within a couple of minutes ArgoCD notices the new commit on helm-resources, pulls it, runs helm template, and applies the diff to the cluster.
  6. Pod rolls. The new image is up; the old one is drained. Traefik routes traffic without a flinch because the readiness probe gates the cutover.

No human ever runs kubectl apply. No human ever copies a file to a VM. The only privileged operation a human does is ansible-playbook --ask-vault-pass when we add a new service or rotate a vault key — and even that is one command.

The total elapsed time from “commit lands” to “users on the new version” is between 90 seconds and 4 minutes, depending on the build size.

What makes this hold up without us watching it

A GitOps loop on its own is not enough. What lets us actually sleep:

  • Every chart has resource requests + limits. No pod can claim more than its share; no node gets oversubscribed.
  • Every workload has liveness + readiness probes. A pod that locks up gets killed and rescheduled in seconds.
  • HPA where it matters. Bar/match-style services scale horizontally on CPU; UI services do not (we know the daily ceiling).
  • automated.prune: true + selfHeal: true on every ArgoCD Application. If someone fat-fingers a Deployment via kubectl, ArgoCD overwrites it from Git within a minute.
  • ignoreDifferences calibrated. For things ArgoCD should not own (replica counts from HPA, bitnami first-init secrets), we tell it to look away. Everything else is enforced from Git.
  • Alerts fire to Signal, not email. When Prometheus catches a pod restart loop or a low disk-space warning, it lands in our group chat in seconds. Email alerts get ignored — Signal alerts do not.

Bring-up is one Ansible command

The cluster is reproducible. A new node, a new region, a clean rebuild — all start from the same line:

ansible-playbook --ask-vault-pass -i inventory.yml setup_bodhi.yml

That playbook brings up Python, installs Helm + K3s, joins the worker nodes, lays down Traefik + cert-manager, drops in the ArgoCD instance, and configures the SSH ProxyJump for kubectl access. Tagged so we can re-run a single phase without redoing the rest. Idempotent — running it twice produces zero changes.

Per-service deploys are scoped equivalents — one playbook per service, each one creating the namespace, pre-provisioning the env Secret if needed, applying the ArgoCD Application, and waiting for the sync to go green.

We have fifteen-ish of these playbooks. They are all short — most are under 100 lines. The infrastructure repo is a flat, scannable thing, not a 50k-line Terraform monorepo.

Backup discipline (the part most startups skip)

Eight services means eight stateful systems whose data we cannot afford to lose. Our backup chart spawns one restic CronJob per stateful namespace. Each one:

  • Runs nightly (staggered between 01:30 and 03:45 UTC so they do not compete for IO)
  • Snapshots the namespace’s PVCs to a single restic repo on Switch Cloud S3
  • Pushes a heartbeat to our Uptime Kuma instance — if the snapshot succeeded
  • Honours a retention policy (daily / weekly / monthly) so we are not paying to keep ten years of snapshots

The Uptime Kuma push-heartbeat is the trick. A snapshot that “ran but failed” still produced log output, and that log output would have been ignored. But Uptime Kuma sees the absence of the heartbeat and pages us — failure becomes loud.

We also do not believe a backup exists until we have restored from it at least once. Each service has a runbook under wiki/recover-<svc>.md that has been exercised end-to-end. Untested restores are not backups.

What we would skip if we were starting today

In the spirit of not overselling — here is what we would do differently if it were day one:

  1. Pick K3s sooner. We did not need full Kubernetes from day one. K3s gives us 95% of what we use for a fraction of the operational surface. The kubelet is the same kubelet; the API server is the same; the difference is what comes installed and what does not.
  2. Skip the multi-cluster instinct. A single cluster on a single Swiss cloud has not let us down. Multi-region is a Stage 2 problem and we are not at Stage 2.
  3. Adopt restic earlier. We tried two other backup tools first. Both were heavier and less honest about failures. Restic + Switch S3 is six months of “it just works”.
  4. Make Signal the alert path from day one. Three months of email-routed alerts produced exactly zero responses. The switch to Signal was the difference between “we have monitoring” and “monitoring works”.

The takeaway for founders

You probably do not need the world’s biggest cloud account. You need a single cluster, a single CI/CD loop, observable workloads, and a backup story you have tested. The minimum credible production setup for an early-stage product is much smaller than the architecture-diagram industry would have you believe.

If you would like to see this stack in your own product — or just want a second opinion on a setup you already have — we are an email away.

Kubernetes · K3s · GitOps · Switzerland · Production

Want to chat about this?

Honest engineering conversations are our favourite — drop us a line.

Let's talk