dysphoric.dev


Automated Kubernetes PV backups using Velero and Tailscale

Posted: 2023-05-01 20:46:12 |
Last update: 2023-05-01 20:46:12

The problem

There is no such thing as a bug free computer system.
Even less so when you're just hosting it in your free time while also working a job.
In order to mitigate disasters which might occur at any point, it is generally a good idea to automatically backup all of your data at least once a day.

I used to do this with borg backup, which worked great back when all of my data was just in Docker volumes.
These days however, my data is in Persistent Storage managed by a Ceph cluster and replicated across three nodes, which is a whole can of worms and cannot be backed up as easily.
There are a lot of options out there and I got struck with a pretty bad case of decision paralysis for almost two years.
Since people running their own small scale stuff on it is not really what the people making Kubernetes and CloudNative tooling have in mind, most solutions require enterprise grade software and budgets, which I don't really have available.
So I ended up doing manual SQL dumps and backups of the most important apps I'm hosting every time I do something risky in the cluster, which is less than ideal.

The solution

In addition to this multi node cluster that lives in a server farm, I also have access to a home server that is always running to provide services which are not supposed to be accessible to the wider internet.
Since upgrading my last gen game consoles to SSDs I also happen to have a couple of large-ish HDDs laying around.
I thought I would combine the two to build myself an inexpensive backup solution.

My cluster does not have access to my home network.
I also did not want to open up any services I am running from home to the wider internet because that is a scary place and I don't want my data to get lost there.
A couple of years ago, I would have thought that this is a good situation to deploy OpenVPN.
Today however, I have deployed OpenVPN a bunch of times and can say that it is not really fun to work with. What is much more fun is Tailscale. A tool that can automatically configure Wireguard networks between your hosts and comes with a very generous free plan that allows up to 100 devices.
I've been playing around with Tailscale quite a lot and I really enjoy it's ease of use and deployment.

So now that I had a solution to connect my cluster with my home server, all I needed was a tool to create automated backups.
As I've written before, there are a lot of different offerings available.
I chose Velero because it seemed pretty polished and had the features I want from it.
In addition to having a very easy to use method of backing up Kubernetes resources, it also has beta level support for using restic to back up persistent volumes, which is what I mostly care about since most of my resources are managed by manifests that I'm tracking in a git repository.

The biggest obvious downside of Velero is that it can only back up to S3 storage, even if you use the restic integration.
If it didn't work like this, I could simply use restic to back up the PVs to my inexpensive storagebox, which I can access through SFTP and have been using for backups for years now.
Alas, I needed S3 storage, so I chose to use a simple dockerized MinIO deployment on our home server, which I can then also mirror to the storagebox using borg backup.

Technical details

Since S3 works via HTTPS, to use it you either need a valid TLS certificate, or tell whatever is accessing the storage that it should ignore certs, which is not a great idea.
Thankfully, Tailscale can create valid Let's Encrypt certificates for your nodes using DNS challenges.

The nodes in a Tailnet all get a domain that looks like this: hostname.tailnet-name.ts.net.
You can use this to access them from within the Tailnet but not from the outside.
In order for your pods, which use a CoreDNS deployment inside your Kubernetes cluster, to be able to resolve those domains, they need to know about them.
Luckily that is pretty easy to accomplish with a simple config change:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
    your-tailnet.ts.net:53 {
        errors
        cache 30
        forward . 100.100.100.100
    }

This tells the DNS server to redirect any queries about your Tailnet to Tailscale's own nameserver.

Once this is done, you can install Velero as explained in their docs.
It did take me some fiddling around with the install command to get it to work as expected. Namely I had to specify --backup-location-config as a parameter for the command for it to set up the backup location as expected.

Once Velero is up and running, you can start making backups using velero backup create $name --default-volumes-to-fs-backup.
This might take quite a while, especially the first time around.
Once it is done you should see a backups and a restic directory in your S3 bucket, which contain the resource and volume backups respectively.

After making sure that everything is working as expected, don't forget to set a schedule using a command like this: velero schedule create daily-backups --schedule "30 4 * * *" --default-volumes-to-fs-backup to automatically have it make a new backup at 4:30 UTC every day.

Pain points

Deploying all of this was not without its pain points.
First of all, there was the fact that a ton of new things were needed when all I really wanted to do was update my cluster, but then a bunch of the tools also did not work as I would have expected.

After my initial install of Velero, it only backed up the cluster's resources and didn't touch the volumes. This was down to me missing the --default-volumes-to-fs-backup parameter and the fact that I was following three differently outdated guides at once.
My backups also all had the PartiallyFailed status because restic threw an error whenever it tried to backup empty volumes. Something that was fixed in the version I thought I was running.
Once I got that working I had mostly working backups to a minio deployment on my girlfriend's server because I did not yet set it up on our home server since I did not want to do all of those things at once.
Two days later, I figured I could set up the rest as well, which went okay until I started my first backup, which ended up on the wrong S3 backend, even though I had set the new one to be used by default.
Once I figured out that the backup location was not available due to a misconfiguration of its credentials, I started a new backup job.
This one complained about incompatible restic versions, which made no sense at all since the bucket did not contain any files created by restic at all.
I decided to delete the old backup location and now it was complaining about the repositories apparently not existing at all.
I deleted all backuprepository resources (I did find an issue by someone who had a similar problem back when that resource was called resticrepositories, the renaming of which was not fun to figure out) and got the version mismatch error again.
Since I could not find a single other person who had faced this issue before, I decided to try the brute force method of completely uninstalling Velero from the cluster and installing it again with only the new configuration.
This finally worked and I was able to create a full backup to my local server.
As a bonus, now suddenly the PartiallyFailed issue also did not occur anymore.
Apparently when I upgraded the Velero deployment on my cluster form 10.0.0 to 11.0.0, it did update which version velero version put out but did not actually upgrade the whole thing. Fun.

Conclusion

Velero is pretty good software, especially if you only need backups for Kubernetes resources.
The restic integration is very nice to have, but it is not polished nearly as much as the rest of the tool and you can feel that every step of the way.
I am still happy I went down this path and I think I will stick with this backup solution for the time being.

Tailscale on the other hand is just amazing software that keeps surprising me with new cool features and stuff that works really well out of the box. They did not pay or influence me in any way to write this and I can really recommend everyone who needs a private network to check it out. It also has some cool features that allow sharing locally hosted game servers with friends for example, which seems pretty neat.

Thanks go out to my girlfriend who supplied me with S3 storage to test out Velero in the first place, helped set up the non Kubernetes parts of this whole deployment and helped get me through some pretty bad impostor syndrome I got after not being able to fully deploy a backup solution and update Ceph as well as my whole Cluster in a single night.