Sometimes it's just not your day but it may be a golden opportunity to learn! In this article: How I learned valuable lessons in a self-indulged kubernetes rescue mission..

Ouch! (rancher being unable to connect)

So what happened exactly? Lets start with some context:

In my homelab I had a small internal k8s cluster for testing deployed using RKE. It consisted of an etcd, a control plane and 2 worker nodes.  Everything worked fine until today that I decided to change  some parameters in my cluster deployment yaml.

So I thought that's no problem for rke k8s, and it's actually quite easy with RKE under normal circumstances. I proceeded to load up my workstation vm and navigated to my gitlab instance to fetch the cluster configuration...

once I clicked the link, boom!

What the hell is going on?!

Soooo... it seems that my repository where I keep my cluster.yaml and most importantly its rke state file had been corrupted and there were no working backups being taken for a long long time -even tho I thought I was taking daily backups-. In other words no way to restore my cluster configuration or its state!

I struggled with my gitlab instance for a couple of hours in denial until i came to terms with the fact it was lost forever.  Then i thought maybe RKE can save the day, what if it can recreate the cluster state from a simple config and my working cluster. Then i could do an "rke up" and the state file would perhaps reappear.

So i did,  after a few minutes i had my cluster.yaml recreated and it now looked like this:

ignore_docker_version: true
kubernetes_version: "v1.14.6-rancher1-1"
ssh_key_path: ~/.ssh/id_rsa
nodes:
  - address: 192.168.14.203
    user: root
    role:
      - controlplane
    hostname_override: master-1
  - address: 192.168.14.202
    user: root
    role:
      - etcd
    hostname_override: etcd-1
  - address: 192.168.14.201
    user: root
    role:
      - worker
    hostname_override: worker-1
  - address: 192.168.14.207
    user: root
    role:
      - worker
    hostname_override: worker-2

The k8s version wasn't exactly the same, and I wasn't sure of the additional configuration so I decided to leave it to the defaults and hope for the best.

Having learned my lesson i decided to quickly take an etcd snapshot before anything else and I'm grateful I did, but more on that later.

rke etcd snapshot-save --name preupdate

My backup was standing under /opt/rke/etcd-snapshots/ in a minute! Success!

Having thought that everything is in order I then quickly typed 'rke up' without giving it much thought but that did not work giving me timeout errors :-(  

In fact, something went seriously wrong at that point as I could no longer get any API response from the control plane.

Unfortunately, I dont have any logs of that part so I am not really sure what exactly happened. After a couple of attempts i decided to give up and try to restore the backup I had just taken. I was convinced at the time that something was wrong with the etcd and luckily i have taken a backup.

I must say at this point that from my experience RKE is very resilient to misuse and generally safe to use. I have restored the etcd snapshots to completely different clusters in the past and it worked.  But somehow this one time was different, perhaps missing the original rke state file is what caused all that is next.

rke etcd snapshot-restore --name preupdate --config cluster.yml
...
...
FATA[0033] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 128, contai logs: Error: snapshot missing hash but --skip-hash-check=false

failed and the worst part, it had already wiped the etcd data clean as kube-cleaner was automatically ran before that!

Now i was desperate, I started googling the last error with a broken cluster, broken config and apparently(?) a broken backup (not to mention a broken ego).

Luckily some answers came up...

https://github.com/rancher/rke/issues/1501
https://github.com/rancher/rancher/issues/21754
https://gitmemory.com/issue/rancher/rke/1914/587324371

Unfortunately none of them were useful for me as in 99% percent of the cases this is caused by including the .zip in the restore command, something I didn't do. Last hope gone.


“But I know, somehow, that only when it is dark enough can you see the stars.”  
 ―       Martin Luther King, Jr.


At this point i was about to give up, wipe the whole cluster and restart clean or even maybe try to restore the backup i have taken in a new cluster, as I have done successfully in the past. That could have worked perfectly fine but the whole point of the homelab is to learn! So I decided to persist, take the hard way and figure this whole thing out.

I started researching the error, cloned the repos of rke and etcd in search for the strings of the error message, and it quickly became clear  that the error was coming from the etcd. Trying to load the backup did not work, as some hash value did not match.

What needed to be done is to provide this flag to etcd in the restore command and then maybe I had a chance to get this to work. Usually having a bad checksum is not a good sign and rarely something useful will come out of it. Anyway I decided to try it.

The problem is of course that there is no easy way to this in my current situation, the RKE through the yaml configuration did not provide any way to pass the --skip-hash-check=true flag - or at least I did not find any- and there is no way to do this manually as RKE spawns ephemeral containers to do all that.

Now what?

I started digging in the rke configuration options desperately trying to find a way to pass this flag to the etcd_restore container but to no avail.

services/etcd.go - part of RKE source code

Looking at the source code I had earlier cloned searching for strings, I realized that maybe I could rebuild the RKE executable. Of course I am not a Golang person so i couldnt really rewrite anything major, but adding a simple parameter the developers forgot or -justifiably- not wanted to provide a switch for was nothing major and maybe I could pull it off.

So I started following the build instructions for the readme, downloading golang and attempting the build, which in my surpise actually worked (!) after a few tries.  A few minutes later i had a custom executable standing in the build/bin directory.

After that everything went pretty smoothly despite the warnings of it being a custom build:

./rke-custom etcd snapshot-restore --name preupdate --config cluster.yml
WARN[0000] This is not an officially supported version (4a0941b-dirty) of RKE. Please download the latest official rase at https://github.com/rancher/rke/releases/latest
INFO[0000] Running RKE version: 4a0941b-dirty
INFO[0000] Restoring etcd snapshot preupdate
...
INFO[0229] Finished restoring snapshot [preupdate] on all etcd hosts
using the custom rke executable to do the etcd snapshort restore.
rke up
INFO[0000] Running RKE version: v1.0.4
INFO[0000] Initiating Kubernetes cluster
INFO[0001] [certificates] Generating admin certificates and kubeconfig
INFO[0001] Successfully Deployed state file at [./cluster.rkestate]
INFO[0001] Building Kubernetes cluster
...
INFO[0086] Finished building Kubernetes cluster successfully
running rke up after the etcd snapshot restore

really?..Yup! etcd snapshot restored and a usable cluster is back up:

Cluster is back up! Note: the age of nodes has not changed - this was not a new cluster afterall

4 hours later I am still so excited i decided to write this blog post something I always make plans for but ultimately neglect to do.

P.S. I am still trying to figure what the hell is going on with my gitlab instance eating repos!

P.S. Some Lessons learned! :

-Do not assume anything and dont rush things, you will make things worse.

-Taking or having backups does not mean said backups are indeed being taken or that they are in fact restorable.

-Perseverance and patience can provide very valuable lessons if of course you can afford it (as i could in this homelab testing setup).

Cheers!

Originally posted on epapandreou.com