And a postmortem on my Mastodon instance's outage on 2023-11-23.
It’s been a while! It’s also been a hell of a week for my Mastodon instance, which has had two outages in a span of days:
The short answer? No.
This isn’t a major server that’s powering mission-critical systems that millions of people are using - it’s a tiny server used pretty much exclusively by me, hosting a Mastodon instance (for me and - currently - one other family member), my private Jellyfin instance so I can have the convenience of music and movie streaming but with my own music (I own a good chunk of music that isn’t available on Spotify!), and a couple of Discord bots exclusive to my friends’ servers. All in all, not really that important for anyone who isn’t me or directly affiliated with me in some way - but it does matter to me, and at this point I have similar needs of it as major services demand of their own servers:
(I just don’t need the “scale to infinitely-many users without slowing down” part. Strictly speaking.)
So… It’s at least a good enough excuse for me to write a blog post, and I definitely need more of those. One post a year is enough, right? :P
The server in question is a Raspberry Pi 4B 8GB model, which honestly manages requirement 1 fairly well; it used to be a 4GB model, but since I run Mastodon with ElasticSearch enabled, the RAM usage was a little high for my tastes (and I don’t like not having search) - 8GB is plenty for a ~single-user instance, though! With a weekly restart just to ensure that the various background jobs (backup scripts, etc) don’t leave anything behind, this little server has been running with basically no issues for over a year now!
Honestly, Raspberry Pis continue to impress me and make me feel giddy that I can just run my own stuff; it’s such a nice feeling, and I’m not even using them optimally yet (the 4GB Pi could really be used to run the less demanding workloads - like the bots, Grafana, etc…). Yet, somehow, something went wrong on Thursday - why?
Well… technically, the 8GB Pi has run for ~6 months without issues; the 4GB Pi, in the same amount of time, had two instances wherein it “locked up” and became completely inaccessible - over SSH, to ping requests, etc - and thus required a physical restart. It wasn’t really an issue at the time, as it was basically right next to me, but it’s still less than ideal. Typically, this happened while I was actively working on it and maybe pushing it a little hard, so I had initially chalked this down to a RAM availability issue - the 8GB Pi not exhibiting this behaviour only served to reinforce this belief. Until yesterday, when it did the exact same thing.
Only, this time, I’m not right there - I’m at university! Hence, a roughly 11 hour downtime, as I had to wait for a family member back home to restart it for me - not ideal… So, let’s do some post-mortem analysis of what went wrong, and how we might either fix it or mitigate it.
If we look at the CPU usage graph for the period in question, we see some interesting behaviour:
This is my at-a-glance graph, so it doesn’t drill down into what the CPU is doing if it’s not something specifically for Mastodon, my Discord Bots, or Jellyfin, but we can see some things nonetheless:
Interesting. Now, I did inspect RAM usage as well, and while that isn’t shown here, the simple summary is: RAM peaked at about 50-60% usage, but was stable before and throughout the issue. Swap was completely unused during the uptime from Tuesday to Thursday.
We can conclude from this that the Pi was still performing operations, and was trying to do something (we just don’t know what) very aggressively in the background. With this, we still have no clue as to the issue, as it appears like there shouldn’t have really been one. So we’ve ruled out the Pi being overloaded and a RAM exhaustion issue; what about the network itself?
I could VPN in during the period of downtime, so I can safely say the network the Pi is attached to was working correctly - but I can also verify that the Pi was not connected to the network at the time, based on the router’s clients list.
Looking at the Network Traffic graph, it’s a bit hard to see anything, since during normal operation it looks like there’s hardly any network activity anyway; the 80 Mbps spike that occurs on restart just dwarfs everything else. Let’s zoom in and look at just the period of unavialability, plus a tiny bit before that:
Ahah! We can clearly see now that the network usage was pretty active during the period before the downtime, but absolutely nonexistant the moment the downtime began. We can now safely say the issue was purely a networking one, localised to the Pi itself. I’m not really sure how to interpret the other graph nearby that really stood out to me, but I’ll show it nonetheless as I feel it probably explains what the pegged CPU core was doing:
Insofar as I can tell, we had a spike in ICMP errors as the downtime began, and then the Pi went absolutely mad with ICMP requests - none of which actually manifested in the wlan0
interface’s network traffic logs. Curious.
Based on all of this, my current working hypothesis is pretty simply that the wlan0
device driver on the Pi crashed, and failed to recover. I’m unsure on if that was due to the ICMP errors, or if they were a consequence of the crash - let me know if you have better knowledge of this!
The way the Pi is connected to the network is something I might change in future - such as switching to ethernet, rather than WiFi - but in the meantime, it’s not up for debate. So, if we can’t specifically guarantee this issue won’t happen again, let’s look at mitigating it.
Strictly speaking, it already is - if my hypothesis that the kernel is still running correctly is, itself, correct, this issue would have resolved itself on Sunday, with the next scheduled automatic restart… But I really want to listen to my music before Sunday, and while I have local backups, that would mean:
So… Relying on that would suck - especially if the Pi ever got into this state on a Monday. Surely there’s a better way?
Weirdly, the solution does exist and is a feature of basically every single Pi in existence? You’d think the Raspberry Pi Foundation would have documented it in their, uh, documentation - but no.
Reading through this post by Diode, we can see both how to set up the solution, and that I’m not the only one to run into this issue. That’s reassuring! Strictly speaking, the solution they provide is outdated1, but I couldn’t see how to use the newer solution to also monitor the wlan0
interface, so… It is what it is! The watchdog would be useless for this situation if it isn’t watching the networking interface, since it appeared the kernel was still working perfectly fine.
So, I set up watchdog, which is pretty simple:
sudo wdctl
./dev/watchdog0
be described to you. Good!dtparam=watchdog=on
to your /boot/config.txt
file to enable it, and reboot.watchdog
: sudo apt update && sudo apt install watchdog
/etc/watchdog.conf
file:watchdog-device = /dev/watchdog
watchdog-timeout = 15
max-load-1 = 24 # optional, if you don't care if the Pi is overloaded. DO NOT SET IT TOO LOW (a boot sees my load average jump to ~7 for a brief time, for example...)
interface = wlan0 # checks that this interface receives data frequently, as a good network interface should.
sudo systemctl enable watchdog; sudo systemctl start watchdog
sudo systemctl status watchdog
alive=[none]
rather than alive=/dev/watchdog
, and you don’t see a hardware watchdog identity (“Broadcom BCM2835 Watchdog Timer”), then it’s not working correctly; check your /etc/watchdog.conf
and then stop and start the Watchdog service again.wlan0
doesn’t receive anything for a bit, the kernel hangs, or the 1m load average exceeds 24.