Lyrenhex: Blog

Raspberry Pi's not-very-documented Watchdog

24 November 2023 10 minute read

And a postmortem on my Mastodon instance's outage on 2023-11-23.

It’s been a while! It’s also been a hell of a week for my Mastodon instance, which has had two outages in a span of days:

  1. An expected, planned outage on Tuesday 21st, from 8am until 11:25am UTC. This isn’t particularly interesting - it was planned.
  2. Much more interesting is the outage from 3am until ~2pm on Thursday 23rd, which was not expected. Let’s talk about that in the style of a postmortem!

Is a post-mortem even needed?

The short answer? No.

This isn’t a major server that’s powering mission-critical systems that millions of people are using - it’s a tiny server used pretty much exclusively by me, hosting a Mastodon instance (for me and - currently - one other family member), my private Jellyfin instance so I can have the convenience of music and movie streaming but with my own music (I own a good chunk of music that isn’t available on Spotify!), and a couple of Discord bots exclusive to my friends’ servers. All in all, not really that important for anyone who isn’t me or directly affiliated with me in some way - but it does matter to me, and at this point I have similar needs of it as major services demand of their own servers:

  1. I need it to be somewhat reliable, so it isn’t always offline; and
  2. I need it to recover on its own if something goes wrong (or at least to be accessible so that I can fix it even while I’m away at university).

(I just don’t need the “scale to infinitely-many users without slowing down” part. Strictly speaking.)

So… It’s at least a good enough excuse for me to write a blog post, and I definitely need more of those. One post a year is enough, right? :P

So, a bit of background

The server in question is a Raspberry Pi 4B 8GB model, which honestly manages requirement 1 fairly well; it used to be a 4GB model, but since I run Mastodon with ElasticSearch enabled, the RAM usage was a little high for my tastes (and I don’t like not having search) - 8GB is plenty for a ~single-user instance, though! With a weekly restart just to ensure that the various background jobs (backup scripts, etc) don’t leave anything behind, this little server has been running with basically no issues for over a year now!

Honestly, Raspberry Pis continue to impress me and make me feel giddy that I can just run my own stuff; it’s such a nice feeling, and I’m not even using them optimally yet (the 4GB Pi could really be used to run the less demanding workloads - like the bots, Grafana, etc…). Yet, somehow, something went wrong on Thursday - why?

Well… technically, the 8GB Pi has run for ~6 months without issues; the 4GB Pi, in the same amount of time, had two instances wherein it “locked up” and became completely inaccessible - over SSH, to ping requests, etc - and thus required a physical restart. It wasn’t really an issue at the time, as it was basically right next to me, but it’s still less than ideal. Typically, this happened while I was actively working on it and maybe pushing it a little hard, so I had initially chalked this down to a RAM availability issue - the 8GB Pi not exhibiting this behaviour only served to reinforce this belief. Until yesterday, when it did the exact same thing.

Only, this time, I’m not right there - I’m at university! Hence, a roughly 11 hour downtime, as I had to wait for a family member back home to restart it for me - not ideal… So, let’s do some post-mortem analysis of what went wrong, and how we might either fix it or mitigate it.

Some graphs

If we look at the CPU usage graph for the period in question, we see some interesting behaviour:

A CPU usage graph from Grafana, showing very little activity before 3am, then a single core being pegged at 100% until the forced restart around 2pm, at which point levels reset the pre-3am values.

This is my at-a-glance graph, so it doesn’t drill down into what the CPU is doing if it’s not something specifically for Mastodon, my Discord Bots, or Jellyfin, but we can see some things nonetheless:

Interesting. Now, I did inspect RAM usage as well, and while that isn’t shown here, the simple summary is: RAM peaked at about 50-60% usage, but was stable before and throughout the issue. Swap was completely unused during the uptime from Tuesday to Thursday.

We can conclude from this that the Pi was still performing operations, and was trying to do something (we just don’t know what) very aggressively in the background. With this, we still have no clue as to the issue, as it appears like there shouldn’t have really been one. So we’ve ruled out the Pi being overloaded and a RAM exhaustion issue; what about the network itself?

I could VPN in during the period of downtime, so I can safely say the network the Pi is attached to was working correctly - but I can also verify that the Pi was not connected to the network at the time, based on the router’s clients list.

A Network Traffic graph from Grafana, showing tiny but spiky network activity on wlan0 before the issue, then no visible activity during the issue, followed by a huge spike to 80 Mbps on restart.

Looking at the Network Traffic graph, it’s a bit hard to see anything, since during normal operation it looks like there’s hardly any network activity anyway; the 80 Mbps spike that occurs on restart just dwarfs everything else. Let’s zoom in and look at just the period of unavialability, plus a tiny bit before that:

The same graph, but zoomed in before the restart spike, showing quite a bit of network activity and then absolutely nothing from 2:45am until the restart.

Ahah! We can clearly see now that the network usage was pretty active during the period before the downtime, but absolutely nonexistant the moment the downtime began. We can now safely say the issue was purely a networking one, localised to the Pi itself. I’m not really sure how to interpret the other graph nearby that really stood out to me, but I’ll show it nonetheless as I feel it probably explains what the pegged CPU core was doing:

Two graphs side-by-side, the first showing ICMP in/out which is regular before the issue and then going ham during the downtime, and the second showing a spike in ICMP errors localised to the start of the downtime.

Insofar as I can tell, we had a spike in ICMP errors as the downtime began, and then the Pi went absolutely mad with ICMP requests - none of which actually manifested in the wlan0 interface’s network traffic logs. Curious.

Based on all of this, my current working hypothesis is pretty simply that the wlan0 device driver on the Pi crashed, and failed to recover. I’m unsure on if that was due to the ICMP errors, or if they were a consequence of the crash - let me know if you have better knowledge of this!

Okay, so how do we avoid having to call my dad to restart the Pi in future?

The way the Pi is connected to the network is something I might change in future - such as switching to ethernet, rather than WiFi - but in the meantime, it’s not up for debate. So, if we can’t specifically guarantee this issue won’t happen again, let’s look at mitigating it.

Strictly speaking, it already is - if my hypothesis that the kernel is still running correctly is, itself, correct, this issue would have resolved itself on Sunday, with the next scheduled automatic restart… But I really want to listen to my music before Sunday, and while I have local backups, that would mean:

So… Relying on that would suck - especially if the Pi ever got into this state on a Monday. Surely there’s a better way?

Enter the Watchdog

Weirdly, the solution does exist and is a feature of basically every single Pi in existence? You’d think the Raspberry Pi Foundation would have documented it in their, uh, documentation - but no.

Reading through this post by Diode, we can see both how to set up the solution, and that I’m not the only one to run into this issue. That’s reassuring! Strictly speaking, the solution they provide is outdated1, but I couldn’t see how to use the newer solution to also monitor the wlan0 interface, so… It is what it is! The watchdog would be useless for this situation if it isn’t watching the networking interface, since it appeared the kernel was still working perfectly fine.

So, I set up watchdog, which is pretty simple:

  1. Check the hardware watchdog is enabled with sudo wdctl.
  1. Install watchdog: sudo apt update && sudo apt install watchdog
  2. Configure the watchdog service by adding (as root) the following to your /etc/watchdog.conf file:
    watchdog-device = /dev/watchdog
    watchdog-timeout = 15
    max-load-1 = 24 # optional, if you don't care if the Pi is overloaded. DO NOT SET IT TOO LOW (a boot sees my load average jump to ~7 for a brief time, for example...)
    interface = wlan0 # checks that this interface receives data frequently, as a good network interface should.
    
  3. Enable and start: sudo systemctl enable watchdog; sudo systemctl start watchdog
  4. Check its status with sudo systemctl status watchdog
  1. Once everything looks alright, you’re done!

Footnotes