this weekend the ganeti cluster had a partial outage: nodes were reachable, but networking was broken on all instances.
this is, presumably, because of the Debian buster point release that occured on saturday (!). last time this happened, weasel identified openvswitch as the culprit, and hiro deployed a fix that would make it survive such situations. but either something else came up or the fix didn't work, because the problem happened again this weekend.
i fixed it by rebooting all nodes forcibly (without migrating first).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
There was a ~8h ganeti outage until about now. It seems the buster point release broke things in our automated upgrade procedure. I didn't have time to diagnose the issue (I was running out) and figured it was more urgent to restore the service.
I rebooted all gnt-fsn nodes by hand (without migrating). Some instances returned with a state of "ERROR_down", so I manually started them (with gnt-instance start). Everything now seems to be back up.
I haven't looked at Nagios in details, but everything is mostly "yellow" now so I'll assume we're good.
It would be great if someone could look at the logs and see what happened. I suspect the openvswitch fix didn't work, or maybe there are other servers we need to block from needrestart's automation (or maybe even unattended-upgrades).
On the 10th of may there was an unattended upgrade. The kernel was updated and the system restarted.
Openvswithch was updated and restarted so maybe the blacklist didn't work.
According to the unattended upgrades logs the following packages were handled by need restart:
Openvswitch was updated together with the following group of packages:
2020-05-10 06:12:53,754 INFO Packages that will be upgraded: base-files distro-info-data iputils-arping iputils-ping iputils-tracepath libbrlapi0.6 libfuse2 libpam-systemd libsystemd0 libudev1 linux-compiler-gcc-8-x86 linux-headers-amd64 linux-image-amd64 linux-kbuild-4.19 openvswitch-common openvswitch-switch postfix postfix-cdb rake rubygems-integration systemd systemd-sysv tzdata udev
Checking openvswitch status it has not been restarted since the 10th of may:
Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled) Active: active (exited) since Sun 2020-05-10 14:05:11 UTC; 2 weeks 3 days ago
And from the log on that day I actually see it died twice:
2020-05-10T06:13:16.534Z|00003|vlog(monitor)|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log2020-05-10T06:13:16.534Z|00004|daemon_unix(monitor)|INFO|pid 3211 died, exit status 0, exiting2020-05-10T06:13:16.787Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log2020-05-10T06:13:16.788Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 02020-05-10T06:13:16.788Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores2020-05-10T06:13:16.788Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...2020-05-10T06:13:16.788Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected2020-05-10T06:13:16.791Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.12020-05-10T06:13:17.332Z|00002|daemon_unix(monitor)|INFO|pid 29781 died, exit status 0, exiting2020-05-10T06:13:17.621Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log2020-05-10T06:13:17.623Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 02020-05-10T06:13:17.623Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores2020-05-10T06:13:17.623Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...2020-05-10T06:13:17.623Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected2020-05-10T06:13:17.630Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1
...
2020-05-10T14:02:23.078Z|00036|bridge|INFO|bridge br0: using datapath ID 00007eb83553f3452020-05-10T14:02:23.398Z|00037|bridge|INFO|bridge br0: deleted interface br0 on port 655342020-05-10T14:02:23.578Z|00002|daemon_unix(monitor)|INFO|pid 29951 died, exit status 0, exiting2020-05-10T14:05:05.241Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log2020-05-10T14:05:05.247Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 02020-05-10T14:05:05.247Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores2020-05-10T14:05:05.247Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...2020-05-10T14:05:05.247Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected2020-05-10T14:05:05.250Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.12020-05-10T14:05:09.384Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation2020-05-10T14:05:09.384Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 22020-05-10T14:05:09.384Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 12020-05-10T14:05:09.384Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action2020-05-10T14:05:09.384Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids