ganeti clusters don't like automatic upgrades

added component::internal services/tor sysadmin team owner::hiro priority::high severity::major status::assigned tpa-roadmap-june type::defect labels

This is the mail I sent on sunday:

There was a ~8h ganeti outage until about now. It seems the buster point release broke things in our automated upgrade procedure. I didn't have time to diagnose the issue (I was running out) and figured it was more urgent to restore the service.

I rebooted all gnt-fsn nodes by hand (without migrating). Some instances returned with a state of "ERROR_down", so I manually started them (with gnt-instance start). Everything now seems to be back up.

I haven't looked at Nagios in details, but everything is mostly "yellow" now so I'll assume we're good.

It would be great if someone could look at the logs and see what happened. I suspect the openvswitch fix didn't work, or maybe there are other servers we need to block from needrestart's automation (or maybe even unattended-upgrades).

On the 10th of may there was an unattended upgrade. The kernel was updated and the system restarted. Openvswithch was updated and restarted so maybe the blacklist didn't work.

According to the unattended upgrades logs the following packages were handled by need restart:


Restarting services...
 /etc/needrestart/restart.d/dbus.service
 systemctl restart apt-daily-upgrade.service ganeti.service smartmontools.service ssh.service strongswan.service syslog-ng.service systemd-logind.service unattended-upgrades.service unbound.service

Openvswitch was updated together with the following group of packages:

2020-05-10 06:12:53,754 INFO Packages that will be upgraded: base-files distro-info-data iputils-arping iputils-ping iputils-tracepath libbrlapi0.6 libfuse2 l
ibpam-systemd libsystemd0 libudev1 linux-compiler-gcc-8-x86 linux-headers-amd64 linux-image-amd64 linux-kbuild-4.19 openvswitch-common openvswitch-switch post
fix postfix-cdb rake rubygems-integration systemd systemd-sysv tzdata udev

Checking openvswitch status it has not been restarted since the 10th of may:

Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sun 2020-05-10 14:05:11 UTC; 2 weeks 3 days ago

And from the log on that day I actually see it died twice:


2020-05-10T06:13:16.534Z|00003|vlog(monitor)|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T06:13:16.534Z|00004|daemon_unix(monitor)|INFO|pid 3211 died, exit status 0, exiting
2020-05-10T06:13:16.787Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T06:13:16.788Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2020-05-10T06:13:16.788Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2020-05-10T06:13:16.788Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-05-10T06:13:16.788Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-05-10T06:13:16.791Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1
2020-05-10T06:13:17.332Z|00002|daemon_unix(monitor)|INFO|pid 29781 died, exit status 0, exiting
2020-05-10T06:13:17.621Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T06:13:17.623Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2020-05-10T06:13:17.623Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2020-05-10T06:13:17.623Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-05-10T06:13:17.623Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-05-10T06:13:17.630Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1

...

2020-05-10T14:02:23.078Z|00036|bridge|INFO|bridge br0: using datapath ID 00007eb83553f345
2020-05-10T14:02:23.398Z|00037|bridge|INFO|bridge br0: deleted interface br0 on port 65534
2020-05-10T14:02:23.578Z|00002|daemon_unix(monitor)|INFO|pid 29951 died, exit status 0, exiting
2020-05-10T14:05:05.241Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T14:05:05.247Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2020-05-10T14:05:05.247Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2020-05-10T14:05:05.247Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-05-10T14:05:05.247Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-05-10T14:05:05.250Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1
2020-05-10T14:05:09.384Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation
2020-05-10T14:05:09.384Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 2
2020-05-10T14:05:09.384Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 1
2020-05-10T14:05:09.384Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action
2020-05-10T14:05:09.384Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids

Tested reinstalling openvswitch with

apt install --reinstall openvswitch-switch

On fsn-node-06. It caused openvswitch-switch to restart:

Active: active (exited) since Wed 2020-05-27 17:09:27 UTC; 2min 44s ago

I think openvswitch should be upgraded manually for the time being.

Migrating VMs between nodes returns the VM back online.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746 opened bug with the package yesterday.

Trac:
Keywords: tpa-roadmap-may deleted, tpa-roadmap-june added

ganeti clusters don't like automatic upgrades

Child items 0

Activity