[routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

16 Feb 2022

      Dear colleagues,

This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable.
During this period, a significant fraction of relying party instances attempting
to fall back to rsync://rpki.ripe.net could not retrieve objects due to capacity
constraints.

At approximately 13:00 UTC, the RPKI team attempted to move the DNS records for
rrdp.ripe.net from out of the ripe.net zone file into a separate include file.
We did this change to prepare for implementing an automated failover between the
CDNs.

This resulted in an outage in the RRDP service, which was caused by an issue in
the ripe.net zone file in the DNS zone. The file contains several $ORIGIN
directives, but they are not reset properly when a block ends. The consequence
is that later relative names in the zone file accidentally get the incorrect
origin applied to them, and it is easy to miss this if the $ORIGIN directive
appears much earlier in the file. 

To prevent such DNS issues in the future, all the blocks will be moved out of
the main zone file into separate include files, because $ORIGIN directives in
them do not persist beyond the end of the file.

Also, earlier today, we hit an issue that our monitoring was broken due to a
change in the prometheus configuration file. This reduced our visibility into
the outage and meant no alerts were sent until this recovered.

A third contributing factor was that a secondary monitoring system monitoring
the RPKI prometheus infrastructure did not alert due to the web interface
returning an HTTP 200 despite the broken configuration.

A final factor was that the capacity of rsync://rpki.ripe.net is limited. Only
part of the relying party instances that attempted to fall back could update
from rsync. This prevented relying party instances from retrieving new objects.

Full timeline:
  * 07:04 UTC: broken alert configuration committed
  * 08:46 UTC: broken alert configuration applied, breaking monitoring.
  * 13:02 UTC: DNS change (effectively removing rrdp.ripe.net from zone) applied
  * 13:44 UTC: alert configuration reverted
  * 14:10 UTC: DNS configuration recovered
  * 14:25 UTC: rsync connection rate back at baseline level

During the period where rrdp.ripe.net was not available, many relying party
instances started falling back to rsync. On partial data available, we observed
a median rsync connection duration of 300 seconds, and 99th percentile of 1660
seconds, with ~55% of rsync connections disconnecting with an error code. Based
on this preliminary data we conclude that this is indicative of underlying IO
limitation in our NFS setup. We will further investigate this.

During the period of outage, our rsync servers returned 5043 “max connection
reached” errors to 2307 unique IP addresses.

We have applied one mitigation (linting of alert configuration). We are also
working on improving our external monitoring without a dependency on on-premise
infrastructure.

Kind regards,
Ties

[routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022

Ties de Kock