RPKI Outage Post-Mortem

25 Feb 2020

      Dear colleagues,

From Saturday 22 February at 08:24 (CET), any newly created, modified, or deleted ROAs (176 in total) could not be added to our publication server due to a disk problem. From that moment on, all the data was stored on the database, but the publication did not happen. The disk did not report any problems and, therefore, no engineer was alerted of this incident.

Due to the disk problem, starting from Sunday 23 February at 09:10 (CET), our CRL expired and our repository could not be properly updated. This was reported to us on Monday 24 February at 11:44 (CET). Immediately, our engineers fixed the disk problem, however, since the CRL expired, all underlying objects also expired. Depending on the Relying Party software an operator used, this abnormal behaviour appeared differently.

Initially, our engineers tried to do a full re-population of the RPKI repository, but unfortunately, this did not update the CRL in the validation tree. At 15:03 (CET), we performed a full CA key-roll, which was completed at 21:02 (CET) and resolved the problem. At 19:58 (CET), all objects in the backlog were published.

We apologise for any inconvenience this may have caused and we are taking all the necessary steps to ensure this incident does not appear again in the future.

Kind regards,

Nathalie Trenaman
Routing Security Programme Manager
RIPE NCC

Nathalie Trenaman

Max Tulyev

Mick O'Donovan

Randy Bush

Max Tulyev

Job Snijders

Randy Bush

Randy Bush

Job Snijders

Max Tulyev

Carlos Friaças

tags

participants (6)