Dear colleagues, From Saturday 22 February at 08:24 (CET), any newly created, modified, or deleted ROAs (176 in total) could not be added to our publication server due to a disk problem. From that moment on, all the data was stored on the database, but the publication did not happen. The disk did not report any problems and, therefore, no engineer was alerted of this incident. Due to the disk problem, starting from Sunday 23 February at 09:10 (CET), our CRL expired and our repository could not be properly updated. This was reported to us on Monday 24 February at 11:44 (CET). Immediately, our engineers fixed the disk problem, however, since the CRL expired, all underlying objects also expired. Depending on the Relying Party software an operator used, this abnormal behaviour appeared differently. Initially, our engineers tried to do a full re-population of the RPKI repository, but unfortunately, this did not update the CRL in the validation tree. At 15:03 (CET), we performed a full CA key-roll, which was completed at 21:02 (CET) and resolved the problem. At 19:58 (CET), all objects in the backlog were published. We apologise for any inconvenience this may have caused and we are taking all the necessary steps to ensure this incident does not appear again in the future. Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
Hello! just a summary: RPKI did not work for 3 days. Nobody care ;) 25.02.20 16:12, Nathalie Trenaman пише:
Dear colleagues,
From Saturday 22 February at 08:24 (CET), any newly created, modified, or deleted ROAs (176 in total) could not be added to our publication server due to a disk problem. From that moment on, all the data was stored on the database, but the publication did not happen. The disk did not report any problems and, therefore, no engineer was alerted of this incident.
Due to the disk problem, starting from Sunday 23 February at 09:10 (CET), our CRL expired and our repository could not be properly updated. This was reported to us on Monday 24 February at 11:44 (CET). Immediately, our engineers fixed the disk problem, however, since the CRL expired, all underlying objects also expired. Depending on the Relying Party software an operator used, this abnormal behaviour appeared differently.
Initially, our engineers tried to do a full re-population of the RPKI repository, but unfortunately, this did not update the CRL in the validation tree. At 15:03 (CET), we performed a full CA key-roll, which was completed at 21:02 (CET) and resolved the problem. At 19:58 (CET), all objects in the backlog were published.
We apologise for any inconvenience this may have caused and we are taking all the necessary steps to ensure this incident does not appear again in the future.
Kind regards,
Nathalie Trenaman Routing Security Programme Manager RIPE NCC
I care for one! Furthermore I think it's very refreshing to have outages like this called out for what they are and full transparency about the cause and fix communicated. It's most helpful. Fair play Nathalie and team! - Mick (AS2110) On Tue, Feb 25, 2020 at 07:57:41PM +0200, Max Tulyev wrote:
Hello!
just a summary: RPKI did not work for 3 days. Nobody care ;)
25.02.20 16:12, Nathalie Trenaman пише:
Dear colleagues,
From Saturday 22 February at 08:24 (CET), any newly created, modified, or deleted ROAs (176 in total) could not be added to our publication server due to a disk problem. From that moment on, all the data was stored on the database, but the publication did not happen. The disk did not report any problems and, therefore, no engineer was alerted of this incident.
Due to the disk problem, starting from Sunday 23 February at 09:10 (CET), our CRL expired and our repository could not be properly updated. This was reported to us on Monday 24 February at 11:44 (CET). Immediately, our engineers fixed the disk problem, however, since the CRL expired, all underlying objects also expired. Depending on the Relying Party software an operator used, this abnormal behaviour appeared differently.
Initially, our engineers tried to do a full re-population of the RPKI repository, but unfortunately, this did not update the CRL in the validation tree. At 15:03 (CET), we performed a full CA key-roll, which was completed at 21:02 (CET) and resolved the problem. At 19:58 (CET), all objects in the backlog were published.
We apologise for any inconvenience this may have caused and we are taking all the necessary steps to ensure this incident does not appear again in the future.
Kind regards,
Nathalie Trenaman Routing Security Programme Manager RIPE NCC
-- - MickoD <mick@mickod.ie>
I think it's very refreshing to have outages like this called out for what they are and full transparency about the cause and fix communicated. It's most helpful.
detailed post morta are very useful, except to those who never make mistakes :) thanks nathalie. randy
To be clear, I mean nobody really uses this RPKI, so 3 days downtime was even not noticed by anyone. Details and reaction was really very good! 25.02.20 20:35, Randy Bush пише:
I think it's very refreshing to have outages like this called out for what they are and full transparency about the cause and fix communicated. It's most helpful.
detailed post morta are very useful, except to those who never make mistakes :)
thanks nathalie.
randy
Dear Max, I do want to add some nuance. On Tue, Feb 25, 2020 at 09:33:45PM +0200, Max Tulyev wrote:
To be clear, I mean nobody really uses this RPKI, so 3 days downtime was even not noticed by anyone.
"Nobody uses RPKI" ... I don't think that statement holds true from any angle. Keep in mind that almost 33% of RIPE prefixes are covered by RPKI ROAs, and globally hundreds of autonomous systems (including the world's largest IP carriers and IXPs) use RPKI data in some shape or form to make better BGP best path selection decisions. RPKI is already here and widely deployed, now we have to deal! :-) The root of the problem was perhaps there for 3 days, but the operational issue for relying parties was ~ 1.5 days because of how things are distributed, cached & expired. ROA provisioning was broken for 3 days. Additionally, the problem was somewhat obfuscated because some widely used RPKI cache validator implementations didn't consider the broken repository broken, which kept appearances up. This may seem like a good thing, but I consider it problematic because it showcased some potential for security issues in RPKI validation implementations. This is now actively being discussed in IETF and I expect that this discussion will result in positive changes in implementations. I am happy the sky didn't fall on top of us, but that doesn't take away from the seriousness of the situation and our duty to learn as much as we can from this to improve our processes. Kind regards, Job
To be clear, I mean nobody really uses this RPKI, so 3 days downtime was even not noticed by anyone.
nobody == { AT&T/AS7018 Cloudflare/AS13335 Cogent/AS174 KPN/AS286 PCCW/AS3491 Tata/AS6453 Telia/AS1299 and many small folk such as three ASs i run }
To be clear, I mean nobody really uses this RPKI, so 3 days downtime was even not noticed by anyone. nobody == {
but, to your point, the reason no one was damaged is that ROV was designed to fail soft. when the ncc failed to publish, the prefixes for which there should have been ROAs did not become invalid, they became not found. so folk dropping invalids did not drop them. what could have happened, but would be quite hard to detect, is that someone could have mis-originated one of those prefixes and it would not have been blocked. randy
On Tue, Feb 25, 2020, at 21:01, Randy Bush wrote:
To be clear, I mean nobody really uses this RPKI, so 3 days downtime was even not noticed by anyone.
nobody == { AT&T/AS7018 Cloudflare/AS13335 Cogent/AS174 KPN/AS286 PCCW/AS3491 Tata/AS6453 Telia/AS1299 and many small folk such as three ASs i run }
And in 30 days ... NTT/AS2914 too! :-) https://www.us.ntt.net/support/policy/rr.cfm#RPKI Kind regards, Job
Hi, was there any real issue with these carriers? Some unreachable networks or complaints? 25.02.20 22:01, Randy Bush пише:
To be clear, I mean nobody really uses this RPKI, so 3 days downtime was even not noticed by anyone.
nobody == { AT&T/AS7018 Cloudflare/AS13335 Cogent/AS174 KPN/AS286 PCCW/AS3491 Tata/AS6453 Telia/AS1299 and many small folk such as three ASs i run }
+1 ! Thanks Nathalie and the team. Carlos On Tue, 25 Feb 2020, Randy Bush wrote:
I think it's very refreshing to have outages like this called out for what they are and full transparency about the cause and fix communicated. It's most helpful.
detailed post morta are very useful, except to those who never make mistakes :)
thanks nathalie.
randy
participants (6)
-
Carlos Friaças
-
Job Snijders
-
Max Tulyev
-
Mick O'Donovan
-
Nathalie Trenaman
-
Randy Bush