Subject: RPKI ROA Deletion: Post-mortem
![](https://secure.gravatar.com/avatar/3615bb930f655d88aff067cecbc3c6b5.jpg?s=120&d=mm&r=g)
Dear colleagues, After our accidental deletion of RPKI ROAs on Wednesday evening, we have a post-mortem report to share with the working group. Following an update to our internal registry software on 1 April at 18:16 (UTC+2), 2,669 ROAs were deleted from Provider Independent (PI) address assignments. This was caused by our registry software classifying these assignments as not-certifiable. From our logs, we can confirm that these blocks never left the RIPE Registry, and within 15 minutes the registry was back to normal. However, by that time the ROAs had already been deleted and could not be restored without intervention from our engineers. Affected users with alerts set up in the LIR Portal received a notification email on 31 March at 22:23, stating that their ROAs were missing. Some of these users emailed our Customer Service Department to ask why their ROAs had been deleted. As this was outside of office hours, our staff did not discover the issue until the next morning. Our engineers were able to reinstate all of the missing ROAs by 13:15 on 2 April. We then informed our membership via ncc-announce and notified the affected users directly. We have since implemented stricter checks on both our registry and RPKI software. We are also investigating whether any of these PI assignments suffered from route-leaks or hijacks after their ROAs were deleted. We apologise for any inconvenience this may have caused and we are taking all necessary steps to ensure this does not happen again in the future. Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
![](https://secure.gravatar.com/avatar/5a42e6028e8bb86507db584e26c73136.jpg?s=120&d=mm&r=g)
thank you fo rthe post mortem. appreciate the lessons. of course, i never make mistakes :) randy
![](https://secure.gravatar.com/avatar/ab7a8919a0456384a82d15edd4df8578.jpg?s=120&d=mm&r=g)
Thank you for clarification and post mortem report. On Fri, Apr 3, 2020 at 11:02 PM Randy Bush <randy@psg.com> wrote:
thank you fo rthe post mortem. appreciate the lessons. of course, i never make mistakes :)
randy
![](https://secure.gravatar.com/avatar/78f9e962d04dcd991978b2aa35421d72.jpg?s=120&d=mm&r=g)
Agreed, thanks for this Nathalie. Given the operational importance of RPKI now and each RIRs role therein can you say anything about what plans RIPE has to provide 24x7 monitoring / support for these services (i.e., beyond your current "office hours")? I also look forward to [your] analysis of the Rostelecom incident that occurred in the same timeframe. Thanks, -danny On 2020-04-03 08:55, Nathalie Trenaman wrote:
Dear colleagues,
After our accidental deletion of RPKI ROAs on Wednesday evening, we have a post-mortem report to share with the working group.
Following an update to our internal registry software on 1 April at 18:16 (UTC+2), 2,669 ROAs were deleted from Provider Independent (PI) address assignments.
This was caused by our registry software classifying these assignments as not-certifiable. From our logs, we can confirm that these blocks never left the RIPE Registry, and within 15 minutes the registry was back to normal. However, by that time the ROAs had already been deleted and could not be restored without intervention from our engineers.
Affected users with alerts set up in the LIR Portal received a notification email on 31 March at 22:23, stating that their ROAs were missing. Some of these users emailed our Customer Service Department to ask why their ROAs had been deleted. As this was outside of office hours, our staff did not discover the issue until the next morning.
Our engineers were able to reinstate all of the missing ROAs by 13:15 on 2 April. We then informed our membership via ncc-announce and notified the affected users directly.
We have since implemented stricter checks on both our registry and RPKI software.
We are also investigating whether any of these PI assignments suffered from route-leaks or hijacks after their ROAs were deleted.
We apologise for any inconvenience this may have caused and we are taking all necessary steps to ensure this does not happen again in the future.
Kind regards,
Nathalie Trenaman Routing Security Programme Manager RIPE NCC
![](https://secure.gravatar.com/avatar/6ccfb2e783f4d5b5c4dd4813d1626baa.jpg?s=120&d=mm&r=g)
Dear Danny, others, On Fri, Apr 03, 2020 at 04:56:41PM -0400, Danny McPherson wrote:
I also look forward to [your] analysis of the Rostelecom incident that occurred in the same timeframe.
I've taken a look at the incident. 2,666 VRPs disappeared around 2020-04-01T16:32Z. For the purpose of this analysis the list of affected VRPs is http://instituut.net/~job/deleted-vrps-ripe-2020-04-01-16-32.txt Andree Toonk (BGPMon) so kind to compile a list of prefixes which were wrongly originated by Rostelecom during incident at 2020-04-01T19:27Z https://portal.bgpmon.net/data/12389_apr2020.txt The above list is not the full list of prefixes affected by this leak. The leak appears to have included route announcements that 12389 received from some customers and some peers, in addition to 'bgp optimiser'-style more-specific hijacks. Full list is available here: https://map.internetintel.oracle.com/api/leak_prefixes/20764_12389_158576850... I'm leaving the 'merely leaked otherwise untouched' routes out of this analysis as those are outside of scope of Origin Validation: the fabricated routes in relation to missing RPKI VRPs are what is matters for this analysis. If we take the intersection of Andree's list with the list of missing VRPs, we have the IP addresses that were affected by both the RIPE NCC RPKI Deletion incident and the Rostelecom BGP incident. The following 12 prefixes (4352 IP addresses): peer_count start_time alert_type base_prefix base_as announced_prefix src_AS Affected_ASname example_ASPath 49 2020-04-01 19:30:34 more_spec_by_other 91.195.240.0/23 47846 91.195.240.0/24 12389 SEDO-AS, DE 24751 20764 12389 12 2020-04-01 19:29:55 more_spec_by_other 62.122.168.0/21 50245 62.122.170.0/24 12389 SERVEREL-AS, NL 18356 38794 4651 4651 20764 12389 11 2020-04-01 19:30:34 more_spec_by_other 91.203.184.0/22 41064 91.203.187.0/24 12389 SKYROCK, FR 29430 13030 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.164.0/23 12389 SERVEREL-AS, NL 49673 24811 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.174.0/23 12389 SERVEREL-AS, NL 49515 197595 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.178.0/23 12389 SERVEREL-AS, NL 49673 24811 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.168.0/23 12389 SERVEREL-AS, NL 49673 24811 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.180.0/23 12389 SERVEREL-AS, NL 43317 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.161.0/24 12389 SERVEREL-AS, NL 49515 197595 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.170.0/24 12389 SERVEREL-AS, NL 49673 24811 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.187.0/24 12389 SERVEREL-AS, NL 1126 24785 20562 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.166.0/24 12389 SERVEREL-AS, NL 51514 20562 20764 12389 If we look at the list of ASNs which were most impacted, the top ten seems mostly anchored to the US (thus under the ARIN TAL), and almost all of them seem heavyweights in the cloud / CDN space. https://portal.bgpmon.net/data/12389_apr2020_affected_asns.txt The incorrect routing information covering to the above listed prefixes was observed by a limited number of BGPMon peers, for other affected routes the peer_count was around 170. While the RPKI incident lasted a number of hours, but the Rostelecom routing incident lasted ten minutes or so. (source: https://map.internetintel.oracle.com/leaks#/id/20764_12389_1585768500) If we assume the generation & propagation of these hijacks was the result of operator error, I imagine the change could've been reverted almost immediately but we'd still see a bit of sloshing for a few minutes through the routing system. Or perhaps the 'waves' we can see in Oracle's 3D rendering of the incident are the effects of Maximum Prefix limits kicking in and various timers firing off at different times. Were these prefixes just unlucky because some BGP optimiser algorithm had chosen them for the purpose of traffc engineering? Was this the result of sophisticated planning? In any case, I can't judge the impact this routing incident had on the three above listed ASNs. I don't know what the victim IPs are used for. We have to keep in mind that a large portion of RIPE NCC's RPKI repository, and of course the RPKI repositories of the other RIRs were *not* affected. ISPs with 'invalid == reject' policies had lot of RPKI data (~134,516 VRPs) available and those VRPs did have positive effects on the scope and reach of the hijacks. RPKI Invalid BGP announcements don't propagate as as good as Not-Found announcements. It appears the 'peer_count' for RPKI protected prefixes was significantly lower (~140) than prefixes not covered by RPKI ROAs (~160). The 'peer_count' value can be considered a proxy metric for a hijack's reach and impact. The RPKI Invalids in this leak propagated through ASNs for which we know they have not yet deployed RPKI OV. The above suggests to me that unavailability of RPKI services during routing incidents, or lack of deployment of Origin Validation confirms what most of us already suspected: it is inconvenient. RIPE NCC's service interruption appears to have affected 4,352 out of the total of 5,945,764 misrouted IPs, and the 'peer_count' for the illegitimate announcements was much lower (better) compared to other prefixes. This leads me to believe this was not a deliberate plan dependent on a process failure inside RIPE NCC, the incident's BGP data just doesn't seem to show the incident maximally capitalised on the RPKI outage. Kind regards, Job
![](https://secure.gravatar.com/avatar/c2cc41dbecb9a0d2f56d91ac240058c6.jpg?s=120&d=mm&r=g)
Dear Job, all, First, thanks fo you and Andree for that e-mail and for those informations. On Sun, 2020-04-05 at 18:29 +0000, Job Snijders wrote:
(...) If we take the intersection of Andree's list with the list of missing VRPs, we have the IP addresses that were affected by both the RIPE NCC RPKI Deletion incident and the Rostelecom BGP incident. The following 12 prefixes (4352 IP addresses):
peer_count start_time alert_type base_prefix base_as announced_prefix src_AS Affected_ASname example_ASPath 49 2020-04-01 19:30:34 more_spec_by_other 91.195.240.0/23 47846 91.195.240.0/24 12389 SEDO-AS, DE 24751 20764 12389 12 2020-04-01 19:29:55 more_spec_by_other 62.122.168.0/21 50245 62.122.170.0/24 12389 SERVEREL-AS, NL 18356 38794 4651 4651 20764 12389 11 2020-04-01 19:30:34 more_spec_by_other 91.203.184.0/22 41064 91.203.187.0/24 12389 SKYROCK, FR 29430 13030 20764 12389
(...)
It seems that I know at least one of those prefixes, as 91.203.187.0/24 is part of one of my customer's network. That specific /24 out of all their allocation is the one having the most of my customer's production (a french MF Radio, which has its own streaming produced indoor, and some other related online applications). I would be quite surprised that it would have some significant traffic within RU networks, but if we assume it's yet another bgp optimizer leak, and since all those "BGP Optimizer blackbox" algorithms are quite obscure, we cannot say. But, it wouldn't surprise me much if they would optimize that specific one out of all AS41064's announcements.
If we assume the generation & propagation of these hijacks was the result of operator error, I imagine the change could've been reverted almost immediately but we'd still see a bit of sloshing for a few minutes through the routing system. Or perhaps the 'waves' we can see in Oracle's 3D rendering of the incident are the effects of Maximum Prefix limits kicking in and various timers firing off at different times.
Were these prefixes just unlucky because some BGP optimiser algorithm had chosen them for the purpose of traffc engineering? Was this the result of sophisticated planning? In any case, I can't judge the impact this routing incident had on the three above listed ASNs. I don't know what the victim IPs are used for.
As I said earlier: We didn't really notice any drop within AS41064's network statistics. But since it's mostly FR and not RU traffic, this could have been completely invisible for us. Fortunately the leak was quite brief... it's just bad luck, indeed :( Kind regards, -- Clément Cavadore
![](https://secure.gravatar.com/avatar/78f9e962d04dcd991978b2aa35421d72.jpg?s=120&d=mm&r=g)
[top post only] Thanks for this Job, interesting analysis. Another question here: at what interval is data from a given RIR repository ingested / operationalized by a given network operator? Or put differently, any idea how much lag today between when an RIR RPKI repository has a change until that becomes OV policy in _your routers? I'm sure this varies but not sure by how much within a given operator, or across operators. -danny On 2020-04-05 14:29, Job Snijders wrote:
Dear Danny, others,
On Fri, Apr 03, 2020 at 04:56:41PM -0400, Danny McPherson wrote:
I also look forward to [your] analysis of the Rostelecom incident that occurred in the same timeframe.
I've taken a look at the incident. 2,666 VRPs disappeared around 2020-04-01T16:32Z. For the purpose of this analysis the list of affected VRPs is http://instituut.net/~job/deleted-vrps-ripe-2020-04-01-16-32.txt
Andree Toonk (BGPMon) so kind to compile a list of prefixes which were wrongly originated by Rostelecom during incident at 2020-04-01T19:27Z https://portal.bgpmon.net/data/12389_apr2020.txt
The above list is not the full list of prefixes affected by this leak. The leak appears to have included route announcements that 12389 received from some customers and some peers, in addition to 'bgp optimiser'-style more-specific hijacks. Full list is available here: https://map.internetintel.oracle.com/api/leak_prefixes/20764_12389_158576850... I'm leaving the 'merely leaked otherwise untouched' routes out of this analysis as those are outside of scope of Origin Validation: the fabricated routes in relation to missing RPKI VRPs are what is matters for this analysis.
If we take the intersection of Andree's list with the list of missing VRPs, we have the IP addresses that were affected by both the RIPE NCC RPKI Deletion incident and the Rostelecom BGP incident. The following 12 prefixes (4352 IP addresses):
peer_count start_time alert_type base_prefix base_as announced_prefix src_AS Affected_ASname example_ASPath 49 2020-04-01 19:30:34 more_spec_by_other 91.195.240.0/23 47846 91.195.240.0/24 12389 SEDO-AS, DE 24751 20764 12389 12 2020-04-01 19:29:55 more_spec_by_other 62.122.168.0/21 50245 62.122.170.0/24 12389 SERVEREL-AS, NL 18356 38794 4651 4651 20764 12389 11 2020-04-01 19:30:34 more_spec_by_other 91.203.184.0/22 41064 91.203.187.0/24 12389 SKYROCK, FR 29430 13030 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.164.0/23 12389 SERVEREL-AS, NL 49673 24811 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.174.0/23 12389 SERVEREL-AS, NL 49515 197595 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.178.0/23 12389 SERVEREL-AS, NL 49673 24811 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.168.0/23 12389 SERVEREL-AS, NL 49673 24811 20764 12389 6 2020-04-01 19:32:12 more_spec_by_other 109.206.160.0/19 50245 109.206.180.0/23 12389 SERVEREL-AS, NL 43317 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.161.0/24 12389 SERVEREL-AS, NL 49515 197595 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.170.0/24 12389 SERVEREL-AS, NL 49673 24811 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.187.0/24 12389 SERVEREL-AS, NL 1126 24785 20562 20764 12389 5 2020-04-01 19:33:04 more_spec_by_other 109.206.160.0/19 50245 109.206.166.0/24 12389 SERVEREL-AS, NL 51514 20562 20764 12389
If we look at the list of ASNs which were most impacted, the top ten seems mostly anchored to the US (thus under the ARIN TAL), and almost all of them seem heavyweights in the cloud / CDN space. https://portal.bgpmon.net/data/12389_apr2020_affected_asns.txt
The incorrect routing information covering to the above listed prefixes was observed by a limited number of BGPMon peers, for other affected routes the peer_count was around 170. While the RPKI incident lasted a number of hours, but the Rostelecom routing incident lasted ten minutes or so. (source: https://map.internetintel.oracle.com/leaks#/id/20764_12389_1585768500)
If we assume the generation & propagation of these hijacks was the result of operator error, I imagine the change could've been reverted almost immediately but we'd still see a bit of sloshing for a few minutes through the routing system. Or perhaps the 'waves' we can see in Oracle's 3D rendering of the incident are the effects of Maximum Prefix limits kicking in and various timers firing off at different times.
Were these prefixes just unlucky because some BGP optimiser algorithm had chosen them for the purpose of traffc engineering? Was this the result of sophisticated planning? In any case, I can't judge the impact this routing incident had on the three above listed ASNs. I don't know what the victim IPs are used for.
We have to keep in mind that a large portion of RIPE NCC's RPKI repository, and of course the RPKI repositories of the other RIRs were *not* affected. ISPs with 'invalid == reject' policies had lot of RPKI data (~134,516 VRPs) available and those VRPs did have positive effects on the scope and reach of the hijacks. RPKI Invalid BGP announcements don't propagate as as good as Not-Found announcements.
It appears the 'peer_count' for RPKI protected prefixes was significantly lower (~140) than prefixes not covered by RPKI ROAs (~160). The 'peer_count' value can be considered a proxy metric for a hijack's reach and impact. The RPKI Invalids in this leak propagated through ASNs for which we know they have not yet deployed RPKI OV.
The above suggests to me that unavailability of RPKI services during routing incidents, or lack of deployment of Origin Validation confirms what most of us already suspected: it is inconvenient.
RIPE NCC's service interruption appears to have affected 4,352 out of the total of 5,945,764 misrouted IPs, and the 'peer_count' for the illegitimate announcements was much lower (better) compared to other prefixes.
This leads me to believe this was not a deliberate plan dependent on a process failure inside RIPE NCC, the incident's BGP data just doesn't seem to show the incident maximally capitalised on the RPKI outage.
Kind regards,
Job
![](https://secure.gravatar.com/avatar/6ccfb2e783f4d5b5c4dd4813d1626baa.jpg?s=120&d=mm&r=g)
Hi, On Mon, Apr 6, 2020, at 15:54, Danny McPherson wrote:
Thanks for this Job, interesting analysis.
Another question here: at what interval is data from a given RIR repository ingested / operationalized by a given network operator? Or put differently, any idea how much lag today between when an RIR RPKI repository has a change until that becomes OV policy in _your routers? I'm sure this varies but not sure by how much within a given operator, or across operators.
Consumption: Some network operators fetch & validate RPKI data only once a day, some perform that action every 15 minutes. Publication: Some CA operators publish changes every 6 hours, some publish every 15 minutes. I agree I wouldn't expect a lot of variance within a given operator, but across operators we should expect differences. Kind regards, Job
![](https://secure.gravatar.com/avatar/a239bfdff029763151f1461cd02ad74b.jpg?s=120&d=mm&r=g)
Dear Danny and all, Thank you for your email. We understand the importance of RPKI for Internet operations and we are taking recent outages very seriously. We already have alerting systems in place that did not report the deletions because deletion of ROAs is sometimes a normal and necessary action. However, as Nathalie mentions in her post-mortem, we have already taken steps to ensure our systems prevent this from happening again. We are also carrying out a separate investigation on the impact this outage had on networks in terms of hijacking and route leaks. There is a 24/7 hotline[1] in place that people can use to report outages outside of office hours. In this case, none of the people who contacted us used this method to alert us. In our Activity Plan and Budget 2020, we requested a significant budget allocation for resiliency of RPKI in anticipation of increased global demand and operational reliance on this system. Lessons learned from these outages will be incorporated into the RPKI activity and we will take all necessary steps to ensure the stability of the system. Kind regards, Felipe Victolla Silveira Chief Operations Officer RIPE NCC
On 3 Apr 2020, at 22:56, Danny McPherson <danny@tcb.net> wrote:
Agreed, thanks for this Nathalie.
Given the operational importance of RPKI now and each RIRs role therein can you say anything about what plans RIPE has to provide 24x7 monitoring / support for these services (i.e., beyond your current "office hours")?
I also look forward to [your] analysis of the Rostelecom incident that occurred in the same timeframe.
Thanks,
-danny
On 2020-04-03 08:55, Nathalie Trenaman wrote:
Dear colleagues, After our accidental deletion of RPKI ROAs on Wednesday evening, we have a post-mortem report to share with the working group. Following an update to our internal registry software on 1 April at 18:16 (UTC+2), 2,669 ROAs were deleted from Provider Independent (PI) address assignments. This was caused by our registry software classifying these assignments as not-certifiable. From our logs, we can confirm that these blocks never left the RIPE Registry, and within 15 minutes the registry was back to normal. However, by that time the ROAs had already been deleted and could not be restored without intervention from our engineers. Affected users with alerts set up in the LIR Portal received a notification email on 31 March at 22:23, stating that their ROAs were missing. Some of these users emailed our Customer Service Department to ask why their ROAs had been deleted. As this was outside of office hours, our staff did not discover the issue until the next morning. Our engineers were able to reinstate all of the missing ROAs by 13:15 on 2 April. We then informed our membership via ncc-announce and notified the affected users directly. We have since implemented stricter checks on both our registry and RPKI software. We are also investigating whether any of these PI assignments suffered from route-leaks or hijacks after their ROAs were deleted. We apologise for any inconvenience this may have caused and we are taking all necessary steps to ensure this does not happen again in the future. Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
![](https://secure.gravatar.com/avatar/a239bfdff029763151f1461cd02ad74b.jpg?s=120&d=mm&r=g)
My apologies. I missed the reference to our Technical Emergency Hotline: [1] RIPE NCC Technical Emergency Hotline: https://www.ripe.net/support/contact/technical-emergency-hotline <https://www.ripe.net/support/contact/technical-emergency-hotline>
On 6 Apr 2020, at 16:19, Felipe Victolla Silveira <fvictolla@ripe.net> wrote:
Dear Danny and all,
Thank you for your email.
We understand the importance of RPKI for Internet operations and we are taking recent outages very seriously.
We already have alerting systems in place that did not report the deletions because deletion of ROAs is sometimes a normal and necessary action. However, as Nathalie mentions in her post-mortem, we have already taken steps to ensure our systems prevent this from happening again.
We are also carrying out a separate investigation on the impact this outage had on networks in terms of hijacking and route leaks.
There is a 24/7 hotline[1] in place that people can use to report outages outside of office hours. In this case, none of the people who contacted us used this method to alert us.
In our Activity Plan and Budget 2020, we requested a significant budget allocation for resiliency of RPKI in anticipation of increased global demand and operational reliance on this system. Lessons learned from these outages will be incorporated into the RPKI activity and we will take all necessary steps to ensure the stability of the system.
Kind regards,
Felipe Victolla Silveira Chief Operations Officer RIPE NCC
On 3 Apr 2020, at 22:56, Danny McPherson <danny@tcb.net> wrote:
Agreed, thanks for this Nathalie.
Given the operational importance of RPKI now and each RIRs role therein can you say anything about what plans RIPE has to provide 24x7 monitoring / support for these services (i.e., beyond your current "office hours")?
I also look forward to [your] analysis of the Rostelecom incident that occurred in the same timeframe.
Thanks,
-danny
On 2020-04-03 08:55, Nathalie Trenaman wrote:
Dear colleagues, After our accidental deletion of RPKI ROAs on Wednesday evening, we have a post-mortem report to share with the working group. Following an update to our internal registry software on 1 April at 18:16 (UTC+2), 2,669 ROAs were deleted from Provider Independent (PI) address assignments. This was caused by our registry software classifying these assignments as not-certifiable. From our logs, we can confirm that these blocks never left the RIPE Registry, and within 15 minutes the registry was back to normal. However, by that time the ROAs had already been deleted and could not be restored without intervention from our engineers. Affected users with alerts set up in the LIR Portal received a notification email on 31 March at 22:23, stating that their ROAs were missing. Some of these users emailed our Customer Service Department to ask why their ROAs had been deleted. As this was outside of office hours, our staff did not discover the issue until the next morning. Our engineers were able to reinstate all of the missing ROAs by 13:15 on 2 April. We then informed our membership via ncc-announce and notified the affected users directly. We have since implemented stricter checks on both our registry and RPKI software. We are also investigating whether any of these PI assignments suffered from route-leaks or hijacks after their ROAs were deleted. We apologise for any inconvenience this may have caused and we are taking all necessary steps to ensure this does not happen again in the future. Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
![](https://secure.gravatar.com/avatar/78f9e962d04dcd991978b2aa35421d72.jpg?s=120&d=mm&r=g)
On 2020-04-06 10:19, Felipe Victolla Silveira wrote:
In our Activity Plan and Budget 2020, we requested a significant budget allocation for resiliency of RPKI in anticipation of increased global demand and operational reliance on this system. Lessons learned from these outages will be incorporated into the RPKI activity and we will take all necessary steps to ensure the stability of the system.
Thanks Felipe, I'm especially glad to hear this! -danny
![](https://secure.gravatar.com/avatar/3f13caa58e1a7bab97a304359708f76a.jpg?s=120&d=mm&r=g)
Hi, thanks a lot for collecting the information and the post-mortem, and apologies for nitpicking, but: On 3 Apr 2020, at 14:55, Nathalie Trenaman wrote:
Following an update to our internal registry software on 1 April at 18:16 (UTC+2), 2,669 ROAs were deleted from Provider Independent (PI) address assignments. [..] Affected users with alerts set up in the LIR Portal received a notification email on 31 March at 22:23, stating that their ROAs were missing. [..]
Our engineers were able to reinstate all of the missing ROAs by 13:15 on 2 April.
The timeline does not completely match up here. I assume the users received alarms on April 1 after the update of the internal registry software? Thanks, Marcus
![](https://secure.gravatar.com/avatar/6ccfb2e783f4d5b5c4dd4813d1626baa.jpg?s=120&d=mm&r=g)
On Sun, Apr 05, 2020 at 12:56:48PM +0200, Marcus Stoegbauer wrote:
thanks a lot for collecting the information and the post-mortem, and apologies for nitpicking, but:
Following an update to our internal registry software on 1 April at 18:16 (UTC+2), 2,669 ROAs were deleted from Provider Independent (PI) address assignments. [..] Affected users with alerts set up in the LIR Portal received a notification email on 31 March at 22:23, stating that their ROAs were missing. [..] Our engineers were able to reinstate all of the missing ROAs by 13:15 on 2 April.
The timeline does not completely match up here. I assume the users received alarms on April 1 after the update of the internal registry software?
I've constructed the following timeline based on my own validator data archives and the notification I received for my PI block. The minute timestamps are a little bit different than the above because NTT's validators run roughly every 15 minutes and i'm basing this off those snapshots. 2020-04-01T16:32Z - 2,666 VRPs disappeared from the RPKI. 2020-04-01T21:23Z - affected users with alerts set up in the LIR Portal received a notification, stating that their ROAs were missing. 2020-04-02T11:32Z - The missing VRPs returned. It is interesting the NCC's outreach about the incident (or their ROA state alert notification emails) seems to have triggered not just a rebound to the original level (which was expected to happen because of the undelete action), but also a tiny increase in new RPKI ROA creation. Date VRP Count 2020-03-29 80,915 2020-03-30 80,933 2020-03-31 81,016 2020-04-01 81,089 2020-04-02 78,626 2020-04-03 81,530 2020-04-04 81,655 2020-04-05 81,732 Kind regards, Job
![](https://secure.gravatar.com/avatar/0b86cf2bf0f6231ce699e31194dea93e.jpg?s=120&d=mm&r=g)
Dear colleagues, Between 02:00 - 08:50 (UCT+2) this morning, our rsync RPKI Repository (rsync://rpki.ripe.net/repository/ <rsync://rpki.ripe.net/repository/>) appeared as down to many relying parties (validators). The servers reached their maximum connection pool size and started to refuse new connections. At the moment, the service is back online and we are investigating why there was a sudden increase in connection to the service. We will post further updates on this page: https://www.ripe.net/support/service-announcements/rsync-rpki-repository-dow... <https://www.ripe.net/support/service-announcements/rsync-rpki-repository-downtime/> Apologies for any inconvenience this may have caused. Kind regards, Thiago da Cruz Senior Software Engineer RIPE NCC
participants (9)
-
Clement Cavadore
-
Danny McPherson
-
Ehsan Ghazizadeh
-
Felipe Victolla Silveira
-
Job Snijders
-
Marcus Stoegbauer
-
Nathalie Trenaman
-
Randy Bush
-
Thiago da Cruz