On Thu, May 13, 2021 at 04:57:21PM +0100, Niall O'Reilly wrote:
On 13 May 2021, at 16:19, Niall O'Reilly wrote:
I will share my arithmetic separately later.
So, as promised.
Time to recover five-nines availability after worst-case service-recovery delay: An hour’s outage must be matched by a minimum of 99,999 hours uninterrupted availability.
I'm concerned the 99.99% and 99.999% numbers splayed in the blog post are very optimistic and unattainable. Even if a cloud provider offers such availability, at this moment I'm under the impression that RIPE NCC is not in a position to even be able to observe whether 98%, 99%, or some other level of service availability is achieved. The types of outages that the RPKI service has seen in the last year appear to be the result of human error, flaws in the application design, and exacerbated by a lack of monitoring. The below is a small selection of outages in the last year. Apr 1st, 2020 - "A subset of hosted RPKI ROAs were deleted (hours)" [1] Apr 6th, 2020 - "The rsync server was down (hours)" [2] Dec 16th, 2020 - "A subset of hosted RPKI ROAs were deleted (hours)" [3] Feb 15th, 2021 - "the publication server stopped working until it was rebooted (hours)" [4] ONGOING, 2021 - "RSYNC clients periodically are fed inconsistent data by RIPE RSYNC server" [5] The above list suggests general availability currently is less than 99.99%. Also the above don't appear of the class of issues 'the cloud' solves. The scaling requirements don't appear to exceeded beyond what a modern medium sized gigabit-connected SSD-backed server can muster. In this posting "Improving operations at RIPE NCC TA" https://www.ripe.net/ripe/mail/archives/routing-wg/2021-February/004237.html I suggested that RIPE NCC should make a dashboard available to the public that shows all metrics and aspects of the RPKI service. There are probably as many metrics as there are line items on a cloud invoice. :-) My suggestion would be to first make service availability statistics available, before migrating to the cloud. This way both the RIPE membership and RIPE NCC staff can easily compare 'before' and 'after'. The community would benefit from better insight into how RIPE NCC themselves appear to think things are going. In the meantime, I recommend solving the ongoing incconsistent publication problem affecting RSYNC clients - before moving to the cloud. The cloud does not solve flawed application designs. Deploying a new application and at the same time moving into the cloud is akin to making multiple unrelated changes at the same time. The blog post suggests a second cloud provider will not be part of 'phase 1' of moving into the cloud. Despite repeated pleas from the community to RIPE NCC to please come up with some kind of quick-fix/workaround for the RSYNC issue, RIPE NCC has been unable to come up with any form of relief or issue masking in the short term. The lack of a clear plan for a second provider, the lack of agility to mitigate ongoing service problems in the short term, and the apparent lacking monitoring make me question the current strategy. Are the right priorities set? What KPIs and goals does RIPE NCC set for themselves? I look forward to today's updates in services-wg. Kind regards, Job [1]: https://www.ripe.net/support/service-announcements/accidental-roa-deletion [2]: https://www.ripe.net/support/service-announcements/rsync-rpki-repository-dow... [3]: https://www.ripe.net/support/service-announcements/rpki-roas-deleted-for-som... [4]: https://www.ripe.net/support/service-announcements/delay-publishing-rpki-obj... [5]: no service announcement (?): https://www.ripe.net/ripe/mail/archives/routing-wg/2021-April/004297.html