Re: [ncc-services-wg] Draft Cloud Strategy Framework

Dear Gert, Hank, Sorry for the delay with our response here. We have a few people taking their summer break at the moment, which has made it hard to get a proper response together. We will share something with the list next week. Kind regards, Antony Gollan Communications Team Leader RIPE NCC

Dear Gert, Hank, First, our apologies again for the delay in our response. A few of us were taking our summer break and our colleagues didn't want to respond without checking with us first. To recap, we’ve outlined our core goals - improve the resilience of our services, become more agile and flexible as an organisation, and focus engineering expertise on our core business. You correctly point out that we haven't really talked about the problems we’re trying to solve. Fair point - we're not used to talking about the firefighting that's needed behind the scenes. We can go over some of this now. We can start by noting that if you take the inverse of the benefits we've listed so far, you find most the problems we're trying to solve. 1. Improve resilience and availability We currently host our infrastructure in two data centres in Amsterdam. While they have provided excellent availability so far, users further afield (South America, Oceania, Asia) experience high latency when accessing our services. Importantly, an outage affecting both of these data centres would render all of our services offline. Public cloud providers have many global regions available, allowing us to choose the level of resilience that best fits a particular service - protecting us against multiple hardware failures or natural disasters (remember that we are below sea level here). 2. Become more agile and flexible We're proud of the stable and highly-available services we provide. Here we can credit the expertise and hard work of our engineering staff, but also a continuous investment in our infrastructure over time. This has a big footprint - we are currently using almost 50 racks across our two data centres. Each hardware element has its own lifecycle: procurement, shipping, installation, configuration, patching, upgrading and retiring. With hundreds of servers, network and storage equipment, this is a continuous operation that takes a lot of time and effort. Hardware maintenance is not even the biggest challenge here: our infrastructure doesn't offer much in the way of flexibility and making changes is complex and expensive. Our infrastructure also lacks elasticity, meaning that we have to estimate demand and over-provision our services to cover any peaks. This makes us less agile, by forcing us into long-term commitments and requiring us to pay for a lot of unused or idle resources. 3. Focus engineering expertise on our core business For each new application or change to our infrastructure, there are a lot of manual steps that require tickets back and forth between separate engineering teams. Getting from idea to reality can take many months, and we can see this impacting our ability to innovate. This is inevitable when attention turns from service excellence to fixing problems and time-consuming, mundane maintenance tasks. We especially don't like this because we often need to react quickly as an organisation, while also being able to experiment with new services in an efficient way. By moving to the cloud, we can build pipelines to deploy code faster, with fewer errors and manual steps, and provide sandbox accounts for engineers to quickly and safely test new technologies. We can also automate security auditing and reporting as much as possible, at all application and infrastructure layers. There were two good comments on the article recently, from Niall Murphy and Bert Hubert. We will respond to these soon, but I would like to reference one point Bert makes there, which is essentially "Don't outsource your key capabilities." We completely agree with this (many of us have been reading Bert's article on this topic recently). This is precisely what we are *not* doing. While it is important to have in-house expertise on all technical layers, some are more important than others. For example, at the physical layer we are already using data centre remote hands to replace failed disks, and we generally want to eliminate as much of the repetitive work to unpack, rack, and cable equipment in the data centre as we can. The resources we save here can be used to double down on the capabilities we want to develop further. We will continue to write our own software and control our deployment pipelines, and configure routers, firewalls, load balancers, and storage devices - whether they are physical or virtual, on-premise or in the cloud. I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask our engineers to spend time on this when I think they'll find we have very resilient services. But past results are not always the best indicator of future performance. And with RPKI especially, I also expect that what we consider acceptable resilience might increase as more and more networks come to rely on it.
(Also I find "evade the discussion on the list by posting a new lengthy article on labs every few months" not really helpful)
I do want to respond to this point. We sometimes miss a comment or take longer to respond than is acceptable, and this is not something that we take lightly as a company. But I would be disappointed if the community thought we were trying to evade discussion. We are here, we are listening, and we will respond. With that, it's over to you again - let me know if you feel I’ve missed anything here. Regards Felipe Victolla Silveira Chief Operations Officer RIPE NCC

On 26/08/2021 15:47, Felipe Victolla Silveira wrote: Felipe, Thanks for the extremely well thought out and detailed response. I can't argue with most of what you stated. So I will go back to my original statement that cloud computing is not secure for critical infrastructure. Cloud vendors roll out dozens of new features per year and each vendor probably has tens of millions of lines of code running and controlling their platforms. Microsoft's IE and Google's Chrome are used by a billion users and have had dozens of security holes found and fixed over the past decade. The cloud platforms are used by perhaps only a million different end users and every week another security hole is found: I assume there are dozens of 0-days in every cloud platform. Regards, Hank
Dear Gert, Hank,
First, our apologies again for the delay in our response. A few of us were taking our summer break and our colleagues didn't want to respond without checking with us first.
To recap, we’ve outlined our core goals - improve the resilience of our services, become more agile and flexible as an organisation, and focus engineering expertise on our core business. You correctly point out that we haven't really talked about the problems we’re trying to solve.
Fair point - we're not used to talking about the firefighting that's needed behind the scenes. We can go over some of this now. We can start by noting that if you take the inverse of the benefits we've listed so far, you find most the problems we're trying to solve.
1. Improve resilience and availability
We currently host our infrastructure in two data centres in Amsterdam. While they have provided excellent availability so far, users further afield (South America, Oceania, Asia) experience high latency when accessing our services. Importantly, an outage affecting both of these data centres would render all of our services offline.
Public cloud providers have many global regions available, allowing us to choose the level of resilience that best fits a particular service - protecting us against multiple hardware failures or natural disasters (remember that we are below sea level here).
2. Become more agile and flexible
We're proud of the stable and highly-available services we provide. Here we can credit the expertise and hard work of our engineering staff, but also a continuous investment in our infrastructure over time. This has a big footprint - we are currently using almost 50 racks across our two data centres.
Each hardware element has its own lifecycle: procurement, shipping, installation, configuration, patching, upgrading and retiring. With hundreds of servers, network and storage equipment, this is a continuous operation that takes a lot of time and effort. Hardware maintenance is not even the biggest challenge here: our infrastructure doesn't offer much in the way of flexibility and making changes is complex and expensive.
Our infrastructure also lacks elasticity, meaning that we have to estimate demand and over-provision our services to cover any peaks. This makes us less agile, by forcing us into long-term commitments and requiring us to pay for a lot of unused or idle resources.
3. Focus engineering expertise on our core business
For each new application or change to our infrastructure, there are a lot of manual steps that require tickets back and forth between separate engineering teams. Getting from idea to reality can take many months, and we can see this impacting our ability to innovate. This is inevitable when attention turns from service excellence to fixing problems and time-consuming, mundane maintenance tasks. We especially don't like this because we often need to react quickly as an organisation, while also being able to experiment with new services in an efficient way.
By moving to the cloud, we can build pipelines to deploy code faster, with fewer errors and manual steps, and provide sandbox accounts for engineers to quickly and safely test new technologies. We can also automate security auditing and reporting as much as possible, at all application and infrastructure layers.
There were two good comments on the article recently, from Niall Murphy and Bert Hubert. We will respond to these soon, but I would like to reference one point Bert makes there, which is essentially "Don't outsource your key capabilities." We completely agree with this (many of us have been reading Bert's article on this topic recently). This is precisely what we are *not* doing.
While it is important to have in-house expertise on all technical layers, some are more important than others. For example, at the physical layer we are already using data centre remote hands to replace failed disks, and we generally want to eliminate as much of the repetitive work to unpack, rack, and cable equipment in the data centre as we can. The resources we save here can be used to double down on the capabilities we want to develop further. We will continue to write our own software and control our deployment pipelines, and configure routers, firewalls, load balancers, and storage devices - whether they are physical or virtual, on-premise or in the cloud.
I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask our engineers to spend time on this when I think they'll find we have very resilient services. But past results are not always the best indicator of future performance. And with RPKI especially, I also expect that what we consider acceptable resilience might increase as more and more networks come to rely on it.
(Also I find "evade the discussion on the list by posting a new lengthy article on labs every few months" not really helpful)
I do want to respond to this point. We sometimes miss a comment or take longer to respond than is acceptable, and this is not something that we take lightly as a company. But I would be disappointed if the community thought we were trying to evade discussion. We are here, we are listening, and we will respond.
With that, it's over to you again - let me know if you feel I’ve missed anything here.
Felipe Victolla Silveira Chief Operations Officer RIPE NCC

Hello, Just my opinion about some points you did give (and after reading most other reactions about the cloud strategy framework). 1. No one is against adding another datacenter (or moving 1 datacenter location) to a location outside the Netherlands. By looking for a good location in the RIPE service region it would solve many of these points. And for some data (mainly read only parts) a copy/mirror could be provided on another URL in other parts in the RIPE service region if required. This way the change that everything goes offline at the same moment will be small. I know RIPE NCC has staff experienced with DNS solutions, for read only data that is send to the public maybe looking at the same basic technical solutions might help. 2. A private cloud could help a lot here I think. Maybe combine it with looking for cloud/VM providers for some of the mirror/frontends with read only data mentioned above. For expected short peaks it is also possible to rent server capacity for shorter periods. 3. This will probably not be solved by moving to a public cloud. Pipelines can also be build with a private cloud or even without. Many things you did mention don't sound like a technical problem but maybe more like a communication problem between people/departments. Outsourcing some technical parts to run your own private cloud might help. Eg all hardware/datacenter related work can be outsourced, enough options are available for this in the RIPE region. If you want to outsource things look for past resilience in that area for the past few years and compare it with your own experience. I expect that RIPE NCC is doing it really good compared to many public clouds (and compare what you would use at them, what you will not use is not important to compare). Kind regards, Mark
-----Original Message----- From: ncc-services-wg <> On Behalf Of Felipe Victolla Silveira Sent: Thursday, August 26, 2021 14:47 To: ncc-services-wg <> Subject: Re: [ncc-services-wg] Draft Cloud Strategy Framework
Dear Gert, Hank,
First, our apologies again for the delay in our response. A few of us were taking our summer break and our colleagues didn't want to respond without checking with us first.
To recap, we’ve outlined our core goals - improve the resilience of our services, become more agile and flexible as an organisation, and focus engineering expertise on our core business. You correctly point out that we haven't really talked about the problems we’re trying to solve.
Fair point - we're not used to talking about the firefighting that's needed behind the scenes. We can go over some of this now. We can start by noting that if you take the inverse of the benefits we've listed so far, you find most the problems we're trying to solve.
1. Improve resilience and availability
We currently host our infrastructure in two data centres in Amsterdam. While they have provided excellent availability so far, users further afield (South America, Oceania, Asia) experience high latency when accessing our services. Importantly, an outage affecting both of these data centres would render all of our services offline.
Public cloud providers have many global regions available, allowing us to choose the level of resilience that best fits a particular service - protecting us against multiple hardware failures or natural disasters (remember that we are below sea level here).
2. Become more agile and flexible
We're proud of the stable and highly-available services we provide. Here we can credit the expertise and hard work of our engineering staff, but also a continuous investment in our infrastructure over time. This has a big footprint - we are currently using almost 50 racks across our two data centres.
Each hardware element has its own lifecycle: procurement, shipping, installation, configuration, patching, upgrading and retiring. With hundreds of servers, network and storage equipment, this is a continuous operation that takes a lot of time and effort. Hardware maintenance is not even the biggest challenge here: our infrastructure doesn't offer much in the way of flexibility and making changes is complex and expensive.
Our infrastructure also lacks elasticity, meaning that we have to estimate demand and over-provision our services to cover any peaks. This makes us less agile, by forcing us into long-term commitments and requiring us to pay for a lot of unused or idle resources.
3. Focus engineering expertise on our core business
For each new application or change to our infrastructure, there are a lot of manual steps that require tickets back and forth between separate engineering teams. Getting from idea to reality can take many months, and we can see this impacting our ability to innovate. This is inevitable when attention turns from service excellence to fixing problems and time- consuming, mundane maintenance tasks. We especially don't like this because we often need to react quickly as an organisation, while also being able to experiment with new services in an efficient way.
By moving to the cloud, we can build pipelines to deploy code faster, with fewer errors and manual steps, and provide sandbox accounts for engineers to quickly and safely test new technologies. We can also automate security auditing and reporting as much as possible, at all application and infrastructure layers.
There were two good comments on the article recently, from Niall Murphy and Bert Hubert. We will respond to these soon, but I would like to reference one point Bert makes there, which is essentially "Don't outsource your key capabilities." We completely agree with this (many of us have been reading Bert's article on this topic recently). This is precisely what we are *not* doing.
While it is important to have in-house expertise on all technical layers, some are more important than others. For example, at the physical layer we are already using data centre remote hands to replace failed disks, and we generally want to eliminate as much of the repetitive work to unpack, rack, and cable equipment in the data centre as we can. The resources we save here can be used to double down on the capabilities we want to develop further. We will continue to write our own software and control our deployment pipelines, and configure routers, firewalls, load balancers, and storage devices - whether they are physical or virtual, on-premise or in the cloud.
I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask our engineers to spend time on this when I think they'll find we have very resilient services. But past results are not always the best indicator of future performance. And with RPKI especially, I also expect that what we consider acceptable resilience might increase as more and more networks come to rely on it.
(Also I find "evade the discussion on the list by posting a new lengthy article on labs every few months" not really helpful)
I do want to respond to this point. We sometimes miss a comment or take longer to respond than is acceptable, and this is not something that we take lightly as a company. But I would be disappointed if the community thought we were trying to evade discussion. We are here, we are listening, and we will respond.
With that, it's over to you again - let me know if you feel I’ve missed anything here.
Felipe Victolla Silveira Chief Operations Officer RIPE NCC
participants (4)
Antony Gollan
Felipe Victolla Silveira
Hank Nussbacher
Mark Scholten