Delay in publishing RPKI objects
Dear colleagues, On Monday, 15 February we encountered an issue with our RPKI software. This issue prevented us from publishing RPKI object updates from 08:07-18:06 (UTC). During this period, Certificate Authority activation and Route Origin Authorization configuration updates were delayed and therefore not visible in the RPKI repository. The updates were published after we restarted the system at 17:45 (UTC), with full recovery completed by 18:06 (UTC). Since this non-publishing period is shorter than our default RPKI object validity period, set to 8 hours, existing objects that are not updated were still valid. No data was lost during this period. We are still investigating the reason that caused this delay and will follow up once we have identified the root cause. Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
Dear RIPE NCC, On Tue, Feb 16, 2021 at 04:56:31PM +0100, Nathalie Trenaman wrote:
On Monday, 15 February we encountered an issue with our RPKI software. This issue prevented us from publishing RPKI object updates from 08:07-18:06 (UTC).
During this period, Certificate Authority activation and Route Origin Authorization configuration updates were delayed and therefore not visible in the RPKI repository.
It appears Certificate Authority revocation was also delayed.
The updates were published after we restarted the system at 17:45 (UTC), with full recovery completed by 18:06 (UTC). Since this non-publishing period is shorter than our default RPKI object validity period, set to 8 hours, existing objects that are not updated were still valid. No data was lost during this period.
Can the following phrase "default RPKI object validty period, set to 8 hours" please be clarified? For objects produced in the RIPE-hosted RPKI environment I observe the following validity periods are commonly used: Object type | validity duration after issuance -------------------+--------------------------------- CRLs | 24 hours ROA EE certs | 18 months Manifest eContent | 24 hours Manifest EE certs | 7 days CAs | 18 months I'm just guessing, is the '8 hour' period a reference to RIPE-751 section 2.3? "A certificate will be published within eight hours of being issued (or deleted)." The RIPE-751 CPS also states in section 4.9.8 ("Maximum latency for CRLs"): CRLs will be published to the repository system within one hour of their generation. As the outage appears to have exceeded both the 1 hour revocation window and 8 hour object publication window, RIPE NCC was not compliant with its own CPS. The multitude of RPKI service impacting events as a result from maloperation of the RIPE NCC trust anchor are starting to give me me cause for concern. Kind regards, Job
Dear Job, Please find my answers in line below:
Op 16 feb. 2021, om 19:22 heeft Job Snijders via routing-wg <routing-wg@ripe.net> het volgende geschreven:
Dear RIPE NCC,
On Tue, Feb 16, 2021 at 04:56:31PM +0100, Nathalie Trenaman wrote:
On Monday, 15 February we encountered an issue with our RPKI software. This issue prevented us from publishing RPKI object updates from 08:07-18:06 (UTC).
During this period, Certificate Authority activation and Route Origin Authorization configuration updates were delayed and therefore not visible in the RPKI repository.
It appears Certificate Authority revocation was also delayed.
Indeed, all modifications, creations and deletions were delayed.
The updates were published after we restarted the system at 17:45 (UTC), with full recovery completed by 18:06 (UTC). Since this non-publishing period is shorter than our default RPKI object validity period, set to 8 hours, existing objects that are not updated were still valid. No data was lost during this period.
Can the following phrase "default RPKI object validty period, set to 8 hours" please be clarified?
For objects produced in the RIPE-hosted RPKI environment I observe the following validity periods are commonly used:
Object type | validity duration after issuance -------------------+--------------------------------- CRLs | 24 hours ROA EE certs | 18 months Manifest eContent | 24 hours Manifest EE certs | 7 days CAs | 18 months
I'm just guessing, is the '8 hour' period a reference to RIPE-751 section 2.3?
"A certificate will be published within eight hours of being issued (or deleted)."
Yes, the eight hours referred to this section in the CPS but also in 4.3.1: "The Production CA and the ACA, as well as hosted CAs, make all subordinate certificates and objects available for publication. While the system will make a best effort to publish these materials as soon as possible, publication should happen no later than eight hours after issuance (as described in Section 2.3.)” The validity periods are all longer than eight hours, and we can confirm that none of the RIPE hosted objects expired within the non-publishing time frame.
The RIPE-751 CPS also states in section 4.9.8 ("Maximum latency for CRLs"): CRLs will be published to the repository system within one hour of their generation.
As the outage appears to have exceeded both the 1 hour revocation window and 8 hour object publication window, RIPE NCC was not compliant with its own CPS.
Correct.
The multitude of RPKI service impacting events as a result from maloperation of the RIPE NCC trust anchor are starting to give me me cause for concern.
I’m sorry to hear this. Transparency is key for us, this means that we report any event. In this case, we were not compliant with our CPS and this non-publishing period had operational impact. However, not all relying party software discovered this non-publishing period, for example, rpki-client. Routinator logs these warnings. Is this something all relying party software should log? Maybe this should be discussed in the SIDROPS wg at the IETF. Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
Hi Nathalie do any of the other RIRs run any components of the software stack that the RIPE NCC developed for managing its RPKI CA functionality? I.e. how much exposure to production environments does this software get outside the RIPE NCC? Nick
Hi Nick,
Op 17 feb. 2021, om 13:54 heeft Nick Hilliard <nick@foobar.org> het volgende geschreven:
Hi Nathalie
do any of the other RIRs run any components of the software stack that the RIPE NCC developed for managing its RPKI CA functionality? I.e. how much exposure to production environments does this software get outside the RIPE NCC?
Nick
In the past APNIC used some fork of our publication server code, but I’m not aware how far this diverted from our code. I’m not aware of any of the other RIRs using our software stack. Kind regards, Nathalie
Hi Nick,
Op 17 feb. 2021, om 15:25 heeft Nathalie Trenaman <nathalie@ripe.net> het volgende geschreven:
Hi Nick,
Op 17 feb. 2021, om 13:54 heeft Nick Hilliard <nick@foobar.org> het volgende geschreven:
Hi Nathalie
do any of the other RIRs run any components of the software stack that the RIPE NCC developed for managing its RPKI CA functionality? I.e. how much exposure to production environments does this software get outside the RIPE NCC?
Nick
In the past APNIC used some fork of our publication server code, but I’m not aware how far this diverted from our code. I’m not aware of any of the other RIRs using our software stack.
Kind regards, Nathalie
I stand corrected. Tim pointed out to me that APNIC has their own code base, always had and is actually older than the RIPE NCC code. AFRINIC runs a fork from APNIC. ARIN used our code base around 2010 for their pilot, but implemented their own code base from scratch. LACNIC uses the RPKI object library but has their own CA and publication stack. Sorry for the confusion and thanks for the correction Tim. Kind regards, Nathalie
Nathalie Trenaman wrote on 17/02/2021 15:16:
I stand corrected. Tim pointed out to me that APNIC has their own code base, always had and is actually older than the RIPE NCC code. AFRINIC runs a fork from APNIC. ARIN used our code base around 2010 for their pilot, but implemented their own code base from scratch. LACNIC uses the RPKI object library but has their own CA and publication stack.
Hi Nathalie, Ok, was curious. The reason I ask is because the overall complexity of this software is pretty high and we've seen a number of RPKI CA / publication failures from different registries over the last year. If each RIR is using their own internally-developed implementations, then they're also the only organisations testing and developing monitoring systems for them, which means that in each case, exposure to interesting and unusual corner cases is likely to be pretty low. E.g. in comparison to other software suites where there might be anything between 10s and millions of copies in production, and complex bugs tend to get flushed out quickly. This is an observation more than anything else. Obviously the flip side of running in-house software like this is that bugs in one implementation are unlikely to affect any others. Nick
There is one point which has been seen at more than one publication server. A corner-case in the update of a filesystem being served by rsync, can cause 'incoherence' on what is seen to be served. You need to either hand-code the rsync server side to be not-filestore based (eg a DB backend, with some coherent model of what is current) or, run the rsync update method very carefully, so the "live" rsync --daemon instances do not see change in the underlying filestore, while still serving open connections. eg, use filesystems like ZFS with a snapshot model, fs-layers, and do things which use the abstraction to persist an older view, while publishing a newer view to new things. There are also ways to do this using things like haproxy, traefik or other load/proxy balancers, and a "pod" of servers in a pool which can drain. Depends on your platform how you code it. If you use rsync --daemon, It turns out there is a common fix: use symlinks *at* the publication point root, the thing rsyncd.conf points at. That can be "moved" safely, to point to a fresh repository, the live daemons are functionally bound to the dirpath which is under the old publication path reference. If you do this, you can drain the live connections and all new forked client coonnections get the new repository. If you try do things "below" the publication path anchor in the config, it doesn't work well. I don't think we all coded to this the same way, but I do think more than one RIR walked into the problem. We talk about this stuff. We discuss this, and coordinate loosely, to find and fix problems like this. I don't think there's a lack of goodwill amongst us, to try and limit exposure to these problems. Originally, we were serving a small handful of clients reliably, on a slow(ish) cycle, with infrequent update, on small repositories. Now, we have thousand(s) of clients and its growing, we have larger datasets, we have frequent updates (it only takes one change to one ROA to cause a cascade of Manifest related changes). We've been asked to improve resiliency, and our first pass on this has been to go "into the cloud" but that introduces the coherency question: how does your cloud provider help you, to publish and maintain a coherent state? Rsync is not a web protocol. Cloud services are not primarily designed well to serve this service. It is more like TCP anycast, or a DNS director to discrete platforms. If we had succeeded in moving the community rapidly to RRDP, which *IS* a web protocol and is designed to work coherently in a CDN, we wouldn't have this problem. There are people (Job?) who think Rsync might be needed as a repo backfill option but I personally think this is wrong: I think the RRDP/Delta model and web based bulk access to current bootstrap state is a far better model: its faster, it is inherently more coherent to the state of a specific change, It scales wonderfully, its faster. I really think we need to move in the IETF on rsync deprecation. cheers -George On Thu, Feb 18, 2021 at 3:00 AM Nick Hilliard <nick@foobar.org> wrote:
Nathalie Trenaman wrote on 17/02/2021 15:16:
I stand corrected. Tim pointed out to me that APNIC has their own code base, always had and is actually older than the RIPE NCC code. AFRINIC runs a fork from APNIC. ARIN used our code base around 2010 for their pilot, but implemented their own code base from scratch. LACNIC uses the RPKI object library but has their own CA and publication stack.
Hi Nathalie,
Ok, was curious. The reason I ask is because the overall complexity of this software is pretty high and we've seen a number of RPKI CA / publication failures from different registries over the last year.
If each RIR is using their own internally-developed implementations, then they're also the only organisations testing and developing monitoring systems for them, which means that in each case, exposure to interesting and unusual corner cases is likely to be pretty low. E.g. in comparison to other software suites where there might be anything between 10s and millions of copies in production, and complex bugs tend to get flushed out quickly. This is an observation more than anything else. Obviously the flip side of running in-house software like this is that bugs in one implementation are unlikely to affect any others.
Nick
There is one point which has been seen at more than one publication server. A corner-case in the update of a filesystem being served by rsync, can cause 'incoherence' on what is seen to be served.
that is called a "bug" and has little to do with rsync. when updating a file which is being served, one creates a new updated file and then moves the link. randy
I really think we need to move in the IETF on rsync deprecation.
you have it backward first you need to fix rrdp so it does not do things such as serve out of baliwick data. then you need to fix operational deployment. then you can measure the net to be sure everybody is serving rrdp properly. then you can blah blah blah in the ietf. but we have had this discussion before. randy
On Thu, Feb 18, 2021 at 10:17 AM Randy Bush <randy@psg.com> wrote:
I really think we need to move in the IETF on rsync deprecation.
you have it backward
first you need to fix rrdp so it does not do things such as serve out of baliwick data.
To refresh the stack, can you give me an instance please?
then you need to fix operational deployment.
Thats work-in-progress. We were hoping to move on a process design to get there, while we finish that deployment. Almost all children NOT in hosted, are RRDP active. I would be very surprised if the majority use case now, is not RRDP active.
then you can measure the net to be sure everybody is serving rrdp properly.
That sounds like a fine activity for somebody ELSE to do, to me.
then you can blah blah blah in the ietf.
Well.. I think we have draft in blah blah blah. And, we have worked to get everyone to be running it visibly. And, I think the measurement is a nice hole for somebody to fill.
but we have had this discussion before.
Yea, I know, but the problem is we've arrived at needing to boost resiliency against scale, and rsync is a really poor fit for the problem because of the fact most CDN choices are tuned for HTTP and not arbitrary TCP protocols. cheers -G
randy
To refresh the stack, can you give me an instance please?
then you need to fix operational deployment.
Thats work-in-progress. We were hoping to move on a process design to get there, while we finish that deployment. Almost all children NOT in hosted, are RRDP active. I would be very surprised if the majority use case now, is not RRDP active.
then you can measure the net to be sure everybody is serving rrdp properly.
That sounds like a fine activity for somebody ELSE to do, to me.
see our imc 2020 paper
but we have had this discussion before.
Yea, I know, but the problem is we've arrived at needing to boost resiliency against scale, and rsync is a really poor fit for the problem because of the fact most CDN choices are tuned for HTTP and not arbitrary TCP protocols.
your emergency due to lack of planning and action does not motivate me randy
On Thu, Feb 18, 2021 at 10:37 AM Randy Bush <randy@psg.com> wrote:
To refresh the stack, can you give me an instance please?
then you need to fix operational deployment.
Thats work-in-progress. We were hoping to move on a process design to get there, while we finish that deployment. Almost all children NOT in hosted, are RRDP active. I would be very surprised if the majority use case now, is not RRDP active.
then you can measure the net to be sure everybody is serving rrdp properly.
That sounds like a fine activity for somebody ELSE to do, to me.
see our imc 2020 paper
The data is from January-April 2020. It would be interesting to see how the landscape has changed by April 2021 I think. Two reasons: the publishing side may well have changed, and the RP side has definitely changed in some ways. Not that it invalidates the IMC paper: far from it. The point would be, to see if it can help show there has been a substantive change in the system overall. Do you think a re-measure is achievable as a low(ish) cost activity?
but we have had this discussion before.
Yea, I know, but the problem is we've arrived at needing to boost resiliency against scale, and rsync is a really poor fit for the problem because of the fact most CDN choices are tuned for HTTP and not arbitrary TCP protocols.
your emergency due to lack of planning and action does not motivate me
I think this is a poor characterisation of what should be done, and what the cost/benefit issues are. Suffice to say we have plans, and we are acting. The "emergency" such as there is one, is, that during the deployment and planning, service levels are going to continue to be open to question. I have this work timed for Q3/4 in 2021 because I have a larger body of un-related work in Q1/2. The distribution of service into self-hosted raises concerns for me that no amount of work in the RIR will fix. We have been promoting "publish in parent" because it helps to reduce the points of connect, which are going to tend to be SPF for many self-hosted people until they also put their publication states into a resilient fabric. We're improving our own resiliency all the time. I discussed some RTT outcomes today with Job, in RRDP he can see 300ms drop to 5ms from the CDN/DNS solution we use, which is a significant improvement in RTT, and load sharing. I cannot achieve that in the non-web protocol because nobody can offer cache for the datastream in question. I can do better than 300ms delay (which RobA frequently pointed out made APNIC look particularly eggregious on the long-haul datapath, because rsync is innately serialised read/write function) if I can get enough points of presence behind rsync, but then I get a coherency problem, which the CDN for web guys solved. Its just hard to fix this, in rsync. You know this, and its one of the reasons I wanted to promote deprecation. It might help, if publication-as-a-service was a thing, and we all decided to put the publication burden into prime agents, we paid to do this under SLA. That has problems of its own, in terms of governance, maybe it needs to be a market. But, thats kind-of how the DNS works. There's a label, its served by different people, sometimes they administrate the boxes directly, sometimes they use intermediaries, we measure the effectiveness of them against load, it mostly works. I wouldn't have a problem if there was a declared market price to do publication protocol into AWS, Cloudflare, Fastly, GCP, same protocol endpoint, they do the rest once you write objects in. It might well be significantly more resilient than what we're trying to do now. Hosting the TA function, the HSM bound functions, I don't think we've hit significant stresses yet. RIPE are looking at dual-redundant signer models. There are cloud-HSM services. -G
randy
Dear RIPE NCC, On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
The multitude of RPKI service impacting events as a result from maloperation of the RIPE NCC trust anchor are starting to give me cause for concern.
I’m sorry to hear this. Transparency is key for us, this means that we report any event. In this case, we were not compliant with our CPS and this non-publishing period had operational impact.
From the previous email there might be a misunderstanding about what rpki-client and Routinator do. Both utilities help Relying Parties validate X.509 signed CMS objects and transform the validated content into authorizations and attestations. Neither utility is a SLA or CPS compliance monitor. RIPE NCC - as CA operator - needs different tools.
Neither utility has been designed to interpret the Certification Practise Policy (written in a natural language) and subsequently programmatically transform the described 'Service Level' into metrics suitable for monitoring. A relying party can never tell the difference between a publication pipeline being empty because CAs didn't issue new objects, or a publication pipeline being empty because of a malfunction in one of RIPE NCC's RPKI subsystems. More examples of 'out of scope' functionality for Relying Party software: validators don't monitor whether lirportal.ripe.net is functional, whether RIPE NCC's BPKI API endpoints are operational, or whether LIRs paid their invoices, the list is quite long. The validators by themselves are the wrong tool for RPKI CPS/SLA monitoring. You state "transparency is key for us", but I fear ad-hoc low-quality a-posteriori reports are not the appropriate mechanism to impress and reassure this community regarding 'transparency'. I have some tangible suggestions to RIPE NCC that will improve the reliability of the Trust Anchor and potentially help rebuild trust: A need for Certificate Transparency ----------------------------------- RIPE NCC should set up a Certificate Transparency project which publicly shows which certificates (fingerprints) were issued when, and store such information in immutable logs, accessible to all. How can anyone trust a Trust Anchor which does not offer transparency about its issuance process? Lack of transparency to signer software --------------------------------------- The RIPE NCC WHOIS database software is open source, as is most of the software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has undertaken over the years. Why has the signer source code still not open sourced? Why can't members review the code related to scheduled changes? Why is an organisation proclaiming 'transparency' being opaque about how the RPKI certificates are issued? Lack of Public status dashboard ------------------------------- RIPE NCC should set up a website like https://rpki-status.ripe.net/ which shows dashboards with graphs and traffic lights related to each (best effort) commitment listed in the CPS. RIPE NCC should continuously publish & revoke & delete objects and verify whether those activities are visible externally, and then automatically report whether any potential delays observed are within the Service Levels outlined in the CPS. Metrics that come to mind: * delta between last certificate issuance & successful publication * Object count in the repository, repo size (and graphs) * Time-To-Keyroll (and graphs on duration & frequency) * Resource utilisation of various RPKI subsystems * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync) * Graphs & logs of overlap between INRs listed on EE certificates under the RIPE TA and other commonly used TAs, matched against known transfers. This will help detect compromises as well as understand whether transfers are successful or not. * Unique client IP count for RSYNC & RRDP for last hour/day/week * Number of CS/hostmaster tickets mentioning RPKI There is are plenty of aspects to monitor, perhaps some notes should be copied from how the DNS root is monitored. Lack of operational experience with BGP ROV at RIPE NCC ------------------------------------------------------- I believe the number of potential future incidents related to the RIPE NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE NCC themselves apply RPKI based BGP Origin Validation 'invalid == reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS themselves having a dependency on the RPKI services will increase organization-wide exposure to the (lack of) well-being of the Trust Anchor services, and given the short communication channels between the OPS team and the RPKI team my expectation is that we'll see problems being solved faster and perhaps even problems being prevented. An analogy: RIPE NCC is a kitchenchef refusing to eat their own food. How can we trust RIPE NCC to operate RPKI services, when RIPE NCC themselves refuses to apply the cryptographic products to their BGP routing decisions? "RPKI for thee but not for me?" Surely RIPE NCC staff has not disabled DNSSEC sig checking on their resolvers, or disabled WebPKI TLS validation in their browsers? I'm not joking, it makes zero sense to participate in a PKI and at the same time not participate in the same way everyone outside RIPE NCC depend on the service. I am very aware of potential for circular dependencies between BGP and RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it appears my feedback is ignored, problem reports remain unresolved. Reporting issues has become a thankless effort, useless because no remediation actions are taken, and obviously RIPE NCC staff are growing tired of hearing about problems (but if one wishes to stop hearing about problems... perhaps solve the issues, instead of a 'head in the sand' approach?!) Conclusion & Call to action --------------------------- There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a multi-million budget and talented dedicated staff to achieve the above. None of the above is impossible or unreasonable to ask from Trust Anchors. I recognize how the above information reflects negatively on the current state of the RIPE NCC Trust Anchor. But the reality of the situation is that we see an outage every few weeks, there is an apparent lack of architectural oversight how to improve. I really hope this is a temporarily state of being, on which we can look back a year from now as "haha remember those RPKI teething pains?". I wish for RIPE NCC to be successful in operating the Trust Anchor. RIPE NCC would to well to allow themselves to be vulnerable to criticism by exposing service level metrics and efforts like production of public merkle tree logs - reflecting the certificate issuance process. RIPE NCC should allow itself to be held accountable - which can only happen if there is visibility into where friction exists. Does RIPE NCC understand the precariousness of the current situation and the negative impact on the long term viability of the RPKI if course is not corrected? This email outlines deliverables, will RIPE NCC commit to those? What timelines can the community expect? What kind of help is needed to achieve this? Kind regards, Job
Hello, I agreee with Job that reliability of TA needs to be improved and I fully support ideas described by Job below. - Daniel On 2/17/21 4:58 PM, Job Snijders via routing-wg wrote:
Dear RIPE NCC,
On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
The multitude of RPKI service impacting events as a result from maloperation of the RIPE NCC trust anchor are starting to give me cause for concern.
I’m sorry to hear this. Transparency is key for us, this means that we report any event. In this case, we were not compliant with our CPS and this non-publishing period had operational impact.
From the previous email there might be a misunderstanding about what rpki-client and Routinator do. Both utilities help Relying Parties validate X.509 signed CMS objects and transform the validated content into authorizations and attestations. Neither utility is a SLA or CPS compliance monitor. RIPE NCC - as CA operator - needs different tools.
Neither utility has been designed to interpret the Certification Practise Policy (written in a natural language) and subsequently programmatically transform the described 'Service Level' into metrics suitable for monitoring.
A relying party can never tell the difference between a publication pipeline being empty because CAs didn't issue new objects, or a publication pipeline being empty because of a malfunction in one of RIPE NCC's RPKI subsystems.
More examples of 'out of scope' functionality for Relying Party software: validators don't monitor whether lirportal.ripe.net is functional, whether RIPE NCC's BPKI API endpoints are operational, or whether LIRs paid their invoices, the list is quite long. The validators by themselves are the wrong tool for RPKI CPS/SLA monitoring.
You state "transparency is key for us", but I fear ad-hoc low-quality a-posteriori reports are not the appropriate mechanism to impress and reassure this community regarding 'transparency'.
I have some tangible suggestions to RIPE NCC that will improve the reliability of the Trust Anchor and potentially help rebuild trust:
A need for Certificate Transparency -----------------------------------
RIPE NCC should set up a Certificate Transparency project which publicly shows which certificates (fingerprints) were issued when, and store such information in immutable logs, accessible to all.
How can anyone trust a Trust Anchor which does not offer transparency about its issuance process?
Lack of transparency to signer software ---------------------------------------
The RIPE NCC WHOIS database software is open source, as is most of the software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has undertaken over the years.
Why has the signer source code still not open sourced? Why can't members review the code related to scheduled changes? Why is an organisation proclaiming 'transparency' being opaque about how the RPKI certificates are issued?
Lack of Public status dashboard -------------------------------
RIPE NCC should set up a website like https://rpki-status.ripe.net/ which shows dashboards with graphs and traffic lights related to each (best effort) commitment listed in the CPS. RIPE NCC should continuously publish & revoke & delete objects and verify whether those activities are visible externally, and then automatically report whether any potential delays observed are within the Service Levels outlined in the CPS.
Metrics that come to mind:
* delta between last certificate issuance & successful publication * Object count in the repository, repo size (and graphs) * Time-To-Keyroll (and graphs on duration & frequency) * Resource utilisation of various RPKI subsystems * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync) * Graphs & logs of overlap between INRs listed on EE certificates under the RIPE TA and other commonly used TAs, matched against known transfers. This will help detect compromises as well as understand whether transfers are successful or not. * Unique client IP count for RSYNC & RRDP for last hour/day/week * Number of CS/hostmaster tickets mentioning RPKI
There is are plenty of aspects to monitor, perhaps some notes should be copied from how the DNS root is monitored.
Lack of operational experience with BGP ROV at RIPE NCC -------------------------------------------------------
I believe the number of potential future incidents related to the RIPE NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE NCC themselves apply RPKI based BGP Origin Validation 'invalid == reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS themselves having a dependency on the RPKI services will increase organization-wide exposure to the (lack of) well-being of the Trust Anchor services, and given the short communication channels between the OPS team and the RPKI team my expectation is that we'll see problems being solved faster and perhaps even problems being prevented.
An analogy: RIPE NCC is a kitchenchef refusing to eat their own food. How can we trust RIPE NCC to operate RPKI services, when RIPE NCC themselves refuses to apply the cryptographic products to their BGP routing decisions? "RPKI for thee but not for me?"
Surely RIPE NCC staff has not disabled DNSSEC sig checking on their resolvers, or disabled WebPKI TLS validation in their browsers? I'm not joking, it makes zero sense to participate in a PKI and at the same time not participate in the same way everyone outside RIPE NCC depend on the service.
I am very aware of potential for circular dependencies between BGP and RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it appears my feedback is ignored, problem reports remain unresolved.
Reporting issues has become a thankless effort, useless because no remediation actions are taken, and obviously RIPE NCC staff are growing tired of hearing about problems (but if one wishes to stop hearing about problems... perhaps solve the issues, instead of a 'head in the sand' approach?!)
Conclusion & Call to action ---------------------------
There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a multi-million budget and talented dedicated staff to achieve the above. None of the above is impossible or unreasonable to ask from Trust Anchors.
I recognize how the above information reflects negatively on the current state of the RIPE NCC Trust Anchor. But the reality of the situation is that we see an outage every few weeks, there is an apparent lack of architectural oversight how to improve. I really hope this is a temporarily state of being, on which we can look back a year from now as "haha remember those RPKI teething pains?". I wish for RIPE NCC to be successful in operating the Trust Anchor.
RIPE NCC would to well to allow themselves to be vulnerable to criticism by exposing service level metrics and efforts like production of public merkle tree logs - reflecting the certificate issuance process. RIPE NCC should allow itself to be held accountable - which can only happen if there is visibility into where friction exists.
Does RIPE NCC understand the precariousness of the current situation and the negative impact on the long term viability of the RPKI if course is not corrected?
This email outlines deliverables, will RIPE NCC commit to those? What timelines can the community expect? What kind of help is needed to achieve this?
Kind regards,
Job
Hi Job, See my responses inline in your final section...
Op 17 feb. 2021, om 16:58 heeft Job Snijders via routing-wg <routing-wg@ripe.net> het volgende geschreven:
Dear RIPE NCC,
On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
The multitude of RPKI service impacting events as a result from maloperation of the RIPE NCC trust anchor are starting to give me cause for concern.
I’m sorry to hear this. Transparency is key for us, this means that we report any event. In this case, we were not compliant with our CPS and this non-publishing period had operational impact.
From the previous email there might be a misunderstanding about what rpki-client and Routinator do. Both utilities help Relying Parties validate X.509 signed CMS objects and transform the validated content into authorizations and attestations. Neither utility is a SLA or CPS compliance monitor. RIPE NCC - as CA operator - needs different tools.
Neither utility has been designed to interpret the Certification Practise Policy (written in a natural language) and subsequently programmatically transform the described 'Service Level' into metrics suitable for monitoring.
A relying party can never tell the difference between a publication pipeline being empty because CAs didn't issue new objects, or a publication pipeline being empty because of a malfunction in one of RIPE NCC's RPKI subsystems.
More examples of 'out of scope' functionality for Relying Party software: validators don't monitor whether lirportal.ripe.net is functional, whether RIPE NCC's BPKI API endpoints are operational, or whether LIRs paid their invoices, the list is quite long. The validators by themselves are the wrong tool for RPKI CPS/SLA monitoring.
You state "transparency is key for us", but I fear ad-hoc low-quality a-posteriori reports are not the appropriate mechanism to impress and reassure this community regarding 'transparency'.
I have some tangible suggestions to RIPE NCC that will improve the reliability of the Trust Anchor and potentially help rebuild trust:
A need for Certificate Transparency -----------------------------------
RIPE NCC should set up a Certificate Transparency project which publicly shows which certificates (fingerprints) were issued when, and store such information in immutable logs, accessible to all.
How can anyone trust a Trust Anchor which does not offer transparency about its issuance process?
Lack of transparency to signer software ---------------------------------------
The RIPE NCC WHOIS database software is open source, as is most of the software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has undertaken over the years.
Why has the signer source code still not open sourced? Why can't members review the code related to scheduled changes? Why is an organisation proclaiming 'transparency' being opaque about how the RPKI certificates are issued?
Lack of Public status dashboard -------------------------------
RIPE NCC should set up a website like https://rpki-status.ripe.net/ which shows dashboards with graphs and traffic lights related to each (best effort) commitment listed in the CPS. RIPE NCC should continuously publish & revoke & delete objects and verify whether those activities are visible externally, and then automatically report whether any potential delays observed are within the Service Levels outlined in the CPS.
Metrics that come to mind:
* delta between last certificate issuance & successful publication * Object count in the repository, repo size (and graphs) * Time-To-Keyroll (and graphs on duration & frequency) * Resource utilisation of various RPKI subsystems * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync) * Graphs & logs of overlap between INRs listed on EE certificates under the RIPE TA and other commonly used TAs, matched against known transfers. This will help detect compromises as well as understand whether transfers are successful or not. * Unique client IP count for RSYNC & RRDP for last hour/day/week * Number of CS/hostmaster tickets mentioning RPKI
There is are plenty of aspects to monitor, perhaps some notes should be copied from how the DNS root is monitored.
Lack of operational experience with BGP ROV at RIPE NCC -------------------------------------------------------
I believe the number of potential future incidents related to the RIPE NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE NCC themselves apply RPKI based BGP Origin Validation 'invalid == reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS themselves having a dependency on the RPKI services will increase organization-wide exposure to the (lack of) well-being of the Trust Anchor services, and given the short communication channels between the OPS team and the RPKI team my expectation is that we'll see problems being solved faster and perhaps even problems being prevented.
An analogy: RIPE NCC is a kitchenchef refusing to eat their own food. How can we trust RIPE NCC to operate RPKI services, when RIPE NCC themselves refuses to apply the cryptographic products to their BGP routing decisions? "RPKI for thee but not for me?"
Surely RIPE NCC staff has not disabled DNSSEC sig checking on their resolvers, or disabled WebPKI TLS validation in their browsers? I'm not joking, it makes zero sense to participate in a PKI and at the same time not participate in the same way everyone outside RIPE NCC depend on the service.
I am very aware of potential for circular dependencies between BGP and RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it appears my feedback is ignored, problem reports remain unresolved.
Reporting issues has become a thankless effort, useless because no remediation actions are taken, and obviously RIPE NCC staff are growing tired of hearing about problems (but if one wishes to stop hearing about problems... perhaps solve the issues, instead of a 'head in the sand' approach?!)
Conclusion & Call to action ---------------------------
There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a multi-million budget and talented dedicated staff to achieve the above. None of the above is impossible or unreasonable to ask from Trust Anchors.
We can do many things but our main concern is to implement what is needed in a way that we can manage effectively and with input from all the relevant stakeholders. You've provided a big list here and some of these are already on our roadmap. For example, ROV in AS3333, we are working on this, and we expect to come with an announcement soon. Also, open-sourcing the RPKI core is on our roadmap, but this will take a bit longer.
I recognize how the above information reflects negatively on the current state of the RIPE NCC Trust Anchor. But the reality of the situation is that we see an outage every few weeks, there is an apparent lack of architectural oversight how to improve. I really hope this is a temporarily state of being, on which we can look back a year from now as "haha remember those RPKI teething pains?". I wish for RIPE NCC to be successful in operating the Trust Anchor.
RIPE NCC would to well to allow themselves to be vulnerable to criticism by exposing service level metrics and efforts like production of public merkle tree logs - reflecting the certificate issuance process. RIPE NCC should allow itself to be held accountable - which can only happen if there is visibility into where friction exists.
We're very open to constructive feedback. We would also like to somehow formalise the how we encourage, get and integrate feedback and requests in a better way. Apart from the critical work on RPKI itself, we will work to come up with a proposal on how we might achieve this together with the community. I think we aim for the same thing here.
Does RIPE NCC understand the precariousness of the current situation and the negative impact on the long term viability of the RPKI if course is not corrected?
The RIPE NCC absolutely recognises the importance of running a stable, safe and resilient Trust Anchor and we are very committed to ensuring the long-term viability of RPKI.
This email outlines deliverables, will RIPE NCC commit to those? What timelines can the community expect? What kind of help is needed to achieve this?
This will take some time to assess and we'll need to come back with a more detailed response. Regards, Nathalie
Kind regards,
Job
Hi Nathalie, On 2/19/21 9:15 AM, Nathalie Trenaman wrote:
We can do many things but our main concern is to implement what is needed in a way that we can manage effectively and with input from all the relevant stakeholders. You've provided a big list here and some of these are already on our roadmap. For example, ROV in AS3333, we are working on this, and we expect to come with an announcement soon.
Can you share your roadmap? I think also plans and timeline should be open. As these plans exists, you can just publish such document(s) for those who're interested. I think community should be informed in advance about plans - not just ex-post by "marketing" announcements about done things.
Also, open-sourcing the RPKI core is on our roadmap, but this will take a bit longer.
Can you explain in detail, where's problem with opensourcing RPKI core (publishing it's code)? Are there some legal reasons or there's something else blocking publishing code you're using? As above, how long we have to wait? I think (open) community review is important also from security perspective for this critical part of internet infrastructure. - Daniel
Hi Daniel,
Op 19 feb. 2021, om 09:33 heeft Daniel Suchy via routing-wg <routing-wg@ripe.net> het volgende geschreven:
Hi Nathalie,
On 2/19/21 9:15 AM, Nathalie Trenaman wrote:
We can do many things but our main concern is to implement what is needed in a way that we can manage effectively and with input from all the relevant stakeholders. You've provided a big list here and some of these are already on our roadmap. For example, ROV in AS3333, we are working on this, and we expect to come with an announcement soon.
Can you share your roadmap? I think also plans and timeline should be open. As these plans exists, you can just publish such document(s) for those who're interested. I think community should be informed in advance about plans - not just ex-post by "marketing" announcements about done things.
I shared the RPKI roadmap on RIPE Labs last year: https://labs.ripe.net/Members/nathalie_nathalie/where-were-at-with-rpki-resi... <https://labs.ripe.net/Members/nathalie_nathalie/where-were-at-with-rpki-resiliency> and the work plan for this year has recently been finalised and will first be presented to the Executive Board in March. After that, I will publish another RIPE Labs article with the work plan for this year and announce it in this working group as well.
Also, open-sourcing the RPKI core is on our roadmap, but this will take a bit longer.
Can you explain in detail, where's problem with opensourcing RPKI core (publishing it's code)? Are there some legal reasons or there's something else blocking publishing code you're using? As above, how long we have to wait? I think (open) community review is important also from security perspective for this critical part of internet infrastructure.
I agree that open sourcing the RPKI core is important, and so are many other things that we are working on. Please be assured that this is on our radar and we’ll move forward with this as soon as we can.
- Daniel
Kind regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
participants (7)
-
Daniel Suchy
-
George Michaelson
-
Hank Nussbacher
-
Job Snijders
-
Nathalie Trenaman
-
Nick Hilliard
-
Randy Bush