Hi Job, See my responses inline in your final section...
Op 17 feb. 2021, om 16:58 heeft Job Snijders via routing-wg <routing-wg@ripe.net> het volgende geschreven:
Dear RIPE NCC,
On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
The multitude of RPKI service impacting events as a result from maloperation of the RIPE NCC trust anchor are starting to give me cause for concern.
I’m sorry to hear this. Transparency is key for us, this means that we report any event. In this case, we were not compliant with our CPS and this non-publishing period had operational impact.
From the previous email there might be a misunderstanding about what rpki-client and Routinator do. Both utilities help Relying Parties validate X.509 signed CMS objects and transform the validated content into authorizations and attestations. Neither utility is a SLA or CPS compliance monitor. RIPE NCC - as CA operator - needs different tools.
Neither utility has been designed to interpret the Certification Practise Policy (written in a natural language) and subsequently programmatically transform the described 'Service Level' into metrics suitable for monitoring.
A relying party can never tell the difference between a publication pipeline being empty because CAs didn't issue new objects, or a publication pipeline being empty because of a malfunction in one of RIPE NCC's RPKI subsystems.
More examples of 'out of scope' functionality for Relying Party software: validators don't monitor whether lirportal.ripe.net is functional, whether RIPE NCC's BPKI API endpoints are operational, or whether LIRs paid their invoices, the list is quite long. The validators by themselves are the wrong tool for RPKI CPS/SLA monitoring.
You state "transparency is key for us", but I fear ad-hoc low-quality a-posteriori reports are not the appropriate mechanism to impress and reassure this community regarding 'transparency'.
I have some tangible suggestions to RIPE NCC that will improve the reliability of the Trust Anchor and potentially help rebuild trust:
A need for Certificate Transparency -----------------------------------
RIPE NCC should set up a Certificate Transparency project which publicly shows which certificates (fingerprints) were issued when, and store such information in immutable logs, accessible to all.
How can anyone trust a Trust Anchor which does not offer transparency about its issuance process?
Lack of transparency to signer software ---------------------------------------
The RIPE NCC WHOIS database software is open source, as is most of the software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has undertaken over the years.
Why has the signer source code still not open sourced? Why can't members review the code related to scheduled changes? Why is an organisation proclaiming 'transparency' being opaque about how the RPKI certificates are issued?
Lack of Public status dashboard -------------------------------
RIPE NCC should set up a website like https://rpki-status.ripe.net/ which shows dashboards with graphs and traffic lights related to each (best effort) commitment listed in the CPS. RIPE NCC should continuously publish & revoke & delete objects and verify whether those activities are visible externally, and then automatically report whether any potential delays observed are within the Service Levels outlined in the CPS.
Metrics that come to mind:
* delta between last certificate issuance & successful publication * Object count in the repository, repo size (and graphs) * Time-To-Keyroll (and graphs on duration & frequency) * Resource utilisation of various RPKI subsystems * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync) * Graphs & logs of overlap between INRs listed on EE certificates under the RIPE TA and other commonly used TAs, matched against known transfers. This will help detect compromises as well as understand whether transfers are successful or not. * Unique client IP count for RSYNC & RRDP for last hour/day/week * Number of CS/hostmaster tickets mentioning RPKI
There is are plenty of aspects to monitor, perhaps some notes should be copied from how the DNS root is monitored.
Lack of operational experience with BGP ROV at RIPE NCC -------------------------------------------------------
I believe the number of potential future incidents related to the RIPE NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE NCC themselves apply RPKI based BGP Origin Validation 'invalid == reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS themselves having a dependency on the RPKI services will increase organization-wide exposure to the (lack of) well-being of the Trust Anchor services, and given the short communication channels between the OPS team and the RPKI team my expectation is that we'll see problems being solved faster and perhaps even problems being prevented.
An analogy: RIPE NCC is a kitchenchef refusing to eat their own food. How can we trust RIPE NCC to operate RPKI services, when RIPE NCC themselves refuses to apply the cryptographic products to their BGP routing decisions? "RPKI for thee but not for me?"
Surely RIPE NCC staff has not disabled DNSSEC sig checking on their resolvers, or disabled WebPKI TLS validation in their browsers? I'm not joking, it makes zero sense to participate in a PKI and at the same time not participate in the same way everyone outside RIPE NCC depend on the service.
I am very aware of potential for circular dependencies between BGP and RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it appears my feedback is ignored, problem reports remain unresolved.
Reporting issues has become a thankless effort, useless because no remediation actions are taken, and obviously RIPE NCC staff are growing tired of hearing about problems (but if one wishes to stop hearing about problems... perhaps solve the issues, instead of a 'head in the sand' approach?!)
Conclusion & Call to action ---------------------------
There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a multi-million budget and talented dedicated staff to achieve the above. None of the above is impossible or unreasonable to ask from Trust Anchors.
We can do many things but our main concern is to implement what is needed in a way that we can manage effectively and with input from all the relevant stakeholders. You've provided a big list here and some of these are already on our roadmap. For example, ROV in AS3333, we are working on this, and we expect to come with an announcement soon. Also, open-sourcing the RPKI core is on our roadmap, but this will take a bit longer.
I recognize how the above information reflects negatively on the current state of the RIPE NCC Trust Anchor. But the reality of the situation is that we see an outage every few weeks, there is an apparent lack of architectural oversight how to improve. I really hope this is a temporarily state of being, on which we can look back a year from now as "haha remember those RPKI teething pains?". I wish for RIPE NCC to be successful in operating the Trust Anchor.
RIPE NCC would to well to allow themselves to be vulnerable to criticism by exposing service level metrics and efforts like production of public merkle tree logs - reflecting the certificate issuance process. RIPE NCC should allow itself to be held accountable - which can only happen if there is visibility into where friction exists.
We're very open to constructive feedback. We would also like to somehow formalise the how we encourage, get and integrate feedback and requests in a better way. Apart from the critical work on RPKI itself, we will work to come up with a proposal on how we might achieve this together with the community. I think we aim for the same thing here.
Does RIPE NCC understand the precariousness of the current situation and the negative impact on the long term viability of the RPKI if course is not corrected?
The RIPE NCC absolutely recognises the importance of running a stable, safe and resilient Trust Anchor and we are very committed to ensuring the long-term viability of RPKI.
This email outlines deliverables, will RIPE NCC commit to those? What timelines can the community expect? What kind of help is needed to achieve this?
This will take some time to assess and we'll need to come back with a more detailed response. Regards, Nathalie
Kind regards,
Job