On 17/02/2021 17:58, Job Snijders via routing-wg wrote:

+1.

-Hank

Dear RIPE NCC,

On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote:
The multitude of RPKI service impacting events as a result from
maloperation of the RIPE NCC trust anchor are starting to give me
cause for concern.
I’m sorry to hear this. Transparency is key for us, this means that we
report any event. In this case, we were not compliant with our CPS and
this non-publishing period had operational impact.
>From the previous email there might be a misunderstanding about what
rpki-client and Routinator do. Both utilities help Relying Parties
validate X.509 signed CMS objects and transform the validated content
into authorizations and attestations. Neither utility is a SLA or CPS
compliance monitor. RIPE NCC - as CA operator - needs different tools.

Neither utility has been designed to interpret the Certification
Practise Policy (written in a natural language) and subsequently
programmatically transform the described 'Service Level' into metrics
suitable for monitoring.

A relying party can never tell the difference between a publication
pipeline being empty because CAs didn't issue new objects, or a
publication pipeline being empty because of a malfunction in one of RIPE
NCC's RPKI subsystems.

More examples of 'out of scope' functionality for Relying Party
software: validators don't monitor whether lirportal.ripe.net is
functional, whether RIPE NCC's BPKI API endpoints are operational, or
whether LIRs paid their invoices, the list is quite long. The validators
by themselves are the wrong tool for RPKI CPS/SLA monitoring.

You state "transparency is key for us", but I fear ad-hoc low-quality
a-posteriori reports are not the appropriate mechanism to impress and
reassure this community regarding 'transparency'.

I have some tangible suggestions to RIPE NCC that will improve the
reliability of the Trust Anchor and potentially help rebuild trust:

A need for Certificate Transparency
-----------------------------------

RIPE NCC should set up a Certificate Transparency project which publicly
shows which certificates (fingerprints) were issued when, and store such
information in immutable logs, accessible to all.

How can anyone trust a Trust Anchor which does not offer transparency
about its issuance process?

Lack of transparency to signer software
---------------------------------------

The RIPE NCC WHOIS database software is open source, as is most of the
software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has
undertaken over the years.

Why has the signer source code still not open sourced? Why can't members
review the code related to scheduled changes? Why is an organisation
proclaiming 'transparency' being opaque about how the RPKI certificates
are issued?

Lack of Public status dashboard
-------------------------------

RIPE NCC should set up a website like https://rpki-status.ripe.net/
which shows dashboards with graphs and traffic lights related to each
(best effort) commitment listed in the CPS. RIPE NCC should continuously
publish & revoke & delete objects and verify whether those activities
are visible externally, and then automatically report whether any
potential delays observed are within the Service Levels outlined in the
CPS.

Metrics that come to mind:

* delta between last certificate issuance & successful publication
* Object count in the repository, repo size (and graphs)
* Time-To-Keyroll (and graphs on duration & frequency)
* Resource utilisation of various RPKI subsystems
* aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync)
* Graphs & logs of overlap between INRs listed on EE certificates under
  the RIPE TA and other commonly used TAs, matched against known
  transfers. This will help detect compromises as well as understand
  whether transfers are successful or not.
* Unique client IP count for RSYNC & RRDP for last hour/day/week
* Number of CS/hostmaster tickets mentioning RPKI

There is are plenty of aspects to monitor, perhaps some notes should be
copied from how the DNS root is monitored.

Lack of operational experience with BGP ROV at RIPE NCC
-------------------------------------------------------

I believe the number of potential future incidents related to the RIPE
NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE
NCC themselves apply RPKI based BGP Origin Validation 'invalid ==
reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS
themselves having a dependency on the RPKI services will increase
organization-wide exposure to the (lack of) well-being of the Trust
Anchor services, and given the short communication channels between the
OPS team and the RPKI team my expectation is that we'll see problems
being solved faster and perhaps even problems being prevented.

An analogy: RIPE NCC is a kitchenchef refusing to eat their own food.
How can we trust RIPE NCC to operate RPKI services, when RIPE NCC
themselves refuses to apply the cryptographic products to their BGP
routing decisions? "RPKI for thee but not for me?"

Surely RIPE NCC staff has not disabled DNSSEC sig checking on their
resolvers, or disabled WebPKI TLS validation in their browsers? I'm not
joking, it makes zero sense to participate in a PKI and at the same time
not participate in the same way everyone outside RIPE NCC depend on the
service.

I am very aware of potential for circular dependencies between BGP and
RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it
appears my feedback is ignored, problem reports remain unresolved.

Reporting issues has become a thankless effort, useless because no
remediation actions are taken, and obviously RIPE NCC staff are growing
tired of hearing about problems (but if one wishes to stop hearing about
problems... perhaps solve the issues, instead of a 'head in the sand'
approach?!)

Conclusion & Call to action
---------------------------

There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a
multi-million budget and talented dedicated staff to achieve the above.
None of the above is impossible or unreasonable to ask from Trust
Anchors.

I recognize how the above information reflects negatively on the current
state of the RIPE NCC Trust Anchor. But the reality of the situation is
that we see an outage every few weeks, there is an apparent lack of
architectural oversight how to improve. I really hope this is a
temporarily state of being, on which we can look back a year from now as
"haha remember those RPKI teething pains?". I wish for RIPE NCC to
be successful in operating the Trust Anchor.

RIPE NCC would to well to allow themselves to be vulnerable to criticism
by exposing service level metrics and efforts like production of public
merkle tree logs - reflecting the certificate issuance process. RIPE NCC
should allow itself to be held accountable - which can only happen if
there is visibility into where friction exists.

Does RIPE NCC understand the precariousness of the current situation and
the negative impact on the long term viability of the RPKI if course is
not corrected?

This email outlines deliverables, will RIPE NCC commit to those? What
timelines can the community expect? What kind of help is needed to
achieve this?

Kind regards,

Job