Changes to the RRDP repository
Dear colleagues, Over the past months, we have been working on the RIPE NCC RPKI repositories. We have an update to the RRDP repository that we plan to deploy on 9 November. This will create a regular RRDP event (a session change) but will have no other externally visible impact. We want to share some of the improvements this change has to offer and highlight two areas in particular. First, we have improved the publication server software [0]. The current publication server uses an embedded NoSQL/schemaless database. We have changed the project to use PostgreSQL instead, which allows us to move several integrity checks to the publication server’s database. Second, we have changed how the publication server is deployed, which is part of our work to move components of our infrastructure on-premise. Initially, we will run two independent instances, with separate database servers and data centres, with each instance receiving all objects in the repository. We aim to keep this simple at an initial stage, closely monitor how the environment behaves, and expand later if we need to. Because the RRDP session differs between the instances, one instance is (and can be) active at each moment in time. This allows us to swap them out during an upgrade and allows us to fall back to the second one if any issue is being detected. Both instances are behind a load-balancer, which is the origin for the Content Delivery Network (CDN) that we use. By using a CDN, we (a) reduce the latency from various geographical locations, (b) protect ourselves from network glitches, and (c) reduce the bandwidth peak after a session change that would interfere with other services on the RIPE NCC network (for example during a deployment). This change is an intermediate step in our work on the resiliency of our publication infrastructure. We have extensively looked at possible architectures that can solve the issues we are facing now and considered numerous failure modes, and we think this design strikes a good balance between resilience and simplicity. We will discuss our architectural changes with the community at RIPE 83 and look forward to hearing your feedback. If you have any questions, please get in touch with us. Kind regards, Ties de Kock [0]: https://github.com/RIPE-NCC/rpki-publication-server
We aim to keep this simple at an initial stage, closely monitor how the environment behaves
i am deeplying interested in how a CA and a PP (and RP and routers) are measured and monitored. in general, i am scared to death of the growing deployment of rpki and rov with so little, if any, measurement. so i would beg/encourage you to publish how you do this and maybe even think of making your tools more generally useful. and i would love to hear of and see tools any others may be using. randy
Hi Randy,
On 27 Oct 2021, at 19:45, Randy Bush <randy@psg.com> wrote:
We aim to keep this simple at an initial stage, closely monitor how the environment behaves
i am deeplying interested in how a CA and a PP (and RP and routers) are measured and monitored. in general, i am scared to death of the growing deployment of rpki and rov with so little, if any, measurement.
so i would beg/encourage you to publish how you do this and maybe even think of making your tools more generally useful.
We have made a significant investment in monitoring and alerting, using Prometheus. I will introduce the part of our monitoring relevant to the repository content below (there is more), and we will include an update on monitoring in our RIPE 83 presentation. We have metrics for the Certification Authority (CA) system, monitor Relying Party software instances, and run tools specifically for monitoring. We also run smoke-tests (via the UI) and an end-to-end test that validates that VRPs for a ROA created by a user become visible to RPs. The metrics in the CA system are mostly for liveliness (e.g. "job x is running successfully"), ongoing publication, and error (rates). We do test (hosted) CA creation/deletion in our staging environment - but not in our production environment because we do not have the two (hosted, delegated) production LIR accounts required. As a liveliness check for the publication server instances, we check when the publication server received the last withdrawal and publish, and when the most recent notification.xml is written (using an RP via serial and directly). Furthermore, we have three types of checks on the content of the repository. For this, we have two endpoints on the CA system: "hash and filename of all files in the repository" and "all VRPs". For the files, using an internal tool, we check that: * All files in the CA "filename+hash" endpoint are present in each repository (rsync instances, publication server instances, rrdp.ripe.net) after they have had time to converge. * Not too many "leftover" files in each of the repository instances. * No objects are present in the repo that are about to expire within ~13.5 hours. Using RP instances, we check that: * All VRPs in the CA system show up in the effective VRPs within <time_threshold> (using rtrmon). Because we monitor that no files are mismatching between the CA system and the repositories, this check implies that the VRPs are visible in all the repository instances. Please let us know if there is interest in the tool we use to compare repositories. We might add that to our roadmap if there is interest. Kind regards, Ties
participants (2)
-
Randy Bush
-
Ties de Kock