Dear colleagues, First, let me give you an overview of our rsync infrastructure and the situation encountered by a client. Afterwards, I will describe the context of our application and repository, and how that limits our design space. RPKI objects are created on machines in an isolated network. The active machine writes new objects to an NFS share (with replicated storage in two data centres). The rsync machines (outside the isolated network) serve these files. These are behind a load balancer. Sets of objects to be updated (for example a manifest, CRL, and certificate) are written to a staging directory by the application. After all the files are created, they are moved into the main repository directory. There is a small period between these moves. In a recent incident that was reported to us, this was ~30ms, with files written in an order where the repository was correct on disk at all times. This part of the code has been in place since 2012. While the files are written to the filesystem they are also sent to a (draft version of RFC 8181) publication server. The files are sent atomically in one message. The publication is synchronous: when a ROA is created, it is immediately published to rsync and the publication server. The affected client in the reported incident read the file list *before* the new certificate was present, but checked the content (and copied) the updated manifest which referred to a certificate that was not present when the file list was created. In the rest of this document, we will call this situation a non-repeatable read; part of the repository reflects one state while another part reflects a different state. On April 12, we published 41,600 sets of objects. This resulted in 41,600 distinct repository states on disk. The RIPE NCC repository contains ~65,500 files in ~34,500 directories, with a total size of 157MB. The repository is consistent (on disk) when the application is not publishing objects. The repository is also consistent (for a slow client) when no files are added or changed after their rsync client starts retrieving the file list. Copying the repository without coordination from the application (i.e. to spool it) has the same risk of a non-repeatable read as rsync clients have. However, in this case, it would affect many clients for an extended period - and mask instead of solve the underlying issue. Other approaches (such as snapshotting) also have limitations that make them untenable. The RPKI application does not support writing the complete repository to disk for each state (as needed for spooling the repository as proposed in scripts). Synchronously writing every state of the repository to disk is not feasible, given our update frequency and repository size. Functionality for asynchronously writing the repository to disk needs to be developed. We have two paths to develop this: - The first is a new daemon that writes to disk from the database state at a set interval. - The second one is using RRDP as a source of truth and writing the repository to disk. Furthermore, we would need to migrate the storage from NFS to have faster writes. Both approaches need an extended period for validation and we are not able to deploy these within a few weeks. The latter approach (using RRDP) has less risk and is the option we are aiming for at the moment. We plan to release the new publication infrastructure in Q2/Q3 2021 and hope to migrate earlier. I’m happy to answer any further questions you may have. Kind regards, Ties de Kock RIPE NCC
On 12 Apr 2021, at 15:12, Nick Hilliard <nick@foobar.org> wrote:
Erik Bais wrote on 12/04/2021 11:41:
This looks to be a 3 line bash script fix on a cronjob … So why isn’t this just tested on a testbed and updated before the end of the week ?
cache coherency and transaction guarantees are problems which are known to be difficult to solve. Long term, the RIPE NCC probably needs to aim for the degree of transaction read consistency that might be provided by an ACID database or something equivalent, and that will take time and probably a migration away from a filesystem-based data store.
So the question is what to do in the interim? The bash script that Job posted may help to reduce some of the race conditions that are being observed, but it's unlikely to guarantee transaction consistency in a deep sense. Does the RIPE NCC have an opinion on whether the approach used in this script would help with some of the problems that are being seen and if so, would there be any significant negative effects associated with implementing it as an intermediate patch-up?
Nick