There is one point which has been seen at more than one publication server. A corner-case in the update of a filesystem being served by rsync, can cause 'incoherence' on what is seen to be served. You need to either hand-code the rsync server side to be not-filestore based (eg a DB backend, with some coherent model of what is current) or, run the rsync update method very carefully, so the "live" rsync --daemon instances do not see change in the underlying filestore, while still serving open connections. eg, use filesystems like ZFS with a snapshot model, fs-layers, and do things which use the abstraction to persist an older view, while publishing a newer view to new things. There are also ways to do this using things like haproxy, traefik or other load/proxy balancers, and a "pod" of servers in a pool which can drain. Depends on your platform how you code it. If you use rsync --daemon, It turns out there is a common fix: use symlinks *at* the publication point root, the thing rsyncd.conf points at. That can be "moved" safely, to point to a fresh repository, the live daemons are functionally bound to the dirpath which is under the old publication path reference. If you do this, you can drain the live connections and all new forked client coonnections get the new repository. If you try do things "below" the publication path anchor in the config, it doesn't work well. I don't think we all coded to this the same way, but I do think more than one RIR walked into the problem. We talk about this stuff. We discuss this, and coordinate loosely, to find and fix problems like this. I don't think there's a lack of goodwill amongst us, to try and limit exposure to these problems. Originally, we were serving a small handful of clients reliably, on a slow(ish) cycle, with infrequent update, on small repositories. Now, we have thousand(s) of clients and its growing, we have larger datasets, we have frequent updates (it only takes one change to one ROA to cause a cascade of Manifest related changes). We've been asked to improve resiliency, and our first pass on this has been to go "into the cloud" but that introduces the coherency question: how does your cloud provider help you, to publish and maintain a coherent state? Rsync is not a web protocol. Cloud services are not primarily designed well to serve this service. It is more like TCP anycast, or a DNS director to discrete platforms. If we had succeeded in moving the community rapidly to RRDP, which *IS* a web protocol and is designed to work coherently in a CDN, we wouldn't have this problem. There are people (Job?) who think Rsync might be needed as a repo backfill option but I personally think this is wrong: I think the RRDP/Delta model and web based bulk access to current bootstrap state is a far better model: its faster, it is inherently more coherent to the state of a specific change, It scales wonderfully, its faster. I really think we need to move in the IETF on rsync deprecation. cheers -George On Thu, Feb 18, 2021 at 3:00 AM Nick Hilliard <nick@foobar.org> wrote:
Nathalie Trenaman wrote on 17/02/2021 15:16:
I stand corrected. Tim pointed out to me that APNIC has their own code base, always had and is actually older than the RIPE NCC code. AFRINIC runs a fork from APNIC. ARIN used our code base around 2010 for their pilot, but implemented their own code base from scratch. LACNIC uses the RPKI object library but has their own CA and publication stack.
Hi Nathalie,
Ok, was curious. The reason I ask is because the overall complexity of this software is pretty high and we've seen a number of RPKI CA / publication failures from different registries over the last year.
If each RIR is using their own internally-developed implementations, then they're also the only organisations testing and developing monitoring systems for them, which means that in each case, exposure to interesting and unusual corner cases is likely to be pretty low. E.g. in comparison to other software suites where there might be anything between 10s and millions of copies in production, and complex bugs tend to get flushed out quickly. This is an observation more than anything else. Obviously the flip side of running in-house software like this is that bugs in one implementation are unlikely to affect any others.
Nick