Re: [routing-wg] Delay in publishing RPKI objects

17 Feb 2021

      There is one point which has been seen at more than one publication
server. A corner-case in the update of a filesystem being served by
rsync, can cause 'incoherence' on what is seen to be served.

You need to either hand-code the rsync server side to be not-filestore
based (eg a DB backend, with some coherent model of what is current)
or, run the rsync update method very carefully, so the "live" rsync
--daemon instances do not see change in the underlying filestore,
while still serving open connections.  eg, use filesystems like ZFS
with a snapshot model, fs-layers, and do things which use the
abstraction to persist an older view, while publishing a newer view to
new things.

There are also ways to do this using things like haproxy, traefik or
other load/proxy balancers, and a "pod" of servers in a pool which can
drain. Depends on your platform how you code it.

If you use rsync --daemon, It turns out there is a common fix: use
symlinks *at* the publication point root, the thing rsyncd.conf points
at. That can be "moved" safely, to point to a fresh repository, the
live daemons are functionally bound to the dirpath which is under the
old publication path reference. If you do this, you can drain the live
connections and all new forked client coonnections get the new
repository. If you try do things "below" the publication path anchor
in the config, it doesn't work well.

I don't think we all coded to this the same way, but I do think more
than one RIR walked into the problem.

We talk about this stuff. We discuss this, and coordinate loosely, to
find and fix problems like this. I don't think there's a lack of
goodwill amongst us, to try and limit exposure to these problems.

Originally, we were serving a small handful of clients reliably, on a
slow(ish) cycle, with infrequent update, on small repositories. Now,
we have thousand(s) of clients and its growing, we have larger
datasets, we have frequent updates (it only takes one change to one
ROA to cause a cascade of Manifest related changes). We've been asked
to improve resiliency, and our first pass on this has been to go "into
the cloud" but that introduces the coherency question: how does your
cloud provider help you, to publish and maintain a coherent state?

Rsync is not a web protocol. Cloud services are not primarily designed
well to serve this service. It is more like TCP anycast, or a DNS
director to discrete platforms.

If we had succeeded in moving the community rapidly to RRDP, which
*IS* a web protocol and is designed to work coherently in a CDN, we
wouldn't have this problem. There are people (Job?) who think Rsync
might be needed as a repo backfill option but I personally think this
is wrong: I think the RRDP/Delta model and web based bulk access to
current bootstrap state is a far better model: its faster, it is
inherently more coherent to the state of a specific change, It scales
wonderfully, its faster.

I really think we need to move in the IETF on rsync deprecation.

cheers

-George

On Thu, Feb 18, 2021 at 3:00 AM Nick Hilliard <nick@foobar.org> wrote:
...
Nathalie Trenaman wrote on 17/02/2021 15:16:
...
I stand corrected. Tim pointed out to me that APNIC has their own code base, always had and is actually older than the RIPE NCC code.
AFRINIC runs a fork from APNIC.
ARIN used our code base around 2010 for their pilot, but implemented their own code base from scratch.
LACNIC uses the RPKI object library but has their own CA and publication stack.
Hi Nathalie,
Ok, was curious.  The reason I ask is because the overall complexity of
this software is pretty high and we've seen a number of RPKI CA /
publication failures from different registries over the last year.
If each RIR is using their own internally-developed implementations,
then they're also the only organisations testing and developing
monitoring systems for them, which means that in each case, exposure to
interesting and unusual corner cases is likely to be pretty low.  E.g.
in comparison to other software suites where there might be anything
between 10s and millions of copies in production, and complex bugs tend
to get flushed out quickly.  This is an observation more than anything
else. Obviously the flip side of running in-house software like this is
that bugs in one implementation are unlikely to affect any others.
Nick

Re: [routing-wg] Delay in publishing RPKI objects

George Michaelson