Hi Job.
On 16 Feb 2022, at 15:05, Job Snijders via routing-wg <routing-wg@ripe.net> wrote:
Hi all,
I noticed the RIPE NCC RRDP service (https://rrdp.ripe.net/) became unreachable at 2022-02-16 13:34:10 UTC+0 (and still is down).
Ouch. Fallback to rsync due to a DNS misconfiguration (which should have recovered).
This RRDP outage event should not pose an issue for most RPKI validators, because most RPKI cache implementations (which follow best practises) will attempt to try to synchronize via RSYNC, in case RRDP is unavailable.
However, it seems RIPE NCC adjusted the default rsyncd settings and lowered the concurrent connection count from 200 (which already is too low for RPKI Repository Servers) to 150?
$ rsync --no-motd -rt rsync://rpki.ripe.net/repository/ @ERROR: max connections (150) reached -- try again later rsync error: error starting client-server protocol (code 5) at main.c(1666) [Receiver=3.1.2]
I'm not familiar with the RIPE RPKI RSYNC service architecture, so the above error could be misleading: perhaps there is a loadbalancer distributing TCP sessions across multiple backends, each backend configured to serve up to 150 clients? Or perhaps there is a single rsyncd instance (in which case 150 definitely is too low).
We have described our rsync infrastructure extensively in earlier messages (e.g. [0]). There are multiple instances behind a load-balancer. The current storage is on NFS which has a performance limitation - it peaked at about 80K operations/second (2m average). We will follow up with a more detailed post-mortem. Kind regards, Ties [0]: https://www.ripe.net/ripe/mail/archives/routing-wg/2021-June/004351.html