Issue affecting rsync RPKI repository fetching
Dear colleagues, We have been made aware of an issue that may affect some users who use RPKI relying party (RP) software that uses rsync. Please note that by default, only rpki-client reads from rsync; the rest of the RPs prefer the RPKI Repository Delta Protocol (RRDP). The issue appears to create some inconsistency between the RPKI repository and rsync clients. In more detail, an RRDP client reads a complete state for a specific “serial” from the repository. In contrast, an rsync client syncs the state in multiple steps. First, a list of files is copied, followed by updates for files that have been copied. In an affected scenario, a certificate is added and one of the other files (the manifest) is modified after the file list has been sent. By reading the new manifest, but not copying the new file (it is not on the rsync file list), the repository copied by the rsync client contains an invalid manifest (a file is missing) and the RP software rejects it. We are planning on changing our publication infrastructure and using the same "revisions" RRDP uses for the content of the rsync repository. Rsync is an officially supported distribution protocol for RPKI repository data, and it is one of our highest priorities that the data published is atomic and consistent. We plan to release the new publication infrastructure in Q2/Q3 2021. Part of this work will mitigate these non-repeatable-reads for clients using rsync. We will update you on our progress during RIPE 82, taking place online from 17-21 May 2021. Kind regards, Nathalie Trenaman RIPE NCC
Hi Nathalie, Thank you for addressing this RIPE NCC infra issue. It looks like the RIPE NCC RPKI infra for rsync is updating the ROA’s in the same directory while the RPKI clients that use rsync, are still fetching the files. This is common knowledge on using rsync .. but with the use of MD5 of crypto checks on the files, that becomes an issue. It is best practise to dump the files in a specific (timestamp)directory .. symlink the download link to the timestamp directory .. and keep things read-only once the stuff is written on disk.. so there are no improper updates that would cause crypto or MD5 hash issues. Once there is new RPKI data, create a new timestamp dir, move the symlink to the new location and be done with it. As a RPKI-client user that is happy with the security within the software, that starts to barf over improper RPKI data .. as one should hope it would .. I would like to ask the NCC to update their rsync method quicker than ‘perhaps in 6 months … ‘ This looks to be a 3 line bash script fix on a cronjob … So why isn’t this just tested on a testbed and updated before the end of the week ? Regards, Erik Bais From: routing-wg <routing-wg-bounces@ripe.net> on behalf of Nathalie Trenaman <nathalie@ripe.net> Date: Monday 12 April 2021 at 12:04 To: "routing-wg@ripe.net" <routing-wg@ripe.net> Subject: [routing-wg] Issue affecting rsync RPKI repository fetching Dear colleagues, We have been made aware of an issue that may affect some users who use RPKI relying party (RP) software that uses rsync. Please note that by default, only rpki-client reads from rsync; the rest of the RPs prefer the RPKI Repository Delta Protocol (RRDP). The issue appears to create some inconsistency between the RPKI repository and rsync clients. In more detail, an RRDP client reads a complete state for a specific “serial” from the repository. In contrast, an rsync client syncs the state in multiple steps. First, a list of files is copied, followed by updates for files that have been copied. In an affected scenario, a certificate is added and one of the other files (the manifest) is modified after the file list has been sent. By reading the new manifest, but not copying the new file (it is not on the rsync file list), the repository copied by the rsync client contains an invalid manifest (a file is missing) and the RP software rejects it. We are planning on changing our publication infrastructure and using the same "revisions" RRDP uses for the content of the rsync repository. Rsync is an officially supported distribution protocol for RPKI repository data, and it is one of our highest priorities that the data published is atomic and consistent. We plan to release the new publication infrastructure in Q2/Q3 2021. Part of this work will mitigate these non-repeatable-reads for clients using rsync. We will update you on our progress during RIPE 82, taking place online from 17-21 May 2021. Kind regards, Nathalie Trenaman RIPE NCC
Erik Bais wrote on 12/04/2021 11:41:
This looks to be a 3 line bash script fix on a cronjob … So why isn’t this just tested on a testbed and updated before the end of the week ?
cache coherency and transaction guarantees are problems which are known to be difficult to solve. Long term, the RIPE NCC probably needs to aim for the degree of transaction read consistency that might be provided by an ACID database or something equivalent, and that will take time and probably a migration away from a filesystem-based data store. So the question is what to do in the interim? The bash script that Job posted may help to reduce some of the race conditions that are being observed, but it's unlikely to guarantee transaction consistency in a deep sense. Does the RIPE NCC have an opinion on whether the approach used in this script would help with some of the problems that are being seen and if so, would there be any significant negative effects associated with implementing it as an intermediate patch-up? Nick
On Mon, Apr 12, 2021 at 02:12:10PM +0100, Nick Hilliard wrote:
Erik Bais wrote on 12/04/2021 11:41:
This looks to be a 3 line bash script fix on a cronjob … So why isn’t this just tested on a testbed and updated before the end of the week ?
cache coherency and transaction guarantees are problems which are known to be difficult to solve. Long term, the RIPE NCC probably needs to aim for the degree of transaction read consistency that might be provided by an ACID database or something equivalent, and that will take time and probably a migration away from a filesystem-based data store.
So the question is what to do in the interim? The bash script that Job posted may help to reduce some of the race conditions that are being observed, but it's unlikely to guarantee transaction consistency in a deep sense. Does the RIPE NCC have an opinion on whether the approach used in this script would help with some of the problems that are being seen and if so, would there be any significant negative effects associated with implementing it as an intermediate patch-up?
Perhaps the script [0] can be of use, or perhaps not. The script assumes a POSIXish-compliant environment. It is not clear to me what software process runs where and how RIPE NCC runs their publication service. The core problem seems to me that while RSYNC clients are connected the RIPE NCC RPKI server appears to 'pull the rug' from underneath them. This practise reduces the reliability of the RIPE NCC RPKI service. I can only guess how the RIPE NCC RPKI publication service exactly is configured, but I imagine there is a 'Signer Server' which writes to disk the few thousand individual RPKI objects, and separately there is a RSYNC server (rpki.ripe.net) which serves the files to RSYNC clients. Transferring sets of inter-related files around is a 'batch' operation, the pipeline should set up accordingly. As such, calling 'rsync' from crontab to populate the rpki.ripe.net rsync server would likely lead to inconsistent results. There are (at least) two objectives to keep in mind: 1/ While the Signer software is writing new files out to disk, the 'signer to publisher' replication process should not run, because the signer isn't finished yet. 2/ While a given RSYNC client is fetching from 'rpki.ripe.net', the 'signer to publisher' replication process should not alter the contents of the filesystem hierarchy the RSYNC client is fetching from. The satisfy the above two conditions, I suspect a number of solutions are available: A) take ownership and control and only launch subsequent pipeline steps when the Signer is done signing the latest requests. After a consistent set of files has been written to disk, only then copy, stage, and switch to the new directory contents using a symlink swap (allowing already connected RSYNC clients to complete their fetch). B) Use a load balancer to direct new RSYNC clients to a RSYNC server containing the latest (consistent) set of files. C) Make the RSYNC service pull from the latest (allegedly consistent) RRDP snapshot.xml file, then move newly connected clients to the new content using either the symlink [0] trick or a orchestrate draining/onramping via a load balancer like haproxy. There is a wealth of knowledge available in this working group on how POSIX-like systems work, how ISP operations work, and the RPKI works, I hope RIPE NCC can leverage that. Kind regards, Job [0]: http://sobornost.net/~job/rpki-rsync-move.sh.txt
Job Snijders wrote on 12/04/2021 16:10:
There is a wealth of knowledge available in this working group on how POSIX-like systems work, how ISP operations work, and the RPKI works, I hope RIPE NCC can leverage that.
I'm curious about the scale of the issue here. Would someone from the RIPE NCC be able to disclose how many rysnc clients they're seeing? And what percentage of total rpki (i.e. rsync + rrdp) that is? I.e. how much attention needs to be given to resolving this issue? Nick
I'm curious about the scale of the issue here.
if you make an approximation that all RPs touch all PPs, you may find this useful John Kristoff, Randy Bush, Chris Kanich, George Michaelson, Amreesh Phokeer, Thomas Schmidt, Matthias Wählisch. On Measuring RPKI Relying Parties, ACM IMC 2020 https://archive.psg.com/201029.imc-rp.pdf randy --- randy@psg.com `gpg --locate-external-keys --auto-key-locate wkd randy@psg.com` signatures are back, thanks to dmarc header butchery
Not to detract from the paper Randy posted, in any way. For APRICOT 2021 I reported to the APNIC routing security sig as follows: As of Feb 2021 •1,009 total ASNs reading the APNIC RPKI data every day –902 distinct ASNs collecting using RRDP protocol (https) –927 distinct ASNs collecting via rsync -- So for us, they are mostly co-incident sets of ASNs. For APNIC's publication point, the scale of the issue is "almost everybody does both" Whats the size of the problem? The size of the problem is the likelihood of updating rsync state whilst its being fetched. There is a simple model to avoid this which has been noted: modify discrete directories, swing a symlink so the change from "old" to "new" state is as atomic as possible. rsync on UNIX is chroot() and the existing bound clients will drain from the old state. Then run some cleanup process. But, if you are caught behind a process which writes to the actively served rsync directory, the "size" of the subdirectory structure is an indication of the risk of a Manifest being written to, whilst being fetched. Yes, in absolute terms it could happen to a 1 ROA manifest, but it is more likely to happen in any manifest of size The "cost" of a non-atomic upate is higher, and I believe the risk is higher. The risk is computing the checksums and writing the Manifest and signing it, while some asynchronous update is happening, and whilst people are fetching the state. RIR host thousands of children, so we have at least one manifest each which is significantly larger over those certificated products and more likely to trigger a problem. ggm@host6 repository % !fi find . -type d | xargs -n 1 -I {} echo "ls {} | wc -l" | sh | sort | uniq -c 1 2 1 3 1 7 1 8 1 9 1 52 1 147 1 3352 ggm@host6 repository % Our hosted solution has this structure. Most children by far have less than 10 objects. ggm@host6 member_repository % find . -type d | xargs -n 1 -I {} echo "ls {} | wc -l" | sh | sort | uniq -c 2997 1 1697 2 2099 3 560 4 229 5 96 6 44 7 22 8 17 9 11 10 6 11 5 12 5 13 6 14 2 15 4 16 2 17 2 18 1 20 1 23 1 25 1 27 1 28 1 29 1 34 1 38 1 40 3 42 1 46 1 60 1 97 1 848 ggm@host6 member_repository %
It has been pointed out to me I must have meant chdir() when I said chroot(). Sorry. rysnc --daemon is chdir() to the target of a symlink when it runs. So, changing the symlink which an earlier instance has chdir() "into" doesn't alter the directory of that forked daemon, if you change the symlink. -G
George Michaelson wrote on 13/04/2021 05:40:
As of Feb 2021 •1,009 total ASNs reading the APNIC RPKI data every day –902 distinct ASNs collecting using RRDP protocol (https) –927 distinct ASNs collecting via rsync
mmm, interesting. full preso here:
https://conference.apnic.net/51/assets/files/APSr481/routing-security-sig-rp...
Would it be possible to drill down into these figures a bit more? I.e. is it possible to work out how many are pulling the TAL via rsync, but then using rrdp to synchronise their local instances? Or equivalently, how many people are using rsync for everyone? Either figure will give ~ the other. Pulling the TAL via rsync and then using rrdp for everything else is not a scenario that needs to be taken into account for this rsync consistency issue. Nick
Would it be possible to drill down into these figures a bit more? I.e. is it possible to work out how many are pulling the TAL via rsync, but then using rrdp to synchronise their local instances? Or
that came out badly garbled. I meant how many clients were pulling the trust anchor certs via rsync due to having older TALs installed on the local RP cache, and then downloading the manifest/roas data via rrdp afterwards because the TA contains both rsync and https locator information, and the RP software was able to select rrdp instead of rsync because that was presented as an option. Nick
Hi Nick,
On 13 Apr 2021, at 15:33, Nick Hilliard <nick@foobar.org> wrote:
Would it be possible to drill down into these figures a bit more? I.e. is it possible to work out how many are pulling the TAL via rsync, but then using rrdp to synchronise their local instances? Or
that came out badly garbled. I meant how many clients were pulling the trust anchor certs via rsync due to having older TALs installed on the local RP cache, and then downloading the manifest/roas data via rrdp afterwards because the TA contains both rsync and https locator information, and the RP software was able to select rrdp instead of rsync because that was presented as an option.
For RIPE I can give more details on what we see. Because of the structure of our repository, we can split out clients connecting over rsync to retrieve the trust anchor from those connecting to the main repository. We do see a change on the 2nd of April so I'm providing data both for the week before and after this date. The cause for this change is unknown. In the week leading up to the 2nd of April, on average per dag we see: * 192 unique IPs (from 182 /24's/64's) creating 8636 connections to /repository * 911 unique IPs (from 721 /24's/64's) creating 81855 connections to /ta In the week starting on the 2nd of April on average per day we see: * 598 unique IPs (from 582 /24's/64's) creating 17594 connections to /repository * 1301 unique IPs (from 1114 /24's/64's) creating 89675 connections to /ta Traffic also increased from ~34 to ~73GB an hour (for rsync). We see ~1086 unique IPs accessing the TA certificate over HTTPS per day. Kind regards, Ties
I'll see if I can do that from the log stream. It may take some time. cheers -G On Tue, Apr 13, 2021 at 10:39 PM Nick Hilliard <nick@foobar.org> wrote:
George Michaelson wrote on 13/04/2021 05:40:
As of Feb 2021 •1,009 total ASNs reading the APNIC RPKI data every day –902 distinct ASNs collecting using RRDP protocol (https) –927 distinct ASNs collecting via rsync
mmm, interesting. full preso here:
https://conference.apnic.net/51/assets/files/APSr481/routing-security-sig-rp...
Would it be possible to drill down into these figures a bit more? I.e. is it possible to work out how many are pulling the TAL via rsync, but then using rrdp to synchronise their local instances? Or equivalently, how many people are using rsync for everyone? Either figure will give ~ the other.
Pulling the TAL via rsync and then using rrdp for everything else is not a scenario that needs to be taken into account for this rsync consistency issue.
Nick
Dear colleagues, First, let me give you an overview of our rsync infrastructure and the situation encountered by a client. Afterwards, I will describe the context of our application and repository, and how that limits our design space. RPKI objects are created on machines in an isolated network. The active machine writes new objects to an NFS share (with replicated storage in two data centres). The rsync machines (outside the isolated network) serve these files. These are behind a load balancer. Sets of objects to be updated (for example a manifest, CRL, and certificate) are written to a staging directory by the application. After all the files are created, they are moved into the main repository directory. There is a small period between these moves. In a recent incident that was reported to us, this was ~30ms, with files written in an order where the repository was correct on disk at all times. This part of the code has been in place since 2012. While the files are written to the filesystem they are also sent to a (draft version of RFC 8181) publication server. The files are sent atomically in one message. The publication is synchronous: when a ROA is created, it is immediately published to rsync and the publication server. The affected client in the reported incident read the file list *before* the new certificate was present, but checked the content (and copied) the updated manifest which referred to a certificate that was not present when the file list was created. In the rest of this document, we will call this situation a non-repeatable read; part of the repository reflects one state while another part reflects a different state. On April 12, we published 41,600 sets of objects. This resulted in 41,600 distinct repository states on disk. The RIPE NCC repository contains ~65,500 files in ~34,500 directories, with a total size of 157MB. The repository is consistent (on disk) when the application is not publishing objects. The repository is also consistent (for a slow client) when no files are added or changed after their rsync client starts retrieving the file list. Copying the repository without coordination from the application (i.e. to spool it) has the same risk of a non-repeatable read as rsync clients have. However, in this case, it would affect many clients for an extended period - and mask instead of solve the underlying issue. Other approaches (such as snapshotting) also have limitations that make them untenable. The RPKI application does not support writing the complete repository to disk for each state (as needed for spooling the repository as proposed in scripts). Synchronously writing every state of the repository to disk is not feasible, given our update frequency and repository size. Functionality for asynchronously writing the repository to disk needs to be developed. We have two paths to develop this: - The first is a new daemon that writes to disk from the database state at a set interval. - The second one is using RRDP as a source of truth and writing the repository to disk. Furthermore, we would need to migrate the storage from NFS to have faster writes. Both approaches need an extended period for validation and we are not able to deploy these within a few weeks. The latter approach (using RRDP) has less risk and is the option we are aiming for at the moment. We plan to release the new publication infrastructure in Q2/Q3 2021 and hope to migrate earlier. I’m happy to answer any further questions you may have. Kind regards, Ties de Kock RIPE NCC
On 12 Apr 2021, at 15:12, Nick Hilliard <nick@foobar.org> wrote:
Erik Bais wrote on 12/04/2021 11:41:
This looks to be a 3 line bash script fix on a cronjob … So why isn’t this just tested on a testbed and updated before the end of the week ?
cache coherency and transaction guarantees are problems which are known to be difficult to solve. Long term, the RIPE NCC probably needs to aim for the degree of transaction read consistency that might be provided by an ACID database or something equivalent, and that will take time and probably a migration away from a filesystem-based data store.
So the question is what to do in the interim? The bash script that Job posted may help to reduce some of the race conditions that are being observed, but it's unlikely to guarantee transaction consistency in a deep sense. Does the RIPE NCC have an opinion on whether the approach used in this script would help with some of the problems that are being seen and if so, would there be any significant negative effects associated with implementing it as an intermediate patch-up?
Nick
Dear Ties, group, Thank you for the outline. On Wed, Apr 14, 2021 at 02:33:37PM +0200, Ties de Kock wrote:
The RPKI application does not support writing the complete repository to disk for each state (as needed for spooling the repository as proposed in scripts). Synchronously writing every state of the repository to disk is not feasible, given our update frequency and repository size. Functionality for asynchronously writing the repository to disk needs to be developed. We have two paths to develop this: - The first is a new daemon that writes to disk from the database state at a set interval. - The second one is using RRDP as a source of truth and writing the repository to disk. Furthermore, we would need to migrate the storage from NFS to have faster writes.
Both approaches need an extended period for validation and we are not able to deploy these within a few weeks. The latter approach (using RRDP) has less risk and is the option we are aiming for at the moment. We plan to release the new publication infrastructure in Q2/Q3 2021 and hope to migrate earlier.
The "RRDP as source of truth" approach indeed seems the more appealing (and simpler!) option. I would encourage the NCC to follow that path. In the mean time, can https://www.ripe.net/support/service-announcements/service-announcements/cur... be updated to reflect that there are known race conditions and problems with the RIPE NCC RSYNC service? Are there any other tweaks the NCC can think of that reduce the operational pain? Maybe increasing the publication interval? Kind regards, Job
Hi Nathalie, On 04/12, Nathalie Trenaman wrote:
Dear colleagues,
<snip/>
We are planning on changing our publication infrastructure and using the same "revisions" RRDP uses for the content of the rsync repository. Rsync is an officially supported distribution protocol for RPKI repository data, and it is one of our highest priorities that the data published is atomic and consistent. We plan to release the new publication infrastructure in Q2/Q3 2021. Part of this work will mitigate these non-repeatable-reads for clients using rsync.
We will update you on our progress during RIPE 82, taking place online from 17-21 May 2021.
The above description seems to suggest that repository access via rsync is an optional extra that the RIPE NCC provides as a value add. However, as of course you know, rsync is the only mandatory to implement access method *today* and we are not yet on an agreed path towards an RRDP-only future. It seems to me that this issue deserves the same urgency as "the publication server periodically reboots when used correctly". If that requires a workaround (of the form suggested by others) now, and then a redesign in 6 months, fine. But it is a dis-service to the Internet community as a whole to skip the "workaround now" step. Cheers, Ben
participants (8)
-
Ben Maddison
-
Erik Bais
-
George Michaelson
-
Job Snijders
-
Nathalie Trenaman
-
Nick Hilliard
-
Randy Bush
-
Ties de Kock