RPKI Validator 3: disable fetching from certain repos?
Hi, some URLs basically fail all the time (Timeout). Is there a way to tell validator "stop trying to connect to them"? https://rpki.cnnic.cn/rrdp/notify.xml: java.util.concurrent.TimeoutException https://rpkica.twnic.tw/rrdp/notify.xml: java.util.concurrent.TimeoutException this has been reported a while ago: https://github.com/RIPE-NCC/rpki-validator-3/issues/45 kind regards, nusenu -- https://twitter.com/nusenu_ https://mastodon.social/@nusenu
nusenu writes:
Hi,
some URLs basically fail all the time (Timeout).
Is there a way to tell validator "stop trying to connect to them"?
https://rpki.cnnic.cn/rrdp/notify.xml: java.util.concurrent.TimeoutException https://rpkica.twnic.tw/rrdp/notify.xml: java.util.concurrent.TimeoutException
this has been reported a while ago: https://github.com/RIPE-NCC/rpki-validator-3/issues/45
kind regards, nusenu
Hi nusenu, I agree that something is wrong here and a different behavior would be good, but I don't think we want the folks who operate relying party software instances configuring their RPs to never again try to retrieve from certain repositories. We need to make sure that when the repos eventually do get fixed, that all RPs once again will retrieve from them. Ideally, all CAs would closely and continually watch their children, letting them know promptly when there are any problems including the complete inability to retrieve as we see now. Several folks have been in contact with APNIC, who acknowledges the problems with cnnic.cn (longstanding) and twnic.tw (more recent). As of earlier today, APNIC seems optimistic that these situations will both improve in the coming days. Let's wait and see. Probably a better behavior for rpki-validator-3 to take to avoid needlessly filling up logs, etc., with failed attempts would be to back off when re-trying unreachable repos. If a normally-reachable repo suddenly goes quiet, re-try a few times as normal, but then gradually increase the time until the next attempt, up to some maximum interval -- possibly several hours. Thanks. Jay B.
Jay Borkenhagen:
I agree that something is wrong here and a different behavior would be good, but I don't think we want the folks who operate relying party software instances configuring their RPs to never again try to retrieve from certain repositories. We need to make sure that when the repos eventually do get fixed, that all RPs once again will retrieve from them.
you are right, good point!
Ideally, all CAs would closely and continually watch their children, letting them know promptly when there are any problems including the complete inability to retrieve as we see now. Several folks have been in contact with APNIC, who acknowledges the problems with cnnic.cn (longstanding) and twnic.tw (more recent). As of earlier today, APNIC seems optimistic that these situations will both improve in the coming days. Let's wait and see.
that is great news, can we follow that progress somewhere?
Probably a better behavior for rpki-validator-3 to take to avoid needlessly filling up logs, etc., with failed attempts would be to back off when re-trying unreachable repos. If a normally-reachable repo suddenly goes quiet, re-try a few times as normal, but then gradually increase the time until the next attempt, up to some maximum interval -- possibly several hours.
yes, this sounds reasonable. kind regards, nusenu -- https://twitter.com/nusenu_ https://mastodon.social/@nusenu
On Wed, 10 Oct 2018 at 17:06, nusenu <nusenu-lists@riseup.net> wrote:
Ideally, all CAs would closely and continually watch their children, letting them know promptly when there are any problems including the complete inability to retrieve as we see now. Several folks have been in contact with APNIC, who acknowledges the problems with cnnic.cn (longstanding) and twnic.tw (more recent). As of earlier today, APNIC seems optimistic that these situations will both improve in the coming days. Let's wait and see.
that is great news, can we follow that progress somewhere?
I’m afraid we’ll know when we know. Kind regards, Job
Hi [disclaimer, I am no longer involved in RIPE NCC implementation, but I am now involved in NLnet Labs implementation]
On 10 Oct 2018, at 10:05, nusenu <nusenu-lists@riseup.net> wrote:
Jay Borkenhagen:
Probably a better behavior for rpki-validator-3 to take to avoid needlessly filling up logs, etc., with failed attempts would be to back off when re-trying unreachable repos. If a normally-reachable repo suddenly goes quiet, re-try a few times as normal, but then gradually increase the time until the next attempt, up to some maximum interval -- possibly several hours.
yes, this sounds reasonable.
Imho retries are cheap, but it is better to bother the logs, and operator, only if there are state changes. Related, the trust anchors overview pages also shows the current state of repository availability to operators: https://rpki-validator.ripe.net/trust-anchors But, there seems to be a bug in the validator when viewing the page for the APNIC TA. Tim
nusenu:
https://rpki.cnnic.cn/rrdp/notify.xml: java.util.concurrent.TimeoutException https://rpkica.twnic.tw/rrdp/notify.xml: java.util.concurrent.TimeoutException
are they generally unavailable or are they just answering to a limited set of source IPs? (depending on geolocation of the source IP) -- https://twitter.com/nusenu_ https://mastodon.social/@nusenu
There is a problem with the RRDP pub-point in TWNIC which we (APNIC) are discussing with them. They did not appreciate at first that a 443 bound service would have to be publicly visible and wish to re-architect things to move this to a place outside their firewall. It doesn't appear logistically simple to disable the certificated declaration right now, I think we might have an operations discussion about timeouts and risks here. Basically, "they know there is a publicly visible problem" and "they are working on it" -George On Thu, Oct 11, 2018 at 11:15 PM nusenu <nusenu-lists@riseup.net> wrote:
nusenu:
https://rpki.cnnic.cn/rrdp/notify.xml: java.util.concurrent.TimeoutException https://rpkica.twnic.tw/rrdp/notify.xml: java.util.concurrent.TimeoutException
are they generally unavailable or are they just answering to a limited set of source IPs? (depending on geolocation of the source IP)
-- https://twitter.com/nusenu_ https://mastodon.social/@nusenu
Thanks, George! For nusenu and other folks here running RIPE's rpki-validator-3: earlier this summer folks at RIPE told me that their current code does not forget about de-referenced repositories. The upshot is that once a validator learns of a repository, it will forever continue trying to reach it, even after upstream repositories no longer point to it. RIPE's developers know this is not a desirable behavior, and fixing it is on their to-do list. In the meantime, if one wants to stop these extra failed retrievals and eliminate the noise in the logs, one may empty the cache and start it again. Here are the steps for rpki-validator-3 installations based on RIPE's Centos 7 packages: ############################## sudo systemctl stop rpki-validator-3 ls -l /var/lib/rpki-validator-3/db/ sudo -u rpki sh -c '> /var/lib/rpki-validator-3/db/rpki-validator.h2.mv.db' sudo -u rpki sh -c '> /var/lib/rpki-validator-3/db/rpki-validator.h2.trace.db ' ls -l /var/lib/rpki-validator-3/db/ # just to confirm empty caches sudo systemctl start rpki-validator-3 In case you use any extra TALs, they will need to be uploaded again. I do load extra TALs, so I continued with: sleep 30 # wait until rpki-validator-3 is running, then: upload-tal.sh ~/arin-ripevalidator.tal http://localhost:8080/ upload-tal.sh ~/altca.tal http://localhost:8080/ sudo systemctl restart rpki-validator-3 ############################## I have executed the steps above on my boxes today. The error messages regarding rpki.cnnic.cn and rpkica.twnic.net.tw have now ceased. For now I still see errors for https://rpkica.twnic.tw/rrdp/notify.xml, but once the issue George described has been resolved, re-applying the above steps once more will take care of that one, too. Thanks. Jay B. George Michaelson writes:
There is a problem with the RRDP pub-point in TWNIC which we (APNIC) are discussing with them. They did not appreciate at first that a 443 bound service would have to be publicly visible and wish to re-architect things to move this to a place outside their firewall.
It doesn't appear logistically simple to disable the certificated declaration right now, I think we might have an operations discussion about timeouts and risks here.
Basically, "they know there is a publicly visible problem" and "they are working on it"
-George On Thu, Oct 11, 2018 at 11:15 PM nusenu <nusenu-lists@riseup.net> wrote:
nusenu:
https://rpki.cnnic.cn/rrdp/notify.xml: java.util.concurrent.TimeoutException https://rpkica.twnic.tw/rrdp/notify.xml: java.util.concurrent.TimeoutException
are they generally unavailable or are they just answering to a limited set of source IPs? (depending on geolocation of the source IP)
participants (5)
-
George Michaelson
-
Jay Borkenhagen
-
Job Snijders
-
nusenu
-
Tim Bruijnzeels