On Mon, 3 Aug 2020 at 12:14, Philip Homburg <notifications@github.com> wrote:
This is not an ideal place to discuss how atlas works. It would be better to use the atlas mailing list (https://lists.ripe.net/mailman/listinfo/ripe-atlas).
In any case, we have 2 controllers for software probes. Those controllers are kept separate from controllers for hardware probes. We placed both controllers at Hetzner and unfortunately, Hetzner seems to have some issues recently (or at least more than I remember). Typically a controller handles about 500 probes, so it will take some time before we will get controllers in other places in the world.
Quite a bit of atlas backend logic depends on probes having just one controller at a time, so this is unlikely to change soon.
Sending probes to different controllers happens automatically, but it happens on a time scale of around 6 hours. It seems that the issue at Hetzner was less than 2 hours. However, with all controllers for software probes at Hetzner, a long failure at Hetzner would indeed impact all software probes.
In this case, why not have the software probes set up with a fall-back RIPE Atlas Anchor (controller)? E.g. - Setup a list of 2 or more Anchors in a hierarchical order, - If the software probe cannot reach the primary Anchor for more than 10-20 minutes., fall-back to the next Anchor reachable, - Retest connectivity to all configured anchors every 6 hours, - Revert to use the first reachable anchor in the locally configured list (a hierarchical order). * The above idea assumes an Anchor being able to temp. handle more than the default ~500 probes per Anchor, plus co-location diversity of
= 2 providers and >= 3 anchors.
-- Chriztoffer