Re: [atlas] [RIPE-NCC/ripe-atlas-software-probe] Software probes only connect to one controller? (#45)
On Mon, 3 Aug 2020 at 12:14, Philip Homburg <notifications@github.com> wrote:
This is not an ideal place to discuss how atlas works. It would be better to use the atlas mailing list (https://lists.ripe.net/mailman/listinfo/ripe-atlas).
In any case, we have 2 controllers for software probes. Those controllers are kept separate from controllers for hardware probes. We placed both controllers at Hetzner and unfortunately, Hetzner seems to have some issues recently (or at least more than I remember). Typically a controller handles about 500 probes, so it will take some time before we will get controllers in other places in the world.
Quite a bit of atlas backend logic depends on probes having just one controller at a time, so this is unlikely to change soon.
Sending probes to different controllers happens automatically, but it happens on a time scale of around 6 hours. It seems that the issue at Hetzner was less than 2 hours. However, with all controllers for software probes at Hetzner, a long failure at Hetzner would indeed impact all software probes.
In this case, why not have the software probes set up with a fall-back RIPE Atlas Anchor (controller)? E.g. - Setup a list of 2 or more Anchors in a hierarchical order, - If the software probe cannot reach the primary Anchor for more than 10-20 minutes., fall-back to the next Anchor reachable, - Retest connectivity to all configured anchors every 6 hours, - Revert to use the first reachable anchor in the locally configured list (a hierarchical order). * The above idea assumes an Anchor being able to temp. handle more than the default ~500 probes per Anchor, plus co-location diversity of
= 2 providers and >= 3 anchors.
-- Chriztoffer
Hello,
In this case, why not have the software probes set up with a fall-back RIPE Atlas Anchor (controller)?
It's worth mentioning that a probe not being able to connect to its assigned controller is the exceptional status, not the general rule. Probes quite happily continue to measure and store results for later delivery even if they are disconnected from the infrastructure. Obviously they cannot receive new measurement requests while they are disconnected. The complexity of handling the exceptional case is a high price to pay considering that 1) the controlling infrastructure is partially hosted at the RIPE NCC, partially at hosting providers, with enough 9s of uptime, and 2) disconnects are basically always partial (ie. always affect a subset of the probes) On the flip side, there are a number of benefits of making the probes stick to the same controller as long as possible. Therefore we defined a time interval (2 hours) in which the probes will keep on trying to reconnect to the same server, and only ask for help if that doesn't work out. I hope this explains the behaviour. Cheers, Robert
participants (2)
-
Chriztoffer Hansen
-
Robert Kisteleki