shared fate for my two probes
Hi, My two RIPE Atlas probes have been acting strangely over the past several months. Very frequently I receive automatic notices from the RIPE Atlas system that both of my probes have been "disconnected from the RIPE Atlas infrastructure" essentially simultaneously. Here's the history from 2018: probe 11171 2018-02-23 18:08:57 UTC. probe 11203 2018-02-23 18:10:23 UTC. probe 11171 2018-06-21 04:32:15 UTC. probe 11203 2018-06-21 04:30:59 UTC. probe 11171 2018-06-24 05:36:57 UTC. probe 11203 2018-06-24 05:37:04 UTC. probe 11171 2018-07-31 20:25:27 UTC. probe 11203 2018-07-31 20:25:32 UTC. probe 11171 2018-08-18 04:48:20 UTC. probe 11171 2018-09-15 02:31:26 UTC. probe 11203 2018-09-15 02:31:28 UTC. probe 11171 2018-10-15 23:03:50 UTC. probe 11203 2018-10-15 23:04:06 UTC. probe 11171 2018-11-16 14:42:28 UTC. probe 11203 2018-11-16 14:42:33 UTC. (Those last three events are eerily periodic.) Each time when I am finally able to visit inspect the probes (both are reported as "Firmware Version 4940 (1100)"), their LEDs are in the same state: --------------- Looking at the units with the on/off LED to the left: (1) the left-most (on/off) LEDs are on steadily. (2) the next LED is off. the next two LEDs are flashing at different frequencies: (3) the next flashes on approx. 48 times per minute (4) the next flashes on approx. 110 times per minute (5) the centers of the wide "WPS/RESET" LEDs are on steadily. --------------- Once these probes fail they remain offline until I try turning it off and on again. The probes are powered similarly: the RIPE-provided USB cable is plugged into an Apple iPhone USB power adapter, in turn plugged into a standard US 120v / 60 Hz AC receptacle. However, the AC power circuits feeding these receptacles come from two very different sources which should not share fate. Until today both probes have been plugged into ports on the same Cisco switch, but after today's disconnect I moved probe 11203 to a different physical switch that has the same VLANs. None of my other equipment experiences any troubles around the times when these events happen to my RIPE Atlas probes. In the past I have reported these events to atlas@ripe.net, but I doubt that the causes cited (e.g. "the problem is with your dns") have been accurate. (I do not blame the atlas@ripe.net folks who have responded. Something strange is happening here, and I haven't yet cracked it.) I'm curious whether anyone else sees similar behaviors, or has suggestions for ways to determine what is happening. Thanks! Jay B.
On 2018/11/16 22:16 , Jay Borkenhagen wrote:
probe 11171 2018-11-16 14:42:28 UTC. probe 11203 2018-11-16 14:42:33 UTC.
Both probes lost all internet connectivity for a couple of hours. I have no idea what happened but Atlas traceroutes stop at hop 1 (and don't even reach the local router). Philip
Philip Homburg writes:
On 2018/11/16 22:16 , Jay Borkenhagen wrote:
probe 11171 2018-11-16 14:42:28 UTC. probe 11203 2018-11-16 14:42:33 UTC.
Both probes lost all internet connectivity for a couple of hours. I have no idea what happened but Atlas traceroutes stop at hop 1 (and don't even reach the local router).
Yes, but the problem was not with the networking infrastructure: like I have reported each time, none of my other equipment has had any connectivity troubles at these times, and both probes remained inaccessible until power-cycled, when immediately they are fine again. The "couple of hours" you cite start when the issue begins, and ends only when I arrive to power-cycle them. The only sort of thing that makes any sense to me is that something occurs that triggers the probes and only the probes to lose their minds. Possibly an electrical power aberration or a networking hiccup sends the probes into a state that they cannot recover from on their own. But even the 'power aberration' explanation seems unlikely, since the two power feeds are dis-similar enough that I would not expect any surge or whatever that hits one to hit them both, and to hit in a way that only RIPE Atlas probes are affected. Has anyone else experienced electrical power disruptions that send RIPE Atlas probes into a state like the one I described in my initial message, while no other nearby equipment notices any problem? Thanks.
My V3 probe would semi-randomly drop connection for a few minutes before I moved it to another part of the apartment. Sounds kinda stupid, I know, but are your probes physically located near each other? Can you move them? Could be that they are more sensitive to stray RF than your other equipment. On Mon, Nov 19, 2018 at 5:54 PM Jay Borkenhagen <ripe-atlas@braeburn.org> wrote:
Philip Homburg writes:
On 2018/11/16 22:16 , Jay Borkenhagen wrote:
probe 11171 2018-11-16 14:42:28 UTC. probe 11203 2018-11-16 14:42:33 UTC.
Both probes lost all internet connectivity for a couple of hours. I have no idea what happened but Atlas traceroutes stop at hop 1 (and don't even reach the local router).
Yes, but the problem was not with the networking infrastructure: like I have reported each time, none of my other equipment has had any connectivity troubles at these times, and both probes remained inaccessible until power-cycled, when immediately they are fine again. The "couple of hours" you cite start when the issue begins, and ends only when I arrive to power-cycle them.
The only sort of thing that makes any sense to me is that something occurs that triggers the probes and only the probes to lose their minds. Possibly an electrical power aberration or a networking hiccup sends the probes into a state that they cannot recover from on their own. But even the 'power aberration' explanation seems unlikely, since the two power feeds are dis-similar enough that I would not expect any surge or whatever that hits one to hit them both, and to hit in a way that only RIPE Atlas probes are affected.
Has anyone else experienced electrical power disruptions that send RIPE Atlas probes into a state like the one I described in my initial message, while no other nearby equipment notices any problem?
Thanks.
Philip Homburg writes:
On 2018/11/16 22:16 , Jay Borkenhagen wrote:
probe 11171 2018-11-16 14:42:28 UTC. probe 11203 2018-11-16 14:42:33 UTC.
Both probes lost all internet connectivity for a couple of hours. I have no idea what happened but Atlas traceroutes stop at hop 1 (and don't even reach the local router).
Yes, but the problem was not with the networking infrastructure: For me the most obvious approach is to put the probes behind a switch
On 2018/11/19 17:53 , Jay Borkenhagen wrote: that supports port mirroring and look at the actual traffic during this period. The fact that it happens to both probes at the same time makes it very unlikely that it is a probe specific hardware problem. I'm not aware of any bugs in the Linux kernel that would affect IPv4 and IPv6 at the same time. So the obvious next step is to check what actually goes over the wire.
Hi, Following up to my own note below to provide an update: On 23-November probe 11171 disconnected again, but this time probe 11203 -- recently moved to a port on a different ethernet switch -- did not. After the 23-November event I moved probe 11171 to a different switch as well, and neither probe has suffered a recurrence since then. So, it seems that one of my ethernet switches occasionally does something that these RIPE Atlas probes do not like. Whatever that is, none of my other devices connected to that switch seem to take notice. (Were the RIPE Atlas probe issues a warning of a switch that will soon fail? Dunno. :-) ) Thanks to all those who offered suggestions for possible causes or diagnostic methods. Jay B. On 16-November-2018, Jay Borkenhagen writes:
Hi,
My two RIPE Atlas probes have been acting strangely over the past several months. Very frequently I receive automatic notices from the RIPE Atlas system that both of my probes have been "disconnected from the RIPE Atlas infrastructure" essentially simultaneously. Here's the history from 2018:
probe 11171 2018-02-23 18:08:57 UTC. probe 11203 2018-02-23 18:10:23 UTC.
probe 11171 2018-06-21 04:32:15 UTC. probe 11203 2018-06-21 04:30:59 UTC.
probe 11171 2018-06-24 05:36:57 UTC. probe 11203 2018-06-24 05:37:04 UTC.
probe 11171 2018-07-31 20:25:27 UTC. probe 11203 2018-07-31 20:25:32 UTC.
probe 11171 2018-08-18 04:48:20 UTC.
probe 11171 2018-09-15 02:31:26 UTC. probe 11203 2018-09-15 02:31:28 UTC.
probe 11171 2018-10-15 23:03:50 UTC. probe 11203 2018-10-15 23:04:06 UTC.
probe 11171 2018-11-16 14:42:28 UTC. probe 11203 2018-11-16 14:42:33 UTC.
(Those last three events are eerily periodic.)
Each time when I am finally able to visit inspect the probes (both are reported as "Firmware Version 4940 (1100)"), their LEDs are in the same state:
---------------
Looking at the units with the on/off LED to the left:
(1) the left-most (on/off) LEDs are on steadily. (2) the next LED is off. the next two LEDs are flashing at different frequencies: (3) the next flashes on approx. 48 times per minute (4) the next flashes on approx. 110 times per minute (5) the centers of the wide "WPS/RESET" LEDs are on steadily.
---------------
Once these probes fail they remain offline until I try turning it off and on again.
The probes are powered similarly: the RIPE-provided USB cable is plugged into an Apple iPhone USB power adapter, in turn plugged into a standard US 120v / 60 Hz AC receptacle. However, the AC power circuits feeding these receptacles come from two very different sources which should not share fate.
Until today both probes have been plugged into ports on the same Cisco switch, but after today's disconnect I moved probe 11203 to a different physical switch that has the same VLANs.
None of my other equipment experiences any troubles around the times when these events happen to my RIPE Atlas probes.
In the past I have reported these events to atlas@ripe.net, but I doubt that the causes cited (e.g. "the problem is with your dns") have been accurate. (I do not blame the atlas@ripe.net folks who have responded. Something strange is happening here, and I haven't yet cracked it.)
I'm curious whether anyone else sees similar behaviors, or has suggestions for ways to determine what is happening.
Thanks!
Jay B.
participants (3)
-
Jay Borkenhagen
-
Philip Homburg
-
Sebastian Johansson