[atlas]Probe flapping
Hi, My RIPE Atlas probe (MAC: 00204AC8243A) started flapping last night and is continuing to flap. It have flapped about 90 times since midnight, do you have any known problems at the moment? -- Kind regards Harald Firing Karlsen
On 2010.12.16. 9:26, Harald Firing Karlsen wrote:
Hi,
My RIPE Atlas probe (MAC: 00204AC8243A) started flapping last night and is continuing to flap. It have flapped about 90 times since midnight, do you have any known problems at the moment?
Hi, It seems that your probe is not the only one. We're looking into the details and will get back with an update. Cheers, Robert
Op 16-12-10 10:05, Robert Kisteleki schreef:
On 2010.12.16. 9:26, Harald Firing Karlsen wrote:
My RIPE Atlas probe (MAC: 00204AC8243A) started flapping last night and is continuing to flap. It have flapped about 90 times since midnight, do you have any known problems at the moment?
It seems that your probe is not the only one. We're looking into the details and will get back with an update.
UPC had maintenance last night, at least where I am (Arnhem - NL). Maybe that has something to do with it? My 'home-probe' (00:20:4A:BF:FD:C5) did not show any interruptions though. I run two probes, one at work, one at home. The one at work has been flapping since the start (00:20:4A:C8:22:A4). Don't know exactly why yet, so I am interested in the outcome of this. Regards, -- Marco
BTW, simple suggestion: in your "last 25 connections" it would be really nice to add two columns (marked with *): Connect at | Connected for* | Disconnect at: | Disconnected for* To display for what duration the probe was on/off. Would probably be done in a few mins, and would make this table much easier to read! Michael On 16/12/2010 10:05, "Robert Kisteleki" <robert@ripe.net> wrote:
On 2010.12.16. 9:26, Harald Firing Karlsen wrote:
Hi,
My RIPE Atlas probe (MAC: 00204AC8243A) started flapping last night and is continuing to flap. It have flapped about 90 times since midnight, do you have any known problems at the moment?
Hi,
It seems that your probe is not the only one. We're looking into the details and will get back with an update.
Cheers, Robert
On 2010.12.16. 11:05, Michael H. Behringer wrote:
BTW, simple suggestion: in your "last 25 connections" it would be really nice to add two columns (marked with *):
Connect at | Connected for* | Disconnect at: | Disconnected for*
To display for what duration the probe was on/off.
Would probably be done in a few mins, and would make this table much easier to read!
Michael
Yes, this seems to be a good idea. Robert
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview. No conclusions yet. Architecture: After registering with the RIPE Atlas network the probes are connected to "controllers" that handle requests to/from the probes. The architecture allows probes to use any controller in the system. Probes are distributed among controllers according to geographic and load balancing heuristics at the moment. We have four controllers at the moment: 1 in Germany on a dedicated server: jonin 1 in the US on a dedicated server: carson 2 in NL on RIPE NCC VMs: caldwell and zelenka You can see the number of probes associated with each controller and some other details on https://atlas.ripe.net/statistics This page is updated hourly. What happened: This morning zelenka was in standby and ronin started disassociating probes in a massive way. We do not know the root cause of this. The most likely cause so far is a connectivity problem but we are investigating with an open mind. The system reacted as designed and the probes dropped by ronin started to register with caldwell. Unfortunately caldwell became overloaded by this both because of its physical limitations and because of an unfortunate database configuration error. Probes associated to carson were not affected. What we are doing: We brought up Zelenka but as Murphy dictates the RIPE NCC firewall prevented probes from reaching it. This has been fixed and zelenka is now picking up probes. We working hard to fix a lot of minor problems uncovered by this and to get all probes re-connected and their data backlog processed. What we have learned so far: We need a larger safety margin in the capacity of the controllers vs the number of deployed probes. We will start moving caldwell and zelenka onto physical machines outside of firewalls and other complications. We also need to exercise moving probes among controllers and verify that the safety margin exists in reality. Personally I regard all this as normal teehting problems in a distributed computing deployment. So far the architecture is holding up well. Just the implementation has some flaws. Plwase bear with us. If anyone has suggestions for high quality hosting of controllers in the RIPE region, please drop me and Robert a private mail. Daniel
Dear All, Here's an update to Daniel's message from yesterday. As Daniel mentioned, on Wednesday evening our system started to migrate the probes away from a particular controller (ronin, in DE). We have a strong suspicion on why this happened, but it's not confirmed so I'm not going to publicly speculate :-) In any case, since we don't yet have enough spare capacity to handle this situation, another controller was overloaded. We needed to fix the internal databases on these controllers, which took some time. We were able to bring the system back to a stable state by the afternoon. This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now. There are still some of them, like 10 or so, which are not connected (down) so we can't really help those from here. If your probe was working properly before Wednesday, but now is down, then please power cycle it (using the USB power) and it will very likely come back fine. Probes in the US (and Asia, very likely) were not affected, as they have a local controller on the west coast, which was not involved. That's because the system really doesn't like to send European probes to it, it's too far. Let us know if there's anything else not working properly, so that we can look into it. Regards, Robert On 2010.12.16. 14:57, Daniel Karrenberg wrote:
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview.
[...]
Also interesting: This event is pretty clearly visible in the "probe up count" graph: <http://atlas.ripe.net/dynamic/stats/stats.goal.png> On Dec 17, 2010, at 6:09 AM, Robert Kisteleki wrote:
Dear All,
Here's an update to Daniel's message from yesterday.
As Daniel mentioned, on Wednesday evening our system started to migrate the probes away from a particular controller (ronin, in DE). We have a strong suspicion on why this happened, but it's not confirmed so I'm not going to publicly speculate :-) In any case, since we don't yet have enough spare capacity to handle this situation, another controller was overloaded.
We needed to fix the internal databases on these controllers, which took some time. We were able to bring the system back to a stable state by the afternoon.
This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now. There are still some of them, like 10 or so, which are not connected (down) so we can't really help those from here. If your probe was working properly before Wednesday, but now is down, then please power cycle it (using the USB power) and it will very likely come back fine.
Probes in the US (and Asia, very likely) were not affected, as they have a local controller on the west coast, which was not involved. That's because the system really doesn't like to send European probes to it, it's too far.
Let us know if there's anything else not working properly, so that we can look into it.
Regards, Robert
On 2010.12.16. 14:57, Daniel Karrenberg wrote:
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview.
[...]
On 2010.12.17. 16:29, Richard L. Barnes wrote:
Also interesting: This event is pretty clearly visible in the "probe up count" graph: <http://atlas.ripe.net/dynamic/stats/stats.goal.png>
Indeed. There's also a more detailed statistics page, which gives interesting hints for those who are interested in more details: https://atlas.ripe.net/statistics We don't plan to apply cosmetics to the graphs :) so you can see the event in its full extent. Robert
Robert Kisteleki wrote:
This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now.
It looks like both of my babies were affected, living in completely different networks (IDs 414 and 466). Both of them seem to got turned away at 2010-12-16 08:14:01 UTC and 2010-12-16 08:14:02 UTC. They came back at 2010-12-17 07:55:23 UTC and 2010-12-17 07:53:59 UTC respectively (without any intervention on my end). Just fyi, Wilfried
Hi Robert, It seems like a probe (ID: 303) in Asia is still having disconnection problem. Connect at: Disconnect at: 2010-12-19 06:07:29 UTC (still connected) 2010-12-18 17:32:11 UTC 2010-12-19 06:06:49 UTC 2010-12-18 17:15:28 UTC 2010-12-18 17:30:54 UTC 2010-12-18 10:26:52 UTC 2010-12-18 17:13:24 UTC Or is this suppose to be normal? But every disconnect - connect session has 1-2min gap. Best Wishes, Aftab A. Siddiqui -----Original Message----- From: ripe-atlas-admin@ripe.net [mailto:ripe-atlas-admin@ripe.net] On Behalf Of Robert Kisteleki Sent: Friday, December 17, 2010 4:09 PM To: ripe-atlas@ripe.net Subject: Re: [atlas]Probe flapping Dear All, Here's an update to Daniel's message from yesterday. As Daniel mentioned, on Wednesday evening our system started to migrate the probes away from a particular controller (ronin, in DE). We have a strong suspicion on why this happened, but it's not confirmed so I'm not going to publicly speculate :-) In any case, since we don't yet have enough spare capacity to handle this situation, another controller was overloaded. We needed to fix the internal databases on these controllers, which took some time. We were able to bring the system back to a stable state by the afternoon. This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now. There are still some of them, like 10 or so, which are not connected (down) so we can't really help those from here. If your probe was working properly before Wednesday, but now is down, then please power cycle it (using the USB power) and it will very likely come back fine. Probes in the US (and Asia, very likely) were not affected, as they have a local controller on the west coast, which was not involved. That's because the system really doesn't like to send European probes to it, it's too far. Let us know if there's anything else not working properly, so that we can look into it. Regards, Robert On 2010.12.16. 14:57, Daniel Karrenberg wrote:
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview.
[...]
Hi Aftab, Although your probe is located in Asia, it is still connected to one of our controller in Europe because this part of Asia is closer to Germany than west coast of US :) Since Friday our system is fully functional, so I think it's your network that makes your probe flapping. The small gap is normal since after the probe disconnects it takes some time to re-connect. From what i can see from our history your probe was disconnected quit often from the beginning :) Regards, Andreas Strikos On Dec 19, 2010, at 9:16 AM, Aftab A. Siddiqui wrote:
Hi Robert,
It seems like a probe (ID: 303) in Asia is still having disconnection problem.
Connect at: Disconnect at: 2010-12-19 06:07:29 UTC (still connected) 2010-12-18 17:32:11 UTC 2010-12-19 06:06:49 UTC 2010-12-18 17:15:28 UTC 2010-12-18 17:30:54 UTC 2010-12-18 10:26:52 UTC 2010-12-18 17:13:24 UTC
Or is this suppose to be normal? But every disconnect - connect session has 1-2min gap.
Best Wishes,
Aftab A. Siddiqui
-----Original Message----- From: ripe-atlas-admin@ripe.net [mailto:ripe-atlas-admin@ripe.net] On Behalf Of Robert Kisteleki Sent: Friday, December 17, 2010 4:09 PM To: ripe-atlas@ripe.net Subject: Re: [atlas]Probe flapping
Dear All,
Here's an update to Daniel's message from yesterday.
As Daniel mentioned, on Wednesday evening our system started to migrate the probes away from a particular controller (ronin, in DE). We have a strong suspicion on why this happened, but it's not confirmed so I'm not going to publicly speculate :-) In any case, since we don't yet have enough spare capacity to handle this situation, another controller was overloaded.
We needed to fix the internal databases on these controllers, which took some time. We were able to bring the system back to a stable state by the afternoon.
This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now. There are still some of them, like 10 or so, which are not connected (down) so we can't really help those from here. If your probe was working properly before Wednesday, but now is down, then please power cycle it (using the USB power) and it will very likely come back fine.
Probes in the US (and Asia, very likely) were not affected, as they have a local controller on the west coast, which was not involved. That's because the system really doesn't like to send European probes to it, it's too far.
Let us know if there's anything else not working properly, so that we can look into it.
Regards, Robert
On 2010.12.16. 14:57, Daniel Karrenberg wrote:
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview.
[...]
Hi Andreas, I thought connecting it directly on the distribution network will make it The stable one :( well if everything is fine at your end than I'll try to change the upstream for this subnet. Is it possible to change the controller my probe is connected to and can you share the controller IPs? If it's not against the policy. Best Wishes, Aftab A. Siddiqui ------ Sent from my iPhone *G On Dec 20, 2010, at 9:21 PM, Andreas Strikos <astrikos@ripe.net> wrote:
Hi Aftab,
Although your probe is located in Asia, it is still connected to one of our controller in Europe because this part of Asia is closer to Germany than west coast of US :) Since Friday our system is fully functional, so I think it's your network that makes your probe flapping. The small gap is normal since after the probe disconnects it takes some time to re-connect. From what i can see from our history your probe was disconnected quit often from the beginning :)
Regards, Andreas Strikos
On Dec 19, 2010, at 9:16 AM, Aftab A. Siddiqui wrote:
Hi Robert,
It seems like a probe (ID: 303) in Asia is still having disconnection problem.
Connect at: Disconnect at: 2010-12-19 06:07:29 UTC (still connected) 2010-12-18 17:32:11 UTC 2010-12-19 06:06:49 UTC 2010-12-18 17:15:28 UTC 2010-12-18 17:30:54 UTC 2010-12-18 10:26:52 UTC 2010-12-18 17:13:24 UTC
Or is this suppose to be normal? But every disconnect - connect session has 1-2min gap.
Best Wishes,
Aftab A. Siddiqui
-----Original Message----- From: ripe-atlas-admin@ripe.net [mailto:ripe-atlas-admin@ripe.net] On Behalf Of Robert Kisteleki Sent: Friday, December 17, 2010 4:09 PM To: ripe-atlas@ripe.net Subject: Re: [atlas]Probe flapping
Dear All,
Here's an update to Daniel's message from yesterday.
As Daniel mentioned, on Wednesday evening our system started to migrate the probes away from a particular controller (ronin, in DE). We have a strong suspicion on why this happened, but it's not confirmed so I'm not going to publicly speculate :-) In any case, since we don't yet have enough spare capacity to handle this situation, another controller was overloaded.
We needed to fix the internal databases on these controllers, which took some time. We were able to bring the system back to a stable state by the afternoon.
This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now. There are still some of them, like 10 or so, which are not connected (down) so we can't really help those from here. If your probe was working properly before Wednesday, but now is down, then please power cycle it (using the USB power) and it will very likely come back fine.
Probes in the US (and Asia, very likely) were not affected, as they have a local controller on the west coast, which was not involved. That's because the system really doesn't like to send European probes to it, it's too far.
Let us know if there's anything else not working properly, so that we can look into it.
Regards, Robert
On 2010.12.16. 14:57, Daniel Karrenberg wrote:
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview.
[...]
Hi Aftab, No it's not possible to change the controller your probe is connected to. The system is choosing the best controller for your probe based on the geolocation and some additional criteria. I guess you can see the ip of the controller your probe is connected to. It's not a secret :) Regards, Andreas On Dec 20, 2010, at 6:26 PM, Aftab A. Siddiqui wrote:
Hi Andreas, I thought connecting it directly on the distribution network will make it The stable one :( well if everything is fine at your end than I'll try to change the upstream for this subnet.
Is it possible to change the controller my probe is connected to and can you share the controller IPs? If it's not against the policy.
Best Wishes, Aftab A. Siddiqui
------ Sent from my iPhone *G
On Dec 20, 2010, at 9:21 PM, Andreas Strikos <astrikos@ripe.net> wrote:
Hi Aftab,
Although your probe is located in Asia, it is still connected to one of our controller in Europe because this part of Asia is closer to Germany than west coast of US :) Since Friday our system is fully functional, so I think it's your network that makes your probe flapping. The small gap is normal since after the probe disconnects it takes some time to re-connect. From what i can see from our history your probe was disconnected quit often from the beginning :)
Regards, Andreas Strikos
On Dec 19, 2010, at 9:16 AM, Aftab A. Siddiqui wrote:
Hi Robert,
It seems like a probe (ID: 303) in Asia is still having disconnection problem.
Connect at: Disconnect at: 2010-12-19 06:07:29 UTC (still connected) 2010-12-18 17:32:11 UTC 2010-12-19 06:06:49 UTC 2010-12-18 17:15:28 UTC 2010-12-18 17:30:54 UTC 2010-12-18 10:26:52 UTC 2010-12-18 17:13:24 UTC
Or is this suppose to be normal? But every disconnect - connect session has 1-2min gap.
Best Wishes,
Aftab A. Siddiqui
-----Original Message----- From: ripe-atlas-admin@ripe.net [mailto:ripe-atlas-admin@ripe.net] On Behalf Of Robert Kisteleki Sent: Friday, December 17, 2010 4:09 PM To: ripe-atlas@ripe.net Subject: Re: [atlas]Probe flapping
Dear All,
Here's an update to Daniel's message from yesterday.
As Daniel mentioned, on Wednesday evening our system started to migrate the probes away from a particular controller (ronin, in DE). We have a strong suspicion on why this happened, but it's not confirmed so I'm not going to publicly speculate :-) In any case, since we don't yet have enough spare capacity to handle this situation, another controller was overloaded.
We needed to fix the internal databases on these controllers, which took some time. We were able to bring the system back to a stable state by the afternoon.
This morning we revived some probes (25 or so) which were in a limbo -- they were not properly connected. We forced them to re-connect, so they are fine now. There are still some of them, like 10 or so, which are not connected (down) so we can't really help those from here. If your probe was working properly before Wednesday, but now is down, then please power cycle it (using the USB power) and it will very likely come back fine.
Probes in the US (and Asia, very likely) were not affected, as they have a local controller on the west coast, which was not involved. That's because the system really doesn't like to send European probes to it, it's too far.
Let us know if there's anything else not working properly, so that we can look into it.
Regards, Robert
On 2010.12.16. 14:57, Daniel Karrenberg wrote:
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview.
[...]
My home probe had been flapping a lot since I first installed it. But on Tuesday, I swapped out the router it's connected to * and it's been fine ever since **. I think it's the fact that the new router supports IPv6 :) --Richard * Before: Linksys WRT54GL running DD-WRT. After: Lenovo x61 running Ubuntu 10.10. ** Except for a single, isolated 30-minute down time On Dec 16, 2010, at 3:26 AM, Harald Firing Karlsen wrote:
Hi,
My RIPE Atlas probe (MAC: 00204AC8243A) started flapping last night and is continuing to flap. It have flapped about 90 times since midnight, do you have any known problems at the moment?
-- Kind regards Harald Firing Karlsen
Thanks Richard for sharing this. What we have observed so far is that in some installations (a few percent of them) the probe does indeed flap in a 5-15m, sometimes up to 45m cycle. In most of these cases the probe doesn't succeed with the common traceroute at all. In one case, we could verify that this was not specific to the probe, a normal laptop showed the same behavior. Maybe we could ask the hosts of these probes to tell us what kind of home router they have, to see if there are common ones. Cheers, Robert On 2010.12.17. 1:39, Richard L. Barnes wrote:
My home probe had been flapping a lot since I first installed it. But on Tuesday, I swapped out the router it's connected to * and it's been fine ever since **. I think it's the fact that the new router supports IPv6 :)
--Richard
* Before: Linksys WRT54GL running DD-WRT. After: Lenovo x61 running Ubuntu 10.10. ** Except for a single, isolated 30-minute down time
On Dec 16, 2010, at 3:26 AM, Harald Firing Karlsen wrote:
Hi,
My RIPE Atlas probe (MAC: 00204AC8243A) started flapping last night and is continuing to flap. It have flapped about 90 times since midnight, do you have any known problems at the moment?
-- Kind regards Harald Firing Karlsen
On Fri, Dec 17, 2010 at 9:54 AM, Robert Kisteleki <robert@ripe.net> wrote:
Maybe we could ask the hosts of these probes to tell us what kind of home router they have, to see if there are common ones.
Perhaps this could be a standard question in the probe setup / account admin page? -- Mark Dranse mark@dranse.com
On 2010.12.17. 10:23, Mark Dranse wrote:
On Fri, Dec 17, 2010 at 9:54 AM, Robert Kisteleki <robert@ripe.net> wrote:
Maybe we could ask the hosts of these probes to tell us what kind of home router they have, to see if there are common ones.
Perhaps this could be a standard question in the probe setup / account admin page?
That's too smart, I couldn't have thought of that :-) Robert
participants (10)
-
Aftab A. Siddiqui
-
Andreas Strikos
-
Daniel Karrenberg
-
Harald Firing Karlsen
-
Marco Davids (SIDN)
-
Mark Dranse
-
Michael H. Behringer
-
Richard L. Barnes
-
Robert Kisteleki
-
Wilfried Woeber, UniVie/ACOnet