some thoughts and question regrding probe "stability"
Hi Folks, triggered by the discussion related to DNSMON, and an issue (power, resolved) with one of my V1 probes, I'd like to get some input or start a disussion or an investigation. To start with, I am not very clear what the term "stability" w/should mean in this context, as the probes are supposed to buffer measurment data locally, at least for a while (true?). So, here goes... Obviously, looking at some Atlas Stat pages, there are probes with a 100% uptime. Now, looking a the 3 under my supervision (2x V1, 1 Anchor), ref "Connected" and "Disconnected", there's no chance to get near that value, as all of them tend to topple over on a regular basis, mostly for a *short* period of time in the range of 0m(!) to some 30+m. With respct to the bahaviour of the Anchor, which is mounted in the same rack as the backbone router it connects to, in a Data Center, we tried to correlate the (reported) disconnection events with the router and interface logs for the probe. No luck there, also, no maint works or the like, so I presume the Anchor didn't reboot or that there were "real" network problems. Let's compare the most recent dis/connection logs for my 3 pets: ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected 2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m 2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m 2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m 2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected 2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m 2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m 2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m 2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected 2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m 2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m 2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m 2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m Again, I fail to see some obvious correlation, what am I missing? Does anyone else see a similar pattern? How to start debugging, if there's anythig that needs debugging? Thanks for your ideas and help! Wilfried
Hi Wilfred, Atleast your probes were online for many number of days. Here is the availability report of my V1 probe 0303. 99.71% Availability. +---------------------+---------------------+------------+--------------+ | Connected (UTC) | Disconnected (UTC) | Connected | Disconnected | |---------------------+---------------------+------------+--------------+ | 2014-05-29 23:46:12 | 2014-06-02 06:30:06 | 1d 06:30 | 0d 00:00 | | 2014-06-02 06:40:35 | 2014-06-03 06:52:14 | 1d 00:11 | 0d 00:10 | | 2014-06-03 06:59:53 | 2014-06-04 22:11:56 | 1d 15:12 | 0d 00:07 | | 2014-06-04 22:22:43 | 2014-06-16 15:48:25 | 11d 17:25 | 0d 00:10 | | 2014-06-16 15:59:17 | 2014-06-17 22:11:24 | 1d 06:12 | 0d 00:10 | | 2014-06-17 22:22:53 | 2014-06-21 21:13:51 | 3d 22:50 | 0d 00:11 | | 2014-06-21 21:41:35 | 2014-06-23 15:44:56 | 1d 18:03 | 0d 00:27 | | 2014-06-23 15:54:55 | 2014-06-29 04:19:02 | 5d 12:24 | 0d 00:09 | | 2014-06-29 04:53:22 | Still up | 1d 19:06 | 0d 00:34 | +---------------------+---------------------+------------+--------------+ It is directly connected to our core router. I was never able to correlate any of the disconnection times with any network incident. Best Wishes, Aftab A. Siddiqui -----Original Message----- From: ripe-atlas-bounces@ripe.net [mailto:ripe-atlas-bounces@ripe.net] On Behalf Of Wilfried Woeber Sent: Thursday, July 17, 2014 5:49 PM To: ripe-atlas@ripe.net Subject: [atlas] some thoughts and question regrding probe "stability" Hi Folks, triggered by the discussion related to DNSMON, and an issue (power, resolved) with one of my V1 probes, I'd like to get some input or start a disussion or an investigation. To start with, I am not very clear what the term "stability" w/should mean in this context, as the probes are supposed to buffer measurment data locally, at least for a while (true?). So, here goes... Obviously, looking at some Atlas Stat pages, there are probes with a 100% uptime. Now, looking a the 3 under my supervision (2x V1, 1 Anchor), ref "Connected" and "Disconnected", there's no chance to get near that value, as all of them tend to topple over on a regular basis, mostly for a *short* period of time in the range of 0m(!) to some 30+m. With respct to the bahaviour of the Anchor, which is mounted in the same rack as the backbone router it connects to, in a Data Center, we tried to correlate the (reported) disconnection events with the router and interface logs for the probe. No luck there, also, no maint works or the like, so I presume the Anchor didn't reboot or that there were "real" network problems. Let's compare the most recent dis/connection logs for my 3 pets: ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected 2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m 2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m 2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m 2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected 2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m 2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m 2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m 2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected 2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m 2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m 2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m 2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m Again, I fail to see some obvious correlation, what am I missing? Does anyone else see a similar pattern? How to start debugging, if there's anythig that needs debugging? Thanks for your ideas and help! Wilfried
For what it's worth, I've seen similar trouble over the last couple days. I've only had my probe hooked up since the 2nd of July, and compared to some of you it's a pretty basic local network. Unfortunately the 2 times it's gone down have been two hours before our store opens for the day so I haven't been here to see what's up. Of course it's very possible that something on our end is just going down and I don't know about it, so I'll keep an eye on it. *Ross Weseloh* On Thu, Jul 17, 2014 at 8:03 AM, Aftab A. Siddiqui <aftabs@cyber.net.pk> wrote:
Hi Wilfred, Atleast your probes were online for many number of days. Here is the availability report of my V1 probe 0303. 99.71% Availability.
+---------------------+---------------------+------------+--------------+ | Connected (UTC) | Disconnected (UTC) | Connected | Disconnected | |---------------------+---------------------+------------+--------------+ | 2014-05-29 23:46:12 | 2014-06-02 06:30:06 | 1d 06:30 | 0d 00:00 | | 2014-06-02 06:40:35 | 2014-06-03 06:52:14 | 1d 00:11 | 0d 00:10 | | 2014-06-03 06:59:53 | 2014-06-04 22:11:56 | 1d 15:12 | 0d 00:07 | | 2014-06-04 22:22:43 | 2014-06-16 15:48:25 | 11d 17:25 | 0d 00:10 | | 2014-06-16 15:59:17 | 2014-06-17 22:11:24 | 1d 06:12 | 0d 00:10 | | 2014-06-17 22:22:53 | 2014-06-21 21:13:51 | 3d 22:50 | 0d 00:11 | | 2014-06-21 21:41:35 | 2014-06-23 15:44:56 | 1d 18:03 | 0d 00:27 | | 2014-06-23 15:54:55 | 2014-06-29 04:19:02 | 5d 12:24 | 0d 00:09 | | 2014-06-29 04:53:22 | Still up | 1d 19:06 | 0d 00:34 | +---------------------+---------------------+------------+--------------+
It is directly connected to our core router. I was never able to correlate any of the disconnection times with any network incident.
Best Wishes,
Aftab A. Siddiqui
-----Original Message----- From: ripe-atlas-bounces@ripe.net [mailto:ripe-atlas-bounces@ripe.net] On Behalf Of Wilfried Woeber Sent: Thursday, July 17, 2014 5:49 PM To: ripe-atlas@ripe.net Subject: [atlas] some thoughts and question regrding probe "stability"
Hi Folks,
triggered by the discussion related to DNSMON, and an issue (power, resolved) with one of my V1 probes, I'd like to get some input or start a disussion or an investigation.
To start with, I am not very clear what the term "stability" w/should mean in this context, as the probes are supposed to buffer measurment data locally, at least for a while (true?).
So, here goes...
Obviously, looking at some Atlas Stat pages, there are probes with a 100% uptime.
Now, looking a the 3 under my supervision (2x V1, 1 Anchor), ref "Connected" and "Disconnected", there's no chance to get near that value, as all of them tend to topple over on a regular basis, mostly for a *short* period of time in the range of 0m(!) to some 30+m.
With respct to the bahaviour of the Anchor, which is mounted in the same rack as the backbone router it connects to, in a Data Center, we tried to correlate the (reported) disconnection events with the router and interface logs for the probe. No luck there, also, no maint works or the like, so I presume the Anchor didn't reboot or that there were "real" network problems.
Let's compare the most recent dis/connection logs for my 3 pets:
ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected 2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m 2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m 2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m 2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m
ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected 2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m 2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m 2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m 2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m
ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected 2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m 2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m 2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m 2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m
Again, I fail to see some obvious correlation, what am I missing?
Does anyone else see a similar pattern?
How to start debugging, if there's anythig that needs debugging?
Thanks for your ideas and help! Wilfried
Hi, On 17 Jul 2014, at 14:48, Wilfried Woeber <Woeber@CC.UniVie.ac.at> wrote:
Let's compare the most recent dis/connection logs for my 3 pets:
ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected 2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m 2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m 2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m 2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m
ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected 2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m 2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m 2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m 2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m
ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected 2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m 2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m 2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m 2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m
I see something similar, where I have two probes connected at home, to the same switch, to the same DSL connection. These probes are from different generations though. ID 3144 2014-07-16 10:22:03 1d 3h 38m Still Connected 2014-07-16 08:48:36 1h 10m 2014-07-16 09:58:51 0h 23m 2014-07-15 22:15:36 10h 13m 2014-07-16 08:29:29 0h 19m 2014-07-15 03:48:13 16h 49m 2014-07-15 20:37:48 1h 37m 2014-07-11 23:16:11 3d 3h 49m 2014-07-15 03:05:51 0h 42m 2014-07-06 19:00:04 5d 4h 1m 2014-07-11 23:01:08 0h 15m ID 11849 2014-07-16 10:13:47 1d 3h 49m Still Connected 2014-07-16 08:41:47 1h 17m 2014-07-16 09:58:56 0h 14m 2014-07-16 08:31:26 0h 1m 2014-07-16 08:33:10 0h 8m 2014-07-15 22:08:37 10h 20m 2014-07-16 08:29:27 0h 1m 2014-07-15 03:52:50 16h 45m 2014-07-15 20:38:05 1h 30m Jeroen.
Hi Wilfried,
Let's compare the most recent dis/connection logs for my 3 pets:
Here is what I found in our logs:
ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected
Upgrade to firmware 4650
2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m
Hard to say, some network glitch
2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m
Anchor was rebooted
2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m
2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m
Anchor was rebooted
ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected
Some network glitch, unclear what
2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m
Probe upgraded firmware, reason for disconnect got lost
2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m
Network problem
2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m
Some network problem.
2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m
Unclear
ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected
Some network problem
2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m
Power cycled?
2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m
Some network problem. High RTTs
2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m
Power cycled?
2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m
Same.
Again, I fail to see some obvious correlation, what am I missing?
Does anyone else see a similar pattern?
How to start debugging, if there's anythig that needs debugging?
A couple of points: 1) The connection between a probe (or anchor) and its controller doesn't have to be perfectly stable. It has to be good enough that probes will report results in timely fashion and can get commands. But nothing beyond that. 2) For single probe to see a network failure (with measurements using the default parameters) the failure has to last for at least 10 minutes. That way a couple of measurements will have a chance to report on the failure. In contrast, the connection between a probe and the controller is already terminated if the network is down for one minute. 3) When a target is measured by many probes then it is likely that at least some probes will pick up an event. But one probe on its own, it is hard to say anything about that. 4) Version 1 probes tend to reboot after losing the connection to the controller due to memory fragmentation issues. That is unfortunate, but we can't really do anything about it. Version 3 probes and anchors just report their results a little later. Philip
Just a thought, Are you connected to a "green" switch that might be dropping the power when idle and the probe can't handle that situation and disconnecting from the network and the process starts over? Bryan Socha Network Engineer DigitalOcean On Thu, Jul 17, 2014 at 12:03 PM, Philip Homburg <philip.homburg@ripe.net> wrote:
Hi Wilfried,
Let's compare the most recent dis/connection logs for my 3 pets:
Here is what I found in our logs:
ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected
Upgrade to firmware 4650
2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m
Hard to say, some network glitch
2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m
Anchor was rebooted
2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m
Network glitch
2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m
Anchor was rebooted
ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected
Some network glitch, unclear what
2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m
Probe upgraded firmware, reason for disconnect got lost
2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m
Network problem
2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m
Some network problem.
2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m
Unclear
ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected
Some network problem
2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m
Power cycled?
2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m
Some network problem. High RTTs
2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m
Power cycled?
2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m
Same.
Again, I fail to see some obvious correlation, what am I missing?
Does anyone else see a similar pattern?
How to start debugging, if there's anythig that needs debugging?
A couple of points: 1) The connection between a probe (or anchor) and its controller doesn't have to be perfectly stable. It has to be good enough that probes will report results in timely fashion and can get commands. But nothing beyond that. 2) For single probe to see a network failure (with measurements using the default parameters) the failure has to last for at least 10 minutes. That way a couple of measurements will have a chance to report on the failure. In contrast, the connection between a probe and the controller is already terminated if the network is down for one minute. 3) When a target is measured by many probes then it is likely that at least some probes will pick up an event. But one probe on its own, it is hard to say anything about that. 4) Version 1 probes tend to reboot after losing the connection to the controller due to memory fragmentation issues. That is unfortunate, but we can't really do anything about it. Version 3 probes and anchors just report their results a little later.
Philip
On 7/17/2014 at 6:03 PM Philip Homburg wrote: |Hi Wilfried, | [major snip] | | In contrast, the connection between a probe and the | controller is already terminated if the network is | down for one minute. ============= Pulling that one sentence out of a long-ish reply. If I read that correctly... If a probe loses contact with its controller for 60 seconds, then the controller considers the probe offline and starts a disconnected timer. Is that a correct interpretation?
On 2014/07/17 19:01 , Mike. wrote:
On 7/17/2014 at 6:03 PM Philip Homburg wrote:
|Hi Wilfried, | [major snip] | | In contrast, the connection between a probe and the | controller is already terminated if the network is | down for one minute. =============
Pulling that one sentence out of a long-ish reply.
If I read that correctly...
If a probe loses contact with its controller for 60 seconds, then the controller considers the probe offline and starts a disconnected timer.
More like, the controller 'pings' the probe every 20 seconds and after 3 missed responses the connection is terminated. And for the Atlas system as a whole, that works. But the goal of the Atlas system is not to have a probe connected as long as possible. Philip
Hu Philip + Team, Philip Homburg wrote: first of all thanks for investigating! [...]
More like, the controller 'pings' the probe every 20 seconds and after 3 missed responses the connection is terminated.
And for the Atlas system as a whole, that works. But the goal of the Atlas system is not to have a probe connected as long as possible.
That's fully understood. I'm still having a couple of questions :-) 1) if I do understand correctly, the decision to label a probe "disconnected" is made by the associateed collector, based on pings? (btw. - "real" pings on ICMP or internal over the channel?) 2) if that's the case, is there an easy way to find out to which collector a probe is "assigned"? (is this static or dynamic?) 3) if a probe, in particular an anchor, gets updated with a new firmware, is it possible that the ethernet IF does *not* go down? (Note, the 6009 is an old, big, beta box! Is there a difference with the new soekris probes?)
Philip
Just to be very clear, I just want to understand how to interpret things, 'cause I already had an issue with one of my v1 probes, and in the end it turned out that the USB power feed was just boarderline, problem gone after replacement. And as an ISP and backbone operator, seeing stuff as "down" or "disconnected", without a good explanation, starts to itch after a while :-) All the best, have a nice weekend, Wilfried.
On 2014/07/18 12:12 , Wilfried Woeber wrote:
Hu Philip + Team,
Philip Homburg wrote:
first of all thanks for investigating!
No problem. I was also curious myself why 'normal' probes would disconnect. Most time is spend looking at the exceptions.
[...]
More like, the controller 'pings' the probe every 20 seconds and after 3 missed responses the connection is terminated.
And for the Atlas system as a whole, that works. But the goal of the Atlas system is not to have a probe connected as long as possible.
That's fully understood.
I'm still having a couple of questions :-)
1) if I do understand correctly, the decision to label a probe "disconnected" is made by the associateed collector, based on pings? (btw. - "real" pings on ICMP or internal over the channel?)
Connected/disconnected is based on whether a probe has a ssh connection to a controller. There is a keepalive mechanism within the ssh protocol to see if there other end is still there. That ssh mechanism is used abort the connection. Nothing to do with real (ICMP) pings.
2) if that's the case, is there an easy way to find out to which collector a probe is "assigned"? (is this static or dynamic?)
I don't know why, but that information is not shown to normal users. Of course, if you can capture traffic, you can easily find out :-) The assignment is dynamic.
3) if a probe, in particular an anchor, gets updated with a new firmware, is it possible that the ethernet IF does *not* go down? (Note, the 6009 is an old, big, beta box! Is there a difference with the new soekris probes?)
On regular probes a firmware upgrade always involves a reboot. On anchors the Atlas 'firmware' is an rpm. There is no reason to reboot the box or bring its interface down to upgrade the Atlas rpm.
Just to be very clear, I just want to understand how to interpret things, 'cause I already had an issue with one of my v1 probes, and in the end it turned out that the USB power feed was just boarderline, problem gone after replacement.
Yes it is good to keep an eye on those things. We can only look at probes statistically or in response to tickets, mail, etc.
And as an ISP and backbone operator, seeing stuff as "down" or "disconnected", without a good explanation, starts to itch after a while :-)
I think the best page to look at is the 'Result from Built-in Measurements'. If those graphs look fine, then there is no real reason to worry. Unless the probe keeps connecting and disconnecting multiple time a day or something like that.
On 18/07/14 14:23, Philip Homburg wrote:
3) if a probe, in particular an anchor, gets updated with a new firmware, is it possible that the ethernet IF does *not* go down? (Note, the 6009 is an old, big, beta box! Is there a difference with the new soekris probes?)
On regular probes a firmware upgrade always involves a reboot. On anchors the Atlas 'firmware' is an rpm. There is no reason to reboot the box or bring its interface down to upgrade the Atlas rpm.
To add to this point: When we roll out a new probe firmware to the Anchors, the software restarts (and will disconnect and reconnect to the controller) but nothing happens to the system OS - it will not result in a reboot, or a network interface flap. However, seperately from the probe firmware, we also patch and maintain the Anchor system OS. This happens during a weekly scheduled maintenance window: Tuesdays between 14:00 and 15:00 UTC (at least we try, sometimes the window overruns a little!) This may result in system services (such as the DNS and HTTP servers on the Anchors) restarting, and if there is a kernel update available, we will also reboot the Anchor during this window. This will result in the network interface going down and up during the reboot. This applies to both the V1 Dell Anchors and the V2 Soekris Anchors. Hope this helps, Cheers, Colin -- Colin Petrie Systems Engineer RIPE NCC
On 7/18/2014 at 6:11 PM Colin Petrie wrote: |On 18/07/14 14:23, Philip Homburg wrote: |>> 3) if a probe, in particular an anchor, gets updated with a new |firmware, is | [ snip ] | |Hope this helps, | |Cheers, |Colin | |-- |Colin Petrie |Systems Engineer |RIPE NCC ============= Many thanks to you and Philip for providing the technical overview messages. I like to understand what I see on the status page for the probe I host, and the overviews have been most helpful towards that end. Thanks again.
Hello, To provide a bit of background information about $subject: In order to receive reports from the probes, and to deliver (measurement) commands to it, we maintain a bidirectional channel from the probe to the infrastructure. At the moment this is using SSH. We consider the probe to be "connected" as long as this channel is open, and "disconnected" when it's not. Note that this is only an indicator for the probe's stability, not a precise quality metric. Said connections can break for a number of reasons -- administrative actions, probe power loss, power cycle, path problems between probe and infrastructure (including the NAT box, if applicable), and infrastructure availability. For example, every now and then we disconnect the probes to make them upgrade, or have to reboot the server the probe is connected to. All these event show up as disconnects. The disconnection time mostly depends on the reason of the disconnect -- for example a probe reboot can be done in seconds, firmware upgrade takes something like 5-15 minutes, a controller reboot can cause up to 2 hours of non-connectedness. Finally: as Philip mentioned, we don't optimise for high connection times. The probes execute the pre-scheduled measurements even if they are not connected to the infrastructure. Regards, Robert
Robert Kisteleki wrote:
Hello,
To provide a bit of background information about $subject:
Thanks to all who added to the picture, really appreciated! I'm feeling considerably more comfortable now to see somw "disconnects"[1] for a few minutes, always now and then :-) Still I'd like to see - eventually - some sort of alerts "from the system", if the frequency of the disconnects grows outside of of some sanity envelope. To be chatted about during the next RIPE meeting, maybe. Have a nice summer period, everyone, Wilfried [1] given that background, I wonder what the usefulness of the top-up-time page is, when this is not under the control of the probe host ;-)
participants (9)
-
Aftab A. Siddiqui
-
Bryan Socha
-
Colin Petrie
-
Jeroen van der Ham
-
Mike.
-
Philip Homburg
-
Robert Kisteleki
-
Ross Weseloh
-
Wilfried Woeber