Re: [atlas] some thoughts and question regrding probe "stability"

18 Jul 2014

      On 2014/07/18 12:12 , Wilfried Woeber wrote:
...
Hu Philip + Team,
Philip Homburg wrote:
first of all thanks for investigating!
No problem. I was also curious myself why 'normal' probes would
disconnect. Most time is spend looking at the exceptions.
...
[...]
...
More like, the controller 'pings' the probe every 20 seconds and after 3
missed responses the connection is terminated.
And for the Atlas system as a whole, that works. But the goal of the
Atlas system is not to have a probe connected as long as possible.
That's fully understood.
I'm still having a couple of questions :-)
1) if I do understand correctly, the decision to label a probe "disconnected"
   is made by the associateed collector, based on pings? (btw. - "real" pings
   on ICMP or internal over the channel?)
Connected/disconnected is based on whether a probe has a ssh connection
to a controller. There is a keepalive mechanism within the ssh protocol
to see if there other end is still there. That ssh mechanism is used
abort the connection. Nothing to do with real (ICMP) pings.
...
2) if that's the case, is there an easy way to find out to which collector a
   probe is "assigned"? (is this static or dynamic?)
I don't know why, but that information is not shown to normal users. Of
course, if you can capture traffic, you can easily find out :-)

The assignment is dynamic.
...
3) if a probe, in particular an anchor, gets updated with a new firmware, is
   it possible that the ethernet IF does *not* go down? (Note, the 6009 is an
   old, big, beta box! Is there a difference with the new soekris probes?)
On regular probes a firmware upgrade always involves a reboot. On
anchors the Atlas 'firmware' is an rpm. There is no reason to reboot the
box or bring its interface down to upgrade the Atlas rpm.
...
Just to be very clear, I just want to understand how to interpret things,
'cause I already had an issue with one of my v1 probes, and in the end it
turned out that the USB power feed was just boarderline, problem gone after
replacement.
Yes it is good to keep an eye on those things. We can only look at
probes statistically or in response to tickets, mail, etc.
...
And as an ISP and backbone operator, seeing stuff as "down" or "disconnected",
without a good explanation, starts to itch after a while :-)
I think the best page to look at is the 'Result from Built-in
Measurements'. If those graphs look fine, then there is no real reason
to worry. Unless the probe keeps connecting and disconnecting multiple
time a day or something like that.