On 2014/07/18 12:12 , Wilfried Woeber wrote:
Hu Philip + Team,
Philip Homburg wrote:
first of all thanks for investigating!
No problem. I was also curious myself why 'normal' probes would disconnect. Most time is spend looking at the exceptions.
[...]
More like, the controller 'pings' the probe every 20 seconds and after 3 missed responses the connection is terminated.
And for the Atlas system as a whole, that works. But the goal of the Atlas system is not to have a probe connected as long as possible.
That's fully understood.
I'm still having a couple of questions :-)
1) if I do understand correctly, the decision to label a probe "disconnected" is made by the associateed collector, based on pings? (btw. - "real" pings on ICMP or internal over the channel?)
Connected/disconnected is based on whether a probe has a ssh connection to a controller. There is a keepalive mechanism within the ssh protocol to see if there other end is still there. That ssh mechanism is used abort the connection. Nothing to do with real (ICMP) pings.
2) if that's the case, is there an easy way to find out to which collector a probe is "assigned"? (is this static or dynamic?)
I don't know why, but that information is not shown to normal users. Of course, if you can capture traffic, you can easily find out :-) The assignment is dynamic.
3) if a probe, in particular an anchor, gets updated with a new firmware, is it possible that the ethernet IF does *not* go down? (Note, the 6009 is an old, big, beta box! Is there a difference with the new soekris probes?)
On regular probes a firmware upgrade always involves a reboot. On anchors the Atlas 'firmware' is an rpm. There is no reason to reboot the box or bring its interface down to upgrade the Atlas rpm.
Just to be very clear, I just want to understand how to interpret things, 'cause I already had an issue with one of my v1 probes, and in the end it turned out that the USB power feed was just boarderline, problem gone after replacement.
Yes it is good to keep an eye on those things. We can only look at probes statistically or in response to tickets, mail, etc.
And as an ISP and backbone operator, seeing stuff as "down" or "disconnected", without a good explanation, starts to itch after a while :-)
I think the best page to look at is the 'Result from Built-in Measurements'. If those graphs look fine, then there is no real reason to worry. Unless the probe keeps connecting and disconnecting multiple time a day or something like that.