Anand Buddhdev wrote:
Gilles Massen wrote:
Hello Gilles,
I was wondering if there is a detailed description of what and how is tested in order to check the lameness of a server (i.e. how are the names resolved, timeouts and retransmits of the queries, checks made,....)? Any pointer would be welcome.
This is currently not documented.
It needs to be documented for at least 2 reasons that come immediately to mind. First, it will allow operators to perform the same tests that you(pl) are performing, thus hopefully catching problems before they happen. Second, it will allow for peer review of the procedure. Given that there are a non-trivial number of operators reporting that your system is generating false positives, I think both of these goals are relevant. In an effort to be helpful (not critical) I will ask some questions about the details you've left out below. Hopefully fleshing this process out can help everyone involved.
However, I can provide a quick explanation here.
The first phase of the lameness checks involves generating a canonical list of name servers for a zone.
How is this done? IOW, what sources are queried to generate the list?
The process gathers all the name servers, and queries each name once for A and AAAA records, with a 3-second timeout.
What source is queried? If a query times out are other authoritative sources for that name tried? I would say off hand that 3 seconds is probably not long enough for a timeout, but it might be acceptable if more than one source is tried (for example, if you're looking for the address records for ns1.example.com and you try all of the name servers for example.com).
Once it has a complete list of zone and nameserver address pairs,
I think I know what you mean by this, but I'm not sure. Could you please flesh this out?
it queries each address for a SOA record for the zone, with a 3-second timeout. If a particular address yields no response, it is queued, and queried up to 4 more times at varying intervals.
That sounds a little more reasonable, although if it were me I would double the timeout on each successive query. hope this helps, Doug