HÃ¥vard, Thanks for your reply. Clearly some additional thinking is necessary... On Thu, 2009-10-22 at 23:59 +0200, Havard Eidnes wrote:
So, I propose we modify the current process to work something like this:
1. Tell users that their delegations are lame. 2. Wait, then tell them again if not fixed. 3. Wait, then PULL THE DELEGATION if not fixed.
One interpretation could be "pull the delegation to the lame name server, but leave the working ones in place".
Yes, that is the idea I was going for. Apologies for being unclear.
Do note, though, that if the zone itself still lists the lame server in its NS RRset, that RRset will override the NS RRset received from the delegating zone, since the latter is non- authoritative information, and recursive name servers may think it's a good idea to validate the NS RRset from one of the authoritative name servers. So... It's not a given that removing the delegation record for the lame name server will actually make much of a difference.
You bring up a very good point. AFAIK the lameness checking at the RIPE NCC only looks at things from the parent point of view. There is a different class of error, which you touch on here, which is mismatch between parent NS RRSET and child (authoritative) NS RRSET. This has not been discussed. NS RRSET Mismatches ------------------- A mismatch can be one of three types: 1. NS in parent not in child 2. NS in child not in parent - server is not lame 3. NS in child not in parent - server is lame The first case is a sort of lameness, and actually quite easily detected. I think that it can be covered exactly as any other sort of lameness. (It is possible for a name server listed in the parent to answer correctly even though it is not listed in the NS set of the child; this may happen during a migration for example. I don't think that affects this discussion, but I thought I would mention it.) The second case is not lameness, but is an incorrectness at the parent. Again, if data accuracy is our goal (and for this proposal we assume that it is), then we must fix it, somehow. I propose the same algorithm as for removing lame delegations: warn, warn, update. Except in this case "update" means adding the appropriate NS. The third case is the tricky one. We have no good solution here. If we cared about user experience, then we would eliminate the NS from the parent RRSET, because that will result in a slightly better average query pattern. However, we do not care about users, we care about data, so it is difficult to say what the best way forward is. (See more below.)
Or perhaps you meant "remove the entire delegation"? It sounds kind of drastic...
It is drastic, but in the 3rd case we have no good options. Since we care about data accuracy, we may need drastic measures. We have two possible approaches: * We follow the normal lameness process for the lame server: warn, warn, delete. Then we must spam the administrator every time we re-run our check until the zone is fixed. Yes, it is annoying and not likely to get things fixed, but for the sake of the data, it is necessary that we try. * Otherwise, yes, we simply remove the entire delegation. One could argue that "we have killed the patient to cure the disease", but please keep in mind that data consistency is the goal. If I was the administrator for the child zone, I would actually prefer the second option (spam is annoying). It is also better because it results in a correct parent zone. But I leave it up to the working group to decide. Thankfully there is no glue in the reverse tree, so we can ignore that class of mismatch. :) But I am reminded of another missing point in our quest for correctness: NS with partial lameness. NS with Partial Lameness ------------------------ In this case, we have something like this: 2.0.192.in-addr.arpa NS ns1.example.net. ns2.example.net. ns1.example.net A 192.0.2.0 ; working server A 192.0.2.1 ; broken server What we have here is lameness caused by a NS record with multiple addresses, only some of which are answering properly. Since we have no control over this NS to A/AAAA mapping, we have the same options as case #3 above: we can pester continuously or we can pull the entire delegation. Reading my proposals here, one might get the idea that I don't support the idea of data correctness as the correct philosophy for DNS lameness checking. You are correct. In a sense, this is a sort of reducto ad absurdum discussion: http://en.wikipedia.org/wiki/Reductio_ad_absurdum If you begin with the premise that data quality is important as an end goal, rather than starting with the premise that data quality is important only when it helps people, you have no way to measure when a technique for improving data quality is simply not worth the bother. HOWEVER, I do accept the possibility that the community may say "damn the torpedoes, full speed ahead!"(*) If we're going to go for data quality, lets not be half-assed(**), lets get it right this time. :) -- Shane (*) Excuse the Americanism, but it seems somehow appropriate: http://tinyurl.com/damn-the-torpedoes (**) Another Americanism, also equally appropriate IMHO: http://www.urbandictionary.com/define.php?term=half-assed