Re: [dns-wg] Solving lameness in the reverse zones

23 Oct 2009

      Håvard,

Thanks for your reply. Clearly some additional thinking is necessary...

On Thu, 2009-10-22 at 23:59 +0200, Havard Eidnes wrote:
...
...
So, I propose we modify the current process to work something like this:
1. Tell users that their delegations are lame.
     2. Wait, then tell them again if not fixed.
     3. Wait, then PULL THE DELEGATION if not fixed.
One interpretation could be "pull the delegation to the lame name
server, but leave the working ones in place".
Yes, that is the idea I was going for. Apologies for being unclear.
...
Do note, though, that if the zone itself still lists the lame
server in its NS RRset, that RRset will override the NS RRset
received from the delegating zone, since the latter is non-
authoritative information, and recursive name servers may think
it's a good idea to validate the NS RRset from one of the
authoritative name servers.  So...  It's not a given that
removing the delegation record for the lame name server will
actually make much of a difference.
You bring up a very good point.

AFAIK the lameness checking at the RIPE NCC only looks at things from
the parent point of view. There is a different class of error, which you
touch on here, which is mismatch between parent NS RRSET and child
(authoritative) NS RRSET. This has not been discussed.

NS RRSET Mismatches
-------------------
A mismatch can be one of three types:

     1. NS in parent not in child
     2. NS in child not in parent - server is not lame
     3. NS in child not in parent - server is lame

The first case is a sort of lameness, and actually quite easily
detected. I think that it can be covered exactly as any other sort of
lameness. (It is possible for a name server listed in the parent to
answer correctly even though it is not listed in the NS set of the
child; this may happen during a migration for example. I don't think
that affects this discussion, but I thought I would mention it.)

The second case is not lameness, but is an incorrectness at the parent.
Again, if data accuracy is our goal (and for this proposal we assume
that it is), then we must fix it, somehow. I propose the same algorithm
as for removing lame delegations: warn, warn, update. Except in this
case "update" means adding the appropriate NS.

The third case is the tricky one. We have no good solution here. If we
cared about user experience, then we would eliminate the NS from the
parent RRSET, because that will result in a slightly better average
query pattern. However, we do not care about users, we care about data,
so it is difficult to say what the best way forward is. (See more
below.)
...
Or perhaps you meant "remove the entire delegation"?  It sounds
kind of drastic...
It is drastic, but in the 3rd case we have no good options. Since we
care about data accuracy, we may need drastic measures.

We have two possible approaches:

      * We follow the normal lameness process for the lame server: warn,
        warn, delete. Then we must spam the administrator every time we
        re-run our check until the zone is fixed. Yes, it is annoying
        and not likely to get things fixed, but for the sake of the
        data, it is necessary that we try.
      * Otherwise, yes, we simply remove the entire delegation. One
        could argue that "we have killed the patient to cure the
        disease", but please keep in mind that data consistency is the
        goal.

If I was the administrator for the child zone, I would actually prefer
the second option (spam is annoying). It is also better because it
results in a correct parent zone. But I leave it up to the working group
to decide.

Thankfully there is no glue in the reverse tree, so we can ignore that
class of mismatch. :)

But I am reminded of another missing point in our quest for correctness:
NS with partial lameness.

NS with Partial Lameness
------------------------
In this case, we have something like this:

2.0.192.in-addr.arpa NS ns1.example.net.
                        ns2.example.net.
ns1.example.net   A     192.0.2.0         ; working server
                  A     192.0.2.1         ; broken server

What we have here is lameness caused by a NS record with multiple
addresses, only some of which are answering properly.

Since we have no control over this NS to A/AAAA mapping, we have the
same options as case #3 above: we can pester continuously or we can pull
the entire delegation.

Reading my proposals here, one might get the idea that I don't support
the idea of data correctness as the correct philosophy for DNS lameness
checking. You are correct. In a sense, this is a sort of reducto ad
absurdum discussion:

http://en.wikipedia.org/wiki/Reductio_ad_absurdum

If you begin with the premise that data quality is important as an end
goal, rather than starting with the premise that data quality is
important only when it helps people, you have no way to measure when a
technique for improving data quality is simply not worth the bother.

HOWEVER, I do accept the possibility that the community may say "damn
the torpedoes, full speed ahead!"(*) If we're going to go for data
quality, lets not be half-assed(**), lets get it right this time. :)

--
Shane

(*) Excuse the Americanism, but it seems somehow appropriate:
    http://tinyurl.com/damn-the-torpedoes

(**) Another Americanism, also equally appropriate IMHO:
     http://www.urbandictionary.com/define.php?term=half-assed

Re: [dns-wg] Solving lameness in the reverse zones

Shane Kerr