On 03/19/2017 08:09 AM, Shane Kerr wrote:
Doug,
At 2017-03-18 18:34:25 -0700 Doug Barton <dougb@dougbarton.email> wrote:
On 03/18/2017 08:46 AM, Anand Buddhdev wrote:
Dear colleagues,
This is a follow-up to our message of Friday about issues with some reverse delegations.
After doing a thorough analysis, and with the help of ARIN staff, we found more issues with our zonelet generation code
Can you say more about the benefit of this "zonelet" system vs. ARIN simply delegating the appropriate zones to you, and you managing them like any other DNS zone?
I do appreciate you keeping the community informed about the causes of the outage, but it seems that at least part of the root cause is that you're operating what sounds like a fairly fragile system in the first place, with (de fact) insufficient validity checking.
I was at the RIPE NCC when we adopted the zonelet approach, although I haven't worked for them for over a decade.
The zonelet system was designed to allow reverse DNS for IPv4 space that was originally assigned to one RIR but was later partially migrated to other RIRs. This happened when LACNIC and AfriNIC were formed, although I think that an audit was done at the time and so space was moved around between all 5 RIRs.
The problem is that we could have a delegation like 999.in-addr.arpa going to the RIPE NCC and then 888.999.in-addr.arpa being managed by ARIN... but want 888.999.in-addr.arpa to point to the **address holder's** name servers, not **ARIN's** name servers.
So ARIN needs a way to get the information about the name servers to the RIPE NCC somehow (and RIPE NCC to LACNIC, and so on). Zonelets are used for this, which is basically just the NS records needed, probably picked up using SSH.
I think that we discussed using dynamic DNS (DDNS) for this at the time, but decided that the simplest & best solution was zonelets.
DNAME could be used, but it would involve an extra lookup for resolvers, right? (DNAME was pretty new when zonelets were adopted, and I don't know that BIND 8 supported them, which was still the most popular DNS server at that time.)
My guess is that the bugs are probably more due to ancient Perl code than an overly-complicated system for exchanging this information. Heck, it's possible that the bugs are due to MY ancient Perl code, although I really don't remember who wrote or tested the code....
Thank you, Shane for the explanation, which makes perfect sense. RIPE folks, the operational answer to this problem would seem to be having ARIN implement a sanity check such that if more than N% of the information is changed in a given pass that humans need to get involved to approve the change. I had a lovely chat with John Curran about that on NANOG, which you can see starting here: https://mailman.nanog.org/pipermail/nanog/2017-March/090626.html Short version, they won't do anything differently unless you specifically ask them to. We all make mistakes, and I have no doubts that y'all have done your best to find/fix the bugs that created the most recent problem. But I've used similar sanity check systems in the past with good success. Everyone makes mistakes, and there is no shame to a "belt and braces" approach to critical infrastructure like this. I hope that you'll consider it. Doug