Hello everyone, thank your for hard work on this. I think it's well written document. More substantial feedback below relates to: - TTL recommendations - Selection of transport protocols - EDNS Client Subnet (ECS) - Missing mention of RFC 8906 More in-line below, including couple nits. On 26. 11. 23 18:01, Shane Kerr wrote:
#### System Diversity[typo]
and my sometimes be hidden.
## DNS configuration knobs ### DNSSEC validation
[RFC9364](https://www.rfc-editor.org/rfc/rfc9364.html) provides a lot of useful information, and links to further documents about DNSSEC. However, operators usually do not need to know the details, and can simply ensure that DNSSEC validation is enabled in their software; this is usually enabled by default.
[Nit] Oh I wish. E.g. PowerDNS does not enable validation by default and that's not a small player. I propose to remove the "; this is usually enabled by default." part as it might lead to sloppiness and IMHO does not really bring much.
### DNS Transport Protocols
**UDP and TCP must be supported.**
For: ALL DNS resolver operators.
I like the capital ALL :-)
UDP is what most clients use, and TCP is necessary for DNS answers that are too large for a single UDP packet.
[Nit] Maybe mention that UDP over 512 bytes is also okay? Check for stupid firewalls or something? Or maybe that's going into too much detail, I don't know.
### Packet Fragmentation Avoidance
**Servers should be configured to avoid fragmentation.**
For: ALL DNS resolver operators.
Packet fragmentation can cause issues with DNS over UDP, especially over IPv6. These issues can be minimized by choosing implementations that set IP options to avoid this, and by taking care with EDNS0 message sizes.
Recommendations are available in [draft-ietf-dnsop-avoid-fragmentation](https://datatracker.ietf.org/doc/draft-ietf-dnsop-avoid-fragmentation/).
[Nit] I think linking to URL https://datatracker.ietf.org/doc/html/draft-ietf-dnsop-avoid-fragmentation is better as it should point to the latest version of the document, even after it becomes an RFC.
### Encrypted DNS
**DNS-over-TLS (DoT), DNS-over-HTTPS (DoH), and DNS-over-QUIC (DoQ) should be supported.**
For: All DNS resolver operators.
DoT, DoH, and DoQ are different technologies that all provide an encrypted channel between the resolver and the authoritative server. DoT is the oldest, and provides encrypted DNS using TLS. DoH uses HTTP over TLS as a way to transmit queries and answers, and is widely supported by web browsers. DoQ is the newest, and provides advanced features such as separate streams for each query, avoiding the "head of line" blocking problem common with all protocols layered on top of TCP (such as DoT and DoH).
- DoT - [RFC7858](https://www.rfc-editor.org/rfc/rfc7858.html) - DoH - [RFC8484](https://www.rfc-editor.org/rfc/rfc8484.html) - DoQ - [RFC9250](https://www.rfc-editor.org/rfc/rfc9250.html)
### Aggressive NSEC cachingI agree with this paragraph and disagree with disagreements elsewhere on
[Substantial] This recommendation says "increase attack surface 3x". Each extra protocol comes with its operational cost, and I think supporting all of them without reason and sufficient knowledge is asking for trouble. Operators will need to know how to debug not only DNS, but also TLS+HTTP/2 combo (or QUIC) and none of this is for fainthearted - especially when under DDoS. My personal recommendation would be - pick smallest set supported by target population. Bonus points for DoT because it by far easiest to understand and debug when something goes wrong. If the WG thinks supporting all of this protocol circus _at once_ is the best recommendation then I think this recommendation deserves a warning that operators need to do their homework first, understand how to debug individual pieces, and be prepared to handle protocol-specific DoS attacks. the mailing list :-)
### Local Root
**Local root should be used.**
For: Public resolver operators.
Since the root zone is DNSSEC signed,
^^^ Something is missing here, I guess?
Running a local root has several benefits, but it is an additional component to maintain. For public resolver operators this is definitely worth the cost, but other resolver operators may choose to simply send all queries to the well-distributed root name servers.
[comment, no text change proposed] With proper monitoring in place, sure, but it's not really buying much if aggressive caching is enabled. With RFC 8198 you get benefits of local root within seconds and with less operational complexity and fragility.
### TTL Recommendations
**TTL limits may be adjusted.**
For: All DNS resolver operators.
Software typically defaults to a maximum stored TTL of 1 or 2 days. This may be lowered to reduce the cache size. A lower TTL will mean removing rarely-used records that have long TTL, and should not have much operational impact from a CPU or network point of view, but may save memory.
[Substantial] This section seems *entirely* incorrect to me. Cache needs some sort of limit on its size anyway, regardless of TTL limits. Artificially limiting TTLs is entirely ineffective as a method to limit cache size in many scenarios - e.g. when under random subdomain attack. A proper cache cleaning algorithm should take care of evicting least used records, and no TTL limits are needed. I think this section should discuss impact of very long TTLs on availability when someone messes up things on the auth side (slack.com DS, anyone?), or when the auth side is under attack. https://ant.isi.edu/~johnh/PAPERS/Moura19b.html is an excellent resource possibly worth linking to.
It is possible to set a minimum TTL in many implementations. This is a violation of the DNS protocol, although may be useful to reduce load from records with very low TTL (less than 5 seconds).
[nit] I argue that setting lower bound higher than ~ seconds is antisocial and asking for operational trouble. On the other hand TTL=0 is antisocial from the auth side and should be outlawed :-) Personally I would be fine with recommending minimum TTL=1 second.
### EDNS Client Subnet (ECS)
**ECS may be enabled.**
[Substantial] Can we say something like **ECS may be enabled if careful evaluation indicates it is beneficial.** ?
For: All DNS resolver operators.
EDNS Client Subnet (ECS) allows the resolver to include information about the IP address of the client querying it when sending messages to authoritative servers. This may allow authoritative servers to provide different answers which are more appropriate for the client. However, ECS will increase the amount of cache space required by resolvers, may reduce DNS performance, and may have privacy implications.
It most certainly _will_ (not may) reduce DNS performance. But it reportedly increases non-DNS performance in certain scenarios :shrug: [Substantial] At this point in text I would like to prepend a sentence like this: "A resolver operator whose clients share single network path to the Internet will see no benefit at all."
A resolver operator that has clients that are limited to a specific region may see no benefit. A resolver operator that has a widely distributed anycast network may not have much benefit from ECS, since the locations that initiate the query will be close to the client. But a resolver operator that answers client queries only from a few locations, and expects clients to come from a wide area, may provide better service for end-users by supporting ECS.
EDNS client subnet is described in [RFC7871](https://www.rfc-editor.org/rfc/rfc7871.html), an informational RFC.
### Trust Anchor Reporting
**Trust anchor reporting may be enabled.** [nit] I would say it should be enabled. It costs almost nothing and has negligible risks for anyone.
----------- Now the hard part - missing pieces. This one hard to put under existing headings. Should it be under ### Software considerations or ### Networking considerations or elsewhere? Anyway: RFC 8906 A Common Operational Problem in DNS Servers: Failure to Communicate https://datatracker.ietf.org/doc/html/rfc8906 If an operator puts a "security appliance" in front of the DNS server to increase its purported "security" it messes up the protocol and breaks things. Most importantly, failure to respond to _ALL_ queries (because the "appliance thinks some queries are not safe or needed") leads to exploitable protocol-level issues. See e.g. paper about trouble in resolver-auth transactions: Silence is not Golden: Disrupting the Load Balancing of Authoritative DNS Servers https://indico.dns-oarc.net/event/47/contributions/1018/ Similarly, non-response to stub clients _also_ creates problems because stubs are notoriously bad at handling retransmissions in timely manner etc. I think this is worth calling out. "Even if your server is not going to answer a query, send back at least RCODE REFUSED." or something like that. Congratulations if you made it this far - and thank you for your time! -- Petr Špaček Internet Systems Consortium