post-mortem for ripe.net DNSSEC problem on 1 November 2023
Dear colleagues, Please find below the post mortem for the DNSSEC problem that caused most of RIPE NCC's services to become unavailable yesterday. Please reach out if you have any questions or feedback. Thanks, Paul de Weerd Manager Global Information Infrastructure team RIPE NCC Summary On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone were bogus due to expired DNSSEC signatures being served. This rendered most of the RIPE NCC’s services unreachable. After investigating the issue, we found a typo in a change to our zone where a record had a TTL that was longer (864,000 seconds instead of 86,400) than the refresh interval for RRSIGs (seven days). This caused our signer to stop refreshing signatures and only sign changes to the zone. We are talking to the vendor of our DNSSEC signing solution about this case to see what can be improved on that end, have implemented a pre-commit check to prevent TTLs longer than a day in the ripe.net zone and are looking at improving monitoring for stale signatures to spot issues like this before they cause problems. Impact DNSSEC signatures in the ripe.net zone are valid for 14 days, with our signers configured to resign them after half that time (seven days). On 1 November at 10:45 UTC the signature on several records in the ripe.net zone expired. These records had last been signed on 18 October and were due to be re-signed on the 25th. However, due to a problem with the TTL on one record, our signer stopped re-signing records in the zone on 25 October. This resulted in the expiry of 11,026 out of 11,389 records on 1 November. New or changed records were still properly signed (363 of them), which meant that our monitoring, which checks the signature validity of the SOA record at the zone apex, missed this issue. Because our internal resolvers are configured for DNSSEC validation, the impact was rather immediate for staff, as many internal services broke due to this issue. After first dismissing some alternative causes, we quickly found the problem was with expired signatures in the ripe.net zone, so we turned our attention to our signers. At the same time, we temporarily disabled DNSSEC validation on our internal resolvers so we could more easily access our own systems while troubleshooting. Resolution While debugging, we found that the rrsig-refresh option that we configured to seven days (half the value of the rrsig-lifetime option of 14 days) was likely involved, logs showed: info: [ripe.net.] DNSSEC, signing zone error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired RRSIGs in resolver caches info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+0000 error: [ripe.net.] zone event 're-sign' failed (invalid parameter) At 12:14 UTC we removed that option from our configuration and we could sign the zone again. The freshly signed zone was pushed out and went live a little bit later, which meant that at 12:15 UTC our services were available again for most users. Unfortunately, some users kept seeing problems for several hours after we restored the signatures. Root cause After further investigation we found that the change that triggered this problem introduced a record in the ripe.net zone with a TTL of 864,000 (ten days). Because this TTL is longer than our rrsig-refresh configuration, this could lead to cases where a resolver’s cache contains the record with an expired signature. The signer software rightfully complained about this. We were surprised to find it then stopped refreshing signatures for all records in the zone that didn’t change. Future steps During the incident and the aftermath we identified a few changes that we want to make to improve the resiliency of our setup and allow us to find cases like these before they become problems. Our current RRSIG freshness monitoring did not catch this case, because the records we monitor still had valid and recent signatures, so we are considering what we can do to cover this situation. We have also improved our zone-editing pipeline to catch typos or misconfigurations for TTL values. Next to that, the problem also affected our ability to communicate internally, as our internal chat system was unresolvable too. We have some means of out-of-band communication, but will review how we can improve that. Additionally, while the status.ripe.net website is hosted on separate infrastructure, the fact that it is also in the ripe.net domain meant that it was just as unreachable as our other services. We will evaluate this approach and see how we can improve on it. Timeline (times in UTC) 25 October 08:52 a record was added to the ripe.net zone with a TTL of 10 days 08:53 knot incrementally signs ripe.net successfully 09:02 knot fails to sign the ripe.net zone for the first time 1 November 10:45 ripe.net signatures expire and many records go bogus 11:27 DNSSEC validation on internal resolvers was disabled 12:14 changed configuration and manually re-signed zone 12:15 ripe.net zone has new valid signatures 12:38 DNSSEC validation on internal resolvers is re-enabled 2 November 08:39 typo in TTL fixed, bringing it back to 86,400 seconds as intended 08:39 added check in pipeline to detect too large TTL values
Hi Paul, On Thu, Nov 02, 2023 at 04:43:40PM +0100, Paul de Weerd wrote:
Please find below the post mortem for the DNSSEC problem that caused most of RIPE NCC's services to become unavailable yesterday.
Impressive post-mortem. Thanks for digging so deeply into the situation and providing so detailed answers. Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard, Michael Emmer Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
Paul & DNS team - On 02.11.2023 17:27, Gert Doering wrote:
On Thu, Nov 02, 2023 at 04:43:40PM +0100, Paul de Weerd wrote:
Please find below the post mortem for the DNSSEC problem that caused most of RIPE NCC's services to become unavailable yesterday.
Impressive post-mortem. Thanks for digging so deeply into the situation and providing so detailed answers.
+1000 to what Gert wrote. Thanks! -C.
On Fri, Nov 03, 2023 at 12:02:47PM +0100, Carsten Schiefner wrote: Dear Paul & DNS Team,
On 02.11.2023 17:27, Gert Doering wrote:
On Thu, Nov 02, 2023 at 04:43:40PM +0100, Paul de Weerd wrote:
Please find below the post mortem for the DNSSEC problem that caused most of RIPE NCC's services to become unavailable yesterday.
Impressive post-mortem. Thanks for digging so deeply into the situation and providing so detailed answers.
+1000 to what Gert wrote.
Indeed, great in-depth investigation and impressive document. Much appreciated. Best, Piotr -- Piotr Strzyżewski
participants (4)
-
Carsten Schiefner
-
Gert Doering
-
Paul de Weerd
-
Piotr Strzyzewski