Dear colleagues,
Yesterday, we performed an upgrade on the Security world software on our RPKI core servers. The upgrade was finished at approximately 08:45 UTC. We tested the upgrade and verified that everything worked before enabling the RPKI dashboard again.
At approximately 10:50 UTC, we received an alert from our monitoring that showed an error for both our online Hardware Security Modules (HSMs). While we immediately started the investigation of this alert, we also decided to temporarily stop RPKI Core to keep a consistent state. This also meant that we had to temporarily close down the RPKI Dashboard.
At 11:22 UTC we contacted our vendor as we had never seen this behaviour before. A consultant from our vendor advised a reboot of the HSMs, which we performed at 11:55 UTC. After the reboot, the HSMs got back online and we enabled the RPKI Core and RPKI dashboard. It is still unknown whether the upgrade was the direct cause of the errors, as the error was very generic.
While we are working on finding the root cause, we still need to reboot systems and HSMs occasionally, which causes unavailability of the RPKI Dashboard for a few minutes and it will take a bit longer than usual for objects to get published in our repository. As soon as we have more information, we will share it here.
As a result of this outage, we will speed up the process to replace the online HSMs, which we described in our recent RIPE Labs article [0].
Kind regards,
Stella Vouteva