Dear Colleagues,
This is an incident report regarding an LIR Portal service outage that
occurred last night and earlier today.
The outage started at 21:15 (UTC), Monday, 7 September, and services
were restored at 08:15 (UTC) this morning.
The outage was caused by a configuration error of the Apache web server.
The RIPE NCC has strict upgrade procedures that our staff follow to
prevent this, but they do leave some room for human error, and this
outage was the result of such an error. To minimise the chance of this
happening again, we plan to include automated monitoring of our Apache
configuration syntax to our monitoring system.
A separate issue was responsible for the longer than usual time it took
to recognise and fix the problem. Our monitoring
system is set up to send text messages to a 24/7 engineer in case of
outages such as this. From our logs, we can see that the system did
indeed try to send an SMS about the outage yesterday at 21:15 (UTC), but
was unable to connect to our mobile service provider.
To minimise the chance of this happening again, we will reconfigure our
monitoring system to fall back to another provider in case a message can
not be sent out.
I hope that you find this explanation of the cause of the outage and the
steps we are taking to prevent it from happening again useful.
Regards,
Tim Bruijnzeels
Service Manager LIR Portal
RIPE NCC