RIPE Atlas probe status issues

23 Apr 2018

      Dear RIPE Atlas users,

There were various issues relating to the recorded status of RIPE Atlas
probes over the weekend. This was brought to our attention by internal
monitoring and information provided by users on the mailing list.

Throughout this period most probes did actually remain connected to
controllers, and measurement results were collected as normal. The side
effects included:

* the number of probes reported as connected by the system was lower
than it should have been
* the status (connected/disconnected) of many probes was incorrect
* new measurements took longer than usual to start
* fewer probes than usual were available for new measurements, leading
in some cases to “no suitable” probes messages when trying to schedule
new measurements
* various system tags were incorrectly applied, including many probes
being marked as having USB problems when this was not the case
* temporary discrepancies with crediting/debiting of RIPE Atlas credits
for the
connected time of probes

The issues were caused by a bug fix deployment at Friday 9AM UTC where a
package was accidentally downgraded causing a regression to an old bug
in the task handling of the central system. This bug caused a backlog of
messages to build, slowing down or stopping the registering of various
status messages in the system. Problems built up gradually as the
backlog increased, until the root cause was identified on Sunday
morning. The issue was then fixed and the system stabilized completely
by about 10AM UTC. We have identified procedural and technical solutions
that will stop this problem happening again, and are looking at ways to
improve our monitoring of these kinds of issues.

We apologise for any inconvenience or confusion caused by this event and
would like to thank all of you who took the time to notify us of what
you were seeing.

Kind regards,
Chris Amin
RIPE NCC

Chris Amin

Daniel AJ Sokolov

tags

participants (2)