Dear RIPE Atlas users, There were various issues relating to the recorded status of RIPE Atlas probes over the weekend. This was brought to our attention by internal monitoring and information provided by users on the mailing list. Throughout this period most probes did actually remain connected to controllers, and measurement results were collected as normal. The side effects included: * the number of probes reported as connected by the system was lower than it should have been * the status (connected/disconnected) of many probes was incorrect * new measurements took longer than usual to start * fewer probes than usual were available for new measurements, leading in some cases to “no suitable” probes messages when trying to schedule new measurements * various system tags were incorrectly applied, including many probes being marked as having USB problems when this was not the case * temporary discrepancies with crediting/debiting of RIPE Atlas credits for the connected time of probes The issues were caused by a bug fix deployment at Friday 9AM UTC where a package was accidentally downgraded causing a regression to an old bug in the task handling of the central system. This bug caused a backlog of messages to build, slowing down or stopping the registering of various status messages in the system. Problems built up gradually as the backlog increased, until the root cause was identified on Sunday morning. The issue was then fixed and the system stabilized completely by about 10AM UTC. We have identified procedural and technical solutions that will stop this problem happening again, and are looking at ways to improve our monitoring of these kinds of issues. We apologise for any inconvenience or confusion caused by this event and would like to thank all of you who took the time to notify us of what you were seeing. Kind regards, Chris Amin RIPE NCC
Thank you. Has the problem resurfaced? My probe #1118 is showing offline again, although it isn't. I can see the ongoing measurement results on my profile page. BR Daniel AJ On 2018-04-23 at 06:37 AM, Chris Amin wrote:
Dear RIPE Atlas users,
There were various issues relating to the recorded status of RIPE Atlas probes over the weekend. This was brought to our attention by internal monitoring and information provided by users on the mailing list.
Throughout this period most probes did actually remain connected to controllers, and measurement results were collected as normal. The side effects included:
* the number of probes reported as connected by the system was lower than it should have been * the status (connected/disconnected) of many probes was incorrect * new measurements took longer than usual to start * fewer probes than usual were available for new measurements, leading in some cases to “no suitable” probes messages when trying to schedule new measurements * various system tags were incorrectly applied, including many probes being marked as having USB problems when this was not the case * temporary discrepancies with crediting/debiting of RIPE Atlas credits for the connected time of probes
The issues were caused by a bug fix deployment at Friday 9AM UTC where a package was accidentally downgraded causing a regression to an old bug in the task handling of the central system. This bug caused a backlog of messages to build, slowing down or stopping the registering of various status messages in the system. Problems built up gradually as the backlog increased, until the root cause was identified on Sunday morning. The issue was then fixed and the system stabilized completely by about 10AM UTC. We have identified procedural and technical solutions that will stop this problem happening again, and are looking at ways to improve our monitoring of these kinds of issues.
We apologise for any inconvenience or confusion caused by this event and would like to thank all of you who took the time to notify us of what you were seeing.
Kind regards, Chris Amin RIPE NCC
participants (2)
-
Chris Amin
-
Daniel AJ Sokolov