Atlas fully down..

older
One-off measurements stopped after...

Ernst J. Oud

19 Sep 2023 19 Sep '23

11:04 p.m.

Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird. Not up to RIPE standards. Regards, Ernst J. Oud

Show replies by date

Randy Bush

20 Sep 20 Sep

12:06 a.m.

...

Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird.

been down so long it looks like up to me [0] as you are probably too young for that reference, how about it looks pretty up from here. perhaps a more specific symptom might help with diagnosis. randy [0] https://en.wikipedia.org/wiki/Been_Down_So_Long_It_Looks_Like_Up_to_Me

Peter Potvin

12:12 a.m.

Consumption delay according to the main page is up to 16+ hours so something is indeed very wrong. Regards, Peter Potvin | Executive Director ------------------------------------------------------------------------------ *Accuris Technologies Ltd.* On Tue, Sep 19, 2023 at 6:06 PM Randy Bush <randy@psg.com> wrote:

...

...
Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird.

been down so long it looks like up to me [0]

as you are probably too young for that reference, how about it looks pretty up from here. perhaps a more specific symptom might help with diagnosis.

randy

[0] https://en.wikipedia.org/wiki/Been_Down_So_Long_It_Looks_Like_Up_to_Me

-- ripe-atlas mailing list ripe-atlas@ripe.net https://lists.ripe.net/mailman/listinfo/ripe-atlas

Randy Bush

12:34 a.m.

...

Consumption delay according to the main page is up to 16+ hours so something is indeed very wrong.

aha! a symptom. thanks. indeed, an issue randy

Ernst J. Oud

12:17 a.m.

I don’t think I fully understand what you are saying. Do you imply that Atlas works for you? I doubt it since all of it is down, see the status page at ripe.net, I guess worldwide. No results are processed, no tests are running, even Magellan is down. Or were you joking? My Dutch sense of humor might be different :-) Ernst

...

On 20 Sep 2023, at 00:06, Randy Bush <randy@psg.com> wrote:

...
Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird.

been down so long it looks like up to me [0]

as you are probably too young for that reference, how about it looks pretty up from here. perhaps a more specific symptom might help with diagnosis.

randy

[0] https://en.wikipedia.org/wiki/Been_Down_So_Long_It_Looks_Like_Up_to_Me

Fearghas Mckay

3:09 a.m.

...

On 19 Sep 2023, at 17:04, Ernst J. Oud <ernstoud@gmail.com> wrote:

Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird.

The status update is Degraded Performance, for a non-critical service. https://atlas.ripe.net/ acknowledges there is a consumption delay, hardly silence. f

Ernst J. Oud

11:59 a.m.

Yes, the status page shows a bit of info. But “degraded performance” does not cover the real situation since all of Atlas was or still is down. Measurements don’t report data, all built-ins don’t show data, probe tags are lost, Magellan - used for streaming - does not give any data etc. Yes, it is a non-critical service, but people like me do rely on data from my probes to monitor my network. It is a bit of give and take but it appears I need another service for this… Regards, Ernst J. Oud

...

On 20 Sep 2023, at 03:09, Fearghas Mckay <fearghas@gmail.com> wrote:

...
On 19 Sep 2023, at 17:04, Ernst J. Oud <ernstoud@gmail.com> wrote:

Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird.

The status update is Degraded Performance, for a non-critical service. https://atlas.ripe.net/ acknowledges there is a consumption delay, hardly silence.

f

Robert Kisteleki

7:43 a.m.

On 2023-09-19 23:04, Ernst J. Oud wrote:

...

Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird.

Not up to RIPE standards.

Regards,

Ernst J. Oud

Good morning, I'm sad to report that indeed there's still an issue with result processing - which is still reflected on the status page. Specifically, the HBase backend that is responsible for storing and retrieving the new (and historic) results is struggling to store the data form the last ~24 hours. The teams have been working on solving this basically 24/7 since the issue occurred but haven't been successful yet. All else (continuing to run existing measurements, creating new ones, real-time streaming of the results, APIs, UI, ...) are running undisturbed. I hope this helps understanding the extent of the problem, and we'll of course let you know when there's progress. Regards, Robert

Ernst J. Oud

11:47 a.m.

Robert et al, In contrast to your statement below, streaming of results from new or existing measurements using Magellan currently does *not* work… no results are obtained, see below. —- [32mLooking good! Measurement 60182213 was created and details about it can be found here: https://atlas.ripe.net/measurements/60182213/[0m [32mConnecting to stream...[0m [32mDisconnected from stream[0m —- Regards, Ernst J. Oud

...

On 20 Sep 2023, at 07:43, Robert Kisteleki <robert@ripe.net> wrote:

...
On 2023-09-19 23:04, Ernst J. Oud wrote: Considering the fact that all of Atlas is completely down for more than 24 hrs., I find the silence a bit deafening. No status updates, nothing. Weird. Not up to RIPE standards. Regards, Ernst J. Oud

Good morning,

I'm sad to report that indeed there's still an issue with result processing - which is still reflected on the status page.

Specifically, the HBase backend that is responsible for storing and retrieving the new (and historic) results is struggling to store the data form the last ~24 hours. The teams have been working on solving this basically 24/7 since the issue occurred but haven't been successful yet.

All else (continuing to run existing measurements, creating new ones, real-time streaming of the results, APIs, UI, ...) are running undisturbed.

I hope this helps understanding the extent of the problem, and we'll of course let you know when there's progress.

Regards, Robert

Robert Kisteleki

12:22 p.m.

Hi, On 2023-09-20 11:47, Ernst J. Oud wrote:

...

Robert et al,

In contrast to your statement below, streaming of results from new or existing measurements using Magellan currently does *not* work… no results are obtained, see below.

—- [32mLooking good! Measurement 60182213 was created and details about it can be found here:

https://atlas.ripe.net/measurements/60182213/[0m

[32mConnecting to stream...[0m

[32mDisconnected from stream[0m

The answer is the same here as for Stephane - no probes were selected for the measurement, hence no answers are coming back on the streaming interface either. Regards, Robert

Ernst J. Oud

12:57 p.m.

Robert, I checked my code that calls Magellan. It does not add the “system-ipv4-works“ tag to the request for probes in measurements. Does Magellan add that tag automatically? Regards, Ernst J. Oud

...

On 20 Sep 2023, at 12:22, Robert Kisteleki <robert@ripe.net> wrote:

Hi,

...
On 2023-09-20 11:47, Ernst J. Oud wrote: Robert et al, In contrast to your statement below, streaming of results from new or existing measurements using Magellan currently does *not* work… no results are obtained, see below. —- [32mLooking good! Measurement 60182213 was created and details about it can be found here: https://atlas.ripe.net/measurements/60182213/[0m [32mConnecting to stream...[0m [32mDisconnected from stream[0m

The answer is the same here as for Stephane - no probes were selected for the measurement, hence no answers are coming back on the streaming interface either.

Regards, Robert

Chris Amin

1:22 p.m.

On 20/09/2023 12:57, Ernst J. Oud wrote:

...

I checked my code that calls Magellan. It does not add the “system-ipv4-works“ tag to the request for probes in measurements.

Does Magellan add that tag automatically? Hi Ernst,

The system-ipv4-works (and system-ipv6-works) tags are specified in the configuration in the ripe-atlas-tools settings file. If you run: ripe-atlas configure --editor you can find a tags block with various default tag sets to use for different measurement types. *However*, as a workaround we have re-populated the system-ipv4-works and system-ipv6-works tags, so that the default behaviour of Magellan should be sane once again. This should allow scheduling of new measurements and streaming the results. Note: The probe sets for these tags are currently just copied from the system-ipv4-stable-1d and system-ipv6-stable-1d tags, respectively, which were not affected by this issue because they use a different data backend. The actual probe sets are therefore not exactly the same, but they are functionally very similar if your goal is "restrict the participating probes to ones that are likely to work".

Ernst J. Oud

1:38 p.m.

Chris, Thanks for your input. That clarifies a lot! Regards, Ernst J. Oud

...

On 20 Sep 2023, at 13:22, Chris Amin <camin@ripe.net> wrote:

On 20/09/2023 12:57, Ernst J. Oud wrote:

...
I checked my code that calls Magellan. It does not add the “system-ipv4-works“ tag to the request for probes in measurements. Does Magellan add that tag automatically? Hi Ernst,

The system-ipv4-works (and system-ipv6-works) tags are specified in the configuration in the ripe-atlas-tools settings file. If you run:

ripe-atlas configure --editor

you can find a tags block with various default tag sets to use for different measurement types.

*However*,

as a workaround we have re-populated the system-ipv4-works and system-ipv6-works tags, so that the default behaviour of Magellan should be sane once again. This should allow scheduling of new measurements and streaming the results.

Note: The probe sets for these tags are currently just copied from the system-ipv4-stable-1d and system-ipv6-stable-1d tags, respectively, which were not affected by this issue because they use a different data backend. The actual probe sets are therefore not exactly the same, but they are functionally very similar if your goal is "restrict the participating probes to ones that are likely to work".

-- ripe-atlas mailing list ripe-atlas@ripe.net https://lists.ripe.net/mailman/listinfo/ripe-atlas

Stephane Bortzmeyer

11:56 a.m.

On Wed, Sep 20, 2023 at 07:43:20AM +0200, Robert Kisteleki <robert@ripe.net> wrote a message of 42 lines which said:

...

All else (continuing to run existing measurements, creating new ones, real-time streaming of the results, APIs, UI, ...) are running undisturbed.

This is not what I observe. For instance, asking for 100 probes in France yields a "No suitable probes" (see measurement #60181887). So, "completely down" seems a fair summary to me.

Ernst J. Oud

12:02 p.m.

Stephane, Yes, I also don’t understand Robert’s remark. All user defined measurements, new or existing do not give any results. If Robert’s remark is based on some Atlas performance monitoring function then that function should/could be improved since it does not reflect reality. Regards, Ernst J. Oud

...

On 20 Sep 2023, at 11:56, Stephane Bortzmeyer <bortzmeyer@nic.fr> wrote:

On Wed, Sep 20, 2023 at 07:43:20AM +0200, Robert Kisteleki <robert@ripe.net> wrote a message of 42 lines which said:

...
All else (continuing to run existing measurements, creating new ones, real-time streaming of the results, APIs, UI, ...) are running undisturbed.

This is not what I observe. For instance, asking for 100 probes in France yields a "No suitable probes" (see measurement #60181887). So, "completely down" seems a fair summary to me.

Stephane Bortzmeyer

12:12 p.m.

On Wed, Sep 20, 2023 at 12:02:19PM +0200, Ernst J. Oud <ernstoud@gmail.com> wrote a message of 34 lines which said:

...

Yes, I also don’t understand Robert’s remark. All user defined measurements, new or existing do not give any results.

May be the situation is complicated and not fully understood yet. Good luck for the technical team, anyway.

Robert Kisteleki

12:19 p.m.

Hi, On 2023-09-20 11:56, Stephane Bortzmeyer wrote:

...

On Wed, Sep 20, 2023 at 07:43:20AM +0200, Robert Kisteleki <robert@ripe.net> wrote a message of 42 lines which said:

...
All else (continuing to run existing measurements, creating new ones, real-time streaming of the results, APIs, UI, ...) are running undisturbed.

This is not what I observe. For instance, asking for 100 probes in France yields a "No suitable probes" (see measurement #60181887). So, "completely down" seems a fair summary to me.

In that measurement you're asking specifically for probes tagged with "system-ipv4-works" - which, as you can read in the related thread this morning, is a causality of the current issue we're facing. We have a fix for this particular slice of the problem that we can apply once the system has recovered. Regards, Robert

868

Age (days ago)

869

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

Chris Amin
Ernst J. Oud
Fearghas Mckay
Peter Potvin
Randy Bush
Robert Kisteleki
Stephane Bortzmeyer