New subject: Update: Oddity in "atlas-probe-traffic" measurement data/graph

8 Jan 2025

      Hi,

Probes report the traffic volume they see on the device's/machine's
interfaces in built-in measurement #9002 (on software probes, this needs
to be enabled manually for privacy reasons). I use that to keep an eye
on the behavior/health of the probe's host machine via the graphs
generated by RIPEstat, which works well most of the time.

Sometimes, however, there is an oddity in the data, namely extreme
outliers. See attached graph, showing a peak of greater than 6.6 Gbps,
when the "normal" traffic rarely exceeds 2Mbit/s.

The underlying data shows an unusual high amount of bytes/packets
sent/received for one instant that is pretty much impossible, here the
data points around the anomaly (from [1]):

[...]
       {
         "timestamp": 1736330734,
         "bytes_recv": 70881.58333333333,
         "bytes_sent": 63566.92222222222,
         "packets_recv": 737.8166666666667,
         "packets_sent": 700.6,
         "interfaces": [
           "eth0"
         ]
       },
       {
         "timestamp": 1736330921,
         "bytes_recv": 925866981.9732621,
         "bytes_sent": 820879052.6524065,
         "packets_recv": 8351935.347593583,
         "packets_sent": 8177239.866310161,
         "interfaces": [
           "eth0"
         ]
       },
       {
         "timestamp": 1736331101,
         "bytes_recv": 106008.85555555555,
         "bytes_sent": 88055.43333333333,
         "packets_recv": 1100.5944444444444,
         "packets_sent": 971.6666666666666,
         "interfaces": [
           "eth0",
           "he-ipv6"
         ]
       },
[...]

I think I've seen similar behavior on all of my software probes at one
time or another, don't recall whether this also previously occured with
my HW probe.

So far, I have not been able to manually reproduce this, but it seems
this is typically related to issues in communication with the
controller, be it something on the controller side, like was more
frequent/pronounced during the migration to the new controller
infrastructure. Be it something on the probe side, e.g., temporary issue
in network stack configuration.

My current hypothesis is that when the probe cannot send off the
measurement data, the accumulated data _for this particular measurement
type_ is somehow getting corrupted, and the corrupted data is then sent
once the communication issues are resolved, and data upload resumes.

E.g., in this case, from the logs, it looks as if a reregistration with
the controller was being done during minute 10:08, with the measurement
data for #9002 taken just when the re-registration started. Not sure
whether that is just coincidence, or whether there is a causal relation
between the two.

Anyone else seeing something similar, occasionally at least? Any idea
what might be going on? (Have not gotten too far yet scouring through
the probe/measurement code to see whether that yields any hints.)

Thanks!

R.

[1]
https://stat.ripe.net/data/atlas-probe-traffic/data.json?probe_id=1008486&measurement_id=9002&starttime=2025-01-08T09:51:49&endtime=2025-01-08T10:15:32&resolution=0&display_mode=condensed

Oddity in "atlas-probe-traffic" measurement data/graph

ripe＠nurfuerspam.de

ripe＠nurfuerspam.de

Robert Kisteleki

tags

participants (2)