Inconsistencies in anchor API and measurements
Hi everyone, sorry in advance for the long mail. tl;dr: Anchor API and UI give inconsistent results. Some anchor mesh measurements could be fixed, some might target non-anchors. Not sure what to do about it. I am currently working a lot with RIPE Atlas and recently wanted to use the anchors and their mesh measurements in particular. I wanted to answer two simple queries: 1. Get a list of all active anchors 2. Get a list of all active anchor mesh and probe measurements (traceroute, for my particular use case) However, while trying to answer these queries, I stumbled upon quite some inconsistencies, depending on which interface / API is used. As far as I can tell there are four ways in which one could technically answer query 1: 1. Look at the website: https://atlas.ripe.net/anchors/list/ -> 840 results 2. Query the /anchors API with attribute is_disabled = false: https://atlas.ripe.net/api/v2/anchors/?is_disabled=false -> 844 results 3. Query the /probes API with attributes is_anchor = true and status = 1 (Connected): https://atlas.ripe.net/api/v2/probes/?is_anchor=True&status=1 -> 729 results 4. Query the /probes API with attributes tags = system-anchor and status = 1 (Connected): https://atlas.ripe.net/api/v2/probes/?tags=system-anchor&status=1 -> 729 results Methods 3 and 4 are actually consistent! The main discrepancy is between the /anchors and /probes API: 116 anchors that are listed on the webpage and/or /anchors API as ‘active’ are inactive (97 abandoned; 19 disconnected at time of writing) according to their respective /probes entry. I understand that disconnects might be temporary, but some anchors seem to be inactive for years (at least according to their status) and are still listed as active. I have attached a text file with some notes that go deeper into the differences, but might be hard to read. For query 2, I faced a similar situation: 1. Look at the website: https://atlas.ripe.net/anchors/list/full/ Anchoring Mesh IPv4: 840 Anchoring Mesh IPv6: 739 Anchoring Probes IPv4: 840 Anchoring Probes IPv6: 705 Total: 3124 2. Query the /anchor-measurements API Anchoring Mesh IPv4: 849 Anchoring Mesh IPv6: 743 Anchoring Probes IPv4: 902 Anchoring Probes IPv6: 753 Total: 3247 3. Query the /measurements API with attributes status = 2 (Ongoing), type = traceroute and corresponding attributes: Anchoring Mesh IPv4: af = 4; tags=anchoring,mesh Anchoring Mesh IPv6: af = 6; tags=anchoring,mesh Anchoring Probes IPv4: af = 4; tags=anchoring,probes Anchoring Probes IPv6: af = 6; tags=anchoring,probes For example: https://atlas.ripe.net/api/v2/measurements/?status=2&type=traceroute&af=6&tags=anchoring,probes Anchoring Mesh IPv4: 1026 Anchoring Mesh IPv6: 902 Anchoring Probes IPv4: 668 Anchoring Probes IPv6: 619 Total: 3215 These results are even more mixed: - Tags can be inconsistent: Some measurements have none, some have the ‘probes’ or ’mesh’ tag, but miss the ’anchoring’ tag. - Some anchors have multiple measurements (especially probes measurements), of which most actually are run by the same set of probes, i.e., they are duplicates. - Which measurements are contained in which of the three result sets is very mixed, maybe I should draw a Venn diagram :) Finally, I looked at the consistency of the IP addresses of the /anchors API (ip_v4 and ip_v6), the /probes API (address_v4, address_v6), the DNS result for the FQDN of the anchors, and the target IP of the mesh/probes measurements. I noticed some problems, since our lab (IIJ) also operates an anchor (probe 6425 [0]) and we updated the IP address some time ago, but are actually not reached by the mesh measurement, because the measurement still targets the old IP. I attached a CSV that includes the raw data (of measurements with some form of problem), but basically there are 93 measurements from connected anchors that fail, and out of which 68 (from 29 anchors) could work, if the measurement would target the correct IP. These measurements have matching anchor/probe IPs and DNS records, so I do not know why the measurement target is stale. There are some additional measurements that could work, but it is unclear what the intended ‘correct’ IP is. On that note, there are 48 measurements that ‘work‘, i.e., they get a response from the target, but it is not clear if the target is the intended receiver: - 8 target abandoned anchors - 18 have different probe and anchor IPs and target one of them - 21 have the same probe and anchor IP but target something else Again, I am sorry for this long mail. I understand that RIPE Atlas is a huge project that has grown over time so it might be hard to keep some things synchronized, and some other things might not be easily decidable (e.g., when to mark an anchor is inactive). However, I think especially the IP address of an anchor in the /anchors and /probes APIs, in the DNS entry, and the target of the mesh/probes measurements need to be consistent. Currently some mesh measurements might target an entirely different machine. I wanted to bring some attention to this, but not sure what else I can do as a user. I don‘t want to complain too much :) For now I will just use all data sources as input and apply some sanity checks. Best, Malte P.S.: Some feedback on how we can bring the measurement of our anchor to target our anchor would be nice though. [0] https://atlas.ripe.net/probes/6425
Hi Malte, Thank you for the long email, I will try to answer you as best I can. On 15/11/2021 08:32, Malte Appel wrote:
Hi everyone,
sorry in advance for the long mail. tl;dr: Anchor API and UI give inconsistent results. Some anchor mesh measurements could be fixed, some might target non-anchors. Not sure what to do about it.
I am currently working a lot with RIPE Atlas and recently wanted to use the anchors and their mesh measurements in particular. I wanted to answer two simple queries:
1. Get a list of all active anchors 2. Get a list of all active anchor mesh and probe measurements (traceroute, for my particular use case)
However, while trying to answer these queries, I stumbled upon quite some inconsistencies, depending on which interface / API is used. As far as I can tell there are four ways in which one could technically answer query 1:
Going forward, the API is going to be the absolute source of truth. In the (near) future the web content will be all based off the data from the API. We are aware of inconsistencies between the web content and the API but these will most likely not be fixed until we switch the pages over to using the API instead.
1. Look at the website: https://atlas.ripe.net/anchors/list/ -> 840 results 2. Query the /anchors API with attribute is_disabled = false: https://atlas.ripe.net/api/v2/anchors/?is_disabled=false -> 844 results 3. Query the /probes API with attributes is_anchor = true and status = 1 (Connected): https://atlas.ripe.net/api/v2/probes/?is_anchor=True&status=1 -> 729 results 4. Query the /probes API with attributes tags = system-anchor and status = 1 (Connected): https://atlas.ripe.net/api/v2/probes/?tags=system-anchor&status=1 -> 729 results
Methods 3 and 4 are actually consistent!
The anchor tag is based on the value of is_anchor so these should indeed be identical.
The main discrepancy is between the /anchors and /probes API: 116 anchors that are listed on the webpage and/or /anchors API as ‘active’ are inactive (97 abandoned; 19 disconnected at time of writing) according to their respective /probes entry.
Active and Inactive are anchor data points while connected/abandoned are probe data points. This means in short that anchors can be active while the probe is considered abandoned. We are rethinking the whole connected/disconnected and abandoned terminology but for now these will be different concepts.
I understand that disconnects might be temporary, but some anchors seem to be inactive for years (at least according to their status) and are still listed as active.
It does make sense to mark an anchor as inactive when the probe becomes abandoned so I will implement that soon. This will make the results a little more predictable and logical.
I have attached a text file with some notes that go deeper into the differences, but might be hard to read.
For query 2, I faced a similar situation:
1. Look at the website: https://atlas.ripe.net/anchors/list/full/ Anchoring Mesh IPv4: 840 Anchoring Mesh IPv6: 739 Anchoring Probes IPv4: 840 Anchoring Probes IPv6: 705 Total: 3124 2. Query the /anchor-measurements API Anchoring Mesh IPv4: 849 Anchoring Mesh IPv6: 743 Anchoring Probes IPv4: 902 Anchoring Probes IPv6: 753 Total: 3247 3. Query the /measurements API with attributes status = 2 (Ongoing), type = traceroute and corresponding attributes: Anchoring Mesh IPv4: af = 4; tags=anchoring,mesh Anchoring Mesh IPv6: af = 6; tags=anchoring,mesh Anchoring Probes IPv4: af = 4; tags=anchoring,probes Anchoring Probes IPv6: af = 6; tags=anchoring,probes For example: https://atlas.ripe.net/api/v2/measurements/?status=2&type=traceroute&af=6&tags=anchoring,probes Anchoring Mesh IPv4: 1026 Anchoring Mesh IPv6: 902 Anchoring Probes IPv4: 668 Anchoring Probes IPv6: 619 Total: 3215
These results are even more mixed:
- Tags can be inconsistent: Some measurements have none, some have the ‘probes’ or ’mesh’ tag, but miss the ’anchoring’ tag.
The anchoring tag is removed when an anchor becomes decommissioned so it's correct that some measurements do not have this tag as only the 'active' set of measurements has the anchoring tag applied.
- Some anchors have multiple measurements (especially probes measurements), of which most actually are run by the same set of probes, i.e., they are duplicates. This is a known problem and I realised that the script that cleans this up hasn't been run for a while. I will do this soon so the duplicates will disappear. - Which measurements are contained in which of the three result sets is very mixed, maybe I should draw a Venn diagram :)
Finally, I looked at the consistency of the IP addresses of the /anchors API (ip_v4 and ip_v6), the /probes API (address_v4, address_v6), the DNS result for the FQDN of the anchors, and the target IP of the mesh/probes measurements.
I noticed some problems, since our lab (IIJ) also operates an anchor (probe 6425 [0]) and we updated the IP address some time ago, but are actually not reached by the mesh measurement, because the measurement still targets the old IP.
When an anchor changes IP addresses after it's been deployed, the current set of anchoring measurements needs to be stopped and a new set needs to be created. When checking this, I saw that this hasn't happened yet for all anchors. I will fix this soon after the RIPE meeting. This should fix the vast majority of cases where you saw this issue assuming we are aware of all IP addresses changes.
I attached a CSV that includes the raw data (of measurements with some form of problem), but basically there are 93 measurements from connected anchors that fail, and out of which 68 (from 29 anchors) could work, if the measurement would target the correct IP. These measurements have matching anchor/probe IPs and DNS records, so I do not know why the measurement target is stale. There are some additional measurements that could work, but it is unclear what the intended ‘correct’ IP is.
On that note, there are 48 measurements that ‘work‘, i.e., they get a response from the target, but it is not clear if the target is the intended receiver: - 8 target abandoned anchors - 18 have different probe and anchor IPs and target one of them - 21 have the same probe and anchor IP but target something else
Thank you for this list, I will compare it to the dataset I have to make sure I am correcting all known problems.
Again, I am sorry for this long mail. I understand that RIPE Atlas is a huge project that has grown over time so it might be hard to keep some things synchronized, and some other things might not be easily decidable (e.g., when to mark an anchor is inactive). However, I think especially the IP address of an anchor in the /anchors and /probes APIs, in the DNS entry, and the target of the mesh/probes measurements need to be consistent. Currently some mesh measurements might target an entirely different machine.
Since the introduction of the VM anchors, the network has grown much faster and unfortunately some data quality checks have not been run or implemented fully. We are working on this however and we expect to have most of these issues fixed before the end of the year.
I wanted to bring some attention to this, but not sure what else I can do as a user. I don‘t want to complain too much :) For now I will just use all data sources as input and apply some sanity checks.
Thank you again for bringing this to our attention. We're working hard to fix the issues. If you feel I have not addressed some of your issues, please let me know. Apologies for these bugs, hopefully they will be permanently fixed in the near future. Kind regards, Johan ter Beest RIPE Atlas Engineer
Best, Malte
P.S.: Some feedback on how we can bring the measurement of our anchor to target our anchor would be nice though.
See answer inline above on stopping and recreating the active anchor measurements set
Hi Johan, thank you very much for the detailed response! On 11/17/21 20:30, Johan ter Beest wrote:
[huge snip]
Thank you again for bringing this to our attention. We're working hard to fix the issues. If you feel I have not addressed some of your issues, please let me know.
I believe you have addressed everything appropriately and it is very nice to hear that these issues are being worked on :)
Apologies for these bugs, hopefully they will be permanently fixed in the near future.
Kind regards,
Johan ter Beest RIPE Atlas Engineer
Best, Malte
participants (2)
-
Johan ter Beest
-
Malte Appel