RIPE NCC measurement data retention
Dear all, We've just published a proposal about establishing principles around how the RIPE NCC retains and publishes Internet measurement data, specifically in RIS and RIPE Atlas: https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-prin... We would be very happy to see discussions about this here on the mailing list, on the RIPE NCC Forum, or live at RIPE87. Regards, Robert Kisteleki RIPE NCC
Hi Robert, thanks for opening this discussion. I am one of these researchers (or data hoarders? :)) that looks a lot at historical data, so I appreciate that there are no plans to delete data (yet). Moving old data to cheaper/slower storage is a good idea imho if it reduces the cost for the RIPE NCC. It would be nice, however, to retain the ability of fetching results (or metadata) for specific measurements or timeframes via an API, although this would still require an active index. For me, it would be fine if this access is slow. I am saying this since the alternative is usually to put archive dump files on some FTP server, but I can not think of a suitable structure, e.g., for Atlas measurements. One file per measurement does not work for long-running, ongoing measurements. One file per measurement containing results for some timeframe probably blows up the number of files. Anyways, I support this proposal, but would prefer to retain a systematic (if slow) access to the data. Best, Malte P.S.: It may be obvious, but I mostly work with Atlas, less with RIS, so my comment is mostly related to Atlas data.
Hello everyone, Apologies for the long post in advance! I'm a long time RIS data user, and I have a couple of suggestions related to the RIS data retention topic that Robert presented yesterday. First is about the usefulness of keeping multiple daily snapshots of the peer RIBs of decades ago. I agree that having 3x daily snapshots is useful to take a quick look at routing tables and is very simple to use. However, I would like to point out that it is possible to recreate the RIB of each peer at any time starting from any RIB snapshot and applying the content of the UPDATE files collected by RIS between the RIB snapshot creation and the desired time. For example, if I want to see the RIB status of rrc00 at 04:00UTC, I can take the RIB snapshot taken at midnight, evolve that with all the UPDATE files from midnight to 04:00UTC and enjoy the results. Said that, I think that a possibility to save some data could be to get rid of 2 of the 3 daily snapshots for older months of RIS. RIS could keep the last years' RIBs as they are now, and remove the RIBs taken at 08:00 and 16:00 for anything older than that - keeping the 00:00. Taking into analysis the month of October 2023 for rrc00, RIBs took 38.8GB while UPDATEs took 45GB. Of course, different collectors have different peers and a different traffic of BGP updates being recorded. Still, cutting the RIBs to one third would give a good saving in data. Second, is about compression. I understand that RIS is leveraging on the collecting software to create gz files, but probably it would be worth to consider to switch to some compressing technique able to compress data more - at least for older data. I know RouteViews is using bz2 already, that could be a good choice if the collecting software already handle that. Every MRT reader is capable of handling bz2 files. However, I found xz extremely performing on top of MRT files - even though only a few MRT readers are capable of reading that. As an exercise, I took bview.20231129.0000.gz of rrc00. The size of the file is 406MB, which becomes 4.1GB uncompressed. If I were to bzip2 the uncompressed file, I would have a bview.20231129.0000.bz2 of 242MB. If I were to xz the uncompressed file, I would have a bview.20231129.0000.xz of 160MB. There may be other compression tools that are even more efficient on MRT data out there. I think a little study on the effectiveness of the different compressing technique should be performed before taking any decision - if you want to follow this route. Apologies once again for the long post! Alessandro Best Regards, Alessandro Improta Engineering manager p. +393488077654 e. aimprota@catchpoint.com<mailto:aimprota@catchpoint.com> a. Via Aurelia Sud km 367, Pietrasanta (LU) [cid:71a5b299-4181-4814-9099-2221a198b1a6] Learn more about Catchpoint → Watch this 2-minute video!<https://www.catchpoint.com/explainer> [linkedin]<https://www.linkedin.com/company/catchpoint-systems-inc> [twitter] <https://twitter.com/Catchpoint> [facebook] <https://www.facebook.com/catchpoint/> [youtube] <https://www.youtube.com/c/Catchpoint/> ________________________________ From: mat-wg <mat-wg-bounces@ripe.net> on behalf of Robert Kisteleki <robert@ripe.net> Sent: Wednesday, November 22, 2023 5:43 PM To: Measurement Analysis and Tools Working Group <mat-wg@ripe.net> Subject: [mat-wg] RIPE NCC measurement data retention Dear all, We've just published a proposal about establishing principles around how the RIPE NCC retains and publishes Internet measurement data, specifically in RIS and RIPE Atlas: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flabs.ripe.net%2Fauthor%2Fkistel%2Fripe-ncc-measurement-data-retention-principles%2F&data=05%7C01%7Caimprota%40catchpoint.com%7Cfda177fd8e564570805a08dbeb7a2b4d%7C0c927d7e38e74a3fa4f2e746ec8a0842%7C0%7C0%7C638362682202269553%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QVoMazXXruSVWHvnmc%2FL62LjuQ2JELeD6sJnSv3Jcm4%3D&reserved=0<https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-principles/> We would be very happy to see discussions about this here on the mailing list, on the RIPE NCC Forum, or live at RIPE87. Regards, Robert Kisteleki RIPE NCC -- To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.ripe.net%2Fmailman%2Flistinfo%2Fmat-wg&data=05%7C01%7Caimprota%40catchpoint.com%7Cfda177fd8e564570805a08dbeb7a2b4d%7C0c927d7e38e74a3fa4f2e746ec8a0842%7C0%7C0%7C638362682202269553%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1e%2F%2FNrX30i9rvo11noOIAHm8l9a2Wkd4G8fbQgte3G0%3D&reserved=0<https://lists.ripe.net/mailman/listinfo/mat-wg>
Hi folks, Bumping this thread in case you missed it: The NCC is keen to hear the community's thoughts on the principles they should follow for measurement data retention. We had a good discussion on this topic at the WG session during RIPE87, and it'd be good to see it continue here, or on the forum <https://forum.ripe.net/t/ripe-ncc-measurement-results-retention/759>, or privately back to the NCC if you prefer. The relevant materials are here: - Recording: https://ripe87.ripe.net/archives/video/1176/ - Slides: https://ripe87.ripe.net/presentations/13-kisteleki-MATWG-RIPE87.pdf, slides 16 & 17 - Labs post: https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-prin... Specifically, the principles in question are the bullets on slide 17, or the bullets at the end of the Labs article. Cheers, S. On Wed, Nov 22, 2023 at 11:43 AM Robert Kisteleki <robert@ripe.net> wrote:
Dear all,
We've just published a proposal about establishing principles around how the RIPE NCC retains and publishes Internet measurement data, specifically in RIS and RIPE Atlas:
https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-prin...
We would be very happy to see discussions about this here on the mailing list, on the RIPE NCC Forum, or live at RIPE87.
Regards, Robert Kisteleki RIPE NCC
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/mat-wg
I'd like to go out on the record and say that RIPE RIS is invaluable to the internet research and routing analytics world, and a large part of the invaluableness comes from it's longevity in its data archives. That being said, it's understandable that the storage costs to store bgp ribs and update messages would soon become untenable. However I would like to point out (Obviously excluding redundancy) that the currently quoted "raw" data volume for RIS is not that much. (From the blog post linked in the OP)
This dataset currently weighs in at roughly 50 TB of compressed dump and update files, with 80% accounting for the data collected in the last five years
50TB is 12x 4TB drives, 12 3.5" drive slots is a standard size for a 2 Rack Unit sized server. It sounds like the larger concern (at least for the next 5 years) is that the RIPEStat use case is so large (800 TB according to the post). Could you provide more information on what is potentially causing such a large amplified storage usage? Is there a better argument to potentially deprecate some ripe stat tools if they are costing a huge amount of backend storage to provide? Rather than risk degrading RIS's historical archive storage. As a side point I would propose to remove the existence of private right atlas measurements, As almost all things on the internet ( and thus being measured by RIPE Atlas ) are public, the measurements towards such inherently public things probably shouldn't need to be private. On Mon, Dec 11, 2023 at 4:45 AM Stephen Strowes via mat-wg <mat-wg@ripe.net> wrote:
Hi folks,
Bumping this thread in case you missed it:
The NCC is keen to hear the community's thoughts on the principles they should follow for measurement data retention. We had a good discussion on this topic at the WG session during RIPE87, and it'd be good to see it continue here, or on the forum, or privately back to the NCC if you prefer.
The relevant materials are here:
- Recording: https://ripe87.ripe.net/archives/video/1176/ - Slides: https://ripe87.ripe.net/presentations/13-kisteleki-MATWG-RIPE87.pdf, slides 16 & 17 - Labs post: https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-prin...
Specifically, the principles in question are the bullets on slide 17, or the bullets at the end of the Labs article.
Cheers,
S.
On Wed, Nov 22, 2023 at 11:43 AM Robert Kisteleki <robert@ripe.net> wrote:
Dear all,
We've just published a proposal about establishing principles around how the RIPE NCC retains and publishes Internet measurement data, specifically in RIS and RIPE Atlas: https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-prin...
We would be very happy to see discussions about this here on the mailing list, on the RIPE NCC Forum, or live at RIPE87.
Regards, Robert Kisteleki RIPE NCC
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/mat-wg
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/mat-wg
I'd like to go out on the record and say that RIPE RIS is invaluable to the internet research and routing analytics world, and a large part of the invaluableness comes from it's longevity in its data archives.
aol and agree that i do not understand how the numbers add up randy
Thank you Ben, Randy and Joshua for your feedback. The 50 TB (and growing) of storage space needed to hold all of the (compressed) RIS dump and update files for the entire history of the project are (currently) not a big concern. As Ben points out, 50 TB can relatively easily be held in a single machine. These days, three 20 TB disks are sufficient. Obviously, reality is a bit more complex, with redundancy and availability added into the mix. However since this is less problematic, we're not planning any changes in this area at present. Obviously, the 800 TB that was mentioned is a different matter. We store RIS data in a variety of ways for fast access by the RIPEstat front-end servers. We use Apache HBase for this, which uses Apache HDFS as a storage backend, giving us redundant storage. This redundancy comes at a price - by default, HDFS stores its data in triplicate, so the 800 TB of storage used contains just over 250 TB of actual data. Higher replication is possible but will cost even more, and lower is strongly discouraged. Then, for the various widgets / infocards / data endpoints in RIPEstat, the data is transformed and stored in different HBase tables. This does, unfortunately, mean more duplication of data because data is indexed by different aspects in different tables for the specific access pattern. Now these various ways of storing the same data were not a big problem in the past. However, with the growth of RIS and the Internet, the volume of incoming data over time has steadily grown too. Now it has come to a point where we need to start thinking of different ways to make this data available to the community. This is where Robert's RIPE Labs post and his presentation at the meeting in Rome come in. We want to review how we make this data available to you as end users (be it as researchers, as operators or in whatever form that applies) so that we can remain cost effective while giving you the most useful data in a fast and easy-to-access way. So in summary, we're looking at doing exactly what Ben suggests: keep offering the historic dataset as we currently do through the dump and update files, but reviewing how we can reduce cost of the other ways in which we store this data without losing value for our end users. Paul de Weerd RIPE NCC
Hi, I'm another researcher that uses quite a bit of the historical data held in these services, and I appreciate the commitment to keeping this data available where possible. In the Labs article <https://labs.ripe.net/author/kistel/ripe-ncc-measurement-data-retention-principles/>, there's a statement that: "For the RIPEstat use-case, we make the data available in a variety of ways which takes up about 800 TB of storage space." This reads to me as if there's a lot of (potentially unnecessary?) data duplication. I think proposal 2 therefore sounds sensible - I would imagine that it's possible to reconstruct some of or all of the formats served, so for older data would producing some of these on-the-fly/converting formats be feasible? Is there a way to get a breakdown of what data forms you're using are most storage-intensive, or which parts of services like RIPEstat are using the most storage? I'm imagining that there probably aren't that many use-cases where getting instant access to historic data is needed, so making accessing older data slower/tiered (and hence cheaper) doesn't seem like a problem, but I'm looking at it very much from a research perspective so I could be way off the mark on that. Kind regards, Josh
participants (8)
-
Alessandro Improta
-
Ben Cartwright-Cox
-
Joshua Levett
-
Malte Tashiro
-
Paul de Weerd
-
Randy Bush
-
Robert Kisteleki
-
Stephen Strowes