Re: [mat-wg] RIPE NCC measurement data retention

21 Dec 2023

      Thank you Ben, Randy and Joshua for your feedback.

The 50 TB (and growing) of storage space needed to hold all of the
(compressed) RIS dump and update files for the entire history of the
project are (currently) not a big concern.  As Ben points out, 50 TB
can relatively easily be held in a single machine. These days, three
20 TB disks are sufficient.  Obviously, reality is a bit more complex,
with redundancy and availability added into the mix. However since
this is less problematic, we're not planning any changes in this area
at present.

Obviously, the 800 TB that was mentioned is a different matter. We
store RIS data in a variety of ways for fast access by the RIPEstat
front-end servers.  We use Apache HBase for this, which uses Apache
HDFS as a storage backend, giving us redundant storage.  This
redundancy comes at a price - by default, HDFS stores its data in
triplicate, so the 800 TB of storage used contains just over 250 TB of
actual data. Higher replication is possible but will cost even more,
and lower is strongly discouraged. Then, for the various widgets /
infocards / data endpoints in RIPEstat, the data is transformed and
stored in different HBase tables.  This does, unfortunately, mean more
duplication of data because data is indexed by different aspects in
different tables for the specific access pattern.

Now these various ways of storing the same data were not a big problem
in the past. However, with the growth of RIS and the Internet, the
volume of incoming data over time has steadily grown too.  Now it has
come to a point where we need to start thinking of different ways to
make this data available to the community.  This is where Robert's
RIPE Labs post and his presentation at the meeting in Rome come in.
We want to review how we make this data available to you as end users
(be it as researchers, as operators or in whatever form that applies)
so that we can remain cost effective while giving you the most useful
data in a fast and easy-to-access way.

So in summary, we're looking at doing exactly what Ben suggests: keep
offering the historic dataset as we currently do through the dump and
update files, but reviewing how we can reduce cost of the other ways
in which we store this data without losing value for our end users.

Paul de Weerd
RIPE NCC