[db-wg] Removing personal data from bulk output from the RIPE Database

8 May 2013

      Dear colleagues,

We are proposing an improvement to the algorithm used for removing
personal data from the bulk provisioning of RIPE Database data. Below
you will find an explanation of the current process, why we’re proposing
to change it and how the new process would work.

------------------------
Current process
------------------------
At the moment this data is published in three forms:

• Complete dump, accessible from:
<ftp://ftp.ripe.net/ripe/dbase/ripe.db.gz>
• Split dump of each object type, accessible from:
<ftp://ftp.ripe.net/ripe/dbase/split/>
• A live feed using NRTM protocol, described at:
< https://www.ripe.net/data-tools/db/nrtm-mirroring>

These data dumps and streams are mainly used for keeping an up-to-date
local copy of the RIPE Database or for research and analytical purposes.
To prevent the bulk publishing of personal data, for example names,
email addresses and phone numbers, we remove all personal data from
these datasets. We call this process "dummification of personal data".
At the moment it consists of replacing all PERSON and ROLE objects with
a single dummy placeholder object as well as changing all references to
the personal data objects from all other objects to refer to this dummy
placeholder object.

-------------------------------------------------
Rationale for Improving the Process
--------------------------------------------------
We’ve received feedback from different users and researchers that we are
overdoing the dummification. For example, one can obtain all references
to personal objects without hitting any personal object result limits by
querying the live RIPE Database with proper flags (like -r). This makes
the "dummification" of these references in the data dumps meaningless.

For more information, see:
<http://www.ripe.net/data-tools/support/documentation/aup>

-----------------------------------------------------
Proposal for New Dummification Algorithm
-----------------------------------------------------
In order to improve the usability of the data dumps and streams, we are
proposing to change the "dummification" algorithm to keep the actual
personal objects and all references to them and only obfuscate the
fields with personal data (for example real names, phone numbers and
addresses). The new algorithm will also try to preserve data that is
useful for researchers, while not revealing any data that might expose
the identity of the date subject. For example, we are proposing to keep
the first half of phone number digits or to keep the domain part of
email addresses.

The proposed changes should not affect any End User scripts currently
running because the data will still be in the valid RIPE RPSL format. We
propose to run the two datasets in parallel for some time to ensure a
smooth transition.

Further details and examples of how the dummified objects will change
are detailed in a RIPE Labs article at:

<https://labs.ripe.net/Members/kranjbar/proposed-improvements-to-dummification-of-personal-data-in-the-ripe-database>

Please let us know if you agree with the proposed improvements.

Regards,
Denis Walker
Business Analyst
RIPE NCC Database Group