Dear colleagues, We are proposing an improvement to the algorithm used for removing personal data from the bulk provisioning of RIPE Database data. Below you will find an explanation of the current process, why we’re proposing to change it and how the new process would work. ------------------------ Current process ------------------------ At the moment this data is published in three forms: • Complete dump, accessible from: <ftp://ftp.ripe.net/ripe/dbase/ripe.db.gz> • Split dump of each object type, accessible from: <ftp://ftp.ripe.net/ripe/dbase/split/> • A live feed using NRTM protocol, described at: < https://www.ripe.net/data-tools/db/nrtm-mirroring> These data dumps and streams are mainly used for keeping an up-to-date local copy of the RIPE Database or for research and analytical purposes. To prevent the bulk publishing of personal data, for example names, email addresses and phone numbers, we remove all personal data from these datasets. We call this process "dummification of personal data". At the moment it consists of replacing all PERSON and ROLE objects with a single dummy placeholder object as well as changing all references to the personal data objects from all other objects to refer to this dummy placeholder object. ------------------------------------------------- Rationale for Improving the Process -------------------------------------------------- We’ve received feedback from different users and researchers that we are overdoing the dummification. For example, one can obtain all references to personal objects without hitting any personal object result limits by querying the live RIPE Database with proper flags (like -r). This makes the "dummification" of these references in the data dumps meaningless. For more information, see: <http://www.ripe.net/data-tools/support/documentation/aup> ----------------------------------------------------- Proposal for New Dummification Algorithm ----------------------------------------------------- In order to improve the usability of the data dumps and streams, we are proposing to change the "dummification" algorithm to keep the actual personal objects and all references to them and only obfuscate the fields with personal data (for example real names, phone numbers and addresses). The new algorithm will also try to preserve data that is useful for researchers, while not revealing any data that might expose the identity of the date subject. For example, we are proposing to keep the first half of phone number digits or to keep the domain part of email addresses. The proposed changes should not affect any End User scripts currently running because the data will still be in the valid RIPE RPSL format. We propose to run the two datasets in parallel for some time to ensure a smooth transition. Further details and examples of how the dummified objects will change are detailed in a RIPE Labs article at: <https://labs.ripe.net/Members/kranjbar/proposed-improvements-to-dummification-of-personal-data-in-the-ripe-database> Please let us know if you agree with the proposed improvements. Regards, Denis Walker Business Analyst RIPE NCC Database Group