Removing personal data from bulk output from the RIPE Database
Dear colleagues, We are proposing an improvement to the algorithm used for removing personal data from the bulk provisioning of RIPE Database data. Below you will find an explanation of the current process, why we’re proposing to change it and how the new process would work. ------------------------ Current process ------------------------ At the moment this data is published in three forms: • Complete dump, accessible from: <ftp://ftp.ripe.net/ripe/dbase/ripe.db.gz> • Split dump of each object type, accessible from: <ftp://ftp.ripe.net/ripe/dbase/split/> • A live feed using NRTM protocol, described at: < https://www.ripe.net/data-tools/db/nrtm-mirroring> These data dumps and streams are mainly used for keeping an up-to-date local copy of the RIPE Database or for research and analytical purposes. To prevent the bulk publishing of personal data, for example names, email addresses and phone numbers, we remove all personal data from these datasets. We call this process "dummification of personal data". At the moment it consists of replacing all PERSON and ROLE objects with a single dummy placeholder object as well as changing all references to the personal data objects from all other objects to refer to this dummy placeholder object. ------------------------------------------------- Rationale for Improving the Process -------------------------------------------------- We’ve received feedback from different users and researchers that we are overdoing the dummification. For example, one can obtain all references to personal objects without hitting any personal object result limits by querying the live RIPE Database with proper flags (like -r). This makes the "dummification" of these references in the data dumps meaningless. For more information, see: <http://www.ripe.net/data-tools/support/documentation/aup> ----------------------------------------------------- Proposal for New Dummification Algorithm ----------------------------------------------------- In order to improve the usability of the data dumps and streams, we are proposing to change the "dummification" algorithm to keep the actual personal objects and all references to them and only obfuscate the fields with personal data (for example real names, phone numbers and addresses). The new algorithm will also try to preserve data that is useful for researchers, while not revealing any data that might expose the identity of the date subject. For example, we are proposing to keep the first half of phone number digits or to keep the domain part of email addresses. The proposed changes should not affect any End User scripts currently running because the data will still be in the valid RIPE RPSL format. We propose to run the two datasets in parallel for some time to ensure a smooth transition. Further details and examples of how the dummified objects will change are detailed in a RIPE Labs article at: <https://labs.ripe.net/Members/kranjbar/proposed-improvements-to-dummification-of-personal-data-in-the-ripe-database> Please let us know if you agree with the proposed improvements. Regards, Denis Walker Business Analyst RIPE NCC Database Group
Hi, On Wed, May 08, 2013 at 12:06:15PM +0200, Denis Walker wrote:
----------------------------------------------------- Proposal for New Dummification Algorithm ----------------------------------------------------- In order to improve the usability of the data dumps and streams, we are proposing to change the "dummification" algorithm to keep the actual personal objects and all references to them and only obfuscate the fields with personal data (for example real names, phone numbers and addresses).
Sounds good to me. +1 Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (89) 32356-444 USt-IdNr.: DE813185279
Denis,
We?ve received feedback from different users and researchers that we are overdoing the dummification. For example, one can obtain all references to personal objects without hitting any personal object result limits by querying the live RIPE Database with proper flags (like -r). This makes the "dummification" of these references in the data dumps meaningless.
I do not buy this argument. We know that certain access restrictions can be circumvented eventually by renting the ultimate botnet and do a mass harvest. That doesn't render restrictions useless. One could argue that if certain access controls were implemented to achieve a certain goal and other methods open a path around these controls, those other methods (the -r flag in this case) ought to be reviewed instead.
In order to improve the usability of the data dumps and streams, we are proposing to change the "dummification" algorithm to keep the actual personal objects and all references to them and only obfuscate the fields with personal data (for example real names, phone numbers and addresses). The new algorithm will also try to preserve data that is useful for researchers, while not revealing any data that might expose the identity of the date subject. For example, we are proposing to keep the first half of phone number digits or to keep the domain part of email addresses.
I am missing a list of data protection goals that were desired to be met by the original implementation and a serious assessment why they would still be met by the proposed changed method. I doubt that obfuscating the local part of an email address is an adequate measure of anonymization or pseudonymization. Similar concerns hold for phone numbers. On a meta level mangled data is a threat to real data more than replaced data is. FWIW, i don't see the special case for 'abuse-mailbox'. With optimizing the 'dummification algorithm' around fuzzy criteria it occurs to me we're putting the cart before the horse. -Peter
Dear Peter Thank you for your comments. First of all let us make it clear that the RIPE NCC is only "proposing" a change based on many requests we have received, specially from CERTs, to streamline the dummification method. If the community wants to apply more strict restrictions we will implement whatever the end result is of the discussion. For historical background, the RIPE Database is an open public registry. ALL the information in it is available to anyone. For privacy reasons we have always applied limits on the amount of personal data that can be queried in a given time period. If you use the '-rB' query flags you avoid any personal data being returned. But ALL other data in the RIPE Database can be returned without any restrictions or limits. (Except for password hashes.) The RIPE Data Protection Task Force looked at the ways of accessing bulk data from the RIPE Database. It was decided to apply dummification rules to the bulk data access. But a lot of the information that was dumified was always, and still is, available by normal queries to the RIPE Database. So it is not a case of the normal queries circumventing the dummification rules. It is that information was dummified that was still available by normal queries. So some of the dummification rules that were applied were meaningless. Access to that bulk data was never prevented. What the RIPE NCC is suggesting now is to only obfuscate that part of the data that actually will identify an individual in the bulk data download. It is true that partial obfuscation of data can be in-effective if some one puts in the effort to correlate all the fields (for example infer the local part of email address from real name). But keep in mind that this data is already available publicly and anyone can obtain the whole dataset with far far less effort. The RIPE Data Protection Task Force discussed many data protection issues and goals. The outcome was to apply a set of strict rules on bulk data access that were considered 'safe'. Over the years since then the RIPE NCC has received many comments about this being an 'overkill'. The current database architecture will support heavy 'legal' querying from single IP addresses. This renders some of the original goals not practically achievable. The ripe563 policy, Abuse Contact Management in the RIPE Database, states "The “abuse-mailbox:” attribute must be available in an unrestricted way via whois, APIs and future techniques." As we said at the beginning, the RIPE NCC is completely neutral on this issue. We will implement whatever the community wants. Regards Denis Walker Business Analyst RIPE NCC Database Group On 08/05/2013 14:51, Peter Koch wrote:
Denis,
We?ve received feedback from different users and researchers that we are overdoing the dummification. For example, one can obtain all references to personal objects without hitting any personal object result limits by querying the live RIPE Database with proper flags (like -r). This makes the "dummification" of these references in the data dumps meaningless.
I do not buy this argument. We know that certain access restrictions can be circumvented eventually by renting the ultimate botnet and do a mass harvest. That doesn't render restrictions useless. One could argue that if certain access controls were implemented to achieve a certain goal and other methods open a path around these controls, those other methods (the -r flag in this case) ought to be reviewed instead.
In order to improve the usability of the data dumps and streams, we are proposing to change the "dummification" algorithm to keep the actual personal objects and all references to them and only obfuscate the fields with personal data (for example real names, phone numbers and addresses). The new algorithm will also try to preserve data that is useful for researchers, while not revealing any data that might expose the identity of the date subject. For example, we are proposing to keep the first half of phone number digits or to keep the domain part of email addresses.
I am missing a list of data protection goals that were desired to be met by the original implementation and a serious assessment why they would still be met by the proposed changed method. I doubt that obfuscating the local part of an email address is an adequate measure of anonymization or pseudonymization. Similar concerns hold for phone numbers. On a meta level mangled data is a threat to real data more than replaced data is. FWIW, i don't see the special case for 'abuse-mailbox'.
With optimizing the 'dummification algorithm' around fuzzy criteria it occurs to me we're putting the cart before the horse.
-Peter
Denis, On Wednesday, 2013-05-08 12:06:15 +0200, Denis Walker <denis@ripe.net> wrote:
Please let us know if you agree with the proposed improvements.
In general I think it's fine, although I tend to think we should NOT dummificate the ORGANISATION objects. These are not intended for persons, and should not have the same concerns. Yes, people may have personal data there, but they can also put their mother's maiden name and their date of birth as a comment into an AS number registration. :) Cheers, -- Shane
participants (4)
-
Denis Walker
-
Gert Doering
-
Peter Koch
-
Shane Kerr