Dear Peter Thank you for your comments. First of all let us make it clear that the RIPE NCC is only "proposing" a change based on many requests we have received, specially from CERTs, to streamline the dummification method. If the community wants to apply more strict restrictions we will implement whatever the end result is of the discussion. For historical background, the RIPE Database is an open public registry. ALL the information in it is available to anyone. For privacy reasons we have always applied limits on the amount of personal data that can be queried in a given time period. If you use the '-rB' query flags you avoid any personal data being returned. But ALL other data in the RIPE Database can be returned without any restrictions or limits. (Except for password hashes.) The RIPE Data Protection Task Force looked at the ways of accessing bulk data from the RIPE Database. It was decided to apply dummification rules to the bulk data access. But a lot of the information that was dumified was always, and still is, available by normal queries to the RIPE Database. So it is not a case of the normal queries circumventing the dummification rules. It is that information was dummified that was still available by normal queries. So some of the dummification rules that were applied were meaningless. Access to that bulk data was never prevented. What the RIPE NCC is suggesting now is to only obfuscate that part of the data that actually will identify an individual in the bulk data download. It is true that partial obfuscation of data can be in-effective if some one puts in the effort to correlate all the fields (for example infer the local part of email address from real name). But keep in mind that this data is already available publicly and anyone can obtain the whole dataset with far far less effort. The RIPE Data Protection Task Force discussed many data protection issues and goals. The outcome was to apply a set of strict rules on bulk data access that were considered 'safe'. Over the years since then the RIPE NCC has received many comments about this being an 'overkill'. The current database architecture will support heavy 'legal' querying from single IP addresses. This renders some of the original goals not practically achievable. The ripe563 policy, Abuse Contact Management in the RIPE Database, states "The “abuse-mailbox:” attribute must be available in an unrestricted way via whois, APIs and future techniques." As we said at the beginning, the RIPE NCC is completely neutral on this issue. We will implement whatever the community wants. Regards Denis Walker Business Analyst RIPE NCC Database Group On 08/05/2013 14:51, Peter Koch wrote:
Denis,
We?ve received feedback from different users and researchers that we are overdoing the dummification. For example, one can obtain all references to personal objects without hitting any personal object result limits by querying the live RIPE Database with proper flags (like -r). This makes the "dummification" of these references in the data dumps meaningless.
I do not buy this argument. We know that certain access restrictions can be circumvented eventually by renting the ultimate botnet and do a mass harvest. That doesn't render restrictions useless. One could argue that if certain access controls were implemented to achieve a certain goal and other methods open a path around these controls, those other methods (the -r flag in this case) ought to be reviewed instead.
In order to improve the usability of the data dumps and streams, we are proposing to change the "dummification" algorithm to keep the actual personal objects and all references to them and only obfuscate the fields with personal data (for example real names, phone numbers and addresses). The new algorithm will also try to preserve data that is useful for researchers, while not revealing any data that might expose the identity of the date subject. For example, we are proposing to keep the first half of phone number digits or to keep the domain part of email addresses.
I am missing a list of data protection goals that were desired to be met by the original implementation and a serious assessment why they would still be met by the proposed changed method. I doubt that obfuscating the local part of an email address is an adequate measure of anonymization or pseudonymization. Similar concerns hold for phone numbers. On a meta level mangled data is a threat to real data more than replaced data is. FWIW, i don't see the special case for 'abuse-mailbox'.
With optimizing the 'dummification algorithm' around fuzzy criteria it occurs to me we're putting the cart before the horse.
-Peter