Re: [db-wg] Removing personal data from bulk output from the RIPE Database

8 May 2013

      Dear Peter

Thank you for your comments. First of all let us make it clear that the 
RIPE NCC is only "proposing" a change based on many requests we have 
received, specially from CERTs, to streamline the dummification method. 
If the community wants to apply more strict restrictions we will 
implement whatever the end result is of the discussion.

For historical background, the RIPE Database is an open public registry. 
ALL the information in it is available to anyone. For privacy reasons we 
have always applied limits on the amount of personal data that can be 
queried in a given time period. If you use the '-rB' query flags you 
avoid any personal data being returned. But ALL other data in the RIPE 
Database can be returned without any restrictions or limits. (Except for 
password hashes.)

The RIPE Data Protection Task Force looked at the ways of accessing bulk 
data from the RIPE Database. It was decided to apply dummification rules 
to the bulk data access. But a lot of the information that was dumified 
was always, and still is, available by normal queries to the RIPE 
Database. So it is not a case of the normal queries circumventing the 
dummification rules. It is that information was dummified that was still 
available by normal queries. So some of the dummification rules that 
were applied were meaningless. Access to that bulk data was never prevented.

What the RIPE NCC is suggesting now is to only obfuscate that part of 
the data that actually will identify an individual in the bulk data 
download. It is true that partial obfuscation of data can be 
in-effective if some one puts in the effort to correlate all the fields 
(for example infer the local part of email address from real name). But 
keep in mind that this data is already available publicly and anyone can 
obtain the whole dataset with far far less effort.

The RIPE Data Protection Task Force discussed many data protection 
issues and goals. The outcome was to apply a set of strict rules on bulk 
data access that were considered 'safe'. Over the years since then the 
RIPE NCC has received many comments about this being an 'overkill'. The 
current database architecture will support heavy 'legal' querying from 
single IP addresses. This renders some of the original goals not 
practically achievable.

The ripe563 policy, Abuse Contact Management in the RIPE Database, 
states "The “abuse-mailbox:” attribute must be available in an 
unrestricted way via whois, APIs and future techniques."

As we said at the beginning, the RIPE NCC is completely neutral on this 
issue. We will implement whatever the community wants.

Regards
Denis Walker
Business Analyst
RIPE NCC Database Group

On 08/05/2013 14:51, Peter Koch wrote:
...
Denis,
...
We?ve received feedback from different users and researchers that we are
overdoing the dummification. For example, one can obtain all references
to personal objects without hitting any personal object result limits by
querying the live RIPE Database with proper flags (like -r). This makes
the "dummification" of these references in the data dumps meaningless.
I do not buy this argument.  We know that certain access restrictions
can be circumvented eventually by renting the ultimate botnet and do a mass
harvest.  That doesn't render restrictions useless.
One could argue that if certain access controls were implemented to achieve
a certain goal and other methods open a path around these controls, those other
methods (the -r flag in this case) ought to be reviewed instead.
...
In order to improve the usability of the data dumps and streams, we are
proposing to change the "dummification" algorithm to keep the actual
personal objects and all references to them and only obfuscate the
fields with personal data (for example real names, phone numbers and
addresses). The new algorithm will also try to preserve data that is
useful for researchers, while not revealing any data that might expose
the identity of the date subject. For example, we are proposing to keep
the first half of phone number digits or to keep the domain part of
email addresses.
I am missing a list of data protection goals that were desired to be met
by the original implementation and a serious assessment why they would
still be met by the proposed changed method.  I doubt that obfuscating
the local part of an email address is an adequate measure of anonymization
or pseudonymization.  Similar concerns hold for phone numbers.
On a meta level mangled data is a threat to real data more than
replaced data is. FWIW, i don't see the special case for 'abuse-mailbox'.
With optimizing the 'dummification algorithm' around fuzzy criteria
it occurs to me we're putting the cart before the horse.
-Peter