Proposal to allow UTF8

newer
Re: [db-wg] Proposal to allow UTF8

Piotr Strzyzewski

17 Apr 2015 17 Apr '15

12:18 p.m.

Dear DB-WG Members Proposal: I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys. Description: RIPE NCC service region covers Europe, the Middle East and parts of Central Asia. Moreover we have users from outside of this region. This means that WHOIS DB stores data for people and organizations from number of different countries using number of different alphabets. At this moment, all data in the RIPE WHOIS DB have to be stored using 7-bit plain US ASCII character set. [As a side note: It is technically possible to store some UTF8 content in some attributes, but the answer to whois query (both terminal and web based) returns "?" character in this case.] Lack of the full support for national character sets leads to some problems which includes, but is not limited to: 1. Mistakes in person/organization names due to national->english and english->national (based mostly on guess) conversion. 2. Mistakes in person/organization address due to national->english and english->national (based mostly on guess) conversion. 3. Conflict of converted words with other correct words (most visible in latin-based character sets). 4. Possible offensive word formation due to national->english conversion of names and/or addresses of person/organization. [As a side note to points no 1-3: This could lead to some problems when LEA tries to find out precisely who should be contacted in case of abuse.] On the other side, community members needs to know who is responsible for certain resource without the necessity of understanding all the others character sets. Moreover, some objects are filled with data that has to be provided in ASCII character set due to business rules (like ORGANISATION object details for LIRs). RIPE NCC has a policy to insist on latin based names for organisation objects that it verifies (allocated, and sponsored end-user space). Taking this into accout I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys. Some possible issues to be addressed: 1. When this proposal will be supported by the DB-WG, then it has to be discussed at least with AA-WG and AP-WG. 2. UTF8 may cause problems for client code. Comment: The proper implementation plan and announcements schedule should be prepared. 3. UTF8 may result in contact addresses and names that are not readable by a large part of the community. Comment: Primary keys (mostly names) still have to be in ASCII character set. Moreover, LIRs data are also in ASCII character set due to business rules. 4. At this moment there are no major technical issues blocking UTF8 support in the RIPE DB back-end. However thorough checks have to be done. Looking for your comments. Piotr -- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl

Show replies by date

poty＠iiat.ru

17 Apr 17 Apr

1:11 p.m.

I'm completely agree with Piotr that the lack of UTF8 support brings some confusion in several cases. Several times it took months of changing in address for objects (as an example) to be able to receive an invoice and corresponding documents. The problem also arises in geo-location services and in many other areas. So I support the incentive. Regards, Vladislav Potapov IIAT, Ltd. ru.iiat -----Original Message----- From: db-wg [mailto:db-wg-bounces@ripe.net] On Behalf Of Piotr Strzyzewski Sent: Friday, April 17, 2015 1:18 PM To: db-wg@ripe.net Subject: [db-wg] Proposal to allow UTF8 Dear DB-WG Members Proposal: I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys. Description: RIPE NCC service region covers Europe, the Middle East and parts of Central Asia. Moreover we have users from outside of this region. This means that WHOIS DB stores data for people and organizations from number of different countries using number of different alphabets. At this moment, all data in the RIPE WHOIS DB have to be stored using 7-bit plain US ASCII character set. [As a side note: It is technically possible to store some UTF8 content in some attributes, but the answer to whois query (both terminal and web based) returns "?" character in this case.] Lack of the full support for national character sets leads to some problems which includes, but is not limited to: 1. Mistakes in person/organization names due to national->english and english->national (based mostly on guess) conversion. 2. Mistakes in person/organization address due to national->english and english->national (based mostly on guess) conversion. 3. Conflict of converted words with other correct words (most visible in latin-based character sets). 4. Possible offensive word formation due to national->english conversion of names and/or addresses of person/organization. [As a side note to points no 1-3: This could lead to some problems when LEA tries to find out precisely who should be contacted in case of abuse.] On the other side, community members needs to know who is responsible for certain resource without the necessity of understanding all the others character sets. Moreover, some objects are filled with data that has to be provided in ASCII character set due to business rules (like ORGANISATION object details for LIRs). RIPE NCC has a policy to insist on latin based names for organisation objects that it verifies (allocated, and sponsored end-user space). Taking this into accout I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys. Some possible issues to be addressed: 1. When this proposal will be supported by the DB-WG, then it has to be discussed at least with AA-WG and AP-WG. 2. UTF8 may cause problems for client code. Comment: The proper implementation plan and announcements schedule should be prepared. 3. UTF8 may result in contact addresses and names that are not readable by a large part of the community. Comment: Primary keys (mostly names) still have to be in ASCII character set. Moreover, LIRs data are also in ASCII character set due to business rules. 4. At this moment there are no major technical issues blocking UTF8 support in the RIPE DB back-end. However thorough checks have to be done. Looking for your comments. Piotr -- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl

Shane Kerr

2:46 p.m.

Piotr, On Fri, 17 Apr 2015 12:18:04 +0200 Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...

Proposal:

I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys.

I think it makes sense. As someone who only knows languages with Latin characters, it does make me nervous to do a lookup and get back something like: person: Микки Маус address: Волшебное Королевство phone-no: +1 234 567 8901 # curse useless mandatory fields! nic-hdl: THE-MOUSE-RIPE mnt-by: MNT-GLOBAL-COPYRIGHT-POLICE source: RIPE Because what does it even mean if you don't understand Cyrillic? (You mention this specific issue in your proposal.) But really, it's better to have correct information than to have something forced into a specific format with an imperfect approximation. We have Google Translate to try to guess what stuff means. :) Personally I think it could be used for primary keys too, but that can always be added later. Cheers, -- Shane

Job Snijders

2:54 p.m.

Hi all, On Fri, Apr 17, 2015 at 12:46:25PM +0000, Shane Kerr wrote:

...

On Fri, 17 Apr 2015 12:18:04 +0200 Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...
Proposal: I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys.

I think it makes sense. <snip>

Personally I think it could be used for primary keys too, but that can always be added later.

Allowing UTF8 in datafields helps with the issues that Piotr mentioned, there is a big benefit for all if you can accurately input your personal name or company address. On the other hand lots of primary keys are used by computer programs that don't care about such data. They just need to expand AS-Королевство into a set of prefixes. wait, maybe it is easier to just leave that as AS-KINGDOM. :-) I think it is best to leave primary keys untouched as they are today. The proposal as set forth by Pitor will benefit humans, and it is my assessment that it does not harm existing computer programs. Kind regards, Job

Jaap Akkerhuis

3:06 p.m.

Job Snijders writes:

...

<SNIP>

On the other hand lots of primary keys are used by computer programs that don't care about such data. They just need to expand AS-Королевство into a set of prefixes. wait, maybe it is easier to just leave that as AS-KINGDOM. :-)

I think it is best to leave primary keys untouched as they are today.

Strings comparison in UNICODE is problematic...

...

The proposal as set forth by Pitor will benefit humans, and it is my assessment that it does not harm existing computer programs.

Yup. jaap

João Damas

4:38 p.m.

...

On 17 Apr 2015, at 14:54, Job Snijders <job@ntt.net> wrote:

Hi all,

On Fri, Apr 17, 2015 at 12:46:25PM +0000, Shane Kerr wrote:

...
On Fri, 17 Apr 2015 12:18:04 +0200 Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...
Proposal: I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys.

I think it makes sense. <snip>

Personally I think it could be used for primary keys too, but that can always be added later.

Allowing UTF8 in datafields helps with the issues that Piotr mentioned, there is a big benefit for all if you can accurately input your personal name or company address.

Fully agree.

...

On the other hand lots of primary keys are used by computer programs that don't care about such data. They just need to expand AS-Королевство into a set of prefixes. wait, maybe it is easier to just leave that as AS-KINGDOM. :-)

I think it is best to leave primary keys untouched as they are today.

Also agreed, not breaking installed base is important. Hopefully whatever comes after whois/rpsl will do this right from the beginning…(*)

...

The proposal as set forth by Pitor will benefit humans, and it is my assessment that it does not harm existing computer programs.

yep! Joao (*) for some definition of right, of which there might be many.

Tim Bruijnzeels

23 Apr 23 Apr

3:27 p.m.

Hi WG,

...

On 17 Apr 2015, at 12:18, Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

Dear DB-WG Members

Proposal:

I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys.

Description:

RIPE NCC service region covers Europe, the Middle East and parts of Central Asia. Moreover we have users from outside of this region. This means that WHOIS DB stores data for people and organizations from number of different countries using number of different alphabets.

At this moment, all data in the RIPE WHOIS DB have to be stored using 7-bit plain US ASCII character set.

[As a side note: It is technically possible to store some UTF8 content in some attributes, but the answer to whois query (both terminal and web based) returns "?" character in this case.]

Technically it's Latin-1 at the moment, so it's a bit more than US ASCII, but does not include e.g. cyrillic.

...

Lack of the full support for national character sets leads to some problems which includes, but is not limited to:

1. Mistakes in person/organization names due to national->english and english->national (based mostly on guess) conversion. 2. Mistakes in person/organization address due to national->english and english->national (based mostly on guess) conversion. 3. Conflict of converted words with other correct words (most visible in latin-based character sets). 4. Possible offensive word formation due to national->english conversion of names and/or addresses of person/organization.

[As a side note to points no 1-3: This could lead to some problems when LEA tries to find out precisely who should be contacted in case of abuse.]

On the other side, community members needs to know who is responsible for certain resource without the necessity of understanding all the others character sets. Moreover, some objects are filled with data that has to be provided in ASCII character set due to business rules (like ORGANISATION object details for LIRs). RIPE NCC has a policy to insist on latin based names for organisation objects that it verifies (allocated, and sponsored end-user space).

Taking this into accout I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys.

Some possible issues to be addressed:

1. When this proposal will be supported by the DB-WG, then it has to be discussed at least with AA-WG and AP-WG.

...

Lack of the full support for national character sets leads to some problems which includes, but is not limited to:

1. Mistakes in person/organization names due to national->english and english->national (based mostly on guess) conversion. 2. Mistakes in person/organization address due to national->english and english->national (based mostly on guess) conversion. 3. Conflict of converted words with other correct words (most visible in latin-based character sets). 4. Possible offensive word formation due to national->english conversion of names and/or addresses of person/organization.

[As a side note to points no 1-3: This could lead to some problems when LEA tries to find out precisely who should be contacted in case of abuse.]

On the other side, community members needs to know who is responsible for certain resource without the necessity of understanding all the others character sets. Moreover, some objects are filled with data that has to be provided in ASCII character set due to business rules (like ORGANISATION object details for LIRs). RIPE NCC has a policy to insist on latin based names for organisation objects that it verifies (allocated, and sponsored end-user space).

Taking this into accout I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys.

Some possible issues to be addressed:

1. When this proposal will be supported by the DB-WG, then it has to be discussed at least with AA-WG and AP-WG.

Yes, also from RIPE NCC's perspective it's important that this is done before any final technical decision is made here.

...

2. UTF8 may cause problems for client code.

Comment: The proper implementation plan and announcements schedule should be prepared.

Ack. We will probably get a number of support requests because of this, but provided that we have a clear mandate from the community (including AA-WG and AP-WG), we can address this with an implementation plan, announcements and RC.

...

3. UTF8 may result in contact addresses and names that are not readable by a large part of the community.

This is why the AA-WG and AP-WG should be in the loop. The main concern is what users expect from the content of the registry with regards to abuse or contact related information for resources. Arguments can be made for both cases (accurate representation of names vs a representation that is more easily recognisable by most users). As far as we are concerned we are impartial in this, but need a clear community mandate on the way forward.

...

Comment: Primary keys (mostly names) still have to be in ASCII character set. Moreover, LIRs data are also in ASCII character set due to business rules.

At least with regards to resources allocated or assigned through the RIPE NCC (possibly through a sponsoring LIR) our current interpretation of related policies is that latin-1 is expected. I.e. people can deal with some variations on basic ascii, but nothing too exotic. Another concern is that it is not feasible for us to accurately verify names in all possible non-latin character sets, and of course it's very important to our function as a registry that this is done properly. That said we can also apply this to the specific information that is verified by the RIPE NCC, such as names and addresses for organisations, even if the database technically allows UTF-8 in other places: e.g. person names or names for organisations not associated with resources that the RIPE NCC allocated or assigned.

...

4. At this moment there are no major technical issues blocking UTF8 support in the RIPE DB back-end. However thorough checks have to be done.

Agreed. There is no technical showstopper from our perspective. Kind regards, Tim Bruijnzeels Assistant Manager Software Engineering RIPE NCC

...

Looking for your comments.

Piotr

-- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl

4028

Age (days ago)

4034

Last active (days ago)

List overview

Download

6 comments

7 participants

participants (7)

Jaap Akkerhuis
Job Snijders
João Damas
Piotr Strzyzewski
poty＠iiat.ru
Shane Kerr
Tim Bruijnzeels