Proposal to allow non-ASCII characters in "org-name:", "person:" and "role:" attributes
Dear colleagues, Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including: * These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret However there are some drawbacks to forcing names to only use a subset of ASCII characters: * Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways. Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes. Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version. Please let me know your feedback. Regards Ed Shryane RIPE NCC --- Whois attribute verbose description (copied from the help text). org-name -------- Specifies the name of the organisation that this organisation object represents in the RIPE Database. This is an ASCII-only text attribute. The restriction is because this attribute is a look-up key and the whois protocol does not allow specifying character sets in queries. The user can put the name of the organisation in non-ASCII character sets in the "descr:" attribute if required. A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word. person ------ Specifies the full name of an administrative, technical or zone contact person for other objects in the database. It should contain 2 to 10 words. A word is made up of ASCII alphanumeric characters and additionally: .`'_- The first word should begin with a letter. At least one other word should also begin with a letter. Max 64 characters can be used in each word. role ---- Specifies the full name of a role entity, e.g. RIPE DBM. A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
Dear Edward, On Fri, Nov 24, 2023 at 10:03:15AM +0100, Edward Shryane via db-wg wrote:
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
The above two points are key in making the RIPE database useful and accessible to everyone, I too would love to see those points addressed.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
Wouldn't it be an opportune time to support UTF-8 instead of LATIN-1? As I understand it, through the use of UTF-8 more languages could be supported. UTF-8 seems to be the preferred character encoding in any new IETF work (for good reason). Have the effects of LATIN-1 on downstream applications such as NRTM v3 and NRTM v4 been considered? You indicate that LATIN-1 already is supported in the RIPE database, so I imagine you and the team already deliberated on the pro's and con's of UTF-8 vs LATIN-1; and as such concluded with this particular recommendation. I just wanted to make sure to raise these questions. :-) Some interesting reading material on UTF-8 https://utf8everywhere.org/ Kind regards, Job
Job Snijders via db-wg wrote on 24/11/2023 09:21:
Wouldn't it be an opportune time to support UTF-8 instead of LATIN-1?
punycode?? Nick (Sorry, couldn't resist.)
Hi Job,
On 24 Nov 2023, at 10:21, Job Snijders <job@fastly.com> wrote:
Dear Edward,
On Fri, Nov 24, 2023 at 10:03:15AM +0100, Edward Shryane via db-wg wrote:
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
The above two points are key in making the RIPE database useful and accessible to everyone, I too would love to see those points addressed.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
Wouldn't it be an opportune time to support UTF-8 instead of LATIN-1? As I understand it, through the use of UTF-8 more languages could be supported. UTF-8 seems to be the preferred character encoding in any new IETF work (for good reason).
I wrote an impact analysis on UTF-8 in the RIPE database last year: https://labs.ripe.net/author/ed_shryane/impact-analysis-for-utf-8-in-the-rip... We already support UTF-8 in the Whois REST API and on the website, but convert to/from latin-1 in the database. Switching to UTF-8 in the database is not technically difficult, but we need functional requirements from the community on where to allow UTF-8 characters. This proposal is only to support more Latin-1 characters to be supported in names, while preserving backwards compatibility for querying (by also doing normalisation to ASCII).
Have the effects of LATIN-1 on downstream applications such as NRTM v3 and NRTM v4 been considered?
Allowing Latin-1 in these name attributes *does* impact NRTMv3 and NRTMv4 (as they will no longer be ASCII only), but these characters are already allowed elsewhere in RPSL (e.g. the workaround of putting the correct name in the "descr:" attribute). Also the object primary key will remain ASCII.
You indicate that LATIN-1 already is supported in the RIPE database, so I imagine you and the team already deliberated on the pro's and con's of UTF-8 vs LATIN-1; and as such concluded with this particular recommendation. I just wanted to make sure to raise these questions. :-)
We can switch to UTF-8, this proposal allows more characters in those attributes without needing to change the database character set.
Some interesting reading material on UTF-8 https://utf8everywhere.org/
Kind regards,
Job
Regards Ed Shryane RIPE NCC
On 2023 Nov 24 (Fri) at 10:42:11 +0100 (+0100), Edward Shryane via db-wg wrote: :> On 24 Nov 2023, at 10:21, Job Snijders <job@fastly.com> wrote: :> On Fri, Nov 24, 2023 at 10:03:15AM +0100, Edward Shryane via db-wg wrote: :I wrote an impact analysis on UTF-8 in the RIPE database last year: :https://labs.ripe.net/author/ed_shryane/impact-analysis-for-utf-8-in-the-rip... : :We already support UTF-8 in the Whois REST API and on the website, but convert to/from latin-1 in the database. : :Switching to UTF-8 in the database is not technically difficult, but we need functional requirements from the community on where to allow UTF-8 characters. : :This proposal is only to support more Latin-1 characters to be supported in names, while preserving backwards compatibility for querying (by also doing normalisation to ASCII). : :> Have the effects of LATIN-1 on downstream applications such as NRTM v3 :> and NRTM v4 been considered? : :Allowing Latin-1 in these name attributes *does* impact NRTMv3 and NRTMv4 (as they will no longer be ASCII only), but these characters are already allowed elsewhere in RPSL (e.g. the workaround of putting the correct name in the "descr:" attribute). Also the object primary key will remain ASCII. : :> :> You indicate that LATIN-1 already is supported in the RIPE database, so :> I imagine you and the team already deliberated on the pro's and con's of :> UTF-8 vs LATIN-1; and as such concluded with this particular :> recommendation. I just wanted to make sure to raise these questions. :-) :> : :We can switch to UTF-8, this proposal allows more characters in those attributes without needing to change the database character set. : I think it would be best if we migrated the entire database to UTF-8 first, upgrading all LATIN-1 attributes to UTF-8 at that time, then change attributes to allow UTF-8 or keep them as ASCII. I'd like to avoid adding more LATIN-1 if we can avoid it. -- 43rd Law of Computing: Anything that can go wr fortune: Segmentation violation -- Core dumped
Hi Peter,
On 24 Nov 2023, at 10:57, Peter Hessler via db-wg <db-wg@ripe.net> wrote:
: ... :We can switch to UTF-8, this proposal allows more characters in those attributes without needing to change the database character set. :
I think it would be best if we migrated the entire database to UTF-8 first, upgrading all LATIN-1 attributes to UTF-8 at that time, then change attributes to allow UTF-8 or keep them as ASCII.
We can migate internally to UTF-8 while keeping the current syntax rules, so nothing changes for the user yet (i.e. the database changes but not the interfaces). Once we use UTF-8 internally we can start to support non-latin-1 characters in attributes and convert interfaces to use UTF-8, as decided by the community.
I'd like to avoid adding more LATIN-1 if we can avoid it.
We can allow more characters in names regardless of the character set, this can be done without needing to wait for a UTF-8 migration. Regards Ed Shryane RIPE NCC
Hi, On Fri, 24 Nov 2023 at 03:02, Edward Shryane via db-wg <db-wg@ripe.net> wrote: [...]
We can migate internally to UTF-8 while keeping the current syntax rules, so nothing changes for the user yet (i.e. the database changes but not the interfaces).
Once we use UTF-8 internally we can start to support non-latin-1 characters in attributes and convert interfaces to use UTF-8, as decided by the community.
This seems a reasonable approach. It would be good to move to a place where registrants can record their actual legal name in the database and have that displayed to other users. Kind regards, Leo
Hi Ed,
We can migrate internally to UTF-8 while keeping the current syntax rules, so nothing changes for the user yet (i.e. the database changes but not the interfaces).
Once we use UTF-8 internally we can start to support non-latin-1 characters in attributes and convert interfaces to use UTF-8, as decided by the community.
I think this is a good idea. It's definitely a good first step in the right direction. Cheers, Sander
Hi, I like this proposal - people should be able to include their names accurately in the database. I know there’s an RFC that’s says everything should be ASCII, but I don’t think many implementations have followed that in the last decade. On 24 Nov 2023, at 10:21, Job Snijders via db-wg <db-wg@ripe.net> wrote:
Have the effects of LATIN-1 on downstream applications such as NRTM v3 and NRTM v4 been considered?
As far as I know, NRTMv3 has no defined encoding, but, speaking from memory, IRRDv4 does a best effort to decode it as UTF-8. There are encoding errors as a result, but as they occur in few fields that have loose syntax anyways, the impact is small. RIPE already limits personal data anyways, so not sure how much of this would be included. NRTMv4 is explicitly UTF-8, so a LATIN-1 database has to transcode (if that’s the term). I haven’t checked whether the current RIPE db implementation does this. Sasha
Hi Sasha,
On 25 Nov 2023, at 13:14, Sasha Romijn <sasha@reliablycoded.nl> wrote:
Hi,
I like this proposal - people should be able to include their names accurately in the database. I know there’s an RFC that’s says everything should be ASCII, but I don’t think many implementations have followed that in the last decade.
On 24 Nov 2023, at 10:21, Job Snijders via db-wg <db-wg@ripe.net> wrote:
Have the effects of LATIN-1 on downstream applications such as NRTM v3 and NRTM v4 been considered?
As far as I know, NRTMv3 has no defined encoding, but, speaking from memory, IRRDv4 does a best effort to decode it as UTF-8. There are encoding errors as a result, but as they occur in few fields that have loose syntax anyways, the impact is small. RIPE already limits personal data anyways, so not sure how much of this would be included.
NRTMv4 is explicitly UTF-8, so a LATIN-1 database has to transcode (if that’s the term). I haven’t checked whether the current RIPE db implementation does this.
There are not many non-ASCII characters in the snapshot as "descr:" and "remarks" attributes are dummified. I found an example of an a-umlaut (ä) in the PGPKEY-AC7C8A10 object which is latin-1 in the database, and correctly encoded in the NRTMv4 snapshot as UTF-8 bytes 0xc3a4. Regards Ed Shryane RIPE NCC
On Fri, Nov 24, 2023 at 10:03:15AM +0100, Edward Shryane via db-wg wrote: Dear Ed, DB-WG Members,
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
I beg to differ with the last one.
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
I too would like to see these two points addressed.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
When in Rome, do as the Romans do ... or get back to UTF-8 topic ;-) Time flies fast, and it occurs that it was exactly 13 years ago, in Rome, during RIPE61 when I gave the presentation about non-ASCII characters in DB. One can find the archives here: https://ripe61.ripe.net/programme/meeting-plan/database-agenda/ Five years later I formally proposed to allow UTF-8 in all free text attributes of all DB objects except in primary keys (https://www.ripe.net/ripe/mail/archives/db-wg/2015-April/004516.html). This topic has been also discussed in very good RIPE Labs article by Ed (https://labs.ripe.net/author/ed_shryane/impact-analysis-for-utf-8-in-the-rip...). Yet, we still discuss Latin-1 instead of UTF-8. And I'm puzzled why. :-) I would like to see the topic addressed in more general way, solving the issue(s) for the entire service region instead of adding characters used by few Western countries only. Considering this, I would like to ask you to switch to UTF-8 instead of using Latin-1. Thanks for looking into this. Best, Piotr -- Piotr Strzyżewski
On Sun, Nov 26, 2023 at 01:16:22PM +0100, Piotr Strzyzewski wrote:
On Fri, Nov 24, 2023 at 10:03:15AM +0100, Edward Shryane via db-wg wrote:
Dear Ed, DB-WG Members,
I would like to see the topic addressed in more general way, solving the issue(s) for the entire service region instead of adding characters used by few Western countries only. Considering this, I would like to ask you to switch to UTF-8 instead of using Latin-1.
As the UTF-8 topic was briefly discussed during DB-WG session at RIPE87 in Rome, I would like to propose moving forward with it. If that means a topic for first (?) interim meeting, let it be. Let me know please if this works for you. Thanks in advance. Best, Piotr -- Piotr Strzyżewski
On Dec 03, Piotr Strzyzewski via db-wg <db-wg@ripe.net> wrote:
As the UTF-8 topic was briefly discussed during DB-WG session at RIPE87 in Rome, I would like to propose moving forward with it. If that means a topic for first (?) interim meeting, let it be. Let me know please if this works for you. Thanks in advance. In Rome I talked a bit with Edward about this. Background: I am the author of the whois client used by all Linux distributions.
I fully agree that switching to UTF-8 is desirable, but we cannot just change the encoding of port 43 without major side effects. Since version 5.5.4 (december 2019), the client assumes that the output of whois.ripe.net is Latin 1 and then transcodes it to the system encoding. Receiving unexpected UTF-8 would cause mojibake. My suggestion is to add a new query "command line" option to specify the desired encoding (limiting it to either ISO-8859-1 or UTF-8), as supported by other whois servers. -C is the most common choice, but maybe it would be better to use --charset to not waste a single letter option. See https://github.com/rfc1036/whois/blob/next/servers_charset_list . In a few years then it will be much easier to switch the default from Latin 1 to UTF-8. -- ciao, Marco
Dear colleagues, Based on the discussion regarding UTF-8 in the RIPE database during the interim meeting yesterday, I suggest that we implement support for UTF-8 in the database (i.e. convert the schema and add a flag to allow a client to choose a character set), but we do not allow additional characters for now, pending further DB-WG discussion. Our intention is to lay the groundwork for future support, without breaking existing functionality. If you have any concerns or objections please let me know. We will now prepare an implementation plan / impact analysis of these changes. Regards Ed Shryane RIPE NCC
On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <db-wg@ripe.net> wrote:
Dear colleagues,
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
Regards Ed Shryane RIPE NCC
---
Whois attribute verbose description (copied from the help text).
org-name -------- Specifies the name of the organisation that this organisation object represents in the RIPE Database. This is an ASCII-only text attribute. The restriction is because this attribute is a look-up key and the whois protocol does not allow specifying character sets in queries. The user can put the name of the organisation in non-ASCII character sets in the "descr:" attribute if required.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
person ------ Specifies the full name of an administrative, technical or zone contact person for other objects in the database.
It should contain 2 to 10 words. A word is made up of ASCII alphanumeric characters and additionally: .`'_- The first word should begin with a letter. At least one other word should also begin with a letter. Max 64 characters can be used in each word.
role ---- Specifies the full name of a role entity, e.g. RIPE DBM.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/db-wg
That sounds like a perfect migration plan. Thanks! On 2024 Jan 18 (Thu) at 10:34:21 +0100 (+0100), Edward Shryane via db-wg wrote: :Dear colleagues, : :Based on the discussion regarding UTF-8 in the RIPE database during the interim meeting yesterday, I suggest that we implement support for UTF-8 in the database (i.e. convert the schema and add a flag to allow a client to choose a character set), but we do not allow additional characters for now, pending further DB-WG discussion. Our intention is to lay the groundwork for future support, without breaking existing functionality. If you have any concerns or objections please let me know. : :We will now prepare an implementation plan / impact analysis of these changes. : :Regards :Ed Shryane :RIPE NCC : : :> On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <db-wg@ripe.net> wrote: :> :> Dear colleagues, :> :> Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including: :> :> * These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. :> * RPSL names are ASCII according to RFC2622 :> * Using a normalised name makes the object easier to query :> * Reading a normalised name is easier to interpret :> :> However there are some drawbacks to forcing names to only use a subset of ASCII characters: :> :> * Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. :> * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways. :> :> Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes. :> :> Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version. :> :> Please let me know your feedback. :> :> Regards :> Ed Shryane :> RIPE NCC :> :> --- :> :> Whois attribute verbose description (copied from the help text). :> :> org-name :> -------- :> Specifies the name of the organisation that this organisation object :> represents in the RIPE Database. This is an ASCII-only text attribute. :> The restriction is because this attribute is a look-up key and the :> whois protocol does not allow specifying character sets in queries. :> The user can put the name of the organisation in non-ASCII character :> sets in the "descr:" attribute if required. :> :> A list of 1 to 30 words separated by white space. :> A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- :> A word may have up to 64 characters and is not case sensitive. :> Each word can have any combination of the above characters with no restriction on the start or end of a word. :> :> person :> ------ :> Specifies the full name of an administrative, technical or zone :> contact person for other objects in the database. :> :> It should contain 2 to 10 words. :> A word is made up of ASCII alphanumeric characters and additionally: .`'_- :> The first word should begin with a letter. :> At least one other word should also begin with a letter. :> Max 64 characters can be used in each word. :> :> role :> ---- :> Specifies the full name of a role entity, e.g. RIPE DBM. :> :> A list of 1 to 30 words separated by white space. :> A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- :> A word may have up to 64 characters and is not case sensitive. :> Each word can have any combination of the above characters with no restriction on the start or end of a word. :> :> :> -- :> :> To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/db-wg : : :-- : :To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/db-wg -- Chemicals, n.: Noxious substances from which modern foods are made.
Dear colleagues, To follow-up on the UTF-8 discusssion in January, the DB team plans to implement support for UTF-8 in 3 phases: (1) Add a flag to allow a client to choose a character set In the Whois release 1.112, we have added the "-Z / --charset" query flag to allow clients to specify which character set they expect. The server response will encode RPSL objects using that character set. This new flag can already be tested in the RC environment, e.g. the SHRYANE-MNT object contains "remarks:" attributes with non-ASCII (but still latin-1) characters: $ whois -h whois-rc.ripe.net -r shryane-mnt $ whois -h whois-rc.ripe.net -r -Z utf8 shryane-mnt This flag has no impact on the default behaviour of the RIPE database. This change only affects port 43, and the default character set remains latin-1. This flag will already be useful for example, to capture responses as UTF-8 to file or use UTF-8 encoding in your terminal. In future, if the default on port 43 changes to UTF-8, then clients can keep latin-1 by using "-Z/--charset latin1". (2) Convert the database schema to UTF-8 In the following Whois release, the DB team plans to switch the RIPE database schema character set from latin-1 to UTF-8. This will allow Whois to store UTF-8 strings in the database index tables. Switching the database schema character set will involve about 1 hour of downtime to Whois updates, and Whois queries will not be affected. We will announce this change in advance. This change will have no impact on the default behaviour of the RIPE database. All interfaces will behave as before, and RPSL objects will remain latin-1 encoded internally. (3) Allow UTF-8 to be used in RPSL objects Once the RIPE database schema supports the UTF-8 character set, the DB team will create a further Whois release that will allow UTF-8 to be used in RPSL objects, in addition to the index tables. The default behaviour of the RIPE database will remain the same. All interfaces will behave as before, but RPSL objects will use UTF-8 internally. In future, if the DB-WG decides to allow UTF-8 characters in RPSL, the database will already support it. Regards Ed Shryane RIPE NCC
On 18 Jan 2024, at 10:34, Edward Shryane <eshryane@ripe.net> wrote:
Dear colleagues,
Based on the discussion regarding UTF-8 in the RIPE database during the interim meeting yesterday, I suggest that we implement support for UTF-8 in the database (i.e. convert the schema and add a flag to allow a client to choose a character set), but we do not allow additional characters for now, pending further DB-WG discussion. Our intention is to lay the groundwork for future support, without breaking existing functionality. If you have any concerns or objections please let me know.
We will now prepare an implementation plan / impact analysis of these changes.
Regards Ed Shryane RIPE NCC
On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <db-wg@ripe.net> wrote:
Dear colleagues,
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
Regards Ed Shryane RIPE NCC
---
Whois attribute verbose description (copied from the help text).
org-name -------- Specifies the name of the organisation that this organisation object represents in the RIPE Database. This is an ASCII-only text attribute. The restriction is because this attribute is a look-up key and the whois protocol does not allow specifying character sets in queries. The user can put the name of the organisation in non-ASCII character sets in the "descr:" attribute if required.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
person ------ Specifies the full name of an administrative, technical or zone contact person for other objects in the database.
It should contain 2 to 10 words. A word is made up of ASCII alphanumeric characters and additionally: .`'_- The first word should begin with a letter. At least one other word should also begin with a letter. Max 64 characters can be used in each word.
role ---- Specifies the full name of a role entity, e.g. RIPE DBM.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/db-wg
Dear colleagues, There was a question about UTF-8 support by major Whois providers during last week's DB-WG session at RIPE88. During the UTF-8 discussion in December I checked the other RIRs as follows: LACNIC: only Latin-1 encoded characters are accepted in updates (UTF-8 is ignored) but UTF-8 is returned on port 43. Example: whois -h whois.lacnic.net PAP12 APNIC: only Latin-1 is returned Example: whois -h testwhois.apnic.net YYYYMMDD-MNT Subsequently I tested the other RIRs to be sure: ARIN: UTF-8 is supported in the RPSL object and UTF-8 is returned on port 43. Example: whois -h whois.arin.net POC SHRYA12-ARIN AFRINIC: UTF-8 characters are accepted in updates and UTF-8 is returned on port 43. Example: whois -h whois.afrinic.net SHRYANE-MNT RIPE stores Latin-1 and returns Latin-1 on port 43. So in summary, 3 RIRs return UTF-8 and 2 RIRs return Latin-1 on port 43. Regards Ed Shryane RIPE NCC
On 2 May 2024, at 16:02, Edward Shryane <eshryane@ripe.net> wrote:
Dear colleagues,
To follow-up on the UTF-8 discusssion in January, the DB team plans to implement support for UTF-8 in 3 phases:
(1) Add a flag to allow a client to choose a character set
In the Whois release 1.112, we have added the "-Z / --charset" query flag to allow clients to specify which character set they expect. The server response will encode RPSL objects using that character set.
This new flag can already be tested in the RC environment, e.g. the SHRYANE-MNT object contains "remarks:" attributes with non-ASCII (but still latin-1) characters:
$ whois -h whois-rc.ripe.net -r shryane-mnt $ whois -h whois-rc.ripe.net -r -Z utf8 shryane-mnt
This flag has no impact on the default behaviour of the RIPE database. This change only affects port 43, and the default character set remains latin-1.
This flag will already be useful for example, to capture responses as UTF-8 to file or use UTF-8 encoding in your terminal. In future, if the default on port 43 changes to UTF-8, then clients can keep latin-1 by using "-Z/--charset latin1".
(2) Convert the database schema to UTF-8
In the following Whois release, the DB team plans to switch the RIPE database schema character set from latin-1 to UTF-8. This will allow Whois to store UTF-8 strings in the database index tables.
Switching the database schema character set will involve about 1 hour of downtime to Whois updates, and Whois queries will not be affected. We will announce this change in advance.
This change will have no impact on the default behaviour of the RIPE database. All interfaces will behave as before, and RPSL objects will remain latin-1 encoded internally.
(3) Allow UTF-8 to be used in RPSL objects
Once the RIPE database schema supports the UTF-8 character set, the DB team will create a further Whois release that will allow UTF-8 to be used in RPSL objects, in addition to the index tables.
The default behaviour of the RIPE database will remain the same. All interfaces will behave as before, but RPSL objects will use UTF-8 internally.
In future, if the DB-WG decides to allow UTF-8 characters in RPSL, the database will already support it.
Regards Ed Shryane RIPE NCC
On 18 Jan 2024, at 10:34, Edward Shryane <eshryane@ripe.net> wrote:
Dear colleagues,
Based on the discussion regarding UTF-8 in the RIPE database during the interim meeting yesterday, I suggest that we implement support for UTF-8 in the database (i.e. convert the schema and add a flag to allow a client to choose a character set), but we do not allow additional characters for now, pending further DB-WG discussion. Our intention is to lay the groundwork for future support, without breaking existing functionality. If you have any concerns or objections please let me know.
We will now prepare an implementation plan / impact analysis of these changes.
Regards Ed Shryane RIPE NCC
On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <db-wg@ripe.net> wrote:
Dear colleagues,
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
Regards Ed Shryane RIPE NCC
---
Whois attribute verbose description (copied from the help text).
org-name -------- Specifies the name of the organisation that this organisation object represents in the RIPE Database. This is an ASCII-only text attribute. The restriction is because this attribute is a look-up key and the whois protocol does not allow specifying character sets in queries. The user can put the name of the organisation in non-ASCII character sets in the "descr:" attribute if required.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
person ------ Specifies the full name of an administrative, technical or zone contact person for other objects in the database.
It should contain 2 to 10 words. A word is made up of ASCII alphanumeric characters and additionally: .`'_- The first word should begin with a letter. At least one other word should also begin with a letter. Max 64 characters can be used in each word.
role ---- Specifies the full name of a role entity, e.g. RIPE DBM.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/db-wg
Dear colleagues, It was pointed out that the ARIN example: whois -h whois.arin.net POC SHRYA12-ARIN is not correct, and should read: whois -h whois.arin.net "p SHRYA12-ARIN" (I used "POC" instead of "p" and that could either cause "POC" to be additionally returned, or no objects at all, depending on your whois client). Apologies, Ed Shryane RIPE NCC
On 28 May 2024, at 11:27, Edward Shryane <eshryane@ripe.net> wrote:
Dear colleagues,
There was a question about UTF-8 support by major Whois providers during last week's DB-WG session at RIPE88.
During the UTF-8 discussion in December I checked the other RIRs as follows:
LACNIC: only Latin-1 encoded characters are accepted in updates (UTF-8 is ignored) but UTF-8 is returned on port 43. Example: whois -h whois.lacnic.net PAP12 APNIC: only Latin-1 is returned Example: whois -h testwhois.apnic.net YYYYMMDD-MNT
Subsequently I tested the other RIRs to be sure:
ARIN: UTF-8 is supported in the RPSL object and UTF-8 is returned on port 43. Example: whois -h whois.arin.net POC SHRYA12-ARIN AFRINIC: UTF-8 characters are accepted in updates and UTF-8 is returned on port 43. Example: whois -h whois.afrinic.net SHRYANE-MNT
RIPE stores Latin-1 and returns Latin-1 on port 43.
So in summary, 3 RIRs return UTF-8 and 2 RIRs return Latin-1 on port 43.
Regards Ed Shryane RIPE NCC
On 2 May 2024, at 16:02, Edward Shryane <eshryane@ripe.net> wrote:
Dear colleagues,
To follow-up on the UTF-8 discusssion in January, the DB team plans to implement support for UTF-8 in 3 phases:
(1) Add a flag to allow a client to choose a character set
In the Whois release 1.112, we have added the "-Z / --charset" query flag to allow clients to specify which character set they expect. The server response will encode RPSL objects using that character set.
This new flag can already be tested in the RC environment, e.g. the SHRYANE-MNT object contains "remarks:" attributes with non-ASCII (but still latin-1) characters:
$ whois -h whois-rc.ripe.net -r shryane-mnt $ whois -h whois-rc.ripe.net -r -Z utf8 shryane-mnt
This flag has no impact on the default behaviour of the RIPE database. This change only affects port 43, and the default character set remains latin-1.
This flag will already be useful for example, to capture responses as UTF-8 to file or use UTF-8 encoding in your terminal. In future, if the default on port 43 changes to UTF-8, then clients can keep latin-1 by using "-Z/--charset latin1".
(2) Convert the database schema to UTF-8
In the following Whois release, the DB team plans to switch the RIPE database schema character set from latin-1 to UTF-8. This will allow Whois to store UTF-8 strings in the database index tables.
Switching the database schema character set will involve about 1 hour of downtime to Whois updates, and Whois queries will not be affected. We will announce this change in advance.
This change will have no impact on the default behaviour of the RIPE database. All interfaces will behave as before, and RPSL objects will remain latin-1 encoded internally.
(3) Allow UTF-8 to be used in RPSL objects
Once the RIPE database schema supports the UTF-8 character set, the DB team will create a further Whois release that will allow UTF-8 to be used in RPSL objects, in addition to the index tables.
The default behaviour of the RIPE database will remain the same. All interfaces will behave as before, but RPSL objects will use UTF-8 internally.
In future, if the DB-WG decides to allow UTF-8 characters in RPSL, the database will already support it.
Regards Ed Shryane RIPE NCC
On 18 Jan 2024, at 10:34, Edward Shryane <eshryane@ripe.net> wrote:
Dear colleagues,
Based on the discussion regarding UTF-8 in the RIPE database during the interim meeting yesterday, I suggest that we implement support for UTF-8 in the database (i.e. convert the schema and add a flag to allow a client to choose a character set), but we do not allow additional characters for now, pending further DB-WG discussion. Our intention is to lay the groundwork for future support, without breaking existing functionality. If you have any concerns or objections please let me know.
We will now prepare an implementation plan / impact analysis of these changes.
Regards Ed Shryane RIPE NCC
On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <db-wg@ripe.net> wrote:
Dear colleagues,
Currently the RIPE database only allows a subset of ASCII characters in the "org-name:", "person:" and "role:" attributes, for a few reasons including:
* These attributes are also a look-up key and the Whois protocol does not allow specifying character sets in queries. * RPSL names are ASCII according to RFC2622 * Using a normalised name makes the object easier to query * Reading a normalised name is easier to interpret
However there are some drawbacks to forcing names to only use a subset of ASCII characters:
* Organisations, roles and persons cannot use their actual name if it includes characters outside this subset. * Normalisation is not standard, but is an interpretation done by each maintainer, e.g. characters could be excluded or converted in different ways.
Since we support the Latin-1 character set in the RIPE database, I propose we also allow non-ASCII Latin-1 characters in these attributes.
Querying for a name can be done either using the latin-1 characters (proposed) or a normalised, ASCII representation (currently). The normalised version will be generated by Whois and stored in a database index for querying. The primary key will also be generated from the normalised version.
Please let me know your feedback.
Regards Ed Shryane RIPE NCC
---
Whois attribute verbose description (copied from the help text).
org-name -------- Specifies the name of the organisation that this organisation object represents in the RIPE Database. This is an ASCII-only text attribute. The restriction is because this attribute is a look-up key and the whois protocol does not allow specifying character sets in queries. The user can put the name of the organisation in non-ASCII character sets in the "descr:" attribute if required.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
person ------ Specifies the full name of an administrative, technical or zone contact person for other objects in the database.
It should contain 2 to 10 words. A word is made up of ASCII alphanumeric characters and additionally: .`'_- The first word should begin with a letter. At least one other word should also begin with a letter. Max 64 characters can be used in each word.
role ---- Specifies the full name of a role entity, e.g. RIPE DBM.
A list of 1 to 30 words separated by white space. A word is made up of ASCII alphanumeric characters and additionally: ][)(._"*@,&:!'`+/- A word may have up to 64 characters and is not case sensitive. Each word can have any combination of the above characters with no restriction on the start or end of a word.
--
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/db-wg
participants (9)
-
Edward Shryane
-
Job Snijders
-
Leo Vegoda
-
Marco d'Itri
-
Nick Hilliard
-
Peter Hessler
-
Piotr Strzyzewski
-
Sander Steffann
-
Sasha Romijn