Summary: Currently the whois database accepts =E2=80=9Cnon-breaking-space=E2=80=9D- characters in updates. We propose that whois replaces these =E2=80=9Cnon-breaking-space=E2=80=9D-characters with regular spaces before storing the object. From my point of view the whois service currently has to be considered broken and the problem should be considered with a bit broader perspective than discussing the treatment of a single element of the character set.
Context: Currently the whois database accepts updates with "non-breaking-space=E2=80=9D-characters as part of attribute values. This behaviour is fairly acceptable, since the = "non-breaking-space=E2=80=9D-character is part of the = =E2=80=9Clatin1=E2=80=9D-character-set which is supported by whois. The whois database treats these "non-breaking-spaces=E2=80=9D-characters = as regular spaces and considers the object to be syntactically correct. However, the object is stored exactly as it was received from the = client: including non-breaking-spaces
Problem: We think most of these "non-breaking-spaces=E2=80=9D where intended to = be regular spaces, but ended up mangled due to copy and paste. So, when such an object is being queried for, the original object = (including non-breaking-spaces) is being returned. This is inconvenient. since most clients cannot handle this = =E2=80=9Cexceptional=E2=80=9D character. It many clients it will end up = as something like a =E2=80=9C?=E2=80=99-character. The actual problem is one of interoperability and in Internet tradition we have to take care of it more seriously than as an issue of just convenience! [do we need to discuss the value of that interoperability tradidtion?)
Strictly speaking "whois" should be understood as the TCP port 43 database query service; my primary concern right now is applying the guideline of "... be conservative in what you send" to the operation of exactly this service of the RIPE database to ensure interoperability. Tools using the RIPE whois for routing registry queries do expect the responses to contain RPSL objects; RPSL syntax definition (RFC 2622) and traditional implemenatation (see BISON/FLEX in RFC 2622) did not explictly go beyond the 7 bit ASCII character set. It seems that we missed to address some small details when extending the character set used in the RIPE database was done - well, maybe it was just my stupid self that missed an explicit discussion and resolution of those details... (specific pointers from those in the know are welcome!) ACTUAL OPERATIONAL PROBLEM REPORT: We were hit by a real operational problem: our traditional RPSL tools querying the RIPE-whois-Server for some routing registry evaluation failed because unexpected characters occured in semantically significant attributes (members:) of a returned RPSL object (as-set:).
Alternative solutions: -1- Let whois software accept the update but convert = "non-breaking-spaces=E2=80=9D into regular spaces before storing -2- Consider requests containing "non-breaking-space=E2=80=9D-characters = as syntactically incorrect, and do not accept updates containing them.=20=
-3- Keep current solution: You get what you asked for. With current service implementation I do NOT get what I am asking for/expecting: I am asking for at least syntactically correct RPSL objects and not some byte string (blob?!) that someone stored for a certain key! Does anybody disagree with the requirement that responses from whois service have to conform to well defined syntax and that interoperability considerations apply?
Some additional statistics that can be used to understand the size of = the problem: Approximately 3.000 objects (out of 8.000.000) contain = "non-breaking-space=E2=80=9D-characters. Mostly in =E2=80=9Cremarks"-, = =E2=80=9Cdescription"- and =E2=80=9Caddress"-attributes?
Now the question remains: what version of RPSL syntax is enforced on the whois service response? I do expect something that it is sufficiently conservative/strict to ensure interoperability for clients. I observe: there are 3 different stages in the data flow where you can "manipulate"; the above list of alternatives certainly does NOT consider the third: (a) on input decide what to accept (potentially different per input interface) and how to normalize (b) representation for storage (c) what presentation you use for output (potentially different per output interface and per output options) (I'd agree with Wilfried in that at least one transparent path through the system is desirable and mappings minimized.) Right now I'm not so much interested in what you are doing in (a) and (b); most reasonable things you can do there would allow (c) to present RPSL objects for whois conformant to a very conservative and restricted syntax. LOOKING CLOSER at RPSL syntax: It seems to me that attributes with value syntax different from RFC2622 <free-form> (p. 6) can be be normalized to strict 7 bit ASCII with no loss of semantic (and in fact that normalization does not need to change anything but white space!). It seems to me that all significant use cases for extended character sets is in attribute values that were defined using <free-form>; further it seems likely that existing RPSL tool implementations actually use a very liberal interpretation of "a sequence of ASCII characters" (i.e. more or less "a sequence of characters"). So I propose to define - and implement accordingly: by default the RIPE whois server will return RPSL objects with character set normalization enforced - attributes that have values with syntax not using <free-form> will be reperesented using only ASCII (7 bit) - attributes using <free-form> syntax can use the extended charcter set (I'm not sure whether Latin-1 or UTF-8 or something else...) (Potentially the alternatives 1 or 2 might be valid implementation of my proposal! and would be appreciated as quick fixes for a current problem). thanks for these statistics; though a closer look would be helpful. - How many objects use characters outside of strict ASCII in "RPSL semantically significant attributes", i.e. attributes that use non-<free-form> syntax? - are all of those "extended characters" to be considered "white space"? (and specifically is there any other extended character to be considered white space beyond "non breaking space"?)
Marc Grol member of ripe=E2=80=99s whois database team mgrol@ripe.net <mailto:mgrol@ripe.net> +31648928856
Ruediger Volk Deutsche Telekom AG -- Internet Backbone Engineering E-Mail: rv@NIC.DTAG.DE