Request for review of Turtle (an RDF serialization) media type: text/turtle

W3C is about to publish a Team Submission for the RDF serialization Turtle. A mockup of the document to be published is at http://www.w3.org/2007/11/21-turtle Because the document will include the text of the media type registration, I am vetting this registration with ietf-types before publishing the document. Some discussion about the claim to force utf-8 encoding (and not require that in a charset parameter) can be seen at http://lists.w3.org/Archives/Public/www-archive/2007Dec/ (Subject: Media types for RDF languages N3 and Turtle) I got moderator-actioned for having too many folks in the Cc so I'm Bcc'ing them all in this request for review: "Sean B. Palmer" <sean@miscoranda.com>, Tim Berners-Lee <timbl@w3.org>, "Daniel W. Connolly" <connolly@w3.org>, Dave Beckett <dave@dajobe.org>, Lee Feigenbaum <lee@thefigtrees.net>, Garret Wilson <garret@globalmentor.com>, Graham Klyne <GK@ninebynine.org>, Dan Brickley <danbri@danbri.org>, Type name: text Subtype name: turtle Required parameters: None Optional parameters: None Encoding considerations: The syntax of Turtle is expressed over code points in Unicode[UNICODE]. The encoding is always UTF-8 [RFC3629]; the charset parameter is not needed; though it may be included so long as the value is 'UTF-8'. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F] Security considerations: Turtle uses IRIs as term identifiers. Applications interpreting data expressed in Turtle sould address the security issues of Internationalized Resource Identifiers (IRIs) Section 8, as well as Uniform Resource Identifier (URI): Generic Syntax [RFC3986] Section 7 Multiple IRIs may have the same appearance. Characters in different scripts may look similar (a Cyrillic "o" may appear similar to a Latin "o"). A character followed by combining characters may have the same visual representation as another character (LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT has the same visual representation as LATIN SMALL LETTER E WITH ACUTE). Any person or application that is writing or interpreting data in Turtle must take care to use the IRI that matches the intended semantics, and avoid IRIs that make look similar. Further information about matching of similar characters can be found in Unicode Security Considerations [UNISEC] and Internationalized Resource Identifiers (IRIs) [RFC3987] Section 8. Interoperability considerations: There are no known interoperability issues. Published specification: TBD, in the mean time, see http://www.w3.org/2007/11/21-turtle Applications which use this media type: No widely deployed applications are known to use this media type. It may be used by some web services and clients consuming their data. Additional information: Magic number(s): Turtle documents may have the strings '@prefix' or '@base' (case dependent) near the beginning of the document. File extension(s): ".ttl" Base URI: The Turtle '@base <IRIref>' term can change the current base URI for relative IRIrefs in the query language that are used sequentially later in the document. Macintosh file type code(s): "TEXT" Person & email address to contact for further information: Eric Prud'hommeaux <eric@w3.org> Intended usage: COMMON Restrictions on usage: None Author/Change controller: The Turtle specification is the product of David Beckett and Tim Berners-Lee. A W3C Working Group may assume maintenance of this document; W3C reserves change control over this specifications. Normative References [RFC3023] Murata, M., St. Laurent, S., and D. Kohn, "XML Media Types", RFC 3023, January 2001. [RFC3629] F. Yergeau, "UTF-8, a transformation format of ISO 10646", RFC 3629, November 2003. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. [UNICODE] The Unicode Standard, Version 4. ISBN 0-321-18578-1, as updated from time to time by the publication of new versions. The latest version of Unicode and additional information on versions of the standard and of the Unicode Character Database is available at http://www.unicode.org/unicode/standard/versions/. [UNISEC] Mark Davis, Michel Suignard, "Unicode Security Considerations. http://www.unicode.org/reports/tr36/ -- -eric office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA mobile: +1.617.599.3509 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.

Eric Prud'hommeaux wrote:
W3C is about to publish a Team Submission for the RDF serialization Turtle. A mockup of the document to be published is at http://www.w3.org/2007/11/21-turtle
Because the document will include the text of the media type registration, I am vetting this registration with ietf-types before publishing the document. Some discussion about the claim to force utf-8 encoding (and not require that in a charset parameter) can be seen at http://lists.w3.org/Archives/Public/www-archive/2007Dec/ (Subject: Media types for RDF languages N3 and Turtle) I got moderator-actioned for having too many folks in the Cc so I'm Bcc'ing them all in this request for review: ...
1) If text/* proves to be problematic, why not use application/*? 2) Also, keep in mind that while RFC2046 may be interpreted not to mandate the ASCII default for text types other than text/plain, there's also RFC2616 saying...: The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. -- <http://tools.ietf.org/html/rfc2616#section-3.7.1> See also the related HTTPbis issue: <http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20> BR, Julian

Julian Reschke wrote:
there's also RFC2616
Yes, that's an ugly legacy exception...
<http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20>
...maybe 2616bis can drop this oddity in favour of a simple "unknown text is ASCII" rule. HTTP oddities shouldn't affect MIME registrations, there's no string "2616" in BCP13. Frank

Frank Ellermann wrote:
Julian Reschke wrote:
there's also RFC2616
Yes, that's an ugly legacy exception...
<http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20>
...maybe 2616bis can drop this oddity in favour of a simple "unknown text is ASCII" rule. HTTP oddities shouldn't affect MIME registrations, there's no string "2616" in BCP13.
Indeed. It would be nice if somebody could provide some insight why this ever made it into HTTP. Was that just an attempt to allow text/html encoded in latin1 to be served without charset parameter? BR, Julian

Julian Reschke wrote:
It would be nice if somebody could provide some insight why this ever made it into HTTP. Was that just an attempt to allow text/html encoded in latin1 to be served without charset parameter?
Some parts of this puzzle: RFC 2070 introduced an "ideally anything is Unicode" concept, later adopted by HTML 4+, XHTML 1+, and XML 1+. AFAIK HTML 3.2 and maybe also HTML 3 still didn't have this feature. As far as RFC numbers mean something 2070 was published "after" 2068, both say January 1997, and "the law" 2277 was clearly a year later. RFC 2068 (HTTP/1.1) was the successor of 1945 (HTTP/1.0, May 1996), 2070 (HTML i18n) was the successor of 1866 (HTML 2, November 1995). Tim Berners-Lee, one co-author of RFC 1866 and 1945, wrote in 1866: | NOTE - To support non-western writing systems, a larger character | repertoire will be specified in a future version of HTML. The | document character set will be [ISO-10646], or some subset that | agrees with [ISO-10646]; in particular, all numeric character | references must use code positions assigned by [ISO-10646]. Speculation, in May 1996 it made sense that HTTP/1.0 can transport HTML 2 "as is", default Latin-1, and it took Harald and Martin some months to fix this in RFC 2070 and 2277, too late for RFC 2068, and RFC 2616 simply inherited "default Latin-1" wholesale. Frank

At 03:17 07/12/19, Frank Ellermann wrote:
Julian Reschke wrote:
It would be nice if somebody could provide some insight why this ever made it into HTTP. Was that just an attempt to allow text/html encoded in latin1 to be served without charset parameter?
Yes, in some ways. The Web was started at CERN in Geneva, and at that time, iso-8859-1 seemed like a forward-looking choice allowing to cover not only the US, but also (most of) Western Europe. The first versions of HTTP (HTTP 0.9 or before) didn't have any version indication, didn't allow a charset parameter, and also didn't have any request or response headers. Responses were just HTML, nothing else. For a short summary, please see http://www.w3.org/Protocols/HTTP/AsImplemented.html. HTTP 0.9 was later generalized into HTTP 1.0 and HTTP 1.1 as we know it. For quite some time, there were a lot of clients out there that badly choked on charset parameters. So it wasn't that the default was an attempt to save some bytes for Latin-1, but that it was in some way necessary to be backwards- compatible with very early versions not documented as RFCs. Such backwards compatibility is no longer necessary, fortunately. The situation currently on the Web is different. The actual 'default' used by browsers isn't simply iso-8859-1, it's whatever the user has set as his/her preferred encoding, or whatever the setting of the specific language version is. This means that in essence, there is NO default. The HTTP spec clearly should be fixed to say so.
Some parts of this puzzle: RFC 2070 introduced an "ideally anything is Unicode" concept, later adopted by HTML 4+, XHTML 1+, and XML 1+. AFAIK HTML 3.2 and maybe also HTML 3 still didn't have this feature.
As far as RFC numbers mean something 2070 was published "after" 2068, both say January 1997, and "the law" 2277 was clearly a year later.
At least 2070 and 2277 were in the works for a really long time. That may also apply to 2068.
RFC 2068 (HTTP/1.1) was the successor of 1945 (HTTP/1.0, May 1996), 2070 (HTML i18n) was the successor of 1866 (HTML 2, November 1995).
Tim Berners-Lee, one co-author of RFC 1866 and 1945, wrote in 1866:
| NOTE - To support non-western writing systems, a larger character | repertoire will be specified in a future version of HTML. The | document character set will be [ISO-10646], or some subset that | agrees with [ISO-10646]; in particular, all numeric character | references must use code positions assigned by [ISO-10646].
That was put in because the HTML WG at that time already more or less understood (after a lot of discussions) that the direction to go was Unicode/ISO 10646. A lot of the work on HTML 2.0 and HTML i18n (RFC 2070) and some other pieces was going on somewhat in parallel.
Speculation, in May 1996 it made sense that HTTP/1.0 can transport HTML 2 "as is", default Latin-1, and it took Harald and Martin some months to fix this in RFC 2070 and 2277, too late for RFC 2068, and RFC 2616 simply inherited "default Latin-1" wholesale.
It wasn't just Harald and me. It was a lot more people, in particular all the coauthors of RFC 2070 and 2277. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

At 00:45 07/12/19, Frank Ellermann wrote:
Julian Reschke wrote:
there's also RFC2616
Yes, that's an ugly legacy exception...
<http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20>
...maybe 2616bis can drop this oddity in favour of a simple "unknown text is ASCII" rule.
The new version of the HTTP spec, 2616bis, should definitely drop the iso-8859-1 default, but NOT in favor of "unknown text is ASCII". It should just say that there is no default. There is a big difference between these two, especially for document formats that contain internal 'charset' information. A default of US-ASCII makes document-internal 'charset' information useless (because the external information wins). No default means that the recipient will look at the internal information.
HTTP oddities shouldn't affect MIME registrations, there's no string "2616" in BCP13.
One reason for the problems with text/xml was that the original MIME default of US-ASCII was enforced. This made it impossible to serve XML documents with internal 'charset' information only as text/xml. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

* Martin Dürst <duerst@it.aoyama.ac.jp> [2007-12-26 10:29+0900]
At 00:45 07/12/19, Frank Ellermann wrote:
Julian Reschke wrote:
there's also RFC2616
Yes, that's an ugly legacy exception...
<http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20>
...maybe 2616bis can drop this oddity in favour of a simple "unknown text is ASCII" rule.
The new version of the HTTP spec, 2616bis, should definitely drop the iso-8859-1 default, but NOT in favor of "unknown text is ASCII". It should just say that there is no default. There is a big difference between these two, especially for document formats that contain internal 'charset' information. A default of US-ASCII makes document-internal 'charset' information useless (because the external information wins). No default means that the recipient will look at the internal information.
I'm fuzzy on the logic here. It seems one of the big values of text/ tree is that you know the browser will render it on the screen if it doesn't know anything about the subtype. How then will it know where to look for internal charset info? (By the same token, how then will it know what the default charset is for the mystery media type?) Don't get me wrong, I'm ok with the browser making a guess and the user overriding it. In fact, I'd like to see the first guess be utf-8 if the stream appears to be valid utf-8. And if it's really shift-JIS or utf-16, you'll get some funny looking chars (hopefully the browser won't be sensitive to misinterpreting something as "^drm -r /"). Are we on the cusp of a new HTTP1.1 draft? May I register media types that pre-empt that this a little in favor of doing the right thing?
HTTP oddities shouldn't affect MIME registrations, there's no string "2616" in BCP13.
One reason for the problems with text/xml was that the original MIME default of US-ASCII was enforced. This made it impossible to serve XML documents with internal 'charset' information only as text/xml.
Regards, Martin.
#-#-# Martin J. Dürst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
-- -eric office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA mobile: +1.617.599.3509 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.

Martin Duerst wrote:
The new version of the HTTP spec, 2616bis, should definitely drop the iso-8859-1 default, but NOT in favor of "unknown text is ASCII". It should just say that there is no default.
A MIME entity with "default ASCII" using any 1xxx xxxx octets is erroneous. With "default ASCII" 2616bis would be consistent with MIME, that's good. We have no "unknown-7bit" charset for unidentified "ASCII compatible" encodings (for octets 0..127), and the "default ASCII" is an emulation for such dubious cases, same idea as in mail. Years later (after 2616bis) it might be possible to upgrade "default ASCII" to UTF-8, Latin-1 was a dead end. As soon as we're back to "default ASCII" just let RFC 2277 finish it off.
There is a big difference between these two, especially for document formats that contain internal 'charset' information. A default of US-ASCII makes document-internal 'charset' information useless (because the external information wins).
Right, that must not happen, IMO a "default" is an assumption if no better info is available. For HTTP it also limits what can be used in *headers* (no message/rfc822 vs. message/global abstractions necessary, HTTP isn't UTF8SMTP) The *body* contains octets, only 0..127 can be interpreted as ASCII, anything else needs an explicit declaration somewhere - "internal" would be fine for many users who can't change the "external" declaration. That's actually the same issue as it is today with an external "default Latin-1", the internal UTF-8 / KOI8-R / windows-1252 (etc.) declaration wins if there is no explicit statement from the server. Otherwise my non-ASCII Web pages won't validate, but they do.
One reason for the problems with text/xml was that the original MIME default of US-ASCII was enforced. This made it impossible to serve XML documents with internal 'charset' information only as text/xml.
The odd text/xml case is different, there's a MUST somewhere in the text/xml spec. But nobody treats text/html as "default Latin-1" ignoring the internal declaration. The W3C validator even enforces its very own UTF-8 default for HTML 2, where it really should be Latin-1, maybe we could report this as bug :-) Frank

On Fri, 28 Dec 2007 03:26:46 +0100, Frank Ellermann <nobody@xyzzy.claranet.de> wrote:
One reason for the problems with text/xml was that the original MIME default of US-ASCII was enforced. This made it impossible to serve XML documents with internal 'charset' information only as text/xml.
The odd text/xml case is different, there's a MUST somewhere in the text/xml spec. But nobody treats text/html as "default Latin-1" ignoring the internal declaration.
FWIW, I don't know of any software (apart from maybe the Universal Feed Parser) that doesn't treat text/xml as it treats application/xml. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/>

On Fri, 28 Dec 2007, Frank Ellermann wrote:
Years later (after 2616bis) it might be possible to upgrade "default ASCII" to UTF-8, Latin-1 was a dead end. As soon as we're back to "default ASCII" just let RFC 2277 finish it off.
FWIW, a number of specs are already overriding both MIME and HTTP when it comes to character encodings. For example HTML4 says to not default to any encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as currently proposed defaults to an even more complicated heuristic [3], and so on. In the "real world" the implementations are following the heuristics described in CSS2.1 and HTML5 (or something close to them), and those differ for text/css and text/html, so it would seem pointless for HTTP to try to define something here: it would just get ignored. IMHO the best option is for HTTP to stay out of the discussion altogether and let the lower level specs (MIME) and the higher level specs (XML, HTML, CSS, etc, defining the formats) figure it out amongst themselves. -- Footnotes -- [1] http://www.w3.org/TR/html4/charset.html#h-5.2.2 This text explicitly says that HTTP's default is useless. It then recomments behaviour that is even more useless, but that's another problem altogether... [2] http://www.w3.org/TR/CSS21/syndata.html#charset [3] http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.... Cheers, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'

(bcc: http-wg because this is about the ietf-types proposal for RDF types) I think it would be good to separate out the discussion of HTTP & browser behavior from the MIME registrations of RDF representations. The simple solution for Media Types for RDF languages is -- if you really want to use text/*, then always use a charset parameter, even if you'd rather not. Don't try to change MIME behavior or update the default or otherwise mess with it. You can say, using current MIME, what you want to say. Perhaps the default to US-ASCII is annoying, the content-type string is longer than you would like, but changing MIME is unnecessary. Secondly, don't use use the '+' syntax, as in: text/rdf+n3, text/rdf+turtle since it doesn't fit into the paradigm established for +xml; this was noted during the discussion of registration of +zip types. For an alternative, I imagine there are lots of possibilities. Personally, I prefer being more explicit, e.g., Label N3 documents as: text/w3c.rdf.n3;charset=utf-8 label Turtle files as: text/w3c.rdf.turtle;charset=utf-8 Larry

Martin Duerst wrote:
At 00:45 07/12/19, Frank Ellermann wrote:
Julian Reschke wrote:
there's also RFC2616 Yes, that's an ugly legacy exception...
<http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20> ...maybe 2616bis can drop this oddity in favour of a simple "unknown text is ASCII" rule.
The new version of the HTTP spec, 2616bis, should definitely drop the iso-8859-1 default, but NOT in favor of "unknown text is ASCII". It should just say that there is no default.
As far as I understand, we currently have RFC2046, RFC2616 and RFC3023 making conflicting requirements: RFC2046: the default for text/* is US-ASCII (<http://tools.ietf.org/html/rfc2046#section-4.1.2>) RFC2616: the default for text/* received over HTTP is ISO8859-1 (<http://tools.ietf.org/html/rfc2616#section-3.7.1>) RFC3023: the default for text/xml is US-ASCII, even when received over HTTP (<http://tools.ietf.org/html/rfc3023#section-3.1>) This is a mess, and as far as I can tell, it would be good if at least HTTP would get out of it. So it seems that we need to decide on two separate questions: 1) Do we want HTTP to override RFC2046's defaults at all? 2) If we do want to continue that, what do we want to mandate? Right now, browsers (just tested Opera/Safari/Mozilla/IE7) ignore all three RFCs for at least text/xml (they all look at the content). If our answer to 1) is "no", the content will still be broken, but at least it's not HTTP's fault anymore. Otherwise we can state "in absence of charset parameter recipient MAY do charset sniffing (BOM, XML decl, HTML meta tag, ...), which would probably match what's actually implemented.
There is a big difference between these two, especially for document formats that contain internal 'charset' information. A default of US-ASCII makes document-internal 'charset' information useless (because the external information wins). No default means that the recipient will look at the internal information.
Yep.
HTTP oddities shouldn't affect MIME registrations, there's no string "2616" in BCP13.
One reason for the problems with text/xml was that the original MIME default of US-ASCII was enforced. This made it impossible to serve XML documents with internal 'charset' information only as text/xml.
Hmm. So why did RFC3023 then mandate US-ASCII again? In general, is it acceptable *at all* to override charset defaults defined in RFC2046? BR, Julian

* Julian Reschke <julian.reschke@gmx.de> [2007-12-18 15:24+0100]
Eric Prud'hommeaux wrote:
W3C is about to publish a Team Submission for the RDF serialization Turtle. A mockup of the document to be published is at http://www.w3.org/2007/11/21-turtle
Because the document will include the text of the media type registration, I am vetting this registration with ietf-types before publishing the document. Some discussion about the claim to force utf-8 encoding (and not require that in a charset parameter) can be seen at http://lists.w3.org/Archives/Public/www-archive/2007Dec/ (Subject: Media types for RDF languages N3 and Turtle) I got moderator-actioned for having too many folks in the Cc so I'm Bcc'ing them all in this request for review: ...
1) If text/* proves to be problematic, why not use application/*?
Turtle is a form of RDF that was designed to be specifically human-readable. It is unlikely that RDF will ever have a more texty expression. 2046 §4.1. Text Media Type ¶2: [[ Beyond plain text, there are many formats for representing what might be known as "rich text". An interesting characteristic of many such representations is that they are to some extent readable even without the software that interprets them. It is useful, then, to distinguish them, at the highest level, from such unreadable data as images, audio, or text represented in an unreadable form. In the absence of appropriate interpretation software, it is reasonable to show subtypes of "text" to the user, while it is not reasonable to do so with most nontextual data. Such formatted textual data should be represented using subtypes of "text". ]] Yes, text/* is problematic, but if we figure out what it can and can't do, they the institutional knowldege will hopefully transfer to the next poor sucker who tries to register a non-ascii language in text/ . Ceretainly, I'm certainly willing to give up and fall back to application/ , but I worry that no modern languages are appropriate for text/ and that we are hostage to a legacy exactly counter to the intent of the tree.
2) Also, keep in mind that while RFC2046 may be interpreted not to mandate the ASCII default for text types other than text/plain, there's also RFC2616 saying...:
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. -- <http://tools.ietf.org/html/rfc2616#section-3.7.1>
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3#sec3.4.1 explains the motivation of this rather painful rule; current practice of some old and broken HTTP/1.0 implementations.) This asserts extra constraints on participants in HTTP transactions above and beyond those of othter agents exchanging MIME e.g. MTAs and MUAs. If the media type for turtle was as I wrote, 2616 would say that web servers would still have to supply the charset=UTF-8 parameter. When that deference to legacy gets obsolesced, the media type will not need updating. This reduces our choices to comparing the relative costs of: 1. failure to present a fairly intelligible form to the consumer. 2. must include charset in HTTP for the foreseeable future.
See also the related HTTPbis issue: <http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/issues/#i20>
BR, Julian
-- -eric office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA mobile: +1.617.599.3509 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.

"Eric" == Eric Prud'hommeaux <eric@w3.org> writes:
Eric> Encoding considerations: Eric> The syntax of Turtle is expressed over code points in Eric> Unicode[UNICODE]. The encoding is always UTF-8 [RFC3629]; the Eric> charset parameter is not needed; though it may be included so Eric> long as the value is 'UTF-8'. Shouldn't the normative reference be to the UCS (ISO 10646) rather than to Unicode? The Universal Character Set is the international standard. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

James Cloos wrote:
"Eric" == Eric Prud'hommeaux <eric@w3.org> writes:
Eric> Encoding considerations: Eric> The syntax of Turtle is expressed over code points in Eric> Unicode[UNICODE]. The encoding is always UTF-8 [RFC3629]; the Eric> charset parameter is not needed; though it may be included so Eric> long as the value is 'UTF-8'.
Shouldn't the normative reference be to the UCS (ISO 10646) rather than to Unicode? The Universal Character Set is the international standard.
I'm sure Eric knows the explanation below on how to reference Unicode and / or ISO 10464 ... http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefUnicode see esp. http://www.w3.org/TR/2005/REC-charmod-20050215/#C062 "Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646." I think this is applicable also for this case. Felix

"Felix" == Felix Sasaki <fsasaki@w3.org> writes:
Shouldn't the normative reference be to the UCS (ISO 10646) rather than to Unicode? The Universal Character Set is the international standard.
Felix> http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefUnicode Felix> http://www.w3.org/TR/2005/REC-charmod-20050215/#C062 Felix> "Since specifications in general need both a definition for their Felix> characters and the semantics associated with these characters, Felix> specifications SHOULD include a reference to the Unicode Standard, Felix> whether or not they include a reference to ISO/IEC 10646." I was coming from an IETF POV rather than a W3C POV, hense the feeling that the ISO should be prefered over the industry org for normative references.... -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

James Cloos wrote:
"Felix" == Felix Sasaki <fsasaki@w3.org> writes:
Shouldn't the normative reference be to the UCS (ISO 10646) rather than to Unicode? The Universal Character Set is the international standard.
Felix> http://www.w3.org/TR/2005/REC-charmod-20050215/#sec-RefUnicode Felix> http://www.w3.org/TR/2005/REC-charmod-20050215/#C062
Felix> "Since specifications in general need both a definition for their Felix> characters and the semantics associated with these characters, Felix> specifications SHOULD include a reference to the Unicode Standard, Felix> whether or not they include a reference to ISO/IEC 10646."
I was coming from an IETF POV rather than a W3C POV, hense the feeling that the ISO should be prefered over the industry org for normative references....
not sure if this is a question of POVs - it probably depends on what information you want to point people to. Both Unicode and ISO/IEC 10646 provide the same code points, but Unicode provides alsoadditional semantics useful for implementers. See the first link above. Felix
participants (9)
-
Anne van Kesteren
-
Eric Prud'hommeaux
-
Felix Sasaki
-
Frank Ellermann
-
Ian Hickson
-
James Cloos
-
Julian Reschke
-
Larry Masinter
-
Martin Duerst