
On Thu, 30 Oct 2003, MURATA Makoto wrote:
* We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of simplicty.
Which UTF-16? Unfortunately, there are three charsets for UTF-16. They are "utf-16le", "utf-16be" and "utf-16" (see RFC 2781).
The XML specification says: Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. As easy as it gets :-)
Since this XML format describes hexadecimal data, almost every character is US-ASCII. I wonder why we have to double the file size by representing a US-ASCII character with 16 bits. 1MB in UTF-8 becomes 2MB in UTF-16.
That's a good point. OK it's UTF-8 only then. Linus