
On Wednesday, October 29, 2003, 5:33:26 PM, MURATA wrote: MM> On Mon, 27 Oct 2003 18:06:46 +0100 (MET) MM> Linus Walleij <triad@df.lth.se> wrote:
* We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of simplicty.
MM> Which UTF-16? Unfortunately, there are three charsets for UTF-16. MM> They are "utf-16le", "utf-16be" and "utf-16" (see RFC 2781). I believe the answer to this can be found by a case-insensitive string match between the three stings you give and the string "UTF-16" that I suggested above. However, my main reason for suggesting both the encodings that XML mandates is because the effort to detect and give an error on an otherwise perfectly fine file in XML that uses UTF-16 and that the parser would happily accept, seems higher than any utility gained. MM> Since this XML format describes hexadecimal data, almost every character MM> is US-ASCII. I wonder why we have to double the file size by representing MM> a US-ASCII character with 16 bits. 1MB in UTF-8 becomes 2MB in UTF-16. You don't *have* to double the file size. Saying UTF-16 only would, I agree, double the file size in most cases. Use UTF-8 unless the quantity of human-readable content is large and the overall size is smaller in UTF-16
* We define that in this case the "charset" parameter is omitted, for this very reason.
MM> If there is more than one choice, I do not think that this is a MM> good reason to omit the charset parameter. On the contrary, since both UTF-8 and UTF-16 are accepted by every XML parser worldwide and since, therefore, using either of these two does not require an XML declaration with an encoding declaration, then there is absolutely no problem there. Forcing the server to construct redundant and frequently erroneous information in parallel seems especially unwise in this case, because every parser can accept both encodings and because the string containing the encoding is not physically present in the file, making the server have to do byte analysis to compute the value. -- Chris mailto:chris@w3.org