
On Wednesday, November 24, 2004, 4:41:48 PM, Bjoern wrote: BH> * Chris Lilley wrote:
As you yourself pointed out, per RFC3023
Processors generating XML MIME entities MUST NOT label conflicting charset information between the MIME Content-Type and the XML declaration.
such content is already non conforming.
BH> It does not actually apply to content... Yes, that is an ambiguity that needs to be cleared up. It says things that generate the content must not do that; if that were true, then there would be no such content. Since there is, its worth splitting this into two: - Conformance for XML generators - Conformance for XML messages (headers plus bodies)
In terms of dealing with such content if it still occurs, the XML well formedness rules already handle that in an entirely satisfactory manner and nothing further need be added. These are already well implemented and highly interoperable.
BH> Consider a *UTF-8 encoded* document BH> Content-Type: application/xml;charset=iso-8859-1 Since that isn't image/svg+xml then it has a charset parameter, although the processor that generated it is non conforming to the existing RFC 3023. But lets press on into how to detect or resolve the error. BH> <?xml version="1.0"?> BH> ... BH> <!--Björn--> BH> ... BH> With no BOM and using only US-ASCII characters for the rest of the BH> document, Cleverly constructed example, if the processor believes the charset the processor will think the comment says Björn. However, as soon as you save it, your name is mis-spelled. I'm sure you would not like that, Björn. So in this case, although the processor that generated it is non conforming, the content is not non conforming (but it should be) and the processor that receives it has two possibilities: a) it can add the missing encoding declaration when processing and when saving to disk (note that, if the xml happened to be digitally signed and in canonical XML form, this would break the signature). See RFC 3741 b) it can note that a required encoding declaration is not present, and throw a well formedness error. Note that both of these choices will break some content and both of these choices are licensed by the relevant specifications. There is thus non-interoperability. Note further that, in the case where the charset parameter is not present, there is 100% interoperability, no breakage, all in conformance with the existing clauses in RFC 3023 which 3023bis will retain, since they are proven by implementation experience with running code to be highly robust and interoperable. So, lets take the other case, which is more interesting. Consider an *8859-1 encoded* document Content-Type: application/xml;charset=UTF-8 <?xml version="1.0"?> ... <!--Björn--> ... With your proposal, would the well formedness error (bytes occur that cannot occur in UTF-8) be silently recovered from if the HTTP header overrides it, even for an XML processor, while it would continue to fail in other cases (such as server side processing)? BH> with your proposal, which of the following behaviors of BH> implementations would be considered conforming? (see above for discussion of b and c) BH> a) it fails to process the document due to RFC3023bis/XML 1.0 errors That would be the safest course. Consider if the non-ascii character was a euro or some other currency symbol, if the document was an invoice, and was being processed by an accounting system not by a human being. Accounting systems do not have the luxury of a human to look at the invoice, go to View...Character Encoding and try various possibilities until it seems to look right, then save the document and edit the local copy and fix up the encoding declaration BH> b) it considers the comment to include "Björn" BH> c) it considers the comment to include "Björn" BH> * application/xhtml+xml (with no update to RFC3236) That is an existing type and has an existing charset parameter. Applications are thus allowed to use it, with all the complications and breakage that this entails as described above. BH> * image/svg+xml (as you propose it) There is no charset parameter. Processors that generate one and messages that contain one are in error. BH> For application/xml / application/xhtml+xml this would currently be b) BH> as the document includes 0xC3 0xB6 and the encoding is determined to be BH> ISO-8859-1 which means the sequence above represents "ö". It would sometimes be b) and sometimes c) depending on the particular software and whether its reading from disk on the server or over the net. I frankly can't understand how you consider this lack of interoperability to be a desirable thing. -- Chris Lilley mailto:chris@w3.org Chair, W3C SVG Working Group Member, W3C Technical Architecture Group