Please review application/shf+xml

We have just published an I-D for the Standard Hex Format, and this announcement is to ask the types-list to review it from a transport type perspective. http://www.ietf.org/internet-drafts/draft-strombergson-shf-00.txt Yours, Linus Walleij on behalf of the author group

At 11pm on 21/10/03 you (Linus Walleij) wrote:
We have just published an I-D for the Standard Hex Format, and this announcement is to ask the types-list to review it from a transport type perspective.
http://www.ietf.org/internet-drafts/draft-strombergson-shf-00.txt
In section 6 the draft states | Refer to this DTD as: | | <!ENTITY % SHF PUBLIC "-//IETF//DTD SHF//EN" | "http://www.foo.org/shf.dtd"> | %SHF; and in section 9.1 | There is no charset parameter. ; however in section 9.5 we have | Second, neither the "XML" declaration (e.g., ) nor the "DOCTYPE" | declaration (e.g., ) may be present. (Accordingly, if another | character set other than UTF-8 is desired, then the "charset" | parameter must be present.) . These are inconsistent. I would suggest removing both restrictions listed in 9.5: their purpose is unclear. If it is to simplify the processing, then there are many aspects of XML parsing (CDATA sections, properly parsed comments, random charsets, substituting entity references) that are harder to deal with than simply ignoring an XML declaration and doctype. Two additional points: would it not be worth declaring an XML namespace for this format in addition to the DTD? and would it not be worth adding support for using hashes other than SHA-1, both for when the time surely comes that SHA-1 is insufficient security, and to allow simpler checksums in secure environments with limited processing power (such as embedded systems)? More generally, although this may be out of the remit of this list, is an XML-based format not a little complex for a hex dump? The draft mentions 8- and 16-bit embedded systems: are these likely to have the necessary processing power and XML-parsing tools available to make use of a dump in this format? If it is necessary to transform the dump into a simpler form on a more powerful machine this rather defeats the object of a general-purpose platform-neutral format. Ben

Thanks Ben! Your effort is much appreciated. On Wed, 22 Oct 2003 ben@morrow.me.uk wrote:
In section 6 the draft states
| Refer to this DTD as: | | <!ENTITY % SHF PUBLIC "-//IETF//DTD SHF//EN" | "http://www.foo.org/shf.dtd"> | %SHF;
and in section 9.1
| There is no charset parameter.
; however in section 9.5 we have
| Second, neither the "XML" declaration (e.g., ) nor the "DOCTYPE" | declaration (e.g., ) may be present. (Accordingly, if another | character set other than UTF-8 is desired, then the "charset" | parameter must be present.)
. These are inconsistent.
Sorry, I was confusing "charset" with "encoding". Will never do it again... Also the paragraph is unclear, as you say. Could I write something like: Second, neither the "xml" processing instruction nor the "DOCTYPE" declaration need to be present. (Accordingly, if a character set other than UTF-8 is desired, then the "encoding" parameter must be present in an "xml" processing instruction .)
I would suggest removing both restrictions listed in 9.5: their purpose is unclear.
The idea is: if you want to switch character set, do so in the processing instruction. (<?xml version="1.0" encoding="foo" ?>) I hope the above conveys this clearly. So, a pedantically specified SHF file would begin: <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE shf PUBLIC "-//IETF//DTD SHF//EN" "http://ietf.org/shf.dtd"> <dump ... etc
Two additional points: would it not be worth declaring an XML namespace for this format in addition to the DTD?
I have seen no standards for this: there are however two drafts about it, while we don't know if they will be published, we fashined this like e.g. the BEEP standard.
and would it not be worth adding support for using hashes other than SHA-1, both for when the time surely comes that SHA-1 is insufficient security, and to allow simpler checksums in secure environments with limited processing power (such as embedded systems)?
We had this discussion, and SHA-1 is sort of IETF standard (RFC 3174). The purpose of the SHA-1 checksum is plain checksumming of the contents (information integrity) not to counter compromise. The SHF file as a whole may be signed and checked by way of RFC 3275 if need be, as stated. Several checksum algorithms would increase complexity of implementation and was removed for keeping it simple. On processing power, see below.
More generally, although this may be out of the remit of this list, is an XML-based format not a little complex for a hex dump?
The main purpose is transport and storage. In reality, dumps are typically not transferred to embedded systems by way of textual formats anyway, instead a host program ("flasher" etc) on some other machine will typically read the SHF file and transfer the data via serial bus in some custom format. SHF file -> parser/converter -> 01011001001 -> device In the future, as complexity increase in embedded systems, this may change, so that systems parse the SHF file directly. I will try to clarify this. Thanks, Linus

Please follow RFC 3023 and provide the charset parameter. Cheers, -- MURATA Makoto <murata@hokkaido.email.ne.jp>

On Wednesday, October 22, 2003, 2:42:51 PM, MURATA wrote: MM> Please follow RFC 3023 and provide the charset parameter. If your applications are tested with multiple charsets, and you can demonstrate that your applications interoperably: a) give the charset parameter precedence over what the XML declaration says b) correctly re-write the XML when saving to disk so that the encoding parameter in the XML declaration is altered to what the MIME charset parameter said (if they differ, thus ensuring that the XML remains well formed when read from disk) then please feel free to add the charset parameter. If, on the other hand, - you would rather avoid this extra level of implementation burden - you would like consistent behavior when reading from local disk as well as from the network, - you are uncomfortable about the possibility of having allegedly XML content on your server which is not well formed in the absence of HTTP headers - you want to simplify the deployability of this media type across Web servers from different vendors then I urge you to *not* add a charset parameter but instead, to add the sentence: "There is no charset parameter. Character handling has identical semantics to the case where the charset parameter of the "application/xml" media type is omitted, as described in [RFC3023]." -- Chris mailto:chris@w3.org

On Wed, 22 Oct 2003 16:39:17 +0200 Chris Lilley <chris@w3.org> wrote:
If your applications are tested with multiple charsets, and you can demonstrate that your applications interoperably:
Chris can certainly argue against RFC 3023 and try to improve it. However, I do not think that Linus would like to wait for the conclusion of that debate. Apparently, application/shf+xml and RFC 3023 should be in sync, By the way, if we drop the charset from RFC 3023, some SOAP implementations will break. Linus wrote:
If, on the other hand, (...) then I urge you to *not* add a charset parameter but instead, to add the sentence:
Your previous statements on this issue have not gone unnoticed, so that's how I do it. Thanks, Chris.
I do not think that this is a good action. Cheers, Makoto -- MURATA Makoto <murata@hokkaido.email.ne.jp>

At 9am on 24/10/03 you (MURATA Makoto) wrote:
On Wed, 22 Oct 2003 16:39:17 +0200 Chris Lilley <chris@w3.org> wrote:
If your applications are tested with multiple charsets, and you can demonstrate that your applications interoperably:
Chris can certainly argue against RFC 3023 and try to improve it. However, I do not think that Linus would like to wait for the conclusion of that debate. Apparently, application/shf+xml and RFC 3023 should be in sync, By the way, if we drop the charset from RFC 3023, some SOAP implementations will break.
Um, excuse me?
From rfc3023 (section 3.2, 'Application/xml registration'):
| Optional parameters: charset <snip> | If an application/xml entity is received where the charset | parameter is omitted, no information is being provided about the | charset by the MIME Content-Type header. Conforming XML | processors MUST follow the requirements in section 4.3.3 of [XML] | that directly address this contingency. However, MIME processors | that are not XML processors SHOULD NOT assume a default charset if | the charset parameter is omitted from an application/xml entity. Chris's suggestion that app/shf+xml entities should not have a charset, but instead always be processed according to this paragraph, is both entirely in conformance with the RFC and entirely sensible, especially for a format which will almost invariably be in ASCII. Ben

On Fri, 24 Oct 2003 02:12:22 +0100 ben@morrow.me.uk wrote:
Um, excuse me?
Chris's suggestion that app/shf+xml entities should not have a charset, but instead always be processed according to this paragraph, is both entirely in conformance with the RFC and entirely sensible, especially for a format which will almost invariably be in ASCII.
What you have missed is 7.1 of RFC 3023 Registrations for new XML-based media types under top-level types other than "text" SHOULD, in specifying the charset parameter and encoding considerations, define them as: "Same as [charset parameter / encoding considerations] of application/xml as specified in RFC 3023." Cheers, -- MURATA Makoto <murata@hokkaido.email.ne.jp>

On Friday, October 24, 2003, 2:59:58 AM, MURATA wrote: MM> On Wed, 22 Oct 2003 16:39:17 +0200 MM> Chris Lilley <chris@w3.org> wrote:
If your applications are tested with multiple charsets, and you can demonstrate that your applications interoperably:
MM> Chris can certainly argue against RFC 3023 and try to improve it. Well, I could. Instead, though, I prefer to suggest wording that directly and normatively references RFC 3023 since that is the current specification. MM> However, I do not think that Linus would like to wait for the MM> conclusion of that debate. I don't recall suggesting anyone wait for anything. MM> Apparently, application/shf+xml and RFC 3023 should be in sync, For example, using my suggested wording whereby the registration for application/shf+xml references what RFC 3023 says to do. MM> By the way, if we drop the charset from RFC 3023, some SOAP MM> implementations will break. Perhaps you could indicate where I suggested altering the SOAP specification? MM> Linus wrote:
If, on the other hand, (...) then I urge you to *not* add a charset parameter but instead, to add the sentence:
Your previous statements on this issue have not gone unnoticed, so that's how I do it. Thanks, Chris.
MM> I do not think that this is a good action. However, you fail to articulate a reason for this. -- Chris mailto:chris@w3.org

On Fri, 24 Oct 2003 03:22:46 +0200 Chris Lilley <chris@w3.org> wrote:
For example, using my suggested wording whereby the registration for application/shf+xml references what RFC 3023 says to do.
I do not think so, because the proposed I-D does not say Same as [charset parameter/ encoding considerations] of application/xml as specified in RFC 3023 which is what RFC 3023 says to do. Cheers, -- MURATA Makoto <murata@hokkaido.email.ne.jp>

On Friday, October 24, 2003, 11:44:30 AM, MURATA wrote: MM> On Fri, 24 Oct 2003 03:22:46 +0200 MM> Chris Lilley <chris@w3.org> wrote:
For example, using my suggested wording whereby the registration for application/shf+xml references what RFC 3023 says to do.
MM> I do not think so, because the proposed I-D does not say MM> Same as [charset parameter/ encoding considerations] of MM> application/xml as specified in RFC 3023 MM> which is what RFC 3023 says to do. My suggested wording, which was only suggested a few days ago on this list so might not be in the latest ID was "There is no charset parameter. Character handling has identical semantics to the case where the charset parameter of the "application/xml" media type is omitted, as described in [RFC3023]." The intent seems identical to what you suggest. What am I missing here? -- Chris mailto:chris@w3.org

On Fri, 24 Oct 2003 16:26:31 +0200 Chris Lilley <chris@w3.org> wrote:
"There is no charset parameter. Character handling has identical semantics to the case where the charset parameter of the "application/xml" media type is omitted, as described in [RFC3023]."
The intent seems identical to what you suggest. What am I missing here?
I think that this I-D should provide a good reason to omit the charset parameter and that reason should be specific to application/shf+xml. For example, "this media type uses UTF-8 only" is a perfectly good reason. In this particular case, are there any reasons to allow something different from UTF-8? (I'm just asking.) Cheers, -- MURATA Makoto <murata@hokkaido.email.ne.jp>

On Sunday, October 26, 2003, 3:04:11 PM, MURATA wrote: MM> On Fri, 24 Oct 2003 16:26:31 +0200 MM> Chris Lilley <chris@w3.org> wrote:
"There is no charset parameter. Character handling has identical semantics to the case where the charset parameter of the "application/xml" media type is omitted, as described in [RFC3023]."
The intent seems identical to what you suggest. What am I missing here?
MM> I think that this I-D should provide a good reason to omit the charset MM> parameter and that reason should be specific to application/shf+xml. MM> For example, "this media type uses UTF-8 only" is a perfectly good reason. MM> In this particular case, are there any reasons to allow something MM> different from UTF-8? (I'm just asking.) Probably not, though I would advance the case that if they really are using an XML parser then they get UTF-16 for free and should allow both. However, I was not saying that the only time the charset parameter should be omitted was when the encoding was UTF-8 or UTF-16. -- Chris mailto:chris@w3.org

OK we have then reached consensus on this issue I believe: * We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of simplicty. * We define that in this case the "charset" parameter is omitted, for this very reason. And then everybody is happy with this transport type? Regarding the application as such, I have announced it for discussion on the discuss-list for Application Area, all interested parties are welcomed to join in on it. Yours, Linus Walleij

On Mon, 27 Oct 2003 18:06:46 +0100 (MET) Linus Walleij <triad@df.lth.se> wrote:
* We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of simplicty.
Which UTF-16? Unfortunately, there are three charsets for UTF-16. They are "utf-16le", "utf-16be" and "utf-16" (see RFC 2781). Since this XML format describes hexadecimal data, almost every character is US-ASCII. I wonder why we have to double the file size by representing a US-ASCII character with 16 bits. 1MB in UTF-8 becomes 2MB in UTF-16.
* We define that in this case the "charset" parameter is omitted, for this very reason.
If there is more than one choice, I do not think that this is a good reason to omit the charset parameter. Cheers, -- MURATA Makoto <murata@hokkaido.email.ne.jp>

On Thu, 30 Oct 2003, MURATA Makoto wrote:
* We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of simplicty.
Which UTF-16? Unfortunately, there are three charsets for UTF-16. They are "utf-16le", "utf-16be" and "utf-16" (see RFC 2781).
The XML specification says: Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. As easy as it gets :-)
Since this XML format describes hexadecimal data, almost every character is US-ASCII. I wonder why we have to double the file size by representing a US-ASCII character with 16 bits. 1MB in UTF-8 becomes 2MB in UTF-16.
That's a good point. OK it's UTF-8 only then. Linus

On Wednesday, October 29, 2003, 5:33:26 PM, MURATA wrote: MM> On Mon, 27 Oct 2003 18:06:46 +0100 (MET) MM> Linus Walleij <triad@df.lth.se> wrote:
* We define that shf+xml will use UTF-8 and UTF-16 only, for reasons of simplicty.
MM> Which UTF-16? Unfortunately, there are three charsets for UTF-16. MM> They are "utf-16le", "utf-16be" and "utf-16" (see RFC 2781). I believe the answer to this can be found by a case-insensitive string match between the three stings you give and the string "UTF-16" that I suggested above. However, my main reason for suggesting both the encodings that XML mandates is because the effort to detect and give an error on an otherwise perfectly fine file in XML that uses UTF-16 and that the parser would happily accept, seems higher than any utility gained. MM> Since this XML format describes hexadecimal data, almost every character MM> is US-ASCII. I wonder why we have to double the file size by representing MM> a US-ASCII character with 16 bits. 1MB in UTF-8 becomes 2MB in UTF-16. You don't *have* to double the file size. Saying UTF-16 only would, I agree, double the file size in most cases. Use UTF-8 unless the quantity of human-readable content is large and the overall size is smaller in UTF-16
* We define that in this case the "charset" parameter is omitted, for this very reason.
MM> If there is more than one choice, I do not think that this is a MM> good reason to omit the charset parameter. On the contrary, since both UTF-8 and UTF-16 are accepted by every XML parser worldwide and since, therefore, using either of these two does not require an XML declaration with an encoding declaration, then there is absolutely no problem there. Forcing the server to construct redundant and frequently erroneous information in parallel seems especially unwise in this case, because every parser can accept both encodings and because the string containing the encoding is not physically present in the file, making the server have to do byte analysis to compute the value. -- Chris mailto:chris@w3.org

At 2pm on 22/10/03 you (Linus Walleij) wrote:
On Wed, 22 Oct 2003 ben@morrow.me.uk wrote:
Could I write something like:
Second, neither the "xml" processing instruction nor the "DOCTYPE" declaration need to be present.
Ah yes, this is quite different. I ws puzzled by 'neither... MAY be present'.
(Accordingly, if a character set other than UTF-8 is desired, then the "encoding" parameter must be present in an "xml" processing instruction .)
For this, see Chris Lilley's mail. The paragraph he gives is all that is required.
Two additional points: would it not be worth declaring an XML namespace for this format in addition to the DTD?
I have seen no standards for this: there are however two drafts about it, while we don't know if they will be published, we fashined this like e.g. the BEEP standard.
What about <http://www.w3.org/TR/REC-xml-names>? OK, that's W3C not IETF, but they are probably the appropriate standards body wrt XML stuff. This is rather a side issue, and doesn't really matter. :)
and would it not be worth adding support for using hashes other than SHA-1,
We had this discussion, and SHA-1 is sort of IETF standard (RFC 3174).
Sorry. Fair enough. :)
More generally, although this may be out of the remit of this list, is an XML-based format not a little complex for a hex dump?
The main purpose is transport and storage. In reality, dumps are typically not transferred to embedded systems by way of textual formats anyway, instead a host program ("flasher" etc) on some other machine will typically read the SHF file and transfer the data via serial bus in some custom format.
This is not quite my point (although it does clarify that the format would be useful rather than futile :). Rather, my question is, why are you using XML rather than (say) some format based on short-lines-of- ASCII (perhaps taking RFC2822 as your model)? Given that the data to be represented is pure ascii, and has a very simple structure, do you really need all the complexities of XML? Ben

On Wed, 22 Oct 2003 ben@morrow.me.uk wrote:
Rather, my question is, why are you using XML rather than (say) some format based on short-lines-of- ASCII (perhaps taking RFC2822 as your model)? Given that the data to be represented is pure ascii, and has a very simple structure, do you really need all the complexities of XML?
OK that's a fair and good question... Several things makes us go for XML. First, it's an Internet thing we wanna do, so if we were just writing "the most simple hexdump standard" the place to do it would probably be IEEE and not IETF. Such de facto-standards (like S-records) already exist. We expect the need for transport of this kind of data to increase, so an IETF RFC is needed. There is a general trend i Unix and other OS:es to in addition to being textual, also be XML. Also, if we should not go for XML, then the same line of reasoning about simplicity would also go for BEEP and others, yes? These RFCs give me the impression that textual transport should be made in XML where possible, not only where complexity is above some certain level. (Correct med if this is wrong.) Perhaps the most important point raised was that if we need to extend this format, e.g. replace it with an SHF v.2 at some point (if not before, then as 128bit computing is introduced sooner or later), XML is easy to extend, version and add structure in, if desired. When complexity increase, XML scales fine. Yours, Linus Walleij

At 7pm on 23/10/03 you (Linus Walleij) wrote:
On Wed, 22 Oct 2003 ben@morrow.me.uk wrote:
Rather, my question is, why are you using XML rather than (say) some format based on short-lines-of- ASCII (perhaps taking RFC2822 as your model)? Given that the data to be represented is pure ascii, and has a very simple structure, do you really need all the complexities of XML?
First, it's an Internet thing we wanna do, so if we were just writing "the most simple hexdump standard" the place to do it would probably be IEEE and not IETF.
This is rather what I was worried about :). I would have said that if you were writing 'the most simple hexdump standard; subject to it being textual, more-or-less human-readable and widely interoperable' then the IETF is exactly the place to do it. For example, RFC2822 is pretty close to 'the most simple email format'. And even MIME, which extends it, is still based on lines of ASCII text. MIME could easily be recast into XML, but there would be no advantage in doing so.
Such de facto-standards (like S-records) already exist. We expect the need for transport of this kind of data to increase, so an IETF RFC is needed.
Of course. And you should of course consider more than your immediate needs when writing such an RFC, as you have done.
Also, if we should not go for XML, then the same line of reasoning about simplicity would also go for BEEP and others, yes?
I don't know...BEEP is not something I am familiar with. Looking over the RFC, BEEP messages seem to have the property that there are a number of different 'classes' of message, each with a fixed internal structure; this list of classes is extensible. This is a data model which XML is well suited to.
These RFCs give me the impression that textual transport should be made in XML where possible, not only where complexity is above some certain level. (Correct med if this is wrong.)
I would have said 1. For a simple format like a hex dump, using XML adds significant complexity. 2. You should not add complexity to a format unless there is a significant advantage in doing so. So the question then is, what advantage is XML giving you?
Perhaps the most important point raised was that if we need to extend this format, e.g. replace it with an SHF v.2 at some point (if not before, then as 128bit computing is introduced sooner or later), XML is easy to extend, version and add structure in, if desired. When complexity increase, XML scales fine.
This is true. However, such things as changing the maximum word size are easy to work into a simple format (indeed, there is no real reason to limit the word size at all beyond the specifics of your hashing mechanism). Other structure I would say would be better dealt with by defining a new format which includes a section of SHF: quite possibly this new format could be XML-based, viz.: <newformat> <metadata value="foo"/> <hexdump type="text/shf"> [here follows an SHF object] </hexdump> </newformat> But this is all just my opinion: if noone else on the list backs me up I will consider myself outvoted and keep quiet :). Ben

Also, if we should not go for XML, then the same line of reasoning about simplicity would also go for BEEP and others, yes? These RFCs give me the impression that textual transport should be made in XML where possible, not only where complexity is above some certain level. (Correct med if this is wrong.)
The "best current practice" for use of XML within IETF is in http://www.ietf.org/rfc/rfc3470.txt. It does not contain the recommendation that "textual transport should be made in XML where possible". Larry
participants (5)
-
ben@morrow.me.uk
-
Chris Lilley
-
Larry Masinter
-
Linus Walleij
-
MURATA Makoto