RE: Media Type "text/csv": new draft (-02) and Last Call

Graham & Yakov, Where a Comma-Separated-Value format is used by peer computer applications attempting to communicate with each other in an open fashion, it is very simple for them to produce a fixed number of Comma-Separated-Values. It sounds like your production of a variable number of Comma-Separated-Values is an artefact of how you are manually driving one proprietary spreadsheet program from the keyboard/mouse. If such manually generated output is to be read by a similarly manually driven mechanism, then it may be acceptable to have variable numbers of Comma-Separated-Values per record. Standardisation in numbers of fieldds is then unnecessary, so an RFC need not cater for the uncontrolled nature of manually handled data. But the same cannot be said of automated computer-based applications, where maintaining a strict count of generated and expected Comma-Separated-Values per record is not only easy, but also allows for an extra level of data validation: namely that a received record is corrupt if it has too few or too many fields. This is where standardisation in the format of the CSV records becomes appropriate material for an RFC. Regards, Clyde Ingram -----Original Message----- From: Graham Klyne [mailto:GK-lists@ninebynine.org] Sent: Tuesday, March 29, 2005 12:21 PM To: clyde.ingram@edl.uk.eds.com; YakovS@solidmatrix.com Cc: ietf-types@alvestrand.no Subject: RE: Media Type "text/csv": new draft (-02) and Last Call At 15:19 23/03/05 +0000, clyde.ingram@edl.uk.eds.com wrote:
Please clarify whether the trailing commas that your Excel export generates are there to mark the end of the last field, or to mark the start of a last field which currently has no value.
I'm not sure how to tell the difference in an Excel spreadsheet. In the case where this arose for me, I had created a speadsheet with varying numbers of values in different rows, and many of the rows were output by Excel with *multiple* trailing commas. Some rows were generated without any trailimng commas. My point would be that if this happens with reasonable data then is must be permitted. Whether it's interpreted as a field terminator as start of field with no value is, I think, moot. #g --
Graham,
-----Original Message----- From: Graham Klyne [<mailto:GK-lists@ninebynine.org>mailto:GK-lists@ninebynine.org] Sent: Wednesday, March 23, 2005 9:55 AM To: Yakov Shafranovich; clyde.ingram@edl.uk.eds.com Cc: ietf-types@alvestrand.no Subject: Re: Media Type "text/csv": new draft (-02) and Last Call
At 01:14 23/03/05 -0500, Yakov Shafranovich wrote:
Clyde,
Thanks for pointing this out. I personally think that instead of making the header record mandatory which is something that most CSV applications do not have, I would rather take the comma out of the end of the record and have the last field end with a CRLF instead of an optional COMMA. Do you think that is a plausible solution?
No. Some of the Excel data I process has trailing commas. This must be allowed.
I also don't think it's necessary to say anything (other than maybe as a comment) about any special status for the first line: such use is accommodated quite reasonably within the basic CSV format.
For example, having such a line when exporting Excel as CSV depends entirely upon how the user constructs the original spreadsheet. Column headings are common, but not mandatory. In some cases, there may be a more complex heading structure -- this is an application issue, not a dataset format issue, and as such does not belong in the dataset format specification.
#g -- ------------
Please clarify whether the trailing commas that your Excel export generates are there to mark the end of the last field, or to mark the start of a last field which currently has no value.
To take a concrete example, I would expect a CSV of sibling relationships in a mythical family to look like this, assuming the siblings are one brother (Bart) and 2 sisters (Lisa & Maggie):
child,sisters,brothers<CR-LF> Bart,Lisa & Maggie,<CR-LF> Lisa,Maggie,Bart<CR-LF> Maggie,Lisa,Bart<CR-LF>
where the trailing comma for the record of child=Bart signifies that the "brothers" field is null, so that Bart has no brothers. In my view this is a logical conclusion, and in fact stripping that one trailing comma would be an error, as that record would only have 2 fields, not 3.
Would you, however, expect the CSV file to use comma as a field-terminator, rather than a field-separator, as follows?:
child,sisters,brothers,<CR-LF> Bart,Lisa & Maggie,,<CR-LF> Lisa,Maggie,Bart,<CR-LF> Maggie,Lisa,Bart,<CR-LF>
Note that parsers that split data records on unprotected comma would detect one field too many in this latter case.
In a Comma SEPARATED Value file format, can you configure Excel to use comma as a SEPARATOR between values, rather than a TERMINATOR (at the end of values)?
Regarding your remarks on the header record being "an application issue, not a dataset format issue, and as such does not belong in the dataset format specification": XML, ASN.1, and other (application-independent) data interchange formats, explicitly tag individual fields so that their type is unambiguously defined within a context. In contrast, CSV conveys no tags per field in a data record. Hence, to help with application-independent data interchange, the CSV format should convey field titles in a header record.
Here is an example of lack of application-independence: if my application sends yours this CSV file:
,Bart,Lisa & Maggie<CR-LF> Bart,Lisa,Maggie<CR-LF> Bart,Maggie,Lisa<CR-LF>
and your application depends on the assumption that the fields are the sequence:
child sisters brothers
then your application will mis-interpret the data. But if my application precedes this with a header record, like so:
brothers,child,sisters<CR-LF> ,Bart,Lisa & Maggie<CR-LF> Bart,Lisa,Maggie<CR-LF> Bart,Maggie,Lisa<CR-LF>
then your application can maintain independence from the change by my application, because the CSV file conveys the corresponding new field sequence (the columns "brother" and "child" have swapped).
Regards, Clyde
------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

clyde.ingram@edl.uk.eds.com wrote:
But the same cannot be said of automated computer-based applications, where maintaining a strict count of generated and expected Comma-Separated-Values per record is not only easy, but also allows for an extra level of data validation: namely that a received record is corrupt if it has too few or too many fields. This is where standardisation in the format of the CSV records becomes appropriate material for an RFC.
The draft as it is written now (-03) does not mandate that the same number of fields need to appear on each line, mainly due to the fact that the draft is focusing on the MIME type registration. Would the following change to section 2, subsection 4 be sufficient to address your concerns: "Within the header and each record there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. The last field in the record may not be followed by a comma. For example:" Yakov
participants (2)
-
clyde.ingram@edl.uk.eds.com
-
Yakov Shafranovich