New subject: Media Type "text/csv": new draft (-02) and Last Call

30 Mar 2005

      Graham & Yakov,

Where a Comma-Separated-Value format is used by peer computer applications
attempting to communicate with each other in an open fashion, it is very
simple for them to produce a fixed number of Comma-Separated-Values.
It sounds like your production of a variable number of
Comma-Separated-Values is an artefact of how you are manually driving one
proprietary spreadsheet program from the keyboard/mouse.  If such manually
generated output is to be read by a similarly manually driven mechanism,
then it may be acceptable to have variable numbers of Comma-Separated-Values
per record.  Standardisation in numbers of fieldds is then unnecessary, so
an RFC need not cater for the uncontrolled nature of manually handled data.

But the same cannot be said of automated computer-based applications, where
maintaining a strict count of generated and expected Comma-Separated-Values
per record is not only easy, but also allows for an extra level of data
validation: namely that a received record is corrupt if it has too few or
too many fields.  This is where standardisation in the format of the CSV
records becomes appropriate material for an RFC.

Regards,
Clyde Ingram

-----Original Message-----
From: Graham Klyne [mailto:GK-lists@ninebynine.org]
Sent: Tuesday, March 29, 2005 12:21 PM
To: clyde.ingram@edl.uk.eds.com; YakovS@solidmatrix.com
Cc: ietf-types@alvestrand.no
Subject: RE: Media Type "text/csv": new draft (-02) and Last Call

At 15:19 23/03/05 +0000, clyde.ingram@edl.uk.eds.com wrote:
...
Please clarify whether the trailing commas that your Excel export 
generates are there to mark the end of the last field, or to mark the 
start of a last field which currently has no value.
I'm not sure how to tell the difference in an Excel spreadsheet.

In the case where this arose for me, I had created a speadsheet with 
varying numbers of values in different rows, and many of the rows were 
output by Excel with *multiple* trailing commas.  Some rows were generated 
without any trailimng commas.  My point would be that if this happens with 
reasonable data then is must be permitted.  Whether it's interpreted as a 
field terminator as start of field with no value is, I think, moot.

#g
--
...
Graham,
-----Original Message-----
From: Graham Klyne 
[<mailto:GK-lists@ninebynine.org>mailto:GK-lists@ninebynine.org]
Sent: Wednesday, March 23, 2005 9:55 AM
To: Yakov Shafranovich; clyde.ingram@edl.uk.eds.com
Cc: ietf-types@alvestrand.no
Subject: Re: Media Type "text/csv": new draft (-02) and Last Call
At 01:14 23/03/05 -0500, Yakov Shafranovich wrote:
...
Clyde,
Thanks for pointing this out. I personally think that instead of making
the header record mandatory which is something that most CSV applications
do not have, I would rather take the comma out of the end of the record
and have the last field end with a CRLF instead of an optional COMMA. Do
you think that is a plausible solution?
No.  Some of the Excel data I process has trailing commas.  This must be
allowed.
I also don't think it's necessary to say anything (other than maybe as a
comment) about any special status for the first line:  such use is
accommodated quite reasonably within the basic CSV format.
For example, having such a line when exporting Excel as CSV depends
entirely upon how the user constructs the original spreadsheet.  Column
headings are common, but not mandatory.  In some cases, there may be a more
complex heading structure -- this is an application issue, not a dataset
format issue, and as such does not belong in the dataset format 
specification.
#g
--
------------
Please clarify whether the trailing commas that your Excel export 
generates are there to mark the end of the last field, or to mark the 
start of a last field which currently has no value.
To take a concrete example, I would expect a CSV of sibling relationships 
in a mythical family to look like this, assuming the siblings are one 
brother (Bart) and 2 sisters (Lisa & Maggie):
child,sisters,brothers<CR-LF>
    Bart,Lisa & Maggie,<CR-LF>
    Lisa,Maggie,Bart<CR-LF>
    Maggie,Lisa,Bart<CR-LF>
where the trailing comma for the record of child=Bart signifies that the 
"brothers" field is null, so that Bart has no brothers.  In my view this 
is a logical conclusion, and in fact stripping that one trailing comma 
would be an error, as that record would only have 2 fields, not 3.
Would you, however, expect the CSV file to use comma as a 
field-terminator, rather than a field-separator, as follows?:
child,sisters,brothers,<CR-LF>
    Bart,Lisa & Maggie,,<CR-LF>
    Lisa,Maggie,Bart,<CR-LF>
    Maggie,Lisa,Bart,<CR-LF>
Note that parsers that split data records on unprotected comma would 
detect one field too many in this latter case.
In a Comma SEPARATED Value file format, can you configure Excel to use 
comma as a SEPARATOR between values, rather than a TERMINATOR (at the end 
of values)?
Regarding your remarks on the header record being "an application issue, 
not a dataset format issue, and as such does not belong in the dataset 
format specification": XML, ASN.1, and other (application-independent) 
data interchange formats,  explicitly tag individual fields so that their 
type is unambiguously defined within a context.  In contrast, CSV conveys 
no tags per field in a data record.  Hence, to help with 
application-independent data interchange, the CSV format should convey 
field titles in a header record.
Here is an example of lack of application-independence: if my application 
sends yours this CSV file:
,Bart,Lisa & Maggie<CR-LF>
    Bart,Lisa,Maggie<CR-LF>
    Bart,Maggie,Lisa<CR-LF>
and your application depends on the assumption that the fields are the 
sequence:
child
    sisters
    brothers
then your application will mis-interpret the data.
But if my application precedes this with a header record, like so:
brothers,child,sisters<CR-LF>
    ,Bart,Lisa & Maggie<CR-LF>
    Bart,Lisa,Maggie<CR-LF>
    Bart,Maggie,Lisa<CR-LF>
then your application can maintain independence from the change by my 
application, because the CSV file conveys the corresponding new field 
sequence (the columns "brother" and "child" have swapped).
Regards,
Clyde
------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact

RE: Media Type "text/csv": new draft (-02) and Last Call

clyde.ingram＠edl.uk.eds.com

Yakov Shafranovich

tags

participants (2)