Home » excel » php – standardizing CSV file types

php – standardizing CSV file types

Posted by: admin April 23, 2020 Leave a comment

Questions:

I’m using a csv parser class (http://code.google.com/p/php-csv-parser/) to parse and extract data from csv files. The problem I’m encountering is that it only works for certain csv file types. (It seems that there is a csv type for Mac, for Ms-Dos, and for Windows.)

The code works if I use a csv file which was saved on a mac (in excel) using the csv – windows option. However, if I save a file on a windows machine simply as csv, that doesn’t work. (You would think that that would be the same format as saving csv-windows on a mac.) It does work from a windows machine if I save it as a csv-MSDOS file. This seems a little ridiculous.

Is there a way to standardize these three file types so that my code can read any type of csv that is uploaded?

i’m thinking it would be something like this:

$standardizedCSV = preg_replace_all('/\r[^\n]/', '\r\n', $csvContent);

I know it has something to do with how each file type handles end of lines, but I’m a little put out trying to figure out those differences. If anybody has any advice, please let me know.

Thanks.

UPDATE:
This is the relevant code from the csv parser I’m using which extracts data row by row:

$c = 0;
$d = $this->settings['delimiter'];
$e = $this->settings['escape'];
$l = $this->settings['length'];

$res = fopen($this->_filename, 'r');

while ($keys = fgetcsv($res, $l, $d, $e)) {

if ($c == 0) {
   $this->headers = $keys;
} else {
   array_push($this->rows, $keys);
}

$c ++;
}

I guess I need to understand how fgetcsv handles eol’s, so that I can make sure that csv files of any format are handled in the same manner.

How to&Answers:

This seems to do the trick:

    ini_set("auto_detect_line_endings", true);

The problem was with line endings, but I didn’t need to create my own EOL parser. This runtime setting does it for me. See http://us.php.net/manual/en/filesystem.configuration.php#ini.auto-detect-line-endings.

Answer:

I don’t think the line endings is an issue. The thing about CSV is that it’s only a “comma separated values” file and not standardized beyond that. So some systems separate the values using commas, some using semicolons (;). I’m sure there are variations that use even other value separators.

Additionally, the escape character (most often backslash \) can be different between CSV files, and some CSV files also use quotation marks around each value (").

A CSV file can use any variation between the above. For instance, I’m fairly certain that Microsoft Excel exports CSV files separating values using semicolons and without any quotation around the values.

I’m sure there are ways to auto-detect how to parse the CSV file, but the best way would be allowing the user to decide. That’s what Excel does.

Answer:

If you use CSV files, you have to agree on many details which are not properly standardized:

  • Line endings (Unix 0x0a, Macintosh 0x0d, DOS 0x0d 0x0a)
  • Field separators (comma, semicolon etc.)
  • Field quoting (all fields quoted, only string fields, only string fields containing field and line separators)
  • Escaping of double quotes within string fields (doubling of double quotes, backslash character before double quote etc.)
  • Multiline string fields (are they allowed or not)
  • File encoding (ISO-8859-1, UTF-8 etc.)

If you create a CSV reader, you can automatically handle different variations of line endings and field quoting. But the rest has to be known to the CSV parser beforehand.

The defacto standard is the CSV format produced by Excel. However, Excel uses different variations of the format:

  • Usually DOS line endings (but I’ve never tried it with Excel for Macintosh)
  • Field separator depending on the locale. If the comma is used to group the digits in long numbers, Excel uses the semicolon as field separator. Otherwise the comma.
  • Excel uses double quotes if needed.
  • Excel doubles the double quotes within string fields.
  • Excel supports multiline string fields.
  • The file encoding seems to be the file encoding of the current locale. So it varies.