7

I encountered a little problem when parsing CSV-Strings that contain german umlauts (-> ä, ö, ü, Ä, Ö, Ü) in PHP.

Assume the following csv input string:

w;x;y;z
48;OSL;Oslo Stock Exchange;B
49;OTB;Österreichische Termin- und Optionenbörse;C
50;VIE;Wiener Börse;D

And the appropriate PHP code used to parse the string and create an array which contains the data from the csv-String:

public static function parseCSV($csvString) {
    $rows = str_getcsv($csvString, "\n");
    // Remove headers ..
    $header = array_shift($rows);
    $cols = str_getcsv($header, ';');
    if(!$cols || count($cols)!=4) {
        return null;
    }
    // Parse rows ..
    $data = array();
    foreach($rows as $row) {
        $cols = str_getcsv($row, ';');
        $data[] = array('w'=>$cols[0], 'x'=>$cols[1], 'y'=>$cols[2], 'z'=>$cols[3]);
    }
    if(count($data)>0) {
        return $data;
    }
    return null;
}

The result of calling the above function with the given csv-string results in:

Array
(
    [0] => Array
        (
            [w] => 48
            [x] => OSL
            [y] => Oslo Stock Exchange
            [z] => B
        )

    [1] => Array
        (
            [w] => 49
            [x] => OTB
            [y] => sterreichische Termin- und Optionenbörse
            [z] => C
        )

    [2] => Array
        (
            [w] => 50
            [x] => VIE
            [y] => Wiener Börse
            [z] => D
        )
)

Note that the second entry is missing the Ö. This only happens, if the umlaut is placed directly after the column separator character. It also happens, if more than one umlaut is places in sequence, i.e. "ÖÖÖsterreich" -> "sterreich". The csv-string is sent using a HTML-Form, thus the content gets URL-encoded. I use a Linux server, with utf-8 encoding and the csv-string looks correct before parsing.

Any ideas?

Javaguru
  • 890
  • 5
  • 10
  • 1
    cannot reproduce. works for me. http://codepad.viper-7.com/v6WIaT – Gordon Jul 05 '11 at 07:24
  • It is a encoding problem. I tried to place the string directly in the php-file, using UTF-8 encoding. Then it worked. Now I use $csvString = utf8_encode($csvString); before the parsing code, and it works like a charm. – Javaguru Jul 05 '11 at 07:41
  • 1
    I guess, I should ensure that all form-data is encoded with UTF-8, using the meta tag and an appropriate HTTP-Response Header. – Javaguru Jul 05 '11 at 07:49
  • And you can set the form accept charset in your HTML as well: [`accept-charset`](http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset) – hakre Jul 05 '11 at 08:17
  • Works in Windows but I have this problem on Linux machine. – Josef Sábl May 20 '13 at 14:48

2 Answers2

6

Assuming fgetcsv (http://php.net/manual/en/function.fgetcsv.php) works similar to str_getcsv() then to quote the man page:

Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.

then you should try setting a locale with setlocale http://php.net/manual/en/function.setlocale.php

if this doesn't work, try enabling multi byte overload http://www.php.net/manual/en/mbstring.overload.php

or even better, using a standard framework library like a Zend/Symfony library to pull the data out

gingerCodeNinja
  • 1,239
  • 1
  • 12
  • 27
0

I had a similar issue with the ï character in some data that originated from Microsoft Excel, saved out as a CSV (yes, with UTF8 encoding selected in the "web options" part of the "Save As..." dialog). And still, this appears not to be the same UTF8 encoding that str_getcsv expects.

I now run everything through iconv first and it works fine - there seems something up with Excel's idea of a CSV file:

iconv -f windows-1252 -t utf8 source.csv > output.csv
Coder
  • 2,833
  • 2
  • 22
  • 24