2

my MySQL database is set to utf8_unicode_ci and I have $pdo->exec('SET NAMES "utf8"') as part of the following php code yet when I echo text from the query a hyphen - looks likes this –. What am I doing wrong, why is the hyphen not displaying correctly?

<?php    
    try {
        $pdo = new PDO('mysql:host=localhost;dbname=danville_tpf', 'danville_dan', 'password');
        $pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
        $pdo->exec('SET NAMES "utf8"');
    } catch (PDOException $e) {
        $output = 'Unable to connect to the database server.';
        include 'output.html.php';
        exit();
    }

    $output = 'Theme Park Database initialized';
    //include 'output.html.php';//

    try {
        $park_id = $_GET['park_id'];
        $query = "SELECT * FROM tpf_parks WHERE park_id = $park_id";
        $result = $pdo->query($query);
    } catch (PDOException $e) {
        $output = 'Unable to connect to the database server.';
        //include 'output.html.php';//
    }

    $output = 'Sucessfully pulled park';
    //include 'output.html.php';//

    foreach ($result as $row) {
        $parkdetails[] = array(
            'name' => $row['name'],
            'blurb' => $row['blurb'],
            'website' => $row['website'],
            'address' => $row['address'],
            'logo' => $row['logo']
        );    
    }
?>

Please help.

zajonc
  • 1,935
  • 5
  • 20
  • 25
themeparkfocus
  • 187
  • 6
  • 16

1 Answers1

11

– is common mojibake for an en dash (), which is a different character from a hyphen.

It is the result of taking the UTF-8–encoded form of the dash (0xe2 0x80 0x93) and incorrectly assuming that it is actually encoded using Windows-1252.

Interpreting those three bytes as Windows-1252: 0xe2, 0x80 and 0x93 separately represent â, and .

Assuming the offending character is in the blurb field, if you query SELECT HEX(blurb) FROM tpf_parks (with a suitable WHERE clause), you will see the hex encoding of the offending bytes.

If you see E28093 in there, then the database value is correctly encoded as UTF-8 and there will be a character encoding mismatch in your client or server configuration (e.g. you're reading it from the DB or displaying it to the browser with mismatched encodings).

If, however, you see C3A2E282ACE2809C, then the character has already been encoded incorrectly in the database — i.e. interpreted incorrectly, then saved as the UTF-8 representation of those 3 characters. If this is the case you'll need to update the data to fix the issue. You could do this using iconv:

$fixedData = iconv("utf-8", "windows-1252", $badData);

This will convert the doubly-converted bytes back to the UTF-8 encoding.

cmbuckley
  • 40,217
  • 9
  • 77
  • 91
  • Thank you so much for that in depth explanation. I'm a complete beginner with PHP/MySQL using a number of books as I build my site. This answer explained a lot about how encoding works and how to identify problems along the way. Turns out I was not using hyphens but en dash. So it was a problem with the data. Thank you. – themeparkfocus Mar 29 '13 at 01:19
  • 1
    Ah, cp-1252 problems... Haven't we all been there ;-) – Chris Wesseling Mar 30 '13 at 10:51