PHP Decode & Encode

In this article I will talk about PHP decoding and encoding. I recently ran into an issue trying to get some strings converted to the correct character encoding in a PHP website of mine. In searching Google for help I found that many others have had similar issues. I’ve concluded that most of these issues could be solved, or at least less painful, by first understanding character encoding. Make yourself familiar with how character encoding works, why it’s there, and which encoding you are most likely to use on your website.

In the end, the fix to my problem was so simple it’s stupid. But without knowledge that doesn’t make any difference. Allow me to explain the scenario I faced. I wrote a PHP script that would download via FTP some compressed text files from a partner’s server. Once downloaded the script would extract the files then read them line by line. The text in the files was formatted tab delimited, or CSV. The script would insert each line from the CSV into a MySQL database table with matching fields. This was all fine and dandy until I noticed that some words and phrases here and there were all screwy.

At first I put off the funky text problem, but eventually it just got so annoying I caved and worked on fixing it. I didn’t know where the data was getting messed up. Maybe it wasn’t stored right in the CSV, maybe when I read the file it changed in memory, maybe it was saving wrong to the database, maybe it was changed when reading from the database, or maybe it was goofed up on output to the page. Looking back on my approach to fix it, I see that I really wasted a lot of time, randomly trying different things to track it down. But what I tried really isn’t important, it is what worked that matters.

If I were to start at the beginning and had a talk with future-me for advice, this is what I would do.

Be sure you know what character encoding you are starting with.

One obvious way would be to ask the source. But if you can’t reach them quickly, or don’t believe them anyway, you can try the mb_detect_encoding() function. A quick example to find out if the source string is UTF-8 encoded:

if (mb_detect_encoding($value) == 'UTF-8') {
    // blah blah blah
}

Store the data correctly.

If you want to save the data in a database you’ll need to be sure your database schema is set up to match. In my case I confirmed that the text in the CSV files were indeed UTF-8 encoded. I chose to use the utf8_general_ci collation for storing the data in my MySQL database.

Decode the text correctly.

Probably the easiest option for decoding UTF-8 characters using PHP is with the utf8_decode() function. But not too fast now! The description for the method on php.net says “Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1″. Don’t do what I did and get stumped on why it doesn’t look right coming out the other side of utf8_decode(). I assumed that my encoded characters were ISO-8859-1, but I was wrong. Before I get to that, there is another way to encode and decode characters. It’s the iconv() function. I correctly decoded/converted my text using the following:

$decodedString = iconv("UTF-8", "WINDOWS-1252//TRANSLIT", $encodedString);

As you can see I needed to convert to WINDOWS-1252, not ISO-8859-1. Oops. Read more about the iconv module on php.net. You may not have the module installed on your server, and so will need to do so. I think it’s worth it.

Know what encoding you are converting to.

There is a trial and error method you can use to find the right output encoding. I couldn’t quickly get an answer from my data provider so I searched around and stumbled upon something close to this beauty code.

<?php
// UTF-8 encoded string
$wtfString = "You’ll";
 
// Loop through possible encodings one by one and output the result 
foreach(mb_list_encodings() as $chr){ 
    echo iconv("UTF-8", $chr."//TRANSLIT", $wtfString)." : ".$chr."\n";
}

This is a subset of the output:

You’ll : UTF-8
You+AOIgrCEi-ll : UTF-7
 : UTF7-IMAP
You?EUR(TM)ll : ASCII
You���EUR���ll : EUC-JP
You?EUR(TM)ll : SJIS
You���EUR���ll : eucJP-win
You?EUR(TM)ll : SJIS-win
 : CP51932
 : JIS
You?EUR(TM)ll : ISO-2022-JP
 : ISO-2022-JP-MS
You’ll : Windows-1252
You�EUR(TM)ll : ISO-8859-1
You�EUR(TM)ll : ISO-8859-2
You�EUR(TM)ll : ISO-8859-3
You�EUR(TM)ll : ISO-8859-4

This gave me a quick view of what it would look like on the other side. The line I was looking for was “You’ll : Windows-1252″. Bingo!

Tell web browsers the web page’s character encoding.

To be on the safe side you should tell the visitor’s web browser what encoding to use to display text on your page, instead of letting it do guess or otherwise whatever it feels like. Generally speaking it is safe enough to include a meta tag in the head tag of your web page. You could tell the browser to expect UTF-8 encoding like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Or you could use this with HTML5:

<meta charset="utf-8">

If you are generating XML with ISO-8859-1 you can define the character encoding like this:

<?xml version="1.0" encoding="ISO-8859-1"?>

To cover all your bases you should also specify the encoding in the header, like this:

header('Content-Type: text/html; charset=iso-8859-1');

Conclusion

I hope this helps you out if you find yourself in a similar painful character encoding situation and don’t know what to do. Enjoy!