I would like to detect encoding of some text (using PHP).
For that purpose i use mb_detect_encoding() function.
The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.
Consider the following example
$html = <<< STR ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください STR; mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2')); $originalEncoding = mb_detect_encoding($str); die($originalEncoding); // $originalEncoding = 'UTF-8'
However if you change the order of encodings in mb_detect_order() the results will be different:
mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2')); die($originalEncoding); // $originalEncoding = 'EUC-JP'
So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?
That’s what I would expect to happen.
The detection algorithm probably just keeps trying, in order, the encodings you specified in
mb_detect_order and then returns the first one under which the bytestream would be valid.
Something more intelligent requires statistical methods (I think machine learning is commonly used).
EDIT: See e.g. this article for more intelligent methods.
Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).
Not really. The different encodings often have large areas of overlap, and if your string that you are testing exists entirly inside that overlap, then both encoding are acceptable.
For example, utf-8 and ISO-8859-1 are the same for the letters a-z. The string “hello” would have an identical sequence of bytes in both encodings.
This is exactly why there is an
mb_detect_order() function in the first place, as it allows you to say what you would prefer to happen when these clashes happen. Would you like “hello” to be utf-8 or ISO-8859-1?
Keep in mind
mb_detect_encoding() does not know what encoding the data is in. You may see a string, but the function itself only sees a stream of bytes. Going by that, it needs to guess what the encoding is – e.g. ASCII would be if bytes are only in the 0-127 range, UTF-8 would be if there are ASCII bytes and 128+ bytes that exist only in pairs or more, and so forth.
As you can imagine, given that context, it’s quite difficult to detect an encoding reliably.
Like rihk said, this is what the
mb_detect_order() function is for – you’re basically supplying your best guess what the data is likely to be. Do you work with UTF-8 files frequently? Then chances are your stuff isn’t likely to be UTF-16 even if
mb_detect_encoding() could guess it as that.
Example case: Internet Explorer uses some interesting encoding guessing if nothing is specified (@link, Section: ‘To automatically detect a website’s language’) that’s caused strange behaviours on websites that took encoding for granted in the past. You can probably find some amusing stuff on that if you google around. It makes for a nice show-case how even statistical methods can backfire horribly, and why encoding-guessing in general is problematic.
mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.
The wikipedia list of charsets is a good place to see the characters that make up each charset.
Because these charset values overlap (char x8fA1EF exists in both ‘UTF-8’ and in ‘EUC-JP’) this will be considered a match even though it’s a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can’t identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.
As far as I’m aware, there is no surefire way of identifying a charset. PHP’s “best guess” method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset.
The best solution is to “know” the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.
If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.