Home ยป Php ยป javascript โ€“ JS charCodeAt equivalent in PHP (with full unicode and emoji compatibility)

javascript โ€“ JS charCodeAt equivalent in PHP (with full unicode and emoji compatibility)

Posted by: admin July 12, 2020 Leave a comment


I have a simple code in JS that I canโ€™t replicate in PHP if it comes to special characters.

This is the JS code (see JSFiddle for output):

var str = "t๐Ÿ™๐Ÿฟ๐Ÿ˜˜๐ŸŽšโ†™๏ธ๐Ÿ•—๐Ÿ‡จ๐Ÿ‡ฌ๐ฏฆ”"; //char "t" and special characters, emojis, etc..
document.write("Length is: "+str.length); // Length is: 19
for(var i=0; i<str.length; i++) {
  document.write("<br> charCodeAt(" + i + "): " + str.charCodeAt(i));

The first problem is that PHP strlen() and mb_strlen() already gives different results from JS (strlen: 39, mb_strlen: 11), however I managed to get the same with a custom JS_StringLength function (thanks to this SO answer).

Here is what I have in PHP so far (see phpFiddle for output):


function JS_StringLength($string) {
    return strlen(iconv('UTF-8', 'UTF-16LE', $string)) / 2;

function JS_charCodeAt($str, $index){
    //not working!

    $char = mb_substr($str, $index, 1, 'UTF-8');
    if (mb_check_encoding($char, 'UTF-8'))
        $ret = mb_convert_encoding($char, 'UTF-32BE', 'UTF-8');
        return hexdec(bin2hex($ret));
    } else {
        return null;

$str = "t๐Ÿ™๐Ÿฟ๐Ÿ˜˜๐ŸŽšโ†™๏ธ๐Ÿ•—๐Ÿ‡จ๐Ÿ‡ฌ๐ฏฆ”";

echo $str."\n";
//echo "Length is: ".strlen($str)."\n"; //wrong
echo "Length is: ".JS_StringLength($str)."\n"; //OK
for($i=0; $i<JS_StringLength($str); $i++) {
    echo "charCodeAt(".$i."): ".JS_charCodeAt($str, $i)."\n";

After a full day of Googling, and trying out everything I found, nothing gave the same results as JS.
What should JS_charCodeAt be to get the same output as JS with similar performance?

Experimenting #1:
Enter my string into https://r12a.github.io/app-conversion/ (awesome stuff). Looks like JS works with UTF-16 code units (19) and PHP strlen counts UTF-8 code units (39).

Experimenting #2:
When using json_encode() on my string โ€“ of course โ€“ the result will almost be something like that, what JavaScript may uses. I even examined the original PHP source code of json_encode and how json_encode escapes strings, but.. well..

Before flagging as a duplicate, please make sure you test a solution with the string in the above examples (or random emojis) as ALL the charCodeAt implementations found here on stackoverflow are working with most of the special characters, but NOT with emojis.

How to&Answers:

The way that JS handles UTF-16 is not ideal; charCodeAt is picking out code units for you, including surrogates in the emoji cases. If you want the real codepoint for each character, String.codePointAt() would be a better choice. That said, since your usecase wasnโ€™t explained, this achieves what you were originally asking for without the need for json related functions:


$original = 't๐Ÿ™๐Ÿฟ๐Ÿ˜˜๐ŸŽšโ†™๏ธ๐Ÿ•—๐Ÿ‡จ๐Ÿ‡ฌ๐ฏฆ”';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < iconv_strlen($converted, 'UTF-16LE'); $i++) {
    $character = iconv_substr($converted, $i, 1, 'UTF-16LE');
    $codeUnits = unpack('v*', $character);

    foreach ($codeUnits as $codeUnit) {
        echo $codeUnit . PHP_EOL;

This converts the (assumed) UTF-8 string into UTF-16, then loops over each character. In UTF-16, each character is 2 or 4 bytes in size. Unpack with the v repeating formatter will return one short in the former case, or 2 in the latter (v is the unsigned short formatter).

It could also be implemented by looping over the UTF-8 and converting each character one-by-one; it doesnโ€™t make a great deal of difference though. Also the same could be achieved with the mb_* functions.


Since youโ€™ve inquired about a quicker way of doing this, combining the above with the solution offered by nwellnhof gives better performance:


$original = 't๐Ÿ™๐Ÿฟ๐Ÿ˜˜๐ŸŽšโ†™๏ธ๐Ÿ•—๐Ÿ‡จ๐Ÿ‡ฌ๐ฏฆ”';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < strlen($converted); $i += 2) {
        $codeUnit = ord($converted[$i]) + (ord($converted[$i+1]) << 8);
        echo $codeUnit . PHP_EOL;

First off, this converts the UTF-8 string into UTF-16LE. Weโ€™re interested in writing out UTF-16 code units (as per the behaviour charCodeAt()), and these are represented by 16 bits. The loop is simply jumping 2 bytes at a time. For each iteration, itโ€™ll take the numeric value of the byte at that position, and add it to the next byte, left shifted by 8. The left shifting is because weโ€™re dealing with little endian formatted UTF-16.

By way of example, take consider the character BENGALI DIGIT ONE (เงง). This is represented by a single UTF-16 code unit, 2535. It is easier to first off describe how this is encoded as UTF-16BE. The single code unit for this character would consume 16 bits:

0000100111100111 (2535)

In PHP, strings are effectively byte arrays. So, PHP sees this as:

$converted[0] = 00001001 (9)
$converted[1] = 11100111 (231)

Given the 2 above bytes, how do we obtain the code unit? What we really want to do is something like:

   0000100100000000 (2304)
+          11100111 (231)
=  0000100111100111 (2535)

But we canโ€™t do that, since we only have single bytes to play with. One way is to deal with this is to use integers instead, giving us a full 64 bits (8 bytes).. and we want to represent the code unit in integer form anyway, so that seems like a reasonable route. We can obtain the numeric value of each byte via ord():

ord($converted[0]) == 0000000000000000000000000000000000000000000000000000000000001001 == 9
ord($converted[1]) == 0000000000000000000000000000000000000000000000000000000011100111 = 231

And left shift the first value by 8:

   0000000000000000000000000000000000000000000000000000000000001001 (9) 
<< 0000000000000000000000000000000000000000000000000000000000001000 (8)
=  0000000000000000000000000000000000000000000000000000100100000000 (2304)

And then sum together, as before:

   0000000000000000000000000000000000000000000000000000100100000000 (2304)
+  0000000000000000000000000000000000000000000000000000000011100111 (231)
=  0000000000000000000000000000000000000000000000000000100111100111 (2535)

So we now have the correct code unit value of 2535. The only difference with UTF-16LE is the order of the bytes is reversed. So instead of left shifting the first byte by 8, we need to left shift the second byte.

P.S: An equivalent way of performing this step would be to do

for ($i = 0; $i < strlen($converted); $i += 2) {
        $codeUnit = unpack('v', $converted[$i] . $converted[$i+1]);
        echo $codeUnit . PHP_EOL;

The unpack function will do exactly as just described which the v formatter is supplied, which tells it to expect 16 bits arranged in little endian. It may be worth benchmarking the 2 if youโ€™re interested in optimising for speed.


Ok, so after almost two days, I think Iโ€™ve found an answer myself.
The basic idea is that json_encode() escapes multibyte Unicode characters, in a form, that JS uses them (like ๐Ÿ˜˜ = "\ud83d\ude18") for character counting, for the charCodeAt function, etc. So if we JSON encode the string, we can extract an array of simple characters, and escaped multibyte chars. This way, we can easily count the characters of the original string as UTF-16 code units (just like JS does). And of course, we can return the โ€œcharCodeAtโ€ values (ord() on simple characters, and converting \uXXXX hex to dec on multibyte characters).

Problem: If I want to get the โ€œJS charCodeAtโ€ value for every character in a for loop (so basically convert a string to charcode list), this code will be slow on long texts, because preg_match_all in getUTF16CodeUnits will run once for every single character.
Workaround: Instead of calling getUTF16CodeUnits every time, store the matches array in a variable, and work with that. More details: FASTER VERSION (backup)

Code and demo:


function getUTF16CodeUnits($string) {
    $string = substr(json_encode($string), 1, -1);
    preg_match_all("/\\u[0-9a-fA-F]{4}|./mi", $string, $matches);
    return $matches[0];

function JS_StringLength($string) {
    return count(getUTF16CodeUnits($string));

function JS_charCodeAt($string, $index) {
    $utf16CodeUnits = getUTF16CodeUnits($string);
    $unit = $utf16CodeUnits[$index];

    if(strlen($unit) > 1) {
        $hex = substr($unit, 2);
        return hexdec($hex);
    else {
        return ord($unit);

$str = "t๐Ÿ™๐Ÿฟ๐Ÿ˜˜๐ŸŽšโ†™๏ธ๐Ÿ•—๐Ÿ‡จ๐Ÿ‡ฌ๐ฏฆ”";

echo "Length is: ".JS_StringLength($str)."\n";
for($i=0; $i<JS_StringLength($str); $i++) {
    echo "charCodeAt(".$i."): ".JS_charCodeAt($str, $i)."\n";

Improvements, fixes, comments are highly appreciated!


If you really want an equivalent of JavaScriptโ€™s charCodeAt method, try:

function JS_charCodeAt($str, $index) {
    $utf16 = mb_convert_encoding($str, 'UTF-16LE', 'UTF-8');
    return ord($utf16[$index*2]) + (ord($utf16[$index*2+1]) << 8);

But charCodeAt is problematic and should be replaced with codePointAt. Most JavaScript code dealing with characters in the supplementary Unicode planes like Emojis and using charCodeAt is probably wrong. You can find code emulating codePointAt in the answers to the question UTF-8 safe equivalent of ord or charCodeAt() in PHP.