Home » Php » php – Length of strings in unicode are different

php – Length of strings in unicode are different

Posted by: admin July 12, 2020 Leave a comment

Questions:

How come the length of the following strings is different although the number of characters in the strings are the same

echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>";
echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>";

Outputs

35
26
How to&Answers:

The first batch of characters take up three bytes each, because they’re way down in the 39-thousand-ish character list, whereas the second group only take two bytes each, being around 400. (The number of bytes/octets required per character are discussed in the UTF-8 wikipedia article.)

strlen counts the number of bytes taken by the string, which gives such odd results in Unicode.

Answer:

I am no PHP expert but it seems that strlen it counts bytes… there is mb_strlen which counts characters…

EDIT – for further reference on how multi-byte encoding works see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and

Answer:

It looks like it’s counting the number of bytes in the encoding being used. For example, it looks like the second string is taking two bytes per non-space character, whereas the first string is taking three bytes per non-space character. I would expect:

echo strlen("A B C D E F G H I")

to print out 17 – a single byte per ASCII character.

My guess it that this is all using the UTF-8 encoding – which would certainly be in-line with the varying width of representation.

Answer:

According to this post on php.net/strlen, PHP interprets all strings passed to strlen as ASCII.

Answer:

Use mb_strlen, it count characters in provided encoding, not bytes as strlen