How come the length of the following strings is different although the number of characters in the strings are the same
echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>"; echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>";
The first batch of characters take up three bytes each, because they’re way down in the 39-thousand-ish character list, whereas the second group only take two bytes each, being around 400. (The number of bytes/octets required per character are discussed in the UTF-8 wikipedia article.)
strlen counts the number of bytes taken by the string, which gives such odd results in Unicode.
I am no PHP expert but it seems that
strlen it counts bytes… there is
mb_strlen which counts characters…
EDIT – for further reference on how multi-byte encoding works see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and
It looks like it’s counting the number of bytes in the encoding being used. For example, it looks like the second string is taking two bytes per non-space character, whereas the first string is taking three bytes per non-space character. I would expect:
echo strlen("A B C D E F G H I")
to print out 17 – a single byte per ASCII character.
My guess it that this is all using the UTF-8 encoding – which would certainly be in-line with the varying width of representation.
According to this post on php.net/strlen, PHP interprets all strings passed to
strlen as ASCII.
Use mb_strlen, it count characters in provided encoding, not bytes as