Home » Php » unicode – Calculating the length of a Japanese multibyte string with half-width kana in PHP

unicode – Calculating the length of a Japanese multibyte string with half-width kana in PHP

Posted by: admin July 12, 2020 Leave a comment

Questions:

So I have a UTF-8 encoded string which can contain full-width kanji, full-width kana, half-width kana, romaji, numbers or kawaii japanese symbols like ★ or ♥.

If I want the length I use mb_strlen() and it counts each of these as 1 in length. Which is fine for most purposes.

But, I’ve been asked (by a Japanese client) to only count half-width kana as 0.5 (for the purpose of maxlength of a text field) because apparently thats how Japanese websites do it. I do this using mb_strwidth() which counts full-width as 2, and half-width as 1, then i just divide by 2.

However this method also counts romaji characters as 1 so something like Chocアイス would count as 7 .. then i’d divide by 2 to account for kanji and I’d get 3.5. but I actually want 5.5 (4 for the Romaji + 1.5 for the 3 half-width kana).

// EDIT:
some more info: any character (even non-kana) which has both a full and a half should be 1 for the full-width and 0.5 for the half-width. for example, characters like ¥、3@( should all be 1, but characters like ¥,[email protected]( should all be 0.5

// EXTRA EDIT: symbols like ☆ and ♥ should be 1, but the mb_strwidth/2 method return them as 0.5

Is there a standard way that Japanese systems count string length?
Or does everyone just loop thru their strings and count the characters which don’t match the standard width rules?

How to&Answers:

One way is to convert the half-width katakana to full-width and subtract the difference in width from the original length:

$raw = 'Chocアイス';
$full = mb_convert_kana($raw, 'K');
$len = mb_strlen($raw) - (mb_strwidth($full) - mb_strwidth($raw))/2;
assert($len === 5.5);

However, are you sure that you should be considering basic latin characters as full-width? There do exist full-width varieties of basic latin characters too—that is, should Choc be considered the same as Choc?

Usually, characters like “A” and “ア” would have a width of 1, but “A” and “ア” would have a width of 2 (which is what mb_strwidth does). I’d be cautious about having to hack around that.


Given your edit, mb_strwidth (or mb_strwidth/2) does exactly what you want.

Answer:

So, I found no answer for this.

I fixed it by literally iterating thru and checking each character and manually applying the counting rules that my client asked for.

Answer:

Look at Perl’s Unicode::GCString module: it give the correct columns for all Unicode, including the East Asian stuff.

It is an underlying component of Unicode::LineBreak, which I have found absolutely indispensable for doing proper text segmentation of Asian scripts.

As you might well imagine, both are Made in Japan™.
🙂