So I have a UTF-8 encoded string which can contain full-width kanji, full-width kana, half-width kana, romaji, numbers or kawaii japanese symbols like ★ or ♥.
If I want the length I use
mb_strlen() and it counts each of these as 1 in length. Which is fine for most purposes.
But, I’ve been asked (by a Japanese client) to only count half-width kana as 0.5 (for the purpose of maxlength of a text field) because apparently thats how Japanese websites do it. I do this using
mb_strwidth() which counts full-width as 2, and half-width as 1, then i just divide by 2.
However this method also counts romaji characters as 1 so something like
Chocｱｲｽ would count as 7 .. then i’d divide by 2 to account for kanji and I’d get 3.5. but I actually want 5.5 (4 for the Romaji + 1.5 for the 3 half-width kana).
some more info: any character (even non-kana) which has both a full and a half should be 1 for the full-width and 0.5 for the half-width. for example, characters like
￥、３＠（ should all be 1, but characters like
¥,[email protected]( should all be 0.5
// EXTRA EDIT: symbols like ☆ and ♥ should be 1, but the mb_strwidth/2 method return them as 0.5
Is there a standard way that Japanese systems count string length?
Or does everyone just loop thru their strings and count the characters which don’t match the standard width rules?
One way is to convert the half-width katakana to full-width and subtract the difference in width from the original length:
$raw = 'Chocｱｲｽ'; $full = mb_convert_kana($raw, 'K'); $len = mb_strlen($raw) - (mb_strwidth($full) - mb_strwidth($raw))/2; assert($len === 5.5);
However, are you sure that you should be considering basic latin characters as full-width? There do exist full-width varieties of basic latin characters too—that is, should
Choc be considered the same as
Usually, characters like “A” and “ｱ” would have a width of 1, but “Ａ” and “ア” would have a width of 2 (which is what
mb_strwidth does). I’d be cautious about having to hack around that.
Given your edit,
mb_strwidth/2) does exactly what you want.
So, I found no answer for this.
I fixed it by literally iterating thru and checking each character and manually applying the counting rules that my client asked for.
Look at Perl’s Unicode::GCString module: it give the correct columns for all Unicode, including the East Asian stuff.
It is an underlying component of Unicode::LineBreak, which I have found absolutely indispensable for doing proper text segmentation of Asian scripts.
As you might well imagine, both are Made in Japan™.