Home » Php » php – Regular expression preg_quote symbols are not detected

php – Regular expression preg_quote symbols are not detected

Posted by: admin April 23, 2020 Leave a comment

Questions:

I have a dictionary of swear words in the database, and the following works great

preg_match_all("/\b".$f."(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

$t is the input text and simply, $f = preg_quote("punk"); "punk" is from the database dictionary, so at this point in the loop the expression is as follows

preg_match_all("/\bpunk(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

preg_quote replaces symbols eg. # with \\# so that the expression is escaped, but when the dictionary is checking eg. "[email protected]" or "A$$" these symbols are not detected in the input string with the above expression, I have both a$$ and [email protected] in the dictionary, but they do not work. If I remove preg_quote() on the word, the regular expression is invalid as these symbols are not escaped.

Any suggestions on how I can detect "a$$" ???

Edit:

So I guess the expression that is not working as intended would be eg.

preg_match_all("/\bf\@ck(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

Which should find [email protected] in $t

UPDATE:

This is my usage, simply put; if there are matches in $m replace them with "\*\*\*\*", this whole block is inside a loop through each word in the dictionary, $f is the dictionary word and $t is the input

$f = preg_quote($f);
preg_match_all("/\b$f(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);
if (count($m) > 0) {
     $t = preg_replace("/(\b$f(?:ing|er|es|s)?\b)/si","\*\*\*\*\*",$t);
}

UPDATE:
Behold, the var_dump:

preg_quote($f) = string(5) "a$$"
$t = string(18) "You're such an a$$"
expression = string(29) "/\ba$$(?:ing|er|es|s)?\b/si"

UPDATE:
This is only happening when words end with a symbol. I tested "a$$hole" and it’s fine, but "a$$" doesn’t work.

ANOTHER UPDATE:
Try this simplified version, $words being a make-shift dictionary

$words = array("a$$","asshole","a$$hole","[email protected]","f#ck","f*ck");
$text = "Input whatever you feel like here eg. a$$";

foreach ($words as $f) {
   $f = preg_quote($f,"/");
   $text = preg_replace("/\b".$f."(?:ing|er|es|s)?\b/si",
                         str_repeat("*",strlen($f)),
                        $t);
}

I should expect to see "Input whatever you feel like here eg. \*\*\*" as a result.

How to&Answers:

Cannot Be Done

I’m sorry, but this “problem” is truly impossible to solve. Consider these:

  • ꜰᴜᴄᴋ   is U+A730.1D1C.1D04.1D0B, “\N{LATIN LETTER SMALL CAPITAL F}\N{LATIN LETTER SMALL CAPITAL U}\N{LATIN LETTER SMALL CAPITAL C}\N{LATIN LETTER SMALL CAPITAL K}”
  • ᶠᵘᶜᵏ   is U+1DA0.1D58.1D9C.1D4F, “\N{MODIFIER LETTER SMALL F}\N{MODIFIER LETTER SMALL U}\N{MODIFIER LETTER SMALL C}\N{MODIFIER LETTER SMALL K}”
  • 𝒻𝓊𝒸𝓀   is U+1D4BB.1D4CA.1D4B8.1D4C0, “\N{MATHEMATICAL SCRIPT SMALL F}\N{MATHEMATICAL SCRIPT SMALL U}\N{MATHEMATICAL SCRIPT SMALL C}\N{MATHEMATICAL SCRIPT SMALL K}”
  • 𝖋𝖚𝖈𝖐   is U+1D58B.1D59A.1D588.1D590, “\N{MATHEMATICAL BOLD FRAKTUR SMALL F}\N{MATHEMATICAL BOLD FRAKTUR SMALL U}\N{MATHEMATICAL BOLD FRAKTUR SMALL C}\N{MATHEMATICAL BOLD FRAKTUR SMALL K}”
  • 𝓕 𝒰 𝒞 𝒦   is U+1D4D5.1D4B0.1D49E.1D4A6, “\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}\N{MATHEMATICAL SCRIPT CAPITAL U}\N{MATHEMATICAL SCRIPT CAPITAL C}\N{MATHEMATICAL SCRIPT CAPITAL K}”
  • ⓕ ⓤ ⓒ ⓚ   is U+24D5.24E4.24D2.24DA, “\N{CIRCLED LATIN SMALL LETTER F}\N{CIRCLED LATIN SMALL LETTER U}\N{CIRCLED LATIN SMALL LETTER C}\N{CIRCLED LATIN SMALL LETTER K}”
  • Γ̵𐌵ᏟᏦ   is U+393.335.10335.13DF.13E6, “\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}\N{GOTHIC LETTER QAIRTHRA}\N{CHEROKEE LETTER TLI}\N{CHEROKEE LETTER TSO}”
  • ƒμɕѤ   is U+192.3BC.255.464, “\N{LATIN SMALL LETTER F WITH HOOK}\N{GREEK SMALL LETTER MU}\N{LATIN SMALL LETTER C WITH CURL}\N{CYRILLIC CAPITAL LETTER IOTIFIED E}”
  • Г̵ЦСК   is U+413.335.426.421.41A, “\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}\N{CYRILLIC CAPITAL LETTER TSE}\N{CYRILLIC CAPITAL LETTER ES}\N{CYRILLIC CAPITAL LETTER KA}”
  • ғᵾȼƙ   is U+493.1D7E.23C.199, “\N{CYRILLIC SMALL LETTER GHE WITH STROKE}\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}\N{LATIN SMALL LETTER C WITH STROKE}\N{LATIN SMALL LETTER K WITH HOOK}”
  • ϜυϚΚ   is U+3DC.3C5.3DA.39A, “\N{GREEK LETTER DIGAMMA}\N{GREEK SMALL LETTER UPSILON}\N{GREEK LETTER STIGMA}\N{GREEK CAPITAL LETTER KAPPA}”
  • ЖↃUᆿ   is U+416.2183.55.11BF, “\N{CYRILLIC CAPITAL LETTER ZHE}\N{ROMAN NUMERAL REVERSED ONE HUNDRED}\N{LATIN CAPITAL LETTER U}\N{HANGUL JONGSEONG KHIEUKH}”
  • ʞɔnɟ   is U+29E.254.6E.25F, “\N{LATIN SMALL LETTER TURNED K}\N{LATIN SMALL LETTER OPEN O}\N{LATIN SMALL LETTER N}\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}”

It Gets Worse

And if you think those are easy, just try coping with all of these:

𝓕 00 Ↄ ʞ, F ᵾ ⒞ K, K ⓒ Ц ⒡ , 𝖋 𝖀 K 𝒸, ғ ∞ Ϛ k, f 𝓊 Ꮯ K, ⓕ oo ɔ ⓚ , ɟ ⒰ ¢ K, 𝒻 𝖚 ȼ 𝖐, 𝕱 Ù ȼ ⒦ ,
f 𝒰 ⒞ ƙ, F 𐌵 ᶜ 𝕶, F ∞ 𝒞 Ж , 𝕱 @ Ꮯ 𝓀, ɟ ᵘ 𝒞 𝕶, F Ц ¢ 𝒦, f oo Ꮯ ʞ, 𝕱 oo ¢ Ж , 𝕱 υ ᶜ Κ , Ϝ ú * ʞ,
ꜰ 𝖚 c K, ƒ ᵘ ȼ k, 𝖋 U ȼ 𝕶, Ж ɔ μ ƒ, F ⓤ ⒞ k, ƒ 𝖚 C ƙ, ғ 00 ɔ Ѥ, ƒ U c ᴋ, 𝕱 ∞ Ꮶ ⓒ , ꜰ 𝓊 ᴄ ⒦ ,
𝕱 ⒰ Ꮯ Ѥ, ꜰ ᴜ 𝒞 ⒦ , F 𝒰 𝖈 ʞ, f 00 𝖈 𝓀, ғ u С K, f 𐌵 ɔ Κ , f μ Ↄ K, ɟ 𝖚 c ʞ, f 𝖚 Ↄ 𝖐, F μ ¢ 𝓀,
ᆿ 𝖀 ᴄ ⒦ , Κ ¢ oo ɟ, ᶠ μ ᶜ Ѥ, ᶠ ⓤ Ꮯ Ж , 𝒦 ⒞ ᵘ F, F @ C ⓚ , Ѥ ᴄ u F, ⒡ ᵾ C k, ƒ μ ᶜ ᴋ, F 𝒰 C 𝓀,
f ᵘ ¢ ᵏ, ᆿ 00 𝒸 𝕶, ꜰ υ ȼ K, Ϝ 𝓊 ȼ К , 𝕱 oo ɕ ᴋ, ғ 𝒰 Ꮯ ᴋ, ꜰ n 𝒸 K, ꜰ μ Ϛ К , F ∞ ȼ 𝖐, ⒡ 𐌵 Ↄ Κ ,
ƒ 𝖚 ⒞ 𝒦, ᶠ U C Ꮶ, ᶠ υ Ↄ ƙ, 𝓕 𝓊 C 𝓀, Ϝ U 𝒸 Ѥ, Ϝ U Ↄ 𝓀, 𝒻 U ⒞ ᵏ, F @ C К , ғ ᴜ 𝖈 ᴋ, ⒡ U 𝒸 К ,
ɟ U * ᵏ, 𝖋 Ц c Κ , ғ U Ↄ 𝕶, ƒ ⒰ 𝒞 ᵏ, ғ 𝖚 * K, 𝖋 n 𝕮 ⓚ , ᶠ 00 С К , 𝖋 Ц 𝒞 k, ƙ c Ц ᶠ, 𝕱 ⒰ Ѥ 𝖈,
ꜰ ǔ ᴄ ⒦ , F 𝒰 Ↄ 𝓀, 𝒦 𝖈 υ ꜰ, 𝖋 𝖚 * ᵏ, 𝖋 00 𝕮 Ж , Κ C 𝖚 𝖋, ᶠ U С K, ꜰ 𝖀 𝖈 Κ , ɟ U ᶜ ⓚ , 𝒻 ∞ ȼ ᴋ,
ƒ U К ć, ƒ υ ȼ ᴋ, ⒡ ∞ Ж ɕ, 𝖋 ᵘ 𝖈 ᵏ, F U Ϛ ʞ, ⓕ 𐌵 𝕮 Ж , 𝕱 𝒰 𝓀 Ↄ, Ϝ n * K, 𝓕 oo c ⓚ , ƒ U ¢ ʞ,
ƒ u C ʞ, K ¢ μ ⒡ , ɟ ⒰ K ɔ, F U c k, F Ц 𝖈 ⓚ , 𝒻 U ᴋ ɔ, 𝖋 𝖀 Ꮯ 𝒦, 𝒻 𐌵 𝖈 ⓚ , ⓕ 𝖚 C К , ɟ ᵾ * ⒦ ,
ᶠ ᵘ ⒞ ⒦ , ƒ ⒰ ᴄ ᵏ, ⒡ ⒰ С K, 𝓕 ⒰ * ᴋ, ᆿ ∞ ʞ ɕ, 𝒻 n * Ѥ, Ϝ μ ᴄ 𝒦, k ć ᵘ ƒ, 𝓕 ᵘ ɕ 𝖐, ɟ Ц Ꮶ ᴄ, 𝓕 ᵾ ⒞ ᵏ,
ғ ᵘ 𝒸 ᵏ, 𝖋 ᵾ * Ѥ, F 𝖚 Ꮯ K, ғ ⓤ 𝕮 ᴋ, ƒ u ɕ 𝖐, ƙ c ⒰ F, 𝒻 𝒰 ⓒ Κ , K ᶜ Ц 𝕱, ɟ 𝖚 c ⒦ , ƒ @ c Κ ,
Ϝ Ц ȼ Ḱ, ⒡ ᵘ 𝒞 ⒦ , ɟ ᵾ Ѥ ¢, F 𝖀 Ↄ 𝒦, Ϝ ᴜ 𝖐 𝖈, Ϝ 𝖀 ⒞ 𝖐, 𝕱 U Ꮯ ʞ, ƒ υ Ꮯ ᵏ, F ᵾ Ꮯ Κ , Ϝ ᵘ ⓒ ʞ,
𝓕 ⓤ ᶜ ƙ, ᆿ 𝒰 ⒞ 𝕶, f 𝖀 Ↄ Ѥ, 𝖋 U 𝒞 K, Ϝ ᴜ * 𝓀, ꜰ @ ⓒ ʞ, ƒ u ⓒ 𝒦, f U ⒞ k, 𝕱 00 ᴄ Ѥ, 𝒻 υ С K,
F ᴜ ᴄ 𝕶, ⓕ oo Ↄ ⓚ , ⒡ ᵘ ɕ 𝓀, ⓕ υ ᴄ Κ , ᆿ U Ꮯ 𝕶, 𝒻 𝖀 Ꮯ Ꮶ, 𝖋 𐌵 Ć 𝓀, 𝓕 Ц ɕ К , f @ Ↄ ⓚ , ᴋ ᶜ U ꜰ,
𝓕 ᴜ c ⒦ , F ᵘ C 𝒦, 𝒻 00 𝖈 Ꮶ, ꜰ 00 𝖈 К , Ϝ 𝖚 Ϛ ᵏ, F 𐌵 c Ѥ, ⓕ oo Ↄ K, f ᵾ С ᵏ, ⓕ Ц c 𝒦, 𝓕 𐌵 c Ж ,
ⓕ 𝓊 𝒞 ƙ, ⓚ C n ғ, ɟ U ȼ 𝕶, 𝒻 00 K ȼ, 𝒻 𐌵 ᴄ 𝖐, 𝒻 Ц C 𝓀, 𝖋 Ц ¢ 𝓀, Ϝ ᵘ c k, ⒡ 𐌵 ¢ k, ƒ ⓤ ⓚ Ↄ, 𝒻 𐌵 𝕮 k,
ƒ U Ↄ K, 𝓕 𝖀 ᴄ Ꮶ, ᆿ ⓤ 𝕮 ⒦ , Ж ɔ U 𝖋, ƒ υ * ᴋ, ƒ 𝓊 𝒞 k, 𝓕 U С ⒦ , 𝒻 𝖚 C Ж , ƒ μ Ꮯ ƙ, ⓕ n ᴄ ⒦ ,
ⓕ μ ⓒ Ж , ⒡ 00 ɕ 𝖐, 𝕱 ᴜ ᶜ 𝒦, ᆿ Ù Ж 𝖈, ⒦ ȼ U 𝖋, k C ⓤ ᆿ, Ϝ n ȼ ᵏ, ᴋ ȼ ᵾ ɟ, F 𝖀 ȼ Ѥ, ғ ⒰ ȼ 𝒦,
f U Ж ⒞ , F ῠ 𝒸 ᵏ, F u 𝒸 Κ , F 00 ȼ 𝕶, ꜰ μ Ϛ Ꮶ, ᆿ 𝖀 𝒞 K, ⒡ n Ↄ Ж , F @ 𝒞 ƙ, ᶠ ὺ 𝒸 К ,
𝒻 U C ᵏ, F U 𝖈 ⒦ , 𝒻 00 Ↄ 𝕶, ᶠ 𝖚 c К , ғ ⓤ 𝒞 𝒦, 𝓕 ⓤ 𝖈 Κ , 𝒻 U 𝒸 Ж , ⒡ 𝖀 ɔ Ꮶ, ⓚ ɔ 𝓊 f, 𝒻 U C K,
F @ C Ѥ, ғ ᴜ С k, ɟ u * ƙ, ⓕ ᵾ ɕ 𝒦, 𝕱 00 ȼ K, 𝒻 υ 𝓀 𝖈, ƒ ⒰ * ʞ, ⓕ U Ↄ Ж , ꜰ U ȼ ƙ, ⒡ u С ⒦ ,
ꜰ ᴜ 𝕮 Ќ, ᆿ μ 𝒞 ⒦ , ⓕ @ ᴄ К , ᶠ υ ɔ ᵏ, ƙ Ↄ oo ꜰ, F ᴜ 𝕮 𝒦, 𝓕 ⒰ C ᵏ, 𝖋 U 𝒸 ƙ, ƒ ∞ C Ꮶ, 𝒻 ⒰ * K,
𝒻 u Ↄ ᴋ, ᆿ U ⓒ 𝓀, ᆿ U Ꮶ 𝕮, 𝓕 n 𝒦 𝖈, ƒ Ц C ƙ, ⒦ 𝖈 𝒰 ꜰ, K ¢ ᵘ f, 𝕱 ⒰ 𝖈 Ꮶ, 𝓀 ᴄ 00 𝖋, Ϝ U 𝒞 k,
𝕱 u ¢ ⒦ , 𝕱 𝓊 * Ѥ, ƒ 𝖀 С ᴋ, 𝒻 𝖀 C Ꮶ, 𝖋 @ 𝕮 Κ , ʞ С 𝖀 ᶠ, 𝖋 ᵾ Ϛ Ꮶ, ᶠ ⒰ ɔ 𝒦, F Ц ⒞ ʞ, ⒡ ⒰ К ɔ,
ɟ υ ¢ 𝕶, Ѥ ȼ U ᆿ, 𝒻 ᴜ Ↄ ʞ, ғ 𝓊 * K, 𝒻 𝒰 ᴄ ʞ, F 𝖀 𝖈 ʞ, 𝒻 @ ȼ 𝒦, 𝒻 ⒰ * 𝖐, 𝒻 ᵾ ȼ 𝒦, F 𐌵 ¢ Ѥ,
ꜰ ⓤ ƙ Ϛ, ⓕ 00 c ʞ, 𝕱 00 Ϛ K, 𝖋 υ Ↄ Κ , ꜰ μ ⓒ Ж , 𝒻 ᵘ Ϛ ʞ, Ϝ ᵘ Ↄ ᵏ, ⒡ ᵾ Ꮯ 𝒦, Ϝ ⒰ ȼ Ѥ, ƒ n 𝒞 Ѥ,
ᆿ μ ⓒ k, 𝖋 Ц ɕ Κ , ғ μ 𝕮 Ѥ, f ⓤ Ꮯ 𝖐, ᵏ 𝕮 μ ƒ, ᵏ С 𝖚 𝓕, ᆿ ∞ 𝖈 𝒦, ғ ᵘ Ꮯ 𝓀, ƒ μ Ↄ k, f oo K ȼ,
ɟ 𝓊 𝕶 С , ꜰ n 𝖈 K, 𝒻 00 𝖈 ᵏ, ᶠ μ ⓒ 𝓀, 𝖐 c ∞ Ϝ, ᆿ Ц Ć ⒦ , 𝕱 ᵘ ᴄ 𝒦, F 00 𝕮 ⓚ , ᶠ @ ȼ К , …

And that’s not all: there are at least a bazingatillion more where those came from. Do you see now why this fundamentally cannot be done?

Full Disclosure

Because I don’t believe in security through obscurity, here’s the program that generates all those:

#!/usr/bin/env perl
#
# unifuck - print infinite permutations of fuck in unicode aliases
#
# Tom Christiansen <[email protected]>
# Mon May 23 09:37:27 MDT 2011

use strict;
use warnings;
use charnames ":full";

use Unicode::Normalize;

binmode(STDOUT, ":utf8");

our(@diddle, @fuck, %fuck); # initted down below
while (my($f,$u,$c,$k) = splice(@fuck, 0, 4)) {
    $fuck{F}{$f}++;
    $fuck{U}{$u}++;
    $fuck{C}{$c}++;
    $fuck{K}{$k}++;
} 

my @F = keys %{ $fuck{F} };
my @U = keys %{ $fuck{U} };
my @C = keys %{ $fuck{C} };
my @K = keys %{ $fuck{K} };

while (1) { 
    my $f = $F[rand @F];
    my $u = $U[rand @U];
    my $c = $C[rand @C];
    my $k = $K[rand @K];

    for ($f,$u,$c,$k) {  
        next if length > 1;
        next if /\p{EA=W}/;
        next if /\pM/;
        next if /\p{InEnclosedAlphanumerics}/;
        s/$/$diddle[rand @diddle]/          if rand(100) < 15;
        s/$/\N{COMBINING ENCLOSING KEYCAP}/ if rand(100) <  1;
    }

    if    (             0) {                                       }
    elsif (rand(100) <  5) {     $u        = q(@)                  } 
    elsif (rand(100) <  5) {        $c     = q(*)                  } 
    elsif (rand(100) < 10) {       ($c,$k) = ($k,$c)               } 
    elsif (rand(100) < 15) { ($f,$u,$c,$k) = reverse ($f,$u,$c,$k) }

    print NFC("$f $u $c $k\n");
}

BEGIN {

    # ok to have repeats in each position, since they'll be counted only once
    # per unique strings
    @fuck = (

        "\N{LATIN CAPITAL LETTER F}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{LATIN CAPITAL LETTER C}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER U}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{INFINITY}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER O}\N{LATIN SMALL LETTER O}",
        "\N{LATIN SMALL LETTER C}",
        "\N{KELVIN SIGN}",

        "\N{LATIN SMALL LETTER F}",
        "\N{DIGIT ZERO}\N{DIGIT ZERO}",
        "\N{CENT SIGN}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN LETTER SMALL CAPITAL F}",
        "\N{LATIN LETTER SMALL CAPITAL U}",
        "\N{LATIN LETTER SMALL CAPITAL C}",
        "\N{LATIN LETTER SMALL CAPITAL K}",

        "\N{MODIFIER LETTER SMALL F}",
        "\N{MODIFIER LETTER SMALL U}",
        "\N{MODIFIER LETTER SMALL C}",
        "\N{MODIFIER LETTER SMALL K}",

        "\N{MATHEMATICAL SCRIPT SMALL F}",
        "\N{MATHEMATICAL SCRIPT SMALL U}",
        "\N{MATHEMATICAL SCRIPT SMALL C}",
        "\N{MATHEMATICAL SCRIPT SMALL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR SMALL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL K}",

        "\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}",
        "\N{MATHEMATICAL SCRIPT CAPITAL U}",
        "\N{MATHEMATICAL SCRIPT CAPITAL C}",
        "\N{MATHEMATICAL SCRIPT CAPITAL K}",

        "\N{CIRCLED LATIN SMALL LETTER F}",
        "\N{CIRCLED LATIN SMALL LETTER U}",
        "\N{CIRCLED LATIN SMALL LETTER C}",
        "\N{CIRCLED LATIN SMALL LETTER K}",

        "\N{PARENTHESIZED LATIN SMALL LETTER F}",
        "\N{PARENTHESIZED LATIN SMALL LETTER U}",
        "\N{PARENTHESIZED LATIN SMALL LETTER C}",
        "\N{PARENTHESIZED LATIN SMALL LETTER K}",

        "\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{GOTHIC LETTER QAIRTHRA}",
        "\N{CHEROKEE LETTER TLI}",
        "\N{CHEROKEE LETTER TSO}",

        "\N{LATIN SMALL LETTER F WITH HOOK}",
        "\N{GREEK SMALL LETTER MU}",
        "\N{LATIN SMALL LETTER C WITH CURL}",
        "\N{CYRILLIC CAPITAL LETTER IOTIFIED E}",

        "\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{CYRILLIC CAPITAL LETTER TSE}",
        "\N{CYRILLIC CAPITAL LETTER ES}",
        "\N{CYRILLIC CAPITAL LETTER KA}",

        "\N{CYRILLIC SMALL LETTER GHE WITH STROKE}",
        "\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}",
        "\N{LATIN SMALL LETTER C WITH STROKE}",
        "\N{LATIN SMALL LETTER K WITH HOOK}",

        "\N{GREEK LETTER DIGAMMA}",
        "\N{GREEK SMALL LETTER UPSILON}",
        "\N{GREEK LETTER STIGMA}",
        "\N{GREEK CAPITAL LETTER KAPPA}",

        "\N{HANGUL JONGSEONG KHIEUKH}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{ROMAN NUMERAL REVERSED ONE HUNDRED}",
        "\N{CYRILLIC CAPITAL LETTER ZHE}",

        "\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}",
        "\N{LATIN SMALL LETTER N}",
        "\N{LATIN SMALL LETTER OPEN O}",
        "\N{LATIN SMALL LETTER TURNED K}",

        "\N{FULLWIDTH LATIN CAPITAL LETTER F}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER U}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER C}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER K}",

    );

    @diddle = (
        "\N{COMBINING GRAVE ACCENT}",
        "\N{COMBINING ACUTE ACCENT}",
        "\N{COMBINING CIRCUMFLEX ACCENT}",
        "\N{COMBINING TILDE}",
        "\N{COMBINING BREVE}",
        "\N{COMBINING DOT ABOVE}",
        "\N{COMBINING DIAERESIS}",
        "\N{COMBINING CARON}",
        "\N{COMBINING CANDRABINDU}",
        "\N{COMBINING INVERTED BREVE}",
        "\N{COMBINING GRAVE TONE MARK}",
        "\N{COMBINING ACUTE TONE MARK}",
        "\N{COMBINING GREEK PERISPOMENI}",
        "\N{COMBINING FERMATA}",
        "\N{COMBINING SUSPENSION MARK}",
    );

}

Answer:

\b checks for a word boundary. According to http://www.regular-expressions.info/wordboundaries.html:

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

“Word characters” are letters, digits, and underscores, so in the string “a$$”, the word boundary occurs after the “a”, not after the second “$”.

You will probably need to explicitly specify the characters you consider to be “word boundaries” by using a class (e.g., [- '"]).

Answer:

Now, when you said that it doesn’t work at the end of the word I see the problem. [email protected] or any other such special characters aren’t part of the word (so \b breaks the word after ‘a’ in case of ‘a$$’ if it isn’t followed by any other letters in the input string). I suggest using [^a-z] to mark the end of the word to fix it.

preg_match_all("/\b".$f."(?:ing|er|es|s)?[^a-z]/si",$t,$m,PREG_SET_ORDER);