Home » Mysql » how to check if a string looks randomized, or human generated and pronouncable?

how to check if a string looks randomized, or human generated and pronouncable?

Posted by: admin November 1, 2017 Leave a comment

Questions:

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like “bilbomoothof” .. it may be nonsense, but it still contains pronouncable sounds and so appears human-generated.

I accept that it could have been randomly generated from a dictionary of syllables, or word parts, but let’s assume for a moment that the bot in question is a bit rubbish.

  1. Suppose you have a username like
    “sdfgbhm342r3f”, to a human this is
    clearly a random string. But can
    this be identified programatically?
  2. Are there any algorithms available
    (similar to Soundex, etc..) that can
    identify pronounceable sounds within
    a string like this?

Solutions applicable in PHP/MySQL most appreciated.

Answers:

I guess you could think of something like that if you could restrict yourself to pronounceable sounds in english. For me (I am French), words like szczepan or wawrzyniec are unpronounceable and certainly have a certain randomness.

But they are actually Polish first names (meaning steven and lawrence)…

Questions:
Answers:

I agree with Mac. But more than that, people sometimes have user name that aren’t pronouncable, like qwerty or rtfmorleave.

Why bother with that ?

< obsolete and false, but i don’t delete because of comments >

But more than that, no bots use ‘zetztzgsd’ as user name, they have dictionnary of realname, possible nick name, etc. so I think this would be a waster of time for you

< / obsolete and false, but i don’t delete because of comments>

Questions:
Answers:

Look up n-gram analysis. It is successfully used to automatically detect text language and works surprisingly well even on very short texts.

The online demo (no longer online) recognized ‘bilbomoothof’ as English and ‘sdfgbhm342r3f’ as Nepali. It probably always returns the best match, even if it’s a very poor one. I think you could train it to discern between ‘pronounceable’ and ‘random’.

Questions:
Answers:

Just use CAPTCHA as a part of the registration process.

You can never distinguish real uesrnames from bot-created usernames, without severely annoying your users.

You will block users with bizzare, or non-English names, which will irritate them, and the bots will just keep trying until they catch a good username (from dictionary, or other sources – this is a very nice one, by the way!).

EDIT : Looking for prevention rather than after-the-fact analysis?

The solution is letting somebody else manage user’s identities for you. For instance, you can use a small list of OpenID providers (like SO), or facebook connect, or both.
You’ll know for sure that the users are real, and that they have been solving at least one CAPTCHA.

EDIT: Another Idea

Search the string in Google, and check the number of matches found. Shouldn’t be your only tool, but it is a good indicator, too. Randomized strings, of course, should have little or no matches.

Questions:
Answers:

Off the top of my head, you could look for syllables, making use of soundex. That’s the direction I would explore, based on the assumption that a pronounceable word has at least one syllable.

EDIT: Here’s a function for counting syllables:

function count_syllables($word) {

$subsyl = Array(
'cial'
,'tia'
 ,'cius'
 ,'cious'
 ,'giu'
 ,'ion'
 ,'iou'
 ,'sia$'
 ,'.ely$'
 );

 $addsyl = Array(
 'ia'
 ,'riet'
 ,'dien'
 ,'iu'
 ,'io'
 ,'ii'
 ,'[aeiouym]bl$'
 ,'[aeiou]{3}'
 ,'^mc'
 ,'ism$'
 ,'([^aeiouy])l$'
 ,'[^l]lien'
 ,'^coa[dglx].'
 ,'[^gq]ua[^auieo]'
 ,'dnt$'
 );

 // Based on Greg Fast's Perl module Lingua::EN::Syllables
 $word = preg_replace('/[^a-z]/is', '', strtolower($word));
 $word_parts = preg_split('/[^aeiouy]+/', $word);
 foreach ($word_parts as $key => $value) {
 if ($value <> '') {
 $valid_word_parts[] = $value;
 }
 }

 $syllables = 0;
 // Thanks to Joe Kovar for correcting a bug in the following lines
 foreach ($subsyl as $syl) {
 $syllables -= preg_match('~'.$syl.'~', $word);
 }
 foreach ($addsyl as $syl) {
 $syllables += preg_match('~'.$syl.'~', $word);
 }
 if (strlen($word) == 1) {
 $syllables++;
 }
 $syllables += count($valid_word_parts);
 $syllables = ($syllables == 0) ? 1 : $syllables;
 return $syllables;
 }

From this very interesting link:

http://www.addedbytes.com/php/flesch-kincaid-function/

Questions:
Answers:

Reply for question #1:

Unfortunately this cannot be done, since Kolmogorov complexity function is not computable, therefore you cannot generate such algorithm unless you will apply some rules to domain of possible user names, then you will be able to perform heuristic analysis and decide, but even then it’s really hard to do.

PS: After posted this answer, I bumped into some service which gave an idea of example for user name domain restriction, let to the users use the mail box of well known public domain as they user names.

Questions:
Answers:

You could use a neural network to evaluate whether the nickname looks like a natural-language nickname.

Assemble two data-sets: one of valid nicknames, and one of bogus-generated ones. Train a simple back-progating single hidden layer neural network with the character values as inputs. The neural network will learn to discriminate between strings like “zrgssgbt” and “zargbyt”, since the latter has consonants and vowels intermingled .

It is important to use real-world examples to get a good discriminator.

Questions:
Answers:

I dont know of existing algorithms for this problem, but I think it can be attacked in any one of the following ways:

  • your bot may be rubbish, but you can keep a list of syllables, or more specifically, phonemes, that you can try finding in your given string. But this sounds a bit difficult becasuse you would need to segment the string in different places etc.
  • there are 5 vowels in the english alphabet, and 21 others. You could assume that if they were randomly generated, then approximately you would expect 5/26*W, (where W is word length) letters that are vowels, and significant deviations from this could be suspicious. (If letter are included then 5/31 and so on..) You can try building on this idea by searching for doubletons, and trying to make sure that each doubleton occurs with same probability etc.
  • further, you can try to segment your input string around vowels, example three lettters before a vowel and three letters after a vowel, and try to find out if it make a recognizable sound by comparing with phonemes.
Questions:
Answers:

In Russian, we have forbidden syllables, like ГЙ, а Ъ or Ь after a vowel and so on.

However, spam bots just use the names database, that’s why my spam inbox is full of strange names you can only meet in history books.

I expect English to have syllable distribution histograms too (like ETAOIN SHRDLU, but for two-letter or even three-letter syllables), and having critical density of low frequency syllables in one name is certainly a sign.

Questions:
Answers:

Note that many large sites suggest usernames like [first init][middle init][last name][number]. The users then carry these usernames over to other sites, and the first three letters are definitely not pronounceable.