Home » Php » Detect language from string in PHP

Detect language from string in PHP

Posted by: admin December 3, 2017 Leave a comment

Questions:

In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.

Answers:

You can not detect the language from the character type. And there are no foolproof ways to do this.

With any method, you’re just doing an educated guess. There are available some math related articles out there

Questions:
Answers:

I’ve used the Text_LanguageDetect pear package with some reasonable results. It’s dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)

Questions:
Answers:

You could do this entirely client side with Google’s AJAX Language API (now defunct).

With the AJAX Language API, you can translate and detect the language of blocks of text within a webpage using only Javascript. In addition, you can enable transliteration on any textfield or textarea in your web page. For example, if you were transliterating to Hindi, this API will allow users to phonetically spell out Hindi words using English and have them appear in the Hindi script.

You can detect automatically a string’s language

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And translate any string written in one of the supported languages (also defunct)

google.language.translate("Hello world", "en", "es", function(result) {
  if (!result.error) {
    var container = document.getElementById("translation");
    container.innerHTML = result.translation;
  }
});

Questions:
Answers:

I know this is an old post, but here is what I developed after not finding any viable solution.

  • other suggestions are all too heavy and too cumbersome for my situation
  • I support a finite number of languages on my website (at the moment two: ‘en’ and ‘de’ – but solution is generalised for more).
  • I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
  • So I want a solution with minimal false positives – but don’t care so much about false negatives.

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

Code – Any suggestions for speed improvement are more than welcome!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }

Questions:
Answers:

As Google Translate API is going closing down as a free service, you can try this free alternative, which is a replacement for Google Translate API:

http://detectlanguage.com

Questions:
Answers:

you can use API of service Lnag ID http://langid.net/identify-language-from-api.html

Questions:
Answers:

You can probably use the Google Translate API to detect the language and translate it if necessary.

Questions:
Answers:

I tried the Text_LanguageDetect library and the results I got were not very good (for instance, the text “test” was identified as Estonian and not English).

I can recommend you try the Yandex Translate API which is FREE for 1 million characters for 24 hours and up to 10 million characters a month.
It supports (according to the documentation) over 60 languages.

<?php
function identifyLanguage($text)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/detect?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (strlen($outputJson->lang) > 0)
            {
                return $outputJson->lang;
            }
        }
    }

    return "unknown";
}

function translateText($text, $targetLang)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/translate?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text) . "&lang=" . urlencode($targetLang);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (count($outputJson->text) > 0 && strlen($outputJson->text[0]) > 0)
            {
                return $outputJson->text[0];
            }
        }
    }

    return $text;
}

header("content-type: text/html; charset=UTF-8");

echo identifyLanguage("エクスペリエンス");
echo "<br>";
echo translateText("エクスペリエンス", "en");
echo "<br>";
echo translateText("エクスペリエンス", "es");
echo "<br>";
echo translateText("エクスペリエンス", "zh");
echo "<br>";
echo translateText("エクスペリエンス", "he");
echo "<br>";
echo translateText("エクスペリエンス", "ja");
echo "<br>";
?>

Questions:
Answers:

One approach might be to break the input string into words and then look up those words in an English dictionary to see how many of them are present. This approach has a few limitations:

  • proper nouns may not be handled well
  • spelling errors can disrupt your lookups
  • abbreviations like “lol” or “b4” won’t necessarily be in the dictionary
Questions:
Answers:

Perhaps submit the string to this language guesser:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser

Questions:
Answers:

I would take documents from various languages and reference them against Unicode. You could then use some bayesian reasoning to determine which language it is by the just the unicode characters used. This would seperate French from English or Russian.

I am not sure exactly on what else could be done except lookup the words in language dictionaries to determine the language (using a similar probabilistic approach).

Questions:
Answers:

You can see how to detect language for a string in php using the Text_LanguageDetect Pear Package or downloading to use it separately like a regular php library.

Questions:
Answers:

You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php.
If you don’t have that much content, you could use Google’s API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I’d finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there’s a 100,000 chars cap per day) of the API.

Questions:
Answers:

try to use ascii encode.
i use that code to determine ru\en languages in my social bot project

function language($string) {
        $ru = array("208","209","208176","208177","208178","208179","208180","208181","209145","208182","208183","208184","208185","208186","208187","208188","208189","208190","208191","209128","209129","209130","209131","209132","209133","209134","209135","209136","209137","209138","209139","209140","209141","209142","209143");
        $en = array("97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122");
        $htmlcharacters = array("<", ">", "&amp;", "&lt;", "&gt;", "&");
        $string = str_replace($htmlcharacters, "", $string);
        //Strip out the slashes
        $string = stripslashes($string);
        $badthings = array("=", "#", "~", "!", "?", ".", ",", "<", ">", "/", ";", ":", '"', "'", "[", "]", "{", "}", "@", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "|", "`");
        $string = str_replace($badthings, "", $string);
        $string = mb_strtolower($string);
        $msgarray = explode(" ", $string);
        $words = count($msgarray);
        $letters = str_split($msgarray[0]);
        $letters = ToAscii($letters[0]);
        $brackets = array("[",",","]");
        $letters = str_replace($brackets,  "", $letters);
        if (in_array($letters, $ru)) {
            $result = 'Русский' ; //russian
        } elseif (in_array($letters, $en)) {
            $result = 'Английский'; //english
        } else {
            $result = 'ошибка' . $letters; //error
        }} return $result;  

Questions:
Answers:

Text_LanguageDetect pear package produced terrible results: “luxury apartments downtown” is detected as Portuguese…

Google API is still the best solution, they give 300$ free credit and warn before charging you anything

Below is a super simple function that uses file_get_contents to download the lang detected by the API, so no need to download or install libraries etc.

function guess_lang($str) {

    $str = str_replace(" ", "%20", $str);

    $content = file_get_contents("https://translation.googleapis.com/language/translate/v2/detect?key=YOUR_API_KEY&q=".$str);

    $lang = (json_decode($content, true));

    if(isset($lang))
        return $lang["data"]["detections"][0][0]["language"];
 }

Execute:

echo guess_lang("luxury apartments downtown montreal"); // returns "en"

You can get your Google Translate API key here: https://console.cloud.google.com/apis/library/translate.googleapis.com/

This is a simple example for short phrases to get you going. For more complex applications you’ll want to restrict your API key and use the library obviously.

Leave a Reply

Your email address will not be published. Required fields are marked *