Home » Php » php – Remove extra whitespaces from extracted PDF text

php – Remove extra whitespaces from extracted PDF text

Posted by: admin July 12, 2020 Leave a comment

Questions:

I have extracted the text from a PDF file and some of the text has extra whitespaces between words.

Your water a n d wastewater s t a t e m e n t

I wrote a function to remove the extra spaces from the text above.

function removeExtraWhitespace($val) {
    $nval = "";

    for($i = 0; $i < strlen($val); $i++) {
        if($val[$i] != " ") {
            $nval .= $val[$i];
        }
        else if((isset($val[$i-2]) && $val[$i-2] != " ") || (isset($val[$i+2]) && $val[$i+2] != " ")) {
            $nval .= $val[$i];
        }
    }
    return $nval;
}

Which will output:

Your water and wastewater statement

I know that this function will not work in all circumstances though. If the text has a valid 1 letter word, like ‘a’, then it will fail, or if only part of a word has extra spaces.

I n e e d to remove whitespaces f r o m a string

When putting the above text in to my function it will output:

Ineed to remove whitespaces froma string

Is there a way to make a function that will work on all possible text?

How to&Answers:

Spelling correction is hard work. I think you should use online spelling correction websites. You can do something like this:

function curl($post)
{
    $user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://orthographe.reverso.net/RISpellerWS/RestSpeller.svc/v1/CheckSpellingAsXml/language=eng?outputFormat=json&doReplacements=false&interfLang=en&dictionary=both&spellOrigin=interactive&includeSpellCheckUnits=true&includeExtraInfo=true&isStandaloneSpeller=true');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'Created: 01/01/0001 00:00:00',
        'Referer: http://www.reverso.net/spell-checker/english-spelling-grammar/',
        'Username: OnlineSpellerWS'
    ));
    $icerik = curl_exec($ch);
    curl_close($ch);
    return $icerik;
}


$response   = json_decode(curl('Ineed to remove whitespaces froma string'));

var_dump($response->AutoCorrectedText);

It is just for idea. I am sure there are spelling correction websites which provide API.