Home » Python » Simple way to remove multiple spaces in a string?

Simple way to remove multiple spaces in a string?

Posted by: admin January 29, 2018 Leave a comment

Questions:

Suppose this is the string:

The   fox jumped   over    the log.

It would result in:

The fox jumped over the log.

What is the simplest, 1-2 liner that can do this? Without splitting and going into lists…

Answers:
>>> import re
>>> re.sub(' +',' ','The     quick brown    fox')
'The quick brown fox'

Questions:
Answers:

foo is your string:

" ".join(foo.split())

Be warned though this removes “all whitespace characters (space, tab, newline, return, formfeed)”. (Thanks to hhsaffar, see comments) ie "this is \t a test\n" will effectively end up as "this is a test"

Questions:
Answers:
import re
s = "The   fox jumped   over    the log."
re.sub("\s\s+" , " ", s)

or

re.sub("\s\s+", " ", s)

since the space before comma is listed as a pet peeve in PEP8, as mentioned by moose in the comments.

Questions:
Answers:

Using regexes with “\s” and doing simple string.split()’s will also remove other whitespace – like newlines, carriage returns, tabs. Unless this is desired, to only do multiple spaces, I present these examples.


EDIT: As I’m wont to do, I slept on this, and besides correcting a typo on the last results (v3.3.3 @ 64-bit, not 32-bit), the obvious hit me: the test string was rather trivial.

So, I got … 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum to get more-realistic time tests. I then added random-length extra spaces throughout:

original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))

I also corrected the “proper join“; if one cares, the one-liner will essentially do a strip of any leading/trailing spaces, this corrected version preserves a leading/trailing space (but only ONE ;-). (I found this because the randomly-spaced lorem_ipsum got extra spaces on the end and thus failed the assert.)


# setup = '''

import re

def while_replace(string):
    while '  ' in string:
        string = string.replace('  ', ' ')

    return string

def re_replace(string):
    return re.sub(r' {2,}' , ' ', string)

def proper_join(string):
    split_string = string.split(' ')

    # To account for leading/trailing spaces that would simply be removed
    beg = ' ' if not split_string[ 0] else ''
    end = ' ' if not split_string[-1] else ''

    # versus simply ' '.join(item for item in string.split(' ') if item)
    return beg + ' '.join(item for item in split_string if item) + end

original_string = """Lorem    ipsum        ... no, really, it kept going...          malesuada enim feugiat.         Integer imperdiet    erat."""

assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string)

#'''

# while_replace_test
new_string = original_string[:]

new_string = while_replace(new_string)

assert new_string != original_string

# re_replace_test
new_string = original_string[:]

new_string = re_replace(new_string)

assert new_string != original_string

# proper_join_test
new_string = original_string[:]

new_string = proper_join(new_string)

assert new_string != original_string

NOTE: The “while version” made a copy of the original_string, as I believe once modified on the first run, successive runs would be faster (if only by a bit). As this adds time, I added this string copy to the other two so that the times showed the difference only in the logic. Keep in mind that the main stmt on timeit instances will only be executed once; the original way I did this, the while loop worked on the same label, original_string, thus the second run, there would be nothing to do. The way it’s set up now, calling a function, using two different labels, that isn’t a problem. I’ve added assert statements to all the workers to verify we change something every iteration (for those who may be dubious). E.g., change to this and it breaks:

# while_replace_test
new_string = original_string[:]

new_string = while_replace(new_string)

assert new_string != original_string # will break the 2nd iteration

while '  ' in original_string:
    original_string = original_string.replace('  ', ' ')

Tests run on a laptop with an i5 processor running Windows 7 (64-bit).

timeit.Timer(stmt = test, setup = setup).repeat(7, 1000)

test_string = 'The   fox jumped   over\n\t    the log.' # trivial

Python 2.7.3, 32-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001066 |   0.001260 |   0.001128 |   0.001092
     re_replace_test |   0.003074 |   0.003941 |   0.003357 |   0.003349
    proper_join_test |   0.002783 |   0.004829 |   0.003554 |   0.003035

Python 2.7.3, 64-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001025 |   0.001079 |   0.001052 |   0.001051
     re_replace_test |   0.003213 |   0.004512 |   0.003656 |   0.003504
    proper_join_test |   0.002760 |   0.006361 |   0.004626 |   0.004600

Python 3.2.3, 32-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001350 |   0.002302 |   0.001639 |   0.001357
     re_replace_test |   0.006797 |   0.008107 |   0.007319 |   0.007440
    proper_join_test |   0.002863 |   0.003356 |   0.003026 |   0.002975

Python 3.3.3, 64-bit, Windows
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.001444 |   0.001490 |   0.001460 |   0.001459
     re_replace_test |   0.011771 |   0.012598 |   0.012082 |   0.011910
    proper_join_test |   0.003741 |   0.005933 |   0.004341 |   0.004009

test_string = lorem_ipsum
# Thanks to http://www.lipsum.com/
# "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"

Python 2.7.3, 32-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.342602 |   0.387803 |   0.359319 |   0.356284
     re_replace_test |   0.337571 |   0.359821 |   0.348876 |   0.348006
    proper_join_test |   0.381654 |   0.395349 |   0.388304 |   0.388193    

Python 2.7.3, 64-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.227471 |   0.268340 |   0.240884 |   0.236776
     re_replace_test |   0.301516 |   0.325730 |   0.308626 |   0.307852
    proper_join_test |   0.358766 |   0.383736 |   0.370958 |   0.371866    

Python 3.2.3, 32-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.438480 |   0.463380 |   0.447953 |   0.446646
     re_replace_test |   0.463729 |   0.490947 |   0.472496 |   0.468778
    proper_join_test |   0.397022 |   0.427817 |   0.406612 |   0.402053    

Python 3.3.3, 64-bit
                test |      minum |    maximum |    average |     median
---------------------+------------+------------+------------+-----------
  while_replace_test |   0.284495 |   0.294025 |   0.288735 |   0.289153
     re_replace_test |   0.501351 |   0.525673 |   0.511347 |   0.508467
    proper_join_test |   0.422011 |   0.448736 |   0.436196 |   0.440318

For the trivial string, it would seem that a while-loop is the fastest, followed by the Pythonic string-split/join, and regex pulling up the rear.

For non-trivial strings, seems there’s a bit more to consider. 32-bit 2.7? It’s regex to the rescue! 2.7 64-bit? A while loop is best, by a decent margin. 32-bit 3.2, go with the “proper” join. 64-bit 3.3, go for a while loop. Again.

In the end, one can improve performance if/where/when needed, but it’s always best to remember the mantra:

  1. Make It Work
  2. Make It Right
  3. Make It Fast

IANAL, YMMV, Caveat Emptor!

Questions:
Answers:

Have to agree with Paul McGuire’s comment above. To me,

         ' '.join(the_string.split())

is vastly preferable to whipping out a regex. My measurements (Linux, Python 2.5) show the split-then-join to be almost 5 times faster than doing the “re.sub(…)”, and still 3 times faster if you precompile the regex once and do the operation multiple times. And it is by any measure easier to understand — much more pythonic.

Questions:
Answers:

Similar to the previous solutions, but more specific: replace two or more spaces with one:

>>> import re
>>> s = "The   fox jumped   over    the log."
>>> re.sub('\s{2,}', ' ', s)
'The fox jumped over the log.'

Questions:
Answers:

A simple soultion

>>> import re
>>> s="The   fox jumped   over    the log."
>>> print re.sub('\s+',' ', s)
The fox jumped over the log.

Questions:
Answers:

Other alternative

>>> import re
>>> str = 'this is a            string with    multiple spaces and    tabs'
>>> str = re.sub('[ \t]+' , ' ', str)
>>> print str
this is a string with multiple spaces and tabs

Questions:
Answers:

One line of code to remove all extra spaces before, after, and within a sentence:

sentence = "  The   fox jumped   over    the log.  "
sentence = ' '.join(filter(None,sentence.split(' ')))

Explanation:

  1. Split entire string into list.
  2. Filter empty elements from list.
  3. Rejoin remaining elements* with single space

*Remaining elements should be words or words with punctuations, etc. I did not test this extensively, but this should be a good starting point. All the best!

Questions:
Answers:

This also seems to work:

while "  " in s:
    s=s.replace("  "," ")

Where the variable s represents your string.

Questions:
Answers:
def unPretty(S):
   # given a dictionary, json, list, float, int, or even a string.. 
   # return a string stripped of CR, LF replaced by space, with multiple spaces reduced to one.
   return ' '.join( str(S).replace('\n',' ').replace('\r','').split() )

Questions:
Answers:

If it’s whitespace you’re dealing with splitting on None will not include empty string in the returned value.

https://docs.python.org/2/library/stdtypes.html#str.split

Questions:
Answers:
string='This is a             string full of spaces          and taps'
string=string.split(' ')
while '' in string:
    string.remove('')
string=' '.join(string)
print(string)

results:

This is a string full of spaces and taps

Questions:
Answers:

To remove white space, considering leading, trailing and extra white space in between words, use:

(?<=\s) +|^ +(?=\s)| (?= +[\n\0])

the first or deals with leading white space, the second or deals with start of string leading white space, and the last one deals with trailing white space

for proof of use this link will provide you with a test.

https://regex101.com/r/meBYli/4

let me know if you find an input that will break this regex code.

ALSO – this is to be used with the re.split function

Questions:
Answers:

In some cases it’s desirable to replace consecutive occurrences of every whitespace character with a single instance of that character. You’d use a regular expression with backreferences to do that.

(\s)\1{1,} matches any whitespace character, followed by one or more occurrences of that character. Now, all you need to do is specify the first group (\1) as the replacement for the match.

Wrapping this in a function:

import re

def normalize_whitespace(string):
    return re.sub(r'(\s)\1{1,}', r'\1', string)
>>> normalize_whitespace('The   fox jumped   over    the log.')
'The fox jumped over the log.'
>>> normalize_whitespace('First    line\t\t\t \n\n\nSecond    line')
'First line\t \nSecond line'

Questions:
Answers:

I haven’t read a lot into the other examples but I have just created this method for consolidating multiple consecutive space characters.

It does not use any libraries, and whilst it is relatively long in terms of script length it is not a complex implementation

def spaceMatcher(command):
    """
    function defined to consolidate multiple whitespace characters in 
    strings to a single space
    """
    #initiate index to flag if more than 1 consecutive character 
    iteration
    space_match = 0
    space_char = ""
    for char in command:
      if char == " ":
          space_match += 1
          space_char += " "
      elif (char != " ") & (space_match > 1):
          new_command = command.replace(space_char, " ")
          space_match = 0
          space_char = ""
      elif char != " ":
          space_match = 0
          space_char = ""
   return new_command

command = None
command = str(input("Please enter a command ->"))
print(spaceMatcher(command))
print(list(spaceMatcher(command)))

Questions:
Answers:

import re
string = re.sub(‘[ \t\n]+’,’ ‘,’The quick brown \n\n \t fox’)

This will remove all the tab, new lines and multiple white spaces with single white space.