Home » Python » Pandas DataFrame: remove unwanted parts from strings in a column

Pandas DataFrame: remove unwanted parts from strings in a column

Posted by: admin November 30, 2017 Leave a comment


I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))


i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.



In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])


There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:


EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1     52
2     62
3     44
4     30
5    110
Name: result


I’ve found big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below:

import pandas as pd
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop


A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract('(\d+)', expand=True).astype(int)

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110


Put this right of result column and get the result.

Leave a Reply

Your email address will not be published. Required fields are marked *