Home » Python » Pandas DataFrame: remove unwanted parts from strings in a column

Pandas DataFrame: remove unwanted parts from strings in a column

Posted by: admin November 30, 2017 Leave a comment

Questions:

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!

Answers:
data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

Questions:
Answers:

i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

Questions:
Answers:

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

Questions:
Answers:

There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

Questions:
Answers:

I’ve found big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

Questions:
Answers:

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract('(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

Questions:
Answers:
=RIGHT(LEFT(O13,(LEN(O13)-1)),LEN(LEFT(O13,(LEN(O13)-1))-1))

Put this right of result column and get the result.

Leave a Reply

Your email address will not be published. Required fields are marked *