Home » Python » Find string between two substrings

Find string between two substrings

Posted by: admin November 1, 2017 Leave a comment

Questions:

How do I find a string between two substrings ('123STRINGabc' -> 'STRING')?

My current method is like this:

>>> start = 'asdf=5;'
>>> end = '123jasd'
>>> s = 'asdf=5;iwantthis123jasd'
>>> print((s.split(start))[1].split(end)[0])
iwantthis

However, this seems very inefficient and un-pythonic. What is a better way to do something like this?

Forgot to mention:
The string might not start and end with start and end. They may have more characters before and after.

Answers:
s = "123123STRINGabcabc"

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

def find_between_r( s, first, last ):
    try:
        start = s.rindex( first ) + len( first )
        end = s.rindex( last, start )
        return s[start:end]
    except ValueError:
        return ""


print find_between( s, "123", "abc" )
print find_between_r( s, "123", "abc" )

gives:

123STRING
STRINGabc

I thought it should be noted – depending on what behavior you need, you can mix index and rindex calls or go with one of the above versions (it’s equivalent of regex (.*) and (.*?) groups).

Questions:
Answers:
import re

s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print result.group(1)

Questions:
Answers:
s[len(start):-len(end)]

Questions:
Answers:

String formatting adds some flexibility to what Nikolaus Gradwohl suggested. start and end can now be amended as desired.

import re

s = 'asdf=5;iwantthis123jasd'
start = 'asdf=5;'
end = '123jasd'

result = re.search('%s(.*)%s' % (start, end), s).group(1)
print(result)

Questions:
Answers:
start = 'asdf=5;'
end = '123jasd'
s = 'asdf=5;iwantthis123jasd'
print s[s.find(start)+len(start):s.rfind(end)]

gives

iwantthis

Questions:
Answers:

Here is one way to do it

_,_,rest = s.partition(start)
result,_,_ = rest.partition(end)
print result

Another way using regexp

import re
print re.findall(re.escape(start)+"(.*)"+re.escape(end),s)[0]

or

print re.search(re.escape(start)+"(.*)"+re.escape(end),s).group(1)

Questions:
Answers:
source='your token [email protected] and maybe [email protected] or maybe [email protected]'
start_sep='_'
end_sep='@df'
result=[]
tmp=source.split(start_sep)
for par in tmp:
  if end_sep in par:
    result.append(par.split(end_sep)[0])

print result

must show:
here0, here1, here2

the regex is better but it will require additional lib an you may want to go for python only

Questions:
Answers:

Just converting the OP’s own solution into an answer:

def find_between(s, start, end):
  return (s.split(start))[1].split(end)[0]

Questions:
Answers:

To extract STRING, try:

myString = '123STRINGabc'
startString = '123'
endString = 'abc'

mySubString=myString[myString.find(startString)+len(startString):myString.find(endString)]

Questions:
Answers:

My method will be to do something like,

find index of start string in s => i
find index of end string in s => j

substring = substring(i+len(start) to j-1)

Questions:
Answers:

This is essentially cji’s answer – Jul 30 ’10 at 5:58.
I changed the try except structure for a little more clarity on what was causing the exception.

def find_between( inputStr, firstSubstr, lastSubstr ):
'''
find between firstSubstr and lastSubstr in inputStr  STARTING FROM THE LEFT
    http://stackoverflow.com/questions/3368969/find-string-between-two-substrings
        above also has a func that does this FROM THE RIGHT   
'''
start, end = (-1,-1)
try:
    start = inputStr.index( firstSubstr ) + len( firstSubstr )
except ValueError:
    print '    ValueError: ',
    print "firstSubstr=%s  -  "%( firstSubstr ), 
    print sys.exc_info()[1]

try:
    end = inputStr.index( lastSubstr, start )       
except ValueError:
    print '    ValueError: ',
    print "lastSubstr=%s  -  "%( lastSubstr ), 
    print sys.exc_info()[1]

return inputStr[start:end]    

Questions:
Answers:

These solutions assume the start string and final string are different. Here is a solution I use for an entire file when the initial and final indicators are the same, assuming the entire file is read using readlines():

def extractstring(line,flag='$'):
    if flag in line: # $ is the flag
        dex1=line.index(flag)
        subline=line[dex1+1:-1] #leave out flag (+1) to end of line
        dex2=subline.index(flag)
        string=subline[0:dex2].strip() #does not include last flag, strip whitespace
    return(string)

Example:

lines=['asdf 1qr3 qtqay 45q at $A NEWT?$ asdfa afeasd',
    'afafoaltat $I GOT BETTER!$ derpity derp derp']
for line in lines:
    string=extractstring(line,flag='$')
    print(string)

Gives:

A NEWT?
I GOT BETTER!

Questions:
Answers:

You can simply use this code or copy the function below. All neatly in one line.

def substring(whole, sub1, sub2):
    return whole[whole.index(sub1) : whole.index(sub2)]

If you run the function as follows.

print(substring("5+(5*2)+2", "(", "("))

You will pobably be left with the output:

(5*2

rather than

5*2

If you want to have the sub-strings on the end of the output the code must look like below.

return whole[whole.index(sub1) : whole.index(sub2) + 1]

But if you don’t want the substrings on the end the +1 must be on the first value.

return whole[whole.index(sub1) + 1 : whole.index(sub2)]

Questions:
Answers:
from timeit import timeit
from re import search, DOTALL


def partition_find(string, start, end):
    return string.partition(start)[2].rpartition(end)[0]


def re_find(string, start, end):
    # applying re.escape to start and end would be safer
    return search(start + '(.*)' + end, string, DOTALL).group(1)


def index_find(string, start, end):
    return string[string.find(start) + len(start):string.rfind(end)]


# The wikitext of "Alan Turing law" article form English Wikipeida
# https://en.wikipedia.org/w/index.php?title=Alan_Turing_law&action=edit&oldid=763725886
string = """..."""
start = '==Proposals=='
end = '==Rival bills=='

assert index_find(string, start, end) \
       == partition_find(string, start, end) \
       == re_find(string, start, end)

print('index_find', timeit(
    'index_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('partition_find', timeit(
    'partition_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('re_find', timeit(
    're_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

Result:

index_find 0.35047444528454114
partition_find 0.5327825636197754
re_find 7.552149639286381

re_find was almost 20 times slower than index_find in this example.

Questions:
Answers:

Parsing text with delimiters from different email platforms posed a larger-sized version of this problem. They generally have a START and a STOP. Delimiter characters for wildcards kept choking regex. The problem with split is mentioned here & elsewhere – oops, delimiter character gone. It occurred to me to use replace() to give split() something else to consume. Chunk of code:

nuke = '~~~'
start = '|*'
stop = '*|'
julien = (textIn.replace(start,nuke + start).replace(stop,stop + nuke).split(nuke))
keep = [chunk for chunk in julien if start in chunk and stop in chunk]
logging.info('keep: %s',keep)

Questions:
Answers:

This I posted before as code snippet in Daniweb:

# picking up piece of string between separators
# function using partition, like partition, but drops the separators
def between(left,right,s):
    before,_,a = s.partition(left)
    a,_,after = a.partition(right)
    return before,a,after

s = "bla bla blaa <a>data</a> lsdjfasdjöf (important notice) 'Daniweb forum' tcha tcha tchaa"
print between('<a>','</a>',s)
print between('(',')',s)
print between("'","'",s)

""" Output:
('bla bla blaa ', 'data', " lsdjfasdj\xc3\xb6f (important notice) 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f ', 'important notice', " 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f (important notice) ', 'Daniweb forum', ' tcha tcha tchaa')
"""

Questions:
Answers:

This seems much more straight forward to me:

import re

s = 'asdf=5;iwantthis123jasd'
x= re.search('iwantthis',s)
print(s[x.start():x.end()])