417

How do I find a string between two substrings ('123STRINGabc' -> 'STRING')?

My current method is like this:

>>> start = 'asdf=5;'
>>> end = '123jasd'
>>> s = 'asdf=5;iwantthis123jasd'
>>> print((s.split(start))[1].split(end)[0])
iwantthis

However, this seems very inefficient and un-pythonic. What is a better way to do something like this?

Forgot to mention: The string might not start and end with start and end. They may have more characters before and after.

5
  • 2
    Your additional information makes it almost necessary to use regexes for maximum correctness. Commented Jul 30, 2010 at 6:39
  • 33
    What's wrong with your own solution? I actually prefer it to the one you accepted. Commented Nov 10, 2014 at 12:06
  • I was trying to do this as well but for multiple instances it looks like using *? to do a non greedy search and then just cutting off the string with s[s.find(end)] worked for tracking multiple instances Commented Jan 9, 2019 at 23:07
  • 1
    @reubano: one feature/bug of this code is that it does not raise an exception when the end text does not occur in the original text. The accepted answer fixes this. Commented Jan 19, 2022 at 14:50
  • just a note: s[1:-1] will also do what you had.. though i like .group(1) or (.*?) non-greedy from below better Commented Oct 30, 2022 at 23:04

20 Answers 20

525
import re

s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))

# returns 'iwantthis'
Sign up to request clarification or add additional context in comments.

15 Comments

@Jesse Dhillon -- what about @Tim McNamara's suggestion of something like ''.join(start,test,end) in a_string?
What if I need to find between 2 substrings and the second one is repeated after first one? Something like this: s= 'asdf=5;I_WANT_ONLY_THIS123jasdNOT_THIS123jasd
Add ? to make it non greedy result = re.search('asdf=5;(.*?)123jasd', s)
How can this be amended to select data between start/end if the start/end is duplicated? e.g. say i wanted to select both strings separately between <> i would like to send <message> to <name> and return result1='message' and result2 = 'name'
This however extracts the string between the first and the LAST occurrence of the 2nd string, which may be incorrect, especially when parsing HTML. Unfortunately, this question appears closed so I cannot post my answer.
|
189
s = "123123STRINGabcabc"

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

def find_between_r( s, first, last ):
    try:
        start = s.rindex( first ) + len( first )
        end = s.rindex( last, start )
        return s[start:end]
    except ValueError:
        return ""


print find_between( s, "123", "abc" )
print find_between_r( s, "123", "abc" )

gives:

123STRING
STRINGabc

I thought it should be noted - depending on what behavior you need, you can mix index and rindex calls or go with one of the above versions (it's equivalent of regex (.*) and (.*?) groups).

7 Comments

He said that he wanted a way that was more Pythonic, and this is decidedly less so. I'm not sure why this answer was picked, even OP's own solution is better.
Agreed. I'd use the solution by @Tim McNamara , or the suggestion by the same of something like start+test+end in substring
Right, so it's less pythonic, ok. Is it less efficient than regexps too? And there's also @Prabhu answer you need to downvote, as it suggest the same solution.
+1 too, for a more generic and reusable (by import) solution.
+1 since it works better than the other solutions in the case where end is found more than once. But I do agree that the OP's solution is more simpler.
|
147
start = 'asdf=5;'
end = '123jasd'
s = 'asdf=5;iwantthis123jasd'
print s[s.find(start)+len(start):s.rfind(end)]

gives

iwantthis

4 Comments

I upvoted this because it works regardless of input string size. Some of the other methods assumed you'd know the length ahead of time.
yes it works by without input size however it does assume the string exists
This however extracts the string between the first and the LAST occurrence of the 2nd string, which may be incorrect, especially when parsing HTML. Unfortunately, this question appears closed so I cannot post my answer.
That's python! No need for regular expressions :)
63
s[len(start):-len(end)]

2 Comments

This is very nice, assuming start and end are always at the start and end of the string. Otherwise, I would probably use a regex.
I went the most Pythonic answer to the original question I could think of. Testing using the in operator would probably be faster than regexp.
49

Just converting the OP's own solution into an answer:

def find_between(s, start, end):
    return s.split(start)[1].split(end)[0]

Comments

39

String formatting adds some flexibility to what Nikolaus Gradwohl suggested. start and end can now be amended as desired.

import re

s = 'asdf=5;iwantthis123jasd'
start = 'asdf=5;'
end = '123jasd'

result = re.search('%s(.*)%s' % (start, end), s).group(1)
print(result)

4 Comments

I'm getting this: 'NoneType' object has no attribute 'group'
That means a match wasn't found. Check your regular expression.
@Dentrax is right: should return nothing not an error
I think Tim means that the search should return None as there were no matches. Since the search returned 'None', applying of .group(1) at the end causes the error.
36

If you don't want to import anything, try the string method .index():

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

# Output: 'string'
print(text[text.index(left)+len(left):text.index(right)])

5 Comments

I am loving it. simple, single-line, clear enough, no additional imports and works out of the box. I have no idea what is the deal with the over-engineered answers above.
This is not checking whether the "right" text is actually at the right side of the text. If there are any occurrences of "right" before the text it won't work.
@AndreFeijo I agree with you, this was my first solution when trying to extract texts and I wanted to avoid regex weird syntax. However, in situations as you mentioned, I would use regex instead.
in that case (not all of cases) you could find left and then right, although it's a two line code text = text[text.index(left)+len(left):len(role)] text = text[0:text.index(right)]
Hi Fernando, for this text "ADRIANOPICCININIC216186162022-07-27 09:36:33Z" i am looking to extract only "C21618616", how can i do that?
16
source='your token _here0@df and maybe _here1@df or maybe _here2@df'
start_sep='_'
end_sep='@df'
result=[]
tmp=source.split(start_sep)
for par in tmp:
  if end_sep in par:
    result.append(par.split(end_sep)[0])

print result

must show: here0, here1, here2

the regex is better but it will require additional lib an you may want to go for python only

2 Comments

This worked for me. Thank you for extending the solution for multiple occurrences.
I was exactly looking for this, It helps for multiple occurrences, This post needs more upvotes :p.
15

Here is one way to do it

_,_,rest = s.partition(start)
result,_,_ = rest.partition(end)
print result

Another way using regexp

import re
print re.findall(re.escape(start)+"(.*)"+re.escape(end),s)[0]

or

print re.search(re.escape(start)+"(.*)"+re.escape(end),s).group(1)

Comments

6

Here is a function I did to return a list with a string(s) inbetween string1 and string2 searched.

def GetListOfSubstrings(stringSubject,string1,string2):
    MyList = []
    intstart=0
    strlength=len(stringSubject)
    continueloop = 1

    while(intstart < strlength and continueloop == 1):
        intindex1=stringSubject.find(string1,intstart)
        if(intindex1 != -1): #The substring was found, lets proceed
            intindex1 = intindex1+len(string1)
            intindex2 = stringSubject.find(string2,intindex1)
            if(intindex2 != -1):
                subsequence=stringSubject[intindex1:intindex2]
                MyList.append(subsequence)
                intstart=intindex2+len(string2)
            else:
                continueloop=0
        else:
            continueloop=0
    return MyList


#Usage Example
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y68")
for x in range(0, len(List)):
               print(List[x])
output:


mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","3")
for x in range(0, len(List)):
              print(List[x])
output:
    2
    2
    2
    2

mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y")
for x in range(0, len(List)):
               print(List[x])
output:
23
23o123pp123

1 Comment

Extraordinary answer. I'd hire a guy like you
5

To extract STRING, try:

myString = '123STRINGabc'
startString = '123'
endString = 'abc'

mySubString=myString[myString.find(startString)+len(startString):myString.find(endString)]

Comments

4

You can simply use this code or copy the function below. All neatly in one line.

def substring(whole, sub1, sub2):
    return whole[whole.index(sub1) : whole.index(sub2)]

If you run the function as follows.

print(substring("5+(5*2)+2", "(", "("))

You will pobably be left with the output:

(5*2

rather than

5*2

If you want to have the sub-strings on the end of the output the code must look like below.

return whole[whole.index(sub1) : whole.index(sub2) + 1]

But if you don't want the substrings on the end the +1 must be on the first value.

return whole[whole.index(sub1) + 1 : whole.index(sub2)]

Comments

3

These solutions assume the start string and final string are different. Here is a solution I use for an entire file when the initial and final indicators are the same, assuming the entire file is read using readlines():

def extractstring(line,flag='$'):
    if flag in line: # $ is the flag
        dex1=line.index(flag)
        subline=line[dex1+1:-1] #leave out flag (+1) to end of line
        dex2=subline.index(flag)
        string=subline[0:dex2].strip() #does not include last flag, strip whitespace
    return(string)

Example:

lines=['asdf 1qr3 qtqay 45q at $A NEWT?$ asdfa afeasd',
    'afafoaltat $I GOT BETTER!$ derpity derp derp']
for line in lines:
    string=extractstring(line,flag='$')
    print(string)

Gives:

A NEWT?
I GOT BETTER!

Comments

2

This I posted before as code snippet in Daniweb:

# picking up piece of string between separators
# function using partition, like partition, but drops the separators
def between(left,right,s):
    before,_,a = s.partition(left)
    a,_,after = a.partition(right)
    return before,a,after

s = "bla bla blaa <a>data</a> lsdjfasdjöf (important notice) 'Daniweb forum' tcha tcha tchaa"
print between('<a>','</a>',s)
print between('(',')',s)
print between("'","'",s)

""" Output:
('bla bla blaa ', 'data', " lsdjfasdj\xc3\xb6f (important notice) 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f ', 'important notice', " 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f (important notice) ', 'Daniweb forum', ' tcha tcha tchaa')
"""

Comments

2

This is essentially cji's answer - Jul 30 '10 at 5:58. I changed the try except structure for a little more clarity on what was causing the exception.

def find_between( inputStr, firstSubstr, lastSubstr ):
'''
find between firstSubstr and lastSubstr in inputStr  STARTING FROM THE LEFT
    http://stackoverflow.com/questions/3368969/find-string-between-two-substrings
        above also has a func that does this FROM THE RIGHT   
'''
start, end = (-1,-1)
try:
    start = inputStr.index( firstSubstr ) + len( firstSubstr )
except ValueError:
    print '    ValueError: ',
    print "firstSubstr=%s  -  "%( firstSubstr ), 
    print sys.exc_info()[1]

try:
    end = inputStr.index( lastSubstr, start )       
except ValueError:
    print '    ValueError: ',
    print "lastSubstr=%s  -  "%( lastSubstr ), 
    print sys.exc_info()[1]

return inputStr[start:end]    

Comments

2
from timeit import timeit
from re import search, DOTALL


def partition_find(string, start, end):
    return string.partition(start)[2].rpartition(end)[0]


def re_find(string, start, end):
    # applying re.escape to start and end would be safer
    return search(start + '(.*)' + end, string, DOTALL).group(1)


def index_find(string, start, end):
    return string[string.find(start) + len(start):string.rfind(end)]


# The wikitext of "Alan Turing law" article form English Wikipeida
# https://en.wikipedia.org/w/index.php?title=Alan_Turing_law&action=edit&oldid=763725886
string = """..."""
start = '==Proposals=='
end = '==Rival bills=='

assert index_find(string, start, end) \
       == partition_find(string, start, end) \
       == re_find(string, start, end)

print('index_find', timeit(
    'index_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('partition_find', timeit(
    'partition_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

print('re_find', timeit(
    're_find(string, start, end)',
    globals=globals(),
    number=100_000,
))

Result:

index_find 0.35047444528454114
partition_find 0.5327825636197754
re_find 7.552149639286381

re_find was almost 20 times slower than index_find in this example.

Comments

1

My method will be to do something like,

find index of start string in s => i
find index of end string in s => j

substring = substring(i+len(start) to j-1)

Comments

1

Parsing text with delimiters from different email platforms posed a larger-sized version of this problem. They generally have a START and a STOP. Delimiter characters for wildcards kept choking regex. The problem with split is mentioned here & elsewhere - oops, delimiter character gone. It occurred to me to use replace() to give split() something else to consume. Chunk of code:

nuke = '~~~'
start = '|*'
stop = '*|'
julien = (textIn.replace(start,nuke + start).replace(stop,stop + nuke).split(nuke))
keep = [chunk for chunk in julien if start in chunk and stop in chunk]
logging.info('keep: %s',keep)

Comments

0

Further from Nikolaus Gradwohl answer, I needed to get version number (i.e., 0.0.2) between('ui:' and '-') from below file content (filename: docker-compose.yml):

    version: '3.1'
services:
  ui:
    image: repo-pkg.dev.io:21/website/ui:0.0.2-QA1
    #network_mode: host
    ports:
      - 443:9999
    ulimits:
      nofile:test

and this is how it worked for me (python script):

import re, sys

f = open('docker-compose.yml', 'r')
lines = f.read()
result = re.search('ui:(.*)-', lines)
print result.group(1)


Result:
0.0.2

3 Comments

Using Docker for simple task is bad practice.
@DmitryBubnenkov what does the above post has to do anything with Docker usage/implementation? It's all about finding a string between two substrings in a file.
I thought this use case was great. My use case was a css file with encoded base64 text it just shows not every text file needs to be .txt
-3

This seems much more straight forward to me:

import re

s = 'asdf=5;iwantthis123jasd'
x= re.search('iwantthis',s)
print(s[x.start():x.end()])

1 Comment

This requires you to know the string you're looking for, it doesn't find whatever string is between the two substrings, as the OP requested. The OP wants to be able to get the middle no matter what it is, and this answer would require you to know the middle before you start.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.