How to extract specific data from a string?

Question

I have a text document I want to parse through. I want to be able to get the strings between "@5c00\n" and "@ffd2\n" and also between "@ffd2\n" and "@"

@5c00
81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 
B1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 
@ffd2
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 00 5C CF 0C 
@
q

I have tried to use regular expressions but this seems to give me ['',''].

file = open("app_blink.txt","r") #app_blink.txt being the string above
contents = file.read()
data = re.findall('\n(.*)@',contents,re.M)

I expected to get:

data
['81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 \nB1 13 38 01 32 D0 10 00..
 FD 3F 03 43 00 00 00 02','14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C..
 \n14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C \n14 5C 14 5C 14 5C 14..
 5C 14 5C 14 5C 00 5C CF 0C \n']

but actually got:

data
['','']

Mark Tolonen · Accepted Answer · 2019-05-20 21:57:33Z

1

You were close. You needed the re.DOTALL flag instead, and a non-greedy match:

contents = '''\
@5c00
81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 
B1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 
@ffd2
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 00 5C CF 0C 
@
q
'''

import re
for x in re.findall(r'\n(.*?)@',contents,re.DOTALL):
    print(x)

Output:

81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 
B1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 

14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 00 5C CF 0C

answered May 20, 2019 at 21:57

Mark Tolonen

181k26 gold badges183 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Draconis · Accepted Answer · 2019-05-20 21:55:06Z

0

This sounds like a job for regular expressions!

\@[^\n]*\n([^\@]*)\n(?=\@)

This regular expression will match:

First, a literal @ sign
Then, any line of characters, ending with a newline
Then, everything it can find that doesn't include an @: this part is saved into group #1
Then, a newline ending it all
Finally, accept only if the next character is an @ (but don't consume that character)

As an example:

>>> re.search(r'\@[^\n]*\n([^\@]*)\n(?=\@)', your_string).group(1)
'81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 \nB1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 '

So to get a list of the important stuff:

>>> [m.group(1) for m in re.finditer(r'\@[^\n]*\n([^\@]*)\n(?=\@)', your_string)]
['81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 \nB1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 ', '14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C \n14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C \n14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 00 5C CF 0C ']

Or, for a simpler answer:

re.split(r'\@[^\n]*\n', your_string)

Split the string whenever you find a line starting with @.

answered May 20, 2019 at 21:55

Draconis

3,5471 gold badge21 silver badges34 bronze badges

1 Comment

Nicholas Nguyen Over a year ago

Wow! Thank you for the help! I greatly appreciate you taking the time in explaining your regular expression also

Guillermo García López · Accepted Answer · 2019-05-20 21:55:40Z

0

Check this regex:

data = re.findall('^[\d \w]{2,}$',contents,re.M)

It's just taking the lines that have hexadecimal numbers.

answered May 20, 2019 at 21:55

Guillermo García López

1961 silver badge5 bronze badges

Comments

MichaelD · Accepted Answer · 2019-05-20 22:08:16Z

This regex ought to work Tryit

import re

regex = r"^[^\@].*"

test_str = ("@5c00\n81 00 00\n76 20 11\n@ffd2\n")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Emma Marcier · Accepted Answer · 2019-05-20 22:26:33Z

Here, we may not want to use regular expressions because it might become slightly expensive. Maybe a string split would be fine. For example, we can split by @.

Example

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

test_str = '''
@bb00
81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 
B1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02
@5c00
81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 
B1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 
@ffd2
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 
14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 00 5C CF 0C 
@
81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 
B1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 

'''

split_str = test_str.split('@')
data=[]
for matches in split_str:
  if (matches[:4] == '5c00' or matches[:4] == 'ffd2'):
    data.append(matches[5:])


print(data)

Output

['81 00 00 5C B1 13 3E 01 0C 43 B1 13 A6 00 1C 43 \nB1 13 38 01 32 D0 10 00 FD 3F 03 43 00 00 00 02 \n', '14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C \n14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 14 5C \n14 5C 14 5C 14 5C 14 5C 14 5C 14 5C 00 5C CF 0C \n']

Collectives™ on Stack Overflow

How to extract specific data from a string?

5 Answers 5

Comments

1 Comment

Comments

Comments

Example

Output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Example

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Related