XML parsing with Python and regex does not return all results

Question

I am still struggling with regexp:

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

print(re.findall(pattern, text, re.S))

This returns:

[('abc', '8')]

I would expect it to return:

[('abc', '4'), ('def', '8')]

Why is it so greedy and matches everything until the last closing tag?

This is the regex101 link: https://regex101.com/r/ANO7RA/1

Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(

I strongly urge you to use a proper XML parser. See RegEx match open tags except XHTML self-contained tags. While it may be possible to handle specific narrow use cases with regular expressions in general it is literally not possible to parse XML with regex. It's almost always better to use a proper XML / HTML parser like lxml or an XML query language like XPath. — Chris
– Chris, Commented Feb 17, 2020 at 17:32
I second what @Chris said. I don't know a single person that favors xml instead of json but a few of them tried to use regex. It only generates more problems. Recently I've found xmltodict and it's super easy to use (I don't likelxml either). — Tom Wojcik
– Tom Wojcik, Commented Feb 17, 2020 at 18:00
I also do not favor XML instead of JSON, I do need to make due with the format my source information comes in, though. — mrCarnivore
– mrCarnivore, Commented Feb 17, 2020 at 18:01

jawad-khan · Accepted Answer · 2020-02-17 18:03:15Z

2

This is the pattern you need.

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

answered Feb 17, 2020 at 18:03

jawad-khan

3131 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Barka · Accepted Answer · 2020-02-17 18:16:06Z

1

I agree with others, it is best to use an xml parser here. But to fix what you have ...

You are missing a question mark. regexes are greedy by default. They grab as much as they can. To make them non-greedy, you need to add a question mark after the part that you want to be none-greedy for. This regex will give you what you want:

<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

you had the question mark correctly after

</SW-ARRAYSIZE>.*

but you were missing it after

</SHORT-NAME>.*

.

I think you want to only capture the content of the two '.*?'s. If that is the case, I would put them in groups and retrieve the groups in code to work with them. The regex will then become:

<SW-VARIABLE>\s*<SHORT-NAME>(?P<sn>[^<]*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>(?P<vf>[^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

with the two group names being sn and vf. demo

Your python code for retrieving the named groups will then become:

matches= re.search(regex, string1)
print("shortName: ", matches.group('sn'))
print("vf: ", matches.group('vf'))

edited Feb 17, 2020 at 18:16

answered Feb 17, 2020 at 17:54

Barka

8,95216 gold badges69 silver badges95 bronze badges

2 Comments

mrCarnivore Over a year ago

Thanks for the explanation. This does, however, not work for me. Have you tried it with the example?

Barka Over a year ago

it looks like i had a typo in there. try this: regex101.com/r/JHGEek/2

Freeman · Accepted Answer · 2020-02-17 18:14:22Z

you can also check this out :

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''
pattern=r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?</SW-VARIABLE>'
print(re.findall(pattern, text, re.S))

output :

[('abc', '4'), ('def', '8')]

mrCarnivore · Accepted Answer · 2020-02-17 17:53:33Z

0

I seem to have found an answer myself:

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>\s*<CATEGORY>[^<]*</CATEGORY>\s*<SW-ARRAYSIZE>\s*<VF>(.*)</VF>\s*</SW-ARRAYSIZE>'

print(re.findall(pattern, text))

You really have to limit the usage of .* and make use of the very predictable structure of the XML.

answered Feb 17, 2020 at 17:53

mrCarnivore

5,1282 gold badges14 silver badges29 bronze badges

Collectives™ on Stack Overflow

XML parsing with Python and regex does not return all results

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related