0

I am still struggling with regexp:

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

print(re.findall(pattern, text, re.S))

This returns:

[('abc', '8')]

I would expect it to return:

[('abc', '4'), ('def', '8')]

Why is it so greedy and matches everything until the last closing tag?

This is the regex101 link: https://regex101.com/r/ANO7RA/1

Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(

4
  • 4
    I strongly urge you to use a proper XML parser. See RegEx match open tags except XHTML self-contained tags. While it may be possible to handle specific narrow use cases with regular expressions in general it is literally not possible to parse XML with regex. It's almost always better to use a proper XML / HTML parser like lxml or an XML query language like XPath. Commented Feb 17, 2020 at 17:32
  • 1
    See also How do I parse XML in Python? Commented Feb 17, 2020 at 17:34
  • I second what @Chris said. I don't know a single person that favors xml instead of json but a few of them tried to use regex. It only generates more problems. Recently I've found xmltodict and it's super easy to use (I don't likelxml either). Commented Feb 17, 2020 at 18:00
  • 1
    I also do not favor XML instead of JSON, I do need to make due with the format my source information comes in, though. Commented Feb 17, 2020 at 18:01

4 Answers 4

2

This is the pattern you need.

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

Sign up to request clarification or add additional context in comments.

Comments

1

I agree with others, it is best to use an xml parser here. But to fix what you have ...

You are missing a question mark. regexes are greedy by default. They grab as much as they can. To make them non-greedy, you need to add a question mark after the part that you want to be none-greedy for. This regex will give you what you want:

<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

you had the question mark correctly after

</SW-ARRAYSIZE>.*

but you were missing it after

</SHORT-NAME>.*

.

I think you want to only capture the content of the two '.*?'s. If that is the case, I would put them in groups and retrieve the groups in code to work with them. The regex will then become:

<SW-VARIABLE>\s*<SHORT-NAME>(?P<sn>[^<]*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>(?P<vf>[^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

with the two group names being sn and vf. demo

Your python code for retrieving the named groups will then become:

matches= re.search(regex, string1)
print("shortName: ", matches.group('sn'))
print("vf: ", matches.group('vf'))

2 Comments

Thanks for the explanation. This does, however, not work for me. Have you tried it with the example?
it looks like i had a typo in there. try this: regex101.com/r/JHGEek/2
1

you can also check this out :

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''
pattern=r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?</SW-VARIABLE>'
print(re.findall(pattern, text, re.S))

output :

[('abc', '4'), ('def', '8')]

Comments

0

I seem to have found an answer myself:

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>\s*<CATEGORY>[^<]*</CATEGORY>\s*<SW-ARRAYSIZE>\s*<VF>(.*)</VF>\s*</SW-ARRAYSIZE>'

print(re.findall(pattern, text))

You really have to limit the usage of .* and make use of the very predictable structure of the XML.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.