parse xml like document with regex

Question

I have a file that has many xml-like elements such as this one:

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>

I need to parse the docid and the text. What's a suitable regular expression for that?

I've tried this but it doesn't work:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

EDIT: I've modified the pattern like this:

<document docid=(\d+)>(.*)</document>

This matches the whole document unfortunately not the individual document elements.

EDIT2: The correct implementation from Ahmad's and Acorn's answer is:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)

@thephpdeveloper, in general, you're right. But if it's XML-like format with known structure, regular expressions might be the easiest solution. — svick
– svick, Commented Nov 15, 2011 at 3:21

Ahmad Mageed · Accepted Answer · 2011-11-15 03:06:06Z

4

Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.

You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:

<document docid=(\d+)>(.*?)</document>

answered Nov 15, 2011 at 3:06

Ahmad Mageed

96.9k20 gold badges170 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Acorn Over a year ago

Well spotted. This doesn't solve the problem of the expression needing to match over multiple lines though.

Ahmad Mageed Over a year ago

@Acorn yeah I missed that, thinking the OP had that covered because it "matches the whole document." Good point though :)

Community · Accepted Answer · 2017-05-23 11:53:37Z

4

You need to use the DOTALL option with your regular expression so that it will match over multiple lines (by default . will not match newline characters).

Also note the comments regarding greediness in Ahmad's answer.

import re

text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''

pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)

In general, regular expressions are not suitable for parsing XML/HTML.

See:

RegEx match open tags except XHTML self-contained tags and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

You want to use a parser like lxml.

edited May 23, 2017 at 11:53

CommunityBot

11 silver badge

answered Nov 15, 2011 at 2:45

Acorn

50.8k30 gold badges143 silver badges180 bronze badges

1 Comment

siamii Over a year ago

this isn't XML, just similar. I only need the regex for this one file

Mathias Müller · Accepted Answer · 2016-10-25 20:32:08Z

1

Seems to work for .net "xml-like" structure just FYI...

<([^<>]+)>([^<>]+)<(\/[^<>]+)>

edited Oct 25, 2016 at 20:32

Mathias Müller

22.7k13 gold badges62 silver badges78 bronze badges

answered Jan 21, 2014 at 22:22

user2860427

651 gold badge2 silver badges5 bronze badges

Collectives™ on Stack Overflow

parse xml like document with regex

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related