1

I have a file that has many xml-like elements such as this one:

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>

I need to parse the docid and the text. What's a suitable regular expression for that?

I've tried this but it doesn't work:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

EDIT: I've modified the pattern like this:

<document docid=(\d+)>(.*)</document>

This matches the whole document unfortunately not the individual document elements.

EDIT2: The correct implementation from Ahmad's and Acorn's answer is:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)
2
  • 1
    XML and Regex is two words I hate hearing together. Commented Nov 15, 2011 at 2:42
  • 1
    @thephpdeveloper, in general, you're right. But if it's XML-like format with known structure, regular expressions might be the easiest solution. Commented Nov 15, 2011 at 3:21

3 Answers 3

4

Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.

You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:

<document docid=(\d+)>(.*?)</document>
Sign up to request clarification or add additional context in comments.

2 Comments

Well spotted. This doesn't solve the problem of the expression needing to match over multiple lines though.
@Acorn yeah I missed that, thinking the OP had that covered because it "matches the whole document." Good point though :)
4

You need to use the DOTALL option with your regular expression so that it will match over multiple lines (by default . will not match newline characters).

Also note the comments regarding greediness in Ahmad's answer.

import re

text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''

pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)

In general, regular expressions are not suitable for parsing XML/HTML.

See:

RegEx match open tags except XHTML self-contained tags and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

You want to use a parser like lxml.

1 Comment

this isn't XML, just similar. I only need the regex for this one file
1

Seems to work for .net "xml-like" structure just FYI...

<([^<>]+)>([^<>]+)<(\/[^<>]+)>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.