0

I have an XML document with the following structure-

> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>

I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-

file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)

But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-

category_text = re.findall(r'(.+)', xml_doc)

This does the job. Any idea(s) as to why the ... XML element code is not working?

Thanks!

2 Answers 2

2

The special character . will not match a newline, so that regex will not match a multiline string.

You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

Sign up to request clarification or add additional context in comments.

Comments

1

You can use re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output:

[" Postmodernism preceded by Modernism '' Postmodernity\n> Postchristianity Postmodern philosophy Postmodern architecture\n> Postmodern art Postmodernist film Postmodern literature Postmodern\n> music Postmodern theater Critical theory Globalization Consumerism\n> "]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.