2

I am trying to parse a very ugly XML file with Python. I manage to get pretty well into it, but at the npdoc element it fails. What am I doing wrong?

XML:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<npexchange xmlns="http://www.example.com/npexchange/3.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="3.5">
<article id="123" refType="Article">
<articleparts>
    <articlepart id="1234" refType="ArticlePart">
        <data>
            <npdoc xmlns="http://www.example.com/npdoc/2.1" version="2.1" xml:lang="sv_SE">
                <body>
                    <p>Lorem ipsum some random text here.</p>
                    <p>
                        <b>Yes this is HTML markup, and I would like to keep that.</b>
                    </p>
                </body>
                <headline>
                    <p>I am a headline</p>
                </headline>
                <leadin>
                    <p>I am some other text</p>
                </leadin>
            </npdoc>
        </data>
    </articlepart>
</articleparts>
</article>
</npexchange>

This is the python code I have so far:

from xml.etree.ElementTree import ElementTree

def parse(self):
    tree = ElementTree(file=filename)

    for item in tree.iter("article"):
        articleParts = item.find("articleparts")
        for articlepart in articleParts.iter("articlepart"):
            data = articlepart.find("data")
            npdoc = data.find("npdoc")

            id = item.get("id")
            headline = npdoc.find("headline").text
            leadIn = npdoc.find("leadin").text
            body = npdoc.find("body").text


    return articles

What happens is that I get the id out, but the fields that are inside the npdoc element I cannot access. The npdoc variable gets set to None.

Update: Managed to get the elements into variables by using the namespace in the .find() calls. How do I get the value? As it is HTML it does not come out correctly with the .text attribute.

8
  • What is the expected output? Commented Apr 20, 2015 at 9:10
  • That is not a valid XML document. It does not have a root elment. Commented Apr 20, 2015 at 9:11
  • The expected outcome is that the string <p>I am a headline</p> in the headline variable, and so on. Commented Apr 20, 2015 at 9:12
  • There is a root element, going to edit it in now. My cleaning was a little aggressive. Commented Apr 20, 2015 at 9:14
  • 1
    THis is a namespace problem. Is it possible to define a prefix like xmlns:n for the http://www.example.com/npdoc/2.1 namespace? Without such a prefix it is difficult to access the elements under this namespace. Commented Apr 20, 2015 at 9:53

2 Answers 2

2
nsmap = {'npdoc': 'http://www.example.com/npdoc/2.1'}
data = articlepart.find("npdoc:data", namespaces=nsmap)

...will find your data element. No ugly, unreliable string munging required. (Re: "unreliable" -- consider what this would do to CDATA sections containing literal arrow brackets).

Sign up to request clarification or add additional context in comments.

Comments

1

This is what I came up with in Python 3.4. It's certainly not bulletproof, but it might give you some ideas.

import xml.etree.ElementTree as ET
tree = ET.parse(r'C:\Users\Gord\Desktop\nasty.xml')
npexchange = tree.getroot()
for article in npexchange:
    for articleparts in article:
        for articlepart in articleparts:
            id = articlepart.attrib['id']
            print("ArticlePart - id: {0}".format(id))
            for data in articlepart:
                for npdoc in data:
                    for child in npdoc:
                        tag = child.tag[child.tag.find('}')+1:]
                        print("    {0}:".format(tag))  ## e.g., "body:"
                        contents = ET.tostring(child).decode('utf-8')
                        contents = contents.replace('<ns0:', '<')
                        contents = contents.replace('</ns0:', '</')
                        contents = contents.replace(' xmlns:ns0="http://www.example.com/npdoc/2.1">', '>')
                        contents = contents.replace('<' + tag + '>\n', '')
                        contents = contents.replace('</' + tag + '>', '')
                        contents = contents.strip()
                        print("        {0}".format(contents))

The console output is

ArticlePart - id: 1234
    body:
        <p>Lorem ipsum some random text here.</p>
                            <p>
                                <b>Yes this is HTML markup, and I would like to keep that.</b>
                            </p>
    headline:
        <p>I am a headline</p>
    leadin:
        <p>I am some other text</p>

Update

Somewhat improved version with

  • a Namespace map (as suggested by Charles),
  • register_namespace with an empty prefix to remove some namespace prefix "noise", and
  • using .findall() instead of blindly iterating through child nodes regardless of their tag:
import xml.etree.ElementTree as ET
npdoc_uri = 'http://www.example.com/npdoc/2.1'
nsmap = {
    'npexchange': 'http://www.example.com/npexchange/3.5',
    'npdoc': npdoc_uri
    }
ET.register_namespace("", npdoc_uri)
tree = ET.parse(r'/home/gord/Desktop/nasty.xml')
npexchange = tree.getroot()
for article in npexchange.findall('npexchange:article', nsmap):
    for articleparts in article.findall('npexchange:articleparts', nsmap):
        for articlepart in articleparts.findall('npexchange:articlepart', nsmap):
            id = articlepart.attrib['id']
            print("ArticlePart - id: {0}".format(id))
            for data in articlepart.findall('npexchange:data', nsmap):
                for npdoc in data.findall('npdoc:npdoc', nsmap):
                    for child in npdoc.getchildren():
                        tag = child.tag[child.tag.find('}')+1:]
                        print("    {0}:".format(tag))  ## e.g., "body:"
                        contents = ET.tostring(child).decode('utf-8')
                        # remove HTML block tags, e.g. <body ...> and </body>
                        contents = contents.replace('<' + tag + ' xmlns="' + npdoc_uri + '">\n', '')
                        contents = contents.replace('</' + tag + '>', '')
                        contents = contents.strip()
                        print("        {0}".format(contents))

7 Comments

This looks promising, had to do some modifications as there is more stuff than just <article> elements in the root. (and a lot more crap in the <article> element) but I have something working at the moment.
I had a hunch that there may have been more elements. I considered adding an if thing.tag == 'thing': block inside each loop but I thought it might clutter things up too much.
One thing that it seems to do though is that it removes closing tags (</p> and such)
Aha. contents.replace('</ns0:', '<') changed to contents.replace('</ns0:', '</')..
WTF? Why in the world would you round-trip through text and use string replacements rather than just changing the namespace map?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.