Parsing XML with Python - accessing elements

Question

I'm using lxml to parse some xml, but for some reason I can't find a specific element.

I'm trying to access the <Constant> elements.

Here's an xml snippet:

  </rdf:Description>
</rdf:RDF>
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>

The code I'm using is like this:

    >>> from lxml import etree as ET
    >>> parsed = ET.parse('ct.cps')
    >>> root = parsed.getroot()    
    >>> for a in root.findall(".//Constant"):
    ...     print a.attrib['key']
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.get('key')
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.attrib['key']
    ...

As you can see, none of these things seem to work.

What am I doing wrong?

EDIT: I'm wondering if it has something to do with the fact that <Constant> elements are empty?

EDIT2: Source xml here: https://www.dropbox.com/s/i6hga7nvmcd6rxx/ct.cps?dl=0

I guess it has to do with namespace. You need to take care of the namespace part. — Hai Vu
– Hai Vu, Commented Jul 2, 2015 at 19:29

mzjn · Accepted Answer · 2015-07-04 08:12:39Z

4

Here is how you can get the values you are looking for:

from lxml import etree

parsed = etree.parse('ct.cps')

for a in parsed.findall("//{http://www.copasi.org/static/schema}Constant"):
    print a.attrib["key"]

Output:

Parameter_4344
Parameter_4343
Parameter_4342
Parameter_4341
Parameter_4340
Parameter_4339
Parameter_4338
Parameter_4337
Parameter_4336
Parameter_4335
Parameter_4334
Parameter_4333
Parameter_4332
Parameter_4331
Parameter_4330
Parameter_4329
Parameter_4328
Parameter_4327
Parameter_4326
Parameter_4325
Parameter_4324
Parameter_4323
Parameter_4322
Parameter_4321
Parameter_4320
Parameter_4319

The important thing here is that the COPASI root element in your XML file (the real one at the Dropbox URL) declares a default namespace (http://www.copasi.org/static/schema). This means that the element and all its descendants, including Constant, belong to that namespace.

So instead of Constant elements, you need to look for {http://www.copasi.org/static/schema}Constant elements.

See http://lxml.de/tutorial.html#namespaces.

Here is how you could do it using XPath instead of findall:

from lxml import etree

NSMAP = {"c": "http://www.copasi.org/static/schema"}

parsed = etree.parse('ct.cps')

for a in parsed.xpath("//c:Constant", namespaces=NSMAP):
    print a.attrib["key"]

See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.

edited Jul 4, 2015 at 8:12

answered Jul 4, 2015 at 7:52

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user5193682 Over a year ago

Wow!! I had an annoying problem with parsing since hours and by pure luck I foind out the source of my problem is exactly the namespace one! Thanks so much!

mzjn Over a year ago

@user9589: I'm glad I could help!

Hai Vu · Accepted Answer · 2015-07-02 20:56:01Z

0

First, please disregard my comment. It turns out that xml.etree is much better than the standard xml.etree.ElementTree in that it takes care of the namespace. The problem you have is you want to search for '//Constant', which means the nodes can be at any level. However, the root element does not allow you to do it:

>>> root.findall('//Constant')
SyntaxError: cannot use absolute path on element

However, you can do that at higher level:

>>> parsed.findall('//Constant')
[<Element Constant at 0x10a7ce128>, <Element Constant at 0x10a7ce170>]

Update

I am posting here the full text. Since I don't have your full XML file, I make something up to fill in the blank.

from lxml import etree as ET
from StringIO import StringIO

xml_text = """<?xml version='1.0' encoding='utf-8' ?>

<rdf:root  xmlns:rdf='http://foo.bar.com/rdf'>
<rdf:RDF>
  <rdf:Description>
    DescriptionX
  </rdf:Description>
</rdf:RDF>
<rdf:foo>
        <MiriamAnnotation>
          bar
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>
        </ListOfConstants>
</rdf:foo>
</rdf:root>
"""

buffer = StringIO(xml_text)
tree = ET.parse(buffer)
for constant_node in tree.findall('//Constant'):
    print constant_node.attrib['key']

edited Jul 2, 2015 at 20:56

answered Jul 2, 2015 at 20:00

Hai Vu

41.5k16 gold badges75 silver badges106 bronze badges

2 Comments

Charon Over a year ago

I thought that lxml is better than elementtree with regards to namespaces because I had namespace errors when using the latter and theyre now sorted. However, the first error is removed by putting a . before the first backslash though... and it still doesn't work. As for the second suggestion, I still can't access attribute values using that code: nothing is outputted

Charon Over a year ago

I think that it might be something to do with the fact that the <Constant> elements are empty elements, but I'm not sure.

Gary Wisniewski · Accepted Answer · 2015-07-02 22:04:11Z

0

Don't use findall. It is has a limited featureset and is designed to be compatible with ElementTree.

Instead, use xpath, which supports namespaces. From the above, it appears that you probably want to say something like

# possibilities, you need to get these right...
ns_dict = {'atom':"http://www.w3.org/2005/Atom",,
    "rdf":"http://www.w3.org/2000/01/rdf-schema#" }

root = parsed.getroot()    
for a in root.xpath('.//rdf:Constant', namespaces=ns_dict):
    print a.attrib['key']

Note that you must include a namespace prefix in your xpath expression whenever an element has a non-blank namespace, and they must map to one of the namespace URLs that match the same URLs in your document.

Update

Since you posted your original document, I see that there is no namespace assigned to the elements you are looking for. This will work, I just tried it with your source document:

for a in tree.xpath("//Constant"):
    print a.attrib['key']

You don't need a namespace because there is no default namespace specified in the document itself.

edited Jul 2, 2015 at 22:04

answered Jul 2, 2015 at 20:38

Gary Wisniewski

1,13010 silver badges9 bronze badges

13 Comments

Charon Over a year ago

Thanks. How do I find out the element's namespace? EDIT: I'm sure that rdf is the namespace but your code still isnt working

Gary Wisniewski Over a year ago

Note that "rdf" is just a namespace prefix. My rdf:Constant refers to the namespace in ns_dict. Namespaces must be explicitly declared inside your source document and are inherited by subelements. Look for the xmlns declarations in containing elements. For example, if a containing element has xmlns= it defines the default namespace for any unadorned subelement, whereas "xmlns:xxx=" defines the namespace for the 'xxx' prefix. Also, it doesn't matter what the prefix was in your source document. lxml rewrites them all internally, so you need to remap them.

Charon Over a year ago

So the rdf term will be a url?

Gary Wisniewski Over a year ago

Yes, namespaces are specified as URLs in the source document's xmlns attributes. The prefixes are just mappings, and the source document prefixes will not be preserved. ET.parse will convert all prefixes to internal, explicit references to the namespace URLs, and you need to use a namespace map like ns_dict in my example in order to re-establish a new set of prefixes for use in xpath.

Gary Wisniewski Over a year ago

You can get a list of the namespaces used by your source document by using parsed.getroot().nsmap. If you print that out, you'll see what namespaces are defined in your document.

|

Collectives™ on Stack Overflow

Parsing XML with Python - accessing elements

3 Answers 3

2 Comments

Update

2 Comments

Update

13 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Update

2 Comments

Update

13 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related