2

I'm using lxml to parse some xml, but for some reason I can't find a specific element.

I'm trying to access the <Constant> elements.

Here's an xml snippet:

  </rdf:Description>
</rdf:RDF>
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>

The code I'm using is like this:

    >>> from lxml import etree as ET
    >>> parsed = ET.parse('ct.cps')
    >>> root = parsed.getroot()    
    >>> for a in root.findall(".//Constant"):
    ...     print a.attrib['key']
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.get('key')
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.attrib['key']
    ... 

As you can see, none of these things seem to work.

What am I doing wrong?


EDIT: I'm wondering if it has something to do with the fact that <Constant> elements are empty?


EDIT2: Source xml here: https://www.dropbox.com/s/i6hga7nvmcd6rxx/ct.cps?dl=0

2
  • I guess it has to do with namespace. You need to take care of the namespace part. Commented Jul 2, 2015 at 19:29
  • Ah I see what you mean, I will try. Commented Jul 2, 2015 at 19:39

3 Answers 3

4

Here is how you can get the values you are looking for:

from lxml import etree

parsed = etree.parse('ct.cps')

for a in parsed.findall("//{http://www.copasi.org/static/schema}Constant"):
    print a.attrib["key"]

Output:

Parameter_4344
Parameter_4343
Parameter_4342
Parameter_4341
Parameter_4340
Parameter_4339
Parameter_4338
Parameter_4337
Parameter_4336
Parameter_4335
Parameter_4334
Parameter_4333
Parameter_4332
Parameter_4331
Parameter_4330
Parameter_4329
Parameter_4328
Parameter_4327
Parameter_4326
Parameter_4325
Parameter_4324
Parameter_4323
Parameter_4322
Parameter_4321
Parameter_4320
Parameter_4319

The important thing here is that the COPASI root element in your XML file (the real one at the Dropbox URL) declares a default namespace (http://www.copasi.org/static/schema). This means that the element and all its descendants, including Constant, belong to that namespace.

So instead of Constant elements, you need to look for {http://www.copasi.org/static/schema}Constant elements.

See http://lxml.de/tutorial.html#namespaces.


Here is how you could do it using XPath instead of findall:

from lxml import etree

NSMAP = {"c": "http://www.copasi.org/static/schema"}

parsed = etree.parse('ct.cps')

for a in parsed.xpath("//c:Constant", namespaces=NSMAP):
    print a.attrib["key"]

See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.

Sign up to request clarification or add additional context in comments.

2 Comments

Wow!! I had an annoying problem with parsing since hours and by pure luck I foind out the source of my problem is exactly the namespace one! Thanks so much!
@user9589: I'm glad I could help!
0

First, please disregard my comment. It turns out that xml.etree is much better than the standard xml.etree.ElementTree in that it takes care of the namespace. The problem you have is you want to search for '//Constant', which means the nodes can be at any level. However, the root element does not allow you to do it:

>>> root.findall('//Constant')
SyntaxError: cannot use absolute path on element

However, you can do that at higher level:

>>> parsed.findall('//Constant')
[<Element Constant at 0x10a7ce128>, <Element Constant at 0x10a7ce170>]

Update

I am posting here the full text. Since I don't have your full XML file, I make something up to fill in the blank.

from lxml import etree as ET
from StringIO import StringIO

xml_text = """<?xml version='1.0' encoding='utf-8' ?>

<rdf:root  xmlns:rdf='http://foo.bar.com/rdf'>
<rdf:RDF>
  <rdf:Description>
    DescriptionX
  </rdf:Description>
</rdf:RDF>
<rdf:foo>
        <MiriamAnnotation>
          bar
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>
        </ListOfConstants>
</rdf:foo>
</rdf:root>
"""

buffer = StringIO(xml_text)
tree = ET.parse(buffer)
for constant_node in tree.findall('//Constant'):
    print constant_node.attrib['key']

2 Comments

I thought that lxml is better than elementtree with regards to namespaces because I had namespace errors when using the latter and theyre now sorted. However, the first error is removed by putting a . before the first backslash though... and it still doesn't work. As for the second suggestion, I still can't access attribute values using that code: nothing is outputted
I think that it might be something to do with the fact that the <Constant> elements are empty elements, but I'm not sure.
0

Don't use findall. It is has a limited featureset and is designed to be compatible with ElementTree.

Instead, use xpath, which supports namespaces. From the above, it appears that you probably want to say something like

# possibilities, you need to get these right...
ns_dict = {'atom':"http://www.w3.org/2005/Atom",,
    "rdf":"http://www.w3.org/2000/01/rdf-schema#" }

root = parsed.getroot()    
for a in root.xpath('.//rdf:Constant', namespaces=ns_dict):
    print a.attrib['key']

Note that you must include a namespace prefix in your xpath expression whenever an element has a non-blank namespace, and they must map to one of the namespace URLs that match the same URLs in your document.

Update

Since you posted your original document, I see that there is no namespace assigned to the elements you are looking for. This will work, I just tried it with your source document:

for a in tree.xpath("//Constant"):
    print a.attrib['key']

You don't need a namespace because there is no default namespace specified in the document itself.

13 Comments

Thanks. How do I find out the element's namespace? EDIT: I'm sure that rdf is the namespace but your code still isnt working
Note that "rdf" is just a namespace prefix. My rdf:Constant refers to the namespace in ns_dict. Namespaces must be explicitly declared inside your source document and are inherited by subelements. Look for the xmlns declarations in containing elements. For example, if a containing element has xmlns= it defines the default namespace for any unadorned subelement, whereas "xmlns:xxx=" defines the namespace for the 'xxx' prefix. Also, it doesn't matter what the prefix was in your source document. lxml rewrites them all internally, so you need to remap them.
So the rdf term will be a url?
Yes, namespaces are specified as URLs in the source document's xmlns attributes. The prefixes are just mappings, and the source document prefixes will not be preserved. ET.parse will convert all prefixes to internal, explicit references to the namespace URLs, and you need to use a namespace map like ns_dict in my example in order to re-establish a new set of prefixes for use in xpath.
You can get a list of the namespaces used by your source document by using parsed.getroot().nsmap. If you print that out, you'll see what namespaces are defined in your document.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.