Python: Parsing an XML file with several attributes in one node

Question

I'm still new in programming but I know some Python and am familiar with XPath and XML in general. Currently I'm working with some XML data that looks something like this:

<foo>
  <bar>
      <unit>
          <structure>
              <token word="Rocky" att1="noun" att2="name">Rocky</token>
              <token word="the" att1="article" att2="">the</token>
              <token word="yellow" att1="adjective" att2="color">yellow</token>
              <token word="dog" att1="noun" att2="animal">dog</token>
          </structure>
      </unit>
  </bar>
</foo>

Now what I need to do with this is to first find an attribute value, let's take

<token word="dog" att1="noun"att2="animal"</token>

for an instance. So in all the structures in the document I want to first find all the nodes that have animal as the att2 value and THEN get all the siblings of that node into a list. Because the nodes have several attributes each, I'm trying to include each one of them into a different list, that is to say make a list out of all the attributes in the structure that has the animal in one of its childrens' att2 value. For instance:

 listWord = [Rocky, the, yellow, dog]
 listAtt1 = [noun, article, adjective, noun]
 listAtt2 = [name, ,color, animal]

At the moment I'm just wondering if it's even possible. Thus far I've only managed to hit my head against the wall with the attribute structure not to mention the empty values.

Your XML is not valid, it misses a few closing > for the tokens — Guillaume
– Guillaume, Commented Nov 3, 2016 at 13:00
Your XML structure is broken, all the <token> tags are missing the closing >, maybe a copy and paste error. — Marcs
– Marcs, Commented Nov 3, 2016 at 13:00
THEN get all the siblings of that node into a list. => what exactly do you call a sibling ? — Guillaume
– Guillaume, Commented Nov 3, 2016 at 13:05
Are the example listWord listAtt1 and listAtt2 the lists you are trying to build ? — Guillaume
– Guillaume, Commented Nov 3, 2016 at 13:07
Whoops, yeah just forgot the closings while constructing the structure. But they are there. — Ize
– Ize, Commented Nov 3, 2016 at 14:15

asongtoruin · Accepted Answer · 2016-11-03 14:53:50Z

1

With the closing token tags included, and assuming your text is contained in test.xml, the following:

import xml.etree.ElementTree

e = xml.etree.ElementTree.parse('test.xml').getroot()

listWord = []
listAtt1 = []
listAtt2 = []

for child in e.iter('token'):
    listWord.append(child.attrib['word'])
    listAtt1.append(child.attrib['att1'])
    listAtt2.append(child.attrib['att2'])

print listWord
print listAtt1
print listAtt2

will return:

['Rocky', 'the', 'yellow', 'dog']
['noun', 'article', 'adjective', 'noun']
['name', '', 'color', 'animal']

e.iter() lets you iterate over e as the root and elements below it - we specify the tag of token to only return token elements. child.attrib returns a dictionary of attributes, which we append to lists.

EDIT: For the second bit of your question, I think the following will (though potentially not best practice) do what you are looking for:

import xml.etree.ElementTree

e = xml.etree.ElementTree.parse('test.xml').getroot()

listWord = []
listAtt1 = []
listAtt2 = []
animal_structs =[]

for structure in e.iter('structure'):
    for child in structure.iter('token'):
        if 'att2' in child.keys():
            if child.attrib['att2'] == 'animal':
                animal_structs.append(structure)
                break

for structure in animal_structs:
    for child in structure.iter('token'):
        listWord.append(child.attrib['word'])
        listAtt1.append(child.attrib['att1'])
        listAtt2.append(child.attrib['att2'])

print listWord
print listAtt1
print listAtt2

We first create a list of all the structure elements with an animal child, and then return all the then attributes for each of those structures.

edited Nov 3, 2016 at 14:53

answered Nov 3, 2016 at 13:10

asongtoruin

10.4k3 gold badges42 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Ize Over a year ago

This looks very promising but still all I get is: if child.attrib['att2'] == 'animal': KeyError: 'att2'

asongtoruin Over a year ago

We need to check that the token has this key - I've edited this in.

Ize Over a year ago

You are correct. This works flawlessly when I test it with a short XML extract but not that much with the original XML file. Must be something wrong with that one then.

Ize Over a year ago

I keep getting the same key error. There is probably still some upper structure I have forgotten or some other reason it doesn't recognize the attribute att2.

asongtoruin Over a year ago

You shouldn't get the same error if you've added the if 'att2' in child.keys() line in, so I'm a little confused here.

|

Guillaume · Accepted Answer · 2016-11-03 13:16:22Z

I'm not sure I understand your question, but here are the parts that I understand (using lxml and xpath):

from lxml import etree
tree = etree.fromstring("""<foo>
  <bar>
      <unit>
          <structure>
              <token word="Rocky" att1="noun" att2="name"></token>
              <token word="the" att1="article" att2=""></token>
              <token word="yellow" att1="adjective" att2="color"></token>
              <token word="dog" att1="noun" att2="animal"></token>
          </structure>
      </unit>
  </bar>
</foo>""")


// get a list of all possible words, att1, att2:
listWord = tree.xpath("//token/@word")
listAtt1 = tree.xpath("//token/@att1")
listAtt2 = tree.xpath("//token/@att2")

// get all the tokens with att2="animal"
for token in tree.xpath('//token[@att2="animal"]'):
    do_your_own_stuff()

Collectives™ on Stack Overflow

Python: Parsing an XML file with several attributes in one node

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related