0

I'm still new in programming but I know some Python and am familiar with XPath and XML in general. Currently I'm working with some XML data that looks something like this:

<foo>
  <bar>
      <unit>
          <structure>
              <token word="Rocky" att1="noun" att2="name">Rocky</token>
              <token word="the" att1="article" att2="">the</token>
              <token word="yellow" att1="adjective" att2="color">yellow</token>
              <token word="dog" att1="noun" att2="animal">dog</token>
          </structure>
      </unit>
  </bar>
</foo>

Now what I need to do with this is to first find an attribute value, let's take

<token word="dog" att1="noun"att2="animal"</token>

for an instance. So in all the structures in the document I want to first find all the nodes that have animal as the att2 value and THEN get all the siblings of that node into a list. Because the nodes have several attributes each, I'm trying to include each one of them into a different list, that is to say make a list out of all the attributes in the structure that has the animal in one of its childrens' att2 value. For instance:

 listWord = [Rocky, the, yellow, dog]
 listAtt1 = [noun, article, adjective, noun]
 listAtt2 = [name, ,color, animal]

At the moment I'm just wondering if it's even possible. Thus far I've only managed to hit my head against the wall with the attribute structure not to mention the empty values.

6
  • Your XML is not valid, it misses a few closing > for the tokens Commented Nov 3, 2016 at 13:00
  • Your XML structure is broken, all the <token> tags are missing the closing >, maybe a copy and paste error. Commented Nov 3, 2016 at 13:00
  • THEN get all the siblings of that node into a list. => what exactly do you call a sibling ? Commented Nov 3, 2016 at 13:05
  • Are the example listWord listAtt1 and listAtt2 the lists you are trying to build ? Commented Nov 3, 2016 at 13:07
  • Whoops, yeah just forgot the closings while constructing the structure. But they are there. Commented Nov 3, 2016 at 14:15

2 Answers 2

1

With the closing token tags included, and assuming your text is contained in test.xml, the following:

import xml.etree.ElementTree

e = xml.etree.ElementTree.parse('test.xml').getroot()

listWord = []
listAtt1 = []
listAtt2 = []

for child in e.iter('token'):
    listWord.append(child.attrib['word'])
    listAtt1.append(child.attrib['att1'])
    listAtt2.append(child.attrib['att2'])

print listWord
print listAtt1
print listAtt2

will return:

['Rocky', 'the', 'yellow', 'dog']
['noun', 'article', 'adjective', 'noun']
['name', '', 'color', 'animal']

e.iter() lets you iterate over e as the root and elements below it - we specify the tag of token to only return token elements. child.attrib returns a dictionary of attributes, which we append to lists.

EDIT: For the second bit of your question, I think the following will (though potentially not best practice) do what you are looking for:

import xml.etree.ElementTree

e = xml.etree.ElementTree.parse('test.xml').getroot()

listWord = []
listAtt1 = []
listAtt2 = []
animal_structs =[]

for structure in e.iter('structure'):
    for child in structure.iter('token'):
        if 'att2' in child.keys():
            if child.attrib['att2'] == 'animal':
                animal_structs.append(structure)
                break

for structure in animal_structs:
    for child in structure.iter('token'):
        listWord.append(child.attrib['word'])
        listAtt1.append(child.attrib['att1'])
        listAtt2.append(child.attrib['att2'])

print listWord
print listAtt1
print listAtt2

We first create a list of all the structure elements with an animal child, and then return all the then attributes for each of those structures.

Sign up to request clarification or add additional context in comments.

9 Comments

This looks very promising but still all I get is: if child.attrib['att2'] == 'animal': KeyError: 'att2'
We need to check that the token has this key - I've edited this in.
You are correct. This works flawlessly when I test it with a short XML extract but not that much with the original XML file. Must be something wrong with that one then.
I keep getting the same key error. There is probably still some upper structure I have forgotten or some other reason it doesn't recognize the attribute att2.
You shouldn't get the same error if you've added the if 'att2' in child.keys() line in, so I'm a little confused here.
|
1

I'm not sure I understand your question, but here are the parts that I understand (using lxml and xpath):

from lxml import etree
tree = etree.fromstring("""<foo>
  <bar>
      <unit>
          <structure>
              <token word="Rocky" att1="noun" att2="name"></token>
              <token word="the" att1="article" att2=""></token>
              <token word="yellow" att1="adjective" att2="color"></token>
              <token word="dog" att1="noun" att2="animal"></token>
          </structure>
      </unit>
  </bar>
</foo>""")


// get a list of all possible words, att1, att2:
listWord = tree.xpath("//token/@word")
listAtt1 = tree.xpath("//token/@att1")
listAtt2 = tree.xpath("//token/@att2")

// get all the tokens with att2="animal"
for token in tree.xpath('//token[@att2="animal"]'):
    do_your_own_stuff()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.