How to parse .xml file with multiple nested children in python?

Question

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.

Let me first present you the .xml file:

<?xml version="1.0" encoding="utf-8"?>
<Start>
  <step1 stepA="5" stepB="6" />
  <step2>
    <GOAL1>11111</GOAL1>
    <stepB>
      <stepBB>
        <stepBBB stepBBB1="pinco">1</stepBBB>
      </stepBB>
      <stepBC>
        <stepBCA>
          <GOAL2>22222</GOAL2>
        </stepBCA>
      </stepBC>
      <stepBD>-NO WOMAN NO CRY                                            
              -I SHOT THE SHERIF                                                           
              -WHO LET THE DOGS OUT
      </stepBD>
    </stepB>
  </step2>
  <step3>
    <GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
      <stepB stepB1="12" stepB2="13" />
      <stepC>XXX</stepC>
      <stepC>
        <stepCC>
          <stepCC GOAL4="saf12">33333</stepCC>
        </stepCC>
      </stepC>
    </GOAL3>
  </step3>
  <step3>
    <GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
      <stepB stepB1="14" stepB2="15" />
      <stepC>YYY</stepC>
      <stepC>
        <stepCC>
          <stepCC GOAL4="fwe34">44444</stepCC>
        </stepCC>
      </stepC>
    </GOAL3>
  </step3>
</Start>

My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:

Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.

Here is the code that I wrote:

import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()

GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)

GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)

GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)

GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)

GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B)

and the output that I get is the following:

The output that I would like should be like this:

11111 
22222
GIOVANNI
33333
ANDREA
44444

As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.

The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.

Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID

In order to make the .xml file structure more understandable I post an image of what it looks like:

The highlighted elements are what I am looking for.

Am I right that you want get text from element with tag "GOAL" + str(N) where N is numer? — nick_gabpe
– nick_gabpe, Commented Nov 16, 2016 at 9:53
from GOAL1, GOAL2, GOAL4 I need to get the numbers they store: 11111, 22222, 33333, 44444. As for GOAL3 I would like to get GIOVANNI and ANDREA. — Federico Gentile
– Federico Gentile, Commented Nov 16, 2016 at 10:00

Ari Gold · Accepted Answer · 2016-11-16 10:00:29Z

1

here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that

import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
    for child in root:
        print child.text
        get_tail(child)
get_tail(root)

answered Nov 16, 2016 at 10:00

Ari Gold

1,55011 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Abhishek Kumar Over a year ago

Works best if you just want to iterate over the tree without any constrain.

nick_gabpe · Accepted Answer · 2016-11-16 10:46:30Z

1

import xml.etree.cElementTree as ET
data = ET.parse('test.xml')    
for d in data.iter():
       if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
          print d.text
       elif d.tag in ["GOAL3", "GOAL4"]:
          print d.attrib.values()[0]

edited Nov 16, 2016 at 10:46

answered Nov 16, 2016 at 10:39

nick_gabpe

5,9637 gold badges32 silver badges42 bronze badges

3 Comments

Federico Gentile Over a year ago

When I run it I get: 'dict_values' object does not support indexing

nick_gabpe Over a year ago

Do you use python 3.x ?

nick_gabpe Over a year ago

You can try replace from 'd.attrib.values()[0]' to 'print d.attrib["GOAL3_NAME"]'

Collectives™ on Stack Overflow

How to parse .xml file with multiple nested children in python?

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related