3

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.

Let me first present you the .xml file:

<?xml version="1.0" encoding="utf-8"?>
<Start>
  <step1 stepA="5" stepB="6" />
  <step2>
    <GOAL1>11111</GOAL1>
    <stepB>
      <stepBB>
        <stepBBB stepBBB1="pinco">1</stepBBB>
      </stepBB>
      <stepBC>
        <stepBCA>
          <GOAL2>22222</GOAL2>
        </stepBCA>
      </stepBC>
      <stepBD>-NO WOMAN NO CRY                                            
              -I SHOT THE SHERIF                                                           
              -WHO LET THE DOGS OUT
      </stepBD>
    </stepB>
  </step2>
  <step3>
    <GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
      <stepB stepB1="12" stepB2="13" />
      <stepC>XXX</stepC>
      <stepC>
        <stepCC>
          <stepCC GOAL4="saf12">33333</stepCC>
        </stepCC>
      </stepC>
    </GOAL3>
  </step3>
  <step3>
    <GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
      <stepB stepB1="14" stepB2="15" />
      <stepC>YYY</stepC>
      <stepC>
        <stepCC>
          <stepCC GOAL4="fwe34">44444</stepCC>
        </stepCC>
      </stepC>
    </GOAL3>
  </step3>
</Start>

My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:

Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.

Here is the code that I wrote:

import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()

GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)

GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)

GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)

GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)

GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B) 

and the output that I get is the following:

11111
22222


33333
44444

The output that I would like should be like this:

11111 
22222
GIOVANNI
33333
ANDREA
44444

As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.

The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.

Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID

In order to make the .xml file structure more understandable I post an image of what it looks like:

enter image description here

The highlighted elements are what I am looking for.

3
  • Am I right that you want get text from element with tag "GOAL" + str(N) where N is numer? Commented Nov 16, 2016 at 9:53
  • from GOAL1, GOAL2, GOAL4 I need to get the numbers they store: 11111, 22222, 33333, 44444. As for GOAL3 I would like to get GIOVANNI and ANDREA. Commented Nov 16, 2016 at 10:00
  • I would recommend you to use "for d in data.iter()" Commented Nov 16, 2016 at 10:22

2 Answers 2

1

here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that

import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
    for child in root:
        print child.text
        get_tail(child)
get_tail(root)
Sign up to request clarification or add additional context in comments.

1 Comment

Works best if you just want to iterate over the tree without any constrain.
1
import xml.etree.cElementTree as ET
data = ET.parse('test.xml')    
for d in data.iter():
       if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
          print d.text
       elif d.tag in ["GOAL3", "GOAL4"]:
          print d.attrib.values()[0]

3 Comments

When I run it I get: 'dict_values' object does not support indexing
Do you use python 3.x ?
You can try replace from 'd.attrib.values()[0]' to 'print d.attrib["GOAL3_NAME"]'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.