Extract elements from a XML with Python

Question

I'm trying to extract some specifics elements from a XML. I download the data from an API and save in a variable as sitios2.

xml code:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<lista><sitio sitio_id="131997">
<custom_id/>    <lang></lang>
    <fecha_alta>2017-06-22 22:38:18</fecha_alta>
<observaciones/>    <ultimas24hrs>  <item id='imps24ad'>0</item>
    <item id='clicks24'>0</item>
    <item id='imps24blank'>0</item>
    <item id='ctr24'>0</item>
</ultimas24hrs>
<fecha_baja/>   <sitio_id>131997</sitio_id>
    <estado>1</estado>
    <hex_sitio_id>2039D
</hex_sitio_id>
    <url>https://www.google.com.ar/</url>
    <nombre>google.com.ar</nombre>
</sitio>

My code:

import xml.etree.ElementTree as ET
root = ET.fromstring(sitios2)
for child in root:
    print(child.tag, child.attrib)
for item in root.iter('item'):
    print(item.attrib)

output I have is:

('sitio', {'sitio_id': '131997'})

{'id': 'imps24ad'}
{'id': 'clicks24'}

what i'm looking for is a txt file with all data but only with the information I need:

sitio_id="131997" 
fecha_alta 2017-06-22 22:38:18
imps24blank 0
estado 1 
url https://www.google.com.ar/
nombre google.com.ar

I'm not exactly sure what you are looking for. Output a (stripped down) xml file again, the one you posted in the end of your post? Or extracting certain elements into a python-type, like a dict? What do you want to do with the data? — ascripter
– ascripter, Commented Jan 26, 2018 at 13:40
Yes, it's indeed a broken XML file. If that's what you need to handle, a regex-approach might be best. Otherwise fix your XML first. — ascripter
– ascripter, Commented Jan 26, 2018 at 13:55
I fix the XML, do you have an idea how I can extract what I'm looking for? @ascripter — Martin Bouhier
– Martin Bouhier, Commented Jan 26, 2018 at 14:01

Cristhian Boujon · Accepted Answer · 2018-01-26 14:30:52Z

1

You can use xpath

for child in root.find("./sitio"):
    print(child.tag, child.text)
for item in root.findall('./sitio/ultimas24hrs/item'):
    print(item.tag, item.attrib, item.text)

output:

custom_id None
lang None
fecha_alta 2017-06-22 22:38:18
observaciones None
ultimas24hrs   
fecha_baja None
sitio_id 131997
estado 1
hex_sitio_id 2039D

url https://www.google.com.ar/
nombre google.com.ar
item {'id': 'imps24ad'} 0
item {'id': 'clicks24'} 0
item {'id': 'imps24blank'} 0
item {'id': 'ctr24'} 0

NOTE: Your provided xml is not valid, so I assumed that your xml is:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<lista>
  <sitio sitio_id="131997">
    <custom_id/>
    <lang/>
    <fecha_alta>2017-06-22 22:38:18</fecha_alta>
    <observaciones/>
    <ultimas24hrs>
      <item id="imps24ad">0</item>
      <item id="clicks24">0</item>
      <item id="imps24blank">0</item>
      <item id="ctr24">0</item>
    </ultimas24hrs>
    <fecha_baja/>
    <sitio_id>131997</sitio_id>
    <estado>1</estado>
    <hex_sitio_id>2039D</hex_sitio_id>
    <url>https://www.google.com.ar/</url>
    <nombre>google.com.ar</nombre>
  </sitio>
</lista>

edited Jan 26, 2018 at 14:30

answered Jan 26, 2018 at 14:22

Cristhian Boujon

4,19013 gold badges57 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Martin Bouhier Over a year ago

The second for, I need the inner values: item {'id': 'imps24blank'} "0"

Martin Bouhier Over a year ago

and also, why I can extract the information for only 1 sitio?? because I have more information as before but different sitio_id in the xml

Cristhian Boujon Over a year ago

@MartinBouhier if you need just item id="imps24blank" you can add if into for statement

Martin Bouhier Over a year ago

how can I mix all values as

custom_id None lang None fecha_alta 2017-06-22 22:38:18 observaciones None ultimas24hrs    fecha_baja None sitio_id 131997 estado 1 hex_sitio_id 2039D url https://www.google.com.ar/ nombre google.com.ar item {'id': 'imps24ad'} 0 item {'id': 'clicks24'} 0 item {'id': 'imps24blank'} 0 item {'id': 'ctr24'} 0

Cristhian Boujon Over a year ago

What do you mean by "mix"?

|

har07 · Accepted Answer · 2018-01-27 10:07:44Z

0

Just iterate through sitio elements, and use XPath to find all information needed within current sitio in every iteration :

for s in root.findall('sitio'):
    id = s.find('sitio_id')
    fa = s.find('fecha_alta')
    i24 = s.find('*/item[@id="imps24blank"]')
    estado = s.find('estado')
    url = s.find('url')
    nombre = s.find('nombre')

    print(id.tag, id.text)
    print(fa.tag, fa.text)
    print(i24.tag, i24.text)
    print(estado.tag, estado.text)
    print(url.tag, url.text)
    print(nombre.tag, nombre.text)

eval.in demo

Break down of the XPath expression used to find i24 value :

*: find child element of any name
/item: then from such elements, find child elements named item where...
[@id="imps24blank"]: ...id attribute value equals string "imps24blank"

answered Jan 27, 2018 at 10:07

har07

89.5k12 gold badges87 silver badges143 bronze badges

3 Comments

Martin Bouhier Over a year ago

I don´t understand how I should create a variable to extract the imps24blank with the results... i24 = s.find('*/item[@id="imps24blank"]')don't show me anything

har07 Over a year ago

@MartinBouhier the eval.in demo provided in this answer showcases otherwise, it prints item 0. Please create minimal eval.in demo that demonstrate your problem, otherwise I have no idea how to help because the same code works in my demo..

Martin Bouhier Over a year ago

yes I saw then and I had the results I was looking for! Thanks man! Could you help me with my new questions?? it is very similar with this. stackoverflow.com/questions/48512286/…

Collectives™ on Stack Overflow

Extract elements from a XML with Python

2 Answers 2

7 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related