0

I'm trying to extract some specifics elements from a XML. I download the data from an API and save in a variable as sitios2.

xml code:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<lista><sitio sitio_id="131997">
<custom_id/>    <lang></lang>
    <fecha_alta>2017-06-22 22:38:18</fecha_alta>
<observaciones/>    <ultimas24hrs>  <item id='imps24ad'>0</item>
    <item id='clicks24'>0</item>
    <item id='imps24blank'>0</item>
    <item id='ctr24'>0</item>
</ultimas24hrs>
<fecha_baja/>   <sitio_id>131997</sitio_id>
    <estado>1</estado>
    <hex_sitio_id>2039D
</hex_sitio_id>
    <url>https://www.google.com.ar/</url>
    <nombre>google.com.ar</nombre>
</sitio>

My code:

import xml.etree.ElementTree as ET
root = ET.fromstring(sitios2)
for child in root:
    print(child.tag, child.attrib)
for item in root.iter('item'):
    print(item.attrib)

output I have is:

('sitio', {'sitio_id': '131997'})

{'id': 'imps24ad'}
{'id': 'clicks24'}

what i'm looking for is a txt file with all data but only with the information I need:

sitio_id="131997" 
fecha_alta 2017-06-22 22:38:18
imps24blank 0
estado 1 
url https://www.google.com.ar/
nombre google.com.ar
7
  • I'm not exactly sure what you are looking for. Output a (stripped down) xml file again, the one you posted in the end of your post? Or extracting certain elements into a python-type, like a dict? What do you want to do with the data? Commented Jan 26, 2018 at 13:40
  • Your XML tags don't match. Commented Jan 26, 2018 at 13:41
  • I just edited @ascripter Commented Jan 26, 2018 at 13:53
  • Yes, it's indeed a broken XML file. If that's what you need to handle, a regex-approach might be best. Otherwise fix your XML first. Commented Jan 26, 2018 at 13:55
  • I fix the XML, do you have an idea how I can extract what I'm looking for? @ascripter Commented Jan 26, 2018 at 14:01

2 Answers 2

1

You can use xpath

for child in root.find("./sitio"):
    print(child.tag, child.text)
for item in root.findall('./sitio/ultimas24hrs/item'):
    print(item.tag, item.attrib, item.text)

output:

custom_id None
lang None
fecha_alta 2017-06-22 22:38:18
observaciones None
ultimas24hrs   
fecha_baja None
sitio_id 131997
estado 1
hex_sitio_id 2039D

url https://www.google.com.ar/
nombre google.com.ar
item {'id': 'imps24ad'} 0
item {'id': 'clicks24'} 0
item {'id': 'imps24blank'} 0
item {'id': 'ctr24'} 0

NOTE: Your provided xml is not valid, so I assumed that your xml is:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<lista>
  <sitio sitio_id="131997">
    <custom_id/>
    <lang/>
    <fecha_alta>2017-06-22 22:38:18</fecha_alta>
    <observaciones/>
    <ultimas24hrs>
      <item id="imps24ad">0</item>
      <item id="clicks24">0</item>
      <item id="imps24blank">0</item>
      <item id="ctr24">0</item>
    </ultimas24hrs>
    <fecha_baja/>
    <sitio_id>131997</sitio_id>
    <estado>1</estado>
    <hex_sitio_id>2039D</hex_sitio_id>
    <url>https://www.google.com.ar/</url>
    <nombre>google.com.ar</nombre>
  </sitio>
</lista>
Sign up to request clarification or add additional context in comments.

7 Comments

The second for, I need the inner values: item {'id': 'imps24blank'} "0"
and also, why I can extract the information for only 1 sitio?? because I have more information as before but different sitio_id in the xml
@MartinBouhier if you need just item id="imps24blank" you can add if into for statement
how can I mix all values as custom_id None lang None fecha_alta 2017-06-22 22:38:18 observaciones None ultimas24hrs fecha_baja None sitio_id 131997 estado 1 hex_sitio_id 2039D url https://www.google.com.ar/ nombre google.com.ar item {'id': 'imps24ad'} 0 item {'id': 'clicks24'} 0 item {'id': 'imps24blank'} 0 item {'id': 'ctr24'} 0
What do you mean by "mix"?
|
0

Just iterate through sitio elements, and use XPath to find all information needed within current sitio in every iteration :

for s in root.findall('sitio'):
    id = s.find('sitio_id')
    fa = s.find('fecha_alta')
    i24 = s.find('*/item[@id="imps24blank"]')
    estado = s.find('estado')
    url = s.find('url')
    nombre = s.find('nombre')

    print(id.tag, id.text)
    print(fa.tag, fa.text)
    print(i24.tag, i24.text)
    print(estado.tag, estado.text)
    print(url.tag, url.text)
    print(nombre.tag, nombre.text)

eval.in demo

Break down of the XPath expression used to find i24 value :

  • *: find child element of any name
  • /item: then from such elements, find child elements named item where...
  • [@id="imps24blank"]: ...id attribute value equals string "imps24blank"

3 Comments

I don´t understand how I should create a variable to extract the imps24blank with the results... i24 = s.find('*/item[@id="imps24blank"]')don't show me anything
@MartinBouhier the eval.in demo provided in this answer showcases otherwise, it prints item 0. Please create minimal eval.in demo that demonstrate your problem, otherwise I have no idea how to help because the same code works in my demo..
yes I saw then and I had the results I was looking for! Thanks man! Could you help me with my new questions?? it is very similar with this. stackoverflow.com/questions/48512286/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.