Reading different XML to pandas dataframe

Question

I have several XML files which I want to read in to a pandas dataframe and merge them into one dataframe with a unified time stamp.

For example the dataframes look like that:

# PTU.xml:

<?xml version='1.0' encoding='utf8'?>
<PtuResults>
  <Row DataSrvTime="2021-07-08T08:58:14.3942671" SP="950.067" PFH="950.067" T="291.810" Hum="84.035" Alt="590.081" Status="0" />
  <Row DataSrvTime="2021-07-08T08:58:36.8831974" SP="950.935" PFH="949.456" T="291.569" Hum="89.576" Alt="582.223" Status="0" />
  <Row DataSrvTime="2021-07-08T08:58:36.8835716" SP="950.256" PFH="949.207" T="291.072" Hum="91.548" Alt="588.357" Status="0" />
</PtuResults>

# - or -
#
# AMS.xml:

<?xml version='1.0' encoding='utf8'?>
<AdditionalSensorData>
  <Row MeasurementOffset="-0.403" DataSrvTime="2021-07-08T08:43:29.0616419" GpsTO="12" XData=" C7 10 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2FE 320 0 0 230 165 111 224 1201 13BF 13CB B 0 58 1 278 B695" />
  <Row MeasurementOffset="-0.253" DataSrvTime="2021-07-08T08:43:30.0866790" GpsTO="35" XData=" C2 16 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 320 0 0 0 23A 133 111 223 118C 1236 1237 9 0 6B 1 278 9B2" />
  <Row MeasurementOffset="-0.103" DataSrvTime="2021-07-08T08:43:31.1107931" GpsTO="58" XData=" CB E 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2FE 341 0 0 230 179 111 222 11E0 1396 13A2 10 0 7C 2 278 3ECA" />
</AdditionalSensorData>

First, I am reading in the XML-files:

import pandas as pd
import numpy as np
import os
import xml.etree.ElementTree as ET 

# Mypath with datafiles:
path = 'c:/mypath/'

# Names of specific files:
PTU = 'PTU.xml'
GPS = 'GPS.xml'
AMS = 'AMS.xml'

data_PTU = ET.parse(path + PTU).getroot()
data_GPS = ET.parse(path + GPS).getroot()
data_AMS = ET.parse(path + AMS).getroot()

I defined a principal routine for the transformation into a Pandas dataframe with the attribute names as column names:

def parse_XML(xml_file, df_cols): 
    xroot = ET.parse(xml_file).getroot()
    rows = []
    
    for attribute in xroot: 
        res = []
        res.append(attribute.attrib.get(df_cols[0]))
        for element in df_cols[:]: 

            if attribute is not None and attribute.find(element) is not None:
                res.append(attribute.find(element))
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

And finally I am using for example

parse_XML(path + AMS,["DataSrvTime", "XData"]) 
# -or-
parse_XML(path + PTU, ["DataSrvTime", "SP", "T", "Hum", "Alt"])

to make a dataframe. The result shows a dataframe with the first column as index column, the second column as "DataSrvTime" in this case and any following column filled with "None"s. If I change the order to

parse_XML(path + AMS,["XData", "DataSrvTime"])

for example XData is shown but DataSrvTime column contains "None"s. Does anyone see where my mistake is?

If the reading into a dataframe is perfect the further plan is to merge all the different dataframes (AMS, PTU, GPS for example) with (Pandas) resample('S').mean() and (Pandas) dataframe.interpolate() to a unifying timestamp into one dataframe. Would you recommend another solution?

Thanks for reading and helping out.

Parfait · Accepted Answer · 2022-06-14 14:27:21Z

2

Given your attribute-centric XML is relatively flat, consider the new IO method, pandas.read_xml. Then, merge the two data frames or run an iterative join using concat on a list of data frames which requires the datetime to be set as index.

pd.merge

final_df = pd.merge(
    pd.read_xml('PTU.xml'), pd.read_xml('AMS.xml'),
    on = "DataSrvTime", how = "outer"
)

pd.concat

df_list = [
    pd.read_xml(f).set_index("DataSrvTime")
    for f in 
    ['PTU.xml', 'GPS.xml', 'AMS.xml']
]

final_df = pd.concat(df_list, axis=0)

edited Jun 14, 2022 at 14:27

answered Jun 14, 2022 at 3:10

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Swawa Over a year ago

parfait: (can't add and "@" here.. it's deleted any time...) thanks for the hint to pandas.read_xml. this works very well, and seems to be a fast and good solution :-). The merge part only works for two files at once with the parameter how = "outer". Is there another way to merge more than two files instead of merging two first and merge each further xml step by step? With the last step I get an error. Might there be a misspelling in the loop?

Swawa Over a year ago

parfait: is set_index not associated to dataframes only?

Swawa Over a year ago

it seems to work if I add "=": df_list = [.....]

Parfait Over a year ago

Yes, please add assignment, =. And pd.concat is the way for more than two data frame outer join.

Swawa Over a year ago

perfectly, this works. Thanks for the new insights :-)

Collectives™ on Stack Overflow

Reading different XML to pandas dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related