1

I have several XML files which I want to read in to a pandas dataframe and merge them into one dataframe with a unified time stamp.

For example the dataframes look like that:

# PTU.xml:

<?xml version='1.0' encoding='utf8'?>
<PtuResults>
  <Row DataSrvTime="2021-07-08T08:58:14.3942671" SP="950.067" PFH="950.067" T="291.810" Hum="84.035" Alt="590.081" Status="0" />
  <Row DataSrvTime="2021-07-08T08:58:36.8831974" SP="950.935" PFH="949.456" T="291.569" Hum="89.576" Alt="582.223" Status="0" />
  <Row DataSrvTime="2021-07-08T08:58:36.8835716" SP="950.256" PFH="949.207" T="291.072" Hum="91.548" Alt="588.357" Status="0" />
</PtuResults>

# - or -
#
# AMS.xml:

<?xml version='1.0' encoding='utf8'?>
<AdditionalSensorData>
  <Row MeasurementOffset="-0.403" DataSrvTime="2021-07-08T08:43:29.0616419" GpsTO="12" XData=" C7 10 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2FE 320 0 0 230 165 111 224 1201 13BF 13CB B 0 58 1 278 B695" />
  <Row MeasurementOffset="-0.253" DataSrvTime="2021-07-08T08:43:30.0866790" GpsTO="35" XData=" C2 16 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 320 0 0 0 23A 133 111 223 118C 1236 1237 9 0 6B 1 278 9B2" />
  <Row MeasurementOffset="-0.103" DataSrvTime="2021-07-08T08:43:31.1107931" GpsTO="58" XData=" CB E 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2FE 341 0 0 230 179 111 222 11E0 1396 13A2 10 0 7C 2 278 3ECA" />
</AdditionalSensorData>

First, I am reading in the XML-files:

import pandas as pd
import numpy as np
import os
import xml.etree.ElementTree as ET 

# Mypath with datafiles:
path = 'c:/mypath/'

# Names of specific files:
PTU = 'PTU.xml'
GPS = 'GPS.xml'
AMS = 'AMS.xml'

data_PTU = ET.parse(path + PTU).getroot()
data_GPS = ET.parse(path + GPS).getroot()
data_AMS = ET.parse(path + AMS).getroot()

I defined a principal routine for the transformation into a Pandas dataframe with the attribute names as column names:

def parse_XML(xml_file, df_cols): 
    xroot = ET.parse(xml_file).getroot()
    rows = []
    
    for attribute in xroot: 
        res = []
        res.append(attribute.attrib.get(df_cols[0]))
        for element in df_cols[:]: 

            if attribute is not None and attribute.find(element) is not None:
                res.append(attribute.find(element))
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

And finally I am using for example

parse_XML(path + AMS,["DataSrvTime", "XData"]) 
# -or-
parse_XML(path + PTU, ["DataSrvTime", "SP", "T", "Hum", "Alt"])

to make a dataframe. The result shows a dataframe with the first column as index column, the second column as "DataSrvTime" in this case and any following column filled with "None"s. If I change the order to

parse_XML(path + AMS,["XData", "DataSrvTime"]) 

for example XData is shown but DataSrvTime column contains "None"s. Does anyone see where my mistake is?

enter image description here enter image description here enter image description here

If the reading into a dataframe is perfect the further plan is to merge all the different dataframes (AMS, PTU, GPS for example) with (Pandas) resample('S').mean() and (Pandas) dataframe.interpolate() to a unifying timestamp into one dataframe. Would you recommend another solution?

Thanks for reading and helping out.

1 Answer 1

2

Given your attribute-centric XML is relatively flat, consider the new IO method, pandas.read_xml. Then, merge the two data frames or run an iterative join using concat on a list of data frames which requires the datetime to be set as index.

pd.merge

final_df = pd.merge(
    pd.read_xml('PTU.xml'), pd.read_xml('AMS.xml'),
    on = "DataSrvTime", how = "outer"
)

pd.concat

df_list = [
    pd.read_xml(f).set_index("DataSrvTime")
    for f in 
    ['PTU.xml', 'GPS.xml', 'AMS.xml']
]

final_df = pd.concat(df_list, axis=0)
Sign up to request clarification or add additional context in comments.

5 Comments

parfait: (can't add and "@" here.. it's deleted any time...) thanks for the hint to pandas.read_xml. this works very well, and seems to be a fast and good solution :-). The merge part only works for two files at once with the parameter how = "outer". Is there another way to merge more than two files instead of merging two first and merge each further xml step by step? With the last step I get an error. Might there be a misspelling in the loop?
parfait: is set_index not associated to dataframes only?
it seems to work if I add "=": df_list = [.....]
Yes, please add assignment, =. And pd.concat is the way for more than two data frame outer join.
perfectly, this works. Thanks for the new insights :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.