2

I'm having issues with parsing through my XML file to convert into a pandas dataframe. An example entry is below:

<p>


 <persName id="t17200427-2-defend31" type="defendantName">
 Alice 
 Jones 
 <interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
 <interp inst="t17200427-2-defend31" type="given" value="Alice"/>
 <interp inst="t17200427-2-defend31" type="gender" value="female"/>
 </persName> 

 , of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName> 
 <interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
 <interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
 <join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
 <interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
 <interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
 privately stealing a Bermundas Hat, value 10 s. out of the Shop of 

 <persName id="t17200427-2-victim33" type="victimName">
 Edward 
 Hillior 
 <interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
 <interp inst="t17200427-2-victim33" type="given" value="Edward"/>
 <interp inst="t17200427-2-victim33" type="gender" value="male"/>
 <join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
 </persName> 



 </rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs> 
 <join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
 <interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
 <interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
 Guilty to the value of 10 d.
 </rs> 
 <rs id="t17200427-2-punish11" type="punishmentDescription">
 <interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
 <join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
 Transportation
 </rs> .</p>

I want a dataframe that have the columns gender, offense, and text of the trial. I have previously extracted all of the data into a data frame, but cannot get the text between

tags.

This is an example code:

def table_of_cases(xml_file_name):
    file = ET.ElementTree(file = xml_file_name)
    iterate = file.getiterator()
    i = 1
    table = pd.DataFrame()
    for element in iterate:
        if element.tag == "persName":
            t = element.attrib['type']
            try:
                val = [element.attrib['value']]
                if t not in labels:
                    table[t] = val
                elif t+num not in labels:
                    table[t+num] = val
                elif t+num in labels:
                    num = str(i+1)
                    table[t+num] = val
            except Exception:
                pass
            labels = list(table.columns.values)
            num = str(i)

    return table

** I have about 1,000+ files of these same XML format to make into one dataframe

1 Answer 1

3

Because your XML is pretty complex with text values spilling across nodes, consider XSLT, the special-purpose language designed to transform XML files especially complex to simpler ones.

Python's third-party module, lxml, can run XSLT 1.0 even XPath 1.0 to parse through the transformed result for migration to a pandas dataframe. Additionally, you can use external XSLT processors that Python can call with subprocess.

Specifically, below XSLT extracts necessary attributes from both defendant and victim and entire paragraph text value by using XPath's descendant::* from the root, assuming <p> is a child to it.

XSLT (save as a .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/*">
    <xsl:apply-templates select="p"/>
  </xsl:template>
  
  <xsl:template match="p">
    <data>
      <defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
      <defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
      <offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
      <offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>
      
      <victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
      <victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
      <verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
      <verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
      <punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>
      
      <trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
    </data>
  </xsl:template>       
 
</xsl:stylesheet>

Python

import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")

# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)

# OUTPUT TO CONSOLE
print(result)

data = []
for i in result.xpath('/*'):
    inner = {}
    for j in i.xpath('*'):
        inner[j.tag] = j.text
        
    data.append(inner)
    
trial_df = pd.DataFrame(data)

print(trial_df)

For the 1,000 similar XML files, loop through this process and append each one-row trial_df dataframes in a list to be stacked with pd.concat.

XML Output

<?xml version="1.0"?>
<data>
  <defendantName>Alice Jones</defendantName>
  <defendantGender>female</defendantGender>
  <offenceCategory>theft</offenceCategory>
  <offenceSubCategory>shoplifting</offenceSubCategory>
  <victimName>Edward Hillior</victimName>
  <victimGender>male</victimGender>
  <verdictCategory>guilty</verdictCategory>
  <verdictSubCategory>theftunder1s</verdictSubCategory>
  <punishmentCategory>transport</punishmentCategory>
  <trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>

Dataframe Output

#   defendantGender defendantName offenceCategory offenceSubCategory  \
# 0          female   Alice Jones           theft        shoplifting   

#   punishmentCategory                                          trialText  \
# 0          transport  Alice Jones , of St. Michael's Cornhill, was i...   

#   verdictCategory verdictSubCategory victimGender      victimName  
# 0          guilty       theftunder1s         male  Edward Hillior  
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! I was wondering how do you use the stylesheet to convert the XML to XSL if I'm using Jupyter Notebook and the only language that can be used is Python and not HTML
Did you run the code? I explain using lxml. Jupyter Notebook is just an interface like pycharm or spyder and should run the above compliant Python code. Make sure file paths are correct.
I was able to run the code. Since I have multiple <p> tags in one file, I was wondering how to go about this. Thank you for your time!
Very important to always include the root for XML questions. See XSLT and Python script update that assumes <p> is a child of root and can be many in document.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.