Parse XML Schema Definition to CSV with Python

Question

I would like to parse the elements of an XML schema definition into a CSV file for documentation and analysis. My XSD takes the following form;

<xs:element name="ELEMENT">
<xs:complexType>
    <xs:sequence>
        <xs:element ref="element 1"/>
        <xs:element ref="element 2"/>
        <xs:element ref="element 3"/>
    </xs:sequence>
</xs:complexType>
</xs:element>

For a given element name, I would like to create a CSV containing element 1, element 2, element 3, etc.

I've tried the Python lxml library but have not been able to access / filter by individual elements yet.

import xml.etree.ElementTree as ET
tree = ET.parse('doc.xsd')
root = tree.getroot()
for child in root:
  print child.tag, child.attrib

Do you want those elements as columns or as rows? Btw, the xml above is incomplete and is not valid XML. Try updating it to minimal working XSD file. — Jan Vlcinsky
– Jan Vlcinsky, Commented Jun 24, 2014 at 15:58
I would recommend you using lxml. You have to install it and it take a moment, but than you have very powerful package with great xpath support, schema validation etc. And to follow up, go the tutorial lxml is offering, it will answer all your questions. — Jan Vlcinsky
– Jan Vlcinsky, Commented Jun 24, 2014 at 16:02
Jan, thanks for the quick reply. I have the full, valid XSD here locally. This is a just a snippent. I tried lxml but am getting stuck. Using lxml, how do you find a specific element? Once you find it, how do you access the sub-elements? BTW, a list of element1,element2,element3 is plenty sufficient. — user265603
– user265603, Commented Jun 24, 2014 at 16:30

Jan Vlcinsky · Accepted Answer · 2014-06-24 16:44:51Z

3

Following code shows how to search XSD for element names.

from lxml import etree
xsdstr = """
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="ELEMENT">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="element 1"/>
        <xs:element ref="element 2"/>
        <xs:element ref="element 3"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
"""

doc = etree.fromstring(xsdstr.strip())

namespaces = {"xs": "http://www.w3.org/2001/XMLSchema"}

names = doc.xpath("//xs:element/@ref", namespaces=namespaces)
print names

Running it prints:

['element 1', 'element 2', 'element 3']

In case, you have more complex schema, you might need to target the names better, here is possible example:

print "trying more precise targeting ------"
names = doc.xpath("//xs:element[@name='ELEMENT']//xs:sequence/xs:element/@ref", namespaces=namespaces)
print names

In our case, the result is the same.

answered Jun 24, 2014 at 16:44

Jan Vlcinsky

44.4k12 gold badges106 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user265603 Over a year ago

Thanks a bunch. This definitely started me down the right path. I would up vote but I don't have the rep yet. Thanks again.

score 0 · Accepted Answer · 2020-03-24 08:58:39Z

Find XSD to CSV parser as below: With below code, multi-node XMLs can be parsed too.

import pandas as pd
from bs4 import BeautifulSoup


def xsd_to_dict(xsd_path):
    super_dict = {}
    soup = BeautifulSoup(open(xsd_path), "html.parser")
    for complex_type in soup.find_all('xs:complextype'):
        xsd_parsed = [x for x in ",".join(str(complex_type).split("\n"))
            .replace("</xs:sequence>", "")
            .replace("'<xs:sequence>", "")
            .replace("<xs:", "")
            .replace("</xs:complextype>", "")
            .replace("</xs:element>", "")
            .replace(">", "").replace("sequence", "")
            .split(",") if x != ""]

        if len(xsd_parsed[0]) > len("complextype") + 1:
            matrix_list = [e.split(" ") for e in xsd_parsed[-len(xsd_parsed) + 1:]]

            level_1 = ["|".join(["".join([":".join(final.split("=")) for final in y if len(final.split("=")) == 2])
                                 for y in [x.split(",") for x in item]]) for item in matrix_list]
            level_1.insert(0, xsd_parsed[0])
            for x in level_1[-len(xsd_parsed) + 1:]:
                flattened_dict = {x.split(":")[0]:"-".join(x.split(":")[-len(x.split(":")) + 1:])
                       for x in (level_1[0] + x).replace("=", ":").split("|")}
                xPath = flattened_dict.get("complextype name")
                xmlName = flattened_dict.get("name")
                dataType = flattened_dict.get("type")

                if xmlName != None:
                    final_dict = {x.split(":")[0]:x.split(":")[1]
                                for x in str("xpath:"+str(xPath)+",xmlFieldName:"+str(xmlName)+",dataPath:"+str(dataType)).split(",")}
                    for k, v in final_dict.items():
                        super_dict.setdefault(k, []).append(v)

    return super_dict



def xsd_to_csv(xsd_path):
    pd.DataFrame(xsd_to_dict(xsd_path)).to_csv(xsd_path.replace(".xsd", ".csv"))
    return "done"


xsd_to_csv("CustomersOrders.xsd")

input: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/sample-xsd-file-customers-and-orders1

Output:

,xpath,xmlFieldName,dataPath
0,"""CustomerType""","""CompanyName""","""xs-string"""
1,"""CustomerType""","""ContactName""","""xs-string"""
2,"""CustomerType""","""ContactTitle""","""xs-string"""
3,"""CustomerType""","""Phone""","""xs-string"""
4,"""CustomerType""","""Fax""","""xs-string"""
5,"""CustomerType""","""FullAddress""","""AddressType"""
6,"""CustomerType""","""CustomerID""","""xs-token""</xs-attribute"
7,"""AddressType""","""Address""","""xs-string"""
8,"""AddressType""","""City""","""xs-string"""
9,"""AddressType""","""Region""","""xs-string"""
10,"""AddressType""","""PostalCode""","""xs-string"""
11,"""AddressType""","""Country""","""xs-string"""
12,"""AddressType""","""CustomerID""","""xs-token""</xs-attribute"
13,"""OrderType""","""CustomerID""","""xs-token"""
14,"""OrderType""","""EmployeeID""","""xs-token"""
15,"""OrderType""","""OrderDate""","""xs-dateTime"""
16,"""OrderType""","""RequiredDate""","""xs-dateTime"""
17,"""OrderType""","""ShipInfo""","""ShipInfoType"""
18,"""ShipInfoType""","""ShipVia""","""xs-integer"""
19,"""ShipInfoType""","""Freight""","""xs-decimal"""
20,"""ShipInfoType""","""ShipName""","""xs-string"""
21,"""ShipInfoType""","""ShipAddress""","""xs-string"""
22,"""ShipInfoType""","""ShipCity""","""xs-string"""
23,"""ShipInfoType""","""ShipRegion""","""xs-string"""
24,"""ShipInfoType""","""ShipPostalCode""","""xs-string"""
25,"""ShipInfoType""","""ShipCountry""","""xs-string"""
26,"""ShipInfoType""","""ShippedDate""","""xs-dateTime""

Collectives™ on Stack Overflow

Parse XML Schema Definition to CSV with Python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related