4

I would like to parse the elements of an XML schema definition into a CSV file for documentation and analysis. My XSD takes the following form;

<xs:element name="ELEMENT">
<xs:complexType>
    <xs:sequence>
        <xs:element ref="element 1"/>
        <xs:element ref="element 2"/>
        <xs:element ref="element 3"/>
    </xs:sequence>
</xs:complexType>
</xs:element>

For a given element name, I would like to create a CSV containing element 1, element 2, element 3, etc.

I've tried the Python lxml library but have not been able to access / filter by individual elements yet.

import xml.etree.ElementTree as ET
tree = ET.parse('doc.xsd')
root = tree.getroot()
for child in root:
  print child.tag, child.attrib
4
  • Do you want those elements as columns or as rows? Btw, the xml above is incomplete and is not valid XML. Try updating it to minimal working XSD file. Commented Jun 24, 2014 at 15:58
  • I would recommend you using lxml. You have to install it and it take a moment, but than you have very powerful package with great xpath support, schema validation etc. And to follow up, go the tutorial lxml is offering, it will answer all your questions. Commented Jun 24, 2014 at 16:02
  • Jan, thanks for the quick reply. I have the full, valid XSD here locally. This is a just a snippent. I tried lxml but am getting stuck. Using lxml, how do you find a specific element? Once you find it, how do you access the sub-elements? BTW, a list of element1,element2,element3 is plenty sufficient. Commented Jun 24, 2014 at 16:30
  • Tutorial explains few methods. One being xpath. Commented Jun 24, 2014 at 16:34

2 Answers 2

3

Following code shows how to search XSD for element names.

from lxml import etree
xsdstr = """
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="ELEMENT">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="element 1"/>
        <xs:element ref="element 2"/>
        <xs:element ref="element 3"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
"""

doc = etree.fromstring(xsdstr.strip())

namespaces = {"xs": "http://www.w3.org/2001/XMLSchema"}

names = doc.xpath("//xs:element/@ref", namespaces=namespaces)
print names

Running it prints:

['element 1', 'element 2', 'element 3']

In case, you have more complex schema, you might need to target the names better, here is possible example:

print "trying more precise targeting ------"
names = doc.xpath("//xs:element[@name='ELEMENT']//xs:sequence/xs:element/@ref", namespaces=namespaces)
print names

In our case, the result is the same.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a bunch. This definitely started me down the right path. I would up vote but I don't have the rep yet. Thanks again.
0

Find XSD to CSV parser as below: With below code, multi-node XMLs can be parsed too.

import pandas as pd
from bs4 import BeautifulSoup


def xsd_to_dict(xsd_path):
    super_dict = {}
    soup = BeautifulSoup(open(xsd_path), "html.parser")
    for complex_type in soup.find_all('xs:complextype'):
        xsd_parsed = [x for x in ",".join(str(complex_type).split("\n"))
            .replace("</xs:sequence>", "")
            .replace("'<xs:sequence>", "")
            .replace("<xs:", "")
            .replace("</xs:complextype>", "")
            .replace("</xs:element>", "")
            .replace(">", "").replace("sequence", "")
            .split(",") if x != ""]

        if len(xsd_parsed[0]) > len("complextype") + 1:
            matrix_list = [e.split(" ") for e in xsd_parsed[-len(xsd_parsed) + 1:]]

            level_1 = ["|".join(["".join([":".join(final.split("=")) for final in y if len(final.split("=")) == 2])
                                 for y in [x.split(",") for x in item]]) for item in matrix_list]
            level_1.insert(0, xsd_parsed[0])
            for x in level_1[-len(xsd_parsed) + 1:]:
                flattened_dict = {x.split(":")[0]:"-".join(x.split(":")[-len(x.split(":")) + 1:])
                       for x in (level_1[0] + x).replace("=", ":").split("|")}
                xPath = flattened_dict.get("complextype name")
                xmlName = flattened_dict.get("name")
                dataType = flattened_dict.get("type")

                if xmlName != None:
                    final_dict = {x.split(":")[0]:x.split(":")[1]
                                for x in str("xpath:"+str(xPath)+",xmlFieldName:"+str(xmlName)+",dataPath:"+str(dataType)).split(",")}
                    for k, v in final_dict.items():
                        super_dict.setdefault(k, []).append(v)

    return super_dict



def xsd_to_csv(xsd_path):
    pd.DataFrame(xsd_to_dict(xsd_path)).to_csv(xsd_path.replace(".xsd", ".csv"))
    return "done"


xsd_to_csv("CustomersOrders.xsd")

input: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/sample-xsd-file-customers-and-orders1

Output:

,xpath,xmlFieldName,dataPath
0,"""CustomerType""","""CompanyName""","""xs-string"""
1,"""CustomerType""","""ContactName""","""xs-string"""
2,"""CustomerType""","""ContactTitle""","""xs-string"""
3,"""CustomerType""","""Phone""","""xs-string"""
4,"""CustomerType""","""Fax""","""xs-string"""
5,"""CustomerType""","""FullAddress""","""AddressType"""
6,"""CustomerType""","""CustomerID""","""xs-token""</xs-attribute"
7,"""AddressType""","""Address""","""xs-string"""
8,"""AddressType""","""City""","""xs-string"""
9,"""AddressType""","""Region""","""xs-string"""
10,"""AddressType""","""PostalCode""","""xs-string"""
11,"""AddressType""","""Country""","""xs-string"""
12,"""AddressType""","""CustomerID""","""xs-token""</xs-attribute"
13,"""OrderType""","""CustomerID""","""xs-token"""
14,"""OrderType""","""EmployeeID""","""xs-token"""
15,"""OrderType""","""OrderDate""","""xs-dateTime"""
16,"""OrderType""","""RequiredDate""","""xs-dateTime"""
17,"""OrderType""","""ShipInfo""","""ShipInfoType"""
18,"""ShipInfoType""","""ShipVia""","""xs-integer"""
19,"""ShipInfoType""","""Freight""","""xs-decimal"""
20,"""ShipInfoType""","""ShipName""","""xs-string"""
21,"""ShipInfoType""","""ShipAddress""","""xs-string"""
22,"""ShipInfoType""","""ShipCity""","""xs-string"""
23,"""ShipInfoType""","""ShipRegion""","""xs-string"""
24,"""ShipInfoType""","""ShipPostalCode""","""xs-string"""
25,"""ShipInfoType""","""ShipCountry""","""xs-string"""
26,"""ShipInfoType""","""ShippedDate""","""xs-dateTime""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.