ParseError: undefined entity while parsing XML file in Python

Question

I have a big XML file with several article nodes. I have included only one with the problem. I try to parse it in Python to filter some data and I get the error

File "<string>", line unknown
ParseError: undefined entity &Ouml;: line 90, column 17

Sample of the XML file

<?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE dblp SYSTEM "dblp.dtd">
    <dblp>
        <article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
            <author>Alejandro P. Buchmann</author>
            <author>M. Tamer &Ouml;zsu</author>
            <author>Dimitrios Georgakopoulos</author>
            <title>Towards a Transaction Management System for DOM.</title>
            <journal>GTE Laboratories Incorporated</journal>
            <volume>TR-0146-06-91-165</volume>
            <month>June</month>
            <year>1991</year>
            <url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
        </article>
    </dblp>

From my search in Google, I found that this kind of error appears if you have issues in the node names. However, the line with the error is the second author, in the text.

This is my Python code

with open('xaa.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

As the error message tells you, &0uml; is not a standard XML entity so as it stands your XML isn’t valid, hence the error. See xml.com/pub/a/98/08/xmlqna2.html not sure if you can declare them to ElementTree outside the XML file. — DisappointedByUnaccountableMod
– DisappointedByUnaccountableMod, Commented Mar 18, 2020 at 19:52

mzjn · Accepted Answer · 2020-03-18 20:44:37Z

The declaration of the Ouml entity is presumably in the DTD (dblp.dtd), but ElementTree does not support external DTDs. ElementTree only recognizes entities declared directly in the XML file (in the "internal subset"). This is a working example:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp [
<!ENTITY Ouml 'Ö'>
]>
<dblp>
  <article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
    <author>Alejandro P. Buchmann</author>
    <author>M. Tamer &Ouml;zsu</author>
    <author>Dimitrios Georgakopoulos</author>
    <title>Towards a Transaction Management System for DOM.</title>
    <journal>GTE Laboratories Incorporated</journal>
    <volume>TR-0146-06-91-165</volume>
    <month>June</month>
    <year>1991</year>
    <url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
  </article>
</dblp>

To parse the XML file in the question without errors, you need a more powerful XML library that supports external DTDs. lxml is a good choice for that.

Guido U. Draheim · Accepted Answer · 2023-05-15 21:50:51Z

0

The Oouml looks like a standard html5 entity. It may help to convert those to their unicode characters before running the xml parser. In python3.4+ you can use html.unescape for that.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)

answered May 15, 2023 at 21:50

Guido U. Draheim

3,2861 gold badge23 silver badges20 bronze badges

Collectives™ on Stack Overflow

ParseError: undefined entity while parsing XML file in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related