0

I tried to encode the XML file so that it could read the invalid content without problem, however it did not work.

This is my code:

import xml.etree.ElementTree as ET
import io

file_path = r'c:\data\MSM\Energy\XML-files\my_xml.xml' 

with io.open(file_path, 'r', encoding='utf-8-sig') as f:
    contents = f.read()
    tree = ET.fromstring(contents)

This is what I receive:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 62, column 48

This is how XML file line 62 looks like:

62    <Organisation>Blue & Logistics B.V.</Organisation>

I'm sure it has to do with the & sign, so how can I encode that?

0

2 Answers 2

1

First, it has nothing to do with encoding. It's simply that your file doesn't contain well-formed XML. Find out how, where, and when it was created, and fix the process that created it. An & in content needs to be escaped, typically as &amp;.

Don't try repairing bad XML except in desperation - you're very likely to make things worse, especially if you have to handle multiple input documents from the same unreliable source.

Sign up to request clarification or add additional context in comments.

8 Comments

I'm not responsible for creating the XML files, they are just being delivered to me....
They aren't XML files, they are junk. Return them to sender as faulty goods.
Well, for example, if you try to replace all ampersands by &amp; then your code will fail if it encounters an ampersand that is being used correctly to introduce an entity or character reference. Distinguishing ampersands that are being used correctly from those that are being used incorrectly requires some heuristics that will probably work 99% of the time, but the other 1% of the time you will be left with corrupt data. At the same time, if ampersands are not being escaped correctly, then there's a very good chance that < is not being escaped correctly either, and that's even harder.
In short, standards are a good thing: repairing broken XML is like filing down the pins on a power plug to make it fit in your power socket. It's a lot of work and you end up with a fire risk. And in any case, the fact that your supplier got such simple things wrong means you really can't trust them with anything they do.
Ampersand should be escaped as &amp; or &#38;.
|
0

Load the xml as text replace the & and use xml parser

import xml.etree.ElementTree as ET

with open('x.xml') as f:
  xml = f.read()
  xml = xml.replace("&", "&#38;")
  root = ET.fromstring(xml)
  print(root)

x.xml

<r>
  <Organisation>Blue & Logistics B.V.</Organisation>
</r>

output

<Element 'r' at 0x7f431e86bc70>

3 Comments

There's predefined entity &amp; which you can use instead of &#38;. Anyway, your method will make problem even worse if there are any entity entity reference in XML.
Why will escaping those characters will make the problem worse?
@Al-Andalus, if some content is already escaped this code will break everything replacing "&", pretty much it have been already said by Michael Kay in his answer, this and this comments. Better to solve problem with solution which generates this junk which looks like XML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.