Undefined entity error while using ElementTree

Question

I have a set of XML files that I need to read and format into a single CSV file. In order to read from the XML files, I have used the solution mentioned here.

My code looks like this:

from os import listdir
import xml.etree.cElementTree as et

files = listdir(".../blogs/")

for i in range(len(files)):
    # fname = ".../blogs/" + files[i]
    f = open(".../blogs/" + files[i], 'r')
    contents = f.read()
    tree=et.fromstring(contents)
    for el in tree.findall('post'):
        post = el.text

    f.close()

This gives me the error cElementTree.ParseError: undefined entity: at the line tree=et.fromstring(contents). Oddly enough, when I run each of the commands on command line Python (without the for-loop though), it runs perfectly.

In case you want to know the XML structure, it is like this:

<Blog>
<date> some date </date>
<post> some blog post </post>
</Blog>

So what is causing this error, and how come it doesn't run from the Python file, but runs from the command line?

Update: After reading this link I checked files[0] and found that '&' symbol occurs a few times. I think that might be causing the problem. I used a random file to read when I ran the same commands on command line.

First off, should ".../blogs/" be "../blogs" or "../../blogs/"? — skeevey
– skeevey, Commented Mar 4, 2013 at 20:03
Well it is certainly reading the file correctly. I don't think that is a problem. — Antimony
– Antimony, Commented Mar 4, 2013 at 20:12

Community · Accepted Answer · 2017-05-23 12:17:58Z

2

As I mentioned in the update, there were some symbols that I suspected might be causing a problem. The reason the error didn't come up when I ran the same lines on the command line is because I would randomly pick a file that didn't have any such characters.

Since I mainly required the content between the <post> and </post> tags, I created my own parser (as was suggested in the link mentioned in the update).

from os import listdir

files = listdir(".../blogs/")

for i in range(len(files)):

    f = open(".../blogs/" + files[i], 'r')
    contents = f.read()
    seek1 = contents.find('<post>')
    seek2 = contents.find('</post>', seek1+1)
    while(seek1!=-1):
        post = contents[seek1+5:seek2+6]
        seek1 = contents.find('<post>', seek1+1)
        seek2 = contents.find('</post>', seek1+1)

    f.close()

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Mar 4, 2013 at 20:18

Antimony

2,2403 gold badges28 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Undefined entity error while using ElementTree

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related