2

I have a set of XML files that I need to read and format into a single CSV file. In order to read from the XML files, I have used the solution mentioned here.

My code looks like this:

from os import listdir
import xml.etree.cElementTree as et

files = listdir(".../blogs/")

for i in range(len(files)):
    # fname = ".../blogs/" + files[i]
    f = open(".../blogs/" + files[i], 'r')
    contents = f.read()
    tree=et.fromstring(contents)
    for el in tree.findall('post'):
        post = el.text

    f.close()

This gives me the error cElementTree.ParseError: undefined entity: at the line tree=et.fromstring(contents). Oddly enough, when I run each of the commands on command line Python (without the for-loop though), it runs perfectly.

In case you want to know the XML structure, it is like this:

<Blog>
<date> some date </date>
<post> some blog post </post>
</Blog>

So what is causing this error, and how come it doesn't run from the Python file, but runs from the command line?

Update: After reading this link I checked files[0] and found that '&' symbol occurs a few times. I think that might be causing the problem. I used a random file to read when I ran the same commands on command line.

2
  • First off, should ".../blogs/" be "../blogs" or "../../blogs/"? Commented Mar 4, 2013 at 20:03
  • Well it is certainly reading the file correctly. I don't think that is a problem. Commented Mar 4, 2013 at 20:12

1 Answer 1

2

As I mentioned in the update, there were some symbols that I suspected might be causing a problem. The reason the error didn't come up when I ran the same lines on the command line is because I would randomly pick a file that didn't have any such characters.

Since I mainly required the content between the <post> and </post> tags, I created my own parser (as was suggested in the link mentioned in the update).

from os import listdir

files = listdir(".../blogs/")

for i in range(len(files)):

    f = open(".../blogs/" + files[i], 'r')
    contents = f.read()
    seek1 = contents.find('<post>')
    seek2 = contents.find('</post>', seek1+1)
    while(seek1!=-1):
        post = contents[seek1+5:seek2+6]
        seek1 = contents.find('<post>', seek1+1)
        seek2 = contents.find('</post>', seek1+1)

    f.close()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.