1

I am reading a large file containing various <xml>..</xml> elements. Since every XML parser has trouble with that, I would like to produce efficiently new file objects for each <xml>..</xml> block.

I was starting to subclass the file object in Python, but got stucked there. I think, I've to intercept each line starting with </xml> and return a new file object; maybe by using yield.

Can someone guide me to do the step in the right direction?

Here is my current code fragment:

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    line = self.next()
    while line is not None:
      output.write(line.strip())
      if line.strip() == '</xml>':
        yield output
        output = StringIO()
      try:
        line = self.next()
      except StopIteration:
        break
    output.close()

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  print 'm' + elem.getvalue() + 'm'
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

Thanks!

SOLUTION (still interested in a better version):

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    output.write(self.next())
    line = self.next()
    while line is not None:
      if line.startswith('<?xml'):
        output.seek(0)
        yield output
        output = StringIO()
      output.write(line)
      try:
        line = self.next()
      except StopIteration:
        break
    output.seek(0)
    yield output

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

1 Answer 1

1

While not a direct answer to your question, this may solve your problem anyway: Simply adding another <xml> at the beginning and another </xml> at the end will probably make your XML parser accept the document:

from lxml import etree
document = "<xml>a</xml> <xml>b</xml>"
document = "<xml>" + document + "</xml>"
for subdocument in etree.XML(document):
    # whatever
Sign up to request clarification or add additional context in comments.

2 Comments

Yeah, you're right. However, I'm interested in that more 'sophisticated' solution to learn Python better.
This is exactly the way that I handle it except I don't reuse any tagnames that are likely to be in the XML nodes. I have a production application that does xmlstr="<doc>"+inpstr+"</doc>"; xmldoc = etree.XML(xmlstr).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.