Python: Create various file objects while reading a file

Question

I am reading a large file containing various <xml>..</xml> elements. Since every XML parser has trouble with that, I would like to produce efficiently new file objects for each <xml>..</xml> block.

I was starting to subclass the file object in Python, but got stucked there. I think, I've to intercept each line starting with </xml> and return a new file object; maybe by using yield.

Can someone guide me to do the step in the right direction?

Here is my current code fragment:

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    line = self.next()
    while line is not None:
      output.write(line.strip())
      if line.strip() == '</xml>':
        yield output
        output = StringIO()
      try:
        line = self.next()
      except StopIteration:
        break
    output.close()

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  print 'm' + elem.getvalue() + 'm'
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

Thanks!

SOLUTION (still interested in a better version):

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    output.write(self.next())
    line = self.next()
    while line is not None:
      if line.startswith('<?xml'):
        output.seek(0)
        yield output
        output = StringIO()
      output.write(line)
      try:
        line = self.next()
      except StopIteration:
        break
    output.seek(0)
    yield output

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

Sven Marnach · Accepted Answer · 2011-06-14 19:43:52Z

1

While not a direct answer to your question, this may solve your problem anyway: Simply adding another <xml> at the beginning and another </xml> at the end will probably make your XML parser accept the document:

from lxml import etree
document = "<xml>a</xml> <xml>b</xml>"
document = "<xml>" + document + "</xml>"
for subdocument in etree.XML(document):
    # whatever

answered Jun 14, 2011 at 19:43

Sven Marnach

608k123 gold badges969 silver badges866 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

labrassbandito Over a year ago

Yeah, you're right. However, I'm interested in that more 'sophisticated' solution to learn Python better.

Michael Dillon Over a year ago

This is exactly the way that I handle it except I don't reuse any tagnames that are likely to be in the XML nodes. I have a production application that does xmlstr="<doc>"+inpstr+"</doc>"; xmldoc = etree.XML(xmlstr).

Collectives™ on Stack Overflow

Python: Create various file objects while reading a file

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related