Mixed xml / text parsing python

Question

I need to parse some Log-files in this ugly format (Any number of plaintext headers where some of those headers got additional data in xml):

[dd/mm/yy]:message_data
<starttag>
    <some_field>some_value</some_field>
     ....
</starttag>
[dd/mm/yy]:message_data
[dd/mm/yy]:message_data
....

So far my approach is:

    message_text = None
    for line in LOGFILE:

        message_start_match = MESSAGE_START_RE.search(line)
        if not message_start_match:
            header_info = HEADER_RE.search(line)

        if message_start_match:
            message_text = line
            continue
        if message_text:
            message_text += line

        if MESSAGE_END_RE.search(line):
            process_message_with_xml_parser(message_text, header_info)
            message_text=None

where

MESSAGE_START_RE = re.compile(r"<starttag.*>)
MESSAGE_END_RE = re.compile(r"</starttag>)
header_info is a regex with named fields of the message

Do you know any better way?

The Problem in this aproach is: I am sort of parsing xml with regex (which is stupid). Is there any package which can recognize start and end of xml in file?

Sabuj Hassan · Accepted Answer · 2014-04-07 13:10:23Z

1

You can still use BeautifulSoup on your ugly xml. Here is an example:

from bs4 import BeautifulSoup

data = """[dd/mm/yy]:message_data
<starttag>
    <some_field>some_value</some_field>
     ....
</starttag>
[dd/mm/yy]:message_data
[dd/mm/yy]:message_data"""

soup = BeautifulSoup(data);
starttag = soup.findAll("starttag")
for tag in starttag:
    print tag.find("some_field").text
    # => some_value

answered Apr 7, 2014 at 13:10

Sabuj Hassan

39.7k14 gold badges83 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ProfHase85 Over a year ago

Thanks, this is already thought about this approach. But I could not find any way to combine the line [dd/mm/yy]:message_data] (which contains the message meta information, like timestamp, issuer) with tag.find("some_field").text Beautifulsoup does not provide the line number of the match

Sabuj Hassan Over a year ago

Those are outside tags. You can use soup.text and then parse text from there as there will not be any xml tag on it.

Collectives™ on Stack Overflow

Mixed xml / text parsing python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related