1

I need to parse some Log-files in this ugly format (Any number of plaintext headers where some of those headers got additional data in xml):

[dd/mm/yy]:message_data
<starttag>
    <some_field>some_value</some_field>
     ....
</starttag>
[dd/mm/yy]:message_data
[dd/mm/yy]:message_data
....

So far my approach is:

    message_text = None
    for line in LOGFILE:

        message_start_match = MESSAGE_START_RE.search(line)
        if not message_start_match:
            header_info = HEADER_RE.search(line)

        if message_start_match:
            message_text = line
            continue
        if message_text:
            message_text += line

        if MESSAGE_END_RE.search(line):
            process_message_with_xml_parser(message_text, header_info)
            message_text=None

where

MESSAGE_START_RE = re.compile(r"<starttag.*>)
MESSAGE_END_RE = re.compile(r"</starttag>)
header_info is a regex with named fields of the message

Do you know any better way?

The Problem in this aproach is: I am sort of parsing xml with regex (which is stupid). Is there any package which can recognize start and end of xml in file?

1 Answer 1

1

You can still use BeautifulSoup on your ugly xml. Here is an example:

from bs4 import BeautifulSoup

data = """[dd/mm/yy]:message_data
<starttag>
    <some_field>some_value</some_field>
     ....
</starttag>
[dd/mm/yy]:message_data
[dd/mm/yy]:message_data"""

soup = BeautifulSoup(data);
starttag = soup.findAll("starttag")
for tag in starttag:
    print tag.find("some_field").text
    # => some_value
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this is already thought about this approach. But I could not find any way to combine the line [dd/mm/yy]:message_data] (which contains the message meta information, like timestamp, issuer) with tag.find("some_field").text Beautifulsoup does not provide the line number of the match
Those are outside tags. You can use soup.text and then parse text from there as there will not be any xml tag on it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.