I need to parse some Log-files in this ugly format (Any number of plaintext headers where some of those headers got additional data in xml):
[dd/mm/yy]:message_data
<starttag>
<some_field>some_value</some_field>
....
</starttag>
[dd/mm/yy]:message_data
[dd/mm/yy]:message_data
....
So far my approach is:
message_text = None
for line in LOGFILE:
message_start_match = MESSAGE_START_RE.search(line)
if not message_start_match:
header_info = HEADER_RE.search(line)
if message_start_match:
message_text = line
continue
if message_text:
message_text += line
if MESSAGE_END_RE.search(line):
process_message_with_xml_parser(message_text, header_info)
message_text=None
where
MESSAGE_START_RE = re.compile(r"<starttag.*>)
MESSAGE_END_RE = re.compile(r"</starttag>)
header_info is a regex with named fields of the message
Do you know any better way?
The Problem in this aproach is: I am sort of parsing xml with regex (which is stupid). Is there any package which can recognize start and end of xml in file?