1

I recently inherited a project where we deserialize a bunch of data written out by a system that I cannot change (wish they used a standard serializer, but I cannot change this). For the most part, I was able to use ctypes to represent the structures and cast the data right into Python, but we have some cases where the underlying data structures are a mess (again, something I cannot change no matter how much I have tried). The 2 cases that are driving me nuts trying to find an efficient way are when the c structures are defined like this:

Simple Case:

struct b{
  int data;
  int more_data;
};

struct a{
  int num_entries;
  b* data;
}; 

Which, when it was serialized, packed the b* data into memory as if it were a static array deceleration.

And here comes the most horrible case I have to deal with:

struct c{
  int a;
  int b;
};

struct b{
  int random_data;
  c* data;
  int more_data;
};

struct a{
  int len; // This actually defines the length in struct b for "data" array size
  b nested_data;
  c why_not_it_is_this_poorly_organized;
}

Any help would sure be appreciated!

4
  • Looks a lot like stackoverflow.com/questions/8392203/… Commented Jun 15, 2017 at 22:27
  • That question is going in the other direction, it would not address the case where you receive a byte stream and then need to cast the data back into this structure representation. Commented Jun 15, 2017 at 22:30
  • Using some platform dependent (at best) binary format is a really bad idea. Use a text format or at least define the binary format independent of the C platform. Then use proper marshalling on both sides. Commented Jun 15, 2017 at 23:17
  • I agree with you Olaf, but I can't do that, I have no control of the input data I am receiving, I agree with you fully though, and have stated my case numerous times, but this software goes back about 20 years, and apparently, we can only do things the way we have always done them, because, that's the way we have always done it... Corporate recursive logic... Commented Jun 16, 2017 at 0:44

1 Answer 1

0

Have you tried looking at the Python bitstring API? From here you can write some methods to de-serialize the data by slicing out the array. It might look something like this:

def parse_c(buffer):
    # Parse data into list

def parse_b(buffer):
    # Parse first int of B
    list = parse_c(buffer[8:-8]) # skip first/last int
    # Parse last int of B

def parse_a(buffer):
    # Parse len (you could also pass this into parse_b, but b can figure it out from the size)
    b = parse_b(buffer[-8:-16])
    c = parse_c(buffer[-16:])
Sign up to request clarification or add additional context in comments.

3 Comments

This doesn't seem to actually help in any way. Now you have bits instead of bytes. If anything, that's a step backward.
As the documentation says, "Bitstrings don’t know or care how they were created; they are just collections of bits. This means that you are quite free to interpret them in any way that makes sense." See: pythonhosted.org/bitstring/interpretation.html For example: length = buffer.int to grab that first int.
This may be the way I have to go, but I was hoping to do this without parsing the data field by field, since there tends to be around 1TB of data for every log we record, and there are somewhere on the order of 10,000 data types to be represented. I worry about the efficiency of slicing and nested function calls, but, it is better than anything I have come up with so far! Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.