De-Serializing C structures with dynamic arrays in Python

Question

I recently inherited a project where we deserialize a bunch of data written out by a system that I cannot change (wish they used a standard serializer, but I cannot change this). For the most part, I was able to use ctypes to represent the structures and cast the data right into Python, but we have some cases where the underlying data structures are a mess (again, something I cannot change no matter how much I have tried). The 2 cases that are driving me nuts trying to find an efficient way are when the c structures are defined like this:

Simple Case:

struct b{
  int data;
  int more_data;
};

struct a{
  int num_entries;
  b* data;
};

Which, when it was serialized, packed the b* data into memory as if it were a static array deceleration.

And here comes the most horrible case I have to deal with:

struct c{
  int a;
  int b;
};

struct b{
  int random_data;
  c* data;
  int more_data;
};

struct a{
  int len; // This actually defines the length in struct b for "data" array size
  b nested_data;
  c why_not_it_is_this_poorly_organized;
}

Any help would sure be appreciated!

That question is going in the other direction, it would not address the case where you receive a byte stream and then need to cast the data back into this structure representation. — FrenchToast
– FrenchToast, Commented Jun 15, 2017 at 22:30
Using some platform dependent (at best) binary format is a really bad idea. Use a text format or at least define the binary format independent of the C platform. Then use proper marshalling on both sides. — too honest for this site
– too honest for this site, Commented Jun 15, 2017 at 23:17
I agree with you Olaf, but I can't do that, I have no control of the input data I am receiving, I agree with you fully though, and have stated my case numerous times, but this software goes back about 20 years, and apparently, we can only do things the way we have always done them, because, that's the way we have always done it... Corporate recursive logic... — FrenchToast
– FrenchToast, Commented Jun 16, 2017 at 0:44

Stormswept · Accepted Answer · 2017-06-15 22:31:09Z

0

Have you tried looking at the Python bitstring API? From here you can write some methods to de-serialize the data by slicing out the array. It might look something like this:

def parse_c(buffer):
    # Parse data into list

def parse_b(buffer):
    # Parse first int of B
    list = parse_c(buffer[8:-8]) # skip first/last int
    # Parse last int of B

def parse_a(buffer):
    # Parse len (you could also pass this into parse_b, but b can figure it out from the size)
    b = parse_b(buffer[-8:-16])
    c = parse_c(buffer[-16:])

answered Jun 15, 2017 at 22:31

Stormswept

4846 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2357112 Over a year ago

This doesn't seem to actually help in any way. Now you have bits instead of bytes. If anything, that's a step backward.

Stormswept Over a year ago

As the documentation says, "Bitstrings don’t know or care how they were created; they are just collections of bits. This means that you are quite free to interpret them in any way that makes sense." See: pythonhosted.org/bitstring/interpretation.html For example: length = buffer.int to grab that first int.

FrenchToast Over a year ago

This may be the way I have to go, but I was hoping to do this without parsing the data field by field, since there tends to be around 1TB of data for every log we record, and there are somewhere on the order of 10,000 data types to be represented. I worry about the efficiency of slicing and nested function calls, but, it is better than anything I have come up with so far! Thanks!

Collectives™ on Stack Overflow

De-Serializing C structures with dynamic arrays in Python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related