Python - appending/padding binary file adds garbage

Question

I'm trying to create and concat binary files, and I need to add padding in between them, what happens for some reason is that python adds garbage and significantly increases the file (And it also changes the existing contents).

What I do is something like this

file1 = OPEN(read_file, "rb")
file2 = OPEN(read_file, "rb")
file3 = OPEN(write_file, "ab")
data = file1.read(f1_size)
file3.write(data)
file3.flush()
file1.close()
while padding_size > 0:
   file3.write(b'\x00')
   padding_size -= 1
   file3.flush() # Not sure this is mandatory here
data2 = file2.read(f2_size)
file3.write(data2)
file3.flush()
file2.close()
file3.close()

This even happens before I add the 2nd binary file. If I use a size which is small (lets say 100) then it writes fine, but if I use a slightly bigger size, it goes all crazy and adds a lot of garbage to the output file. This code is not optimal, but it doesn't really matter to me, as long as I am able to properly add padding.

Appreciate your help.

Update
I actually learned that another script which shouldn't change my file is doing it and causing an issue due to a failure. Thank you for the insights and the help, I've changed my script significantly.

I don't have an answer, but I do have a few notes. First and foremost, you'll never read the second file -- size is decremented to zero during the while loop, so file2.read(size) doesn't read anything. Second, you should use the with open(file) syntax -- docs.python.org/2/tutorial/inputoutput.html -- it'll help keep track of your code better, which will help debugging this issue. Third, you don't need to flush at all unless you need to see the output file change throughout the script (rather than just at the end). — DreadPirateShawn
– DreadPirateShawn, Commented Oct 14, 2015 at 7:04
@DreadPirateShawn I understand what you mean, I wrote generic variables here, so the size is not correct, I will fix the above. — just_a_user
– just_a_user, Commented Oct 14, 2015 at 8:13

DreadPirateShawn · Accepted Answer · 2015-10-14 16:20:36Z

The problem might be related to the extra work you're doing (in particular specifying the input size) or to how you're checking the output (unspecified above).

Consider the following, which works as desired:

Prep test data:

$ echo "abc" > input1
$ echo "def" > input2
$ zip input1.zip input1
  adding: input1 (stored 0%)
$ zip input2.zip input2
  adding: input2 (stored 0%)
$ cat -v input1.zip
PK^C^D
^@^@^@^@^@)}NGNM-^AM-^HG^D^@^@^@^D^@^@^@^F^@^\^@input1UT    ^@^CM-^^w^^V[x^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@abc
PK^A^B^^^C
^@^@^@^@^@)}NGNM-^AM-^HG^D^@^@^@^D^@^@^@^F^@^X^@^@^@^@^@^A^@^@^@M-$M-^A^@^@^@^@input1UT^E^@^CM-^^w^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@PK^E^F^@^@^@^@^A^@^A^@L^@^@^@D^@^@^@^@^@
$ cat -v input2.zip
PK^C^D
^@^@^@^@^@.}NGM-<M-^Sn^H^D^@^@^@^D^@^@^@^F^@^\^@input2UT    ^@^CM-'w^^V[x^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@def
PK^A^B^^^C
^@^@^@^@^@.}NGM-<M-^Sn^H^D^@^@^@^D^@^@^@^F^@^X^@^@^@^@^@^A^@^@^@M-$M-^A^@^@^@^@input2UT^E^@^CM-'w^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@PK^E^F^@^@^@^@^A^@^A^@L^@^@^@D^@^@^@^@^@

Refined script "test.py":

read_file1 = "input1.zip"
read_file2 = "input2.zip"
write_file = "output"
padding_size = 50

with open(write_file, "ab") as file3:
    with open(read_file1, "rb") as file1:
        data = file1.read()
        file3.write(data)
    while padding_size > 0:
        file3.write(b'\x00')
        padding_size -= 1
    with open(read_file2, "rb") as file2:
        data = file2.read()
        file3.write(data)

Output:

$ python test.py
$ cat -v output
PK^C^D
^@^@^@^@^@)}NGNM-^AM-^HG^D^@^@^@^D^@^@^@^F^@^\^@input1UT    ^@^CM-^^w^^V[x^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@abc
PK^A^B^^^C
^@^@^@^@^@)}NGNM-^AM-^HG^D^@^@^@^D^@^@^@^F^@^X^@^@^@^@^@^A^@^@^@M-$M-^A^@^@^@^@input1UT^E^@^CM-^^w^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@PK^E^F^@^@^@^@^A^@^A^@L^@^@^@D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@PK^C^D
^@^@^@^@^@.}NGM-<M-^Sn^H^D^@^@^@^D^@^@^@^F^@^\^@input2UT    ^@^CM-'w^^V[x^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@def
PK^A^B^^^C
^@^@^@^@^@.}NGM-<M-^Sn^H^D^@^@^@^D^@^@^@^F^@^X^@^@^@^@^@^A^@^@^@M-$M-^A^@^@^@^@input2UT^E^@^CM-'w^^Vux^K^@^A^DM-]^E^@^@^DM-]^E^@^@PK^E^F^@^@^@^@^A^@^A^@L^@^@^@D^@^@^@^@^@
$ ls -l
total 24
-rw-r--r-- 1 foo bar   4 Oct 14 15:41 input1
-rw-r--r-- 1 foo bar 166 Oct 14 15:45 input1.zip
-rw-r--r-- 1 foo bar   4 Oct 14 15:41 input2
-rw-r--r-- 1 foo bar 166 Oct 14 15:45 input2.zip
-rw-r--r-- 1 foo bar 382 Oct 14 15:58 output
-rw-r--r-- 1 foo bar 406 Oct 14 15:58 test.py

Notes:

The docs -- https://docs.python.org/2/tutorial/inputoutput.html in this case -- have good advice, and often it's very wise to spend a bit of extra time to understand them first.

For instance, search for "read" and you'll see that file.read() reads the entire file, the size parameter is optional. That's one less moving part right there -- don't specify the size unless you truly must, since you might specify it incorrectly.
As you read the doc, you'll see "with open" syntax recommended -- that helps keep the code easier to read, which often helps debugging and maintenance.
Search for "flush" and you won't find it at all, actually -- it's not necessary. Google "Python flush file" and you'll find more detailed confirmation and explanation of what the use-cases for "flush" are. That's another moving part to remove.

Thus my revised code above.

Lastly, always take the time to make simple / small test data, like I've done here. It's small enough that you can see the entire input files and the entire resulting output file, and can visually confirm that they are not corrupted. Additionally, using ls -l you can see that (size of output file) = (size of input1) + (size of input2) + padding.

Now, where does that leave us?

This answer provides confirmed code that achieves your stated goal. (Yay!) Granted, you might still have bugs -- but even if so, now you have a reference point that you can use to investigate your own repro case further, to see how it differs. (If it does differ, feel free to post a new question, with your own test data, full runnable script, and the means you're using to determine that the output is corrupted.)

Collectives™ on Stack Overflow

Python - appending/padding binary file adds garbage

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related