How to decode a source code which is compressed with gzip in python

Question

I am trying to get the source code of a php web page with a proxy, but it is showing not printable characters. The output I got is as follows:

 "Date: Tue, 09 Feb 2016 10:29:14 GMT
Server: Apache/2.4.9 (Unix) OpenSSL/1.0.1g PHP/5.5.11 mod_perl/2.0.8-dev Perl/v5.16.3
X-Powered-By: PHP/5.5.11
Set-Cookie: PHPSESSID=jmqasueos33vqoe6dbm3iscvg0; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 577
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html

�TMo�@�G����7�)P�H�H�DS��=U�=�U�]˻��_�Ycl�T�*�>��eg��
                                                          ����Z�
                                                                �V�N�f�:6�ԫ�IkZ77�A��nG�W��ɗ���RGY��Oc`-ο�ƜO��~?�V��$�
                            �l4�+���n�].W��TǇSx�/|�n��#���>��r����;�l����H��4��f�\  �SY�y��7��"

how to decode this code using python, i tried to use

decd=zlib.decompress(data, 16+zlib.MAX_WBITS)

but is not giving the decoded data

The proxy which i am using is working fine for few other web applications. It showing non printable characters for some web applications, how to decode this?

As I am using proxy I dont want to use get() and urlopen() or any other requests from python program.

How are you retrieving the URL? If you use the requests module the content will be automatically decompressed for you. — mhawke
– mhawke, Commented Feb 9, 2016 at 10:54
So you just want to know how to decompress the body of the HTTP response? What is the input? Is it the whole HTTP response, including headers, or is it just the compressed body? What is contained in the data that you passed to zlib.decompress()? — mhawke
– mhawke, Commented Feb 9, 2016 at 11:20
@mhawke .. My proxy will get the whole HTTP response, I want to decode the compressed body. When i am sending request of same page using get function i am getting normal HTML source code it, but when I am using proxy, it is showing the above thing. — krocks
– krocks, Commented Feb 9, 2016 at 11:25
What is in data? If data contains only gzipped data, zlib.decompress(data, 16+zlib.MAX_WBITS) should successfully decompress the data. Or you could use the gzip module as shown in my answer. But what are you passing in data? — mhawke
– mhawke, Commented Feb 9, 2016 at 11:26
@mhawke .. <pre>data</pre> contains the above shown information starting for url name, date and even that non printable characters — krocks
– krocks, Commented Feb 9, 2016 at 11:31

Community · Accepted Answer · 2017-05-23 12:15:35Z

One obvious way to do this is to extract the compressed data from the response and decompress it using GzipFile().read(). This method of splitting the response might be prone to failure, but here it goes:

from gzip import GzipFile
from StringIO import StringIO

http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'

body = http.split('\r\n\r\n', 1)[1]
print GzipFile(fileobj=StringIO(body)).read()

Output

{
  "gzipped": true, 
  "headers": {
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "192.168.1.1"
}

If you feel compelled to parse the full HTTP response message, then, as inspired by this answer, here is a rather roundabout way to do it which involves constructing a httplib.HTTPResponse directly from the raw HTTP response, using that to create a urllib3.response.HTTPResponse, and then accessing the decompressed data:

import httplib
from cStringIO import StringIO
from urllib3.response import HTTPResponse

http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'

class DummySocket(object):
    def __init__(self, data):
        self._data = StringIO(data)
    def makefile(self, *args, **kwargs):
        return self._data

response = httplib.HTTPResponse(DummySocket(http))
response.begin()
response = HTTPResponse.from_httplib(response)
print(response.data)

Output

{
  "gzipped": true, 
  "headers": {
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "192.168.1.1"
}

actually I am using proxy through which i can capture the source code of url in a browser. if I use request from from a python program I can get, but I want to capture automatically for all pages i am visiting in browser.
@krocks: I've updated my answer with 2 methods that should work for you.

mementum · Accepted Answer · 2016-02-09 11:37:47Z

0

Although gzip uses zlib, when Content-Encoding is set to gzip, there is an additional header before the compressed stream which is not properly interpreted by the zlib.decompress call.

Put your data in a file-like object and pass it through the gzip module. For example something like:

mydatafile = cStringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=mydatafile)
decdata = gzipper.read()

From my already old http library for Python 2.x

https://github.com/mementum/httxlib/blob/master/httxlib/httxcompression.py

answered Feb 9, 2016 at 11:37

mementum

3,23316 silver badges20 bronze badges

Collectives™ on Stack Overflow

How to decode a source code which is compressed with gzip in python

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related