9

I am looking for a native way to parse an http request in Python 3.

This question shows a way to do it in Python 2, but uses now deprecated modules, (and Python 2) and I am looking for a way to do it in Python 3.

I would mainly like to just figure out what resource is requested and parse the headers and from a simple request. (i.e):

GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

Can someone show me a basic way to parse this request?

6
  • 1
    Your first sentence shows that you know you should just use a library (e.g. urllib3, requests). Then you say you're trying to do it in Python 3, and don't know how. Why don't you just use requests? Commented Aug 22, 2016 at 23:54
  • @JonathonReinhart I am working in an environment that does not allow the use of third party libraries. Commented Aug 23, 2016 at 0:34
  • 1
    urllib is not third party Commented Aug 23, 2016 at 0:58
  • And it would appear this class in the standard library does what you want. docs.python.org/3/library/… Commented Aug 23, 2016 at 1:02
  • 1
    @cricket_007 he does not mention urllib. He mentions urllib3 which is third party. Commented Aug 23, 2016 at 1:48

3 Answers 3

7

You could use the email.message.Message class from the email module in the standard library.

By modifying the answer from the question you linked, below is a Python3 example of parsing HTTP headers.

Suppose you wanted to create a dictionary containing all of your header fields:

import email
import pprint

request_string = 'GET / HTTP/1.1\r\nHost: localhost\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, sdch\r\nAccept-Language: en-US,en;q=0.8'

# pop the first line so we only process headers
_, headers = request_string.split('\r\n', 1)

# construct a message from the request string. note: the return is already a dict-like object.
message = email.message_from_string(headers)

# construct a dictionary containing the headers
headers = dict(message.items())

# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)

if you ran this at a python prompt, the result would look like:

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate, sdch',
 'Accept-Language': 'en-US,en;q=0.8',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Host': 'localhost',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
Sign up to request clarification or add additional context in comments.

6 Comments

This is great - and yes, sorry my formatting of the original request was bad. However, where do I get the resource here? (i.e. the actual resource being requested). Since we pop it, how do I know what was actually requested?
@Startec it would be in the first line, along with the request method and protocol version.
So I would have to do some string splitting on the first line?
yes, you could probably just split the first line on whitespace to grab the resource name.
@Startec StringIO is creating a in-memory file-object to feed email.message_from_file (which expects a text stream). You can also parse messages directly from bytes, strings or binary streams.
|
2

Each one of those field names should be delimited by carriage return then newline, and then the field name and value are delimited by a colon. So assuming you already have the response as a string, it should be as easy as:

fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    key,value = field.split(':', 1)#split each line by http field name and value
    output[key] = value

Update 4/13

Using the example http resp in the linked to post:

resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'


fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    if not field:
        continue
    key,value = field.split(':', 1)
    output[key] = value    
print(output)

An additional check to make sure field is not empty is needed. OUtput:

{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}

5 Comments

That code won't work. Patch it by add maxsplit=1 to split() and it would be actually better. And you may want to split by \n instead of \r\n, that way it would be more generic Then do not forget \r at the end if any..
You may want to consider a dedicated library like kiss-headers to handle them properly.
@Ousret - updated post to show that code works even on the example request in the post. I did need to have quick error check if field was empty, but for example code it holds up. As for using libraries, that is a good default choice.
Check out this header : User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0 It will fail with this. ;)
You may have to ignore the second line as well (i.e. Host field and value) in case the port number is explicitly included in the url. I.E. use fields = fields[2:] or key,value = field.split(':') will throw error.
0

Here are some Python packages aimed at proper HTTP protocol parsing:

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.