How to parse raw HTTP request in Python 3?

Question

I am looking for a native way to parse an http request in Python 3.

This question shows a way to do it in Python 2, but uses now deprecated modules, (and Python 2) and I am looking for a way to do it in Python 3.

I would mainly like to just figure out what resource is requested and parse the headers and from a simple request. (i.e):

GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

Can someone show me a basic way to parse this request?

Your first sentence shows that you know you should just use a library (e.g. urllib3, requests). Then you say you're trying to do it in Python 3, and don't know how. Why don't you just use requests? — Jonathon Reinhart
– Jonathon Reinhart, Commented Aug 22, 2016 at 23:54
@JonathonReinhart I am working in an environment that does not allow the use of third party libraries. — Startec
– Startec, Commented Aug 23, 2016 at 0:34
And it would appear this class in the standard library does what you want. docs.python.org/3/library/… — OneCricketeer
– OneCricketeer, Commented Aug 23, 2016 at 1:02
@cricket_007 he does not mention urllib. He mentions urllib3 which is third party. — Startec
– Startec, Commented Aug 23, 2016 at 1:48

newUserHa · Accepted Answer · 2023-06-22 02:03:25Z

7

You could use the email.message.Message class from the email module in the standard library.

By modifying the answer from the question you linked, below is a Python3 example of parsing HTTP headers.

Suppose you wanted to create a dictionary containing all of your header fields:

import email
import pprint

request_string = 'GET / HTTP/1.1\r\nHost: localhost\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, sdch\r\nAccept-Language: en-US,en;q=0.8'

# pop the first line so we only process headers
_, headers = request_string.split('\r\n', 1)

# construct a message from the request string. note: the return is already a dict-like object.
message = email.message_from_string(headers)

# construct a dictionary containing the headers
headers = dict(message.items())

# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)

if you ran this at a python prompt, the result would look like:

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate, sdch',
 'Accept-Language': 'en-US,en;q=0.8',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Host': 'localhost',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}

edited Jun 22, 2023 at 2:03

newUserHa

51 silver badge3 bronze badges

answered Aug 23, 2016 at 1:42

Corey Goldberg

61.5k30 gold badges135 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Startec Over a year ago

This is great - and yes, sorry my formatting of the original request was bad. However, where do I get the resource here? (i.e. the actual resource being requested). Since we pop it, how do I know what was actually requested?

Corey Goldberg Over a year ago

@Startec it would be in the first line, along with the request method and protocol version.

Startec Over a year ago

So I would have to do some string splitting on the first line?

Corey Goldberg Over a year ago

yes, you could probably just split the first line on whitespace to grab the resource name.

Nuno André Over a year ago

@Startec StringIO is creating a in-memory file-object to feed email.message_from_file (which expects a text stream). You can also parse messages directly from bytes, strings or binary streams.

|

liviaerxin · Accepted Answer · 2023-04-25 18:57:50Z

2

Each one of those field names should be delimited by carriage return then newline, and then the field name and value are delimited by a colon. So assuming you already have the response as a string, it should be as easy as:

fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    key,value = field.split(':', 1)#split each line by http field name and value
    output[key] = value

Update 4/13

Using the example http resp in the linked to post:

resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'


fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    if not field:
        continue
    key,value = field.split(':', 1)
    output[key] = value    
print(output)

An additional check to make sure field is not empty is needed. OUtput:

{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}

edited Apr 25, 2023 at 18:57

liviaerxin

6878 silver badges14 bronze badges

answered Aug 23, 2016 at 1:13

Liam Kelly

3,7242 gold badges25 silver badges53 bronze badges

5 Comments

Ousret Over a year ago

That code won't work. Patch it by add maxsplit=1 to split() and it would be actually better. And you may want to split by \n instead of \r\n, that way it would be more generic Then do not forget \r at the end if any..

Ousret Over a year ago

You may want to consider a dedicated library like kiss-headers to handle them properly.

Liam Kelly Over a year ago

@Ousret - updated post to show that code works even on the example request in the post. I did need to have quick error check if field was empty, but for example code it holds up. As for using libraries, that is a good default choice.

Ousret Over a year ago

Check out this header : User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0 It will fail with this. ;)

Matthew Thomas Over a year ago

You may have to ignore the second line as well (i.e. Host field and value) in case the port number is explicitly included in the url. I.E. use fields = fields[2:] or key,value = field.split(':') will throw error.

buherator · Accepted Answer · 2022-06-02 12:04:36Z

0

Here are some Python packages aimed at proper HTTP protocol parsing:

https://dpkt.readthedocs.io/en/latest/api/api_auto.html#module-dpkt.http
https://h11.readthedocs.io/en/latest/
https://github.com/benoitc/http-parser/ (C backend)
https://github.com/MagicStack/httptools (based on NodeJS's C backend)
https://github.com/silentsignal/netlib-offline (shameless plug)

answered Jun 2, 2022 at 12:04

buherator

1263 bronze badges

Collectives™ on Stack Overflow

How to parse raw HTTP request in Python 3?

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related