Jump to other posts in this series
In part one we created a super-simple web server capable of handling a simple subset of HTTP requests. In this post, we'll flesh out our web server a bit more, adding support for header parsing, content negotiation, and cookies.
Parsing headers
Recall that HTTP headers are Key => Value pairs that contain "meta-information" about the request or response they are sent on. Just as with HTTP methods, some header fields are used more often than others. Three in particular are included in almost every request:
- User-Agent
- Host
- Accept
Let's examine these in a bit more detail.
User-Agent
The User-Agent header is used to give the server information about the agent
a user (or system) is using to make the request. In normal web browsing, the
agent is your browser itself; the browser makes HTTP requests on your behalf.
For programmatic HTTP requests, the User-Agent is normally the name of whatever
HTTP library the application is using. If we use the program curl
to make a
request to our server (start the server by typing python <name_of_script.py>
8080 /tmp
and make sure you have a file named hello.txt
in your /tmp
directory) by typing curl --verbose localhost:8080/hello.txt
, we see the
User-Agent is reported as curl/7.30.0
. This indicates to the server that the
request was made by a client using version 7.30.0
of curl
(or libcurl
, the
curl
library).
This information is especially useful when the request is sent by a browser.
When I point Chrome at localhost:8080/hello.txt
and ask our server to print
out the User-Agent header (more on how to do that in a bit), I get the following
string:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
You can see that not only is the type of browser I'm using sent, the rendering library ("AppleWebKit") and OS are also specified. Since different versions of different browsers support different HTML/CSS features, a web application may use this information to modify a page so that it renders properly on the client's browser, regardless of which browser they're using.
Host
The Host header is used to let the server know what address the request is bound for (i.e. the IP address or Fully Qualified Domain Name and port). This is mostly used by HTTP proxies to determine how to route the message and isn't of much interest to us, aside from the fact that this header is required on all HTTP/1.1 requests.
Accept
The Accept header is used to signal to the server what sort of media types are acceptable for the response. In this way, the client can issue requests that are limited to a small set of media types. If this header isn't present (it almost always, is, however) then the client is assumed to accept all media types.
Choosing the appropriate representation for a resource with multiple representations (both format like JSON vs. HTML and content like languages) is called Content Negotiation. If the server chooses a representation, it's performing Server-driven Negotiation. This is most often the case, since clients have a number of header fields to describe their capabilities and preferences.
Chief among these is the Accept header. By specifying "application/json", for example, in the Accept header of a request for a resource that has both HTML and JSON representations, I'm essentially forcing the server to send me the JSON representation (since that's all I accept, as far as the server knows). Content Negotiation is most important when dealing with REST APIs, as it's important that programmatic access to a resource receive the proper representation (like JSON).
If, after examining the Accept header, the server determines it cannot fulfill the request (because, while the resource exists, it doesn't exist in any of the media types specified), an error is returned. Specifically, HTTP error code 406, "Not Acceptable" is returned.
Parsing Headers
Let's augment our initial implementation to properly parse headers and preform rudimentary content negotiation. Take a look at the following, updated version of the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
There are two main changes to the code: a portion which parses header fields and
another that does simple content negotiation. The parse_headers
function
splits each line after the start-line on the :
character. To the left is the
name of the header field. Everything else is the value.
With the headers parsed, we can move on to content negotiation. In this
simplified version, we check to see if the extension of the file requested
is found within one of the values in the Accept header. */*
is interpreted as
"anything", while type/*
is meant to mean "all forms of this type". We handle
the former (because it's easy) while ignoring the latter for the moment.
If you add an .html
file to /tmp
and request it, you should see it properly
returned. If, however, you manually set the Accept header to something like
application/json
, you'll get back a 406
error, as expected.
Cookies
Cookies are Key => Value pairs that the server sets on the client. If the
client supports cookies, any cookies set on the first request will be sent along
with subsequent requests. For example, let's track if a user has visited the
site before and print out a message if they have (note I'm changing from using
curl
to httpie
here, as the latter has a much nicer interface for all of
this).
How can we augment our server to handle this? It's a two step process. First, the server must set the cookie on the client by sending a Set-Cookie header with the name and value of the cookie. Second, the server must recognize the cookie as one that it set on subsequent requests. An example will clear things up a bit.
Imagine I change our server to add Set-Cookie: HasVisited = 1
in the response headers.
If I use Chrome to connect to our server, the first time we visit the cookie
will get set within Chrome. The next time we use Chrome to visit the page,
Chrome sends its cookies in the Cookie:
header field. We can parse that and
determine if HasVisited = 1
, at which point we know that this user has visited
before.
Here's a quick and dirty way to accomplish that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
As you can see, it's a two-step process: first we set the cookie on the response header, then we recognize the cookie on the request header. Believe it or not, but that's all there is to cookies. The powerful pieces of magic that allow you to stay logged in to certain sites are just cookies set by the site's web server. Of course, they set and check them in a more robust way, but the theory is the same.
Summary
In part two, we added the ability to parse headers, perform content negotiation, and set/get cookies. Though the code could use a bit of clean-up, it's still only 74 lines of code. In the next post in this series, we'll touch on the topic of authentication and caching, as well as perform some much-needed refactoring of the code. Until then, have fun playing around with your new web server!
Posted on by Jeff Knupp