MAMA: HTTP Headers

By Brian Wilson

Index:
  1. Introduction
  2. About MAMA's HTTP requests
  3. The HTTP response—general anatomy
  4. Most popular HTTP Header fields and other additional data
  5. HTTP protocols
  6. HTTP Header fields:
    1. Content-Type: MIME types and character sets
    2. Server
    3. Connection
    4. X-Powered-By
    5. Expires
    6. Cache-Control
    7. Vary
    8. SSL-Cipher
    9. The other HTTP Header fields

Introduction

In the beginning, there were only a few HTTP Headers with which MAMA was concerned (such as the Content-Type and Server fields). As feature requests for HTTP Header data accumulated over time, managing the results became more difficult. MAMA generally was not as concerned with what the HTTP Headers contained as with what the rest of the HTTP response had to say. At a certain point, the various checks that MAMA was doing on the HTTP Headers became too numerous, so the decision was made to store the entire HTTP Header in the database. This way, any new requests could be quickly completed locally without having to do an entirely fresh re-crawl of the entire MAMA URL set.

For this study, we will first look at the general shape and composition of the HTTP Headers MAMA encountered before looking at some of the results found for select individual HTTP Headers. Saarsoo's study is the only comparable large-scale study of HTTP Headers of which I am aware, and MAMA's discoveries will be compared with Saarsoo's data where possible.

About MAMA's HTTP Requests

An HTTP response is often heavily dependent on the original HTTP request. It is important to look at what MAMA is sending as its HTTP request before looking at the responses received. An original goal for MAMA was to mimic, as accurately as possible, what an Opera Web browser would encounter when surfing the net; this likely led to a coloring of some of the data returned. Servers can and do discriminate on the basis of User-Agent or other HTTP Header fields. The HTTP request headers used in this study are shown in figure 2-1 below. The biggest difference between Opera's HTTP request headers and MAMA's lies in the Accept-Encoding value. Opera can handle gzip, deflate and other encodings. This functionality was not added to MAMA in order to limit the coding and analysis effort MAMA needed to do for each URL.

NOTE: The Accept-Language and Accept-Charset values chosen reflect the author's own particular language bias.

Fig 2-1: MAMA's HTTP request headers
Header nameHeader value
Accept "text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1"
Accept-Charset "windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1"
Accept-Encoding "identity, *;q=0"
Accept-Language "en"
Connection "Keep-Alive"
User-Agent "Opera/9.10 (Windows NT 5.1; U; en)"

The HTTP Response - general anatomy

HTTP/1.1 (RFC 2616) goes into great detail about the allowed configurations that an HTTP response can take on. In this section, only the HTTP Header block that comes before the main HTTP response body will be covered. In general terms, the header block consists of a "status line" followed by any number of newline-separated header field/value pairs. The status line contains important basic information about the entire HTTP transaction and has the following format:

[Protocol]/[HTTP Version] [HTTP Status Code] [HTTP Status reason text]

The format of the Header name/value pairs that follow is generally:

[Field Name (case-insensitive)]: [Field Value]

In this section, we will look at which HTTP Headers were the most popular ones encountered.

Popularity contest

To start off our look at HTTP Headers, Fig 4-1 below is an abbreviated look at a frequency table of the most popular header fields found in MAMA's URL set (see also the more extensive per-URL frequency table). The astute reader may have already noticed that the caption in Fig 4-1 below reads, "Top 10...", but there are 13 values. This is because the first three field names—the ubiquitous ones prefixed by "Client-"—are generated in MAMA's process as a result of the usage of Perl's LWP module. These fields will be ignored in this study. As a result, the most frequent HTTP Header field name then becomes the expected Content-Type. With this adjustment, MAMA's frequency table generally agrees very closely with Saarsoo's study of HTTP Headers for the top values.

Fig 4-1: Top 10 HTTP Header field names
(Of 3,509,180 URLs analyzed)
Header name Header value   Header name Header value
Client-Date 3,509,180 Content-Length 2,534,203
Client-Response-Num 3,509,180 Last-Modified 2,129,100
Client-Peer 3,509,175 Content-Range 2,068,687
Content-Type 3,508,919 ETag 1,954,567
Date 3,504,603 Accept-Ranges 1,870,170
Server 3,465,179 X-Powered-By 1,348,347
Connection 2,851,099    

Other semi-random data about HTTP Headers

The most common number of HTTP Header fields encountered in this study was 12 (please see also the full frequency table)—actually 9, if you take into account the adjusted value due to the ignored "Client-*" headers. The overall length of the header had a fairly wide distribution, with the average length being 381 characters. The longest header block length encountered was 9,725 characters, found at http://www.studentenwerk.uni-freiburg.de/ in an apparently isolated case; the URL has an overabundance of "EACCELERATOR hit" header fields repeated 100 times! This is definitely not typical, and the Header block for that URL is not otherwise remarkable.

HTTP protocol versions

MAMA used a native Perl LWP method to get the protocol and version used in the URL's HTTP response. It then did a simple substring match for "1.1", "1.0" or "0.9" within that value. To allow for instances where some other version was detected, MAMA also had "unknown" as a fallback default value, but all HTTP responses fell within the three expected version types anyway. Almost half (50/104) of the HTTP/0.9 URLs were from galeon.com variants (ex: http://equipobarzamudio.galeon.com/). Whittling down duplicate domains from the HTTP/0.9 result, MAMA discovered only 42 unique servers using this HTTP protocol version (insert witty joke referencing Douglas Adams here).

Fig 5-1: HTTP protocol versions in MAMA
HTTP protocol Quantity
HTTP/1.1 3,451,169
HTTP/1.0 57,907
HTTP/0.9 104

HTTP Header fields

"Content-Type" HTTP Header field: MIME types and character sets

MIME types

Of the 3,509,180 URLs that MAMA analyzed, the vast majority (~99.9%) used a "text/html" MIME type (see full frequency table). "text/plain" and "application/xhtml+xml" types also had some occasional representation in the set (~1,000 cases each). Other values were also encountered, albeit very rarely—including some values that would clearly indicate they should not be analyzed by MAMA as markup (like "text/rtf"). It was not known in advance of performing the analysis whether some content could be masquerading using unexpected MIME types (such as HTML being served as "text/plain"), so MAMA was not as discerning in this area as it could have been. In the future, more will be done to filter out unprocessable MIME types from analysis, including checking file extensions for inappropriate types (there are 694 URLs still present in MAMA's set that have a ".txt" file extension, for example).

The Content-Type 'Charset' parameter

The character set used for a document can be specified in several ways. One of those ways is through the Content-Type HTTP Header field using a "charset" parameter (we'll look at the other methods in the Document encodings section). The HTTP Header line syntax for defining the character set via this method looks like this:

Content-Type: text/html; charset=ISO-8859-1

Although the Content-Type Header field was encountered in almost every single URL analyzed, the "charset" parameter was only detected in 688,819 (19.63%) of them. A look at the frequency table of the values for the charset parameter shows that the value is usually (~88% of the time) either "utf-8" or "iso-8859-1".

Silly HTTP Header pet tricks - Case-sensitivity of the 'Charset' keyword

Although HTTP/1.1 defines the media type "charset" parameter as being case-insensitive, what sort of capitalization is found on the Web? The dominating usage is all lowercase in 99.2% of the cases. A recent question from a co-worker prompted MAMA to answer this minor question. Answers big and small, MAMA can provide them all!

Fig 6A-1: Capitalization of the "charset" parameter name
"Charset"
Capitalization
Quantity
charset 683,265
Charset 5,487
CHARSET 66
CharSet 1

"Server" HTTP Header field

This field contains information about the Web server used to serve the HTTP request. MAMA again used a built-in Perl LWP method to detect this field. The value of the Server HTTP Header field is expected to be the same for all pages from that server, so rather than look at results on a per-URL basis, it is more instructive to look at per-domain results.

In the brief summary below (Fig 6B-1), notice the obvious (and expected) dominance of the Apache and IIS Web servers. In fact, in the full per-domain and per-URL unique value frequency tables for the HTTP Header Server field, all the values in the top ten are either Apache or IIS related. Apache is represented in a whopping 2,011,088 (67.72%) of domains in MAMA while IIS is used in 769,375 (25.91%) domains. The popularity ranking of the Web servers mentioned below is very similar to Saarsoo's study.

Fig 6B-1: Popular HTTP Header "Server" field Values
[Out of 2,969,738 domains]
Server substring value Quantity   Server substring value Quantity
Apache 2,011,088 Lotus-Domino 6,609
IIS 769,375 IBM 6,289
[Empty] 42,261 IdeaWebServer 5,912
Zeus 21,314 ZServer 4,990
Squeegit 20,569 WebServerX 4,667
Rapidsite 18,876 Sun-ONE-Web 4,427
NOYB 9,746 SX_Spectrum 4,088
GFE 9,734 NCSA 109
Netscape 7,926    

"Server" oddities

The count for Apache also includes an additional Server string added for completeness: "a p a c h e" (2,616 times). The results in Fig 6B-1 are also slightly muddied by an odd value that is undoubtedly some form of joke— 86 domains had a Server header field value with both "IIS" and "Apache" in them (all of them being the string "Apache-IIS/5.0"), from domains such as http://www.longevitytea.com/ and http://letspartyrental.com/. The value is undoubtedly a spoof or hoax— one would expect a real hybrid of the dominant Web servers to be a bit more popular. Lastly, notice also that a fair number of servers use the Server HTTP Header field to rebuff our desire for deeper knowledge of the Web servers that are in use: 9,746 domains tell us that it is "NOYB" (None Of Your Business) ... the nerve!

"Connection" HTTP Header field

The Connection field specifies options that are to be used for the current particular HTTP connection. In MAMA's analysis, the value is almost always "close" (~98.5%) when it is present. This result is very different than Saarsoo's research where "close" was actually a minority value (~41.3%) compared to the dominant value of "keep-alive" (~58.7%). A number of factors may have influenced this, with the most likely culprit being the facilities used in the respective studies to fetch the URLs— the Perl LWP module in this MAMA study and GNU Wget in Saarsoo's case. For now we will let this discrepancy stand, but the issue may be interesting to revisit in the future.

Fig 6C-1: HTTP Header "Connection" field values
Value Frequency
close 2,806,105
keep-alive 39,804
transfer-encoding 2,970
keep-alive, close 1,857
keep-alive, timeout=50, maxreq=60 203
keep-alive, te, close 143
persistent 7

Somebody get that kid a dictionary

In the list of HTTP Header fields encountered (Fig 4-1 above), some variations are noticeable. Often these variations are misspellings, but it is difficult to know whether these are deliberate or not. The HTTP Header field with the most variations and frequency was definitely the Connection field. Some of the misspellings are so demonstrably wrong that one wonders how they could survive even the simplest of inspections, but 13,764 occurrences of "Cneoction" seems far too high to be an accident. Table Fig 6C-2 below shows the strange and slightly bizarre list of erroneous Connection header misspellings.

Fig 6C-2: Misspellings of the "Connection" HTTP Header field
Misspelling Frequency
Cneonction 13,764
NnCoection 8,569
X-Cnection 1,332
Xonnection 135
-Onnection 82

"X-Powered-By" HTTP Header field

This is a common Header extension field used to identify the Web server pre-processing engine in use (if any). ASP and PHP dominate this field (you have to go down past the 20th position in the popularity frequency table to find a value that does not contain either ASP or PHP). Combined, the various ASP and PHP values comprise 98.2% of all X-Powered-By values. PHP is the most diversified of the values in use, with about 450 of the 750 values (~60%!) in the frequency table being unique PHP flavors. Finally, let us pause a moment and contemplate all the hard work that the fine folk mentioned in the 6th position put into all of this...

...OK, that is enough. Back to the analysis.

Fig 6D-1: X-Powered-By substrings detected
Substring detected Frequency
ASP 720,386
PHP 603,590
pleskwin 7,836
modlayout 5,979
servlet 2,223
"the blood, sweat and tears of the fine, fine textdrive staff" 633
zend 496

"Expires" HTTP Header field

The Expires header documents a "best before" date. Unlike with food products, an expired date does not necessarily mean that the resource has changed or disappeared. The field is used to give the date and time after which the content is considered "stale"— proxy servers need to be especially mindful of this value to prevent old cached content from being passed on to an end-user instead of fresher content from the originating source. Other than some extremes and error cases, this field is somewhat tedious to sift through—as you can see from the full Expires frequency table, the values are mostly simple dates.

Those who cannot remember the past are condemned to repeat it (George Santayana)

Looking closer at the proper format for the Expires field in HTTP 1.1 (RFC 2616), MAMA uncovers quite a few transgressions:

"[It] MUST be in RFC 1123 date format, such as: 'Tue, 26 Oct 1999 19:00:00 GMT'...HTTP/1.1 clients and caches MUST treat other invalid date formats, especially including the value '0', as in the past (i.e., 'already expired')"

Not only are values of "0" for the date used, but also more extreme values like "now", "never", "-1", "-1d" and "-10000". Values in the past generally don't go further back than the UNIX origin date favorite of "01 Jan 1970", but the occasional URL makes a foray in the time machine back to the turn of the last century (1900). An enterprising group of 27 URLs made the jump back to Bastille Day ("14 Jul, 1789") for their expiration—it might entertaining to double-check to see if those were all French URLs. 10 URLs authoritatively stated they are expired (and probably mummified) by using an expiry of "01 jan 0001".

Back to the future!

Going in the other direction, MAMA also discovered many Expires dates beyond the MAMA analysis timeframe. HTTP 1.1 (RFC 2616) has an interesting comment on future expiries:

"To mark a response as 'never expires,' an origin server sends an Expires date approximately one year from the time the response is sent. HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future."

Contrary to this mentioned proviso, a number of URLs (92) jump forward to the future, but not the recommended single year. URLs with expiries set clearly in the future used a smattering of dates between 2010 and 2035—quite a bit forward in time than what is suggested. The wording of RFC 2119—a document about the wording of requirements in RFCs—says that pesky, previously mentioned "SHOULD" terminology indicates that the future date values MAMA encountered are permissible if the creator knows what they are doing. One hopes that the creators of four URLs in MAMA know what they are doing when they set their Expires field in the year 2999! What would the Web even look like then?!

"Cache-Control" HTTP Header field

Field-component popularity

This field communicates information used to override normal caching strategies employed by proxies or clients. The value is a comma-separated list of related header fields that are relevant when deciding the caching status for a document. MAMA's raw per-URL frequency table for this field is a list of unique compound values, but that does not really reveal the popularity of the sub-components very well. The table below (Fig 6F-1) goes further by showing the most popular frequencies of the components referenced in each "Cache-Control" header value from the full frequency table.

Fig 6F-1: "Cache-Control" value component frequency table
Value component Frequency   Value Component Frequency
private 356,826 public 8,409
no-cache 245,761 s-maxage 3,244
pre-check 219,069 cache 2,986
post-check 218,957 proxy-revalidate 2,943
must-revalidate 196,645 store 255
no-store 180,853 no-transform 170
max-age 85,695    

It is just a matter of time

One thing that stands out when looking at the complete Cache-Control frequency table is the wide variety of Max-age time values. Since the Max-age value should take precedence over any Expires header value, it can be informative to look closer at the times represented. In the distribution table below (Fig 6F-2), noticeable spikes are apparent. Other values were also detected, but their frequencies were below the chosen threshold.

a href=
Fig 6F-2: Cache-Control "Max-age" values
Max-age value (sec) Frequency   Max-age value (sec) Frequency
0 33,882 10800 (3 hr) 408
1 2,188 14400 (4 hr) 238
10 508 18000 (5 hr) 347
20 579 21600 (6 hr) 1,091
30 30528800 (8 hr) 318
60 (1 min) 11,553 43200 (12 hr) 297
120 (2 min) 676 86400 (1 day) 5,546
300 (5 min) 2,224 172800 (2 day) 262
600 (10 min) 9,848 259200 (3 day) 482
900 (15 min) 808 432000 (5 day) 384
1200 (20 min) 261 604800 (7 day) 665
1800 (30 min) 1,295 864000 (10 day) 200
3600 (1 hr) 3,735 1209600 (14 day) 651
7200 (2 hr) 1,773 2592000 (1 month) 715

"Vary" HTTP Header field

This field consists of a comma-separated list of other header fields that are used to determine:

"... whether a cache is permitted to use the response to reply to a subsequent request without revalidation. For uncacheable or stale responses, the Vary field value advises the user agent about the criteria that were used to select the representation."

The full per-URL frequency table of unique values found for this field is not very extensive, but a quick summary is still useful. Notice that "accept-encoding" is the dominant value here.

Fig 6G-1: "Vary" value component frequency table
Value component Frequency
accept-encoding 119,130
host 47,727
user-agent 41,469
* 36,485
cookie 8,845
accept-language 1,267
nfinfo 568
negotiate 428
x-forwarded-host 421
referer 48

"SSL-Cipher" HTTP Header field

Popularity: a small sample space, but more to come

In the URL set that MAMA analyzed, there were relatively few URLs using the HTTPS protocol—only 4,994. MAMA detects all SSL-Cipher HTTP Header fields due to a past request from an Opera developer to discover what cipher types are in use, but MAMA's sample space is not overly large. To satisfy the original request fully, a much deeper study of HTTPS domains was performed and will be presented as an adjunct to this study at a later time. As a consequence, I will not go into great depth about this header field here.

The SSL settings are expected to be uniform across a given Web server, so the focus here is SSL-Cipher values on a per-domain basis (full frequency table). The 4,994 HTTPS URLs are from 4,355 unique domains. Among these domains, two cipher strings are rather evenly dominant: "RC4-MD5" and "DHE-RSA-AE256-SHA". Other values also occurred, although with far lower frequency.

Fig 6H-1: SSL Cipher types
Cipher types Frequency Percentage
HTTPS URLs
RC4-MD5 1,729 39.70%
DHE-RSA-AES256-SHA 1,674 38.44%
Other 953 21.88%

HTTPS headers from an HTTP URL?

One would expect that only HTTPS URLs would deliver an SSL-Cipher header, but that is not always the case. Some 4,994 URLs used the HTTPS protocol, but in fact 4,997 URLs had a non-empty SSL-Cipher header field—three non-HTTPS servers were sending the header as well.

Fig 6H-2: HTTP Sites sending SSL headers
URL
http://www.nfc.usda.gov/
http://www.pbefcu.com/
http://www.simap.ch/

The other HTTP Header fields

Several of the other popular HTTP Headers were also analyzed but did not yield much in the way of interesting trends to present here. For instance, when the Client-Transfer-Encoding header is actually used, it only yielded the value "chunked". The values for Content-Language have a fair amount of variety but are not otherwise very remarkable. Other fields like Content-Length, Etag, and Set-Cookie produced so many unique random values that there was no point in searching for trends. For some fields that had small result sets, like the Pragma header, the only thing that stood out was how dominant a single value was ("no-cache": 98% of the time). The Pragma field also demonstrated the same sort of curiosity found in fields like the Connection header—that the specification writers' original choice of the field name was sometimes unfortunate in that some keywords are just easily misspelled, such as "cache" giving way to semi-frequent variations like "chache", "cashe" and "cahce".

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.