MAMA: Document Encodings
- Previous article—MAMA: Basic document structure
- Next article—MAMA: Character entities
- Table of contents
Index:
Introduction
A critical part of rendering a document lies in a browser discerning the proper character set encoding. MAMA tracked most of the methods that can be used to detect the encoding, but it did not attempt to declare a "One, True Encoding" in cases where discrepancies existed. In this section we will examine the usage of three of the main encoding sources and discover how/whether they overlapped, and whether they agreed with each other.
How encodings are specified
The HTTP Headers can provide the encoding through the "Charset" parameter of the Content-Type field. The Content-Type (and consequently the "Charset" parameter) can also be specified in HTML via the META element, and XML documents have an additional means to signal the document's encoding by using an "encoding" attribute on the XML prolog. The encoding was specified using at least one of these methods in 2,626,228 URLs from MAMA (74.84%). As you can see, authors show a great preference for using the META element to specify a document's encoding over the other two methods.
Note: Region sizes are not to scale
Can we agree to disagree?
Having multiple encoding sources can quickly become a can of worms—what happens when those encodings do not agree with each other? Nothing is worse than a Web page having a schizophrenic argument with itself about its encoding identity. To compare encodings, the various values were all forced to lowercase and leading/trailing spaces were removed. Encoding variations like "iso-8859-1", "iso_8859_1" and "iso 8859-1" would all be considered different values using this scheme. The results of this comparison show that in the majority of encoding overlap situations, the values agree (72.96% of all overlap scenarios). However, values are expected to agree; another (negative) way to frame the results is that 133,968 URLs (27.04% of the overlap scenarios) have specified multiple encodings and they clash with each other. So in at least 1/4 of cases where a browser does not have a single encoding source, it must then resort to torturous gymnastics to determine the outcome.
Encoding specification method | Total quantity |
Encodings agree |
Percentage agrees |
---|---|---|---|
HTTP Header and META only | 417,113 | 293,868 | 70.45% |
META and XML only | 49,115 | 45,029 | 91.68% |
HTTP Header and XML only | 6,791 | 4,553 | 67.04% |
All three: HTTP Header, META and XML | 22,500 | 18,101 | 80.45% |
- Previous article—MAMA: Basic document structure
- Next article—MAMA: Character entities
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.