MAMA: Markup report, part 1: the basics

By Brian Wilson

Introduction

Our first look at markup in MAMA's URL set will begin with the basics. We will look at some of the housekeeping concerns of every document: how they are encoded, how big documents are, how much of the document is actual markup, doctype usage, etc. We will also give a quick overview of the most popular elements and attributes before we start digging deeper into those sub-topics. This overview should give you an idea of what markup documents look like in the broadest sense.

For a deeper look at these areas and more, the following MAMA article topics are also available this week:

Document Encodings

Before a document's content can be examined, its encoding must be determined. The biggest trouble with specifying HTML encoding is that there are so many ways to do it. A document may specify none, one, or even ALL of the different methods. And, if there is disagreement at any of these levels of twirled spaghetti, a precarious dance must ensue.

MAMA tracked the specific encoding values from three primary locations:

The "charset" parameter of the HTTP header:
ex: Content-Type: text/html; charset=ISO-8859-1
The "charset" parameter for the Content-Type via the META element:
ex: <META Http-equiv="Content-Type" Content="text/html; charset=Shift-JIS">
The XML encoding:
ex: <?xml version="1.0" encoding="utf-8"?>

The HTML encoding was specified using at least one of these methods in 2,626,228 of the URLs MAMA examined (74.8%). Of these, the dominant scheme was the META element syntax, used in ~90% of the cases where MAMA detected any of the 3 encoding sources.

Venn diagram for breakdown of encoding source specification methods

Note: Region sizes are not to scale.

Overall document sizes

Among the many statistics MAMA has gathered about the documents it analyzed, length statistics prove to be interesting to examine. One such criteria is the basic overall document length. This is simply the length of the original document, without adding in any of the page's external dependencies like CSS or Scripting. The average basic document size of MAMA's analyzed URLs was 16,406 characters.

MAMA's base document size ranges
Size range (characters)	Frequency	Percentage
0 && <= 5,000	1,113,224	31.7%
>5,000 && <= 10,000	717,825	20.5%
>10,000 && <= 15,000	509,456	14.5%
>15,000 && <= 20,000	324,765	9.3%
>20,000 && <= 25,000	213,093	6.1%
>25,000 && <= 50,000	432,129	12.3%
>50,000 && <= 75,000	112,481	3.2%
>75,000 && <= 100,000	40,349	1.1%
>100,000 && <= 200,000	35,354	1.0%
> 200,000	8,287	0.2%

But what about all those dependencies?

One other interesting length factor MAMA tracked was labeled "extras". This measure added up the sizes of all external CSS, scripting, frames, and IFrames referenced by the main document. While the basic document length gives some idea of a user's initial download penalty, this "extras" length gives a better sense of the overall weight of a page before its objects (such as images and plug-ins) are loaded. The overall average length of all the "extras" is 20,296 characters, but it increases to 28,038 characters, factoring in only the cases where any of the "extras" exist. This is definitely a case where the documents that are "extras heavy" are throwing off the total average—as you can see, most documents actually have an "extras" sum of less than 10,000 characters. So, by at least one measure, the average page will download as much in "extras" content size as it must download for the main document.

MAMA's "extras" size ranges
(969,042 URLs had an "extras" size of 0)
Size range (characters)	Frequency	Percentage
>0 && <= 5,000	753,392	21.5%
>5,000 && <= 10,000	361,460	10.3%
>10,000 && <= 15,000	207,283	5.9%
>15,000 && <= 20,000	182,116	5.2%
>20,000 && <= 25,000	190,039	5.4%
>25,000 && <= 50,000	467,214	13.3%
>50,000 && <= 75,000	173,156	4.9%
>75,000 && <= 100,000	78,497	2.2%
>100,000 && <= 200,000	95,093	2.7%
>200,000	31,888	0.9%

Doctypes

The markup validator has a lot to say about Doctypes. They are a key component in determining a successful validation. MAMA stored the information about a document's Doctype pulled from the W3C validator, but it also looked for the Doctype information separately. In MAMA's URL set, 1,788,294 of the URLs analyzed (50.96%) had a Doctype present. About 85% of MAMA's URLs would be rendered in most browsers using their "Quirks" mode.

Doctype versions

Different versions of the HTML standard can be detected via unique strings in the Doctype statement. The leading space in most of the values below helps differentiate between HTML and XHTML versions. HTML 4 variants are twice as popular as any of the other versions.

Doctype versions popularity in MAMA
Doctype-version substring	Frequency	Percentage Using Doctype
" html 4" (HTML 4 variants)	1,122,392	62.8%
" xhtml 1.0"	548,307	30.7%
" html 3.2"	57,354	3.2%
"ietf"	34,965	2.0%
" xhtml 1.1"	20,958	1.2%
"softquad" \|\| "//sq//"	9,950	0.6%
" html 2"	7,640	0.4%
" html 3.0"	1,711	0.1%

Doctype flavors

Beginning with HTML 4.0, HTML was stratified into 3 separate variants: Strict, Transitional, and Frameset. A portion of the Doctype statement directly reflects these variants and we can easily discern the "flavors" of HTML by searching for the substrings. The Transitional configuration is more than 10 times as likely as the other types.

Doctype flavor popularity in MAMA
Doctype-flavor substring	Frequency	Percentage Using Doctype
"Transitional"	1,459,912	81.6%
"Strict"	130,191	7.3%
"Frameset"	64,516	3.6%

For more about doctypes, read the doctypes section of the Basic structure article.

Miscellaneous document structure matters

MAMA calculated a "Tag Ratio" for each document. This was total length of the content within all tags divided by the overall page length. A Tag Ratio of 0 would be all plain text, while a Tag Ratio of 100.0 would be completely tags, without even having linefeeds or spaces between the tags. The average document had a Tag Ratio of 61.64%—almost 2/3 of each document being tags.

MAMA also kept its eye out for character entities—a portable way to express characters that are not in the page's specified character set. All Unicode characters can be given as a numeric entity, while many of these can also be expressed via special name codes. In almost every case where a named entity counterpart exists for a character, the named version is least as popular as the numeric version, if not often much more so.

ex:   (used 2,537,947 times);   found 41,390 times

The character entity used most often was the "non-breaking space", found in 72.3% of all MAMA's documents.

Top 10 most popular character entities
Entity Description	Entity	Entity Code	Frequency	Entity Description	Entity	Entity Code	Frequency
Non-breaking space		nbsp	2,537,947	Small 'u' with Diaeresis	ü	uuml	226,695
Ampersand	&	amp	1,256,005	Small 'e' with Acute	é	eacute	207,322
Copyright Sign	©	copy	776,051	Small 'a' with Diaeresis	ä	auml	204,855
Quotation Mark	"	quot	520,902	Small 'o' with Diaeresis	ö	ouml	184,313
Greater-Than Sign	>	gt	276,149	Right Double-Angle Quotation Mark	»	raquo	123,207

Popular markup elements and attributes

For authors that have spent any time at all writing HTML documents, there will be no real surprises about which elements are the most popular. The 10 most popular markup elements can be divided into three basic categories:

Basic document-structure elements (HTML, HEAD, BODY, TITLE and META)
Tables (TABLE, TR and TD)
Hyperlinks and images (A and IMG)

Top 10 markup elements
ELEMENT	Frequency	Percentage	ELEMENT	Frequency	Percentage
`HEAD`	3,464,519	98.7%	`META`	3,276,347	93.4%
`TITLE`	3,459,207	98.6%	`IMG`	3,219,487	91.7%
`HTML`	3,452,975	98.4%	`TABLE`	2,894,184	82.5%
`BODY`	3,452,907	98.4%	`TD`	2,891,972	82.4%
`A`	3,307,397	94.2%	`TR`	2,891,205	82.4%

The list of the most popular attributes in MAMA comes primarily from only 4 of the previously covered top 10 elements. Attributes for A, IMG, META and TABLE are the most popular. It is interesting that none of the top structural elements (HEAD, TITLE, HTML, BODY) score any attributes in the top 10 attribute beauty pageant.

Top 10 markup attributes
ELEMENT[Attribute]	Frequency	Percentage	ELEMENT[Attribute]	Frequency	Percentage
`A`[`Href`]	3,304,834	94.2%	`META`[`Http-equiv`]	2,826,859	80.6%
`META`[`Content`]	3,273,610	93.3%	`IMG`[`Border`]	2,810,265	80.1%
`IMG`[`Src`]	3,219,304	91.7%	`META`[`Name`]	2,710,638	77.2%
`IMG`[`Width`]	2,957,808	84.3%	`TABLE`[`Border`]	2,691,899	76.7%
`IMG`[`Height`]	2,945,989	84.0%	`TABLE`[`Width`]	2,637,117	75.1%

Conclusion

This overview of the basics of markup seems to be too thin to allow us to come to any real "conclusions" on the topic just yet. We are just getting started. The information here on document encodings and length, doctypes, tag ratios, and character entities glosses over many of the details found in the deeper writeup. Our final, simplistic mention of popular markup elements and attributes merely sets the stage for what will follow—in the coming weeks, we will devote considerably more attention to the details of elements and attributes in common use.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.