MAMA: Basic document structure

By Brian Wilson

Index:

  1. Introduction
  2. Document statistics
  3. Byte order marks
  4. Doctypes
  5. A document's "Tag Ratio"
  6. Markup elements
  7. Markup attributes

Introduction

To get started in MAMA's look at markup practices and trends, we will first look at overall document sizes; then, we will examine some of the basic document structural components (Byte Order Marks and Doctypes). Finally, the full frequency tables for both elements and attributes will be presented. It is expected that most readers will find the breakdowns in the individual sections sufficient for most purposes. Those wishing to dig into the meat of this research are encouraged to look deeply at the complete, unvarnished elements and attributes frequency tables for quicker cross-comparison between markup topics.

Document statistics

Document size

This first metric is the integer character length of the original main document. No document dependencies are counted in this number. The average document size of MAMA's analyzed URLs was 16,406 characters. In all, ~55 URLs hit MAMA's hard limit ceiling of 5 Megabytes.

Fig 2-1: Document sizes
Size range Frequency   Size range Frequency   Size range Frequency
=0 2,217 >8000 && <=9000 136,348 >35000 && <=40000 76,277
>0 && <= 500 137,827 >9000 && <=10000 127,766 >40000 && <=45000 59,142
>500 && <=1000 202,031 >10000 && <=12000 229,676 >45000 && <=50000 44,190
>1000 && <=2000 255,084 >12000 && <=14000 194,834 >50000 && <=75000 112,481
>2000 && <=3000 188,206 >14000 && <=16000 162,359 >75000 && <=100000 40,349
>3000 && <=4000 170,332 >16000 && <=18000 135,076 >100000 && <=150000 27,382
>4000 && <=5000 159,744 >18000 && <=20000 112,276 >150000 && <=200000 7,972
>5000 && <=6000 156,531 >20000 && <=25000 213,093 >200000 && <=250000 3,092
>6000 && <=7000 152,619 >25000 && <=30000 147,698 >250000 && <=300000 1,643
>7000 && <=8000 144,561 >30000 && <=35000 104,822 >300000 3,552

Document Frame/IFrame sizes

This is an integer character length that is the aggregate sum of all Frames and IFrames used in a document. In all, 80.78% of all pages had a Frame/IFrame length of 0, and this is an expected result—any non-zero value means that Frames or IFrames are part of the document infrastructure. The average length of the combined Frame/IFrame components was 3,060.4 characters, but this factors in all the cases where there were no Frames or IFrames. The average length of the Frame/IFrame components where they were actually used was 15,919.8 characters.

Fig 2-2: Frame/IFrame sizes
Size range Frequency   Size range Frequency   Size range Frequency
=0 2,834,569 >8000 && <=9000 27,863 >35000 && <=40000 12,034
>0 && <= 500 26,035 >9000 && <=10000 25,025 >40000 && <=45000 8,786
>500 && <=1000 35,043 >10000 && <=12000 43,865 >45000 && <=50000 6,408
>1000 && <=2000 50,640 >12000 && <=14000 37,049 >50000 && <=75000 16,642
>2000 && <=3000 41,304 >14000 && <=16000 29,324 >75000 && <=100000 6,411
>3000 && <=4000 38,274 >16000 && <=18000 24,789 >100000 && <=150000 4,929
>4000 && <=5000 35,519 >18000 && <=20000 20,177 >150000 && <=200000 3,313
>5000 && <=6000 32,526 >20000 && <=25000 41,618 >200000 && <=250000 880
>6000 && <=7000 31,593 >25000 && <=30000 27,106 >250000 && <=300000 376
>7000 && <=8000 29,351 >30000 && <=35000 17,032 >300000 699

Document "extras" sizes

This value is an aggregate sum length of all the "extra" dependencies in a document (not counting embedded objects). It consists of all frames and IFrames content (the Frame/IFrame size count from the previous table), all external script content, and all CSS from external and imported stylesheets. Values of 0 are still expected to have a high representation, but now that we have multiple factors in play, the chances of that happening are greatly reduced. The overall average length of all "extras" is 20,295.7 characters, but it increases to 28,038.3 characters factoring in only the cases where any of the "extras" exist.

Fig 2-3: Document "extras" sizes
Size range Frequency   Size range Frequency   Size range Frequency
=0 969,042 >9000 && <=10000 53,271 >40000 && <=45000 69,438
>0 && <= 500 84,747 >10000 && <=12000 92,431 >45000 && <=50000 53,694
>500 && <=1000 117,985 >12000 && <=14000 76,680 >50000 && <=60000 81,219
>1000 && <=2000 178,577 >14000 && <=16000 89,519 >60000 && <=70000 68,595
>2000 && <=3000 154,796 >16000 && <=18000 73,095 >70000 && <=80000 43,553
>3000 && <=4000 120,169 >18000 && <=20000 57,694 >80000 && <=90000 33,830
>4000 && <=5000 97,118 >20000 && <=22500 101,774 >90000 && <=100000 24,456
>5000 && <=6000 88,678 >22500 && <=25000 88,265 >100000 && <=150000 68,781
>6000 && <=7000 89,053 >25000 && <=30000 137,810 >150000 && <=200000 26,312
>7000 && <=8000 66,891 >30000 && <=35000 116,366 >200000 && <=250000 13,022
>8000 && <=9000 63,567 >35000 && <=40000 89,906 >250000 18,866

Byte Order Marks

A co-worker asked for MAMA to detect the presence of Byte Order Marks (BOMs), which are used to signal the use of some encoding flavors. The intent was to find real-world examples of pages that used these BOMs so that they could be tested in Opera. Alas, MAMA only detected 3 of the 8 types of BOMs it looked for in the URLs analyzed. A Perl regular expression match against the first 5 characters in each URL document was done to detect the following encodings.

Fig 3-1: BOM detection patterns in MAMA
BOM type Perl regexp
utf-32 (little-endian) /^(\xff\xfe\x00\x00)/
utf-32 (big-endian) /^(\x00\x00\xfe\xff)/
utf-16 (little-endian) /^(\xff\xfe)/
utf-16 (big-endian) /^(\xfe\xff)/
utf-8 /^(\xef\xbb\xbf)/
utf-7 /^(\x2b\x2f\x76\x38\x2d)/
scsu /^(\x0e\xfe\xff)/
bocu-1 /^(\xfb\xee\x28)/

BOMs detected

The 3 BOMs were found in a total of 17,649 URLs (0.50% of all URLs analyzed). The BOM found most often is utf-8.

Fig 3-2: BOMs detected in MAMA's URLs
BOM Frequency
utf-8 17,006
utf-16 (little-endian) 647
utf-16 (big-endian) 26

Doctypes

The Doctype statement is used in two ways. Passively, it proclaims the markup standard to which the document is supposed to adhere. A markup validator can use this information to analyze its conformance to that standard. We examine the validation aspect of the Doctype and its implications in a separate document. In this section we will look at some of the things we can easily glean from the Doctype, as well as the more active role that Doctypes have taken in recent years in their role as arbiter of the rendering mode that a browser will use.

Anatomy of a Doctype statement

Now, we can take a look at the components of a Doctype to see what sort of information it can offer us:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Fig 4-1: Components of a Doctype
Component Description
"<!DOCTYPE" The beginning of the Doctype
"html" This string specifies the name of the root element for the markup type
"PUBLIC" This indicates the availability of the DTD resource. It can be a publicly accessible object ("PUBLIC") or a system resource ("SYSTEM") such as a local file or URL. HTML/XHTML DTDs are specified by "PUBLIC" identifiers.
"-//W3C//DTD XHTML 1.0 Transitional//EN" This is the Formal Public Identifier (FPI). This compact, quoted string gives a lot of information about the DTD, such as its Registration, Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the "XHTML 1.0 Transitional" part). If the processing entity does not already have local access to this DTD, it can get it from the System Identifier (next portion).
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" The System Identifier (SI); the URL location of the DTD specified in the FPI
">" The ending of the Doctype

Doctypes found by MAMA

The entire Doctype statement was stored in MAMA. In all, 1,788,294 of the URLs analyzed (50.96%) had a Doctype present. For the purposes of the full frequency table for Doctype, the values were normalized to lower case.

Doctype versions

Different HTML standards can be detected via unique strings in the Doctype statement. The leading space in most of the values below helps differentiate between HTML and XHTML versions. HTML 4 variants are twice as popular as any of the other versions.

Fig 4-2: Doctype versions in MAMA
Doctype-version substring Frequency   Doctype-version substring Frequency
" html 4" (HTML 4 variants) 1,122,392 "softquad" || "//sq//" 9,950
" xhtml 1.0" 548,307 " html 2" 7,640
" html 3.2" 57,354 " html 3.0" 1,711
"ietf" 34,965 "WAP" 131
" xhtml 1.1" 20,958 " xhtml 2" 18

Doctype flavors

Beginning with HTML 4.0, HTML was stratified into 3 separate variants: Strict, Transitional, and Frameset. The Label portion of the Doctype FPI reflects these variants, and we can easily discern the "flavors" of HTML by searching for the substrings. The Transitional configuration is more than 10 times as likely as the other types.

Fig 4-3: Doctype flavors in MAMA
Doctype-flavor substring Frequency
"Transitional" 1,459,912
"Strict" 130,191
"Frameset" 64,516

System Identifiers (SIs)

A look at the full Doctype frequency table shows that there is a good balance between Doctypes that specify a SI versus those that do not. A simplistic way to find SIs that use an absolute URI would be to look for the string "http://" in the Doctype statements; doing so finds 880,702 matching URLs. However, URIs can be relative too, so we should expand our search. If instead of "http://" we look for ".dtd", this might be a good usage indicator for ALL Doctypes with SIs. Doing so finds 897,601 URLs, or 50.19% of all MAMA cases where a Doctype is present.

Doctype switching: Standards vs. Quirks mode

Saarsoo produced a comparison of what pages were rendered in Standards vs. Quirks mode based on Henri Sivonen's excellent page on doctype switching. Using this page as a guide, we can construct a similar table, but with MAMA numbers included. To reduce the complexity of Sivonen's original table, we'll only show the columns of the most popular current browser sets in use: Mozilla/Safari, Opera 9, IE7/Opera7.1 and IE6/Opera7. Note that these groupings pair up browsers that have very similar quirks, almost standards, and standards modes. Standards, Almost Standards, and Quirks modes are listed as  S ,  A  and  Q  respectively.

With the complexity of Sivonen's chart, one would expect the numbers for the different browsers to vary by a wider margin. It appears the main differences in most browsers are in Doctypes with lower representation in the wild. Generally, about 85% of all Web pages are rendered using Quirks mode, while the remaining ~15% of URLs are rendered using either Standards or Almost Standards modes. If we only look at URLs that have a Doctype, Standards, and Almost Standards are used in ~30% of those cases.

Fig 4-4: Doctype switching behavior in browsers
(Behavior data from Henri Sivonen's Doctype switching table)
Doctype MAMA
frequency
Moz/
Safari
Opera9 IE7/
Opera7.1
IE6/
Opera7
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"> 6,745 S S A A
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> 2,488 S S A A
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/html4/strict.dtd"> 42 S S A A
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> 14,471 S S A A
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 90,296 A A A A
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> 2,732 A A A A
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 2,185 Q Q A A
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> (w/o XML prolog): 10,563 S S A A
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> (w/o XML prolog): 26 S S A A
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (w/o XML prolog): 58,086 S S A A
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> (w/o XML prolog): 295,687 A A A A
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> (w/ XML prolog): 3,475 S S A Q
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"> (w/ XML prolog): 14 S S A Q
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (w/ XML prolog): 5,842 S S A Q
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> (w/ XML prolog): 54,765 A A A Q
<!DOCTYPE HTML PUBLIC "ISO/IEC 15445:2000//DTD HTML//EN"> 10 S Q Q Q
<!DOCTYPE html> 199 S S A A
Fig 4-5: MAMA's Doctype switching totals by browser
Browser Standards
Mode [%]
Almost
Standards
Mode [%]
Quirks
Mode [%]
Mozilla/Safari 101,961
[2.91%]
443,480
[12.64%]
2,963,739
[84.46%]
Opera 9 101,951
[2.91%]
443,480
[12.64%]
2,963,749
[84.46%]
IE7/Opera 7.1 0
[0.0%]
547,616
[15.61%]
2,961,564
[84.39%]
IE6/Opera 7.0 0
[0.0%]
483,520
[13.78%]
3,025,660
[86.22%]

A document's "Tag Ratio"

During MAMA's analysis, it kept track of the size of all the markup tags used, as well as the overall page size. The ratio of these two values provides some minor insight into authoring practices, and how much plain text content authors have on their pages. Saarsoo did something similar in his study, but he called his ratio a "text percentage". In his study, the plain-text portion of the page averaged about 20% of the overall size. In MAMA's case, the "Tag Ratio" was the total content within all tags divided by the overall page size. A low Tag Ratio value reflects a relatively small amount of markup tags compared to the text content while a high tag ratio would be a large amount of markup tags compared to the text content. A Tag Ratio of 0 would be all plain-text, while a Tag Ratio of 100.0 would be completely tags, without even having linefeeds or spaces between the tags. The average document had a Tag Ratio of 61.64%, with almost 2/3 of each document being tags. A full frequency table of Tag Ratios is also available.

Markup elements

We will discuss many of these elements in more detail in their appropriate sections; here we will just take a quick look at the top 20, and say a little something about the overall rankings before moving on. There are no real surprises here in the rankings of the top elements. Comparing the chart below to Saarsoo's study, there is a little movement in the rankings but not until we get out of the top 10—and the top 50 elements from both share 49 elements in common! Hickson's study has some differences in ranking order even in its top 10. The discrepancies are very minor however, involving values that have very similar totals and adjacent positions in MAMA's list.

The most popular elements

  • Basic document elements: HTML, HEAD and BODY
  • Hyperlinks and images: A and IMG
  • Tables (TABLE, TD and TR)
  • A smattering of important elements used in the HEAD: TITLE, META, SCRIPT, LINK and STYLE
  • Simple structural and formatting elements: BR, P, DIV, FONT, B, SPAN and STRONG

No real surprises here; the full, unvarnished element list also reveals a significant number of irrelevant entries as you go deeper down the roster—it seems there is a lot of custom markup, typos, and script fragments out there (the script fragments may be artifacts of MAMA's parsing strategy).

Fig 6-1: Popular markup elements
[Please also see the complete frequency table]
ELEMENT Frequency   ELEMENT Frequency   ELEMENT Frequency
HEAD 3,464,519 TABLE 2,894,184 FONT 2,061,417
TITLE 3,459,207 TD 2,891,972 LINK 2,018,510
HTML 3,452,975 TR 2,891,205 B 1,805,495
BODY 3,452,907 BR 2,859,662 SPAN 1,527,964
A 3,307,397 P 2,702,935 STYLE 1,313,454
META 3,276,347 SCRIPT 2,528,823 STRONG 1,102,056
IMG 3,219,487 DIV 2,499,779 CENTER 1,076,535

Markup attributes

As with the discussions about markup elements, we will wait to talk more about attributes in the sections appropriate for each. Right now, we will again look at a top 20 list. The attributes found in the top 20 all come from only 7 different elements:

  • A
  • META
  • IMG
  • TABLE
  • TD
  • LINK
  • SCRIPT

These results and their ordering compare favorably to the brief attribute data listed in Hickson's study.

Fig 7-1: Popular markup attributes
[Please also see the complete frequency table]
ELEMENT[Attribute] Frequency   ELEMENT[Attribute] Frequency   ELEMENT[Attribute] Frequency
A[Href] 3,304,834 META[Name] 2,710,638 TD[Valign] 2,189,287
META[Content] 3,273,610 TABLE[Border] 2,691,899 LINK[Href] 2,016,007
IMG[Src] 3,219,304 TABLE[Width] 2,637,117 LINK[Rel] 2,001,105
IMG[Width] 2,957,808 TABLE[Cellpadding] 2,585,020 A[Target] 1,978,018
IMG[Height] 2,945,989 TABLE[Cellspacing] 2,578,416 TD[Align] 1,977,367
META[Http-equiv] 2,826,859 IMG[Alt] 2,520,939 SCRIPT[Language] 1,965,725
IMG[Border] 2,810,265 TD[Width] 2,324,752 LINK[Type] 1,777,982

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.