MAMA: W3C validator research

By Brian Wilson

Page 1 index : Page 2 index : Page 3 index

Note that this document is large, so has been broken up into 3 pages; use the navigation at the bottom of the document to navigate between pages.

About markup validation—an introduction

MAMA is an in-house Opera research project developed to create a repeatable and cross-referenced analysis of a significant population of Web pages that represent real world markup. Of course, part of that examination must also cover markup validation—an important measure of a page's adherence to a specific standard. The W3C markup validation tool produces useful metrics that add to the rest of MAMA's breakdown of its URL set. We will look at what validation reveals about these URLs, what it means to validate a document, and what benefits or drawbacks are derived from the process.

The readership of this section of MAMA's research is expected to be the casual Web page author out for a relaxing weekend browse, as well as those developing the W3C validator tool itself, looking for incisive statistics about the validation "State Of The Union". As a result of this diverse audience, some readers will find that some sections are redundant or mystifying (possibly both at the same time even!). Feel free to skip around the article as needed, but the best first-time reading flow is definitely a linear read. Some of the data presented may need some prerequisite knowledge, but I hope that even the most detailed examinations here may be of interest to all readers in some way. There are some positive trends, some surprises, and some disappointments in the figures to follow.

A quick summary:

The good news: Markup validation pass rates are definitely improving over time.
The bad news: The overall validation pass rate is still miserably low and is not increasing as fast as one would hope

Previous validation studies

There are two previous, large-scale studies of markup validation to which we can compare MAMA's results regarding markup validation trends. Direct correlation with these previous studies was not an original goal of MAMA, but it is a happy accident, given that many of MAMA's design choices happen to coincide.

Dec. 2001: Dagfinn Parnas's "How to cope with incorrect HTML" thesis; University of Bergen, Norway
Jun. 2006: Rene Saarsoo's "Coding practices of Web pages" bachelor thesis [PDF, In Estonian] [English summary]

The analysis tools and target URL group were roughly the same between MAMA and these other projects. Both Parnas's and Saarsoo's studies used the WDG validator (see next section), which shares much of the same back-end mechanics with the W3C validator. Both studies also used the DMoz URL set (see next section). The main difference between the URL sets used lies in the amount of DMoz analyzed; where MAMA's research overlaps with Parnas's and Saarsoo's studies, we will attempt to compare results.

**Fig 2-1:** URL set sizes of validation studies
Study	Date	URL Set	Full DMoz Size	Study Set Size
Parnas	Dec. 2001	DMoz	~2.5 million	~2.4 million
Saarsoo	Jun. 2006	DMoz	~4.4 million	~1.0 million
MAMA	Jan. 2008	DMoz	~4.7 million	~3.5 million

Sources and tools: The URL set and the validator

[For more details about the URLs and tools used in this study, take a look at the Methodology Appendix section of this document.]

Treading on familiar ground: The Open Directory Project (DMoz)

There is a lot of MAMA coverage elsewhere about the DMoz URL set and the decision to use it as the basis of MAMA's research. MAMA did not analyze ALL of the DMoz URLs, though. Transient network issues, dead URLs, and other problems inevitably kept the final URLs analyzed from being bigger than its final total of about 3.5 million. The number of URLs from any given domain was limited in order to decrease per-domain bias in the results. This was an important design decision, because DMoz has a big problem with domain bias (~5% of all URLs in it are solely from cnn.com, for example). Parnas and Saarsoo did not do this, but it has proven to be a useful strategy to employ. I set an arbitrary per-domain limit of 30 URLs, and this seems to be a fair limitation. This restriction policy also helps track per-domain trends—if any are noticeable, they will be presented where they seem interesting.

Any comparison of MAMA's data to other similar studies, even if they also use DMoz, must take into account that DMoz grows and changes over time as editors add, freshen, or delete URLs from its roster. URLs can grow stale or obsolete through removal, and domains can and do die on a distressingly regular basis. The aggregation source of these URLs remains the same, but the set itself is an evolving, dynamic entity.

The W3C validator

To test the URL set, MAMA used the W3C Markup Validator tool (http://validator.w3.org/, v. 0.8.2 released Oct. 2007), which uses the OpenSP parser for its main validation engine. The W3C Markup Validator is a free service from the W3C that helps authors improve the quality of their documents by checking adherence to standards via DTDs. The Parnas and Saarsoo studies both used the WDG validator, but for MAMA's analysis, the W3C validator was the validation tool of choice. As stated on the WDG's Web site, there are many similarities between these two validators,

"Most of the previous differences between the two validators have disappeared with recent development of the W3C validator".

So, even though the validators used are different, there is significant overlap between MAMA's validation study data and the other previous studies. The W3C Quality Assurance group has produced many excellent tools and processes over the years, and that hard work definitely deserves to be showcased in a study like this. Kudos to the W3C validator team!

What use is markup validation to an author?

Why would an author validate a document at all? A validator does not write a Web page for you— the inspiration and perspiration must still come completely from the author. There does not appear to be any real negative consequences to omitting this step. Sticking rigorously to a standard does not necessarily spell success—using a validator on a page and correcting any problems it brings to light does not guarantee that the result will look right on one browser, let alone all of them. Conversely, an invalid page may render exactly the way an author was expecting.

Both authors and readers have come to expect that all browsers perform impeccable error recovery in the face of the worst tag soups the Web can throw at it. Forgiveness is perhaps the most under-appreciated yet important feature we expect from a browser. However, that is asking a lot, especially for the increasingly lightweight devices that are being used to browse the Web. If there are any consequences for sloppy authoring practices, it would be here.

Henri Sivonen properly framed the role of the markup validator in an author's toolkit:

"[A] validator is just a spell checker for the benefit of markup writers so that they can identify typos and typo-like mistakes instead of having to figure out why a counter-intuitive error handling mechanism kicks in when they test in browsers."

Continuing with the spell-checker analogy, there are no dire consequences for a page failing to validate, just as there is seldom a serious consequence of having spelling typos in a document—the overall full meaning is still conveyed well enough to get the point across.

Using the spell-checker analogy also helps dispel a practice that the W3C encourages, something that we will talk more about in a later section—proclaiming that a page has been validated. This is a pointless exercise and means nothing (W3C tool evangelism aside). It is like saying a document has been spell-checked at some time during its history. Any subsequent change to a document can introduce errors—both spelling and syntax-wise—and make the claim superfluous code baggage. As we will show in later sections, pages that have passed validation in the past often do not STAY validated!

Markup validation is a useful tool to help insure that a page conforms to a target you are aiming for. The most obvious thing to take away from the entirety of the MAMA research is that people are BAD at this "HTML thing". Improper tag nesting is rampant, and misspelled or misplaced element and attribute names happen all the time. It is very easy to make silly, casual mistakes—we all make them. Validation of Web pages would expose all these types of simple (and avoidable) errors in moments.

For even more (and probably better) reasons to validate your documents, have a look at the W3C's excellent treatment of the subject: "Why Validate?".

How many pages validated?

The raw validation numbers

The validator's SOAP response has an <m:validity > element with Boolean content values of "true" and "false". A "true" value is considered a successful validation. MAMA found that 145,009 out of 3,509,180 URLs passed validation.

**Fig 5-1:** Validation pass rate studies
Study	Date	Passed validation	Total validated	Percentage
Parnas	Dec. 2001	14,563	2,034,788	0.71%
Saarsoo	Jun. 2006	25,890	1,002,350	2.58%
MAMA	Jan. 2008	145,009	3,509,180	4.13%

Another interesting view of MAMA's URL validation study is how many domains in MAMA that contained ANY page that validated: 130,398 (of 3,011,661 distinct domains validated) [4.33%]

Validation rates where select Web-page authoring features are also involved

Now, we need to ask the same basic "does it validate?" question multiple ways, keeping our main variable (validation rate) constant, while varying other criteria. This has the potential to say some interesting things about the validation rates as a whole, while also providing insight to biases that can arise when mixing popular factors and technologies found in web pages. Note: instead of listing overall URL totals, the totals mentioned are only for the URLs that use each technology.

**Fig 5-2:** Validation pass rates relating to various features
Quantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities
Authoring feature used	Criteria used to match	Quantity validating	Total quantity using technology	Percentage
Script/JavaScript	Any `"javascript:"` URL Any external script pointed to by `SCRIPT` element Any script embedded in a `SCRIPT` element Any known event handler content (for attributes beginning with "on")	99,299 [90,233]	2,617,828 [2,306,921]	3.79% [3.91%]
CSS	Any `Style` attribute content Any content of `STYLE` element Any external stylesheet pointed to by `LINK` element (`Rel`="stylesheet")	129,893 [117,361]	2,821,141 [2,487,898]	4.64% [4.72%]
Adobe Flash	`EMBED`: MIME type of the `Src` attribute contains "flash" `PARAM`: Element contains the string ".swf" or "flash" `OBJECT`: MIME type of the object contains "flash" Script: Any mention of "flash" or ".swf"	44,491 [41,058]	1,176,227 [1,050,121]	3.78% [3.91%]
Frames	Usage of the `FRAMESET` element	5,905 [5,741]	378,033 [354,321]	1.56% [1.62%]
Iframes	Usage of the `IFRAME` element	4,615 [4,238]	222,462 [193,489]	2.07% [2.19%]
Font	Usage of the `FONT` element (common, CSS-obsoleted formatting markup)	29,723 [27,491]	2,061,422 [1,762,528]	1.44% [1.56%]
IIS Web Server	Detection of "iis" string in HTTP header `Server` field	24,743 [22,227]	883,854 [769,375]	2.80% [2.89%]
Apache Web Server	Detection of "apache" string in HTTP header `Server` field	110,834 [99,866]	2,347,328 [2,011,088]	5.38% [4.97%]

Validation, content management systems (CMS), and editors

MAMA looked at the META "Generator" value to find popular CMS and editors in use for the following table, looking for any noticeable trends in validation rates. One might expect per-domain numbers to be more interesting in this case than per-URL, because sites are often developed using a single platform, but there is very little difference between the two views. In general, CMS systems generate valid pages at markedly higher rates than the overall average, with "Typo3" variants leading at almost 13%. On the other hand, the editor situation has some wild differences. Microsoft's FrontPage has a VERY wide deployment rate, but a depressingly low validation pass rate of ~0.5%. Apple's iWeb editor, however, has a freakishly high validation rate. Kudos to iWeb for this happy discovery.

**Fig 5-3:** Validation pass rates relating to editors
Quantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities
Editor	Quantity passing validation	Total occurrences	Percentage
Apple iWeb	2,051 [2,016]	2,504 [2,465]	81.91% [81.78%]
Microsoft FrontPage	1,923 [1,846]	347,095 [305,220]	0.55% [0.60%]
Adobe GoLive	1,086 [1,057]	41,865 [39,035]	2.59% [2.71%]
NetObjects Fusion	802 [793]	26,355 [25,466]	3.04% [3.11%]
IBM WebSphere	626 [585]	32,218 [24,460]	1.94% [2.39%]
Microsoft MSHTML	518 [502]	40,030 [38,328]	1.29% [1.31%]
Microsoft Visual Studio	272 [245]	22,936 [21,051]	1.19% [1.16%]
Adobe Dreamweaver	205 [198]	5,954 [5,647]	3.44% [3.51%]
Microsoft Word	154 [153]	24,892 [22,503]	0.62% [0.68%]
Adobe PageMill	100 [92]	15,148 [12,142]	0.66% [0.76%]
Claris Home Page	48 [41]	6,259 [4,798]	0.77% [0.85%]

**Fig 5-4:** Validation pass rates relating to CMS
Quantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities
CMS	Quantity passing validation	Total occurrences of CMS	percentage
Typo3	2,301 [2,170]	18,067 [16,930]	12.74% [12.82%]
Joomla	2,248 [2,233]	34,852 [34,237]	6.45% [6.52%]
WordPress	1,494 [1,472]	16,594 [16,046]	9.00% [9.17%]
Blogger	30 [30]	9,907 [9,808]	0.30% [0.31%]

Interesting views of validation rates, part 1: W3C-Member companies

The W3C is the organization that creates the markup standards and the markup validator used in this study. One would hope that the individual companies that support and comprise the W3C would spearhead the effort to follow the standards that the W3C creates. Well, it turns out that is indeed the case. The top pages of W3C-member companies definitely adhere to markup standards at much higher rates than the rest of the Web. However, these "standard-bearers" (pun intended) could definitely do better at this than they currently do.

In February 2002, Marko Karppinen validated 506 URLs of all the W3C-member companies at that time. Only 18 of these pages passed validation. Compared to Parnas's validation study of the DMoz URLs just two months before, the W3C-member company validation rate of 3.56% was considerably better than the 0.7% rate for URLs "in the wild", but it is nothing for the paragons of Web standards to brag about. Such a low validation pass rate could easily be perturbed by any number of transient conditions or other factors.

Saarsoo also did a study of W3C-member company validation rates in Jun. 2006. By that point, the validation situation had improved nicely for the member companies to 17.00%. Fast-forwarding now to Jan. 2008 [W3C-member-company list snapshot], and we see that the general Web-at-large has caught up to, and even exceeded, the previous validation pass rate of W3C-member companies from Karppinen's study era. The general validation pass rate in the DMoz population is now running at ~4.13%, and the W3C-member company pass rate is a strong 20.15%, with more member companies than ever claiming the validation crown.

**Fig 6-1:** W3C-Member-company list validation studies
W3C-member list study	Date	Total in member list	Total validated	Passed validation	Percentage
Marko Karppinen	Feb. 2002	506	506	18	3.56%
Saarsoo	Jun. 2006	401	352	61	17.00%
MAMA	Jan. 2008	429	412	83	20.15%

Just showcasing the increased validation rate does not tell the whole story. Saarsoo left an excellent data trail to which to compare the present validation pass rate. It is interesting to note that, although the overall pass rate has increased, many of the sites that passed validation previously no longer do so at the time of writing. Achieving a passing validation status does not seem to be as hard as maintaining that status over time. Compared to Saarsoo's study, there are just as many URLs that previously validated but currently do not as there are URLs that maintained their passing validation status.

**Fig 6-2:** Validation comparison to Saarsoo's W3C-Member-Company study
Validation comparison	Quantity
URLs that validated before and do now	25
URLs that validated before but do not now and are still in W3C-member-company list	25
URLs that validated before but are no longer in W3C-member-company list	11

Saarsoo commented in 2006 on the dynamic nature of the W3C company roster. From early 2002 there were 506 member companies, dipping down to 401 in mid-2006, to the present time (early 2008) where we see the list back up to 429. To put the change in some perspective, the net loss of companies in the list over this time-frame is 77, which is almost as many companies as the number that currently pass validation. Put simply, a pessimist might say that a company on this list is just about as likely to drop out of the W3C as it is to achieve a successful validation.

The W3C-Member List successful validation Honor Roll

In his 2002 study, Karppinen prominently listed the W3C-member companies whose main URLs passed validation in order to,

"highlight the effort that goes into making an interoperable web site".

This is an excellent idea and is becoming a bit of a time-honored tradition that both the Saarsoo study and this one has followed. The first list from Karppinen was easy to keep inline with the rest of the study, because it was (unfortunately) short and sweet. As the pass rate has improved over time, this list becomes progressively longer. This is the goal, though; everyone wants the list to be too long to display easily. [See the Honor Roll list here.]

And the crown goes to ...

Two companies' URLs have maintained valid sites throughout all three studies from 2002-2008. These companies deserve extra congratulations for this feat.

Joint Info. Systems Comm. of the UK Higher Ed. Funding Council (JISC)
Opera Software (the company for which the author works)

Many sites are constantly changing, but being a member of an organization that creates standards should be compulsion enough to attain a recognized level of excellence in those standards. Saarsoo ended his 2006 look at the W3C-member list with an optimistic wish for the future,

"Maybe at 2008 we have 50% of valid W3C member sites."

Unfortunately, that number is nowhere close to the current reality. It may be too much for the W3C to require its member-companies' sites to pass validation, but they should definitely try to push for higher levels than they currently attain, to serve as a good example if nothing else.

Page 1 index : Page 2 index : Page 3 index

Interesting views of validation rates, part 2: Alexa Global Top 500

About the Alexa Global Top 500

Now, we will look at another "interesting" small URL set, the Alexa service from Amazon. Alexa utilizes Web crawling and user-installed browser toolbars to track "important sites". It maintains, among many other useful measures, a global "Top 500" list of URLs considered popular on the Web. The Alexa list was chosen primarily because the size of the list was similar in size to the W3C list—so even though MAMA might be comparing apples to oranges, at least it compares a fairly equal number of apples and oranges. The W3C-company list skews toward academic and "big money" commercial computer sites. The Alexa list is representative of what and how people actually use and experience on the Web on a day-to-day basis.

While few could argue that Alexa's "Top 500" list is relevant and popular, there are some definite biases in its list:

It is prejudiced toward big/popular sites with many country-specific variants, such as Google, Yahoo!, and eBay. This ends up reducing the breadth of the list. Google is the most extreme example of this, with 63 of the 487 URLs in the analyzed set being various regional Google sites.
It includes the top pages of domain aggregators with varied user content, such as LiveJournal, Facebook, and fc2.com. These top pages are not representative of the wide variety of the user-created content they contain.
The list consists entirely of top-level, entrance, or "surface" pages of a site. There is no intentional "deep" URL representation.

Validating the Alexa Top 500

On 28 January 2008, the then-latest Alexa Top 500 list was inserted into MAMA [January 2008 snapshot list, latest live version]. About half of these URLs were already in MAMA, having been part of other sources. Of the 500 URLs in this list, 487 were successfully analyzed and validated. Only 32 of these URLs passed validation (6.57%). This is a slightly higher percentage rate than the much larger overall MAMA population, but the quantity and difference are still too small to declare any trends.

**Fig 7-1:** Alexa Top 500 validation studies
Alexa Top 500 List study	Date	Passed validation	Total set-size	percentage
MAMA	Jan. 2008	32	487	6.57%

For future Alexa studies

OK, so the Alexa Top 500 does have some drawbacks. Should the URL set be tossed out entirely? Can this set be improved? Aside from the Top 500, Alexa has a very deep catalog and categorization of URLs, some of them available freely, but most are available only for a fee. Some categories of URLs include division by country and by language. Alexa currently has publicly-available lists of the top 100 URLs for 21 different languages (2,100 URLs) and 117 countries (11,700 URLs). Note: The per-country list represents popularity among users in a country, not sites hosted in the country. An undoubtedly-interesting expanded list of the Alexa Global Top 500 could be created by aggregating all of these sources, which would probably yield 5,000-10,000 URLs (if duplicates were eliminated).

If the validation rates of the Alexa Global Top 500 are studied in the future, the current version of the Top 500 list of URLs will likely be quite different than it is at this time of writing. The topicality of the list—a strength that promotes the relevance of the analysis—and also makes cross-comparisons over time difficult. Documenting the list that was used in each analysis will be helpful in doing that.

Validation badge/icons: An interesting diversion?

Before MAMA had validated even a single URL, the author discovered this page at the W3C's site: http://www.w3.org/QA/Tools/Icons. This page lists icons that,

"may be used on documents that successfully passed validation for a specific technology, using the W3C validation services".

It seemed like an interesting idea to compare the pages that were using these images claiming validation with how they actually validate. This can only be a crude measure for a number of reasons, but, by far, the main one is as follows: an author can easily host the validation icon/badge on their own server and name it anything they want.

For those gearheads in the audience who have some "regexp savvy", the following Perl regular expression was used to identify validation icon/badges utilizing the W3C naming scheme. This pattern match was used against the Src attribute of the IMG elements of URLs analyzed:

Regexp:
/valid-((css|html|mathml|svg|xhtml|xml).*?)(-blue)?(\.png|\.gif|-v\.svg|-v\.eps)?$/i || /(wcag1.*?)(\.png|\.gif|-v\.svg|-v\.eps)?$/i

This seems to capture fully all the variations of the W3C's established naming conventions (any corrections are very welcome if it does not). Note that the regexp errs on the cautious side and can also capture unintended matches like JPEG files matching the naming scheme. One might think this an error, but it turns out it is not. JPEG versions of the validation icons are not (currently) listed on the W3C's Web site, but a random spot-check of JPEG images thus detected by MAMA ARE validation badge icons! In this case, what appears to be false-positives are actually valid after all.

Ex: http://www.w3.org/Icons/valid-html401-blue.png is stored as 'html401-blue'

Validation rates of URLs having validation badge/icons

Now we will look at the list of W3C Validation Image Badges found in MAMA by URL [also by domain]. Even with the various pitfalls that could occur with MAMA's pattern matching, there is still a comparison that is interesting to explore: how many pages that use a badge actually validate? If we consider that the only type of badge of real interest in our sample is an HTML variant (html, xhtml), looking for the substrings "html" and "xhtml" within this field in MAMA gives us:

**Fig 8-1:** Validation rates of URLs with validation icons
Type of badge identified	Total	Actually validated	Percentage
xhtml	11,657	5,480	47.01%
html	22,033	10,995	49.90%

This is just under 50% in each case, which is frankly a rather miserable hit ratio. If these URLs do not validate, do they bear ANY resemblance to the badge they are claiming?

Comparison of stated validation badge/icon type versus actual detected Doctype

Next, we will try comparing the actual Doctypes detected compared to the badges claiming compliance to those respective Doctypes. Doctypes detected in both the validator and MAMA analyses are listed for comparison. The situation definitely improves here over the previous figures. Note: Fatal validation errors cause the validator to under-report Doctypes by reporting no Doctype at all in such cases.

**Fig 8-2:** Reported validation icon type versus MAMA-detected Doctype
Type Of badge identified	Validator- detected Doctype	MAMA- detected Doctype	Total according to badge/icon
xhtml	10,553	11,054	11,657
html	20,570	21,475	22,033

The validation badges certainly increase public awareness of validation as something for which the authors strive, but it does not appear to be the best measure of reality. For the half of badged URLs that claim validation compliance but currently do not validate, one has to wonder whether they ever did validate in the past. Pages definitely tend to change over time and removing or updating an icon badge may not be high on most author's list of "Things To Do". The next time you see such an icon, consider its current state with a grain of salt.

For future W3C badge studies

After this survey was completed, the following rather prominent quote was noticed on the W3C's Validation Icons page,

"The image should be used as a link to re-validate the document."

It may be useful to incorporate this fact to identify further validation badges in the future.

Doctypes

What are we examining?

First up is the Doctype. The Doctype statement tells the validator which DTD to use when validating—it is the basic evaluation metric for the document. MAMA used its own methods to divine the Doctype for every document, but the validator actually detects the Doctype in two slightly different ways: one by the validator itself and the other by the SGML parser at the core of the validator.

**Fig 9-1:** Detected Doctype factors used in this study
Source of Doctype	Information being used
MAMA	Detected Doctype statement
Validator	SOAP `<m:doctype >` content
Validator	'W09'/'W09x' warning messages

This is a good time to dissect a Doctype and see what makes it tick. We will look at a typical Doctype statement, and examine all of its parts:

Ex: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

**Fig 9-2:** Components of a DTD
Component	Description
"<!DOCTYPE"	The beginning of the Doctype
"html"	This string specifies the name of the root element for the markup type.
"PUBLIC"	This indicates the availability of the DTD resource. It can be a publicly-accessible object ("PUBLIC") or a system resource ("SYSTEM") such as a local file or URL. HTML/XHTML DTDs are specified by "PUBLIC" identifiers.
"-//W3C//DTD XHTML 1.0 Transitional//EN"	This is the Formal Public Identifier (FPI). This compact, quoted string gives a lot of information about the DTD, such as its Registration, Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the "XHTML 1.0 Transitional" part). If the processing entity does not already have local access to this DTD, it can get it from the System Identifier (next portion).
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"	The System Identifier (SI); the URL location of the DTD specified in the FPI
">"	The ending of the Doctype

MAMA's analysis stores the entire DOCTYPE statement, but the validator's SOAP response only returns a portion of it— generally the FPI, but some situations may return the SI instead or even nothing at all if an error condition is detected. These situations are infrequent, though; only 70 URLs analyzed by the validator returned the Doctype's SI, for example.

!Doctypes!

The validator examined 3,509,10 URLs overall. Of those, the validator says that 1,474,974 (42.03%) "definitely" did not use a DOCTYPE (indicated by an empty content for the <m:doctype > element in the SOAP response). In addition to the empty <m:doctype > element in the SOAP response, the validator also returns explicit warnings for the instances it does not encounter a Doctype statement: specifically, warning codes 'W09' and 'W09x' are generated by the SGML parser layer of the validator. Is there any correlation between these warning codes and the "official" empty Doctype mentioned in the SOAP response? The quick answer is yes. Some 1,373,352 URLs have either the 'W09' or 'W09x' warnings. Looking closer for a direct correlation, 1,371,899 URLs were issued a 'W09'/'W09x' warning AND do not have a Doctype listed in the SOAP response. This leaves 1,453 URLs that had some sort of validator-detectable Doctype, but a warning for No Doctype was issued. Sampling several URLs from the above set showed that, in every case, the Doctype statement was not at the very beginning of the document. So, it appears that the OpenSP parser does not like this, but the validator itself is OK with this scenario.

MAMA also looked at Doctypes in its main analysis. We have compared cases where both tools found no Doctype. MAMA found 1,720,886 URLs without a Doctype. This is a rather large discrepancy compared to the validator's numbers above. We must alter this figure further because the SOAP response for a validation failure error returns empty <m:doctype > and <m:charset > elements. To improve the quality of our comparison between MAMA and the validator's results, we must exclude from our mutual examination all URLs with a positive validator failure count. After this minor adjustment, the numbers are much more in line with each other. To the numbers:

**Fig 9-3:** Scenarios where Doctype is not present
Situation	Qty
MAMA detected no Doctype.	1,465,367
Validator detected no Doctype.	1,474,974
MAMA and validator both detected no Doctype.	1,423,478
MAMA detected no Doctype, but the validator did.	41,889
Validator detected no Doctype, but MAMA did.	51,496

The final two numbers are the most interesting. These discrepancies are still quite large (~3% of the overall 'no Doctype detected' count). What could account for this? Some reasons noticed for the differences (there could be others):

MAMA did not look for a Doctype in the destination document of a META refresh/redirect. The validator appears to do this.
Ex: http://disneyworld.disney.go.com/wdw/parks/parkLanding?id=TLLandingPage
MAMA does not request or handle gzipped content, but it was occasionally served to it anyway. The validator appears to handle this.
MAMA looked anywhere in the document for a Doctype, but the validator only looks near the beginning of the document. A rather large set of URLs unfortunately fit this description.
Ex: http://www.ruready.com/
URL content can change over time, including the addition or deletion of Doctypes. MAMA's analysis occurred in November 2007, and the validation of those same URLs happened in January 2008—over 2 months later. In sampling random parts of the URL set where MAMA did not initially detect a Doctype, a current, live analysis by MAMA does indeed detect a Doctype in most cases tried. Other than a bug existing in MAMA (unfortunately, always possible in any software), this is the best explanation to put forth.

Doctype statement present details

What about URLs that had validator-detectable Doctypes? We will linger on the comparison between MAMA's Doctype detection and the Validator's before looking in depth at what those Doctypes were.

**Fig 9-4:** Scenarios where Doctype is present
Situation	Qty
MAMA detected a Doctype.	1,788,294
The validator detected a Doctype.	1,625,509
MAMA and the validator both detected a Doctype, and it was the same.	1,583,620
MAMA and the validator both detected a Doctype, and it was different.	36,119

Where MAMA and the validator both found a Doctype, they disagree 2.28% of the time. Other than the aforementioned time delay between the MAMA and validator analyses, could there be other reasons to account for this difference? Scanning a list of results for MAMA/validator Doctypes that differed, there may indeed be a trend—and a positive one at that. Of the 36,119 URLs that changed Doctype, 23,390 of them (64.76%) changed from an HTML Doctype to an XHTML Doctype. There are a few reasons mentioned above that could be affecting these results, and the above numbers could be a coincidence, but this looks like a data point supporting the gradual shift from HTML to XHTML.

To summarize the per-URL and per-domain frequency tables for validator Doctype, Transitional FPI flavors have a lock on the top three most popular positions. The other variants trail far behind. If a document has a Doctype, it is likely to be a Transitional flavor of XHTML 1.0 or (even more likely) HTML 4.0x. XHTML 1.0 Strict dominates over any other Strict variant (98% of all Strict types).

Totals for common substrings found in the validator Doctype field

A survey of the FPIs the validator exposed is like a microcosm of the evolution of HTML—there are documents claiming to adhere to "ancient" versions from the early days all the way through to the language's present XHTML incarnations. Searching for a few, well-chosen substrings demonstrates this variety well, and we can see how well an author's choice of Doctype FPI results in actually passing validation. Out of the 1,625,509 URLs exposing a Doctype to the validator, Strict Doctypes pass validation twice as often as the other flavors, and XHTML Doctypes are much are heavily favored for passing validation than other Doctypes. More could be said about the final two items in the table below (to say the least), but that is left for a future discussion.

**Fig 9-5:** Detection of substrings in the Doctype field
Doctype flavor	Qty	Percentage of total	Passing validation	Percentage of flavor
"Transitional"	1,341,024	82.50%	112,348	8.38%
"Strict"	100,002	6.15%	17,502	17.50%
"Frameset"	57,225	3.52%	4,133	7.22%
Doctype markup language	Qty	Percentage of total	Passing validation	Percentage of markup language
" html 4" (HTML 4 variants)	987,701	60.76%	66,535	6.74%
" xhtml 1.0"	544,622	33.50%	71,537	13.14%
" html 3.2"	44,642	2.75%	1,753	3.93%
" xhtml 1.1"	19,984	1.23%	4,074	20.39%
" html 2"	4,792	0.29%	176	3.67%
" html 3.0"	884	0.05%	44	4.98%
"WAP"	789	0.05%	468	59.32%
" xhtml 2"	11	0.00%	0	0.00%

The studies from Parnas and Saarsoo did not use the W3C validator, and, as a consequence, there was not such an extreme focus on Doctype usage. Generally, the validator they used only tracked whether a Doctype was used at all. The main reported error type in Parnas' study was a missing Doctype, with only 18.8% of URLs having one present. By the time of Saarsoo's study, the number of URLs having a Doctype moved up to 39.08%. Fast-forward to now, and that number has grown considerably yet again—to 57.7% according to the W3C validator. This is a very respectable increase over time. If few authors are actually creating valid documents, at least most of them seem to understand that there IS a standard to which they should be adhering.

Doctypes for our small, special interest URL sets

Backtracking just a little, the next two tables are a quick look at the Doctypes used for the W3C-member-company URLs and the Alexa Top 500 list. Almost 76% of those URLs passing validation are XHTML variants in the W3C-company set, and in the Alexa list it is almost 66%.

**Fig 9-6:** Doctype FPIs of W3C-Member-Company Web sites and validation rates
Doctype FPI	Passed validation	Total	Percentage of FPI type
-//W3C//DTD XHTML 1.0 Transitional//EN	36	145	24.83%
-//W3C//DTD XHTML 1.0 Strict//EN	23	45	51.11%
-//W3C//DTD HTML 4.01 Transitional//EN	16	95	16.84%
-//W3C//DTD XHTML 1.1//EN	4	8	50.00%
-//W3C//DTD HTML 4.0 Transitional//EN	3	22	13.64%
-//W3C//DTD HTML 4.01//EN	1	7	14.29%
-//W3C//DTD HTML 3.2//EN	0	1	0.00%
-//W3C//DTD HTML 4.01 Frameset//EN	0	1	0.00%
-//W3C//DTD HTML 3.2 Final//EN	0	1	0.00%
-//W3C//DTD XHTML 1.0 Strict//FI	0	1	0.00%
-//W3C//DTD XHTML 1.0 Frameset//EN	0	1	0.00%
[None]	0	85	0.00%

**Fig 9-7:** Doctype FPIs of Alexa Top 500 Web sites and validation rates
Doctype FPI	Passed validation	Total	Percentage of FPI type
-//W3C//DTD XHTML 1.0 Strict//EN	10	37	27.03%
-//W3C//DTD XHTML 1.0 Transitional//EN	9	130	6.92%
-//W3C//DTD HTML 4.01 Transitional//EN	5	77	6.49%
-//W3C//DTD HTML 4.0 Transitional//EN	3	22	13.64%
-//W3C//DTD HTML 4.01//EN	2	12	16.67%
-//W3C//DTD XHTML 1.1//EN	2	5	40.00%
-//iDNES//DTD HTML 4//EN	1	1	100.00%
-//W3C//DTD HTML 4.01 Frameset//EN	0	1	0.00%
-//W3C//DTD XHTML 1.1//EN http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd	0	1	0.00%
-//W3C//DTD XHTML 1.0 Strict //EN	0	1	0.00%
-//W3C//DTD XHTML 1.0 Transitional//ES	0	1	0.00%
-//W3C//DTD HTML 4.0 Strict//EN	0	1	0.00%
[None]	0	193	0.00%

Character sets

In the previous section on Doctypes, there were many ways to look at just a single variable (presence or lack of a Doctype). Now, with character sets it becomes even more complex. Even a simplistic view of character set determination can involve at least three aspects of a document. MAMA, the validator, and the validator's SGML parser ALL have something to say about the choice of a document's character set. To cover every permutation and difference between the many possible charset specification vectors would definitely exhaust the author and most likely bore the reader. Every effort will be made to present some of this data in a way from that is not TOO overwhelming.

There are three main areas of interest when determining the character set to use when validating a document:

The charset parameter of the Content-Type field in a document's HTTP Header
The charset parameter of the Content attribute for a META "Content-Type" declaration
The encoding attribute of the XML prologue

For brevity, these will be shortened to "HTTP", "META", and "XML" respectively.

Character set differences between MAMA and the validator

An important difference exists between MAMA and the validator when talking about character sets. There is an HTTP header that allows a request to specify which character sets it prefers. MAMA sent this "Accept-Charset" header with a value of "windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1". This header field value is used by Opera (9.10), and MAMA tried to emulate this browser as closely as possible. The character sets that were specified reflect the author's own particular language bias. The validator is another story. It does not send an "Accept-Charset" header field at all. This may cause differences between the two and affect the reported character set results.

MAMA's view of character sets

First up is a look at what MAMA was able to determine about these three fields, and how they are used in combination with each other. The totals here account for all cases where a non-empty value was present for any of the HTTP/META/XML charset specification types. The following tables show the frequencies for the different ways that character sets are established and mixed. A document can have none, any or all of these factors. Note: The XML level in Fig 9-1 appears to be very low in comparison to the other specification methods, but this is because the number of documents with an XML declaration is also rather low. Looked at in this way, that ratio actually the highest, being even more favorable than the META case at 96,264 of 104,722 URLs (91.92%). Fig 9-2 offers a breakdown of all the combinations of ways to specify a character set. By a large majority, authors do this using only the META element method. The final table, Fig 9-3, shows what happens when more than one source for a character set existed in a document, and whether these multiple values agreed with one another.

**Fig 10-1:** MAMA—How character sets are specified
Charset source	Number of occurrences	Total where any charset specified	Percentage where any charset specified
HTTP	686,749	2,626,206	26.15%
META	2,361,221	2,626,206	89.91%
XML	96,264	2,626,206	3.67%

**Fig 10-2:** MAMA—How character sets are specified in combination
Charset specified in	Quantity	Total where any charset specified	Percentage where any charset specified
HTTP only	240,349	2,626,206	9.15%
META only	1,872,497	2,626,206	71.30%
XML only	17,858	2,626,206	0.68%
HTTP and META	417,109	2,626,206	15.88%
HTTP and XML	6,791	2,626,206	0.26%
META and XML	49,115	2,626,206	1.87%
All three sources	22,500	2,626,206	0.86%

**Fig 10-3:** MAMA—How character sets disagree when specified in combination
Specified charset sources	Disagree	Total	Percentage
HTTP and META	123,245	417,109	29.55%
HTTP and XML	2,238	6,791	32.96%
META and XML	4,086	49,115	8.32%
All three sources	4,399	22,500	19.55%

The validator's view of character sets

Now, we will look at the way the markup validator views charset information. The validator generally looks for the same three document sources mentioned previously to determine charset information. Before looking at these actual charset values, it is useful to examine whether the validator's view of charset information is internally consistent or not. It can also be instructive to compare, where possible, the validator's view of charset information versus MAMA's view.

To directly compare validator and MAMA charset information, we must remove some URLs from consideration. The validator's SOAP response returns an empty charset value in all cases where there is a validator failure. It is useful to know if the validator is returning a "truly" empty charset value, so all URLs with a failure error are removed from the examination set for this phase. This immediately reduces our URL group by 408,687 URLs.

The items of interest to look at in the validator response are the contents of the <m:charset > element and warnings issued for no detected charset or charset value mismatch from differing sources. We will explore how/if all these factors mesh when the validator is determining which charset to use.

Validator-detected charsets versus MAMA-detected charsets

The following table is mostly for sanity checking to see if the validator's results resemble MAMA's results. The first two entries have very low totals, but this may involve some corner charset detection cases worth taking a second glance. The third case is a definite indication that the validator has default fallback values used for character set when none is detected through the typical methods.

**Fig 10-4:** Validator versus MAMA charset detection
Validator charset detected	Scenario	Total
No	No MAMA charsets detected	47
No	MAMA charset detected	1,179
Yes	No MAMA charsets detected	592,361
Yes	Validator also issued: "Warning! Conflicting charsets..." message	118,367
Yes	Validator also issued: "Warning! No charset found..." message	480,942

Validator Warning 04 issued: No character encoding found

This table might be a little confusing with some of the double negatives being tossed around. The presence of a Warning 04 means that the SGML parser portion of the validator did not detect a character set. This result may differ from what the validator ends up deciding should be used for the charset. Note that Row 2 is the sum of rows 1, 3, and 4. Row 6 is the sum of rows 5, 7, and 8. Row 5 is another indication that the validator uses a default character set value.

**Fig 10-5:** Validator Warning 04 scenarios
Warning 04	Charset state	Total
No	No validator charset detected	1,226
No	Validator charset detected	2,618,315
No	No MAMA charset detected	137,286
No	MAMA charset detected	2,482,255
Yes	No validator charset detected	0
Yes	Validator charset detected	480,942
Yes	No MAMA charset detected	455,122
Yes	MAMA charset detected	25,820

Validator Warnings 18-20 issued: Character encoding mismatches

In these cases, the validator discovers more than one encoding source, and there is some disagreement between them. The validator does not say what the disagreement was, so for some idea, we can look at the data MAMA discovered about these sources. Note that the final row in each table is the expected scenario for the warning to be generated; naturally, those totals are the highest by a wide margin. URLs from the other rows may merit further testing, but there is one reason mentioned before that can explain at least some of these quantities: the two-month delta between MAMA's analysis and the validator's analysis of the URL set.

**Fig 10-6:** Warning 18: Character encoding mismatch
(HTTP Header encoding/XML encoding)
MAMA Detected HTTP	MAMA Detected XML	Additional Factor	Total
Yes	No	--	483
No	Yes	--	70
Yes	Yes	Both agree	80
Yes	Yes	Both different	2,517

**Fig 10-7:** Warning 19: Character encoding mismatch
(HTTP Header encoding/META encoding)
MAMA Detected HTTP	MAMA Detected META	Additional Factor	Total
Yes	No	--	6,712
No	Yes	--	4,485
Yes	Yes	Both agree	4,153
Yes	Yes	Both different	97,028

**Fig 10-8:** Warning 20: Character encoding mismatch
(XML encoding/META element encoding)
MAMA Detected XML	MAMA Detected META	Additional Factor	Total
Yes	No	--	79
No	Yes	--	50
Yes	Yes	Both agree	88
Yes	Yes	Both different	992

Validator-detected charset values

We have saved the best of our character set discussion for last: what values are actually used by the validator for character set? (We will be looking at similar frequency tables for each of the MAMA-detected charset sources (HTTP header, META, XML) in another section of this study.) The full per-URL and per-Domain frequency tables for validator charset show very little movement between the two—you have to go down to #17 before there is a difference! Below is an abbreviated per-URL frequency table for validator character-set values (out of 243 unique values found for this field).

**Fig 10-9:** Validator character-set short frequency table
Validator charset value	Frequency	Percentage	Validator charset value	Frequency	Percentage
iso-8859-1	1,510,827	43.05%	iso-8859-15	12,276	0.35%
utf-8	943,326	26.88%	big5	11,395	0.32%
windows-1252	293,595	8.37%	windows-1254	9,756	0.28%
shift_jis	87,593	2.50%	iso-8859-9	9,091	0.26%
iso-8859-2	60,663	1.73%	us-ascii	8,134	0.23%
windows-1251	51,336	1.46%	euc-jp	7,174	0.20%
windows-1250	30,353	0.86%	x-sjis	5,564	0.16%
gb2312	19,412	0.55%	euc-kr	4,768	0.14%

Page 1 index : Page 2 index : Page 3 index

Validator failures

When the validator runs into a condition that does not allow it to validate a document, a failure notice is issued. The validator defines nine different conditions as fatal errors, but MAMA only encountered four of them among all the URLs it has processed through the validator. It is certainly possible that MAMA's selection mechanism may have contributed to prevention of these errors from occurring. Some 408,920 URLs out of the 3,509,180 URLs validated (11.65%) officially failed validation for various reasons.

**Fig 11-1:** Validator failure modes
Failure type	Detected in MAMA	Explanation
Transcode error	No	Occurs when attempting to transcode the character encoding of the document
Byte Error	Yes	Bytes found that are not valid in the specified character encoding
URI Error	No	The URL Scheme/protocol is not supported by the validator
No-content error	No	No content found to validate
IP Error	No	IP address is not public
HTTP Error	Yes	Received unexpected HTTP response
MIME Error	Yes	Unsupported MIME type
Parse External ID Error	Yes	Reference made to a system-specific file instead of using a well-known public identifier
Referer Error	No	Referer check requested but 'Referer' HTTP header not sent

Frequencies of failure types in MAMA

By far, the "Fatal Byte Error" occurs the most of any failure error—300,008 times (8.55%) out of all URLs validated. This error type occurs when characters in the document are not valid in the detected character encoding. This is an indication to the validator that it cannot trust the information it has about the document, so it chooses to quit trying rather than attempt to validate incorrectly.

An additional failure mode relating to MAMA's processing of the validator's activities should be mentioned. If MAMA did not receive a response back from the validator, or some other (possibly) temporary factor caused an interruption between MAMA and the validator, an "err" message code was generated. MAMA encountered this type of error 34,950 times out of the 3,509,180 URLs (1.00%) that were passed to the validator. Note that MAMA has not yet tried to re-validate any of these URLs. There are various pluses and minuses to dismissing the "err" state, or any other validator failure mode from the overall grand total of URLs validated. These failed URLs remain in the final count, but if you disagree, there is enough numerical data to be able to arrive at your own tweaked numbers and percentages.

**Fig 11-2:** Validator failures in MAMA's URLs
Failure type	Number of occurrences	Percentage
Fatal byte error	300,008	8.55%
Fatal HTTP error	63,908	1.82%
err	34,950	1.00%
Fatal Parse Extid error	8,360	0.24%
Fatal MIME error	1,709	0.05%

Number of failures

A field was created in the MAMA database to store the number of failures encountered in a document. The expectation was that the validator could only experience one failure mode at a time, so this field would hold either a '0' or '1'. Imagine the surprise when 248 URL cases registered as having two failure types at the same time! It turns out that in every one of these cases, it was the "Fatal Byte Error" and "Fatal MIME Error" occurring at the same time.
[Note: 98 of the 248 URLs returning these double-failure modes are definitely text files (ending in ".txt") and should be removed from consideration]

**Fig 11-3:** Number of failures per URL
Number of failures	Number of occurrences	Percentage
0	3,100,484	88.35%
1	408,439	11.64%
2	248	0.01%

Validator warnings

The validator issues a Warning if it detects missing or conflicting information important for the validation process. In such cases, the validator must make a "best guess"; if the validator has chosen wrong, it can negate the entire validation results. The validator suggests that all Warning issues be addressed so that the validator can produce results that have the highest confidence.

The validator can produce 27 different types of Warnings, but MAMA only encountered 14 of them in its journeys through DMoz and friends. A specific Warning type will only be issued once for a URL if it is encountered, but multiple Warning types can be issued for the same URL.

Frequencies of Warning types

The most common Warning type in MAMA's URL set was W06/"Unable to determine parse mode", with W09/"No DOCTYPE found" coming a close second. These two each dwarf all other Warning types combined by a factor of two. For full explanations of the Warning codes, see the Validator CVS.

**Fig 12-1:** Validator Warning-type frequency table
Warning code	Explanation	Frequency	Percentage
W06	Unable to determine parse mode (XML/SGML)	1,585,029	45.17%
W09	No DOCTYPE found	1,372,864	39.12%
W04	No character encoding found	480,942	13.71%
W19	Character encoding mismatch (HTTP header/META element)	113,927	3.25%
W11	Namespace found in non-XML document	65,807	1.88%
W23	Conflict between MIME type and document type	19,097	0.54%
W21	Byte-order mark found in UTF-8 File	17,148	0.49%
W22	Character Encoding suggestion: use XXX instead of YYY	8,237	0.23%
W24	Rare or unregistered character encoding detected	7,149	0.20%
W18	Character encoding mismatch (HTTP header/XML encoding)	3,220	0.09%
W20	Character encoding mismatch (XML encoding/META element)	1,220	0.04%
W09x	No DOCTYPE found. Checking XML syntax only	488	0.01%
W07	Contradictory parse modes detected (XML/SGML)	72	0.00%
W01	Missing 'charset' attribute (HTTP header for XML)	21	0.00%

Warnings in combination

MAMA never encountered more than five different Warning types at a time for any given URL. The most common scenario found was for a URL to have two types of Warnings at a time. There is a definite correlation between the two most frequent Warning types and that big "bump" in the Warning-count list below. Of the 1,025,319 cases where only two different Warning types were encountered, 951,957 (92.84%) were the W06 and W09 type together.

**Fig 12-2:** Number of warnings per URL
Number of Warnings	Frequency	Percentage
0	1,702,424	48.51%
1	363,103	10.35%
2	1,025,319	29.22%
3	411,850	11.74%
4	6,439	0.18%
5	35	0.00%

Ex: 5 Warning types in combination: http://www.hazenasvinov.cz

... And, er ... those other types of warnings too

The truth is, the validator seems to define a warning somewhat loosely, hence the capitalized use of "Warning" in the previous section to make the validator's two interpretations distinct. Firstly, it defines a "Warning" according to the warning codes and meanings in the above section, where MAMA encountered no more than 5 Warning types at a time. The validator additionally has a warnings section in its SOAP output, and a warning summary count. When the validator uses this latter interpretation of warning, it seems to have a more liberal meaning. It lumps other error types in with the strict Warnings measure as classified before. By this accounting, a number of URLs in DMoz have more than 10,000 of these warnings each.

The URL that contained the most "warnings" of this expanded type is a blog at: http://club-aguada.blogspot.com/. In MAMA's initial analysis, it reported 19,602 warnings! When collecting together this research soon after, this URL was re-checked on 16 Feb., 2008 through the validator and it still had 14,838 warnings—and an additional 14,949 errors. This URL only has about 10-20 paragraphs of text content and an additional 1,400 or so non-visible search engine spam hyperlinks. Such a big change in results seems somewhat suspect in a short amount of time, but content in blogs tend to change rather rapidly which could account for the difference.

What IS of concern is how a page that is less than 250KB in size generates over 26MB from the validator's SOAP output mode. The SOAP version is much more terse than the HTML output, so the validation results size could have been even bigger. A validation result like this is just far too excessive. Perhaps the validator should offer a way (at least as an option) to truncate the warnings and/or errors after a certain amount to control this problem.

Validator errors

Any problem or issue that the validator can recognize that is not a failure or a warning is just a common "error". Errors have the most variety—446 are currently defined in the error_messages.cfg file in the validator's code. The validator only encountered 134 of them through MAMA's URL set. The validation studies done by Parnas and Saarsoo kept track of far fewer error types—perhaps to decrease the studies' complexity. MAMA kept track of them all in the hopes that it might be useful to those developing or using the validator. First we will take a look at the various error types and error frequencies. To wrap things up, we will showcase URLs demonstrating some of the extreme error scenarios discovered (the URLs exhibited the error behavior at the time of writing but can change over time).

Error-type frequency

For each error type found in a URL, MAMA stored only the error code and the number of times that error type occurred. Shown below is a short "Top 10" list of the most frequent error types. The frequency ratios for the top errors generally agree with Saarsoo's research, with a few minor differences. The error that happens most often in the analyzed URL set is #108 (2,253,893 times), followed closely by: #127 (2,013,162 times). Coming in third is an interesting document structural error, #344: "No document type declaration; implying X". This error appears to mirror the functionality of Warning W09/W09x, "No DOCTYPE found" (see previous section) very closely - notice that the occurrence numbers for the two types are almost identical.

**Fig 13-1:** Validator error-type frequency table
Error code	Error description	Frequency	Percentage
108	There is no attribute X	2,253,893	64.23%
127	Required attribute X not specified	2,013,162	57.37%
344	No document type declaration; implying X	1,371,836	39.09%
79	End tag for element X which is not open	1,232,169	35.11%
64	Document type does not allow element X here	1,229,145	35.03%
76	Element X undefined	1,114,796	31.77%
325	Reference to entity X for which no system identifier could be generated	859,846	24.50%
25	General entity X not defined and no default entity	859,636	24.50%
338	Cannot generate system identifier for general entity X	859,636	24.50%
247	NET-enabling start-tag requires SHORTTAG YES	798,046	22.74%

The full validator error-type frequency table for MAMA's study is in a separate document. For brevity, only the error codes are listed there. The complete list of validator error codes and their explanations can be found on the W3C's site. Note that a few error message codes are not described in the aforementioned W3C document, and need a little extra exposition:

"xmlw": XML well-formedness error
"no-x": No-xmlns error (No XML Namespace declared)
"wron": Wrong-xmlns (Incorrect XML Namespace declared)

Quantity of error types

There were 3,000,493 URLs where at least one validation error occurred. But among these URLs, there was a great variety in the types of erorrs encountered. The vast majority of URLs encountering errors found 10 types of errors or less. The average total number of validation errors per page is 46.70.

**Fig 13-2:** Validator error-type variety per URL
[See also the full frequency table]
Total number error types	Number of occurrences	Total number error types	Number of occurrences
0	508,687	8	208,563
1	194,518	9	172,004
2	249,997	10	145,127
3	301,900	11	117,612
4	315,367	12	96,969
5	336,832	13	76,967
6	312,103	14	61,692
7	252,934	15	47,681

Error extremes

DMoz has many URLs, and some are bound to have unbelievable numbers of errors. Believe it, though—the following three tables showcase the most extreme offenders in generating validator error messages.

The URLs in these lists are fairly diverse. Some of the documents are long, yet some are also fairly brief (considering the error quantity). Some use CSS or scripting, while others do not. IIS and Apache are usually both well-represented. The only noticeable tendency is found in the last table (Fig 13-5) for the widest variety of error types; five of the eight worst offenders in this category use Microsoft IIS 6.0/ASP.NET servers (note the same URL pattern in 4 of them). There is no noticeable correlation other than this. One plausible explanation for the inflated error numbers could be that IIS servers browser sniff for the User-Agent header string and deliver lower-quality content based on the validator's UA value "W3C_Validator/1.575".

**Fig 13-3:** URLs with the most errors of a specific type
URL	Error Type	Error Qty
http://www.music-house.co.uk/	76	28,961
http://www.zughaid.com/TMP.htm	325	22,193
http://www.filosofico.net/virgilioeneide.htm	65	15,409
http://www.gencat.cat/diue/llengua/eines_per_a_lempresa/lexics/alimenta.htm	64	14,316
http://www.cwc.lsu.edu/cwc/projects/dbases/chase.htm	82	12,211
http://www.dienanh.net/forums/	xmlw	12,103

**Fig 13-4:** URLs with the most total errors in combination
URL	Total Errors
http://club-aguada.blogspot.com/	37,370
http://www.first-jp.com/	34,530
http://www.prezesm.kylos.pl/	33,083
http://defensor-sporting.blogspot.com/	31,617
http://www.mlnh.zmva.ru	29,184
http://www.music-house.co.uk/	28,963

**Fig 13-5:** URLs with the widest variety of error types
URL	Error Type Qty
http://alumni.wsu.edu/site/c.llKYL9MQIsG/b.1860301/k.BCA0/Home.htm	39
http://www.vincipro.com/cart/home.php	38
http://www.c-sharpcorner.com/UploadFile/prasad_1/RegExpPSD12062005021717AM/RegExpPSD.aspx	38
http://www.sleepfoundation.org/site/c.huIXKjM0IxF/b.2417141/k.C60C/Welcome.htm	35
http://www.buckeyeranch.org/site/c.glKSLeMXIsG/b.1043121/k.BCC0/Home.htm	35
http://www.ucmerced.edu/	35
http://www.girlscouts.ak.org/site/c.hsJSK0PDJpH/b.1806483/k.BE48/Home.htm	35
http://kaltenkirchen.dlrg.de	35

Summing up ...

Parnas' study presented an interesting statistic:

"In October 2001, the W3C validator validated approximately 80,000 documents per day"

Olivier Théreaux, who currently works on development of the W3C validator, provided an updated usage statistic in February 2008 of ~700-800,000 URLs per day. This is a ten-fold increase. The awareness regarding the process of validating documents definitely seems to be increasing over time, as this sharp increase in usage of the validator indicates. The perceived importance of having documents pass validation though, needs to improve. Yes, the pass-rate in the general Web population has also increased by a respectable rate—0.71% to 4.13% in "just" six years. It also has increased similarly for the W3C member companies in that time. However, in the case of the W3C members, they appear to regress in their validation pass state about as often as they attain this goal. How can the Web-at-large strive to do better when these key companies do not seem to be trying harder? As we have seen with the (non-)success of the validation icon badge, it is one thing to say you support the standards—and validation as a means to that end—but it does not necessarily reflect reality.

If we relax our concentration on simply passing validation, we notice that support for other parts of this process are improving nicely over time. At least one aspect of the validation process has made great strides and definitely contributes to a perceived importance for document correctness—Doctype usage. Doctypes help concentrate author focus toward thinking about what standards their documents are trying to adhere to. This can only help the validation cause over time. The Web may be crossing an important threshold in this regard. The number of URLs in this study carrying a Doctype of some kind, has just barely crossed the 50% boundary. In the U.S. political system this is called a "clear mandate"—so, an avalanche of authors validating their documents must not be far behind ... right? Joking aside, there is a clear and obvious connection between claiming to adhere to a standard and then actually doing so. Increased outreach by the standards community to help developers draw the dotted line between the two points in this line can only help matters here.

Appendix: Validation methodology

Markup validation was the last main phase of the research completed. MAMA only attempted to validate the URLs that were successfully analyzed in the other big analysis phase, so as to maximize the possibility for data cross-referencing.

The URL set

MAMA employed several strategies to refine and improve the analysis set of URLs. The full size of the DMoz URL set was ~4.5 million as of Nov. 2007, which was distilled down to ~3.5 million URLs. Saarsoo's study chose to follow, as closely as possible, the URL-selection strategy that Parnas used in his study, to ensure maximum compatibility between the two. MAMA's URL selection methods do not directly match these other studies. Even with the set size reduction, this appears to be the largest URL sample of validation trends to date.

URL sets analyzed:
- DMoz (May 2007 initial snapshot, added diff Nov. 2007)
- W3C-member-company home pages (429 listed URLs; 26 January 2008)
- Alexa Global Top 500 (500 URLs; 28 January 2008)
Basic filtering: Domain limiting of the randomized URL set to no more 30 URLs analyzed per domain
Other filtering: Excluded non-HTTP/HTTPS protocols
Skipped analysis of URLs that hit any failure conditions

Various parts of the examined URL sets have definite bias. Alexa's top URL lists, for example, are the result of usage stats from voluntary installation of a Windows-only MSIE toolbar. The DMoz set has definite top-page-itis—it is skewed heavily toward the root/home pages of domains by as much as 80%!

The W3C validator

MAMA was only able to employ two local copies of the W3C validator on separate machines. One of these machines was very "old" and weak by today's hardware standards, while the other one was more of a "typical" modern system. The weak machine was simply not up to the task and could only handle about 1/10th of the load that the more powerful machine easily handled. MAMA would feed a URL to the validator, parse the output result, then send it to the MAMA database for storage, and move on to the next URL in the list to be analyzed ... rinse and repeat until complete. The big bottleneck was the validator. If MAMA had more validators available to use, the processing time would be drastically cut from weeks to days.

Validator machine 1: CPU: Intel 2.4GHz dual core P4; RAM: 1GB
Validator machine 2: CPU: AMD 800MHz; RAM: 768MB
Driver script: Perl (using LWP module for validator communication and DBI module for database connectivity)
Number of driver scripts: Usually about 10 at a time
Duration of validation: 8-29 January, 2008 (~ 3 weeks), usually 24/7
Processing rate: ~150,000 URLs per day
How many URLs validated: 3509170 URLs from 3011661 domains.
URL list: randomized

The markup validator has a number of processing options, but a main goal for the validation process was to keep the analysis simple and direct. Each candidate URL was passed to the validator using the following options. The SOAP output was chosen for its brevity and ease of results parsing.

Charset: Detect automatically
Doctype: Detect automatically
Output: SOAP

MAMA stored a compacted version of the results of each URL validation. In retrospect, it would also have been useful to store at least part of each error description (the unique arguments portion), but during this first time through there was no way to know just how much storage all that data would need. So, MAMA opted to store as little as possible. As it is, MAMA's abbreviated format stored over 25 million rows of data for the abbreviated error messages alone. A goal for "next time" is to store all the unique error arguments in addition to what MAMA currently stores.

Did it validate? (Pass/Fail)
Doctype FPI
Character set
Number of warnings
Number of errors
Number of failures
Date the URL was validated
An aggregated list of error types and the quantity of those errors for the URL

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

MAMA: W3C validator research

About markup validation—an introduction

A quick summary:

Previous validation studies

Sources and tools: The URL set and the validator

Treading on familiar ground: The Open Directory Project (DMoz)

The W3C validator

What use is markup validation to an author?

How many pages validated?

The raw validation numbers

Validation rates where select Web-page authoring features are also involved

Validation, content management systems (CMS), and editors

Interesting views of validation rates, part 1: W3C-Member companies

The W3C-Member List successful validation Honor Roll

And the crown goes to ...

Interesting views of validation rates, part 2: Alexa Global Top 500

About the Alexa Global Top 500

Validating the Alexa Top 500

For future Alexa studies

Validation badge/icons: An interesting diversion?

Validation rates of URLs having validation badge/icons

Comparison of stated validation badge/icon type versus actual detected Doctype

For future W3C badge studies

Doctypes

What are we examining?

!Doctypes!

Doctype statement present details

Totals for common substrings found in the validator Doctype field

Doctypes for our small, special interest URL sets

Character sets

Character set differences between MAMA and the validator

MAMA's view of character sets

The validator's view of character sets

Validator-detected charsets versus MAMA-detected charsets

Validator Warning 04 issued: No character encoding found

Validator Warnings 18-20 issued: Character encoding mismatches

Validator-detected charset values

Validator failures

Frequencies of failure types in MAMA

Number of failures

Validator warnings

Frequencies of Warning types

Warnings in combination

... And, er ... those other types of warnings too

Validator errors

Error-type frequency

Quantity of error types

Error extremes

Summing up ...

Appendix: Validation methodology

The URL set

The W3C validator

Comments