MAMA: Markup validation report

By Brian Wilson

Introduction

Web standards are good for the Web! Most of the readers of this site will understand why this statement holds true—ease of maintenance, cross platform compatibility, access by people with disabilities, the list goes on!

But how does the reality of the Web hold up to these ideals? Surely with so many good reasons to code using open Web standards, the majority of sites should validate? Not so—Opera's MAMA project has gathered a lot of quite shocking statistics showing that very few of the sites surveyed actually exhibit markup that validates.

This article will discuss validation and MAMA's findings, including what markup validation is, whether people bother to validate their markup, how many sites actually do validate, and possible reasons why the rate of markup validation is still so low.

Note that this article is a heavily condensed version of the full MAMA markup validation study, aimed at giving a quick summary of its main points. For a much deeper treatment of the area of markup validation, check out the full version.

What is markup validation?

The W3C validator is a tool that authors can use to ensure that their markup conforms to a standard. This tool began life over 10 years ago as a Web wrapper around an SGML parser, but it has expanded its reach over time to include validation capability for documents of many flavors. The tool checks a page of markup against a set of rules defined by the document's Doctype, and delivers either a cheerful passing grade, or a failure message with a list of any warnings and errors that need to be addressed.

Do people validate?

Will we ever get everyone validating? That is a bit much to hope for. HTML's genesis as a simple language that anyone and everyone could learn means that there will inevitably always be those that don't know all (or most) of the rules. Many authors "speak" pidgin markup that would make a learned HTML author's (or validator's) toes curl.

As part of MAMA's overall analysis process, it ran every URL in its database through the W3C's markup validator. MAMA was able to validate 3,509,180 URLs in 3,011,661 domains, and only 4.13% of the URL set passed validation (with 4.33% of the domains having at least 1 URL that passed validation).

This is a decent increase over previous validation studies performed in the past:

Validation pass rate studies
Study	Date	Passed Validation	Total Validated	Percentage
Parnas	Dec. 2001	14,563	2,034,788	0.71%
Saarsoo	Jun. 2006	25,890	1,002,350	2.58%
MAMA	Jan. 2008	145,009	3,509,170	4.13%

So how many pages validate in total, roughly?

Using recent approximations for the size and reachability of the Web coupled with the validation pass rates discovered by MAMA, rough numbers can be estimated for the overall validation rates of the ENTIRE Web. Google's recent total URL estimate of 1 trillion reachable Web pages would give us 41.3 billion URLs passing markup validation. That is a LOT of Web pages. A different data point comes from Netcraft's August 2008 assessment of reachable domains: 176,748,506. Coupling that with MAMA's per-domain validation metric would give us 7.65 million domains that have at least 1 URL passing markup validation.

How often do authors validate?

The W3C suggests that you display Validation image badges on pages that have passed validation. MAMA used a Perl regular expression to detect the use of such validation badges utilizing the W3C naming scheme. This pattern match was used against the SRC attributes of IMG elements of all URLs analyzed:

Regexp:
/valid-((css|html|mathml|svg|xhtml|xml).*?)(-blue)?(\.png|\.gif|-v\.svg|-v\.eps)?$/i || /(wcag1.*?)(\.png|\.gif|-v\.svg|-v\.eps)?$/i)

MAMA found that slightly less than half of pages using detectable validation icons actually passed validation.

Validation pass rates of URLs with Validation icons
Type Of Badge Identified	Total	Actually Validated
xhtml	11,657	5,480
html	22,033	10,995

These badges appear to falsely convey the current validation state of a URL just as often as they get it right, so use of these icons to report the CURRENT STATE of validating pages should perhaps be discouraged. An evolving Web page is always a moving target, and such icons can not be trusted to be up-to-date and true.

Who validates?

One set of URLs where you would expect especially high validation pass rates is the W3C's member companies. This is indeed true (just over 20% of the companies' URL top pages pass validation), but 1 in 5 is still rather low for companies linked to the organization that creates and advances the markup standards.

Is it fair to expect such high numbers from the W3C member companies? Yes. Is it realistic? Apparently not. Looking at historical validation pass rates for these companies highlights the annoying fact that achieving a passing markup validation does not in any way guarantee that the page will maintain that state over time. Harkening back to the previous section on how often authors validate their documents, the answer seems to be "not very often".

Validation for CMS and editors

Validation rates in MAMA's URLs were compared with the URLs that used the META "Generator" syntax. This allowed for the identification of text and Web page editors, as well as popular content management systems (CMS). First, the good news. Most of the popular identified CMS systems surpass the validation rates of the general URL population, with the exception of Blogger. It had an exceptionally poor showing in MAMA's URL set.

Validation pass rates relating to CMS
CMS	Total URLs In MAMA	Quantity Passing Validation	Percentage Passing Validation
Typo	18,067	2,301	12.7%
WordPress	16,594	1,494	9.0%
Joomla	34,852	2,248	6.5%
Blogger	9,907	30	0.3%

Now for the (mostly) bad news. The text and Web page editors that identified themselves via the META element simply embarrassed themselves compared to the general population's validation pass rate...except for a lone shining beacon. Apple's iWeb editor definitively won the day in this category. Approximately 82% of pages reporting this editor passed validation—an astounding result considering the next-closest editor pass rate was Adobe Dreamweaver with only 3.4%! Aside from iWeb, all of the popular Web page editors that MAMA discovered had lower validation pass rates than the overall URL set.

Validation pass rates relating to editors
Editor	Total URLs In MAMA	Quantity Passing Validation	Percentage Passing Validation
Apple iWeb	2,504	2,051	81.9%
Adobe Dreamweaver	5,954	205	3.4%
NetObjects Fusion	26,355	802	3.0%
Adobe GoLive	41,865	1,086	2.6%
IBM WebSphere	32,218	626	1.9%
Microsoft MSHTML	40,030	518	1.3%
Microsoft Visual Studio	22,936	272	1.2%
Claris Home Page	6,259	48	0.8%
Adobe PageMill	15,148	100	0.7%
Microsoft FrontPage	347,095	1,923	0.6%
Microsoft Word	24,892	154	0.6%

Some of the validation results in brief

The validator goes far beyond a simple "thumbs-up/down" decision over whether a document passes validation. When an error of any type is discovered, the validator is very helpful in showing you just how many, and (more helpful) just where those errors occurred.

The W3C's validator defines 27 separate types of warnings, and 446 types of errors. Some documents that MAMA validated contained errors numbering in the tens of thousands. No one said passing validation is always easy.

Some tasty nuggets found from MAMA's validation of 3.5 million documents:

More URLs than ever are using Doctype statements. In MAMA's URL set the number has passed the 50% threshold (51.0%).
Strict Doctype flavors pass validation at much higher rates (17.5%) than Transitional (8.4%) or Frameset (7.2%).
XHTML Doctype flavors pass validation at much higher rates (13.4%) than HTML flavors (6.6%).
The majority of pages specify a document's encoding, and do so via the META markup syntax (89.9%).
The most popular detected encodings are iso-8859-1 (43.1%) and utf-8 (26.9%).
The most frequent fatal validation error: characters are used that aren't allowed by the detected character set (8.6%).
The most frequent validation errors encountered both deal with attributes:
- There is no attribute X (64.2%)
- Required attribute X not specified (57.4%)

Why are so few pages validating?

There are a lot of reasons why most of the Web does not validate; consider the following cases:

Many sites are built upon CMSes that do not spit out standards-compliant markup on the front end—it is nigh-on impossible to get these sites to validate.
Many sites are put up on the Web by hobbyists, who do not care about Web standards—they just want to get their "look at my kittens" site on the Web by any means necessary.
Many sites these days feature user-generated content (think of any blog and social networking site); even if you make your blog validate, it can still easily be invalidated by a site visitor submitting a comment featuring bad markup.
A lot of developers don't care about validation—their site works for the target audience they are aiming it at, and they get paid regardless of standards compliance.
And many more reasons!

Chris Mills has written a presentation on Web standards and education (ZIP file download), which covers possible reasons for non-validation in more detail.

Summary

The role of the validator has been likened to that of a "spell-checker" for Web page structure. A validator alone will not make your page better...a page that passes validation can still look or behave terribly, and a page with hundreds or even thousands of errors can still produce a reasonable user experience in most browsers.

A validator simply catches errors. We all make mistakes, even the experts. The worst mistakes are often the typos or unintentional gaffes; a validator can make easy work of catching these. Authors should validate—it's an easy process. This is not a conundrum; validate your code and do it often.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.