MAMA: Markup validation report

By Brian Wilson

Introduction

Web standards are good for the Web! Most of the readers of this site will understand why this statement holds true—ease of maintenance, cross platform compatibility, access by people with disabilities, the list goes on!

But how does the reality of the Web hold up to these ideals? Surely with so many good reasons to code using open Web standards, the majority of sites should validate? Not so—Opera's MAMA project has gathered a lot of quite shocking statistics showing that very few of the sites surveyed actually exhibit markup that validates.

This article will discuss validation and MAMA's findings, including what markup validation is, whether people bother to validate their markup, how many sites actually do validate, and possible reasons why the rate of markup validation is still so low.

Note that this article is a heavily condensed version of the full MAMA markup validation study, aimed at giving a quick summary of its main points. For a much deeper treatment of the area of markup validation, check out the full version.

What is markup validation?

The W3C validator is a tool that authors can use to ensure that their markup conforms to a standard. This tool began life over 10 years ago as a Web wrapper around an SGML parser, but it has expanded its reach over time to include validation capability for documents of many flavors. The tool checks a page of markup against a set of rules defined by the document's Doctype, and delivers either a cheerful passing grade, or a failure message with a list of any warnings and errors that need to be addressed.

Do people validate?

Will we ever get everyone validating? That is a bit much to hope for. HTML's genesis as a simple language that anyone and everyone could learn means that there will inevitably always be those that don't know all (or most) of the rules. Many authors "speak" pidgin markup that would make a learned HTML author's (or validator's) toes curl.

As part of MAMA's overall analysis process, it ran every URL in its database through the W3C's markup validator. MAMA was able to validate 3,509,180 URLs in 3,011,661 domains, and only 4.13% of the URL set passed validation (with 4.33% of the domains having at least 1 URL that passed validation).

This is a decent increase over previous validation studies performed in the past:

Validation pass rate studies
Study Date Passed Validation Total Validated Percentage
Parnas Dec. 2001 14,563 2,034,788 0.71%
Saarsoo Jun. 2006 25,890 1,002,350 2.58%
MAMA Jan. 2008 145,009 3,509,170 4.13%

So how many pages validate in total, roughly?

Using recent approximations for the size and reachability of the Web coupled with the validation pass rates discovered by MAMA, rough numbers can be estimated for the overall validation rates of the ENTIRE Web. Google's recent total URL estimate of 1 trillion reachable Web pages would give us 41.3 billion URLs passing markup validation. That is a LOT of Web pages. A different data point comes from Netcraft's August 2008 assessment of reachable domains: 176,748,506. Coupling that with MAMA's per-domain validation metric would give us 7.65 million domains that have at least 1 URL passing markup validation.

How often do authors validate?

The W3C suggests that you display Validation image badges on pages that have passed validation. MAMA used a Perl regular expression to detect the use of such validation badges utilizing the W3C naming scheme. This pattern match was used against the SRC attributes of IMG elements of all URLs analyzed:

Regexp:
/valid-((css|html|mathml|svg|xhtml|xml).*?)(-blue)?(\.png|\.gif|-v\.svg|-v\.eps)?$/i || /(wcag1.*?)(\.png|\.gif|-v\.svg|-v\.eps)?$/i)

MAMA found that slightly less than half of pages using detectable validation icons actually passed validation.

Validation pass rates of URLs with Validation icons
Type Of Badge
Identified
Total Actually
Validated
xhtml 11,657 5,480
html 22,033 10,995

These badges appear to falsely convey the current validation state of a URL just as often as they get it right, so use of these icons to report the CURRENT STATE of validating pages should perhaps be discouraged. An evolving Web page is always a moving target, and such icons can not be trusted to be up-to-date and true.

Who validates?

One set of URLs where you would expect especially high validation pass rates is the W3C's member companies. This is indeed true (just over 20% of the companies' URL top pages pass validation), but 1 in 5 is still rather low for companies linked to the organization that creates and advances the markup standards.

Is it fair to expect such high numbers from the W3C member companies? Yes. Is it realistic? Apparently not. Looking at historical validation pass rates for these companies highlights the annoying fact that achieving a passing markup validation does not in any way guarantee that the page will maintain that state over time. Harkening back to the previous section on how often authors validate their documents, the answer seems to be "not very often".

Validation for CMS and editors

Validation rates in MAMA's URLs were compared with the URLs that used the META "Generator" syntax. This allowed for the identification of text and Web page editors, as well as popular content management systems (CMS). First, the good news. Most of the popular identified CMS systems surpass the validation rates of the general URL population, with the exception of Blogger. It had an exceptionally poor showing in MAMA's URL set.

Validation pass rates relating to CMS
CMSTotal URLs
In MAMA
Quantity
Passing
Validation
Percentage
Passing
Validation
Typo18,0672,30112.7%
WordPress16,5941,4949.0%
Joomla34,8522,2486.5%
Blogger9,907300.3%

Now for the (mostly) bad news. The text and Web page editors that identified themselves via the META element simply embarrassed themselves compared to the general population's validation pass rate...except for a lone shining beacon. Apple's iWeb editor definitively won the day in this category. Approximately 82% of pages reporting this editor passed validation—an astounding result considering the next-closest editor pass rate was Adobe Dreamweaver with only 3.4%! Aside from iWeb, all of the popular Web page editors that MAMA discovered had lower validation pass rates than the overall URL set.

Validation pass rates relating to editors
EditorTotal URLs
In MAMA
Quantity
Passing
Validation
Percentage
Passing
Validation
Apple iWeb2,5042,05181.9%
Adobe Dreamweaver5,9542053.4%
NetObjects Fusion26,3558023.0%
Adobe GoLive41,8651,0862.6%
IBM WebSphere32,2186261.9%
Microsoft MSHTML40,0305181.3%
Microsoft Visual Studio22,9362721.2%
Claris Home Page6,259480.8%
Adobe PageMill15,1481000.7%
Microsoft FrontPage347,0951,9230.6%
Microsoft Word24,8921540.6%

Some of the validation results in brief

The validator goes far beyond a simple "thumbs-up/down" decision over whether a document passes validation. When an error of any type is discovered, the validator is very helpful in showing you just how many, and (more helpful) just where those errors occurred.

The W3C's validator defines 27 separate types of warnings, and 446 types of errors. Some documents that MAMA validated contained errors numbering in the tens of thousands. No one said passing validation is always easy.

Some tasty nuggets found from MAMA's validation of 3.5 million documents:

  • More URLs than ever are using Doctype statements. In MAMA's URL set the number has passed the 50% threshold (51.0%).
  • Strict Doctype flavors pass validation at much higher rates (17.5%) than Transitional (8.4%) or Frameset (7.2%).
  • XHTML Doctype flavors pass validation at much higher rates (13.4%) than HTML flavors (6.6%).
  • The majority of pages specify a document's encoding, and do so via the META markup syntax (89.9%).
  • The most popular detected encodings are iso-8859-1 (43.1%) and utf-8 (26.9%).
  • The most frequent fatal validation error: characters are used that aren't allowed by the detected character set (8.6%).
  • The most frequent validation errors encountered both deal with attributes:
    • There is no attribute X (64.2%)
    • Required attribute X not specified (57.4%)

Why are so few pages validating?

There are a lot of reasons why most of the Web does not validate; consider the following cases:

  • Many sites are built upon CMSes that do not spit out standards-compliant markup on the front end—it is nigh-on impossible to get these sites to validate.
  • Many sites are put up on the Web by hobbyists, who do not care about Web standards—they just want to get their "look at my kittens" site on the Web by any means necessary.
  • Many sites these days feature user-generated content (think of any blog and social networking site); even if you make your blog validate, it can still easily be invalidated by a site visitor submitting a comment featuring bad markup.
  • A lot of developers don't care about validation—their site works for the target audience they are aiming it at, and they get paid regardless of standards compliance.
  • And many more reasons!

Chris Mills has written a presentation on Web standards and education (ZIP file download), which covers possible reasons for non-validation in more detail.

Summary

The role of the validator has been likened to that of a "spell-checker" for Web page structure. A validator alone will not make your page better...a page that passes validation can still look or behave terribly, and a page with hundreds or even thousands of errors can still produce a reasonable user experience in most browsers.

A validator simply catches errors. We all make mistakes, even the experts. The worst mistakes are often the typos or unintentional gaffes; a validator can make easy work of catching these. Authors should validate—it's an easy process. This is not a conundrum; validate your code and do it often.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.