MAMA: What has come before

By Brian Wilson

Introduction

This article looks at what has come before MAMA in terms of web structure studies, both small and large.

Small-scale studies

There have been a number of attempts in the past to gain some measure of the composition of the Web. Many of them have had small sample sizes (in the range of hundreds to a few thousand URLs). The following is a chronological list of some of the more interesting smaller studies available covering various aspects of the Web. It is by no means exhaustive:

Fig 1-1: Small-scale URL studies
Date Description Author URL Sample Size
Feb 2002 W3C member company site validation results Marko Karppinen 506 sites
Apr 2003 "Alpha Geeks" XHTML usage on personal Web pages Evan Goer 113 sites
Nov 2005 Class and Id attribute value usage John Allsopp 1315 sites
2007-2008 Various markup factors Philip Taylor Various

Large-scale studies

Parnas: "Markup Validity of Open Directory Web Pages"

Dagfinn Parnas' masters thesis work in 2001 involving markup validation rates is the first large-scale analysis of the composition Web of which I am aware, but the world would have to wait until 2005 for the next major look at what was being used in Web pages.

Ian Hickson/Google: "Web Authoring Statistics"

One of the people I had asked regarding criteria for MAMA to examine was my then-coworker at Opera, Ian Hickson, who now works at Google. He was (and still is) heavily involved in the W3C standards process and had excellent ideas of things to look for. However, there are a number of things he requested that MAMA just was not up to analyzing at the time (such as Class names used in a document). On moving to Google, he was able to leverage their vast caches of Web pages to come up with the first widely-available deep analysis of Web page technologies. The sheer volume of Web pages analyzed for his "Web Authoring Statistics" document is a daunting monument to the power that a search engine company can bring to bear on the issue. Ian's study is great—in terms of pure representative numbers it will probably never have an equal. Given its current resources, MAMA will certainly never be able to analyze that many URLs.

Rene Saarsoo: Coding practices of Web pages

Rather late in the MAMA-development process, I was pointed to Saarsoo's thesis work that had been released about a year before. In shape and form, his work bears many resemblances to MAMA, and his work also inspired me to add many new features to MAMA at a late hour—the intention was not to duplicate his features, but rather to try to fill gaps that neither of our studies had addressed up to that point. For example, MAMA's research on scripting was expanded in response to a bug in Saarsoo's analysis script that failed to correctly gather certain data items. Saarsoo's study also fills some holes that MAMA has—for example, it goes into much greater detail about CSS than MAMA currently does.

Fig 2-1: Large-scale URL studies
Date Description Author URL Sample Size
Dec. 2001 Markup Validity of Open Directory Web Pages Dagfinn Parnas >2,000,000
Dec. 2005 Web Authoring Statistics Ian Hickson/Google >1,000,000,000
Jun. 2006 Coding Practices of Web Pages Rene Saarsoo ~1,270,000

There is significant overlap in all of these studies (MAMA included), but together they constitute a much larger tapestry of information than any single study can manage. In the various MAMA reports published now and in the future, we will make many comparisons to the data from these other studies; the reader is encouraged to visit all of them to gain a deeper understanding of all aspects of Web pages.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.