MAMA: What has come before

By Brian Wilson

Introduction

This article looks at what has come before MAMA in terms of web structure studies, both small and large.

Small-scale studies

There have been a number of attempts in the past to gain some measure of the composition of the Web. Many of them have had small sample sizes (in the range of hundreds to a few thousand URLs). The following is a chronological list of some of the more interesting smaller studies available covering various aspects of the Web. It is by no means exhaustive:

**Fig 1-1:** Small-scale URL studies
Date	Description	Author	URL Sample Size
Feb 2002	W3C member company site validation results	Marko Karppinen	506 sites
Apr 2003	"Alpha Geeks" XHTML usage on personal Web pages	Evan Goer	113 sites
Nov 2005	`Class` and `Id` attribute value usage	John Allsopp	1315 sites
2007-2008	Various markup factors	Philip Taylor	Various

Large-scale studies

Parnas: "Markup Validity of Open Directory Web Pages"

Dagfinn Parnas' masters thesis work in 2001 involving markup validation rates is the first large-scale analysis of the composition Web of which I am aware, but the world would have to wait until 2005 for the next major look at what was being used in Web pages.

Ian Hickson/Google: "Web Authoring Statistics"

One of the people I had asked regarding criteria for MAMA to examine was my then-coworker at Opera, Ian Hickson, who now works at Google. He was (and still is) heavily involved in the W3C standards process and had excellent ideas of things to look for. However, there are a number of things he requested that MAMA just was not up to analyzing at the time (such as Class names used in a document). On moving to Google, he was able to leverage their vast caches of Web pages to come up with the first widely-available deep analysis of Web page technologies. The sheer volume of Web pages analyzed for his "Web Authoring Statistics" document is a daunting monument to the power that a search engine company can bring to bear on the issue. Ian's study is great—in terms of pure representative numbers it will probably never have an equal. Given its current resources, MAMA will certainly never be able to analyze that many URLs.

Rene Saarsoo: Coding practices of Web pages

Rather late in the MAMA-development process, I was pointed to Saarsoo's thesis work that had been released about a year before. In shape and form, his work bears many resemblances to MAMA, and his work also inspired me to add many new features to MAMA at a late hour—the intention was not to duplicate his features, but rather to try to fill gaps that neither of our studies had addressed up to that point. For example, MAMA's research on scripting was expanded in response to a bug in Saarsoo's analysis script that failed to correctly gather certain data items. Saarsoo's study also fills some holes that MAMA has—for example, it goes into much greater detail about CSS than MAMA currently does.

**Fig 2-1:** Large-scale URL studies
Date	Description	Author	URL Sample Size
Dec. 2001	Markup Validity of Open Directory Web Pages	Dagfinn Parnas	>2,000,000
Dec. 2005	Web Authoring Statistics	Ian Hickson/Google	>1,000,000,000
Jun. 2006	Coding Practices of Web Pages	Rene Saarsoo	~1,270,000

There is significant overlap in all of these studies (MAMA included), but together they constitute a much larger tapestry of information than any single study can manage. In the various MAMA reports published now and in the future, we will make many comparisons to the data from these other studies; the reader is encouraged to visit all of them to gain a deeper understanding of all aspects of Web pages.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.