MAMA: What has come before
- Previous article—MAMA: Script tokenization: ECMAScript/JavaScript syntax
- Next article—MAMA: Methodology
- Table of contents
Introduction
This article looks at what has come before MAMA in terms of web structure studies, both small and large.
Small-scale studies
There have been a number of attempts in the past to gain some measure of the composition of the Web. Many of them have had small sample sizes (in the range of hundreds to a few thousand URLs). The following is a chronological list of some of the more interesting smaller studies available covering various aspects of the Web. It is by no means exhaustive:
Date | Description | Author | URL Sample Size |
---|---|---|---|
Feb 2002 | W3C member company site validation results | Marko Karppinen | 506 sites |
Apr 2003 | "Alpha Geeks" XHTML usage on personal Web pages | Evan Goer | 113 sites |
Nov 2005 | Class
and Id attribute value usage |
John Allsopp | 1315 sites |
2007-2008 | Various markup factors | Philip Taylor | Various |
Large-scale studies
Parnas: "Markup Validity of Open Directory Web Pages"
Dagfinn Parnas' masters thesis work in 2001 involving markup validation rates is the first large-scale analysis of the composition Web of which I am aware, but the world would have to wait until 2005 for the next major look at what was being used in Web pages.
Ian Hickson/Google: "Web Authoring Statistics"
One of the people I had asked regarding criteria for MAMA to examine was my
then-coworker at Opera, Ian Hickson, who now works at Google. He was (and still
is) heavily involved in the W3C standards process and had excellent ideas of
things to look for. However, there are a number of things he requested that
MAMA just was not up to analyzing at the time (such as Class
names used in a document). On moving to Google, he was able to leverage their
vast caches of Web pages to come up with the first widely-available deep
analysis of Web page technologies. The sheer volume of Web pages analyzed for
his "Web Authoring Statistics" document is a daunting monument to the power that
a search engine company can bring to bear on the issue. Ian's study is
great—in terms of pure representative numbers it will
probably never have an equal. Given its current resources, MAMA will certainly
never be able to analyze that many URLs.
Rene Saarsoo: Coding practices of Web pages
Rather late in the MAMA-development process, I was pointed to Saarsoo's thesis work that had been released about a year before. In shape and form, his work bears many resemblances to MAMA, and his work also inspired me to add many new features to MAMA at a late hour—the intention was not to duplicate his features, but rather to try to fill gaps that neither of our studies had addressed up to that point. For example, MAMA's research on scripting was expanded in response to a bug in Saarsoo's analysis script that failed to correctly gather certain data items. Saarsoo's study also fills some holes that MAMA has—for example, it goes into much greater detail about CSS than MAMA currently does.
Date | Description | Author | URL Sample Size |
---|---|---|---|
Dec. 2001 | Markup Validity of Open Directory Web Pages | Dagfinn Parnas | >2,000,000 |
Dec. 2005 | Web Authoring Statistics | Ian Hickson/Google | >1,000,000,000 |
Jun. 2006 | Coding Practices of Web Pages | Rene Saarsoo | ~1,270,000 |
There is significant overlap in all of these studies (MAMA included), but together they constitute a much larger tapestry of information than any single study can manage. In the various MAMA reports published now and in the future, we will make many comparisons to the data from these other studies; the reader is encouraged to visit all of them to gain a deeper understanding of all aspects of Web pages.
- Previous article—MAMA: Script tokenization: ECMAScript/JavaScript syntax
- Next article—MAMA: Methodology
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.