MAMA: Methodology

By Brian Wilson

Index:
  1. Introduction
  2. MAMA's tiny machine "empire": hardware and software
  3. Analysis processing time (project duration)
  4. Breakdown of MAMA's basic analysis processing time
  5. Frequency tables
  6. To use stock parsers or create it all from scratch, that is the question
  7. Caveats and disclaimers
  8. Design choices

Introduction

MAMA is written in Perl and stores its information in a MySQL database. The internals of the system reflected its incremental development over a long period of time. It began life as a series of counters and simple condition checks and evolved over the years into a complex labyrinth of data structures. The current codebase is fairly messy and in need of some TLC. New features have been bolted on where useful or necessary. The MAMA system also has many satellite Perl scripts to accomplish various tasks, including some of the complex data analysis and correlation from the database.

MAMA's tiny machine "empire": hardware and software

The MAMA database machine

The database where the analysis data is stored has undergone several iterations. The initial version had an ugly monolithic table design and was extremely slow. The current version has far more features and adaptability, and yet the query performance has also been greatly improved. The MAMA database currently has 100 million total records in 22 tables and ~21GB of data representing the analysis of just over 3.5 million URLs.

Fig 2-1: MAMA Database Hardware/Software
Component Details
Hardware CPU Dual core Intel Pentium D/3.00GHz
Hardware RAM 2 Gigabytes
OS Debian 4.0 (Etch)
MySQL Ver. 5.0.32

MAMA's client analysis machines

MAMA's main analysis script was distributed to other client machines to do the URL fetching and deconstruction steps. The results of this were then passed to the MAMA database machine. There were only about 4-8 client machines available at any one time, most of which were older, underpowered machines by today's hardware standards. All clients were running Linux distribution flavors. By comparison to the hardware in Saarsoo's study (18 machines), MAMA's usages were downright modest.

MAMA's validator machines

MAMA was only able to employ two local copies of the W3C validator on separate machines. One of these machines was very old and weak by today's hardware standards, while the other one was a more typical modern system. The weak machine was simply not up to the task and could only handle about 1/10th of the load that the more powerful machine easily handled. MAMA would feed a URL to the validator, parse the output result, send it to the MAMA database for storage, and then move on to the next URL in the list to be analyzed. Rinse and repeat until complete. The big bottleneck was the validator. If MAMA had more validators available to use, the processing time would be drastically cut from weeks to days.

Fig 2-2: W3C Markup Validator Hardware
Component Details
Validator #1 CPU: Intel 2.4GHz dual core P4; RAM: 1GB
Validator #2 CPU: AMD 800MHz; RAM: 768MB

Analysis processing time (project duration)

Completing both phases of URL analysis for MAMA took some time. The process was divided up into several smaller tasks. The main analysis and the validation analysis were completed at different times, because the validator machines were not yet ready at the time of the main analysis. This means that the main analysis and the validation analysis were completed roughly 2 months apart - enough time for a small portion of URLs to have changed their content significantly. In the future, both stages of analysis would preferably be done at the same time. The main analysis took just over 2 weeks to complete, while the validation phase took about 3 weeks to finish.

For the main analysis phase, note that some batches of URLs were completed at different times. The bulk of the URL set was done in the first half of November. A diff was performed between MAMA's URL set and the then-current DMoz URL set and new URLs were added to MAMA from 10-12 December. Finally, The balance of Alexa URLs and W3C member company URLs that were not already in MAMA were analyzed at the end of January.

Fig 3-1: Dates of MAMA analysis phases
Analysis phase Dates
Main analysis 31 Oct. - 13 Nov. 2007; 10 - 12 Dec. 2007; 28 - 29 Jan. 2008
Markup validation 08 - 29 Jan. 2008

Breakdown of MAMA's basic analysis processing time

For most of the URLs in MAMA, the basic analysis phase was a quick affair. 50% of the URLs took less than 2 seconds each; 75% of them took less than 5 seconds. It was the remaining 25% of the URLs where MAMA spent most of its analysis time. There was a hard timeout limit set for each HTTP transaction - 180 seconds. This explains the slight hump in Fig 4-1 below, between 180-240 seconds. The listed analysis times included not just the main URL itself, but a page's many dependencies (scripts, style sheets, and often some of the images as well). Since MAMA analyzed a URL's dependencies serially, in some cases the overall analysis time blossomed. 43 URLs took over half an hour each to analyze, while the most extreme case was a URL that took over 4.5 hours to analyze!

Fig 4-1: MAMA basic URL analysis time
Time (Sec) Frequency Time (Sec) Frequency
0-1 865,590 10-15 118,288
1-2 871,701 15-20 40,650
2-3 513,339 20-2520,319
3-4 325,583 25-3010,976
4-5 213,415 30-60 20,435
5-6 162,767 60-120 7,996
6-7 121,782 120-180 1,897
7-8 85,167 180-240 17,222
8-9 61,653 240-300 586
9-10 46,699 300+ 3,115

Frequency tables

In order to decrease the physical size of the frequency table documents found in this study, any values detected less than 4 times were not included. Looked at one way, MAMA's philosophy can be summed up as:

  • 2 times is a coincidence
  • 3 times is the beginnings of a trend, or a big coincidence
  • 4 times is where even coincidences become too big to ignore

But this is over-thinking the strategy. Listing every single detected value in every frequency table is simply untenable. As it is, some of the lower bounds of the tables had to be raised to make the overall size more manageable. The frequency tables were mostly created by automation scripts, but occasionally some additional hand-editing was done to remove values that seemed grossly incorrect or obviously didn't fit the context.

The general attribute frequency table had a special extra filter added - some authors don't properly quote attribute values, which leads to an unintended (and incorrect) proliferation of attributes. This was readily apparent with META elements, and what should have been Keywords and Description attribute values were parsed as an explosion of attribute names, causing the attribute frequency table to grow by almost 33%. A custom attribute filter was therefore made for the META element to only allow attribute values that you'd commonly expect for the element. Aside from this, MAMA's frequency tables generally go into much greater depth than has been documented in other studies, even to the point where many values listed will be below any sort of statistical significance. This may be overwhelming to many readers, but it provides as deep a look as possible for those who are interested in such minutiae and are actually looking for aberrant behavior. It is the unexpected cases that happen with regularity that are often the most interesting to browser makers.

To use stock parsers or create it all from scratch, that is the question

Unlike Ian Hickson's work at Google and Rene Saarsoo's research, I did not use an official or stock parsing library for markup analysis. I originally had the (untested) theory that markup "in the wild" could be very, VERY bad, and didn't expect that an off-the-shelf parsing library could cope with the dregs of the Web. Since most browsers are very forgiving of bad markup, it seemed like MAMA's approach should also be as forgiving as possible. It wasn't that I didn't trust the quality of any existing parser - the fact is that I simply didn't trust the quality of the markup I would run into on the Web. I could more readily adapt my own code if I ran into problems.

As MAMA's capabilities grew to encompass more of the script and CSS end of things, my thoughts on this problem have evolved somewhat. In terms of syntactic strictness, the script content on a web page stands the highest chance of being the most valid - scripting engines are very unforgiving of errors, hence authors must produce more rigorous code. Markup, on the other hand, can be badly nested with many errors, and a browser will still try to render it. Real-world CSS falls somewhere between these two extremes. Web pages in the wild unearth situations that can really stress a parser, so the error recovery needs to be very robust. If I had to begin MAMA again, I would try to use existing parsing libraries for all its markup, CSS and script analysis needs, just to ease the workload.

Over time, MAMA will try to incorporate existing parsers where it is feasible and useful; the first change will be integrating the Perl SAC module for CSS analysis in the future. After seeing its utility in Saarsoo's study it seems to me to be an excellent choice for improving MAMA's CSS detection.

Caveats and disclaimers

Some aspects of the MAMA system have inherent limitations that may (or may not) have introduced problems or bias:

  • Country determination:
    MAMA uses the Perl Geo::IP module from Maxmind to determine the country of origin of URLs. Any problems or limitations with Maxmind's system are thus also problems and limitations in MAMA's country selection system.
  • Geographical limitations:
    The MAMA system is running from Norway. Network resolution and other issues may have arisen that affect the results because of the system's geographical location.
  • Parsers:
    The parsing mechanisms for markup and CSS are custom, and there may be some bugs remaining.
  • URLs mentioned:
    URLs singled out in this study demonstrate the described behaviors at the time of writing, but URLs can/do change over time.
  • The frequency tables sometimes contain values with excessive escape characters "\". This is partially a result of MAMA's markup-in-script detection, and also a bug resulting from improper data round-tripping from MAMA's database storage. It will be fixed in future versions.
  • Character length
    A metric used several times in this study is "character length" of a file; this is used in lieu of file size. Character length is not as immediately relatable as giving a file size, but this functionality is easily available in Perl and provides similar data to file size. Generally, 1 character = 1 byte, but that is not always the case; Unicode and Asian character sets complicate such measures (among other factors), and file sizes can differ between OSes.
  • MAMA did not analyze ALL of the URLs it set out to. Transient network issues, dead URLs and other problems inevitably kept the final URLs analyzed from being bigger than its final total of about 3.5 million.

Design choices

MAMA tries to emulate a browser as closely as possible, but in the end it simply is not. The following are some things it can and can't handle.

MAMA as a Browser Engine

  • MAMA identified itself as Opera 9.10 for its User-Agent string in order to experience the Web the way Opera's browser would.
  • MAMA's URL selection policy did not respect any robots.txt or other spidering methodologies. MAMA relied on a randomized URL set and domain-capping to prevent server overloading, coupled with the expectation that all URLs in DMoz were fair game for surveying by their very nature.

Frames, etc.

  • Sub-frames and META refresh documents are added to the overall analysis stack for each URL when encountered. This more accurately represents the overall browsing experience for a single URL, but for some tallies that MAMA kept track of (like total number of URLs in a document), this ended up inflating the sums for the parent page; the approach has both good and bad qualities.
  • Sub-frames are added to the overall analysis stack, but not any nested framesets; This process applies only to the 1st-level of frame documents.

JavaScript detection

  • For better or worse, MAMA attempted to look in SCRIPT blocks for dynamically written markup. The remnants of this policy can be seen in the frequency tables where JavaScript syntax occasionally stands out.
  • Scripts are particularly problematic since MAMA doesn't use a real JavaScript engine. Scripts are only tokenized and specific token chains are searched for keywords. No complex parsing was attempted.
  • Scripting functions and variables can be abstracted so that it disguises JavaScript features MAMA looks for.
  • Attempts were made to detect external CSS or scripts that are dynamically written by other scripts, but no attempt was made to download or analyze their contents.

Other

  • MAMA didn't handle cookie processing.
  • Any unusual or extraordinary authoring conditions that the analysis script didn't expect

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.