MAMA: Scripting report, part 1: Basic scripting syntax and features

By Brian Wilson

Introduction

To my knowledge, there has never been an in-depth examination of scripting factors in Web pages. Rene Saarsoo's study of Coding Practices of Web Pages was able to analyze some factors, but a bug in his analysis program prevented deeper investigation.

A number of strategies were employed by MAMA to extract information from scripting content. Substring matching was used, in addition to regular expressions and complex scripting language tokenization. We will look this week at the basics of scripting, leaving room for next week's thorough examination of MAMA's scripting tokenization. For a deeper look at the details of MAMA's scripting examination, the following MAMA article topics are also available this week:

Script inclusion methods used in Web pages

Scripting was detected in 2,617,305 of MAMA's URLs, from four different sources:

All of these sources together form an interesting and complex backdrop on which to paint our analysis of what MAMA discovered about script usage on the Web.

Quantities of script components in Web pages

Of the four possible methods to specify scripting, the most popular technique found by MAMA is via embedded script—just over 88% of scripts used this method. External scripts and event-handler attributes were used in a similar number of cases (they were both used in ~2/3 of all scripting cases). The "quantity per page" values and other counters represent the number of occurrences for the specific syntax that was discovered for a URL. For example, the maximum number of external scripts encountered in any single page was 264; the maximum number of event handlers discovered was 37,658. The average "per-page" numbers listed in the table below apply where that type of scripting was used and does not cover the total MAMA URL space.

Totals for different methods of script inclusion
Script typeDescriptionTotal URLs
containing
script type
% Total
script
usage
Most
popular
quantity
Max.
quantity
per page
Avg.
quantity
per page
Embedded scriptsInline content of the SCRIPT element2,303,363 88.0%12,0103.6
Event handlersContent of attributes beginning with "on"1,707,594 65.2%137,65819.2
External scriptsContent from SCRIPT/Src URLs1,651,383 63.1%12642.5
JavaScript URLsHyperlink URLs prefaced by "javascript:"483,936 18.5%13,3964.9

Diagram: Script usage by type

The most common script usage was the intersection of external and embedded script with event handler attributes. The rarest combination detected was the use of only event handler script with JavaScript URLs. To get a clearer view of the uses and intersections of the different script specification methods, we can examine the corresponding Venn diagram.

Venn diagram for script usage types

Note: Region sizes are not to scale

Scripts dynamically creating/writing other technologies

During MAMA's development process, a number of URL examples I tested exhibited behaviors that appeared to be distressingly common. So common, in fact, that it seemed imperative for MAMA to measure just how frequently it was happening in the wild. Scripts have the ability to dynamically add markup and code to a document, and some even go so far as to dynamically create other scripts. Full script parsing and execution would be necessary to track down, detect and analyze ALL of these cases, but MAMA is not able to do that in the current version. Instead, MAMA settled for simply detecting the situations where external dependencies are dynamically written in order to gauge the relative importance of this type of behavior. MAMA's discovery that as many as 25% of the URLs using scripting matched its rudimentary "Script writing a Script" criteria definitely warrants future MAMA attention!

Dynamically created CSS and Frame occurrences were much less frequent than the script->script case. All of the checks used simple regular expression substring matches, but in the script->script instance, MAMA added an additional detection in the JavaScript tokenization routine, looking for adjacent quoted string tokens joined by JavaScript's "+" operator. A simple analysis then looked for aggregate strings satisfying MAMA's search criteria.

Dynamically-created external dependencies written by script
ScenarioWhat was detectedFrequency% Total
script
usage
Script writing ScriptSubstring/Regexp: /<scr/ or
parsed JS String tokens containing:
/<script/ && /\ssrc\s*=/
675,90225.8%
Script writing CSSSubstring/Regexp: /rel=[\'\"]?stylesheet/95,0663.6%
Script writing FramesSubstring/Regexp: /<frameset/14,8400.6%

Mentioning specific browsers in script content

This feature began as a generic question many at Opera had: "How many authors write their Web pages with Opera in mind?". Opera already had evidence that some authors make use of browser-specific workarounds, and this is especially true of scripting. For a simple answer to this question, MAMA detected the use of browser name keywords (case-insensitively)—these were expected to be unique enough to give a good idea of how many authors were at least thinking about specific browsers when they developed their documents. MAMA's approach searched against all scripting content, including script comments. This method does not give 100% reliable numbers—it can be fairly easy for simple keyword matching to give false-positives, after all. The choice of the keywords used was expected to reveal true browser name mentions in the majority of cases.

It turns out that the most difficult of all the browsers to detect in script is Opera, because authors generally refer to Opera with only the single "opera" keyword. This keyword can also match "operator", for example; about 25,000 of MAMA's URLs used the keywords "operator" or "operators".

Browser names mentioned in script
BrowserKeywordsFrequency% Total
script
usage
Microsoft Internet Explorer"Internet Explorer", "MSIE"916,30635.0%
Opera"Opera"766,27429.3%
Mozilla Firefox"Mozilla", "Gecko", "Firefox"475,62818.2%
Apple Safari"Safari"279,94610.7%

VBScript usage

MAMA expected the script it encountered to be JavaScript. This is a fair expectation, but it is somewhat unrealistic. Some fraction of Web pages are definitely known to support Microsoft's IE-only VBScript. There have not been any big public studies into script usage before, so MAMA had no idea at the beginning of the study about how prevalent VBScript might be. A special check was added to detect the use of this scripting language: all opening SCRIPT tags and all script content was examined for (case-insensitive) traces of the substring "vbscript". 103,485 URLs in MAMA were found satisfying this condition (4.0% of pages using scripting).

Script library evidence using MAMA search factors as archaeology tools

There were two specific factors MAMA studied which shows that it is easy to expose script library usage: external script file names and JavaScript function names. When many authors make use of a popular script, the usage numbers of these two factors really makes them stand out. In the case of function names, library function names often use a consistent naming scheme and they have similar frequency rates; these cluster together in a frequency list, making them easier to detect.

Top scripting libraries detected by function name

To see script library activity in action, we need to look at the top 75 entries in the full function name list (cutoff value chosen to demonstrate the proximity effect of libraries in the list):

  • The most popular values are Macromedia-related (function names prefixed by "MM_"). The first two have similar frequencies, and the next pair have similar frequencies as well.
  • Google's Urchin tracker comes next, with 29 of the top 75 spots, all with VERY similar frequencies (384-394,000 times each). The function names are prefixed with "__utm" or "_u". Not coincidentally, an external script file name "urchin.js" was found 383,870 times.
  • Google's ad-syndication platform is also well represented in the function name list. The function names are all very compact—typically 1-2 letters long. The entire code for this ad-syndication script is also compacted, with no linefeeds and extra spacing. These function names are all adjacent in the frequency list, being used 160-185,000 times. It is no coincidence again that the external script file name "show_ads.js" was used 178,697 times.
  • The following image control/rollover effect functions are very popular and all seem to be related, based on their similar naming schemes and proximities in the frequency list: changeImages (66,867), preloadImages (62,570), newImage (60,512).
  • Adobe's "Active Content" seems to control Flash instances in Web pages. These 5 "Active Content" functions have names prefixed by "AC_" and occurred between 60-64,000 times in MAMA. A corresponding external script with the name "AC_RunActiveContent.js" was found 60,428 times and is no doubt related to these instances.
  • Two adjacent entries appear to read and write browser cookies (getCookie and setCookie).
  • In the top 75, two function names (hideMenu and Menu) can be found, but if you go below position 75 you can find many more functions obviously relating to menus.

This is just a small sample, a number of other unique prefixes are noticeable by glancing further down the frequency list—Adobe GoLive has many functions prefixed by "CS" (after finding 100 such unique function names, I stopped counting). Functions common to Lycos/Angelfire/Tripod scripts were well represented with the common prefixes "lhb_" (17 times), "LR_" (18 times) and "lycos_" (11 times).

...And there is more evidence

Detecting libraries was a very important task for MAMA. The external script file names and function names were the passive evidence found. MAMA also identified unique strings that would track usage of a number of specific script libraries in common use (e.g., Prototype and jQuery), tracking systems (e.g.: Urchin, Omniture, and Hitbox), and DHTML menu systems (e.g. Milonic). Every effort was made to guarantee that the patterns were distinctive, but the criteria used may not be totally reliable. There can, of course, always be the occasional false positive, and future versions of these script libraries may alter some of the (currently) unique criteria that MAMA seeks. The full script syntax article details the results for all 24 of the libraries it looked for in more detail.

Note: All of the search criteria are case-sensitive regular expressions.

Most popular DHTML Menus/Libraries detected by MAMA
DHTML menu/library nameSearch criteria (regexp)Frequency
Macromedia functions from Dreamweaver/FireworksScript: / MM_/682,019
Google Analytics/Urchin TrackerScript: /function\s+urchinTracker/
Filename: /^urchin\.js$/
384,756
Prototype JavaScript FrameworkScript: /var\s+Prototype\s+=\s+{\s+Version:\s+/31,423
Omniture/SiteCatalyst AnalyticsScript: /SiteCatalyst/, /Omniture/
Filename: /^s_code\.js$/
18,468
JQuery LibraryScript: /jQuery./
Filename: /^jquery.*?\.js$/
17,027
Dynamic Drive HV MenuScript: /MbrSetUp/, /ChildVerticalOverlap/15,111
Milonic DHTML MenuScript: /closeMenusByArray/, /milonic/13,585
WebSideStory/HitBox AnalyticsScript: /function\s+_hbEvent/
Filename: /^hbx\.js$/
10,963

Conclusion

In the interests of brevity, many of the topics in this week's overview received a very condensed treatment. The reader is encouraged to dig deeper into this week's main MAMA scripting articles for more extensive coverage of the various factors examined. (See the links at the beginning of this document.) Next week, we will look at the goldmine of information that MAMA was able to extract from its tokenization of scripting code—almost 500 JavaScript and DOM-related keywords were identified in 28 categories.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.