MAMA: Scripting syntax

By Brian Wilson

Index:

  1. Introduction
  2. External script file names
  3. External script MIME types
  4. JavaScript function names
  5. DHTML menu/library usage
  6. Scripts dynamically creating/writing other technologies
  7. Mentioning specific browsers in script content
  8. Other miscellaneous script detections

Introduction

Scripting was detected in 2,617,305 of MAMA's URLs (74.58%), coming from 4 different sources: JavaScript URLs, event-handler attributes, external scripts and embedded scripts. A number of strategies were employed to extract information out of the scripting content that MAMA explored. Substring matching was used, in addition to regular expressions and complex scripting language tokenization. This last method was added fairly late to MAMA's analysis, but it has been an excellent way to discover various factors that were not otherwise available through the simpler substring/regular expression processes.

External script file names

This item of scripting metadata was originally requested to allow easier tracking of JavaScript/DHTML code libraries. A quick survey of authors' use of script libraries shows that they usually do not alter the filenames. Looking at the full frequency table of external script file names, one can easily pick out the names of common script libraries: Prototype, Scriptaculous, Lightbox, Milonic ... it goes on and on. Also noticeable in the list are a number of scripts used for a variety of purposes. Google's Urchin tracking script "urchin.js" is the most popular script name by more than a factor of 2, while their ad-syndication script "show_ads.js" comes in second place. A variety of Adobe/Macromedia scripts can also be found in this list. If you examine file name data and compare them to the frequency tables for script function names (below), it is easy to uncover direct relationships between the two. Of course, often a script is just created from scratch by the page author, and they often do nothing to disguise that fact—external scripts using the file names "script.js" and "scripts.js" are the most popular "obvious" file names found.

Fig 2-1: Top external script file names
(See also the complete frequency table)
File nameFrequency% Using
external
script
 File nameFrequency% Using
external
script
urchin.js383,87023.25%init.js21,8581.32%
show_ads.js178,69710.82%script.js21,8491.32%
counter.js71,2944.32%scripts.js21,3011.29%
AC_RunActiveContent.js60,4283.66%getcod.cgi20,0991.22%
menu.js47,9362.90%hb.js19,8661.20%
swfobject.js43,7512.65%global.js19,6551.19%
mm_menu.js35,9922.18%functions.js19,5031.18%
prototype.js31,1621.89%code-end.js19,4161.18%
rollover.js28,2151.71%code-start.js19,4161.18%
common.js26,2971.59%code-middle.js19,4161.18%
animate.js23,6561.43%lycosRating.js.php19,4141.18%

External script MIME types

MAMA tracked the returned MIME type of external scripts that were downloaded. It did not trust any explicit values for the Type attribute (if present) for this information; it relied solely upon the information received when actually fetching the external resource. The value "application/x-javascript" is the most popular by more than a factor of two over the 2nd place MIME type—"text/javascript". At over 10% of the external script population, the MIME type "text/html" is surprising in its popularity. One hopes that this is the result of misconfigured servers and not Web page 404-redirects—there is currently very little MAMA can do to tell the difference. In all, ~800 scripts reported a VBScript MIME type, although MAMA found over 100,000 cases of the keyword "vbscript" in both embedded and external scripts. While there are some believable scenarios where VBscript could be used in embedded form 10-to-1 compared to external scripts, the overall ratio of external to embedded scripts does not support this outcome. It is very possible that a number of servers are not delivering external VBScript files with the correct MIME type. As with the external CSS MIME case, many other MIME types were also observed in the full MIME frequency table.

Fig 3-1: Top external ecript MIME types
(See also the complete frequency table)
MIME typeFrequency% Using
external
script
 MIME typeFrequency% Using
external
script
application/x-javascript1,282,92277.69%none4,2420.26%
text/javascript559,68833.89%application/octetstream1,2220.07%
text/html176,86310.71%text/js1,0510.06%
text/plain16,6841.01%text/vbscript7970.05%
application/javascript12,5220.76%text/css4610.03%
application/octet-stream8,4200.51%image/gif3130.02%
text/x-js6,2830.38%texthtml2980.02%

JavaScript function names

MAMA tracked the function names declared in script code. For a number of reasons, library scripts are fairly easy to pick out in the full frequency list, especially near the top:

  1. Common script libraries are used by many different URLs, but they use the same function naming schemes, and often the same external script file name.
  2. These libraries almost always have the same or similar frequency rates, so they cluster together in the list for easier detection.
  3. Because of their proximities in the list, function naming schemes used by libraries stand out.

Top scripting libraries detected by function name

To see script library activity in action, we can look at the top 75 entries in the function name list (cutoff value chosen to demonstrate the proximity effect of libraries in the list):

  • The most popular values are Macromedia-related (function names prefixed by "MM_"). The first two have similar frequencies, and the next pair have similar frequencies as well.
  • Google's Urchin tracker comes next, with 29 of the top 75 spots, all with VERY similar frequencies (all within the range 384-394,000 times each). The function names are prefixed with "__utm" or "_u". Not coincidentally, an external script file name "urchin.js" was found 383,870 times.
  • Google's ad-syndication platform is also well represented in the function name list. The function names are all very compact—typically 1-2 letters long. The entire code for this ad-syndication script is also compacted, with no linefeeds and extra spacing. These function names are all adjacent in the frequency list, being used 160-185,000 times. It is again no coincidence that the external script file name "show_ads.js" was used 178,697 times.
  • The following image control/rollover effect functions are very popular and all seem to be related, based on their similar naming schemes and proximities in the frequency list: changeImages (66,867), preloadImages (62,570), newImage (60,512).
  • Adobe's "Active Content" controls Flash instances in Web pages. These 5 "Active Content" functions have names prefixed by "AC_" and occurred between 60-64,000 times in MAMA. A corresponding external script with the name "AC_RunActiveContent.js" was found 60,428 times and is no doubt related to these instances.
  • 2 adjacent entries read and write browser cookies - getCookie and setCookie.
  • In the top 75, two function names (hideMenu and Menu) can be found, but if you go below position 75 you can find many more functions obviously relating to menus.

This is just a small sample; a number of other unique prefixes are noticeable by glancing further down the frequency list—Adobe GoLive has many functions prefixed by "CS" (after finding 100 such unique function names, I stopped counting). Functions common to Lycos/Angelfire/Tripod scripts were well represented with the common prefixes "lhb_" (17 times), "LR_" (18 times) and "lycos_" (11 times).

This part of MAMA's research began as a desire to locate real-life examples of specific popular DHTML menu systems and libraries so that we could test their functionality in Opera and investigate various issues. I worked with a colleague to identify 1 or 2 substrings from each of these menus/libraries that would uniquely distinguish them from other JavaScript code. Every effort was made to guarantee that the patterns were distinctive, but the criteria used may not be totally reliable. There can, of course, always be the occasional false positive, and future versions of these script libraries may alter some of the (currently) unique criteria that MAMA seeks.

MAMA detected 1,084,593 URLs using at least 1 of the following DHTML Menus or Libraries. In the URLs where these systems were detected, over 60% used the Macromedia functions, while over 1/3 used Google's Urchin tracking system. By comparison, all of the other code libraries were used far less often.

Note: All of the search criteria are case-sensitive regular expressions.

Fig 5-1: DHTML Menus/Libraries detected by MAMA
DHTML menu/library nameSearch criteria (regexp)Frequency
Macromedia functions from Dreamweaver/FireworksScript: / MM_/682,019
Google Analytics/Urchin TrackerScript: /function\s+urchinTracker/
Filename: /^urchin\.js$/
384,756
Prototype JavaScript FrameworkScript: /var\s+Prototype\s+=\s+{\s+Version:\s+/31,423
Omniture/SiteCatalyst AnalyticsScript: /SiteCatalyst/, /Omniture/
Filename: /^s_code\.js$/
18,468
JQuery LibraryScript: /jQuery./
Filename: /^jquery.*?\.js$/
17,027
Dynamic Drive HV MenuScript: /MbrSetUp/, /ChildVerticalOverlap/15,111
Milonic DHTML MenuScript: /closeMenusByArray/, /milonic/13,585
WebSideStory/HitBox AnalyticsScript: /function\s+_hbEvent/
Filename: /^hbx\.js$/
10,963
Yahoo YUI! LibraryScript: /YAHOO.namespace/7,953
Jupitermedia HierMenusScript: /HM_/7,631
Likno AllWebMenusScript: /awmCreateMenu/5,705
OpenCube QuickMenu ProScript: /dqm__/, /DQM_/4,837
Dan Steinman's DynAPIScript: /dynapi/3,471
TinyMCE Text EditorScript: /tinyMCE./
Filename: /tiny_mce\.js/
3,432
Ultimate Drop Down MenuScript: /um.menuClasses/, /\/\/UDMv3/3,334
xFx MenuScript: /dmbtbB/, /rjsPath/2,490
Siteexpert/Xtreeme MenuScript: /m1.bIncBorder/2,044
Freestyle Menu (Angus Turnbull)Script: /FSMenu.prototype/1,770
Cascading Popup Menu (Angus Turnbull)Script: /PopupMenu.prototype/840
MochiKit LibraryScript: /MochiKit.MochiKit/248
Dojo JavaScript ToolkitScript: /dojo.js/220
Tree MenuScript: /MTMOutputString/110

Scripts dynamically creating/writing other technologies

During MAMA's development process, a number of URL examples exhibited behaviors that appeared to be distressingly common. So common, in fact, that it seemed imperative for MAMA to measure just how frequently it was happening in the wild. Scripts have the ability to dynamically add markup and code to a document, and some even go so far as to dynamically create other scripts. Full script parsing and execution would be necessary to track down, detect, and analyze ALL of these cases, but MAMA is not able to do that in the current version. Instead, MAMA settled for detecting situations where external dependencies are dynamically written in order to gauge the relative importance of this type of behavior. MAMA's discovery that as many as 25% of the URLs using scripting matched its rudimentary "Script writing a Script" criteria definitely warrants future investigation!

Dynamically created CSS and Frames occurrences were much less frequent than the script->script case. All of the checks used simple regular expression substring matches, but in the script->script instance, MAMA added an additional detection in the JavaScript tokenization routine, looking for adjacent quoted string tokens joined by JavaScript's "+" operator. A simple analysis then looked for aggregate strings satisfying MAMA's search criteria.

Fig 6-1: Dynamically created external dependencies written by script
ScenarioWhat was detectedFrequency% Total
script usage
Script writing ScriptSubstring/Regexp: /<scr/ or
parsed JS String tokens containing:
/<script/ && /\ssrc\s*=/
675,90225.82%
Script writing CSSSubstring/Regexp: /rel=[\'\"]?stylesheet/95,0663.63%
Script writing FramesSubstring/Regexp: /<frameset/14,8400.57%

Mentioning specific browsers in script content

This feature began as a generic question many at Opera had: "How many authors write their Web pages with Opera in mind?". Opera already had evidence that some authors make use of browser-specific workarounds, and this is especially true of scripting. For a simple answer to this question, MAMA detected the use of browser name keywords (case-insensitively)—these were expected to be unique enough to give a good idea of how many authors were considering specific browsers in their development. MAMA's approach searched against all scripting content, including script comments. This method does not give 100% reliable numbers—it can be fairly easy for simple keyword matching to give false-positives, after all. The choice of the keywords used was expected to reveal true browser name mentions in the majority of cases.

It turns out that the most difficult of all the browsers to detect in script is Opera, because authors generally refer to Opera with the single "opera" keyword. This keyword can also match "operator";for example, about 25,000 of MAMA's URLs used the keywords "operator" or "operators". Authors also typically use a single keyword with Safari, but this is not a problem since "safari" is not a substring of any other common word (well, that I know of, anyway).

Fig 7-1: Browser names mentioned in script
BrowserKeywordsFrequency% Total
script
usage
Microsoft Internet Explorer"Internet Explorer", "MSIE"916,30635.01%
Opera"Opera"766,27429.28%
Mozilla Firefox"Mozilla", "Gecko", "Firefox"475,62818.17%
Apple Safari"Safari"279,94610.70%

Other miscellaneous script detections

Many of the items here are detections added to satisfy special requests from those at Opera who needed to quickly gather statistics on script usage. There used to be many more of this type of simple checks, but with the advent of MAMA's newer basic JavaScript tokenizer, they became redundant and were removed. These are the remainders of that older strategy. Some of the following items are important, while others would definitely be considered esoteric or "fringe" data based on the usage numbers. Mostly, it serves as a reminder that you can find any sort of information you like from scripting if you just know how to look for it.

Fig 8-1: Miscellaneous items searched for in scripts
FactorMotivationWhat was detectedFrequency
Window.openTo help study pop-up-blocking trends Substring: "window.open" in any script content938,210
Frame breakingInternal tool defeated by frame breakers Substring: "top.location.href" in any script content115,564
VBScript usageTo find scripting cases using Microsoft's scripting language Substring "vbscript" (CI) in all opening SCRIPT tags, as well as in any script content 103,485
CSS .filter propertyTo find sites using MSIE CSS 'filter' property via script (could be name collisions with DOM Traversal) Substring: ".filter" in any script content198,487
CSS .display set to "block"Sites use to dynamically toggle sections Regexp: /style.display\s*=\s*[\'\"]block/ in any script content238,917
CSS .display set to "table" or "inline-table"Testing sites that use this CSS property/value combination Regexp: /style.display\s*=\s*[\'\"](inline-)?table/ in any script content1,543
Use of the "eval" keywordScript engine developer needed live test cases eval used as a parsed JavaScript identifier13,067
Aliasing "eval" to another variableScript engine developer needed live test cases Regexp: /\=\s*eval[^\w]/ && !~ /\=\s*eval\s*\(/303

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.