MAMA: Scripting report, part 1: Basic scripting syntax and features
Introduction
To my knowledge, there has never been an in-depth examination of scripting factors in Web pages. Rene Saarsoo's study of Coding Practices of Web Pages was able to analyze some factors, but a bug in his analysis program prevented deeper investigation.
A number of strategies were employed by MAMA to extract information from scripting content. Substring matching was used, in addition to regular expressions and complex scripting language tokenization. We will look this week at the basics of scripting, leaving room for next week's thorough examination of MAMA's scripting tokenization. For a deeper look at the details of MAMA's scripting examination, the following MAMA article topics are also available this week:
Script inclusion methods used in Web pages
Scripting was detected in 2,617,305 of MAMA's URLs, from four different sources:
- External scripts via the
SCRIPT
/Src
element/attribute - Embedded scripts as inline content of the
SCRIPT
element - Common event handler attributes (attributes beginning with the string "on")
- JavaScript URL syntax used by hyperlinks (Any content after the leading string "javascript:" in a hyperlink)
All of these sources together form an interesting and complex backdrop on which to paint our analysis of what MAMA discovered about script usage on the Web.
Quantities of script components in Web pages
Of the four possible methods to specify scripting, the most popular technique found by MAMA is via embedded script—just over 88% of scripts used this method. External scripts and event-handler attributes were used in a similar number of cases (they were both used in ~2/3 of all scripting cases). The "quantity per page" values and other counters represent the number of occurrences for the specific syntax that was discovered for a URL. For example, the maximum number of external scripts encountered in any single page was 264; the maximum number of event handlers discovered was 37,658. The average "per-page" numbers listed in the table below apply where that type of scripting was used and does not cover the total MAMA URL space.
Script type | Description | Total URLs containing script type | % Total script usage |
Most popular quantity |
Max. quantity per page | Avg. quantity per page |
---|---|---|---|---|---|---|
Embedded scripts | Inline content of the SCRIPT element | 2,303,363 | 88.0% | 1 | 2,010 | 3.6 |
Event handlers | Content of attributes beginning with "on" | 1,707,594 | 65.2% | 1 | 37,658 | 19.2 |
External scripts | Content from SCRIPT /Src URLs | 1,651,383 | 63.1% | 1 | 264 | 2.5 |
JavaScript URLs | Hyperlink URLs prefaced by "javascript:" | 483,936 | 18.5% | 1 | 3,396 | 4.9 |
Diagram: Script usage by type
The most common script usage was the intersection of external and embedded script with event handler attributes. The rarest combination detected was the use of only event handler script with JavaScript URLs. To get a clearer view of the uses and intersections of the different script specification methods, we can examine the corresponding Venn diagram.
Note: Region sizes are not to scale
Scripts dynamically creating/writing other technologies
During MAMA's development process, a number of URL examples I tested exhibited behaviors that appeared to be distressingly common. So common, in fact, that it seemed imperative for MAMA to measure just how frequently it was happening in the wild. Scripts have the ability to dynamically add markup and code to a document, and some even go so far as to dynamically create other scripts. Full script parsing and execution would be necessary to track down, detect and analyze ALL of these cases, but MAMA is not able to do that in the current version. Instead, MAMA settled for simply detecting the situations where external dependencies are dynamically written in order to gauge the relative importance of this type of behavior. MAMA's discovery that as many as 25% of the URLs using scripting matched its rudimentary "Script writing a Script" criteria definitely warrants future MAMA attention!
Dynamically created CSS and Frame occurrences were much less frequent than the script->script case. All of the checks used simple regular expression substring matches, but in the script->script instance, MAMA added an additional detection in the JavaScript tokenization routine, looking for adjacent quoted string tokens joined by JavaScript's "+" operator. A simple analysis then looked for aggregate strings satisfying MAMA's search criteria.
Scenario | What was detected | Frequency | % Total script usage |
---|---|---|---|
Script writing Script | Substring/Regexp: /<scr/ orparsed JS String tokens containing: /<script/ && /\ssrc\s*=/ | 675,902 | 25.8% |
Script writing CSS | Substring/Regexp: /rel=[\'\"]?stylesheet/ | 95,066 | 3.6% |
Script writing Frames | Substring/Regexp: /<frameset/ | 14,840 | 0.6% |
Mentioning specific browsers in script content
This feature began as a generic question many at Opera had: "How many authors write their Web pages with Opera in mind?". Opera already had evidence that some authors make use of browser-specific workarounds, and this is especially true of scripting. For a simple answer to this question, MAMA detected the use of browser name keywords (case-insensitively)—these were expected to be unique enough to give a good idea of how many authors were at least thinking about specific browsers when they developed their documents. MAMA's approach searched against all scripting content, including script comments. This method does not give 100% reliable numbers—it can be fairly easy for simple keyword matching to give false-positives, after all. The choice of the keywords used was expected to reveal true browser name mentions in the majority of cases.
It turns out that the most difficult of all the browsers to detect in script is Opera, because authors generally refer to Opera with only the single "opera" keyword. This keyword can also match "operator", for example; about 25,000 of MAMA's URLs used the keywords "operator" or "operators".
Browser | Keywords | Frequency | % Total script usage |
---|---|---|---|
Microsoft Internet Explorer | "Internet Explorer", "MSIE" | 916,306 | 35.0% |
Opera | "Opera" | 766,274 | 29.3% |
Mozilla Firefox | "Mozilla", "Gecko", "Firefox" | 475,628 | 18.2% |
Apple Safari | "Safari" | 279,946 | 10.7% |
VBScript usage
MAMA expected the script it encountered to be JavaScript. This is a fair
expectation, but it is somewhat unrealistic. Some fraction of Web pages are
definitely known to support Microsoft's IE-only VBScript. There have not been
any big public studies into script usage before, so MAMA had no idea at the
beginning of the study about how prevalent VBScript might be. A special check
was added to detect the use of this scripting language: all opening
SCRIPT
tags and all script content was examined for
(case-insensitive) traces of the substring "vbscript".
103,485 URLs in MAMA were found satisfying this condition (4.0% of pages using
scripting).
Script library evidence using MAMA search factors as archaeology tools
There were two specific factors MAMA studied which shows that it is easy to expose script library usage: external script file names and JavaScript function names. When many authors make use of a popular script, the usage numbers of these two factors really makes them stand out. In the case of function names, library function names often use a consistent naming scheme and they have similar frequency rates; these cluster together in a frequency list, making them easier to detect.
Top scripting libraries detected by function name
To see script library activity in action, we need to look at the top 75 entries in the full function name list (cutoff value chosen to demonstrate the proximity effect of libraries in the list):
- The most popular values are Macromedia-related (function names prefixed by "MM_"). The first two have similar frequencies, and the next pair have similar frequencies as well.
- Google's Urchin tracker comes next, with 29 of the top 75 spots, all with VERY similar frequencies (384-394,000 times each). The function names are prefixed with "__utm" or "_u". Not coincidentally, an external script file name "urchin.js" was found 383,870 times.
- Google's ad-syndication platform is also well represented in the function name list. The function names are all very compact—typically 1-2 letters long. The entire code for this ad-syndication script is also compacted, with no linefeeds and extra spacing. These function names are all adjacent in the frequency list, being used 160-185,000 times. It is no coincidence again that the external script file name "show_ads.js" was used 178,697 times.
- The following image control/rollover effect functions are very popular and
all seem to be related, based on their similar naming schemes and proximities
in the frequency list:
changeImages
(66,867),preloadImages
(62,570),newImage
(60,512). - Adobe's "Active Content" seems to control Flash instances in Web pages. These 5 "Active Content" functions have names prefixed by "AC_" and occurred between 60-64,000 times in MAMA. A corresponding external script with the name "AC_RunActiveContent.js" was found 60,428 times and is no doubt related to these instances.
- Two adjacent entries appear to read and write browser cookies
(
getCookie
andsetCookie
). - In the top 75, two function names (
hideMenu
andMenu
) can be found, but if you go below position 75 you can find many more functions obviously relating to menus.
This is just a small sample, a number of other unique prefixes are noticeable by glancing further down the frequency list—Adobe GoLive has many functions prefixed by "CS" (after finding 100 such unique function names, I stopped counting). Functions common to Lycos/Angelfire/Tripod scripts were well represented with the common prefixes "lhb_" (17 times), "LR_" (18 times) and "lycos_" (11 times).
...And there is more evidence
Detecting libraries was a very important task for MAMA. The external script file names and function names were the passive evidence found. MAMA also identified unique strings that would track usage of a number of specific script libraries in common use (e.g., Prototype and jQuery), tracking systems (e.g.: Urchin, Omniture, and Hitbox), and DHTML menu systems (e.g. Milonic). Every effort was made to guarantee that the patterns were distinctive, but the criteria used may not be totally reliable. There can, of course, always be the occasional false positive, and future versions of these script libraries may alter some of the (currently) unique criteria that MAMA seeks. The full script syntax article details the results for all 24 of the libraries it looked for in more detail.
Note: All of the search criteria are case-sensitive regular expressions.
DHTML menu/library name | Search criteria (regexp) | Frequency |
---|---|---|
Macromedia functions from Dreamweaver/Fireworks | Script: / MM_/ | 682,019 |
Google Analytics/Urchin Tracker | Script: /function\s+urchinTracker/ Filename: /^urchin\.js$/ | 384,756 |
Prototype JavaScript Framework | Script: /var\s+Prototype\s+=\s+{\s+Version:\s+/ | 31,423 |
Omniture/SiteCatalyst Analytics | Script: /SiteCatalyst/ , /Omniture/ Filename: /^s_code\.js$/ | 18,468 |
JQuery Library | Script: /jQuery./ Filename: /^jquery.*?\.js$/ | 17,027 |
Dynamic Drive HV Menu | Script: /MbrSetUp/ , /ChildVerticalOverlap/ | 15,111 |
Milonic DHTML Menu | Script: /closeMenusByArray/ , /milonic/ | 13,585 |
WebSideStory/HitBox Analytics | Script: /function\s+_hbEvent/ Filename: /^hbx\.js$/ | 10,963 |
Conclusion
In the interests of brevity, many of the topics in this week's overview received a very condensed treatment. The reader is encouraged to dig deeper into this week's main MAMA scripting articles for more extensive coverage of the various factors examined. (See the links at the beginning of this document.) Next week, we will look at the goldmine of information that MAMA was able to extract from its tokenization of scripting code—almost 500 JavaScript and DOM-related keywords were identified in 28 categories.
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.