MAMA scripting report, part 2: JavaScript and DOM tokenization

By Brian Wilson

Introduction—mining JavaScript code

Throughout much of MAMA's development history, script content was analyzed using numerous substring and regular expression checks. There was always a desire to use a more formal analysis, but an easy and simple solution never presented itself. After discovering Saarsoo's markup and CSS study and the breadth of his results, it was decided that one of the biggest deficits still outstanding in large-scale Web research was an in-depth analysis of JavaScript. This was the impetus necessary to finally tackle the formal analysis problem.

This week's overview on MAMA's scripting tokenization analysis can only be called "brief" compared to the following detailed articles that it summarizes:

Analyzing the JavaScript

MAMA adapted a simple public domain JavaScript tokenization script by Christopher Diggins. His tokenizer breaks down JavaScript into a number of categories: comments, quoted strings, identifiers, numbers, whitespace, and symbols. Of these, the basic JavaScript and DOM syntax would be covered by the identifiers category. Out of storage necessity, I wanted to break up these identifier tokens into groups for easier access. The basic structure of JavaScript as a group of objects with methods and properties lends itself to this type of categorization, so it was a natural fit. The compartmentalization of the language into small groups also easily translated to a database design that made searching for specific JavaScript and DOM keywords much faster. MAMA would also store these keywords with their case-sensitivity preserved, which would further aid in disambiguating their use. All told, 28 categories were selected containing a total of 481 identifier keywords.

Tokenization: Problems encountered

MAMA has done well with this initial tokenizer design, but it is a first-generation effort that could be improved. For instance, a major bug was discovered after the analysis phase—a database field created to store token identifier chains never stored anything at all. This field would have allowed better correlation for keywords with ambiguous meaning. For example, the open method is used by multiple objects including Window and XMLHTTPRequest; the lost MAMA field would have allowed a greater degree of clarity with multiple uses of the same keyword. The next major MAMA crawl will definitely address this lack and will go even farther in its examination of scripting.

Some other basic issues were noticed during this process:

  • The fewer characters in a keyword, the less likely it is to be used for a single specific purpose. This is especially true for common or ambiguous words like all, class and id.
  • Mixed-case keywords stand a higher chance of representing unique usage, especially camel-case property and method keywords, such as hasChildNodes and getElementById.

JavaScript (ECMAScript)

The JavaScript keyword category contains 13 different areas of the JavaScript/ECMAScript language, covering basic JavaScript syntax and core JavaScript objects.

JavaScript (ECMAScript) language keywords

This is the list of keywords standardized by ECMAScript v3. Each keyword is part of the language's basic syntax. As such, many of them are expected to have VERY high usage, which is indeed the case. Looking at the entire group, these keywords are used in 2,476,007 URLs (94.6% of all pages using scripting). A closer examination of this section reveals some of the basic ways that JavaScript/ECMAScript is used in the wild:

  • The keyword function is used the most of all keywords, with 87.2% of all scripting cases. In all, 84.7% of pages that used function had at least one case of the return keyword.
  • Predictably, function is more popular than var, and in turn var is used at higher rates than new.
  • Conditional constructs (if) are favored over looping (for), and the alternate looping mechanism while is less popular than the primary for usage.
  • The else keyword for conditional code flow was used 80% as often as the companion keyword if.
  • break is favored over continue.
  • Boolean values true and false are used a similar number of times and are used together 1,314,911 times (91.2% of false cases and 85.6% of true cases).
  • The try/catch syntax is used a similar number of times. The fallback condition finally is only used in 5.5% of the cases using catch.
JavaScript (ECMAScript) keywords
KeywordFrequency KeywordFrequency  KeywordFrequency KeywordFrequency
function2,281,902true1,529,306 try753,384default254,919
if2,253,000false1,441,874 catch752,271throw242,519
var2,152,170null1,412,832 continue611,755do137,318
new1,939,996typeof1,015,441 in563,323void130,542
return1,932,406while1,014,486 case328,235delete77,570
else1,795,957break893,712 switch325,443instanceof61,019
for1,751,342this810,322 with287,731finally41,788

The Array object

This object's property and method keywords were detected in 1,835,275 URLs. The Array object is mostly concerned with manipulations of array structures, but it has one main informational property, length, which informs about the number of items in the array. This property is used far more often than any of the other Array-specific methods, but that is an expected state of affairs, because it is also used by the String object to count the number of characters in a string. The way MAMA is currently set up, it can not distinguish between these two uses. Of the standard array operations, push is much more popular than pop, shift is much more popular than unshift, while shift has only marginally higher representation than pop.

Properties and methods of the Array object
Property/
method
Frequency Property/
method
Frequency
length1,825,953reverse62,908
join401,563pop56,512
push378,223splice50,193
slice187,082sort46,780
concat113,300unshift19,198
shift84,745  

JavaScript: The global object

This is not an object class, but it covers references to all the major predefined objects, including the Error object types. MAMA detected the use of these global object references in 1,817,657 URLs (almost 70% of all URLs using scripting). The Array object is referenced most, used in over 55% of all URLs using script. The Date object is explicitly used in over 42% of scripting cases. All the other global objects were mentioned by only 15% or less each of scripting cases.

Properties and methods of the Global object
Property/
method
Frequency Property/
method
Frequency
Array1,453,169RangeError33,849
Date1,119,350Boolean23,715
Object368,446Math23,560
String361,638SyntaxError4,857
RegExp315,660TypeError785
Error221,977ReferenceError179
Function152,246URIError135
Number103,658EvalError129

The String object

The String object is used to manipulate groups of individual characters. MAMA discovered 1,982,954 URLs using String object-specific keywords, but that includes the length property, which is also used as a property by the Array object. This name collision issue also happens with the replace/search keywords, which are used by both the String and Location objects. The way MAMA is currently set up, it can not distinguish between multiple uses of a keyword. Judging by the relative popularity of other properties and methods between the two objects, it is likely that the majority of the 1,825,953 uses of the length property are in a String object context.

Most of the String object methods have high usage rates; length, indexOf, and substring are all used in more than half of the URLs that use scripting. There are some clear trends though—toLowerCase is MUCH more popular than toUpperCase, and substring is MUCH more popular than substr.

Properties and methods of the String object
Property/
method
Frequency Property/
method
Frequency
length1,825,953lastIndexOf642,927
indexOf1,643,269charCodeAt486,355
substring1,523,307substr470,049
toLowerCase966,482match450,277
split912,249toUpperCase271,588
charAt870,869fromCharCode140,676
replace710,059localeCompare169
search658,995  

The DOM

Scripting use was found in MAMA 2,617,305 times. This section is devoted to the results uncovered for 294 DOM-related keywords in 15 categories encompassing the largest Objects and conceptual areas of the DOM. The DOM is a very large domain to cover, so in the end I limited the initial set to objects and keywords that I thought might be the most popular or interesting.

The Document object

Forty keywords were associated with the Document object for MAMA, with 2,353,632 of the URLs having at least one of the keyword snippets (89.9% of all URLs using script). The parent substring document has the highest popularity here, and it actually has the highest occurrence of ANY tokenized keyword detected by MAMA (in 89.6% of all script). This could be persuasive evidence in demonstrating that dynamically changing the document is the most popular use of JavaScript.

The getElementById and write keywords are understandably quite popular, being the basic historic methods for addressing and dynamically creating parts of a document; each was found in over 50% of all script cases. The W3C DOM method of addressing content document.getElementById is more popular than the MSIE-originated document.all by a comfortable margin. The getElementById method is almost twice as popular as getElementsByTagName, and both trounce getElementsByName by a wide margin. The write method is clearly preferred by authors over writeln 4.5 to 1.

Other keywords from the Document object can tell us a lot about many aspects of usage in Web-page authoring. The layers keyword is actually the most common process used to detect (browser sniff) Netscape Navigator 4, which explains why the use of this keyword in script is so large compared to the LAYER element (the script keyword is used over 34 times as much)! The cookie keyword can give a good measure of how often client-side cookies are used by script (22.4% of all Web pages). This is probably a much better measure than the Navigator object's cookieEnabled property reflecting only 45,411 cases. The images keyword here is just one useful factor in determining whether scripting is dynamically controlling images; top keywords from the token remainders list also suggest Image usage (src, width, Image, and height). These could also be leveraged to discover scripts that are manipulating Images. Direct use of the FORM, INPUT or SELECT elements in markup were detected in 1,068,842 cases, while the DOM level 0 forms keyword was detected 665,305 times. However, these factors occurred together only 293,048 times. What this disparity might suggest about form control via script is not really clear—perhaps, in a significant number of cases,form widgets are generated dynamically.

DOM Document object properties and methods
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
document2,345,827 getElementsByTagName797,464 URL382,120
getElementById1,484,601 cookie786,427 writeln312,995
write1,401,743 body746,071 lastModified229,841
all1,145,064 createElement731,116 links173,607
referrer959,234 forms665,305 createTextNode125,308
images901,477 domain528,066 anchors122,835
layers898,064 documentElement419,297 defaultView92,977

The Element object

The keywords collected under the Element object umbrella were found in 1,336,464 URLs from MAMA. The MSIE shorthand innerHTML, which is used to read and dynamically write content in a document, is very popular. If we compare innerHTML to document.createElement or any of the Node object's methods for accessing and writing child nodes, it appears that it may actually be less popular these days than equivalent W3C DOM methods. Writing attributes with the setAttribute method appears to be a more frequent authoring task than merely reading it with getAttribute.

The currentStyle keyword (used 111,964 times) comes from IE and is only slightly more widespread than the W3C DOM version window.getComputedStyle (used 99,815 times). These two methods of accessing a browser's CSS interpretation share usage in a large majority of the cases (92,505 times), indicating an author preference to get the job done using any and all methods at their disposal.

The offset/scroll methods originated by IE show clear trends. offsetTop and offsetLeft are more popular than either offsetHeight and offsetWidth. Similarly, Top and Left are both more popular than Height and Width for the "scroll" methods. The Top and Height properties are always more popular than the Left and Width properties for both the offset and scroll method groups. In cases where the Left and Width component methods are used, the overwhelming majority (more than 90% each) are used in conjunction with the more dominant Top/Height methods.

DOM Element object properties and methods
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
id1,007,621className359,699getAttribute299,346
innerHTML695,329offsetHeight353,416scrollLeft283,749
setAttribute413,403scrollTop352,061scrollHeight252,315
offsetTop370,397offsetWidth339,529tagName245,805
offsetLeft361,448offsetParent330,524currentStyle111,964

The Node object

The appendChild keyword was especially popular in this group. Authors apparently like to dynamically add content to documents—what a surprise! It was detected in 713,711 of MAMA's URLs—more than twice as often as the next-nearest Node object keyword. This number may seem unusually high compared to its other keyword siblings, but not if we look outside the Node object for a correlation. The related DOM method document.createElement is a likely companion to appendChild, and it was seen 731,116 times.

Some other relative comparisons can also be interesting; appendChild is four times as popular as removeChild, while removeChild is MUCH more popular than replaceChild. firstChild is approximately three times as popular as lastChild and nextSibling is more than three times as popular as previousSibling. nodeType and nodeName are used a similar number of times and are used in combination ~2/3 of the time (93,546 cases).

DOM Node object properties and methods
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
appendChild713,711nodeName144,836ownerDocument60,851
parentNode317,411attributes127,841xml48,824
childNodes236,865nodeValue116,097replaceChild47,405
firstChild186,788hasChildNodes115,660cloneNode47,233
removeChild174,231nextSibling102,171previousSibling28,972
insertBefore152,605prefix93,197normalize10,107
nodeType150,297lastChild62,872selectSingleNode7,679

The Window object

This object represents a browser window or sub-frame. Of all the keywords in this group, window was obviously going to be the most popular. There are a number of intriguing comparisons to be made between the various keyword couplings.

Dialogs are generated in JavaScript using the alert, confirm, and prompt methods of the Window object. Of these, alert is used most—17.8% of URLs using script utilize it in some fashion; confirm and prompt are only found in 4.1% and 1.2% of scripted pages respectively.

setTimeout is almost twice as popular as clearTimeout, but clearTimeout is almost NEVER used without setTimeout (found together in 490,124 URLs). Similarly, setInterval is significantly more popular than its companion clearInterval, but clearInterval use is almost always paired with setInterval (detected in unison 311,890 times).

Some of the keywords in this group are generic in nature and can be used across multiple objects. The keywords focus and blur were placed here, but also apply to other objects (like Input and Link). The simple keyword open definitely applies as the Window object method, but as a concept open is very generic and there may be some name collision (such as another official use as a separate method of the XMLHttpRequest object).

DOM Window object properties and methods
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
window1,812,773frames790,893alert467,055
navigator1,570,402self739,456status464,370
location1,475,171innerWidth668,432setInterval392,436
screen1,049,650innerHeight657,440close355,895
open1,021,945event525,373clearInterval316,922
parent836,445clearTimeout493,937history254,699
setTimeout812,357focus475,947pageYOffset254,325

XML related objects, properties, and methods

Not all of these keywords are dedicated solely to XML processing. The keyword with the highest detected frequency here was ActiveXObject, which is MSIE's generic system for using ActiveX controls in Web pages. How do we filter out non-XML related usages of ActiveXObject? Firstly, authors wanting to use XMLHttpRequest these days will typically allow for both types of objects. These two keywords are used together in 105,013 cases (93.5% of XMLHttpRequest cases). Another notable pairing is the incidence of the onreadystatechange keyword, which also tracks very close to use of XMLHttpRequest (94.9%). The readyState is a vital part of XMLHTTP processing, so tracking its numbers can also expose MSIE-only uses of XMLHTTP. The keywords readyState and onreadystatechange were used together 104,763 times. The remainder of the readyState cases (in 45,329 URLs) will likely be MSIE centric syntax.

Saarsoo also looked for "XMLHttpRequest" usage and only encountered it 6,125 times—1.9% of the pages that were determined to be using JavaScript in his study. By comparison, MAMA's usage rate is quite a bit higher. Considering only the same metric (use of XMLHttpRequest), it was found in 4.3% of MAMA URLs that were using script.

DOM XML-related object properties and methods
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
ActiveXObject652,356onreadystatechange106,599getResponseHeader32,187
readyState150,092responseText95,262statusText22,358
XMLHttpRequest112,277setRequestHeader73,413parseFromString15,266
send109,029responseXML42,272getAllResponseHeaders11,492

Conclusion

I have come to the conclusion that adding the JavaScript tokenizer was a very good move. There is still a lot more that can be extracted from scripting, but this process brings it a LONG way in the right direction. This is the first ever detailed look at script usage on a large scale in the wild, and it offers comprehensive data on the subjects of JavaScript and the DOM. Short of adding a full JavaScript execution engine to MAMA's analysis, this tokenizer will serve the MAMA system well for some time to come; There is even more interesting data still to be mined from this in the future.

Overall summary - MAMA phase 1

So, this brings the release of the current crop of MAMA analysis data to completion—I truly hope this data has been useful, and that it answers some of the burning questions you have about what is out there on the Web. However, this is by no means the end of what MAMA has to offer; there will be a short pause until after the new year while MAMA gathers more data. The next phase of MAMA's life will involve a full re-crawl of the URL set used in this study in order to examine how Web pages change over time. During that process, a number of brand-new search criteria will also be analyzed. New data resulting from this update will of course end up published in additional articles here on dev.opera.com as soon as they are ready.

There has been considerable interest in making MAMA's data available for general consumption, and we are definitely moving in that direction as resources allow it. Please let us know if you would like to be included in the preliminary betas of this project.

And of course, please let us know also if you have any ideas for further data mining you would like to see done, or think there is anything noticeably absent from the current data set.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.