MAMA scripting report, part 2: JavaScript and DOM tokenization
Introduction—mining JavaScript code
Throughout much of MAMA's development history, script content was analyzed using numerous substring and regular expression checks. There was always a desire to use a more formal analysis, but an easy and simple solution never presented itself. After discovering Saarsoo's markup and CSS study and the breadth of his results, it was decided that one of the biggest deficits still outstanding in large-scale Web research was an in-depth analysis of JavaScript. This was the impetus necessary to finally tackle the formal analysis problem.
This week's overview on MAMA's scripting tokenization analysis can only be called "brief" compared to the following detailed articles that it summarizes:
Analyzing the JavaScript
MAMA adapted a simple public domain JavaScript tokenization script by Christopher Diggins. His tokenizer breaks down JavaScript into a number of categories: comments, quoted strings, identifiers, numbers, whitespace, and symbols. Of these, the basic JavaScript and DOM syntax would be covered by the identifiers category. Out of storage necessity, I wanted to break up these identifier tokens into groups for easier access. The basic structure of JavaScript as a group of objects with methods and properties lends itself to this type of categorization, so it was a natural fit. The compartmentalization of the language into small groups also easily translated to a database design that made searching for specific JavaScript and DOM keywords much faster. MAMA would also store these keywords with their case-sensitivity preserved, which would further aid in disambiguating their use. All told, 28 categories were selected containing a total of 481 identifier keywords.
Tokenization: Problems encountered
MAMA has done well with this initial tokenizer design, but it is a first-generation
effort that could be improved. For instance, a major bug was discovered after the
analysis phase—a database field created to store token identifier chains never
stored anything at all. This field would have allowed better correlation for keywords
with ambiguous meaning. For example, the open
method is
used by multiple objects including Window and XMLHTTPRequest; the lost MAMA field
would have allowed a greater degree of clarity with multiple uses of the same
keyword. The next major MAMA crawl will definitely address this lack and will
go even farther in its examination of scripting.
Some other basic issues were noticed during this process:
- The fewer characters in a keyword, the less likely it is to be used for a single
specific purpose. This is especially true for common or ambiguous words like
all
,class
andid
. - Mixed-case keywords stand a higher chance of representing unique usage,
especially camel-case property and method keywords, such as
hasChildNodes
andgetElementById
.
JavaScript (ECMAScript)
The JavaScript keyword category contains 13 different areas of the JavaScript/ECMAScript language, covering basic JavaScript syntax and core JavaScript objects.
JavaScript (ECMAScript) language keywords
This is the list of keywords standardized by ECMAScript v3. Each keyword is part of the language's basic syntax. As such, many of them are expected to have VERY high usage, which is indeed the case. Looking at the entire group, these keywords are used in 2,476,007 URLs (94.6% of all pages using scripting). A closer examination of this section reveals some of the basic ways that JavaScript/ECMAScript is used in the wild:
- The keyword
function
is used the most of all keywords, with 87.2% of all scripting cases. In all, 84.7% of pages that usedfunction
had at least one case of thereturn
keyword. - Predictably,
function
is more popular thanvar
, and in turnvar
is used at higher rates thannew
. - Conditional constructs (
if
) are favored over looping (for
), and the alternate looping mechanismwhile
is less popular than the primaryfor
usage. - The
else
keyword for conditional code flow was used 80% as often as the companion keywordif
. break
is favored overcontinue
.- Boolean values
true
andfalse
are used a similar number of times and are used together 1,314,911 times (91.2% offalse
cases and 85.6% oftrue
cases). - The
try
/catch
syntax is used a similar number of times. The fallback conditionfinally
is only used in 5.5% of the cases usingcatch
.
Keyword | Frequency | Keyword | Frequency | Keyword | Frequency | Keyword | Frequency | |||
---|---|---|---|---|---|---|---|---|---|---|
function | 2,281,902 | true | 1,529,306 | try | 753,384 | default | 254,919 | |||
if | 2,253,000 | false | 1,441,874 | catch | 752,271 | throw | 242,519 | |||
var | 2,152,170 | null | 1,412,832 | continue | 611,755 | do | 137,318 | |||
new | 1,939,996 | typeof | 1,015,441 | in | 563,323 | void | 130,542 | |||
return | 1,932,406 | while | 1,014,486 | case | 328,235 | delete | 77,570 | |||
else | 1,795,957 | break | 893,712 | switch | 325,443 | instanceof | 61,019 | |||
for | 1,751,342 | this | 810,322 | with | 287,731 | finally | 41,788 |
The Array object
This object's property and method keywords were detected in 1,835,275 URLs.
The Array object is mostly concerned with manipulations of array structures,
but it has one main informational property, length
,
which informs about the number of items in the array. This property is used
far more often than any of the other Array-specific methods, but that is an
expected state of affairs, because it is also used by the String object to
count the number of characters in a string. The way MAMA is currently set up,
it can not distinguish between these two uses. Of the standard array operations,
push
is much more popular than pop
,
shift
is much more popular than unshift
,
while shift
has only marginally higher representation than pop
.
Property/ method | Frequency | Property/ method | Frequency | |
---|---|---|---|---|
length | 1,825,953 | reverse | 62,908 | |
join | 401,563 | pop | 56,512 | |
push | 378,223 | splice | 50,193 | |
slice | 187,082 | sort | 46,780 | |
concat | 113,300 | unshift | 19,198 | |
shift | 84,745 |
JavaScript: The global object
This is not an object class, but it covers references to all the major predefined objects, including the Error object types. MAMA detected the use of these global object references in 1,817,657 URLs (almost 70% of all URLs using scripting). The Array object is referenced most, used in over 55% of all URLs using script. The Date object is explicitly used in over 42% of scripting cases. All the other global objects were mentioned by only 15% or less each of scripting cases.
Property/ method | Frequency | Property/ method | Frequency | |
---|---|---|---|---|
Array | 1,453,169 | RangeError | 33,849 | |
Date | 1,119,350 | Boolean | 23,715 | |
Object | 368,446 | Math | 23,560 | |
String | 361,638 | SyntaxError | 4,857 | |
RegExp | 315,660 | TypeError | 785 | |
Error | 221,977 | ReferenceError | 179 | |
Function | 152,246 | URIError | 135 | |
Number | 103,658 | EvalError | 129 |
The String object
The String object is used to manipulate groups of individual characters. MAMA
discovered 1,982,954 URLs using String object-specific keywords, but that
includes the length
property, which is also used as
a property by the Array object. This name collision issue also happens with the
replace
/search
keywords,
which are used by both the String and Location objects. The way MAMA is currently
set up, it can not distinguish between multiple uses of a keyword. Judging by
the relative popularity of other properties and methods between the two objects,
it is likely that the majority of the 1,825,953 uses of the length
property are in a String object context.
Most of the String object methods have high usage rates; length
,
indexOf
, and substring
are
all used in more than half of the URLs that use scripting. There are some
clear trends though—toLowerCase
is
MUCH more popular than toUpperCase
,
and substring
is MUCH more popular
than substr
.
Property/ method | Frequency | Property/ method | Frequency | |
---|---|---|---|---|
length | 1,825,953 | lastIndexOf | 642,927 | |
indexOf | 1,643,269 | charCodeAt | 486,355 | |
substring | 1,523,307 | substr | 470,049 | |
toLowerCase | 966,482 | match | 450,277 | |
split | 912,249 | toUpperCase | 271,588 | |
charAt | 870,869 | fromCharCode | 140,676 | |
replace | 710,059 | localeCompare | 169 | |
search | 658,995 |
The DOM
Scripting use was found in MAMA 2,617,305 times. This section is devoted to the results uncovered for 294 DOM-related keywords in 15 categories encompassing the largest Objects and conceptual areas of the DOM. The DOM is a very large domain to cover, so in the end I limited the initial set to objects and keywords that I thought might be the most popular or interesting.
The Document object
Forty keywords were associated with the Document object for MAMA, with 2,353,632
of the URLs having at least one of the keyword snippets (89.9% of all URLs
using script). The parent substring document
has
the highest popularity here, and it actually has the highest occurrence of
ANY tokenized keyword detected by MAMA (in 89.6% of all
script). This could be persuasive evidence in demonstrating that dynamically
changing the document is the most popular use of JavaScript.
The getElementById
and write
keywords are understandably quite popular, being the basic historic methods for
addressing and dynamically creating parts of a document; each was found in over
50% of all script cases. The W3C DOM method of addressing content
document.getElementById
is more popular than the
MSIE-originated document.all
by a comfortable margin.
The getElementById
method is almost twice as popular as
getElementsByTagName
, and both trounce
getElementsByName
by a wide
margin. The write
method is clearly preferred by
authors over writeln
4.5 to 1.
Other keywords from the Document object can tell us a lot about many aspects of usage in
Web-page authoring. The layers
keyword is actually
the most common process used to detect (browser sniff) Netscape Navigator 4,
which explains why the use of this keyword in script is so
large compared to the LAYER
element (the script keyword is used over 34 times as much)! The cookie
keyword can give a good measure of how often client-side cookies are used by
script (22.4% of all Web pages). This is probably a much better measure than
the Navigator object's cookieEnabled
property reflecting
only 45,411 cases. The images
keyword here is just one
useful factor in determining whether scripting is dynamically controlling images;
top keywords from the token remainders list also suggest Image usage (src
,
width
, Image
, and
height
). These could also be leveraged to discover
scripts that are manipulating Images. Direct use of the FORM
,
INPUT
or SELECT
elements
in markup were detected in 1,068,842 cases, while the DOM level 0 forms
keyword was detected 665,305 times. However, these factors occurred together
only 293,048 times. What this disparity might suggest about form control via
script is not really clear—perhaps, in a significant number of cases,form
widgets are generated dynamically.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
document | 2,345,827 | getElementsByTagName | 797,464 | URL | 382,120 | ||
getElementById | 1,484,601 | cookie | 786,427 | writeln | 312,995 | ||
write | 1,401,743 | body | 746,071 | lastModified | 229,841 | ||
all | 1,145,064 | createElement | 731,116 | links | 173,607 | ||
referrer | 959,234 | forms | 665,305 | createTextNode | 125,308 | ||
images | 901,477 | domain | 528,066 | anchors | 122,835 | ||
layers | 898,064 | documentElement | 419,297 | defaultView | 92,977 |
The Element object
The keywords collected under the Element object umbrella were found in 1,336,464
URLs from MAMA. The MSIE shorthand innerHTML
, which is
used to read and dynamically write content in a document, is very popular. If we
compare innerHTML
to document.createElement
or any of the Node object's methods for accessing and writing child nodes, it
appears that it may actually be less popular these days than equivalent W3C DOM
methods. Writing attributes with the setAttribute
method appears to be a more frequent authoring task than merely reading it with
getAttribute
.
The currentStyle
keyword (used 111,964 times) comes
from IE and is only slightly more widespread than the W3C DOM version
window.getComputedStyle
(used 99,815 times). These
two methods of accessing a browser's CSS interpretation share usage in a large
majority of the cases (92,505 times), indicating an author preference to get
the job done using any and all methods at their disposal.
The offset/scroll methods originated by IE show clear trends.
offsetTop
and offsetLeft
are more popular than either offsetHeight
and
offsetWidth
. Similarly, Top and Left are both more
popular than Height and Width for the "scroll" methods. The Top and Height
properties are always more popular than the Left and Width properties for
both the offset and scroll method groups. In cases where the Left and Width
component methods are used, the overwhelming majority (more than 90% each)
are used in conjunction with the more dominant Top/Height methods.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
id | 1,007,621 | className | 359,699 | getAttribute | 299,346 | ||
innerHTML | 695,329 | offsetHeight | 353,416 | scrollLeft | 283,749 | ||
setAttribute | 413,403 | scrollTop | 352,061 | scrollHeight | 252,315 | ||
offsetTop | 370,397 | offsetWidth | 339,529 | tagName | 245,805 | ||
offsetLeft | 361,448 | offsetParent | 330,524 | currentStyle | 111,964 |
The Node object
The appendChild
keyword was especially popular in
this group. Authors apparently like to dynamically add content to documents—what a surprise! It was detected in 713,711 of MAMA's URLs—more than twice
as often as the next-nearest Node object keyword. This number may seem unusually
high compared to its other keyword siblings, but not if we look outside the Node
object for a correlation. The related DOM method document.createElement
is a likely companion to appendChild
, and it was
seen 731,116 times.
Some other relative comparisons can also be interesting; appendChild
is four times as popular as removeChild
, while
removeChild
is MUCH more popular
than replaceChild
. firstChild
is approximately three times as popular as lastChild
and
nextSibling
is more than three times as popular as
previousSibling
. nodeType
and nodeName
are used a similar number of times and
are used in combination ~2/3 of the time (93,546 cases).
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
appendChild | 713,711 | nodeName | 144,836 | ownerDocument | 60,851 | ||
parentNode | 317,411 | attributes | 127,841 | xml | 48,824 | ||
childNodes | 236,865 | nodeValue | 116,097 | replaceChild | 47,405 | ||
firstChild | 186,788 | hasChildNodes | 115,660 | cloneNode | 47,233 | ||
removeChild | 174,231 | nextSibling | 102,171 | previousSibling | 28,972 | ||
insertBefore | 152,605 | prefix | 93,197 | normalize | 10,107 | ||
nodeType | 150,297 | lastChild | 62,872 | selectSingleNode | 7,679 |
The Window object
This object represents a browser window or sub-frame. Of all the keywords in
this group, window
was obviously going to be the
most popular. There are a number of intriguing comparisons to be made between
the various keyword couplings.
Dialogs are generated in JavaScript using the alert
,
confirm
, and prompt
methods
of the Window object. Of these, alert
is used most—17.8% of URLs using script utilize it in some fashion; confirm
and prompt
are only found in 4.1% and 1.2% of scripted
pages respectively.
setTimeout
is almost twice as popular as clearTimeout
,
but clearTimeout
is almost NEVER
used without setTimeout
(found together in 490,124
URLs). Similarly, setInterval
is significantly more
popular than its companion clearInterval
, but
clearInterval
use is almost always paired with
setInterval
(detected in unison 311,890 times).
Some of the keywords in this group are generic in nature and can be used across
multiple objects. The keywords focus
and blur
were placed here, but also apply to other objects (like Input and Link). The
simple keyword open
definitely applies as the Window
object method, but as a concept open
is very generic
and there may be some name collision (such as another official use as a separate
method of the XMLHttpRequest object).
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
window | 1,812,773 | frames | 790,893 | alert | 467,055 | ||
navigator | 1,570,402 | self | 739,456 | status | 464,370 | ||
location | 1,475,171 | innerWidth | 668,432 | setInterval | 392,436 | ||
screen | 1,049,650 | innerHeight | 657,440 | close | 355,895 | ||
open | 1,021,945 | event | 525,373 | clearInterval | 316,922 | ||
parent | 836,445 | clearTimeout | 493,937 | history | 254,699 | ||
setTimeout | 812,357 | focus | 475,947 | pageYOffset | 254,325 |
XML related objects, properties, and methods
Not all of these keywords are dedicated solely to XML processing. The keyword
with the highest detected frequency here was ActiveXObject
,
which is MSIE's generic system for using ActiveX controls in Web pages. How do
we filter out non-XML related usages of ActiveXObject
?
Firstly, authors wanting to use XMLHttpRequest these days will typically allow for
both types of objects. These two keywords are used together
in 105,013 cases (93.5% of XMLHttpRequest
cases).
Another notable pairing is the incidence of the onreadystatechange
keyword, which also tracks very close to use of XMLHttpRequest
(94.9%). The readyState is a vital part of XMLHTTP processing, so tracking
its numbers can also expose MSIE-only uses of XMLHTTP. The keywords readyState
and onreadystatechange
were used together 104,763 times.
The remainder of the readyState
cases (in 45,329 URLs)
will likely be MSIE centric syntax.
Saarsoo also looked for "XMLHttpRequest" usage and only encountered it 6,125
times—1.9% of the pages that were determined to be using JavaScript in his
study. By comparison, MAMA's usage rate is quite a bit higher. Considering
only the same metric (use of XMLHttpRequest
), it
was found in 4.3% of MAMA URLs that were using script.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
ActiveXObject | 652,356 | onreadystatechange | 106,599 | getResponseHeader | 32,187 | ||
readyState | 150,092 | responseText | 95,262 | statusText | 22,358 | ||
XMLHttpRequest | 112,277 | setRequestHeader | 73,413 | parseFromString | 15,266 | ||
send | 109,029 | responseXML | 42,272 | getAllResponseHeaders | 11,492 |
Conclusion
I have come to the conclusion that adding the JavaScript tokenizer was a very good move. There is still a lot more that can be extracted from scripting, but this process brings it a LONG way in the right direction. This is the first ever detailed look at script usage on a large scale in the wild, and it offers comprehensive data on the subjects of JavaScript and the DOM. Short of adding a full JavaScript execution engine to MAMA's analysis, this tokenizer will serve the MAMA system well for some time to come; There is even more interesting data still to be mined from this in the future.
Overall summary - MAMA phase 1
So, this brings the release of the current crop of MAMA analysis data to completion—I truly hope this data has been useful, and that it answers some of the burning questions you have about what is out there on the Web. However, this is by no means the end of what MAMA has to offer; there will be a short pause until after the new year while MAMA gathers more data. The next phase of MAMA's life will involve a full re-crawl of the URL set used in this study in order to examine how Web pages change over time. During that process, a number of brand-new search criteria will also be analyzed. New data resulting from this update will of course end up published in additional articles here on dev.opera.com as soon as they are ready.
There has been considerable interest in making MAMA's data available for general consumption, and we are definitely moving in that direction as resources allow it. Please let us know if you would like to be included in the preliminary betas of this project.
And of course, please let us know also if you have any ideas for further data mining you would like to see done, or think there is anything noticeably absent from the current data set.
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.