MAMA script tokenization: ECMAScript/JavaScript syntax

By Brian Wilson

Index:

  1. Introduction
  2. JavaScript/ECMAScript keywords
  3. ECMAScript-reserved words
  4. Array object
  5. Date object
  6. Function object
  7. Global constants and methods
  8. Global objects
  9. Global object prototypes—messing with a good thing
  10. Math object
  11. Number object
  12. Object object
  13. RegExp object
  14. String object

Introduction

Scripting use was detected in 2,617,305 of MAMA's URLs. This entire section is devoted to details of 13 different areas of the JavaScript/ECMAScript language, covering basic JavaScript syntax and core JavaScript objects. We will leave analysis of the DOM for the dedicated DOM tokenization article.

Fig 1-1: Overall use of JavaScript/ECMAScript factors
JavaScript factorFrequency JavaScript factorFrequency
ECMAScript keywords2,476,007Function object925,025
String object1,982,954Object object487,445
Array object1,835,275RegExp object313,752
Global objects1,817,657Global prototypes170,844
Global constants/methods1,760,274Reserved keywords94,035
Date object1,085,966Number object11,641

JavaScript/ECMAScript keywords

This is the list of keywords standardized by ECMAScript v3. Each keyword is part of the language's basic syntax. As such, many of them are expected to have VERY high usage, which is indeed the case. Looking at the entire group, these keywords are used in 2,476,007 URLs (94.60% of all pages using scripting). A closer examination of this section reveals some of the basic ways that JavaScript/ECMAScript is used in the wild:

  • function is used the most of all keywords with 87.19% of all scripting cases. In 84.68% of the pages that used function there was at least one return keyword also detected.
  • Predictably, function is more popular than var, and in turn var is used at higher rates than new.
  • Conditional constructs (if) are favored over looping (for), and the alternate looping mechanism while is less popular than the primary for usage.
  • The else keyword for conditional code flow was used 80% as often as the companion keyword if.
  • break is favored over continue.
  • Boolean values true and false are used a similar number of times and are used together 1,314,911 times (91.19% of false cases and 85.98% of true cases).
  • Thetry/catch syntax is used a similar number of times. The fallback condition finally is only used in 5.5% of the cases using catch.
Fig 2-1: ECMAScript/JavaScript keywords
KeywordFrequency KeywordFrequency  KeywordFrequency KeywordFrequency
function2,281,902true1,529,306 try753,384default254,919
if2,253,000false1,441,874 catch752,271throw242,519
var2,152,170null1,412,832 continue611,755do137,318
new1,939,996typeof1,015,441 in563,323void130,542
return1,932,406while1,014,486 case328,235delete77,570
else1,795,957break893,712 switch325,443instanceof61,019
for1,751,342this810,322 with287,731finally41,788

ECMAScript-reserved words

These keywords are not currently used by ECMAScript/JavaScript, but ECMA has reserved their use for possible inclusion in future ECMAScript versions. The reserved words do not necessarily tell us too much about current script syntax but can point out the scope of possible problems if new scripting syntax is introduced. All told, 94,035 URLs used at least 1 of these reserved words, with the class keyword detected almost 5 times as often as its nearest reserved word value. If MAMA's tokenization process can be trusted, as many as 3.59% of URLs that use scripting could be in for some kind of surprise if any of these reserved words become official parts of future ECMAScript syntax.

Fig 3-1: ECMAScript-reserved words
Reserved
word
Frequency Reserved
word
Frequency  Reserved
word
Frequency Reserved
word
Frequency
class69,649public1,665 goto420abstract81
boolean14,510private1,441 interface347implements33
float12,915double971 super340const31
static7,485protected913 package305byte26
import3,813final841 extends152synchronized11
int3,058short630 export130volatile11
native2,613enum585 debugger127transient5
long2,490char494 throws107  

Array object

This object's property and method keywords were detected in 1,835,275 URLs. The Array object is mostly concerned with manipulations of array structures, but it has one main informational property, length, which details the number of items in the array. This property is used far more often than any of the other Array-specific methods, but that is an expected state of affairs, because it is also used by the String object to count the number of characters in a string. The way tokenization in MAMA is currently set up, it can not distinguish between these two uses. Of the standard array operations, push is much more popular than pop, shift is much more popular than unshift, while shift has only marginally higher representation than pop.

Fig 4-1: Properties and methods of the Array object
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
length1,825,953concat113,300splice50,193
join401,563shift84,745sort46,780
push378,223reverse62,908unshift19,198
slice187,082pop56,512  

Date object

MAMA found 1,085,966 URLs using at least one of these Date object keywords. This object has dozens of methods that allow full control of all aspects of dates and times. Most of the methods have "plain" versions and UTC (Coordinated Universal Time) versions, but the plain types were always found to be more popular than the corresponding UTC incarnations. Additionally, most of the timeframe methods have get and set variations. As an example, two main methods to access the month portion of a date are getMonth and setMonth. The MAMA URL set revealed that reading an existing date component ("get") is always more popular than the date's corresponding write method ("set").

Fig 5-1: Properties and methods of the Date object
[Please also see the full frequency table.]
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency Property/
method
Frequency
getTime880,057getMonth238,845 getDay161,373getSeconds119,185
toGMTString589,687getDate231,988 getYear156,232setMonth29,439
setTime534,490parse223,046 getMinutes131,256setDate28,329
getTimezoneOffset327,185getHours167,625 getFullYear125,465setFullYear22,882

Function object

The properties and methods of the Function object are applied to JavaScript functions—really, they must stop using such hard-to-remember syntax! In all, 925,025 URLs from MAMA detected these specific property/method keywords. Of that number, 97.40% of the time the arguments keyword is used, which is just over 1/3 of all URLs that use scripting. All other property/method keywords here have much lower usage rates.

Fig 6-1: Properties and methods of the Function object
Property/
method
Frequency Property/
method
Frequency
arguments900,932callee58,194
apply113,173caller22,094
call77,192prototype9,260

Global constants and methods

These keywords control aspects of the global object. They were detected in 1,760,274 URLs from MAMA, or just over 2/3 of all scripting cases. The escape keyword is more popular than the corresponding unescape keyword, by a 3:2 ratio. The encodeURI keyword is FAR more popular than the related decodeURI keyword (by almost 50 to 1!); but, in an odd twist, encodeURIComponent is only slightly more popular than decodeURIComponent.

Fig 7-1: Global object constants and methods
Contstant/methodFrequency Contstant/methodFrequency  Constant/methodFrequency
parseInt1,172,466decodeURIComponent541,755isFinite25,243
escape1,096,151encodeURI392,740NaN12,135
eval971,985isNaN356,244decodeURI8,414
unescape729,588parseFloat343,896Infinity935
encodeURIComponent589,443undefined177,960getClass638

Browser sniffing and parseInt

The parseInt keyword is the most popular of all the global object constants and methods, so taking a closer look is warranted. The keyword parseInt is often found when using script to perform crude browser detection (sniffing). So, how often is parseInt used when comparing it to the components of the Navigator object that are also commonly employed to do browser sniffing? The following numbers only indicate usage somewhere for the same URL—it does not mean they are used in the same function or even the same script! Still, the high degree of use correlation between parseInt and the popular Navigator properties indicates a distinct affinity between the two.

Fig 7-2: Use of "parseInt" with Navigator properties
ConditionFrequencyTotal for
Navigator
property
Navigator
property %
parseInt && appVersion731,386885,56482.59%
parseInt && appName630,039877,34571.81%
parseInt && userAgent593,363812,38273.04%

Global objects

This is JavaScript's global object. It is not an object class, but it covers references to all the major predefined objects, including Error object types. MAMA detected the use of these global object references in 1,817,657 URLs (almost 70% of all URLs using scripting). The Array object is referenced most, used in over 55% of all URLs using script. The Date object is explicitly used in over 42% of scripting cases. All the other global objects were mentioned by only 15% or less each of scripting cases.

Fig 8-1: Properties and methods of the Global object
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency Property/
method
Frequency
Array1,453,169RegExp315,660 RangeError33,849TypeError785
Date1,119,350Error221,977 Boolean23,715ReferenceError179
Object368,446Function152,246 Math23,560URIError135
String361,638Number103,658 SyntaxError4,857EvalError129

Global object prototypes—messing with a good thing

A developer at Opera requested real-life use cases where JavaScript/ECMAScript's built in global object types were modified using the prototype property. If an identifier chain (ex: Array.prototype) contained the string ".prototype" and the substring before that was a reference to one of the global objects, it was considered a match. The detection method used was not perfect (eg: foo.Array.prototype would not match) and was intended to be a first-generation attempt only. Several of the global objects—mostly the Error objects—did not appear to have any prototype modification: EvalError, Math, RangeError, ReferenceError, and URIError. The Array object had the most prototype changes by a wide margin, followed by the String object. What use might such information serve? For one, the data could point out functionality that the global objects lack which many authors could find useful. These could be good candidates for new features in future versions of JavaScript/ECMAScript.

Fig 9-1: Modified global object prototypes
ObjectFrequency ObjectFrequency  ObjectFrequency
Array125,575Date15,681RegExp257
String77,123Object12,282TypeError5
Function52,457Error4,837SyntaxError5
Number40,049Boolean731  

Math object

A bug prevented MAMA from directly saving the information for this object from the pages it analyzed. The Math object constants and static functions were successfully pulled from scripts by MAMA, but the database field where that information would be stored was not created properly. Hence, that particular data was thrown away. But not all was lost—"leftovers" list from the tokenizer existed for all identifier tokens that did not get placed into other categories, and this stored the information for the Math object, anyway. These numbers should be reliable for our analysis. Future versions of MAMA will correct this bug.

Several of the Math object constants (E, LOG10E, LOG2E, SQRT1_2 and SQRT2) were not detectable in any of the URLs that MAMA analyzed. Of the remaining ones, only PI was detected in a significant quantity (9,766 times). Computing the maximum and minimum (max and min respectively) was also very well represented.

Fig 10-1: Constants and functions of the Math object
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
max73,296cos7,281ceil760
exp63,522abs4,935atan2581
min58,858sin4,056tan253
log30,632pow3,886acos136
random26,900floor3,042atan28
PI9,766sqrt1,972LN1020
round9,244asin1,804LN24

Number object

The constants and methods from this object were not accessed very often compared to other objects - only 0.44% of all scripts used 1 or more of them in MAMA's URL set. The keyword used most of all is MAX_VALUE, but the one used the fewest times is MIN_VALUE...quite appropriate. The value POSITIVE_INFINITY is used more than twice as much as NEGATIVE_INFINITY...this is probably a very funny cosmic math joke to someone, but for most people the punchline would fall flat.

Fig 11-1: Constants and methods of the Number object
Constant/methodFrequency Constant/methodFrequency
MAX_VALUE7,526toPrecision282
toFixed3,291toExponential90
POSITIVE_INFINITY737MIN_VALUE47
NEGATIVE_INFINITY319  

Object object

This object is a superclass—the parent of all JavaScript objects—and, as such, these properties and methods are shared by all other objects. There were 487,445 URLs in MAMA that used 1 or more of these property/method keywords, with the toString method being the runaway author favorite.

Fig 12-1: Properties and methods of the Object object
Property/methodFrequency Property/methodFrequency
toString468,948hasOwnProperty7,659
constructor83,338propertyIsEnumerable370
valueOf23,573isPrototypeOf35
toLocaleString8,337  

RegExp object

This object was created to enable regular expression pattern matching functionality in JavaScript/ECMAScript. One or more of these keywords were found in 313,752 of MAMA's URLs. Several of the keywords here stand a greater chance of being non-unique with respect to the RegExp object, so some of the results here may be biased. For instance, test as a keyword is nondescript and could represent many things in different situations.

Fig 13-1: Properties and methods of the RegExp object
Property/methodFrequency Property/methodFrequency
test237,755lastIndex11,052
exec113,711ignoreCase3,068
source70,751multiline751
global12,847  

String object

The String object is used to manipulate groups of individual characters. MAMA discovered 1,982,954 URLs using String object-specific keywords, but that includes the length property, which is also used as a property by the Array object. This name collision issue also happens with the replace/search keywords, which are used by both the String and Location objects. The way MAMA is currently set up, it cannot distinguish between multiple uses of a keyword. Judging by the relative popularity of other properties and methods between the two objects, it is likely that the majority of the 1,825,953 uses of the length property are in a String object context. Some of this object's methods were not grouped here by MAMA, specifically the HTML methods (anchor, big, blink, bold, fixed, fontcolor, fontsize, italics, link, small, strike, sub and sup), but most of these can be tracked in the section covering "The Rest". Some of the methods of the String object (namely concat and slice) were never detected in any of the URLs MAMA analyzed.

Most of the String object methods have high usage rates; length, indexOf and substring are all used in more than half of the URLs that use scripting. There are some interesting trends though - toLowerCase is MUCH more popular than toUpperCase, and substring is MUCH more popular than substr.

Fig 14-1: Properties and methods of the String object
Property/
method
Frequency Property/
method
Frequency  Property/
method
Frequency
length1,825,953charAt870,869substr470,049
indexOf1,643,269replace710,059match450,277
substring1,523,307search658,995toUpperCase271,588
toLowerCase966,482lastIndexOf642,927fromCharCode140,676
split912,249charCodeAt486,355localeCompare169

IndexOf, substring, and browser sniffing

The discussion in the Global object section looked for a correlation between the use of the parseInt method and several common Navigator object properties: appName, appVersion, userAgent. Two methods of the String object are also used often in browser-sniffing scripts: indexOf and substring. We again see a strong use correlation between these String methods and the Navigator object properties. The biggest connection drawn between these items is the use of indexOf with the userAgent Navigator property in over 98% of userAgent cases.

Fig 15-1: Use of "parseInt" with Navigator properties
ConditionFrequencyTotal for
Navigator
property
Navigator
property %
indexOf && userAgent801,109812,38298.61%
indexOf && appVersion702,228885,56479.30%
indexOf && appName650,241877,34574.11%
substring && userAgent705,859812,38286.89%
substring && appVersion642,193885,56472.52%
substring && appName611,198877,34569.66%

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.