MAMA script tokenization: ECMAScript/JavaScript syntax
- Previous article—MAMA: Script tokenization: DOM
- Next article—MAMA: What has come before
- Table of contents
Index:
- Introduction
- JavaScript/ECMAScript keywords
- ECMAScript-reserved words
- Array object
- Date object
- Function object
- Global constants and methods
- Global objects
- Global object prototypes—messing with a good thing
- Math object
- Number object
- Object object
- RegExp object
- String object
Introduction
Scripting use was detected in 2,617,305 of MAMA's URLs. This entire section is devoted to details of 13 different areas of the JavaScript/ECMAScript language, covering basic JavaScript syntax and core JavaScript objects. We will leave analysis of the DOM for the dedicated DOM tokenization article.
JavaScript factor | Frequency | JavaScript factor | Frequency | |
---|---|---|---|---|
ECMAScript keywords | 2,476,007 | Function object | 925,025 | |
String object | 1,982,954 | Object object | 487,445 | |
Array object | 1,835,275 | RegExp object | 313,752 | |
Global objects | 1,817,657 | Global prototypes | 170,844 | |
Global constants/methods | 1,760,274 | Reserved keywords | 94,035 | |
Date object | 1,085,966 | Number object | 11,641 |
JavaScript/ECMAScript keywords
This is the list of keywords standardized by ECMAScript v3. Each keyword is part of the language's basic syntax. As such, many of them are expected to have VERY high usage, which is indeed the case. Looking at the entire group, these keywords are used in 2,476,007 URLs (94.60% of all pages using scripting). A closer examination of this section reveals some of the basic ways that JavaScript/ECMAScript is used in the wild:
function
is used the most of all keywords with 87.19% of all scripting cases. In 84.68% of the pages that usedfunction
there was at least onereturn
keyword also detected.- Predictably,
function
is more popular thanvar
, and in turnvar
is used at higher rates thannew
. - Conditional constructs (
if
) are favored over looping (for
), and the alternate looping mechanismwhile
is less popular than the primaryfor
usage. - The
else
keyword for conditional code flow was used 80% as often as the companion keywordif
. break
is favored overcontinue
.- Boolean values
true
andfalse
are used a similar number of times and are used together 1,314,911 times (91.19% offalse
cases and 85.98% oftrue
cases). - The
try
/catch
syntax is used a similar number of times. The fallback conditionfinally
is only used in 5.5% of the cases usingcatch
.
Keyword | Frequency | Keyword | Frequency | Keyword | Frequency | Keyword | Frequency | |||
---|---|---|---|---|---|---|---|---|---|---|
function | 2,281,902 | true | 1,529,306 | try | 753,384 | default | 254,919 | |||
if | 2,253,000 | false | 1,441,874 | catch | 752,271 | throw | 242,519 | |||
var | 2,152,170 | null | 1,412,832 | continue | 611,755 | do | 137,318 | |||
new | 1,939,996 | typeof | 1,015,441 | in | 563,323 | void | 130,542 | |||
return | 1,932,406 | while | 1,014,486 | case | 328,235 | delete | 77,570 | |||
else | 1,795,957 | break | 893,712 | switch | 325,443 | instanceof | 61,019 | |||
for | 1,751,342 | this | 810,322 | with | 287,731 | finally | 41,788 |
ECMAScript-reserved words
These keywords are not currently used by ECMAScript/JavaScript, but ECMA has
reserved their use for possible inclusion in future ECMAScript versions. The
reserved words do not necessarily tell us too much about current script syntax
but can point out the scope of possible problems if new scripting syntax is
introduced. All told, 94,035 URLs used at least 1 of these reserved words, with the
class
keyword detected almost 5 times as often as
its nearest reserved word value. If MAMA's tokenization process can be trusted,
as many as 3.59% of URLs that use scripting could be in for some kind of surprise
if any of these reserved words become official parts of future ECMAScript syntax.
Reserved word | Frequency | Reserved word | Frequency | Reserved word | Frequency | Reserved word | Frequency | |||
---|---|---|---|---|---|---|---|---|---|---|
class | 69,649 | public | 1,665 | goto | 420 | abstract | 81 | |||
boolean | 14,510 | private | 1,441 | interface | 347 | implements | 33 | |||
float | 12,915 | double | 971 | super | 340 | const | 31 | |||
static | 7,485 | protected | 913 | package | 305 | byte | 26 | |||
import | 3,813 | final | 841 | extends | 152 | synchronized | 11 | |||
int | 3,058 | short | 630 | export | 130 | volatile | 11 | |||
native | 2,613 | enum | 585 | debugger | 127 | transient | 5 | |||
long | 2,490 | char | 494 | throws | 107 |
Array object
This object's property and method keywords were detected in 1,835,275 URLs.
The Array object is mostly concerned with manipulations of array structures,
but it has one main informational property, length
,
which details the number of items in the array. This property is used
far more often than any of the other Array-specific methods, but that is an
expected state of affairs, because it is also used by the String object to
count the number of characters in a string. The
way tokenization in MAMA is currently set up, it can not distinguish between these two uses.
Of the standard array operations, push
is much more
popular than pop
, shift
is much more popular than unshift
, while
shift
has only marginally higher representation than
pop
.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
length | 1,825,953 | concat | 113,300 | splice | 50,193 | ||
join | 401,563 | shift | 84,745 | sort | 46,780 | ||
push | 378,223 | reverse | 62,908 | unshift | 19,198 | ||
slice | 187,082 | pop | 56,512 |
Date object
MAMA found 1,085,966 URLs using at least one of these Date object keywords. This
object has dozens of methods that allow full control of all aspects of dates and
times. Most of the methods have "plain" versions and UTC (Coordinated Universal Time)
versions, but the plain types were always found to be more popular than the
corresponding UTC incarnations. Additionally, most of the timeframe methods have
get and set variations. As an example, two main methods to access the month portion
of a date are getMonth
and setMonth
.
The MAMA URL set revealed that reading an existing date component ("get") is
always more popular than the date's corresponding write method ("set").
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | |||
---|---|---|---|---|---|---|---|---|---|---|
getTime | 880,057 | getMonth | 238,845 | getDay | 161,373 | getSeconds | 119,185 | |||
toGMTString | 589,687 | getDate | 231,988 | getYear | 156,232 | setMonth | 29,439 | |||
setTime | 534,490 | parse | 223,046 | getMinutes | 131,256 | setDate | 28,329 | |||
getTimezoneOffset | 327,185 | getHours | 167,625 | getFullYear | 125,465 | setFullYear | 22,882 |
Function object
The properties and methods of the Function object are applied to JavaScript
functions—really, they must stop using such hard-to-remember
syntax! In all, 925,025 URLs from MAMA detected these specific property/method keywords.
Of that number, 97.40% of the time the arguments
keyword is used, which is just over 1/3 of all URLs that use scripting. All
other property/method keywords here have much lower usage rates.
Property/ method | Frequency | Property/ method | Frequency | |
---|---|---|---|---|
arguments | 900,932 | callee | 58,194 | |
apply | 113,173 | caller | 22,094 | |
call | 77,192 | prototype | 9,260 |
Global constants and methods
These keywords control aspects of the global object. They were detected in
1,760,274 URLs from MAMA, or just over 2/3 of all scripting cases. The
escape
keyword is more popular than the corresponding
unescape
keyword, by a 3:2 ratio. The encodeURI
keyword is FAR more popular than the related decodeURI
keyword (by almost 50 to 1!); but, in an odd twist, encodeURIComponent
is only slightly more popular than decodeURIComponent
.
Contstant/method | Frequency | Contstant/method | Frequency | Constant/method | Frequency | ||
---|---|---|---|---|---|---|---|
parseInt | 1,172,466 | decodeURIComponent | 541,755 | isFinite | 25,243 | ||
escape | 1,096,151 | encodeURI | 392,740 | NaN | 12,135 | ||
eval | 971,985 | isNaN | 356,244 | decodeURI | 8,414 | ||
unescape | 729,588 | parseFloat | 343,896 | Infinity | 935 | ||
encodeURIComponent | 589,443 | undefined | 177,960 | getClass | 638 |
Browser sniffing and parseInt
The parseInt
keyword is the most popular of all the
global object constants and methods, so taking a closer look is warranted. The keyword parseInt
is often found when using script to perform
crude browser detection (sniffing). So, how often is parseInt
used when comparing it to the components of the Navigator object that are also
commonly employed to do browser sniffing? The following numbers only indicate
usage somewhere for the same URL—it does not mean they are used in the
same function or even the same script! Still, the high degree of use correlation between
parseInt
and the popular Navigator properties indicates
a distinct affinity between the two.
Condition | Frequency | Total for Navigator property | Navigator property % |
---|---|---|---|
parseInt && appVersion | 731,386 | 885,564 | 82.59% |
parseInt && appName | 630,039 | 877,345 | 71.81% |
parseInt && userAgent | 593,363 | 812,382 | 73.04% |
Global objects
This is JavaScript's global object. It is not an object class, but it covers references to all the major predefined objects, including Error object types. MAMA detected the use of these global object references in 1,817,657 URLs (almost 70% of all URLs using scripting). The Array object is referenced most, used in over 55% of all URLs using script. The Date object is explicitly used in over 42% of scripting cases. All the other global objects were mentioned by only 15% or less each of scripting cases.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | |||
---|---|---|---|---|---|---|---|---|---|---|
Array | 1,453,169 | RegExp | 315,660 | RangeError | 33,849 | TypeError | 785 | |||
Date | 1,119,350 | Error | 221,977 | Boolean | 23,715 | ReferenceError | 179 | |||
Object | 368,446 | Function | 152,246 | Math | 23,560 | URIError | 135 | |||
String | 361,638 | Number | 103,658 | SyntaxError | 4,857 | EvalError | 129 |
Global object prototypes—messing with a good thing
A developer at Opera requested real-life use cases where JavaScript/ECMAScript's
built in global object types were modified using the prototype property. If an
identifier chain (ex: Array.prototype
) contained the
string ".prototype" and the substring before that
was a reference to one of the global objects, it was considered a match. The
detection method used was not perfect (eg: foo.Array.prototype
would not match) and was intended to be a first-generation attempt only. Several
of the global objects—mostly the Error objects—did not appear to have any
prototype modification: EvalError
, Math
,
RangeError
, ReferenceError
,
and URIError
. The Array object had the most prototype
changes by a wide margin, followed by the String object. What use might such
information serve? For one, the data could point out functionality that the
global objects lack which many authors could find useful. These could be good
candidates for new features in future versions of JavaScript/ECMAScript.
Object | Frequency | Object | Frequency | Object | Frequency | ||
---|---|---|---|---|---|---|---|
Array | 125,575 | Date | 15,681 | RegExp | 257 | ||
String | 77,123 | Object | 12,282 | TypeError | 5 | ||
Function | 52,457 | Error | 4,837 | SyntaxError | 5 | ||
Number | 40,049 | Boolean | 731 |
Math object
A bug prevented MAMA from directly saving the information for this object from the pages it analyzed. The Math object constants and static functions were successfully pulled from scripts by MAMA, but the database field where that information would be stored was not created properly. Hence, that particular data was thrown away. But not all was lost—"leftovers" list from the tokenizer existed for all identifier tokens that did not get placed into other categories, and this stored the information for the Math object, anyway. These numbers should be reliable for our analysis. Future versions of MAMA will correct this bug.
Several of the Math object constants (E
,
LOG10E
, LOG2E
,
SQRT1_2
and SQRT2
) were
not detectable in any of the URLs that MAMA analyzed. Of the remaining ones,
only PI
was detected in a significant quantity
(9,766 times). Computing the maximum and minimum (max
and min
respectively) was also very well represented.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
max | 73,296 | cos | 7,281 | ceil | 760 | ||
exp | 63,522 | abs | 4,935 | atan2 | 581 | ||
min | 58,858 | sin | 4,056 | tan | 253 | ||
log | 30,632 | pow | 3,886 | acos | 136 | ||
random | 26,900 | floor | 3,042 | atan | 28 | ||
PI | 9,766 | sqrt | 1,972 | LN10 | 20 | ||
round | 9,244 | asin | 1,804 | LN2 | 4 |
Number object
The constants and methods from this object were not accessed very often compared
to other objects - only 0.44% of all scripts used 1 or more of them in MAMA's
URL set. The keyword used most of all is MAX_VALUE
,
but the one used the fewest times is MIN_VALUE
...quite
appropriate. The value POSITIVE_INFINITY
is used more
than twice as much as NEGATIVE_INFINITY
...this is
probably a very funny cosmic math joke to someone, but for most people the
punchline would fall flat.
Constant/method | Frequency | Constant/method | Frequency | |
---|---|---|---|---|
MAX_VALUE | 7,526 | toPrecision | 282 | |
toFixed | 3,291 | toExponential | 90 | |
POSITIVE_INFINITY | 737 | MIN_VALUE | 47 | |
NEGATIVE_INFINITY | 319 |
Object object
This object is a superclass—the parent of all JavaScript objects—and,
as such, these properties and methods are shared by all other objects. There
were 487,445 URLs in MAMA that used 1 or more of these property/method keywords,
with the toString
method being the runaway author favorite.
Property/method | Frequency | Property/method | Frequency | |
---|---|---|---|---|
toString | 468,948 | hasOwnProperty | 7,659 | |
constructor | 83,338 | propertyIsEnumerable | 370 | |
valueOf | 23,573 | isPrototypeOf | 35 | |
toLocaleString | 8,337 |
RegExp object
This object was created to enable regular expression pattern matching functionality
in JavaScript/ECMAScript. One or more of these keywords were found in 313,752 of
MAMA's URLs. Several of the keywords here stand a greater chance of being non-unique
with respect to the RegExp object, so some of the results here may be biased. For
instance, test
as a keyword is nondescript and could
represent many things in different situations.
Property/method | Frequency | Property/method | Frequency | |
---|---|---|---|---|
test | 237,755 | lastIndex | 11,052 | |
exec | 113,711 | ignoreCase | 3,068 | |
source | 70,751 | multiline | 751 | |
global | 12,847 |
String object
The String object is used to manipulate groups of individual characters. MAMA
discovered 1,982,954 URLs using String object-specific keywords, but that
includes the length
property, which is also used as
a property by the Array object. This name collision issue also happens with the
replace
/search
keywords,
which are used by both the String and Location objects. The way MAMA is currently
set up, it cannot distinguish between multiple uses of a keyword. Judging by
the relative popularity of other properties and methods between the two objects,
it is likely that the majority of the 1,825,953 uses of the length
property are in a String object context. Some of this object's methods were not
grouped here by MAMA, specifically the HTML methods (anchor
,
big
, blink
, bold
,
fixed
, fontcolor
,
fontsize
, italics
,
link
, small
,
strike
, sub
and
sup
), but most of these can be tracked in the
section covering "The Rest". Some of
the methods of the String object (namely concat
and
slice
) were never detected in any of the URLs MAMA analyzed.
Most of the String object methods have high usage rates; length
,
indexOf
and substring
are
all used in more than half of the URLs that use scripting. There are some
interesting trends though - toLowerCase
is
MUCH more popular than toUpperCase
,
and substring
is MUCH more popular
than substr
.
Property/ method | Frequency | Property/ method | Frequency | Property/ method | Frequency | ||
---|---|---|---|---|---|---|---|
length | 1,825,953 | charAt | 870,869 | substr | 470,049 | ||
indexOf | 1,643,269 | replace | 710,059 | match | 450,277 | ||
substring | 1,523,307 | search | 658,995 | toUpperCase | 271,588 | ||
toLowerCase | 966,482 | lastIndexOf | 642,927 | fromCharCode | 140,676 | ||
split | 912,249 | charCodeAt | 486,355 | localeCompare | 169 |
IndexOf, substring, and browser sniffing
The discussion in the Global object section looked for a correlation between
the use of the parseInt
method and several common
Navigator object properties: appName
,
appVersion
, userAgent
.
Two methods of the String object are also used often in browser-sniffing scripts:
indexOf
and substring
. We
again see a strong use correlation between these String methods and the Navigator
object properties. The biggest connection drawn between these items is the use
of indexOf
with the userAgent
Navigator property in over 98% of userAgent
cases.
Condition | Frequency | Total for Navigator property | Navigator property % |
---|---|---|---|
indexOf && userAgent | 801,109 | 812,382 | 98.61% |
indexOf && appVersion | 702,228 | 885,564 | 79.30% |
indexOf && appName | 650,241 | 877,345 | 74.11% |
substring && userAgent | 705,859 | 812,382 | 86.89% |
substring && appVersion | 642,193 | 885,564 | 72.52% |
substring && appName | 611,198 | 877,345 | 69.66% |
- Previous article—MAMA: Script tokenization: DOM
- Next article—MAMA: What has come before
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.