MAMA: Character entities

By Brian Wilson

Index:

  1. Introduction
  2. Character entity usage
  3. A popularity contest: Named or numeric character entities?
  4. Illegal code points for numeric entities
  5. The mystery of the word jumble

Introduction

I am not aware of any past studies of character entities (either numeric or named). MAMA's study appears to be the first; although, in hindsight, more could have been done. Knowing what character entities are used is definitely a good start, but it would also be nice to know how many character entities a document has as well.

Character-entity usage

In all, 3,002,458 of 3,509,180 URLs analyzed (85.56%) use at least 1 character entity. The most popular entity reference is the "Non-breaking space", used 2,537,947 times (72.32% of pages overall)—twice as much as any other entity used.

Fig 2-1: Popular character entities
[See also the full frequency table]
Entity description Entity Entity
code
Frequency   Entity description Entity Entity
code
Frequency
Non-breaking space   nbsp 2,537,947 Small 'o' with diaeresis ö ouml 184,313
Ampersand & amp 1,256,005 Right double-angle quotation mark » raquo 123,207
Copyright sign © copy 776,051 Small 'a' qith grave à agrave 119,984
Quotation mark " quot 520,902 Small 'e' with grave è egrave 104,890
Greater-than sign > gt 276,149 Less-than sign < lt 100,218
Small 'u' with diaeresis ü uuml 226,695 Small Sharp S ß szlig 94,842
Small 'e' with acute é eacute 207,322 Apostrophe ' 39 89,642
Small 'a' with diaeresis ä auml 204,855 Small 'o' with acute ó oacute 86,211

A popularity contest: Named or numeric character entities?

Among the most popular entities, there is a definite preference for using the named version rather than the numeric version—in the frequency table above (Fig 2-1) a numeric entity is not encountered until the 15th slot. In almost every case where a named entity counterpart exists, the named version is at least as popular as the numeric version, if not much more so.

An allowed alternate form of numeric character entity is a hexadecimal version of the entity number, like so:

Standard numeric entity: &#46; = Hexadecimal numeric entity: &#x2e;

This form of numeric entity was detected many times in MAMA's URLs, but its usage is sharply lower than the equivalent standard decimal representations of the same entity.

Illegal code points for numeric entities

The range of code points from 127-159 is designated as "system control characters" in ISO-8859-*, and Unicode character sets and should not be used. This does not stop authors from including them as numeric character entities in the wild, though. The most popular entities in this range correspond to certain Windows system-specific characters that are not very portable. As mentioned before, the legal named entity versions of the Windows-specific character are quite a bit more popular, as are the legal Unicode numeric entity forms of the characters. The only slight exception is the "Bullet" character—it is slightly more popular in its illegal &#149; incarnation than either of its legal forms separately.

Fig 4-1: Popularity comparison for key, illegal numeric entities
Values in "[]" brackets represent the Hexadecimal version of the numeric entities
Character description Character Illegal
numeric
entity
Frequency Proper
numeric
entity
Frequency Named
entity
Frequency
Left single quotation mark 145 3,284 8216
[x2018]
11,220
[122]
lsquo 8,890
Right single quotation mark 146 25,056 8217
[x2019]
77,397
[940]
rsquo 82,000
Left double quotation mark 147 9,165 8220
[x201c]
40,866
[586]
ldquo 37,661
Right double quotation mark 148 8,536 8221
[x201d]
42,206
[414]
rdquo 35,170
Bullet 149 40,768 8226
[x2022]
37,128
[2,373]
bull 38,136
En dash 150 16,562 8211
[x2013]
34,300
[232]
ndash 45,323
Em dash 151 14,833 8212
[x2014]
19,065
[146]
mdash 22,290
Trademark sign 153 12,510 8482
[x2122]
11,570
[131]
trade 17,223

The mystery of the word jumble

Upon assembling a list of the top numeric character entities, a number of seemingly unrelated, unremarkable ASCII characters stand out. The most popular numeric entity characters do not reflect the letters with the highest relative frequencies (in the English language at least). This group of characters only makes sense though when they are put together. They indicate that obfuscated e-mail addresses are very popular. The following e-mail-related character and word groupings stand out: "@", "at", ".", ":", "nospam", "email" and "com"...that could make, for example:

"email: test at foo.com.nospam"

Fig 5-1: Popular numeric entities representing 'boring' characters
Numeric
entity
Character Frequency   Numeric
entity
Character Frequency
64 @ 35,494 101 e 22,067
46 . 27,506 108 l 20,781
111 o 26,046 110 n 19,867
97 a 24,773 99 c 18,645
105 i 24,198 115 s 17,741
116 t 23,067 58 : 13,890
109 m 22,674 112 p 8,492

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.