MAMA: Hyperlinks
- Previous article—MAMA: BODY structure
- Next article—MAMA: Phrase, block, list, and other elements
- Table of contents
Index:
Introduction
Hyperlinks are the glue that holds the Web together. Without them, the Web wouldn't be a web at all—just an un-organized group consisting of billions of isolated documents. Aside from the URL set to which MAMA already has access, the treasure trove of links and metadata are really the only easy method MAMA could use to expand its URL set in the future—and what an expansion that would be! With MAMA's URL set having an average of 38 hyperlinks per page, the ~3.5 million URLs it has already analyzed translates to roughly 130 million more. We are not sure yet whether MAMA could handle that many in its database, but a significant amount of URL atrophy has been noticeable during this research. There is definitely security in numbers for future MAMA Web crawls.
Hyperlink totals
MAMA considered each occurrence of an Href
attribute
in an AREA
or A
element
as a hyperlink and kept a running tally for each URL analyzed. It also compared
the domain of the URL in a hyperlink with the domain of the page being analyzed
in order to discover how many external domains were being referenced in a
document. The winner for most URLs at the time of writing seems to be
http://www.scheffau.at/, which has 37,538
hyperlinks (in the sub-frames). In all, 2,408,662 of the 3,337,666 URLs that contained
any hyperlinks (72.17%) had at least one linking outside the domain of the URL
being analyzed. The average number of hyperlinks per document was 38.41.
Hyperlink protocols
A facet of every hyperlink that was tracked was the protocol of the URL. MAMA's analysis was confined to the HTTP protocol (and some HTTPS), but having a repository of URLs using other protocols is expected to be very helpful in Opera's future testing and development efforts. The storing of protocol types used in hyperlinks was limited to non-HTTP protocols, because the overwhelming majority was expected to be the HTTP protocol; MAMA only wanted to track and store data on the exceptions to the rule.
You can throw a rock in any direction on the Web and hit an HTTP URL—they are a dime a dozen. Having a ready cache of URLs that make use of/link to esoteric protocols can be a useful thing to have available. Notice that HTTPS is the dominant non-HTTP protocol, by far. The next-nearest neighbor is the FILE protocol, which is puzzling—for documents on the publicly accessible Web, these should all be non-functional links. Also making their presence known are FTP, RSS ("feedhttp"), and—ahem—"NHP". Also notice that the simple 4-letter string "http" is apparently difficult to spell—variants such as "hhtp", "htto" and "htp" occur far too often.
Protocol | Frequency | Protocol | Frequency | |
---|---|---|---|---|
https | 236,389 | mms | 2,419 | |
file | 34,056 | callto | 1,030 | |
ftp | 8,207 | irc | 832 | |
nhp | 3,993 | mailto | 706 | |
feedhttp | 2,957 | telnet | 601 |
The AREA
element
Client Side Image Maps (CSIM) allow one or more activate-able shapes on an
image to be defined. The AREA
element (which needs
to be paired with the MAP
element for full CSIM
functionality) has several attributes that control the geometry of the hyperlink.
The AREA
attribute of greatest interest in this section
is Href
. Just as with the Href
attribute for the A
element, it specifies the destination
at the other side of the link jump.
ELEMENT/Attribute | Frequency |
---|---|
AREA | 453,187 |
Coords | 452,272 |
Href | 450,478 |
Shape | 439,720 |
Alt | 203,624 |
Nohref | 13,570 |
The A
element
The A
element has a number of attributes, ranging from
the popular (the Href
attribute for the A
element is actually THE most popular of any attribute in all the
URLs that MAMA analyzed), to the esoteric and the forgotten (Methods
and Urn
).
ELEMENT/Attribute | Frequency | ELEMENT/Attribute | Frequency | |
---|---|---|---|---|
A | 3,307,397 | Shape | 4,058 | |
Href | 3,304,834 | Hreflang | 3,065 | |
Target | 1,978,018 | Rev | 761 | |
Title | 658,820 | Disabled | 556 | |
Name | 485,168 | Coords | 201 | |
Rel | 96,613 | Charset | 137 | |
Accesskey | 54,876 | Methods | 62 | |
Tabindex | 14,898 | Urn | 58 | |
Type | 12,251 |
A
Rel
attribute
The Rel
attribute for the A
element expresses the relationship that the destination URL has to the current
URL. Until relatively recently, this attribute was under-used. However, its
use has grown in
the last few years as microformats have been embraced. The most popular
values for this attribute are "nofollow" at more
than 2-to-1 over the next-nearest values of "bookmark"
and "tag".
A Rel Attribute Value | Frequency |
---|---|
nofollow | 46,179 |
bookmark | 20,524 |
tag | 20,445 |
category | 13,012 |
external | 7,473 |
license | 6,330 |
alternate | 5,252 |
lightbox | 2,917 |
me | 1,929 |
self | 1,630 |
A
Name
attribute
The Name
attribute, also known as an anchor, identifies
a location in a document. Hyperlinks can link to a specific part of a document
using an anchor. The values that this attribute may take can be semantically
significant. The top values for Name
reflect this,
representing common locations or controls such as
"top", "bottom"
and "menu" found in many documents. The "Top" value is
by far the most popular, occurring at least an order of magnitude more than
any other value.
Note: The values for this attribute highlight
an interesting trend—there are many popular values of the form
"dk[number]". In all, 21 of the top 26 spots in the
frequency table follow this pattern, with very similar quantities
(~6000-9000 times each), with the lowest values like "dk3"
having the highest frequency and higher values like "dk18"
having the lowest frequency. Testing several representative URLs using these
A
Name
values (such as
http://www.plasmatvrentals.com/
and http://www.weddingfavour.net/),
they ALL seem to be from a single domain-parking site
(domainsponsor.com). This is unfortunate and skews MAMA's overall results.
Next time around, this fact will be used to filter out this particular domain parker.
A name value | Frequency | A name value | Frequency | |
---|---|---|---|---|
top | 163,733 | links | 5,057 | |
content | 15,258 | oben (German for "top") | 4,781 | |
bottom | 9,142 | news | 4,318 | |
up | 7,147 | menu | 4,137 | |
contact | 5,841 | topofpage | 3,998 | |
pagetop | 5,559 | navigation | 3,920 |
- Previous article—MAMA: BODY structure
- Next article—MAMA: Phrase, block, list, and other elements
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.