MAMA: Hyperlinks

By Brian Wilson

Index:

  1. Introduction
  2. Hyperlink totals
  3. Hyperlink protocols
  4. The AREA element
  5. The A element

Introduction

Hyperlinks are the glue that holds the Web together. Without them, the Web wouldn't be a web at all—just an un-organized group consisting of billions of isolated documents. Aside from the URL set to which MAMA already has access, the treasure trove of links and metadata are really the only easy method MAMA could use to expand its URL set in the future—and what an expansion that would be! With MAMA's URL set having an average of 38 hyperlinks per page, the ~3.5 million URLs it has already analyzed translates to roughly 130 million more. We are not sure yet whether MAMA could handle that many in its database, but a significant amount of URL atrophy has been noticeable during this research. There is definitely security in numbers for future MAMA Web crawls.

Hyperlink totals

MAMA considered each occurrence of an Href attribute in an AREA or A element as a hyperlink and kept a running tally for each URL analyzed. It also compared the domain of the URL in a hyperlink with the domain of the page being analyzed in order to discover how many external domains were being referenced in a document. The winner for most URLs at the time of writing seems to be http://www.scheffau.at/, which has 37,538 hyperlinks (in the sub-frames). In all, 2,408,662 of the 3,337,666 URLs that contained any hyperlinks (72.17%) had at least one linking outside the domain of the URL being analyzed. The average number of hyperlinks per document was 38.41.

Hyperlink protocols

A facet of every hyperlink that was tracked was the protocol of the URL. MAMA's analysis was confined to the HTTP protocol (and some HTTPS), but having a repository of URLs using other protocols is expected to be very helpful in Opera's future testing and development efforts. The storing of protocol types used in hyperlinks was limited to non-HTTP protocols, because the overwhelming majority was expected to be the HTTP protocol; MAMA only wanted to track and store data on the exceptions to the rule.

You can throw a rock in any direction on the Web and hit an HTTP URL—they are a dime a dozen. Having a ready cache of URLs that make use of/link to esoteric protocols can be a useful thing to have available. Notice that HTTPS is the dominant non-HTTP protocol, by far. The next-nearest neighbor is the FILE protocol, which is puzzling—for documents on the publicly accessible Web, these should all be non-functional links. Also making their presence known are FTP, RSS ("feedhttp"), and—ahem—"NHP". Also notice that the simple 4-letter string "http" is apparently difficult to spell—variants such as "hhtp", "htto" and "htp" occur far too often.

Fig 3-1: Popular non-HTTP hyperlink protocols
(See also the full frequency table.)
ProtocolFrequency ProtocolFrequency
https236,389mms2,419
file34,056callto1,030
ftp8,207irc832
nhp3,993mailto706
feedhttp2,957telnet601

The AREA element

Client Side Image Maps (CSIM) allow one or more activate-able shapes on an image to be defined. The AREA element (which needs to be paired with the MAP element for full CSIM functionality) has several attributes that control the geometry of the hyperlink. The AREA attribute of greatest interest in this section is Href. Just as with the Href attribute for the A element, it specifies the destination at the other side of the link jump.

Fig 4-1: AREA element/attribute frequency
ELEMENT/AttributeFrequency
AREA453,187
    Coords452,272
    Href450,478
    Shape439,720
    Alt203,624
    Nohref13,570

The A element

The A element has a number of attributes, ranging from the popular (the Href attribute for the A element is actually THE most popular of any attribute in all the URLs that MAMA analyzed), to the esoteric and the forgotten (Methods and Urn).

Fig 5-1: A element/attribute frequency
ELEMENT/AttributeFrequency ELEMENT/AttributeFrequency
A3,307,397    Shape4,058
    Href3,304,834    Hreflang3,065
    Target1,978,018    Rev761
    Title658,820    Disabled556
    Name485,168    Coords201
    Rel96,613    Charset137
    Accesskey54,876    Methods62
    Tabindex14,898    Urn58
    Type12,251  

A Rel attribute

The Rel attribute for the A element expresses the relationship that the destination URL has to the current URL. Until relatively recently, this attribute was under-used. However, its use has grown in the last few years as microformats have been embraced. The most popular values for this attribute are "nofollow" at more than 2-to-1 over the next-nearest values of "bookmark" and "tag".

Fig 5-2: Top A Rel values
(See also the full frequency table.)
A Rel Attribute ValueFrequency
nofollow46,179
bookmark20,524
tag20,445
category13,012
external7,473
license6,330
alternate5,252
lightbox2,917
me1,929
self1,630

A Name attribute

The Name attribute, also known as an anchor, identifies a location in a document. Hyperlinks can link to a specific part of a document using an anchor. The values that this attribute may take can be semantically significant. The top values for Name reflect this, representing common locations or controls such as "top", "bottom" and "menu" found in many documents. The "Top" value is by far the most popular, occurring at least an order of magnitude more than any other value.

Note: The values for this attribute highlight an interesting trend—there are many popular values of the form "dk[number]". In all, 21 of the top 26 spots in the frequency table follow this pattern, with very similar quantities (~6000-9000 times each), with the lowest values like "dk3" having the highest frequency and higher values like "dk18" having the lowest frequency. Testing several representative URLs using these A Name values (such as http://www.plasmatvrentals.com/ and http://www.weddingfavour.net/), they ALL seem to be from a single domain-parking site (domainsponsor.com). This is unfortunate and skews MAMA's overall results. Next time around, this fact will be used to filter out this particular domain parker.

Fig 5-3: Popular A Name values
(See also the full frequency table.)
A name valueFrequency A name valueFrequency
top163,733links5,057
content15,258oben (German for "top")4,781
bottom9,142news4,318
up7,147menu4,137
contact5,841topofpage3,998
pagetop5,559navigation3,920

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.