MAMA: Images - elements and formats
- Previous article—MAMA: Phrase, block, list, and other elements
- Next article—MAMA: Forms
- Table of contents
Index:
- Introduction
- IMG element
- INPUT Type="image"/Src
- Background attribute of BODY, TABLE, TD, and TH elements
- MAP and AREA
- How many images were encountered?
- Image formats
- Image formats in combination
- Other image formats
Introduction
Inline images have become an integral part of the Web since they were first
introduced in Mosaic 0.1 beta just over 15 years ago. Indeed, the graphical
nature of the Web is definitely one of its biggest selling points. It was a
natural part of MAMA's evolution to study image usage—3,219,487 URLs in the set analyzed used
the IMG
element—that's 91.74%! Extra
scrutiny was given to the most popular image formats in use today: GIF, JPEG,
and PNG. Due to some of the limitations MAMA had and the unknown nature of
what it might encounter, the image analysis was somewhat conservative. MAMA
was designed to download objects serially instead of in parallel. This strategy
worked fine for downloading one, two, or several page components at a time, but
if this was done for every inline image there would have been
a serious degradation in MAMA's per-URL analysis duration. As a result, it only
kept track of various image quantities and stayed away from looking at information
on image dimension, file size, or other format metadata. MAMA will hopefully do
more in the future to detect these pieces of data and other image specification vectors.
Elements and attributes used to display or control image behavior
MAMA detected image usage in the expected way—using the Src
attribute of the IMG
element. It also allowed for
some of the other popular ways to specify images, but not all of them. For
instance, it did detect the Background
markup attribute used by the
BODY
|TABLE
|TD
|TH
elements, as well as the Src
attribute of the INPUT
Type
="image" construct.
However, MAMA did not try to detect images specified using the
OBJECT
element this time, and it has not (yet) looked
for images defined using CSS properties such as "background-image"
.
In future crawls, attempts will be made to broaden image usage detection.
IMG Element
The IMG
/Src
method of specifying
an image was obviously going to have very high usage—and with 91.74% using it,
that is indeed the case. Considering the image use rate in another constructive
way—as a percentage of URLs where an image was specified using any MAMA-detectable
manner—the ratio skyrockets. Total image usage occurred in 3,233,208 of MAMA's URLs, so
the percentage for this alternate view is 99.57%! It can be said that when images are
specified in a document, at least one of them will almost always be via the
IMG
/Src
method.
There are some attributes that pair naturally together in the IMG
element, so we should examine how often they occur together. The Height
and Width
attributes are used together in 2,937,843 cases,
indicating a clear authoring preference for explicit specification of both
aspects of an image's dimensions. The Hspace
and Vspace
attributes are used together 354,011 times, with horizontal padding around an image
(Hspace
), enjoying a significant lead when used solo. An
expected pairing between Usemap
and Ismap
, however,
did not materialize in the figures very often (only 18,825 times).
ELEMENT/Attribute | Frequency | ELEMENT/Attribute | Frequency | ELEMENT/Attribute | Frequency | ||
---|---|---|---|---|---|---|---|
IMG | 3,219,487 | Align | 1,134,698 | Ismap | 32,131 | ||
Src | 3,219,304 | Name | 875,461 | Longdesc | 25,413 | ||
Width | 2,957,808 | Hspace | 526,348 | Lowsrc | 24,944 | ||
Height | 2,945,989 | Usemap | 447,774 | Loop | 4,016 | ||
Border | 2,810,265 | Vspace | 445,580 | Start | 2,100 | ||
Alt | 2,520,939 | Title | 367,132 |
INPUT
Type
="image"/Src
The overloaded INPUT
element allows an image to be
specified as a graphical submit button using the Src
attribute. This form of image embedding has a higher representation than I
anticipated, with about one-third of URLs using graphical submit buttons instead
of INPUT
Type
="submit".
ELEMENT/Attribute | Frequency |
---|---|
INPUT | 1,008,545 |
Src | 337,286 |
Background
attribute of BODY
,
TABLE
, TD
and TH
elements
The role that these early presentational attributes serve is now more effectively
filled using CSS, but a surprising number of URLs still use Background
.
The most popular usage of this type is with the TD
element, where almost 25% of URLs using a TD
element
also have the Background
attribute.
ELEMENT[Attribute] | Element Frequency | Attribute Frequency | Percentage |
---|---|---|---|
BODY [Background ] | 3,452,907 | 634,617 | 18.38% |
TABLE [Background ] | 2,894,184 | 281,209 | 9.72% |
TD [Background ] | 2,891,972 | 714,706 | 24.71% |
TH [Background ] | 148,344 | 5,354 | 3.61% |
MAP
and AREA
These elements defining Client-Side Image Maps (CSIM) naturally pair together,
as neither element serves any effective purpose without the other. The numbers
certainly bear this out, with 452,944 URLs having BOTH
MAP
and AREA
elements
(~99% of URLs using either). One has to wonder why the few remaining cases
that do not use the elements together even exist—are they just the flotsam
of dead markup?
ELEMENT/Attribute | Frequency |
---|---|
MAP | 457,902 |
Name | 456,648 |
Id | 58,141 |
ELEMENT/Attribute | Frequency |
---|---|
AREA | 453,187 |
Coords | 452,272 |
Href | 450,478 |
Shape | 439,720 |
Alt | 203,624 |
Nohref | 13,570 |
Values of the AREA
/Shape
attribute
The frequency table for this attribute's values is short and sweet—authors stick to the known keywords. There is a clear preference for the geometric shape of choice, though: "rect" is favored 10 to 1 over the next-nearest value "poly".
How many images were encountered?
MAMA kept track of how many images were detected in each document, even duplicate ones. It tallied the total image references encountered, the number of unique images encountered, and the maximum number of times an image was referenced multiple times. MAMA found 3,233,208 of 3,509,180 URLs (92.14%) using images via at least one of the previously mentioned methods.
Caution: Some of the URLs mentioned may cause loading problems in a browser.
Criteria | Maximum Quantity | Average In Sample |
---|---|---|
Total number images | 65,535 | 22.63 |
Total number unique images | 1,610 | 12.27 |
Maximum number duplicate images | 65,535 | 15.22 |
Maximum number background images | 242 | 2.57 |
Note:
MAMA used the MySQL SMALLINT data type to store image usage information, which
has a maximum value of 65,535. Values higher than this are capped to the maximum
value. This seemed like a safe upper boundary, but in the end there was the
occasional surprise that exceeded even that lofty number. If a URL is shown
having an image tally of 65,535, it is a good bet that the quantity was
considerably more than that in reality.
Total images
This total tracks any reference to an image, including duplicates. For example, with the venerable and maligned "spacer gif" authoring trick (commonly used to try to achieve pixel-perfect table layouts), each usage of the image would count toward the overall image total. Two URLs in MAMA hit the maximum image quantity limit.
URL | Total images |
---|---|
http://www.ratingspot.com/ (URL no longer active) | 65,535 |
http://www.goldcup2002.com/ | 65,535 |
http://www.houseofnutrition.com/ | 25,909 |
http://www.1000irani.com/ | 12,527 |
Total number of unique images
This number tracks only the unique references to images. For a given URL, comparing this number to the "Total number of images" value might provide some insight about usage of repeated graphical elements such as spacers, bullets, horizontal rules, and so forth.
URL | Total unique images |
---|---|
http://www.ccom-inet.de/ | 1,610 |
http://www.dolomitenhotels.net/ | 1,247 |
http://www.peterkamin.de/Goslar/goslar.htm (URL no longer active) | 1,105 |
http://www.lenuagedesfilles.com/ | 1,070 |
Maximum number of duplicate images
Every time an image reference was used more than once, MAMA kept track of the running totals. The value stored by MAMA, "Maximum duplicates", represents the highest number of times a unique inline image URL was duplicated in a document. In all, 1,592,488 URLs had at least one image reference used more than once. The frequency table for this value does not show any big leaps or jumps in it, but there are some small reversals that may warrant some scrutiny. There does not seem to be any obvious reason for the slight order alterations.
URL | Total duplicate images |
---|---|
http://www.ratingspot.com/ (URL no longer active) | 65,535 |
http://www.goldcup2002.com/ | 65,535 |
http://www.houseofnutrition.com/ | 25,771 |
Maximum number of background images
Any image reference using the Background
attribute
(from the BODY
, TABLE
,
TD
and TH
elements) was
counted as a background image. MAMA had 1,288,880 URLs with at least one
such background image.
URL | Total background images |
---|---|
http://www.gasperitsch.com/ (URL no longer active) | 242 |
http://www.youth.cn/ | 167 |
http://www.333tourthai.com/ | 154 |
http://www.imagegood.co.kr/ | 124 |
Image formats
Authors use images in many ways, and there is definitely room on the Web for all of the popular formats. In addition to keeping track of image totals, MAMA tried to discover which formats were in common use. Specifically, we wanted to see how often GIFs, JPEGs, and PNGs occurred. We will first take a look at how each image type was detected (Fig 7-1), follow it up with general usage statistics for those types (Fig 7-2), and then list some examples of the extreme usage cases detected.
Image format detection
MAMA defaulted to using an image's file extension to judge the format type. If MAMA could declare an image format from just this data alone, it did not try to dig any deeper than that. If it could not determine the format from the file extension, MAMA would then download the HTTP HEAD of the referenced image and proceed to examine the image's MIME type to detect the format. This policy was a useful shortcut that really helped with the analysis script's overall performance.
Image format | Substring detected in file extension | Substring detected in MIME type |
---|---|---|
GIF | ".gif" | "gif" |
JPEG | ".jpg" or ".jpeg" | "jpeg" |
PNG | ".png" | "png" |
Image format usage totals
JPEG has no real competition in depicting photographs or realistic scenes, but the PNG format and the dominant GIF format are at odds for the same use cases. Due to a number of historical issues, uptake of the PNG format has been slower than many expected. Authors seem to have no problem with both formats coexisting on their Web sites. GIF and PNG, can't we all just get along?
Note: The frequency tables for each image type are rather linear, typically in order all the way out past the 30th position in the list.
Image Format | Total occurrences | Percentage | Maximum quantity encountered | Average in sample | Standard deviation in sample |
---|---|---|---|---|---|
GIF | 2,854,113 | 81.33% | 1,610 | 9.04 | 10.45 |
JPEG | 2,451,507 | 69.86% | 1,201 | 6.11 | 9.54 |
PNG | 374,408 | 10.67% | 539 | 3.21 | 5.31 |
Maximum image format uses
Notice that the maximum image quantity instance for each format is usually a rather extreme value compared to any of its closest neighbors and is not typical.
URL | Total GIFs |
---|---|
http://www.ccom-inet.de/ | 1,610 |
http://www.r-type.org/muse/aaa0000.htm | 939 |
http://www.pcpages.com/homemom/ogpjoint.html | 869 |
http://dibujando-en-el-viento.nireblog.com/ (URL no longer active) | 821 |
URL | Total JPEGs |
---|---|
http://www.dolomitenhotels.net/ | 1,201 |
http://car-hifi-produkte.de/ | 833 |
http://www.lacancha.com/greatest.html | 816 |
http://www.worldisround.com/articles/16107/index.htm (URL no longer active) | 805 |
URL | Total PNGs |
---|---|
http://www.aaronmichaels.com/ | 539 |
http://www-laog.obs.ujf-grenoble.fr/~desert/cosmologie/cours/coursv2/coursv2.html (URL no longer active) | 537 |
http://www.sigmasigmarho.com/usf/ (URL no longer active) | 251 |
http://www2002.org/CDROM/refereed/127/ | 233 |
Image formats in combination: Venn diagram
The following diagram shows the overlap in usage of the three dominant image formats. The relationship between GIF and PNG is usually characterized as an adversarial one, so it was expected that these numbers would demonstrate authors showing a clear preference for one or the other in their pages. However, that definitely is not the case. PNGs were detected in 374,408 URLs, and of those, 311,827 URLs (83.29%) also used the GIF format as well. If that is what constitutes a format war, the battle is a subtle one.
Note: Region sizes are not to scale
Other image formats
Any image reference not falling into the GIF, JPEG, or PNG classifications was put into an "other" category. In all, 372,895 MAMA URLs contained images in this group—over 10% of all pages analyzed! This seems like a much higher number than one would expect for image formats "on the fringe". Now, we can look at the qualifications for this fallback category to see what can be revealed about the process.
Processing a unique image reference in MAMA:
- Look for a file extension of .gif, .jpg/.jpeg, or .png.
- If a file extension is not found, get a HEAD of the image URL and remember the MIME type; otherwise, the MIME type is blank.
- If the extension or the MIME type contains an indication for GIF, JPEG, or PNG, increment the appropriate counters.
- Otherwise, increment the "other" counter.
Image format detection in MAMA was added rather late in the development process, and some of the strategies used can be improved on for next time. MAMA downloaded document dependencies serially instead of in parallel, so analyzing each and every image reference would have been very expensive time-wise. In the steps above you may notice that the MIME type is only fetched if a known extension is not detected. It was expected that the majority of images in a document would fall into one of the 3 image format categories featured, so excessive network activity to download the HTTP Headers of images would be greatly reduced. The above strategy works well for detecting GIF, JPEG, and PNG, but things could be improved with respect to the "other" category. In addition to image references that were in other image formats, there were additional false positives:
- A large number of images are served by scripts or cgis using URL arguments and had no file extensions. These would always fail the file extension check and fall through to the MIME type check.
- If the MIME type check failed for any reason, be it network timeouts or other transient conditions, an empty value was returned. Images served using the previously mentioned method that should have normally fallen into our 3 main categories would then end up in "other".
- If a MIME type check was attempted for a broken image link, it could return a 404 Error (or worse, an HTML 404-redirect), and this could throw off the detection method. A next-iteration strategy would be to ignore these cases altogether.
Given those caveats, MAMA did detect some image formats outside our 3 big buckets. Generally, the most popular formats were bitmaps and icon files (often using ".bmp" and ".ico" extensions), but the URLs with the highest concentrations of these image types were all .ico file type cases (ex: http://www.lenuagedesfilles.com/ with 883 and http://www.blogalego.com/ with 401 respectively).
- Previous article—MAMA: Phrase, block, list, and other elements
- Next article—MAMA: Forms
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.