MAMA: Images - elements and formats

By Brian Wilson

Index:

  1. Introduction
  2. IMG element
  3. INPUT Type="image"/Src
  4. Background attribute of BODY, TABLE, TD, and TH elements
  5. MAP and AREA
  6. How many images were encountered?
  7. Image formats
  8. Image formats in combination
  9. Other image formats

Introduction

Inline images have become an integral part of the Web since they were first introduced in Mosaic 0.1 beta just over 15 years ago. Indeed, the graphical nature of the Web is definitely one of its biggest selling points. It was a natural part of MAMA's evolution to study image usage—3,219,487 URLs in the set analyzed used the IMG element—that's 91.74%! Extra scrutiny was given to the most popular image formats in use today: GIF, JPEG, and PNG. Due to some of the limitations MAMA had and the unknown nature of what it might encounter, the image analysis was somewhat conservative. MAMA was designed to download objects serially instead of in parallel. This strategy worked fine for downloading one, two, or several page components at a time, but if this was done for every inline image there would have been a serious degradation in MAMA's per-URL analysis duration. As a result, it only kept track of various image quantities and stayed away from looking at information on image dimension, file size, or other format metadata. MAMA will hopefully do more in the future to detect these pieces of data and other image specification vectors.

Elements and attributes used to display or control image behavior

MAMA detected image usage in the expected way—using the Src attribute of the IMG element. It also allowed for some of the other popular ways to specify images, but not all of them. For instance, it did detect the Background markup attribute used by the BODY|TABLE|TD|TH elements, as well as the Src attribute of the INPUT Type="image" construct. However, MAMA did not try to detect images specified using the OBJECT element this time, and it has not (yet) looked for images defined using CSS properties such as "background-image". In future crawls, attempts will be made to broaden image usage detection.

IMG Element

The IMG/Src method of specifying an image was obviously going to have very high usage—and with 91.74% using it, that is indeed the case. Considering the image use rate in another constructive way—as a percentage of URLs where an image was specified using any MAMA-detectable manner—the ratio skyrockets. Total image usage occurred in 3,233,208 of MAMA's URLs, so the percentage for this alternate view is 99.57%! It can be said that when images are specified in a document, at least one of them will almost always be via the IMG/Src method.

There are some attributes that pair naturally together in the IMG element, so we should examine how often they occur together. The Height and Width attributes are used together in 2,937,843 cases, indicating a clear authoring preference for explicit specification of both aspects of an image's dimensions. The Hspace and Vspace attributes are used together 354,011 times, with horizontal padding around an image (Hspace), enjoying a significant lead when used solo. An expected pairing between Usemap and Ismap, however, did not materialize in the figures very often (only 18,825 times).

Fig 2-1: IMG element/attribute frequency
ELEMENT/AttributeFrequency  ELEMENT/AttributeFrequency  ELEMENT/AttributeFrequency
IMG3,219,487    Align1,134,698     Ismap32,131
    Src3,219,304    Name875,461     Longdesc25,413
    Width2,957,808    Hspace526,348     Lowsrc24,944
    Height2,945,989    Usemap447,774     Loop4,016
    Border2,810,265    Vspace445,580     Start2,100
    Alt2,520,939    Title367,132   

INPUT Type="image"/Src

The overloaded INPUT element allows an image to be specified as a graphical submit button using the Src attribute. This form of image embedding has a higher representation than I anticipated, with about one-third of URLs using graphical submit buttons instead of INPUT Type="submit".

Fig 3-1: INPUT TYPE="image"/SRC attribute frequency
ELEMENT/AttributeFrequency
INPUT1,008,545
    Src337,286

Background attribute of BODY, TABLE, TD and TH elements

The role that these early presentational attributes serve is now more effectively filled using CSS, but a surprising number of URLs still use Background. The most popular usage of this type is with the TD element, where almost 25% of URLs using a TD element also have the Background attribute.

Fig 4-1: Background attribute frequency
ELEMENT[Attribute]Element
Frequency
Attribute
Frequency
Percentage
BODY[Background]3,452,907634,61718.38%
TABLE[Background]2,894,184281,2099.72%
TD[Background]2,891,972714,70624.71%
TH[Background]148,3445,3543.61%

MAP and AREA

These elements defining Client-Side Image Maps (CSIM) naturally pair together, as neither element serves any effective purpose without the other. The numbers certainly bear this out, with 452,944 URLs having BOTH MAP and AREA elements (~99% of URLs using either). One has to wonder why the few remaining cases that do not use the elements together even exist—are they just the flotsam of dead markup?

Fig 5-1: MAP element/attribute frequency
ELEMENT/AttributeFrequency
MAP457,902
    Name456,648
    Id58,141
Fig 5-2: AREA element/attribute frequency
ELEMENT/AttributeFrequency
AREA453,187
    Coords452,272
    Href450,478
    Shape439,720
    Alt203,624
    Nohref13,570

Values of the AREA/Shape attribute

The frequency table for this attribute's values is short and sweet—authors stick to the known keywords. There is a clear preference for the geometric shape of choice, though: "rect" is favored 10 to 1 over the next-nearest value "poly".

How many images were encountered?

MAMA kept track of how many images were detected in each document, even duplicate ones. It tallied the total image references encountered, the number of unique images encountered, and the maximum number of times an image was referenced multiple times. MAMA found 3,233,208 of 3,509,180 URLs (92.14%) using images via at least one of the previously mentioned methods.

Caution: Some of the URLs mentioned may cause loading problems in a browser.

Fig 6-1: Maximum image usages
CriteriaMaximum
Quantity
Average
In Sample
Total number images65,53522.63
Total number unique images1,61012.27
Maximum number duplicate images65,53515.22
Maximum number background images2422.57

Note:
MAMA used the MySQL SMALLINT data type to store image usage information, which has a maximum value of 65,535. Values higher than this are capped to the maximum value. This seemed like a safe upper boundary, but in the end there was the occasional surprise that exceeded even that lofty number. If a URL is shown having an image tally of 65,535, it is a good bet that the quantity was considerably more than that in reality.

Total images

This total tracks any reference to an image, including duplicates. For example, with the venerable and maligned "spacer gif" authoring trick (commonly used to try to achieve pixel-perfect table layouts), each usage of the image would count toward the overall image total. Two URLs in MAMA hit the maximum image quantity limit.

Fig 6-2: URLs having highest total images
URLTotal
images
http://www.ratingspot.com/ (URL no longer active)65,535
http://www.goldcup2002.com/65,535
http://www.houseofnutrition.com/25,909
http://www.1000irani.com/12,527

Total number of unique images

This number tracks only the unique references to images. For a given URL, comparing this number to the "Total number of images" value might provide some insight about usage of repeated graphical elements such as spacers, bullets, horizontal rules, and so forth.

Fig 6-3: URLs having highest totals of unique image references
URLTotal
unique
images
http://www.ccom-inet.de/1,610
http://www.dolomitenhotels.net/1,247
http://www.peterkamin.de/Goslar/goslar.htm (URL no longer active)1,105
http://www.lenuagedesfilles.com/1,070

Maximum number of duplicate images

Every time an image reference was used more than once, MAMA kept track of the running totals. The value stored by MAMA, "Maximum duplicates", represents the highest number of times a unique inline image URL was duplicated in a document. In all, 1,592,488 URLs had at least one image reference used more than once. The frequency table for this value does not show any big leaps or jumps in it, but there are some small reversals that may warrant some scrutiny. There does not seem to be any obvious reason for the slight order alterations.

Fig 6-4: URLs having highest totals of image duplicates
URLTotal
duplicate
images
http://www.ratingspot.com/ (URL no longer active)65,535
http://www.goldcup2002.com/65,535
http://www.houseofnutrition.com/25,771

Maximum number of background images

Any image reference using the Background attribute (from the BODY, TABLE, TD and TH elements) was counted as a background image. MAMA had 1,288,880 URLs with at least one such background image.

Fig 6-5: URLs having highest totals of background images
URLTotal
background
images
http://www.gasperitsch.com/ (URL no longer active)242
http://www.youth.cn/167
http://www.333tourthai.com/154
http://www.imagegood.co.kr/124

Image formats

Authors use images in many ways, and there is definitely room on the Web for all of the popular formats. In addition to keeping track of image totals, MAMA tried to discover which formats were in common use. Specifically, we wanted to see how often GIFs, JPEGs, and PNGs occurred. We will first take a look at how each image type was detected (Fig 7-1), follow it up with general usage statistics for those types (Fig 7-2), and then list some examples of the extreme usage cases detected.

Image format detection

MAMA defaulted to using an image's file extension to judge the format type. If MAMA could declare an image format from just this data alone, it did not try to dig any deeper than that. If it could not determine the format from the file extension, MAMA would then download the HTTP HEAD of the referenced image and proceed to examine the image's MIME type to detect the format. This policy was a useful shortcut that really helped with the analysis script's overall performance.

Fig 7-1: Methods used to detect image formats in MAMA
Image formatSubstring detected
in file extension
Substring detected
in MIME type
GIF".gif""gif"
JPEG".jpg" or ".jpeg""jpeg"
PNG".png""png"

Image format usage totals

JPEG has no real competition in depicting photographs or realistic scenes, but the PNG format and the dominant GIF format are at odds for the same use cases. Due to a number of historical issues, uptake of the PNG format has been slower than many expected. Authors seem to have no problem with both formats coexisting on their Web sites. GIF and PNG, can't we all just get along?

Note: The frequency tables for each image type are rather linear, typically in order all the way out past the 30th position in the list.

Fig 7-2: Image format statistics
Image
Format
Total
occurrences
Percentage Maximum
quantity
encountered
Average
in sample
Standard
deviation
in sample
GIF2,854,11381.33% 1,6109.0410.45
JPEG2,451,50769.86% 1,2016.119.54
PNG374,40810.67% 5393.215.31

Maximum image format uses

Notice that the maximum image quantity instance for each format is usually a rather extreme value compared to any of its closest neighbors and is not typical.

Fig 7-3: URLs having highest number of detected GIFs
URLTotal
GIFs
http://www.ccom-inet.de/1,610
http://www.r-type.org/muse/aaa0000.htm939
http://www.pcpages.com/homemom/ogpjoint.html869
http://dibujando-en-el-viento.nireblog.com/ (URL no longer active)821
Fig 7-4: URLs having highest number of detected JPEGs
URLTotal
JPEGs
http://www.dolomitenhotels.net/1,201
http://car-hifi-produkte.de/833
http://www.lacancha.com/greatest.html816
http://www.worldisround.com/articles/16107/index.htm (URL no longer active)805
Fig 7-5: URLs having highest number of detected PNGs
URLTotal
PNGs
http://www.aaronmichaels.com/539
http://www-laog.obs.ujf-grenoble.fr/~desert/cosmologie/cours/coursv2/coursv2.html (URL no longer active)537
http://www.sigmasigmarho.com/usf/ (URL no longer active)251
http://www2002.org/CDROM/refereed/127/233

Image formats in combination: Venn diagram

The following diagram shows the overlap in usage of the three dominant image formats. The relationship between GIF and PNG is usually characterized as an adversarial one, so it was expected that these numbers would demonstrate authors showing a clear preference for one or the other in their pages. However, that definitely is not the case. PNGs were detected in 374,408 URLs, and of those, 311,827 URLs (83.29%) also used the GIF format as well. If that is what constitutes a format war, the battle is a subtle one.

Note: Region sizes are not to scale

Venn diagram for image format usage types

Other image formats

Any image reference not falling into the GIF, JPEG, or PNG classifications was put into an "other" category. In all, 372,895 MAMA URLs contained images in this group—over 10% of all pages analyzed! This seems like a much higher number than one would expect for image formats "on the fringe". Now, we can look at the qualifications for this fallback category to see what can be revealed about the process.

Processing a unique image reference in MAMA:

  1. Look for a file extension of .gif, .jpg/.jpeg, or .png.
  2. If a file extension is not found, get a HEAD of the image URL and remember the MIME type; otherwise, the MIME type is blank.
  3. If the extension or the MIME type contains an indication for GIF, JPEG, or PNG, increment the appropriate counters.
  4. Otherwise, increment the "other" counter.

Image format detection in MAMA was added rather late in the development process, and some of the strategies used can be improved on for next time. MAMA downloaded document dependencies serially instead of in parallel, so analyzing each and every image reference would have been very expensive time-wise. In the steps above you may notice that the MIME type is only fetched if a known extension is not detected. It was expected that the majority of images in a document would fall into one of the 3 image format categories featured, so excessive network activity to download the HTTP Headers of images would be greatly reduced. The above strategy works well for detecting GIF, JPEG, and PNG, but things could be improved with respect to the "other" category. In addition to image references that were in other image formats, there were additional false positives:

  • A large number of images are served by scripts or cgis using URL arguments and had no file extensions. These would always fail the file extension check and fall through to the MIME type check.
  • If the MIME type check failed for any reason, be it network timeouts or other transient conditions, an empty value was returned. Images served using the previously mentioned method that should have normally fallen into our 3 main categories would then end up in "other".
  • If a MIME type check was attempted for a broken image link, it could return a 404 Error (or worse, an HTML 404-redirect), and this could throw off the detection method. A next-iteration strategy would be to ignore these cases altogether.

Given those caveats, MAMA did detect some image formats outside our 3 big buckets. Generally, the most popular formats were bitmaps and icon files (often using ".bmp" and ".ico" extensions), but the URLs with the highest concentrations of these image types were all .ico file type cases (ex: http://www.lenuagedesfilles.com/ with 883 and http://www.blogalego.com/ with 401 respectively).

td

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.