MAMA: XML

By Brian Wilson

MAMA: XML

Index:

  1. Introduction
  2. XML processing instructions (PIs)
  3. The XML prolog and its encoding
  4. XML Namespaces
  5. Other XML-related attributes

Introduction

MAMA tracked a number of XML-related details in order to get a better sense of how XML is used on the Web. We have already seen evidence of XML in some of the other sections of this write-up. The Content-Type HTTP header revealed just over 1,000 documents using XML-related MIME types. URLs analyzed that ended in ".xml" or ".xhtm" were detected in just under 1,000 documents. Certain conditions in scripting also mark a document with the "stamp" of XML. In this section, we will look at some additional factors which contribute to the evidence of XML usage in Web documents: XML processing instructions (PIs), the XML prolog (a type of PI), and XML namespaces. We can even consider the presence of an XHTML Doctype as another tip that XML is in use. Combining these factors together, over 700,000 URLs (~20% of all URLs analyzed) exhibited evidence of trying to be XML in one form or another. Adding in other details that MAMA did not directly analyze, this number can likely be said to be even higher. XML syntax definitely deserves some scrutiny in MAMA's research.

XML processing instructions (PIs)

The number of URLs reporting PIs was 104,413. This is slightly lower than the number mentioned below for XML prologs detected. In the MAMA code, this should not be possible, since the XML prolog detection is a child condition of the XML PI case. This difference exposes a small bug in PI detection when frames are used. It looks like it affects ~2-4% of the PI URLs, and will be fixed in the next version.

The XML PI quantity full frequency table shows that some documents have significant numbers of PIs. Investigating such cases exposes some practices that would be considered sloppy; some authors put multiple XML prologs in a single document, while others misplace PI-looking constructs from pre-processing languages.

Ex: 33 XML prologs in a document: http://www.711.ru/

CSS stylesheets in XML

In all, 569 CSS stylesheet PIs were detected. MAMA used the following approach to judge a positive match:

  1. The PI begins with the string "xml-stylesheet".
  2. The PI has a "Type" attribute value of "text/css".

The XML prolog and its encoding

An XML prolog is a type of processing instruction and is an optional component of an XML document. MAMA found 104,722 URLs with XML prologs amongst its URLs. The following is a typical XML prolog:

Ex: <?xml version="1.0" encoding="iso-8859-1"?>

XML encoding

The XML prolog can also have an optional Encoding attribute, which specifies the character set used in the document. Use of the Encoding attribute in the prolog is very popular—if we look at all URLs that actually use an XML prolog, 96,264 of them (over 92%) specify a document's encoding in this manner. The "iso-8859-1" value is twice as popular as any other encoding method.

Fig 3-1: Top XML encoding values
[Also see the full frequency table.]
Encoding valueFrequency
iso-8859-154,572
utf-827,052
iso-8859-23,919
shift_jis2,464
utf-161,688

XML namespaces

Although detecting the XML prolog declaration gives some idea of how XML is used on the Web, it is not a required item for an XML document. MAMA also looked for XML namespace URIs used in documents and the number was MUCH higher than for the XML prolog. XML namespaces were found in 656,808 URLs (18.72% of all URLs analyzed). The XHTML namespace is the prevalent value here, but another conspicuous trend is easily noticeable: a significant number of Microsoft-related namespaces are very prominent. Twenty-two of the top 100 namespaces were from Microsoft. Conversely, some interesting XML-related technologies had very low representation in the URLs that MAMA analyzed; XLINK was detected 152 times, but XML events, XHTML2, XFORMS and XSLT each only had 1-2 dozen cases each. Server-side XSLT (a separate implementation vector than this evidence of client-side XSLT) likely has higher usage rates than MAMA's XSLT numbers indicate—evidence from the "Server" and "X-Powered-By" HTTP header fields support this view.

Fig 4-1: Top XML namespace URIs
[Also see the full frequency table.]
Namespace URIFrequency
http://www.w3.org/1999/xhtml564,458
urn:schemas--microsoft-com:office:office78,212
urn:schemas--microsoft-com:vml 74,737
http://www.w3.org/tr/rec-html4070,932
urn:schemas--microsoft-com:office:word 23,993

Other XML-related attributes

Xml:space

The xml:space attribute is used to signal that contained spacing is important and should be preserved. It can be applied to any element, but, for most markup scenarios, it will typically be interesting only in JavaScript and CSS contexts. In the URLs that MAMA analyzed, this holds true: the attribute was used 520 times with the SCRIPT element, 140 times in the STYLE element, but was not detected with any other elements.

Xml:lang

This attribute is used to define the natural language of an element's contents. It takes as a value an RFC 3066 language code. As you can see from the table below (Fig 5-1), the most popular place to use this attribute is the HTML element; it dwarfs all other usages by a factor of almost 100.

Fig 5-1: Elements that use the xml:lang attribute
ELEMENTFrequency ELEMENTFrequency
HTML213,216LINK254
SPAN2,886P195
META2,333HEAD133
A1,258INPUT97
BODY824EM93
ACRONYM385IMG74
DIV328TITLE72
ABBR322H269
LI278H161

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

Comments

The forum archive of this article is still available on My Opera.

No new comments accepted.