XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (27 page)

Characters in the Data Model

In the XML Information Set definition (
http://www.w3.org/TR/xml-infoset
), each individual character is distinguished as an object (or
information item
). This is a useful model conceptually, because it allows one to talk about the properties of a character and the position of a character relative to other characters, but it would be very expensive to represent each character as a separate object in a real tree implementation.

XDM has chosen not to represent characters as nodes. It would be nice if it did, because the XPath syntax could then be extended naturally to do character manipulation within strings, but the designers chose instead to provide a separate set of string-manipulation functions. These functions are described in Chapter 13.

A string (and therefore the string value of a node) is a sequence of zero or more characters. Each character is a
Char
as defined in the XML standard. Loosely, this is a Unicode character. In XML 1.0 it must be one of the following:

  • One of the four whitespace characters tab
    x09
    , linefeed
    x0A
    , carriage return
    x0D
    , or space
    x20.
  • An ordinary 16-bit Unicode character in the range
    x21
    to
    xD7FF
    or
    xE000
    to
    xFFFD.
  • An extended Unicode character in the range
    x010000
    to
    x10FFFF
    . In programming languages such as Java, and in files using UTF-8 or UTF-16 encoding, such a character is represented as a
    surrogate pair
    , using two 16-bit codes in the range
    xD800
    to
    xDFFF.
    But as far as XPath is concerned, it is one character rather than two. This affects functions that count characters in a string or that make use of the position of a character in a string, for example the functions
    string-length()
    ,
    substring()
    , and
    translate()
    . Here XPath differs from Java, which normally counts a surrogate pair as two characters.

Unicode surrogate pairs are starting to be increasingly used for specialist applications. For example, there is a full range of musical symbols in the range
x1D100
to
x1D1FF
. Although these are unlikely to be used when typesetting printed sheet music, they are very important in texts containing musical criticism. They also have some of the most delightful names in the whole Unicode repertoire: Who can resist a character called Tempus Perfectum cum Prolatione Perfecta? If you're interested, it looks like a circle with a dot in the middle.

Note that line endings are normalized to a single newline
x0A
character, regardless of how they appear in the original XML source file.

XML 1.1 allows additional characters, notably control characters in the range
x01
to
x1F
. XSLT 2.0 and XPath 2.0 processors are not obliged to support XML 1.1, but many are likely to do so eventually. XML 1.1 also recognizes line ending characters used on IBM mainframes and converts these to the standard
x0A
newline character.

It is not possible in a stylesheet to determine how a character was written in the original XML file. For example, the following strings are all identical as far as XDM is concerned:

  • >
  • >
  • >
  • >
  • >
  • ]]>

The XML parser handles these different character representations. In most implementations, the XSLT processor couldn't treat these representations differently even if it wanted to, because they all look the same once the XML parser has dealt with them.

What Does the Tree Leave Out?

The debate in defining a tree model is about what to leave out. What information from the source XML document is significant, and what is an insignificant detail? For example, is it significant whether the
CDATA
notation was used for text? Are entity boundaries significant? What about comments?

Many newcomers to XSLT ask questions like “How can I get the processor to use single quotes around attribute values rather than double quotes?” or “How can I get it to output
 
instead of
 
?” The answer is that you can't, because these distinctions are considered to be things that the recipient of the document shouldn't care about, and they were therefore left out of the XDM model.

Generally, the features of an XML document fall into one of three categories: definitely significant, definitely insignificant, and debatable. For example, the order of elements is definitely significant, the order of attributes within a start element tag is definitely insignificant, but the significance of comments is debatable.

The XML standard itself doesn't define these distinctions particularly clearly. It defines certain things that must be reported to the application, and these are certainly significant. There are other things that are obviously significant (such as the order of elements) about which it says nothing. Equally, there are some things that it clearly states are insignificant, such as the choice of
CR-LF
or
LF
for line endings, but many others about which it stays silent, such as choice of

versus
'
to delimit attribute values.

One result of this is that different standards in the XML family have each made their own decisions on these matters, and the XDM specification is no exception.

The debate arises partly because there are two kinds of applications. Applications that want only to extract the information content of the document are usually interested only in the core information content. Applications such as XML editing tools tend also to be interested in details of how the XML was written, because when the user makes no change to a section of the document, they want the corresponding output document to be as close to the original as possible.

One attempt to define the information content of an XML document is the W3C InfoSet specification (
http://www.w3.org/TR/xml-infoset/
). This takes a fairly liberal view, retaining things such as CDATA section boundaries and external entity references in the model, on the basis that some users might attach importance to these things.

Other books

I'm Your Santa by Castell, Dianne
Winter Storms by Elin Hilderbrand
Vicious Carousel by Tymber Dalton
In the Dead of Night by Aiden James
Wolf Island by Cheryl Gorman
3 Christmas Crazy by Kathi Daley
Severed Souls by Terry Goodkind