XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (11 page)

XSLT and CSS

Why are there two stylesheet languages, XSL (that is, XSLT plus XSL Formatting Objects) as well as Cascading Style Sheets (CSS and CSS2)?

It's only fair to say that in an ideal world there would be a single language in this role, and that the reason there are two is that no one was able to invent something that achieved the simplicity and economy of CSS for doing simple things, combined with the power of XSL for doing more complex things.

CSS is mainly used for rendering HTML, but it can also be used for rendering XML directly, by defining the display characteristics of each XML element. However, it has serious limitations. It cannot reorder the elements in the source document, it cannot add text or images, it cannot decide which elements should be displayed and which omitted, neither can it calculate totals or averages or sequence numbers. In other words, it can only be used when the structure of the source document is already very close to the final display form.

Having said this, CSS is simple to write, and it is very economical in machine resources. It doesn't reorder the document, and so it doesn't need to build a tree representation of the document in memory, and it can start displaying the document as soon as the first text is received over the network. Perhaps, most important of all, CSS is very simple for HTML authors to write, without any programming skills. In comparison, XSLT is far more powerful, but it also consumes a lot more memory and processor power, as well as training budget.

It's often appropriate to use both tools together. Use XSLT to create a representation of the document that is close to its final form, in that it contains the right text in the right order, and then use CSS to add the finishing touches, by selecting font sizes, colors, and so on. Typically, you would do the XSLT processing on the server and the CSS processing on the client (in the browser); so, another advantage of this approach is that you reduce the amount of data sent down the line, which should improve response time for your users and postpone the next expensive bandwidth increase.

XSLT and XML Schemas

One of the biggest changes in XSLT 2.0, and one of the most controversial, is the integration of XSLT with the XML Schema language. XML Schema provides a replacement for DTDs as a way of specifying the structural constraints that apply to a class of documents; unlike DTDs, an XML Schema can regulate the content of the text as well as the nesting of the elements and attributes. Many of the industry vocabularies being used to define XML interchange standards are specified using XML Schema definitions. For example, several of the XML vocabularies for describing music, which I alluded to earlier in the chapter, have an XML Schema to define their rules, and this schema can be used to check the conformance of individual documents to the standard in question.

When you write a stylesheet, you need to make assumptions about the structure of the input documents it is designed to process and the structure of the result documents it is designed to produce. With XSLT 1.0, these assumptions were implicit; there was no formal way of stating the assumptions in the stylesheet itself. As a result, if you try applying a stylesheet to the wrong kind of input document, the result will generally be garbage.

The idea of linking XSLT and XML Schema was driven by two main considerations:

  • There should, in principle, be software engineering benefits if a program (and a stylesheet is indeed a program) makes explicit assertions about its expected inputs and outputs. These assertions can lead to better and faster detection of errors, often enabling errors to be reported at compile time that otherwise would only be reported the first time the stylesheet was applied to some test data that happened to exercise a particular part of the code.
  • The more information that's available to an XSLT processor at compile time, the more potential it has to generate optimal code, giving faster execution and better use of memory.

So why the controversy? It's mainly because XML Schema itself is less than universally popular. It's an extremely complex specification that's very hard to read, and when you discover what it says, it appears to be full of rules that seem artificial and inconsistent. It manages at the same time to be specified in very formal language, and yet to have a worryingly high number of bugs that have been fixed through published errata. Although there are good books that present XML Schema in a more readable way, they achieve this by glossing over the complications, which means that the error messages you get when you do something wrong can be extremely obscure. As a result, there has been a significant amount of support for an alternative schema language, Relax NG, which as it happens was co-developed by the designer of XSLT and XPath, James Clark, and is widely regarded as a much more elegant approach.

The XSL and XQuery Working Groups responded to these concerns by ensuring that support for XML Schema was optional, both for implementors and for users. This has largely silenced the objections.

The signs are that XML Schema is here to stay, whether people like it or not. It has the backing of all the major software vendors such as IBM, Oracle, and Microsoft, and it has been adopted by most of the larger user organizations and industries. And like so many things that the IT world has adopted as standards, it may be imperfect but it does actually work. Meanwhile, to simplify the situation rather cruelly, Relax NG is taking the role of the Apple Mac: the choice of the cognoscenti who judge a design by its intrinsic quality rather than by cold-blooded cost-benefit analysis.

As I've already mentioned, W3C is not an organization that likes to let a thousand flowers bloom. It is not a loose umbrella organization in which each working group is free to do its own thing. There are strong processes that ensure the working groups cooperate and strive to reconcile their differences. There is therefore a determination to make all the specifications work properly together, and the message was that if XML Schema had its problems, then people should work together to get them fixed. XSLT and XML Schema come from the same stable, so they were expected to work together. And now that the specs are finished and products are out, I think users are starting to discover that they can work together beneficially.

Chapter 4 provides an overview of how stylesheets and schemas are integrated in XSLT 2.0, and Chapter 19 provides a worked example of an application that uses this capability. When I first developed this application for the book (which I did at the same time as I developed the underlying support in Saxon), I was pleasantly surprised to see that I really was getting benefits from the integration. At the simplest level, I really liked the immediate feedback you get when a stylesheet generates output that does not conform to the schema for the result document, with error messages that point straight to the offending line in the stylesheet. This makes for a much faster debugging cycle than does the old approach of putting the finished output file through a schema validator as a completely separate operation.

XSLT and XQuery

XQuery is a separate specification from W3C, designed to allow data in XML documents to be queried. It can operate on single documents, or on collections containing millions of documents held in an XML database.

Functionally, XQuery offers a subset of the capabilities of XSLT. You could regard it as XSLT without the template rules, and without some of the extra features such as the ability to do grouping, or to format dates and times, or to import modules and selectively override them. It would be a mistake, however, to think that being a smaller language makes XQuery a poor relation. The relative simplicity of XQuery does indeed make it harder to write large and complex applications, but it does bring two significant advantages: the language is easier to learn, especially for those coming from a SQL background, and it is easier to optimize, especially when running against gigabytes of data preloaded and preindexed in an XML database.

XQuery has XPath 2.0 as a subset. This makes it very much a member of the same family as XSLT. The two languages have a great deal in common, most importantly their type system. There are no formal facilities in the W3C specifications that allow XSLT and XQuery to be mixed in a single application, but because the processing models are so closely aligned, many implementations allow one language to be called from the other. Saxon in fact implements both languages as different surface syntaxes for the same underlying processing engine.

There are some applications for which XSLT is definitely better suited, particularly document publishing. There are others where XQuery is the only sensible choice, notably searching for data in large XML databases. There's a third class of applications, especially message conversion, where either language will do the job, and where the choice is largely a matter of personal preference. My advice would be to use XQuery if it's a very small application and XSLT if it's bigger, largely because in my experience it's easier to write XSLT code that's adaptable to change and reusable in different applications.

The History of XSL

Like most of the XML family of standards, XSLT was developed by the World Wide Web Consortium (W3C), a coalition of companies orchestrated by Tim Berners-Lee, the inventor of the Web. There is an interesting page on the history of XSL, and styling proposals generally, at
http://www.w3.org/Style/History/
.

Writing history is a tricky business. Sharon Adler, the chair of the XSL Working Group, tells me that her recollections of events are very different from the way I describe them. This just goes to show that the documentary record is a very crude snapshot of what people were actually thinking and talking about. Unfortunately, though, it's all that we've got.

Prehistory

HTML was originally conceived by Berners-Lee (
www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt
) as a set of tags to mark the logical structure of a document; headings, paragraphs, links, quotes, code sections, and the like. Soon, people wanted more control over how the document looked; they wanted to achieve the same control over the appearance of the delivered publication as they had with printing and paper. So, HTML acquired more and more tags and attributes to control presentation; fonts, margins, tables, colors, and all the rest that followed. As it evolved, the documents being published became more and more browser-dependent, and it was seen that the original goals of simplicity and universality were starting to slip away.

The remedy was widely seen as separation of content from presentation. This was not a new concept; it had been well developed through the 1980s in the development of
Standard Generalized Markup Language
(
SGML
).

Just as XML was derived as a greatly simplified subset of SGML, so XSLT has its origins in an SGML-based standard called
DSSSL (Document Style Semantics and Specification Language)
. DSSSL (pronounced
Dissel
) was developed primarily to fill the need for a standard device-independent language to define the output rendition of SGML documents, particularly for high-quality typographical presentation. SGML was around for a long time before DSSSL appeared in the early 1990s, but until then the output side had been handled using proprietary and often extremely expensive tools, geared toward driving equally expensive phototypesetters, so that the technology was really taken up only by the big publishing houses.

Michael Sperberg-McQueen and Robert F. Goldstein presented an influential paper at the WWW '94 conference in Chicago under the title
A Manifesto for Adding SGML Intelligence to the World-Wide Web
. You can find it at
http://tigger.uic.edu/∼cmsmcq/htmlmax.html
.

The authors presented a set of requirements for a stylesheet language, which is as good a statement as any of the aims that the XSL designers were trying to meet. As with other proposals from around that time, the concept of a separate transformation language had not yet appeared, and a great deal of the paper is devoted to the rendition capabilities of the language. There are many formative ideas, however, including the concept of fallback processing to cope with situations where particular features are not available in the current environment.

It is worth quoting some extracts from the paper here:

Ideally, the stylesheet language should be declarative, not procedural, and should allow stylesheets to exploit the structure of SGML documents to the fullest. Styles must be able to vary with the structural location of the element: paragraphs within notes may be formatted differently from paragraphs in the main text. Styles must be able to vary with the attribute values of the element in question: a quotation of type “display” may need to be formatted differently from a quotation of type “inline”…

At the same time, the language has to be reasonably easy to interpret in a procedural way: implementing the stylesheet language should not become the major challenge in implementing a Web client.

The semantics should be additive: It should be possible for users to create new stylesheets by adding new specifications to some existing (possibly standard) stylesheet. This should not require copying the entire base stylesheet; instead, the user should be able to store locally just the user's own changes to the standard stylesheet, and they should be added in at browse time. This is particularly important to support local modifications of standard DTDs.

Syntactically, the stylesheet language must be very simple, preferably trivial to parse. One obvious possibility: formulate the stylesheet language as an SGML DTD, so that each stylesheet will be an SGML document. Since the browser already knows how to parse SGML, no extra effort will be needed.

We recommend strongly that a subset of DSSSL be used to formulate stylesheets for use on the World Wide Web; with the completion of the standards work on DSSSL, there is no reason for any community to invent their own style-sheet language from scratch. The full DSSSL standard may well be too demanding to implement in its entirety, but even if that proves true, it provides only an argument for defining a subset of DSSSL that must be supported, not an argument for rolling our own. Unlike home-brew specifications, a subset of a standard comes with an automatically predefined growth path. We expect to work on the formulation of a usable, implementable subset of DSSSL for use in WWW stylesheets, and invite all interested parties to join in the effort.

Other books

Glimmer by Stacey Wallace Benefiel, Valerie Wallace
Down the Dirt Road by Carolyn LaRoche
Vampire Cowboy by Chastain, Juliet
Born Under a Million Shadows by Andrea Busfield
Seduced by His Target by Gail Barrett
The Thirteen Problems by Agatha Christie
A Tale of Two Families by Dodie Smith