Comparison of XML processing models

Jérôme Euzenat (Jerome.Euzenat@inrialpes.fr)
INRIA Rhône-Alpes

Introduction

With the communication of data in structured formats, and most notably XML, there is a need for assembling elementary transformations (such as XSLT stylesheets) for achieving complex and rationnalized transformation tasks. Several tools have arisen recently for achieving this. They have a lot in common despite some differences in objectives. The following table introduce them with salient diverging features in the expression of the transformation.

\ file flow
task orientedAntTransmorpher
data orientedXweb, LagoonCocoon
Table: comparison of the different systems considered here. They are organized along two dimensions:
task/data oriented
whether their processing is described as a set of task that can be invoked for providing some output or as a set of operation for obtaining one type of target.
file/flow
whether the data communication (and the processing control) is expressed through files (by specifying how to compute a file) or flow (by specifying the data diffusion). An extension of this dimension, that has not been considered, is the processing of infinite streams (the data keeps coming and the system must continuously provide its output).
Another way to compare these systems is along the push/pull option (i.e., whether they are used for computing one output on demand or for computing all output once the input are available).
\ push pull
staticTransmorpher, XwebAnt
dynamicLagoonCocoon
Table: comparison of the different systems considered under their processing dimensions.

The languages

In the following, we make the distinction between a sitemap which describes what files/output can be obtained and a stylesheet which describe how to perform a particular transformation. Of course, both are often mixed since how to obtain what file is often described.

There exists other systems for this task Xpipes (http://xpipe.sourceforge.net) or Ux

Lagoon

Lagoon (http://lagoon.sourceforge.net/docs/userguide.html, previously XotW [staldal2000a]) is aimed at generating web sites offline. The site is described by a sitemap providing all the files in the web site. Lagoon distinguishes between six types of components: format (from XML to bytes), transform (from XML to XML), source (a generator of XML), read (a generator of bytes), parse (from bytes to XML) and process (from bytes to bytes). Unlike many of other tools (and like Xweb), Lagoon mixes byte and XML streams. Lagoon has Make-like facilities for recomputing files only when necessary. Emphasis is put on this evolved form of caching. Like Make, Lagoon mixes sitemap and stylesheet for all the files. The transformations are rules for providing a file. They are expressed in a functional way (each producer, but the split transform, can only provide one file). This prohibits transformation flow reuse as with tasks.

Cocoon

Cocoon (http://xml.apache.org/cocoon, [mclaughlin2000a]) is another stylesheet composition system written in Java and integrated with Servlet servers. It allows to compute online web sites. Advantages of Cocoon include document caching and explicit declaration of transformations ("sitemap"). Cocoon is based on a three-step site publication model (creation, content processing and rendering). This provides a clear methodology for developing sites but confines the system to a particular type of processing. The caching mechanism of Cocoon is tied to that methodology by enabling caching only at these steps.

Ant

Ant (http://jakarta.apache.org/ant/) is a substitute for the famous Make using XML and Java. It is thus a program configurator and updater. Its goal is not XML processing but it shares features with the systems presented here: a simple processing model and an easily customizable philosophy. Ant is task and file oriented,

XWeb

XWeb (http://meganesia.int.gu.edu.au/~pbecker/xweb/manual.html) aims at generating web sites offline. Its current approach is file-based with input/output handled implicitely. It thus implements pipelines. Processes processes either XML or binary information.

A new processing model for XWeb is available (http://meganesia.int.gu.edu.au/~pbecker/xweb/processingModel.html).

Transmorpher

Transmorpher (http://transmorpher.inrialpes.fr, [euzenat2001a]) is an environment for defining and processing complex transformation flows. It targets transformation engineering, not especially web site generation. It enables the description of complex data flows combining other flows with basic transformations.

Transmorpher allows to:

Term comparison

The first exercise in order to compare languages consists in comparing the terms used for describing them. The first table considers the basic concepts manipulated by the processing models (at least those which are common to these systems, they have other concepts not considered here).

antcocoonlagoonxweb (*)xpipetransmorpher
stylesheetmakefilesitemapsitemapwebsitestylesheet
I/Ofile- (implicit)filestream- (implicit)channel
contextproperties (?)context
parametersparametersconfigurationparameters
base componentsbuilt-in tasksproducersprocessXcomponentbuilt-in transformations
taskstaskpipeline/action-setxsl nesting/macroXpipe/Xrigsprocess
selectionmatchingselector--
iteration- (implicit
on directories)
- (implicit
on directories)
- (implicit
on directories)
- (implicit
on directories)
-iterator
Table: concept comparison. (*) Xweb description based on the new processing description.

These systems are provided with various basic components depending on their destinations. The second table deals with the basic components that are provided for implementing the actual processing (not the organization of this processing). Ant is not really relevant for this comparison (it has different extensions).

cocoonlagoonxwebtransmorpher
generating outputserializer/generatorformat/consumerserializer
getting inputreadersource/readgenerator
applying and external programactiontransform/parse/processprogramCallexternal
applying a custom program-ruleset
generating dynamic pagesJSP/XSPLSP--
applying an XSLT stylesheettransformer (xslt)transform (xslt)xslexternal (xslt)
applying a querytransformer (sql)-query
aggregating resultsaggregatormerge
spliting resultsmatcher (?)transform (split)dispatch
Table: available basic component comparison.

A possible track for collaboration

Layers

One can layer these systems in the following way (adapted from Peter Becker):

Sharing plug-in definition

It appears that all these systems have comparable needs to plug coded transformation and data manipulation procedures. One great benefit should come from the sharing of the plug-in definitions so that as soon as one plug-in is made for one system, it is avalaible for the others (all these systems require the same tools: XSLT engines, special purpose formatters, various parsers and serializers, etc.

This is not obvious to do in JAXP [armstrong2001a] which has been especially designed for XSLT transformations (XML stream, template interface, only one input and one output).

Achieving the sharing of plug-in definitions can take advantage of the terminological comparison above because it helps for defining the rock-bottom categories requiring a special interface. For instance, Lagoon distinguishes transformations on the basis of the "encoding" of input/output (i.e. whether it is XML or just bytes), though Transmorpher distinguishes on the number of input/output. As a consequence, Transmorpher distinguishes dispatch from transformation and Lagoon distinguishes format from transform.

Xweb

The expected interface for XWeb is the following is made of a Registry with the method:

public static void register(net.sourceforge.xweb.processors.ProcessorFactory factory)
Factory with the methods (?):
    public static String getProcessName()
    public static String getProcessNamespace()
and Processors with methods:
public static net.sourceforge.processors.Processor getProcessor(??? configuration)
    public List getInputs()
    public List getOutputs()
    public void connectOutput(Processor other, Input input)
    protected void input(Input input, BinaryData data)
    protected void input(Input input, XMLData data)
    public void run()

Transmorpher

The (undocumented) interface for Transmorpher is made of a Factory which knows a correspondance table between Process types (xslt, broadcast, concat, etc.) and the actual Java class names. These correspondances will be given in a near future to Transmorpher through a defextern tag similar to Cocoon declarators. There also exists an undocumented property file which allows to choose a default implementation of a particular Process category. Howeer, we prefer to be able to mix implementations within the same stylesheet so not develop this possibility. The Factory provides:

  public void initFactory()
  public static final TProcess newProcess(String type, Object[] params)  {
  public final TProcess newGenerator(String[] pOut,String type,Parameters pParam,StringParameters staticAttributes){
  public static final TProcess newSerializer(String[] pIn,String type,Parameters pParam,StringParameters staticAttributes){
  public static final TProcess newDispatcher(String[] pIn,String[] pOut,String type,Parameters pParam) {
  public static final TProcess newConnector(String[] pIn,String[] pOut,String type,Parameters pParam)  {
  public static final TProcess newExternal( String[] pIn,String[] pOut,String type,Parameters pParam)
  public static final TProcess newApplyQuery( String[] pIn,String[] pOut,String type,Parameters pParam,StringParameters staticAttributes){
Important features here are: I/O and parameters are given at creating time. All processes are typed (this helps the factory to know what parameters are needed. As a consequences, Transmorpher provides several types of Processes interface. They most important method is the constructor which deals with all the parameters. The TProcess interface is very simple (not everything is useful):. public interface TProcess { /** Get the name of the process */ public String getName(); /** Set the name of the process*/ public void setName(String pName) ; /* set an In port */ public void setIn(int i,XML_Port pFileIn) ; /** Set an Out Port */ public void setOut(int i,XML_Port pFileOut) ; /** Get an In port */ public XML_Port getIn(int i) ; /** Get an Out Port */ public XML_Port getOut(int i) ; /** Get an In port */ public XML_Port getIn(String pName) ; /** Get an Out Port */ public XML_Port getOut(String pName) ; /** Get the In ports */ public XML_Port[] getIn() ; /** Get the Out Ports */ public XML_Port[] getOut() ; public String[] getNameOut() ; public String[] getNameIn() ; public String getNameOut(int i) ; public String getNameIn(int i) ; /** pass a set of parameters to a process */ public void setParameters(Parameters p); /** returns the parameters of a process */ public Parameters getParameters(); /** get the parameter value of a process */ public Object getParameter(String k) ; public void setParameter(String k, Object o); /** bind the parameters of the process to the runtime parameters */ public void bindParameters( Parameters p ); public void generatePort(); public void setFatherName(String name); public String getFatherName(); } No run() method is available since the processors are supposed to be always running (but generators). This could be an obstacle to the integration of plug-ins in the actual implementation of Transmorpher (and maybe Cocoon?): it is a flow of SAX events, so each component implementation is "always running" as opposed to having a run() primitive posposed in XWeb2.

Beside SAX plug-in types, it might be useful to consider:

Cocoon

In Cocoon, plugs-in are declared within the siteman inside the components tag. They are typed as generators, transformers, serializers, readers, selectors, matchers and actions. They are identified by a name and the name of the class which implements them. Additional parameters can be passed on to the component. For instance:

<map:components> <map:generators ..> <map:transformers ..> <map:serializers default="html"> <map:serializer name="html" mime-type="text/html" src="org.apache.cocoon.serialization.HTMLSerializer"> <doctype-public>-//W3C//DTD HTML 4.0 Transitional//EN</doctype-public> <doctype-system>http://www.w3.org/TR/REC-html40/loose.dtd</doctype-system> <omit-xml-declaration>true</omit-xml-declaration> <encoding>UTF-8</encoding> <indent>1</indent> </map:serializer> <map:serializer name="wap" mime-type="text/vnd.wap.wml" src="org.apache.cocoon.serialization.XMLSerializer"> <doctype-public>-//WAPFORUM//DTD WML 1.1//EN</doctype-public> <doctype-system>http://www.wapforum.org/DTD/wml_1.1.xml</doctype-system> <encoding>UTF-8</encoding> </map:serializer> <map:serializer name="svg2jpeg" mime-type="image/jpeg" src="org.apache.cocoon.serialization.SVGSerializer"> <parameter name="background_color" type="color" value="#00FF00"/> </map:serializer> <map:serializer name="svg2png" mime-type="image/png" src="org.apache.cocoon.serialization.SVGSerializer"> </map:serializer> </map:serializers> </map:components>

The interface for, e.g, a Serializer is the following:

public interface XMLConsumer extends ContentHandler, LexicalHandler {}

public interface SitemapOutputComponent extends Component {

  /**
   * Set the OutputStream where the requested resource should
   * be serialized.
   */
  void setOutputStream(OutputStream out) throws IOException;

  /**
   * Get the mime-type of the output of this Component.
   */
  String getMimeType();

  /**
   * Test if the component wants to set the content length
   */
  boolean shouldSetContentLength();
}

public interface Serializer extends XMLConsumer, SitemapOutputComponent {
   String ROLE = "org.apache.cocoon.serialization.Serializer";
}

Lagoon

The interface is not yet documented.

Sharing call interface

It would be nice to be able to call some of these systems from the outside (e.g. XWeb calls Lagoon or Cocoon calls Transmorpher at the processing tool level). In fact, if we can normalize a built-in interface, this could lead to be able to interoperate through the "external interface" of each system.

There exists one such interface in JAXP [armstrong2001a]: the transformer interface. Unfortunately, it only allows one input and one output to a transformer. This is too specific and calls for something else.

One of the advantage of sharing the call interface would be to call a particular system as a plug-in in addition to a processing tool (e.g., Cocoon could delegate only a few of its basic tasks to Lagoon). This requires these systems to be reentrant.

Bibliography

[Armstrongİ2001a]
Eric Armstrong, Working with XML: the Java API for XML Parsing (JAXP) Tutorial, 2001 http://java.sun.com/xml/jaxp-docs-1.0.1/docs/tutorial/
[Clarkİ1999a]
James Clark (ed.), XSL transformations (XSLT) version 1.0, W3C Recommendation, 1999 http://www.w3.org/TR/xslt
[Euzenatİ2001a]
JÈrÙme Euzenat and Laurent Tardif, Processing XML transformation flows, Proceedings of 2nd Extreme Markup Language conference, MontrÈal (CA), pp61-72, 2001
[McLaughlinİ2000a]
Brett McLaughlin, Web Publishing Frameworks, in: Brett McLaughlin, Java and XML, O'Reilly and associates, Sebastopol (CA US), 2000, http://www.oreilly.com/catalog/javaxml/chapter/ch09.html
[StÂldalİ2000a]
Mikael StÂldal, Presenting XML documents on different media with stylesheets, Master's thesis, KTH, Stockholm (SE), 2000

Appendix: available base transformations

Cocoon

This is the list found at http://xml.apache.org/cocoon/developing/extending.html.

Generators:

DirectoryGenerator
Generates an XML directory listing.
FileGenerator
Does the job of an XML parser: read an XML file and outputs SAX events.
HTMLGenerator
Takes an HTML URL, makes an XHTML of it, and outputs the SAX events caused by this XHTML.
ImageDirectoryGenerator
An extension of DirectoryGenerators that adds extra attributes for image files.
PhpGenerator
Allows PHP to be used as a generator. Builds upon the PHP servlet functionality. Overrides the output method in order to pipe the results into SAX events.
RequestGenerator
[FIXME: This looks like just outputing the request headers, the request parameters and the configuration parameters. But I don't see any use of it (besides debugging and demonstration). Are there other situations in which you might want to use this?]
ServerPagesGenerator
Makes a Generator at compile time, based on the src file you define in the sitemap. This one is responsible for making your XSP pages work.
StatusGenerator
Generates an XML representation of the current status of Cocoon. This can be considered "for administration use", i.e. your application probably won't deal with this one.

Transformer:

LogTransformer
This is a class that can be plugged into a pipeline to print the SAX events which passes through this Transformer in a readable form to a file. This Transformer's main purpose is debugging.
SQLTransformer
Can be used for querying a SQL database.
XalanTransformer
Probably the most intuitive Transformer: it applies an XSL sheet to the SAX events it receives. It uses Xalan in the process.
XIncludeTransformer
To include other XML documents in your "XML document" (which at transformation time exists in SAX events).
XTTransformer
The same as XalanTransformer, but this one uses XT.

Serializer:

FOPSerializer
Make PDF files.
HTMLSerializer
Generate an HTML document.
LinkSerializer
Show the targets of the links in the document.
SVGSerializer
To construct an SVG.
TextSerializer
Generate a text document.
XMLSerializer
Generate an XML document.

Lagoon

source
void to XML, Parse source as XML
source (dir)
read
void to bytes
parse
Bytes to XML
process
Bytes to bytes
transform (xslt)
XML to XML
transform (split)
XML to XML
transform (lsp)
XML to XML
format (xml)
XML to bytes (as XML)
format (html)
XML to bytes (as HTML)
format (xhtml)
XML to bytes (as XHTML)
format (text)
XML to bytes (as text)
format (fo)
XML:FO to bytes (as PDF)
format (svg)
XML to bytes (as SVG)

Transmorpher

Generator (readfile)
void to XML (reads in a file)
Dispatcher (broadcast)
XML to multiple XML (copies)
External (xslt)
XML to XML (uses any XSLT processor)
Query (tmq)
XML to XML (Transmorpher query language=Xpath)
Ruleset
XML to XML: not extensible
Process
XML to XML: not extensible
Merger (concat)
Multiple XML to XML (concatenate the content of both file under the first's root node).
Serializer (writefile)
XML to void (writes in a file)
Serializer (xsltserial)
XML to void (writes in a through an XSLT stylesheet - not avalaible but necessary)

http://transmorpher.inrialpes.fr/docs/compare.html

$Id: compare.html,v 1.3 2002/04/11 16:41:23 jerome Exp $