While doing a migration of some big xslt 3 project into plain C# we run into a case that was not obvious to resolve.
Documents we process can be from a tiny to a moderate size. Being stored in xml they might take from virtually zero to, say, 10-20 MB.
In C# we may rewrite Xslt code virtually in one-to-one manner using standard features like XDocument, LINQ, regular classes, built-in collections, and so on. Clearly C# has a reacher repertoire, so task is easily solved unless you run into multiple opportunities to solve it.
The simplest solution is to use XDocument API to represent data at runtime, and use LINQ to query it. All features like xslt keys, templates, functions, xpath sequences, arrays and maps and primitive types are natuarally mapped into C# language and its APIs.
Taking several xslt transformations we could see that xslt to C# rewrite is rather straightforward and produces recognizable functional programs that have close C# source code size to their original Xslt. As a bonus C# lets you write code in asynchronous way, so C# wins in a runtime scalability, and in a design-time support.
But can you do it better in C#, especially when some data has well defined xml schemas?
The natural step, in our opinion, would be to produce C# plain object model from xml schema and use it for runtime processing.
Fortunately .NET has xml serialization attributes and tools to produce classes from xml schemas. With small efforts we have created a relevant class hierarchy for a rather big xml schema. XmlSerializer is used to convert object model to and from xml through XmlReader and XmlWriter. So, we get typed replacement of generic XDocument that still supports the same LINQ API over collections of objects, and takes less memory at runtime.
The next step would be to commit a simple test like:
-
read object model;
-
transform it;
-
write it back.
We have created such tests both for XDocument and for object model cases, and compared results from different perspectives.
Both solution produce very similar code, which is also similar to original xslt both in style and size.
Object model has static typing, which is much better to support.
But the most unexpected outcome is that object model was up to 20% slower due to serialization and deserialization even with pregenerated xmlserializer assemblies. Difference of transformation performance and memory consumption was so unnoticable that it can be neglected. These results were confirmed with multiple tests, with multiple cycles including heating up cycles.
Here we run into a case where static typing harms more than helps. Because of the nature of our processing pipeline, which is offline batch, this difference can be mapped into 10th of minutes or even more.
Thus in this particular case we decided to stay with runtime typing as a more performant way of processing in C#.
Xslt is oftentimes thought as a tool to take input xml, and run transformation to get html or some xml on output. Our use case is more complex, and is closer to a data mining of big data in batch. Our transformation pipelines often take hour or more to run even with SSD disks and with CPU cores fully loaded with work.
So, we're looking for performance opportunities, and xml vs json might be promising.
Here are our hypotheses:
- json is lighter than xml to serialize and deserialize;
- json stored as map(*), array(*) and other items() are ligher than node() at runtime, in particular subtree copy is zero cost in json;
- templates with match patterns are efficiently can be implemented with maps();
- there is incremental way forward from use of xml to use of json.
If it pays off we might be switching xml format to json all over, even though it is a development effort.
But to proceed we need to commit an experiment to measure processing speed of xml vs json in xslt.
Now our task is to find an isolated small representative sample to prove or reject our hypotheses.
Better to start off with some existing transformation, and change it from use of xml to json.
The question is whether there is such a candidate.
Not sure what is use of our Xslt Graph exercises but what we are sure with is that it stresses different parts of Saxon Xslt engine and helps to find and resolve different bugs.
While implementing biconnected components algorithm we incidently run into internal error with Saxon 10.1 with rather simple xslt:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:array="http://www.w3.org/2005/xpath-functions/array"
exclude-result-prefixes="xs array">
<xsl:template match="/">
<xsl:sequence select="
array:fold-left
(
[8, 9],
(),
function($first as item(), $second as item())
{
min(($first, $second))
}
)"/>
</xsl:template>
</xsl:stylesheet>
More detail can be found at Saxon's issue tracker: Bug #4578: NullPointerException when array:fold-left|right $zero argument is an empty sequence.
Bug is promptly resolved.
While working on algorithm to trace Biconnected components for Graph API in the XSLT we realized that we implemented it unconventionally.
A pseudocode in Wikipedia is:
GetArticulationPoints(i, d)
visited[i] := true
depth[i] := d
low[i] := d
childCount := 0
isArticulation := false
for each ni in adj[i] do
if not visited[ni] then
parent[ni] := i
GetArticulationPoints(ni, d + 1)
childCount := childCount + 1
if low[ni] ≥ depth[i] then
isArticulation := true
low[i] := Min (low[i], low[ni])
else if ni ≠ parent[i] then
low[i] := Min (low[i], depth[ni])
if (parent[i] ≠ null and isArticulation) or (parent[i] = null and childCount > 1) then
Output i as articulation point
That algorithm is based on the fact that connected graph can be represented as a tree of biconnected components. Vertices of such tree are called articulation points. Implementation deals with a depth of each vertex, and with a lowpoint parameter that is also related to vertex depth during Depth-First-Search.
Out of interest we approached to the problem from different perspective. A vertex is an articulation point if it has neighbors that cannot be combined into a path not containing this vertex. As well as classical algorithm we use Depth-First-Search to navigate the graph, but in contrast we collect cycles that pass through each vertex. If during back pass of Depth-First-Search we find not cycle from "child" to "ancestor" then it is necessary an articulation point.
Here is pseudocode:
GetArticulationPoints(v, p) -> result
index = index + 1
visited[v] = index
result = index
articulation = p = null ? -1 : 0
for each n in neighbors of v except p do
if visited[n] = 0 then
nresult = GetArticulationPoints(n, v)
result = min(result, nresult)
if nresult >= visited[v] then
articulation = articulation + 1
else
result = min(result, visited[n])
if articulation > 0 then
Output v as articulation point
Algorithms' complexity are the same.
What is interesting is that we see no obvious way to transform one algorithm into the other except from starting from Graph theory.
More is on Wiki.
Michael Key's "A Proposal for XSLT 4.0" has spinned our interest in what could be added or changed in XSLT. This way we decided to implement Graph API purely in xslt. Our goal was to prove that:
- it's possible to provide efficient implementation of different Graph Algorithms in XSLT;
- to build Graph API the way that engine could provide native implementations of Grahp Algorithms.
- to find through an experiments what could be added to XSLT as a language.
At present we may confirm that first two goals are reachable; and experiments have shown that XSLT could provide more help to make program better, e.g. we have seen that language could simplify coding cycles.
Graph algorithms are often expressed with while cycles, e.g "Dijkstra's algorithm" has:
12 while Q is not empty:
13 u ← vertex in Q with min dist[u]
body is executed when condition is satisfied, but condition is impacted by body itself.
In xslt 3.0 we did this with simple recursion:
<xsl:template name="f:while" as="item()*">
<xsl:param name="condition" as="function(item()*) as xs:boolean"/>
<xsl:param name="action" as="function(item()*) as item()*"/>
<xsl:param name="next" as="function(item()*, item()*) as item()*"/>
<xsl:param name="state" as="item()*"/>
<xsl:if test="$condition($state)">
<xsl:variable name="items" as="item()*" select="$action($state)"/>
<xsl:sequence select="$items"/>
<xsl:call-template name="f:while">
<xsl:with-param name="condition" select="$condition"/>
<xsl:with-param name="action" select="$action"/>
<xsl:with-param name="next" select="$next"/>
<xsl:with-param name="state" select="$next($state, $items)"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
But here is the point. It could be done in more comprehended way. E.g. to let xsl:iterate without select to cycle until xsl:break is reached.
<xsl:iterate>
<xsl:param name="name" as="..." value="..."/>
<xsl:if test="...">
<xsl:break/>
</xsl:if>
...
</xsl:iterate>
So, what we propose is to let xsl:iterate/@select to be optional, and change the behavior of processor when the attribute is missing from compilation error to a valid behavior.
This should not impact on any existing valid XSLT 3.0 program.
Recently we've read an article
"A Proposal for XSLT 4.0", and thought it worth to suggest one more idea. We have written a message to Michael Kay, author of this proposal. Here it is:
A&V
Historically xslt, xquery and xpath were dealing with trees. Nowadays it became much common to process graphs. Many tasks can be formulated in terms of graphs, and in particular any task processing trees is also graph task.
I suggest to take a deeper look in this direction.
As an inspiration I may suggest to look at "P1709R2: Graph Library" - the C++ proposal.
Michael Kay
I have for many years found it frustrating that XML is confined to hierarchic relationships (things like IDREF and XLink are clumsy workarounds); also the fact that the arbitrary division of data into "documents" plays such a decisive role: documents should only exist in the serialized representation of the model, not in the model itself.
I started my career working with the Codasyl-defined network data model. It's a fine and very flexible data model; its downfall was the (DOM-like) procedural navigation language. So I've often wondered what one could do trying to re-invent the Codasyl model in a more modern idiom, coupling it with an XPath-like declarative access language extended to handle networks (graphs) rather than hierarchies.
I've no idea how close a reinventiion of Codasyl would be to some of the modern graph data models; it would be interesting to see. The other interesting aspect of this is whether you can make it work for schema-less data.
But I don't think that would be an incremental evolution of XSLT; I think it would be something completely new.
A&V
I was not so radical in my thoughts.
Even C++ API is not so radical, as they do not impose hard requirements on internal graph representation but rather define template API that will work both with third party representations (they even mention Fortran) or several built-in implementations that uses standard vectors.
Their strong point is in algorithms provided as part of library and not graph internal structure (I think authors of that paper have structured it not the best way). E.g. in the second part they list graph algorithms: Depth First Search (DFS); Breadth First Search (BFS); Topological Sort (TopoSort); Shortest Paths Algorithms; Dijkstra Algorithms; and so on.
If we shall try to map it to xpath world them graph on API level might be represented as a user function or as a map of user functions.
On a storage level user may implement graph using a sequence of maps or map of maps, or even using xdm elements.
So, my approach is evolutional. In fact I suggest pure API that could even be implemented now.
Michael Kay
Yes, there's certainly scope for graph-oriented functions such as closure($origin, $function) and is-reachable($origin, $function) and find-path($origin, $destination, $function) where we use the existing data model, treating any item as a node in a graph, and
representing the arcs using functions. There are a few complications, e.g. what's the identity comparison between arbitrary items, but it can probably be done.
A&V
> There are a few complications, e.g. what's the identity comparison between arbitrary items, but it can probably be done.
One approach to address this is through definition of graph API. E.g. to define graph as a map (interface analogy) of functions, with equality functions, if required:
map
{
vertices: function(),
edges: function(),
value: function(vertex),
in-vertex: function(edge),
out-vertex: function(edge),
edges: function(vertex),
is-in-vertex: function(edge, vertex),
is-out-vertex: function(edge, vertex)
...
}
Not sure how far this will go but who knows.
This story started half year ago when Michael Kay, author of Saxon XSLT processor, was dealing with performance in multithreaded environment. See Bug #3958.
The problem is like this.
Given XSLT:
<xsl:stylesheet exclude-result-prefixes="#all"
version="3.0"
xmlns:saxon="http://saxon.sf.net/"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template name="main">
<xsl:for-each saxon:threads="4" select="1 to 10">
<xsl:choose>
<xsl:when test=". eq 1">
<!-- Will take 10 seconds -->
<xsl:sequence select="
json-doc('https://httpbin.org/delay/10')?url"/>
</xsl:when>
<xsl:when test=". eq 5">
<!-- Will take 9 seconds -->
<xsl:sequence select="
json-doc('https://httpbin.org/delay/9')?url"/>
</xsl:when>
<xsl:when test=". eq 10">
<!-- Will take 8 seconds -->
<xsl:sequence select="
json-doc('https://httpbin.org/delay/8')?url"/>
</xsl:when>
</xsl:choose>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Implement engine to achieve best performance of parallel for-each.
Naive implementation that will distribute iterations per threads will run into unfair load on threads, so some load-balancing is required. That was the case Saxon EE.
Michael Kay has been trying to find most elegant way for the implementation and has written the comment:
I can't help feeling that the answer to this must lie in using the Streams machinery, and Spliterators in particular. I've spent another hour or so reading all about Spliterators, and I have to confess I really don't understand the paradigm. If someone can enlighten me, please go ahead...
We have decided to take the challange and to model the expected behavior using Streams. Here is our go:
import java.util.stream.IntStream;
import java.util.stream.Stream;
import java.util.function.Consumer;
import java.util.function.Function;
public class Streams
{
public static class Item<T>
{
public Item(int index, T data)
{
this.index = index;
this.data = data;
}
int index;
T data;
}
public static void main(String[] args)
{
run(
"Sequential",
input(),
Streams::action,
Streams::output,
true);
run(
"Parallel ordered",
input().parallel(),
Streams::action,
Streams::output,
true);
run(
"Parallel unordered",
input().parallel(),
Streams::action,
Streams::output,
false);
}
private static void run(
String description,
Stream<Item<String>> input,
Function<Item<String>, String[]> action,
Consumer<String[]> output,
boolean ordered)
{
System.out.println(description);
long start = System.currentTimeMillis();
if (ordered)
{
input.map(action).forEachOrdered(output);
}
else
{
input.map(action).forEach(output);
}
long end = System.currentTimeMillis();
System.out.println("Execution time: " + (end - start) + "ms.");
System.out.println();
}
private static Stream<Item<String>> input()
{
return IntStream.range(0, 10).
mapToObj(i -> new Item<String>(i + 1, "Data " + (i + 1)));
}
private static String[] action(Item<String> item)
{
switch(item.index)
{
case 1:
{
sleep(10);
break;
}
case 5:
{
sleep(9);
break;
}
case 10:
{
sleep(8);
break;
}
default:
{
sleep(1);
break;
}
}
String[] result = { "data:", item.data, "index:", item.index + "" };
return result;
}
private synchronized static void output(String[] value)
{
boolean first = true;
for(String item: value)
{
if (first)
{
first = false;
}
else
{
System.out.print(' ');
}
System.out.print(item);
}
System.out.println();
}
private static void sleep(int seconds)
{
try
{
Thread.sleep(seconds * 1000);
}
catch(InterruptedException e)
{
throw new IllegalStateException(e);
}
}
}
We model three cases:
- "Sequential"
- slowest, single threaded execution with output:
data: Data 1 index: 1
data: Data 2 index: 2
data: Data 3 index: 3
data: Data 4 index: 4
data: Data 5 index: 5
data: Data 6 index: 6
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 9 index: 9
data: Data 10 index: 10
Execution time: 34009ms.
- "Parallel ordered"
- fast, multithread execution preserving order, with output:
data: Data 1 index: 1
data: Data 2 index: 2
data: Data 3 index: 3
data: Data 4 index: 4
data: Data 5 index: 5
data: Data 6 index: 6
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 9 index: 9
data: Data 10 index: 10
Execution time: 10019ms.
- "Parallel unordered"
- fastest, multithread execution not preserving order, with output:
data: Data 6 index: 6
data: Data 2 index: 2
data: Data 4 index: 4
data: Data 3 index: 3
data: Data 9 index: 9
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 5 index: 5
data: Data 10 index: 10
data: Data 1 index: 1
Execution time: 10001ms.
What we can add in conclusion is that xslt engine could try automatically decide what approach to use, as many SQL engines are doing, and not to force developer to go into low level engine details.
Recently we observed how we solved the same task in different versions of XPath: 2.0, 3.0, and 3.1.
Consider, you have a sequence $items , and you want to call some function over each item of the sequence, and to return combined result.
In XPath 2.0 this was solved like this:
for $item in $items return
f:func($item)
In XPath 3.0 this was solved like this:
$items!f:func(.)
And now with XPath 3.1 that defined an arrow operator => we attempted to write something as simple as:
$items=>f:func()
That is definitely not working, as it is the same as f:func($items) .
Next attempt was:
$items!=>f:func()
That even does not compile.
So, finally, working expression using => looks like this:
$items!(.=>f:func())
This looks like a step back comparing to XPath 3.0 variant.
More than that, XPath grammar of arrow operator forbids the use of predictes, axis or mapping operators, so this won't compile:
$items!(.=>f:func()[1])
$items!(.=>f:func()!something)
Our conclusion is that arrow operator is rather confusing addition to XPath.
Xslt 3.0 defines a feature called streamability: a technique to write xslt code that is able to handle arbitrary sized inputs.
This contrasts with conventional xslt code (and xslt engines) where inputs are completely loaded in memory.
To make code streamable a developer should declare her code as such, and the code should pass Streamability analysis.
The goal is to define subset of xslt/xpath operations that allow to process input in one pass.
In simple case it's indeed a simple task to verify that code is streamable, but the more complex your code is the less trivial it's to witness it is streamable.
On the forums we have seen a lot of discussions, where experts were trying to figure out whether particular xslt is streamable. At times it's remarkably untrivial task!
This, in our opinion, clearly manifests that the feature is largerly failed attempt to inscribe some optimization technique into xslt spec.
The place of such optimization is in the implementation space, and not in spec. Engine had to attempt such optimization and fallback to the traditional implementation.
The last such example is: Getting SXST0060 "No streamable path found in expression" when trying to push a map with grounded nodes to a template of a streamable mode, where both xslt code and engine developers are not sure that the code is streamable in the first place.
By the way, besides streamability there is other optimization technique that works probably in all SQL engines. When data does not fit into memory engine may spill it on disk. Thus trading memory pressure for disk access. So, why didn't such techninque find the way into the Xslt or SQL specs?
After 17 years of experience we still run into dummy bugs in xslt (xpath in fact).
The latest one is related to order of nodes produced by ancestor-or-self axis.
Consider the code:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:template match="/">
<xsl:variable name="data" as="element()">
<a>
<b>
<c/>
</b>
</a>
</xsl:variable>
<xsl:variable name="item" as="element()" select="($data//c)[1]"/>
<xsl:message select="$item!ancestor-or-self::*!local-name()"/>
<xsl:message select="$item!local-name(), $item!..!local-name(), $item!..!..!local-name()"/>
</xsl:template>
</xsl:stylesheet>
We expected to have the following outcome
But correct one is
Here is why:
ancestor-or-self::* is an AxisStep. From XPath §3.3.2:
[Definition: An axis step returns a sequence of nodes that are reachable from the context node via a specified axis. Such a step has two parts: an axis, which defines the "direction of movement" for the step, and a node test, which selects nodes based on their kind, name, and/or type annotation.] If the context item is a node, an axis step returns a sequence of zero or more nodes; otherwise, a type error is raised [err:XPTY0020]. The resulting node sequence is returned in document order.
For some reason we were thinking that reverse axis produces result in reverse order. It turns out the reverse order is only within predicate of such axis.
See more at https://saxonica.plan.io/boards/3/topics/7312
XPath 3 has introduced a syntactic sugar for a string concatenation, so following:
concat($a, $b)
can be now written as:
$a || $b
This is nice addition, except when you run into a trouble. Being rooted in C world we unintentionally have written a following xslt code:
<xsl:if test="$a || $b">
...
</xsl:if>
Clearly, we intended to write $a or $b . In contrast $a || $b is evaluated as concat($a, $b) . If both variables are false() we get 'falsefalse' outcome, which has effective boolean value true() . This means that test condition of xsl:if is always true() .
What can be done to avoid such unfortunate typo, which is manifested in no way neither during compilation nor during runtime?
The answer is to issue informational message during the compilation, e.g. if result of || operator is converted to a boolean, and if its arguments are booleans also then chances are high this is typo, and not intentional expression.
We adviced to implement such message in the saxon processor (see https://saxonica.plan.io/boards/3/topics/7305).
It seems we've found discrepancy in regex implementation during the transformation in Saxon.
Consider the following xslt:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:template match="/">
<xsl:variable name="text" as="xs:string"
select="'A = "a" OR B = "b"'"/>
<xsl:analyze-string regex=""(\\"|.)*?"" select="$text">
<xsl:matching-substring>
<xsl:message>
<xsl:sequence select="regex-group(0)"/>
</xsl:message>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
vs javascript
<html>
<body>
<script>
var text = 'A = "a" OR B = "b"';
var regex = /"(\\"|.)*?"/;
var match = text.match(regex);
alert(match[0]);
</script>
</body>
</html>
xslt produces: "a" OR B = "b"
while javascript: "a"
What is interesting is that we're certain this was working correctly in Saxon several years ago.
You can track progress of the bug at: https://saxonica.plan.io/boards/3/topics/7300 and at https://saxonica.plan.io/issues/3902.
We've found that there is a Saxon HE update that was going to fix problems we mentioned in the previous post, and decided to give it a second chance.
Now Saxon fails with two other errors:
We shall be waiting for the fixes. Mean time we're back to version 9.7.
Finally, Saxon 9.8 is out!
This means that basic xslt 3 is available in the HE version.
Update: as usually, each new release has new bugs...
See https://saxonica.plan.io/boards/3/topics/6809
We have found that Saxon HE 9.7.0-18 has finally exposed partial support to map and array item types. So, now you can encapsulate your data in sequence rather than having a single sequence and treating odd and even elements specially.
Basic example is:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="t"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
exclude-result-prefixes="xs t map">
<xsl:template match="/">
<xsl:variable name="map" as="map(xs:string, xs:string)" select="
map
{
'Su': 'Sunday',
'Mo': 'Monday',
'Tu': 'Tuesday',
'We': 'Wednesday',
'Th': 'Thursday',
'Fr': 'Friday',
'Sa': 'Saturday'
}"/>
<xsl:message select="map:keys($map)"/>
</xsl:template>
</xsl:stylesheet>
A list of map functions can be found here http://www.w3.org/2005/xpath-functions/map/, though not all are available, as Saxon HE still does not allow inline functions.
P.S. From the development perspective it's a great harm that Saxon HE is so limited. Basically limited to xslt 2.0 + some selected parts of 3.0.
Lately we do not program in XSLT too often but rather in java, C#, SQL and javascript, but from time to time we have tasks in XSLT.
People claim that those languages are too different and use this argument to explain why XSLT is only a niche language. We, on the other hand, often spot similarities between them.
So, what it is in other languages that is implemented as tunnel parameters in XSLT?
To get an answer we reiterated how they work in XSLT, so, you:
- define a template with parameters marked as
tunnel="yes" ;
- use these parameters the same way as regular parameters;
- pass template parameters down to other templates marking them as
tunnel="yes" ;
The important difference of regular template parameters from tunnel parameters is that the tunnel parameters are implicitly passed down the call chain of templates. This means that you:
- define your API that is expected to receive some parameter;
- pass these parameters somewhere high in the stack, or override them later in the stack chain;
- do not bother to propagate them (you might not even know all of the tunnel parameters passed, so encapsulation is in action);
As a result we have a template with some parameters passed explicitly, and some others are receiving values from somewhere, usually not from direct caller. It’s possible to say that these tunnel parameters are injected into a template call. This resembles a lot injection API in other languages where you configure that some parameters are prepared for you by some container rather then by direct caller.
Now, when we have expressed this idea it seems so obvious but before we thought of this we did not realize that tunnel parameters in XSLT and Dependency Injection in other languages are the same thing.
Recently we have found and fixed a bug in unreachable statement optimization in jxom.
Latest version of stylesheets can be found at github.com languages-xom.
Good bad and good news.
- Good: recently a new version Saxon XSLT processor was published:
-
12 May 2016
Saxon 9.7.0.5 maintenance release for Java and .NET.
- Bad: we run that release on our code base and found a bug:
-
See Internal error in Saxon-HE-9.7.0-5
- Good: Michael Kay has confirmed the problem and even fixed it:
-
See Bug #2770
- The only missing ingredient is when the patch will be available to the public:
"We tend to do a new maintenance release every 4-6 weeks. Can't commit to firm dates."
Visitor pattern is often used to separate operation from object graph it operates with. Here we assume that the reader is familiar with the subject.
The idea is like this:
- The operation over object graph is implemented as type called
Visitor .
Visitor defines methods for each type of object in the graph, which a called during traversing of the graph.
- Traversing over the graph is implemented by a type called
Traverser , or by the Visitor or by each object type in the graph.
Implementation should collect, aggregate or perform other actions during visit of objects in the graph, so that at the end of the visit the purpose of operation will be complete.
Such implementation is push-like: you create operation object and call a method that gets object graph on input and returns operation result on output.
In the past we often dealt with big graphs (usually these are virtual graphs backended at database or at a file system).
Also having a strong experience in the XSLT we see that the visitor pattern in OOP is directly mapped into xsl:template and xsl:apply-templates technique.
Another thought was that in XML processing there are two camps:
- SAX (push-like) - those who process xml in callbacks, which is very similar to visitor pattern; and
- XML Reader (pull-like) - those who pull xml components from a source, and then iterate and process them.
As with SAX vs XML Reader or, more generally, push vs pull processing models, there is no the best one. One or the other is preferable in particular circumstances. E.g. Pull like component fits into a transformation pipeline where one pull component has another as its source; another example is when one needs to process two sources at once, which is untrivial with push like model. On the other hand push processing fits better into Reduce part of MapReduce pattern where you need to accumulate results from source.
So, our idea was to complete classic push-like visitor pattern with an example of pull-like implementation.
For the demostration we have selected Java language, and a simplest boolean expression calculator.
Please follow GitHub nesterovsky-bros/VisitorPattern to see the detailed explanation.
Essence of the problem (see Error during transformation in Saxon 9.7, thread on forum):
- XPath engine may arbitrary reorder predicates whose expressions do not depend on a context position.
- While an XPath expression
$N[@x castable as xs:date][xs:date(@x) gt xs:date("2000-01-01")] cannot raise an error if it's evaluated from the left to right, an expression with reordered predicates $N[xs:date(@x) gt xs:date("2000-01-01")][@x castable as xs:date] may generate an error when @x is not a xs:date .
To avoid a potential problem one should rewrite the expression like this: $N[if (@x castable as xs:date) then xs:date(@x) gt xs:date("2000-01-01") else false()] .
Please note that the following rewrite will not work: $N[(@x castable as xs:date) and (xs:date(@x) gt xs:date("2000-01-01"))] , as arguments of and expression can be evaluated in any order, and error that occurs during evaluation of any argument may be propageted.
With these facts we faced a task to check our code base and to fix possible problems.
A search has brought ~450 instances of XPath expessions that use two or more consequtive predicates. Accurate analysis limited this to ~20 instances that should be rewritten. But then, all of sudden, we have decided to commit an experiment. What if we split XPath expression in two sub expressions. Can error still resurface?
Consider:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:variable name="elements" as="element()+"><a/><b value="c"/></xsl:variable>
<xsl:template match="/">
<xsl:variable name="a" as="element()*" select="$elements[self::d or self::e]"/>
<xsl:variable name="b" as="element()*" select="$a[xs:integer(@value) = 1]"/>
<xsl:sequence select="$b"/>
</xsl:template>
</xsl:stylesheet>
As we expected Saxon 9.7 internally assembles a final XPath with two predicates and reorders them. As result we get an error:
Error at char 20 in xsl:variable/@select on line 8 column 81 of Saxon9.7-filter_speculation.xslt:
FORG0001: Cannot convert string "c" to an integer
This turn of events greately complicates the code review we have to commit.
Michiel Kay's answer to this example:
I think your argument that the reordering is inappropriate when the expression is written using variables is very powerful. I shall raise the question with my WG colleagues.
In fact we think that either: reordering of predicates is inappropriate, or (weaker, to allow reordering) to treat an error during evaluation of predicate expression as false() . This is what is done in XSLT patterns. Other solutions make XPath less intuitive.
In other words we should use XPath (language) to express ideas, and engine should correctly and efficiently implement them. So, we should not be forced to rewrite expression to please implementation.
On December, 30 we have opened a thread in Saxon help forum that shows a stylesheet generating an error. This is the stylesheet:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:variable name="elements" as="element()+"><a/><b value="c"/></xsl:variable>
<xsl:template match="/">
<xsl:sequence select="$elements[self::d or self::e][xs:integer(@value) = 1]"/>
</xsl:template>
</xsl:stylesheet>
We get an error:
Error at char 47 in xsl:sequence/@select on line 7 column 83 of Saxon9.7-filter_speculation.xslt:
FORG0001: Cannot convert string "c" to an integer
Exception in thread "main" ; SystemID: .../Saxon9.7-filter_speculation.xslt; Line#: 7; Column#: 47
ValidationException: Cannot convert string "c" to an integer
at
...
It's interesting that error happens in Saxon 9.7 but not in earlier versions.
The answer we got was expected but disheartening:
The XPath specification (section 2.3.4, Errors and Optimization) explicitly allows the predicates of a filter expression to be reordered by an optimizer. See this example, which is very similar to yours:
The expression in the following example cannot raise a casting error if it is evaluated exactly as written (i.e., left to right). Since neither predicate depends on the context position, an implementation might choose to reorder the predicates to achieve better performance (for example, by taking advantage of an index). This reordering could cause the expression to raise an error.
$N[@x castable as xs:date][xs:date(@x) gt xs:date("2000-01-01")]
Following the spec, Michael Kay advices us to rewrite XPath:
$elements[self::d or self::e][xs:integer(@value) = 1]
like this:
$elements[if (self::d or self::e) then xs:integer(@value) = 1 else false()]
Such subtleties make it hard to reason about and to teach XPath. We doubt many people will spot the difference immediately.
We think that if such optimization was so much important to spec writers, then they had to change filter rules to treat failed predicates as false() . This would avoid any obscure differences in these two, otherwise equal, expressions. In fact something similar already exists with templates where failed evaluation of pattern is treated as un-match.
It's time to align csharpxom to the latest version of C#. The article New Language Features in C# 6 sums up what's being added.
Sources can be found at nesterovsky-bros/languages-xom, and C# model is at csharp folder.
In general we feel hostile to any new features until they prove they bring an added value. So, here our list of new features from most to least useless:
String interpolation
var s = $"{p.Name} is {p.Age} year{{s}} old";
This is useless, as it does not account resource localization.
Null-conditional operators
int? first = customers?[0].Orders?.Count();
They claim to reduce cluttering from null checks, but in our opinion it looks opposite. It's better to get NullReferenceException if arguments are wrong.
Exception filters
private static bool Log(Exception e) { /* log it */ ; return false; }
…
try { … } catch (Exception e) when (Log(e)) {}
"It is also a common and accepted form of “abuse” to use exception filters for side effects; e.g. logging."
Design a feature for abuse just does not tastes good.
Expression-bodied function and property members.
public Point Move(int dx, int dy) => new Point(x + dx, y + dy);
public string Name => First + " " + Last;
Not sure it's that usefull.
Taking into an account that we use Saxon for many years, it was strange to run into so simple error like the following:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<xsl:variable name="doc" as="element()+"><a/><b/><c/></xsl:variable>
<xsl:sequence select="$doc = 3"/>
</xsl:template>
</xsl:stylesheet>
This is a simplified case that should produce an dynamic error FORG0001 as per General Comparisions; the real code is more complex, as it uses SFINAE and continues.
This case crushes in Saxon with exception:
Exception in thread "main" java.lang.RuntimeException:
Internal error evaluating template at line 3 in module ICE9.6.xslt
at net.sf.saxon.expr.instruct.Template.applyLeavingTail()
at net.sf.saxon.trans.Mode.applyTemplates()
at net.sf.saxon.Controller.transformDocument()
at net.sf.saxon.Controller.transform()
at net.sf.saxon.s9api.XsltTransformer.transform()
at net.sf.saxon.jaxp.TransformerImpl.transform()
...
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString()
at java.lang.Long.parseLong()
at java.lang.Long.parseLong()
at net.sf.saxon.expr.GeneralComparison.quickCompare()
at net.sf.saxon.expr.GeneralComparison.compare()
at net.sf.saxon.expr.GeneralComparison.evaluateManyToOne()
at net.sf.saxon.expr.GeneralComparison.evaluateItem()
at net.sf.saxon.expr.GeneralComparison.evaluateItem()
at net.sf.saxon.expr.Expression.process()
at net.sf.saxon.expr.instruct.Template.applyLeavingTail()
... 8 more
We have reported the problem at Saxon's forum, and as usual the problem was shortly resolved.
Much time has passed since we fixed or extended Languages Xml Object Model.
But now we needed to manipulate with and generate javascript programs.
Though xslt today is not a language of choice but rather niche language, it still fits very well to tasks of code generation and transformation.
So, we're pleased to announce ECMAScript Xml Object Model, which includes:
All sources are available at github: https://github.com/nesterovsky-bros/languages-xom
After investigation we have found that Saxon 9.6 HE does not support xslt 3.0 as we assumed earlier.
On Saxonica site it's written: "Support for XQuery 3.0 and XPath 3.0 (now Recommendations) has been added to the open-source product."
As one can notice no xslt is mentioned.
More details are on open-source 3.0 support.
:-(
The new release of Saxon HE (version 9.6) claims basic support of xslt 3.0. So we're eager to test it but... errors happen. See error report at Error in SaxonHE9-6-0-1J and Bug #2160.
As with previous release Exception during execution in Saxon-HE-9.5.1-6 we bumped into engine's internal error.
We expect to see an update very soon, and to continue with long waited xslt 3.0.
Here is an argument to the discussion of open source vs commercial projects: open source projects with rich community may benefit, as problems are detected promptly; while commercial projects risk to live with more unnoticed bugs.
With Saxon 9.6 we can finally play with open source xslt 3.0
It's sad that it took so much time to make it available.
See Saxonica's home page to get details.
These days we're not active xslt developers, though we still consider xslt and xquery are important part of our personal experience and self education.
Besides, we have a pretty large xslt code base that is and will be in use. We think xslt/xquery is in use in many other big and small projects thus they have a strong position as a niche languages.
Thus we think it's important to help to those who support xslt/xquery engines.
That's what we're regularly doing (thanks to our code base).
Now, to the problem we just have found. Please consider the code:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="this"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:param name="new-line-text" as="xs:string" select="'
'"/>
<xsl:variable name="items" as="item()*" select="'select', $new-line-text"/>
<xsl:message select="t:string-join($items)"/>
</xsl:template>
<!--
Joins adjacent string items.
$items - items to join.
Returns joined items.
-->
<xsl:function name="t:string-join" as="item()*">
<xsl:param name="items" as="item()*"/>
<xsl:variable name="indices" as="xs:integer*" select="
0,
index-of
(
(
for $item in $items return
$item instance of xs:string
),
false()
),
count($items) + 1"/>
<xsl:sequence select="
for $i in 1 to count($indices) - 1 return
(
$items[$indices[$i]],
string-join
(
subsequence
(
$items,
$indices[$i] + 1,
$indices[$i + 1] - $indices[$i] - 1
),
''
)
)"/>
</xsl:function>
</xsl:stylesheet>
The output is:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at net.sf.saxon.om.Chain.itemAt(Chain.java:161)
at net.sf.saxon.om.SequenceTool.itemAt(SequenceTool.java:130)
at net.sf.saxon.expr.FilterExpression.iterate(FilterExpression.java:1143)
at net.sf.saxon.expr.LetExpression.iterate(LetExpression.java:365)
at net.sf.saxon.expr.instruct.BlockIterator.next(BlockIterator.java:49)
at net.sf.saxon.expr.MappingIterator.next(MappingIterator.java:70)
at net.sf.saxon.expr.instruct.Message.processLeavingTail(Message.java:264)
at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:660)
at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1057)
at net.sf.saxon.Controller.transformDocument(Controller.java:2088)
at net.sf.saxon.Controller.transform(Controller.java:1911)
...
The problem is reported at: Exception during execution in Saxon-HE-9.5.1-6 and also tracked at https://saxonica.plan.io/issues/2104.
Update: according to Michael Kay the issue is fixed. See note #4:
I have committed a patch to Chain.itemAt() on the 9.5 and 9.6 branches to check for index<0
Among proposed new features (other than Maps and Arrays) in XPath 3.1 we like Arrow operator (=>).
It's defined like this:
[Definition: An arrow operator is a postfix operator that applies a function to an item, using the item as the first argument to the function.] If $i is an item and f() is a function, then $i=>f() is equivalent to f($i) , and $i=>f($j) is equivalent to f($i, $j) .
This syntax is particularly helpful when conventional function call syntax is unreadable, e.g. when applying multiple functions to an item. For instance, the following expression is difficult to read due to the nesting of parentheses, and invites syntax errors due to unbalanced parentheses:
tokenize((normalize-unicode(upper-case($string))),"\s+")
Many people consider the following expression easier to read, and it is much easier to see that the parentheses are balanced:
$string=>upper-case()=>normalize-unicode()=>tokenize("\s+")
What it looks like?
Right! It's like extension functions in C#.
Awhile ago we have created a set of xml schemas and xslt to represent different
languages as xml, and to generate source from those xmls. This way we know to
represent and generate: java, c#, cobol, and several sql dialects (read about
languages xom on this site).
Here, we'd like to expose a nuisance we had with sql dialects schema.
Our goal was to define a basic sql schema, and dialect extensions. This way we
assumed to express general and dialect specific constructs. So, lets consider an
example.
General:
-- Select one row
select * from A
DB2:
select * from A fetch first row only
T-SQL:
select top 1 * from A
Oracle:
select * from A where rownum = 1
All these queries have common core syntax, while at the same time have dialect
specific means to express intention to return first row only.
Down to the xml schema basic select statement looks like this:
<xs:complexType name="select-statement">
<xs:complexContent>
<xs:extension base="full-select-statement">
<xs:sequence>
<xs:element name="columns" type="columns-clause">
<xs:element name="from" type="from-clause" minOccurs="0">
<xs:element name="where" type="unary-expression" minOccurs="0"/>
<xs:element name="group-by" type="expression-list" minOccurs="0"/>
<xs:element name="having" type="unary-expression" minOccurs="0"/>
<xs:element name="order-by" type="order-by-clause" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="specification" type="query-specification"
use="optional" default="all"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Here all is relatively clear. The generic select looks like:
<sql:select>
<sql:columns>
<sql:column
wildcard="true"/>
</sql:columns>
<sql:from>
<sql:table name="A"/>
</sql:from>
</sql:select>
But how would you define dialect specifics?
E.g. for T-SQL we would like to see a markup:
<sql:select>
<tsql:top>
<sql:number value="1"/>
</tsql:top>
<sql:columns>
<sql:column
wildcard="true"/>
</sql:columns>
<sql:from>
<sql:table name="A"/>
</sql:from>
</sql:select>
While for DB2 there should be:
<sql:select>
<sql:columns>
<sql:column
wildcard="true"/>
</sql:columns>
<sql:from>
<sql:table name="A"/>
</sql:from>
<db2:fetch-first rows="1"/>
</sql:select>
So, again the quesions are:
- how to define basic sql schema with goal to extend it in direction of
DB2 or T-SQL?
- how to define an xslt sql serializer that will be also extendable?
Though we have tried several solutions to that problem, none is satisfactory enough.
To allow extensions we have defined that all elements in sql schema are based on
sql-element , which allows extensions:
<xs:complexType name="sql-element" abstract="true">
<xs:sequence>
<xs:element ref="extension" minOccurs="0"
maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:element name="extension" type="extension"/>
<xs:complexType name="extension" abstract="true">
<xs:complexContent>
<xs:extension base="sql-element"/>
</xs:complexContent>
</xs:complexType>
...
<xs:element name="top" type="top-extension"
substitutionGroup="sql:extension"/>
<xs:complexType name="top-extension">
<xs:complexContent>
<xs:extension base="sql:extension">
<xs:sequence>
<xs:element ref="sql:expression"/>
</xs:sequence>
<xs:attribute name="percent" type="xs:boolean"
use="optional" default="false"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Unfortunately, this creates too weak typed schema for extensions, thus
intellisence suggests too many options.
If you deal with
web applications you probably have already dealt with export data to Excel.
There are several options to prepare data for Excel:
- generate CSV;
- generate HTML that excel understands;
- generate XML in Spreadsheet 2003 format;
- generate data using Open XML SDK or some other 3rd party libraries;
- generate data in XLSX format, according to Open XML specification.
You may find a good article with pros and cons of each solution
here. We, in our turn, would like to share our experience in this field. Let's start from requirements:
- Often we have to export huge data-sets.
- We should be able to format, parametrize and to apply different styles to the exported data.
- There are cases when exported data may contain more than one table per sheet or
even more than one sheet.
- Some exported data have to be illustrated with charts.
All these requirements led us to a solution based on XSLT processing of streamed data.
The advantage of this solution is that the result is immediately forwarded to a client as fast as
XSLT starts to generate output. Such approach is much productive than generating of XLSX using of Open XML SDK or any other third party library, since it avoids keeping
a huge data-sets in memory on the server side.
Another advantage - is simple maintenance, as we achieve
clear separation of data and presentation layers. On each request to change formatting or
apply another style to a cell you just have to modify xslt file(s) that generate
variable parts of XLSX.
As result, our clients get XLSX files according with Open XML specifications.
The details of implementations of our solution see in our next posts.
Earlier we have shown
how to build streaming xml reader from business data and have reminded about
ForwardXPathNavigator which helps to create
a streaming xslt transformation. Now we want to show how to stream content
produced with xslt out of WCF service.
To achieve streaming in WCF one needs:
1. To configure service to use streaming. Description on how to do this can be
found in the internet. See web.config of the sample
Streaming.zip for the details.
2. Create a service with a method returning Stream :
[ServiceContract(Namespace = "http://www.nesterovsky-bros.com")]
[AspNetCompatibilityRequirements(RequirementsMode = AspNetCompatibilityRequirementsMode.Allowed)]
public class Service
{
[OperationContract]
[WebGet(RequestFormat = WebMessageFormat.Json)]
public Stream GetPeopleHtml(int count,
int seed)
{
...
}
}
2. Return a Stream from xsl transformation.
Unfortunately (we mentioned it already), XslCompiledTransform generates its
output into XmlWriter (or into output Stream ) rather than exposes result as
XmlReader , while WCF gets input stream and passes it to a client.
We could generate xslt output into a file or a memory Stream and then return
that content as input Stream , but this will defeat a goal of streaming, as
client would have started to get data no earlier that the xslt completed its
work. What we need instead is a pipe that form xslt output Stream to an input
Stream returned from WCF.
.NET implements pipe streams, so our task is trivial.
We have defined a utility method that creates an input Stream from a generator
populating an output Stream :
public static Stream GetPipedStream(Action<Stream> generator)
{
var output = new AnonymousPipeServerStream();
var input = new AnonymousPipeClientStream(
output.GetClientHandleAsString());
Task.Factory.StartNew(
() =>
{
using(output)
{
generator(output);
output.WaitForPipeDrain();
}
},
TaskCreationOptions.LongRunning);
return input;
}
We wrapped xsl transformation as such a generator:
[OperationContract]
[WebGet(RequestFormat = WebMessageFormat.Json)]
public Stream GetPeopleHtml(int count, int seed)
{
var context = WebOperationContext.Current;
context.OutgoingResponse.ContentType = "text/html";
context.OutgoingResponse.Headers["Content-Disposition"] =
"attachment;filename=reports.html";
var cache = HttpRuntime.Cache;
var path = HttpContext.Current.Server.MapPath("~/People.xslt");
var transform = cache[path] as XslCompiledTransform;
if (transform == null)
{
transform = new XslCompiledTransform();
transform.Load(path);
cache.Insert(path, transform, new CacheDependency(path));
}
return Extensions.GetPipedStream(
output =>
{
// We have a streamed business data.
var people = Data.CreateRandomData(count, seed, 0, count);
// We want to see it as streamed xml data.
using(var stream =
people.ToXmlStream("people", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
{
// XPath forward navigator is used as an input source.
transform.Transform(
new ForwardXPathNavigator(reader),
new XsltArgumentList(),
output);
}
});
}
This way we have build a code that streams data directly from business data to a
client in a form of report. A set of utility functions and classes helped us to
overcome .NET's limitations and to build simple code that one can easily
support.
The sources can be found at
Streaming.zip.
In the previous
post about streaming we have dropped at the point where we have XmlReader
in hands, which continously gets data from IEnumerable<Person>
source.
Now we shall remind about ForwardXPathNavigator - a class we have built
back in 2002, which adds streaming transformations to .NET's xslt processor.
While XslCompiledTransform is desperately obsolete, and no upgrade
will possibly follow; still it's among the fastest xslt 1.0 processors. With
ForwardXPathNavigator we add ability to transform input data of arbitrary size to this processor.
We find it interesting that
xslt 3.0 Working Draft defines streaming processing in a way that closely
matches rules for ForwardXPathNavigator :
Streaming achieves two important objectives: it allows large documents to be transformed
without requiring correspondingly large amounts of memory; and it allows the processor
to start producing output before it has finished receiving its input, thus reducing
latency.
The rules for streamability, which are defined in detail in 19.3 Streamability
Analysis, impose two main constraints:
-
The only nodes reachable from the node that is currently being processed are its
attributes and namespaces, its ancestors and their attributes and namespaces, and
its descendants and their attributes and namespaces. The siblings of the node, and
the siblings of its ancestors, are not reachable in the tree, and any attempt to
use their values is a static error. However, constructs (for example, simple forms
of xsl:number , and simple positional patterns) that require knowledge
of the number of preceding elements by name are permitted.
-
When processing a given node in the tree, each descendant node can only be visited
once. Essentially this allows two styles of processing: either visit each of the
children once, and then process that child with the same restrictions applied; or
process all the descendants in a single pass, in which case it is not possible while
processing a descendant to make any further downward selection.
The only significant difference between ForwardXPathNavigator and
xlst 3.0 streaming is in that we reported violations of rules for streamability
at runtime, while xslt 3.0 attempts to perform this analysis at compile time.
Here the C# code for the xslt streamed transformation:
var transform = new XslCompiledTransform();
transform.Load("People.xslt");
// We have a streamed business data.
var people = Data.CreateRandomData(10000, 0, 0, 10000);
// We want to see it as streamed xml data.
using(var stream =
people.ToXmlStream("people", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
using(var output = File.Create("people.html"))
{
// XPath forward navigator is used as an input source.
transform.Transform(
new ForwardXPathNavigator(reader),
new XsltArgumentList(),
output);
}
Notice how XmlReader is wrapped into ForwardXPathNavigator .
To complete the picture we need xslt that follows the streaming rules:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:d="http://www.nesterovsky-bros.com"
exclude-result-prefixes="msxsl d">
<xsl:output method="html" indent="yes"/>
<!-- Root template processed in the streaming mode. -->
<xsl:template match="/d:people">
<html>
<head>
<title>List of persons</title>
<style type="text/css">
.even
{
}
.odd
{
background: #d0d0d0;
}
</style>
</head>
<body>
<table border="1">
<tr>
<th>ID</th>
<th>First name</th>
<th>Last name</th>
<th>City</th>
<th>Title</th>
<th>Age</th>
</tr>
<xsl:for-each select="d:person">
<!--
Get element snapshot.
A
snapshot allows arbitrary access to the element's content.
-->
<xsl:variable name="person">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:variable name="position" select="position()"/>
<xsl:apply-templates mode="snapshot" select="msxsl:node-set($person)/d:person">
<xsl:with-param name="position" select="$position"/>
</xsl:apply-templates>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
<xsl:template mode="snapshot" match="d:person">
<xsl:param name="position"/>
<tr>
<xsl:attribute name="class">
<xsl:choose>
<xsl:when test="$position mod 2 = 1">
<xsl:text>odd</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>even</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<td>
<xsl:value-of select="d:Id"/>
</td>
<td>
<xsl:value-of select="d:FirstName"/>
</td>
<td>
<xsl:value-of select="d:LastName"/>
</td>
<td>
<xsl:value-of select="d:City"/>
</td>
<td>
<xsl:value-of select="d:Title"/>
</td>
<td>
<xsl:value-of select="d:Age"/>
</td>
</tr>
</xsl:template>
</xsl:stylesheet>
So, we have started with a streamed entity data, proceeded to the streamed
XmlReader and reached to the streamed xslt transformation.
But at the final post about streaming we shall remind a simple way of building
WCF service returning html stream from our xslt transformation.
The sources can be found at
Streaming.zip.
For some reason neither .NET's XmlSerializer nor DataContractSerializer allow
reading data through an XmlReader . These APIs work other way round writing data
into an XmlWriter . To get data through XmlReader one needs to write it to some
destination like a file or memory stream, and then to read it using XmlReader .
This complicates streaming design considerably.
In fact the very same happens with other .NET APIs.
We think the reason of why .NET designers preferred XmlWriter to XmlReader in
those APIs is that XmlReader 's implementation is a state machine like, while
XmlWriter 's implementation looks like a regular procedure. It's much harder to
manually write and to support a correct state machine logic
than a procedure.
If history would have gone slightly
different way, and if yield return, lambda, and Enumerator API appeared before
XmlReader , and XmlWriter then, we think, both these classes looked differently.
Xml source would have been described with a IEnumerable<XmlEvent> instead of
XmlReader , and XmlWriter must be looked like a function receiving
IEnumerable<XmlEvent> . Implementing XmlReader would have meant a creating a
enumerator. Yield return and Enumerable API would have helped to implement it in
a procedural way.
But in our present we have to deal with the fact that DataContractSerializer
should write the data into XmlWriter , so let's assume we have a project that
uses Entity Framework to access the database, and that you have a data class
Person , and data access method GetPeople() :
[DataContract(Name = "person", Namespace = "http://www.nesterovsky-bros.com")]
public class Person
{
[DataMember] public int Id { get; set; }
[DataMember] public string FirstName { get; set; }
[DataMember] public string LastName { get; set; }
[DataMember] public string City { get; set; }
[DataMember] public string Title { get; set; }
[DataMember] public DateTime BirthDate { get; set; }
[DataMember] public int Age { get; set; }
}
public static IEnumerable<Person> GetPeople() { ... }
And your goal is to expose result of GetPeople() as XmlReader .
We achieve result with three simple steps:
- Define
JoinedStream - an input Stream implementation that
reads data from a enumeration of streams (IEnumerable<Stream> ).
- Build xml parts in the form of
IEnumerable<Stream> .
- Combine parts into final xml stream.
The code is rather simple, so here we qoute its essential part:
public static class Extensions
{
public static Stream JoinStreams(this IEnumerable<Stream> streams, bool closeStreams = true)
{
return new JoinedStream(streams, closeStreams);
}
public static Stream ToXmlStream<T>(
this IEnumerable<T> items,
string rootName = null,
string rootNamespace = null)
{
return items.ToXmlStreamParts<T>(rootName, rootNamespace).
JoinStreams(false);
}
private static IEnumerable<Stream> ToXmlStreamParts<T>(
this IEnumerable<T> items,
string rootName = null,
string rootNamespace = null)
{
if (rootName == null)
{
rootName = "ArrayOfItems";
}
if (rootNamespace == null)
{
rootNamespace = "";
}
var serializer = new DataContractSerializer(typeof(T));
var stream = new MemoryStream();
var writer = XmlDictionaryWriter.CreateTextWriter(stream);
writer.WriteStartDocument();
writer.WriteStartElement(rootName, rootNamespace);
writer.WriteXmlnsAttribute("s", XmlSchema.Namespace);
writer.WriteXmlnsAttribute("i", XmlSchema.InstanceNamespace);
foreach(var item in items)
{
serializer.WriteObject(writer, item);
writer.WriteString(" ");
writer.Flush();
stream.Position = 0;
yield return stream;
stream.Position = 0;
stream.SetLength(0);
}
writer.WriteEndElement();
writer.WriteEndDocument();
writer.Flush();
stream.Position = 0;
yield return stream;
}
private class JoinedStream: Stream
{
public JoinedStream(IEnumerable<Stream> streams, bool closeStreams = true)
...
}
}
The use is even more simple:
// We have a streamed business data.
var people = GetPeople();
// We want to see it as streamed xml data.
using(var stream = people.ToXmlStream("persons", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
{
...
}
We have packed the sample into the project
Streaming.zip.
In the next post we're going to remind about streaming processing in xslt.
Some time ago we were taking a part in a project where 95% of all sources are xslt
2.0. It was a great experience for us.
The interesting part is that we used xslt in areas we would never expect
it in
early 2000s. It crunched gigabytes of data in offline, while earlier we
generally sought xslt application in a browser or on a server as an engine to render the data.
Web applications (both .NET and java) are in our focus today, and it became hard
to find application for xslt or xquery.
Indeed, client side now have a very strong APIs: jquery, jqueryui, jsview,
jqgrid, kendoui, and so on. These libraries, and today's browsers cover
developer's needs in building managable applications. In contrast, a native
support of xslt (at least v2) does not exist in browsers.
Server side at present is seen as a set of web services. These services
support both xml and json formats, and implement a business logic only. It
would be a torture to try to write such a frontend in xslt/xquery. A server
logic itself is often dealing with a diversity of data sources like databases,
files (including xml files) and other.
As for a database (we primarily work with SQL Server 2008 R2), we think that all
communication should go through stored procedures, which implement all data
logic. Clearly, this place is not for xslt. However, those who know sql beyond
its basics can confirm that sql is very similar to xquery. More than that SQL
Server (and other databases) integrate xquery to work with xml data, and we do
use it extensively.
Server logic itself uses API like LINQ to manipulate with different data
sources. In fact, we think that one can build a compiler from xquery 3.0 to C#
with LINQ. Other way round compiler would be a whole different story.
The net result is that we see little place for xslt and xquery. Well, after all it's only a personal perspective on the subject.
The similar type of thing has happened to us with C++. As with xslt/xquery we
love C++ very much, and we fond of C++11, but at present we have no place in our
current projects for C++. That's pitty.
P.S. Among other things that play against xslt/xquery is that there is a shortage of people who know these languages, thus who can support such projects.
This time we
update csharpxom to adjust it to C# 4.5.
Additions are async modifier and
await operator.
They are used to simplify asynchronous programming.
The following example from
the msdn:
private async Task<byte[]> GetURLContentsAsync(string url)
{
var content = new MemoryStream();
var request = (HttpWebRequest)WebRequest.Create(url);
using(var response = await request.GetResponseAsync())
using(var responseStream = response.GetResponseStream())
{
await responseStream.CopyToAsync(content);
}
return content.ToArray();
}
looks like this in csharpxom:
<method name="GetURLContentsAsync" access="private" async="true">
<returns>
<type name="Task" namespace="System.Threading.Tasks">
<type-arguments>
<type name="byte" rank="1"/>
</type-arguments>
</type>
</returns>
<parameters>
<parameter name="url">
<type name="string"/>
</parameter>
</parameters>
<block>
<var name="content">
<initialize>
<new-object>
<type name="MemoryStream" namespace="System.IO"/>
</new-object>
</initialize>
</var>
<var name="request">
<initialize>
<cast>
<invoke>
<static-method-ref name="Create">
<type name="WebRequest" namespace="System.Net"/>
</static-method-ref>
<arguments>
<var-ref name="url"/>
</arguments>
</invoke>
<type name="HttpWebRequest" namespace="System.Net"/>
</cast>
</initialize>
</var>
<using>
<resource>
<var name="response">
<initialize>
<await>
<invoke>
<method-ref name="GetResponseAsync">
<var-ref name="request"/>
</method-ref>
</invoke>
</await>
</initialize>
</var>
</resource>
<using>
<resource>
<var name="responseStream">
<initialize>
<invoke>
<method-ref name="GetResponseStream">
<var-ref name="response"/>
</method-ref>
</invoke>
</initialize>
</var>
</resource>
<expression>
<await>
<invoke>
<method-ref name="CopyToAsync">
<var-ref name="responseStream"/>
</method-ref>
<arguments>
<var-ref name="content"/>
</arguments>
</invoke>
</await>
</expression>
</using>
</using>
<return>
<invoke>
<method-ref name="ToArray">
<var-ref name="content"/>
</method-ref>
</invoke>
</return>
</block>
</method>
@michaelhkay Saxon 9.4 is out.
But why author does not state that HE version is still xslt/xpath 2.0, as neither xslt maps, nor function items are supported.
It has happened so, that we have never worked with jQuery, however were aware of
it.
In early 2000 we have developed a web application that contained rich javascript
APIs, including UI components. Later, we were actively practicing in ASP.NET, and
later in JSF.
At present, looking at jQuery more closely we regret that we have failed to
start using it earlier.
Separation of business logic and presentation is remarkable when one uses JSON
web services. In fact server part can be seen as a set of web services
representing a business logic and a set of resources: html, styles, scripts,
others. Nor ASP.NET or JSF approach such a consistent separation.
The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data
to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews
.
Unfortunatelly after reading about this
binding API, and
being in love with Xslt and XQuery we just want to cry. We don't know what would
be the best solution for the task, but what we see looks uncomfortable to us.
A couple of weeks ago, we have suggested to introduce a enumerator function into
the XPath (see
[F+O30] A enumerator function):
I would like the WG to consider an addition of a function that turns a sequence
into a enumeration of values.
Consider a function like this:
fn:enumerator($items as item()*) as function() as item()?;
alternatively, signature could be:
fn:enumerator($items as function() as item()*) as function() as item()?;
This function receives a sequence, and returns a function item, which upon N's
call shall return N's element of the original sequence. This way, a sequence of
items is turned into a function providing a enumeration of items of the
sequence.
As an example consider two functions:
a) t:rand($seed as xs:double) as xs:double* - a function producing a random
number sequence;
b) t:work($input as element()) as element() - a function that generates output
from it's input, and that needs random numbers in the course of the execution.
t:work() may contain a code like this:
let $rand := fn:enumerator(t:rand($seed)),
and later it can call $rand() to get a random numbers.
Enumerators will help to compose algorithms where one algorithm communicate
with other independant algorithms, thus making code simpler. The most obvious
class of enumerators are generators: ordered numbers, unique identifiers,
random numbers.
Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is
similar to a "side effect" of a function generate-id() from a newly created
node (see bug #13747, and bug #13494).
The idea is inspired by a generator function, which returns a new value upon each
call.
Such function can be seen as a stateful object. But our goal is to look at
it in a more functional way. So, we look at the algorithm as a function that
produces a sequence of output, which is pure functional; and an enumerator that
allows to iterate over algorithm's output.
This way, we see the function that implements an algorithm and the function that
uses it can be seen as two thread of functional programs that use messaging to
communicate to each other.
Honestly, we doubt that WG will accept it, but it's interesting to watch the
discussion.
More than month has passed since we have reported a problem to the saxon forum (see
Saxon optimizer bug and
Saxon 9.2 generate-id() bug).
The essence of the problem is that we have constructed argumentless function to
return a unique identifiers each time function is called. To achieve the effect
we have created a temporary node and returned its generate-id() value.
Such a function is nondetermenistic, as we cannot state that its result depends
on arguments only. This means that engine's optimizer is not free to reorder
calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where
engine elevates function call out of cycle thus producing invalid results.
Michael Kay, the author of the Saxon engine, argued that this is "a gray area of
the xslt spec":
If the spec were stricter about defining exactly when you can rely on identity-dependent
operations then I would be obliged to follow it, but I think it's probably deliberate
that it currently allows implementations some latitude, effectively signalling to
users that they should avoid depending on this aspect of the behaviour.
He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end
two related bugs have been raised:
- Bug 13494
- Node
uniqueness returned from XSLT function;
- Bug 13747
- [XPath 3.0] Determinism of expressions returning constructed nodes.
Yesterday, the WG has resolved the issue:
The Working Group agreed that default behavior should continue to require these
nodes to be constructed with unique IDs.
We believe that this is the kind of thing implementations can do with
annotations or declaration options, and it would be best to get implementation
experience with this before standardizing.
This means that the technique we used to generate unique identifiers is correct
and the behaviour is well defined.
The only problem is to wait when Saxon will fix its behaviour accordingly.
An xslt code that worked in the production for several years failed
unexpectedly. That's unusual, unfortunate but it happens.
We started to analyze the problem, limited the code block and recreated it in
the simpe form. That's it:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/public"
exclude-result-prefixes="t xs">
<xsl:template match="/" name="main">
<xsl:variable name="content">
<root>
<xsl:for-each select="1 to 3">
<item/>
</xsl:for-each>
</root>
</xsl:variable>
<xsl:variable name="result">
<root>
<xsl:for-each select="$content/root/item">
<section-ref name-ref="{t:generate-id()}.s"/>
<!--
<xsl:variable name="id" as="xs:string"
select="t:generate-id()"/>
<section-ref name-ref="{$id}.s"/>
-->
</xsl:for-each>
</root>
</xsl:variable>
<xsl:message select="$result"/>
</xsl:template>
<xsl:function name="t:generate-id" as="xs:string">
<xsl:variable name="element" as="element()">
<element/>
</xsl:variable>
<xsl:sequence select="generate-id($element)"/>
</xsl:function>
</xsl:stylesheet>
This code performs some transformation and assigns unique values to
name-ref attributes. Values generated with
t:generate-id() function are guaranteed to be unique, as spec
claims that every node has its unique generate-id() value.
Imagine, what was our surprise to find that generated elements all have the same
name-ref 's. We studied code all over, and found no holes in our
reasoning and implementation, so our conlusion was: it's Saxon's bug!
It's interesting enough that if we rewrite code a little (see commented part),
it starts to work properly, thus we suspect Saxon's optimizer.
Well, in the course of development we have found and reported many Saxon bugs,
but how come that this little beetle was hiding so long.
We've verified that the bug exists in the versions 9.2 and 9.3. Here is the bug
report:
Saxon 9.2 generate-id() bug.
Unfortunatelly, it's there already for three days (2011-07-25 to 2011-07-27)
without any reaction. We hope this will change soon.
We did not update
languages-xom already for many monthes but now we have found a severe bug
in the jxom's algorithm for eliminating unreachable code. The marked line
were considered as unreachable:
check:
if (condition)
{
break check;
}
else
{
return;
}
// due to bug the following was considered unreachable
expression;
Bug is fixed.
Current update contains other cosmetic fixes.
Please download xslt sources from languages-xom.zip.
Summary
Languages XOM is a set of xml schemas and xslt stylesheets that allows:
- to define programs in xml form;
- to perform transformations over code in xml form;
- to generate sources.
Languages XOM includes:
- jxom - Java Xml Object model;
- csharpxom - C# Xml Object Model;
- cobolxom - COBOL Xml Object Model;
- sqlxom - SQL Xml Object Model (including several sql dialects);
- aspx - ASP.NET Object Model;
A proprietary part of languages XOM also includes XML Object Model for a
language named Cool:GEN. In fact the original purpose for this API was a
generation of java/C#/COBOL from Cool:GEN. For more details about Cool:GEN
conversion please see
here.
For some reason we never knew about instance initializer in java; on
the other hand static initializer is well known.
class A
{
int x;
static int y;
// This is an instance initializer.
{
x = 1;
}
// This is a static initializer.
static
{
y = 2;
}
}
Worse, we have missed it in the java grammar when we were building jxom.
This way jxom was missing the feature.
Today we fix the miss and introduce a schema element:
<class-initializer
static="boolean">
<block>
...
</block>
</class-initializer>
It superseeds:
<static>
<block>
...
</block>
</static>
that supported static
initializers alone.
Please update
languages-xom xslt stylesheets.
P.S. Out of curiosity, did you ever see any use of instance initializers?
We could not stand the temptation to implement the @Yield annotation that
we described
earlier.
Idea is rather clear but people are saying that it's not an easy task to update
the sources.
They were right!
Implementation has its price, as we were forced to access JDK's classes of javac
compiler. As result, at present, we don't support other compilers such as
EclipseCompiler.
We shall look later what can be done in this area.
At present, annotation processor works perfectly when you run javac either from
the command line, from ant, or from other build tool.
Here is an example of how method is refactored:
@Yield
public static Iterable<Long> fibonachi()
{
ArrayList<Long> items = new ArrayList<Long>();
long Ti = 0;
long Ti1 = 1;
while(true)
{
items.add(Ti);
long value = Ti + Ti1;
Ti = Ti1;
Ti1 = value;
}
}
And that's how we transform it:
@Yield()
public static Iterable<Long> fibonachi() {
assert (java.util.ArrayList<Long>)(ArrayList<Long>)null == null : null;
class $state$ implements java.lang.Iterable<Long>, java.util.Iterator<Long>, java.io.Closeable {
public java.util.Iterator<Long> iterator() {
if ($state$id == 0) {
$state$id = 1;
return this;
} else return new $state$();
}
public boolean hasNext() {
if (!$state$nextDefined) {
$state$hasNext = $state$next();
$state$nextDefined = true;
}
return $state$hasNext;
}
public Long next() {
if (!hasNext()) throw new java.util.NoSuchElementException();
$state$nextDefined = false;
return $state$next;
}
public void remove() {
throw new java.lang.UnsupportedOperationException();
}
public void close() {
$state$id = 5;
}
private boolean $state$next() {
while (true) switch ($state$id) {
case 0:
$state$id = 1;
case 1:
Ti = 0;
Ti1 = 1;
case 2:
if (!true) {
$state$id = 4;
break;
}
$state$next = Ti;
$state$id = 3;
return true;
case 3:
value = Ti + Ti1;
Ti = Ti1;
Ti1 = value;
$state$id = 2;
break;
case 4:
case 5:
default:
$state$id = 5;
return false;
}
}
private long Ti;
private long Ti1;
private long value;
private int $state$id;
private boolean $state$hasNext;
private boolean $state$nextDefined;
private Long $state$next;
}
return new $state$();
}
Formatting is automatic, sorry, but anyway it's for diagnostics only. You
will never see this code.
It's iteresting to say that this implementation is very precisely mimics
xslt state machine implementation we have done back in 2008.
You can
download YieldProcessor here. We hope that someone will find our solution
very interesting.
Several times we have already wished to see
yield feature in java and all the time came to the same implementation:
infomancers-collections.
And every time with dissatisfaction turned away, and continued with regular
iterators.
Why? Well, in spite of the fact it's the best implementation of the feature we have
seen, it's still too heavy, as it's playing with java byte code at run-time.
We never grasped the idea why it's done this way, while there is
post-compile
time annotation processing in java.
If we would implemented the yeild feature in java we would created a @Yield
annotation and would demanded to implement some well defined code pattern like
this:
@Yield
Iteratable<String> iterator()
{
// This is part of pattern.
ArrayList<String> list = new ArrayList<String>();
for(int i = 0; i < 10; ++i)
{
// list.add() plays the role of yield return.
list.add(String.valueOf(i));
}
// This is part of pattern.
return list;
}
or
@Yield
Iterator<String> iterator()
{
// This is part of pattern.
ArrayList<String> list = new ArrayList<String>();
for(int i = 0; i < 10; ++i)
{
// list.add() plays the role of yield return.
list.add(String.valueOf(i));
}
// This is part of pattern.
return list.iterator();
}
Note that the code will work correctly even, if by mischance, post-compile-time
processing will not take place.
At post comile time we would do all required refactoring to turn these
implementations into a state machines thus runtime would not contain any third
party components.
It's iteresting to recall that we have also implemented similar refactoring in
pure xslt.
See What you can do with jxom.
Update: implementation can be found at Yield.zip
Michael Key, author of the Saxon xslt processor, being inspired by the GWT
ideas, has decided to compile Saxon HE into javascript. See
Compiling Saxon using GWT.
The resulting script is about 1MB of size.
But what we thought lately, that it's overkill to bring whole xslt engine on a
client, while it's possible to generate javascript from xslt the same way as he's building java from xquery. This will probably require some runtime
but of much lesser size.
Search at www.google.fr:
An empty sequence is not allowed as the @select attribute of xsl:analyze-string
That's known issue. See Bug 7976.
In xslt 2.0 you should either check the value before using xsl:analyze-string, or wrap it into string() call.
The problem is addressed in xslt 3.0
michaelhkay: Saxon 9.3 has been out for 8 days: only two bugs so far, one found by me. I think that's a record.
Not necessary. We, for example, who use Saxon HE, have found nothing new in Saxon 9.3, while expected to see xslt 3.0. Disappointed. No actual reason to migrate.
P.S. We were among the first who were finding early bugs in previous releases.
We're following w3's "Bug
9069 - Function to invoke an XSLT transformation".
There, people argue about xpath API to invoke xslt transformations. Function should
look roughly like this:
transform
(
$node-tree as node()?,
$stylesheet as item(),
$parameters as XXX
) as node()
The discussion is spinning around the last argument: $parameters as
XXX . Should it be an xml element describing parameters, a function returning values for parameter names, or some new type modelling immutable
map?
What is most interesting in this discussion is the leak about plans to introduce
a map type:
Comment 7 Michael Kay, 2010-09-14 22:46:58 UTC
We're currently talking about adding an immutable map to XSLT as a new data
type (the put operation would return a new map). There appear to be a number of
possible efficient implementations. It would be ideally suited for this purpose,
because unlike the mechanism used for serialization parameters, the values can be
any data type (including nodes), not only strings.
There is a hope that map will finally appear in xslt!
See also:
Bug 5630
- [DM] Tuples and maps,
Tuples and maps - Status: CLOSED, WONTFIX,
Map, based on immutable trees,
Maps in exslt2?
Historically
jxom was developed first, and as such exhibited some imperfectness in its
xml schema.
csharpxom has taken into an account jxom's problems.
Unfortunately we could not easily fix jxom as a great amount of code already
uses it. In this refactoring we tried to be conservative, and have changed only
"type" and "import" xml schema elements in java.xsd.
Consider type reference and package import constructs in the old schema:
<!-- import java.util.ArrayList; -->
<import name="java.util.ArrayList"/>
<!-- java.util.ArrayList<java.math.BigDecimal> -->
<type package="java.util">
<part name="ArrayList">
<argument>
<type name="BigDecimal" package="java.math">
</argument>
</part>
</type>
<!-- my.Parent.Nested -->
<type package="my">
<part name="Parent"/>
<part name="Nested"/>
<type>
Here we can observe that:
- type is referred by a qualified name in import element;
- type has two forms: simple (see BigDecimal), and other for nested or generic
type (see ArrayList).
We have made it more consistent in the updated jxom:
<!-- import java.util.ArrayList; -->
<import>
<type name="ArrayList" package="java.util"/>
</import>
<!-- java.util.ArrayList<java.math.BigDecimal> -->
<type name="ArrayList" package="java.util">
<argument>
<type name="BigDecimal" package="java.math">
</argument>
</type>
<!-- my.Parent.Nested -->
<type name="Nested">
<type name="Parent" package="my"/>
<type>
We hope that you will not be impacted very much by this fix.
Please refresh Languages XOM from
languages-xom.zip.
P.S. we have also included xml schema and xslt api to generate ASPX (see
Xslt serializer for ASPX output). We, in fact, in our projects, generate aspx documents with
embedded csharpxom, and then pass it through two stage transformation.
In the
previous post we have announced an
API to parse a COBOL source into the cobolxom.
We exploited the
incremental parser to build a grammar xml tree and then were planning to
create an xslt transformation to generate
cobolxom.
Now, we would like to declare that such xslt is ready.
At present all standard COBOL constructs are supported, but more tests
are required. Preprocessor support is still in the todo list.
You may peek into an examples of
COBOL:
Cobol grammar:
And
cobolxom:
While we were building a grammar to cobolxom stylesheet we asked ourselves
whether the COBOL parsing could be done entirely in xslt. The answer is yes, so
who knows it might be that we shall turn this task into pure xslt one. :-)
Recently we've seen a code like this:
<xsl:variable name="a" as="element()?" select="..."/>
<xsl:variable name="b" as="element()?" select="..."/>
<xsl:apply-templates select="$a">
<xsl:with-param name="b" tunnel="yes" as="element()" select="$b"/>
</xsl:apply-templates>
It fails with an error:
"An empty sequence is not allowed as the value of parameter $b".
What is interesting is that the value of $a is an empty sequence,
so the code could potentially work, provided processor evaluated $a first,
and decided not to evaluate xsl:with-param.
Whether the order of evaluation of @select and xsl:with-param is specified
by the standard or it's an implementation defined?
We asked this question on
xslt forum, and got the following answer:
The specification leaves this implementation-defined. Since the values
of the parameters are the same for every node processed, it's a
reasonably strategy for the processor to evaluate the parameters before
knowing how many selected nodes there are, though I guess an even better
strategy would be to do it lazily when the first selected node is found.
Well, that's an expected answer. This question will probably induce Michael Kay
to introduce a small optimization into the Saxon.
Once ago we have created an
incremental parser, and now when we have decided to load COBOL sources
directly into
cobolxom (XML Object Model for a COBOL) the parser did the job perfectly.
The good point about incremental parser is that it easily handles COBOL's
grammar.
The whole process looks like this:
- incremental parser having a COBOL grammar builds a grammar tree;
- we stream this tree into xml;
- xslt to transform xml from previous step into
cobolxom (TODO).
This is an example of a COBOL:
IDENTIFICATION DIVISION.
PROGRAM-ID. FACTORIAL RECURSIVE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 NUMB PIC 9(4) VALUE IS 5.
01 FACT PIC 9(8) VALUE IS 0.
LOCAL-STORAGE SECTION.
01 NUM PIC 9(4).
PROCEDURE DIVISION.
MOVE 'X' TO XXX
MOVE NUMB TO NUM
IF NUMB = 0 THEN
MOVE 1 TO FACT
ELSE
SUBTRACT 1 FROM NUMB
CALL 'FACTORIAL'
MULTIPLY NUM BY FACT
END-IF
DISPLAY NUM '! = ' FACT
GOBACK.
END PROGRAM FACTORIAL.
And a grammar tree:
<Program>
<Name data="FACTORIAL"/>
<Recursive/>
<DataDivision>
<WorkingStorageSection>
<Data>
<Level data="01"/>
<Name data="NUMB"/>
<Picture data="9(4)"/>
<Value>
<Numeric data="5"/>
</Value>
</Data>
<Data>
<Level data="01"/>
<Name data="FACT"/>
<Picture data="9(8)"/>
<Value>
<Numeric data="0"/>
</Value>
</Data>
</WorkingStorageSection>
<LocalStorageSection>
<Data>
<Level data="01"/>
<Name data="NUM"/>
<Picture data="9(4)"/>
</Data>
</LocalStorageSection>
</DataDivision>
<ProcedureDivision>
<Sentence>
<MoveStatement>
<From>
<String data="'X'"/>
</From>
<To>
<Identifier>
<DataName data="XXX"/>
</Identifier>
</To>
</MoveStatement>
<MoveStatement>
<From>
<Identifier>
<DataName data="NUMB"/>
</Identifier>
</From>
<To>
<Identifier>
<DataName data="NUM"/>
</Identifier>
</To>
</MoveStatement>
<IfStatement>
<Condition>
<Relation>
<Identifier>
<DataName data="NUMB"/>
</Identifier>
<Equal/>
<Numeric data="0"/>
</Relation>
</Condition>
<Then>
<MoveStatement>
<From>
<Numeric data="1"/>
</From>
<To>
<Identifier>
<DataName data="FACT"/>
</Identifier>
</To>
</MoveStatement>
</Then>
<Else>
<SubtractStatement>
<Value>
<Numeric data="1"/>
</Value>
<From>
<Identifier>
<DataName data="NUMB"/>
</Identifier>
</From>
</SubtractStatement>
<CallStatement>
<Name>
<String data="'FACTORIAL'"/>
</Name>
</CallStatement>
<MultiplyStatement>
<Value>
<Identifier>
<DataName data="NUM"/>
</Identifier>
</Value>
<By>
<Identifier>
<DataName data="FACT"/>
</Identifier>
</By>
</MultiplyStatement>
</Else>
</IfStatement>
<DisplayStatement>
<Values>
<Identifier>
<DataName data="NUM"/>
</Identifier>
<String data="'! = '"/>
<Identifier>
<DataName data="FACT"/>
</Identifier>
</Values>
</DisplayStatement>
<GobackStatement/>
</Sentence>
</ProcedureDivision>
<EndName data="FACTORIAL"/>
</Program>
The last step is to transform tree into cobolxom is in the TODO list.
We have commited COBOL grammar in the same place at
SourceForge as it was with XQuery grammar. Solution is now under the VS
2010.
Suppose you have a timestamp string, and want to check whether it fits to one of the
following formats with leading and trailing spaces:
- YYYY-MM-DD-HH.MM.SS.NNNNNN
- YYYY-MM-DD-HH.MM.SS
- YYYY-MM-DD
We decided to use regex and its capture groups to extract timestamp parts. This
left us with only solution: xsl:analyze-string instruction. It took
a couple more minutes to reach a final solution:
<xsl:variable name="parts" as="xs:string*">
<xsl:analyze-string select="$value"
regex="
^\s*(\d\d\d\d)-(\d\d)-(\d\d)
(-(\d\d)\.(\d\d)\.(\d\d)(\.(\d\d\d\d\d\d))?)?\s*$"
flags="x">
<xsl:matching-substring>
<xsl:sequence select="regex-group(1)"/>
<xsl:sequence select="regex-group(2)"/>
<xsl:sequence select="regex-group(3)"/>
<xsl:sequence select="regex-group(5)"/>
<xsl:sequence select="regex-group(6)"/>
<xsl:sequence select="regex-group(7)"/>
<xsl:sequence select="regex-group(9)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:choose>
<xsl:when test="exists($parts)">
...
</xsl:when>
<xsl:otherwise>
...
</xsl:otherwise>
</xsl:choose>
How would you solve the problem? Is it the best solution?
We have updated C# XOM (csharpxom) to support C# 4.0 (in fact there are very few
changes).
From the grammar perspective this includes:
- Dynamic types;
- Named and optional arguments;
- Covariance and contravariance of generic parameters for interfaces and
delegates.
Dynamic type, C#:
dynamic dyn = 1;
C# XOM:
<var name="dyn">
<type name="dynamic"/>
<initialize>
<int value="1"/>
</initialize>
</var>
Named and Optional Arguments, C#:
int Increment(int value, int increment = 1)
{
return value + increment;
}
void
Test()
{
// Regular call.
Increment(7, 1);
// Call with named parameter.
Increment(value: 7, increment: 1);
// Call with default.
Increment(7);
}
C# XOM:
<method name="Increment">
<returns>
<type name="int"/>
</returns>
<parameters>
<parameter name="value">
<type name="int"/>
</parameter>
<parameter
name="increment">
<type name="int"/>
<initialize>
<int value="1"/>
</initialize>
</parameter>
</parameters>
<block>
<return>
<add>
<var-ref name="value"/>
<var-ref name="increment"/>
</add>
</return>
</block>
</method>
<method
name="Test">
<block>
<expression>
<comment>Regular call.</comment>
<invoke>
<method-ref name="Increment"/>
<arguments>
<int value="7"/>
<int value="1"/>
</arguments>
</invoke>
</expression>
<expression>
<comment>Call with named
parameter.</comment>
<invoke>
<method-ref name="Increment"/>
<arguments>
<argument name="value">
<int value="7"/>
</argument>
<argument name="increment">
<int value="1"/>
</argument>
</arguments>
</invoke>
</expression>
<expression>
<comment>Call with default.</comment>
<invoke>
<method-ref name="Increment"/>
<arguments>
<int value="7"/>
</arguments>
</invoke>
</expression>
</block>
</method>
Covariance and contravariance, C#:
public interface Variance<in T, out P, Q>
{
P X(T
t);
}
C# XOM:
<interface access="public" name="Variance">
<type-parameters>
<type-parameter
name="T" variance="in"/>
<type-parameter name="P" variance="out"/>
<type-parameter name="Q"/>
</type-parameters>
<method name="X">
<returns>
<type name="P"/>
</returns>
<parameters>
<parameter name="t">
<type name="T"/>
</parameter>
</parameters>
</method>
</interface>
Other cosmetic fixes were also introduced into Java XOM (jxom), COBOL XOM
(cobolxom), and into sql XOM (sqlxom).
The new version is found at
languages-xom.zip.
See also: What's
New in Visual C# 2010
We have run into another xslt bug, which depends on several independent
circumstances and often behaves differently being observed. That's clearly a
Heisenbug.
Xslt designers failed to realize that a syntactic suggar they introduce into
xpath can turn into obscure bugs. Well, it's easy to be wise afterwards...
To the point.
Consider you have a sequence consisting of text nodes and
elements, and now you want to "normalize" this sequence wrapping
adjacent text nodes into
separate elements. The following stylesheet is supposed to do the work:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:variable
name="nodes" as="node()*">
<xsl:text>Hello, </xsl:text>
<string value="World"/>
<xsl:text>! </xsl:text>
<xsl:text>Well, </xsl:text>
<string value="hello"/>
<xsl:text>, if not joking!</xsl:text>
</xsl:variable>
<result>
<xsl:sequence
select="t:normalize($nodes)"/>
</result>
</xsl:template>
<xsl:function
name="t:normalize" as="node()*">
<xsl:param name="nodes" as="node()*"/>
<xsl:for-each-group select="$nodes" group-starting-with="*">
<xsl:variable
name="string" as="element()?" select="self::string"/>
<xsl:variable name="texts"
as="node()*"
select="current-group() except $string"/>
<xsl:sequence
select="$string"/>
<xsl:if test="exists($texts)">
<string
value="{string-join($texts, '')}"/>
</xsl:if>
</xsl:for-each-group>
</xsl:function>
</xsl:stylesheet>
We're expecting the following output:
<result>
<string value="Hello, "/>
<string value="World"/>
<string value="! Well, "/>
<string value="hello"/>
<string value=", if not joking!"/>
</result>
But often we're getting other results, like:
<result>
<string value="Hello, "/>
<string value="World"/>
<string value="Well, ! "/>
<string value="hello"/>
<string value=", if not joking!"/>
</result>
Such output may seriously confuse, unless you will recall the rule for the
xpath except operator:
The except operator takes two node sequences as operands and returns a sequence containing all the nodes that occur in the first operand but not in the second operand.
... these operators eliminate duplicate nodes from their result sequences based
on node identity. The resulting sequence is returned in document order..
...
The relative order of nodes in distinct trees is stable but implementation-dependent
These words mean that result sequence may be very different from original
sequence.
In contrast, if we change $text definition to:
<xsl:variable name="texts"
as="node()*"
select="current-group()[not(. is $string)]"/>
then the result becomes stable, but less clear.
See also
Xslt Heisenbug
Recently we were raising a question about serialization of ASPX output in xslt.
The question went like this:
What's the recommended way of ASPX page generation? E.g.:
------------------------ <%@ Page AutoEventWireup="true" CodeBehind="CurMainMenuP.aspx.cs" EnableSessionState="True" Inherits="Currency.CurMainMenuP" Language="C#" MaintainScrollPositionOnPostback="True" MasterPageFile="Screen.Master" %>
<asp:Content ID="Content1" runat="server" ContentPlaceHolderID="Title">CUR_MAIN_MENU_P</asp:Content>
<asp:Content ID="Content2" runat="server" ContentPlaceHolderID="Content"> <span id="id1222146581" runat="server" class="inputField system UpperCase" enableviewstate="false"> <%# Dialog.Global.TranCode %> </span> ... ------------------------
Notice aspx page directives, data binding expessions, and prefixed tag names without namespace declarations.
There was a whole range of expected answers. We, however, looked whether somebody have already dealed with the task and has a ready solution at hands.
In general it seems that xslt community is very angry about ASPX: both format and technology. Well, put this aside.
The task of producing ASPX, which is almost xml, is not solvable when you're staying with pure xml serializer. Xslt's xsl:character-map does not work at all. In fact it looks as a childish attempt to address the problem, as it does not support character escapes but only grabs characters and substitutes them with strings.
We have decided to create ASPX serializer API producing required output text. This way you use <xsl:output method="text"/> to generate ASPX pages.
With this goal in mind we have defined a little xml schema to describe ASPX irregularities in xml form. These are:
<xs:element name="declared-prefix"> - to describe known prefixes, which should not be declared;
<xs:element name="directive"> - to describe directives like <%@ Page %>;
<xs:element name="content"> - a transparent content wrapper;
<xs:element name="entity"> - to issue xml entity;
<xs:element name="expression"> - to describe aspx expression like <%# Eval("A") %>;
<xs:element name="attribute"> - to describe an attribute of the parent element.
This approach greately simplified for us an ASPX generation process.
The API includes:
We have implemented report parser in C#. Bacause things are spinned around C#, a
schema definition is changed.
We have started from classes defining a report definition tree, annotated these
classes for xml serialization, and, finally, produced xml schema for such tree.
So, at present, it is not an xml schema with annotations but a separate xml
schema.
In addition we have defined APIs:
- to enumerate report data (having report definition and report data one can get
IEnumerable<ViewValue> to iterate report data in structured form);
- to read report through
XmlReader , which allows, for example, to
have report as input for an xslt tranformation.
- to write report directly into
XmlWriter .
An example of report definition as C# code is:
MyReport.cs. The very same report definition but serialized into xml is
my-report.xml. A generated xml schema for a report definition is:
schema0.xsd.
The good point about this solution is that it's already flexible enough to
describe every report layout we have at hands, and it's extendable. Our
measurments show that report parsing is extremely fast and have very small
memory footprint due to forward only nature of report definitions.
From the design point of view report definition is a view of original text data
with view info attached.
At present we have defined following views:
- Element - a named view to generate output from a content view;
- Content - a view to aggregate other views together;
- Choice - a view to produce output from one of content views;
- Sequence - a view to sequence input view by key expressions, and to attach an
index to each sequence item;
- Iterator - a view to generate output from input view while some condition is
true, and to attach an iteration index to each part of output view;
- Page - a view to remove page headers and footers in the input view, and to
attach an index to each page;
- Compute - a named view to produce result of evaluation of expression as output
view;
- Data - a named view to produce output value from some bounds of input view,
and optionally to convert, validate and format the value.
To specify details of definitions there are:
- expressions to deal with integers:
Add , Div ,
Integer , MatchProperty , Max , Min ,
Mod , Mul , Neg , Null ,
Sub , VariableRef , ViewProperty , Case ;
- conditions to deal with booleans:
And , EQ , GE ,
GT , IsMatch , LE , LT ,
NE , Not , Or .
At present there is no a specification of a report definitions. Probably, it's
the most complex part to create such a spec for a user without deep knowledge.
At present, our idea is that one should use xml schema (we should polish
generated schema) for the report definition and schema aware editor to build
report definitions. That's very robust approach working perfectly with
languages xom.
C# sources can be found at:
ReportLayout.zip including report definition classes and a sample report.
We're facing a task of parsing reports produced from legacy applications and
converting them into a structured form, e.g. into xml. These xml files can be
processed further with up to date tools to produce good looking reports.
Reports at hands are of very different structure and of size: from a couple of KB
to a several GB. The good part is that they mostly have a tabular form, so it's
easy to think of specific parsers in case of each report type.
Our goal is to create an environment where a less qualified person(s) could
create and manage such parsers, and only rarely to engage someone who will handle
less untrivial cases.
Our analysis has shown that it's possible to write such parser in almost any
language: xslt, C#, java.
Our approach was to create an xml schema annotations
that from one side define a data structure, and from the other map report
layout. Then we're able to create an xslt that will generate either xslt, C#, or
java parser according to the schema definitions. Because of
languages
xom, providing XML Object Model and serialization stylesheets for C# and
java, it does not really matter what we shall generate xslt or C#/java, as
code will look the same.
The approach we're going to use to describe reports is
not as powerfull as conventional parsers. Its virtue, however, is simplicity of
specification.
Consider a report sample (a data to extract is in bold):
1 TITLE ... PAGE: 1
BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10
CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39
CM BUS ...
CO NBR FRM FUNC ...
----- ----- ----- -----
XXX 065 065 CLR ...
YYY ...
...
1 TITLE ... PAGE: 2
BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10
CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39
CM BUS ...
CO NBR FRM FUNC ...
----- ----- ----- -----
AAA NNN MMM PPP ...
BBB ...
...
* * * * * E N D O F R E P O R T * * * * *
We're approaching to the report through a sequence of views (filters) of this
report. Each veiw localizes some report data either for the subsequent
filterring or for the extraction of final data.
Looking into the example one can build following views of the report:
- View of data before the "E N D O F R E P O R T" line.
- View of remaining data without page headers and footers.
- Views of table rows.
- Views of cells.
A sequence of filters allows us to build a pipeline of transformations of
original text. This also allows us to generate a clean xslt, C# or java code
to parse the data.
At first, our favorite language for such parser was xslt.
Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very
strong in streaming processing. Without a couple of extension functions to
prevent caching, it tends to cache whole input in the memory, which is not
acceptable.
At present we have decided to start from C# code, which is pure C# naturally.
Code still is in the development but at present we would like to share the xml
schema annotations describing report layout:
report-mapping.xsd, and a sample of report description:
test.xsd.
A few little changes in streaming and in name normalization algorithms in jxom
and in csharpxom and the generation speed almost doubled (especially for big files).
We suspect, however, that our xslt code is tuned for saxon engine.
It would be nice to know if anybody used languages XOM with other engines. Is
anyone using it at all (well, at least there are downloads)?
Languages XOM (jxom, csharpxom, cobolxom, sqlxom) can be loaded from:
languages-xom.zip
At times a simple task in xslt looks like a puzzle. Today we have this one.
For a string and a regular expression find a position and a length of the matched
substring.
The problem looks so simple that you do not immediaty realize that you are going
to spend ten minutes trying to solve it in the best way.
Try it yourself before proceeding:
<xsl:variable name="match" as="xs:integer*">
<xsl:analyze-string select="$line"
regex="my-reg-ex">
<xsl:matching-substring>
<xsl:sequence select="1, string-length(.)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:sequence select="0, string-length(.)"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:choose>
<xsl:when test="$match[1]">
<xsl:sequence
select="1, $match[2]"/>
</xsl:when>
<xsl:when test="$match[3]">
<xsl:sequence select="$match[2], $match[4]"/>
</xsl:when>
</xsl:choose>
To see that the problem with Generator functions in xslt
is a bit more complicated compare two functions.
The first one is quoted from the earlier post:
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
It does not work in Saxon: crashes with out of memory.
The second one is slightly modified version of the same function:
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value + 0"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
It's working without problems. In first case Saxon decides to cache all
function's output, in the second case it decides to evaluate data lazily on
demand.
It seems that optimization algorithms implemented in Saxon are so plentiful and
complex that at times they fool one another. :-)
See also:
Generator functions
There are some complications with streamed tree that we have implemented in
saxon. They are due to the fact that only a view of input data is available at
any time. Whenever you access some element that's is not available you're
getting an exception.
Consider an example. We have a log created with java logging. It looks like
this:
<log>
<record>
<date>...</date>
<millis>...</millis>
<sequence>...</sequence>
<logger>...</logger>
<level>INFO</level>
<class>...</class>
<method>...</method>
<thread>...</thread>
<message>...</message>
</record>
<record>
...
</record>
...
We would like to write an xslt that returns a page
of log as html:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xs t">
<xsl:param name="start-page" as="xs:integer" select="1"/>
<xsl:param name="page-size" as="xs:integer" select="50"/>
<xsl:output method="xhtml" byte-order-mark="yes" indent="yes"/>
<!-- Entry point. -->
<xsl:template match="/log">
<xsl:variable name="start" as="xs:integer"
select="($start-page - 1) * $page-size + 1"/>
<xsl:variable name="records" as="element()*"
select="subsequence(record, $start, $page-size)"/>
<html>
<head>
<title>
<xsl:text>A log file. Page: </xsl:text>
<xsl:value-of select="$start-page"/>
</title>
</head>
<body>
<table border="1">
<thead>
<tr>
<th>Level</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<xsl:apply-templates mode="t:record" select="$records"/>
</tbody>
</table>
</body>
</html>
</xsl:template>
<xsl:template mode="t:record" match="record">
<!-- Make a copy of record to avoid streaming access problems. -->
<xsl:variable name="log">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:variable name="level" as="xs:string"
select="$log/record/level"/>
<xsl:variable name="message" as="xs:string"
select="$log/record/message"/>
<tr>
<td>
<xsl:value-of select="$level"/>
</td>
<td>
<xsl:value-of select="$message"/>
</td>
</tr>
</xsl:template>
</xsl:stylesheet>
This code does not work. Guess why? Yes, it's subsequence() , which is too greedy.
It always wants to know what's the next node, so it naturally skips a content of
the current node. Algorithmically, such saxon code could be rewritten, and could possibly work
better also in modes other than streaming.
A viable workaround, which does not use subsequence, looks rather untrivial:
<!-- Entry point. -->
<xsl:template match="/log">
<xsl:variable name="start" as="xs:integer"
select="($start-page - 1) * $page-size + 1"/>
<xsl:variable name="end" as="xs:integer"
select="$start + $page-size"/>
<html>
<head>
<title>
<xsl:text>A log file. Page: </xsl:text>
<xsl:value-of select="$start-page"/>
</title>
</head>
<body>
<table border="1">
<thead>
<tr>
<th>Level</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<xsl:sequence select="
t:generate-records(record, $start, $end, ())"/>
</tbody>
</table>
</body>
</html>
</xsl:template>
<xsl:function name="t:generate-records" as="element()*">
<xsl:param name="records" as="element()*"/>
<xsl:param name="start" as="xs:integer"/>
<xsl:param name="end" as="xs:integer?"/>
<xsl:param name="result" as="element()*"/>
<xsl:variable name="record" as="element()?" select="$records[$start]"/>
<xsl:choose>
<xsl:when test="(exists($end) and ($start > $end)) or empty($record)">
<xsl:sequence select="$result"/>
</xsl:when>
<xsl:otherwise>
<!-- Make a copy of record to avoid streaming access problems. -->
<xsl:variable name="log">
<xsl:copy-of select="$record"/>
</xsl:variable>
<xsl:variable name="level" as="xs:string"
select="$log/record/level"/>
<xsl:variable name="message" as="xs:string"
select="$log/record/message"/>
<xsl:variable name="next-result" as="element()*">
<tr>
<td>
<xsl:value-of select="$level"/>
</td>
<td>
<xsl:value-of select="$message"/>
</td>
</tr>
</xsl:variable>
<xsl:sequence select="
t:generate-records
(
$records,
$start + 1,
$end,
($result, $next-result)
)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
Here we observed the greediness of saxon, which too early tried to consume
more input than it's required. In the other cases we have seen that it may
defer actual data access to the point when there is no data anymore.
So, without tuning internal saxon logic it's possible but not easy to write
stylesheets that exploit streaming features.
P.S. Updated sources are at
streamedtree.zip
When time has come to process big xml log files we've decided to implement
streamable tree in saxon the very same way it was implemented in .net eight
years ago (see
How would we approach to streaming facility in xslt).
It's interesting enough that the implementation is similar to one of
composable tree. There a node never stores a reference to a parent, while in
the streamed tree no references to children are stored. This way only a limited
subview of tree is available at any time. Implementation does not support
preceding and preceding-sibling axes. Also, one cannot navigate to a node that
is out of scope.
Implementation is external (there are no changes to saxon itself). To use it one
needs to create an instance of DocumentInfo , which pulls data from
XMLStreamReader , and
to pass it as an input to a transformation:
Controller controller =
(Controller)transformer;
XMLInputFactory factory = XMLInputFactory.newInstance();
StreamSource inputSource = new StreamSource(new File(input));
XMLStreamReader reader = factory.createXMLStreamReader(inputSource);
StaxBridge bridge = new StaxBridge();
bridge.setPipelineConfiguration(
controller.makePipelineConfiguration());
bridge.setXMLStreamReader(reader);
inputSource = new DocumentImpl(bridge);
transformer.transform(inputSource, new StreamResult(output));
This helped us to format an xml log file of arbitrary size. An xslt like this can do the work:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xs">
<xsl:template match="/log">
<html>
<head>
<title>Log</title>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="message">
...
</xsl:template>
<xsl:template match="message[@error]">
...
</xsl:template>
...
</xsl:stylesheet>
Implementation can be found at:
streamedtree.zip
jxom else if (google search)
Google helps with many things but with retrospective support.
Probably guy's trying to build a nested if then else
jxom elements.
We expected this and have defined a function
t:generate-if-statement() in
java-optimizer.xslt.
Its signature:
<!--
Generates if/then/else if ... statements.
$closure - a series of conditions and blocks.
$index - current index.
$result - collected result.
Returns if/then/else if ... statements.
-->
<xsl:function name="t:generate-if-statement" as="element()">
<xsl:param name="closure" as="element()*"/>
<xsl:param name="index" as="xs:integer"/>
<xsl:param name="result" as="element()?"/>
Usage is like this:
<!-- Generate a sequence of pairs: (condition, scope). -->
<xsl:variable name="branches" as="element()+">
<xsl:for-each select="...">
<!-- Generate condition. -->
<scope>
<!-- Generate statemetns. -->
</scope>
</xsl:for-each>
</xsl:variable>
<xsl:variable name="else" as="element()?">
<!-- Generate final else, if any. -->
</xsl:variable>
<!-- This generates if statement. -->
<xsl:sequence
select="t:generate-if-statement($branches, count($branches)
- 1, $else)"/>
P.S. By the way, we like that someone is looking into jxom.
By the generator we assume a function that produces an infinitive output
sequence for a particular input.
That's a rather theoretical question, as xslt does not allow infinitive
sequence, but look at the example:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:variable name="value" as="xs:string" select="'10100101'"/>
<xsl:variable name="values" as="xs:integer+"
select="t:generate(1)"/>
<!--<xsl:variable name="values" as="xs:integer+">
<xsl:call-template name="t:generate">
<xsl:with-param name="value" select="1"/>
</xsl:call-template>
</xsl:variable>-->
<xsl:variable name="integer" as="xs:integer" select="
sum
(
for $index in 1 to string-length($value)
return
$values[$index][substring($value, $index, 1) = '1']
)"/>
<xsl:message select="$integer"/>
</xsl:template>
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
<!--<xsl:template name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:call-template name="t:generate">
<xsl:with-param name="value" select="$value * 2"/>
</xsl:call-template>
</xsl:template>-->
</xsl:stylesheet>
Here the logic uses such a generator and decides by itself where to break.
Should such code be valid?
From the algorithmic perspective example would better to work, as separation of
generator logic and its use are two different things.
Lately, after playing a little with saxon tree models, we thought that design
would be more cleaner and implementation faster if NamePool were implemented
differently.
Now, saxon is very pessimistic about java objects thus it prefers to encode
qualified names with integers. The encoding and decoding is done in the
NamePool . Other parts of code use these integer values.
Operations done over these integers are:
- equality comparision of two such integers in order to check whether to
qualified or extended names are equal;
- get different parts of qualified name from
NamePool .
We would design this differently. We would:
- create a
QualifiedName class to store all name parts.
- declare
NamePool to create and cache QualifiedName instances.
This way:
- equality comparision would be a reference comparision of two instances;
- different parts of qualified name would become a trivial getter;
- contention of such name pool would be lower.
That's the implementation we would propose:
QualifiedName.java,
NameCache.java
Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.
This way, in a code:
<xsl:variable name="elements" as="element()*" select="..."/>
<xsl:variable name="result" as="element()">
<result>
<xsl:sequence select="$elements"/>
</result>
</xsl:variable>
the implementation shares internal representation among $elements
and subtree of $result . From the perspective of xslt it
looks as completely different subtrees with different node identities, which is
in the accordance with its view of the world.
After a short study we've decided to create a research implementation of this
tree model in saxon. It's took only a couple of days to introduce a minimal changes
to engine, to refactor linked tree into a new composable tree, and to
perform some tests.
In many cases saxon has benefited immediately from this new tree model, in some
other cases more tunings are required.
Our tests've showed that this new tree performed better than
linked tree, but a little bit worser than tiny tree. On the other hand, it's
obvious that conventional code patterns avoid subtree copying, assuming it's
expensive operation, thus one should rethink some code practices to benefit from
composable tree.
Implementation can be downloaded at:
saxon.composabletree.zip
From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.
They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.
The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.
Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.
It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.
That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.
Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy
baggage.
If you're not entirely concentrated during translation of your algorithms into
xslt 2.0 you can get into trap, as we did.
Consider a code snapshot:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+"
select="$elements/t:convert(.)"/>
Looks simple, isn't it?
Our intention was to get converted elements, which result from some
xsl:apply-templates logic.
Well, this code works... but rather sporadically, as results are often in wrong
order! This bug is very close to
what is called a
Heisenbug.
So, where is the problem?
Elementary, my dear Watson:
xsl:apply-templates constructs a sequence of rootless
elements.
$elements/t:convert(.) converts elements and orders them in document order.
Here is a tricky part:
The relative order of nodes in distinct trees is stable but
implementation-dependent...
Clearly each rootless element belongs to a unique tree.
After that we have realized what the problem is, code has been
immediately rewritten as:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+" select="
for $element in $elements return
t:convert($element)"/>
P.S. Taking into an accout a size of our xslt code base, it took a half an hour to
localize the problem. Now, we're at position to review all uses of slashes
in xslt. As you like it?
Opinions on xml namespaces
olegtk: @srivatsn Originally the idea was that namespace URI would point to some schema definition. Long abandoned idea.
Not so long ago, I've seen a good reasoning about the same subject:
Inline functions in xslt 2.1 look often as a some strange aberration. Sure,
there are very usefull cases when they are delegates of program logic
(e.g. comparators, and filters), but often (probably more often) we can see that
it's use is to model data structures.
As an example, suppose you want to model a structure with three properties say
a, b, and c. You implement this creating functions that wrap and unwrap the
data:
function make-data($a as item(), $b as item(), $c
as item()) as function() as item()+
{
function() { $a, $b, $c }
}
function a($data as function() as item()+) as item()
{
$data()[1]
}
function b($data as function() as item()+) as item()
{
$data()[2]
}
function c($data as function() as item()+) as item()
{
$data()[3]
}
Clever?
Sure, it is! Here, we have modeled structrue with the help of sequence, which we
have wrapped into a function item.
Alas, clever is not always good (often it's a sign of a bad). We just wanted to
define a simple structure. What it has in common with function?
There is a distance between what we want to express, designing an algorithm, and
what we see looking at the code. The greater the distance, the more efforts are
required to write, and to read the code.
It would be so good to have simpler way to express such concept as a structure.
Let's dream a little. Suppose you already have a structure, and just want to
access its members. An idea we can think about is an xpath like access method:
$data/a, $data/b, $data/c
But wait a second, doesn't
$data looks very like an xml element, and its accessors are just
node tests?
That's correct, so data constructor may coincide with element constructor.
Then what pros and cons of using of xml elements to model structures?
Pros are: existing xml type system, and sensibly looking code (you just
understand that here we're constructing a structure).
Cons are: xml trees are implemented the way that does not assume fast (from the
perfromace perspective) composition, as when you construct a structure a copy of
data is made.
But here we observe that "implemented" is very important word in this context.
If xml tree implementation would not store reference to the parent node then
subtrees could be composed very efficiently (note that tree is assumed to be
immutable). Parent node could be available through a tree navigator, which would
contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).
Such tree structure would probably help not only in this particular case but
also with other conventional xslt code patterns.
P.S. Saxon probably could implement its NodeInfo this way.
Update: see also Custom tree model.
A while ago we have proposed to introduce maps as built-in types in
xpath/xquery type system: Tuples and maps.
The suggestion has been declined (probably our arguments were not convincing). We,
however, still think that maps along with sequences are primitives, required to
build sensible (not obscure one) algorithms. To see that map is at times is the
best way to resolve the problems we shall refer to an utility function to
allocate names in scope. Its signature looks like this:
<!--
Allocates unique names in the form $prefix{number}?.
Note that prefixes may coincide.
$prefixes - a name prefixes.
$names - allocated names pool.
$name-max-length - a longest allowable name length.
Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
<xsl:param name="prefixes" as="xs:string*"/>
<xsl:param name="names" as="xs:string*"/>
<xsl:param name="name-max-length" as="xs:integer?"/>
Just try to program it and you will find yourselves coding something like one defined at
cobolxom.
To be efficient such maps should provide, at worst, a logarithmic operation
complexity:
- Access to the map through a key (and/or by index) - complexity is LnN;
- Creation of a new map with added or removed item - complexity is LnN;
- Construction of the map from ordered items - complexity is O(N);
- Construction of the map from unordered items - complexity is O(N*LnN);
- Total enumeration of all items - complexity is O(N*LnN);
These performance metrics are found in many functional and procedural
implementations of the maps. Typical RB and AVL tree based maps satisfy these
restrictions.
What we suggest is to introduce map implementation into the exslt2 (assuming
inline functions are present). As a sketch we have implemented pure
AVL Tree in Xslt 2.0:
We do not assume that implementation like this should be used, but rather think
of some extension function(s) that provides a better performance.
What do you think?
The story about immutable tree would not be complete without xslt
implementation. To make it possible one needs something to approxomate tree
nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would
be either unefficient or not pure).
To isolate the problem we have used tuple interface:
tuple:ref($items as item()*) as item() - to wrap items into a tuple;
tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
tuple:is-same($first as item(), $second as item()) as xs:boolean
- to test
whether two tuples are the same.
and defined inefficient implementation based on xml elements. Every other part
of code is a regular AVL algorithm implementation.
We want to stress again that even assuming that there is a good tuple
implementation we would prefer built-in associative container implementation.
Why the heck you need to include about 1000 lines of code just to use a map?
Source code is:
We like Visual Studio very much, and try to adopt new version earlier.
For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.
Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.
We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.
While bemoaning about lack of associative containers in xpath type system, we
have came up with a good implementation of t:allocate-names(). Implementation
can be seen at location
cobol-names.xslt.
It is based on recursion and on the use of xsl:for-each-group.
Alogrithmic worst case complexity is O(N*LogN*LogL), where N is number of names, and L is a length of
a longest name.
This does not invalidate the idea that associative containers
are very wishful, as blessed one who naturally types such implementation. For us, it went the hard way, and has taken three days to realize that original algorithm is problematic, and to work out the better one.
In practice this means 2 seconds for the new implementation against 25 minutes for the old one.
Why do we return to this theme again?
Well, it's itching!
In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:
<!-- Allocates unique names in the form $prefix{number}?. Note that prefixes may coincide. $prefixes - a name prefixes. $names - allocated names pool. $name-max-length - a longest allowable name length. Returns unique names. --> <xsl:function name="t:allocate-names" as="xs:string*"> <xsl:param name="prefixes" as="xs:string*"/> <xsl:param name="names" as="xs:string*"/> <xsl:param name="name-max-length" as="xs:integer?"/>
We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.
If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).
The very same simple tasks tend to appear in different languages (e.g.
C# Haiku).
Now we have to find:
- integer and fractional part of a decimal;
- length and precision of a decimal.
These tasks have no trivial solutions in xslt 2.0.
At present we have came up with the following answers:
Fractional part:
<xsl:function name="t:fraction" as="xs:decimal">
<xsl:param name="value" as="xs:decimal"/>
<xsl:sequence select="$value mod 1"/>
</xsl:function>
Integer part v1:
<xsl:function name="t:integer" as="xs:decimal">
<xsl:param name="value" as="xs:decimal"/>
<xsl:sequence select="$value - t:fraction($value)"/>
</xsl:function>
Integer part v2:
<xsl:function name="t:integer" as="xs:decimal">
<xsl:param name="value" as="xs:decimal"/>
<xsl:sequence select="
if ($value ge 0) then
floor($value)
else
-floor(-$value)"/>
</xsl:function>
Length and precision:
<!--
Gets a decimal specification as a closure:
($length as xs:integer, $precision as xs:integer).
-->
<xsl:function
name="t:decimal-spec" as="xs:integer+">
<xsl:param name="value"
as="xs:decimal"/>
<xsl:variable name="text" as="xs:string" select="
if ($value
lt 0) then
xs:string(-$value)
else
xs:string($value)"/>
<xsl:variable
name="length" as="xs:integer"
select="string-length($text)"/>
<xsl:variable
name="integer-length" as="xs:integer"
select="string-length(substring-before($text, '.'))"/>
<xsl:sequence select="
if
($integer-length) then
($length - 1, $length - $integer-length - 1)
else
($length, 0)"/>
</xsl:function>
The last function looks odious. In many other languages its implementation
would be considered as embarrassing.
Continuing with the post "Ongoing xslt/xquery spec update"
we would like to articulate what options regarding associative containers do we
have in a functional languages (e.g. xslt, xquery), assuming that variables are
immutable and implementation is efficient (in some sense).
There are three common implementation techniques:
- store data (keys, value pairs) in sorted array, and use binary search to
access values by a key;
- store data in a hash map;
- store data in a binary tree (usually RB or AVL trees).
Implementation choice considerably depends on operations, which are taken over
the container. Usually these are:
- construction;
- value lookup by key;
- key enumeration (ordered or not);
- container modification (add and remove data into the
container);
- access elements by index;
Note that modification in a functional programming means a creation of a new
container, so here is a
division:
- If container's use pattern does not include modification, then probably the
simplest solution is to build it as an ordered sequence of
pairs, and use binary search to access the data. Alternatively, one could
implement associative container as a hash map.
- If modification is essential then neither ordered sequence of pairs, hash map
nor classical tree implementation can be used, as they are either too slow
or too greedy for a memory, either during modification or during access.
On the other hand to deal with container's modifications one can build
an implementation, which uses "top-down" RB
or AVL trees. To see the
difference consider a classical tree structure and its functional variant:
|
Classical |
Functional |
Node structure: |
node
parent
left
right
other data |
node
left
right
other data |
Node reference: |
node itself |
node path from a root of a tree |
Modification: |
either mutable or requires a completely new tree |
O(LnN) nodes are created
|
Here we observe that:
- one can implement efficient map (lookup time no worse than O(LnN)) with no
modification support, using ordered array;
- one can implement efficient map with support of modification, using immutable binary tree;
- one can implement all these algorithms purely in xslt and xquery (provided that inline
functions are supported);
- any such imlementation will lose against the same implementation
written in C++, C#, java;
- the best implementation would probably start from sorted array and
will switch to binary tree after some size threshold.
Here we provide a C# implementation of a functional AVL tree, which also supports
element indexing:
Our intention was to show that the usual algorithms for associative
containers apply in functional
programming; thus a feature complete functional language must support
associative containers to make development more conscious, and to free a
developer from inventing basic things existing already for almost a half of
century.
Several years ago we have started a new project. We do not like neither hate
any particular language, thus the decision what language to use was pragmatical:
xslt 2.0 fitted perfectly.
At present it's a solid volume of xslt code. It exhibits all the virtues of any
other good project in other language: clean design, modularity, documentation,
sophisticationless (good code should not be clever).
Runtime profile of the project is that it deals with xml documents with sizes
from a few dozens of bytes to several megabytes, and with xml schemas from
simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline
of stylesheets processes gigabytes of data stored in the database and in files.
All the bragging above is needed here to introduce the context for the following
couple of lines.
The diversity of load conditions and a big code base, exposed xslt engine of
choice to a good proof test. The victim is Saxon. In the course of project we
have found and reported many bugs. Some of them are nasty and important, and
others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being
fixed promtly (see
the last one).
Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).
Well, it has happened so that we're helping to Saxon project. Unintentionally,
however!
P.S.
About language preferences.
Nowdays we're polishing a
COBOL
generation. To this end we have familiarized ourselves with this language.
That's the beatiful language. Its straightforwardness helps to see the evolution
of computer languages and to understand what and why today's languages try to
timidly hide.
We have updated
languages-xom.zip. There are many fixes in cobolxom (well, cobolxom is new,
and probably there will be some more bugs). Also we have included Xml Object
Model for the SQL, which in fact has appeared along with jxom.
SQL xom supports basic sql syntax including common table expressions, and two
dialects for DB2 and Oracle.
Recently W3C has published new drafts for
xquery 1.1 and for
xpath 2.1. We have noticed that
committee has decided to introduce inline functions both for the xquery and for the
xpath.
That's a really good news! This way xquery, xpath and xslt are
being approached the Object Oriented Programming the way of javascript with its
inline functions.
Now we shall be able to implement tuples (a sequence of items wrapped into
single item), object with named properties, trees (e.g. RB Tree), associative
containers (tree maps and hash maps, sets).
Surely, all this will be in the spirit of functional programming.
The only thing we regret about is that the WG did not include built-in
implementations for trees and associative containers, as we don't believe that
one can create an efficient implementation of these abstractions neither in
xquery nor in xslt (asymptotically results will be good, but coefficients will
be painful).
See also:
Tuple and maps
Not sure how things work for others but for us it turns out that Saxon 9.2 introduces new bugs, works slower and eats much more memory than its ancestor v9.1.
See
Memory problem with V9.2.
We hope all this will be fixed soon.
Update: By the way, Saxon 9.2 (at the moment 2009-01-04) does not like (despises in fact) small documents and especially text nodes in those documents. It loves huge in memory documents, however.
Update 2009-01-05: case's closed, fix's commited into svn.
Today, I've tried to upgrade our projects to Saxon 9.2. We have a rather big set
of stylesheets grinding gigabytes of information. It's obvious that we
expected at least the same performance from the new version.
But to my puzzlement a pipeline of transformations failed almost immediately
with en error message:
XPTY0018: Cannot mix nodes and atomic values in the result of a path expression
We do agree with this statement in general, but what it had in common with our
stylesheets? And how everything was working in 9.1?
To find the root of the problem I've created a minimal problem reproduction:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="this"
exclude-result-prefixes="xs t">
<!-- Entry point. -->
<xsl:template match="/">
<xsl:variable name="p" as="element()">
<p l="1"/>
</xsl:variable>
<o l="{$p/t:d(.)}"/>
</xsl:template>
<xsl:function name="t:d" as="item()*">
<xsl:param name="p" as="element()"/>
<xsl:apply-templates mode="p" select="$p"/>
</xsl:function>
<xsl:template match="*" mode="p">
<xsl:sequence select="concat('0', @l)"/>
</xsl:template>
</xsl:stylesheet>
Really simple, isn't it? The problem is in a new optimization of
concat() function, introduced in version 9.2. It tries to eliminate
string concatenation, and in certain cases emits its arguments directly into the
output as text nodes, separating whole output with some stopper strings. The
only problem is that no such optimization is allowed in this particular case
(which is rather common, and surely legal, in our stylesheets); result of
<xsl:template match="p" mode="p"> should not be a node, but of type
xs:string .
Saxon 9.2 is here already for 3 month, at lest! Thus, how come that such
a bug was not
discovered earlier?
Update: the fix is commited into the svn on the next day. That's promptly!
We've added a new language to the set of Xml Object Model schemas and stylesheets.
The newcomer is COBOL! No jokes. It's not a whim, really. Believe it or
not but COBOL is still alive and we need to generate it (mainly different sorts of
proxies).
We've used VS COBOL II grammar Version 1.0.3 as a reference. Implemented grammar
is complete but without preprocessor statements. On the other hand it defines COPY and EXEC SQL constructs.
Definitely, it'll take a time for the xml schema and xslt implementation to
become mature.
Now language XOM is:
- jxom - for java;
- csharpxom - for C#;
- cobolxom - for COBOL.
Sources can be found at
languages-xom.
Given:
- an xml defining elements and groups;
- each element belongs to a group or groups;
- group may belong to another group.
Find:
- groups, a given element directly or inderectly belongs to;
- a function checking whether an element belongs to a group.
Example:
<groups>
<group name="g1">
<element ref="e1"/>
<element ref="e2"/>
<element
ref="e3"/>
<group ref="g2"/>
</group>
<group name="g2">
<element ref="e5"/>
</group>
<group name="g3">
<element ref="e1"/>
<element ref="e4"/>
</group>
</groups>
There are several solutions depending on aggresiveness of optimization. A
moderate one is done through the xsl:key. All this reminds recursive common
table expressions in SQL.
Anyone?
A client asked us to produce Excel reports in ASP.NET
application. They've given an Excel templates, and also defined what they want to show.
What are our options?
- Work with Office COM API;
- Use Office Open XML SDK (which is a set of pure .NET
API);
- Try to apply xslt somehow;
- Macro, other?
For us, biased to xslt, it's hard to make a fair choice. To judge, we've
tried formalize client's request and to look into future support.
So, we have defined sql stored procedures to provide the data. This way data can be
represented either as ADO.NET DataSet, a set of classes, as xml, or in other reasonable format. We do not
predict any considerable problem with data representation if client will decide
to modify reports in future.
It's not so easy when we think about Excel generation.
Due to ignorance we've thought that Excel is much like xslt in some regard, and
that it's possible to provide a tabular data in some form and create Excel
template, which will consume the data to form a final output. To some extent
it's possible, indeed, but you should start creating macro or vb scripts to
achieve acceptable results.
When we've mentioned macroses to the client, they immediately stated that
such a solution won't work due to security reasons.
Comparing COM API and Open XML SDK we can see that both provide almost the same
level of service for us, except that the later is much more lighter and supports only Open XML format, and the earlier is a heavy
API exposing MS Office and supports earlier versions also.
Both solutions have a considerable drawback: it's not easy to create Excel
report in C#, and it will be a pain to support such solution if client will ask,
say in half a year, to modify something in Excel template or to create one more
report.
Thus we've approached to xslt. There we've found two more directions:
- generate data for Office Open XML;
- generate xml in format of MS Office 2003.
It's turned out that it's rather untrivial task to generate data for Open XML,
and it's not due to the format, which is not xml at all but a zipped folder
containing xmls. The problem is in the complex schemas and in many complex
relations between files constituting Open XML document. In contrast, MS
Office 2003 format allows us to create a single xml file for the spreadsheet.
Selecting between standard and up to date format, and older proprietary one, the
later looks more attractive for the development and support.
At present we're at position to use xslt and to generate files in MS Office
2003 format. Are there better options?
Did you ever hear that double numbers may cause roundings, and that
many financial institutions are very sensitive to those roundings?
Sure you did! We're also aware of this kind of problem, and we thought we've
taken care of it. But things are not that simple, as you're not always
know what an impact the problem can have.
To understand the context it's enough to say that we're converting (using xslt by the way) programs
written in a CASE tool called
Cool:GEN into java and into C#. Originally, Cool:GEN generated COBOL and C
programs as deliverables. Formally, clients compare COBOL results vs java or C#
results, and they want them to be as close as possible.
For one particular client it was crucial to have correct results during
manipulation with numbers with 20-25 digits in total, and with 10 digits after a decimal point.
Clients are definitely right, and we've introduced generation options to control
how to represent numbers in java and C# worlds; either as double or
BigDecimal (in java), and decimal (in C#).
That was our first implementation. Reasonable and clean. Was it enough? - Not at
all!
Client's reported that java's results (they use java and BigDecimal
for every number with decimal point) are too precise, comparing to Mainframe's
(MF) COBOL. This rather unusuall complain puzzles a litle, but client's
confirmed that they want no more precise results than those MF produces.
The reason of the difference was in that that both C# and especially java may
store much more decimal digits than is defined for the particualar result on MF.
So, whenever you define a field storing 5 digits after decimal point, you're
sure that exactly 5 digits will be stored. This contrasts very much with results
we had in java and C#, as both multiplication and division can produce many more
digits after the decimal point. The solution was to truncate(!) (not to round) the
numbers to the specific precision in property setters.
So, has it resolved the problem? - No, still not!
Client's reported that now results much more better (coincide with MF, in fact)
but still there are several instances when they observe differences in 9th and
10th digits after a decimal point, and again java's result are more accurate.
No astonishment this time from us but analisys of the reason of the difference.
It's turned out that previous solution is partial. We're doing a final truncation
but still there were intermediate results like in a/(b * c) , or in a * (b/c) .
For the intermediate results MF's COBOL has its, rather untrivial, formulas (and
options) per each operation defining the number of digits to keep after a
decimal point. After we've added similar options into the generator, several
truncations've manifested in the code to adjust intermediate results. This way
we've reached the same accurateness as MF has.
What have we learned (reiterated)?
- A simple problems may have far reaching impact.
- More precise is not always better. Client often prefers compatible rather than
more accurate results.
If you have a string variable $value as xs:string , and want to know whether it starts from a digit, then what's the best way to do it in the xpath?
Our answer is: ($value ge '0') and ($value lt ':') .
Looks a little funny (and disturbing).
In our project we're generating a lot of xml files, which are subjects of manual
changes, and repeated generations (often with slightly different generation
options). This way a life flow of such an xml can be described as following:
- generate original xml (version 1)
- manual changes (version 2)
- next generation (version 3)
- manual changes integrated into the new generation (version 4)
If it were a regular text files we could use diff utility to prepare
patch between versions 1 and 2, and apply it with patch utility to
a version 3. Unfortunately xml has additional semantics compared to a plain text. What's an
invariant or a simple modification in xml is often a drastic change in text.
diff /patch does not work well for us. We need xml diff
and patch.
The first guess is to google it! Not so simple.
We have failed to find a tool or an API that can be used from ant. There are a
lot of GUIs to show xml differences and to perform manual merge, or doing
similar but different things to what we need
(like MS's xmldiffpatch).
Please point us to such a program!
Meantime, we need to proceed. We don't believe that such a tool can be
done on the knees, as it's a heuristical and mathematical at the same time
task requiring a careful design and good statistics for the use cases. Our idea
is to exploit
diff /patch . To achieve the goals we're going to
perform some normalization of xmls before diff to remove redundant
invariants, and normalization after the patch to return it into a readable form.
This includes:
- ordering attributes by their names;
- replacing unsignificant whitespaces with line breaks;
- entering line breaks after element names and before attributes, after
attribute name and before it's value, and after an attribute value.
This way we expect to recieve files reacting to modifications similarly to text
files.
At present C# serializer knows how to print comments and do some formatting (we had to create micro xml serializer within xslt to serialize xml comments). C#'s formatting is not as advanced as java's one, but it should not be such in the first place, as C# text tends to be more neat due to properties and events. Compare:
Java: instance.getItems().get(10).setValue(value);
vs
C#: instance.Items[10].Value = value;
TODO: implement API existing in jxom and missing in C# xom. This includes:
- name normalization - rewriting tree to make names unique (duplicate names are often appear during generation from code templates);
- namespaces normalization - rewriting tree to elevate type namespaces (during generation, types are usually fully qualified);
- unreachable code detection - optional feature (in java it's required, as unreachable code is an error, while in C# it's only a warning);
- compile time expression evaluation - optional feature used in code optimization and in reachability checks;
- state machine refactoring - not sure, as C# has
yield statement that does the similar thing.
Update can be found at: jxom/C# xom.
June, 24 update: name and namespace normalizations are implemented.
Writing a language serializer is an as easy task, as riding a bicycle. Once you learned it, you won't apply a mental force anymore to create a new one.
This still requires essential mechanical efforts to write and test things.
Well, this is the first draft of the C# xslt serializer. Archive contains both C# xom and jxom.
Note: no comments are still supported; nothing is done to format code except line wrapping.
Well, it's jxom no more but also csharpxom!
A project concerns demanded us to create a C# 3.0 xml schema.
Shortly we expect to create an xslt serializing an xml document in this schema into a text. Thankfully to the original design we can reuse java streamer almost without changes.
A fact: C# schema more than twice bigger than the java's.
Yesterday, we've found an article
"Repackaging Saxon". It's about a decision to go away from Saxon-B/Saxon-SA
packaging to a more conventional product line: Home/Professional/Enterprise
Editions.
The good news are that the Saxon stays open source. That's most important as an
open comunity spirit will be preserved. On the other hand Professional and
Enterprise Editions will not be free.
In this regard the most interesting comments are:
John Cowan> I suspect that providing packaging only for $$ (or pounds or euros) won't actually work, because someone else will step in and provide that packaging for free, as the licensing permits.
and response:
Michael Kay> This will be interesting to see. I'm relying partly on the idea that there's a fair degree of trust, and expectation of support, associated with Saxonica's reputation, and that the people who are risking their business on the product might be hesitant to rely on third parties, who won't necessarily be prompt in issuing maintenance releases etc; at the same time, such third parties may serve the needs of the hobbyists who are the real market for the open source version.
and also:
Michael Kay> ...I haven't been able to make a model based on paid services plus free software work for me. It's hard enough to get the services business; when you do get it, it's hard to get enough revenue from it to fund the time spent on developing and supporting the software. Personally, I think the culture of free software has gone too far, and it is now leading to a lack of investment in new software...
Recently, we have started looking into a task of creating an interactive parser. A generic one.
Yes, we know there are plenty of them all around, however the goals we have
defined made us to construct the new implementation.
The goals:
- Parser must be incremental.
You should direct what to parse, and when to stop. This virtually demands rather "pull" than conventional "push"
implementation.
- Parser must be able to synchronize a tree with text.
Whenever the underlying text is changed, a limited part of a tree should
to be updated.
- Parser should be able to recover from errors, and continue parsing.
- Parser should be manageable.
This is a goal of every program, really.
- Parser must be fast.
- A low memory footprint is desired.
What's implemented (VS2008, C#) and put at SourceForge, is called an
Incremental Parser.
These are parser's insights:
- Bookmarks are objects to track text points. We use a binary tree (see
Bare binary tree algorithms)
to adjust positions of bookmarks when text is changed.
- Ranges define parsed tree elements. Each range is defined by two
bookmarks, and a grammar annotation.
- There are grammar primitives, which can be composed into a grammar graph.
- A grammar graph along with ranges form a state machine.
- Grammar chains are cached, turning
parsing into a series of probes of literal tokens and transitions between
grammar chains. This caching is done on demand, which results in warming-up effect.
- Parser itself includes a random access tokenizer, and a queue of ranges
pending to be parsed.
- Parsing is conducted as a cycle of pulling and parsing of pending ranges.
- Whenever text is changed a closest range is queued for the reparsing.
- A balance between amount of parsing and memory consumption can be achieved
through a detalization of grammar annotation for a range. An active text
fragment can be fully annotated, while for other text parts a coarse range can
be stored.
We have defined
xpath like grammar to test our ideas. See printed
parsed trees to get understanding of what information can be seen from
ranges.
A simple demand nowdays - a good IDE.
Almost a ten years have passed since xslt has appeared but still, we're not pleased with IDEs
claiming xslt support. Our expectaions are not too high. There are things however, which must be present in such
an IDE.
- A notion of project, and possibly a group of projects. You may think of
it as a main xslt including other xslts participationg in the project.
- A code completion. A feature providing typing hints for language constructs, includes,
prefixes, namespaces, functions, templates, modes, variables, parameters, schema
elements, and other (all this should work in a context of the project).
- A code refactoring. A means to move parts of code between (or inside) files and projects,
rename things (functions, templates, parameters, variables, prefixes,
namespaces, and other).
- Code validation and run.
- Optional debug feature.
We would be grateful if someone had pointed to any such IDE.
Once upon a time, we created a function mimicking
decapitalize() method defined in java in java.beans.Introspector. Nothing
special, indeed. See the source:
/**
* Utility method to take a string and convert it to normal Java variable
* name capitalization. This normally means converting the first
* character from upper case to lower case, but in the (unusual) special
* case when there is more than one character and both the first and
* second characters are upper case, we leave it alone.
* <p>
* Thus "FooBah" becomes "fooBah" and "X" becomes "x", but "URL" stays
* as "URL".
*
* @param name The string to be decapitalized.
* @return The decapitalized version of the string.
*/
public static String decapitalize(String name) {
if (name == null || name.length() == 0) {
return name;
}
if (name.length() > 1 && Character.isUpperCase(name.charAt(1)) &&
Character.isUpperCase(name.charAt(0))){
return name;
}
char chars[] = name.toCharArray();
chars[0] = Character.toLowerCase(chars[0]);
return new String(chars);
}
We typed implementation immediately:
<xsl:function name="t:decapitalize" as="xs:string">
<xsl:param name="value" as="xs:string?"/>
<xsl:variable name="c" as="xs:string"
select="substring($value, 2, 1)"/>
<xsl:sequence select="
if ($c = upper-case($c)) then
$value
else
concat
(
lower-case(substring($value, 1, 1)),
substring($value, 2)
)"/>
</xsl:function>
It worked, alright, until recently, when it has fallen to work, as the output was
different from java's counterpart.
The input was W9Identifier. Function naturally returned the same value, while
java returned w9Identifier. We has fallen with the assumption that
$c = upper-case($c) returns true when character is an upper case letter. That's
not correct for numbers. Correct way is:
<xsl:function name="t:decapitalize" as="xs:string">
<xsl:param name="value" as="xs:string?"/>
<xsl:variable name="c" as="xs:string"
select="substring($value, 2, 1)"/>
<xsl:sequence select="
if ($c != lower-case($c)) then
$value
else
concat
(
lower-case(substring($value, 1, 1)),
substring($value, 2)
)"/>
</xsl:function>
The last year we were working on a project, which in essence dealt with
transformation of graphs. Our experience with xslt 1.0, and other available information
was promising - xslt 2.0 is a perfect match.
We were right, xslt 2.0 fitted very well to the problem.
It's easy to learn xslt 2.0/xquery: be acquainted with xml schema; read through
a syntax, which is rather concise; look at examples, and start coding. API you
will learn incrementally.
The same as other languages, xslt 2.0 is only a media to express algorithms. As such
it fills its role rather good, as good as SQL:2003 and its variations do, and
sometimes even better than other programming languages like C++ do.
Compare expressions "get data satisfying to a specific criteria" and "for each
data part check a specific condition, and if it true, add it to the result".
These often represent the same idea from two perspectives: human (or math)
thinkning; and thinking in terms of execution procedure.
Both kinds of expressions have their use, however it has happened so that we're
the human beings and perceive more easily natural language notions like:
subjects, objects, predicates, deduction, induction and so on. I think the
reason is that a human's (not positronic) brain grasps ideas, conceptions,
images as something static, while execution procedure demands a notion of time
(or at least notions of a sequence and an order) for the comprehension. ("Are you serious?", "Joke!" )
There is the other side to this story.
We have made the project design in relatively short terms. A good scalable
design. We needed people who know xslt 2.0 to implement it. It has turned out, this was a strong objection against xslt!
Our fellow, xslt guru, Oleg
Tkachenko has left our company to make his career at Microsoft, and to our disbelief it
was impossible to find a person who was interested in a project involvong 85% of
xslt and 15% of other technologies including java. Even in java world people prefer routine projects, like
standard swing or web application, to a project demanding creativeness.
Possibly, it was our mistake, to allow to our company to look for developers the standard
way: some secretary was looking through her sources, and inevitably was finding
so-so java + poor xml + almost zero xslt knowledge graduates. We had to make
appeals on xslt forums especially since the project could be easily developed
with a distributed group.
Finally, we have designed and implemented the project by ourselves but to the
present day our managers are calling and suggesting java developers
for our project. What a bad joke!
Just for fun I've created exslt2.xslt and exslt2-test.xslt to model concepts
discussed at EXSLT 2.0
forum. I did nothing special but used tuple as reference, and also I've
defined f:call() to make function call indirectly.
<?xml version="1.0" encoding="utf-8"?>
<!--
exslt 2 sketches.
-->
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:f="http://exslt.org/v2"
xmlns:t="this"
xmlns:p="private"
exclude-result-prefixes="xs t f">
<xsl:include href="exslt2.xslt"/>
<xsl:template match="/" name="main">
<root>
<xsl:variable name="refs" as="item()*" select="
for $i in 1 to 20 return
f:ref(1 to $i)"/>
<total-items>
<xsl:sequence select="
sum
(
for $ref in $refs return
count(f:deref($ref))
)"/>
</total-items>
<sums-per-ref>
<xsl:for-each select="$refs">
<xsl:variable name="index" as="xs:integer" select="position()"/>
<sum
index="{$index}"
value="{sum(f:deref(.))}"/>
</xsl:for-each>
</sums-per-ref>
<add>
<xsl:text>1 + 2 = </xsl:text>
<xsl:sequence select="f:call(xs:QName('t:add'), (1, 2))"/>
</add>
</root>
</xsl:template>
<xsl:function name="t:add" as="xs:integer">
<xsl:param name="arguments" as="xs:integer+"/>
<xsl:variable name="first" as="xs:integer" select="$arguments[1]"/>
<xsl:variable name="second" as="xs:integer" select="$arguments[2]"/>
<xsl:sequence select="$first + $second"/>
</xsl:function>
</xsl:stylesheet>
Code can be found at saxon.extensions.9.1.zip.
We have created
Java Xml Object Model purely for purposes of our project. In fact jxom at
present has siblings: xml models for sql dialects. There are also different APIs
like name normalizations, refactorings, compile time evaluation.
It turns out that jxom is also good enough for other developers.
The drawback of jxom, however, is rather complex xml schema. It takes time to understand it. To simplify things
we have created (and planning to create more) a couple of examples allowing to feel
how jxom xml looks like.
The latest version can be loaded from
jxom.zip
We would be pleased to see more comments on the subject.
Recently, working on completely different thing, I've realized that one may create a
"generator", function returning different values per each call. I was somewhat
puzzled with this conclusion, as I thought xslt functions have no side effects,
and for the same arguments xslt function returns the same result.
I've confirmed the conclusion at the forum. See
Scope of uniqueness of generate-id().
In short:
- each node has an unique identity;
- function in the course of work creates a temporary node and produces a result
depending on identity of that node.
Example:
<xsl:stylesheet version="2.0"
xmlns:f="data:,f"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<xsl:message select="
for $i in 1 to 8 return
f:fun()"/>
</xsl:template>
<xsl:function name="f:fun" as="xs:string">
<xsl:variable name="x">!</xsl:variable>
<xsl:sequence select="generate-id($x)"/>
</xsl:function>
</xsl:stylesheet>
The next thought was that if you may create a generator then it's easy to create
a good random number generator (that's a trivial math task).
Hey gurus, take a chance!
Suppose you have constructed a sequence of attributes.
How do you access a value of attribute "a"?
Simple, isn't it? It has taken a couple of minutes to find a solution!
<xsl:variable name="attributes" as="attribute()*">
<xsl:apply-templates mode="t:generate-attributes" select="."/>
</xsl:variable>
<xsl:variable name="value" as="xs:string?"
select="$attributes[self::attribute(a)]"/>
Saying
Our project, containing many different xslt files, generates many different
outputs (e.g: code that uses DB2 SQL, or Oracle SQL, or DAO, or some
other flavor of code). This results in usage of
indirect calls to handle different generation options, however to allow xslt
to work we had to create a big main xslt including stylesheets for each kind of
generation. This impacts on a compilation time.
Alternatives
- A big main xslt including everything.
- A big main xslt including everything and using "use-when" attribute.
- Compose main xslt on the fly.
We were eagerly inclined to the second alternative. Unfortunately a limited set of information is available when "use-when" is evaluated. In
particular there are neither parameters nor documents available. Using
Saxon's extensions one may reach only static variables, or access
System.getProperty(). This isn't flexible.
We've decided to try the third alternative.
Solution
We think we have found a nice solution: to create XsltSource ,
which receives a list of includes upon construction, and creates an xslt
when getReader() is called.
import java.io.Reader;
import java.io.StringReader;
import javax.xml.transform.stream.StreamSource;
/**
* A source to read generated stylesheet, which includes other stylesheets.
*/
public class XsltSource extends StreamSource
{
/**
* Creates an {@link XsltSource} instance.
*/
public XsltSource()
{
}
/**
* Creates an {@link XsltSource} instance.
* @param systemId a system identifier for root xslt.
*/
public XsltSource(String systemId)
{
super(systemId);
}
/**
* Creates an {@link XsltSource} instance.
* @param systemId a system identifier for root xslt.
* @param includes a list of includes.
*/
public XsltSource(String systemId, String[] includes)
{
super(systemId);
this.includes = includes;
}
/**
* Gets stylesheet version.
* @return a stylesheet version.
*/
public String getVersion()
{
return version;
}
/**
* Sets a stylesheet version.
* @param value a stylesheet version.
*/
public void setVersion(String value)
{
version = value;
}
/**
* Gets a list of includes.
* @return a list of includes.
*/
public String[] getIncludes()
{
return includes;
}
/**
* Sets a list of includes.
* @param value a list of includes.
*/
public void setIncludes(String[] value)
{
includes = value;
}
/**
* Generates an xslt on the fly.
*/
public Reader getReader()
{
String[] includes = getIncludes();
if (includes == null)
{
return super.getReader();
}
String version = getVersion();
if (version == null)
{
version = "2.0";
}
StringBuilder builder = new StringBuilder(1024);
builder.append("<stylesheet version=\"");
builder.append(version);
builder.append("\" xmlns=\"http://www.w3.org/1999/XSL/Transform\">");
for(String include: includes)
{
builder.append("<include href=\"");
builder.append(include);
builder.append("\"/>");
}
builder.append("</stylesheet>");
return new StringReader(builder.toString());
}
/**
* An xslt version. By default 2.0 is used.
*/
private String version;
/**
* A list of includes.
*/
private String[] includes;
}
To use it one just needs to write:
Source source = new XsltSource(base, stylesheets);
Templates templates = transformerFactory.newTemplates(source);
...
where:
base is a base uri for the generated stylesheet; it's used to
resolve relative includes;
stylesheets is an array of hrefs.
Such implementation resembles a dynamic linking when separate parts are bound at
runtime. We would like to see dynamic modules in the next version of xslt.
Why we've turned our attention to the Saxon implementation?
A considerable part (~75%) of project we're working on at present is creating
xslt(s). That's not stylesheets to create page presentations, but rather
project's business logic. To fulfill the project we were in need of xslt 2.0
processor. In the current state of affairs I doubt someone can point to a good
alternative to the Saxon implementation.
The open source nature of the SaxonB project and intrinsic curiosity act like a
hook for such species like ourselves.
We want to say that we're rather sceptical
observers of a code: the code should prove it have merits. Saxon looks
consistent. It takes not too much time to grasp implementation concepts taking
into account that the code routinely follows xpath/xslt/xquery specifications.
These code observation and practice with live xslt tasks helped us to form an
opinion on the Saxon itself. That's why we dare to critique it.
1. Compilation is fused with execution.
An xslt before being executed passes several stages including xpath data model, and a graph of expressions - objects implementing
parts of runtime logic.
Expression graph is optimized to achieve better runtime performace. The
optimization logic is distributed throughout the code, and in particular lives
in expression objects. This means that expression completes two roles: runtime
execution and optimization.
I would prefer to see a smaller and cleaner run time objects (expressions),
and optimization logic separately. On the other hand I can guess why Michael Kay
fused these roles: to ease lazy optimizations (at runtime).
2. Optimizations are xslt 1.0 by origin
This is like a heritage. There are two main techniques: cached
sequences, and global indices of rooted nodes.
This might be enough in xslt 1.0, but in 2.0 where there are diverse set of
types, where sequences extend node sets to other types, where sequences may
logically be grouped by pairs, tripples, and so on, this is not enough.
XPath data model operates with sequences only (in math sense). On the other hand it
defines many set based functions (operators) like: $a intersect $b , $a except $b ,
$a = $b , $a != $b . In these examples XPath sequences are better to consider as sets, or maps of items.
Other example: for $i in index-of($names, $name) return $values[$i] , where
$names as xs:string* , $values as element()* shows that
a closure of ($names , $values ) is in fact a map, and
$names might be implemented as a composition of a sequence and a map of
strings to indices.
There are other use case examples, which lead me to think that Saxon lacks set
based operators. Global indices are poor substitution, which work for rooted
trees only.
Again, I guess why Michael Kay is not implementing these operators: not everyone
loads xslt with stressful tasks requiring these features. I think xslt is mostly
used to render pages, and one rarely deviates from rooted trees.
In spite of the objections we think that Saxon is a good xslt 2.0 implementation,
which unfortunately lacks competitors.
We are certain xslt/xquery are the best for web application frameworks from the
design perspective; or, in other words, pipeline frameworks allowing use of
xslt/xquery are preferable way to create web applications.
Advantages are obvious:
-
clear separation of business logic, data, and presentation;
-
richness of languages, allowing to implement simple presentation, complex
components, and sophisticated data binding;
-
built-in extensibility, allowing comunication with business logic, written in
other languages and/or located at different site.
It seems the agitation for a such technologies is like to force an open
door. There are such frameworks out there:
Orbeon Forms, Cocoon, and others.
We're not qualified to judge of their virtues, however...
Look at the current state of affairs. The main players in this area (well, I
have a rather limited vision) push other technologies: JSP/JSF/Faceletes and
alike in the Java world, and ASP.NET in the .NET world. The closest thing they
are providing is xslt servlet/component allowing to generate an output.
Their variants of syntaxis, their data binding techniques allude to similar
paradigms in xslt/xquery:
<select>
<c:forEach var="option" items="#{bean.options}">
<option value="#{option.key}">#{parameter.value}</option>
</c:forEach>
</select>
On the surface, however, we see much more limited (in design and in the
application) frameworks.
And here is a contradiction: how can it be that at present such a good design is
not as popular, as its competitors, at least?
Someone can say, there is no such a problem. You can use whatever you want. You
have a choice! Well, he's lucky. From our perspective it's not that simple.
We're creating rather complex web applications. Their nature isn't important in
this context, but what is important is that there are customers. They are not
thoroughly enlightened in the question, and exactly because of this they prefer
technologies proposed by leaders. It seems, everything convince them: main
stream, good support, many developers who know technology.
There is no single chance to promote anything else.
We believe that the future may change this state, but we're creating at present,
and cannot wait...
I've uploaded jxom.zip
Now, it contains a state machine generator. See "What you can do with jxom".
The code is in the java-state-machine-generator.xslt. The test is in the java-state-machine-test.xslt.
We're facing a task of conversion of a java method into a state machine.
This is like to convert a SAX Parser, pushing data, into an Xml Reader, which
pulls data.
The task is formalized as:
- for a given method containing split markers create a class perimitting
iteration;
- each iteration performs part of a logic of a method.
We have defined rules converting all statements into a state machine except
of the statement synchronized . In fact the logic is rather linear, however the most untrivial conversion is for try statement.
Consider an example:
public class Test
{
void method()
throws Exception
{
try
{
A();
B();
}
catch(Exception e)
{
C(e);
}
finally
{
D();
}
E();
}
private void A()
throws Exception
{
// logic A
}
private void B()
throws Exception
{
// logic B
}
private void C(Exception e)
throws Exception
{
// logic C
}
private void D()
throws Exception
{
// logic D
}
private void E()
throws Exception
{
// logic E
}
}
Suppose we want to see method() as a state machine in a way that split markers are after calls to
methods A() , B() , C() , D() , E() . This is how it looks as a state machine:
Callable<Boolean> methodAsStateMachine()
throws Exception
{
return new Callable<Boolean>()
{
public Boolean call()
throws Exception
{
do
{
try
{
switch(state)
{
case 0:
{
A();
state = 1;
return true;
}
case 1:
{
B();
state = 3;
return true;
}
case 2:
{
C(ex);
state = 3;
return true;
}
case 3:
{
D();
if (currentException != null)
{
throw currentException;
}
state = 4;
return true;
}
case 4:
{
E();
state = -1;
return false;
}
}
if (currentException == null)
{
currentException = new IllegalStateException();
}
}
catch(Throwable e)
{
currentException = null;
switch(state)
{
case 0:
case 1:
{
if (e instanceof Exception)
{
ex = (Exception)e;
state = 2;
}
else
{
currentException = e;
state = 3;
}
continue;
}
case 2:
{
currentException = e;
state = 3;
continue;
}
}
currentException = e;
state = -1;
}
}
while(false);
return this.<Exception>error();
}
@SuppressWarnings("unchecked")
private <T extends Throwable> boolean error()
throws T
{
throw (T)currentException;
}
private int state = 0;
private Throwable currentException = null;
private Exception ex = null;
};
}
Believe it, or not but this transformation can be done purely in xslt 2.0 with the help of the
jxom (Java xml object model). We shall update
jxom.zip whenever
this module will be implemented and tested.
In the xslt one can express logically the same things in different words like:
exists($x)
and
every $y in $x satisfies exists($y)
newbie> Really the same?
expert> Ops... You're right, these are different things!
What's the difference?
I was already writing about tuples and maps in the xslt (see
Tuples and maps - Status: CLOSED, WONTFIX, and
Tuples and maps in Saxon).
Now, I want to argue on a use case, and on how xslt processor can detect such a
use case and implement it as map. This way, for a certain conditions, a sequences
could be treated as maps (or as sets).
Use case.
There are two stages:
- a logic collecting nodes/values satisfying some criteria.
- process data, and take a special action whenever a node/value is collected on
the previous stage.
Whenever we're talking of nodes than result of the first stage is
a sequence $set as node()* . The role of this sequence is a
set of nodes (order is not important).
The second stage is usually an xsl:for-each , an xsl:apply-templates ,
or something of this kind, which repeatedly verifies whether a some $node as
node()? belongs to the $set , like a following: $node intersect
$set , or $node except $set .
In spite of that we're still using regular xpath 2.0, we have managed to express
a set based operation. It's a matter of xslt processor's optimizer to detect
such a use case and consider a sequence as a set. In fact the detection rule is
rather simple.
For expressions $node except $set and $node intersect $set :
$set can be considered as a set, as order of elements is not important;
- chances are good that a
$set being implemented as a set outperforms implementation
using a list or an array.
Thus what to do? Well, I do not think I'm the smartest child, quite opposite...
however it worth to hint this idea to xslt implementers (see
Suggest optimization). I still do not know if it was fruitful...
P.S. A very similar use case exists for a function index-of($collection, $item).
I know we're not the first who create a parser in xslt.
However I still want to share our implementation, as I think it's beautiful.
In our project, which is conversion from a some legacy language to java, we're
dealing with dynamic expressions. For example in the legacy language one can
filter a collection using an expression defined by a string:
collection.filter("a > 0 and b = 7");
Whenever expression string is calculated there is nothing to do except to parse such
string at runtime and perform filtering dynamically. On the other hand we have
found that in the majority of cases literal strings are used. Thus we have decided to
optimize this route like this:
collection.filter(
new Filter<T>()
{
boolean filter(T value)
{
return (value.getA() > 0) and (value.getB() = 7);
}
});
This means that we're converting that expression string into java code on the
generation stage.
In the xslt - our generator engine - this means that we have to convert a string
into expression tree like this:
(a > 7 or a= 3) and c * d = 2.2
to
<and>
<or>
<gt>
<identifier>a</identifier>
<integer>7</integer>
</gt>
<eq>
<identifier>a</identifier>
<integer>3</integer>
</eq>
</or>
<eq>
<mul>
<identifier>c</identifier>
<identifier>d</identifier>
</mul>
<decimal>2.2</decimal>
</eq>
</and>
Our parser fits naturally to the world of parsers: it uses xsl:analyze-string instruction to tokenize input and
parses tokens according to an expression grammar. During implementation I've
found some new to me things. I think they worth mentioning:
-
As tokenizer is defined as a big regular expression, we have rather verbose
regex attribute over xsl:analyze-string . It was hard
to edit such a big line until I've found there is flag="x" option that solves
formatting problems:
The flags attribute may be used to control the interpretation of the regular expression... If it contains the letter x, then whitespace within the regular expression is ignored.
This means that I can use spaces to format regular expression and /s to specify space as part of expression.
-
Saxon 9.1.0.1 has inefficiency in implementation of
xsl:analyze-string
instruction, whenever regex contains literal value however with '{' character
(e.g. "\p{{L}}"), as it considers the value to be an AVT and delays pattern
compilation until runtime, which it does every time instruction is executed.
Use following link to see the xslt:
expression-parser.xslt.
To see how to generate java from an xml follow this link:
Xslt for the jxom (Java xml object model), jxom.zip.
Yesterday, incidentally, I've arrived to a problem of a dynamic error during evaluation of a template's match.
This reminded me
SFINAE in C++. There the principle is applied at compile time to find a
matching template.
I think people underestimate the meaning of this behaviour. The effect of
dynamic errors occurring during pattern evaluation is described in the
specification:
Any dynamic error or type error that occurs during the evaluation of a pattern against a particular node is treated as a recoverable error even if the error would not be recoverable under other circumstances. The optional recovery action is to treat the pattern as not matching that node.
This has far reaching consequences, like an error recovery. To illustrate what I'm talking about please look at this simple stylesheet that recovers from "Division by zero.":
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:template match="/">
<xsl:variable name="operator" as="element()+">
<div divident="10" divisor="0"/>
<div divident="10" divisor="2"/>
</xsl:variable>
<xsl:apply-templates select="$operator"/>
</xsl:template>
<xsl:param name="NaN" as="xs:double" select="1.0 div 0"/>
<xsl:template
match="div[(xs:integer(@divident) div xs:integer(@divisor)) ne $NaN]">
<xsl:message select="xs:integer(@divident) div xs:integer(@divisor)"/>
</xsl:template>
<xsl:template match="div">
<xsl:message select="'Division by zero.'"/>
</xsl:template>
</xsl:stylesheet>
Here, if there is a division by zero a template is not matched and other
template is selected, thus second template serves as an error handler for the
first one. Definitely, one may define much more complex construction to be
handled this way.
I never was a purist (meaning doing everything in xslt), however this example
along with
indirect function call, shows that xslt is rather equiped language. One just
need to be smart enough to understand how to do a things.
See also: Try/catch block in xslt 2.0 for Saxon 9.
Among other job activities, we're from time to time asked to check technical skills of job applicants.
Several times we were interviewing people who're far below the
acceptable professional skills. It's a torment for both sides, I should say.
To ease things we have designed a small
questionnaire (specific to our projects) for job applicants. It's sent to an applicant before the
meeting. Even partially answered, this
questionnaire constitutes a good filter against profanes:
<questionnaire> <item>
<question> Please estimate your knowledge in XML Schema
(xsd) as lacking, bad, good, or perfect.
</question> <answer/> </item> <item>
<question> Please estimate your
knowledge in xslt 2.0/xquery 1.0 as lacking, bad, good, or perfect.
</question> <answer/> </item> <item>
<question> Please estimate your
knowledge in xslt 1.0 as lacking, bad, good, or perfect. </question> <answer/> </item> <item>
<question> Please estimate your
knowledge in java as lacking, bad, good, or perfect. </question> <answer/> </item> <item>
<question> Please estimate your
knowledge in c# as lacking, bad, good, or perfect. </question> <answer/> </item> <item>
<question> Please estimate your
knowledge in sql as lacking, bad, good, or perfect. </question> <answer/> </item> <item>
<question> For logical values A, B,
please rewrite logical expression "A and B" using operator "or".
</question> <answer/> </item> <item>
<question> For logical values A, B,
please rewrite logical expression "A = B" using operators "and" and "or".
</question> <answer/> </item> <item>
<question> There are eight balls, with
only one heavier than some other.
What is a minimum number of weighings reveals the
heavier ball?
Please be suspicious about the "trivial" solution.
</question> <answer/> </item> <item>
<question> If A results in B. What one
may say about the reason of B? </question> <answer/> </item> <item>
<question> If only A or B result in C.
What one may say about the reason of C? </question> <answer/> </item> <item>
<question> Please define an xml schema
for this questionnaire. </question> <answer/> </item> <item>
<question> Please create a simple
stylesheet creating an html table based on this questionnaire.
</question> <answer/> </item> <item>
<question> For a table A with columns
B, C, and D, please create an sql query selecting B groupped by C and ordered by
D. </question> <answer/> </item> <item>
<question> For a sequence of xml
elements A with attribute B, please write a stylesheet excerpt creating a
sequence of elements D, grouping elements A with the same string value of
attribute B, sorted in the order of ascending of B. </question> <answer/> </item> <item>
<question> Having a java class A with
properties B and C, please sort a collection of A for B in ascending, and C in
descending order.
</question> <answer/> </item> <item>
<question> What does a following line
mean in c#?
int? x; </question> <answer/> </item> <item>
<question> What is a parser? </question> <answer/> </item> <item>
<question> How to issue an error in the
xml stylesheet? </question> <answer/> </item> <item>
<question> What is a lazy evaluation? </question> <answer/> </item> <item>
<question> How do you understand a
following sentence?
For each line of code there should be a comment.
</question> <answer/> </item> <item>
<question> Have you used any
supplemental information to answer these questions? </question> <answer/> </item> <item>
<question> Have you independently
answered these questions? </question> <answer/> </item> </questionnaire>
I've found that proposition to introduce tuples and maps to xslt/xquery type system has not found a support:
At the joint meeting of the XSL and XQuery Working groups 2008-06-23
it was decided that a change of this nature would be too large for the
next "point" release of the Recommendations. The request for new
functionality will be considered for a future "main" release.
Boor> *****!
Pessimist> Ah, there won't be tuples and maps in xslt/xquery...
Optimist> Wow, chances are good to see this addition by the year 2018!
We are designing a rather complex xslt 2.0 application, dealing with semistructured
data. We must tolerate with errors during processing, as there are cases where an
input is not perfectly valid (or the program is not designed or ready to get
such an input).
The most typical error is unsatisfied expectation of tree structure like:
<xsl:variable name="element" as="element()" select="some-element"/>
Obviously, dynamic error occurs if a specified element is not present. To
concentrate on primary logic, and to avoid a burden of illegal (unexpected) case
recovery we have created a try/catch API. The goal of such API is:
- to be able to continue processing in case of error;
- report as much as possible useful information related to an error.
Alternatives:
Do not think this is our arrogance, which has turned us to create a custom API. No, we
were looking for alternatives! Please see
[xsl] saxon:try() discussion:
- saxon:try()
function - is a kind of pseudo function, which explicitly relies on lazy
evaluation of its arguments, and ... it's not available in SaxonB;
- ex:error-safe
extension instruction - is far from perfect in its implementation quality, and provides no error location.
We have no other way except to design this feature by ourselves. In our defence one
can say that we are using innovatory approach that encapsulates details of the
implementation behind template and calls handlers indirectly.
Use:
Try/catch API is designed as a template
<xsl:template name="t:try-block"/> calling a "try" handler, and, if
required, a "catch" hanler using
<xsl:apply-templates mode="t:call"/> instruction. Caller passes any
information to these handlers by the means of tunnel parameters.
Handlers must be in a "t:call " mode. The "catch" handler
may recieve following error info parameters:
<xsl:param name="error" as="xs:QName"/>
<xsl:param name="error-description" as="xs:string"/>
<xsl:param name="error-location" as="item()*"/>
where $error-location is a sequence of pairs (location as
xs:string, context as item())* .
A sample:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/public/"
exclude-result-prefixes="xs t">
<xsl:include href="try-block.xslt"/>
<xsl:template match="/"> <result> <xsl:for-each select="1 to 10">
<xsl:call-template name="t:try-block"> <xsl:with-param name="value" tunnel="yes"
select=". - 5"/> <xsl:with-param name="try" as="element()"> <try/>
</xsl:with-param> <xsl:with-param name="catch" as="element()">
<t:error-handler/> </xsl:with-param> </xsl:call-template> </xsl:for-each>
</result> </xsl:template>
<xsl:template mode="t:call" match="try"> <xsl:param
name="value" tunnel="yes" as="xs:decimal"/>
<value> <xsl:sequence select="1 div
$value"/> </value> </xsl:template>
</xsl:stylesheet>
The sample prints values according to the formula "1/(i - 5)", where "i" is a
variable varying from 1 to 10. Clearly, division by zero occurs when "i" is equal
to 5.
Please notice how to access try/catch API through
<xsl:include href="try-block.xslt"/> . The main logic is
executed in
<xsl:template mode="t:call" match="try"/> , which
recieves parameters using tunneling. A default error handler
<t:error-handler/> is used to report errors.
Error report:
Error: FOAR0001
Description:
Decimal divide by zero
Location:
1. systemID: "file:///D:/style/try-block-test.xslt", line: 34
2. template mode="t:call"
match="element(try, xs:anyType)"
systemID: "file:///D:/style/try-block-test.xslt", line: 30
context node:
/*[1][local-name() = 'try']
3. template mode="t:call"
match="element({http://www.nesterovsky-bros.com/xslt/private/try-block}try, xs:anyType)"
systemID: "file:///D:/style/try-block.xslt", line: 53
context node:
/*[1][local-name() = 'try']
4. systemID: "file:///D:/style/try-block.xslt", line: 40
5. call-template name="t:try-block"
systemID: "file:///D:/style/try-block-test.xslt", line: 17
6. for-each
systemID: "file:///D:/style/try-block-test.xslt", line: 16
context item: 5
7. template mode="saxon:_defaultMode"
match="document-node()"
systemID: "file:///D:/style/try-block-test.xslt", line: 14
context node:
/
Implementation details:
You were not expecting this API to be pure xslt, weren't you?
Well, you're right, there is an extension function. Its pseudo code is like
this:
function tryBlock(tryItems, catchItems)
{
try
{
execute xsl:apply-templates for tryItems.
}
catch
{
execute xsl:apply-templates for catchItems.
}
}
The last thing. Please get the implementation
saxon.extensions.zip. There you will find sources of the try/catch, and
tuples/maps API.
Right now we're inhabiting in the java world, thus all our tasks are (in)directly
related to this environment.
We want to store stylesheets as resources of java application, and at
the same time to point to these stylesheets without jar qualification. In .NET this idea would not
appear at all, as there are well defined boundaries between assemblies, but java uses
rather different approach. Whenever you have a resource name, it's up to
ClassLoader to find this resource. To exploit this feature we've created
an uri resolver for the stylesheet
transformation. The protocol we use has a following format: "resource:/resource-path ".
For example to store stylesheets in the
META-INF/stylesheets folder we use uri "resource:/META-INF/stylesheets/java/main.xslt ".
Relative path is resolved naturally. A path "../jxom/java-serializer.xslt "
in previously mentioned stylesheet is resolved to "resource:/META-INF/stylesheets/jxom/java-serializer.xslt ".
We've created a small class ResourceURIResolver . You need to
supply an instance of TransformerFactory with this resolver:
transformerFactory.setURIResolver(new ResourceURIResolver());
The class itself is so small that we qoute it here:
import java.io.InputStream;
import java.net.URI;
import java.net.URISyntaxException;
import javax.xml.transform.Source;
import javax.xml.transform.TransformerException;
import javax.xml.transform.URIResolver;
import javax.xml.transform.stream.StreamSource;
/**
* This class implements an interface that can be called by the processor
* to turn a URI used in document(), xsl:import, or xsl:include into a
* Source object.
*/
public class ResourceURIResolver implements URIResolver
{
/**
* Called by the processor when it encounters
* an xsl:include, xsl:import, or document() function.
*
* This resolver supports protocol "resource:".
* Format of uri is: "resource:/resource-path", where "resource-path" is
an
* argument of a {@link ClassLoader#getResourceAsStream(String)} call.
* @param href - an href attribute, which may be relative or absolute.
* @param base - a base URI against which the first argument will be
made
* absolute if the absolute URI is required.
* @return a Source object, or null if the href cannot be resolved, and
* the processor should try to resolve the URI itself.
*/
public Source resolve(String href, String base)
throws TransformerException
{
if (href == null)
{
return null;
}
URI uri;
try
{
if (base == null)
{
uri = new URI(href);
}
else
{
uri = new URI(base).resolve(href);
}
}
catch(URISyntaxException e)
{
// Unsupported uri. return null;
}
if (!"resource".equals(uri.getScheme()))
{
return null;
}
String resourceName = uri.getPath();
if ((resourceName == null) || (resourceName.length() == 0))
{
return null;
}
if (resourceName.charAt(0) == '/')
{
resourceName = resourceName.substring(1);
}
ClassLoader classLoader =
Thread.currentThread().getContextClassLoader();
InputStream stream =
classLoader.getResourceAsStream(resourceName);
if (stream == null)
{
return null;
}
return new StreamSource(stream, uri.toString());
}
}
We've uploaded an update for the jxom.
It has turned out that jxom schema is so powerful that you can do a great number of manipulations over xml representation of java program.
In our case this is an optimization of unreachable code, defined at
Sun's spec. We're facing this problem as result of translation from other ancient language, which also has well defined xml schema.
We also have introduced an ability to annotate jxom elements (see meta element), which in practice we use to annotate expressions with their types and perform "compile time" expression evaluation.
You may download jxom version at usual place.
See also: Java Xml Object Model.
Recently I've proposed to add two new atomic types
tuple and map to the xpath/xslt/xquery type system (see "Tuples an maps").
Later I've implemented
tuple and map pure xslt approximation. Now I want to present
java
implementation for Saxon.
I've created TupleValue and MapValue atomic types, and Collections class
exposing extension functions api. It's easy to use this api. I'll repeat an
example that I was showing earlier:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:f="http://www.nesterovsky-bros.com/xslt/functions/public"
xmlns:p="http://www.nesterovsky-bros.com/xslt/functions/private"
xmlns:c="java:com.nesterovskyBros.saxon.Functions"
exclude-result-prefixes="xs f p c">
<xsl:template match="/">
<root>
<xsl:variable name="tuples" as="item()*" select="
for $i in 1 to 20
return c:tuple(1 to $i)"/>
<total-items>
<xsl:sequence select="
sum
(
for $tuple in $tuples return
count(c:tuple-items($tuple))
)"/>
</total-items>
<tuples-size>
<xsl:sequence select="count($tuples)"/>
</tuples-size>
<sums-per-tuples>
<xsl:for-each select="$tuples">
<xsl:variable name="index"
as="xs:integer" select="position()"/>
<sum index="{$index}"
value="{sum(c:tuple-items(.))}"/>
</xsl:for-each>
</sums-per-tuples>
<xsl:variable name="cities" as="element()*">
<city name="Jerusalem" country="Israel"/>
<city name="London" country="Great Britain"/>
<city name="Paris" country="France"/>
<city name="New York" country="USA"/>
<city name="Moscow" country="Russia"/>
<city name="Tel Aviv" country="Israel"/>
<city name="St. Petersburg" country="Russia"/>
</xsl:variable>
<xsl:variable name="map" as="item()" select="
c:map
(
for $city in $cities return
(
$city/string(@country),
$city
)
)"/>
<xsl:for-each select="c:map-keys($map)">
<xsl:variable name="key" as="xs:string" select="."/>
<country name="{$key}">
<xsl:sequence select="c:map-value($map,
$key)"/>
</country>
</xsl:for-each>
</root>
</xsl:template>
</xsl:stylesheet>
Download java source.
P.S. I would wish this api be integrated into Saxon, as at present java
extension functions are called through reflection.
Today I've found another new language (working draft in fact). It's
an XML Pipeline Language.
XProc: An XML Pipeline Language, a language for
describing operations to be performed on XML documents.
An XML Pipeline specifies a sequence of operations to be performed on zero or
more XML documents. Pipelines generally accept zero or more XML documents as
input and produce zero or more XML documents as output. Pipelines are made up of
simple steps which perform atomic operations on XML documents and constructs
similar to conditionals, iteration, and exception handlers which control which
steps are executed.
An experience shows a process of language invention is an essential part of
computer industry from the very beginning, however...
I must confess I must be too reluctant to any new language: I was happy with
C++, but then all these new languages like Delphi, Java, C#, and so many others
started to appear. It's correct to say that there is no efficient universal
language, however I think it's wrong to say that a domain specific language is
required to solve a particular problem in a most efficient way.
And now a question to the point: why do you need a new language for describing
operations to be performed on XML documents?
A project I'm currently working on, requires me to manipulate with a big number
of documents. This includes accessing these documents with key()
function.
I never thought this task poses any problem, until I've discovered that Saxon
caches documents loaded using document() function to preserve their identities:
By default, this function is ·stable·.
Two calls on this function return the same document node if the same
URI Reference (after resolution to an absolute URI Reference)
is supplied to both calls. Thus, the following expression
(if it does not raise an error) will always be true:
doc("foo.xml") is doc("foo.xml")
However, for performance reasons, implementations may provide a user option to
evaluate the function without a guarantee of stability. The manner in which any
such option is provided is implementation-defined. If the user has not selected
such an option, a call of the function must either return a stable result or
must raise an error: [err:FODC0003].
Saxon provides a saxon:discard-document() function to release documents from
cache. The use case is like this:
<xsl:variable name="document" as="document-node()"
select="saxon:discard-document(document(...))"/>
You may see, that saxon:discard-document() is bound to a place where document is
loaded. In my case this is inefficient, as my code repeatedly accesses documents
from different places. To release loaded documents I need to collect them after
main processing.
Other issue in Saxon is that, processor may keep document references through
xsl:key, thus saxon:discard-document() provides no guaranty of documents to be
garbage collected.
To deal with this, I've designed (Saxon specific) api to manage document pools:
t:begin-document-pool-scope() as item()
Begins document pool scope.
Returns scope id.
t:end-document-pool-scope(scope as item())
Terminates document pool scope.
$scope - scope id.
t:put-document-in-pool(document as document-node()) as
document-node()
Puts a document into a current scope of document pool.
$document - a document to put into the document pool.
Returns the same document node.
The use case is:
<xsl:variable name="scope" select="t:begin-document-pool-scope()"/>
<xsl:sequence select="t:assert($scope)"/>
...
<xsl:variable name="document" as="document-node()"
select="t:put-document-in-pool(...)"/>
...
<xsl:sequence
select="t:end-document-pool-scope($scope)"/>
Download
document-pool.xslt to use this api.
I was already writing about the logical
difference between tamplates and functions. This time I've realized another,
technical one. It's related to lazy evaluation, permitted by language
specification.
I was arguing as follows:
- suppose you define a function returning a sequence;
- this function at final step constructs document using
xsl:result-document;
- caller invokes this function and uses only first item of sequence;
- lazy evaluation allows to xslt processor to calculate first item only, thus
to avoid creation of output document altogether.
This conclusion looked ridiculous to me, as it means that I cannot reliably
expect creation of documents built with xsl:result-document instruction.
To resolve the issue I've checked specification. Someone has already thought of
this. This is what specification says:
[Definition: Each instruction in the
stylesheet is evaluated in one
of two possible output states: final output state or
temporary
output state].
[Definition: The first of the
two output states is called
final output state. This state applies when instructions are writing to a
final result tree.]
[Definition: The second
of the two output states is
called temporary output state. This state applies when instructions are
writing to a temporary tree
or any other non-final destination.]
The instructions in the
initial template are evaluated in final output state. An instruction is evaluated
in the same output state as
its calling instruction, except that
xsl:variable , xsl:param ,
xsl:with-param ,
xsl:attribute ,
xsl:comment ,
xsl:processing-instruction ,
xsl:namespace ,
xsl:value-of ,
xsl:function ,
xsl:key ,
xsl:sort , and xsl:message
always evaluate the instructions in their contained
sequence
constructor in temporary output state.
[ERR XTDE1480] It is a non-recoverable dynamic error to evaluate the
xsl:result-document
instruction in temporary output state.
As you can see, xsl:function is always evaluated in temporary output state, and
cannot contain xsl:result-document, in contrast to xsl:template, which may be
evaluated in final output state. This difference dictates the role of templates as
a "top level functions" and functions as standalone algorithms.
You can find more on subject at "Lazy evaluation and predicted results".
In the era of parallel processing it's so natural to inscribe your favorite programming language in the league of "Multithreading supporter". I've seen such appeals before "Wide Finder in XSLT --> deriving new requirements for efficiency in XSLT processors."
... I am not aware of any XSLT implementation that provides explicit or implicit support for parallel processing (with the obvious goal to take advantage of the multi-core processors that have almost reached a "prevalent" status today) ...
I think both xslt and xquery are well fitted for parrallel processing in terms of type system. This is because of "immutable" nature (until recent additions) of the execution state, which prevents many race conditions. The only missing ingredients are indirect function call, and a couple of core functions to queue parallel tasks.
Suppose there is a type to encapsulate a function call (say function-id), and a function accepting a sequence and a function-id. This function calls function-id for each element of the sequence in a parallel way, and then combines a final result, as if it were implemented serially.
Pretty simple, isn't it?
<!--
This function runs $id function for each item in a sequence.
$items - items to process.
$id - function id.
Returns a sequece of results of calls to $id function.
-->
<xsl:function name="x:queue-tasks" as="items()*">
<xsl:param name="items" as="item()*"/>
<xsl:param name="id" as="x:function-id"/>
<!-- The pseudo code. -->
<xsl:sequence select="$items/call $id (.)"/>
</xsl:function>
Yesterday's idea has inspired me as much as to create a prototype implementation of map and tuple in the xslt 2.0.
Definitely I wished these were a built-in types, and were considered as atomic values for purposes of comparasions and iteration. This way it were possible to create highly efficient grouping per several fields at once.
This pure implementation (xslt-tuple.zip) is rather scetchy, however it allows to feel what can be done with tuples and maps. I guess a good example may say more than many other words, so have a pleasure:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:f="http://www.nesterovsky-bros.com/xslt/functions"
exclude-result-prefixes="xs f">
<xsl:include href="tuple.xslt"/>
<xsl:include href="map.xslt"/>
<xsl:template match="/">
<root>
<xsl:variable name="tuples" as="item()*" select="
f:tuple
(
for $i in 1 to 10
return
f:tuple(1 to $i)
)"/>
<total-items>
<xsl:sequence select="count($tuples)"/>
</total-items>
<tuples-size>
<xsl:sequence select="f:tuple-size($tuples)"/>
</tuples-size>
<sums-per-tuples>
<xsl:for-each select="1 to f:tuple-size($tuples)">
<xsl:variable name="index" as="xs:integer" select="position()"/>
<sum
index="{$index}"
value="{sum(f:tuple-items(f:tuple-item($tuples,
$index)))}"/>
</xsl:for-each>
</sums-per-tuples>
<xsl:variable name="cities" as="element()*">
<city name="Jerusalem" country="Israel"/>
<city name="London" country="Great Britain"/>
<city name="Paris" country="France"/>
<city name="New York" country="USA"/>
<city name="Moscow" country="Russia"/>
<city name="Tel Aviv" country="Israel"/>
<city name="St. Petersburg" country="Russia"/>
</xsl:variable>
<xsl:variable name="map" as="item()*" select="
f:map
(
for $city in $cities
return
($city/string(@country), $city)
)"/>
<xsl:for-each select="f:map-keys($map)">
<xsl:variable name="key" as="xs:string" select="."/>
<country name="{$key}">
<xsl:sequence select="f:map-value($map, $key)"/>
</country>
</xsl:for-each>
</root>
</xsl:template>
</xsl:stylesheet>
The type system of xslt 2.0 is not complete (see
Sequence of sequences in xslt 2.0).
You cannot perform manipulations over items as you could do. The reason is in
the luck of set based constructs: xslt 2.0 supports sequences, but not
associative maps of items.
If you think that xml can be used as a good approximation of a map, I shan't agree
with you. Xml has an application in a very specific cases only. Maps I'm
thinking of, would allow associate items by reference, like sequences do.
This opens a perspective to create a state objects, to manage sequence of sequences,
to create cyclic graphs of items, and so on. These maps are richer than what
key() function provides right now, and allow to implement for-each-group in
xquery.
Such maps can be modeled with several functions, however I would wish they were
built in:
f:map($items as item()*) as item()
Returns a map from a sequence $items of pairs (key, value).
f:map-items($map as item()) as item()*
Returns a sequence of pairs (key, value) for a map $map.
f:map-keys($map as item()) as item()*
Returns a sequence of keys contained in a map $map.
f:map-values($map as item()) as item()*
Returns a sequence of values contained in a map $map.
f:map-value($map as item(), $key as item()) as item()*
Returns a sequence of values corresponding to a specified key $key contained a
specified map $map.
The other thing I would add is items tuple. It's like a sequence, however a sequence of tuples is never transformed into single sequence, but stays as sequence of tuples.
Fortunately it's possible to implement such extension functions.
xslt 2.0 is a beautiful language and at the same time it allows constructs, which may trouble anyone.
Look at this valid stylesheet:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:template match="/">
<xsl:variable name="x" as="node()" select="."/>
<xsl:variable name="x" as="xs:int" select="***"/>
<xsl:sequence select="$x"/>
</xsl:template>
</xsl:stylesheet>
Fun, isn't it?
I was thinking earlier about the difference between named
tamplates and functions in xslt 2.0 and have not found satisfactory criterion for a decision of what to use in each case. I was not first one who has
troubled with this, see
stylesheet functions or named templates.
To feel easy I deliberately have decided to use functions
whenever possible, avoid named tamplates completely, and use matching templates
to apply logic depending on context (something like virtual function). I've forgot about the issue until yesterday. To realize the
difference one should stop thinking of it, quite opposite she must start solving
practical xslt tasks, and if there is any difference, except syntactic, it will
manifest itself somehow.
To make things obvious to those whose programming roots are in a language like C++ I shall compare
xsl:function with free standing (or static) C++ function, and named xsl:template with C++ member function. In C++ you can use both free standing and member
functions interchangeably, however if there is only one argument (among others)
whose state transition this function represents then it's preferrable to define
it as a member function. The most important difference between these two type of
functions is that a member function has hidden argument "this", and is able to
access its private state.
Please, do not try to think I'm going to compare template context item in xslt 2.0 with "this" in C++,
quite opposite I consider context item as a part of a state. I'm arguing
however, of private state that can be passed through template call chain with tunnel parameters. Think of
a call tunneling some state (like options, flags, values), and that state accessed several levels deep in call hierarchy, whenever one needs to. You cannot do it with xsl:function, you cannot pass all private state through the function call, you just do not know of it.
This way my answer to the tacit question is:
- use xsl:function to perform independent unit of logic;
- use named xsl:template when a functionality is achieved cooperatively, and when you will possibly need to share the state between different implementation blocks;
After thinking through this, I've noticed that such distinction does not exist in XQuery 1.0.
There is no tunneling there.
In the xslt world there is no widely used custom to think of stylesheet members
as of public and private in contrast to other programming languages like
C++/java/c# where access modifiers are essential. The reason is in complexity of
stylesheets: the less size of code - the easier to developer to keep all details
in memory. Whenever xslt program grows you should modularize
it to keep it manageable.
At the point where modules are introduced one starts thinking of public
interface of module and its implementation details. This separation is
especially important for the template matching as you won't probably want to
match private template just because you've forgotten about some template in
implementation of some module.
To make public or private member distinction you can introduce two namespaces in
your stylesheet, like:
For the private namespace you can use a unique name, e.g. stylesheet name as
part of uri.
The following example is based on
jxom. This stylesheet builds expression from expression tree. Public part
consists only of t:get-expression function, other members are private:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/public"
xmlns:p="http://www.nesterovsky-bros.com/private/expression.xslt"
xmlns="http://www.nesterovsky-bros.com/download/jxom.zip"
xpath-default-namespace="http://www.nesterovsky-bros.com/download/jxom.zip"
exclude-result-prefixes="xs t p">
<xsl:output method="text" indent="yes"/>
<!--
Entry point. -->
<xsl:template match="/">
<xsl:variable name="expression"
as="element()">
<lt>
<sub>
<mul>
<var name="b"/>
<var name="b"/>
</mul>
<mul>
<mul>
<int>4</int>
<var name="a"/>
</mul>
<var name="c"/>
</mul>
</sub>
<double>0</double>
</lt>
</xsl:variable>
<xsl:value-of
select="t:get-expression($expression)" separator=""/>
</xsl:template>
<!--
Gets
expression.
$element - expression element.
Returns expression tokens.
-->
<xsl:function name="t:get-expression" as="item()*">
<xsl:param name="element"
as="element()"/>
<xsl:apply-templates mode="p:expression" select="$element"/>
</xsl:function>
<!--
Gets binary expression.
$element - assignment expression.
$type - expression type.
Returns expression token sequence.
-->
<xsl:function
name="p:get-binary-expression" as="item()*">
<xsl:param name="element"
as="element()"/>
<xsl:param name="type" as="xs:string"/>
<xsl:sequence
select="t:get-expression($element/*[1])"/>
<xsl:sequence select="' '"/>
<xsl:sequence select="$type"/>
<xsl:sequence select="' '"/>
<xsl:sequence
select="t:get-expression($element/*[2])"/>
</xsl:function>
<!-- Mode
"expression". Empty match. -->
<xsl:template mode="p:expression"
match="@*|node()">
<xsl:sequence select="error(xs:QName('invalid-expression'),
name())"/>
</xsl:template>
<!-- Mode "expression". or. -->
<xsl:template
mode="p:expression" match="or">
<xsl:sequence select="p:get-binary-expression(.,
'||')"/>
</xsl:template>
<!-- Mode "expression". and. -->
<xsl:template
mode="p:expression" match="and">
<xsl:sequence
select="p:get-binary-expression(., '&&')"/>
</xsl:template>
<!-- Mode
"expression". eq. -->
<xsl:template mode="p:expression" match="eq">
<xsl:sequence select="p:get-binary-expression(., '==')"/>
</xsl:template>
<!--
Mode "expression". ne. -->
<xsl:template mode="p:expression" match="ne">
<xsl:sequence select="p:get-binary-expression(., '!=')"/>
</xsl:template>
<!--
Mode "expression". le. -->
<xsl:template mode="p:expression" match="le">
<xsl:sequence select="p:get-binary-expression(., '<=')"/>
</xsl:template>
<!--
Mode "expression". ge. -->
<xsl:template mode="p:expression" match="ge">
<xsl:sequence select="p:get-binary-expression(., '>=')"/>
</xsl:template>
<!--
Mode "expression". lt. -->
<xsl:template mode="p:expression" match="lt">
<xsl:sequence select="p:get-binary-expression(., '<')"/>
</xsl:template>
<!--
Mode "expression". gt. -->
<xsl:template mode="p:expression" match="gt">
<xsl:sequence select="p:get-binary-expression(., '>')"/>
</xsl:template>
<!--
Mode "expression". add. -->
<xsl:template mode="p:expression" match="add">
<xsl:sequence select="p:get-binary-expression(., '+')"/>
</xsl:template>
<!--
Mode "expression". sub. -->
<xsl:template mode="p:expression" match="sub">
<xsl:sequence select="p:get-binary-expression(., '-')"/>
</xsl:template>
<!--
Mode "expression". mul. -->
<xsl:template mode="p:expression" match="mul">
<xsl:sequence select="p:get-binary-expression(., '*')"/>
</xsl:template>
<!--
Mode "expression". div. -->
<xsl:template mode="p:expression" match="div">
<xsl:sequence select="p:get-binary-expression(., '/')"/>
</xsl:template>
<!--
Mode "expression". neg. -->
<xsl:template mode="p:expression" match="neg">
<xsl:sequence select="'-'"/>
<xsl:sequence select="t:get-expression(*[1])"/>
</xsl:template>
<!-- Mode "expression". not. -->
<xsl:template
mode="p:expression" match="not">
<xsl:sequence select="'!'"/>
<xsl:sequence
select="t:get-expression(*[1])"/>
</xsl:template>
<!-- Mode "expression".
parens. -->
<xsl:template mode="p:expression" match="parens">
<xsl:sequence
select="'('"/>
<xsl:sequence select="t:get-expression(*[1])"/>
<xsl:sequence
select="')'"/>
</xsl:template>
<!-- Mode "expression". var. -->
<xsl:template
mode="p:expression" match="var">
<xsl:sequence select="@name"/>
</xsl:template>
<!-- Mode "expression". int, short, byte, long, float, double. -->
<xsl:template
mode="p:expression"
match="int | short | byte | long | float | double">
<xsl:sequence select="."/>
</xsl:template>
</xsl:stylesheet>
I often find myself in a position that whenever I'm thinking of something, I can find the idea to be already implemented somewhere.
A good example is xslt/xquery -> java code.
Well, the world is full with smart guys.
Wow, I've found an article Code generation in XSLT 2.0. The article is dated by year 2005.
Well, I was inventing a bicycle. This is a good lesson for me.
I'm going to study very carefully about SQL Code Generation, as this is exacly the same task I'm facing now.
I've updated jxom.zip.
There are minor fixes there. The most important addition is a line breaker. The purpose of the line breaker is to split long lines.
Long lines appear if there are verbose comments, or there is a very long expression, which was not categorized as multiline.
It's not perfect, however looks acceptable.
Now I'm facing a next problem: I need to do a similar job I'm doing to java, however with sql. Moreover, I need to support several dialects of sql. I'm not sure if it's possible (worth) to define single sql-xom.xsd, or should I define sql-db2-v9-xom.xsd, sql-sqlserver-2005-xom.xsd, ...
The bad news are that sql grammar is much more complex than one of java. Probably I'll start from some sql subset. In any case I do not consider generation of sql "directly", as jxom fits remarkably into its role.
Building jxom stylesheets I've learned what is a "good" and "bad" recursion from the saxon's perspective.
I'm using control tokens $t:indent and $t:unindent to control indentation in the sequence of tokens defining java output. To build output lines I need to calculate total indentation for each line. This can be done using cummulative sum, considering $t:indent as +1 and $t:unindent as -1.
This task can be formalized as "calculate cummulative integer sum".
The first approach I've tested is non recursive: "for $i in 1 to count($items) return sum(subsequence($items, 1, $i))". It is incredibly slow.
The next try was recurrent: calculate and spew results as they are calculated. This is "crash fast" method. Saxon, indeed, implements this as recursion and arrives to a stack limit early.
The last approach, employes saxon's ability to detect some particular flavour of tail calls. When function contains a tail call, and the output on a tail call code path consists of this tail call only, then saxon transforms such construction into a cycle. Thus I need to accumulate result and pass it down to a tail call chain and output it on the last opportunity only.
The following sample shows this technique:
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.nesterovsky-bros.com" exclude-result-prefixes="xs t">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/"> <xsl:variable name="values" as="xs:integer*" select="1 to 10000"/>
<result> <sum> <xsl:value-of select="t:cumulative-integer-sum($values)"/>
<!-- This call crashes with stack overflow. --> <!-- <xsl:value-of select="t:bad-cumulative-integer-sum($values)"/> -->
<!-- To compare speed uncomment following lines. --> <!--<xsl:value-of select="sum(t:cumulative-integer-sum($values))"/>--> <!--<xsl:value-of select="sum(t:slow-cumulative-integer-sum($values))"/>--> </sum> </result> </xsl:template>
<!-- Calculates cumulative sum of integer sequence. $items - input integer sequence. Returns an integer sequence that is a cumulative sum of original sequence. --> <xsl:function name="t:cumulative-integer-sum" as="xs:integer*"> <xsl:param name="items" as="xs:integer*"/>
<xsl:sequence select="t:cumulative-integer-sum-impl($items, 1, 0, ())"/> </xsl:function>
<!-- Implementation of the t:cumulative-integer-sum. $items - input integer sequence. $index - current iteration index. $sum - base sum. $result - collected result. Returns an integer sequence that is a cumulative sum of original sequence. --> <xsl:function name="t:cumulative-integer-sum-impl" as="xs:integer*"> <xsl:param name="items" as="xs:integer*"/> <xsl:param name="index" as="xs:integer"/> <xsl:param name="sum" as="xs:integer"/> <xsl:param name="result" as="xs:integer*"/>
<xsl:variable name="item" as="xs:integer?" select="$items[$index]"/>
<xsl:choose> <xsl:when test="empty($item)"> <xsl:sequence select="$result"/> </xsl:when> <xsl:otherwise> <xsl:variable name="value" as="xs:integer" select="$item + $sum"/> <xsl:variable name="next" as="xs:integer+" select="$result, $value"/>
<xsl:sequence select=" t:cumulative-integer-sum-impl($items, $index + 1, $value, $next)"/> </xsl:otherwise> </xsl:choose> </xsl:function>
<!-- "Bad" implementation of the cumulative-integer-sum. --> <xsl:function name="t:bad-cumulative-integer-sum" as="xs:integer*"> <xsl:param name="items" as="xs:integer*"/>
<xsl:sequence select="t:bad-cumulative-integer-sum-impl($items, 1, 0)"/> </xsl:function>
<!-- "Bad" implementation of the cumulative-integer-sum. --> <xsl:function name="t:bad-cumulative-integer-sum-impl" as="xs:integer*"> <xsl:param name="items" as="xs:integer*"/> <xsl:param name="index" as="xs:integer"/> <xsl:param name="sum" as="xs:integer"/>
<xsl:variable name="item" as="xs:integer?" select="$items[$index]"/>
<xsl:if test="exists($item)"> <xsl:variable name="value" as="xs:integer" select="$item + $sum"/> <xsl:sequence select="$value"/> <xsl:sequence select=" t:bad-cumulative-integer-sum-impl($items, $index + 1, $value)"/> </xsl:if> </xsl:function>
<!-- Non recursive implementation of the cumulative-integer-sum. --> <xsl:function name="t:slow-cumulative-integer-sum" as="xs:integer*"> <xsl:param name="items" as="xs:integer*"/>
<xsl:sequence select=" for $i in 1 to count($items) return sum(subsequence($items, 1, $i))"/> </xsl:function>
</xsl:stylesheet>
Comparing xslt 2.0 with its predecessor I see a great evolution of the language. There are however parts of language, which are not as good as they could be.
Look at manipulations of sequence of sequence of items. xpath 2.0/xquery 1.0 type system treats type quantifiers separately from type itself. One can declare a variable of type "xs:string", or variable of type of sequence of strings "xs:string*". Unfortunately it's not possible to declare a sequence of sequence of strings "xs:string**", as type can have only one quantifier.
I think this is wrong. People do different tricks to remedy the problem. Typically one builds nodes that contain copy of items of sequences. Clearly this is a heavy way to achieve a simple result, moreover it does not preserve item identity.
In jxom I'm using different solution to store sequence of sequences, namely storing all sequences in one, separated with terminator.
A typical sample is in the java serializer. After building method's parameters I should format them one (compact) or the other (verbose) way depending on decision, which can be made when all parameters are already built.
To see how it's working please look at following xslt:
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.nesterovsky-bros.com" exclude-result-prefixes="xs t">
<xsl:output method="xml" indent="yes"/>
<!-- Terminator token. --> <xsl:variable name="t:terminator" as="xs:QName" select="xs:QName('t:terminator')"/>
<!-- New line. --> <xsl:variable name="t:crlf" as="xs:string" select="' '"/>
<xsl:template match="/"> <!-- We need to manipulate a sequence of sequence of tokens. To do this we use $t:terminator to separate sequences. --> <xsl:variable name="short-items" as="item()*"> <xsl:sequence select="t:get-param('int', 'a')"/> <xsl:sequence select="$t:terminator"/>
<xsl:sequence select="t:get-param('int', 'b')"/> <xsl:sequence select="$t:terminator"/>
<xsl:sequence select="t:get-param('int', 'c')"/> <xsl:sequence select="$t:terminator"/> </xsl:variable>
<xsl:variable name="long-items" as="item()*"> <xsl:sequence select="t:get-param('int', 'a')"/> <xsl:sequence select="$t:terminator"/>
<xsl:sequence select="t:get-param('int', 'b')"/> <xsl:sequence select="$t:terminator"/>
<xsl:sequence select="t:get-param('int', 'c')"/> <xsl:sequence select="$t:terminator"/>
<xsl:sequence select="t:get-param('int', 'd')"/> <xsl:sequence select="$t:terminator"/> </xsl:variable>
<result> <short> <xsl:value-of select="t:format($short-items)" separator=""/> </short> <long> <xsl:value-of select="t:format($long-items)" separator=""/> </long> </result> </xsl:template>
<!-- Returns a sequence of tokens that defines a parameter. $type - parameter type. $name - parameter name. Returns sequence of parameter tokens. --> <xsl:function name="t:get-param" as="item()*"> <xsl:param name="type" as="xs:string"/> <xsl:param name="name" as="xs:string"/>
<xsl:sequence select="$type"/> <xsl:sequence select="' '"/> <xsl:sequence select="$name"/> </xsl:function>
<!-- Format sequence of sequence of tokens separated with $t:terminator. $tokens - sequence of sequence of tokens to format. Returns formatted sequence of tokens. --> <xsl:function name="t:format" as="item()*"> <xsl:param name="tokens" as="item()*"/>
<xsl:variable name="terminators" as="xs:integer+" select="0, index-of($tokens, $t:terminator)"/> <xsl:variable name="count" as="xs:integer" select="count($terminators) - 1"/> <xsl:variable name="verbose" as="xs:boolean" select="$count > 3"/>
<xsl:sequence select=" for $i in 1 to $count return ( subsequence ( $tokens, $terminators[$i] + 1, $terminators[$i + 1] - $terminators[$i] - 1 ), if ($i = $count) then () else ( ',', if ($verbose) then $t:crlf else ' ' ) )"/> </xsl:function>
</xsl:stylesheet>
I've updated jxom.zip. Now it supports qualified type name optimization.
I need to mention that this optimization is only possible when imports does not contain wildcard declarations like:
import a.b.*;
The only important thing to do is a good line breaker.
Is it possible to call function indirectly in xslt 2.0?
The answer is yes, however implementation uses dull trick of template matching to select a function handler. Template matching is a beautiful thing. Definitely it was not devised to make this trick possible.
The following example defines two functions t:sum, and t:count to call indirectly by t:test. Function id (a.k.a. function pointer) is defined by t:sum, and t:count variables.
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.nesterovsky-bros.com" exclude-result-prefixes="xs t">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/"> <xsl:variable name="items" as="element()*"> <value>1</value> <value>2</value> <value>3</value> <value>4</value> <value>5</value> </xsl:variable>
<root> <sum> <xsl:sequence select="t:test($items, $t:sum)"/> </sum> <count> <xsl:sequence select="t:test($items, $t:count)"/> </count> </root> </xsl:template>
<!-- Mode "t:function-call". Default match. --> <xsl:template mode="t:function-call" match="@* | node()"> <xsl:sequence select=" error ( xs:QName('invalid-call'), concat('Unbound function call. Id: ', name()) )"/> </xsl:template>
<!-- Id of the function t:sum. --> <xsl:variable name="t:sum" as="item()"> <t:sum/> </xsl:variable>
<!-- Mode "t:function-call". t:sum handler. --> <xsl:template mode="t:function-call" match="t:sum"> <xsl:param name="items" as="element()*"/>
<xsl:sequence select="t:sum($items)"/> </xsl:template>
<!-- Calculates a sum of elements. $param - items to sum. Returns sum of element values. --> <xsl:function name="t:sum" as="xs:integer"> <xsl:param name="items" as="element()*"/>
<xsl:sequence select="sum($items/xs:integer(.))"/> </xsl:function>
<!-- Id of the function t:count. --> <xsl:variable name="t:count" as="item()"> <t:count/> </xsl:variable>
<!-- Mode "t:function-call". t:count handler. --> <xsl:template mode="t:function-call" match="t:count"> <xsl:param name="items" as="element()*"/>
<xsl:sequence select="t:count($items)"/> </xsl:template>
<!-- Calculates the number of elements in a sequence. $param - items to count. Returns count of element values. --> <xsl:function name="t:count" as="xs:integer"> <xsl:param name="items" as="element()*"/>
<xsl:sequence select="count($items)"/> </xsl:function>
<!-- A function that performs indirect call. $param - items to pass to an indirect call. $function-id - a function id. Returns a value calculated in the indirect function. --> <xsl:function name="t:test" as="xs:integer"> <xsl:param name="items" as="element()*"/> <xsl:param name="function-id" as="item()"/>
<xsl:variable name="result" as="xs:integer"> <xsl:apply-templates mode="t:function-call" select="$function-id"> <xsl:with-param name="items" select="$items"/> </xsl:apply-templates> </xsl:variable>
<xsl:sequence select="$result"/> </xsl:function>
</xsl:stylesheet>
Hello again!
To see first part about jxom please read.
I'm back with jxom (Java xml object model). I've finally managed to create an xslt that generates java code from jxom document.
Will you ask why it took as long as a week to produce it?
There are two answers: 1. My poor talents. 2. I've virtually created two implementations.
My first approach was to directly generate java text from xml. I was a truly believer that this is the way. I've screwed things up on that way, as when you're starting to deal with indentations, formatting and reformatting of text you're generating you will see things are not that simple. Well, it was a naive approach.
I could finish it, however at some point I've realized that its complexity is not composable from complexity of its parts, but increases more and more. This is not permissible for a such simple task. Approach is bad. Point.
An alternative I've devised is simple and in fact more natural than naive approach. This is a two stage generation: a) generate sequence of tokens - serializer; b) generate and then print a sequence of lines - streamer.
Tokens (item()*) are either control words (xs:QName), or literals (xs:string).
I've defined following control tokens:
Token |
Description |
t:indent |
indents following content. |
t:unindent |
unindents following content. |
t:line-indent |
resets indentation for one line. |
t:new-line |
new line token. |
t:terminator |
separates token sequences. |
t:code |
marks line as code (default line type). |
t:doc |
marks line as documentation comment. |
t:begin-doc |
marks line as begin of documentation comment. |
t:end-doc |
marks line as end of documentation comment. |
t:comment |
marks line as comment. |
Thus an input for the streamer looks like:
<xsl:sequence select="'public'"/> <xsl:sequence select="' '"/> <xsl:sequence select="'class'"/> <xsl:sequence select="' '"/> <xsl:sequence select="'A'"/> <xsl:sequence select="$t:new-line"/> <xsl:sequence select="'{'"/> <xsl:sequence select="$t:new-line"/> <xsl:sequence select="$t:indent"/> <xsl:sequence select="'public'"/> <xsl:sequence select="' '"/> <xsl:sequence select="'int'"/> <xsl:sequence select="' '"/> <xsl:sequence select="'a'"/> <xsl:sequence select="';'"/> <xsl:sequence select="$t:unindent"/> <xsl:sequence select="$t:new-line"/> <xsl:sequence select="'}'"/> <xsl:sequence select="$t:new-line"/>
Streamer receives a sequence of tokens and transforms it in a sequence of lines.
One beautiful thing about tokens is that streamer can easily perform line breaks in order to keep page width, and another convenient thing is that code generating tokens should not track indentation level, as it just uses t:indent, t:unindent control tokens to increase and decrease current indentation.
The way the code is built allows mimic any code style. I've followed my favorite one. In future I'll probably add options controlling code style. In my todo list there still are several features I want to implement, such as line breaker to preserve page width, and type qualification optimizer (optional feature) to reduce unnecessary type qualifications.
Current implementation can be found at jxom.zip. It contains:
File |
Description |
java.xsd |
jxom xml schema. |
java-serializer-main.xslt |
transformation entry point. |
java-serializer.xslt |
generates tokens for top level constructs. |
java-serializer-statements.xslt |
generates tokens for statements. |
java-serializer-expressions.xslt |
generates tokens for expressions. |
java-streamer.xslt |
converts tokens into lines. |
DataAdapter.xml |
sample jxom document. |
This was my first experience with xslt 2.0. I feel very pleased with what it can do. The only missed feature is indirect function call (which I do not want to model with dull template matching approach).
Note that in spite that xslt I've built is platform independed I want to point out that I was experimenting with saxon 9. Several times I've relied on efficient tail call implementation (see t:cumulative-integer-sum), which otherwise will lead to xslt stack overflow.
I shall be pleased to see your feedback on the subject.
Hello,
I was not writing for a long time. IMHO: nothing to say? - do not noise!
Nowadays I'm busy with xslt.
Should I be pleased that w3c committee has finally delivered xpath 2.0/xslt 2.0/xquery? There possibly were people who have failed to wait till this happened, and who have died. Be grateful to the fate we have survived!
I'm working now with saxon 9. It's good implementation, however too interpreter like in my opinion. I think these languages could be compiled down to machine/vm code the same way as c++/java/c# do.
To the point. I need to generate java code in xslt. I've done this earlier; that time I dealt with relatively simple templates like beans or interfaces. Now I need to generate beans, interfaces, classes with logic. In fact I should cover almost all java 6 features.
Immediately I've started thinking in terms of java xml object model (jxom). Thus there will be an xml schema of jxom (Am I inventing bicycle? I pray you to point me to an existing schema!) - java grammar as xml. There will be xslts, which generate code according to this schema, and xslt that will serialize jxom documents derectly into java.
This two stage generation is important as there are essentially two different tasks: generate java code, and serialize it down to a text format. Moreover whenever I have jxom document I can manipulate it! And finally this will allow to our team to concentrate efforts, as one should only generate jxom document.
Yesterday, I've found java ANLT grammar, and have converted it into xml schema: java.xsd. It is important to have this xml schema defined, even if no one shall use it except in editor, as it makes jxom generation more formal.
The next step is to create xslt serializer, which is in todo list.
To feel how jxom looks I've created it manually for some simple java file:
// $Id: DataAdapter.java 1122 2007-12-31 12:43:47Z arthurn $ package com.bphx.coolgen.data;
import java.util.List;
/** * Encapsulates encyclopedia database access. */ public interface DataAdapter { /** * Starts data access session for a specified model. * @param modelId - a model to open. */ void open(int modelId) throws Exception;
/** * Ends data access session. */ void close() throws Exception;
/** * Gets current model id. * @return current model id. */ int getModelId();
/** * Gets data objects for a specified object type for the current model. * @param type - an object type to get data objects for. * @return list of data objects. */ List<DataObject> getObjectsForType(short type) throws Exception;
/** * Gets a list of data associations for an object id. * @param id - object id. * @return list of data associations. */ List<DataAssociation> getAssociations(int id) throws Exception;
/** * Gets a list of data properties for an object id. * @param id - object id. * @return list of data properties. */ List<DataProperty> getProperties(int id) throws Exception; }
jxom:
<unit xmlns="http://www.bphx.com/java-1.5/2008-02-07" package="com.bphx.coolgen.data"> <comment>$Id: DataAdapter.java 1122 2007-12-31 12:43:47Z arthurn $</comment> <import package="java.util.List"/> <interface access="public" name="DataAdapter"> <comment doc="true">Encapsulates encyclopedia database access.</comment> <method name="open"> <comment doc="true"> Starts data access session for a specified model. <para type="param" name="modelId">a model to open.</para> </comment> <parameters> <parameter name="modelId"><type name="int"/></parameter> </parameters> <throws><type name="Exception"/></throws> </method> <method name="close"> <comment doc="true">Ends data access session.</comment> <throws><type name="Exception"/></throws> </method> <method name="getModelId"> <comment doc="true"> Gets current model id. <para type="return">current model id.</para> </comment> <returns><type name="int"/></returns> <throws><type name="Exception"/></throws> </method> <method name="getObjectsForType"> <comment doc="true"> Gets data objects for a specified object type for the current model. <para name="param" type="type"> an object type to get data objects for. </para> <para type="return">list of data objects.</para> </comment> <returns> <type> <part name="List"> <typeArgument><type name="DataObject"/></typeArgument> </part> </type> </returns> <parameters> <parameter name="type"><type name="short"/></parameter> </parameters> <throws><type name="Exception"/></throws> </method> <method name="getAssociations"> <comment doc="true"> Gets a list of data associations for an object id. <para type="param" name="id">object id.</para> <para type="return">list of data associations.</para> </comment> <returns> <type> <part name="List"> <typeArgument><type name="DataAssociation"/></typeArgument> </part> </type> </returns> <parameters> <parameter name="id"><type name="int"/></parameter> </parameters> <throws><type name="Exception"/></throws> </method> <method name="getProperties"> <comment doc="true"> Gets a list of data properties for an object id. <para type="param" name="id">object id.</para> <para type="return">list of data properties.</para> </comment> <returns> <!-- Compact form of generic type. --> <type name="List<DataProperty>"/> </returns> <parameters> <parameter name="id"><type name="int"/></parameter> </parameters> <throws><type name="Exception"/></throws> </method> </interface> </unit>
To read about xslt for jxom please follow this link.
|