Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.
This way, in a code:
<xsl:variable name="elements" as="element()*" select="..."/>
<xsl:variable name="result" as="element()">
<result>
<xsl:sequence select="$elements"/>
</result>
</xsl:variable>
the implementation shares internal representation among $elements
and subtree of $result . From the perspective of xslt it
looks as completely different subtrees with different node identities, which is
in the accordance with its view of the world.
After a short study we've decided to create a research implementation of this
tree model in saxon. It's took only a couple of days to introduce a minimal changes
to engine, to refactor linked tree into a new composable tree, and to
perform some tests.
In many cases saxon has benefited immediately from this new tree model, in some
other cases more tunings are required.
Our tests've showed that this new tree performed better than
linked tree, but a little bit worser than tiny tree. On the other hand, it's
obvious that conventional code patterns avoid subtree copying, assuming it's
expensive operation, thus one should rethink some code practices to benefit from
composable tree.
Implementation can be downloaded at:
saxon.composabletree.zip
From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.
They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.
The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.
Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.
It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.
That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.
Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy
baggage.
If you're not entirely concentrated during translation of your algorithms into
xslt 2.0 you can get into trap, as we did.
Consider a code snapshot:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+"
select="$elements/t:convert(.)"/>
Looks simple, isn't it?
Our intention was to get converted elements, which result from some
xsl:apply-templates logic.
Well, this code works... but rather sporadically, as results are often in wrong
order! This bug is very close to
what is called a
Heisenbug.
So, where is the problem?
Elementary, my dear Watson:
xsl:apply-templates constructs a sequence of rootless
elements.
$elements/t:convert(.) converts elements and orders them in document order.
Here is a tricky part:
The relative order of nodes in distinct trees is stable but
implementation-dependent...
Clearly each rootless element belongs to a unique tree.
After that we have realized what the problem is, code has been
immediately rewritten as:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+" select="
for $element in $elements return
t:convert($element)"/>
P.S. Taking into an accout a size of our xslt code base, it took a half an hour to
localize the problem. Now, we're at position to review all uses of slashes
in xslt. As you like it?
Opinions on xml namespaces
olegtk: @srivatsn Originally the idea was that namespace URI would point to some schema definition. Long abandoned idea.
Not so long ago, I've seen a good reasoning about the same subject:
Inline functions in xslt 2.1 look often as a some strange aberration. Sure,
there are very usefull cases when they are delegates of program logic
(e.g. comparators, and filters), but often (probably more often) we can see that
it's use is to model data structures.
As an example, suppose you want to model a structure with three properties say
a, b, and c. You implement this creating functions that wrap and unwrap the
data:
function make-data($a as item(), $b as item(), $c
as item()) as function() as item()+
{
function() { $a, $b, $c }
}
function a($data as function() as item()+) as item()
{
$data()[1]
}
function b($data as function() as item()+) as item()
{
$data()[2]
}
function c($data as function() as item()+) as item()
{
$data()[3]
}
Clever?
Sure, it is! Here, we have modeled structrue with the help of sequence, which we
have wrapped into a function item.
Alas, clever is not always good (often it's a sign of a bad). We just wanted to
define a simple structure. What it has in common with function?
There is a distance between what we want to express, designing an algorithm, and
what we see looking at the code. The greater the distance, the more efforts are
required to write, and to read the code.
It would be so good to have simpler way to express such concept as a structure.
Let's dream a little. Suppose you already have a structure, and just want to
access its members. An idea we can think about is an xpath like access method:
$data/a, $data/b, $data/c
But wait a second, doesn't
$data looks very like an xml element, and its accessors are just
node tests?
That's correct, so data constructor may coincide with element constructor.
Then what pros and cons of using of xml elements to model structures?
Pros are: existing xml type system, and sensibly looking code (you just
understand that here we're constructing a structure).
Cons are: xml trees are implemented the way that does not assume fast (from the
perfromace perspective) composition, as when you construct a structure a copy of
data is made.
But here we observe that "implemented" is very important word in this context.
If xml tree implementation would not store reference to the parent node then
subtrees could be composed very efficiently (note that tree is assumed to be
immutable). Parent node could be available through a tree navigator, which would
contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).
Such tree structure would probably help not only in this particular case but
also with other conventional xslt code patterns.
P.S. Saxon probably could implement its NodeInfo this way.
Update: see also Custom tree model.
A while ago we have proposed to introduce maps as built-in types in
xpath/xquery type system: Tuples and maps.
The suggestion has been declined (probably our arguments were not convincing). We,
however, still think that maps along with sequences are primitives, required to
build sensible (not obscure one) algorithms. To see that map is at times is the
best way to resolve the problems we shall refer to an utility function to
allocate names in scope. Its signature looks like this:
<!--
Allocates unique names in the form $prefix{number}?.
Note that prefixes may coincide.
$prefixes - a name prefixes.
$names - allocated names pool.
$name-max-length - a longest allowable name length.
Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
<xsl:param name="prefixes" as="xs:string*"/>
<xsl:param name="names" as="xs:string*"/>
<xsl:param name="name-max-length" as="xs:integer?"/>
Just try to program it and you will find yourselves coding something like one defined at
cobolxom.
To be efficient such maps should provide, at worst, a logarithmic operation
complexity:
- Access to the map through a key (and/or by index) - complexity is LnN;
- Creation of a new map with added or removed item - complexity is LnN;
- Construction of the map from ordered items - complexity is O(N);
- Construction of the map from unordered items - complexity is O(N*LnN);
- Total enumeration of all items - complexity is O(N*LnN);
These performance metrics are found in many functional and procedural
implementations of the maps. Typical RB and AVL tree based maps satisfy these
restrictions.
What we suggest is to introduce map implementation into the exslt2 (assuming
inline functions are present). As a sketch we have implemented pure
AVL Tree in Xslt 2.0:
We do not assume that implementation like this should be used, but rather think
of some extension function(s) that provides a better performance.
What do you think?
The story about immutable tree would not be complete without xslt
implementation. To make it possible one needs something to approxomate tree
nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would
be either unefficient or not pure).
To isolate the problem we have used tuple interface:
tuple:ref($items as item()*) as item() - to wrap items into a tuple;
tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
tuple:is-same($first as item(), $second as item()) as xs:boolean
- to test
whether two tuples are the same.
and defined inefficient implementation based on xml elements. Every other part
of code is a regular AVL algorithm implementation.
We want to stress again that even assuming that there is a good tuple
implementation we would prefer built-in associative container implementation.
Why the heck you need to include about 1000 lines of code just to use a map?
Source code is:
We like Visual Studio very much, and try to adopt new version earlier.
For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.
Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.
We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.
While bemoaning about lack of associative containers in xpath type system, we
have came up with a good implementation of t:allocate-names(). Implementation
can be seen at location
cobol-names.xslt.
It is based on recursion and on the use of xsl:for-each-group.
Alogrithmic worst case complexity is O(N*LogN*LogL), where N is number of names, and L is a length of
a longest name.
This does not invalidate the idea that associative containers
are very wishful, as blessed one who naturally types such implementation. For us, it went the hard way, and has taken three days to realize that original algorithm is problematic, and to work out the better one.
In practice this means 2 seconds for the new implementation against 25 minutes for the old one.
Ladies and gentlemen of the jury...
Why do we return to this theme again?
Well, it's itching!
In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:
<!-- Allocates unique names in the form $prefix{number}?. Note that prefixes may coincide. $prefixes - a name prefixes. $names - allocated names pool. $name-max-length - a longest allowable name length. Returns unique names. --> <xsl:function name="t:allocate-names" as="xs:string*"> <xsl:param name="prefixes" as="xs:string*"/> <xsl:param name="names" as="xs:string*"/> <xsl:param name="name-max-length" as="xs:integer?"/>
We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.
If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).
The very same simple tasks tend to appear in different languages (e.g.
C# Haiku).
Now we have to find:
- integer and fractional part of a decimal;
- length and precision of a decimal.
These tasks have no trivial solutions in xslt 2.0.
At present we have came up with the following answers:
Fractional part:
<xsl:function name="t:fraction" as="xs:decimal">
<xsl:param name="value" as="xs:decimal"/>
<xsl:sequence select="$value mod 1"/>
</xsl:function>
Integer part v1:
<xsl:function name="t:integer" as="xs:decimal">
<xsl:param name="value" as="xs:decimal"/>
<xsl:sequence select="$value - t:fraction($value)"/>
</xsl:function>
Integer part v2:
<xsl:function name="t:integer" as="xs:decimal">
<xsl:param name="value" as="xs:decimal"/>
<xsl:sequence select="
if ($value ge 0) then
floor($value)
else
-floor(-$value)"/>
</xsl:function>
Length and precision:
<!--
Gets a decimal specification as a closure:
($length as xs:integer, $precision as xs:integer).
-->
<xsl:function
name="t:decimal-spec" as="xs:integer+">
<xsl:param name="value"
as="xs:decimal"/>
<xsl:variable name="text" as="xs:string" select="
if ($value
lt 0) then
xs:string(-$value)
else
xs:string($value)"/>
<xsl:variable
name="length" as="xs:integer"
select="string-length($text)"/>
<xsl:variable
name="integer-length" as="xs:integer"
select="string-length(substring-before($text, '.'))"/>
<xsl:sequence select="
if
($integer-length) then
($length - 1, $length - $integer-length - 1)
else
($length, 0)"/>
</xsl:function>
The last function looks odious. In many other languages its implementation
would be considered as embarrassing.
Given:
public class N
{
public readonly N next;
}
What needs to be done to construct a ring of N : n1 refers to n2 , n2 to n3 , ... nk to n1 ? Is it possible?
To end with immutable trees, at least for now, we've implemented IDictionary<K, V> .
It's named Map<K, V> . Functionally it looks very like SortedDictionary<K, V> .
there are some differences, however:
Map in contrast to SortedDictionary is very cheap on
copy.
- Bacause
Map is based on AVL tree, which is more rigorly balanced
than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary ,
and a little bit slower on modification.
- Due to the storage structure: node + navigator,
Map consumes less memory than
SortedDictionary , and is probably cheaper for GC (simple garbage
graphs).
- As AVL tree stores left and right subtree sizes, in contrast to a "color" in
RB tree, we able to index data in two ways: with integer index, and with key
value.
Sources are:
Update:
It was impossible to withstand temptation to commit some primitive performance
comparision. Map outperforms SortedDictionary both in population and in access.
this does not aggree with pure algorithm's theory, but there might be other
unaccounted factors: memory consumption, quality of implementation, and so on.
Program.cs is updated with measurements.
Update 2:
More occurate tests show that for some key types Map 's faster, for others
SortedDictionary 's faster. Usually Map 's slower during population (mutable AVL
tree navigator may fix this). the odd thing is that Map<string, int> is faster
than SortedDictionary<string, int> both for allocaction and for access. See
excel report.
Update 3:
Interesing observation. The following table shows maximal and
average tree heights for different node sizes in AVL and RB trees after a random population:
|
AVL |
RB |
Size |
Max |
Avg |
Max |
Avg |
10 |
4 |
2.90 |
5 |
3.00 |
50 |
7 |
4.94 |
8 |
4.94 |
100 |
8 |
5.84 |
9 |
5.86 |
500 |
11 |
8.14 |
14 |
8.39 |
1000 |
12 |
9.14 |
16 |
9.38 |
5000 |
15 |
11.51 |
18 |
11.47 |
10000 |
16 |
12.53 |
20 |
12.47 |
50000 |
19 |
14.89 |
23 |
14.72 |
100000 |
20 |
15.90 |
25 |
15.72 |
500000 |
25 |
18.26 |
28 |
18.27 |
1000000 |
25 |
19.28 |
30 |
19.27 |
Here, according with theory, the height of AVL tree is shorter than the height
of RB tree. But what is most interesting is that the depth of an "average
node". This value describes a number of steps required to find a random key. RB
tree is very close and often is better than AVL in this regard.
It was obvious as hell from day one of generics that there will appear obscure
long names when you will start to parametrize your types. It was the easiest
thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad
practices.
Read Associative containers in a functional languages
and
Program.cs to see what we're talking about.
Briefly, there is a pair (string , int ), which in C# should be declared as:
System.Collections.Generic.KeyValuePair<string, int>
Obviously we would like to write it in a short way. These are our attempts, which
fail:
1. Introduce generic alias Pair<K, V>:
using System.Collections.Generic;
using Pair<K, V> = KeyValuePair<K, V>;
2. Introduce type alias for a generic type with specific types.
using System.Collections.Generic;
using Pair = KeyValuePair<string, int>;
And this is only one that works:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
Do you think is it bearable? Well, consider the following:
- There is a generic type
ValueNode<T> , where T
should be Pair .
- There is a generic type
TreeNavigator<N> , where N is should be ValueNode<Pair> .
The declaration looks like this:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
using Node = NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>;
using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator<
NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>>;
Do you still think is it acceptable?
P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the
object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1) .
C++ prefers second form. C# and java always use the first one.
|