To end with immutable trees, at least for now, we've implemented IDictionary<K, V> .
It's named Map<K, V> . Functionally it looks very like SortedDictionary<K, V> .
there are some differences, however:
Map in contrast to SortedDictionary is very cheap on
copy.
- Bacause
Map is based on AVL tree, which is more rigorly balanced
than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary ,
and a little bit slower on modification.
- Due to the storage structure: node + navigator,
Map consumes less memory than
SortedDictionary , and is probably cheaper for GC (simple garbage
graphs).
- As AVL tree stores left and right subtree sizes, in contrast to a "color" in
RB tree, we able to index data in two ways: with integer index, and with key
value.
Sources are:
Update:
It was impossible to withstand temptation to commit some primitive performance
comparision. Map outperforms SortedDictionary both in population and in access.
this does not aggree with pure algorithm's theory, but there might be other
unaccounted factors: memory consumption, quality of implementation, and so on.
Program.cs is updated with measurements.
Update 2:
More occurate tests show that for some key types Map 's faster, for others
SortedDictionary 's faster. Usually Map 's slower during population (mutable AVL
tree navigator may fix this). the odd thing is that Map<string, int> is faster
than SortedDictionary<string, int> both for allocaction and for access. See
excel report.
Update 3:
Interesing observation. The following table shows maximal and
average tree heights for different node sizes in AVL and RB trees after a random population:
|
AVL |
RB |
Size |
Max |
Avg |
Max |
Avg |
10 |
4 |
2.90 |
5 |
3.00 |
50 |
7 |
4.94 |
8 |
4.94 |
100 |
8 |
5.84 |
9 |
5.86 |
500 |
11 |
8.14 |
14 |
8.39 |
1000 |
12 |
9.14 |
16 |
9.38 |
5000 |
15 |
11.51 |
18 |
11.47 |
10000 |
16 |
12.53 |
20 |
12.47 |
50000 |
19 |
14.89 |
23 |
14.72 |
100000 |
20 |
15.90 |
25 |
15.72 |
500000 |
25 |
18.26 |
28 |
18.27 |
1000000 |
25 |
19.28 |
30 |
19.27 |
Here, according with theory, the height of AVL tree is shorter than the height
of RB tree. But what is most interesting is that the depth of an "average
node". This value describes a number of steps required to find a random key. RB
tree is very close and often is better than AVL in this regard.
It was obvious as hell from day one of generics that there will appear obscure
long names when you will start to parametrize your types. It was the easiest
thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad
practices.
Read Associative containers in a functional languages
and
Program.cs to see what we're talking about.
Briefly, there is a pair (string , int ), which in C# should be declared as:
System.Collections.Generic.KeyValuePair<string, int>
Obviously we would like to write it in a short way. These are our attempts, which
fail:
1. Introduce generic alias Pair<K, V>:
using System.Collections.Generic;
using Pair<K, V> = KeyValuePair<K, V>;
2. Introduce type alias for a generic type with specific types.
using System.Collections.Generic;
using Pair = KeyValuePair<string, int>;
And this is only one that works:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
Do you think is it bearable? Well, consider the following:
- There is a generic type
ValueNode<T> , where T
should be Pair .
- There is a generic type
TreeNavigator<N> , where N is should be ValueNode<Pair> .
The declaration looks like this:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
using Node = NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>;
using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator<
NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>>;
Do you still think is it acceptable?
P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the
object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1) .
C++ prefers second form. C# and java always use the first one.
Continuing with the post "Ongoing xslt/xquery spec update"
we would like to articulate what options regarding associative containers do we
have in a functional languages (e.g. xslt, xquery), assuming that variables are
immutable and implementation is efficient (in some sense).
There are three common implementation techniques:
- store data (keys, value pairs) in sorted array, and use binary search to
access values by a key;
- store data in a hash map;
- store data in a binary tree (usually RB or AVL trees).
Implementation choice considerably depends on operations, which are taken over
the container. Usually these are:
- construction;
- value lookup by key;
- key enumeration (ordered or not);
- container modification (add and remove data into the
container);
- access elements by index;
Note that modification in a functional programming means a creation of a new
container, so here is a
division:
- If container's use pattern does not include modification, then probably the
simplest solution is to build it as an ordered sequence of
pairs, and use binary search to access the data. Alternatively, one could
implement associative container as a hash map.
- If modification is essential then neither ordered sequence of pairs, hash map
nor classical tree implementation can be used, as they are either too slow
or too greedy for a memory, either during modification or during access.
On the other hand to deal with container's modifications one can build
an implementation, which uses "top-down" RB
or AVL trees. To see the
difference consider a classical tree structure and its functional variant:
|
Classical |
Functional |
Node structure: |
node
parent
left
right
other data |
node
left
right
other data |
Node reference: |
node itself |
node path from a root of a tree |
Modification: |
either mutable or requires a completely new tree |
O(LnN) nodes are created
|
Here we observe that:
- one can implement efficient map (lookup time no worse than O(LnN)) with no
modification support, using ordered array;
- one can implement efficient map with support of modification, using immutable binary tree;
- one can implement all these algorithms purely in xslt and xquery (provided that inline
functions are supported);
- any such imlementation will lose against the same implementation
written in C++, C#, java;
- the best implementation would probably start from sorted array and
will switch to binary tree after some size threshold.
Here we provide a C# implementation of a functional AVL tree, which also supports
element indexing:
Our intention was to show that the usual algorithms for associative
containers apply in functional
programming; thus a feature complete functional language must support
associative containers to make development more conscious, and to free a
developer from inventing basic things existing already for almost a half of
century.
Several years ago we have started a new project. We do not like neither hate
any particular language, thus the decision what language to use was pragmatical:
xslt 2.0 fitted perfectly.
At present it's a solid volume of xslt code. It exhibits all the virtues of any
other good project in other language: clean design, modularity, documentation,
sophisticationless (good code should not be clever).
Runtime profile of the project is that it deals with xml documents with sizes
from a few dozens of bytes to several megabytes, and with xml schemas from
simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline
of stylesheets processes gigabytes of data stored in the database and in files.
All the bragging above is needed here to introduce the context for the following
couple of lines.
The diversity of load conditions and a big code base, exposed xslt engine of
choice to a good proof test. The victim is Saxon. In the course of project we
have found and reported many bugs. Some of them are nasty and important, and
others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being
fixed promtly (see
the last one).
Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).
Well, it has happened so that we're helping to Saxon project. Unintentionally,
however!
P.S.
About language preferences.
Nowdays we're polishing a
COBOL
generation. To this end we have familiarized ourselves with this language.
That's the beatiful language. Its straightforwardness helps to see the evolution
of computer languages and to understand what and why today's languages try to
timidly hide.
We have updated
languages-xom.zip. There are many fixes in cobolxom (well, cobolxom is new,
and probably there will be some more bugs). Also we have included Xml Object
Model for the SQL, which in fact has appeared along with jxom.
SQL xom supports basic sql syntax including common table expressions, and two
dialects for DB2 and Oracle.
Recently W3C has published new drafts for
xquery 1.1 and for
xpath 2.1. We have noticed that
committee has decided to introduce inline functions both for the xquery and for the
xpath.
That's a really good news! This way xquery, xpath and xslt are
being approached the Object Oriented Programming the way of javascript with its
inline functions.
Now we shall be able to implement tuples (a sequence of items wrapped into
single item), object with named properties, trees (e.g. RB Tree), associative
containers (tree maps and hash maps, sets).
Surely, all this will be in the spirit of functional programming.
The only thing we regret about is that the WG did not include built-in
implementations for trees and associative containers, as we don't believe that
one can create an efficient implementation of these abstractions neither in
xquery nor in xslt (asymptotically results will be good, but coefficients will
be painful).
See also:
Tuple and maps
Not sure how things work for others but for us it turns out that Saxon 9.2 introduces new bugs, works slower and eats much more memory than its ancestor v9.1.
See
Memory problem with V9.2.
We hope all this will be fixed soon.
Update: By the way, Saxon 9.2 (at the moment 2009-01-04) does not like (despises in fact) small documents and especially text nodes in those documents. It loves huge in memory documents, however.
Update 2009-01-05: case's closed, fix's commited into svn.
Today, I've tried to upgrade our projects to Saxon 9.2. We have a rather big set
of stylesheets grinding gigabytes of information. It's obvious that we
expected at least the same performance from the new version.
But to my puzzlement a pipeline of transformations failed almost immediately
with en error message:
XPTY0018: Cannot mix nodes and atomic values in the result of a path expression
We do agree with this statement in general, but what it had in common with our
stylesheets? And how everything was working in 9.1?
To find the root of the problem I've created a minimal problem reproduction:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="this"
exclude-result-prefixes="xs t">
<!-- Entry point. -->
<xsl:template match="/">
<xsl:variable name="p" as="element()">
<p l="1"/>
</xsl:variable>
<o l="{$p/t:d(.)}"/>
</xsl:template>
<xsl:function name="t:d" as="item()*">
<xsl:param name="p" as="element()"/>
<xsl:apply-templates mode="p" select="$p"/>
</xsl:function>
<xsl:template match="*" mode="p">
<xsl:sequence select="concat('0', @l)"/>
</xsl:template>
</xsl:stylesheet>
Really simple, isn't it? The problem is in a new optimization of
concat() function, introduced in version 9.2. It tries to eliminate
string concatenation, and in certain cases emits its arguments directly into the
output as text nodes, separating whole output with some stopper strings. The
only problem is that no such optimization is allowed in this particular case
(which is rather common, and surely legal, in our stylesheets); result of
<xsl:template match="p" mode="p"> should not be a node, but of type
xs:string .
Saxon 9.2 is here already for 3 month, at lest! Thus, how come that such
a bug was not
discovered earlier?
Update: the fix is commited into the svn on the next day. That's promptly!
We've added a new language to the set of Xml Object Model schemas and stylesheets.
The newcomer is COBOL! No jokes. It's not a whim, really. Believe it or
not but COBOL is still alive and we need to generate it (mainly different sorts of
proxies).
We've used VS COBOL II grammar Version 1.0.3 as a reference. Implemented grammar
is complete but without preprocessor statements. On the other hand it defines COPY and EXEC SQL constructs.
Definitely, it'll take a time for the xml schema and xslt implementation to
become mature.
Now language XOM is:
- jxom - for java;
- csharpxom - for C#;
- cobolxom - for COBOL.
Sources can be found at
languages-xom.
Given:
- an xml defining elements and groups;
- each element belongs to a group or groups;
- group may belong to another group.
Find:
- groups, a given element directly or inderectly belongs to;
- a function checking whether an element belongs to a group.
Example:
<groups>
<group name="g1">
<element ref="e1"/>
<element ref="e2"/>
<element
ref="e3"/>
<group ref="g2"/>
</group>
<group name="g2">
<element ref="e5"/>
</group>
<group name="g3">
<element ref="e1"/>
<element ref="e4"/>
</group>
</groups>
There are several solutions depending on aggresiveness of optimization. A
moderate one is done through the xsl:key. All this reminds recursive common
table expressions in SQL.
Anyone?
In spite of the fact that our last projects are being developed in Java, the .NET is definitly our favorite platform.
In a twitter I saw the phrase: "Java the language is a stagnant mess". It's said in favour of C#. It's true that C# significantly affects now even on Java (let's remember generics, jaxb, web services, etc.), but in my opinion, the C# won't be the leading language for worldwide enterprise applications in the nearest future.
One of causes is that the main platform for .NET still is Windows. The situation could be changed by Mono project, but I think there are yet not enough projects on platforms other than Windows.
My guess is confirmed by some of observations that I did as a software engineer of an IT company. Our company performs different software porting projects from legacy programming languages like COBOL, ADSO, Natural etc. into up to date languages like Java, C# etc. It worth to say that clients rarely select to migrate to .NET despite to our advices.
The main reason of such choice, according to most of our clients, is that they want to be platform independent and only Java gives them this choice.
It worth for Microsoft to think about cooperation with Mono in order to make .NET really platform indpendent, otherwise C# will always be a step behind Java despite apparent advantages of C# as a programming language.
A client asked us to produce Excel reports in ASP.NET
application. They've given an Excel templates, and also defined what they want to show.
What are our options?
- Work with Office COM API;
- Use Office Open XML SDK (which is a set of pure .NET
API);
- Try to apply xslt somehow;
- Macro, other?
For us, biased to xslt, it's hard to make a fair choice. To judge, we've
tried formalize client's request and to look into future support.
So, we have defined sql stored procedures to provide the data. This way data can be
represented either as ADO.NET DataSet, a set of classes, as xml, or in other reasonable format. We do not
predict any considerable problem with data representation if client will decide
to modify reports in future.
It's not so easy when we think about Excel generation.
Due to ignorance we've thought that Excel is much like xslt in some regard, and
that it's possible to provide a tabular data in some form and create Excel
template, which will consume the data to form a final output. To some extent
it's possible, indeed, but you should start creating macro or vb scripts to
achieve acceptable results.
When we've mentioned macroses to the client, they immediately stated that
such a solution won't work due to security reasons.
Comparing COM API and Open XML SDK we can see that both provide almost the same
level of service for us, except that the later is much more lighter and supports only Open XML format, and the earlier is a heavy
API exposing MS Office and supports earlier versions also.
Both solutions have a considerable drawback: it's not easy to create Excel
report in C#, and it will be a pain to support such solution if client will ask,
say in half a year, to modify something in Excel template or to create one more
report.
Thus we've approached to xslt. There we've found two more directions:
- generate data for Office Open XML;
- generate xml in format of MS Office 2003.
It's turned out that it's rather untrivial task to generate data for Open XML,
and it's not due to the format, which is not xml at all but a zipped folder
containing xmls. The problem is in the complex schemas and in many complex
relations between files constituting Open XML document. In contrast, MS
Office 2003 format allows us to create a single xml file for the spreadsheet.
Selecting between standard and up to date format, and older proprietary one, the
later looks more attractive for the development and support.
At present we're at position to use xslt and to generate files in MS Office
2003 format. Are there better options?
Did you ever hear that double numbers may cause roundings, and that
many financial institutions are very sensitive to those roundings?
Sure you did! We're also aware of this kind of problem, and we thought we've
taken care of it. But things are not that simple, as you're not always
know what an impact the problem can have.
To understand the context it's enough to say that we're converting (using xslt by the way) programs
written in a CASE tool called
Cool:GEN into java and into C#. Originally, Cool:GEN generated COBOL and C
programs as deliverables. Formally, clients compare COBOL results vs java or C#
results, and they want them to be as close as possible.
For one particular client it was crucial to have correct results during
manipulation with numbers with 20-25 digits in total, and with 10 digits after a decimal point.
Clients are definitely right, and we've introduced generation options to control
how to represent numbers in java and C# worlds; either as double or
BigDecimal (in java), and decimal (in C#).
That was our first implementation. Reasonable and clean. Was it enough? - Not at
all!
Client's reported that java's results (they use java and BigDecimal
for every number with decimal point) are too precise, comparing to Mainframe's
(MF) COBOL. This rather unusuall complain puzzles a litle, but client's
confirmed that they want no more precise results than those MF produces.
The reason of the difference was in that that both C# and especially java may
store much more decimal digits than is defined for the particualar result on MF.
So, whenever you define a field storing 5 digits after decimal point, you're
sure that exactly 5 digits will be stored. This contrasts very much with results
we had in java and C#, as both multiplication and division can produce many more
digits after the decimal point. The solution was to truncate(!) (not to round) the
numbers to the specific precision in property setters.
So, has it resolved the problem? - No, still not!
Client's reported that now results much more better (coincide with MF, in fact)
but still there are several instances when they observe differences in 9th and
10th digits after a decimal point, and again java's result are more accurate.
No astonishment this time from us but analisys of the reason of the difference.
It's turned out that previous solution is partial. We're doing a final truncation
but still there were intermediate results like in a/(b * c) , or in a * (b/c) .
For the intermediate results MF's COBOL has its, rather untrivial, formulas (and
options) per each operation defining the number of digits to keep after a
decimal point. After we've added similar options into the generator, several
truncations've manifested in the code to adjust intermediate results. This way
we've reached the same accurateness as MF has.
What have we learned (reiterated)?
- A simple problems may have far reaching impact.
- More precise is not always better. Client often prefers compatible rather than
more accurate results.
Recently we were visiting Ukraine, the capital city, and a
town we've come from.
Today's Ukraine makes a twofold impression.
On the one hand it's a childhood places and relatives, an enormous
pleasure of meeting university and school friends, a good surprise of meeting
university chancellor who was already hoary with age when we were studying.
On the other hand it's already a very different country from what the memory
draws. I must be wrong but my impression was that it's a country of traders and
endless political battles.
It's neither bad nor good but a point of history. Unfortunately we cannot think
ourselves now living in Ukraine.
On the question where is our home now, we have the only answer it's in Israel.
For some reason C# lacks a decimal truncation function
limiting result to a specified number of digits after a decimal point. Don't
know what's the reasoning behind, but it stimulates the thoughts. Internet
is plentiful with workarounds. A tipical answer is like this:
Math.Truncate(2.22977777 * 1000) / 1000; // Returns 2.229
So, we also want to provide our solution to this problem.
public static decimal Truncate(decimal value,
byte decimals)
{
decimal result = decimal.Round(value, decimals);
int c = decimal.Compare(value, result);
bool negative = decimal.Compare(value, 0) < 0;
if (negative ? c <= 0 : c >= 0)
{
return result;
}
return result - new decimal(1, 0, 0, negative, decimals);
}
Definitely, if the function were implemented by the framework it were much more efficient. We assume, however, that above's the best implementation that can be done externally.
|