Monday, 13 April 2009 - Nesterovsky bros

Monday, 13 April 2009

This happens in .NET Framework 3.5, 32 bit, VS 2008.

C#:

namespace NesterovskyBros.Test { using Microsoft.VisualStudio.TestTools.UnitTesting; [TestClass] public class CharAtUnitTest { private TestContext testContextInstance; public TestContext TestContext { get { return testContextInstance; } set { testContextInstance = value; } } [TestMethod] public void CharAtTest() { this.text = "1"; string token = Read(1, false); TestContext.WriteLine("token: {0}", token); } private string Read(int offset, bool flag) { string token = null; int c = 0; if (flag) { goto Whitespace; } Scan: c = CharAt(offset); switch(c) { case -1: { return "<Eof>"; } case '\'': { token = "Literal"; goto Literal; } } Whitespace: if (c == ' ') { return "Space"; } return "Unknown"; Literal: while(true) { int d = CharAt(offset); if (token != "Literal") { goto Scan; } if (d == c) { return token; } } } string text; private int CharAt(int offset) { string text = this.text; return (uint)offset >= (uint)text.Length ? -1 : text[offset]; } } }

In debug mode this test prints: "token: <Eof>". In release - "token: Unknown". The bug is so fragile that even slightest change in code removes it. Looking into disassembly we can see that the problem is near the switch:

Scan: c = CharAt(offset); /* Our old friend, CharAt(). Inlined! */ 00000017 mov edx,dword ptr [edi+8] 0000001a cmp dword ptr [edx+8],esi 0000001d jbe 00000032 0000001f cmp esi,dword ptr [edx+8] 00000022 jae 000000CE 00000028 movzx eax,word ptr [edx+esi*2+0Ch] 0000002d mov dword ptr [ebp-10h],eax 00000030 jmp 00000039 00000032 mov dword ptr [ebp-10h],0FFFFFFFFh /* Move -1 (four bytes) into stack. */ 00000039 movzx edx,word ptr [ebp-10h] /* Get two bytes into edx (0FFFFh) */ switch(c) 0000003d cmp edx,0FFFFFFFFh /* Never true. */ 00000040 je 0000004A 00000042 cmp dword ptr [ebp-10h],27h 00000046 je 0000005A 00000048 jmp 00000062 { case -1: { return "<Eof>"; 0000004a mov eax,dword ptr ds:[022EDE68h] 00000050 lea esp,[ebp-0Ch] 00000053 pop ebx 00000054 pop esi 00000055 pop edi 00000056 pop ebp 00000057 ret 4

This looks like a tremendous bug, like one of those shaking belief in computer's infallibility. :-)

It would be nice if you would verify the case on your computer.

Monday, 13 April 2009 13:46:54 UTC

Comments [4] -
Incremental Parser

Sunday, 05 April 2009

Update on Incremental Parser

Praises: I dare not to think how could we live without AnkhSVN.

At present we have:

a generic parser;
fully functional xquery parser;
detailed error report, and syntax suggestion;
high performance.

The idea of runtime grammar tree and a reader like parser results in a high performace, as we able to build a lookup tables to probe tokens. This allows us to start parsing immediately from the most specific grammar chain. For example, consider the xquery grammar:

[1] Module ::= VersionDecl? (LibraryModule | MainModule) [2] VersionDecl ::= "xquery" "version" StringLiteral ("encoding" StringLiteral)? Separator [3] MainModule ::= Prolog QueryBody [4] LibraryModule ::= ModuleDecl Prolog [5] ModuleDecl ::= "module" "namespace" NCName "=" URILiteral Separator [6] Prolog ::= ((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)* ((VarDecl | FunctionDecl | OptionDecl) Separator)* ... [87] VarRef ::= "$" VarName

Formally, to parse xquery "$v" one needs to go deep into a grammar hierarchy. That's what is usually done. On the contrast, a lookup table for the grammar "Module", containing 80 different token runs, allows us to identify grammar chain just with a couple of probes:

[0] "xquery" "version" [1] "module" "namespace" [2] "declare" "default" "element" "namespace" [3] "declare" "default" "function" "namespace" [4] "declare" "boundary-space" [5] "declare" "default" "collation" [6] "declare" "base-uri" [7] "declare" "construction" [8] "declare" "ordering" [9] "declare" "default" "order" "empty" [10] "declare" "copy-namespaces" [11] "declare" "namespace" [12] "declare" "schema" [13] "import" "module" [14] "declare" "variable" "$" [15] "declare" "function" [16] "declare" "option" [17] "for" "$" [18] "let" "$" [19] "some" "$" [20] "every" "$" [21] "typeswitch" "(" [22] "if" "(" [23] "-" [24] "+" [25] "validate" "{" [26] "validate" "lax" [27] "validate" "strict" [28] "/" [29] "//" [30] <integer> [31] <decimal> [32] <double> [33] <string> [34] "$" [35] "(" [36] "." [37] <functionname> "(" [38] "ordered" "{" [39] "unordered" "{" [40] "<" <qname> [41]  [42] <?pi literal?> [43] "document" "{" [44] "element" <qname> [45] "element" "{" [46] "attribute" <qname> [47] "attribute" "{" [48] "text" "{" [49] "comment" "{" [50] "processing-instruction" <ncname> [51] "processing-instruction" "{" [52] "parent" "::" [53] "ancestor" "::" [54] "preceding-sibling" "::" [55] "preceding" "::" [56] "ancestor-or-self" "::" [57] ".." [58] "child" "::" [59] "descendant" "::" [60] "attribute" "::" [61] "self" "::" [62] "descendant-or-self" "::" [63] "following-sibling" "::" [64] "following" "::" [65] "@" [66] "document-node" "(" [67] "element" "(" [68] "attribute" "(" [69] "schema-element" "(" [70] "schema-attribute" "(" [71] "processing-instruction" "(" [72] "comment" "(" [73] "text" "(" [74] "node" "(" [75] <qname> [76] "*" [77] <ncname:*> [78] <*:ncname> [79] "(#"

This way, algorithmically, we outperform most of conventional parsers.

On the other hand, a parsed tree we're building, has a compact representation. Each tree node is defined with two text bookmarks, grammar chain, and a grammar specific data. What's important is that the production of garbage memory is very low, as the rate of parser's fail assumptions is small.

What should be done:

Attach events to the xquery grammar to collect program constructions: variables, functions, namespaces in scope. This will provide auto completion info.
Release inactive parsed subtrees. E.g. we can free tree of function body, and preserve its text range (two bookmarks).

Well, I'd like to think someone could understand anything in all this mumbling. All sources are at "Incremental parser" home.

Sunday, 05 April 2009 15:50:49 UTC

Comments [0] -
Incremental Parser

More on .NET's JIT

There is a method Right() in the RB tree implementation:

public int Right(int node) { return items[node].right; }

JIT does not want to inline it, probably as the method may throw:

public int Right(int node) { return items[node].right; 00000000 mov eax,dword ptr [ecx+4] 00000003 cmp edx,dword ptr [eax+4] 00000006 jae 00000013 00000008 shl edx,4 0000000b lea eax,[eax+edx+8] 0000000f mov eax,dword ptr [eax+8] 00000012 ret 00000013 call 74C3A62C 00000018 int 3

Too sad.

Sunday, 05 April 2009 13:16:06 UTC

Comments [0] -
Incremental Parser | Tips and tricks

Thursday, 02 April 2009

.NET JIT Quality

Early in 2001 we've read that .NET's JIT is smart enough to optimize repeated boundary checks.

In the year 2009 we still can verify that this is not the case (no matter how hard you try).

C#:

private int CharAt(int offset) { string text = this.text; return (uint)offset >= (uint)text.Length ? -1 : text[offset]; }

Disassembly:

private int CharAt(int offset) { string text = this.text; 00000000 push ebp 00000001 mov ebp,esp 00000003 mov ecx,dword ptr [ecx+30h] return (uint)offset >= (uint)text.Length ? -1 : text[offset]; 00000006 cmp dword ptr [ecx+8],edx 00000009 jbe 00000017 0000000b cmp edx,dword ptr [ecx+8] 0000000e jae 0000001C 00000010 movzx eax,word ptr [ecx+edx*2+0Ch] 00000015 pop ebp 00000016 ret 00000017 or eax,0FFFFFFFFh 0000001a pop ebp 0000001b ret 0000001c call 74C24C6C 00000021 int 3

P.S. Neither this method is inlined (IL length is 25 bytes).

Thursday, 02 April 2009 07:56:00 UTC

Comments [0] -
Incremental Parser | Tips and tricks

Tuesday, 31 March 2009

IE8's Shocked

Yesterday, I've installed IE8.

Looks better here and there.

Today, I'm shocked!

I've reopened my web mail and it remembered the session. It keeps session cookies after closing IE8 instance!

I did not believe to myself and logged into an another web application, and then opened another IE8 instance. What do you think? - It shares the session between instances!

That is a serious security problem.

It prevents me from opening two sessions of a web application on my computer.

P.S. we have found that this problem was already discussed. See IE8 handles sessions/cookies different than IE7 - big trouble for - ...

Someone needs a brain surgery...

Quick solution: run IE8 with -nomerge command line option.

Tuesday, 31 March 2009 11:45:52 UTC

Comments [2] -
Tips and tricks

Wednesday, 18 March 2009

What you cannot do with C#'s generics

We'd like to return to the binary tree algorithms and spell what you cannot do with generics in C#. Well, you can do many things, however with generalization penalty.

Consider a binary tree node: Node(Parent, Left, Right). RB, AVL, and others algorithms attach some private information to this node to perform balancing.

You can express this idea methematically (and in C++), you cannot implement it efficiently in C#.

More focused example. Consider RB tree: Node(Parent, Left, Right, Color). There are a number of ways you may implement the internal structure of the tree. Algorithms themselves stay the same.

Straightforward implementation:

class Node { Node Parent; Node Left; Node Right; bool Color; }

This implementation allocates nodes in the heap and each node refers to other nodes.

Node navigator implementation:

class Node { Node Left; Node Right; bool Color; } struct NodeNavigator { Node[] nodes; int index; }

Node does not refer to the parent. This reduces the memory consumption and simplifies object graph, which is good for GC. Tree is walked using a node navigator, which stores ancestors of the node.

Node as a structure:

struct Node { int Parent; int Left; int Right; bool Color; // This might be integrated as highest bit of parent. }

Tree is stored as an array of nodes. This is compact and GC efficient implementation.

Node as a structure, and with node navigator:

struct Node { int Left; int Right; bool Color; // This might be integrated as highest bit of left. } struct NodeNavigator { Tree tree; int[] nodes; int index; }

Tree is stored as an array of nodes, and a navigator is used to walk it. This is the most compact implementation.

Each implementation has its virtues. The common between implementations is that they share the same balancing and navigation algorithms. Storage differences prevent a single C# implementation. To the contrast, C++ allows to define a concept "tree" and to define specializations of this concept, allowing a unified algorithms; all this is done without performance penalty.

P.S. java in this regard, is almost alternativeless...

Wednesday, 18 March 2009 06:53:05 UTC

Comments [0] -
Incremental Parser | Tips and tricks

Sunday, 08 March 2009

Incremental Parser

Recently, we have started looking into a task of creating an interactive parser. A generic one.

Yes, we know there are plenty of them all around, however the goals we have defined made us to construct the new implementation.

The goals:

Parser must be incremental.
You should direct what to parse, and when to stop.
This virtually demands rather "pull" than conventional "push" implementation.
Parser must be able to synchronize a tree with text.
Whenever the underlying text is changed, a limited part of a tree should to be updated.
Parser should be able to recover from errors, and continue parsing.
Parser should be manageable.
This is a goal of every program, really.
Parser must be fast.
A low memory footprint is desired.

What's implemented (VS2008, C#) and put at SourceForge, is called an Incremental Parser.

These are parser's insights:

Bookmarks are objects to track text points. We use a binary tree (see Bare binary tree algorithms) to adjust positions of bookmarks when text is changed.
Ranges define parsed tree elements. Each range is defined by two bookmarks, and a grammar annotation.
There are grammar primitives, which can be composed into a grammar graph.
A grammar graph along with ranges form a state machine.
Grammar chains are cached, turning parsing into a series of probes of literal tokens and transitions between grammar chains. This caching is done on demand, which results in warming-up effect.
Parser itself includes a random access tokenizer, and a queue of ranges pending to be parsed.
Parsing is conducted as a cycle of pulling and parsing of pending ranges.
Whenever text is changed a closest range is queued for the reparsing.
A balance between amount of parsing and memory consumption can be achieved through a detalization of grammar annotation for a range. An active text fragment can be fully annotated, while for other text parts a coarse range can be stored.

We have defined xpath like grammar to test our ideas. See printed parsed trees to get understanding of what information can be seen from ranges.

Sunday, 08 March 2009 21:00:38 UTC

Comments [0] -
Announce | Incremental Parser | xslt

Thursday, 12 February 2009

Bare binary tree algorithms

Do you agree that binary trees and algorithms that keep trees reasonably balanced are important?

Our answer is yes!

It's interesting enough, however, that you won't easily find these algorithms publicly available.

Though red-black, AVL and other algorithms described in the wikipedia are defined in terms of tree manipulation, all implementations we have seen, deal with trees annotated with keys and values. These implementations really use tree balancing algorithms behind the schene, and expose a commonplace set or map containers to a client. Even C++ Standard Library suffers from this disease.

We think that binary trees are valuable independent concepts, and they worth to be implemented separately, at least because there are other algorithms, except sets and maps, using trees.

And well, we did it in C#! See RedBlackTree.cs.

Consider an example - a simple scheduler, ScheduleBookmark.cs, with operations:

schedule an action;
remove an action from the schedule;
enumerate actions;
find a date, an action is scheduled for;
find an action (or at least closest one) for a specified date;
postpone actions due to delays;

A balanced binary tree allows efficient implementation of such a scheduler. Tree node stores an action, and a time span between parent node and this node. This way:

Operation	Steps
schedule an action	find place + link node + rebalance tree
remove an action from the schedule	unlink node + rebalance tree
enumerate actions	navigate tree
find a date, an action is scheduled for	find node in tree
find an action for a specified date	cumulate time spans up to the tree root
postpone actions due to delays	fixup time spans from a node up to the tree root

Compare operation complexities between tree, array, list and map:

Operation	Tree	Array	List	Map
schedule an action	O(ln(N))	O(N)	O(N)	O(ln(N))
remove an action from the schedule	O(ln(N))	O(N)	O(1)	O(ln(N))
enumerate actions	O(ln(N))	O(1)	O(1)	O(ln(N))
find a date, an action is scheduled for	O(ln(N))	O(1)	O(1)	O(1)
find an action for a specified date	O(ln(N))	O(ln(N))	O(N)	O(ln(N))
postpone actions due to delays	O(ln(N))	O(N)	O(N)	O(N*ln(N))

Complexity of each operation for the tree is O(ln(N)). No arrays, lists, or maps achieve similar worst case guaranty.

Finally, the test program is Program.cs, and a whole project (VS2008) is Tree.zip

Thursday, 12 February 2009 13:17:36 UTC

Comments [0] -
Incremental Parser | Tips and tricks

Wednesday, 11 February 2009

Generic way to screw things up

Could you think of a C# method accepting an ancestor, and forbidding a descendant of a class at compile time?

The answer to this probably is: why do you need such a reptile.

Well, I don't. I didn't meant to create such a method, but generics help a lot!

public class BinaryTreeNode<Node> where Node: BinaryTreeNode<Node> { public Node parent; public Node left; public Node right; } public class MyNode: BinaryTreeNode<MyNode> { public int key; } public class MyRoot: MyNode { } public class Test { public void test() { MyRoot root = new MyRoot(); // print((MyNode)root); // This works. print(root); // This does not work. } private static void print<T>(T node) where T: BinaryTreeNode<T> { Console.WriteLine("print me"); } }

By the way, BinaryTreeNode is an "abstract" class, as you cannot instantiate it but inherit only.

Wednesday, 11 February 2009 13:59:17 UTC

Comments [0] -
Incremental Parser | Tips and tricks

Thursday, 15 January 2009

A good xslt IDE

A simple demand nowdays - a good IDE.

Almost a ten years have passed since xslt has appeared but still, we're not pleased with IDEs claiming xslt support. Our expectaions are not too high. There are things however, which must be present in such an IDE.

A notion of project, and possibly a group of projects. You may think of it as a main xslt including other xslts participationg in the project.
A code completion. A feature providing typing hints for language constructs, includes, prefixes, namespaces, functions, templates, modes, variables, parameters, schema elements, and other (all this should work in a context of the project).
A code refactoring. A means to move parts of code between (or inside) files and projects, rename things (functions, templates, parameters, variables, prefixes, namespaces, and other).
Code validation and run.
Optional debug feature.

We would be grateful if someone had pointed to any such IDE.

Thursday, 15 January 2009 14:41:35 UTC

Comments [13] -
Incremental Parser | xslt

Wednesday, 14 January 2009

is-upper-case() function in xslt 2.0

Once upon a time, we created a function mimicking decapitalize() method defined in java in java.beans.Introspector. Nothing special, indeed. See the source:

/** * Utility method to take a string and convert it to normal Java variable * name capitalization. This normally means converting the first * character from upper case to lower case, but in the (unusual) special * case when there is more than one character and both the first and * second characters are upper case, we leave it alone. * <p> * Thus "FooBah" becomes "fooBah" and "X" becomes "x", but "URL" stays * as "URL". * * @param name The string to be decapitalized. * @return The decapitalized version of the string. */ public static String decapitalize(String name) { if (name == null || name.length() == 0) { return name; } if (name.length() > 1 && Character.isUpperCase(name.charAt(1)) && Character.isUpperCase(name.charAt(0))){ return name; } char chars[] = name.toCharArray(); chars[0] = Character.toLowerCase(chars[0]); return new String(chars); }

We typed implementation immediately:

<xsl:function name="t:decapitalize" as="xs:string"> <xsl:param name="value" as="xs:string?"/> <xsl:variable name="c" as="xs:string" select="substring($value, 2, 1)"/> <xsl:sequence select=" if ($c = upper-case($c)) then $value else concat ( lower-case(substring($value, 1, 1)), substring($value, 2) )"/> </xsl:function>

It worked, alright, until recently, when it has fallen to work, as the output was different from java's counterpart.

The input was W9Identifier. Function naturally returned the same value, while java returned w9Identifier. We has fallen with the assumption that $c = upper-case($c) returns true when character is an upper case letter. That's not correct for numbers. Correct way is:

<xsl:function name="t:decapitalize" as="xs:string"> <xsl:param name="value" as="xs:string?"/> <xsl:variable name="c" as="xs:string" select="substring($value, 2, 1)"/> <xsl:sequence select=" if ($c != lower-case($c)) then $value else concat ( lower-case(substring($value, 1, 1)), substring($value, 2) )"/> </xsl:function>

Wednesday, 14 January 2009 15:46:23 UTC

Comments [0] -
Tips and tricks | xslt

Tuesday, 06 January 2009

Hundred percent agree

100% Agree.

> Basically we work for clients and if they ask that we need
> this output from this input then we don't thing about it and
> get the result.

I guess the definition of professionalism is that if you're a professional,
you advise the client when he gets the requirements wrong, and if you're
not, you build whatever rubbish he asks for.

Michael Kay
http://www.saxonica.com/

Tuesday, 06 January 2009 11:22:36 UTC

Comments [0] -

Saturday, 03 January 2009

A beautiful language and a labour force

The last year we were working on a project, which in essence dealt with transformation of graphs. Our experience with xslt 1.0, and other available information was promising - xslt 2.0 is a perfect match.

We were right, xslt 2.0 fitted very well to the problem.

It's easy to learn xslt 2.0/xquery: be acquainted with xml schema; read through a syntax, which is rather concise; look at examples, and start coding. API you will learn incrementally.

The same as other languages, xslt 2.0 is only a media to express algorithms. As such it fills its role rather good, as good as SQL:2003 and its variations do, and sometimes even better than other programming languages like C++ do.

Compare expressions "get data satisfying to a specific criteria" and "for each data part check a specific condition, and if it true, add it to the result". These often represent the same idea from two perspectives: human (or math) thinkning; and thinking in terms of execution procedure.

Both kinds of expressions have their use, however it has happened so that we're the human beings and perceive more easily natural language notions like: subjects, objects, predicates, deduction, induction and so on. I think the reason is that a human's (not positronic) brain grasps ideas, conceptions, images as something static, while execution procedure demands a notion of time (or at least notions of a sequence and an order) for the comprehension. ("Are you serious?", "Joke!" :-) )

There is the other side to this story.

We have made the project design in relatively short terms. A good scalable design. We needed people who know xslt 2.0 to implement it. It has turned out, this was a strong objection against xslt!

Our fellow, xslt guru, Oleg Tkachenko has left our company to make his career at Microsoft, and to our disbelief it was impossible to find a person who was interested in a project involvong 85% of xslt and 15% of other technologies including java. Even in java world people prefer routine projects, like standard swing or web application, to a project demanding creativeness.

Possibly, it was our mistake, to allow to our company to look for developers the standard way: some secretary was looking through her sources, and inevitably was finding so-so java + poor xml + almost zero xslt knowledge graduates. We had to make appeals on xslt forums especially since the project could be easily developed with a distributed group.

Finally, we have designed and implemented the project by ourselves but to the present day our managers are calling and suggesting java developers for our project. What a bad joke!

Saturday, 03 January 2009 09:51:05 UTC

Comments [0] -
xslt

Wednesday, 17 December 2008

exlst sketches

Just for fun I've created exslt2.xslt and exslt2-test.xslt to model concepts discussed at EXSLT 2.0 forum. I did nothing special but used tuple as reference, and also I've defined f:call() to make function call indirectly.

<?xml version="1.0" encoding="utf-8"?>  <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:f="http://exslt.org/v2" xmlns:t="this" xmlns:p="private" exclude-result-prefixes="xs t f"> <xsl:include href="exslt2.xslt"/> <xsl:template match="/" name="main"> <root> <xsl:variable name="refs" as="item()*" select=" for $i in 1 to 20 return f:ref(1 to $i)"/> <total-items> <xsl:sequence select=" sum ( for $ref in $refs return count(f:deref($ref)) )"/> </total-items> <sums-per-ref> <xsl:for-each select="$refs"> <xsl:variable name="index" as="xs:integer" select="position()"/> <sum index="{$index}" value="{sum(f:deref(.))}"/> </xsl:for-each> </sums-per-ref> <add> <xsl:text>1 + 2 = </xsl:text> <xsl:sequence select="f:call(xs:QName('t:add'), (1, 2))"/> </add> </root> </xsl:template> <xsl:function name="t:add" as="xs:integer"> <xsl:param name="arguments" as="xs:integer+"/> <xsl:variable name="first" as="xs:integer" select="$arguments[1]"/> <xsl:variable name="second" as="xs:integer" select="$arguments[2]"/> <xsl:sequence select="$first + $second"/> </xsl:function> </xsl:stylesheet>

Code can be found at saxon.extensions.9.1.zip.

Wednesday, 17 December 2008 13:53:03 UTC

Comments [0] -
xslt

Wednesday, 10 December 2008

People start looking into jxom

We have created Java Xml Object Model purely for purposes of our project. In fact jxom at present has siblings: xml models for sql dialects. There are also different APIs like name normalizations, refactorings, compile time evaluation.

It turns out that jxom is also good enough for other developers.

The drawback of jxom, however, is rather complex xml schema. It takes time to understand it. To simplify things we have created (and planning to create more) a couple of examples allowing to feel how jxom xml looks like.

The latest version can be loaded from jxom.zip

We would be pleased to see more comments on the subject.

Wednesday, 10 December 2008 09:35:26 UTC

Comments [0] -
Announce | xslt

Navigation