In the
previous post we have announced an
API to parse a COBOL source into the cobolxom.
We exploited the
incremental parser to build a grammar xml tree and then were planning to
create an xslt transformation to generate
cobolxom.
Now, we would like to declare that such xslt is ready.
At present all standard COBOL constructs are supported, but more tests
are required. Preprocessor support is still in the todo list.
You may peek into an examples of
COBOL:
Cobol grammar:
And
cobolxom:
While we were building a grammar to cobolxom stylesheet we asked ourselves
whether the COBOL parsing could be done entirely in xslt. The answer is yes, so
who knows it might be that we shall turn this task into pure xslt one. :-)
Once ago we have created an
incremental parser, and now when we have decided to load COBOL sources
directly into
cobolxom (XML Object Model for a COBOL) the parser did the job perfectly.
The good point about incremental parser is that it easily handles COBOL's
grammar.
The whole process looks like this:
- incremental parser having a COBOL grammar builds a grammar tree;
- we stream this tree into xml;
- xslt to transform xml from previous step into
cobolxom (TODO).
This is an example of a COBOL:
IDENTIFICATION DIVISION.
PROGRAM-ID. FACTORIAL RECURSIVE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 NUMB PIC 9(4) VALUE IS 5.
01 FACT PIC 9(8) VALUE IS 0.
LOCAL-STORAGE SECTION.
01 NUM PIC 9(4).
PROCEDURE DIVISION.
MOVE 'X' TO XXX
MOVE NUMB TO NUM
IF NUMB = 0 THEN
MOVE 1 TO FACT
ELSE
SUBTRACT 1 FROM NUMB
CALL 'FACTORIAL'
MULTIPLY NUM BY FACT
END-IF
DISPLAY NUM '! = ' FACT
GOBACK.
END PROGRAM FACTORIAL.
And a grammar tree:
<Program>
<Name data="FACTORIAL"/>
<Recursive/>
<DataDivision>
<WorkingStorageSection>
<Data>
<Level data="01"/>
<Name data="NUMB"/>
<Picture data="9(4)"/>
<Value>
<Numeric data="5"/>
</Value>
</Data>
<Data>
<Level data="01"/>
<Name data="FACT"/>
<Picture data="9(8)"/>
<Value>
<Numeric data="0"/>
</Value>
</Data>
</WorkingStorageSection>
<LocalStorageSection>
<Data>
<Level data="01"/>
<Name data="NUM"/>
<Picture data="9(4)"/>
</Data>
</LocalStorageSection>
</DataDivision>
<ProcedureDivision>
<Sentence>
<MoveStatement>
<From>
<String data="'X'"/>
</From>
<To>
<Identifier>
<DataName data="XXX"/>
</Identifier>
</To>
</MoveStatement>
<MoveStatement>
<From>
<Identifier>
<DataName data="NUMB"/>
</Identifier>
</From>
<To>
<Identifier>
<DataName data="NUM"/>
</Identifier>
</To>
</MoveStatement>
<IfStatement>
<Condition>
<Relation>
<Identifier>
<DataName data="NUMB"/>
</Identifier>
<Equal/>
<Numeric data="0"/>
</Relation>
</Condition>
<Then>
<MoveStatement>
<From>
<Numeric data="1"/>
</From>
<To>
<Identifier>
<DataName data="FACT"/>
</Identifier>
</To>
</MoveStatement>
</Then>
<Else>
<SubtractStatement>
<Value>
<Numeric data="1"/>
</Value>
<From>
<Identifier>
<DataName data="NUMB"/>
</Identifier>
</From>
</SubtractStatement>
<CallStatement>
<Name>
<String data="'FACTORIAL'"/>
</Name>
</CallStatement>
<MultiplyStatement>
<Value>
<Identifier>
<DataName data="NUM"/>
</Identifier>
</Value>
<By>
<Identifier>
<DataName data="FACT"/>
</Identifier>
</By>
</MultiplyStatement>
</Else>
</IfStatement>
<DisplayStatement>
<Values>
<Identifier>
<DataName data="NUM"/>
</Identifier>
<String data="'! = '"/>
<Identifier>
<DataName data="FACT"/>
</Identifier>
</Values>
</DisplayStatement>
<GobackStatement/>
</Sentence>
</ProcedureDivision>
<EndName data="FACTORIAL"/>
</Program>
The last step is to transform tree into cobolxom is in the TODO list.
We have commited COBOL grammar in the same place at
SourceForge as it was with XQuery grammar. Solution is now under the VS
2010.
This happens in .NET Framework 3.5, 32 bit, VS 2008.
C#:
namespace NesterovskyBros.Test
{
using Microsoft.VisualStudio.TestTools.UnitTesting;
[TestClass]
public class CharAtUnitTest
{
private TestContext testContextInstance;
public TestContext TestContext
{
get { return testContextInstance; }
set { testContextInstance = value; }
}
[TestMethod]
public void CharAtTest()
{
this.text = "1";
string token = Read(1, false);
TestContext.WriteLine("token: {0}", token);
}
private string Read(int offset, bool flag)
{
string token = null;
int c = 0;
if (flag)
{
goto Whitespace;
}
Scan:
c = CharAt(offset);
switch(c)
{
case -1:
{
return "<Eof>";
}
case '\'':
{
token = "Literal";
goto Literal;
}
}
Whitespace:
if (c == ' ')
{
return "Space";
}
return "Unknown";
Literal:
while(true)
{
int d = CharAt(offset);
if (token != "Literal")
{
goto Scan;
}
if (d == c)
{
return token;
}
}
}
string text;
private int CharAt(int offset)
{
string text = this.text;
return (uint)offset >= (uint)text.Length ? -1 : text[offset];
}
}
}
In debug mode this test prints: "token: <Eof>". In release - "token: Unknown".
The bug is so fragile that even slightest change in code removes it. Looking
into disassembly we can see that the problem is near the switch:
Scan:
c = CharAt(offset); /* Our old friend, CharAt(). Inlined! */
00000017 mov edx,dword ptr [edi+8]
0000001a cmp dword ptr [edx+8],esi
0000001d jbe 00000032
0000001f cmp esi,dword ptr [edx+8]
00000022 jae 000000CE
00000028 movzx eax,word ptr [edx+esi*2+0Ch]
0000002d mov dword ptr [ebp-10h],eax
00000030 jmp 00000039
00000032 mov dword ptr [ebp-10h],0FFFFFFFFh /* Move -1 (four bytes) into
stack. */
00000039 movzx edx,word ptr [ebp-10h] /* Get two bytes into edx (0FFFFh) */
switch(c)
0000003d cmp edx,0FFFFFFFFh /* Never true. */
00000040 je 0000004A
00000042 cmp dword ptr [ebp-10h],27h
00000046 je 0000005A
00000048 jmp 00000062
{
case -1:
{
return "<Eof>";
0000004a mov eax,dword ptr ds:[022EDE68h]
00000050 lea esp,[ebp-0Ch]
00000053 pop ebx
00000054 pop esi
00000055 pop edi
00000056 pop ebp
00000057 ret 4
This looks like a tremendous bug, like one of those shaking belief in
computer's infallibility.
It would be nice if you would verify the case on your computer.
Praises: I dare not to think how could we live without
AnkhSVN.
At present we have:
- a generic parser;
- fully functional xquery parser;
- detailed error report, and syntax suggestion;
- high performance.
The idea of runtime grammar tree and a reader like parser results in a high performace, as we able to
build a lookup tables to probe tokens. This allows
us to start
parsing immediately from the most specific grammar chain. For example, consider the xquery
grammar:
[1] Module ::= VersionDecl? (LibraryModule | MainModule)
[2] VersionDecl ::=
"xquery" "version" StringLiteral ("encoding" StringLiteral)? Separator
[3] MainModule ::= Prolog QueryBody
[4] LibraryModule ::= ModuleDecl Prolog
[5] ModuleDecl ::= "module" "namespace" NCName "=" URILiteral Separator
[6] Prolog ::=
((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)*
((VarDecl | FunctionDecl | OptionDecl) Separator)*
...
[87] VarRef ::= "$" VarName
Formally, to parse xquery "$v" one needs to go deep into a grammar hierarchy.
That's what is usually done. On the contrast, a lookup table for the grammar
"Module", containing 80 different token runs, allows us to identify grammar chain just with a couple of probes:
[0] "xquery" "version"
[1] "module" "namespace"
[2] "declare" "default" "element" "namespace"
[3] "declare" "default" "function" "namespace"
[4] "declare" "boundary-space"
[5] "declare" "default" "collation"
[6] "declare" "base-uri"
[7] "declare" "construction"
[8] "declare" "ordering"
[9] "declare" "default" "order" "empty"
[10] "declare" "copy-namespaces"
[11] "declare" "namespace"
[12] "declare" "schema"
[13] "import" "module"
[14] "declare" "variable" "$"
[15] "declare" "function"
[16] "declare" "option"
[17] "for" "$"
[18] "let" "$"
[19] "some" "$"
[20] "every" "$"
[21] "typeswitch" "("
[22] "if" "("
[23] "-"
[24] "+"
[25] "validate" "{"
[26] "validate" "lax"
[27] "validate" "strict"
[28] "/"
[29] "//"
[30] <integer>
[31] <decimal>
[32] <double>
[33] <string>
[34] "$"
[35] "("
[36] "."
[37] <functionname> "("
[38] "ordered" "{"
[39] "unordered" "{"
|
[40] "<" <qname>
[41] <!--literal-->
[42] <?pi literal?>
[43] "document" "{"
[44] "element" <qname>
[45] "element" "{"
[46] "attribute" <qname>
[47] "attribute" "{"
[48] "text" "{"
[49] "comment" "{"
[50] "processing-instruction" <ncname>
[51] "processing-instruction" "{"
[52] "parent" "::"
[53] "ancestor" "::"
[54] "preceding-sibling" "::"
[55] "preceding" "::"
[56] "ancestor-or-self" "::"
[57] ".."
[58] "child" "::"
[59] "descendant" "::"
[60] "attribute" "::"
[61] "self" "::"
[62] "descendant-or-self" "::"
[63] "following-sibling" "::"
[64] "following" "::"
[65] "@"
[66] "document-node" "("
[67] "element" "("
[68] "attribute" "("
[69] "schema-element" "("
[70] "schema-attribute" "("
[71] "processing-instruction" "("
[72] "comment" "("
[73] "text" "("
[74] "node" "("
[75] <qname>
[76] "*"
[77] <ncname:*>
[78] <*:ncname>
[79] "(#" |
This way, algorithmically, we outperform most of conventional parsers.
On the other hand, a parsed tree we're building, has a
compact representation. Each tree node is defined with two text bookmarks,
grammar chain, and a grammar specific data. What's important is that the
production of garbage memory is very low, as the rate of parser's fail
assumptions is small.
What should be done:
-
Attach events to the xquery grammar to collect program constructions: variables,
functions, namespaces in scope. This will provide auto completion info.
-
Release inactive parsed subtrees. E.g. we can free tree of function body, and
preserve its text range (two bookmarks).
Well, I'd like to think someone could understand anything in all this
mumbling. All sources are at "Incremental parser" home.
There is a method Right() in the RB tree implementation:
public int Right(int node)
{
return items[node].right;
}
JIT does not want to inline it, probably as the method may throw:
public int Right(int node)
{
return items[node].right;
00000000 mov eax,dword ptr [ecx+4]
00000003 cmp edx,dword ptr [eax+4]
00000006 jae 00000013
00000008 shl edx,4
0000000b lea eax,[eax+edx+8]
0000000f mov eax,dword ptr [eax+8]
00000012 ret
00000013 call 74C3A62C
00000018 int 3
Too sad.
Early in 2001 we've read that .NET's JIT is smart enough to optimize repeated
boundary checks.
In the year 2009 we still can verify that this is not the case (no matter how
hard you try).
C#:
private int CharAt(int offset)
{
string text = this.text;
return (uint)offset >= (uint)text.Length ? -1 : text[offset];
}
Disassembly:
private int CharAt(int offset)
{
string text = this.text;
00000000 push ebp
00000001 mov ebp,esp
00000003 mov ecx,dword ptr [ecx+30h]
return (uint)offset >= (uint)text.Length ? -1 : text[offset];
00000006 cmp dword ptr [ecx+8],edx
00000009 jbe 00000017
0000000b cmp edx,dword ptr [ecx+8]
0000000e jae 0000001C
00000010 movzx eax,word ptr [ecx+edx*2+0Ch]
00000015 pop ebp
00000016 ret
00000017 or eax,0FFFFFFFFh
0000001a pop ebp
0000001b ret
0000001c call 74C24C6C
00000021 int 3
P.S. Neither this method is inlined (IL length is 25 bytes).
We'd like to return to the binary tree algorithms and spell what you cannot
do with generics in C#. Well, you can do many things, however with generalization
penalty.
Consider a binary tree node: Node(Parent, Left, Right) . RB, AVL, and
others algorithms attach some private information to this node to perform
balancing.
You can express this idea methematically (and in C++), you cannot implement it efficiently in C#.
More focused example. Consider RB tree: Node(Parent, Left, Right, Color) .
There are a number of ways you may implement the internal structure of the tree.
Algorithms themselves stay the same.
Straightforward implementation:
class Node
{
Node Parent;
Node Left;
Node Right;
bool Color;
}
This implementation allocates nodes in the heap and each node refers to other
nodes.
Node navigator implementation:
class Node
{
Node Left;
Node Right;
bool Color;
}
struct NodeNavigator
{
Node[] nodes;
int index;
}
Node does not refer to the parent. This reduces the memory consumption and
simplifies object graph, which is good for GC. Tree is walked using a node
navigator, which stores ancestors of the node.
Node as a structure:
struct Node
{
int Parent;
int Left;
int Right;
bool Color; // This might be integrated as highest bit of parent.
}
Tree is stored as an array of nodes. This is compact and GC efficient
implementation.
Node as a structure, and with node navigator:
struct Node
{
int Left;
int Right;
bool Color; // This might be integrated as highest bit of left.
}
struct NodeNavigator
{
Tree tree;
int[] nodes;
int index;
}
Tree is stored as an array of nodes, and a navigator is used to walk it. This is the most compact implementation.
Each implementation has its virtues. The common between implementations is that
they share the same balancing and navigation algorithms. Storage
differences prevent a single C# implementation. To the contrast, C++ allows to
define a concept "tree" and to define specializations of this concept, allowing
a unified algorithms; all this is done without performance penalty.
P.S. java in this regard, is almost alternativeless...
Recently, we have started looking into a task of creating an interactive parser. A generic one.
Yes, we know there are plenty of them all around, however the goals we have
defined made us to construct the new implementation.
The goals:
- Parser must be incremental.
You should direct what to parse, and when to stop. This virtually demands rather "pull" than conventional "push"
implementation.
- Parser must be able to synchronize a tree with text.
Whenever the underlying text is changed, a limited part of a tree should
to be updated.
- Parser should be able to recover from errors, and continue parsing.
- Parser should be manageable.
This is a goal of every program, really.
- Parser must be fast.
- A low memory footprint is desired.
What's implemented (VS2008, C#) and put at SourceForge, is called an
Incremental Parser.
These are parser's insights:
- Bookmarks are objects to track text points. We use a binary tree (see
Bare binary tree algorithms)
to adjust positions of bookmarks when text is changed.
- Ranges define parsed tree elements. Each range is defined by two
bookmarks, and a grammar annotation.
- There are grammar primitives, which can be composed into a grammar graph.
- A grammar graph along with ranges form a state machine.
- Grammar chains are cached, turning
parsing into a series of probes of literal tokens and transitions between
grammar chains. This caching is done on demand, which results in warming-up effect.
- Parser itself includes a random access tokenizer, and a queue of ranges
pending to be parsed.
- Parsing is conducted as a cycle of pulling and parsing of pending ranges.
- Whenever text is changed a closest range is queued for the reparsing.
- A balance between amount of parsing and memory consumption can be achieved
through a detalization of grammar annotation for a range. An active text
fragment can be fully annotated, while for other text parts a coarse range can
be stored.
We have defined
xpath like grammar to test our ideas. See printed
parsed trees to get understanding of what information can be seen from
ranges.
Do you agree that binary trees and algorithms that keep trees reasonably balanced
are important?
Our answer is yes!
It's interesting enough, however, that you won't easily find these algorithms
publicly available.
Though red-black,
AVL and other algorithms
described in the wikipedia are defined in terms of tree manipulation, all
implementations we have seen, deal with trees annotated with keys and values.
These implementations really use tree balancing algorithms behind the schene,
and expose a commonplace set or map containers to a client. Even
C++ Standard
Library suffers from this disease.
We think that binary trees are valuable independent concepts, and they worth to
be implemented separately, at least because there are other algorithms, except
sets and maps, using trees.
And well, we did it in C#! See
RedBlackTree.cs.
Consider an example - a simple scheduler,
ScheduleBookmark.cs, with operations:
- schedule an action;
- remove an action from the schedule;
- enumerate actions;
- find a date, an action is scheduled for;
- find an action (or at least closest one) for a specified date;
- postpone actions due to delays;
A balanced binary tree allows efficient implementation of such a scheduler. Tree
node stores an action, and a time span between parent node and this node.
This way:
Operation |
Steps |
schedule an action |
find place + link node + rebalance tree |
remove an action from the schedule |
unlink node + rebalance tree |
enumerate actions |
navigate tree |
find a date, an action is scheduled for |
find node in tree |
find an action for a specified date |
cumulate time spans up to the tree root |
postpone actions due to delays |
fixup time spans from a node up to the tree root |
Compare operation complexities between tree, array, list and map:
Operation |
Tree |
Array |
List |
Map |
schedule an action |
O(ln(N)) |
O(N) |
O(N) |
O(ln(N)) |
remove an action from the schedule |
O(ln(N)) |
O(N) |
O(1) |
O(ln(N)) |
enumerate actions |
O(ln(N)) |
O(1) |
O(1) |
O(ln(N)) |
find a date, an action is scheduled for |
O(ln(N)) |
O(1) |
O(1) |
O(1) |
find an action for a specified date |
O(ln(N)) |
O(ln(N)) |
O(N) |
O(ln(N)) |
postpone actions due to delays |
O(ln(N)) |
O(N) |
O(N) |
O(N*ln(N)) |
Complexity of each operation for the tree is O(ln(N)). No arrays, lists, or maps achieve similar worst case guaranty.
Finally, the test program is
Program.cs,
and a whole project (VS2008) is
Tree.zip
Could you think of a C# method accepting an ancestor, and
forbidding a descendant of a class at compile time?
The answer to this probably is: why do you need such a reptile.
Well, I don't. I didn't meant to create such a method, but generics help a lot!
public class BinaryTreeNode<Node>
where Node: BinaryTreeNode<Node>
{
public Node parent;
public Node left;
public Node right;
}
public class MyNode: BinaryTreeNode<MyNode>
{
public int key;
}
public class MyRoot: MyNode
{
}
public class Test
{
public void test()
{
MyRoot root = new MyRoot();
// print((MyNode)root); // This works.
print(root); // This does not work.
}
private static void print<T>(T node)
where T: BinaryTreeNode<T>
{
Console.WriteLine("print me");
}
}
By the way, BinaryTreeNode is an "abstract" class, as you cannot instantiate it but inherit only.
A simple demand nowdays - a good IDE.
Almost a ten years have passed since xslt has appeared but still, we're not pleased with IDEs
claiming xslt support. Our expectaions are not too high. There are things however, which must be present in such
an IDE.
- A notion of project, and possibly a group of projects. You may think of
it as a main xslt including other xslts participationg in the project.
- A code completion. A feature providing typing hints for language constructs, includes,
prefixes, namespaces, functions, templates, modes, variables, parameters, schema
elements, and other (all this should work in a context of the project).
- A code refactoring. A means to move parts of code between (or inside) files and projects,
rename things (functions, templates, parameters, variables, prefixes,
namespaces, and other).
- Code validation and run.
- Optional debug feature.
We would be grateful if someone had pointed to any such IDE.
|