Writing a language serializer is an as easy task, as riding a bicycle. Once you learned it, you won't apply a mental force anymore to create a new one.
This still requires essential mechanical efforts to write and test things.
Well, this is the first draft of the C# xslt serializer. Archive contains both C# xom and jxom.
Note: no comments are still supported; nothing is done to format code except line wrapping.
Today an old book was extracted by my son on the light of God. The book was immediatly opened on this verse:
Any trifle can become a main business of your life.
You just need be a firmly believed that there is nothing more important that can be achieved. And then nothing won't prevent you gasp out from delight to engage with this nonsense.
Unfortunatelly too often these facetious verses of Gregory Oster becoming a true.
I've read a popular scientific stuff about DNA, RNA, proteins, cells,
prokaryotes and eukaryotes their structures, roles, operational principles,
evolution.
All the computer technologies and robotics seem like a childish babbling
comparing to microbiology and molecular biology.
I would wish to be so open minded, and have so capable brain with infinite work
capacity (and live so long life ) to push, to break through the borders of knowledge of the humanity.
Ah, I envy to the Renaissance people who were capble to hold and drive the
science and art, this contrasts so much with contemporary specializations.
Well, it's jxom no more but also csharpxom!
A project concerns demanded us to create a C# 3.0 xml schema.
Shortly we expect to create an xslt serializing an xml document in this schema into a text. Thankfully to the original design we can reuse java streamer almost without changes.
A fact: C# schema more than twice bigger than the java's.
Today, I've found a C++0x FAQ by Bjarne Stroustrup reviewing most of the new features that we shall see in the next version.
A good insight for those who don't track the WG progress.
But what attracts me is a passage:
Sounds rather pessimistic to my taste.
There is a nice
ServiceLoader API in java 6 implementing a service provider idiom. It's good
(good because it's standard) class resolving interface implementation using
META-INF/service location.
Unfortunately, there is even no JSR implementation for this class in java 5. This
makes it impossible for us to use it.
What a nuisance!
We honour the memory of our grandfathers and grandmothers who battled that
cruel war. Our grandfather has fallen in that war, other grandfather and grandmothers have survived and
lived long lives.
Time is relentless, they have left this world but we shall keep them and
their deeds in memory.
|
|
Yesterday, we've found an article
"Repackaging Saxon". It's about a decision to go away from Saxon-B/Saxon-SA
packaging to a more conventional product line: Home/Professional/Enterprise
Editions.
The good news are that the Saxon stays open source. That's most important as an
open comunity spirit will be preserved. On the other hand Professional and
Enterprise Editions will not be free.
In this regard the most interesting comments are:
John Cowan> I suspect that providing packaging only for $$ (or pounds or euros) won't actually work, because someone else will step in and provide that packaging for free, as the licensing permits.
and response:
Michael Kay> This will be interesting to see. I'm relying partly on the idea that there's a fair degree of trust, and expectation of support, associated with Saxonica's reputation, and that the people who are risking their business on the product might be hesitant to rely on third parties, who won't necessarily be prompt in issuing maintenance releases etc; at the same time, such third parties may serve the needs of the hobbyists who are the real market for the open source version.
and also:
Michael Kay> ...I haven't been able to make a model based on paid services plus free software work for me. It's hard enough to get the services business; when you do get it, it's hard to get enough revenue from it to fund the time spent on developing and supporting the software. Personally, I think the culture of free software has gone too far, and it is now leading to a lack of investment in new software...
Sunny> Look what have I found! Consider a C#:
public class T
{
public T free;
}
public void NewTest()
{
T cache = new T();
Stopwatch timer = new Stopwatch();
timer.Reset();
timer.Start();
for(int i = 0; i < 10000000; ++i)
{
// Get from cache.
T t;
if (cache.free == null)
{
cache.free = new T();
}
t = cache.free;
// Release
cache.free = t;
t = null;
}
timer.Stop();
long cacheTicks = timer.ElapsedTicks;
timer.Reset();
timer.Start();
for(int i = 0; i < 10000000; ++i)
{
new T();
}
timer.Stop();
long newTicks = timer.ElapsedTicks;
Console.WriteLine("cache: {0}, new: {1}", cacheTicks, newTicks);
}
Gloomy> And?
Sunny> Tests show that new T() is almost as fast as
caching! GC's "new" probably has a fast route, where it shifts free memory border
in an atomic way, thus allocation takes just several cycles.
Gloomy> Well, you're probably right, there is a fast route. I, however,
have a different opinion. To track references, a generational garbage collector
implements field assign as a call rather than a mov .
This routine, except move itself, marks touched memory page in a special card
table (who said GC is cheap?); thus, I think, a reference field setter is
almost as slow as the "new" call.
.Net is known for its array covariance. That means that any array can be cast to
an array of base elements:
public class T: B
{
}
T[] tlist = ...
B[] blist = tlist;
This feature comes at cost:
B b = ...
T t = ...
blist[0] = b; // This efficiently is: blist[0] = (T)b;
tlist[0] = t; // This is the same: tlist[0] = (T)t;
We pay the cost of additional cast, just for nothing. Let this dubious design decision opresses .Net/Java inventors.
You can eliminate the cast. Just use array of structs:
struct S<T>
{
public T t;
}
S<T>[] slist = ...
slist[0].t = t; // Works without cast.
Measurment show that S[] is ~35% faster than T[] on write, and slower (JIT could do better) on read.
Well, ugly workaround of ugly design.
P.S. In java there is no relief...
This happens in .NET Framework 3.5, 32 bit, VS 2008.
C#:
namespace NesterovskyBros.Test
{
using Microsoft.VisualStudio.TestTools.UnitTesting;
[TestClass]
public class CharAtUnitTest
{
private TestContext testContextInstance;
public TestContext TestContext
{
get { return testContextInstance; }
set { testContextInstance = value; }
}
[TestMethod]
public void CharAtTest()
{
this.text = "1";
string token = Read(1, false);
TestContext.WriteLine("token: {0}", token);
}
private string Read(int offset, bool flag)
{
string token = null;
int c = 0;
if (flag)
{
goto Whitespace;
}
Scan:
c = CharAt(offset);
switch(c)
{
case -1:
{
return "<Eof>";
}
case '\'':
{
token = "Literal";
goto Literal;
}
}
Whitespace:
if (c == ' ')
{
return "Space";
}
return "Unknown";
Literal:
while(true)
{
int d = CharAt(offset);
if (token != "Literal")
{
goto Scan;
}
if (d == c)
{
return token;
}
}
}
string text;
private int CharAt(int offset)
{
string text = this.text;
return (uint)offset >= (uint)text.Length ? -1 : text[offset];
}
}
}
In debug mode this test prints: "token: <Eof>". In release - "token: Unknown".
The bug is so fragile that even slightest change in code removes it. Looking
into disassembly we can see that the problem is near the switch:
Scan:
c = CharAt(offset); /* Our old friend, CharAt(). Inlined! */
00000017 mov edx,dword ptr [edi+8]
0000001a cmp dword ptr [edx+8],esi
0000001d jbe 00000032
0000001f cmp esi,dword ptr [edx+8]
00000022 jae 000000CE
00000028 movzx eax,word ptr [edx+esi*2+0Ch]
0000002d mov dword ptr [ebp-10h],eax
00000030 jmp 00000039
00000032 mov dword ptr [ebp-10h],0FFFFFFFFh /* Move -1 (four bytes) into
stack. */
00000039 movzx edx,word ptr [ebp-10h] /* Get two bytes into edx (0FFFFh) */
switch(c)
0000003d cmp edx,0FFFFFFFFh /* Never true. */
00000040 je 0000004A
00000042 cmp dword ptr [ebp-10h],27h
00000046 je 0000005A
00000048 jmp 00000062
{
case -1:
{
return "<Eof>";
0000004a mov eax,dword ptr ds:[022EDE68h]
00000050 lea esp,[ebp-0Ch]
00000053 pop ebx
00000054 pop esi
00000055 pop edi
00000056 pop ebp
00000057 ret 4
This looks like a tremendous bug, like one of those shaking belief in
computer's infallibility.
It would be nice if you would verify the case on your computer.
Praises: I dare not to think how could we live without
AnkhSVN.
At present we have:
- a generic parser;
- fully functional xquery parser;
- detailed error report, and syntax suggestion;
- high performance.
The idea of runtime grammar tree and a reader like parser results in a high performace, as we able to
build a lookup tables to probe tokens. This allows
us to start
parsing immediately from the most specific grammar chain. For example, consider the xquery
grammar:
[1] Module ::= VersionDecl? (LibraryModule | MainModule)
[2] VersionDecl ::=
"xquery" "version" StringLiteral ("encoding" StringLiteral)? Separator
[3] MainModule ::= Prolog QueryBody
[4] LibraryModule ::= ModuleDecl Prolog
[5] ModuleDecl ::= "module" "namespace" NCName "=" URILiteral Separator
[6] Prolog ::=
((DefaultNamespaceDecl | Setter | NamespaceDecl | Import) Separator)*
((VarDecl | FunctionDecl | OptionDecl) Separator)*
...
[87] VarRef ::= "$" VarName
Formally, to parse xquery "$v" one needs to go deep into a grammar hierarchy.
That's what is usually done. On the contrast, a lookup table for the grammar
"Module", containing 80 different token runs, allows us to identify grammar chain just with a couple of probes:
[0] "xquery" "version"
[1] "module" "namespace"
[2] "declare" "default" "element" "namespace"
[3] "declare" "default" "function" "namespace"
[4] "declare" "boundary-space"
[5] "declare" "default" "collation"
[6] "declare" "base-uri"
[7] "declare" "construction"
[8] "declare" "ordering"
[9] "declare" "default" "order" "empty"
[10] "declare" "copy-namespaces"
[11] "declare" "namespace"
[12] "declare" "schema"
[13] "import" "module"
[14] "declare" "variable" "$"
[15] "declare" "function"
[16] "declare" "option"
[17] "for" "$"
[18] "let" "$"
[19] "some" "$"
[20] "every" "$"
[21] "typeswitch" "("
[22] "if" "("
[23] "-"
[24] "+"
[25] "validate" "{"
[26] "validate" "lax"
[27] "validate" "strict"
[28] "/"
[29] "//"
[30] <integer>
[31] <decimal>
[32] <double>
[33] <string>
[34] "$"
[35] "("
[36] "."
[37] <functionname> "("
[38] "ordered" "{"
[39] "unordered" "{"
|
[40] "<" <qname>
[41] <!--literal-->
[42] <?pi literal?>
[43] "document" "{"
[44] "element" <qname>
[45] "element" "{"
[46] "attribute" <qname>
[47] "attribute" "{"
[48] "text" "{"
[49] "comment" "{"
[50] "processing-instruction" <ncname>
[51] "processing-instruction" "{"
[52] "parent" "::"
[53] "ancestor" "::"
[54] "preceding-sibling" "::"
[55] "preceding" "::"
[56] "ancestor-or-self" "::"
[57] ".."
[58] "child" "::"
[59] "descendant" "::"
[60] "attribute" "::"
[61] "self" "::"
[62] "descendant-or-self" "::"
[63] "following-sibling" "::"
[64] "following" "::"
[65] "@"
[66] "document-node" "("
[67] "element" "("
[68] "attribute" "("
[69] "schema-element" "("
[70] "schema-attribute" "("
[71] "processing-instruction" "("
[72] "comment" "("
[73] "text" "("
[74] "node" "("
[75] <qname>
[76] "*"
[77] <ncname:*>
[78] <*:ncname>
[79] "(#" |
This way, algorithmically, we outperform most of conventional parsers.
On the other hand, a parsed tree we're building, has a
compact representation. Each tree node is defined with two text bookmarks,
grammar chain, and a grammar specific data. What's important is that the
production of garbage memory is very low, as the rate of parser's fail
assumptions is small.
What should be done:
-
Attach events to the xquery grammar to collect program constructions: variables,
functions, namespaces in scope. This will provide auto completion info.
-
Release inactive parsed subtrees. E.g. we can free tree of function body, and
preserve its text range (two bookmarks).
Well, I'd like to think someone could understand anything in all this
mumbling. All sources are at "Incremental parser" home.
There is a method Right() in the RB tree implementation:
public int Right(int node)
{
return items[node].right;
}
JIT does not want to inline it, probably as the method may throw:
public int Right(int node)
{
return items[node].right;
00000000 mov eax,dword ptr [ecx+4]
00000003 cmp edx,dword ptr [eax+4]
00000006 jae 00000013
00000008 shl edx,4
0000000b lea eax,[eax+edx+8]
0000000f mov eax,dword ptr [eax+8]
00000012 ret
00000013 call 74C3A62C
00000018 int 3
Too sad.
Early in 2001 we've read that .NET's JIT is smart enough to optimize repeated
boundary checks.
In the year 2009 we still can verify that this is not the case (no matter how
hard you try).
C#:
private int CharAt(int offset)
{
string text = this.text;
return (uint)offset >= (uint)text.Length ? -1 : text[offset];
}
Disassembly:
private int CharAt(int offset)
{
string text = this.text;
00000000 push ebp
00000001 mov ebp,esp
00000003 mov ecx,dword ptr [ecx+30h]
return (uint)offset >= (uint)text.Length ? -1 : text[offset];
00000006 cmp dword ptr [ecx+8],edx
00000009 jbe 00000017
0000000b cmp edx,dword ptr [ecx+8]
0000000e jae 0000001C
00000010 movzx eax,word ptr [ecx+edx*2+0Ch]
00000015 pop ebp
00000016 ret
00000017 or eax,0FFFFFFFFh
0000001a pop ebp
0000001b ret
0000001c call 74C24C6C
00000021 int 3
P.S. Neither this method is inlined (IL length is 25 bytes).
|