Sunday, 09 May 2010 - Nesterovsky bros

Sunday, 09 May 2010

We're facing a task of parsing reports produced from legacy applications and converting them into a structured form, e.g. into xml. These xml files can be processed further with up to date tools to produce good looking reports.

Reports at hands are of very different structure and of size: from a couple of KB to a several GB. The good part is that they mostly have a tabular form, so it's easy to think of specific parsers in case of each report type.

Our goal is to create an environment where a less qualified person(s) could create and manage such parsers, and only rarely to engage someone who will handle less untrivial cases.

Our analysis has shown that it's possible to write such parser in almost any language: xslt, C#, java.

Our approach was to create an xml schema annotations that from one side define a data structure, and from the other map report layout. Then we're able to create an xslt that will generate either xslt, C#, or java parser according to the schema definitions. Because of languages xom, providing XML Object Model and serialization stylesheets for C# and java, it does not really matter what we shall generate xslt or C#/java, as code will look the same.

The approach we're going to use to describe reports is not as powerfull as conventional parsers. Its virtue, however, is simplicity of specification.

Consider a report sample (a data to extract is in bold):

1 TITLE ...                    PAGE:            1
 BUSINESS DATE: 09/30/09   ... RUN DATE: 02/23/10
 CYCLE : ITD      RUN: 001 ... RUN TIME: 09:22:39

        CM         BUS   ...
  CO    NBR  FRM   FUNC  ...
 ----- ----- ----- -----  
 XXX   065   065   CLR   ...
 YYY   ...
...
1 TITLE ...                    PAGE:            2
 BUSINESS DATE: 09/30/09   ... RUN DATE: 02/23/10
 CYCLE : ITD      RUN: 001 ... RUN TIME: 09:22:39

        CM         BUS   ...
  CO    NBR  FRM   FUNC  ...
 ----- ----- ----- -----  
 AAA   NNN   MMM   PPP   ...
 BBB   ...
...

* * * * *  E N D   O F   R E P O R T  * * * * *

We're approaching to the report through a sequence of views (filters) of this report. Each veiw localizes some report data either for the subsequent filterring or for the extraction of final data.

Looking into the example one can build following views of the report:

View of data before the "E N D O F R E P O R T" line.
View of remaining data without page headers and footers.
Views of table rows.
Views of cells.

A sequence of filters allows us to build a pipeline of transformations of original text. This also allows us to generate a clean xslt, C# or java code to parse the data.

At first, our favorite language for such parser was xslt. Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very strong in streaming processing. Without a couple of extension functions to prevent caching, it tends to cache whole input in the memory, which is not acceptable.

At present we have decided to start from C# code, which is pure C# naturally. :-)

Code still is in the development but at present we would like to share the xml schema annotations describing report layout: report-mapping.xsd, and a sample of report description: test.xsd.

Sunday, 09 May 2010 05:18:57 UTC

Comments [0] -
Announce | Thinking aloud | xslt

Wednesday, 05 May 2010

Languages XOM update

A few little changes in streaming and in name normalization algorithms in jxom and in csharpxom and the generation speed almost doubled (especially for big files).

We suspect, however, that our xslt code is tuned for saxon engine.

It would be nice to know if anybody used languages XOM with other engines. Is anyone using it at all (well, at least there are downloads)?

Languages XOM (jxom, csharpxom, cobolxom, sqlxom) can be loaded from: languages-xom.zip

Wednesday, 05 May 2010 06:48:10 UTC

Comments [1] -
Announce | xslt

Sunday, 02 May 2010

Xslt match puzzle

At times a simple task in xslt looks like a puzzle. Today we have this one.

For a string and a regular expression find a position and a length of the matched substring.

The problem looks so simple that you do not immediaty realize that you are going to spend ten minutes trying to solve it in the best way.

Try it yourself before proceeding:

<xsl:variable name="match" as="xs:integer*"> <xsl:analyze-string select="$line" regex="my-reg-ex"> <xsl:matching-substring> <xsl:sequence select="1, string-length(.)"/> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:sequence select="0, string-length(.)"/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:choose> <xsl:when test="$match[1]"> <xsl:sequence select="1, $match[2]"/> </xsl:when> <xsl:when test="$match[3]"> <xsl:sequence select="$match[2], $match[4]"/> </xsl:when> </xsl:choose>

Sunday, 02 May 2010 15:35:02 UTC

Comments [0] -
Tips and tricks | xslt

Saturday, 01 May 2010

Generator functions in xslt in Saxon

To see that the problem with Generator functions in xslt is a bit more complicated compare two functions.

The first one is quoted from the earlier post:

<xsl:function name="t:generate" as="xs:integer*"> <xsl:param name="value" as="xs:integer"/> <xsl:sequence select="$value"/> <xsl:sequence select="t:generate($value * 2)"/> </xsl:function>

It does not work in Saxon: crashes with out of memory.

The second one is slightly modified version of the same function:

<xsl:function name="t:generate" as="xs:integer*"> <xsl:param name="value" as="xs:integer"/> <xsl:sequence select="$value + 0"/> <xsl:sequence select="t:generate($value * 2)"/> </xsl:function>

It's working without problems. In first case Saxon decides to cache all function's output, in the second case it decides to evaluate data lazily on demand.

It seems that optimization algorithms implemented in Saxon are so plentiful and complex that at times they fool one another. :-)

See also: Generator functions

Saturday, 01 May 2010 07:18:24 UTC

Comments [0] -
Thinking aloud | Tips and tricks | xslt

Friday, 23 April 2010

Complications with streamed tree

There are some complications with streamed tree that we have implemented in saxon. They are due to the fact that only a view of input data is available at any time. Whenever you access some element that's is not available you're getting an exception.

Consider an example. We have a log created with java logging. It looks like this:

<log> <record> <date>...</date> <millis>...</millis> <sequence>...</sequence> <logger>...</logger> <level>INFO</level> <class>...</class> <method>...</method> <thread>...</thread> <message>...</message> </record> <record> ... </record> ...

We would like to write an xslt that returns a page of log as html:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.nesterovsky-bros.com/xslt/this" xmlns="http://www.w3.org/1999/xhtml" exclude-result-prefixes="xs t"> <xsl:param name="start-page" as="xs:integer" select="1"/> <xsl:param name="page-size" as="xs:integer" select="50"/> <xsl:output method="xhtml" byte-order-mark="yes" indent="yes"/>  <xsl:template match="/log"> <xsl:variable name="start" as="xs:integer" select="($start-page - 1) * $page-size + 1"/> <xsl:variable name="records" as="element()*" select="subsequence(record, $start, $page-size)"/> <html> <head> <title> <xsl:text>A log file. Page: </xsl:text> <xsl:value-of select="$start-page"/> </title> </head> <body> <table border="1"> <thead> <tr> <th>Level</th> <th>Message</th> </tr> </thead> <tbody> <xsl:apply-templates mode="t:record" select="$records"/> </tbody> </table> </body> </html> </xsl:template> <xsl:template mode="t:record" match="record">  <xsl:variable name="log"> <xsl:copy-of select="."/> </xsl:variable> <xsl:variable name="level" as="xs:string" select="$log/record/level"/> <xsl:variable name="message" as="xs:string" select="$log/record/message"/> <tr> <td> <xsl:value-of select="$level"/> </td> <td> <xsl:value-of select="$message"/> </td> </tr> </xsl:template> </xsl:stylesheet>

This code does not work. Guess why? Yes, it's subsequence(), which is too greedy. It always wants to know what's the next node, so it naturally skips a content of the current node. Algorithmically, such saxon code could be rewritten, and could possibly work better also in modes other than streaming.

A viable workaround, which does not use subsequence, looks rather untrivial:

<xsl:template match="/log"> <xsl:variable name="start" as="xs:integer" select="($start-page - 1) * $page-size + 1"/> <xsl:variable name="end" as="xs:integer" select="$start + $page-size"/> <html> <head> <title> <xsl:text>A log file. Page: </xsl:text> <xsl:value-of select="$start-page"/> </title> </head> <body> <table border="1"> <thead> <tr> <th>Level</th> <th>Message</th> </tr> </thead> <tbody> <xsl:sequence select=" t:generate-records(record, $start, $end, ())"/> </tbody> </table> </body> </html> </xsl:template> <xsl:function name="t:generate-records" as="element()*"> <xsl:param name="records" as="element()*"/> <xsl:param name="start" as="xs:integer"/> <xsl:param name="end" as="xs:integer?"/> <xsl:param name="result" as="element()*"/> <xsl:variable name="record" as="element()?" select="$records[$start]"/> <xsl:choose> <xsl:when test="(exists($end) and ($start > $end)) or empty($record)"> <xsl:sequence select="$result"/> </xsl:when> <xsl:otherwise>  <xsl:variable name="log"> <xsl:copy-of select="$record"/> </xsl:variable> <xsl:variable name="level" as="xs:string" select="$log/record/level"/> <xsl:variable name="message" as="xs:string" select="$log/record/message"/> <xsl:variable name="next-result" as="element()*"> <tr> <td> <xsl:value-of select="$level"/> </td> <td> <xsl:value-of select="$message"/> </td> </tr> </xsl:variable> <xsl:sequence select=" t:generate-records ( $records, $start + 1, $end, ($result, $next-result) )"/> </xsl:otherwise> </xsl:choose> </xsl:function>

Here we observed the greediness of saxon, which too early tried to consume more input than it's required. In the other cases we have seen that it may defer actual data access to the point when there is no data anymore.

So, without tuning internal saxon logic it's possible but not easy to write stylesheets that exploit streaming features.

P.S. Updated sources are at streamedtree.zip

Friday, 23 April 2010 10:12:38 UTC

Comments [0] -
Thinking aloud | xslt

Thursday, 22 April 2010

Very basic things about java

At some point we needed to have an array with volatile elements in java.

We knew that such beast is not found in the java world. So we searched the Internet and found the answers that are so wrong, and introduce so obscure threading bugs that the guys who provided them would better hide them and run immediately to fix their buggy programs...

The first one is Volatile arrays in Java. They suggest such solution:

volatile int[] arr = new int[...]; ... arr[4] = 100; arr = arr;

The number two: What Volatile Means in Java

A guy assures that this code works:

Fields:

int answer = 0; volatile boolean ready = false;

Thread1:

answer = 42; ready = true;

Thread2:

if (ready) { print(answer); }

They are very wrong! Non volatile access can be reordered by the implementation. See Java's Threads and Locks:

The rules for volatile variables effectively require that main memory be touched exactly once for each use or assign of a volatile variable by a thread, and that main memory be touched in exactly the order dictated by the thread execution semantics. However, such memory actions are not ordered with respect to read and write actions on nonvolatile variables.

They probably thought of locks when they argued about volatiles:

a lock action acts as if it flushes all variables from the thread's working memory; before use they must be assigned or loaded from main memory.

P.S. They would better recommend AtomicReferenceArray.

Thursday, 22 April 2010 13:05:48 UTC

Comments [0] -
Thinking aloud | Tips and tricks

Wednesday, 21 April 2010

Streamed tree

When time has come to process big xml log files we've decided to implement streamable tree in saxon the very same way it was implemented in .net eight years ago (see How would we approach to streaming facility in xslt).

It's interesting enough that the implementation is similar to one of composable tree. There a node never stores a reference to a parent, while in the streamed tree no references to children are stored. This way only a limited subview of tree is available at any time. Implementation does not support preceding and preceding-sibling axes. Also, one cannot navigate to a node that is out of scope.

Implementation is external (there are no changes to saxon itself). To use it one needs to create an instance of DocumentInfo, which pulls data from XMLStreamReader, and to pass it as an input to a transformation:

Controller controller = (Controller)transformer; XMLInputFactory factory = XMLInputFactory.newInstance(); StreamSource inputSource = new StreamSource(new File(input)); XMLStreamReader reader = factory.createXMLStreamReader(inputSource); StaxBridge bridge = new StaxBridge(); bridge.setPipelineConfiguration( controller.makePipelineConfiguration()); bridge.setXMLStreamReader(reader); inputSource = new DocumentImpl(bridge); transformer.transform(inputSource, new StreamResult(output));

This helped us to format an xml log file of arbitrary size. An xslt like this can do the work:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/1999/xhtml" exclude-result-prefixes="xs"> <xsl:template match="/log"> <html> <head> <title>Log</title> </head> <body> <xsl:apply-templates/> </body> </html> </xsl:template> <xsl:template match="message"> ... </xsl:template> <xsl:template match="message[@error]"> ... </xsl:template> ... </xsl:stylesheet>

Implementation can be found at: streamedtree.zip

Wednesday, 21 April 2010 07:10:34 UTC

Comments [0] -
Announce | Thinking aloud | xslt

Thursday, 15 April 2010

jxom else if (google search)

Google helps with many things but with retrospective support.

Probably guy's trying to build a nested if then else jxom elements.

We expected this and have defined a function t:generate-if-statement() in java-optimizer.xslt.

Its signature:

 <xsl:function name="t:generate-if-statement" as="element()"> <xsl:param name="closure" as="element()*"/> <xsl:param name="index" as="xs:integer"/> <xsl:param name="result" as="element()?"/>

Usage is like this:

<xsl:variable name="branches" as="element()+"> <xsl:for-each select="...">  <scope>  </scope> </xsl:for-each> </xsl:variable> <xsl:variable name="else" as="element()?">  </xsl:variable>  <xsl:sequence select="t:generate-if-statement($branches, count($branches) - 1, $else)"/>

P.S. By the way, we like that someone is looking into jxom.

Thursday, 15 April 2010 06:59:01 UTC

Comments [0] -
Tips and tricks | xslt

Friday, 09 April 2010

Generator functions in xslt

By the generator we assume a function that produces an infinitive output sequence for a particular input.

That's a rather theoretical question, as xslt does not allow infinitive sequence, but look at the example:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.nesterovsky-bros.com/xslt" exclude-result-prefixes="xs t"> <xsl:template match="/"> <xsl:variable name="value" as="xs:string" select="'10100101'"/> <xsl:variable name="values" as="xs:integer+" select="t:generate(1)"/>  <xsl:variable name="integer" as="xs:integer" select=" sum ( for $index in 1 to string-length($value) return $values[$index][substring($value, $index, 1) = '1'] )"/> <xsl:message select="$integer"/> </xsl:template> <xsl:function name="t:generate" as="xs:integer*"> <xsl:param name="value" as="xs:integer"/> <xsl:sequence select="$value"/> <xsl:sequence select="t:generate($value * 2)"/> </xsl:function>  </xsl:stylesheet>

Here the logic uses such a generator and decides by itself where to break.

Should such code be valid?

From the algorithmic perspective example would better to work, as separation of generator logic and its use are two different things.

Friday, 09 April 2010 14:38:34 UTC

Comments [0] -
Thinking aloud | xslt

Name pool

Lately, after playing a little with saxon tree models, we thought that design would be more cleaner and implementation faster if NamePool were implemented differently.

Now, saxon is very pessimistic about java objects thus it prefers to encode qualified names with integers. The encoding and decoding is done in the NamePool. Other parts of code use these integer values.

Operations done over these integers are:

equality comparision of two such integers in order to check whether to qualified or extended names are equal;
get different parts of qualified name from NamePool.

We would design this differently. We would:

create a QualifiedName class to store all name parts.
declare NamePool to create and cache QualifiedName instances.

This way:

equality comparision would be a reference comparision of two instances;
different parts of qualified name would become a trivial getter;
contention of such name pool would be lower.

That's the implementation we would propose: QualifiedName.java, NameCache.java

Friday, 09 April 2010 13:05:30 UTC

Comments [0] -
Thinking aloud | xslt

Thursday, 08 April 2010

Composable tree model in saxon

Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.

This way, in a code:

<xsl:variable name="elements" as="element()*" select="..."/> <xsl:variable name="result" as="element()"> <result> <xsl:sequence select="$elements"/> </result> </xsl:variable>

the implementation shares internal representation among $elements and subtree of $result. From the perspective of xslt it looks as completely different subtrees with different node identities, which is in the accordance with its view of the world.

After a short study we've decided to create a research implementation of this tree model in saxon. It's took only a couple of days to introduce a minimal changes to engine, to refactor linked tree into a new composable tree, and to perform some tests.

In many cases saxon has benefited immediately from this new tree model, in some other cases more tunings are required.

Our tests've showed that this new tree performed better than linked tree, but a little bit worser than tiny tree. On the other hand, it's obvious that conventional code patterns avoid subtree copying, assuming it's expensive operation, thus one should rethink some code practices to benefit from composable tree.

Implementation can be downloaded at: saxon.composabletree.zip

Thursday, 08 April 2010 06:26:02 UTC

Comments [0] -
Announce | Thinking aloud | xslt

Sunday, 04 April 2010

How would we approach to streaming facility in xslt

From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.

They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.

The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.

Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.

It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.

That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.

Sunday, 04 April 2010 20:53:27 UTC

Comments [0] -
Thinking aloud | xslt

Friday, 02 April 2010

Xslt Heisenbug

Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy baggage.

If you're not entirely concentrated during translation of your algorithms into xslt 2.0 you can get into trap, as we did.

Consider a code snapshot:

<xsl:variable name="elements" as="element()+"> <xsl:apply-templates/> </xsl:variable> <xsl:variable name="converted-elements" as="element()+" select="$elements/t:convert(.)"/>

Looks simple, isn't it?

Our intention was to get converted elements, which result from some xsl:apply-templates logic.

Well, this code works... but rather sporadically, as results are often in wrong order! This bug is very close to what is called a Heisenbug.

So, where is the problem?

Elementary, my dear Watson:

xsl:apply-templates constructs a sequence of rootless elements.
$elements/t:convert(.) converts elements and orders them in document order.

Here is a tricky part:

The relative order of nodes in distinct trees is stable but implementation-dependent...

Clearly each rootless element belongs to a unique tree.

After that we have realized what the problem is, code has been immediately rewritten as:

<xsl:variable name="elements" as="element()+"> <xsl:apply-templates/> </xsl:variable> <xsl:variable name="converted-elements" as="element()+" select=" for $element in $elements return t:convert($element)"/>

P.S. Taking into an accout a size of our xslt code base, it took a half an hour to localize the problem. Now, we're at position to review all uses of slashes in xslt. As you like it?

Friday, 02 April 2010 17:53:18 UTC

Comments [0] -
Thinking aloud | xslt

Saturday, 27 March 2010

Xml namespaces

Opinions on xml namespaces

olegtk: @srivatsn Originally the idea was that namespace URI would point to some schema definition. Long abandoned idea.

Not so long ago, I've seen a good reasoning about the same subject:

XML Namespaces by James Clark;
A comment by M. Kay.

Saturday, 27 March 2010 09:49:45 UTC

Comments [0] -
xslt

Monday, 22 March 2010

Inline functions in xslt 2.1

Inline functions in xslt 2.1 look often as a some strange aberration. Sure, there are very usefull cases when they are delegates of program logic (e.g. comparators, and filters), but often (probably more often) we can see that it's use is to model data structures.

As an example, suppose you want to model a structure with three properties say a, b, and c. You implement this creating functions that wrap and unwrap the data:

function make-data($a as item(), $b as item(), $c as item()) as function() as item()+ { function() { $a, $b, $c } }

function a($data as function() as item()+) as item() { $data()[1] }

function b($data as function() as item()+) as item() { $data()[2] }

function c($data as function() as item()+) as item() { $data()[3] }

Clever?

Sure, it is! Here, we have modeled structrue with the help of sequence, which we have wrapped into a function item.

Alas, clever is not always good (often it's a sign of a bad). We just wanted to define a simple structure. What it has in common with function?

There is a distance between what we want to express, designing an algorithm, and what we see looking at the code. The greater the distance, the more efforts are required to write, and to read the code.

It would be so good to have simpler way to express such concept as a structure. Let's dream a little. Suppose you already have a structure, and just want to access its members. An idea we can think about is an xpath like access method:

$data/a, $data/b, $data/c

But wait a second, doesn't $data looks very like an xml element, and its accessors are just node tests? That's correct, so data constructor may coincide with element constructor.

Then what pros and cons of using of xml elements to model structures?

Pros are: existing xml type system, and sensibly looking code (you just understand that here we're constructing a structure).

Cons are: xml trees are implemented the way that does not assume fast (from the perfromace perspective) composition, as when you construct a structure a copy of data is made.

But here we observe that "implemented" is very important word in this context. If xml tree implementation would not store reference to the parent node then subtrees could be composed very efficiently (note that tree is assumed to be immutable). Parent node could be available through a tree navigator, which would contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).

Such tree structure would probably help not only in this particular case but also with other conventional xslt code patterns.

P.S. Saxon probably could implement its NodeInfo this way.

Update: see also Custom tree model.

Monday, 22 March 2010 11:02:07 UTC

Comments [0] -
Thinking aloud | xslt

Navigation