We're facing a task of parsing reports produced from legacy applications and converting them into a structured form, e.g. into xml. These xml files can be processed further with up to date tools to produce good looking reports.
Reports at hands are of very different structure and of size: from a couple of KB to a several GB. The good part is that they mostly have a tabular form, so it's easy to think of specific parsers in case of each report type.
Our goal is to create an environment where a less qualified person(s) could create and manage such parsers, and only rarely to engage someone who will handle less untrivial cases.
Our analysis has shown that it's possible to write such parser in almost any language: xslt, C#, java.
Our approach was to create an xml schema annotations that from one side define a data structure, and from the other map report layout. Then we're able to create an xslt that will generate either xslt, C#, or java parser according to the schema definitions. Because of languages xom, providing XML Object Model and serialization stylesheets for C# and java, it does not really matter what we shall generate xslt or C#/java, as code will look the same.
The approach we're going to use to describe reports is not as powerfull as conventional parsers. Its virtue, however, is simplicity of specification.
Consider a report sample (a data to extract is in bold):
1 TITLE ... PAGE: 1 BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10 CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39 CM BUS ... CO NBR FRM FUNC ... ----- ----- ----- ----- XXX 065 065 CLR ... YYY ... ... 1 TITLE ... PAGE: 2 BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10 CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39 CM BUS ... CO NBR FRM FUNC ... ----- ----- ----- ----- AAA NNN MMM PPP ... BBB ... ... * * * * * E N D O F R E P O R T * * * * *
We're approaching to the report through a sequence of views (filters) of this report. Each veiw localizes some report data either for the subsequent filterring or for the extraction of final data.
Looking into the example one can build following views of the report:
A sequence of filters allows us to build a pipeline of transformations of original text. This also allows us to generate a clean xslt, C# or java code to parse the data.
At first, our favorite language for such parser was xslt. Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very strong in streaming processing. Without a couple of extension functions to prevent caching, it tends to cache whole input in the memory, which is not acceptable.
At present we have decided to start from C# code, which is pure C# naturally.
Code still is in the development but at present we would like to share the xml schema annotations describing report layout: report-mapping.xsd, and a sample of report description: test.xsd.
Remember Me
a@href@title, b, blockquote@cite, em, i, strike, strong, sub, super, u