Thursday, 19 January 2012 - Nesterovsky bros

Thursday, 19 January 2012

While looking at some SQL we have realized that it can be considerably optimized.

Consider a table source like this:

with Data(ID, Type, SubType) ( select 1, 'A', 'X' union all select 2, 'A', 'Y' union all select 3, 'A', 'Y' union all select 4, 'B', 'Z' union all select 5, 'B', 'Z' union all select 6, 'C', 'X' union all select 7, 'C', 'X' union all select 8, 'C', 'Z' union all select 9, 'C', 'X' union all select 10, 'C', 'X' )

Suppose you want to group data by type, to calculate number of elements in each group and to display sub type if all rows in a group are of the same sub type.

Earlier we have written the code like this:

select Type, case when count(distinct SubType) = 1 then min(SubType) end SubType, count(*) C from Data group by Type;

Namely, we select min(SybType) provided that there is a single distinct SubType, otherwise null is shown. That works perfectly, but algorithmically count(distinct SubType) = 1 needs to build a set of distinct values for each group just to ask the size of this set. That is expensive!

What we wanted can be expressed differently: if min(SybType) and max(SybType) are the same then we want to display it, otherwise to show null.

That's the new version:

select Type, case when min(SubType) = max(SubType) then min(SubType) end SubType, count(*) C from Data group by Type;

Such a simple rewrite has cardinally simplified the execution plan:

Execution plans

Another bizarre problem we have discovered is that SQL Server 2008 R2 just does not support the following:

select count(distinct SubType) over(partition by Type) from Data

That's really strange, but it's known bug (see Microsoft Connect).

Thursday, 19 January 2012 21:12:11 UTC

Comments [0] -
SQL Server puzzle | Tips and tricks

Friday, 13 January 2012

READ_COMMITTED_SNAPSHOT in SQL Server 2008 R2

A database we support for a client contains multi-billion row tables. Many users query the data from that database, and it's permanently populated with a new data.

Every day we load several millions rows of a new data. Such loads can lock tables for a considerable time, so our loading procedures collect new data into intermediate tables and insert it into a final destination by chunks, and usually after work hours.

SQL Server 2008 R2 introduced READ_COMMITTED_SNAPSHOT database option. This feature trades locks for an increased tempdb size (to store row versions) and possible performance degradation during a transaction.

When we have switched the database to that option we did not notice any considerable performance change. Encouraged, we've decided to increase size of chunks of data we insert at once.

Earlier we have found that when we insert no more than 1000 rows at once, users don't notice impact, but for a bigger chunk sizes users start to complain on performance degradation. This has probably happened due to locks escalations.

Now, with chunks of 10000 or even 100000 rows we have found that no queries became slower. But load process became several times faster.

We were ready to pay for increased tempdb and transaction log size to increase performance, but in our case we didn't approach limits assigned by the DBA. Another gain is that we can easily load data at any time. This makes data we store more up to date.

Friday, 13 January 2012 13:43:56 UTC

Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks

Saturday, 17 December 2011

Fast Property Access in V8 JavaScript Engine

Yesterday, by accident, we've seen an article about some design principles of V8 JavaScript Engine. It made clearer what techniques are used in today's script implementations.

In particular V8 engine optimizes property access using "dynamically created hidden classes". These are structures to store object's layout, they are derived when ש new property is created (deleted) on the object. When code accesses a property, and if a cached object's dynamic hidden class is available at the code point then access time is comparable to one of native fields.

In our opinion this tactics might lead to a proliferation of such dynamic hidden classes, which requires a considerable housekeeping, which also slows property write access, especially when it's written for the first time.

We would like to suggest a slightly different strategy, which exploits the cache matches, and does not require a dynamic hidden classes.

Consider an implementation data type with following characteristics:

object is implemented as a hash map of property id to property value: Map<ID, Value>;
it stores data as an array of pairs and can be accessed directly: Pair<ID, Value> values[];
property index can be acquired with a method: int index(ID);

A pseudo code for the property access looks like this:

pair = object.values[cachedIndex]; if (pair.ID == propertyID) { value = pair.Value; } else { // Cache miss. cachedIndex = object.index(propertyID); value = objec.values[cachedIndex].Value; }

This approach brings us back to dictionary like implementation but with important optimization of array speed access when property index is cached, and with no dynamic hidden classes.

Saturday, 17 December 2011 10:18:14 UTC

Comments [0] -
Thinking aloud

Saturday, 10 December 2011

Saxon 9.4 is out

@michaelhkay Saxon 9.4 is out.

But why author does not state that HE version is still xslt/xpath 2.0, as neither xslt maps, nor function items are supported.

Saturday, 10 December 2011 12:16:28 UTC

Comments [0] -
Thinking aloud | xslt

Saturday, 03 December 2011

SQL Server 2008 with(recompile) Next

Recently, we have found and reported the bug in the SQL Server 2008 (see SQL Server 2008 with(recompile), and also Microsoft Connect).

Persons, who's responsible for the bug evaluation has closed it, as if "By Design". This strange resolution, in our opinion, says about those persons only.

Well, we shall try once more (see Microsoft Connect). We have posted another trivial demonstartion of the bug, where we show that option(recompile) is not used, which leads to table scan (nothing worse can happen for a huge table).

Saturday, 03 December 2011 15:06:44 UTC

Comments [0] -
SQL Server puzzle | Thinking aloud

Friday, 18 November 2011

SQL Server 2008 with(recompile)

Recently we have introduced some stored procedure in the production and have found that it performs incredibly slow.

Our reasoning and tests in the development environment did not manifest any problem at all.

In essence that procedure executes some SELECT and returns a status as a signle output variable. Procedure recieves several input parameters, and the SELECT statement uses with(recompile) execution hint to optimize the performance for a specific parameters.

We have analyzed the execution plan of that procedure and have found that it works as if with(recompile) hint was not specified. Without that hint SELECT failed to use index seek but rather used index scan.

What we have lately found is that the same SELECT that produces result set instead of reading result into a variable performs very well.

We think that this is a bug in SQL Server 2008 R2 (and in SQL Server 2008).

To demonstrate the problem you can run this test:

-- Setup create table dbo.Items ( Item int not null primary key ); go insert into dbo.Items select 1 union all select 2 union all select 3 union all select 4 union all select 5 go create procedure dbo.GetMaxItem ( @odd bit = null, @result int output ) as begin set nocount on; with Items as ( select * from dbo.Items where @odd is null union all select * from dbo.Items where (@odd = 1) and ((Item & 1) = 1) union all select * from dbo.Items where (@odd = 0) and ((Item & 1) = 0) ) select @result = max(Item) from Items option(recompile); end; go create procedure dbo.GetMaxItem2 ( @odd bit = null, @result int output ) as begin set nocount on; declare @results table ( Item int ); with Items as ( select * from dbo.Items where @odd is null union all select * from dbo.Items where (@odd = 1) and ((Item & 1) = 1) union all select * from dbo.Items where (@odd = 0) and ((Item & 1) = 0) ) insert into @results select max(Item) from Items option(recompile); select @result = Item from @results; end; go

Test with output into a variable:

declare @result1 int; execute dbo.GetMaxItem @odd = null, @result = @result1 output

Execution plan of dbo.GetMaxItem

Test without output directly into a variable:

declare @result2 int; execute dbo.GetMaxItem2 @odd = null, @result = @result2 output

Execution plan of dbo.GetMaxItem2

Now, you can see the difference: the first execution plan uses startup expressions, while the second optimizes execution branches, which are not really used. In our case it was crucial, as the execition time difference was minutes (and more in future) vs a split of second.

See also Microsoft Connect Entry.

Friday, 18 November 2011 14:49:50 UTC

Comments [0] -
SQL Server puzzle | Tips and tricks

Thursday, 10 November 2011

eXperanto project actual yet...

A bit history: the first release of this solution was about 9.5 years ago...

Today we've run into a strange situation. One of our clients ask us about automatic conversion of data from mainframe (that were defined as COBOL copybooks) into XML or Java/.NET objects. On our suggestion to use eXperanto, which is well known to him, he stated that he wouldn't like to use a tool of a company that is no more exists...

The situation, in our opinion, become more strange when you consider the following:

eXperanto (the design-time tool and run-time libraries for Java and .NET) were developed, well tested, and delivered by us to production already several years ago.
the client bought this set (the tool and libraries).
the set is in production yet already in another big company, and is used time to time by our company in different migration projects.
the client talks with developers of this tool and run-time libraries, and he knows about this fact.
the client uses widely open source solutions even without dedicated vendors or support warranties.

Thursday, 10 November 2011 22:14:09 UTC

Comments [2] -
.NET | Java | Thinking aloud

Friday, 28 October 2011

jQuery

It has happened so, that we have never worked with jQuery, however were aware of it.

In early 2000 we have developed a web application that contained rich javascript APIs, including UI components. Later, we were actively practicing in ASP.NET, and later in JSF.

At present, looking at jQuery more closely we regret that we have failed to start using it earlier.

Separation of business logic and presentation is remarkable when one uses JSON web services. In fact server part can be seen as a set of web services representing a business logic and a set of resources: html, styles, scripts, others. Nor ASP.NET or JSF approach such a consistent separation.

The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews .

Unfortunatelly after reading about this binding API, and being in love with Xslt and XQuery we just want to cry. We don't know what would be the best solution for the task, but what we see looks uncomfortable to us.

Friday, 28 October 2011 22:59:23 UTC

Comments [0] -
ASP.NET | JSF and Facelets | Thinking aloud | Tips and tricks | xslt

Sunday, 16 October 2011

Funny, about yield return in java

Incidentally, we have found one new implementation of yield return in java that is in the development stage. Sources can be found at https://github.com/peichhorn/lombok-pg/zipball/master. Just to be sure we have copied those sources at other place peichhorn-lombok-pg-0.10.0-39-g384fb7b.zip (you may search "yield" in the archive).

It's broken according to source tracker, but the funny thing is that sources, however different, still resemble our yield return implementation (Yield.jar, Yield.3.7.jar - Indigo, Yield.zip - sources) very much: variable names, error messages, algorithmic structure.

Those programmers probably have forgotten good manners: to reference a base work, at least.

Well, we generously forgive them this blunder.

P.S. our implementation, in contrast, works without bugs.

P.P.S. misunderstanding is resolved. See comments.

Sunday, 16 October 2011 13:49:33 UTC

Comments [3] -
Java | Thinking aloud

Thursday, 29 September 2011

An XPath enumerator function

A couple of weeks ago, we have suggested to introduce a enumerator function into the XPath (see [F+O30] A enumerator function):

I would like the WG to consider an addition of a function that turns a sequence into a enumeration of values.

Consider a function like this: fn:enumerator($items as item()*) as function() as item()?;

alternatively, signature could be:

fn:enumerator($items as function() as item()*) as function() as item()?;

This function receives a sequence, and returns a function item, which upon N's call shall return N's element of the original sequence. This way, a sequence of items is turned into a function providing a enumeration of items of the sequence.

As an example consider two functions:

a) t:rand($seed as xs:double) as xs:double* - a function producing a random number sequence;
b) t:work($input as element()) as element() - a function that generates output from it's input, and that needs random numbers in the course of the execution.

t:work() may contain a code like this:
let $rand := fn:enumerator(t:rand($seed)),

and later it can call $rand() to get a random numbers.

Enumerators will help to compose algorithms where one algorithm communicate with other independant algorithms, thus making code simpler. The most obvious class of enumerators are generators: ordered numbers, unique identifiers, random numbers.

Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is similar to a "side effect" of a function generate-id() from a newly created node (see bug #13747, and bug #13494).

The idea is inspired by a generator function, which returns a new value upon each call.

Such function can be seen as a stateful object. But our goal is to look at it in a more functional way. So, we look at the algorithm as a function that produces a sequence of output, which is pure functional; and an enumerator that allows to iterate over algorithm's output.

This way, we see the function that implements an algorithm and the function that uses it can be seen as two thread of functional programs that use messaging to communicate to each other.

Honestly, we doubt that WG will accept it, but it's interesting to watch the discussion.

Thursday, 29 September 2011 11:56:05 UTC

Comments [0] -
Thinking aloud | xslt

Wednesday, 14 September 2011

Resolution of the Saxon optimizer bug

More than month has passed since we have reported a problem to the saxon forum (see Saxon optimizer bug and Saxon 9.2 generate-id() bug).

The essence of the problem is that we have constructed argumentless function to return a unique identifiers each time function is called. To achieve the effect we have created a temporary node and returned its generate-id() value.

Such a function is nondetermenistic, as we cannot state that its result depends on arguments only. This means that engine's optimizer is not free to reorder calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where engine elevates function call out of cycle thus producing invalid results.

Michael Kay, the author of the Saxon engine, argued that this is "a gray area of the xslt spec":

If the spec were stricter about defining exactly when you can rely on identity-dependent operations then I would be obliged to follow it, but I think it's probably deliberate that it currently allows implementations some latitude, effectively signalling to users that they should avoid depending on this aspect of the behaviour.

He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end two related bugs have been raised:

Bug 13494 - Node uniqueness returned from XSLT function;
Bug 13747 - [XPath 3.0] Determinism of expressions returning constructed nodes.

Yesterday, the WG has resolved the issue:

The Working Group agreed that default behavior should continue to require these nodes to be constructed with unique IDs. We believe that this is the kind of thing implementations can do with annotations or declaration options, and it would be best to get implementation experience with this before standardizing.

This means that the technique we used to generate unique identifiers is correct and the behaviour is well defined.

The only problem is to wait when Saxon will fix its behaviour accordingly.

Wednesday, 14 September 2011 05:54:56 UTC

Comments [0] -
Thinking aloud | xslt

Tuesday, 06 September 2011

Entity Framework and char parameters

We're not big fans of Entity Framework, as we don't directly expose the database structure to the client program but rather through stored procedures and functions. So, EF for us is a tool to expose those stored procedures as .NET wrappers. This limited use of EF still greatly automates the data access code.

But what we have lately found is that the EF has a problem with char parameters. Namely, if you import a procedure say MyProc that accepts char(1), and then will call it through the generated wrapper, the you will see in sql profiler that char(1) parameter is passed with many trailing spaces as if it were char(8000). There isn't necessity to prove that this is highly ineffective.

We can see that the problem happens in VS 2010 designer rather than in the EF runtime, as SP's parameters are not attributed with length, see model xml (*.edmx):

<Function Name="MyProc" Schema="Data"> ... <Parameter Name="recipientType" Type="char" Mode="In" /> ... </Function>

while if we set:

<Parameter Name="recipientType" Type="char" MaxLength="1" Mode="In" />

the runtime starts working as expected. So the workaround is to fix model file manually.