Recently our colleague turned to us and asked to help to deal with some complex query.
It has turned out that the complex part was to understand what he wants to achieve.
After listening to him we have forumulated the task in our words and have confirmed that that is what he wants.
So, that's the task in our formulation:
Data is defined like this:
create table dbo.Event ( EventID bigint not null, AccountID varchar(18) not null, primary key(EventID, AccountID) );
Requested query turned out to be very simple, yet, not as simple as one would think to account big amout of data:
with D as ( select * from dbo.Event ), S as ( select EventID, count(*) Items, checksum_agg(checksum(AccountID)) Hash from D group by EventID ) select S1.EventID, S2.EventID from S S1 inner join S S2 on S1.EventID < S2.EventID and S1.Items = S2.Items and S1.Hash = S2.Hash and not exists ( select AccountID from D where EventID = S1.EventID except select AccountID from D where EventID = S2.EventID );
The idea is to:
Even simpler solution that does not use hashes is not scaleable, as it's performance is slower than O(N^2), where N - is a number of events. It has unacceptable time with N ~1e4, nothing to say about N ~1e7.
At this point our colleague was already satisfied, as he got result in couple of minutes for a task that he could not even formalize as SQL.
But we felt it could be even better.
We looked at statistics:
with D as ( select * from dbo.Event ), S as ( select EventID, count(*) Items from D group by EventID ) select Items, count(*) EventCount from S group by Items order by EventCount desc;
and have seen that most of the events, about 90%, deal with single account, and all other with two and more (some of them act upon big number of accounts).
The nature of the dataset gave us a hint of more verbose but more fast query:
with D as ( select * from dbo.Event ), S as ( select EventID, min(AccountID) AccountID, count(*) Items, checksum_agg(checksum(AccountID)) Hash from D group by EventID ) select S1.EventID, S2.EventID from S S1 inner join S S2 on S1.EventID < S2.EventID and S1.Items = 1 and S2.Items = 1 and S1.AccountID = S2.AccountID union all select S1.EventID, S2.EventID from S S1 inner join S S2 on S1.EventID < S2.EventID and S1.Items > 1 and S2.Items > 1 and S1.Items = S2.Items and S1.Hash = S2.Hash and not exists ( select AccountID from D where EventID = S1.EventID except select AccountID from D where EventID = S2.EventID );
This query produced results in twenty seconds instead of couple of minutes for a dataset with ~1e7 rows.