Groups in SQL - Nesterovsky bros

Wednesday, 05 August 2020

Groups in SQL

Recently our colleague turned to us and asked to help to deal with some complex query.

It has turned out that the complex part was to understand what he wants to achieve.

After listening to him we have forumulated the task in our words and have confirmed that that is what he wants.

So, that's the task in our formulation:

Assume you have events.
Each event acts upon one or more accounts.
Find all events that act on the same set of accounts.
Note we deal with mutiple millions of events and accounts.

Data is defined like this:

create table dbo.Event
(
  EventID bigint not null,
  AccountID varchar(18) not null,
  primary key(EventID, AccountID)
);

Requested query turned out to be very simple, yet, not as simple as one would think to account big amout of data:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    count(*) Items,
    checksum_agg(checksum(AccountID)) Hash
  from
    D
  group by
    EventID
)
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items = S2.Items and
    S1.Hash = S2.Hash and
    not exists
    (
      select AccountID from D where EventID = S1.EventID
      except
      select AccountID from D where EventID = S2.EventID
    );

The idea is to:

calculate a hash derived from list of accounts for each group;
join groups with the same hash;
verify that matched groups fit perfectly.

Even simpler solution that does not use hashes is not scaleable, as it's performance is slower than O(N^2), where N - is a number of events. It has unacceptable time with N ~1e4, nothing to say about N ~1e7.

At this point our colleague was already satisfied, as he got result in couple of minutes for a task that he could not even formalize as SQL.

But we felt it could be even better.

We looked at statistics:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    count(*) Items
  from
    D
  group by
    EventID
)
select
  Items, count(*) EventCount
from
  S
group by
  Items
order by
  EventCount desc;

and have seen that most of the events, about 90%, deal with single account, and all other with two and more (some of them act upon big number of accounts).

The nature of the dataset gave us a hint of more verbose but more fast query:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    min(AccountID) AccountID,
    count(*) Items,
    checksum_agg(checksum(AccountID)) Hash
  from
    D
  group by
    EventID
)
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items = 1 and
    S2.Items = 1 and
    S1.AccountID = S2.AccountID
union all
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items > 1 and
    S2.Items > 1 and
    S1.Items = S2.Items and
    S1.Hash = S2.Hash and
    not exists
    (
      select AccountID from D where EventID = S1.EventID
      except
      select AccountID from D where EventID = S2.EventID
    );

This query produced results in twenty seconds instead of couple of minutes for a dataset with ~1e7 rows.

Wednesday, 05 August 2020 07:44:07 UTC

Comments [0] -
SQL Server puzzle | Thinking aloud | Tips and tricks

Comments are closed.