We migrate code out of MF to Azure. Tool we use produces plain good functionally equivalent C# code. But it turns it's not enough!
So, what's the problem?
Converted code is very slow, especially for batch processing, where MF completes job, say in 30 minutes, while converted code finishes in 8 hours.
At this point usually someone appears and whispers in the ear:
Look, those old technologies are proven by the time. It worth to stick to old good Cobol, or better to Assembler if you want to do the real thing.
We're curious though: why is there a difference?
Turns out the issue lies in differences of network topology between MF and Azure solutions. On MF all programs, database and file storage virtually sit in a single box, thus network latency is negligible.
It's rather usual to see chatty SQL programs on MF that are doing a lot of small SQL queries.
In Azure - programs, database, file storage are different services most certainly sitting in different phisical boxes. You should be thankfull if they are co-located in a single datacenter. So, network latency immediately becomes a factor. Even if it just adds 1 millisecond per SQL roundtrip, it adds up in loops, and turns in the showstopper.
There is no easy workaround on the hardware level.
People advice to write programs differently: "Tune applications and databases for performance in Azure SQL Database".
That's a good advice for a new development but discouraging for migration done by a tool.
So, what is the way forward?
Well, there is one. While accepting weak sides of Azure we can exploit its strong sides.
Before continuing let's consider a code demoing the problem:
public void CreateReport(StringWriter writer) { var index = 0; foreach(var transaction in dataService. GetTransactions(). OrderBy(item => (item.At, item.SourceAccountId))) { var sourceAccount = dataService.GetAccount(transaction.SourceAccountId); var targetAccount = transaction.TargetAccountId != null ? dataService.GetAccount(transaction.TargetAccountId) : null; ++index; if (index % 100 == 0) { Console.WriteLine(index); } writer.WriteLine($"{index},{transaction.Id},{ transaction.At},{transaction.Type},{transaction.Amount},{ transaction.SourceAccountId},{sourceAccount?.Name},{ transaction.TargetAccountId},{targetAccount?.Name}"); } }
This cycle queries transactions, along with two more small queries to get source and target accounts for each transaction. Results are printed into a report.
If we assume query latency just 1 millisecond, and try to run such code for 100K transactions we easily come to 200+ seconds of execution.
Reality turns to be much worse. Program spends most of its lifecycle waiting for database results, and iterations don't advance until all work of previous iterations is complete.
We could do better even without trying to rewrite all code! Let's articulate our goals:
The idea is to form two processing pipelines:
Each pipeline may post sub-tasks to the other, so (a) runs its tasks in parallel unordered, while (b) runs its tasks as if everything was running serially.
So, parallel plan would be like this:
To reduce burden of task piplelines we use Dataflow (Task Parallel Library), and encapsulate everything in a small wrapper.
Consider refactored code:
public void CreateReport(StringWriter writer) { using var parallel = new Parallel(options.Value.Parallelism); // ⬅️ 1 var index = 0; parallel.ForEachAsync( // ⬅️ 2 dataService. GetTransactions(). OrderBy(item => (item.At, item.SourceAccountId)), transaction => // ⬅️ 3 { var sourceAccount = dataService.GetAccount(transaction.SourceAccountId); var targetAccount = transaction.TargetAccountId != null ? dataService.GetAccount(transaction.TargetAccountId) : null; parallel.PostSync( // ⬅️ 4 (transaction, sourceAccount, targetAccount), // ⬅️ 5 data => { var (transaction, sourceAccount, targetAccount) = data; // ⬅️ 6 ++index; if (index % 100 == 0) { Console.WriteLine(index); } writer.WriteLine($"{index},{transaction.Id},{ transaction.At},{transaction.Type},{transaction.Amount},{ transaction.SourceAccountId},{sourceAccount?.Name},{ transaction.TargetAccountId},{targetAccount?.Name}"); }); }); }
Consider ⬅️ points:
Parallel
parallel.ForEachAsync()
parallel.PostSync()
What we achieve with this refactoring:
We think that such refactored code is still recognizible.
As for performance this is what log shows:
Serial test 100 ... Execution time: 00:01:33.8152540 Parallel test 100 ... Execution time: 00:00:05.8705468
Please take a look at project to understand implementation details, and in particular Parallel class implementing API to post parallel and serial tasks, run cycles and some more.
Please continue reading on GitHub.