Performance Left on the Table: Precompiled Reporting Queries for Analytics

Transactional vs. Analytical

Traditionally, SQL databases have been designed around what’s known as a “transaction processing” workload. This is the typical database work most developers and DBAs will be familiar with for running an app: creating, looking up, and editing individual items or small batches of items at a time, with a high volume of small transactions occurring more-or-less constantly. A transactional database needs to have good solutions to things like concurrency (two or more transactions trying to work on the same data at the same time) and be able to perform small inserts and updates very quickly.

But it’s been understood as far back as the 1970s that there’s a second “flavor” of database workloads that works very differently: analytical processing. This is the exact opposite of a transactional workload: it involves executing long-running, heavyweight queries over large volumes of data, to produce reports. It’s not unheard of for reporting queries to run for hours, or even a day or longer! Oftentimes, though, a good deal of this execution time is not inherent to the nature of the queries being run, but rather is overhead caused by running analytical queries on a database designed and optimized for the transactional workflow.

In the age of cloud computing, various cloud-native analytical database services have arisen, using various techniques to make heavy analytical queries easier to work with. Characterized by architectural decisions such as columnar data storage (an on-disc storage style that is slower to insert data to, but can make reading and filtering on a SELECT query much faster) and horizontal scalability (improving speed by running multiple compute nodes against the same data store,) analytical databases are designed first and foremost for the sort of heavyweight reporting that transactional databases struggle with.

Ad-hoc vs. Prepared

A typical SELECT query is parsed, planned, executed, and then essentially thrown away by the database. When a query needs to be reused, often with different parameters, it can be prepared, with some of the parsing and planning work cached by the database engine so that it doesn’t have to be done again and again every time you run the query. This can be a big help on transactional databases, where every millisecond of overhead can add up over the course of a high volume of queries. But neither ordinary query planning nor prepared statements typically take particularly long on optimizing their queries, so that the time it takes to work out the optimal query plan doesn’t overwhelm the time they save in running the query.

One notable counterexample is IBM’s DB2, which always does a thorough job of query planning, as an early reviewer of this article pointed out, saying, “DB2 takes no such compromises: its optimizer is really thorough and the compilation can and does significant amount of time. This is alleviated by aggressive caching (basically every query is internally treated as Prepared Statement) and I have personally witnessed what happens when this cache is flushed by careless creation of many different queries (it brought that app to halt, serious production incident).” Such incidents show the downside of a database dogmatically treating every query as one that needs to be optimized aggressively.

Having said that, in the analytical world where long-running queries abound and users frequently complain about the high compute costs associated with them, a more aggressive approach seems more likely to be helpful than harmful. So several months ago, Pansynchro Technologies began working on an experimental SQL implementation designed to deal with this problem.

Where databases typically look at query optimization from the perspective of relational algebra, as a matter for the query planner to solve by cleverly rearranging filters, indices and joins, we decided to approach it from a compiler-theory perspective instead: could we turn a SELECT query into a traditional program, and let an advanced native code compiler squeeze a high level of performance out of it?

Is It Even Worth Trying?

As far as we can tell, this is not an approach that’s been tried before. We did some research and found no evidence of modern databases using such an approach, so we asked professional developers with a wide range of experience and saw a broad consensus emerge almost immediately: no one does it because it will never work. “It isn’t that big a win by comparison with getting the data movements right (correct indices, best order of joins, that sort of thing) as I/O is the dominant cost in all non-trivial customer queries,” read one reply that was largely representative of the general consensus.

Another user pointed out that this is similar to how it was done in the Bad Old Days of Embedded SQL, stating, “It was horrible and everyone jumped to the CLI (or its refined version, ODBC) as soon as possible … [largely due to] the actual performance in real life. As several people already noted, the [query plan] uses, in the best case, an outdated snapshot of the data statistics (made at the compilation time). Or, in the worst case, does not use any statistics at all (as was common in 1980s). That is BAD, because the actual optimal Query changes quite significantly during application’s lifetime; ESPECIALLY if the database starts empty (as it usually does) or with non-representative data.”

An interesting analysis. Still, we remained unconvinced. I/O optimization is definitely a very important component of database query time, but in analytical queries there’s typically also a significant amount of processing going on, such as computing groups, aggregates, and window functions, or evaluating one-to-many or many-to-many joins. Are hours-long reports really spending all that time reading data from the disc? Of course not!

You Use Different Moves

In the classic adventure film The Princess Bride, there’s a sequence where the Man In Black has to fight Vizzini’s two henchmen, and then face off with Vizzini himself, to rescue the kidnapped princess. The second fight is against the giant Fezzik, played by pro wrestler Andre the Giant, and when he begins to lose the match Fezzik comes to a realization: he’s used to fighting groups of people and is out of practice fighting against a single opponent. “You use different moves when you’re fighting half a dozen people, than when you only have to be worried about one,” he says.

While the movie plays this as a humorous excuse for how he could possibly lose an unarmed fight against a man half his size, the principle espoused here is very relevant in the software world. It brings to mind language developer Eric Lippert’s discussion of different programming languages being used for different purposes:

Let’s take JavaScript for example. (I worked on the original versions of JScript at Microsoft from 1996 through 2001.) The by-design purpose of JavaScript was to make the monkey dance when you moused over it. Scripts were often a single line. We considered ten line scripts to be pretty normal, hundred line scripts to be huge, and thousand line scripts were unheard of. The language was absolutely not designed for programming in the large, and our implementation decisions, performance targets, and so on, were based on that assumption.

This sounds a lot like the transactional SQL workflow. It typically happens one small INSERT, UPDATE, or DELETE at a time, or with a relatively simple SELECT query with, at most, one or two JOINs and a single GROUP BY. And it should come as no surprise that under the hood, a transactional database looks very similar to a scripting language.

Let’s look at PostgreSQL. (We’re not trying to insult or demean Postgres here; it’s a great database that’s generally considered one of the very best at what it does, and for good reasons! We’re simply using it because 1) it’s extremely popular and thus likely to be familiar to our audience, 2) the source code is freely available to examine, and 3) large analytical queries are generally understood to be an issue for Postgres.) Every datum in Postgres — the individual numbers, strings, boolean values, and other single points of data — is represented by a single C code type called Datum. The linked file contains over 500 lines of code and comments explaining how a Datum works and providing routines to convert between Datum and specific data types.

Representing every piece of data with a “uni-type” like this is a classic scripting language technique. It simplifies a lot of work that the script compiler has to do, at the expense of execution efficiency, which isn’t a very significant expense at all when you’re dealing with small volumes of data. With a larger program, though, the picture looks very different. (This is why people working on “big data” code in Python tend to hand the number-crunching off to external libraries written in high-performance statically-typed programming languages; it’s just impossible to get good performance in pure Python with all of the scripting-language overhead.)

Any time Postgres has to perform any calculations on data, such as evaluating JOIN, ORDER BY, and GROUP BY criteria, or loading values into an aggregate, it has to take the Datum values and “unpack” them into real C data types. When working with operations that produce new data, such as SQL functions, aggregates, and window functions, the values have to be converted into Datum values for the SQL processor to work on. It’s as if there are two different “realms” of code, the Land of SQL and the Land of Computation, and any trip across the border between these two realms requires a stop at Passport Control. The stops may be relatively brief, but they add up over time. This process is referred to as marshalling, and performance-minded developers try to avoid it whenever possible.

Strongly-Typed SQL

In our project, which we’re calling DataMountain, we took a different approach. Every datum is strongly typed throughout the system. Every tuple has a well-defined struct type, and every table is defined as a collection of the strongly-typed tuples that make up its rows.

How is this possible? One of the textbook characteristics of dynamically-typed programming languages is the use of types that are not necessarily known to the language at compile-time, and that definitely is an apt description of a SQL query; the database code has no idea what the tuples will look like in the end user’s database.

We resolve this dilemma using metaprogramming and code generation: between the Roslyn compiler being available at runtime and the .NET Core runtime’s Reflection capabilities, it’s possible to generate and interact with new strongly-defined types on the fly, a capability which we take advantage of extensively in DataMountain. Ad-hoc queries are translated into strongly-typed .NET code and executed in-memory. There is no “datum” uni-type that data needs to be marshalled to and from in order to process and perform computations on it.

But this compilation process comes with overhead of its own. In our testing, compiling even the simplest SQL queries tends to cost between 0.1 and 0.2 seconds of CPU time. Definitely not what you want in a transactional workflow of thousands of sub-second queries, but when dealing with queries that run for minutes or hours? It’s negligible.

Prepared Reports

There’s one downside to using the Roslyn code generator: it relies on a JIT compiler for native code generation that’s designed to execute quickly rather than search for maximum performance. For the best performance, then, we have to go one step further. So we extended the SQL engine’s syntax for prepared queries with the “REPORT” keyword:

prepare report MyReport as select * from MyTable;

Instead of a traditional prepared query that lasts for the duration of your session, DataMountain prepares a Report query by producing generated code representing the query and feeding it into a high-performance native code compiler, then saving the result as persisted part of the database.

But Is It Any Good?

Let’s look at some numbers. For this test, we used Brent Ozar’s 2021 StackOverflow data dump, a 411 GB monster of a SQL Server database containing hundreds of millions of rows of data. For a baseline, we ran a very simple analytical query. The Posts table contains all of the posts, both questions and answers, in StackOverflow’s history up to that point. There are several different kinds of posts, but only questions have tags. So we asked SQL Server, select count(*) from Posts where Tags is not null. On our dev box, it spun for 217 seconds (3 minutes and 37 seconds) before returning a result of 20,890,055. (20.8 million questions out of 52.1 million total posts.)

Then we imported the database’s contents into DataMountain, using its built-in support for the Pansynchro protocol for bulk loading. The load ran for 6622 seconds (1:50:22), and resulted in 216 GB of on-disc data, including a primary key index on every table, the same as the SQL Server database had. The same query shown above, executed as an ad-hoc query in DataMountain, ran for 30.5 seconds and returned the same result, 20,890,055, on the same hardware as the SQL Server database.

But we already know that analytical databases will run heavyweight queries faster than a transactional database like SQL Server. This isn’t what we’re trying to test here. The next step, the moment of truth, was to build the query as a prepared report and execute it.

This time, DataMountain produced the same result as the other two in 23.2 seconds, an improvement of approximately 24% over the already-fast ad-hoc analytical query engine!

(This query is largely for illustrative purposes. It could, of course, be optimized pretty aggressively by a query planner if the column in question was indexed. But when you’re running a large report involving multiple joins, for example, you can end up running calculations on a large volume of intermediate data that isn’t indexable. Our intention here is to show that the DataMountain architecture can help a lot with such number-crunching work.)

The process of compiling the query into an optimized report took 14.5 seconds, roughly twice the amount of time saved by the optimized query. This is not automatically a net gain, which is why it’s not the default for DataMountain query execution. But the cost of compiling the report only has to be paid once. For large business-intelligence or data-mining reports that you’re likely to run over and over, running the same report once a day or once a week for example, we believe that this technique shows conclusively that DataMountain is able to capture a significant amount of performance that other databases leave on the table. And in the cloud, where you’re billed by the second for compute resources, time literally is money.

Moving Forward

DataMountain is still a largely experimental project. At the time of this writing, large swathes of SQL remain unimplemented, including window functions, JSON data operators, and the bulk of DDL operations. Also, as of yet the query planner remains largely unimplemented. Aside from basic constant folding and the simplest, most quick-and-dirty pushdown algorithm we could come up with to ensure that index-based and columnar filtering are viable on as many queries as possible and joins will run in a reasonable order, we’ve made no attempt to transform or optimize the queries that are fed into it. It hasn’t even ever been run though a profiler yet!

And it’s still very, very fast. At least on simple queries. We believe that the benchmark shown here, and other results like it, vindicate our approach, and that we can continue to squeeze out even better performance as we focus on building a proper query planner and statistics system to deal with optimizing complicated queries. At Pansynchro Technologies, everything we do is designed to save our users time and money. If you’d like to try DataMountain out and see if it can help you, check it out at [not available yet.]

A Target To Reach For

An early reviewer of this paper pointed out some similar testing that Clickhouse has done on a comparably-sized StackOverflow data dump, which ran significantly faster (1.013 seconds) for a query that was likewise a table scan + aggregation of the Votes table. It’s not clear what the hardware specs were that Clickhouse was running on and how they stack up against our somewhat meager dev box that we ran these benchmarks on. They describe it as “a 96 GiB, 24 vCPU ClickHouse Cloud cluster,” which presumably means a cluster with 96 GB of RAM and 24 vCPUs total, distributed across an unknown number of virtual machines.

As noted above, the DataMountain PoC is largely unoptimized. Among other things, this means that everything here, so far, is running single-threaded on a single machine. This will not remain the case for long.

And while Amdahl’s Law tells us that it’s never quite this simple when dealing with parallelism, it’s worth noting that the time achieved by this 24 CPU solution is approximately 1/24 of our completely unoptimized time. Because table-scan-and-aggregate is a highly parallelizable problem by nature, this suggests that it should be quite doable to achieve similar performance with DataMountain, and even surpass it.

Let’s call that a target to reach for. Hopefully a we’ll be able to report back in a future post that we did it!

Expectation

With our current pace of progress, we expect to be able to launch a public beta of DataMountain in Q1 of 2025. Watch this space for updates!