The One Billion Row Challenge: Optimizing CSV Performance

A few weeks ago, Gunnar Morling announced the One Billion Row Challenge, described as follows: “write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!” The challenge was to ingest this billion lines of input, calculate these three aggregates, and format the output in a very specific way, as quickly as possible.

Pansynchro isn’t written in Java, so this can’t be an official entry to Morling’s contest, but it’s exactly the sort of data processing that’s in our wheelhouse: see how quickly we can chew through a lot of data! The data file is formatted as a series of lines of text, with each of the 1 billion lines containing a city name and a temperature value, separated by a semicolon. In other words, it’s a CSV file, but with semicolons instead of commas. This reading scenario is already supported by Pansynchro. So far, so good.

Calculating these aggregate values sounds a lot like a SQL query, and at least one developer decided to implement it in SQL. (He claims an optimized time of 25.58 seconds, but with no baseline to compare against this number is kind of meaningless. Hardware can make a huge difference in performance, as we’ll see further on.) Our PanSQL implementation looked fairly similar:

load dataDict from '.\1brc.pansync'
load resultDict from '.\1brc_results.pansync'

open mySource as Files for source with '{ "Files": [ { "Name": "Data", "File": ["E:\\1brc\\measurements.txt"] } ] }'
open myInput as Csv for read with dataDict, 'Delimiter='';''', mySource
open myOutput as Console for write with resultDict, ''

stream data as dataDict.Data
stream result as resultDict.Result

with aggs as (
	select Name, min(Temperature) as minTemp, avg(Temperature) as meanTemp, max(Temperature) as maxTemp
	from data
	group by Name
)
select '{' + string_agg(format('{0}={1:F1}/{2:F1}/{3:F1}', Name, MinTemp, MeanTemp, MaxTemp), ', ') + '}'
from aggs
order by Name
into result

sync myInput to myOutput

The data dicts were trivial, defining a single stream for input with two fields, and for output with one field. The only real problem is, the solution to this uses three features that didn’t exist in PanSQL: a CTE, a string_agg aggregate, and a string formatting function. So we implemented them. These features, along with some other stuff unrelated to the One Billion Row Challenge, are now available in the v0.2 release.

Once we had that working, it was time to get some numbers. We built the dataset from Morling’s repository and ran what was, at that point, the fastest entry in order to get a baseline, Roy van Rijn’s solution. (At the time of this writing it’s now tied for 2nd place, a tiny fraction of a second behind the first place entry.) It ran in 3 minutes and 4 seconds on our dev machine, with an 8-core Ryzen 7 processor and 32 GB of RAM. Morling’s test system is much beefier, a 32 core EPYC Zen2 with 128 GB of RAM, running Roy’s code in about 2.15 seconds. He doesn’t mention storage, but given some of the numbers posted it seems likely that his beefy test rig has a significantly faster data transfer rate than our consumer-grade SSD. (See? Hardware matters. A lot! Our build was approximately 85 times slower.) After building the PanSQL code and running it, our initial attempt took 25 minutes, approximately 1500 seconds. (Scale that down by a factor of 85 and you have about 17.65 seconds, which makes puts our initial work faster than about the bottom 30% of 1BRC entries. Not particularly satisfying.)

PanSQL produces .NET code, so we broke out the JetBrains dotTrace profiler to have a look at what was taking so long. The first thing we found was that it was taking a long time to split the string into the substrings delimited by the semicolons. The CSV-parsing code we had in there was very general-case, designed to handle a wide range of variations on CSV input, but this was very simple input, so we added a “happy path” case for when the more complicated code was not necessary. This got us down from 25 minutes to 17:10.

After that, the next-biggest “hot spot” was transforming the CSV values into input data. For every input, it would compare the field type against a list of possible field types to determine what to do with it. But because the field types were known from the beginning, we rewrote this code to calculate how to process each row at the start of the file. This shaved off over 30% of the processing time, bringing it down to 11:50.

Next we tried poking at various optimizations, caching a few things, making the file read buffer larger, and using some of the Span APIs introduced in .NET 8 that are supposed to provide serious speed gains. All together, this gained us a few seconds, bringing the time to 11:40. A bit disappointing, but better than nothing. It seemed like the CSV parsing was about as fast as we were going to get it. But the CSV parsing was only about 60% of the time reported by the profiler; the rest was the SQL transformation.

However, because the SQL transformation is logically separate from the CSV parsing, just consuming the data produced by it without caring about the specifics, this suggested another performance enhancement: pipeline them. Produce the data in its own thread and feed it to the SQL transformer through a channel. This would add a bit of overhead, of course, but we’d make up for it in volume. …or so we thought, until we tried it. It was slow. It was horrendously slow, with the pipeline code adding massive amounts of overhead that took the total time up to about 27 minutes!

The theory was still sound, but channel operations turned out to have too much overhead to make it worthwhile to run a billion of them. But what if we ran a lot fewer than that by batching up the data into large blocks? We tested it with a block size of 1,000 (ie. parse 1000 rows at a time and deliver them to the SQL transformer) and saw an immediate improvement: 10 minutes, 23 seconds. So we started experimenting with different block sizes, and the sweet spot seemed to be around 7000, which yielded us a new time of 7:37. (We ended up making this size configurable, as results could easily vary depending on the nature of the data being read. The pipelining is disabled by default, as you’re not likely to see any performance gains from it unless you’re reading large CSV files with hundreds of thousands of rows at the minimum.)

Rebuilding everything in release mode with full optimizations on brought it down to 6:38, almost a full minute faster. (We hadn’t done this so far in the process because the optimizations can interfere with the accuracy of profiling data.)

And finally, a bit of optimization on the SQL side: profiling showed that a notable amount of time was being spent on the dictionaries holding the aggregate values, looking up each key once to find its and then a second time to store the updated value. There’s an obscure, low-level API that was added to .NET 6 called CollectionsMarshal.GetValueRefOrAddDefault, designed to optimize this specific use case so that it can be done with only one dictionary lookup. Implementing it on the aggregator code brought the time down another 48 seconds, to 5:50.

Scale that 5:50 down by a factor of 85 and we’re at 4.12 equivalent seconds, placing us now in the top 15% of Gunnar Morling’s results now instead of the bottom 30%. And honestly, that’s good enough. The top-tier contest entries are hyper-specifically optimized for this particular problem, while Pansynchro is a framework designed to solve the general problem of data syncs as quickly as possible. It’s not going to do a lot of the tricks that 1BRC contestants pulled out to hammer away at this specific file, because they don’t apply to your data in the same way. That flexibility will always add a bit of overhead, but it turns out to be far less than you’d imagine!

Paul Graham once wrote, citing computing pioneer Bill Woods, that “as a rule of thumb, each layer of interpretation costs a factor of 10 in speed. This extra cost buys you flexibility.” Well, our little PanSQL script just shattered this rule of thumb, running within a factor of 2 of the fastest solution available to the contest! We picked .NET in order to build on several years of the .NET Core team and community focusing on high performance in the runtime and the core libraries, and these results show just how powerful that can be! If you’re dealing with CSV data, you simply won’t find a faster solution than Pansynchro.