ETL Hello World!

One of the simplest possible ETL tasks is to clone a database. Extract and load, with no transforms, from a database filled with data to an empty one with an identical schema. Many ETL jobs are just this simple, and even if yours isn’t, it provides a good baseline.

How does Pansynchro stack up against the competition? For our benchmark we used Brent Ozar’s StackOverflow Medium (2013) data dump, which has enough data to provide a significant test for the system, but not so much that it will take forever to run. The source and destination databases are MS SQL Server, both residing on the same machine. A sync job was configured to copy the entire database from the data dump to the blank destination, then run. We benchmarked the same sync job on Airbyte, SSIS, and Pansynchro.

The test was run on a Windows 10 dev machine, with an AMD Ryzen 7 2700X 8-core CPU and 32 GB of RAM, with the database installed locally on an SSD drive. No attempts were made to fine-tune any of these systems for optimal performance. This is a raw, out-of-the-box, simplest-possible-setup performance test.

Airbyte

Airbyte uses a two-step sync system, encoding data as JSON to transfer it and storing the JSON-encoded data in intermediary tables on your destination database, then running DBT to decode the JSON back to your destination data. Running a sync on this StackOverflow database, Airbyte took 15,594 seconds (4 hours, 19 minutes, 54 seconds) to transfer the data, and measured the raw data transfer at 31.5 GB. The DBT decoding phase took an additional 22,780 seconds (6 hours, 19 minutes, and 40 seconds), for an overall sync time of 38,374 seconds (10 hours, 39 minutes, and 34 seconds).

After running, the quality of the resulting data is… questionable. Rather than dropping the data in the existing tables with identical layouts, Airbyte created its own tables under a new airbyte schema. It appears that Airbyte’s table creation code and DBT scripts make no attempt to faithfully preserve the original schema of your data; instead opting for a “close enough” representation. For example, all integer types were converted to bigint, and textual types to nvarchar(max). Datetime values were also converted to nvarchar(max) rather than parsed from their JSON string values back to SQL datetimes!

SQL Server Integration Services

SSIS has a reputation for being fast, so it would be expected to beat Airbyte’s time, right? As it turns out, yes, it did. After configuring an SSIS package to transfer all tables, it took 3,910 seconds (1 hour, 5 minutes, 10 seconds) to run the sync, with no intermediate tables. Unfortunately, SSIS does not provide any way to audit the total data transfer size. As the sync used existing tables, data types were preserved faithfully.

Pansynchro

On empty destinations, Pansynchro copies data directly; otherwise it loads data to intermediate tables, then merges the data to the destination tables, because profiling has shown this actually improves performance slightly vs. direct loading to the destination. This project was set up as a client and a server, sending data from the server to the client using the Pansynchro binary protocol. This took 1,078 seconds (17 minutes, 58 seconds) to complete, and measured the raw data transfer at 7.04 GB. Again, the sync used existing tables, so all data types were preserved faithfully.

Native vs. WSL performance

To be entirely fair to Airbyte, the system we ran this benchmark on was a Windows box. Airbyte doesn’t support Windows, so this benchmark was run under WSL, which could easily have added some level of overhead to the process. Having said that, there are two important points to be aware of.

  1. Any WSL overhead would have been negligible on the DBT side of things. This mostly consists of running some very heavyweight SQL queries on the server to execute the transformations, and the server is not running under WSL.
  2. Airbyte is written primarily in Java, with the DBT work implemented in Python scripts. Both Java and Python are cross-platform languages, and in fact Java’s cross-platform nature has been one of its biggest selling points since the very beginning. It’s not clear why Airbyte shouldn’t be able to run under Windows, but they don’t appear to have any support for it; their own documentation says you need to run under WSL.

Pansynchro, on the other hand, is a fully cross-platform system. Just to get an apples-to-apples comparison, we ran the Pansynchro sync again under WSL so it would have the same handicap Airbyte was presumably operating under. Surprisingly, it took 1,077 seconds, almost identical to the Windows duration. So it would seem that the effect of WSL on performance was not significant afterall.

Parallel vs. Serial performance

Pansynchro’s outperforming of SSIS by over 3x is particularly noteworthy, given that SSIS was copying all nine tables in parallel, whereas the Pansynchro protocol is a purely serial system. So not only did Pansynchro beat SSIS handily, it did so with most of its CPU cores proverbially tied behind its back! This does inspire some further research into possible ways to speed up Pansynchro’s performance even further by running multiple data streams in parallel. Because Pansynchro’s binary encoder uses a 2-stage parallel pipeline internally, it should be possible to efficiently run up to (# of cores / 2) streams at the same time.

The Pansynchro protocol has gone through several revisions before it was ready for public release. At the time of this writing, it’s currently running on version 6. Parallelization will likely be the main focus of the upcoming version 7. Watch this space for updates.

Conclusion

Pansynchro ran the sync job 3.63x faster than SSIS and 35.6x faster than Airbyte, with 4.47x less data transfer than Airbyte.

That’s right. Pansynchro outperformed Airbyte by a factor of 35, and even beat “ETL world record” framework SSIS by a factor of 3.5 on its own home turf!

The numbers speak for themselves. Even when we go out of our way to not disadvantage the competition, Pansynchro handily outperforms both systems. Simply put, Pansynchro is the fastest way to run your data sync jobs, hands down.