Network Protocols: The Hidden Transformation

We speak of ETL and ELT, that data should be (E)xtracted from a source and (L)oaded into a destination, and (T)ransformed from one representation to another somewhere along the way. It’s a bit of an oversimplification, but not a particularly large one, to say that the basic work of ETL is to copy a database from the source to the destination.

But we’re not technically “making a copy of the database” per se. That can be done, for almost any database, by making a backup of the database and restoring it, but no one does ETL that way for plenty of good reasons. Instead, we query the database to Extract the data, possibly Transform it along the way, and then Load it into the destination. And in that intermediate state, it has to be represented in some way.

If you’re just using a hand-rolled script, this could be as simple as loading the data into a Pandas dataframe and then writing this data to the destination. For a more sophisticated system where the reader and writer are not located on the same hardware, the data has to be written out in some form to the network connection between the two. Either way, this constitutes a form of transformation, changing the data from its representation in the database to some other form, designed for holding data in memory or sending it across the Internet. And transforming data always takes some amount of CPU time and effort.

The network representation of your data is known as a protocol, and it can make or break your data sync performance. The larger and more complex the data representation in the protocol is, the more CPU work it will take to encode and decode it, and the more time and bandwidth it will take to transfer it over the network.

For example, some systems encode your data in a JSON-based protocol. And don’t get us wrong, JSON is a great file format… for its intended use case. The purpose of JSON is to move small amounts of data around in a human-readable way that’s also relatively easy for a machine to read. But ETL jobs don’t deal with small amounts of data! If you have gigabytes or terabytes of data to move, nobody’s going to be loading that in a text editor and looking it over, so why would you want to expend the time and bandwidth of, for example, putting a field name in front of every single field, repeated over and over again for millions or billions of rows, all for the benefit of nonexistent human readers?

At Pansynchro Technologies, we’ve done it differently. The Pansynchro protocol is designed from the ground up for efficiency and obsessively optimized. All data is written in a low-overhead binary format, with various filters applied to intelligently simplify certain common patterns, to represent the data in as few bytes as possible. It’s then compressed to reduce the network payload even further. This is how we’re able to copy the same data with a reduction of over 77% in bandwidth and 97% in processing time compared to a leading JSON-based protocol.

If your data is in the cloud, you’re paying for all that compute and data transfer. If you’d like to cut your bill down significantly, have a look at Pansynchro.