Building A Simple Data Pipeline Part 2: Network Sync

In the first part, we built a very simple, local data sync script. But most data sync work isn’t running locally; you typically want to copy data across a network connection. And since network bandwidth is a significant bottleneck in moving large amounts of data around, you want to make sure that the network data can be encoded as efficiently and compactly as possible. In this tutorial, we’ll look at how this is done in PanSQL.

First, open up your Analyze script from the previous tutorial. On the line analyze myInput as myDataDict, add the words with optimize to the end of the line. The with optimize directive is optional; it means that the analyzer will spend some additional time analyzing the content of the data in addition to reading the schema, looking for data patterns that the Pansynchro network protocol can use to transmit your data more efficiently. (The optimization pass can take a while to run if your database is large, but it should only have to be done once.)

Use the PanSQL executable to rebuild your analyzer and run it. While the optimizer is running, let’s work on the sync scripts.

Now make a copy of your sync script file. Call one of the copies Server.pansql and the other Client.pansql. Open up the Server script in your text editor. On the third line, where you’re opening a write connection to your destination database, change the type to Network and the connection string to '127.0.0.1'. (The connection string for the network server connection is the URL it should listen at. For most network setups, telling it to run on localhost is just fine. If this won’t work for your system, use the appropriate IP address or host name.)

Open the Client script now, and replace the second line (the reader) with a Network type. Set as its connection string the IP address or host name of the computer you’re going to run the server on. If you’re running both the client and the server on the same system, because this is just a simple tutorial, this can be '127.0.0.1' too.

And that’s it. Once you’ve got these scripts saved, go ahead and build them. (Make sure the analyzer is finished preparing the optimized data dictionary first, of course!) Then run the server, then run the client, and Pansynchro will run a network sync of your data. (If your destination database is not empty because you ran the sync in the previous tutorial, you might want to go in and truncate everything first, to have a clean slate.)

Of course, a lot of data sync work isn’t as simple as just copying data straight across as-is. Oftentimes you’ll want to transform it in some way. In the next tutorial we’ll look at how to write SQL-based transformations in your PanSQL scripts.

A few notes:

  • The server must be running before the client. If the client tries to run and can’t connect to the server, it will fail with an error.
  • When run, the server will sit around waiting for a client to connect, sync data to it once, and then terminate. Building a server that will run multiple sync jobs can be done, but it’s outside the scope of this tutorial.
  • If you put the client on a different machine from the server, make sure the server’s firewall is configured to allow incoming TCP connections on port 24601.