Building A Simple Data Pipeline Part 1: Your First Sync

A Pansynchro data pipeline has three major components: a reader that ingests the data from the source, a writer that loads it into the destination, and a data dictionary that describes the data to be synced. All three can be set up using our simple scripting system, known as PanSQL.

Before writing any PanSQL scripts, you’ll need three things set up:

  1. A source database with data in it.
  2. A destination database with the same schema as your source database. This should be empty.
  3. The .NET 8 SDK needs to be installed on your computer. (Don’t worry, you don’t need to know how to program in .NET to use PanSQL, but the script system does use .NET under the hood. Go ahead and install it if you don’t already have it.)

The first thing to do is to prepare the data dictionary, as it needs to be ready before setting up the reader or the writer. You can get started by copying the following code block into your text editor of choice:

open myInput as <TYPE> for analyze with '<CONNECTION STRING>'
analyze myInput as myDataDict
save myDataDict to '\Pansynchro\DataDictionary.pansync'

Where it says <TYPE>, replace it with the name of your database type for your source database. Pansynchro currently supports the following databases (type name on the left):

Firebird   -- Firebird (server or embedded)
MSSQL      -- Microsoft SQL Server
MySql      -- MySql and MySql-compatible databases such as MariaDB
Postgres   -- PostgreSQL
Snowflake  -- Snowflake data cloud
Sqlite     -- SQLite database files

Likewise, replace <CONNECTION STRING> with the ADO.NET database connection string for the database type in question. The exact format of the connection strings differs from one database to another, but you should be able to find help for how to build one at connectionstrings.com. (Be aware that, if you’re using PostgreSQL, the “Npgsql” style connection string is required.) The connection string should be written inside of ‘single quotes’, just like a string in a SQL script.

This script is pretty straightforward:

  1. It opens a database connection for analyze, meaning that it connects an analyzer to the database rather than a reader or a writer, and names the connection myInput.
  2. It runs the analyze command on myInput, producing a data dictionary named myDataDict. The data dictionary contains information about the structure of your database.
  3. It saves the data dictionary to the specified filename.

Once you have the script ready, save it in a new folder called Pansynchro, as analyze.pansql. Now that you have the script set up, it’s time to build it. Go ahead and grab the latest PanSQL release from GitHub, and unzip it on your computer. Once you have the PanSQL executable unzipped to some place, open a command prompt in that folder and tell it to run the PanSQL executable on your scripts. Assuming you put your scripts in a folder called Pansynchro, at the root of your drive, it would look like this:

Windows PowerShell: .\PanSQL.exe C:\Pansynchro\analyze.pansql
Bash: ./PanSQL /Pansynchro/analyze.pansql

This will compile the script into .NET code for you, then build the code into a functioning executable, which it should place under the Build subdirectory. Go ahead and run the analyzer executable. It should very quickly analyze your database, then drop the data dictionary file in the specified location.

Now let’s write the sync script. Open a new text file. This one should look like this:

load myDataDict from '\Pansynchro\DataDictionary.pansync'
open myInput as <TYPE> for read with myDataDict, '<CONNECTION STRING>'
open myOutput as <TYPE> for write with myDataDict, '<CONNECTION STRING>'
sync myInput to myOutput

Again, substitute in types and connection strings as appropriate. The source and destination databases should have identical schemas. (For the moment. We’ll get into transformations a bit later on.) Like the first script, this is very straightforward:

  1. Load the data dictionary produced by the Analyze script into a variable in memory.
  2. Open a connection to the source database. This one is opened for read, designating it as a reader. Unlike the analyzer connection, this requires a data dictionary as well as the connection string.
  3. Open a connection to the destination database, designated for write. This works just like the reader.
  4. The final line does exactly what it says: performs the sync from the reader to the writer.

Save the script alongside the Analyze script, as sync.pansql. This script assumes that both your source and destination databases are local; it’s not likely to give the best performance if you want to send lots of data across the Internet. We’ll get to that later on; right now we’re just looking at how to write a PanSQL script. Run the PanSQL executable again, but this time use *.pansql for the filename. It will build all of your PanSQL scripts into their own projects and drop them in the Build subdirectory.

Once you have the sync executable built, run it. And… that’s it. Sit back and relax while Pansynchro makes a copy of your database. Not bad for just seven lines of script code, huh? Of course, this isn’t doing all that much yet. In the next tutorial we’ll look into copying data across a network connection.

(NOTE: It’s important to build and run the Analyze script before trying to build the sync script. The load command in the sync script will actually load the data dictionary into the script compiler to verify your data — this is important later on when you start writing transformations — and if the data dictionary is not present, the build will fail with an error.)