Data Factory is a fantastic integration tool built into the Azure platform that helps with the movement and transformation of data between locations. It participates in the ETL pathway as highlighted below in this reference architecture shared by Microsoft.

For most data analysts working solely in Power BI – it’s a natural progression of learning to encounter this and other technologies that occur ‘upstream’ from the Power BI ecosystem. Learning how to use Data Factory and other upstream programs only enhances our ability to solve problems and deliver best-in class-solutions.

I’ll be introducing a series of posts over the next year that focus on different ‘patterns’ that exist in Data Factory. We can consider these patterns to be the foundation for much larger and complex solutions, which count on strong and durable foundations.

The pattern today is relatively simple: we are moving tables from a SQL data source to Azure Data Lake Storage. It can be best captured using the highlighted part of the reference diagram.

When getting started with Data Factory it’s worth noting that ADF now exists in two independent locations in the Azure Portal: As a stand-alone product and embedded within Azure Synapse.

For the purposes of this post – we could use either one, as both have the exact same feature set and are arguably the exact same program. However, it’s worth noting that the version in Synapse is perhaps more future-proof as Synapse continues to be shaped into a ‘one-stop-shop’ for data.

So here’s what we are trying to accomplish:

Move 3 tables from Azure SQL to Azure Data Lake Storage (Gen2) and save as Parquet files.

The goal of this post is to focus on the orchestration of this movement using a control table so we are going to keep it simple (hence why there will be a ‘series’ of posts coming).

For every orchestration that takes place in Data Factory – there are 3 required components:

  • A Linked Service: which establishes a connection to the data location
  • A Dataset: which represents the data to be handled
  • A Pipeline: which executes the operations required to ‘move’ the data.

Given these required components – things in Data Factory can escalate QUICKLY if we just start creating each representative component for our simple 3 table move. Here’s what that might look like.

This ad-hoc setup requires 6 datasets, 3 pipelines, and 2 linked services. This may not seem like much, but imagine what it would look like if our table count grows to 100!

Alternatively, Data Factory provides a really ‘handy’ wizard for setting up these components as part of a simple copy activity. (What we plan to do is essentially a copy-activity). It provides a step-by-step flow for creating and selecting the representative pieces. Here’s what part of that wizard flow looks like on the ‘source’ side of our 3 table move.

We aren’t going to look at each step along the way, but the output of the copy-activity is a MUCH cleaner result of 2 Datasets, 2 Linked Services, and 1 Pipeline.

What this copy wizard does really well is that it ‘parameterizes’ the datasets which reduces the number required to just 2. Parameterizing the dataset means that we no longer ‘hard-code’ the values. Instead they are passed in at runtime.

This is what the Azure SQL Dataset looks like once parameterized by the copy wizard:

The down side of the Copy Activity wizard occurs when we need to change something. Let’s say we want to add another table from the same database to the activity. Where would this be done?

Well it’s not immediately clear, but after clicking on the Pipeline background – we can pull up the Pipeline parameter menu and see that the copy wizard dropped our hard-coded values here

That means any changes we want to make to our pipeline must be made within this ‘default value’ box holding an array of values. Here’s what the array text looks like when it is formatted and pasted into a separate area:

[
	{
		"source": {
			"schema": "dbo",
			"table": "Address"
		},
		"destination": {
			"fileName": "dboAddress.parquet"
		}
	},
	{
		"source": {
			"schema": "SalesLT",
			"table": "Customer"
		},
		"destination": {
			"fileName": "SalesLTCustomer.parquet"
		}
	},
	{
		"source": {
			"schema": "SalesLT",
			"table": "Product"
		},
		"destination": {
			"fileName": "SalesLTProduct.parquet"
		}
	}
]

While it’s certainly ‘doable’ to edit and update the information in this array to add or remove a table, it can be a bit daunting as the sole means of managing this activity.

Control Table to the rescue!

The idea of a control table is to provide an orchestration framework for our orchestration tool. Think of it as sitting on top of our entire process and feeding the necessary information to the appropriate parts.

We are going to apply a similar framework to what we saw happen automatically in the Copy Activity Wizard – except now the values will be kept in a table rather than in a pipeline array parameter. In this way – when we want to make changes to the Data Factory pipeline – we only need to change the values in the table.

Let’s convert our example from above into a Control Table managed process.

What’s in a Control Table?

A control table can contain as much information as we want! There is no ‘perfect’ structure and control tables can vary based on the activities that they need to orchestrate. They can by built in-real-time during pipeline execution or constructed before-hand, as we will do in this exercise. The beautiful part of working with a control table is that we can always change and update the structure during development.

Let’s have a look at the simple control table built for this activity.

For simplicity sake – we can actually build this table in an EXCEL or CSV file and just import it to SQL when it’s ready to use.

The column titles were built to be as clear as possible about what this table provides. In short – all but 1 of these columns will be used to populate values in our Data Factory parameters. The “last_run” column is the only one that ‘receives’ info from the pipeline about the last time it was executed successfully. There are also columns here that we aren’t going to use (source_watermark_column, sql_statement, and turnonoff). These might hint at other functions the control table could provide, but aren’t necessary for this simple exercise.

One key understanding about using control tables in ADF is that we use an iterative function to pass each row of the table 1 at a time in order to populate the parameters. So each row runs independently of the row above or below it.

Dataset Prep

In order to use the values in the control table – we have to first add the control table as a dataset with an appropriate linked service. That setup is pictured below.

Next we are going to create datasets and linked services for both our source (Azure SQL) and destination (Azure Data Lake Storage). Below is the setup for Azure SQL DB. During this creation step we will add parameters – which act as placeholders for the data to come later.

We use the Dynamic Content screen to add formulas specific for the Schema and Tablename parameters. This formulas act as a through-put for values which have yet to be specified. (Remember these values will be passed at runtime).

The final product should look something like this:

Now we rinse and repeat for the Azure Data Lake Storage dataset: This time we are going to add 3 parameters. 1 for the FileSystemname (i.e the main navigation container in ADLS Gen2), 1 for the filename, and 1 for the filetype (which will be parquet).

These parameters all correspond with columns in our control table.

It may be helpful to name the parameters similar to the control table column, but it’s not required.

Back on the Dataset Connection screen we have to input the parameter formulas to the corresponding cells. In this simple example, we are skipping the directory cell (or sub-folders inside the main folder). We are also adding our filetype to the end of the filename using the concat function and separating the two with a period.

Ok – we are ready to build our pipeline!

Pipeline Construction

The first step in our pipeline is to call our Control Table! This is done with a Lookup Activity.

On the settings of the lookup activity – we are going to use a SQL query to return all columns and rows.

Here also we can use the Preview Data button to “see” if the our query is returning the values we expect. In this case we are seeing all 3 rows – perfect!

Next we need to add an iterator step – such as a ‘ForEach’ – which will go “row-by-row” through the results of the control table query.

This step is connected to our previous one – and we have to specify what information we want to grab from the lookup step. This is done by adding a formula to the items cell on the settings tab. In this example, we are going to take all columns of our query (we will sort them out later).

Next we are going to click on the “Activities” in the ForEach loop to add our actual process – which is a Copy Activity. This is the step that actually ‘moves’ the data.

After pulling the copy activity to the canvas – we need add some information in order to pass the contents of the ForEach loop into each of the respective dataset parameters. This is where we specify which columns of our 1-row query we want – and where we want to put those values.

Here’s what that looks like for the Source (Azure SQL).

We follow this same pattern for all the parameters in our Source and Sink datasets. Here’s what that would look like once it’s all finished:

Done!

Technically that is all we have to do to have this copy activity running.

There is an extra step added to this Activity that runs a VERY simple stored procedure to update the last_run column of our control table. This step only happens if the Copy Activity is SUCCESSFUL (as represented by the Green arrow connecting the two).

CREATE PROCEDURE [etl].[updatelastruntime_controltable]
	@Schema         varchar(50),
	@TableName	varchar(50)

AS

UPDATE etl.ControlTable
SET last_run = getdate() 
WHERE
source_schema = @schema AND
source_table = @TableName

The stored procedure simply grabs the current datetime (in UTC) and updates the last_run column based on that row having a matching schema and table name.

Similar to how we passed parameters to the datasets – we can pass parameters to this stored procedure.

We can now test the whole process using the Debug button:

If everything works as expected – all steps of our pipeline will have a green check as noted in the Output

We should also see the Parquet files in our Storage Account (which we do)

And our SQL Control Table column should be updated with new times (which it is)

At this point we are ready to publish our Pipeline Run!

If we later want to add more tables or edit details of the existing run, it can all be done via our control table. We no longer need to worry about going into ADF. Everything can be managed from one central and easy to read location.

And this is just the tip of the iceberg when it comes to what we can do with ADF, but in nearly all of the future work – we will utilizing some variation of a Control Table!