As a self-service tool – Power BI provides a slew of connections to various data sources. When building reports in a more ad-hoc capacity – the data often comes from whatever file-type is most convenient, and is saved in the location where it is most convenient. During these build scenarios, I can’t help but wonder how Power BI handles these different file types and locations upon import. Is one file-type and location clearly the ‘best’ or do they all perform well? Should there be a winner?
To test this theory – I wanted to setup a fairly simple experiment that involved loading the same amount of data into Power BI Desktop from about 6 different pathways. The source data in this experiment is a CSV file with Sales data having approximately 1.5 million rows.
The Data Sources
I would take this data and place it in a number of cloud-based locations that are easily available to an analyst working in a similar ad-hoc capacity. Some of these locations (like Azure SQL) will require changing the format of the data away from the original .csv file, but importantly, the ‘amount’ of data should be the same. Here’s the list:
There are a few immediate caveats to this experiment that I want to make sure to call out early.
- Using an Excel file format will truncate the data to about 1.048 million rows.
- The Azure SQL DB is only set to 5 DTUs (the lowest available).
- The parquet file format was created via a copy activity from an Azure SQL table using Azure Data Factory
- No transformations beyond the required/default steps to load the file were done in Power Query.
From a Power Query perspective – some files (like the parquet format) – load using only 1 visible step – whereas importing a CSV file from Azure Gen 2 Storage using the distributed file store address (when hierarchical namespace is enabled) requires a bunch of supporting queries and extra steps. (This is a similar experience to loading a folder of CSV files). In any case, the Power Query work represents the minimum number of transformations that happen when loading the data – with no additional transformation completed on any table.
Measuring Refresh Speed
Phil Seamark created a handy tool for visualizing the data in a SQL Profiler trace XML file. We will be using this tool to connect to our Power BI Desktop file to measure the events taking place during each manual refresh.
Refreshes were conducted 4 times. 3 of the refreshes were completed immediately after the Power BI Desktop file was opened and 1 refresh was conducted after having the file open for a while (and having itself previously been refreshed).
Here is one of the reports that is generated when we analyze the Trace XML file generated for each run.
The Object names represent each of the filetype and locations. The blue ExecuteSQL steps on the top graph represent operations taking place in Power Query. Here we see how steps with more transformations (like loading the CSV file from Azure Data Lake Gen2) require more work.
The yellow areas of the chart represent the process of loading data into the data model. While all pathways (except for Excel) had the same number of rows of data, there are clearly pathways that are faster than others. Compare the bottom SQL yellow line to the top yellow line.
It’s also worth noting that the results were not exactly the same in the 3 refreshes that followed the same exact setup.
Here’s another refresh (similar to the one above) that was run immediately after opening the Power BI Desktop file. In this refresh – the ExecuteSQL step is identical for all pathways, with Process varying considerably.
To compare each run let’s chart the ‘duration’ that each pathway took – using the Matrix chart at the bottom of each result page.
Here’s what all 4 refresh runs look like when charted together.
and here’s the average refresh duration for each data source.
The CSV file imported from Azure Gen2 Storage appears to be the fastest refresh mechanism of all the options. This import method uses the Azure Data Lake Storage Gen2 connector with “dfs” in the URL instead of ‘blob’.
This should come as no surprise – since the Distributed File Store (dfs) pattern was designed to provide significant performance improvements for loading and reading data.
The Excel file is likely a wash and should have simply been eliminated from this comparison since it is about 500,000 rows smaller than the other files. This is likely why it performed better than expected.
Interestingly – the Azure SQL Table consistently took the longest amount of time to refresh. Initially, I suspected that this was merely due to the low DTUs on the database, so I scaled it up to 3000 DTUs to see if that had any noticeable change on the load time. (It didn’t)
DTUs clearly don’t play a role in the loading (or process) part of the refresh. Perhaps partitioning this table would have had a more noticeable impact on the refresh time, but we’ll save that for another round.
Another point to take note of is that these refreshes were done with Power BI Desktop – and may vary when migrated to the Service. Unfortunately, I was unable to run a trace against my Premium Per User Workspace. Certainly for a future post I’ll look to see how refresh times change with workloads in the service.
Is there a clear winner?
In this simple experiment, it would appear that loading data from Azure Data Lake Storage (Gen2) using the distributed file store import pathway yields that fastest result.
However, it’s important to recognize that this experiment used the simplest of import strategies with no additional transformations were done in Power Query. It’s possible as we start adding more work in Power Query, other elements like query folding may provide a distinct advantages to the SQL pathway that simply weren’t observed here.
See you at the next race!