In the world of self-service BI – analysts often start a project by trying to answer a bunch of desired business questions. In a perfect world, the data needed to answer these questions comes via from a ‘single-source-of-truth’ data-mart where information is catalogued, cleaned, defined, and separated into appropriate dimension and fact tables.
Probably just as common is the scenario where someone in the business hurriedly points the analyst to a saved Excel file and requests the same important questions to be answered.
In this latter scenario, we can import this single Excel file into Power BI and get right to work, but it may be less obvious why it is good practice to always build a data model – even when all the data is coming from a single simple document.
Let’s demonstrate this using a sample Excel sheet filled with sports data.
In this sample data table each row represents a matchup between two teams and includes the following items.
For simplicity – we will use one driving question for this work:
Which team had the highest average viewership rating?
The first thing we must confront with this data table is that it’s not optimized for Power BI. The ideal data structure for Power BI work is a star schema, which is where there is large fact table surrounded by individual dimension tables.
A star-schema structure is optimized for performance and usability. This may not be obvious with small tables like we are using, but it becomes more and more evident as the size of the data grows. A bad data model can derail a Power BI report by causing needless load times, inaccurate measures, and bloated report size. Also – as is the case with our data table, putting this data in a star-schema format makes it easier to arrive at the answer to our business question.
If we consider plugging our sports data into this star schema – it would look something like this:
I always think of dimension tables as the source for building my slicers and the fact table as where the raw data is stored. This future-proofs the data model and makes it much easier to connect additional fact tables or grow the model size to millions of rows.
To answer our driving business question (Which team had the highest average viewership rating?) we still need to overcome a few hurdles with our initial data table.
The most obvious issue in the original data table is that our teams appear in BOTH the HomeTeam and AwayTeam columns.
We need to modify this so that all the Teams appear in a single column. If we tried to connect our Team dimension table to both the HomeTeam & AwayTeam columns – by default – one relationship would always remain inactive. This is not good practice.
Instead – Using Power Query – we can add an Index Column to our table (let’s call it Match ID) – and unpivot our HomeTeam and AwayTeam columns. That puts all the Teams in the same column and still gives us a way to identify each match up.
Likewise we are confronted with a similar problem when we review how the demographic ratings data are organized in the table. It is ‘possible’ to leave the data like it is and just write an average measure for each demographic column…. but this would be working harder, not smarter.
Instead, let’s unpivot the demographic rating data into one single column so we only have to write 1 measure!
At this point we have our data in a shape to easily answer the business question. We use Power Query to create our dimension tables by isolating the respective columns of our original table and removing duplicates. Then we use those tables to define relationships to the fact table. (In my model – the date dimension table filters the Match dimension table – which then filters the fact table).
Back to our business question: Which team had the highest average viewership rating?
Let’s write a simple average measure:
Average Viewership Rating =
AVERAGE ( MatchSummary_fixed[Rating] )
When we drop this measure into a table with our Team dimension – we arrive at the answer: Orlando.
You might be asking yourself – why go through all the trouble of setting up dimension tables – when we could have easily arrived at this answer using the Team names still in the FACT table.
Over time this answer will reveal itself – especially as your use of Power BI and dimensional modelling grows. Larger more complex datasets (where fact tables are millions or billions of rows) require a data model that is optimized for compression and performance. In these cases, we write formulas that utilize fields in the dimension tables (which are typically much smaller) and allow those actions to filter the much larger Fact tables. The engine behind Power BI was literally built to do this exact thing really well.
If you shirk convention, sooner or later this will become painfully obvious the first time your Power BI report returns this:
Additionally – most Power BI datasets hardly stop at one fact table. Most models will grow to include multiple fact tables and we use dimension tables to bind everything together. These become the conduit that connect disparate tables and allow us to build full-featured reports containing all relevant data.