My dataset is taking too long to refresh!
Performance tuning dataset refresh in Power BI
⏳ one eternity later…
The almighty Chris Webb recently did a presentation on performance tuning dataset refresh in Power BI! I always find it helpful to write things down (or type things out) to help my 🧠 to soak all this up! I hope you may find this useful!
Why is refresh performance important?
Your reports are ready for yours users to view faster
You can refresh more frequently during the day if you need to
Dataset development is easier
If something goes wrong with your data, you can fix and reload faster
Slow refresh of one dataset may impact
- Refresh performance of other datasets
- Report performanceBut how fast is fast enough?
Requirements for data refresh
Don’t ask what your users want, ask what they need
Questions:
- When is your source data ready to use?
- How often does your source data change?
- What time do you need your data by?
- How many times do you need to refresh in a day? What is the business need?
- What if you unexpectedly need to refresh (eg to fix data problems)
- How important is keeping data up-to-date versus report performance?
You need to think about performance right from the very beginning!
Choosing a storage mode
Import (default) — fastest query performance but data must be refreshed
Push — data is pushed into a dataset; many limitations
DirectQuery — no need to refresh but query performance is
- Composite models allow you to mix DirectQuery and Import tables
- Aggregations are pre-aggregated tables that improve query performance
- Use auto-refresh to make sure your report always shows the latest dataUse Import unless you have a good reason not to!
Performance tuning dataset refresh in Power BI
What happens during a refresh?
The Analysis Services engine runs queries to get the data it needs for all of the tables in the dataset. Each of these queries is linked to a Power Query query and each Power Query query might be linked to other Power Query queries.
The Power Query queries go back to the data sources to get the data.
The data is returned to the Power Query engine where it does the transformations.
From the Power Query engine, the data goes back to the Analysis Services engine where its loaded into the data structures used by the dataset.
There are 3 places where the refresh might be slow:
Data source
Power Query engine
Analysis Services engine
Import refresh tuning methodology
Photo by Julian Hochgesang on Unsplash
Steps:
Model your data properly
Remove all data that isn’t needed for your report/analysis
Tune your data source
Tune your Power Query queries
Tune the Analysis Services engine inside Power BI
You need to check:
Performance of a single refresh while developing
Actual performance of dateset refresh in production
Important: Tuning performance of a refresh on your Power BI Desktop might be difference to the performance that you get on the Power BI service. Need to test both
Measuring overall refresh performance
SQL Server Profiler is the best tool for measuring refresh
- Connect to Power BI Desktop via DAX Studio
- Connect to Power BI Premium capacities via XMLA endpoint
— not possible to connect to Power BI Shared capacity
- Displays all activity in the Analysis Services engine
- Look for Process command and Duration columnPower BI Service refresh history also has overall refresh times
Refresh summary page (and API) shows refresh times for datasets in Premium
Power BI Capacity Metrics app shows refresh times for Premium
How to measure your dataset refresh in Power BI Desktop using SQL Server Profiler
Open your .pbix
Open DAX Studio. Ensure the ‘Show External Tools’ is enabled in the Advanced options.
Go to the Advanced tab and click on SQL Server Profiler
Go to the ‘Events Selection’ tab in the ‘Trace Properties’
Select both
Command Begin
andCommand End
under Command Events. Click Run.Go back to your Power BI Desktop and hit refresh.
You’ll see in the SQL Server Profiler the command that took the longest is the refresh command.
SQL Server Profiler
Trace Properties > Events Selection > Command Events
Refresh
How to check your dataset refresh in Power BI Service using the Power BI Service, Power BI Admin portal and Power BI Premium Capacity Metrics App
Power BI Service
Go to your dataset settings
Click on Refresh history
Refresh history
For Power BI Premium, admin portal
Go to the Power BI admin portal
Under Capacity settings, click on ‘Refresh summary’
In the ‘Schedule’ tab you can also check what refresh schedule is set to for any given time
Admin portal > Capacity settings > Refresh summary
Power BI Premium Capacity Metrics app
Capacity Metrics Report >
PowerBIPremiumCapacityMetricsapp
Data Source
Data modelling and refresh performance
Good data modelling is important for many reasons — data refresh performance is only one of them
Good modelling may make refresh performance slower, but will make report query performance faster
Basic rule: always build a star schema!
Common problems:
- Tables with lots of columns
— Do you need to unpivot measures?
— Do some of your fact table columns actually belong on a dimension table?
— Are you even going to use all of these columns?
- One big table instead of fact tables and dimension tables
- Use of expensive data types, eg Double instead of Currency
Understand star schema and the importance for Power BI - Power BI
This article targets Power BI Desktop data modelers. It describes star schema design and its relevance to developing…
Only load the data you need
The more data you load, the slower the refresh will be
So:
- Remove any columns you don’t need
- Filter out any rows you don’t need
- Think about applying a limit on history, eg only loading one year of data
- Do this as soon as possible, ideally before the data even reaches Power BIIt’s easier to add data back if you need it than remove data from a dataset in production
Deployment pipelines (in Premium) can be used to limit the amount of data you work with in a development environment.
— You can check out my post about Power BI Deployment Pipelines <here\> and Chris Webb’s “Limit The Amount Of Data You Work With In Power BI Desktop Using Parameters and Deployment Pipelines” <here\>.
Data source type and refresh performance
How quickly can your data source send data to Power BI
Some tips:
- Relational databases perform better than files
- CSV files will perform better than JSON, XML and especially Excel
- Files stored in SharePoint may be slow to load compared to local files
- Web services may also be slowConsider loading your data into a fast data source before loading in into Power BI
Tuning your data source
If your data source is a relational database, tune the SQL queries that are run when refresh takes place
- Tools like SQL Server Profiler can be used to see what queries are runOther useful tools:
- Fiddler for viewing requests made to web services
- Process Monitor for viewing reads from text files
- Power Query Query Diagnostics
Data source location
Network latency between your data source and Power BI can affect refresh performance
- If you’re using an On-premises data gateway, think about the location of the gateway machinePower BI Premium allows you to locate different capacities in different Azure Regions
Power Query engine performance
Power Query performance can vary depending on where Power Query queries are run:
- Power BI Desktop — when you are developing
- Power BI Service — if you’re only connecting to cloud data sources
- On-premises data gateway — if any of your data sources are on-prem, all traffic has to go through a gatewayPerformance will depend on:
- Hardware of the machine where queries are run
- Configuration settings and properties
- Efficiency of the queries themselves
How to measure Power Query query performance using SQL Server Profiler
Open your .pbix
Open DAX Studio. Ensure the ‘Show External Tools’ is enabled in the Advanced options.
Go to the Advanced tab and click on SQL Server Profiler
Go to the ‘Events Selection’ tab in the ‘Trace Properties’
Under Command Events, select both
Command Begin
andCommand End
.
Under Progress Reports, selectProgress Report Begin
andProgress Report End
.
Click Run.Go back to your Power BI Desktop and hit refresh.
You’ll see in the SQL Server Profiler how long each Power Query query took. Inside the refresh command you’ll see a large number of
Progress Report Begin
andProgress Report End
events.
There are 2 events that will show you how long a Power Query query step took to run. You’ll want to arrange your columns so you can see the EventClass
, EventSubclass
, TextData
, and Duration
.
Eventclass:Progress Report End
EventSubclass:25 — ExecuteSQL
. The amount of time the Power Query engine took for the query to start to return data.Eventclass:Progress Report End
EventSubclass:17— ReadData
. This is the amount of time it took for the Power Query engine to read the data from the query.
Power Query query performance using SQL Server Profiler
How to measure Power Query query performance using Power Query diagnostics
Query Diagnostics is a powerful new feature that will allow you to determine what Power Query is doing during authoring time. Earlier this year in March, Power Query diagnostics was announced to GA.
Query Diagnostics
Query Diagnostics is a powerful new feature that will allow you to determine what Power Query is doing during authoring…
Understanding folding with Query Diagnostics
One of the most common reasons to use Query Diagnostics is to have a better understanding of what operations were…
Check out Chris Webb’s blog post <here\> to help further make sense of this data.
Power Query Power BI Desktop
Measure performance of Power Query queries in Desktop using:
- SQL Server Profiler
- Power Query Query DiagnosticsSettings to improve performance in Power BI
- Disable queries that you don’t need to load into
- Turn off “Allow data preview to download in the background”
- Turn off data privacy checks — but only if you know what this means!
- Experiment with “Enable parallel loading of tables”
- Use Table.View to stop multiple reads
- Turn off “Include in report refresh” if a query doesn’t need to be refreshed
Disable Background Data option
Query Folding
Query folding refers to the way the Power Query engine can push transformations back to the data source
Almost always results in much better performance
Only possible with some data sources: relational databases, Analysis Services, OData feeds, some others
Only possible for some transformations
- Different data sources support folding for different transformationsSome transformations stop other folding happening
Writing your own SQL queries also prevents folding
Alex Powers recently did a 30 Day Challenge: Query Folding — #30DQUERY
Query folding
This article targets data modelers developing models in Power Pivot or Power BI Desktop. It describes what Power Query…
Tuning the Power Query engine
If query folding is not taking place, then the Power Query engine does the transformations in your queries
Some transformations such as sort, merge, pivot/unpivot require all data to be loaded into RAM
- A query is limited to using 256 RAM, so paging may take place (TIL)Some transformations force multiple reads from a data source
- Using Table.Buffer may help — but may also cause paging
Tuning the on-premises data gateway
If you are using an on-premises data gateway to load data, your Power Query queries will be executed on the gateway machine
Tips:
- Locate the gateway machine close to the data source
- Make sure the gateway server has enough CPU and memory
- Clustered gateways allow for the load to be spread across multiple serversTurn on performance logging and use the Power BI template report to analyse it. You can download it from this link <here\>.
Using dataflows to improve performance
Power BI dataflow
Dataflows let you share the output of a Power Query query between multiple datasets
- Do complex transformations once instead of inside multiple datasets
- Do transformations when the data for one query is ready, no need to wait until all data needed by the dataset is readyData privacy checks are off by default -> better performance
In a Premium capacity:
- Enhanced compute engine improves performance by loading data into SQL
- Container Size property = more RAM for the Power Query engine
- Computed entities allow you to stage data from slow data sources
transformation donce twice
transformation done once
You can also have different dataflows executed at different times when different data available. So if one part of your data sources is available much earlier than the others you can use a dataflow to do any transformations so that the transformed data from there is available much more quickly
Tuning the Analysis Services engine
SQL Server Profiler displays a lot of detail about what happens during refresh in the Analysis Services engine.
Official support for Tabular Editor within Desktop will allow changing more properties:
- IsAvailableinMDX — controlswhether hiearchies are built on columns (only relevant for clients that query using MDX such as Excel)
- EncodingHints — forces the use of a certain type of encoding for a column
Calculated columns and calculated tables
Calculated columns and calculated tables are evaluated during refresh
- So the more you have, the slower refresh will beCan you replace a calculated column with a measure?
- Strange but true: this may also help query performanceCan you replace a calculated table with a Power Query query or a table in your data source?
Loading data into hidden tables and then using DAX to transform it is usually a bad thing
BUT certain calculations will be much quick in DAX
Incremental refresh
Incremental refresh lets you refresh only the data that is new or has changed
- Less data to load -> faster
- Works by creating and managing partitions within the tableNow available in Power BI Shared as well as Premium
Design for use with data warehouses built on relational databases
Can be adapted for use with other data sources such as:
- Web services
- Folders containing multiple files
Check out Chris Webb’s blog post on this topic <here\>
Refresh in the Power BI Service
Refresh in the Power BI Service only when resources are available
Therefore, refresh does not always start at the scheduled time
Refresh may be slower in the Service because:
- You have a very fast development PC
- It takes longer to load data into the cloud than into Power BI DesktopRefresh may run faster on Premium because:
- More resources = more parallelism, but only on a P2+
- More likely to start on time — assuming your capacity isn’t overloaded