My dataset is taking too long to refresh!

Performance tuning dataset refresh in Power BI

⏳ one eternity later…

The almighty Chris Webb recently did a presentation on performance tuning dataset refresh in Power BI! I always find it helpful to write things down (or type things out) to help my 🧠 to soak all this up! I hope you may find this useful!

Why is refresh performance important?

  • Your reports are ready for yours users to view faster

  • You can refresh more frequently during the day if you need to

  • Dataset development is easier

  • If something goes wrong with your data, you can fix and reload faster

  • Slow refresh of one dataset may impact
    - Refresh performance of other datasets
    - Report performance

  • But how fast is fast enough?

Requirements for data refresh

  • Don’t ask what your users want, ask what they need

  • Questions:
    - When is your source data ready to use?
    - How often does your source data change?
    - What time do you need your data by?
    - How many times do you need to refresh in a day? What is the business need?
    - What if you unexpectedly need to refresh (eg to fix data problems)
    - How important is keeping data up-to-date versus report performance?

You need to think about performance right from the very beginning!

Choosing a storage mode

  • Import (default) — fastest query performance but data must be refreshed

  • Push — data is pushed into a dataset; many limitations

  • DirectQuery — no need to refresh but query performance is
    - Composite models allow you to mix DirectQuery and Import tables
    - Aggregations are pre-aggregated tables that improve query performance
    - Use auto-refresh to make sure your report always shows the latest data

  • Use Import unless you have a good reason not to!

Performance tuning dataset refresh in Power BI

What happens during a refresh?

  1. The Analysis Services engine runs queries to get the data it needs for all of the tables in the dataset. Each of these queries is linked to a Power Query query and each Power Query query might be linked to other Power Query queries.

  2. The Power Query queries go back to the data sources to get the data.

  3. The data is returned to the Power Query engine where it does the transformations.

  4. From the Power Query engine, the data goes back to the Analysis Services engine where its loaded into the data structures used by the dataset.

There are 3 places where the refresh might be slow:

  1. Data source

  2. Power Query engine

  3. Analysis Services engine

Import refresh tuning methodology

Photo by Julian Hochgesang on Unsplash

Steps:

  1. Model your data properly

  2. Remove all data that isn’t needed for your report/analysis

  3. Tune your data source

  4. Tune your Power Query queries

  5. Tune the Analysis Services engine inside Power BI

You need to check:

  • Performance of a single refresh while developing

  • Actual performance of dateset refresh in production

Important: Tuning performance of a refresh on your Power BI Desktop might be difference to the performance that you get on the Power BI service. Need to test both

Measuring overall refresh performance

  • SQL Server Profiler is the best tool for measuring refresh
    - Connect to Power BI Desktop via DAX Studio
    - Connect to Power BI Premium capacities via XMLA endpoint
    — not possible to connect to Power BI Shared capacity
    - Displays all activity in the Analysis Services engine
    - Look for Process command and Duration column

  • Power BI Service refresh history also has overall refresh times

  • Refresh summary page (and API) shows refresh times for datasets in Premium

  • Power BI Capacity Metrics app shows refresh times for Premium

How to measure your dataset refresh in Power BI Desktop using SQL Server Profiler

  1. Open your .pbix

  2. Open DAX Studio. Ensure the ‘Show External Tools’ is enabled in the Advanced options.

  3. Go to the Advanced tab and click on SQL Server Profiler

  4. Go to the ‘Events Selection’ tab in the ‘Trace Properties’

  5. Select bothCommand Begin and Command End under Command Events. Click Run.

  6. Go back to your Power BI Desktop and hit refresh.

  7. You’ll see in the SQL Server Profiler the command that took the longest is the refresh command.

SQL Server Profiler

Trace Properties > Events Selection > Command Events

Refresh

How to check your dataset refresh in Power BI Service using the Power BI Service, Power BI Admin portal and Power BI Premium Capacity Metrics App

Power BI Service

  1. Go to your dataset settings

  2. Click on Refresh history

Refresh history

For Power BI Premium, admin portal

  1. Go to the Power BI admin portal

  2. Under Capacity settings, click on ‘Refresh summary’

  3. In the ‘Schedule’ tab you can also check what refresh schedule is set to for any given time

Admin portal > Capacity settings > Refresh summary

Power BI Premium Capacity Metrics app

Capacity Metrics Report >

PowerBIPremiumCapacityMetricsapp

go.microsoft.com

Data Source

Data modelling and refresh performance

  • Good data modelling is important for many reasons — data refresh performance is only one of them

  • Good modelling may make refresh performance slower, but will make report query performance faster

  • Basic rule: always build a star schema!

  • Common problems:
    - Tables with lots of columns
    — Do you need to unpivot measures?
    — Do some of your fact table columns actually belong on a dimension table?
    — Are you even going to use all of these columns?
    - One big table instead of fact tables and dimension tables
    - Use of expensive data types, eg Double instead of Currency

Understand star schema and the importance for Power BI - Power BI

This article targets Power BI Desktop data modelers. It describes star schema design and its relevance to developing…

docs.microsoft.com

Only load the data you need

  • The more data you load, the slower the refresh will be

  • So:
    - Remove any columns you don’t need
    - Filter out any rows you don’t need
    - Think about applying a limit on history, eg only loading one year of data
    - Do this as soon as possible, ideally before the data even reaches Power BI

  • It’s easier to add data back if you need it than remove data from a dataset in production

  • Deployment pipelines (in Premium) can be used to limit the amount of data you work with in a development environment.
    — You can check out my post about Power BI Deployment Pipelines <here\> and Chris Webb’s “Limit The Amount Of Data You Work With In Power BI Desktop Using Parameters and Deployment Pipelines” <here\>.

Data source type and refresh performance

  • How quickly can your data source send data to Power BI

  • Some tips:
    - Relational databases perform better than files
    - CSV files will perform better than JSON, XML and especially Excel
    - Files stored in SharePoint may be slow to load compared to local files
    - Web services may also be slow

  • Consider loading your data into a fast data source before loading in into Power BI

Tuning your data source

  • If your data source is a relational database, tune the SQL queries that are run when refresh takes place
    - Tools like SQL Server Profiler can be used to see what queries are run

  • Other useful tools:
    - Fiddler for viewing requests made to web services
    - Process Monitor for viewing reads from text files
    - Power Query Query Diagnostics

Data source location

  • Network latency between your data source and Power BI can affect refresh performance
    - If you’re using an On-premises data gateway, think about the location of the gateway machine

  • Power BI Premium allows you to locate different capacities in different Azure Regions

Power Query engine performance

  • Power Query performance can vary depending on where Power Query queries are run:
    - Power BI Desktop — when you are developing
    - Power BI Service — if you’re only connecting to cloud data sources
    - On-premises data gateway — if any of your data sources are on-prem, all traffic has to go through a gateway

  • Performance will depend on:
    - Hardware of the machine where queries are run
    - Configuration settings and properties
    - Efficiency of the queries themselves

How to measure Power Query query performance using SQL Server Profiler

  1. Open your .pbix

  2. Open DAX Studio. Ensure the ‘Show External Tools’ is enabled in the Advanced options.

  3. Go to the Advanced tab and click on SQL Server Profiler

  4. Go to the ‘Events Selection’ tab in the ‘Trace Properties’

  5. Under Command Events, select bothCommand Begin and Command End.
    Under Progress Reports, select Progress Report Begin and Progress Report End.
    Click Run.

  6. Go back to your Power BI Desktop and hit refresh.

  7. You’ll see in the SQL Server Profiler how long each Power Query query took. Inside the refresh command you’ll see a large number of Progress Report Begin and Progress Report End events.

There are 2 events that will show you how long a Power Query query step took to run. You’ll want to arrange your columns so you can see the EventClass, EventSubclass, TextData, and Duration.

  1. Eventclass:Progress Report End EventSubclass:25 — ExecuteSQL. The amount of time the Power Query engine took for the query to start to return data.

  2. Eventclass:Progress Report End EventSubclass:17— ReadData . This is the amount of time it took for the Power Query engine to read the data from the query.

Power Query query performance using SQL Server Profiler

How to measure Power Query query performance using Power Query diagnostics

Query Diagnostics is a powerful new feature that will allow you to determine what Power Query is doing during authoring time. Earlier this year in March, Power Query diagnostics was announced to GA.

Query Diagnostics

Query Diagnostics is a powerful new feature that will allow you to determine what Power Query is doing during authoring…

docs.microsoft.com

Understanding folding with Query Diagnostics

One of the most common reasons to use Query Diagnostics is to have a better understanding of what operations were…

docs.microsoft.com

Check out Chris Webb’s blog post <here\> to help further make sense of this data.

Power Query Power BI Desktop

  • Measure performance of Power Query queries in Desktop using:
    - SQL Server Profiler
    - Power Query Query Diagnostics

  • Settings to improve performance in Power BI
    - Disable queries that you don’t need to load into
    - Turn off “Allow data preview to download in the background”
    - Turn off data privacy checks — but only if you know what this means!
    - Experiment with “Enable parallel loading of tables”
    - Use Table.View to stop multiple reads
    - Turn off “Include in report refresh” if a query doesn’t need to be refreshed

Disable Background Data option

Query Folding

  • Query folding refers to the way the Power Query engine can push transformations back to the data source

  • Almost always results in much better performance

  • Only possible with some data sources: relational databases, Analysis Services, OData feeds, some others

  • Only possible for some transformations
    - Different data sources support folding for different transformations

  • Some transformations stop other folding happening

  • Writing your own SQL queries also prevents folding

Alex Powers recently did a 30 Day Challenge: Query Folding — #30DQUERY

Query folding

This article targets data modelers developing models in Power Pivot or Power BI Desktop. It describes what Power Query…

docs.microsoft.com

Tuning the Power Query engine

  • If query folding is not taking place, then the Power Query engine does the transformations in your queries

  • Some transformations such as sort, merge, pivot/unpivot require all data to be loaded into RAM
    - A query is limited to using 256 RAM, so paging may take place (TIL)

  • Some transformations force multiple reads from a data source
    - Using Table.Buffer may help — but may also cause paging

Tuning the on-premises data gateway

  • If you are using an on-premises data gateway to load data, your Power Query queries will be executed on the gateway machine

  • Tips:
    - Locate the gateway machine close to the data source
    - Make sure the gateway server has enough CPU and memory
    - Clustered gateways allow for the load to be spread across multiple servers

  • Turn on performance logging and use the Power BI template report to analyse it. You can download it from this link <here\>.

Using dataflows to improve performance

Power BI dataflow

  • Dataflows let you share the output of a Power Query query between multiple datasets
    - Do complex transformations once instead of inside multiple datasets
    - Do transformations when the data for one query is ready, no need to wait until all data needed by the dataset is ready

  • Data privacy checks are off by default -> better performance

  • In a Premium capacity:
    - Enhanced compute engine improves performance by loading data into SQL
    - Container Size property = more RAM for the Power Query engine
    - Computed entities allow you to stage data from slow data sources

transformation donce twice

transformation done once

You can also have different dataflows executed at different times when different data available. So if one part of your data sources is available much earlier than the others you can use a dataflow to do any transformations so that the transformed data from there is available much more quickly

Tuning the Analysis Services engine

  • SQL Server Profiler displays a lot of detail about what happens during refresh in the Analysis Services engine.

  • Official support for Tabular Editor within Desktop will allow changing more properties:
    - IsAvailableinMDX — controlswhether hiearchies are built on columns (only relevant for clients that query using MDX such as Excel)
    - EncodingHints — forces the use of a certain type of encoding for a column

Calculated columns and calculated tables

  • Calculated columns and calculated tables are evaluated during refresh
    - So the more you have, the slower refresh will be

  • Can you replace a calculated column with a measure?
    - Strange but true: this may also help query performance

  • Can you replace a calculated table with a Power Query query or a table in your data source?

  • Loading data into hidden tables and then using DAX to transform it is usually a bad thing

  • BUT certain calculations will be much quick in DAX

Incremental refresh

  • Incremental refresh lets you refresh only the data that is new or has changed
    - Less data to load -> faster
    - Works by creating and managing partitions within the table

  • Now available in Power BI Shared as well as Premium

  • Design for use with data warehouses built on relational databases

  • Can be adapted for use with other data sources such as:
    - Web services
    - Folders containing multiple files

Check out Chris Webb’s blog post on this topic <here\>

Refresh in the Power BI Service

  • Refresh in the Power BI Service only when resources are available

  • Therefore, refresh does not always start at the scheduled time

  • Refresh may be slower in the Service because:
    - You have a very fast development PC
    - It takes longer to load data into the cloud than into Power BI Desktop

  • Refresh may run faster on Premium because:
    - More resources = more parallelism, but only on a P2+
    - More likely to start on time — assuming your capacity isn’t overloaded


Did you find this article valuable?

Support Ian's blog by becoming a sponsor. Any amount is appreciated!