Tune Shuffle Partitions – Troubleshoot Data Storage Processing-1
Azure Monitoring Overview, Develop a Batch Processing Solution, Microsoft DP-203, Tune Queries by Using CacheAs you learned in the previous section, data shuffling occurs when the data required to successfully execute a SQL command must be retrieved from more than a single node. There are numerous reasons for this. One reason is skewing. When data is skewed, it means that there is an uneven distribution of data across the nodes. The results of either PDW_SHOWSPACEUSED or sys.dm_pdw_nodes_db_partition_stats are helpful to identify skewing. Using the results of skewing statistics and the results from the explain plan described in the previous section, you should be able to gain enough insights into whether your queries are suffering from shuffling. If you do find that there are examples of shuffles that are impacting the execution speed of your SQL queries, the following are some methods for improvement.
The first step is to identify the specific SQL query impacted by shuffling, and then compare it to the values in the WITH segment of the CREATE TABLE statement against which the SQL query executes. The WITH segment contains the index type, the distribution model, the distribution key, and the partition description, which can resemble the following:
WITH
( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([SCENARIO_ID]),
PARTITION ([SCENARIO_ID] RANGE RIGHT FOR VALUES(1, 2, 3, 4, 5, 6, 7, 8))););
The different types of indexes are as follows:
- Clustered columnstore index
- Heap
- Clustered index or nonclustered index
You can read about how these indexes operate in the “Data Concepts” section of Chapter 2. What may be obvious in the WITH clause is that SCENARIO_ID is used as the distribution key and the partition key. If the SQL query is optimized to select data that targets a single SCENARIO_ID, then you would not expect much shuffling. However, if the SQL query pulls data for multiple scenarios and those distributions are on different nodes, then shuffling will occur. Here are a few techniques that can be used to tune partition shuffling:
- Rewrite the SQL queries.
- Review EXPLAIN plan recommendations.
- Reshuffle/repartition.
- Optimize clustered columnstore tables.
Although this is a possible solution, it may not be practical in many scenarios. Building on the SCENARIO_ID example, consider the following SQL statement:
SELECT * FROM READING WHERE SCENARIO_ID IN (1, 2, 3, 4)
The SQL query requests all the data that matches four scenarios. Looking at the distribution and the partition settings, it should be obvious that there is a high probability of shuffling. Based on the configuration within the WITH clause, the data for each scenario is stored into different distributions and partitions. How to specifically optimize the SQL query depends greatly on which procedures are performed on the dataset after retrieval. You might realize faster performance by executing the query four times in parallel after projecting the SQL statement to a single scenario, like the following:
SELECT * FROM READING WHERE SCENARIO_ID = 1
SELECT * FROM READING WHERE SCENARIO_ID = 2
SELECT * FROM READING WHERE SCENARIO_ID = 3
SELECT * FROM READING WHERE SCENARIO_ID = 4