Tune Shuffle Partitions – Troubleshoot Data Storage Processing-2
Azure Monitoring Overview, Develop a Batch Processing Solution, Handle Skew in Data, Microsoft DP-203It would also make sense to identify the specific columns that are required from the READING table instead of using the wildcard character. This will reduce the size of the retrieved dataset and reduce latency. This action would not reduce shuffling, but it is simply good practice to make your data operations perform better.
Some additional patterns to avoid data shuffling are using the CROSS JOIN type, using operators like IN or CONTAINS instead of equals (=), performing joins on columns that are different data types, and performing a join on a column, which is not part of the distribution key.
The next tuning technique to manage shuffling is to analyze the results of the explain plan for the given SQL query. As stated in the previous section, details about shuffling are available associated with the SHUFFLE_MOVE attribute in the XML file. There are cases when the database engine recognizes missing indexes that could benefit performance.
Those and other recommendations are highlighted in the XML file as well as in the explain plan illustration in Figure 10.5. The database engine query optimizer uses statistics captured during the execution of queries as a basis for describing the explain plan and discovering inefficiencies and recommending solutions.
You can improve performance by executing UPDATE STATISTICS or by running sp_updatestatsmore often than the database engine. There will be more detail later about these two approaches for maintaining statistics.
The third technique for tuning the effects of shuffling is to manually reshuffle or repartition the data. The objective here is to redistribute the data more evenly across the distributions and nodes, to avoid not only shuffling but skewing as well.
A word of caution here: before you attempt either of these activities, recognize that they are considered expensive operations, which means they will be disruptive to the operational data activities happening while they are being applied and can affect data consumers after the change has been implemented. From a dedicated SQL pool perspective, you reshuffle and repartition your data by using CTAS.
This specific example was provided in Chapter 2 from a copy perspective, but it is useful in this scenario as well. The process entails creating a table appended with _NEW, which has the new, optimized distribution and partition details contained in the WITH clause.
Once the new table is created and populated with existing data, for example READING_NEW, you rename the existing table, as shown in the following syntax, rename the new table to the original name, and then drop the old table that contains the undesired distribution and partitions:
CREATE TABLE [dbo].[READING_NEW]
WITH(CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([READING_DATETIME]),
PARTITION ([READING_DATETIME] RANGE RIGHT FOR VALUES
(‘2022-10.01’, ‘2022-11-01’, ‘2022-12-01’)))
ASSELECT *
FROM [dbo].[READING];
RENAME OBJECT READING TO READING_OLD
RENAME OBJECT READING_NEW TO READING
DROP TABLE READING_OLD