Tune Shuffle Partitions – Troubleshoot Data Storage Processing-3
Azure Monitoring Overview, Develop a Batch Processing Solution, Exams of Microsoft, Microsoft DP-203For the Apache Spark pool perspective, refer to the “Design a Partition Strategy” section in Chapter 3. Methods like spark.conf.set(“spark.sql.shuffle.partitions”, 200) and df.rdd.getNumPartitions() provide insights and features for managing data partitions and shuffling. One final note concerning the number of partitions required for optimal performance. There is no absolute or standard recommendation for the number of partitions, because it depends greatly on the data and your specific scenario. You should have neither too few partitions nor too many. Too many partitions can decrease performance, as it reduces the effectiveness of the clustered columnstore indexes, whereas too few can cause skewing or large datasets with a wide variety of data. Using the information described so far for identifying shuffling should help you determine the best ratio over time and after implementing and testing these tuning techniques.
The last option discussed here for tuning for shuffling is to optimize clustered columnstore tables. There are two options: The first option is to avoid scenarios where a clustered columnstore is not optimal, and the other option is to push the data into a compressed columnstore segment. When you choose to use clustered columnstore indexes, you need to know which scenarios will not benefit from them. If your queries perform these actions more times than not, you should either rewrite the queries or use the other indexing varieties available for your dedicated SQL pool tables. Tables that store data that changes frequently will not benefit from a clustered columnstore; instead, consider heap or temporary tables. Data types such as varchar(max), varbinary(max), and nvarchar(max) will also not realize any benefits; instead, consider heap or a clustered index. The last scenario to avoid is using a clustered columnstore on tables with fewer than 60 million rows. The number 60 should trigger a term mentioned previously in this book, the law of 60, which leads well into the topic of pushing the data into compressed segments.
By default, a dedicated SQL pool allocates 60 distributions into which your database is divided. When you then partition the data, it is divided into even more segments. Each partition must contain 1 million rows in order to benefit from a clustered columnstore and compression. This is an example of how having too many partitions causes problems, in that if you partition so much that the number of rows within the partition is less than a million, then performance would be reduced—assuming that, prior to the partitioning, over a million rows of data existed.
As mentioned, there are some features helpful with determining the state of your partitions DBCC PDW_SHOWPARTITIONSTATS or sys.dm_db_partition_stats. You can also use the following algorithm to determine the number of partitions:
total number of rows / (1 million * 60) = number of partitions
Consider the previous example of provisioning that used the most common method of data type used for partitioning, which is date. There were three partitions: 2022‐10‐01, 2022‐11‐01, and 2022‐12‐01. Using 155 million as a hypothetical number to represent the total number of rows and plugging that into the equation results in the following: 155000000 / (1000000 * 60) = 2.58