Tune Shuffle Partitions – Troubleshoot Data Storage Processing-4

14 Nov, 2023

Tune Shuffle Partitions – Troubleshoot Data Storage Processing-4

The result identifies that the partitions have been overprovisioned. It would be prudent to repartition from three to two so that your queries can benefit from the performance gained by the clustered columnstore and compressed segments. The equation should also call attention to the point that unless your table contains at least 60 million rows, using a clustered columnstore or partitions is not justified for performance reasons. To determine the number of rows necessary to warrant the three partitions, use this formula. This is solved by multiplying both sides of the equation by 60 million.

x / (1000000 * 60) = 3
x = 180000000

The result is that the data that is to be stored into the table with these three partitions must be at least 180 million rows to gain any benefits.

One final point about partitioning is that partitions are applied onto each distribution. This means that because there are 60 by default, the provisioning of three results in 180 total partitions. The algorithm is interested only in the partitions in a single distribution; however, as this implementation is hard to visualize, it is worth stating here.

Before moving on to the next section, there is one final concept to discuss that can decrease shuffling: partition elimination. Partition elimination is implemented by optimizing the WHERE clause predicate so that the database engine can infer which data partitions contain the data. In the following SQL statement, notice that theWHERE clause specifically targets one of the three partitions created on the READING table:

SELECT VALUE
FROM READING
WHERE READING_DATETIME BETWEEN ‘2022-12-01’ and ‘2022-12-31’

Structuring the SQL query in such a way results in the data server determining that access to a single partition is enough to fulfill the query request, meaning that all partitions do not require scanning before returning the requested data‐saving time. Notice that the WITH clause, which created the three partitions based on date, also includes the keywords RANGE RIGHT FOR VALUES. The RANGE RIGHT keywords indicate that the associated values represent the lower value boundary.

This is especially important when working with dates and partitions based on months. Without the RIGHT designation, data from midnight on the first day of the month would not be included in the same partition as the later values on the same day. Using RANGE with either LEFT or RIGHT makes partition elimination and your calculations based on a date range more performant and precise.

Table 10.2 provides some details about the partitioning tools discussed here.

Consider having a look back at other Parallel Data Warehouse Dynamic Management Views which were listed in Table 9.2. Ask yourself if they could be helpful in identifying or resolving shuffling or skewing, or optimizing data partitions.

TABLE 10.2 Database partition analysis features

Name	Description
sys.database_files	Identities how many files the database occupies
sys.dm_db_partition_stats	Renders the number of rows per partition
PDW_SHOWPARTITIONSTATS	Renders the size and number of rows for each partition
UPDATE_STATISTICS	Updates query optimization statistics (SQL query)
sp_updatestats	Updates query optimization statistics (stored procedure)

trygone.org

trygone.org

Tune Shuffle Partitions – Troubleshoot Data Storage Processing-4

Leave a Reply Cancel reply