Handle Skew in Data – Troubleshoot Data Storage Processing
Handle Skew in Data, Microsoft DP-203, Tune Queries by Using CacheData skew, introduced in Chapter 2, is the uneven distribution of data across nodes. Refer to Figure 2.19 to visualize data skewing. As mentioned in Chapter 3, a large influx or purging of data that happens to be stored on a specific node due to the applied partition attributes can cause skewing. That event can cause shuffling, which is the platform’s response to such an influx or purging of data. It is important, however, to proactively monitor your partitions so that the data distribution remains optimal as data scenarios change. One approach for monitoring data distribution and skewing is to execute the following code snippet, which was introduced in Chapter 9. An example output follows the code snippet.
DBCC PDW_SHOWSPACEUSED (‘[SQLPool].[brainwaves].[FactREADING]’);
+——-+—————-+————+————–+—————–+
| ROWS |RESERVED_SPACE|DATA_SPACE|UNUSED_SPACE|DISTRIBUTION_ID|
+——-+—————-+————+————–+—————–+
| 72546 | 136472 | 136256 | 184 | 31 |
| 72524 | 98032 | 97792 | 208 | 37 |
| 72040 | 97276 | 98295 | 182 | 46 |
| 69311 | 98128 | 97920 | 176 | 17 |
| 35343 | 98160 | 97904 | 224 | 40 |
+——-+—————-+————+————–+—————–+
The result of PDW_SHOWSPACEUSED shows that data is in fact skewed across distributions. This is recognizable by the wide variety of data rows placed onto each node, which is identified by the value in the DISTRIBUTION_ID column. The goal is to have data spread evenly across all the distributions.
Another method for discovering and analyzing data skew is by executing the following PDW query, which is followed by a sampling of the results:
SELECT row_count, pdw_node_id, distribution_id
FROM sys.dm_pdw_nodes_db_partition_stats
ORDER BY row_count DESC
+———–+————-+—————–+
| row_count | pdw_node_id | distribution_id |
+———–+————-+—————–+
| 72546 | 9 | 31 |
| 72524 | 9 | 37 |
| 72040 | 9 | 46 |
| 69311 | 9 | 17 |
| 35343 | 9 | 40 |
+———–+————-+—————–+
The output is similar to the DBCC command, with the addition of the node on which the distribution is hosted. Finally, recall from Chapter 5, “Transform, Manage, and Prepare Data,” (Exercise 5.11) that you changed data in order to remove outlying data that was skewing the median and average calculations of brain wave scenarios. In addition to reconfiguring the partition parameters to improve skewing, the action you took in Exercise 5.11 is an example of how to approach solving this kind of issue.