Optimize Pipelines for Analytical or Transactional Purposes – Troubleshoot Data Storage Processing-3
Develop a Batch Processing Solution, Microsoft DP-203It is important to state that you should utilize all DMVs that are discussed in this chapter. This section is focusing specifically on the transaction DMVs, but all of them are useful in finding latency and implementing tuning techniques. This is the case for taking the findings from this DMV forward. The sys.dm_pdw_nodes_tran_database_transactions DMV identifies that this might be the source of a latency problem. You can also use some of the other DMVs to find the specific query that is failing and the specific step within the query. Ideas surrounding how to pursue that and how to determine which DMVs to use have already been discussed, many of which are in Table 10.4. Table 10.6 provides an extended list of some transaction‐related DMVs.
TABLE 10.6 Transaction and HTAP dynamic management views
Name | Description |
sys.dm_tran_commit_table | Lists committed transactions per table |
sys.dm_db_xtp_transactions | Lists active in‐memory OLTP transactions |
sys.dm_tran_current_snapshot | Displays a view of all active transactions |
sys.dm_tran_database_transactions sys.dm_pdw_nodes_tran_database_transactions | Renders information about transactions |
sys.dm_tran_session_transactions | Provides correlation information between sessions and transactions |
sys.dm_tran_active_transactions | Returns information about transactions |
sys.dm_tran_current_transaction | Lists information about a specific transaction |
sys.dm_tran_locks | Lists locked transactions |
Another DMV that is useful for monitoring transactions is sys.dm_pdw_nodes_os_performance:counters, which can provide the size of the transaction log file. This DMV was introduced in Table 9.2. Performing a select on that system table and checking the value in column counter_valuewill determine if the transaction log file size is too big, which means you should scale up to get more compute resources. In order to filter the query down a bit more, you would want to search for “log file(s)” in the counter_name column and “distribution” in the instance:name column. The size threshold for the transaction log file is around 160 GB; if you reach that, you should consider scaling up.
The concept of scaling, as discussed numerous times, has to do with adding compute resources to your workspace for running your data analytics processes. Scaling in the context just discussed has to do with the performance level of the dedicated SQL pool. The compute infrastructure responsible for running your data integration capabilities—such as your pipeline, triggers, linked services, access controls, and so on—is the integration runtime (IR). Chapter 3 discussed the capabilities of an IR, with Table 3.12 listing the different IR types and their provided compute resources. An IR does not provide the compute for the running of SQL queries on your dedicated SQL pools, Apache Spark pools, or batch processing, which is discussed later, nor does it provide the compute resource for a data flow, discussed next. Keep in mind that each datastore requires its own allocated amount of compute resources, and you need to keep an eye on which one specifically requires the scale.
The last point of discussion in this section has to do with an activity in a pipeline that can have a significant impact on performance: the Data Flow activity. An insight concerning Data Flow activities you should have gained from Exercises 4.7, 5.1, and 5.3 is that this is the activity that performs the heavy data ingestion and initial transformation work. Specifically, a data flow begins with a data source from which the data is pulled. The data is copied to the workspace, numerous transformations are performed on the data, and, finally, the transformed data result is inserted into a sink. Those three segments—source, transformation, and sink—are a good paradigm for compartmentalizing the areas to perform your optimization efforts. In Exercise 9.2 you learned how to determine the duration of a Data Flow activity that runs within your pipeline. If the duration is longer than expected, or if you just want to drill into the details of the data flow, that capability is available on the Monitor hub and shown in Figure 9.15. Clicking the icon in the Stages column for the data flow sink transformation, as shown in Figure 9.15, breaks that duration down into transformation and stages, like that shown in Figure 10.10.
You can also click each transformation in the data flow to view further diagnostic details, as shown in Figure 10.11.