Optimize Pipelines for Analytical or Transactional Purposes – Troubleshoot Data Storage Processing-5
Azure Monitoring Overview, Develop a Batch Processing Solution, Exams of Microsoft, Handle Skew in Data, Microsoft DP-203, Tune Queries by Using CacheThe other option in the Compute Size drop‐down is Custom, which enables the Compute Type and Core Count boxes, allowing you to select up to 256 (+ 256 driver cores) for either Basic or Standard type. In addition to allowing you to select a greater core count, the Custom drop‐down enables you to add dynamic content. This means you can write some logic that can be used to determine the compute type and/or the core count at runtime. Consider the following code snippet, for example:
@if(greater(activity(‘GetAKVMonitorLog’).output.size, 1000000000), 128, 64)
This snippet checks the output of the GetAKVMonitorLog activity, which hypothetically provides the number of rows that need transformation for this run. If the size is greater than 1 GB, then set the core count to 128. If the greater() method returns false, meaning the number of rows is less than 1 GB, allocate 64 cores toward transforming the data. This is a very powerful approach for managing the amount of allocated compute resources at runtime.
The third segment of the data flow discussed in this section falls between the source and the sink, which is where the transformation of the data occurs. Many transformations can take place during the operation (refer to Table 4.4 and Table 4.5). For transformations like joins, exists, and lookups, there is a feature called broadcasting. If the data stream that either of those three transformations receives is small enough to fit inside the memory of the node on which the data flow executes, then the data can be loaded into memory instead of to disk, which is much faster because there is no I/O. Figure 10.14 shows the three options for configuring broadcasting: Auto, Fixed, and Off.
FIGURE 10.14 Optimizing pipelines for analytics or transactional purposes: data flow broadcasting
When the Auto option is selected, the data flow engine will decide if and when to broadcast. Setting the option to Off is recommended when you know in advance that the data stream, incoming (Left) or outgoing (Right), is too large to fit inside of the memory allocated to the data flow. Setting the option to Off disables the feature for the data flow engine, which results in a reduction in required compute resources. The compute resource is also reduced when the Broadcast option is set to Fixed, which not only disables the auto process but also enforces the storage of data into memory instead of disk.