Optimize and Troubleshoot Data Storage and Data Processing – Troubleshoot Data Storage Processing-2
Exams of Microsoft, Handle Skew in Data, Microsoft DP-203, Tune Queries by Using CachePerforming I/O is inherently latent, so when you do it, it needs to be worth the effort. If your application code is frequently writing very small bits of data to disk, retrieving a single row from a database, or sending small but frequent network messages, you might be experiencing the chatty I/O antipattern. The simple solution for this is to bundle together these single requests and send them all at once. You would need to look into your own application source code to determine this, but in any place you are writing files or data to disk or using the DISTINCT keyword in a query, you need to make sure this antipattern is not happening. You have learned that selecting all rows without a WHERE clause is not a pattern that represents projection. Projection means that you select only the data you need. Doing this reduces the amount of time required to retrieve the data from the database and the amount of time it takes to return the dataset to the client. When working with files, it is common to retrieve more than necessary in an effort to avoid additional or chatty I/O issues, but doing this results in the extraneous fetching antipattern. It is important to design the retrieval of data to be as precise as possible to avoid this pattern. When coding Scala or C#, it is common that you create an object to store data, for example, a DataFrame or an Event Hub client. Both of those classes are intended to be reused throughout the lifetime of the process. If you destroy them and then re‐create them within the same process, then your application code will run more slowly because the instantiation of an object class carries a high overhead cost and should therefore be avoided when possible. Not reusing classes and their instantiated object results in the improper instantiation antipattern. Performing the instantiation at the class level in C#, versus within a specific method, would resolve this behavior.
In addition to storing relation data in an Azure SQL database, it is also possible to store documents in it as well. Doing so can result in a performant solution; however, you know that ADLS and Azure Cosmos DB are datastores designed specifically for storing documents. Having a single datastore to place your data into makes the solution architecture easier to understand and maintain; however, doing so may also result in poorer performance. If you identify that better performance can be attained by storing your blob documents in a datastore other than the one that contains your relational data, then you have just resolved the monolithic persistence antipattern. The benefits of caching data have been covered in much detail numerous times in this book. Chapter 9 discussed how to determine if the data is being retrieved from cache for a dedicated SQL pool (refer to Figure 9.17). The metric is Adaptive Cache Used Percentage. In all possible cases, it is highly recommended to implement data caching, as it greatly improves performance by avoiding high‐cost I/O retrievals.
The noisy neighbor antipattern happens when an application on one server in a network consumes all available resources, impacting other servers and applications in that network. For example, take the limit of 216 (65,536) ports on a Windows OS server. If one application consumes all those ports, then other servers cannot make connections outside that network. Additionally, networks have bandwidth capacities that can result in slow transmissions of data packets when consumed by other clients in the same network.
To avoid the retry storm antipattern, there is a coding pattern called circuit breaker. This pattern describes how you can determine if the error causing the failure will last some time, is expected to be resolved fast enough to retry immediately, or falls somewhere in between. The point is that you do not want to try to perform a failed request immediately, over and over again, if the resource you need to fulfill the request is not going to be back online for some time. Instead, implement the circuit breaker pattern to reduce the impact of failed requests, which is high, and retry the request within a timeframe that reflects the seriousness of the unavailable resource.
The opposite of an asynchronous thread is a synchronous thread. As previously mentioned, prior to the ability to manage the thread type in managed code, all threads executed synchronously, one after the other, and waited idly until the long‐running actions on switch threads completed. I/O threads were, and still are, notorious for this, and you should avoid this kind of thread usage when possible. Knowing the kind of antipattern causing the unwanted or unexpected computational outcome helps you choose the right tools to troubleshoot and drives the discovery of the issue location. The location of the issue is more easily discoverable after the antipattern is identified because the location in which they occur is defined. For example, improper instantiation happens in application source code, whereas a noisy neighbor happens in the network.