Develop a Batch Processing Solution–Monitoring Azure Data Storage andProcessing-1
Exams of Microsoft, Handle Skew in Data, Microsoft DP-203Chapter 6 covered everything you need to know about designing and implementing batch processing from an Azure Synapse Analytics pipeline. In order to test and/or troubleshoot pipelines, you need to first know what kind of logs exist, how to configure them, how to retrieve them, and what they mean. This is why this section is here, in the chapter that covers monitoring, and is also why some additional information regarding troubleshooting will be provided in Chapter 10. You should now have significant insights into the monitoring capabilities in Azure, so the following section can be better consumed and placed into context. By doing so, you will be better able to apply this knowledge in situations that require it.
Design and Create Tests for Data Pipelines
The pipeline that used Azure Batch in Chapter 6 was named TransformSessionFrequencyToMedian and was created in Exercise 6.2. After its initial creation, the pipeline was modified and updated in seven other exercises. There are three different components to this pipeline: the Azure Batch job (Calculate Frequency Median), a Spark job (To Avro), and the notebook (Identify Brainwave Scenario), which runs in the Azure Databricks workspace. To build a test plan, you must have the absolute details of what each component does and the expected outcomes of those activities. The Azure Batch account that was created in Exercise 6.1 had a primary objective of converting JSON documents that contain full brain wave sessions for a given scenario into median values per frequency. For every JSON document containing a session, there should be a JSON file created with just the median frequencies. In this case there are 20 sessions, so 20 output files are expected, with the content resembling the following:
{“ALPHA”: 1.4492,”BETA_H”: 0.7406,”BETA_L”: 2.4036,”GAMMA”: 0.4342,”THETA”: 3.3934}
The Azure Batch job achieved this by a program named brainjammer‐batch.exe, which is in the Chapter06\Ch06Ex01 directory on GitHub. You should check the following:
- Whether the Azure Batch job completed successfully
- Whether the number of files taken from the input directory is equal to the number of files in the output directory
The Spark job activity was added in Exercise 6.3 with the purpose of converting the JSON files into AVRO files, which is the most optimal file format for batch processing. The Python code for this activity is also on GitHub, in the Chapter06\Ch06Ex03 directory. The output is a group of AVRO files placed into an output directory. You should check the following during testing:
- Whether the Spark job completed successfully
- Whether files are in the specified output directory
The final activity was created in Exercise 6.4 and executed on an Azure Databricks Apache Spark cluster. The notebook written in Python retrieves the AVRO files generated by the To Avro activity. The frequency values contained in the AVRO files are compared against the values in Table 5.2. The output is written to a Delta table and then retrieved. You should check the following during testing:
- Whether the notebook run completed successfully
- Whether there are rows on the Delta table
In total, six tests must be carried out after changes. The tests must be successful prior to publishing those changes to the production environment. There are numerous approaches to perform testing. The first decision to make is whether the tests should be carried out manually or automated. In this scenario, because you are working alone, there is only a single pipeline in scope. The pipeline is not very complex, so a manual testing approach would be the most optimal. Setting up an automated testing solution is a large undertaking, and, in the previous scenario, the effort required to set it up and manage it would outweigh its benefit. However, if you are working in a team environment, with many pipelines or relative complexity, and you perhaps want to implement CI/CD, then the effort to automate testing would be beneficial. You will learn why as the discussion progresses.
Once you decide to go the automated approach, and even to some extent, with manual testing, you need to determine the numerous types of testing. There are many, but five of the most common are summarized in Table 9.8.
TABLE 9.8 Different types of testing
Type | Description |
Unit | Low level, test classes, methods, and functions in the source code |
Integration | High level, cross‐component testing, architecture, and code dependencies |
Functional | Tests output against business requirements |
Performance | Compares last version performance metrics with current version |
Regression | Checks for old bugs reintroduced into the release pipeline |