Compact Small Files – Troubleshoot Data Storage Processing
Exams of Microsoft, Microsoft DP-203, Tune Queries by Using CacheThe “Design for Efficient Querying” section in Chapter 3, “Data Sources and Ingestion,” discussed how the number of files can impact performance—specifically, that processing many small files can negatively impact performance. A reason for this is that an ADLS container is designed to work most optimally with larger files. Another reason is that loading files from disk requires an I/O operation. Each file retrieval could theoretically require an I/O operation that can result in creating a chatty I/O antipattern. To learn how to merge a group of small files into a single file, perform Exercise 10.1. This exercise merges all Classical Music brain wave CSV files located on GitHub, in the BrainwaveData/SessionCSV/ClassicalMusic/POW directory. These files have previously been copied to the ADLS container created in Exercise 3.1 and uploaded, as described in Exercise 3.2.
EXERCISE 10.1 Compact Small Files
- Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ navigate to the Integrate hub ➢ create a new pipeline ➢ expand the Move & Transform group ➢ drag and drop a Copy Data activity to the pipeline ➢ and then select the Source tab.
- Create a new ADLS Gen2 Delimited Text Source dataset that targets the Classical Music CSV brain wave files from Exercise 3.2. Leave the file name empty. The file path should resemble the following:
brainjammer/SessionCSV/ClassicalMusic/POW/ - Select the Wildcard File Path radio button as the File Path Type value, something similar to the following:
brainjammer/SessionCSV/ClassicalMusic/POW/* - Select the Sink tab, and then create a new ADLS Gen2 Delimited Text Sink dataset to write the single merged CSV file into a directory in the ADLS container, perhaps like the following:
brainjammer/EMEA/brainjammer/in/2022/11/30/14/ - Select Merge Files from the Copy Behavior drop‐down list box. Figure 10.2 shows a summarized view of the Source tab and Sink tab configurations.
FIGURE 10.3 Compacting small files—Source and Sink tabs
- Click Commit ➢ click Commit All ➢ click Debug ➢ and then view the merged CSV file in the targeted sink location within the chosen ADLS container.
The pipeline and dataset JSON configuration files are located in the Chapter10/Ch10Ex01 directory on GitHub. These files will help you identify the specific configuration details if something is unclear or not working as expected. None of the steps performed in the Exercise 10.1are new; you have already created a pipeline and datasets and configured a source and sink. The same is possible using Python code within a notebook running on an Apache Spark pool by using code similar to the following:
%%pyspark
df = spark.read \
.load(‘abfss://[email protected]/path/*.csv’,
format=’csv’, header=True)
df.write \
.mode(‘overwrite’).format(‘csv’) \
.save(‘abfss://[email protected]/path/merged.csv’)
The syntax should look familiar to you at this point, as you have performed similar activities in previous exercises. The code snippet loads all the CSV files in the provided path into a DataFrame and then writes the content of the DataFrame to a CSV file into the provided path.
To conclude this section, remember two points, the first concerning optimal file sizes. In Chapter 3 you learned that the optimal file size for dedicated SQL pools is between 100 MB and 10 GB. For Apache Spark pools, the optimal file size is between 256 MB and 100 GB. The second point to keep in mind is the law of 60 and the fact that queries will perform much better on tables with at least 1,000,000 rows. Details on both of those points are covered in Chapter 3, where the designing of a partitioning strategy is discussed.