How New QLC SATA SSDs Deliver 8x Faster Machine Learning

Posted on October 9, 2018 in Benchmarks, Publications, Recent

We record performance measurements on Micron 5210 SSD related to Machine Learning workflow. Even though Machine Learning is highly CPU intensive, fast storage can lower training time through faster file pre-processing and serialization, particularly when the size of a data set exceeds the amount of installed memory. A popular format for datasets is TFRecord, and in our performance measurements, we will be comparing the throughput speed and completion time of a TFRecord on a 7.68TB Micron 5210 ION SSD versus that of an 8TB Seagate 7200RPM HDD.

Colfax-Machine-Learning-and-QLC-SSDs.pdf (151 KB)

1. QLC SSDs
2. Micron QLC SSD
3. Test System Configuration
4. Test Workload: TFRecord
5. Test Results
6. Summary

1. QLC SSDs

For years, 7200 RPM hard disk drives (HDDs) have been the standard media on which Machine Learning (ML) training data sets have been stored. These traditional HDDs have been preferred due to their low cost and easy to adopt SATA interfaces. However, HDD’s suffer from relatively slow throughput . Solid State Drives (SSDs) have been too expensive to justify their potential gain. Despite SSD cost / GB decreases, their overall cost has remained just out of reach for all but the most demanding (and expensive) ML platforms. This barrier is starting to fall with the introduction of the world’s first QLC SSD, the Micron® 5210 ION enterprise SATA SSD.

2. Micron QLC SSD

Micron QLC SSDs are a new type of SSD that uses Quad-Level Cell (QLC) NAND technology.
QLC flash stores 4 bits of data per cell enabling higher density. Increasing the density of the cells enables higher storage capacity and fast read speeds. However this decreases the amount of write cycles available ultimately decreasing the lifespan of the SSD if write heavy workloads are run. Therefore, QLC SSDs are aimed to eventually replace standard HDDs in read intensive workloads by providing SSD read speed with HDD capacity. They come in 2.5-inch form factor and the capacities range from 1.92TB to 7.68TB at an approachable cost.

3. Test System Configuration

Our test configuration is a standard, x86 server platform equipped as noted in Table 1.

Component	Details
CPU	2x Intel® Xeon®Platinum 8168 CPU @ 2.70GHz
Cores	48 cores with 2-way Intel® Hyper-Threading Technology
Memory	96 GB DDR4
Drives	7.68TB Micron® 5210 QLC SSD 8TB Seagate® 7200 RPM HDD

Table 1: System Configuration

4. Test Workload: TFRecord

TFRecord is the standard format for TensorFlow. It is used to store large amounts of data (for example, a collection of images) in a single TFRecord file, which can be read from storage faster than individual files and loaded into TensorFlow in batches for training.

In our test workload, we read large high-resolution images in the TIFF format from the drive using OpenCV, rescale each image to low resolution, append the resulting image to a TFRecord, and write the resultant TFRecord object back onto the drive.This process simulates the preprocessing of real-world data into a format suitable for deep learning training. To reach the performance limits of our system, we designed our TFRecord creation script to ingest images in parallel using Python multiprocessing. When we run the code, we can specify the number of concurrent processes that read and resize the images. Parallel processing of data reading and conversion taps into the additional performance of both the drive and the multi-core CPU. We can tune this parallelism to utilize and maximize available read speed and CPU processing power.

5. Test Results

First, we fixed the number of processes and increased the number of images to show scale in performance and utilization. Each run was tested 3 times to minimize variation in speed and time. Barring any anomalies, the fastest run was recorded.

Table 2 shows the bandwidth and time for each test and drive by process count.

	Micron 5210 ION 7.68TB		Typical 7200 RPM HDD 8TB
Number of Processes	Bandwidth MB/s	Time (sec)	Bandwidth MB/s	Time (sec)
1	24.4	981	24.2	898
2	47.7	501	39.4	607
3	69.2	345	49.2	523
4	91.6	261	47.5	507
6	116	206	50.1	507
8	174	137	52.1	504
10	208	115	53.5	501
14	287	83.2	54.5	494
18	320	74.6	55.5	487
22	348	68.7	55.8	485
26	362	66.1	56.2	478
30	368	64.9	57.0	459
40	353	67.7	57.9	448
50	355	67.3	47.5	439
60	364	65.6	49.4	431
70	367	65.1	47.2	429
80	376	63.6	48.5	426
90	377	63.3	45.8	420
100	375	64.0	47.7	413

Table 2: Performance test of bandwidth and completion time. Data set is 1000 images at 22MB each, totaling 23GB data set size.

Figure 1: Performance test of bandwidth and completion time. Data set is 1000 images at 22MB each, totaling 23GB data set size.

At a single process, the Micron 5210 and the HDD performed similarly (a rate of 1 image per second), resulting in a run time of 980 seconds. With one process, the performance is limited by the CPU and not the drive read speed. As the number of concurrent processes increased, the throughput of the Micron 5210SSD also increased until reaching a maximum sustained read speed at around 30 processes.

Parallel processing did not significantly accelerate image conversion on the HDD, which achieved at best, a 2.3x speedup compared to single-process performance. We attribute that to the nature of a spinning drive, which cannot efficiently handle parallel read operations. In contrast, the quick read speed and the parallel nature of the Micron 5210 SSD managed many more queues compared to the HDD with a commensurate performance improvement displayed. At 90 concurrent processes, the Micron 5210 was 8.2x faster than the HDD and 15.5x faster than either drive when running only 1 process.

Next to fine tune our results, we tested all the process counts between 90 and 100 to find the optimal parallelism (observed when speed reached its maximum). These results are shown in Table 3.

	Micron 5210 ION 7.68TB
Number of Processes	Bandwidth MB/s	Time (sec)
91	388.9	61.5
92	382.3	62.6
93	391.1	61.2
94	386.0	62.0
95	391.3	61.2
96	391.7	61.1
97	390.1	61.3
98	391.1	31.1
99	389.6	31.4

Table 3: 5210 process count > 90.

Performance results in Table 3 demonstrate how the process count affects bandwidth. Here we noted that the performance peaked at 96 processes. Running the test script multiple times showed some variation where 97 processes was faster, but on average 96 processes consistently showed the best results.

Since 96 processes showed the highest throughput given it’s relationship to the number of logical CPUs available in the server (48 cores each running 2 hyper-threads), we used this value as the base for our next performance measurement.

Using the optimal process count of 96, we then ran TFRecord on various numbers of images. Given that many ML workloads involve very large numbers of images, we tested to ensure real-world applicability. For example, skin cancer detection algorithms (that are more accurate than doctors) have been built doing ML training on 100,000 images, so we used 100,000 images as a barometer of real-world results due to the size of the dataset.

The results for image sets above 10,000 are the key numbers to focus on below, as in all instances, storage is effectively isolated as the test variable because the size of the dataset exceeds the available DRAM, ensuring that memory has no bearing on these results. Given that many ML datasets are often far larger than a system’s installed memory capacity, these results are a good indicator of the actual performance differences between the storage drives being compared. Results are shown below in Table 4

Test Parameters			7.68TB Micron 5210		8TB HDD		Performance Delta
Number of Images	Processes	Size of Dataset*	Bandwidth, MB/s*	Completion Time, s*	Bandwidth, MB/s*	Completion Time, s*	Speedup	Time Saved
1	96	23 MB	23	1	22	1	1x	–
2	96	46 MB	24	1	23	1	1x	–
4	96	92 MB	46	1	42	1	1x	–
8	96	184 MB	113	1	49	3	2x	2 seconds
16	96	368 MB	255	1	48	7	5x	6 seconds
32	96	736 MB	340	2	44	16	9x	14 seconds
64	96	2 GB	371	4	44	34	9x	30 seconds
128	96	3 GB	378	8	47	65	8x	1 minute
256	96	6 GB	389	16	44	138	9x	2 minutes
512	96	12 GB	389	32	40	305	10x	5 minutes
1,000	96	23 GB	394	61	43	605	10x	9 minutes
10,000	96	230 GB	376	636	44	5,423	9x	1.3 hours
50,000	96	1150 GB	370	3,234	44	27,258	8x	6.7 hours
100,000	96	2300 GB	354	6,764	43	54,617	8x	13.3 hours

Table 4: Varying image counts tested. All data rounded to the nearest whole number. > 90.

Table 4 shows TFRecord reaches near max throughput when the dataset is greater than 64 images and stays fairly consistent all the way to 100,000 images. Comparing these results to an HDD, the time difference is substantial (especially at 100,000 as the Micron 5210 SSD is able to complete the workload in 6,764 seconds (113 minutes, or about 2 hours) versus the 54,617 seconds that the HDD takes (910 minutes, or about 15 hours). The delta between these times was 13.3 hours, and for image sets above 10,000, the QLC SSD performed 8x faster than the HDD.

6. Summary

In our study, with the Micron 5210 ION SSD, a read-intensive transformation of an image dataset with the purpose of a TFRecord file creation was accelerated by about 8x compared to a similar-sized HDD. In a 100,000 image dataset at 23MB per image, our test HDD took 15.17 hours to resize the images and pack them into a single TFRecord file while the Micron 5210 ION took only 1.88 hours to do the same task–13 additional hours for the HDD to complete the same work.

Faster completion times can have dramatic effects on overall project value, delivery and expense. When analyzing the potential benefits of the Micron 5210 ION, a cost per GB analysis at the time of acquisition may be short-sighted. One should consider the total cost of ownership, the value of keeping expensive CPU assets busy, and the potential reductions in yearly operating expenses related to power and cooling. How often you do machine learning is also very important, as performance deltas begin to compound over time, particularly given how long many ML workflows typically take to complete.

Overall, the Micron 5210 ION is a high capacity, read-intensive SSD that can add significant value to data pre-processing to speed up machine learning workflows at an approachable price.

Share this:

Table of Contents