Response to Comments on our paper "Optimizing Databases by Learning Hidden Parameters of Solid State Drives"

Learning Hidden Parameters of SSDs

We recently wrote a paper on how to learn hidden parameters of SSDs. Commercial SSDs are black boxes, and their internal operation remains hidden from the users. Reverse engineering the internal operation of SSDs is a challenging task given the narrow view into these devices through the block interface. There is no one rule that holds across devices. In our work, we were able to uncover some of the hidden parameters of SSDs from major manufacturers using a systematic set of benchmarks. We used these hidden parameters to improve the performance of SQLite and MariaDB.

Response to the comments

Recently Mark Callaghan wrote a blog post giving comments on our work. Many of his comments were insightful, and I will try to respond to them to the best of my ability:

1. Measuring desirable write request sizes - In a nutshell, we measure desirable write request sizes for an SSD by issuing write requests of different sizes, and measuring the latency of subsequent read requests to the device. The lower the latency of read requests, the better is the performance of the SSD. Here, we are trying to find what write request sizes are best for the device, irrespective of the application’s requirements/performance metrics. This is the first step in understanding the SSDs internal parameters. Mark is right in pointing out that overall database performance goals would be different for different files (for eg. the undo/redo log of InnoDB with sector-sized writes that are rarely read from disk, it doesn’t make sense to optimize for read performance here). What write request sizes to issue to the SSD is a co-decision between the application requirements and SSD parameters, and in some cases, it will not make sense to directly use the SSD’s desirable write request size. In fact, we do mention this at the end of section 3.2.1 in the paper.

2. Difference between desirable write request size and stripe size - We define the stripe size as the "unit of placement decision inside the SSD". Put simply, it is how the SSD decides to break down request data internally. To bring out the difference between stripe size and desirable write request sizes, I will describe my guess about how the Samsung 960 Evo SSD works internally. When a write request is issued to the SSD, it is internally divided into chunks of size 64KB, which are distributed over channels in a round robin manner. Thus, by running experiment 1 in the paper on the Samsung SSD, we find that all files created with a write request size 64KB*i internally have similar layout, and similar read latency. However, when a write request of 32KB (< 64KB) is issued, there are two competing factors. Once again, the 32KB chunks are distributed over channels in a round-robin manner. However, since there are double the number of chunks, the channel level parallelism increases, thus providing higher internal bandwidth. At the same time, the overhead of assembling these chunks to return a single response to a read request also increases, and in the case of Samsung SSD, the former wins out. Its difficult to be sure about the internal operation of an SSD, and one can only speculate about the algorithms used internally. But from an application’s perspective all we need to know is that issuing write requests of size >=32KB is desirable for this device.

3. Indexes issuing larger IO to make use of desirable write request sizes - Mark rightly pointed out that for a B-Tree index, the read/write request sizes are usually small (a few KBs) for desirable write request sizes to be useful. We appreciate his suggestions about indexing structures such as LSM trees, copy-on-write B-Trees, and heap-organized tables in Postgres that issue larger I/O, and we will be looking into them as follow-up work. Studying these data structures with larger I/O sizes in a cloud setting will be an interesting exercise as well, to see how the overhead of transferring data over the network impacts the performance gains.

4. Impact of the kernel, concurrent write requests, and garbage collection - There are multiple factors that come into play when issuing IO requests to a device. The OS block layer and IO scheduler can alter the requests issued by the application --  smaller contiguous write requests can be merged to issue bigger ones, and larger IO requests can be broken down into smaller ones (as described in the document by Jens Axboe). As a less than ideal workaround for the former case, we used the fdatasync system call in one of our experiments where we used hot locations (a hidden parameter) of the SSD. While using hot locations boosted the performance of select operations considerably, the fdatasync system call reduced the performance of insert operations in the case of SQLite. The flipside case of larger write requests being divided into smaller ones is a possibility that remains to be studied. Most of the SSDs we’ve studied so far have desirable write request sizes ~64KB, so as long as the device limitations and kernel configuration do not split a request of such a small size, this might not be a pressing concern. Another aspect to consider is the impact of concurrently issued requests, as they can have subtle effects inside the device. A lot depends on the placement scheme used by the SSD as interfering write requests can change the resulting internal layout of data (and resulting channel-level parallelism) for a file stored on the SSD. This is related to garbage collection (GC) as well, as the preference order of GC in SSDs is same channel compared to across channel (as shifting data across channels is more expensive). More generally speaking, a common concern we encounter here is the ability of the database to control IO requests to the device.

Overall, the possible directions described by Mark -- index structures with larger IO sizes, performance in the cloud, TRIM command in different SSDs, desirable amount of parallelism for an application -- are all great directions for pursuing future work. @Mark: thanks for your comments, and for taking a close look at our work!

1 comment: