USENIX ATC '20 Paper #452 Reviews and Comments
===========================================================================
Paper #452 PinK: High-speed In-storage Key-value Store with Bounded Tails


Review #452A
===========================================================================

Overall merit
-------------
4. Accept - Suitable for ATC, and I will advocate for it

Paper summary
-------------
The paper proposes PinK, an LSM-tree-based in-storage key-value engine, for use when offloading a key-value storage structure into flash memory storage.  The work starts by noting that hash schemes, although quite easy to implement, have high tail latencies and variance.  Obviously, one wants a storage structure to have predictable fast performance.  PinK achieves this by adopting an LSM structure, but innovating in many steps of the implementation.  A main idea is to just keep the top levels of the tree in memory and to keep the tree itself in a sorted form, permitting better locality of access.  PinK also has some hardware accelerator features that greatly enhance performance.

Strengths
---------
The paper is exceptionally well written and structured, and gives a very clean step by step justification for each major design decision, very often using experiments to evaluate choices and to identify the winning option.  Performance is excellent, and the stated goal of reducing tail latency and variance is definitely achieved.

Weaknesses
----------
The paper makes use of some unusual hardware capabilities (the specialized comparators described in Section 4.4).  I personally am a believer in hardware acceleration, and I found this step in the project exciting and innovative and effective.  Just the same, we end up with a specialized "proprietary" element.  I suspect that some ATC participants may feel that the work is less broadly valuable because of this one aspect.  In contrast, if this unit were something highly standard (like a bit of code that can run on an FPGA or GPU accelerator, rather than an ASIC that literally lives inside the flash controller), I think the concern would not arise.  The ASPLOS community, in contrast, or ISCA, would love that aspect...

Detailed feedback
-----------------
I don't really have a lot of feedback to offer -- I like the work, will support it in the PC discussions, and I think you've really done a very impressive job here.


Review #452B
===========================================================================

Overall merit
-------------
3. Weak accept - Suitable for ATC with some shepherding, and I may advocate
   for it

Paper summary
-------------
The paper proposes new optimizations to the design and implementation of LSM trees for SSDs with spare resources (CPU, DRAM and FPGAs) that expose a Key-Value interface. In particular, they propose eliminating the usage of bloom filters for reducing tail latency and instead propose pinning multiple top levels to DRAM by representing keys and values more compactly. Finally, they propose optimizations to compaction, garbage collection, and deep-search by leveraging and adding additional metadata.

Strengths
---------
The tradeoffs make a lot of sense. Optimizing for the popular ones should bring down the tail latency and redesigning the metadata makes better use of DRAM. The offloading of the compaction to the lower-level and the garbage collection that leverages higher-level data layout are also nice ideas.

Weaknesses
----------
Several recent papers have proposed changes to LSM trees to improve both read and write performance including reduction of write amplification. Comparison to such systems is missing. Also, it was not clear to me why a larger bloom filter with more hash functions won't push the tail farther. It is not considered as an option at all.

Detailed feedback
-----------------
I think the techniques make a lot of sense but there are some motivation elements that are missing. You can get arbitrarily higher accuracy from bloom filters by using more bits and more hash functions. I was surprised to see that you evaluated them using only one configuration and fixed on the 1.3% requests requiring multiple SSD accesses. Would a more configurable bloom filter configuration achieve a level of tail latency acceptable for the application? Perhaps varying the bloom filter config to show that it is not tenable in general would have been good. 

The motivation for using the optimized search is also not clear. You should show asymptotically that for deeper searches your scheme would be better than using bloom filters just for the levels on the SSD. Only then is your optimization really needed. 

One other thing is the usage of some bits of the key for detecting presence. This might work well if keys are drawn from a random space. Many applications have a lot of bias in the key space they work with and therefore it might lead to false positives thus reducing the performance. A smaller hash of the keys might be what you are looking for. 

Another motivation element missing was comparison with other LSM optimizations from recent times (SOSP 2019) show how to reduce compaction overheads. With reduced compaction overheads from other techniques, do we really need to worry about the compute overhead of bloom filters during compactions? Without deeper qualitative and quantitative comparison to such techniques, it will be harder to understand how much your optimizations are actually necessary.

Finally, I think rather than reducing unpredictability your focus is on reducing standard deviation. Unpredictability in SSDs happens mainly because of interference from the background operations in the FTL. Many optimizations in recent times have reduced that. What you are trying to do is to ensure that most operations take the same latency (one DRAM access for checking and one SSD access for data).


Review #452C
===========================================================================

Overall merit
-------------
5. Strong accept - Definitely suitable for ATC, and I will champion it

Paper summary
-------------
KVSSDs embed a key value store in an SSD. The key challenge of this design is the resource-limited environment of an embedded system. In the past this has motivated implementing the key-value store using a hash table instead of a the traditionally more resource-intensive implementation of an LSM tree. Furthermore, traditional optimizations of LSM trees are using Bloom filters to quickly find a key/value pair which, due to their probabilistic properties, can lead to long tail latencies. This paper proposes a new way to implement an LSM tree using data structures that allows for a non-probabilistic, complete index into the LSM tree small enough to be pinned into main memory and carefully optimized for the performance profiles of flash devices. The data structures are also designed for hardware-supported compaction, a usually CPU- and IO-intensive process. The result compared to traditional hash and LSM tree implementations is comparable throughput, guaranteed worst-case read latencies, improved average read latency, and greatly reduced CPU and IO load. The approach does assume a small amount of battery-backed DRAM (64MB in the evaluation) for the pinned index. 

More detail on the data structures: L0 is converted into a Skip List that serves as a write buffer to enable sufficient parallelism when buffered k/v pairs are flushed. All other levels are stored in flash where meta segments store keys and pointers into key/value pairs in data segments. To quickly find k/v items, the other levels are indexed by pinned-in-memory level-lists, one per level, that index by the start key of every meta segment. Overall similar design to Wisckey but optimized for embedded environments and hardware accelerators.

Strengths
---------
- This is a very nice example how to take the state of the art design of a key/value store that evolved in the resource-rich environment of hosts and adapt it to the resource-poor embedded environment, carefully leveraging optimization opportunities created by specific hardware like FPGAs and flash devices.

- The paper also uses one example to first explain LSM tree basics and then all the optimizations they performed.

- I like the integration of hardware-accelerated compaction and garbage collection.

Weaknesses
----------
- The approach relies on battery-backed DRAM (BB-DRAM) for all pinned data structures (skip list plus level lists) so those don't have to ever be flushed. The authors acknowledge that the data structure size grows linearly by the number of k/v pairs stored, roughly BB-DRAM-size = SSD-size / 8000. The BB-DRAM-size is also sensitive to the average key size, the meta segment size, and the average value size. Let's hope that SSD sizes are not growing exponentially -- experience suggests otherwise. 

- BB-DRAM also creates an additional mode for data loss: if the battery fails, all data is lost at the next power failure. The authors do not describe a recovery scheme from the data that would be still available in flash.


Review #452D
===========================================================================

Overall merit
-------------
4. Accept - Suitable for ATC, and I will advocate for it

Paper summary
-------------
PinK is a LSM tree implementation targeted to KV-SSDs. To reduce the computation
overhead of index generation for each SSD block, PinK uses key-pointer pairs
instead, that are stored separately from the data itself. To reduce read latency
the top-level blocks are stored in DRAM, which is supported by capacitors to
avoid having to flush them to flash. Other optimizations include
hardware-assisted compaction, and range pointers that reduce the binary search
cost at read time.

Strengths
---------
+ Several optimizations to LSM tree design to adapt to the context of KV-SSDs
  (even though each technique by itself in not novel)
+ Evaluated using an FPGA-based prototype and realistic baselines

Weaknesses
----------
- Cost or feasibility of the design's hardware support is not discussed
- Missing detailed comparison with design of previous LSM-tree based
  drive-internal implementations such as iLSM-SSD and Kinetic drives
- Evaluation limited to 64GB capacity (and 44GB of data), when modern SSDs are
  up to two orders of magnitude larger in capacity

Detailed feedback
-----------------
Thank you for your submission to ATC '20! Your submission is really well-written
and taught me a number of interesting facts about KV-SSDs.

PinK's design is based on several optimizations that collectively improve both
read-write performance and tail latency. The novelty, however, is limited to the
combination of those techniques, as each one of them has been explored by prior
work. What is making me bring this up, is that your introduction led me to
believe that KV-SSDs are primarily using hashing-based schemes, but then Section
2.4 implies there are several existing drive-internal implementations based on
LSM trees (LightStore, iLSM-SSD, Kinetic HDDs) which you avoid discussing in
more detail than "We expect iLSM-SSD to exhibit similar behavior to LightStore".
For your camera-ready or a future version of the paper, please correct this
discrepancy to properly manage reader expectations.

Using YCSB as your baseline workload makes sense, given the timing of your
submission. But since then, a new paper has been published at FAST '20 from
Facebook ("Characterizing, Modeling, and Benchmarking RocksDB Key-Value
Workloads at Facebook" by Cao et al.) which finds that YCSB workloads differ
significantly from their three key KV workloads. They have released an extension
to RocksDB's db_bench benchmark to simulate their workloads. You might want to
look into trying those workloads for a future version of this work.

One of my major concerns with this work is that your design relies on hardware
support, such as capacitors to avoid flushing blocks pinned in DRAM to storage
and hardware acceleration to speed up merge-sorting blocks during compaction,
but there is no discussion of the way this would affect the cost/GB of the final
product.

It is unclear why your evaluation is limited to 44GB of data, and why you have
decided to only use 64GB of capacity instead of the entire 256GB. Given that
most SSDs today are up to two orders of magnitude larger than the capacity you
test your evaluation with, I think it is important to test the scalability of
your proposed design at larger capacities. Furthermore, please fix Section 3.2,
which currently states that you use 256GB of NAND flash and then that you reduce
your SSD's capacity to 64GB without stating a valid reason.

I would like to see a discussion of the way your DRAM is allocated between the
skip list, the compaction buffers, and the level lists.

In Figure 3c, why is the throughput w/o Bloom Filters higher than w/ Bloom
Filters? Is it the BF construction overhead?

In Figure 1, your KV-SSD with 64GBof data was able to achieve 100 KIOPS. I
assume you used a different prototype (an actual KV-SSD) for that measurement.
Is the FPGA-based prototype you used in your evaluation capped much lower for
that workload? Please address this discrepancy.

In Figure 10b, are all your compaction I/Os all the same size? If not, what is
the amount of data moved for both the LSM-tree and PinK?

Are you planning to release the code for your Hash and PinK implementations? If
not, please explain why not.

Section 3.1: "run in the tree has each own filter" -> its
Section 4.2: "Bloom filters is typically" -> are


Review #452E
===========================================================================

Overall merit
-------------
4. Accept - Suitable for ATC, and I will advocate for it

Paper summary
-------------
This paper carefully tunes LSM-Trees for use in KV-SSDs, and argues they outperform hash-based data structures (due to memory constraints), while providing stronger semantics.  The paper proposes a number of optimizations, and explains how they tuned the tree for the available hardware.

Strengths
---------
They achieve good constant factor performance on the hardware at hand, and show that, LSM-Trees cope with memory constrained environments well.

The experiments and implementation seem thorough.

Weaknesses
----------
The paper describes many old optimizations as though they are new, and misses  some important related work.  It would be nice if the experiments made the tradeoffs between the different approaches more clear.

Detailed feedback
-----------------
Level Pinning is closely related to what conventional LSM-Trees do:  They manage index and metadata pages using an LRU cache.  By setting the index page fanout appropriately, this allows them to pin the upper levels of all the indexes in memory.  In the absence of a page cache (and with hardware configurations that are known ahead of time), level pinning is probably simpler to implement, and more performant (especially on embedded hardware), but, even when tuned optimally, it doesn’t seem likely to do much better than a simple LRU strategy in terms of number of I/Os.

The idea of trading between caching levels in memory and allocating space to bloom filters isn’t new.  I doubt that either of the configurations tested is optimal.  (The experiments only consider scenarios where almost all memory goes to bloom filters or it all goes to pinning levels).

FPGAs have been used to implement bloom filters in the past.  You should cite one of the papers on that topic.  https://scholar.google.com/scholar?q=fpga+bloom+filter produces many examples.  Such techniques seem complementary to yours.

I haven’t seen hardware accelerated LSM merges before.  That’s a nice trick.  Did it come from LightStore, or is it a contribution of this paper?

You should include throughput latency curves for some of the experiments (see the YCSB paper, Figure 3 for an example).  Systems aren’t typically run at 100% throughput, so measuring tail latency there isn’t very useful.  The curve lets readers look at the maximum throughput achievable given a latency SLA.  (Since you’re focused on tail latency, having a Y axis of 99 (or even higher) percentile latency would be reasonable.)

GCOPT is kind of strange.  I’ve never heard of an LSM-Tree that attempts to update tree components in place (doing so defeats the purpose of using an LSM-Tree), and you show that doing so performs poorly.  So, PinK-GCOPT is a strange strawman, and PinK+GCOPT is just “pink doing what most other systems do”.

Figure 3 (especially b).  The caption should make it clear that this is an LSM-Tree vs. a strawman, and not LSM vs PinK.

Figure 3a: Maybe put the Y axis on a log scale.

Figure 3(c): You mention a few reasons why YCSB-Load is slower with bloom filters, but I didn’t see a discussion of write amplification.  The less memory you devote to the lower levels of the tree, the greater the write amp.  Figure 3d kind of covers this, but it’s worth noting, especially if you add that reference to hardware-accelerated bloom filter creation.

Your second technique “reduce search ranges of the level lists” is called Fractional Cascading, and is already implemented by some LSM-Trees (such as RocksDB).

Section 4.5: You say you separate the “meta” and the “data” in order to get hot-cold separation for your GC, but then in the GC for Meta Segment, it sounds like Meta and Data can coexist on the same erase block.  I’m confused.

It would be nice if you provided a rule of thumb saying how much memory one needs to have a hash table that outperforms PinK.


Rebuttal Response by Sungjin Lee <sungjin.lee@dgist.ac.kr> (Author) (753 words)
---------------------------------------------------------------------------
We appreciate the comments from all reviewers. We answer some specific questions below; we will incorporate these and other suggestions if we have a chance.

#### [#452A]
A1. Hardware-Accelerator: We implemented PinK on a widely available Xilinx board (ZCU102). All the source code (HW/SW) will be released as a reference design.

#### [#452B]
B1. Bloom Filter: The Bloom filter in the paper was fine-tuned for a minimal lookup cost given a DRAM budget. Showing the tradeoff between tail-latency and rebuilding costs using various Bloom filter parameters is a great suggestion.

B2. Optimized Search: PinK requires a computation time of $O(h^2·log(T))$ or $O(h^2)$ when $T$ is fixed. While PinK issues only one flash read, the computation time is asymptotically larger than the Bloom filter, which has $O(h)$ computation time but issues multiple flash reads depending on false hits. Optimized search improves PinK to have the same time complexity of $O(h ·log(T)) \approx O(h)$.

B3. Prefix Comparison: Good point. If there exists a bias in the key space, the prefix approach doesn't work well. '+RANGE' in Figure 11 shows the worst-case read latency (it still exhibits shorter tails than LSM-tree). Some techniques can mitigate this limitation.

B4. KVell: Since the system environment PinK assumes is somewhat different from other KVSs, one-to-one comparison of PinK with other techniques is difficult. For example, KVell provides excellent performance but assumes that all the indices are loaded entirely in the host DRAM. This is not a valid assumption for resource-constrained systems. To the best of our knowledge, there is no KVS that *guarantees* the worst-case read latency when this assumption is violated. This may be the biggest distinguishing feature of PinK.

#### [#452C]
C1. BB-DRAM: We agree that there will be issues with the scalability of BB-DRAM. Perhaps new non-volatile memory like Optane may help remedy the problem.

C2. Recovery: The most straightforward but a slow way is to scan the flash when the battery fails. PinK may regularly flush out dirty meta segments and level lists although it causes extra I/Os. A hash-based KVS also has the same issue.

#### [#452D]
D1. HW-Cost: Enterprise-class SSDs already have capacitors for data protection; PinK uses them without any additional cost. PinK hardware accelerator compares streams of sorted strings and requires a small amount of logic (~6500 LUTs, 64KB buffer). Commercial SSDs use custom accelerators, and the cost of adding the PinK accelerator into the controller ASIC would be insignificant.
					
D2. Scalability: Owing to long runtime (>1 hour for each configuration), we reduced the SSD size to 64 GB. According to a model proposed by Monkey(SIGMOD17), the increase of the SSD capacity doesn’t affect read latency, but WAF increases at a root scale (SSD-capacity$^{1/h}$, where $h$ is the tree height). WAFs of 64 GB and 1 TB SSDs are 1.74 and 2.49, respectively. Thus, we believe that scalability is not a serious issue in PinK. All the LSM-tree-based KVS have the same property.

D3. Bloom Filters: Yes, due to the overheads of reconstructing Bloom filters. 

D4. IOPS: Yes. A real KV-SSD (in Figure 1) doesn’t allow us to modify its firmware, so we had to implement the hash and LSM-tree algorithms in the slower SSD prototype. The performance of our prototype is about half that of the real KV-SSD.

D5. Compaction I/Os: PinK involves less I/O for compaction, because many compaction I/Os are absorbed by pinned levels in DRAM. In 'Load', LSM-Tree moved 138GB for compaction out of total 183GB data for flash reads/writes, but PinK moved 46GB out of 91GB.

D6. Releasing Code: See A1

#### [#452E]
E1. LRU vs. Bloom filter vs. Pinning: LRU caching provides good read latency under highly-localized workloads; Bloom filters offer good average latency, independently of workloads; Pinning guarantees the worst-case latency. There might exist a point when the three techniques can be combined optimally. This study focuses on Pinning, but this point is definitely worth exploring.

E2. HW-accelerated Merge-Sort: LightStore doesn’t use HW-accelerated merge-sort; it is a new contribution of this study.

E3. GCOPT: Sorry for the confusion. "PinK" is based on the WiscKey’s GC approach. "PinK+GCOPT" represents the technique introduced in Section 4.5 which rewrites KV pairs (victims for GC) to L0 in DRAM. To show its impact, we compare PinK+GCOPT with PinK.

E4. WAF in Figure 3(c): ‘w/o BF’ didn’t make use of available DRAM for caching popular entries. Therefore, the WAFs of the LSM-Tree w/o BF and w/ BF were the same. Thus, the degraded performance of w/ BF shows how much the BF reconstruction affects throughput. This is our mistake -- we will clarify the description.


Comment @A1 by Reviewer B
---------------------------------------------------------------------------
Congratulations on your paper being accepted. The reviewers unanimously agreed that this is a novel, timely, interesting, and well-written paper. Please address some of the minor concerns around motivation raised by some reviewers.