Authors:
Juwon Kim, Minsu Jang, Muhammad Danish Tehseen, Joontaek Oh, and YouJip Won, KAIST
Journal/Conference:
USENIX ATC '22
Source:
https://www.usenix.org/conference/atc22/presentation/kim-juwon
Abstract
In this work, we develop the log-structured filesystem that is free from garbage collection. There are two key technical ingredients: \emph{IPLFS}, a log-structured filesystem for infinite partition, and \emph{Interval Mapping}, a space-efficient LBA-to-PBA mapping for infinite filesystem partition. In IPLFS, we separate the filesystem partition size from the physical storage size and set the size of the logical partition large enough so that there is no lack of free segments in the logical partition during SSD’s lifespan. This allows the filesystem to write the updates in append-only fashion without reclaiming the invalid filesystem blocks. We revise the metadata structure of the baseline filesystem, F2FS, so that it can efficiently handle the storage partition with 2^{64} sectors. We develop Interval Mapping to minimize the memory requirement for the LBA-to-PBA translation in FTL. Interval Mapping is a three level mapping tree. It maintains mapping only for actively used filesystem region. With Interval Mapping, the FTL can maintain the mapping for the 2^{64} sector range with almost identical memory requirement with the page mapping whose LBA range is limited by the size of the storage capacity. We implement the IPLFS on Linux kernel 5.11.0 and prototype the Interval Mapping in OpenSSD. By eliminating the filesystem level garbage collection, IPLFS outperforms F2FS by up to 12.8 times (FIO) and 3.73 times (MySQL YCSB A), respectively.
Introduction
The garbage collection exposes the underlying log-structured filesystem under excessive tail latency and lowers the throughput of the filesystem. The garbage collection also generates extra write traffics that increase flash wears. Until now, a large amount of works has been dedicated to mitigate the overhead of garbage collection in the log-structured filesystem. Despite all previous works, the root cause of garbage collection still remains neglected; the filesystem needs to reclaim the invalid filesystem blocks to make the free segments.
Thus, the authors propose eliminate garbage collection from the log-structured filesystem by separating the logical partition from the physical storage and allowing the filesystem to define very large logical partition independent of the capacity of the physical storage. Without garbage collections, log-structured filesystem can by greatly simplified because block allocation bitmap and reverse mapping are not required anymore. For further optimization, Interval mapping, fixed-region mapping and map node compaction are proposed to manage the space for LBA-to-PBA mapping efficiently.
The fundamental design philosophy is the separation of the logical storage partition from the physical storage. To do so, the size of the logical partition is large enough so that there is no lack of free LBAs during SSD’s lifespan. There are two main components: IPLFS and Interval Mapping.
IPLFS
IPLFS consists of three key design ingredients: (1) multi-area partition layout, (2) garbage collection-less metadata design and (3) discard map and discard logging.
Firstly, IPLFS has six logs, each of which accommodates the filesystem blocks of the same type and hotness level so that maintains the size of the actively used regions small. MSB 3 bits of the LBA address are used as the area identifier.
Secondly, in the metadata section, reverse mapping and block allocation bitmap are no longer used because they aim to support garbage collection. The node address table (NAT) is retained as well as in F2FS. The number of entries of NAT corresponds to the maximum number of inodes in the filesystem, which is limited by the capacity of the physical storage.
Lastly, discard bitmap and discard log are developed to keep track of the newly invalidated filesystem blocks and to restore the filesystem from any crashes, respectively. A discard bitmap represents the newly invalidated filesystem blocks since the last checkpoint in per-section basis. IPLFS organizes a set of the discard bitmaps using the hash table and the section number is used as a hash key. In each checkpoint, IPLFS constructs the discard commands with the hash table and removes the discard bitmap from it. Then, IPLFS dispatches the commands to a separate thread.
Meanwhile, to mitigate storage leak, IPLFS saves the commands as logs at dedicated area of the in-memory checkpoint pack prior to issuing them to the storage. When the system crashes, IPLSF recovers the commands in two phases: reconstructs the commands with respect to the logs in the most recent checkpoint pack and identifies the past changes from the checkpoint using fsync-ed files.
Interval Mapping
Interval mapping is developed for space-efficient LBA-to-PBA mapping using three-height of interval tree. The six data partition have their own interval mapping tree. Root node consists of an array of zones and each zone node consists of 1024 map nodes as its child nodes. Map node consists of three components: the range of the mapping segment, region directory, and the array of mappings for individual regions. For efficient management of the interval tree, two core technique are proposed: (1) fixed-region mapping and (2) map node compaction.
Fixed-region mapping supports to minimize the memory requirements and the mapping latency in interval mapping. The region size is set as the size of the smallest hole in the mapping segment to maximize the mapping efficiency and minimize the map node size.
To reduce the map node size, map node compaction is proposed. If the mapping efficiency becomes smaller than a certain threshold, FTL reorganizes the map node by removing holes (invalid flash pages) in the corresponding mapping segment so that only one consecutive array of valid blocks should be in the region.
Evaluation
IPLFS and Interval Mapping was implemented on F2FS (Linux 5.11.0) and OpenSSD (230GB, 8 channels), respectively. All the experiments conducted on a PC with Intel CPU i7-4770K and 8 GB DRAM.
Under FIO workload, the performance of F2FS drops to 1/10 when it starts to perform garbage collection while IPLFS performance remains steady. Under YCSB-A workload with MySQL, the average read latency and the average update latency of IPLFS are 1/2 and 1/3 of those of F2FS, respectively. Moreover, tail latencies of update at 95% and at 99% in IPLFS are 5.2× and 5× lower than those of F2FS.
Address translation overhead was evaluated as well. Interval Mapping yields 88% longer mapping latency than the page mapping. When creating a new mapping entry, Interval Mapping exhibits 3.3× longer latency than the page mapping. However, for end-to-end latency, the read latency and the write latency of Interval Mapping and page mapping are almost identical.
Furthermore, the effectiveness of map node compaction was examined with a 160GB file. Without compaction, the mapping tree size increases as the filesystem ages and reaches over 80 Mbyte. On the other hand, the mapping tree size stays at around 40 MB with compaction. Mapping table size remains almost the same as the mapping table size for 160 GB even though the active interval expands to 305 GB.
Conclusion
Separating the logical filesystem partition size from the physical storage size and making the logical filesystem partition size virtually infinite, IPLFS get freed from garbage collection. In addition, interval-mapping allow IPLFS to manage infinite virtual address space efficiently by only mapping actively used filesystem region. The authors proved its efficiency in terms of throughput, latency and space usage.