The Software Model

Drivers (and other software) written for traditional block storage devices need several modifications to work with zoned storage devices. The most obvious is that the host software must obey the new constraints on only writing sequentially within a zone, but that's not the end of the story. Zoned storage also makes the host software responsible for more of the management of data placement. Handling that starts with keeping track of each zone's state. That is more complex than it might sound at first. ZNS adopts the same concept of possible zone states that are used for host-managed SMR hard drives. Technically, this is with ZBC and ZAC extensions to the SCSI and ATA command sets respectively:

Each of these seven circles represents a possible state of one of the zones on a Zoned Namespace SSD. A few of these seven states have an obvious purpose: empty and full zones are pretty much self-explanatory.

(A zone may be put into the full state without actually storing as much data as its capacity allows. In those scenarios, putting a zone into the full state is like finalizing an optical disc after burning: nothing more can be written to the zone until it is reset (erased).)

The read-only and offline states are error states used when a drive's flash is failing. While ZNS SSDs reduce write amplification, they still have to perform wear leveling at the hardware level. The read-only and offline states are only expected to come into play when the drive as a whole is at the end of its life. Consequently, a lot of software targeting zoned storage won't do anything interesting with these states and will simply treat the entire device as dead once a zone fails into one of these states.

That still leaves three states: implicitly opened, explicitly opened, and closed.

A zone that is in any one of these three states is considered active. Drives will tend to have limits on the number of zones that can be opened (explicitly or implicitly) or active at any given time. These limitations arise because active or open zones require a bit of extra tracking information beyond just knowing what state the zone is in. For every active zone, the drive needs to keep track of the write pointer, which indicates how full the zone is and where the next write to the zone will go. A write pointer isn't needed for full or empty zones because full zones cannot accept more writes, and empty zones will be written to starting at the beginning of the zone.

A zone must be opened in order to accept new writes. Zones can be implicitly opened by simply issuing a write command, or they can be explicitly opened using a zone management command to open (it doesn't actually write new data).

The distinction between implicitly and explicitly opened zones is that the SSD controller is free to automatically close a zone that was opened implicitly through a write command. An explicitly opened zone, one that was issued with an ‘open’ command, will only be put in the closed state when the host software commands it.

If a ZNS SSD is operating at its limit for the number of zones that can be open and they're all explicitly opened, then any attempt to open a new zone will fail. However, if some of the zones are only implicitly opened, then trying to open a new zone will cause the SSD to close one of those implicitly open zones.

The distinction between open and closed zones allows drives to keep a practical limit on the internal resources (eg. buffers) needed to handle new writes to zones. To some extent, this is just a holdover from SMR hard drives, but there is a relevant limitation in how flash memory works. These days, NAND flash memory typically has page sizes of about 16kB, but ZNS SSDs still support writes of individual LBAs that will typically be 4kB (or 512 bytes). That means writing to a zone can leave flash memory cells in a partially programmed state. Even when doing only page-sized and properly aligned writes, cells may be left in a partially programmed state until further writes arrive, due to how SSDs commonly map pages to physical memory cells.

Flash memory cells that are in a partially programmed state are particularly at risk of suffering from a read disturb error, where attempts to read from that cell or an adjacent cell may change the voltage of the partially programmed cell. Open Channel SSDs deal with this by simply disallowing reads from such pages, but the zoned storage model tries to avoid imposing extra restrictions on read commands. ZNS SSDs will typically cache recently-written data so that a read command can be handled without touching partially programmed NAND pages. The available memory for such caching is what leads to a limit on the number of open zones.

If an open zone with some partially programmed memory cells is to be closed, the drive has two choices: finish programming those cells using some filler data, keep track of the hole in the zone, and hope the host doesn't try to use the full zone capacity later. Alternatively, the drive can keep buffering recently-written data even for closed zones. Depending on how many active zones a drive wants to support, this can still allow for a ZNS SSD to get by with much less DRAM than a conventional SSD, so this approach is what's more likely to be used in practice. A SSD that supports both zoned and block IO namespaces will probably be able to keep all of its zones active or open simultaneously.

 

In principle, a ZNS SSD could expose each individual flash erase block as a separate zone that would be several megabytes, depending on the underlying flash memory. This would mean writes to a single zone are limited to the write speed of a single NAND flash die. For recent TLC NAND flash, single die write speeds go up to about 82 MB/s (Samsung 6th-gen V-NAND) and for QLC the single-die write speed can be below 10MB/s. In practice, drives will tend to support zone sizes that aggregate many erase blocks across multiple dies and all of the controller's channels, so that sequential writes (or reads) to a single zone can be as fast as would be supported on a conventional FTL-based SSD.

A recent Western Digital demo with a 512GB ZNS prototype SSD showed the drive using a zone size of 256MB (for 2047 zones total) but also supporting 2GB zones. Within a single zoned namespace, all zones will use the same zone size, but a drive can support reformatting a namespace to change its zone size or multiple namespaces with different zone sizes.

Hints or Warnings

Many recent NVMe features allow SSDs and host software to exchange optional hints about data layout, access patterns and lifetimes. This is an SSD driven feature to the host, rather than requiring both sides to support using this information. ZNS makes zones an explicit concept that the host must deal with directly, but takes the hinting approach for some of the remaining internal operations of the SSD.

ZNS SSDs don't perform garbage collection in the sense of traditional SSDs, but they are still responsible for wear leveling. That can sometimes mean the drive will have to re-locate data to different physical NAND erase blocks, especially if the drive is relatively full with data that is infrequently modified. Rewriting an entire zone of, say, 256MB is a pretty big background job that would have a noticeable impact on the latency of handling IO commands coming from the host. A ZNS SSD can notify the host that it recommends resetting a zone because it plans to do some background work on that zone soon, and can include an estimate of how many seconds until that will happen. This gives the host an opportunity to reset the zone, which may involve the host doing some garbage collection of its own if only some of the data in the zone is still needed. (To help with such situations, NVMe has also added a Copy command to collect disparate chunks of data into a single contiguous chunk, without the data having to leave the SSD.)

Similarly, a ZNS SSD can recommend that an active zone should be moved to the Full state by the host either writing to the rest of the zone's capacity, or issuing a Zone Finish command.

When the host software pays heed to both of the above hints and takes the recommended actions, the SSD will be able to avoid almost all of the background operations that have a large impact on performance or write amplification. But because these are merely hints, if the host software ignores them or simply isn't in a position to comply, the SSD is still obligated to preserve user data throughout its background processing. There may still be some side effects, such as the drive having to move an open or active zone to the full state in exceptional circumstances, and host software must be written to tolerate these events. It's also impossible to completely eliminate write amplification. For example, static data may need to be rewritten eventually to prevent uncorrectable errors from accumulated read disturb errors.

Supporting Multiple Writers

The requirement to write data sequentially within a zone presents obvious challenges for software to manage data layout and especially updates to existing data. But it also creates a performance bottleneck when multiple threads want to write to the same zone. Each write command sent to the SSD needs to be addressed to the LBA currently pointed to by the zone's write pointer. When multiple threads are writing to a zone, there's a race condition where the write pointer can be advanced by another thread's write between when a thread checks for the location of the write pointer and when its write command gets to the SSD. That will lead to writes being rejected by the SSD. To prevent this, software has to synchronize between threads to properly serialize writes to each zone. The resulting locking overhead will tend to cause write performance to decrease when more threads are writing, and it is difficult to get the queue depth above 1. 

To address this limitation, the ZNS specification includes an optional append command that can be used instead of the write command. Append commands are always addressed to the beginning of the zone, but the SSD will write the data wherever the write pointer is when it gets around to processing that command. When signaling completion of that command, the SSD returns to the host the LBAs of where the data actually landed. This eliminates the synchronization requirement and allows many threads to write new data to a zone simultaneously with no core-to-core communication at all. The downside is that even more complexity has been moved into host software, which now must record data locations after the fact instead of trying to allocate space before writing data. Even returning the address where data ended up to the application has proven to be a challenge for existing IO APIs, which are usually only set up to return error codes to the application.

The append command isn't the only possible solution to this scalability challenge; it's just the one that has been standardized with this initial version of the NVMe ZNS specification. Other solutions have been proposed and implemented in prototypes or non-standard zoned SSDs. Radian Memory has been supporting their own form of zoned storage on their SSDs for years. Their solution is to allow out of order writes within a certain distance ahead of the write pointer. The SSD will cache these writes and advance the write pointer up to the first gap in data that has arrived so far. There's another NVMe Technical Proposal on its way toward standardization to define a Zone Random Write Area (ZRWA) that allows random writes and in-place overwriting of data while it's still in the SSD's cache. Both of these methods require more resources on the SSD than the Zone Append command, but arguably make life easier for software developers. Since Zone Append, ZRWA and any other solution has to be an optional extension to the basic ZNS feature set, there's potential for some annoying fragmentation here.

How to Enable NVMe Zoned Namespaces Comparison With Other Storage Paradigms
Comments Locked

45 Comments

View All Comments

  • jeremyshaw - Monday, August 10, 2020 - link

    The early 70s and 80s timeframe saw CPUs and Memory scaling roughly the same, year to year. After a while, memory advanced a whole lot slower, necessitating the multiple tiers of memory we have now, from L1 cache to HDD. Modern CPUs didn't become lots of SRAM with at attached ALU just because CPU designers love throwing their transistor budget into measly megabytes of cache. They became that way, simply because other tiers of memory and storage are just too slow.
  • WorBlux - Wednesday, December 22, 2021 - link

    Modern CPU's have instruction that let you skip cache, and then there was SPARC with streaming accelerators, where you could unleash a true vector/CUDA style instruction directly against a massive chunk of memory.
  • Arbie - Thursday, August 6, 2020 - link

    An excellent article; readable and interesting even to those (like me) who don't know the tech but with depth for those who do. Right on the AT target.
  • Arbie - Thursday, August 6, 2020 - link

    And - I appreciated the "this is important" emphasis so I knew where to pay attention.
  • ads295 - Friday, August 7, 2020 - link

    +1 all the way
  • batyesz - Thursday, August 6, 2020 - link

    UltraRAM is the next big step in the computer market.
  • tygrus - Thursday, August 6, 2020 - link

    The first 512-sectors I remember is going back to the days of IBM XT compatibles, 5¼ inch floppies, 20MB HDD, MSDOS, FAT12 & FAT16. That well over 30 years of baggage is heavy to carry around. They moved to 32bit based file systems and 4KB blocks/clusters or larger (eg. 64 or 128bit addresses, 2MB blocks/clusters are possible).

    It wastes space to save small files/fragments in large blocks but it also wastes resources to handle more locations (smaller blocks) with longer addresses taking up more space and processing.

    Management becomes more complex to overcome the quirks of HW & increased capacities.
  • WaltC - Tuesday, August 11, 2020 - link

    Years ago, just for fun, I formatted a HD with 1k clusters because I wanted to see how much of a slowdown the increased overhead would create--I remember it being quite pronounced and quickly jumped back to 4k clusters. I was surprised at how much of slow down it created. That was many years ago--I can't even recall what version of Windows I was using at the time...;)
  • Crazyeyeskillah - Thursday, August 6, 2020 - link

    I'll ask the dumb questions no one else has posted:
    What kind of performance numbers will this equate to?

    Cheers
  • Billy Tallis - Thursday, August 6, 2020 - link

    There's really too many variables and too little data to give a good answer at this point. Some applications will be really ill-suited to running on zoned storage, and may not gain any performance. Even for applications that are a good fit for zoned storage, the most important benefits may be to latency/QoS metrics that are less straightforward to interpret than throughput.

    The Radian/IBM Research case study mentioned near the end of the article claims 65% improvement to throughput and 22x improvement to some tail latency metric for a Sysbench MySQL test. That's probably close to best-case numbers.

Log in

Don't have an account? Sign up now