SSD seemed to be the second best invention since sliced bread and Netflix, unless we remember those OCZ SSD models that died too early after a rather small number of writes. Still, modern SSDs have much improved, and since this bloody Windows 10 seems to like to access the disk only too often, “SSD” is the winning keyword–unless you like it to be “Advil.”

The problem is, most people don’t know how SSDs actually work. To start with the biggest mystery, most people have heard of the TRIM command, but they don’t quite understand it. Some of the quick definitions call it a “background garbage collection mechanism,” and Wikipedia comes with this crap:

The TRIM command is designed to enable the operating system to notify the SSD which pages no longer contain valid data due to erases either by the user or operating system itself. During a delete operation, the OS will mark the sectors as free for new data and send a TRIM command to the SSD to mark them as not containing valid data. After that the SSD knows not to preserve the contents of the block when writing a page, resulting in less write amplification with fewer writes to the flash, higher write speed, and increased drive life.

But hey, no matter the storage medium, the file system or the operating system, no standard delete command actually deletes data! The deleted files are left intact on the disk, and only the corresponding entries in the FAT, MFT or the inode table are updated. Of course, this still requires a disk write. This still happens with an SSD, because this can’t be volatile info–the disk has to know what files are there and where are they located (virtually, as the firmware will take care of the actual translation). So, at fist sight, it’s not clear:

  • what is not written back on an SSD when a deletion occurs;
  • what is a TRIM command actually doing;
  • why is that done so and how is this supposed to increase an SSD’s life.

As someone once said, SSD manufacturers protect their technology secrets more carefully than Coca-Cola protects its soda formula. And yet, someone must know how this works, right? Especially as there are still scary figures out there (the article is from 2013 though):

Consider the read/write longevity of SLC (Single-Level Cell) and consumer-grade MLC (Multi-Level Cell) NAND memory, the storage media used to build SSDs: The former is typically rated to last 100,000 cycles, but the latter is rated for only 10,000. Relax—you’d need to write the entire capacity of the drive every day for 25 years or so to wear out all the cells. The latest TLC (Triple-Level Cell) NAND that Samsung is shipping is rated for only a few thousand writes, but you’d still need to write the entire drive’s capacity for something less than ten years to use up the drive. No average user will ever come remotely close to that.

A few thousand writes, and this is supposed to be enough?! TRIM must be the key to that, but how? Here’s the best quick explanation on how an SSD works, well hidden in this inconspicuous paragraph from the same article:

Having the controller write to every NAND cell once before it writes to any cell a second time—a technology known as wear leveling—also helps to extend a drive’s life span. Wear leveling ensures that no cell endures heavy use while another sits virgin next to it. Newer controllers also compress data on the fly before writing it to the disk. Less data equals less wear.

On my Kingston RBU-SNS8152S3256GF SSD, I’ve found of relevance the two following S.M.A.R.T. attributes:

  • 173 [0xAD] = Erase Count, aka Wear Leveling Count, incorrectly read by Speccy as 131,082: it’s actually 0x12, meaning 18 erasures;
  • 231 [0xE7] = SSD Wear Indicator, unknown to Speccy (who thinks it’s a temperature), which reads 0x13, meaning 19 writes (the only logical thing is to imagine 19 writes and 18 erasures).

Let’s note that there are some more constraints:

  • An SSD cannot write to a storage cell in one action, it must first erase the old data before it’s able to write the new information. Such a “zeroing” is time-expensive.
  • An SSD cannot erase individual allocation units (let’s use the antiquated term of cluster to think of them), but rather larger blocks (I’d call them pages, although the terminology is not well-established; some people call blocks a set of pages, I’ll call pages or blocks a set of allocation units). This can make sometime difficult to find an erasable block on a fragmented SSD.

Let’s now try to design a quick and dirty way to implement an SSD, given some NAND-flash and a controller:

  1. The SSD must keep one or both of a “master write counter” and an “erasure counter,” starting at zero. Also, for each storage block (say cluster), there is must be a “write counter” and a flag “in use” saying whether the block has been zeroed or not, regardless of its occupancy status.
  2. When data is written in a storage block, its own “write counter” is incremented. The “master write counter” is updated with the largest such value.
  3. When a file is deleted, it’s marked as such in the MFT (or equivalent)–no change so far as compared to a regular HDD. This MFT must be a part of the SSD, possibly stored in a portion implemented in a different technology, because such OS information cannot be preserved by God, it must be part of the SSD!
  4. The SSD’s firmware, however, doesn’t change the “in use” flag, nor the “write counter” for a storage block whose contents is no more useful, which means “don’t write to them unless there’s no other choice.”
  5. When new data must be written, the firmware looks for storage blocks that have the “write counter” smaller than the largest such counter, which is the “master write counter.” For instance, if parts of the disk have only be written once (counter is 1), and parts of the disk have never been written to (counter is 0), the latter will be used when storing new data.
  6. When the disk is “full” with regards to this algorithm, the firmware looks for blocks that are actually unoccupied (based on the MFT allocation), yet with the “in use” flag not cleared. To make up space, it starts to actually delete the content of such blocks, by zeroing them and resetting the “in use” flag. It’s up to the implementation whether this also increments the master “erasure counter” or whether only a TRIM-triggered erasure is counted by it. Note that because of the granularity of the SSD, all the elementary storage units (clusters) in a block (or page) must be unoccupied in order to trigger an erasure. This actually makes an SSD much more fragmented than a traditional HDD! This need to delete-before-write is severely affecting the write speed of the respective SSD. Most devices include a huge cache buffer (3 to 12 GB) that helps mitigate the reduced write speed in such conditions by delaying the actual write operation until the blocks are zeroed as needed.
  7. If and when a TRIM command is issued, either by the firmware (as it considers it necessary, based on the decreasing write performance) or by the OS (e.g. by a weekly Windows 10 maintenance task), the entirety of the blocks that are not used, but also not deleted are zeroed and the relevant counters are updated. (Note that older devices could not understand a TRIM command, and they had to have distinct zero-filling commands issued for each erasable block. This was also valid for the eMMC storage used in smartphones.)

If I’m not wrong, this should ensure that “no cell endures heavy use while another sits virgin next to it” and that “every NAND cell is written once before a write occurs to any cell a second time.”

But basically the TRIM command is only responsible for releasing for use the cells that are not in use, but not zeroed yet. While improving the write speed to the not-zeroed cells, TRIM cannot decrease the fragmentation of an SSD and, as we’ll see, it cannot fully recover an SSD’s performance–only a full erasure of the entire SSD could do that!

The other day, I used CrystalDiskMark to measure the speed of my SSD. And here’s the big surprise:

SSD_Kingstone_Acer

That’s right, the TRIM command made things… worse! But there’s an explanation to that.

Let’s consider the following fragmentation chart as a discussion material:

fragmented

The two shades of red are marking different levels of fragmentation. Now, consider this:

  • Prior to TRIM, the red and reddish pages would have been completely skipped for writing, as in most cases they were a mix of occupied cells and “not-used-but-unzeroed-yet” cells. Writing in the empty area of the disk is faster.
  • After TRIM, a large number of the pale red cells, having a low occupancy, are most likely released for use (note that such a map has is not granular enough, and the colors are approximating the actual situation). Now the colored part of the chart is like a Swiss cheese–just imagine the pale red cells are white. Writing 1 GB of data starting with the first unoccupied cell results in a very fragmented file! Also, it might be that writing in different pages (the larger sets of blocks that can be addressed or deleted at once) in a rapid sequence decreases furthermore the speed as compared to accessing fewer pages.

I cannot explain why both the sequential writes and the random (“4k”) writes are affected by a TRIM, nor why the read speed is also affected! Maybe the TRIM command also clears some temporary tables used by the firmware to optimize the access, I don’t know.

This Kingston disk seems to lack the aforementioned “huge cache” of 8-16 GB that some SSDs have. Here’s an SSD (Toshiba THNSNH128GBST) that seems to have such a cache:

SSD_Tudor

Technically, the write speed must be much lower than the read speed (unless we’re talking of a HDD), and only a significant amount of cache can lessen the difference.

A fact remains: SSDs cannot be defragmented, and this is an uncorrectable flaw. In Windows 10, Disk Defrag issues a TRIM command when told to defragment an SSD–go try it!

Of course, people like Scott Hanselman can say that there is some kind of SSD defragmentation that can be performed by Windows, namely “your SSD will get intelligently defragmented once a month,” and that it can be triggered (as expected) by performance issues:

It’s also somewhat of a misconception that fragmentation is not a problem on SSDs. If an SSD gets too fragmented you can hit maximum file fragmentation (when the metadata can’t represent any more file fragments) which will result in errors when you try to write/extend a file. Furthermore, more file fragments means more metadata to process while reading/writing a file, which can lead to slower performance.

Who can tell what the truth is with regards to defragmenting an SSD? In an age of the open-source software, the intricacies of the SSD internals are unknown to the general public… If an SSD is really “intelligently defragmented” by Windows once in a while, it must perform some sort of “free space consolidation,” so to hurt as little as possible the SSD wear level.

Either way, as noted, the eMMC storage of your smartphone also needs a TRIM every now and then–and even so, the full performance cannot be reached unless you reformat it to factory settings. External storage, both as microSDHC and USB flash drives are even worse: there is no TRIM command for them, the only way to make them faster is to reformat them!

Back to our mysterious SSD and the TRIM that sometimes hurts, how do we choose an SSD? We read benchmarks and specialized magazines and their recommendations, right?

I don’t trust most benchmarks for various reasons, but let’s find a practical example. Say I read ComputerBILD 4/2016, which has some nice recommendations, of which I chose to focus on a cheap line, Crucial BX200, ranked as follows: 8th for 240-256 GB range; 7th for the 480-512 GB range; and 3rd for the 960+ GB range. Umm… the third place must be a good one, right? Especially when, for a 960 GB SSD, it’s 55 € cheaper than the 2nd place and 86 € cheaper than the 1st place.

The problem is that the speed measurements for large files, small files and sustained throughput are not always relevant enough. Here’s the very same Crucial BX200 SSD (960GB) reviewed by PCWorld: Crucial BX200 SSD review: Good for casual users, but not for slinging extra-large files:

The BX200 is actually two drives in one: a very small and fast one that uses DRAM and SLC (single-level cell) memory, and another much larger and slower drive using TLC (triple-level cell/3-bit). In the BX200’s case, that TLC can only write data to its cells at about 80MBps. No, that’s not a typo. But because of that small cache drive, the BX200 acts just like a high-end SSD most of the time.

We tested two versions of the BX200: the 960GB with 12GB of cache and the 480GB with 6GB of cache. The 240GB version has only 3GB. […] everything looks hunky-dory in the artificial benchmark AS SSD for the 960GB version because the 10GB tests fit will within its 12GB cache. Not so with the 480GB version, and you can expect even worse sequential write numbers out of the 240GB version.

Even the 960’s 12GB of cache is not large enough to mask slow TLC performance in our 20GB copy tests.

Bottom line: the cache can help, but when the actual writing has to occur, make sure your SSD can perform it at more than 80 Mbps. Curcial BX200 (480 GB & 960 GB) has pathetic results even in the tests made by AnandTech. And yet, Crucial BX200 has positive reviews on other sites: on Tom’s Hardware; on TweakTown; on StorageReview.com. Go figure…

RECAP: All in all, I’ve only debunked (sort of) a very few myths, but here’s what I discovered:

  1. TRIM doesn’t primarily serve as a mean to minimize and delay the deletions, but rather as a mean to ensure uniform wearing by making sure no storage cell endures heavy writing when there are empty cells available.
  2. TRIM is not magic; while it’s normally needed on a regular basis, there are cases when it actually decreases performance.
  3. When this happens, it’s because of fragmentation. Whether a defragmentation actually occurs in an SSD it’s still unknown (Scott Hanselman claims it happens), but either way it’s not something that the end-user can trigger, as the standard defrag command triggers a TRIM.
  4. The performance of an SSD can be masked by a huge fast cache (3 to 12 GB), and unfortunately most reviewers don’t design their benchmarks so that the actual speed of the “true SSD” be determined. Don’t trust everything you read unless the review persuades you that the internal limitations of a device are revealed.