A few weeks ago Solid State Drives (SSDs) faced a serious question when a Seagate engineer said in a presentation that SSDs lose data when they are left unpowered. That presentation has since been taken down and the story debunked by many reputable sites. But SSDs are once again in the news, and this time for a legitimate issue.
Earlier this month developers at Algolia, which provides a hosted search API for application developers, noticed its SSDs were getting corrupted and were switching to read-only mode. As expected, they had back-up and servers were restored. But the problem continued and started to get worse.
Algolia Site Reliability Engineer Adam Surak wrote about the issue on the company blog, “The system was issuing a TRIM to erase empty blocks, the command got misinterpreted by the drive and the controller erased blocks it was not supposed to. Therefore our files ended-up with 512 bytes of zeroes, files smaller than 512 bytes were completely zeroed. When we were lucky enough, the misbehaving TRIM hit the super-block of the filesystem and caused a corruption. After disabling the TRIM, the live big files were no longer corrupted but the small files that were once mapped to the memory and never changed since then had two states – correct content in the memory and corrupted one on the drive. Running a check on the files found nothing because they were never fetched again from the drive and just silently read from the memory.”
Initially they thought that the problem was related to the queued TRIM feature in the Linux kernel. Surak clarified in the blog that was not the case, “The TRIM on our drives is un-queued and the issue we have found is not related to the latest changes in the Linux Kernel to disable this feature.”
Upon an in-depth investigation the company discovered that the problem was specific to the Samsung SSDs it was using, including, Samsung MZ7WD480HCGM-00003, Samsung MZ7GE480HMHP-00003, Samsung MZ7GE240HMGR-00003, Samsung SSD 840 PRO Series and Samsung SSD 850 PRO 512GB. They have also listed some Intel SSDs they deem safe for their infrastructure including Intel S3500, Intel S3700 and Intel S3710.
I reached out to some SSD vendors and users of SSDs to get a bigger picture and to find out which drives are the best for enterprise customers. Given the nature of the discovery, and ongoing investigation by Samsung, some vendors declined to comment on the story.
Kingston’s advice: a vendor’s point of view
When asked whether their drives are affected by the issue, Cameron Crandall, senior technology manager for SSD manufacturer Kingston, said via email, “Kingston is aware of the article and our SSDs are not affected due to our implementation of TRIM support on our drives.”
The more important question was their recommendation for enterprise customers running Linux. Crandall said, “SSDs are OS independent so we would recommend any of our KC300/310 and E Series SSDs for enterprise applications. The workload of the drives would be the only variable as to which drive series Kingston recommends for the given application(s).”
Digital Ocean’s advice: a user’s point of view
Digital Ocean is one of the leading virtual private server providers. They are well known for offering SSDs on their servers. When asked whether SSDs in their datacenter are affected by this, Sam Kottler, Platform Engineer at Digital Ocean said “We utilize TRIM on top of drives which are not affected by this issue.”
When I probed them for their recommendation or best practices for choosing the right drive for an enterprise set-up Kottler explained, “The market for SSD’s has yielded a huge amount of variety in both features and quality. The most prominent divide in terms of quality is between drives which are marketed as consumer grade versus those intended for the enterprise or datacenter. While initial performance of consumer drives can generally match some of the performance characteristics of higher end drives, they degrade both more quickly and less gracefully. Performance for multi-threaded write operations stand out in a particularly stark manner between these classes of drives. Firmware can also make a stark difference in multi-threaded performance; having a relationship with vendors to customize that firmware for specific workloads can be quite beneficial. It’s important to simulate workloads on the drives and firmware which are headed for production. Additionally, measuring how disk fill and ensuring that TRIM is enabled in tandem with internal garbage collection is critical for ensuring the longevity and health of drives.
Single-cell versus multi-cell designs can make drastic performance differences as well. Lower-end drives tend to be multi-cell because of the lower cost driven by higher density per cell. That density means that the cells wear down faster; multi-cell drives wear at a rate of about an order of magnitude faster than their single-cell counterparts. SSD’s continue to outperform their spinning counterparts at comparable quality levels. The degradation properties of SSD’s, where they slowly degrade rather than failing in a catastrophic manner make them operationally desirable.”
However, he didn’t recommend any particular brand or make, he said, “It’s hard to make a generalized recommendation given the current options on the market. We’ve extensively tested a number of different drives from several manufacturers and know what works best for us, but drive performance is workload specific.”
Conclusion
If you are planning to purchase SSDs for Linux, keep an eye on the drives that are blacklisted by the Linux Kernel. Also pay heed to what Sukar suggests, “…be careful, even when you don’t enable the TRIM explicitly, at least since Ubuntu 14.04 the explicit FSTRIM runs in a cron once per week on all partitions – the freeze of your storage for a couple of seconds will be your smallest problem.”
If you pay attention to these points, your data may just stay in the solid state.