Minimizing Hard Disk Drive Failure and Data Loss/Drive Life-Cycle Management

Drive selection
Some brands of drives are more reliable than others. While reliability data for particular models can be hard to come by, various factors can be used to estimate a drive's reliability. These factors include product ratings, and some suitability and physical attributes.

Product rating
As is obvious, drives with relatively higher user ratings and good reviews should be preferred. Drives with relatively lower ratings should not be purchased except if used in a RAID environment. The reliability of a model with less than five user ratings, as is the case with brand-new models, is harder to estimate.

Newegg is one of the websites that provides user ratings and reviews for many drives. Google Product Search provides an aggregation of user ratings and reviews from various websites.

Drive class
Enterprise class drives are advertised as having slightly higher reliability than standard desktop class drives, but of course they cost more.

Error recovery mechanism
Hard drives can come with an in-built recovery mechanism which attempts repair if an error occurs. This recovery cycle attempts to recover data from the problematic area, and then reallocates a dedicated area to replace the problematic area.

This process can take up to up to a few minutes depending on the severity of the issue.

Drives meant to be used in a RAID environment must have a feature which prevents them from entering a long recovery cycle, failing which the RAID controller can drop the drive from the array. This feature is known as Time-Limited Error Recovery (TLER) by Western Digital, Error Recovery Control (ERC) by Seagate, and Command Completion Time Limit (CCTL) by Samsung and Hitachi.

Desktop drives that can enter a long recovery cycle should therefore not be used in RAID environments, although drives with TLER / ERC / CCTL can be used in non-RAID environments.

Number of heads
There exists a strong positive correlation between the number of heads in a drive and its failure rate. When choosing between two drives of equal capacity and speed, the one with a fewer number of heads is therefore preferred. This point, however, may not be useful because drives with similar features may tend to have the same number of heads.

Burn-in
A drive has a higher chance than usual of failure in its first few months of use. This increased rate is due to assembly, configuration, or component-level problems. If a drive is susceptible to failing due to such a problem, it would be beneficial if this problem can be detected before the drive is put into use. Care must be taken to ensure that a drive does not overheat during a burn-in.

To aid with this, new drives can first be put through a short burn-in process using special software. This process performs read and write stress tests on the drive. It thus aims to catch problems in the drive that may lead to its early failure. One commercial software application for both Windows and Linux that performs this and other burn-in tests is PassMark BurnInTest.

S.M.A.R.T. reliability data can be queried before and after the burn-in. If a new error is found after the burn-in, it can be indicative of the drive being susceptible to an early failure.

Routine drive upgrades
While planned functional obsolescence is something that can be expected from a company selling a product, in this case it is necessitated by the consumer. Older, smaller drives can routinely be replaced by newer, larger ones. In addition to the increased storage capacity that becomes available, because the older drive is replaced well before its life runs out, the risk of loss of the data contained in that drive is reduced. This is particularly applicable to consumers who require increasing amounts of storage, as they benefit most from the increased storage capacity.

Drives can be replaced based on their features, age, or their fitness as determined by S.M.A.R.T. parameters.