Self-Monitoring, Analysis, and Reporting Technology

From Wikipedia, the free encyclopedia

This article may require cleanup to meet Wikipedia's quality standards.
Please improve this article if you can. (December 2006)

Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T., is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.

1 Background
2 History and predecessors
3 SMART Information
4 Standards and implementation
5 Attributes
- 5.1 Known S.M.A.R.T. attributes
- 5.2 Threshold Exceeds Condition
6 Comparison of S.M.A.R.T. tools
7 References
8 External links

[edit] Background

Fundamentally, hard drives can suffer one of two classes of failures

Predictable ones, when some failure modes, especially mechanical wear and aging, happen gradually over time. A monitoring device can detect these, much as a temperature dial on the dashboard of an automobile can warn a driver — before serious damage occurs — that the engine has started to overheat.
Unpredictable ones, when other failures may occur suddenly and unpredictably, such as an electronic component failing.

Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure.^[1] The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while time remains to take preventative action — such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T.^[2] Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but that certain sub-categories of information S.M.A.R.T. implementations might track do correlate with actual failure rates - specifically that following the first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors and first errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities.^[3]

Pctechguide's page on S.M.A.R.T. (2003)^[4] comments that the technology has gone through three phases:

"In its original incarnation SMART provided failure prediction by monitoring certain online hard drive activities. A subsequent version improved failure prediction by adding an automatic off-line read scan to monitor additional operations. The latest SMART III technology not only monitors hard drive activities but adds failure prevention by attempting to detect and repair sector errors. Also, whilst earlier versions of the technology only monitored hard drive activity for data that was retrieved by the operating system, SMART III tests all data and all sectors of a drive by using off-line data collection to confirm the drive's health during periods of inactivity."

[edit] History and predecessors

The industry's first hard disk monitoring technology was introduced by IBM in 1992 in their IBM 9337 Disk Arrays for AS/400 servers^[5] using IBM 0662 SCSI-2 disk drives. Later it was named Predictive Failure Analysis (PFA) technology. It was measuring several key device health parameters and evaluating them within the drive firmware. Communications between the physical unit and the monitoring software were limited to a binary result - device is OK, or is likely to fail soon.

Later^[6] another variant was created by computer manufacturer Compaq and disk drive manufacturers Seagate, Quantum, and Conner, which was named IntelliSafe. The disk drives were measuring the disk health parameters and the values were transferred to the operating system and user-space monitoring software. Each disk drive vendor was free to decide which parameters are to be included for monitoring and what are their thresholds. The unification was at the protocol level with the host.

Compaq submitted their implementation to Small Form Committee for standardization in early 1995.^[7] It was supported by IBM, by Compaq's development partners Seagate, Quantum, and Conner, and by Western Digital who did not have a failure prediction system at the time. The Committee chose IntelliSafe's approach as it gives more flexibility. The resulting jointly-developed standard was named S.M.A.R.T.

[edit] SMART Information

The technical documentation for SMART is in the AT Attachment standard.^[8]

The most basic information that SMART provides is the SMART status. It provides only two values, "threshold not exceeded" or "threshold exceeded". Often these are represented as "drive OK" or "drive fail" respectively. A "threshold exceeded" value is intended to indicate that there is a relatively high probability that the drive will not be able to honor its specification in the future: that is, it's "about to fail". The predicted failure may be catastrophic or may be something as subtle as inability to write to certain sectors or slower performance than the manufacturer's minimum.

The SMART status does not necessarily indicate the drive's reliability now or in the past. If the drive has already failed catastrophically, the SMART status may be inaccessible. If the drive was experiencing problems in the past, but now the sensors indicate that the problems no longer exist, the SMART status may indicate the drive is OK, depending on the manufacturer's programming.

The inability to read some sectors is not always an indication that the drive is about to fail; one way that unreadable sectors can be created even when the drive is functioning within specification is if the power fails while the drive is writing. Even if the physical disk is damaged in one location so that a sector is unreadable, the disk may be able to use spare space to replace the bad area so that the sector can be overwritten.^[9]

More detail on the health of the drive may be obtained by examining the SMART Attributes. SMART Attributes were included in some drafts of the ATA standard but were removed before the standard became final. The meaning and interpretation of the attributes varies between manufacturers and is sometimes considered a trade secret by the manufacturer. Attributes are discussed further below.^[10]

Drives with SMART may optionally support a number of 'logs'. The error log records information about the most recent errors that the drive has reported back to the host computer. Examining this log may help to determine whether computer problems are disk-related or caused by something else.

A drive supporting SMART may optionally support a number of self-test or maintenance routines, and the results of the tests are kept in the self-test log. The self-test routines can be efficiently used to detect any unreadable sectors on the disk so that they may be restored from backup (for example, from other disks in a RAID). This helps to reduce the risk of a situation where one sector on a disk becomes unreadable, then the backup is damaged, and the data is lost forever.

[edit] Standards and implementation

Many motherboards will display a warning message when a disk drive approaches failure. Although an industry standard among most major hard drive manufacturers,^[11] there are some remaining issues and much proprietary "secret knowledge" held by individual manufacturers as to their specific approach. As a result, S.M.A.R.T. is not always implemented correctly on many computer platforms due to the absence of industry-wide software & hardware standards for S.M.A.R.T. data interchange.^{[citation needed]}

From a legal perspective, the term "S.M.A.R.T." refers only to a signaling method between internal disk drive electromechanical sensors and the host computer — thus a disk drive manufacturer could include a sensor for just one physical attribute and then advertise the product as S.M.A.R.T. compatible. For example, a drive manufacturer might claim to support S.M.A.R.T. but not include a temperature sensor, which the customer might reasonably expect to be present.

Some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives, depending on the type of interface. Few external drives connected via USB and Firewire correctly send S.M.A.R.T. data over those interfaces. With so many ways to connect a hard drive (e.g. SCSI, Fibre Channel, ATA, SATA, SAS, SSA) it's difficult to predict whether S.M.A.R.T. reports will function correctly.

Even on hard drives and interfaces that support it, S.M.A.R.T. data may not be reported correctly to the computer's operating system. Some disk controllers can duplicate all write operations on a secondary "backup" drive in real-time. This feature is known as "RAID mirroring". However, many programs which are designed to analyze changes in drive behavior and relay S.M.A.R.T. alerts to the operator do not function when a computer system is configured for RAID support, usually because under normal RAID array operational conditions, the computer may not be permitted to 'see' (or directly access) individual physical drives, but only logical volumes, by the RAID array subsystem.

On the Windows platform, many programs designed to monitor and report S.M.A.R.T. information will only function under an administrator account. At present S.M.A.R.T. is implemented individually by manufacturers, and while some aspects are standardized for compatibility, others are not.

[edit] Attributes

Each drive manufacturer defines a set of attributes and selects threshold values which attributes should not pass under normal operation. Each attribute has a raw value whose meaning is entirely up to the drive manufacturer, and a normalized value that ranges from 1 to 253 (1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the "normal" value.

Manufacturers that have supported one or more S.M.A.R.T. attributes in various products include: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, Toshiba, Western Digital and ExcelStor Technology.

[edit] Known S.M.A.R.T. attributes

This article may contain original research or unverified claims.
Please improve the article by adding references. See the talk page for details. (September 2007)

The following chart lists some S.M.A.R.T. attributes and the typical meaning of their raw values. Normalized values are always mapped so that higher values are better (with only very rare exceptions such as the "Temperature" attribute on certain Seagate drives^[12]), but higher raw attribute values could be better or worse depending on the attribute and manufacturer. As manufacturers do not necessarily agree on precise attribute definitions and measurement units, the following list of attributes should be regarded as a general reference only.

As an example, the "Reallocated Sectors Count" attribute's normalized value decreases as the number of reallocated sectors increases. In this case, the attribute's raw value will often indicate the actual number of sectors that were reallocated, although vendors are in no way required to adhere to this convention.

Legend
	Higher raw value is better		Lower raw value is better
Critical		Potential indicators of imminent electromechanical failure

ID	Hex	Attribute name	Better	Description
01	01	Read Error Rate		Indicates the rate of hardware read errors that occurred when reading data from a disk surface. A non-zero value indicates a problem with either the disk surface or read/write heads. Do note that Seagate drives often report a raw value, that does not mean it is in failure and show high value even as a new drive.
02	02	Throughput Performance		Overall (general) throughput performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that there is a problem with the disk.
03	03	Spin-Up Time		Average time of spindle spin up (from zero RPM to fully operational [millisec]).
04	04	Start/Stop Count		A tally of spindle start/stop cycles.
05	05	Reallocated Sectors Count		Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping and "reallocated" sectors are called remaps. This is why, on modern hard disks, "bad blocks" cannot be found while testing the surface — all bad blocks are hidden in reallocated sectors. However, the more sectors that are reallocated, the more read/write speed will decrease. A decrease in the attribute value indicates bad sectors.
06	06	Read Channel Margin		Margin of a channel while reading data. The function of this attribute is not specified.
07	07	Seek Error Rate		Rate of seek errors of the magnetic heads. If there is a failure in the mechanical positioning system, a servo damage or a thermal widening of the hard disk, seek errors arise. More seek errors indicates a worsening condition of a disk surface and the mechanical subsystem.
08	08	Seek Time Performance		Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.
09	09	Power-On Hours (POH)		Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state.
10	0A	Spin Retry Count		Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
11	0B	Recalibration Retries		This attribute indicates the number of times recalibration was requested (under the condition that the first attempt was unsuccessful). A decrease of this attribute value is a sign of problems in the hard disk mechanical subsystem.
12	0C	Device Power Cycle Count		This attribute indicates the count of full hard disk power on/off cycles.
13	0D	Soft Read Error Rate		Uncorrected read errors reported to the operating system. If the value is non-zero, you should back up your data.
190	BE	Airflow Temperature (WDC)		Airflow temperature on Western Digital HDs (Same as temp. (C2), but current value is 50 less for some models. Marked as obsolete.)
190	BE	Temperature Difference from 100		Value is equal to (100 - temp °C), allowing manufacturer to set a minimum threshold which corresponds to a maximum temperature. (Seagate only?)^{[citation needed]} Seagate ST910021AS: Verified Present^{[citation needed]} Seagate ST3802110A: Verified Present 2007-02-13^{[citation needed]} Seagate ST980825AS: Verified Present 2007-04-05^{[citation needed]} Seagate ST3320620AS: Verified Present 2007-04-23^{[citation needed]} Seagate ST3500641AS: Verified Present 2007-06-12^{[citation needed]} Seagate ST3250824AS: Verified Present 2007-08-07^{[citation needed]} Seagate ST31000340AS: Verified Present 2008-02-05^{[citation needed]} Seagate ST3160211AS: Verified Present 2008-06-12^{[citation needed]} Seagate ST3320620AS: Verified Present 2008-06-12^{[citation needed]} Seagate ST3400620AS: Verified Present 2008-06-12^{[citation needed]} Samsung HD501LJ: Verified Present under name "Airflow Temperature" 2008-03-02^{[citation needed]}
191	BF	G-sense error rate		Frequency of mistakes as a result of impact loads^{[citation needed]}
192	C0	Power-off Retract Count		Number of times the heads are loaded off the media. Heads can be unloaded without actually powering off.^{[citation needed]} (or Emergency Retract Cycle count - Fujitsu)^{[citation needed]}
193	C1	Load/Unload Cycle		Count of load/unload cycles into head landing zone position.^{[citation needed]}
194	C2	Temperature		Current internal temperature.
195	C3	Hardware ECC Recovered		Time between ECC-corrected errors.
196	C4	Reallocation Event Count		Count of remap operations. The raw value of this attribute shows the total number of attempts to transfer data from reallocated sectors to a spare area. Both successful & unsuccessful attempts are counted.
197	C5	Current Pending Sector Count		Number of "unstable" sectors (waiting to be remapped). If the unstable sector is subsequently written or read successfully, this value is decreased and the sector is not remapped. Read errors on the sector will not remap the sector, it will only be remapped on a failed write attempt. This can be problematic to test because cached writes will not remap the sector, only direct I/O writes to the disk.
198	C6	Uncorrectable Sector Count		The total number of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.
199	C7	UltraDMA CRC Error Count		The number of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).
200	C8	Write Error Rate / Multi-Zone Error Rate		The total number of errors when writing a sector.
201	C9	Soft Read Error Rate		Number of off-track errors. If non-zero, make a backup.
202	CA	Data Address Mark errors		Number of Data Address Mark errors (or vendor-specific).^{[citation needed]}
203	CB	Run Out Cancel		Number of ECC errors
204	CC	Soft ECC Correction		Number of errors corrected by software ECC^{[citation needed]}
205	CD	Thermal Asperity Rate (TAR)		Number of thermal asperity errors.^{[citation needed]}
206	CE	Flying Height	?	Height of heads above the disk surface.^{[citation needed]}
207	CF	Spin High Current	?	Amount of high current used to spin up the drive.^{[citation needed]}
208	D0	Spin Buzz	?	Number of buzz routines to spin up the drive^{[citation needed]}
209	D1	Offline Seek Performance	?	Drive’s seek performance during offline operations^{[citation needed]}
220	DC	Disk Shift		Distance the disk has shifted relative to the spindle (usually due to shock). Unit of measure is unknown.
221	DD	G-Sense Error Rate		The number of errors resulting from externally-induced shock & vibration.
222	DE	Loaded Hours	?	Time spent operating under data load (movement of magnetic head armature)^{[citation needed]}
223	DF	Load/Unload Retry Count	?	Number of times head changes position.^{[citation needed]}
224	E0	Load Friction		Resistance caused by friction in mechanical parts while operating.^{[citation needed]}
225	E1	Load/Unload Cycle Count		Total number of load cycles^{[citation needed]}
226	E2	Load 'In'-time	?	Total time of loading on the magnetic heads actuator (time not spent in parking area).^{[citation needed]}
227	E3	Torque Amplification Count		Number of attempts to compensate for platter speed variations^{[citation needed]}
228	E4	Power-Off Retract Cycle		The number of times the magnetic armature was retracted automatically as a result of cutting power.^{[citation needed]}
230	E6	GMR Head Amplitude	?	Amplitude of "thrashing" (distance of repetitive forward/reverse head motion)^{[citation needed]}
231	E7	Temperature		Drive Temperature
240	F0	Head Flying Hours	?	Time while head is positioning^{[citation needed]}
250	FA	Read Error Retry Rate		Number of errors while reading from a disk

[edit] Threshold Exceeds Condition

This section does not cite any references or sources. (January 2007)
Please help improve this section by adding citations to reliable sources. Unverifiable material may be challenged and removed.

Threshold Exceeds Condition (TEC) is a supposed date when a critical drive statistic attribute will achieve its threshold value. When Drive Health software reports a "Nearest T.E.C." it should be considered as a "Failure date".

Prognosis of this date is based on the factor "Speed of attribute change"; how many points each month the value is decreasing/increasing. This factor is calculated automatically at any change of S.M.A.R.T. attributes for each attribute individually. Note that TEC dates are not guarantees; hard drives can and will either last much longer or fail much sooner than the date given by a TEC.

[edit] Comparison of S.M.A.R.T. tools

In the following some popular tools for reading S.M.A.R.T. data are listed.

	smartmontools	HDAT2	DriveSitter	HDD Health	Active Smart	SpeedFan	SMARTReporter	HDTune	Norton System Doctor	SMART Utility	DiskCheckup	Hard Disk Sentinel
Operating System	Windows (native or Cygwin) Linux Darwin (Mac OS X) Free/Open/NetBSD Solaris OS/2	DOS	Windows	Windows	Windows	Windows	Mac OS X	Windows	Windows	Mac OS X	Windows	Windows, DOS, Linux
price	Open Source	Freeware	29,69 $	Freeware	18,46 €	Freeware	Open Source	Freeware	proprietary	20.00 $	Freeware	Windows: from 18,00 €, DOS/Linux: Freeware
Trial version can be used	-	-	30 days	-	21 days	-	-	-	-	30 days/5 launches	-	Unlimited
Target group	professionals	professionals	advanced	beginners to advanced	beginners to advanced	beginners to advanced	beginners	beginners to advanced	beginners	beginners to advanced	beginners to advanced	beginners to advanced
User Interface	Commandline, optional daemon or service	text menu	graphical	graphical	graphical	graphical	graphical	graphical	graphical	graphical	graphical	Windows: graphical, DOS/Linux: commandline
connection	(S)ATA, SCSI, SAT	(S)ATA	(S)ATA	(S)ATA	(S)ATA, SCSI, USB	(S)ATA, SCSI	(S)ATA	(S)ATA	(S)ATA, SCSI, USB	(S)ATA, SCSI, SAT	(S)ATA, SCSI	(S)ATA, SCSI, SAT, USB
reads hard discs on RAID controllers:	3ware (Linux, FreeBSD, Windows), Compaq/HP (Linux, FreeBSD), and HighPoint (only Linux)	yes	-	-	announced^{[citation needed]}	-	-	-	?	3ware, Compaq/HP OS X Software RAID	Software RAID only	Software RAID only
shows error log	yes	yes	yes	yes	no	no	no	no	no	yes	yes	yes
self testing	yes (also scheduled)	yes	yes	yes	no	no	no	no	no	announced^{[citation needed]}	yes	yes
prediction of failure	no	no	yes	yes	yes	yes	no	no	no	yes	yes	yes
notification at	choosable parameter changes , threshold, temperature	-	choosable parameter changes , threshold, temperature	every parameter change, temperature	threshold, temperature	choosable parameter changes , threshold, temperature	threshold,	-	threshold, (for every single medium)	announced^{[citation needed]}	temperature	temperature, parameter changes, threshold, new problem found, low disk space
notification by	window (only Windows), e-mail, system log, run a certain command	-	window, sound, e-mail, network message, system log, run a certain command	window, sound, e-mail, network message	window, sound, e-mail, network message	window, sound, e-mail, run a certain command	window, e-mail, run a certain command	-	taskbar symbol, sound, administrative message	announced^{[citation needed]}	window, e-mail	taskbar, window, e-mail, network message, sound alert (optionally repeating), run certail command, run automatic backup projects, shutdown/hibernate
vendor	smartmontools	Lubomir Cabla	Oliver Marr	PANTERASoft	Ariolic ATA / SCSI / USB	Alfredo Milani Comparetti	Julian Mayer	EFD Software	Symantec weblink	Volitans Software	Passmark Software	H.D.S. Hungary
notes	Possibility of AAM and further parameter, surface testing	highly scalable, can be set to activate hibernation mode on critical temperature		can be set to activate hibernation mode on critical temperature		offers online drive analysis [2], monitors PC temperatures	benchmarks and surface testing	individual configuration for every medium, Interface for Disc Doktor/chkdsk: surface testing, complete testing at restart		based on smartmontools		acoustic management, highly scalable scheduled and automatic projects upon failure, offers scheduled and manualy disk testing, performance and logical disk information

[edit] References

S.M.A.R.T. attribute meaning. PalickSoft. Retrieved on February 3, 2006.
Zbigniew Chlondowski. S.M.A.R.T. Site: attributes reference table. S.M.A.R.T. Linux. Retrieved on January 17, 2007.
S.M.A.R.T. attributes meaning. Ariolic Software, Ltd (2007). Retrieved on October 26, 2007.
Can we believe S.M.A.R.T. ? - How hard disk S.M.A.R.T. really works. H.D.S. Hungary (2007). Retrieved on June 4, 2008.

^ Seagate statement on enhanced smart attributes
^ http://smartlinux.sourceforge.net/smart/faq.php?#2 ("How does S.M.A.R.T. work?")
^ Failure Trends in a Large Disk Drive Population (Conclusion section) by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso, Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043
^ pctechguide's page on S.M.A.R.T. (2003)
^ IBM Announcement Letter No. ZG92-0289 dated September 1, 1992
^ Seagate - The evolution of S.M.A.R.T.
^ Compaq. IntelliSafe. Technical Report SSF-8035, Small Form Committee, January 1995.
^ Stephens, Curtis E, ed. (December 11, 2006), Information technology - AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS), working draft revision 3f, ANSI INCITS, pp. 198–213, 327-344, <http://www.t13.org/Documents/UploadedDocuments/docs2006/D1699r3f-ATA8-ACS.pdf>
^ Hitachi Global Storage Technologies (19 September 2003), Hard Disk Drive Specification: Hitachi Travelstar 80GN, revision 2.0, Hitachi Document Part Number S13K-1055-20, <http://www.hitachigst.com/tech/techlib.nsf/techdocs/85CC1FF9F3F11FE187256C4F0052E6B6/$file/80GNSpec2.0.pdf>
^ Hatfield, Jim (September 30, 2005), SMART Attribute Annex, e05148r0, <http://www.t13.org/Documents/UploadedDocuments/docs2005/e05148r0-ACS-SMARTAttributesAnnex.pdf>
^ pctechguide: "Industry acceptance of PFA technology eventually led to SMART (Self-Monitoring, Analysis and Reporting Technology) becoming the industry-standard reliability prediction indicator..." [1]
^ smartmontools FAQ ("Attribute 194 (Temperature Celsius) behaves strangely on my Seagate disk")

[edit] External links

Categories: Computer storage technologies

Self-Monitoring, Analysis, and Reporting Technology

From Wikipedia, the free encyclopedia

Contents

[edit] Background

[edit] History and predecessors

[edit] SMART Information

[edit] Standards and implementation

[edit] Attributes

[edit] Known S.M.A.R.T. attributes

[edit] Threshold Exceeds Condition

[edit] Comparison of S.M.A.R.T. tools

[edit] References

[edit] External links

Views

Navigation

Interaction

Search

Languages