Talk:Self-Monitoring, Analysis, and Reporting Technology

From Wikipedia, the free encyclopedia

Contents

[edit] SMART Attributes List

Some descriptions of the SMART attributes are clearly incorrect. "Load" refers to the operation and number of times the heads move from parked to unparked, not when the drive is seeking. GMR head amplitude refers to the signal from the read head, not any movement. "Read Channel Margin" description is content-free.

[edit] Spin Retry Count

This description does not appear to correspond to actual data values in Western Digital and Seagate drives (recent models). Seagate posts 100-100-97-0-OK with HDTune marking it yellow (warning) bar, and WD posts 100-100-51-0-OK with HDTune with no marking color. (Values correspond to Current-Worst-Threshold-Data-Status). With respect to these two manufacturers, the description makes no sense. —The preceding unsigned comment was added by FUBARinSFO (talk • contribs) 00:39, 2 May 2007 (UTC).

[edit] Reallocated sectors

Please make it easier for me to **use** this information. Please enhance the Table entries.

Could we get practical and say, This is a down-counter, and, when zero, there is no way to deal with additional sectors whose read errors are too severe to be fixed with Error Correction Codes.

In the discussion above the Table, tell me, if there is no more space to absorb a sector needing reallocation, does my drive now pass errors up to the opsys file system, which reports read errors and/or other file unavailability?

In the table, you can make room by erasing: "the more sectors that are reallocated, the more read/write speed will decrease". This is true. However, it hardly matters to a user who is suffering data loss, possible data corruption, and potential boot failure, which blocks access to everything.

My SMARTCTRL under WinXP or Knoppix shows
Reallocated_Sector_Ct
VALUE 1
WORST 1
THRESHOLD 63 (Please confirm that a number less than 63 is bad -- but which number?)
This "pre-failure" category of parameter is UPDATED "always".
WHEN_FAILED is "FAILING_NOW". How can I tell? Because VALUE 1 is less than THRESHOLD 63?
The "RA" column (I do not know what this is. Do you? RAW counts?) is 12. 12 is not 1 and 12 is not 63. I wonder what 12 is.

This page has not yet evolved into a pratical guide and it still lacks an accessible exposition of the topic's salient points. Nevertheless, and IMHO, it is already far ahead of most pages and posts on the Internet. So let's not stop now ! Jerry-va 01:27, 29 May 2006 (UTC)jerry-va

The final column is RAW_VALUE, and 12 for the Reallocated_Sector_Ct attribute means that 12 bad sectors have been remapped. You should replace this disk soon, as that number will only rise, and the higher it gets, the more data you're going to lose. --Error28 12:03, 4 September 2006 (UTC)

It comes down to money. If you have reallocated sectors you should replace your disk. In my experience you can go much longer with reallocated sectors on desktop drives, once reallocated sectors show up in laptops they increase fast.Josiah 20:02, 12 October 2007 (UTC)

"This is why, on modern hard disks, "bad blocks" cannot be found while testing the surface — all bad blocks are hidden in reallocated sectors. " I don't think that's quite accurate, based on my understanding. When you write to a bad sector, sure, it gets silently reallocated. If you read the sector and the data is bad but corrected by ECC, the drive should correct it and copy it to a reallocated sector. But if the data is uncorrectable, an error must be returned, since it would be unacceptable to return bogus data. Moreover, it must continue to give an error on a future read, until the sector is rewritten. So you can sometimes find bad sectors by reading the entire disk. "Testing the surface" is confusingly vague. 76.254.84.64 07:30, 31 October 2007 (UTC)

I find this table entry confusing as well -- It says in the table "A decrease in the attribute value indicates bad sectors", *but* the 'Better' column indicates that a decrease in this number is a *good* thing?? It seems like the arrow should be changed for this table entry. If this is really an indicator of the number of sectors potentially available for reallocation (in the event of a bad sector being detected), then it would make sense that a higher number is better, a decrease is bad. When using applications such as HD Tune, under the 'Health' tab it tells me that my particular drive has 100 as the current count for this field, and that 36 is the threshold. It does not seem to see any problem with this and it is telling me that it is OK -- so it seems to coincide... ChrisTracy (talk) 18:09, 15 May 2008 (UTC)

[edit] Temperature sensor

the section on temperature and temperature sensors is opiniated and somewhat incorrect / not up to date. all the hard drives from 1998+ include a temperature sensor. the reason: all modern hard drives use GMR heads (Giant Magneto Resistive heads), which requires very accurate temperature measurement to be able to read the data back (the difference between a 0 and a 1 readback is about the same order of magnitude as a 0.1 degrees Celsius change in the GMR head).

also, the temperature failure mode is not necessarily cumulative.

[edit] Curious sentence

<partsunkn> SMART is a system used to kill the drive when the warranty is up —Clarknova 03:35, 28 February 2006 (UTC)

Removed the following curious sentence from "working modality".

Manufacturing companies which claim to support S.M.A.R.T. but withhold specific sensor information on individual products include Seagate, [...]

[...], indeed! What the frip. - 194.89.3.244 17:56, 28 February 2006 (UTC)

[edit] Read Error Rate description incorrect

Elsewhere I have read that a high value for Read Error Rate is good, and the attribute value decreases as read error rate increase.

Consistent with this, the two SMART monitoring tools I have used alert the user when the Read Error Rate attribute value falls below a threshold.

This description deserves accuracy and careful explanation perhaps more than any other, since this attribute is so critical.

-- I think it means 'time between read errors'; the smaller the number, the higher the rate, but whether it's seconds, hours, or fortnights, I couldn't begin to guess.

Perhaps a more logical definition would be 'no. of succesful reads between errors' ? --217.173.195.210 09:23, 14 August 2007 (UTC)

[edit] Frustration with SMART

I'd like it if the table spelled out what the "good" and "bad" values of the attributes are.

The general rule is that higher is better than lower, except in the case of temperature. The specific thresholds of "OK" and "failing" are up to the manufacturers to specify. Most of the numbers involved are arbitrary and defined separately by the manufacturers. GreenReaper 16:07, 24 August 2006 (UTC)

[edit] More frustration...

I also feel that nobody tells you some useful (average, maybe IQ 100, no IT degree) human-readable information about your harddisk. Something like "your harddisk /dev/hda is 2.8 years old the probability that it will survive this month is 98% (suggested replacement value: 96%; suggested backup value: 99%)". But I'm confused by 1000 different values. How bad are they really? Where should the values be (see comment above)? It does not really help to make a business decision to replace or not to replace the drive. Can someone please shed some light into this? Can smartmontools developer please think of the CTO's business decision of replacing or not replacing a disk? And some useful information for the home user. THANKS -- Michael Janich 09:15, 31 July 2006 (UTC)

Most hard disk fail within the first two years, if it doesn't fail the within those years, it is a good idea to keep the hard disk for another 3 years.

If you can plot the failure rate of hard disk, it starts off very high, then it goes to its lowest point around two years, and then slowly climbs back up to the rate at which it started. Hope that helps Hqduong 08:10, 5 December 2006 (UTC)

Google published a study on hard disks that claimed (based on memory, not citation) that 1) only half the disks that failed had something significant in their SMART readouts, and 2) only half the disks with something significant in their SMART readings actually failed. So after all the hoopla, it may not be that useful after all. --Alvestrand 07:39, 19 March 2007 (UTC)
Hard Disk Sentinel software can display information in an understandable way. It gives a textual description about the hard disks, displays the real number of problems found so you can have some ideas about the real status instead of displaying just some numbers/values. Because thresholds + value pairs and T.E.C. dates are not really able to predict hard disk failures, this software uses a completely different method to detect and display real hard disk problems found on IDE/SATA/USB/SCSI hard disks. Works under Windows, DOS, Linux. —Preceding unsigned comment added by 87.229.50.242 (talk) 08:12, 4 June 2008 (UTC)

[edit] SMART and RAID

Any idea if SMART can still be used on HDD's included in a RAID array? --Evolve2k 05:06, 7 January 2007 (UTC)

I have seen some motherboards with hardware RAID support/PCI RAID expansion cards that have a BIOS/firmware capable of retrieving and displaying SMART data. No idea if there's anything out there that lets you do this in software though. SMART is a very mysterious technology IMO. --86.138.51.21 08:20, 26 January 2007 (UTC)
I'm building a RAID array with four Seagate ST3320620AS (7200.10 320GB) drives in it. Once I get the second pair of drives I'll let you guys know. Using NVIDIA MediaShield on a P5N32-E SLI Plus. I can also confirm that BE is definitely a temperature sensor on that drive, btw. 66.146.62.42 22:23, 10 May 2007 (UTC)

[edit] Merging in Threshold Exceeds Condition

Since the mergeto tag of the Threshold Exceeds Condition article says to discuss the subject here:

  • Merge. My opinion. --Alvestrand 22:16, 13 January 2007 (UTC)

[edit] Background

According to the cited google study, SMART can predict about 40-60% of all drive failures, depending on the monitored attributes. The stated 30% taken from some FAQ might be too pessimistic here.

—The preceding unsigned comment was added by Michi cc (talkcontribs) 17:38, 21 April 2007 (UTC).

[edit] Attribute list is confusing

Some of the arrows in the attribute list don't appear to be correct. "Power On Hours" is marked with an up arrow--I would think that a *lower* number of operating hours would be considered better, not a higher one. Same thing with calibration retries. It's also not clear in many of the descriptions whether the values being referred to are the raw values, normalized values, worst values, threshold values, or something else, making the table even more unintelligible to someone unfamiliar to SMART. All of this should be made much more clear. ::Travis Evans 11:39, 16 June 2007 (UTC)

As of today, these issues now appear to be largely improved. ::Travis Evans (talk) 14:38, 16 December 2007 (UTC)

Set Load/Unload cycle count with a down arrow - as when the head unloads/reloads it creates wear on the servo and the read/write head has a possibility of failiure TO load if it isn't loaded or unloaded completely —Preceding unsigned comment added by 64.228.219.208 (talk) 03:59, 11 October 2007 (UTC)

[edit] Contradictory statement about higher vs lower

Note that the attribute values are always mapped to the range of 1 to 253 in a way that means higher values are better.

This is then followed by a chart which describes whether it is better to have lower or higher values, seeming to contradict the above sentence. Can someone please clarify or correct? Ham Pastrami (talk) 19:20, 24 November 2007 (UTC)

I believe that the reason for this apparent conflict is that the chart refers to the “raw” attribute values rather than the “normalized” ones. For normalized values, the statement in the article that higher numbers are “always” better is almost correct (I'll explain why I say “almost” in a moment), but the raw values can follow any rule that the drive manufacturer wants.
The biggest problem with the article, I think, is that it doesn't explain clearly enough that there are actually several different values involved for each attribute. The chart is totally unclear about it. The chart is also problematic because some of the attributes the chart describes appear to function in a totally different (even the exact opposite) manner on certain drives.
The statement “...The attribute values are always mapped ...in a way...that higher values are better” also isn't true in the strictest since, because I know of some drives (such as mine) which actually indicate the normalized temperature value directly in Celsius (e.g., a value of 40 means 40°C), which means that for this attribute, higher values are actually worse. This is likely a rare exception, though.
I may attempt to greatly clarify the article myself some time if I get a chance, but if anyone else wants to do it right now instead, feel free to go ahead and do so. ::Travis Evans (talk) 21:58, 5 December 2007 (UTC)
Okay, I just edited the Attributes section in the hope that it will now make much more sense. ::Travis Evans (talk) 14:35, 16 December 2007 (UTC)

[edit] More info on Selftest specifications please

A good article. You can get SMART drives to initate either short or long self test (managed by the drive itself). But what exactly does the SMART specification reqiure a drive to do during these tests? Robin April 2008