InfoDoc ID | Synopsis | Date | ||
49203 | Sun StorEdge[TM] Axx00:InfoDoc-Tech Tip:Finding a Failed Disk that RM6 Reports as Optimal | 3 Dec 2002 |
Status | Issued |
Description |
How To Find A Bad Drive With SCSI Analysis Confirm this is indeed the problem. Typically, we will have an Optimal Module with scrolling errors to various blocks on a lun. In this example, we have an A3500FC array attached to a server scrolling read block errors to lun 8. Note that different blocks(12658240 and 40903280) are being reported. Nov 27 09:39:35 cfs1000 scsi: [ID 107833 kern.notice] Vendor: Symbios Serial Number: : : 6t Nov 27 09:39:35 cfs1000 scsi: [ID 107833 kern.notice] Sense Key: Hardware Error Nov 27 09:39:35 cfs1000 scsi: [ID 107833 kern.warning] WARNING: /sbus@2,0/SUNW,socal@d,10000/sf@1,0/ssd@w200000a0b809b08b,8 (ssd25): Nov 27 09:39:36 cfs1000 Error for Command: read(10) Error Level: Retryable Nov 27 09:39:36 cfs1000 scsi: [ID 107833 kern.notice] Requested Block: 40903280 Error Block: 40903280 Nov 27 09:39:37 cfs1000 scsi: [ID 107833 kern.notice] Vendor: Symbios Serial Number: : : 6t Nov 27 09:39:37 cfs1000 scsi: [ID 107833 kern.notice] Sense Key: Hardware Error Nov 27 09:39:38 cfs1000 scsi: [ID 107833 kern.warning] WARNING: /sbus@2,0/SUNW,socal@d,10000/sf@1,0/ssd@w200000a0b809b08b,8 (ssd25): Nov 27 09:39:38 cfs1000 Error for Command: read(10) Error Level: Retryable Nov 27 09:39:38 cfs1000 scsi: [ID 107833 kern.notice] Requested Block: 12658240 Error Block: 12658240
We now need to collect more data. Raid Manager (6.22 and up) contains a tool which will collect performance data for cache, luns, controllers and disk drives. It's with this "back end" scsi analyzer that we can identify which disks are causing the errors. The utility is perfutil. The command to execute is /usr/lib/osa/bin/perfutil -c {module_name} > perf.txt This command generates an ascii text file. For simplicity, the file can be analyzed with a PTS provided pearl script called drive_stats.pl. The script is available internally at http://storage.east/mreid/ in the "downloads" section. Simply run the file through the analyzer. The command to execute is ./drive_stats.pl perf.txt v In this example, we can see that "channel = 2,5" received 20273 Unrecovered Errors and 20212 Retried Requests. Clearly, "channel = 2,5" is bad. The first value "2" is the drive tray. The second value "5" is the target. Obviously, drive [2,5] needs replacing. As expected, drive [2,5] is indeed a member of lun 8. # ./drive_stats.pl perf.txt v drive_stats.pl version 1.0 drive_stats filename modifier [v for verbose error info] [p for performance info] Controller = c1t5d0 Host Time/Date: 10:01:34 11/27/2002 hrs of runtime = 1.42752777777778 channel = 2,5 Unrcvrd Err = 20273 Rtry Req = 20212 total_recovered_errors = 0 total_unrecovered_errors = 20273 total_request_time_outs = 0 total_retried_requests = 20212 total_drive_bus_resets = 0 A snippet from the module profile. Controllers: Name Serial Number Mode Logical Units A (c1t5d0) 1T82421948 Active 6 B (c4t4d1) 1T92401253 Active 5 Detailed LUN Information for cfs1000_01 (continued) LUN Associated Drives 8 [1,5] [2,5] [3,5] [4,5] [5,5] Capacity LUN Controller (MB) RAID Level 8 c1t5d8 138771 5 Drives: Detailed Drive Information for cfs1000_01 Location Capacity (MB) Status Vendor Product ID [1,5] 34732 Optimal FUJITSU MAJ3364M SUN36G [2,5] 34732 Optimal FUJITSU MAJ3364M SUN36G [3,5] 34732 Optimal FUJITSU MAJ3364M SUN36G [4,5] 34732 Optimal FUJITSU MAJ3364M SUN36G [5,5] 34732 Optimal FUJITSU MAJ3364M SUN36G Important Notes 1. The data collected by perfutil does not survive controller reboots. 2. Controllers are rebooted during a powercycle, host reboot, or the "a3k.release" script. 3. This tool can help identify other problems such as channel failures. 4. Collecting this data should become paramount when troubleshooting rm6. For those who want to know more about the interpretation of the perfutil output, the man page gives some pretty good information. The relevent perfutil data and individual drive statistics are provided below. Important points are ... 1. "Parameter: cXXX" is a disk drive. For Example, c205 is tray 2, target 5 or drive [2,5] 2. The man page identifies a hex offset and length for each field of data. 3. The 5th, 6th, 7th and 8th data fields document disk Errors, Retries and Timeouts. 4. Each group of 4 numbers is one data field! 5. For example, Drive c205, had ..... 000a2c71 Read Requests 00002b33 Write Requests ......... 00004f31 Unrecovered Errors. This is 20273 decimal (reported by script). 00004ef4 Retried Requests. This is 20212 decimal (reported by script). Snippet From the Man Page Individual Drive Statistics The controller will report individual drive statistics for all drives recognized by the controller. Parameters 0xCXXX report drive statistics for individual drives. The field can be interpreted as follows: Bits Description _____ ________________________________________________________ 15-12 1100b indicates parameter is a drive specific parameter that will follow the format provided below. 11-8 Channel Number for the Drive. 7-0 Drive ID for the Drive. Offset Parameter Parameter Length ______ _________________________________ ________________ 0x00 Total # of Drive Read Requests 0x04 0x04 Total # of Drive Write Requests 0x04 0x08 Total # of Drive Blocks Requested 0x04 0x0C Total Drive IOs Requested 0x04 0x10 Total # of Recovered Errors 0x04 0x14 Total # of Unrecovered Errors 0x04 0x18 Total # of Request Time Outs 0x04 0x1C Total # of Retried Requests 0x04 0x20 Total Number of Drive Bus Resets 0x04 0x24 Reserved 0x1C Snippet from perfutil file generated: Parameter: c204 Length: 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Parameter: c205 Length: 40 00 0a 2c 71 00 00 2b 33 00 ac ce ac 00 0a 9e f4 00 00 00 00 00 00 4f 31 00 00 00 00 00 00 4e f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Parameter: c206 Length: 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 7a 34 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
INTERNAL SUMMARY:
SUBMITTER: Daniel Caporale APPLIES TO: Hardware/Disk Storage Subsystem/StorEdge Disk Array, Storage/RAID Manager ATTACHMENTS: