InfoDoc ID   Synopsis   Date
49203   Sun StorEdge[TM] Axx00:InfoDoc-Tech Tip:Finding a Failed Disk that RM6 Reports as Optimal   3 Dec 2002

Status Issued

Description

                 How To Find A Bad Drive With SCSI Analysis


Confirm this is indeed the problem. Typically, we will have an Optimal Module with scrolling 
errors to various blocks on a lun. In this example, we have an A3500FC array attached to a 
server scrolling read block errors to lun 8. Note that different blocks(12658240 and 40903280)
are being reported. 


      Nov 27 09:39:35 cfs1000 scsi: [ID 107833 kern.notice]
         Vendor: Symbios Serial Number:    :    : 6t
      Nov 27 09:39:35 cfs1000 scsi: [ID 107833 kern.notice]
         Sense Key: Hardware Error
      Nov 27 09:39:35 cfs1000 scsi: [ID 107833 kern.warning]
         WARNING: /sbus@2,0/SUNW,socal@d,10000/sf@1,0/ssd@w200000a0b809b08b,8 (ssd25):
      Nov 27 09:39:36 cfs1000         Error for Command: read(10)
         Error Level: Retryable
      Nov 27 09:39:36 cfs1000 scsi: [ID 107833 kern.notice]   
         Requested Block: 40903280   Error Block: 40903280
      Nov 27 09:39:37 cfs1000 scsi: [ID 107833 kern.notice]  
         Vendor: Symbios Serial Number:    :    : 6t
      Nov 27 09:39:37 cfs1000 scsi: [ID 107833 kern.notice]
          Sense Key: Hardware Error
      Nov 27 09:39:38 cfs1000 scsi: [ID 107833 kern.warning] 
         WARNING: /sbus@2,0/SUNW,socal@d,10000/sf@1,0/ssd@w200000a0b809b08b,8 (ssd25):
      Nov 27 09:39:38 cfs1000         Error for Command: read(10)
         Error Level: Retryable
      Nov 27 09:39:38 cfs1000 scsi: [ID 107833 kern.notice]
         Requested Block: 12658240   Error Block: 12658240                       

  We now need to collect more data. Raid Manager (6.22 and up) contains a tool which will
collect performance data for cache, luns, controllers and disk drives. It's with this 
"back end" scsi analyzer that we can identify which disks are causing the errors. The 
utility is perfutil. The command to execute is 

            /usr/lib/osa/bin/perfutil -c {module_name} > perf.txt

  This command generates an ascii text file. For simplicity, the file can be analyzed 
with a PTS provided pearl script called drive_stats.pl. The script is available internally
at http://storage.east/mreid/ in the "downloads" section. Simply run the file through the 
analyzer. The command to execute is 

                  ./drive_stats.pl perf.txt v 

  In this example, we can see that "channel = 2,5" received 20273 Unrecovered Errors and 
20212 Retried Requests. Clearly, "channel = 2,5" is bad. The first value "2" is the drive 
tray. The second value "5" is the target. Obviously, drive [2,5] needs replacing. As expected, 
drive [2,5] is indeed a member of lun 8. 


 # ./drive_stats.pl perf.txt v

          drive_stats.pl version 1.0
          drive_stats filename modifier [v for verbose error info] [p for performance info]

          Controller = c1t5d0   Host Time/Date: 10:01:34  11/27/2002
          hrs of runtime =        1.42752777777778
          channel = 2,5   Unrcvrd Err =   20273 Rtry Req =        20212 
          total_recovered_errors =        0
          total_unrecovered_errors =      20273
          total_request_time_outs =       0
          total_retried_requests =        20212
          total_drive_bus_resets =        0

 A snippet from the module profile.

          Controllers:
           
            Name          Serial Number    Mode           Logical Units
            A (c1t5d0)    1T82421948       Active         6
            B (c4t4d1)    1T92401253       Active         5

            Detailed LUN Information for cfs1000_01 (continued)
           
            LUN   Associated Drives
            8     [1,5] [2,5] [3,5] [4,5] [5,5] 
           
                                  Capacity
            LUN   Controller      (MB)      RAID Level
            8     c1t5d8          138771    5     

          Drives:
           
            Detailed Drive Information for cfs1000_01
           
            Location   Capacity (MB)   Status         Vendor   Product ID
            [1,5]      34732           Optimal        FUJITSU  MAJ3364M SUN36G 
            [2,5]      34732           Optimal        FUJITSU  MAJ3364M SUN36G 
            [3,5]      34732           Optimal        FUJITSU  MAJ3364M SUN36G 
            [4,5]      34732           Optimal        FUJITSU  MAJ3364M SUN36G 
            [5,5]      34732           Optimal        FUJITSU  MAJ3364M SUN36G 


                          Important Notes

   1. The data collected by perfutil does not survive controller reboots. 
   2. Controllers are rebooted during a powercycle, host reboot, or the "a3k.release" script. 
   3. This tool can help identify other problems such as channel failures. 
   4. Collecting this data should become paramount when troubleshooting rm6. 

  For those who want to know more about the interpretation of the perfutil output, the man
page gives some pretty good information. The relevent perfutil data and individual drive 
statistics are provided below. Important points are ...

    1. "Parameter: cXXX" is a disk drive. For Example, c205 is tray 2, target 5 or drive [2,5] 
    2. The man page identifies a hex offset and length for each field of data. 
    3. The 5th, 6th, 7th and 8th data fields document disk Errors, Retries and Timeouts. 
    4. Each group of 4 numbers is one data field! 
    5. For example, Drive c205, had .....
          000a2c71 Read Requests 
          00002b33 Write Requests 
          .........
          00004f31 Unrecovered Errors. This is 20273 decimal (reported by script).
          00004ef4 Retried Requests. This is 20212 decimal (reported by script).

 Snippet From the Man Page

            Individual Drive Statistics
               The controller will report individual drive  statistics  for
               all drives recognized by the controller.

               Parameters 0xCXXX report  drive  statistics  for  individual
               drives. The field can be interpreted as follows:

               Bits                          Description
               _____   ________________________________________________________
               15-12   1100b indicates parameter is a drive specific parameter
                       that will follow the format provided below.
               11-8    Channel Number for the Drive.
                7-0    Drive ID for the Drive.

               Offset               Parameter               Parameter Length
               ______   _________________________________   ________________
                0x00    Total # of Drive Read Requests            0x04
                0x04    Total # of Drive Write Requests           0x04
                0x08    Total # of Drive Blocks Requested         0x04
                0x0C    Total Drive IOs Requested                 0x04
                0x10    Total # of Recovered Errors               0x04
                0x14    Total # of Unrecovered Errors             0x04
                0x18    Total # of Request Time Outs              0x04
                0x1C    Total # of Retried Requests               0x04
                0x20    Total Number of Drive Bus Resets          0x04
                0x24    Reserved                                  0x1C

 Snippet from perfutil file generated:

          Parameter: c204
          Length: 40
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 00 00 00 00 00 00 00 00 00
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
          00 00 00 00 00 00 00 00 

          Parameter: c205
          Length: 40
          00 0a 2c 71 00 00 2b 33 00 ac ce ac 00 0a 9e f4 00 00 00 00 00 00 4f 31 00 00 00 00 
          00 00 4e f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
          00 00 00 00 00 00 00 00 

          Parameter: c206
          Length: 40
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 7a 34 00 00 00 00 00 00 00 00 00 00 00 00 
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
          00 00 00 00 00 00 00 00 

                       

INTERNAL SUMMARY:
                       

SUBMITTER: Daniel Caporale APPLIES TO: Hardware/Disk Storage Subsystem/StorEdge Disk Array, Storage/RAID Manager ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.