C H A P T E R 3 |
Troubleshooting |
This chapter discusses common types of failure:
The following are examples of cfgadm diagnostic messages. (Syntax error messages are not included here.)
See the following man pages for additional error message detail: cfgadm(1M) , cfgadm_sbd(1M) , cfgadm_pci(1M) , and config_admin(3X) .
An unconfigure operation for a CPU/Memory board or an I/O board can fail if the system is not in a correct state before you begin the operation.
If you try to unconfigure a system board whose memory is interleaved across system boards, the system displays an error message such as:
cfgadm: Hardware specific failure: unconfigure N0.SB2::memory: Memory is interleaved across boards: /ssm@0,0/memory-controller@b,400000 |
If you try to unconfigure a CPU to which a process is bound, the system displays an error message such as the following:
cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu3: Failed to off-line: /ssm@0,0/SUNW,UltraSPARC-III |
Unbind the process from the CPU and retry the unconfigure operation.
All memory on a system board must be unconfigured before you try to unconfigure a CPU. If you try to unconfigure a CPU before all memory on the board is unconfigured, the system displays an error message such as:
cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu0: Can't unconfig cpu if mem online: /ssm@0,0/memory-controller |
Unconfigure all memory on the board and then unconfigure the CPU.
To unconfigure the memory on a board that has permanent memory, move the permanent memory pages to another board that has enough available memory to hold them. Such an additional board must be available before the unconfigure operation begins.
If the unconfigure operation fails with a message such as the following, the memory on the board could not be unconfigured:
cfgadm: Hardware specific failure: unconfigure N0.SB0: No available memory target: /ssm@0,0/memory-controller@3,400000 |
Add to another board enough memory to hold the permanent memory pages, and then retry the unconfigure operation.
To confirm that a memory page cannot be moved, use the verbose option with the cfgadm command and look for the word "permanent" in the listing:
# cfgadm -av -s "select=type(memory)" |
If the unconfigure fails with one of the messages below, there would not enough available memory in the system if the board is removed:
cfgadm: Hardware specific failure: unconfigure N0.SB0: Insufficient memory |
cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation failed |
Reduce the memory load on the system and try again. If practical, install more memory in another board slot.
If the unconfigure fails with the following message, the memory demand has increased while the unconfigure operation was proceeding:
cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation refused |
Reduce the memory load on the system and try again.
CPU unconfiguration is part of the unconfiguration operation for a
CPU/Memory board. If the operation fails to take the CPU offline, the following message is logged to the console:
WARNING: Processor number failed to offline. |
It is possible to unconfigure a board and then discover that it cannot be disconnected. The cfgadm status display lists the board as not detachable. This problem occurs when the board is supplying an essential hardware service that cannot be relocated to an alternate board.
A device cannot be unconfigured or disconnected while it is in use. Many failures to unconfigure I/O boards occur because activity on the boards has not been stopped, or because an I/O device becomes active again after it has been stopped.
Disks attached to an I/O board must be idled before you attempt to unconfigure or disconnect that board. Any attempt to unconfigure/disconnect a board whose devices are still in use is rejected.
If an unconfiguration operation fails because an I/O board has a busy or open device, the board is left only partially unconfigured. The operation sequence stopped at the busy device.
To regain access to the devices which were not unconfigured, the board must be completely unconfigured and then reconfigured.
If a device on the board is busy, the system logs a message such as the following after an attempt to unconfigure:
cfgadm: Hardware specific failure: unconfigure N0.IB6: Device busy: /ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@6,0 |
To continue the unconfigure operation, unmount the device and retry the unconfigure operation. The board must be in the unconfigured state before you try to reconfigure this board.
All I/O devices must be closed before they are unconfigured.
1. To see which processes have these devices open, use the fuser(1M) command.
2. Run the following command to kill the vold daemon gracefully.:
# /etc/init.d/volmgt stop |
3. Disconnect all SCSI controllers that are associated with the card that you're trying to unconfigure. To get a list of all connected SCSI controllers use the following command:
# cfgadm -l -s "select=class(scsi)" |
4. If the redundancy features of Solaris Volume Manager (SVM) mirroring are used to access a device connected to the board, reconfigure these subsystems so that the device or network is accessible by way of controllers on other system boards.
5. Unmount file systems, including SVM meta-devices that have a board resident partition. (For example, umount/ partition ).
6. Remove the SVM database from board-resident partitions. The location of the SVM database is explicitly chosen by the user and can be changed.
7. Remove any private regions used by Sun Volume Manager or Veritas Volume Manager.
Volume Manager by default uses a private region on each device that it controls, so such devices must be removed from Sun Volume Manager control before they can be detached.
8. Remove disk partitions from the swap configuration.
9. Either kill any process that directly opens a device or raw partition, or direct it to close the open device on the board.
Note - Unmounting file systems may affect NFS client systems. |
Time-outs occur by default after two minutes. Administrators may need to increase this time-out value to avoid time-outs during a DR-induced operating system quiescence, which may take longer than two minutes. Quiescing a system makes the system and related network services unavailable for a period of time that can exceed two minutes. These changes affect both the client and server machines.
Problems that prevent configuration for the CPU/memory board are:
Before you try to configure either CPU0 or CPU1, make sure that the other CPU is unconfigured.
Before configuring memory, all CPUs on the system board must be configured. If you try to configure memory while one or more CPUs are unconfigured, the system displays an error message such as:
cfgadm: Hardware specific failure: configure N0.SB2::memory: Can't config memory if not all cpus are online: /ssm@0,0/memory-controller |
A configure operation may fail because an I/O board with a device does not currently support hot-plugging. In such a situation, the board is now only partially configured. The operation has stopped at the unsupported device. In this situation, the board must be brought back to the unconfigured state before another configure attempt. The system logs a message such as the following:
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.