C H A P T E R  3

Troubleshooting

This chapter discusses common types of failure:

The following are examples of cfgadm diagnostic messages. (Syntax error messages are not included here.)

cfgadm: Configuration administration not supported on this machine cfgadm: hardware component is busy, try againcfgadm: operation: configuration operation not supported on this machinecfgadm: operation: Data error: error_textcfgadm: operation: Hardware specific failure: error_textcfgadm: operation: Insufficient privilegescfgadm: operation: Operation requires a service interruptioncfgadm: System is busy, try againWARNING: Processor number number failed to offline. 

See the following man pages for additional error message detail: cfgadm(1M) , cfgadm_sbd(1M) , cfgadm_pci(1M) , and config_admin(3X) .


Unconfigure Operation Failure

An unconfigure operation for a CPU/Memory board or an I/O board can fail if the system is not in a correct state before you begin the operation.

CPU/Memory Board Unconfiguration Failures

Cannot Unconfigure a Board Whose Memory Is Interleaved Across Boards

If you try to unconfigure a system board whose memory is interleaved across system boards, the system displays an error message such as:

cfgadm: Hardware specific failure: unconfigure N0.SB2::memory: Memory is 
interleaved across boards: /ssm@0,0/memory-controller@b,400000 

Cannot Unconfigure a CPU to Which a Process is Bound

If you try to unconfigure a CPU to which a process is bound, the system displays an error message such as the following:

cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu3: Failed to off-line: 
/ssm@0,0/SUNW,UltraSPARC-III 

single-step bullet Unbind the process from the CPU and retry the unconfigure operation.

Cannot Unconfigure a CPU Before All Memory is Unconfigured

All memory on a system board must be unconfigured before you try to unconfigure a CPU. If you try to unconfigure a CPU before all memory on the board is unconfigured, the system displays an error message such as:

cfgadm: Hardware specific failure: unconfigure N0.SB2::cpu0: Can't unconfig cpu 
if mem online: /ssm@0,0/memory-controller 

single-step bullet Unconfigure all memory on the board and then unconfigure the CPU.

Unable to Unconfigure Memory on a Board With Permanent Memory

To unconfigure the memory on a board that has permanent memory, move the permanent memory pages to another board that has enough available memory to hold them. Such an additional board must be available before the unconfigure operation begins.

Memory Cannot Be Reconfigured

If the unconfigure operation fails with a message such as the following, the memory on the board could not be unconfigured:

cfgadm: Hardware specific failure: unconfigure N0.SB0: No available memory 
target: /ssm@0,0/memory-controller@3,400000 

Add to another board enough memory to hold the permanent memory pages, and then retry the unconfigure operation.

single-step bullet To confirm that a memory page cannot be moved, use the verbose option with the cfgadm command and look for the word "permanent" in the listing:

# cfgadm -av -s "select=type(memory)"

Not Enough Available Memory

If the unconfigure fails with one of the messages below, there would not enough available memory in the system if the board is removed:

cfgadm: Hardware specific failure: unconfigure N0.SB0: Insufficient memory

cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation failed

single-step bullet Reduce the memory load on the system and try again. If practical, install more memory in another board slot.

Memory Demand Increased

If the unconfigure fails with the following message, the memory demand has increased while the unconfigure operation was proceeding:

cfgadm: Hardware specific failure: unconfigure N0.SB0: Memory operation refused

single-step bullet Reduce the memory load on the system and try again.

Unable to Unconfigure a CPU

CPU unconfiguration is part of the unconfiguration operation for a
CPU/Memory board. If the operation fails to take the CPU offline, the following message is logged to the console:

WARNING: Processor number failed to offline. 

This failure occurs if:

  • The CPU has processes bound to it.
  • The CPU is the last one in a CPU set.
  • The CPU is the last online CPU in the system.

Unable to Disconnect a Board

It is possible to unconfigure a board and then discover that it cannot be disconnected. The cfgadm status display lists the board as not detachable. This problem occurs when the board is supplying an essential hardware service that cannot be relocated to an alternate board.

I/O Board Unconfiguration Failure

A device cannot be unconfigured or disconnected while it is in use. Many failures to unconfigure I/O boards occur because activity on the boards has not been stopped, or because an I/O device becomes active again after it has been stopped.

Device Busy

Disks attached to an I/O board must be idled before you attempt to unconfigure or disconnect that board. Any attempt to unconfigure/disconnect a board whose devices are still in use is rejected.

If an unconfiguration operation fails because an I/O board has a busy or open device, the board is left only partially unconfigured. The operation sequence stopped at the busy device.

To regain access to the devices which were not unconfigured, the board must be completely unconfigured and then reconfigured.

If a device on the board is busy, the system logs a message such as the following after an attempt to unconfigure:

cfgadm: Hardware specific failure: unconfigure N0.IB6: Device busy: /ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@6,0

To continue the unconfigure operation, unmount the device and retry the unconfigure operation. The board must be in the unconfigured state before you try to reconfigure this board.

Problems with I/O Devices

All I/O devices must be closed before they are unconfigured.

1. To see which processes have these devices open, use the fuser(1M) command.

2. Run the following command to kill the vold daemon gracefully.:

 # /etc/init.d/volmgt stop

3. Disconnect all SCSI controllers that are associated with the card that you're trying to unconfigure. To get a list of all connected SCSI controllers use the following command:

 # cfgadm -l -s "select=class(scsi)"

4. If the redundancy features of Solaristrademark Volume Manager (SVM) mirroring are used to access a device connected to the board, reconfigure these subsystems so that the device or network is accessible by way of controllers on other system boards.

5. Unmount file systems, including SVM meta-devices that have a board resident partition. (For example, umount/ partition ).

6. Remove the SVM database from board-resident partitions. The location of the SVM database is explicitly chosen by the user and can be changed.

7. Remove any private regions used by Sun Volume Manager or Veritas Volume Manager.

Volume Manager by default uses a private region on each device that it controls, so such devices must be removed from Sun Volume Manager control before they can be detached.

8. Remove disk partitions from the swap configuration.

9. Either kill any process that directly opens a device or raw partition, or direct it to close the open device on the board.



Note - Unmounting file systems may affect NFS client systems.



RPC or TCP Time-out or Loss of Connection

Time-outs occur by default after two minutes. Administrators may need to increase this time-out value to avoid time-outs during a DR-induced operating system quiescence, which may take longer than two minutes. Quiescing a system makes the system and related network services unavailable for a period of time that can exceed two minutes. These changes affect both the client and server machines.


Configure Operation Failure

CPU/Memory Board Configuration Failure

Problems that prevent configuration for the CPU/memory board are:

  • You try to configure either CPU0 or CPU1 while the other is configured.
  • A CPU remains configured on the board.

Cannot Configure Either CPU0 or CPU1 While the Other Is Configured

Before you try to configure either CPU0 or CPU1, make sure that the other CPU is unconfigured.

CPUs on a Board Must Be Configured Before Memory

Before configuring memory, all CPUs on the system board must be configured. If you try to configure memory while one or more CPUs are unconfigured, the system displays an error message such as:

cfgadm: Hardware specific failure: configure N0.SB2::memory: Can't config memory if not all cpus are online: /ssm@0,0/memory-controller

I/O Board Configuration Failure

A configure operation may fail because an I/O board with a device does not currently support hot-plugging. In such a situation, the board is now only partially configured. The operation has stopped at the unsupported device. In this situation, the board must be brought back to the unconfigured state before another configure attempt. The system logs a message such as the following:

cfgadm: Hardware specific failure: configure N0.IB6: Unsafe driver present: <device path>

single-step bullet To continue the configure operation, either remove the unsupported device driver or replace it with a new version of the driver that will support hot-plugging.