论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2007-07-04 10:14 |只看该作者 |倒序浏览

Introduction

A SAN is a complex system that can consist of multiple switches, hosts, storage devices, routers, and hubs. A SAN can also be as simple as a single switch with attached storage and hosts. A breakdown of the individual components yields a range of subcomponents, from simple subcomponents, such as cables, to complex subcomponents, such as switches. At a macro level, the fabric itself is considered a component that might require troubleshooting. Switches are logically positioned in the middle of the network between hosts and storage, and have visibility to both storage and hosts. This visibility into both sides of the storage network enables you to use switches to determine the cause of any malfunction in the SAN. This chapter presents a structured process for identifying marginal or faulty SAN components by helping you figure out where to start and then to methodically home in on the problem. Specific areas of focus include troubleshooting the following symptoms and SAN components:

§
§ Fabric

§
§ “Missing” devices

§
§ Marginal links

§
§ Input/Output (I/O) interruptions

The context of your problem influences how to interpret the data output by the variety of commands available in Fabric OS. For example, focus on the port state information for switchShow output when you are troubleshooting a port issue, and the switch status information from the same command when investigating a fabric issue.We will cover the details of how to troubleshoot using Fabric OS commands such as switchShow, errShow, portStatsShow, and other commands. Understanding host behavior and interpreting host information is also an important part of the troubleshooting process we discuss in this chapter.

The Troubleshooting Approach: The SAN Is a Virtual Cable

When first approaching troubleshooting, think of the SAN as a virtual cable. Storage traditionally involved connecting a Small Computer Systems Interface (SCSI) disk via a SCSI cable to a host; with this scenario, you focus on four _components: the storage, the Host Bus Adapter (HBA), the host’s OS, and the cable/terminator. Troubleshooting a SAN is more challenging, but still has many things in common with the traditional storage troubleshooting process. To the operating system, the SAN provides a link to a disk, just as a traditional SCSI connection would.

You can apply the same “tried-and-true” process of elimination used to trouble shoot a direct-attach SCSI issue or Ethernet network issue to SAN trouble shooting. At a macro level, if you consider the SAN a virtual cable, the issue can reside in three possible areas: the host, the “cable,” or the storage. Troubleshooting can work like a binary search when you start investigating these areas. Start in the middle and determine whether you are “above” or “below” the problem, and then keep dividing the suspect path until you resolve the problem.

When troubleshooting with a simple single-switch configuration, a single host, and a single storage device, you need to focus on the HBA, the Gigabit Interface Converter (GBIC), the host’s OS, the cable, the switch, and the storage. Brocade fabrics run a single-image distributed operating system known as Fabric OS. Fabric OS delivers functionality such as Name Server, Registered State Change Notification (RSCN), Zoning, and security. These functions are part of the SAN and are also variables in the troubleshooting equation. A large SAN can consist of dozens of switches and is capable of growing to thousands of ports. Knowing where in the SAN to initiate troubleshooting can be daunting. The next section uses a typical SAN troubleshooting scenario—a host unable to “see” its disks—to illustrate the method of resolving the problem by treating the SAN as a virtual cable and working with a process of elimination.

A Typical Scenario: “I Cannot See My Disks”

We provide the scenario described in this section to introduce the troubleshooting process and to establish a framework with which you are familiar. Some terms, commands, and concepts may seem foreign. This is okay. We address everything discussed in this section in greater detail later in the chapter.

When a host cannot see its disks, one thing to check is whether that device is logically connected to the switch by reviewing the output from the switchShow command. If the device is not logically connected (that is, it does not show up as an Nx_Port), you need to focus on the port initialization. Notice that port 15 in Figure 8.1 indicates a logically connected device, as this port is connected as an F_Port. Port 14 is an example of an unsuccessful device connection, as the device connected to port 14 is connected as a G_Port. A G_Port indicates an incomplete connection to the fabric. Initially knowing that the missing device is not logically connected eliminates the host and everything on that side of the data path from the suspect list, as depicted in Figure 8.2. This includes all aspects of the host’s OS, the HBA driver settings and binaries, the HBA Basic Input Output System (BIOS) settings, the HBA GBIC, the cable going from the switch to the host, the GBIC on the switch side of that cable, and all switch settings related to the host. That is quite a lot for one command! If the missing device is logically connected to the switch, you need to check to see if the device is present in the Simple Name Server (SNS).

Figure 8.1 Example of a Successful and Unsuccessful Device Connection

core2:admin> switchshow

switchName:
core2

switchType:
2.4

switchState:
Online

switchRole:
Subordinate

switchDomain:
5

switchId:
fffc05

switchWwn:
10:00:00:60:69:10:9b:5b

switchBeacon:
OFF

port 0: sw
Online
E-Port
10:00:00:60:69:11:f9:f7 "edge1"

(upstream)

port
1: sw
Online
E-Port
10:00:00:60:69:10:9b:52 "edge2"

port
2: sw
Online
E-Port
10:00:00:60:69:11:f9:f7 "edge1"

port
3: sw
Online
E-Port
10:00:00:60:69:10:9b:52 "edge2"

port
4: sw
Online
E-Port
10:00:00:60:69:12:f9:8c "edge3"

port
5: sw
Online
E-Port
10:00:00:60:69:12:f9:8c "edge3"

port
6: —
No_Module

port
7: —
No_Module

port
8: —
No_Module

port
9: —
No_Module

port 10: —
No_Module

port 11: id
Online
E-Port
10:00:00:60:69:12:f9:8c "edge3"

port 12: —
No_Module

port 13: —
No_Module

port 14: cu
Online
G-Port //incomplete fabric connection

port 15: id
Online
F-Port
50:06:04:82:bc:01:9a:0c

Figure 8.2 The SAN Virtual Cable

The SNS is a directory service provided by the fabric. Initiators query the Name Server much in the same way you would query a telephone directory looking for a particular person or service. If a device is not in the Name Server, it is essentially invisible to other devices in the fabric. When a device connects to the fabric, that device will register itself with the Name Server. This is similar to the situation in which you change neighborhoods and have your name listed in the new telephone directory. When an initiator, which is normally an HBA, enters the fabric, it queries the Name Server to identify all accessible devices and obtain the addresses of these devices, just like you might scan your telephone directory for a name. Some targets also will query the Name Server. Then the initiator starts the process of establishing a connection with those devices for which the Name Server provides addresses.

Check the Name Server for the presence of your missing device by issuing the nsShowcommand on the switch to which the device is attached (see the sample output in Figure 8.3). This will list all of the nodes connected to that switch, allowing you to determine if a particular node is accessible on the network. An alternate method is to check the Name Server list in the WEB TOOLS Graphical User Interface (GUI) on any switch in the fabric, as it contains a consolidated list of all devices in the fabric. Note that we started the process in the middle of the virtual SAN cable, which is the fabric. This is the process we described earlier as being like a binary search algorithm. You start in the middle half of the data path, figure out if you are “above” the problem or “below”it and keep dividing the suspect path in half until you identify the problem.

Figure 8.3 nsShow Sample Output

ore2:admin> nsshow

The Local Name Server has 9 entries {

Type Pid
COS
PortName
NodeName
TTL(sec)

*N
021a00;
2,3;20:00:00:e0:69:f0:07:c6;10:00:00:e0:69:f0:07:c6; 895

Fabric Port Name: 20:0a:00:60:69:10:8d:fd

NL
051edc;
3;21:00:00:20:37:d9:77:96;20:00:00:20:37:d9:77:96; na