免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
12下一页
最近访问板块 发新帖
查看: 9565 | 回复: 16
打印 上一主题 下一主题

[存储网络] SAN Troubleshooting(博科) [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2007-07-04 10:14 |只看该作者 |倒序浏览
Introduction
A SAN is a complex system that can consist of multiple switches, hosts, storage devices, routers, and hubs. A SAN can also be as simple as a single switch with attached storage and hosts. A breakdown of the individual components yields a range of subcomponents, from simple subcomponents, such as cables, to complex subcomponents, such as switches. At a macro level, the fabric itself is considered a component that might require troubleshooting. Switches are logically positioned in the middle of the network between hosts and storage, and have visibility to both storage and hosts. This visibility into both sides of the storage network enables you to use switches to determine the cause of any malfunction in the SAN. This chapter presents a structured process for identifying marginal or faulty SAN components by helping you figure out where to start and then to methodically home in on the problem. Specific areas of focus include troubleshooting the following symptoms and SAN components:
§
§         Fabric
§
§         “Missing” devices
§
§         Marginal links
§
§         Input/Output (I/O) interruptions
The context of your problem influences how to interpret the data output by the variety of commands available in Fabric OS. For example, focus on the port state information for switchShow output when you are troubleshooting a port issue, and the switch status information from the same command when investigating a fabric issue.We will cover the details of how to troubleshoot using Fabric OS commands such as switchShow, errShow, portStatsShow, and other commands. Understanding host behavior and interpreting host information is also an important part of the troubleshooting process we discuss in this chapter.
The Troubleshooting Approach: The SAN Is a Virtual Cable
When first approaching troubleshooting, think of the SAN as a virtual cable. Storage traditionally involved connecting a Small Computer Systems Interface (SCSI) disk via a SCSI cable to a host; with this scenario, you focus on four _components: the storage, the Host Bus Adapter (HBA), the host’s OS, and the cable/terminator. Troubleshooting a SAN is more challenging, but still has many things in common with the traditional storage troubleshooting process. To the operating system, the SAN provides a link to a disk, just as a traditional SCSI connection would.
You can apply the same “tried-and-true” process of elimination used to trouble shoot a direct-attach SCSI issue or Ethernet network issue to SAN trouble shooting. At a macro level, if you consider the SAN a virtual cable, the issue can reside in three possible areas: the host, the “cable,” or the storage. Troubleshooting can work like a binary search when you start investigating these areas. Start in the middle and determine whether you are “above” or “below” the problem, and then keep dividing the suspect path until you resolve the problem.
When troubleshooting with a simple single-switch configuration, a single host, and a single storage device, you need to focus on the HBA, the Gigabit Interface Converter (GBIC), the host’s OS, the cable, the switch, and the storage. Brocade fabrics run a single-image distributed operating system known as Fabric OS. Fabric OS delivers functionality such as Name Server, Registered State Change Notification (RSCN), Zoning, and security. These functions are part of the SAN and are also variables in the troubleshooting equation. A large SAN can consist of dozens of switches and is capable of growing to thousands of ports. Knowing where in the SAN to initiate troubleshooting can be daunting. The next section uses a typical SAN troubleshooting scenario—a host unable to “see” its disks—to illustrate the method of resolving the problem by treating the SAN as a virtual cable and working with a process of elimination.
A Typical Scenario: “I Cannot See My Disks”
We provide the scenario described in this section to introduce the troubleshooting process and to establish a framework with which you are familiar. Some terms, commands, and concepts may seem foreign. This is okay. We address everything discussed in this section in greater detail later in the chapter.
When a host cannot see its disks, one thing to check is whether that device is logically connected to the switch by reviewing the output from the switchShow command. If the device is not logically connected (that is, it does not show up as an Nx_Port), you need to focus on the port initialization. Notice that port 15 in Figure 8.1 indicates a logically connected device, as this port is connected as an F_Port. Port 14 is an example of an unsuccessful device connection, as the device connected to port 14 is connected as a G_Port. A G_Port indicates an incomplete connection to the fabric. Initially knowing that the missing device is not logically connected eliminates the host and everything on that side of the data path from the suspect list, as depicted in Figure 8.2. This includes all aspects of the host’s OS, the HBA driver settings and binaries, the HBA Basic Input Output System (BIOS) settings, the HBA GBIC, the cable going from the switch to the host, the GBIC on the switch side of that cable, and all switch settings related to the host. That is quite a lot for one command! If the missing device is logically connected to the switch, you need to check to see if the device is present in the Simple Name Server (SNS).
Figure 8.1 Example of a Successful and Unsuccessful Device Connection
core2:admin> switchshow
switchName:
core2
switchType:
2.4
switchState:
Online
switchRole:
Subordinate
switchDomain:
5
switchId:
fffc05
switchWwn:
10:00:00:60:69:10:9b:5b
switchBeacon:
OFF
port 0: sw
Online
E-Port
10:00:00:60:69:11:f9:f7 "edge1"

(upstream)
port
1: sw
Online
E-Port
10:00:00:60:69:10:9b:52 "edge2"
port
2: sw
Online
E-Port
10:00:00:60:69:11:f9:f7 "edge1"
port
3: sw
Online
E-Port
10:00:00:60:69:10:9b:52 "edge2"
port
4: sw
Online
E-Port
10:00:00:60:69:12:f9:8c "edge3"
port
5: sw
Online
E-Port
10:00:00:60:69:12:f9:8c "edge3"
port
6: —
No_Module
port
7: —
No_Module
port
8: —
No_Module
port
9: —
No_Module
port 10: —
No_Module
port 11: id
Online
E-Port
10:00:00:60:69:12:f9:8c "edge3"
port 12: —
No_Module
port 13: —
No_Module
port 14: cu
Online
G-Port //incomplete fabric connection
port 15: id
Online
F-Port
50:06:04:82:bc:01:9a:0c
Figure 8.2 The SAN Virtual Cable
The SNS is a directory service provided by the fabric. Initiators query the Name Server much in the same way you would query a telephone directory looking for a particular person or service. If a device is not in the Name Server, it is essentially invisible to other devices in the fabric. When a device connects to the fabric, that device will register itself with the Name Server. This is similar to the situation in which you change neighborhoods and have your name listed in the new telephone directory. When an initiator, which is normally an HBA, enters the fabric, it queries the Name Server to identify all accessible devices and obtain the addresses of these devices, just like you might scan your telephone directory for a name. Some targets also will query the Name Server. Then the initiator starts the process of establishing a connection with those devices for which the Name Server provides addresses.
Check the Name Server for the presence of your missing device by issuing the nsShowcommand on the switch to which the device is attached (see the sample output in Figure 8.3). This will list all of the nodes connected to that switch, allowing you to determine if a particular node is accessible on the network. An alternate method is to check the Name Server list in the WEB TOOLS Graphical User Interface (GUI) on any switch in the fabric, as it contains a consolidated list of all devices in the fabric. Note that we started the process in the middle of the virtual SAN cable, which is the fabric. This is the process we described earlier as being like a binary search algorithm. You start in the middle half of the data path, figure out if you are “above” the problem or “below”it and keep dividing the suspect path in half until you identify the problem.
Figure 8.3 nsShow Sample Output
ore2:admin> nsshow
The Local Name Server has 9 entries {

Type Pid
COS
PortName
NodeName
TTL(sec)
*N
021a00;
2,3;20:00:00:e0:69:f0:07:c6;10:00:00:e0:69:f0:07:c6; 895

Fabric Port Name: 20:0a:00:60:69:10:8d:fd

NL
051edc;
3;21:00:00:20:37:d9:77:96;20:00:00:20:37:d9:77:96; na


FC4s: FCP [SEAGATE ST318304FC
0005]

Fabric Port Name: 20:0e:00:60:69:10:9b:5b

NL
051ee0;
3;21:00:00:20:37:d9:73:0f;20:00:00:20:37:d9:73:0f; na

FC4s: FCP [SEAGATE ST318304FC
0005]

Fabric Port Name: 20:0e:00:60:69:10:9b:5b

NL
051ee1;
3;21:00:00:20:37:d9:76:b3;20:00:00:20:37:d9:76:b3; na

FC4s: FCP [SEAGATE ST318304FC
0005]

Fabric Port Name: 20:0e:00:60:69:10:9b:5b

NL
051ee2;
3;21:00:00:20:37:d9:77:5a;20:00:00:20:37:d9:77:5a; na

FC4s: FCP [SEAGATE ST318304FC
0005]

Fabric Port Name: 20:0e:00:60:69:10:9b:5b

NL
051ee4;
3;21:00:00:20:37:d9:74:d7;20:00:00:20:37:d9:74:d7; na

FC4s: FCP [SEAGATE ST318304FC
0005]

Fabric Port Name: 20:0e:00:60:69:10:9b:5b

NL
051ee8;
3;21:00:00:20:37:d9:6f:eb;20:00:00:20:37:d9:6f:eb; na

FC4s: FCP [SEAGATE ST318304FC
0005]

Fabric Port Name: 20:0e:00:60:69:10:9b:5b

NL
051eef;
3;21:00:00:20:37:d9:77:45;20:00:00:20:37:d9:77:45; na

FC4s: FCP [SEAGATE ST318304FC
0005]


Fabric Port Name: 20:0e:00:60:69:10:9b:5b

N
051f00;
2,3;50:06:04:82:bc:01:9a:0c;50:06:04:82:bc:01:9a:0c; na

FC4s: FCP [EMC
SYMMETRIX
5267]

Fabric Port Name: 20:0f:00:60:69:10:9b:5b
}

论坛徽章:
0
2 [报告]
发表于 2007-07-04 10:15 |只看该作者
At this point, if the device is not present in the Name Server, you have narrowed your search along the virtual SAN cable to the Name Server interface between the storage. The missing device process defined in this section is summarized in flowchart form in Figure 8.4. Note that Figure 8.4 is an excerpt from the complete missing-device troubleshooting process, which is shown in Figure 8.25. Remember that we will go deeper into this missing-device troubleshooting process and flowchart later in the chapter.
Where to Start and What Data to Gather
As stated in the previous section, SAN troubleshooting should begin in the center of the SAN and proceed outward. Once you know where to start troubleshooting, the next question is how to proceed. Start the troubleshooting process by gathering a preliminary set of data, and then analyze this data to identify where the problem resides: the host, the fabric, or the storage. Then gather additional data from the appropriate area and home in on the cause of the problem. _A plethora of data is available from the switches, hosts, and storage. Knowing what data to look at and when to look at it is fundamental to the SAN _trouble¬shooting process.
Take a Snapshot: Describe the Problem and Gather Information
Start with a general description of the problem and identify as much supporting detail as possible. At the very least, this should include a statement about what the “bad” behavior is, and a statement about what you are doing or have done to expose this behavior. Note that this is not the same as describing what you have done that causes the behavior. You might be doing something correctly, like plugging in a disk array and adding it to a zone, yet it might affect something else in the fabric if there is an underlying problem that is exposed whenever a zone change occurs.
For example, an HBA responding incorrectly to an RSCN could fail when the new zone configuration is enabled. An RSCN is a fabric service for which an edge device optionally registers. When a device registers for an RSCN, it is asking the fabric to send that device a notice anytime something in the fabric changes. For example, when a new device is added to the fabric, any devices that registered for RSCNs will receive a notice. The registered device receiving the RSCN then checks the Name Server to see what has changed and takes appropriate action. For example, if the registered device is a host and a new disk drive is added to the SAN, the host might create the necessary device operating system structures so the new device is accessible to the user.
This information will help you with the problem resolution, and might be necessary if you need to contact Brocade or any Brocade-authorized support channel. Some examples of a general problem description include:
                 When I enable a switch zoning configuration with cfgEnable, storage devices are no longer accessible to the host.
                 There are frequent pauses in I/O when I copy large files between arrays.
                 My edge device sometimes connects as an N_Port, and other times it connects as a Node Loop (NL)_Port when I power it up.
                 The fabric segments and the following error message is logged (provide error message in your description). It does this under normal operation, even when I do not touch any device on the SAN.
Include the answers to the following questions with your problem description:
                 Can you recreate the problem on demand? If so, how? (Go into detail.)
                 Is the problem intermittent? If so, how frequently does it occur?
                 Has anything at all changed recently on the fabric? If so, what? (Provide a complete list.)
                 Is the problem localized or fabric-wide? For example, is the problem happening with other devices in the fabric, or just locally with a single device attached to the switch?
                 Is this an initial install and the device was never working, or was the device working and now it has stopped working?
Other information to record:
                 If there are any error messages, include them with the problem _description.
                 Firmware and driver versions for the affected HBA and storage devices.
                 Firmware and operating system versions for affected hosts and all _fabric switches.
                 External switch information, such as LED state.
                 External HBA and port information, such as LED state.
                 A diagram of the SAN configuration.
                 If long-distance links are present, include information about the length and quality of the lines, and the mechanism being used to achieve the distance (for example, “The line is 10 km long, and we are using Long Wavelength [LWL] GBICs,” or “It is 80 km long, and we are using a Dense Wave Division Multiplexor [DWDM] and the Extended Fabrics product”).
Finally, gather supportShow information from the switches. The supportShow command is a switch command used to gather information about the switch and the fabric; it can provide valuable clues about what is happening in your switch network. It is like a macro in that it executes a long list of switch commands, which Brocade identifies as important for the troubleshooting process. Note that the commands that supportShow executes vary between Fabric OS releases. The v2.4.1 supportShow command executes the following switch commands:
                 version
                 uptime
                 tempShow
                 psShow
                 licenseShow
                 diagShow
                 errDump
                 switchShow
                 portFlagsShow
                 portErrShow
                 mqShow
                 portSemShow
                 portShow
                 portRegShow
                 portRouteShow
                 fabricShow
                 topologyShow
                 qlShow
                 faShow
                 portCfgLport
                 nsShow
                 nsAllShow
                 cfgShow
                 configShow
                 faultShow
                 traceShow
                 portLogDump
One benefit of supportShow is that you do not have to repeatedly retrieve various types of data, since most of the data you need is available from supportShow in one place. As this command rapidly streams in a telnet window, capture mode should be turned on prior to executing the command so that it can be captured to a text file for later review.
NOTE
It is important to execute the supportShow command at the time the problem is occurring, rather than waiting until the fabric is functioning normally.
Due to the large volume of data created by supportShow, you might choose to gather the supportShow data once and then selectively issue a subset of its commands as part of your troubleshooting process.
Troubleshooting Tools
Many tools are available to the SAN troubleshooter. Many of these tools are switch commands. Other tools involve viewing the switch LEDs, host information such as Solaris’ /var/adm/messages file, Fibre Channel analyzers, and diagnostics available on many storage arrays. Rarely is it possible to use a single tool to successfully troubleshoot a problem. It is more common to use several tools to attain a successful resolution of a problem.
Using the Switch LEDs
A significant amount of information can be gathered just by looking at the switch LEDs. At a rudimentary level, it is possible to identify that a device _has faulted or is not yet online by looking for a “fast yellow.” If the switch is located in another room, you can get a visual real-time LED status using the WEB TOOLS interface. Fast flickering green lights are a sign of a healthy SAN. By physically observing the switches that comprise a SAN, it is possible to detect patterns and identify a marginal or faulty component. For example, if you have a situation in which you are trying to identify a device that is repeatedly toggling online and offline, you can use the switch LEDs.
While observing a functional fabric, you can easily identify a potentially disruptive device by scanning for a port that goes offline (no LED light), sends light (steady yellow), comes online (steady green), and then cycles through the same steps—blank, yellow, green. You also want to look for correlations or patterns, such as one device going offline followed by a group of devices going offline and back online again. This situation is common in QuickLoop configurations when the first device going offline is sending a Loop Initialization Primative (LIP), which then causes the other devices to LIP.
How to Identify a Healthy SAN Using the LEDs
A settled and healthy fabric should have solid green or fast flickering green lights. A solid green light indicates an active link, while a fast flickering green light indicates I/O activity.
How to Identify a SAN Problem Using the LEDs
A yellow light or blinking yellow light indicates a problem with your SAN. An LED that transitions from yellow to green, however, is not a problem. A powered-off edge device, or edge device that is not yet online, might cause the switch LEDs to blink yellow.
Another helpful use for the LEDs is for fabric “bring up.” When bringing up a fabric, one sign to look for that indicates a fabric has reached convergence are steady green lights. When the fabric is coming up, the Inter-Switch Links (ISLs) go through initialization, which appear to the observer as flickering green and yellow lights prior to the fabric fully converging. Once the fabric is converged, the lights go to a steady green. Then, as I/O in the fabric begins, you will see flickering green lights on the ISL ports and the edge device ports.
A slowly flashing switch power LED indicates that the switch failed the Power-On Self-Test (POST) and is not able to come online. Refer to the associated switch manual for the location of the power LED. Table 8.1 lists the port LEDs and their definitions (you can also find this table in the Brocade SilkWorm 2800 Hardware Reference Manual).
Table 8.1 Front Panel LED Port Indicators
Ports         LED Definition
No light showing         No light or signal carrier (no module, no cable) for media interface
Steady yellow         Receiving light or signal carrier, but not yet online
Slow yellow (flashes two seconds)        Disabled (result of diagnostics, switchDisable, or portDisable command)
Fast yellow (flashes a half second)        Error, fault with port
Steady green         Online (connected with external device over cable)
Slow green (flashes two seconds)        Online, but segmented (loopback cable or incompatible fabric parameters)
Fast green (flashes a half second)        Internal loopback (diagnostic)
Flickering green        Online and frames flowing through port
Switch Diagnostics
A robust set of switch diagnostics is available so you can validate the operational level of a SilkWorm switch. Several of these diagnostics, such as portLoopbackTest, are also helpful in the troubleshooting process. For example, if you suspect a bad GBIC or switch port, you can use portLoopbackTest to confirm your suspicion. Using portLoopbackTest for troubleshooting is discussed in the section “Troubleshooting Marginal Links” later in the chapter. The supportShow diagnostic command in particular, discussed in detail later in this chapter, is very helpful to the troubleshooting process. The Brocade Fabric OS manuals provide detailed description regarding the usage of diagnostic commands. To see what diagnostic commands are available online, issue the command diagHelp at the switch prompt. The following list of diagnostic commands is available in the V2.4.1 Fabric OS:
                 ramTest  System DRAM diagnostic
                 portRegTest  Port register diagnostic
                 centralMemoryTest  Central memory diagnostic
                 cmiTest  CMI bus connection diagnostic
                 camTest  QuickLoop CAM diagnostic
                 portLoopbackTest  Port internal loopback diagnostic
                 sramRetentionTest  SRAM Data Retention diagnostic
                 cmemRetentionTest  Central Mem Data Retention diagnostic
                 crossPortTest  Cross-connected port diagnostic
                 spinSilk  Cross-connected line-speed exerciser
                 diagClearError  Clear diag error on specified port
                 diagDisablePost  Disable Power-On-Self-Test
                 diagEnablePost  Enable Power-On-Self-Test
                 setGbicMode  Enable tests only on ports with GBICs
                 setSplbMode  Enable 0=Dual, 1=Single port LB mode
                 supportShow  Print version, error, portLog, etc.
                 diagShow  Print diagnostic status information
                 parityCheck  Dram Parity 0=Disabled, 1=Enable
                 spinFab  ISL link diagnostic
                 loopPortTest  L_Port cable loopback diagnostic
Helpful Commands
With dozens of switch commands at your disposal, it can be difficult to determine which command to use in a given situation. An annotated list of helpful commands follows in this section, with additional commands highlighted as they relate to specific issues discussed in following sections. This list of commands is a starting point for gathering data and initiating your troubleshooting process. While the information generated by these commands is also available in supportShow, you will want to use individual commands as you advance through the troubleshooting process. SupportShow creates a significant amount of data and is helpful when you want to perform the original snapshot of the configuration and environment (to report a problem to your switch supplier), or you are not sure what data to capture.
NOTE
Although the switch commands are shown with various capitalization as originally coded in Fabric OS, the commands are no longer case-sensitive and can be entered with all lowercase if desired.
Entering the command help at the switch prompt generates a list of commands available to the user as shown in Figure 8.5. Entering the command help <command> generates a help page (similar to UNIX man pages) for that specific command. Many commands differ by the extension show or dump (for example, errShow and errDump). The difference is that show commands require you to type a return between entries, while the dump commands stream data to the screen without any pauses. Dump commands are used when you have a facility for logging command output to a file. It might be necessary to execute commands on more than one switch in the fabric, especially if the location of the problem is unclear.

论坛徽章:
0
3 [报告]
发表于 2007-07-04 10:15 |只看该作者
As of Fabric OS 2.4.1, there is no time synchronization among the switches, which can make troubleshooting a challenge if the clocks between the switches are skewed. Before you begin troubleshooting your fabric, you should make a note of any time skew so that you can compensate for it when reading command outputs. You should also make an effort to keep switch clocks set correctly during normal operation to avoid this problem.
Figure 8.5 Use the help Command to See What Commands Are Available or Type the help Command for Help About a Specific Command
dev172:admin> help

agtcfgSet                      Set SNMP agent configuration
agtcfgShow                       Print SNMP agent configuration
agtcfgDefault                     Reset SNMP agent to factory default
                            .
                            .
                            .
qlHelp                            Print quick loop help info
routeHelp                        Print routing help info
trackChangesHelp                  Print Track Changes help info
zoneHelp                         Print zoning help info

dev172:admin> help errShow

NAME
     errShow - display the error log

SYNOPSIS
     errShow

AVAILABILITY
     all users

DESCRIPTION
     This command displays the error log, prompting the user to type
     return between each log entry. It is identical to errDump, except

                            .
                            .
                            .
SEE ALSO
     errDump, uptime
The errShow Command
The errShow command provides a listing of up to 64 logged errors and is helpful for identifying where a problem might reside. It sends messages to the console and to the error log. Note that the error log is cleared after a reboot or power cycle; if you want to maintain error logs that persist after reboots or power cycles, consider using the syslog facilities of the switch to log errors to persistent storage. See syslogdIpAdd, syslogdIpRemove, and syslogdIpShow for further detail on how to set up persistent logging.
When examining errShow data, which can be quite wordy, look for trends or patterns. For example, look for an excessive number of errors associated with a specific port. In addition, watch for high error-count values, which indicate a repeated error that has been logged many times. Logging error counts limits errors that occur multiple times from consuming the space provided for the error log. It is important to note that with every error, a severity level is associated. A warning (error level 3) is just that—a warning. An error (error level 2) or critical (error level 1) message is more severe and requires further attention.
An excerpt from the errShow help entry is provided in Figure 8.6. Please refer to the help page or the Fabric OS manual for details on interpreting Diag Err#, as the list of codes is lengthy. A Diag Err# usually indicates a problem with hardware, so contact your switch supplier for further assistance.
In addition to software errors, errShow logs environmental issues, such as over-temperature conditions, and equipment issues such as fan failures or power supply failures. A detailed list of error messages, descriptions, probable causes, and actions is maintained in the Fabric OS Reference Manual Version 2.4 (Publication Number 53-0001569-01).
Figure 8.6 Excerpt from the errShow help Entry
Each entry in the log has the same format:

   Error Number
   ——————
   taskId (taskName): Time Stamp (count)
        Error Type, Error Level, Error Message
   Diag Err#

Error Number       Starting from one. If there are more error than
                  the size of the log, only the most recent errors
                  are shown.

Task Id & Name     The ID and name of the task recording the error.

Time Stamp         The date and time of the first occurrence of
                   the error.

Error Count        For errors that occur multiple times, the repeat
                  count is shown in parenthesis. The maximum count
                  is 999.

Error Type         An uppercase string showing the firmware module
                  and error type. The switch manual contains a
                  detailed explanation of each error type.

Error Level        0  panic (the switch reboots)
                  1  critical
                  2  error
                  3  warning
                  4  information
                  5  debug

Error Message      Additional information about the error.
Figure 8.7 is an example of an errShow message. The fabric is segmented, meaning that the switch that generated this message is logically disconnected from the SAN, and any devices in the SAN that are not directly connected to this switch are inaccessible to this switch. Moreover, any devices located on this switch are unable to access other devices in the fabric. The error level is a warning (3). The task ID (0x10e2b7f0) can be cross-referenced by issuing the telnet command “i” to obtain additional information on the task in question. The Task Name is self-explanatory, and interpreting it is somewhat intuitive. For example, tTransmit is the transmit task. The Task Name can be helpful in identifying the nature of the problem. Finally, the error message indicates that there is a discrepancy between the zone information contained on this switch and the zone information contained in the rest of the fabric. When the switch tried to join _the fabric with this conflicting information, the join request was denied; hence, the segmented fabric. The message even identifies the zone that is causing the conflict; in this case, it is the “red” zone. This zone should be checked and compared to the rest of the fabric, and if the zone information is different, either correct or delete it.

Figure 8.7 errShow Example
The portErrShow Command
The portErrShow command is an effective command for troubleshooting marginal ports. This command provides an error summary for all ports associated with the switch and provides a status of all ports from a link integrity perspective. The key to interpreting the statistics is looking for a very high number of errors relative to the frames transmitted and frames received. For example if 2,000,000 frames have been received and only three Cyclic Redundancy Check (CRC) errors have been logged, the CRC errors relative to the frames received is a very low ratio and the associated port is not suspected as being marginal. On the other hand, if 2,000,000 frames have been received and 10,000 CRC errors have been logged, the CRC errors relative to the frames received is a high ratio and the associated port should be examined further. A rough guideline is to look for errors in excess of 0.5 percent of the total number of frames transferred.
Another important trend to watch is a steadily increasing number of errors. You can track increasing errors by sampling every five or ten seconds and monitoring the delta between the samples. Simple Network Management Protocol (SNMP) polling can be used to facilitate this. Also, the optionally licensed Fabric Watch product can be used to note changes in error rates over time and send out an SNMP trap or error log entry. Streaming errors is a high-order indicator and requires close monitoring—even if the error rate is less than one percent. While the error count relative to frames transmitted or received might be low, a steadily increasing number of errors indicates a marginal port.
The portErrShow statistics shown in Figure 8.8 were gathered from a switch that had a marginal NL_Port (HBA), connected to port 6. It turned out that the Gigabit Link Module (or GLM, a connector similar to a GBIC) was failing and causing a degraded signal. Note how high the enc_in and CRC errors are!
Figure 8.8 portErrShow Example
     frames  enc  crc  too  too  bad  enc disc link loss loss frjt fbsy
     tx   rx   in  err  shrt long  eof  out   c3 fail sync  sig
  ——————————————————————————————————
port 0: 2.9g 1.7g        0        12        0        0        0        0        0        0        2        1        0        0
port 1: 305m 3.0g        0        0        0        0        0        0        0        0        1        0        0        0
port 2: 1.2g 892m        0        0        0        0        0        0        0        0        556        27        0        0
port 3: 1.1m  25m        0        0        0        0        0        82        0        4        9        4        0        0
port 4:  0    0        0        0        0        0        0        0        0        0        0        0        0        0
port 5: 9.5m 4.0g        0        0        0        0        0        0        0        0        1.4k        1.4k        0        0
port 6: 668m 4.0g        6.0m        66m        0        0        236        51m        0        87        54        11        0        0
The error statistics shown in boldface are the primary statistics on which to focus. The following listing explains relevant statistics and associated definitions:
                 enc_in  Received data: the number of 8b/10b encoding errors that have occurred inside frame boundaries. This counter is generally a zero value, although occasional errors might occur on a normal link and give a nonzero result. (Minimum compliance with the link-bit error rate specification on a link continuously receiving frames would cause approximately one error every 20 minutes.) Reinitialization or reboots of the associated Nx_Port can also cause these errors, resulting in a low-count error count.
                 crc_err  Received frames: the number of CRC errors detected. A CRC error indicates that the contents of a frame are no longer valid. Reinitialization or reboots of the associated Nx_Port can also cause these errors, resulting in a low count.
                 too_long  Received frames: the number of frames that were longer than the maximum Fibre Channel frame size (such as a header with more than a 2112-byte payload).
                 bad_eof  The number of frames received with a badly formed _end-of-frame.
                 enc_out  Receive link: the number of 8b/10b encoding errors recorded outside frame boundaries. This number might become nonzero during link initialization, but it indicates a problem if it increments faster than the allowed link-bit error rate (approximately once every 20 minutes).
                 er_disc_c3  Receive link: the number of Class 3 frames discarded. Class 3 frames can be discarded due to timeouts or invalid or unreachable destinations. This quantity could increment at times during normal operation, but might be used for diagnosing problems in some situations.
NOTE
Steadily increasing errors between samples is a very strong sign that the associated port is not functioning properly.
Marginal link troubleshooting and related troubleshooting commands are discussed in more detail in the “Troubleshooting Marginal Links” section later in this chapter.
The switchShow Command
The switchShow command is another powerful command that has many uses for the troubleshooting process. An excerpt from the switchShow help entry is provided here. It is helpful for troubleshooting fabric as well as edge device connectivity issues. This command is likely to be one of the first commands you will execute as part of your troubleshooting process. The nature of the problem will dictate what switchShow data to focus on and how to interpret this data. As shown in Table 8.2 and Figure 8.9, switchShow data is loosely organized into three categories.
Table 8.2 How switchShow Data Relates to the SAN Functional Areas
Fabric-Related        Edge Device-Related        Miscellaneous
switchState        port state        switchId
switchRole                 switchBeacon
switchDomain                 switchType
port state                 switchName
Figure 8.9 switchShow Definitions
This switchShow command displays switch and port status information.
Some information varies with the switch model, e.g. number of
ports, and Domain ID values. The lines of the display show:

switchName        The switch's symbolic name.
switchType        The switch's model and revision numbers.
switchState        The switch's state: Online, Offline, Testing, Faulty.
switchRole        The switch's role: Principal, Subordinate, Disabled.
switchDomain        The switch's Domain ID: 0-31 or 1-239.
switchId        The switch's embedded port D_ID.
switchWwn        The switch's Worldwide Name.
switchBeacon        The switch's beaconing state (either ON or OFF).

The switch summary is followed by one line per port:

port number        The port number: 0-7 or 0-15.

module type        The port module type (GBIC or other):
                   — - no module present
                   sw - shortwave laser
                   lw - longwave laser
                   cu - copper
                   id - serial ID

port state         The port's state:
                   No_Card   - no interface card present
                   No_Module - no module (GBIC or other) present
                   No_Light  - the module is not receiving light
                   No_Sync   - receiving light but out of sync
                   In_Sync   - receiving light and in sync
                   Laser_Flt - module is signaling a laser fault
                   Port_Flt  - port marked faulty
                   Diag_Flt  - port failed diagnostics
                   Lock_Ref  - locking to the reference signal
                   Testing   - running diagnostics
                   Online    - the port is up and running
comment            The comment field may be blank, or may show:
                   Disabled  - the port is disabled
                   Bypassed  - the port is bypassed (loop only)
                   Loopback  - the port is in loopback mode
                   E_Port    - fabric port, shows WWN of attached
                              switch
                   F_Port    - pt-pt port, shows WWN of attached
                              N_Port
                   G_Port    - pt-pt but not yet E_Port or F_Port
                   L_Port    - loop port, shows number of
                              NL_Ports

                   if a port is configured as a long-distance port,
                   the long distance level is shown in the format of
                   "Lx", x being the long-distance level number.
                   See portCfgLongDistance for the level description.
When troubleshooting issues involve the fabric services or a switch’s ability to participate in the fabric, the important parts of switchShow data to focus on are switchState, switchRole, and switchDomain.
Port state is applicable from a fabric perspective for observing the state of expansion ports (E_Ports). E_Ports associated with ISLs are the ports used to connect multiple switches together forming a fabric. Port state is also useful for troubleshooting connectivity problems with end devices (F_Ports and FL_Ports).
In a running fabric, the switchState should always be online. If not, access to and from the switch is not possible. It is possible that the switch may be in a transitory state as it comes online from a power cycle or reboot, so check again to make sure this is not the case. It is also possible that the switch has been manually disabled using the switchDisable command.
A switch can be operating as a principal, subordinate, or disabled, which is indicated by the switchRole variable. There is only one principal switch in the fabric, and if the principal fails, another switch will assume this role. The principal switch facilitates the bring up of the fabric and assignment of domain IDs. A switch domain ID is an address that defines the switch in a fabric. Domain IDs are automatically assigned as part of the fabric initialization process by the principal switch. It is possible to manually assign a domain ID as well. SilkWorm 1000 series switches use the domains 0–31, and SilkWorm 2000 series switches and beyond use the domains 1–239. If a switch is not a principal, it operates in a subordinate switch role. If the switch role indicates disabled, access to and from the switch is not possible and it is likely that someone disabled the switch by typing switchDisable, or the switch was unable to obtain a domain ID. When a switch is disabled, a comment of “unconfirmed” accompanies the domain ID (Figure 8.10). Normally, a switch will be in disabled state after issuing the command switchDisable. The “unconfirmed” attribute could also be caused by a problem with the fabric, which causes a switch to be unable to confirm its domain ID even though the switch is enabled. When the switch is disabled, the LEDs will blink yellow every two seconds and the port state will indicate disabled.
Figure 8.10 Switch Disabled and Unconfirmed Domain
core1:admin> switchshow
switchName:     core1
switchType:     2.4
switchState:    Offline  
switchRole:     Disabled
switchDomain:   1 (unconfirmed)
switchId:       fffc01
switchWwn:      10:00:00:60:69:10:8d:fd
switchBeacon:   OFF
port  0: sw        Laser_Flt        Disabled
port  1: sw        In_Sync        Disabled
port  2: sw        In_Sync        Disabled
port  3: sw        In_Sync        Disabled
port  4: sw        In_Sync        Disabled
port  5: sw        In_Sync        Disabled
port  6: —        No_Module        Disabled
port  7: —        No_Module        Disabled
port  8: —        No_Module        Disabled
port  9: —        No_Module        Disabled
port 10: —        No_Module        Disabled
port 11: —        No_Module        Disabled
port 12: —        No_Module        Disabled
port 13: —        No_Module        Disabled
port 14: —        No_Module        Disabled
port 15: —        No_Module        Disabled

论坛徽章:
0
4 [报告]
发表于 2007-07-04 10:16 |只看该作者
The SilkWorm 1000 series of switches uses the domain IDs 0–31, and the SilkWorm 2000 series and beyond switches use the domain IDs 1–239. Normally, a domain ID is automatically assigned when a switch joins the fabric; however, there are circumstances that can result in domain ID conflicts. This can happen when connecting two online switches that have already been assigned the same domain ID. When two switches in a fabric have the same domain ID, the fabric segments along an ISL that allows domain IDs to be unique in each segment.
The port state information generated by switchShow is pertinent to fabric-related issues if an ISL port is affected. One issue that relates to ISLs involves the port’s inability to fully initialize. While the port is online, it remains in a generic port (G_Port) state since it could not initialize as an E_Port. Another issue that affects ISLs occurs when the link is unable to initialize, resulting in the port not coming online at all. This could be caused by a marginal link, an offline switch connected to the other end of the ISL, or a fabric initialization issue. In either circumstance, it is incumbent upon the SAN administrator to establish that the port is an ISL port or an edge device that is not connected, as there is no way to tell the type of device connected until after the port initializes. Execute the commands portDisable and portEnable, providing the offending port number as an argument to try to reinitialize the port.
The Switch Name is assigned by the user and does not have to be unique in the fabric. However, uniquely naming each switch can make your SAN administration easier. With some Fabric OS versions, WEB TOOLS might not function properly if the Switch Name does not match the switch’s actual host name. You assign a Switch Name with the switchName command.
The switchId value is the switch’s 24-bit Destination ID (D_ID) address in the fabric. This is the Fibre Channel address that another switch would use to send the frame to the switch itself, rather than to a device connected to the switch. This value might appear in portLog data—for example, when the switch probes an edge device for Name Server information.
Using the switchBeacon switch command, you can have the switch flash a back-and-forth pattern (from left to right, and right to left) in yellow to identify the switch. This is helpful if you are doing maintenance and need to identify a switch that is positioned in a rack with many other switches. Finally, the switchType information indicates the switch model and revision in the form model.revision, as shown in Table 8.3.
Table 8.3 switchType Values and Associated Architecture
SwitchType Value        Switch Model
1        1000 series
2        2800
3        2400
4        20x0
5        22x0
Information in the port state section includes the port state, the type of media, the World-Wide Name (WWN) of the connected device, and the switch name if the attached device is a switch, private, phantom, and upstream or downstream information.
The port state will typically be online or offline; however, as shown in Figure 8.4, a laser fault is also indicated when encountered. The type of interface media is shown as well, indicating the type of GBIC used. SW is for shortwave GBICs, LW is for long wavelength GBICs (for longer distances), and ID is for serial ID GBICs. Serial ID GBICs are smart GBICs with serial number and state information.
A private device is normally a loop device that does not perform a Fabric Login (FLOGI) and uses an 8-bit address. A phantom address is a 24-bit translated address for an 8-bit device. A phantom is created for the embedded port so that services and other devices within the SAN can communicate with the devices on a private loop. The switch recognizes only device addresses of 24 bits in length. Therefore, services on the switch that need to communicate with the private devices need to have a 24-bit proxy for their 8-bit addresses. Each device that wants to communicate with devices on a private loop needs to be “represented” on the loop directly. This is done by creating a phantom device for each host that wants to communicate with devices on the private loop. This phantom is acting on behalf of each of the devices that wish to communicate to devices on the loop.
The terms upstream and downstream designate that particular switch’s position in reference to the principal switch in the fabric. These paths are used in the process for assigning switch domain IDs. In Figure 8.11, notice that switch core1 is the principal switch, and all “stream” designators are downstream. For switch edge1, the path to the principal switch is upstream through port 2. There is also a downstream path from switch edge1. This path is used by switch core2 to access switch core1; hence, port 3 is designated as a downstream port. The principal switch has no upstream ports.

Figure 8.11 Upstream and Downstream Paths in Reference to switchShow Output
The port state section of the switchShow output is very helpful in identifying edge-device connection issues. These issues can involve a range of problems, from missing devices to devices initializing with the wrong topology (for example, a loop-configured device initializing as point-to-point topology). The explanation of port states and associated comments is fairly straightforward. When in doubt, check to see that the port is online, assuming a device is attached, and that the topology is correct (F_Port or L_Port). If neither of these values is present, you will need to do further analysis.
The nsShow Command
An excerpt from the nsShow help entry is provided in Figure 8.12. The most important thing about nsShow output is whether the device in which you are interested appears in the command output. If a device does not appear in the Name Server, other devices will not be able to access it. There are some instances where initiators bypass the Name Server and directly communicate with a device by using an earlier obtained address or doing a table scan of addresses. This behavior is considered suspect, as it is bypasses a standard methodology. Note that hard zoning prevents such activities from occurring, ensuring that all devices behave appropriately within the SAN.
NOTE
If the device is not in the Name Server, it is most likely invisible to the rest of the fabric and therefore inaccessible.
Figure 8.12 nsShow help Page
NAME
     nsShow - display local Name Server information

SYNOPSIS
     nsShow

AVAILABILITY
     all users

DESCRIPTION
This command displays local Name Server information, which
includes information about devices connected to this switch,
and cached information about devices connected to other
switches in the fabric.

The message "There is no entry in the Local Name Server" is displayed
if there is no information in this switch, but there still may be
devices connected to other switches in the fabric. The command
nsAllShow shows information from all switches.

Each line of output shows:
*        an asterisk indicates a cached entry from another switch.
Type        U for unknown, N for N_Port, NL for NL_Port.
Pid        The 24-bit Fibre Channel address.
COS        A list of classes of service supported by the device.
PortName        The device's port Worldwide Name.
NodeName        The device's node Worldwide Name.
TTL        The time-to-live (in seconds) for cached entries, or
        'na' (not-applicable) if the entry is local.

There may be additional lines if the device has registered any of
the following information (the switch automatically registers
SCSI inquiry data for FCP target devices): FC4s supported,
(node) IP address, IPA, port and node symbolic names, fabric
port name, hard address and/or port IP address.
Often, the returned SCSI inquiry data is meaningful and indicates telling information such as the vendor, model, and the firmware revision level of the attached device, as shown in Figure 8.13. For HBAs, SCSI inquiry data occasionally is not returned and the Name Server entry is a bit sparse, so it is harder to identify the device. Some vendors are starting to allow administrators to manually populate this field to allow the textual information to be site-specific, such as node names or locations.

Figure 8.13 The nsShow Output Explained
It can be confusing understanding the difference between a device node WWN and a port WWN. A device has only one node WWN and can potentially have one or more port WWN(s). This way, it is possible to uniquely identify multiple paths or interfaces to the same device. For example, today’s Just a Bunch of Disks (JBOD) systems usually have two ports (A and B), and each port has an associated port WWN. This enables two paths to connect to the same disk. How do you know it is the same disk? The node WWN is the same for each path, with each path having a unique port WWN. In Figure 8.14, if the entry for Port ID (PID) 0a19cb is connected on both ports A and B, the node WWN stays the same (20:00:00:20:37:26:b0: 6c), the A port would have a WWN of 21:00:00:20:37:26:b0: 6c, and the B port would have a WWN of 22:00:00:20:37:26:b0: 6c.

Figure 8.14 The Difference between Port WWN and Node WWN
The use of node WWNs and port WWNs is not always strictly followed, and the Fibre Channel specifications are not clear on their usage. A node WWN sometimes is used to represent an entire system and all ports (Port WWNs) associated with that system.
The Name Server also provides information about a device’s PID. Knowing how to decode a PID is helpful in translating a device’s SAN logical address into a SAN physical location. If you know a device’s PID, you know the physical port that device is attached to, the domain ID of the switch that device is attached to, and whether that device is an N_Port or an NL_Port. Figure 8.15 explains this decoding process further.

Figure 8.15 How to Interpret the Port Addressing
The topologyShow Command
The topologyShow command displays the fabric topology, as seen by the local switch. topologyShow output consists of a list of all domains that are part of the fabric, and for each of those domains, all the possible paths to reach these domains from the local switch. In addition, topologyShow displays the total number of switches in the fabric, and the domain ID of the local switch. It is also helpful to issue the switchShow command to identify directly connected switches. Look for E_Ports and the name of the switch located at the other end of the E_Port to create a SAN topology. Perform a switchShow for every switch in the fabric. First, write down the name of the switch on which the command is issued. For each E_Port on that switch, write down the name of the switch to which the E_Port connects. Then draw a line between the switch on which the command is being run and the switch that shows up on the other end of the E_Port. The data in Figure 8.16 indicates that switch edge3 is directly connected to switches core1 and core2. To identify direct-connect switches in the topologyShow output, look for domain entries with a hop count of one. To obtain additional information on the switches in the fabric, such as their IP address, use the fabricShow command.

Figure 8.16 Use topologyShow to Determine the Number of Online Switches in the SAN
SAN Profile
It is recommended that you create a profile of your fabric when it is functioning normally so that you always have a baseline to compare the current state of your SAN. You will want to create a profile before making any changes to the SAN, such as firmware upgrades or additions or deletions of switches or edge devices. This information can be captured from a logging facility within telnet and stored as a text file.
When you finish your maintenance or suspect a problem, take a new profile and compare the baseline profile to your current profile. Any discrepancies require further investigation. For troubleshooting purposes, a profile should consist of the following information extracted from a healthy SAN:
                 The number of domains in the fabric, which can be obtained from topologyShow outputs.
                 The overall topology of the fabric, again from topologyShow and switchShow outputs.
                 The number of noncached Name Server entries for each switch in the fabric, which can be obtained by issuing the command nsShow.
                 The total number of Name Server entries, which can be determined by issuing the command nsAllShow.
You can also obtain this  data by issuing the command supportShow for every switch and then pulling the required data out of log. Another option is to automate the acquisition of data and then parse out the necessary fields. Figure 8.17 and Table 8.4 are examples of the necessary data collection and what a SAN _profile looks like. The data to collect is bolded in Figure 8.17 as well.
Figure 8.17 Data to Collect When Establishing a SAN Profile
BigSAN102:admin> nsShow
The Local Name Server has 2 entries {
Type Pid    COS    PortName         NodeName            TTL(sec)
N    661600;      
    3;50:00:60:e8:02:76:b9:04;50:00:60:e8:02:76:b9:04; na
    FC4s: FCP [HITACHI OPEN-9          0112]
    Fabric Port Name: 20:06:00:60:69:10:67:c4
N    661b00;      
    3;50:00:60:e8:02:76:b9:00;50:00:60:e8:02:76:b9:00; na
    FC4s: FCP [HITACHI OPEN-9          0112]
    Fabric Port Name: 20:0b:00:60:69:10:67:c4
}
BigSAN102:admin> nsAllShow
16 Nx_Ports in the Fabric {
  641300 661600 661b00 6a1100 6b1000 6b1101 6b1600 6d1100
  6d1200 6d1300 7215e1 761d01 761e00 771d00 771f00 781e00
}
BigSAN102:admin> topologyShow

26 domains in the fabric; Local Domain ID: 102

Output truncated. Make sure you capture all domain Ids in the fabric.

论坛徽章:
0
5 [报告]
发表于 2007-07-04 10:17 |只看该作者
Table 8.4 Formatted SAN Profile
Switch        Local NS Entries
BigSAN100        0
BigSAN101        0
BigSAN102        2
BigSAN103        4
BigSAN104        1
BigSAN105        1
BigSAN106        4
BigSAN107        4
BigSAN108        0
BigSAN109        0
BigSAN110        0
BigSAN111        0
BigSAN112        0
BigSAN113        0
BigSAN114        0
BigSAN115        0
BigSAN116        0
BigSAN117        0
BigSAN118        0
BigSAN119        0
BigSAN120        0
BigSAN121        0
BigSAN122        0
BigSAN123        0
BigSAN124        0
BigSAN125        0
Total Nodes        16
Total Switches        26
What Data Can a Host Provide?
A host can provide a significant amount of data to aid the SAN troubleshooting process. Think again of the SAN as a virtual cable. A working virtual SAN cable means that edge devices that are expected to communicate with each other are successfully connected as N_Port or NL_Port (verify this with switchShow), and that the devices are present in the Name Server (verify this with nsShow). Assuming that zoning is properly configured, these edge devices should be able to communicate with each other, just as if they are directly connected to each other with a cable.
A host can indicate if devices are visible to that host. In a Windows environment, do this by running Disk Administrator; in a UNIX environment, do this by issuing the format command. Many tools from HBA vendors are GUI-based and allow for real-time, live viewing of connection status to storage devices. Some examples of these tools are TROIKA’s SAN Command and JNI’s EZ Fibre. If the devices do not show up at the host when these commands are issued, the next step is to see why these devices are not visible to the host. The key to this involves reviewing the host log files. For Solaris, the message log file is normally located in the file /var/adm/messages. You can watch the SAN HBA events in real time, by doing a tail –f /var/adm/messages. For a Microsoft environment, you can use the Event Viewer to see the HBA-related activity. You might need to change the verbosity levels or set the HBAs to debug mode to see detailed data in the message logs. An example of a log from a Solaris host is provided in Figure 8.18 to familiarize you with the data and how you might use it to assist in the trouble&not;shooting process.

Figure 8.18 Solaris Host SAN-Related Messages
In Figure 8.18, the HBA recognizes and has visibility to seven JBOD disks. At this point, you can conclude that the SAN virtual cable is working fine and that the host has visibility to the devices at a SAN level. The next step is to see if the devices are visible to the operating system. Based on the next set of error messages, the indications are that the devices are not visible to the operating system and that the HBA should be investigated further. The problem illustrated in Figure 8.18 occurred because the host HBA drivers were not configured to bind with any targets; hence, the disks were not presented to the operating system. To resolve this issue, it is necessary to follow the HBA directions for binding SAN targets.
When to Use portLog and Other Advanced Tools
The portLog debugging tool is a low-level tool for debugging the SAN. The portLog facilities are available in two forms: portLogDump and portLogShow. The help page for portLogShow is reasonably detailed and helpful for decoding portLog entries. To effectively understand portLog data, you will need a solid background in Fibre Channel fundamentals. Training on decoding portLog data is available from Brocade (www.brocade.com/education_services).
An annotated example of a portLog entry is shown in Figure 8.19 to provide some insight into how a portLog entry is decoded. You will most likely encounter portLog data when entering the supportShow command, which calls portLogDump, or if you are requested to obtain this data by Brocade support.

Figure 8.19 portLog Entry Example
For Fibre Channel developers and people who are intimately involved with SANs, a programmer’s guide is available (Fabric Programming Guide Revision 2.1 Publication number 53-0001561-01Rev. A 4/11/00). The guide is available from the Brocade Web site (www.brocade.com); however, a login and password are required. Instructions for obtaining a login and password are posted on the Web site.
Another low-level debugging tool is a Fibre Channel analyzer. Companies such as Finisar (www.finisar.com) and Xyratex (www.xyratex.com) manufacture Fibre Channel analyzers. An analyzer is typically used in a development environment and rarely to debug production environments. Analyzers can generate a tremendous amount of data. An analyzer is usually inserted into the SAN between the switch and an edge device, or between two switches. Normally, a detailed analysis and troubleshooting effort is required to identify where to insert the analyzer into the SAN and what data the analyzer should look for. Again, an extensive background in Fibre Channel is necessary to effectively use an analyzer.
In-Depth Troubleshooting with Fibre Channel Analyzers
Although configuring SANs is getting easier with each new generation of equipment, it is often useful to have the appropriate tools for configuring and testing your SAN. As in standard Ethernet-based networks, and even in local parallel SCSI bus installations, network sniffers and bus analyzers are very handy tools to really understand what is going on. Fibre Channel cable testers can be purchased for nominal amounts, link activity analyzers for several hundred dollars, and full-blown protocol analyzers for several thousand dollars.
Fibre Channel cable testers provide simple connectivity tests for a cable; in the case of copper cables, they test for connectivity between two ends of a cable. Similar optical tools are available for checking the amount of light that is transmitted through an optical cable, and they provide convenient diagnostic capabilities for cable integrity.
An affordable alternative to full-blown protocol analyzers is a link activity analyzer. Link activity analyzers attach to Fibre Channel cables and analyze basic activity on the link. Basic functionality includes LEDs to indicate when traffic is being sent and received, as well as information such as MB/sec counters, online or offline information, error lights for CRC errors, and optical signal quality indicators. These types of link activity analyzers are ideal for isolating specific problem areas in a SAN, and identifying questionable links or devices.
Finally, for the most information about what is happening on a SAN, protocol analyzers are the best tools available. These tools will record every bit of information that comes across a wire, and through user software can play back activity, show errors, highlight questionable transactions, and more. Ranging from simple two-channel analyzers embedded in a PC to multichannel testers that can test all of the ports of a Fibre Channel switch in a single box, these analyzers are invaluable if you really want to know what is going wrong with your network.
These tools can be invaluable for debugging problems directly at the source, and are often bundled with training and classes to help you learn the basic protocol and debugging techniques. For many problems you encounter in a development environment, a protocol analyzer is the only tool that will help you really  see what is going on. However, in production environments it is unnecessary to invest in a full analyzer for day-to-day operation.
Troubleshooting the Fabric
A problem with the fabric is a pervasive issue that often affects more than one device. When a fabric issue is experienced in a resilient SAN, it might have no impact on SAN functionality since the SAN redundancy compensates for the marginal situation. Table 8.5 provides a high-level review of problematic fabric symptoms and associated possible causes. Fabric issues are normally associated with heterogeneous storage and server environments in which all devices have not been tested as a system.
Table 8.5 Symptoms Indicative of a Fabric Problem
Symptom        Possible Causes
Multiple edge devices are inaccessible from multiple hosts                         Fabric segmentation (zone conflict, mismatched fabric parameters)
                 Switch failure
                 Edge device timeout or communication conflict when accessing the Name Server (FFFFFC) or Fabric F_Port (FFFFFE)
                 Unconfirmed domain
                 Message Queue (MQ) issues
                 Hosts and/or storage attempted to access the fabric prior to fabric convergence
                 Domain ID conflict
                 Port configuration conflict
                 No fabric license installed
Incompletely initialized ISLs: ISL port initializes as a G_Port or does not come online                         Marginal link
                 Fabric initialization error
The remainder of this section identifies what tools to use and data to analyze when a fabric issue is suspected. Symptoms are explained in further detail and specific issue traits are identified. Where possible, workarounds or corrective actions are specified.
What to Look for in a Malfunctioning Fabric
If a switch is unable to join the fabric, all devices on that switch become inaccessible to the fabric and possibly to each other. When edge devices time out or are unable to properly communicate with fabric services, communication between numerous edge devices is interrupted and some devices become inaccessible.
NOTE
When initially identifying a fabric issue, look for a large number of edge devices to be behaving marginally or not communicating at all. See if you can identify a pattern. Is the outage random throughout the fabric, or can you correlate the outage to a particular switch? Does the outage correlate to one particular host type or storage device?
Host Behavior
Hosts that are involved with a fabric problem exhibit a variety of symptoms, one of which is that some or all edge devices become inaccessible. You can verify this situation for UNIX hosts using the command format to see if any devices have disappeared. For Microsoft Windows 2000, start up the Disk Management utility and check if any devices have disappeared. The Solaris /var/adm/messages file and the Microsoft Event Viewer might provide further insight into the issue. ISL initialization issues normally are invisible to the host, as the fabric will reroute around failed ISLs and ensure connectivity—unless the ISL failure results in the SAN becoming segmented, in which case edge devices will become inaccessible to the host. Another possible symptom on the server is reduced performance of the application. In the event of an ISL failure, the fabric will reroute the traffic as mentioned. When this happens, typically, the traffic will have to share ISLs with more devices than normal, possibly resulting in reduced performance due to congestion on the ISLs. Utilities supplied by your HBA vendor can also be helpful in identifying host SAN status.
SAN Profile
If you suspect a SAN issue, create a new SAN profile and compare your baseline SAN profile to your newly created SAN profile. Any unexplained discrepancies require further investigation—whether one or more switches have dropped out, or if there are several missing Name Server entries.
Switch LEDs
If you can observe the SAN switches while the problem is occurring, try to detect an LED pattern. Focus on the ISL ports first. Any yellow lights (blinking or steady) indicate that manual intervention is required. At this point, log in to the switch with yellow lights and issue the command supportShow to extract debugging information for further analysis. If the switch is disabled (all ports blinking a slow yellow), issuing a switchEnable command will bring the switch back into the SAN. If a port is yellow (blinking or steady), you can bring the device back online by issuing the command portDisable and then a portEnable on the yellow port. Issue the command switchShow to verify the port state or a disabled switch.

论坛徽章:
0
6 [报告]
发表于 2007-07-04 10:17 |只看该作者
The errShow Command
Start the troubleshooting process by reviewing errShow data for every switch in the fabric. Fabric segmentation and Message Queue (MQ) errors are indicative of an error that will cause the switch and its connected devices to become inaccessible to the fabric. Fabric segmentation is also caused by zone conflicts, incompatible fabric parameters, or domain conflict. Review the errShow as a starting point.
The switchShow Command
When investigating fabric issues, you need to look at switchShow for port state information and for fabric-related information. Issue the switchShow command on every switch in the fabric. Examine the port state section of the switchShow data for incompletely initialized E_Ports, which will show up as G_Ports or as ports that are not online. If the port does not reinitialize itself, then manually reinitialize the ISL by executing the commands portDisable and portEnable, providing the offending port number as an argument.
A fabric issue that has less impact involves incomplete ISL initialization. If ISL initialization issues occur, it is usually during fabric bring up. ISL initialization issues can also occur during a fabric reconfiguration, which is triggered when an ISL is added or removed or when a switch is added or removed. If the SAN is designed to be resilient, an incomplete ISL initialization minimally impacts the fabric, since there are multiple ISLs connecting the switches and edge devices are still able to communicate with each other. On the other hand, if the SAN is not resilient, an ISL initialization problem may result in a segmentation of the SAN and many devices may lose communications with the SAN.
Resilient topologies deliver at least two internal fabric routes and are considered more resilient because each topology is capable of sustaining a switch or ISL failure while the remaining switches and fabric remain operational. This self-healing capability is enabled by Fabric Shortest Path First (FSPF) and is depicted in Figure 8.20.

Figure 8.20 In a Resilient SAN, an ISL Failure Does Not Affect Communication
Figure 8.20 also depicts the failure of an ISL in a cascade topology, which is the SAN located on the left. Note that switches A and B are unable to communicate with the remaining switches when the ISL marked with the “X” fails. However, a similar switch failure in a resilient topology SAN (located on the right) does not sever communications between the remaining switches. If the ISL fails, it is still possible for switch A to communicate with switch C, using several paths, such as the path highlighted in Figure 8.20. In a resilient topology, an ISL failure might go unnoticed unless some type of monitoring is used (such as Fabric Watch, a separately licensed product available from Brocade). Additionally, with the loss of an ISL, there may also be performance degradation due to a loss of overall available bandwidth.
When reviewing the fabric-related information of switchShow, search for a switch that is disabled or has an unconfirmed domain. An unconfirmed domain indicates that the switch was unable to communicate with the principal switch in the fabric to obtain a domain ID. To resolve either situation, issue the command switchDisable followed by switchEnable to enable the switch to join the fabric.
The topologyShow Command
The topologyShow information is straightforward. You have to issue the topologyShow command on only one switch, unless that switch happens to be disabled or segmented. If this is the case, the topologyShow data will indicate the number of switches in your fabric as one, and you need to pick another switch to obtain the topologyShow information. The number of domains should equal the number of switches in the SAN. You can reference your SAN profile to establish the expected number of switches in the SAN. If there is an unexplained discrepancy, you most likely have a failed, segmented, or disabled switch. You can use switchShow data to identify a disabled or segmented switch.
If a switch that is supposed to be part of the fabric does not show up in the topologyShow output (the previous SAN profile helps here), the administrator should identify the switch, log in to it, and try first a portDisable-portEnable sequence on any of the ports that should be an E_Port. If this does not work, try a switchDisable-switchEnable sequence.
The nsShow and nsAllShow Commands
Issue the command nsAllShow on any switch in the fabric to obtain the total number of edge devices registered with the Name Server. Note that issuing the nsAllShow command on a switch that is segmented or disabled will return Name Server data for only the switch and not the entire fabric. If there is an unexplained discrepancy between this number and the number of Name Server entries recorded in your SAN profile, you will need to identify which switches are associated with the missing Name Server entries. First, check to see if there are a number of missing devices; if so, then it is likely that one of the switches has segmented or is offline. This should have been seen in the prior step. If you are unsure of what devices are missing, issue the command nsShow on each switch in the SAN and compare the number of Name Server entries to your SAN profile. Next, attempt to correlate the missing Name Server entries. Are the missing entries all associated with any particular switch or edge device? Once you rule out a segmented or disabled switch, determine if the port associated with the missing devices is online. If the port is not online, bring the port online by executing the commands portDisable and portEnable, supplying the questionable port number as an argument to these commands. This should refresh the Name Server with the missing port edge devices. If the missing Name Server device port comes online, and it still does not register with the Name Server, then it indicates that there is either a timeout or a conflict in communication between the Name Server and the edge device in question. It is now time to work with your switch supplier and edge-device supplier to resolve this complex problem.
Now that You Suspect a SAN Issue: Digging Deeper
Now that you suspect a SAN issue, you will need to investigate further to identify the root cause. The use and context of each command follows, relative to troubleshooting a SAN issue. Where possible, workarounds or corrective actions are identified. Several commands must be run on each switch; this is something that can be automated. The details for doing so are presented later in the book in Chapter 9.
Timeout of Edge Devices during Fabric Bring Up
If the problem occurs after a SAN bring up or during reconfiguration, it is possible that the edge devices came online before the SAN is ready. If this is the case, you will see flickering green and possibly flickering yellow lights on the ISL ports as the SAN converges while the edge ports remain steady green. You will also see messages on the switch console as edge devices attempt to FLOGI and Port-to-Port Login (PLOGI). Normally this is acceptable; however, if the SAN requires an extended period of time for bring up, devices might time out. Be careful to differentiate between an edge device that successfully retries PLOGIs and FLOGIs while the fabric converges, and do not interpret these retries as failures. When the fabric is completely up, most devices that time out will try again; however, if they do not, a timeout failure is to be expected.
If you suspect a PLOGI/FLOGI timeout failure during fabric convergence, you can confirm your suspicions by reviewing the host logs. You can determine the SAN state, by issuing a topologyShow command and verifying that the correct number of domains are in the fabric. If the edge devices are not tolerant of the time it takes the SAN to converge, they might time out their FLOGI or not successfully interact with the Name Server. In either case, that device will be inaccessible to the fabric. If you suspect this is happening with your SAN, investigate the edge device logs to conclusively determine that timeouts are occurring, the type of timeout, and how long these timeouts last. If timeouts are occurring, one resolution is to increase the timeout values in the fabric (Resource Allocation Time Out Value [R_A_TOV] or Error-Detect Time Out Value [E_D_TOV]) or with the edge devices. There might be other timeout values on the edge device that might help prevent this issue; however changing timeout values is a complex procedure and it is suggested that you work with your switch supplier and edge device supplier at that point.
Port Configuration Conflict _or Missing Fabric License
If your switch is not configured with a fabric license, it cannot join the fabric. The port state section of the switchShow will indicate that the E_Ports are unknown. When you issue the command licenseShow, you should see a fabric license. If the switchShow data indicates unknown E_Ports and you do not have a fabric license installed, you will not be able to join that switch into a fabric until you acquire a fabric license from your switch supplier. The SilkWorm 2010 and 2100 switches are entry-level switches and are not configured with a fabric license, but can be upgraded with a simple license key. These switches are designed for switched loop connectivity using Brocade QuickLoop. They can have a single E_Port for connecting another QuickLoop switch; however, if additional ISLs are connected, they will not come online. The SilkWorm 2240 and 2250 are entry fabric switches designed for small SANs or for the edge of a larger SAN. They can also only support a single E_Port unless you upgrade them to a full fabric license. Figure 8.21 provides an example of a properly installed fabric license.
Figure 8.21 Example of a Properly Installed Fabric License
core1:admin> licenseShow
SRzy9Sz9zeTS0zAG:
    Web license
bbSz9eQb9zccT0AQ:
    Zoning license
RdzdSRcSyzSe0eTn:
    QuickLoop license
cSczRScd9RdTd0SY:
    Fabric license  <------A Fabric license is properly installed
It is possible to prevent switches from connecting into a fabric by disabling E_Port functionality. You might want to do this for security purposes to prevent unauthorized switches from joining the SAN. If your E_Ports are unknowingly disabled, it will not be possible to join the switch into a fabric. To verify the status of the switch E_Ports, issue the command portCfgEport, as shown in Figure 8.22. Note that switch port 0 E_Port capability is disabled and that you cannot use this port as an E_Port. To disable or enable E_Port support for a port, use the portcfgEport command. You might want to do this for security purposes, since turning off an E_Port prevents someone from attaching a switch to your fabric without first obtaining your approval.
Figure 8.22 E_Port Configured as Disabled Example
core1:admin> portcfgEport
Ports:  0   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
      -----------------------------------------------------------
      NO  -   -   -   -   -   -   -   -   -   -   -   -   -  
Segmented Fabrics
A fabric can segment for a variety of reasons, including zone conflicts, incompatible fabric parameters, and domain ID conflicts. This section helps you identify whether you have fabric segmentation, and what type of fabric segmentation you are experiencing. A fabric might segment when you add a new switch to the fabric or upon fabric reconfiguration or bring up. The segmented fabric error message will occur on any switch to which the new switch is trying to connect. The new switch that is trying to join the fabric will show the E_Ports as unknown output from the switchShow command. If the fabric segments during a reconfiguration or bring up, you will have to search for a switch with unknown E_Ports, which can be determined by examining the switchShow output. You can also compare your current SAN profile to your baseline SAN profile to identify the missing switch.
Zoning Conflict
A zone conflict and fabric segmentation can occur when introducing a single- or multiple-switch fabric into an existing fabric. As these conflicts may affect the connected online devices, the switches segment and await human intervention to determine the proper resolution. There is no way to identify the correct configuration without first investigating the nature of the conflict. If there are conflicts, it may be easier to clear the configuration on the conflicted switch and then have that switch absorb the zone information when it becomes part of the fabric. Typically, there are three conditions that will create a zone conflict:
                 Multiple zoning configurations enabled  Enabling zoning on both fabrics when they are connected will create a zone conflict. Only one zone configuration can be enabled in a single fabric at a time. An example of this is if the Day configuration is enabled on one switch and the Night configuration is enabled on the other. The administrator will have to decide which one is appropriate and disable the other.
                 Zone definition type conflict  This occurs when introducing a single- or multiple-switch fabric into an existing fabric that has zoning definitions already defined, but the definition type (in other words, alias, zone) is in conflict. An example of this would be a definition of Red as a zone defining one fabric, and Red as an alias definition on another fabric. This is a definition conflict and will segment the fabric.
                 Zone definition content conflict  This occurs when introducing a single- or multiple-switch fabric into an existing fabric that has zoning definitions (in other words, alias, zone) already defined, but the content is in conflict. This is where the definition name and type match, but the content is different. An example of this would be a Red zone defined on both fabrics. On the first fabric, the Red zone was defined with domain 5, port 4, and the second fabric has the Red zone defined with domain 7, port 3. Both have a zone definition of Red, but the content is in conflict and will cause the fabric to segment. Again it will require that the administrator determine which Red zone is correct and either update the incorrect one or delete it. Once the fabrics merge, the proper Red zone will be propagated to all the switches in the fabric.

论坛徽章:
0
7 [报告]
发表于 2007-07-04 10:18 |只看该作者
The workaround for this situation involves correcting the conflicts or clearing the zoning information on either the fabric or new switch, depending on which zoning configuration you consider to be correct and want to keep. You can clear a zoning configuration by issuing a cfgClear <configuration you want to delete> command followed by a cfgDisable <active configuration you want to delete> command. You should first save a copy of the zone configuration by issuing cfgShow and saving the output to a file (in case you mistakenly delete the wrong configurations). The configUpload command is also useful for this operation. Figure 8.23 shows a zone conflict error message.
Figure 8.23 Zone Conflict Error Message
0x10addf10 (tZone): May 15 09:37:01 (12)
    Error FABRIC-SEGMENTED, 3, port 4, zone conflict
Incompatible Fabric Parameters
Certain system configuration settings are changed by issuing the command configure. The fabric parameter system configuration settings must be the same for every switch in the fabric. The fabric will segment if there is a difference between the parameters that exist in the fabric and the parameters on a switch that is trying to join the fabric. The following parameters must be consistent with the switch that is joining the fabric and the fabric:
  BB credit: (1..27) [16]
  R_A_TOV: (4000..120000) [10000]
  E_D_TOV: (1000..5000) [2000]
  Data field size: (256..2112) [2112]
  Sequence Level Switching: (0..1) [0]
  Disable Device Probing: (0..1) [0]
  Suppress Class F Traffic: (0..1) [0]
  SYNC IO mode: (0..1) [0]
  VC Encoded Address Mode: (0..1) [0]
  Core Switch PID Format: (0..1) [0]
  Per-frame Route Priority: (0..1) [0]
  Long Distance Fabric: (0..1) [0]
NOTE
The range of values and defaults for Fabric OS 2.4.1a are shown in the list of parameters in this section. Fabric parameters are subject to change, and you should consult the documentation of the Fabric OS version you intend to use.
Figure 8.24 shows an example of an incompatible fabric parameters error message occurring on switch edge1, and the resulting switchShow data from that switch. It is necessary to track down the switch connected on ports 0 and 2 of switch edge1 and compare the fabric parameters from that switch to those of edge1. Once you identify the discrepancy, use the configure command to change the discrepant fabric parameters of the joining switch to those of switch edge1.

Figure 8.24 Incompatible Fabric Parameters
Domain ID Conflict
A domain ID conflict can occur if you join a switch that is in the online state into a fabric, and the joining switch domain ID conflicts with the domain ID of a switch in the fabric. Normally, domain IDs are automatically assigned; however, once a switch is online, the domain ID cannot change, as it would change the port addressing and potentially disrupt critical I/O. The resolution for this problem involves performing a switchDisable followed by a switchEnable on the joining switch. This will enable the joining switch to obtain a new domain ID as part of the process of coming online. The fabric principal switch will allocate the next available domain ID to the new switch during this process.
NOTE
Changing domain IDs can have an impact on port zoning entries. Be sure to check to see if any port zoning entries exist for devices on a switch before changing its domain ID, and update any affected zones to reflect the change.
Message Queue Errors
An MQ error is a message queue error. You can identify an MQ error message by looking for the two letters M and Q in the error message. MQ errors can result in edge devices dropping from the Name Server or preventing a switch from joining the fabric. MQ errors are rare and difficult to troubleshoot, and it is suggested that you resolve them by working with your switch supplier. When you encounter MQ errors, execute the supportShow command to capture debug information about the switch. A switch reboot will likely clear any associated problems. Then forward the supportShow data to your switch supplier for _further investigation.
Troubleshooting Devices that Cannot Be Seen
A host that is unable to access a SAN device is a more common SAN issue that can arise. Again, consider the virtual SAN cable analogy to start the trouble&not;shooting process. We want to determine whether the SAN is the cause of the problem or whether it is an edge device issue. To do this you need to work your way along the virtual SAN cable to the edge device(s) that cannot be seen. Figure 8.25 depicts a flowchart that outlines the process for troubleshooting a missing device.

Figure 8.25 Troubleshooting Devices that Cannot Be Seen
What to Look for in the Fabric
The first step is to determine whether the missing device problem is a fabric issue. A quick way to determine this is to establish if the problem is localized to just a single missing device or multiple missing devices. You also want to ensure all switches are online in the fabric. You can quickly check your fabric status by issuing the command topologyShow to verify that the correct number of domains exist in your fabric. You can verify that the missing device is a localized issue by entering the command nsAllShow to establish the total number of devices in the fabric. If you suspect a fabric issue, since multiple devices are missing, follow the fabric _troubleshooting process. If you suspect a missing device issue, since only one or _two devices are unaccessible, move on to the next section, “Are the Host and Storage Visible via switchShow on Their Respective Switches?”
Are the Host and Storage Visible via switchShow on Their Respective Switches?
Use the command switchShow on the switch to which the subject host is connected. Verify that the host port and the storage port are online. If both the storage and the host port are online, move on to the next section, as the virtual SAN cable is logically connected to both the storage and the host. If the port is not online, your host or storage might be malfunctioning, you might have a link initialization issue, or you might have a marginal link. If the edge port is not online or is a G_Port, this is analogous to having a disconnected cable. A host malfunction is a very broad term and can include problems such as incorrect or improperly installed HBA drivers, HBA parameters, or a faulty HBA. A storage malfunction can include an incorrect or improperly configured storage interface or a faulty storage interface.
NOTE
A quick method of identifying the cause of a missing device is to visibly inspect your switch LEDs. Any steady or flashing yellow lights indicate that a port is not online and manual intervention is required.  
Brocade SilkWorm switches by default automatically configure the appropriate port topology based on the connecting port topology, which is either N_Port or NL_Port, or in the case of a switch, an E_Port. This functionality is invaluable for SAN management, because it alleviates the SAN administrator from managing and maintaining the configuration for potentially thousands of ports. In some situations, it is necessary to configure a port for a particular topology by using one or more of the commands portCfgEport, portcfgFAport, or portcfgLport to lock the port into a certain state. This may help with an issue where the edge device supports multiple port topologies and does not initialize in the mode that is desired.
A switch or port might also be configured for QuickLoop. First, check to see that the switch or port in question is configured correctly for the intended purpose. For example, if the attaching edge device is configured as an NL_Port and the switch port is configured as an F_Port, there is a conflict and that edge device might initialize as a G_Port. Initializing as a G_Port is just as bad as not initializing at all, as the associated device is essentially inaccessible. The G_Port, or generic port, is a transitional state defined in the standards as a device transitions to an F_Port or an E_Port. If the port connecting to your edge device is not intended to be a QuickLoop port, you will need to reconfigure that port, or the edge device might not initialize properly. If there is any conflict, resolve the conflict with the switch, by reconfiguring the port, or with the edge device and move onto the next section, “Do the Devices Show Up in the Name Server?” If the devices support both loop and fabric modes, utilize the fabric setting to get the best performance and fault isolation.
See Figure 8.26 for the usage and examples of various port configuration commands. Switch core1 is configured for QuickLoop, as evidenced by the enabled entries in the QuickLoop mode column. Switch core1 port 8 is configured as a loop port, and no ports are configured as Fabric Assist (FA) ports. You can also use the command qlShow to determine if the switch is configured for QuickLoop. If the switch is in QuickLoop mode and no QuickLoop is required, you can issue a qlDisable command to disable QuickLoop for the entire switch. If QuickLoop is required, but is not needed for the port in question, use the qlPortDisable <port #> command for the port that needs to be changed.
Figure 8.26 Port Configuration Examples
core1:admin> qlportshowall

PortNum QuickLoop Mode  Port State
0      Enabled         fabric          E PORT
1      Enabled         fabric          E PORT
2      Enabled         fabric          E PORT
3      Enabled         fabric          E PORT
4      Enabled         fabric          E PORT
5      Enabled         fabric          E PORT
6      Enabled         offline
7      Enabled         offline
8      Enabled         fabric
9      Enabled         offline
10      Enabled         fabric
11      Enabled         offline
12      Enabled         offline
13      Enabled         offline
14      Enabled         offline
15      Enabled         offline
core1:admin> portcfgLport
Ports:        0        1        2        3        4        5        6        7        8        9        10        11        12        13        14        15
       --------------------------------------------------------
Lock        -        -        -        -        -        -        -        -        YES        -        -        -        -        -        -        -

Private        -        -        -        -        -        -        -        -        -        -        -        -        -        -        -        -

core1:admin> portcfgFAport
Ports:        0        1        2        3        4        5        6        7        8        9        10        11        12        13        14        15
       ---------------------------------------------------------
        -        -        -        -        -        -        -        -        -        -        -        -        -        -        -        -

论坛徽章:
0
8 [报告]
发表于 2007-07-04 10:19 |只看该作者
If the port is not online or initializes as a G_Port, attempt to reinitialize the port by executing the commands portDisable and portEnable, supplying the port number in question as an argument to these commands. If this process works, monitor the situation carefully. If the host port consistently does not come online or comes up as a G_Port repeatedly, you might have a marginal link issue, a faulty HBA, HBA driver, or some type of configuration conflict between the host and the switch. At this point, you need to follow the process of troubleshooting a marginal link. If the link is not marginal, contact your switch supplier and HBA supplier to assist with further troubleshooting.
Follow a similar process for the storage port. If the storage port is not online or is a G_Port, this is analogous to a disconnected cable at the storage end. Attempt to reinitialize the port by issuing a portDisable/portEnable. Next, rule out a marginal link, faulty storage equipment, and configuration conflict between the storage and the switch. If you are still unable to establish the root cause, work with your switch supplier and your storage supplier to assist with further troubleshooting.
Do the Devices Show Up in the Name Server?
At this point, you have verified that the host and storage are logically connected to the virtual SAN cable, and it is now necessary to confirm that the two edge ports are able to communicate. Use nsShow on the switch to which the storage is connected and the switch to which the host is connected to verify that these edge devices are registered with the Name Server. If you intend to verify that an Emulex HBA located on switch core1 port 8 is registered with the Name Server, the data in Figure 8.27 would confirm this.
Figure 8.27 nsShow Example—Verifying that an Emulex HBA Is Registered with the Name Server
core1:admin> nsShow
The Local Name Server has 2 entries {
Type Pid    COS     PortName          NodeName          TTL(sec)
N    011800;   
    2,3;10:00:00:00:c9:21:5f:a7;20:00:00:00:c9:21:5f:a7; na
    NodeSymb: [35] "Emulex LP8000 FV3.02    DV5-4.52A7 "
    Fabric Port Name: 20:08:00:60:69:10:8d:fd
N    011a00;   
    2,3;20:00:00:e0:69:f0:07:c6;10:00:00:e0:69:f0:07:c6; na
    Fabric Port Name: 20:0a:00:60:69:10:8d:fd
}
If the devices in question are registered with the Name Server, it is possible that you are experiencing a zoning mismatch or a host/storage issue. If one or both devices are not registered with the Name Server, it is possible that there is a timeout or communication issue between the edge device(s) and the Name Server. Check with the edge device documentation to determine if there is a timeout setting or parameter that may help. If this does not work, contact the support organization for the product that appears to be timing out.
Rule Out Zoning Issues
It is easy to rule out a zoning mismatch if zoning is not enabled. Check to see if zoning is enabled by issuing the cfgShow command. If the output states that no configuration is in effect, zoning is not enabled. If zoning is enabled, it is possible that the two edge devices are unable to communicate with each other due to zoning conflicts. To confirm whether this is the case, review the active zoning configuration. You can do this by again issuing the command cfgShow, as shown in Figure 8.28. In this example, host1 can access disk1, and host2 can access disk2, but host1 cannot access host2 or disk2, and host2 cannot access host1 or disk1. Confirm that the specific edge devices that need to communicate with each other are in the same zone. If they are not, and zoning is active, you need to update your zoning configuration before the edge devices in question are able to communicate with each other. For example, if host1 needs to get access to disk2, it is necessary to update the zoning configuration to enable this access. Once the zone changes are made via the command line or WEB TOOLS-based GUI, the devices should be able to access one another; however, some operating systems might require that you run a disk utility such as format or disk administrator. It is also possible that some operating systems might require a reboot to allow discovery of the new devices.
Figure 8.28 Zoning Example
core1:admin> cfgshow
Defined configuration:
cfg:   colors  red; yellow
zone:  red     host1; disk1
zone:  yellow  host2; disk2
alias: disk1   0,0
alias: disk2   0,1
alias: host1   1,14
alias: host2   1,15

Effective configuration:
cfg:   colors
zone:  red     1,14
                0,0
zone:  yellow  1,15
                0,1
NOTE
If zoning is active, any devices that are not explicitly defined in a zone together are not able to communicate with each other.
At this point, if you establish that there is no switch zoning mismatch, then you have established that the SAN virtual cable is working and that it is likely a host or storage issue. One possible host or storage issue that could be causing the “missing” devices is a mismatch with the HBA or storage-based zoning; be sure to check this first when troubleshooting the edge devices.
NOTE
Incorrect or incomplete zoning is one of the most common causes of SAN communication problems. Checking for this is analogous to checking to see if a “malfunctioning” computer monitor is plugged in.
Edge Device Not in the Name Server
Reaching this point implies that you have verified that the edge devices in question are connected to the switch, and that one or more of the edge devices are not registered in the Name Server. Attempt to reinitialize the edge device(s) with the Name Server by executing the commands portDisable and portEnable, supplying the port number(s) in question as an argument to these commands. If, after you do this, the devices successfully register with the Name Server, you have resolved the problem. However, pay attention to this issue because if the problem recurs, it indicates a complex problem that is best resolved by working with your switch and edge device suppliers. You should also seek this type of assistance if after issuing a portDisable/portEnable, the devices do not register with the Name Server. This fact indicates a complex issue such as a communication conflict or timeout condition. Although edge devices should reconnect to the fabric and register when the port is disabled, some older devices might time out and no longer retry logging in. If this happens, you might need to reboot the device to get that device to reset and log into the fabric and Name Server.
Troubleshooting Marginal Links
A marginal switch port is defined as a switch port that is either receiving a marginal incoming signal, or the switch receiver is not functioning properly. A marginal Nx_Port transmit can be caused by an Nx_Port failing optical component (GBIC or GLM) or a cable issue. A failing Fx_Port receiver can be caused by a failing switch optical component or a failing switch port, as depicted in Figure 8.29.

Figure 8.29 Marginal Port Elements
Marginal Point-to-Point/Fabric Device Links
The impact of a marginal port can be significant. For example, a large storage device such as an HP XP512, an IBM Enterprise Storage Server, or EMC Symmetrix port might be accessed by potentially dozens of hosts. The marginal behavior of this storage device has the potential to impact all devices that access this storage port. Imagine that you are a part of a geographically distributed team of six workers. The primary communication for this team is via telephone. Assume that your telephone is functioning marginally (similar to a poor cellular connection). Anyone who wants to call you will not be able to communicate effectively with you. Conversely, anyone who you call will also be unable to communicate effectively with you. If you are a team leader for this group, the impact of your marginal telephone capabilities is significant, since many people utilize you as a resource. Note that the others in the group are free to communicate with each other without experiencing any impact from your telephone problems. The story can have a happy ending if you gain access to two telephones, and realizing the marginal nature of one telephone line, switch to the working telephone. Note that many SANs are constructed in a similar fashion to Figure 8.30, with dual paths between hosts and storage, and a single failure does not result in an I/O failure. In applications where availability is key, dual- or even triple-redundant fabrics are always recommended.

Figure 8.30 Dual-Fabric SAN Design
Marginal Loop Connections
While a marginal point-to-point link affects only devices that access the point-to-point device, the ramifications of a malfunctioning loop-connected device can impact all devices in that loop. Extending the geographically distributed team analogy further, imagine that the only way the team communicates is via teleconference. Whenever the team needs to communicate, everyone dials in to a conference call. Unfortunately, the teleconference is disrupted by your marginal telephone link. What makes things even worse is that communication between any other team members is impossible or very difficult. For example, it is very difficult for one member to speak with another on the teleconference because your marginal telephone continually creates static on the teleconference.
Brocade QuickLoop and Fabric Assist are unique Fibre Channel topologies that combine aspects of arbitrated loop and fabric topologies. They are composed of multiple private arbitrated loops (looplets) interconnected by a fabric. It can be best described as a Private Loop Fabric Attach, as compared to Private Loop Direct Attached (PLDA) or Fabric Loop Attachment (FLA). The FL_Port of each looplet is hidden from the NL_Ports. QuickLoop is a logical PLDA that complies with the FC-AL standard. Although NL_Port devices are attached to different arbitrated loops interconnected by a fabric, the fabric and the physical device locations are transparent. QuickLoop enables switches to be used in place of hubs in environments where all attached devices are private devices. Fabric Assist mode allows the configuration of a virtual private loop in which a private host can see and access public or private targets anywhere on the fabric. Such a private loop is called QuickLoop Fabric Assist mode zone. Fabric Assist mode enables private hosts to access public or private targets anywhere on the fabric, provided they are configured in the same Fabric Assist zone. A public target accessed by a private host remains public, with full fabric functionality.
The nature of loops is such that the behavior of an unhealthy device on the loop can adversely impact the behavior of the remaining devices on the loop. For example, a marginal GBIC could degrade the signal to the point where the connected NL_Port (host or storage) device is no longer able to effectively communicate. This in turn causes the loop to reset. When a loop resets, so do the individual hosts or storage devices connected to that loop. Under normal circumstances, a loop reset does not cause any harm. However, if a device is constantly resetting, I/O flow can become severely restricted or halted.
Loop Initialization Primitives (LIPs) are part of a healthy loop and are used for a variety of purposes—most commonly to signal other devices on the loop that a new device has been added, or that an existing device has left the loop. When a loop or NL_Port resets, LIPs are generated. However, an excessive number of LIPs will make a loop unstable.
The Fibre Channel standards community is making great strides in further enhancing the functionality of loops. However, loops are starting to become a legacy issue. It is important to note that Fibre Channel and SilkWorm switches also support point-to-point topologies, which are not subject to the same disruptive behaviors that loops are. When a public device accesses a private device (known as translative mode), the LIP is not propagated to that public device, nor is that public device subject to disruption.
Nx_Port (Host/Storage) Behavior with a Marginal Port in the Loop
When a marginal device disrupts the loop, a variety of symptoms can be present. Performance for devices connected to the QuickLoop or devices accessing a common device can be described as slow. Host logs (that is, /var/adm/messages, eventlog, or syslog) might indicate that I/O is timing out or that the interface is being reset. The switch LEDs should be green or a blinking green light. Green lights mixed with yellow lights or flashing yellow lights indicate that the ports are resetting themselves. Devices on the affected loop might FLOGI and/or PLOGI repeatedly onto the fabric as part of a reset process initiated by the HBA. This would show up on the console or telnet management session for the switch to which the affected device was attached. N_Port devices are less susceptible to disruption for reasons stated earlier.

论坛徽章:
0
9 [报告]
发表于 2007-07-04 10:19 |只看该作者
Marginal GBIC/Cable
You can use the er_enc_out statistic to identify a marginal GBIC. Active devices (such as disks) normally clean up an encoding error as these errors are encountered, and mark the frame as having bad CRC. Any er_enc_out errors are encoding errors outside a frame, and do not generate a CRC error. If a high count (for example, several thousand) or incrementing counts of er_enc_out errors are experienced on a particular port, this indicates that the signal is marginal between the connected device’s transmit port and the switch’s receive port. Because this situation is being recorded as encoding errors, the implication is that there is no active device cleaning up the errors between the switch receive and the connected device transmit. The diagnosis: marginal GBIC or cable on the connected device.
Connected Device
Note that LIPs are normal in a healthy loop. An imbalance where the Lip_in count is larger than the Lip_out count indicates that the associated connected device is the originator of LIPs in the loop. A device that generates a large number of LIPs might be malfunctioning. The switch will propagate LIPs in accordance with the Fibre Channel specification. Propagated LIPs are recorded as Lip_out.
Fault Isolation
Once a marginal port is identified, it is necessary to identify where the fault resides. Figure 8.31 depicts a suggested fault isolation process. Fault isolation on a loop is very difficult, which is one of the reasons why loops had limited success.

Figure 8.31 Marginal Link Fault Isolation
How the Switch Can Help: Fabric Watch and QuickLoop Zoning
By virtue of being positioned between storage and host, the switch is a natural resource for gathering statistics and troubleshooting. As shown earlier, the switch can help mitigate the issues that arise when a marginal device disrupts a loop or other N_Port devices.
Brocade Fabric Watch allows each switch to continuously monitor fabric elements for irregular conditions. Fabric Watch can assist in rapidly identifying and escalating potential problems. This proactive management improves the overall availability of the SAN. Specific to troubleshooting marginal links, Fabric Watch can detect such failing port symptoms as excessive CRC errors and proactively send an SNMP alert. It is also possible to telnet into the switch and quickly analyze statistics to identify the marginal port.
To minimize the impact of a marginal device in a loop, you can utilize QuickLoop zoning or Fabric Assist to compartmentalize various host/storage pairs. QuickLoop zoning or Fabric Assist prevents LIPs from propagating between QuickLoop zones. In some respects, QuickLoop zoning turns one loop into multiple virtual loops. In Figure 8.32, a LIP generated by Host A in zone qlZone1 due to a marginal port does not propagate to qlZone2 or qlZone3. Without QuickLoop zoning, a marginal port has the potential to limit or halt I/O for all devices connected to the switch!

Figure 8.32 QuickLoop Zoning Example
Overview of SilkWorm Port Error Statistics
Additional SilkWorm port statistics can be obtained by executing the following telnet commands:
                 portShow <port #>
                 portStatsShow <port #>
Use portStatsShow for error statistics (such as CRC, encoding, bad End of Frame [EOF], etc.), and use portShow for link-level and LIP statistics (such as link failure, loss of sync, loss of signal, etc.). The portShow command offers similar statistics to portStatsShow. However, the statistics gathered by portShow are updated in software whenever a port interrupt is received, while the statistics for portStatsShow are updated in hardware registers as they occur. The significance in this difference is that many errors, such as CRC errors, could occur between interrupts. The hardware counters (portStatsShow) will capture these between interrupt errors, while the software counters (portShow) might not. Another difference between the two commands is that portShow provides LIP statistics and link statistics (link failure, loss of signal, loss of sync), while portStatsShow does not. A partial listing of relevant portShow statistics follows:
                 Lip_in  Number of LIPs transmitted from the connected device to the switch port. Does not apply to F_Port.
                 Lip_out  Number of LIPs transmitted from the switch port to the _connected device. Does not apply to F_Port.
                 Lip_rx  Type of LIP (F7, F last received by the switch from the _connected device. Does not apply to F_Port.
Troubleshooting I/O Pauses
I/O pauses happen, and both the SAN and edge device can and should tolerate such events. The term I/O pause is somewhat generic. An I/O pause can be as harsh as the powering down of a host or storage device while I/O is in transit, which will cause I/O to cease. Alternatively, it can be as lightweight as a port-level RSCN, which might be a problem for only the most latency-sensitive of applications. Most HBAs currently pause I/O during RSCN processing; however, updated drivers are expected to minimize this effect. Relative to the SAN, fabric events can also cause a pause in I/O. A fabric event can be broken down into a change, such as a switch reboot, and the resultant activity to respond to that change. In the case of a switch reboot, not only are the devices connected to that switch affected, but also devices connected to the fabric—even if the fabric is resilient. This is because the fabric needs to reroute, which takes less than a second, and because all devices connected to the SAN that have registered for state change notification must process a global RSCN. Edge devices such as HBAs and storage devices should be tolerant of such pauses in I/O. It is possible to adjust the settings for these devices to accommodate longer or shorter delays in I/O when a SAN event occurs. RSCNs are normal and key to SAN operation.
Several applications are very sensitive to latency and/or RSCNs, such as video-on-demand and applications that are evolving into the SAN model, such as tape backup. High latencies and large numbers of RSCNs can adversely affect these applications. Storage vendors, switch vendors, application vendors, and HBA vendors are working with the standards bodies (T11) as well as modifying their product implementations to handle these types of exceptions. Table 8.6 lists common events that cause fabric rerouting and/or fabric RSCNs.
Table 8.6 Fabric Events and Their Impact
Event        Generate Global RSCN?        Will Result in Reroute?
SwitchDisable  Disabling a switch in the fabric will require the fabric to reconfigure and a new set of data path routes to be established for the resulting downsized fabric.        Yes        Yes
SwitchEnable  The corresponding mode to the disable. A new switch added to the fabric will result in new route calculations to allow for the added ports.        Yes        Sometimes
E_Port connection/disconnection  Adding or removing an ISL will cause a fabric RSCN.        Yes        Sometimes
A zone update, which occurs when you execute a cfgEnable or cfgDisable command.        Yes        No
Adding/removing a switch to/from the fabric.        Yes        Sometimes
Troubleshooting fabric events and their adverse impact on applications and the SAN is a complex process. If you suspect that a fabric event is adversely affecting your SAN, work with your switch supplier for resolution.
Summary
It can be helpful to think of the SAN as a virtual cable when it comes to troubleshooting, approaching the problem by breaking components down to a host, the SAN virtual cable, and the storage. To the operating system, the SAN provides a link to a disk, just as a traditional SCSI connection would. Troubleshooting a SAN is more challenging, but still has many things in common with the traditional storage troubleshooting process. Switches are logically positioned in the middle of the network between hosts and storage, and have visibility to both storage and hosts. This visibility into both sides of the storage network enables you to use switches to determine the cause of any malfunction in the SAN.
SAN troubleshooting should begin in the center of the SAN and proceed outward. Once you know where to start troubleshooting, the next question is how to proceed. Start the troubleshooting process by gathering a preliminary set of data, and then analyze this data to identify where the problem resides: the host, the fabric, or the storage. Next, gather additional data from the appropriate area and focus in on the cause of the problem. A plethora of data is available from the switches, hosts, and storage.
Many tools are available to the SAN troubleshooter. Several of these tools are switch commands. Other tools involve viewing the switch LEDs, host information, Fibre Channel analyzers, and diagnostics available on many storage arrays. It is rarely possible to use a single tool to successfully troubleshoot a problem. It is more common is to use several tools in concert.
A fabric problem is a pervasive issue that can often affect more than one device. When a fabric issue is experienced in a resilient SAN, it might have no impact on SAN functionality, because the SAN redundancy compensates for the marginal situation. However these “soft” errors can cause degradation in the performance of the enterprise application and thus require immediate attention. Fabric issues are normally associated with large fabrics, which are defined as fabrics consisting of 10 or more switches and 100 or more edge devices.
A host that is unable to access a SAN device is a more common issue. This type of issue is classified as a missing device. Use of the commands switchShow and nsShow can quickly reveal the cause of the missing device. Missing device issues are normally limited to a few devices. If more devices are involved, it is likely a fabric issue.
The impact of a marginal port can be significant. For example, a large storage device might be accessed by potentially dozens of hosts. The marginal behavior of this storage device then has the potential to impact all devices that access this storage port. A marginal link involves the connection between the switch and the edge device. Isolating the exact cause of a marginal link involves analyzing and testing many of the components that make up the link: switch port, switch GBIC, cable, edge device GBIC, and the edge device.
I/O pauses do happen, and both the SAN and edge device can and should tolerate such events. The term I/O pause is somewhat generic. An I/O pause can be as severe as the powering down of a host or storage device while I/O is in transit, which will cause I/O to cease. Alternatively, it can be as lightweight as a port-level RSCN, which might be a problem for only the most latency-sensitive applications. Relative to the SAN, fabric events can also cause a pause in I/O. Calibrating your edge devices to handle I/O pauses and troubleshooting I/O pauses is a complex process.
Solutions Fast Track
The Troubleshooting Approach: The SAN Is a Virtual Cable
               Use the SAN’s visibility to both storage and hosts to start your trouble&not;shooting process.
               The switchShow, nsShow, nsAllShow, errShow, and topologyShow commands are extremely informational and invaluable to the trouble&not;shooting process.
               The UNIX format command or HBA vendor-supplied utilities are also helpful in troubleshooting.
               When you start the troubleshooting process, determine if the issue is fabric related or device related. A fabric-related issue impacts many devices, and a device issue impacts only a few devices.
Troubleshooting the Fabric
               A fabric issue impacts many devices. A logical switch outage, such as segmentation or physical switch outage, can cause many devices to drop out of the fabric. Problems with ISL initialization are also considered fabric issues.
               The quickest way to narrow your search of a fabric problem is to compare your baseline SAN profile to your current SAN profile and investigate discrepancies.
               A SAN profile includes the number of devices per switch (nsShow), number of devices in the fabric (nsAllShow), and number of switches in the fabric (topologyShow). The errShow and switchShow commands are also helpful in tracking down fabric issues.
               Some fabric issues are caused by a mismatch in fabric service timeout variables and the edge device timeout settings. Careful analysis of both the fabric and the edge devices is necessary to resolve this complex issue.
Troubleshooting Devices that Cannot Be Seen
               The first thing to check is that the missing device is logically connected to the SAN as indicated by switchShow output.
               Next, check to see that the device is present in the Name Server, using the command nsShow. If the device is not in the Name Server, it is invisible to the other devices in the fabric.
               Other common causes of missing devices are zone conflicts or marginal links.
Troubleshooting Marginal Links
               Use portErrShow to establish if there are a relatively high number of errors, such as CRC errors. Look for a steadily increasing number of errors to confirm a marginal link.
               A marginal link can impact multiple devices. For example, a shared storage device with a marginal link can cause communication problems with all devices that access that shared storage.
               A marginal link can be caused by any of the components that make up the link: switch port, switch GBIC, cable, edge device GBIC, and the edge device.
Troubleshooting I/O Pauses
               I/O pauses happen, and both the SAN and edge device can and should tolerate such events.
               An I/O pause can be as harsh as the powering down of a host or storage device while I/O is in transit, which will cause I/O to cease. Alternatively, it might be as lightweight as a port-level RSCN, which might be a problem for only the most latency-sensitive applications. Relative to the SAN, fabric events can also cause a pause in I/O.
               Several applications, such as video-on-demand and applications that are evolving into the SAN model, such as tape backup, are very sensitive to latency and/or RSCNs. High latencies and large numbers of RSCNs can adversely affect these applications.
               Storage vendors, switch vendors, application vendors, and HBA vendors are working with the standards bodies (T11) as well as modifying their product implementations to handle these types of exceptions.
Frequently Asked Questions
Q: When I activate a zone change (cfgEnable), I notice a pause in I/O and _several of my hosts log warnings. What causes this?
A: When you issue a zone change, an RSCN is delivered to any host in the fabric that registers to receive an RSCN. The pause you notice is the initiator responding to the RSCN, which involves the initiator querying the Name Server and resolving any changes to the fabric.
Q: If I exhaust my troubleshooting options and cannot resolve an issue after reading this chapter, what should my next step be?
A: Contact your switch supplier and request support. Provide the information outlined earlier in this chapter. Of special importance is the supportShow, which is ideally captured while the problem is happening.
Q: How can I tell if my fabric is segmented?
A: Normally, a segmented fabric will generate an error message on the switch that segments. You can view errors by issuing the command errShow.
Q: How come my device inconsistently connects to the switch as either an N_Port or an NL_Port ?
A: It is likely that there is a bug in the port initialization of either the edge device or the switch. A short-term solution is to configure a port for a specific topology. For example, configure a port as an FL_Port by using the command portcfgLport. Longer term, you should resolve this behavior by escalating the problem to your switch supplier and your edge device supplier.
Q: What is a quick way to reinitialize to clear a fault or re-enable a link?
A: The commands portDisable and portEnable will cause a port to reinitialize and potentially clear a fault. Doing so will cause the edge device to register with the Name Server.

论坛徽章:
0
10 [报告]
发表于 2007-07-04 14:02 |只看该作者
咳咳,这个........
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP