http://dsumsky.blogspot.com/2008_07_01_archive.html
I will continue from the point where I finished last time. The remaining diagnostics tools depend on
sysfs interface. The provided information is extracted from this filesystem. If you don't remember the meaning of each entry under the
/sys/class/infiniband directory use these tools.
The IB subnet manager is not running is one of the IB network issues. The IB nodes don't have assigned any LIDs and they aren't able to see each other. The node or his IB ports are connected but they aren't initialized yet. To find out this without
sysfs use the
ibstat command:
CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.400
Hardware version: a0
Node GUID: 0x0003ba0001007ba8
System image GUID: 0x0003ba0001007bab
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007ba9
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007baa
The output contains everything what we need - port state, LID, GUID, rate. The IB link is up but the ports are in the INIT state. No IB subnet manager is running. It is clear because the Sm lid parameter has zero value. It should have LID value of the node which acts like IB subnet manager. The same holds for Base lid. The zero value means that the IB network isn't initialized yet. The similar information will be provided by the ibnetdiscover command:
vendid=0x144f
devid=0x0
switchguid=0x144f00006e9794
Switch 9 "S-00144f00006e9794" # "" base port 0 lid 2 lmc 0
[6] "H-0003ba0001003de4"[2] # "node2 HCA-1" lid 0
[5] "H-0003ba0001003de4"[1] # "node2 HCA-1" lid 0
[4] "H-0003ba0001007ba8"[2] # "node1 HCA-1" lid 0
[3] "H-0003ba0001007ba8"[1] # "node1 HCA-1" lid 0
vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001003de7
caguid=0x3ba0001003de4
Ca 2 "H-0003ba0001003de4" # "node2 HCA-1"
[2] "S-00144f00006e9794"[6] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[5] # lid 0 lmc 0 "" lid 2
vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001007bab
caguid=0x3ba0001007ba8
Ca 2 "H-0003ba0001007ba8" # "node1 HCA-1"
[2] "S-00144f00006e9794"[4] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[3] # lid 0 lmc 0 "" lid 2
The square brackets contain the physical port number at the switch. As we can see, the network is up and discoverable. It consists of one IB switch and two IB nodes, each with dual ported IB HCA. The switch has assigned LID 2, the nodes aren't initialized yet. To display node GUIDs only, use the ibnodes command:
Ca : 0x0003ba0001003de4 ports 2 "node2 HCA-1"
Ca : 0x0003ba0001007ba8 ports 2 "node1 HCA-1"
Switch : 0x00144f00006e9794 ports 9 "" base port 0 lid 2 lmc 0
Finally, two commands remained -
ibroute and
ibchecknet. As the IB network is not fully initialized the nodes can't contact the switch for forwarding table. So the
ibroute command isn't working otherwise it is helpful. The
ibchecknet command produces address resolution errors, the IB network is not valid:
lid 2 address resolution: FAILED
# Switch: nodeguid 0x00144f00006e9794 failed
# Checking Ca: nodeguid 0x0003ba0001003de4
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001003de4 failed
# Checking Ca: nodeguid 0x0003ba0001007ba8
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001007ba8 failed
## Summary: 3 nodes checked, 3 bad nodes found
## 8 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold
In the beginning, I stated the IB subnet manager is not running. Let's launch it with
/etc/init.d/opensmd script and we will see how the behaviour of the tools will change.
I am going to close the article series about Infiniband technology on RHEL platform (check the previous posts
1,
2,
3) with posts intended to the IB troubleshooting. I would like to introduce a basic diagnostic steps of IB environment which may help you to uncover errors and misconfiguration.
The most of troubles you may meet with are traceable via OFED diagnostics tools. They are part of
openib-diags package until OFED 1.2. Since version 1.3, it is replaced with
infiniband-diags package. Let's take a look at the most useful ones:
- ibstat - shows IB device status like firmware version, ports state, their rate, GUIDs, LIDs ...
- ibnetdiscover - discovers IB network topology
- ibroute - queries for IB switch forwarding table (like routing table)
- ibnodes - shows IB nodes in topology
- ibchecknet - runs IB network validation
- ibping - ping IB address
- sysfs - Linux virtual filesystem representing kernel structures, for IB is there directory /sys/class/infiniband
The IB network is similar to the other high performance network technologies like Fibre Channel. The most of troubles with IB are in common. You may need to resolve connectivity issues, firmware or higher level software revisions incompatibilities, driver bugs and similar.
At first, I would like to explain the usage of last two tools -
ibping and
sysfs. They are simple enough and known from other fields. The IB ping works in client-server fashion. That means you need to run
ibping in server mode at one side and another side will act as a client. The server is ponging to the client's pings.
- Server mode - ibping -S -v
- Client mode - ibping -v SERVER_LID_ADDR
The
-v argument increases verbosity level only. The right LID address can be found with
ibnetdiscover command. Run it, find the server node line and use the associated LID now. I will explain it later. If the IB network is healthy
ibping should produce the output at the server side like this (the server LID is 4, his hostname is
node2):
ibwarn: [6795] ibping_serv: starting to serve...ibwarn: [6795] ibping_serv: Pong: node2.(none)
The pongs have to be visible at the client side:
ibwarn: [17946] ibping: Ping..Pong from node2.(none) (Lid 4): time 0.235 msIf you aren't able to see them you should check the connectivity status of your IB HCA. One method to do it is via
sysfs. Each IB HCA is represented with a subdirectory under the
/sys/class/infiniband directory where you can find a lof of useful stuff. For example, if you have dual ported HCA from Mellanox then there should be the following entries for port states:
- /sys/class/infiniband/mthca0/ports/0/state
- /sys/class/infiniband/mthca0/ports/1/state
The state can have three predefined values with these meanings:
- DOWN - port is physically disconnected
- INIT - port is connected and it is initialized
- ACTIVE - port is online and it is working
If
ibping has to work the ports of both nodes have to be in
ACTIVE state. If they are in INIT state then the subnet manager may be not running. The
DOWN state simply means cable problem. By the way, there are other methods to achieve this with help of remaining tools. I am going to explore them next time.
During a few days or weeks, VMware should release their lightweight hypervisor
VMware ESXi for free. It is an enterprise-class hypervisor with footprint about 32MB which is integrated into modern servers through e.g. solid state disks. The small footprint is achieved by dropping so-called Console Operating System (based on RHEL 3). It includes basic functionalities like vSMP or VMFS and for advanced ones, you need to manage it with VMware VirtualCenter. You can download it from
here.
Have you ever wanted to check the content of the first virtual console without switching to it with "Ctrl+Alt+F1" shortcut from your desktop session? Or the second console of a remote server? Or would you like to send something to the user who is working at the third virtual console (not via wall command)?
The GNU/Linux kernel provides two character devices for such tasks:
- /dev/ttyX - represents X. virtual console
- /dev/vcsX - represents X. virtual console text contents
So, to answer the questions use these commands:
- cat /dev/vcs1
- ssh root@server 'cat /dev/vcs2'
- echo "something" > /dev/tty3
More information about Linux allocated devices is written in /usr/src/linux/Documentation/devices.txt. You have to have GNU/Linux sources installed.
As I written in the
previous post, the
/etc/init.d/openibd init script is in charge of starting Infiniband (IB) network. The script parses the
/etc/ofed/openibd.conf configuration file where you can specify which ULPs should be initialized. By default, all ULPs I mentioned last time -
ipoib,
srp,
sdp - are enabled.
The
opensm IB network manager is controlled with the
/etc/init.d/opensmd init script which is configurable via
/etc/ofed/opensm.conf configuration file. You can turn on debugging here but it is not normally needed. It is more useful to enable verbose mode which increases the log verbosity level. The default log file is
/var/log/osm.log. So, if something goes wrong enable verbose mode and check the log file.
After executing the init scripts, you should check the IB network state. The
openibd script is started automatically during the system startup, while the
opensm has to be enabled (with
ntsysv or
chkconfig). Follow this checklist:
- Is Mellanox HCA recognized?
- check the output of lsmod | grep ib_mthca
- check the output of dmesg
- Are appropriate ULPs loaded?
- check the output of lsmod | grep ib_
- should contain ib_ipoib, ib_srp, ib_sdp
- Is IB network initialized and working?
- check the output of cat /sys/class/infiniband/mthca0/ports/X/state
- Is ib0 network interface available?
- check the output of ifconfig -a
If you passed all the checks you would be able to use IP protocol over IB network. I supposed you have two IB nodes in the IB network at least, both are configured the same way and both have passed the checks (like in the
first article). To configure it follow the commands:
- assign an IP address to the nodes
- run ifconfig ib0 IP_ADDR1 up at first node
- run ifconfig ib0 IP_ADDR2 up at second node
- check the IPoIB functionality
- run ping IP_ADDR2 from the first node
- run ping IP_ADDR1 from the second node
So, wasn't it simple? If everything is working the ping should receive replies from the other side. Now, you can run any IP based application over IB - FTP, NFS and so on and utilize its benefits like high throughput and low latencies. Please, if you are interested in the topic leave me a comment.
Sometimes it happens that I'm not able to uninstall a RPM package because of some internal SPEC file errors related to the scriptlets. Last time it happened when I was uninstalling the HP OpenView Storage Data Protector packages from a RHEL server. By mistake, I uninstalled one package which was a dependency of another package and after that I wasn't able to uninstall it due to that dependency and due to it wasn't checked correctly. The whole uninstall procedure looked like this:
- rpm -e OB2-CORE-A.06.00-1
- rpm -e OB2-DA-A.06.00-1
And the produced error follows:
- ERROR: Cannot find /opt/omni//bin/omnicc
- error: %preun(OB2-DA-A.06.00-1.x86_64) scriptlet failed, exit status 3
So, is there a way how to get rid of such a package? Yes, it is and it is simple, just disable executing the scriptlets like this:
- rpm -e --noscripts OB2-DA-A.06.00-1
I think it is pretty simple feature of RPM but it is a bit difficult to remember it. We had to wait for upcoming AMD Opteron servers from Sun a few months since the new quad-core AMD processors, code named Barcelona, were released. Now, it would be a few days when Sun officially announced
here the availibility of the second generation of their AMD servers and new storage arrays, together called as "
next generation open storage hardware". More about the Open Storage hardware and related projects, you can find
here.
The new servers based on quad-core AMD processors are Sun Fire X4140, 4240 and X4540. At the storage field, there were introduced new Sun Storage J4200, J4400 and J4500 arrays. All of them are SAS JBOD arrays. For more details, look at the
Sun System Handbook at
SunSolve.
Let's continue with software introduction. As I wrote the switch is equipped with the ALOM remote management. There is an universal set of commands for platform independent management like password, poweroff, setupsc, resetsc and so on and then a set of commands which are more specific to the platform. In the case of our IB switch there are two such commands:
- setbp - for setting so-called blueprint of switch
- showbp - for showing the current blueprint
- there are five predefined blueprints:
- 9 node, 12 node, 18 node, none and unmanaged
The natural question is what does the blueprint mean? According to
official documentation it seems to be like a predefined configuration of the switch. You can change it with the
setbp command which asks you if you want to run IB management software, how many hosts will be in the subnet and what is the subnet identifier. By default, if you use the switch preconfigured from the factory then two switches will have the same subnet ID. The trouble is, if you intend to configure some level of redundancy between IB switches you will have to have them in different subnets with different subnet IDs. I think it strange because I had to disable the IB management software otherwise I wasn't able to see the nodes in the fabric. As we will see, the IB mangement software including IB subnet manager doesn't seem to like the OFED included in RHEL distro (more about RHEL and OFED I wrote
here).
What about the servers? I preinstalled them with CentOS 5.1 distribution (which is binary compatible with RHEL 5.1). The distribution contains the OFED implementation in version 1.2. The complete OFED implementation in CentOS is divided in a set of RPM packages. The platform dependent part of OFED that means kernel modules are distributed with kernel package. Let's make a quick summary of basic packages:
- kernel - contains IB hardware, IB core and IB ULP modules
- ULP means Upper Level Protocol
- everything is placed in the following directories:
- /lib/modules/`uname -r`/kernel/drivers/infiniband/hw
- /lib/modules/`uname -r`/kernel/drivers/infiniband/core
- /lib/modules/`uname -r`/kernel/drivers/infiniband/ulp
- currently there are supported only IB HCAs from Mellanox
- the supported ULPs are
- ipoib - IP over IB driver
- srp - IB SCSI RDMA initiator driver
- sdp - SDP driver
- openib - this package contains a lot of useful documentation and the important part is the OFED configuration file /etc/ofed/openib.conf and the init script /etc/init.d/openibd which takes care of activating/deactivating the IB network interfaces. Simply, it loads the IB core modules and specified ULP modules in the config.
- openib-diags - this package contains diagnostic tools for IB debugging, I will introduce them later.
- opensm - here we have our IB subnet manager. The package provides the init script /etc/init.d/opensmd for starting it and the /etc/ofed/opensm.conf configuration file.
- libibverbs - this package provides a library allowing userspace programs direct hardware access.
- libibcommon, libibmad, libibumad, opensm-libs - and finally library dependencies for the above packages.
I need to add that the OFED packages belongs to the System Environment/Libraries RPM group and they are not installed by default apart from the openib and libibverbs and of course kernel package. That's all for now and next time I'm going to describe how to work with it.
In my two previous articles, I summarized a few facts about the
Infiniband support in
RHEL distros and included protocols - you can go through them from the following links -
RHEL and Infiniband support and
Infiniband, RDP, SDP.... Let's be more particular now.
My scenario was based on two servers
Sun Fire X4200 M2 and one
Infiniband (
IB) switch
Sun IB Switch 9P. The servers had installed
Infiniband host channel adapters (
HCA)
Sun Dual Port 4x IB HCA to be able to communicate over the
IB fabric. The switch provides nine
IB compliant ports at dual speeds of 4X/12X what means that each port is able to deliver of 10/30
Gbit raw bandwidth. What surprised me was that the switch management is like at the SUN
SPARC midrange servers. Yes, it is
ALOM and it is perfect because you can use the same interface and similar commands you are used to. By the way, the switch chassis looks like a regular SUN server.
The switch is equipped with the
IB subnet manager (
SM) which is required to initialize the
IB hardware and to allow the communication over the
IB fabric. Each
IB subnet has to have at least one and each has unambiguous identifier (
ID) over the fabric. To be complete, the fabric comprises defined
subnets. In my opinion, the
IB SM seems to be working like
ARP cache and
DHCP server in
LANs. Each
HCA in a fabric is globally identified with so-called node
GUID which is like
WWN in
FC or MAC in LAN. The switch has own
GUID as well. The ports of
HCA have so-called port
GUID. Now, when one
HCA or its port want to communicate with another one in the
subnet we need to have assigned some network address. This address is called
LID or local identifier and the
IB SM is in charge of assigning it to the members of the
subnet. The conclusion is the
LIDs are available inside the
subnet only and the
GUIDs are
routable over the
subnets of fabric.
But one thing confused me a bit. When you configure the switch you will need to remember setting its
blueprint otherwise you will ask for trouble. I'm going to write about it in the next part.