Decipher OSM log messages
The last post described a situation where the OpenSM Infiniband (IB) subnet manager was logging hundreds of messages per second to its log file /tmp/osm.log and proposed an intermediate solution for preventing it to fill up the file system. This post is about tracking down the root of the issue.
Supposedly the OpenSM log file contains messages like:
Aug 04 23:27:02 910870 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x008B Port 6 TID:0x000000002c20f6fa
Aug 04 23:27:02 910896 [40A04960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 1091363 times consecutively
Aug 04 23:27:02 912507 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x008B Port 6 TID:0x000000002c20f6fb
Aug 04 23:27:02 912533 [40401960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 1091364 times consecutively
Aug 04 23:27:02 914191 [40602960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:131 Producer:2 from LID:0x008B Port 6 TID:0x000000002c20f6fc
Aug 04 23:27:02 914212 [40602960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 1091365 times consecutively
These messages are rather cryptic and by themselves not particularly helpful. However, they contain information about the origin of the error: Producer:2 from LID:0x008B Port 6. This obscure pair of LID and port actually refers to a port on the IB switch, which is reporting the error to the subnet manager. Now, one can log in to the IB switch and try to figure which physical port and cable on the IB switch are associated with the given (LID, port) pair. Then a trip to the server room and digging among numerous IB cables may reveal the machine that is the cause of all this trouble.
Alternatively, one can open the software toolbox and pull out ibdiagnet — a tool that is part of the OpenIB/OFED distribution. ibdiagnet provides a number of useful functions to debug Infiniband networks and, in addition to general IB network path information, it conveniently provides a mapping from IB switch ports to machine hostnames for all IB host interfaces that are reachable. Even though it can only report that mapping for interfaces that are reachable it can still be used to identify interfaces that are offline assuming the IB cabling was following some predictable pattern.
When running ibdiagnet without any command-line arguments it will run a number of diagnostics and leave a couple of files in /tmp. A detailed list of the files can be found in the ibdiagnet manpage. The file /tmp/ibdiagnet.lst provides a list of all active ports in the IB fabric, including ports that are internal to the switch. Additionally, for any host ports that are active it will show the hostname configured for the corresponding host. This information is used to eventually identify the IB host that causes the troubles:
...
{ SW Ports:18 ... Chip A} LID:008B PN:02 } { CA ... {compute-0-3 HCA-1} LID:008F PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:04 } { CA ... {compute-0-5 HCA-1} LID:0002 PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:05 } { CA ... {compute-0-6 HCA-1} LID:0006 PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:07 } { CA ... {compute-0-8 HCA-1} LID:0004 PN:01 } PHY=4x LOG=ACT SPD=5
{ SW Ports:18 ... Chip A} LID:008B PN:08 } { CA ... {compute-0-9 HCA-1} LID:0005 PN:01 } PHY=4x LOG=ACT SPD=5
...
Since /tmp/osm.log explicitly refers to switch LID 0x8B Port 6, it can be easily determined that the entry for LID 0x8B Port 6 is missing. The nodes are connected to the switch in a particular order and following that wiring pattern LID 0x8B Port 6 would be connected with compute-0-7, the presumed trouble maker. A quick check on the node indeed revealed a kernel panic which prevented the IB driver to initialize the host interface correctly.
Obviously, doing such investigation every single time there is a problem with the IB network may become tedious. So using the information collected by ibdiagnet to create an up-to-date chart for (LID, port) to hostname mappings may be a good idea.