A Musical Tour of Hints and Tools for Debugging Host Networks

1299

Shannon Nelson from the Oracle Linux Kernel Development team offers these tips and tricks to help make host network diagnostics easier. He also includes a recommended playlist for accompanying your debugging!

Ain’t Misbehavin’ (Dinah Washington)

As with many debugging situations, digging into and resolving a network-based problem can seem like a lot of pure guess and magic.  In the networking realm, not only do we have the host system’s processes and configurations to contend with, but also the exciting and often frustrating asynchronicity of network traffic.

Some of the problems that can trigger a debug session are reports of lost packets, corrupt data, poor performance, even random system crashes.  Not always do these end up as actual network problems, but as soon as the customer mentions anything about their wiring rack or routers, the network engineer is brought in and put on the spot.

This post is intended not as a full how-to in debugging any particular network issue, but more a set of some of the tips and tools that we use when investigating network misbehavior.

Start Me Up (The Rolling Stones)

In order to even get started, and probably the most important debugging tool available, is a concise and clear description of what is happening that shouldn’t happen.  This is harder to get than one might think.  You know what I mean, right?  The customer might give us anything from “it’s broken” to the 3 page dissertation of everything but the actual problem.

We start gathering a clearer description by asking simple questions that should be easy to answer.  Things like:

  • Who found it, who is the engineering contact?
  • Exactly what equipment was it running on?
  • When/how often does this happen?
  • What machines/configurations/NICs/etc are involved?
  • Do all such machines have this problem, or only one or two?
  • Are there routers and/or switches involved?
  • Are there Virtual Machines, Virtual Functions, or Containers involved?
  • Are there macvlans, bridges, bonds or teams involved?
  • Are there any network offloads involved?

With this information, we should be able to write our own description of the problem and see if the customer agrees with our summary.  Once we can refine that, we should have a better idea of what needs to be looked into.

Some of the most valuable tools for getting this information are simple user commands that the user can do on the misbehaving systems.  These should help detail what actual NICs and drivers are on the system and how they might be connected.

uname -a – This is an excellent way to start, if nothing else but to get a basic idea of what the system is and how old is the kernel being used.  This can catch the case where the customer isn’t running a supported kernel.

These next few are good for finding what all is on the system and how they are connected:

ip addr, ip link – these are good for getting a view of the network ports that are configured, and perhaps point out devices that are either offline or not set to the right address.  These can also give a hint as to what bonds or teams might be in place.  These replace the deprecated “ifconfig” command.

ip route – shows what network devices are going to handle outgoing packets.  This is mostly useful on systems with many network ports. This replaces the deprecated “route” command and the similar “netstat -rn“.

brctl show – lists software bridges set up and what devices are connected to them.

netstat -i – gives a summary list of the interfaces and their basic statistics. These are also available with “ip -s link“, just formatted differently.

lseth – this is a non-standard command that gives a nice summary combining a lot of the output from the above commands.  (See http://vcojot.blogspot.com/2015/11/introducing-lsethlsnet.html)

Watchin’ the Detectives (Elvis Costello)

Once we have an idea which particular device is involved, the following commands can help gather more information about that device.  This can get us an initial clue as to whether or not the device is configured in a generally healthy way.

ethtool <ethX> – lists driver and connection attributes such as current speed connection and if link is detected.

ethtool -i <ethX> – lists device driver information, including kernel driver and firmware versions, useful for being sure the customer is working with the right software; and PCIe device bus address, good for tracking the low level system hardware interface.

ethtool -l <ethX> – shows the number of Tx and Rx queues that are setup, which usually should match the number of CPU cores to be used.

ethtool -g <ethX> – shows the number of packet buffers for each Tx and Rx queue; too many and we’re wasting memory, too few and we risk dropping packets under heavy throughput pressure.

lspci -s <bus:dev:func> -vv – lists detailed information about the NIC hardware and its attributes. You can get the interface’s <bus:dev:func> from “ethtool -i“.

Diary (Bread)

The system logfiles usually have some good clues in them as to what may have happened around the time of the issue being investigated.  “dmesg” gives the direct kernel log messages, but beware that it is a limited sized buffer that can get overrun and loose history over time. In older Linux distributions the systems logs are found in /var/log, most usefully in either /var/log/messages or /var/log/syslog, while newer “systemd” based systems use “journalctl” for accessing log messages. Either way, there are often interesting traces to be found that can help describe the behavior.

One thing to watch out for is that when the customer sends a log extract, it usually isn’t enough.  Too often they will capture something like the kernel panic message, but not the few lines before that show what led up to the panic.  Much more useful is a copy of the full logfile, or at least something with several hours of log before the event.

Once we have the full file, it can be searched for error messages, any log messages with the ethX name or the PCI device address, to look for more hints.  Sometimes just scanning through the file shows patterns of behavior that can be related.

Fakin’ It (Simon & Garfunkel)

With the information gathered so far, we should have a chance at creating a simple reproducer.  Much of the time we can’t go poking at the customer’s running systems, but need to demonstrate the problem and the fix on our own lab systems.  Of course, we don’t have the same environment, but with a concise enough problem description we stand a good chance of finding a simple case that shows the same behavior.

Some traffic generator tools that help in reproducing the issues include:

ping – send one or a few packets, or send a packet flood to a NIC.  It has flags for size, timing, and other send parameters.

iperf – good for heavy traffic exercise, and can run several in parallel to get a better RSS spread on the receiver.

pktgen – this kernel module can be used to generate much more traffic than user level programs, in part because the packets don’t have to traverse the sender’s network stack.  There are also several options for packet shapes and throughput rates.

scapy – this is a Python tool that allows scripting of specially crafted packets, useful in making sure certain data patterns are exactly what you need for a particular test.

All Along the Watchtower (The Jimi Hendrix Experience)

With our own model of the problem, we can start looking deeper into the system to see what is happening: looking at throughput statistics and watching actual packet contents.  Easy statistic gathering can come from these tools:

ethtool -S <ethX> – most NIC device drivers offer Tx and Rx packets counts, as well as error data, through the ‘-S’ option of ethtool.  This device specific information is a good window into what the NIC thinks it is doing, and can show when the NIC sees low level issues, including malformed packets and bad checksums.

netstat -s <ethX> – this gives protocol statistics from the upper networking stack, such as TCP connections, segments retransmitted, and other related counters.

ip -s link show <ethX> – another method for getting a summary of traffic counters, including some dropped packets.

grep <ethX> /proc/interrupts – looking at the interrupt counters can give a better idea of how well the processing is getting spread across the available CPU cores.  For some loads, we can expect a wide dispersal, and other loads might end up with one core more heavily loaded that others.

/proc/net/* – there are lots of data files exposed by the kernel networking stack available here that can show many different aspects of the network stack operations. Many of the command line utilities get their info directly from these files. Sometimes it is handy to write your own scripts to pull the very specific data that you need from these files.

watch – The above tools give a snapshot of the current status, but sometimes we need to get a better idea of how things are working over time.  The “watch” utility can help here by repeatedly running the snapshot command and displaying the output, even highlighting where things have changed since the last snapshot.  Example uses include:

1
2
3
4
5
#         See the interrupt activity as it happens
watch "grep ethX /proc/interrupts"
#        Watch all of the NIC's non-zero stats
watch "ethtool -S ethX | grep -v ': 0'"

Also useful for catching data in flight is tcpdump and its cousins wireshark and tcpreplay.  These are invaluable in catching packets from the wire, dissecting what exactly got sent and received, and replaying the conversation for testing.  These have whole tutorials in and of themselves so I won’t detail them here, but here’s an example of tcpdump output from a single network packet:

1
2
3
4
5
6
23:12:47.471622 IP (tos 0x0, ttl 64, id 48247, offset 0, flags [DF], proto TCP (6), length 52)
    14.0.0.70.ssh > 14.0.0.52.37594: Flags [F.], cksum 0x063a (correct), seq 2358, ack 2055, win 294, options [nop,nop,TS val 2146211557 ecr 3646050837], length 0
    0x0000:  4500 0034 bc77 4000 4006 61d3 0e00 0046
    0x0010:  0e00 0034 0016 92da 21a8 b78a af9a f4ea
    0x0020:  8011 0126 063a 0000 0101 080a 7fec 96e5
    0x0030:  d952 5215

Photographs and Memories (Jim Croce)

Once we’ve made it this far and we have some idea that it might be a particular network device driver issue, we can do a little research into the history of the driver.  A good web search is an invaluable friend. For example, a web search for “bnxt_en dropping packets” brings up some references to a bugfix for the Nitro A0 hardware – perhaps this is related to a packet drop problem we are seeing?

If we have a clone of the Linux kernel git repository, we can do a search through the patch history for keywords.  If there’s something odd happening with macvlan filters, this will point out some patches that might be related to the issue.  For example, here’s a macvlan issue with driver resets that was fixed upstream in v4.18:

$ git log --oneline drivers/net/ethernet/intel/ixgbe | grep -i macvlan | grep -i reset 
8315ef6 ixgbe: Avoid performing unnecessary resets for macvlan offload 
e251ecf ixgbe: clean macvlan MAC filter table on VF reset
 
$ git describe --contains 8315ef6 
v4.18-rc1~114^2~380^2

Reelin’ In the Years (Steely Dan)

A couple of examples can show a little of how these tools have been used in real life.  Of course, it’s never as easy as it sounds when you’re in the middle of it.

lost/broken packets with TSO from sunvnet through bridge

When doing some performance testing on the sunvnet network driver, a virtual NIC in the SPARC Linux kernel, we found that enabling TSO actually significantly hurt throughput, rather than helping, when going out to a remote system.  After using netstat and ethtool -S to find that there were a lot of lost packets and retries through the base machine’s physical, we used tcpdump on the NIC and at various points in the internal software bridge to find where packets were getting broken and dropped.  We also found comments in the netdev mailing list about an issue with TSO’d packets getting messed up when going into the software bridge.  We turned off TSO for packets headed into the host bridge and the performance issue was fixed.

log file points out misbehaving process

In a case where NIC hardware was randomly freezing up on several servers, we found that a compute service daemon had recently been updated with a broken version that would immediately die and restart several times a second on scores of servers at the same time and was resetting the NICs each time.  Once the daemon was fixed, the NIC resetting stopped and the network problem went away.

Bring It On Home

This is just a quick overview of some of the tools for debugging a network issue.  Everyone has their favorite tools and different uses, we’ve only touched on a few here. They are all handy, but all need our imagination and perseverance to be useful in getting to the root of whatever problem we are chasing.  Also useful are quick shell scripts written to collect specific sets of data, and shell scripts to process various bits of data when looking for something specific.  For more ideas, see the links below.

And sometimes, when we’ve dug so far and haven’t yet found the gold, it’s best to just get up from the keyboard, take a walk, grab a snack, listen to some good music, and let the mind wander.

Good hunting.

This article originally appeared at Oracle Developers Blog