Home Blog Page 268

Move your Dotfiles to Version Control

There is something truly exciting about customizing your operating system through the collection of hidden files we call dotfiles. In What a Shell Dotfile Can Do For You, H. “Waldo” Grunenwald goes into excellent detail about the why and how of setting up your dotfiles. Let’s dig into the why and how of sharing them.

What’s a dotfile?

“Dotfiles” is a common term for all the configuration files we have floating around our machines. These files usually start with a . at the beginning of the filename, like .gitconfig, and operating systems often hide them by default. For example, when I use ls -a on MacOS, it shows all the lovely dotfiles that would otherwise not be in the output.

dotfiles on master
➜ ls
README.md  Rakefile   bin       misc    profiles   zsh-custom

dotfiles on master
➜ ls -a
.               .gitignore      .oh-my-zsh      README.md       zsh-custom
..              .gitmodules     .tmux           Rakefile
.gemrc          .global_ignore .vimrc           bin
.git            .gvimrc         .zlogin         misc
.gitconfig      .maid           .zshrc          profiles

If I take a look at one, .gitconfig, which I use for Git configuration, I see a ton of customization. I have account information, terminal color preferences, and tons of aliases that make my command-line interface feel like mine. 

Read more at OpenSource.com

Text Processing in Rust

Create handy command-line utilities in Rust.

This article is about text processing in Rust, but it also contains a quick introduction to pattern matching, which can be very handy when working with text.

Strings are a huge subject in Rust, which can be easily realized by the fact that Rust has two data types for representing strings as well as support for macros for formatting strings. However, all of this also proves how powerful Rust is in string and text processing.

Apart from covering some theoretical topics, this article shows how to develop some handy yet easy-to-implement command-line utilities that let you work with plain-text files. If you have the time, it’d be great to experiment with the Rust code presented here, and maybe develop your own utilities.

Rust and Text

Rust supports two data types for working with strings: String and str. The String type is for working with mutable strings that belong to you, and it has length and a capacity property. On the other hand, the str type is for working with immutable strings that you want to pass around. You most likely will see an str variable be used as &str. Put simply, an str variable is accessed as a reference to some UTF-8 data. An str variable is usually called a “string slice” or, even simpler, a “slice”. Due to its nature, you can’t add and remove any data from an existing str variable.

Read more at Linux Journal

How Open Source Is Accelerating NFV Transformation

Red Hat is noted for making open source a culture and business model, not just a way of developing software, and its message of open source as the path to innovation resonates on many levels.

In anticipation of the upcoming Open Networking Summit, we talked with Thomas Nadeau, Technical Director NFV at Red Hat, who gave a keynote address at last year’s event, to hear his thoughts regarding the role of open source in innovation for telecommunications service providers.

One reason for open source’s broad acceptance in this industry, he said, was that some very successful projects have grown too large for any one company to manage, or single-handedly push their boundaries toward additional innovative breakthroughs.

“There are projects now, like Kubernetes, that are too big for any one company to do. There’s technology that we as an industry need to work on, because no one company can push it far enough alone,” said Nadeau. “Going forward, to solve these really hard problems, we need open source and the open source software development model.”

Here are more insights he shared on how and where open source is making an innovative impact on telecommunications companies.

Linux.com: Why is open source central to innovation in general for telecommunications service providers?

Nadeau: The first reason is that the service providers can be in more control of their own destiny. There are some service providers that are more aggressive and involved in this than others. Second, open source frees service providers from having to wait for long periods for the features they need to be developed.

And third, open source frees service providers from having to struggle with using and managing monolith systems when all they really wanted was a handful of features. Fortunately, network equipment providers are responding to this overkill problem. They’re becoming much more flexible, more modular, and open source is the best means to achieve that.

Linux.com: In your ONS keynote presentation, you said open source levels the playing field for traditional carriers in competing with cloud-scale companies in creating digital services and revenue streams. Please explain how open source helps.

Nadeau: Kubernetes again. OpenStack is another one. These are tools that these businesses really need, not to just expand, but to exist in today’s marketplace. Without open source in that virtualization space, you’re stuck with proprietary monoliths, no control over your future, and incredibly long waits to get the capabilities you need to compete.

There are two parts in the NFV equation: the infrastructure and the applications. NFV is not just the underlying platforms, but this constant push and pull between the platforms and the applications that use the platforms.

NFV is really virtualization of functions. It started off with monolithic virtual machines (VMs). Then came “disaggregated VMs” where individual functions, for a variety of reasons, were run in a more distributed way. To do so meant separating them, and this is where SDN came in, with the separation of the control plane from the data plane. Those concepts were driving changes in the underlying platforms too, which drove up the overhead substantially. That in turn drove interest in container environments as a potential solution, but it’s still NFV.

You can think of it as the latest iteration of SOA with composite applications. Kubernetes is the kind of SOA model that they had at Google, which dropped the worry about the complicated networking and storage underneath and simply allowed users to fire up applications that just worked. And for the enterprise application model, this works great.

But not in the NFV case. In the NFV case, in the previous iteration of the platform at OpenStack, everybody enjoyed near one-for-one network performance. But when we move it over here to OpenShift, we’re back to square one where you lose 80% of the performance because of the latest SOA model that they’ve implemented. And so now evolving the underlying platform rises in importance, and so the pendulum swing goes, but it’s still NFV. Open source allows you to adapt to these changes and influences effectively and quickly. Thus innovations happen rapidly and logically, and so do their iterations.

Linux.com: Tell us about the underlying Linux in NFV, and why that combo is so powerful.

Nadeau: Linux is open source and it always has been in some of the purest senses of open source. The other reason is that it’s the predominant choice for the underlying operating system. The reality is that all major networks and all of the top networking companies run Linux as the base operating system on all their high-performance platforms. Now it’s all in a very flexible form factor. You can lay it on a Raspberry Pi, or you can lay it on a gigantic million-dollar router. It’s secure, it’s flexible, and scalable, so operators can really use it as a tool now.

Linux.com: Carriers are always working to redefine themselves. Indeed, many are actively seeking ways to move out of strictly defensive plays against disruptors, and onto offense where they ARE the disruptor. How can network function virtualization (NFV) help in either or both strategies?

Nadeau: Telstra and Bell Canada are good examples. They are using open source code in concert with the ecosystem of partners they have around that code which allows them to do things differently than they have in the past. There are two main things they do differently today. One is they design their own network. They design their own things in a lot of ways, whereas before they would possibly need to use a turnkey solution from a vendor that looked a lot, if not identical, to their competitors’ businesses.

These telcos are taking a real “in-depth, roll up your sleeves” approach. ow that they understand what they’re using at a much more intimate level, they can collaborate with the downstream distro providers or vendors. This goes back to the point that the ecosystem, which is analogous to partner programs that we have at Red Hat, is the glue that fills in gaps and rounds out the network solution that the telco envisions.

Learn more at Open Networking Summit, happening April 3-5 at the San Jose McEnery Convention Center.

How to Monitor Disk IO in Linux

iostat is used to get the input/output statistics for storage devices and partitions. iostat is a part of the sysstat package. With iostat, you can monitor the read/write speeds of your storage devices (such as hard disk drives, SSDs) and partitions (disk partitions). In this article, I am going to show you how to monitor disk input/output using iostat in Linux. So, let’s get started.

Installing iostat on Ubuntu/Debian:

The iostat command is not available on Ubuntu/Debian by default. But, you can easily install the sysstat package from the official package repository of Ubuntu/Debian using the APT package manager. iostat is a part of the sysstat package as I’ve mentioned before.

First, update the APT package repository cache with the following command:

sudo apt update

Read more at LinuxHint

Linux Desktop News: Zorin OS 15 Gets New Touch Interface, Android Sync And Native Flatpak Support

One of the things I love about using Linux is how connected you feel to the community. That’s especially true when the actual creator and CEO of a Linux desktop OS reaches out and personally invites you to give it test drive. And after reading what’s in store for Zorin OS 15 (currently in beta), this one just climbed higher on my list of distributions to discover. … Read more at Forbes

Future of the Firm

The “future of the firm” is a big deal. As jobs become more automated, and people more often work in teams, with work increasingly done on a contingent and contract basis, you have to ask: “What does a firm really do?” Yes, successful businesses are increasingly digital and technologically astute. But how do they attract and manage people in a world where two billion people work part-time? How do they develop their workforce when automation is advancing at light speed? And how do they attract customers and full-time employees when competition is high and trust is at an all-time low?

When thinking about the big-picture items affecting the future of the firm, we identified several topics that we discuss in detail in this report:

Trust, responsibility, credibility, honesty, and transparency.

Customers and employees now look for, and hold accountable, firms whose values reflect their own personal beliefs. We’re also seeing a “trust shakeout,” where brands that were formerly trusted lose trust, and new companies build their positions based on ethical behavior. And companies are facing entirely new “trust risks” in social media, hacking, and the design of artificial intelligence (AI) and machine learning (ML) algorithms.

The search for meaning.

Employees don’t just want money and security; they want satisfaction and meaning. They want to do something worthwhile with their lives.

New leadership models and generational change.

Firms of the 20th century were based on hierarchical command and control models. Those models no longer work. In successful firms, leaders rely on their influence and trustworthiness, not their position.

Read more at O’Reilly

How to Install OpenLDAP on Ubuntu Server 18.04

The Lightweight Directory Access Protocol (LDAP) allows for the querying and modification of an X.500-based directory service. In other words, LDAP is used over a Local Area Network (LAN) to manage and access a distributed directory service. LDAPs primary purpose is to provide a set of records in a hierarchical structure. What can you do with those records? The best use-case is for user validation/authentication against desktops. If both server and client are set up properly, you can have all your Linux desktops authenticating against your LDAP server. This makes for a great single point of entry so that you can better manage (and control) user accounts.

The most popular iteration of LDAP for Linux is OpenLDAP. OpenLDAP is a free, open-source implementation of the Lightweight Directory Access Protocol, and makes it incredibly easy to get your LDAP server up and running.

In this three-part series, I’ll be walking you through the steps of:

  1. Installing OpenLDAP server.

  2. Installing the web-based LDAP Account Manager.

  3. Configuring Linux desktops, such that they can communicate with your LDAP server.

In the end, all of your Linux desktop machines (that have been configured properly) will be able to authenticate against a centralized location, which means you (as the administrator) have much more control over the management of users on your network.

In this first piece, I’ll be demonstrating the installation and configuration of OpenLDAP on Ubuntu Server 18.04. All you will need to make this work is a running instance of Ubuntu Server 18.04 and a user account with sudo privileges.
Let’s get to work.

Update/Upgrade

The first thing you’ll want to do is update and upgrade your server. Do note, if the kernel gets updated, the server will need to be rebooted (unless you have Live Patch, or a similar service running). Because of this, run the update/upgrade at a time when the server can be rebooted.
To update and upgrade Ubuntu, log into your server and run the following commands:

sudo apt-get update

sudo apt-get upgrade -y

When the upgrade completes, reboot the server (if necessary), and get ready to install and configure OpenLDAP.

Installing OpenLDAP

Since we’ll be using OpenLDAP as our LDAP server software, it can be installed from the standard repository. To install the necessary pieces, log into your Ubuntu Server and issue the following command:

sudo apt-get instal slapd ldap-utils -y

During the installation, you’ll be first asked to create an administrator password for the LDAP directory. Type and verify that password (Figure 1).

Figure 1: Creating an administrator password for LDAP.

Configuring LDAP

With the installation of the components complete, it’s time to configure LDAP. Fortunately, there’s a handy tool we can use to make this happen. From the terminal window, issue the command:

sudo dpkg-reconfigure slapd

In the first window, hit Enter to select No and continue on. In the second window of the configuration tool (Figure 2), you must type the DNS domain name for your server. This will serve as the base DN (the point from where a server will search for users) for your LDAP directory. In my example, I’ve used example.com (you’ll want to change this to fit your needs).

Figure 2: Configuring the domain name for LDAP.

In the next window, type your Organizational name (ie the name of your company or department). You will then be prompted to (once again) create an administrator password (you can use the same one as you did during the installation). Once you’ve taken care of that, you’ll be asked the following questions:

  • Database backend to use – select MDB.

  • Do you want the database to be removed with slapd is purged? – Select No.

  • Move old database? – Select Yes.

OpenLDAP is now ready for data.

Adding Initial Data

Now that OpenLDAP is installed and running, it’s time to populate the directory with a bit of initial data. In the second piece of this series, we’ll be installing a web-based GUI that makes it much easier to handle this task, but it’s always good to know how to add data the manual way.

One of the best ways to add data to the LDAP directory is via text file, which can then be imported in with the ldapadd command. Create a new file with the command:

nano ldap_data.ldif

In that file, paste the following contents:

dn: ou=People,dc=example,dc=com

objectClass: organizationalUnit

ou: People


dn: ou=Groups,dc=EXAMPLE,dc=COM

objectClass: organizationalUnit

ou: Groups


dn: cn=DEPARTMENT,ou=Groups,dc=EXAMPLE,dc=COM

objectClass: posixGroup

cn: SUBGROUP

gidNumber: 5000


dn: uid=USER,ou=People,dc=EXAMPLE,dc=COM

objectClass: inetOrgPerson

objectClass: posixAccount

objectClass: shadowAccount

uid: USER

sn: LASTNAME

givenName: FIRSTNAME

cn: FULLNAME

displayName: DISPLAYNAME

uidNumber: 10000

gidNumber: 5000

userPassword: PASSWORD

gecos: FULLNAME

loginShell: /bin/bash

homeDirectory: USERDIRECTORY

In the above file, every entry in all caps needs to be modified to fit your company needs. Once you’ve modified the above file, save and close it with the [Ctrl]+[x] key combination.

To add the data from the file to the LDAP directory, issue the command:

ldapadd -x -D cn=admin,dc=EXAMPLE,dc=COM -W -f ldap_data.ldif

Remember to alter the dc entries (EXAMPLE and COM) in the above command to match your domain name. After running the command, you will be prompted for the LDAP admin password. When you successfully authentication to the LDAP server, the data will be added. You can then ensure the data is there, by running a search like so:

ldapsearch -x -LLL -b dc=EXAMPLE,dc=COM 'uid=USER' cn gidNumber

Where EXAMPLE and COM is your domain name and USER is the user to search for. The command should report the entry you searched for (Figure 3).

Figure 3: Our search was successful.

Now that you have your first entry into your LDAP directory, you can edit the above file to create even more. Or, you can wait until the next entry into the series (installing LDAP Account Manager) and take care of the process with the web-based GUI. Either way, you’re one step closer to having LDAP authentication on your network.

Managing Changes in Open Source Projects

Why bother having a process for proposing changes to your open source project? Why not just let people do what they’re doing and merge the features when they’re ready? Well, you can certainly do that if you’re the only person on the project. Or maybe if it’s just you and a few friends.

But if the project is large, you might need to coordinate how some of the changes land. Or, at the very least, let people know a change is coming so they can adjust if it affects the parts they work on. A visible change process is also helpful to the community. It allows them to give feedback that can improve your idea. And if nothing else, it lets people know what’s coming so that they can get excited, and maybe get you a little bit of coverage on Opensource.com or the like. Basically, it’s “here’s what I’m going to do” instead of “here’s what I did,” and it might save you some headaches as you scramble to QA right before your release.

So let’s say I’ve convinced you that having a change process is a good idea. How do you build one?

Decide who needs to review changes​

One of the first things you need to consider when putting together a change process for your community is: “who needs to review changes?” This isn’t necessarily approving the changes; we’ll come to that shortly. But are there people who should take a look early in the process? 

Read more at OpenSource.com

CI/CD Gets Governance and Standardization

Kubernetes, microservices and the advent of cloud native deployments have created a Renaissance-era in computing. As developers write and deploy code as part of continuous integration and continuous delivery (CI/CD) production processes, an explosion of tools has emerged for CI/CD processes, often targeted for cloud native deployments.

“Basically, when we all started looking at microservices as a possible paradigm of development, we needed to learn how to operationalize them,” Priyanka Sharma, director of alliances at GitLab and a member of the governing board at the Cloud Native Computing Foundation (CNCF), said. “That was something new for all of us. And from a very good place, a lot of technology came out, whether it’s open source projects or vendors who are going to help us with every niche problem we were going to face.”

As a countermeasure to this chaos, The Linux Foundation created the CD Foundation, along with more than 20 industry partners, to help standardize tools and processes for CI/CD production pipelines. Sharma has played a big part in establishing the CD Foundation, which she discusses in this episode of The New Stack Makers podcast hosted by Alex Williams, founder and editor-in-chief of The New Stack.

Read more at The New Stack

A Musical Tour of Hints and Tools for Debugging Host Networks

Shannon Nelson from the Oracle Linux Kernel Development team offers these tips and tricks to help make host network diagnostics easier. He also includes a recommended playlist for accompanying your debugging!

Ain’t Misbehavin’ (Dinah Washington)

As with many debugging situations, digging into and resolving a network-based problem can seem like a lot of pure guess and magic.  In the networking realm, not only do we have the host system’s processes and configurations to contend with, but also the exciting and often frustrating asynchronicity of network traffic.

Some of the problems that can trigger a debug session are reports of lost packets, corrupt data, poor performance, even random system crashes.  Not always do these end up as actual network problems, but as soon as the customer mentions anything about their wiring rack or routers, the network engineer is brought in and put on the spot.

This post is intended not as a full how-to in debugging any particular network issue, but more a set of some of the tips and tools that we use when investigating network misbehavior.

Start Me Up (The Rolling Stones)

In order to even get started, and probably the most important debugging tool available, is a concise and clear description of what is happening that shouldn’t happen.  This is harder to get than one might think.  You know what I mean, right?  The customer might give us anything from “it’s broken” to the 3 page dissertation of everything but the actual problem.

We start gathering a clearer description by asking simple questions that should be easy to answer.  Things like:

  • Who found it, who is the engineering contact?
  • Exactly what equipment was it running on?
  • When/how often does this happen?
  • What machines/configurations/NICs/etc are involved?
  • Do all such machines have this problem, or only one or two?
  • Are there routers and/or switches involved?
  • Are there Virtual Machines, Virtual Functions, or Containers involved?
  • Are there macvlans, bridges, bonds or teams involved?
  • Are there any network offloads involved?

With this information, we should be able to write our own description of the problem and see if the customer agrees with our summary.  Once we can refine that, we should have a better idea of what needs to be looked into.

Some of the most valuable tools for getting this information are simple user commands that the user can do on the misbehaving systems.  These should help detail what actual NICs and drivers are on the system and how they might be connected.

uname -a – This is an excellent way to start, if nothing else but to get a basic idea of what the system is and how old is the kernel being used.  This can catch the case where the customer isn’t running a supported kernel.

These next few are good for finding what all is on the system and how they are connected:

ip addr, ip link – these are good for getting a view of the network ports that are configured, and perhaps point out devices that are either offline or not set to the right address.  These can also give a hint as to what bonds or teams might be in place.  These replace the deprecated “ifconfig” command.

ip route – shows what network devices are going to handle outgoing packets.  This is mostly useful on systems with many network ports. This replaces the deprecated “route” command and the similar “netstat -rn“.

brctl show – lists software bridges set up and what devices are connected to them.

netstat -i – gives a summary list of the interfaces and their basic statistics. These are also available with “ip -s link“, just formatted differently.

lseth – this is a non-standard command that gives a nice summary combining a lot of the output from the above commands.  (See http://vcojot.blogspot.com/2015/11/introducing-lsethlsnet.html)

Watchin’ the Detectives (Elvis Costello)

Once we have an idea which particular device is involved, the following commands can help gather more information about that device.  This can get us an initial clue as to whether or not the device is configured in a generally healthy way.

ethtool <ethX> – lists driver and connection attributes such as current speed connection and if link is detected.

ethtool -i <ethX> – lists device driver information, including kernel driver and firmware versions, useful for being sure the customer is working with the right software; and PCIe device bus address, good for tracking the low level system hardware interface.

ethtool -l <ethX> – shows the number of Tx and Rx queues that are setup, which usually should match the number of CPU cores to be used.

ethtool -g <ethX> – shows the number of packet buffers for each Tx and Rx queue; too many and we’re wasting memory, too few and we risk dropping packets under heavy throughput pressure.

lspci -s <bus:dev:func> -vv – lists detailed information about the NIC hardware and its attributes. You can get the interface’s <bus:dev:func> from “ethtool -i“.

Diary (Bread)

The system logfiles usually have some good clues in them as to what may have happened around the time of the issue being investigated.  “dmesg” gives the direct kernel log messages, but beware that it is a limited sized buffer that can get overrun and loose history over time. In older Linux distributions the systems logs are found in /var/log, most usefully in either /var/log/messages or /var/log/syslog, while newer “systemd” based systems use “journalctl” for accessing log messages. Either way, there are often interesting traces to be found that can help describe the behavior.

One thing to watch out for is that when the customer sends a log extract, it usually isn’t enough.  Too often they will capture something like the kernel panic message, but not the few lines before that show what led up to the panic.  Much more useful is a copy of the full logfile, or at least something with several hours of log before the event.

Once we have the full file, it can be searched for error messages, any log messages with the ethX name or the PCI device address, to look for more hints.  Sometimes just scanning through the file shows patterns of behavior that can be related.

Fakin’ It (Simon & Garfunkel)

With the information gathered so far, we should have a chance at creating a simple reproducer.  Much of the time we can’t go poking at the customer’s running systems, but need to demonstrate the problem and the fix on our own lab systems.  Of course, we don’t have the same environment, but with a concise enough problem description we stand a good chance of finding a simple case that shows the same behavior.

Some traffic generator tools that help in reproducing the issues include:

ping – send one or a few packets, or send a packet flood to a NIC.  It has flags for size, timing, and other send parameters.

iperf – good for heavy traffic exercise, and can run several in parallel to get a better RSS spread on the receiver.

pktgen – this kernel module can be used to generate much more traffic than user level programs, in part because the packets don’t have to traverse the sender’s network stack.  There are also several options for packet shapes and throughput rates.

scapy – this is a Python tool that allows scripting of specially crafted packets, useful in making sure certain data patterns are exactly what you need for a particular test.

All Along the Watchtower (The Jimi Hendrix Experience)

With our own model of the problem, we can start looking deeper into the system to see what is happening: looking at throughput statistics and watching actual packet contents.  Easy statistic gathering can come from these tools:

ethtool -S <ethX> – most NIC device drivers offer Tx and Rx packets counts, as well as error data, through the ‘-S’ option of ethtool.  This device specific information is a good window into what the NIC thinks it is doing, and can show when the NIC sees low level issues, including malformed packets and bad checksums.

netstat -s <ethX> – this gives protocol statistics from the upper networking stack, such as TCP connections, segments retransmitted, and other related counters.

ip -s link show <ethX> – another method for getting a summary of traffic counters, including some dropped packets.

grep <ethX> /proc/interrupts – looking at the interrupt counters can give a better idea of how well the processing is getting spread across the available CPU cores.  For some loads, we can expect a wide dispersal, and other loads might end up with one core more heavily loaded that others.

/proc/net/* – there are lots of data files exposed by the kernel networking stack available here that can show many different aspects of the network stack operations. Many of the command line utilities get their info directly from these files. Sometimes it is handy to write your own scripts to pull the very specific data that you need from these files.

watch – The above tools give a snapshot of the current status, but sometimes we need to get a better idea of how things are working over time.  The “watch” utility can help here by repeatedly running the snapshot command and displaying the output, even highlighting where things have changed since the last snapshot.  Example uses include:

1
2
3
4
5
#         See the interrupt activity as it happens
watch "grep ethX /proc/interrupts"
#        Watch all of the NIC's non-zero stats
watch "ethtool -S ethX | grep -v ': 0'"

Also useful for catching data in flight is tcpdump and its cousins wireshark and tcpreplay.  These are invaluable in catching packets from the wire, dissecting what exactly got sent and received, and replaying the conversation for testing.  These have whole tutorials in and of themselves so I won’t detail them here, but here’s an example of tcpdump output from a single network packet:

1
2
3
4
5
6
23:12:47.471622 IP (tos 0x0, ttl 64, id 48247, offset 0, flags [DF], proto TCP (6), length 52)
    14.0.0.70.ssh > 14.0.0.52.37594: Flags [F.], cksum 0x063a (correct), seq 2358, ack 2055, win 294, options [nop,nop,TS val 2146211557 ecr 3646050837], length 0
    0x0000:  4500 0034 bc77 4000 4006 61d3 0e00 0046
    0x0010:  0e00 0034 0016 92da 21a8 b78a af9a f4ea
    0x0020:  8011 0126 063a 0000 0101 080a 7fec 96e5
    0x0030:  d952 5215

Photographs and Memories (Jim Croce)

Once we’ve made it this far and we have some idea that it might be a particular network device driver issue, we can do a little research into the history of the driver.  A good web search is an invaluable friend. For example, a web search for “bnxt_en dropping packets” brings up some references to a bugfix for the Nitro A0 hardware – perhaps this is related to a packet drop problem we are seeing?

If we have a clone of the Linux kernel git repository, we can do a search through the patch history for keywords.  If there’s something odd happening with macvlan filters, this will point out some patches that might be related to the issue.  For example, here’s a macvlan issue with driver resets that was fixed upstream in v4.18:

$ git log --oneline drivers/net/ethernet/intel/ixgbe | grep -i macvlan | grep -i reset 
8315ef6 ixgbe: Avoid performing unnecessary resets for macvlan offload 
e251ecf ixgbe: clean macvlan MAC filter table on VF reset
 
$ git describe --contains 8315ef6 
v4.18-rc1~114^2~380^2

Reelin’ In the Years (Steely Dan)

A couple of examples can show a little of how these tools have been used in real life.  Of course, it’s never as easy as it sounds when you’re in the middle of it.

lost/broken packets with TSO from sunvnet through bridge

When doing some performance testing on the sunvnet network driver, a virtual NIC in the SPARC Linux kernel, we found that enabling TSO actually significantly hurt throughput, rather than helping, when going out to a remote system.  After using netstat and ethtool -S to find that there were a lot of lost packets and retries through the base machine’s physical, we used tcpdump on the NIC and at various points in the internal software bridge to find where packets were getting broken and dropped.  We also found comments in the netdev mailing list about an issue with TSO’d packets getting messed up when going into the software bridge.  We turned off TSO for packets headed into the host bridge and the performance issue was fixed.

log file points out misbehaving process

In a case where NIC hardware was randomly freezing up on several servers, we found that a compute service daemon had recently been updated with a broken version that would immediately die and restart several times a second on scores of servers at the same time and was resetting the NICs each time.  Once the daemon was fixed, the NIC resetting stopped and the network problem went away.

Bring It On Home

This is just a quick overview of some of the tools for debugging a network issue.  Everyone has their favorite tools and different uses, we’ve only touched on a few here. They are all handy, but all need our imagination and perseverance to be useful in getting to the root of whatever problem we are chasing.  Also useful are quick shell scripts written to collect specific sets of data, and shell scripts to process various bits of data when looking for something specific.  For more ideas, see the links below.

And sometimes, when we’ve dug so far and haven’t yet found the gold, it’s best to just get up from the keyboard, take a walk, grab a snack, listen to some good music, and let the mind wander.

Good hunting.

This article originally appeared at Oracle Developers Blog