Testing Network Connectivity for Applications in Containers

420

Testing applications is a critical part of software development as illustrated by the rise of continuous integration and automated testing. In his upcoming LinuxCon + ContainerCon talk — Testing Applications with Traffic Control in Containers — Alban Crequy will focus on one area of testing that is difficult to automate: poor network connectivity. He will describe a testing approach which emulates network connectivity and which integrates existing Linux kernel features into higher level tools such as Kubernetes and Weave Scope. Additionally, he will show how an application running in Kubernetes behaves in response to changing network parameters.

In this interview, Crequy, co-founder and software engineer at Kinvolk GmbH, explains some of the challenges involved in automating testing, describes how the Traffic Control approach works, and talks about potential consequences of network issues.

Linux.com: Can you give us some background on why testing is so important — particularly network testing in containers?

Alban Crequy: Testing applications is important. Some people even go as far as saying, “If it isn’t tested, it doesn’t work.” Although that may have both a degree of truth and untruth to it, the rise of continuous integration (CI) and automated testing has shown that the software industry is taking testing seriously. Network connectivity, however, is difficult to automate and, thus, hasn’t been adequately incorporated into testing scenarios.

The typical testing process has the developer as the first line of defence. Developers usually work within reliable networking conditions. The developers then submit their code to a CI system, which also runs tests under good networking conditions. Once the CI system goes green, internal testing is usually done; ship it!

Nowhere in this process were scenarios tested in which your application experienced degraded network conditions. If your internal tests don’t cover these scenarios, then it’s your users who’ll be doing the testing. This is far from an ideal situation and goes against the “test early, test often” mantra of CI; a bug will cost you more the later it’s caught.

While the development of microservice architectures and containers is replacing the monolithic approach of developing applications, the need for testing degraded networks has not disappeared. On the contrary, when an application scales over more nodes and more containers, there is more communication between containers over the network than on a monolithic application running on one server. Thus, there are many more network links that could be degraded and introduce unforeseen behaviour.

Linux.com: What are some consequences of failure to test? Are there security issues involved?

Alban: I will give three examples showing the consequences of network issues.

Let’s say you have a web shop. Your customer clicks on “buy,” it redirects to a new page but freezes because of a connection issue. The user does not get feedback on whether the javascript code will try again automatically; the user does not know whether they should refresh. That’s a bug. They might click again on “buy” and might end up buying the product twice.

Second example, you have a video stream server. The Real-Time Protocol (RTP) uses UDP packets. If some packets drop or arrive too late, it’s not a big deal; the video player will display a degraded video because of the missing packets but the stream will otherwise play just fine. Or, will it? How can the developers of a video stream server test a scenario where 3% of packets are dropped or delayed?

Finally, applications like etcd or zookeeper implement a consensus algorithm. Nodes regularly send heartbeat messages to other nodes to inform them of their existence. They should be designed to handle a node disconnecting from the network and network splits. What happens when a heartbeat message is delayed? For this kind of applications, testing different scenarios of degraded networks is critical.

I don’t know about the security issues involved. It might depend on your application. Security implications are often difficult to foresee.

Linux.com: What are the challenges of automating this testing?

Alban: It is challenging because testing techniques have to follow architectural changes. When you have one monolithic application that can just be run on a developer’s laptop, there is no need to test a degraded network between the different components. But when that application becomes distributed, the testing framework has to become distributed as well. This brings the complexity of a distributed system into the testing area. With the development of cloud native applications, naturally scalable, and microservices, we need distributed testing frameworks that can be integrated with continuous integration tools.

However, we can benefit from new container and orchestration tools. A testing framework does not have to reinvent the wheel and can benefit from being run in containers and using tools like Kubernetes in the same way as your application.

Although I think that both manual testing and automated testing are useful, automating the user interaction with a website brings its own set of challenges. Tools like Selenium can be used to script that user interaction. This helps, of course. But developers will have to write unit tests and keep them updated whenever the website changes. Another difficulty is the integration between unit tests and the traffic control configuration. It might be required if a unit test has to configure what kind of degraded network it requires.

Automating network testing is not only about high-level tooling: it can also benefit from the latest developments in the Linux kernel. For example, in the recently released Linux 4.7, it is possible to attach BPF programs on tracepoints. Each subsystem of the kernel defines a list of tracepoints, for example on system calls. This allows to run more complex code to gather statistics each time a specific system call such as sendmsg() is used. Before this new version, BPF programs could already be attached on functions with kprobes but without any ABI guarantees, so the monitoring program was tied to a specific kernel version. I am looking forward to exploring the new possibilities it exposes to monitor for testing purposes.

Linux.com: How are you approaching this problem? Briefly, what are the benefits of a tool such as Traffic Control on Linux?

Alban: The traffic control tools on Linux exist since Linux 2.2 was released in 1999. The traffic control subsystem on Linux is composed of different network schedulers and filters. They can be used to distribute the bandwidth fairly, to reserve the bandwidth to specific applications, or to avoid bufferbloat. In my case, I am using a specific network scheduler called “netem” or network emulator. The network emulator is installed on the network interface of a container. It can be configured to limit the bandwidth arbitrarily, to add latency, and to randomly drop packets to simulate a degraded network.

Those are existing features in the Linux kernel. They are made useful in the context of testing container networks by integrating them in higher-level tools such as the container orchestrator Kubernetes and the visualization and monitoring tool Weave Scope. As much as possible, it is not reinventing the wheel but reusing and integrating existing components. In my talk, I will show how this integration was done in a plugin of Weave Scope.

This approach does not require any changes in the applications being tested: they open network connections using the usual Linux API. Moreover, the traffic control configuration can be changed dynamically without restarting any applications and without breaking existing TCP connections.

Linux.com: Is there a particularly important issue that developers should be aware of? If so, what and why?

Alban: I would just suggest thinking about user feedback and how it gets impacted by a slow network. Then, just try it on your websites!

My approach has some limitations at the time of writing. For example, the traffic control plugin in Weave Scope controls the network on each container but it cannot control individual connections. But that should be enough to start with for most use cases. Let us at Kinvolk know if you have other testing needs that would benefit from traffic control. We are always interested in hearing new use cases.

LinuxCon + ContainerCon Europe 

Look forward to three days and 175+ sessions of content covering the latest in containers, Linux, cloud, security, performance, virtualization, DevOps, networking, datacenter management and much more. You don’t want to miss this year’s event, which marks the 25th anniversary of Linux! Register now before tickets sell out. 

Alban Crequy
Originally from France, Alban Crequy currently lives in Berlin where he is a co-founder and software engineer at Kinvolk GmbH. He is the technical project lead for rkt, a container runtime for Linux. Before falling into containers, Alban worked on various projects core to modern Linux; kernel IPC and storage, dbus performance and security, etc. His current technical interests revolve around networking, security, systemd, and containers at the lower levels of the system.