Running applications in our brave new container orchestration world is like managing herds of fireflies; they blink in and out. There is no such thing as uptimes anymore. Applications run, and when they fail, replacements launch from vanilla images. Easy come, easy go. But if your application needs to preserve state, it and must either take periodic snapshots or have some other method of recovering state. Snapshots are far from ideal as you will likely lose data, as with any non-graceful shutdown. This is not optimal, so Apache Mesophere’s Isabel Jimenez and Kapil Arya presented some new ideas at LinuxCon North America.
Arya explains how managing stateless applications is different from managing stateful applications: “When you scale up, you basically launch the new instances, or new loads, or a new cluster. They are pretty much starting all from the vanilla image, the idea being that everything is immutable. When you want to scale down, you just kill the extra instances. If the need comes and you want to, say, schedule some high-priority task, you can easily kill the additional instances that are no longer needed or that need to be preempted, and your high-priority task can actually get the node or the resources right away.”
Stateful applications are different. “To kill an application that is already running, if it’s not a graceful shutdown, then you lose the computation time, and so on. Basically, what that means is, if you have a high-priority task coming in, then killing some instances of the stateful application will definitely result in some compute time loss.”
Container orchestration tools are more optimized for stateless applications. How can we make it better for stateful applications? Arya says, “Make them stateless.” How? One way is to start from scratch. Rewrite your stateful apps to be stateless. That is probably not going to happen. Instead, you could offload the job of managing state to your container orchestration framework and migrate your processes. “We’ll see what actually is involved in doing such a migration. This is a very general recipe that pretty much works on all these scenarios. You first pause the running process, or the container, or the virtual machine, so that the state is now immutable. You then take a snapshot of the current state. You copy over the snapshot to the target node, or the new data center, or the new cluster. Finally, you restart from that snapshot that you just took, and you have the application or the virtual machine up and running.”
Taking the snapshot is referred to as checkpointing. Ideally this happens very quickly, in milliseconds, so that nobody notices any interruptions or delays. Several factors influence this, especially the memory footprint of the application. Arya says that “If you have a memory footprint of a gigabyte, and you’re writing a checkpoint image to a regular disk, then assuming there’s roughly 100 megabytes per second, it’ll take 10 seconds to dump the checkpoint image. If you have some fancy hardware back end, like Cluster File System, then you can get pretty amazing speeds like 60 gigabytes per second or so.”
Watch the complete presentation (below) to learn more details of how Apache is building this functionality into Mesos and to see it demonstrated.