Solving Monitoring in the Cloud With Prometheus

288

Hundreds of companies are now using the open source Prometheus monitoring solution in production, across industries ranging from telecommunications and cloud providers to video streaming and databases.

In advance of CloudNativeCon + KubeCon Europe 2017 to be held March 29-30 in Berlin, we talked to Brian Brazil, the founder of Robust Perception and one of the core developers of the Prometheus project, who will be giving a keynote on Prometheus at CloudNativeCon. Make sure to catch the full Prometheus track at the conference.

Linux.com: What makes monitoring more challenging in a Cloud Native environment?

Brian Brazil: Traditional monitoring tools come from a time when environments were static and machines and services were individually managed. By contrast, a Cloud Native environment is highly automated and dynamic, which requires a more sophisticated approach.

With a traditional setup there were a relatively small number of services, each with their own machine. Monitoring was on machine metrics such as CPU usage and free memory, which were the best way available to alert on user-facing issues. In a Cloud Native world, where many different services not only share machines, but the way in which they’re sharing them is in constant flux, such an approach is not scalable.

For example with a mixed workload of user-facing and batch jobs, a high CPU usage merely indicates that you’re getting good value for money out of your resources. It doesn’t necessarily indicate anything about end-user experience. Thus, metrics like latency, failure ratios, and processing times from services spread across machines must be aggregated up and then used for graphs and alerts.

In the same way that the move was made from manual management of machines and services to tools like Chef and now Kubernetes, we must make a similar transition in the monitoring space.

Linux.com: What are the advantages of Prometheus?

Brian Brazil: Prometheus was created with a dynamic cloud environment in mind. It has integrations with systems such as Kubernetes and EC2 that keep it up to date with what type of containers are running where, which is essential with the rate of change in a modern environment.

Prometheus client libraries allow you to instrument your applications for the metrics and KPIs that matter in your system. For third-party application such as Cassandra, HAProxy or MySQL, there’s a variety of exporters to expose their useful metrics.

The data Prometheus collects is enriched by labels. Labels are arbitrary key-value pairs that can be used to distinguish the development cluster from the production environment, or which HTTP endpoints the metric is broken out by.

The PromQL query language allows for aggregation based on these labels, calculation of 95th percentile latencies per container, service or datacenter, forecasting, and any other math you’d care to do. What’s more: if you can graph it, you can alert on it. This gives you the power to have alerts on what really matters to you and your users, and helps eliminate those late night alerts for non-issues.

Linux.com: Are there things that catch new users off guard?

Brian Brazil: One common misunderstanding is the type of monitoring system that Prometheus is, and where it fits as part of your overall monitoring strategy.

Prometheus is metrics based, meaning it is designed to efficiently deal with numbers — numbers such as how many HTTP requests you’ve served and their latency. What Prometheus is not is an event logging system, and is thus not suitable for tracking the details of each individual HTTP request made. By having both a metrics solution and an event logging solution (such as the ELK stack), you’ll cover a good range in terms of breadth and depth. Neither is sufficient on their own, due to the different engineering tradeoffs each must make.

Linux.com: What has the response to Prometheus been?

Brian Brazil: From its humble beginnings in 2012 when Prometheus had just two developers working on it part time, today in 2017 hundreds of developers have contributed to the Prometheus project itself. In addition a rich ecosystem has spawned, with over 150 third-party party integrations — and that’s just the ones we know of.

There are hundreds of companies using Prometheus in production across all industries from telecommunications to cloud providers, video streaming to databases and startups to Fortune 500s. Since announcing 1.0 last year, the growth in users and the ecosystem has only accelerated.

Linux.com: Are there any talks in particular to watch out for at CloudNativeCon + KubeCon Europe?

Brian Brazil: For those who are used to more static environments, or just trying to reduce pager noise, Alerting in Cloud Native Environments by Fabian Reinartz of CoreOS is essential. If you’re already running Prometheus in a rapidly growing system, in Configuring Prometheus for High Performance, then Soundcloud’s Björn Rabenstein , who wrote the current storage system, will cover what you’ll need to know.

For those on the development side, there’s a workshop on Prometheus Instrumentation that’ll take you from instrumenting your code all the way through visualising the results. My own talk on Counting in Prometheus is a deep dive into the deceptively simple sounding question of counting how many requests there were in the past hour, and how it really works in various monitoring systems.

Not everything is cloud native, Prometheus: The Unsung Heroes is a user story of how Prometheus can monitor infrastructure such as load balancers via SNMP. Finally, in Integrating Long-Term Storage with Prometheus, Julius Volz looks at the plans for our most sought after pieces of future functionality.

All talks will be recorded, so if you aren’t lucky enough to attend in person, you can watch the talks later online.

CloudNativeCon + KubeCon Europe is almost sold out! Register now to secure your seat.