Twitter runs on a massively complex infrastructure running thousands of services, so small efficiencies result in large gains. But figuring out how to measure performance is a giant problem in a system this complex, as is giving Twitter’s teams the incentive and tools to improve resource allocation. Vinu Charanya and Michael Benedict’s talk at LinuxCon North America goes into fascinating detail on the metering and chargeback system Twitter engineers built to solve this problem, using both a technical and social approach.
One of the events responsible for the creation of this system was the 2010 World Cup. Twitter engineers anticipated several times greater demand and scaled up to meet it. But the scale-up was not entirely successful. This resulted in a fundamental architecture change, breaking down functionality into multiple independent microservices.
In 2014, Ellen DeGeneres tweeted a selfie from the Oscars podium, which exposed additional weaknesses in the system. It was retweeted so many times and so fast that the original tweet became inaccessible for over an hour. Diagnosing exactly what went wrong was not easy. Benedict says, “Given the scale and size of Twitter, it’s important to really understand what is really the overall use of infrastructure platform resources across all of these services. How do you know who’s really using what? Given these number of services and number of teams at Twitter, it’s extremely important to understand how we can start capturing the utilization of resources per team, per project, per hour. Finally, how do you really incentivize the right behavior for these engineers, the team leads, the managers, to do the right thing in using our resources?”
Four Challenges
Vinu Charanya describes the Chargeback system that they built to address these problems. She says, “Chargeback provides the ability to track and measure infrastructure usage on a per engineering team basis and charge each owner their usage cost accordingly. Keeping this in mind as we started designing the system, we identified the top four challenges.
“Number one: service identity. We designed a generic service identification abstraction that provides a canonical way to identify a service across infrastructures.
“Number two: resource catalog. We worked with the infrastructure teams to identify and abstract resources that can be published for developers to consume and build.
“Number three: metering. Each infrastructure graphs the consumption of resources by each service through their service identifiers. We built a classic ETL data pipeline to collect all the usage metrics to aggregate and process them in a central location.
“Number four: service metadata. We also built a service metadata system that keeps track of ops and other service-related metadata.”
The end result of Chargeback is three reports for users: a Chargeback bill, an infrastructure profit-and-loss report, and a budgeting report.
Chargeback not only gives Twitter teams measurements of their resource usage and real-world costs, it is also an amazing tool for understanding exactly what is happening inside this huge, fast-moving, interdependent system. Watch Charanya and Benedict’s talk (below) to learn more about the tools and architecture of this most bleeding-edge of techologies.