How DevOps Failed 60K Users

803

Back in 2012, when I was an operations engineer at Slideshare, I was part of a team that launched a DevOps model to speed processes and stay ahead of our competition.

We were a small startup, with fewer than 20 employees, building what became one of the most successful professional content sharing tools on the web.  We didn’t know it at the time, but DevOps was one of the keys to our success in quickly reaching 29 million unique visitors per month and leading to the $119 million acquisition by LinkedIn in 2012.

Our goals in adopting DevOps practices were to create a more cohesive team and achieve maximum efficiency. The development team was split between San Francisco and New Delhi, and the infrastructure was quite complicated. A DevOps environment pushes every contributor to work on and contribute to different parts of the product, so it helped overcome geographic barriers by making people interact and help each other.

It also helped us spread technical knowledge to the most possible people, so that if someone was going on vacation or leaving the company, there was limited impact.

However, our DevOps success didn’t come without some failures, which have since become valuable lessons I share with my engineering students at Holberton School.

Lessons from a DevOps Failure

One of the main ideas behind DevOps is a greater sense of ownership over work responsibilities, and for that you need to give access to part of the infrastructure that developers do not generally have access to. At SlideShare, engineers had access to production servers and production databases.

A software engineer was working on a database-related project and trying out a tool that offered the ability to explore a MySQL database graphically. He decided to reorganize the order of the database columns in that tool so that the data would make more sense to him. What he did not know was that it was also changing the columns’ order in production on the actual database, locking it, which brought down Slideshare.net and shut out the more than 60,000 users trying to access it. When it happened, the person responsible did not realize that the tool was actually performing actions, and it took 15 minutes of collective effort to figure out the source of the problem.

There were two takeaways from this failure:

  1. While DevOps is pushing for everyone to have an impact on any step of the product/service cycle, it’s good practice to take a step back every time you give access to something and make sure it is actually valuable. In this specific situation of the database outage, we realized that giving access to production data was actually not useful at all and was very dangerous. The developer could have extracted the same exact value by using a staging database, but with a much more minor impact on the company.

  2. It’s important to better educate developers on the workings of infrastructure. Many of them have never been exposed to production infrastructure. DevOps is based on a way of working, which obviously is more about human interaction. You can’t expect everyone to naturally know the hidden rules. That’s why onboarding should be mandatory and critical.

Sylvain Kalache
Sylvain Kalache is a co-founder of Holberton School and a former senior Site Reliability Engineer at LinkedIn. He was part of the small Slideshare startup team, as a key player that contributed to the LinkedIn acquisition in 2012.