GitHub uses MySQL as its main datastore for all things non-git
, and its availability is critical to GitHub’s operation. The site itself, GitHub’s API, authentication and more, all require database access. We run multiple MySQL clusters serving our different services and tasks. Our clusters use classic master-replicas setup, where a single node in a cluster (the master) is able to accept writes. The rest of the cluster nodes (the replicas) asynchronously replay changes from the master and serve our read traffic.
The availability of master nodes is particularly critical. With no master, a cluster cannot accept writes: any writes that need to be persisted cannot be persisted. Any incoming changes such as commits, issues, user creation, reviews, new repositories, etc., would fail.
To support writes we clearly need to have an available writer node, a master of a cluster. But just as important, we need to be able to identify, or discover, that node.
On a failure, say a master box crash scenario, we must ensure the existence of a new master, as well as be able to quickly advertise its identity. The time it takes to detect a failure, run the failover and advertise the new master’s identity makes up the total outage time.
This post illustrates GitHub’s MySQL high availability and master service discovery solution, which allows us to reliably run a cross-data-center operation, be tolerant of data center isolation, and achieve short outage times on a failure.
Read more at GitHub