For DevOps, installation is one of the major tasks. People may think package installation is pretty straightforward and easy now: Just run commands like apt-get, yum, brew, etc. Or simply leave it to containers.
Is it really that easy? Here is a list of headaches and hidden costs.
Admit it. We all have unexpected installation failures.
Okay, we have wrapped up multiple scripts, which will install and configure all required services and components. And the test looks good. Services are running correctly. The GUI opens nicely. It feels just great. Maybe we’re even a bit proud of our achievements. Shouldn’t we be?
Then more and more people start to use our code to do deployment. That’s when the real fun starts. Oh, yes, and surprises and embarrassments, too. Package installations fail with endless issues. The process mysteriously sticks somewhere with few clues. Or the installation itself seems to be fine, but the system just doesn’t behave the same as our testing environments.
At first, people won’t complain. They understand it happens, but with more and more issues, the mood changes. And you feel the pressure! Your boss and colleagues have their concerns, too. The task seems quite straightforward. Why is it taking so long? And how much longer you will need to stabilize the installation? Sound familiar?
So what are the moving parts and obstacles really, in terms of system installation? We want to deliver the installation feature quickly, and it has to be reliable and stable.
Problem 1: Tools are in rapid development
Linux is powerful, because it believes in the philosophy of simplicity. Each tool is there for one simple purpose. Then we combine different tools into bigger ones, for bigger missions. That’s so called integration. Yeah, the integration!
If we only integrate stable and well-known tools, we’re in luck. Probably things will go smoothly; otherwise, the situation would be much different.
-
Tools in rapid development means issues, limitations, and workarounds.
Even worse, the error messages could be confusing. See the example below of an error in Chef development. How we can easily see it’s a local issue, not a bug, at the first glance?
Installing yum-epel (0.6.0) from https://supermarket.getchef.com ([opscode] https://supermarket.chef.io/api/v1)
Installing yum (3.5.3) from https://supermarket.getchef.com ([opscode] https://supermarket.chef.io/api/v1)
/var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `encode': "xC2" on US-ASCII (Encoding::InvalidByteSequenceError)
from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `initialize'
from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `new'
from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `parse'
from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook/metadata.rb:473:in `from_json'
from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook/metadata.rb:29:in `from_json'
from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook.rb:36:in `from_path'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cached_cookbook.rb:15:in `from_store_path'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:86:in `cookbook'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:67:in `import'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:30:in `import'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/installer.rb:106:in `block in install'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:38:in `block in download'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:35:in `each'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:35:in `download'
from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/installer.rb:105:in `install'
from /var/lib/gems/1.9.1/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
- Issues of incompatible version frequently happen in system integration. Usually using the latest released version for all tools will work, but not always. Sometimes our development team may have their own preference, which makes things a bit complicated.
We see issues like the following constantly. Yes, I know. I need to upgrade Ruby, Python, or whatever. It just takes time, which means unplanned work, again.
sudo gem install rack -v '2.0.1' ERROR: Error installing rack: rack requires Ruby version >= 2.2.2.
Tip: Record the exact version for all components, including OS. After a successful deployment, I usually automatically dump versions via the trick listed in another post: Compare Difference Of Two Envs.
Problem 2: Every network request is a vulnerable failing point
Frequently, the installation will run commands like apt-get/yum or curl/wget. It will launch outgoing requests.
Well, watch out for any network request, my friends.
- The external server may run into 5XX error, timeout, or be slower than before.
- Files are removed in server, which results in HTTP 404 error.
- Corporate firewall blocks the requests, because of security concerns or data leak.
Each ongoing network request is a failure point. Consequently, our deployment fails or suffers.
Tip: Replicate as many servers as possible under our control — for example, local http server, apt repo server, etc.
People might try to pre-cache all Internet download, by building customized OS images or Docker images. This is meaningful for an installation with no network. It comes with a cost. Things are now more complicated and it takes a significant amount of effort.
Tip: Record all outgoing network requests during deployment. Yes, the issue is still there. But this give us valuable input: what to improve or what to check. Tracking requests can be done easily: Monitor Outbound Traffic In Deployment.
Problem 3: Always installing the latest version will guarantee issues
People install package like below quite often.
apt-get -y update && apt-get -y install ruby
But what version will we get? Today we get ruby 1.9.5. But months later, it would be ruby 2.0.0, or 2.2.2. You do see the potential risks, don’t you?
Tip: Only install packages with fixed version.
Name | Before | After |
Ubuntu | apt-get install docker-engine | apt-get install docker-engine=1.12.1-0~trusty |
CentOS | yum install kernel-debuginfo | yum install kernel-debuginfo-2.6.18-238.19.1.el5 |
Ruby | gem install rubocop | gem install rubocop -v “0.44.1” |
Python | pip install flake8 | pip install flake8==2.0 |
NodeJs | npm install express | npm install express@3.0.0 |
Problem 4: Avoid installation from third repo
Let’s say we want to install haproxy 1.6. However, the official Ubuntu repo only provides haproxy with 1.4 or 1.5. So we do this.
sudo apt-get install software-properties-common add-apt-repository ppa:vbernat/haproxy-1.6 apt-get update apt-get dist-upgrade apt-get install haproxy
It works like a charm, but does this really put an end to this problem? Mostly. However, it still fails from time to time.
- The availability of third repo is usually lower than the official repo.
---- Begin output of apt-key adv --keyserver keyserver.ubuntu.com --recv 1C61B9CD ----
STDOUT: Executing: gpg --ignore-time-conflict --no-options --no-default-keyring --homedir /tmp/tmp.VTYpQ40FG8 --no-auto-check-trustdb --trust-model always --keyring /etc/apt/trusted.gpg --primary-keyring /etc/apt/trusted.gpg --keyring /etc/apt/trusted.gpg.d/brightbox-ruby-ng.gpg --keyring /etc/apt/trusted.gpg.d/oreste-notelli-ppa.gpg --keyring /etc/apt/trusted.gpg.d/webupd8team-java.gpg --keyserver keyserver.ubuntu.com --recv 1C61B9CD
gpgkeys: key 1C61B9CD can't be retrieved
STDERR: gpg: requesting key 1C61B9CD from hkp server keyserver.ubuntu.com
gpg: no valid OpenPGP data found.
gpg: Total number processed: 0
---- End output of apt-key adv --keyserver keyserver.ubuntu.com --recv 1C61B9CD ----
- Third repo is more likely to change. Now you get 1.6.5 and are happy with that. But suddenly, days later, it starts to install 1.6.6 or 1.6.7. Surprise!
Tip: Avoid third repo as much as possible. If there’s no way to avoid it, track and examine the installed version closely.
Problem 5: Installing from source code could be painful
If we can install directly from source code, it’s much more reliable. But the problem is …
- It’s usually harder. Try to build linux from the scratch, you will feel the disater and mess. Too many weird errors, missing packages, conflict versions, etc. You my feel like you’re flying a plane without a manual.
- Compiling from source takes much longer. For example, compile nodejs would take ~30 min. But apt-get only take seconds.
- Missing facility of service management. We want to manage service by via “service XXX status/stop/start” and configure it to be autostart. With source code installation, they might be missing.
Do containers cure the pain?
Nowadays, more and more people are starting to use containers to avoid installation failure. Yes, this largely reduces the failures for end users. But, it doesn’t solve the problem completely, especially for DevOps. We’re the ones who provide the Docker image. Right?
To build images from Dockerfile, we still have five common failures listed above. Containers shift the failure risks from real deployment to image build process.
Further reading: 5 Tips For Building Docker Image.
Bring it all together
Improvement suggestions for package installation:
- List all versions and hidden dependencies
- Monitor all external outgoing traffic
- Only install packages with fixed version
- Try your best to avoid third repo
Containers can help to reduce installation failure. But, DevOps folks still need to deal with all of the above possible failures in the image build process.
Original Article: http://dennyzhang.com/installation_failure
More Reading: How To Check Linux Process Deeply With Common Sense
Learn more about DevOps through this new course from The Linux Foundation and EdX: Introduction to DevOps: Transforming and Improving Operations.