How to Sync Files to Amazon S3 on Linux

7873

Amazon’s Simple Storage Service (S3) has a lot to like. It’s cheap, can be used for storing a little bit of data or as much as you want, and it can be used for distributing files publicly or just storing your private data. Let’s look at how you can take advantage of Amazon S3 on Linux.

Amazon S3 isn’t what you’d want to use for storing just a little bit of personal data. For that, you might want to use Dropbox, SpiderOak, ownCloud, or SparkleShare. Which one depends on how much data, your tolerance for non-free software, and which features you prefer. For my work files, I use Dropbox – in large part because of its LAN sync feature.

But S3 is really good if you need to make backups of a large amount of data, or smaller amounts but you need an offsite backup. It’s also good if you want to use S3 to host files for public distribution and don’t have a server or need to offload data sharing because of capacity issues. Maybe you just want to use it to host a blog, cheaply. S3 also has some nifty features for content distribution and data storage from multiple regions, which we’ll get into another time.

Getting the Tools

You can use S3 in a number of ways on Linux, depending on how you’d like to manage your backups. If you look around, you’ll find a bunch of tools that support S3, including:

S3 Tools and Duplicity are command line utilities that support S3. S3 Tools, as the name implies, focuses on Amazon S3. Duplicity has S3 support, but also supports several other methods of transferring files. Deja Dup is a fairly simple GNOME app for backups, which has S3 support thanks to Duplicity. Dragon Disk is a freeware (but not free software) utility that provides more fine-grained control of backups to S3. It also supports Google Cloud Storage and other cloud storage software.

For the purposes of this article, I’m going to focus on S3 Tools. If you’re a GNOME user, it should take very little effort to set up Deja Dup for S3. We’ll tackle Duplicity and Dragon Disk another time.

S3 Tools

You might find S3 Tools in your distribution’s repositories. If not, the S3 Tools folks have package repositories and have support for several versions of Red Hat, CentOS, Fedora, openSUSE, SUSE Linux Enterprise, Debian, and Ubuntu. You’ll also find instructions on adding the tools on the package repositories page.

Once you have S3 Tools installed, you need to configure it with your Amazon S3 credentials. If you haven’t signed up for them yet, hit the Sign Up button at the top of the S3 overview page. You’ll also want to look at the pricing, which starts at $0.125 per GB per month.

The pricing calculator can help you get an idea how much it would cost to store your data in S3. For example, if you’re storing 100GB in S3, it would run about $12.50 per month – before any costs for data transfer out of S3. Transfer in to S3 is free. Amazon also charges for get/put requests and so forth – so if you’re using S3 to serve up content, then the pricing is going to be higher.

Back to the tools. You need to configure s3cmd (the command line utility from the S3 Tools project) like so:

s3cmd --configure

It will walk you through adding your Amazon credentials and GPG information if you want to encrypt files while stored on S3. Amazon’s storage is supposed to be private, but you should always assume that data stored on remote servers is potentially visible to others. Since I’m storing information that has no real need for privacy (WordPress backups, MP3s, photos that I’d happily publish online anyway) I don’t worry overmuch about encrypting for storage on S3.

There’s another advantage of foregoing GPG encryption, which is that s3cmd can use an rsync-like algorithm for syncing files instead of just re-copying everything.

Now to copy files and use s3cmd sync. You’ll find that the s3cmd syntax mimics standard *nix commands. Want to see what is being stored in your S3 account? Use s3cmd ls to show all buckets. (Amazon calls ’em buckets instead of directories.)

Want to copy between buckets? Use s3cmd cp bucket1bucket2. Note that buckets are specified by the syntax s3://bucketname.

To put files in a bucket, use s3cmd put filenames3://bucket. To get files, use s3cmd get filenamelocal. To upload directories, you need to use the --recursive option.

But if you want to sync files and save yourself some trouble down the road, there’s the sync command. It’s dead simple to use:

s3cmd sync directorys3://bucket/

The first time, it will copy up all files. The next time it will only copy up files that don’t already exist on Amazon S3. However, if you want to get rid of files that you have removed locally, use the --delete-removed option. Note that you should test this with the --dry-run option first. You can accidentally delete files that way.

It’s pretty simple to use s3cmd, and you should look at its man page as well. It even has some support for the CloudFront CDN service if you need that. Happy syncing!