CLI Magic: Be careful what you wget for

115

Author: Joe Barr

What you see is what you wget, as the old saying goes. You can use GNU’s wget package in a variety of ways to bring remote stuff to wherever it is that you are on the network. FTP stuff, sure. HTTP stuff, sure. Even do a little spider work if you’re feeling arachnological. Your GUI may be a sticky mess, but that’s not the same. Follow me, grasshopper, we’re going to the CLI.With wget, not only can you step away from the GUI, you can step away from the CLI. It’s perfectly happy running in non-interactive mode, meaning you can wget it and forget it. Need a vital download from a busy FTP server that’s not letting you in at the moment? Wget it. Here’s how.

Let’s say you want to get a mission-critical app from Ibiblio — not an uncommon situation for many. But when you try to FTP the download file — shutbox-0.3.tar.gz — there is no room at the inn. So instead of using FTP, use wget. The command to use to do so would be:

wget ftp.ibiblio.org/pub/linux/games/shutbox-0.3.tar.gz

By default, wget will try 20 times to get the desired file. If that’s not enough, you can specify the number of times to try by inserting the -t option followed by a larger number, like this:

wget -t 30 ftp.ibiblio.org/pub/linux/games/shutbox-0.3.tar.gz

Then just walk away and forget it, because wget doesn’t need a babysitter.

Of course, FTP is just half the story. You can grab Web content just as easily. In fact, using the recursive option (that’s -r) with the command will let your build a replica of the target site on your own system. PLEASE NOTE: This can have an undesirable impact on both the target and your own system if not used properly, so be careful what you wish for and what you wget.

Let’s say you’re about to leave work but want to read another article here at Linux.com. Your commuter train doesn’t provide Internet access. Call on wget! Download the story now, read it later. Enter this:

wget http://www.linux.com/article.pl?sid=04/12/13/1954228

Opening the file wget downloads in response to that command gives you a readable version of the story, but it really doesn’t look the same as it does online. How to fix that? Let’s add -pk to the command:

wget -pk  http://www.linux.com/article.pl?sid=04/12/13/1954228

Those two options make all the difference. The k option converts the links as they are written to your machine to make them suitable for local viewing instead of reaching out to the Internet to pull bits and pieces in when you need them. The p tells wget to get everything required to display the page. Now the story looks the same on the train ride as it did during your Coke break. Pretty neat.

A more common usage of wget’s HTTP side is to create a mirror image of a remote site. It’s a great way to back up Web sites you may have hosted elsewhere. I saw a tip on OpenDeveloper.org about how to do exactly that. I used it like this the first time:

wget -rpk --level=0 http://www.joebarr.org

But because the site is using PHP to deliver the pages, Mozilla couldn’t parse the pages linked to from the front page. After adding the E option — which appends “.html” to CGI or PHP-generated pages — all the pages worked just fine.

So fine, in fact, that when I tried to view some of the other pages offline, I got “Unauthorized access” messages. But don’t worry, you can provide user and password information using the --http-user= and --http-passwd options. Of course, doing so over an insecure network is not good security practice.

Wget is a very cool tool, and there are many ways to user it productively. As always, ask the man for more information.