Extract data from the Internet with Web scraping

12478

Author: Rob Reilly

Even if you don’t know how to access databases using a Web browser or use an RSS reader, you can extract information from the Internet through Web page scraping. Here’s how you can use some Linux-based tools to get data.

First, you need to decide what data you want and what search strings you’ll use to get it. Obviously, if you need only three lines of information, cutting and pasting is the best way to go. However, if a Web page has 640 lines of useful data that you need to download once a day for the next week, automating the process makes a lot of sense.

Start small. Live Web pages can be complicated to scrape, and designing search strings can be challenging. As your experience grows, you’ll see faster and more efficient ways to use the techniques.

For this example, I wanted see how my articles were doing with page views. I created a demo Web page using a subset of my actual home page. I added fictitious page-view numbers to the original data to give myself some numerical information to work with.

Take a look at this portion of the HTML code for the example page, named index-demo.htm. I’ve kept the example HTML code simple, so you can see how the commands and techniques work. Assume that the page resides on a Web server. Also, note that I haven’t coded any wrapping into the title and page-view line.


<html>
<body>

<b><font size="+1">Rob's Demo Web Page</font></b>
<br>
<br>

<!----------- Articles ------------------>

&nbsp;<a href="http://software.newsforge.com/article.pl?sid=05/08/25/167213">
Using WireFusion for 3D animation</a>&nbsp;&nbsp;&nbsp;(09/08/2005 on NewsForge.com)&nbsp;&nbsp;&nbsp;(103,000 page views)<br>
&nbsp;&nbsp;&nbsp;<em>An interesting GUI program for 3D animation</em>
   <br>
   <br>

&nbsp;<a href="http://software.newsforge.com/software/05/07/08/1913220.shtml">
Spice up your presentations with OpenOffice.org Impress</a>&nbsp;&nbsp;&nbsp;(07/13/2005 on NewsForge.com)&nbsp;&nbsp;&nbsp;(652,000 page views)<br>
&nbsp;&nbsp;&nbsp;<em>Advanced OOo Impress techniques</em>
   <br>
   <br>
.
.
.

&nbsp;<a href="http://management.itmanagersjournal.com/article.pl?sid=04/12/28/2159255">
Personal branding for IT professionals - Part 2</a>&nbsp;&nbsp;&nbsp;(12/29/2004 on IT Managers Journal.com)&nbsp;&nbsp;&nbsp;(489,000 page views)<br>
&nbsp;&nbsp;&nbsp;<em>Part 2 - How to use Open Source tools to help market yourself</em>
   <br>
   <br>

</body>
</html>

Prepare the data for import

You’ll use several command-line programs to manipulate the data into a form that you can import right into an application. In this case, I’m using OpenOffice.org Calc, but you can use others.

First, download the page using GNU Wget, a command-line Web-page grabber that features an impressive selection of options. It can download a single page or a whole site recursively.

For this example, I want to download the index-demo.htm page and save it in the local directory. Here, the -q option suppresses unnecessary command commentary:

> wget -q http://home.earthlink.net/~robreilly/index-demo.htm

Next, pull out the desired lines using grep, which is a text-search tool that finds lines matched by a string. Looking at the index-demo.htm file, you can see that you can use <a> tag and three non-breaking space string (&nbsp;) to pull the title and page-view numbers out of the file. Redirect the output to tmp1.txt for use in the next step:

> grep '</a>&nbsp;&nbsp;&nbsp' index-demo.htm > tmp1.txt

Next, grab the required fields using gawk, a pattern-matching tool that can also format the text output. Looking at tmp1.txt, you can see that a line is separated into parts using the “(” character. You want to extract the article name (field 1) and page-view number (field 3) and direct the output to tmp2.txt:

> gawk -F'(' '{print $1 " " $3}' tmp1.txt > tmp2.txt

Next, clean up the output so you can import it into OpenOffice.org Calc smoothly. Use sed, which is a text-stream editor that can read a file and make string substitutions on the fly. Use the two instances of “*” in a row to substitute in nothing, thus cleaning up the text:

> sed 's*</a>&nbsp;&nbsp;&nbsp**; s*page views)<br>**' tmp2.txt > output.csv

The “*” character marks the strings. You’ll quickly find real Web pages where using the standard sed “/” string marker makes the substitution confusing. It’s OK to use another character, as long as it isn’t a special character (check the sed manual) and doesn’t appear in your desired text string.

I also put two substitutions together in one sed command using the “;” character. The first one gets rid of the &nbsp; string, and the second gets rid of the “page views)<br>” string. Notice that I left one “;” character after the third “&nbsp;” string. This conveniently delimits the two fields when importing into Calc.

Import the list into OOo Calc

Click to enlarge

Finally, we can import the output.csv file into the OOo Calc application. OpenOffice.org versions 2.0 and above use separate commands for the various programs. Use oocalc to start Calc from the command line:

> oocalc output.csv

Calc knows it’s a comma-delimited file and starts the import dialog box automatically. Click the semicolon checkbox and press OK to bring the data into the blank spreadsheet.

Create a chart by highlighting the text and clicking the Chart button. Choose the bar chart using the first column as the label. Click the Create button, and the chart with the page-view data appears (see figure).

Once you get the hang of using these tools, you can streamline the process by putting the steps into a script:


wget -q http://home.earthlink.net/~robreilly/index-demo.htm
grep '</a>&nbsp;&nbsp;&nbsp' index-demo.htm > tmp1.txt
gawk -F'(' '{print $1 " " $3}' tmp1.txt > tmp2.txt
sed 's*</a>&nbsp;&nbsp;&nbsp**; s*page views)<br>**' tmp2.txt > output.csv
rm tmp1.txt tmp2.txt
oocalc output.csv

Change the permission on the file to 744 and run it:

> ./script.txt

Once it executes, all you need to do is finish up creating the graph in Calc.

Many Web pages have a text-search box. Use it to your advantage by putting in your search text and using the generated URL with the Wget command. This technique is great for spanning multiple pages and even multiple sites.

Real-life Web pages can be complicated. Understanding Wget, grep, gawk, and sed will help you make adjustments and retrieve the right data.

SQL, XML, and RSS are all the buzz today, but the lost art of screen (Web) scraping still has a useful place in today’s business environment. That’s especially true as the Linux desktop continues to march into the office world.

Rob Reilly is a consultant, trend spotter, and writer. His company advises clients in a variety of business and technology areas.