July 10, 2009, 8:43 am
In this final part of this three-part series, I’ll start out using SQLite as the relational database and graph data from there before moving on to using PostgreSQL and pivot tables to transform relational data into column format for gnuplot.
Sourcing the Data from Relational Databases
I’ll start with the trivial SQLite database shown below.
$ sqlite3 sqlite1.db
create table data ( label text, a int, b int, id
integer primary key autoincrement );
insert into data values ( 'kittens', 4, 5, null );
insert into data values ( 'mice', 6, 7, null );
insert into data values ( 'rats', 2, 3, null );
$ sqlite3 sqlite1.db 'select * from data'
kittens|4|5|1
mice|6|7|2
rats|2|3|3
$ gnuplot
set style data lines
set datafile separator "|"
set style data histograms
plot "< sqlite3 sqlite1.db 'select * from data'"
using 2:xtic(1) title "hair", "" using 4 t "vision"
The dataset has been completely made up, and instead of using a column from the dataset as the x-axis values, the labels have been explicitly supplied for each row in the first column. The extra :xtic(1) column specification in the first using clause tells gnuplot what to use for the x-axis tic labels. Because of the “<” at the start of the file name, the data is taken from the standard output of the sqlite3 command which issues the query to the database. Note that the second using clause is specified as the empty string so the same query is used for both using clauses. The downside of taking the data from a sub-command like this is that the command is executed for each using clause, in this case causing the query to be run twice. The gain is that it is a fairly quick and dirty way to get the data if the query is simple.
Looking at a slightly more complex example, imagine that we have just bought a new digital camera and taken a bunch of photos with it. We would like to see if there is any correlation between exposure setting and the size of the jpeg image generated by the camera. There are many ways to get the EXIF exposure time from an image and put it into an SQLite database, so I’ll leave that part as an exercise for the reader and their favorite scripting language. Lets say we have an SQLite database as shown below. So we have the URL, size and EXIF exposure time for each image we have taken with the camera.
sqlite> .schema rpix
CREATE TABLE rpix(
url text,
name text,
size int,
exif_exposure_time real
);
sqlite> select * from rpix limit 10;
~/Pictures/IMG_0001.JPG|IMG_0001.JPG|6467510|0.0125
~/Pictures/IMG_0002.JPG|IMG_0002.JPG|6581525|0.0125
~/Pictures/IMG_0003.JPG|IMG_0003.JPG|5762417|0.0166667
The below SQL query will calculate the mean size for every exposure setting used. As queries become more complex, verifying the results on screen and using .output to capture them for use in gnuplot creates simpler plot commands.
sqlite> select avg(size), exif_exposure_time as E
from rpix group by E;
...
4967955.48529412|0.02
4712271.54385965|0.025
...
sqlite> .output /tmp/exposures.dat
sqlite> select avg(size), exif_exposure_time as E
from rpix group by E;
In the above plots, I have always set the default “data” style to be lines or histogram. The default style for “data” is to put little markers at each coordinate specified which works well for this graph. Since the exposure time can vary considerably I set a logarithmic scale on the exposure axis.
set datafile separator "|"
set xlabel "exposure (seconds)"
set ylabel "size (bytes)"
set title "How jpeg file size varies as the exposure setting is altered"
set log x
plot '/tmp/exposures.dat' u 2:1 t ""
The resulting graph, exported as a png image is shown below.
Only three of the gnuplot commands shown for the exposure vs file size graph were really needed (the first and last two). So knowing how to set the field separator and pick numeric data out of columns with the plot command makes gnuplot a very powerful tool for analysing data in a relational database.
The above plots were fairly easy to generate because the output of the SQL query produced data in columns. Alas, it is fairly common for SQL queries to generate row oriented results instead of column oriented ones. Consider the prices table shown below. Normally you would not have this as a base table in the database, but it is quite common for a join between multiple tables to produce a query result like the prices table.
create table prices (
tt timestamp,
product varchar,
price numeric(5,2) );
select * from prices;
tt | product | price
----------------------------+--------------+--------
2009-04-15 18:30:14.749206 | 8gb DDR2 RAM | 200.00
2009-03-15 18:30:14.749206 | 8gb DDR2 RAM | 220.00
2009-02-10 12:30:14.749206 | 8gb DDR2 RAM | 230.00
2009-01-21 06:30:14.749206 | 8gb DDR2 RAM | 245.00
2008-11-03 13:33:04.749206 | 8gb DDR2 RAM | 280.00
2009-04-11 18:03:14.749206 | 500gb HDD | 100.00
2009-02-15 13:30:14.749206 | 500gb HDD | 150.00
2009-01-31 21:34:14.749206 | 500gb HDD | 140.00
2009-01-11 18:03:14.749206 | case | 100.00
2008-12-28 12:45:14.749206 | case | 140.00
2008-08-30 21:34:14.749206 | case | 140.00
The problem with trying to use the prices table with gnuplot is that information is stored row oriented instead of column oriented. That is, there is no single column showing the “case” product and with its price at various times.
To translate row oriented relational data into a column oriented result, many databases support the pivot command. The closest thing in PostgreSQL to pivot is the crosstab function. To use the crosstab function you’ll first need to import the tablefunc.sql file into the database using the below command. Because this includes some functions implemented in the C language into the database you’ll have to have suitable privileges to the database for the import to work.
psql database-name < /usr/share/pgsql/contrib/tablefunc.sql
The below block demonstrates crosstab function and shows the drawback to using it directly. Firstly, crosstab takes two arguments, the SQL to generate a three column table like that shown first, which must be ordered by the first two columns. The layout of the table has to be the first column is the row identifier, the second a category, and the third is the value for the category,row-id combination. This will be clearer once you examine the result of the second SQL query below.
The second parameter to crosstab is the complete list of categories.
# select tt,product,price from prices order by 1,2;
tt | product | price
----------------------------+--------------+--------
2008-08-30 21:34:14.749206 | case | 140.00
2008-11-03 13:33:04.749206 | 8gb DDR2 RAM | 280.00
2008-12-28 12:45:14.749206 | case | 140.00
2009-01-11 18:03:14.749206 | case | 100.00
2009-01-21 06:30:14.749206 | 8gb DDR2 RAM | 245.00
2009-01-31 21:34:14.749206 | 500gb HDD | 140.00
2009-02-10 12:30:14.749206 | 8gb DDR2 RAM | 230.00
2009-02-15 13:30:14.749206 | 500gb HDD | 150.00
2009-03-15 18:30:14.749206 | 8gb DDR2 RAM | 220.00
2009-04-11 18:03:14.749206 | 500gb HDD | 100.00
2009-04-15 18:30:14.749206 | 8gb DDR2 RAM | 200.00
(11 rows)
# select * from crosstab(
'select tt,product,price from prices order by 1,2',
'select distinct(product) from prices')
as
ct(tt timestamp,b numeric(5,2),c numeric(5,2),d numeric(5,2));
tt | b | c | d
----------------------------+--------+--------+--------
2008-08-30 21:34:14.749206 | | | 140.00
2008-11-03 13:33:04.749206 | | 280.00 |
2008-12-28 12:45:14.749206 | | | 140.00
2009-01-11 18:03:14.749206 | | | 100.00
2009-01-21 06:30:14.749206 | | 245.00 |
2009-01-31 21:34:14.749206 | 140.00 | |
2009-02-10 12:30:14.749206 | | 230.00 |
2009-02-15 13:30:14.749206 | 150.00 | |
2009-03-15 18:30:14.749206 | | 220.00 |
2009-04-11 18:03:14.749206 | 100.00 | |
2009-04-15 18:30:14.749206 | | 200.00 |
(11 rows)
The downside to the crosstab function is that it returns a set of records which do not carry type information. So you have to cast the return record giving the column name and type for each column in the result. And as the result will have the number of categories plus one columns in it, knowing the column information for the cast is quite a chore. Without the cast the SQL query fails as shown below.
# select * from
crosstab(
'select tt,product,price from prices order by 1,2',
'select distinct(product) from prices');
ERROR: a column definition list is required for functions returning "record"
Because the casting makes the query very long, it is a fairly large hindrance to experimenting with SQL queries that generate graphs with gnuplot. A custom function can be used to get around this. You might think this is a chicken and egg problem: trying to get around a variable return type to a generic function by wrapping it in a generic function. The savior is that our function can create a temporary table, and load the result into that.
The new function, which I called ct() for short crosstable, takes the name of a table as the first argument, the second and third are the same as the first and second for crosstab(). The fourth parameter defines the key type and the sixth the value type. For our example above the key is a timestamp and the values have type numeric(5,2). The table name supplied in argument one will be deleted and recreated by the function.
The function is reasonably simple, first it drops the temporary table, then starts building the original select * from crosstab() query. The tabstruct is used to build up the structure of both the crosstab function’s return type and the body of the create table declaration for the new temporary table that will contain the results. Notice that the category names you supplied in cats are used one by one to build the table structure, with any illegal characters removed so that they always provide a legal SQL column name. Prepending a ‘c’ to the column names ensures that they do not start with a number.
The final few lines create the new table and execute the built crosstab() query to populate the new table. Unfortunately SELECT INTO has not been implemented for this purpose yet (PostgreSQL 8.3).
create or replace function ct( outname varchar, sel varchar,
cats varchar, kt varchar, rt varchar ) returns varchar as $$
DECLARE
qs varchar;
tmp varchar ;
tabstruct varchar;
BEGIN
execute 'drop table if exists ' || outname;
qs := 'select * from crosstab(''' || sel || ''',''' || cats || ''')';
qs := qs || ' as myrec( ';
tabstruct := 'key ' || kt;
FOR tmp IN execute cats LOOP
tabstruct := tabstruct || ',c' || regexp_replace(tmp,'[^a-zA-Z0-9]','','g');
tabstruct := tabstruct || ' ' || rt;
END LOOP;
qs := qs || tabstruct;
qs := qs || ')';
execute 'create table ' || outname || ' ( ' || tabstruct || ' );';
execute 'insert into ' || outname || ' ( ' || qs || ' );';
return qs;
END;
$$ LANGUAGE plpgsql;
The below shows usage of the function. The main saving is not having to list the return type of crosstab() in the SQL, which becomes more and more substantial as the number of categories increases.
# select * from ct( 'tmp',
'select tt,product,price from prices order by 1,2',
'select distinct(product) from prices', 'timestamp', 'numeric(5,2)' );
...
# select * from tmp;
key | c500gbhdd | c8gbddr2ram | ccase
----------------------------+-----------+-------------+--------
2008-08-30 21:34:14.749206 | | | 140.00
2008-11-03 13:33:04.749206 | | 280.00 |
2008-12-28 12:45:14.749206 | | | 140.00
2009-01-11 18:03:14.749206 | | | 100.00
2009-01-21 06:30:14.749206 | | 245.00 |
2009-01-31 21:34:14.749206 | 140.00 | |
2009-02-10 12:30:14.749206 | | 230.00 |
2009-02-15 13:30:14.749206 | 150.00 | |
2009-03-15 18:30:14.749206 | | 220.00 |
2009-04-11 18:03:14.749206 | 100.00 | |
2009-04-15 18:30:14.749206 | | 200.00 |
While the last two parameters of ct() could possibly be derived by the function itself rather than passed in, having ct around provides a solid base for pivoting tables in preparation for gnuplot. Improvements are of course welcome.
Now that we have pivoted the data into columns, exporting it for gnuplot is simple. Turn off alignment, set and output file and execute a select from the temporary table. Notice that I left the display of the column name metadata on.
t
o /tmp/pg.dat
select * from tmp;
And now to actually plot the pivoted, exported data with gnuplot. The first two lines we’ve seen before, setting the style and separator. The third line tells gnuplot to use the column heading on the first line of the data file as the label for each line plotted. We then tell gnuplot that the xaxis is going to be a timestamp and how to parse the time stored in the data file. the format x setting tells gnuplot how we want to format the tic labels on the xaxis. And the last line which does the plot has nothing new, the main thing to notice is that the none of the three using lines supply a title. We get gnuplot to use the first line of the data for titles so the script doesn’t need to know that information.
set style data lines
set datafile separator "|"
set key autotitle columnhead
set xdata time
set timefmt x "%Y-%m-%d %T"
set format x "%bn%y"
plot "/tmp/pg.dat" using 1:2, "" u 1:3, "" u 1:4
And the resulting graph is shown below.
Wrap Up
Gnuplot allows some very sophisticated graphs to be generated. There is a learning curve to its command line and the interactive graphical terminals do not provide much help to the first time user. There is a help command at the gnuplot prompt, but it helps to know the main commands of interest before you start.
I’ve focused only on plotting existing data rather than mathematical functions. There should be enough to get you going with gnuplot, and the color, font and export functions should help you to generate more appealing image files. If you want to take things further I highly recommend the Gnuplot in Action book.