Author: Ben Martin
Libferris allows you to index and perform full text search on a number of file formats, including PDF, manual pages, and office documents. The recent availability of packages of libferris and its dependencies for Fedora, Ubuntu, and openSUSE makes it simpler to use the library to provide a file server search interface for the Web. Libferris was initially created to provide a virtual filesystem interface, similar to GnomeVFS and KDE’s KIO. Over time libferris has gained sophisticated support for indexing and searching filesystems.
The technique described here makes use of a new user called libferrissearch on the file server to run the search interface. Using a dedicated user allows you to explicitly grant libferrissearch access to only files that you want the Web interface to find, and allows the search interface to return results which might be accessible to the user via NFS but which are not accessible to the Web server. This makes the software more useful to people who wish to take advantage of libferris for file server search, but it does introduce a bit of extra complexity in setting up the search system.
There are packages available for both 32- and 64-bit Fedora 7, 8, and Ubuntu 7.10 Gutsy as well as 32-bit packages for openSUSE 10.3.
Creating the index
The most robust index plugin for libferris is for the PostgreSQL database, which should be running on the file server in order for you to use it with libferris. If you wish to have PostgreSQL running on another machine, you can pass host=pghostname to the fcreate commands that are contained in the script below.
The commands below start out being executed as the root user, and take advantage of two scripts which are shipped with libferris. The scripts have quite long names, and are used only during the initial setup. The first command creates some template databases in the PostgreSQL server which are tailored for libferris use. Once you have these template databases, a regular user can create new libferris indexes that support full text search. The script next creates a new user, makes PostgreSQL aware of that user, and allows the user to create new databases. We then change to that user to set up libferris and its indexing. First we execute ferris-first-time-user to set up ~/.ferris and its various files for this new user, then create a default home database for the user. Finally, we execute the second setup script from the libferris distribution to create a new PostgreSQL database and tell libferris that it should use that index for full text and metadata searches. Each user can have a default full text and metadata index for performing searches with libferris.
# adduser libferrissearch
# psql
root=# create user libferrissearch CREATEDB;
root=# q
# su -l libferrissearch
$ ferris-first-time-user
$ psql template1
template1=> create database libferrissearch;
template1=> q
$ ferris-recreate-primary-fulltext-and-eaindex-as-postgresql.sh libferrissearch
Once libferris is set up you can use the findexadd and feaindexadd commands to populate the index. The first command updates only full text information in the index, while the latter updates only file metadata information. Running the below command as the libferrissearch user populates the libferris indexes with all the files under /docs. If a file has not been modified since it was last indexed then libferris quickly skips over it, so the below commands can be added to a cron job to quickly keep the index up-to-date.
uid=501(libferrissearch) gid=501(libferrissearch) groups=501(libferrissearch)
$ find /docs > /tmp/files-to-index
$ feaindexadd -f /tmp/files-to-index
$ findexadd -f /tmp/files-to-index
For this article I’ve populated /docs with some text files from Project Gutenberg, as well as the PDF file valgrind_manual.pdf from the Valgrind distribution. The following commands verify that the index is able to be used to find the documents. In the final command we can see that the Valgrind manual can be retrieved by its content just like the text files.
uid=501(libferrissearch) gid=501(libferrissearch) groups=501(libferrissearch)
$ ls -l /docs
total 2876
-rw-r—– 1 libferrissearch root 153477 2008-01-12 13:24 alice13a.txt
-rw-r—– 1 libferrissearch root 48923 2008-01-12 13:24 boysw10.txt
-rw-r—– 1 libferrissearch root 259214 2008-01-12 13:24 dmoro11.txt
-rw-r—– 1 libferrissearch root 342169 2008-01-12 13:24 frsls10.txt
-rw-r—– 1 libferrissearch root 112244 2008-01-12 13:24 nobos10.txt
-rw-r—– 1 libferrissearch root 468646 2008-01-12 13:24 sbshp10.txt
-rw-r—– 1 libferrissearch root 40662 2008-01-12 13:24 snark12.txt
-rw-r–r– 1 libferrissearch root 1074618 2008-01-12 13:29 valgrind_manual.pdf
-rw-r—– 1 libferrissearch root 363974 2008-01-12 13:24 warw11.txt
$ findexquery alice
Found 1 matches at the following locations:
file:///docs/alice13a.txt
$ findexquery cache
Found 1 matches at the following locations:
file:///docs/valgrind_manual.pdf
Setting up the Web interface
We want to have our PHP code be executed as the libferrissearch user. I use the mod_suphp Apache module to force this to happen. On a Fedora 8 machine you can install this module from the default repositories using yum. As some PHP code expects not to be running as a different user, I tend to only explicitly enable this module for directories which I wish to use it for. The commands below set up mod_suphp to operate in the http://localhost/libferrissearch, directory which I will use for the libferris search interface.
# cd /etc/httpd/conf.d
# vi libferrissearch.conf
<Directory “/var/www/html/libferrissearch”>
suPHP_Engine on
</Directory>
To turn off suPHP by default add the following to the end of the main HTML directory directive in /etc/httpd/conf/httpd.conf:
…
suPHP_Engine off
suPHP_RemoveHandler .php
php_admin_flag engine on
php_admin_flag register_globals on
</Directory>
Once suPHP is off by default you can enable it by editing /etc/httpd/conf.d/mod_suphp.conf and uncommenting the following line:
You should then restart the Apache server. At this stage we have an Apache Web server that can use mod_suphp on directories which we have explicitly nominated. Now we can move on to setting up the libferrissearch directory and the PHP scripts. Inside the /var/www/html/libferrissearch directory we need to create three files: A PHP script to actually perform the search and return the result, an XSL stylesheet, and a main form page to let the user input the query and see the results.
The first script is runquery-simple.php, which performs the heavy lifting. Some parameters the user can change are defined at the top of the script. I’ll cover the stylesheet in a moment. The restriction can be one of filter, filter-10, or filter-100, with the later two returning a maximum of 10 or 100 results respectively. The showea definition is what metadata from the results we are interested in seeing. For information on the metadata that libferris makes available, see the libferris eadescriptions page. Having the parent-url in the results allows us to group files by which directory contains them.
Next, the query itself is taken from a CGI parameter and a query is formed using ferrisls and its –xml mode to obtain the result set as an XML file. In order to include a link to a custom stylesheet we pass the –hide-xml-declaration to ferrisls so that the <?xml… declaration is left out of the output of ferrisls. This way the XML declaration can be included in the PHP code and we can explicitly link to the stylesheet for rendering the XML result set.
$STYLESHEET=”xml-results-to-xhtml.xsl”;
$restriction=”filter”;
$restriction=”filter-100″;
$showea=escapeshellarg(“url,name,size,size-human-readable,mtime,mtime-display,parent-url”);
$q=$_REQUEST[“q”];
header(‘Content-type: text/xml’);
print “<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no” ?>n”;
print “<?xml-stylesheet href=”$STYLESHEET” type=”text/xsl”?>n”;
$cmd=”/usr/local/bin/ferrisls –xml –hide-xml-declaration “;
$cmd.=” –show-ea=$showea “;
$cmd.=escapeshellarg(“eaquery://$restriction/$q”);
system( $cmd );
?>
The XSL file, xml-results-to-xhtml.xsl, which the above PHP links to, is shown below. The transform take the XML output from ferrisls and create an HTML document complete with color-coding on alternate rows in the result set. The first template matches the top-level XML element and creates the bulk of the HTML document. The second template match outputs a single result in a color-coded table row.
<xsl:stylesheet xmlns:xsl=”http://www.w3.org/1999/XSL/Transform” version=”1.0″>
<xsl:output method=”html”/>
<xsl:template match=”/ferrisls”>
<xsl:variable name=”number-of-columns”>3</xsl:variable>
<html>
<head>
<title>Ferris index</title>
<style>
td.light { background-color:#d5cccc; }
td.dark { background-color:lightgrey; }
a:link {
COLOR: #000055;
}
a:visited {
COLOR: #000022;
}
a:hover {
COLOR: #aa0000;
}
a:active {
COLOR: #00FF00;
}
</style>
</head>
<body bgcolor=”#bdbbbb”>
<table border=”0″ columns=”{$number-of-columns}” >
<!– header for table –>
<tr bgcolor=”pink” color=”#FFFFFF” >
<td>size</td>
<td>mtime</td>
<td>url</td>
</tr>
<xsl:for-each select=”//context”>
<xsl:sort select=”@name” />
<xsl:apply-templates select=”.”>
<xsl:with-param name=”lexpos” select=”position()”/>
</xsl:apply-templates>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
<xsl:template match=”context”>
<xsl:param name=”lexpos”/>
<xsl:variable name=”bgcolor”>
<xsl:choose>
<xsl:when test=”($lexpos) mod 2″>light</xsl:when>
<xsl:otherwise>dark</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<tr bgcolor=”#DDCCCC” >
<td class=”{$bgcolor}”>
<xsl:value-of select=”@size-human-readable” />
</td>
<td class=”{$bgcolor}”>
<xsl:value-of select=”@mtime-display” />
</td>
<td class=”{$bgcolor}”>
<xsl:value-of select=”@url” />
</td>
</tr>
</xsl:template>
</xsl:stylesheet>
The last PHP script, shown below, is a basic front end to the search interface. Two form entries allow for either searching the text contents of files or searching for a regular expression in part of the file path. The search JavaScript function is where the search is actually performed. If the “files only” form element is checked then we explicitly remove results that are not files by adding a boolean metadata restriction to the query. The query CGI parameter is then escaped and the IFRAME that contains the results is directed to load the new search output from the runquery-simple.php script.
<html>
<head>
<title>Ferris Query</title>
<script language=”JavaScript”>
function search( eaname, opcode, val )
{
q='(‘ + eaname + opcode + val + ‘)’;
if( onlyfiles.v.checked )
{
q = ‘(&(is-file==1)’ + q + ‘)’;
}
q = q.replace(/&/g, “%26”);
earl = “runquery-simple.php?q=” + q;
results.src = earl;
}
function OnLoadPage()
{
// set focus onto search box
document.searchurlr.query.focus();
}
</script>
</head>
<body onLoad=”javascript:OnLoadPage()” bgcolor=”lightgrey”>
<form name=”onlyfiles” action=””>
Only search for files:<input type=”checkbox” name=”v” value=”1″ checked=”1″ >
</form>
<table border=”0″ colums=”2″>
<tr>
<td>Search by URL (regex)</td>
<td>
<form name=”searchurlr” action=”javascript: search( ‘url’, ‘=~’, document.searchurlr.query.value)”>
<input type=’text’ name=’query’></form>
</td>
</tr>
<tr>
<td>Search full text</td>
<td>
<form name=”searchurlftx” action=”javascript: search( ‘ferris-fulltext-search’, ‘==’, document.searchurlftx.query.value)”>
<input type=’text’ name=’query’></form>
</td>
</tr>
</table>
<hr/>
<iframe id=”results” src=”runquery-simple.php” width=”100%” height=”100%” />
</body>
</html>
These three scripts should go into /var/www/html/libferrissearch.
# mkdir libferrissearch
# chown -R libferrissearch.libferrissearch libferrissearch
# chmod 755 libferrissearch
# chmod 644 libferrissearch/*
# cd libferrissearch
# ls -l
-rw-r–r– 1 libferrissearch libferrissearch 1.5K 2008-01-12 14:57 index.php
-rw-r–r– 1 libferrissearch libferrissearch 562 2008-01-12 14:57 runquery-simple.php
-rw-r–r– 1 libferrissearch libferrissearch 2.0K 2008-01-12 14:57 xml-results-to-xhtml.xsl
Searching
In the screen shot at right I have performed a full text search for “mad” on the file server.
There are many more possible uses for a PHP Web interface to libferris. Since libferris has the ability to compute cryptographic checksums such as MD5 and SHA1 you can include checksums in the index and later compare them against the current cryptographic checksum for files to detect file modifications or possible media errors. If you have geotagged files, such as JPEG images with GPS coordinates in them, you can create a “network link” endpoint for use in Google Earth.
Different libferris indexes can also be federated to form a single index. This is useful for allowing different storage and update policies for different parts of an index. For example, you could create a single index for manual pages that is updated only when new software is installed on the system.
Categories:
- PHP
- Internet & WWW